Using Azure SQL Partition Magic to Power Bulk Data Upload

Photo Credit: ShutterStock

Bulk importing data from Azure Blob Storage to an Azure SQL partitioned table provides a swift and streamlined process to update heavy volume data, significantly saving time and resources. This approach leverages the underlying inherent advantages of both Azure Blob Storage and Azure SQL, optimizing the data transfer and storage workflow.

The bulk import process from Azure Blob Storage to Azure SQL partitioned tables enables the seamless movement of data in large batches. This method capitalizes on parallel processing, allowing for the simultaneous transfer of multiple chunks of data, resulting in a dramatic reduction in transfer times. Additionally, the partitioned nature of the SQL table means that the data is strategically organized based on specified criteria, such as date ranges or specific attributes, optimizing query performance and retrieval times.

Photo Credit: generated with draw.io.

Let’s look at the steps required to bulk import data from a CSV file in Azure Blob Storage to a partitioned table named Data_Report.

1. Create Database Credential:

You can create a database credential to access Azure Storage using Managed Identity for secure and controlled access.

CREATE DATABASE SCOPED CREDENTIAL ReportingCredential
WITH IDENTITY = ‘MANAGED IDENTITY’
SECRET = ‘XXXXXXXXXXXXXXXXXXXXXXXXXXX’

2. Create External Data Source:

CREATE EXTERNAL DATA SOURCE Data_Report 
WITH ( TYPE = BLOB_STORAGE,
LOCATION = ‘https://xxxxx.blob.core.windows.net/reporting’,
CREDENTIAL = ReportingCredential
);

3. Partitioned Table Schema:

The T-SQL script below creates a partition function [PF_Date] on the DATE column, creating separate partitions for each month, a corresponding partition scheme and a partitioned table using the partitioning scheme.

The partition function defines the boundary values of the initial set of partitions and the data type of the partitioned column. The partition scheme maps partitions to filegroups (in Azure SQL DB, always PRIMARY). The partitioned table is tied to a particular scheme when it is created.

CREATE PARTITION FUNCTION [PF_Date] (DATE)
AS RANGE RIGHT FOR VALUES (‘20210201’, ‘20210301’,‘20210401’,
‘20210501’, ‘20210601’, ‘20210701’, ‘20210801’,
‘20210901’, ‘20211001’, ‘20211101’,‘20211201’, ‘20220101’);
CREATE PARTITION SCHEME PS_Date
AS PARTITION PF_Date
ALL TO ([PRIMARY])


CREATE TABLE Performance_Report
(
ID INT,
Name VARCHAR(100),
Num_Orders DECIMAL(18, 2),
Transaction_Date DATE
)
ON PS_Date

In this example, the Data_Report table has four columns: ID, Name, Num_Orders, and TransactionDate. The table is partitioned based on the TransactionDate column using the RANGE RIGHT strategy, and the specified date ranges in the year 2021 creating monthly data partitions.

4. SQL Query to Bulk Import from Azure Blob Storage:

Assuming you have a CSV file named Data_Report.csv in Azure Blob Storage, here’s a T-SQL query using the BULK INSERT statement to import data into the partitioned table:

BULK INSERT Data_Report
FROM ‘https:// xxxxx.blob.core.windows.net/reporting/Data_Report.csv
WITH (
FIELDTERMINATOR = ‘,’,
ROWTERMINATOR = ‘\n’,
BATCHSIZE = 10000,
TABLOCK,
MAXERRORS = 0
);

In this query:

- Performance_Report is the name of the destination table.

- The FROM clause specifies the URL of the CSV file in Azure Blob Storage.

- FIELDTERMINATOR and ROWTERMINATOR define the delimiters for fields and rows in the CSV file.

- BATCHSIZE determines the number of rows per batch during bulk insert.

- TABLOCK specifies that a bulk update lock is taken.

- MAXERRORS is set to 0, meaning no errors are tolerated during the bulk insert.

Enforcing Data Retention:

Dropping old partitions in a partitioned table is fast and concurrent because it leverages the structure of partitioning, minimal logging, and efficient metadata updates described below. This makes it a practical and scalable solution for managing and purging historical reporting data, especially for enforcement of data retention policies.

Partition Elimination:

  • Partitioned tables are divided into logical segments based on a specified set of columns such as date ranges.
  • When dropping a partition, the database engine can quickly identify the specific partition to be removed based on the partitioning key.
  • The operation is limited to a targeted subset of data and is faster.

Minimal Logging:

  • Microsoft Azure SQL DB uses minimal logging when dropping partitions.
  • Minimal logging means that the database engine only records the transaction log entries necessary to redo the operation, significantly reducing the amount of logging overhead.
  • This contrasts with full logging, which would be required for a non-partitioned table drop operation.

No Data Movement and Mere meta data update:

  • Dropping a partition typically involves deallocating the storage space associated with that partition and updating system catalogs — meta data to reflect the removal of partition.
  • Unlike some operations, such as deleting individual rows, dropping a partition doesn’t involve moving data around within the table.
  • This lack of data movement contributes to the speed of the operation.

Concurrency:

  • Partitioned tables often support concurrent operations, allowing other transactions to proceed unaffected by the partition drop.
  • The database system can manage locks efficiently, minimizing contention and ensuring that other users or processes can access the table during the partition drop.

Let’s say, for example, we want to remove data for January 2021 from the table. Identify the partition first and drop the partition as follows.

SELECT min(p.partition_number), max(p.partition_number) 
FROM sys.partitions p
INNER JOIN sys.sysobjects tab on tab.id = p.object_id
INNER JOIN sys.allocation_units au ON au.container_id = p.hobt_id
INNER JOIN sys.filegroups fg ON fg.data_space_id = au.data_space_id
INNER JOIN SYS.partition_range_values prv ON prv.boundary_id = p.partition_number
INNER JOIN sys.partition_functions PF ON pf.function_id = prv.function_id
WHERE
pf.name = ‘PF_Date’
AND tab.name = ‘DAta_Report’
AND cast(value as date) =’ 20210201’

TRUNCATE TABLE DAta_Report WITH ( PARTITIONS (‘%{min_partition_number}’ TO ‘%{ max_partition_number}’) );

In summary, the bulk import from Azure Blob Storage to Azure SQL partitioned tables represents a time-saving solution that capitalizes on the strengths of both the storage and database services. It combines the scalability and flexibility of Azure Blob Storage with the optimized storage and retrieval capabilities of partitioned tables in Azure SQL providing a robust and efficient data transfer mechanism and an optimal data retention enforcement strategy.

Using Azure SQL Partition Magic to Power Bulk Data Upload was originally published in Walmart Global Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Article Link: Using Azure SQL Partition Magic to Power Bulk Data Upload | by Akilandeswari Chandrasekharan | Walmart Global Tech Blog | Sep, 2024 | Medium