Sliding Window… I like that name. It’s reminiscent of summer evenings, cracking the window next to my bed after a rain shower and inhaling the heady aromas of dirt, trees, and the general outdoors. It instills a sense of calm, of ease, and that all is well in the world. So, is it an appropriate name for a technique used to load data? Yea, I think so…
The sliding window approach to loading data is a method where the primary goal is to minimize the impact on users or applications accessing that data from our target table or tables. Yes, those darn users, again. They always seem to be cropping up adding hitherto unfathomed levels of complexity to what really could be a simple or even menial task. Ah, but as we know, these users’ necessities tend to sire inventions that can play an unintended, yet significant role further on. And that is certainly the case here.
A typical recurring data load into, say, a dimensionalized data warehouse’s fact table is, at its roots, simply the collection of the new or updated data from its various sources followed by the merging of that data into the fact table. Again, this is not really a big deal under most circumstances. However, once you throw a couple twists in, the size of said deal can grow quite substantially. For instance, what if the volume of data that we’re loading is extremely large? Large enough that the process of loading this data can take a significant amount of time… perhaps hours. What if the table we’re loading is being utilized by a global enterprise that has established users in most time zones around the globe? These users expect accurate, timely data at any given time of the day or night. They deserve accurate, timely data, right? But how can we load new data for them without having a negative impact on the availability of that table and the vital information contained in it? And let’s not forget about the other applications in our organization that rely on being able to query this table. If there are locks on this table for an extended amount of time while we are updating it, we are very likely to hear from both grumpy users, and grumpy applications.
The sliding window loading approach comes to our rescue here by merely loading the data into a different table. “Wait… what? Then how do the users and applications get to use it?” I’m glad you asked; through the magic of table partitions and partition switching.
At its heart the sliding window technique works by utilizing table partitions. Continuing our example of a data warehouse fact table, the first step would be to partition the fact table, in this case, based on a date range. The date range will correspond to the timing of our loads, which in turn, correspond to the timing of the reporting being performed against our fact table. Let’s assume we’re loading shop floor data for daily reporting. We’ll also need to create 2 other tables which will act as staging tables for both the loading AND archiving of our fact data. Our load staging table will have data inserted by normal means… typically a collection of SSIS packages and stored procedures. However, it’s how the data gets from the staging table and into the fact table that should cause you to raise an eyebrow and crack a smile. Let’s get started…
In order to partition our table, we’ll need to build both a Partition Function as well as a Partition Scheme to utilize it. We’ll then need to tell our fact table to use the Partition Scheme, too. Let’s assume we are starting from scratch, and have not yet built our fact table. This being the case, we’ll start by constructing the Partition Function:
CREATE PARTITION FUNCTION
pf_FactDemoFunction(INT)
AS RANGE LEFT FOR VALUES(20100901);
GO
Now, as you can see, pf_FactDemoFunction is based upon an integer representation of a date and is set as a RANGE LEFT. I’d like to point out that an integer representation of fact based date keys is great for any dates in your fact table. It’s good practice for several reasons, and will pay dividends when/if you choose to partition. You’ll also note that, although not necessary, I’ve gone ahead and identified the date of my initial partition boundary based on the underlying data in our fact table.
Next we create the Partition Scheme as follows:
CREATE PARTITION SCHEME
ps_FactDemoScheme
AS PARTITION pf_FactDemoFunction ALL TO ([PRIMARY]);
GO
You’ll see that I’m just using the Primary file group for the purposes of this demo and that I’m referencing my newly created Partition Function.
Next we’ll create a very basic fact table:
CREATE TABLE dbo.factDemo
(
ID1 INT NOT NULL
,ID2 INT NOT NULL
,FactDate INT NOT NULL
,Measure1 MONEY NOT NULL
,Measure2 INT NOT NULL
)
ON PS_FactDemoScheme(FactDate);
GO
CREATE INDEX idxFactDate ON dbo.factDemo(FactDate) ON ps_FactDemoScheme(FactDate)
GO
A few things to observe, here… You’ll see that, once again, we’ve chosen to represent our date keys as integers. Remember, these integers will NOT be a simple surrogate key. They will be an actual integer representation of the date. This is to improve query speed, and allow us to use this column as the basis of our partitioning strategy. Also take note of the ON clause which is referencing the Partition Scheme we previously created. Last, we’ve also built an index off of our FactDate date key. Pay special attention to the fact that this index is utilizing our ps_FactDemoScheme Partition Scheme, as well.
Up until now, we really haven’t deviated from the typical activities involved with any “run of the mill” partitioning strategy, but here is where things start to get interesting. We’re now going to build 2 more tables. Other than in naming, each of these tables will exactly mimic our fact table.
CREATE TABLE dbo.factDemo_STGArch
(
ID1 INT NOT NULL
,ID2 INT NOT NULL
,FactDate INT NOT NULL
,Measure1 MONEY NOT NULL
,Measure2 INT NOT NULL
)
ON PS_FactDemoScheme(FactDate);
GO
CREATE INDEX idxFactDate_STGArch ON dbo.factDemo_STGArch(FactDate) ON ps_FactDemoScheme(FactDate)
GO
And…
CREATE TABLE dbo.factDemo_STGLoad
(
ID1 INT NOT NULL
,ID2 INT NOT NULL
,FactDate INT NOT NULL
,Measure1 MONEY NOT NULL
,Measure2 INT NOT NULL
)
ON PS_FactDemoScheme(FactDate);
GO
CREATE INDEX idxFactDate_STGLoad ON dbo.factDemo_STGLoad(FactDate) ON ps_FactDemoScheme(FactDate)
GO
Again, note that these tables and indexes are identical to our original, and were both built on the ps_FactDemoScheme Partition Scheme. This means that they are also partitioned, though they only contain a single boundary… the most current. This is because we are actually going to swap the partition in our staging tables (which really amounts to being the whole table) with the corresponding partition in our fact table. We will switch in the load staging table with new most recent empty partition, and we will switch out the oldest partition from the fact table with the empty archiving stage table. This, in effect, is why the technique is called a “sliding window”… in with the new, out with the old, all in the same process.
Now that our basic infrastructure is in place, we need to turn our thoughts to how to utilize this infrastructure to get our data loaded. To that end, we’ll now need to create the stored procedure responsible for moving our new data into the fact table and our expired data out of it.
CREATE PROCEDURE
dbo.SlideRangeLeftWindow_INT
@RetentionDays INT
,@RunDate DATE = NULL
AS
SET NOCOUNT, XACT_ABORT ON;
DECLARE
@Error INT,
@RowCount BIGINT,
@ErrorLine INT,
@Message VARCHAR(255),
@ExpirationDate DATE,
@PartitionBoundaryDate DATE;
SET @Error = 0;
Our stored procedure will take a couple of parameters. One will help it to identify the current run date, and another which will allow it to track the number of days to keep data in the fact table before sending them to be archived.
Next, we will begin a try-catch block and start it off by checking the contents of the @RunDate parameter. This parameter will allow us to manually pass in a specific date to start the load from. Otherwise, if it’s NULL, then we’ll just assume that we are only loading data from the previous period. We’ll also use the @RunDate parameter in concert with the @RetentionDays parameter to identify the date at which the process will need to archive off the expired records.
BEGIN TRY
IF @RunDate IS NULL
BEGIN
--Use current date if no date specified
SET @RunDate = CAST(GETDATE() AS DATE);
END
--Set date boundary to latest possible date of previous period
SET @RunDate = DATEADD(DAY, -1, @RunDate);
--Calculate expiration date based on @RetentionDays and @RunDate
SET @ExpirationDate = DATEADD(DAY, @RetentionDays * -1, @RunDate);
Next, we’ll need to query a couple of the partition specific catalog views in order to determine which partition contains our oldest data.
--Get oldest existing boundary
SELECT
@PartitionBoundaryDate =
CAST(CAST(MIN(prv.value) AS VARCHAR(10)) AS DATE)
FROM sys.partition_functions AS pf
INNER JOIN sys.partition_range_values AS prv
ON prv.function_id = pf.function_id
WHERE
pf.name = 'pf_FactDemoFunction';
Now we’ll want to start a new transaction and lock down our destination fact table with an exclusive, transaction-long table lock. “But wait, I thought the point of this was to minimize the effect of locking on the users or external applications. That’s a pretty hefty lock!” You are correct, and yes it is. The idea, here, isn’t to minimize the amount of locking on the table. That could open us up to dirty reads and deadlocking issues. We’re going to lock it down quite heavily, in fact, but only for a VERY small amount of time, which is, frankly, the point of this method. So, let’s throw a hint out there that will lock that entire table, again, for the duration of the transaction:
BEGIN TRAN;
--Acquire exclusive table lock to prevent deadlocking with concurrent activity
SELECT TOP 1
@error = 0
FROM
dbo.factDemo WITH (TABLOCKX, HOLDLOCK);
Now we’ll identify the upper and lower boundaries for the date range that we will be loading. This is necessary in order to create a check constraint on the data load staging table. The reason for doing this is to ensure that we don’t have any incoming data that falls outside of the range of our current partition. If we attempt to insert new data into a partition of our fact table that already has data in it, resource usage will spike tremendously as the optimizer tries to identify, row by row, where the data should live. For this reason we always want to insert new fact records into an empty partition. And, again, to confirm that we do this, we create the check constraint on the staging table to make darn sure that our date values all fall in the range of this empty partition.
Since we don’t know what the upper and lower bounds will be before creating the new check constraints, we’ll need to create them using dynamic SQL. No, I’m not usually a big fan of dynamic SQL, but it does have its place. Since we’re not using it for DML, it’s not too much of a problem here. If an error occurs due to there being values outside of the intended range, the catch block will halt further processing and the transaction will roll back making sure that no data was written to the fact table. The life span of this constraint will be very short lived. We only want to issue it to verify our data is in range, and then we’ll remove it.
Once we’ve identified that we don’t have any dates out of range, the magic of partition switching comes into play. With a simple ALTER TABLE statement, our full load staging table trades places with the appropriate partition in the fact table resulting in a full partition in the fact table and an empty staging table… all in the blink of an eye!
At this point we can now safely remove the check constraint we added.
-- Load new records from STGLoad
DECLARE @UpperVal INT;
DECLARE @UpperBoundID INT;
DECLARE @LowerVal INT;
DECLARE @AlterSQL VARCHAR(150);
SELECT
@UpperVal = CAST(prv.value AS INT)
,@UpperBoundID = prv.boundary_id
FROM
sys.partition_functions AS pf
INNER JOIN sys.partition_range_values AS prv
ON prv.function_id = pf.function_id
WHERE
pf.name = 'pf_FactDemoFunction'
AND prv.value =
(SELECT
MAX(prv.value)
FROM
sys.partition_functions AS pf
INNER JOIN sys.partition_range_values AS prv
ON prv.function_id = pf.function_id
WHERE
pf.name = 'pf_FactDemoFunction')
SELECT
@LowerVal = CAST(prv.value AS INT)
FROM
sys.partition_functions AS pf
INNER JOIN sys.partition_range_values AS prv
ON prv.function_id = pf.function_id
WHERE
pf.name = 'pf_FactDemoFunction'
AND boundary_id = @UpperBoundID - 1
IF EXISTS
(SELECT 1
FROM sys.check_constraints
WHERE object_id = OBJECT_ID(N'[dbo].[CK_FactDate]')
AND parent_object_id = OBJECT_ID(N'[dbo].[factDemo_STGLoad]'))
ALTER TABLE [dbo].[factDemo_STGLoad] DROP CONSTRAINT [CK_FactDate]
SET @AlterSQL = 'ALTER TABLE dbo.factDemo_STGLoad ADD CONSTRAINT CK_FactDate CHECK (FactDate > ' +
CAST(@LowerVal AS VARCHAR(8)) +
' AND FactDate <= ' + CAST(@UpperVal AS VARCHAR(8)) + ')';
EXEC(@AlterSQL);
ALTER TABLE dbo.factDemo_STGLoad SWITCH PARTITION @UpperBoundID
TO dbo.factDemo PARTITION @UpperBoundID;
IF EXISTS
(SELECT 1
FROM sys.check_constraints
WHERE object_id = OBJECT_ID(N'[dbo].[CK_FactDate]')
AND parent_object_id = OBJECT_ID(N'[dbo].[factDemo_STGLoad]'))
ALTER TABLE [dbo].[factDemo_STGLoad] DROP CONSTRAINT [CK_FactDate]
At this point, it’s time to initiate some spring cleaning and purge the old, moldy data out of fact table. This can mean writing it off to tape, deleting it, or just copying it to a slower drive to free up the fast hardware for data that gets queried more often. To do this we’ll, first, compare our oldest partition boundary to our calculated expiration date. If that boundary is later than our expiration date, we’ll split the partition, creating a new one that only contains expired data.
--If the first partition boundary is later than the expiration date, split the first partition so that it contains only expired data
IF @PartitionBoundaryDate > @ExpirationDate
OR @PartitionBoundaryDate IS NULL
BEGIN
SET @PartitionBoundaryDate = @ExpirationDate;
ALTER PARTITION SCHEME ps_FactDemoScheme
NEXT USED [PRIMARY];
ALTER PARTITION FUNCTION pf_FactDemoFunction()
SPLIT RANGE(CAST(CONVERT(VARCHAR(10), @PartitionBoundaryDate, 112) AS INT));
END;
Now we’ll double check that our archive staging table is empty of all data, and then execute ourALTER TABLE statement to switch the oldest partition out with it. Follow that step by merging our now empty fact table partition with the next to last partition, eliminating the trailing empty partition and the meat of our loading process is accomplished. We have switched in new data and switched out expired data in the same table faster than you can say, “Table partitioning is cool”.
--Switch and merge partitions older that retention period
WHILE @PartitionBoundaryDate <= @ExpirationDate
BEGIN
--Ensure target staging partition is empty
TRUNCATE TABLE dbo.factDemo_STGArch;
ALTER TABLE dbo.factDemo SWITCH PARTITION 1
TO dbo.factDemo_STGArch PARTITION 1;
--Purge data permanently
TRUNCATE TABLE dbo.factDemo_STGArch;
--Merge first and second partitions
ALTER PARTITION FUNCTION pf_FactDemoFunction()
MERGE RANGE(CAST(CONVERT(VARCHAR(10), @PartitionBoundaryDate, 112) AS INT));
--Get oldest partition boundary for next iteration
SET @PartitionBoundaryDate = NULL;
SELECT
@PartitionBoundaryDate = CAST(CAST(prv.value AS VARCHAR(8)) AS DATE)
FROM
sys.partition_functions AS pf
INNER JOIN sys.partition_range_values AS prv
ON prv.function_id = pf.function_id
WHERE
pf.name = 'pf_FactDemoFunction';
END;
Finally, we’ll wrap everything up by taking care of a bit of preemptive safeguarding, making sure we have the appropriate empty partitions ready for tomorrow’s load as well as completing our TRY block and fleshing out our CATCH block.
--Make sure we have individual partitons for tomorrow and future days.
SET @PartitionBoundaryDate = DATEADD(DAY, 1, @RunDate);
WHILE @PartitionBoundaryDate <= DATEADD(DAY, 2, @RunDate)
BEGIN
IF NOT EXISTS
(
SELECT
prv.value
FROM
sys.partition_functions AS pf
INNER JOIN sys.partition_range_values AS prv
ON prv.function_id = pf.function_id
WHERE
pf.name = 'pf_FactDemoFunction'
AND CAST(CAST(prv.value AS VARCHAR(8)) AS DATE) = @PartitionBoundaryDate
)
BEGIN
ALTER PARTITION SCHEME ps_FactDemoScheme NEXT USED [PRIMARY];
ALTER PARTITION FUNCTION pf_FactDemoFunction() SPLIT RANGE(CAST(CONVERT(VARCHAR(10),@PartitionBoundaryDate, 112) AS INT));
END;
--Calc boundary date for next iteration
SET @PartitionBoundaryDate = DATEADD(DAY, 1, @PartitionBoundaryDate);
END
COMMIT;
END TRY
BEGIN CATCH
SELECT
@Error = ERROR_NUMBER(),
@Message = ERROR_MESSAGE(),
@ErrorLine = ERROR_LINE();
SET @Message =
'Partition maintenenace failed with error %d at line %d: ' + @Message;
RAISERROR(@Message, 16, 1, @Error, @ErrorLine) WITH NOWAIT;
IF @@TRANCOUNT > 0
BEGIN
ROLLBACK;
END;
END CATCH;
RETURN @error;
GO
So now let’s take it for a test drive. I have prepared a few simple INSERT statements to act as our ETL process, loading into our load staging table. These statements are grouped into days, mimicking the recurring load. I have also created a couple of SELECT statements which will give us some insight into the number of partitions, and the amount of data contained in each one allowing us to track the processing of our loads into the fact table.
For each day, execute the INSERT statement. Note that the records are inserted into the staging table. Next, execute the sliding window stored procedure. The procedure should execute without issue and apply the sliding window load into the fact table. In between each day’s load, query the fact table to see the new data added. Also, run the below SELECT statements to see the partition break down and where the inserted rows are now living in each partition.
-- Day 1:
INSERT INTO dbo.factDemo_STGLoad VALUES(1,1,20100901,100,500);
EXEC dbo.SlideRangeLeftWindow_INT @RetentionDays = 1, @RunDate = '2010-09-01';
-- Day 2:
INSERT INTO dbo.factDemo_STGLoad VALUES(3,3,20100902,200,600);
INSERT INTO dbo.factDemo_STGLoad VALUES(2,2,20100902,150,550);
INSERT INTO dbo.factDemo_STGLoad VALUES(4,4,20100902,250,650);
INSERT INTO dbo.factDemo_STGLoad VALUES(5,5,20100902,300,700);
EXEC dbo.SlideRangeLeftWindow_INT @RetentionDays = 1, @RunDate = '2010-09-02';
-- Day 3:
INSERT INTO dbo.factDemo_STGLoad VALUES(6,6,20100903,350,750);
INSERT INTO dbo.factDemo_STGLoad VALUES(7,7,20100903,400,800);
INSERT INTO dbo.factDemo_STGLoad VALUES(8,8,20100903,450,850);
INSERT INTO dbo.factDemo_STGLoad VALUES(9,9,20100903,500,900);
EXEC dbo.SlideRangeLeftWindow_INT @RetentionDays = 1, @RunDate = '2010-09-03';
-- Day 4:
INSERT INTO dbo.factDemo_STGLoad VALUES(10,10,20100904,550,950);
INSERT INTO dbo.factDemo_STGLoad VALUES(11,11,20100904,600,1000);
INSERT INTO dbo.factDemo_STGLoad VALUES(12,12,20100904,700,1050);
INSERT INTO dbo.factDemo_STGLoad VALUES(13,13,20100904,750,1100);
INSERT INTO dbo.factDemo_STGLoad VALUES(14,14,20100904,800,1150);
EXEC dbo.SlideRangeLeftWindow_INT @RetentionDays = 1, @RunDate = '2010-09-04';
-- Day 5:
INSERT INTO dbo.factDemo_STGLoad VALUES(15,15,20100905,850,1200);
EXEC dbo.SlideRangeLeftWindow_INT @RetentionDays = 1, @RunDate = '2010-09-05';
-- Day 5:
INSERT INTO dbo.factDemo_STGLoad VALUES(15,15,20100906,850,1200);
EXEC dbo.SlideRangeLeftWindow_INT @RetentionDays = 1, @RunDate = '2010-09-06';
-- Execute the following between each day’s load
SELECT [ID1]
,[ID2]
,[FactDate]
,[Measure1]
,[Measure2]
FROM [PartitionDemo].[dbo].[factDemo]
SELECT
CAST(prv.value AS INT) AS PartitionBoundary
FROM
sys.partition_functions AS pf
INNER JOIN sys.partition_range_values AS prv
ON prv.function_id = pf.function_id
WHERE
pf.name = 'pf_FactDemoFunction'
SELECT
index_id
,partition_number
,rows
FROM
sys.partitions
WHERE
object_id = OBJECT_ID('dbo.factDemo')
ORDER BY
index_id
,partition_number