Recently I needed to create a stored procedure to process and cleanse a huge set of call center data on a daily basis - more than half a million calls a day. And because data could be backed up and unavailable at times, sometimes a run of multiple days would be necessary, so many millions of rows.
While I was still working out the logic of the cleansing, my procedure was still pretty slow. Even one day's run took 10 minutes. This meant that every time I needed to tweak the logic, it would take at least 10 minutes to examine the results!
That kind of work pattern gets old quickly. The problem with having to take a 10 minute break all the time is that task switching is a pain and makes you forget things. It's easier to sustain focus if you don't have to spend 10 minutes out of every 20 working on something else.
So, how to speed up the procedure and get rid of all those inefficient breaks? For this stage of my work, it would have been fine to work with a much smaller subset of data, as long as I got a good sampling of records. This little hack allowed me to get this subset of data. My first step was creating a temporary table to bring in the records to be processed. So while I was still working out the logic, I added a where clause like this to get a much smaller subset of data:
where
convert(date, CallEnd) in (select CallDate from @DatesToFetch)
-- Just get 1 out of 100 records for now
and CustomerCallD % 100 = 0
This uses the Modulo function (%), which gives the remainder of the division of one number by another. In the above example, the CustomerCallD was an identity value increasing by 1 each time, so it would get every CustomerCallID ending in 00 (where the remainder of CustomerCallID divided by 100 was 0). This was a great way to cut back dramatically on the time required to run the procedure, yet still get a great sampling of records. Then when you're ready to run the full set of data, just comment out the filter.
No comments:
Post a Comment