Monday, September 8, 2014

Troubleshooting slow SQL Server stored procedures - using the Modulo function to improve productivity



 Recently I needed to create a stored procedure to process and cleanse a huge set of call center data on a daily basis - more than half a million calls a day. And because data could be backed up and unavailable at times, sometimes a run of multiple days would be necessary, so many millions of rows. 

While I was still working out the logic of the cleansing, my procedure was still pretty slow. Even one day's run took 10 minutes. This meant that every time I needed to tweak the logic, it would take at least 10 minutes to examine the results! 

That kind of work pattern gets old quickly. The problem with having to take a 10 minute break all the time is that task switching is a pain and makes you forget things. It's easier to sustain focus if you don't have to spend 10 minutes out of every 20 working on something else.

So, how to speed up the procedure and get rid of all those inefficient breaks?  For this stage of my work, it would have been fine to work with a much smaller subset of data, as long as I got a good sampling of records. This little hack allowed me to get this subset of data. My first step was creating a temporary table to bring in the records to be processed. So while I was still working out the logic, I added a where clause like this to get a much smaller subset of data:

 where   
   convert(date, CallEnd) in (select CallDate from @DatesToFetch)  
   -- Just get 1 out of 100 records for now  
   and CustomerCallD % 100 = 0   

This uses the Modulo function (%), which gives the remainder of the division of one number by another. In the above example, the CustomerCallD was an identity value increasing by 1 each time, so it would get every CustomerCallID ending in 00 (where the remainder of CustomerCallID divided by 100 was 0). This was a great way to cut back dramatically on the time required to run the procedure, yet still get a great sampling of records. Then when you're ready to run the full set of data, just comment out the filter.






Sunday, August 17, 2014

Microsoft SQL Server online documentation is improving

Previously, when I was looking for quick information concerning syntax or other SQL Server details, I avoided the Microsoft SQL Server online documentation. Why? Because it was far too difficult to extract the information you needed from the gobs and gobs of text, with every possible variation of every single parameter.

Check out this page from the SQL Server 2008 official documentation, and see how long it takes you to get a quick example of how to to create a unique constraint on a table.

Did you notice that there's 764 lines on the page? And all the examples are at the bottom. The specific example I was looking for was on line 508, in a list of 14 examples.

That's why I usually went straight to the third party sites such as StackOverflow and w3schools. They have single topic short pages, with examples front and center - you can get what you need in a few seconds.

But I was recently reviewing the online documentation, specifically for SQL Server 2014, and it's improved tremendously. Take a look at this page, on creating a check constraint. It's short, to the point, and the examples are not difficult to find. Honestly, I'd still go to the third party sites before I'd go to the official documentation, just because they're still more concise and to-the-point. But now the official documentation at least has a chance.


************************
EDIT 11/11/2014
************************
So, Microsoft documentation has been slowly improving, which is great. But I've recently been playing with Redshift, the Amazon AWS data warehouse product. Now there's some well done documentation. I had been reading up on a new feature that they had - IF NOT EXISTS for table creation and dropping. Since I was working through getting a Redshift cluster set up, I decided to try the new feature, looking at this page of the documentation.

But the feature didn't work. Since every page of the online documentation has a Feedback button, I sent some feedback saying that the feature didn't work. Within 2 hours, I got a friendly email back, saying that I had discovered an "undocumented feature", and that they were working on fixing it. Wow!  Very impressive.







Friday, May 16, 2014

How to search for a string in your SQL Server database code


*****************************************************************
EDIT May 27, 2014: Forget this!  Really, just forget it and use the Red Gate SQL Search tool.  I just downloaded it and have been using it all morning. It's free, functional, searches EVERYTHING at the same time (job steps, tables, columns, etc), and has a very usable interface. This will replace ALL of my queries against system objects.

One tip - if you need to search different servers, you may find it difficult to switch database servers within the Red Gate SQL Search tool. The drop down doesn't appear to work. I've found that the only way to do that is to close the tool, go to the server you want in Object Explorer, and then open the tool again.  You will now be searching the correct server.
*****************************************************************

As anyone who's been a developer for any length of time knows, what you're doing most frequently is NOT working with a fresh slate on a brand new project. No, it's less fun but much more common that you'll be working on enhancing, fixing, modifying, or otherwise updating code that's already out there. And this is another thing you figure out very soon in your career - documentation for the average database product that's internal to a business is almost always either non-existent, or inaccurate.

So, a very important skill for the developer is the ability to research, and figure out what's out there already.

I like to use the following query to search for code. It will return all the code (including views, stored procedures, functions, etc) that contain the string (table name, view name, stored procedure name, etc) you're searching for. There are add-ons that have similar functionality, but I'm not always able to install an add-on.

A huge bonus with this query is the XML field that contains the definition. When you click on it, you can see the actual text of the definition in a new window - no sp_helptext needed!

 SELECT  
   Type =  
     case  
       when OBJECTPROPERTY(object_id, 'IsProcedure') = 1 then 'Procedure'  
       when OBJECTPROPERTY(object_id, 'IsView') = 1 then 'View'  
       when OBJECTPROPERTY(object_id, 'IsTableFunction') = 1 then 'TableFunction'  
       when OBJECTPROPERTY(object_id, 'IsScalarFunction') = 1 then 'ScalarFunction'  
       else 'other'  
     end  
   ,Name = OBJECT_NAME(object_id)   
   ,Definition = CAST((SELECT definition AS [processing-instruction(x)] FOR XML PATH('')) AS XML)   
 FROM sys.sql_modules  
 WHERE  
   Object_definition(object_id) LIKE '%hier%'  
 order by 2, 1 
This is the output you'll get. Try clicking on the definition field to open it up in a new window.










Monday, April 14, 2014

Easily parse output of Set Statistics IO in SQL Server


****************************************************************
EDIT June 19, 2014 - I saw a new tool online that looks easier than using my code. Check it out - http://www.statisticsparser.com/
****************************************************************


Recently I was performance tuning a very involved, gigantic stored procedure. Figuring out exactly where the performance problems were was a challenge, so I turned on Statistics IO.  I'm not going to get into the details of Statistics IO here, you can easily find that online.  The bottom line is that I wanted to figure out how to easily get the most important metric, Logical Reads.  But scanning the enormous volume of text generated was tedious, and I looked online to see what options I had to make the parsing easier and faster.

Vicky Harp has an interesting article on using Excel to parse the output of Statistics IO.  I tried it out, and it works, but I preferred to stick with SQL Server tools if possible, instead of needing to copy and paste formulas into Excel.

So I wrote the stored procedure below to parse the output.  After doing some data manipulations, it uses XML to easily parse the desired values out of the Statistics IO output, and returns a result set with the table name and metrics for each line of output in Set Statistics IO.

Note!!  The output of Set Statistics IO includes single quotes around the table names.  For instance:

 Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.  
 Table 'sysobjrdb'. Scan count 1, logical reads 28, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.  
 Table 'sysschobjs'. Scan count 1, logical reads 9, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.  
 Table 'spt_values'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.  

Since you can't use any text with single quotes in it as a parameter to a stored procedure, you'll need to replace the single quotes with an empty string. I'm skipping the details of how to do this - if you know enough about SQL Server to be worrying about Logical Reads, you don't need step-by-step instructions.

Once you've replaced the single quotes, use that text as a parameter for the stored procedure below, sp_StatisticsIO_OutputParse.  For example:

 sp_StatisticsIO_OutputParse  
 '  
 Table Worktable. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.  
 Table sysobjrdb. Scan count 1, logical reads 28, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.  
 Table sysschobjs. Scan count 1, logical reads 9, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.  
 Table spt_values. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.  
 Table Worktable. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.  
 Table sysscalartypes. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.  
 '  

I hope this helps in your troubleshooting endeavors! And if you can figure out a way to avoid the replacing of the single quotes, please let me know.

 Create procedure dbo.sp_StatisticsIO_OutputParse @Output varchar(max)   
 as   
 /*  
 This stored procedure is designed to parse out the output you get when you turn on   
 Set Statistics IO (under Tools, Options, SQL Server Execution, Advanced), or just run Set Statistics IO in a query window).   
 ! IMPORTANT - I didn't figure out how to deal with the single quotes around the table name in the output of Statistics IO.   
 So, in order to run this properly, you need to replace all single quotes with an empty string.  
 Copy the output without the single quotes, then use the output as the parameter (surrounding the whole output with single quotes) for   
 this procedure, sp_StatisticsIO_OutputParse.  
 */  
 Select @Output = replace (@Output, ' reads', '_reads')  
 Select @Output = replace (@Output, 'Table ', '<Detail Table="')  
 Select @Output = replace (@Output, 'reads ', 'reads="')  
 Select @Output = replace (@Output, '. Scan count', '" Scancount="')  
 Select @Output = replace (@Output, '.', '" />')  
 Select @Output = replace (@Output, ',', '"')  
 Select @Output = replace (@Output, 'lob ', 'lob_')  
 Select @Output = replace (@Output, '-', '_')  
 Declare @OutputXML xml = '<Output> ' + @Output + ' </Output>'  
 SELECT   
   A.value('@Table[1]', 'VARCHAR(20)') as table_name  
   ,A.value('@Scancount[1]', 'int') as scan_count  
   ,A.value('@logical_reads[1]', 'int') as logical_reads  
   ,A.value('@physical_reads[1]', 'int') as physical_reads      
   ,A.value('@read_ahead_reads[1]', 'int') as read_ahead_reads  
   ,A.value('@lob_logical_reads[1]', 'int') as lob_logical_reads    
 into #tmpOutput    
 FROM @OutputXML.nodes('Output/Detail') AS A(A)  
 Select * from #tmpOutput  

Thursday, March 27, 2014

Domain data: three tips for more effective management


I ran into a situation recently that required lots of troubleshooting, inconvenience and painstaking research. I was merging and normalizing some data, and using a particular data set (let's call it the Employee table) that should have been standardized.  In other words, there should have been one set of data that everyone at the company used (ignoring for now issues like sensitive fields, such as Salary).

Instead, different groups in the company were using their own version of the Employee table - with different primary keys (when there were primary keys), categorized differently, some with the latest data, some with data up till the end of last year. Overall, the differences between the data sets weren't huge - probably 15% or less - but when you're creating a dataset that ultimately needs to be able be joined with a dataset from another group, you want to be able to match up without too much pain.  Otherwise the numbers won't match, nobody will trust your data, and reporting and cross-division communication can grind to a halt.

So, what can we do about this?   Here are some potential fixes to try:

  • Be very clear with upper management about the costs of multiple data sources for domain data that should only have one master.  It's tricky because the costs can be hidden - it just takes everyone much longer to do cross-departmental work, everyone needs to research discrepancies in datasets that should match, but don't.
  • Have a company wide "Data" Q & A forum or email alias (a forum is preferable, because it's more easily searchable) where people can post data related question. Transparency is always good, and the more people know about the various domain data sources, the better decisions they can make. 
  • Get the group responsible for domain data management to be as responsive as possible to user requests. Usually the reason why people create their own datasets is because they can't get the original manager of the data to respond to their needs.

The underlying problem, which I believe is related to the lack of choice and competition internally within companies, is more difficult one.  I'm hoping to write up another post on it soon