Tuesday, July 26, 2022

The magic of Query Store for tracking application timeouts

If I were writing a click-baity title for an article on this, it would be "One Crazy Trick To Find Application Timeouts That Will Blow Your Mind!!!" It really is that useful. 

The bottom line is, Query Store is useful for far more than you might think, and specifically researching application timeouts.  

A system I've been working on lately has had many application timeouts, cause by the default setting (timeout after 30 seconds). The tool we were using for error tracking had some flaws, and it was very difficult to actually come up with the stored procedure name, that had the timeout. 

I had set up the Extended Event session for timeouts (on sqlserver.attention). However, potentially due to certain issues with the front end setup, the sqlserver.sql_text value was almost always blank when there was a timeout. I didn't investigate in depth, and moved on to other tasks.

But once I started spending more time on Query Store and using it more, I discovered the execution_type_desc field in query_store_runtime_stats. What does it contain?

Well, 99.9% of the time it has this string "regular". But occasionally, it contains this one "aborted". And these are your timeouts! It's so easy to find timeouts with this method that I'm surprised it's not more well known.

By using the execution_type_desc field in query_store_runtime_stats, I was able to find clusters of linked timeouts, and then dig deeper (also using query_store_runtime_stats, but that's another story) to find the root cause. 

Here's the query I frequently use, commenting out lines when necessary, to get the details I need. This will show 1 row for each QueryID and PlanID. If it's a stored procedure with many separate statements, you may want to filter on one particular QueryID or SQL text.


 SELECT   
   ObjectName = Object_Name(q.object_id)  
   ,RunTimeFirst = convert(varchar(16), rs.first_execution_time,121)  
   ,RunTimeLast = convert(varchar(16), rs.first_execution_time,121)  
   ,TotalExecutions = rs.count_executions  
   ,AverageCPU = round(rs.avg_cpu_time , 0)  
   ,AverageLogicalReads = round(rs.avg_logical_io_reads ,0)  
   ,AverageDuration = round(rs.avg_duration ,0)  
   ,q.query_id   
   ,p.plan_id   
   ,p.Is_Forced_Plan  
   ,rs.runtime_stats_id  
   ,execution_type_desc  
   ,qt.query_sql_text  
   -- Comment out unless necessary...takes a long time.   
   -- ,ExecutionPlan = CAST(p.query_plan as XML)  
 FROM sys.query_store_runtime_stats rs  
 JOIN sys.query_store_plan p   
   ON p.plan_id = rs.plan_id  
 JOIN sys.query_store_query q   
   ON q.query_id = p.query_id  
 JOIN sys.query_store_query_text qt   
   ON q.query_text_id = qt.query_text_id  
 WHERE   
   1=1 -- make it easier to comment/uncomment
   -- just show last 2 days by default  
   and first_execution_time >= dateadd(d, -2, getdate())  
   and Object_Name(q.object_id) = 'API_GetUserLearningPlanObjects'  
   and execution_type_desc = 'Aborted'  
   -- and q.query_id in (4013401)  
   -- and qt.query_sql_text like '%c.CourseID as ID%'  
 Order by RunTimeFirst desc