Thursday, June 8, 2023

Clustered indexes—how to make the right decisions

Oftentimes, with database systems, nobody is specifically in charge of the database. It's just seen as one of the many tools that's being used by the front end code, and as long as it functions, it's fine. 

But sometimes when the system has been running and growing for a few years, problems become evident. Deadlocks. Slow query plans. Timeouts. By default, the primary key is made into the clustered index, unless it's specifically created as a nonclustered index. This can make sense, because you really want a clustered index on a table, and the primary key is sometimes a good choice. But definitely not always! 

I've run into some situations like this. The clustered index is created on a field that is probably rarely used in searches. But how to be sure? In order to more easily research these situations, and figure out which field might be a better potential candidate for clustereding, I wrote the following helpful query. 

It queries the system tables to see which objects are referencing the target table, and then returns the objects (usually stored procedures, but sometimes functions) in order of decreasing CPU usage. This way, if you have hundreds of objects that use the table and don't have time to check each one, you can make sure that you at least check the most resource intensive objects. 

You need to have Query Store turned on (everybody should, in any case, if you need performance data), because it uses the table sys.query_store_runtime_stats in order to show the results in order of CPU usage. 

It returns a handy column that you can click on to easily see the object definition. Thus, you can see the SQL behind the object, and double check whether the clustered index change you're thinking of is useful. If the current field that has the clustered index is rarely used in filtering, and the new field is used heavily, then that's a sign you should switch.

 ;with Main as (  
      -- Get data on everything that's using this table and/or field  
   SELECT   
     ObjectName = Object_Name(Query.object_id)  
     ,TotalExecutions = sum((RuntimeStats.count_executions))  
     ,TotalCPU = sum((RuntimeStats.count_executions * RuntimeStats.avg_cpu_time))  
     ,TotalLogicalReads = sum((RuntimeStats.count_executions * RuntimeStats.avg_logical_io_reads))  
     ,TotalDuration = sum((RuntimeStats.count_executions * RuntimeStats.avg_duration))  
     ,QueryStorePlan.plan_id  
     ,Query.query_id  
   FROM sys.query_store_runtime_stats RuntimeStats  
   JOIN sys.query_store_plan QueryStorePlan ON QueryStorePlan.plan_id = RuntimeStats.plan_id  
   JOIN sys.query_store_query Query ON Query.query_id = QueryStorePlan.query_id  
   JOIN sys.query_store_query_text QueryText ON Query.query_text_id = QueryText.query_text_id  
   WHERE   
     1=1  
     and first_execution_time >= dateadd(d, -2, getdate())  
     and QueryText.query_sql_text like '%' + @TableName + '%'  
           and QueryText.query_sql_text like '%' + @FieldName + '%'  
   GROUP BY Object_Name(Query.object_id), Query.query_id, QueryStorePlan.Plan_id  
 )  
 -- Now return the data, along with resource usage stats and some handy columns you can click on to view the SQL,  
 -- and the query plan  
 Select   
      ObjectName   
      ,TotalExecutions   
      ,TotalCPU   
      ,TotalLogicalReads  
      ,TotalDuration   
   ,SQLTextXML = CAST((SELECT query_sql_text AS [processing-instruction(x)] FOR XML PATH('')) AS XML)    
      ,QueryPlanXML = CAST(query_plan as XML)   
   ,QueryID = Query.query_id  
   ,PlanID = QueryStorePlan.plan_id  
 from Main  
   JOIN sys.query_store_plan QueryStorePlan ON QueryStorePlan.plan_id = Main.plan_id  
   JOIN sys.query_store_query Query ON Query.query_id = QueryStorePlan.query_id  
   JOIN sys.query_store_query_text QueryText ON Query.query_text_id = QueryText.query_text_id  
 Order by TotalCPU desc