When the Phone Rings

A production database is slow. Users are complaining. Every minute costs money. Here is a structured approach.

Immediate Triage (First 5 Minutes)

1. Check if It Is Actually SQL Server

Before diving into query plans, rule out:

  • Network latency between app server and database
  • Application server CPU/memory exhaustion
  • Connection pool exhaustion
  • DNS resolution issues

2. Check Running Queries

SELECT
    r.session_id,
    r.blocking_session_id,
    r.cpu_time,
    r.total_elapsed_time,
    r.reads,
    r.writes,
    r.logical_reads,
    t.text AS query_text,
    r.status
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) t
WHERE r.session_id > 50;

Look for:

  • Queries with high elapsed time but low CPU (waiting on locks or I/O)
  • Blocking chains (blocking_session_id > 0)
  • Suspended queries (status = 'suspended')

3. Check Waits

SELECT TOP 10
    wait_type,
    wait_time_ms / 1000 AS wait_time_sec,
    waiting_tasks_count,
    max_wait_time_ms
FROM sys.dm_os_wait_stats
WHERE wait_type NOT LIKE '%SLEEP%'
  AND wait_type NOT IN ('CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT')
ORDER BY wait_time_ms DESC;

The most common performance-related wait types:

  • PAGEIOLATCH_*: Disk I/O bottleneck — check storage performance
  • LCK_M_*: Locking — review blocking chains
  • WRITELOG: Transaction log bottleneck — check log disk performance
  • CXPACKET: Parallelism — review MAXDOP configuration
  • ASYNC_NETWORK_IO: Application is consuming results slowly

4. Check Current Locks

SELECT
    request_session_id,
    resource_type,
    resource_description,
    request_mode,
    request_status
FROM sys.dm_tran_locks
WHERE request_session_id > 50
  AND resource_database_id = DB_ID();

5. Check TempDB Contention

TempDB is a shared resource. Contention here affects every query.

SELECT
    session_id,
    wait_type,
    wait_resource
FROM sys.dm_exec_requests
WHERE wait_resource LIKE '2:%' -- TempDB ID is 2
  AND wait_type = 'PAGELATCH_%';

Second-Tier Investigation (Next 15 Minutes)

6. Identify Expensive Queries by I/O

SELECT TOP 10
    total_logical_reads / execution_count AS avg_logical_reads,
    total_physical_reads / execution_count AS avg_physical_reads,
    execution_count,
    total_worker_time / execution_count / 1000 AS avg_cpu_ms,
    SUBSTRING(text, (statement_start_offset/2)+1,
        ((CASE statement_end_offset WHEN -1 THEN DATALENGTH(text)
        ELSE statement_end_offset END - statement_start_offset)/2) + 1) AS query_text
FROM sys.dm_exec_query_stats
CROSS APPLY sys.dm_exec_sql_text(sql_handle)
ORDER BY total_logical_reads DESC;

7. Check Missing Indexes

SELECT
    migs.avg_user_impact,
    migs.user_seeks + migs.user_scans AS total_accesses,
    mid.statement AS table_name,
    mid.equality_columns,
    mid.inequality_columns,
    mid.included_columns
FROM sys.dm_db_missing_index_groups mig
JOIN sys.dm_db_missing_index_group_stats migs ON mig.index_group_handle = migs.group_handle
JOIN sys.dm_db_missing_index_details mid ON mig.index_handle = mid.index_handle
WHERE mid.database_id = DB_ID()
ORDER BY migs.avg_user_impact DESC;

Be cautious. Missing index recommendations are useful hints, not commands. Adding indexes can hurt write performance.

8. Review SQL Server Error Log

EXEC sp_readerrorlog 0, 1, 'error';
EXEC sp_readerrorlog 0, 1, 'deadlock';
EXEC sp_readerrorlog 0, 1, 'I/O';

Root Cause Checklist (After the Fire)

  1. Was there a deployment? Check deployment timestamps.
  2. Was there a data change? Sudden data volume changes can invalidate execution plans.
  3. Was there a statistics update? Auto-update statistics can change plans unexpectedly.
  4. Was there a configuration change? MAXDOP, cost threshold for parallelism, memory settings.
  5. Did a maintenance job run? Index rebuilds, DBCC CHECKDB, backups.

Prevention

  • Set up Query Store and baseline normal performance
  • Configure alerts for blocking > 30 seconds
  • Monitor disk latency (avg. disk sec/read and avg. disk sec/write)
  • Keep statistics updated manually during maintenance windows
  • Document the normal state so you can recognize abnormal faster

One Final Tip

When you cannot find the problem, check the application code. Many "database performance" issues are actually N+1 query problems, missing connection pooling, or transactions held open by application logic.