Monday, April 18, 2016

What is the proper key for the clustered index?

In data modeling a table one of the more difficult questions is: should my primary key (PK) have a clustered or non-clustered index? One of the best answers I've seen for this question has been posted on StackExchange. The answer starts by rephrasing the question:

What is the proper key for the clustered index?  

Here is cited copy of response. With a link below to the full listing.

The question is not 'when should the PK be NC', but instead you should ask 'what is the proper key for the clustered index'?  And the answer really depends on how do you query the data. 

The clustered index has an advantage over all other indexes: since it always includes all columns, is always covering. Therefore queries that can leverage the clustered index certainly do not need to use lookups to satisfy some of the projected columns and/or predicates. 

Another piece of the puzzle is how can an index be used? There are three typical patterns: 

  • probes, when a single key value is seek-ed in the index 
  • range scans, when a range of key values is retrieved 
  • order by requirements, when an index can satisfy an order by w/o requiring a stop-and-go sort 

So if you analyze your expected load (the queries) and discover that a large number of queries would use a particular index because they use a certain pattern of access that benefits from an index, it makes sense to propose that index as the clustered index. 

Yet another factor is that the clustered index key is the lookup key used by all non-clustered indices and therefore a wide clustered index key creates a ripple effect and widens all the non-clustered indices and wide indices mean more pages, more I/O, more memory, less goodness. 

A good clustered index is stable, it does not change during the lifetime of the entity, because a change in the clustered index key values means the row has to be deleted and inserted back. 

And a good clustered index grows in order not randomly (each newly inserted key value is larger than the preceding value) as to avoid page splits and fragmentation (without messing around with FILLFACTORs). 

So now that we know what a good clustered index key is, does the primary key (which is a data modelling logical property) match the requirements? If yes, then the PK should be clustered. If no, then the PK should be non-clustered. 

To give an example, consider a sales facts table. Each entry has an ID that is the primary key. But the vast majority of queries ask for data between a date and another date, therefore the best clustered index key would be the sales date, not the ID. Another example of having a different clustered index from the primary key is a very low selectivity key, like a 'category', or a 'state', a key with only very few distinct values. Having a clustered index key with this low selectivity key as the leftmost key, e.g. (state, id), often makes sense because of ranges scans that look for all entries in a particular 'state'. 

One last note about the possibility of a non-clustered primary key over a heap (i.e. there is no clustered index at all). This may be a valid scenario, the typical reason is when bulk insert performance is critical, since heaps have significantly better bulk insert throughput when compared to clustered indices.

this answer edited Dec 1 '12 at 4:31 Paul White
answered Nov 10 '11 at 22:39 Remus Rusanu 
originally posted at:

You can read the full listing on Stackexchange
here: When should a primary key be declared non-clustered?

No comments: