Published Aug 16 2019 10:44 AM 51.1K Views

In SQL Server 2019 CTP 3.1 a new index option was added called OPTIMIZE_FOR_SEQUENTIAL_KEY that is intended to address throughput issues that may occur when a workload has a high number of concurrent inserts. These workloads commonly face an issue known as last page insert contention.

In SQL Server, indexes are stored as B-trees which are ordered by the index key. In a clustered index, the leaf level contains the data pages themselves, ordered by the index key. This means that when you are inserting rows into a table that has a clustered index where the key is a value that is always increasing, such as an identity column or a datetime column, all the new rows will be inserted on the last page of the B-tree. As concurrency increases, contention for this page in memory will increase, which limits scalability of such workloads. This problem can be identified by many threads waiting with a wait type of PAGELATCH_xx on the same page resource in a user database. In an OLTP system that is frequently inserting data, PAGELATCH waits will often be observed, particularly on a busy system.

There are many different methods to address this issue that have been suggested in the past, most of them involve making changes to either the application or the structure of the index. The following article outlines some of the most common solutions:

How to resolve last-page insert PAGELATCH_EX contention in SQL Server

When thinking about addressing this problem automatically in SQL Server, these options as well as others were considered, but most of them require some sort of performance trade-off and would involve major changes to the database engine. What we decided to do instead was to look closer at the contention itself and see if there was some way to improve scalability without making major changes to the way the data is physically stored.

While PAGELATCH contention does introduce additional latency in an application, it is something called a latch convoy that is truly detrimental to application scalability. This blog article gives a good description of convoys and why they happen. Traffic jams are a common analogy used to describe the problem. If you have a road that is at maximum capacity, as long as all the traffic continues to move at the same speed, throughput will be consistent. If you introduce a bottleneck on the road that forces all the cars to go through a single lane, this will cause traffic to back up behind the bottleneck. Traffic will continue to make progress at a slower rate, but if the rate of cars entering the highway remains constant, the traffic jam will get larger and larger unless the bottleneck is removed. In this situation, if something occurs that causes the drivers entering the single lane to hit the brakes, such as a slow driver or a hazard on the road, the rate of throughput drops severely and traffic really piles up. At this point, throughput won't recover until the rate of cars entering the road slows down dramatically, much lower than what the road would normally be able to handle.

With last page insert contention, as the number of insert threads increases, the queue for the page latch increases which in turn increases latency. Throughput will also decrease, but if something slows down one of the threads that is holding the latch, this can trigger a convoy and throughput suddenly falls off a cliff. This typically happens when a page fills up and a new page must be added to the index (also known as a page split). The insert that triggers the new page will naturally have to hold the latch for longer than normal while the new page operation completes. This causes the queue to build up behind the latch. Adding a new page also requires an exclusive latch on the parent page, which can cause latch requests to queue at that level as well. At this point, throughput falls off a cliff.

OPTIMIZE_FOR_SEQUENTIAL_KEY aims to do two things – control the rate at which new threads are allowed to request the latch, and favor threads that are likely to keep the throughput high. These techniques will not prevent the contention or reduce latency, but they will help keep throughput consistent as concurrency increases.


When using this option, you may not see a decrease in PAGELATCH waits, in fact you may even see an increase in waits with a new wait type called BTREE_INSERT_FLOW_CONTROL. Despite these waits, you should be able to achieve much better throughput, and scale to a much higher number of concurrent threads, without hitting the proverbial cliff. If you're not experiencing the convoy phenomenon in your workload, you may not see a huge benefit from this option, and you may even see a slight degradation in performance due to the new flow control waits. You should only use this option if you have a very heavily contentious workload – one where the number of threads inserting into the index is much higher than the number of schedulers – on a clustered index with a sequential key (note that non-clustered indexes can experience this problem as well, but because they have a smaller row size they don’t have as high a tendency to form convoys so they are less likely to benefit from this option). We are also working on providing some new reporting in Azure Data Studio and SQL Server Management Studio that will help detect this scenario and recommend the option where appropriate.


Hopefully that’s enough of an explanation for the average user, but I know many of you out there are thinking “Yeah but what are you DOING under the covers? How does this really work?” To dig a little deeper, as you may have guessed from the name of the new wait type, enabling OPTIMIZE_FOR_SEQUENTIAL_KEY turns on what we call flow control for the index in question. When OPTIMIZE_FOR_SEQUENTIAL_KEY is set to ON for an index, we introduce an upstream flow control mechanism for all the threads that request to enter the critical section of SQL Server code that is protected by a page latch. Unlike a semaphore which simply controls the traffic based on a fixed number of threads, this new mechanism controls the flow of traffic based on a set of conditions such as which CPU the thread is on, the state of the thread, and/or the rate at which the thread is completing work.


With OPTIMIZE_FOR_SEQUENTIAL_KEY, we will limit the number of threads allowed to request the latch to one per scheduler (i.e. CPU core), which should reduce time spent in the runnable queue once the latch is acquired. In addition to this, we will also favor threads that are more likely to make progress once they acquire the latch. This “unfair” mechanism gives preference to threads that are likely to complete their work in a single quantum over threads that will undergo several context switches while holding the latch. While this unfairness may result in some threads experiencing additional latency, overall throughput will be increased because less time will be spent waiting for other resources while holding the latch and blocking other threads.


Keep in mind that this flow control is not going to improve individual insert latency, in fact quite the opposite. Think of it like the traffic lights on a highway on-ramp – the cars coming onto the highway are going to wait longer before they are allowed to enter. It’s going to make them a bit slower than normal, but the result is the traffic keeps moving at a constant (albeit slower) rate instead of piling up. Our initial testing shows up to 30% improvement in throughput on some workloads, but of course your mileage may vary. The other solutions to last page insert contention outlined in the KB article referenced above are also still valid, and in many cases will yield an even better throughput improvement over OPTIMIZE_FOR_SEQUENTIAL_KEY. The best way to address the problem is to test the various options and choose the one that works best for your application and environment.


If you'd like to see an example of OPTIMIZE_FOR_SEQUENTIAL_KEY in action, head over to the SQL Server Samples repository on GitHub and download the sample demo.

Occasional Visitor
Seems like an interesting way to solve the issue. I’m just really curious why you even went to the effort to implement it? Firstly there are many known solutions, which often work much better (As you already mentioned in this article). Secondly, given the existence of Hekaton, I would have expected there to be a massively more elegant solution. Not that many people what this issue, in my experience. As such there doesn’t; seem to be a real urge to throw together a solution. Would it not be worth investing the time in something like a Hekaton merge table? I mean imagine a lock-less, latch-less insert system (Oh yeah it exists). Then every (Randomly pulled out of thin air) 512 inserts you merge those into the main table. Once that the merge completes you remove that merge table. In the mean time you create a new merge table. I mean the solution is really quite simple, exists in other database engines (Look at WiredTiger, which MongoDB bought - I really wish a better company had bought it). Merge tables aren’t new or exciting technology. They do however work and have done for a long time. Oh would it also be possible to stop half the data in non-leaf levels moving to new pages with page splits in an identity index please? I don’t see why all non-leaf level pages have to be 50% full until the next index rebuild. Especially because I shouldn’t need to ever rebuild an identity index. Given that would be most clustered indexes that I have that would make quite a difference to maintenance windows. Cool that you’re doing things to make SQL Server better. I would just prefer it to be a bit longer term and more solution oriented rather than sticking plaster oriented.
Occasional Visitor
Oh cool my comment lost all my formatting. I also seem unable to edit it... So a little correction. Where I wrote “Not many people what this issue”. I meant to write “Not many people have this issue”. Now I’ll see if more carriage returns allow me to format my text.
Occasional Visitor

@Rmwin - good suggestions.  For a sequential index yes the lowest level pages should not be 50% fill that's usually inefficient, except possibly for some odd configurations.  However I believe the B-tree will still have to be re-balanced occasionally even with an optimized insert that minimizes half full pages; that might be done very efficiently by just rearranging a few sub-trees.


@Rmwin Thanks for the feedback! 


Let me address your comment about page splits first. When a new page is added to the end of any level of an index (could be leaf or intermediate level), we do not move any rows around, we simply add a new empty page. The only time pages are left 50% full is when a new page is added in the middle of the level. You can see this in action by creating an empty table with a clustered index on a sequential key and inserting just enough rows to fill one page, then add one more row and look at the resulting pages. You'll see three pages: the first full page, the new empty page, and a new root page that points to the two other pages which now form the leaf level of the index. If you do this again with a clustered index on a non-sequential key such as a uniqueidentifier, after the page split happens you'll see that both the leaf pages contain 50% of the rows. I can post a demo script for this later if it would be helpful.


As I mentioned, we did discuss (at length, I assure you :)) many other options for solving this problem. While a Hekaton merge table seems like a straightforward solution, there are actually a lot of complexities to manage such as different transactional semantics, different indexing etc. that make this challenging to do automatically. This is something you can actually implement yourself, and we do have some customers that use a similar solution (it's often called the shock absorber). 


While you may not have encountered this problem frequently, we do see many customers facing this issue, particularly on servers with a high core count (e.g. 128 cores or more). The hope is that this new flow control mechanism may also be used to improve throughput in other areas of the engine that have a tendency to form convoys under contention.


@jahummer B-trees are self-balancing due to the way they are built from the bottom up (i.e. leaf first, then push up a new level when needed), so they never need to be "re-balanced" per se. Even a sequential index may need to be reorganized, however, if there are a lot of in place updates that cause rows to be moved from one page to another.

Occasional Contributor
Let's hope those customers with many cores are running software from vendors that are advanced enough to implement something like this. I've had discussions with vendors to try to convince them that using cascading updates is a better way to cascade foreign key changes than "instead of" triggers. And I've seen vendors not use foreign keys at all, instead opting to provide a routine for "checking links". I just had a vendor explain to me that snapshot isolation is not necessary because they always set the isolation mode to read uncommitted for all of their reports. There are so many great SQL Server features that go unused in the typical application.
Occasional Visitor

I really don't understand how the first method in the kb4460004 article is supposed to resolve the issue. You just move the contention from the clustered index to the nonclustered one whose key is still ordered by the monotonically increasing IDENTITY value. So all new inserts will be hitting the same page, just of a different index. Hmm...


@nikb33 You're correct that it's possible you will shift some contention to the non-clustered index, but as long as the non-clustered index is narrow (i.e. only one key column and few or no included columns), the contention on that index shouldn't be as bad. Since the clustered index contains the data rows themselves, there are a lot more page splits on these pages as you insert data. It's the page splits that tend to cause the convoy to form, which is what makes this problem escalate quickly.

Occasional Visitor

@Pam Lahoud 


Please try the below and you will see that your reply to me is incorrect. SQL Server has never fully populated the intermediate pages in a B-Tree index even with an identity. You have to rebuild the index for this to happen.



** Create a test DB and then a test table (I'll leave craeting the DB to you


create table dbo.test (

       i int identity(1, 1) primary key clustered

  , filler char(500) default 'a'




** Populate the table


insert into dbo.test default values;

go 1000


insert into dbo.test (filler)

select filler

from dbo.test;

go 10



** Grab the root page


select concat(N'dbcc page(', db_name(), N', ', rp.[file_id], N', ', rp.page_id, N', 3) with tableresults;')

from sys.partitions as p

       inner join sys.system_internals_allocation_units as au

              on p.[partition_id] = au.container_id

       cross apply sys.fn_PhysLocCracker(au.root_page) as rp

where p.index_id = 1

  and p.[object_id] = object_id(N'dbo.test', 'U');



** From here go to any page in the middle of the level below



Also Awesome, the comment preview doesn't work for me so I don't know if this will finally accept the formatting this time around. I hope it does, if not then it would be great if this supported the latest Safari browser.

Occasional Visitor

@Pam Lahoud Oops I missed this bit: -


While you may not have encountered this problem frequently, we do see many customers facing this issue, particularly on servers with a high core count (e.g. 128 cores or more). The hope is that this new flow control mechanism may also be used to improve throughput in other areas of the engine that have a tendency to form convoys under contention.


I assure you I have seen this problem occur in production. We have always resolved the issue at the companies that I was at during that time. I have seen the issue in 2008 R2, 2012 and 2014 in various production environments. There are many solutions and I honestly still cannot see why Microsoft would spend time engineering a solution to a trivial problem. I mean you could simply use a UUID and the random nature of where the insert takes place would resolve the issue. Something like 128 partitions on a table where you use a modulus of the identity or sequence would also work really well (It would be even better if we had true partition level stats, I don't recall seeing that in 2019 yet, maybe I'm wrong here? I don't have 2019 in production so my caring is limited right now).


If you have customers that cannot architect a database correctly for performance then maybe it would be a good idea to have a proper high end qualification, like the MCM once was, again? There is nothing hard about scaling almost any workload. Just ignorance gets in the way. If Microsoft are planning on coding a universal solution to customer ignorance then they'll get frustrated after trying for a very long time indeed.


There is no need to code a shock absorber or anything like it. The tools are already available and have been since the introduction of a UUID (So a long time ago) to deal with this in almost any situation. For the rare cases where this isn't enough, there are other solutions available already. I wonder if you can scale the inserts with this new "Solution" to being superior to simply using a UUID? I mean I was able to drive over a million inserts a second on our test server back on SQL 2008 R2. If you then partition the table as well (Several solutions to this) you ended up pinning the log long before you hit latch contention on the table. Now that you can put the tail of the log on persistent memory this would most likely be resolved. At which point I have also been able to pin the log cache access spin lock on serveral occasions. Yes I've seen that in production as well. Also I've seen the log flush spin lock pinned in production.


Beyond any of that you would want to use in memory tables or avoid an RDBMS and write to a caching layer first and persist on a regular basis to batch your inserts. This would resolve the issue anyway. At that point, however, you might as well just simply move to NoSQL and the problem is resolved (Well as long as you have people who can architect a solution and the likes).


There are a load of problems in SQL Server and useful ways to expand it and yet, for some reason, Microsoft love to code fixes to non-issues. I don't get why you waste engineering effort there. Heaps have been around for ages and have never gotten any love from Microsoft despite them having some very useful properties. I mean how long did it take before you could rebuild a heap? Wasn't that in 2008 R2? So a long time. Then Microsoft managed to do a rebuild of a heap in such a way that with each rebuild not only does the heap retain all its previous IAM pages but it gets new ones. So in time you end up with an IAM chain in the thousands, even for a very small heap. If you don't access you heap for a while then the next time you try SQL Server will load the entire IAM chain into the cache before accessing a single record (Even with a select top (1) * from .....) which can take 20 seconds or even lead to a timeout just to load the IAM pages - stunning. Of course the solution there is to use a clustered index. A bit like preventing edge insert page latch contention is resolved with random inserts. However one gets attention from Microsoft and the other is ignored (And for a lot longer).


You also have an issue with the Query Store, I haven't checked to see if it's resolved with 2019 yet (Like I wrote above, we don't have it in production yet) where if you have a lot of dynamic SQL (Written long before I got to the company I am currently at) then you end up with the in memory hash table growing out of control. You cannot then clear the entries in this hash table with anything other than a restart of SQL Server - not a good idea in a production environment. Let's not even bother to look at the cluster f*** that is XML in SQL Server. However rather than sort that we got JSON (In a relational DB, why oh why oh why?). Why not just provide simple wrappers to decompose and recompose these so that they can be stored in normal tables. This would allow simple indexing (Awesome) by any DBA, developer or even the existing tools without any need to re-write anything. Oh yeah, that's basically how XML indexes, just you cannot add your own indexes because that would be stupid.


Other simple examples: -


  • How long did it take to get trim()?
  • How long did it take to get split_string()?
  • How long did it take to get a half-ways decent version of concat so you didn't want to use an XML and Stuff hack to pivot rows into a single column?


This is what I mean with wasted engineering effort. Please don't just cater for people who don't design solutions, cater for the people who would actually like to use SQL Server as an RDBMS. Make it easy to for what it is not a one size fits all jack of all trades that slowly becomes irrelevant in most every possible scenario where you might want to use it. You have other products that can do that better in your Azure offerings.

Occasional Visitor

@Pam Lahoud, thanks for an insightful article on yet another option for dealing with a phenomenon I've seen countless times over the years.

Frequent Visitor

if promised throughput improvement is just 20-30%...
I'm wondering if we still need to use or  combine OPTIMIZE_FOR_SEQUENTIAL_KEY feature with different solutions.
Most probably hash partitioning (Method #5)
or sequence + bit reversal + clr function (not T-SQL, I will be slower)



@-Michael_Karpenko- that is true, if you can re-architect the database using something like memory-optimized tables or hash partitioning, you will likely get better performance. This solution is designed for customers who cannot re-architect for some reason. It's not a huge improvement in throughput, but it is an improvement in stability of the workload, which can avoid other downstream issues in addition to improving throughput.

Occasional Visitor

Is there any option to avoid latches on Datetime column with nonclustered index that has default getdate()?. My table is already partitioned by that column so partition solutions will not work for me.


Version history
Last update:
‎Aug 20 2019 01:13 PM
Updated by: