Synthetic Accelerations in a Nutshell – Windows Server 2016

Published May 08 2019 06:00 AM 23.6K Views

Hi folks,


Dan Cuomo back for our next installment in this blog series on synthetic accelerations.  Windows Server 2016 marked an inflection point in the synthetic acceleration world on Windows, so in this article we’ll talk about the architectural changes, new capabilities, and changes in configuration requirements compared to the last couple operating systems.


Before we begin, here are the pointers to the previous blogs:

Keep in mind that due to changes in Windows Server 2016, many of the details in Windows Server 2012/R2 becomes irrelevant while some become even more important.  For example, vRSS becomes more important while the benefits of Dynamic VMQ, as originally implemented in Windows Server 2012 R2, is surpassed in this release. We’ll wrap up this series in a future post that covers Windows Server 2019 (and the new and improved Dynamic VMMQ)!


But before we get to the grand finale in Windows Server 2019, let’s talk about the groundwork that occurred in Windows Server 2016 and the big advances they brought in synthetic (through the virtual switch) network performance.  As a quick review, this is the synthetic datapath; all ingress traffic must traverse the virtual switch (vSwitch) in the parent partition prior to being received by the guest:




NIC Architecture in Windows Server 2016

Windows Server 2016 brought a new architecture in the NIC that affects the implementation of VMQ.  If you remember the article on Windows Server 2012, you may remember that the NIC creates a filter based on each vmNIC’s MAC and VLAN combination on the Hyper-V vSwitch.  Ergo, every MAC and VLAN combination registered on the Hyper-V vSwitch would be passed to the NIC to request a VMQ be mapped.


Note: Well not exactly every combination.  VMQs must be requested using the virtual machine properties so only VMs that have the appropriate properties would be passed to the NIC for allocation of a queue.  The rest, as you may recall, land in the default queue.

Now however, some added intelligence in the form of an embedded Ethernet switch (NIC Switch) was built into the physical network adapters.  When Windows detects that a NIC has a NIC Switch, it asks the NIC to map a NIC Switch port (vPort) to a queue (instead of assigning a queue to the MAC + VLAN filter mentioned previously).  Here’s a look at the Legacy VMQ (MAC + VLAN filtered) architecture:




Here’s the new architecture.  If you look at the NIC at the bottom of the picture, you can see a queue is now mapped to a vPort and that vPort maps to a processor.




In previous operating systems, the NIC Switch was used only when SR-IOV was enabled on the virtual switch.  However, in Windows Server 2016 we split the use of the NIC Switch away from SR-IOV and now leverage it to enable some great advantages.


VRSS without the offload

If you’re running Windows Server 2016, you’re likely NOT in this exact situation.  However, this section is important to understand the great leap in performance you’ll see in the next section (so don’t skip this section :smiling_face_with_smiling_eyes:).


Virtual Receive Side Scaling (vRSS) was first introduced in Windows Server 2012 R2.  As a review, vRSS has two primary responsibilities on the host:

  • Creating the mapping of VMQs to processors (known as the indirection table)
  • Packet distribution onto processors

With Legacy VMQ in Windows Server 2012 R2 packet distribution, vRSS performs RSS spreading onto additional processors surpassing the throughput that a single VMQ and CPU could deliver.




While this enabled much improved throughput over Windows Server 2012, the packet distribution (RSS spreading) was performed in software and incurred CPU processing (this is covered in the 2012 R2 article so I won’t spend much time on this here).  This capped the potential throughput to a virtual NIC to about 15 Gbps in Windows Server 2012 R2.


Note: If you simply compare vRSS and VMQ on Windows Server 2012 R2 to Windows Server 2016, you’ll notice that your throughput with the same accelerations enabled naturally improves.  This is for a couple of reasons but predominantly boils down to the improvements in drivers and operating system efficiency.  This is one benefit of upgrading to the latest operating system and installing the latest drivers/firmware.


(I said don’t skip the last section!  If you did, go back and read that first!)



The big advantage of the NIC Switch is the ability to offload the packet distribution function performed by vRSS to NIC.  This capability is known as Virtual Machine Multi Queues (VMMQ) and allows us to assign multiple VMQs to the same vPort in the NIC Switch, which you can see in the picture below.




Now instead of a messy software solution that incurs overhead on the CPU and processes the packets in multiple places, the NIC simply places the packet directly onto the intended processor.  RSS hashing is used to spread the incoming traffic between the queues assigned to the vPort.


In the last article, we discussed how VMQ and RSS had a budding friendship (see the “Orange Mocha Frappuccinos” reference ;)).  Now you should see just how integral RSS is to VMQ’s vitality; They’re like a frog and a bear singing America.  They’re movin' right along; they're truly birds of a feather; they’re in this together.




Now the packets only need to be processed once before heading towards their destination – the virtual NIC.  The result is a much more performant host and increased throughput to the virtual NIC.  With enough available processing power, you can exceed 50 Gbps.

Management of a Virtual NIC’s vRSS and VMMQ Properties

In Windows Server 2016, VMMQ was disabled by default.  It was after all, a brand-new feature and after the “Great Windows Server 2012 VMQ” debacle, we thought it best to disable new offloads by default until the NIC and OS had an opportunity to prove they could play nicely together.

Verify vRSS is Enabled for a vNIC

This should already be enabled (remember this is the OS component that directs the hardware offload, VMMQ) but you can see the settings by running this command for a vmNIC:




And now a host vNIC – Oh yea! We enabled vRSS for host vNICs in this release as well!  This was not previously available on 2012 R2.




Note: If this was accidentally disabled and you’d like to reenable it, please see Set-VMNetworkAdapter.

To Enable VMMQ

First, check to see if VMMQ is enabled.  As previously mentioned, it is not enabled by default on Server 2016.




Now let’s check that the prerequisite hardware features are enabled.  To do this, use Get-NetAdapterAdvancedProperty to query the device (or use the device manager properties for the adapter) for the following three properties:


Virtual Switch RSS4.png


Virtual Machine Queues5.png


Receive Side Scaling6.png


If any of the above settings are disabled (0) make sure to enable them with Set-NetAdapterAdvancedProperty.


Now let’s enable VMMQ on the vNIC by running Set-VMNetworkAdapter:7.png


Finally, verify that VMMQ is enabled:



One last thing…In the picture above you can see the number of VMMQ Queue Pairs (see the 2012 article on NIC Architecture – a queue is technically a queue pair) assigned for this virtual NIC ☹.  It shows the request was for 16, so why did only 8 get allocated?


First, understand the perspective of each properties output.  VMMQQueuePairRequested is what vRSS requested on your behalf – in this case 16.  VMMQQueuePairs is the actual number granted by the hardware.


As you know from our previous Get-NetAdapterAdvancedProperty cmdlet above, each network adapter has defined defaults that governs its properties.  The *MaxRssProcessors property (intuitively) defines the maximum number of RSS processors that can be assigned for any adapter, virtual NICs included.




Lastly the *NumRSSQueues defines how many queues can be assigned.




We can remedy this by changing these properties using Set-NetAdapterAdvancedProperty or device manager.









Now check that the virtual NIC now has 16 queue pairs:




Note: Ever wonder why there is an asterisk (*) in front of the Get-NetAdapterAdvancedProperty properties?


These are known as well-known advanced registry keywords which is standardized software contract between the operating system and the network adapter.  Any keyword listed here without an asterisk in front of it is defined by the vendor and may be different (or not exist) depending on the adapter and driver you use.

By lowering the number of VMMQQueuePairsRequested you have an easy mechanism to manage the available throughput into a VM.  You should assign 1 Queue Pair for every 4 Gbps you’d like a VM to have.




If you choose to do this, keep in mind two things.  First, VMMQ is not a true QoS mechanism.  Your mileage may vary as the actual throughput will depend on the system and available resources.


Second, VMMQ scales considerably better than VMQ alone in large part because of the improvements to the default queue outlined in the next section so you may not need to manage the allocation of queues quite as stringently as you have in the past.


Management of default queues

In the last section, we enabled VMMQ for a specific virtual NIC.  However, you may want to enable VMMQ for the default queue.  As a general best practice, we recommend that you enable VMMQ for the default queue.


As a reminder, the default queue is the queue that can apply to multiple devices.  More specifically, if you don’t get a VMMQ, you’ll use these.  In the past, all VMs that were unable to receive their own VMQ shared the default queue.  Now, they share multiple queues (e.g. VMMQ).



You can see above the same setup is required for the DefaultQueueVMMQPairs.  The only difference is that, instead of setting the configuration on the virtual NIC using Set-VMNetworkAdapter, you set the configuration on the virtual switch like this:





Now any virtual machine that is unable to receive its own queues will benefit from having 16 (or whatever you configure) available to share the workload.


Processor Arrays

In Windows Server 2016, you are no longer required to set the processor arrays with Set-NetAdapterVMQ or Set-NetAdapterRSS.  I’ve been asked if you still can configure these settings if you have a desire to, and the answer is yes.  However, the scenarios when this is useful are few and far between.  For general use, this is no longer a requirement.


As you can see in the picture below, the RSSProcessorArray includes processors by default and they are ordered by NUMA Distance.





You may have seen our minor twitter campaign about eliminating LBFO from your environment…




There are numerous reasons for this recommendation, however it boils down to this:

LBFO is our older teaming technology that will not see future investment, is not compatible with numerous advanced capabilities, and has been exceeded in both performance and stability by our new technologies (SET).


Note: If you're new to Switch Embedded Teaming (SET) you can review this guide for an overview

We’ll have a blog that will unpack that statement quite a bit more but let’s talk about it in terms of synthetic accelerations.  LBFO doesn’t support offloads like VMMQ.  VMMQ is an advanced capability that lowers the host CPU consumption and enables far better network throughput to virtual machines.  In other words, your users (or customers) will be happier with VMMQ in that they can generally get the throughput they want, when they want it, so long as you have the processing power and network hardware to meet their demands.  If you want to use VMMQ and you’re looking for a teaming technology, you must use Switch Embedded Teaming (SET).


To do this, simply add -EnableEmbeddedTeaming $true to your New-VMSwitch cmdlet.




Summary of Requirements

We continue to chip away at the requirements on the system administrator as compared to previous releases.


  • Install latest drivers and firmware


  • Processor Array engaged by default – CPU0This was originally changed in 2012 R2 to enable VRSS (on the host), however this now includes host virtual NICs and hardware queues (VMQ/VMMQ) as well.


  • Configure the system to avoid CPU0 on non-hyperthreaded systems and CPU0 and CPU1 on hyperthreaded systems (e.g. BaseProcessorNumbershould be 1 or 2 depending on hyperthreading).


  • Configure the MaxProcessorNumber to establish that an adapter cannot use a processor higher than this.  The system will now manage this automatically in Windows Server 2016 and so we recommend you do not modify the defaults.


  • Configure MaxProcessors to establish how many processors out of the available list a NIC can spread VMQs across simultaneously.  This is unnecessary due to the enhancements in the default queue.  You may still choose to do this if you’re limiting the queues as a rudimentary QoS mechanism as noted earlier but it is not required.


  • Enable VMMQ on Virtual NICs and the Virtual Switch.  This is new as this capability didn't exist prior to this release and as mentioned, we disabled new offloads.


  • Test customer workload


Summary of Advantages

  • Spreading across virtual CPUs (vRSS in the Guest) – The virtual processors have been removed as a bottleneck (originally implemented in 2012 R2).
  • vRSS Packet Placement Offload – Additional CPUs can be engaged by vRSS (creation of the indirection table implemented in 2012 R2). Now packet placement onto the correct processor can be done in the NIC improving the performance of an individual virtual NIC to +50 Gbps with adequate available resources. This represents another 3x improvement over Windows Server 2012 R2 (and 6x over Windows Server 2012)!

Note: Some have commented that they have no need for an individual virtual machine to receive +50 Gbps on its own.  While these scenarios are (currently) less common, it misses the point.  The actual benefit is that +50 Gbps can be processed by the system whether that be 100 VMs / 50 Gbps == 2 Gbps per VM or +50 Gbps for 1 VM.  You choose how to divvy up the available bandwidth.

  • Multiple queues for the Default Queue – Previously the default queue was a bottleneck for all virtual machines that couldn’t receive their own dedicated queue.  Now those virtual machines can leverage VMMQ in the default queue enabling greater scaling of the system.


  • Management of the Default Queue – You can choose how many queues for the default queue and each virtual NIC


  • Pre-allocated queues – By pre-allocating the queues to virtual machines, we’re able to meet the demand of bursty network workloads.

Summary of Disadvantages

  • VMMQ is disabled by default – You need to enable VMMQ individually or use tooling to enable it.
  • No dynamic assignment of Queues – Dynamic VMQ is effectively deprecated with the use of VMMQ which means that once a queue has been mapped to a processor it will not be moved in response to changing system conditions.
  • Pre-allocated queues – This is also a disadvantage because we may be wasting system resources without the ability to reassign them.

The redesigned NIC architecture (NIC Switch) enabled VMMQ which represents another big jump in synthetic system performance and efficiency.  If you’re using the synthetic datapath, you’ll receive a huge boost over what vRSS and Dynamic VMQ alone can bring alone enabling another 3x improvement in performance.  In addition, VMMQ enables improved system density (packing more VMs onto the same host) and a more consistent user experience as they will be able to leverage VMMQ in the default queue.  Lastly, Switch Embedded Teaming (SET) has officially becomes our recommended teaming option in this release, in part due to its support for advanced offloads like VMMQ.


I hope you enjoyed the first three articles in this series.  Next week we’ll wrap up with the final installment describing Dynamic VMMQ and our first acceleration that isn't named VMQ or RSS!


Dan "Accelerating" Cuomo

Occasional Visitor

@Dan Cuomo If SET is the recommended way to go instead of LBFO for all teaming, does that mean that the Hyper-V role should be deployed on all physical servers?  A couple of example situations would be a physical DC, a physical SCVMM server, or an S2D converged cluster (with or without RDMA support).  It seems to me that if Microsoft wants people to move towards SET for everything, then it should be decoupled from the Hyper-V role and included in the base OS as a core networking technology like LBFO is.  Wouldn't that be more in harmony with the best-practice of minimizing what you install on a system to reduce the attack surface?


Hi @Gregg_H  - LBFO is still required for native scenarios (e.g. without Hyper-V) and this article is in the context that LBFO does not support VMMQ (and many other advanced features, "is not compatible with numerous advanced capabilities").  LBFO, should be considered in maintenance mode; we are not bringing advancements to this feature ("will not see future investment") but to be clear, LBFO is still supported (including on Windows Server 2019).


Summary: Where possible, we recommend Switch Embedded Teaming (not LBFO).

Occasional Visitor

@Dan Cuomo Thanks for clarifying that LBFO is still the recommended teaming solution for non-Hyper-V physical server scenarios.  What about the converged S2D scenario I mentioned before?  If you are leveraging RDMA for your SMB traffic, I would think that you would definitely want SET for your S2D nodes (and obviously your compute ones too).  If RDMA isn't available, would the recommendation still be to use SET?  In either scenario, if SET is the recommended teaming solution you will end up installing Hyper-V on servers that don't really need it other than to gain access to SET.


@Gregg_H - Again, good questions.  I would not say that I recommend LBFO, it's just your only option on bare-metal systems.  Some customers do not require Hyper-V and wish to leave the system without it.  In that case, LBFO is your only option.  However those scenarios are dwindling between on-premises Azure Stack (and Azure Stack HCI), and Azure cloud opportunities.  Quite honestly, i'm not sure that there is a true, measurable benefit to using a bare-metal host if that means you have to use LBFO.


Regarding S2D - Storage Spaces Direct team recommends using iWARP RDMA adapters.  RDMA is one of those advanced features that doesn't work with LBFO.  So if you're teaming your RDMA capable adapters, you MUST use SET.  Furthermore, storage is generally considered a critical workload for most customers.  Given my previous statement about stability, I would recommend that you install Hyper-V and use SET even if you're not running virtual machines or using RDMA on those S2D nodes.


In all scenarios where you can install Hyper-V, I recommend the use of Switch Embedded Teaming (SET) over LBFO.

Occasional Visitor

@Dan Cuomo Ok, I understand your position and recommendations a bit more clearly now.  In my particular environment, we typically only use bare metal for DCs, SCVMM servers, S2D clusters, and Hyper-V clusters to run, monitor, and manage our virtual environment (hence the scenarios I asked about).  The vast majority of our workloads are virtualized.  Given all your responses, I think my original statement stands: it would make sense to decouple SET from Hyper-V as it adds unnecessary bloat and potential exposure to servers that only need a single networking feature that the Hyper-V role enables.


@Gregg_H - Thanks for the feedback!

Add "double check that your SET team is using HyperVPort algorithm" to Summary of Requirements? ;)

Hi @martych - We do recommend using Hyper-V Port over Dynamic, however this is not a requirement.  In fact, our RDMA validation tool, (unrelated to this article) does default to verifying that your team is in this mode.

Established Member

@Dan CuomoIn server 2016 now that we are no longer concerned with MaxProcessors and MaxProcessorsNumber, in sum-of-queues are we still avoiding overlapping of processors and spanning numa nodes when setting BaseProcessorNumber, or do we just set every pNIC as BaseProcessorNumber = 2?  

Established Member

@Dan CuomoWhen working with Set-NetAdapterAdvancedProperty for *MaxRssProcessors, I see in device manager my NIC can do 32, should we use VMMQQueuePairsRequested to identify the minimum Processors needed or just max this out to 32?  Thanks!


EDIT: To help anybody else... I found some info explaining MaxRssProcessors is limited in relationship to the Numa, so in my case the system cant use more than 16 anyways, I left the setting at 16

Established Member

@Dan CuomoI am clarifying my previous question... I apologize for all the questions, sorry to be a pain, there is A LOT of contradictory and confusion info out there on these topics, plus I've been working with MS support and they are just as confused on these topics, they keep giving different answers!!  I really appreciate your help and these great articles!  Thank you!


In your article you mentioned "you are no longer required to set the processor arrays with Set-NetAdapterVMQ or Set-NetAdapterRSS"


At the summary section you also mentioned we should still "Configure the system to avoid CPU0 on non-hyperthreaded systems and CPU0 and CPU1 on hyperthreaded systems (e.g. BaseProcessorNumbershould be 1 or 2 depending on hyperthreading"


It seems like you are indicating (assuming VMMQ is setup) we no longer need to worry about calculating VMQs to multiple different base processors but we still must modify them all from the default of 0 to either 1 or 2 (depending on hyper threading), is that correct?


Furthermore, in a situation where I have 4 NICs and 2 NUMA, do I still need to be concerned with NUMA, putting 2 NICs base at 1 or 2 and the other 2 at the base processor for the second NUMA or do I just put all 4 NICs with a base of 1 or 2 (depending on hyper threading)?


Hi @Samuel Miller - Thanks for reading and reaching out. I rewrote these articles because of all the different information out there. Generally this (and the 2019 article) should be your go to. Here's a few responses to your questions:

  • NUMA - Most systems today do not need to concern themselves with tuning/modifying the RSS/VMQ NUMA configuration on the system. The defaults will suffice.
  • In 2016, you must modify the base processor to avoid CPU0. In 2019 you can optionally do this as well (assuming that Dynamic VMMQ is configured on your system which may alleviate this problem). Even in 2019 safest approach is to modify the base proc to avoid CPU0.
  • Assuming you're using Switch Embedded Teaming (e.g. Server 2016 or higher) you no longer need to concern yourselves sum-of-queues, overlapping processors, (and 99% of cases) spanning numa nodes. LBFO (which has been deprecated for Azure Stack HCI in the future) is a different story.
Regular Visitor

@Dan Cuomo, hello!

I want to use VMMQ in 2016, but i am limited to 16 queues for all VMs and VM Switch. If i don't use VMMQ on same hardware i have 31 queues per physical NIC in team. Is this a normal situation?


Established Member

@Dan Cuomothank you so much for the clarification!!  BTW yes I'm using SET and hyper threading, so I will just configure all my pNIC's with base processor of 2 and be done with it.


I am seeing one other strange behavior, I have configured both *MaxRSSProcessors and *NumRSSQueues as 16 per your instructions, even rebooted the host, yet I am still seeing VMMQQueuePairRequested 16 and VMMQQueuePair 4 for all my vNIC's including managementos.  Although my vSwitch DefaultQueueVMMQPairs does get the full 16 requested, any idea what is going on, is there something else I should check?


One note, when I created the vSwitch I used -EnableIov $true, is this the problem should VMQ VM's not be on an IOV enabled vSwitch, do I need separate vSwitches, 1 for IOV VM's and 1 for VMQ VM's?



Separate IOV question, I don't see any IOV info in poweshell on managementos vNICs, the fields don't even show when the vNIC's are | fl * , is IOV not available for managementos vNIC in Server 2016?


And one last thing, PacketDirect, I have seen this option noted in powershell but cant seem to find lots of info on it e.g. real-world results on the benefits are to be had, if possible can you point me to any good info on this if you think its worthwhile?


Hi @Samuel Miller - Regarding VMMQQueuePairRequested vs VMMQQueuePair...Does VMMQQueuePair go to 16 if you give it a bunch of traffic? E.g. Run NTTTCP with at least 64 connections and see if you get more. I would expect the value to max at 16.


Also, check the perf counter: Hyper-V Virtual Network Adapter VRSS – ReceiveProcessor – VM01_Network Adapter_Entry…

to see how many queues you have and assigned to which CPU. We have guide on validating Dynamic VMMQ but the perf counters section will all be the same.

Regular Visitor

Ok, now i understand why this happens, this is from Emulex manual:

In VMMQ, *MaxRssProcessors registry key controls the number of receive side scaling (RSS) CPUs used by each VPORT, and, by extension, the maximum number of QPs used by a VPORT. The number of QPs used per VPORT determines the number of VPORTs capable of VMMQ. The counts are fluid. For example, a 10Gb/s NIC adapter supports 32 VMMQ QPs. There are always two VPORTS that are VMMQ capable, one for default VPORT and one for non-default PF based VPORT. If you set MaxRssProcessors = 4, there can be 32/4 = 8 VMMQ-capable VPORTs. If you set MaxRssProcessors = 8, there can be 32/8 = 4 VMMQ-capable VPORTs.

Ergo, if MaxRssProcessors = 16 VMMQ will works only for two VMs or for one VSwitch and one VM.


@Loco23 That's not exactly the correct interpretation. It's saying that at max a virtual nic (which is assigned to a vPort) could only scale across 4 CPUs at a time. You're not limited on number of VMs that are assigned a queue (until of course you run out of queues). Additionally, the default queue is a shared queue as explained in the blogs that can be used by multiple vPorts.


There may be some vendor quirks that modify how it is working on your system and I would say if you find something that doesn't make sense, you should open a support case so that we can review properly. It's possible that the adapter has a quirk but that doesn't it mean that's how it should work (e.g. limiting you to only 2 VMs with 16 queues each)! Our support teams can work with the partner (NIC vendor) to get this corrected in their driver.

Regular Visitor

@Dan Cuomois there a way to determine the number of queues on a specific NIC?

In my case i use HPE 556FLR-SFP+ (Emulex) NIC and the above is correlated with my experience, see my screenshot above: one VPORT of *MaxRssProcessors VMMQ QPs per VM.

Regular Visitor

@Dan Cuomoif i have 32 queues and will assign 4 per VM, then I will only be able to enable VMMQ on 8 VMs, right? Or queues can be shared?


The default queue (which is a property of the vmswitch e.g. Get-VMSwitch) are queues that can be shared. Other queues not part of the default queue are directly assigned to the vNIC's vPort and so are not shared.

Senior Member

Hi Dan,

I have noticed that SET switch is not vert smart when it is used for Mgmt. Networks on Hyper-V host in terms of NIC failover. I have configured SET Team with 2 members with Hyper-V Port (used dynamic as well) and if we disable one of the NICs, it experiences 7-9 RTOs to resume the connection. SET works absolutely fine for VMs connected to it, the failover is seamless but it only struggles when we create a VNIC on host for mgmt/cluster/backup which are not used by the VMs itself. If we use LBFO teaming, the failover is seamless and we do not experience any RTOs during failover. With that said, is it okay to use LBFO teaming for those networks which are not utilized by VMs but only used for host management purposes?

Version history
Last update:
‎May 06 2019 04:13 PM
Updated by: