In the first article High Availability in Azure SQL Managed Instance: General Purpose service tier we have described the principles behind the High Availability and dived into the specifics of the General Purpose service tier. In this post we will continue covering the Business Critical service tier, discussing not only how High Availability is achieved but also how it is improved compared to General Purpose, reducing the possible downtime, and unlocking new scenarios such as off-loading read-only workloads from the Primary replica.
High availability in the Business Critical service tier is basically achieved by combining both compute resources (hosting the sqlservr.exe process) and storage (locally attached SSD) in the same node and then replicating that node as part of a four-node cluster. It is worth mentioning that we are referring to node in an Azure Service Fabric cluster, which is a network-connected set of virtual or physical machines into which microservices are deployed and managed. A machine or VM that is part of a cluster is called a cluster node. To know more about Azure Service Fabric clusters, see the following article from Microsoft Docs: Create clusters on Windows Server and Linux - Azure Service Fabric. This architecture of high availability is also shared with Azure SQL Database in the Business Critical and Premium service tier and it is known as model, please see the following article to know more about this availability model: Business Critical service tier - Azure SQL Database & Azure SQL Managed Instance.
You might find this architecture very similar to SQL Server Always On availability groups (to know more: What is an Always On availability group? in the sense that we have a single Primary Replica that receives the read-write operations from the applications, and three secondary replicas containing a copy of the data.
In the traditional deployment (without using Failover Groups) all replicas use the Synchronous commit mode and they are always Synchronized. However, in order to optimize the impact on the application only two replicas need to acknowledge that they have "hardened" (saved to disk) the transaction log in order to tell the application that the transaction was committed. This is called a "Quorum Commit" instead of "Full Commit" to know more about Quorum in a Distributed Environment see: https://en.wikipedia.org/wiki/Quorum_(distributed_computing). In this way if the Primary Replica fails, we have two secondary replicas, with everything that was committed before the failure (achieving a ) that we can failover to.
It is also important to establish the difference between ZDL (zero data loss) and zero-committed data loss. Sometimes both terms are (wrongly in my opinion) used interchangeably, but ZDL would imply that you would not lose any data no matter what, a very noble goal but extremely hard to accomplish. However, when using Synchronous replication you would have zero-commited data loss. ZDL or, in other words, an RPO of zero would only apply to disk writes and only if the disaster requires recovery to the most recent point in time. An RPO of zero on disk writes does not capture any transactions in-flight such as those in the memory of the application, hypervisor stack or in transit between the data-centers. Therefore zero-commited data loss is a more realistic goal and it means that since the transaction block is saved in another replica (before committing it in the Primary Replica) that group of transactions (transaction block) would not be lost if the Primary Replica fails. Of course, this only applies when you have Synchronous replication.
Unlike the General Purpose service tier, in Business Critical we have attached SSDs with all the databases on them (system and user databases) which provide a very low-latency and high throughput IO. This also means that in every replica we have a copy of all the databases (unlike the Standard Availability model in this case we don’t use Shared Storage).
The following diagram shows the architecture of this High Availability solution:
In terms of failure detection, the responsibility is shared between Azure SQL MI and Azure Service Fabric. Azure Service Fabric will perform what we might call an "external monitoring", making sure that the node is alive and responding (to know more about this take a look at the following article on Service Fabric disaster recovery).
The "internal monitoring" is performed by Azure SQL MI and when a high severity error appears within the SQL Engine a dump is triggered. That would either cause the SQL process to be restarted in the same replica or (especially if it is the Primary Replica that is having the issue) an API call to Azure Service Fabric would be made to let Azure Service Fabric know that something “serious” had occurred and a Failover should be made. In that case that replica will be marked as "Not healthy", a Failover will occur, one of the Secondary replica will become the new Primary Replica and also another node would be provisioned to make sure enough nodes are available and a Quorum Set is achieved. Once the failover process is complete, Azure SQL Managed Instance connections are automatically redirected to the new Primary Replica.
Azure SQL Managed Instance always tries to minimize the impact and downtime even when a failure is detected performing the fastest recovery action possible. In this sense there is a concept internally called "restart replica" where only a specific database in a specific node is "restarted" (close/open database operation and database recovery process initiated).
Similarly, to the General Purpose service tier, the failover process could also be initiated manually as we saw in the previous article (High Availability in Azure SQL Managed Instance: General Purpose service tier).
To see all the replicas and the role assigned we can query the sys.dm_hard_fabric_replica_states DMV as shown in the following capture:
As we can see for every database that we have in our Azure SQL Managed Instance the query returns one row for the Primary Replica and three rows corresponding to the three additional Secondary Replicas. Also, you can see that the health state of the synchronization is “HEALTHY” (which is of course the normal desired state). To know more about monitoring HA using DMVs you can check the following article: How-to monitor SQL MI Failover Group & Replicas availability
It is important to make the distinction between having the database “SYNCRHONIZED” and having the same data available for queries in every replica (Primary and Secondary). The “SYNCRHONIZED” state means that the log blocks were “hardened” (persisted to disk) in the Synchronous Secondaries Replicas, however the Redo process, which would reflect those changes in the database file is always an Asynchronous process. This means that even when the state of the database is “SYNCRHONIZED” you would not see those changes in the database file until the Redo process completes.
The Business Critical service tier not only provides low-latency responses, due to the underlying SSD storage (1-2 ms in average), and fast recovery if the underlying infrastructure fails, it also provides to ability to redirect queries/analytical workload and any workload that is read-only to one of the secondaries replicas. This feature allows to take the load of read-only queries and ease the Primary replica:
This feature is known as Read Scale-Out and allows customers to avoid impacting the primary replica with read-only queries.
Azure SQL Managed Instance features a built-in high availability solution, that is deeply integrated with the Azure platform. It is dependent on Service Fabric for failure detection and recovery, and on Azure Blob storage for data protection. The combination of these technologies enables applications to fully realize the benefits of a mixed storage model and support the most demanding SLAs.