Microsoft SQL Server HA/DR on AWS

Video size:

Abstract

In this talk, we will go through several options for high availability and disaster recovery for MSSQL Server, making it one of most popular relational database technologies for enterprises’ mission critical workloads. However, these options are primarily designed for traditional on-premises environments.

This session explains those options in the context and Cloud deployments, pros and cons of each, and what other possibilities emerge in a modern Cloud platform, that otherwise would have been unthinkable when on-premise

Summary

You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native Litmus Cloud. Good morning, good afternoon wherever you are.
Asif Mujawar: In today's session, we'll look at the introduction of SQL Server on AWS. We'll look into the HADR options on AWS for Microsoft SQL Server. And we'll also look at how to reduce a way to reduce the SQL server license cost on Amazon.
There are multiple options when it comes to habibability and disaster recovery. HADR can be categorized into three areas: simple, standard and machine critical. We'll dive into each option in a deep as we go through the slide deck.
If the hardware goes down, the hyperwiser has an issue or the VM fails, then the VM will move v motion to the other side. This is what currently we do onpremises. How about if I bring it to AWS, we have a single AZ deployment. It can scale as the demand grows.
There are two major types of backups. Server level backups are usually the hardware assisted. The other type is a native SQL server capability and does not need any special hardware. When it comes to backup and restore on premises versus AWS.
RDS supports both types of backups server level as well as SQL server DB level backups. Apart from automated backups, you could also create user initiative backup. This is especially useful if you have compliance requirements that mandate retention of backups for longer than 35 days.
In AWS, several backups translate to EBS volume snapshots. Fast restore snapshot enables you to get full performance immediately after restoring volume. You could also create an AWS level SQL server backups. AWS gives you a way to efficiently do such tasks at any scale.
FCI is something that Microsoft introduced since SQL 2008. It gives you resilience against OS and SQL server instance failures. Multi AZ architecture provides higher resilience as compared to on premises. For mission critical databases, these databases should be deployed in a multi AZ deployment.
Storage Space Direct STD allows you to build and deploy software defined NAS across multiple instances. With log shipping, it doesn't matter whether your secondary instance is in the cloud or onpremises. Dr. Cloud enables you to spin up the apps configured with the desired data in your region.
There is something called ability groups which Max have introduced in SQL Server 2012 to take over the mirroring. With the basic ability groups you can only have one database in that availability groups. Compared to these advanced availability groups you have the multiple databases that can be added in the single ability groups.
availability groups come into play when it comes to protecting my SQL server workloads. FCI gives instance level protection for H and Ag Ag. Whereas Ag gives you devs level protection in terms of replication. We'll dive into the mission critical HaDR options.
Multi-AZs have a single point of failure if an instance goes down. It will fail over from your primary role to a secondary instance which is sitting in AZP. Application will continue to work without noticing that instance has been replaced. HADR and RDS options available for work on EC two with backup restore.
Cloud enables to optimize licensing cost. How we can help optimizing the licensing cons when it comes to SQL server deployment. In previous model you are using twelve cores. Now you have an option of downgrading in an instance edition from enterprise to standard.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Are you an sre? A developer? A quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native Litmus Cloud good morning, good afternoon wherever you are in the world. Welcome to today's session on HADR options for Microsoft SQL Server on AWS my name is Asif Mujawar. I'm a senior specialist solutions architect database here at AWS. So in today's session, what we're going to cover, we're going to talk about we'll look at the introduction of SQL Server on AWS, how SQL Server is deployed on AWS and what are the options we have to deploy SQL Server on AWS. Then we'll look into the HADR options on AWS for SQL Server and then we'll look at how to reduce a way to reduce the SQL server license cost on AWS. So let's start with the SQL Server on AWS. SQL Server on AWS typically you can deploy in two flavors. So you can deploy on the Amazon RDS for SQL Server which is our managed service offering with the single click capability and license included instances only. On the other hand, we have SQL Server on Amazon EC two which is self managed vm exactly the same way what you currently do on premises and with the manual high availability and license included, or bring your own license instances. Now let's just dive a bit more deeper into what are the options for deployments SQL Server on Amazon on AWS. So if you look at the right hand side, SQL Server on Amazon EC two is basically all the admin tasks which basically doesn't really add any value towards your business processes such as scaling, the habibability, the backups, the patching, the device install and maintenance and always level patching and those firefighting issues to maintain the space or add a space onto the database service, that's basically what you currently, currently do on premises. To maintain a SQL server infrastructure is something you will continue doing. So when it comes to an EC two, what we will do, the underlying data center operations of power racking and stacking, the hypervisor level and the os install and maintenance, that is something will be offered out of the box on EC two. But whereas everything else like the OS patching database install, the database patching, the backups, the hatcha, the scaling, you need to do that yourself. And this is where RDS comes into play, where all those undifferentiated heavy lifting tasks is what we call it at Amazon. When you don't really add any value to your end user experience in terms of doing the OS level patching or doing the device install and maintenance or datalets patching or backups, the ha, the scaling, all these tasks are taken care for you on the other side which is Amazon RDS. These comes as offering as a managed service, so it comes just out of the box for you. And what that gives you that basically releases the resources to focus upon business value tasks and do the high level schema tuning or schema optimization. So that way the tasks which are schema tuning and all of that currently on premises will be the last thing you worry about because you have the production issues, the patching or the backups or the space issues to tackle. Before you get to these tuning issues, what RDS does. RDS then takes care of everything those operational issues for you and leave you to focus on those tuning which directly contributes towards your end user experience. So we'll just look at quickly at the features on both sides. So EC two is supported with either by Ol or bring your own license or bring your own software or pre configured AMis that is basically virtual machines or pre configured options will the pre configured options will now have as many versions or options supported AWS custom build solution that you might build yourself. So at the moment, how many of you have struggled either because of expenses or just technical challenges in the setting of multi site high availability option for SQL server instance with RDS for SQL Server it's as simple as checking as a checkbox when you launch an Amazon RDS instance to set up multi AZ SQL server cluster that leverages synchronous replication between AZs or using devs mirroring, we'll dive into this as we go along the session. But if you need a different ha solution like always on or log shipping, then choose SQL Server on Amazon EC two and these manage that yourself. Both platform support storage encryption for all Edition using kms and those of you running Enterprise edition you can use TDE on both platform that are transparent encryption. If you need to install third party tools such as Ola Haleigan scripts for maintenance plan or use, then run the SQL server on Amazon EC two. Otherwise RDS will take care of everything for you and if you want to take advantage of automated patching then choose Amazon RDS for SQL Server. Otherwise you need to manage those tedious maintenance tasks yourself with SQL Server on Amazon EC two. With that folks, let's dive into the HADR options in a glance. So typically there are multiple options when it comes to habibability and disaster recovery. But we can categorize them into three areas. Let's just say that simple HADR, standard HADR and machine critical HADR. With these simple HADR, it's the VM level protection where if you have a failover or a hyper as a level, then you are protected at the guest level but no protection at the SQL server or guest os layer. Whereas the standard layer, standard HADR, we take that into consideration where we'll provide you the VM level protection plus SQL server instance level protection. Plus you'll get the guest os layer protection as well. And with the basic availability groups, you can have the device level protection as well. Now when it comes to mission critical HaDR, it's a combination of simple plus standard plus what you go in there, you go into the device level protection. Now, what is the different SQL server protection levels we'll cover in these next slide? But briefly, it's about what's been protected at what level in terms of your failures, like whether it's at the os layer, whether it's a guest layer or it's a SQL instance layer or the database layer. Now machinery called HADR will cover everything. Plus it will give you seconds of RTO with zero data loss and with the minimum downtime for plant maintenance and offer you the horizontal read scaling. Now let's dive into the protection levels as we spoke about. So if you look at the VM level protection or the VM failover that protects you from the hardware failure at the host itself or the hypervisor level failures, or if you have a guest os level failures, if now if I want to take it to the next level, that's your VM failure is a simple HaDR. Now if I want to take it to the next level, or I want to say, hey, I want to protect my SQL server level failure if my instance is hunk, how about that? Now that's something that can be covered with the FCI, these failure cluster instances, which we'll dive in a slide or two in terms of diving into what the FCIs and what the options available for you to leverage the high vability at these cluster level, which is the failure clustering instances. Now if I want to protect my databases, so I protected my hardware, I protected hyperwaza, I protected for the os level and also protected for SQL server instance level. But what if my database gets hung or my database itself becomes unfunctional? Then how about that? So we have the availability groups or the mirroring that could help you to tackle that scenario and fail over to the other partner. We'll dive into this every single option in a deep as we go through the slide deck. So let's talk about simple HaDR. Now, simple HADR, we spoke about VM or guest level failures. So for example, at the moment onpremises you have the VM running on SQL server, running on OEM, that's using the VMDA or VMDK files or the VHDX or RDM files that are hosted on this Ovsan. Or if you have done the RDM directly on the SAN, which is presented from the same location, now you have the hyperwiser level and the hardware underneath. Now if the hardware goes down, the hyperwiser has an issue or the VM fails, then the VM will move v motion to the other side and it will start running on another hypervisor or hardware level. So it's like in a v motion to the other host to cover yourself upon the underlying hardware failure. And this is on premises. This is what currently we do onpremises. How about if I bring it to AWS, we have a single AZ deployment. It's basically you have an ABS volume. That's where the EC two instance is running. And that's basically you have an auto scaling group. It can scale as the demand grows, or at the same time you have the auto heal functionality within the VM. So if something goes wrong with the EC two instance, it will self heal the instance itself or it will move within the AZ. If the instance goes down, the underlying host has a problem. The EC two instance will move from one host to the other host like you have vmotion onpremises. Now let's talk about the backup and restore other simple HaDR options. Generally speaking, there are two major types of backups. Those are backing of the entire SQL instance, backing of the individual database, or that is basically server level backups are usually the hardware assisted. Using the special storage technologies, you can snapshot the underlying storage where a SQL server instance stores its files. Using these snapshots, you can restore a database instance to the point that you can perform a point in time when the snapshot was taken. The other type we have, that is the database level backup is a native SQL server capability and does not need any special hardware. It can be either full backup of an entire database or a differential backup that keeps changes since the previously taken differential or full backup or transaction log backup which stores the transaction logs and enables replaying those transaction logs on top of a previously taken full or differential backup. And therefore restores a database to any point in time. So let's talk about the two options, what we have when it comes to backup and restore on premises versus AWS. So at the moment you have a regular backup plan running on on premises. You can have the same thing on RDS. On AWS, for example, you can have weekly full daily differentials and transaction logs. Backup every 15 minutes. You can run the backup on a local volume, you can do the same thing on AWS, you can run the backup on EBIS volume, you can move the backups on archived this is what I currently do on premises. When it comes to AWS, you move the backups to s three. Now within s three you have different options available for you to move your backups. For example, you have retention policies of the backup that you need to retain for compliance purposes for more than ten years. Now you don't want to stick that backup into an s three bucket and then pay additional charge just for having those backups in a standard tier. Now standard tier will have a premium cost to it, whereas if you move these backups which are not used, say for example, my weekly backups get moved after two weeks into infrequently accessed bucket because those backups are not going to need it anymore. And after say six to eight weeks, I will then move those backup into glacier. So that way I'm not constantly paying the premium cost to store the backups. It's always moving to the lower cost tiers as the file goes. As the time grows or the backup becomes older and older, we just keep shifting that to the cost effective storage where possibilities. Basically what you can do is point it down. Recovery is available using SQL server backups. You can do the differential backups to diminish the backup time and space. And some of the best practices are in every time transaction log is cleared, full backup should be taken and in AWS. I would rather suggest if you want to do the server level backups then consider using the VSS enabled EBS snapshots than looking into the back end of the entire database. Now let's talk about the backups on RDS. So RDS supports both types of backups server level as well as SQL server DB level backups. So when you create an Amazon RDS SQL instance, it already includes an automated server backups out of the box. It takes daily snapshot of the entire instance and by default retains these snapshot for seven days. You could also configure it to another retention period that works better for your needs up to 35 days in Amazon RDS, these are called automated backups, and they are entirely managed by Amazon RDS. Apart from automated backups, you could also create user initiative backup. Similar to an automated backups, user initiated backups are also server level snapshots of the underlying storage blocks. However, unlike the automated backups, the user initiated backups are not automatically deleted by Amazon RDS, even when you delete your entire RDS SQL instance. Although all associated automated backups are deleted, any user initiated backups remain available until you explicitly delete them. This is especially useful if you have compliance requirements that mandate retention of backups for longer than 35 days. You could create an automated job that takes user initiated backups on a regular schedule and deletes expired backups after a given time, for example, ten years or 15 years or more. So RDS for SQL Server also automatically stakes and maintains the transaction lock backups every few minutes. So what that does, that basically enables you to restore an Amazon RDS instance to point in time example, you could use Amazon RDS console to restore a SQL server instance to a specific time into a new instance. In this case, RDS spins up a new SQL server instance, restores last backup from these specified time, and then replays the transition logs up to a specified point in time. And the end result is you have an MSSQL server instance with all the databases restored to specified time. Now, RDS also supports the native backups for an individual databases. It allows you to create or restore a full backup, and these backup files are stored in an s three bucket. The backup can also be a multifile, which is especially useful for very large databases since the maximum single object size in S three is five gig. So when you're backing up or restoring various large database, the multifile backups might be the only possible solution. These full backups can be used to restore SQL server databases on Amazon RDS instance, or move the databases across RDS instance, or migrate the database off or from onto onpremises instances. So that's RDS. Now let's talk about SQL Server on EC two. On SQL Server on EC two, you are responsible to handle some of the backup maintenance capabilities that comes as turnkey solution in RDS. What we looked at however, in AWS, there are still plenty of options and services enabling you to do that. Several backups do translate to EBS volume snapshots, which can be created by invoking single API or through AWS console. The EBS snapshots are incremental, meaning only changes since the last snapshots are captured for subsequent snapshots. Therefore saving on storage costs and reducing time it takes to snapshot. EV snapshot are also a point in time, which means Esnapshot capture all the blocks of associated EBS volumes as at the exact time when the snapshot was triggered. When you restore EBS volume from a snapshot, it becomes immediately available for application to access it. So in this case, it means you can immediately bring SQL server online after restoring one or more of its underlying EBbS volumes from a snapshot. So by default, the restore volumes these underlying blocks from S three at the first time when the application tries to read the data, read them. So this means there can be a lag in performance, while after an EB's volume is restored from a snapshot, the volume eventually catches up with these nominal performance. However, you could avoid that lag by using fast restore snapshot. This feature enables you to get full performance immediately after restoring volume. You could also create an AWS level SQL server backups. It's much more similar to how you do currently on premises. You could have in AWS you have highly durable and highly scalable and cost object store. So that is s three to store the backup files. And these s three gives you high durability and high availability. And using different storage tiers in s three you can also cost optimize your backup solution. You could also use AWS backup to automate server level backups for SQL server instances. Now AWS backup manages CBS volume snapshots at regular intervals and you specify for that you specify. So for example, you have a consistent tag. So what you need to have for AWS backups is the consistent tagging policy across your SQL server instances. Now AWS backups is simple and easy to use. However, to achieve this consistency for SQL Server DB level within volume snapshot, the best practice is to use to use the VSS enabled snapshot. Now let's talk about the VSS, the Windows volume Shadow copy service. It's a solution that orchestrates the os and application access to underlying I O now enabling crash consistent snapshot of underlying storage without any application downtime. SQL Server is VSS aware. Therefore you can use VSS agent to snapshot the SQL server instances. This is made much easier for you to run SQL Server on Amazon EC two. You can use AWS system manager to run a command not only to install the VSS agent on your SQL server instance, but also to invoke the entire workflow of flushing the os and application buffer to the disk. Pausing the I O operation and taking a pointer snapshot of an EBS volume, then resuming the I O. And all of this what I just described could be achieved in a matter of less than 10 seconds and using a single API command through AWS system manager. Now this run command document creates a snapshot of all the EBS volumes attached to the target instance. You also have these option to exclude the root volume as most often the user database files are stored on the other volumes in case you stripe the EBS volume across multiple volumes to create a single file system for SQL Server files. EBS also support crash consistent multi volume snapshot using a single API command. So one of the boldest characteristics of the cloud is to either these large and sometimes rapidly growing environment at scale. If you have one or two SQL server instances, you could easily log into each instance, schedule a backup job via SQL agent, and somehow manage to transfer the backup files to s three bucket when they should be where they should be restored and retained. However, if you have tens and hundreds of even more instances, scheduling the job on each single instance can alone be quite onerous. Fortunately though, AWS gives you a way to efficiently do such tasks at any scale. So you can use AWS system manager state manager to schedule a regular backups and also have a backup files transferred to a target s three bucket. So state manager associations can target individual instances or all instances that have a common tag applied to them. For example, you can use hundreds of UC two instances to have key called SQL Server backup or message SQL backup and have value of automated applied to them. Now then using system manager state manager, you can target all of those instances at once. Combining that with the system manager maintenance window enables you to ensure the backup job doesn't disrupt normal business operations and at the same time, system manager also knows the rate control policies. For example, you can indicate only few instances should be backed up at one time, therefore avoiding all devices being impacted by backup jobs at the same time. Now the system manager can handle hybrid environment. You can use these same solution to manage your SQL server instances running on premises or any other cloud steel that's these video system manager. Now that's all good about backups. How about restore? So backup is half story, the other half is the capability to restore these backups. Every business needs well defined, well rehearsed processes for taking backups as well as restoring them. So restoration procedures can also depend on customers requirements. Basically whether the backup files have to be restored on an existing instance, whether the backup has wherever the backup was originally taken, or on a different instance altogether, or entirely new instance should be created for the purpose of restoring this backup. All of these can be defined and automated using AWS system manager automation document. System manager automation can create new EC two instance or discover identify an instance on which the backup should be restored. Then it can invoke the SSM run command which instructs the target instance to fetch the backup file from S three and restore on the SQL server. Using a proven and tested automation workflow is basically eliminating those human errors during an actual time of disaster, which also reduces the meantime to action and resolution of incidents or change requests. Now that was simple Hadr folks so far. How about standard? Now let's just take it to the next level where we have covered ourselves from guest level failures or we doing the traditional backup and restore well tested approach. Now I want to take you to the next level where I want to talk about I want to cover myself from the hardware failures plus the VM guest table failures, plus the SQL server instances. That's when FCI comes into play. This is something that Microsoft introduced since SQL 2008. That's when they came out with failover cluster instances. So let's look at the typical on premises implementation of always on failover cluster instances. So typically you have two nodes. One is online, the other one is in a hibernate mode or on a standby which is sharing the same storage, either that's on a CSV, SMB or NFS. So it provides you the high vibration of the instance level. Typically you deploy in a single DC and resilience against gives you against SQL server instance failures or the os level failures. Now standard edition only supports two instances, whereas you can go up to 50 instances on enterprise. If I were to do that same thing in AWS cloud, how would I do that? Now you have the AEBs volume that can be multi attached to the different instances at the same time, and you can have the Windows HadR cluster built on top of those two EC two instances, single AZ usually sitting in the same subnet. That gives you resilience against these OS and SQL server instance level failures. And now you can use a typically network balancer virtual IP for WSC static IP in the same subnet. Now let's just take it to the next level, right? I want to go multi AZ. This is where I'm going to realize more value because I'm still running into the single point of failures. Like, and if you look at the single DC deployment, if the entire DC goes down on premises, I don't have a single point of failure. Similar sort of thing. If I go to the single AZ deployments into EC two, I've got a single point of failure. If the AZ goes down, can't do anything. So hence the reason we normally recommend customers to think about multi AZ deployment for mission critical databases where you have lower RT and RPO and you can't afford a downtime, these databases should be deployed in a multi AZ deployment. So FCI or multi AZ again you have the storage comes out of the box like SMB, these .0 sTd Iscazi basically what you do is create the multi AZ on a multi subnet. Resilience gives you the os level 60 failures or the AZ failures because of the low latency network. Within the multi AZ architecture, the stretch cluster is a practical option, provides higher resilience as compared to on premises. If you're looking at the stretch cluster, there is a lot of network intricacies that needs to be handled and looked at before you decide to do a stretch cluster. I mean I've been bitten myself in the past when I did decide to do these stretch cluster and then there could be a lot of network level issues and blips that could cause a false alarm on a SQL server and SQL server to failure. So hence the reason most of the time onpremises stretch clusters are kind of no no practice. But because we have the low latency multi AZ architecture, implementing on Multi AZ SQL Server FCI as a stretch cluster is a possibility. Now let's take it to the next level. We have something called a service called FSX for Windows. This is a file share service that is built and optimized for the Windows workloads. So you can, rather than having have a multi attached EBS volume, you can say I'm going to create a file system which is where I'm going to store all of my data files and it just becomes an SMB share file share that is attached to a SQL cluster, the Windows cluster. So you're still going to have the Multi AZ architecture here where your primary instance is sitting in AZ one, your secondary instance is sitting in AZ two, and you create the FSX file share and then provision it. Now the good thing about these file share is you can provision the throughput AWS compared to what you had in the EBS volume where you have to configure individual EBS volumes to configure throughput, whereas here you get collectively throughput what I need to achieve AWS my entire file store. That's a good thing about this. And then you can have a witness file share in AZ three that can keep an eye on the cluster itself to make sure the cluster fails over and the quorum is protected for any failures rather than having AZ one taking care of. So if you have a corum sitting on node one, then you kind of end up in a split bin scenario, possibly. So to avoid that you can go into AZ three. That's where the Windows file share system is. Witness file share is that could keep an eye on the Cluster gorum and decide to failure nodes from the primary to secondary in a timely manner. So folks, that was the FCIs and the FCI options we have on the SQL server on AWS. So there's another way we can achieve the SQL server high availability using the storage develop options. So storage Space Direct STD is another Windows technology that allows you to build and deploy software defined NAS across multiple instances as a converged or hyper converged model. You can have the SQL server instance and SDD together, or you have SDD and a SQL server on different instances. Now it supports these ntfs and refs and supports up to 16 servers or 400 drives would scale up to six petabyte. And this is on ABS volumes. You can do SD on the instance storage as well using the I three, the D two instances. It's a simple setup, high performance, up to 200 kiosks because of the underlying nvmes you're using the instance local store built in resistance for against these disk or server failures. Now these are ethermal. So far we covered about high vability. How about my disaster recovery? High vability will give me the protection around the hardware failures. What if my entire region goes out? Or what if my entire DC goes out? What are my disaster recurruption? Backup restore is an option, but backup restore has got higher rtos and rpos. What are my options for lower rtos and rpos? So one of the options traditionally is the lock shipping. So this is a typical implementation of lock shipping onpremises. So typically our option for SQL server since back in the days of SQL 2000, those who are old enough to have worked on it, I'm being one of them. So with log shipping, it doesn't matter whether your secondary instance is in the cloud or onpremises. The secondary database is offline. Restoring the logs received from the primary. So log shipping has four steps. First, you're backing up the data on your primary instance. That is basically your transaction log backups. Secondly, you're shipping or copying that transaction log backups from the local server or the primary server into a central file server location. And then basically the step three is to download that backup file from the central shared location or file share into the secondary server. And step four is to apply those transactions or replay those transaction logs onto the secondary server. Now you have full control over these jobs. So you can delay the log copy or log apply to secondary. If you have detected some problem in the primary and you don't want to apply those logs to a secondary, you can do so. So you can choose the job schedule as low as five minutes when the jobs are the frequency of backups on the primary copy and copy to secondary and restore. Or you can go AWS has 60 minutes for example in terms of log backup and copy and restore. It has a direct impact on your RT and RPA. So be mindful of picking up the schedule. So these best thing is to when it cons to copying and applying the secondary, the logs on the secondary, it gives you that option of mitigating if there is a corruption or any logical data that has been deleted, then you can pause it. And then the only thing that you need to reestablish it, the lock shipping, if you pause it, cool. So coming into AWS, if you were to do this setup in AWS, so you have the primarily sitting on premises and you want to do a lights on. Dr. Then you have a lot of shipping. So you can use similar sort of thing where you have the on premises model, where on premises you're backing up the data pushing across into an s three bucket. And then we are scoping from the s three bucket and storing into applying it to an EC two instance. SQL server running on an EC two. This compute can be very minimal because all you're doing is restoring the data so it doesn't demand the same amount of compute as primary. Or you can have the AWS instance sitting in primary into AWS EC two instance. Or you can copy the data back into onpremises if you like. Or you can have a multidr multiregion Dr. Deployment strategy using lock shipping as well. But if you're in the same region, I would rather suggest you to do the EBS snapshot, the VSS and EBS snapshot that is much more quicker and efficient in the same region to be able to do the disaster recovery. Now we have another option is cloud NDO. Cloud NDO for Azadir. Now what that is basically it's a service. It operates at the block level as well. AWS, the VM level, it replicates all the blocks in the VM, including app data files at a block level and it keeps a track of the block changes into the staging environment. These are maintained in the EBS volume at the time of Dr. Cloud endure enables you quickly spin up the instance with all the apps configured along with its data in your desired region. For example, SQL Server instance will be up and running along with the data, ready to accept the connections. And cloud endure for DR is a paid service for Dr. But when it comes to migration, it is for free. That's something you need to keep in your mind. Now let's look at the basic availability groups, right, the AWS, we looked at the HA in terms of and the disaster recovery in terms of what SQL Server all the underlying storage level has to offer. Now there is something called ability groups which Max have introduced in SQL Server 2012 to take over the mirroring as mirroring is, as we all know is deprecated from 2014 onwards. So what is basic availability groups is basically similar sort of thing. What you have multi AZ fcis for example, but it's not really an FCI. The obligatory group runs on top of these windows failure clustering technology. But in FCI you have the shared storage with availability groups. You have independent storage with the basic availability groups that is available in the standard edition. The only limit is two replicas, primary and secondary. You can't access your secondary replica, you cannot do a backup on the second replica. You cannot run any integrity checks or anything. You cannot upgrade your basic ability groups from say to an advanced availability groups. You basically have to drop and re add. And basic ability groups cannot be part of the distributed ability groups which we'll cover in the slider two so basic average group, just think as one database. So in the availability groups it's just another mirroring setup, but it's just underlying. We are using the independent storage copies and a failover clustering technology to be able to facilitate the failovers from one server to the other server. And with the basic ability groups you can only have one database in that availability groups. But whereas compared to these advanced availability groups, you have the multiple databases that can be added in the single ability groups. That's the key difference here folks. So with that we'll dive into the mission critical HaDR options. What do I have in terms of if I want to achieve low RT and RPO? And I want to cover almost all possible scenarios when it comes to protecting my SQL server workloads. So that's when the availability groups comes into play. These are, like I said, it's introduced in SQL Server 2012 onwards, it has taken over these mirroring and taken to these next level and combining with the Windows failover cluster technologies. So let's talk about a pattern of deployment. So we have what we looked at in the previous slides with an FCI where we have multi AZ deployment with the failure cluster instance which was sharing the same storage and the storage was moving between the AZ. Now with the availability groups you can have independent storage per AC and SQL Server is replicating the data at the instance level. That's basically a synchronous replication with automatic failover using the same Windows technologies, Windows failover Cluster and then you get an endpoint available for you to application to connect it. So when there is a problem on AZ one, you can simply fail over to AZ two. But the application doesn't notice anything because it's connecting a listener or a whip. So it's seamless to just say hey, I was talking to AZ one, database one. Now instance one, I'm going to AZ two, instance two, but for the application it still continue to talk to the database. That's the difference here. So that's one pattern here. So if you look at this is a multi AZ within the same region. Let's take it to next level. I want to go multiregion, how about that? You can still do that. So you can have your synchronous replication within the region with the automatic failover between AZ two and AZ three. But you can add another region and you can create a stretch cluster, goes across the region and have the transit gate to enable both accounts within the regions to talk to each other, get the vpcs communicating to each other. And that way you can start replicating the data into a secondary region in Aziban. But this will be asynchronous replication because you can have only one copy, asynchronous, asynchronous replica. If you're going across the region it will be asynchronous and it will be a manual failover. So that's one pattern. What we have another pattern, AWS, a distributed availability groups. So what you're going to have, you're going to have two separate windows failover clusters in region A and a region B. And between these nodes they will talk to each other on a synchronous replication. Now as we looked at in previous configuration that we have synchronous replication going from Az one to Az two in a region a. As soon as we introduce another region, it will be an asynchronous replication with manual failure but what if I have another cluster which is sitting in region B? Can I do that? Because in the previous example we had a one cluster going across the region. Now we are bringing in two different clusters, so we don't have to worry about stretching the cluster, stretching the ad objects and all of that. So we'll have two different independent clusters and these that way what you're doing here, you're going from primary replica in Aza in Region A into Region B, Az one as a forwarder. So that is asynchronously replicated. But the secondary replica in Region B in Az two, these synchronously replicate the data between Az one, region B to a z two in region B, rather than going all the way back to region A, Az one. So a replica in Az one in region B plays the role of forwarder. So you can fail over between two Az. And then because the requirements, you might have to run the workloads six months from Region A, you can run the workloads from region B. So when you fail over to Region B, you still have that same level of protection what you had in region A. In the previous example, you only had protection in region A, but the region B only had one instance. So that's how you can tackle another option available for you if you have the requirement to run the workloads from both regions six monthly. So let's just go dive into the comparison between AWS versus FCI. So FCI gives you instance level protection for H and Dr. Whereas Ag gives you devs level protection in terms of replication. It's fully automated because in a way you can have the hardware storage level replication playing in a play, or your FCI is basically saying the same storage, same master databases that can go in across to the other side, whereas oblivion groups, you'll have to manually replicate the instance level object like an example, logins and link servers and et cetera. Scale up and down in FCI is not supported. It's supported, I'm sorry, because you can add the extra nodes if you like. On AG, same thing. You can add extra nodes and extra copies. Apart from the basic availability groups read scale out FCI is not available because you can only have one primary and the rest are passive nodes, whereas the AGs, you have the read replicas available where you can add reader nodes onto the enterprises edition, non standard asynchronous replication. Again, you have the FCI, the block level storage, you can replicate the block level storage to the other node or other storage, and you can bring up the storage on the other side and off you go. Whereas on these Ag it's on the database level. So you can combine the FCI and Ag in a one cluster. So some of the things which you need to watch out for is always on FCI in a single AZ for Ha and Ag on a second AZ for Dr. You can use the FCI inside a cluster placement group and AG for a Dr. In a second AZ. You can use the FCI across multi AZ for HA and averted groups or ag to the second region for Dr. Or you can use FCI for HA and Ag for read replicas. So failover between FCI in an Ag is always manual in terms of when it comes to Dr. If you combine these two together, you will not be able to have an automatic failure if you combine FC and AG together. So that was all about hivability and a Dr. On AEC two. Now how about rds, how things work when it comes to HADR and RDS? So you have similar sort of sets up. You have a region and you have three AZ where app servers are running and you have RDS for SQL server running. In Aza, for example, all the apps are talking to the AZA. What if SQL server instance goes down? Now we have things like health monitor and self recovery capabilities in a single AZ deployment that will self heal the instance itself, it comes back on and it starts working. But what if your entire AZ goes down and this is where the problem is, you have a single point of failure. Your application sitting in AZb and a C can't work anymore because you lost these SQL server sitting in Aza, not the instance, the entire AZ has gone down. So you have a single point of failure there to address that. What we say customers to deploy the SQL server in multi AZ in RDS setup. So similar sort of thing. You have the app stalking from different AZ into Aza. These we create another instance in AZB and we start replicating the data. Mirroring or always on. Now mirroring is anything less than SQL Server 2016. Or if you are beyond so and turn 2017 then you will have the always on groups available, be it standard or be it enterprise. So what happens if your secondary fails? That's these instance level failure. That's fine. Your application is still continuing to talk to primary service which is sitting in Aza and it will continue to do so and application wouldn't notice anything. And once the secondary AZ instance recovers, it will still start receiving the data from the primary server as it had before. Now what happens if the primarily AZ goes down. That's interesting. Now what's going to happen? You're going to have the failover kick in. It will fail over from your primary role to a secondary instance which is sitting in AZP. Now that takes over AWS, a primary, and then all the application servers are started pointing to that new primary. Now here, because you're using the endpoint, you wouldn't necessarily, your application wouldn't notice anything that has underlying these episode host or instance have changed. It will just continue to work as it was working on the failover cluster instances, sorry, the always on availability pros on EC two or currently on premises with a listener. Similar sort of thing happens here. It will just simply seamlessly fail over the applications to these second instance and your application will continue to work without noticing that instance has been replaced underneath. And once your primary AZ comes back on or the instance comes back on, we will replicate the data from AZB into Aza. So this is all happen automatically. You don't have to set up anything with a single click, whereas the options, what we looked at so far on EC two, those are built and managed by you, whereas in RDs everything is out of the box. You just click one button under the hood, we will do all that automation of creating another instance, backing up your primary databases, replicating that to secondary instance and configuring that always on, mirroring on top of it, and these allowing you to say hey, your database is protected now in the multi AZ. So with a single click, let's look at the failure timeline. So when you look at the multi AZ failure timelines on a standard edition with the Davis mirroring, typically one to two minutes, and because it has genus propagation, if you look at the with always on, it's literally under 10 seconds. Because there is no DNS propagation, it's simply just switch it over to the other side and off he goes. So HadR options, what we looked at so far, we looked at the backup and restore, we looked at the snapshot, we looked at the lock shipping, we looked at the storage replica, the cloud India for Rico. Dr. We looked at the FCIS, we looked at the ags and we looked at the district AG script. And these are the options available for you to work on EC two with the backup restore. You can use that on RDS as well, or RDS itself has an automated backup and multi AZ for what we looked at in terms of your disaster recovery ha scenarios. With that folks, we come to the last topic, which is cloud enables to optimize licensing cost. How we can help optimizing the licensing cons when it comes to SQL server deployment. So this is a typical scenario. On premises you have an enterprise edition SQL server installed in an OLTP environment and it has got two nodes sitting inside a and site b. Now in a primary site you have your OLTP workloads constantly writing the data synchronously replicated to the secondary ac, and we have opened up the secondary site for reporting purposes and you're backing up the data from these. So basically in a nutshell, as soon as you open up the secondary instance you need to pay for the licensing for that. So both sites are constantly used, hence these reason you have licensed both sites. So say six cons on this side and six cons on that side. So total twelve cons implementation with always on ages on premises. Now let's replicate that into ECT on AWS going to do the same number of licenses because you have doing the same thing. There is no improvements on premises in terms of licensing cost. Now let's go into the change a bit. Now what I say is basically my OLTP workloads are coming in the primary, but my secondary instance itself is passive and it's waiting to take over. But there is no activity on that instance. So I don't have to pay for the licensing for those. So I only need six got in this scenario I can create a snapshot, the VSS snapshot of the secondary instance and I spin up another instance from that. That could be enterprise edition or standard edition depending on what features I need opt in and I can only pay for this instance when I use. So it's basically license included instance. You don't have to pay for licensing from your own point of view as a license included instance. So you don't have to pay for extra licensing cons. When you don't use secondary from going from active to passive replica, it may be possible to downgrade from enterprise to standard. So it's basically an option for you to optimize licensing. At the moment you are only onpremises. In previous model you are using twelve cores. Now you have an option of downgrading in an instance edition from enterprise to standard and you're using six for BYOL, whereas the other instance we're using for reporting or the OLED workloads. You can spin up the instance when you need to and that way you can control the cost around the SQL server licensing or also you can use the license related instances with that folks that bring me to the end of my talk. I hope you enjoyed it as much aws I did. And it was lovely talking to you. Have a good day. Thank you.

See all 48 talks at this event!

Conf42 Site Reliability Engineering 2021 - Online

September 30 2021

Microsoft SQL Server HA/DR on AWS

Video size:

Abstract

Summary

Transcript

Asif Mujawar

Specialist Solutions Architect - Database @ AWS

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering 2021 - Online

September 30 2021

Microsoft SQL Server HA/DR on AWS

Video size:

Abstract

Summary

Transcript

Asif Mujawar

Specialist Solutions Architect - Database @ AWS

Join the community!