This post opens a series of posts describing the tech stack we run on.
Our help desk software backend runs (mostly) on Microsoft's stack. The databases are powered by SQL Server and this post is a detailed description of the data storage architecture of our "hosted" helpdesk - for our customers, potential customers, and any DBAs who'd like to know more about hosting databases in Amazon's cloud.
All our database servers are hosted by Amazon Web Services (AWS) and use the "i3" family of instances. Mainly because "i3" offers a superfast network connection and modern, insanely fast "NVMe" local SSD drives.
We don't use Amazon's "RDS" instances (the managed database service offered by AWS) because RDS is basically the same virtual server, "optimized" for database hosting. We have actually developed tons of custom performance tweaks of our own, to make the cloud work even faster than the Amazon's DB-optimized service. Alex has blogged about this a lot in the past.
Now, in case you're not aware, Amazon cloud servers come with two types of drives - "local" drives and "EBS" ("elastic block storage"). Local drives are nothing but regular SSD disks that live on the same physical machine. The data on these drives disappears every time you restart your cloud instance (b/c your cloud server literally launches on new hardware every time you restart it). "EBS" drives on the other hand are essentially network drives that you can "attach" to your machine, and the data will persist across restarts.
EBS drives are great. They are immune to hardware failures so even if your server melts down - you just launch a new one and "re-attach" the disk to the fresh instance... But reliability comes at a price: the biggest problem with EBS drives is that they are expensive and slow.
That is why one of the best database performance tricks we ever came up with was using "buffer pool extension" on the SSD drive. This can speed up a database server by up to 250% Basically you're creating an extra cache layer between the server's memory and the network disk, and the cache lives on the lightning-fast local NVMe disk.
Each database server has:
All disks are encrypted. And (surprise!) we have actually discovered that the encrypted disks perform faster than the "plain" ones, presumably, due to a more efficient "warmup" procedure.
We tested many backup strategies, including S3 backups, "continuous disk-snapshots" strategy etc. etc. but finally settled on a very traditional "old-school" scheme.
The database recovery mode is set to "Full" and since every database server has its own special "backup" drive, this is how our databases are backed up to it:
In case something goes wrong, we restore the freshest "full" backup, then restore all the "diff" backups after it, then restore "log" backups one by one. This way we can have the database restored to the exact "point in time".
(can I just add that we have never ever had to do that in 9 years in business)
The backup drive itself is also backed up daily using Amazon's built-in "disk snapshot" feature.
Interesting find: we discovered that the backup drive should actually be slower than the database storage drive. This way you're naturally "throttling" the backup procedure so it does not slow your app down... You want the "write to backup" operation to be the performance bottleneck, not the "read from the database". It's OK for the database disk to just sit and "wait" while the data is being written to the backup (cause this does not affect your end users). What's not OK is backing up to a super-fast SSD drive, so the reading speed becomes the bottleneck. Basically your backup procedure will suck all the available i/o performance from your database disk and your users will suffer. That is exactly why we can afford using the cheapest EBS drive from AWS - the "cold HDD" disk type, that will only cost you $25/month for a 1TB drive.
Our last recommendation to everyone running in the cloud is to always keep a stopped copy of every server you have. You don't pay for it anyway, it's stopped, and a stopped server is basically just a line in some config file deep inside Amazon's infrastructure. But in case things go downhill, you don't have to waste time investigating what went wrong and which driver caused the BSOD or trying to revive the server. You just shoot the dying horse and switch the disk drives to the emergency server.