What’s the real RPO for databases on AWS RDS and why you’re probably wrong?

Recently I’ve been doing some research on the abilities to restore MySQL databases on AWS RDS. The customer requirement was to have 15 minutes RPO for databases. Can we meet the requirement with AWS RDS alone?

At the first moment I thought so, but then I considered what happens when someone does “accidental deletion” of RDS instance. Yeah, I know we can (and I’ll describe in this post) introduce additional preventive measures to make this event highly unlikely, but… when it comes to backup&recovery I think we need to be prepared for everything :)

But let’s start from the basics :)

AWS RDS snapshots and point-in-time recovery basics

When you create a new RDS instance in AWS you can enable automatic backups, which I highly recommend. You set a backup window and a retention period and AWS will take an automatic snapshot of your database once a day and also delete snapshots older than your retention period. Nice :)

Having automatic backups enabled also gives you an ability to do point-in-time restore: AWS does transactional log backups every 5 minutes and in any case you can restore state of your instance to the time you want, having regard to above mentioned 5 minute intervals and a retention period – you cannot restore to the state older than then oldest snapshot. That makes things even nicer :)

That means RPO of RDS instance is 5 minutes.

The above mentioned features handle 99% of customers needs, but let’s consider what are the other problems related to this way of running a database and how we can lose the data anyway :)

What if…

The first thing to consider is that snapshots are made for the whole instance. It means you cannot restore single database easily. It’s all-or-nothing.

It’s also worth to mention that restore process creates new RDS instance, so you should reconfigure your applications to point to that new instance, which makes RTO for your business service longer and also makes it harder to e.g. restore some, but not all databases hosted on RDS instances.

Of course you can spin up new instance and just copy data from the restored database to the original one, but it’s not out-of-the-box solution RDS provides and you have to take special care about integrity of the data you copy.

It’s also wrong to think that after you delete the instance you can restore it from one of automatic backups, which you cannot:

All automated backups are deleted when you delete a DB instance. After you delete a DB instance, the automated backups can’t be recovered. If you choose to have Amazon RDS create a final DB snapshot before it deletes your DB instance, you can use that to recover your DB instance.
Manual snapshots are not deleted.
[ source ]

Thus, for additional safety of your data it may be a good idea to also take manual snapshots of your instance.

An interesting case about deletion of the instance was mentioned on AWS forum:

Just one warning: if you have a failed instance, please don’t delete it before restoring it into a new instance. Automated backups and binlogs are all deleted if your instance is deleted (only manual snapshots and final snapshot are retained, and the final snapshot may not be useful (or RDS may not be able to take it). So you want to restore first, then delete the original (you can rename the new instance later, so it takes the place of the failed instance after you delete it).
[ source ]

Now, let’s play devil’s advocate and assume the worst case:

Deletion of the instance and other bad things

This can happen, quick search and you can find many cases like that, e.g. #1, #2, #3. Actually I’m surprised how often it happens :P

You should also consider that your account can be compromised and an intruder may delete your database.

Before I show you how to protect your databases from accidential deletion, let me say again what happens when you delete a RDS instance.

As I quoted above: you’ll lose all automatic snapshot and an ability to do point-in-time-restore. If the deletion happened without making final snapshot and you don’t have any manual snapshots you’ll lose all your data from RDS instance without a way to restore it.

If this is a production database it’s called “career-ending event” :P

The above means that you have to take into account some additional preventive measures to avoid such events.

How to protect your databases from deletion?

There are a couple of ways you can minimalize the risk of data loss in case of database deletion:

Firstly, enable database deletion protection – this feature will reduce the risk of accidential deletion, especially when using scripts, Terraform (but make sure that Terraform won’t change deletion_protection flag!) etc.:

Secondly, make manual snapshots – this way you can restore an instance from a snapshot. It won’t recover all the data as data after the snapshot time will be lost, but it’s better than nothing.

Thirdly, ship snapshot to other account:

Today’s big news is that you can now share unencrypted MySQL, Oracle, SQL Server, and PostgreSQL snapshots with other AWS accounts. If you, like many sophisticated AWS customers, use separate AWS accounts for development, testing, and production, you can now share snapshots between AWS accounts in a controlled fashion. If a late-breaking bug is discovered in a production system, you can create a database snapshot and then share it with select developers so that they can diagnose the problem without having to have access to the production account or system.
[ source ]

Unfortunately at the moment it doesn’t work with encrypted snapshots, but I hope AWS will introduce such feature soon :(

Fourthly, make sure noone can delete your database using IAM. For production I suggest adding explicit deny (as deny takes precedence of any allow statements) statement for all privileged users and roles:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Deny",
         "Action":"rds:DeleteDBInstance",
         "Resource":"arn:aws:rds:us-west-2:123456789012:db:my-mysql-instance"
      }
   ]
}

If you want even higher security level then you may be interested in AWS Organizations Service Control Policies. This way you can blacklist some actions for customer-created principals on whole AWS account. Below you can find a sample policy to blacklist deletion of all databases on the account that SCP is attached to:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowsAllActions",
            "Effect": "Allow",
            "Action": "*",
            "Resource": "*"
        },
        {
            "Sid": "DenyRDSDeletion", 
            "Effect": "Deny",
            "Action": "rds:DeleteDBInstance",
            "Resource": "*"
        }
    ]
}

Using SCP gives you a very high level of security because to delete RDS instance you have to sign in into master account and alter the policy first, so it can be very effective strategy to protect your RDS instances :)

Fifthly, do some old-school MySQL dumps and ship it to S3 – if you cannot sleep at night because you’re afraid about AWS losing an availability zone or even a region (highly unlikely, I know :P) it may be a good idea to just do a dump of database and store it in S3 or maybe… even in other cloud provider’s bucket? :)

Sixthly, if you use Terraform make sure that RDS instance has proper lifecycle settings to prevent destroying it:

resource "aws_db_instance" "main" {
  ...
  deletion_protection = true

  lifecycle {
    prevent_destroy = true
  }
}

Lifecycle settings will cause Terraform to fail in all operations that would delete the resource, deletion_protection flag will block deletion on AWS side, as described in first point.

So, what’s the RPO for your databases?

As you can see above, RDS has a nice RPO of 5 min, but I wouldn’t consider a database service built upon RDS alone without a proper preventive measures as a service with 5 minutes RPO. In case of any “accidential deletion” you are in a serious trouble, with data loss possibility included.

That’s why I strongly suggest to depend not only on RDS, but introduce additional security policies as well as additional backup strategy. You know, just in case :)

After we ensure that no one can delete our instance we can say with relief that we have 5 min RPO :)