When Just IOPS Aren’t Enough – Optimal EBS Bandwidth Calculation

The current EBS features in Amazon EC2 environment offers really good performance and the marketing material and documentation boosts how different EBS types give IOPS, but a lesser talked thing is the bandwidth which EBS can offer. Traditional SATA attached magnetic hard disks can usually provide speeds from 40-150 MiB/s but really low amount of IOPS – something between 50-150 depending on rotation speed. With the Provisioned and GP2 EBS volume types it’s easy to focus on just getting enough IOPS but it’s important not to forget how much bandwidth the instance can get from the disks.

The EBS bandwidth in EC2 is limited by a few different factors:

  • The maximum bandwidth of the EBS block device
  • EC2 instance type
  • Is EBS-Optimised bit turned on for the instance

In most instance types the EBS traffic and the instance network traffic flows on the same physical NIC attached. In these cases the EBS-Optimised bit simply adds a QOS marker to the EBS packets. According to AWS documentation some instance types have a dedicated network interface for the EBS traffic and thus they don’t need to offer separated EBS-Optimised mode. This can be seen as the instances are natively always EBS-Optimised in the AWS documentation. It seems that the c4, d2 and m4 instance types seems to have this dedicated EBS NIC in the physical host.

The EBS type itself has bandwidth limitations. According to the documentation GP2 has maximum throughput of 160 MiB/s, Provisioned volume 320 MiB/s and magnetic 40-90 MiB/s. The GP2 and Provisioned bandwidth is determined on the volume size, so a small volume will not achieve the maximum bandwidth.

So to get maximum bandwidth in a single EC2 instance with EBS you should choose an instance type which is always EBS optimised, calculate maximum bandwidth of your EBS instance types and usually combine several EBS volumes together with stripped LVM to obtain best performance. For example a c4.4xlarge has max EBS bandwidth of 250 MiB/s and would require two GP2 EBS volumes (2 * 160 MiB/s) to max it out. According to my tests a single striped LVM logical volume which is backed up with two EBS volumes can achieve a constant read or write speed of 250 MiB/s while also transferring at 1.8 Gbps speed over the ethernet network to another machine.

Q&A on MongoDB backups on Amazon EC2

I just recently got this question from one of my readers and I just posted my response here for future reference:

I was impressed about your post about MongoDB because I have similar setup at my company and I was thinking maybe you could give me an advice.

We have production servers with mongodb 2.6 and replica set. /data /journal /log all separate EBS volumes. I wrote a script that taking snapshot of production secondary /data volume every night. The /data volume 600GB and it takes 8 hours to snapshot using aws snapshot tool. In the morning I restore that snapshot to QA environment mongodb and it takes 1 minute to create volume from snapshot and attach volume to qa instance. Now my boss saying that taking snapshot on running production mongodb drive might bring inconsistency and invalidity of data. I found on internet that db.fsynclock would solve the problem. But what is going to happen if apply fsynclock on secondary (replica set) for 8 hours no one knows.
We store all data (data+journal+logs) into the same EBS volume. That’s also what MongoDB documentation suggests: “To get a correct snapshot of a running mongod process, you must have journaling enabled and the journal must reside on the same logical volume as the other MongoDB data files.” (that doc is from 3.0 but it applies also to 2.x)
I suggest that you switch to having data+journal in the same EBS volume and after that you should be just fine with doing snapshots. The current GP2 SSD disks allows big volumes and great amount of IOPS so that you might be able to get away with having just one EBS volume instead of combining several volumes together with LVM. If you end up using LVM make sure that you use the LVM snapshot sequence which I described in my blog http://www.juhonkoti.net/2015/01/26/tips-and-caveats-for-running-mongodb-in-production
I also suggest that you do snapshots more often than just once per night. The EBS snapshot system stores only new modifications, so the more often you do snapshots, the faster each snapshot will be created. We do it once per hour.
Also after the EBS snapshot API call has been completed and the EBS snapshot process is started you can resume all your operations in the disk which was just snapshotted. In other words: The data is frozen at some atomic moment during the EBS snapshot API call. After that moment the snapshot will contain exactly that data what it was during that atomic moment. The snapshot progress just tells you when you can restore a new EBS volume from that snapshot and that your volume IO performance is degraded a bit because the snapshot is being copied to S3 behind the scenes.
If you want to use fsynclock (which btw should not be required if you use mongodb journal) then implement a following sequence and you are fine:
  1. fsynclock
  2. XFS freeze (xfs_freeze -t /mount/point)
  3. EBS snapshot
  4. XFS unfreeze
  5. fsyncUnlock (xfs_freeze -u /mount/point)
The entire process should not take more than a dozen or so seconds.

 

Tips and caveats for running MongoDB in production

A friend of mine recently asked about tips and caveats when he was planning a production MongoDB installation. It’s a really nice database for many use cases, but it, as every other, has its quirks. As we have been running MongoDB for several years we have encountered quite many bugs and issues with it. Most of them have been fixed during the years, but some are still persisting. Here’s my take on few good to know keypoints:

Slave with replication lag can get slaveOk=true queries

MongoDB replication is asynchronous. The master stores every operation into an oplog which the slaves read one operation at a time and apply the commands into their own data. If a slave can’t keep up it will be delayed and thus it won’t contain all the updates which the master has already seen. If you are running a software which is doing queries with slaveOk=true then mongos and some of the client drivers can direct those queries into one of the slaves. Now if your slave is lagging behind with its replication then there’s a very good change that your application can get older data and thus might end up corrupting your data set logically. A ticket has been acknowledged but not scheduled for implementation: 3346.

There’s two options: You can dynamically check the replication lag in your application and program your application to drop the slaveOk=true property in this case, or you can reconfigure your cluster and hide the lagging slave so that mongos will not drive slaveOk queries to it. This brings us to the second problem:

Reconfiguring cluster often causes it to drop primary database for 10-15 seconds.

There’s really no other way saying this, but this sucks. There are number of operations which still, after all these years, causes the MongoDB cluster to throw its hands into the air, drop primary from the cluster and completely rethink who should be the new master – usually ending up keeping the exact same master than it was before. There have been numerous Jira issues to this but they’re half closed as duplicate and half resolved: 6572, 5788, 7833 plus more.

Keep your oplog big enough, but cluster size small enough.

If your database is getting thousands of updates per second the time what the oplog can hold will start to shrink. If your database is also getting bigger then some operations might take too long that they no longer can complete during the oplog time window. Repairs, relaunches and backup restores are the main problems. We had one database which had 100GB oplog which could hold just about 14 hours of operations – not even closely enough to keep the ops guys sleep well. Another problem is that in some cases the oplog will mostly live in active memory, which will cause penalties to the overall database performance as the hot cacheable data set shrinks.

Solutions? Either manually partition your tables into several distinct mongodb clusters or start using sharding.

A word on backups

This is not a MongoDB related issue, backups can be hard to implement. After a few tries here’s our way which has served us really well: We use AWS so we’re big fans of the provisioned IOPS volumes. We mount several EBS volumes into the machine as we want to keep each volume less than 300GB if possible, so that AWS EBS snapshots wont take forever. We then use LVM with striping to combine the EBS volumes into one LVM Volume Group. On top of that we create a Logical Volume which spans 80% of the available space and we create an XFS filesystem on it. The remaining 20% is left for both backups and emergency space if we need to quickly enlarge the volume. XFS allows growing the filesystem without unmounting it, right on a live production system.

A snapshot is then done with the following sequence:

  1. Create new LVM snapshot. Internally this does XFS lock and fsync, ensuring that the filesystem has fully synchronous status. This causes MongoDB to freeze for around four seconds.
  2. Create EBS snapshots for each underlying EBS volumes. We tag each volume with timestamp, position in the stripe, stripe id and “lineage” which we use to identify the data living in the volume set.
  3. Remove the LVM snapshot. The EBS volume performance is now degraded until the snapshots are completed. This is one of the reason why we want to keep each EBS volume small enough. We usually have 2-4 EBS volumes per LVM group.

Restore is done in reverse order:

  1. Use AWS api to find the most recent set of EBS volumes for given lineage which contains all EBS volumes and which snapshots have been successfully completed.
  2. Create new EBS volumes from the snapshots and mount the volumes into the machine.
  3. Spin up the LVM so that kernel finds the new volumes. The LVM will contain the actual filesystem Logical Volume and the snapshot. The filesystem volume is corrupted and cannot be used per-se.
  4. Restore the snapshot into the volume. The snapshot will contain the fixed state which we want to use, so we need to merge it into the volume where the snapshot was taken from.
  5. The volume is now ready to use. Remove the snapshot.
  6. Start MongoDB. It will replay the journal and then start reading the oplog from the master so that it can get up to date with the rest of the cluster. Because the volumes were created from snapshots the new disks will be slow for at least an hour, so don’t be afraid that mongostat says that the new slave isn’t doing anything. It will, eventually.

Watch out for orphan Map-Reduce operations

If a client doing map-reduce gets killed the map-reduce operation might stick and keep using resources. You can kill them but even the kill operation can take some time. Just keep an eye out for these.