I’m running a MongoDB 2.2 cluster in Amazon EC2 consisting of three machines. One of these machines is used to take hourly snapshots with LVM and EBS and I noticed a rare bug which leads to silent data corruption on the restore phase. I’m using Rightscale to configure the machines with my own ServerTemplate, which I enhance with some Rightscale Chef recipes for automated snapshots and restore. Rightscale needs to support multiple different platforms, where AWS is just one of them and they have carefully constructed these steps to perform the snapshot.
Each machine has two provisioned IO EBS volumes attached to the machine. The Rightscale block_device::setup_block_device creates an LVM volume group on top of these raw devices. Because EC2 can’t do atomic snapshot over multiple EBS volumes simultaneously, the LVM snapshots is used for this. So the backup steps are:
- Lock MongoDB from writes and flush journal file to disk to form a checkpoint with the db.fsyncLock() command.
- Lock the underlying XFS filesystem
- Do LVM snapshot
- Unlock XFS filesystem
- Unlock MongoDB with db.fsyncUnlock()
- Perform EBS snapshot for each underlying EBS volumes
- Delte the LVM snapshot, so that it doesn’t take disk space.
Notice that the main volume will start getting writes after step 5, before the EBS volumes have been snapshotted by Amazon EC2. This point is crucial when understanting the bug later. The restore procedure does the following steps in the block_device::do_primary_restore Chef recipe:
- Order EC2 to create new EBS volumes from each EBS snapshots and wait until the api says that the volumes have attached correctly
- Spin up the LVM
- Mount the snapshot first in read-write so that XFS can unroll the journal log and then remount it into read-only mode
- Mount the main data volume
- Use rsync to sync from the snapshot into the main data volume: rsync -av –delete –inplace –no-whole-file /mnt/snapshot /mnt/storage
- Delete the LVM snapshot
Actual bug
MongoDB used mmap() sys-call to memory-map the data files from disk to memory. This makes the file layer implementation easier, but it creates other problems. Biggest issue is that the MongoDB daemon can’t know when the kernel flushes the writes to disk. This is also crucial information needed to understand this bug.
Rsync is very smart to optimize the comparison and the sync. By default, rsync first looks at the file last modification time and size to determine if the file has changed. Only after it starts a sophisticated comparison function which sends the changed data over network. This makes it very fast to operate.
Now the devil of this bug comes from the kernel itself. It turns out that the kernel has a long lasting bug (dating way back to 2004!) where in some cases the mmap()’ed file mtime (last modification time) is not updated on sync when the kernel flushes writes to the disk, which uses XFS filesystem. Because of this, some of the data which mongodb writes after the LVM snapshot and before the EBS snapshot to the memory-map’ed data is flushed to the disk, but the file mtime is not updated.
Because the Rightscale restoration procedure uses rsync to sync the inconsistent main data volume from the consistent snapshot, the rsync will not notice that some of these files have been changed. Because of this, rsync in fact does not do a proper job to reconstruct the data volume and this results corrupted data.
When you try to start MongoDB from this kind of corrupted data, you will encounter some weird assertion errors like these:
Assertion: 10334:Invalid BSONObj size: -286331154 (0xEEEEEEEE) first element
and
ERROR: writer worker caught exception: BSONElement: bad type 72 on:
I first though that this was a bug in the MongoDB journal playback. The guys at 10gen were happy to assist me and after more careful digging I started to be more suspicious on the snapshot method itself. This required a quite lot of detective work by trial and error until I finally started to suspect the rsync phase in the restoration.
The kernel bug thread had a link to this program which replicated the actual bug in Linux and this confirmed that my system was still suffering from this very same bug:
[root@ip-10-78-10-22 tmp]# uname -a Linux ip-10-78-10-22 2.6.18-308.16.1.el5.centos.plusxen #1 SMP Tue Oct 2 23:25:27 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux [root@ip-10-78-10-22 tmp]# dd if=/dev/urandom of=/mnt/storage/foo.txt 20595+0 records in 20594+0 records out 10544128 bytes (11 MB) copied, 1.75957 seconds, 6.0 MB/s [root@ip-10-78-10-22 tmp]# /tmp/mmap-test /mnt/storage/foo.txt Modifying /mnt/storage/foo.txt... Flushing data using sync()... Failure: time not changed. Not modifying /mnt/storage/foo.txt... Flushing data using msync()... Success: time not changed. Not modifying /mnt/storage/foo.txt... Flushing data using fsync()... Success: time not changed. Modifying /mnt/storage/foo.txt... Flushing data using msync()... Failure: time not changed. Modifying /mnt/storage/foo.txt... Flushing data using fsync()... Failure: time not changed.
Conclusions
I’m now working with both Rightscale and 10gen so that others won’t suffer from this bug. Mainly this means a few documentation tweaks on 10gen side and maybe a change of the backup procedurals on Rightscale side. Note that this bug does not happen unless you are snapshotting a database which uses mmap(). This means that at least MySQL and PostgreSQL are not affected.
This issue reminds that you should periodically test your backup methods by doing a restore and comparing that your data is intact. Debugging this strange bug took me for about a week. It contained all the classic pieces of a good detective story: false leads, dead ends, a sudden plot twist and a happy ending :)
Can you post the mongodb JIRA ticket (if you have one so we can follow)?
Also you don’t suggest what the correct procedures should be.
Hi Matt!
Here’s the relevant tickets:
https://jira.mongodb.org/browse/SERVER-8518
https://jira.mongodb.org/browse/DOCS-1212
You can check if your system is affected by running this:https://bugzilla.kernel.org/attachment.cgi?id=14430
It it’s not, then you can just do xfs_freeze and snapshot the underlying data. As this is fixed in today’s kernels, the best bet would just be to upgrade.
The most important lesson is to test your backup integrity with file based checksumming every once a while, to make sure that you really do have proper backups when you really need them,
Because I’m combining multiple EBS volumes into one big usable volume with lvm. This has two benefits:
1) More ebs volumes means more iops
2) More ebs volumes means that each volume needs to be smaller, which means much faster snapshots and much shorter snapshot recovery time
But to get a consistent snapshot I need to first snapshot the lvm, so that when I do ebs snapshots they will both be precisely from the exact same point of time. Without lvm the ebs snapshots would be a few seconds apart, resulting corrupted data.