The Challenge of Backups in EDA Environments

Written by Brian O'Neill | May 25, 2018 1:00:00 PM

Backing up data is always a challenge. There are different ways to do backups — some are convenient, others complicated depending on how your data is stored, and you often need to strike a balance between the two to accommodate recovery and business needs. Electronic Design Automation (EDA) environments are even more challenging, mainly because of the massive quantity of files that get generated as part of design and simulation.First, repeat after me: RAID IS NOT BACKUP. A RAID array provides some protection against certain types of failures, namely with the individual disks. Multiple disk failures, or failures in the controllers or storage server can still result in corruption and data loss. And RAID certainly doesn't protect against the human factor — accidentally overwriting or deleting a file.

A backup is a separate, independent copy of the data on a different storage system. Ideally, it can easily be replicated to another location as well.

There are different ways one can do backups these days, and some methods may depend on the type of storage used, whether you are using virtual machines, or how long you want to retain the data, etc. But to simplify the discussion, we can generally divide backup methods into one of two classes, file-based and block-based.

File-based Backups

The "traditional" backup solutions are file-based. What that means is that a backup agent scans the filesystem, looking at each file and determining if it has been changed in some way. If it has, it takes the contents of the file and places it in the backup store. If not, it skips it. These systems are conceptually simple both from a backup and a restore point of view, if you want to restore the file, you simply retrieve the contents of the file out of the backup store. And it works practically with any sort of filesystem, and often can back up files on a network fileshare where perhaps you can't run the agent directly on the device.

These file-based solutions often start to fail once your filesystems grow significantly in the number of files stored. Since they must scan each file's metadata to look for changes, any filesystem overhead in checking that information means it takes longer for the scanning to take place. As it increases, there is a chance that the files in the backup can be inconsistent with one another — for instance, in the process of scanning 5 related data files, we may have scanned two files already and determined we didn't need to back them up, but they all get changed while we are still checking, and we determine we need to back up the other 3. In this example, we end up only backing up the three we detected, while the other two remain in the data store with older data from a previous backup. Not good when we go to restore all 5. Some filesystems offer ways to mitigate this problem (like Volume Shadow Copies in Windows) that prevent the data from changing from the backup process point of view during the scan.

EDA processes can result in an enormous number of files — often in the millions. I've had a single volume contain over 40 million files! Now imagine that it takes 1 millisecond per file to determine if it needs to be backed up...that adds up to over 11 hours just to scan, not including copying the actual data! In addition, we may re-generate files frequently, which gets picked up as changes and we need to back them all over again. Most of these systems back up the entire file over again even if a single byte changes, and unless your system supports deduplication you end up with lots and lots of data in your backup archive, which in some cases can make the restores slow as well. (Don't forget, people don't care how fast your backups are, only how fast you can restore their data!)

Block-based Backups

The relatively "new" (to us IT...elders? Grumpy Old Admins? Grups?¹) way of backing up is the block-level backup. Rather than scan the metadata of every file, we look at the entire storage volume and determine which blocks specifically have changed. Then we only need to back up the actual blocks and not entire files for small changes. This minimizes the amount of data we back up — and since we aren't having to look at the files themselves, it doesn't matter at all how many there are! There is no longer a long, slow scan of file metadata. This is a huge boon to EDA data backups. Most of these also work by taking a "snapshot" of the disk — an instantaneous capture of the disk state, providing a true "moment in time" backup for data consistency.

However, to do this your storage system needs to be able to support this. For example, this might be at the filesystem level (such as with NetApp snapshots or ZFS or BTRFS), or with a virtual volume/disk system (such as VMware virtual disks).

VMware Example

Let's take for example the combination of a VMware vSphere environment and Datto’s SIRIS backups. vSphere provides the means to make snapshots of a given VM, providing for convenient rollback after updates, testing of changes, etc. What happens in VMware is that the virtual disk devices, which themselves exist as files on the underlying storage, stop getting new writes and instead all future changes to the virtual disks are written to new files called deltas. This means the state of the virtual disks is preserved at the moment the snapshot is made. Now a backup product such as Datto SIRIS can take the original virtual disk files, compare what blocks have changed since the last time it backed them up, and copy only the blocks that have changed. When the backups are done, the snapshot is released and the changed blocks being written to the delta files gets consolidated back into the originals. Note: While the snapshot exists, more disk space is used to write all the changes to the delta files, so make sure you have sufficient disk space for this to occur for the length of the backup run. This will depend on how much data changes during that time frame, so quiet periods are best.

Added bonus - Datto SIRIS takes the backups it makes and can run the VMs directly on the SIRIS system itself, or even in the cloud! This provides business continuity along with the ability to restore.

Some of these systems which support snapshots may also support an inherent ability to make a limited form of backup sometimes referred to as "snapshot shipping". The idea is that if you already have a copy of a logical device on another system, and we take snapshots of the original, we can send those snapshots over to the other system and recreate the changes there. The snapshots already contain only the changes to the filesystem, so the amount of data to transfer is already minimal and can be suitable for transferring directly offsite depending on overall size and the speed of your connection.

NetApp Filers support this through their SnapMirror and SnapVault products. The ZFS filesystem (supported on Solaris and BSD-derived systems, and as an add-on to Linux) does this as well using the "zfs send" and "zfs recv" commands. The best part is that in the case of loss of the original storage system, these backup systems are generally ready to go immediately — no need for an explicit restore to be done.

The drawback is that typically the destination needs to be similar hardware and a compatible OS release, and depending on the rate of new data being written (and in an EDA environment, that can be large) and the size of the backup system, you may not be able to use this mechanism for long-term archives or transfer over slow network links. Consider a secondary backup system for long-term storage — but keep in mind you can run the archival backups off the backup system and not impact the primary system, so how long it takes is less of a consideration.

When you do go looking for a backup system, be aware that some of the more specialized systems that are able to do block-level backups may not support backing up anything other than the type of system they were designed for, or there are additional costs to backing up different storage systems. Hybrid environments consisting of varying types of storage may be difficult to find a solution that is fast, economical, and takes care of everything. You may wish to consider more than one solution if your environment incorporates lots of different systems, rather than choose a solution which is suboptimal all around.

In the end, there are many different options to choose from. Finding the right balance for your environment will be a challenge. Pay attention to what the vendor options say, but always keep in mind that in YOUR environment, the number of files may be more important that the total amount of data you need to back up, and vendors tend to say very little about that. If you can, choose a backup system that can deal with the underlying block structure of your storage instead of the traditional file methods, and if you are looking for new storage as well, consider how you are going to back it up!

1. Bonk bonk on the head if you missed that nerd reference.

View full post