Why we have a Time Machine

One of the more interesting projects I've worked on while at STAGES has been our time machine.

I know what you're thinking but it's not a real time machine, it's just a way for use to look into the past and see the state of our site on a given day. The idea is that if a user says they did something on June 15th but the data isn't there we can go back and say "at 6 PM the data wasn't there" and then use one of the mid-day backups to investigate further. This way instead of having to restore lots of backups to determine the date the data was deleted we have easy to access data that our support department can look at without even having to involve DevOps.

How it works

The idea is that every day we create a backup of our live data (we make 4+ backups every day) and copy it over to our time machine server along with a copy of the site code when the backup was made. We then use this data to create a new "version" of the site so we can track what it looked like.

The general process is:

  1. Delete Old Sites/Databases - Lesson learned: if you do this last and your drive is running low on space you will run into problems.
  2. Setup the new database - Every day gets it's own database so it's easily deleted in step 1
  3. Import the data
  4. Create a config file.

Number 4 might be optional depending on your software but with STAGES we need to create a file that determines the connection settings for the database. We have a special user in MySQL that only has access to the time machine MySQL server and doesn't have access to live. We also don't have any live MySQL users in time machine (this prevents accidents).

Determine how long to keep data

Another thing that's dependent on your application is how long to keep each of the sites. We have year backups that we keep forever, monthly backups that we keep for a year, weekly backups that we keep for a month, and daily backups that keep for 40 days but this is dependent on how large the backups are. The horrible part about this process is that if you're a normal application your data is always going to be growing and because of this your time machine sites are also going to get bigger and bigger.

I wish there was hard and fast equation I could give you but because data growth is usually so random I have to do the following calculation every time we need to change the retention periods.

total size of disk / (average size of database on disk + average size of site on disk) = total number of possible sites

I use the most current database size and code side for this calculation. When you DB gets big enough the code doesn't even matter so I've stopped using that.

Linux

Linux has been super agreeable to having 50+ low usage sites on the same server but given the fact that cheap hosting providers most likely do more than this it's not really a huge surprise to me.

The one "issue" that we've had Linux wise on this server is that by default 5% of a drive is reserved for the system. On a 10 GB drive that's not a huge problem but when it's a 700 GB drive it's eating up several potential time machine sites. In order to reclaim this space you can use tune2fs to set it to something lower (we're using 1%).

We use Rackspace to host this server and in order to spend the least amount possible we have it running on the lowest level of Linux server they offer. This has worked great but the only downside to this is that the smallest instance doesn't have much disk space so we added a large external SATA drive. Adding a large SATA drive doesn't cost us to much when you compare the 2ish programming hours it takes to restore a single backup. Even if only saves us one restore a month it's still much cheaper.

Apache

I didn't want to have to create a new site config file, delete old config files, and reload Apache every night so I used Apache's VirtualDocumentRoot directive that allows you to define how a host name can be converted to a folder. Below is the important piece of our Apache config.

<VirtualHost *:80>
    DocumentRoot "/var/www/site"
    ServerAlias *.example.com

    <Directory "/var/www/site">
        Options Indexes MultiViews FollowSymLinks
        AllowOverride All
        Order allow,deny
        Allow from all
    </Directory>

    UseCanonicalName Off
    VirtualDocumentRoot /var/www/%1.9+/%1.1%1.2%1.3%1.4/%1.5%1.6/%1.7%1.8/public
</VirtualHost>

The important pieces are:

ServerAlias *.example.com

This tells Apache to use this virtual host for any host names that match *.example.com so when a user attempts to access the site they're automatically lumped in to this virtual host (we also have to set a wildcard DNS entry so all requests to *.example.com are sent to this server).

UseCanonicalName Off

This setting allows Apache to use the host and port specified by the client instead of what's set up. This allows the next setting:

VirtualDocumentRoot /var/www/%1.9+/%1.1%1.2%1.3%1.4/%1.5%1.6/%1.7%1.8/public

I hate this line because it looks like it might make your head explode.

We setup the server so every time machine site's hostname is YYYYmmddlive.example.com (this is just how we picked it you could do it anyway you want). VirtualDocumentRoot breaks the hostname up into the various pieces based on the period (YYYYmmddlive, example, and com) which allows you to reference it. If we just wanted to put the site into Top Level Domain (TLD), domain name, subdomain folders we could use VirtualDocumentRoot /var/www/%3/%2/%1/public.

We wanted to have each year, each month, and each day have it's own folder which made it easier for us to manage but created the crazy string above. The %1.1 references the first character of the first string, %1.2 references the second character of the first string and so on. The %1.9+ references everything after the 9th character. The end result is that all our sites are stored in the directory /var/www/live/YYYY/mm/dd/.

MySQL

MySQL handled this load really well. I know some people have HUGE MySQL databases but it's interesting to see 50+ databases running on the same server. The only real problem we had is that our ORM used the information_schema table to build relationships between our objects dynamically and with thousands of tables it took several minutes to gather the information it needed. This was easily solved by caching this information on the server.

Limiting Access

We wanted to make each time machine site was a read only version of our data so we did this when we created the databases:

grant select on dbname.* to timemachine@localhost
flush privileges

This allowed our timemachine user to just select from the database and not update anything. This kept the data in a pristine state. Initially, we had tell our support department to not save and ignore any errors that they receive if a save occurs but over the years we've added special code so writes don't happen on the time machine server.

Special Backups

Originally we just took a mysqldump backup of our site and restored it to the time machine server. It was quick and it worked really well but we wanted to use as little space as possible with each site. We started deleting all the data from our log tables because there would always be a copy of them on our live server (we then ran optimize table to free up the space).

Eventually we started running into problems with this because it still required us to have 30GB+ of free space for the restore and then we were able to reclaimed some of that space. To fix this problem we had to switch to making a special backup (actually two backups) of our database. The first one backed up all the tables except our log tables and then the second dumped just the structure of those tables.

 mysqldump -uroot -ppassword live --except=live.emaillog .... | gzip > backup.sql.gz
 mysqldump -uroot -ppassword live emaillog >> backup.sql.gz

This also severely reduced the amount of time it took to make the backup and to restore the data on the time machine server.

PHP

We originally tried to run all the sites using the same copy of the code but this quickly fell apart (think how much you code base has changed just this week let alone over months or years). We found that we needed to keep a full copy of the source code for each date so we had the code version matched to the database version once we switched to continuous deployment.

Memcached (caching)

Another issue we ran into is that we were using a caching layer for a lot of pieces of the site. On a single site/single server setup it wasn't too bad but it caused weird bugs when running a bunch of the same site on the same server. To fix this we had to make sure a unique prefix was used for each time machine site.

Firewall

One of the things that we worried about was having our end users find a time machine site and try to add data to it and then be mad when it didn't work. We found the best solution to this was to just block every IP address except for our office. We could have done something crafty like a VPN but this was a very simple solution where we didn't have to worry about people remembering additional passwords or installing VPN software.

Other Stuff

Lastly, in order to make it clear what version of the site people were using we added the date to our header image. We used imagemagick's convert program to do this. Your settings might need to change depending on your header image but this is what worked for us:

convert header.png -gravity southwest red -pointsize 30 -annotate +350+0 "YYYY-mm-dd"

The -pointsize changes the size of the font (duh) and the -anonotate +350+0 sets where the text is located.