Wednesday, March 05, 2008

Automatic change tracking with cron and Mercurial

A while back I had to setup a system for sharing ANDRILL's expedition data among a geographically distributed group of scientists. For simplicities sake, I settled on serving the data via HTTP (for read only access) and WebDAV (for read/write access). It's not the most robust system, but it works, is relatively well supported across modern operating systems, and doesn't require users use some special software.

The only problem is that allowing read/write access to a large group of people means that something is going to get inadvertently changed or deleted. To combat this, I needed an automatic way to track changes that would allow me to revert or rollback any accidents.

The first step was to put the expedition data into a standard version control system. My VCS of choice is Mercurial, so I opted to use that. The initial import took a bit of time because there was about 30GB of files.

Once I had all of the data safely imported into a Mercurial repository, I needed a way to automatically detect and commit changes. I worked up a fairly simple shell script that gets run periodically by cron to check for and commit any changes:

#!/bin/sh

REPOS="/home/projects/sms/"
DATE=`date`

for r in $REPOS
do
echo "Working on $r"

# clean up temporary files
find $r -type f -name "._*" \! -wholename "*.hg*" -exec rm {} \;

# find changes
hg addremove -R $r
hg commit -m "Automated commit @ $DATE" -R $r
done


The script is relatively straightforward. It will walk through a list of repositories (in this case, only 1 is configured). Fore each repository, it first cleans any dangling WebDAV lock files outside of the Mercurial metadata directory. Then it runs hg addremove to detect any added or deleted files. Finally it commits the changes with an automated commit message containing the current date and time.

The script is configured to run periodically (hourly) by cron. This has the side effect that if the same file is modified two separate times within the same hour window, the intermediate changes won't be caught. For my needs, this wasn't a big deal. You could always run the script more often to have a better chance of catching more changes.

The final piece of the puzzle was to send out a daily email to the science team to notify them of the changes. For this, I developed another script that is run once a day:

#!/bin/sh

REPOS="/home/projects/sms/"
DATE=`date`

for r in $REPOS
do
LOG=`hg log -d -1 --template '{rev}\n' -R $r`
if [ -z "$LOG" ]; then
echo "No changes"
else
echo -e "The following files have changed in the last 24 hours:\n" > /tmp/hg-tc-daily.log
for c in $LOG
do
hg log --rev $c --template 'Changeset {rev} ({date|isodate}):\n' -R $r >> /tmp/hg-tc-daily.log
hg status --rev `expr $c - 1`:$c -R $r >> /tmp/hg-tc-daily.log
echo "" >> /tmp/hg-tc-daily.log
done
cat /home/projects/hg-tc/summary.legend >> /tmp/hg-tc-daily.log
cat /tmp/hg-tc-daily.log | mail -s "Expedition Data Changes" foo@bar.com
fi
done


If you can grok that in a single glance, you're a better shell scripter than me. It took me a couple passes to write the script and get it working the way I wanted it to. There's a bit of magic going on here, but I'll break it down.

This script walks through a set of repositories like in the previous script. The first magical incantation we run into is:
LOG=`hg log -d -1 --template '{rev}\n' -R $r`

This line asks Mercurial to tell us the revision numbers for all changes in the last day (-d -1) and stores them, one to a line, in the LOG variable. We then check to see whether the LOG variable contains anything. If it doesn't, then we're done and can exit.

If the LOG variable does contain something, then we need to send out an email to notify people of the changes. We start building up our email in /tmp/hg-tc-log. Next we walk through our list of revision numbers, and for each one we include some information using this incantation:

hg log --rev $c --template 'Changeset {rev} ({date|isodate}):\n' -R $r >> /tmp/hg-tc-daily.log
hg status --rev `expr $c - 1`:$c -R $r >> /tmp/hg-tc-daily.log


The first line simply prints out "Changeset 53 (2008-03-04 15:01 -0600):" or something similar for every changeset that occurred in the last 24 hours. The second line prints out the list of changed files and the change type (added, deleted, modified) from the previous previous revision (--rev `expr $c - 1`:$c).

Finally the script adds a helpful legend to the email and sends it out.

Overall, it's a bit hacky, but it works. The team can continue to collaborate in a central location with the relative safety of using a VCS to track changes. The best part is they don't have to do anything different; the change tracking is all automatic. I've even had the occasion to revert inadvertent changes and deletions, so it was a good investment of an afternoon.

No comments: