Wayback Machine Archiver: Backup Pages with Python

June 04, 2019 #my-projects

A photo of the Library of Congress in 1902.

The Internet Archive runs a service called the Wayback Machine to create a digital archive of the entire internet. It contains the first snapshot of my website, as well as many others. I love the ability to go back and see how my site has evolved, and know that even if I stop hosting it, the archive will live on. That’s why I want the Wayback Machine to archive every change my site goes through.

The Internet Archive makes it easy to submit a site for archival, but doing this manually is time consuming. So I built Wayback Machine Archiver to automate it.

Wayback Machine Archiver

The Archiver runs on either Python 2.7 or 3.4+. It can be installed with pip:

pip install wayback-machine-archiver

After that, submitting pages for archival is as easy as:

archiver https://alexgude.com https://alexgude.com/blog/  # etc.

This is not much of an improvement over doing it manually, since we still have to find each URL by hand. Luckily, most sites already have a list of all their pages in a sitemap.xml.

For example, my sitemap is at https://alexgude.com/sitemap.xml. It is automatically generated by Jekyll when I update my site. Archiver can read a sitemap and submit all of the pages listed in it, making submitting my blog for archiving is as easy as:

archiver --sitemaps https://alexgude.com/sitemap.xml

Archiver has additional options like using multiple threads, logging, and archiving the sitemap as well as the pages they link to. Checkout the README on the Github site, or use the --help flag.

Scheduling Archiver

Archiver follows the Unix Philosophy of “Make each program do one thing well”, so it leaves the scheduling to another program. I recommend cron. Here is how I backup my site, and a few others, each day:

# Backup three websites weekly
@weekly archiver --sitemaps https://alexgude.com/sitemap.xml --archive-sitemap-also --log INFO
@weekly archiver --sitemaps http://charles.uno/sitemap.xml --archive-sitemap-also --log INFO
@weekly archiver https://www.radiokeysmusic.com --sitemaps https://www.radiokeysmusic.com/sitemap.xml --archive-sitemap-also --log INFO

If you find my Archiver useful, please consider donating to the Internet Archive; none of this would be possible without them! Your company may even match your donation like mine does!