Juha-Matti Santala
Community Builder. Dreamer. Adventurer.

🍿 Archiving a site from the web

Snacks (🍿) are my collection of recipes for solving problems. Often recorded (and cleaned up) from actual discussions where I'm involved in helping others with technical problems. Not all solutions are mine but I started collecting them in one place because you never know when you need one.


Sometimes, there's a site of yours on the web that you've lost access to the code but need to archive it. I picked it up from Adam Marcus's blog. First, you need wget (for Mac, I installed it with brew install wget).

Here's a wget incantation that does a lot of good things:

wget -P . --recursive --page-requisites --adjust-extension --convert-links --wait=1 --random-wait --restrict-file-names=ascii,windows

Flags explained

-P

Set directory prefix to prefix. The directory prefix is the directory where all other files and subdirectories will be saved to, i.e. the top of the retrieval tree. The default is ‘.’ (the current directory).

Technically I think not necessary as . is the default but I guess it's good to be explicit.

--recursive

Recursive retrieval of HTTP and HTML/CSS content is breadth-first. This means that Wget first downloads the requested document, then the documents linked from that document, then the documents linked by them, and so on. In other words, Wget first downloads the documents at depth 1, then those at depth 2, and so on until the specified maximum depth.

The maximum depth to which the retrieval may descend is specified with the ‘-l’ option. The default maximum depth is five layers.

With recursive flag, wget will follow links. When I tried this with a project, I used the default depth of five layers but I'm not sure how I would confirm that it was enough.

--page-requisites

This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.

Downloads, Javascript, CSS, images and so on to make the archived site a complete one.

--adjust-extension

If a file of type ‘application/xhtml+xml’ or ‘text/html’ is downloaded and the URL does not end with the regexp ‘.[Hh][Tt][Mm][Ll]?’, this option will cause the suffix ‘.html’ to be appended to the local filename.

This is very handy to archive sites where there's no .html at the end of paths and filenames.

--convert-links

After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.

Turns absolute links to relative local links.

--wait=1

Wait n seconds between retrievals—the same as ‘-w n’.

1 second delay between retrievals

--random-wait

Some web sites may perform log analysis to identify retrieval programs such as Wget by looking for statistically significant similarities in the time between requests. This option causes the time between requests to vary between 0.5 and 1.5 * wait seconds, where wait was specified using the ‘--wait’ option, in order to mask Wget’s presence from such analysis.

Make the server confused if it's a bot or not.

--restrict-file-names

Change which characters found in remote URLs must be escaped during generation of local filenames. Characters that are restricted by this option are escaped, i.e. replaced with ‘%HH’, where ‘HH’ is the hexadecimal number that corresponds to the restricted character. This option may also be used to force all alphabetical cases to be either lower- or uppercase.

I guess this one makes sure all the files are ascii compatible.