Journal backup with comments

Dec 09, 2007 23:11

LiveJournal has had the facility to export the content of a journal or community, to either CSV or XML formats, for quite some time. The only real drawback of it is that this export feature does not include any comments in the export, only the posts.

Previously this has only been an irritation for people; but given the litany of errors over the course of this year with deletion of users' data and poorly implemented access controls, the inability to easily backup all data may be more than merely irritating for some.

On that front, though, I have some good news and some bad news.



The good news is I've worked out just the right selection of wget flags to download an entire journal. That is the entire journal, including the comments and stylesheets (along with anything else required for the pages to render properly), but without the padded repetition of storing all pages with a reply field or thread expansion.

The bad news is this is only useful for people running a Unix or Unix-like operating system (Mac OS X counts), or who have access to one. The Windows users are still up the proverbial creek (yes, I know there are work arounds, like Cygwin).

In the examples that follow; all references to LJ_USER need to be replaced with the LiveJournal Username and all references to LJ_PASS with the corresponding password. I recommend creating a subdirectory to run it in because it will create its own subdirectories for other sites which may be accessed.

wget --http-user="LJ_USER" --http-passwd="LJ_PASS" --user-agent="Backup/LJ_USER" -t 5 -r -l 10 -p -R "*reply*","*thread*" http://LJ_USER.livejournal.com/

That's the most streamlined version of the command, but there are plenty of other options available with wget, so there are alternate approaches. For example the mirroring option (-m) could be selected instead of using the -r and -l flags.

The user-agent flag is really optional, but it's advised something be used in case the journal has been set to ban robots. If that's the case wget won't run on that site, changing the user-agent circumvents that issue.

A more expanded view of some of the more likely to be used flags shown below. Optional flags are enclosed in square brackets.

wget --http-user="LJ_USER" --http-passwd="LJ_PASS" [--user-agent="Backup/LJ_USER"] [-c] [-t 0|5] -[m|r] [-l inf|5|20] [-p] [-H] -R "*reply*"[,"*thread*"] [--domains=example.com,example.net] [--exclude-domains=example.com.au,example.net.au] http://LJ_USER.livejournal.com/

Obviously the time it takes and the amount of disk space required will be determined by the size of the journal being copied. The bandwidth used will always be greater than the data backed up because all those pages with "reply" or "thread" in their URLs are downloaded, just not stored.

Backing up a community one maintains is done in a similar manner, just with a slight variation to prevent accidental downloading of other communities. In this example LJ_COMM needs to be replaced with the name of the community.

wget --http-user="LJ_USER" --http-passwd="LJ_PASS" [--user-agent="Backup/LJ_COMM"] [-c] [-t 0|5] -[m|r] [-l inf|5|20] [-p] [-H] -R "*reply*"[,"*thread*"] [--domains=example.com,example.net] [--exclude-domains=example.com.au,example.net.au] -np http://communities.livejournal.com/LJ_COMM/

The only real difference is the inclusion of -np, so it doesn't try to recursively grab anything above that community's directory, and the slightly different URL. The user-agent field was modified too, but that's not really important.

The output should be able to be uploaded to any regular web server, but it is not likely to be readily imported into another blogging site. Even one that uses the same codebase as LiveJournal does.

So, if you have any concerns about the current policies being pursued on LiveJournal, here's a way to minimise any potential damage. Alternatively, if you've just always wanted a complete backup of your journal, there is a solution.

In case you're wondering, in spite of everything, I'm still more in the latter category. There's plenty of time to see where the current regime change is heading.

geeky, meta

Previous post Next post
Up