Tag all your old entries based on regexp patterns: lj

threeleet in lj_nifty

Tag all your old entries based on regexp patterns

Jun 29, 2005 23:34

Got hundreds of old entries, and don't feel like going through every single one of them to add tags? Discovered a cool new tag that you want to add, but don't want to look through lots of entries to figure out where to add it? Then look no further...

What it is:
LJ Tagger is a Perl script that can be run in any command-line environment with Perl (Linux/Unix shell, OSX Terminal, Cygwin, etc.).

What it does:
Specify a regular expression pattern and a tag, and LJ Tagger will search through your offline archives to find potential matches. If a match is found, it'll display the subject & date, along with the matching paragraph, with the match(es) highlighted in red. You can then decide if the match is appropriate for tagging, and if you hit 'y', it'll update that entry, adding the tag you specified.

What it needs:

Perl, with the following extra modules:
- Date::Parse (debian package libtimedate-perl)
- Getopt::Mixed (libgetopt-mixed-perl)
- HTTP::Cookies (libwww-perl)
- HTTP::Headers (libwww-perl)
- Term::ReadKey (libterm-readkey-perl)
- XML::Simple (libxml-simple-perl)
- XMLRPC::Lite (libsoap-lite-perl)
(To determine whether you have a given module, run 'perl -MFoo::Bar -e print', replacing Foo::Bar with any of the above. If nothing happens, you've got the module. If you get an error like "Can't locate Foo/Bar.pm in @INC [...]", you'll need to install the module- easiest way to do so is via packages for package-based systems (linux, fink, etc.), or run 'perl -MCPAN -e shell' and then 'install Foo::Bar' for each package.)
One or more exported XML files of your journal entries. Currently the formats generated by LiveJournal's export page (XML format) and Logjam are supported (since that's what I have)- it shouldn't be too hard to add support for other slightly-different export formats.

Example:
Say you want to tag entries that mention watching movies with the tag 'movies'. You could use something like this:
ljtagger.pl -i -e 'seen?\b|watch|movie|imdb.com/' -t movies *.xml
You'll then see something like this:

In entry 'one more week...' on 2005-01-24 16:12:00: Pattern /(?i:seen?\b|watch|movie|imdb.com/)/ matched at:
Went to see imdb.com/title/tt0385004/combined">House of Flying Daggers on Friday, [...]
Add tag 'movies'? (y/n/(s)kip this entry/q)

Hit 'y' and it'll update that entry, adding 'movies' to the taglist. If 'movies' is already there, it'll skip it and move on to the next entry.

If you end up getting a match like:
Back to watching mysql compile (stupid new versions aren't compiled w/ ssl)...
which doesn't fit with the 'movies' tag, just hit 'n' to move on to the next match within the same entry, or 's' to skip to the next entry.

The pattern can be any Perl regular expression, or just a simple word (i.e. 'Bob' to tag all entries that mention your friend Bob), and the -w and -i options allow whole-word and case-insensitive matching.

Usage/Disclaimer
The script is GPL'd, reasonably well commented, and comes with a detailed --help option which explains all the options (as well as verbose & debug modes for even more output). It can also read some of the options (i.e. username/password) from a config file if you want. It uses challenge/response to generate a session cookie for updates, which can then be expired with '-x' when you're all done, so your password is never sent over the network.

I've already run the script on a bunch of my own entries, and tested it both on my debian/Linux box and my Mac, and haven't run into any problems, but if anything bad happens, you've always got your backup XML files anyway (which this script won't work without). If you're extra paranoid, you could make a separate copy of your backup files, even though this script doesn't modify them at all. Obviously, if you're really scared about messing with large amounts of entries, and unsure whether you can restore them from your backups, then don't run this script.

Enough already, gimme!
You can download the script here (4879 bytes, ~12k unzipped). If you find it useful, or have any ideas on how to make it better, leave a comment. :)