A Hobbit (diaryofahobbit) wrote in lj_dev,
A Hobbit

livejournal archival utility (perl)

Hey all,

I just wrote a little utility for extracting all your livejournal entries into one big .html file. I figured others would find it useful for their own archival purposes, or as a starting point on writing your own custom archiver.

You can get it here.

Here's what it does:

  • gets the html for all your entries (public, private, and protected) and puts them in one page, earliest to latest.
  • entries are grouped by month. each month has forward and back links so you can jump around.
  • at the end of the resulting .html page, some basic stats on your journal entries are given.
  • the date/time stamp of each entry is linked to the page on livejournal.com, so you can see the comments. this is much easier than attempting to download all the comments for an entry, though i haven't really looked into that possibility much.

Here's what it doesn't do, but would be nice:

  • have it extract only the entries that have changed since the last time it was run. a good implementation of this might use the net-changed (i forget what it's called) facility of the client-server protocol.
  • mood and current music information is not saved in the resulting page. this would be pretty easy to implement.
  • on protected entries, it doesn't tell you which friend group(s) can see it.
  • lj-user tags aren't converted, so when you view the resulting page in your browser, those links won't show up as neat little people icons. this actually wouldn't be too hard to implement... just a search/replace on a regexp. actually that's probably true for many of the lj-specific tags.
  • a calendar view (possibly in a left-nav frame) would be a nice way to browse entries on the main output page.

Here are some other thoughts:

  • a lot of people put IMG links inside their journal entries. it would be nice if the tool automatically retrieved local copies of those, too... so you could, say, put your journal on CD and have everything viewable without being connected to the net.
  • lj-poll tags would be nice to expand. i don't think there's a way in the client-server protocol to get more information on what's inside a poll. the best thing i can think of is to hit the webpage for the poll and snarf the images and text that the lj server expands it to.
  • getting your journal in .xml format would be nice, and some people have done this. but the problem i see with it is 1) the average person isn't going to want to run it through an xslt and 2) entries often include broken html, and most often they're not xhtml-compliant. that makes translation that much more difficult. that said, having an archival tool with a built-in xslt engine and some sample stylesheets would be cool at some point.

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded