kypeli ([info]kypeli) wrote in [info]lj_dev,

Downloading all public articles from a blog

Hi all,

I am looking for a way (for academic research) to download all articles from a blog in XML/JSON format. Is this somehow possible? I know of the RSS feed mentioned on http://www.livejournal.com/bots/ but then I only get the most recent articles from a blog, not all of them.

I would also be interested in a way to get a specific article's content in XML/JSON format, if I know the direct URL to it. I haven't found a way to do this (other than, again, the RSS feed, but the article might not be listed in the feed as it shows only the recent articles from that blog).

I am new to LiveJournal so any help is highly appreciated :) Thanks!
Tags: client, client: export, client: unauthenticated access, code: perl

  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    Your reply will be screened

    Your IP address will be recorded 

  • 10 comments

[info]va_dev

January 28 2012, 20:50:20 UTC 3 months ago

There is an API that you can use: http://www.livejournal.com/doc/server/index.html. This require some scripting/coding. Does this make sense?

[info]kypeli

January 28 2012, 21:06:06 UTC 3 months ago

Thanks for your reply! Does it make sense for my use case? I am not sure :)

If I understood the documentation right, those API calls would let me interface with my own blog entries by authenticating myself first with the server. But I could not find relevant information from that link on how I could take any public blog on livejournal.com and download content from it. But maybe I just didn't understand how to read the docs?

Maybe I missed something?

[info]va_dev

January 28 2012, 21:15:01 UTC 3 months ago

The best way I know is using xmlrpc protocol. There are existing implementations in various programming languages, but you can write your own too. If you look at this page: http://www.livejournal.com/doc/server/ljp.csp.xml-rpc.protocol.html, it lists the methods that can help you for querying anything you need from the journal. In your particular case you can use getevents method in combination with others. The problem is that the number of returned events (entries) per query is limited by 50, however you can fetch all blog entries step by step using the API.

[info]int

January 28 2012, 21:17:30 UTC 3 months ago

You could do it via the LJ protocol and syncitems/getevents, then output it all in whatever format you want. This would mean you'd have to have the username/password of a user to get their items though, which I'm guessing you don't want to do as you mentioned pulling all public items.

[info]kypeli

January 28 2012, 21:37:57 UTC 3 months ago

That is correct. I am interested in analyzing certain (public) blogs and their content but I am not the admin of these blogs.

So basically there isn't really a way to do what I would like to do?

[info]andy

January 29 2012, 07:30:06 UTC 3 months ago

Scraping HTML is the way to do it; LJ is fine with that, assuming your system behaves itself and doesn't create too much strain on the servers. This Perl script used to be able to save a given journal to a set of disk files: http://pastebin.com/1CaVmEij. I haven't checked if it still works, but reviewing it may give you some ideas.

[info]kypeli

January 29 2012, 07:44:58 UTC 3 months ago

Thanks! I was afraid it would go to scraping HTML, but that Perl should be very helpful. Cheers!

[info]kypeli

January 29 2012, 13:02:26 UTC 3 months ago

Thanks again for the Perl script! It worked perfectly!

[info]andy

January 29 2012, 13:08:58 UTC 3 months ago

I'm glad I was able to help!

[info]undyingking

January 29 2012, 14:57:10 UTC 3 months ago

Thanks for that script link, v useful!
Create an Account
Forgot your login or password?
Facebook Twitter More login options
English • Español • Deutsch • Русский…