There's been talk for some time of a general mechanism for applications to get access to LiveJournal data, whether it be for cool meme toys, statistical analysis or whatever. After a lot of consideration of the options, I've believe that RDF is the way to go for this. The reason for this is that RDF is designed to represent exactly the kinds of data and relationships we want to describe. In addition, there are already lots of tools out there for dealing with RDF and lots of RDF applications we can borrow from to make the data more useful to non-LJ-specific applications.
However, RDF to the extreme can make it hard for people to write “quick-and-dirty” data extraction tools, so what I propose we do is to decide what top-level entity types we want to represent and offer an interface in which a given URI (such as /data/user/mart, /data/user/mart/journal) provides an RDF document with a specific entity at the top level of the document (under rdf:RDF), with a promise that it'll always stay there. This way, people who aren't interested in the whole “RDF thing” will be able to use a simple XML parser to find out about one person, or one entry, or whatever. These single-entity RDF documents will be like the per-user FOAF data we have already.
So what are the main entity types we want to represent? As far as I can see, we have these:
The separation of people and communities from journals isn't one we usually make, but it allows us to provide a journal as a separate entity including a list of recent entries separately from a person's or community's profile, plus it allows us to change the relationship between journals and users in the future if it's ever desirable to allow one user to have several journals.
There's lots of already-existing RDF applications which we can use for these:
- People and Communities
- The obvious one for these is FOAF, with Person entities for people and Group entities for communities. We already do this in the FOAF data.
- Journals, Entries and Comments
- A journal itself is a local concept, so we'll have to make up an RDF vocabulary to describe it. Views of the journal, though, such as a recent-entries feed, can be provided as RSS 1.0 with Dublin Core elements. We can borrow the old Forumzilla schema, or invent something similar, to represent the parent-child relationships between entries and comments.
Of course, the applications we borrow can be supplemented by our own vocabularies to describe relationships unique to LiveJournal, such as the “current music” field associated with entries, to pick a trivial example. Users of the data will need an HTTP library of some kind coupled with either a namespace-aware XML parser (of which there are plenty) or a full-blown RDF processing tool for more in-depth analysis.
We already supply a load of data, but only the FOAF data is actually in some standard format for which there are general tools available. We can, by all means, keep the simpler LJ-specific protocols such as the “fdata” script around for more trivial tools, but for this data interface to be genuinely useful I think it must offer all of the data in a similar way so that a single tool, if it is general enough, can process all of the data in the same way.
We can make use of the rdfs:seeAlso element to provide full RDF processors with the links between our separate entity-rooted RDF documents, and also link to the already-existing FOAF data, since there's no point in duplicating that code, although it might be useful to change our foaf:Person and foaf:Group entities to not be anonymous anymore so we can cross-reference them.
I had considered creating an example schema to show off before writing this, but I thought it better to get some discussion going before anyone starts on that, so for now I'm “all talk”. Ultimately, it would be nice if all data that can be read by humans in the HTML can be read by software in the data interface, aside from some obvious exceptions like the obfuscated email addresses which are intended to be hard for machines to read.