Anatoly Vorobey (avva) wrote in lj_dev,
Anatoly Vorobey

journal backup/export

A few hours ago we held an irc brainstorming session on the subject of journal backup, journal export, and syncitems protocol mode (which was recently disabled temporarily). We came up with a plan which will hopefully soon lead to the creation of a new system to be used for both backup and user-initiated export of journals (including entries and comments).

The complete log of the irc session is available for your reading pleasure; below is a quick summary of what we agreed upon, for now. Feedback is invited. Questions are welcome.

Generally, we're going to start storing copies of journals, for backup and export purposes, in standalone files, one file per user/account/journal. The file will contain the copy of the journal in a light database format (DBM, possibly Berkeley DB or a similar format — to be determined). It will certainly contain all the entries posted in that journal and all the comments to all of them; possibly, it'll also contain miscellaneous data from the userinfo page, etc.

The DBM files for all journals will be updated asynchronously; that means that every new entry/comment in a journal aren't immediately mirrorred in the DBM backup copy. Rather, a maintenance job will check which journals need synchronization with their backup copies and perform such synchronization.

Client programs will be able to request that a user's journal be exported in an XML format — either the whole journal from the beginning, or all updates made to the journal since a particular time ("updates" means new entries, new comments to new or old entries, edited old entries, etc.). The XML format into which the DBM data will be converted on-the-fly for downloading by the client program will be "shallow XML" rather than "nested XML", which means that it'll include new entries, comments etc. keyed by numeric IDs and not nested into each other in the form of a tree.

(example for clarification: suppose there's an entry written in September with lots of comments left to this entry in September and some new comments left in October. If the client requests all new updates starting from Oct 1, the server isn't going to send the September entry and September comments, only the new October comments, which are impossible to organise in a coherent tree structure. Presumably the client which requests updates from October already has all the September data stored locally on the user's computer and will be able to construct a meaningful "nested XML" representation from "shallow XML" pieces it received from the server for various periods of time)

Eventually we may also make it possible to download the entire journal in a "nice" publishing format such as PDF, generated on-the-fly from the XML representation generated on-the-fly from the DBM file. Due to bandwidth concerns, we will naturally have to put some restrictions on such downloads — we don't want users to DL their entire journals over several years every day "just in case". The normal procedure for user backup should be incremental backup by XML-based export explained above — probably with the help of a smart client program.

The DBM file will also have all the information necessary to restore the user's journal in the main databases in case of catastrophic DB corruption, etc.

The fate of the current syncitems protocol mode remains undecided for now. We cannot continue to provide syncitems as it is, because it doesn't scale to a large number of users syncing large number of journals (too hard on the DB servers). Basically, the XML export feature we're planning for will be the same as syncitems, only, err, in XML (and including comments; and up to the last backup date for this journal, rather than up to the last entry currently in the DB). It'll be possible to reimplement syncitems so that they work with the DBM export files; we're not yet sure this will be worth the effort. If everyone ends up liking XML export and writing useful tools/clients to support it, we may not need to preserve the current syncitems mode.
  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

← Ctrl ← Alt
Ctrl → Alt →
← Ctrl ← Alt
Ctrl → Alt →