December 28th, 2003

getting comment data

I've been thinking about making comment data accessible in XML. Follow with me as I ramble.

URLs could be of form:
to mimic:

But we don't want to include all text/props, because that could get huge with popular posts like on news. That's shitty to make people poll that.

What if the 234.xml file by default just listed metadata for each comment. So if there were 5,000 comments (typical max for news), and each meta data were line:

<commentref id='353' state='A' parent='0' datepost='yyyymmdd hhmmss' />

And things like state/parent default to 'A' (approved) and zero (top level), then that's like 80 bytes/comment * 5000 comments.... hell, still 390k. (uncompressed, at least) That's a lot to poll.

We could do something like making the base URL

only return the first 1000 comments by default, with a continuation URL in a certain XML element, like:

(to get talkids in that post after talkid 3454)

But the problem then is comments' state can change. (between approved, screened, and deleted) That's the only thing about a comment that can change.

What I've been thinking about for the next version of LJ's backend data storage is that each account has a global counter that's incremented for any change: new posts, new comments, any modifications. Then the URL could be: 234.xml?after_revid=123456 and it'd include comment state changes too.

But currently, we have no index on when comment state changes happen.

Alternatively, we could ignore the state field altogether and only provide it when requested. (maybe as one big packed string, but that's lame in XML.... *coughSVGcough*)

Then the next issue: authentication. I like the idea of making the URL each to get to ("REST") as above, though I'm also cool with making it SOAP and/or XML-RPC. (I'm liking SOAP more, having played with C#/Mono) So with SOAP/XML-RPC, we can do Digest auth (avva should be checking it in anyday now). So maybe we can do the same with the REST URLs, instead of requiring people to do hacky stuff to get LJ login cookies.

Perhaps a globally recognized URL parameter to force Digest auth and setup $remote, even if you don't have LJ login cookies:

Now the user sends their Digest auth credentials and we show them all comments, including screened ones.

Thoughts? What does everybody want in terms of interface?

fetching comments: Use cases

Another round of thoughts on all the ways/reasons to fetch comments. It basically falls into two groups: Downloading all comments for a journal, and all for a certain post. Policy is also discussed below.

Use case #1: Downloading all comments for a journal.


your journal: obviously.

a community: probably. especially if you're a member/poster of it. any reason why not?

an arbitrary person account: questionable, both for server load and privacy. do people want to give other people such easy access to all their comments?


Comments never change, except for two things:

-- states. these change between screened, approved, and deleted. in most cases, this doesn't matter, unless it's your journal, or one you're an admin of. there are 3 values currently used, so 2 bits, but let's make it simpler and assume this is one byte.

-- posterid. with jesse's upcoming patch, the posterid of a talk2 row can change from 0 (anonymous) to the auto-created pseudo-account of a validated email address. a userid is 4 bytes.

So, the fluid part of a comment is 5 bytes. jtalkids are 3 bytes, but they increase in order from 1 for each account, so we can leave them off in the whole-journal case. If we were to provide all the data, packed, for an account, it'd be 5 bytes * number of entries. The news account has 37,000 comments, so that's 180k uncompressed. Not much bandwidth, but a shitload of disk seeks potentially. Probably need paging, both for disk seeks and bandwidth. I imagine people will want XML and not binary blobs they have to unpack. Well, probably text too, one item per line... make everybody happy.

Anyway, once they get the list of all talkids and their states/posterids, they need the text/props. We could provide URLs to fetch either a range (1-100), a single item, or a set of items. (more below)

Use case #2: Downloading all comments for a single entry

Policy: should be allowed by anybody for anything (given proper security), since you can already read all comments with the HTML interface.


We already keep a memcache blob of all comments, packed, for a post. We could pretty easily just take that and dump it out to the user, removing some bits for security reasons. (don't show non-owners screened metadata)

So this comes down to I think 8 bytes per post: 3 talkid, 1 state, 4 userid.

After that request, client can do the same mode before for fetching all metadata for a set of comments (whatever they don't already have).

Summary of request types

URL to fetch the 5 bytes per comment for a whole journal. This works almost
okay as 5 bytes (news is kinda big, and only getting bigger). If XML, it'd probably need to be paged to keep responses at 200k or so per page. Probably best to page it in any case.

URL to fetch the 8 bytes for all comments in an entry. We could do this unpaged and binary, or paged and XML, perhaps going backwards.

URL to fetch entirety of a single comment, set of comments, or range of comments, limited to so many comments per response.

Final thought on server abuse,...

We could restrict queries to a certain speed per user, and only provide blocks of 1000 at a time or something. We could even use all the new LJ::get_secret() stuff we're using for challenge/response and Digest to "sign" continuation tokens. That is, we let any account (even anonymous) start downloading from the beginning of a journal, and we could even memcache that first block of 5000 bytes to avoid attacks, and then we provide in the response a wait time to sleep and a token to include in the next request to get the next block. The token contains within it what the next request should return, the current time, how longer the client must wait before continuing (say, less than a minute), when the token expires (oh, 30 minutes), some random junk, the server's current hour id, and the signature of the current hour's secret value MD5'ed with the rest of the token. Then when they present the token back, we validate it and give them the results.

The problem with all this is that if require them to start at the beginning and use "go forward" tokens, there's no way for them to be nice and start at the location they know they have. Instead, it might be better to only give "go backward" tokens, which presumably is all in memcache if they pound it too fast.

Also, none of this plays well with anonymous "give me all these comments". Perhaps require authentication and rate-limit that way? (And don't give me the "oh, you're just being paranoid" crap.... most of running LJ is dealing with bots and server hell. Programming all this stuff is trivial in comparison. :-))


a) strict, protocol-encumbered rate-limiting with anonymous access
b) less strict, per-account rate-limiting with cleaner/easier API, but requiring auth


We could put all the comment fetching stuff off a bit until we get "revid" support in all the LJ tables. The idea is to have a per-account revision number that increases on any new content or edit, and have a separate index on that (journalid, revid) to make global journal export efficient. But this would only help the "all comments in journal" download, so we could ignore it for now. All we'd need to do in the meantime is make that one extra dumb protocol mode to get the state/posterid for each comment, and we could support that going into the future with the new data format, so I guess there's no reason to delay.