Brad Fitzpatrick (bradfitz) wrote in lj_dev,
Brad Fitzpatrick

fetching comments: Use cases

Another round of thoughts on all the ways/reasons to fetch comments. It basically falls into two groups: Downloading all comments for a journal, and all for a certain post. Policy is also discussed below.

Use case #1: Downloading all comments for a journal.


your journal: obviously.

a community: probably. especially if you're a member/poster of it. any reason why not?

an arbitrary person account: questionable, both for server load and privacy. do people want to give other people such easy access to all their comments?


Comments never change, except for two things:

-- states. these change between screened, approved, and deleted. in most cases, this doesn't matter, unless it's your journal, or one you're an admin of. there are 3 values currently used, so 2 bits, but let's make it simpler and assume this is one byte.

-- posterid. with jesse's upcoming patch, the posterid of a talk2 row can change from 0 (anonymous) to the auto-created pseudo-account of a validated email address. a userid is 4 bytes.

So, the fluid part of a comment is 5 bytes. jtalkids are 3 bytes, but they increase in order from 1 for each account, so we can leave them off in the whole-journal case. If we were to provide all the data, packed, for an account, it'd be 5 bytes * number of entries. The news account has 37,000 comments, so that's 180k uncompressed. Not much bandwidth, but a shitload of disk seeks potentially. Probably need paging, both for disk seeks and bandwidth. I imagine people will want XML and not binary blobs they have to unpack. Well, probably text too, one item per line... make everybody happy.

Anyway, once they get the list of all talkids and their states/posterids, they need the text/props. We could provide URLs to fetch either a range (1-100), a single item, or a set of items. (more below)

Use case #2: Downloading all comments for a single entry

Policy: should be allowed by anybody for anything (given proper security), since you can already read all comments with the HTML interface.


We already keep a memcache blob of all comments, packed, for a post. We could pretty easily just take that and dump it out to the user, removing some bits for security reasons. (don't show non-owners screened metadata)

So this comes down to I think 8 bytes per post: 3 talkid, 1 state, 4 userid.

After that request, client can do the same mode before for fetching all metadata for a set of comments (whatever they don't already have).

Summary of request types

URL to fetch the 5 bytes per comment for a whole journal. This works almost
okay as 5 bytes (news is kinda big, and only getting bigger). If XML, it'd probably need to be paged to keep responses at 200k or so per page. Probably best to page it in any case.

URL to fetch the 8 bytes for all comments in an entry. We could do this unpaged and binary, or paged and XML, perhaps going backwards.

URL to fetch entirety of a single comment, set of comments, or range of comments, limited to so many comments per response.

Final thought on server abuse,...

We could restrict queries to a certain speed per user, and only provide blocks of 1000 at a time or something. We could even use all the new LJ::get_secret() stuff we're using for challenge/response and Digest to "sign" continuation tokens. That is, we let any account (even anonymous) start downloading from the beginning of a journal, and we could even memcache that first block of 5000 bytes to avoid attacks, and then we provide in the response a wait time to sleep and a token to include in the next request to get the next block. The token contains within it what the next request should return, the current time, how longer the client must wait before continuing (say, less than a minute), when the token expires (oh, 30 minutes), some random junk, the server's current hour id, and the signature of the current hour's secret value MD5'ed with the rest of the token. Then when they present the token back, we validate it and give them the results.

The problem with all this is that if require them to start at the beginning and use "go forward" tokens, there's no way for them to be nice and start at the location they know they have. Instead, it might be better to only give "go backward" tokens, which presumably is all in memcache if they pound it too fast.

Also, none of this plays well with anonymous "give me all these comments". Perhaps require authentication and rate-limit that way? (And don't give me the "oh, you're just being paranoid" crap.... most of running LJ is dealing with bots and server hell. Programming all this stuff is trivial in comparison. :-))


a) strict, protocol-encumbered rate-limiting with anonymous access
b) less strict, per-account rate-limiting with cleaner/easier API, but requiring auth


We could put all the comment fetching stuff off a bit until we get "revid" support in all the LJ tables. The idea is to have a per-account revision number that increases on any new content or edit, and have a separate index on that (journalid, revid) to make global journal export efficient. But this would only help the "all comments in journal" download, so we could ignore it for now. All we'd need to do in the meantime is make that one extra dumb protocol mode to get the state/posterid for each comment, and we could support that going into the future with the new data format, so I guess there's no reason to delay.
  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded