We haven't made userinfo.bml use the memcache API, so it's pretty DB intensive. (well, moreso than none)This arose in email after my own toy, TrustFlow (described in trustmetrics) was blocked from LJ after fetching too many userinfo.bml pages, despite cacheing the data fetched for 24 hours.
I'd prefer to make a new "web services" API for you to get exactly what you need, without resorting to costly screen scraping (costly: bandwidth, extra CPU to render the template around the page, extra DB hits for what you don't need, etc...)
So, what info do you need?
I'm currently collecting friends lists and the status of a journal (ie whether it is a community, a syndicated feed, deleted, or a normal user. Other things (eg kelvin, joule, LJ connect) use "Friends of". People also scrape names, email addresses, and birth dates .It seems clear that all the fields listed in userinfo.bml?mode=full should ultimately be made available through the API, but those are probably the ones to start with.
Designing the protocol for this is pretty trivial, but there are a few possible variations:
1) What should the server return? XML seems like an obvious choice, but it might be overkill: if what is returned is a friends list, for example, wouldn't a list of usernames one per line be easier to generate and parse?
2) What should the URL structure be? www.livejournal.com/webservices/friends?u
3) Should every bit of infomation have its own URL (eg email address, birthdate), should they be gathered into sensible bundles, or should the protocol allow you to specify a list of what you want and bundle them together to save on round trips?
My current inclination is
1) XML is worth it; it's general, extensible and there are parsers for everything
3) Give everything its own URL, the simplicity is worth the extra cost of multiple fetches
but I don't know the internals of LJ at all and I'd be interested to hear other opinions before making a detailed proposal. Thanks!
Update: Many thanks to everyone for their help with this so far. After looking at the XMLRPC interface, my answers have changed to
1) Use XMLRPC
2) Use /interface/xmlrpc like everyone else. Or make /interface/anon-xmlrpc an alternative that accepts only those commands you can do without authentication, in case LJ wants to treat such requests differently for eg load balancing without reaching into the request to find its content.
3) Give everything its own RPC call, but support system.multicall() to reduce round trips.
4) Require no authentication, but encourage client authors to use meaningful Referrer headers including a URL. Nogoodniks will avoid any further security by screen scraping, and everybody loses.
Update: Further discussion in lj_dev from bradfitz
Update: First draft of proposal now available.