Oh, and check out the network architecture picture in lj_backend if you haven't already.
(17:07:09) bradfitz: want to hear my current sick idea?
(17:07:37) bradfitz: in short, i hate our DB architecture. users are "clustered" now, so the DB is partitioned horizontally, but now we have n points of failures
(17:07:39) avva: sure
(17:07:43) bradfitz: and with memcached, we never use slaves
(17:07:44) avva: true
(17:07:53) bradfitz: so all users are on 2-3 machines, but only one is used.
(17:08:08) bradfitz: and as happens lots lately (due to fuck-ups), the slaves go corrupt
(17:08:15) bradfitz: and we have to take down the master to resync the slave
(17:08:18) bradfitz: but the slave's useless
(17:08:28) bradfitz: now, consider our other problem: no way of letting users download logs
(17:08:40) bradfitz: i mean, not logs, but their journal/comments/etc
(17:08:44) bradfitz: and worse, not incrementally
(17:08:45) avva: yeah
(17:08:47) bradfitz: from a point in time
(17:08:57) bradfitz: but: we can't even do incremental right.
(17:09:09) bradfitz: consider unscreening a comment: we don't update a modtime
(17:09:15) bradfitz: so here's my idea:
(17:09:30) bradfitz: each user has a log of transactions. each transaction adds something, modifies something, etc
(17:09:37) bradfitz: users can download their transaction log. it never shrinks.
(17:09:51) bradfitz: and now, instead of each user being on one cluster
(17:09:56) bradfitz: each user is on two random clusters.
(17:10:01) bradfitz: their transactions are played on both
(17:10:17) avva: ok, now you lost me.
(17:10:20) bradfitz: each transaction updates that DB's transaction number for that user
(17:10:30) bradfitz: by some separate mechanism. not Mysql replication.
(17:10:47) bradfitz: now, when we need to read, a master register (gm?) keeps track of that user's max transaction number
(17:10:54) bradfitz: so we pick a random host.
(17:10:56) bradfitz: one of the random 2
(17:10:58) bradfitz: for that user
(17:11:00) bradfitz: or maybe 3.
(17:11:01) bradfitz: whatever.
(17:11:07) bradfitz: then we check to see if that host is caught up
(17:11:11) bradfitz: for that user.
(17:11:15) bradfitz: if so, reads are safe.
(17:11:28) bradfitz: now, say a whole host dies.
(17:11:31) avva: where do we store the transaction history?
(17:11:44) bradfitz: NetApps or something... *shrug*
(17:11:53) bradfitz: this is all very hand-wavy. but it's developing.
(17:12:06) bradfitz: NetApps can be setup with active-active replication between two nodes
(17:12:13) avva: so, a whole host dies...
(17:12:13) bradfitz: already, their uptimes with 1 is insane
(17:12:23) bradfitz: a whole host dies, we just replay each user's transactions
(17:12:26) bradfitz: to put them back there
(17:12:34) bradfitz: async. without an admin copying files.
(17:12:45) bradfitz: and the host can be in service during that time.
(17:12:52) bradfitz: users that are caught up start servicing requests.
(17:13:10) bradfitz: you heard of the database idea pervayer or prevlayor or whateveR?
(17:13:12) bradfitz: i forget its name
(17:13:21) bradfitz: it's an in-memory database that each transaction objects which are logged
(17:13:26) bradfitz: this is kinda like that
(17:13:39) bradfitz: basically, i want DBs to be just as much a commodity as memcache machines
(17:13:45) bradfitz: even though they're more expensive
(17:14:13) bradfitz: and now, people can download their transaction history too... and recover from accidental deletions as well.
(17:14:14) avva: no, haven't heard of it.
but I dig your idea.
it seems very promising.
the transaction handling would be a separate module between LJ code and DBI
(17:14:36) bradfitz: yeah, we'd have some transaction object we dispatch
(17:14:38) avva: LJ code would use it instead of DBI whenever updating "clustered" stuff
(17:14:50) bradfitz: which then serializes it, logs it, plays it everywhere,
(17:14:57) bradfitz: something liek that
(17:15:03) bradfitz: so then gm is the only machine that'd be important
(17:15:09) bradfitz: one point of failure, but we could throw money at that
(17:15:11) avva: and we get to scrap mysql replication, yay
(17:15:17) bradfitz: fancy SAN and two hosts connected to it
(17:15:29) avva: *nod*
(17:16:07) bradfitz: another parameter for each user could be what netapp has their transaction log
(17:17:22) bradfitz: but here's my other concern: mysql recently bought this company that does HA clustering, and mysql's building a table handler on top of it
(17:17:27) bradfitz: maybe all this work would be for naught
(17:17:34) bradfitz: if they open sourced something better in early 2004
(17:17:37) avva: HA?
(17:17:45) bradfitz: high-availability
(17:17:53) avva: hmm
(17:18:21) avva: think they'll tell you if you ask them about it?
(17:20:44) bradfitz: probably. i just wanted to bounce it off you first.
(17:20:57) bradfitz: this might be all wrong. maybe what i should focus on is doing it at a lower layer.
(17:21:06) bradfitz: looking into the linux-ha stuff... distributed replicated block devices
(17:22:45) avva: well, I like your idea... especially the serialised storable transaction log, and db redundancy.
these are very attractive.
wanna write up an lj_dev post about it or something?
(17:23:02) bradfitz: or post this chat log?
(17:23:28) avva: yeah
(17:23:49) bradfitz: the other difficult part would be migration to it... but i suppose it'd just be a new dversion
(17:23:57) bradfitz: lock user,
(17:24:03) bradfitz: make fake log from past events
(17:24:10) bradfitz: replicate to a second host
(17:24:21) bradfitz: oh, could even unlock before that
(17:24:32) bradfitz: let the background "catch up daemon" replicate to second host in background
(17:24:37) bradfitz: web nodes won't use it until it's caught up
(17:25:40) avva: yes, that can be figured out pretty easily.
we should make the fake log as nice as possible
order it in server time order (not just between posts and comments separately, but put them all into sequence)
(17:26:14) avva: anyway, that's not hard.
(17:26:51) bradfitz: where would we track what users are in need of replay?
(17:26:53) bradfitz: couldn'
(17:26:58) avva: so, when you update a transaction on a user's server
where do you update the transaction number for that user? does it mean we double the number of writes, just to up the transaction number all the time?
(17:27:00) bradfitz: couldn't be a "cluster master" becuase there no longer is one
(17:27:49) bradfitz: the transaction number wouldn't be allocated like LJ::alloc_user_counter-style... i'd say it's the byte position of transaction log where we're about to write the position
(17:28:09) avva: need some central place to match users to "clusters" anyway
a user won't have clusterid anymore, they'll have an array of clusters, normally of length 2
(17:28:12) bradfitz: so we have a "transaction log server" that mounts the NetApp(s) on NFS, handles locking,
(17:28:19) bradfitz: looks at the transaction size, returns anumber
(17:29:06) bradfitz: the last query of any transaction is updating transpos.userid = [transaction point completed]
(17:29:11) bradfitz: on that database
(17:29:25) bradfitz: so gm doesn't track where everything's at,
(17:29:27) bradfitz: the dbs themselves do
(17:29:43) avva: yeah
that's a lot of new writes though
(17:29:44) bradfitz: if the web nodes can't play the transaction on both nodes,
(17:29:57) bradfitz: then they set a flag on gm
(17:30:08) bradfitz: saying "please sic the catch-up-daemon on this user"
(17:30:14) avva: yep
(17:30:22) bradfitz: which will then ask the transaction server for the lastest transaction point
(17:30:31) bradfitz: and go around asking every server in user's array where they're at
(17:31:15) bradfitz: transactions: SQL and/or named code blocks with parameters. (ala cmdbuffer)
(17:31:26) bradfitz: so we can do things like "delete_post(34)"
(17:31:30) bradfitz: which deletes all comments, etc
(17:31:33) avva: well, that adds the transaction server as an additional bottleneck.
everyone will be asking it all the time for the current transaction point of users.
(17:31:37) bradfitz: without serializing all that SQL?
(17:32:12) bradfitz: well, in the ideal case the web nodes themselves complete the transactions on all 2-3 source nodes
(17:32:12) avva: (yes, obviously transactions are more than 1 SQL statement. we want them to be atomic logical operations on data from user's point of view)
(17:32:17) bradfitz: 2-3 user nodes
(17:33:07) bradfitz: without using real SQL transactions, doing transactions is kinda hard... each named code block would have to understand rolling back
(17:33:15) bradfitz: which could be hard if it just deleted most hte comments
(17:33:20) bradfitz: and failed on the last few
(17:33:28) avva: yeah, but web nodes need transaction points for reading also, no?
say I want to build a friend view. I choose a random host for this user and get the data. do I at this point compare the transaction point of this database to the global one from the transaction server?
(17:34:39) avva: of course, we can and should memcache the current transaction point of user. but still.
(17:34:58) avva: I'm worried about the operation "get the current transaction point for this user" becoming a bottleneck.
(17:35:13) bradfitz: that's just a single row select from both gm and the random node
(17:35:17) bradfitz: to see if it's caught up
(17:35:34) bradfitz: gm or transaction server, i guess.
(17:35:44) bradfitz: but i'd use DB there... faster probably
(17:36:22) avva: yeah
(17:36:37) bradfitz: back to memcached, though: You've never cared about my netsplit problem
(17:36:41) bradfitz: which has affected LJ a few times
(17:36:47) bradfitz: i've mailed you about it
(17:36:55) bradfitz: you remember the details? i have a fix idea.
(17:37:03) bradfitz: it requires changing the memcached protocol a tiny bit, though.
(17:37:06) avva: I didn't know that it actually affected data in practice
(17:37:15) bradfitz: yeah, it has.
(17:37:15) avva: I remember the details, yeah
(17:37:39) bradfitz: we have memcached machines flop a lot. it happened a lot more before i changed the perl modules
(17:37:57) bradfitz: we'd put a new memmcached up after dying, and web nodes would randomly select a web node for 20-30 second
(17:38:11) bradfitz: now i only mark memcache hosts down on connect error
(17:38:13) bradfitz: not other errors
(17:38:22) bradfitz: so if i have a cached handle from earlier,
(17:38:25) bradfitz: and try to re-use it and fails
(17:38:29) bradfitz: we use to mark dead for 30 seconds
(17:38:41) avva: oh, I see
(17:38:43) bradfitz: now i just return undef, kill cached handle, and memcached tries to reconnect later
(17:38:50) bradfitz: only then do we mark dead
(17:38:57) bradfitz: anyway, that only lessens the problem a bit
(17:38:59) bradfitz: it still exists.
(17:39:20) bradfitz: Here's the answer: clients of memcached need to send along their hashing value (the bucket) to servers.
(17:39:32) bradfitz: then servers keep track of what they handle, and broadcast it to other nodes.
(17:39:41) bradfitz: but no more than 1 bucket announcement every few seconds
(17:39:57) bradfitz: then other servers know, "any data I stored for bucket 8 before that is invalid"
(17:40:05) bradfitz: and upon them returning it, they delete it.
(17:40:10) bradfitz: if its create time is too old
(17:40:25) bradfitz: follow?
(17:40:29) avva: sec.
(17:40:56) bradfitz: i can illustrate if necessary.
(17:41:05) bradfitz: work through an example
(17:41:08) avva: no, I think I follow.
(17:41:34) avva: I don't really like the idea of memcached servers talking to each other though
(17:41:42) avva: can't clients do that with some new command?
(17:41:45) bradfitz: they don't have to well... just UDP packets
(17:41:59) bradfitz: broadcast/multicast
(17:42:17) bradfitz: "I'm handling bucket 8!"
(17:42:25) avva: yeah
maybe have a command-line option for it.
(17:42:54) bradfitz: bucket announce port
(17:43:18) avva: and server's awareness of client hashing is also kinda ugly.
even though it's gonna be a transparent value for the server, right?
(17:43:25) bradfitz: yup
(17:44:04) avva: the client doesn't have to send it with every request, does it?... only after connect?
(17:44:21) bradfitz: hm, i hadn't thought about that
(17:44:22) avva: or if it reshuffles bucktes
(17:44:23) bradfitz: but true
(17:44:25) bradfitz: it could be a new command
(17:44:28) avva: yeah
(17:44:48) bradfitz: well, no.
(17:44:56) bradfitz: the command is like:
(17:45:05) bradfitz: IM_USING_FOR_MY_BUCKET 8
(17:45:12) bradfitz: so we can't do that on connect
(17:45:23) bradfitz: because well, it could be many buckets
(17:45:37) bradfitz: our config lets bigger servers handle multiple buckets
(17:45:45) bradfitz: i'd do it on all set/add/replace
(17:46:43) avva: let me think about it
I need to replay this in my head a few times
(17:46:48) bradfitz: k