Tim (visions) wrote in lj_dev,
Tim
visions
lj_dev

okay, since posting to lj_support got little feedback.. ill post here.

recently, something (be it data replication or something else) has resulted in the paid servers offering lagged/stale/outdated/incorrect data to the users.

now, one of the points to the paid servers is that the paid users get faster and more dedicated service. that is a nice benefit... right? I would say so.

now, one would say.. okay... i can understand/deal with a tiny bit of lag with the paid servers data since it IS being replicated. but what about when that delay is of a magnitude of 10-20 minutes.. and sometimes perhaps even longer?

what happens when you are getting comments from free users on posts that you have made.. and the posts themselves are not even visible yet on the paid servers? is that acceptable? situations like that would make me want to switch over to not use the fast servers just so that i could see the posts when others could.

with that said, it seems that the underlying architecture and design for paid versus free users is flawed. based on the operational disection of what is happening (purely observed) it seems as if the schema that it operates under is this:

  • all posts go to the main database, regardless of the cookie (the cookie only determines which webserver is posting to the database)

  • that database is periodically replicated to backup databases which serve the paid users.

  • the main database is what is polled if you do not have ljfastserver defined in your cookie.



with that said, i think it is a drastic design flaw. people are in effect being penalized for paying by getting out of date data. perhaps i am misinterpreting something, but from the behavior.. the model seems to be as i described above.

in my opinion it should be reversed... operating under a schema such as this:


  • all posts go to the main database regardless of user status

  • paid user servers (ljfastserver defines) directly query from the main database, or perhaps from one that is replicated on a much quicker basis (on the order of less than a minute lag as a worst case).

  • all other servers (non-paid users or people that explicitly decide not to use the ljfastserver cookie) query replicated databases at ALL times.



anyway, i hope this makes sense.. i was interrupted a lot while typing it. thoughts?

update:
since there was a little confusion, i will clarify... the schema i presented as what is currently going on was an OPERATIONAL schema. by that i mean, that is what it appears to be operating as, not necessarily what it is implemented as.

secondly... i will refine my "ideal" situation since i didnt explain it well the first time...

as noted here...


  • all posts go to the main database regardless of user status

  • paid user servers (ljfastserver defines) directly query from the main database (until a certain load-level is reached), and then at that point queries are load-delegated to one that is replicated on a much quicker basis (on the order of less than a minute lag as a worst case). once the load level of the main database drops to an acceptable level, the load is rebalanced.

  • all other servers (non-paid users or people that explicitly decide not to use the ljfastserver cookie) query replicated databases at ALL times.



another possibility is to have a pool of slave dbs that are allocated to paid members (as is currently i believe), and based on load on those servers, re-allocate slave db's from the free user pool to the paid member pool dynamically to deal with the load.. putting at most n-1 (or n-2.. whatever is the minimum acceptable based on the number of servers in the pool) servers in the paid member pool. anytime the max # of slave db servers are in the paid member pool, some paid dev person (dormando i imagine) should be paged, emailed, or whatever.. all automatically.

----

explanation of issue:

from brad (here):

normally all db-slaves are 0-2 seconds behind. replication is pretty much instant.

then we turned on synchronous key writes (or rather, disabled async key writes) because otherwise, mysql shutdowns take 5 minutes or so. while we were fixing all the database key corruption caused by the power outage, we wanted to be able to restart quickly to test.

we would've changed it back sooner, but we were debugging another issue and changing too many things at once wouldn't help our analysis of the problem. besides, the site was holding up fine even with sync key writes. until today. i leave, i come back, things be fucked. dormando made mackey (paid db) be async. i took all traffic off it, let it catch back up, then gave it traffic again. dormando's also going around turn it off on the others.

NOW ... here's the real problem: our load balancing for db connections sucks.

i wrote dbselector to monitor them all and tell web slaves where to go dynamically (web slaves get leases on handles, have to revalidate them every 'n' seconds) but it's not in use yet because I want more people to audit it.

avva's going to audit/test/improve it.

we're aware of our weaknesses. we know how to fix them. we just need more manpower.

if you want to work on bin/dbselectd.pl, it'd be much appreciated.
Subscribe
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

  • 36 comments