Precomputed Caching - Cs253 unit 7 - Udacity

One of the other big architecture pieces we've added to help us scale was this notion of a precomputed cache. We found ourselves running these queries to generate the hot page for Reddit over and over and over again. You may cache it for a minute but then once that minute expired we had to recalculate it. We had a job--a kind of job that would run and just compute it and then put the stored value in memcached-- that worked okay but then we had to do it, we had all of our users pages. Every user had their own listings of things they've submitted and liked and their top things, and every Reddit had a new page and a hot page and a bunch of different sorts So we stared precomputing everything. The way we did that is that we have this whole other database stack. These are the replicas of the link database--basically, more link databases. They could lag a little bit. It wasn't a big deal. Every time a vote would come in, we put in this queue--queue is just a list of things to be done-- and we have this machine that basically manages huge list of things, and we had a couple of other machines that we called the precompute servers. What these things would do is they take jobs off the queue. This link has been voted on. Actually, what the apps would do is when a link was voted on, they would add a number of jobs to the queue. The jobs might be to recompute Reddit's front page, recompute this user's liked page, recompute this user's top page, recompute Reddit's top page. There are all sorts of different listings that are affected by a particular vote. These machines would pull off these jobs and then they would run those queries against the database. They would just mercilessly as fast as they could take a job off the queue run the query against these databases. These databases were really, really hot and had no real-time requests. No request from the Internet actually ever touched these precompute machines. It's only these guys, these precompute servers would touch these preocompute databases. When the job was done running, we would take the results and store them to memcached. That way almost every page you looked at on Reddit would be fetched out from memcached. There are very few things you could do on Reddit that would actually directly manipulate a database. Once we got to that point of scaling, things got a lot easier. These are really just kind of the last resort primary sources of data, but any data you can access on Reddit in real time is actually served out of memcache. Every single listing is precomputed and stored in memcached for Reddit on the whole site. This is the reason why now you can't go back beyond about a thousand links on any particular listing. It used to be, you can go to Reddit's front page and hit next, next, next, next, next and then go all the way back to the beginning of time, which will just really, really trounce our databases, do a lot of damage, slow the site down, etc, etc--you can't do that anymore. We only store the top thousand for each sort, which is one of the limitations of doing this precompute thing, but on the upside the cycle is very, very fast. There are very few legitimate reasons to go all the way back to the beginning of time on Reddit anyway. This worked out really nicely, and the site to this day still has this general structure, although a lot of the technologies have changed, and that's what we're going to talk to Neil about.