Building a More Efficient Ruby Interpreter

>>Ninh Bui: So hi guys. We're Phusion. We're Ruby Vien and Ruby Web Application Deployment Shop in Amsterdam. Today we're gonna talk a little bit about building a more efficient Ruby Interpreter. With me here today is Hongli Lai. I myself am Ninh Bui. But we first, before we go on to talk about building a more efficient Ruby Interpreter, it might be interesting to talk a little bit about Ruby itself. So for those who are not familiar with Ruby, it's a dynamic language, highly, it's actually highly dynamic, strongly typed. You can, you have closures and other features like that as well. It's resembling to Python, somewhat, you could say; and it's still a growing language. By the time of this presentation, actually, there are, there's estimate, there's an estimate about 400,000 Ruby programmers out there, and they expect this number to grow to 5 million by 2013. So yeah, so it's still growing rapidly and we believe that this is in part actually to thank to Rails, which is a popular web application development framework which allows you to create, well stunning web applications in a multi-view controller manner. So there are several Ruby implementations out there, and the main one, or the best well known one and most widely deployed one, is actually Matz's Ruby Interpreter. Also abbreviated as MRI. It was named after its creator, Yukihiro Matsumoto. And even though it does a great job in many things, it has some issues though. It has a reputation of being slow - [laughter] like some other virtual machines. And it has a reputation of being a memory hog, basically. So we wanted to make Ruby better. And in particular we wanted to make it better for servers because we enjoy web development and we enjoy working with Rails. So in 2007 we started hacking on the Ruby garbage collector in order to at least make it more efficient for budget servers that we had to use back then. Nowadays that's a totally different story. But, so yeah, we started hacking on the garbage collector, and this resulted in a series of patches that we eventually molded into a useable product. And, this useable product is called Ruby Enterprise Edition. And today Ruby Enterprise Edition includes various patches from other contributors as well, who share this passion on making a MRI suitable for server environments. So in essence you can basically consider Ruby Enterprise Edition as being an MRI optimized for server environments. So now let's talk a little bit about actually tweaking the garbage collector, and I'd like to give Hongli the honors to do that. >>Hongli Lai: So hi, guys. So our initial motivation for making the Ruby garbage collection copy-on-write friendly, is to optimize the memory usage of Rails. But before we discuss that let's see how, how Rails applications work. So actually Ruby, or actually Rails application has an architecture that resembles something like this: there are multiple Rails processes and there's a single web server, or many multiple web servers and load balancers or proxies in front. And the web server, it will forward HTP requests to one of the Rails processes. So here we see it working. Then one of those Rails processes will generate a certain response and then this response is then sent back to the web browser. And if you look inside the Rails processes then you will see that each Rails process is actually single-threaded and handles one concurrent request. So if you want to achieve true concurrency, then you have to run multiple Rails processes. And this is actually no longer strictly true, because since Rails 2.2, Rails has become thread safe. So these days you can actually run multiple threads in a single Rails process. But what we talked about here is still true for most setups out there. And if you look at how much the single "Hello World" Rails application weighs, and it actually weighs about 25 to 30 megabytes, and memory usage increases linearly with each process. So if you want high concurrency and you have to spawn lots of Rails process, then memory usage can quickly add up. So we looked at how we can reduce memory usage while make it less bloated. There are several ways to do that. For example, you could optimize Rails, or you could optimize Ruby. But both of these ways are very tedious; you have to put a lot of work in them; so you try saying: "Maybe you can cheat. Maybe you can use the fork system code and copy-on-write." And that works like this. Suppose that you have a parent process with some data, say a variable A with value 42. And then you fork a child process. And then, initially that child process will share all of its memory with the parent process. So if you access the value of A in child process, then you're actually referring to the same memory page as the one in the parent process. And when either the parent or the child writes to that memory page, then that page is copied by the operating system and then written to. We say that this page is made dirty and the child now still shares most of its memory with the parent, but not that one memory page that it just wrote to. And our past experience with dynamic languages, mostly Perl, has shown that most memory is actually occupied by storing code; for storing the Prior Art tree for example. And while Perl already uses copy-on-write to save memory, and it works like this: it logs as many Perl modules in the Apache parent process as possible; then when Apache forks child processes the memory occupied by those pre-loaded Perl modules will be shared among all Apache worker processes. [pause] And then we thought: "Can we do the same thing with Rails?" Well that really depends on how much memory in Rails application is occupied by actually storing Rails code. So let's see. "Hello World" in Ruby – it's about half a megabyte of memory, so its RSS is one to five megabytes and about one megabyte of that is shared memory, so we end up with about half a megabyte of actual unshared memory. And if we load in the Rails libraries, that's, then we end up with a process that eats about 25 megabytes and that's already with shared memory taken into account. So we had already measured how the "Hello World" Rails application eats about 25 to 30 megabytes, so it seems plausible that we can use copy-on-write to save memory. Well, but, there's a problem, because unfortunately Ruby's garbage collector – it's not quite copy-on-write friendly. And every time that Ruby runs its garbage collector, then old memory pages will be written to and that causes copy-on-write. So if you take a look at Ruby's garbage collector, it's actually a simple mark and sweep system and here's an example. Suppose I do create a full object and then you refer to it from a local variable which is part of the root set. [pause] And then you create bar object and you change the reference to bar. Then when the garbage collector is invoked, the garbage collector will follow all pointers that's reachable from root set and then it marks all objects that it encounters while doing this. And this marking is done by setting the FL_mark flag in, on a bit field inside the object. And then next comes the sweep phase of garbage collection, and in this phase, garbage collector will free all objects that are not marked because apparently you cannot reach them anywhere from within the program. And the thing is, setting this FL_mark thing inside the object, it will actually make the entire memory page of that object dirty, because it writes to that page. And everything that's nearby is affected too. So the Ruby ac notes, they are also garbage collected and so they too will be copied when you run the garbage collection. And the fix to make sure that this doesn't happen is to move the marking data away from the bit field inside objects and to a separate memory region, like a Martin table. This sounds easy, but actually trickier than one might think. And if we go back to the example with the processes, then you see, for example, a parent process and a child process and they refer to some bar data. And this, the bar data could be Rails AST notes. Then, whenever the garbage collector runs, the garbage collector will mark all those AST notes and this causes copy-on-write, so you end up with a copy of bar data, data that's actually identical. You don't really need to create this copy, it just wastes memory. [pause] And it's all because of this FL_Mark thing. So when trying to make the garbage collector copy-on-write friendly, we, we encountered some caveats. For example, like how to measure the dirty memory pages in the first place. Because on Linux we have this proc/self/smaps, and this is a virtual file made by the kernel allows us to inspect a process, total private dirty memory. But there are no tools to see which individual pages are dirty. And other operating systems are even worse, because they don't even seem to allow viewing your total private dirty memory usage on the processes. So reducing the dirty pages involves a lot of guesswork. [laughter] And we used the following test script to measure the effectiveness of copy-on-write, because, what this test script does is, it first loads Rails and then after loading Rails it runs the garbage collector. So that whatever garbage that was created during all this time - and that garbage is not freed in the child - and then it forks a child process and in child process it runs the garbage collector again to see whether the garbage collector is copy-on-write friendly, and then it measures the process private dirty RSS. [pause] So during our first attempt at making garbage collector copy-on-write friendly we did not know how much effect it was going to have. We, we initially used a hash table to implement the mark table and everything that is in this table is considered marked. Luckily Ruby had a built in shading hash table implementation, so we did not have to write our own. And it took a while to make it all work, but we eventually succeeded and to our surprise, it not only saved about 15 to 20 percent memory in the child process, though we were a little bit of disappointed and the garbage collector also became about 30 to 40 percent slower. [pause] So we tried to optimize this thing and it turns out that the mark table itself is very large and we did not expect this. When Rails is loaded there are about 150,000 objects even after garbage collection. So each hash table entry occupies about four words and on x86 this is about 16 bytes. And if you count in malloc overhead too, that's eight bytes, then you end up with 24 bytes per entry. And multiply, multiply that, by the number of objects that we have, then we get about three and a half megabytes and the hash table's bucket allows a depth of five before resizing. So if we count in that overhead too, then we end with about this number, 3.7 megabytes and that's just to mark the objects. And what's more, this 3.7 megabytes consists of many small objects so this, all this memory is actually not returned to the operating system when you free them. It's not done by malloc. So the result of all this is whenever you run the garbage collector, it generates 3.7 megabytes of dirty pages. And then we realized, yeah, we did not really need a full hash table, because we just want to know when an object is marked, so we just need a set. [pause] And a hash table entry consists of these members: you have hash key, record, next. And we can't get rid of record because we were only mapping object addresses to true. And hash is only used to speed up hash table resizing so we can get rid of that too in return for making resizing a little bit slower. And then this new data structure, we call it PointerSet. And entries now only 16 bytes on x86 and that's including malloc overhead. So if you multiply that by the number objects, then, and count in the bucket overhead, too, then we end up with about 2.4 megabytes of memory. So the hash table, I mean sorry, I mean the, the garbage collector now only use 2.4 megabytes of memory for the Martin table. And the garbage collection speed did not change, but the copy-on-write efficiency went up to about 30 percent. So this is definitely an improvement. But 30 percent is still a bit, maahh, not very good. So we tried to optimize this even further. And if you recall that all the set entries are allocated with malloc, and we know that there are, there's a lot of them, then we thought: "Well, maybe you can optimize this by using a memory pool." By using a memory pool, we not only get rid of the malloc space overhead, but it also allows fast allocation because each entry in the pool has a constant size. And this allows our pool to use a simpler algorithm than malloc does. And the pool is allocated with mmaps so that old memory can be released back to the OS and the results were pretty encouraging because copy-on- write savings went up to about 40 percent, and we even got a 15 percent performance improvement. So this is really nice. But, yeah, we were not satisfied, yet. Because we thought: "Yeah, this can be better." And then we thought: "Hum, maybe we can save even more memory by not using a set but actually a bit field as a mark table." [pause] So if you look at how Ruby objects are stored in memory - Ruby objects they are allocated on so-called Ruby heaps. And Ruby heaps are not the same as the system heap that malloc uses, but Ruby heaps are themselves allocated on the system heap. So a Ruby heap consists of multiple equally sized slots and each slot is capable of storing a single Ruby object. And we could add a bit field at the beginning of each Ruby heap and this bit field is then used as mark table. So this bit field would have the same number of bits as the number of slots in the Ruby heap. And a one in the bit field just means that a particular object at that location and that heap is marked or not. So not all slots are occupied by objects. There were about 250,000 objects in our test script after loading Rails and after running the garbage collector. And, but even so, altogether the bit fields only consumed 31 megabytes, sorry, 31 kilobytes of memory and that's a far, far cry from all previous attempts. With this, copy-on-write savings went up to 70 percent. And garbage collector performance also improved by 60 percent compared to previous attempts. So this is a lot better. [pause] And the 70 percent savings that we saw in our test script, is a theoretical maximum because real life applications, they usually create a lot of garbage for their own stuff, so the savings in practice are a little less spectacular. At this point, we were able to save about 15 to 20 percent memory in real life applications, on average, against a five percent overall performance head. But we can do even better, because there's still some dirty pages left; but where are they? They are very hard to find. And after a lot of pulling hair out of our heads and stuff like that, it turned out that the glibc memory allocator was partially to blame. That's ptmalloc2. Because if you take a look at this C code, it allocates one kilobyte of memory. A real child process initially has a private dirty RSS of 125 kilobytes. But after executing this code then the private dirty RSS suddenly jumps up to seven megabytes. Something is definitely wrong here. And we suspect that ptmalloc2 makes a lot of internal bookkeeping structures dirty, and that's what causing all these dirty pages. So we researched other memory allocators that we could use. We tried nedmalloc first, but we couldn't get it to work. Then we tried jemalloc. This is the memory allocator used by FreeBSD and Firefox. But this did not seem to reduce the memory pace at all. So eventually we settled for Goggle's tcmalloc and with this, practical copy-on-write savings went up to 33 percent. And overall performance it went up about 20 percent. So it's even faster than normal Ruby. And we concluded that tcmalloc is faster than ptmalloc2 for our workloads. The garbage collector is actually still slower than normal Ruby, but because of the allocator we still have an overall positive performance difference. [pause] And the 33 percent memory savings and practice is confirmed by other parties as well. For example, shopify, an ecommerce shopping cart service. They testified that they saved a lot of hardware resources with this. And we've also integrated this copy-on-write technology with Phusion Passenger, a Ruby web application deployment software. If you use Phusion Passenger with Ruby Enterprise Edition, then it will automatically use the copy-on-write savings. So to sum up, by using a bit field as a mark table for the garbage collector, and by using tcmalloc as memory allocator, we were not only able to save 33 percent of memory on average in Rails apps, but we were also able to make Ruby about 20 percent faster. And this concludes our part about the garbage collector. Next up is Ninh about threading improvement. [applause] >>Ninh Bui: So next I'd like to talk to you guys a little about, a little bit about improving threading performance. And this is actually a patch that was contributed by Joe Damato and Aman Gupta, who are the authors of the EventMachine which is a io library mainly used in Ruby. And this is actually a good example, we believe from a patch or contribution that allows us to optimize MRI for server environments. So first off, let's go over a bit on, you know, the threading model of Ruby 1.8. So as you can see, this is a typical green thread setup, where you have n userspace threads all mapping to one kernel thread. And this has some advantages as well as disadvantages. One of the biggest disadvantage of this in this multi-core era, is probably that the, the userspace threads can't, can't utilize symmetric multi-processing. Ruby 1.9, however, uses kernel threads. But unfortunately you have a global interpreter lock which still prevents you from using multiple cores. You know, Implementations such as Mac Ruby, however, have been able to remove the global interpreter lock allowing for true utilization of the hardware. And if you want to know more about that you should probably speak to Loren Cincinnati, who is sitting over there. He can tell you probably a lot more about this than I can. So as for the scheduling, the scheduling is a, is a pre-emptive scheduler, which basically means that every userspace threads, thread gets a certain amount of time to complete its task before it gets preempted, and another thread will be scheduled in its place. And this is done either through an itimer, a signal, or through a, a timer thread. It depends on the platform that you're probably using. Another important thing about the Ruby 1.8 scheduler is that you can apply explicit yielding. So basically what you can do is if one userspace thread is executing, you can actually explicitly yield your execution. So you can force a context switch by using thread dot pass. So because we're using userspace threads, however, if we do a blocking call in one of these userspace threads, we are actually blocking the entire kernel thread as well, and basically blocking all the userspace threads here as well. And we need to solve this, in particular for IO, but we'll get there in a few seconds actually. So to illustrate this, if you have a blocking_call for example on the left hand side thread over there, then all the other userspace threads will have to wait until that blocking_call finishes. And this may, may or may not be forever. So basically you are blocking the whole system. So this is important for IO to, to fix for IO actually because, you can have a certain read system call which has to wait, for example, for data to come in. And Ruby solved, solves this by using non-blocking IO. And the way it does that is when it detects a blocking operation, for example, if you are reading from a file descriptor where there's no data to be read from, then basically it will just schedule another thread for execution in its place. So like I said, there are some advantages over using userspace threads over kernel threads. It's a good weigh off that you need to make. And some of the advantages in this scenario are actually that they're fast and they are cheap to spawn because everything happens in userspace so there's no kernel involvement at all. This also means that you can shut them down as fast as you want. And basically you can, you have a granular control over the context switcher as well, because you have to implement this thing yourself and, or you can make it as fast or slow as you want, basically. But this is not particularly the case for Ruby, unfortunately. And this is what the guys from EventMachine found out when they worked on improving this. Because EventMachine allocates a, some data on the stack to read-and-write from, and they noticed that the larger the stack became the, the harder it became to context switch. Actually, the context switch actually became slower. And, as a analogous example, I've made, I put up here a C Function that basically should illustrate what the problem was that they encountered. So first off, you see here that we allocate 50 kilobytes of data on the stack and we use memset here just to zero fill it so that GCC doesn't perform any optimization such as removing this piece of data, because it's unreferenced in later code. Basically what this, this C Function does is basically just allocating 50 kilobytes on the stack and invoking the, the block that you give it, as you can see here. So if we were to try to invoke the C extension function inside Ruby, in this scenario with threads, then we see that here we allocate 50 kilobytes on the stack first; we do some silly calculations for 200,000 times and after each iteration we explicitly force a context switch; then joining all these threads takes about 13 seconds. Now the interesting part comes in, in play when we actually remove the invocation of the C extension method, thus actually removing the execution of, or actually, the allocation of 50 kilobytes on the stack. Then suddenly the execution time drops down to 4.2 seconds. And this is actually a weird correlation going on here and definitely something going on with regards to allocating stuff on the stack, and what for, and what consequences it has on context switching. So profiling the results in an EventMachine scenario with threads, which is very similar to the example that we just showed you with Google perftools, you get something like this. And here we've ordered it in, in order of, you know, frequency of invocations. So first off you see the Ruby time slice handler; nothing weird going on there. At some later point apparently they're doing a hash table lookup, then the Ruby scheduler, and at some other later point you get your Ruby AST interpreter. This is not weird at all. However, this guy over here is. Because apparently at some particular point it needs to invoke memcpy and in order to understand what it's using it for, we better take a lot at what happens when a context switch takes place. So basically when a context switch takes place it needs to store the state of the current thread, basically. So what it does first is to save to the CPU registers with setjmp. Then apparently it saves the stack frames to the heap. So basically it's taking the stack frames from the C stack and it's copying it actually to the heap. That's kind of a WTF, but we'll get to that a little later. At some later point it will save some VM globals and for restoring it's pretty much doing the reverse of this. So, it's restoring the VM globals; it's restoring the stack frames using memcpy, by the way; and then it's restoring the CPU registers. And this has some implications actually, because it's using memcpy to copy the stack data from, from and to the heap. And basically what this means is if you have larger C stack, then a context switch will take longer. So that's actually the symptom that we saw over there when we had this 50 kilobyte stack; then you saw that it drastically increased the context switch time. Now Ruby 1.9 is unaffected by this problem because it uses native threads, but Ruby 1.8 is still the most widely deployed version at this time. So we wanted to solve this. So in order to understand how, how this can be solved, you need to see how Ruby uses the stack. And an important element in that is to know that the Ruby stack frame is basically just the C stack frame; it's sharing this. And in particular you can determine what the stack frame size of this thing is by, for example, using gdb and substracting the base pointer from the stack pointer. That gives you the frame size. Every function call, as a result, will put roughly around one kilobytes on the stack. So if you have some Ruby code on the left hand side, it will allocate one kilobyte for each of these implications on the stack. And it's important to fix this, we believe, because a typical real stack trace looks kind of like this; and it's pretty much 65 levels deep, basically. So if we can solve this problem, then for future versions that may use threading, for example, inside Rails - and as you can see Rails uses the stack extensively - then context switching should be much faster. This has some unexpected consequences during context switches, as well, because basically what this means - because the stack contents can change ad hoc during run time - this means that it's very unsafe actually to refer to a value on the stack from a native thread, because at a context switch, the context can, or actually the, the stack can be changed. So you can see that here, for example, if we have a value on stack here in this function and we were to pass it, pass its address on to a thread function and spawn the thread, then it would not be very safe to refer to that value because when the context switch takes place, its contents could change drastically. So the fix for this is of course, to instead of copying the current stack from and to the heap - like so - we can just not copy that and just change the current stack pointer. So upon a context switch you just basically change the stack pointer to the stack that you want to use to the heap. But this requires some platform specific code. [pause] And before we dive into that, it's good to do a little refresher course here. Because on x86_64 you have to remember that the heap grows from, grows upward. So from low address numbers to high address numbers. And the stack, however, grows from high address numbers to low address numbers. So as you can see the heap grows upwards whereas the stack grows downwards. So if we were to allocate some memory on the heap, then actually the result using either malloc or memorymap will give us a pointer to that address. And that address is actually the stack top. This is a small implementation detail. And so to solve this, for example, for x86_64 you can use the following in line assembly that GCC supports. So as you can see below, so the last two lines are actually the input output list, and we're referring to the input output list in the assembly string in a tokenized manner, such if you are familiar with using print at, for example. So percent zero will refer to that value and percent one will refer to the other value. So what this code does basically is it's, it's moving the stack pointer to the value of stk_base which is kind of - we'll be able to discuss a little bit about that later, if you have some questions on that. And eventually we'll invoke a, it will invoke the function actually, rb_thread_start, the address of rb_thread_start with this new stack. So there are some caveats to this actually, because native stacks grow automatically. Our operating system usually takes care of this, so it grows as it needs to. Our stack doesn't, however, because we're allocating this on the heap and we need to manage this ourselves. So this raises a question, like how large must our thread stacks be? Also, how do we allocate these thread stacks? Do we use memorymap or do you use malloc? And lastly, how do we handle stack overflows? Because it's very easy to fall off your stack, for example, and all bad things could happen, basically. So this led to some decisions and some special cases. First off, it was decided upon that the Ruby thread stack, thread stack size would default to one megabyte which is about -1000 function call depth deep, which we believe is adequate for most use cases. The size for this is configurable during run time for advanced users if they should need to do that. Also the decision has been made to use memorymap instead of malloc so that we don't have to, don't have to incur the malloc overhead that Hongli just talked about; and so that the memory is guaranteed to be released back to the operating system as well. As for stack overflows, we fixed this problem by putting a guard page actually with Prot_None at the end of the stack to catch potential overflow. So if you try to read or write from this guard page then you will get a segmentation fault. As a result of that, as well, was actually that the signal handler must actually run on a separate stack because a signal handler is just a function and can also reference the stack, and it could also reference that Prot_None page basically that caused the signal handler to fire in the first place. So you need something like sig_alt_stack to prevent that. So in terms of benchmarks to see what the results of this is, we can use the alioth thread-ring test and instead of trying to explain to you guys what this boring code does, it's probably better to do this through a small animation. I love Keynote, by the way. So what this benchmark basically does, it initializes a number to, for example, 50 million; it then spawns 403 threads and it then runs all these threads sequentially in such a way that will substract minus, or actually will substract one from number and it will continue to do this until at some particular point, number will be zero. And the next thread to be spun up will notice this, and for example, if this were to be like, for example, thread number 13, then it will print out this number to the console and it will exit. So the results of this benchmark are pretty self-explanatory. As you can see, Ruby 1.8, the original version, takes about 1400 seconds to execute this; whereas Ruby 1.9, takes about half of this. The patch version, however, for Ruby 1.8, is very similar actually to the Ruby 1.9 version, and so we can infer that there is a Delta of about two times, two point three times faster there going off it. So the, things are definitely looking to shape up here. But to truly understand how this improves the threading performance, or actually the scheduling, we could spice things a little bit up by introducing this function which just recursively calls itself for about a hundred times and yield the block that you give to it. And as we already discussed earlier on, every method invocation that you do in Ruby, allocates about a kilobyte on the C stack. So basically what this code does it will grow the stack by about 100 kilobytes. So we, when we rerun the test, sorry, with this, with this function that we just introduced, then you can see that Ruby 1.8 will now take two hours to execute or actually to finish this code. Whereas Ruby 1.9 will take 12 minutes, and Ruby 1.8 the patch version will, while it's still similar to Ruby 1.9 actually so it's 13 minutes. But as you can see from this example the Delta now is nine point four times so it's, it's definitely a cool patch, we believe. So as for threading in its current situation in Ruby 1.8 and in particular in Ruby Enterprise Edition, the patch that Joe and Aman have contributed is currently in Ruby Enterprise Edition and it's available for you now. And it's currently only available on x86, x86_64, but other platforms could be supported provided that you give the assemblies and the proper stack heap growth calculus thingy. So basically, in conclusion, Ruby Enterprise Edition is not a fork, but it's a branch, so regular merging does occur with upstream. We have, contrary to Ruby proper, we have a more liberal patch acceptancy policy, so if you guys have a cool patch that could contribute to Ruby in a server environment, that we'd be more than happy to take it into consideration. And eventually, of course, at one particular point we do really hope that these patches will find their way back to upstream; but until that point we're, we're maintaining Ruby Enterprise Edition. So there are some other interesting Ruby Enterprise Edition patches as well, by the community. RailsBench, for example, and GC statistics and MBARI patches in particular, which improve the, the conservative garbage collector of Ruby. Ruby has a conservative garbage collector which scans the entire stack to determine whether or not a value on that stack is a pointer to something on the Ruby heap. The MBARI patch has actually optimized finding correct pointers to the Ruby heap. And lastly, we have a nice one from Phillippe Hanrigou; a colorful threads which will definitely help you when debugging. So that's basically it, and if you have any questions, we'd be more than happy to try and answer them. [pause] >>Hongli Lai: Well, that's a surprise. >>Ninh Bui: Nobody's gonna ask if we're gonna have beers afterwards. Because - [laughter] Okay. Well, that's basically it. Thank you for listening and we hope you enjoyed it. [applause] [techno music]