A. Rao - The locking infrastructure in the freebsd kernel #2

So basically, we are going to look, mainly in this second part, at how to handle some locking problems that categorize in the kernel. Here, there are described two kinds of problems you can get with locks, that are pretty much common. The first one is called Lock Order Reversal (LOR). When you have for example a thread A, which owns a lock code, for example L1 and another thread B which owns the lock, L2 Then thread A tries to.. Right.. it's wrong. The slide is wrong. The slide is wrong. Thread A tries to acquire L2, but obviously it sleeps because it is owned by thread B and then thread B tries to acquire the lock L1 and it sleeps because it's owned by thread B. This is a situation that never ends and it's pretty much well documented in Cormen and in traditional literature. It's a classical deadlock. This means that, as everybody who has ever read a book about an operating system, knows that locks should maintain an ordering in regard of each other. That's not very simple when you speak about a kernel. From this point of view, the fact that there are 3 kinds of classes of locks is going to count because you can never mix two different kinds of locks. For example a spinlock and a mutex can be mixed in this way. You can have the mutex first and the spinlock later, while the opposite is not actually true. So, you will see that these kind of deadlocks are possible only in the same class of locks, like for example 2 mutex or 2 spin mutex, or such. Also, even if it's not very well documented, for example spinlocks in FreeBSD, have a way to identify such kind of deadlocks. And it's pretty much implemented. It's a feature enabled in the code. They just count how many times they are spinning and if it exceeds an exaggerated result, it means that they are probably under a deadlock and the system panics. Another common problem about locking is when you have a wait channel like a condition variable (cond var), which it is protected by a lock. If there are some ratio chances that this condition variable encounters, for example, when cond var are chased with some preliminary conditions, like a waiter's counter that has to be updated anytime that any thread tries to sleep on the counter or flags to be set. If these conditions are not protected by the same lock, you can end up having some threads sleeping on this wait channel and nobody is going to wake them up again. This is usually called missed wakeup and it's a pretty common mistake that leads to a deadlock. The problem is that it's very difficult to differentiate between missed wakeup and for example forever sleep of a thread that is not likely to be awaken. So these kind of deadlocks are very very difficult to be discovered and will require some bit of work that we will see right now. For example, using some kernel systems and some things integrated into the debugger. In FreeBSD, we have quite a lot of good mechanisms we can use to cope with kernel problems. The first one (and the most important) is called WITNESS. It was introduced in the context of SMPng and has been rewritten in the recent past, mainly by a contribution of Isilon systems. They contributed back then to the writing of WITNESS. This subsystem is very important because it tracks down exactly every order of the locks. So that, if there is an ordering violation like a LOR, it's going to tell the system. You can even set it to directly panic if it finds some deadlocks, or only some possible deadlocks. Another important feature with it is that it can keep track of read/write locks. Doing that, we can identify deadlocks, possibly even on the reader's path. We could say that WITNESS is pretty big, so activating it in your production system is never an option. It's mainly used when you are going to develop a new feature in the kernel and you are going to test it heavily. In particular if it has some relation to locking. We could also tell that with the new code provided by Isilon and Nokia, basically offered by WITNESS is greatly reduced to about the 10th part of what we had before. WITNESS is very good at tracking LOR, but it's not very good at, for example, trying to help you in the case of lost wakeups, because of its nature, mainly. It has a very good integration with the DDB debugger and basically it's in the 8th release, we have new features that help you print out backtraces of the contesting threads on the LORs and their orderings and it shows some graphs of the relations even from the user space. You don't have to go into the kernel debugger to look at it's output. Well, I see that sometimes when they are released there is a confusion about the information reports in regard of deadlocks conditions and what help users can provide to developers about that. So we are going to see all the relevant information when a deadlock is in the kernel. Usually, if you want to find a deadlock that's happening in the kernel, your first line of analysis starts from the DDB instead of a post-mortem analysis, which is even more important. But, using DDB you will get more processes and better information. The most important unit in order to find the deadlock are the LORs reported by WITNESS in order to see if there is something strange that can be happening. You want to know the state of all the threads that are running on the system that is deadlocking. You can see that you're deadlocking, if you see that on the runqueue there are just idle threads. Because it's like saying that runqueues are complete flushed and you have all the threads sleeping in their own containers. You need to know which are the exact locks that are acquired in the system and that's something that WITNESS provides and the very important thing is to know why the threads are stopping. So one of the most important things is retrieving what the threads were doing when they were put asleep. The backtraces of all the threads involved are printed out in order to identify deadlocks. In the case that buffered cache and VFS are probably parts of the deadlocking, you should also print out the information about vnodes and what we're interested in is which vnodes are called, which are actually referenced and in which way they were called. So, this is an example of the thread states in the case of a deadlock. This is an real example of a deadlock but you can see that this is not totally complete. But you can see that all the threads are sleeping. This one is the message used by the wait channel on which they're sleeping on or used by the container like the turnstile or the sleepqueue. If I recall correctly, it's a forced amount that does deadlocking at some point. I'm not really sure because I should have looked at it. You can see that the revelant command here is -ps that DDB supports. Another important thing is getting the situation of all CPUs. As you can see there, usually its because you can add some data structures corrupted in the per-CPU datas. That's a very common situation where you can get deadlocks, because, for example, leaving a corrupted LPD will lead to you having a bigger massive breakage like double-faults and things like that. Usually it's always a good idea to look at all the CPUs involved in the system. The command is """"-show allpcpu"". This one is a WITNESS specific command ""-show alllocks"" and it's going to show all the locks, in the system, who is the owner, like in this case, a mount, and the thread is this one, what the lock is holding, that's the address and where it was acquired. It gives you lines and file. Actually, that's just possible with WITNESS, because otherwise, trying to keep the oldest information in a general purpose kernel will be very expensive for our logging subsystem. Then, the most important thing is the backtrace for any thread. It's going to show the backtrace for all the threads. the seas In this case, the thread with these addresses TID and PID basically got sleeping on a vnode. You will see that the backend in this case is FFF and that's the context switching function, those are the sleepqueues of the containter that is containing the threads, and this one is what it was going to do before, in this case mounting the filesystems. You will see that on a complete feeding, you will have a lot of these kinds of traces, but they are very important for the developers in order to understand what is going on. These ones are the locked vnodes that are also very important when a deadlock happens in VFS or in the buffer cache. You will see that these are the ref counts linked to vnodes, they are specific to some handling of the vnodes such as recycling, and completely freeing. That's the mount point where the vnodes belong and that is the backtrace of what happened when the vnode was acquired. You can see that this comment also gives you information about the lock linked to the vnode. For example, it tells you that the lock is in exclusive mode and it does some shared waits on its queues. That's also the node number. There is also other information you could receive from the DDB linked to, for example, the bugging deadlocks, like sleep chains, for any wait channel, if you have the address and for example, you can also print the wall table of the lock relations from WITNESS but it's mostly never useful because you should already know that. So you will just need to know which is the one that can give the trouble. So if you are going to submit some problems usually called NGAP that are probably deadlocked in the kernel space, that ones are the information that we actually need. Now, it's very difficult to see very good reports about deadlocks, so I think that it is a very good thing to talk about it. Along with the WITNESS, we have another important mechanism that could help us with deadlocks and it's called KTR. KTR is basically a logger, a kernel logger, of events. It's highly configurable, as you can, for example, handle different classes of events. In FreeBSD we have classes linked to the scheduler, to the locking, to the VFS, to callouts, and they are all packed in the same class. So they can be selectively enabled or masked. For example the difference is that you can enable several classes, like the ten classes of the KTR and then you are just interested in three of them even if all ten of them are actually tracked. You will are just going to see three of them. Um, an important thing is that KTR doesn't handle, for example pointers, doesn't store the information passed to init, it just stores the pointer and not the information, for example, the strings, it doesn't make copies, you need to just pass the pointers which are persistent in the memory. Otherwise, KTR won't be able to access them. The good thing about KTR is that you can also look at it from the user space through the ktrdump interface. Why is that important for locking? Because after, it can tell you what happed, on which CPU branches, and the order it happened in. This is very important when you're going to track down for example traces, when you are not sure about the order of operations and how they happened. It's going to tell you. For example that is a typical trace of KTR, where you have the CPU where the event happened, thats the index, that's a timestamp, I think it's retrieved directly from the TSC, but i'm actually not sure. In this case, i was tracking down the scheduler class, so I was interested mainly in scheduler workloads and I could see for example that a contact switch happened scheduling the idle CPU and then other information, like for example the load of the CPU 1 and scan priority boost, like this one and other things. You can enable the option KTR, but you must handle it carefully. For example use an option i didn't include here, which is the length of the buffer it uses to store the pointers in, it's called KTR_ENTRIES, and you should specify enough entries to have a reliable tracking. For example, if you are going to track a lot of events, a short queue is not an option, because you are going to lose some information. A typical queue is of length 2K (2 kilobytes) of entries. These other options let you compile some classes, or mask them, or even mask the CPU. If you have a big SMP environment, so that you can selectively enable some of them. For example, this is very good for tracking down traces in the sleeping queue. You can find referrals here. So, I will spend the last time of the speech speaking about possible improvements to our locking system, which is not very bad. Well, I think that actually our locking system is pretty complete, but it's also pretty confusing for newcomers, it's not widely documented. so maybe we will spend a good amount of time on documentation. As you can see, even in this presentation, which is not very huge, there are many things to say and that are not very simple to understand in particular for people who just need to do simple tasks. For example, I saw a lot of guys coming from Linux World who wanted to actually use spinlocks for time. It's obvious they are missing something from our architecture. From just a technical point of view, it would be very good if we could remove legacy support and overriding support. For example, we have lockmgr and sxlog which are both read/write locks and are both servered by sleep queues. They have some differences, obviously, but, mainly, we could manage the missing bits and just use one of the two interfaces. In the same way, as i told you before, the sleeping point, true-end sleep, read/write sleep and sxsleep should probably be managed with cond vars and superdoff our kernel and we should probably drop sema, because it is obsolete, and can be replaced by condvars and mutex. From a strong technical point of view, as you can see, we spent a lot of time on optimization on our blocking primitives, but very few on our spinning primitives. That's because obviously blocking primitives are our first choice, but spinlocks could be improved too, using technics such as the so-called back-off algorithms and queued spinlocks. I have a patch for that but I would really need to test it and tune it on a big SMP environment. I don't think that, for now, they can handle such an environment. In addition, I'm not sure you are familiar with queued spinlocks algorithms. Basically a back-off algorithms try to reduce the cache pressure on the threads contesting on the lock by giving a time. Instead, the other one uses spinning on a local variable which is not shared by the threads. and the time spent on that local variable increases with the passing of time. Another interesting thing would be benchmarking different wake-up algorithms for blocking primitives. We have an algorithm that has proven to be quite good but we are not confronted with other kind of wake-ups that could have a higher overhead but could give time improvements on a big SMP environment. Another thing that would be very interesting to fix is the priority inversion problem in the case of read locks. There is an approach called dominant record implemented in FreeBSD, but i saw that it's pretty slow for our fastpack. In FreeBSD all our locking primitives are broken and add to path where the fastpack is is often just a single atomic operation, and if it fails, it falls down and the art pattern tries to do all the end-work of the sleep queues. in this case the owner of record technic was going to make the fastpack too simple Basically, it just considers the readets as one and can switch in and out. And it practically lands the priority to this owner of record which does it's right log. Another important thing obviously is improving locking where the optimum approach is not chosen. I see a lot of parts in which the primitives chosen by the developers are not the most suitable ones and we should switch to the right one. Like, for example, in discriminated usage of the spinlocks or the blocking primitives, just to handle cases like that, like the one we saw before with the malloc command, that needs to sleep. Any questions?