Tip:
Highlight text to annotate it
X
So basically,
we are going to look, mainly in this second part, at how to
handle some
locking problems
that categorize in the kernel.
Here, there are described two kinds of problems you can get with locks, that are pretty much common.
The first one is called Lock Order Reversal (LOR).
When you have for example a thread A,
which owns
a lock code, for example L1
and another thread B
which owns the lock, L2
Then thread A tries to..
Right.. it's wrong.
The slide is wrong.
The slide is wrong.
Thread A tries to acquire L2,
but obviously it sleeps because it is owned by thread B
and then thread B tries to acquire
the lock L1
and it sleeps because
it's owned by thread B.
This is a situation that never ends and it's pretty much well documented in Cormen
and in traditional literature.
It's a classical deadlock.
This means that,
as everybody who
has ever read a book
about an operating system,
knows that
locks should maintain
an ordering in regard of each other.
That's not very simple when
you speak about a kernel.
From this point of view, the fact that there are 3 kinds of classes of locks
is going to count because you can never mix two different kinds of locks.
For example
a spinlock
and a mutex
can be mixed in this way.
You can have the mutex first and the spinlock later, while the opposite is not actually true.
So, you will see that these kind of deadlocks are possible
only in the same class of locks,
like for example 2 mutex or 2 spin mutex,
or such.
Also,
even if it's not very well documented,
for example spinlocks
in FreeBSD, have a way to identify such kind of deadlocks.
And it's pretty much implemented.
It's a feature enabled in the code.
They just count how many times they are spinning
and
if it exceeds
an exaggerated result,
it means that they are probably under a deadlock and the system panics.
Another common problem about locking is when you have
a wait channel
like a condition variable (cond var),
which it is protected by a lock.
If there are some ratio chances that this condition variable encounters,
for example,
when cond var are chased
with some preliminary conditions, like a waiter's counter
that has to be updated
anytime that any thread tries to sleep on the counter or flags to be set.
If these conditions are not protected by the same lock,
you can end up having some threads sleeping on this wait channel
and nobody is going to wake them up again.
This is usually called missed wakeup
and it's a pretty common mistake
that leads to a deadlock.
The problem is that
it's very difficult to differentiate between missed wakeup and
for example
forever sleep
of a thread
that is not likely to be awaken.
So these kind of deadlocks are very very difficult to be discovered
and will require some bit of work that we will see right now.
For example,
using
some
kernel systems
and some things integrated into the debugger.
In FreeBSD,
we have quite a lot of good mechanisms we can use to cope
with kernel problems.
The first one (and the most important) is called WITNESS.
It was introduced
in the context of SMPng
and has been rewritten in the recent past,
mainly by a contribution of
Isilon systems.
They contributed back then
to the writing of WITNESS.
This subsystem is very important
because it tracks down exactly every order
of the locks.
So that, if there is an ordering violation like a LOR,
it's going to
tell the system.
You can even set it to directly panic if it finds some deadlocks,
or only
some possible deadlocks.
Another important feature with it is
that it can keep track of read/write locks.
Doing that,
we can identify
deadlocks, possibly
even
on the
reader's path.
We could say that WITNESS is pretty big,
so activating it
in your production system is never an option.
It's mainly used when you are going to develop a new feature in the kernel
and you are going to test it heavily.
In particular if it has
some
relation to locking.
We could also tell that with the new code provided by Isilon and Nokia,
basically
offered by WITNESS is greatly reduced to about
the 10th part of
what we had before.
WITNESS is very good at tracking LOR,
but
it's not very good at, for example,
trying to
help you
in the case of lost wakeups,
because of its nature,
mainly.
It has a very good integration with the DDB debugger
and
basically
it's in the 8th release, we have new features
that help you
print out backtraces
of the contesting threads on the LORs
and their orderings
and
it shows some graphs of the relations even from the user space.
You don't have to go into the kernel debugger to look at it's output.
Well, I see that sometimes when they are released there is a confusion
about the information reports
in regard of deadlocks conditions and what help
users can provide to developers
about that.
So we are going to see
all the relevant information when a deadlock
is in the kernel.
Usually,
if you want to find a deadlock that's happening in the kernel,
your first line of analysis starts from the DDB
instead of a post-mortem analysis,
which is even more important.
But, using DDB you will get more processes and better information.
The most important unit in order to find the deadlock
are the LORs reported by WITNESS in order to see if there is something strange
that can be happening.
You want to know the state of all the threads that are running on the system that is deadlocking.
You can see that you're deadlocking, if you see that
on the runqueue
there are
just idle threads.
Because it's like saying that runqueues are complete flushed
and you have all the threads sleeping in their own containers.
You need to know which are the exact locks that are acquired
in the system
and that's something that WITNESS provides
and the very important thing is to know why the threads are stopping.
So one of the most important things is retrieving what the threads were doing
when
they were put asleep.
The backtraces of all the threads involved
are printed out in order to identify deadlocks.
In the case that
buffered cache and VFS are
probably parts of the deadlocking,
you should also print out
the information about vnodes
and what we're interested in is which vnodes are called,
which
are actually referenced
and
in which way they were called.
So,
this is an example
of the
thread states
in the case of a deadlock.
This is an real example of a deadlock
but you can see
that
this is not totally complete.
But you can see that all the threads are sleeping.
This one is the message
used by the wait channel
on which they're sleeping on
or used by
the container like the turnstile or the sleepqueue.
If I recall correctly, it's a forced amount that does deadlocking at some point.
I'm not really sure
because I should have looked at it.
You can see that the revelant command here is -ps
that DDB supports.
Another important thing
is getting
the situation of all CPUs.
As you can see there,
usually
its because you can add some data structures corrupted
in the per-CPU datas.
That's a very common situation where you can get deadlocks,
because, for example,
leaving a corrupted LPD will lead
to you having a bigger massive breakage like double-faults and things like that. Usually it's always a good idea to look at all the CPUs involved in the system.
The command
is """"-show allpcpu"".
This one
is a WITNESS specific command ""-show alllocks"" and it's going to show all the locks,
in the system,
who is the owner,
like in this case,
a mount,
and the thread is this one,
what the lock is holding,
that's the address
and where it was acquired.
It gives you lines and file.
Actually,
that's just possible
with WITNESS, because otherwise,
trying to keep the oldest information
in a general purpose kernel will be very expensive for our logging subsystem.
Then, the most important thing is the backtrace for any thread.
It's going to show the backtrace
for all the threads.
the seas
In this case,
the thread with these addresses TID and PID
basically got sleeping
on a vnode.
You will see that the backend in this case is FFF
and
that's the context switching function,
those are the sleepqueues of the containter that is containing the threads,
and this one
is what it was going to do
before,
in this case mounting the filesystems.
You will see that on a complete feeding,
you will have a lot of these kinds of traces,
but they are very important
for the developers in order to understand what is going on.
These ones are the locked vnodes
that are also very important when
a deadlock happens in VFS or in the buffer cache.
You will see
that these are the ref counts linked to vnodes,
they are specific
to some handling of the vnodes such as recycling,
and completely freeing.
That's the mount point
where the vnodes
belong
and
that is the backtrace
of what happened when the vnode
was acquired.
You can see that this comment also gives you information
about the lock linked to the vnode.
For example, it tells you that
the lock
is in exclusive mode
and
it does some shared waits
on its queues.
That's also
the node number.
There is also other information you could receive from the DDB linked to, for example,
the bugging deadlocks,
like sleep chains,
for any
wait channel, if you have the address
and for example,
you can also print the wall table of the lock relations from WITNESS
but it's mostly never useful because you should already know that.
So you will just need to know which is the one
that
can give the trouble.
So if you are going to submit some problems
usually called NGAP that are probably deadlocked in the kernel space,
that ones
are the information that we actually need.
Now,
it's very difficult to see very good reports about deadlocks,
so
I think that
it is a very good thing to talk about it.
Along with the WITNESS, we have another important mechanism that could help us with deadlocks
and it's called KTR.
KTR is
basically a logger, a kernel logger, of events.
It's
highly configurable,
as you can, for example, handle different classes of events.
In FreeBSD we have
classes linked to the scheduler,
to the locking,
to the VFS, to callouts,
and they are all packed in the same class.
So they can be selectively enabled or masked.
For example
the difference is that you can enable several classes,
like
the ten classes of the KTR
and then you are just interested in three of them even if all ten of them are
actually tracked.
You will are just going to
see three of them.
Um,
an important thing is that KTR
doesn't handle,
for example
pointers,
doesn't store the information
passed to init, it just stores the pointer
and not the information,
for example,
the strings,
it doesn't make copies, you need to just pass the pointers
which are persistent in the memory.
Otherwise,
KTR won't be able to access them.
The good thing about KTR is that
you can also look at it from the user space
through the ktrdump interface.
Why is that important for locking?
Because
after,
it can tell you what happed,
on which CPU branches,
and the order it happened in.
This is very important when you're going to track down for example traces,
when you are not sure about the order of operations and
how they happened. It's going to tell you.
For example
that is
a typical trace of KTR,
where you have
the CPU where the event happened, thats the index,
that's a timestamp,
I think it's retrieved directly from the TSC, but i'm actually not sure.
In this case,
i was tracking down the scheduler class,
so I was interested mainly in scheduler workloads and I could see
for example
that
a contact switch happened
scheduling
the idle CPU
and then other information,
like for example the load of the CPU 1
and scan priority boost,
like
this one
and other things.
You can enable
the option KTR, but you must handle it carefully.
For example
use an option i didn't include here,
which is the length of the buffer it uses to store the pointers in, it's called KTR_ENTRIES,
and you should specify
enough entries
to have a reliable tracking.
For example, if you are going to track a lot of events,
a short queue is not an option,
because you are going to lose some information.
A typical queue is of length 2K (2 kilobytes)
of entries.
These other options
let you compile some classes,
or mask them,
or even mask the CPU.
If you have a big SMP environment,
so that you can selectively enable some of them.
For example, this is very good for tracking down traces in the sleeping queue.
You can find referrals here.
So,
I will spend the last time of the speech speaking about possible improvements
to
our locking system, which is not very bad.
Well,
I think that actually our locking system is pretty complete,
but it's also pretty confusing for newcomers, it's not widely documented.
so maybe we will spend a good amount of time on documentation.
As you can see,
even in this presentation, which is not very huge,
there are many things to say
and that are not very simple to understand in particular
for people
who just need to do simple tasks.
For example, I saw a lot of guys coming from Linux World
who wanted to actually use spinlocks for time.
It's obvious they are missing something from our architecture.
From
just a technical point of view,
it would be very good if we could remove
legacy support and overriding support. For example,
we have lockmgr and sxlog
which are both read/write locks and are both servered by sleep queues.
They have some differences, obviously,
but, mainly,
we could manage the missing bits and just use one of the two interfaces.
In the same way, as i told you before,
the sleeping point, true-end sleep, read/write sleep and sxsleep
should probably be managed with cond vars
and superdoff our kernel
and we should probably drop sema,
because it is obsolete, and can be replaced by condvars and mutex.
From
a strong technical point of view,
as you can see,
we spent a lot of time
on optimization on our blocking primitives,
but very few on our spinning primitives.
That's because obviously blocking primitives are our first choice,
but
spinlocks could be improved too,
using technics such as the so-called back-off algorithms and
queued spinlocks.
I have a patch for that but
I would really need
to test it and tune it on a big SMP environment.
I don't think
that, for now,
they can handle such an environment.
In addition,
I'm not sure you are familiar with queued spinlocks algorithms.
Basically a back-off algorithms try
to reduce the cache pressure on the
threads contesting on the lock
by giving a time.
Instead, the other one
uses spinning on a local variable which is not shared by the threads.
and the time spent
on that
local variable increases
with the passing of time.
Another interesting thing would be benchmarking
different wake-up algorithms for blocking primitives.
We have an algorithm that has proven to be
quite good
but
we are not confronted with other kind of wake-ups that could have
a higher overhead but could give time improvements
on a big SMP environment.
Another thing that would be very interesting to fix is the priority inversion problem
in the case of read locks.
There is an approach
called dominant record implemented in FreeBSD,
but i saw that it's pretty
slow for our fastpack. In FreeBSD all our locking primitives are broken and add to path
where
the fastpack is
is often just a single atomic operation,
and
if it fails,
it falls down and the art pattern tries to do all the end-work of the sleep queues.
in this case the owner of record technic was going to make the fastpack too simple
Basically,
it just considers
the readets as one
and can switch in and out.
And it practically lands the priority to this owner of record which does it's right log.
Another important thing obviously is improving locking
where the
optimum approach is not chosen.
I see a lot of parts in which the primitives chosen by the developers
are not the most suitable ones
and we should switch to the right one.
Like, for example, in discriminated usage of the spinlocks
or the blocking primitives,
just to handle cases like that, like the one we saw before with the malloc command,
that needs to sleep.
Any questions?