Tip:
Highlight text to annotate it
X
I do apologize for the (other)
for the EuroBSDCon slides. I've redone the
title page and redone the
and made some changes to the slides and they didn't make it through for approval
by this afternoon so
okay so
I'm gonna be talking about
doing
about isolating jobs for performance and predictability in clusters
before I get into that
I want to talk a little bit about
who we are and
what our problem space is like because that
dictates that… has an effect on our solutions base
I work for the aerospace corporation.
We work;
we operate a federally-funded research and development center
in the area national security space
and in particular we work with the air force space and missile command
and with the national reconnaissance office
and our engineers support a wide variety
of activities within that area
so we have
a bit over fourteen hundred to correct
sorry twenty four hundred engineers
in virtually every discipline we have
as you would expect we have our rocket scientists, we have people who build satellites
we have people who build sensors that go on satellites, people who study these sort of things
that you
see when you
use those sensors
that sort of thing.
We also have civil engineers and
electronic engineers
and process,
computer process people
so we literally do everything related to space and all sorts of things that you might not
expect to be related to space,
since we also for instance help build ground systems ‘cause satellites aren’t very useful if
there isn't anything to talk to them;
and these engineers
since they're solving all these different problems we have
engineering applications in you know virtually every size you can think of
ranging from you know little spreadsheet things that you might not think of as an engineering
application but they are
to Matlab programs or a lot of C code
or one of traditional parallel for us
serial code
and then
large parallel applications either in house; genetic algorithms and that sort
of thing,
or traditional
the classic parallel code
like you work around a crate or something material simulation 0:02:40.119,0:02:41.459 or that or food flow
or that sort of thing
so
so we have this big application space
just want to give a little introduction to that because it
does come back and influence what we
the sort of solutions we look at
so the rest of the talk I’m gonna talk about rese…
we skipped a slide, there we are, that’s a little better.
Now, what I'm interested in is I do high performance computing
at company
and I provide high performance computing resources to our users
as part of my role in our technical computing services organization
so
our primary resource at this point is
the fellowship cluster
it's a for the
named for the fellowship of the ring
so it's a…
… eleven axel nodes
wrap the core systems
over here there's a
Cisco a large Cisco switch. Actually today there are around two sixty five oh nines if
you assess them
and because we couldn’t get the port density we wanted otherwise
and primarily the Gigabit Ethernet system runs FreeBSD currently 6.0 ‘cause we haven’t upgraded
it yet
planning to move probably to 7.1 or maybe slightly past 7.1
if we want to get the latest HWPMC changes in
we use the Sun Grid Engine scheduler was one of the two main options for open source
resource managers on clusters the other one being the…
… the TORQUE
and now recombination from cluster resources
so we also have
that's actually
40 TB that’s really the raw number on a sun thumper and 0:04:23.219,0:04:26.290 that’s thirty two usable once you start using RAID-Z2
since you might actually like to have your data should a disk fail
and with today's discs RAID…
RAID five
doesn't really cut it,
And then we also have some other resources coming on but I’m going to be (concentrating on)
two smaller clusters unfortunately probably running Linux and
some SMPs but
I’m going to be concentrating here on the work we're doing on our other
FreeBSD based cluster.
So, first of all
first of all I want to talk about why we want to share resources. Should be fairly obvious
but I'll talk about it in a little bit
and then what goes wrong when you start sharing resources
after that I'll talk about some different solutions to those problems
and
some fairly trivial experiments that we've done so far in terms of enhancing the schedule or
using operating system features
so you mitigate those problems
and
then conclude with some feature work.
So, obviously if you have a resource the size… the size of our cluster, fourteen hundred
cores roughly
you probably want to share it unless you
purpose built it for a single application
you're going to want to have your users
sharing it
and you don't want to just say you know, you get on Monday
probably not going to be a very effective option
especially not when we have as many users as we do
we also can't just afford to buy another one every time a user shows up
so one of our
senior VPs said a while back
you know
we could probably afford to buy just about anything we could need once
we can't just
buy ten of them though
if we really, really needed it
dropping
small numbers of millions of dollars on
computing resources wouldn’t be
impossible
but we can't go to you know just have every engineer who wants one just call up Dell and say ship me ten racks
it's not going to work
and the other thing is that we can’t
we need to also provide quick turnaround
for some users
so we can't have one user hogging the system and hogging it until they are done
because we have some users
and then the next one can run
because we have some users who'll come in and say well I need to run
for three months
and
we've had users come in and literally run
pretty much using the entire system for three months
well so we've had to provide some ability for other users to still get their work done
so we can't just… so we do have to have some sharing
however when you start to share any resource
like this
you start getting contention
users need the same thing at the same time
and so they fight back and forth for it and they can't get what they want
so you have to balance them a bit
you know also
some jobs lie when they
request resources and they actually need more than they ask for
which can cause problems
so we schedule them. We say you're going to fit here fine and they run off and use
more than they said
and if we don't have a mechanism to constrain them
we have problems.
Likewise
once these users start to contend
that doesn't just result in
the jobs taking,
taking longer in terms of wall clock time
because they are extremely slow
but there's overhead related to that contention; they get swapped out due to pressure on
various systems
if you really
for instance run out of memory then you go into swap and you end up wasting all your cycles
pulling junk in and out of disc
wasting your bandwidth on that
so there are
resource
there are resource costs to the contention not merely
a delay in returning results.
So now I'm going to switch gears and start talk… so I'm going to talk a little bit about different
solutions to these
to the
these contention issues
and look at different ways of solving the problem. Most of these are things that have
already been done
but I just want to talk about
the different ways and then
evaluate them in our context.
So a classic solution to the problem is
Gang Scheduling
It's basically conventional Unix process context switching
written really big
you what you do is you have your parallel job that’s running
on a system
and it runs for a while
and then after a certain amount of time you basically shove it all; you kick it off of all the nodes
and let the next one come in
and typically when people do this they do it on on the order of hours because the context switch
time is extremely large is extremely high
for example
because it's not just like swapping a process internet. You suddenly have to co-ordinate
the this context which across to all your processes
if you're running say
MPI over TCP
you actually need to tear down the TCP sessions because you can't just have TCP timers sitting
around
or that sort of thing so
there there's a there's a lot of overhead
associated with this. You take a long context switch
if all of your infrastructure supports this
it's fairly effective
and it does allow jobs to avoid interfering with each other which is nice
so you can't you don't have issues
because you're typically allocating
whole swaps of the system
and for properly written applications
partial results can be returned which for some of our users is really important where you're doing a
refinement
users would want to look at the results and say okay
you know is this just going off into the weeds
or does it look like it's actually converging on some sort of useful solution
as they don't want to just wait till the end.
Down side of course is that this context switches costs are very high
and most importantly there's really a lack of useful implementations
a number of platforms have implemented this in the past
but in practice on modern clusters which are built on commodity hardware
with you know
communication libraries written on standard protocols
the tools just aren’t there
and so
it's not very practical.
Also it doesn't really make a lot of sense with small jobs
and one of the things that we found is we have users who have
embarrassingly parallel problems for they need to look at
you know twenty thousand studies
and they could write something that looked more like a conventional parallel application where they
you know wrote a Scheduler and set up an MPI a Message Passing Interface
and handed out tasks to pieces of their job and then you could do this
but then they would be running a Scheduler and they would probably do a bad job of it turns out it's actually
fairly difficult to do right
even a trivial case
and so what they do instead is they just select twenty
twenty thousand jobs to grid engine and say okay
whatever I'll deal with it
earlier versions that might have been a problem
current versions of the code
handle easily a million jobs that
so not really a big deal
but those sort of users wouldn't fit well
into the gang scheduled environment
at least not in a
conventional gang scheduled environment where you do gang scheduling on the granularity of
jobs
so from that perspective it wouldn’t work very well.
If you have all the pieces in place and you are doing a big parallel applications it is in fact
an extremely effective approach.
Another option which is sort of related
it's in fact
take taking an even courser granularity
is single application or single project clusters or sub-clusters.
For instance this is used some national labs
where you're given a cycle allocation for a year based on your grant proposals
and what your cycle allocation actually comes to you as is
here's your cluster
here's a frontend
here's this chunk of notes, they're yours, go to it.
Install your own OS, whatever you want
it's yours
and then and at a sort of finer scale there's things such as
you could use Emulab
which is the network emulation system but also does a OS install and configuration
so you could do dynamic allocation that way
Sun's
Project Hedeby now actually I think it's called service domain manager
is the productised version
or some Clusters on Demand
they were actually talking about web hosting clusters but
things that allow rapid deployment unless you do that a little
little
a more granular level than the
the allocate them once a year approach
nonetheless
let’s you give people whole clusters to work with
nice one nice thing about it is
the isolation between the processes
is complete
so you don’t have to worry about users stomping on each other. It’s their own system, they can trash it all they
want
if they flood the network or they
run the nodes into swap
well that's their problem
but it also has the advantage that you can tailor the images
on the nodes of the operative systems to meet the exact needs of the application
down side of course is its coarse granularity, in our environment that doesn't work
very well
since we do have all of these all these different types of jobs
context switches are also pretty expensive. Certainly on the order of minutes
Emulab typically claim something like ten minutes
there are some systems out there
for instance if you use I think it’s Open Boot that they're calling it today. It used to be 1xBIOS
where you can actually deploy a system in
tens of seconds
mostly by getting rid of all that junk the BIOS writers wrote
and
the OS boots pretty fast if you don’t have all that stuff to waylay you,
but in practice on sort of
off the shelf hardware
the context switches times’ are quite high
users of course can interfere with themselves
you can argue it's not a problem but
ideally you would like to prevent that
one of the things that I have to deal with is that my users are
almost universally
not trained as computer scientists or programmers
you know they’re trained in their domain area
they're really good in that area
but their concepts of the way hardware works in the way software works
don’t match reality in many cases
(inaudible question) It’s pretty rare in practice
well I've heard one lab that does it significantly
but it's like they do it on sort of a yearly allocation basis
and throw the hardware away after two or three years
and you do typically have some sort of the deployment
system in place
or in those types of cases actually
usually your application comes with
and here's what we're going to spend on this many people
on this project so this is
big resource allocation
And yeah I guess one other issue with this is there's no real easy
way to capture underutilized resources for example
if you have
an application which you know say single-threaded and uses a ton of memory
and is running on a machine
the machines we're buying these days are eight core so
that’s wasting a lot of CPU cycles you're just generating a lot of heat doing nothing
so ideally you would like a scheduler that said okay so you're using
using eight or seven of the eight Gigabytes of RAM but we've got these jobs
sitting here that
need next to know need
a hundred megabytes so we slap seven of those in along with the big job
and backfill and in this
mechanism there's no
there's no good way to do that
obviously if the users have that application next they can do it themselves
but it's not something where we can easily bring in
bring in more jobs and have a mix to take advantage of the different
resources.
A related approach is to
to install virtualization software on the equipment and this is this is
this is the essence of
what Cloud Computing is at the moment
it's Amazon providing Zen
Zen hosting for
relatively arbitrary
OS images
it does have the advantage that it allows rapid deployment
in theory if your application is scalable provides for
extremely high scalability
particularly if you
aren’t us and therefore can possibly use somebody else's hardware
in our application's case that’s
not very practical so
we can't do that
and
it also has the advantage that you can run
you can have people with their own image in there
which is tightly resource constrained but you can run more than one of them on a node. So for instance
you can give
one job
four cores and another job two cores another
you know and have a couple single core
jobs in theory
you can get fairly strong isolation there obviously there are shared resources underneath
and you
probably can't
afford to completely isolate say network bandwidth
at the bottom layer
you can do some but
if you go overboard you can spend all your time on accounting
you also can again
tailor the images to the job
and in this environment actually you can do that even more strongly than that
the sub-cluster approach
in that you can often do run
a five-year-old operating system or ten-year-old operating system if you're using full virtualization
and that can allow
allow obsolete code with weird baselines to work which is important in our space because
the average program runs ten years or more
our average project runs ten years or more
and as a result
you might have to go rerun this program that was written
way back on
some ancient version of windows or whatever
it also does provide
the ability to recover resources
as I was talking about before
but you can't do easily with sub-clusters because you can’t just slip
another image
on the on there and say are you can use anything and
you know give that image idle priority essentially
down side of course is that it is in complete isolation and that there is a shared
hardware
you're not likely to find I don't think any the virtualization systems out there
right now
virtualize
your segment of
memory bandwidth
or your segment
of cache
of cache space
so users can’t in fact interfere with themselves and each other in this environment
it's also
not really efficient for small jobs; the cost of running an entire OS for every
job is fairly high
even with
relatively light
Unix like OSes is you're still looking
couple hundred megabytes in practice
once you get everything up and running unless you run something totally stripped down
there’s significant overhead
there’s CPU slowdown typically in the
you know typical estimates are in the twenty percent range
numbers really range from fifty percent to five percent depending on what exactly you're doing
possibly even lower
or higher
and just
you know the overhead because you have the whole OS there's a lot of a lot
of duplicate
stuff
the various vendors
have their answers they claim you know we can
we can merge that and say oh you're running the same kernel so we'll keep your memory
we use the same memory but
at some level
it's all going to get duplicated.
A related option
comes from sort of the internet havesting industry which is to use virtual private
which is the technology from virtual private servers
the example that everyone here is probably familiar with is Jails where
you can provide
your own file system root
your own network interface
and what not
and
the nice thing about this is
that unlike full virtualization
the overhead is very small
basically it costs you
an entry in your process table
or an entry in few structures
there's some extra tests in their kernel but otherwise
there's not a huge overhead for virtualization you don't need an extra kernel for every
image
so you get the difference here between
be able to run maybe
you might be able to squeeze two hundred VMWare images onto a machine
VMWare people say no no don't do that but we have machines that are running
nearly that many.
On the other hand there are people out there who run thousands of
virtual hosts
using this technique on a single machine so
big difference in resource use
on especially with light
in the lightly loaded use
in our environment we're looking more running a very small number of them but still
that overhead is significant
you still do have some ability to tailor the
images to a job’s needs
you could have a
custom root that for instance you could be running
FreeBSD 6.0 in one
in one
virtual server and 7.0 in another
you have to be running of course 7.0 kernel or 8.0 kernel to make that work
but it allows you to do that
we also in principle can do
evil things like our 64-bit kernel and then 32-bit user spaces because
say you have applications that you can't find the source to anymore
or libraries you don't have the source to any more
an answer
interesting things there
and the other nice thing is since you're
you're doing a very lightweight and incomplete virtualization
you don't have to virtualize things you don't care about so you don’t have the overhead of
virtualizing everything.
Downsides of course are incomplete isolation
you are running processes that on the same kernel
and they can interfere with each other
and there's dubious flexibility obviously
I don't think anyone
should have the ability to run Windows in a jail.
There’s some
Net BSD support but
and I don’t think it's really gotten to that point.
One final area
that sort of diverges from this
is the classic
Unix solution to the problem
on this on single
in a single machine
which is
to use existing resource limits and resource partitioning techniques
you know for example all Unix like our Unix systems have to process resource limits
a resource and typically
scheduler a
cluster schedulers support the common ones
so you can set a
memory limit on your process or a CPU time limit on your process
and the schedulers typically provide
at least
launch support for
the limits on
a given set of process, that’s part of the job
also the most
you know there are a number of forms of resource partitioning that
are available
as a standard feature
on so memory discs are one of them so
if you want to create a file system space that’s limited in size, create a memory disc
and back it
and back it with a NMAP file
or swap
of partitioning
disc use
and then there are techniques like CPU affinities that you can walk processes to it
a single process
processor or a set of processors
and so they can't interfere with each other with processes running on other processors
the nice thing about this first is that you're using existing facilities so you don’t have to rewrite
lots of new features
for a niche application
and they tend to integrate well with existing schedulers in many cases
parts of them are already implemented
and in fact the experiments that I'll talk about later are all using this type of
technique.
Cons are of course
incomplete isolation again
and there’s typically no unified framework
for the concept of a job when a job is composed of the center processes
yeah there are a number of data structures within the kernel for instance the session
which
sort of aggregate processes
but there isn’t one
in BSD or Linux at this point
which allows you to place resource limits on those in a way that you can a process
IREX did have support like that
where they have a job ID
and there could be a job limit
and selected projects
are sort of similar but not quite the same
processes or part of a project but
it's not quite the same inherited relationship
and typically
there aren’t
limits on things like bandwidth. There was
a sort of a
bandwidth limiting
nice type interface
on that I saw
posted as a research project
many years ago I think in the 2.x days
where you could say this process can have
you know five megabits
or whatever
but I haven't really seen anything take off
that would be a pretty neat thing to have
actually one other exception there
is on IREX again
the XFS file system supported guaranteed data rates on file handles you could say
you could open a file and say I need
ten megabits read or ten megabits write
or whatever and it would say
okay or no
and then you could read and write and it would do evil things at the file system layer
in some cases
all to ensure that you could get that streaming data rate
by keeping the file.
So now I’m going to talk about what we've done
what we needed was a solution to handle a wide range of job types
So of the options we looked at for instance
single application clusters or project clusters
I think that the isolation they provide is essentially unparalleled
and in our environment we probably have to virtualize in order to be
efficient in terms of
being able to handle our job mix and what not and handle the fact that our users
tend to have
spikes in their use
on a large scale so for instance we get GPS we’ll show up and say we need to run for a month
on and then
some indeterminate number of months later they'll do it again
for that sort of quick
demands
we really need the virtuals something virtualized
and then we have to pay the price of
of the overhead
and again it doesn't handle small jobs well and that is a
large portion of our job mix so
of the
quarter million or something jobs we’ve run
on our cluster
I would guess that
more than half of those were submitted
in
batches of more than ten thousand
so they'll just pop up
the other method to have looked at
are using resource limits
the nice thing of course is they're achievable with
they achieve useful isolation
and they’re implementable with either existing functionality or small extensions so that's what we’ve
concentrating on.
We’ve also been doing some thinking about
could we use the techniques there
and combine them with jails
or related features
it may be bulking up jails to be more like zones in Solaris
or containers I think they're calling them this week
and
so we're looking at that as well
to be able to provide
to be able to provide pretty user operating environments
potentially isolating users from upgrades so for instance as we upgrade the kernel
and users can continue using it all the images they don't have time to rebuild their
application in
and handle the updates in libraries and what not
they also have the potential to provide strong isolation for security purposes
which could be useful in the future.
We do think that
of these mechanisms the nice thing is that resource limit
the resource limits and partitioning scheme
as well as virtual private service are very similar implementation requirements
set up a fair bit more expensive
in the VPS case
but nonetheless they're fairly similar.
So, what we've been doing is we've taken the Sun Grid Engine
and we were originally intended to actually extend Sun Grid Engine and modify its daemons
to do the work
on what we ended up doing instead is realize that well
we can actually specify an alternate program to run instead of the shepherd
The shepherd is the process
that starts all
starts the script that
can for each job
on a given node
it collects usage and forwards signals to the children
and also is responsible for starting remote components
so a shepherd is started and then
traditionally in Sun grid engine it starts out
its own RShell Daemon
and
jobs connect over
these days that for their own
mechanism which is
secure
not using the
crafty old RShell code.
So what we've done is we've implemented a wrapper script
which allows a pre-command hook
to run before the shepherd starts
the command wrapper so before we start shepherd we can run like the N program
or we can run
TRUE to whatever
to set up the environment that it runs in or CPU
setters I’ll show later
and a post command hook for cleanup
it's implemented in Ruby because I felt like it.
The first thing we implemented was memory backed temporary directories. The motivation for
this
is that
we've had problems for users will you know
run slash temp out on the nodes
where we have the nodes configured is that they do have discs
and most of the disc is available as slash temp
we had some cases
particularly early on where users would fill up the discs and not delete it
their job would crash or they would forget to add clean up code or whatever
and then other jobs would fail strangely
you might expect that you just get a nice error message
programmers being programmers
people would not do their
error handling correctly.
A number of libraries do have issues like for instance
the PVM library
unexpectedly fails and reports a completely strange error
if it can't create a file in temp
because it needs to create a UNIX domain socket so it can talk to itself.
So, what we’ve done here
is it turns out that Sun Grid Engine actually creates a temporary directory often the
typically /TEMP but you can change that
and points temp dir to that
location
we've educated most of all users now
to use that location correctly so they’ll use that variable
they treat their files under temp dir
and then when the job exits
the Grid Engine deletes the directory
and that all gets cleaned up
the problem of course being that of multiple are also running on the same node at the same time
one of them could still fill temp
so the solution was pretty simple we created a
wrapper script at the beginning of the job
creates a
a
memory file to swap back to MD file system
of a user requestable size with the default
and
this has a number of advantages the biggest one of course is that
it's fixed size so we get
you know
the user gets
what they asked for
and once they run of space, they run out of space well
and too bad they ran out of space
they should have asked for more
the other
the other advantage is the side-effect that
now that we're running swap back memory files systems for temp
the users who only use a fairly small amount of temp
should see vastly improved performance because they're running in memory
rather than writing to disc
quick example
we've a little job script here
prints temp dir and
prints the
amount of space
we submit our job request saying that we want
this is what we want hundred megabytes of temp space
the same that's why if this
so the program doesn't
so the program ends at the end of it
for doing it
here's a live demo
all and then
you look at the output
you can see it
does in fact it creates a memory file system
I attempted to do great code
having a variable space
that is roughly what the user asked for
the version that I had
when I was attempting this was not entirely accurate
trying to guess what all the UFS overhead would be
as the result was
not quite consistent
I couldn't figure out easy function so
it does a better job than it did to start with, it’s not perfect
sometimes however
today that that's a good fix
we're coming to
Deploy it pretty soon
it works pretty easily
well sometimes it's not enough
the biggest issue is that they were badly designed programs all
all over the world
don't use temp dir like they're supposed to
in fact
(inaudible question) so there are all these applications
there are all these applications still that need temp say during start up
that sort of thing
so
all
so we have problems with these
realistically
we can’t change all of them
it's just not going to happen
so we still have problems with people
running out of resources
so we probably
feel that
the most general solution is to write a per job slash temp
and virtualize that portion of the files system in memory space
and variate symlinks can do that
and so we said okay let's give it a shot
just to introduce the concept of variate symlinks for people who aren’t familiar with them
variate symlinks are basically symlinks that contain variables
which are expanded at run time
it allows paths to be different for different processes
for example
you create some files
you create
a symlink whose contents are
this variable which has the default shell value
and you
get different results with different variable sets.
So, to talk about the implementation we’ve done,
it's derived from direct implementation, most of the data structures are identical
however, I’ve made a number of changes
the biggest one is that we took the concept of scopes and we turned them entirely around
in there is a system scope which is over overridden by a user scope and by a
process scope
problem with that is if you
only think about say the systems scope
and
you decide you want to do something clever like have
a root file system which
where slash lib points to different things for different
different architectures
well, works quite nicely until the users come along and
set their arch variable
up for you
if you have say a Set UID program and you don't defensively
and you don't implement correctly
the obvious bad things happen. Obviously you would
write your code to not do that I believe they did, but
there's a whole class of problems where
it's easy to screw up
add and do something wrong there
so by
reversing the order
we can reduce the risks
at the moment we don't
have a user scope
I just don't like the idea of the users scope to be honest
problem being that then you have to have per user state in kernel
that just sort of sits around forever you can never garbage collect it except the
Administrator way
just doesn't seem like a great idea to me
And jail scope
just hasn't been implemented
because it wasn't entirely clear what the semantics should be
I also added default variable support variable also shell style
variable support
to some extent undoes the scope
the scope change
in that
the default variable becomes a system scope
which is overridden by everything
but there are cases where we need to do that in particular who wants implement their
slashed temp which varies
we have to do something like this because temp needs to work
if we don't have the job values set
I also decided to use
percent instead of dollar sign to avoid confusion with shell variables because these
are
a separate namespace in the kernel
we can't do it to main OS and do all the evaluation in the user space
it's classic vulnerability
in the CVE database for instance
and we’re not using @ and avoid confusion with AFS
or the Net BSD implementation
which does not allow
user or administratively settable values
that support
I don't have any automated variables such as
the percent sys value which is universally set in the Net BDS implementation
or
a UID variable which they also have 0:39:32.579,0:39:34.909 and currently and it allows
setting of values in other processes, you can only set them in your own and inherit it
that may change but
one of my goals here is because they were subtle ways to make dumb mistakes and
cause security vulnerabilities
I've attempted to slim the feature set down to the point where you
have some reasonable chance of not
doing that
if you start building systems on them for deployment.
The final area that we've worked on
is moving away from the final system space
and into CPU sets
Jeff Roberts implemented a program
implemented a CPU set functionality which allows you to
create… put a process into a CPU set
and then set the affinity of that
CPU set
by default every process has an anonymous
CPU set that was stuffed into one that was created by this
in a parent
so for a little background here
in a typical SGE configuration
every node has one slot
per CPU
There are a number of other ways you can configure it, basically a slot is something
a job can run in
and a parallel job crosses slots and can be in more than one slot
for instance in many applications where code tends to spend a fair bit of time
waiting for IO
you are looking at more than one slot per CPU so two slots per
core is not uncommon
but probably the most common configuration and the one that
you get out of the box is you just install a Grid Engine
is one slot for each CPU
and that's how that's how we run because we want users to have
that whole CPU for whatever they want to do with it
so jobs are allocated one or more slots
if they're
depending on whether they're sequential or parallel jobs and how many they ask for
but this is just a convention there's no actual connection between slots
and CPUs
so it's quite possible to
submit a non-parallel job
that goes off and spawns a zillion threads
and sucks up all the CPUs on the whole system
in some early versions of grid engine
there actually was
support for tying slots
to CPUs if you set it up that way
there is a sensible implementation for IREX and then things got weirder and weirder is
people tried to implement it on other platforms which had
vastly different
CPU binding semantics
and at this point it’s entirely broken
on every platform as far as I can tell
so we decided okay we've got this wrapper let's see what we can do
in terms of making things work.
We now have the wrapper store allocations in the final system
we have a not yet recursive allocation algorithm
well we try to do is
find the best fit
fitting set of
adjacent cores
and then if that doesn't work we take the largest to repeat
and until we fix
or until we've got enough slots
the goal is to minimize new fragments we haven't done any analysis
to determine whether that's actually
an appropriate algorithm
but off hand it seems
fine given I’ve thought about it over lunch.
Should 40’s lay down their OSes
turns out that FreeBSD, CPU setting, API and the Linux one
differ only in the very small details
They’re
essentially exactly
identical which is
convenient semantically, so converting between then is pretty straight forward
so converting between then is pretty straight forward, so I did a set of benchmarks
to demonstrate the
effectiveness of CPU set, they also happen to demonstrate the wrapper
but don’t really have any relevance
used a little eight core Intel Xeon box
7.1 pre-release that had
John Bjorkman backported
CPU set
from 8.0 shortly before release
well not so shortly, it's supposed to be shortly before
and the SG 6.2
we used the simple integer benchmarks
end Queens program were tested
for instance an 8 x 8 board
placed
the 8 queens so they can’t capture each other
on the board
so it's a simple load benchmark
that we ran a small version of the problem as our measure command to generate
load we ran a larger version that we ran for much longer
some results
so for baseline,
the most interesting thing is to do a baseline run
you see this
some variance it's not really very high
not surprising it doesn't really do anything
except suck CPU see here
Really not much
going on
in this case we’ve got seven load processes and a single
a single test process running
we see things slow down slightly
and
the standard deviation goes up a bit
it’s a little bit of deviation from baseline
the obvious explanation is clearly
we’re just content switching a bit more
because we don't have
CPUs that are doing nothing at all
there some extra load from the system as well
since the kernel has to run and background tests have to run
you know in this case we have a badly behaved application
we now have 8 load processes which would suck up all the CPU
and then we try to run our measurement process
we see a you know
substantial performance decrease
you know about in the range we would expect
see if we had any
decrease
we fired up with CPU set
quite obviously
the interesting thing here is to see it
we’re getting no statistically significant difference
between the baseline case with
7 processors if we use CPU sets we don't see this variance
which is nice to know that this shows
that's it
we actually see a slight performance improvement
and
we
we see a reduction in variance
so CPU set is actually improving performance even if we’re not overloaded
and we see in the overloaded case
it's the same
for the other processes they’re stuck on other CPUs
one interesting side note actually is that
when I was doing some tests early on
we actually saw
I tried doing the base line and the baseline with CPU set and if you just fired off with the original
algorithm
which
grabbed CPU0
you saw a significant performance decline
because there's a lot of stuff that ends up running on CPU0
which
what led to the
quick observation you want to allocate from the large numbers down
so that you use
the CPUs which are not running the random processes that get stuck on zero
or get all the interrupts in some architectures
and avoid Core0 in particular.
so some conclusions
I think we have useful proof of concept of going to be deploying
certainly the
memory stuff soon
once we upgrade to seven we’ll
definitely be deploying the CPU sets
so it's
both improves performance
in the contended case and in the and uncontended case
we would like in the future to do some more work with virtual private server stuff
Particularly it would be really interesting
to be able to run different
different FreeBSD versions in jails
for to run up for instance CentOS images in jail since we’re running CentOS
on our Linux based systems
there could actually be some really interesting things there
in that for instance we can run
we could potentially detrace Linux applications it's never going to happen on native Linux
there's also another example where
Paul Sub who’s doing some benchmarking recently
and relative to Linux on the same hardware
he was seeing a three and a half times improvement 0:48:04.900,0:48:07.230 in basic matrix multiplication
relative to current
because previously super-pegged functionality
where you vastly reduce the number of TLV entries
in the page table
and so
that sort of thing can apply even to apply to our Linux using population
could give FreeBSD some real wins there
I’d like to look at
more on the point of isolating users from kernel upgrades
one of the issues we've had is that
when you do a new bump
we have users who depend on all sorts of libraries immediate which
you know the vendors like to rev them to do
stupid API breaking changes is fairly regularly so
it’d be nice for users if we can get all the benefits to kernel upgrades
and they could upgrade at their leisure
so we're hoping to do that in future as well
we’d would like to see more limits on bandwidth type resources
for instance say limiting the amount of
it's fairly easy to know the amount of sockets I own
but it’s hard to place a total limit on network bandwidth
by a particular process
when almost all of our storage is on NFS how do you classify that traffic
without a fair bit of change to the kernel and somehow tagging that
it's an interesting challenge.
we'd also like to see it could be needed some you implement something like
the IRIX job ID
to allow the scheduler to just tag processes as part of a job
currently
I've grid engine uses a clever but evil hack
where they add
an extra group to the process
and they just have a range of groups
available so they get inherited in the users can’t drop them so
that allows them to track the process but it’s an ugly hack
and with the current limits on the number of groups it can become a real problem
actually before I take questions
I do want to put in
one quick point
the think it's not interesting you live in the area and if you're looking for
looking for a job
we are trying to hire a few people it's difficult to hire good
we do have some openings and we're looking for
BSD people in general system Admin people
so questions?
Yes (inaudible question)
I would expect that to happen but it's not something I’ve attempted to test
what I would really like is to have a topology aware allocator
so that you can request that you know I want
I want to share cache or I don't want to share cache
I want to share memory band width or not share memory bandwidth
open MPI 1.3
on the Linux side have a topology where a wrapper for their CPU
functionality
makes it something called
the PLAP
portable Linux
CPU allocator. Is that what it's actually been
what the acronym is
in essence they have to work around the fact that there were three standard
there were three different
kernel APIs for the same syscall
for CPU allocation because all the vendors did it themselves somehow
they're the same number but they’re completely incompatible
when you first load the application it calls the syscall and it tries to figure out which
one it is
by what errors it returns depending on what
are you missing and completely evil
I think people should port their API and have their library work but
we don’t need to do that junk because we did not make that mistake
so I would like to see the topology aware stuff in particular
(inaudible question)
the trick is it’s easy to limit application bandwidth
fairly easy to limit application bandwidth
it becomes more difficult when you have to
if your
interfaces are shared between application traffic
and
say NFS
getting classifying that is going to be trickier you have to tag you’d have to add a fair bit of code
to trace that down through the kernel certainly doable
(inaudible question)
I have contemplated doing just that
or in fact the other thing we consider doing
more as a research project than is a practical thing
would be actually how
would be
independent VLANs
because then we could do
things like
give each process a VLAN they couldn't even
share at the internet layer
once the images’ in place for instance we will be able to do that
and that say you know you've got your interfaces it’s yours whatever
but then we could limit it
we could rate limit that at the kernel we can also have
we’d have a physically isolated we’d have a logically isolated network as well
with some of the latest switches we could actually rate limit
at the switch as well
(inaudible questions) so to the first question
we don’t run multiple
sensitivity data on these clusters
unclassified cluster
we've avoided that problem by
not allowing it
But it is a real issue
it's just not one we've had to deal with
in practice with stuff that’s sensitive
has handling requirements that you can't touch the same hardware without a scrub
you need a pretty
ridiculously aggressive
you need a very coarse granularity
a ridiculous remote imaging process that you moved all of the data
so if I were to do that I would probably get rid of the discs
just
go disc less
that would get rid of my number-one failure case of
that would be pretty good but
but haven’t done it
NFS failures we've had occasional problems of NFS overloading
we haven't had real problem
we're all local network it’s fairly tightly contained so we haven't had problems with
things
with
you know the server going down for extended periods and causing everything to hang
it's been more an issue of
I mean there isn't there's a problem that Panasas is described as in cast
you can take out any NFS server
I mean we have the bluearc guys come in and the PGA based stuff with multiple ten-gig links I said
you know I've got
to do this and they said can we not try this with your whole cluster
because if you got
three hundred and fifty
gigabit ethernet interfaces going into the system
Even ten gig you can saturate pretty trivially
so that level
there's an inherent problem
on we need to handle that kind of bandwidth we've
got to get it a parallel file system
get a cluster
before doing streaming stuff we could go via SWAN or something
anyone else?
thank you, everyone (applause and end)