Isolating Cluster Jobs for Performance and Predictability, Brooks Davis (DCBSDCon 2009)

I do apologize for the (other) for the EuroBSDCon slides. I've redone the title page and redone the and made some changes to the slides and they didn't make it through for approval by this afternoon so okay so I'm gonna be talking about doing about isolating jobs for performance and predictability in clusters before I get into that I want to talk a little bit about who we are and what our problem space is like because that dictates that… has an effect on our solutions base I work for the aerospace corporation. We work; we operate a federally-funded research and development center in the area national security space and in particular we work with the air force space and missile command and with the national reconnaissance office and our engineers support a wide variety of activities within that area so we have a bit over fourteen hundred to correct sorry twenty four hundred engineers in virtually every discipline we have as you would expect we have our rocket scientists, we have people who build satellites we have people who build sensors that go on satellites, people who study these sort of things that you see when you use those sensors that sort of thing. We also have civil engineers and electronic engineers and process, computer process people so we literally do everything related to space and all sorts of things that you might not expect to be related to space, since we also for instance help build ground systems ‘cause satellites aren’t very useful if there isn't anything to talk to them; and these engineers since they're solving all these different problems we have engineering applications in you know virtually every size you can think of ranging from you know little spreadsheet things that you might not think of as an engineering application but they are to Matlab programs or a lot of C code or one of traditional parallel for us serial code and then large parallel applications either in house; genetic algorithms and that sort of thing, or traditional the classic parallel code like you work around a crate or something material simulation 0:02:40.119,0:02:41.459 or that or food flow or that sort of thing so so we have this big application space just want to give a little introduction to that because it does come back and influence what we the sort of solutions we look at so the rest of the talk I’m gonna talk about rese… we skipped a slide, there we are, that’s a little better. Now, what I'm interested in is I do high performance computing at company and I provide high performance computing resources to our users as part of my role in our technical computing services organization so our primary resource at this point is the fellowship cluster it's a for the named for the fellowship of the ring so it's a… … eleven axel nodes wrap the core systems over here there's a Cisco a large Cisco switch. Actually today there are around two sixty five oh nines if you assess them and because we couldn’t get the port density we wanted otherwise and primarily the Gigabit Ethernet system runs FreeBSD currently 6.0 ‘cause we haven’t upgraded it yet planning to move probably to 7.1 or maybe slightly past 7.1 if we want to get the latest HWPMC changes in we use the Sun Grid Engine scheduler was one of the two main options for open source resource managers on clusters the other one being the… … the TORQUE and now recombination from cluster resources so we also have that's actually 40 TB that’s really the raw number on a sun thumper and 0:04:23.219,0:04:26.290 that’s thirty two usable once you start using RAID-Z2 since you might actually like to have your data should a disk fail and with today's discs RAID… RAID five doesn't really cut it, And then we also have some other resources coming on but I’m going to be (concentrating on) two smaller clusters unfortunately probably running Linux and some SMPs but I’m going to be concentrating here on the work we're doing on our other FreeBSD based cluster. So, first of all first of all I want to talk about why we want to share resources. Should be fairly obvious but I'll talk about it in a little bit and then what goes wrong when you start sharing resources after that I'll talk about some different solutions to those problems and some fairly trivial experiments that we've done so far in terms of enhancing the schedule or using operating system features so you mitigate those problems and then conclude with some feature work. So, obviously if you have a resource the size… the size of our cluster, fourteen hundred cores roughly you probably want to share it unless you purpose built it for a single application you're going to want to have your users sharing it and you don't want to just say you know, you get on Monday probably not going to be a very effective option especially not when we have as many users as we do we also can't just afford to buy another one every time a user shows up so one of our senior VPs said a while back you know we could probably afford to buy just about anything we could need once we can't just buy ten of them though if we really, really needed it dropping small numbers of millions of dollars on computing resources wouldn’t be impossible but we can't go to you know just have every engineer who wants one just call up Dell and say ship me ten racks it's not going to work and the other thing is that we can’t we need to also provide quick turnaround for some users so we can't have one user hogging the system and hogging it until they are done because we have some users and then the next one can run because we have some users who'll come in and say well I need to run for three months and we've had users come in and literally run pretty much using the entire system for three months well so we've had to provide some ability for other users to still get their work done so we can't just… so we do have to have some sharing however when you start to share any resource like this you start getting contention users need the same thing at the same time and so they fight back and forth for it and they can't get what they want so you have to balance them a bit you know also some jobs lie when they request resources and they actually need more than they ask for which can cause problems so we schedule them. We say you're going to fit here fine and they run off and use more than they said and if we don't have a mechanism to constrain them we have problems. Likewise once these users start to contend that doesn't just result in the jobs taking, taking longer in terms of wall clock time because they are extremely slow but there's overhead related to that contention; they get swapped out due to pressure on various systems if you really for instance run out of memory then you go into swap and you end up wasting all your cycles pulling junk in and out of disc wasting your bandwidth on that so there are resource there are resource costs to the contention not merely a delay in returning results. So now I'm going to switch gears and start talk… so I'm going to talk a little bit about different solutions to these to the these contention issues and look at different ways of solving the problem. Most of these are things that have already been done but I just want to talk about the different ways and then evaluate them in our context. So a classic solution to the problem is Gang Scheduling It's basically conventional Unix process context switching written really big you what you do is you have your parallel job that’s running on a system and it runs for a while and then after a certain amount of time you basically shove it all; you kick it off of all the nodes and let the next one come in and typically when people do this they do it on on the order of hours because the context switch time is extremely large is extremely high for example because it's not just like swapping a process internet. You suddenly have to co-ordinate the this context which across to all your processes if you're running say MPI over TCP you actually need to tear down the TCP sessions because you can't just have TCP timers sitting around or that sort of thing so there there's a there's a lot of overhead associated with this. You take a long context switch if all of your infrastructure supports this it's fairly effective and it does allow jobs to avoid interfering with each other which is nice so you can't you don't have issues because you're typically allocating whole swaps of the system and for properly written applications partial results can be returned which for some of our users is really important where you're doing a refinement users would want to look at the results and say okay you know is this just going off into the weeds or does it look like it's actually converging on some sort of useful solution as they don't want to just wait till the end. Down side of course is that this context switches costs are very high and most importantly there's really a lack of useful implementations a number of platforms have implemented this in the past but in practice on modern clusters which are built on commodity hardware with you know communication libraries written on standard protocols the tools just aren’t there and so it's not very practical. Also it doesn't really make a lot of sense with small jobs and one of the things that we found is we have users who have embarrassingly parallel problems for they need to look at you know twenty thousand studies and they could write something that looked more like a conventional parallel application where they you know wrote a Scheduler and set up an MPI a Message Passing Interface and handed out tasks to pieces of their job and then you could do this but then they would be running a Scheduler and they would probably do a bad job of it turns out it's actually fairly difficult to do right even a trivial case and so what they do instead is they just select twenty twenty thousand jobs to grid engine and say okay whatever I'll deal with it earlier versions that might have been a problem current versions of the code handle easily a million jobs that so not really a big deal but those sort of users wouldn't fit well into the gang scheduled environment at least not in a conventional gang scheduled environment where you do gang scheduling on the granularity of jobs so from that perspective it wouldn’t work very well. If you have all the pieces in place and you are doing a big parallel applications it is in fact an extremely effective approach. Another option which is sort of related it's in fact take taking an even courser granularity is single application or single project clusters or sub-clusters. For instance this is used some national labs where you're given a cycle allocation for a year based on your grant proposals and what your cycle allocation actually comes to you as is here's your cluster here's a frontend here's this chunk of notes, they're yours, go to it. Install your own OS, whatever you want it's yours and then and at a sort of finer scale there's things such as you could use Emulab which is the network emulation system but also does a OS install and configuration so you could do dynamic allocation that way Sun's Project Hedeby now actually I think it's called service domain manager is the productised version or some Clusters on Demand they were actually talking about web hosting clusters but things that allow rapid deployment unless you do that a little little a more granular level than the the allocate them once a year approach nonetheless let’s you give people whole clusters to work with nice one nice thing about it is the isolation between the processes is complete so you don’t have to worry about users stomping on each other. It’s their own system, they can trash it all they want if they flood the network or they run the nodes into swap well that's their problem but it also has the advantage that you can tailor the images on the nodes of the operative systems to meet the exact needs of the application down side of course is its coarse granularity, in our environment that doesn't work very well since we do have all of these all these different types of jobs context switches are also pretty expensive. Certainly on the order of minutes Emulab typically claim something like ten minutes there are some systems out there for instance if you use I think it’s Open Boot that they're calling it today. It used to be 1xBIOS where you can actually deploy a system in tens of seconds mostly by getting rid of all that junk the BIOS writers wrote and the OS boots pretty fast if you don’t have all that stuff to waylay you, but in practice on sort of off the shelf hardware the context switches times’ are quite high users of course can interfere with themselves you can argue it's not a problem but ideally you would like to prevent that one of the things that I have to deal with is that my users are almost universally not trained as computer scientists or programmers you know they’re trained in their domain area they're really good in that area but their concepts of the way hardware works in the way software works don’t match reality in many cases (inaudible question) It’s pretty rare in practice well I've heard one lab that does it significantly but it's like they do it on sort of a yearly allocation basis and throw the hardware away after two or three years and you do typically have some sort of the deployment system in place or in those types of cases actually usually your application comes with and here's what we're going to spend on this many people on this project so this is big resource allocation And yeah I guess one other issue with this is there's no real easy way to capture underutilized resources for example if you have an application which you know say single-threaded and uses a ton of memory and is running on a machine the machines we're buying these days are eight core so that’s wasting a lot of CPU cycles you're just generating a lot of heat doing nothing so ideally you would like a scheduler that said okay so you're using using eight or seven of the eight Gigabytes of RAM but we've got these jobs sitting here that need next to know need a hundred megabytes so we slap seven of those in along with the big job and backfill and in this mechanism there's no there's no good way to do that obviously if the users have that application next they can do it themselves but it's not something where we can easily bring in bring in more jobs and have a mix to take advantage of the different resources. A related approach is to to install virtualization software on the equipment and this is this is this is the essence of what Cloud Computing is at the moment it's Amazon providing Zen Zen hosting for relatively arbitrary OS images it does have the advantage that it allows rapid deployment in theory if your application is scalable provides for extremely high scalability particularly if you aren’t us and therefore can possibly use somebody else's hardware in our application's case that’s not very practical so we can't do that and it also has the advantage that you can run you can have people with their own image in there which is tightly resource constrained but you can run more than one of them on a node. So for instance you can give one job four cores and another job two cores another you know and have a couple single core jobs in theory you can get fairly strong isolation there obviously there are shared resources underneath and you probably can't afford to completely isolate say network bandwidth at the bottom layer you can do some but if you go overboard you can spend all your time on accounting you also can again tailor the images to the job and in this environment actually you can do that even more strongly than that the sub-cluster approach in that you can often do run a five-year-old operating system or ten-year-old operating system if you're using full virtualization and that can allow allow obsolete code with weird baselines to work which is important in our space because the average program runs ten years or more our average project runs ten years or more and as a result you might have to go rerun this program that was written way back on some ancient version of windows or whatever it also does provide the ability to recover resources as I was talking about before but you can't do easily with sub-clusters because you can’t just slip another image on the on there and say are you can use anything and you know give that image idle priority essentially down side of course is that it is in complete isolation and that there is a shared hardware you're not likely to find I don't think any the virtualization systems out there right now virtualize your segment of memory bandwidth or your segment of cache of cache space so users can’t in fact interfere with themselves and each other in this environment it's also not really efficient for small jobs; the cost of running an entire OS for every job is fairly high even with relatively light Unix like OSes is you're still looking couple hundred megabytes in practice once you get everything up and running unless you run something totally stripped down there’s significant overhead there’s CPU slowdown typically in the you know typical estimates are in the twenty percent range numbers really range from fifty percent to five percent depending on what exactly you're doing possibly even lower or higher and just you know the overhead because you have the whole OS there's a lot of a lot of duplicate stuff the various vendors have their answers they claim you know we can we can merge that and say oh you're running the same kernel so we'll keep your memory we use the same memory but at some level it's all going to get duplicated. A related option comes from sort of the internet havesting industry which is to use virtual private which is the technology from virtual private servers the example that everyone here is probably familiar with is Jails where you can provide your own file system root your own network interface and what not and the nice thing about this is that unlike full virtualization the overhead is very small basically it costs you an entry in your process table or an entry in few structures there's some extra tests in their kernel but otherwise there's not a huge overhead for virtualization you don't need an extra kernel for every image so you get the difference here between be able to run maybe you might be able to squeeze two hundred VMWare images onto a machine VMWare people say no no don't do that but we have machines that are running nearly that many. On the other hand there are people out there who run thousands of virtual hosts using this technique on a single machine so big difference in resource use on especially with light in the lightly loaded use in our environment we're looking more running a very small number of them but still that overhead is significant you still do have some ability to tailor the images to a job’s needs you could have a custom root that for instance you could be running FreeBSD 6.0 in one in one virtual server and 7.0 in another you have to be running of course 7.0 kernel or 8.0 kernel to make that work but it allows you to do that we also in principle can do evil things like our 64-bit kernel and then 32-bit user spaces because say you have applications that you can't find the source to anymore or libraries you don't have the source to any more an answer interesting things there and the other nice thing is since you're you're doing a very lightweight and incomplete virtualization you don't have to virtualize things you don't care about so you don’t have the overhead of virtualizing everything. Downsides of course are incomplete isolation you are running processes that on the same kernel and they can interfere with each other and there's dubious flexibility obviously I don't think anyone should have the ability to run Windows in a jail. There’s some Net BSD support but and I don’t think it's really gotten to that point. One final area that sort of diverges from this is the classic Unix solution to the problem on this on single in a single machine which is to use existing resource limits and resource partitioning techniques you know for example all Unix like our Unix systems have to process resource limits a resource and typically scheduler a cluster schedulers support the common ones so you can set a memory limit on your process or a CPU time limit on your process and the schedulers typically provide at least launch support for the limits on a given set of process, that’s part of the job also the most you know there are a number of forms of resource partitioning that are available as a standard feature on so memory discs are one of them so if you want to create a file system space that’s limited in size, create a memory disc and back it and back it with a NMAP file or swap of partitioning disc use and then there are techniques like CPU affinities that you can walk processes to it a single process processor or a set of processors and so they can't interfere with each other with processes running on other processors the nice thing about this first is that you're using existing facilities so you don’t have to rewrite lots of new features for a niche application and they tend to integrate well with existing schedulers in many cases parts of them are already implemented and in fact the experiments that I'll talk about later are all using this type of technique. Cons are of course incomplete isolation again and there’s typically no unified framework for the concept of a job when a job is composed of the center processes yeah there are a number of data structures within the kernel for instance the session which sort of aggregate processes but there isn’t one in BSD or Linux at this point which allows you to place resource limits on those in a way that you can a process IREX did have support like that where they have a job ID and there could be a job limit and selected projects are sort of similar but not quite the same processes or part of a project but it's not quite the same inherited relationship and typically there aren’t limits on things like bandwidth. There was a sort of a bandwidth limiting nice type interface on that I saw posted as a research project many years ago I think in the 2.x days where you could say this process can have you know five megabits or whatever but I haven't really seen anything take off that would be a pretty neat thing to have actually one other exception there is on IREX again the XFS file system supported guaranteed data rates on file handles you could say you could open a file and say I need ten megabits read or ten megabits write or whatever and it would say okay or no and then you could read and write and it would do evil things at the file system layer in some cases all to ensure that you could get that streaming data rate by keeping the file. So now I’m going to talk about what we've done what we needed was a solution to handle a wide range of job types So of the options we looked at for instance single application clusters or project clusters I think that the isolation they provide is essentially unparalleled and in our environment we probably have to virtualize in order to be efficient in terms of being able to handle our job mix and what not and handle the fact that our users tend to have spikes in their use on a large scale so for instance we get GPS we’ll show up and say we need to run for a month on and then some indeterminate number of months later they'll do it again for that sort of quick demands we really need the virtuals something virtualized and then we have to pay the price of of the overhead and again it doesn't handle small jobs well and that is a large portion of our job mix so of the quarter million or something jobs we’ve run on our cluster I would guess that more than half of those were submitted in batches of more than ten thousand so they'll just pop up the other method to have looked at are using resource limits the nice thing of course is they're achievable with they achieve useful isolation and they’re implementable with either existing functionality or small extensions so that's what we’ve concentrating on. We’ve also been doing some thinking about could we use the techniques there and combine them with jails or related features it may be bulking up jails to be more like zones in Solaris or containers I think they're calling them this week and so we're looking at that as well to be able to provide to be able to provide pretty user operating environments potentially isolating users from upgrades so for instance as we upgrade the kernel and users can continue using it all the images they don't have time to rebuild their application in and handle the updates in libraries and what not they also have the potential to provide strong isolation for security purposes which could be useful in the future. We do think that of these mechanisms the nice thing is that resource limit the resource limits and partitioning scheme as well as virtual private service are very similar implementation requirements set up a fair bit more expensive in the VPS case but nonetheless they're fairly similar. So, what we've been doing is we've taken the Sun Grid Engine and we were originally intended to actually extend Sun Grid Engine and modify its daemons to do the work on what we ended up doing instead is realize that well we can actually specify an alternate program to run instead of the shepherd The shepherd is the process that starts all starts the script that can for each job on a given node it collects usage and forwards signals to the children and also is responsible for starting remote components so a shepherd is started and then traditionally in Sun grid engine it starts out its own RShell Daemon and jobs connect over these days that for their own mechanism which is secure not using the crafty old RShell code. So what we've done is we've implemented a wrapper script which allows a pre-command hook to run before the shepherd starts the command wrapper so before we start shepherd we can run like the N program or we can run TRUE to whatever to set up the environment that it runs in or CPU setters I’ll show later and a post command hook for cleanup it's implemented in Ruby because I felt like it. The first thing we implemented was memory backed temporary directories. The motivation for this is that we've had problems for users will you know run slash temp out on the nodes where we have the nodes configured is that they do have discs and most of the disc is available as slash temp we had some cases particularly early on where users would fill up the discs and not delete it their job would crash or they would forget to add clean up code or whatever and then other jobs would fail strangely you might expect that you just get a nice error message programmers being programmers people would not do their error handling correctly. A number of libraries do have issues like for instance the PVM library unexpectedly fails and reports a completely strange error if it can't create a file in temp because it needs to create a UNIX domain socket so it can talk to itself. So, what we’ve done here is it turns out that Sun Grid Engine actually creates a temporary directory often the typically /TEMP but you can change that and points temp dir to that location we've educated most of all users now to use that location correctly so they’ll use that variable they treat their files under temp dir and then when the job exits the Grid Engine deletes the directory and that all gets cleaned up the problem of course being that of multiple are also running on the same node at the same time one of them could still fill temp so the solution was pretty simple we created a wrapper script at the beginning of the job creates a a memory file to swap back to MD file system of a user requestable size with the default and this has a number of advantages the biggest one of course is that it's fixed size so we get you know the user gets what they asked for and once they run of space, they run out of space well and too bad they ran out of space they should have asked for more the other the other advantage is the side-effect that now that we're running swap back memory files systems for temp the users who only use a fairly small amount of temp should see vastly improved performance because they're running in memory rather than writing to disc quick example we've a little job script here prints temp dir and prints the amount of space we submit our job request saying that we want this is what we want hundred megabytes of temp space the same that's why if this so the program doesn't so the program ends at the end of it for doing it here's a live demo all and then you look at the output you can see it does in fact it creates a memory file system I attempted to do great code having a variable space that is roughly what the user asked for the version that I had when I was attempting this was not entirely accurate trying to guess what all the UFS overhead would be as the result was not quite consistent I couldn't figure out easy function so it does a better job than it did to start with, it’s not perfect sometimes however today that that's a good fix we're coming to Deploy it pretty soon it works pretty easily well sometimes it's not enough the biggest issue is that they were badly designed programs all all over the world don't use temp dir like they're supposed to in fact (inaudible question) so there are all these applications there are all these applications still that need temp say during start up that sort of thing so all so we have problems with these realistically we can’t change all of them it's just not going to happen so we still have problems with people running out of resources so we probably feel that the most general solution is to write a per job slash temp and virtualize that portion of the files system in memory space and variate symlinks can do that and so we said okay let's give it a shot just to introduce the concept of variate symlinks for people who aren’t familiar with them variate symlinks are basically symlinks that contain variables which are expanded at run time it allows paths to be different for different processes for example you create some files you create a symlink whose contents are this variable which has the default shell value and you get different results with different variable sets. So, to talk about the implementation we’ve done, it's derived from direct implementation, most of the data structures are identical however, I’ve made a number of changes the biggest one is that we took the concept of scopes and we turned them entirely around in there is a system scope which is over overridden by a user scope and by a process scope problem with that is if you only think about say the systems scope and you decide you want to do something clever like have a root file system which where slash lib points to different things for different different architectures well, works quite nicely until the users come along and set their arch variable up for you if you have say a Set UID program and you don't defensively and you don't implement correctly the obvious bad things happen. Obviously you would write your code to not do that I believe they did, but there's a whole class of problems where it's easy to screw up add and do something wrong there so by reversing the order we can reduce the risks at the moment we don't have a user scope I just don't like the idea of the users scope to be honest problem being that then you have to have per user state in kernel that just sort of sits around forever you can never garbage collect it except the Administrator way just doesn't seem like a great idea to me And jail scope just hasn't been implemented because it wasn't entirely clear what the semantics should be I also added default variable support variable also shell style variable support to some extent undoes the scope the scope change in that the default variable becomes a system scope which is overridden by everything but there are cases where we need to do that in particular who wants implement their slashed temp which varies we have to do something like this because temp needs to work if we don't have the job values set I also decided to use percent instead of dollar sign to avoid confusion with shell variables because these are a separate namespace in the kernel we can't do it to main OS and do all the evaluation in the user space it's classic vulnerability in the CVE database for instance and we’re not using @ and avoid confusion with AFS or the Net BSD implementation which does not allow user or administratively settable values that support I don't have any automated variables such as the percent sys value which is universally set in the Net BDS implementation or a UID variable which they also have 0:39:32.579,0:39:34.909 and currently and it allows setting of values in other processes, you can only set them in your own and inherit it that may change but one of my goals here is because they were subtle ways to make dumb mistakes and cause security vulnerabilities I've attempted to slim the feature set down to the point where you have some reasonable chance of not doing that if you start building systems on them for deployment. The final area that we've worked on is moving away from the final system space and into CPU sets Jeff Roberts implemented a program implemented a CPU set functionality which allows you to create… put a process into a CPU set and then set the affinity of that CPU set by default every process has an anonymous CPU set that was stuffed into one that was created by this in a parent so for a little background here in a typical SGE configuration every node has one slot per CPU There are a number of other ways you can configure it, basically a slot is something a job can run in and a parallel job crosses slots and can be in more than one slot for instance in many applications where code tends to spend a fair bit of time waiting for IO you are looking at more than one slot per CPU so two slots per core is not uncommon but probably the most common configuration and the one that you get out of the box is you just install a Grid Engine is one slot for each CPU and that's how that's how we run because we want users to have that whole CPU for whatever they want to do with it so jobs are allocated one or more slots if they're depending on whether they're sequential or parallel jobs and how many they ask for but this is just a convention there's no actual connection between slots and CPUs so it's quite possible to submit a non-parallel job that goes off and spawns a zillion threads and sucks up all the CPUs on the whole system in some early versions of grid engine there actually was support for tying slots to CPUs if you set it up that way there is a sensible implementation for IREX and then things got weirder and weirder is people tried to implement it on other platforms which had vastly different CPU binding semantics and at this point it’s entirely broken on every platform as far as I can tell so we decided okay we've got this wrapper let's see what we can do in terms of making things work. We now have the wrapper store allocations in the final system we have a not yet recursive allocation algorithm well we try to do is find the best fit fitting set of adjacent cores and then if that doesn't work we take the largest to repeat and until we fix or until we've got enough slots the goal is to minimize new fragments we haven't done any analysis to determine whether that's actually an appropriate algorithm but off hand it seems fine given I’ve thought about it over lunch. Should 40’s lay down their OSes turns out that FreeBSD, CPU setting, API and the Linux one differ only in the very small details They’re essentially exactly identical which is convenient semantically, so converting between then is pretty straight forward so converting between then is pretty straight forward, so I did a set of benchmarks to demonstrate the effectiveness of CPU set, they also happen to demonstrate the wrapper but don’t really have any relevance used a little eight core Intel Xeon box 7.1 pre-release that had John Bjorkman backported CPU set from 8.0 shortly before release well not so shortly, it's supposed to be shortly before and the SG 6.2 we used the simple integer benchmarks end Queens program were tested for instance an 8 x 8 board placed the 8 queens so they can’t capture each other on the board so it's a simple load benchmark that we ran a small version of the problem as our measure command to generate load we ran a larger version that we ran for much longer some results so for baseline, the most interesting thing is to do a baseline run you see this some variance it's not really very high not surprising it doesn't really do anything except suck CPU see here Really not much going on in this case we’ve got seven load processes and a single a single test process running we see things slow down slightly and the standard deviation goes up a bit it’s a little bit of deviation from baseline the obvious explanation is clearly we’re just content switching a bit more because we don't have CPUs that are doing nothing at all there some extra load from the system as well since the kernel has to run and background tests have to run you know in this case we have a badly behaved application we now have 8 load processes which would suck up all the CPU and then we try to run our measurement process we see a you know substantial performance decrease you know about in the range we would expect see if we had any decrease we fired up with CPU set quite obviously the interesting thing here is to see it we’re getting no statistically significant difference between the baseline case with 7 processors if we use CPU sets we don't see this variance which is nice to know that this shows that's it we actually see a slight performance improvement and we we see a reduction in variance so CPU set is actually improving performance even if we’re not overloaded and we see in the overloaded case it's the same for the other processes they’re stuck on other CPUs one interesting side note actually is that when I was doing some tests early on we actually saw I tried doing the base line and the baseline with CPU set and if you just fired off with the original algorithm which grabbed CPU0 you saw a significant performance decline because there's a lot of stuff that ends up running on CPU0 which what led to the quick observation you want to allocate from the large numbers down so that you use the CPUs which are not running the random processes that get stuck on zero or get all the interrupts in some architectures and avoid Core0 in particular. so some conclusions I think we have useful proof of concept of going to be deploying certainly the memory stuff soon once we upgrade to seven we’ll definitely be deploying the CPU sets so it's both improves performance in the contended case and in the and uncontended case we would like in the future to do some more work with virtual private server stuff Particularly it would be really interesting to be able to run different different FreeBSD versions in jails for to run up for instance CentOS images in jail since we’re running CentOS on our Linux based systems there could actually be some really interesting things there in that for instance we can run we could potentially detrace Linux applications it's never going to happen on native Linux there's also another example where Paul Sub who’s doing some benchmarking recently and relative to Linux on the same hardware he was seeing a three and a half times improvement 0:48:04.900,0:48:07.230 in basic matrix multiplication relative to current because previously super-pegged functionality where you vastly reduce the number of TLV entries in the page table and so that sort of thing can apply even to apply to our Linux using population could give FreeBSD some real wins there I’d like to look at more on the point of isolating users from kernel upgrades one of the issues we've had is that when you do a new bump we have users who depend on all sorts of libraries immediate which you know the vendors like to rev them to do stupid API breaking changes is fairly regularly so it’d be nice for users if we can get all the benefits to kernel upgrades and they could upgrade at their leisure so we're hoping to do that in future as well we’d would like to see more limits on bandwidth type resources for instance say limiting the amount of it's fairly easy to know the amount of sockets I own but it’s hard to place a total limit on network bandwidth by a particular process when almost all of our storage is on NFS how do you classify that traffic without a fair bit of change to the kernel and somehow tagging that it's an interesting challenge. we'd also like to see it could be needed some you implement something like the IRIX job ID to allow the scheduler to just tag processes as part of a job currently I've grid engine uses a clever but evil hack where they add an extra group to the process and they just have a range of groups available so they get inherited in the users can’t drop them so that allows them to track the process but it’s an ugly hack and with the current limits on the number of groups it can become a real problem actually before I take questions I do want to put in one quick point the think it's not interesting you live in the area and if you're looking for looking for a job we are trying to hire a few people it's difficult to hire good we do have some openings and we're looking for BSD people in general system Admin people so questions? Yes (inaudible question) I would expect that to happen but it's not something I’ve attempted to test what I would really like is to have a topology aware allocator so that you can request that you know I want I want to share cache or I don't want to share cache I want to share memory band width or not share memory bandwidth open MPI 1.3 on the Linux side have a topology where a wrapper for their CPU functionality makes it something called the PLAP portable Linux CPU allocator. Is that what it's actually been what the acronym is in essence they have to work around the fact that there were three standard there were three different kernel APIs for the same syscall for CPU allocation because all the vendors did it themselves somehow they're the same number but they’re completely incompatible when you first load the application it calls the syscall and it tries to figure out which one it is by what errors it returns depending on what are you missing and completely evil I think people should port their API and have their library work but we don’t need to do that junk because we did not make that mistake so I would like to see the topology aware stuff in particular (inaudible question) the trick is it’s easy to limit application bandwidth fairly easy to limit application bandwidth it becomes more difficult when you have to if your interfaces are shared between application traffic and say NFS getting classifying that is going to be trickier you have to tag you’d have to add a fair bit of code to trace that down through the kernel certainly doable (inaudible question) I have contemplated doing just that or in fact the other thing we consider doing more as a research project than is a practical thing would be actually how would be independent VLANs because then we could do things like give each process a VLAN they couldn't even share at the internet layer once the images’ in place for instance we will be able to do that and that say you know you've got your interfaces it’s yours whatever but then we could limit it we could rate limit that at the kernel we can also have we’d have a physically isolated we’d have a logically isolated network as well with some of the latest switches we could actually rate limit at the switch as well (inaudible questions) so to the first question we don’t run multiple sensitivity data on these clusters unclassified cluster we've avoided that problem by not allowing it But it is a real issue it's just not one we've had to deal with in practice with stuff that’s sensitive has handling requirements that you can't touch the same hardware without a scrub you need a pretty ridiculously aggressive you need a very coarse granularity a ridiculous remote imaging process that you moved all of the data so if I were to do that I would probably get rid of the discs just go disc less that would get rid of my number-one failure case of that would be pretty good but but haven’t done it NFS failures we've had occasional problems of NFS overloading we haven't had real problem we're all local network it’s fairly tightly contained so we haven't had problems with things with you know the server going down for extended periods and causing everything to hang it's been more an issue of I mean there isn't there's a problem that Panasas is described as in cast you can take out any NFS server I mean we have the bluearc guys come in and the PGA based stuff with multiple ten-gig links I said you know I've got to do this and they said can we not try this with your whole cluster because if you got three hundred and fifty gigabit ethernet interfaces going into the system Even ten gig you can saturate pretty trivially so that level there's an inherent problem on we need to handle that kind of bandwidth we've got to get it a parallel file system get a cluster before doing streaming stuff we could go via SWAN or something anyone else? thank you, everyone (applause and end)