Tip:
Highlight text to annotate it
X
1.We are here at Qcon London 2012 and I am sitting here with Rupert Smith. Rupert, who
are you?My name is Rupert Smith and I am currently working for Rapid Addition and myself and
Kevin Houstoun have been here today giving a talk on our FIX engine specifically looking
at low latency performance. 2. So, why don`t you like latency? Latency
can mean you miss your trade, basically if somebody is selling something in the market
you got to be the quickest to get to it if you want to have a chance. It's not necessarily
by being the quickest, we were talking before, it's about making the software real time which
simply means making it more predictable, and because there are things like garbage collection,
and other sources of jitter that might mean you have a certain chance of missing a trade,
so even though our FIX engine might perform quite well and have low latency, there will
always be that section of high latency outliers that you might run into. So we initially started
out just with the aim of making the FIX engine more predictable and removing the high latency
outliers and what we actually found is that we managed to lower the latency baseline as
well. So it's about being able to give people a predictable execution and to avoid running
into latency spikes. 3.So what's your approach to avoiding these?
Do you use Java?The FIX engine I have written was originally written in C# and prior to
this I have worked on Apache Qpid and had various ideas about how messaging software
should be written and hadn't necessarily been able to apply those ideas at the time when
I worked on it, so when I went to be interviewed at Rapid they told me about their ideas, so
we kind of merged our ideas together really, so a lot of ideas I have implemented are not
my own, I have taken what existed in the C# engine. One thing that they suggested doing
which is quite controversial is to make the engine garbage free, the whole point of C#
and Java is that these languages have garbage collectors so that you can allocate objects
and don't have to worry, they got cleaned up after you. And obviously every time the
garbage collector runs it can pause your program and inject latency into it. So they suggested
that we make it garbage free, and when Java first came out, we might be talking pre Java
1, people actually used to make the programs garbage free, because the object allocation
was very slow in some of the originals versions. But now doing object pooling for example is
considered to be an anti pattern, something you shouldn't do, it's going to make your
program slower, so it's quite a controversial thing to attempt to do.
4.So how did you do it? Did you use object pool or did you rewrite your code to just
use primitives?We've got a hierarchy of four different techniques: and because the Java
compiler itself is smart enough to eliminate some of the garbage so if you allocate an
object say within just one method, and then you don't return that object, the compiler
says: well that object never escapes from the scope which is in, so I can just allocate
its fields on the stack and then it never has to get allocated on the head and get cleaned
up afterwards. So usually I write the program and then I run it under a profiler and only
eliminate the garbage that actually is there. Which might be specific to a particular implementation
of Java because there is not just the Oracle JDK, there's IBM and other ones as well, so
I've just gone for Oracle Java. So that's the first thing we do, just let the compiler
take care of it. The second thing you can do is you can make your objects mutable, and
so you can change the fields but generally that's bad programming practice, all the libraries
in Java are immutable, when they can be, so for example if I pass a String to a method
that you've written, and if your method were to change the String without telling me, that
would be very annoying. And what you can do is you can make an object mutable and you
can just reuse it like changing its fields at a later point in time which means it will
let you introduce bugs into your program, which makes it more tricky, it's not necessarily
a nice thing to do, but it can help you save on garbage. The next level you might go to
is object pooling which basically is mutable objects that you just keep in an array somewhere.
You take some out, use them, when you finish with them you put them back again, so you
might use that in a situation where you don't know how many you need in advance. And then
the highest level technique would be reference counting where you need to pass data from
one thread to another, so if you had two threads operating on the same data, and you don't
know which one is going to finish first. So when your reference count goes back to zero
you see that they both finished and it's safe to put it back into the object pool. So in
fact we don't actually use object pooling very much, object pooling on its own is only
used in one place in the code where I don't know how many of a particular kind of object
I have got to allocate, and the reference counting is really only used on the actual
FIX message objects where they're handed between one thread or another.
5.But do all of these techniques - they don't actually guarantee that there will never be
a garbage collection, because if you use some library that might do some naughty things
there might still be one. Well there is no third party open source code in our FIX engine,
we wrote all the libraries that we use ourselves. Obviously we used what is in the Java standard
library but if you are trying to be garbage free you might run into problems because you
are using something in the standard library and it's not actually garbage free so then
you got to write your own implementation. I've been fairly lucky with that, caused me
a few headaches but mostly I was able to work around them.
6.So basically you have to look at all of the data paths and code paths and see if they
contain any 'new' operator basically?Well 'new' is ok if you are going to hang on to
the object for a long time, it's when you are creating stuff and then throwing it away
quickly that's the problem. If I try to make some code garbage free what I generally do
is make a JProfiler and I just look at the allocation hot spots and that will tell me
where objects are being allocated continuously, how much memory I am using so I can focus
quite quickly on where the garbage collection is going. Generally you can also use some
command line options with Java, -Xloggc and it will print out all the console information
every time it's doing garbage collection I run it with that and check that it's not actually
running in the garbage collector. I do things like leave it running for the weekend, come
back and on Monday check if it managed to keep running all weekend with no garbage collection.
7.That's very interesting, that is a pure code approach to low latency, no GCs basically
so you don't use any kind of realtime VMs, have you used realtime virtual machines? No,
I haven't at all because a real time virtual machines will only let you allocate things,
on the stack. I think the realtime Java VM is not free though, so if we produced a version
that requires a realtime VM that might put people off attempting to use it because then
they would have to buy the realtime version. 8.So you are doing interesting things with
FPGAs, programmable hardware, why do you need that, isn't the CPU good enough?CPU is pretty
good, CPUs are very fast, and I mean they clocked it up to 4 GHz, and an FPGA might
only be clocked at about 600 MHz, but you can do things in parallel in hardware that
you can't do in software. Even despite the lower clock speed you can actually get speed
up, if you target your part of implementation towards a very specific problem that you are
trying to solve. 9.So what kind of problems do you solve with
the FPGA?I suppose that one thing is the serialization and deserialization of the message, which
we have written in software, it's an ASCII based protocol and we are constantly converting
the ASCII message into binary format to give it to an application, so you'd write your
application that expects the price, for example, to be decimal number, it doesn't want an ASCII,
then likewise when you send an order, that is going to be converted back into ASCII so
we are going to be implementing the ASCII to binary and binary to ASCII part in hardware
to make that as quickly as possible. So for example in software a cycle of that loop may
take about eight microseconds, in hardware we are looking to get that down to much less
than that. 10.So, how do you actually use the hardware?
Do you write Verilog or something?Rupert Smith: It's all written in Verilog but I am not actually
doing that, somebody else is the hardware guy.
Werner Schuster: That's always good.
Rupert Smith: Don't ask me too many questions about that.
11.So how do you approach, how do you design for an FPGA? Do you have different algorithms?
How does that work?FPGA is really cool because a program for FPGA written in a hardware definition
language, the entire thing runs in parallel, and every line of code is simultaneously executing,
it's quite a bizarre thing to get your head around to begin with. Once you begin to understand
it, it is really cool. 12.So with an FPGA do you have to write everything
from scratch or can you use libraries?Rupert Smith: Well, there are libraries for things,
but it's not quite like software where you have got lots of libraries. You are doing
some very primitive operations with bits that's why it can be fast, because you are designing
an electrical circuit, made out of gates, ultimately that's juggling bits, so you can
get right down there to the lowest level of detail and make everything just how you want
it. Basically the FIX engine has layers, it has got a session layer on top, which handles
things like logons and heartbeat messages for you, mostly messages just pass through
and we give them to the application, and below that it's got the translation layer that does
the serialization/ deserialization between ASCII and binary, then below that we've got
a network layer, and we're moving the bottom two layers of that stack down in the hardware
so the message will arrive from the network, we turn it into a binary format and then hand
it straight up to the session layer which really just passes it on to the application
and so it's like zero copy IO it will not be any copying going on internally in the
stack, during my talk I had a slide showing you some of the copying that can actually
go on between the network card and by the time the information arrives at the application
level internally within the stack, within the TCP stack itself and within the FIX engine
eliminating all copying so that the data would just be directly handed over and that's a
thing that we can do in hardware that we can't do in software and like cut through, and TCP
is not really designed for cut through because there is a message length, and the checksum
in the header of the message so you can't write those things out until you finished
writing the body of the message because they go at the beginning. But when a message is
arriving off of the network we can certainly stream that message into the FIX core straight
away and before we even have received the last bit of it off the network just if we
encounter an error in the message later on, after we've started passing it to the FIX
core we need to send a signal say "The last message is corrupt by the way" and forget
that last one. So even though the protocol is not designed with cut through in mind,
there are tricks we can do to get at least some kind of cut through on it. Likewise with
the PCI bus itself the PCI bus protocol is not really designed for cut through. I am
not actually too familiar with the format I just know that it doesn't. I read a paper
comparing the PCI bus to a hyper transport bus and the hyper transport was designed with
cut through so for example in hyper transport you'd have a header on your message, and a
message body and then a tail at the end and then you put the checksum in the tail at the
end so you don't need to know up front before you even start sending data. And I don't know
for certain but I think the Intel bus is also much better designed for cut through. But
we can do a similar trick when you are sending a message you put it on the PCI bus, we can
start feeding that into the FIX core straight away we may find out that that message is
corrupt and to cancel the operation but at least in one direction we are getting some
use of cut through. So that let's us merge the layers of the stack a bit and have them
running in parallel. That's where we get an advantage with FPGA.
Werner Schuster: That's a real powerful solution.
Rupert Smith: Yes, you can't do these things in software, you don't have that option.
Werner Schuster: Because you need signals, to race ahead. Ok, that's very useful.
Rupert Smith: Yes, and Intel have been very helpful, and of course they have their QPI
bus as well, they have been quite keen to promote that to us as an option, and so you'd
have a network card, the network cable goes straight into something that plugs into the
CPU socket and is talking directly to the bus. That would be a lower latency solution,
on the PCI. The implementation we are doing now on the FPGA is going to be now on the
PCI card. 13.So these are all very useful techniques
and I think we should all check out FPGAs for our low latency needs.I would recommend
getting a book, and having to read it is really quite fascinating thing although I am not
doing the hardware implementation, someone else has been employed to do that, I did find
out quite a lot about it so I can understand it and I was quite blown away about what you
can do, it's quite a cool thing. 14.So it seams to be a new venue for lots
of things.Yes, I mean you can buy like a Xilinx board on eBay for thirty dollars that is really
basic starter board and a book and have a play round with it.
15.But you have to learn Verilog or VHDL.Rupert Smith: Yes.
Werner Schuster: That's a high level-ish language. It has curly braces.
Rupert Smith: I think VHDL is considered to be a bit more high level than Verilog, it
has like a type system, I don't actually know VHDL, Verilog is considered to be slightly
lower level, get down and dirty kind of language than VHDL.