Patterns of Nature - 17 Oct 2013

>> Good afternoon everyone. Welcome to today's lunch hour lecture. My name is Jack Ashby from the Grant Museum of Zoology and I'm here as your host today. And [inaudible] welcome the speaker. Our speaker, Sofia Olhede from the Department of Statistical Science here at UCL is the professor of statistics. She also is an honorary professor in the Department of Computer Sciences here at UCL, so two professorships in one person. Her research focuses on the analysis of extremely large and complex processes that's in the business and then simply as a big data. And her findings have [inaudible] two concepts in biometrics and ecology, which places her perfectly to talk about today's topic, which is patterns in nature. If you could join me in welcoming Sofia Olhede. >> [Applause] Thank you all for coming. Hopefully you can all hear me. I take a statistic pleasure in having [inaudible] pronounce my surname because it's particularly difficult for people from an Anglo-Saxon culture. So, I've had my kick for the day, so I guess now I have to pay the piper. So, most of the work that I'm going to talk about today is collaborative. I work for the lot of other fields that many of my collaborators in the UK and the many further a field. So I thought I would go about it by first talking a little bit about what do I mean by a pattern and what kind of patterns do we find? So, an area I worked on for a long while is oceanography. So if you've seen that plot, the first reaction you have is like, "Whoa, that's a lot of color." So, the observations here are trajectories. So, each colored line, which hopefully you can make out, is actually representing the path of a piece of kit, which someone's thrown in the ocean. And then it's tracked using satellites and the position of the piece of kit is recorded. What people want to do is to figure out what's the state of the oceans? What's the ocean's circulation like? And they do that by analyzing each squiggly line. So, that's kind of an example we're going to be returning to in a little while. When they try to understand these patterns, they need more than statistics. So, what you see here is the output of a numerical model. It's basically used in mathematics techniques from applied mathematics and trying to create the sort of patterns that we are going to be extracting from the real data that you just saw. So that's oceans, now one of the wonderful things about maps is that you can actually take techniques that are applicable to one field and you can apply it to a completely different field, which might be operating at different length scales with different characteristic times et cetera, et cetera. So one of the other fields I work on is trying to understand recordings from prematurely born infants. And I actually see my collaborator in the audience waving happily at me. Hopefully I didn't do anything bad with the analysis this week. What we see here is recordings taken from more than one infant after it's been stimulated either using tactile stimulation or a painful stimulation because they have to draw blood samples from the infants. Yeah, we're not allowed to do arbitrary painful experiments on poor little infants or if they are, I haven't been told anyway. Of course, data can come in many different forms. Another type of data that I find fascinating is looking at relationships between individuals. So, what you see on the right hand side is a depiction of linkages between political bloggers in the US. And what people are trying to understand there is -- are there special patterns in the linkages? Can you identify groupings or communities of the individuals and try to understand the patterns of how they interact? So that type of data is very different because it comes in the form of being absent or present. There's a link -- there's not a link. It's not like a continuous recording with a EEG where we actually have a continuous readings of possible outputs. So it has different challenges. Wonder why Obama is looking so unhappy. I wonder if they're out of the fiscal cliff or not. Maybe he is too. Finally, another piece of data, which I find fascinating, is looking at the ecology of trees. So I saw my PhD student in the audience. He very bravely went to the rainforest actually and collected some data on his own. And that's where the picture comes from, but the whole dataset we study is a long study that's been taking place in Panama over many years. And we're trying to understand if species of trees aggregate -- i.e. like to be more close together or disaggregate -- i.e. they actually repel each other. And what you see on the right hand side [inaudible] the output of that type of analysis where if you can maybe squint very carefully, you can make out red little blocks in the middle. Well, every -- every red block is representing a prevalence for a tree species to be nearer to another tree species, rather than a green block, which is showing that the tree species is preferring not to be near the other type of species. So here again we're trying to understand communities just like with the political blogs and we're trying to make out -- do these patterns change over time? What are the, sort of unifying aspects of these patterns? And trying to understand if they're going to be altering with other [inaudible] quantities that we have from these forests. So these have some archetypical problems. What's unifying about them is that we're trying to understand aspects of them across time and across space. This is when I start to worry if I turned off Skype. I'm not going to worry about that. If someone phones, I'm going to say I'm busy. So, how do we learn to make sense of these different types of patterns? Well, you might ask first, what is a pattern? I've kind of shown you pretty pictures with lots of different color blobs. What is a patter? Well to me, a pattern is an apparent structure. So a pattern is kind of a diffuse thing. You take up a picture that you made of the [inaudible] signal. And there seems to be some relationship. There seems to be some mechanistic structure that forces a kind of behavior given some other behaviors taking place. And the first question you might ask when you see this apparent pattern is, "What makes these patterns happen? Are they really there and what's causing them? So are they structural changes that we might control and affect and understand or are they more the domain on my field, mainly that are statistics, random variation which we really can't expect to effect because we can't govern randomness. And then we might ask what structured patterns are actually interesting? What patterns can we actually make out and try to use in our understanding of the world? And which patterns don't people care about? Right. So, you might be able to find a pattern, but it's not relevant to the discipline where the data was collected and they won't really care. Or the pattern can just be significant in a statistical sense in that, it does not appear to be random, but it doesn't fit in to any important scientific question. So, a typical statistical question would be -- could this pattern of been caused just by random variation or is it -- is it that actually requiring a more complicated mechanism to explain it while the scientific question would be -- is it scientifically interesting? Do I care about this pattern? So, to go back a little bit in history, here is R.A. Fisher, who was a well-known statistician at UCL. He arrived in 1933. He was one of the founders of statistics and discipline. So he was -- well what would later on become my department and really founded the discipline and he sort of says that the -- the main key idea of statistics is to be able to discover patterns using the rules of mathematics. So, it's important to be able to actually create former mechanisms so we can quantify -- we've stopped being qualitative. We actually want to be able to be precise and quantify the apparent patterns that we see around us. So, the main questions of a statisticians is first, is very simple. We just look at data. So on the right hand side is a plot of some of the oceanographic data that I'm going to revisit later. So, one of the first question we might ask is -- we see this data, how do we summarize this data? The second question is -- okay, we've summarized it, we've captured key features, what do we do next? Well, we actually want to try to explain or understand what we see. And that corresponds to modeling because a model is a quantitative mechanism that produces a pattern. So the second question after we've looked at key features that appear to be in the data is -- how do we make up simple rules that actually create a pattern? And one of the most important question a statistician has to answer is -- how do these two things match? I might calculate something from the data that looks interesting, but can I actually match this up with my notion of how the data must of been created. So, having gone to the basics of statistics, we are now probably up to the 1960's at least and I'm now going to talk a little bit about one of my main fields of studies, which is the study of time patterns. So, I'm interested in phenomena that seems to happen across time and alter their behavior dependant on what time they were observed at. So a typical thing here might be the temperature. You wouldn't expect to have widely varying temperatures between two time points that are near each other. So, we expect to have more or less the same temperature today as yesterday. And if we study it across many years, we also expect to see features that remain the same until you to even longer length scales like ice ages. And if you're a statistician or a mathematician, if you try to understand data in the context of a model and you're making up a model that's supposed to describe the data, then you need to construct a full pattern. You need to make up a way of getting out the full curve or an event that's happening across many time points because you're not just interested in the single value, you want to see how does this process, this whole phenomena change over time. So you can think of that as in rather than generating numbers like making up a value. So if we're making up a value, we might think of a typical object that we're observing such as people's heights. So if we randomly select people's heights, well we're not going to get negative heights, but we're going to get values that range between -- I don't know -- one meter and three meters. And we're going to get a value for each person in the room. If we're studying phenomena across time, then we expect to generate [inaudible] curves that are evolving across time. So, the notion here is we're trying to understand more than one value. We want to understand a whole set of values, which are chained together. I mean, time is connected, we expect to see a degree of relatedness as we grow across time. So, what sort of scenarios might these be typical situations? Well, again, we talked about the oceanography a little while ago, here we see -- well, actually you'll see this is very recent data -- September 2013. These are the measurement devices that people have chucked in the oceans, but they're actually studying ocean circulation width. So, each of these objects is collecting a location and we're going to be using that to study the oceans. So, the type of observations that come [inaudible] as we go across time -- so if we just go back one-step, which I'm sure will make you slightly dizzy. But here we have a location for each recording device at any one time point as we start to go across time. What we get is the full curve for each recording device. So, we start to see a full pattern, which is corresponding to movements or pattern behavior that's evolving across time. So, that might look like a [inaudible], what you're actually seeing is lots of different trajectories, lines that are made up of pattern behavior as the measurement device is moving in space across time. So, the question is -- okay, we now have these phenomena, they're evolving across time, we're trying to understand them as they roll across time. What do we actually have to understand? Well, normally when we observe phenomena, we have to understand something which has been aggregated. We see a plethro of effects that are occurring because the eventual output is an aggregation. So the whole is basically a sum of the parts. And what you see in the top right hand corner there is a drifter observation, so it's one of those measurement objects as it's going across space in time. Now, when we model in mathematics, we want to come up with a mechanism, a way of generating that whole pattern as things change across time. So the first question we want to ask is -- to make up this pattern, what are actually the parts that will enable us to make up these patterns? And the first thing you see as you look at the blue curve as it changes across time because it's moving is that it's actually slowly meandering. You're seeing the -- where the drifter has been over a course of a time. So what we need is some meandering drift. And that could, for example, be the green curve which would we -- one particular mathematical model of making up drifting across time. Secondly, we need wiggles. So, typical wiggles we might see in that data could correspond to -- if you in space get stuck near a [inaudible], which is one of the phenomena we study or it could be due to other oscillatory phenomena that evolve and change across time like the second green curve. So to make a mathematical model, you have to make up a rule which create these sorts of patterns. And to make them useful in a statistical sense, the rule has to create slightly different outputs every time you generate a curve. So, aggregating all of these different components to end up with a single line like the blue thing you see on top of the right hand corner, you have an unobserved components model because the output is an aggregation of lots of different components. Okay, so now we've decided that we're actually going to create the signal by aggregating lots of different effects. How are we actually going to come up with these different mathematical models that yield output that look more or less the same. Now this is kind of an interactive process where you need to talk to the collaborator, you need to understand what phenomena do you need to reproduce in order to get back the structure they need to extract from the actual signal. So, a first step is to talk a lot to your collaborator and figure out what parts of the signal need to be preserved. The second thing is to realize that you need a statistical model because each drifter looks a little bit different. You can't have a model which gives you back exactly the same curve every time. And therefore, you need to have a way of generating some random variation into your model. So that's sort of the second [inaudible] in the actual modeling. So taking all of these components together, you then have to worry about how to make up realistic time evolution, realistic changes in time and then aggregate the different components. So just to show you an example of why it makes sense to ask these types of questions and then look at real data, what we see on the left hand side is two different models -- two different mathematical models that are generating different outputs. And they kind of, give insights how real data could tell us what kind of model we'd need because they look quite different in terms of their output. So you see two squiggly lines where one of them seems to drift more while the other blue squiggly line is staying closer to its typical value. And if you look to the right of that, you see again that the red squiggly line, the way we made that up, it's happy to meander further away from where it started and is allowed to drift further. So, as you go from [inaudible], you just want to make up ways of getting squiggly lines with different characteristics. To be useful to applications, you'd have -- basically have to have something in that mechanism that makes up the squiggly lines that allows you to understand whether observations are allowed to drift further out like the red squiggly line compared to the blue squiggly line that's tightly sticking near to the point where it started. So a sort of typical thing to do would be to sit down with the person you're working with and understand whether they'd be happy for the squiggly lines to more far away or whether you need to come up with a mathematical mechanism that forces the squiggly lines to stay -- stay really where they started. And it's a little bit more interesting in oceanography because there you have more than just real data. So, on the top left hand side you see here some plots of real velocities from the oceans calculated from those colored squiggly lines we saw before. Well, if we have a statistical model that's trying to reproduce those features, we see realizations from such a model on the top right. So, that's coming up with a statistical way of trying to reproduce the structure you see in the real data, which is on the top left hand side. While these kinds of observations can also be modeled numerically using applied maps and if we look at the bottom, we see the type of output that such a model would give. And if we were interested in describing the differences we could take a statistical model, a method of generating realizations and we could try to match it also to that applied [inaudible] model and try to understand how they're different. So we would, when faced with that data -- to try to understand what mechanism is generating the observations -- we would actually try to match the model that comes from the statistical mechanism to the real data or to other types of output and understand -- how do they look different when they're being generated? And to give you some more flavor of the things you might be interested in -- if you look on the top and the bottom here, the top is real data and I'll explain what it is in a second. The bottom is model output. Again, you're trying to look at one of those trajectories of drifters as it's evolving across time and what you're trying to describe is how wiggly it is. So just think of the axis that's going up and down in this direction as a measure of how fast this observation seems to be wiggling. While one way to actually check whether a statistical model is fitting or reproducing the features of the data well, would be to see what type of wiggliness would that model reproduce compared to reality. And as you can see, they're actually matching very well. So, the observation output here is in a sense wiggling in the same way as the real data. So, [inaudible] talked about trying to get a model we can reproduce, basically affects that we see in the real data. The next question might be -- why would we want to do this? Okay. It could be a fun thing to do on a rainy Sunday afternoon, but why do we actually want to spend our time trying to make up new observations or mechanisms that can reproduce the variation we see in real life. And one of the key aspects here is trying to understand what variation can be considered normal and what variation seems to be abnormal. Can we tell which of the trajectories are not typical for a particular region? And can we understand what sort of behavior would we expect to see in the given region. So, it's really about trying to find ways of describing large sets of data using simple statistical models. And one of the things we can do once we figured out what normal behavior is and I'm going to show you a video where we compare how we would expect a trajectory to look like when there was no special behavior with something where there's an abnormal pattern. So I'm now going to show you a video -- if you wait one second. So, what you see here is a numerical simulation, again of ocean circulation and what you see in the actual movies of water seas generated from the supplied mathematical model and we've put trajectories or drifter like things in this numerical simulation. And what we're doing is extracting or identifying drift -- water seas, which are in this fluid automatically. And what you see is all of the colored hoops that are being identified are actually the parameters of these [inaudible] seas. So a vortex is a swirl of water which you might see in your tap that you [inaudible] bathroom tap on. And we're trying to understand if we can extract, identify and actually quantify the properties of water seas in the ocean. So of course, you could not just put down measurement devices all over the oceans. You really do need to have floating measurement devices in order to understand what's actually going on in the oceans. Okay. So, we're now going to have a change of pace. We're now going to go away from the oceans and we're now going to talk about a little bit of neuroscience. So, we're going to take questions at the end that -- anyway and that I would just warn you that we're now changing topics. So, I work with a number of other scientists at UCL and their taking measurements of basically a prematurely born infants and they're trying to understand the developing brain using these different measurements. And one of the main goals is to actually to discriminate between responses to different stimuli and understanding how those responses actually change with time. And the main measurements we take here is electroencephalography and that characterizes electrical activity in the brain. So, the typical observation here is quite noisy, so they actually look a little bit like the trajectories we saw before from the oceans. They're in a completely different length and time scales and the oceans we observe -- ocean currents over years; we've got very long time scales. The shortest thing we could go down to would be minutes. Here we're operating under milliseconds, everything's very fast. So, three milliseconds of course, is just three seconds. And what are we actually observing? Well, we're trying to characterize -- how do infants respond when they're getting tactile stimulation [inaudible] touched and when they're getting [inaudible] stimulation, i.e. they're getting painfully stimulated. And we're trying to identify the types of temporal patterns. We're trying to identify if we're getting similar responses or if the responses change depending on the age of the actual infant we're recording. So, here on the left and right hand side, you see different trajectories if you want to compare them with the oceanographic data or different time signals where we've indicated the actual age of the infant and we've identified a typical type of response to stimuli, which is getting compared to the actual time course that we're observing. Of course, there's more responses than just those to the painful stimuli. So we're going to see more than one typical signal or wave form and my collaborators and I -- as we were trying to analyze these types of data -- we also noticed that they were wiggles. So wiggles correspond to special kind of oscillation. And oscillations are associated with a period of atypical oscillation and what we saw was depending on if there was tactile stimulation i.e. touch or if there was painful stimulation, you had a different rate of wiggles in the response either to the painful or the tactile stimulation. So, what you see on the top are courses across time, the signals across time and what we've zoomed in on there are special kinds of wiggles. And we've used a description of the wiggle, we've taken the signal and we've tried to [inaudible] into the typical rates that wiggles can [inaudible] and we've identified if there's a special kind of response depending on how fast the wiggle is wiggling. Be very precise. So what do we actually get out of this? I hope you can see some of it. I should probably of zoomed it in a little bit more. Well, what we did was we studied depending on the type of stimulus, so what you have on the further right there is noxious stimulation. We've got tactile stimulation, which is touch and then we try to see -- is the response changing? Are we seeing more wiggles? Are we seeing more of the smooth curve that we just discussed before? So, as we look at the type of stimulation, which are the two different columns, and we compare with the incidence of response with a particular kind of wiggle or type of signal that we're trying to identify. What occurs is that there is a clear [inaudible], so what you see is the curves that go down or go up because the specific stimulus becomes more or less common depending on the age of the infant. So the type of wiggles, which actually characterize the oceanographic signal we saw before is now being used to describe the response to stimuli in infants. So, I should probably check how we're doing for time. >> 10 more minutes. >> 10 more minutes. So, I've talked a little bit about modeling processes across time. So, modeling requires two important factors. The first thing is you need to have clear communication and understanding where the person is actually is trying to understand the data. But you can sit around and wait until the cows come home, but you're not going to do anything useful unless you understand what people are trying to get out of the data. And the main key aspect there is you're trying to understand variation or apparent patterns in your data and connect them to a scientific question of interest. Key to actually understanding that variation or how to actually determine whether the pattern is the ability to make up fake patterns and that's where mathematics and statistics becomes so important because we have ways of quantifying or making up artificial structures which can explain the artificial, or the real variation we actually see in data. Thank you. [ Applause ] >> Thank you very much, Sofia for genuinely [inaudible]. We do have time for questions. If you do have any questions, you could wait for a microphone to reach you so people watching online can hear your questions. Does anyone have any questions? >> With the brain grafts, how did you make the sort of, what looks like a linear or -- yeah, linear pattern into more like a circular. So from the top to the bottom, how did you make that? >> That's a very good question. So I was actually slightly [inaudible] two different concepts. So, we were looking at single signals across time. I [inaudible] variable changing across time and we were also looking at positions so two locations across time. So that circular pattern you saw was actually their two coordinates in space of something evolving in time which was plotted as the third axis. Now, whether something is going back and not meandering off, so the sort of linear increase you saw was the process of what's actually changing in terms of how much energy it had. It was getting bigger and bigger and bigger, so the fundamental aspects you have to put into the mathematics, which makes something return to typical values that it had before [inaudible] rather than just go off and take any arbitrary large values. So that there were two aspects there and parted was whether you were modeling a single object across time or two objects like a location, but also whether you were allowing something to just drift off or forcing it to return to previous values. So in finance, for instance, people assume that things are going to return to previous values right after, but some processes are not. They don't have that characteristic, they might just meander off and take any value. There's no actual constraint in its value. So that's two different mathematical models which create different characteristics. And you need to look at real data to see which would be more appropriate for a given scenario. >> [Inaudible]. >> Taking the brain model again, are your data independent of the awakeness of the infant and the state of their feeding? To take one example and to take your oceanography example, what's a factor of temperature change in the water? >> So, there things are a little bit -- I'll start with oceanography. Things are a little bit confounded. So, you do have things like bisymmetry and what things are like underneath the water. They will affect how things move in the oceans [inaudible]. There are also different speeds because they're currents. And temperature also will have an effect on the local system. So you don't expect to get the same sort of trajectories irrespectively where you are in space. And that's really at a confluence at many different effects. I think, yes you're right. You do expect to see different characteristics. As for the infants, some of the other information you have and should use in the analysis, sometimes you don't have other information. Like, the infant is not so easy for them to communicate, so you can't put in there state of being [inaudible]. I mean, you can try to observe whether they look happy or not, but that wouldn't be a very rigorous process. >> Behind you over here. >> Do you as a person ever worry about the cases, especially with complex data sets where the math returns like, a type one or a type two error. Would you just have to chalk it up to like, an occupational hazard and sort of, relax about it? >> You're may need to clarify the question a little bit. Are you saying, do I analyze a lot of different data sets and... >> Yeah, so the cases where you might end up with a type one or a type two error, simply where the math returns that to be the case. And do you worry about the fact that that can happen or do you just have to kind of accept that as part of what you do? >> Sofia, would you like to start by explaining what a type one and type two error is? >> That's a very good point. So in statistics, you worry about two different things. So you set up a very rigorous framework and you have what's known as a null hypothesis. So, that's like a little bit like a [inaudible]. It's something that you don't expect to be true, but which you say is true just so you can see if something's different from it. So making a type one error or type two error means you either say that the simple null model is false, when it was actually true or on the other hand you can say that the true model was not shown to be false, but it was actually false. Statistics is a little bit backwards. It sets up these logical sort of, flip flops in order to be rigorous about how it compares two different hypothesis when one is simpler than the other. So one of the errors you can make, for instance then, is to reject the null. I say, the null simple model was false when it was actually true. And if you have a lot of data like some of the examples and you do lots of different checks and see whether the null was false, that means the probability that you make in error in your actual testing those up. But [inaudible] statistical theory has come up with ways of dealing with testing many, many times and the techniques to deal with the error. Of course, you control it within study. So if I study the oceanography data and then go on to study some other data set, I don't worry about having done many tests across my different studies, I just worry about the many tests within the study. So, the answer is yes and no. >> [Multiple speakers]. >> A lady in the box at the top. >> It's nice to be called the lady in the box at the top. It sounds like [inaudible]. That was very interesting talk. And how powerfully can these models of analysis apply to general and popular understanding as opposed to scientific understanding to change popular perceptions of global warming and climate change. I mean, it seems to me if they were explained as you explained, in which was clearly, they might have more effect and understanding. Does that make sense? >> Yes. I think the problem with things like global warming is it's very hard because you don't have enough data. So, I have a colleague that I was visiting a few months ago and he was trying to understand how to take account of spacial effects, the observations that are near each other in space can't be considered independent. And I asked him, "Do you worry about this and do you have enough data to actually understand, whether some of these complicated effects [inaudible]? And he said "I don't know. I don't even know how much data we need to collect in order to answer that question. So climate is a dangerous topic to have opinions about and the things that I talked about like the oceanography, has an implication for climate, but it doesn't go out and say, "Oh, a temperature is going to increase by five degrees." And that's just because the systems are a little bit too complicated. So I think we're all hoping that we're going to have more observations which will certainly help us to understand things better, but we're a long way away from having simple methods of understanding very, very big and complicated systems. So the things I was talking about here would correspond to a tiny bit of a global climate model that I'm happy to contribute to. But then, there are lots of people like me who putting in different [inaudible]. And we don't all know each other or understand how what we do relates. So, I think at the moment, there are very few people who have a good understanding of what big climate models are actually doing. And we all need to work harder and contribute to make a distance because I think -- it's like when I said to my friend the scientist, "So how can you work on these problems when you haven't even figured out if there's enough data to answer?" You know? And he said, "Well, if I don't try, I'm certainly not going to learn." So, and he also said, "I only have one big project left that I can do before I retire and I'd like it to be this one. >> Okay, I'm afraid that is all we've got time for. I want to thank you all for coming, thank you for your questions, but mostly thank you to our speaker Sofia Olhede [applause].