Movie Line Monday with Steve - Big data, small data and machine learning

Hi, my name's Steve Malmskog with Netskope, and I'd like to look into another Movie Line Monday. The topic today is about machine learning. And today's quote is "I'm sorry, Dave. I'm afraid I can't do that." It comes from the movie 2001: A Space Odyssey. And for those of you who are familiar with the movie, you know that that's actually not said by a person, but actually by computer. And today, we're going to be talking about machine learning. And we're not going to be talking about designing systems that are sophisticated as HAL, but we are going to be talking about how machine learning is used in enterprise and other commercial solutions, and give you a brief introduction to what that is. And I want to start with a definition about machine learning. There are so more formal definitions. The department head of the Machine Learning Department at CMU, Carnegie Mellon, Tom Mitchell-- he actually has a very good formal definition. We're going to be a little lighter weight here. I want to use something that doesn't bog us down too much. And so the definition that I'm adopting here is simply this. The idea of machine learning is the study of designing systems that learn from data. And the two key words here that I want to look at in this video is the idea of learning and the idea of data. And if we start with data, if you're at all familiar with where things have headed in the last few years, the whole world of data is actually predicated on this idea of big data. And if you looked on Google Trends, for example, the number of searches from big data over the last two years has more than gone up by, I think, 10-fold in the last 24 months. So that's a lot of people who are interested in this area. But the reason it's interesting is not because of big data itself, but actually, as one of my colleagues, Ron, did in another Movie Line Monday, talking about small data, which is the idea of you taking big data and putting it in some format that you can make sense of. And one of the tools that we use to make that possible is this idea of machine learning. So machine learning is one of those tools in the toolbox where we can take things like big data and make sense of it. And if we look at the other half of-- we have data on one side. And then the other half is this idea of learning. So machine learning is not simply about processing data. For example, you could have scripts that are run to compress data. You could have things that are substituting words or doing word searches. Those are just data processing tasks. Machine learning is more about actually learning from the data. And when we say learning, what we mean by that is we mean improvement over time. So we start at some point in time, and as time goes on, the actual results that we get are better than they were before. And this idea of learning and machine learning is very similar to how we learn ourselves. For example, if you were learning to play a piece on the piano-- let's say, Beethoven's "Moonlight Sonata"-- you might start out the first time playing and you might only get 40% of the notes right. But over time, you keep practicing, you keep practicing, your brain adapts to the process of playing that piece until you hopefully master it at some point. In the same way, machines and machine learning algorithms can mimic that behavior through the application of data and their algorithms themselves to give that same improvement over time. And a great example that I like to use for that is this idea of email. So if you had an email account on an internet provider 15 years ago, you probably saw just about as much spam in your email as you saw real email. And it reached a point where, in many cases, email was becoming nearly unusable. And that's mainly because a lot of commercial vendors had not yet adopted the means of figuring out what was real email and what's spam. But today, in your modern email inbox, you probably see very, very few pieces of spam at all. You might see a few once in while. But for the most part, it's not there. And what changed in that time is really the ability to take email as an input, apply it to a machine learning algorithm. And at the output, we get either an actual email, or we figure out that it's actual spam. The process of making this possible is really due to the machine learning algorithm. And in fact, it's not that the spam has disappeared. If you actually go into your spam folder and take a look, there are hundreds and hundreds of spam messages that you are still getting. But machine learning has made a tool that would by now be completely unusable perfectly functional because of this ability to improve over time and take inputs of data and turn it into something useful, either email or reject it because it's spam. So just to wrap up here, what I want to talk about is this idea, this relationship that exists between the data and the machine learning algorithm. So as we apply data to the machine learning algorithm itself, there are two things that need to happen to get a satisfactory result. The first thing is that the data itself has to be good data. Even though the term "big data" is very popular, what's really important at the end of the day is that the data is actually usable. You can do something with it. Having a lot of data that you can't actually run a machine learning algorithm against is not very useful. At the same time, you need to have a machine learning algorithm that is also good. You need to choose a machine learning algorithm that actually works for the problem that you have it hand. When you combine these two together, you get really good results. If you mess up on either one of these, you can expect poor results. So that about wraps it up for another Movie Line Monday. If you have questions, either about machine learning or maybe another topic that you'd like to see, feel free to email us at movielinemonday@netskope.com. Again, I'm Steve Malmskog with Netskope and thanks for watching.