Custom Vision – Machine Learning Made Easy - The xamarin show

>> Welcome back everyone to the Xamarin show. I'm your host James Montemagno. And today I'm super-crazy excited to have my very good friend Jim Bennett all the way from, I have no idea where you're living nowadays. Cloud Developer Advocate here at Microsoft who's coming on to talk about this awesome blog post/custom vision/we got BB8 here. >> We got all the toys. >> What are we even talking about? >> What are we even talking about? Yeah, we are talking about classifying images. So one of the cool new AI tools that's come out of the Azure Cognitive Services is an image classifier. >> Oh, cool. >> And it's a magic little tool, you teach it what something looks like, and it can then identify them. >> Nice. So first off Jim before we get into it, maybe talk on a little bit about yourself. >> Okay, so yeah. I'm Jim Bennett, I'm a Cloud Developer Advocate here at Microsoft, this is my fourth week on board. >> Nice. >> I've known James probably for the last four years or so through the Xamarin community. I've been a Xamarin MVP, and then a Microsoft MVP. I've been involved in the community. The author of "Xamarin in Action" The guy who wrote the foreword from that is a particularly nice chap. >> I heard it's pretty good. >> It is pretty good book, and, yeah just super-stoked to finally be on the show. >> Awesome. Yeah, it's good to have you here. I'm really passionate. Often, I'm talking about machine learning and building models I literally know nothing about it. So I have my good friend Frank, who does a podcast with me. He's the expert. He knows everything, and we've gone into all sorts of different details about, what does it mean to make a model versus image, and how do you build hot dog or not, essentially. Things like that. Well, you did something way more interesting than a hot dog or not, so maybe you can cue it up by telling people what you built and why you built it. >> So, what I've built is a little tool for helping to identify my daughter's toys. So I've got a little girl, she turns five tomorrow. So Happy Birthday, Evie, for tomorrow. And she's got a whole raft of different cuddly toys, and I lose track of what they're called. Even she does. She picks a name for her toy and then she can't remember what she's called that toy. So I thought it'd be great to build a classifier to allow me to use my phone to both take a picture of a toy and it tell me the name. And that's a nice idea for me, and also thinking things like, we've recently moved to the UK -- you say you don't know where I live now, I keep moving around the world. Recently moved to the UK. We spend a lot of time with my parents who are based in the UK. They have no clue what her different toys are. So again they can use this app to kind of sneakily learn that her favorite toy this week is called, "Baby Duck." >> Got it. >> So they can know. That was my inspiration for building this. >> That's pretty cool because then you could also take it to the next level, which is you could then start tagging [inaudible] you could create an app where not only could you scan the toy, but you could then create a database entry, like this costs this much, this is when we got it. This is when it was no longer the favorite, right? It's kind of see trend. That's actually a really great idea. >> It's great if you have other parents with kids and their kid likes your toy. Why not get the toy, correct? Take a quick photo. Look it up. Oh yeah, we bought this toy in this particular shop. >> So I assume then that you're a data scientist. You're a machine learning expert. Things like that. Right? >> Nope. Not in the slightest. As far as I'm concerned this is a magic box. >> Got it. >> And this what is really cool about the whole Azure Cognitive Services, where I've been playing with them. You don't need to be a data scientist. >> Yeah. >> Now I'm sure you could get someone like Page Bailey on here to explain how it all works. I have no clue, but I don't need to have a clue. >> Magic. I like when magical things magic. Yes. >> It's all magical. It's a magic box. >> Yes. So what are we looking at? What do you have up here right now? >> So what I've got here is the Azure Custom Vision Service. You can get to it at customvision.ai. Just log in with a standard Microsoft account. And from this you can train a neural network to identify images just using a small number of images. So, if you start clean you would need thousands upon thousands of different images. >> Yeah. >> And over time you train this network to recognize certain things. What Microsoft have done, they have got some pre-trained images -- I don't understand how it works -- but with some pre-trained images, you give it literally five or six pictures and that's enough for it to learn that this is this toy, this is this toy. You think about the classic hot dog, no hot dog, you give it five hot dog pictures, five no hot dog pictures, and it can tell straightaway which one is which when you take another photograph. >> Yeah. From my understanding, it's important not to have what it is, but it's almost more important to have what it's not, right? Because it's going to keep learning. If this is not a hot dog, not a hot dog, not a hot dog, it's going to get really good at knowing what a hot dog is, because it knows what it's not. Is that about correctish? >> I believe so. It's called overfitting, I think. I'm still not 100% sure on how that bit works. I just rely on the magic box side of things. >> So how would I get started here? What is this? >> So to get started, you go to customvision dot AI. You click on "New Projects." Give it a name, "My Project," give it a description, "Thing." And then you choose the domain. So there's a number of different domains these models have been pre-set for. So "General" means, 'I don't really know what it is.' You can train it up, whatever you like. There's a "Food" one, which has been trained with food items. So, if you doing hotdog, no hotdog, apples vs. oranges, that's a great one for that. They've got landmarks, they've got retail, they've got adult, which is great if you want to build content filtering. So you want to build a social media app or an image-sharing app, and you want to make sure that people are not sharing things they shouldn't be. >> Got it. >> Especially if it's one for adult content. And you'll notice the bottom three are repetitions of three at the top, marked as "Compact." >> Yeah, what does that mean? >> So that means these models are small. >> Oh, got it, because models are like usually the biggest thing in the world, so how do you put that on a mobile device, right? >> You can't do it yet. [inaudible] and tends to flow on Android. You can take these models and run them on device, they run on the GPU. >> Got it. >> So if you train to set the normal general model that runs in the cloud, where resources are, you know, bejillions of different specialist processes that run. That's really cool, but you can't do that on a phone. >> Got it. >> The compact model is not as good as the normal model, but it's still good enough for what you need. >>Yeah. >> The small electronic device, that is hugely important. >> It starts compact, then I assume that you're building up the models because you're going to add a bunch of images to it, right? >> Yes. >> Got it. Cool. >> So I'm not going to create a new project. I'm going to use one I prepared earlier. >> Okay. >> So, I've got one here that I use my daughter's toys. And I have loaded up not that many toys. So this is it. This is enough to do. >> Porgs? >> Yep, some Porgs there. She does love her Porg. We got a cat, a cuddly duck. We've got BB-8 and I got a cuddly tiger. >> The question is, is the Porg or is the BB-8 the favorite, currently? Did she see the "Last Jedi"? >> She's too young for that. >> Oh, okay. >> She turns five tomorrow. >> But all we know is that the Porg is adorable. >> She loves the Porg. She got really upset the other day when she learned that Porgs are not real. >> Oh my, yeah. >> Yeah. I really want a real Porg, Daddy. >> Spoiler alert. >> Sorry, they're not real. But yeah, this is all it took to train it. So you give it these pictures, and then you click the "Magic Train" button, and that's enough for it to go off and figure out: This is what a tiger looks like, this is what BB-8 looks like. >> What are these things on these tags over here? Because, how does it know? >> Right. So when you upload an image, you give it a number of images and you give it a tag. And that tag is what is in the image. >> Oh. Got it. >> So I uploaded this one here which she calls 'Tiny Tig,' and I've tagged it as "Tiny Tig." >> Got it. So immediately then, it knows what's not the tiger, because it doesn't have those tags. >> Yeah. So this one here has got a tag of "BB-8." We've got the Porg here. She calls it 'Porgy,' it's got a tag of "Porgy." >> Got it. >> So the tag's what you use to identify what's there. You upload apple's, you tag those as "Apples. " You upload oranges, you tag those as "Oranges." So the tags are to say "When you learn what this thing is, this thing is called this. This one is called this. This one is called this." >> Okay. >> That's kind of it, really. >> So, what does "Train" do when you hit "Train"? >> So, when you hit the "Train" button, this will go off, and this will take the half-complete model, and then specialize that model based on the images that are there. >> So, if I was to get another toy, would I then Upload, and then retrain? >> Yeah. If you want more images, you'd retrain. And every single time you do a training, you get a thing called iteration. So, where are they? So, you have these different iterations. >> Okay. >> So, the very first iteration I had, I had only trained with two particular toys. >> Got it. >> And the next iteration, I added another toy, and then I added a couple more images of a different toy, then trained it again. So, when you do a prediction, you actually see the images that use the prediction appear here. >> Oh, okay, cool. >> And I can say well, actually, this one here, it predicts it was a porg, but I could then tag it and say, well, it's yes a definite tag this is the porg >> So, it knows. >> So, it knows. So, especially if it finds one doesn't recognize, I can say this is this, or if it's a new toy, if I want to take photographs of the monkey here, I could then add that one as a new toy, and add a new tag. >> Got it. >> And then I click the "Train" button again, it runs another iteration, and learns more about [inaudible]. >> Oh, that's cool. So, you can essentially keep training the model as you're using the model? >> Yes. >> It's cool. That's very nice. >> Yeah, because you kind of need to to specialize the model. If you take a photograph of something, and it doesn't know whether that's a hot dog or not, you say yes it is, no it isn't. You train the model again. So, you can keep the model growing and improving. >> Got it. >> Keep it the more specialized. >> So, how do I take this? I mean, it seems like, here, I could upload an image or something. But how do I get it in my app? Because like I care about putting it inside of a Xamarin app. How do that? >> That's what I care about. So, if I do have a quick testing here, where I can just choose an image. This is the image I took earlier. >> Okay. >> And this, sometimes it doesn't come back straight away. Oh, it did. And then I click that one more time. So shouldn't tag if they don't come back. There we go, 100 time probability is BBI. So, everything is probability based. It doesn't say, I think this is X. It says, here is a probability that it is this tag, this tag, this tag. >> It's pretty good. It's 100 percent sure it's BBI. >> 100 percent sure it's PBI, and has never seen this image before. So, now, if I get back to predictions, you will see the image appears there. >> Got it. You try and tag it. Yeah. >> So, that's cool. But, obviously, it's more fun in a mobile app. >> Yeah. >> Because you're going to be around the grandparents, you're going to be around, you're going to be around, they're going to need toy, and the tagging stuff. >> Yes. >> And actually, it is really, really easy to get this in. >> I like easy things. I'm a big fan. >> Yeah. It's a new package. >> Love that. >> Literally, that's Microsoft.Cognitive.CustomVision.Prediction. It's a new get a package. You bring that in, you then create an instance of a prediction endpoint, you give an IP case that comes from Custom Vision, you code predict image, and you pass it a stream, and that stream contains the image. So, if you were to use, for example, the media plugin that you've written, that spits out a file, we can then get a stream from that, you pass the stream in here. >> So, what is this hitting? Is it hitting an endpoint, or what's happening? When I call this predict image, what happens? >> So, behind the scenes, when you code predict image, it will go off to custom vision pretty much exactly the same as when I did the quick test. You Upload that image up to the endpoint. So, obviously you have to have an internet connection,, it will then run this in the Cloud, download the results. >> Okay. >> Get this straight away. So, if I just run this on my phone here, we should see that hopefully, the Demichelis are with me, we see an action with BB8 there. >> And so this is going to go through, and then this returns this. What's an image tag prediction model? Is that something you created, or is that coming back? >> So, this comes back as part of the API. So, as part of the new get package we have these prediction models. They contain the name of the tag and the probability. So, when we predict, we get back- if I was to take break on here so we can see it. So, if I just run my app here, I'll take a photograph of, BB8, so we're using the media plugging, Upload that, we hit my endpoint here, and it's going to go off the Cloud, do its thing, come back, and then I really should-. >> You've simplified your code so much. >> I've simplified the code so much, I can't see what's coming, but if I'd run this, oh. >> Don't know what it is. >> Don't know what it is. Let's try this again. Let's us use the the front of BB 8, this time. Promise these models don't understand what it is, they understand images, they understand different colors. So, there we go. Probably because of the black eye on the front there. >> Go ahead, and pull that up here. >> So, let's bring them up. It finds it's BB8 >> So, can you go and add a break point then to see the result there, if we do it again? So, you're calling this from somewhere this get best tag. >> Yes. >> So, I'm really interested in what's coming back. >> Yeah. >> There we go. Oh, okay, interesting. So, that is refactoring code live like that? >> That's the way to do it. Refactor live, okay.It's a great point there. >> Yes. >> And let's run this again. And we can see why she comes back. >> Because to me what's interesting is, I'm really interested in the results of what actually literally just happened. So here, you get some predictions, and then this just seems to be super easy code, and then using a link essentially in your queries is like, I want to find the highest probability or what are you determining for the-? >> I want to find the highest probability. >> Got it. >> Essentially, it's not. I've got a threshold on here, because I could end up taking a photograph of you, for example, and it thinks there's a 0.1 percent chance that you're BB8, and 0.5 percent chance that you are porg. >> Yeah. >> And so-. >> High probabilities as well. > Yes. I don't want that to come through, I wanted to say I don't know who you are. >> Got it. >> So, I've got a threshold set at 50 percent. So, I'm just going to run this again, take the same photo, and we should hope for now, hit this button, we see what comes back. Essentially what comes back is the same as we got from- there we go. >> So, five predictions. >> So five predictions. So the first one has a probability of one. So, it's a double guess from zero to one. One being 100 percent chance, and it comes out with a tag BB8. >> So, that's the tag that we saw in the Custom Vision. Oh, Okay. >> Yeah. And there we go. The porg is 1.4 times 10, the amount is 17 percent chance. >> That one always throws me, because you have to go back to your math class, right? Because it's actually exponentially to the negative 17. So, it's basically zero, must be honest, yeah. >> Yes, pretty much yeah. And again Tani tag is pretty much zero, Baby Duck is pretty much zero, and My Mema. >> So, do these get returned in specific order or random from your findings? >> They're in the order of the tags. >> Okay, got it. >> So, the way I have my tags defined here, it does it in the order that the tags were originally added. As far as I know, things might change. >> Try my face. We'll see what happens on my face. >> Okay, let's try. >> The beard, maybe the beard will be a porg. >> Okay. Let's see what it says. I hope it comes up as a porg. >> I'm excited. >> You're excited. Okay, so, what have we got? >> I think it does come back in the order conversely because there is a 2.3 times 10 to minus 6 chance that you are a porg. >> Not bad. >> You're going to be the next Star Wars film. Well done, you. And 5 to the minus 12, you're a tiger. >> Okay. >> And then, 4 to the minus 13 that you're a small cuddly cat, and then 3 to the minus 13 a BB-8. So the closest thing you are as a porg, but you're looking at 2 times to 10 minus 6. So it's very very small probability you're a porg. So, what will happen here is I will filter out based on probability threshold and say, "I don't know who you are". >> Got it. >> You're not 50 percent or something. Now this required an internet connection to make that connection. I mean to me, what if I'm in a place, I'm in the middle of a cabin in the woods, I don't have Internet connection, I still want to do this couldn't I- or is it something that Custom Vision AI gives me to do that. Because I know that built-in the both iOS and Android, I have like you mentioned earlier Core ML, which is Core Machine Learning from Apple, and then TensorFlow, which is Google is essentially a version of that. So like, do I get to use those, or do I have to go reinvent the wheel? >> Yes. You very much get to use those. So, that's one thing that's very cool about the Custom Vision System is you can export your models. >> Oh, okay. >> So, when I look at my iterations, you will see there's an "Export" button on top here. >> I like that. >> Choose this. iOS 11 Core ML, Android TensorFlow. >> Okay. >> So I can download my models. This is why I went for a compact model. If I went to the non-compact model, I wouldn't go download it. >> Because it'd be so big, you can't really upload a gig app to the app store anyways. >> Yes. >> So. >> I think there's a limit of like 500 megas of model size for Core ML. I don't know about TensorFlow. >> So, this model here, you download it. Now the nice thing I had assumed that since Xamarin gives you direct bindings, are you then just using the Core ML APIs themselves? How does that work? >> Yes. So I have got another version of the app here. And as part of this app and you get a package Xamarin plugin for making this really, really easy. >> Oh, cool. >> Because it's kind of quite hard if you don't know how these APIs work. >> Yeah. >> So on iOS, what you do is you download the model. You then have to compile it, it comes as an non-compact version you have to compile it. I don't mean why. I'm sure there's more API who can tell me. I get this model, and then, what you can do is I load him. So everything is all written for you, all the APIs is of that is a hold of Core ML API is a base from vision. >> Yeah. >> Which is in the VNCoreMLRequest, which is a request that makes sense Core ML Vision Access. >> Yeah. >> It's all asynchronous. So let's get the image. I take the stream I'll get back and convert that to you I image. I then pass that to Core ML Request, this via CoreMLRequest, which is designed to identify things and site images. It's pretty much built for this particular purpose. >> Got it. >> And then, it has a hand to the calls back to return some results. And these results are essentially, a list of classifications, which are text probability pass. >> Yeah. >> The kind of identical output. Different class type because it's iOS API but, essentially, the output is the same. Now, one thing to note about this "ToCVPixelBuffer". So these models don't actually understand images. They're not about images, they just understand rule binary data. So it doesn't look good. >> The ones and zeros of the image, not the image itself. >> Yes. >> Yeah. >> And it expect seems to be a certain size. So Custom VisionWorks images are 227 by 227, so I have to convert it to the right size, and then to have it convert it to essentially a pixel buffer just to hold it floats, hold a floating point numbers for RGB. When you pump that in there, it runs the neural network, does the magic in the magic box. The magic books does, and then, spits out your texts and probabilities. >> Oh, cool. >> And that's really cool. And an on Android, it's essentially, it's the same thing. So, there is a TensorFlow inference library, which is the equivalent of this vision-based Core ML library, which is designed for processing images. It's being wrapped, it's being bound as it binding for it. It's not available as a new gap package. You just up and get at the moment, this is TensorFlow of Xamarin package. >> Oh, cool. >> And I'm using that in my code. Pretty much, it's exactly the same. I have a bitmap, an android bitmap. I scale it down 227 by 227, I spits out a whole lot of floats. Pretty much I'm just building a big byte array. >> Got it. >> Of floaters essentially, where it floats for RGB values. I feed it then, I run the model, I searched out so I get some outputs. The difference between the models for iOS and Android, Core ML has the text built-in. The Android one doesn't. It just knows that the zeroth item has a property of X, first on property Y. >> Got it. >> But with the model, you get a label's file. And this labels file just gives you the labels of the texts so you know that the zeroth one is going to be the first label of-. >> Got it. >> The one that's going to be the next level. >> So it's kind of a nice-. >> A wrong device. >> So if I have an Internet connection, I may want to use that because that model might be updated, right? >> Yes. >> Because literally, I mean the nice thing is if your offline which is saying is not only can I use TensorFlow, can I use Core ML to actually get it so I can be completely offline, but if I use the online version, I can continually wrap it iterate, and then also my model online may be better than the model here. >> Yeah. >> So would you recommend that a strategy of maybe checking first locally, and then going online, or what would be the best strategy like combining the two or one or the other? >> It depends. >> Okay. >> I know it sounds a kind. >> I'm assuming the Core ML is going to be way faster than making a Web request, right? >> The advance to do is in Core ML or TensorFlow to flow is you can do it over a live video feed. >> Got it. >> So, one great example I've seen for this, I saw a demo at NDC Sydney last year, where they were using ultrasound scans to look for bladder defects in babies in the utero. So, as in the ultrasound, they'll see certain defects. >> Scanning it, yeah. >> If you know what you're looking for. And a model was built just a very quick to identify these. Now, if you wanted to have that as a constant live feed, so the live ultrasound feed as you're doing the scan, if it spots it, it pings you straight away. >> Yeah. It can be making Web requests every five per se. >> It can, and if you think, if you're a big flash hospital in America in the first world, you have radiologists who notice, spot these things. If you're out of the middle of Africa, for example, so you're very generic doctor maybe medicinal frontier and you're are helping to train up the local midwives to do ultrasound, if you've got an ultrasound plugged into an Android tablet. >> Yeah. >> You can't make those Internet connections to do the scans so you can put the model on that. >> Maybe doctors without borders are going out, you're going to scan something. You do it right on the device. >> Yeah, and you don't need to necessarily train everybody to look for everything. This device will find it for you. And if it spots something, it will ping one of us and we can come and actually scan it further. >> Yeah. >> So, it depends on what the use case is. If you want to have a live video feed, you probably want to on-device. >> On- device. >> But if you want to continue to improve the model, then you want to have it remotely. Of course, you always just redownload the model. Yeah, and redo it. >> Now, you say, you can mix and match, so you can kind of get it out. >> That's awesome. I love it. I think it's a really cool use case more than just hotdog or not, and all the source code on GitHub or how do we get it? >> Yep. So, all the source code for this is up in GitHub. We can put in the show notes, the new git package will. >> Put in the show notes below. >> Put in the show below. >> Awesome. >> It's all there and it's so easy to use. >> I love it. It looks easy. I really like it. Awesome, Jim. >> Two lines of code. >> Yeah. >> The image is recognized. >> Well, thanks for coming all the way over to show off this Custom Vision AI. I love it. I'm going to go become a machine learning expert now. So thanks again, Jim, for coming on. >> Thank you for having me on this also. >> Yap. And thanks everyone for tuning in. It has been yet another episode of The Xamarin Show. Make sure you subscribe up over there, over there, down there, ding the bell, you know what to do. Subscribe so you get all the latest episodes right in your inbox, and feed each and every week. Until next time. I'm James Montemagno, and thanks for watching.