Def Con 21 - Jaime filson and rob fuller - Gitdigger creating useful wordlists

>> Good morning, everybody. >> How are you guys, awake or sober? >> Who slept in this room last night, and that's the only reason you are here? >> One guy. >> JAIME FILSON: Okay. So this is GitDigger, I'm WiK. >> ROB FULLER: I'm Mubix. >> JAIME FILSON: So last night, at random, well, not random for Mubix, but we ran into a taxi line and decided to go with him over to Pawn Stars. Everyone knows Pawn Stars? So inside, we're walking around. We're looking at the souvenirs and all of a sudden we notice this kiosk. Everybody is using it. What's that? Well, we walk up to it and it has a camera. You can take a picture of yourself and they allow you to log in with your user name and password to Facebook, Twitter, to send an image to yourself or to tweet it out to the public. (Chuckles). So I email to myself. I'm not giving them anything. And this is the result on the screen. >> ROB FULLER: Legit, right? >> JAIME FILSON: So I did most of the research. I did all the research! >> ROB FULLER: That's me. >> JAIME FILSON: Yeah, that's him. >> ROB FULLER: So we are not the first ones to make wordlists. Sebastian French something, he's an awesome guy. I'm not trying to make fun of him and also all of Matt Weir's stuff. If you haven't used his keyboard dictionary, it's one of the best ones to find people who just use, you know, along the way. And the other people who make awesome wordlists, you rock. Going on. >> JAIME FILSON: So we weren't the first ones to go digging through source code. SVN digger was released. They went through a ton of SVN repositories, linked through and then published the frequency count of all the files and all the directories that they found and pulled down from, I forget exactly where they pulled them down from. >> ROB FULLER: Google Code. Just to point out really quick, if you take a picture of that QR code, we are not trying to hack you. It's linked to the information. >> JAIME FILSON: I made them, not him. So they are good to go. >> ROB FULLER: The only problem with using Google Code and stuff like that, they like to put these captures in, which makes it hard to automate stuff. So this is ‑‑ >> JAIME FILSON: So this is how everything got started. 2:00 in the morning, somebody posts a link to SVN digger. Everybody thinks it's cool. I haven't seen anything like it before then. And Rob was like that's awesome. That one line, that's why he's standing up here right now, because of that one line of code. So I'm like, oh, this is awesome. I can do this crap, 30 minutes or so, I will go to bed, wake up in the morning and the code will be done and I will have an awesome wordlist. So my first problem was that I couldn't find at 2:00 in the morning, mind you, I couldn't find a good way to get all the repositories. So I started to go to their git help list is the most forked and I used some Python and started web scraping all of that. So do some basic Python, I'm web scraping that. I'm saving it in SQLite, user names and project names and then just sent my computer loose cloning all the repositories. So now what do I do with it? I have these repositories. I'm using OSWOC to go through each repository and keep a count of the user ‑‑ the file name and the directory. I'm doing a whole lot of said grab oc, just trying to clean everything up and make it nice and easy. There was a ton of manual review, because I thought it would be easy to go through and pull out all the user names and passwords, and email addresses I found in this code. So I spent about 17 hours total on my 30 minute project and all kinds of hours trying to pull out user names and passwords, and I've got a mile line of said that I just copy and paste and come back later. So OS.walk was taking forever to go through and find everything. I thought there's got to be a better way to do this. After some Google fool, I found betterwalk which claims that OS.walk makes unnecessary ISP calls, is this a folder, is this a file. We don't know, API, please tell me and they cut that out of their loop, which speeds things up to two and a half times. So the good news is, I've got some awesome wordlists. And I posted them out on IRC. Everybody loved them. I was like, great. But the bad news is I only have some repositories. I have maybe the most popular repositories and that's it. SQL transactions were extremely slow. It took maybe about 30 seconds to go is this already in my table? Yes? Okay. Let's add one to the count. And the 17 hours of manual labor, really sucked because I am the laziest *** on the planet. If I could have got my goon to carry me in here, I would have. And my hard drive was full. I've had terabytes of this data. So everybody liked it. So I'm like, okay, let's get a little serious. How can I make this better? How can I streamline it? How can I not do 17 hours of manual labor. First problem, storage. How am I going to store all the data? So my first thought, I did some Googling in bitcasa, awesome, $99 a year, unlimited space. Built‑in indexing so I can give people access to all the code and they can search for whatever in the world they want and get it. At that time, six months ago, at that time, there was only a Windows client. It crashed every time I tried to launch a robocopy or just simple copy and paste, and it was extremely slow, because they encrypted all the data on the upswing. So what might have taken me six days to upload a terabyte with my slow *** connection would have taken, like, a month. The next option, which I thought was the option was to have a NAS. Everything was stored in one place. It was protected. I could download directly to it's but it's hard to get free money for these things. So I had three terabytes already. So my solution, right there is the first ten terabytes of all the data. (Chuckles) >> ROB FULLER: That's awesome! >> JAIME FILSON: So the next problem is how can I make downloading these repositories better, easier? How can I get all of the repositories? So when I was actually awake, I found the API which I felt incredibly stupid not knowing about. And it's nice because the AP I. gives you all kinds of nice, useful information. The only thing I haven't found is they will tell you it's a fork of a project, but they don't tell you who was the main project, who it was forked from. So I can keep track of how popular a project is, but I have no idea which guy was the original. So database, SQLite sucks really bad when you are trying to store a lot of data. I searched to my SQL. I've had questions in the past, why didn't I use PostgreSQL, I know my SQL and again, I'm lazy. I didn't want to learn something new. So let's put this all together now. So now I have two main scripts. I've got the first Python script that's threaded, goes through, downloads all the data. It's got another mode that will go through and process all of that data. And then I have another script which I will talk a little bit more about that just takes a long list of user names, passwords, email addresses, and I pass it to the table name and it just goes and dumps all the data into that table. The MySQL database, I created a table to keep track of more product information, more project information and the user names and pass words and everything now has its own table. And I'm keeping track of the last seen ID so that I don't have to start over or repeat myself. So here's how the downloading works. Downloader goes out to the API and says, give me 100 repositories. I saw ‑‑ I have already seen 5,000. So GitHub comes back at you and says, okay, here's the next 100. So it downloads it, dumps it into the database that I've got it and then automatically clones the repository to my hard drive. Unfortunately, the processing got a little better, but there's still a lot of manual work. So now, the processor mode is checking my database, going okay, I don't have this repository, but I know it exists. It downloads it. Great. Or it ‑‑ it goes through and auto loops it. It does a betterwalk on it. And now if you notice the red line, that's all of my manual work. So I have to grep all of this data, pull out user names, passwords, emails, RSA keys. All kinds of fun stuff, and then clean it up which can take for a one grep session for one day can take four days for me to go through and clean it all up and dump it into the database. And then I have a Bash script that will connect to the database and dump everything and create the wordlists and automatically send it back up to GitHub which is a real irony. I'm downloading all of their data and yet storing it on GitHub. So the updated news. I now have all the repositories. I can now get every single public one. Generating the wordlists with Bash script takes minutes once everything is in the database. Because of the updates I did to the database, I can store the repositories. It will tell me which one to go to get. The sucky part about that is if I want to go back and grep for more stuff, I have to get this giant hub and plug all of these hard drives in at the same time. >> ROB FULLER: It's awesome. You should see it. >> JAIME FILSON: Yeah. I'm estimating that it will take about 30 terabytes to download all the repositories, however, I'm pulling that number out of my butt based off of the first ‑‑ the amounts of repositories I got from the first 10 terabytes, because everybody is uploading new stuff every single day. I could probably continue with this project forever and never see the end of GitHub. >> ROB FULLER: So this is the big data drinking game. If you just heard me say, "big data" drink, but you guys are all hungover. So I won't ask you to do it. So obviously this is a build up to the actual worldlist. What did we get out of it? So anyone with kids knows exactly how this goes. So how does this go. Dun, dun, duuunnnnn! You can get the movie and just fast forward it to that part of the movie. It's the best part. >> ROB FULLER: These are pretty straightforward lists but the cool thing is what we see inside of them and we're not just talking about password lists. That's the obvious use, right? I'm going to have a set of passwords that I'm going to use against it. The all directories list and all files list is awesome, when you are talking about web application attacks, and the user names. I didn't know that so many people loved Bob, but they do. More than admin. So stats. Pretty pictures. >> JAIME FILSON: I promise, this is the only stat it's. I just wanted to give an overview of how many passwords are in the database, versus how many are actually unique to each section. >> ROB FULLER: So this is where it gets relevant to what I do. I'm a senior red teamer and one of the things ‑‑ I just break stuff. I already talked about forced browsing. The SVN digger kind of started that whole thing. The great thing about forced browsing is when you get a set of the directories or wordlists or stuff like that, you can just exactly like DirBuster, you can go through and find it. You can use these wordlists with DirBusters. The small default pass wordlist which is not exactly like the same thing that I would have expected as the default passwords and you start with root tore, blah. Static salts, it's hilarious when you have a salt for passwords and then that repository is used as an application out there in the real world. >> JAIME FILSON: I actually stopped pulling out static salts, because there's so many! And I'm never going to get this done in time to do a CFP on the project if all I did was pull out the static salts. >> ROB FULLER: So five minutes? So number 22 on the list of files is exception.php. I never, ever, looked for that when I was looking at a web application, even a php one. But after WiK had done his research and shared the list, I got code execution because it was loading the exception information and you could identify any list you want. That's brute force browsing. And this is pretty awesome. This is one of my favorites, NTLM SSO magic, do you know what that does? It has your user name and password statically assigned in there. So it does NTLM. All right. So real world stuff? Anyone see this release? The secret tokens for rails? If you have a secret token stored in your repository and it's also used in your production, without you clanging it, it's direct remote code execution. So this is the gentleman, and I'm going to butcher his name ‑‑ I won't butcher his name. He sent out an email to all 1,000 users who had this in their repositories. >> JAIME FILSON: I'm much too lazy to do all of that. >> ROB FULLER: You start parsing every file from the git repository. Right now WiK isn't, but if you store your password, then the gentleman who just said it, removes it, but you can go back in the history if you don't nuke it. Mass static code analysis: You can find a ton of things really quickly. And .svn, when you convert an svn repository into a git repository, sometimes people forget to delete those things and they can have configs, cluing database configs and all kinds of things. Git ignore is an amazing little file that tells your git repository what files to never look to commit. Those are exactly the files that I want to look for. Because those are the things that are important. So I usually look for that. 403 on empty directory. On GitHub or on git ‑‑ as well as SVN, it doesn't let you create a directory and commit it, unless there's something in it. So MT directory and DS stores are usually how some people do it. Another thing is running OCR on all the images. We actually found a gentleman or a girl that had their password stored in an image for their repository. It was awesome! Using the list of text files, grepping out all the emails which he already does and I'm stopping because it gives all the ideas and we're done! (Applause). >> JAIME FILSON: Thank you. I actually want to give a quick thank you to nova hackers. There are any Nova hackers in the room? >> ROB FULLER: Boo! You all suck. >> JAIME FILSON: They suck. But without their help and support, encouragement, I would have never kept going with this project, because they helped me out with resources. I now have a file server which can store up to 34 terabytes of data. So once I get the original 10 bytes switched over, I'm going to start downloading, and pulling out some more stuff. >> ROB FULLER: Cool stuff? No? Everyone is waiting for the next talk? Questions? All right. Cool. >> JAIME FILSON: Thanks. Thanks for coming. (Applause) >> So for those of you filtering into the room and looking, Made Open Hacking is about to start in ten minutes. The schedule for Track 2 is really messed up. There are some tracks that didn't even make it on to the schedule. Please stop by the Information Booth if you want an updated schedule in about an hour. They're getting PDFs printed right now and they should have them in about an hour. If you want the schedule right now, the one on the website is the most up to date, however, they are doing a weird thing where they are telling you the ends of talk and not the start times. So the start time of a talk is 10 minutes after the one preceding it ends. Yeah.