Tip:
Highlight text to annotate it
X
>> It's like Saturday morning at eleven o'clock. Is everyone awake yet? Have you had enough
>> NOAH: What is it? Saturday? >> Yes.
>> NOAH: I've been here since Friday of last week. I got sunburned. It's right there. You
weren't paying attention. >> Raspberry rum. It goes good with coke.
So welcome to the Dark Arts of OSINT. This would be Dr. Noah Schiffman, aka Security
Freak. He is the academic of our team, the one who actually finished college.
(laughter) Yes, he is way more intelligent than I am,
a snappy dresser, and an absolutely wonderful guy. (Ring)
>> Dude, that's so not cool. >> SKYDOG: Could I have a red shirt kick the
*** out of this guy? There we go. Like they didn't tell us that earlier.
And I'm SkyDog, of course, by the picture there. We are part of the Dead Bunny Club.
It's the pseudophilanthropic arm of everything SkyDog does. We got together. I met you a
couple of years ago and we found that we're fast friends and we have a lot of fun getting
together and getting into trouble. >> Sometimes a little more than friends.
>> We weren't going to talk about that. I took that out of the presenter notes, dude.
>> NOAH: Sorry. >> This is my 11th year, back in the AP days.
Round of applause. Everyone's a n00b. >> NOAH: I just heard about DEF CON two weeks
ago. >> SKYDOG: So I get to celebrate, ironically,
at my 11th year year. I've actually been a goon for nine years. For my 11th year here
I get to celebrate three firsts. Don't worry, I haven't learned my virginity.
>> Don't worry, it will happen soon. >> SKYDOG: I understand I have to talk to
a girl, though, and I'm not ready for that. (laughter)
(applause) >> So the first one it was really wonderful.
My son got to participate in DEF CON kids. I'm old enough now that I have offspring.
>> NOAH: Cooper. >> SKYDOG: He placed fourth in social engineering
and second in hacker jeopardy. (applause)
>> SKYDOG: My second would be my first Mohawk ever. I got to participate in Mohawk Con this
year. Round of applause for those guys. (applause)
>> SKYDOG: I had to leave Vanderbilt to make that happen. This is an honor. I did find
out they require you to submit a paper. I didn't read the fine print, but here we are.
We're talking about our live demo. >> NOAH: There is this live demo thing that
maybe kind of discussed in the CFP. Well, I don't know how many people here are familiar
with something called Matlab or R or ‑‑ I don't know, other letters of the alphabet.
Yes, what's your favorite letter? So I didn't have a licensed copy of Matlab and went with
Octave and got into a battle with Octave and they won and I lost. We're doing a different
kind of live demo that's sort of audience‑participation based. It will be really fun and everyone
will get to meet people sitting next to you. It will be a fun icebreaker opportunity ‑‑
no, it's not. But it's going to be a demo that we can all participate in and make a
point. I hate Octave. I hate it. Ready?
>> SKYDOG: Get loose. Here we go. So our talk today is about the Dark Arts of OSINT. So
the path we'll take, we'll talk about what is OSINT? We're going to move on to ‑‑
Evan, if you call me again, I'll *** kill you. I swear. *** kill you.
I digress. So we'll speak about what is OSINT. We'll talk about some acquisition tools and
techniques. I am then going to sit down and the guy with the math background is going
to speak with anonymizing data. You don't remember? Uh‑huh. I'm going to leave the
stage and Noah is going to speak about anonymizing and de‑anonymizing data.
Open source intelligence. Thank you for putting the pause in there. Did you get the transitions
in there? >> NOAH: I did some. I forgot.
>> SKYDOG: The cool one that wipes? >> NOAH: It dissolves.
>> SKYDOG: You pay for the dissolve. So what is open source intelligence? Essentially open
source intelligence is anything out there that you can reach without having to be a
Leo or something similar or belong to a large organization that requires paperwork to get
to it. Anything you can get to online or readily available. Why do you care? You had a picture
taken by some *** with a camera, not one of our photographers, but someone with a phone
or whatever. Guess what, you're now hooked up with open source. The information is out
there. You appear in a picture. Now it's something I can catalog and index. So congratulations.
>> NOAH: Prism. >> SKYDOG: Weren't going to talk about ‑‑
>> NOAH: Sorry. >> SKYDOG: And so how can it be optimized?
We're looking at big data sets and crunching the numbers and actually extracting some information ‑‑
some interesting information out of what's available, readily available.
So OSINT comprises many things. One of them will be text, whether it is e‑mails that
you sent back in '73 when you were talking about something bizarre? Did you send anything
back? Never mind. I've gone back and found things I've done on forums way, way back in
the day using a different name that I was able to actually find online, things that
probably would have shown me how ignorant I was at the time. But anyway, you have text
that's out there that could be searched for. You also have imagery. We have Facebook. We
have appearing at DEF CON, if you don't realize it or not, you probably had a picture taken
of you at some point in time, video. I think last night Evan played the VR system and the
robot, which is an absolute hoot, which will appear on YouTube with some captioning later
on. The Black Hat robot. We have audio. The video we have here of this
presentation is currently available on DVD later. They also put the audio up of that
so if you're not into driving and looking at your iPhone, you can listen to the audio.
And then you have geospatial, which would be the images you take from a device that's
GPS enabled and records your longitude and latitude and fun things like that, other information
that doesn't always get removed from imagery when it's put online.
There is a certain signal to noise ratio. If you've been online, if you're looking for
data, there may be some really bizarre things. Noah never lived in Henderson, Nevada, but
for some reason my name and phone number are associated with them. There are certain information
out there that doesn't really fall into place correctly. You have to go through and decrease
the noise to get the true signal. So out of that, once you clean up enough data,
you're able to go through and put enough things together, layer them together, find where
the high points and the graph appear. You will find actual data. Anyone in the law enforcement
community, which I am not, anyone who is in that community realizes that when enough data
is collected, it becomes actionable. Then it becomes intelligence, something that can
be used to actually do something. (Prism) Sorry, I got a little cough there. Furball.
>> NOAH: No, I don't want to drink any more. >> SKYDOG: Wait until you get on stage. Media
had newspapers clippings from other parts of the United States write a report on it.
We moved into the radio age. Things were transcribed and cataloged and indexed. The search time
on information like that was a little long, if you want to claim about more Oracle or
mySQL. It got compressed down to videotape and things of that nature.
Like I said, I recently worked for Vanderbilt. They have the largest compendium of news broadcasts.
They go back farther than anyone else. That information can also be searched by metadata.
Of course, we're down to the Internet age where every *** can get out there and
dance and then put online their robot at a large security conference. That's coming back
to haunt you, *** hat. So the evolution is new sources, of course,
with radio and print. Then we move to government repositories. For some reason they decided
it would be a good idea to collect information and store it. Who knew?
Then you went to academic publications where they sorted data and put it together. Theoretically
they anonymized it. Now we've moved into the electronic databases where we know everything
about you. Those are sexy. Those will get you laid, definitely.
So the current forms and uses of OSINT are definitely tool sets, websites you can go
to, and of course databases you can get your hands onto, depending on what your flavor
is. Whoever has used Maltego? Show of hands. Cool. Okay. So Maltego is basically used ‑‑
you put a click. Yeah. Next time I'll let you do this, Bart.
Maltego is used to dig down on an organization. You can look at whose records and DNS and
IP's and e‑mails and things of that nature. I'll have someone else come up here and stomp
your ***, too. Maltego is really good for drilling down on
a company by looking at e‑mail addresses and things to compile a large amount of data.
Who has used FOCA? If you haven't played with FOCA ‑‑ FOCA is a lot of fun ‑‑
basically it looks at the metadata in Microsoft Office documents, PDF's. It will do Open Office.
It looks at the metadata in pictures. You can begin to compile information just in the
hidden information in all the documents you can get ahold of.
Randy from accounting puts out some sort of a document and inside that it contains information
about where it's stored on the local network, and it actually makes it to the outside world
and gives me some information about how the interior network is built.
So that one's a really nice fun one to play with. SearchDiggity. Anyone use that one?
>> NOAH: Not in my backyard. >> SKYDOG: Do what? Apparently SearchDiggity
isn't used as much as everyone would like. It basically is another form of being able
to sift through data. It takes information from Bing and Google and compiles it into
a nice interface to get to it. A lot of nice pieces of software. Anyone heard of Recorded
Future? This is one of those that makes you kind of cringe a little bit. It's a temporal
analysis engine. It forecasts and does analysis to predict future events based on information
from social networks and patterns that they can find.
They're able to go in and put some information in and actually determine what could possibly
happen based on information that's flowing right now. Of course there's Facebook. Who
has put their music preferences on ‑‑ who uses Facebook? It's all right, we're among
friends. You can raise your hands. Big mistake. Could we get a picture of that?
So if you've put onto Facebook, hey, I like REO Speedwagon. For all the young guys in
the crowd, that's a rocking band. Well, I can go back in with graph search now and say,
hey, I want to know anyone who lives in Tennessee who likes REO Speedwagon. And then I can mine
some data out, and I guess give you a jingle and say, hey, why don't you sit around and
listen to records, at which point you would probably run.
There are things are actually being put out there now for you to be able to look at the
data and try to grind through it. There are other websites, social mentions, Spokio, I
have my own personal preferences on what to use. Johnny Long isn't here, but who has ever
seen the Google hacking database? Okay. So a bunch of things that people have put together.
If you're looking for certain types of information, they've put query structures together for
you to use. This is what it's like to hang out with Noah and at any point in time.
Basically you have three different types of public data. You have cooperatively provided
data, which would be this is my name and this is what I like, which is social networking,
it's what I put on Facebook. I like REO Speedwagon and Smurfs. It's things that you put out there that your personal preferences
or posts that you can make that can be mined to look at but you've willingly given it up.
Did I say that right? >> NOAH: Yes.
>> SKYDOG: Okay. Just checking. Things that are confidentially provided, a session ID.
I had to log in to give that information. I filled out a questionnaire or survey. I
said, "Yes, I'm more than happy to allow you to look at this information." I put something
in there enough that it's very identifiable, be it my address, my phone number, my credit
card, things of that nature. So you have to actually ‑‑ it's a site with a privacy
policy where you say I agree to it. So you've given that information up and you've agreed
to their legal statement there. Then you have the unknowingly provided or
where did they get this from? It's the DMV records. It's other information. Maybe it's
your medical records or how the *** did they get my APGAR scores? He was slow at birth
and it never got better. Government and academia ‑‑ whoever participated
in something in college where you paid $20 for an *** probe or something for research.
So they take that data and they put it into a database and they put it online. Theoretically
your name's not associated with it. So who publishes these data sets? A lot of
times it's government. There's academia. There is a commercial market for data that's been
pieced together. For a certain fee, you can go in and cruise through that data. The more
you pay, the more granular your data becomes and the more revealing it is.
Why are these data sets published? For statistical analysis, we want to go back and look at the
information and do some predictions. Looking for trends and patterns that are out there.
Retrospective outcomes. We struggle trying to find the proper example of this. We decided
on which is better, *** or Cialis. We look at the information and see the satisfaction ‑‑
I guess that's not the right terminology. >> NOAH: I said ***.
>> SKYDOG: A buddy of mine, I swear, said Cialis.
>> NOAH: It was a friend of mine, too. >> SKYDOG: No, no, it wasn't.
>> NOAH: Evan? >> SKYDOG: Where did Evan go? He's hiding.
That's good. So of course this information is used for
decision‑making for future things, maybe it is product design or coming up with something
new, whether it's actually going to be popular in any way, shape, or form.
A lot of the things that are using here, the tools on the websites, I don't do the math.
That's this gentleman's side of things. Occasionally I get asked to find things. Who in the crowd,
who finished high school? Show of hands. It's okay. All right. Who went to college? Now,
who finished college? Okay. This is your crowd. So anyway ‑‑
(laughter) Do you want to do that? No. So I did not finish
college. I had a hell of a lot of fun while I was there, per my GPA, but what I did not
learn while I was at college is what you can and can't do. It was not taught out of me,
oh, you can't do it that way. So I never heard that before. And I don't pay attention to
it, so it makes it a lot easier for me to do some things like drill data on somebody.
Occasionally I'll get a phone call and I'll get a couple pieces of criteria, and they
say "find someone." And I've become very a dept at doing that using the open source.
Is anyone staying at the Bellagio? Cabana by the refrigerated pool. If at any point
in your lifetime you can make that happen, definitely do it. I'm in the sun. I've got
the Mac Book Air with me. I'm trying to get on the *** wireless there that doesn't
work, and there's a gentleman to my immediate right. He notices I have a computer, which
for all of us is the sticking point to, yeah, dude, my computer at home doesn't work. Whoever
has answered that question? I'm in a swimsuit by the pool and a guy starts
talking to me. Okay. I'll bite. No problem. So we start discussing China, politics, the
economy, fun things like that to really make you happy. We have a few drinks. And he says,
"You know, so, you're in Vegas. Are you here for business or pleasure?"
And I said, "Currently for pleasure." I would think that's the case if I'm by the pool.
And he says, "So you're here for pleasure. That's good."
And I said, "Well, actually, in two or three weeks I'm coming back out to the largest hacker
conference in the United States called DEF CON."
And you could hear his *** pucker in the seat.
(laughter) That's one of those things where who in the
crowd hasn't had to explain what that means? Put your hand down, *** hat.
So I began to explain what DEF CON is. Since we didn't have the documentary, it was very
interesting to explain it to him. >> NOAH: It's the hearing impaired con.
>> SKYDOG: I got to explain to him what we do and why we get together for all that. And
then his *** friend shows up. He had come to Vegas to go to the Pawn Stars place downtown.
And he said, "Dude, I got to meet Hoss. Okay, let's go get a steak."
We're going to head off to get a steak at so‑and‑so place, nice meeting you. Later.
I said your name is Brian, and your family owns a civil construction firm in Seattle,
Washington. And the guy says, "Yeah." And I said, "I'll send you an e‑mail to
your work e‑mail within the next 48 hours." Again, you could hear his *** pucker.
And I said, "Don't worry, I'm gonna show you. I have two bits of information on you. I don't
have your last name. I don't have much more than that, but I'm gonna send you an e‑mail
and show you what's possible." So we went out and had a nice dinner, went
out to the pool the next day. At some point I thought, I got to go find Brian. So I sit
down on the bed and fire up the laptop. In 45 minutes, I owned this guy. I have where
he lives. Pictures of his house, what he paid for. Pictures of all of his relatives. I then
took it upon myself to scan the exterior of his network and tell his system administrator
you probably should change this; it's not good to have this open.
Brian never responded to the e‑mail, oddly enough. I didn't send him an invoice. I did
it gratis. But that's a good example of I had two bits of information on the guy. Fortunately,
one of them was unique enough, it allowed me to find him. I was able to correlate civil
construction, oddly enough, against the YouTube video which I was able to pick this guy out
in, and from there just went to town on him. So I guess if you get an e‑mail from a guy
that you met by the pool who is a hacker and he says he has a picture of the house from
the driveway, it might be a little unnerving. >> NOAH: Was that legal?
>> SKYDOG: I don't give a ***. (laughter)
I don't have to have a court order, and apparently no one else does.
(laughter) Anyhow, the open source side of it can be
a lot of fun. One of the things that Noah is going to discuss is finding outliers in
the data. Brian had enough for me to be able to find. Had he said my name is John, it would
probably be a little bit more difficult. If he says, yeah, I work at Starbucks, not as
much of an outlier, but it took me about 45 minutes to track him down. If you ever get bored and you're by the pool
at Bellagio, just wait for someone to come by. It's a lot of fun.
>> NOAH: You like talking to guys at pools, don't you?
(laughter) >> SKYDOG: Have you ever been given a wedgy
on stage? (laughter)
>> SKYDOG: You take the microphone. >> NOAH: Wow.
(laughter) >> NOAH: Sky claimed that I'm gonna talk about
a lot of things that I don't know where he got that from, but ‑‑
>> SKYDOG: You were really, really drunk. >> NOAH: I know a little bit of math, some
basic addition, subtraction stuff. I'm not going to talk about anything hard in advance,
because that's for smart people. A lot of these slides ‑‑ hello? Hello? Where is
the echo. I don't like that echo. >> SKYDOG: I picked them out of other people's
sets. Have fun. >> NOAH: Dammit.
Okay. These slides are semi new to me, but I think I did make them. So let's go through
them. Data science. The science of data. Science has been around for a long time. Data has
been around for a long time. You put them together and it's ‑‑
(laughter) ‑‑ it's emerged mostly over the past decade to be really the real data
science, information scientists. It's been the past decade kind of thing. It sort of
came out of the whole business analytics competitive intelligence. Like everything else, driven
by big business, because they're just looking out for our best interest. So all of a sudden
people who were data mining and mathematical analysis are very valuable to big businesses
and other entities that like to analyze large data sets. Are there other entities that collect
lots of data? >> SKYDOG: None that I've heard of.
>> NOAH: I haven't heard of any either. But I'm sure there are organizations out there
that are collecting lots of data and doing something with this.
>> SKYDOG: Purely for benevolent reasons. >> NOAH: Yeah, exactly. But it's mostly to
enhance our shopping experience; right? Like other people who bought this also bought this.
Statistics, just you're given data, try to come up with a model, probability, given a
model, let's try to predict the data. Simple concept.
Okay. Here is a little graphic demonstrating what I just said, and it's useless. Historic
data model, ignore. Data sources. Okay. These are some random
examples of readily available public data sets. We've actually gone from, like, having
database information to databases that are cataloging the databases of information. It's
increasing exponentially. My favorite was Free Base I came across when I was searching
for something else, but apparently it's a database.
(laughter) I also like Info Chimps. Big data. Not just
any data, but big data. Buzz word, who thinks it's a buzz word? Some other people think
it's a legitimate, real thing? That's cool. I don't judge.
Well, I don't know. It's hard to define what that really means, big data, like is it big
data ‑‑ is it in the Cloud? >> SKYDOG: It's a large type face.
>> NOAH: What's the cutoff for being big? 8 inches? 10 inches? What does it become really
big? Sky, how big is your data. >> SKYDOG: My data is huge.
>> NOAH: I work with a very small data set, and I'm okay with that.
(laughter) >> SKYDOG: And at this point this is yet another
presentation we cannot put in our portfolio for public speaking.
>> NOAH: Oh, boy, that's true. So technically, at least what I found is that it sort of defined
as big data incredibly large amounts of data that are being rapidly generated and have
lots of variability. Okay. Sure. But it's still big at that time.
But the interesting thing about it, from our perspective, is that the creation of big data
has also sort of brought forth the development of tools to work with big data to analyze
these big data sets. All these new mathematical advanced platforms for performing all kinds
of functions on big data, which is of interest to us. We're going to look at that in a few
minutes. Okay. Terminology. That means sort of that
of defining words felt. >> SKYDOG: We Googled it back stage.
>> NOAH: A lot of Googling. Depending on what publication you read or what book, anonymization,
mean the same thing. De‑anonymization kind of mean the same thing. Some groups will switch
for the purposes of our talk, yeah, they're synonymous, but sort of antonyms, opposite
meaning antonyms. You reverse one of these processes, you revert
to the other. Anyone with a fifth grade background should be able to do it. This is simple stuff.
Data, when it's initially collected, a lot of times it contains personally identifiable
information, like Social Security number or address or something else, your name. That
would be personally identifiable. So there needs to be some kind of process that takes
this data and makes it sort of anonymous. I love you, too. Oh, what was that? Ten.
>> SKYDOG: That was ten. >> NOAH: Holy crap. Oh, dude, you took up
all the damn time. Damn. Wow. Okay. So we need to find a way to make this personal
information ‑‑ what? Okay. Make it into anonymous public data. So there's a couple
of different ways it can be done in general, removing variables all together. A variable
that actually is unique enough to be identifying by itself, like, you know, I've had eight
kids and been in ***, Octomom, remove those. Global re‑coding, local suppression, where
re‑coding certain variables or suppression certain values in certain columns that are
really identifiable, a whole bunch of different ways.
Okay. Anomyzation metrics. We have to look at the way we no one's data. Is this working?
Is it making the data anonymous or at the same time making it usable, the whole utility,
that's a balance right there. So two matrix, disclosure risks, like revealing data in the
public set, and then information retention, how utility of that data. So we take away
all this information. Oh, it's anonymous, but is it still usable. That's a balance you
have to strike. It's a tough problem. You want to minimize disclosure risk, maximize
information retention. Easier said than done, but information Intropy, and not the entropy
from thermodynamics, which I sent a long semester trying to go through.
Information theory, so the idea of is ‑‑ ten minutes. I have like a million slides
to go through. Basically the amount of information that can
be the number of states that can reveal the total number of possibilities for a given
state, like the ‑‑ I use an eight‑sided die in an example that obviously you can roll
and you get, like, one through eight because it's got eight sides. Information will be
three bits, yeah, so population of the world, let's just say 8 billion, that's like 33 bits.
Awesome websites, 33 bits.org. I'll cruise over.
Everyone participate in something real quick. We got to do something. Get up and dance.
(applause) Do we have time for this?
>> SKYDOG: I think we have all the time we want.
>> NOAH: You got that pull? >> SKYDOG: I didn't do that. Wrong? Let me
get the radio and get a couple of red shirts in there.
>> NOAH: We're on slide 26. >> SKYDOG: We were gonna look at audience
participation and kind of go through and sort people out based on some criteria. We can
skip it, if you want, or if you want to stand up and raise your hand. Do you want to do
that? >> NOAH: All right. Cool. How about this.
First question. Everyone here who this is their first time attending DEF CON, please
stand up. >> SKYDOG: Nope, nope, nope, nope, nope.
>> NOAH: Come up with west coast, east coast or age cut off?
>> SKYDOG: Tell you what. Anyone from the east coast stay standing. Everyone else set
down. You guys paid the highest airfares, thank you very much.
>> NOAH: Anyone here from New Jersey up? I didn't say what to do. I said anyone from
New Jersey up? >> SKYDOG: Simon says.
>> NOAH: No, you can sit down. Okay. >> SKYDOG: Have we got seven or eight or ten
people? >> NOAH: What are the states below New Jersey.
>> SKYDOG: No, no, no, I was going to say if you have a hangover. I guess that's not
publicly data available. If you're female, raise your hand. *** data set.
>> NOAH: No, and actually, that would be the unit. There we go.
>> SKYDOG: Say 29 years of age or younger. All the old *** in the room, sit down. Oh,
man. >> NOAH: Anyone living below North Carolina
or South Carolina border sit down. >> SKYDOG: Did we do New Jersey and up?
>> NOAH: Yeah, we're now North Carolina and jersey.
(laughter) who do we have? >> SKYDOG: You said New Jersey and up. You're
in the upper quadrant. >> NOAH: I don't know what the *** I'm doing.
>> SKYDOG: We can't do male female. Who got laid last night? That's bad data set, too.
(laughter) >> SKYDOG: So how many people are we up to?
Who is remaining standing? Count them off. I can't see for the lights. How many people
are in the room right now? 7‑ or 8,000. Up to that, we're down to three people remain
standing. How many questions? Like five questions. >> NOAH: Well, it was maybe four or five questions.
The entropy for those questions, west coast/east coast, is one bit.
>> SKYDOG: First time at DEF CON. >> NOAH: Two bits.
>> SKYDOG: Anyone above New Jersey and above? >> NOAH: Pretty much I think all the questions
were two bit. So five. Basically five bits of entropy and we are able to narrow down
the population to three or four people. And it's all innocuous information. The point
is that the combination of all this innocuous information can actually be quite identifiable.
>> SKYDOG: Thank you for participating. A round of applause for yourselves.
>> NOAH: I have 20. Outliers, single outliers, easy to pick them up if you have them in combinations
or set or a little bit trickier to detect. Mathematically possible. Graphical example
of an outlier. Everyone here in the audience was an outlier.
Data sets. You have sets of data. You have set A, set B, what's the intersection there?
A. Look at that A and B. Amazing. Now you add C, look what you have. A and C, B and
C. What do you have in the middle? Holy crap. Isn't that amazing?
(laughter) >> SKYDOG: That's the math thing happening.
>> NOAH: Unique variable overlap. You know, yeah, if you have outliers for different types
of data and they ‑‑ you know what, just move on mathematical tacts with three minutes.
>> SKYDOG: Slow down. Just do it. Who gives a ***. I got it covered.
>> NOAH: Sweet. Inferential analysis, and the example of it, remember the targeted advertising,
the teenage woman who was pregnant and was getting all this targeted advertising based
on her purchasing behavior to her household, and then her dad was upset that she was getting
targeted ads for Enfamil and diapers and got all pissed off at the manager. And she was
pregnant and that's not a good way to find out and tell your parents you're pregnant
is through targeted ads. That's not how I'll tell my parents.
The whole Netflix, IMDb. You all remember that? The census, you know they don't come
to the door. >> SKYDOG: I don't answer the door. Me and
my 12 roommates. >> NOAH: They still come to the door? Another
reason not to answer the door. Actually, so a researcher in 1990, Latanya Sweeney, using
the information from the census data, date of birth, zip code, 80 percent of the information
was unique, based on principles of information entropy. Exposed healthcare records of the
governor of Massachusetts at the time, which is kind of funny. And screw you. Zip code,
there's 43,000 zip codes, birthdate 365. Birth year, about 70 different age ranges of two
different genders, 30 bits of entropy includes all the population in the U.S. Simple as that.
PGP. Ever heard of PGP? Personal gender project? This is where people voluntarily submit all
this genetic information about themselves. They want to correlate genotype, phenotype
to learn about themselves. Oh, dude. Look, anyway. Again, this is a project gone bad.
(laughter) No one saw that. That didn't happen.
Record linkage. >> SKYDOG: This is a cool background.
>> NOAH: Take care of him. He's stressing me out. Record linkage. This is where you
have a public data set and a private data set. Public data set maybe that is metadata
that's publicly available and might have some innocuous but identifying information in about
an individual. The private data said well, that's got personal information you don't
want people to know. Through record linkage it's possible to actually correlate the two
and discover sort of these anonymous or so‑called anonymous traits about a person by combining
the two data sets. And I'll get to them mathematically how to do that in a second, or not if I get
kicked off stage. All right. Flying through these slides. Vectors.
This is where you get into the math. So either go to sleep or those ‑‑ anyone math torn?
Okay. Your data points now become a vector. Your record, attributes, yeah, boom. Okay.
We're now with the dealing with vector math. Take it one step further. The whole database
is a matricy, boom, records people attributes database. Cool. Again we now can apply matrices
math, inversions, and dot products, all kinds of wonderful things like that. Actually, so
a cosign similarity, measuring the angular differences, math, math, math, math, math.
One thing we did was the actual mathematical formula for the similarity functioning, in
case anyone wants to try it at home or see me after class and we'll discuss it. Yeah.
Then diagrams, this is really cool. So to be able to visually to understand and identify
overlapping data sets, you have two at that time sets, A, B, multiple variables that were
in common that were the same descriptive traits. Looked at the intersections of them. Noted
here by these little lines across. Okay. So these data sets independent descriptive variables
in common. Then we take those little sections that are in common and we VIN the VINs as
we say. So take those and watch this. Bam, bam, bam, bam! Look at that.
And then based on that we can actually now actually ‑‑ the subspace defined by that
area is the intersection of all of these groups and actually identifies records for which
all attributes are identical and actually identifies an actual person.
(laughter) Wait. Okay. In summation, the dark side of
OSINT. So big data, big problem, big data. Lots of tools are being used for analysis
and visualization. More data sets are being developed and this is the mathematical attacks
are going to become easier and easier. It's another weapon for social engineering tool
kits. This is information about individuals that we'll able to ascertain and they're not
going to be aware of it. And they're not voluntarily giving this information, but it's going to
be actually sort of reidentified about them from these anonymous data sets. So cool for
us, bad for them. What can we do to defend against the dark
arts? (laughter)
Proper sanitization methods. There are not ‑‑ there's no way ‑‑ there's no standards
to actually implement anonymizing matrix but provide true anonymity. We need access controls
or my recommendation is to falsify everything and just make *** up. So that's what I would
do. (applause)
In conclusion ‑‑ yeah. (applause)
>> SKYDOG: Questions and answers will be handled at the bar. You guys are buying!
>> Ladies and gentlemen, the full presentation will be scene at SkyDog comma little later.
Thank you for the speaker for letting us go a little long.
>> How can we take out SkyDog and his buddy? >> Head to the chillout cafe for question
and answer and more elucidation on the slides.