3. Nan Zhang - Data breaches cybersecurity

[ Music ] >> We come now to talk about this from a technical perspective. What exactly are the possible implications of the breaches like that and what are the possible reasons for it? So at this time, the information that we know about the cause of this breach is actually quite limited. We don't know exactly what are the reasons that caused the system to be hacked into the data to be disclosed. But if you look at the rumors like, you know, comments provided by a security analyst from Gardner. It seems to be that a payment aggregator who is responsible for aggregating credit card payments for a lot of cab companies in New York, New York City actually has an administrative account being compromised. Not sure an intensive technology but because the adversary was able to correctly the answer the knowledge-based authentication presence like many of you may have on your e-mail accounts. So this, coupled with-- it's kind of a coincidence that the a group of payments happened to move as services from locally host to a cloud provider which is Amazon EC2, in this case, just a few months ago. And this company which provides this payment aggregation service for NYC cabs also happened to provide the encryption for, you know the authentication between global payments and these cloud service providers. So it maybe just a sheer coincidence, it may not be the actual reason why this breach actually happened but if you connect all the dots together, it seems like a reasonable story at least maybe because of the technical cause we don't know about. So what I would like to comment on is if this were actually the cause of the attack. Even if it isn't, it still tells us something about the current practice of all the different authentication services and what are the potential implications of similar attacks in the future. This is one of the arguments I want to talk about is what are implications of all these breaches on the internet from the data repositories? What an adversary can do with all the disclosed data. So for the first point, if you look at the authentication services being breached hypothetically in this case, it is just because an adversary is able to answer the knowledge-based authentication presence. It's basically, it tells us two things, one is knowledge-based authentication and perhaps it's not really good idea. If not for other reasons, just because of the amount of information that people can find about you on the web. So it is a lot if you might actually try to search your name on the web. And if you actually read a little deeper and find a lot of data sources that have information about you, actually there is amazing amount of information someone had entered about you on the internet. So setting up some knowledge-based authentication questions like which high school you attended, which city you got married in, isn't really a very safe question. A lot of people will be able to answer those questions just by looking through that information on the web. This is one thing. And second is, if you look at the authentication services provided for a lot of users, there seems to be a trend now that actually regular user accounts have stricter and stricter requirements on the kind of passwords you have to set, what kind of questions you have to answer to pass knowledge-based authentication. In contrast, the regulation so that the constraints on administrative accounts is looser and looser, they do not enforce the same kind of regulations that regular account holders will have to follow. In a way, you're wondering why because these administrative accounts are often shared by multiple users. It's not that only one user has one account. Multiple users may have to access the same account to get business done. And this is actually, this problem is actually made worst by the trend of moving a lot of services from locally hosted to account provider. Because it's one thing and that's you pick up your phone, call the IT department and say, "I lost my password" can we reset it from this so I can login to the system. It's a totally other issue that you have to call a cloud service provider and then convince the cloud service provider you are who you claim you are and implement the actual passport reset. So a lot of these cases when the services are based on cloud, these cloud service providers cannot provide you with some very complicated authentication services. Instead, what happens here, all right, possibly in this case is some simple knowledge-based authentication questions are used to reset the password as long as-- in other words, we can somehow get answers for those questions, for the accounts that get compromised. So that across is not a fact that we know, it's only a speculation at this time but it tells us some alarming trend that maybe happening especially with moving off where IT services do the cloud. And best what issue then perhaps have to be addressed by technical community in a sense that we want authentication services may have to receive a lot of attention from the academic community and the research community in general as well as from other perspectives, business and legal perspectives. But the second thing I want to talk about is, which exactly are the implications of having all these things disclosed? The potential adversaries. So in this particular case that Howard just mentioned only the track 2 data is disclosed which means that hopefully, based on the knowledge-- based on the facts that we know, the account holder's name, address, and other the social security number, other information are not actually disclosed to the adversaries. So seemingly, besides you have to reset your credit card, change to another credit card number, there's not much information about you, are being disclosed on this case. But it seemed, the real danger happening, it was all breaches. It's not really what an adversary can do. It's just one bunch of data records that are breached or disclosed in one instance. But rather, with a lot of other auxiliary data sources, either already available on the web or being breached in multiple instances. How an adversary can actually connect the dots together and infer a lot more serious information about you that you yourself, like you don't even know. In this case, the database research community for example, have studied this for quite some time on how one can connect the dots from multiple data sources to infer some information about you that you think is not available. For example, some of the very first studies on this issue was by Sweeney and Company in Massachusetts. So what made them was they looked at one public data source which is the health insurance benefits of oldest state employees of Massachusetts. In that data source, there is no personal identifiable information disclosed. So you cannot see what is the name of a personnel social security member. All of that were masked because of the concerns on privacy because health is very sensitive information. The only information available on there are the zip code, the date of birth, and the gender of a person or all these other health insurance information. Now, what this researcher did was to take that date source. And crunch the data with another data source which basically shows that zip code, date of birth, and gender of state employees in Massachusetts. You might be saying that there are a lot of people that had the same zip code of you, a lot of people are born on the same date as you; have the same gender, of course. But their research actually showed that 75 percent of all the people in the United States can be uniquely identified by the combination of zip code, date of births, and gender. Which means when they crunch the two data sources together, they know the health insurance information or the hospital visits of the governor of Massachusetts, if they're from Massachusetts. But this basically just illustrates the danger of having multiple data sources about you or containing information about you available on the web. There are a lot of major studies, you can find them easily from the literature. One of them is to link the data that Netflix is disclosed. Although an anonymous fashion about which movies their are subscribers that actually rendered and the data source from imdb.com. And in that case, the researchers were also able to link that this user at imdb.com with this subscriber of Netflix. So as we infer additional information about what movie you have rented; you have viewed, you have commented on. So it seem the real danger of these data breaches really lies on the ability of the adversary to crunch all the data about you together and then infer sensitive information. Now the problem with this from a technical perspective is we don't yet know how exactly an adversary can do these things. For example, there's no technology available for me to actually test about which information about myself is available on the web. For example, if you want to-- before you set a knowledge-based authentication questions, maybe you want to know whether this question can be answered by someone from searching you on Google. There is no tool available to test these things and maybe that actually is something that the academic community can address in the future. [ Music ]