Magic Session 2 Part 2 - Technical hurdles, research solutions

Robert Chadduck: Thank you Richard. Ah, Ken in the context of both your career experience spanning multiple interests in government information and records, and now your current role as a distinguished visiting scientist at NIST ITL, you know, are there any perspectives or technical opportunities that you would like to draw our attention to? Ken Thibodeau: Yes, but, as an archivist first, I have to set the record straight. I am not here today representing NIST or any other part of the US government, but as a private citizen. So I get to say whatever I want. Robert Chadduck: That is why you are visiting Ken. Today you are not visiting. Ken Thibodeau: And also you might have deduced from the fact that I am the only panelist, who has a full suit and tie on, that having spent nearly 40 years in the government, I carry a lot of baggage from that experience. And I have a prejudice, regarding today’s topic. So I want to confess that up front and that prejudice is simply. Most of my forty years, most of my effort was spent on pushing the use of computer technology for the conduct of the government’s business and for the management of the information that results from that business. Within that context, I don’t believe that technology can solve the major problems of access by the press or by the public to government information, because they are not technological problems. They are the kinds of things we heard about from the first panel. There are legal barriers. There are bureaucratic barriers. There are organizational, cultural barriers. And there are something that I can only call American barriers. Cause we are simply such a large and heterogeneous nation that almost anything you can propose doing, you will find dissonant voices who want to do something different. We already have had an example of that this morning. Two completely legitimate voices but dissonant, where David wants government data available in a form that is easily analyzeable and easily, what’s the verb mashupable? Whereas if we give it to him in that form we are creating, what Sarah Cohen calls “false records.” Things that never existed as part of the way the government did its business. Hopefully there is some way of balancing those out, but I think the fundamental solution has to come prior to the technology and from the perspective of an old-time bureaucrat, I would say. Gary Bass hit on one of the most fundamental things that has to be done. If you really want the government to proactively and aggressively make information available, first of all to the citizens because that is our fundamental responsibility and secondly to the press, because hey are an important intermediary in helping citizens understand government information. You have to make it in the self-interest of the agencies to do that. Cause under the current situation, by in large there is no reward. There is not even recompense and in many situations there are not even resources available specifically to responding to outsider requests for information as we heard from Bill in the situation DOD is no different from other, except I was surprised to learn that DOD actually does fund the FOIA office. There are other organizations, where it’s simply you assign people to do the work and they scrounge whatever they can and they don’t have communications within the agency. Having said that then let me shift to the technology. I do think since I love the technology, I would not have spent my career on it otherwise. Technology could, once you get over that barrier, help the situation a lot. But I can only say it could because as you can see in my prepared remarks, there are pluses and minuses in whichever way you look. You know I think there are three basic questions: Can the technology make more government information available? Secondly if it is available, can the technology actually help you get it, get access to it? And thirdly, if you get access to it, can the technology help you to do with it what you want to do? And there are pluses and minuses in all of those. Undoubtedly the technology makes more information available. There is simply an explosion of data as a result of the use of computers. And that is true of the whole world as well as the federal government. But it’s easy to find a very voluminous literature on the dangers that IT poses digital technologies specifically, because it makes it so easy to destroy or alter information or it makes it so hard to carry it forward in time in a way that is usable, given technology change. On the other side, I think there are very simple methods that can be and are frequently implemented in agencies that do a better job at guaranteeing the survival of government information. My best example is Oliver North shredded his paper notes. He couldn’t even touch the backup tapes that held multiple digital copies of those notes in their original form. And as result those notes are in the National Archives today. The Reagan Administration was actually the first administration from which we got email and we got the Reagan and first Bush email at the same time. And it was a collection of about a couple hundred thousand messages. And I have to tell you, the way NARA dealt with it at the time, from a technological standpoint was primitive and from a personal standpoint of the people who had to do that work, which included poor Bob, who was our lead engineer at the time. At the very least you’d have to say it was gruesome. And I won’t go into details, but I would in the breaks if you are interested in hearing horror stories about how you process government information when you have no system available. And when, because of the sensitivity of White House information, you are only allowed to create a system in which your staff have no way to see the data that they are processing. We couldn’t print, we couldn’t dump to screen. Contrast that with the George W. Bush Administration. We got in excess of 200 million emails from the Bush White House; a total of some of 82 terabytes of electronic records from that White House. In less than a year, roughly within eight months, we had not only ingested all of that stuff into an archival system we had recently deployed specifically for Presidential electronic records, but on the way in we indexed it all. So you can do searches on the full content of all that stuff. But actually if you know the situation, that’s not very good. Searches of the kind you can do with Google and Bing and other internet search engines are very good at finding you pieces of relevant information. But they are not very good at weeding out the chafe. And our archivists in Presidential Libraries, if they get a request from the White House or the Congress or the courts, they don’t want a result set of 400,000 hits, if there is only maybe 10,000 in there that are really responsive. And in fact, the other problem exists as well. You may get a lot of stuff that is really not what you are looking for and conversely, the probability is that you are not going to get much of what you are looking for. NARA’s chief litigator in his spare time, which he is a personal friend of mine, so I figured out, his spare time is some time between 2:30 and 3:30 am,, has been for several years involved in a nationwide research project on search mechanisms that can be used in e-discovery, which is very important in litigation. What they found in this research is you are really lucky using current Boolean type searches if you get 41% of what is really relevant in the collection that you are searching. So the current technology is not very good. It is also not very good for the reasons that Pittham has discovered. Computers today by and large can’t handle context. And context is often definitive. So I was very happy to hear that Pitham is promoting efforts to make use of context in the discovery and management of government information. A really extreme example of how poor today’s computer systems are, not run of the mill but advanced computer systems. It’s the lack of understanding of context that led IBM’s Watson to identify Toronto as an American city, in responding to a question about or in answer about two airports in American cities on Jeopardy. Watson is a computer that beat the best Jeopardy human contestants. So things are not ideal, but they are getting better. In addition to indexing all of the Bush electronic records on the fly as they came in, we were able for at least 90% of them to associate most of those records with metadata and contextual information that allows faceted search. So if I am doing a search on White House email, I can search for *** Cheney, but I can also tell the computer “I want to search on messages in which *** Cheney was discussed, or in which he was the author, in which he was a recipient or he was a blind copy, carbon copy.” I can even use the metadata to say I am only interested in *** Cheney’s email if he actually read it, because that data is sitting there on the computer. So it is a much richer way of finding information. And that will continue in the future. And I want to finish then with the aspect of the third question of: What about actually being able to use the information, Often very voluminous information that you could get from the government. There I think things are very promising and I can cite several examples actually from NARA. We have for many years now been sponsoring, actually since we got the Bush and Reagan email, been sponsoring research by Bill Underwood down at Georgia Tech in the use of automated techniques to characterize and even describe electronic records. The tools Bill and his team are developing can do things like distinguish the type of electronic document. Is it a policy document, is it correspondence. And distinguish subtypes. If it is correspondence, is it letters or memoranda. And he can even go in to a finer level and say what’s the letter about. And tell you, no this set of letters relates to personnel appointments that needs Senate confirmation, that set are responses to inquiries from citizens, that set is correspondence with major stakeholders about a legislative initiative. So actually allow you to understand a lot of what’s in a collection before you have read any of it. There is work supported by DOE the Los Alamos Laboratory by Jorge Raymond and Shelly Spering that’s looking at managing records in terms of what they contain, emphasizing entire collections rather than simply individual documents, so that they can look at a collection, of say Cabinet Secretary’s correspondence over 4 years and tell you what are the major themes that show up in that correspondence and also how those themes evolve over time. And then you can do mashups between things like the evolution of themes and what were the political events that were going on or the international events going on in the world that might have caused the interest of high level officials to shift from one area to another. The National Science Foundation and NARA are also supporting research at the Texas Advanced Computing Research by Dr. Maria Esteva and her colleagues, who are developing methods to visualize information about large collections, we are talking millions of documents, so that you can interactively explore based on the visualization what’s in a collection and what are the subsets of that collection that you might actually be interested in. But going back to that fundamental problem of you need to incentivize agencies to make information available. I think one of things that needs to happen in the government to make more of our records available to citizens and the press is an infrastructure change by using what technologists, common technique, if you can’t solve a problem, go to a higher level of abstraction, cause it’s often easier to solve it at that higher level. Or, another way of putting it is introduce levels of indirection. So I think it would be good if the government’s IT infrastructure were able to move so that we could introduce more and more indirection between the information we have in storage and the applications that we use to conduct government business. Because when you come into a government agency and you ask for information, you are at least competing, if not conflicting with the use of that system to do our current business. And that is a basic problem. But if we could separate those and we had either an app or a service that could look at the information in storage, in response to requests from the public, then you wouldn’t be interfering with the officer who is responsible for awarding the contract or administering the contract or getting the grant out or deciding whether the Medicaid cost is legitimate and so on. It wouldn’t be interfering with their work, but hopefully you would be making more government information available to more Americans. Thank you. Robert Chadduck: Thank you Ken. I will have no further questions about what constitutes gruesome engineering challenges. The aspect of looking forward is all of the interests of technology that eliminates the gruesomeness of what the government confronts. So in that aspect George, Ken, thank you. George there is no surprise. We have moved towards the middle. In the aspect of what we have spoken about in the last hour, is the aspects of the interests of making government information available and I would also informally call repurposeable. You know the aspect of context aware computing and technologies. And particularly your vantage point, George in the NITRD program, of being able to look across the coordinated research across multiple agencies. I’d welcome your perspectives. Thank you. George Strawn: Thank you Bob. Good morning ladies and gentlemen. A special good morning to Matt and Leslie up there, two of my long term colleagues from the National Science Foundation, Office of the General Counsel. And I think the NSF Office of General Counsel might serve as a metaphor for the activity that we are discussing today. I was very surprised when I first came to NSF to find that the purpose of the office of the General Counsel was to help me get my job done, not to say no. And I find it an exemplary Office of General Counsel. Well, if you check the schedule you will note that we are talking about identifying specific technical problems. Well I am going to tell you about one big one. And technical solutions or research agendas to find solutions to these problems. I am going to tell you one thing. Just like the tiresome father in the old movie, The Graduate, who had just one thing to say to the hapless graduate, I will tell you one thing. The sub-bullets that we might also talk about mention quality of records. Whoa! You don’t know how big a problem that is, ladies and gentlemen. Now that we are going to make them transparent, we are going to find out. And ten years from now, government records are going to be better than private sector records, because ours will be transparent and theirs won’t and theirs will still have the garbage in it. Government agencies, what can we do about documents that are not born digital. Well there is a religious metaphor for that, they have got to be born again. (Laugher) So we are working on scanning technologies that will take analog stuff and turn it into digital and that is getting better too. Finally insights into emerging technologies that may help, blah, blah,blah. Well, that is the one big thing that I am going to feel free to enlighten you with. I am going to start with one quick analogy between the public and the private sector and end up with one further analogy between the public and the private sector. James Q. Wilson, investigator at UCLA wrote a book a number of years ago, contrasting the private sector and the public sector and saying the title of the book was something like”The Bureaucracy and Why It Works that Way,” And we have alluded to a number of those points in this and the previous panel. First of all, we, the government, have unusual authorities; in particular tax the citizens and to send the citizens to war. Those are both pretty awesome responsibilities and that imposes some unusual obligations upon us. Also, since we are a democratic society, we in the government have to take matters of equity extremely seriously, probably more seriously than non-governmental entities have to take it. Because of these types of issues, we are considerably subject to accountability. And openness and transparency are there in general to support our accountability for all of these appropriate reasons. Oh, by the way, there is a fourth thing we are supposed to do, which is the same thing that the private sector, the only thing that the private sector is supposed to do: be efficient. Private sector is efficient enough they stay in business. Well we have got to be efficient as well as doing those particular things. So we are in a digital world and the three big dimensions of digital stuff are access to the data, data quality, and turning data into knowledge, which is another phrase for some of the things that we have been talking about already. Access to data, well that’s a policy issue and a culture issue. Culture of sharing as we have been referring to, which is still being aborning if ever. I’m, since policy isn’t my business I can glibly say yeah, if it is FOIAble, let’s webize it. Anything that is FOIAble, let’s just put it up on the web and be done with it. But that’s not my business, so we’ll see how that works out. I have already mentioned data quality and I think that is the unsuspected 800 pound gorilla sitting over in the corner that we will find out about. It isn’t just in the government. How about hospitals? Where we report that the hundred thousand accidental deaths occur a year in the nation’s hospitals A good part of them because of bad information on the patients’ records. But turning data into knowledge is sort of one of the interesting things that the IT people are trying to help with right now. Here we are talking about how do we, how do we make useful information, make knowledge available to the public and the press. And how do we, how do we get beyond the fact that bad news is redundant. I guess the way I think we need to do it is, and sort of one of my agenda items is to preach to the agency administrations in the federal government, “ You got to get on top of this information.” Don’t forget this is new stuff and you think the agency directors and secretaries have full access to all the information that makes their agencies run. You got another thought coming. We are able to keep records at a low level, which allow transaction processing to occur. But filtering up the data, turning it into knowledge that is useful to the directors of the agencies, that is still something we are trying to get done, too. So, my advice to agency directors is,” You have got to get on top of this new method of turning data into knowledge, because it will help you manage your agencies better. Oh, by the way there is another reason why you ought to be on top of these techniques and technologies. Because others are getting on top of them, too, and they might use that information to whack you. The more you know about it, you will a) get things straightened out and b) be able to respond knowledgably about any nefarious characterizations of information from you agency, etc. etc.” So, I think the way we make the information more available to the public and the press is we also make the information more available to agency leadership, find out it’s useful and then it will trickle out more automatically, than it has been able to do heretofore. Alright, what’s the one big problem that I see, that you all we have talked about, but I want to make it explicit. The one big problem is there is too much data. There is too much data to read. Okay? Therefore our Silicon friends have to do more for us. We have to find ways to get our computers and our networks to help us more, to help us zero in on the data that we want, that means get rid of the superfluous stuff, as was previously mentioned, and maybe even more than that, as I’ll talk about in a minute. So, one of the sound bites of this is that we need to find ways to turn the data deluge into a data cornucopia. Or to use my other metaphor, to turn the data deluge into a knowledge cornucopia. What’s the ideal solution? Well, we don’t need to worry about implementing that for awhile because it is a ways off. Watson, this computer program that successfully trounced two Jeopardy people, points in that direction of semantic search. Can we understand enough about the context that when somebody asks a plain old question in a normal natural language, can we figure out what they really mean and do we have access to all of the data and information in the world that can construct the correct answer? Well Watson shows how easy it is to, as other Turing tests shows, how easy it is to fool a person into thinking there may be a person there when there is just a computer. But the Toronto example shows that there will be a knee slapper every now and then when some little bit of context is missing and you make a surprisingly silly answer. Alright, so let’s say that semantic search a la computers in the science fiction’s Star Treks and other types of movies are a ways away, probably not to worry about them now, although there are plenty of researchers busy in the laboratories who will be surprising us in times to come. What do we have now? Well, I think now the partial solution that we have emerging, I particularly liked the comment earlier about the Federal government needs to act as the wholesaler, not the retailer. Cause that is starting to happen. If we put out the official records in web readable format of one sort or another and then others add value to that information, you get by the conundrum of well we produce the non records because we massage the data. It may well be that we just give you the federal records and others massage the data. And um we have our cake and eat it too. Now, this is starting to happen. Perhaps you are familiar with data.gov, which all of the agencies have been exhorted to put their public data in one site that can be, that can be used as a pivot point to reference from. There is also USASpending.gov, which is a mandatory thing, not a bully pulpit thing, where every agency has to post all of their spending data, grants, contracts intermurals and what have you on one big site and that is being aborning and yes, it has data quality issues right now. And yes there were some unfortunate decisions like truncating research project abstracts to 80 characters. Well that doesn’t tell you too much about what the particular research project is. But it’s a first step, folks, it’s a first step. And then we have sort of more special purpose, more successful steps at the moment. NIH has NIH Reporter, which has a record for every grant they ever make up there for you to look at. NSF also has something called research.gov which has every grant they have made for 20 years, showing you the full abstract, not just the first 80 characters of the abstract and all sorts of other pertinent information. Now these are wonderful bases to begin. They’ve got value in and of themselves and the wonderful bases for others to start adding values. And this is nothing totally new. If you are familiar with the patent system, the US Patent, the PTO has been putting out patent application information for a long time. Some years ago I became aware of private sector ventures to take that patent information and make it really useful to others seeking, wondering what patents have been filed for and whether they can go ahead and file their own patents and so on and so forth. Now, that was for the big players, as I recall, cause it wasn’t exactly something that the average citizen would be able to afford annual subscription fees in terms of hundreds of thousands of dollars. Another area where similar things are happening. You perhaps saw the article in The New York Times a week or two ago on e-discovery. And a sort of a hand-wringing going on about the legions of lawyers that are going to be displaced when discovery for legal trials can be done electronically, rather than by the young lawyers searching over the mounds of haystacks looking for the needle that will be the smoking gun that will allow their side to win. But more specifically to this area, let’s go back to the data.gov for a second. That’s a bunch of data and you can sort of say well it’s not in format that helps me very much. Some researchers have jumped in. I’m particularly familiar with Jim Hendler and his group at Renslear Polytechnic, but there are others as well, who are taking the data out of data.gov and putting it up in what is now called linkedoperndata format. Now this is pretty new and this is getting close to the one big thing that I am going to preach about today. Linkedopendata is sort of the new phrase to try to sell to a skeptical public something called the semantic web, um the semantic web. Well, what is the semantic web? It’s something for computers to read, whereas the regular web is something for people to read. So it attempts to get at this issue of there is too much data to read, too much data to read so let’s give it to the computers to read rather than having to read it yourself. And several comments have been made about that there is no universal standard for data and that the cry, whether it is just different government forms or multiple different standards, this is a very, a very troubling situation. Well, turns out that this semantic web provides a common standard for a whole log of data of different formats. That is nothing totally new, either. Whether you stop to think about it or not, the Internet itself was a similar thing. There were a whole bunch of different physical networks out there: ethernet and token ring and SNA and Decnet, all sorts of proprietary networks. Along came the Internet with a software network on top that sort of hid the differences from the various proprietary networks and made them interact and play together. That is one of the reasons it was called Internet, because it was originally designed to interconnect three separate networks in the 1970s. Ahm, so we need something that can interconnect things of different formats. Well I will give you the shortest tutorial you have ever heard about the semantic web. It is a common format for heterogeneous data including both tabular data and textual data. All data to oversimplify can either be tables that has rows and columns or free text. And let’s just talk about English cause there aren’t any other languages in the world that matter. Wouldn’t it be nice if there was a common format for all tables regardless of how many columns they had and all text, at least all English text? Well, that is what the semantic web provides. What is it? Well, it’s a table with exactly three columns. Now how can you make different columnar, different tables of different size and text, how can you make them fit into a table with three columns? Well, extremely briefly, for tables, the three columns are the row name, the column name, and the cell value. So you sort of explode out what a table, it becomes three times as big as it was in its regular format, but hey, storage is cheap, going on free these days, we don’t worry about that. How about text? Well, long as you bring, go visit your friends who do natural language processing and find out we are getting pretty close to the ability to uhm linguistically analyze sentences. And we can come up, I have seen an existence proof at the National Library of Medicine. Ahm, the linguist who has created a system called “Semantic Medline” based on the Medline database is of research abstracts, where he can find maybe 75 % of what we will call the key sentences in the abstracts of these articles. That means a subject, a verb and an object. Hey, that’s a triple. So in terms of the text, your three columns are the subject, the verb and the object. And if you can identify what you mean by the key sentences and you can put them in that format, all of a sudden you have put tabular information and textual information in exactly the same form. Now, have it told you all there is to it? No, and I don’t intend to. Ahm, but I will say that to add to that, you have the ability to do the dictionary stuff, the data stuff, the tabular stuff, index stuff. In the big time we call that ontologies, but it just amounts to what I am talking about. How do you, how do you, get yours hands on the vocabulary and know what this word means and what that word means and when these two words are the same when they are different or when these two different words mean the same thing. And, and. So if you can do all that, then if you can search the data, not like a Google search where you get back a million hits of things you might want to read and just hope that they are in a good order. But where it’s more like a database search in modern times where there is something called SQL or Sequel, structured query language which goes out to a database and finds an answer and brings it back to you, gives you an answer, constructs an answer and then brings it back, uh rather than let’s you construct your own answer by reading materials. Well the semantic web has something called SPARQL, which is the semantic equivalent to the SQL, which constructs an answer for you and brings it back. Well, by the time you do all this, you have got a wonderful possibility. Now these are early days. Ah and the linked open data and its use in data.gov and so forth. The fact that there are a couple: you have heard of Facebook, I’ll bet. Did you know that that’s a semantic web application? The technology under the hood that makes that work is the semantic web. Those of you in academia maybe you have heard about something called Vivo, which is a developing way under NIH support, developed originally at Cornell and now spreading out to a larger partnership that allows universities and their faculty to have a Facebook for university professors and their work; if you will, a scholarly Facebook. And of course this is a semantic web application. So this stuff is starting, and this is only ten years old. Ahm usually it takes ten years for something to come out of the laboratory and another ten years, if it is any good, to become a billion dollar industry. Well we are ten years away from that. Don’t forget the Internet didn’t come on the public scene until thirty years after they started working on it, 1965 to 1995. So these are early days for new technologies like the semantic web. Not that researchers haven’t been working on the artificial intelligence and language processing and linguistic and logical stuff for 30 or 40 years also and then they have just adopted it to the semantic web. Some people say the only thing new about the semantic web is the web, and of course it’s 15 years old now. So, ah this is, I think a technology to watch. In the hype cycle, don’t expect it to do everything for you, especially yet. But in terms of making more information into knowledge, both for agency administrators and for the public and their press representatives, this is going to be, I believe an important dimension of the future as we go along. As I said I was going to end with one more relationship between the public and the private sector, and so let me do that very quickly. Ahm sometimes when I am giving a curmudgeon speech to federal contractors and so forth, I point out that I know perfectly well how we could ruin the private sector in the United States. And that is by making them follow the federal acquisition regulations and the OPM hiring restrictions. I guess maybe I could add to that for these purposes, making them follow the federal proposed standards for transparency and openness. Wouldn’t it be fun to have similar standards for openness and transparency in the private sector as we are all here to diligently seek in the public sector? Robert Chadduck: Thank you George. Thank you for your insights and as is commonly the case, George in many respects, I have no additional questions.