Tip:
Highlight text to annotate it
X
Robert Chadduck: Thank you Richard. Ah, Ken in the context of both your career experience
spanning multiple interests in government information and records, and now your current
role as a distinguished visiting scientist at NIST ITL, you know, are there any perspectives
or technical opportunities that you would like to draw our attention to?
Ken Thibodeau: Yes, but, as an archivist first, I have to set the record straight. I am not
here today representing NIST or any other part of the US government, but as a private
citizen. So I get to say whatever I want.
Robert Chadduck: That is why you are visiting Ken. Today you are not visiting.
Ken Thibodeau: And also you might have deduced from the fact that I am the only panelist,
who has a full suit and tie on, that having spent nearly 40 years in the government, I
carry a lot of baggage from that experience. And I have a prejudice, regarding today’s
topic. So I want to confess that up front and that prejudice is simply. Most of my forty
years, most of my effort was spent on pushing the use of computer technology for the conduct
of the government’s business and for the management of the information that results
from that business.
Within that context, I don’t believe that technology can solve the major problems of
access by the press or by the public to government information, because they are not technological
problems. They are the kinds of things we heard about from the first panel. There are
legal barriers. There are bureaucratic barriers. There are organizational, cultural barriers.
And there are something that I can only call American barriers. Cause we are simply such
a large and heterogeneous nation that almost anything you can propose doing, you will find
dissonant voices who want to do something different.
We already have had an example of that this morning. Two completely legitimate voices
but dissonant, where David wants government data available in a form that is easily analyzeable
and easily, what’s the verb mashupable? Whereas if we give it to him in that form
we are creating, what Sarah Cohen calls “false records.” Things that never existed as part
of the way the government did its business. Hopefully there is some way of balancing those
out, but I think the fundamental solution has to come prior to the technology and from
the perspective of an old-time bureaucrat, I would say. Gary Bass hit on one of the most
fundamental things that has to be done. If you really want the government to proactively
and aggressively make information available, first of all to the citizens because that
is our fundamental responsibility and secondly to the press, because hey are an important
intermediary in helping citizens understand government information. You have to make it
in the self-interest of the agencies to do that.
Cause under the current situation, by in large there is no reward. There is not even recompense
and in many situations there are not even resources available specifically to responding
to outsider requests for information as we heard from Bill in the situation DOD is no
different from other, except I was surprised to learn that DOD actually does fund the FOIA
office. There are other organizations, where it’s simply you assign people to do the
work and they scrounge whatever they can and they don’t have communications within the
agency.
Having said that then let me shift to the technology. I do think since I love the technology,
I would not have spent my career on it otherwise. Technology could, once you get over that barrier,
help the situation a lot. But I can only say it could because as you can see in my prepared
remarks, there are pluses and minuses in whichever way you look. You know I think there are three
basic questions: Can the technology make more government information available? Secondly
if it is available, can the technology actually help you get it, get access to it? And thirdly,
if you get access to it, can the technology help you to do with it what you want to do?
And there are pluses and minuses in all of those.
Undoubtedly the technology makes more information available. There is simply an explosion of
data as a result of the use of computers. And that is true of the whole world as well
as the federal government. But it’s easy to find a very voluminous literature on the
dangers that IT poses digital technologies specifically, because it makes it so easy
to destroy or alter information or it makes it so hard to carry it forward in time in
a way that is usable, given technology change.
On the other side, I think there are very simple methods that can be and are frequently
implemented in agencies that do a better job at guaranteeing the survival of government
information. My best example is Oliver North shredded his paper notes. He couldn’t even
touch the backup tapes that held multiple digital copies of those notes in their original
form. And as result those notes are in the National Archives today.
The Reagan Administration was actually the first administration from which we got email
and we got the Reagan and first Bush email at the same time. And it was a collection
of about a couple hundred thousand messages. And I have to tell you, the way NARA dealt
with it at the time, from a technological standpoint was primitive and from a personal
standpoint of the people who had to do that work, which included poor Bob, who was our
lead engineer at the time. At the very least you’d have to say it was gruesome. And I
won’t go into details, but I would in the breaks if you are interested in hearing horror
stories about how you process government information when you have no system available. And when,
because of the sensitivity of White House information, you are only allowed to create
a system in which your staff have no way to see the data that they are processing. We
couldn’t print, we couldn’t dump to screen.
Contrast that with the George W. Bush Administration. We got in excess of 200 million emails from
the Bush White House; a total of some of 82 terabytes of electronic records from that
White House. In less than a year, roughly within eight months, we had not only ingested
all of that stuff into an archival system we had recently deployed specifically for
Presidential electronic records, but on the way in we indexed it all. So you can do searches
on the full content of all that stuff.
But actually if you know the situation, that’s not very good. Searches of the kind you can
do with Google and Bing and other internet search engines are very good at finding you
pieces of relevant information. But they are not very good at weeding out the chafe. And
our archivists in Presidential Libraries, if they get a request from the White House
or the Congress or the courts, they don’t want a result set of 400,000 hits, if there
is only maybe 10,000 in there that are really responsive.
And in fact, the other problem exists as well. You may get a lot of stuff that is really
not what you are looking for and conversely, the probability is that you are not going
to get much of what you are looking for. NARA’s chief litigator in his spare time, which he
is a personal friend of mine, so I figured out, his spare time is some time between 2:30
and 3:30 am,, has been for several years involved in a nationwide research project on search
mechanisms that can be used in e-discovery, which is very important in litigation.
What they found in this research is you are really lucky using current Boolean type searches
if you get 41% of what is really relevant in the collection that you are searching.
So the current technology is not very good. It is also not very good for the reasons that
Pittham has discovered. Computers today by and large can’t handle context. And context
is often definitive. So I was very happy to hear that Pitham is promoting efforts to make
use of context in the discovery and management of government information.
A really extreme example of how poor today’s computer systems are, not run of the mill
but advanced computer systems. It’s the lack of understanding of context that led
IBM’s Watson to identify Toronto as an American city, in responding to a question about or
in answer about two airports in American cities on Jeopardy. Watson is a computer that beat
the best Jeopardy human contestants.
So things are not ideal, but they are getting better. In addition to indexing all of the
Bush electronic records on the fly as they came in, we were able for at least 90% of
them to associate most of those records with metadata and contextual information that allows
faceted search. So if I am doing a search on White House email, I can search for ***
Cheney, but I can also tell the computer “I want to search on messages in which *** Cheney
was discussed, or in which he was the author, in which he was a recipient or he was a blind
copy, carbon copy.” I can even use the metadata to say I am only interested in *** Cheney’s
email if he actually read it, because that data is sitting there on the computer. So
it is a much richer way of finding information. And that will continue in the future.
And I want to finish then with the aspect of the third question of: What about actually
being able to use the information, Often very voluminous information that you could get
from the government. There I think things are very promising and I can cite several
examples actually from NARA. We have for many years now been sponsoring, actually since
we got the Bush and Reagan email, been sponsoring research by Bill Underwood down at Georgia
Tech in the use of automated techniques to characterize and even describe electronic
records.
The tools Bill and his team are developing can do things like distinguish the type of
electronic document. Is it a policy document, is it correspondence. And distinguish subtypes.
If it is correspondence, is it letters or memoranda. And he can even go in to a finer
level and say what’s the letter about. And tell you, no this set of letters relates to
personnel appointments that needs Senate confirmation, that set are responses to inquiries from citizens,
that set is correspondence with major stakeholders about a legislative initiative. So actually
allow you to understand a lot of what’s in a collection before you have read any of
it.
There is work supported by DOE the Los Alamos Laboratory by Jorge Raymond and Shelly Spering
that’s looking at managing records in terms of what they contain, emphasizing entire collections
rather than simply individual documents, so that they can look at a collection, of say
Cabinet Secretary’s correspondence over 4 years and tell you what are the major themes
that show up in that correspondence and also how those themes evolve over time. And then
you can do mashups between things like the evolution of themes and what were the political
events that were going on or the international events going on in the world that might have
caused the interest of high level officials to shift from one area to another.
The National Science Foundation and NARA are also supporting research at the Texas Advanced
Computing Research by Dr. Maria Esteva and her colleagues, who are developing methods
to visualize information about large collections, we are talking millions of documents, so that
you can interactively explore based on the visualization what’s in a collection and
what are the subsets of that collection that you might actually be interested in.
But going back to that fundamental problem of you need to incentivize agencies to make
information available. I think one of things that needs to happen in the government to
make more of our records available to citizens and the press is an infrastructure change
by using what technologists, common technique, if you can’t solve a problem, go to a higher
level of abstraction, cause it’s often easier to solve it at that higher level. Or, another
way of putting it is introduce levels of indirection.
So I think it would be good if the government’s IT infrastructure were able to move so that
we could introduce more and more indirection between the information we have in storage
and the applications that we use to conduct government business.
Because when you come into a government agency and you ask for information, you are at least
competing, if not conflicting with the use of that system to do our current business.
And that is a basic problem.
But if we could separate those and we had either an app or a service that could look
at the information in storage, in response to requests from the public, then you wouldn’t
be interfering with the officer who is responsible for awarding the contract or administering
the contract or getting the grant out or deciding whether the Medicaid cost is legitimate and
so on. It wouldn’t be interfering with their work, but hopefully you would be making more
government information available to more Americans. Thank you.
Robert Chadduck: Thank you Ken. I will have no further questions about what constitutes
gruesome engineering challenges. The aspect of looking forward is all of the interests
of technology that eliminates the gruesomeness of what the government confronts. So in that
aspect George, Ken, thank you.
George there is no surprise. We have moved towards the middle. In the aspect of what
we have spoken about in the last hour, is the aspects of the interests of making government
information available and I would also informally call repurposeable. You know the aspect of
context aware computing and technologies. And particularly your vantage point, George
in the NITRD program, of being able to look across the coordinated research across multiple
agencies. I’d welcome your perspectives. Thank you.
George Strawn: Thank you Bob. Good morning ladies and gentlemen. A special good morning
to Matt and Leslie up there, two of my long term colleagues from the National Science
Foundation, Office of the General Counsel. And I think the NSF Office of General Counsel
might serve as a metaphor for the activity that we are discussing today. I was very surprised
when I first came to NSF to find that the purpose of the office of the General Counsel
was to help me get my job done, not to say no. And I find it an exemplary Office of General
Counsel.
Well, if you check the schedule you will note that we are talking about identifying specific
technical problems. Well I am going to tell you about one big one. And technical solutions
or research agendas to find solutions to these problems. I am going to tell you one thing.
Just like the tiresome father in the old movie, The Graduate, who had just one thing to say
to the hapless graduate, I will tell you one thing.
The sub-bullets that we might also talk about mention quality of records. Whoa! You don’t
know how big a problem that is, ladies and gentlemen. Now that we are going to make them
transparent, we are going to find out. And ten years from now, government records are
going to be better than private sector records, because ours will be transparent and theirs
won’t and theirs will still have the garbage in it.
Government agencies, what can we do about documents that are not born digital. Well
there is a religious metaphor for that, they have got to be born again. (Laugher) So we
are working on scanning technologies that will take analog stuff and turn it into digital
and that is getting better too.
Finally insights into emerging technologies that may help, blah, blah,blah. Well, that
is the one big thing that I am going to feel free to enlighten you with.
I am going to start with one quick analogy between the public and the private sector
and end up with one further analogy between the public and the private sector. James Q.
Wilson, investigator at UCLA wrote a book a number of years ago, contrasting the private
sector and the public sector and saying the title of the book was something like”The
Bureaucracy and Why It Works that Way,” And we have alluded to a number of those points
in this and the previous panel.
First of all, we, the government, have unusual authorities; in particular tax the citizens
and to send the citizens to war. Those are both pretty awesome responsibilities and that
imposes some unusual obligations upon us. Also, since we are a democratic society, we
in the government have to take matters of equity extremely seriously, probably more
seriously than non-governmental entities have to take it. Because of these types of issues,
we are considerably subject to accountability. And openness and transparency are there in
general to support our accountability for all of these appropriate reasons.
Oh, by the way, there is a fourth thing we are supposed to do, which is the same thing
that the private sector, the only thing that the private sector is supposed to do: be efficient.
Private sector is efficient enough they stay in business. Well we have got to be efficient
as well as doing those particular things.
So we are in a digital world and the three big dimensions of digital stuff are access
to the data, data quality, and turning data into knowledge, which is another phrase for
some of the things that we have been talking about already. Access to data, well that’s
a policy issue and a culture issue. Culture of sharing as we have been referring to, which
is still being aborning if ever. I’m, since policy isn’t my business I can glibly say
yeah, if it is FOIAble, let’s webize it. Anything that is FOIAble, let’s just put
it up on the web and be done with it. But that’s not my business, so we’ll see how
that works out.
I have already mentioned data quality and I think that is the unsuspected 800 pound
gorilla sitting over in the corner that we will find out about. It isn’t just in the
government. How about hospitals? Where we report that the hundred thousand accidental
deaths occur a year in the nation’s hospitals A good part of them because of bad information
on the patients’ records.
But turning data into knowledge is sort of one of the interesting things that the IT
people are trying to help with right now. Here we are talking about how do we, how do
we make useful information, make knowledge available to the public and the press. And
how do we, how do we get beyond the fact that bad news is redundant.
I guess the way I think we need to do it is, and sort of one of my agenda items is to preach
to the agency administrations in the federal government, “ You got to get on top of this
information.” Don’t forget this is new stuff and you think the agency directors and
secretaries have full access to all the information that makes their agencies run. You got another
thought coming. We are able to keep records at a low level, which allow transaction processing
to occur. But filtering up the data, turning it into knowledge that is useful to the directors
of the agencies, that is still something we are trying to get done, too.
So, my advice to agency directors is,” You have got to get on top of this new method
of turning data into knowledge, because it will help you manage your agencies better.
Oh, by the way there is another reason why you ought to be on top of these techniques
and technologies. Because others are getting on top of them, too, and they might use that
information to whack you. The more you know about it, you will a) get things straightened
out and b) be able to respond knowledgably about any nefarious characterizations of information
from you agency, etc. etc.”
So, I think the way we make the information more available to the public and the press
is we also make the information more available to agency leadership, find out it’s useful
and then it will trickle out more automatically, than it has been able to do heretofore.
Alright, what’s the one big problem that I see, that you all we have talked about,
but I want to make it explicit. The one big problem is there is too much data. There is
too much data to read. Okay?
Therefore our Silicon friends have to do more for us. We have to find ways to get our computers
and our networks to help us more, to help us zero in on the data that we want, that
means get rid of the superfluous stuff, as was previously mentioned, and maybe even more
than that, as I’ll talk about in a minute.
So, one of the sound bites of this is that we need to find ways to turn the data deluge
into a data cornucopia. Or to use my other metaphor, to turn the data deluge into a knowledge
cornucopia. What’s the ideal solution? Well, we don’t need to worry about implementing
that for awhile because it is a ways off. Watson, this computer program that successfully
trounced two Jeopardy people, points in that direction of semantic search. Can we understand
enough about the context that when somebody asks a plain old question in a normal natural
language, can we figure out what they really mean and do we have access to all of the data
and information in the world that can construct the correct answer?
Well Watson shows how easy it is to, as other Turing tests shows, how easy it is to fool
a person into thinking there may be a person there when there is just a computer. But the
Toronto example shows that there will be a knee slapper every now and then when some
little bit of context is missing and you make a surprisingly silly answer.
Alright, so let’s say that semantic search a la computers in the science fiction’s
Star Treks and other types of movies are a ways away, probably not to worry about them
now, although there are plenty of researchers busy in the laboratories who will be surprising
us in times to come. What do we have now?
Well, I think now the partial solution that we have emerging, I particularly liked the
comment earlier about the Federal government needs to act as the wholesaler, not the retailer.
Cause that is starting to happen.
If we put out the official records in web readable format of one sort or another and
then others add value to that information, you get by the conundrum of well we produce
the non records because we massage the data. It may well be that we just give you the federal
records and others massage the data. And um we have our cake and eat it too.
Now, this is starting to happen. Perhaps you are familiar with data.gov, which all of the
agencies have been exhorted to put their public data in one site that can be, that can be
used as a pivot point to reference from. There is also USASpending.gov, which is a mandatory
thing, not a bully pulpit thing, where every agency has to post all of their spending data,
grants, contracts intermurals and what have you on one big site and that is being aborning
and yes, it has data quality issues right now. And yes there were some unfortunate decisions
like truncating research project abstracts to 80 characters. Well that doesn’t tell
you too much about what the particular research project is. But it’s a first step, folks,
it’s a first step.
And then we have sort of more special purpose, more successful steps at the moment. NIH has
NIH Reporter, which has a record for every grant they ever make up there for you to look
at. NSF also has something called research.gov which has every grant they have made for 20
years, showing you the full abstract, not just the first 80 characters of the abstract
and all sorts of other pertinent information.
Now these are wonderful bases to begin. They’ve got value in and of themselves and the wonderful
bases for others to start adding values. And this is nothing totally new. If you are familiar
with the patent system, the US Patent, the PTO has been putting out patent application
information for a long time. Some years ago I became aware of private sector ventures
to take that patent information and make it really useful to others seeking, wondering
what patents have been filed for and whether they can go ahead and file their own patents
and so on and so forth. Now, that was for the big players, as I recall, cause it wasn’t
exactly something that the average citizen would be able to afford annual subscription
fees in terms of hundreds of thousands of dollars.
Another area where similar things are happening. You perhaps saw the article in The New York
Times a week or two ago on e-discovery. And a sort of a hand-wringing going on about the
legions of lawyers that are going to be displaced when discovery for legal trials can be done
electronically, rather than by the young lawyers searching over the mounds of haystacks looking
for the needle that will be the smoking gun that will allow their side to win.
But more specifically to this area, let’s go back to the data.gov for a second. That’s
a bunch of data and you can sort of say well it’s not in format that helps me very much.
Some researchers have jumped in. I’m particularly familiar with Jim Hendler and his group at
Renslear Polytechnic, but there are others as well, who are taking the data out of data.gov
and putting it up in what is now called linkedoperndata format. Now this is pretty new and this is
getting close to the one big thing that I am going to preach about today. Linkedopendata
is sort of the new phrase to try to sell to a skeptical public something called the semantic
web, um the semantic web.
Well, what is the semantic web? It’s something for computers to read, whereas the regular
web is something for people to read. So it attempts to get at this issue of there is
too much data to read, too much data to read so let’s give it to the computers to read
rather than having to read it yourself.
And several comments have been made about that there is no universal standard for data
and that the cry, whether it is just different government forms or multiple different standards,
this is a very, a very troubling situation. Well, turns out that this semantic web provides
a common standard for a whole log of data of different formats. That is nothing totally
new, either.
Whether you stop to think about it or not, the Internet itself was a similar thing. There
were a whole bunch of different physical networks out there: ethernet and token ring and SNA
and Decnet, all sorts of proprietary networks. Along came the Internet with a software network
on top that sort of hid the differences from the various proprietary networks and made
them interact and play together. That is one of the reasons it was called Internet, because
it was originally designed to interconnect three separate networks in the 1970s.
Ahm, so we need something that can interconnect things of different formats. Well I will give
you the shortest tutorial you have ever heard about the semantic web. It is a common format
for heterogeneous data including both tabular data and textual data. All data to oversimplify
can either be tables that has rows and columns or free text. And let’s just talk about
English cause there aren’t any other languages in the world that matter. Wouldn’t it be
nice if there was a common format for all tables regardless of how many columns they
had and all text, at least all English text? Well, that is what the semantic web provides.
What is it?
Well, it’s a table with exactly three columns. Now how can you make different columnar, different
tables of different size and text, how can you make them fit into a table with three
columns?
Well, extremely briefly, for tables, the three columns are the row name, the column name,
and the cell value. So you sort of explode out what a table, it becomes three times as
big as it was in its regular format, but hey, storage is cheap, going on free these days,
we don’t worry about that.
How about text? Well, long as you bring, go visit your friends who do natural language
processing and find out we are getting pretty close to the ability to uhm linguistically
analyze sentences. And we can come up, I have seen an existence proof at the National Library
of Medicine. Ahm, the linguist who has created a system called “Semantic Medline” based
on the Medline database is of research abstracts, where he can find maybe 75 % of what we will
call the key sentences in the abstracts of these articles. That means a subject, a verb
and an object.
Hey, that’s a triple. So in terms of the text, your three columns are the subject,
the verb and the object. And if you can identify what you mean by the key sentences and you
can put them in that format, all of a sudden you have put tabular information and textual
information in exactly the same form.
Now, have it told you all there is to it? No, and I don’t intend to. Ahm, but I will
say that to add to that, you have the ability to do the dictionary stuff, the data stuff,
the tabular stuff, index stuff. In the big time we call that ontologies, but it just
amounts to what I am talking about.
How do you, how do you, get yours hands on the vocabulary and know what this word means
and what that word means and when these two words are the same when they are different
or when these two different words mean the same thing. And, and. So if you can do all
that, then if you can search the data, not like a Google search where you get back a
million hits of things you might want to read and just hope that they are in a good order.
But where it’s more like a database search in modern times where there is something called
SQL or Sequel, structured query language which goes out to a database and finds an answer
and brings it back to you, gives you an answer, constructs an answer and then brings it back,
uh rather than let’s you construct your own answer by reading materials.
Well the semantic web has something called SPARQL, which is the semantic equivalent to
the SQL, which constructs an answer for you and brings it back. Well, by the time you
do all this, you have got a wonderful possibility. Now these are early days. Ah and the linked
open data and its use in data.gov and so forth. The fact that there are a couple: you have
heard of Facebook, I’ll bet. Did you know that that’s a semantic web application?
The technology under the hood that makes that work is the semantic web. Those of you in
academia maybe you have heard about something called Vivo, which is a developing way under
NIH support, developed originally at Cornell and now spreading out to a larger partnership
that allows universities and their faculty to have a Facebook for university professors
and their work; if you will, a scholarly Facebook. And of course this is a semantic web application.
So this stuff is starting, and this is only ten years old. Ahm usually it takes ten years
for something to come out of the laboratory and another ten years, if it is any good,
to become a billion dollar industry. Well we are ten years away from that.
Don’t forget the Internet didn’t come on the public scene until thirty years after
they started working on it, 1965 to 1995. So these are early days for new technologies
like the semantic web. Not that researchers haven’t been working on the artificial intelligence
and language processing and linguistic and logical stuff for 30 or 40 years also and
then they have just adopted it to the semantic web. Some people say the only thing new about
the semantic web is the web, and of course it’s 15 years old now.
So, ah this is, I think a technology to watch. In the hype cycle, don’t expect it to do
everything for you, especially yet. But in terms of making more information into knowledge,
both for agency administrators and for the public and their press representatives, this
is going to be, I believe an important dimension of the future as we go along.
As I said I was going to end with one more relationship between the public and the private
sector, and so let me do that very quickly. Ahm sometimes when I am giving a curmudgeon
speech to federal contractors and so forth, I point out that I know perfectly well how
we could ruin the private sector in the United States. And that is by making them follow
the federal acquisition regulations and the OPM hiring restrictions. I guess maybe I could
add to that for these purposes, making them follow the federal proposed standards for
transparency and openness. Wouldn’t it be fun to have similar standards for openness
and transparency in the private sector as we are all here to diligently seek in the
public sector?
Robert Chadduck: Thank you George. Thank you for your insights and as is commonly the case,
George in many respects, I have no additional questions.