Tip:
Highlight text to annotate it
X
In this podcast,
I'm going to explain a little bit more about keyword searching.
Chronicling America uses optical character recognition or
OCR to make newspaper pages searchable. Basically OCR is a computer program that
reads the page
and does its best to figure out what words are on it it then create a text document that
is attached to each
image and used by the search engine to find your search terms. Let's click on an
image to open the text document. The text document is composed of words and
letters the OCR recognized.
As you can see there are some strings of letters that are not
recognizable words and sometimes the words the OCR recognizes are not accurate
and are not the words that appear on the image. For more information about this,
see the podcast titled Overcoming Historical Language Barriers.
Because the text document underlying each image is not created by humans
errors and imperfect OCR are unavoidable. Sometimes it is strange fonts
imperfections in the page such as creases, low contrast
or minimal column dividers that throw off sentence structure which leads
to unexpected search results.
If you search for more obscure news items such as a birth announcement that would
only be found in one paper
and the OCR missed it in that one paper, then the article will be missed altogether.
To contrast if you search for a news item such as McKinley's assassination,
if the OCR misses it in one paper then it is likely you will get many results from
other papers because McKinley's assassination was news
of national importance. Nevertheless
keyword searchability even if it doesn't work one hundred percent of the time
does greatly increase access and is definitely time-saving when compared to
working with print
or microfilmed newspapers. And that concludes our podcast about
understanding keyword searchability.
Check out our other podcasts for more information about Chronicling America.