Ndnp Podcast 11 - Understanding keyword searching

In this podcast, I'm going to explain a little bit more about keyword searching. Chronicling America uses optical character recognition or OCR to make newspaper pages searchable. Basically OCR is a computer program that reads the page and does its best to figure out what words are on it it then create a text document that is attached to each image and used by the search engine to find your search terms. Let's click on an image to open the text document. The text document is composed of words and letters the OCR recognized. As you can see there are some strings of letters that are not recognizable words and sometimes the words the OCR recognizes are not accurate and are not the words that appear on the image. For more information about this, see the podcast titled Overcoming Historical Language Barriers. Because the text document underlying each image is not created by humans errors and imperfect OCR are unavoidable. Sometimes it is strange fonts imperfections in the page such as creases, low contrast or minimal column dividers that throw off sentence structure which leads to unexpected search results. If you search for more obscure news items such as a birth announcement that would only be found in one paper and the OCR missed it in that one paper, then the article will be missed altogether. To contrast if you search for a news item such as McKinley's assassination, if the OCR misses it in one paper then it is likely you will get many results from other papers because McKinley's assassination was news of national importance. Nevertheless keyword searchability even if it doesn't work one hundred percent of the time does greatly increase access and is definitely time-saving when compared to working with print or microfilmed newspapers. And that concludes our podcast about understanding keyword searchability. Check out our other podcasts for more information about Chronicling America.