Tip:
Highlight text to annotate it
X
This video will give you an overview of CINCH: a tool you can use to automate downloading files from the internet in a preservation-friendly manner.
The development of CINCH was made possible by an Institute for Museum and Library Services Sparks! Ignition grant.
CINCH, which stands for Capture, Ingest, Checksum) is a tool that automates the transfer of online content to a repository, using ingest technologies appropriate for digital preservation.
More familiarly, CINCH grabs freely available online content, authenticates it, extracts metadata, and readies it for repository ingest.
We developed CINCH for the widest possible audience. To this end, it is modular, flexible, easy to use, repository-neutral, and open source.
North Carolina libraries can use a hosted version of CINCH through NC LIVE beginning in July of 2012.
This is an overview of what CINCH does. We’ll walk through all of these steps individually, so don’t worry about examining this workflow now.
OK, let’s start at the top.
First the user logs into CINCH. They upload a list of URLs that point to the files they want to retrieve from the internet.
CINCH currently supports the file formats listed here. The files must be freely accessible (that is, not password protected).
CINCH then locates the URLs in that file list. You may be asking how you might create your file list.
Here are a few suggestions. If you currently subscribe to the Internet Archive’s Archive-It service
and are looking for .pdfs, you can use the .pdf report generated by that service.
If you crawl websites using an alternative tool, you may also be able to generate a list using that tool.
Site map generators are also a good way to find out all of the files on a specific website.
The output of a sitemap generator can be used to create your file list for CINCH.
Once you have uploaded your file list and CINCH has located the requested files, it performs the following actions on the files in their remote location.
It calculates the checksum or hash of the remote file. If this isn’t successful, a note is made in the event list
and the file is not downloaded.
It checks to make sure the file you’re requesting has an allowable extension. Remember, we mentioned allowable files earlier in this video.
It checks the name of your file to see if it matches the name of any previous files you’ve downloaded.
If it finds a match, the file is moved to a problem_files folder for later review.
It assesses the size of the file. At this time CINCH can only handle files up to .4 GB in size.
Anything larger will not be downloaded. A message to this effect will be added to the event list.
It also assigns a unique modified file name to the remote file. I’ll explain later how it chooses those names.
CINCH then downloads your files.
Once your files are downloaded, CINCH does the following:
First, last modified dates and times are verified against the information from before the files were downloaded.
If the dates and times do not match, they are reset so that they do. This does not result in any change to the file’s checksum value.
Next, it scans those files for viruses. Any that are found to have viruses are deleted and a note is made in the events list.
A checksum is calculated for the local file, and compared with the checksum calculated before the file was downloaded.
If the checksum has changed, CINCH moves the file to the problem_files folder.
CINCH checks to make sure there are no files with duplicate checksums in your current batch.
If there are, they are moved to the problem_files folder.
Finally, metadata is extracted from the file. I’ll go over the metadata CINCH retrieves in a moment.
When these actions are complete, CINCH packages your files into one or more zip files.
Each zip file can have up to .5 GB of content.
If your current request has more than that, you’ll receive more than one zip file. Each zip file has a manifest, an event list, and a metadata file.
CINCH will email you, letting you know that your files are ready.
You can then download your file or files, to be processed however you’d like. You have 30 days to download your files.
Now I’d like to show you an example of a zip file you’d receive from CINCH.
I uploaded my list of URLs pointing to online .pdf documents, and received a message from CINCH that my files are ready.
This is what I see when I log in to CINCH.
Here are the contents of my .zip file. You’ll note that inside is a folder containing problem files (if any).
Remember those files that may have duplicate checksums or files for which CINCH was unable to calculate a checksum.
Next are my metadata documents. There’s an event list, detailing every action
CINCH took with each individual file, a simple file manifest, and a metadata document.
Below that you’ll see the requested files.
Remember I mentioned that CINCH renames each file with a unique file name?
Here you can see examples of how CINCH performed that action.
Each of these file names comes in part from the original URL from which the file was retrieved, and the original file name.
If I open the Event list file, I see that CINCH kept track of each event it performed on each file, as well as the event date and time.
Finally, let's take a look at the metadata document. Inside you’ll see a number of metadata fields.
If you’ve requested .pdf files, as I did, the metadata document may include fields like author, creation date, creator, etc.
Most of these fields come from the file’s properties, or the embedded metadata in each file header.
Fields like possible title and possible keywords include extracted information that's just a best guess.
Your metadata document will be different based on the types of files you're requesting from CINCH.
Information about where each of these fields comes from can be found within CINCH’s documentation.
CINCH is freely available for download and use at this address. You can also find more information at our website.
As always, feel free to contact us at digital.info@ncdcr.gov if you have any comments or questions. And thanks for listening!