CONTENTdm Flex Loader

Last updated
Save as PDF

Learn how to setup and use the CONTENTdm Flex Loader to batch import newspapers and eBooks.

Flex Loader is a Windows application that enables users to efficiently batch import newspapers and eBooks with OCR transcript text encoded in ALTO XML and packaged in METS XML.

See below for instructions for installing and using the CONTENTdm Flex Loader. The Flex Loader supports the import of data into CONTENTdm from the METS/ALTO format. Currently supported METS/ALTO formats are newspapers and monographs (eBooks).

Download Flex Loader Version 6.2

Save the file below to your computer and run the executable to install.

Right-click to save: InstallCONTENTdmFlexLoader62.exe

Requirements

Check that you have the following before installing the Flex Loader:

Recommended: Windows XP with Service Pack 3 or later (32- or 64-bit), or Windows Vista Business Edition with Service Pack 1 or later (32- or 64-bit).
- The following have also been tested and will work with Flex Loader: Windows XP Professional Edition, Windows Vista Ultimate, Windows Vista Enterprise, Windows 2003 R2 Enterprise, Windows 2003 R2 Standard, Windows 2008 Enterprise, Windows 2008 Data Center, Windows 2008 R2 Enterprise, Windows 7 Ultimate 32-bit.
Connectivity to http://www.worldcat.org. Click the link or open a browser and type in the URL to check.
CONTENTdm Server version 5.1 or later
Access to CONTENTdm Administration with permissions to approve items and build the index of the target collection.
Data in one of the supported formats (currently CCS Newspapers, CCS Monographs or NDNP Newspapers).
- Data processed into the METS/ALTO format by the vendors iArchives and Perfect Image have been tested and successfully ingested using the NDNP Newspaper format.
- Monograph format has been used with ebooks, and has been tested with book metadata from Microsoft Academic Live and CCS.
Minimum 2 GB RAM.
Intel Core 2 Duo 2.0 GHz (Intel Core 2 Quad 2.5 GHz recommended).
500 GB storage space (fast 100 GB 3.0 Gb/s 7200 RPM disk recommended).
- If you will be running more than one instance of the Flex Loader, we recommend that the storage space is on your local drive.
High-speed Ethernet network (gigabit network recommended). A very fast network is recommended if you are going to process and upload large files.

Installing

You need administrative rights on your Windows workstation to install the Flex Loader.

Install the Flex Loader

Double-click the .exe file.
Follow the InstallShield screens to install. You will need to enter a valid CONTENTdm license code. Use the same license code as you use for your CONTENTdm Server. (Your code is displayed in CONTENTdm Administration under the Server tab on the About page.)
Start the application by going to the Start menu > All Programs > OCLC > CONTENTdm Flex Loader.
The CONTENTdm Flex Loader screen displays.

Setting up your files and collection

Your data must be in one of the supported formats:

CCS Newspaper
CCS Monograph
NDNP Newspaper formats

More information about ALTO and METS:

ALTO is a specification for encoding technical metadata for Optical Character Recognition (OCR).
METS, the Metadata Encoding Transmission Standard, is a specification for encoding structural metadata to transport descriptive, administrative, and technical metadata.

Before you begin adding files to the Flex Loader, make sure each newspaper or monograph (i.e., all the files related to a single issue or book) is saved in a single folder (subfolders within the single, parent folder can also be used). Files are added to the Flex Loader at the folder level — you cannot add individual files.

For example, the following folder contains six issues of a newspaper. Folder 1887090101 is one issue and is the folder that you would select to add to the processing queue in step two of Using the Flex Loader (below).

The folder 1887090101 contains the following files — these are all of the files for this issue of the newspaper. If you’re using subfolders within a single parent folder, one subfolder might contain all the image files and a second subfolder might contain all the XML files.

Please note the following additional file set-up requirements:

All XML files should have the .xml file extension.
All ALTO files should contain "ALTO" in the file name.
For NDNP Newspaper format, the Flex Loader expects two XML files in the METS format: an issue file and an articles file. The primary (issue) file should have a file name that includes an underscore but does NOT start with "articles_" and does NOT contain "-ALTO" (e.g., 1887090101_1.xml). The articles file name should start with "articles_".
Note: If you do not have an articles file, Flex Loader will still process the issue file, and the articles file is treated as empty (the issue is not segmented into articles).
CCS formats should have "-mets" in the file name of the .xml file.
All other supported formats should have "mets" (without the hyphen required for CCS formats, see above) in the file name of the .xml file.
If you want to provide a PDF file of a complete newspaper issue or book, include the PDF file in the folder and name it “all.pdf”. The file is uploaded as-is and will become an option for downloading and printing when users view your collection.
If you want to use Flex Loader to create a PDF file of a complete newspaper issue or book from single-page JP2 or TIFF files (instead of using the all.pdf option described above), check the box labeled Create Print PDF in step 2, and then click Add. The PDF file will become an option for downloading and printing when users view your collection.

Setting up your collection

Flex Loader only imports metadata from fields in the XML that are mapped to the collection fields. View and modify the fields, if necessary, in CONTENTdm Administration. We recommend you set up your collection with a full text search field. Although it is not required unless you have transcripts (and want to include the full text in the collection), a full text field enables special features for viewing and searching in your collection (including searching within the document itself). For more information, see Edit field properties.

Before using the Flex Loader, also confirm your collection configuration settings for the creation of display images. For more information, see the About Image Files section later in this document and the Help topic Display image settings.

Using the Flex Loader

From the main Flex Loader screen, follow the steps outlined on the screen:

Sign in to your CONTENTdm Server and select the collection to which you want to add your files.
Optionally, before adding folders to the processing queue, you can view and update metadata fields, including selecting auto data to automatically fill out fields for your items (Flex Loader only imports metadata from fields in the XML that are mapped to the collection fields [mapped field names do not have to match]). Click View and customize metadata fields for this collection to get started.

First, select the field to customize. You can enter any combination of text and/or auto data. To add auto data, select the auto data from the list and then click Add to add it to the Edit Metadata text box for the selected field. When you have finished entering text and/or auto data, click OK to save your changes and close the screen. (These metadata field settings will be saved so you can use them for future uploads to the collection.)

For more information about metadata field mappings from auto data to XML data, see the CONTENTdm Flex Loader metadata elements.
Note: You can edit metadata for the compound object-level record or for the pagelevel records by selecting the appropriate tab while editing.

For example, you may want to customize the compound-object level record Date field to use the auto data $(ISSUEDATE).

And you may want to customize the page-level record for the Identifier field to use $(REELNUMBER) auto data. This example also includes the text label “Reel” preceding the auto data.

Metadata format notes:
- Auto data availability may vary depending on format and source files. If auto data is not available for a field, that field is left blank.
- Supported date formats (based on ISO8601 standard) are YYYY-MM-DD, DD.MM.YYYY, MM/DD/YYYY, YYYY-MM and YYYY.
- For NDNP newspaper format: When mapping Title to $(ISSUETITLE), the LABEL attribute value from the METS XML file is used for the Title field of an issue.
- For CCS monograph and newspaper formats: When mapping Title to $(ISSUETITLE), the MODS Title value is concatenated and used for the Title field.
Then from the main screen, select the folders that contain the files and add them to the processing queue. Remember to specify the format of your files. For more information about which folder to select, see Setting up your files and collection.
To remove folders from the queue, right-click on the folder name and select Remove. A confirmation dialog will confirm your deletion.
- To see information about the selected folder, including any error message, click Information.
Next, click Start to process the files. Files are processed by the Web service and added to the approval queue on your CONTENTdm Server.
- Click Stop to cancel the upload process. Flex Loader stops the upload after the current folder processing completes.
When processing of all folders has completed, go to CONTENTdm Administration and approve and index the files for the collection.
- You can access CONTENTdm Administration by clicking the Access CONTENTdm Administration link in the screen to the left of the Start button.
- For more information, see Approve items and Build a collection index.

Note: You can run more than one instance of the Flex Loader so you can process multiple publications at the same time. We recommend you work from data on a local drive to ensure good performance.

We recommend you plan to work in batches. Processing can take some time, and if you close the Flex Loader before all processing has completed in the queue, the queue list is not retained.

About image files

The CONTENTdm Flex Loader uses JPEG2000 or TIFF image file formats.

If JPEG2000 images are available in the import folder, the Flex Loader uses these images first. If JPEG2000 images are not available, the Flex Loader uses TIFF images instead.

When TIFF images are processed, they are converted to display images, based on the collection configuration settings selected on the CONTENTdm Server. (If display image creation is not enabled, display images in JPEG format are automatically created.)

Note: The default configuration on the server is to generate display images in JPEG2000 format, with a compression ratio set to 10:1.

For more information, see Display image settings.

CONTENTdm Flex Loader metadata elements

The following is a list of the data in the XML files that match the auto data fields in the metadata mapping interface in Flex Loader. It also documents which data is automaticall extracted and added to CONTENTdm fields, if not mapped using the auto data fields.

Compound Object Metadata

The following are not extracted from XML files:

$(DATE) Current system/computer date, example "Wednesday, June 10, 2009".
$(DATEMMDDYY) Current system/computer date, example "06102009".
$(DATEYYYYMMDD) Current system/computer date, example "20090610".
$(ISODATEYYYYMMDD) Current system/computer date, example "2009-06-10".
$(USERNAME) Username running the application, example "Administrator".
$(YEAR) Current system/computer date, example "2009".
$(ZEROAPPEND1) Appends "0", example, before "1", after "10".
$(ZEROAPPEND2) Appends "00", example, before "1", after "100".
$(ZEROAPPEND3 Appends "000", example, before "1", after "1000".

The following are extracted from XML files:

$(EDITION) : Number of the edition in chronological order, default is “1”. Note: In the NDNP specs, this is referred to as “Edition Order.”
- To locate: mets:mets[@TYPE=“urn:library-ofcongress:ndnp:mets:newspaper:issue”]/mets:dmdSec[@ID=“issueModsBib”]/mets:mdWrap/mets:xmlData/mods:mods/mods:relatedItem/mods:part/mods:detail[@type=“edition”]/mods:number
$(EDITIONLABEL) : Description of the edition, as printed, example “Final Edition”
- To locate: mets:mets[@TYPE=“urn:library-ofcongress:ndnp:mets:newspaper:issue”]/mets:dmdSec[@ID=“issueModsBib”]/mets:mdWrap/mets:xmlData/mods:mods/mods:relatedItem/mods:part/mods:detail[@type=“edition”]/mods:caption
$(LCCN) : Library of Congress Catalog Number, example “sn83031150”
- To locate: mets:mets[@TYPE=“urn:library-ofcongress:ndnp:mets:newspaper:issue”]/mets:dmdSec[@ID=“issueModsBib”]/mets:mdWrap/mets:xmlData/mods:mods/mods:relatedItem/mods:identifier/@type
$(ISSUEDATE) : Document issue date, example “1959-01-05”
- To locate: Find > mets > dmdSec > MdSecType > MdWrap > XmlData
$(ISSUEDAY) : Day of the month that the document was issued, example “05” for the 5th day
- To locate: Same as $(ISSUEDATE) but extracts the day portion of the date, if available.
$(ISSUEMONTH) : Month that the document was issued, example “12” for December
- To locate: Same as $(ISSUEDATE) but extracts the month portion of the date, if available.
$(ISSUEYEAR) : Year of the document, example “1959”
- To locate: Same as $(ISSUEDATE) but extracts the year portion of the date, if available.
$(ISSUEPRESENT) : Valid values are: Present, Not digitized, published; Not digitized, not published; Not digitized, publishing unknown. In the NDNP standard, this is known as “Issue Present Indicator.”
- To locate: mets:mets[@TYPE=“urn:library-ofcongress:ndnp:mets:newspaper:issue”]/mets:dmdSec[@ID=“issueModsBib”]/mets:mdWrap/mets:xmlData/mods:mods/mods:note
$(ISSUETITLE) : Name of the document, example “The Seattle Times.”
- To locate: Find > mets > extract contents of LABEL attribute.
$(TITLEVOLUME) : Document volume, example “5”
- To locate: For iArchives: mets:mets[@TYPE=“urn:library-ofcongress:ndnp:mets:newspaper:issue”]/mets:dmdSec[@ID=“issueModsBib”]/mets:mdWr ap/mets:xmlData/mods:mods/mods:relatedItem/mods:part/mods:detail[@type=“volume”]/mods:number.
  
  For CCS: Find > dmdSec > mdWrap > xmlData > mods > relatedItem > identifier > find type=“local” > extract contents.
$(TITLENUMBER) : Document number, example “2”
- To locate: For iArchives: mets:mets[@TYPE=“urn:library-ofcongress:ndnp:mets:newspaper:issue”]/mets:dmdSec[@ID=“issueModsBib”]/mets:xmlData/mods:mods/mods:relatedItem/mods:part/mods:detail[@type=“issue”]/mods:number.
  
  For CCS: Find > dmdSec > mdWrap > xmlData > mods > titleInfo > partsNumber > extract contents.

Page level metadata

The following are not extracted from XML files:

$(EXTENSION) : File extension of the pages of the document, example “.jpg”.
$(FILENAME) : File name without extension, example “clarkcountypage1”.
$(PAGETYPE) : "Cover" for compound objects or "page" for the pages.
$(PATH) : Path to original location of the document.
$(SIZE) : Document page size in bytes, example, "95230" bytes.
$(SIZEKB) : Document page size in kilobytes, example, "95.230" kilobytes.

The following are extracted from XML files:

$(PAGENUMBER) : Page number of document, example “3”.
- To locate: Find > structMap > div ID=“PAGEID” > find TYPE=“PAGE” > extract contents of ORDERLABEL attribute.
$(REELNUMBER) : Library of Congress microfilm reel number.
- To locate: mets:mets[@TYPE=“urn:library-ofcongress:ndnp:mets:newspaper:issue”]/mets:dmdSec[@ID=“pageModsBib1”]/mets:mdWrap/mets:xmlData/mods:mods/mods:relatedItem[@type=“original”]/mods:identifer[@type=“reel number”]
$(SOURCEREPOSITORY) : Owner of digitized source; city and state postal abbreviations.
- To locate: mets:mets[@TYPE=“urn:library-ofcongress: ndnp:mets:newspaper:issue”]/mets:dmdSec[@ID=“pageModsBib1”]/mets:mdW rap/mets:xmlData/mods:mods/mods:relatedItem[@type=“original”]/mods:location/mods: physicalLocation/@displayLabel
$(TRANSCRIPT) : Page transcript.
- To locate: Find Layout > PrintSpace > TextBlock > TextLine > String > extract CONTENT attribute.