This section describes the production of the digital datasets, focusing on the technical standards being proposed for this project. It describes the capture of the digital images; generation of OCR text; creation of metadata; and the overall production workflow and Quality Assurance (QA) activities.

Digital datasets

This section describes the production of the digital datasets, focusing on the technical standards being proposed for this project. It describes the capture of the digital images; generation of OCR text; creation of metadata; and the overall production workflow and Quality Assurance (QA) activities.

To inform discussions with JSTOR and decisions about technical standards, a wide selection of approximately 150 19th century pamphlets were transported from the University of Bristol to BOPCRIS at Southampton. These were examined by BOPCRIS, with several pamphlets scanned and others matched with previous scans of analogous material. The Bristol pamphlets exhibited most of the characteristics checked in the  format and condition survey (e.g. greyscale information, tight binding, unusual typefaces) and they helped inform the criteria. Further issues, not represented in the Bristol selection, were observed during visits to the collection (e.g. early 19th century long “s”s). Some of these were tested using similar materials drawn from Southampton.

Digital images

This project will use the full range of book scanners available within the digitisation laboratory at BOPCRIS:

  • Minolta PS 7000 greyscale scanners for bitonal or greyscale capture from volumes or separate pamphlets that can be opened flat – pictured left
  • Digitising Line Suprascan colour scanner for colour capture and for grey or bitonal capture from volumes or individual pamphlets requiring special support (e.g. loose boards or tight binding)
  • Digitising Line robotic colour scanner for sturdy bound volumes where the bulk of the volume is being scanned and the pamphlets are of similar paper weight and have been trimmed to fit the volume

All BOPCRIS staff are trained in handling materials appropriately and in operating the scanners. Given the nature of the collections, only a small proportion of the 19th century pamphlets could be captured on the robotic scanner, with most requiring a PS7000 or Suprascan scanner. This has increased the cost of digital capture in the revised bid.

In the course of this scoping study, video- and phone-conferences were held with JSTOR to discuss the digital capture specifications. These discussions were informed by sample images BOPCRIS had generated from the Bristol collection or from similar materials.

JSTOR’s standard approach is to capture pages of text as 600dpi bitonal scans and pages with grey or colour (e.g. illustrations or photographs) as 300dpi 8-bit grey or 24-bit colour scans. JSTOR’s delivery images are downsized from the TIFF masters and delivered as GIF images (for text) or JPEG images (where grey or colour is present). The latter images are often created by overlaying and compositing 8-bit or 24-bit illustrations with bitonal text.

Project partners considered the possibility of capturing the entire pamphlet collection in greyscale or colour, in recognition of their value as historic objects as well as information carriers. However, the higher filesize would vastly increase the storage requirements and the download for users. Although manageable for 1 million images, it would make it very difficult to scale the collection beyond this current project. It was decided to stay with the mixed bitonal and grey/colour specifications proposed in the initial bid, but to capture annotations in grey or colour where significant.

The table below indicates the digital image standards adopted for this project.

Table 3.1 Image specifications

 

Master images
(archived by JSTOR; available to contributing libraries and JISC)
Delivery images
(JSTOR standard delivery formats)
Pages of text 100% capture at 600dpi bitonal (1-bit) images saved in uncompressed TIFF 6.0 format. 760 pixel wide GIF images (with large PDF versions or full size TIFF images downloadable)
Pages with grey or colour information
(including significant annotations)
100% capture at 300dpi grey (8-bit) or colour (24-bit) images saved in uncompressed TIFF 6.0 format. 760 pixel wide JPEG images (with large PDF versions or full size TIFF images downloadable)

OCR text

There was some discussion during the scoping study about where it was best to undertake the OCR work: BOPCRIS or JSTOR. JSTOR generally requires its vendors to use PrimeOCR while BOPCRIS have Abbyy software built into its Agora production system. Pages from some of the sample pamphlets were OCRed and the data was shared with JSTOR. This data was in the .idx file format Agora generates using Abbyy, which includes word coordinates (not currently used by JSTOR). In addition to the standard Abbyy 8.0 software, an ‘Old English’ add-on was trialed during the scoping study. This specialist software is designed to capture older typefaces.

It was agreed that the OCR was best done by BOPCRIS as a part of its production workflow, using Abbyy and Abbyy Old English where necessary (selectively used, because its high license cost is based on the number of pages OCRed). JSTOR would be supplied with both plain text (.txt) and the coordinated text (.idx). JSTOR would use the text files initially, but be likely to use the idx if it moved to a delivery system that highlights words from the text (it anticipates doing so in the future).

Several tests were made to determine the level of accuracy achievable from the sample pamphlets. The tests suggested that a high level was possible for much of the material, with even difficult texts achieving up to 99.9% character accuracy. Further tests would be done were the project to proceed, but these initial tests suggested an average character accuracy of 97-98% for the 600dpi bitonal images and 300dpi grey/colour images specified for this project. For this project and material JSTOR would apply an average accuracy level rather than a minimum acceptance level and were happy with the averages BOPCRIS were obtaining. While rescanning or re-OCRing may occasionally be necessary, the project does not envisage having to re-key any data.

Currently any problems with the OCR are picked up visually and the accuracy levels determined manually by comparing the OCR output with the original page or its digital image. BOPCRIS hope to introduce accuracy monitoring software into their automated workflow at an early stage of this project. This software would make it easier to detect any issues and change the variables in order to optimise the OCR.

The following tables show examples of OCR drawn from an early 19th century satirical pamphlet. The first two show the same text scanned at 600dpi and 300dpi (the later would usually be used where illustrations are present). Both have OCRed well for this typical text. The third table (3.2:C) shows the title page for this pamphlet and illustrates the impact of complex fonts. These usually occur on title pages, where they will be compensated by the presence of bibliographic metadata.

Table 3.2:A   600dpi bitonal using Abbyy 8.0 – 97.6% character accuracy

THE TOTAL ECLIPSE,

Courteous Reader:

This eclipse, Tho’ Moore so deep in science clips, Was not foretold; but it is said That Moore and Andrews both are dead. So sly withal, so deep and ohary> Was that Royston Luminary, Never could I, broad or ‘t home, Once in conjunction with him come. Had I his ultramundane store, I’d write i’ th’ style of mystic lore, Of Capricorn, of knees, and hams*, Of Cancer, Leo, bulls, and rams ; Of legs and shoulders, arms and hips, But to proceed with the eclipse. As in the vast expanse we sec* The orbs all varying in degree, Revolving round some fixed star, Now nearer us, now from us far; Sustained by Gravitation’s laws, Enacted by the great First Cause:-*-* So do Terrestrial objects bend And toward one common centre tend ; Nor let a thought be entertained, The simile is overstrained. Terrestrial bodies have their sigHs^ One darkens and another shines; Sometimes we see they change their places, Slide across each other’s faces ; Sometimes these terrestrial ‘clipses Brown us all like herds of’gypsies:

/upload/jisc/programmes/digitisation/abbyy_eclipse_600bitonal.jpg 

 

Table 3.2:B   300dpi greyscale using Abbyy 8.0 – 98.7% character accuracy

THE TOTAL ECLIPSE,

Courteous Reader:

This eclipse* Tho’ Moore so deep in science clips, Was not foretold; but it is said That Moore and Andrews both are dead. So sly withal, so deep and ohary> Was that Royston Luminary, Never could I, broad or ‘t home, Once in conjunction with him come. Had I his ultramundane store, I’d write i’ th’ style or’mystic lore, Of Capricorn, of knees, and hams, Of Cancer, Leo, bulls, and rams ; Of legs and shoulders, arms and hips, But to proceed with the, eclipse. As in the vast expanse we sec* The orbs all varying in degree, Revolving round some fixed star, Now nearer us, now from us far; Sustained by Gravitation’s laws, Enacted by the great First Cause:-”* So do Terrestrial objects bend Aud toward one common centre tend ; Nor let a thought be entertained, The simile is overstrained. Terrestrial bodies have their sighs^ One darkens and another shines; Sometimes we see they change their places, Slide across each other’s faces ; Sometimes these terrestrial ‘clipses Brown us all like herds of gypsies:

  /upload/jisc/programmes/digitisation/abbyy_eclipse_300greyscale.jpg

Table 3.2:C   300dpi greyscale using Abbyy 8.0 – 75.8% character accuracy

THE TOTAL ECLIPSE:

A GRAND 3Pol(ttco=>ronomical pjtttonttttott, WHICH OCCURRED IN THE YEAR 1820;

WITH A SERIES OF IBKK&NB&WI&*B8»

TO DEMONSTRATE Cfte &mtftguration of tje fMattel^

TO WHICH IS ADDED, AN HIEROGLYPHIC, Adapted to these Wonderful Times J

, v\ \> m

<t

m

w

“ All tongues speak of him, and the bleared sights “ Are spectacled to see him.”

LONDON : PUBLISHED BY THOMAS DOLBY, ^STRAND, 30, HOLYWELL

STREET, AND 34, WARDOUR SrKELl.

ONE SHILLING.

/upload/jisc/programmes/digitisation/abbyy_eclipse_300greyscale_75 copy.jpg 

Metadata

BOPCRIS has a German production system called Agora, which uses its own proprietary XML (Extensible Markup Language) metadata format. This metadata would be customised to ensure that all the necessary information is incorporated. Once complete, it would be exported and transformed (via software routines) into standards-compliant XML. This standard metadata would be delivered to JSTOR for archiving and delivery to libraries or JISC (on request) and for further transformation into JSTOR’s own metadata standard for delivery to end users.

The table below details the metadata standards adopted for this project.

Table 3.3  Metadata specifications

 

Metadata standard

Comments

Bibliographic Metadata

MODS  (Metadata Object Description Schema) or MARCXML

BOPCRIS would take bibliographic records from the CURL or Copac databases: probably in the MODS format, but possibly as the more extensive MARCXML. MODS (currently version 3) is a good choice for this project because: (1) Copac is moving to MODS by the end of 2006; and (2) MODS was especially developed to hold a simplified set of MARC data for use within digital library collections.

Technical Metadata

MIX  (NISO Metadata for Images in XML)

BOPCRIS would probably use selective elements from the new MIX standard to record information about the digital images. MIX is an encoding of the very extensive NISO data dictionary (Z39.87)  MIX is still in draft (currently version 0.2).

Preservation Metadata PREMIS  (Preservation Metadata Implementation Strategies Working Group) BOPCRIS would use elements of the PREMIS data dictionary, which is being widely adopted as a means of recording information to support the preservation of digital resources. PREMIS is currently still undergoing a period of trial use.
Structural Metadata METS  (Metadata Encoding & Transmission Standard) BOPCRIS would use METS, which is now an established standard for structuring complex digital resources (e.g. publications with multiple pages) and wrapping other sets of metadata. It often used with MODS.
Delivery Metadata – JSTOR NLM  (National Library of Medicine) JSTOR are migrating to the NLM standard, which is widely used for journals but also has an adaptation for monographs, such as pamphlets. BOPCRIS would work with JSTOR to develop a suitable mapping (via a DTD or schema) to enable the rich archival metadata set (MODS, MIX and PREMIS in METS) to be transformed into JSTOR’s delivery metadata standard.
Delivery Metadata – Copac and OPACs MODS, MARCXML or MARC21 The project will enable access to the digitised pamphlets via Copac and the OPACs of holding libraries. This is achieved by adding a link to existing MODS, MARCXML or MARC21 records.

Production and QA workflow

Quality Assurance (QA) is a key part of any digitisation workflow (see TASI’s QA documentation). For this project, BOPCRIS would undertake QA at several stages during the production phase, and then further QA would be done by JSTOR when they receive the dataset.

The workflow follows this order:

  1. Images are logged onto the Agora production system and passed on to Scanning Operators.
  2. Scanning Operators scan each page, checking as they go (the first QA, for images). Images are rescanned as necessary. Once complete, the set of files are passed on to Indexers.
  3. Indexers check that all the pages are present and that the images are of good quality (the second QA, for images). If there are any issues they request a rescan from a Scanning Operator.
  4. Indexers initiate the XML generation, which incorporates the data necessary for later export and transformation into the standards.
  5. Indexers identify any non-English language or old English fonts and flag these in the production system so that the appropriate software settings are triggered when the images are OCRed. The dataset then enters the automatic OCR workflow.
  6. The production system picks up the images and metadata, OCRs the images, and generates associated .idx and .txt files.
  7. Currently OCR is spot-checked, but it is anticipated that automated OCR-checking would be introduced at this point in the workflow for this project (a third QA, for OCR).
  8. JSTOR do further QA on the images and OCR (the fourth QA, for images and OCR), checking an average of 10% of images and OCR files. JSTOR would liase with BOPCRIS to address any issues. Because some rescanning might be necessary, collections would be held at Southampton until signed off by JSTOR.

See a diagram of this workflow

Bookmark and Share