British newspapers 1620-1900
See the collection
Download the full report
This report describes all of the stages and issues that occurred during a second complex mass newspaper digitisation project. The project was an innovative and challenging example of a public/private partnership between Gale Cengage Learning, CCS and the British Library.
Executive Summary
The workflow begins with the selection of content and pre-generated metadata elements from the newspaper catalogue with the original newspapers repaired and stabilised and additional metadata captured direct. An internal British Library (BL) department then creates new microfilms from print for 90% of the content and another internal department scans from originals for the other 10% of content. The digitisation supplier scans the microfilm to create master images and produces the xml and METS files. The supplier sends results to the BL for QA and sign off. It does not address in detail the development of the website as one already exists and the upload of this content is managed by a separate team within the library.
The British Newspapers 1620 – 1900 project was an innovative and challenging example of a public/private partnership between Gale Cengage Learning, CCS and the British Library. Different cultural emphases between the three project partners were managed in order to allow a free flow of communication. As in the first project 'JISC 1', there were many discussions between the partners about efficiencies and this second phase of newspaper digitisation funded by the JISC in 2007 has been a major success. All of the selected titles were digitised except for the Irish sub cluster, which had to be dropped as it was in too poor condition to be handled. The final list of titles to be delivered will be confirmed by Cengage Gale when they complete the upload of the content in autumn this year.
The project has left a good legacy for the future. At the end of the project an inventory of digitised pages has been created and archived both within the host organisation and the web hosting partner. We know the number of items, pages and articles that have been digitised and where there is duplication and gaps in the runs (1,157,349 pages including 18th century material; 192,030 issues and 2,266 reels created).2 We have extended provision of access to dispersed material and contributed to the development of technical standards through improvements to workflows, enriched metadata and evidence of best practice. As in BL’s approach to 'JISC 1', the second project engaged with users through academic representatives on the project board and user consultation during the bidding process.
The overall approach can be summarised as retaining best practice from the first project model of engagement with the source material and front loading QA at the start of the workflow, plus recognising the added benefit of using opportunities offered by new public/private partnerships. The workflow was changed during the shape phase to allow for direct scanning of one original newspaper, (The Standard) by a new digitisation studio set up at the BL. Also, the direct ingest of BL captured metadata by the vendor and a review of the existing vendor and QA processes was felt to be necessary. This resulted in a complete reprofiling of the budget to ensure this change was budget neutral. The search for an ideal ‘core’ workflow of automated processes continues, and it is the experience of the project that suppliers will continuously improve their workflows as they work with innovative solutions to exceptional content.
An alternative future workflow needs to be more flexible in order to adapt to the diversity of any source material. Consideration should be given to using digital cameras as well as flatbed scanners to improve quality and preservation microfilms could be generated later. Also, consideration could be given to disbinding the thicker bound volumes into smaller bound sections, as they have done in KBL for handling items in their newspapers digitisation project. This allows them greater control over reduction of gutter shadow and any subsequent loss of text due to skew.
It is possible to experience a wide range of inconsistencies in both scanning and the preparation of xml which might be due to high turn-over of personnel or variations between different production facilities. Often, the vendor’s solution can be to automate more, but there will still be cases were there are many more questions which are better addressed at the start of production with a pilot or at key break points in the contract. Creation of a workflow that allows for assessment of the collection as a whole, such as reviewing all of the scans before zoning and OCR is recommended. For example, a break point after scanning allows a decision to be made on which images must be rescanned.
Including some user-defined metadata would help to continually increase the searching capabilities of the source material. This would also increase the popularity of the resource and help create a sense of ownership amongst users.