This project will extend and establish the generality of the mark-up and meta-data extraction from scanned literature developed by Lu et al (2008), targeting the biodiversity domain. Meta-data will focus on proper nouns (taxon, people and place names) and dates: we will enhance the searchability of those terms using associative techniques from Natural Language Processing (NLP) combined with likely Optical Character Recognition (OCR) errors, for example by allowing the recovery of Pioa against a search for Pica, provided the context of Pioa is a bird, ideally a magpie.

Automatic Biodiversity Literature Enhancement

A bug chartThe overall aim of the project is to establish and extend information extraction techniques from scanned taxonomic literature in the Biodiversity Heritage Library (www.biodiversitylibrary.org)  Scanned texts contain errors introduced by imperfect Optical Character Recognition (OCR) and other sources, so techniques are required that are robust in the face of such errors.

The ABLE project aims to extract mark-up and meta-data from scanned literature in the biodiversity domain. The meta-data we aim to extract includes proper nouns (taxon, people and place names) and dates. We also intend to enhance the searchability of those terms using associative techniques from Natural Language Processing combined with likely OCR errors. For example, by allowing the recovery of Pioa against a search for Pica, provided the context of Pioa is a bird, ideally a magpie.

If fully successful the software developed in the ABLE project will be applied to the BHL library of over 7 million pages.

Lead site: The Open University

Project Plan

Download the Project Plan (pdf)

Documents & Multimedia

Bookmark and Share
Summary
Start date
1 October 2008
End date
30 September 2009
Funding programme
Digitisation and Content
Strand
Enriching Digital Resources 2008-09
Project website
Committees
  • JISC Content Services committee
Topic