Automatic Biodiversity Literature Enhancement
The overall aim of the project is to establish and extend information extraction techniques from scanned taxonomic literature in the Biodiversity Heritage Library (www.biodiversitylibrary.org) Scanned texts contain errors introduced by imperfect Optical Character Recognition (OCR) and other sources, so techniques are required that are robust in the face of such errors.
The ABLE project aims to extract mark-up and meta-data from scanned literature in the biodiversity domain. The meta-data we aim to extract includes proper nouns (taxon, people and place names) and dates. We also intend to enhance the searchability of those terms using associative techniques from Natural Language Processing combined with likely OCR errors. For example, by allowing the recovery of Pioa against a search for Pica, provided the context of Pioa is a bird, ideally a magpie.
If fully successful the software developed in the ABLE project will be applied to the BHL library of over 7 million pages.
Lead site: The Open University
Project Plan
Download the
Project Plan (pdf)