The aim of this project was to investigate the needs of the academic chemistry research community with respect to how data associated with theses may best be managed and to facilitate routine and automatic extraction of domain-specific data, its transformation into metadata and ingest into institutional repositories.

Submission, preservation & exposure of Chemistry teaching & research data from theses

Download the full report

The aim of this project was to investigate the needs of the academic chemistry research community with respect to how data associated with theses may best be managed and to facilitate routine and automatic extraction of domain-specific data, its transformation into metadata and ingest into institutional repositories.

Executive Summary

Much of the experimental data generated by postgraduate researchers in chemistry and related departments are conventionally reported in theses. Although such theses might describe up to 50 novel chemical syntheses, with full characterisation of synthesised compounds, much of this is not communicated in peer-reviewed publication to the scientific community in an appropriate form (numbers are reduced to points on diagrams, tables are converted to graphs in pixel form) and a significant proportion of preparative procedures (anecdotally estimated at 80%) are never formally submitted at all. Although the bare outline essentials of the synthesis are published, the detailed experimental recipes (as found in the thesis) are often omitted.

This project was funded as a proof-of-concept approach to develop software to automatically extract chemical terms and objects contained within electronic theses (e-theses). We have shown that it is possible to reliably identify organic chemical terms in both Portable Document Format (PDF) and Office Open XML (DOCX) format theses and to extract and deposit these within a Resource Description Framework (RDF) triplestore. Semantic Web standards for searching data have been developed by W3C, and we have explored the viability of RDF-based semantic querying to enable re-use of the data contained within chemistry e-theses. Although the internal structure of PDF did not permit the identification of chemical objects (e.g. spectral assignments and physical properties), their capture from DOCX format e-theses as Chemical Markup Language CML data files was achieved. These files were deposited in APP-enabled data repositories, each being URI-linked to a searchable named chemical entity in the RDF triplestore.

We have demonstrated:

  • routine and automatic extraction of Chemical Objects (e.g. molecules, spectra) and named chemical entities in high volumes, transformation into metadata and their capture into data repositories and triplestores
  • exploration of the viability of RDF-based semantic querying
  • review of current document format practice in the deposition of chemistry theses and how this influences ease of data extraction

This machine-based identification of chemical terms was achieved using modified OSCAR3 processing software which, in part using the ChEBI chemistry ontology, is specific to ‘small molecule’ organic structures typically found in synthetic organic chemistry theses. The need to develop other chemistry-domain ontologies is indicated.

This project was funded by JISC's Digital Repositories Programme as a joint project between Cambridge University Library and the chemistry departments of University of Cambridge and Imperial College

Report available electronically only

Bookmark and Share