Metadata Generation for Resource Discovery
Background
There is no single tool or suite of tools to which portal and repository managers can go to meet most of their metadata generation requirements. The available tools generally handle a narrow range of digital formats, generate a restricted element set and, in the case of extraction algorithms, are mostly effective within narrow subject domains or for documents of a predictable layout or genre. There is no registry or trusted body of documentation that rates the quality of metadata generation tools or identifies the most effective tool(s) for any given task. Benchmarks and reliable evaluation studies are conspicuously lacking. Moreover, few tools have APIs, which means that it is not possible to call these tools automatically in a flexible manner. A single metadata record will therefore usually require the merging of output from several tools each of which must be invoked manually. Because of the generally variable quality of their output it is probable that metadata generation tools will mostly be used to pre-populate a cataloguing form prior to its subsequent amendment by cataloguers. However, conclusions about whether such a hybrid human-machine metadata creation environment is more effective than a purely human cataloguing approach can only be drawn after the better tools, such as Data Fountains, have been developed as Web services – thus enabling a fair comparison to be made.
Overview
The creation of high quality resource discovery metadata is crucial for the use, sharing and repurposing of digital resources. However, manual methods of metadata creation by trained, professional cataloguers are believed to be expensive. Automated metadata generation is sometimes posited as a solution. This project evaluates auto-generation techniques such as 1) the harvesting of metatags from document headers 2) content (e.g. keyword) extraction from the body of documents 3) automatic metadata enhancement using controlled vocabularies 4) text and data mining. Consideration is given to workflow issues and Web services approaches. The project considers a range of text and non-textual resource types.
Aims and objectives
The project identifies and evaluates the following:
-
The metadata needs of the JISC Information Environment;
-
The metadata generation and creation processes that are currently used there;
-
Currently available tools that are not being used by the JISC portals and repositories and ascertains the reasons for the lack of uptake;
-
The areas most urgently requiring new or improved metadata generation tools;
-
The most promising new techniques and approaches emerging from recent experimental research into automated metadata generation;
-
How and where such recent innovations can be profitably harnessed within the JISC IE to maximise the effectiveness of automated metadata generation.
Project methodology
The project uses the following methods:
-
A questionnaire and in-depth interviews with portal managers to identify metadata needs and current metadata generation/creation processes
-
Schema and metadata documentation provided on JISC IE portal web pages.
-
A full-scale review of the product literature and documentation relating to existing automated generation tools.
-
A gap analysis to identify the areas most urgently requiring new or improved metadata generation tools.
-
An in-depth review of the literature from recent experimental research into automated metadata generation to identify key trends, opportunities and solutions.
-
Interviews with acknowledged experts in the subject.
Anticipated outputs and outcomes
A report submitted to JISC that identifies the opportunities and scope for future research and development that may facilitate increased automated metadata generation within the JISC IE. The report contains a set of recommendations for future work within the JISC IE that are based on an assessment of the severity of the need, the suitability of the solutions that are emerging, and the likely cost-benefits.
The project has studied
-
Automated metatag harvesting
-
Automated content extraction
-
Automated metadata enhancement using controlled vocabularies
-
Automatic indexing/clustering
-
Text mining
-
Content based image retrieval (CBIR)
Lead Institution Arts & Humanities Data Service (AHDS)
Project partners
University College London, School of Library Archive and Information Studies (SLAIS)
Project Staff
Project Manager
-
Andrew Wilson
, Arts & Humanities Data Service, tel: 020 7848 1988, fax: 0207 848 1989, Andrew Wilson is no longer with the
AHDS so please direct all correspondence to:
malcolm.polfreman@ahds.ac.uk
Project Team
-
-
Vanda Broughton, University College London, School of Library Archive and Information Studies (SLAIS), tel: 020 7679 2291, fax: 020 7383 0557,
v.broughton@ucl.ac.uk
-