Metadata is often defined as ‘data about data’ or ‘information about information’ It is usually structured textual information that describes something about the creation, content, or context of a digital resource – be it a single file, part of a single file, or a collection of many files. Metadata is the glue which links information and data across the world wide web. It is the tool that helps people to discover, manage, describe, preserve and build relationships with and between digital resources.
Metadata might take the form of a controlled term, carefully constructed or chosen from a formal list, and entered into a pre-established category. Or, it may simply be a free text description or set of keywords used to annotate or ‘tag’ a resource. It could be information that is provided manually, by a real person, or, it could be information derived automatically from a machine or a piece of hardware.
Metadata might describe something objective and straightforward, such as the file size of a digital file; or something more complex, such as the subject matter of a resource or legal rights associated with its use. Metadata is often held within databases, but it can take other forms - it can just as easily be found embedded within a digital file itself.
Metadata is selective
Metadata invariably offers a selective or simplified description of a resource. The Oxford English Dictionary defines metadata as 'data that operates at a higher level of abstraction'.
If a picture paints a thousand words, text-based descriptions will only ever capture the information or meaning held within a given resource partially - let alone all the other information that might be associated with it (for example, the history of its creation, its relationship to other resources, or possible uses to which it might be put).
The challenge for those producing metadata for a digital resource is, therefore, to establish the information that is the most important to record for a particular use.
Metadata is structured data
Metadata is usually structured in some way. Rather than associating terms randomly with a resource, it is common to use a set of generic categories (eg ‘Creator’, ‘Title’, ‘Subject’) and then assign specific terms within those categories (eg ‘Creator: Leonardo da Vinci’, ‘Title: Mona Lisa’, ‘Subject: woman’).
Metadata schema categories | Metadata vocabulary terms |
---|---|
Creator | Leonardo da Vinci |
Title | Mona Lisa |
Subject | woman, portrait, Renaissance... |
This approach has several advantages:
- It makes it easier to understand the metadata, making it obvious to a user, for example, that it is Leonardo who has created the work rather than Mona Lisa
- It makes it easier to retrieve the image in a search, since the search query can be specific, targeting relevant categories rather than searching across all of the metadata
- It can make it easier to share the resource and its metadata with other resources, as long as common categories and terminologies have been used
Metadata implementers sometimes use confusing terminology
Sometimes metadata categories are referred to as metadata ‘elements’ or ‘units’ and the full set of categories used to describe a resource is called a metadata ‘schema’, ‘data structure’ or ‘format’. Metadata can seem unduly complicated because there are no fixed rules for the use of these terms, therefore their meaning can be ambiguous, terms can be used inter-changeably or can have different uses in other contexts.
In this infokit the term 'element' will be used to refer to a single category, for example 'Title' or 'Description', and the term 'schema' will be used to refer to the full set of categories used in a given resource.
Another phrase often used in metadata circles is 'controlled vocabulary'. A controlled vocabulary is a specific term used within a metadata element that has been drawn from a pre-defined list (eg thesaurus) or has been constructed according to a standard set of rules (eg 'enter the creator name in this form: ‘Surname, Forenames’'). Using a controlled vocabulary and/or following a pre-determined set of rules for creating the terms used in metadata elements helps cataloguers and resource managers to:
- Enter data consistently
- Make searching and browsing digital collections easier and more efficient
- Relate content in one archive or collection with that in another, providing the same of similar rules have been applied
Digital resources can be complex and their metadata reflects this
Aggregation: Collection → Item → Part of an item
Metadata can be used to describe either an individual digital file, a discreet part of an individual file, or an aggregation or collection of many digital files. Metadata used to describe a collection or aggregation of many digital files is often referred to as 'collection level metadata'. Examples of this could be: an online learning resource that contains many digital objects; a digital archive collection that contains many digital files relating to a person or subject; or a music album that contains many individual songs.
Item level metadata - the description of an individual digital file - is comparatively self explanatory, and can include, for example, metadata for a single still or moving image file. It could be a user requirement, however, to describe a particular scene from a moving image, in which case metadata would also be needed to describe a component part of the single moving image file.
Metadata schemas have been developed to meet the aggregation challenge in different ways. Some schemas advocate separate metadata records to describe individual ‘things’ (collection, single item, part of an item) which can then be linked and related to one another (eg the Dublin Core schema). While others have been developed to describe different aggregations within a single metadata record (eg the SEPIADES schema).
Describing digital and analogue
It is often the case that the digital file that is being described will be a representation of something that exists in the analogue world. A painting, a landscape, a person or a building for example. It is also sometimes the case that a further analogue representation of the real world object may exist in the form of a photographic print, slide or transparency, or an audio or video tape.
Unsurprisingly, metadata sometimes has to reflect this complexity also. Take again the example of Leonardo’s Mona Lisa. In this case there might be (a) an original art work (the painting), (b) a photographic reproduction of that art work (a slide), and (c) a digital representation of that work (a digital file).The table below shows how the metadata might differ according to the different 'content layer' being described.
Original image | Slide image | Digital image | |
---|---|---|---|
Creator | Leonardo da Vinci | Jane Smith [photographer] | John Brown [scanning technician] |
Format | Painting | Photographic transparency | JPEG image |
Location | Louvre Museum | University slide collection | A:\images\0023.jpg |
Metadata producers have to decide therefore which digital or analogue 'thing' is being described, and how the resulting metadata will be organised and translated into clear and unambiguous information for the user. As was the case in the aggregation example above, metadata schemas have been developed to tackle the digital and analogue issue in different ways.
Some schemas will have sub-categories within a given element, for example 'Creator' could be sub-divided into Creator.Analogue (of the painting); Creator.Analogue.Surrogate (of the slide) and Creator.Digital (of the digital image); some will advocate separate elements, but remaining within the same schema (eg Artist; Photographer; Scanning Technician); and others, completely separate but link-able records for each description.
Metadata can have many purposes
It is no exaggeration to say that metadata is the axis on which the wheels of the Internet turn. As users of digital resources it enables us to find what we are looking for (resource discovery metadata) or tell us what resources are (descriptive metadata). It might tell us where the resource has come from, who owns it and how it can be used (provenance and rights metadata). It might describe how the digital resource was created (technical metadata), how it is managed (administrative metadata) and how it can be kept into the future (preservation metadata). Or it might help us to relate and link this digital resource with other resources (structural metadata).
While some of these functions overlap, in practice it can be convenient to use labels like 'descriptive' or 'administrative' to characterise the different metadata schemas in existence. Some schemas tend to be more focused on resource description and resource discovery (eg Dublin Core), while others include a larger proportion of administrative categories (eg Categories for the Description of Works of Art). There are also schemas that focus on particular purposes (eg PREMIS Preservation Metadata).
These distinctions are useful to keep in mind when developing a metadata framework. The fundamental question to ask is: what activities are to be supported? And following on from that: what particular metadata elements will be required to facilitate these activities? The broad distinction between 'descriptive metadata' and 'administrative metadata' is also a useful reminder that some metadata is aimed particularly at the end users of a digital resource and other metadata will centre primarily around the institutional management of the digital resource. Descriptive metadata is likely to be searched and displayed within a public interface, while much of the administrative metadata will be hidden from public display.
Notwithstanding some overlap, it is broadly helpful to think of metadata in four purpose types, namely:
- Descriptive metadata - used to find, identify and understand a resource (includes resource discovery metadata)
- Administrative metadata - used to manage the resource (includes technical and preservation metadata)
- Structural metadata - used to record and facilitate relationships between or within digital resources
- Use metadata - metadata collected from or about the users of a resource (eg web-logs and statistics, often server generated, but sometimes provided by users themselves through comments)
Metadata often reflects the community it has come from
Digital collections are often created by and housed within particular communities, such as libraries, archives, museums and educational institutions. Many of the formal metadata schemas in use have been developed by those communities, and if a digital resource is based within a particular community, it probably makes sense to adopt the metadata schema commonly used there, at least as a starting point.
However it is also true that the approaches and biases that community-based metadata schemas have within them may not be suitable for all digital resources. Some resources may span communities, or at other times a lighter touch may be required. In these cases, it will usually be necessary to take a more generic approach, which often results in a compromise as to the depth of information being recorded.
A third approach could be to assign metadata based on the format of the digital collection as opposed to the community it has come from. Metadata schemas have been developed to deal specifically with media type: still image, time-based media and text, which may offer advantages in describing a specific scenario unique to a media type, say, a part of an audio file, but perhaps lack some of the administrative and descriptive detail inherent in the community based approach.