Gateway to text and data mining
Investigating development options for a platform to facilitate computational analysis of textual corpora.
Ended 31 Jul 2019100%
£50k to £250k
We are looking at options for a text and data mining (TDM) service across two existing Jisc services - CORE and Journal Archives. This potential service would leverage computational research methods to generate new forms of research outcomes, or to deliver traditional outcomes more efficiently and with a higher level of accuracy.
What we're doing
The UK government defines text and data mining as "the process of deriving information from machine-read material. It works by copying large quantities of material, extracting the data, and recombining it to identify patterns". In research, TDM offers up huge opportunities to engage in cutting-edge investigations. By carrying out computational analyses on large corpora, it is possible to make new discoveries, bring about a more accurate/efficient research process and advance scholarship.
Despite the favourable legal context for TDM, brought about by the text and data mining copyright exception, creating the infrastructure to process research literature at scale is a significant organisational and technical challenge, a challenge Jisc is uniquely placed to address.
We are currently investigating the opportunities for a possible Jisc-delivered text and data mining service and analysing options for how we might do so, initially across two existing Jisc services: CORE and Journal Archives. These two platforms already deliver significant amounts of digital content in the form of scholarly articles and their combined corpus could immediately facilitate new research opportunities and lines of inquiry, if text mining techniques were applied to it.
As more machine-readable content has become available, many scholarly disciplines have steadily grown their level of engagement with text and data mining and the practice has moved increasingly into the mainstream of research. From entity recognition, through sentiment analysis and corpus linguistics, there is a wide range of TDM manifestations, used to differing ends by just as many disciplines.
The reason for this could be explained by the benefits that TDM has the potential to confer. As evidenced in the Jisc value and benefits of text mining report, these include:
- Increased researcher efficiency
- Unlocking hidden information and developing new knowledge
- Exploring new horizons
- Improved research and evidence base
- Improving the research process and quality
Broader economic and societal benefits include cost savings and productivity gains, innovative new service development, new business models and new medical treatments.
How we're doing it
The goal of gateway to text and data mining is to provide a solution which supports both experienced and novice practitioners of TDM – a service that is intuitive and easy to understand, yet with capabilities sufficient to be of real value to the majority of users.
Whilst still in its formative stages, a broad approach has been identified, which seeks to combine three key elements:
- A mechanism allowing users to define and create their own, task-specific corpora from a ‘pool’ dataset created initially from CORE and Journal Archives
- A workflow environment in which corpora can be interrogated and processed using TDM components from an available toolkit. To begin with, this range of tools will be limited and will concentrate on providing those most widely-used
- Training materials and helpdesk assistance, as well as mechanisms to share experiences and best practices, in order to encourage and support a community of practitioners
We are aiming to place the user experience firmly at the centre of future product development, so that technology solutions are built, in the first place, around user need rather than around solely technical considerations.
To support this, we will be testing ideas and prototypes, in order to analyse and integrate user requirements, assumptions and expectations of a potential gateway to text and data mining service.
This project ran until summer 2019