JISC’s recent report on text mining showed that text mining techniques have huge potential benefits for the UK economy, particularly in areas such as the digital economy, knowledge infrastructure, health and the Semantic Web. But the report also shows that its use is being held back by barriers related to copyright law.
But what exactly is text mining? It comprises a number of steps, starting from retrieving the most useful document for a user’s needs to extracting meaningful (semantic) nuggets of information. It does all this by looking at text in a new way – not by reading it as a human would do, but by turning text into abstract representations which can be analysed using a battery of computational methods. One such method is ‘parsing’ – working out how words are combined in phrases and how phrases are combined into a sentence. There are also techniques that enable computers to distinguish, say, names of people from names of locations or of genes.
Professor Sophia Ananiadou, director of the NaCTeM, is better placed than most to explain how it works. She says, “Applying text mining on a massive collection opens up the content at a semantic level, allowing you to carry out a more precise and informative search than is possible with conventional keyword-based search. For example, I might be researching diabetes in the research papers. I may be looking for a particular angle, such as diabetes in children, or I may not even know what I am looking for, and I might easily be overwhelmed by the volume of material a conventional search would return.
“However, advanced search engines equipped to exploit the results from literature that has been text mined will enable me to rapidly focus my search and drill down to precise concepts expressed in their many various ways in text, or even to find ‘semantic snippets’ that would ‘fill in the blanks’ for me in queries such as ‘What causes diabetes?’.
“Furthermore, text mining allows researchers to search for meaningful associations using different terms to bring up new connections and facilitate knowledge discovery. No-one can possibly keep up with all the articles being published in their field – and we must not forget that, increasingly, researchers are having to engage in multi- and interdisciplinary research to tackle the tough challenges of science. Text mining allows you to see the path through the forest, not get lost in the trees, and it also lets you see unsuspected side paths that lead you on to new adventures of discovery.”
Drug discovery offers a great example. It is typically a very long process where one of the most time-consuming procedures is searching the literature for new associations between genes, proteins, chemical compounds, symptoms and diseases. Not so long ago, a reader would have laboriously carried out the literature search, making inferences such as that A may affect C – maybe because they had read that A affects B and B affects C in different research papers. Now, however, text mining provides more efficient ways to find such jewels of even unsuspected information in the vast and ever-growing biomedical literature.
JISC’s recent report goes beyond giving evidence of how text mining can improve individual research practices. It also outlines the considerable benefits text mining offers for UK research and society. Torsten Reimer, JISC programme manager, explains, “Text mining is already producing efficiencies and new knowledge in areas as diverse as biological science, particle physics, media and communications.” And Professor Douglas Kell, chief executive of the Biotechnology and Biological Sciences Research Council (BBSRC), argues the ‘enormous value’ that text mining could add to the UK economy, with significant social and economic impact – “as long as the text is freely available and unencumbered.” He says that without text mining there is a real risk that we will miss discoveries that could have significant social and economic impact.
Current copyright law, however, imposes serious restrictions on text mining, because it involves a range of computerised analytical processes that are not all readily permitted within UK intellectual property law. In order to be ‘mined’, text must be accessed, copied, analysed, annotated and related to existing information and understanding. Even if the user has access rights to the material, making annotated copies can currently be illegal without the permission of the copyright holder. Like other readers of the JISC report, Professor Kell argues the importance of implementing the Hargreaves Review recommendation on allowing an exception in copyright law for non-commercial data and text mining.
His view is shared by the majority of the audience at the recent text mining event. Philip Ditchfield, of GlaxoSmithKline, explained: “There are about 7,000 diseases out there and we can cure about 1% as an industry at the moment. We’re all patients at the end of the day and we need to discover medicines. That’s the priority. We’re a very compliant industry and we want to work with publishers, not undermine their intellectual property.”
Text mining is one way to make sense of the big data we all face.
JISC is currently funding an international competition which asks “what do you do with a million books?” Or a million pages of newspapers? Or a million photographs of artworks? Find out how Digging into data is working.
UK public data could already be worth £16bn per year, according to the Cabinet Office.
With text mining a much more straightforward activity in other countries – it is permitted in Japan, while academics and technology companies in the US can assert fair use – there is a risk, says Dr Diane McDonald, one of the authors of the JISC report, “that new and innovative companies might move to the US if text mining is too hard to do here, losing the potential for jobs and revenue.”
JISC funded the National Centre for Text Mining for 7 years between 2004 and 2011. The Centre, based at the University of Manchester, is now independent of JISC funding, but what does that mean for the researchers who rely on its services? And what can others learn from their new business model? Nicola Yeeles from JISC caught up with centre director Sophia Ananiadou to find out.
Listen to the podcast (6:21)
But the call is not just to work together with publishers to open up content, but also to look for imaginative ways to exploit what the technology can already do. Sophia explains that text mining adds value to existing deals between publishers and academic institutions. Moreover, she sees a potential win in joining up information about research papers and other documents that are held by our universities to make them more searchable.
She says, “I think we could be looking at a very different scenario if different institutional repositories were talking to one another.” The NaCTeM, which is now becoming independent of JISC having secured sustainable funding, is inviting partnerships with academics and also with industry interested in mining data, so that their endeavours might then be fed back into benefiting the research community. “It is a continuous process,” Sophia explains. “When we finish a piece of collaborative research with a business partner, such as AstraZeneca, we then go on to develop text mining services for publicly available datasets, so that the whole academic community also profits from the industrial funding we have received. These services are free to academia. Thus, the potential for improving the competitiveness of the UK economy is vast.
“Very often people have different needs depending on the applications, so we welcome ideas about customising our services. We can explore ideas in common even before they have secured funding, by collaborating with a research team or institution to bid for funding to develop a different application or service that might look into a particular type of semantic association or search previously untapped content.
Sophia adds, “A good example of how the results of several years of engagement with users have fed into incremental improvement of our text mining services can be seen in our work for UK PubMed Central (www.ukpmc.ac.uk), a large archive of full text content in Biomedicine and Health. Our EvidenceFinder service for UKPMC offers a unique way of conducting a semantic search for evidence expressed in the form of facts: we process on a regular basis the growing open access subset of UKPMC, such that, when the user enters an initial query, EvidenceFinder automatically generates questions involving elements of that query, for the user to explore. Crucially, however, these generated questions are unlike the autocompleted queries commonly found in conventional search engines; instead, the questions we generate from the underlying analysed facts are known to have answers. This approach can be applied to any domain.”
Text mining remains an easier thing to do in other countries – it is permitted in Japan, while academics and technology companies in the US can argue fair use, so according to Dr Diane McDonald, one of the authors of the JISC report, “there is a risk that new and innovative companies might move to the US if text mining is too hard to do here, losing the potential for jobs and revenue.”
Can we afford for that to happen? We welcome your comments below.

Comment on this article…
You might like…
If you liked this article you might also find these of interest:
Get in touch with the National Centre for Text Mining.
Read JISC’s report on text mining.
Read a blog post exploring the outcomes from JISC’s text mining event.
Read a write up of the report in Research Information magazine.