Digital is having a huge impact on how researchers work and the skills they need, from data analytics to data management. Tony Hey, chief data scientist of the Science and Technology Facilities Council and former vice-president of Microsoft Research, explains the challenges facing data intensive scientific discovery, from data curation to preservation, and the need for a UK e-infrastructure strategy.
At Digifest you're going to be talking about data intensive scientific discovery, something you call the Fourth Paradigm. What is that?
"Almost every field in science is now being overwhelmed by the amount of data it is creating. We've gone from having very little data to having more than we can cope with.
Take the environment, for example. The National Science Foundation in the US is funding an exciting project called the Ocean Observatory Initiative. So they’re putting a fibre optic cable on the seabed on the Juan de Fuca tectonic plate just off Seattle. Now, instead of sending out a ship with post-doc researchers that'll return after a few weeks with some data on a USB stick which they take back to their lab, the researchers will now have data streaming in, 24 hours a day, 365 days a year. Seattle is in an earthquake zone and at risk of a really big earthquake about every 200 years. So it seems like a good idea to be able to monitor the sea bed in real-time!
The notion of the Fourth Paradigm came from my late colleague and Turing award winner, Jim Gray. Jim observed that if you've now got huge amounts of data, researchers need to know how to organise it, how to reorganise it, how to access it, how to visualise it, and how to do sophisticated data analytics on it. For scientists drowning in data, life gets enormously complicated!
You now need to teach researchers new skills of data management and analytics. That’s what Jim meant by the Fourth Paradigm, following on from the three other paradigms - experimental science, theoretical analysis and computer simulation. These Fourth Paradigm skills don’t supersede researchers needing to know something about these other three research paradigms, but increasingly, many scientists now need to be equipped with techniques for managing and analysing data. "
Jim Gray’s dream was to have a world in which "all of the science literature is online, all of the science data is online and they interoperate with each other". How close are we to that?
"I think 2013 was a very significant year. One of the things that I was pursuing when I was a dean at Southampton University was a repository providing open access to research papers. I was keen to establish a research repository because, as dean of engineering, I had 200 engineers and about 500 post docs and students in my faculty, all writing research papers and each year, the library asking me which journals I wanted to cancel. In fact, the university could not afford to subscribe to all the journals in which my staff published - yet I was supposed to write a review of the faculty’s research at the end of each year.
Journal prices have been going up by something like 10-15% on average per year over the last 10 or 15 years while library budgets have been going up by at most 3% per year, so there's clearly a real mismatch and something has to change in research publishing. So I started by insisting we keep a digital full-text copy of all the faculty’s research papers in a faculty research repository. Although I started this initiative, it was other academics and librarians at Southampton who made this vision a reality.
A major role for the library at a research-led university is surely to manage the research repository and be the guardian of the research output of the university. Running such a repository is a key part of a research university’s "reputation management".
But what you really need is to be able to get from the paper to the data on which the authors have based their results. For me, the most significant event in 2013 was the US White House memorandum requiring every significant federal research funding agency in the US to come up with a plan to increase public access to the results of their research. And the definition of research included not only research papers but also the relevant data sets. This memo has kicked off a worldwide focus on open science with research papers linked to the relevant data sets.
So I absolutely do believe in Jim Gray’s vision: the future is all about putting both the paper and the relevant data online. "
And, over here, what do you think about the current open access landscape, how have things evolved since the Finch Report?
"I think the Finch report made a mistake in advocating only gold open access. I have always advocated green open access and this is not because I hate publishers, I don’t. I edited journals for non-open access publishers for many years, but as dean, I just could not afford what they were selling me.
We need to have a new partnership with publishers. I don’t want to do all the things they do, but I do want to have a service that is good value and that I can afford. That’s all I want! But of course, at the moment, we’re in the middle of a revolution.
So, currently, it’s all very confusing but, at the end of the day when the dust has settled, the general public will be able to have access to the research papers that their taxes have paid for without having to pay out large sums to publishers. That’s surely a good thing."
So you think the open access road is going in the right direction?
"Yes, I think we are definitely going in the right direction. It’s just very confusing as to exactly what funding model will emerge at the end of this revolution! If you talk to different people, you get different answers.
The Finch Report just proposes one possible answer. However, curating, publishing and archiving data is much more complicated than keeping just research papers in repositories, certainly in terms of how this process should be funded. It is clearly not sensible to keep all the data scientists create, so how much data should you keep? I believe that you have to have the people who were involved in generating the data – the scientists – also involved in deciding which of their data should be kept.
But there is also another problem: in computational and data-intensive science, some of the most important people are the scientists who write the scientific software or who do the sophisticated data management and write the clever data analysis codes. But these people are not the principle investigators, so they don’t get the limelight.
How do you give attractive career paths to these talented people who are now, increasingly, becoming essential to the progress of modern science? I think this is a real challenge for universities and the research community. It’s especially a challenge because if you train people in data intensive science, they can go and get a job in industry at twice the salary!"
You're working on the UK infrastructure strategy – what does that involve?
"At a national level we need to have an infrastructure capable of supporting first class research. On the network, data and software side, I think there are six key elements. The first is a high bandwidth research network, which is what the Janet network attempts to provide at the moment. However, I think we need to do more with the Janet network in order to support data-intensive applications adequately.
We also need to have a robust authentication and authorisation infrastructure so that the right people can get access to the right systems. Then we need to worry about the production and sustainability of research software. A fourth component consists of technologies and standards for data provenance, curation and preservation. Only then can we realise Jim Gray’s vision for open access to publications and data via research repositories. And finally, we need to create appropriate and powerful discovery tools capable of searching efficiently across globally distributed repositories.
Watch a video about Janet: one of the fastest networks in the world
Complementary to this network, software and data infrastructure is the provision of high performance computing resources. In the UK we must provide our computational scientists with access to supercomputers that are competitive on a world scale. In addition, we need to provide appropriate high performance computing resources at both regional and university levels. We then must put this all together and train the people we need to make efficient and effective use of this research infrastructure.
The E-infrastructure Leadership Council, which I co-chair with the minister for science and universities, Jo Johnson, is all about how industry can make appropriate use of this expensive research infrastructure. Can we make the e-infrastructure really usable for industry so that our companies can become more competitive and productive? For me, this is a very exciting challenge."
More generally, how do you see Jisc's role with the infrastructure strategy?
"I see Jisc as capable of playing a leadership role in providing a high performance research network.
One of the challenges we face is the need to have a high bandwidth network that is able to move high volumes of data direct from the data source right to where the researcher needs it. The problem is that you can have a very high bandwidth wide area network but when the data stream reaches the researcher’s university it has to go through the firewall and gets mixed up with the email, web traffic, students watching videos for teaching and so on.
What you need is what they call in the US a ‘Science DMZ’ - a science demilitarised zone. This architecture means that when the data gets to the firewall, it doesn't go through the firewall but goes straight to where you need it. In this way, you solve the ‘last mile’ problem and can have sustained high end-to-end bandwidth. This is critical to enable you to transfer lots of data in a reasonable time as opposed to taking weeks.
In the US, they have implemented this type of data stream architecture at over 100 universities under the NSF’s Campus Cyberinfrastructure initiative. In the UK we have no comparable initiative involving university CIOs. Traditionally, Jisc’s responsibility has stopped at the university firewall so there is a worrying gap in our UK e-infrastructure. To address the needs of the new generation of data-intensive scientists, I think that Jisc needs to partner with universities to implement high speed end-to-end data links.
However, Jisc’s development of robust authorisation and authentication middleware is extremely important and allows companies as well as academics to access our national research infrastructure as authorised users. Jisc has also played a key role in promoting repositories for research publications and data. Finally, the Digital Curation Centre (or DCC) is funded by Jisc and has been a pioneer in data curation and a key source of expertise for the research community for more than 10 years."
Join Tony at Digifest
This is the first in a series of features from our speakers for this year's Digifest.
Join the debate
The views expressed by contributors to Jisc Inform are theirs alone and not necessarily those of Jisc. You might not agree with everything that the contributors say but you are guaranteed to read something that will raise questions and spark debate while you're at Digifest - and beyond.