Data is everywhere – but where does it come from, where is it being stored, and is it worth keeping? It is crucial to take stock.
The UK Government has recently published a sweeping new National Data Strategy, which revealed plans for an unprecedented audit and digitisation of public datasets. The £120m project will include the digitalisation and assessment of government documents and datasets from the NHS, police, fire and rescue and education.
How the National Data Strategy affects universities and colleges
Educational institutions such as universities and colleges need to preserve large amounts of data.
Think about digital research outputs such as research articles, research data and PhD projects as well as special collections, archives and electronic records management. On top of that, institutions have statutory obligations to keep a multitude of records like financial, tax, staff, students and governance data. A first step in managing these data sets is to get a handle on the extent of the data at hand.
Auditing data sounds cumbersome but has become an essential tool to assess whether an organisation’s data is fit for purpose. The government is keen to get on top of the ever-growing problem of what to scan, store or shred as a great many organisations are generating more data than they dispose of, creating a gargantuan virtual landfill.
Preserving your data
Digital preservation is a significant challenge, needing continuing assessment as both technologies and usage change.
Clever tools, such as Jisc’s preservation service, automatically reformat files so they are readable with new and as-yet unbuilt software. Once in the preservation system, the files are automatically ‘recognised’ and then processed according to pre-set rules into an appropriate format that is as future proof as possible.
This sort of tools help preserve the types of documents that form the building blocks of our history. Birth, death and marriage records used to be kept in paper form, giving insights into what life was like many years ago. These records are relatively easily preserved if they are well kept, and the adoption to an accessible digital format is relatively straightforward.
The threats of obsolescence or loss are amplified where the technical challenges are high and when it’s not clear who’s responsible for preserving the data, for instance when there’s multiple stakeholders involved or platforms change.
Think about old fashioned Telex1 messages that often used to be the first source for breaking news. These Telex messages are on the Digital Preservation Coalition’s 'bit list' of digitally endangered species. This list highlights digital materials that are most at risk of extinction, as well as those which are relatively safe thanks to digital preservation.
Organising and auditing data
Over the past decades, data collecting has evolved in an organic fashion. Data is now preserved in a myriad of ways, as the amount of information that needs to be stored has increased and preservation systems have changed. And the idea ‘if it doesn’t exist in three places it doesn’t exist’ simply causes a tripling of the problem.
People are (and will probably continue to be) one of the biggest problems when it comes to organising and managing data. What is a logically ordered dataset for one may be totally incomprehensible to others.
Questions to ask when auditing
A data audit allows institutions and individuals to address the following questions:
What data do you have, and what do you generate?
- Where is the data stored?
- Is it in the most appropriate place?
- Is it known about by the people who should know about it?
Quite often all sorts of hidden, unregulated data comes to light.
- Who generated or is generating the data?
- Who has and is using the data?
- And who could or should be using the data?
It is not uncommon for information that could be widely used across a whole institution to die in isolated silos.
- What did the data cost to produce?
- What does it cost to keep?
- What would it cost if it were lost?
- Is it worth keeping in limited storage?
Understanding the value of the data generated and stored can help in managing budgets and form a source of income - especially when it is processed, aggregated, and enhanced.
Current data regulations, especially those pertaining to GDPR and personally identifiable information relating to living individuals, mean what used to be adequate and appropriate systems and services for storing data are no longer fit for purpose.
The risks associated with data loss or exposure are often misunderstood at best, ignored at worst - and, unfortunately, they’re ever increasing as data and infrastructures become more connected.
How vulnerable is the data and how can it be protected?
Formats change and become obsolete. Digital data deteriorates. An audit can help establish where your data sits on the data curation maturity scale (my model can be seen below) and, more importantly, it allows you to formulate a roadmap to progress up the scale.
Data curation maturity scale
- Chaotic - No idea what data exists, where it is, its value, or the risks associated with this ignorance [danger zone]
- Aware - Know what and where most of the data is [danger zone]
- Organised - All data in appropriate systems deduplicated, preserved, curated and regulatory compliant
- Empowered - Know the value of the data, how to realise that value and the risks (negative value) associated with it
- Enriched - Realising value from data
In the words of 19th-century textile designer, poet and socialist activist William Morris’ words: ‘Have nothing in your houses that you do not know to be useful or believe to be beautiful.’ Perhaps this is a time to reassess the value of our possessions, be they physical or virtual.
To find out more, book your free place at Jisc's webinar 'What is digital preservation and should you be worried?' on 5 November 2020.
- 1 Read more about Telex on Wikipedia https://en.wikipedia.org/wiki/Telex