"As well as providing a foundation upon which management information can be generated and displayed to users, the structured approach to data usage that the project has engendered has a number of additional benefits…"
University of Liverpool
All BI projects are heavily reliant on the data they have access to. The eleven projects we supported used data from a wide variety of sources to build their BI models. These sources were predominantly internal but a significant proportion also used external data sources.
1. It cannot be assumed that external datasets will be immediately usable by a BI project
The University of Sheffield and Durham University projects were both heavily reliant on external data sources with Sheffield looking at open government datasets and Durham working with HESA to map organisational structures against academic cost centres to allow better and more detailed benchmarking. Other projects also used some elements of the HEIDI datasets provided by HESA.
A common problem with data sources such as HEIDI and data.gov.uk sites was not the data quality but the usability of the datasets and the difficulty of navigating the sites. The University of East London (UEL) and Durham both noted that before the data extracted from HEIDI could be used it had to be preprocessed into a format more suitable for their applications.
If you are planning to make use of external data sources within your BI project it may prove wise to undertake some initial feasibility testing as early as possible in order to give you a better idea of the challenges you may face and the impact this may have on your overall project planning.
2. There may be issues with data quality and timeliness when repurposing data from internal systems
All projects that use data from internal sources are likely to experience some level of difficulty with the data they use. In the case of thees projects these difficulties included the timeliness of the data, its format and accuracy. The Universities of Central Lancashire (UCLan) and Bolton both noted that data that was deemed of sufficient quality by the data owners was inadequate for the needs of a BI project.
UCLan in particular found that there was very little awareness or appreciation by those staff creating data across the University of the wider uses that data was being put to and the resulting requirement for it to be in place and correct in a timely manner. The University of East London, Bolton, UCLan and the University of Manchester all found that before they could use their data they had to either clean or reformat it or negotiate with the data originators (who were not always the data owners) in order to increase the quality and accuracy of the data. The complexities surrounding the agreement of data definitions and the importance of such work in a BI context are addressed in the next section.
Another issue with data was gaining access to the information needed in a timely manner. Both Manchester and UCLan had to negotiate with data controllers in order to gain access to the information needed for their project within time frames that were compatible with other data sources. In one extreme example a dataset was updated monthly while complementary datasets were being updated daily.
It is also worth bearing in mind that the reuse of data for purposes other than those for which it was originally created may also risk straying into potentially complex legal territory. If personal data is involved (for example data relating to staff or students) it will be necessary to bear in mind the requirements of the Data Protection Act 1998 and in particular the implications of Principle 2:
“Personal data shall be obtained only for one or more specified and lawful purposes, and shall not be further processed in any manner incompatible with that purpose or those purposes”.
In addition to direct legal issues there may also be questions of an ethical nature to be addressed in the usage of data about students (for example) in a predictive capacity, especially where decisions are then taken on the basis of such data (such as selecting particular cohorts of students for special measures who are seen from patterns of behaviour as being at a particular risk of ‘dropping out’). These are complex and emerging issues, but are definitely worthy of consideration. The Cetis paper: Legal, Risk and Ethical Aspects of Analytics in Higher Education has further information on this area.
3. Extracting data from ‘silo’ systems into a data warehouse will almost always result in benefits
What is a data warehouse?
"In computing, a data warehouse… is a database used for reporting and data analysis. It is a central repository of data which is created by integrating data from one or more disparate sources.
Data warehouses store current as well as historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons."
Data was commonly sourced from corporate systems such as student, finance, HR and CRIS (current research information system) systems and consequently was seen to currently exist in ‘silos’. Almost without exception the projects opted to consolidate the data they required into a data warehouse before using it to generate the required business intelligence. Getting data into the warehouse in a usable format was an issue for several of the projects. However once the data was in the warehouse it was generally found to be more straightforward to manage than trying to extract it from the individual corporate systems in real time.
Additionally, some of the projects preprocessed the extracted data to construct some compound data items which were also stored in the data warehouse. There was general agreement that the data was best extracted and stored in the warehouse at the lowest level of aggregation or ‘as raw as possible’.
As the project progressed the Open University team found that they had been in part responsible for a larger organisation-wide project to develop a multipurpose data warehouse. While the team generally welcomed this development it did mean that the specific demands of the project were subsumed into the more general information needs of the organisation. ‘Mission creep’ of this nature is not unusual as part of a BI project, and though this can be viewed as evidence of success in terms of the obvious impact the project is having on the organisation can also, without careful management, begin to derail the planned objectives of the original project.
The fact that for these projects the additional overhead of extracting data from ‘silo’ systems into a data warehouse proved to be the most efficient and cost effective solution does not make this a universal truth and organisations are urged to carry out their own analysis before starting a warehousing project.