Grey research data: The new frontier

Ahead of the Grey Research Data workshop that APO will run at the eResearch Australasia Conference in Brisbane this October, we set out the state of play for research data publishing and citation in the open access context.

Citation needed‘ by futureatlas.com on Wikimedia (CC BY 2.0)

What is ‘Grey Research Data’?

Research data management practices have matured over the last 20 years. In many research and publishing contexts, datasets are now treated just like scholarly articles. Datasets needs to be published. They need to be catalogued. And they need to be cited within articles (and even by other datasets!). Research data is no longer something used just to generate graphs and other data presentations for publication and then relegated to local storage ‘just in case’. Research data now enjoys similar status as a publication it its own right.

Data publishing has arisen within a broader ‘open science’ context that is flourishing. Initiatives such as the Research Data Alliance and Open Science Framework advocate for and support preregistration of data collection activities, and open-notebook practices that make data openly accessible prior to publication. A key measure for research data management is the FAIR principle: data should be Findable, Accessible, Interoperable and Reusable, not only after publication (as a distinct object and/or presented within an article), but at any time over the lifecycle of research outputs. 

Grey research data is research data published directly by organisations. We discuss special challenges for grey research data below as part of the open access context. But first, what is the pay off for publishing data?

Benefits of data publishing

Discoverability – data can be found beyond the context of articles that base analysis on the data. There are more contexts, repositories, platforms in which to find data that is independently published. There is more metadata available to match the data with information needs.

Timely – data is available for others to use, scrutinise or extend, well before inclusion in scholarly articles. A common criticism of scholarly research, by users in both policy and industry, is the time it takes from inception to publication of findings.

Credibility – making data available for independent scrutiny adds validity to the findings that cite the data. Policy that claims to be evidence based can do so with greater backing if the data is findable and accessible, and has already been so for some period leading up to publication and implementation.

Reliability – making it easier to reproduce research by making datasets, the methods and tools used to generate them available. Where the whole research activity is reproducible, greater external reliability is possible.

Cite-abilityciting data brings discoverability, credibility and reproducibility into sharp focus; citing data makes data:

  • discoverable by engaging with incentives, such as adding to impact factor metrics;
  • credible by standardising name authorities for investigators, and institutions; and
  • reliable for those already engaged in traditional research outputs who wish to extend or validate the research.

The Joint Declaration of Data Citation Principles (Martone, 2014) provides further rationale for data publishing and citation, while Socha (2013) asserts that the benefits reach beyond what is usually considered to be the research context:

‘… many members of the public will wish to use some of the research data for other, even unintended, purposes, to create new applications, to educate themselves or others, and to pursue other serendipitous results’ (section 6.2.7).

Data publishing and open access

The trend towards data publishing is not uniform. Within the academy, data publishing uptake ‘has come at different rates in different fields and disciplines’ (ibid, 2013).

Publishing contexts and associated incentive systems account for some of the differences in data publishing practices. Taking only one segment of this problem, let’s consider three environments where research is published: closed access (or traditional) journals, open access journals, and direct publishing.

Traditional journals have provided incentives for authors to share and cite data and this is a significant driver behind the ‘dramatic increase’ in data sharing and citation practices (Castro et al. 2017). However, the same cannot be said for open access journals. Somewhat paradoxically, data publishing practices have fallen behind in open access journals due to ‘weak adoption’ of data publishing policies (ibid, p 82). 

The third context, direct publishing, exhibits another set of disciplinary and incentive mechanisms again. The relatively low rate of data publishing by agencies (including public, commercial and non-government) that practice direct publishing is perhaps unsurprising, given the relative lack of standardisation in publishing practices (such as persistent identifiers for authors, institutions and articles, as well as bibliographic style guides) in those communities. This is notwithstanding significant efforts to provide repositories and services for publishing data that comes from direct publishing workflows, including Research Data Australia and the Australian Data Archive – third-party services that serve to store, describe and identify datasets generated in (mostly) publicly funded research activities. A significant offering from these services is handling or minting of Digital Object Identifiers (DOIs) for data, and pre-formatted citations that include the DOIs in pre-formatted citations.

Workshopping data citation in grey research

Notwithstanding the data management services described above, data citation within direct publishing models remains sporadic at best. This problem has not gone unnoticed at APO – a grey literature repository largely exhibiting direct-published research. As part of the Linked Semantic Platforms project, APO has taken steps to surface relationships between research articles and the data that they are based on by creating links from article metadata to published datasets. Where possible, we link to a DOI, or some other trustworthy location. Results are mixed – hunting down datasets to link to articles presents recurring problems, such as whether the data is:

  • published, as a distinct object, at all
  • formally cited from within an article of interest
  • openly accessible (e.g. sign-in is required)
  • available in an open format
  • persistently identified
  • persistently located
  • at a useful granularity or aggregation level
  • reasonably free of terms and conditions.

These are some of the challenges that APO has encountered. As part of identifying data citation issues further, we are running a Grey Research Data workshop at the Brisbane Conference and Exhibition Centre on Monday 21 October 2019. Workshop participants will have the opportunity to expand our ‘problem set’ by sharing their challenges finding, accessing and citing datasets. We will also work with participants to identify data citation best practices. This workshop will inform APO strategy for evolving standards and practices, especially within direct publishing workflows. 

To participate in the Grey Research Data workshop, please register at eResearch Australasia 2019.

References

Castro E, Crosas M, Garnett A, Sheridan K, Altman M. ‘Evaluating and Promoting Open Data Practices in Open Access Journals’. Journal of Scholarly Publishing, Vol 49 Issue 1, October 2017, pp 66-88. DOI: doi.org/10.3138/jsp.49.1.66

Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. Martone M (ed.) San Diego CA: FORCE11; 2014. DOI: doi.org/10.25490/a97f-egyk

Yvonne M. Socha. ‘Out of Cite, Out of Mind: The Current State of Practice, Policy, and Technology for the Citation of Data’. Data Science Journal, 12, pp CIDCR1–CIDCR7, 2013. DOI: doi.org/10.2481/dsj.OSOM13-043