Data reuse and the open data citation advantage
- PMID:24109559
- PMCID: PMC3792178
- DOI: 10.7717/peerj.175
Data reuse and the open data citation advantage
Abstract
Background. Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results. Here, we look at citation rates while controlling for many known citation predictors and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion. After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.
Keywords: Bibliometrics; Data archiving; Data repositories; Data reuse; Gene expression microarray; Incentives; Information science; Open data.
Figures








Similar articles
- Recall and bias of retrieving gene expression microarray datasets through PubMed identifiers.Piwowar H, Chapman W.Piwowar H, et al.J Biomed Discov Collab. 2010 Mar 28;5:7-20.J Biomed Discov Collab. 2010.PMID:20349403Free PMC article.
- Impact Factors and Prediction of Popular Topics in a Journal.Nielsen MB, Seitz K.Nielsen MB, et al.Ultraschall Med. 2016 Aug;37(4):343-5. doi: 10.1055/s-0042-111209. Epub 2016 Aug 4.Ultraschall Med. 2016.PMID:27490462English.
- Patterns of citations of open access and non-open access conservation biology journal papers and book chapters.Calver MC, Bradley JS.Calver MC, et al.Conserv Biol. 2010 Jun;24(3):872-80. doi: 10.1111/j.1523-1739.2010.01509.x. Epub 2010 Apr 23.Conserv Biol. 2010.PMID:20455909
- Shortcut citations in the methods section: Frequency, problems, and strategies for responsible reuse.Standvoss K, Kazezian V, Lewke BR, Bastian K, Chidambaram S, Arafat S, Alsharif U, Herrera-Melendez A, Knipper AD, Seco BMS, Soto NN, Rakitzis O, Steinecker I, van Kronenberg Till P, Zarebidaki F, Abbasi P, Weissgerber TL.Standvoss K, et al.PLoS Biol. 2024 Apr 2;22(4):e3002562. doi: 10.1371/journal.pbio.3002562. eCollection 2024 Apr.PLoS Biol. 2024.PMID:38564513Free PMC article.Review.
- Predictors of citations in the urological literature.Willis DL, Bahler CD, Neuberger MM, Dahm P.Willis DL, et al.BJU Int. 2011 Jun;107(12):1876-80. doi: 10.1111/j.1464-410X.2010.10028.x. Epub 2011 Feb 18.BJU Int. 2011.PMID:21332629Review.
Cited by
- Creating clear and informative image-based figures for scientific publications.Jambor H, Antonietti A, Alicea B, Audisio TL, Auer S, Bhardwaj V, Burgess SJ, Ferling I, Gazda MA, Hoeppner LH, Ilangovan V, Lo H, Olson M, Mohamed SY, Sarabipour S, Varma A, Walavalkar K, Wissink EM, Weissgerber TL.Jambor H, et al.PLoS Biol. 2021 Mar 31;19(3):e3001161. doi: 10.1371/journal.pbio.3001161. eCollection 2021 Mar.PLoS Biol. 2021.PMID:33788834Free PMC article.
- The landscape of open science in behavioral addiction research: Current practices and future directions.Eben C, Bőthe B, Brevers D, Clark L, Grubbs JB, Heirene R, Kräplin A, Lewczuk K, Palmer L, Perales JC, Peters J, van Holst RJ, Billieux J.Eben C, et al.J Behav Addict. 2023 Oct 5;12(4):862-870. doi: 10.1556/2006.2023.00052. Print 2023 Dec 22.J Behav Addict. 2023.PMID:38141055Free PMC article.Review.
- How often do cancer researchers make their data and code available and what factors are associated with sharing?Hamilton DG, Page MJ, Finch S, Everitt S, Fidler F.Hamilton DG, et al.BMC Med. 2022 Nov 9;20(1):438. doi: 10.1186/s12916-022-02644-2.BMC Med. 2022.PMID:36352426Free PMC article.
- Open Accessibility in Education Research: Enhancing the Credibility, Equity, Impact, and Efficiency of Research.Fleming JI, Wilson SE, Hart SA, Therrien WJ, Cook BG.Fleming JI, et al.Educ Psychol. 2021;56(2):110-121. doi: 10.1080/00461520.2021.1897593. Epub 2021 Mar 31.Educ Psychol. 2021.PMID:35582472Free PMC article.
- An open future for ecological and evolutionary data?Kenall A, Harold S, Foote C.Kenall A, et al.BMC Evol Biol. 2014 Apr 2;14(1):66. doi: 10.1186/1471-2148-14-66.BMC Evol Biol. 2014.PMID:24690275Free PMC article.
References
- Boettiger C. 2013. knitcitations: citations for knitr markdown files.https://github.com/cboettig/knitcitations .
- Craig ID, Plum AM, McVeigh ME, Pringle J, Amin M. Do open access articles have greater citation impact?: A critical review of the literature. Journal of Informetrics. 2007;1(3):239–248. doi: 10.1016/j.joi.2007.04.001. - DOI
- Dorch B. 2012. On the Citation Advantage of linking to data. hprints. Available athttp://hprints.org/hprints-00714715 (accessed 5 July 2012)
- Fox J. 2010. polycor: polychoric and polyserial correlations.http://CRAN.R-project.org/package=polycor .
- Fu LD, Aliferis C. Models for predicting and explaining citation count of biomedical articles. AMIA Symposium. 2008:222–226.http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2656101&tool=p... . - PMC - PubMed
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources