Movatterモバイル変換


[0]ホーム

URL:


Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
Thehttps:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

NIH NLM Logo
Log inShow account info
Access keysNCBI HomepageMyNCBI HomepageMain ContentMain Navigation
pubmed logo
Advanced Clipboard
User Guide

Full text links

PeerJ, Inc. full text link PeerJ, Inc. Free PMC article
Full text links

Actions

Share

doi: 10.7717/peerj.175. eCollection 2013.

Data reuse and the open data citation advantage

Affiliations

Data reuse and the open data citation advantage

Heather A Piwowar et al. PeerJ..

Abstract

Background. Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results. Here, we look at citation rates while controlling for many known citation predictors and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion. After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.

Keywords: Bibliometrics; Data archiving; Data repositories; Data reuse; Gene expression microarray; Incentives; Information science; Open data.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Citation density for papers with and without publicly available microarray data, by year of study publication.
Figure 2
Figure 2. Increased citation count for studies with publicly available data, by year of publication.
Estimates from multivariate analysis, lines indicate 95% confidence intervals.
Figure 3
Figure 3. Cumulative number of datasets deposited in GEO each year, and cumulative number of third-party reuse papers published that directly attribute GEO data published each year, log scale.
Figure 4
Figure 4. Number of papers mentioning GEO accession numbers.
Each panel represents reuse of a particular year of dataset submissions, with number of mentions on the y axis, years since the initial publication on the x axis, and a line for reuses by the data collection team and a line for third-party investigators.
Figure 5
Figure 5. Cumulative number of third-party reuse papers, by date of reuse paper publication.
Separate lines are displayed for different dataset submission years.
Figure 6
Figure 6. Scatterplot of year of publication of third-party reuse paper (with jitter) vs number of GEO datasets mentioned in the paper (log scale).
The line connects the mean number of datasets attributed in reuse papers vs publication year.
Figure 7
Figure 7. Proportion of data reused by third-party papers vs year of data submission.
These estimates are a lower bound: they only considered reuse by papers in PubMed Central, and only when reuse was attributed through direct mention of a GEO accession number.
Figure 8
Figure 8. Proportion of data submissions that contributed to data reuse papers, by year of reuse paper publication and dataset submission.
Each panel includes a cohort of data reuse papers published in a given year. The lines indicate the proportion of datasets that were mentioned, in aggregate, by the data reuse papers, by the year of dataset publication. The proportion is relative to the total number of datasets submitted in a given year.
See this image and copyright information in PMC

Similar articles

See all similar articles

Cited by

See all "Cited by" articles

References

    1. Boettiger C. 2013. knitcitations: citations for knitr markdown files.https://github.com/cboettig/knitcitations .
    1. Craig ID, Plum AM, McVeigh ME, Pringle J, Amin M. Do open access articles have greater citation impact?: A critical review of the literature. Journal of Informetrics. 2007;1(3):239–248. doi: 10.1016/j.joi.2007.04.001. - DOI
    1. Dorch B. 2012. On the Citation Advantage of linking to data. hprints. Available athttp://hprints.org/hprints-00714715 (accessed 5 July 2012)
    1. Fox J. 2010. polycor: polychoric and polyserial correlations.http://CRAN.R-project.org/package=polycor .
    1. Fu LD, Aliferis C. Models for predicting and explaining citation count of biomedical articles. AMIA Symposium. 2008:222–226.http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2656101&tool=p... . - PMC - PubMed

Grants and funding

This study was funded by DataONE (OCI-0830944), Dryad (DBI-0743720), and a Discovery grant to Michael Whitlock from the Natural Sciences and Engineering Research Council of Canada. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources

Full text links
PeerJ, Inc. full text link PeerJ, Inc. Free PMC article
Cite
Send To

NCBI Literature Resources

MeSHPMCBookshelfDisclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.


[8]ページ先頭

©2009-2025 Movatter.jp