“Pre-prints” are sometimes neither Pre nor Print (c.f. https://doi.org/10.12688/f1000research.11408.1), but they do go on and get published in journals. While researchers may have different motivations for posting a preprint, such as establishing a record of priority or seeking rapid feedback, the primary motivation appears to be timely sharing of results prior to journal publication.
Although this is a simple question, we have not had an easy way to answer how this varies across disciplines, preprint repositories and journals. Until now. Crossref metadata provides not only an open and easy way to do so, but up-to-date data to get the latest results.
Crossref asks preprint repositories to update their metadata once a preprint has been published by adding the article link into its record via the “is-preprint-of” relation. As the record is processed, we make the link available going both directions, while preserving the provenance of the statement in the metadata output (“asserted-by”: “subject” or “asserted-by”: “object”). This results in bidirectional assertions in the Crossref REST API where search engines, analytics providers, indexes, etc. can get from the preprint to the article (“is-preprint-of”) as well as vice versa (“has-preprint”), making it easier to find, cite, link, and assess.
Using rOpenSci’s R library for the Crossref REST API (rcrossref), we pulled all articles connected to a previous preprint (https://api.crossref.org/works?filter=relation.type:has-preprint&facet=publisher-name:*&rows=0) and then aggregated them based on journal via their ISSNs (https://api.crossref.org/works?filter=relation.type:has-preprint&facet=issn:*), tallying the results in a tidy table with the journal name (ex: PLOS Biology (https://api.crossref.org/journals/2167-8359)).
So without further delay, let’s look at the results of the 20 journals with the highest number of preprints associated with its articles (data from August 21, 2018):
|Springer Nature||Scientific Reports||394|
|Proceedings of the National Academy of Sciences||PNAS||205|
|PLOS||PLOS Computational Biology||196|
|Springer Nature||Nature Communications||187|
|The Genetics Society of America||Genetics||168|
|Oxford University Press||Nucleic Acids Research||148|
|Oxford University Press||Bioinformatics||138|
|The Genetics Society of America||Genetics||120|
|The Genetics Society of America||G3: Genes, Genomes, Genetics||104|
|Cold Spring Harbor Laboratory||Genome Research||104|
|Oxford University Press||Molecular Biology and Evolution||100|
|Springer Nature||BMC Genomics||92|
|MDPI AG||International Journal of Molecular Sciences||86|
|JMIR Publications||Journal of Medical Internet Research||83|
This list has not been normalized or weighted based on the size of the journal. The following observations are informed speculations, as we can only infer so much from the raw data:
One major consideration in these results, concerns what’s missing in the data. These fall into two camps: incomplete member data, and incomplete membership coverage.
We have been working with our members to deposit preprints using the proper content type, and to provide links to published articles in their metadata. However, not all have yet done so (ex: SSRN), leading to holes in our research nexus graph, which subsequently detracts from the completeness of the data.
We celebrate the preprint repositories who are required to update their metadata when an article is published from a preprint, thereby populating the map with critical bridges between preprints and articles. Crossref participation benefits not only the content owner, but the membership at large and all the systems across the research ecosystem powered by Crossref metadata.
Lastly, this data is dependent on the coverage of preprint repositories who register content with us. We are thrilled that Center for Open Science, our newest preprints addition who represents 21 community repositories, has recently filled in swaths of the map. But there remain dead zones in the research graph from repositories who are not Crossref members (ex: ArXiv). Their disciplines, as a result, are under represented in these results.
As to the question of “where do preprints get published?”, anyone in fact can answer this question based on the metadata Crossref collects and provides to the community as an open infrastructure provider. We encourage the community to explore and analyze the data further with other available datasets to glean more insights on how scholarly communications is changing with the increasing growth of preprints. For example, the effective results across all journals represented can be weighted based on the number of articles published by each journal.
Crossref data is open for all to examine and reuse through our REST API. Please dive in and share your findings with us!
2019 March 22
2019 February 25
2019 February 21
2019 February 10