- Home
- Science Introduction
- Data
- Data Guidelines
Data guidelines
Procedures for making data available in PAGES-acknowledged publications
Since 31 October 2018
> Link to PAGES working group databases and external databases
Contents
1. PAGES Data Stewardship and FAIR and CARE Data Principles
2. Archive the data
2.1.1 Input data obtained from public sources
2.1.2 Input data obtained from a third party or previous publications
2.1.3 Output data generated by the study
2.1.4 Computer code and workflows
3. Cite the data
4. Include a "Data Availability" statement
5. Where should I deposit my data?
1. PAGES Data Stewardship and FAIR and CARE Data Principles
Global change research requires a high level of data integration. To advance the goal of accelerating discovery in global paleosciences, PAGES is committed to making data openly available and intelligently reusable, while curtailing the scientific loss of valuable data. PAGES’ Data Stewardship Integrative Activity aims to develop and facilitate leading practices for maximizing the long-term scientific benefit of the data generated as part of all PAGES-related activities, while satisfying PAGES’ obligation to funders. The PAGES Scientific Steering Committee (SSC) and the International Project Office (IPO) recognize that data stewardship requires effort. We appreciate the community’s foresight and dedication to these data-availability procedures. If you have any questions or suggestions about them, or if you foresee any problems applying them to research facilitated by PAGES, please email the pagespages.unibe.ch (International Project Office).
The importance of openly available, quality-controlled data for assuring the integrity and advancement of science underlies the data policies of scientific journals and research funders, as well as these procedures. PAGES is united with other international scientific organizations in its commitment to making data publicly accessible.
In August 2018, PAGES became a Partner Member of the World Data System (WDS), an interdisciplinary body of the International Science Council (formally the International Council for Science). As such, PAGES works with WDS and its fellow members, including NOAA-Paleoclimatology, PANGAEA and Neotoma, to enable access to quality-assured paleoenvironmental data and metadata, ensure long-term data preservation, and promote the development and use of agreed-upon data conventions.
PAGES is also an early signatory on the FAIR (findable, accessible, interoperable, and reusable) guiding principles for data stewardship, which builds on the Coalition for Publishing Data in Earth and Space Sciences (COPDESS). PAGES is committed to working with researchers, publishers, and repositories to translate the aspirations of open and useful data from policy into practice. Increased access to data by the community, in turn, supports the synthesis science projects that are essential to PAGES’ mission.
Publication is a crucial, high-value stage for data stewardship. PAGES data availability procedures described below are for authors, reviewers, and editors of new publications. They apply to all peer-reviewed articles that acknowledge PAGES, but are superseded by any stricter policies imposed by the funders, research institutions, and journals specific to the work. The procedures are based on FAIR principles, which have been endorsed by scientific organizations globally. They are adapted for paleoscience from the Author Guidelines that are now being implemented by all major publishers of Earth and Space Sciences, as motivated by the Enabling FAIR Data Project. Useful FAQs about the FAIR Author Guidelines are now available to explain the rationale behind the FAIR principles, and to address common questions about best practices for data sharing.
2. Archive the data
2.1 What to archive
According to FAIR data principles, all essential input data and results that are reported in an article must be available through a community recognized, publicly accessible, long-term data repository. Identifying the "essential data" is not always obvious, and can only be determined within the context of the unique contribution of a study. The ideal goal is full reproducibility; all input data, output data, and code should be stored, and should allow others to replicate the published data analyses and to readily compare the outcome with future studies. Guidelines are summarized below; specific examples of open-data implementation can be found in the interactive discussions of papers comprising recent PAGES-led special issues of the journal Climate of the Past, as described here. Importantly, data that are not available publicly are not acceptable as part of publications that acknowledge PAGES.
2.1.1 Input data obtained from public sources. Data obtained from a public repository and used as input data without modification should be cited (see below) without resubmission to a repository. If the data are modified slightly from the original source, the simple process used to modify the data (e.g. truncation, conversion to anomalies) should be noted. Data obtained from a public repository and modified significantly as input data should be submitted to a repository, and data citations for both the original and the modified versions should be included (cross referenced). Data obtained from online tools used to access, process, or display a dataset must cite both the underlying data and software (online tool) including the final version used to generate it, and the date accessed.
2.1.2 Input data obtained from a third party or previous publications. PAGES products are based only on publicly available data. Therefore, PAGES working groups, and authors of articles that acknowledge PAGES, cannot accept or include use-restricted data in their research.
Following FAIR data principles, all essential underlying data and metadata that have not previously been lodged at a public repository, must be made publicly available. Consider account processing time and submit data to a trusted data repository (it is possible to keep the data under embargo) before the paper submission. Datasets must receive a persistent identifier on publication of the article. If the results of previous studies are the basis for a significant new conclusion, but are not yet publicly available, authors of PAGES-acknowledged articles should facilitate the transfer of the data from previous publications to a repository to receive and cite a persistent identifier, with credit given to the original data generator. If possible, this should be done in collaboration with the data generator early in the development of the study, with the goal of preserving the original data, rather than extracting data from a scanned image.
In this way, authors of PAGES-endorsed activities are asked to serve as stewards of other data relevant to their published work, even if used only for comparison, by rescuing the valued data, assigning credit to the data generator, attaching essential metadata, and transferring them to a public repository. Without this effort, the data may never again be discovered and used by scientists, or might be used without proper credit attributed to the data generator. This is true even for data that are currently in supplements of published papers, and are therefore not aligned with FAIR principles. For large-scale synthesis products, PAGES encourages the use of data-oriented publications as a means to include many data generators in the production of a value-added data product with shared and inclusive authorship.
2.1.3 Output data generated by the study. Original raw and processed data generated by a study, results of numerical or statistical analyses, and analyses of model output must be transferred to a repository. Ideally, the data used to plot every substantive figure should be archived, especially if they will be useful in digital form for future studies; for example, to test the sensitivity of the results to different assumptions, or to incorporate the data into a future data synthesis. For manuscripts that include various renditions of the output data, such as the outcomes of different data-processing routines or model runs, the version that is favored by the author, and the one that would most likely be reused in future research, should be top priority for archiving. Very large files may present special circumstances. Studies that feature multiple datasets as the primary outcome should transfer the entire suite of data to a data repository and, where possible, organize the data on a single landing page under one persistent identifier. In addition to the data themselves, include the essential metadata needed to maximize discoverability and facilitate the accurate reuse of the dataset. Conventions for what constitutes essential metadata for paleo data have not yet been fully developed, and vary for different data types and purposes. Data appearing for the first time in a new publication, and newly archived data that have been rescued from previous publications, will have different metadata needs.
2.1.4 Computer code and workflows. The guidelines and procedures for archiving paleoenvironmental-science-related code are less well developed than for data, and the platforms for doing so evolve rapidly. Nonetheless, authors are strongly encouraged to deposit significant code and other underlying digital assets into suitable repositories that guarantee long-term archival, that track both versions and branching, such as Github, and that allow citing these using a data citation or link to a persistent identifier. More information on software citation principles is available here.
2.2 When to transfer the data
Data and associated metadata should be made available to reviewers when the manuscript is submitted, so reviewers have an opportunity to evaluate the content for possible errors, completeness, and adherence to conventions. Most repositories hold data with restricted access while the associated publication is under review. If a repository does not allow restricted access during review, then the data should be made available to the reviewers directly, and to the repository with adequate time for evaluation prior to publication. Regardless of the specific procedure during the review period, data and metadata, with few exceptions, must be made publicly available concurrently with the publication of an article. We encourage journal editors to not accept a paper for publication until the data have been received and approved by a repository, and its persistent identifier cited within the manuscript (below).
2.2.1 Data embargoes. In special cases in which an agreement is made with the journal editor and the data repository, and providing it does not conflict with other applicable policies, data may be held in trust by a repository for a period of up to one year following the publication of an article. Such special circumstances might involve previously unpublished data generated by a third party, or as part of an early-career scientist’s body of work that is only partly interpreted in the article. A viable data citation must be available at the time of publication and the repository must have a mechanism to automatically release the data at the end of the embargo period. Ideally, the basic metadata (e.g. study site coordinates) should be publicly accessible, along with a notice informing viewers about the embargo timeline and author contact information.
2.3 Where to archive the data
Data should be archived digitally in public repositories in accordance with FAIR data principles, as described by the FAIR Author Guidelines. Archiving data through article (digital) supplements does not satisfy the FAIR principles. PAGES’ WDS Partners, including NOAA-Paleoclimatology, PANGAEA and Neotoma, have demonstrated their compliance with international standards for trusted data repositories, as have other discipline-specific repositories allied with paleosciences, including those listed in the Registry of Research Data Repositories (re3data). The use of general, non-disciplinary-specific repositories (e.g. FigShare or university-hosted servers) is discouraged. Instead, community specific repositories are encouraged because they provide a high level of data curation that advances FAIR data principles. Recipients of the PAGES Data Stewardship Scholarships have to submit their data in any of these trusted repositories, or in other of their choice that meets our guidelines, at the end of their project.
3. Cite the data
According to FAIR data practices, all essential data, whether generated by the study or input to the study, must be cited using a persistent, unique, machine-readable identifier, usually a DOI, as assigned by a data repository. These "data citations" appear in the main text alongside, and in the same way as, a bibliographic citation, and they are included in the reference section of the paper. Some journals subdivide the reference section into bibliographic citations and data citations, so the two types can be consumed separately by readers and by machines. Data citations track the provenance of a dataset and give credit to the data generator, which might be someone other than the author of an article that interprets the data. For DOIs and datasets that support versioning, include the version identifier (e.g. PAGES 2k temperature database v2.0.0). In the reference section, a data citation includes: Creators, Title, Repository, Identifier, Submission Year. More information about data citations can be found here, and an example can be found below.
Example text and corresponding data and bibliographic citations (Climate of the Past reference style):
Text body
The PAGES2k Consortium (2017a) assembled a large global dataset of temperature-sensitive proxy records (PAGES2k Consortium, 2017b). Among the records is the paleo-temperature reconstruction from Laguna Chepical (de Jong et al. 2016), which was described by de Jong et al. (2013).
References
de Jong, R., von Gunten, L., Maldonado, A., and Grosjean, M.: Late Holocene summer temperatures in the central Andes reconstructed from the sediments of high-elevation Laguna Chepical, Chile (32° S), Climate of the Past, 9, 1921-1932, 2013.
de Jong, R., von Gunten, L., Maldonado, A., and Grosjean, M.: Laguna Chepical summer temperature reconstruction, World Data Center for Paleoclimatology, https://www.ncdc.noaa.gov/paleo/study/20366, 2016.
PAGES 2k Consortium: A global multiproxy database for temperature reconstructions of the Common Era, Scientific Data, 4,170088, 2017a.
PAGES 2k Consortium: PAGES 2k global 2,000 year multiproxy database, version 2.0.0, doi: https://www.ncdc.noaa.gov/paleo/study/21171, 2017b.
4. Include a "Data Availability" statement
According to FAIR Author Guidelines, every publication must include a statement specifying how the underlying data used in, and produced by, the study can be accessed. For most journals, this information is provided as a separate section (typically titled "Data Availability") in which the data citations are gathered and reiterated from the text along with explanations about versions and notes on reuse.
We recommend that a manuscript should not be accepted into the review stage by an editor unless it contains a statement dedicated to the availability of the underlying data. "Data available upon request from the author" is not considered acceptable as part of a "Data Availability" statement.
In unusual cases where data access is restricted, authors must explain these restrictions in the Data Availability statement at the time of submission. Such restrictions might be determined by law, institution policies, funder terms, privacy, intellectual property and licensing agreements, or the ethical context of the research. If the data cannot be made fully publicly available, the reasons for the restrictions (e.g. identity disclosure of human subjects) must be specified, and the data should still be preserved in a FAIR-compliant repository, with appropriate access and controls in place.
5. Where should I deposit my data?
We recommend (request for PAGES activities) that all new data produced are deposited with an archive of the ICSU-World Data System (WDS). For paleodata, the primary WDS archives are:
We also recommend the use of International Geo Sample Numbers (IGSNs), where appropriate. The IGSN is a persistent identifier that allows tracking a sample through its history.