Data Set Repositories

AGU’s enactment of an open data policy for all papers in its journals has moved up a notch. The current enforcement of the policy is that “available upon request from the author” is no longer allowed. The data that you use in a paper, on which you are advancing our understanding of the space environment, must be available to others. Remember, “data” is not just observed values but numerically generated values as well.

For many observational data sets, openness is required by the major US funding agencies, NASA and NSF. In fact, even for small grants, they now require data management plans about how the data produced by the project will be stored and made accessible to others. NOAA has a lot of its data freely available through several avenues. If you do simulations at the CCMC, then your run output is available to all at that website. That is, for many things, we can simply list a website and call it good.

The issue is for small data sets, like laboratory experiments or temporary instrument installations, and in-house simulation results. Authors using such data need to make the numbers available to readers without the reader being required to go through the author. Furthermore, the website where the data are available needs to be a permanent and independent repository, not the author’s personal site. We need others to be able to independently check our results, reproduce our plots and tables, and verify our claims.

For those at big institutions, like me, such places are creating open repositories for their researchers. For instance, the University of Michigan has a site called Deep Blue. We are putting data bricks there from specific, published papers.

Many have asked about “public repositories” that will accept a data brick accompanying a journal article. There are several. AGU is associated with COPDESS, the Coalition on Publishing Data in the Earth and Space Sciences, which is an organization that maintains a list of scientific repositories. It is easily searchable and includes heliophysics and space physics as taxonomy groups. One data base listed there for space physics is NCEI, the National Center for Environmental Information, which has this for its data archive submission front page. AGU also recommends general ones like  ZenodoDryad, or Figshare – each can assign a DOI for deposited data. Github is becoming a common place to share not only code but also code output.


            The AGU Data Policy FAQ page has a lot of good information about current implementation and additional suggestions of repositories willing to host your data.

Another question that I get is, “how much to upload?” My common answer is, “As much as you can.” Seriously, though, some numerical simulations produces hundreds of GB of output, and some statistical surveys of observational data can cover several TB of values. I don’t want to quote all of the policies for every data repository but there are some out there that will take very large data sets. The minimum set should be “those data used in the paper.” This includes the values behind any plot, table, or value in the paper.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s