Print

Print


> 
> On Mar 11, 2024, at 9:01 AM, Eric Lease Morgan <[log in to unmask]> wrote:
> 
> To what degree is it unethical or unprofessional to deposit data sets in multiple respositories?

> A long time ago, in a galaxy far far away, the preservation of books and journals was ensured when multiple libraries included books and journals in their collections. This philosopy of preservation was well-articulated with the advent of LOCKSS when they said, "Lot's of copies keep stuff safe." See: https://www.lockss.org/
> 
> Now-a-days, we relegate the preservation of the scholarly record -- whether that be books, journals, or data sets -- to centralized networked services. Hmmm.
> 
> For decades I have been using the Internet to provide access to library collections and services, and one of things this experience has taught me is, links WILL break. Thus, if I deposit my data sets in multiple Internet locations, then the probability of losing access to the data sets decreases. Yet, like the publishing of articles in multiple journals is seen as unethical, would the publishing of data sets in multiple locations be seen in the same light? One problem with multiple deposits would be generation of multiple DOI's, which begs the question, "Which DOI is the authoritative one?"
> 
> Put more simply, it is okay for me to deposit my data sets in my university's institutional repository as well as something like Zenodo?

Many years ago, I published an alignment of FRBR with scientific data:

https://doi.org/10.1002/meet.2008.14504503102

Although it has some issues with “Active Data” (constantly growing or otherwise being modified), and issues of granularity (which to be honest, I don’t think FRBR ever really handled the issue of dealing with collections too well), I think we need to ask “Is this actually a duplicate?”

Some domain repositories will insist on the data being put into a specific format for use by their community… so although it may the same “data”, it’s actually a different Expression or Manifestation of that data.  (If the datum are still the same, but the packaging is different (eg, saved GeoTIFF vs. NetCDF vs. FITS vs. CDF) it’s a new Manifestation.  If you had to re-grid the data to align with a different reference system, it’s a new Expression, too)

If it’s a bitwise duplicate (same exact file packaging, no additional metadata, etc), then it’s the same Manifestation of the same data, but a different Item.  So maybe it’s a duplicate… but the access is different, so it’s still useful.

In all of these cases, I would make use of the Alternate Identifier in Zenodo to link to other copies/variants of the data.  In some cases, I might also look to see if ARKs (Archive Resource Keys) would be appropriate to declare that it’s the same digital object in multiple locations:

https://arks.org/about/

To get back to the ‘submitting to multiple journals’ comparison… there are overlay journals that republish articles in their field of interest.  Yes, there is technically one journal that’s authoritative, but domain repositories are a bit special as they often provide a service by indexing the data in a specific way to make it findable and usable by their specific community.  They may also add value over time by adding/updating metadata that’s useful for their community (findable, usable, documenting use caveats, etc).  In this way, even though they ‘Data Object’ may stay the same, the ‘Information Object’ (per OAIS) is no longer the same as what was published in the other repositories.

It’s like if you had two copies of the same physics textbook, one of which was marked up by Richard Feynman.  They may have the same ISBN, but the marked up one may have additional value to a given community.

Because of that potential for extra value, I don’t fault them for creating a new DOI.  But I do believe that they need to track the alternate identifiers / locations for the data.  Especially when you’re dealing with multi-TB collections so people don’t waste time downloading two copies and then realize the time & bandwidth they just wasted.

-Joe
(Currently unaffiliated)

PS.  There were a few people who argued that data isn’t a Creative Work, and therefore had no business being aligned with FRBR…. but if you ever hear the stories about how scientists calibrate their instruments, you would agree that most data is a Creative Work.  (Even the raw data in some cases, when you find out what they have to do to get 20 year old instruments to continue to produce data … or brand new instruments that took a long calibration image as part of commissioning just when a solar flare happened and damaged the detector)