Current challenges
Data and metadata
Data types
Framework and good practices
Second French Plan for Open Science1
France is committed to ensuring that the results of scientific research are open to all, researchers, companies and citizens, without hindrance, without delay, without payment.
Heterogeneity (data types, origin, standards) &
Diversity of “objects” to be linked together1
Loss of information over time2
Figure from CNRS prospective
A note are to support producers and reusers of biodiversity data and metadata in:
The targets of this note are above all researchers, managers, engineers and data technicians
Reproducibility is about results that can be obtained by someone else (or you in the future) given the same data and the same code. This is a technical problem.
We talk about Computational reproducibility
Computational reproducibility frequently refers to the ability to generate equivalent analytical outcomes from the same data set using the same code and software1.
[…] all raw data and metadata, code, programming scripts, and bespoke software necessary for fully replicating any analyses that lead to inferences made in a published study2.
An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.
Claerbout & Karrenbach (1992)1
An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.
Claerbout & Karrenbach (1992)1
Reproducibility has the potential to serve as a minimum standard for judging scientific claims (…).
Peng (2011)2
An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.
Claerbout & Karrenbach (1992)1
Reproducibility has the potential to serve as a minimum standard for judging scientific claims (…).
Peng (2011)2
Sharing the code and the data is now a prerequisite for publishing in many journals
Each degree of reproducibility requires additional skills and time. While some of those skills (e.g. literal programming, version control, setting up environments) pay off in the long run, they can require a high up-front investment.
According to Wilson et al. (2017)1, good practices for a better reproducibility can be organized into the following six topics:
Data management
Project organization
Tracking changes
Collaboration
Manuscript
Code & Software
Website available at: https://rdatatoolbox.github.io/
What kind of data and/or metadata are you using for your research?
DATA
METADATA
“Research data are defined as factual records in the form of figures, texts, images and sounds which are used as the main sources for scientific research and which the scientific community generally recognizes as being necessary to validate research results”1.
“Metadata, which can be simply defined as “data about data,” is a way of naming things and representing data and their relationships […] Metadata is structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use or manage an information resource”2.
What kind of data and/or metadata are you using for your research?
DATA
METADATA
Link this information with your biodiversity research projects
Imagine : you’re wondering about the distribution of a given species.
but also: what are the intra-population variations ? DNA, trait measurements ? phylogeny ?
where are they spatially ? in which proportions ?
when were they observed or sampled ? repeated measures ? time stamp or global period ?
how were they collected ? biases ? pseudoreplication ? true absences ?
why and who : citizen science ? opportunistic ? funding ?
Format1
Format1
Protocols and methods2
Format1
Protocols and methods2
Thematic and/or “other classifications”
According to the EML classification1
According to Kissling WD et al. (2018)
Repeated measures, such as long term surveys, allowing to look at evolution (of abundances, biomass, etc) over time.
Time series analyses
Generic term that starts whenever there are spatial coordinates (long/lat) associated with an observation. May also refer to remote sensing images, GIS.
Again a very generic term, ranges from spreadsheets of data occurrences to DNA sequences, and text mining (systematic reviews, web scrapping, etc).
Data life cycle
FAIR Principles
Flux and stocks of data
The data lifecycle is the set of stages of management, conservation and dissemination of research data, associated with research activities.
For more information: PNDB
A set of guiding principles for managing research data to make it Findable, Accessible, Interoperable, and Reusable by humans and machines1.
This is the way!
For more information: PNDB
Sharing data and metadata from research activities requires making them available in repositories.
It is recommended to prioritize deposits in trusted thematic repositories (eg. InDoRES, SEANOE, …), or failing that in institutional repositories (eg. Data SUD, CIRAD Dataverse, …), or failing that, in generic repositories such as the Recherche Data Gouv repository. Resources exist for choosing the adapted and recommended data repository.
Repositories names | Supporting by | thematic, institutional, generic | disciplinary fields | Accepted data (keywords) | embargo | Persistent identifier | Volume limit |
---|---|---|---|---|---|---|---|
InDoRES | CNRS-Ecology, MNHN | thematic ( and institutional) | Ecology, Environment, Bio-archaeology | Environmental, ecological and geographical data | yes | DOI | 2 GB per data set but planned to increase to 4 or 5 GB soon |
EaSy Data | Data Terra, BRGM | thematic | Earth and Environmental Sciences | Long tail data from the earth system and environment (example: project issues) | yes (2 years max.) | DOI | 5 GB per file, 100 GB per deposit. Possibility to make the request if larger volume |
SEANOE | Ifremer | thematic (and institutional) | Oceanography | Georeferenced marine data | yes (2 years max.) | DOI | 100 GB |
Data SUD | IRD | institutional | all fields covered by IRD agents | ??? | ??? | DOI | ??? |
GBIF | the international GBIF community | thematic | Life sciences, Biodiversity, Animal biology, Plant biology, Ecology, Environment; Ecosystems | Taxa, occurrence data, sampling data, all standardized according to Darwin core or ABCD standards. | yes | DOI | no |
Recherche Data Gouv | Recherche Data Gouv | generic | all fields | all | yes | DOI | ??? |
PNDB - Data Terra