General context

Open science, Data lifecyle & FAIR principles

November 2025

Olivier Norvez

Sophie Pamerlon

Nicolas Casajus

Animation coordinator
@PNDB
@DataTerra
@Theia

Data engineer
@GBIF-France

Senior data scientist
@FRB-CESAB

Current challenges

Data and metadata

Data types

Framework and good practices

Current challenges in biodiversity data

Open Science

Second French Plan for Open Science¹

France is committed to ensuring that the results of scientific research are open to all, researchers, companies and citizens, without hindrance, without delay, without payment.

Axe 1 : Generalize open access to publications
Axe 2 : Structure, share and open research data
Axe 3 : Open and promote source codes produced by research
Axe 4 : Transforming practices to make open science the default principle

Heterogeneity and loss of informations

Heterogeneity (data types, origin, standards) &
Diversity of “objects” to be linked together¹

Loss of information over time²

A complex landscape

Diversity of tools and practices (historical practices)
Different supervisions
Flow and storage of data and metadata
Difficulty for data producers to identify the right information systems to deposit their data according to the scopes and themes
Difficulty for users to know where and how to search for data

Figure from Garnier & Pavoine et al.,2025

A complex landscape

A note are to support producers and reusers of biodiversity data and metadata in:

Understanding the landscape (actors, who does what, who is who, etc.)
Sharing (meta)data (where to deposit, what types of data, etc.)
Using the latter via the complementary of information systems according to the themes and/or target audiences

The targets of this note are above all researchers, managers, engineers and data technicians

Version 1 is available here
Version 2 will be available soon in January 2025

A complex landscape

A complex landscape

Reproducibility concepts

What is reproducibility?

Reproducibility is about results that can be obtained by someone else (or you in the future) given the same data and the same code. This is a technical problem.

We talk about Computational reproducibility

What is reproducibility?

Computational reproducibility frequently refers to the ability to generate equivalent analytical outcomes from the same data set using the same code and software¹.

[…] all raw data and metadata, code, programming scripts, and bespoke software necessary for fully replicating any analyses that lead to inferences made in a published study².

Why does it matter?

An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.

Claerbout & Karrenbach (1992)¹

Why does it matter?

An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.

Claerbout & Karrenbach (1992)¹

Reproducibility has the potential to serve as a minimum standard for judging scientific claims (…).

Peng (2011)²

Why does it matter?

An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.

Claerbout & Karrenbach (1992)¹

Reproducibility has the potential to serve as a minimum standard for judging scientific claims (…).

Peng (2011)²

Sharing the code and the data is now a prerequisite for publishing in many journals

Reproducibility spectrum

Each degree of reproducibility requires additional skills and time. While some of those skills (e.g. literal programming, version control, setting up environments) pay off in the long run, they can require a high up-front investment.

Concepts

According to Wilson et al. (2017)¹, good practices for a better reproducibility can be organized into the following six topics:

Data management

Project organization

Tracking changes

Collaboration

Manuscript

Code & Software

Tools

Website available at: https://rdatatoolbox.github.io/

Data vs. metadata

Exercise

What kind of data / metadata are you using for your research?

Definitions

“Research data are defined as factual records in the form of figures, texts, images and sounds which are used as the main sources for scientific research and which the scientific community generally recognizes as being necessary to validate research results”¹.

“Metadata, which can be simply defined as “data about data,” is a way of naming things and representing data and their relationships […] Metadata is structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use or manage an information resource”².

Defining biodiversity data

Artifacts and symbols collected, stored, and disseminated in order to document the constituent elements of life, extant or extinct, at any scale of time and space.

These elements are biological and encompass anything that contributes to the structure and function of life at any scale, from the microscopic to the level of the planet, and to the continuation of life, that is, its ability to persist and evolve over time.

Source: Garnier E et al. (2025) https://doi.org/10.1016/j.tree.2025.06.004

Data, datasets, databases and co

Legend: (a) Data and metadata make up (b) datasets. Multiple datasets in one location form a (c) database. (d) Aggregators compile data from many databases and (e) repackagers transform data in a way that makes it more accessible for all audiences (i.e. lay and professional). (f) External users (e.g. scientists, industries, government agencies etc.) access raw data by (g) going through any or all of these data sharing portals.

Source: Blair J et al. (2020) https://doi.org/10.3897/BDJ.8.e32765

Data classification(s)

Exercise

How would you classify data?

Characterizing data

Link this information with your biodiversity research projects

Imagine: you’re wondering about the distribution of a given species.

Characterizing data

what species ? –> observation data: presence/absence, abundance, density, biomass ?

but also: what are the intra-population variations ? DNA, trait measurements ? phylogeny ?

where are they spatially ? in which proportions ?
when were they observed or sampled ? repeated measures ? time stamp or global period ?
how were they collected ? biases ? pseudoreplication ? true absences ?
why and who : citizen science ? opportunistic ? funding ?

Classifying data by…

Format¹

Data Table
Spatial raster
Spatial vector
Databases
Other entities

Classifying data by…

Format¹

Data Table
Spatial raster
Spatial vector
Databases
Other entities

Protocols and methods²

Citizen science
Sensors
DNA-based techniques
Satellite remote sensing
Others (simulated data)

Classifying data by…

Format¹

Data Table
Spatial raster
Spatial vector
Databases
Other entities

Protocols and methods²

Citizen science
Sensors
DNA-based techniques
Satellite remote sensing
Others (simulated data)

Thematic and/or “other classifications”

Temporal information (e.g. longitudinal data / time series (LTS))
Spatial information (spatial data (GIS, occurrences, remote sensing, etc.)
Textual data (csv, …)
Taxonomic
Trait data
Non-exhaustive

Data categories: by format

According to the EML classification¹

Data Table
Spatial raster
Spatial vector
Databases
Other entities

Data categories: by type

According to Kissling WD et al. (2018)

Citizen science
Sensors
DNA-based techniques
Satellite remote sensing
Others (simulated data)

Data categories: thematic and/or “other classifications”

Longitudinal data

Repeated measures, such as long term surveys, allowing to look at evolution (of abundances, biomass, etc) over time.

Time series analyses

Spatial data

Generic term that starts whenever there are spatial coordinates (long/lat) associated with an observation. May also refer to remote sensing images, GIS.

Textual data

Again a very generic term, ranges from spreadsheets of data occurrences to DNA sequences, and text mining (systematic reviews, web scrapping, etc).

Frameworks and Good practices

Data life cycle

FAIR Principles

Flux and stocks of data

Data life cycle

The data lifecycle is the set of stages of management, conservation and dissemination of research data, associated with research activities.

For more information: PNDB

FAIR principles

A set of guiding principles for managing research data to make it Findable, Accessible, Interoperable, and Reusable by humans and machines¹.

This is the way!

For more information: PNDB

FAIR principles

For more information: PNDB and GBIF

Flux and stocks

Flux and stocks

Sharing data and metadata from research activities requires making them available in repositories.

It is recommended to prioritize deposits in trusted thematic repositories (eg. InDoRES, SEANOE, …), or failing that in institutional repositories (eg. Data SUD, CIRAD Dataverse, …), or failing that, in generic repositories such as the Recherche Data Gouv repository. Resources exist for choosing the adapted and recommended data repository.

Ouvrir la science (2024)

Norvez et al.(2024)

Recherche Data Gouv (2024)

Flux and stocks

Table 1: Examples of biodiversity and environmental data repositories, from Ouvrir la science (2024) and adapted
Repositories names	Supporting by	thematic, institutional, generic	disciplinary fields	Accepted data (keywords)	embargo	Persistent identifier	Volume limit
InDoRES	CNRS-Ecology, MNHN	thematic ( and institutional)	Ecology, Environment, Bio-archaeology	Environmental, ecological and geographical data	yes	DOI	2 GB per data set but planned to increase to 4 or 5 GB soon
EaSy Data	Data Terra, BRGM	thematic	Earth and Environmental Sciences	Long tail data from the earth system and environment (example: project issues)	yes (2 years max.)	DOI	5 GB per file, 100 GB per deposit. Possibility to make the request if larger volume
SEANOE	Ifremer	thematic (and institutional)	Oceanography	Georeferenced marine data	yes (2 years max.)	DOI	100 GB
Data SUD	IRD	institutional	all fields covered by IRD agents	???	???	DOI	???
GBIF	the international GBIF community	thematic	Life sciences, Biodiversity, Animal biology, Plant biology, Ecology, Environment; Ecosystems	Taxa, occurrence data, sampling data, all standardized according to Darwin core or ABCD standards.	yes	DOI	no
Recherche Data Gouv	Recherche Data Gouv	generic	all fields	all	yes	DOI	???

Take home message

Heterogeneity (data types, origin, standards) & diversity of “objects” to be linked together¹
Loss of information over time²
Toward a better open science and reproducibility ³ ⁴

Resources

FRB-CESAB

Data Terra

GBIF

Recherche Data Gouv

Table of contents

Table of contents

Current challenges in biodiversity data

Open Science

Heterogeneity and loss of informations

A complex landscape

A complex landscape

A complex landscape

A complex landscape

Reproducibility concepts

What is reproducibility?

What is reproducibility?

Why does it matter?

Why does it matter?

Why does it matter?

Reproducibility spectrum

Concepts

Tools

Data vs. metadata

Exercise

Definitions

Defining biodiversity data

Data, datasets, databases and co

Data classification(s)

Exercise

Characterizing data

Characterizing data

Classifying data by…

Classifying data by…

Classifying data by…

Data categories: by format

Data categories: by type

Data categories: thematic and/or “other classifications”

Frameworks and Good practices

Frameworks and Good practices

Data life cycle

FAIR principles

FAIR principles

Flux and stocks

Flux and stocks

Flux and stocks

Take home message

Resources

Resources