Biodiversity data: General context

Open science, complex landscape & reproducibility

November 2024

Olivier Norvez

Sophie Pamerlon

Nicolas Casajus

Animation coordinator
@PNDB  
@DataTerra  

Data engineer
@GBIF-France  

Data scientist
@FRB-CESAB    

Table of contents

Table of contents


Current challenges


Data and metadata


Data types


Framework and good practices

Current challenges in biodiversity data

Open Science

Second French Plan for Open Science1

France is committed to ensuring that the results of scientific research are open to all, researchers, companies and citizens, without hindrance, without delay, without payment.




  • Axe 1 : Generalize open access to publications
  • Axe 2 : Structure, share and open research data
  • Axe 3 : Open and promote source codes produced by research
  • Axe 4 : Transforming practices to make open science the default principle

Heterogeneity and loss of informations

Heterogeneity (data types, origin, standards) &
Diversity of “objects” to be linked together
1

Loss of information over time2

A complex landscape


  • Diversity of tools and practices (historical practices)
  • Different supervisions
  • Flow and storage of data and metadata
  • Difficulty for data producers to identify the right information systems to deposit their data according to the scopes and themes
  • Difficulty for users to know where and how to search for data

Figure from CNRS prospective

A complex landscape


A note are to support producers and reusers of biodiversity data and metadata in:

  • Understanding the landscape (actors, who does what, who is who, etc.)
  • Sharing (meta)data (where to deposit, what types of data, etc.)
  • Using the latter via the complementary of information systems according to the themes and/or target audiences


The targets of this note are above all researchers, managers, engineers and data technicians


  • Version 1 is available here
  • Version 2 will be available soon in January 2025

A complex landscape

A complex landscape

Reproducibility concepts

What is reproducibility?


Reproducibility is about results that can be obtained by someone else (or you in the future) given the same data and the same code. This is a technical problem.


 We talk about Computational reproducibility

What is reproducibility?

Computational reproducibility frequently refers to the ability to generate equivalent analytical outcomes from the same data set using the same code and software1.

[…] all raw data and metadata, code, programming scripts, and bespoke software necessary for fully replicating any analyses that lead to inferences made in a published study2.

Why does it matter?

An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.

Claerbout & Karrenbach (1992)1


Why does it matter?

An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.

Claerbout & Karrenbach (1992)1


Reproducibility has the potential to serve as a minimum standard for judging scientific claims (…).

Peng (2011)2

Why does it matter?

An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.

Claerbout & Karrenbach (1992)1


Reproducibility has the potential to serve as a minimum standard for judging scientific claims (…).

Peng (2011)2


 Sharing the code and the data is now a prerequisite for publishing in many journals

Reproducibility spectrum


Source: Peng (2011)1


Each degree of reproducibility requires additional skills and time. While some of those skills (e.g. literal programming, version control, setting up environments) pay off in the long run, they can require a high up-front investment.

Concepts

According to Wilson et al. (2017)1, good practices for a better reproducibility can be organized into the following six topics:




 Data management

 Project organization

 Tracking changes


 Collaboration

 Manuscript

 Code & Software

Tools





Website available at: https://rdatatoolbox.github.io/

Data vs. metadata

Data vs. metadata

What kind of data and/or metadata are you using for your research?


DATA

METADATA

Definitions

Research data are defined as factual records in the form of figures, texts, images and sounds which are used as the main sources for scientific research and which the scientific community generally recognizes as being necessary to validate research results1.

Metadata, which can be simply defined as “data about data,” is a way of naming things and representing data and their relationships […] Metadata is structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use or manage an information resource2.

Data classification(s)

Data classification(s)

What kind of data and/or metadata are you using for your research?


DATA

METADATA

Characterizing data







  Link this information with your biodiversity research projects

Imagine : you’re wondering about the distribution of a given species.

Characterizing data


  • what species ? –> observation data: presence/absence, abundance, density, biomass ?

but also: what are the intra-population variations ? DNA, trait measurements ? phylogeny ?

  • where are they spatially ? in which proportions ?

  • when were they observed or sampled ? repeated measures ? time stamp or global period ?

  • how were they collected ? biases ? pseudoreplication ? true absences ?

  • why and who : citizen science ? opportunistic ? funding ?

Classifying data by…


Format1

  • Data Table
  • Spatial raster
  • Spatial vector
  • Databases
  • Other entities

Classifying data by…


Format1

  • Data Table
  • Spatial raster
  • Spatial vector
  • Databases
  • Other entities

Protocols and methods2

  • Citizen science
  • Sensors
  • DNA-based techniques
  • Satellite remote sensing
  • Others (simulated data)

Classifying data by…


Format1

  • Data Table
  • Spatial raster
  • Spatial vector
  • Databases
  • Other entities

Protocols and methods2

  • Citizen science
  • Sensors
  • DNA-based techniques
  • Satellite remote sensing
  • Others (simulated data)

Thematic and/or “other classifications”

  • Temporal information (e.g. longitudinal data / time series (LTS))
  • Spatial information (spatial data (GIS, occurrences, remote sensing, etc.)
  • Textual data (csv, …)
  • Taxonomic
  • Trait data
  • Non-exhaustive

Data categories: by format

According to the EML classification1

  • Data Table
  • Spatial raster
  • Spatial vector
  • Databases
  • Other entities

Data categories: by type


According to Kissling WD et al. (2018)

  • Citizen science
  • Sensors
  • DNA-based techniques
  • Satellite remote sensing
  • Others (simulated data)

Data categories: thematic and/or “other classifications”

  • Longitudinal data

Repeated measures, such as long term surveys, allowing to look at evolution (of abundances, biomass, etc) over time.

Time series analyses

  • Spatial data

Generic term that starts whenever there are spatial coordinates (long/lat) associated with an observation. May also refer to remote sensing images, GIS.


  • Textual data

Again a very generic term, ranges from spreadsheets of data occurrences to DNA sequences, and text mining (systematic reviews, web scrapping, etc).

Frameworks and Good practices

Frameworks and Good practices


Data life cycle


FAIR Principles




Flux and stocks of data


Data life cycle




The data lifecycle is the set of stages of management, conservation and dissemination of research data, associated with research activities.


 For more information: PNDB

FAIR principles



A set of guiding principles for managing research data to make it Findable, Accessible, Interoperable, and Reusable by humans and machines1.

This is the way!

 For more information: PNDB

FAIR principles

 For more information: PNDB and GBIF

Flux and stocks

Flux and stocks

Sharing data and metadata from research activities requires making them available in repositories.

It is recommended to prioritize deposits in trusted thematic repositories (eg. InDoRES, SEANOE, …), or failing that in institutional repositories (eg. Data SUD, CIRAD Dataverse, …), or failing that, in generic repositories such as the Recherche Data Gouv repository. Resources exist for choosing the adapted and recommended data repository.

Flux and stocks

Table 1: Examples of biodiversity and environmental data repositories, from Ouvrir la science (2024) and adapted
Repositories names Supporting by thematic, institutional, generic disciplinary fields Accepted data (keywords) embargo Persistent identifier Volume limit
InDoRES CNRS-Ecology, MNHN thematic ( and institutional) Ecology, Environment, Bio-archaeology Environmental, ecological and geographical data yes DOI 2 GB per data set but planned to increase to 4 or 5 GB soon
EaSy Data Data Terra, BRGM thematic Earth and Environmental Sciences Long tail data from the earth system and environment (example: project issues) yes (2 years max.) DOI 5 GB per file, 100 GB per deposit. Possibility to make the request if larger volume
SEANOE Ifremer thematic (and institutional) Oceanography Georeferenced marine data yes (2 years max.) DOI 100 GB
Data SUD IRD institutional all fields covered by IRD agents ??? ??? DOI ???
GBIF the international GBIF community thematic Life sciences, Biodiversity, Animal biology, Plant biology, Ecology, Environment; Ecosystems Taxa, occurrence data, sampling data, all standardized according to Darwin core or ABCD standards. yes DOI no
Recherche Data Gouv Recherche Data Gouv generic all fields all yes DOI ???

General context : take home message

  • Heterogeneity (data types, origin, standards) & diversity of “objects” to be linked together1
  • Loss of information over time2
  • Toward a better open science and reproducibility 3 4

General context : resources

General context : resources