Accessing biodiversity data

Web portals, scripting, API & Web scraping

November 2024

Nicolas Casajus

Camille Coux

Data scientists
@FRB-CESAB

Web portals

Scripting

API

Exercise

Web scraping

Web portals

Imagine: you’re doing some species distribution models.

You have a list of species and their occurrences in space and time.

You need:

spatial data of France, to map the occurrences
trait information for each species, for 3 specific traits

Some portals are more straight-foraward than others…

Example of GADM maps: https://gadm.org/

They have a nice description of file formats too.

Open in R:

sf::st_layers(here::here("data/gadm41_FRA.gpkg"))

## Driver: GPKG 
## Available layers:
##   layer_name geometry_type features fields crs_name
## 1  ADM_ADM_0 Multi Polygon        1      2   WGS 84
## 2  ADM_ADM_1 Multi Polygon       13     11   WGS 84
## 3  ADM_ADM_2 Multi Polygon       96     13   WGS 84
## 4  ADM_ADM_3 Multi Polygon      350     16   WGS 84
## 5  ADM_ADM_4 Multi Polygon     3728     14   WGS 84
## 6  ADM_ADM_5 Multi Polygon    36611     15   WGS 84

Open in R:

library(ggplot2)
fr <- sf::read_sf(here::here("data/gadm41_FRA.gpkg"), layer = "ADM_ADM_1")
head(fr)

## Simple feature collection with 6 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -5.143751 ymin: 41.33375 xmax: 9.560416 ymax: 50.16764
## Geodetic CRS:  WGS 84
## # A tibble: 6 × 12
##   GID_1   GID_0 COUNTRY NAME_1 VARNAME_1 NL_NAME_1 TYPE_1 ENGTYPE_1 CC_1  HASC_1
##   <chr>   <chr> <chr>   <chr>  <chr>     <chr>     <chr>  <chr>     <chr> <chr> 
## 1 FRA.1_1 FRA   France  Auver… NA        NA        Région Region    NA    FR.AR 
## 2 FRA.2_1 FRA   France  Bourg… NA        NA        Région Region    NA    FR.BF 
## 3 FRA.3_1 FRA   France  Breta… NA        NA        Région Region    NA    FR.BT 
## 4 FRA.4_1 FRA   France  Centr… NA        NA        Région Region    NA    FR.CN 
## 5 FRA.5_1 FRA   France  Corse  Corsica   NA        Région Region    NA    FR.CE 
## 6 FRA.6_1 FRA   France  Grand… NA        NA        Région Region    NA    FR.AO 
## # ℹ 2 more variables: ISO_1 <chr>, geom <MULTIPOLYGON [°]>

Open in R:

ggplot(fr) + geom_sf(aes(fill = NAME_1)) + theme_bw()

Imagine: you’re doing some species distribution models.

Now you have:

spatial data of France, to map the occurrences

But you still need:

trait information for each species, for 3 specific traits

Imagine: you’re doing some species distribution models.

Now you have:

spatial data of France, to map the occurrences

But you still need:

trait information for each species, for 3 specific traits

Before anything else:

Imagine precisely what kind of data you need. Draw the table you want to get.

TRY: a database for plant traits

TRY data portal: https://www.try-db.org/TryWeb/dp.php

1. Understand how the data is structured

Check out the data explorer section.

What are the traits like ? How are they measured ? If there are several measures of the same trait, which one will you choose ?

TRY: a database for plant traits

TRY data portal: https://www.try-db.org/TryWeb/dp.php

1. Understand how the data is structured

Check out the data explorer section.

What are the traits like ? How are they measured ? If there are several measures of the same trait, which one will you choose ?

Explore the trait table.

Let’s try leaf area –> 88 traits contain “leaf area” in their description!

Notice the Trait ID column, this is what you’ll need to query the trait(s) you select.

Trait IDs examples: LAI = 3116 , Flower size = 3568, Photosynthesis : intercellular CO2 concentration = 49

TRY: a database for plant traits

TRY data portal: https://www.try-db.org/TryWeb/dp.php

1. Understand how the data is structured

Check out the data explorer section.

What are the traits like ? How are they measured ? If there are several measures of the same trait, which one will you choose ?

Explore the trait table.

Let’s try leaf area –> 88 traits contain “leaf area” in their description!

Notice the Trait ID column, this is what you’ll need to query the trait(s) you select.

Trait IDs examples: LAI = 3116 , Flower size = 3568, Photosynthesis : intercellular CO2 concentration = 49

Other table fields:

ObsNum: Number of Observations
ObsGRNum: Number of geo-referenced Observations
AccSpecNum: Number of Accepted Species

2. Look at the species list

Get species IDs:

Arabidopsis thaliana : 4341
Bellis perennis : 7173
Quercus ilex : 45402

2. Look at the species list

Get species IDs:

Arabidopsis thaliana : 4341
Bellis perennis : 7173
Quercus ilex : 45402

Get data

Submit request. Write a short description of your project. You’ll receive a text file by email within a few days.

Requesting data can take several days.

You may encounter encoding issues….

t <- read.csv(here::here("data/36266.txt"), sep="\t")

## Error in type.convert.default(data[[i]], as.is = as.is[i], dec = dec, : invalid multibyte string at '<dc>lo'

readr::guess_encoding(here::here("data/36266.txt"))

## # A tibble: 2 × 2
##   encoding   confidence
##   <chr>           <dbl>
## 1 ISO-8859-1       0.4 
## 2 ISO-8859-2       0.21

t <- utils::read.csv2(file = file(here::here("data/36266.txt"), encoding = 'ISO-8859-1'), sep="\t")
colnames(t)

##  [1] "LastName"              "FirstName"             "DatasetID"            
##  [4] "Dataset"               "SpeciesName"           "AccSpeciesID"         
##  [7] "AccSpeciesName"        "ObservationID"         "ObsDataID"            
## [10] "TraitID"               "TraitName"             "DataID"               
## [13] "DataName"              "OriglName"             "OrigValueStr"         
## [16] "OrigUnitStr"           "ValueKindName"         "OrigUncertaintyStr"   
## [19] "UncertaintyName"       "Replicates"            "StdValue"             
## [22] "UnitName"              "RelUncertaintyPercent" "OrigObsDataID"        
## [25] "ErrorRisk"             "Reference"             "Comment"              
## [28] "StdValueStr"           "X"

Alternative import via R:

The quite handy R package rtry.

Solves the encoding problem:

t2 <- rtry::rtry_import(here::here("data/36266.txt"))

And this takes us to the next section…

Scripting

Queries from R

R lets you download and import the data directly if you have the url, by using the download.file() function:

download.file("https://geodata.ucdavis.edu/gadm/gadm4.1/gpkg/gadm41_FRA.gpkg", destfile = here::here("data", "fr_2.gpkg"))

Queries from R

R lets you download and import the data directly if you have the url, by using the download.file() function:

download.file("https://geodata.ucdavis.edu/gadm/gadm4.1/gpkg/gadm41_FRA.gpkg", destfile = here::here("data", "fr_2.gpkg"))

Sometimes it sends strange errors on windows. Simply add mode = "wb"

download.file("https://geodata.ucdavis.edu/gadm/gadm4.1/gpkg/gadm41_FRA.gpkg", destfile = here::here("data", "fr_2.gpkg"), mode = "wb")

What if the file is compressed ?

You can use the helper functions unzip(), gzfile() etc but it’s unnecessary for most R reading functions:

download.file("http://www.sociopatterns.org/wp-content/uploads/2015/07/Friendship-network_data_2013.csv.gz", destfile = here::here("data", "friends.gz"))

# all of these work fine: 
friends <- read.table(gzfile( here::here("data", "friends.gz")))
friends <- read.csv2(gzfile( here::here("data", "friends.gz")), sep = " ")
friends <- read.table(here::here("data", "friends.gz"))

What if the file is compressed ?

You can use the helper functions unzip(), gzfile() etc but it’s unnecessary for most R reading functions:

download.file("http://www.sociopatterns.org/wp-content/uploads/2015/07/Friendship-network_data_2013.csv.gz", destfile = here::here("data", "friends.gz"))

# all of these work fine: 
friends <- read.table(gzfile( here::here("data", "friends.gz")))
friends <- read.csv2(gzfile( here::here("data", "friends.gz")), sep = " ")
friends <- read.table(here::here("data", "friends.gz"))

Other useful packages: data.table, vroom, …

API

Accessing data from the Web

(RESTful) Web API

Definition: A service accessed from a client device (mobile phones, laptops, etc.) to a Web server using the Hypertext Transfer Protocol (HTTP)¹.

The client sends an HTTP request to the Web server
The Web server sends back a response in JSON or XML format (raw data)
The Web server exposes one or more endpoints (predefined request/response)

(RESTful) Web API

Advantages (for the user)

User can create its own client
Client can be developed in any language
User can include the service in a bigger project
User access raw data

Writing code means automation and reproducibility

(RESTful) Web API

Advantages (for the user)

User can create its own client
Client can be developed in any language
User can include the service in a bigger project
User access raw data

Writing code means automation and reproducibility

Limitations

Each API has its own specification
Authentication method (free or not)
- Token
- Login and password
Some restrictions
- Number of requests per day/month
- Number of records per request
- Incomplete data

Example of APIs

This is a non-exhaustive list

Biodiversity data

Global Biodiversity Information Facility (GBIF) - API
IUCN Red List - API
Fishbase - API
Species+/CITES Checklist - API
Knowledge Network for Biocomplexity (KNB) - API

Taxonomy

Encyclopedia of Life (EOL) - API
Integrated Taxonomic Information System (ITIS) - API
Barcode of Life Data (BOLD) - API
World Register of Marine Species (WoRMS) - API

Scientific literature

Web of Science - API
Scopus - API
CrossRef - API
OpenAlex - API

Others

Wikipedia - API
OpenStreetMap - API
Zenodo - API

How does it work?

Requesting an API is based on the client-server protocol

API Client

It’s the tool you use to request the API and parse the response (data).

If you are lucky, an API client will already be available.

Non-exhaustive list of packages

Otherwise you will have to build your own client.

Building an API client

Available at: https://httr2.r-lib.org/

# Install 'httr2' package ----
install.packages("httr2")

Table: Main functions of `httr2`
Function	Description
`request()`	Create an HTTP request
`req_url_query()`	Add parameters to an HTTP request
`req_perform()`	Send HTTP request to an API
`resp_status()`	Check the HTTP response status
`resp_content_type()`	Check the content type of the response
`resp_body_json()`	Parse the response content (JSON format)
`resp_body_xml()`	Parse the response content (XML format)

Building an API client

Example with the OpenStreetMap Nominatim API

Retrieve coordinates from a location (city, address, building, etc.)

Nominatim API client

1. Build the HTTP request

# Nominatim API endpoint ----
endpoint <- "https://nominatim.openstreetmap.org/search"

# Prepare the HTTP request ----
http_request <- httr2::request(endpoint)

http_request

https://nominatim.openstreetmap.org/search

Nominatim API client

1. Build the HTTP request

# Nominatim API endpoint ----
endpoint <- "https://nominatim.openstreetmap.org/search"

# Prepare the HTTP request ----
http_request <- httr2::request(endpoint)

http_request

https://nominatim.openstreetmap.org/search

# Append request parameters ----
http_request <- http_request |> 
  httr2::req_url_query(city    = "Montpellier") |> 
  httr2::req_url_query(country = "France")

http_request

https://nominatim.openstreetmap.org/search?city=Montpellier&country=France

Nominatim API client

1. Build the HTTP request

# Nominatim API endpoint ----
endpoint <- "https://nominatim.openstreetmap.org/search"

# Prepare the HTTP request ----
http_request <- httr2::request(endpoint)

http_request

https://nominatim.openstreetmap.org/search

# Append request parameters ----
http_request <- http_request |> 
  httr2::req_url_query(city    = "Montpellier") |> 
  httr2::req_url_query(country = "France")

http_request

https://nominatim.openstreetmap.org/search?city=Montpellier&country=France

# Append response parameters ----
http_request <- http_request |> 
  httr2::req_url_query(format = "json") |> 
  httr2::req_url_query(limit  = 1)

http_request

https://nominatim.openstreetmap.org/search?city=Montpellier&country=France&
format=json&limit=1

Nominatim API client

1. Build the HTTP request

# Nominatim API endpoint ----
endpoint <- "https://nominatim.openstreetmap.org/search"

# Prepare the HTTP request ----
http_request <- httr2::request(endpoint)

http_request

https://nominatim.openstreetmap.org/search

# Append request parameters ----
http_request <- http_request |> 
  httr2::req_url_query(city    = "Montpellier") |> 
  httr2::req_url_query(country = "France")

http_request

https://nominatim.openstreetmap.org/search?city=Montpellier&country=France

# Append response parameters ----
http_request <- http_request |> 
  httr2::req_url_query(format = "json") |> 
  httr2::req_url_query(limit  = 1)

http_request

https://nominatim.openstreetmap.org/search?city=Montpellier&country=France&
format=json&limit=1

2. Send the HTTP request

# Send HTTP request  ----
http_response <- httr2::req_perform(http_request)

http_response

Status: 200 OK
Content-Type: application/json

Nominatim API client

1. Build the HTTP request

# Nominatim API endpoint ----
endpoint <- "https://nominatim.openstreetmap.org/search"

# Prepare the HTTP request ----
http_request <- httr2::request(endpoint)

http_request

https://nominatim.openstreetmap.org/search

# Append request parameters ----
http_request <- http_request |> 
  httr2::req_url_query(city    = "Montpellier") |> 
  httr2::req_url_query(country = "France")

http_request

https://nominatim.openstreetmap.org/search?city=Montpellier&country=France

# Append response parameters ----
http_request <- http_request |> 
  httr2::req_url_query(format = "json") |> 
  httr2::req_url_query(limit  = 1)

http_request

https://nominatim.openstreetmap.org/search?city=Montpellier&country=France&
format=json&limit=1

2. Send the HTTP request

# Send HTTP request  ----
http_response <- httr2::req_perform(http_request)

http_response

Status: 200 OK
Content-Type: application/json

3. Check response status

# Check response status ----
httr2::resp_status(http_response)

## [1] 200

Nominatim API client

1. Build the HTTP request

# Nominatim API endpoint ----
endpoint <- "https://nominatim.openstreetmap.org/search"

# Prepare the HTTP request ----
http_request <- httr2::request(endpoint)

http_request

https://nominatim.openstreetmap.org/search

# Append request parameters ----
http_request <- http_request |> 
  httr2::req_url_query(city    = "Montpellier") |> 
  httr2::req_url_query(country = "France")

http_request

https://nominatim.openstreetmap.org/search?city=Montpellier&country=France

# Append response parameters ----
http_request <- http_request |> 
  httr2::req_url_query(format = "json") |> 
  httr2::req_url_query(limit  = 1)

http_request

https://nominatim.openstreetmap.org/search?city=Montpellier&country=France&
format=json&limit=1

2. Send the HTTP request

# Send HTTP request  ----
http_response <- httr2::req_perform(http_request)

http_response

Status: 200 OK
Content-Type: application/json

3. Check response status

# Check response status ----
httr2::resp_status(http_response)

## [1] 200

4. Check response content type

# Check response status ----
httr2::resp_content_type(http_response)

## [1] "application/json"

Nominatim API client

5. Parse response content

# Parse response content ----
content <- httr2::resp_body_json(http_response)

content

## [[1]]
## [[1]]$place_id
## [1] 104509929
## 
## [[1]]$licence
## [1] "Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright"
## 
## [[1]]$osm_type
## [1] "relation"
## 
## [[1]]$osm_id
## [1] 28722
## 
## [[1]]$lat
## [1] "43.6112422"
## 
## [[1]]$lon
## [1] "3.8767337"
## 
## [[1]]$class
## [1] "boundary"
## 
## [[1]]$type
## [1] "administrative"
## 
## [[1]]$place_rank
## [1] 16
## 
## [[1]]$importance
## [1] 0.6880008
## 
## [[1]]$addresstype
## [1] "city"
## 
## [[1]]$name
## [1] "Montpellier"
## 
## [[1]]$display_name
## [1] "Montpellier, Hérault, Occitanie, France métropolitaine, France"
## 
## [[1]]$boundingbox
## [[1]]$boundingbox[[1]]
## [1] "43.5667083"
## 
## [[1]]$boundingbox[[2]]
## [1] "43.6533580"
## 
## [[1]]$boundingbox[[3]]
## [1] "3.8070597"
## 
## [[1]]$boundingbox[[4]]
## [1] "3.9413208"

# Object type ----
class(content)

## [1] "list"

# Object dimensions ----
length(content)

## [1] 1

Nominatim API client

5. Parse response content

# Parse response content ----
content <- httr2::resp_body_json(http_response)

content

## [[1]]
## [[1]]$place_id
## [1] 104509929
## 
## [[1]]$licence
## [1] "Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright"
## 
## [[1]]$osm_type
## [1] "relation"
## 
## [[1]]$osm_id
## [1] 28722
## 
## [[1]]$lat
## [1] "43.6112422"
## 
## [[1]]$lon
## [1] "3.8767337"
## 
## [[1]]$class
## [1] "boundary"
## 
## [[1]]$type
## [1] "administrative"
## 
## [[1]]$place_rank
## [1] 16
## 
## [[1]]$importance
## [1] 0.6880008
## 
## [[1]]$addresstype
## [1] "city"
## 
## [[1]]$name
## [1] "Montpellier"
## 
## [[1]]$display_name
## [1] "Montpellier, Hérault, Occitanie, France métropolitaine, France"
## 
## [[1]]$boundingbox
## [[1]]$boundingbox[[1]]
## [1] "43.5667083"
## 
## [[1]]$boundingbox[[2]]
## [1] "43.6533580"
## 
## [[1]]$boundingbox[[3]]
## [1] "3.8070597"
## 
## [[1]]$boundingbox[[4]]
## [1] "3.9413208"

# Object type ----
class(content)

## [1] "list"

# Object dimensions ----
length(content)

## [1] 1

6. Clean data

# Clean output ----
content <- content[[1]]

content <- data.frame("name" = content$"name",
                      "lon"  = as.numeric(content$"lon"),
                      "lat"  = as.numeric(content$"lat"))
content

##          name      lon      lat
## 1 Montpellier 3.876734 43.61124

Visualization

# Install required package ----
install.packages("maps")

# Map France boundary ----
maps::map(regions = "France", 
          fill    = TRUE, 
          col     = "black")

# Add retrieved coordinates ----
points(x   = content$"lon", 
       y   = content$"lat", 
       pch = 19, 
       cex = 1, 
       col = "red")

# Add retrieved name ----
text(x      = content$"lon", 
     y      = content$"lat", 
     labels = content$"name", 
     pos    = 2, 
     col    = "white", 
     family = "serif")

Code factorisation

Function definition

get_coords_from_location <- function(city, country) {
  
  # Nominatim API endpoint ----
  endpoint <- "https://nominatim.openstreetmap.org/search"

  # Prepare the HTTP request ----
  http_request <- httr2::request(endpoint)
  
  # Append request parameters ----
  http_request <- http_request |> 
    httr2::req_url_query(city    = city) |> 
    httr2::req_url_query(country = country) |> 
    httr2::req_url_query(format = "json") |> 
    httr2::req_url_query(limit  = 1)
  
  # Send HTTP request  ----
  http_response <- httr2::req_perform(http_request)
  
  # Check response status ----
  httr2::resp_check_status(http_response)
  
  # Parse response content ----
  content <- httr2::resp_body_json(http_response)
  
  # Clean output ----
  content <- content[[1]]
  content <- data.frame("name" = content$"name",
                        "lon"  = as.numeric(content$"lon"),
                        "lat"  = as.numeric(content$"lat"))
  content
}

Code factorisation

Function definition

get_coords_from_location <- function(city, country) {
  
  # Nominatim API endpoint ----
  endpoint <- "https://nominatim.openstreetmap.org/search"

  # Prepare the HTTP request ----
  http_request <- httr2::request(endpoint)
  
  # Append request parameters ----
  http_request <- http_request |> 
    httr2::req_url_query(city    = city) |> 
    httr2::req_url_query(country = country) |> 
    httr2::req_url_query(format = "json") |> 
    httr2::req_url_query(limit  = 1)
  
  # Send HTTP request  ----
  http_response <- httr2::req_perform(http_request)
  
  # Check response status ----
  httr2::resp_check_status(http_response)
  
  # Parse response content ----
  content <- httr2::resp_body_json(http_response)
  
  # Clean output ----
  content <- content[[1]]
  content <- data.frame("name" = content$"name",
                        "lon"  = as.numeric(content$"lon"),
                        "lat"  = as.numeric(content$"lat"))
  content
}

Function usage

# Retrieve coordinates ----
get_coords_from_location(city = "Montpellier", country = "France")

##          name      lon      lat
## 1 Montpellier 3.876734 43.61124

Code factorisation

Function definition

get_coords_from_location <- function(city, country) {
  
  # Nominatim API endpoint ----
  endpoint <- "https://nominatim.openstreetmap.org/search"

  # Prepare the HTTP request ----
  http_request <- httr2::request(endpoint)
  
  # Append request parameters ----
  http_request <- http_request |> 
    httr2::req_url_query(city    = city) |> 
    httr2::req_url_query(country = country) |> 
    httr2::req_url_query(format = "json") |> 
    httr2::req_url_query(limit  = 1)
  
  # Send HTTP request  ----
  http_response <- httr2::req_perform(http_request)
  
  # Check response status ----
  httr2::resp_check_status(http_response)
  
  # Parse response content ----
  content <- httr2::resp_body_json(http_response)
  
  # Clean output ----
  content <- content[[1]]
  content <- data.frame("name" = content$"name",
                        "lon"  = as.numeric(content$"lon"),
                        "lat"  = as.numeric(content$"lat"))
  content
}

Function usage

# Retrieve coordinates ----
get_coords_from_location(city = "Montpellier", country = "France")

##          name      lon      lat
## 1 Montpellier 3.876734 43.61124

Automation

# List of cities ----
cities <- c("Montpellier", "Paris", "Strasbourg", "Grenoble", "Bourges")

# Retrieve coordinates ----
coords <- data.frame()

for (city in cities) {
  
  coord <- get_coords_from_location(city = city, country = "France")
  coords <- rbind(coords, coord)
}

coords

##          name      lon      lat
## 1 Montpellier 3.876734 43.61124
## 2       Paris 2.320041 48.85889
## 3  Strasbourg 7.750713 48.58461
## 4    Grenoble 5.735782 45.18756
## 5     Bourges 2.399125 47.08117

Exercise (40 min)

Accessing data

Part 1: Download New Zealand boundaries from https://gadm.org/ (GeoJSON Level 0).

Use the function download.file().

Part 2: Download GBIF occurrences for two bat species endemic to the islands of New Zealand:

Mystacina tuberculata (New Zealand lesser short-tailed bat)
Chalinolobus tuberculatus (New Zealand long-tailed bat)

Use the function rgbif::occ_search().
Do not forget to export the data.

Part 3: Download the PanTHERIA database, a species-level database of life history, ecology, and geography of extant and recently extinct mammals available here.

Use the function download.file() and the function readr::read_delim() to import the database.

Bonus: Plot a New Zealand map with GBIF occurrences.

Use the packages sf and ggplot2.

Correction

Part 1

## Download administrative boundaries of NZL ----

gadm_url <- "https://geodata.ucdavis.edu/gadm/gadm4.1/json/"
filename <- "gadm41_NZL_0.json"

download.file(url      = paste0(gadm_url, filename), 
              destfile = filename,
              mode     = "wb")

Part 2

## Download GBIF occurrences ----

species_names <- c("Mystacina tuberculata", 
                   "Chalinolobus tuberculatus")

occ <- rgbif::occ_search(scientificName     = species_names, 
                         fields             = "minimal",
                         hasCoordinate      = TRUE,
                         hasGeospatialIssue = FALSE)


## Append occurrences -----

occ <- rbind(occ$`Mystacina tuberculata`$"data",
             occ$`Chalinolobus tuberculatus`$"data")

Part 2 (continued)

## Export occurrences ----

write.csv(x         = occ, 
          file      = "gbif_occurrences.csv", 
          row.names = FALSE)

Part 3

## Download PanTHERIA database ----

esa_url  <- "https://esapubs.org/archive/ecol/E090/184/"
filename <- "PanTHERIA_1-0_WR05_Aug2008.txt"

download.file(url      = paste0(esa_url, filename), 
              destfile = filename,
              mode     = "wb")

Correction

Bonus

## Import NZL shapefile ----

nzl <- sf::st_read("gadm41_NZL_0.json") |> 
  sf::st_transform(crs = 27200)


## Convert occurrences into sf object ----

occ_sf <- occ |> 
  sf::st_as_sf(coords = c("decimalLongitude", "decimalLatitude"),
               crs    = 4326) |> 
  sf::st_transform(crs = 27200)


## Map occurrences ----

library("ggplot2")

ggplot() +
  
  geom_sf(data = nzl) +
  
  geom_sf(data = occ_sf, mapping = aes(color = scientificName)) +
  
  theme_bw()

Web scraping

What is web scraping?

A method to automatically extract data from web pages
Converts unstructured web content into structured data
Also known as screen scraping

If available, you should use API: it will give you more reliable data.

What is a web page?

HTML
Content structuration

CSS
Formatting

JavaScript
Dynamism & interactivity

HTML basics

A web page is described and structured by the HTML language (HyperText Markup Language)

HTML code

<!DOCTYPE html>
<html>
  
  ...
      
</html>

An HTML page has always this structure

HTML basics

A web page is described and structured by the HTML language (HyperText Markup Language)

HTML code

<!DOCTYPE html>
<html>
  
  <!-- Document metadata -->
  <head>
    ...
  </head>
  
  <!-- Document content -->
  <body>
    ...
  </body>
      
</html>

The tag <html> contains two children:

<head> contains page metadata
<body> contains page content

HTML basics

A web page is described and structured by the HTML language (HyperText Markup Language)

HTML code

<!DOCTYPE html>
<html>
  
  <!-- Document metadata -->
  <head>
    <meta charset="UTF-8">
    <title>Page title</title>
  </head>
  
  <!-- Document content -->
  <body>
    ...
  </body>
      
</html>

The tag <head> can contain different metadata:

<title> contains the page title
<meta> is used to specify the encoding, authors, keywords, etc.
<link> is used to call external resources

N.B. This section is not necessary interesting for web scraping

HTML basics

A web page is described and structured by the HTML language (HyperText Markup Language)

HTML code

<!DOCTYPE html>
<html>
  
  <!-- Document metadata -->
  <head>
    <meta charset="UTF-8">
    <title>Page title</title>
  </head>
  
  <!-- Document content -->
  <body>
  
    <h1 id='section-1'>Header A</h1>
    
    <p class='my-class'>A paragraph with <b>bold text</b>.</p>
    
    <p>
      A second paragraph with a 
      <a href='https://google.com'>link</a>.
    </p>
    
    <img src='images/my-img.png' width='150' height='150' />
    
  </body>
      
</html>

HTML basics

A web page is described and structured by the HTML language (HyperText Markup Language)

HTML code

<!DOCTYPE html>
<html>
  
  <!-- Document metadata -->
  <head>
    <meta charset="UTF-8">
    <title>Page title</title>
  </head>
  
  <!-- Document content -->
  <body>
  
    <h1 id='section-1'>Header A</h1>
    
    <p class='my-class'>A paragraph with <b>bold text</b>.</p>
    
    <p>
      A second paragraph with a 
      <a href='https://google.com'>link</a>.
    </p>
    
    <img src='images/my-img.png' width='150' height='150' />
    
  </body>
      
</html>

Except for some elements (<img />), all HTML tags are double.
Some elements are block tags (<h1>, <p>, etc.), other are inline tags (<b>, <a>, etc.)
Some elements can have attributes: id, class, href, src, etc.

Web scraping consists in detecting HTML elements by the tag, class or the id (unique) to extract contents or attributes (href, src, etc.).

The rvest package

Available at: https://rvest.tidyverse.org/

# Install 'rvest' package ----
install.packages("rvest")

Table: Main functions of `rvest`
Function	Description
`read_html()`	Read and parse HTML content
`html_element()`	Extract an HTML element(s)
`html_attr()`	Extract an HTML attribute(s)
`html_text2()`	Extract the content of element(s)
`html_table()`	Extract a table & convert into a `data.frame`

Scrap a table

Example with the Wikipedia page Liste des communes de France les plus peuplées

0. Install additional packages

# 'janitor' to clean dirty data ----
install.packages("janitor")

# 'dplyr' to handle data ----
install.packages("dplyr")

Scrap a table

Example with the Wikipedia page Liste des communes de France les plus peuplées

0. Install additional packages

# 'janitor' to clean dirty data ----
install.packages("janitor")

# 'dplyr' to handle data ----
install.packages("dplyr")

1. Build the HTTP request

# Wikipedia URL ----
base_url <- "https://fr.wikipedia.org"

# Page URL ----
page_url <- paste0(base_url,
                   "/wiki/",
                   "Liste_des_communes_de_France_les_plus_peuplées")

https://fr.wikipedia.org/wiki/Liste_des_communes_de_France_les_plus_peuplées

Scrap a table

Example with the Wikipedia page Liste des communes de France les plus peuplées

0. Install additional packages

# 'janitor' to clean dirty data ----
install.packages("janitor")

# 'dplyr' to handle data ----
install.packages("dplyr")

1. Build the HTTP request

# Wikipedia URL ----
base_url <- "https://fr.wikipedia.org"

# Page URL ----
page_url <- paste0(base_url,
                   "/wiki/",
                   "Liste_des_communes_de_France_les_plus_peuplées")

https://fr.wikipedia.org/wiki/Liste_des_communes_de_France_les_plus_peuplées

2. Scrap the HTML page

# Scrap web page ----
content <- rvest::read_html(page_url)

{html_document}

Scrap a table

Example with the Wikipedia page Liste des communes de France les plus peuplées

0. Install additional packages

# 'janitor' to clean dirty data ----
install.packages("janitor")

# 'dplyr' to handle data ----
install.packages("dplyr")

1. Build the HTTP request

# Wikipedia URL ----
base_url <- "https://fr.wikipedia.org"

# Page URL ----
page_url <- paste0(base_url,
                   "/wiki/",
                   "Liste_des_communes_de_France_les_plus_peuplées")

https://fr.wikipedia.org/wiki/Liste_des_communes_de_France_les_plus_peuplées

2. Scrap the HTML page

# Scrap web page ----
content <- rvest::read_html(page_url)

{html_document}

3. Extract HTML tables

# Scrap web page ----
tables <- rvest::html_table(content)

Scrap a table

Example with the Wikipedia page Liste des communes de France les plus peuplées

0. Install additional packages

# 'janitor' to clean dirty data ----
install.packages("janitor")

# 'dplyr' to handle data ----
install.packages("dplyr")

1. Build the HTTP request

# Wikipedia URL ----
base_url <- "https://fr.wikipedia.org"

# Page URL ----
page_url <- paste0(base_url,
                   "/wiki/",
                   "Liste_des_communes_de_France_les_plus_peuplées")

https://fr.wikipedia.org/wiki/Liste_des_communes_de_France_les_plus_peuplées

2. Scrap the HTML page

# Scrap web page ----
content <- rvest::read_html(page_url)

{html_document}

3. Extract HTML tables

# Scrap web page ----
tables <- rvest::html_table(content)

# Type of output ----
class(tables)

## [1] "list"

# Element length ----
length(tables)

## [1] 6

Scrap a table

Example with the Wikipedia page Liste des communes de France les plus peuplées

0. Install additional packages

# 'janitor' to clean dirty data ----
install.packages("janitor")

# 'dplyr' to handle data ----
install.packages("dplyr")

1. Build the HTTP request

# Wikipedia URL ----
base_url <- "https://fr.wikipedia.org"

# Page URL ----
page_url <- paste0(base_url,
                   "/wiki/",
                   "Liste_des_communes_de_France_les_plus_peuplées")

https://fr.wikipedia.org/wiki/Liste_des_communes_de_France_les_plus_peuplées

2. Scrap the HTML page

# Scrap web page ----
content <- rvest::read_html(page_url)

{html_document}

3. Extract HTML tables

# Scrap web page ----
tables <- rvest::html_table(content)

# Type of output ----
class(tables)

## [1] "list"

# Element length ----
length(tables)

## [1] 6

4. Extract the good table

# Extract second table ----
datas <- tables[[2]]

Scrap a table

Example with the Wikipedia page Liste des communes de France les plus peuplées

5. Clean output

# Explore data ----
datas

## # A tibble: 288 × 14
##    Rang2024 CodeInsee Commune     Département  Statut Région `Population légale`
##    <chr>    <chr>     <chr>       <chr>        <chr>  <chr>  <chr>              
##  1 Rang2024 CodeInsee Commune     Département  Statut Région 2021[1]            
##  2 1        75056     Paris[a]    Paris[a]     Préfe… Île-d… 2 133 111          
##  3 2        13055     Marseille   Bouches-du-… Préfe… Prove… 873 076            
##  4 3        69123     Lyon        Métropole d… Préfe… Auver… 522 250            
##  5 4        31555     Toulouse    Haute-Garon… Préfe… Occit… 504 078            
##  6 5        06088     Nice        Alpes-Marit… Préfe… Prove… 348 085            
##  7 6        44109     Nantes      Loire-Atlan… Préfe… Pays … 323 204            
##  8 7        34172     Montpellier Hérault      Préfe… Occit… 302 454            
##  9 8        67482     Strasbourg  Bas-Rhin     Préfe… Grand… 291 313            
## 10 9        33063     Bordeaux    Gironde      Préfe… Nouve… 261 804            
## # ℹ 278 more rows
## # ℹ 7 more variables: `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>

Scrap a table

Example with the Wikipedia page Liste des communes de France les plus peuplées

5. Clean output

# Explore data ----
datas

## # A tibble: 288 × 14
##    Rang2024 CodeInsee Commune     Département  Statut Région `Population légale`
##    <chr>    <chr>     <chr>       <chr>        <chr>  <chr>  <chr>              
##  1 Rang2024 CodeInsee Commune     Département  Statut Région 2021[1]            
##  2 1        75056     Paris[a]    Paris[a]     Préfe… Île-d… 2 133 111          
##  3 2        13055     Marseille   Bouches-du-… Préfe… Prove… 873 076            
##  4 3        69123     Lyon        Métropole d… Préfe… Auver… 522 250            
##  5 4        31555     Toulouse    Haute-Garon… Préfe… Occit… 504 078            
##  6 5        06088     Nice        Alpes-Marit… Préfe… Prove… 348 085            
##  7 6        44109     Nantes      Loire-Atlan… Préfe… Pays … 323 204            
##  8 7        34172     Montpellier Hérault      Préfe… Occit… 302 454            
##  9 8        67482     Strasbourg  Bas-Rhin     Préfe… Grand… 291 313            
## 10 9        33063     Bordeaux    Gironde      Préfe… Nouve… 261 804            
## # ℹ 278 more rows
## # ℹ 7 more variables: `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>

# Select top 10 cities ----
top10 <- datas[2:11, ]

# Filter columns ----
top10 <- top10[ , c(1, 3:4, 7)]

Scrap a table

Example with the Wikipedia page Liste des communes de France les plus peuplées

5. Clean output

# Explore data ----
datas

## # A tibble: 288 × 14
##    Rang2024 CodeInsee Commune     Département  Statut Région `Population légale`
##    <chr>    <chr>     <chr>       <chr>        <chr>  <chr>  <chr>              
##  1 Rang2024 CodeInsee Commune     Département  Statut Région 2021[1]            
##  2 1        75056     Paris[a]    Paris[a]     Préfe… Île-d… 2 133 111          
##  3 2        13055     Marseille   Bouches-du-… Préfe… Prove… 873 076            
##  4 3        69123     Lyon        Métropole d… Préfe… Auver… 522 250            
##  5 4        31555     Toulouse    Haute-Garon… Préfe… Occit… 504 078            
##  6 5        06088     Nice        Alpes-Marit… Préfe… Prove… 348 085            
##  7 6        44109     Nantes      Loire-Atlan… Préfe… Pays … 323 204            
##  8 7        34172     Montpellier Hérault      Préfe… Occit… 302 454            
##  9 8        67482     Strasbourg  Bas-Rhin     Préfe… Grand… 291 313            
## 10 9        33063     Bordeaux    Gironde      Préfe… Nouve… 261 804            
## # ℹ 278 more rows
## # ℹ 7 more variables: `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>

# Select top 10 cities ----
top10 <- datas[2:11, ]

# Filter columns ----
top10 <- top10[ , c(1, 3:4, 7)]

# Clean column names ----
top10 <- janitor::clean_names(top10)
top10

## # A tibble: 10 × 4
##    rang2024 commune     departement          population_legale
##    <chr>    <chr>       <chr>                <chr>            
##  1 1        Paris[a]    Paris[a]             2 133 111        
##  2 2        Marseille   Bouches-du-Rhône     873 076          
##  3 3        Lyon        Métropole de Lyon[b] 522 250          
##  4 4        Toulouse    Haute-Garonne        504 078          
##  5 5        Nice        Alpes-Maritimes      348 085          
##  6 6        Nantes      Loire-Atlantique     323 204          
##  7 7        Montpellier Hérault              302 454          
##  8 8        Strasbourg  Bas-Rhin             291 313          
##  9 9        Bordeaux    Gironde              261 804          
## 10 10       Lille       Nord                 236 710

Scrap a table

Example with the Wikipedia page Liste des communes de France les plus peuplées

5. Clean output

# Explore data ----
datas

## # A tibble: 288 × 14
##    Rang2024 CodeInsee Commune     Département  Statut Région `Population légale`
##    <chr>    <chr>     <chr>       <chr>        <chr>  <chr>  <chr>              
##  1 Rang2024 CodeInsee Commune     Département  Statut Région 2021[1]            
##  2 1        75056     Paris[a]    Paris[a]     Préfe… Île-d… 2 133 111          
##  3 2        13055     Marseille   Bouches-du-… Préfe… Prove… 873 076            
##  4 3        69123     Lyon        Métropole d… Préfe… Auver… 522 250            
##  5 4        31555     Toulouse    Haute-Garon… Préfe… Occit… 504 078            
##  6 5        06088     Nice        Alpes-Marit… Préfe… Prove… 348 085            
##  7 6        44109     Nantes      Loire-Atlan… Préfe… Pays … 323 204            
##  8 7        34172     Montpellier Hérault      Préfe… Occit… 302 454            
##  9 8        67482     Strasbourg  Bas-Rhin     Préfe… Grand… 291 313            
## 10 9        33063     Bordeaux    Gironde      Préfe… Nouve… 261 804            
## # ℹ 278 more rows
## # ℹ 7 more variables: `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>

# Select top 10 cities ----
top10 <- datas[2:11, ]

# Filter columns ----
top10 <- top10[ , c(1, 3:4, 7)]

# Clean column names ----
top10 <- janitor::clean_names(top10)
top10

## # A tibble: 10 × 4
##    rang2024 commune     departement          population_legale
##    <chr>    <chr>       <chr>                <chr>            
##  1 1        Paris[a]    Paris[a]             2 133 111        
##  2 2        Marseille   Bouches-du-Rhône     873 076          
##  3 3        Lyon        Métropole de Lyon[b] 522 250          
##  4 4        Toulouse    Haute-Garonne        504 078          
##  5 5        Nice        Alpes-Maritimes      348 085          
##  6 6        Nantes      Loire-Atlantique     323 204          
##  7 7        Montpellier Hérault              302 454          
##  8 8        Strasbourg  Bas-Rhin             291 313          
##  9 9        Bordeaux    Gironde              261 804          
## 10 10       Lille       Nord                 236 710

# Rename specific column ----
top10 <- dplyr::rename(top10,
                       pop2021 = population_legale)

colnames(top10)

## [1] "rang2024"    "commune"     "departement" "pop2021"

Scrap a table

Example with the Wikipedia page Liste des communes de France les plus peuplées

5. Clean output (continued)

# Explore data ----
top10

## # A tibble: 10 × 4
##    rang2024 commune     departement          pop2021  
##    <chr>    <chr>       <chr>                <chr>    
##  1 1        Paris[a]    Paris[a]             2 133 111
##  2 2        Marseille   Bouches-du-Rhône     873 076  
##  3 3        Lyon        Métropole de Lyon[b] 522 250  
##  4 4        Toulouse    Haute-Garonne        504 078  
##  5 5        Nice        Alpes-Maritimes      348 085  
##  6 6        Nantes      Loire-Atlantique     323 204  
##  7 7        Montpellier Hérault              302 454  
##  8 8        Strasbourg  Bas-Rhin             291 313  
##  9 9        Bordeaux    Gironde              261 804  
## 10 10       Lille       Nord                 236 710

Scrap a table

Example with the Wikipedia page Liste des communes de France les plus peuplées

5. Clean output (continued)

# Explore data ----
top10

## # A tibble: 10 × 4
##    rang2024 commune     departement          pop2021  
##    <chr>    <chr>       <chr>                <chr>    
##  1 1        Paris[a]    Paris[a]             2 133 111
##  2 2        Marseille   Bouches-du-Rhône     873 076  
##  3 3        Lyon        Métropole de Lyon[b] 522 250  
##  4 4        Toulouse    Haute-Garonne        504 078  
##  5 5        Nice        Alpes-Maritimes      348 085  
##  6 6        Nantes      Loire-Atlantique     323 204  
##  7 7        Montpellier Hérault              302 454  
##  8 8        Strasbourg  Bas-Rhin             291 313  
##  9 9        Bordeaux    Gironde              261 804  
## 10 10       Lille       Nord                 236 710

# Convert 'rang2024' to numeric ----
top10$"rang2024" <- as.integer(top10$"rang2024")

Scrap a table

Example with the Wikipedia page Liste des communes de France les plus peuplées

5. Clean output (continued)

# Explore data ----
top10

## # A tibble: 10 × 4
##    rang2024 commune     departement          pop2021  
##    <chr>    <chr>       <chr>                <chr>    
##  1 1        Paris[a]    Paris[a]             2 133 111
##  2 2        Marseille   Bouches-du-Rhône     873 076  
##  3 3        Lyon        Métropole de Lyon[b] 522 250  
##  4 4        Toulouse    Haute-Garonne        504 078  
##  5 5        Nice        Alpes-Maritimes      348 085  
##  6 6        Nantes      Loire-Atlantique     323 204  
##  7 7        Montpellier Hérault              302 454  
##  8 8        Strasbourg  Bas-Rhin             291 313  
##  9 9        Bordeaux    Gironde              261 804  
## 10 10       Lille       Nord                 236 710

# Convert 'rang2024' to numeric ----
top10$"rang2024" <- as.integer(top10$"rang2024")

# Remove footnotes in 'commune' ----
top10$"commune"     <- gsub("\\[[a-z]\\]", "", top10$"commune")

# Remove footnotes in 'departement' ----
top10$"departement" <- gsub("\\[[a-z]\\]", "", top10$"departement")

top10

## # A tibble: 10 × 4
##    rang2024 commune     departement       pop2021  
##       <int> <chr>       <chr>             <chr>    
##  1        1 Paris       Paris             2 133 111
##  2        2 Marseille   Bouches-du-Rhône  873 076  
##  3        3 Lyon        Métropole de Lyon 522 250  
##  4        4 Toulouse    Haute-Garonne     504 078  
##  5        5 Nice        Alpes-Maritimes   348 085  
##  6        6 Nantes      Loire-Atlantique  323 204  
##  7        7 Montpellier Hérault           302 454  
##  8        8 Strasbourg  Bas-Rhin          291 313  
##  9        9 Bordeaux    Gironde           261 804  
## 10       10 Lille       Nord              236 710

Scrap a table

Example with the Wikipedia page Liste des communes de France les plus peuplées

5. Clean output (continued)

# Convert 'pop2021' to numeric ----
top10$"pop2021" <- gsub(" ", "", top10$"pop2021")
top10$"pop2021" <- as.numeric(top10$"pop2021")

top10

## # A tibble: 10 × 4
##    rang2024 commune     departement       pop2021
##       <int> <chr>       <chr>               <dbl>
##  1        1 Paris       Paris             2133111
##  2        2 Marseille   Bouches-du-Rhône   873076
##  3        3 Lyon        Métropole de Lyon  522250
##  4        4 Toulouse    Haute-Garonne      504078
##  5        5 Nice        Alpes-Maritimes    348085
##  6        6 Nantes      Loire-Atlantique   323204
##  7        7 Montpellier Hérault            302454
##  8        8 Strasbourg  Bas-Rhin           291313
##  9        9 Bordeaux    Gironde            261804
## 10       10 Lille       Nord               236710

Scrap other elements

Detect HTML element by tag

# Extract content of h1 element ----
rvest::html_element(content, css = "h1") |> 
  rvest::html_text2()

## [1] "Liste des communes de France les plus peuplées"

Scrap other elements

Detect HTML element by tag

# Extract content of h1 element ----
rvest::html_element(content, css = "h1") |> 
  rvest::html_text2()

## [1] "Liste des communes de France les plus peuplées"

Detect HTML elements by tag

# Extract content of the first h2 element ----
rvest::html_element(content, css = "h2") |> 
  rvest::html_text2()

## [1] "Sommaire"

# Extract content of all h2 elements ----
rvest::html_elements(content, css = "h2") |> 
  rvest::html_text2()

## [1] "Sommaire"                                                 
## [2] "Cadre des données"                                        
## [3] "Vue d'ensemble"                                           
## [4] "Communes de plus de 30 000 habitants"                     
## [5] "Communes ayant compté plus de 30 000 habitants avant 2024"
## [6] "Notes et références"                                      
## [7] "Voir aussi"

Scrap other elements

Detect HTML element by tag

# Extract content of h1 element ----
rvest::html_element(content, css = "h1") |> 
  rvest::html_text2()

## [1] "Liste des communes de France les plus peuplées"

Detect HTML elements by tag

# Extract content of the first h2 element ----
rvest::html_element(content, css = "h2") |> 
  rvest::html_text2()

## [1] "Sommaire"

# Extract content of all h2 elements ----
rvest::html_elements(content, css = "h2") |> 
  rvest::html_text2()

## [1] "Sommaire"                                                 
## [2] "Cadre des données"                                        
## [3] "Vue d'ensemble"                                           
## [4] "Communes de plus de 30 000 habitants"                     
## [5] "Communes ayant compté plus de 30 000 habitants avant 2024"
## [6] "Notes et références"                                      
## [7] "Voir aussi"

Detect HTML element by ID

# Extract content of the h2 element detected by its id ----
rvest::html_element(content, css = "#Cadre_des_données") |> 
  rvest::html_text2()

## [1] "Cadre des données"

Scrap other elements

Detect HTML element by tag

# Extract content of h1 element ----
rvest::html_element(content, css = "h1") |> 
  rvest::html_text2()

## [1] "Liste des communes de France les plus peuplées"

Detect HTML elements by tag

# Extract content of the first h2 element ----
rvest::html_element(content, css = "h2") |> 
  rvest::html_text2()

## [1] "Sommaire"

# Extract content of all h2 elements ----
rvest::html_elements(content, css = "h2") |> 
  rvest::html_text2()

## [1] "Sommaire"                                                 
## [2] "Cadre des données"                                        
## [3] "Vue d'ensemble"                                           
## [4] "Communes de plus de 30 000 habitants"                     
## [5] "Communes ayant compté plus de 30 000 habitants avant 2024"
## [6] "Notes et références"                                      
## [7] "Voir aussi"

Detect HTML element by ID

# Extract content of the h2 element detected by its id ----
rvest::html_element(content, css = "#Cadre_des_données") |> 
  rvest::html_text2()

## [1] "Cadre des données"

Extract attribute

# Extract URL of the first image ----
image_url <- rvest::html_element(content, css = "img") |> 
  rvest::html_attr(name = "src")
image_url

## [1] "/static/images/icons/wikipedia.png"

Scrap other elements

Detect HTML element by tag

# Extract content of h1 element ----
rvest::html_element(content, css = "h1") |> 
  rvest::html_text2()

## [1] "Liste des communes de France les plus peuplées"

Detect HTML elements by tag

# Extract content of the first h2 element ----
rvest::html_element(content, css = "h2") |> 
  rvest::html_text2()

## [1] "Sommaire"

# Extract content of all h2 elements ----
rvest::html_elements(content, css = "h2") |> 
  rvest::html_text2()

## [1] "Sommaire"                                                 
## [2] "Cadre des données"                                        
## [3] "Vue d'ensemble"                                           
## [4] "Communes de plus de 30 000 habitants"                     
## [5] "Communes ayant compté plus de 30 000 habitants avant 2024"
## [6] "Notes et références"                                      
## [7] "Voir aussi"

Detect HTML element by ID

# Extract content of the h2 element detected by its id ----
rvest::html_element(content, css = "#Cadre_des_données") |> 
  rvest::html_text2()

## [1] "Cadre des données"

Extract attribute

# Extract URL of the first image ----
image_url <- rvest::html_element(content, css = "img") |> 
  rvest::html_attr(name = "src")
image_url

## [1] "/static/images/icons/wikipedia.png"

# Build image full URL ----
image_url <- paste0(base_url, image_url)
image_url

## [1] "https://fr.wikipedia.org/static/images/icons/wikipedia.png"

Scrap other elements

Detect HTML element by tag

# Extract content of h1 element ----
rvest::html_element(content, css = "h1") |> 
  rvest::html_text2()

## [1] "Liste des communes de France les plus peuplées"

Detect HTML elements by tag

# Extract content of the first h2 element ----
rvest::html_element(content, css = "h2") |> 
  rvest::html_text2()

## [1] "Sommaire"

# Extract content of all h2 elements ----
rvest::html_elements(content, css = "h2") |> 
  rvest::html_text2()

## [1] "Sommaire"                                                 
## [2] "Cadre des données"                                        
## [3] "Vue d'ensemble"                                           
## [4] "Communes de plus de 30 000 habitants"                     
## [5] "Communes ayant compté plus de 30 000 habitants avant 2024"
## [6] "Notes et références"                                      
## [7] "Voir aussi"

Detect HTML element by ID

# Extract content of the h2 element detected by its id ----
rvest::html_element(content, css = "#Cadre_des_données") |> 
  rvest::html_text2()

## [1] "Cadre des données"

Extract attribute

# Extract URL of the first image ----
image_url <- rvest::html_element(content, css = "img") |> 
  rvest::html_attr(name = "src")
image_url

## [1] "/static/images/icons/wikipedia.png"

# Build image full URL ----
image_url <- paste0(base_url, image_url)
image_url

## [1] "https://fr.wikipedia.org/static/images/icons/wikipedia.png"

# Download image ----
download.file(url      = image_url,
              destfile = "wikipedia_logo.png",
              mode     = "wb")

Getting the good selector

Press CTRL + U (Firefox) to display the HTML code of the page
Right click on a page element and click on Inspect
Install SelectorGadget bookmarklet in your browser

Dynamic web pages

Have a look at the function session() of the package rvest.
Have a look at the package RSelenium

Ethics and legalities

Well… it’s complicated.

Read the Chapter 24.2 of the book R for Data Science by Wickham, Cetinkaya-Rundel & Grolemund.

Be nice on the web with the package polite

Table of contents

Table of contents

Web portals

Imagine: you’re doing some species distribution models.

Some portals are more straight-foraward than others…

Open in R:

Open in R:

Open in R:

Imagine: you’re doing some species distribution models.

Imagine: you’re doing some species distribution models.

Before anything else:

TRY: a database for plant traits

1. Understand how the data is structured

TRY: a database for plant traits

1. Understand how the data is structured

TRY: a database for plant traits

1. Understand how the data is structured

2. Look at the species list

2. Look at the species list

You may encounter encoding issues….

Alternative import via R:

Scripting

Queries from R

Queries from R

What if the file is compressed ?

What if the file is compressed ?

API

Accessing data from the Web

(RESTful) Web API

(RESTful) Web API

(RESTful) Web API

Example of APIs

How does it work?

API Client

Building an API client

Building an API client

Nominatim API client

Nominatim API client

Nominatim API client

Nominatim API client

Nominatim API client

Nominatim API client

Nominatim API client

Nominatim API client

Visualization

Code factorisation

Code factorisation

Code factorisation

Exercise (40 min)

Accessing data

Correction

Correction

Web scraping

What is web scraping?

HTML basics

HTML basics

HTML basics

HTML basics

HTML basics

The rvest package

Scrap a table

Scrap a table

Scrap a table

Scrap a table

Scrap a table

Scrap a table

Scrap a table

Scrap a table

Scrap a table

Scrap a table

Scrap a table

Scrap a table

Scrap a table

Scrap a table

Scrap other elements

Scrap other elements

Scrap other elements

Scrap other elements

Scrap other elements

Scrap other elements