Accessing biodiversity data

Web portals, scripting, API & Web scraping

November 2024

Nicolas Casajus

Camille Coux

Data scientists
@FRB-CESAB    

Table of contents

Table of contents




Web portals

Scripting

API



Exercise

Web scraping

  Web portals

Imagine: you’re doing some species distribution models.

You have a list of species and their occurrences in space and time.

You need:

  • spatial data of France, to map the occurrences

  • trait information for each species, for 3 specific traits

Some portals are more straight-foraward than others…

Example of GADM maps: https://gadm.org/

They have a nice description of file formats too.

Open in R:

sf::st_layers(here::here("data/gadm41_FRA.gpkg"))
## Driver: GPKG 
## Available layers:
##   layer_name geometry_type features fields crs_name
## 1  ADM_ADM_0 Multi Polygon        1      2   WGS 84
## 2  ADM_ADM_1 Multi Polygon       13     11   WGS 84
## 3  ADM_ADM_2 Multi Polygon       96     13   WGS 84
## 4  ADM_ADM_3 Multi Polygon      350     16   WGS 84
## 5  ADM_ADM_4 Multi Polygon     3728     14   WGS 84
## 6  ADM_ADM_5 Multi Polygon    36611     15   WGS 84

Open in R:

library(ggplot2)
fr <- sf::read_sf(here::here("data/gadm41_FRA.gpkg"), layer = "ADM_ADM_1")
head(fr)
## Simple feature collection with 6 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -5.143751 ymin: 41.33375 xmax: 9.560416 ymax: 50.16764
## Geodetic CRS:  WGS 84
## # A tibble: 6 × 12
##   GID_1   GID_0 COUNTRY NAME_1 VARNAME_1 NL_NAME_1 TYPE_1 ENGTYPE_1 CC_1  HASC_1
##   <chr>   <chr> <chr>   <chr>  <chr>     <chr>     <chr>  <chr>     <chr> <chr> 
## 1 FRA.1_1 FRA   France  Auver… NA        NA        Région Region    NA    FR.AR 
## 2 FRA.2_1 FRA   France  Bourg… NA        NA        Région Region    NA    FR.BF 
## 3 FRA.3_1 FRA   France  Breta… NA        NA        Région Region    NA    FR.BT 
## 4 FRA.4_1 FRA   France  Centr… NA        NA        Région Region    NA    FR.CN 
## 5 FRA.5_1 FRA   France  Corse  Corsica   NA        Région Region    NA    FR.CE 
## 6 FRA.6_1 FRA   France  Grand… NA        NA        Région Region    NA    FR.AO 
## # ℹ 2 more variables: ISO_1 <chr>, geom <MULTIPOLYGON [°]>

Open in R:

ggplot(fr) + geom_sf(aes(fill = NAME_1)) + theme_bw() 

Imagine: you’re doing some species distribution models.

Now you have:

  • spatial data of France, to map the occurrences          

But you still need:

  • trait information for each species, for 3 specific traits

Imagine: you’re doing some species distribution models.

Now you have:

  • spatial data of France, to map the occurrences          

But you still need:

  • trait information for each species, for 3 specific traits

Before anything else:

Imagine precisely what kind of data you need. Draw the table you want to get.

TRY: a database for plant traits

TRY data portal: https://www.try-db.org/TryWeb/dp.php

1. Understand how the data is structured

Check out the data explorer section.

What are the traits like ? How are they measured ? If there are several measures of the same trait, which one will you choose ?

TRY: a database for plant traits

TRY data portal: https://www.try-db.org/TryWeb/dp.php

1. Understand how the data is structured

Check out the data explorer section.

What are the traits like ? How are they measured ? If there are several measures of the same trait, which one will you choose ?

Explore the trait table.

Let’s try leaf area –> 88 traits contain “leaf area” in their description!

Notice the Trait ID column, this is what you’ll need to query the trait(s) you select.

Trait IDs examples: LAI = 3116 , Flower size = 3568, Photosynthesis : intercellular CO2 concentration = 49

TRY: a database for plant traits

TRY data portal: https://www.try-db.org/TryWeb/dp.php

1. Understand how the data is structured

Check out the data explorer section.

What are the traits like ? How are they measured ? If there are several measures of the same trait, which one will you choose ?

Explore the trait table.

Let’s try leaf area –> 88 traits contain “leaf area” in their description!

Notice the Trait ID column, this is what you’ll need to query the trait(s) you select.

Trait IDs examples: LAI = 3116 , Flower size = 3568, Photosynthesis : intercellular CO2 concentration = 49

Other table fields:

  • ObsNum: Number of Observations

  • ObsGRNum: Number of geo-referenced Observations

  • AccSpecNum: Number of Accepted Species

2. Look at the species list

Get species IDs:

  • Arabidopsis thaliana : 4341

  • Bellis perennis : 7173

  • Quercus ilex : 45402

2. Look at the species list

Get species IDs:

  • Arabidopsis thaliana : 4341

  • Bellis perennis : 7173

  • Quercus ilex : 45402


Get data

Register, go to request data. Now you can file in the trait IDs.

Submit request. Write a short description of your project. You’ll receive a text file by email within a few days.

Requesting data can take several days.

You may encounter encoding issues….

t <- read.csv(here::here("data/36266.txt"), sep="\t")
## Error in type.convert.default(data[[i]], as.is = as.is[i], dec = dec, : invalid multibyte string at '<dc>lo'
readr::guess_encoding(here::here("data/36266.txt"))
## # A tibble: 2 × 2
##   encoding   confidence
##   <chr>           <dbl>
## 1 ISO-8859-1       0.4 
## 2 ISO-8859-2       0.21
t <- utils::read.csv2(file = file(here::here("data/36266.txt"), encoding = 'ISO-8859-1'), sep="\t")
colnames(t)
##  [1] "LastName"              "FirstName"             "DatasetID"            
##  [4] "Dataset"               "SpeciesName"           "AccSpeciesID"         
##  [7] "AccSpeciesName"        "ObservationID"         "ObsDataID"            
## [10] "TraitID"               "TraitName"             "DataID"               
## [13] "DataName"              "OriglName"             "OrigValueStr"         
## [16] "OrigUnitStr"           "ValueKindName"         "OrigUncertaintyStr"   
## [19] "UncertaintyName"       "Replicates"            "StdValue"             
## [22] "UnitName"              "RelUncertaintyPercent" "OrigObsDataID"        
## [25] "ErrorRisk"             "Reference"             "Comment"              
## [28] "StdValueStr"           "X"

Alternative import via R:

The quite handy R package rtry.

Solves the encoding problem:

t2 <- rtry::rtry_import(here::here("data/36266.txt"))

And this takes us to the next section…

  Scripting

Queries from R

R lets you download and import the data directly if you have the url, by using the download.file() function:

download.file("https://geodata.ucdavis.edu/gadm/gadm4.1/gpkg/gadm41_FRA.gpkg", destfile = here::here("data", "fr_2.gpkg"))

Queries from R

R lets you download and import the data directly if you have the url, by using the download.file() function:

download.file("https://geodata.ucdavis.edu/gadm/gadm4.1/gpkg/gadm41_FRA.gpkg", destfile = here::here("data", "fr_2.gpkg"))

Sometimes it sends strange errors on windows. Simply add mode = "wb"

download.file("https://geodata.ucdavis.edu/gadm/gadm4.1/gpkg/gadm41_FRA.gpkg", destfile = here::here("data", "fr_2.gpkg"), mode = "wb")

What if the file is compressed ?

You can use the helper functions unzip(), gzfile() etc but it’s unnecessary for most R reading functions:

download.file("http://www.sociopatterns.org/wp-content/uploads/2015/07/Friendship-network_data_2013.csv.gz", destfile = here::here("data", "friends.gz"))

# all of these work fine: 
friends <- read.table(gzfile( here::here("data", "friends.gz")))
friends <- read.csv2(gzfile( here::here("data", "friends.gz")), sep = " ")
friends <- read.table(here::here("data", "friends.gz"))

What if the file is compressed ?

You can use the helper functions unzip(), gzfile() etc but it’s unnecessary for most R reading functions:

download.file("http://www.sociopatterns.org/wp-content/uploads/2015/07/Friendship-network_data_2013.csv.gz", destfile = here::here("data", "friends.gz"))

# all of these work fine: 
friends <- read.table(gzfile( here::here("data", "friends.gz")))
friends <- read.csv2(gzfile( here::here("data", "friends.gz")), sep = " ")
friends <- read.table(here::here("data", "friends.gz"))

Other useful packages: data.table, vroom, …

  API

Accessing data from the Web





User interface

Code interface

(RESTful) Web API

Definition: A service accessed from a client device (mobile phones, laptops, etc.) to a Web server using the Hypertext Transfer Protocol (HTTP)1.

  • The client sends an HTTP request to the Web server
  • The Web server sends back a response in JSON or XML format (raw data)
  • The Web server exposes one or more endpoints (predefined request/response)

Code interface

(RESTful) Web API

Advantages (for the user)

  • User can create its own client
  • Client can be developed in any language
  • User can include the service in a bigger project
  • User access raw data



Writing code means automation and reproducibility

(RESTful) Web API

Advantages (for the user)

  • User can create its own client
  • Client can be developed in any language
  • User can include the service in a bigger project
  • User access raw data



Writing code means automation and reproducibility

Limitations

  • Each API has its own specification
  • Authentication method (free or not)
    • Token
    • Login and password
  • Some restrictions
    • Number of requests per day/month
    • Number of records per request
    • Incomplete data

Example of APIs

This is a non-exhaustive list


Biodiversity data

  • Global Biodiversity Information Facility (GBIF) - API
  • IUCN Red List - API
  • Fishbase - API
  • Species+/CITES Checklist - API
  • Knowledge Network for Biocomplexity (KNB) - API

Taxonomy

  • Encyclopedia of Life (EOL) - API
  • Integrated Taxonomic Information System (ITIS) - API
  • Barcode of Life Data (BOLD) - API
  • World Register of Marine Species (WoRMS) - API

Scientific literature

  • Web of Science - API
  • Scopus - API
  • CrossRef - API
  • OpenAlex - API

Others

  • Wikipedia - API
  • OpenStreetMap - API
  • Zenodo - API

How does it work?

Requesting an API is based on the client-server protocol

API Client

It’s the tool you use to request the API and parse the response (data).


If you are lucky, an API client will already be available.


Non-exhaustive list of packages



 Otherwise you will have to build your own client.

Building an API client


Available at: https://httr2.r-lib.org/


# Install 'httr2' package ----
install.packages("httr2")


Table: Main functions of httr2
Function Description
request() Create an HTTP request
req_url_query() Add parameters to an HTTP request
req_perform() Send HTTP request to an API
resp_status() Check the HTTP response status
resp_content_type() Check the content type of the response
resp_body_json() Parse the response content (JSON format)
resp_body_xml() Parse the response content (XML format)

Building an API client

Example with the OpenStreetMap Nominatim API

Retrieve coordinates from a location (city, address, building, etc.)

Nominatim API client

1. Build the HTTP request

# Nominatim API endpoint ----
endpoint <- "https://nominatim.openstreetmap.org/search"

# Prepare the HTTP request ----
http_request <- httr2::request(endpoint)

http_request
https://nominatim.openstreetmap.org/search

Nominatim API client

1. Build the HTTP request

# Nominatim API endpoint ----
endpoint <- "https://nominatim.openstreetmap.org/search"

# Prepare the HTTP request ----
http_request <- httr2::request(endpoint)

http_request
https://nominatim.openstreetmap.org/search


# Append request parameters ----
http_request <- http_request |> 
  httr2::req_url_query(city    = "Montpellier") |> 
  httr2::req_url_query(country = "France")

http_request
https://nominatim.openstreetmap.org/search?city=Montpellier&country=France

Nominatim API client

1. Build the HTTP request

# Nominatim API endpoint ----
endpoint <- "https://nominatim.openstreetmap.org/search"

# Prepare the HTTP request ----
http_request <- httr2::request(endpoint)

http_request
https://nominatim.openstreetmap.org/search


# Append request parameters ----
http_request <- http_request |> 
  httr2::req_url_query(city    = "Montpellier") |> 
  httr2::req_url_query(country = "France")

http_request
https://nominatim.openstreetmap.org/search?city=Montpellier&country=France


# Append response parameters ----
http_request <- http_request |> 
  httr2::req_url_query(format = "json") |> 
  httr2::req_url_query(limit  = 1)

http_request
https://nominatim.openstreetmap.org/search?city=Montpellier&country=France&
format=json&limit=1

Nominatim API client

1. Build the HTTP request

# Nominatim API endpoint ----
endpoint <- "https://nominatim.openstreetmap.org/search"

# Prepare the HTTP request ----
http_request <- httr2::request(endpoint)

http_request
https://nominatim.openstreetmap.org/search


# Append request parameters ----
http_request <- http_request |> 
  httr2::req_url_query(city    = "Montpellier") |> 
  httr2::req_url_query(country = "France")

http_request
https://nominatim.openstreetmap.org/search?city=Montpellier&country=France


# Append response parameters ----
http_request <- http_request |> 
  httr2::req_url_query(format = "json") |> 
  httr2::req_url_query(limit  = 1)

http_request
https://nominatim.openstreetmap.org/search?city=Montpellier&country=France&
format=json&limit=1

2. Send the HTTP request

# Send HTTP request  ----
http_response <- httr2::req_perform(http_request)

http_response
Status: 200 OK
Content-Type: application/json

Nominatim API client

1. Build the HTTP request

# Nominatim API endpoint ----
endpoint <- "https://nominatim.openstreetmap.org/search"

# Prepare the HTTP request ----
http_request <- httr2::request(endpoint)

http_request
https://nominatim.openstreetmap.org/search


# Append request parameters ----
http_request <- http_request |> 
  httr2::req_url_query(city    = "Montpellier") |> 
  httr2::req_url_query(country = "France")

http_request
https://nominatim.openstreetmap.org/search?city=Montpellier&country=France


# Append response parameters ----
http_request <- http_request |> 
  httr2::req_url_query(format = "json") |> 
  httr2::req_url_query(limit  = 1)

http_request
https://nominatim.openstreetmap.org/search?city=Montpellier&country=France&
format=json&limit=1

2. Send the HTTP request

# Send HTTP request  ----
http_response <- httr2::req_perform(http_request)

http_response
Status: 200 OK
Content-Type: application/json


3. Check response status

# Check response status ----
httr2::resp_status(http_response)
## [1] 200

Nominatim API client

1. Build the HTTP request

# Nominatim API endpoint ----
endpoint <- "https://nominatim.openstreetmap.org/search"

# Prepare the HTTP request ----
http_request <- httr2::request(endpoint)

http_request
https://nominatim.openstreetmap.org/search


# Append request parameters ----
http_request <- http_request |> 
  httr2::req_url_query(city    = "Montpellier") |> 
  httr2::req_url_query(country = "France")

http_request
https://nominatim.openstreetmap.org/search?city=Montpellier&country=France


# Append response parameters ----
http_request <- http_request |> 
  httr2::req_url_query(format = "json") |> 
  httr2::req_url_query(limit  = 1)

http_request
https://nominatim.openstreetmap.org/search?city=Montpellier&country=France&
format=json&limit=1

2. Send the HTTP request

# Send HTTP request  ----
http_response <- httr2::req_perform(http_request)

http_response
Status: 200 OK
Content-Type: application/json


3. Check response status

# Check response status ----
httr2::resp_status(http_response)
## [1] 200


4. Check response content type

# Check response status ----
httr2::resp_content_type(http_response)
## [1] "application/json"

Nominatim API client

5. Parse response content

# Parse response content ----
content <- httr2::resp_body_json(http_response)

content
## [[1]]
## [[1]]$place_id
## [1] 104509929
## 
## [[1]]$licence
## [1] "Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright"
## 
## [[1]]$osm_type
## [1] "relation"
## 
## [[1]]$osm_id
## [1] 28722
## 
## [[1]]$lat
## [1] "43.6112422"
## 
## [[1]]$lon
## [1] "3.8767337"
## 
## [[1]]$class
## [1] "boundary"
## 
## [[1]]$type
## [1] "administrative"
## 
## [[1]]$place_rank
## [1] 16
## 
## [[1]]$importance
## [1] 0.6880008
## 
## [[1]]$addresstype
## [1] "city"
## 
## [[1]]$name
## [1] "Montpellier"
## 
## [[1]]$display_name
## [1] "Montpellier, Hérault, Occitanie, France métropolitaine, France"
## 
## [[1]]$boundingbox
## [[1]]$boundingbox[[1]]
## [1] "43.5667083"
## 
## [[1]]$boundingbox[[2]]
## [1] "43.6533580"
## 
## [[1]]$boundingbox[[3]]
## [1] "3.8070597"
## 
## [[1]]$boundingbox[[4]]
## [1] "3.9413208"


# Object type ----
class(content)
## [1] "list"


# Object dimensions ----
length(content)
## [1] 1

Nominatim API client

5. Parse response content

# Parse response content ----
content <- httr2::resp_body_json(http_response)

content
## [[1]]
## [[1]]$place_id
## [1] 104509929
## 
## [[1]]$licence
## [1] "Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright"
## 
## [[1]]$osm_type
## [1] "relation"
## 
## [[1]]$osm_id
## [1] 28722
## 
## [[1]]$lat
## [1] "43.6112422"
## 
## [[1]]$lon
## [1] "3.8767337"
## 
## [[1]]$class
## [1] "boundary"
## 
## [[1]]$type
## [1] "administrative"
## 
## [[1]]$place_rank
## [1] 16
## 
## [[1]]$importance
## [1] 0.6880008
## 
## [[1]]$addresstype
## [1] "city"
## 
## [[1]]$name
## [1] "Montpellier"
## 
## [[1]]$display_name
## [1] "Montpellier, Hérault, Occitanie, France métropolitaine, France"
## 
## [[1]]$boundingbox
## [[1]]$boundingbox[[1]]
## [1] "43.5667083"
## 
## [[1]]$boundingbox[[2]]
## [1] "43.6533580"
## 
## [[1]]$boundingbox[[3]]
## [1] "3.8070597"
## 
## [[1]]$boundingbox[[4]]
## [1] "3.9413208"


# Object type ----
class(content)
## [1] "list"


# Object dimensions ----
length(content)
## [1] 1


6. Clean data

# Clean output ----
content <- content[[1]]

content <- data.frame("name" = content$"name",
                      "lon"  = as.numeric(content$"lon"),
                      "lat"  = as.numeric(content$"lat"))
content
##          name      lon      lat
## 1 Montpellier 3.876734 43.61124

Visualization

# Install required package ----
install.packages("maps")


# Map France boundary ----
maps::map(regions = "France", 
          fill    = TRUE, 
          col     = "black")

# Add retrieved coordinates ----
points(x   = content$"lon", 
       y   = content$"lat", 
       pch = 19, 
       cex = 1, 
       col = "red")

# Add retrieved name ----
text(x      = content$"lon", 
     y      = content$"lat", 
     labels = content$"name", 
     pos    = 2, 
     col    = "white", 
     family = "serif")

Code factorisation

Function definition

get_coords_from_location <- function(city, country) {
  
  # Nominatim API endpoint ----
  endpoint <- "https://nominatim.openstreetmap.org/search"

  # Prepare the HTTP request ----
  http_request <- httr2::request(endpoint)
  
  # Append request parameters ----
  http_request <- http_request |> 
    httr2::req_url_query(city    = city) |> 
    httr2::req_url_query(country = country) |> 
    httr2::req_url_query(format = "json") |> 
    httr2::req_url_query(limit  = 1)
  
  # Send HTTP request  ----
  http_response <- httr2::req_perform(http_request)
  
  # Check response status ----
  httr2::resp_check_status(http_response)
  
  # Parse response content ----
  content <- httr2::resp_body_json(http_response)
  
  # Clean output ----
  content <- content[[1]]
  content <- data.frame("name" = content$"name",
                        "lon"  = as.numeric(content$"lon"),
                        "lat"  = as.numeric(content$"lat"))
  content
}

Code factorisation

Function definition

get_coords_from_location <- function(city, country) {
  
  # Nominatim API endpoint ----
  endpoint <- "https://nominatim.openstreetmap.org/search"

  # Prepare the HTTP request ----
  http_request <- httr2::request(endpoint)
  
  # Append request parameters ----
  http_request <- http_request |> 
    httr2::req_url_query(city    = city) |> 
    httr2::req_url_query(country = country) |> 
    httr2::req_url_query(format = "json") |> 
    httr2::req_url_query(limit  = 1)
  
  # Send HTTP request  ----
  http_response <- httr2::req_perform(http_request)
  
  # Check response status ----
  httr2::resp_check_status(http_response)
  
  # Parse response content ----
  content <- httr2::resp_body_json(http_response)
  
  # Clean output ----
  content <- content[[1]]
  content <- data.frame("name" = content$"name",
                        "lon"  = as.numeric(content$"lon"),
                        "lat"  = as.numeric(content$"lat"))
  content
}

Function usage

# Retrieve coordinates ----
get_coords_from_location(city = "Montpellier", country = "France")
##          name      lon      lat
## 1 Montpellier 3.876734 43.61124

Code factorisation

Function definition

get_coords_from_location <- function(city, country) {
  
  # Nominatim API endpoint ----
  endpoint <- "https://nominatim.openstreetmap.org/search"

  # Prepare the HTTP request ----
  http_request <- httr2::request(endpoint)
  
  # Append request parameters ----
  http_request <- http_request |> 
    httr2::req_url_query(city    = city) |> 
    httr2::req_url_query(country = country) |> 
    httr2::req_url_query(format = "json") |> 
    httr2::req_url_query(limit  = 1)
  
  # Send HTTP request  ----
  http_response <- httr2::req_perform(http_request)
  
  # Check response status ----
  httr2::resp_check_status(http_response)
  
  # Parse response content ----
  content <- httr2::resp_body_json(http_response)
  
  # Clean output ----
  content <- content[[1]]
  content <- data.frame("name" = content$"name",
                        "lon"  = as.numeric(content$"lon"),
                        "lat"  = as.numeric(content$"lat"))
  content
}

Function usage

# Retrieve coordinates ----
get_coords_from_location(city = "Montpellier", country = "France")
##          name      lon      lat
## 1 Montpellier 3.876734 43.61124


Automation

# List of cities ----
cities <- c("Montpellier", "Paris", "Strasbourg", "Grenoble", "Bourges")

# Retrieve coordinates ----
coords <- data.frame()

for (city in cities) {
  
  coord <- get_coords_from_location(city = city, country = "France")
  coords <- rbind(coords, coord)
}

coords
##          name      lon      lat
## 1 Montpellier 3.876734 43.61124
## 2       Paris 2.320041 48.85889
## 3  Strasbourg 7.750713 48.58461
## 4    Grenoble 5.735782 45.18756
## 5     Bourges 2.399125 47.08117

  Exercise (40 min)

Accessing data

Part 1: Download New Zealand boundaries from https://gadm.org/ (GeoJSON Level 0).

  Use the function download.file().


Part 2: Download GBIF occurrences for two bat species endemic to the islands of New Zealand:

  • Mystacina tuberculata (New Zealand lesser short-tailed bat)  

  • Chalinolobus tuberculatus (New Zealand long-tailed bat)  

      Use the function rgbif::occ_search().
      Do not forget to export the data.


Part 3: Download the PanTHERIA database, a species-level database of life history, ecology, and geography of extant and recently extinct mammals available here.

  Use the function download.file() and the function readr::read_delim() to import the database.


Bonus: Plot a New Zealand map with GBIF occurrences.

  Use the packages sf and ggplot2.

Correction

Part 1

## Download administrative boundaries of NZL ----

gadm_url <- "https://geodata.ucdavis.edu/gadm/gadm4.1/json/"
filename <- "gadm41_NZL_0.json"

download.file(url      = paste0(gadm_url, filename), 
              destfile = filename,
              mode     = "wb")


Part 2

## Download GBIF occurrences ----

species_names <- c("Mystacina tuberculata", 
                   "Chalinolobus tuberculatus")

occ <- rgbif::occ_search(scientificName     = species_names, 
                         fields             = "minimal",
                         hasCoordinate      = TRUE,
                         hasGeospatialIssue = FALSE)


## Append occurrences -----

occ <- rbind(occ$`Mystacina tuberculata`$"data",
             occ$`Chalinolobus tuberculatus`$"data")

Part 2 (continued)

## Export occurrences ----

write.csv(x         = occ, 
          file      = "gbif_occurrences.csv", 
          row.names = FALSE)


Part 3

## Download PanTHERIA database ----

esa_url  <- "https://esapubs.org/archive/ecol/E090/184/"
filename <- "PanTHERIA_1-0_WR05_Aug2008.txt"

download.file(url      = paste0(esa_url, filename), 
              destfile = filename,
              mode     = "wb")

Correction

Bonus

## Import NZL shapefile ----

nzl <- sf::st_read("gadm41_NZL_0.json") |> 
  sf::st_transform(crs = 27200)


## Convert occurrences into sf object ----

occ_sf <- occ |> 
  sf::st_as_sf(coords = c("decimalLongitude", "decimalLatitude"),
               crs    = 4326) |> 
  sf::st_transform(crs = 27200)


## Map occurrences ----

library("ggplot2")

ggplot() +
  
  geom_sf(data = nzl) +
  
  geom_sf(data = occ_sf, mapping = aes(color = scientificName)) +
  
  theme_bw()

  Web scraping

What is web scraping?

  • A method to automatically extract data from web pages
  • Converts unstructured web content into structured data
  • Also known as screen scraping


 If available, you should use API: it will give you more reliable data.


 What is a web page?



HTML
Content structuration

CSS
Formatting

JavaScript
Dynamism & interactivity

HTML basics

A web page is described and structured by the HTML language (HyperText Markup Language)


HTML code

<!DOCTYPE html>
<html>
  
  ...
      
</html>


An HTML page has always this structure

HTML basics

A web page is described and structured by the HTML language (HyperText Markup Language)


HTML code

<!DOCTYPE html>
<html>
  
  <!-- Document metadata -->
  <head>
    ...
  </head>
  
  <!-- Document content -->
  <body>
    ...
  </body>
      
</html>


The tag <html> contains two children:

  • <head> contains page metadata
  • <body> contains page content

HTML basics

A web page is described and structured by the HTML language (HyperText Markup Language)


HTML code

<!DOCTYPE html>
<html>
  
  <!-- Document metadata -->
  <head>
    <meta charset="UTF-8">
    <title>Page title</title>
  </head>
  
  <!-- Document content -->
  <body>
    ...
  </body>
      
</html>


The tag <head> can contain different metadata:

  • <title> contains the page title
  • <meta> is used to specify the encoding, authors, keywords, etc.
  • <link> is used to call external resources

N.B. This section is not necessary interesting for web scraping

HTML basics

A web page is described and structured by the HTML language (HyperText Markup Language)


HTML code

<!DOCTYPE html>
<html>
  
  <!-- Document metadata -->
  <head>
    <meta charset="UTF-8">
    <title>Page title</title>
  </head>
  
  <!-- Document content -->
  <body>
  
    <h1 id='section-1'>Header A</h1>
    
    <p class='my-class'>A paragraph with <b>bold text</b>.</p>
    
    <p>
      A second paragraph with a 
      <a href='https://google.com'>link</a>.
    </p>
    
    <img src='images/my-img.png' width='150' height='150' />
    
  </body>
      
</html>

HTML basics

A web page is described and structured by the HTML language (HyperText Markup Language)


HTML code

<!DOCTYPE html>
<html>
  
  <!-- Document metadata -->
  <head>
    <meta charset="UTF-8">
    <title>Page title</title>
  </head>
  
  <!-- Document content -->
  <body>
  
    <h1 id='section-1'>Header A</h1>
    
    <p class='my-class'>A paragraph with <b>bold text</b>.</p>
    
    <p>
      A second paragraph with a 
      <a href='https://google.com'>link</a>.
    </p>
    
    <img src='images/my-img.png' width='150' height='150' />
    
  </body>
      
</html>


  • Except for some elements (<img />), all HTML tags are double.
  • Some elements are block tags (<h1>, <p>, etc.), other are inline tags (<b>, <a>, etc.)
  • Some elements can have attributes: id, class, href, src, etc.


 Web scraping consists in detecting HTML elements by the tag, class or the id (unique) to extract contents or attributes (href, src, etc.).

The rvest package


# Install 'rvest' package ----
install.packages("rvest")


Table: Main functions of rvest
Function Description
read_html() Read and parse HTML content
html_element() Extract an HTML element(s)
html_attr() Extract an HTML attribute(s)
html_text2() Extract the content of element(s)
html_table() Extract a table & convert into a data.frame

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


0. Install additional packages

# 'janitor' to clean dirty data ----
install.packages("janitor")

# 'dplyr' to handle data ----
install.packages("dplyr")

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


0. Install additional packages

# 'janitor' to clean dirty data ----
install.packages("janitor")

# 'dplyr' to handle data ----
install.packages("dplyr")


1. Build the HTTP request

# Wikipedia URL ----
base_url <- "https://fr.wikipedia.org"

# Page URL ----
page_url <- paste0(base_url,
                   "/wiki/",
                   "Liste_des_communes_de_France_les_plus_peuplées")
https://fr.wikipedia.org/wiki/Liste_des_communes_de_France_les_plus_peuplées

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


0. Install additional packages

# 'janitor' to clean dirty data ----
install.packages("janitor")

# 'dplyr' to handle data ----
install.packages("dplyr")


1. Build the HTTP request

# Wikipedia URL ----
base_url <- "https://fr.wikipedia.org"

# Page URL ----
page_url <- paste0(base_url,
                   "/wiki/",
                   "Liste_des_communes_de_France_les_plus_peuplées")
https://fr.wikipedia.org/wiki/Liste_des_communes_de_France_les_plus_peuplées


2. Scrap the HTML page

# Scrap web page ----
content <- rvest::read_html(page_url)
{html_document}

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


0. Install additional packages

# 'janitor' to clean dirty data ----
install.packages("janitor")

# 'dplyr' to handle data ----
install.packages("dplyr")


1. Build the HTTP request

# Wikipedia URL ----
base_url <- "https://fr.wikipedia.org"

# Page URL ----
page_url <- paste0(base_url,
                   "/wiki/",
                   "Liste_des_communes_de_France_les_plus_peuplées")
https://fr.wikipedia.org/wiki/Liste_des_communes_de_France_les_plus_peuplées


2. Scrap the HTML page

# Scrap web page ----
content <- rvest::read_html(page_url)
{html_document}

3. Extract HTML tables

# Scrap web page ----
tables <- rvest::html_table(content)

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


0. Install additional packages

# 'janitor' to clean dirty data ----
install.packages("janitor")

# 'dplyr' to handle data ----
install.packages("dplyr")


1. Build the HTTP request

# Wikipedia URL ----
base_url <- "https://fr.wikipedia.org"

# Page URL ----
page_url <- paste0(base_url,
                   "/wiki/",
                   "Liste_des_communes_de_France_les_plus_peuplées")
https://fr.wikipedia.org/wiki/Liste_des_communes_de_France_les_plus_peuplées


2. Scrap the HTML page

# Scrap web page ----
content <- rvest::read_html(page_url)
{html_document}

3. Extract HTML tables

# Scrap web page ----
tables <- rvest::html_table(content)


# Type of output ----
class(tables)
## [1] "list"


# Element length ----
length(tables)
## [1] 6

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


0. Install additional packages

# 'janitor' to clean dirty data ----
install.packages("janitor")

# 'dplyr' to handle data ----
install.packages("dplyr")


1. Build the HTTP request

# Wikipedia URL ----
base_url <- "https://fr.wikipedia.org"

# Page URL ----
page_url <- paste0(base_url,
                   "/wiki/",
                   "Liste_des_communes_de_France_les_plus_peuplées")
https://fr.wikipedia.org/wiki/Liste_des_communes_de_France_les_plus_peuplées


2. Scrap the HTML page

# Scrap web page ----
content <- rvest::read_html(page_url)
{html_document}

3. Extract HTML tables

# Scrap web page ----
tables <- rvest::html_table(content)


# Type of output ----
class(tables)
## [1] "list"


# Element length ----
length(tables)
## [1] 6


4. Extract the good table

# Extract second table ----
datas <- tables[[2]]

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


5. Clean output

# Explore data ----
datas
## # A tibble: 288 × 14
##    Rang2024 CodeInsee Commune     Département  Statut Région `Population légale`
##    <chr>    <chr>     <chr>       <chr>        <chr>  <chr>  <chr>              
##  1 Rang2024 CodeInsee Commune     Département  Statut Région 2021[1]            
##  2 1        75056     Paris[a]    Paris[a]     Préfe… Île-d… 2 133 111          
##  3 2        13055     Marseille   Bouches-du-… Préfe… Prove… 873 076            
##  4 3        69123     Lyon        Métropole d… Préfe… Auver… 522 250            
##  5 4        31555     Toulouse    Haute-Garon… Préfe… Occit… 504 078            
##  6 5        06088     Nice        Alpes-Marit… Préfe… Prove… 348 085            
##  7 6        44109     Nantes      Loire-Atlan… Préfe… Pays … 323 204            
##  8 7        34172     Montpellier Hérault      Préfe… Occit… 302 454            
##  9 8        67482     Strasbourg  Bas-Rhin     Préfe… Grand… 291 313            
## 10 9        33063     Bordeaux    Gironde      Préfe… Nouve… 261 804            
## # ℹ 278 more rows
## # ℹ 7 more variables: `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


5. Clean output

# Explore data ----
datas
## # A tibble: 288 × 14
##    Rang2024 CodeInsee Commune     Département  Statut Région `Population légale`
##    <chr>    <chr>     <chr>       <chr>        <chr>  <chr>  <chr>              
##  1 Rang2024 CodeInsee Commune     Département  Statut Région 2021[1]            
##  2 1        75056     Paris[a]    Paris[a]     Préfe… Île-d… 2 133 111          
##  3 2        13055     Marseille   Bouches-du-… Préfe… Prove… 873 076            
##  4 3        69123     Lyon        Métropole d… Préfe… Auver… 522 250            
##  5 4        31555     Toulouse    Haute-Garon… Préfe… Occit… 504 078            
##  6 5        06088     Nice        Alpes-Marit… Préfe… Prove… 348 085            
##  7 6        44109     Nantes      Loire-Atlan… Préfe… Pays … 323 204            
##  8 7        34172     Montpellier Hérault      Préfe… Occit… 302 454            
##  9 8        67482     Strasbourg  Bas-Rhin     Préfe… Grand… 291 313            
## 10 9        33063     Bordeaux    Gironde      Préfe… Nouve… 261 804            
## # ℹ 278 more rows
## # ℹ 7 more variables: `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>


# Select top 10 cities ----
top10 <- datas[2:11, ]

# Filter columns ----
top10 <- top10[ , c(1, 3:4, 7)]

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


5. Clean output

# Explore data ----
datas
## # A tibble: 288 × 14
##    Rang2024 CodeInsee Commune     Département  Statut Région `Population légale`
##    <chr>    <chr>     <chr>       <chr>        <chr>  <chr>  <chr>              
##  1 Rang2024 CodeInsee Commune     Département  Statut Région 2021[1]            
##  2 1        75056     Paris[a]    Paris[a]     Préfe… Île-d… 2 133 111          
##  3 2        13055     Marseille   Bouches-du-… Préfe… Prove… 873 076            
##  4 3        69123     Lyon        Métropole d… Préfe… Auver… 522 250            
##  5 4        31555     Toulouse    Haute-Garon… Préfe… Occit… 504 078            
##  6 5        06088     Nice        Alpes-Marit… Préfe… Prove… 348 085            
##  7 6        44109     Nantes      Loire-Atlan… Préfe… Pays … 323 204            
##  8 7        34172     Montpellier Hérault      Préfe… Occit… 302 454            
##  9 8        67482     Strasbourg  Bas-Rhin     Préfe… Grand… 291 313            
## 10 9        33063     Bordeaux    Gironde      Préfe… Nouve… 261 804            
## # ℹ 278 more rows
## # ℹ 7 more variables: `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>


# Select top 10 cities ----
top10 <- datas[2:11, ]

# Filter columns ----
top10 <- top10[ , c(1, 3:4, 7)]


# Clean column names ----
top10 <- janitor::clean_names(top10)
top10
## # A tibble: 10 × 4
##    rang2024 commune     departement          population_legale
##    <chr>    <chr>       <chr>                <chr>            
##  1 1        Paris[a]    Paris[a]             2 133 111        
##  2 2        Marseille   Bouches-du-Rhône     873 076          
##  3 3        Lyon        Métropole de Lyon[b] 522 250          
##  4 4        Toulouse    Haute-Garonne        504 078          
##  5 5        Nice        Alpes-Maritimes      348 085          
##  6 6        Nantes      Loire-Atlantique     323 204          
##  7 7        Montpellier Hérault              302 454          
##  8 8        Strasbourg  Bas-Rhin             291 313          
##  9 9        Bordeaux    Gironde              261 804          
## 10 10       Lille       Nord                 236 710

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


5. Clean output

# Explore data ----
datas
## # A tibble: 288 × 14
##    Rang2024 CodeInsee Commune     Département  Statut Région `Population légale`
##    <chr>    <chr>     <chr>       <chr>        <chr>  <chr>  <chr>              
##  1 Rang2024 CodeInsee Commune     Département  Statut Région 2021[1]            
##  2 1        75056     Paris[a]    Paris[a]     Préfe… Île-d… 2 133 111          
##  3 2        13055     Marseille   Bouches-du-… Préfe… Prove… 873 076            
##  4 3        69123     Lyon        Métropole d… Préfe… Auver… 522 250            
##  5 4        31555     Toulouse    Haute-Garon… Préfe… Occit… 504 078            
##  6 5        06088     Nice        Alpes-Marit… Préfe… Prove… 348 085            
##  7 6        44109     Nantes      Loire-Atlan… Préfe… Pays … 323 204            
##  8 7        34172     Montpellier Hérault      Préfe… Occit… 302 454            
##  9 8        67482     Strasbourg  Bas-Rhin     Préfe… Grand… 291 313            
## 10 9        33063     Bordeaux    Gironde      Préfe… Nouve… 261 804            
## # ℹ 278 more rows
## # ℹ 7 more variables: `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>


# Select top 10 cities ----
top10 <- datas[2:11, ]

# Filter columns ----
top10 <- top10[ , c(1, 3:4, 7)]


# Clean column names ----
top10 <- janitor::clean_names(top10)
top10
## # A tibble: 10 × 4
##    rang2024 commune     departement          population_legale
##    <chr>    <chr>       <chr>                <chr>            
##  1 1        Paris[a]    Paris[a]             2 133 111        
##  2 2        Marseille   Bouches-du-Rhône     873 076          
##  3 3        Lyon        Métropole de Lyon[b] 522 250          
##  4 4        Toulouse    Haute-Garonne        504 078          
##  5 5        Nice        Alpes-Maritimes      348 085          
##  6 6        Nantes      Loire-Atlantique     323 204          
##  7 7        Montpellier Hérault              302 454          
##  8 8        Strasbourg  Bas-Rhin             291 313          
##  9 9        Bordeaux    Gironde              261 804          
## 10 10       Lille       Nord                 236 710


# Rename specific column ----
top10 <- dplyr::rename(top10,
                       pop2021 = population_legale)

colnames(top10)
## [1] "rang2024"    "commune"     "departement" "pop2021"

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


5. Clean output (continued)

# Explore data ----
top10
## # A tibble: 10 × 4
##    rang2024 commune     departement          pop2021  
##    <chr>    <chr>       <chr>                <chr>    
##  1 1        Paris[a]    Paris[a]             2 133 111
##  2 2        Marseille   Bouches-du-Rhône     873 076  
##  3 3        Lyon        Métropole de Lyon[b] 522 250  
##  4 4        Toulouse    Haute-Garonne        504 078  
##  5 5        Nice        Alpes-Maritimes      348 085  
##  6 6        Nantes      Loire-Atlantique     323 204  
##  7 7        Montpellier Hérault              302 454  
##  8 8        Strasbourg  Bas-Rhin             291 313  
##  9 9        Bordeaux    Gironde              261 804  
## 10 10       Lille       Nord                 236 710

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


5. Clean output (continued)

# Explore data ----
top10
## # A tibble: 10 × 4
##    rang2024 commune     departement          pop2021  
##    <chr>    <chr>       <chr>                <chr>    
##  1 1        Paris[a]    Paris[a]             2 133 111
##  2 2        Marseille   Bouches-du-Rhône     873 076  
##  3 3        Lyon        Métropole de Lyon[b] 522 250  
##  4 4        Toulouse    Haute-Garonne        504 078  
##  5 5        Nice        Alpes-Maritimes      348 085  
##  6 6        Nantes      Loire-Atlantique     323 204  
##  7 7        Montpellier Hérault              302 454  
##  8 8        Strasbourg  Bas-Rhin             291 313  
##  9 9        Bordeaux    Gironde              261 804  
## 10 10       Lille       Nord                 236 710


# Convert 'rang2024' to numeric ----
top10$"rang2024" <- as.integer(top10$"rang2024")

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


5. Clean output (continued)

# Explore data ----
top10
## # A tibble: 10 × 4
##    rang2024 commune     departement          pop2021  
##    <chr>    <chr>       <chr>                <chr>    
##  1 1        Paris[a]    Paris[a]             2 133 111
##  2 2        Marseille   Bouches-du-Rhône     873 076  
##  3 3        Lyon        Métropole de Lyon[b] 522 250  
##  4 4        Toulouse    Haute-Garonne        504 078  
##  5 5        Nice        Alpes-Maritimes      348 085  
##  6 6        Nantes      Loire-Atlantique     323 204  
##  7 7        Montpellier Hérault              302 454  
##  8 8        Strasbourg  Bas-Rhin             291 313  
##  9 9        Bordeaux    Gironde              261 804  
## 10 10       Lille       Nord                 236 710


# Convert 'rang2024' to numeric ----
top10$"rang2024" <- as.integer(top10$"rang2024")


# Remove footnotes in 'commune' ----
top10$"commune"     <- gsub("\\[[a-z]\\]", "", top10$"commune")

# Remove footnotes in 'departement' ----
top10$"departement" <- gsub("\\[[a-z]\\]", "", top10$"departement")

top10
## # A tibble: 10 × 4
##    rang2024 commune     departement       pop2021  
##       <int> <chr>       <chr>             <chr>    
##  1        1 Paris       Paris             2 133 111
##  2        2 Marseille   Bouches-du-Rhône  873 076  
##  3        3 Lyon        Métropole de Lyon 522 250  
##  4        4 Toulouse    Haute-Garonne     504 078  
##  5        5 Nice        Alpes-Maritimes   348 085  
##  6        6 Nantes      Loire-Atlantique  323 204  
##  7        7 Montpellier Hérault           302 454  
##  8        8 Strasbourg  Bas-Rhin          291 313  
##  9        9 Bordeaux    Gironde           261 804  
## 10       10 Lille       Nord              236 710

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


5. Clean output (continued)

# Convert 'pop2021' to numeric ----
top10$"pop2021" <- gsub(" ", "", top10$"pop2021")
top10$"pop2021" <- as.numeric(top10$"pop2021")

top10
## # A tibble: 10 × 4
##    rang2024 commune     departement       pop2021
##       <int> <chr>       <chr>               <dbl>
##  1        1 Paris       Paris             2133111
##  2        2 Marseille   Bouches-du-Rhône   873076
##  3        3 Lyon        Métropole de Lyon  522250
##  4        4 Toulouse    Haute-Garonne      504078
##  5        5 Nice        Alpes-Maritimes    348085
##  6        6 Nantes      Loire-Atlantique   323204
##  7        7 Montpellier Hérault            302454
##  8        8 Strasbourg  Bas-Rhin           291313
##  9        9 Bordeaux    Gironde            261804
## 10       10 Lille       Nord               236710

Scrap other elements

Detect HTML element by tag

# Extract content of h1 element ----
rvest::html_element(content, css = "h1") |> 
  rvest::html_text2()
## [1] "Liste des communes de France les plus peuplées"

Scrap other elements

Detect HTML element by tag

# Extract content of h1 element ----
rvest::html_element(content, css = "h1") |> 
  rvest::html_text2()
## [1] "Liste des communes de France les plus peuplées"


Detect HTML elements by tag

# Extract content of the first h2 element ----
rvest::html_element(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"


# Extract content of all h2 elements ----
rvest::html_elements(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"                                                 
## [2] "Cadre des données"                                        
## [3] "Vue d'ensemble"                                           
## [4] "Communes de plus de 30 000 habitants"                     
## [5] "Communes ayant compté plus de 30 000 habitants avant 2024"
## [6] "Notes et références"                                      
## [7] "Voir aussi"

Scrap other elements

Detect HTML element by tag

# Extract content of h1 element ----
rvest::html_element(content, css = "h1") |> 
  rvest::html_text2()
## [1] "Liste des communes de France les plus peuplées"


Detect HTML elements by tag

# Extract content of the first h2 element ----
rvest::html_element(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"


# Extract content of all h2 elements ----
rvest::html_elements(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"                                                 
## [2] "Cadre des données"                                        
## [3] "Vue d'ensemble"                                           
## [4] "Communes de plus de 30 000 habitants"                     
## [5] "Communes ayant compté plus de 30 000 habitants avant 2024"
## [6] "Notes et références"                                      
## [7] "Voir aussi"

Detect HTML element by ID

# Extract content of the h2 element detected by its id ----
rvest::html_element(content, css = "#Cadre_des_données") |> 
  rvest::html_text2()
## [1] "Cadre des données"

Scrap other elements

Detect HTML element by tag

# Extract content of h1 element ----
rvest::html_element(content, css = "h1") |> 
  rvest::html_text2()
## [1] "Liste des communes de France les plus peuplées"


Detect HTML elements by tag

# Extract content of the first h2 element ----
rvest::html_element(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"


# Extract content of all h2 elements ----
rvest::html_elements(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"                                                 
## [2] "Cadre des données"                                        
## [3] "Vue d'ensemble"                                           
## [4] "Communes de plus de 30 000 habitants"                     
## [5] "Communes ayant compté plus de 30 000 habitants avant 2024"
## [6] "Notes et références"                                      
## [7] "Voir aussi"

Detect HTML element by ID

# Extract content of the h2 element detected by its id ----
rvest::html_element(content, css = "#Cadre_des_données") |> 
  rvest::html_text2()
## [1] "Cadre des données"


Extract attribute

# Extract URL of the first image ----
image_url <- rvest::html_element(content, css = "img") |> 
  rvest::html_attr(name = "src")
image_url
## [1] "/static/images/icons/wikipedia.png"

Scrap other elements

Detect HTML element by tag

# Extract content of h1 element ----
rvest::html_element(content, css = "h1") |> 
  rvest::html_text2()
## [1] "Liste des communes de France les plus peuplées"


Detect HTML elements by tag

# Extract content of the first h2 element ----
rvest::html_element(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"


# Extract content of all h2 elements ----
rvest::html_elements(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"                                                 
## [2] "Cadre des données"                                        
## [3] "Vue d'ensemble"                                           
## [4] "Communes de plus de 30 000 habitants"                     
## [5] "Communes ayant compté plus de 30 000 habitants avant 2024"
## [6] "Notes et références"                                      
## [7] "Voir aussi"

Detect HTML element by ID

# Extract content of the h2 element detected by its id ----
rvest::html_element(content, css = "#Cadre_des_données") |> 
  rvest::html_text2()
## [1] "Cadre des données"


Extract attribute

# Extract URL of the first image ----
image_url <- rvest::html_element(content, css = "img") |> 
  rvest::html_attr(name = "src")
image_url
## [1] "/static/images/icons/wikipedia.png"


# Build image full URL ----
image_url <- paste0(base_url, image_url)
image_url
## [1] "https://fr.wikipedia.org/static/images/icons/wikipedia.png"

Scrap other elements

Detect HTML element by tag

# Extract content of h1 element ----
rvest::html_element(content, css = "h1") |> 
  rvest::html_text2()
## [1] "Liste des communes de France les plus peuplées"


Detect HTML elements by tag

# Extract content of the first h2 element ----
rvest::html_element(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"


# Extract content of all h2 elements ----
rvest::html_elements(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"                                                 
## [2] "Cadre des données"                                        
## [3] "Vue d'ensemble"                                           
## [4] "Communes de plus de 30 000 habitants"                     
## [5] "Communes ayant compté plus de 30 000 habitants avant 2024"
## [6] "Notes et références"                                      
## [7] "Voir aussi"

Detect HTML element by ID

# Extract content of the h2 element detected by its id ----
rvest::html_element(content, css = "#Cadre_des_données") |> 
  rvest::html_text2()
## [1] "Cadre des données"


Extract attribute

# Extract URL of the first image ----
image_url <- rvest::html_element(content, css = "img") |> 
  rvest::html_attr(name = "src")
image_url
## [1] "/static/images/icons/wikipedia.png"


# Build image full URL ----
image_url <- paste0(base_url, image_url)
image_url
## [1] "https://fr.wikipedia.org/static/images/icons/wikipedia.png"


# Download image ----
download.file(url      = image_url,
              destfile = "wikipedia_logo.png",
              mode     = "wb")

Getting the good selector

  • Press CTRL + U (Firefox) to display the HTML code of the page
  • Right click on a page element and click on Inspect
  • Install SelectorGadget bookmarklet in your browser


Dynamic web pages

  • Have a look at the function session() of the package rvest.
  • Have a look at the package RSelenium

Ethics and legalities

Well… it’s complicated.



 Read the Chapter 24.2 of the book R for Data Science by Wickham, Cetinkaya-Rundel & Grolemund.


 Be nice on the web with the package polite

Thanks