Accessing biodiversity data

Web portals, Scripting, API & Web scraping

November 2025

Nicolas Casajus

Camille Coux

Data scientists
@FRB-CESAB    

Data lifecycle

Data lifecycle

Table of contents

Table of contents





Web portals


Scripting


API




Exercise


Web scraping

  Web portals

Web portals

Any online platform (website) which supports users in accessing collections of open data. They can be governmental (e.g. data.gouv.fr), or from organizations, NGOs, or from an individual initiative.


 You access data in a web browser (by clicking)


  • Click on a ready-to-download file (e.g. GADM)
  • Fill a form to download a user-specific file (e.g. GBIF)
  • Fill a form to get data through a URL (e.g. BirdLife)
  • Sometimes, you need to registrer (e.g. TRY)

Case study

You’re doing some species distribution models for metropolitan France. You already have a list of species and their occurrences in space and time.

Now, you need to retrieve:

  • Spatial boundaries of France regions to map the species occurrences
  • Climate data (temperature and precipitation) to fit models

Access data - France regions

The GADM data portal is a good option to get spatial boundaries of any country in the World at different administrative levels.

They have a nice description of file formats.


 Let’s use the Geopackage format (.gpkg)

Access data - France regions

Let’s list the layers available in the gadm41_FRA.gpkg file


sf::st_layers(
  here::here("data", "gadm41_FRA.gpkg")
)
## Driver: GPKG 
## Available layers:
##   layer_name geometry_type features fields crs_name
## 1  ADM_ADM_0 Multi Polygon        1      2   WGS 84
## 2  ADM_ADM_1 Multi Polygon       13     11   WGS 84
## 3  ADM_ADM_2 Multi Polygon       96     13   WGS 84
## 4  ADM_ADM_3 Multi Polygon      350     16   WGS 84
## 5  ADM_ADM_4 Multi Polygon     3728     14   WGS 84
## 6  ADM_ADM_5 Multi Polygon    36611     15   WGS 84

All these layers correspond to different levels of subdivisions.

Layer name Description
ADM_ADM_0 France contours
ADM_ADM_1 Region countours
ADM_ADM_2 Departments contours
ADM_ADM_3 Communes contours

Access data - France regions

Let’s import the regional subdivision


regions <- sf::st_read(
  dsn = here::here("data", "gadm41_FRA.gpkg"),
  layer = "ADM_ADM_1"
)

head(regions)
## Simple feature collection with 6 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -5.143751 ymin: 41.33375 xmax: 9.560416 ymax: 50.16764
## Geodetic CRS:  WGS 84
##     GID_1 GID_0 COUNTRY                  NAME_1 VARNAME_1 NL_NAME_1 TYPE_1
## 1 FRA.1_1   FRA  France    Auvergne-Rhône-Alpes        NA        NA Région
## 2 FRA.2_1   FRA  France Bourgogne-Franche-Comté        NA        NA Région
## 3 FRA.3_1   FRA  France                Bretagne        NA        NA Région
## 4 FRA.4_1   FRA  France     Centre-Val de Loire        NA        NA Région
## 5 FRA.5_1   FRA  France                   Corse   Corsica        NA Région
## 6 FRA.6_1   FRA  France               Grand Est        NA        NA Région
##   ENGTYPE_1 CC_1 HASC_1  ISO_1                           geom
## 1    Region   NA  FR.AR     NA MULTIPOLYGON (((5.415834 44...
## 2    Region   NA  FR.BF     NA MULTIPOLYGON (((5.256271 46...
## 3    Region   NA  FR.BT FR-BRE MULTIPOLYGON (((-3.248194 4...
## 4    Region   NA  FR.CN FR-CVL MULTIPOLYGON (((2.063459 46...
## 5    Region   NA  FR.CE FR-20R MULTIPOLYGON (((9.102084 41...
## 6    Region   NA  FR.AO     NA MULTIPOLYGON (((7.178251 47...

Access data - France regions

Let’s import the regional subdivision


regions <- sf::st_read(
  dsn = here::here("data", "gadm41_FRA.gpkg"),
  layer = "ADM_ADM_1"
)

head(regions)
library(ggplot2)

ggplot(regions) + 
  geom_sf(aes(fill = NAME_1)) + 
  theme_bw()
## Simple feature collection with 6 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -5.143751 ymin: 41.33375 xmax: 9.560416 ymax: 50.16764
## Geodetic CRS:  WGS 84
##     GID_1 GID_0 COUNTRY                  NAME_1 VARNAME_1 NL_NAME_1 TYPE_1
## 1 FRA.1_1   FRA  France    Auvergne-Rhône-Alpes        NA        NA Région
## 2 FRA.2_1   FRA  France Bourgogne-Franche-Comté        NA        NA Région
## 3 FRA.3_1   FRA  France                Bretagne        NA        NA Région
## 4 FRA.4_1   FRA  France     Centre-Val de Loire        NA        NA Région
## 5 FRA.5_1   FRA  France                   Corse   Corsica        NA Région
## 6 FRA.6_1   FRA  France               Grand Est        NA        NA Région
##   ENGTYPE_1 CC_1 HASC_1  ISO_1                           geom
## 1    Region   NA  FR.AR     NA MULTIPOLYGON (((5.415834 44...
## 2    Region   NA  FR.BF     NA MULTIPOLYGON (((5.256271 46...
## 3    Region   NA  FR.BT FR-BRE MULTIPOLYGON (((-3.248194 4...
## 4    Region   NA  FR.CN FR-CVL MULTIPOLYGON (((2.063459 46...
## 5    Region   NA  FR.CE FR-20R MULTIPOLYGON (((9.102084 41...
## 6    Region   NA  FR.AO     NA MULTIPOLYGON (((7.178251 47...

Access data - France regions


Access data - Climate data


You need to know what you are looking for: resolution, time period, monthly or yearly averages, geographic coverage, etc.

Trait data - The Pandora box

 TRY data portal for plant traits

To access TRY database, you’ll need to:

  1. register
  2. fill a request for which you write a short description of your project
  3. wait for a text file by email within a few days

Retrieving data from repositories



This usually happens when you want to retrieve data from a paper or study.



FORCIS, from Chaabane et al. (2023), published their database on Zenodo.

  Scripting

Download data directly from

In the previous section, we saw how to manually download data from a web browser. However, when possible, we recommend you to perfom this task using code (scripting).

 Reproducibility & Automation


In , the function download.file() can be used to download a file from Internet.


download.file(
  url = "https://geodata.ucdavis.edu/gadm/gadm4.1/gpkg/gadm41_FRA.gpkg",
  destfile = here::here("data", "fr_2.gpkg"),
  mode = "wb"
)

Set the argument mode to wb, especially if you use Windows

Where can I find the file URL?

 Just right-click on a link and copy the URL.

What about compressed file?

You can use the helper functions unzip() and gzfile() after downloading the compressed file.


download.file(
  url = "http://www.sociopatterns.org/wp-content/uploads/2015/07/Friendship-network_data_2013.csv.gz",
  destfile = here::here("data", "friends.gz"),
  mode = "wb"
)

friends <- read.table(gzfile(here::here("data", "friends.gz")))


But it’s unnecessary for most reading functions


friends <- read.table(here::here("data", "friends.gz"))


Other useful packages: data.table, vroom, readr, etc.

The package geodata

Some packages have been tailored specifically to facilitate the data access using the download.file() function (internally).


 This is the case of the geodata package

The package geodata

For WorldClim data


fr <- geodata::worldclim_country(
  country = "France",
  var = "tavg",
  path = "temp.tif"
)

# Have a look at this object:
fr

The package geodata

Use the terra package to handle and plot the raster (GeoTIFF format)

# plot the first monthly average:
terra::plot(fr$"FRA_wc2.1_30s_tavg_1")

  API

Accessing data from the Web





User interface

Code interface

(RESTful) Web API

Definition: A service accessed from a client device (mobile phones, laptops, etc.) to a Web server using the Hypertext Transfer Protocol (HTTP)1.

  • The client sends an HTTP request to the Web server
  • The Web server sends back a response in JSON or XML format (raw data)
  • The Web server exposes one or more endpoints (predefined request/response)

Code interface

(RESTful) Web API

Advantages (for the user)

  • User can create its own client
  • Client can be developed in any language
  • User can include the service in a bigger project
  • User access raw data



Writing code means automation and reproducibility

(RESTful) Web API

Advantages (for the user)

  • User can create its own client
  • Client can be developed in any language
  • User can include the service in a bigger project
  • User access raw data



Writing code means automation and reproducibility

Limitations

  • Each API has its own specification
  • Authentication method (free or not)
    • Token
    • Login and password
  • Some restrictions
    • Number of requests per day/month
    • Number of records per request
    • Incomplete data

Example of APIs

This is a non-exhaustive list


Biodiversity data

  • Global Biodiversity Information Facility (GBIF) - API
  • IUCN Red List - API
  • Fishbase - API
  • Species+/CITES Checklist - API
  • Knowledge Network for Biocomplexity (KNB) - API

Taxonomy

  • Encyclopedia of Life (EOL) - API
  • Integrated Taxonomic Information System (ITIS) - API
  • Barcode of Life Data (BOLD) - API
  • World Register of Marine Species (WoRMS) - API

Scientific literature

  • Web of Science - API
  • Scopus - API
  • CrossRef - API
  • OpenAlex - API

Others

  • Wikipedia - API
  • OpenStreetMap - API
  • Zenodo - API

How does it work?

Requesting an API is based on the client-server protocol

API Client

It’s the tool you use to request the API and parse the response (data).


If you are lucky, an API client will already be available.


Non-exhaustive list of packages



 Otherwise you will have to build your own client.

Building an API client


Available at: https://httr2.r-lib.org/


# Install 'httr2' package ----
install.packages("httr2")


Table: Main functions of httr2
Function Description
request() Create an HTTP request
req_url_query() Add parameters to an HTTP request
req_perform() Send HTTP request to an API
resp_status() Check the HTTP response status
resp_content_type() Check the content type of the response
resp_body_json() Parse the response content (JSON format)
resp_body_xml() Parse the response content (XML format)

Building an API client

Example with the OpenStreetMap Nominatim API

Retrieve coordinates from a location (city, address, building, etc.)

Nominatim API client

1. Build the HTTP request

# Nominatim API endpoint ----
endpoint <- "https://nominatim.openstreetmap.org/search"

# Prepare the HTTP request ----
http_request <- httr2::request(endpoint)
https://nominatim.openstreetmap.org/search

Nominatim API client

1. Build the HTTP request

# Nominatim API endpoint ----
endpoint <- "https://nominatim.openstreetmap.org/search"

# Prepare the HTTP request ----
http_request <- httr2::request(endpoint)
https://nominatim.openstreetmap.org/search


# Append request parameters ----
http_request <- http_request |> 
  httr2::req_url_query(city    = "Montpellier") |> 
  httr2::req_url_query(country = "France")
https://nominatim.openstreetmap.org/search?city=Montpellier&country=France

Nominatim API client

1. Build the HTTP request

# Nominatim API endpoint ----
endpoint <- "https://nominatim.openstreetmap.org/search"

# Prepare the HTTP request ----
http_request <- httr2::request(endpoint)
https://nominatim.openstreetmap.org/search


# Append request parameters ----
http_request <- http_request |> 
  httr2::req_url_query(city    = "Montpellier") |> 
  httr2::req_url_query(country = "France")
https://nominatim.openstreetmap.org/search?city=Montpellier&country=France


# Append response parameters ----
http_request <- http_request |> 
  httr2::req_url_query(format = "json") |> 
  httr2::req_url_query(limit  = 1)
https://nominatim.openstreetmap.org/search?city=Montpellier&country=France&
format=json&limit=1

Nominatim API client

1. Build the HTTP request

# Nominatim API endpoint ----
endpoint <- "https://nominatim.openstreetmap.org/search"

# Prepare the HTTP request ----
http_request <- httr2::request(endpoint)
https://nominatim.openstreetmap.org/search


# Append request parameters ----
http_request <- http_request |> 
  httr2::req_url_query(city    = "Montpellier") |> 
  httr2::req_url_query(country = "France")
https://nominatim.openstreetmap.org/search?city=Montpellier&country=France


# Append response parameters ----
http_request <- http_request |> 
  httr2::req_url_query(format = "json") |> 
  httr2::req_url_query(limit  = 1)
https://nominatim.openstreetmap.org/search?city=Montpellier&country=France&
format=json&limit=1

2. Send the HTTP request

# Send HTTP request  ----
http_response <- httr2::req_perform(http_request)
Status: 200 OK
Content-Type: application/json

Nominatim API client

1. Build the HTTP request

# Nominatim API endpoint ----
endpoint <- "https://nominatim.openstreetmap.org/search"

# Prepare the HTTP request ----
http_request <- httr2::request(endpoint)
https://nominatim.openstreetmap.org/search


# Append request parameters ----
http_request <- http_request |> 
  httr2::req_url_query(city    = "Montpellier") |> 
  httr2::req_url_query(country = "France")
https://nominatim.openstreetmap.org/search?city=Montpellier&country=France


# Append response parameters ----
http_request <- http_request |> 
  httr2::req_url_query(format = "json") |> 
  httr2::req_url_query(limit  = 1)
https://nominatim.openstreetmap.org/search?city=Montpellier&country=France&
format=json&limit=1

2. Send the HTTP request

# Send HTTP request  ----
http_response <- httr2::req_perform(http_request)
Status: 200 OK
Content-Type: application/json


3. Check response status

# Check response status ----
httr2::resp_status(http_response)
## [1] 200

Nominatim API client

1. Build the HTTP request

# Nominatim API endpoint ----
endpoint <- "https://nominatim.openstreetmap.org/search"

# Prepare the HTTP request ----
http_request <- httr2::request(endpoint)
https://nominatim.openstreetmap.org/search


# Append request parameters ----
http_request <- http_request |> 
  httr2::req_url_query(city    = "Montpellier") |> 
  httr2::req_url_query(country = "France")
https://nominatim.openstreetmap.org/search?city=Montpellier&country=France


# Append response parameters ----
http_request <- http_request |> 
  httr2::req_url_query(format = "json") |> 
  httr2::req_url_query(limit  = 1)
https://nominatim.openstreetmap.org/search?city=Montpellier&country=France&
format=json&limit=1

2. Send the HTTP request

# Send HTTP request  ----
http_response <- httr2::req_perform(http_request)
Status: 200 OK
Content-Type: application/json


3. Check response status

# Check response status ----
httr2::resp_status(http_response)
## [1] 200


4. Check response content type

# Check response status ----
httr2::resp_content_type(http_response)
## [1] "application/json"

Nominatim API client

5. Parse response content

# Parse response content ----
content <- httr2::resp_body_json(http_response)

content
## [[1]]
## [[1]]$place_id
## [1] 77908475
## 
## [[1]]$licence
## [1] "Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright"
## 
## [[1]]$osm_type
## [1] "relation"
## 
## [[1]]$osm_id
## [1] 28722
## 
## [[1]]$lat
## [1] "43.6112422"
## 
## [[1]]$lon
## [1] "3.8767337"
## 
## [[1]]$class
## [1] "boundary"
## 
## [[1]]$type
## [1] "administrative"
## 
## [[1]]$place_rank
## [1] 16
## 
## [[1]]$importance
## [1] 0.6880008
## 
## [[1]]$addresstype
## [1] "city"
## 
## [[1]]$name
## [1] "Montpellier"
## 
## [[1]]$display_name
## [1] "Montpellier, Hérault, Occitanie, France métropolitaine, France"
## 
## [[1]]$boundingbox
## [[1]]$boundingbox[[1]]
## [1] "43.5667083"
## 
## [[1]]$boundingbox[[2]]
## [1] "43.6533580"
## 
## [[1]]$boundingbox[[3]]
## [1] "3.8070597"
## 
## [[1]]$boundingbox[[4]]
## [1] "3.9413208"


# Object type ----
class(content)
## [1] "list"


# Object dimensions ----
length(content)
## [1] 1

Nominatim API client

5. Parse response content

# Parse response content ----
content <- httr2::resp_body_json(http_response)

content
## [[1]]
## [[1]]$place_id
## [1] 77908475
## 
## [[1]]$licence
## [1] "Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright"
## 
## [[1]]$osm_type
## [1] "relation"
## 
## [[1]]$osm_id
## [1] 28722
## 
## [[1]]$lat
## [1] "43.6112422"
## 
## [[1]]$lon
## [1] "3.8767337"
## 
## [[1]]$class
## [1] "boundary"
## 
## [[1]]$type
## [1] "administrative"
## 
## [[1]]$place_rank
## [1] 16
## 
## [[1]]$importance
## [1] 0.6880008
## 
## [[1]]$addresstype
## [1] "city"
## 
## [[1]]$name
## [1] "Montpellier"
## 
## [[1]]$display_name
## [1] "Montpellier, Hérault, Occitanie, France métropolitaine, France"
## 
## [[1]]$boundingbox
## [[1]]$boundingbox[[1]]
## [1] "43.5667083"
## 
## [[1]]$boundingbox[[2]]
## [1] "43.6533580"
## 
## [[1]]$boundingbox[[3]]
## [1] "3.8070597"
## 
## [[1]]$boundingbox[[4]]
## [1] "3.9413208"


# Object type ----
class(content)
## [1] "list"


# Object dimensions ----
length(content)
## [1] 1


6. Clean data

# Clean output ----
content <- content[[1]]

content <- data.frame(
  "name" = content$"name",
  "lon" = as.numeric(content$"lon"),
  "lat" = as.numeric(content$"lat")
)

content
##          name      lon      lat
## 1 Montpellier 3.876734 43.61124

Visualization

# Install required package ----
install.packages("maps")


# Map France boundary ----
maps::map(
  regions = "France", 
  fill = TRUE, 
  col = "black"
)

# Add retrieved coordinates ----
points(
  x = content$"lon", 
  y = content$"lat", 
  pch = 19, 
  cex = 1, 
  col = "red"
)

# Add retrieved name ----
text(
  x = content$"lon", 
  y = content$"lat", 
  labels = content$"name", 
  pos = 2, 
  col = "white", 
  family = "serif"
)

Code factorisation

Function definition

get_coords_from_location <- function(city, country) {
  
  # Nominatim API endpoint ----
  endpoint <- "https://nominatim.openstreetmap.org/search"

  # Prepare the HTTP request ----
  http_request <- httr2::request(endpoint)
  
  # Append request parameters ----
  http_request <- http_request |> 
    httr2::req_url_query(city    = city) |> 
    httr2::req_url_query(country = country) |> 
    httr2::req_url_query(format = "json") |> 
    httr2::req_url_query(limit  = 1)
  
  # Send HTTP request  ----
  http_response <- httr2::req_perform(http_request)
  
  # Check response status ----
  httr2::resp_check_status(http_response)
  
  # Parse response content ----
  content <- httr2::resp_body_json(http_response)
  
  # Clean output ----
  content <- content[[1]]
  content <- data.frame(
    "name" = content$"name",
    "lon" = as.numeric(content$"lon"),
    "lat" = as.numeric(content$"lat")
  )

  content
}

Code factorisation

Function definition

get_coords_from_location <- function(city, country) {
  
  # Nominatim API endpoint ----
  endpoint <- "https://nominatim.openstreetmap.org/search"

  # Prepare the HTTP request ----
  http_request <- httr2::request(endpoint)
  
  # Append request parameters ----
  http_request <- http_request |> 
    httr2::req_url_query(city    = city) |> 
    httr2::req_url_query(country = country) |> 
    httr2::req_url_query(format = "json") |> 
    httr2::req_url_query(limit  = 1)
  
  # Send HTTP request  ----
  http_response <- httr2::req_perform(http_request)
  
  # Check response status ----
  httr2::resp_check_status(http_response)
  
  # Parse response content ----
  content <- httr2::resp_body_json(http_response)
  
  # Clean output ----
  content <- content[[1]]
  content <- data.frame(
    "name" = content$"name",
    "lon" = as.numeric(content$"lon"),
    "lat" = as.numeric(content$"lat")
  )
  
  content
}

Function usage

# Retrieve coordinates ----
get_coords_from_location(city = "Montpellier", country = "France")
##          name      lon      lat
## 1 Montpellier 3.876734 43.61124

Code factorisation

Function definition

get_coords_from_location <- function(city, country) {
  
  # Nominatim API endpoint ----
  endpoint <- "https://nominatim.openstreetmap.org/search"

  # Prepare the HTTP request ----
  http_request <- httr2::request(endpoint)
  
  # Append request parameters ----
  http_request <- http_request |> 
    httr2::req_url_query(city    = city) |> 
    httr2::req_url_query(country = country) |> 
    httr2::req_url_query(format = "json") |> 
    httr2::req_url_query(limit  = 1)
  
  # Send HTTP request  ----
  http_response <- httr2::req_perform(http_request)
  
  # Check response status ----
  httr2::resp_check_status(http_response)
  
  # Parse response content ----
  content <- httr2::resp_body_json(http_response)
  
  # Clean output ----
  content <- content[[1]]
  content <- data.frame(
    "name" = content$"name",
    "lon" = as.numeric(content$"lon"),
    "lat" = as.numeric(content$"lat")
  )
  
  content
}

Function usage

# Retrieve coordinates ----
get_coords_from_location(city = "Montpellier", country = "France")
##          name      lon      lat
## 1 Montpellier 3.876734 43.61124


Automation

# List of cities ----
cities <- c("Montpellier", "Paris", "Strasbourg", "Grenoble", "Bourges")

# Retrieve coordinates ----
coords <- data.frame()

for (city in cities) {
  coord <- get_coords_from_location(city = city, country = "France")
  coords <- rbind(coords, coord)
}

coords
##          name      lon      lat
## 1 Montpellier 3.876734 43.61124
## 2       Paris 2.348391 48.85350
## 3  Strasbourg 7.750713 48.58461
## 4    Grenoble 5.735782 45.18756
## 5     Bourges 2.399125 47.08117

  Exercise (40 min)

Exercise - Accessing data

Part 1: Download the PanTHERIA database, a species-level database of life history, ecology, and geography of extant and recently extinct mammals available here.

  Use the function download.file() to download the dataset


Part 2: Download GBIF occurrences for two bat species endemic to the islands of New Zealand:

  • Mystacina tuberculata (New Zealand lesser short-tailed bat)  

  • Chalinolobus tuberculatus (New Zealand long-tailed bat)  

      Use the function rgbif::occ_search() to download species occurrences
      Do not forget to export the data


Bonus A: Download New Zealand boundaries from GADM (GeoJSON Level 0).

  Use the function download.file()


Bonus B: Plot a New Zealand map with GBIF occurrences.

  Use the packages sf and ggplot2

Correction

Part 1

## Download PanTHERIA database ----

esa_url  <- "https://esapubs.org/archive/ecol/E090/184/"
filename <- "PanTHERIA_1-0_WR05_Aug2008.txt"

download.file(
  url = paste0(esa_url, filename), 
  destfile = filename,
  mode = "wb"
)

Part 2

## Species names ----

species_names <- c(
  "Mystacina tuberculata", 
  "Chalinolobus tuberculatus"
)

## Download GBIF occurrences ----

occ <- rgbif::occ_search(
  scientificName = species_names, 
  fields = "minimal",
  hasCoordinate = TRUE,
  hasGeospatialIssue = FALSE
)

## Append occurrences -----

occ <- rbind(
  occ$`Mystacina tuberculata`$"data",
  occ$`Chalinolobus tuberculatus`$"data"
)

## Export occurrences ----

write.csv(
  x = occ,
  file = "gbif_occurrences.csv",
  row.names = FALSE
)

Correction

Bonus A

## Download administrative boundaries of NZL ----

gadm_url <- "https://geodata.ucdavis.edu/gadm/gadm4.1/json/"
filename <- "gadm41_NZL_0.json"

download.file(
  url  = paste0(gadm_url, filename), 
  destfile = filename,
  mode = "wb"
)

Correction

Bonus B

## Import NZL shapefile ----

nzl <- sf::st_read("gadm41_NZL_0.json") |> 
  sf::st_transform(crs = 27200)


## Convert occurrences into sf object ----

occ_sf <- occ |> 
  sf::st_as_sf(
    coords = c("decimalLongitude", "decimalLatitude"),
    crs = 4326
  ) |> 
  sf::st_transform(crs = 27200)


## Map occurrences ----

library("ggplot2")

ggplot() +
  geom_sf(data = nzl) +
  geom_sf(data = occ_sf, mapping = aes(color = scientificName)) +
  theme_bw()

  Web scraping

What is web scraping?

  • A method to automatically extract data from web pages
  • Converts unstructured web content into structured data
  • Also known as screen scraping


 If available, you should use API: it will give you more reliable data.


 What is a web page?



HTML

Content structuration

CSS

Formatting

JavaScript

Dynamism & interactivity

HTML basics

A web page is described and structured by the HTML language (HyperText Markup Language)


HTML code

<!DOCTYPE html>
<html>
  
  ...
      
</html>


An HTML page has always this structure

HTML basics

A web page is described and structured by the HTML language (HyperText Markup Language)


HTML code

<!DOCTYPE html>
<html>
  
  <!-- Document metadata -->
  <head>
    ...
  </head>
  
  <!-- Document content -->
  <body>
    ...
  </body>
      
</html>


The tag <html> contains two children:

  • <head> contains page metadata
  • <body> contains page content

HTML basics

A web page is described and structured by the HTML language (HyperText Markup Language)


HTML code

<!DOCTYPE html>
<html>
  
  <!-- Document metadata -->
  <head>
    <meta charset="UTF-8">
    <title>Page title</title>
  </head>
  
  <!-- Document content -->
  <body>
    ...
  </body>
      
</html>


The tag <head> can contain different metadata:

  • <title> contains the page title
  • <meta> is used to specify the encoding, authors, keywords, etc.
  • <link> is used to call external resources

N.B. This section is not necessary interesting for web scraping

HTML basics

A web page is described and structured by the HTML language (HyperText Markup Language)


HTML code

<!DOCTYPE html>
<html>
  
  <!-- Document metadata -->
  <head>
    <meta charset="UTF-8">
    <title>Page title</title>
  </head>
  
  <!-- Document content -->
  <body>
  
    <h1 id='section-1'>Header A</h1>
    
    <p class='my-class'>A paragraph with <b>bold text</b>.</p>
    
    <p>
      A second paragraph with a 
      <a href='https://google.com'>link</a>.
    </p>
    
    <img src='images/my-img.png' width='150' height='150' />
    
  </body>
      
</html>

HTML basics

A web page is described and structured by the HTML language (HyperText Markup Language)


HTML code

<!DOCTYPE html>
<html>
  
  <!-- Document metadata -->
  <head>
    <meta charset="UTF-8">
    <title>Page title</title>
  </head>
  
  <!-- Document content -->
  <body>
  
    <h1 id='section-1'>Header A</h1>
    
    <p class='my-class'>A paragraph with <b>bold text</b>.</p>
    
    <p>
      A second paragraph with a 
      <a href='https://google.com'>link</a>.
    </p>
    
    <img src='images/my-img.png' width='150' height='150' />
    
  </body>
      
</html>


  • Except for some elements (<img />), all HTML tags are double.
  • Some elements are block tags (<h1>, <p>, etc.), other are inline tags (<b>, <a>, etc.)
  • Some elements can have attributes: id, class, href, src, etc.


 Web scraping consists in detecting HTML elements by the tag, class or the id (unique) to extract contents or attributes (href, src, etc.).

The rvest package


# Install 'rvest' package ----
install.packages("rvest")


Table: Main functions of rvest
Function Description
read_html() Read and parse HTML content
html_element() Extract an HTML element(s)
html_attr() Extract an HTML attribute(s)
html_text2() Extract the content of element(s)
html_table() Extract a table & convert into a data.frame

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


0. Install additional packages

# 'janitor' to clean dirty data ----
install.packages("janitor")

# 'dplyr' to handle data ----
install.packages("dplyr")

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


0. Install additional packages

# 'janitor' to clean dirty data ----
install.packages("janitor")

# 'dplyr' to handle data ----
install.packages("dplyr")


1. Build the HTTP request

# Wikipedia URL ----
base_url <- "https://fr.wikipedia.org"

# Page URL ----
page_url <- paste0(base_url,
                   "/wiki/",
                   "Liste_des_communes_de_France_les_plus_peuplées")
https://fr.wikipedia.org/wiki/Liste_des_communes_de_France_les_plus_peuplées

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


0. Install additional packages

# 'janitor' to clean dirty data ----
install.packages("janitor")

# 'dplyr' to handle data ----
install.packages("dplyr")


1. Build the HTTP request

# Wikipedia URL ----
base_url <- "https://fr.wikipedia.org"

# Page URL ----
page_url <- paste0(base_url,
                   "/wiki/",
                   "Liste_des_communes_de_France_les_plus_peuplées")
https://fr.wikipedia.org/wiki/Liste_des_communes_de_France_les_plus_peuplées


2. Scrap the HTML page

# Scrap web page ----
content <- rvest::read_html(page_url)
{html_document}

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


0. Install additional packages

# 'janitor' to clean dirty data ----
install.packages("janitor")

# 'dplyr' to handle data ----
install.packages("dplyr")


1. Build the HTTP request

# Wikipedia URL ----
base_url <- "https://fr.wikipedia.org"

# Page URL ----
page_url <- paste0(base_url,
                   "/wiki/",
                   "Liste_des_communes_de_France_les_plus_peuplées")
https://fr.wikipedia.org/wiki/Liste_des_communes_de_France_les_plus_peuplées


2. Scrap the HTML page

# Scrap web page ----
content <- rvest::read_html(page_url)
{html_document}

3. Extract HTML tables

# Scrap web page ----
tables <- rvest::html_table(content)

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


0. Install additional packages

# 'janitor' to clean dirty data ----
install.packages("janitor")

# 'dplyr' to handle data ----
install.packages("dplyr")


1. Build the HTTP request

# Wikipedia URL ----
base_url <- "https://fr.wikipedia.org"

# Page URL ----
page_url <- paste0(base_url,
                   "/wiki/",
                   "Liste_des_communes_de_France_les_plus_peuplées")
https://fr.wikipedia.org/wiki/Liste_des_communes_de_France_les_plus_peuplées


2. Scrap the HTML page

# Scrap web page ----
content <- rvest::read_html(page_url)
{html_document}

3. Extract HTML tables

# Scrap web page ----
tables <- rvest::html_table(content)


# Type of output ----
class(tables)
## [1] "list"


# Element length ----
length(tables)
## [1] 7

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


0. Install additional packages

# 'janitor' to clean dirty data ----
install.packages("janitor")

# 'dplyr' to handle data ----
install.packages("dplyr")


1. Build the HTTP request

# Wikipedia URL ----
base_url <- "https://fr.wikipedia.org"

# Page URL ----
page_url <- paste0(base_url,
                   "/wiki/",
                   "Liste_des_communes_de_France_les_plus_peuplées")
https://fr.wikipedia.org/wiki/Liste_des_communes_de_France_les_plus_peuplées


2. Scrap the HTML page

# Scrap web page ----
content <- rvest::read_html(page_url)
{html_document}

3. Extract HTML tables

# Scrap web page ----
tables <- rvest::html_table(content)


# Type of output ----
class(tables)
## [1] "list"


# Element length ----
length(tables)
## [1] 7


4. Extract the good table

# Extract second table ----
datas <- tables[[2]]

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


5. Clean output

# Explore data ----
datas
## # A tibble: 288 × 15
##    Rang2022 CodeInsee Commune     Département  Statut Région `Population légale`
##    <chr>    <chr>     <chr>       <chr>        <chr>  <chr>  <chr>              
##  1 Rang2022 CodeInsee Commune     Département  Statut Région 2022[1]            
##  2 1        75056     Paris[d]    Paris[d]     Préfe… Île-d… 2 113 705          
##  3 2        13055     Marseille   Bouches-du-… Préfe… Prove… 877 215            
##  4 3        69123     Lyon        Métropole d… Préfe… Auver… 520 774            
##  5 4        31555     Toulouse    Haute-Garon… Préfe… Occit… 511 684            
##  6 5        06088     Nice        Alpes-Marit… Préfe… Prove… 353 701            
##  7 6        44109     Nantes      Loire-Atlan… Préfe… Pays … 325 070            
##  8 7        34172     Montpellier Hérault      Préfe… Occit… 307 101            
##  9 8        67482     Strasbourg  Bas-Rhin     Préfe… Grand… 291 709            
## 10 9        33063     Bordeaux    Gironde      Préfe… Nouve… 265 328            
## # ℹ 278 more rows
## # ℹ 8 more variables: `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


5. Clean output

# Explore data ----
datas
## # A tibble: 288 × 15
##    Rang2022 CodeInsee Commune     Département  Statut Région `Population légale`
##    <chr>    <chr>     <chr>       <chr>        <chr>  <chr>  <chr>              
##  1 Rang2022 CodeInsee Commune     Département  Statut Région 2022[1]            
##  2 1        75056     Paris[d]    Paris[d]     Préfe… Île-d… 2 113 705          
##  3 2        13055     Marseille   Bouches-du-… Préfe… Prove… 877 215            
##  4 3        69123     Lyon        Métropole d… Préfe… Auver… 520 774            
##  5 4        31555     Toulouse    Haute-Garon… Préfe… Occit… 511 684            
##  6 5        06088     Nice        Alpes-Marit… Préfe… Prove… 353 701            
##  7 6        44109     Nantes      Loire-Atlan… Préfe… Pays … 325 070            
##  8 7        34172     Montpellier Hérault      Préfe… Occit… 307 101            
##  9 8        67482     Strasbourg  Bas-Rhin     Préfe… Grand… 291 709            
## 10 9        33063     Bordeaux    Gironde      Préfe… Nouve… 265 328            
## # ℹ 278 more rows
## # ℹ 8 more variables: `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>


# Select top 10 cities ----
top10 <- datas[2:11, ]

# Filter columns ----
top10 <- top10[ , c(1, 3:4, 7)]

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


5. Clean output

# Explore data ----
datas
## # A tibble: 288 × 15
##    Rang2022 CodeInsee Commune     Département  Statut Région `Population légale`
##    <chr>    <chr>     <chr>       <chr>        <chr>  <chr>  <chr>              
##  1 Rang2022 CodeInsee Commune     Département  Statut Région 2022[1]            
##  2 1        75056     Paris[d]    Paris[d]     Préfe… Île-d… 2 113 705          
##  3 2        13055     Marseille   Bouches-du-… Préfe… Prove… 877 215            
##  4 3        69123     Lyon        Métropole d… Préfe… Auver… 520 774            
##  5 4        31555     Toulouse    Haute-Garon… Préfe… Occit… 511 684            
##  6 5        06088     Nice        Alpes-Marit… Préfe… Prove… 353 701            
##  7 6        44109     Nantes      Loire-Atlan… Préfe… Pays … 325 070            
##  8 7        34172     Montpellier Hérault      Préfe… Occit… 307 101            
##  9 8        67482     Strasbourg  Bas-Rhin     Préfe… Grand… 291 709            
## 10 9        33063     Bordeaux    Gironde      Préfe… Nouve… 265 328            
## # ℹ 278 more rows
## # ℹ 8 more variables: `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>


# Select top 10 cities ----
top10 <- datas[2:11, ]

# Filter columns ----
top10 <- top10[ , c(1, 3:4, 7)]


# Clean column names ----
top10 <- janitor::clean_names(top10)
top10
## # A tibble: 10 × 4
##    rang2022 commune     departement          population_legale
##    <chr>    <chr>       <chr>                <chr>            
##  1 1        Paris[d]    Paris[d]             2 113 705        
##  2 2        Marseille   Bouches-du-Rhône     877 215          
##  3 3        Lyon        Métropole de Lyon[e] 520 774          
##  4 4        Toulouse    Haute-Garonne        511 684          
##  5 5        Nice        Alpes-Maritimes      353 701          
##  6 6        Nantes      Loire-Atlantique     325 070          
##  7 7        Montpellier Hérault              307 101          
##  8 8        Strasbourg  Bas-Rhin             291 709          
##  9 9        Bordeaux    Gironde              265 328          
## 10 10       Lille       Nord                 238 695

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


5. Clean output

# Explore data ----
datas
## # A tibble: 288 × 15
##    Rang2022 CodeInsee Commune     Département  Statut Région `Population légale`
##    <chr>    <chr>     <chr>       <chr>        <chr>  <chr>  <chr>              
##  1 Rang2022 CodeInsee Commune     Département  Statut Région 2022[1]            
##  2 1        75056     Paris[d]    Paris[d]     Préfe… Île-d… 2 113 705          
##  3 2        13055     Marseille   Bouches-du-… Préfe… Prove… 877 215            
##  4 3        69123     Lyon        Métropole d… Préfe… Auver… 520 774            
##  5 4        31555     Toulouse    Haute-Garon… Préfe… Occit… 511 684            
##  6 5        06088     Nice        Alpes-Marit… Préfe… Prove… 353 701            
##  7 6        44109     Nantes      Loire-Atlan… Préfe… Pays … 325 070            
##  8 7        34172     Montpellier Hérault      Préfe… Occit… 307 101            
##  9 8        67482     Strasbourg  Bas-Rhin     Préfe… Grand… 291 709            
## 10 9        33063     Bordeaux    Gironde      Préfe… Nouve… 265 328            
## # ℹ 278 more rows
## # ℹ 8 more variables: `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>,
## #   `Population légale` <chr>, `Population légale` <chr>


# Select top 10 cities ----
top10 <- datas[2:11, ]

# Filter columns ----
top10 <- top10[ , c(1, 3:4, 7)]


# Clean column names ----
top10 <- janitor::clean_names(top10)
top10
## # A tibble: 10 × 4
##    rang2022 commune     departement          population_legale
##    <chr>    <chr>       <chr>                <chr>            
##  1 1        Paris[d]    Paris[d]             2 113 705        
##  2 2        Marseille   Bouches-du-Rhône     877 215          
##  3 3        Lyon        Métropole de Lyon[e] 520 774          
##  4 4        Toulouse    Haute-Garonne        511 684          
##  5 5        Nice        Alpes-Maritimes      353 701          
##  6 6        Nantes      Loire-Atlantique     325 070          
##  7 7        Montpellier Hérault              307 101          
##  8 8        Strasbourg  Bas-Rhin             291 709          
##  9 9        Bordeaux    Gironde              265 328          
## 10 10       Lille       Nord                 238 695


# Rename specific column ----
top10 <- dplyr::rename(top10,
                       pop2021 = population_legale)

colnames(top10)
## [1] "rang2022"    "commune"     "departement" "pop2021"

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


5. Clean output (continued)

# Explore data ----
top10
## # A tibble: 10 × 4
##    rang2022 commune     departement          pop2021  
##    <chr>    <chr>       <chr>                <chr>    
##  1 1        Paris[d]    Paris[d]             2 113 705
##  2 2        Marseille   Bouches-du-Rhône     877 215  
##  3 3        Lyon        Métropole de Lyon[e] 520 774  
##  4 4        Toulouse    Haute-Garonne        511 684  
##  5 5        Nice        Alpes-Maritimes      353 701  
##  6 6        Nantes      Loire-Atlantique     325 070  
##  7 7        Montpellier Hérault              307 101  
##  8 8        Strasbourg  Bas-Rhin             291 709  
##  9 9        Bordeaux    Gironde              265 328  
## 10 10       Lille       Nord                 238 695

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


5. Clean output (continued)

# Explore data ----
top10
## # A tibble: 10 × 4
##    rang2022 commune     departement          pop2021  
##    <chr>    <chr>       <chr>                <chr>    
##  1 1        Paris[d]    Paris[d]             2 113 705
##  2 2        Marseille   Bouches-du-Rhône     877 215  
##  3 3        Lyon        Métropole de Lyon[e] 520 774  
##  4 4        Toulouse    Haute-Garonne        511 684  
##  5 5        Nice        Alpes-Maritimes      353 701  
##  6 6        Nantes      Loire-Atlantique     325 070  
##  7 7        Montpellier Hérault              307 101  
##  8 8        Strasbourg  Bas-Rhin             291 709  
##  9 9        Bordeaux    Gironde              265 328  
## 10 10       Lille       Nord                 238 695


# Convert 'rang2022' to numeric ----
top10$"rang2022" <- as.integer(top10$"rang2022")

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


5. Clean output (continued)

# Explore data ----
top10
## # A tibble: 10 × 4
##    rang2022 commune     departement          pop2021  
##    <chr>    <chr>       <chr>                <chr>    
##  1 1        Paris[d]    Paris[d]             2 113 705
##  2 2        Marseille   Bouches-du-Rhône     877 215  
##  3 3        Lyon        Métropole de Lyon[e] 520 774  
##  4 4        Toulouse    Haute-Garonne        511 684  
##  5 5        Nice        Alpes-Maritimes      353 701  
##  6 6        Nantes      Loire-Atlantique     325 070  
##  7 7        Montpellier Hérault              307 101  
##  8 8        Strasbourg  Bas-Rhin             291 709  
##  9 9        Bordeaux    Gironde              265 328  
## 10 10       Lille       Nord                 238 695


# Convert 'rang2022' to numeric ----
top10$"rang2022" <- as.integer(top10$"rang2022")


# Remove footnotes in 'commune' ----
top10$"commune"     <- gsub("\\[[a-z]\\]", "", top10$"commune")

# Remove footnotes in 'departement' ----
top10$"departement" <- gsub("\\[[a-z]\\]", "", top10$"departement")

top10
## # A tibble: 10 × 4
##    rang2022 commune     departement       pop2021  
##       <int> <chr>       <chr>             <chr>    
##  1        1 Paris       Paris             2 113 705
##  2        2 Marseille   Bouches-du-Rhône  877 215  
##  3        3 Lyon        Métropole de Lyon 520 774  
##  4        4 Toulouse    Haute-Garonne     511 684  
##  5        5 Nice        Alpes-Maritimes   353 701  
##  6        6 Nantes      Loire-Atlantique  325 070  
##  7        7 Montpellier Hérault           307 101  
##  8        8 Strasbourg  Bas-Rhin          291 709  
##  9        9 Bordeaux    Gironde           265 328  
## 10       10 Lille       Nord              238 695

Scrap a table

 Example with the Wikipedia page Liste des communes de France les plus peuplées


5. Clean output (continued)

# Convert 'pop2021' to numeric ----
top10$"pop2021" <- gsub(" ", "", top10$"pop2021")
top10$"pop2021" <- as.numeric(top10$"pop2021")

top10
## # A tibble: 10 × 4
##    rang2022 commune     departement       pop2021
##       <int> <chr>       <chr>               <dbl>
##  1        1 Paris       Paris             2113705
##  2        2 Marseille   Bouches-du-Rhône   877215
##  3        3 Lyon        Métropole de Lyon  520774
##  4        4 Toulouse    Haute-Garonne      511684
##  5        5 Nice        Alpes-Maritimes    353701
##  6        6 Nantes      Loire-Atlantique   325070
##  7        7 Montpellier Hérault            307101
##  8        8 Strasbourg  Bas-Rhin           291709
##  9        9 Bordeaux    Gironde            265328
## 10       10 Lille       Nord               238695

Scrap other elements

Detect HTML element by tag

# Extract content of h1 element ----
rvest::html_element(content, css = "h1") |> 
  rvest::html_text2()
## [1] "Liste des communes de France les plus peuplées"

Scrap other elements

Detect HTML element by tag

# Extract content of h1 element ----
rvest::html_element(content, css = "h1") |> 
  rvest::html_text2()
## [1] "Liste des communes de France les plus peuplées"


Detect HTML elements by tag

# Extract content of the first h2 element ----
rvest::html_element(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"


# Extract content of all h2 elements ----
rvest::html_elements(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"                                                 
## [2] "Cadre des données"                                        
## [3] "Vue d'ensemble"                                           
## [4] "Communes de plus de 30 000 habitants"                     
## [5] "Communes ayant compté plus de 30 000 habitants avant 2025"
## [6] "Notes et références"                                      
## [7] "Voir aussi"

Scrap other elements

Detect HTML element by tag

# Extract content of h1 element ----
rvest::html_element(content, css = "h1") |> 
  rvest::html_text2()
## [1] "Liste des communes de France les plus peuplées"


Detect HTML elements by tag

# Extract content of the first h2 element ----
rvest::html_element(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"


# Extract content of all h2 elements ----
rvest::html_elements(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"                                                 
## [2] "Cadre des données"                                        
## [3] "Vue d'ensemble"                                           
## [4] "Communes de plus de 30 000 habitants"                     
## [5] "Communes ayant compté plus de 30 000 habitants avant 2025"
## [6] "Notes et références"                                      
## [7] "Voir aussi"

Detect HTML element by ID

# Extract content of the h2 element detected by its id ----
rvest::html_element(content, css = "#Cadre_des_données") |> 
  rvest::html_text2()
## [1] "Cadre des données"

Scrap other elements

Detect HTML element by tag

# Extract content of h1 element ----
rvest::html_element(content, css = "h1") |> 
  rvest::html_text2()
## [1] "Liste des communes de France les plus peuplées"


Detect HTML elements by tag

# Extract content of the first h2 element ----
rvest::html_element(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"


# Extract content of all h2 elements ----
rvest::html_elements(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"                                                 
## [2] "Cadre des données"                                        
## [3] "Vue d'ensemble"                                           
## [4] "Communes de plus de 30 000 habitants"                     
## [5] "Communes ayant compté plus de 30 000 habitants avant 2025"
## [6] "Notes et références"                                      
## [7] "Voir aussi"

Detect HTML element by ID

# Extract content of the h2 element detected by its id ----
rvest::html_element(content, css = "#Cadre_des_données") |> 
  rvest::html_text2()
## [1] "Cadre des données"


Extract attribute

# Extract URL of the first image ----
image_url <- rvest::html_element(content, css = "img") |> 
  rvest::html_attr(name = "src")
image_url
## [1] "/static/images/icons/wikipedia.png"

Scrap other elements

Detect HTML element by tag

# Extract content of h1 element ----
rvest::html_element(content, css = "h1") |> 
  rvest::html_text2()
## [1] "Liste des communes de France les plus peuplées"


Detect HTML elements by tag

# Extract content of the first h2 element ----
rvest::html_element(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"


# Extract content of all h2 elements ----
rvest::html_elements(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"                                                 
## [2] "Cadre des données"                                        
## [3] "Vue d'ensemble"                                           
## [4] "Communes de plus de 30 000 habitants"                     
## [5] "Communes ayant compté plus de 30 000 habitants avant 2025"
## [6] "Notes et références"                                      
## [7] "Voir aussi"

Detect HTML element by ID

# Extract content of the h2 element detected by its id ----
rvest::html_element(content, css = "#Cadre_des_données") |> 
  rvest::html_text2()
## [1] "Cadre des données"


Extract attribute

# Extract URL of the first image ----
image_url <- rvest::html_element(content, css = "img") |> 
  rvest::html_attr(name = "src")
image_url
## [1] "/static/images/icons/wikipedia.png"


# Build image full URL ----
image_url <- paste0(base_url, image_url)
image_url
## [1] "https://fr.wikipedia.org/static/images/icons/wikipedia.png"

Scrap other elements

Detect HTML element by tag

# Extract content of h1 element ----
rvest::html_element(content, css = "h1") |> 
  rvest::html_text2()
## [1] "Liste des communes de France les plus peuplées"


Detect HTML elements by tag

# Extract content of the first h2 element ----
rvest::html_element(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"


# Extract content of all h2 elements ----
rvest::html_elements(content, css = "h2") |> 
  rvest::html_text2()
## [1] "Sommaire"                                                 
## [2] "Cadre des données"                                        
## [3] "Vue d'ensemble"                                           
## [4] "Communes de plus de 30 000 habitants"                     
## [5] "Communes ayant compté plus de 30 000 habitants avant 2025"
## [6] "Notes et références"                                      
## [7] "Voir aussi"

Detect HTML element by ID

# Extract content of the h2 element detected by its id ----
rvest::html_element(content, css = "#Cadre_des_données") |> 
  rvest::html_text2()
## [1] "Cadre des données"


Extract attribute

# Extract URL of the first image ----
image_url <- rvest::html_element(content, css = "img") |> 
  rvest::html_attr(name = "src")
image_url
## [1] "/static/images/icons/wikipedia.png"


# Build image full URL ----
image_url <- paste0(base_url, image_url)
image_url
## [1] "https://fr.wikipedia.org/static/images/icons/wikipedia.png"


# Download image ----
download.file(url      = image_url,
              destfile = "wikipedia_logo.png",
              mode     = "wb")

Getting the good selector

  • Press CTRL + U (Firefox) to display the HTML code of the page
  • Right click on a page element and click on Inspect
  • Install SelectorGadget bookmarklet in your browser


Dynamic web pages

  • Have a look at the function session() of the package rvest.
  • Have a look at the package RSelenium

Ethics and legalities

Well… it’s complicated.



 Read the Chapter 24.2 of the book R for Data Science by Wickham, Cetinkaya-Rundel & Grolemund.


 Be nice on the web with the package polite

Thanks