Raw vs. derived data
Data cleaning
Working w/ taxonomy
Exercise
General recommendations1
download.file()
, wget
, curl
, etc.README
, metadata, etc.)General recommendations1
download.file()
, wget
, curl
, etc.README
, metadata, etc.)Proposed files organization
.
β
ββ data/
ββ raw-data/
ββ raw-data-1.csv π
ββ ...
ββ README.md
General recommendations1
|
|
)tidy
dataProposed files organization
.
β
ββ data/
ββ raw-data/
ββ raw-data-1.csv π
ββ ...
ββ README.md
General recommendations1
|
|
)tidy
dataProposed files organization
.
β
ββ data/
β ββ raw-data/
β β ββ raw-data-1.csv π
β β ββ ...
β β ββ README.md
β β
β ββ derived-data/
β ββ derived-data-1.RData
β ββ ...
β
ββ code/
ββ process-raw-data-1.R
ββ ...
General recommendations1
|
|
)tidy
dataProposed files organization
.
β
ββ data/
β ββ raw-data/
β β ββ raw-data-1.csv π
β β ββ ...
β β ββ README.md
β β
β ββ derived-data/
β ββ derived-data-1.RData
β ββ ...
β
ββ code/
ββ process-raw-data-1.R
ββ ...
Alternative
.
β
ββ data/
β ββ raw-data-1.csv π
β ββ ...
β ββ README.md
β
ββ outputs/
β ββ output-1.RData
β ββ ...
β
ββ code/
ββ process-raw-data-1.R
ββ ...
and many moreβ¦
Available at: https://sfirke.github.io/janitor/
Function | Description |
---|---|
clean_names() |
Cleans names of a data.frame |
remove_constant() |
Remove constant columns |
remove_empty() |
Remove empty rows and/or columns |
get_dupes() |
Identify column w/ identical value |
single_value() |
Check if a column has only a single value |
janitor
examplejanitor
exampleAvailable at: https://sfirke.github.io/dplyr/
Function | Description |
---|---|
arrange() |
Order rows using column values |
filter() |
Keep rows that match a condition |
select() |
Keep or drop columns using their names |
mutate() |
Create, modify, and delete columns |
distinct() |
Keep distinct/unique rows |
slice() |
Subset rows using their positions |
rename() |
Rename columns |
pull() |
Extract a single column |
group_by() |
Group by one or more variables |
summarise() |
Summarise each group down to one row |
R package dplyr
is very useful to :
select columns (based on their names only, no $var [,βvarβ] etc) : dplyr::select(column1, column2)
filter specific lines from those selected columns : dplyr::filter(level1 == a, level2 > x)
apply a function to the table subset : group_by(factor_column)
and then mutate(new_column = old_column*2)
put it back together in a dataframe/data table etc.
dplyr
exampledplyr
example Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
dplyr
example Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
## Manipulate data ----
iris_nice |>
dplyr::select(sepal_length, sepal_width, species) |>
dplyr::filter(sepal_length <= 5) |>
dplyr::group_by(species) |>
dplyr::mutate("mean_length" = mean(sepal_length),
"mean_width" = mean(sepal_width))
# A tibble: 32 Γ 5
# Groups: species [3]
sepal_length sepal_width species mean_length mean_width
<dbl> <dbl> <fct> <dbl> <dbl>
1 4.9 3 setosa 4.76 3.20
2 4.7 3.2 setosa 4.76 3.20
3 4.6 3.1 setosa 4.76 3.20
4 5 3.6 setosa 4.76 3.20
5 4.6 3.4 setosa 4.76 3.20
6 5 3.4 setosa 4.76 3.20
7 4.4 2.9 setosa 4.76 3.20
8 4.9 3.1 setosa 4.76 3.20
9 4.8 3.4 setosa 4.76 3.20
10 4.8 3 setosa 4.76 3.20
# βΉ 22 more rows
The family of dplyr::*_join()
functions can deal with all types of data.frame
merges. Itβs the R equivalent to SQL querying.
Letβs load band_instruments
and band_members
datasets from dplyr
.
left_join()
keeps all rows in x
.
left_join()
keeps all rows in x
.
left_join()
keeps all rows in x
.
left_join()
keeps all rows in x
.
inner_join()
only keeps rows from x
that have a matching key in y
.
dplyr
cheatsheetAvailable at: https://tidyr.tidyverse.org/
tidyr::pivot_longer()
tidyr::pivot_wider()
The function tidyr::pivot_longer()
compiles information from multiple columns into a unique column. This is commonly needed to reformat datasets that were optimized for ease of data entry or comparison rather than ease of analysis.
This format is the one privileged by the
ggplot2
functions.
# A tibble: 6 Γ 11
religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agnostic 27 34 60 81 76 137 122
2 Atheist 12 27 37 52 35 70 73
3 Buddhist 27 21 30 34 33 58 62
4 Catholic 418 617 732 670 638 1116 949
5 Donβt knβ¦ 15 14 15 11 10 35 21
6 Evangeli⦠575 869 1064 982 881 1486 949
# βΉ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
# `Don't know/refused` <dbl>
The function tidyr::pivot_longer()
compiles information from multiple columns into a unique column. This is commonly needed to reformat datasets that were optimized for ease of data entry or comparison rather than ease of analysis.
This format is the one privileged by the
ggplot2
functions.
# A tibble: 6 Γ 11
religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agnostic 27 34 60 81 76 137 122
2 Atheist 12 27 37 52 35 70 73
3 Buddhist 27 21 30 34 33 58 62
4 Catholic 418 617 732 670 638 1116 949
5 Donβt knβ¦ 15 14 15 11 10 35 21
6 Evangeli⦠575 869 1064 982 881 1486 949
# βΉ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
# `Don't know/refused` <dbl>
## Pivot to longer ----
newtab <- relig_income |>
tidyr::pivot_longer(cols = !religion,
names_to = "income",
values_to = "count")
newtab
# A tibble: 180 Γ 3
religion income count
<chr> <chr> <dbl>
1 Agnostic <$10k 27
2 Agnostic $10-20k 34
3 Agnostic $20-30k 60
4 Agnostic $30-40k 81
5 Agnostic $40-50k 76
6 Agnostic $50-75k 137
7 Agnostic $75-100k 122
8 Agnostic $100-150k 109
9 Agnostic >150k 84
10 Agnostic Don't know/refused 96
# βΉ 170 more rows
The function tidyr::pivot_wider()
is the opposite of tidyr::pivot_longer()
.
## Pivot to wider ----
back <- newtab |>
tidyr::pivot_wider(names_from = income,
values_from = count)
back
# A tibble: 18 Γ 11
religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agnostic 27 34 60 81 76 137 122
2 Atheist 12 27 37 52 35 70 73
3 Buddhist 27 21 30 34 33 58 62
4 Catholic 418 617 732 670 638 1116 949
5 Donβt kβ¦ 15 14 15 11 10 35 21
6 Evangel⦠575 869 1064 982 881 1486 949
7 Hindu 1 9 7 9 11 34 47
8 Histori⦠228 244 236 238 197 223 131
9 Jehovah⦠20 27 24 24 21 30 15
10 Jewish 19 19 25 25 30 95 69
11 Mainlin⦠289 495 619 655 651 1107 939
12 Mormon 29 40 48 51 56 112 85
13 Muslim 6 7 9 10 9 23 16
14 Orthodox 13 17 23 32 32 47 38
15 Other C⦠9 7 11 13 13 14 18
16 Other F⦠20 33 40 46 49 63 46
17 Other W⦠5 2 3 4 2 7 3
18 Unaffil⦠217 299 374 365 341 528 407
# βΉ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
# `Don't know/refused` <dbl>
tidyr
cheatsheetThe function paste()
The function strsplit()
Available at: https://stringr.tidyverse.org/
Function | Description |
---|---|
str_detect() |
Detect the presence/absence of a pattern |
str_extract() |
Extract the first match from each string |
str_extract_all() |
Extract all match from each string |
str_replace() |
Replace the first match by a text |
str_replace_all() |
Replace all matches by a text |
str_remove() |
Remove the first match |
str_remove_all() |
Remove all matches |
str_split() |
Split up a string into pieces |
str_to_upper() |
Convert a string to upper case |
str_to_lower() |
Convert a string to lower case |
The stringr
package uses a standardized set of functions applied to strings
Detect pattern1
The stringr
package uses a standardized set of functions applied to strings
Subset matches
The stringr
package uses a standardized set of functions applied to strings
Replace pattern
str_replace()
1 replaces first occurrenceThe stringr
package uses a standardized set of functions applied to strings
Character case
str_to_upper()
1 converts to upper caseAnd many moreβ¦
How can we specify generic patterns?
Regular expressions, or regex: a concise language for describing patterns in strings. More here.
How can we specify generic patterns?
Regular expressions, or regex: a concise language for describing patterns in strings. More here.
Examples
How can we specify generic patterns?
Regular expressions, or regex: a concise language for describing patterns in strings. More here.
Examples
[[1]]
[1] "M" "y" "e" "m" "a" "i" "l" "i" "s" "c" "a" "m" "i" "l" "l" "e" "g" "m" "a"
[20] "i" "l" "c" "o" "m"
How can we specify generic patterns?
Regular expressions, or regex: a concise language for describing patterns in strings. More here.
Examples
stringr
cheatsheetAvailable at: https://lubridate.tidyverse.org/
Function | Description |
---|---|
ymd() , ydm() , etc. |
Parse dates with year, month, and day |
ms() , hm() , hms() |
Parse periods |
ymd_hms() , dmy_hm() |
Parse date-times |
year() |
Get years component of a date-time |
month() |
Get months component of a date-time |
hour() |
Get hours component of a date-time |
week() |
Get weeks component of a date-time |
now() |
Get current day and time |
lubridate
cheatsheet
Many taxonomic data sources available
Most of these taxonomic databases provide an API
Most of these taxonomic databases provide an package π
rgbif
exampleDedicated vignette on Working with taxonomic names
rgbif
exampleAccepted name
usageKey | scientificName | canonicalName | rank | status | confidence | matchType | kingdom | phylum | order | family | genus | species | kingdomKey | phylumKey | orderKey | familyKey | genusKey | speciesKey | synonym | verbatim_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5209244 | Acanthurus lineatus (Linnaeus, 1758) | Acanthurus lineatus | SPECIES | ACCEPTED | 99 | EXACT | Animalia | Chordata | Perciformes | Acanthuridae | Acanthurus | Acanthurus lineatus | 1 | 44 | 587 | 4233 | 2379647 | 5209244 | FALSE | Acanthurus lineatus |
Synonym name
usageKey | acceptedUsageKey | scientificName | canonicalName | rank | status | confidence | matchType | kingdom | phylum | order | family | genus | species | kingdomKey | phylumKey | orderKey | familyKey | genusKey | speciesKey | synonym | verbatim_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5209245 | 5209244 | Ctenodon lineatus (Linnaeus, 1758) | Ctenodon lineatus | SPECIES | SYNONYM | 98 | EXACT | Animalia | Chordata | Perciformes | Acanthuridae | Acanthurus | Acanthurus lineatus | 1 | 44 | 587 | 4233 | 2379647 | 5209244 | TRUE | Ctenodon lineatus |
worrms
exampleworrms
exampleAccepted name
AphiaID | url | scientificname | authority | status | unacceptreason | taxonRankID | rank | valid_AphiaID | valid_name | valid_authority | parentNameUsageID | kingdom | phylum | class | order | family | genus | citation | lsid | isMarine | isBrackish | isFreshwater | isTerrestrial | isExtinct | match_type | modified |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
159582 | https://www.marinespecies.org/aphia.php?p=taxdetails&id=159582 | Acanthurus lineatus | (Linnaeus, 1758) | accepted | NA | 220 | Species | 159582 | Acanthurus lineatus | (Linnaeus, 1758) | 125908 | Animalia | Chordata | Teleostei | Acanthuriformes | Acanthuridae | Acanthurus | Froese, R. and D. Pauly. Editors. (2024). FishBase. Acanthurus lineatus (Linnaeus, 1758). Accessed through: World Register of Marine Species at: https://www.marinespecies.org/aphia.php?p=taxdetails&id=159582 on 2024-11-05 | urn:lsid:marinespecies.org:taxname:159582 | 1 | 0 | 0 | 0 | NA | like | 2008-01-15T17:27:08.177Z |
Synonym name
AphiaID | url | scientificname | authority | status | unacceptreason | taxonRankID | rank | valid_AphiaID | valid_name | valid_authority | parentNameUsageID | kingdom | phylum | class | order | family | genus | citation | lsid | isMarine | isBrackish | isFreshwater | isTerrestrial | isExtinct | match_type | modified |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
317534 | https://www.marinespecies.org/aphia.php?p=taxdetails&id=317534 | Ctenodon lineatus | (Linnaeus, 1758) | unaccepted | NA | 220 | Species | 159582 | Acanthurus lineatus | (Linnaeus, 1758) | 296440 | Animalia | Chordata | Teleostei | Acanthuriformes | Acanthuridae | Ctenodon | Froese, R. and D. Pauly. Editors. (2024). FishBase. Ctenodon lineatus (Linnaeus, 1758). Accessed through: World Register of Marine Species at: https://www.marinespecies.org/aphia.php?p=taxdetails&id=317534 on 2024-11-05 | urn:lsid:marinespecies.org:taxname:317534 | 1 | 0 | 0 | 0 | NA | like | 2008-02-28T13:41:07.550Z |
worrms
exampleClassification
AphiaID | rank | scientificname |
---|---|---|
2 | Kingdom | Animalia |
1821 | Phylum | Chordata |
146419 | Subphylum | Vertebrata |
1828 | Infraphylum | Gnathostomata |
152352 | Parvphylum | Osteichthyes |
10194 | Gigaclass | Actinopterygii |
843664 | Superclass | Actinopteri |
293496 | Class | Teleostei |
1517548 | Order | Acanthuriformes |
125515 | Family | Acanthuridae |
125908 | Genus | Acanthurus |
159582 | Species | Acanthurus lineatus |
rotl
examplerotl
exampleAccepted name
search_string | unique_name | approximate_match | score | ott_id | is_synonym | flags | number_matches |
---|---|---|---|---|---|---|---|
acanthurus lineatus | Acanthurus lineatus | FALSE | 1 | 93141 | FALSE | 1 |
## Species to look for ----
species <- "Ctenodon lineatus"
## Check name in GBIF database ----
gbif <- rgbif::name_backbone(species) |>
dplyr::select(species, acceptedUsageKey) |>
dplyr::rename(gbif_acc_name = species,
gbif_acc_id = acceptedUsageKey)
## Check name in WORMS database ----
worms <- worrms::wm_records_name(species) |>
dplyr::select(valid_name, valid_AphiaID) |>
dplyr::rename(worms_acc_name = valid_name,
worms_acc_id = valid_AphiaID)
## Check name in OTL database ----
otl <- rotl::tnrs_match_names(species) |>
dplyr::select(unique_name, ott_id) |>
dplyr::rename(otl_acc_name = unique_name,
otl_acc_id = ott_id)
## Append results ----
data.frame("original_name" = species, gbif, worms, otl)
original_name | gbif_acc_name | gbif_acc_id | worms_acc_name | worms_acc_id | otl_acc_name | otl_acc_id |
---|---|---|---|---|---|---|
Ctenodon lineatus | Acanthurus lineatus | 5209244 | Acanthurus lineatus | 159582 | Acanthurus lineatus | 93141 |
Part 1 - Clean the PanTHERIA database from the previous exercise
readr::read_delim()
to import dataMSW05_Binomial
, 5-1_AdultBodyMass_g
, 8-1_AdultForearmLen_mm
& 3-1_AgeatFirstBirth_d
pan_binomial_name
, pan_adult_body_mass_g
, pan_adult_forearm_len_mm
& pan_age_at_first_birth_d
-999
by NA
Part 2 - Create a reference table from GBIF occurrences downloaded previously
scientificName
Part 3 - Merge this table with PanTHERIA (cleaned) to add trait values to these two species
Part 1
library(dplyr)
## Open PanTHERIA data ----
pantheria <- readr::read_delim("PanTHERIA_1-0_WR05_Aug2008.txt")
## Clean PanTHERIA data ----
pantheria <- pantheria |>
select(MSW05_Binomial,
`5-1_AdultBodyMass_g`,
`8-1_AdultForearmLen_mm`,
`3-1_AgeatFirstBirth_d`) |>
rename(binomial_name = MSW05_Binomial,
pan_adult_body_mass_g = `5-1_AdultBodyMass_g`,
pan_adult_forearm_len_mm = `8-1_AdultForearmLen_mm`,
pan_age_at_first_birth_d = `3-1_AgeatFirstBirth_d`) |>
mutate(across(starts_with("pan_"), ~ ifelse(.x == -999, NA, .x)))
Part 2
## Open GBIF occurrences ----
gbif_occ <- read.csv("gbif_occurrences.csv")
## Extract & clean species names ----
gbif_species <- gbif_occ |>
select(scientificName) |>
distinct() |>
mutate(scientificName = stringr::str_extract(scientificName,
"^[A-z]+\\s[a-z]+")) |>
rename(binomial_name = scientificName)
Part 3