| Title: | Access Nature Media Repositories |
|---|---|
| Description: | Streamline searching/downloading of nature media files (e.g. audios, photos) from online repositories. The package offers functions for obtaining media metadata from online repositories, downloading associated media files and updating data sets with new records. |
| Authors: | Marcelo Araya-Salas [aut, cre] (ORCID: <https://orcid.org/0000-0003-3594-619X>), Jorge Elizondo-Calvo [aut] (ORCID: <https://orcid.org/0009-0004-5873-9585>), Alejandro Rico-Guevara [aut] (ORCID: <https://orcid.org/0000-0003-4067-5312>) |
| Maintainer: | Marcelo Araya-Salas <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 0.2.0 |
| Built: | 2026-03-20 01:32:14 UTC |
| Source: | https://github.com/ropensci/suwo |
download_media downloads media files from online repositories.
download_media( metadata, path = ".", cores = getOption("suwo_cores", 1), pb = getOption("suwo_pb", TRUE), verbose = getOption("suwo_verbose", TRUE), overwrite = FALSE, folder_by = NULL )download_media( metadata, path = ".", cores = getOption("suwo_cores", 1), pb = getOption("suwo_pb", TRUE), verbose = getOption("suwo_verbose", TRUE), overwrite = FALSE, folder_by = NULL )
metadata |
data frame previously obtained from any suwo query
function (i.e. |
path |
Directory path where the output media files will be saved.
By default files are saved into the current working directory ( |
cores |
Numeric vector of length 1. Controls whether parallel computing
is applied by specifying the number of cores to be used. Default is 1
(i.e. no parallel computing). Can be set globally for the current R session
via the "mc.cores" option (e.g. |
pb |
Logical argument to control if progress bar is shown. Default
is |
verbose |
Logical argument that determines if text is shown in
console. Default is |
overwrite |
Logical. If TRUE, existing files (in |
folder_by |
Character string with the name of a character or factor
column in the metadata data frame. If supplied the function will use the
unique values in that column to create subfolders within |
This function will take the output data frame of any of the
"query_reponame()" functions and download the associated media files. The
function will download all files into a single directory
(argument "path"). File downloading process can be interrupted and
resume later as long as the working directory is the same. Users only need
to rerun the same function call. By default only the missing files will be
downloaded when resuming. Can also be used on a updated query output
(see update_metadata()) to add the new media files to the
existing media pool.
Downloads media files into the supplied directory path
("path") and returns (invisibly) the input data frame with
two additional columns: downloaded_file_name with the name of
the downloaded file (if downloaded or already in the directory), and
download_status with the result of the download process for each
file (either "saved", "overwritten", "already there (not downloaded)",
or "failed").
Marcelo Araya-Salas ([email protected])
query_gbif(), query_macaulay()
a_zambiana <- query_inaturalist(species = "Amanita zambiana", format = "image") # run if query didnt fail if (!is.null(a_zambiana)) { # donwload the first to files phae_anth_downl <- download_media(metadata = a_zambiana[1:2, ], path = tempdir()) }a_zambiana <- query_inaturalist(species = "Amanita zambiana", format = "image") # run if query didnt fail if (!is.null(a_zambiana)) { # donwload the first to files phae_anth_downl <- download_media(metadata = a_zambiana[1:2, ], path = tempdir()) }
find_duplicates detect possible duplicated entries from merged
metadata from several repositories.
find_duplicates( metadata, sort = TRUE, criteria = paste("country > 0.8", "locality > 0.5", "user_name > 0.8", "time == 1", "date == 1", sep = " & "), verbose = getOption("suwo_verbose", TRUE) )find_duplicates( metadata, sort = TRUE, criteria = paste("country > 0.8", "locality > 0.5", "user_name > 0.8", "time == 1", "date == 1", sep = " & "), verbose = getOption("suwo_verbose", TRUE) )
metadata |
data frame obtained from combining the output metadata of
two or more suwo query function using the |
sort |
Logical argument indicating if the output data frame should be
sorted by the |
criteria |
A character string indicating the criteria to use to
determine duplicates. By default, the criteria is set to
|
verbose |
Logical argument that determines if text is shown in
console. Default is |
This function compares the information in the entries of a
combined metadata data frame (typically the output of
merge_metadata()) and labels those possible duplicates with
a common index in a new column named duplicate_group. The comparison is
based on the similarity of the following fields: user_name, locality,
time and country. Only rows with no missing data for those fields will
be considered. The function uses the RecordLinkage package to perform the
a fuzzy matching comparison and identify potential duplicates based on
predefined similarity thresholds (see argument 'criteria'). The function
only spots duplicates from different repositories and assumes those
duplicates should have the same format and date. This function is
useful for curating the data obtained by merging data from multiple sources,
as the same observation might be recorded in different repositories.
This is a common issue in citizen science repositories, where users might
upload the same observation to different platforms. This can also occur as
some repositories automatically share data with other repositories,
particularly with GBIF.
A data frame with the input data frame and an additional column
named duplicate_group indicating potential duplicates with a common index.
Entries without potential duplicates are labeled as NA in this new column.
Marcelo Araya-Salas ([email protected])
# get metadata from 2 repos gb <- query_gbif(species = "Turdus rufiventris", format = "sound") inat <- query_inaturalist(species = "Turdus rufiventris", format = "sound") # run if queries didnt fail if (!is.null(gb) && !is.null(inat)) { # combine metadata merged_metadata <- merge_metadata(inat, gb) # find duplicates label_dup_metadata <- find_duplicates(metadata = merged_metadata) }# get metadata from 2 repos gb <- query_gbif(species = "Turdus rufiventris", format = "sound") inat <- query_inaturalist(species = "Turdus rufiventris", format = "sound") # run if queries didnt fail if (!is.null(gb) && !is.null(inat)) { # combine metadata merged_metadata <- merge_metadata(inat, gb) # find duplicates label_dup_metadata <- find_duplicates(metadata = merged_metadata) }
map_locations creates maps to visualize the geographic spread of
media records.
map_locations( metadata, cluster = FALSE, palette = grDevices::hcl.colors, by = "species" )map_locations( metadata, cluster = FALSE, palette = grDevices::hcl.colors, by = "species" )
metadata |
data frame previously obtained from any suwo query
function (i.e. |
cluster |
Logical to control if icons are clustered by locality.
Default is |
palette |
Color palette function used for location markers. By default
it uses the virdis palette ( |
by |
Name of column to be used for coloring markers. Default is "species". |
This function creates maps for visualizing the geographic spread of observations. Note that only observations with geographic coordinates are displayed.
An interacrive map with the locations of the observations.
Marcelo Araya-Salas ([email protected]) and Grace Smith Vidaurre
# search in xeno-canto e_hochs <- query_gbif(species = "Entoloma hochstetteri", format = "image") # run if query didnt fail if (!is.null(e_hochs)) { # create map map_locations(e_hochs) }# search in xeno-canto e_hochs <- query_gbif(species = "Entoloma hochstetteri", format = "image") # run if query didnt fail if (!is.null(e_hochs)) { # create map map_locations(e_hochs) }
merge_metadata merges metadata data frames from suwo queries.
merge_metadata(..., check_columns = TRUE)merge_metadata(..., check_columns = TRUE)
... |
two or more data frames (each one as a separate entry) referring
to the metadata obtained from suwo query functions ( |
check_columns |
Logical argument indicating if the function should
check that all input data frames have the required basic columns.
Default is |
This function combines metadata from multiple sources
(e.g. WikiAves and xeno-canto) into a single data frame for easier analysis
and comparison. Each input data frame must be obtained from one of the
suwo query functions (e.g., query_wikiaves(), query_xenocanto(), etc.)
with raw_data = FALSE.
A single data frame with the data from all input data frames
combined and with an additional column named source indicating the
original data frame from which each row originated. The column source will
contain the name provided for each data frame (either as individual data
frames or in a list). If no names were provided, the object names will
be used instead.
Marcelo Araya-Salas ([email protected])
# get metadata from 2 repos wa <- query_wikiaves(species = "Glaucis dohrnii", format = "sound") gb <- query_gbif(species = "Glaucis dohrnii", format = "sound") # run if queries didnt fail if (!is.null(wa) && !is.null(gb)) { # combine metadata using single data frames merged_mt <- merge_metadata(wa, gb) # combine metadata using named single data frames merged_mt <- merge_metadata(wikiaves = wa, gbif = gb) # combine metadata using a list of data frames mt_list <- list(wikiaves = wa, gbif = gb) merged_mt <- merge_metadata(mt_list) }# get metadata from 2 repos wa <- query_wikiaves(species = "Glaucis dohrnii", format = "sound") gb <- query_gbif(species = "Glaucis dohrnii", format = "sound") # run if queries didnt fail if (!is.null(wa) && !is.null(gb)) { # combine metadata using single data frames merged_mt <- merge_metadata(wa, gb) # combine metadata using named single data frames merged_mt <- merge_metadata(wikiaves = wa, gbif = gb) # combine metadata using a list of data frames mt_list <- list(wikiaves = wa, gbif = gb) merged_mt <- merge_metadata(mt_list) }
query_gbif searches for metadata from GBIF.
query_gbif( species = getOption("suwo_species"), format = getOption("suwo_format", c("image", "sound", "video", "interactive resource")), cores = getOption("suwo_cores", 1), pb = getOption("suwo_pb", TRUE), verbose = getOption("suwo_verbose", TRUE), dataset = NULL, all_data = getOption("suwo_all_data", FALSE), raw_data = getOption("suwo_raw_data", FALSE) )query_gbif( species = getOption("suwo_species"), format = getOption("suwo_format", c("image", "sound", "video", "interactive resource")), cores = getOption("suwo_cores", 1), pb = getOption("suwo_pb", TRUE), verbose = getOption("suwo_verbose", TRUE), dataset = NULL, all_data = getOption("suwo_all_data", FALSE), raw_data = getOption("suwo_raw_data", FALSE) )
species |
Character string with the scientific name of a species in
the format: "Genus epithet". Required. Can be set globally for the current
R session via the "suwo_species" option
(e.g. |
format |
Character vector with the media format to query for.
Options are 'sound', 'image', 'video' and 'interactive resource'.
Can be set globally for the current R session via the "suwo_format"
option (e.g. |
cores |
Numeric vector of length 1. Controls whether parallel computing
is applied by specifying the number of cores to be used. Default is 1
(i.e. no parallel computing). Can be set globally for the current R session
via the "mc.cores" option (e.g. |
pb |
Logical argument to control if progress bar is shown. Default
is |
verbose |
Logical argument that determines if text is shown in
console. Default is |
dataset |
The name of a specific dataset in which to focus the query (by default it searchs across all available datasets). Users can check available dataset names by downloading this csv file https://api.gbif.org/v1/dataset/search/export?format=CSV&. See https://www.gbif.org/dataset/search?q= for more details. |
all_data |
Logical argument that determines if all data available
from database is shown in the results of search. Default is |
raw_data |
Logical argument that determines if the raw data from the
repository is returned (e.g. without any manipulation).
Default is |
This function queries for species observation info in the
open-access
online repository GBIF.
GBIF (the Global Biodiversity Information Facility) is an international
network and data infrastructure funded by the world's governments and
aimed at providing open access to data about all types of life on Earth.
Note that some of the records returned by this function could be duplicates
of records returned by other suwo functions
(e.g., query_inaturalist()).
The function returns a data frame with the metadata of the media
files matching the search criteria. If all_data = TRUE, all metadata
fields (columns) are returned. If raw_data = TRUE, the raw data as
obtained from the repository is returned (without any formatting).
Marcelo Araya-Salas ([email protected])
GBIF.org (2024), GBIF Home Page. Available from: https://www.gbif.org/
# search dink frog sound files# search dink frog sound files
query_inaturalist searches for metadata from
iNaturalist.
query_inaturalist( species = getOption("suwo_species"), format = getOption("suwo_format", c("image", "sound")), cores = getOption("suwo_cores", 1), pb = getOption("suwo_pb", TRUE), verbose = getOption("suwo_verbose", TRUE), all_data = getOption("suwo_all_data", FALSE), raw_data = getOption("suwo_raw_data", FALSE), identified = FALSE, verifiable = FALSE )query_inaturalist( species = getOption("suwo_species"), format = getOption("suwo_format", c("image", "sound")), cores = getOption("suwo_cores", 1), pb = getOption("suwo_pb", TRUE), verbose = getOption("suwo_verbose", TRUE), all_data = getOption("suwo_all_data", FALSE), raw_data = getOption("suwo_raw_data", FALSE), identified = FALSE, verifiable = FALSE )
species |
Character string with the scientific name of a species in
the format: "Genus epithet". Required. Can be set globally for the current
R session via the "suwo_species" option
(e.g. |
format |
Character vector with the media format to query for.
Currently 'image' and 'sound' are available. Can be set globally for
the current R session via the "suwo_format" option
(e.g. |
cores |
Numeric vector of length 1. Controls whether parallel computing
is applied by specifying the number of cores to be used. Default is 1
(i.e. no parallel computing). Can be set globally for the current R session
via the "mc.cores" option (e.g. |
pb |
Logical argument to control if progress bar is shown. Default
is |
verbose |
Logical argument that determines if text is shown in
console. Default is |
all_data |
Logical argument that determines if all data available
from database is shown in the results of search. Default is |
raw_data |
Logical argument that determines if the raw data from the
repository is returned (e.g. without any manipulation).
Default is |
identified |
Logical argument to define if search results are categorized as identified by inaturalist. |
verifiable |
Logical argument to define if search results are categorized as verifiable by inaturalist. |
This function queries for species observation info in the open-access online repository iNaturalist. iNaturalist is a free, crowdsourced online platform for nature enthusiasts to document and identify plants, animals, fungi, and other organisms in the wild. Note that Inaturalist observations do not include a 'country' field.
The function returns a data frame with the metadata of the media
files matching the search criteria. If all_data = TRUE, all metadata
fields (columns) are returned. If raw_data = TRUE, the raw data as
obtained from the repository is returned (without any formatting).
Marcelo Araya-Salas ([email protected])
iNaturalist. Available from https://www.inaturalist.org. (Accessed on 10-02-2026)
# search Bleeding Tooth mushroom images# search Bleeding Tooth mushroom images
query_macaulay searches for metadata from
Macaulay library.
query_macaulay( species = getOption("suwo_species"), taxon_code = NULL, format = getOption("suwo_format", c("image", "sound", "video")), verbose = getOption("suwo_verbose", TRUE), all_data = getOption("suwo_all_data", FALSE), raw_data = getOption("suwo_raw_data", FALSE), path = ".", files = NULL, dates = NULL, taxon_code_info = ml_taxon_code )query_macaulay( species = getOption("suwo_species"), taxon_code = NULL, format = getOption("suwo_format", c("image", "sound", "video")), verbose = getOption("suwo_verbose", TRUE), all_data = getOption("suwo_all_data", FALSE), raw_data = getOption("suwo_raw_data", FALSE), path = ".", files = NULL, dates = NULL, taxon_code_info = ml_taxon_code )
species |
Character string with the scientific name of a species in
the format: "Genus epithet". Required. Can be set globally for the current
R session via the "suwo_species" option
(e.g. |
taxon_code |
Optional character string with the Macaulay Library taxon code (see vignette for more details). If provided, 'species' is ignored. |
format |
Character vector with the media format to query for. Options
are 'sound', 'image' of 'video'. Can be set globally for the current R
session via the "suwo_format" option (e.g. |
verbose |
Logical argument that determines if text is shown in
console. Default is |
all_data |
Logical argument that determines if all data available
from database is shown in the results of search. Default is |
raw_data |
Logical argument that determines if the raw data from the
repository is returned (e.g. without any manipulation).
Default is |
path |
Directory path where the .csv file will be saved. By default it
is saved into the current working directory ( |
files |
Optional character vector with the name(s) of the .csv file(s) to read. If provided, the function will import the data from the .csv files instead of opening the Macaulay Library search page in a browser ('species' is ignored if supplied). |
dates |
Optional numeric vector with years to split the search. If
provided, the function will perform separate queries for each date range
(between consecutive date values) and combine the results. Useful for
queries that return large number of results (i.e. > 10000 results limit).
For example, to search for the species between 2010 to 2020 and between 2021
to 2025 use |
taxon_code_info |
Data frame containing the taxon code information.
By default the function will use the internal data frame
|
This function queries for species observation info in the
Macaulay library online
repository and returns the metadata of media files matching the query. The
Macaulay Library is the world’s largest repository of digital media
(audio, photo, and video) of wildlife (mostly birds but also other
vertebrates and invertebrates), and their habitats. The archive
hosts more than 77 million images, 3 million sound recordings, and
350k videos, from more than 80k contributors, and is integrated with
eBird, the world’s largest biodiversity dataset.
For bird species the species name must be valid according to the Macaulay
Library taxonomy (which follows the Clements checklist). For non-bird
species users must use the argument taxon_code. The species taxon code
can be found by running a search at the
Macaulay Library's search page and
checking the URL of the species page. For instance, the URL when searching
for jaguar (Panthera onca) is
'https://search.macaulaylibrary.org/catalog?taxonCode=t-11032765'
so the taxon code is "t-11032765". If all_data = TRUE, all metadata
fields (columns) are returned. If raw_data = TRUE, the raw data as
obtained from the repository is returned (without any formatting).
Here are some instructions for using this function properly:
Valid bird species names can be checked at
suwo:::ml_taxon_code$SCI_NAME.
Users must save the save the .csv file manually
If the file is saved overwriting a pre-existing file (i.e. same file name) the function will not detect it
A maximum of 10000 records per query can be returned,
but this can be bypassed by using the dates argument to split
the search into smaller date ranges
Users must log in to the Macaulay Library/eBird account in order to access large batches of observations
This is an interactive function which opens a browser window to the
Macaulay Library's search page, where the user must download a .csv file
with the metadata. The function then reads the .csv file and returns a data
frame with the metadata. The function can also import previously downloaded
metadata (in csv format) with the argument files.
Marcelo Araya-Salas ([email protected])
Scholes III, Ph.D. E (2015). Macaulay Library Audio and Video Collection. Cornell Lab of Ornithology. Occurrence dataset https://doi.org/10.15468/ckcdpy accessed via GBIF.org on 2024-05-09.
Clements, J. F., P. C. Rasmussen, T. S. Schulenberg, M. J. Iliff, J. A. Gerbracht, D. Lepage, A. Spencer, S. M. Billerman, B. L. Sullivan, M. Smith, and C. L. Wood. 2025. The eBird/Clements checklist of Birds of the World: v2025. Downloaded from https://www.birds.cornell.edu/clementschecklist/download/
if (interactive()){ # query sounds tur_ili <- query_macaulay(species = "Turdus iliacus", format = "sound", path = tempdir()) # test a query with more than 10000 results paging by date # this example splits by entire year intervals cal_cos <- query_macaulay(species = "Calypte costae", format = "image", path = tempdir(), dates = c(1976, 2019, 2022, 2024, 2025, 2026)) # this example splits by year-month intervals (as dates have decimals) cal_cos <- query_macaulay(species = "Calypte costae", format = "image", path = tempdir(), dates = seq(2020, 2026, length.out = 10)) ## update clement list (note that this is actually the same list used in the # current 'suwo' version, just for the sake of the example) # url to the clements list version october 2024 # (split so it is not truncaded by CRAN) clements_url <- paste0( "https://www.birds.cornell.edu/clementschecklist/wp-content/uploads/2024/10", "/Clements-v2024-October-2024-rev.csv" ) # read list from url new_clements <- utils::read.csv(clements_url) # provide "updated" clements list to query_macaulay() tur_ili2 <- query_macaulay(species = "Turdus iliacus", format = "sound", taxon_code_info = new_clements, path = tempdir()) # query using taxon code # this is the URL when querying jaguars: # https://search.macaulaylibrary.org/catalog?taxonCode=t-11032765 p_onca <- query_macaulay(taxon_code = "t-11032765", format = "image") }if (interactive()){ # query sounds tur_ili <- query_macaulay(species = "Turdus iliacus", format = "sound", path = tempdir()) # test a query with more than 10000 results paging by date # this example splits by entire year intervals cal_cos <- query_macaulay(species = "Calypte costae", format = "image", path = tempdir(), dates = c(1976, 2019, 2022, 2024, 2025, 2026)) # this example splits by year-month intervals (as dates have decimals) cal_cos <- query_macaulay(species = "Calypte costae", format = "image", path = tempdir(), dates = seq(2020, 2026, length.out = 10)) ## update clement list (note that this is actually the same list used in the # current 'suwo' version, just for the sake of the example) # url to the clements list version october 2024 # (split so it is not truncaded by CRAN) clements_url <- paste0( "https://www.birds.cornell.edu/clementschecklist/wp-content/uploads/2024/10", "/Clements-v2024-October-2024-rev.csv" ) # read list from url new_clements <- utils::read.csv(clements_url) # provide "updated" clements list to query_macaulay() tur_ili2 <- query_macaulay(species = "Turdus iliacus", format = "sound", taxon_code_info = new_clements, path = tempdir()) # query using taxon code # this is the URL when querying jaguars: # https://search.macaulaylibrary.org/catalog?taxonCode=t-11032765 p_onca <- query_macaulay(taxon_code = "t-11032765", format = "image") }
query_wikiaves searches for metadata from
WikiAves.
query_wikiaves( species = getOption("suwo_species"), format = getOption("suwo_format", c("image", "sound")), cores = getOption("suwo_cores", 1), pb = getOption("suwo_pb", TRUE), verbose = getOption("suwo_verbose", TRUE), all_data = getOption("suwo_all_data", FALSE), raw_data = getOption("suwo_raw_data", FALSE) )query_wikiaves( species = getOption("suwo_species"), format = getOption("suwo_format", c("image", "sound")), cores = getOption("suwo_cores", 1), pb = getOption("suwo_pb", TRUE), verbose = getOption("suwo_verbose", TRUE), all_data = getOption("suwo_all_data", FALSE), raw_data = getOption("suwo_raw_data", FALSE) )
species |
Character string with the scientific name of a species in
the format: "Genus epithet". Required. Can be set globally for the current
R session via the "suwo_species" option
(e.g. |
format |
Character vector with the media format to query for.
Options are 'image' or 'sound'. Can be set globally for
the current R session via the "suwo_format" option
(e.g. |
cores |
Numeric vector of length 1. Controls whether parallel computing
is applied by specifying the number of cores to be used. Default is 1
(i.e. no parallel computing). Can be set globally for the current R session
via the "mc.cores" option (e.g. |
pb |
Logical argument to control if progress bar is shown. Default
is |
verbose |
Logical argument that determines if text is shown in
console. Default is |
all_data |
Logical argument that determines if all data available
from database is shown in the results of search. Default is |
raw_data |
Logical argument that determines if the raw data from the
repository is returned (e.g. without any manipulation).
Default is |
This function queries for avian digital media in the open-access online repository WikiAves and returns its metadata. WikiAves is a Brazilian online platform and citizen science project that serves as the largest community for birdwatchers in Brazil. It functions as a collaborative, interactive encyclopedia of Brazilian birds, where users contribute georeferenced photographs and sound recordings, which are then used to build a vast database for research and conservation.
The function returns a data frame with the metadata of the media
files matching the search criteria. If all_data = TRUE, all metadata
fields (columns) are returned. If raw_data = TRUE, the raw data as
obtained from the repository is returned (without any formatting).
Marcelo Araya-Salas ([email protected])
Schubert, Stephanie Caroline, Lilian Tonelli Manica, and André De Camargo Guaraldo. 2019. Revealing the potential of a huge citizen-science platform to study bird migration. Emu-Austral Ornithology 119.4: 364-373.
# search p_nattereri <- query_wikiaves(species = "Phaethornis nattereri", format = "image")# search p_nattereri <- query_wikiaves(species = "Phaethornis nattereri", format = "image")
query_xenocanto searches for metadata from
Xeno-Canto.
query_xenocanto( species = getOption("suwo_species"), cores = getOption("suwo_cores", 1), pb = getOption("suwo_pb", TRUE), verbose = getOption("suwo_verbose", TRUE), all_data = getOption("suwo_all_data", FALSE), raw_data = getOption("suwo_raw_data", FALSE), api_key = Sys.getenv("xc_api_key") )query_xenocanto( species = getOption("suwo_species"), cores = getOption("suwo_cores", 1), pb = getOption("suwo_pb", TRUE), verbose = getOption("suwo_verbose", TRUE), all_data = getOption("suwo_all_data", FALSE), raw_data = getOption("suwo_raw_data", FALSE), api_key = Sys.getenv("xc_api_key") )
species |
Character string with the scientific name of a species in
the format: "Genus epithet". Required. Can be set globally for the current
R session via the "suwo_species" option (e.g.
|
cores |
Numeric vector of length 1. Controls whether parallel computing
is applied by specifying the number of cores to be used. Default is 1
(i.e. no parallel computing). Can be set globally for the current R session
via the "mc.cores" option (e.g. |
pb |
Logical argument to control if progress bar is shown. Default
is |
verbose |
Logical argument that determines if text is shown in
console. Default is |
all_data |
Logical argument that determines if all data available
from database is shown in the results of search. Default is |
raw_data |
Logical argument that determines if the raw data from the
repository is returned (e.g. without any manipulation).
Default is |
api_key |
Character string refering to the key assigned by Xeno-Canto
as authorization for searches. Get yours at
https://xeno-canto.org/account.
Required. Avoid setting your API key directly in the function call to
prevent exposing it in your code. Instead, set it as an environment variable
(e.g., in your .Renviron file using
|
This function queries metadata for animal sound recordings in the open-access online repository Xeno-Canto. Xeno-Canto hosts sound recordings of birds, frogs, non-marine mammals and grasshoppers. Complex queries can be constructed using the Xeno-Canto advanced query syntax (see examples).
The function returns a data frame with the metadata of the media
files matching the search criteria. If all_data = TRUE, all metadata
fields (columns) are returned. If raw_data = TRUE, the raw data as
obtained from the repository is returned (without any formatting).
Marcelo Araya-Salas ([email protected])
Planqué, Bob, & Willem-Pier Vellinga. 2008. Xeno-canto: a 21st-century way to appreciate Neotropical bird song. Neotrop. Birding 3: 17-23.
query_gbif(), query_wikiaves(),
query_inaturalist()
if (interactive()){ # An API key is required. Get yours at https://xeno-canto.org/account. # run this in the console but dont save it in script Sys.setenv(xc_api_key = "YOUR_API_KEY_HERE") # Simple search for a species p_anth <- query_xenocanto(species = "Phaethornis anthophilus") # Search for same species and add specify country p_anth_cr <- query_xenocanto( species = 'sp:"Phaethornis anthophilus" cnt:"Panama"', raw_data = TRUE) # Search for female songs of a species femsong <- query_xenocanto( species = 'sp:"Thryothorus ludovicianus" type:"song" type:"female"') }if (interactive()){ # An API key is required. Get yours at https://xeno-canto.org/account. # run this in the console but dont save it in script Sys.setenv(xc_api_key = "YOUR_API_KEY_HERE") # Simple search for a species p_anth <- query_xenocanto(species = "Phaethornis anthophilus") # Search for same species and add specify country p_anth_cr <- query_xenocanto( species = 'sp:"Phaethornis anthophilus" cnt:"Panama"', raw_data = TRUE) # Search for female songs of a species femsong <- query_xenocanto( species = 'sp:"Thryothorus ludovicianus" type:"song" type:"female"') }
remove_duplicates removes duplicated media records.
remove_duplicates( metadata, same_repo = FALSE, cores = getOption("suwo_cores", 1), pb = getOption("suwo_pb", TRUE), repo_priority = c("Xeno-Canto", "GBIF", "iNaturalist", "Macaulay Library", "WikiAves", "Observation"), verbose = getOption("suwo_verbose", TRUE) )remove_duplicates( metadata, same_repo = FALSE, cores = getOption("suwo_cores", 1), pb = getOption("suwo_pb", TRUE), repo_priority = c("Xeno-Canto", "GBIF", "iNaturalist", "Macaulay Library", "WikiAves", "Observation"), verbose = getOption("suwo_verbose", TRUE) )
metadata |
data frame obtained from possible duplicates with the
function |
same_repo |
Logical argument indicating if observations labeled
as duplicates that belong to the same repository should be removed. Default
is |
cores |
Numeric vector of length 1. Controls whether parallel computing
is applied by specifying the number of cores to be used. Default is 1
(i.e. no parallel computing). Can be set globally for the current R session
via the "mc.cores" option (e.g. |
pb |
Logical argument to control if progress bar is shown. Default
is |
repo_priority |
Character vector indicating the priority of
repositories when selecting which observation to retain when duplicates
are found. Default is |
verbose |
Logical argument that determines if text is shown in
console. Default is |
When compiling data from multiple repositories, duplicated media
records are a common issue, particularly for sound recordings. These
duplicates occur both through data sharing between repositories like
Xeno-Canto and GBIF, and when users upload the same file to multiple
platforms. In such cases those multiple observations seem to refer to the
same media file and therefore, only one copy is needed. This function
removes duplicate observations identified with the function
find_duplicates(). When duplicates are found, one observation
from each group of duplicates is retained in the output data frame.
However, if multiple observations from the same repository are
labeled as duplicates, by default (same_repo = FALSE) all of them
are retained in the output data frame. This is useful as it can be
expected that observations from the same repository are not true
duplicates (e.g. different recordings uploaded to Xeno-Canto with
the same date, time and location by the same user), but rather have not
been documented with enough precision to be told apart. This behavior can
be modified. If same_repo = TRUE, only one of the duplicated
observations from the same repository will be retained in the output data
frame. The function will give priority to repositories in which media
downloading is more straightforward (Xeno-Canto and GBIF), but this can be
modified with the argument 'repo_priority'.
A single data frame with a subset of the 'metadata' with those observations that were determined not to be duplicates.
Marcelo Araya-Salas ([email protected])
find_duplicates(), merge_metadata()
# get metadata from 2 repos gb <- query_gbif(species = "Turdus rufiventris", format = "sound") if(interactive()){ key <- "YOUR XENO CANTO API KEY" xc <- query_xenocanto(species = "Turdus rufiventris", api_key = key) # combine metadata merged_metadata <- merge_metadata(xc, gb) # find duplicates label_dup_metadata <- find_duplicates(metadata = merged_metadata) # remove duplicates dedup_metadata <- remove_duplicates(label_dup_metadata) }# get metadata from 2 repos gb <- query_gbif(species = "Turdus rufiventris", format = "sound") if(interactive()){ key <- "YOUR XENO CANTO API KEY" xc <- query_xenocanto(species = "Turdus rufiventris", api_key = key) # combine metadata merged_metadata <- merge_metadata(xc, gb) # find duplicates label_dup_metadata <- find_duplicates(metadata = merged_metadata) # remove duplicates dedup_metadata <- remove_duplicates(label_dup_metadata) }
update_metadata update metadata from previous queries.
update_metadata( metadata, path = ".", cores = getOption("suwo_cores", 1), pb = getOption("suwo_pb", TRUE), verbose = getOption("suwo_verbose", TRUE), api_key = NULL, dates = NULL )update_metadata( metadata, path = ".", cores = getOption("suwo_cores", 1), pb = getOption("suwo_pb", TRUE), verbose = getOption("suwo_verbose", TRUE), api_key = NULL, dates = NULL )
metadata |
data frame previously obtained from any suwo query
function (i.e. |
path |
Directory path where the .csv file will be saved. Only
applicable for |
cores |
Numeric vector of length 1. Controls whether parallel computing
is applied by specifying the number of cores to be used. Default is 1
(i.e. no parallel computing). Can be set globally for the current R session
via the "mc.cores" option (e.g. |
pb |
Logical argument to control if progress bar is shown. Default
is |
verbose |
Logical argument that determines if text is shown in
console. Default is |
api_key |
Character string referring to the key assigned by
Xeno-Canto as authorization for searches. Get yours at
https://xeno-canto.org/account. Only
needed if the input metadata comes from |
dates |
Optional numeric vector with years to split the search. If
provided, the function will perform separate queries for each date range
(between consecutive date values) and combine the results. Useful for
queries that return large number of results (i.e. > 10000 results limit).
For example, to search for the species between 2010 to 2020 and between
2021 to 2025 use |
This function updates the metadata from a previous query to
add entries found in the source repository. All observations must belong
to the same repository. The function adds the column new_entry which
labels those entries that are new (i.e., not present in the input metadata).
The input data frame must have been obtained from any of the query
functions with the argument raw_data = FALSE. The function uses the same
query species and format as in the original query. If no new entries are
found, the function returns the original metadata and prints a message. If
some old entries are not returned in the new query they are still retained.
The function assumes that no new files are added to existing repository
entries.
returns a data frame similar to the input 'metadata' with new data appended.
Marcelo Araya-Salas ([email protected])
# query metadata a_gioiosa <- query_gbif(species = "Amanita gioiosa", format = "image") # run if query didnt fail if (!is.null(a_gioiosa)) { # remove last 3 rows to test update_metadata sub_a_gioiosa <- a_gioiosa[1:(nrow(a_gioiosa)- 3), ] # update up_a_gioiosa <- update_metadata(metadata = sub_a_gioiosa) # check number of rows is the same # nrow(up_a_gioiosa) == nrow(a_gioiosa) }# query metadata a_gioiosa <- query_gbif(species = "Amanita gioiosa", format = "image") # run if query didnt fail if (!is.null(a_gioiosa)) { # remove last 3 rows to test update_metadata sub_a_gioiosa <- a_gioiosa[1:(nrow(a_gioiosa)- 3), ] # update up_a_gioiosa <- update_metadata(metadata = sub_a_gioiosa) # check number of rows is the same # nrow(up_a_gioiosa) == nrow(a_gioiosa) }