Title: | Tools for Working with 'Taxonomic' Databases |
---|---|
Description: | Tools for working with 'taxonomic' databases, including utilities for downloading databases, loading them into various 'SQL' databases, cleaning up files, and providing a 'SQL' connection that can be used to do 'SQL' queries directly or used in 'dplyr'. |
Authors: | Scott Chamberlain [aut], Zebulun Arendsee [aut], Tamás Stirling [ctb, cre] |
Maintainer: | Tamás Stirling <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.1 |
Built: | 2024-11-27 03:24:41 UTC |
Source: | https://github.com/ropensci/taxizedb |
Taxonomic databases interface
All are using SQLite as the database
NCBI: text files are provided by NCBI, which we stitch into a sqlite db
ITIS: they provide a sqlite dump, which we use here
The PlantList: created from stitching together csv files. this source is no longer updated as far as we can tell. they say they've moved focus to the World Flora Online
Catalogue of Life: created from Darwin Core Archive dump. Using the latest monthly edition via http://www.catalogueoflife.org/DCA_Export/archive.php
GBIF: created from Darwin Core Archive dump. right now we only have the taxonomy table (called gbif), but will add the other tables in the darwin core archive later
Wikidata: aggregated taxonomy of Open Tree of Life, GLoBI and Wikidata. On Zenodo, created by Joritt Poelen of GLOBI.
World Flora Online: http://www.worldfloraonline.org/
NCBI: since db_download_ncbi
creates the database when the function
is called, it's updated whenever you run the function
ITIS: since ITIS provides the sqlite database as a download, you can
delete the old file and run db_download_itis
to get a new dump;
they I think update the dumps every month or so
The PlantList: no longer updated, so you shouldn't need to download this after the first download
Catalogue of Life: a GitHub Actions job runs once a day at 00:00 UTC, building the lastest COL data into a SQLite database thats hosted on Amazon S3
GBIF: a GitHub Actions job runs once a day at 00:00 UTC, building the lastest COL data into a SQLite database thats hosted on Amazon S3
Wikidata: last updated April 6, 2018. Scripts are available to update the data if you prefer to do it yourself.
World Flora Online: since db_download_wfo
creates the database when
the function is called, it's updated whenever you run the function
NCBI: ftp://ftp.ncbi.nih.gov/pub/taxonomy/
ITIS: https://www.itis.gov/downloads/index.html
The PlantList - http://www.theplantlist.org/
Catalogue of Life: via http://www.catalogueoflife.org/content/annual-checklist-archive
GBIF: http://rs.gbif.org/datasets/backbone/
Wikidata: https://zenodo.org/record/1213477
World Flora Online: http://www.worldfloraonline.org/
## Not run: library(dplyr) # data source: NCBI db_download_ncbi() src <- src_ncbi() df <- tbl(src, "names") filter(df, name_class == "scientific name") # data source: ITIS ## download ITIS database db_download_itis() ## connect to the ITIS database src <- src_itis() ## use SQL syntax sql_collect(src, "select * from hierarchy limit 5") ### or pipe the src to sql_collect src %>% sql_collect("select * from hierarchy limit 5") ## use dplyr verbs src %>% tbl("hierarchy") %>% filter(ChildrenCount > 1000) ## or create tbl object for repeated use hiers <- src %>% tbl("hierarchy") hiers %>% select(TSN, level) # data source: The PlantList ## download tpl datababase db_download_tpl() ## connecto the tpl database src <- src_tpl() ## do queries tpl <- tbl(src, "tpl") filter(tpl, Family == "Pinaceae") # data source: Catalogue of Life ## download col datababase db_download_col() ## connec to the col database src <- src_col() ## do queries names <- tbl(src, "taxa") select(names, taxonID, scientificName) # data source: GBIF ## download gbif datababase db_download_gbif() ## connecto the gbif database src <- src_gbif() ## do queries df <- tbl(src, "gbif") select(df, taxonID, scientificName) # data source: Wikidata db_download_wikidata() src <- src_wikidata() df <- tbl(src, "wikidata") filter(df, rank_id == "Q7432") # data source: World Flora Online db_download_wfo() src <- src_wfo() df <- tbl(src, "wfo") filter(df, taxonID == "wfo-0000000010") ## End(Not run)
## Not run: library(dplyr) # data source: NCBI db_download_ncbi() src <- src_ncbi() df <- tbl(src, "names") filter(df, name_class == "scientific name") # data source: ITIS ## download ITIS database db_download_itis() ## connect to the ITIS database src <- src_itis() ## use SQL syntax sql_collect(src, "select * from hierarchy limit 5") ### or pipe the src to sql_collect src %>% sql_collect("select * from hierarchy limit 5") ## use dplyr verbs src %>% tbl("hierarchy") %>% filter(ChildrenCount > 1000) ## or create tbl object for repeated use hiers <- src %>% tbl("hierarchy") hiers %>% select(TSN, level) # data source: The PlantList ## download tpl datababase db_download_tpl() ## connecto the tpl database src <- src_tpl() ## do queries tpl <- tbl(src, "tpl") filter(tpl, Family == "Pinaceae") # data source: Catalogue of Life ## download col datababase db_download_col() ## connec to the col database src <- src_col() ## do queries names <- tbl(src, "taxa") select(names, taxonID, scientificName) # data source: GBIF ## download gbif datababase db_download_gbif() ## connecto the gbif database src <- src_gbif() ## do queries df <- tbl(src, "gbif") select(df, taxonID, scientificName) # data source: Wikidata db_download_wikidata() src <- src_wikidata() df <- tbl(src, "wikidata") filter(df, rank_id == "Q7432") # data source: World Flora Online db_download_wfo() src <- src_wfo() df <- tbl(src, "wfo") filter(df, taxonID == "wfo-0000000010") ## End(Not run)
Retrieve immediate descendents of a taxon
children(x, db = "ncbi", verbose = TRUE, ...)
children(x, db = "ncbi", verbose = TRUE, ...)
x |
(character) Vector of taxon keys for the given database |
db |
(character) The database to search, one of ncbi, itis, gbif, col, or wfo |
verbose |
(logical) Print verbose messages |
... |
Additional arguments passed to database specific function. |
list of tibbles with the columns: id, name, rank. This is exactly
equivalent to the output of taxize::children()
## Not run: children(c(3700, 2)) children(c(154395, 154357), db = "itis") children("wfo-4000032377", db = "wfo") children(2877951, db = "gbif") children("C66T4", db = "col") # Abies Mill. Mill. ## End(Not run)
## Not run: children(c(3700, 2)) children(c(154395, 154357), db = "itis") children("wfo-4000032377", db = "wfo") children(2877951, db = "gbif") children("C66T4", db = "col") # Abies Mill. Mill. ## End(Not run)
This function is equivalent to the taxize::classification()
function,
except that it uses a local database (so is much faster). The output is
identical to taxize::classification()
classification(x, db = "ncbi", verbose = TRUE, ...)
classification(x, db = "ncbi", verbose = TRUE, ...)
x |
character) Vector of taxon keys for the given database |
db |
character) The database to search, one of ncbi, itis, gbif, col, or wfo |
verbose |
(logical) Print verbose messages |
... |
Additional arguments passed to database specific classification functions. |
list of data.frames with the columns: name, rank, and id. This is
exactly equivalent to the output of taxize::classification()
## Not run: classification(c(3702, 9606)) classification(c(154395, 154357), db = "itis") classification(c("wfo-0000291463", "wfo-7000000057"), db = "wfo") classification(2878586, db = "gbif") classification(c(2878586, 2704179), db = "gbif") classification("C66T4", db = "col") # Abies Mill. ## End(Not run)
## Not run: classification(c(3702, 9606)) classification(c(154395, 154357), db = "itis") classification(c("wfo-0000291463", "wfo-7000000057"), db = "wfo") classification(2878586, db = "gbif") classification(c(2878586, 2704179), db = "gbif") classification("C66T4", db = "col") # Abies Mill. ## End(Not run)
Download taxonomic databases
db_download_ncbi(verbose = TRUE, overwrite = FALSE) db_download_itis(verbose = TRUE, overwrite = FALSE) db_download_tpl(verbose = TRUE, overwrite = FALSE) db_download_wfo(verbose = TRUE, overwrite = FALSE) db_download_col(verbose = TRUE, overwrite = FALSE) db_download_gbif(verbose = TRUE, overwrite = FALSE) db_download_wikidata(verbose = TRUE, overwrite = FALSE)
db_download_ncbi(verbose = TRUE, overwrite = FALSE) db_download_itis(verbose = TRUE, overwrite = FALSE) db_download_tpl(verbose = TRUE, overwrite = FALSE) db_download_wfo(verbose = TRUE, overwrite = FALSE) db_download_col(verbose = TRUE, overwrite = FALSE) db_download_gbif(verbose = TRUE, overwrite = FALSE) db_download_wikidata(verbose = TRUE, overwrite = FALSE)
verbose |
(logical) Print messages. Default: |
overwrite |
(logical) If |
Downloads sql database, cleans up unneeded files, returns path to sql file
(character) path to the downloaded SQL database
## Not run: # ITIS # db_download_itis() # src_itis() # Plantlist # db_download_tpl() # db_download_tpl(overwrite=TRUE) # overwrite - download again # src_tpl() # COL # db_download_col() # src_col() # GBIF # db_download_gbif() # src_gbif() # NCBI # db_download_ncbi() # src_ncbi() # Wikidata # db_download_wikidata() # db_download_wikidata(overwrite=TRUE) # overwrite - download again # src_wikidata() # World Flora Online # db_download_wfo() # src_wfo() ## End(Not run)
## Not run: # ITIS # db_download_itis() # src_itis() # Plantlist # db_download_tpl() # db_download_tpl(overwrite=TRUE) # overwrite - download again # src_tpl() # COL # db_download_col() # src_col() # GBIF # db_download_gbif() # src_gbif() # NCBI # db_download_ncbi() # src_ncbi() # Wikidata # db_download_wikidata() # db_download_wikidata(overwrite=TRUE) # overwrite - download again # src_wikidata() # World Flora Online # db_download_wfo() # src_wfo() ## End(Not run)
Use db_download then src_taxizedb
db_load_itis(...) db_load_tpl(...) db_load_col(...) db_load_gbif(...) db_load_ncbi(...) db_load_wikidata(...)
db_load_itis(...) db_load_tpl(...) db_load_col(...) db_load_gbif(...) db_load_ncbi(...) db_load_wikidata(...)
... |
ignored |
database path
db_path(db)
db_path(db)
db |
(character) db name. one of: itis, tpl, col, gbif, ncbi, wikidata, wfo. required |
This function is nearly equivalent to the taxize::downstream()
function
downstream(x, db = "ncbi", verbose = TRUE, ...)
downstream(x, db = "ncbi", verbose = TRUE, ...)
x |
(character) Vector of taxon keys for the given database |
db |
(character) The database to search, one of ncbi, itis, gbif, col, or wfo |
verbose |
(logical) Print verbose messages |
... |
Additional arguments passed to database specific downstream functions |
list of data.frames with the columns: childtaxa_id, childtaxa_name,
and rank. This is exactly equivalent to the output of taxize::downstream()
## Not run: # get descendents from all ranks # downstream(c(3700, 9605)) # takes a while # limit results to species downstream(c(3700, 9605), downto='species') # allow ambiguous nodes but no ambiguous species downstream( c(3700, 9605), downto='species', ambiguous_nodes=FALSE, ambiguous_species=TRUE ) # ITIS id <- name2taxid('Aves', db = "itis") downstream(id, db = "itis", downto = "family") downstream(id, db = "itis", downto = "genus") id <- name2taxid('Bombus', db = "itis") downstream(id, db = "itis", downto = "species") # COL id <- name2taxid('Chordata', db = "col") downstream(id, db = "col", downto = "family") # GBIF id <- name2taxid('Pinaceae', db = "gbif") downstream(id, db = "gbif", downto = "genus") # World Flora Online id <- name2taxid('Pinaceae', db = "wfo") downstream(id, db = "wfo", downto = "species") ## End(Not run)
## Not run: # get descendents from all ranks # downstream(c(3700, 9605)) # takes a while # limit results to species downstream(c(3700, 9605), downto='species') # allow ambiguous nodes but no ambiguous species downstream( c(3700, 9605), downto='species', ambiguous_nodes=FALSE, ambiguous_species=TRUE ) # ITIS id <- name2taxid('Aves', db = "itis") downstream(id, db = "itis", downto = "family") downstream(id, db = "itis", downto = "genus") id <- name2taxid('Bombus', db = "itis") downstream(id, db = "itis", downto = "species") # COL id <- name2taxid('Chordata', db = "col") downstream(id, db = "col", downto = "family") # GBIF id <- name2taxid('Pinaceae', db = "gbif") downstream(id, db = "gbif", downto = "genus") # World Flora Online id <- name2taxid('Pinaceae', db = "wfo") downstream(id, db = "wfo", downto = "species") ## End(Not run)
name2taxid()
returns a vector and dies if there are any ambiguous
names. name2taxid_map()
returns a data.frame mapping names to ids
name2taxid(x, db = "ncbi", verbose = TRUE, out_type = c("uid", "summary"), ...)
name2taxid(x, db = "ncbi", verbose = TRUE, out_type = c("uid", "summary"), ...)
x |
(character) Vector of taxon keys for the given database |
db |
(character) The database to search, one of ncbi, itis, gbif, wfo, or tpl |
verbose |
(logical) Print verbose messages |
out_type |
(logical) character "uid" for an ID vector, "summary" for a table with columns 'tax_id' and 'tax_name'. |
... |
Additional arguments passed to database specific classification functions. |
The NCBI taxonomy database includes common names, synonyms and misspellings. However, the database is a little inconsistent. For some species, such as Arabidopsis thaliana, the misspelling Arabidopsis_thaliana is included, but the same is NOT done for humans. However, underscores are supported when querying through entrez, as is done in taxize, which implies entrez is replacing underscores with spaces. So I do the same. A corner case appears when an organism uses underscores as part of the name, not just a standin for space ("haloarchaeon 3A1_DGR"). To deal with this case, we replace underscores with spaces ONLY if there are not spaces in the original name.
## Not run: name2taxid(c('Arabidopsis thaliana', 'pig')) name2taxid(c('Arabidopsis thaliana', 'pig'), out_type="summary") name2taxid(x=c('Arabidopsis thaliana', 'Apis mellifera'), db = "itis") name2taxid(x=c('Arabidopsis thaliana', 'Apis mellifera'), db = "itis", out_type="summary") name2taxid(x=c('Arabidopsis thaliana', 'Quercus kelloggii'), db = "wfo") name2taxid(x=c('Arabidopsis thaliana', 'Quercus kelloggii'), db = "wfo", out_type="summary") name2taxid("Austrobaileyaceae", db = "wfo") name2taxid("Quercus kelloggii", db = "gbif") name2taxid(c("Quercus", "Fabaceae", "Animalia"), db = "gbif") name2taxid(c("Abies Mill.", "Pinales Gorozh.", "Tracheophyta"), db = "col") name2taxid(c("Abies mangifica", "Acanthopale aethiogermanica", "Acanthopale albosetulosa"), db = "tpl") ## End(Not run)
## Not run: name2taxid(c('Arabidopsis thaliana', 'pig')) name2taxid(c('Arabidopsis thaliana', 'pig'), out_type="summary") name2taxid(x=c('Arabidopsis thaliana', 'Apis mellifera'), db = "itis") name2taxid(x=c('Arabidopsis thaliana', 'Apis mellifera'), db = "itis", out_type="summary") name2taxid(x=c('Arabidopsis thaliana', 'Quercus kelloggii'), db = "wfo") name2taxid(x=c('Arabidopsis thaliana', 'Quercus kelloggii'), db = "wfo", out_type="summary") name2taxid("Austrobaileyaceae", db = "wfo") name2taxid("Quercus kelloggii", db = "gbif") name2taxid(c("Quercus", "Fabaceae", "Animalia"), db = "gbif") name2taxid(c("Abies Mill.", "Pinales Gorozh.", "Tracheophyta"), db = "col") name2taxid(c("Abies mangifica", "Acanthopale aethiogermanica", "Acanthopale albosetulosa"), db = "tpl") ## End(Not run)
Query and get data back into a data.frame
sql_collect(src, query, ...)
sql_collect(src, query, ...)
src |
(src) An |
query |
(character) A SQL query |
... |
further args passed on to |
we run dplyr::tbl()
, then dplyr::collect()
## Not run: src <- src_itis() sql_collect(src, "select * from hierarchy limit 5") ## or pipe the src to sql_collect src %>% sql_collect("select * from hierarchy limit 5") ## End(Not run)
## Not run: src <- src_itis() sql_collect(src, "select * from hierarchy limit 5") ## or pipe the src to sql_collect src %>% sql_collect("select * from hierarchy limit 5") ## End(Not run)
src - dplyr src objects
src_itis(path = db_path("itis"), ...) src_tpl(path = db_path("tpl"), ...) src_col(path = db_path("col"), ...) src_gbif(path = db_path("gbif"), ...) src_ncbi(path = db_path("ncbi"), ...) src_wikidata(path = db_path("wikidata"), ...) src_wfo(path = db_path("wfo"), ...)
src_itis(path = db_path("itis"), ...) src_tpl(path = db_path("tpl"), ...) src_col(path = db_path("col"), ...) src_gbif(path = db_path("gbif"), ...) src_ncbi(path = db_path("ncbi"), ...) src_wikidata(path = db_path("wikidata"), ...) src_wfo(path = db_path("wfo"), ...)
path |
(character) path to SQLite database. by default
we use the function |
... |
Further args passed on to |
an src object
## Not run: # src_itis() # src_tpl() # src_col() # src_gbif() # src_ncbi() # src_wikidata() # src_wfo() ## End(Not run)
## Not run: # src_itis() # src_tpl() # src_col() # src_gbif() # src_ncbi() # src_wikidata() # src_wfo() ## End(Not run)
Get taxa at specific scientific ranks
taxa_at( x, rank, db = "ncbi", missing = "lower", verbose = TRUE, warn = TRUE, ... )
taxa_at( x, rank, db = "ncbi", missing = "lower", verbose = TRUE, warn = TRUE, ... )
x |
(character) Vector of taxon keys (ids) for the given database. required |
rank |
(character) A target rank for which to fetch data. required |
db |
(character) The database to search, one of ncbi, itis, gbif, col, or wfo |
missing |
(character) if no data found at the given rank and input key,
should we get the next closest lower than that given in |
verbose |
(logical) Print verbose messages |
warn |
(logical) If |
... |
Additional arguments passed to database specific classification functions |
list of data.frame's for each input taxon key, where each data.frame has fields: name, rank, id. When no results found, an empty data.frame
## Not run: taxa_at(186803, rank = "order", db = "ncbi", missing = "lower") taxa_at(c(186803, 541000, 216572, 186804, 31979, 186806), rank = "family", missing = "lower") taxa_at(c(154395, 154357, 23041, 154396), rank = "family", db="itis") taxa_at(c('wfo-4000032377', 'wfo-0000541830'), rank = "family", db = "wfo") taxa_at("wfo-7000000057", rank = "order", db = "wfo") taxa_at(2877951, rank = "phylum", db = "gbif") taxa_at(c(2877951, 5386), rank = "family", db = "gbif") taxa_at(c("C66T4", "C7ZVH", "TP"), rank = "family", db = "col") ## End(Not run)
## Not run: taxa_at(186803, rank = "order", db = "ncbi", missing = "lower") taxa_at(c(186803, 541000, 216572, 186804, 31979, 186806), rank = "family", missing = "lower") taxa_at(c(154395, 154357, 23041, 154396), rank = "family", db="itis") taxa_at(c('wfo-4000032377', 'wfo-0000541830'), rank = "family", db = "wfo") taxa_at("wfo-7000000057", rank = "order", db = "wfo") taxa_at(2877951, rank = "phylum", db = "gbif") taxa_at(c(2877951, 5386), rank = "family", db = "gbif") taxa_at(c("C66T4", "C7ZVH", "TP"), rank = "family", db = "col") ## End(Not run)
Convert taxon IDs to scientific names
taxid2name(x, db = "ncbi", verbose = TRUE, warn = TRUE, ...)
taxid2name(x, db = "ncbi", verbose = TRUE, warn = TRUE, ...)
x |
(character) Vector of taxon keys for the given database |
db |
(character) The database to search, one of ncbi, itis, gbif, col, wfo, or tpl |
verbose |
(logical) Print verbose messages |
warn |
(logical) If |
... |
Additional arguments passed to database specific classification functions |
character vector of scientific names
## Not run: taxid2name(c(3702, 9606)) taxid2name(c(154395, 154357, 23041, 154396), db = "itis") taxid2name(c('wfo-0000541830', 'wfo-0000291463'), db = "wfo") taxid2name("wfo-7000000057", db = "wfo") taxid2name(2877951, db = "gbif") taxid2name(c(2877951, 5386), db = "gbif") taxid2name(c("C66T4", "C7ZVH", "TP"), db = "col") taxid2name(c("kew-2614538", "kew-2895433", "kew-2615007"), db = "tpl") ## End(Not run)
## Not run: taxid2name(c(3702, 9606)) taxid2name(c(154395, 154357, 23041, 154396), db = "itis") taxid2name(c('wfo-0000541830', 'wfo-0000291463'), db = "wfo") taxid2name("wfo-7000000057", db = "wfo") taxid2name(2877951, db = "gbif") taxid2name(c(2877951, 5386), db = "gbif") taxid2name(c("C66T4", "C7ZVH", "TP"), db = "col") taxid2name(c("kew-2614538", "kew-2895433", "kew-2615007"), db = "tpl") ## End(Not run)
Convert taxon IDs to scientific ranks
taxid2rank(x, db = "ncbi", verbose = TRUE, warn = TRUE, ...)
taxid2rank(x, db = "ncbi", verbose = TRUE, warn = TRUE, ...)
x |
(character) Vector of taxon keys (name or id) for the given database |
db |
(character) The database to search, one of ncbi, itis, gbif, col, or wfo |
verbose |
(logical) Print verbose messages |
warn |
(logical) If |
... |
Additional arguments passed to database specific classification functions |
character vector of ranks in the same order as the inputs
## Not run: taxid2rank(c(3701, 9606)) taxid2rank(c(154395, 154357, 23041, 154396), db = "itis") taxid2rank(c('wfo-4000032377', 'wfo-0000541830'), db = "wfo") taxid2rank("wfo-7000000057", db = "wfo") taxid2rank(2877951, db = "gbif") taxid2rank(c(2877951, 5386), db = "gbif") taxid2rank(c("C66T4", "C7ZVH", "TP"), db = "col") ## End(Not run)
## Not run: taxid2rank(c(3701, 9606)) taxid2rank(c(154395, 154357, 23041, 154396), db = "itis") taxid2rank(c('wfo-4000032377', 'wfo-0000541830'), db = "wfo") taxid2rank("wfo-7000000057", db = "wfo") taxid2rank(2877951, db = "gbif") taxid2rank(c(2877951, 5386), db = "gbif") taxid2rank(c("C66T4", "C7ZVH", "TP"), db = "col") ## End(Not run)
Manage cached taxizedb files with hoardr
cache_delete
only accepts 1 file name, while
cache_delete_all
doesn't accept any names, but deletes all files.
For deleting many specific files, use cache_delete
in a lapply()
type call
tdb_cache$cache_path_get()
get cache path
tdb_cache$cache_path_set()
set cache path
tdb_cache$list()
returns a character vector of full
path file names
tdb_cache$files()
returns file objects with metadata
tdb_cache$details()
returns files with details
tdb_cache$delete()
delete specific files
tdb_cache$delete_all()
delete all files, returns nothing
## Not run: tdb_cache # list files in cache tdb_cache$list() # delete certain database files # tdb_cache$delete("file path") # tdb_cache$list() # delete all files in cache # tdb_cache$delete_all() # tdb_cache$list() ## End(Not run)
## Not run: tdb_cache # list files in cache tdb_cache$list() # delete certain database files # tdb_cache$delete("file path") # tdb_cache$list() # delete all files in cache # tdb_cache$delete_all() # tdb_cache$list() ## End(Not run)