Title: | High Performance Interface to 'GBIF' |
---|---|
Description: | A high performance interface to the Global Biodiversity Information Facility, 'GBIF'. In contrast to 'rgbif', which can access small subsets of 'GBIF' data through web-based queries to a central server, 'gbifdb' provides enhanced performance for R users performing large-scale analyses on servers and cloud computing providers, providing full support for arbitrary 'SQL' or 'dplyr' operations on the complete 'GBIF' data tables (now over 1 billion records, and over a terabyte in size). 'gbifdb' accesses a copy of the 'GBIF' data in 'parquet' format, which is already readily available in commercial computing clouds such as the Amazon Open Data portal and the Microsoft Planetary Computer, or can be accessed directly without downloading, or downloaded to any server with suitable bandwidth and storage space. The high-performance techniques for local and remote access are described in <https://duckdb.org/why_duckdb> and <https://arrow.apache.org/docs/r/articles/fs.html> respectively. |
Authors: | Carl Boettiger [aut, cre] |
Maintainer: | Carl Boettiger <[email protected]> |
License: | Apache License (>= 2) |
Version: | 1.0.0 |
Built: | 2024-11-27 03:39:44 UTC |
Source: | https://github.com/ropensci/gbifdb |
Default location can be set with the env var GBIF_HOME,
otherwise will use the default provided by tools::R_user_dir()
gbif_dir()
gbif_dir()
path to the gbif home directory directory
gbif_dir()
gbif_dir()
Sync a local directory with selected release of the AWS copy of GBIF
gbif_download( version = gbif_version(), dir = gbif_dir(), bucket = gbif_default_bucket(), region = "" )
gbif_download( version = gbif_version(), dir = gbif_dir(), bucket = gbif_default_bucket(), region = "" )
version |
Release date (YYYY-MM-DD) which should be synced. Will detect latest version by default. |
dir |
path to local directory where parquet files should be stored.
Fine to leave at default, see |
bucket |
Name of the regional S3 bucket desired. Default is "gbif-open-data-us-east-1". Select a bucket closer to your compute location for improved performance, e.g. European researchers may prefer "gbif-open-data-eu-central-1" etc. |
region |
bucket region (usually ignored? Just set the bucket appropriately) |
Sync parquet files from GBIF public data catalog, https://registry.opendata.aws/gbif/.
Note that data can also be found on the Microsoft Cloud, https://planetarycomputer.microsoft.com/dataset/gbif
Also, some users may prefer to download this data using an alternative interface or work on a cloud-host machine where data is already available. Note, these data include all CC0 and CC-BY licensed data in GBIF that have coordinates which passed automated quality checks, see https://github.com/gbif/occurrence/blob/master/aws-public-data.md.
logical indicating success or failure.
gbif_download()
gbif_download()
Return a path to the directory containing GBIF example parquet data
gbif_example_data()
gbif_example_data()
example data is taken from the first 1000 rows of the 2011-11-01 release of the parquet data.
path to the example occurrence data installed with the package.
gbif_example_data()
gbif_example_data()
Local connection to a downloaded GBIF Parquet database
gbif_local( dir = gbif_parquet_dir(version = gbif_version(local = TRUE)), tblname = "gbif", backend = c("arrow", "duckdb"), safe = TRUE )
gbif_local( dir = gbif_parquet_dir(version = gbif_version(local = TRUE)), tblname = "gbif", backend = c("arrow", "duckdb"), safe = TRUE )
dir |
location of downloaded GBIF parquet files |
tblname |
name for the database table |
backend |
choose duckdb or arrow. |
safe |
logical. Should we exclude columns |
A summary of this GBIF data, along with column meanings can be found at https://github.com/gbif/occurrence/blob/master/aws-public-data.md
a remote tibble tbl_sql
class object
gbif <- gbif_local(gbif_example_data())
gbif <- gbif_local(gbif_example_data())
Connect to GBIF remote directly. Can be much faster than downloading for one-off use or when using the package from a server in the same region as the data. See Details.
gbif_remote( version = gbif_version(), bucket = gbif_default_bucket(), safe = TRUE, unset_aws = getOption("gbif_unset_aws", TRUE), endpoint_override = Sys.getenv("AWS_S3_ENDPOINT", "s3.amazonaws.com"), backend = c("arrow", "duckdb"), ... )
gbif_remote( version = gbif_version(), bucket = gbif_default_bucket(), safe = TRUE, unset_aws = getOption("gbif_unset_aws", TRUE), endpoint_override = Sys.getenv("AWS_S3_ENDPOINT", "s3.amazonaws.com"), backend = c("arrow", "duckdb"), ... )
version |
GBIF snapshot date |
bucket |
GBIF bucket name (including region). A default can also be set using
the option |
safe |
logical, default TRUE. Should we exclude columns |
unset_aws |
Unset AWS credentials? GBIF is provided in a public bucket,
so credentials are not needed, but having a AWS_ACCESS_KEY_ID or other AWS
environmental variables set can cause the connection to fail. By default,
this will unset any set environmental variables for the duration of the R session.
This behavior can also be turned off globally by setting the option
|
endpoint_override |
optional parameter to |
backend |
duckdb or arrow |
... |
additional parameters passed to the |
Query performance is dramatically improved in queries that return only
a subset of columns. Consider using explicit select()
commands to return only
the columns you need.
A summary of this GBIF data, along with column meanings can be found at https://github.com/gbif/occurrence/blob/master/aws-public-data.md
a remote tibble tbl_sql
class object.
gbif <- gbif_remote() gbif()
gbif <- gbif_remote() gbif()
Can also return latest locally downloaded version, or list all versions
gbif_version( local = FALSE, dir = gbif_dir(), bucket = gbif_default_bucket(), all = FALSE, ... )
gbif_version( local = FALSE, dir = gbif_dir(), bucket = gbif_default_bucket(), all = FALSE, ... )
local |
Search only local versions? logical, default |
dir |
local directory ( |
bucket |
Which remote bucket (region) should be checked |
all |
show all versions? (logical, default |
... |
additional arguments to arrow::s3_bucket |
A default version can be set using option gbif_default_version
latest available gbif version, string
## Latest local version available: gbif_version(local=TRUE) ## default version options(gbif_default_version="2021-01-01") gbif_version() ## Latest online version available: gbif_version() ## All online versions: gbif_version(all=TRUE)
## Latest local version available: gbif_version(local=TRUE) ## default version options(gbif_default_version="2021-01-01") gbif_version() ## Latest online version available: gbif_version() ## All online versions: gbif_version(all=TRUE)