Title: | Keep a Collection of Sparkly Data Resources |
---|---|
Description: | Tools to get and maintain a data repository from third-party data providers. |
Authors: | Ben Raymond [aut, cre], Michael Sumner [aut], Miles McBain [rev, ctb], Leah Wasser [rev, ctb] |
Maintainer: | Ben Raymond <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.16.4 |
Built: | 2024-12-11 01:25:40 UTC |
Source: | https://github.com/ropensci/bowerbird |
Generate a bowerbird data source object for an Australian Antarctic Data Centre data set
bb_aadc_source(metadata_id, eds_id, id_is_metadata_id = FALSE, ...)
bb_aadc_source(metadata_id, eds_id, id_is_metadata_id = FALSE, ...)
metadata_id |
string: the metadata ID of the data set. Browse the AADC's collection at https://data.aad.gov.au/metadata/records/ to find the relevant |
eds_id |
integer: specify one or more |
id_is_metadata_id |
logical: if TRUE, use the |
... |
: passed to |
A tibble containing the data source definition, as would be returned by bb_source
## Not run: ## generate the source def for the "AADC-00009" dataset ## (Antarctic Fur Seal Populations on Heard Island, Summer 1987-1988) src <- bb_aadc_source("AADC-00009") ## download it to a temporary directory data_dir <- tempfile() dir.create(data_dir) res <- bb_get(src, local_file_root = data_dir, verbose = TRUE) res$files ## End(Not run)
## Not run: ## generate the source def for the "AADC-00009" dataset ## (Antarctic Fur Seal Populations on Heard Island, Summer 1987-1988) src <- bb_aadc_source("AADC-00009") ## download it to a temporary directory data_dir <- tempfile() dir.create(data_dir) res <- bb_get(src, local_file_root = data_dir, verbose = TRUE) res$files ## End(Not run)
Add new data sources to a bowerbird configuration
bb_add(config, source)
bb_add(config, source)
config |
bb_config: a bowerbird configuration (as returned by |
source |
data.frame: one or more data source definitions, as returned by |
configuration object
## Not run: cf <- bb_config("/my/file/root") %>% bb_add(bb_example_sources()) ## End(Not run)
## Not run: cf <- bb_config("/my/file/root") %>% bb_add(bb_example_sources()) ## End(Not run)
A function for removing unwanted files after downloading. This function is not intended to be called directly, but rather is specified as a postprocess
option in bb_source
.
bb_cleanup( pattern, recursive = FALSE, ignore_case = FALSE, all_files = FALSE, ... )
bb_cleanup( pattern, recursive = FALSE, ignore_case = FALSE, all_files = FALSE, ... )
pattern |
string: regular expression, passed to |
recursive |
logical: should the cleanup recurse into subdirectories? |
ignore_case |
logical: should pattern matching be case-insensitive? |
all_files |
logical: should the cleanup include hidden files? |
... |
: extra parameters passed automatically by |
This function can be used to remove unwanted files after a data source has been synchronized. The pattern
specifies a regular expression that is passed to file.info
to find matching files, which are then deleted. Note that only files in the data source's own directory (i.e. its subdirectory of the local_file_root
specified in bb_config
) are subject to deletion. But, beware! Some data sources may share directories, which can lead to unexpected file deletion. Be as specific as you can with the pattern
parameter.
a list, with components status
(TRUE on success) and deleted_files
(character vector of paths of files that were deleted)
bb_source
, bb_config
, bb_decompress
## Not run: ## remove .asc files after synchronization my_source <- bb_source(..., postprocess = list(list("bb_cleanup", pattern = "\\.asc$"))) ## End(Not run)
## Not run: ## remove .asc files after synchronization my_source <- bb_source(..., postprocess = list(list("bb_cleanup", pattern = "\\.asc$"))) ## End(Not run)
The configuration object controls the behaviour of the bowerbird synchronization process, run via bb_sync(my_config)
. The configuration object defines the data sources that will be synchronized, where the data files from those sources will be stored, and a range of options controlling how the synchronization process is conducted. The parameters provided here are repository-wide settings, and will affect all data sources that are subsequently added to the configuration.
bb_config( local_file_root, wget_global_flags = list(restrict_file_names = "windows", progress = "dot:giga"), target_s3_args = list(), http_proxy = NULL, ftp_proxy = NULL, clobber = 1 )
bb_config( local_file_root, wget_global_flags = list(restrict_file_names = "windows", progress = "dot:giga"), target_s3_args = list(), http_proxy = NULL, ftp_proxy = NULL, clobber = 1 )
local_file_root |
string: location of data repository on local file system |
wget_global_flags |
list: wget flags that will be applied to all data sources that call |
target_s3_args |
list: arguments to pass to |
http_proxy |
string: URL of HTTP proxy to use e.g. 'http://your.proxy:8080' (NULL for no proxy) |
ftp_proxy |
string: URL of FTP proxy to use e.g. 'http://your.proxy:21' (NULL for no proxy) |
clobber |
numeric: 0=do not overwrite existing files, 1=overwrite if the remote file is newer than the local copy, 2=always overwrite existing files |
Note that the local_file_root
directory need not actually exist when the configuration object is created, but when bb_sync
is run, either the directory must exist or create_root=TRUE
must be passed (i.e. bb_sync(...,create_root=TRUE)
).
configuration object
## Not run: cf <- bb_config("/my/file/root") %>% bb_add(bb_example_sources()) ## save to file saveRDS(cf,file="my_config.rds") ## load previously saved config cf <- readRDS(file="my_config.rds") ## End(Not run)
## Not run: cf <- bb_config("/my/file/root") %>% bb_add(bb_example_sources()) ## save to file saveRDS(cf,file="my_config.rds") ## load previously saved config cf <- readRDS(file="my_config.rds") ## End(Not run)
Return the local directory of each data source in a configuration. Files from each data source are stored locally in the associated directory. Note that if a data source has multiple source_url
values, this function might return multiple directory names (depending on whether those source_url
s map to the same directory or not).
bb_data_source_dir(config)
bb_data_source_dir(config)
config |
bb_config: configuration as returned by |
character vector of directories
cf <- bb_config("/my/file/root") %>% bb_add(bb_example_sources()) bb_data_source_dir(cf)
cf <- bb_config("/my/file/root") %>% bb_add(bb_example_sources()) bb_data_source_dir(cf)
Gets or sets the data sources contained in a bowerbird configuration object.
bb_data_sources(config) bb_data_sources(config) <- value
bb_data_sources(config) bb_data_sources(config) <- value
config |
bb_config: a bowerbird configuration (as returned by |
value |
data.frame: new data sources to set (e.g. as returned by |
Note that an assignment along the lines of bb_data_sources(cf) <- new_sources
replaces all of the sources in the configuration with the new_sources
. If you wish to modify the existing sources then read them, modify as needed, and then rewrite the whole lot back into the configuration object.
a tibble with columns as specified by bb_source
bb_config
, bb_source
, bb_example_sources
## create a configuration and add data sources cf <- bb_config(local_file_root="/your/data/directory") cf <- bb_add(cf,bb_example_sources()) ## examine the sources contained in cf bb_data_sources(cf) ## replace the sources with different ones ## Not run: bb_data_sources(cf) <- new_sources ## End(Not run)
## create a configuration and add data sources cf <- bb_config(local_file_root="/your/data/directory") cf <- bb_add(cf,bb_example_sources()) ## examine the sources contained in cf bb_data_sources(cf) ## replace the sources with different ones ## Not run: bb_data_sources(cf) <- new_sources ## End(Not run)
Functions for decompressing files after downloading. These functions are not intended to be called directly, but rather are specified as a postprocess
option in bb_source
. bb_unzip
, bb_untar
, bb_gunzip
, bb_bunzip2
, and bb_uncompress
are convenience wrappers around bb_decompress
that specify the method.
bb_decompress(method, delete = FALSE, ...) bb_unzip(...) bb_gunzip(...) bb_bunzip2(...) bb_uncompress(...) bb_untar(...)
bb_decompress(method, delete = FALSE, ...) bb_unzip(...) bb_gunzip(...) bb_bunzip2(...) bb_uncompress(...) bb_untar(...)
method |
string: one of "unzip", "gunzip", "bunzip2", "decompress", "untar" |
delete |
logical: delete the zip files after extracting their contents? |
... |
: extra parameters passed automatically by |
Tar files can be compressed (i.e. file extensions .tar, .tgz, .tar.gz, .tar.bz2, or .tar.xz). Support for tar files may depend on your platform (see untar
).
If the data source delivers compressed files, you will most likely want to decompress them after downloading. These functions will do this for you. By default, these do not delete the compressed files after decompressing. The reason for this is so that on the next synchronization run, the local (compressed) copy can be compared to the remote compressed copy, and the download can be skipped if nothing has changed. Deleting local compressed files will save space on your file system, but may result in every file being re-downloaded on every synchronization run.
list with components status (TRUE
on success), files
(character vector of paths to extracted files), and deleted_files
(character vector of paths of files that were deleted)
bb_source
, bb_config
, bb_cleanup
## Not run: ## decompress .zip files after synchronization but keep zip files intact my_source <- bb_source(..., postprocess = list("bb_unzip")) ## decompress .zip files after synchronization and delete zip files my_source <- bb_source(..., postprocess = list(list("bb_unzip", delete = TRUE))) ## End(Not run)
## Not run: ## decompress .zip files after synchronization but keep zip files intact my_source <- bb_source(..., postprocess = list("bb_unzip")) ## decompress .zip files after synchronization and delete zip files my_source <- bb_source(..., postprocess = list(list("bb_unzip", delete = TRUE))) ## End(Not run)
These example sources are useful as data sources in their own right, but are primarily provided as demonstrations of how to define data sources. See also vignette("bowerbird")
for further examples and discussion.
bb_example_sources(sources)
bb_example_sources(sources)
sources |
character: names or identifiers of one or more sources to return. See Details for the list of example sources and a brief explanation of each |
Example data sources:
"NOAA OI SST V2" - a straightforward data source that requires a simple one-level recursive download
"Australian Election 2016 House of Representatives data" - an example of a recursive download that uses additional criteria to restrict what is downloaded
"CMEMS global gridded SSH reprocessed (1993-ongoing)" - a data source that requires a username and password
"Oceandata SeaWiFS Level-3 mapped monthly 9km chl-a" - an example data source that uses the bb_handler_oceandata
method
"Sea Ice Trends and Climatologies from SMMR and SSM/I-SSMIS, Version 3" - an example data source that uses the bb_handler_earthdata
method
"Bathymetry of Lake Superior" - another example that passes extra flags to the bb_handler_rget
call in order to restrict what is downloaded
a tibble with columns as specified by bb_source
See the doc_url
and citation
field in each row of the returned tibble for references associated with these particular data sources
bb_config
, bb_handler_rget
, bb_handler_oceandata
, bb_handler_earthdata
, bb_source_us_buildings
## define a configuration and add the 2016 election data source to it cf <- bb_config("/my/file/root") %>% bb_add( bb_example_sources("Australian Election 2016 House of Representatives data")) ## Not run: ## synchronize (download) the data bb_sync(cf) ## End(Not run)
## define a configuration and add the 2016 election data source to it cf <- bb_config("/my/file/root") %>% bb_add( bb_example_sources("Australian Election 2016 House of Representatives data")) ## Not run: ## synchronize (download) the data bb_sync(cf) ## End(Not run)
This function will return the path to the wget executable if it can be found on the local system, and optionally install it if it is not found. Installation (if required) currently only works on Windows platforms. The wget.exe executable will be downloaded from https://eternallybored.org/misc/wget/ installed into your appdata directory (typically something like C:/Users/username/AppData/Roaming/)
bb_find_wget(install = FALSE, error = TRUE)
bb_find_wget(install = FALSE, error = TRUE)
install |
logical: attempt to install the executable if it is not found? (Windows only) |
error |
logical: if wget is not found, raise an error. If |
the path to the wget executable, or (if error is FALSE
) NULL if it was not found
https://eternallybored.org/misc/wget/
## Not run: wget_path <- bb_find_wget() wget_path <- bb_find_wget(install=TRUE) ## install (on windows) if needed ## End(Not run)
## Not run: wget_path <- bb_find_wget() wget_path <- bb_find_wget(install=TRUE) ## install (on windows) if needed ## End(Not run)
The bb_fingerprint
function, given a data repository configuration, will return the timestamp of download and hashes of all files associated with its data sources. This is intended as a general helper for tracking data provenance: for all of these files, we have information on where they came from (the data source ID), when they were downloaded, and a hash so that later versions of those files can be compared to detect changes. See also vignette("data_provenance")
.
bb_fingerprint(config, hash = "sha1")
bb_fingerprint(config, hash = "sha1")
config |
bb_config: configuration as returned by |
hash |
string: algorithm to use to calculate file hashes: "md5", "sha1", or "none". Note that file hashing can be slow for large file collections |
a tibble with columns:
filename - the full path and filename of the file
data_source_id - the identifier of the associated data source (as per the id
argument to bb_source
)
size - the file size
last_modified - last modified date of the file
hash - the hash of the file (unless hash="none"
was specified)
vignette("data_provenance")
## Not run: cf <- bb_config("/my/file/root") %>% bb_add(bb_example_sources()) bb_fingerprint(cf) ## End(Not run)
## Not run: cf <- bb_config("/my/file/root") %>% bb_add(bb_example_sources()) bb_fingerprint(cf) ## End(Not run)
This is a convenience function that provides a shorthand method for synchronizing a small number of data sources. The call bb_get(...)
is roughly equivalent to bb_sync(bb_add(bb_config(...), ...), ...)
(don't take the dots literally here, they are just indicating argument placeholders).
bb_get( data_sources, local_file_root, clobber = 1, http_proxy = NULL, ftp_proxy = NULL, create_root = FALSE, verbose = FALSE, confirm_downloads_larger_than = 0.1, dry_run = FALSE, ... )
bb_get( data_sources, local_file_root, clobber = 1, http_proxy = NULL, ftp_proxy = NULL, create_root = FALSE, verbose = FALSE, confirm_downloads_larger_than = 0.1, dry_run = FALSE, ... )
data_sources |
tibble: one or more data sources to download, as returned by e.g. |
local_file_root |
string: location of data repository on local file system |
clobber |
numeric: 0=do not overwrite existing files, 1=overwrite if the remote file is newer than the local copy, 2=always overwrite existing files |
http_proxy |
string: URL of HTTP proxy to use e.g. 'http://your.proxy:8080' (NULL for no proxy) |
ftp_proxy |
string: URL of FTP proxy to use e.g. 'http://your.proxy:21' (NULL for no proxy) |
create_root |
logical: should the data root directory be created if it does not exist? If this is |
verbose |
logical: if |
confirm_downloads_larger_than |
numeric or NULL: if non-negative, |
dry_run |
logical: if |
... |
: additional parameters passed through to |
Note that the local_file_root
directory must exist or create_root=TRUE
must be passed.
a tibble, as for bb_sync
bb_config
, bb_example_sources
, bb_source
, bb_sync
## Not run: my_source <- bb_example_sources("Australian Election 2016 House of Representatives data") status <- bb_get(local_file_root = tempdir(), data_sources = my_source, verbose = TRUE) ## the files that have been downloaded: status$files[[1]] ## Define a new source: Geelong bicycle paths from data.gov.au my_source <- bb_source( name = "Bike Paths - Greater Geelong", id = "http://data.gov.au/dataset/7af9cf59-a4ea-47b2-8652-5e5eeed19611", doc_url = "https://data.gov.au/dataset/geelong-bike-paths", citation = "See https://data.gov.au/dataset/geelong-bike-paths", source_url = "https://data.gov.au/dataset/7af9cf59-a4ea-47b2-8652-5e5eeed19611", license = "CC-BY", method = list("bb_handler_rget", accept_download = "\\.zip$", level = 1), postprocess = list("bb_unzip")) ## get the data status <- bb_get(data_sources = my_source, local_file_root = tempdir(), verbose = TRUE) ## find the .shp file amongst the files, and plot it shpfile <- status$files[[1]]$file[grepl("shp$", status$files[[1]]$file)] library(sf) bx <- read_st(shpfile) plot(bx) ## End(Not run)
## Not run: my_source <- bb_example_sources("Australian Election 2016 House of Representatives data") status <- bb_get(local_file_root = tempdir(), data_sources = my_source, verbose = TRUE) ## the files that have been downloaded: status$files[[1]] ## Define a new source: Geelong bicycle paths from data.gov.au my_source <- bb_source( name = "Bike Paths - Greater Geelong", id = "http://data.gov.au/dataset/7af9cf59-a4ea-47b2-8652-5e5eeed19611", doc_url = "https://data.gov.au/dataset/geelong-bike-paths", citation = "See https://data.gov.au/dataset/geelong-bike-paths", source_url = "https://data.gov.au/dataset/7af9cf59-a4ea-47b2-8652-5e5eeed19611", license = "CC-BY", method = list("bb_handler_rget", accept_download = "\\.zip$", level = 1), postprocess = list("bb_unzip")) ## get the data status <- bb_get(data_sources = my_source, local_file_root = tempdir(), verbose = TRUE) ## find the .shp file amongst the files, and plot it shpfile <- status$files[[1]]$file[grepl("shp$", status$files[[1]]$file)] library(sf) bx <- read_st(shpfile) plot(bx) ## End(Not run)
This is a handler function to be used with AWS S3 data providers. This function is not intended to be called directly, but rather is specified as a method
option in bb_source
. Note that this currently only works with public data sources that are accessible without an S3 key.
The method arguments accepted by bb_handler_aws_s3
are currently:
"bucket" string: name of the bucket (defaults to "")
"base_url" string: as for s3HTTP
"region" string: as for s3HTTP
"use_https" logical: as for s3HTTP
"prefix" string: as for get_bucket
; only keys in the bucket that begin with the specified prefix will be processed
and other parameters passed to the bb_rget
function, including "accept_download", "accept_download_extra", "reject_download"
Note that the "prefix", "accept_download", "accept_download_extra", "reject_download" parameters can be used to restrict which files are downloaded from the bucket.
bb_handler_aws_s3(...)
bb_handler_aws_s3(...)
... |
: parameters, see Description |
A tibble with columns ok
, files
, message
## Not run: ## an example AWS S3 data source src <- bb_source( name = "SILO climate data", id = "silo-open-data", description = "Australian climate data from 1889 to yesterday. This source includes a single example monthly rainfall data file. Adjust the 'accept_download' parameter to change this.", doc_url = "https://www.longpaddock.qld.gov.au/silo/gridded-data/", citation = "SILO datasets are constructed by the Queensland Government using observational data provided by the Australian Bureau of Meteorology and are available under the Creative Commons Attribution 4.0 license", license = "CC-BY 4.0", method = list("bb_handler_aws_s3", region = "silo-open-data.s3", base_url = "amazonaws.com", prefix = "Official/annual/monthly_rain/", accept_download = "2005\\.monthly_rain\\.nc$"), comment = "The unusual specification of region and base_url is a workaround for an aws.s3 issue, see https://github.com/cloudyr/aws.s3/issues/318", postprocess = NULL, collection_size = 0.02, data_group = "Climate") temp_root <- tempdir() status <- bb_get(src, local_file_root = temp_root, verbose = TRUE) ## End(Not run)
## Not run: ## an example AWS S3 data source src <- bb_source( name = "SILO climate data", id = "silo-open-data", description = "Australian climate data from 1889 to yesterday. This source includes a single example monthly rainfall data file. Adjust the 'accept_download' parameter to change this.", doc_url = "https://www.longpaddock.qld.gov.au/silo/gridded-data/", citation = "SILO datasets are constructed by the Queensland Government using observational data provided by the Australian Bureau of Meteorology and are available under the Creative Commons Attribution 4.0 license", license = "CC-BY 4.0", method = list("bb_handler_aws_s3", region = "silo-open-data.s3", base_url = "amazonaws.com", prefix = "Official/annual/monthly_rain/", accept_download = "2005\\.monthly_rain\\.nc$"), comment = "The unusual specification of region and base_url is a workaround for an aws.s3 issue, see https://github.com/cloudyr/aws.s3/issues/318", postprocess = NULL, collection_size = 0.02, data_group = "Climate") temp_root <- tempdir() status <- bb_get(src, local_file_root = temp_root, verbose = TRUE) ## End(Not run)
This is a handler function to be used with data sets from Copernicus Marine. This function is not intended to be called directly, but rather is specified as a method
option in bb_source
.
bb_handler_copernicus(product, ctype = "stac", ...)
bb_handler_copernicus(product, ctype = "stac", ...)
product |
string: the desired Copernicus marine product. See |
ctype |
string: most likely "stac" for a dataset containing multiple files, or "file" for a single file |
... |
: parameters passed to |
Note that users will need a Copernicus login.
TRUE on success
https://help.marine.copernicus.eu/en/collections/4060068-copernicus-marine-toolbox
This is a handler function to be used with data sets from NASA's Earthdata system. This function is not intended to be called directly, but rather is specified as a method
option in bb_source
.
bb_handler_earthdata(...)
bb_handler_earthdata(...)
... |
: parameters passed to |
This function uses bb_rget
, and so data sources using this function will need to provide appropriate bb_rget
parameters. Note that curl v5.2.1 introduced a breaking change to the default value of the 'unrestricted_auth' option: see <https://github.com/jeroen/curl/issues/260>. Your Earthdata source definition might require 'allow_unrestricted_auth = TRUE' as part of the method parameters.
TRUE on success
https://wiki.earthdata.nasa.gov/display/EL/How+To+Register+With+Earthdata+Login
## Not run: ## note that the full version of this data source is provided as part of bb_example_data_sources() my_source <- bb_source( name = "Sea Ice Trends and Climatologies from SMMR and SSM/I-SSMIS, Version 3", id = "10.5067/IJ0T7HFHB9Y6", description = "NSIDC provides this data set ... [truncated; see bb_example_data_sources()]", doc_url = "https://nsidc.org/data/NSIDC-0192/versions/3", citation = "Stroeve J, Meier WN (2018) ... [truncated; see bb_example_data_sources()]", source_url = "https://daacdata.apps.nsidc.org/pub/DATASETS/nsidc0192_seaice_trends_climo_v3/", license = "Please cite, see http://nsidc.org/about/use_copyright.html", authentication_note = "Requires Earthdata login, see https://urs.earthdata.nasa.gov/. Note that you will also need to authorize the application 'nsidc-daacdata' (see 'My Applications' at https://urs.earthdata.nasa.gov/profile)", method = list("bb_handler_earthdata", level = 4, relative = TRUE, accept_download = "\\.(s|n|png|txt)$", allow_unrestricted_auth = TRUE), user = "your_earthdata_username", password = "your_earthdata_password", collection_size = 0.02) ## End(Not run)
## Not run: ## note that the full version of this data source is provided as part of bb_example_data_sources() my_source <- bb_source( name = "Sea Ice Trends and Climatologies from SMMR and SSM/I-SSMIS, Version 3", id = "10.5067/IJ0T7HFHB9Y6", description = "NSIDC provides this data set ... [truncated; see bb_example_data_sources()]", doc_url = "https://nsidc.org/data/NSIDC-0192/versions/3", citation = "Stroeve J, Meier WN (2018) ... [truncated; see bb_example_data_sources()]", source_url = "https://daacdata.apps.nsidc.org/pub/DATASETS/nsidc0192_seaice_trends_climo_v3/", license = "Please cite, see http://nsidc.org/about/use_copyright.html", authentication_note = "Requires Earthdata login, see https://urs.earthdata.nasa.gov/. Note that you will also need to authorize the application 'nsidc-daacdata' (see 'My Applications' at https://urs.earthdata.nasa.gov/profile)", method = list("bb_handler_earthdata", level = 4, relative = TRUE, accept_download = "\\.(s|n|png|txt)$", allow_unrestricted_auth = TRUE), user = "your_earthdata_username", password = "your_earthdata_password", collection_size = 0.02) ## End(Not run)
This is a handler function to be used with data sets from NASA's Oceandata system. This function is not intended to be called directly, but rather is specified as a method
option in bb_source
.
bb_handler_oceandata(search, dtype, sensor, ...)
bb_handler_oceandata(search, dtype, sensor, ...)
search |
string: (required) the search string to pass to the oceancolor file searcher (https://oceandata.sci.gsfc.nasa.gov/api/file_search) |
dtype |
string: (optional) the data type (e.g. "L3m") to pass to the oceancolor file searcher. Valid options at the time of writing are aquarius, seawifs, aqua, terra, meris, octs, czcs, hico, viirs (for snpp), viirsj1, s3olci (for sentinel-3a), s3bolci (see https://oceancolor.gsfc.nasa.gov/data/download_methods/) |
sensor |
string: (optional) the sensor (e.g. "seawifs") to pass to the oceancolor file searcher. Valid options at the time of writing are L0, L1, L2, L3b (for binned data), L3m (for mapped data), MET (for ancillary data), misc (for sundry products) |
... |
: extra parameters passed automatically by |
Note that users will need an Earthdata login, see https://urs.earthdata.nasa.gov/. Users will also need to authorize the application 'OB.DAAC Data Access' (see 'My Applications' at https://urs.earthdata.nasa.gov/profile)
Oceandata uses standardized file naming conventions (see https://oceancolor.gsfc.nasa.gov/docs/format/), so once you know which products you want you can construct a suitable file name pattern to search for. For example, "S*L3m_MO_CHL_chlor_a_9km.nc" would match monthly level-3 mapped chlorophyll data from the SeaWiFS satellite at 9km resolution, in netcdf format. This pattern is passed as the search
argument. Note that the bb_handler_oceandata
does not take need 'source_url' to be specified in the bb_source
call.
TRUE on success
https://oceandata.sci.gsfc.nasa.gov/
my_source <- bb_source( name="Oceandata SeaWiFS Level-3 mapped monthly 9km chl-a", id="SeaWiFS_L3m_MO_CHL_chlor_a_9km", description="Monthly remote-sensing chlorophyll-a from the SeaWiFS satellite at 9km spatial resolution", doc_url="https://oceancolor.gsfc.nasa.gov/", citation="See https://oceancolor.gsfc.nasa.gov/citations", license="Please cite", method=list("bb_handler_oceandata",search="S*L3m_MO_CHL_chlor_a_9km.nc"), postprocess=NULL, collection_size=7.2, data_group="Ocean colour")
my_source <- bb_source( name="Oceandata SeaWiFS Level-3 mapped monthly 9km chl-a", id="SeaWiFS_L3m_MO_CHL_chlor_a_9km", description="Monthly remote-sensing chlorophyll-a from the SeaWiFS satellite at 9km spatial resolution", doc_url="https://oceancolor.gsfc.nasa.gov/", citation="See https://oceancolor.gsfc.nasa.gov/citations", license="Please cite", method=list("bb_handler_oceandata",search="S*L3m_MO_CHL_chlor_a_9km.nc"), postprocess=NULL, collection_size=7.2, data_group="Ocean colour")
This is a general handler function that is suitable for a range of data sets. This function is not intended to be called directly, but rather is specified as a method
option in bb_source
.
bb_handler_rget(...)
bb_handler_rget(...)
... |
: parameters passed to |
This handler function makes calls to the bb_rget
function. Arguments provided to bb_handler_rget
are passed through to bb_rget
.
TRUE on success
my_source <- bb_source( name = "Australian Election 2016 House of Representatives data", id = "aus-election-house-2016", description = "House of Representatives results from the 2016 Australian election.", doc_url = "http://results.aec.gov.au/", citation = "Copyright Commonwealth of Australia 2017. As far as practicable, material for which the copyright is owned by a third party will be clearly labelled. The AEC has made all reasonable efforts to ensure that this material has been reproduced on this website with the full consent of the copyright owners.", source_url = "http://results.aec.gov.au/20499/Website/HouseDownloadsMenu-20499-Csv.htm", license = "CC-BY", method = list("bb_handler_rget", level = 1, accept_download = "csv$"), collection_size = 0.01) my_data_dir <- tempdir() cf <- bb_config(my_data_dir) cf <- bb_add(cf, my_source) ## Not run: bb_sync(cf, verbose = TRUE) ## End(Not run)
my_source <- bb_source( name = "Australian Election 2016 House of Representatives data", id = "aus-election-house-2016", description = "House of Representatives results from the 2016 Australian election.", doc_url = "http://results.aec.gov.au/", citation = "Copyright Commonwealth of Australia 2017. As far as practicable, material for which the copyright is owned by a third party will be clearly labelled. The AEC has made all reasonable efforts to ensure that this material has been reproduced on this website with the full consent of the copyright owners.", source_url = "http://results.aec.gov.au/20499/Website/HouseDownloadsMenu-20499-Csv.htm", license = "CC-BY", method = list("bb_handler_rget", level = 1, accept_download = "csv$"), collection_size = 0.01) my_data_dir <- tempdir() cf <- bb_config(my_data_dir) cf <- bb_add(cf, my_source) ## Not run: bb_sync(cf, verbose = TRUE) ## End(Not run)
This is a general handler function that is suitable for a range of data sets. This function is not intended to be called directly, but rather is specified as a method
option in bb_source
.
bb_handler_wget(...)
bb_handler_wget(...)
... |
: parameters passed to |
This handler function makes calls to the wget
utility via the bb_wget
function. Arguments provided to bb_handler_wget
are passed through to bb_wget
.
TRUE on success
my_source <- bb_source( id="gshhg_coastline", name="GSHHG coastline data", description="A Global Self-consistent, Hierarchical, High-resolution Geography Database", doc_url= "http://www.soest.hawaii.edu/pwessel/gshhg", citation="Wessel, P., and W. H. F. Smith, A Global Self-consistent, Hierarchical, High-resolution Shoreline Database, J. Geophys. Res., 101, 8741-8743, 1996", source_url="ftp://ftp.soest.hawaii.edu/gshhg/*", license="LGPL", method=list("bb_handler_wget",recursive=TRUE,level=1,accept="*bin*.zip,README.TXT"), postprocess=list("bb_unzip"), collection_size=0.6)
my_source <- bb_source( id="gshhg_coastline", name="GSHHG coastline data", description="A Global Self-consistent, Hierarchical, High-resolution Geography Database", doc_url= "http://www.soest.hawaii.edu/pwessel/gshhg", citation="Wessel, P., and W. H. F. Smith, A Global Self-consistent, Hierarchical, High-resolution Shoreline Database, J. Geophys. Res., 101, 8741-8743, 1996", source_url="ftp://ftp.soest.hawaii.edu/gshhg/*", license="LGPL", method=list("bb_handler_wget",recursive=TRUE,level=1,accept="*bin*.zip,README.TXT"), postprocess=list("bb_unzip"), collection_size=0.6)
This is a helper function to install wget. Currently it only works on Windows platforms. The wget.exe executable will be downloaded from https://eternallybored.org/misc/wget/ and saved to either a temporary directory or your user appdata directory (see the use_appdata_dir
parameter).
bb_install_wget(force = FALSE, use_appdata_dir = FALSE)
bb_install_wget(force = FALSE, use_appdata_dir = FALSE)
force |
logical: force reinstallation if wget already exists |
use_appdata_dir |
logical: by default, |
the path to the installed executable
https://eternallybored.org/misc/wget/
## Not run: bb_install_wget() ## confirm that it worked: bb_wget("help") ## End(Not run)
## Not run: bb_install_wget() ## confirm that it worked: bb_wget("help") ## End(Not run)
This is a helper function designed to make it easier to modify an already-defined data source. Generally, parameters passed here will replace existing entries in src
if they exist, or will be added if not. The method
and postprocess
parameters are slightly different: see Details, below.
bb_modify_source(src, ...)
bb_modify_source(src, ...)
src |
data.frame or tibble: a single-row data source (as returned by |
... |
: parameters as for |
With the exception of the method
and postprocess
parameters, any parameter provided here will entirely replace its equivalent in the src
object. Pass a new value of NULL
to remove an existing parameter.
The method
and postprocess
parameters are lists, and modification for these takes place at the list-element level: any element of the new list will replace its equivalent element in the list in src. If the src list does not contain that element, it will be added. To illustrate, say that we have created a data source with:
src <- bb_source(method=list("bb_handler_rget", parm1 = value1, parm2 = value2), ...)
Calling
bb_modify_source(src, method = list(parm1 = newvalue1))
will result in a new method
value of list("bb_handler_rget", parm1 = newvalue1, parm2 = value2)
Modifying postprocess
elements is similar. Note that it is not currently possible to entirely remove a postprocess component using this function. If you need to do so, you'll need to do it manually.
as for bb_source
: a tibble with columns as per the bb_source
function arguments (excluding warn_empty_auth
)
## this pre-defined source requires a username and password src <- bb_example_sources( "Sea Ice Trends and Climatologies from SMMR and SSM/I-SSMIS, Version 3") ## add username and password src <- bb_modify_source(src,user="myusername",password="mypassword") ## or using the pipe operator src <- bb_example_sources( "Sea Ice Trends and Climatologies from SMMR and SSM/I-SSMIS, Version 3") %>% bb_modify_source(user="myusername",password="mypassword") ## remove the existing "data_group" component src %>% bb_modify_source(data_group=NULL) ## change just the 'level' setting of an existing method definition src %>% bb_modify_source(method=list(level=3)) ## remove the 'level' component of an existing method definition src %>% bb_modify_source(method=list(level=NULL))
## this pre-defined source requires a username and password src <- bb_example_sources( "Sea Ice Trends and Climatologies from SMMR and SSM/I-SSMIS, Version 3") ## add username and password src <- bb_modify_source(src,user="myusername",password="mypassword") ## or using the pipe operator src <- bb_example_sources( "Sea Ice Trends and Climatologies from SMMR and SSM/I-SSMIS, Version 3") %>% bb_modify_source(user="myusername",password="mypassword") ## remove the existing "data_group" component src %>% bb_modify_source(data_group=NULL) ## change just the 'level' setting of an existing method definition src %>% bb_modify_source(method=list(level=3)) ## remove the 'level' component of an existing method definition src %>% bb_modify_source(method=list(level=NULL))
This function is not intended to be called directly, but rather is specified as a postprocess
option in bb_source
.
bb_oceandata_cleanup(...)
bb_oceandata_cleanup(...)
... |
: extra parameters passed automatically by |
This function will remove near-real-time (NRT) files from an oceandata collection that have been superseded by their non-NRT versions.
a list, with components status
(TRUE on success) and deleted_files
(character vector of paths of files that were deleted)
This function provides similar, but simplified, functionality to the the command-line wget
utility. It is based on the rvest
package.
bb_rget( url, level = 0, wait = 0, accept_follow = c("(/|\\.html?)$"), reject_follow = character(), accept_download = bb_rget_default_downloads(), accept_download_extra = character(), reject_download = character(), user, password, clobber = 1, no_parent = TRUE, no_parent_download = no_parent, no_check_certificate = FALSE, relative = FALSE, remote_time = TRUE, verbose = FALSE, show_progress = verbose, debug = FALSE, dry_run = FALSE, stop_on_download_error = FALSE, retries = 0, force_local_filename, use_url_directory = TRUE, no_host = FALSE, cut_dirs = 0L, link_css = "a", curl_opts, target_s3_args ) bb_rget_default_downloads()
bb_rget( url, level = 0, wait = 0, accept_follow = c("(/|\\.html?)$"), reject_follow = character(), accept_download = bb_rget_default_downloads(), accept_download_extra = character(), reject_download = character(), user, password, clobber = 1, no_parent = TRUE, no_parent_download = no_parent, no_check_certificate = FALSE, relative = FALSE, remote_time = TRUE, verbose = FALSE, show_progress = verbose, debug = FALSE, dry_run = FALSE, stop_on_download_error = FALSE, retries = 0, force_local_filename, use_url_directory = TRUE, no_host = FALSE, cut_dirs = 0L, link_css = "a", curl_opts, target_s3_args ) bb_rget_default_downloads()
url |
string: the URL to retrieve |
level |
integer >=0: recursively download to this maximum depth level. Specify 0 for no recursion |
wait |
numeric >=0: wait this number of seconds between successive retrievals. This option may help with servers that block users making too many requests in a short period of time |
accept_follow |
character: character vector with one or more entries. Each entry specifies a regular expression that is applied to the complete URL. URLs matching all entries will be followed during the spidering process. Note that the first URL (provided via the |
reject_follow |
character: as for |
accept_download |
character: character vector with one or more entries. Each entry specifies a regular expression that is applied to the complete URL. URLs that match all entries will be accepted for download. By default the |
accept_download_extra |
character: character vector with one or more entries. If provided, URLs will be accepted for download if they match all entries in |
reject_download |
character: as for |
user |
string: username used to authenticate to the remote server |
password |
string: password used to authenticate to the remote server |
clobber |
numeric: 0=do not overwrite existing files, 1=overwrite if the remote file is newer than the local copy, 2=always overwrite existing files |
no_parent |
logical: if |
no_parent_download |
logical: similar to |
no_check_certificate |
logical: if |
relative |
logical: if |
remote_time |
logical: if |
verbose |
logical: print trace output? |
show_progress |
logical: if |
debug |
logical: if |
dry_run |
logical: if |
stop_on_download_error |
logical: if |
retries |
integer: number of times to retry a request if it fails with a transient error (similar to curl, a transient error means a timeout, an FTP 4xx response code, or an HTTP 5xx response code |
force_local_filename |
character: if provided, then each |
use_url_directory |
logical: if |
no_host |
logical: if |
cut_dirs |
integer: if |
link_css |
string: css selector that identifies links (passed as the |
curl_opts |
named list: options to use with |
target_s3_args |
list: named list or arguments to provide to |
NOTE: this is still somewhat experimental.
a list with components 'ok' (TRUE/FALSE), 'files', and 'message' (error or other messages)
Gets or sets a bowerbird configuration object's settings. These are repository-wide settings that are applied to all data sources added to the configuration. Use this function to alter the settings of a configuration previously created using bb_config
.
bb_settings(config) bb_settings(config) <- value
bb_settings(config) bb_settings(config) <- value
config |
bb_config: a bowerbird configuration (as returned by |
value |
list: new values to set |
Note that an assignment along the lines of bb_settings(cf) <- new_settings
replaces all of the settings in the configuration with the new_settings
. The most common usage pattern is to read the existing settings, modify them as needed, and then rewrite the whole lot back into the configuration object (as per the examples here).
named list
cf <- bb_config(local_file_root="/your/data/directory") ## see current settings bb_settings(cf) ## add an http proxy sets <- bb_settings(cf) sets$http_proxy <- "http://my.proxy" bb_settings(cf) <- sets ## change the current local_file_root setting sets <- bb_settings(cf) sets$local_file_root <- "/new/location" bb_settings(cf) <- sets
cf <- bb_config(local_file_root="/your/data/directory") ## see current settings bb_settings(cf) ## add an http proxy sets <- bb_settings(cf) sets$http_proxy <- "http://my.proxy" bb_settings(cf) <- sets ## change the current local_file_root setting sets <- bb_settings(cf) sets$local_file_root <- "/new/location" bb_settings(cf) <- sets
This function is used to define a data source, which can then be added to a bowerbird data repository configuration. Passing the configuration object to bb_sync
will trigger a download of all of the data sources in that configuration.
bb_source( id, name, description = NA_character_, doc_url, source_url, citation, license, comment = NA_character_, method, postprocess, authentication_note = NA_character_, user = NA_character_, password = NA_character_, access_function = NA_character_, data_group = NA_character_, collection_size = NA, warn_empty_auth = TRUE )
bb_source( id, name, description = NA_character_, doc_url, source_url, citation, license, comment = NA_character_, method, postprocess, authentication_note = NA_character_, user = NA_character_, password = NA_character_, access_function = NA_character_, data_group = NA_character_, collection_size = NA, warn_empty_auth = TRUE )
id |
string: (required) a unique identifier of the data source. If the data source has a DOI, use that. Otherwise, if the original data provider has an identifier for this dataset, that is probably a good choice here (include the data version number if there is one). The ID should be something that changes when the data set changes (is updated). A DOI is ideal for this |
name |
string: (required) a unique name for the data source. This should be a human-readable but still concise name |
description |
string: a plain-language description of the data source, provided so that users can get an idea of what the data source contains (for full details they can consult the |
doc_url |
string: (required) URL to the metadata record or other documentation of the data source |
source_url |
character vector: one or more source URLs. Required for |
citation |
string: (required) details of the citation for the data source |
license |
string: (required) description of the license. For standard licenses (e.g. creative commons) include the license descriptor ("CC-BY", etc) |
comment |
string: comments about the data source. If only part of the original data collection is mirrored, mention that here |
method |
list (required): a list object that defines the function used to synchronize this data source. The first element of the list is the function name (as a string or function). Additional list elements can be used to specify additional parameters to pass to that function. Note that |
postprocess |
list: each element of |
authentication_note |
string: if authentication is required in order to access this data source, make a note of the process (include a URL to the registration page, if possible) |
user |
string: username, if required |
password |
string: password, if required |
access_function |
string: can be used to suggest to users an appropriate function to read these data files. Provide the name of an R function or even a code snippet |
data_group |
string: the name of the group to which this data source belongs. Useful for arranging sources in terms of thematic areas |
collection_size |
numeric: approximate disk space (in GB) used by the data collection, if known. If the data are supplied as compressed files, this size should reflect the disk space used after decompression. If the data_source definition contains multiple source_url entries, this size should reflect the overall disk space used by all combined |
warn_empty_auth |
logical: if |
The method
parameter defines the handler function used to synchronize this data source, and any extra parameters that need to be passed to it.
Parameters marked as "required" are the minimal set needed to define a data source. Other parameters are either not relevant to all data sources (e.g. postprocess
, user
, password
) or provide metadata to users that is not strictly necessary to allow the data source to be synchronized (e.g. description
, access_function
, data_group
). Note that three of the "required" parameters (namely citation
, license
, and doc_url
) are not strictly needed by the synchronization code, but are treated as "required" because of their fundamental importance to reproducible science.
See vignette("bowerbird")
for more examples and discussion of defining data sources.
a tibble with columns as per the function arguments (excluding warn_empty_auth
)
bb_config
, bb_sync
, vignette("bowerbird")
## a minimal definition for the GSHHG coastline data set: my_source <- bb_source( id = "gshhg_coastline", name = "GSHHG coastline data", doc_url = "http://www.soest.hawaii.edu/pwessel/gshhg", citation = "Wessel, P., and W. H. F. Smith, A Global Self-consistent, Hierarchical, High-resolution Shoreline Database, J. Geophys. Res., 101, 8741-8743, 1996", source_url = "ftp://ftp.soest.hawaii.edu/gshhg/", license = "LGPL", method = list("bb_handler_rget",level = 1, accept_download = "README|bin.*\\.zip$")) ## a more complete definition, which unzips the files after downloading and also ## provides an indication of the size of the dataset my_source <- bb_source( id = "gshhg_coastline", name = "GSHHG coastline data", description = "A Global Self-consistent, Hierarchical, High-resolution Geography Database", doc_url = "http://www.soest.hawaii.edu/pwessel/gshhg", citation = "Wessel, P., and W. H. F. Smith, A Global Self-consistent, Hierarchical, High-resolution Shoreline Database, J. Geophys. Res., 101, 8741-8743, 1996", source_url = "ftp://ftp.soest.hawaii.edu/gshhg/*", license = "LGPL", method = list("bb_handler_rget", level = 1, accept_download = "README|bin.*\\.zip$"), postprocess = list("bb_unzip"), collection_size = 0.6) ## define a data repository configuration cf <- bb_config("/my/repo/root") ## add this source to the repository cf <- bb_add(cf, my_source) ## Not run: ## sync the repo bb_sync(cf) ## End(Not run)
## a minimal definition for the GSHHG coastline data set: my_source <- bb_source( id = "gshhg_coastline", name = "GSHHG coastline data", doc_url = "http://www.soest.hawaii.edu/pwessel/gshhg", citation = "Wessel, P., and W. H. F. Smith, A Global Self-consistent, Hierarchical, High-resolution Shoreline Database, J. Geophys. Res., 101, 8741-8743, 1996", source_url = "ftp://ftp.soest.hawaii.edu/gshhg/", license = "LGPL", method = list("bb_handler_rget",level = 1, accept_download = "README|bin.*\\.zip$")) ## a more complete definition, which unzips the files after downloading and also ## provides an indication of the size of the dataset my_source <- bb_source( id = "gshhg_coastline", name = "GSHHG coastline data", description = "A Global Self-consistent, Hierarchical, High-resolution Geography Database", doc_url = "http://www.soest.hawaii.edu/pwessel/gshhg", citation = "Wessel, P., and W. H. F. Smith, A Global Self-consistent, Hierarchical, High-resolution Shoreline Database, J. Geophys. Res., 101, 8741-8743, 1996", source_url = "ftp://ftp.soest.hawaii.edu/gshhg/*", license = "LGPL", method = list("bb_handler_rget", level = 1, accept_download = "README|bin.*\\.zip$"), postprocess = list("bb_unzip"), collection_size = 0.6) ## define a data repository configuration cf <- bb_config("/my/repo/root") ## add this source to the repository cf <- bb_add(cf, my_source) ## Not run: ## sync the repo bb_sync(cf) ## End(Not run)
This function constructs a data source definition for the Microsoft US Buildings data set. This data set contains 124,885,597 computer generated building footprints in all 50 US states. NOTE: currently, the downloaded zip files will not be unzipped automatically. Work in progress.
bb_source_us_buildings(states)
bb_source_us_buildings(states)
states |
character: (optional) one or more US state names for which to download data. If missing, data from all states will be downloaded. See the reference page for valid state names |
a tibble with columns as specified by bb_source
https://github.com/Microsoft/USBuildingFootprints
bb_example_sources
, bb_config
, bb_handler_rget
## Not run: ## define a configuration and add this buildings data source to it ## only including data for the District of Columbia and Hawaii cf <- bb_config(tempdir()) %>% bb_add(bb_source_us_buildings(states = c("District of Columbia", "Hawaii"))) ## synchronize (download) the data bb_sync(cf) ## End(Not run)
## Not run: ## define a configuration and add this buildings data source to it ## only including data for the District of Columbia and Hawaii cf <- bb_config(tempdir()) %>% bb_add(bb_source_us_buildings(states = c("District of Columbia", "Hawaii"))) ## synchronize (download) the data bb_sync(cf) ## End(Not run)
Keep only selected data_sources in a bowerbird configuration
bb_subset(config, idx)
bb_subset(config, idx)
config |
bb_config: a bowerbird configuration (as returned by |
idx |
logical or numeric: index vector of data_source rows to retain |
configuration object
## Not run: cf <- bb_config("/my/file/root") %>% bb_add(bb_example_sources()) %>% bb_subset(1:2) ## End(Not run)
## Not run: cf <- bb_config("/my/file/root") %>% bb_add(bb_example_sources()) %>% bb_subset(1:2) ## End(Not run)
This function produces a summary of a bowerbird configuation in HTML or Rmarkdown format. If you are maintaining a data collection on behalf of other users, or even just for yourself, it may be useful to keep an up-to-date HTML summary of your repository in an accessible location. Users can refer to this summary to see which data are in the repository and some details about them.
bb_summary( config, file = tempfile(fileext = ".html"), format = "html", inc_license = TRUE, inc_auth = TRUE, inc_size = TRUE, inc_access_function = TRUE, inc_path = TRUE )
bb_summary( config, file = tempfile(fileext = ".html"), format = "html", inc_license = TRUE, inc_auth = TRUE, inc_size = TRUE, inc_access_function = TRUE, inc_path = TRUE )
config |
bb_config: a bowerbird configuration (as returned by |
file |
string: path to file to write summary to. A temporary file is used by default |
format |
string: produce HTML ("html") or Rmarkdown ("Rmd") file? |
inc_license |
logical: include each source's license and citation details? |
inc_auth |
logical: include information about authentication for each data source (if applicable)? |
inc_size |
logical: include each source's size (disk space) information? |
inc_access_function |
logical: include each source's access function? |
inc_path |
logical: include each source's local file path? |
path to the summary file in HTML or Rmarkdown format
## Not run: cf <- bb_config("/my/file/root") %>% bb_add(bb_example_sources()) browseURL(bb_summary(cf)) ## End(Not run)
## Not run: cf <- bb_config("/my/file/root") %>% bb_add(bb_example_sources()) browseURL(bb_summary(cf)) ## End(Not run)
This function takes a bowerbird configuration object and synchronizes each of the data sources defined within it. Data files will be downloaded if they are not present on the local machine, or if the configuration has been set to update local files.
bb_sync( config, create_root = FALSE, verbose = FALSE, catch_errors = TRUE, confirm_downloads_larger_than = 0.1, dry_run = FALSE )
bb_sync( config, create_root = FALSE, verbose = FALSE, catch_errors = TRUE, confirm_downloads_larger_than = 0.1, dry_run = FALSE )
config |
bb_config: configuration as returned by |
create_root |
logical: should the data root directory be created if it does not exist? If this is |
verbose |
logical: if |
catch_errors |
logical: if |
confirm_downloads_larger_than |
numeric or NULL: if non-negative, |
dry_run |
logical: if |
Note that when bb_sync
is run, the local_file_root
directory must exist or create_root=TRUE
must be specified (i.e. bb_sync(...,create_root=TRUE)
). If create_root=FALSE
and the directory does not exist, bb_sync
will fail with an error.
a tibble with the name
, id
, source_url
, sync success status
, and files
of each data source. Data sources that contain multiple source URLs will appear as multiple rows in the returned tibble, one per source_url
. files
is a tibble with columns url
(the URL the file was downloaded from), file
(the path to the file), and note
(either "downloaded" for a file that was downloaded, "local copy" for a file that was not downloaded because there was already a local copy, or "decompressed" for files that were extracted from a downloaded (or already-locally-present) compressed file. url
will be NA
for "decompressed" files
## Not run: ## Choose a location to store files on the local file system. ## Normally this would be an explicit choice by the user, but here ## we just use a temporary directory for example purposes. td <- tempdir() cf <- bb_config(local_file_root = td) ## Bowerbird must then be told which data sources to synchronize. ## Let's use data from the Australian 2016 federal election, which is provided as one ## of the example data sources: my_source <- bb_example_sources("Australian Election 2016 House of Representatives data") ## Add this data source to the configuration: cf <- bb_add(cf, my_source) ## Once the configuration has been defined and the data source added to it, ## we can run the sync process. ## We set \code{verbose=TRUE} so that we see additional progress output: status <- bb_sync(cf, verbose = TRUE) ## The files in this data set have been stored in a data-source specific ## subdirectory of our local file root: status$files[[1]] ## We can run this at any later time and our repository will update if the source has changed: status2 <- bb_sync(cf, verbose = TRUE) ## End(Not run)
## Not run: ## Choose a location to store files on the local file system. ## Normally this would be an explicit choice by the user, but here ## we just use a temporary directory for example purposes. td <- tempdir() cf <- bb_config(local_file_root = td) ## Bowerbird must then be told which data sources to synchronize. ## Let's use data from the Australian 2016 federal election, which is provided as one ## of the example data sources: my_source <- bb_example_sources("Australian Election 2016 House of Representatives data") ## Add this data source to the configuration: cf <- bb_add(cf, my_source) ## Once the configuration has been defined and the data source added to it, ## we can run the sync process. ## We set \code{verbose=TRUE} so that we see additional progress output: status <- bb_sync(cf, verbose = TRUE) ## The files in this data set have been stored in a data-source specific ## subdirectory of our local file root: status$files[[1]] ## We can run this at any later time and our repository will update if the source has changed: status2 <- bb_sync(cf, verbose = TRUE) ## End(Not run)
This function is an R wrapper to the command-line wget
utility, which is called using either the exec_wait
or the exec_internal
function from the sys package. Almost all of the parameters to bb_wget
are translated into command-line flags to wget
. Call bb_wget("help")
to get more information about wget's command line flags. If required, command-line flags without equivalent bb_wget
function parameters can be passed via the extra_flags
parameter.
bb_wget( url, recursive = TRUE, level = 1, wait = 0, accept, reject, accept_regex, reject_regex, exclude_directories, restrict_file_names, progress, user, password, output_file, robots_off = FALSE, timestamping = FALSE, no_if_modified_since = FALSE, no_clobber = FALSE, no_parent = TRUE, no_check_certificate = FALSE, relative = FALSE, adjust_extension = FALSE, retr_symlinks = FALSE, extra_flags = character(), verbose = FALSE, capture_stdout = FALSE, quiet = FALSE, debug = FALSE )
bb_wget( url, recursive = TRUE, level = 1, wait = 0, accept, reject, accept_regex, reject_regex, exclude_directories, restrict_file_names, progress, user, password, output_file, robots_off = FALSE, timestamping = FALSE, no_if_modified_since = FALSE, no_clobber = FALSE, no_parent = TRUE, no_check_certificate = FALSE, relative = FALSE, adjust_extension = FALSE, retr_symlinks = FALSE, extra_flags = character(), verbose = FALSE, capture_stdout = FALSE, quiet = FALSE, debug = FALSE )
url |
string: the URL to retrieve |
recursive |
logical: if true, turn on recursive retrieving |
level |
integer >=0: recursively download to this maximum depth level. Only applicable if |
wait |
numeric >=0: wait this number of seconds between successive retrievals. This option may help with servers that block multiple successive requests, by introducing a delay between requests |
accept |
character: character vector with one or more entries. Each entry specifies a comma-separated list of filename suffixes or patterns to accept. Note that if any of the wildcard characters '*', '?', '[', or ']' appear in an element of accept, it will be treated as a filename pattern, rather than a filename suffix. In this case, you have to enclose the pattern in quotes, for example |
reject |
character: as for |
accept_regex |
character: character vector with one or more entries. Each entry provides a regular expression that is applied to the complete URL. Matching URLs will be accepted for download |
reject_regex |
character: as for |
exclude_directories |
character: character vector with one or more entries. Each entry specifies a comma-separated list of directories you wish to exclude from download. Elements may contain wildcards |
restrict_file_names |
character: vector of one of more strings from the set "unix", "windows", "nocontrol", "ascii", "lowercase", and "uppercase". See https://www.gnu.org/software/wget/manual/wget.html#index-Windows-file-names for more information on this parameter. |
progress |
string: the type of progress indicator you wish to use. Legal indicators are "dot" and "bar". "dot" prints progress with dots, with each dot representing a fixed amount of downloaded data. The style can be adjusted: "dot:mega" will show 64K per dot and 3M per line; "dot:giga" shows 1M per dot and 32M per line. See https://www.gnu.org/software/wget/manual/wget.html#index-dot-style for more information |
user |
string: username used to authenticate to the remote server |
password |
string: password used to authenticate to the remote server |
output_file |
string: save wget's output messages to this file |
robots_off |
logical: by default wget considers itself to be a robot, and therefore won't recurse into areas of a site that are excluded to robots. This can cause problems with servers that exclude robots (accidentally or deliberately) from parts of their sites containing data that we want to retrieve. Setting |
timestamping |
logical: if |
no_if_modified_since |
logical: applies when retrieving recursively with timestamping (i.e. only downloading files that have changed since last download, which is achieved using |
no_clobber |
logical: if |
no_parent |
logical: if |
no_check_certificate |
logical: if |
relative |
logical: if |
adjust_extension |
logical: if a file of type 'application/xhtml+xml' or 'text/html' is downloaded and the URL does not end with .htm or .html, this option will cause the suffix '.html' to be appended to the local filename. This can be useful when mirroring a remote site that has file URLs that conflict with directories (e.g. http://somewhere.org/this/page which has further content below it, say at http://somewhere.org/this/page/more. If "somewhere.org/this/page" is saved as a file with that name, that name can't also be used as the local directory name in which to store the lower-level content. Setting |
retr_symlinks |
logical: if |
extra_flags |
character: character vector of additional command-line flags to pass to wget |
verbose |
logical: print trace output? |
capture_stdout |
logical: if |
quiet |
logical: if |
debug |
logical: if |
the result of the system call (or if bb_wget("--help")
was called, a message will be issued). The returned object will have components 'status' and (if capture_stdout
was TRUE
) 'stdout' and 'stderr'
## Not run: ## get help about wget command line parameters bb_wget("help") ## End(Not run)
## Not run: ## get help about wget command line parameters bb_wget("help") ## End(Not run)
Generate a bowerbird data source object for a Zenodo data set
bb_zenodo_source(id, use_latest = FALSE)
bb_zenodo_source(id, use_latest = FALSE)
id |
: the ID of the data set |
use_latest |
logical: if |
A tibble containing the data source definition, as would be returned by bb_source
## Not run: ## generate the source object for the dataset ## 'Ichtyological data of Station de biologie des Laurentides 2019' src <- bb_zenodo_source(3533328) ## download it to a temporary directory data_dir <- tempfile() dir.create(data_dir) res <- bb_get(src, local_file_root = data_dir, verbose = TRUE) res$files ## End(Not run)
## Not run: ## generate the source object for the dataset ## 'Ichtyological data of Station de biologie des Laurentides 2019' src <- bb_zenodo_source(3533328) ## download it to a temporary directory data_dir <- tempfile() dir.create(data_dir) res <- bb_get(src, local_file_root = data_dir, verbose = TRUE) res$files ## End(Not run)
Often it's desirable to have local copies of third-party data sets. Fetching data on the fly from remote sources can be a great strategy, but for speed or other reasons it may be better to have local copies. This is particularly common in environmental and other sciences that deal with large data sets (e.g. satellite or global climate model products). Bowerbird is an R package for maintaining a local collection of data sets from a range of data providers.
Maintainer: Ben Raymond [email protected]
Authors:
Michael Sumner
Other contributors:
Miles McBain [email protected] [reviewer, contributor]
Leah Wasser [reviewer, contributor]
https://github.com/AustralianAntarcticDivision/bowerbird
Useful links:
Report bugs at https://github.com/ropensci/bowerbird/issues