Package 'biomartr'

Title: Genomic Data Retrieval
Description: Perform large scale genomic data retrieval and functional annotation retrieval. This package aims to provide users with a standardized way to automate genome, proteome, 'RNA', coding sequence ('CDS'), 'GFF', and metagenome retrieval from 'NCBI RefSeq', 'NCBI Genbank', 'ENSEMBL', and 'UniProt' databases. Furthermore, an interface to the 'BioMart' database (Smedley et al. (2009) <doi:10.1186/1471-2164-10-22>) allows users to retrieve functional annotation for genomic loci. In addition, users can download entire databases such as 'NCBI RefSeq' (Pruitt et al. (2007) <doi:10.1093/nar/gkl842>), 'NCBI nr', 'NCBI nt', 'NCBI Genbank' (Benson et al. (2013) <doi:10.1093/nar/gks1195>), etc. with only one command.
Authors: Hajk-Georg Drost [aut, cre] , Haakon Tjeldnes [aut, ctb]
Maintainer: Hajk-Georg Drost <[email protected]>
License: GPL-2
Version: 1.0.9
Built: 2024-12-12 22:23:05 UTC
Source: https://github.com/ropensci/biomartr

Help Index


Genomic Data Retrieval

Description

This package interacts with a suite of web Application Programming Interfaces and FTP sites to perform automated genomic data retieval and annotation information retrieval.

About

To automate the retrieval process on a meta-genomic scale, this package provides useful interface functions for genomic sequence retrieval and functional annotation retrieval. The major aim of biomartr is to facilitate computational reproducibility and large-scale handling of genomic data for (meta-)genomic analyses.

In detail, biomartr aims to provide users with an easy to use framework to obtain genome, proteome, CDS, GFF (annotation), genome assembly quality, and metagenome project data. Furthermore, an interface to the Ensembl Biomart database allows users to retrieve functional annotation for genomic loci. Users can download entire databases such as

  • NCBI RefSeq

  • NCBI nr

  • NCBI nt

  • NCBI Genbank

  • NCBI nt

  • Ensembl

  • Ensembl Genomes

  • UniProt

Author(s)

Hajk-Georg Drost [email protected]

See Also

Useful links:


Main BioMart Query Function

Description

This function takes a set of gene ids and the biomart specifications and performs a biomart query for the given set of gene ids.

Usage

biomart(genes, mart, dataset, attributes, filters, mute_citation = FALSE, ...)

Arguments

genes

a character vector storing the gene ids of a organisms of interest to be queried against BioMart.

mart

a character string specifying the mart to be used. Users can obtain available marts using getMarts.

dataset

a character string specifying the dataset within the mart to be used, e.g. dataset = "hsapiens_gene_ensembl".

attributes

a character vector specifying the attributes that shall be used, e.g. attributes = c("start_position","end_position","description").

filters

a character vector specifying the filter (query key) for the BioMart query, e.g. filter = "ensembl_gene_id".

mute_citation

logical value indicating whether citation message should be muted.

...

additional parameters for the getBM function.

Details

This function is the main query function of the biomartr package.

It enables to fastly access annotations of a given gene set based on the biomaRt package implemented by Steffen Durinck et al.

Value

A data.table storing the initial query gene vector in the first column, the output gene vector in the second column, and all attributes in the following columns.

Author(s)

Hajk-Georg Drost

See Also

Other biomaRt: getAttributes(), getDatasets(), getMarts(), organismBM(), organismFilters()

Examples

## Not run: 
# 1) select a mart
getMarts()

# we will select mart 'plants_mart' and search for available datasets
getDatasets(mart = "plants_mart")

# we choose dataset 'athaliana_eg_gene' and run biomart()
# using mart: 'plants_mart', dataset: "athaliana_eg_gene"
# attributes: c("start_position","end_position","description")
# for an example gene set of Arabidopsis thaliana:
# c("AT1G06090", "AT1G06100", "AT1G06110", "AT1G06120",
#    "AT1G06130", "AT1G06200")

biomart(genes      = c("AT1G06090", "AT1G06100",
                       "AT1G06110", "AT1G06120",
                       "AT1G06130", "AT1G06200"),
        mart       = "plants_mart",
        dataset    = "athaliana_eg_gene",
        attributes = c("start_position","end_position","description"),
        filters    = "ensembl_gene_id")

## End(Not run)

Get directory to store back end files like kingdom summaries etc

Description

Get directory to store back end files like kingdom summaries etc

Usage

cachedir(non_temp_cache = "~/.biomartr_cache_dir.rds")

Arguments

non_temp_cache

"~/.biomartr_cache_dir.rds",

Value

reads the rds file, and returns the path for local cache, if not existing, use tempdir().

See Also

Other cachedir: cachedir_set()

Examples

cachedir()

Set directory to store back end files like kingdom summaries etc

Description

Set directory to store back end files like kingdom summaries etc

Usage

cachedir_set(path)

Arguments

path

the path to cache dir, example "~/Bio_data/biomartr_cache/"

Value

invisible(NULL), only save the file to path location

See Also

Other cachedir: cachedir()

Examples

# By default it is tempdir()
cachedir()
# cachedir_set("~/Bio_data/biomartr_cache/")
cachedir()

Check whether an annotation file contains outlier lines

Description

Some annotation files include lines with character lengths greater than 65000. This causes problems when trying to import such annotation files into R using import. To overcome this issue, this function screens for such lines in a given annotation file and removes these lines so that import can handle the file.

Usage

check_annotation_biomartr(annotation_file, remove_annotation_outliers = FALSE)

Arguments

annotation_file

a file path to the annotation file.

remove_annotation_outliers

shall outlier lines be removed from the input annotation_file? If yes, then the initial annotation_file will be overwritten and the removed outlier lines will be stored at tempdir for further exploration.

Author(s)

Hajk-Georg Drost

Examples

## Not run: 
# download an example annotation file from NCBI RefSeq
Ath_path <- biomartr::getGFF(organism = "Arabidopsis thaliana")
# run annotation file check on the downloaded file
biomartr::check_annotation_biomartr(Ath_path)
# several outlier lines were detected, thus we re-run the
# function using 'remove_annotation_outliers = TRUE'
# to remove the outliers and overwrite the file
biomartr::check_annotation_biomartr(Ath_path, remove_annotation_outliers = TRUE)

## End(Not run)

Download a NCBI Database to Your Local Hard Drive

Description

This function allows users to download a database selected by listDatabases to their local hard drive.

Usage

download.database(db, path = "database")

Arguments

db

a character string specifying the database that shall be downloaded (selected from listDatabases).

path

a character string specifying the location (a folder) in which the corresponding database shall be stored. Default is path = "database". In case this folder does not exist yet, it will be created.

Details

This function downloads large databases to your hard drive. For this purpose a folder named database (default) is created and the correspondning database then stored in this folder.

Value

File path to the downloaded database file.

Author(s)

Hajk-Georg Drost

See Also

download.database.all, listDatabases

Examples

## Not run: 
  # search for available NCBI nr databases
  listNCBIDatabases(db = "nr")
  # select NCBI nr version 27 =  "nr.27.tar.gz"
  # and download it to your hard drive
  # -> please note that large databases take some time for download!
  download.database(db = "nr.27.tar.gz")

## End(Not run)

Download all elements of an NCBI databse

Description

The download.database functions allows users to retrieve individual packages of a NCBI database. This function is designed to retrieve the entire database selected by the users (hence all packages corresponding to this database).

Usage

download.database.all(db, path = NULL)

Arguments

db

a character string specifying the database that shall be downloaded (selected from listDatabases).

path

a character string specifying the location (a folder) in which the corresponding database shall be stored. In case this folder does not exist yet, it will be created.

Value

A character vector storing the file paths of the downloaded databases.

Author(s)

Hajk-Georg Drost

See Also

download.database, listNCBIDatabases

Examples

## Not run: 
# search for available NCBI databases
  listNCBIDatabases(db = "all")
# choose database NCBI nr and download compelete database
  download.database.all(db = "nr", path = "nr")

## End(Not run)

List all available ENSEMBL divisions

Description

Retrieve a list of available databases on ENSEMBL for which get.ensembl.info can be retrieved.

Usage

ensembl_divisions()

Author(s)

Hajk-Georg Drost

Examples

ensembl_divisions()

Helper function to retrieve species information from the ENSEMBL API

Description

This function interfaces with the ENSEMBL API (https://rest.ensembl.org/info/species?content-type=application/json) and internally stores the output to use this information for subsequent retrieval function calls.

Usage

get.ensembl.info(update = FALSE, division)

Arguments

update

logical, default FALSE. If TRUE, force re-download of info.

division

the ENSEMBL database (division) for which information shall be retrieved (available options can be obtained with ensembl_divisions).

Author(s)

Hajk-Georg Drost

See Also

ensembl_divisions, getKingdomAssemblySummary, getENSEMBLInfo

Examples

## Not run: 
# Look at available ENSEMBL division options
ensembl_divisions()
# Retrieve available information for EnsemblVertebrates
example <- get.ensembl.info(division = "EnsemblVertebrates")
example
# Update information file stored in the tempdir() folder.
example_update <- get.ensembl.info(division = "EnsemblVertebrates", update = TRUE)
example_update

## End(Not run)

Genome Assembly Stats Retrieval

Description

Main genome assembly stats retrieval function for an organism of interest. By specifying the scientific name of an organism of interest the corresponding genome assembly stats file storing the assembly statistics of the organism of interest can be downloaded and stored locally. Genome assembly stats files can be retrieved from several databases.

Usage

getAssemblyStats(
  db = "refseq",
  organism,
  reference = FALSE,
  skip_bacteria = TRUE,
  release = NULL,
  type = "download",
  path = file.path("_ncbi_downloads", "genomeassembly_stats"),
  mute_citation = FALSE
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "ensembl"

organism

a character string specifying the scientific name of the organism of interest, e.g. organism = "Homo sapiens".

reference

a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome.

skip_bacteria

Due to its enormous dataset size (> 700MB as of July 2023), the bacterial summary file will not be loaded by default anymore. If users wish to gain insights for the bacterial kingdom they needs to actively specify skip_bacteria = FALSE. When skip_bacteria = FALSE is set then the bacterial summary file will be downloaded.

release

most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

type

shall only the file be retrieved (default) type = "download" or should the corresponding file be downloaded and subsequently be imported type = "import".

path

a character string specifying the location (a folder) in which the corresponding file shall be stored. Default is path = file.path("_ncbi_downloads","genomeassembly_stats").

mute_citation

logical value indicating whether citation message should be muted.

Details

Internally this function loads the the overview.txt file from NCBI:

refseq: refseq_genbank_ftp_server_url_genome_specific("refseq")

genbank: refseq_genbank_ftp_server_url_genome_specific("genbank")

to retrieve available scientific names of organisms and creates a directory '_ncbi_downloads/genomeassembly_stats' to store the Genome Assembly Stats of interest as text file for future processing. In case the corresponding fasta file already exists within the '_ncbi_downloads/genomeassembly_stats' folder and is accessible within the workspace, no download process will be performed.

An example genome assembly stats file can be found here: ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/ GCF_000001405.36_GRCh38.p10/GCF_000001405.36_GRCh38.p10_assembly_stats.txt.

Value

File path to downloaded genome assembly stats file.

Author(s)

Hajk-Georg Drost

See Also

getGenome, getProteome, getCDS, getGFF, getRNA, getCollection, meta.retrieval, read_assemblystats

Examples

## Not run: 
# download the genome assembly stats file of Saccharomyces cerevisiae
# from NCBI RefSeq
# and store the corresponding genome file in
# '_ncbi_downloads/genomeassembly_stats'
file_path <- getAssemblyStats( db = "refseq",
                 organism = "Saccharomyces cerevisiae",
                 path = file.path("_ncbi_downloads","genomeassembly_stats"))
# import the raw file as it is downloaded
Scerevisiae.stats <- read_assemblystats(file_path, type = "raw")

# download the genome assembly stats file of Saccharomyces cerevisiae
# from NCBI RefSeq
# and import overall statistics of the genome assembly
Scerevisiae.stats.import <- getAssemblyStats( db = "refseq",
                 organism = "Saccharomyces cerevisiae",
                 type = "import",
                 path = file.path("_ncbi_downloads","genomeassembly_stats"))

## End(Not run)

Retrieve All Available Attributes for a Specific Dataset

Description

This function queries the BioMart Interface and returns a table storing all available attributes for a specific dataset.

Usage

getAttributes(mart, dataset, mute_citation = FALSE)

Arguments

mart

a character string specifying the database (mart) for which datasets shall be listed.

dataset

a character string specifying the dataset for which attributes shall be listed.

mute_citation

logical value indicating whether citation message should be muted.

Author(s)

Hajk-Georg Drost

See Also

Other biomaRt: biomart(), getDatasets(), getMarts(), organismBM(), organismFilters()

Examples

## Not run: 
# search for available datasets
getMarts()

# choose database (mart): ENSEMBL_MART_ENSEMBL
# and get a table of all available datasets from this BioMart database
head(getDatasets(mart = "ENSEMBL_MART_ENSEMBL"), 10)

# choose dataset: "hsapiens_gene_ensembl"
head(getAttributes(mart = "ENSEMBL_MART_ENSEMBL",
                   dataset = "hsapiens_gene_ensembl") , 5)

## End(Not run)

A wrapper to all bio getters, selected with 'type' argument

Description

A wrapper to all bio getters, selected with 'type' argument

Usage

getBio(
  db = "refseq",
  organism,
  type,
  reference = FALSE,
  release = NULL,
  gunzip = FALSE,
  update = FALSE,
  skip_bacteria = TRUE,
  path = paste0("set_", toupper(type)),
  remove_annotation_outliers = FALSE,
  analyse_genome = FALSE,
  assembly_type = "toplevel",
  format = "gff3",
  mute_citation = FALSE
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "ensembl"

organism

Organism selector id, there are three options to characterize an organism:

  • by scientific name: e.g. organism = "Homo sapiens"

  • by database specific accession identifier: e.g. organism = "GCF_000001405.37" (= NCBI RefSeq identifier for Homo sapiens)

  • by taxonomic identifier from NCBI Taxonomy: e.g. organism = "9606" (= taxid of Homo sapiens)

type

biological sequence type. (alternatives are: genome, gff, cds, rna, proteome, assembly_stats, repeat_masker, collection (all the others))

reference

a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome.

release

a numeric, the database release version of ENSEMBL (db = "ensembl"). Default is release = NULL meaning that the most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

gunzip

a logical, indicating whether or not files should be unzipped.

update

logical, default FALSE. Updated backend cached files needed. Usually keep this false, to make ut run much faster. Only set to TRUE, if you believe you cache is outdated (Species only exist in newest release etc)

skip_bacteria

Due to its enormous dataset size (> 700MB as of July 2023), the bacterial summary file will not be loaded by default anymore. If users wish to gain insights for the bacterial kingdom they needs to actively specify skip_bacteria = FALSE. When skip_bacteria = FALSE is set then the bacterial summary file will be downloaded.

path

character, default location is paste0("set_", toupper(type))

remove_annotation_outliers

shall outlier lines be removed from the input annotation_file? If yes, then the initial annotation_file will be overwritten and the removed outlier lines will be stored at tempdir for further exploration.

analyse_genome

logical, default FALSE. If TRUE, get general genome statistics like gc content etc. For more details, see ?summary_genome

assembly_type

character, default c("primary_assembly", "toplevel"). Used for ensembl only, specifies the genome assembly type. Searches for both primary and toplevel, and if both are found, uses the first by order (so primary is prioritized by default). The Primary assembly should usually be used if it exists. The "primary assembly" contains all the top-level sequence regions, excluding alternative haplotypes and patches. If the primary assembly file is not present for a species (only defined for standard model organisms), that indicates that there were no haplotype/patch regions, and in such cases, the 'toplevel file is used. For more details see: ensembl tutorial

format

"gff3", alternative "gtf" for ensembl.

mute_citation

logical, default FALSE, indicating whether citation message should be muted.

Details

Fetching of assembly / sequence data is done by fetching an overview file from metadata of given database:
For NCBI (refseq/genbank):
Internally this function loads the the overview.txt file from NCBI:

refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/

genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/

It will then create a directory relative to file type wanted, if you get fasta genomes it will be _ncbi_downloads/genomes' etc. In case the corresponding fasta file already exists within the '_ncbi_downloads/genomes' folder and is accessible within the workspace, no download process will be performed. For other file types the same rule applies.

For ensembl it fetches overview per type from the rest API:

ensembl: https://rest.ensembl.org

Value

File path to downloaded genome.

Author(s)

Hajk-Georg Drost

See Also

Other getBio: getCDS(), getCollection(), getGFF(), getGenome(), getProteome(), getRNA()


Generic Bio data set extractor

Description

Usually you want to use one of the specific set extractors

Usage

getBioSet(
  db = "refseq",
  organisms,
  set_type,
  reference = FALSE,
  release = NULL,
  gunzip = TRUE,
  update = FALSE,
  skip_bacteria = TRUE,
  path = paste0("set_", toupper(set_type)),
  remove_annotation_outliers = FALSE,
  assembly_type = "toplevel",
  format = "gff3",
  mute_citation = FALSE
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "ensembl"

organisms

a character vector storing the names of the organisms than shall be retrieved. There are three available options to characterize an organism:

set_type

the biological sequence type that shall be retrieved. Available options are

  • set_type = "genome"

  • set_type = "proteome"

  • set_type = "cds"

  • set_type = "gff"

  • set_type = "rna"

  • set_type = "assembly_stats"

  • set_type = "repeat_masker"

  • set_type = "collection" (all the others)

reference

a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome.

release

a numeric, the database release version of ENSEMBL (db = "ensembl"). Default is release = NULL meaning that the most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

gunzip

a logical, indicating whether or not files should be unzipped.

update

logical, default FALSE. Updated backend cached files needed. Usually keep this false, to make ut run much faster. Only set to TRUE, if you believe you cache is outdated (Species only exist in newest release etc)

skip_bacteria

Due to its enormous dataset size (> 700MB as of July 2023), the bacterial summary file will not be loaded by default anymore. If users wish to gain insights for the bacterial kingdom they needs to actively specify skip_bacteria = FALSE. When skip_bacteria = FALSE is set then the bacterial summary file will be downloaded.

path

character, default location is paste0("set_", toupper(set_type))

remove_annotation_outliers

shall outlier lines be removed from the input annotation_file? If yes, then the initial annotation_file will be overwritten and the removed outlier lines will be stored at tempdir for further exploration.

assembly_type

character, default c("primary_assembly", "toplevel"). Used for ensembl only, specifies the genome assembly type. Searches for both primary and toplevel, and if both are found, uses the first by order (so primary is prioritized by default). The Primary assembly should usually be used if it exists. The "primary assembly" contains all the top-level sequence regions, excluding alternative haplotypes and patches. If the primary assembly file is not present for a species (only defined for standard model organisms), that indicates that there were no haplotype/patch regions, and in such cases, the 'toplevel file is used. For more details see: ensembl tutorial

format

"gff3", alternative "gtf" for ensembl.

mute_citation

logical, default FALSE, indicating whether citation message should be muted.

Details

Fetching of assembly / sequence data is done by fetching an overview file from metadata of given database:
For NCBI (refseq/genbank):
Internally this function loads the the overview.txt file from NCBI:

refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/

genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/

It will then create a directory relative to file type wanted, if you get fasta genomes it will be _ncbi_downloads/genomes' etc. In case the corresponding fasta file already exists within the '_ncbi_downloads/genomes' folder and is accessible within the workspace, no download process will be performed. For other file types the same rule applies.

For ensembl it fetches overview per type from the rest API:

ensembl: https://rest.ensembl.org

Value

character vector, the file path to the downloaded genomes,
The returned character vector has names as either:
- 'new' (file was downloaded now)
- 'old' files did already exist)

Author(s)

Hajk-Georg Drost

See Also

getBio

Other getBioSet: getCDSSet(), getCollectionSet(), getGFFSet(), getGenomeSet(), getProteomeSet(), getRNASet()

Examples

## Not run: 
getBioSet("refseq", organisms = c("Arabidopsis thaliana",
                                  "Arabidopsis lyrata",
                                  "Capsella rubella"),
                                  set_type = "cds")

## End(Not run)

Coding Sequence Retrieval

Description

Main retrieval function for coding sequences (CDS) of an organism of interest. By specifying the scientific name of an organism of interest the corresponding fasta-file storing the CDS information for the organism of interest can be downloaded and stored locally. CDS files can be retrieved from several databases.

Usage

getCDS(
  db = "refseq",
  organism,
  reference = FALSE,
  skip_bacteria = TRUE,
  release = NULL,
  gunzip = FALSE,
  path = file.path("_ncbi_downloads", "CDS"),
  mute_citation = FALSE
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "ensembl"

organism

Organism selector id, there are three options to characterize an organism:

  • by scientific name: e.g. organism = "Homo sapiens"

  • by database specific accession identifier: e.g. organism = "GCF_000001405.37" (= NCBI RefSeq identifier for Homo sapiens)

  • by taxonomic identifier from NCBI Taxonomy: e.g. organism = "9606" (= taxid of Homo sapiens)

reference

a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome.

skip_bacteria

Due to its enormous dataset size (> 700MB as of July 2023), the bacterial summary file will not be loaded by default anymore. If users wish to gain insights for the bacterial kingdom they needs to actively specify skip_bacteria = FALSE. When skip_bacteria = FALSE is set then the bacterial summary file will be downloaded.

release

a numeric, the database release version of ENSEMBL (db = "ensembl"). Default is release = NULL meaning that the most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

gunzip

a logical, indicating whether or not files should be unzipped.

path

a character string specifying the location (a folder) in which the corresponding CDS file shall be stored. Default is path = file.path("_ncbi_downloads","CDS").

mute_citation

logical, default FALSE, indicating whether citation message should be muted.

Details

Fetching of assembly / sequence data is done by fetching an overview file from metadata of given database:
For NCBI (refseq/genbank):
Internally this function loads the the overview.txt file from NCBI:

refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/

genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/

It will then create a directory relative to file type wanted, if you get fasta genomes it will be _ncbi_downloads/genomes' etc. In case the corresponding fasta file already exists within the '_ncbi_downloads/genomes' folder and is accessible within the workspace, no download process will be performed. For other file types the same rule applies.

For ensembl it fetches overview per type from the rest API:

ensembl: https://rest.ensembl.org

Value

File path to downloaded genome.

Author(s)

Hajk-Georg Drost

See Also

Other getBio: getBio(), getCollection(), getGFF(), getGenome(), getProteome(), getRNA()

Other cds: getCDSSet(), read_cds()

Examples

## Not run: 
# download the genome of Arabidopsis thaliana from refseq
# and store the corresponding genome CDS file in '_ncbi_downloads/CDS'
file_path <- getCDS( db       = "refseq",
             organism = "Arabidopsis thaliana",
             path     = file.path("_ncbi_downloads","CDS"))

Ath_CDS <- read_cds(file_path, format = "fasta")


## End(Not run)

CDS retrieval of multiple species

Description

Main CDS retrieval function for a set of organism of interest. By specifying the scientific names of the organisms of interest the corresponding fasta-files storing the CDS of the organisms of interest will be downloaded and stored locally. CDS files can be retrieved from several databases.

Usage

getCDSSet(
  db = "refseq",
  organisms,
  reference = FALSE,
  release = NULL,
  gunzip = TRUE,
  update = FALSE,
  path = "set_CDS"
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "ensembl"

organisms

a character vector storing the names of the organisms than shall be retrieved. There are three available options to characterize an organism:

reference

a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome.

release

a numeric, the database release version of ENSEMBL (db = "ensembl"). Default is release = NULL meaning that the most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

gunzip

a logical, indicating whether or not files should be unzipped.

update

logical, default FALSE. Updated backend cached files needed. Usually keep this false, to make ut run much faster. Only set to TRUE, if you believe you cache is outdated (Species only exist in newest release etc)

path

character, default location is paste0("set_", toupper(set_type))

Details

Fetching of assembly / sequence data is done by fetching an overview file from metadata of given database:
For NCBI (refseq/genbank):
Internally this function loads the the overview.txt file from NCBI:

refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/

genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/

It will then create a directory relative to file type wanted, if you get fasta genomes it will be _ncbi_downloads/genomes' etc. In case the corresponding fasta file already exists within the '_ncbi_downloads/genomes' folder and is accessible within the workspace, no download process will be performed. For other file types the same rule applies.

For ensembl it fetches overview per type from the rest API:

ensembl: https://rest.ensembl.org

Value

character vector, the file path to the downloaded genomes,
The returned character vector has names as either:
- 'new' (file was downloaded now)
- 'old' files did already exist)

Author(s)

Hajk-Georg Drost

See Also

Other getBioSet: getBioSet(), getCollectionSet(), getGFFSet(), getGenomeSet(), getProteomeSet(), getRNASet()

Other cds: getCDS(), read_cds()

Examples

## Not run: 
getBioSet("refseq", organisms = c("Arabidopsis thaliana",
                                  "Arabidopsis lyrata",
                                  "Capsella rubella"),
                                  set_type = "cds")

## End(Not run)

Retrieve a Collection: Genome, Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStats

Description

Main collection retrieval function for an organism of interest. By specifying the scientific name of an organism of interest a collection consisting of the genome file, proteome file, CDS file, RNA file, GFF file, Repeat Masker file, AssemblyStats file of the organism of interest can be downloaded and stored locally. Collections can be retrieved from several databases. For full set of collection elements, see: biomartr:::supported_biotypes(db)

Usage

getCollection(
  db = "refseq",
  organism,
  reference = TRUE,
  skip_bacteria = TRUE,
  release = NULL,
  assembly_type = "toplevel",
  analyse_genome = FALSE,
  remove_annotation_outliers = FALSE,
  gunzip = FALSE,
  path = file.path("_db_downloads", "collections"),
  mute_citation = FALSE
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "ensembl"

organism

Organism selector id, there are three options to characterize an organism:

  • by scientific name: e.g. organism = "Homo sapiens"

  • by database specific accession identifier: e.g. organism = "GCF_000001405.37" (= NCBI RefSeq identifier for Homo sapiens)

  • by taxonomic identifier from NCBI Taxonomy: e.g. organism = "9606" (= taxid of Homo sapiens)

reference

a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome.

skip_bacteria

Due to its enormous dataset size (> 700MB as of July 2023), the bacterial summary file will not be loaded by default anymore. If users wish to gain insights for the bacterial kingdom they needs to actively specify skip_bacteria = FALSE. When skip_bacteria = FALSE is set then the bacterial summary file will be downloaded.

release

a numeric, the database release version of ENSEMBL (db = "ensembl"). Default is release = NULL meaning that the most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

assembly_type

character, default c("primary_assembly", "toplevel"). Used for ensembl only, specifies the genome assembly type. Searches for both primary and toplevel, and if both are found, uses the first by order (so primary is prioritized by default). The Primary assembly should usually be used if it exists. The "primary assembly" contains all the top-level sequence regions, excluding alternative haplotypes and patches. If the primary assembly file is not present for a species (only defined for standard model organisms), that indicates that there were no haplotype/patch regions, and in such cases, the 'toplevel file is used. For more details see: ensembl tutorial

analyse_genome

logical, default FALSE. If TRUE, get general genome statistics like gc content etc. For more details, see ?summary_genome

remove_annotation_outliers

shall outlier lines be removed from the input annotation_file? If yes, then the initial annotation_file will be overwritten and the removed outlier lines will be stored at tempdir for further exploration.

gunzip

a logical, indicating whether or not files should be unzipped.

path

a character string specifying the location (a folder) in which the corresponding collection shall be stored. Default is path = file.path("_db_downloads","collections").

mute_citation

logical, default FALSE, indicating whether citation message should be muted.

Details

Fetching of assembly / sequence data is done by fetching an overview file from metadata of given database:
For NCBI (refseq/genbank):
Internally this function loads the the overview.txt file from NCBI:

refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/

genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/

It will then create a directory relative to file type wanted, if you get fasta genomes it will be _ncbi_downloads/genomes' etc. In case the corresponding fasta file already exists within the '_ncbi_downloads/genomes' folder and is accessible within the workspace, no download process will be performed. For other file types the same rule applies.

For ensembl it fetches overview per type from the rest API:

ensembl: https://rest.ensembl.org

Value

File path to downloaded genome.

Author(s)

Hajk-Georg Drost

See Also

Other getBio: getBio(), getCDS(), getGFF(), getGenome(), getProteome(), getRNA()

Other collection: getCollectionSet()

Examples

## Not run: 
# download the collection of Homo sapiens from refseq
# and store the corresponding genome file in '_ncbi_downloads/collection'
 Hsap_collection <- getCollection( db       = "refseq",
             organism = "Homo sapiens",
             path = file.path("_db_downloads","collections"))
# download the collection of Homo sapiens from genbank
# and store the corresponding genome file in '_ncbi_downloads/collection'
 Hsap_collection <- getCollection( db       = "genbank",
             organism = "Homo sapiens",
             path = file.path("_db_downloads","collections"))
# download the collection of Homo sapiens from ensembl
# and store the corresponding genome file in '_ncbi_downloads/collection'
 Hsap_collection <- getCollection( db       = "ensembl",
             organism = "Homo sapiens",
             path = file.path("_db_downloads","collections"))

## End(Not run)

Retrieve a Collection: Genome, Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStats of multiple species

Description

Main collection retrieval function for an organism of interest. By specifying the scientific name of an organism of interest a collection consisting of the genome file, proteome file, CDS file, RNA file, GFF file, Repeat Masker file, AssemblyStats file of the organism of interest can be downloaded and stored locally. Collections can be retrieved from several databases.

Usage

getCollectionSet(
  db = "refseq",
  organisms,
  reference = FALSE,
  release = NULL,
  skip_bacteria = TRUE,
  gunzip = TRUE,
  update = FALSE,
  remove_annotation_outliers = TRUE,
  path = "set_collections",
  mute_citation = FALSE
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "ensembl"

organisms

a character vector storing the names of the organisms than shall be retrieved. There are three available options to characterize an organism:

reference

a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome.

release

a numeric, the database release version of ENSEMBL (db = "ensembl"). Default is release = NULL meaning that the most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

skip_bacteria

Due to its enormous dataset size (> 700MB as of July 2023), the bacterial summary file will not be loaded by default anymore. If users wish to gain insights for the bacterial kingdom they needs to actively specify skip_bacteria = FALSE. When skip_bacteria = FALSE is set then the bacterial summary file will be downloaded.

gunzip

a logical, indicating whether or not files should be unzipped.

update

logical, default FALSE. Updated backend cached files needed. Usually keep this false, to make ut run much faster. Only set to TRUE, if you believe you cache is outdated (Species only exist in newest release etc)

remove_annotation_outliers

shall outlier lines be removed from the input annotation_file? If yes, then the initial annotation_file will be overwritten and the removed outlier lines will be stored at tempdir for further exploration.

path

a character string specifying the location (a folder) in which the corresponding collection shall be stored. Default is path = file.path("_db_downloads","collections").

mute_citation

logical, default FALSE, indicating whether citation message should be muted.

Details

Fetching of assembly / sequence data is done by fetching an overview file from metadata of given database:
For NCBI (refseq/genbank):
Internally this function loads the the overview.txt file from NCBI:

refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/

genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/

It will then create a directory relative to file type wanted, if you get fasta genomes it will be _ncbi_downloads/genomes' etc. In case the corresponding fasta file already exists within the '_ncbi_downloads/genomes' folder and is accessible within the workspace, no download process will be performed. For other file types the same rule applies.

For ensembl it fetches overview per type from the rest API:

ensembl: https://rest.ensembl.org

Value

character vector, the file path to the downloaded genomes,
The returned character vector has names as either:
- 'new' (file was downloaded now)
- 'old' files did already exist)

Author(s)

Hajk-Georg Drost

See Also

Other getBioSet: getBioSet(), getCDSSet(), getGFFSet(), getGenomeSet(), getProteomeSet(), getRNASet()

Other collection: getCollection()

Examples

## Not run: 
getBioSet("refseq", organisms = c("Arabidopsis thaliana",
                                  "Arabidopsis lyrata",
                                  "Capsella rubella"),
                                  set_type = "cds")

## End(Not run)

Retrieve All Available Datasets for a BioMart Database

Description

This funcion queries the BioMart API and returns a table storing all available datasets for a selected BioMart databases.

Usage

getDatasets(mart, mute_citation = FALSE)

Arguments

mart

a character string specifying the database (mart) for which datasets shall be listed.

mute_citation

logical value indicating whether citation message should be muted.

Author(s)

Hajk-Georg Drost

See Also

Other biomaRt: biomart(), getAttributes(), getMarts(), organismBM(), organismFilters()

Examples

## Not run: 
# search for available datasets
# getMarts()
# choose database: "ENSEMBL_MART_ENSEMBL"
head(getDatasets("ENSEMBL_MART_ENSEMBL"), 10)

## End(Not run)

Download sequence or annotation from ENSEMBL

Description

Backend function for retrieving files sequence and annotation files from the ENSEMBL ftp server

Usage

getENSEMBL(
  organism,
  type = "dna",
  id.type = "toplevel",
  release = NULL,
  path,
  format
)

Arguments

organism

Organism selector id, there are three options to characterize an organism:

  • by scientific name: e.g. organism = "Homo sapiens"

  • by database specific accession identifier: e.g. organism = "GCF_000001405.37" (= NCBI RefSeq identifier for Homo sapiens)

  • by taxonomic identifier from NCBI Taxonomy: e.g. organism = "9606" (= taxid of Homo sapiens)

type

character, biological sequence type (e.g. "dna", "cds")

id.type

a character, default "toplevel". id type of assembly, either "toplevel" or "primary_assembly" for genomes. Can be other strings, for non genome objects.

release

a numeric, the database release version of ENSEMBL (db = "ensembl"). Default is release = NULL meaning that the most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

path

location where file shall be stored.

format

"gff3", alternative "gtf" for ensembl.

Value

either a character path to downloaded file, or a logical FALSE, specifying failure.

Author(s)

Hajk-Georg Drost


Helper function for retrieving gtf files from ENSEMBL

Description

This function downloads gff files of query organisms from ENSEMBL.

Usage

getENSEMBL.gtf(organism, type = "dna", path, release = NULL)

Arguments

organism

Organism selector id, there are three options to characterize an organism:

  • by scientific name: e.g. organism = "Homo sapiens"

  • by database specific accession identifier: e.g. organism = "GCF_000001405.37" (= NCBI RefSeq identifier for Homo sapiens)

  • by taxonomic identifier from NCBI Taxonomy: e.g. organism = "9606" (= taxid of Homo sapiens)

type

character, biological sequence type (e.g. "dna", "cds")

path

location where file shall be stored.

release

a numeric, the database release version of ENSEMBL (db = "ensembl"). Default is release = NULL meaning that the most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

Value

character filepath to download file, returns FALSE if failed.

Author(s)

Hajk-Georg Drost


Helper function for retrieving biological sequence files from ENSEMBL

Description

This function downloads gff files of query organisms from ENSEMBL.

Usage

getENSEMBL.Seq(
  organism,
  type = "dna",
  id.type = "toplevel",
  release = NULL,
  path
)

Arguments

organism

Organism selector id, there are three options to characterize an organism:

  • by scientific name: e.g. organism = "Homo sapiens"

  • by database specific accession identifier: e.g. organism = "GCF_000001405.37" (= NCBI RefSeq identifier for Homo sapiens)

  • by taxonomic identifier from NCBI Taxonomy: e.g. organism = "9606" (= taxid of Homo sapiens)

type

character, biological sequence type (e.g. "dna", "cds")

id.type

a character, default "toplevel". id type of assembly, either "toplevel" or "primary_assembly" for genomes. Can be other strings, for non genome objects.

release

a numeric, the database release version of ENSEMBL (db = "ensembl"). Default is release = NULL meaning that the most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

path

location where file shall be stored.

Value

either a character path to downloaded file, or a logical FALSE, specifying failure.

Author(s)

Hajk-Georg Drost


Retrieve ENSEMBLGENOMES info file

Description

Retrieve species and genome information from http://rest.ensemblgenomes.org/info/species?content-type=application/json/.

Usage

getENSEMBLGENOMESInfo()

Author(s)

Hajk-Georg Drost

Examples

## Not run: 
info.file <- getENSEMBLGENOMESInfo()
info.file

## End(Not run)

Retrieve ENSEMBL info file

Description

Retrieve species and genome information from http://rest.ensembl.org/info/species?content-type=application/json/.

Usage

getENSEMBLInfo(update = FALSE, divisions = ensembl_divisions())

Arguments

update

logical, default FALSE. If TRUE, update cached list, if FALSE use existing cache (if it exists). For cache location see cachedir()

divisions

character, name of divisions to check, default is all from ensembl_divisions(). If NULL, also all is used.

Value

a tibble table storing info for all available ENSEMBL divisions.

Author(s)

Hajk-Georg Drost

See Also

ensembl_divisions, get.ensembl.info, getKingdomAssemblySummary

Examples

## Not run: 
# look at available divisions
ensembl_divisions()
# retrieve information for all ENSEMBL divisions at once
test <- getENSEMBLInfo()
test
# retrieve information for a particular ENSEMBL division (e.g. EnsemblVertebrates)
test_vertebrates <- get.ensembl.info(update = TRUE, division = "EnsemblVertebrates")
test_vertebrates

## End(Not run)

Retrieve All Available Filters for a Specific Dataset

Description

This funcion queries the BioMart API and returns a table storing all available filters for a specific dataset.

Usage

getFilters(mart, dataset, mute_citation = FALSE)

Arguments

mart

a character string specifying the database (mart) for which datasets shall be listed.

dataset

a character string specifying the dataset for which filters shall be listed.

mute_citation

logical value indicating whether citation message should be muted.

Author(s)

Hajk-Georg Drost

See Also

getMarts, getDatasets, getAttributes, organismBM, organismFilters, organismAttributes

Examples

## Not run: 
# search for available datasets
# getMarts()
# choose database (mart): "ENSEMBL_MART_ENSEMBL"
# head(getDatasets(mart = "ENSEMBL_MART_ENSEMBL"), 10)
# choose dataset: "hsapiens_gene_ensembl"
head(getFilters(mart = "ENSEMBL_MART_ENSEMBL",
                dataset = "hsapiens_gene_ensembl") , 5)

## End(Not run)

Genome Retrieval

Description

Main genome retrieval function for an organism of interest. By specifying the scientific name of an organism of interest the corresponding fasta-file storing the genome of the organism of interest can be downloaded and stored locally. Genome files can be retrieved from several databases. In addition, the genome summary statistics for the retrieved species is stored locally to provide users with insights regarding the genome assembly quality (see summary_genome for details). This is useful when comparing genomes with large difference in genome assembly qualities.

Usage

getGenome(
  db = "refseq",
  organism,
  reference = FALSE,
  skip_bacteria = TRUE,
  release = NULL,
  gunzip = FALSE,
  path = file.path("_ncbi_downloads", "genomes"),
  assembly_type = "toplevel",
  mute_citation = FALSE,
  analyse_genome = FALSE
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "ensembl"

organism

Organism selector id, there are three options to characterize an organism:

  • by scientific name: e.g. organism = "Homo sapiens"

  • by database specific accession identifier: e.g. organism = "GCF_000001405.37" (= NCBI RefSeq identifier for Homo sapiens)

  • by taxonomic identifier from NCBI Taxonomy: e.g. organism = "9606" (= taxid of Homo sapiens)

reference

a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome.

skip_bacteria

Due to its enormous dataset size (> 700MB as of July 2023), the bacterial summary file will not be loaded by default anymore. If users wish to gain insights for the bacterial kingdom they needs to actively specify skip_bacteria = FALSE. When skip_bacteria = FALSE is set then the bacterial summary file will be downloaded.

release

a numeric, the database release version of ENSEMBL (db = "ensembl"). Default is release = NULL meaning that the most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

gunzip

a logical, indicating whether or not files should be unzipped.

path

character, default location is paste0("set_", toupper(type))

assembly_type

character, default c("primary_assembly", "toplevel"). Used for ensembl only, specifies the genome assembly type. Searches for both primary and toplevel, and if both are found, uses the first by order (so primary is prioritized by default). The Primary assembly should usually be used if it exists. The "primary assembly" contains all the top-level sequence regions, excluding alternative haplotypes and patches. If the primary assembly file is not present for a species (only defined for standard model organisms), that indicates that there were no haplotype/patch regions, and in such cases, the 'toplevel file is used. For more details see: ensembl tutorial

mute_citation

logical, default FALSE, indicating whether citation message should be muted.

analyse_genome

logical, default FALSE. If TRUE, get general genome statistics like gc content etc. For more details, see ?summary_genome

Details

Fetching of assembly / sequence data is done by fetching an overview file from metadata of given database:
For NCBI (refseq/genbank):
Internally this function loads the the overview.txt file from NCBI:

refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/

genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/

It will then create a directory relative to file type wanted, if you get fasta genomes it will be _ncbi_downloads/genomes' etc. In case the corresponding fasta file already exists within the '_ncbi_downloads/genomes' folder and is accessible within the workspace, no download process will be performed. For other file types the same rule applies.

For ensembl it fetches overview per type from the rest API:

ensembl: https://rest.ensembl.org

Value

File path to downloaded genome.

Author(s)

Hajk-Georg Drost

See Also

Other getBio: getBio(), getCDS(), getCollection(), getGFF(), getProteome(), getRNA()

Other genome: getGenomeSet(), read_genome()


Retrieve NCBI GENOME_REPORTS file

Description

Retrieves NCBI GENOME_REPORTS file from ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/overview.txt.

Usage

getGENOMEREPORT(
  local_file = file.path(cachedir(), "_ncbi_downloads", "overview.txt")
)

Arguments

local_file

character, file path, default: file.path(cachedir(), "_ncbi_downloads", "overview.txt")

Value

a tibble object with the report

Author(s)

Hajk-Georg Drost

Examples

## Not run: 
report <- getGENOMEREPORT()
report

## End(Not run)

Genome Retrieval of multiple species

Description

Main genome retrieval function for a set of organism of interest. By specifying the scientific names of the organisms of interest the corresponding fasta-files storing the genome of the organisms of interest will be downloaded and stored locally. Genome files can be retrieved from several databases.

Usage

getGenomeSet(
  db = "refseq",
  organisms,
  reference = FALSE,
  release = NULL,
  skip_bacteria = TRUE,
  gunzip = TRUE,
  update = FALSE,
  path = "set_genomes",
  assembly_type = "toplevel",
  mute_citation = FALSE
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "ensembl"

organisms

a character vector storing the names of the organisms than shall be retrieved. There are three available options to characterize an organism:

reference

a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome.

release

a numeric, the database release version of ENSEMBL (db = "ensembl"). Default is release = NULL meaning that the most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

skip_bacteria

Due to its enormous dataset size (> 700MB as of July 2023), the bacterial summary file will not be loaded by default anymore. If users wish to gain insights for the bacterial kingdom they needs to actively specify skip_bacteria = FALSE. When skip_bacteria = FALSE is set then the bacterial summary file will be downloaded.

gunzip

a logical, indicating whether or not files should be unzipped.

update

logical, default FALSE. Updated backend cached files needed. Usually keep this false, to make ut run much faster. Only set to TRUE, if you believe you cache is outdated (Species only exist in newest release etc)

path

a character string specifying the location (a folder) in which the corresponding genomes shall be stored. Default is path = "set_genomes".

assembly_type

character, default c("primary_assembly", "toplevel"). Used for ensembl only, specifies the genome assembly type. Searches for both primary and toplevel, and if both are found, uses the first by order (so primary is prioritized by default). The Primary assembly should usually be used if it exists. The "primary assembly" contains all the top-level sequence regions, excluding alternative haplotypes and patches. If the primary assembly file is not present for a species (only defined for standard model organisms), that indicates that there were no haplotype/patch regions, and in such cases, the 'toplevel file is used. For more details see: ensembl tutorial

mute_citation

logical, default FALSE, indicating whether citation message should be muted.

Details

Fetching of assembly / sequence data is done by fetching an overview file from metadata of given database:
For NCBI (refseq/genbank):
Internally this function loads the the overview.txt file from NCBI:

refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/

genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/

It will then create a directory relative to file type wanted, if you get fasta genomes it will be _ncbi_downloads/genomes' etc. In case the corresponding fasta file already exists within the '_ncbi_downloads/genomes' folder and is accessible within the workspace, no download process will be performed. For other file types the same rule applies.

For ensembl it fetches overview per type from the rest API:

ensembl: https://rest.ensembl.org

Value

character vector, the file path to the downloaded genomes,
The returned character vector has names as either:
- 'new' (file was downloaded now)
- 'old' files did already exist)

Author(s)

Hajk-Georg Drost

See Also

Other getBioSet: getBioSet(), getCDSSet(), getCollectionSet(), getGFFSet(), getProteomeSet(), getRNASet()

Other genome: getGenome(), read_genome()

Examples

## Not run: 
getBioSet("refseq", organisms = c("Arabidopsis thaliana",
                                  "Arabidopsis lyrata",
                                  "Capsella rubella"),
                                  set_type = "cds")

## End(Not run)

Genome Annotation Retrieval (GFF3)

Description

Main retrieval function for GFF files of an organism of interest. By specifying the scientific name of an organism of interest the corresponding gff file storing the annotation for the organism of interest can be downloaded and stored locally. GFF files can be retrieved from several databases.

Usage

getGFF(
  db = "refseq",
  organism,
  reference = FALSE,
  skip_bacteria = TRUE,
  release = NULL,
  gunzip = FALSE,
  remove_annotation_outliers = FALSE,
  path = file.path("_ncbi_downloads", "annotation"),
  mute_citation = FALSE,
  format = "gff3"
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "ensembl"

organism

Organism selector id, there are three options to characterize an organism:

  • by scientific name: e.g. organism = "Homo sapiens"

  • by database specific accession identifier: e.g. organism = "GCF_000001405.37" (= NCBI RefSeq identifier for Homo sapiens)

  • by taxonomic identifier from NCBI Taxonomy: e.g. organism = "9606" (= taxid of Homo sapiens)

reference

a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome.

skip_bacteria

Due to its enormous dataset size (> 700MB as of July 2023), the bacterial summary file will not be loaded by default anymore. If users wish to gain insights for the bacterial kingdom they needs to actively specify skip_bacteria = FALSE. When skip_bacteria = FALSE is set then the bacterial summary file will be downloaded.

release

a numeric, the database release version of ENSEMBL (db = "ensembl"). Default is release = NULL meaning that the most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

gunzip

a logical, indicating whether or not files should be unzipped.

remove_annotation_outliers

shall outlier lines be removed from the input annotation_file? If yes, then the initial annotation_file will be overwritten and the removed outlier lines will be stored at tempdir for further exploration.

path

a character string specifying the location (a folder) in which the corresponding annotation file shall be stored. Default is path = file.path("_ncbi_downloads","annotation").

mute_citation

logical, default FALSE, indicating whether citation message should be muted.

format

"gff3", alternative "gtf" for ensembl.

Details

Fetching of assembly / sequence data is done by fetching an overview file from metadata of given database:
For NCBI (refseq/genbank):
Internally this function loads the the overview.txt file from NCBI:

refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/

genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/

It will then create a directory relative to file type wanted, if you get fasta genomes it will be _ncbi_downloads/genomes' etc. In case the corresponding fasta file already exists within the '_ncbi_downloads/genomes' folder and is accessible within the workspace, no download process will be performed. For other file types the same rule applies.

For ensembl it fetches overview per type from the rest API:

ensembl: https://rest.ensembl.org

Value

File path to downloaded genome.

Author(s)

Hajk-Georg Drost

See Also

Other getBio: getBio(), getCDS(), getCollection(), getGenome(), getProteome(), getRNA()

Other gff: getGFFSet(), read_gff()

Examples

## Not run: 
# download the annotation of Arabidopsis thaliana from refseq
# and store the corresponding genome file in '_ncbi_downloads/annotation'
Athal_gff <- getGFF( db       = "refseq",
               organism = "Arabidopsis thaliana",
               path = file.path("_ncbi_downloads","annotation"),
               remove_annotation_outliers = TRUE)
Athal_gff_import <- read_gff(Athal_gff)


# download the genome of Arabidopsis thaliana from genbank
# and store the corresponding genome file in '_ncbi_downloads/annotation'
Athal_gff <- getGFF( db       = "genbank",
               organism = "Arabidopsis thaliana",
               path = file.path("_ncbi_downloads","annotation"),
               remove_annotation_outliers = TRUE)
Athal_gff_import <- read_gff(Athal_gff)

# download the genome of Homo sapiens from ensembl
# and store the corresponding genome file in '_ncbi_downloads/annotation'
Hsap_gff <- getGFF( db       = "ensembl",
               organism = "Homo sapiens",
               path = file.path("_ncbi_downloads","annotation"),
               remove_annotation_outliers = TRUE)
Hsap_gff_import <- read_gff(Hsap_gff)


## End(Not run)

GFF retrieval of multiple species

Description

Main GFF retrieval function for a set of organism of interest. By specifying the scientific names of the organisms of interest the corresponding fasta-files storing the GFF of the organisms of interest will be downloaded and stored locally. GFF files can be retrieved from several databases.

Usage

getGFFSet(
  db = "refseq",
  organisms,
  reference = FALSE,
  release = NULL,
  skip_bacteria = TRUE,
  gunzip = TRUE,
  remove_annotation_outliers = FALSE,
  update = FALSE,
  format = "gff3",
  path = "set_GFF",
  mute_citation = FALSE
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "ensembl"

organisms

a character vector storing the names of the organisms than shall be retrieved. There are three available options to characterize an organism:

reference

a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome.

release

a numeric, the database release version of ENSEMBL (db = "ensembl"). Default is release = NULL meaning that the most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

skip_bacteria

Due to its enormous dataset size (> 700MB as of July 2023), the bacterial summary file will not be loaded by default anymore. If users wish to gain insights for the bacterial kingdom they needs to actively specify skip_bacteria = FALSE. When skip_bacteria = FALSE is set then the bacterial summary file will be downloaded.

gunzip

a logical, indicating whether or not files should be unzipped.

remove_annotation_outliers

shall outlier lines be removed from the input annotation_file? If yes, then the initial annotation_file will be overwritten and the removed outlier lines will be stored at tempdir for further exploration.

update

logical, default FALSE. Updated backend cached files needed. Usually keep this false, to make ut run much faster. Only set to TRUE, if you believe you cache is outdated (Species only exist in newest release etc)

format

"gff3", alternative "gtf" for ensembl.

path

character, default location is paste0("set_", toupper(set_type))

mute_citation

logical, default FALSE, indicating whether citation message should be muted.

Details

Fetching of assembly / sequence data is done by fetching an overview file from metadata of given database:
For NCBI (refseq/genbank):
Internally this function loads the the overview.txt file from NCBI:

refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/

genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/

It will then create a directory relative to file type wanted, if you get fasta genomes it will be _ncbi_downloads/genomes' etc. In case the corresponding fasta file already exists within the '_ncbi_downloads/genomes' folder and is accessible within the workspace, no download process will be performed. For other file types the same rule applies.

For ensembl it fetches overview per type from the rest API:

ensembl: https://rest.ensembl.org

Value

character vector, the file path to the downloaded genomes,
The returned character vector has names as either:
- 'new' (file was downloaded now)
- 'old' files did already exist)

Author(s)

Hajk-Georg Drost

See Also

Other getBioSet: getBioSet(), getCDSSet(), getCollectionSet(), getGenomeSet(), getProteomeSet(), getRNASet()

Other gff: getGFF(), read_gff()

Examples

## Not run: 
getBioSet("refseq", organisms = c("Arabidopsis thaliana",
                                  "Arabidopsis lyrata",
                                  "Capsella rubella"),
                                  set_type = "cds")

## End(Not run)

Gene Ontology Query

Description

This function takes a gene id as character vector from a given query organism and returns the corresponding GO terms and additional GO information.

Usage

getGO(organism, genes, filters, ...)

Arguments

organism

a character string specifying the scientific name of a query organism.

genes

a character vector storing the gene ids of a organisms of interest to be queried against Ensembl Biomart.

filters

a character vector specifying the filter (query key) for the Ensembl Biomart query, e.g. filter = "ensembl_gene_id".

...

additional parameters that can be passed to the biomart function.

Details

This function takes the scientific name of a query organism, a set of genes for which GO terms and additional information shall be retrieved, and a filter argument that specifies the attribute for the query genes.

Author(s)

Hajk-Georg Drost

See Also

biomart, organismFilters, organismBM, getBM, getMarts, getDatasets, getFilters

Examples

## Not run:  
GO_tbl <- getGO(organism = "Arabidopsis thaliana", 
                genes    = c("AT1G06090", "AT1G06100"),
                filters  = "ensembl_gene_id")

# look at the result
head(GO_tbl)

## End(Not run)

Retrieve available groups for a kingdom of life (only available for NCBI RefSeq and NCBI Genbank)

Description

A short list of available groups for a kingdom of life.

Usage

getGroups(db = "refseq", kingdom)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

Default is db = "refseq".

kingdom

a character string specifying for which kingdom of life groups shall be retrieved. See getKingdoms for details.

Author(s)

Hajk-Georg Drost

See Also

meta.retrieval, getGenome, getProteome, getCDS, getKingdoms

Examples

# get possible kigdom names
getKingdoms(db = "refseq")
## Not run: 
# retrieve subgroups for vertebrate_mammalian available from refseq
getGroups(db = "refseq", kingdom = "vertebrate_mammalian")

# get possible kigdom names
getKingdoms(db = "genbank")
# retrieve subgroups for vertebrate_mammalian available from genbank
getGroups(db = "genbank", kingdom = "vertebrate_mammalian")

## End(Not run)

Genome Annotation Retrieval (GTF)

Description

Main retrieval function for GTF files of an organism of interest. By specifying the scientific name of an organism of interest the corresponding GTF file storing the annotation for the organism of interest can be downloaded and stored locally. GTF files can be retrieved from several databases.

Usage

getGTF(
  db = "ensembl",
  organism,
  remove_annotation_outliers = FALSE,
  path = file.path("ensembl", "annotation"),
  release = NULL,
  mute_citation = FALSE
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "ensembl"

organism

Organism selector id, there are three options to characterize an organism:

  • by scientific name: e.g. organism = "Homo sapiens"

  • by database specific accession identifier: e.g. organism = "GCF_000001405.37" (= NCBI RefSeq identifier for Homo sapiens)

  • by taxonomic identifier from NCBI Taxonomy: e.g. organism = "9606" (= taxid of Homo sapiens)

remove_annotation_outliers

shall outlier lines be removed from the input annotation_file? If yes, then the initial annotation_file will be overwritten and the removed outlier lines will be stored at tempdir for further exploration.

path

a character string specifying the location (a folder) in which the corresponding annotation file shall be stored. Default is path = file.path("_ncbi_downloads","annotation").

release

a numeric, the database release version of ENSEMBL (db = "ensembl"). Default is release = NULL meaning that the most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

mute_citation

logical, default FALSE, indicating whether citation message should be muted.

Details

Fetching of assembly / sequence data is done by fetching an overview file from metadata of given database:
For NCBI (refseq/genbank):
Internally this function loads the the overview.txt file from NCBI:

refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/

genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/

It will then create a directory relative to file type wanted, if you get fasta genomes it will be _ncbi_downloads/genomes' etc. In case the corresponding fasta file already exists within the '_ncbi_downloads/genomes' folder and is accessible within the workspace, no download process will be performed. For other file types the same rule applies.

For ensembl it fetches overview per type from the rest API:

ensembl: https://rest.ensembl.org

Value

File path to downloaded genome.

Author(s)

Hajk-Georg Drost

See Also

Other getBio: getBio(), getCDS(), getCollection(), getGenome(), getProteome(), getRNA()

Other gff: getGFFSet(), read_gff()

Examples

## Not run: 
# download the annotation of Homo sapiens from ensembl
# and store the corresponding genome file in 'ensembl/annotation'
getGTF(db            = "ensembl",
       organism      = "Homo sapiens",
       path          = file.path("ensembl","annotation"))

getGTF(db            = "ensembl",
       organism      = "Homo sapiens",
       path          = file.path("ensembl","annotation"),
       assembly_type = "primary_assembly")


## End(Not run)

Retrieve and summarise the assembly_summary.txt files from NCBI for all kingdoms

Description

Retrieval function of the assembly_summary.txt file from NCBI for all kingdoms. The assembly_summary.txt files store available species on NCBI.

Usage

getKingdomAssemblySummary(
  db,
  skip_bacteria = TRUE,
  file = assemblies_info_path(db)
)

Arguments

db

database name. E.g. refseq or genbank.

skip_bacteria

Due to its enormous dataset size (> 700MB as of July 2023), the bacterial summary file will not be loaded by default anymore. If users wish to gain insights for the bacterial kingdom they needs to actively specify skip_bacteria = FALSE. When skip_bacteria = FALSE is set then the bacterial summary file will be downloaded.

file

path, local path to total summary file, default is in tmp folder.

Author(s)

Hajk-Georg Drost

See Also

getSummaryFile, getMetaGenomeSummary, get.ensembl.info

Examples

## Not run: 
# This example will run the default version of this function
# whereby information for Bacteria are not downloaded
test <- getKingdomAssemblySummary(db = "genbank", skip_bacteria = TRUE)
test
# Users can then retrieve information for Bacteria by skip_bacteria = FALSE
test2 <- getKingdomAssemblySummary(db = "genbank", skip_bacteria = FALSE)
test2

## End(Not run)

Retrieve available kingdoms of life

Description

A short list of available kingdoms of life

Usage

getKingdoms(db = "refseq")

Arguments

db

a character string specifying the database from which the genome shall be retrieved: db = "refseq", db = "genbank", db = "ensembl", db = "ensemblgenomes". Default is db = "refseq".

Author(s)

Hajk-Georg Drost

See Also

meta.retrieval, getGenome, getProteome, getCDS, getGroups

Examples

# retrieve kingdoms available from refseq
getKingdoms(db = "refseq")

# retrieve kingdoms available from genbank
getKingdoms(db = "genbank")

Retrieve information about available Ensembl Biomart databases

Description

This funcion queries the Ensembl Biomart API and returns a table storing information about all available Ensembl Biomart databases.

Usage

getMarts(update = FALSE)

Arguments

update

logical, default FALSE. If FALSE, use cached file if it exists. Set to TRUE to force new update

Author(s)

Hajk-Georg Drost

See Also

Other biomaRt: biomart(), getAttributes(), getDatasets(), organismBM(), organismFilters()

Examples

## Not run: 
# get a table of all available databases from Ensembl Biomart
getMarts()
 
## End(Not run)

Retrieve annotation *.gff files for metagenomes from NCBI Genbank

Description

Retrieve available annotation *.gff files for metagenomes from NCBI Genbank. NCBI Genbank allows users to download entire metagenomes and their annotations of several metagenome projects. This function downloads available metagenomes that can then be downloaded via getMetaGenomes.

Usage

getMetaGenomeAnnotations(
  name,
  path = file.path("_ncbi_downloads", "metagenome", "annotations"),
  metagenomes.members = dplyr::filter(getMetaGenomeSummary(), organism_name == name &
    total_gene_count > 0)
)

Arguments

name

metagenome name retrieved by listMetaGenomes.

path

a character string specifying the location (a folder) in which the corresponding metagenome annotations shall be stored. Default is path = file.path("_ncbi_downloads","metagenome","annotations").

metagenomes.members

a tibble with selected assemblies, default: dplyr::filter(getMetaGenomeSummary(), organism_name == name & total_gene_count > 0)). This is different to getMetaGenome since it requires a gff to exist, most genbank assemblies are .gbff files only, which are usually not useful.

Author(s)

Hajk-Georg Drost

See Also

getMetaGenomes, listMetaGenomes, getGFF

Examples

## Not run: 
# Frist, retrieve a list of available metagenomes
listMetaGenomes()

# Now, retrieve the 'human gut metagenome'
getMetaGenomeAnnotations(name = "human gut metagenome")

## End(Not run)

Retrieve metagenomes from NCBI Genbank

Description

Retrieve available metagenomes from NCBI Genbank. NCBI Genbank allows users to download entire metagenomes of several metagenome projects. This function downloads available metagenomes that can then be downloaded via getMetaGenomes.

Usage

getMetaGenomes(
  name,
  path = file.path("_ncbi_downloads", "metagenome"),
  metagenomes.members = dplyr::filter(getMetaGenomeSummary(), organism_name == name)
)

Arguments

name

metagenome name retrieved by listMetaGenomes.

path

a character string specifying the location (a folder) in which the corresponding metagenome shall be stored. Default is path = file.path("_ncbi_downloads","metagenome").

metagenomes.members

a tibble of metagenome assemblies, default: dplyr::filter(getMetaGenomeSummary(), organism_name == name)

Author(s)

Hajk-Georg Drost

See Also

getMetaGenomeAnnotations, listMetaGenomes

Examples

## Not run: 
# Frist, retrieve a list of available metagenomes
listMetaGenomes()

# Now, retrieve the 'human gut metagenome'
getMetaGenomes(name = "human gut metagenome")

## End(Not run)

Retrieve the assembly_summary.txt file from NCBI genbank metagenomes

Description

Retrieval function of the assembly_summary.txt file from NCBI genbank metagenomes. This files stores all available metagenome projects on NCBI Genbank.

Usage

getMetaGenomeSummary(
  local_file = file.path(cachedir(), "assembly_summary_metagenomes_genbank.txt")
)

Arguments

local_file

where to store this backend file, default: file.path(cachedir(), "assembly_summary_metagenomes_genbank.txt")

Value

a tibble table

Author(s)

Hajk-Georg Drost

See Also

getKingdomAssemblySummary, getSummaryFile

Examples

## Not run: 
meta.summary <- getMetaGenomeSummary()
meta.summary

## End(Not run)

Proteome Retrieval

Description

Main proteome retrieval function for an organism of interest. By specifying the scientific name of an organism of interest the corresponding fasta-file storing the proteome of the organism of interest can be downloaded and stored locally. Proteome files can be retrieved from several databases.

Usage

getProteome(
  db = "refseq",
  organism,
  reference = TRUE,
  skip_bacteria = TRUE,
  release = NULL,
  gunzip = FALSE,
  update = TRUE,
  path = file.path("_ncbi_downloads", "proteomes"),
  mute_citation = FALSE
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "ensembl"

organism

Organism selector id, there are three options to characterize an organism:

  • by scientific name: e.g. organism = "Homo sapiens"

  • by database specific accession identifier: e.g. organism = "GCF_000001405.37" (= NCBI RefSeq identifier for Homo sapiens)

  • by taxonomic identifier from NCBI Taxonomy: e.g. organism = "9606" (= taxid of Homo sapiens)

reference

a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome.

skip_bacteria

Due to its enormous dataset size (> 700MB as of July 2023), the bacterial summary file will not be loaded by default anymore. If users wish to gain insights for the bacterial kingdom they needs to actively specify skip_bacteria = FALSE. When skip_bacteria = FALSE is set then the bacterial summary file will be downloaded.

release

a numeric, the database release version of ENSEMBL (db = "ensembl"). Default is release = NULL meaning that the most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

gunzip

a logical, indicating whether or not files should be unzipped.

update

logical, default TRUE. (Uniprot only for now!) If species info file exists already, do not re download, makes it faster but the file can be old, i.e. no longer as complete as it could be.

path

a character string specifying the location (a folder) in which the corresponding proteome shall be stored. Default is path = file.path("_ncbi_downloads","proteomes").

mute_citation

logical, default FALSE, indicating whether citation message should be muted.

Details

Fetching of assembly / sequence data is done by fetching an overview file from metadata of given database:
For NCBI (refseq/genbank):
Internally this function loads the the overview.txt file from NCBI:

refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/

genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/

It will then create a directory relative to file type wanted, if you get fasta genomes it will be _ncbi_downloads/genomes' etc. In case the corresponding fasta file already exists within the '_ncbi_downloads/genomes' folder and is accessible within the workspace, no download process will be performed. For other file types the same rule applies.

For ensembl it fetches overview per type from the rest API:

ensembl: https://rest.ensembl.org

Value

File path to downloaded proteome.

Author(s)

Hajk-Georg Drost

See Also

Other getBio: getBio(), getCDS(), getCollection(), getGFF(), getGenome(), getRNA()

Other proteome: getProteomeSet(), read_proteome()

Examples

## Not run: 
# download the proteome of Arabidopsis thaliana from NCBI RefSeq
# and store the corresponding proteome file in '_ncbi_downloads/refseq/proteomes'
file_path <- getProteome( db       = "refseq",
             organism = "Arabidopsis thaliana",
             path     = file.path("_ncbi_downloads","refseq","proteomes") )
# import proteome into R session
Ath_proteome <- read_proteome(file_path, format = "fasta")

# download the proteome of Arabidopsis thaliana from NCBI Genbank
# and store the corresponding proteome file in '_ncbi_downloads/genbank/proteomes'
file_path <- getProteome( db       = "genbank",
             organism = "Arabidopsis thaliana",
             path     = file.path("_ncbi_downloads","genbank","proteomes") )
# import proteome into R session
Ath_proteome <- read_proteome(file_path, format = "fasta")

# and store the corresponding proteome file in '_downloads/uniprot/proteomes'
file_path <- getProteome( db       = "uniprot",
             organism = "Arabidopsis thaliana",
             path     = file.path("_downloads","uniprot","proteomes") )
# import proteome into R session
Ath_proteome <- read_proteome(file_path, format = "fasta")

# download the proteome of Arabidopsis thaliana from ENSEMBL
# and store the corresponding proteome file in '_downloads/ensembl/proteomes'
file_path <- getProteome( db       = "ensembl",
             organism = "Arabidopsis thaliana",
             path     = file.path("_downloads","ensembl","proteomes") )
# import proteome into R session
Ath_proteome <- read_proteome(file_path, format = "fasta")

## End(Not run)

Proteome retrieval of multiple species

Description

Main proteome retrieval function for a set of organism of interest. By specifying the scientific names of the organisms of interest the corresponding fasta-files storing the proteome of the organisms of interest will be downloaded and stored locally. proteome files can be retrieved from several databases.

Usage

getProteomeSet(
  db = "refseq",
  organisms,
  reference = FALSE,
  release = NULL,
  skip_bacteria = TRUE,
  gunzip = TRUE,
  update = FALSE,
  path = "set_proteomes",
  mute_citation = FALSE
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "ensembl"

organisms

a character vector storing the names of the organisms than shall be retrieved. There are three available options to characterize an organism:

reference

a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome.

release

a numeric, the database release version of ENSEMBL (db = "ensembl"). Default is release = NULL meaning that the most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

skip_bacteria

Due to its enormous dataset size (> 700MB as of July 2023), the bacterial summary file will not be loaded by default anymore. If users wish to gain insights for the bacterial kingdom they needs to actively specify skip_bacteria = FALSE. When skip_bacteria = FALSE is set then the bacterial summary file will be downloaded.

gunzip

a logical, indicating whether or not files should be unzipped.

update

logical, default FALSE. Updated backend cached files needed. Usually keep this false, to make ut run much faster. Only set to TRUE, if you believe you cache is outdated (Species only exist in newest release etc)

path

a character string specifying the location (a folder) in which the corresponding proteomes shall be stored. Default is path = "set_proteomes".

mute_citation

logical, default FALSE, indicating whether citation message should be muted.

Details

Fetching of assembly / sequence data is done by fetching an overview file from metadata of given database:
For NCBI (refseq/genbank):
Internally this function loads the the overview.txt file from NCBI:

refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/

genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/

It will then create a directory relative to file type wanted, if you get fasta genomes it will be _ncbi_downloads/genomes' etc. In case the corresponding fasta file already exists within the '_ncbi_downloads/genomes' folder and is accessible within the workspace, no download process will be performed. For other file types the same rule applies.

For ensembl it fetches overview per type from the rest API:

ensembl: https://rest.ensembl.org

Value

character vector, the file path to the downloaded genomes,
The returned character vector has names as either:
- 'new' (file was downloaded now)
- 'old' files did already exist)

Author(s)

Hajk-Georg Drost

See Also

Other getBioSet: getBioSet(), getCDSSet(), getCollectionSet(), getGFFSet(), getGenomeSet(), getRNASet()

Other proteome: getProteome(), read_proteome()

Examples

## Not run: 
# download the proteomes of three different species at the same time
#### Database: NCBI RefSeq
file_paths <- getProteomeSet(db = "refseq", organisms = c("Arabidopsis thaliana",
                                      "Arabidopsis lyrata",
                                       "Capsella rubella"))
# look at file paths
file_paths

#### Database: NCBI Genbank
file_paths <- getProteomeSet(db = "genbank", organisms = c("Arabidopsis thaliana",
                                      "Arabidopsis lyrata",
                                       "Capsella rubella"))
# look at file paths
file_paths

# download the proteomes of three different species at the same time
#### Database: ENSEMBL
file_paths <- getProteomeSet(db = "ensembl", organisms = c("Homo sapiens",
                                      "Mus musculus",
                                       "Caenorhabditis elegans"))
# look at file paths
file_paths

# download the proteomes of three different species at the same time
#### Database: UniProt
file_paths <- getProteomeSet(db = "uniprot", organisms = c("Homo sapiens",
                                      "Mus musculus",
                                       "Caenorhabditis elegans"))
# look at file paths
file_paths

## End(Not run)

Retrieve available database releases or versions of ENSEMBL

Description

Retrieve available database releases or versions of ENSEMBL.

Usage

getReleases(db = "ensembl")

Arguments

db

a character string specifying the database from which available resease versions shall be retrieved:

  • db = "ensembl"

Author(s)

Hajk-Georg Drost

Examples

## Not run: 
# retrieve available resease versions of ENSEMBL
getReleases("ensembl")

## End(Not run)

Repeat Masker Retrieval

Description

Main Repeat Masker output retrieval function for an organism of interest. By specifying the scientific name of an organism of interest the corresponding Repeat Masker file storing the genome of the organism of interest can be downloaded and stored locally. Repeat Masker files can be retrieved from several databases.

Usage

getRepeatMasker(
  db = "refseq",
  organism,
  reference = FALSE,
  skip_bacteria = TRUE,
  release = NULL,
  gunzip = FALSE,
  path = file.path("_ncbi_downloads", "repeatmasker"),
  mute_citation = FALSE
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

organism

a character string specifying the scientific name of the organism of interest, e.g. organism = "Homo sapiens".

reference

a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome.

skip_bacteria

Due to its enormous dataset size (> 700MB as of July 2023), the bacterial summary file will not be loaded by default anymore. If users wish to gain insights for the bacterial kingdom they needs to actively specify skip_bacteria = FALSE. When skip_bacteria = FALSE is set then the bacterial summary file will be downloaded.

release

most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

gunzip

a logical, indicating whether or not files should be unzipped.

path

a character string specifying the location (a folder) in which the corresponding file shall be stored. Default is path = file.path("_ncbi_downloads","repeatmasker").

mute_citation

logical value indicating whether citation message should be muted.

Details

Internally this function loads the the overview.txt file from NCBI:

refseq: refseq_genbank_ftp_server_url_genome_specific("refseq")

genbank: refseq_genbank_ftp_server_url_genome_specific("genbank")

and creates a directory '_ncbi_downloads/repeatmasker' to store the files of interest as fasta file for future processing. In case the corresponding fasta file already exists within the '_ncbi_downloads/repeatmasker' folder and is accessible within the workspace, no download process will be performed.

Value

File path to downloaded Repeat Masker output file.

Author(s)

Hajk-Georg Drost

See Also

getGenome, getProteome, getCDS, getGFF, getRNA, getCollection, meta.retrieval, read_rm

Examples

## Not run: 

# download the Repeat Masker output file of Homo sapiens from refseq
# and store the corresponding genome file in '_ncbi_downloads/genomes'
file_path <- getRepeatMasker( db       = "refseq",
             organism = "Homo sapiens",
             path = file.path("_ncbi_downloads","repeatmasker"))

Hsap_repeatmasker <- read_rm(file_path)


## End(Not run)

RNA Sequence Retrieval

Description

Main retrieval function for RNA sequences of an organism of interest. By specifying the scientific name of an organism of interest the corresponding fasta-file storing the RNA information for the organism of interest can be downloaded and stored locally. RNA files can be retrieved from several databases.

Usage

getRNA(
  db = "refseq",
  organism,
  reference = FALSE,
  skip_bacteria = TRUE,
  release = NULL,
  assembly_type = "toplevel",
  path = file.path("_ncbi_downloads", "RNA"),
  gunzip = FALSE,
  mute_citation = FALSE
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "ensembl"

organism

Organism selector id, there are three options to characterize an organism:

  • by scientific name: e.g. organism = "Homo sapiens"

  • by database specific accession identifier: e.g. organism = "GCF_000001405.37" (= NCBI RefSeq identifier for Homo sapiens)

  • by taxonomic identifier from NCBI Taxonomy: e.g. organism = "9606" (= taxid of Homo sapiens)

reference

a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome.

skip_bacteria

Due to its enormous dataset size (> 700MB as of July 2023), the bacterial summary file will not be loaded by default anymore. If users wish to gain insights for the bacterial kingdom they needs to actively specify skip_bacteria = FALSE. When skip_bacteria = FALSE is set then the bacterial summary file will be downloaded.

release

a numeric, the database release version of ENSEMBL (db = "ensembl"). Default is release = NULL meaning that the most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

assembly_type

character, default c("primary_assembly", "toplevel"). Used for ensembl only, specifies the genome assembly type. Searches for both primary and toplevel, and if both are found, uses the first by order (so primary is prioritized by default). The Primary assembly should usually be used if it exists. The "primary assembly" contains all the top-level sequence regions, excluding alternative haplotypes and patches. If the primary assembly file is not present for a species (only defined for standard model organisms), that indicates that there were no haplotype/patch regions, and in such cases, the 'toplevel file is used. For more details see: ensembl tutorial

path

a character string specifying the location (a folder) in which the corresponding CDS file shall be stored. Default is path = file.path("_ncbi_downloads","RNA").

gunzip

a logical, indicating whether or not files should be unzipped.

mute_citation

logical, default FALSE, indicating whether citation message should be muted.

Details

Fetching of assembly / sequence data is done by fetching an overview file from metadata of given database:
For NCBI (refseq/genbank):
Internally this function loads the the overview.txt file from NCBI:

refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/

genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/

It will then create a directory relative to file type wanted, if you get fasta genomes it will be _ncbi_downloads/genomes' etc. In case the corresponding fasta file already exists within the '_ncbi_downloads/genomes' folder and is accessible within the workspace, no download process will be performed. For other file types the same rule applies.

For ensembl it fetches overview per type from the rest API:

ensembl: https://rest.ensembl.org

Value

File path to downloaded genome.

Author(s)

Hajk-Georg Drost

See Also

Other getBio: getBio(), getCDS(), getCollection(), getGFF(), getGenome(), getProteome()

Other rna: getRNASet(), read_rna()

Examples

## Not run: 
# download the RNA of Arabidopsis thaliana from refseq
# and store the corresponding RNA file in '_ncbi_downloads/RNA'
file_path <- getRNA( db       = "refseq",
             organism = "Arabidopsis thaliana",
             path     = file.path("_ncbi_downloads","RNA"))

Ath_RNA <- read_rna(file_path, format = "fasta")

## End(Not run)

RNA Retrieval of multiple species

Description

Main RNA retrieval function for a set of organism of interest. By specifying the scientific names of the organisms of interest the corresponding fasta-files storing the RNA of the organisms of interest will be downloaded and stored locally. RNA files can be retrieved from several databases.

Usage

getRNASet(
  db = "refseq",
  organisms,
  reference = FALSE,
  release = NULL,
  skip_bacteria = TRUE,
  gunzip = TRUE,
  update = FALSE,
  path = "set_RNAs",
  mute_citation = FALSE
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "ensembl"

organisms

a character vector storing the names of the organisms than shall be retrieved. There are three available options to characterize an organism:

reference

a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome.

release

a numeric, the database release version of ENSEMBL (db = "ensembl"). Default is release = NULL meaning that the most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that.

skip_bacteria

Due to its enormous dataset size (> 700MB as of July 2023), the bacterial summary file will not be loaded by default anymore. If users wish to gain insights for the bacterial kingdom they needs to actively specify skip_bacteria = FALSE. When skip_bacteria = FALSE is set then the bacterial summary file will be downloaded.

gunzip

a logical, indicating whether or not files should be unzipped.

update

logical, default FALSE. Updated backend cached files needed. Usually keep this false, to make ut run much faster. Only set to TRUE, if you believe you cache is outdated (Species only exist in newest release etc)

path

a character string specifying the location (a folder) in which the corresponding RNAs shall be stored. Default is path = "set_RNAs".

mute_citation

logical, default FALSE, indicating whether citation message should be muted.

Details

Fetching of assembly / sequence data is done by fetching an overview file from metadata of given database:
For NCBI (refseq/genbank):
Internally this function loads the the overview.txt file from NCBI:

refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/

genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/

It will then create a directory relative to file type wanted, if you get fasta genomes it will be _ncbi_downloads/genomes' etc. In case the corresponding fasta file already exists within the '_ncbi_downloads/genomes' folder and is accessible within the workspace, no download process will be performed. For other file types the same rule applies.

For ensembl it fetches overview per type from the rest API:

ensembl: https://rest.ensembl.org

Value

character vector, the file path to the downloaded genomes,
The returned character vector has names as either:
- 'new' (file was downloaded now)
- 'old' files did already exist)

Author(s)

Hajk-Georg Drost

See Also

Other getBioSet: getBioSet(), getCDSSet(), getCollectionSet(), getGFFSet(), getGenomeSet(), getProteomeSet()

Other rna: getRNA(), read_rna()

Examples

## Not run: 
getBioSet("refseq", organisms = c("Arabidopsis thaliana",
                                  "Arabidopsis lyrata",
                                  "Capsella rubella"),
                                  set_type = "cds")

## End(Not run)

Helper function to retrieve the assembly_summary.txt file from NCBI

Description

Retrieval function of the assembly_summary.txt file from NCBI.

Usage

getSummaryFile(db, kingdom, file = assemblies_info_path(db, kingdom))

Arguments

db

database name. E.g. refseq or genbank.

kingdom

kingdom for which assembly_summary.txt file shall be retrieved. See also getKingdoms.

file

path, local path to total summary file, default is in tmp folder.

Author(s)

Hajk-Georg Drost

See Also

getKingdomAssemblySummary, getMetaGenomeSummary

Examples

## Not run: 
test <- getSummaryFile("refseq","plant")
test

## End(Not run)

Get uniprot info from organism

Description

Get uniprot info from organism

Usage

getUniProtInfo(organism, path = cachedir(), update = TRUE)

Arguments

organism

character, name of organism

path

path at which the info file shall be stored locally.

update

shall the internal cachedir file be deleted and the info file freshly downloaded from the UniProt API?


Retrieve UniProt Database Information File (STATS)

Description

The UniProt stores a STATS file to summarise all available information for their reference proteomes. Users can now download this file and process it with biomartr.

Usage

getUniProtSTATS(update = FALSE)

Arguments

update

shall the internal cachedir file be deleted and the STATS file freshly downloaded from the UniProt FTP servers?

Author(s)

Hajk-Georg Drost

Examples

## Not run: 
# retrieve STATS file from UniProt
uniprot_info <- getUniProtSTATS(update = TRUE)
# look at results
uniprot_info

## End(Not run)

Check Genome Availability

Description

This function checks the availability of a given genome on the NBCI servers specified as scientific name.

Usage

is.genome.available(
  db = "refseq",
  organism,
  skip_bacteria = TRUE,
  details = FALSE
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "ensembl"

  • db = "uniprot"

organism

there are three options to characterize an organism:

  • by scientific name: e.g. organism = "Homo sapiens"

  • by database specific accession identifier: e.g. organism = "GCF_000001405.37" (= NCBI RefSeq identifier for Homo sapiens)

  • by taxonomic identifier from NCBI Taxonomy: e.g. organism = "9606" (= taxid of Homo sapiens)

skip_bacteria

Due to its enormous dataset size (> 700MB as of July 2023), the bacterial summary file will not be loaded by default anymore. If users wish to gain insights for the bacterial kingdom they needs to actively specify skip_bacteria = FALSE. When skip_bacteria = FALSE is set then the bacterial summary file will be downloaded.

details

a logical value specifying whether or not details on genome size, kingdom, etc. shall be printed to the console intead of a boolean value.

Details

Internally this function calls the listGenomes function to detect all available genomes and checks whether or not the specified organism is available for download.

Value

a logical value specifing whether or not the genome of the input organism is available. In case details = TRUE only a character string specifying the genome details is being returned.

Author(s)

Hajk-Georg Drost

Examples

## Not run: 
# checking whether the Homo sapiens genome is stored on NCBI
is.genome.available(organism = "Homo sapiens", db = "refseq")

# and printing details
is.genome.available(organism = "Homo sapiens", db = "refseq", details = TRUE)

# checking whether the Homo sapiens genome is stored on ENSEMBL
is.genome.available(organism = "Homo sapiens", db = "ensembl")

# and printing details
is.genome.available(organism = "Homo sapiens",
                    details = TRUE,
                    db = "ensembl")

## End(Not run)

Retrieve a List of Available NCBI Databases for Download

Description

This function allows you to retrieve a list of database names and versions that can be downloaded from correspondning servers.

Database retrieval is crucial for most biological studies and analyses. There is a vast diversity of databases that can be accessed remotely or that can be downloaded to your local machine. This function provides an interface to databases that can be downloaded from NCBI servers and lists all available databases and their database version to be able to select an appropriate database for download with download.database.

Usage

listDatabases(db = "nr", update = FALSE)

listNCBIDatabases(db = "nr", update = FALSE)

Arguments

db

a character string specifying the name of the database that shall be searched for.

update

a logical value specifying whether or not the local listDatabases.txt file shall be updated by remote access to NCBI.

Author(s)

Hajk-Georg Drost

See Also

download.database, download.database.all

Examples

## Not run: 
# retrieve all versions of the NCBI 'nr' database that can be downloaded
listNCBIDatabases(db = "nr")

# analogous:
# listNCBIDatabases(db = "cdd")
# listNCBIDatabases(db = "nt")
# listNCBIDatabases(db = "gss")
# listNCBIDatabases(db = "refseq_protein")

## End(Not run)

List All Available Genomes either by kingdom, group, or subgroup

Description

This function retrieves the names of all genomes available on the NCBI ftp:// server and stores the results in a file named 'overview.txt' inside the directory _ncbi_downloads' that is built inside the workspace.

Usage

listGenomes(
  db = "refseq",
  type = "all",
  subset = NULL,
  details = FALSE,
  update = FALSE,
  skip_bacteria = FALSE
)

Arguments

db

a character string specifying the database for which genome availability shall be checked. Available options are:

  • db = "refseq"

  • db = "genbank"

  • db = "ensembl"

type

a character string specifying a potential filter of available genomes. Available options are:

  • type = "all", no subset

  • type = "kingdom", subset on kingdom

  • type = "group", subset on group

  • type = "subgroup", subset on subgroup

subset

a character string or character vector specifying a subset of type. E.g. if users are interested in retrieving all Eukaryota species, they can specify: type = "kingdom" and subset = "Eukaryota".

details

a boolean value specifying whether only the scientific names of stored genomes shall be returned (details = FALSE) or all information such as

  • organism_name

  • kingdoms

  • group

  • subgroup

  • file_size_MB, etc.

update

logical, default FALSE. If TRUE, update cached list, if FALSE use existing cache (if it exists). For cache location see cachedir()

skip_bacteria

Due to its enormous dataset size (> 700MB as of July 2023), the bacterial summary file will not be loaded by default anymore. If users wish to gain insights for the bacterial kingdom they needs to actively specify skip_bacteria = FALSE. When skip_bacteria = FALSE is set then the bacterial summary file will be downloaded.

Details

Internally this function loads the the overview.txt file from NCBI and creates a directory '_ncbi_downloads' in the temdir() folder to store the overview.txt file for future processing. In case the overview.txt file already exists within the '_ncbi_downloads' folder and is accessible within the workspace, no download process will be performed again.

Note

Please note that the ftp:// connection relies on the NCBI or ENSEMBL server and cannot be accurately accessed via a proxy.

Author(s)

Hajk-Georg Drost

Examples

## Not run: 
# print details for refseq
listGenomes(db = "refseq")
# print details for all plants in refseq
listGenomes(db = "refseq", type = "kingdom")
# print details for all plant groups in refseq
listGenomes(db = "refseq", type = "group")
# print details for all plant subgroups in refseq
listGenomes(db = "refseq", type = "subgroup")
# Ensembl
listGenomes(db = "ensembl", type = "kingdom", subset = "EnsemblVertebrates")

## End(Not run)

List number of available genomes in each taxonomic group

Description

Users can retrieve the available number of sequenced genomes per group. Only available for db = "refseq" and db = "genbank".

Usage

listGroups(db = "refseq", kingdom = "all", details = FALSE)

Arguments

db

a character string specifying the database for which genome availability shall be checked. Available options are:

  • db = "refseq"

  • db = "genbank"

kingdom

a kingdom specification retrieved by getKingdoms.

details

shall all species corresponding to the specified kingdom be returned? Default is details = FALSE.

Author(s)

Hajk-Georg Drost

See Also

listGenomes, is.genome.available, listKingdoms

Examples

## Not run: 
# example for refseq
listGroups(db = "refseq")
# example for genbank
listGroups(db = "genbank")
### in case groups should be specified by kingdom
# first, retrieve available kingdom names
listKingdoms()
# now we choose kingdom "bacteria"
listGroups(db = "refseq", kingdom = "bacteria")
# or
listGroups(db = "genbank", kingdom = "bacteria")

## End(Not run)

List number of available genomes in each kingdom of life

Description

Users can retrieve the available number of sequenced genomes per kingdom.

Usage

listKingdoms(db = "refseq")

Arguments

db

a character string specifying the database for which genome availability shall be checked, e.g. db = "refseq", db = "genbank", db = "ensembl".

Author(s)

Hajk-Georg Drost

See Also

listGenomes, is.genome.available, listGroups

Examples

## Not run: 
# list number of available genomes in refseq for each kingdom of life
listKingdoms(db = "refseq")
# example for genbank
listKingdoms(db = "genbank")
# example for ensembl
listKingdoms(db = "ensembl")

## End(Not run)

List available metagenomes on NCBI Genbank

Description

List available metagenomes on NCBI genbank. NCBI genbank allows users to download entire metagenomes of several metagenome projects. This function lists all available metagenomes that can then be downloaded via getMetaGenomes.

Usage

listMetaGenomes(details = FALSE)

Arguments

details

a boolean value specifying whether only the scientific names of stored metagenomes shall be returned (details = FALSE) or all information such as "organism_name","bioproject", etc (details = TRUE).

Author(s)

Hajk-Georg Drost

See Also

getMetaGenomes, getMetaGenomeSummary

Examples

## Not run: 
# retrieve available metagenome projects at NCBI Genbank
listMetaGenomes()
# retrieve detailed information on available metagenome projects 
# at NCBI Genbank
listMetaGenomes(details = TRUE)

## End(Not run)

Perform Meta-Genome Retrieval

Description

Download genomes, proteomes, cds, gff, rna, or assembly stats files of all species within a kingdom of life.

Usage

meta.retrieval(
  db = "refseq",
  kingdom,
  group = NULL,
  type = "genome",
  restart_at_last = TRUE,
  reference = FALSE,
  combine = FALSE,
  path = NULL
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "emsembl"

kingdom

a character string specifying the kingdom of the organisms of interest, e.g.

  • For NCBI RefSeq:

    • kingdom = "archaea"

    • kingdom = "bacteria"

    • kingdom = "fungi"

    • kingdom = "invertebrate"

    • kingdom = "plant"

    • kingdom = "protozoa"

    • kingdom = "viral"

    • kingdom = "vertebrate_mammalian"

    • kingdom = "vertebrate_other"

  • For NCBI Genbank:

    • kingdom = "archaea"

    • kingdom = "bacteria"

    • kingdom = "fungi"

    • kingdom = "invertebrate"

    • kingdom = "plant"

    • kingdom = "protozoa"

    • kingdom = "vertebrate_mammalian"

    • kingdom = "vertebrate_other"

  • For ENSEMBL:

    • kingdom = "EnsemblVertebrates"

    • kingdom = "EnsemblPlants"

    • kingdom = "EnsemblFungi"

    • kingdom = "EnsemblMetazoa"

    • kingdom = "EnsemblBacteria"

    • kingdom = "EnsemblProtists"

Available kingdoms can be retrieved with getKingdoms.

group

only species belonging to this subgroup will be downloaded. Groups can be retrieved with getGroups.

type

type of sequences that shall be retrieved. Options are:

  • type = "genome" : (for genome assembly retrieval; see also getGenome),

  • type = "proteome" : (for proteome retrieval; see also getProteome),

  • type = "cds" : (for coding sequence retrieval; see also getCDS),

  • type = "gff" : (for annotation file retrieval in gff format; see also getGFF),

  • type = "gtf" : (for annotation file retrieval in gtf format (only for ensembl and ensemblgenomes); see also getGTF)

  • type = "rna" : (for RNA file retrieval in fasta format; see also getRNA),

  • type = "rm" : (for Repeat Masker output file retrieval; see also getRepeatMasker),

  • type = "assemblystats" : (for genome assembly quality stats file retrieval; see also getAssemblyStats).

restart_at_last

a logical value indicating whether or not meta.retrieval should pick up at the last species when re-running the function.

  • If restart_at_last = TRUE (Default) then meta.retrieval will skip all organisms that are already present in the folder and will start downloading all remaining species. However, this way meta.wretrieval will not be able to check whether already downloaded organism files are corrupted or not by checking the md5 checksum.

  • If restart_at_last = FALSE then meta.retrieval will start from the beginning and crawl through already downloaded organism files and check whether already downloaded organism files are corrupted or not by checking the md5 checksum. After checking existing files the function will start downloading all remaining organisms.

reference

a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome. Options are:

  • reference = FALSE (Default): all organisms (reference, representative, and non-representative genomes) are downloaded.

  • reference = TRUE: organisms that are downloaded must be either a reference or representative genome. Thus, most genomes which are usually non-reference genomes will not be downloaded.

combine

just in case type = "assemblystats" is specified, shall assemby stats of individual species be imported and combined to a data.frame?

path

path to the folder in which downloaded genomes shall be stored. By default the kingdom name is used to name the output folder.

Details

This function aims to perform bulk retrieval of the genomes, proteomes, cds, etc. of species that belong to the same kingdom of life or to the same subgroup.

Value

a character vector storing the file paths of the retrieved files.

Author(s)

Hajk-Georg Drost

See Also

Other meta_retrival: meta.retrieval.all()

Examples

## Not run: 
# get all available kingdoms for refseq
getKingdoms(db = "refseq")
# download all vertebrate genomes from refseq
meta.retrieval(kingdom = "vertebrate_mammalian",
               db = "refseq",
               type = "genome")

# get all available kingdoms for genbank
getKingdoms(db = "genbank")
# download all vertebrate genomes from genbank
meta.retrieval(kingdom = "vertebrate_mammalian",
               db = "genbank",
               type = "genome")


# In case users do not wish to retrieve genomes from an entire kingdom,
# but rather from a subgoup (e.g. from species belonging to the
# Gammaproteobacteria class, a subgroup of the bacteria kingdom),
# they can use the following workflow"
# First, users can again consult the getKingdoms() function to retrieve
# kingdom information.
getKingdoms(db = "refseq")

# In this example, we will choose the bacteria kingdom.
# Now, the getGroups() function allows users to obtain available
# subgroups of the bacteria kingdom.
getGroups(db = "refseq", kingdom = "bacteria")

# Now we choose the group Gammaproteobacteria and specify
# the group argument in the meta.retrieval() function
meta.retrieval(kingdom = "bacteria",
   roup = "Gammaproteobacteria",
   db = "refseq",
   type = "genome")

## End(Not run)

Perform Meta-Genome Retrieval of all organisms in all kingdoms of life

Description

Download genomes, proteomes, cds, gff, rna, or assembly stats files of individual species of all kingdoms of life.

Usage

meta.retrieval.all(db = "refseq", type = "genome", reference = FALSE)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "emsembl"

  • db = "ensemblgenomes"

type

type of sequences that shall be retrieved. Options are:

  • type = "genome" : for genome assembly retrieval; see also getGenome),

  • type = "proteome" : (for proteome retrieval; see also getProteome),

  • type = "cds" : (for coding sequence retrieval; see also getCDS),

  • type = "gff" : (for annotation file retrieval in gff format; see also getGFF),

  • type = "gtf" : (for annotation file retrieval in gtf format (only for ensembl and ensemblgenomes); see also getGTF),

  • type = "rna" : (for RNA file retrieval in fasta format; see also getRNA),

  • type = "rm" : (for Repeat Masker output file retrieval; see also getRepeatMasker),

  • type = "assemblystats" (for genome assembly quality stats file retrieval; see also getAssemblyStats).

reference

a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome. Options are:

  • reference = FALSE (Default): all organisms (reference, representative, and non-representative genomes) are downloaded.

  • reference = TRUE: organisms that are downloaded must be either a reference or representative genome. Thus, most genomes which are usually non-reference genomes will not be downloaded.

Details

This function aims to perform bulk retrieval of all genomes of species for all kingdoms of life.

Value

a character vector storing the file paths of the retrieved files.

Author(s)

Hajk-Georg Drost

See Also

Other meta_retrival: meta.retrieval()

Examples

## Not run: 
# download all genomes from refseq
meta.retrieval.all(db = "refseq", type = "genome")
# download all vertebrate genomes from genbank
meta.retrieval.all(db = "genbank", type = "genome")
# download all vertebrate genomes from ensemblgenomes
meta.retrieval.all(db = "genbank", type = "ensemblgenomes")

## End(Not run)

Retrieve Ensembl Biomart attributes for a query organism

Description

In addition to the organismBM function, this function returns all available attributes that can be accessed through different marts and datasets for a given query organism.

Usage

organismAttributes(organism, update = FALSE, topic = NULL)

Arguments

organism

a character string specifying the scientific name of a query organism.

update

a logical value specifying whether or not the local listMart.txt, listDatasets.txt, and listAttributes_organism.txt files shall be updated by remote access to BioMart.

topic

a character string specifying a topic (category) of attributes, e.g. topic = "id".

Details

For a given query organism, this function retrieves all available attributes that can be accessed through different marts and datasets.

Sometimes the same attribute names correspond to different datasets and marts causing problems when using getMarts. The approach introduced by this function provides (again) a organism centric way of accessing organism specific attributes.

The topic argument allows the user to search for specific attribute topics/categories for faster filtering.

Value

a data.frame storing corresponding attribute names, description, datasets, and marts.

Note

When you run this function for the first time, the data retrieval procedure will take some time, due to the remote access to BioMart. The corresponding result is then saved in a *.txt file within the tempdir directory named "_biomart/listMarts.txt","_biomart/listDatasets.txt", and "_biomart/listAttributes_organism.txt", allowing subsequent queries to perform much faster.

Author(s)

Hajk-Georg Drost

References

http://biomart.org/

Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Steffen Durinck, Paul T. Spellman, Ewan Birney and Wolfgang Huber, Nature Protocols 4, 1184-1191 (2009).

BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Steffen Durinck, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma and Wolfgang Huber, Bioinformatics 21, 3439-3440 (2005).

See Also

organismFilters, organismBM, biomart, listAttributes

Examples

## Not run:  
# search for attribute topic id
head(organismAttributes("Homo sapiens", topic = "id"), 20)

## End(Not run)

Retrieve Ensembl Biomart marts and datasets for a query organism

Description

This function returns either all available biomart connections for all available organisms for which biomart access is possible, or (when specified) returns all organism specific biomart connections.

Usage

organismBM(organism = NULL, update = FALSE, mute_citation = TRUE)

Arguments

organism

a character string specifying the scientific name of a query organism. Default is organism = NULL. In this case all available biomart connections are returned.

update

a logical value specifying whether or not the local listMart.txt and listDatasets.txt files shall be updated by remote access to BioMart.

mute_citation

logical value indicating whether citation message should be muted.

Details

This function collects all available biomart connections and returns a table storing the organism for which biomart connections are available as well as the corresponding mart and database.

Note

When you run this function for the first time, the data retrieval procedure will take some time, due to the remote access to BioMart. The corresponding result is then saved in a *.txt file named "_biomart/listDatasets.txt" in the tempdir directory, allowing subsequent queries to perform much faster.

Author(s)

Hajk-Georg Drost

References

http://biomart.org/

Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Steffen Durinck, Paul T. Spellman, Ewan Birney and Wolfgang Huber, Nature Protocols 4, 1184-1191 (2009).

BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Steffen Durinck, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma and Wolfgang Huber, Bioinformatics 21, 3439-3440 (2005).

See Also

Other biomaRt: biomart(), getAttributes(), getDatasets(), getMarts(), organismFilters()

Examples

## Not run: 
# returning all available biomart connections
head(organismBM(), 20)
# retrieving all available datasets and biomart connections for
# a specific query organism (scientific name)
organismBM(organism = "Homo sapiens")
# you can also update the downloaded version using
# the "update = TRUE" argument
head(organismBM(update = TRUE), 20)

## End(Not run)

Retrieve Ensembl Biomart filters for a query organism

Description

In addition to the organismBM and organismAttributes functions, this function returns all available filters that can be accessed through different marts and datasets for a given query organism.

Usage

organismFilters(organism, update = FALSE, topic = NULL)

Arguments

organism

a character string specifying the scientific name of a query organism.

update

a logical value specifying whether or not the local listMart.txt, listDatasets.txt, and listFilters_organism.txt files shall be updated by remote access to BioMart.

topic

a character string specifying a topic (category) of filters, e.g. topic = "id".

Details

For a given query organism, this function retrieves all available filters that can be accessed through different marts and datasets.

Sometimes the same filter names correspond to different datasets and marts causing problems when using getMarts. The approach introduced by this function provides (again) a organism centric way of accessing organism specific filters.

The topic argument allows the user to search for specific filters topics/categories for faster selection.

Value

a data.frame storing corresponding filter names, description, datasets, and marts.

Note

When you run this function for the first time, the data retrieval procedure will take some time, due to the remote access to BioMart. The corresponding result is then saved in a *.txt file within the tempdir directory named "_biomart/listMarts.txt","_biomart/listDatasets.txt", and "_biomart/listFilters_organism.txt", allowing subsequent queries to perform much faster.

Author(s)

Hajk-Georg Drost

References

http://biomart.org/

Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Steffen Durinck, Paul T. Spellman, Ewan Birney and Wolfgang Huber, Nature Protocols 4, 1184-1191 (2009).

BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Steffen Durinck, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma and Wolfgang Huber, Bioinformatics 21, 3439-3440 (2005).

See Also

Other biomaRt: biomart(), getAttributes(), getDatasets(), getMarts(), organismBM()

Examples

## Not run: 
# search for filter topic "id"
head(organismFilters("Homo sapiens", topic = "id"), 20)

## End(Not run)

Import Genome Assembly Stats File

Description

This function reads an organism specific Genome Assembly Stats file that was retrieved with getAssemblyStats.

Usage

read_assemblystats(file, type = "raw")

Arguments

file

a character string specifying the path to the file storing the Genome Assembly Stats file.

type

a tibble object, either type = "raw" to import the entire genome assembly stats file or type = "stats" to import overall statistics including all chromosomes, mitochondria and plastids.

Details

This function takes a string specifying the path to the Genome Assembly Stats file of interest (e.g. the path returned by getAssemblyStats) and imports it.

Author(s)

Hajk-Georg Drost

See Also

getAssemblyStats, read_genome, read_proteome, read_cds, read_gff


Import CDS as Biostrings or data.table object

Description

This function reads an organism specific CDS stored in a defined file format.

Usage

read_cds(
  file,
  format = "fasta",
  obj.type = "Biostrings",
  delete_corrupt = FALSE,
  ...
)

Arguments

file

a character string specifying the path to the file storing the CDS.

format

a character string specifying the file format used to store the genome, e.g. format = "fasta" (default) or format = "gbk".

obj.type

a character string specifying the object stype in which the genomic sequence shall be represented. Either as obj.type = "Biostrings" (default) or as obj.type = "data.table".

delete_corrupt

a logical value specifying whether potential CDS sequences that cannot be divided by 3 shall be be excluded from the the dataset. Default is delete_corrupt = FALSE.

...

additional arguments that are used by read.fasta.

Details

The read.cds function takes a string specifying the path to the cds file of interest as first argument.

It is possible to read in different proteome file standards such as fasta or genebank.

CDS stored in fasta files can be downloaded from http://www.ensembl.org/info/data/ftp/index.html.

Value

A data.table storing the gene id in the first column and the corresponding sequence as string in the second column.

Author(s)

Hajk-Georg Drost

See Also

Other cds: getCDS(), getCDSSet()

Other readers: read_genome(), read_gff(), read_proteome(), read_rna()


Import Genome Assembly as Biostrings or data.table object

Description

This function reads an organism specific genome stored in a defined file format.

Usage

read_genome(file, format = "fasta", obj.type = "Biostrings", ...)

Arguments

file

a character string specifying the path to the file storing the genome.

format

a character string specifying the file format used to store the genome, e.g. format = "fasta" (default) or format = "gbk".

obj.type

a character string specifying the object stype in which the genomic sequence shall be represented. Either as obj.type = "Biostrings" (default) or as obj.type = "data.table".

...

additional arguments that are used by the read.fasta function.

Details

This function takes a string specifying the path to the genome file of interest as first argument (e.g. the path returned by getGenome).

Value

Either a Biostrings or data.table object.

Author(s)

Hajk-Georg Drost

See Also

Other genome: getGenome(), getGenomeSet()

Other readers: read_cds(), read_gff(), read_proteome(), read_rna()


Import GFF File

Description

This function reads an organism specific CDS stored in a defined file format.

Usage

read_gff(file)

Arguments

file

a character string specifying the path to the file storing the CDS.

Details

This function takes a string specifying the path to the GFF file of interest (e.g. the path returned by getGFF).

Value

Either a Biostrings or data.table object.

Author(s)

Hajk-Georg Drost

See Also

Other gff: getGFF(), getGFFSet()

Other readers: read_cds(), read_genome(), read_proteome(), read_rna()


Import Proteome as Biostrings or data.table object

Description

This function reads an organism specific proteome stored in a defined file format.

Usage

read_proteome(file, format = "fasta", obj.type = "Biostrings", ...)

Arguments

file

a character string specifying the path to the file storing the proteome.

format

a character string specifying the file format used to store the genome, e.g. format = "fasta" (default) or format = "gbk".

obj.type

a character string specifying the object stype in which the genomic sequence shall be represented. Either as obj.type = "Biostrings" (default) or as obj.type = "data.table".

...

additional arguments that are used by read.fasta.

Details

This function takes a string specifying the path to the proteome file of interest as first argument.

It is possible to read in different proteome file standards such as fasta or genebank.

Value

Either a Biostrings or data.table object.

Author(s)

Hajk-Georg Drost

See Also

Other readers: read_cds(), read_genome(), read_gff(), read_rna()

Other proteome: getProteome(), getProteomeSet()


Import Repeat Masker output file

Description

This function reads an organism specific Repeat Masker output file.

Usage

read_rm(file)

Arguments

file

a character string specifying the path to the file storing the Repeat Masker output (e.g. retrieved with getRepeatMasker).

Details

This function takes a string specifying the path to the Repeat Masker output file of interest as first argument.

Author(s)

Hajk-Georg Drost

See Also

getRepeatMasker, read_genome, read_proteome, read_gff, read_rna


Import RNA as Biostrings or data.table object

Description

This function reads an organism specific RNA stored in a defined file format.

Usage

read_rna(file, format = "fasta", obj.type = "Biostrings", ...)

Arguments

file

a character string specifying the path to the file storing the RNA.

format

a character string specifying the file format used to store the genome, e.g. format = "fasta" (default) or format = "gbk".

obj.type

a character string specifying the object stype in which the genomic sequence shall be represented. Either as obj.type = "Biostrings" (default) or as obj.type = "data.table".

...

additional arguments that are used by read.fasta.

Details

This function takes a string specifying the path to the RNA file of interest as first argument. It is possible to read in different proteome file standards such as fasta or genebank.

Value

A data.table storing the gene id in the first column and the corresponding sequence as string in the second column.

Author(s)

Hajk-Georg Drost

See Also

Other rna: getRNA(), getRNASet()

Other readers: read_cds(), read_genome(), read_gff(), read_proteome()


Retrieve All Organism Names Stored on refseq

Description

This function extracts all organism names (scientific names) for which genomes, proteomes, and CDS files are stored on the NCBI refseq server.

Usage

refseqOrganisms()

Value

a character vector of html from kingdom URLs

Author(s)

Hajk-Georg Drost


Retrieve summary statistics for a coding sequence (CDS) file

Description

A summary statistics of specific CDS features is returned.

Usage

summary_cds(file, organism)

Arguments

file

file path to a CDS file in fasta format.

organism

character string specifying the organism at hand.

Details

The summary statistics include:

  • total_seqs:

  • nnn_abs: The total number of NNN's (over all chromosomes/scaffolds/contigs) in all coding sequences combined

  • nnn_perc: The percentage (relative frequency) of NNN's (over all chromosomes/scaffolds/contigs) compared to the total number of nucleotides of all coding sequences

Author(s)

Hajk-Georg Drost

See Also

getCollection, getCDS, read_cds, summary_genome


Retrieve summary statistics for a genome assembly file

Description

A summary statistics of specific genome features is generated. These statistics are useful to assess the genome quality of retrieved genome assemblies when performing comparative genomics tasks. This way, users can assess whether or not patterns found based on genome comparisons aren't just a technical artifact of differences in genome assembly quality.

Usage

summary_genome(file, organism)

Arguments

file

file path to a genome assembly file in fasta format.

organism

character string specifying the organism at hand.

Details

The summary statistics include:

  • genome_size_mbp: Genome size in mega base pairs

  • n50_mbp: The N50 contig size of the genome assembly in mega base pairs

  • n_seqs: The number of chromosomes/scaffolds/contigs of the genome assembly file

  • n_nnn: The absolute number of NNNs (over all chromosomes or scaffolds or contigs) in the genome assembly file

  • rel_nnn: The percentage (relative frequency) of NNNs (over all chromosomes or scaffolds or contigs) compared to the total number of nucleotides in the genome assembly file

  • genome_entropy: The Shannon Entropy of the genome assembly file (median entropy over all individual chromosome entropies)

  • n_gc: The total number of GCs (over all chromosomes or scaffolds or contigs) in the genome assembly file

  • rel_gc: The (relative frequency) of GCs (over all chromosomes or scaffolds or contigs) compared to the total number of nucleotides in the genome assembly file

Author(s)

Hajk-Georg Drost

See Also

summary_cds, getCollection, getGenome, read_genome

Examples

## Not run: 
# retrieve genome from NCBI RefSeq
Sc <- biomartr::getGenome(db = "refseq", organism = "Saccharomyces cerevisiae")
# compute genome assembly summary statistics
Sc_genome_summary <- summary_genome(file = Sc, organism = "Saccharomyces cerevisiae")
# look at results
Sc_genome_summary

## End(Not run)