| Title: | Genomic Data Retrieval | 
|---|---|
| Description: | Perform large scale genomic data retrieval and functional annotation retrieval. This package aims to provide users with a standardized way to automate genome, proteome, 'RNA', coding sequence ('CDS'), 'GFF', and metagenome retrieval from 'NCBI RefSeq', 'NCBI Genbank', 'ENSEMBL', and 'UniProt' databases. Furthermore, an interface to the 'BioMart' database (Smedley et al. (2009) <doi:10.1186/1471-2164-10-22>) allows users to retrieve functional annotation for genomic loci. In addition, users can download entire databases such as 'NCBI RefSeq' (Pruitt et al. (2007) <doi:10.1093/nar/gkl842>), 'NCBI nr', 'NCBI nt', 'NCBI Genbank' (Benson et al. (2013) <doi:10.1093/nar/gks1195>), etc. with only one command. | 
| Authors: | Hajk-Georg Drost [aut, cre] (ORCID: <https://orcid.org/0000-0002-1567-306X>), Haakon Tjeldnes [aut, ctb] | 
| Maintainer: | Hajk-Georg Drost <[email protected]> | 
| License: | GPL-2 | 
| Version: | 1.0.11 | 
| Built: | 2025-10-17 05:59:20 UTC | 
| Source: | https://github.com/ropensci/biomartr | 
This package interacts with a suite of web Application Programming Interfaces and FTP sites to perform automated genomic data retieval and annotation information retrieval.
To automate the retrieval process on a meta-genomic scale, this package
provides useful interface functions for genomic sequence retrieval and
functional annotation retrieval.
The major aim of biomartr is to facilitate computational
reproducibility and large-scale handling of genomic data for
(meta-)genomic analyses.
In detail, biomartr aims to provide users with an easy to use
framework to obtain genome, proteome, CDS, GFF (annotation), genome
assembly quality, and metagenome project data. Furthermore, an interface to
the Ensembl Biomart database allows users to retrieve functional annotation
for genomic loci.
Users can download entire databases
such as
NCBI RefSeq
NCBI nr
NCBI nt
NCBI Genbank
NCBI nt
Ensembl
Ensembl Genomes
UniProt
Hajk-Georg Drost [email protected]
Useful links:
Report bugs at https://github.com/ropensci/biomartr/issues
This function takes a set of gene ids and the biomart specifications and performs a biomart query for the given set of gene ids.
biomart(genes, mart, dataset, attributes, filters, mute_citation = FALSE, ...)biomart(genes, mart, dataset, attributes, filters, mute_citation = FALSE, ...)
| genes | a character vector storing the gene ids of a organisms of interest to be queried against BioMart. | 
| mart | a character string specifying the mart to be used. Users
can obtain available marts using  | 
| dataset | a character string specifying the dataset within the mart to
be used, e.g.  | 
| attributes | a character vector specifying the attributes that shall be
used, e.g.  | 
| filters | a character vector specifying the filter (query key) for the
BioMart query, e.g.  | 
| mute_citation | logical value indicating whether citation message should be muted. | 
| ... | additional parameters for the
 | 
This function is the main query function of the biomartr package.
It enables to fastly access annotations of a given gene set based on the biomaRt package implemented by Steffen Durinck et al.
A data.table storing the initial query gene vector in the first column, the output gene vector in the second column, and all attributes in the following columns.
Hajk-Georg Drost
Other biomaRt: 
getAttributes(),
getDatasets(),
getMarts(),
organismBM(),
organismFilters()
## Not run: # 1) select a mart getMarts() # we will select mart 'plants_mart' and search for available datasets getDatasets(mart = "plants_mart") # we choose dataset 'athaliana_eg_gene' and run biomart() # using mart: 'plants_mart', dataset: "athaliana_eg_gene" # attributes: c("start_position","end_position","description") # for an example gene set of Arabidopsis thaliana: # c("AT1G06090", "AT1G06100", "AT1G06110", "AT1G06120", # "AT1G06130", "AT1G06200") biomart(genes = c("AT1G06090", "AT1G06100", "AT1G06110", "AT1G06120", "AT1G06130", "AT1G06200"), mart = "plants_mart", dataset = "athaliana_eg_gene", attributes = c("start_position","end_position","description"), filters = "ensembl_gene_id") ## End(Not run)## Not run: # 1) select a mart getMarts() # we will select mart 'plants_mart' and search for available datasets getDatasets(mart = "plants_mart") # we choose dataset 'athaliana_eg_gene' and run biomart() # using mart: 'plants_mart', dataset: "athaliana_eg_gene" # attributes: c("start_position","end_position","description") # for an example gene set of Arabidopsis thaliana: # c("AT1G06090", "AT1G06100", "AT1G06110", "AT1G06120", # "AT1G06130", "AT1G06200") biomart(genes = c("AT1G06090", "AT1G06100", "AT1G06110", "AT1G06120", "AT1G06130", "AT1G06200"), mart = "plants_mart", dataset = "athaliana_eg_gene", attributes = c("start_position","end_position","description"), filters = "ensembl_gene_id") ## End(Not run)
Get directory to store back end files like kingdom summaries etc
cachedir(non_temp_cache = "~/.biomartr_cache_dir.rds")cachedir(non_temp_cache = "~/.biomartr_cache_dir.rds")
| non_temp_cache | "~/.biomartr_cache_dir.rds", | 
reads the rds file, and returns the path for local cache, if not existing, use tempdir().
Other cachedir: 
cachedir_set()
cachedir()cachedir()
Set directory to store back end files like kingdom summaries etc
cachedir_set(path)cachedir_set(path)
| path | the path to cache dir, example "~/Bio_data/biomartr_cache/" | 
invisible(NULL), only save the file to path location
Other cachedir: 
cachedir()
# By default it is tempdir() cachedir() # cachedir_set("~/Bio_data/biomartr_cache/") cachedir()# By default it is tempdir() cachedir() # cachedir_set("~/Bio_data/biomartr_cache/") cachedir()
Some annotation files include lines with character lengths greater than 65000. 
This causes problems when trying to import such annotation files into R using import.
To overcome this issue, this function screens for such lines
in a given annotation file and removes these lines so that
import can handle the file.
check_annotation_biomartr(annotation_file, remove_annotation_outliers = FALSE)check_annotation_biomartr(annotation_file, remove_annotation_outliers = FALSE)
| annotation_file | a file path to the annotation file. | 
| remove_annotation_outliers | shall outlier lines be removed from the input  | 
Hajk-Georg Drost
## Not run: # download an example annotation file from NCBI RefSeq Ath_path <- biomartr::getGFF(organism = "Arabidopsis thaliana") # run annotation file check on the downloaded file biomartr::check_annotation_biomartr(Ath_path) # several outlier lines were detected, thus we re-run the # function using 'remove_annotation_outliers = TRUE' # to remove the outliers and overwrite the file biomartr::check_annotation_biomartr(Ath_path, remove_annotation_outliers = TRUE) ## End(Not run)## Not run: # download an example annotation file from NCBI RefSeq Ath_path <- biomartr::getGFF(organism = "Arabidopsis thaliana") # run annotation file check on the downloaded file biomartr::check_annotation_biomartr(Ath_path) # several outlier lines were detected, thus we re-run the # function using 'remove_annotation_outliers = TRUE' # to remove the outliers and overwrite the file biomartr::check_annotation_biomartr(Ath_path, remove_annotation_outliers = TRUE) ## End(Not run)
This function allows users to download a database selected by
listDatabases to their local hard drive.
download.database(db, path = "database")download.database(db, path = "database")
| db | a character string specifying the database that shall be downloaded
(selected from  | 
| path | a character string specifying the location (a folder) in
which the corresponding database shall be stored.
Default is  | 
This function downloads large databases to your hard drive.
For this purpose a folder
named database (default) is created and the correspondning
database then stored in this folder.
File path to the downloaded database file.
Hajk-Georg Drost
download.database.all, listDatabases
## Not run: # search for available NCBI nr databases listNCBIDatabases(db = "nr") # select NCBI nr version 27 = "nr.27.tar.gz" # and download it to your hard drive # -> please note that large databases take some time for download! download.database(db = "nr.27.tar.gz") ## End(Not run)## Not run: # search for available NCBI nr databases listNCBIDatabases(db = "nr") # select NCBI nr version 27 = "nr.27.tar.gz" # and download it to your hard drive # -> please note that large databases take some time for download! download.database(db = "nr.27.tar.gz") ## End(Not run)
The download.database functions allows users to
retrieve individual packages of a NCBI database. This function is designed to
retrieve the entire database selected by the users (hence all packages 
corresponding to this database).
download.database.all(db, path = NULL)download.database.all(db, path = NULL)
| db | a character string specifying the database that shall be downloaded
(selected from  | 
| path | a character string specifying the location (a folder) in which the corresponding database shall be stored. In case this folder does not exist yet, it will be created. | 
A character vector storing the file paths of the downloaded databases.
Hajk-Georg Drost
download.database, listNCBIDatabases
## Not run: # search for available NCBI databases listNCBIDatabases(db = "all") # choose database NCBI nr and download compelete database download.database.all(db = "nr", path = "nr") ## End(Not run)## Not run: # search for available NCBI databases listNCBIDatabases(db = "all") # choose database NCBI nr and download compelete database download.database.all(db = "nr", path = "nr") ## End(Not run)
Retrieve a list of available databases on ENSEMBL for which get.ensembl.info can be retrieved.
ensembl_divisions()ensembl_divisions()
Hajk-Georg Drost
ensembl_divisions()ensembl_divisions()
This function interfaces with the ENSEMBL API (https://rest.ensembl.org/info/species?content-type=application/json) and internally stores the output to use this information for subsequent retrieval function calls.
get.ensembl.info(update = FALSE, division)get.ensembl.info(update = FALSE, division)
| update | logical, default FALSE. If TRUE, force re-download of info. | 
| division | the ENSEMBL database (division) for which information shall
be retrieved (available options can be obtained with  | 
Hajk-Georg Drost
ensembl_divisions, getKingdomAssemblySummary, getENSEMBLInfo
## Not run: # Look at available ENSEMBL division options ensembl_divisions() # Retrieve available information for EnsemblVertebrates example <- get.ensembl.info(division = "EnsemblVertebrates") example # Update information file stored in the tempdir() folder. example_update <- get.ensembl.info(division = "EnsemblVertebrates", update = TRUE) example_update ## End(Not run)## Not run: # Look at available ENSEMBL division options ensembl_divisions() # Retrieve available information for EnsemblVertebrates example <- get.ensembl.info(division = "EnsemblVertebrates") example # Update information file stored in the tempdir() folder. example_update <- get.ensembl.info(division = "EnsemblVertebrates", update = TRUE) example_update ## End(Not run)
Main genome assembly stats retrieval function for an organism of interest. By specifying the scientific name of an organism of interest the corresponding genome assembly stats file storing the assembly statistics of the organism of interest can be downloaded and stored locally. Genome assembly stats files can be retrieved from several databases.
getAssemblyStats( db = "refseq", organism, reference = FALSE, skip_bacteria = TRUE, release = NULL, type = "download", path = file.path("_ncbi_downloads", "genomeassembly_stats"), mute_citation = FALSE )getAssemblyStats( db = "refseq", organism, reference = FALSE, skip_bacteria = TRUE, release = NULL, type = "download", path = file.path("_ncbi_downloads", "genomeassembly_stats"), mute_citation = FALSE )
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| organism | a character string specifying the scientific name of the
organism of interest, e.g.  | 
| reference | a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome. | 
| skip_bacteria | Due to its enormous dataset size (> 700MB as of July 2023),
the bacterial summary file will not be loaded by default anymore. If users
wish to gain insights for the bacterial kingdom they needs to actively specify  | 
| release | most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that. | 
| type | shall only the file be retrieved (default)
 | 
| path | a character string specifying the location (a folder) in
which the corresponding file shall be stored. Default is
 | 
| mute_citation | logical value indicating whether citation message should be muted. | 
Internally this function loads the the overview.txt file from NCBI:
refseq: refseq_genbank_ftp_server_url_genome_specific("refseq")
genbank: refseq_genbank_ftp_server_url_genome_specific("genbank")
to retrieve available scientific names of organisms and creates a directory '_ncbi_downloads/genomeassembly_stats' to store the Genome Assembly Stats of interest as text file for future processing. In case the corresponding fasta file already exists within the '_ncbi_downloads/genomeassembly_stats' folder and is accessible within the workspace, no download process will be performed.
An example genome assembly stats file can be found here: ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/ GCF_000001405.36_GRCh38.p10/GCF_000001405.36_GRCh38.p10_assembly_stats.txt.
File path to downloaded genome assembly stats file.
Hajk-Georg Drost
getGenome, getProteome, getCDS,
getGFF, getRNA, getCollection, meta.retrieval,
read_assemblystats
## Not run: # download the genome assembly stats file of Saccharomyces cerevisiae # from NCBI RefSeq # and store the corresponding genome file in # '_ncbi_downloads/genomeassembly_stats' file_path <- getAssemblyStats( db = "refseq", organism = "Saccharomyces cerevisiae", path = file.path("_ncbi_downloads","genomeassembly_stats")) # import the raw file as it is downloaded Scerevisiae.stats <- read_assemblystats(file_path, type = "raw") # download the genome assembly stats file of Saccharomyces cerevisiae # from NCBI RefSeq # and import overall statistics of the genome assembly Scerevisiae.stats.import <- getAssemblyStats( db = "refseq", organism = "Saccharomyces cerevisiae", type = "import", path = file.path("_ncbi_downloads","genomeassembly_stats")) ## End(Not run)## Not run: # download the genome assembly stats file of Saccharomyces cerevisiae # from NCBI RefSeq # and store the corresponding genome file in # '_ncbi_downloads/genomeassembly_stats' file_path <- getAssemblyStats( db = "refseq", organism = "Saccharomyces cerevisiae", path = file.path("_ncbi_downloads","genomeassembly_stats")) # import the raw file as it is downloaded Scerevisiae.stats <- read_assemblystats(file_path, type = "raw") # download the genome assembly stats file of Saccharomyces cerevisiae # from NCBI RefSeq # and import overall statistics of the genome assembly Scerevisiae.stats.import <- getAssemblyStats( db = "refseq", organism = "Saccharomyces cerevisiae", type = "import", path = file.path("_ncbi_downloads","genomeassembly_stats")) ## End(Not run)
This function queries the BioMart Interface and returns a table storing all available attributes for a specific dataset.
getAttributes(mart, dataset, mute_citation = FALSE)getAttributes(mart, dataset, mute_citation = FALSE)
| mart | a character string specifying the database (mart) for which datasets shall be listed. | 
| dataset | a character string specifying the dataset for which attributes shall be listed. | 
| mute_citation | logical value indicating whether citation message should be muted. | 
Hajk-Georg Drost
Other biomaRt: 
biomart(),
getDatasets(),
getMarts(),
organismBM(),
organismFilters()
## Not run: # search for available datasets getMarts() # choose database (mart): ENSEMBL_MART_ENSEMBL # and get a table of all available datasets from this BioMart database head(getDatasets(mart = "ENSEMBL_MART_ENSEMBL"), 10) # choose dataset: "hsapiens_gene_ensembl" head(getAttributes(mart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl") , 5) ## End(Not run)## Not run: # search for available datasets getMarts() # choose database (mart): ENSEMBL_MART_ENSEMBL # and get a table of all available datasets from this BioMart database head(getDatasets(mart = "ENSEMBL_MART_ENSEMBL"), 10) # choose dataset: "hsapiens_gene_ensembl" head(getAttributes(mart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl") , 5) ## End(Not run)
A wrapper to all bio getters, selected with 'type' argument
getBio( db = "refseq", organism, type, reference = FALSE, release = NULL, gunzip = FALSE, update = FALSE, skip_bacteria = TRUE, path = paste0("set_", toupper(type)), remove_annotation_outliers = FALSE, analyse_genome = FALSE, assembly_type = "toplevel", format = "gff3", mute_citation = FALSE )getBio( db = "refseq", organism, type, reference = FALSE, release = NULL, gunzip = FALSE, update = FALSE, skip_bacteria = TRUE, path = paste0("set_", toupper(type)), remove_annotation_outliers = FALSE, analyse_genome = FALSE, assembly_type = "toplevel", format = "gff3", mute_citation = FALSE )
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| organism | Organism selector id, there are three options to characterize an organism: 
 | 
| type | biological sequence type. (alternatives are: genome, gff, cds, rna, proteome, assembly_stats, repeat_masker, collection (all the others)) | 
| reference | a logical value indicating whether or not a genome shall be a candidate for downloaded if it isn't marked in the database as either a reference genome or a representative genome. This is helpful if you don't want to allow "partial genomes" etc. | 
| release | a numeric, the database release version of ENSEMBL ( | 
| gunzip | a logical, indicating whether or not files should be unzipped. | 
| update | logical, default FALSE. Updated backend cached files needed. Usually keep this false, to make ut run much faster. Only set to TRUE, if you believe you cache is outdated (Species only exist in newest release etc) | 
| skip_bacteria | Due to its enormous dataset size (> 700MB as of July 2023),
the bacterial summary file will not be loaded by default anymore. If users
wish to gain insights for the bacterial kingdom they needs to actively specify  | 
| path | character, default location is paste0("set_", toupper(type)) | 
| remove_annotation_outliers | shall outlier lines be removed from the input  | 
| analyse_genome | logical, default FALSE. If TRUE, get general genome statistics like gc content etc. For more details, see ?summary_genome | 
| assembly_type | character, default c("primary_assembly", "toplevel"). Used for ensembl only, specifies the genome assembly type. Searches for both primary and toplevel, and if both are found, uses the first by order (so primary is prioritized by default). The Primary assembly should usually be used if it exists. The "primary assembly" contains all the top-level sequence regions, excluding alternative haplotypes and patches. If the primary assembly file is not present for a species (only defined for standard model organisms), that indicates that there were no haplotype/patch regions, and in such cases, the 'toplevel file is used. For more details see: ensembl tutorial | 
| format | "gff3", alternative "gtf" for ensembl. | 
| mute_citation | logical, default FALSE, indicating whether citation message should be muted. | 
Fetching of assembly / sequence data is done by fetching an overview file
from metadata of given database:
For NCBI (refseq/genbank): 
Internally this function loads the the overview.txt file from NCBI:
refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/
genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/
It will then create a directory relative to file type wanted,
if you get fasta genomes it will be
_ncbi_downloads/genomes' etc.
In case the corresponding fasta file already exists within the
'_ncbi_downloads/genomes' folder and is accessible within the workspace,
no download process will be performed.
For other file types the same rule applies.
For ensembl it fetches overview per type from the rest API:
ensembl: https://rest.ensembl.org
File path to downloaded genome.
Hajk-Georg Drost
Other getBio: 
getCDS(),
getCollection(),
getGFF(),
getGenome(),
getProteome(),
getRNA()
Usually you want to use one of the specific set extractors
getBioSet( db = "refseq", organisms, set_type, reference = FALSE, release = NULL, gunzip = TRUE, update = FALSE, skip_bacteria = TRUE, path = paste0("set_", toupper(set_type)), remove_annotation_outliers = FALSE, assembly_type = "toplevel", format = "gff3", mute_citation = FALSE )getBioSet( db = "refseq", organisms, set_type, reference = FALSE, release = NULL, gunzip = TRUE, update = FALSE, skip_bacteria = TRUE, path = paste0("set_", toupper(set_type)), remove_annotation_outliers = FALSE, assembly_type = "toplevel", format = "gff3", mute_citation = FALSE )
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| organisms | a character vector storing the names of the organisms than shall be retrieved. There are three available options to characterize an organism: | 
| set_type | the biological sequence type that shall be retrieved. Available options are 
 | 
| reference | a logical value indicating whether or not a genome shall be a candidate for downloaded if it isn't marked in the database as either a reference genome or a representative genome. This is helpful if you don't want to allow "partial genomes" etc. | 
| release | a numeric, the database release version of ENSEMBL ( | 
| gunzip | a logical, indicating whether or not files should be unzipped. | 
| update | logical, default FALSE. Updated backend cached files needed. Usually keep this false, to make ut run much faster. Only set to TRUE, if you believe you cache is outdated (Species only exist in newest release etc) | 
| skip_bacteria | Due to its enormous dataset size (> 700MB as of July 2023),
the bacterial summary file will not be loaded by default anymore. If users
wish to gain insights for the bacterial kingdom they needs to actively specify  | 
| path | character, default location is paste0("set_", toupper(set_type)) | 
| remove_annotation_outliers | shall outlier lines be removed from the input  | 
| assembly_type | character, default c("primary_assembly", "toplevel"). Used for ensembl only, specifies the genome assembly type. Searches for both primary and toplevel, and if both are found, uses the first by order (so primary is prioritized by default). The Primary assembly should usually be used if it exists. The "primary assembly" contains all the top-level sequence regions, excluding alternative haplotypes and patches. If the primary assembly file is not present for a species (only defined for standard model organisms), that indicates that there were no haplotype/patch regions, and in such cases, the 'toplevel file is used. For more details see: ensembl tutorial | 
| format | "gff3", alternative "gtf" for ensembl. | 
| mute_citation | logical, default FALSE, indicating whether citation message should be muted. | 
Fetching of assembly / sequence data is done by fetching an overview file
from metadata of given database:
For NCBI (refseq/genbank): 
Internally this function loads the the overview.txt file from NCBI:
refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/
genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/
It will then create a directory relative to file type wanted,
if you get fasta genomes it will be
_ncbi_downloads/genomes' etc.
In case the corresponding fasta file already exists within the
'_ncbi_downloads/genomes' folder and is accessible within the workspace,
no download process will be performed.
For other file types the same rule applies.
For ensembl it fetches overview per type from the rest API:
ensembl: https://rest.ensembl.org
character vector, the file path to the downloaded genomes,
The returned character vector has names as either:
- 'new' (file was downloaded now)
- 'old' files did already exist)
Hajk-Georg Drost
Other getBioSet: 
getCDSSet(),
getCollectionSet(),
getGFFSet(),
getGenomeSet(),
getProteomeSet(),
getRNASet()
## Not run: getBioSet("refseq", organisms = c("Arabidopsis thaliana", "Arabidopsis lyrata", "Capsella rubella"), set_type = "cds") ## End(Not run)## Not run: getBioSet("refseq", organisms = c("Arabidopsis thaliana", "Arabidopsis lyrata", "Capsella rubella"), set_type = "cds") ## End(Not run)
Main retrieval function for coding sequences (CDS) of an organism of interest. By specifying the scientific name of an organism of interest the corresponding fasta-file storing the CDS information for the organism of interest can be downloaded and stored locally. CDS files can be retrieved from several databases.
getCDS( db = "refseq", organism, reference = FALSE, skip_bacteria = TRUE, release = NULL, gunzip = FALSE, path = file.path("_ncbi_downloads", "CDS"), mute_citation = FALSE )getCDS( db = "refseq", organism, reference = FALSE, skip_bacteria = TRUE, release = NULL, gunzip = FALSE, path = file.path("_ncbi_downloads", "CDS"), mute_citation = FALSE )
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| organism | Organism selector id, there are three options to characterize an organism: 
 | 
| reference | a logical value indicating whether or not a genome shall be a candidate for downloaded if it isn't marked in the database as either a reference genome or a representative genome. This is helpful if you don't want to allow "partial genomes" etc. | 
| skip_bacteria | Due to its enormous dataset size (> 700MB as of July 2023),
the bacterial summary file will not be loaded by default anymore. If users
wish to gain insights for the bacterial kingdom they needs to actively specify  | 
| release | a numeric, the database release version of ENSEMBL ( | 
| gunzip | a logical, indicating whether or not files should be unzipped. | 
| path | a character string specifying the location (a folder)
in which the corresponding CDS file shall be stored.
Default is  | 
| mute_citation | logical, default FALSE, indicating whether citation message should be muted. | 
Fetching of assembly / sequence data is done by fetching an overview file
from metadata of given database:
For NCBI (refseq/genbank): 
Internally this function loads the the overview.txt file from NCBI:
refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/
genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/
It will then create a directory relative to file type wanted,
if you get fasta genomes it will be
_ncbi_downloads/genomes' etc.
In case the corresponding fasta file already exists within the
'_ncbi_downloads/genomes' folder and is accessible within the workspace,
no download process will be performed.
For other file types the same rule applies.
For ensembl it fetches overview per type from the rest API:
ensembl: https://rest.ensembl.org
File path to downloaded genome.
Hajk-Georg Drost
Other getBio: 
getBio(),
getCollection(),
getGFF(),
getGenome(),
getProteome(),
getRNA()
Other cds: 
getCDSSet(),
read_cds()
## Not run: # download the genome of Arabidopsis thaliana from refseq # and store the corresponding genome CDS file in '_ncbi_downloads/CDS' file_path <- getCDS( db = "refseq", organism = "Arabidopsis thaliana", path = file.path("_ncbi_downloads","CDS")) Ath_CDS <- read_cds(file_path, format = "fasta") ## End(Not run)## Not run: # download the genome of Arabidopsis thaliana from refseq # and store the corresponding genome CDS file in '_ncbi_downloads/CDS' file_path <- getCDS( db = "refseq", organism = "Arabidopsis thaliana", path = file.path("_ncbi_downloads","CDS")) Ath_CDS <- read_cds(file_path, format = "fasta") ## End(Not run)
Main CDS retrieval function for a set of organism of interest. By specifying the scientific names of the organisms of interest the corresponding fasta-files storing the CDS of the organisms of interest will be downloaded and stored locally. CDS files can be retrieved from several databases.
getCDSSet( db = "refseq", organisms, reference = FALSE, release = NULL, gunzip = TRUE, update = FALSE, path = "set_CDS" )getCDSSet( db = "refseq", organisms, reference = FALSE, release = NULL, gunzip = TRUE, update = FALSE, path = "set_CDS" )
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| organisms | a character vector storing the names of the organisms than shall be retrieved. There are three available options to characterize an organism: | 
| reference | a logical value indicating whether or not a genome shall be a candidate for downloaded if it isn't marked in the database as either a reference genome or a representative genome. This is helpful if you don't want to allow "partial genomes" etc. | 
| release | a numeric, the database release version of ENSEMBL ( | 
| gunzip | a logical, indicating whether or not files should be unzipped. | 
| update | logical, default FALSE. Updated backend cached files needed. Usually keep this false, to make ut run much faster. Only set to TRUE, if you believe you cache is outdated (Species only exist in newest release etc) | 
| path | character, default location is paste0("set_", toupper(set_type)) | 
Fetching of assembly / sequence data is done by fetching an overview file
from metadata of given database:
For NCBI (refseq/genbank): 
Internally this function loads the the overview.txt file from NCBI:
refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/
genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/
It will then create a directory relative to file type wanted,
if you get fasta genomes it will be
_ncbi_downloads/genomes' etc.
In case the corresponding fasta file already exists within the
'_ncbi_downloads/genomes' folder and is accessible within the workspace,
no download process will be performed.
For other file types the same rule applies.
For ensembl it fetches overview per type from the rest API:
ensembl: https://rest.ensembl.org
character vector, the file path to the downloaded genomes,
The returned character vector has names as either:
- 'new' (file was downloaded now)
- 'old' files did already exist)
Hajk-Georg Drost
Other getBioSet: 
getBioSet(),
getCollectionSet(),
getGFFSet(),
getGenomeSet(),
getProteomeSet(),
getRNASet()
Other cds: 
getCDS(),
read_cds()
## Not run: getBioSet("refseq", organisms = c("Arabidopsis thaliana", "Arabidopsis lyrata", "Capsella rubella"), set_type = "cds") ## End(Not run)## Not run: getBioSet("refseq", organisms = c("Arabidopsis thaliana", "Arabidopsis lyrata", "Capsella rubella"), set_type = "cds") ## End(Not run)
Main collection retrieval function for an organism of interest. By specifying the scientific name of an organism of interest a collection consisting of the genome file, proteome file, CDS file, RNA file, GFF file, Repeat Masker file, AssemblyStats file of the organism of interest can be downloaded and stored locally. Collections can be retrieved from several databases. For full set of collection elements, see: biomartr:::supported_biotypes(db)
getCollection( db = "refseq", organism, reference = TRUE, skip_bacteria = TRUE, release = NULL, assembly_type = "toplevel", analyse_genome = FALSE, remove_annotation_outliers = FALSE, gunzip = FALSE, path = file.path("_db_downloads", "collections"), mute_citation = FALSE )getCollection( db = "refseq", organism, reference = TRUE, skip_bacteria = TRUE, release = NULL, assembly_type = "toplevel", analyse_genome = FALSE, remove_annotation_outliers = FALSE, gunzip = FALSE, path = file.path("_db_downloads", "collections"), mute_citation = FALSE )
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| organism | Organism selector id, there are three options to characterize an organism: 
 | 
| reference | a logical value indicating whether or not a genome shall be a candidate for downloaded if it isn't marked in the database as either a reference genome or a representative genome. This is helpful if you don't want to allow "partial genomes" etc. | 
| skip_bacteria | Due to its enormous dataset size (> 700MB as of July 2023),
the bacterial summary file will not be loaded by default anymore. If users
wish to gain insights for the bacterial kingdom they needs to actively specify  | 
| release | a numeric, the database release version of ENSEMBL ( | 
| assembly_type | character, default c("primary_assembly", "toplevel"). Used for ensembl only, specifies the genome assembly type. Searches for both primary and toplevel, and if both are found, uses the first by order (so primary is prioritized by default). The Primary assembly should usually be used if it exists. The "primary assembly" contains all the top-level sequence regions, excluding alternative haplotypes and patches. If the primary assembly file is not present for a species (only defined for standard model organisms), that indicates that there were no haplotype/patch regions, and in such cases, the 'toplevel file is used. For more details see: ensembl tutorial | 
| analyse_genome | logical, default FALSE. If TRUE, get general genome statistics like gc content etc. For more details, see ?summary_genome | 
| remove_annotation_outliers | shall outlier lines be removed from the input  | 
| gunzip | a logical, indicating whether or not files should be unzipped. | 
| path | a character string specifying the location (a folder) in which
the corresponding collection shall be stored. Default is
 | 
| mute_citation | logical, default FALSE, indicating whether citation message should be muted. | 
Fetching of assembly / sequence data is done by fetching an overview file
from metadata of given database:
For NCBI (refseq/genbank): 
Internally this function loads the the overview.txt file from NCBI:
refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/
genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/
It will then create a directory relative to file type wanted,
if you get fasta genomes it will be
_ncbi_downloads/genomes' etc.
In case the corresponding fasta file already exists within the
'_ncbi_downloads/genomes' folder and is accessible within the workspace,
no download process will be performed.
For other file types the same rule applies.
For ensembl it fetches overview per type from the rest API:
ensembl: https://rest.ensembl.org
File path to downloaded genome.
Hajk-Georg Drost
Other getBio: 
getBio(),
getCDS(),
getGFF(),
getGenome(),
getProteome(),
getRNA()
Other collection: 
getCollectionSet()
## Not run: # download the collection of Homo sapiens from refseq # and store the corresponding genome file in '_ncbi_downloads/collection' Hsap_collection <- getCollection( db = "refseq", organism = "Homo sapiens", path = file.path("_db_downloads","collections")) # download the collection of Homo sapiens from genbank # and store the corresponding genome file in '_ncbi_downloads/collection' Hsap_collection <- getCollection( db = "genbank", organism = "Homo sapiens", path = file.path("_db_downloads","collections")) # download the collection of Homo sapiens from ensembl # and store the corresponding genome file in '_ncbi_downloads/collection' Hsap_collection <- getCollection( db = "ensembl", organism = "Homo sapiens", path = file.path("_db_downloads","collections")) ## End(Not run)## Not run: # download the collection of Homo sapiens from refseq # and store the corresponding genome file in '_ncbi_downloads/collection' Hsap_collection <- getCollection( db = "refseq", organism = "Homo sapiens", path = file.path("_db_downloads","collections")) # download the collection of Homo sapiens from genbank # and store the corresponding genome file in '_ncbi_downloads/collection' Hsap_collection <- getCollection( db = "genbank", organism = "Homo sapiens", path = file.path("_db_downloads","collections")) # download the collection of Homo sapiens from ensembl # and store the corresponding genome file in '_ncbi_downloads/collection' Hsap_collection <- getCollection( db = "ensembl", organism = "Homo sapiens", path = file.path("_db_downloads","collections")) ## End(Not run)
Main collection retrieval function for an organism of interest. By specifying the scientific name of an organism of interest a collection consisting of the genome file, proteome file, CDS file, RNA file, GFF file, Repeat Masker file, AssemblyStats file of the organism of interest can be downloaded and stored locally. Collections can be retrieved from several databases.
getCollectionSet( db = "refseq", organisms, reference = FALSE, release = NULL, skip_bacteria = TRUE, gunzip = TRUE, update = FALSE, remove_annotation_outliers = TRUE, path = "set_collections", mute_citation = FALSE )getCollectionSet( db = "refseq", organisms, reference = FALSE, release = NULL, skip_bacteria = TRUE, gunzip = TRUE, update = FALSE, remove_annotation_outliers = TRUE, path = "set_collections", mute_citation = FALSE )
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| organisms | a character vector storing the names of the organisms than shall be retrieved. There are three available options to characterize an organism: | 
| reference | a logical value indicating whether or not a genome shall be a candidate for downloaded if it isn't marked in the database as either a reference genome or a representative genome. This is helpful if you don't want to allow "partial genomes" etc. | 
| release | a numeric, the database release version of ENSEMBL ( | 
| skip_bacteria | Due to its enormous dataset size (> 700MB as of July 2023),
the bacterial summary file will not be loaded by default anymore. If users
wish to gain insights for the bacterial kingdom they needs to actively specify  | 
| gunzip | a logical, indicating whether or not files should be unzipped. | 
| update | logical, default FALSE. Updated backend cached files needed. Usually keep this false, to make ut run much faster. Only set to TRUE, if you believe you cache is outdated (Species only exist in newest release etc) | 
| remove_annotation_outliers | shall outlier lines be removed from the input  | 
| path | a character string specifying the location (a folder) in which
the corresponding collection shall be stored. Default is
 | 
| mute_citation | logical, default FALSE, indicating whether citation message should be muted. | 
Fetching of assembly / sequence data is done by fetching an overview file
from metadata of given database:
For NCBI (refseq/genbank): 
Internally this function loads the the overview.txt file from NCBI:
refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/
genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/
It will then create a directory relative to file type wanted,
if you get fasta genomes it will be
_ncbi_downloads/genomes' etc.
In case the corresponding fasta file already exists within the
'_ncbi_downloads/genomes' folder and is accessible within the workspace,
no download process will be performed.
For other file types the same rule applies.
For ensembl it fetches overview per type from the rest API:
ensembl: https://rest.ensembl.org
character vector, the file path to the downloaded genomes,
The returned character vector has names as either:
- 'new' (file was downloaded now)
- 'old' files did already exist)
Hajk-Georg Drost
Other getBioSet: 
getBioSet(),
getCDSSet(),
getGFFSet(),
getGenomeSet(),
getProteomeSet(),
getRNASet()
Other collection: 
getCollection()
## Not run: getBioSet("refseq", organisms = c("Arabidopsis thaliana", "Arabidopsis lyrata", "Capsella rubella"), set_type = "cds") ## End(Not run)## Not run: getBioSet("refseq", organisms = c("Arabidopsis thaliana", "Arabidopsis lyrata", "Capsella rubella"), set_type = "cds") ## End(Not run)
This funcion queries the BioMart API and returns a table storing all available datasets for a selected BioMart databases.
getDatasets(mart, mute_citation = FALSE)getDatasets(mart, mute_citation = FALSE)
| mart | a character string specifying the database (mart) for which datasets shall be listed. | 
| mute_citation | logical value indicating whether citation message should be muted. | 
Hajk-Georg Drost
Other biomaRt: 
biomart(),
getAttributes(),
getMarts(),
organismBM(),
organismFilters()
## Not run: # search for available datasets # getMarts() # choose database: "ENSEMBL_MART_ENSEMBL" head(getDatasets("ENSEMBL_MART_ENSEMBL"), 10) ## End(Not run)## Not run: # search for available datasets # getMarts() # choose database: "ENSEMBL_MART_ENSEMBL" head(getDatasets("ENSEMBL_MART_ENSEMBL"), 10) ## End(Not run)
Backend function for retrieving files sequence and annotation files from the ENSEMBL ftp server
getENSEMBL( organism, type = "dna", id.type = "toplevel", release = NULL, path, format )getENSEMBL( organism, type = "dna", id.type = "toplevel", release = NULL, path, format )
| organism | Organism selector id, there are three options to characterize an organism: 
 | 
| type | character, biological sequence type (e.g. "dna", "cds") | 
| id.type | a character, default "toplevel". id type of assembly, either "toplevel" or "primary_assembly" for genomes. Can be other strings, for non genome objects. | 
| release | a numeric, the database release version of ENSEMBL ( | 
| path | location where file shall be stored. | 
| format | "gff3", alternative "gtf" for ensembl. | 
either a character path to downloaded file, or a logical FALSE, specifying failure.
Hajk-Georg Drost
This function downloads gff files of query organisms from ENSEMBL.
getENSEMBL.gtf(organism, type = "dna", path, release = NULL)getENSEMBL.gtf(organism, type = "dna", path, release = NULL)
| organism | Organism selector id, there are three options to characterize an organism: 
 | 
| type | character, biological sequence type (e.g. "dna", "cds") | 
| path | location where file shall be stored. | 
| release | a numeric, the database release version of ENSEMBL ( | 
character filepath to download file, returns FALSE if failed.
Hajk-Georg Drost
This function downloads gff files of query organisms from ENSEMBL.
getENSEMBL.Seq( organism, type = "dna", id.type = "toplevel", release = NULL, path )getENSEMBL.Seq( organism, type = "dna", id.type = "toplevel", release = NULL, path )
| organism | Organism selector id, there are three options to characterize an organism: 
 | 
| type | character, biological sequence type (e.g. "dna", "cds") | 
| id.type | a character, default "toplevel". id type of assembly, either "toplevel" or "primary_assembly" for genomes. Can be other strings, for non genome objects. | 
| release | a numeric, the database release version of ENSEMBL ( | 
| path | location where file shall be stored. | 
either a character path to downloaded file, or a logical FALSE, specifying failure.
Hajk-Georg Drost
Retrieve species and genome information from http://rest.ensemblgenomes.org/info/species?content-type=application/json/.
getENSEMBLGENOMESInfo()getENSEMBLGENOMESInfo()
Hajk-Georg Drost
## Not run: info.file <- getENSEMBLGENOMESInfo() info.file ## End(Not run)## Not run: info.file <- getENSEMBLGENOMESInfo() info.file ## End(Not run)
Retrieve species and genome information from http://rest.ensembl.org/info/species?content-type=application/json/.
getENSEMBLInfo(update = FALSE, divisions = ensembl_divisions())getENSEMBLInfo(update = FALSE, divisions = ensembl_divisions())
| update | logical, default FALSE. If TRUE, update cached list,
if FALSE use existing cache (if it exists). For cache location see
 | 
| divisions | character, name of divisions to check, default is all from
 | 
a tibble table storing info for all available ENSEMBL divisions.
Hajk-Georg Drost
ensembl_divisions, get.ensembl.info, getKingdomAssemblySummary
## Not run: # look at available divisions ensembl_divisions() # retrieve information for all ENSEMBL divisions at once test <- getENSEMBLInfo() test # retrieve information for a particular ENSEMBL division (e.g. EnsemblVertebrates) test_vertebrates <- get.ensembl.info(update = TRUE, division = "EnsemblVertebrates") test_vertebrates ## End(Not run)## Not run: # look at available divisions ensembl_divisions() # retrieve information for all ENSEMBL divisions at once test <- getENSEMBLInfo() test # retrieve information for a particular ENSEMBL division (e.g. EnsemblVertebrates) test_vertebrates <- get.ensembl.info(update = TRUE, division = "EnsemblVertebrates") test_vertebrates ## End(Not run)
This funcion queries the BioMart API and returns a table storing all available filters for a specific dataset.
getFilters(mart, dataset, mute_citation = FALSE)getFilters(mart, dataset, mute_citation = FALSE)
| mart | a character string specifying the database (mart) for which datasets shall be listed. | 
| dataset | a character string specifying the dataset for which filters shall be listed. | 
| mute_citation | logical value indicating whether citation message should be muted. | 
Hajk-Georg Drost
getMarts, getDatasets,
getAttributes, organismBM,
organismFilters, organismAttributes
## Not run: # search for available datasets # getMarts() # choose database (mart): "ENSEMBL_MART_ENSEMBL" # head(getDatasets(mart = "ENSEMBL_MART_ENSEMBL"), 10) # choose dataset: "hsapiens_gene_ensembl" head(getFilters(mart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl") , 5) ## End(Not run)## Not run: # search for available datasets # getMarts() # choose database (mart): "ENSEMBL_MART_ENSEMBL" # head(getDatasets(mart = "ENSEMBL_MART_ENSEMBL"), 10) # choose dataset: "hsapiens_gene_ensembl" head(getFilters(mart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl") , 5) ## End(Not run)
Main genome retrieval function for an organism of interest.
By specifying the scientific name of an organism of interest the
corresponding fasta-file storing the genome of the organism of interest
can be downloaded and stored locally. Genome files can be retrieved from
several databases. In addition, the genome summary statistics for the
retrieved species is stored locally to provide users with
insights regarding the genome assembly quality (see summary_genome for details).
This is useful when comparing genomes with large difference in genome assembly qualities.
getGenome( db = "refseq", organism, reference = FALSE, skip_bacteria = TRUE, release = NULL, gunzip = FALSE, path = file.path("_ncbi_downloads", "genomes"), assembly_type = "toplevel", mute_citation = FALSE, analyse_genome = FALSE )getGenome( db = "refseq", organism, reference = FALSE, skip_bacteria = TRUE, release = NULL, gunzip = FALSE, path = file.path("_ncbi_downloads", "genomes"), assembly_type = "toplevel", mute_citation = FALSE, analyse_genome = FALSE )
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| organism | Organism selector id, there are three options to characterize an organism: 
 | 
| reference | a logical value indicating whether or not a genome shall be a candidate for downloaded if it isn't marked in the database as either a reference genome or a representative genome. This is helpful if you don't want to allow "partial genomes" etc. | 
| skip_bacteria | Due to its enormous dataset size (> 700MB as of July 2023),
the bacterial summary file will not be loaded by default anymore. If users
wish to gain insights for the bacterial kingdom they needs to actively specify  | 
| release | a numeric, the database release version of ENSEMBL ( | 
| gunzip | a logical, indicating whether or not files should be unzipped. | 
| path | character, default location is paste0("set_", toupper(type)) | 
| assembly_type | character, default c("primary_assembly", "toplevel"). Used for ensembl only, specifies the genome assembly type. Searches for both primary and toplevel, and if both are found, uses the first by order (so primary is prioritized by default). The Primary assembly should usually be used if it exists. The "primary assembly" contains all the top-level sequence regions, excluding alternative haplotypes and patches. If the primary assembly file is not present for a species (only defined for standard model organisms), that indicates that there were no haplotype/patch regions, and in such cases, the 'toplevel file is used. For more details see: ensembl tutorial | 
| mute_citation | logical, default FALSE, indicating whether citation message should be muted. | 
| analyse_genome | logical, default FALSE. If TRUE, get general genome statistics like gc content etc. For more details, see ?summary_genome | 
Fetching of assembly / sequence data is done by fetching an overview file
from metadata of given database:
For NCBI (refseq/genbank): 
Internally this function loads the the overview.txt file from NCBI:
refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/
genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/
It will then create a directory relative to file type wanted,
if you get fasta genomes it will be
_ncbi_downloads/genomes' etc.
In case the corresponding fasta file already exists within the
'_ncbi_downloads/genomes' folder and is accessible within the workspace,
no download process will be performed.
For other file types the same rule applies.
For ensembl it fetches overview per type from the rest API:
ensembl: https://rest.ensembl.org
File path to downloaded genome.
Hajk-Georg Drost
Other getBio: 
getBio(),
getCDS(),
getCollection(),
getGFF(),
getProteome(),
getRNA()
Other genome: 
getGenomeSet(),
read_genome()
Retrieves NCBI GENOME_REPORTS file from ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/overview.txt.
getGENOMEREPORT( local_file = file.path(cachedir(), "_ncbi_downloads", "overview.txt") )getGENOMEREPORT( local_file = file.path(cachedir(), "_ncbi_downloads", "overview.txt") )
| local_file | character, file path, default: file.path(cachedir(), "_ncbi_downloads", "overview.txt") | 
a tibble object with the report
Hajk-Georg Drost
## Not run: report <- getGENOMEREPORT() report ## End(Not run)## Not run: report <- getGENOMEREPORT() report ## End(Not run)
Main genome retrieval function for a set of organism of interest. By specifying the scientific names of the organisms of interest the corresponding fasta-files storing the genome of the organisms of interest will be downloaded and stored locally. Genome files can be retrieved from several databases.
getGenomeSet( db = "refseq", organisms, reference = FALSE, release = NULL, skip_bacteria = TRUE, gunzip = TRUE, update = FALSE, path = "set_genomes", assembly_type = "toplevel", mute_citation = FALSE )getGenomeSet( db = "refseq", organisms, reference = FALSE, release = NULL, skip_bacteria = TRUE, gunzip = TRUE, update = FALSE, path = "set_genomes", assembly_type = "toplevel", mute_citation = FALSE )
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| organisms | a character vector storing the names of the organisms than shall be retrieved. There are three available options to characterize an organism: | 
| reference | a logical value indicating whether or not a genome shall be a candidate for downloaded if it isn't marked in the database as either a reference genome or a representative genome. This is helpful if you don't want to allow "partial genomes" etc. | 
| release | a numeric, the database release version of ENSEMBL ( | 
| skip_bacteria | Due to its enormous dataset size (> 700MB as of July 2023),
the bacterial summary file will not be loaded by default anymore. If users
wish to gain insights for the bacterial kingdom they needs to actively specify  | 
| gunzip | a logical, indicating whether or not files should be unzipped. | 
| update | logical, default FALSE. Updated backend cached files needed. Usually keep this false, to make ut run much faster. Only set to TRUE, if you believe you cache is outdated (Species only exist in newest release etc) | 
| path | a character string specifying the location (a folder) in which
the corresponding genomes shall be stored. Default is
 | 
| assembly_type | character, default c("primary_assembly", "toplevel"). Used for ensembl only, specifies the genome assembly type. Searches for both primary and toplevel, and if both are found, uses the first by order (so primary is prioritized by default). The Primary assembly should usually be used if it exists. The "primary assembly" contains all the top-level sequence regions, excluding alternative haplotypes and patches. If the primary assembly file is not present for a species (only defined for standard model organisms), that indicates that there were no haplotype/patch regions, and in such cases, the 'toplevel file is used. For more details see: ensembl tutorial | 
| mute_citation | logical, default FALSE, indicating whether citation message should be muted. | 
Fetching of assembly / sequence data is done by fetching an overview file
from metadata of given database:
For NCBI (refseq/genbank): 
Internally this function loads the the overview.txt file from NCBI:
refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/
genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/
It will then create a directory relative to file type wanted,
if you get fasta genomes it will be
_ncbi_downloads/genomes' etc.
In case the corresponding fasta file already exists within the
'_ncbi_downloads/genomes' folder and is accessible within the workspace,
no download process will be performed.
For other file types the same rule applies.
For ensembl it fetches overview per type from the rest API:
ensembl: https://rest.ensembl.org
character vector, the file path to the downloaded genomes,
The returned character vector has names as either:
- 'new' (file was downloaded now)
- 'old' files did already exist)
Hajk-Georg Drost
Other getBioSet: 
getBioSet(),
getCDSSet(),
getCollectionSet(),
getGFFSet(),
getProteomeSet(),
getRNASet()
Other genome: 
getGenome(),
read_genome()
## Not run: getBioSet("refseq", organisms = c("Arabidopsis thaliana", "Arabidopsis lyrata", "Capsella rubella"), set_type = "cds") ## End(Not run)## Not run: getBioSet("refseq", organisms = c("Arabidopsis thaliana", "Arabidopsis lyrata", "Capsella rubella"), set_type = "cds") ## End(Not run)
Main retrieval function for GFF files of an organism of interest. By specifying the scientific name of an organism of interest the corresponding gff file storing the annotation for the organism of interest can be downloaded and stored locally. GFF files can be retrieved from several databases.
getGFF( db = "refseq", organism, reference = FALSE, skip_bacteria = TRUE, release = NULL, gunzip = FALSE, remove_annotation_outliers = FALSE, path = file.path("_ncbi_downloads", "annotation"), mute_citation = FALSE, format = "gff3" )getGFF( db = "refseq", organism, reference = FALSE, skip_bacteria = TRUE, release = NULL, gunzip = FALSE, remove_annotation_outliers = FALSE, path = file.path("_ncbi_downloads", "annotation"), mute_citation = FALSE, format = "gff3" )
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| organism | Organism selector id, there are three options to characterize an organism: 
 | 
| reference | a logical value indicating whether or not a genome shall be a candidate for downloaded if it isn't marked in the database as either a reference genome or a representative genome. This is helpful if you don't want to allow "partial genomes" etc. | 
| skip_bacteria | Due to its enormous dataset size (> 700MB as of July 2023),
the bacterial summary file will not be loaded by default anymore. If users
wish to gain insights for the bacterial kingdom they needs to actively specify  | 
| release | a numeric, the database release version of ENSEMBL ( | 
| gunzip | a logical, indicating whether or not files should be unzipped. | 
| remove_annotation_outliers | shall outlier lines be removed from the input  | 
| path | a character string specifying the location (a folder) in which
the corresponding annotation file shall be stored. Default is
 | 
| mute_citation | logical, default FALSE, indicating whether citation message should be muted. | 
| format | "gff3", alternative "gtf" for ensembl. | 
Fetching of assembly / sequence data is done by fetching an overview file
from metadata of given database:
For NCBI (refseq/genbank): 
Internally this function loads the the overview.txt file from NCBI:
refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/
genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/
It will then create a directory relative to file type wanted,
if you get fasta genomes it will be
_ncbi_downloads/genomes' etc.
In case the corresponding fasta file already exists within the
'_ncbi_downloads/genomes' folder and is accessible within the workspace,
no download process will be performed.
For other file types the same rule applies.
For ensembl it fetches overview per type from the rest API:
ensembl: https://rest.ensembl.org
File path to downloaded genome.
Hajk-Georg Drost
Other getBio: 
getBio(),
getCDS(),
getCollection(),
getGenome(),
getProteome(),
getRNA()
Other gff: 
getGFFSet(),
read_gff()
## Not run: # Simple and small yeast download from refseq getGFF( db = "refseq", organism = "Saccharomyces cerevisiae", path = tempdir(), mute_citation = TRUE) # download the annotation of Arabidopsis thaliana from refseq # and store the corresponding genome file in '_ncbi_downloads/annotation' Athal_gff <- getGFF( db = "refseq", organism = "Arabidopsis thaliana", path = file.path("_ncbi_downloads","annotation"), remove_annotation_outliers = TRUE) Athal_gff_import <- read_gff(Athal_gff) # download the genome of Arabidopsis thaliana from genbank # and store the corresponding genome file in '_ncbi_downloads/annotation' Athal_gff <- getGFF( db = "genbank", organism = "Arabidopsis thaliana", path = file.path("_ncbi_downloads","annotation"), remove_annotation_outliers = TRUE) Athal_gff_import <- read_gff(Athal_gff) # download the genome of Homo sapiens from ensembl # and store the corresponding genome file in '_ncbi_downloads/annotation' Hsap_gff <- getGFF( db = "ensembl", organism = "Homo sapiens", path = file.path("_ncbi_downloads","annotation"), remove_annotation_outliers = TRUE) Hsap_gff_import <- read_gff(Hsap_gff) ## End(Not run)## Not run: # Simple and small yeast download from refseq getGFF( db = "refseq", organism = "Saccharomyces cerevisiae", path = tempdir(), mute_citation = TRUE) # download the annotation of Arabidopsis thaliana from refseq # and store the corresponding genome file in '_ncbi_downloads/annotation' Athal_gff <- getGFF( db = "refseq", organism = "Arabidopsis thaliana", path = file.path("_ncbi_downloads","annotation"), remove_annotation_outliers = TRUE) Athal_gff_import <- read_gff(Athal_gff) # download the genome of Arabidopsis thaliana from genbank # and store the corresponding genome file in '_ncbi_downloads/annotation' Athal_gff <- getGFF( db = "genbank", organism = "Arabidopsis thaliana", path = file.path("_ncbi_downloads","annotation"), remove_annotation_outliers = TRUE) Athal_gff_import <- read_gff(Athal_gff) # download the genome of Homo sapiens from ensembl # and store the corresponding genome file in '_ncbi_downloads/annotation' Hsap_gff <- getGFF( db = "ensembl", organism = "Homo sapiens", path = file.path("_ncbi_downloads","annotation"), remove_annotation_outliers = TRUE) Hsap_gff_import <- read_gff(Hsap_gff) ## End(Not run)
Main GFF retrieval function for a set of organism of interest. By specifying the scientific names of the organisms of interest the corresponding fasta-files storing the GFF of the organisms of interest will be downloaded and stored locally. GFF files can be retrieved from several databases.
getGFFSet( db = "refseq", organisms, reference = FALSE, release = NULL, skip_bacteria = TRUE, gunzip = TRUE, remove_annotation_outliers = FALSE, update = FALSE, format = "gff3", path = "set_GFF", mute_citation = FALSE )getGFFSet( db = "refseq", organisms, reference = FALSE, release = NULL, skip_bacteria = TRUE, gunzip = TRUE, remove_annotation_outliers = FALSE, update = FALSE, format = "gff3", path = "set_GFF", mute_citation = FALSE )
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| organisms | a character vector storing the names of the organisms than shall be retrieved. There are three available options to characterize an organism: | 
| reference | a logical value indicating whether or not a genome shall be a candidate for downloaded if it isn't marked in the database as either a reference genome or a representative genome. This is helpful if you don't want to allow "partial genomes" etc. | 
| release | a numeric, the database release version of ENSEMBL ( | 
| skip_bacteria | Due to its enormous dataset size (> 700MB as of July 2023),
the bacterial summary file will not be loaded by default anymore. If users
wish to gain insights for the bacterial kingdom they needs to actively specify  | 
| gunzip | a logical, indicating whether or not files should be unzipped. | 
| remove_annotation_outliers | shall outlier lines be removed from the input  | 
| update | logical, default FALSE. Updated backend cached files needed. Usually keep this false, to make ut run much faster. Only set to TRUE, if you believe you cache is outdated (Species only exist in newest release etc) | 
| format | "gff3", alternative "gtf" for ensembl. | 
| path | character, default location is paste0("set_", toupper(set_type)) | 
| mute_citation | logical, default FALSE, indicating whether citation message should be muted. | 
Fetching of assembly / sequence data is done by fetching an overview file
from metadata of given database:
For NCBI (refseq/genbank): 
Internally this function loads the the overview.txt file from NCBI:
refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/
genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/
It will then create a directory relative to file type wanted,
if you get fasta genomes it will be
_ncbi_downloads/genomes' etc.
In case the corresponding fasta file already exists within the
'_ncbi_downloads/genomes' folder and is accessible within the workspace,
no download process will be performed.
For other file types the same rule applies.
For ensembl it fetches overview per type from the rest API:
ensembl: https://rest.ensembl.org
character vector, the file path to the downloaded genomes,
The returned character vector has names as either:
- 'new' (file was downloaded now)
- 'old' files did already exist)
Hajk-Georg Drost
Other getBioSet: 
getBioSet(),
getCDSSet(),
getCollectionSet(),
getGenomeSet(),
getProteomeSet(),
getRNASet()
Other gff: 
getGFF(),
read_gff()
## Not run: getBioSet("refseq", organisms = c("Arabidopsis thaliana", "Arabidopsis lyrata", "Capsella rubella"), set_type = "cds") ## End(Not run)## Not run: getBioSet("refseq", organisms = c("Arabidopsis thaliana", "Arabidopsis lyrata", "Capsella rubella"), set_type = "cds") ## End(Not run)
This function takes a gene id as character vector from a given query organism and returns the corresponding GO terms and additional GO information.
getGO(organism, genes, filters, ...)getGO(organism, genes, filters, ...)
| organism | a character string specifying the scientific name of a query organism. | 
| genes | a character vector storing the gene ids of a organisms of interest to be queried against Ensembl Biomart. | 
| filters | a character vector specifying the filter (query key) for 
the Ensembl Biomart query, e.g.  | 
| ... | additional parameters that can be passed to the 
 | 
This function takes the scientific name of a query organism, a set of genes for which GO terms and additional information shall be retrieved, and a filter argument that specifies the attribute for the query genes.
Hajk-Georg Drost
biomart, organismFilters, 
organismBM, getBM, 
getMarts, getDatasets, 
getFilters
## Not run: GO_tbl <- getGO(organism = "Arabidopsis thaliana", genes = c("AT1G06090", "AT1G06100"), filters = "ensembl_gene_id") # look at the result head(GO_tbl) ## End(Not run)## Not run: GO_tbl <- getGO(organism = "Arabidopsis thaliana", genes = c("AT1G06090", "AT1G06100"), filters = "ensembl_gene_id") # look at the result head(GO_tbl) ## End(Not run)
A short list of available groups for a kingdom of life.
getGroups(db = "refseq", kingdom)getGroups(db = "refseq", kingdom)
| db | a character string specifying the database from which the genome shall be retrieved: 
 Default is  | 
| kingdom | a character string specifying for which kingdom of life 
groups shall be retrieved. See  | 
Hajk-Georg Drost
meta.retrieval, getGenome, 
getProteome, getCDS, getKingdoms
# get possible kigdom names getKingdoms(db = "refseq") ## Not run: # retrieve subgroups for vertebrate_mammalian available from refseq getGroups(db = "refseq", kingdom = "vertebrate_mammalian") # get possible kigdom names getKingdoms(db = "genbank") # retrieve subgroups for vertebrate_mammalian available from genbank getGroups(db = "genbank", kingdom = "vertebrate_mammalian") ## End(Not run)# get possible kigdom names getKingdoms(db = "refseq") ## Not run: # retrieve subgroups for vertebrate_mammalian available from refseq getGroups(db = "refseq", kingdom = "vertebrate_mammalian") # get possible kigdom names getKingdoms(db = "genbank") # retrieve subgroups for vertebrate_mammalian available from genbank getGroups(db = "genbank", kingdom = "vertebrate_mammalian") ## End(Not run)
Main retrieval function for GTF files of an organism of interest. By specifying the scientific name of an organism of interest the corresponding GTF file storing the annotation for the organism of interest can be downloaded and stored locally. GTF files can be retrieved from several databases.
getGTF( db = "ensembl", organism, reference = FALSE, remove_annotation_outliers = FALSE, path = file.path("ensembl", "annotation"), release = NULL, mute_citation = FALSE )getGTF( db = "ensembl", organism, reference = FALSE, remove_annotation_outliers = FALSE, path = file.path("ensembl", "annotation"), release = NULL, mute_citation = FALSE )
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| organism | Organism selector id, there are three options to characterize an organism: 
 | 
| reference | a logical value indicating whether or not a genome shall be a candidate for downloaded if it isn't marked in the database as either a reference genome or a representative genome. This is helpful if you don't want to allow "partial genomes" etc. | 
| remove_annotation_outliers | shall outlier lines be removed from the input  | 
| path | a character string specifying the location (a folder) in which
the corresponding annotation file shall be stored. Default is
 | 
| release | a numeric, the database release version of ENSEMBL ( | 
| mute_citation | logical, default FALSE, indicating whether citation message should be muted. | 
Fetching of assembly / sequence data is done by fetching an overview file
from metadata of given database:
For NCBI (refseq/genbank): 
Internally this function loads the the overview.txt file from NCBI:
refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/
genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/
It will then create a directory relative to file type wanted,
if you get fasta genomes it will be
_ncbi_downloads/genomes' etc.
In case the corresponding fasta file already exists within the
'_ncbi_downloads/genomes' folder and is accessible within the workspace,
no download process will be performed.
For other file types the same rule applies.
For ensembl it fetches overview per type from the rest API:
ensembl: https://rest.ensembl.org
File path to downloaded genome.
Hajk-Georg Drost
Other getBio: 
getBio(),
getCDS(),
getCollection(),
getGenome(),
getProteome(),
getRNA()
Other gff: 
getGFFSet(),
read_gff()
## Not run: # download the annotation of Homo sapiens from ensembl # and store the corresponding genome file in 'ensembl/annotation' getGTF(db = "ensembl", organism = "Homo sapiens", path = file.path("ensembl","annotation")) getGTF(db = "ensembl", organism = "Homo sapiens", path = file.path("ensembl","annotation"), assembly_type = "primary_assembly") ## End(Not run)## Not run: # download the annotation of Homo sapiens from ensembl # and store the corresponding genome file in 'ensembl/annotation' getGTF(db = "ensembl", organism = "Homo sapiens", path = file.path("ensembl","annotation")) getGTF(db = "ensembl", organism = "Homo sapiens", path = file.path("ensembl","annotation"), assembly_type = "primary_assembly") ## End(Not run)
Retrieval function of the assembly_summary.txt file from NCBI for all kingdoms. The assembly_summary.txt files store available species on NCBI.
getKingdomAssemblySummary( db, skip_bacteria = TRUE, file = assemblies_info_path(db) )getKingdomAssemblySummary( db, skip_bacteria = TRUE, file = assemblies_info_path(db) )
| db | database name. E.g.  | 
| skip_bacteria | Due to its enormous dataset size (> 700MB as of July 2023),
the bacterial summary file will not be loaded by default anymore. If users
wish to gain insights for the bacterial kingdom they needs to actively specify  | 
| file | path, local path to total summary file, default is in tmp folder. | 
Hajk-Georg Drost
getSummaryFile, getMetaGenomeSummary, get.ensembl.info
## Not run: # This example will run the default version of this function # whereby information for Bacteria are not downloaded test <- getKingdomAssemblySummary(db = "genbank", skip_bacteria = TRUE) test # Users can then retrieve information for Bacteria by skip_bacteria = FALSE test2 <- getKingdomAssemblySummary(db = "genbank", skip_bacteria = FALSE) test2 ## End(Not run)## Not run: # This example will run the default version of this function # whereby information for Bacteria are not downloaded test <- getKingdomAssemblySummary(db = "genbank", skip_bacteria = TRUE) test # Users can then retrieve information for Bacteria by skip_bacteria = FALSE test2 <- getKingdomAssemblySummary(db = "genbank", skip_bacteria = FALSE) test2 ## End(Not run)
A short list of available kingdoms of life
getKingdoms(db = "refseq")getKingdoms(db = "refseq")
| db | a character string specifying the database from which the genome
shall be retrieved:  | 
Hajk-Georg Drost
meta.retrieval, getGenome,
getProteome, getCDS, getGroups
# retrieve kingdoms available from refseq getKingdoms(db = "refseq") # retrieve kingdoms available from genbank getKingdoms(db = "genbank")# retrieve kingdoms available from refseq getKingdoms(db = "refseq") # retrieve kingdoms available from genbank getKingdoms(db = "genbank")
This funcion queries the Ensembl Biomart API and returns a table storing information about all available Ensembl Biomart databases.
getMarts(update = FALSE)getMarts(update = FALSE)
| update | logical, default FALSE. If FALSE, use cached file if it exists. Set to TRUE to force new update | 
Hajk-Georg Drost
Other biomaRt: 
biomart(),
getAttributes(),
getDatasets(),
organismBM(),
organismFilters()
## Not run: # get a table of all available databases from Ensembl Biomart getMarts() ## End(Not run)## Not run: # get a table of all available databases from Ensembl Biomart getMarts() ## End(Not run)
Retrieve available annotation *.gff files for metagenomes
from NCBI Genbank. NCBI Genbank allows users
to download entire metagenomes and their annotations of several metagenome
projects. This function downloads available metagenomes that can then be
downloaded via getMetaGenomes.
getMetaGenomeAnnotations( name, path = file.path("_ncbi_downloads", "metagenome", "annotations"), metagenomes.members = dplyr::filter(getMetaGenomeSummary(), organism_name == name & total_gene_count > 0) )getMetaGenomeAnnotations( name, path = file.path("_ncbi_downloads", "metagenome", "annotations"), metagenomes.members = dplyr::filter(getMetaGenomeSummary(), organism_name == name & total_gene_count > 0) )
| name | metagenome name retrieved by  | 
| path | a character string specifying the location (a folder)
in which the corresponding metagenome annotations shall be stored.
Default is
 | 
| metagenomes.members | a tibble with selected assemblies, default: dplyr::filter(getMetaGenomeSummary(), organism_name == name & total_gene_count > 0)). This is different to getMetaGenome since it requires a gff to exist, most genbank assemblies are .gbff files only, which are usually not useful. | 
Hajk-Georg Drost
getMetaGenomes, listMetaGenomes,
getGFF
## Not run: # Frist, retrieve a list of available metagenomes listMetaGenomes() # Now, retrieve the 'human gut metagenome' getMetaGenomeAnnotations(name = "human gut metagenome") ## End(Not run)## Not run: # Frist, retrieve a list of available metagenomes listMetaGenomes() # Now, retrieve the 'human gut metagenome' getMetaGenomeAnnotations(name = "human gut metagenome") ## End(Not run)
Retrieve available metagenomes from NCBI Genbank.
NCBI Genbank allows users to download entire metagenomes of several
metagenome projects. This function downloads available metagenomes that can
then be downloaded via getMetaGenomes.
getMetaGenomes( name, path = file.path("_ncbi_downloads", "metagenome"), metagenomes.members = dplyr::filter(getMetaGenomeSummary(), organism_name == name) )getMetaGenomes( name, path = file.path("_ncbi_downloads", "metagenome"), metagenomes.members = dplyr::filter(getMetaGenomeSummary(), organism_name == name) )
| name | metagenome name retrieved by  | 
| path | a character string specifying the location (a folder) in
which the corresponding metagenome shall be stored.
Default is  | 
| metagenomes.members | a tibble of metagenome assemblies, default: dplyr::filter(getMetaGenomeSummary(), organism_name == name) | 
Hajk-Georg Drost
getMetaGenomeAnnotations,
listMetaGenomes
## Not run: # Frist, retrieve a list of available metagenomes listMetaGenomes() # Now, retrieve the 'human gut metagenome' getMetaGenomes(name = "human gut metagenome") ## End(Not run)## Not run: # Frist, retrieve a list of available metagenomes listMetaGenomes() # Now, retrieve the 'human gut metagenome' getMetaGenomes(name = "human gut metagenome") ## End(Not run)
Retrieval function of the assembly_summary.txt file from NCBI genbank metagenomes. This files stores all available metagenome projects on NCBI Genbank.
getMetaGenomeSummary( local_file = file.path(cachedir(), "assembly_summary_metagenomes_genbank.txt") )getMetaGenomeSummary( local_file = file.path(cachedir(), "assembly_summary_metagenomes_genbank.txt") )
| local_file | where to store this backend file, default: file.path(cachedir(), "assembly_summary_metagenomes_genbank.txt") | 
a tibble table
Hajk-Georg Drost
getKingdomAssemblySummary,
getSummaryFile
## Not run: meta.summary <- getMetaGenomeSummary() meta.summary ## End(Not run)## Not run: meta.summary <- getMetaGenomeSummary() meta.summary ## End(Not run)
Main proteome retrieval function for an organism of interest. By specifying the scientific name of an organism of interest the corresponding fasta-file storing the proteome of the organism of interest can be downloaded and stored locally. Proteome files can be retrieved from several databases.
getProteome( db = "refseq", organism, reference = TRUE, skip_bacteria = TRUE, release = NULL, gunzip = FALSE, update = TRUE, path = file.path("_ncbi_downloads", "proteomes"), mute_citation = FALSE )getProteome( db = "refseq", organism, reference = TRUE, skip_bacteria = TRUE, release = NULL, gunzip = FALSE, update = TRUE, path = file.path("_ncbi_downloads", "proteomes"), mute_citation = FALSE )
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| organism | Organism selector id, there are three options to characterize an organism: 
 | 
| reference | a logical value indicating whether or not a genome shall be a candidate for downloaded if it isn't marked in the database as either a reference genome or a representative genome. This is helpful if you don't want to allow "partial genomes" etc. | 
| skip_bacteria | Due to its enormous dataset size (> 700MB as of July 2023),
the bacterial summary file will not be loaded by default anymore. If users
wish to gain insights for the bacterial kingdom they needs to actively specify  | 
| release | a numeric, the database release version of ENSEMBL ( | 
| gunzip | a logical, indicating whether or not files should be unzipped. | 
| update | logical, default TRUE. (Uniprot only for now!) If species info file exists already, do not re download, makes it faster but the file can be old, i.e. no longer as complete as it could be. | 
| path | a character string specifying the location (a folder) in which
the corresponding proteome shall be stored. Default is
 | 
| mute_citation | logical, default FALSE, indicating whether citation message should be muted. | 
Fetching of assembly / sequence data is done by fetching an overview file
from metadata of given database:
For NCBI (refseq/genbank): 
Internally this function loads the the overview.txt file from NCBI:
refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/
genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/
It will then create a directory relative to file type wanted,
if you get fasta genomes it will be
_ncbi_downloads/genomes' etc.
In case the corresponding fasta file already exists within the
'_ncbi_downloads/genomes' folder and is accessible within the workspace,
no download process will be performed.
For other file types the same rule applies.
For ensembl it fetches overview per type from the rest API:
ensembl: https://rest.ensembl.org
File path to downloaded proteome.
Hajk-Georg Drost
Other getBio: 
getBio(),
getCDS(),
getCollection(),
getGFF(),
getGenome(),
getRNA()
Other proteome: 
getProteomeSet(),
read_proteome()
## Not run: # download the proteome of Arabidopsis thaliana from NCBI RefSeq # and store the corresponding proteome file in '_ncbi_downloads/refseq/proteomes' file_path <- getProteome( db = "refseq", organism = "Arabidopsis thaliana", path = file.path("_ncbi_downloads","refseq","proteomes") ) # import proteome into R session Ath_proteome <- read_proteome(file_path, format = "fasta") # download the proteome of Arabidopsis thaliana from NCBI Genbank # and store the corresponding proteome file in '_ncbi_downloads/genbank/proteomes' file_path <- getProteome( db = "genbank", organism = "Arabidopsis thaliana", path = file.path("_ncbi_downloads","genbank","proteomes") ) # import proteome into R session Ath_proteome <- read_proteome(file_path, format = "fasta") # and store the corresponding proteome file in '_downloads/uniprot/proteomes' file_path <- getProteome( db = "uniprot", organism = "Arabidopsis thaliana", path = file.path("_downloads","uniprot","proteomes") ) # import proteome into R session Ath_proteome <- read_proteome(file_path, format = "fasta") # download the proteome of Arabidopsis thaliana from ENSEMBL # and store the corresponding proteome file in '_downloads/ensembl/proteomes' file_path <- getProteome( db = "ensembl", organism = "Arabidopsis thaliana", path = file.path("_downloads","ensembl","proteomes") ) # import proteome into R session Ath_proteome <- read_proteome(file_path, format = "fasta") ## End(Not run)## Not run: # download the proteome of Arabidopsis thaliana from NCBI RefSeq # and store the corresponding proteome file in '_ncbi_downloads/refseq/proteomes' file_path <- getProteome( db = "refseq", organism = "Arabidopsis thaliana", path = file.path("_ncbi_downloads","refseq","proteomes") ) # import proteome into R session Ath_proteome <- read_proteome(file_path, format = "fasta") # download the proteome of Arabidopsis thaliana from NCBI Genbank # and store the corresponding proteome file in '_ncbi_downloads/genbank/proteomes' file_path <- getProteome( db = "genbank", organism = "Arabidopsis thaliana", path = file.path("_ncbi_downloads","genbank","proteomes") ) # import proteome into R session Ath_proteome <- read_proteome(file_path, format = "fasta") # and store the corresponding proteome file in '_downloads/uniprot/proteomes' file_path <- getProteome( db = "uniprot", organism = "Arabidopsis thaliana", path = file.path("_downloads","uniprot","proteomes") ) # import proteome into R session Ath_proteome <- read_proteome(file_path, format = "fasta") # download the proteome of Arabidopsis thaliana from ENSEMBL # and store the corresponding proteome file in '_downloads/ensembl/proteomes' file_path <- getProteome( db = "ensembl", organism = "Arabidopsis thaliana", path = file.path("_downloads","ensembl","proteomes") ) # import proteome into R session Ath_proteome <- read_proteome(file_path, format = "fasta") ## End(Not run)
Main proteome retrieval function for a set of organism of interest. By specifying the scientific names of the organisms of interest the corresponding fasta-files storing the proteome of the organisms of interest will be downloaded and stored locally. proteome files can be retrieved from several databases.
getProteomeSet( db = "refseq", organisms, reference = FALSE, release = NULL, skip_bacteria = TRUE, gunzip = TRUE, update = FALSE, path = "set_proteomes", mute_citation = FALSE )getProteomeSet( db = "refseq", organisms, reference = FALSE, release = NULL, skip_bacteria = TRUE, gunzip = TRUE, update = FALSE, path = "set_proteomes", mute_citation = FALSE )
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| organisms | a character vector storing the names of the organisms than shall be retrieved. There are three available options to characterize an organism: | 
| reference | a logical value indicating whether or not a genome shall be a candidate for downloaded if it isn't marked in the database as either a reference genome or a representative genome. This is helpful if you don't want to allow "partial genomes" etc. | 
| release | a numeric, the database release version of ENSEMBL ( | 
| skip_bacteria | Due to its enormous dataset size (> 700MB as of July 2023),
the bacterial summary file will not be loaded by default anymore. If users
wish to gain insights for the bacterial kingdom they needs to actively specify  | 
| gunzip | a logical, indicating whether or not files should be unzipped. | 
| update | logical, default FALSE. Updated backend cached files needed. Usually keep this false, to make ut run much faster. Only set to TRUE, if you believe you cache is outdated (Species only exist in newest release etc) | 
| path | a character string specifying the location (a folder) in which
the corresponding proteomes shall be stored. Default is  | 
| mute_citation | logical, default FALSE, indicating whether citation message should be muted. | 
Fetching of assembly / sequence data is done by fetching an overview file
from metadata of given database:
For NCBI (refseq/genbank): 
Internally this function loads the the overview.txt file from NCBI:
refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/
genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/
It will then create a directory relative to file type wanted,
if you get fasta genomes it will be
_ncbi_downloads/genomes' etc.
In case the corresponding fasta file already exists within the
'_ncbi_downloads/genomes' folder and is accessible within the workspace,
no download process will be performed.
For other file types the same rule applies.
For ensembl it fetches overview per type from the rest API:
ensembl: https://rest.ensembl.org
character vector, the file path to the downloaded genomes,
The returned character vector has names as either:
- 'new' (file was downloaded now)
- 'old' files did already exist)
Hajk-Georg Drost
Other getBioSet: 
getBioSet(),
getCDSSet(),
getCollectionSet(),
getGFFSet(),
getGenomeSet(),
getRNASet()
Other proteome: 
getProteome(),
read_proteome()
## Not run: # download the proteomes of three different species at the same time #### Database: NCBI RefSeq file_paths <- getProteomeSet(db = "refseq", organisms = c("Arabidopsis thaliana", "Arabidopsis lyrata", "Capsella rubella")) # look at file paths file_paths #### Database: NCBI Genbank file_paths <- getProteomeSet(db = "genbank", organisms = c("Arabidopsis thaliana", "Arabidopsis lyrata", "Capsella rubella")) # look at file paths file_paths # download the proteomes of three different species at the same time #### Database: ENSEMBL file_paths <- getProteomeSet(db = "ensembl", organisms = c("Homo sapiens", "Mus musculus", "Caenorhabditis elegans")) # look at file paths file_paths # download the proteomes of three different species at the same time #### Database: UniProt file_paths <- getProteomeSet(db = "uniprot", organisms = c("Homo sapiens", "Mus musculus", "Caenorhabditis elegans")) # look at file paths file_paths ## End(Not run)## Not run: # download the proteomes of three different species at the same time #### Database: NCBI RefSeq file_paths <- getProteomeSet(db = "refseq", organisms = c("Arabidopsis thaliana", "Arabidopsis lyrata", "Capsella rubella")) # look at file paths file_paths #### Database: NCBI Genbank file_paths <- getProteomeSet(db = "genbank", organisms = c("Arabidopsis thaliana", "Arabidopsis lyrata", "Capsella rubella")) # look at file paths file_paths # download the proteomes of three different species at the same time #### Database: ENSEMBL file_paths <- getProteomeSet(db = "ensembl", organisms = c("Homo sapiens", "Mus musculus", "Caenorhabditis elegans")) # look at file paths file_paths # download the proteomes of three different species at the same time #### Database: UniProt file_paths <- getProteomeSet(db = "uniprot", organisms = c("Homo sapiens", "Mus musculus", "Caenorhabditis elegans")) # look at file paths file_paths ## End(Not run)
Retrieve available database releases or versions of ENSEMBL.
getReleases(db = "ensembl")getReleases(db = "ensembl")
| db | a character string specifying the database from which available resease versions shall be retrieved: 
 | 
Hajk-Georg Drost
## Not run: # retrieve available resease versions of ENSEMBL getReleases("ensembl") ## End(Not run)## Not run: # retrieve available resease versions of ENSEMBL getReleases("ensembl") ## End(Not run)
Main Repeat Masker output retrieval function for an organism of interest. By specifying the scientific name of an organism of interest the corresponding Repeat Masker file storing the genome of the organism of interest can be downloaded and stored locally. Repeat Masker files can be retrieved from several databases.
getRepeatMasker( db = "refseq", organism, reference = FALSE, skip_bacteria = TRUE, release = NULL, gunzip = FALSE, path = file.path("_ncbi_downloads", "repeatmasker"), mute_citation = FALSE )getRepeatMasker( db = "refseq", organism, reference = FALSE, skip_bacteria = TRUE, release = NULL, gunzip = FALSE, path = file.path("_ncbi_downloads", "repeatmasker"), mute_citation = FALSE )
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| organism | a character string specifying the scientific name of the
organism of interest, e.g.  | 
| reference | a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome. | 
| skip_bacteria | Due to its enormous dataset size (> 700MB as of July 2023),
the bacterial summary file will not be loaded by default anymore. If users
wish to gain insights for the bacterial kingdom they needs to actively specify  | 
| release | most recent database version is used. release = 75 would for human would give the stable GRCh37 release in ensembl. Value must be > 46, since ensembl did not structure their data if the standard format before that. | 
| gunzip | a logical, indicating whether or not files should be unzipped. | 
| path | a character string specifying the location (a folder) in which
the corresponding file shall be stored. Default is
 | 
| mute_citation | logical value indicating whether citation message should be muted. | 
Internally this function loads the the overview.txt file from NCBI:
refseq: refseq_genbank_ftp_server_url_genome_specific("refseq")
genbank: refseq_genbank_ftp_server_url_genome_specific("genbank")
and creates a directory '_ncbi_downloads/repeatmasker' to store the files of interest as fasta file for future processing. In case the corresponding fasta file already exists within the '_ncbi_downloads/repeatmasker' folder and is accessible within the workspace, no download process will be performed.
File path to downloaded Repeat Masker output file.
Hajk-Georg Drost
getGenome, getProteome, getCDS,
getGFF, getRNA, getCollection, meta.retrieval,
read_rm
## Not run: # download the Repeat Masker output file of Homo sapiens from refseq # and store the corresponding genome file in '_ncbi_downloads/genomes' file_path <- getRepeatMasker( db = "refseq", organism = "Homo sapiens", path = file.path("_ncbi_downloads","repeatmasker")) Hsap_repeatmasker <- read_rm(file_path) ## End(Not run)## Not run: # download the Repeat Masker output file of Homo sapiens from refseq # and store the corresponding genome file in '_ncbi_downloads/genomes' file_path <- getRepeatMasker( db = "refseq", organism = "Homo sapiens", path = file.path("_ncbi_downloads","repeatmasker")) Hsap_repeatmasker <- read_rm(file_path) ## End(Not run)
Main retrieval function for RNA sequences of an organism of interest. By specifying the scientific name of an organism of interest the corresponding fasta-file storing the RNA information for the organism of interest can be downloaded and stored locally. RNA files can be retrieved from several databases.
getRNA( db = "refseq", organism, reference = FALSE, skip_bacteria = TRUE, release = NULL, assembly_type = "toplevel", path = file.path("_ncbi_downloads", "RNA"), gunzip = FALSE, mute_citation = FALSE )getRNA( db = "refseq", organism, reference = FALSE, skip_bacteria = TRUE, release = NULL, assembly_type = "toplevel", path = file.path("_ncbi_downloads", "RNA"), gunzip = FALSE, mute_citation = FALSE )
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| organism | Organism selector id, there are three options to characterize an organism: 
 | 
| reference | a logical value indicating whether or not a genome shall be a candidate for downloaded if it isn't marked in the database as either a reference genome or a representative genome. This is helpful if you don't want to allow "partial genomes" etc. | 
| skip_bacteria | Due to its enormous dataset size (> 700MB as of July 2023),
the bacterial summary file will not be loaded by default anymore. If users
wish to gain insights for the bacterial kingdom they needs to actively specify  | 
| release | a numeric, the database release version of ENSEMBL ( | 
| assembly_type | character, default c("primary_assembly", "toplevel"). Used for ensembl only, specifies the genome assembly type. Searches for both primary and toplevel, and if both are found, uses the first by order (so primary is prioritized by default). The Primary assembly should usually be used if it exists. The "primary assembly" contains all the top-level sequence regions, excluding alternative haplotypes and patches. If the primary assembly file is not present for a species (only defined for standard model organisms), that indicates that there were no haplotype/patch regions, and in such cases, the 'toplevel file is used. For more details see: ensembl tutorial | 
| path | a character string specifying the location (a folder) in which
the corresponding
CDS file shall be stored. Default is
 | 
| gunzip | a logical, indicating whether or not files should be unzipped. | 
| mute_citation | logical, default FALSE, indicating whether citation message should be muted. | 
Fetching of assembly / sequence data is done by fetching an overview file
from metadata of given database:
For NCBI (refseq/genbank): 
Internally this function loads the the overview.txt file from NCBI:
refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/
genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/
It will then create a directory relative to file type wanted,
if you get fasta genomes it will be
_ncbi_downloads/genomes' etc.
In case the corresponding fasta file already exists within the
'_ncbi_downloads/genomes' folder and is accessible within the workspace,
no download process will be performed.
For other file types the same rule applies.
For ensembl it fetches overview per type from the rest API:
ensembl: https://rest.ensembl.org
File path to downloaded genome.
Hajk-Georg Drost
Other getBio: 
getBio(),
getCDS(),
getCollection(),
getGFF(),
getGenome(),
getProteome()
Other rna: 
getRNASet(),
read_rna()
## Not run: # download the RNA of Arabidopsis thaliana from refseq # and store the corresponding RNA file in '_ncbi_downloads/RNA' file_path <- getRNA( db = "refseq", organism = "Arabidopsis thaliana", path = file.path("_ncbi_downloads","RNA")) Ath_RNA <- read_rna(file_path, format = "fasta") ## End(Not run)## Not run: # download the RNA of Arabidopsis thaliana from refseq # and store the corresponding RNA file in '_ncbi_downloads/RNA' file_path <- getRNA( db = "refseq", organism = "Arabidopsis thaliana", path = file.path("_ncbi_downloads","RNA")) Ath_RNA <- read_rna(file_path, format = "fasta") ## End(Not run)
Main RNA retrieval function for a set of organism of interest. By specifying the scientific names of the organisms of interest the corresponding fasta-files storing the RNA of the organisms of interest will be downloaded and stored locally. RNA files can be retrieved from several databases.
getRNASet( db = "refseq", organisms, reference = FALSE, release = NULL, skip_bacteria = TRUE, gunzip = TRUE, update = FALSE, path = "set_RNAs", mute_citation = FALSE )getRNASet( db = "refseq", organisms, reference = FALSE, release = NULL, skip_bacteria = TRUE, gunzip = TRUE, update = FALSE, path = "set_RNAs", mute_citation = FALSE )
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| organisms | a character vector storing the names of the organisms than shall be retrieved. There are three available options to characterize an organism: | 
| reference | a logical value indicating whether or not a genome shall be a candidate for downloaded if it isn't marked in the database as either a reference genome or a representative genome. This is helpful if you don't want to allow "partial genomes" etc. | 
| release | a numeric, the database release version of ENSEMBL ( | 
| skip_bacteria | Due to its enormous dataset size (> 700MB as of July 2023),
the bacterial summary file will not be loaded by default anymore. If users
wish to gain insights for the bacterial kingdom they needs to actively specify  | 
| gunzip | a logical, indicating whether or not files should be unzipped. | 
| update | logical, default FALSE. Updated backend cached files needed. Usually keep this false, to make ut run much faster. Only set to TRUE, if you believe you cache is outdated (Species only exist in newest release etc) | 
| path | a character string specifying the location (a folder) in which
the corresponding RNAs shall be stored. Default is  | 
| mute_citation | logical, default FALSE, indicating whether citation message should be muted. | 
Fetching of assembly / sequence data is done by fetching an overview file
from metadata of given database:
For NCBI (refseq/genbank): 
Internally this function loads the the overview.txt file from NCBI:
refseq: ftp.ncbi.nlm.nih.gov/genomes/refseq/
genbank: ftp.ncbi.nlm.nih.gov/genomes/genbank/
It will then create a directory relative to file type wanted,
if you get fasta genomes it will be
_ncbi_downloads/genomes' etc.
In case the corresponding fasta file already exists within the
'_ncbi_downloads/genomes' folder and is accessible within the workspace,
no download process will be performed.
For other file types the same rule applies.
For ensembl it fetches overview per type from the rest API:
ensembl: https://rest.ensembl.org
character vector, the file path to the downloaded genomes,
The returned character vector has names as either:
- 'new' (file was downloaded now)
- 'old' files did already exist)
Hajk-Georg Drost
Other getBioSet: 
getBioSet(),
getCDSSet(),
getCollectionSet(),
getGFFSet(),
getGenomeSet(),
getProteomeSet()
Other rna: 
getRNA(),
read_rna()
## Not run: getBioSet("refseq", organisms = c("Arabidopsis thaliana", "Arabidopsis lyrata", "Capsella rubella"), set_type = "cds") ## End(Not run)## Not run: getBioSet("refseq", organisms = c("Arabidopsis thaliana", "Arabidopsis lyrata", "Capsella rubella"), set_type = "cds") ## End(Not run)
Retrieval function of the assembly_summary.txt file from NCBI.
getSummaryFile(db, kingdom, file = assemblies_info_path(db, kingdom))getSummaryFile(db, kingdom, file = assemblies_info_path(db, kingdom))
| db | database name. E.g.  | 
| kingdom | kingdom for which assembly_summary.txt file shall be
retrieved. See also  | 
| file | path, local path to total summary file, default is in tmp folder. | 
Hajk-Georg Drost
getKingdomAssemblySummary,
getMetaGenomeSummary
## Not run: test <- getSummaryFile("refseq","plant") test ## End(Not run)## Not run: test <- getSummaryFile("refseq","plant") test ## End(Not run)
Get uniprot info from organism
getUniProtInfo(organism, path = cachedir(), update = TRUE)getUniProtInfo(organism, path = cachedir(), update = TRUE)
| organism | character, name of organism | 
| path | path at which the info file shall be stored locally. | 
| update | shall the internal  | 
The UniProt stores a STATS file to summarise all available information for their reference proteomes.
Users can now download this file and process it with biomartr.
getUniProtSTATS(update = FALSE)getUniProtSTATS(update = FALSE)
| update | shall the internal  | 
Hajk-Georg Drost
## Not run: # retrieve STATS file from UniProt uniprot_info <- getUniProtSTATS(update = TRUE) # look at results uniprot_info ## End(Not run)## Not run: # retrieve STATS file from UniProt uniprot_info <- getUniProtSTATS(update = TRUE) # look at results uniprot_info ## End(Not run)
This function checks the availability of a given genome on the NBCI servers specified as scientific name.
is.genome.available( db = "refseq", organism, skip_bacteria = TRUE, details = FALSE )is.genome.available( db = "refseq", organism, skip_bacteria = TRUE, details = FALSE )
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| organism | there are three options to characterize an organism: 
 | 
| skip_bacteria | Due to its enormous dataset size (> 700MB as of July 2023),
the bacterial summary file will not be loaded by default anymore. If users
wish to gain insights for the bacterial kingdom they needs to actively specify  | 
| details | a logical value specifying whether or not details on genome size, kingdom, etc. shall be printed to the console intead of a boolean value. | 
Internally this function calls the listGenomes function to
detect all available genomes and checks whether or not the specified organism
is available for download.
a logical value specifing whether or not the genome of the input
organism is available. In case details = TRUE only a character
string specifying the genome details is being returned.
Hajk-Georg Drost
## Not run: # checking whether the Homo sapiens genome is stored on NCBI is.genome.available(organism = "Homo sapiens", db = "refseq") # and printing details is.genome.available(organism = "Homo sapiens", db = "refseq", details = TRUE) # checking whether the Homo sapiens genome is stored on ENSEMBL is.genome.available(organism = "Homo sapiens", db = "ensembl") # and printing details is.genome.available(organism = "Homo sapiens", details = TRUE, db = "ensembl") ## End(Not run)## Not run: # checking whether the Homo sapiens genome is stored on NCBI is.genome.available(organism = "Homo sapiens", db = "refseq") # and printing details is.genome.available(organism = "Homo sapiens", db = "refseq", details = TRUE) # checking whether the Homo sapiens genome is stored on ENSEMBL is.genome.available(organism = "Homo sapiens", db = "ensembl") # and printing details is.genome.available(organism = "Homo sapiens", details = TRUE, db = "ensembl") ## End(Not run)
This function allows you to retrieve a list of database names and versions that can be downloaded from correspondning servers.
Database retrieval is crucial for most biological studies and analyses.
There is a vast diversity of databases that can be accessed remotely or that
can be downloaded to your local machine. This function provides an interface
to databases that can be downloaded from NCBI servers and lists all available
databases and their database version to be able to select an appropriate
database for download with download.database.
listDatabases(db = "nr", update = FALSE) listNCBIDatabases(db = "nr", update = FALSE)listDatabases(db = "nr", update = FALSE) listNCBIDatabases(db = "nr", update = FALSE)
| db | a character string specifying the name of the database that shall be searched for. | 
| update | a logical value specifying whether or not the local listDatabases.txt file shall be updated by remote access to NCBI. | 
Hajk-Georg Drost
download.database, download.database.all
## Not run: # retrieve all versions of the NCBI 'nr' database that can be downloaded listNCBIDatabases(db = "nr") # analogous: # listNCBIDatabases(db = "cdd") # listNCBIDatabases(db = "nt") # listNCBIDatabases(db = "gss") # listNCBIDatabases(db = "refseq_protein") ## End(Not run)## Not run: # retrieve all versions of the NCBI 'nr' database that can be downloaded listNCBIDatabases(db = "nr") # analogous: # listNCBIDatabases(db = "cdd") # listNCBIDatabases(db = "nt") # listNCBIDatabases(db = "gss") # listNCBIDatabases(db = "refseq_protein") ## End(Not run)
This function retrieves the names of all genomes available on the NCBI ftp:// server and stores the results in a file named 'overview.txt' inside the directory _ncbi_downloads' that is built inside the workspace.
listGenomes( db = "refseq", type = "all", subset = NULL, details = FALSE, update = FALSE, skip_bacteria = FALSE )listGenomes( db = "refseq", type = "all", subset = NULL, details = FALSE, update = FALSE, skip_bacteria = FALSE )
| db | a character string specifying the database for which genome availability shall be checked. Available options are: 
 | 
| type | a character string specifying a potential filter of available genomes. Available options are: 
 | 
| subset | a character string or character vector specifying a subset of
 | 
| details | a boolean value specifying whether only the scientific names of stored genomes shall be returned (details = FALSE) or all information such as 
 | 
| update | logical, default FALSE. If TRUE, update cached list,
if FALSE use existing cache (if it exists). For cache location see
 | 
| skip_bacteria | Due to its enormous dataset size (> 700MB as of July 2023),
the bacterial summary file will not be loaded by default anymore. If users
wish to gain insights for the bacterial kingdom they needs to actively specify  | 
Internally this function loads the the overview.txt file from NCBI
and creates a directory '_ncbi_downloads' in the temdir()
folder to store the overview.txt file for future processing. In case the
overview.txt file already exists within the '_ncbi_downloads' folder and is
accessible within the workspace, no download process will be performed again.
Please note that the ftp:// connection relies on the NCBI or ENSEMBL server and cannot be accurately accessed via a proxy.
Hajk-Georg Drost
## Not run: # print details for refseq listGenomes(db = "refseq") # print details for all plants in refseq listGenomes(db = "refseq", type = "kingdom") # print details for all plant groups in refseq listGenomes(db = "refseq", type = "group") # print details for all plant subgroups in refseq listGenomes(db = "refseq", type = "subgroup") # Ensembl listGenomes(db = "ensembl", type = "kingdom", subset = "EnsemblVertebrates") ## End(Not run)## Not run: # print details for refseq listGenomes(db = "refseq") # print details for all plants in refseq listGenomes(db = "refseq", type = "kingdom") # print details for all plant groups in refseq listGenomes(db = "refseq", type = "group") # print details for all plant subgroups in refseq listGenomes(db = "refseq", type = "subgroup") # Ensembl listGenomes(db = "ensembl", type = "kingdom", subset = "EnsemblVertebrates") ## End(Not run)
Users can retrieve the available number of sequenced 
genomes per group. Only available for db = "refseq" and 
db = "genbank".
listGroups(db = "refseq", kingdom = "all", details = FALSE)listGroups(db = "refseq", kingdom = "all", details = FALSE)
| db | a character string specifying the database for which genome availability shall be checked. Available options are: 
 | 
| kingdom | a kingdom specification retrieved by 
 | 
| details | shall all species corresponding to the specified 
 | 
Hajk-Georg Drost
listGenomes, is.genome.available, 
listKingdoms
## Not run: # retrieve a list of available kingdoms getKingdoms() # [1] "archaea" "bacteria" "fungi" "invertebrate" "plant" # [6] "protozoa" "vertebrate_mammalian" "vertebrate_other" "viral" # example for refseq listGroups(db = "refseq") # example for genbank listGroups(db = "genbank") ### in case groups should be specified by kingdom # first, retrieve available kingdom names listKingdoms() # now we choose kingdom "bacteria" listGroups(db = "refseq", kingdom = "bacteria") # or listGroups(db = "genbank", kingdom = "bacteria") ## End(Not run)## Not run: # retrieve a list of available kingdoms getKingdoms() # [1] "archaea" "bacteria" "fungi" "invertebrate" "plant" # [6] "protozoa" "vertebrate_mammalian" "vertebrate_other" "viral" # example for refseq listGroups(db = "refseq") # example for genbank listGroups(db = "genbank") ### in case groups should be specified by kingdom # first, retrieve available kingdom names listKingdoms() # now we choose kingdom "bacteria" listGroups(db = "refseq", kingdom = "bacteria") # or listGroups(db = "genbank", kingdom = "bacteria") ## End(Not run)
Users can retrieve the available number of sequenced genomes per kingdom.
listKingdoms(db = "refseq")listKingdoms(db = "refseq")
| db | a character string specifying the database for which genome 
availability shall be checked, 
e.g.  | 
Hajk-Georg Drost
listGenomes, is.genome.available, 
listGroups
## Not run: # list number of available genomes in refseq for each kingdom of life listKingdoms(db = "refseq") # example for genbank listKingdoms(db = "genbank") # example for ensembl listKingdoms(db = "ensembl") ## End(Not run)## Not run: # list number of available genomes in refseq for each kingdom of life listKingdoms(db = "refseq") # example for genbank listKingdoms(db = "genbank") # example for ensembl listKingdoms(db = "ensembl") ## End(Not run)
List available metagenomes on NCBI genbank. NCBI genbank 
allows users to download entire metagenomes of several metagenome projects. 
This function lists all available metagenomes that can then be downloaded via
getMetaGenomes.
listMetaGenomes(details = FALSE)listMetaGenomes(details = FALSE)
| details | a boolean value specifying whether only the scientific names 
of stored metagenomes shall be returned ( | 
Hajk-Georg Drost
getMetaGenomes, getMetaGenomeSummary
## Not run: # retrieve available metagenome projects at NCBI Genbank listMetaGenomes() # retrieve detailed information on available metagenome projects # at NCBI Genbank listMetaGenomes(details = TRUE) ## End(Not run)## Not run: # retrieve available metagenome projects at NCBI Genbank listMetaGenomes() # retrieve detailed information on available metagenome projects # at NCBI Genbank listMetaGenomes(details = TRUE) ## End(Not run)
Download genomes, proteomes, cds, gff, rna, or assembly stats files of all species within a kingdom of life.
meta.retrieval( db = "refseq", kingdom, group = NULL, type = "genome", restart_at_last = TRUE, reference = FALSE, combine = FALSE, path = NULL )meta.retrieval( db = "refseq", kingdom, group = NULL, type = "genome", restart_at_last = TRUE, reference = FALSE, combine = FALSE, path = NULL )
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| kingdom | a character string specifying the kingdom of the organisms of interest, e.g. 
 Available kingdoms can be retrieved with  | 
| group | only species belonging to this subgroup will be downloaded.
Groups can be retrieved with  | 
| type | type of sequences that shall be retrieved. Options are: 
 | 
| restart_at_last | a logical value indicating whether or not  
 | 
| reference | a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome. Options are: 
 | 
| combine | just in case  | 
| path | path to the folder in which downloaded genomes shall be stored. By default the kingdom name is used to name the output folder. | 
This function aims to perform bulk retrieval of the genomes, proteomes, cds, etc. of species that belong to the same kingdom of life or to the same subgroup.
a character vector storing the file paths of the retrieved files.
Hajk-Georg Drost
Other meta_retrival: 
meta.retrieval.all()
## Not run: # get all available kingdoms for refseq getKingdoms(db = "refseq") # download all vertebrate genomes from refseq meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "genome") # get all available kingdoms for genbank getKingdoms(db = "genbank") # download all vertebrate genomes from genbank meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "genome") # In case users do not wish to retrieve genomes from an entire kingdom, # but rather from a subgoup (e.g. from species belonging to the # Gammaproteobacteria class, a subgroup of the bacteria kingdom), # they can use the following workflow" # First, users can again consult the getKingdoms() function to retrieve # kingdom information. getKingdoms(db = "refseq") # In this example, we will choose the bacteria kingdom. # Now, the getGroups() function allows users to obtain available # subgroups of the bacteria kingdom. getGroups(db = "refseq", kingdom = "bacteria") # Now we choose the group Gammaproteobacteria and specify # the group argument in the meta.retrieval() function meta.retrieval(kingdom = "bacteria", roup = "Gammaproteobacteria", db = "refseq", type = "genome") ## End(Not run)## Not run: # get all available kingdoms for refseq getKingdoms(db = "refseq") # download all vertebrate genomes from refseq meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "genome") # get all available kingdoms for genbank getKingdoms(db = "genbank") # download all vertebrate genomes from genbank meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "genome") # In case users do not wish to retrieve genomes from an entire kingdom, # but rather from a subgoup (e.g. from species belonging to the # Gammaproteobacteria class, a subgroup of the bacteria kingdom), # they can use the following workflow" # First, users can again consult the getKingdoms() function to retrieve # kingdom information. getKingdoms(db = "refseq") # In this example, we will choose the bacteria kingdom. # Now, the getGroups() function allows users to obtain available # subgroups of the bacteria kingdom. getGroups(db = "refseq", kingdom = "bacteria") # Now we choose the group Gammaproteobacteria and specify # the group argument in the meta.retrieval() function meta.retrieval(kingdom = "bacteria", roup = "Gammaproteobacteria", db = "refseq", type = "genome") ## End(Not run)
Download genomes, proteomes, cds, gff, rna, or assembly stats files of individual species of all kingdoms of life.
meta.retrieval.all(db = "refseq", type = "genome", reference = FALSE)meta.retrieval.all(db = "refseq", type = "genome", reference = FALSE)
| db | a character string specifying the database from which the genome shall be retrieved: 
 | 
| type | type of sequences that shall be retrieved. Options are: 
 | 
| reference | a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome. Options are: 
 | 
This function aims to perform bulk retrieval of all genomes of species for all kingdoms of life.
a character vector storing the file paths of the retrieved files.
Hajk-Georg Drost
Other meta_retrival: 
meta.retrieval()
## Not run: # download all genomes from refseq meta.retrieval.all(db = "refseq", type = "genome") # download all vertebrate genomes from genbank meta.retrieval.all(db = "genbank", type = "genome") # download all vertebrate genomes from ensemblgenomes meta.retrieval.all(db = "genbank", type = "ensemblgenomes") ## End(Not run)## Not run: # download all genomes from refseq meta.retrieval.all(db = "refseq", type = "genome") # download all vertebrate genomes from genbank meta.retrieval.all(db = "genbank", type = "genome") # download all vertebrate genomes from ensemblgenomes meta.retrieval.all(db = "genbank", type = "ensemblgenomes") ## End(Not run)
In addition to the organismBM function, 
this function returns all available attributes that can be accessed through 
different marts and datasets for a given query organism.
organismAttributes(organism, update = FALSE, topic = NULL)organismAttributes(organism, update = FALSE, topic = NULL)
| organism | a character string specifying the scientific name of a query organism. | 
| update | a logical value specifying whether or not the local listMart.txt, listDatasets.txt, and listAttributes_organism.txt files shall be updated by remote access to BioMart. | 
| topic | a character string specifying a topic (category) of attributes, 
e.g.  | 
For a given query organism, this function retrieves all available attributes that can be accessed through different marts and datasets.
Sometimes the same attribute names correspond to different datasets and 
marts causing problems when using getMarts. The approach 
introduced by this function provides (again) a organism centric way of 
accessing organism specific attributes.
The topic argument allows the user to search for specific attribute 
topics/categories for faster filtering.
a data.frame storing corresponding attribute names, description, datasets, and marts.
When you run this function for the first time, the data retrieval procedure 
will take some time, due to the remote access to BioMart. The corresponding 
result is then saved in a *.txt file within the tempdir
directory named "_biomart/listMarts.txt","_biomart/listDatasets.txt", and 
"_biomart/listAttributes_organism.txt", allowing subsequent queries to 
perform much faster.
Hajk-Georg Drost
Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Steffen Durinck, Paul T. Spellman, Ewan Birney and Wolfgang Huber, Nature Protocols 4, 1184-1191 (2009).
BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Steffen Durinck, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma and Wolfgang Huber, Bioinformatics 21, 3439-3440 (2005).
organismFilters, organismBM, 
biomart, listAttributes
## Not run: # search for attribute topic id head(organismAttributes("Homo sapiens", topic = "id"), 20) ## End(Not run)## Not run: # search for attribute topic id head(organismAttributes("Homo sapiens", topic = "id"), 20) ## End(Not run)
This function returns either all available biomart connections for all available organisms for which biomart access is possible, or (when specified) returns all organism specific biomart connections.
organismBM(organism = NULL, update = FALSE, mute_citation = TRUE)organismBM(organism = NULL, update = FALSE, mute_citation = TRUE)
| organism | a character string specifying the scientific name of a
query organism. Default is  | 
| update | a logical value specifying whether or not the local listMart.txt and listDatasets.txt files shall be updated by remote access to BioMart. | 
| mute_citation | logical value indicating whether citation message should be muted. | 
This function collects all available biomart connections and returns a table storing the organism for which biomart connections are available as well as the corresponding mart and database.
When you run this function for the first time, the data retrieval procedure
will take some time, due to the remote access to BioMart. The corresponding
result is then saved in a *.txt file named "_biomart/listDatasets.txt" in the
tempdir directory, allowing subsequent queries to perform
much faster.
Hajk-Georg Drost
Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Steffen Durinck, Paul T. Spellman, Ewan Birney and Wolfgang Huber, Nature Protocols 4, 1184-1191 (2009).
BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Steffen Durinck, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma and Wolfgang Huber, Bioinformatics 21, 3439-3440 (2005).
Other biomaRt: 
biomart(),
getAttributes(),
getDatasets(),
getMarts(),
organismFilters()
## Not run: # returning all available biomart connections head(organismBM(), 20) # retrieving all available datasets and biomart connections for # a specific query organism (scientific name) organismBM(organism = "Homo sapiens") # you can also update the downloaded version using # the "update = TRUE" argument head(organismBM(update = TRUE), 20) ## End(Not run)## Not run: # returning all available biomart connections head(organismBM(), 20) # retrieving all available datasets and biomart connections for # a specific query organism (scientific name) organismBM(organism = "Homo sapiens") # you can also update the downloaded version using # the "update = TRUE" argument head(organismBM(update = TRUE), 20) ## End(Not run)
In addition to the organismBM and
organismAttributes functions, this function
returns all available filters that can be accessed through different marts
and datasets for a given query organism.
organismFilters(organism, update = FALSE, topic = NULL)organismFilters(organism, update = FALSE, topic = NULL)
| organism | a character string specifying the scientific name of a query organism. | 
| update | a logical value specifying whether or not the local listMart.txt, listDatasets.txt, and listFilters_organism.txt files shall be updated by remote access to BioMart. | 
| topic | a character string specifying a topic (category) of filters,
e.g.  | 
For a given query organism, this function retrieves all available filters that can be accessed through different marts and datasets.
Sometimes the same filter names correspond to different datasets and
marts causing problems when using getMarts.
The approach introduced by this function provides (again) a organism centric
way of accessing organism specific filters.
The topic argument allows the user to search for specific filters
topics/categories for faster selection.
a data.frame storing corresponding filter names, description, datasets, and marts.
When you run this function for the first time, the data retrieval procedure
will take some time, due to the remote access to BioMart. The corresponding
result is then saved in a *.txt file within the tempdir
directory  named "_biomart/listMarts.txt","_biomart/listDatasets.txt", and
"_biomart/listFilters_organism.txt", allowing subsequent queries to perform
much faster.
Hajk-Georg Drost
Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Steffen Durinck, Paul T. Spellman, Ewan Birney and Wolfgang Huber, Nature Protocols 4, 1184-1191 (2009).
BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Steffen Durinck, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma and Wolfgang Huber, Bioinformatics 21, 3439-3440 (2005).
Other biomaRt: 
biomart(),
getAttributes(),
getDatasets(),
getMarts(),
organismBM()
## Not run: # search for filter topic "id" head(organismFilters("Homo sapiens", topic = "id"), 20) ## End(Not run)## Not run: # search for filter topic "id" head(organismFilters("Homo sapiens", topic = "id"), 20) ## End(Not run)
This function reads an organism specific Genome Assembly
Stats file that was retrieved with getAssemblyStats.
read_assemblystats(file, type = "raw")read_assemblystats(file, type = "raw")
| file | a character string specifying the path to the file storing the Genome Assembly Stats file. | 
| type | a tibble object, either  | 
This function takes a string specifying the path to the Genome
Assembly Stats file of interest (e.g. the path returned by
getAssemblyStats) and imports it.
Hajk-Georg Drost
getAssemblyStats, read_genome,
read_proteome, read_cds, read_gff
This function reads an organism specific CDS stored in a defined file format.
read_cds( file, format = "fasta", obj.type = "Biostrings", delete_corrupt = FALSE, ... )read_cds( file, format = "fasta", obj.type = "Biostrings", delete_corrupt = FALSE, ... )
| file | a character string specifying the path to the file storing the CDS. | 
| format | a character string specifying the file format used to store
the genome, e.g.  | 
| obj.type | a character string specifying the object stype in which the
genomic sequence shall be represented.
Either as  | 
| delete_corrupt | a logical value specifying whether potential CDS
sequences that cannot be divided by 3 shall be
be excluded from the the dataset. Default is  | 
| ... | additional arguments that are used by
 | 
The read.cds function takes a string specifying the path
to the cds file of interest as first argument.
It is possible to read in different proteome file standards such as fasta or genebank.
CDS stored in fasta files can be downloaded from http://www.ensembl.org/info/data/ftp/index.html.
A data.table storing the gene id in the first column and the corresponding sequence as string in the second column.
Hajk-Georg Drost
Other cds: 
getCDS(),
getCDSSet()
Other readers: 
read_genome(),
read_gff(),
read_proteome(),
read_rna()
This function reads an organism specific genome stored in a defined file format.
read_genome(file, format = "fasta", obj.type = "Biostrings", ...)read_genome(file, format = "fasta", obj.type = "Biostrings", ...)
| file | a character string specifying the path to the file storing the genome. | 
| format | a character string specifying the file format used to store
the genome, e.g.  | 
| obj.type | a character string specifying the object stype in which
the genomic sequence shall be represented. Either as
 | 
| ... | additional arguments that are used by the
 | 
This function takes a string specifying the path to the genome file
of interest as first argument (e.g. the path returned by
getGenome).
Either a Biostrings or data.table object.
Hajk-Georg Drost
Other genome: 
getGenome(),
getGenomeSet()
Other readers: 
read_cds(),
read_gff(),
read_proteome(),
read_rna()
This function reads an organism specific CDS stored in a defined file format.
read_gff(file)read_gff(file)
| file | a character string specifying the path to the file storing the CDS. | 
This function takes a string specifying the path to the GFF file
of interest (e.g. the path returned by getGFF).
Either a Biostrings or data.table object.
Hajk-Georg Drost
Other gff: 
getGFF(),
getGFFSet()
Other readers: 
read_cds(),
read_genome(),
read_proteome(),
read_rna()
This function reads an organism specific proteome stored in a defined file format.
read_proteome(file, format = "fasta", obj.type = "Biostrings", ...)read_proteome(file, format = "fasta", obj.type = "Biostrings", ...)
| file | a character string specifying the path to the file storing the proteome. | 
| format | a character string specifying the file format used to store the
genome, e.g.  | 
| obj.type | a character string specifying the object stype in which the
genomic sequence shall be represented.
Either as  | 
| ... | additional arguments that are used by
 | 
This function takes a string specifying the path to the proteome file of interest as first argument.
It is possible to read in different proteome file standards such as fasta or genebank.
Either a Biostrings or data.table object.
Hajk-Georg Drost
Other readers: 
read_cds(),
read_genome(),
read_gff(),
read_rna()
Other proteome: 
getProteome(),
getProteomeSet()
This function reads an organism specific Repeat Masker output file.
read_rm(file)read_rm(file)
| file | a character string specifying the path to the file 
storing the Repeat Masker output (e.g. retrieved with  | 
This function takes a string specifying the path to the Repeat Masker output file of interest as first argument.
Hajk-Georg Drost
getRepeatMasker, read_genome, 
read_proteome, read_gff, read_rna
This function reads an organism specific RNA stored in a defined file format.
read_rna(file, format = "fasta", obj.type = "Biostrings", ...)read_rna(file, format = "fasta", obj.type = "Biostrings", ...)
| file | a character string specifying the path to the file storing the RNA. | 
| format | a character string specifying the file format used to store the
genome, e.g.  | 
| obj.type | a character string specifying the object stype in which the
genomic sequence shall be represented.
Either as  | 
| ... | additional arguments that are used by
 | 
This function takes a string specifying the path to the RNA file of interest as first argument. It is possible to read in different proteome file standards such as fasta or genebank.
A data.table storing the gene id in the first column and the corresponding sequence as string in the second column.
Hajk-Georg Drost
Other rna: 
getRNA(),
getRNASet()
Other readers: 
read_cds(),
read_genome(),
read_gff(),
read_proteome()
This function extracts all organism names (scientific names) for which genomes, proteomes, and CDS files are stored on the NCBI refseq server.
refseqOrganisms()refseqOrganisms()
a character vector of html from kingdom URLs
Hajk-Georg Drost
A summary statistics of specific CDS features is returned.
summary_cds(file, organism)summary_cds(file, organism)
| file | file path to a CDS file in  | 
| organism | character string specifying the organism at hand. | 
The summary statistics include:
total_seqs: 
nnn_abs: The total number of NNN's 
(over all chromosomes/scaffolds/contigs) in all coding sequences combined
nnn_perc: The percentage (relative frequency) of NNN's 
(over all chromosomes/scaffolds/contigs) compared to the total number of 
nucleotides of all coding sequences
Hajk-Georg Drost
getCollection, getCDS, read_cds, summary_genome
A summary statistics of specific genome features is generated. These statistics are useful to assess the genome quality of retrieved genome assemblies when performing comparative genomics tasks. This way, users can assess whether or not patterns found based on genome comparisons aren't just a technical artifact of differences in genome assembly quality.
summary_genome(file, organism)summary_genome(file, organism)
| file | file path to a genome assembly file in  | 
| organism | character string specifying the organism at hand. | 
The summary statistics include:
genome_size_mbp: Genome size in mega base pairs
n50_mbp: The N50 contig size of the genome assembly in mega base pairs
n_seqs: The number of chromosomes/scaffolds/contigs of the genome assembly file
n_nnn: The absolute number of NNNs (over all chromosomes or scaffolds or contigs) in the genome assembly file
rel_nnn: The percentage (relative frequency) of NNNs (over all chromosomes or scaffolds or contigs) compared to the total number of
nucleotides in the genome assembly file
genome_entropy: The Shannon Entropy of the genome assembly file (median entropy over all individual chromosome entropies)
n_gc: The total number of GCs
(over all chromosomes or scaffolds or contigs) in the genome assembly file
rel_gc: The (relative frequency) of GCs
(over all chromosomes or scaffolds or contigs) compared to the total number of
nucleotides in the genome assembly file
Hajk-Georg Drost
summary_cds, getCollection, getGenome, read_genome
## Not run: # retrieve genome from NCBI RefSeq Sc <- biomartr::getGenome(db = "refseq", organism = "Saccharomyces cerevisiae") # compute genome assembly summary statistics Sc_genome_summary <- summary_genome(file = Sc, organism = "Saccharomyces cerevisiae") # look at results Sc_genome_summary ## End(Not run)## Not run: # retrieve genome from NCBI RefSeq Sc <- biomartr::getGenome(db = "refseq", organism = "Saccharomyces cerevisiae") # compute genome assembly summary statistics Sc_genome_summary <- summary_genome(file = Sc, organism = "Saccharomyces cerevisiae") # look at results Sc_genome_summary ## End(Not run)