Package 'rentrez'

Title: 'Entrez' in R
Description: Provides an R interface to the NCBI's 'EUtils' API, allowing users to search databases like 'GenBank' <https://www.ncbi.nlm.nih.gov/genbank/> and 'PubMed' <https://pubmed.ncbi.nlm.nih.gov/>, process the results of those searches and pull data into their R sessions.
Authors: David Winter [aut, cre] , Scott Chamberlain [ctb] , Han Guangchun [ctb]
Maintainer: David Winter <[email protected]>
License: MIT + file LICENSE
Version: 1.2.3
Built: 2024-11-27 03:54:11 UTC
Source: https://github.com/ropensci/rentrez

Help Index


Fetch pubmed ids matching specially formatted citation strings

Description

Fetch pubmed ids matching specially formatted citation strings

Usage

entrez_citmatch(bdata, db = "pubmed", retmode = "xml", config = NULL)

Arguments

bdata

character, containing citation data. Each citation must be represented in a pipe-delimited format journal_title|year|volume|first_page|author_name|your_key| The final field "your_key" is arbitrary, and can used as you see fit. Fields can be left empty, but be sure to keep 6 pipes.

db

character, the database to search. Defaults to pubmed, the only database currently available

retmode

character, file format to retrieve. Defaults to xml, as per the API documentation, though note the API only returns plain text

config

vector configuration options passed to httr::GET

Value

A character vector containing PMIDs

See Also

config for available configs

Examples

## Not run: 
ex_cites <- c("proc natl acad sci u s a|1991|88|3248|mann bj|test1|",
              "science|1987|235|182|palmenberg ac|test2|")
entrez_citmatch(ex_cites)

## End(Not run)

List available search fields for a given database

Description

Fetch a list of search fields that can be used with a given database. Fields can be used as part of the term argument to entrez_search

Usage

entrez_db_searchable(db, config = NULL)

Arguments

db

character, name of database to get search field from

config

config vector passed to httr::GET

Value

An eInfoSearch object (subclassed from list) summarizing linked-databases. Can be coerced to a data-frame with as.data.frame. Printing the object shows only the names of each available search field.

See Also

entrez_search

Other einfo: entrez_db_links(), entrez_db_summary(), entrez_dbs(), entrez_info()

Examples

## Not run: 
pmc_fields <- entrez_db_searchable("pmc")
pmc_fields[["AFFL"]]
entrez_search(db="pmc", term="Otago[AFFL]", retmax=0)
entrez_search(db="pmc", term="Auckland[AFFL]", retmax=0)

sra_fields <- entrez_db_searchable("sra")
as.data.frame(sra_fields)

## End(Not run)

Retrieve summary information about an NCBI database

Description

Retrieve summary information about an NCBI database

Usage

entrez_db_summary(db, config = NULL)

Arguments

db

character, name of database to summaries

config

config vector passed to httr::GET

Value

Character vector with the following data

DbName Name of database

Description Brief description of the database

Count Number of records contained in the database

MenuName Name in web-interface to EUtils

DbBuild Unique ID for current build of database

LastUpdate Date of most recent update to database

See Also

Other einfo: entrez_db_links(), entrez_db_searchable(), entrez_dbs(), entrez_info()

Examples

## Not run: 
entrez_db_summary("pubmed")

## End(Not run)

List databases available from the NCBI

Description

Retrieves the names of databases available through the EUtils API

Usage

entrez_dbs(config = NULL)

Arguments

config

config vector passed to httr::GET

Value

character vector listing available dbs

See Also

Other einfo: entrez_db_links(), entrez_db_searchable(), entrez_db_summary(), entrez_info()

Examples

## Not run: 
entrez_dbs()

## End(Not run)

Download data from NCBI databases

Description

Pass unique identifiers to an NCBI database and receive data files in a variety of formats. A set of unique identifiers mustbe specified with either the db argument (which directly specifies the IDs as a numeric or character vector) or a web_history object as returned by entrez_link, entrez_search or entrez_post.

Usage

entrez_fetch(
  db,
  id = NULL,
  web_history = NULL,
  rettype,
  retmode = "",
  parsed = FALSE,
  config = NULL,
  ...
)

Arguments

db

character, name of the database to use

id

vector (numeric or character), unique ID(s) for records in database db. In the case of sequence databases these IDs can take form of an NCBI accession followed by a version number (eg AF123456.1 or AF123456.2).

web_history

a web_history object

rettype

character, format in which to get data (eg, fasta, xml...)

retmode

character, mode in which to receive data, defaults to an empty string (corresponding to the default mode for rettype).

parsed

boolean should entrez_fetch attempt to parse the resulting file. Only works with xml records (including those with rettypes other than "xml") at present

config

vector, httr configuration options passed to httr::GET

...

character, additional terms to add to the request, see NCBI documentation linked to in references for a complete list

Details

The format for returned records is set by that arguments rettype (for a particular format) and retmode for a general format (JSON, XML text etc). See Table 1 in the linked reference for the set of formats available for each database. In particular, note that sequence databases (nuccore, protein and their relatives) use specific format names (eg "native", "ipg") for different flavours of xml.

For the most part, this function returns a character vector containing the fetched records. For XML records (including 'native', 'ipg', 'gbc' sequence records), setting parsed to TRUE will return an XMLInternalDocument,

Value

character string containing the file created

XMLInternalDocument a parsed XML document if parsed=TRUE and rettype is a flavour of XML.

References

https://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EFetch_

See Also

config for available 'httr' configs

Examples

## Not run: 
katipo <- "Latrodectus katipo[Organism]"
katipo_search <- entrez_search(db="nuccore", term=katipo)
kaitpo_seqs <- entrez_fetch(db="nuccore", id=katipo_search$ids, rettype="fasta")
#xml
kaitpo_seqs <- entrez_fetch(db="nuccore", id=katipo_search$ids, rettype="native")

## End(Not run)

Find the number of records that match a given term across all NCBI Entrez databases

Description

Find the number of records that match a given term across all NCBI Entrez databases

Usage

entrez_global_query(term, config = NULL, ...)

Arguments

term

the search term to use

config

vector configuration options passed to httr::GET

...

additional arguments to add to the query

Value

a named vector with counts for each a database

See Also

config for available configs

Examples

## Not run:  
NCBI_data_on_best_butterflies_ever <- entrez_global_query(term="Heliconius")

## End(Not run)

Get information about EUtils databases

Description

Gather information about EUtils generally, or a given Eutils database. Note: The most common uses-cases for the einfo util are finding the list of search fields available for a given database or the other NCBI databases to which records in a given database might be linked. Both these use cases are implemented in higher-level functions that return just this information (entrez_db_searchable and entrez_db_links respectively). Consequently most users will not have a reason to use this function (though it is exported by rentrez for the sake of completeness.

Usage

entrez_info(db = NULL, config = NULL)

Arguments

db

character database about which to retrieve information (optional)

config

config vector passed on to httr::GET

Value

XMLInternalDocument with information describing either all the databases available in Eutils (if db is not set) or one particular database (set by 'db')

See Also

config for available httr configurations

Other einfo: entrez_db_links(), entrez_db_searchable(), entrez_db_summary(), entrez_dbs()

Examples

## Not run: 
all_the_data <- entrez_info()
XML::xpathSApply(all_the_data, "//DbName", xmlValue)
entrez_dbs()

## End(Not run)

Post IDs to Eutils for later use

Description

Post IDs to Eutils for later use

Usage

entrez_post(db, id = NULL, web_history = NULL, config = NULL, ...)

Arguments

db

character Name of the database from which the IDs were taken

id

vector with unique ID(s) for records in database db.

web_history

A web_history object. Can be used to add to additional identifiers to an existing web environment on the NCBI

config

vector of configuration options passed to httr::GET

...

character Additional terms to add to the request, see NCBI documentation linked to in references for a complete list

References

https://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EPost_

See Also

config for available httr configurations

Examples

## Not run:   
so_many_snails <- entrez_search(db="nuccore", 
                      "Gastropoda[Organism] AND COI[Gene]", retmax=200)
upload <- entrez_post(db="nuccore", id=so_many_snails$ids)
first <- entrez_fetch(db="nuccore", rettype="fasta", web_history=upload,
                      retmax=10)
second <- entrez_fetch(db="nuccore", file_format="fasta", web_history=upload,
                       retstart=10, retmax=10)

## End(Not run)

Get summaries of objects in NCBI datasets from a unique ID

Description

The NCBI offer two distinct formats for summary documents. Version 1.0 is a relatively limited summary of a database record based on a shared Document Type Definition. Version 1.0 summaries are only available as XML and are not available for some newer databases Version 2.0 summaries generally contain more information about a given record, but each database has its own distinct format. 2.0 summaries are available for records in all databases and as JSON and XML files. As of version 0.4, rentrez fetches version 2.0 summaries by default and uses JSON as the exchange format (as JSON object can be more easily converted into native R types). Existing scripts which relied on the structure and naming of the "Version 1.0" summary files can be updated by setting the new version argument to "1.0".

Usage

entrez_summary(
  db,
  id = NULL,
  web_history = NULL,
  version = c("2.0", "1.0"),
  always_return_list = FALSE,
  retmode = NULL,
  config = NULL,
  ...
)

Arguments

db

character Name of the database to search for

id

vector with unique ID(s) for records in database db. In the case of sequence databases these IDs can take form of an NCBI accession followed by a version number (eg AF123456.1 or AF123456.2)

web_history

A web_history object

version

either 1.0 or 2.0 see above for description

always_return_list

logical, return a list of esummary objects even when only one ID is provided (see description for a note about this option)

retmode

either "xml" or "json". By default, xml will be used for version 1.0 records, json for version 2.0.

config

vector configuration options passed to httr::GET

...

character Additional terms to add to the request, see NCBI documentation linked to in references for a complete list

Details

By default, entrez_summary returns a single record when only one ID is passed and a list of such records when multiple IDs are passed. This can lead to unexpected behaviour when the results of a variable number of IDs (perhaps the result of entrez_search) are processed with an apply family function or in a for-loop. If you use this function as part of a function or script that generates a variably-sized vector of IDs setting always_return_list to TRUE will avoid these problems. The function extract_from_esummary is provided for the specific case of extracting named elements from a list of esummary objects, and is designed to work on single objects as well as lists.

Value

A list of esummary records (if multiple IDs are passed and always_return_list if FALSE) or a single record.

file XMLInternalDocument xml file containing the entire record returned by the NCBI.

References

https://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_ESummary_

See Also

config for available configs

extract_from_esummary which can be used to extract elements from a list of esummary records

Examples

## Not run: 
 pop_ids = c("307082412", "307075396", "307075338", "307075274")
 pop_summ <- entrez_summary(db="popset", id=pop_ids)
 extract_from_esummary(pop_summ, "title")
 
 # clinvar example
 res <- entrez_search(db = "clinvar", term = "BRCA1", retmax=10)
 cv <- entrez_summary(db="clinvar", id=res$ids)
 cv
 extract_from_esummary(cv, "title", simplify=FALSE)
 extract_from_esummary(cv, "trait_set")[1:2] 
 extract_from_esummary(cv, "gene_sort") 

## End(Not run)

Extract elements from a list of esummary records

Description

Extract elements from a list of esummary records

Usage

extract_from_esummary(esummaries, elements, simplify = TRUE)

Arguments

esummaries

Either an esummary or an esummary_list (as returned by entrez_summary).

elements

the names of the element to extract

simplify

logical, if possible return a vector

Value

List or vector containing requested elements

See Also

entrez_summary for examples of this function in action.


Extract URLs from an elink object

Description

Extract URLs from an elink object

Usage

linkout_urls(elink)

Arguments

elink

elink object (returned by entrez_link) containing Urls

Value

list of character vectors, one per ID each containing of URLs for that ID.

See Also

entrez_link


Summarize an XML record from pubmed.

Description

Note: this function assumes all records are of the type "PubmedArticle" and will return an empty record for any other type (including books).

Usage

parse_pubmed_xml(record)

Arguments

record

Either and XMLInternalDocument or character the record to be parsed ( expected to come from entrez_fetch)

Value

Either a single pubmed_record object, or a list of several

Examples

hox_paper <- entrez_search(db="pubmed", term="10.1038/nature08789[doi]")
hox_rel <- entrez_link(db="pubmed", dbfrom="pubmed", id=hox_paper$ids)
recs <- entrez_fetch(db="pubmed", 
                       id=hox_rel$links$pubmed_pubmed[1:3], 
                       rettype="xml")
parse_pubmed_xml(recs)

rentrez

Description

rentrez provides functions to search for, discover and download data from the NCBI's databases using their EUtils function.

Details

Users are expected to know a little bit about the EUtils API, which is well documented: https://www.ncbi.nlm.nih.gov/books/NBK25500/

The NCBI will ban IPs that don't use EUtils within their user guidelines. In particular /enumerated /item Don't send more than three request per second (rentrez enforces this limit) /item If you plan on sending a sequence of more than ~100 requests, do so outside of peak times for the US /item For large requests use the web history method (see examples for entrez_search or use entrez_post to upload IDs)


Set the ENTREZ_KEY variable to be used by all rentrez functions

Description

The NCBI allows users to access more records (10 per second) if they register for and use an API key. This function allows users to set this key for all calls to rentrez functions during a particular R session. See the vignette section "Using API keys" for a detailed description.

Usage

set_entrez_key(key)

Arguments

key

character. Value to set ENTREZ_KEY to (i.e. your API key).

Value

A logical of length one, TRUE is the value was set FALSE if not. value is returned inside invisible(), i.e. it is not printed to screen when the function is called.