Title: | Phylogenetic Reconstruction and Time-dating |
---|---|
Description: | The phruta R package is designed to simplify the basic phylogenetic pipeline. Specifically, all code is run within the same program and data from intermediate steps are saved in independent folders. Furthermore, all code is run within the same environment which increases the reproducibility of your analysis. phruta retrieves gene sequences, combines newly downloaded and local gene sequences, and performs sequence alignments. |
Authors: | Cristian Roman Palacios [aut, cre] , Anna Krystalli [rev], Rayna Harris [rev], Frederick Boehm [rev] |
Maintainer: | Cristian Roman Palacios <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.3 |
Built: | 2024-12-27 06:16:40 UTC |
Source: | https://github.com/ropensci/phruta |
Retrieve accession numbers and titles for complex searches conducted in genbank.
Note that this function depends on acc.retrieve
as it actually uses expand.grid
to find all the relevant organism/gene combinations in genbank.
acc.table.retrieve( clades = NULL, species = NULL, genes = NULL, speciesLevel = NULL, npar = 2, nSearchesBatch = 499 )
acc.table.retrieve( clades = NULL, species = NULL, genes = NULL, speciesLevel = NULL, npar = 2, nSearchesBatch = 499 )
clades |
A vector of clade names (character). Note that this can either be the name of a clade (e.g. Apis) or the code in NCBI (e.g. txid7459). |
species |
A vector of species names (logical). Note that this can either be the name of a clade (e.g. Apis) or the code in NCBI (e.g. txid7459). |
genes |
A vector of gene names (character; optional). |
speciesLevel |
Whether the result should be a species-level dataset (logical). |
npar |
Number of parallel searches (the default is probably the best option). |
nSearchesBatch |
Number of searches per batch |
This function returns an object of class data.frame
that
includes the following columns. First, Species
with the
species name listed in GenBank. Second, Ti
including the title
of the accession in GenBank. Third, Acc
listing the accession
number. Fourth, gene
including the name of the target gene.
## Not run: acc.table.retrieve( clades = c('Felis', 'Vulpes', 'Phoca'), species = 'Manis_pentadactyla' , genes = c("A2AB","ADORA3","ADRB2","APOB", "APP","ATP7","BCHE","BDNF", "BMI1","BRCA1","BRCA2","CNR1", "COI","CREM","CYTB","DMP1", "EDG1","ENAM","FBN1","GHR", "IRBP","ND1","ND2","PLCB4", "PNOC","RAG1a","RAG1b","RAG2", "TTN","TYR1","VWF"), speciesLevel = FALSE ) ## End(Not run)
## Not run: acc.table.retrieve( clades = c('Felis', 'Vulpes', 'Phoca'), species = 'Manis_pentadactyla' , genes = c("A2AB","ADORA3","ADRB2","APOB", "APP","ATP7","BCHE","BDNF", "BMI1","BRCA1","BRCA2","CNR1", "COI","CREM","CYTB","DMP1", "EDG1","ENAM","FBN1","GHR", "IRBP","ND1","ND2","PLCB4", "PNOC","RAG1a","RAG1b","RAG2", "TTN","TYR1","VWF"), speciesLevel = FALSE ) ## End(Not run)
This function retrieves the gene names for given organism in genbank. This function is
useful to explore the genes that are widely sampled for a given taxon
in genbank. However, note that the performance of gene.sampling.retrieve
is entirely dependant on the quality of submissions to genbank. For instance, if
most of the sequences in genbank (nucleotide) don't have information on the
sequenced region, this function won't provide reliable estimates of the sampling
frequency of each gene in the database. We recommend using this function with caution,
and only when there is no additional information on the genes that are widely sampled
for the target group(s) of interest.
gene.sampling.retrieve( organism, speciesSampling = TRUE, npar = 2, nSearchesBatch = 499 )
gene.sampling.retrieve( organism, speciesSampling = TRUE, npar = 2, nSearchesBatch = 499 )
organism |
A vector of taxonomic groups (character). |
speciesSampling |
Whether the results should be at the species-level (logical). |
npar |
Number of simultaneous searches (character; optional). The default 2 is generally fast enough and does not exceed the maximum number of connections to genbank. |
nSearchesBatch |
Number of searches per batch |
This function returns a data.frame
that comprises the following
columns. First, Gene
, including the name of the relevant
gene sampled in the target taxonomic groups. Second,
Sampled in N species
includes the number of species where the
gene is sampled among the target taxa. Third, PercentOfSampledSpecies
indicates the percentage of species where the gene is sampled (assuming
GeneBank's taxonomic backbone).
## Not run: test.spp <- gene.sampling.retrieve(organism = "Puma", speciesSampling = TRUE) test.pop <- gene.sampling.retrieve(organism = "Puma", speciesSampling = FALSE) ## End(Not run)
## Not run: test.spp <- gene.sampling.retrieve(organism = "Puma", speciesSampling = TRUE) test.pop <- gene.sampling.retrieve(organism = "Puma", speciesSampling = FALSE) ## End(Not run)
This function adds sequences located in a particular folder
(default is "0.AdditionalSequences"
) to
fasta files in another folder (default is "0.Sequences"
).
Files must be in FASTA format and names
of the files should perfectly match the ones in the previously
downloaded folder (e.g. "0.Sequences"
).
This function creates a third new folder "0.0.OriginalDownloaded"
containing the originally
downloaded sequences. The sequences in the "0.Sequences"
folder get
replaced with the combined ones.
Note that new sequences can be added even for just a fraction of the
originally downloaded genes.
sq.add(folderDownloaded = "0.Sequences", folderNew = "0.AdditionalSequences")
sq.add(folderDownloaded = "0.Sequences", folderNew = "0.AdditionalSequences")
folderDownloaded |
Name of the folder with downloaded sequences. |
folderNew |
Name of the folder with local sequences. |
None
## Not run: sq.add(folderDownloaded = "0.Sequences", folderNew = "0.AdditionalSequences") ## End(Not run)
## Not run: sq.add(folderDownloaded = "0.Sequences", folderNew = "0.AdditionalSequences") ## End(Not run)
Perform multiple sequence alignment on all the fasta files
saved in a given folder. Alignment is conducted using
"BiocManager::install("DECIPHER")"
. Note that this function first
optimizes the direction of the sequences, then aligns
using "BDECIPHER"
, and finally masks the resulting
alignment (optionally but does it per default). The
masking step includes removing common gaps across all the species and removing
highly ambiguous positions. The resulting aligned sequences
are stored to a new folder
"2.Alignments"
.'
sq.aln( folder = "1.CuratedSequences", FilePatterns = "renamed", sqs.object = NULL, mask = TRUE, maxFractionGapsSpecies = 0.01, ... )
sq.aln( folder = "1.CuratedSequences", FilePatterns = "renamed", sqs.object = NULL, mask = TRUE, maxFractionGapsSpecies = 0.01, ... )
folder |
Name of the folder where the sequences to align are stored (character). |
FilePatterns |
A string that is common to all the target
files in the relevant folder (character). Note that
this argument can be set to |
sqs.object |
A list of sequences generated from sq.curate. Only use if you're not interested in download sequences locally. |
mask |
Removes ambiguous sites (Logical, TRUE or FALSE). |
maxFractionGapsSpecies |
Maximum fraction of gaps per species (when masked) |
... |
Arguments passed to |
This function will return an object of class list
including the
original and renamed sequence alignments. Optionally, this object will
also include the masked alignments.
## Not run: sq.retrieve.direct( clades = c("Felis", "Vulpes", "Phoca"), species = "Manis_pentadactyla", genes = c("ADORA3", "CYTB") ) sq.curate( filterTaxonomicCriteria = "Felis|Vulpes|Phoca|Manis", kingdom = "animals", folder = "0.Sequences" ) sq.aln(folder = "1.CuratedSequences") ## End(Not run)
## Not run: sq.retrieve.direct( clades = c("Felis", "Vulpes", "Phoca"), species = "Manis_pentadactyla", genes = c("ADORA3", "CYTB") ) sq.curate( filterTaxonomicCriteria = "Felis|Vulpes|Phoca|Manis", kingdom = "animals", folder = "0.Sequences" ) sq.aln(folder = "1.CuratedSequences") ## End(Not run)
After downloading sequences from genbank, this function
curates sequences based on taxonomic
information. Note that this function provides two summary datasets.
First, the accession numbers.
Second, the taxonomic information for each species in the database.
The taxonomy strictly follows
the gbif taxonomic backbone. The resulting files are saved
to "1.CuratedSequences"
. The
resulting files also have the most recent curated taxonomy
following the gbif (or selected database) taxonomic backbone.
sq.curate( filterTaxonomicCriteria = NULL, mergeGeneFiles = NULL, database = "gbif", kingdom = NULL, folder = "0.Sequences", sqs.object = NULL, removeOutliers = TRUE, minSeqs = 5, threshold = 0.05, ranks = c("kingdom", "phylum", "class", "order", "family", "genus", "species") )
sq.curate( filterTaxonomicCriteria = NULL, mergeGeneFiles = NULL, database = "gbif", kingdom = NULL, folder = "0.Sequences", sqs.object = NULL, removeOutliers = TRUE, minSeqs = 5, threshold = 0.05, ranks = c("kingdom", "phylum", "class", "order", "family", "genus", "species") )
filterTaxonomicCriteria |
A single string of terms (delimited using "|") listing all the strings that could be used to identify the species that should be in the dataset (character). |
mergeGeneFiles |
A named list, with each element being a character vector
indicating the names of the files in |
database |
A name of a database with taxonomic information. Although 'gbif' is faster, it only has information for animals and plants. Other databases follow taxize::classification. |
kingdom |
Optional and only used when database='gbif'. Two possible options: "animals" or "plants." |
folder |
The name of the folder where the original sequences are located (character). |
sqs.object |
A list of sequences generated from |
removeOutliers |
Whether |
minSeqs |
minimum number of sequences per locus |
threshold |
Relative to |
ranks |
The taxonomic ranks used to examine the taxonomy of the species
in the |
This function will return an object of class list
with the
following elements. First, the curated sequences with original names.
Second, the curated sequences with species-level names. Third,
the accession numbers table. Fourth, a summary of taxonomic
information for all the species sampled in the files.
## Not run: sq.retrieve.direct( clades = c("Felis", "Vulpes", "Phoca"), species = "Manis_pentadactyla", genes = c("ADORA3", "CYTB") ) sq.curate( filterTaxonomicCriteria = "Felis|Vulpes|Phoca|Manis", database = "gbif", kingdom = "animals", folder = "0.Sequences" ) ## End(Not run)
## Not run: sq.retrieve.direct( clades = c("Felis", "Vulpes", "Phoca"), species = "Manis_pentadactyla", genes = c("ADORA3", "CYTB") ) sq.curate( filterTaxonomicCriteria = "Felis|Vulpes|Phoca|Manis", database = "gbif", kingdom = "animals", folder = "0.Sequences" ) ## End(Not run)
This function runs partitionfinder v1 within phruta. For now, all analyses are based on genes. Please note that you need at least two gene regions to run partitionfinder.
sq.partitionfinderv1( folderAlignments = "2.Alignments", FilePatterns = "Masked", folderPartitionFinder = "2.1.PartitionFinderv1", models = "all", run = TRUE )
sq.partitionfinderv1( folderAlignments = "2.Alignments", FilePatterns = "Masked", folderPartitionFinder = "2.1.PartitionFinderv1", models = "all", run = TRUE )
folderAlignments |
Name of the folder where the sequences to align are stored (character). |
FilePatterns |
A string that is common to all the target files in the
relevant folder (character). Note that
this argument can be set to |
folderPartitionFinder |
Name of the new folder where the output files are stored (string). |
models |
Models to run in partitionfinder (string). |
run |
Run partitionfinder? |
None
## Not run: sq.retrieve.direct( clades = c("Felis", "Vulpes", "Phoca"), species = "Manis_pentadactyla", genes = c("ADORA3", "CYTB") ) sq.curate( filterTaxonomicCriteria = "Felis|Vulpes|Phoca|Manis", kingdom = "animals", folder = "0.Sequences" ) sq.aln(folder = "1.CuratedSequences") sq.partitionfinderv1( folderAlignments = "2.Alignments", FilePatterns = "Masked", models = "all" ) ## End(Not run)
## Not run: sq.retrieve.direct( clades = c("Felis", "Vulpes", "Phoca"), species = "Manis_pentadactyla", genes = c("ADORA3", "CYTB") ) sq.curate( filterTaxonomicCriteria = "Felis|Vulpes|Phoca|Manis", kingdom = "animals", folder = "0.Sequences" ) sq.aln(folder = "1.CuratedSequences") sq.partitionfinderv1( folderAlignments = "2.Alignments", FilePatterns = "Masked", models = "all" ) ## End(Not run)
Downloads sequences from genbank (nucleotide database) for particular taxa
and genes into a folder called "0.Sequences"
.
sq.retrieve.direct( clades = NULL, species = NULL, genes = NULL, db = "itis", maxseqs = 1, maxlength = 5000 )
sq.retrieve.direct( clades = NULL, species = NULL, genes = NULL, db = "itis", maxseqs = 1, maxlength = 5000 )
clades |
A vector listing taxonomic groups of interest (character). |
species |
A vector listing additional species interest (character). This argument can be used to define additional target species in the ingroup or species to be sampled in the outgroup (character). |
genes |
A vector listing gene names of interest (character). |
db |
Follows |
maxseqs |
Maximum number of sequences to retrieve per search (taxa + gene) (numeric). |
maxlength |
Maximum length of the gene sequence (numeric). |
None
## Not run: sq.retrieve.direct( clades = c("Felis", "Vulpes", "Phoca"), species = "Manis_pentadactyla", genes = c("ADORA3", "CYTB") ) ## End(Not run)
## Not run: sq.retrieve.direct( clades = c("Felis", "Vulpes", "Phoca"), species = "Manis_pentadactyla", genes = c("ADORA3", "CYTB") ) ## End(Not run)
Downloads sequences from genbank (nucleotide database) for particular taxa
and genes into a folder called "0.Sequences"
.
sq.retrieve.indirect(acc.table, download.sqs = FALSE)
sq.retrieve.indirect(acc.table, download.sqs = FALSE)
acc.table |
An accession table, ideally generated using |
download.sqs |
Logical indicating whether sequences should be downloaded locally or returned as a list. |
None
## Not run: acc.table.retrieve( clades = c('Felis', 'Vulpes', 'Phoca'), species = 'Manis_pentadactyla' , genes = c("A2AB","ADORA3","ADRB2","APOB", "APP","ATP7","BCHE","BDNF", "BMI1","BRCA1","BRCA2","CNR1", "COI","CREM","CYTB","DMP1", "EDG1","ENAM","FBN1","GHR", "IRBP","ND1","ND2","PLCB4", "PNOC","RAG1a","RAG1b","RAG2", "TTN","TYR1","VWF"), speciesLevel=TRUE ) sq.retrieve.indirect(test) ## End(Not run)
## Not run: acc.table.retrieve( clades = c('Felis', 'Vulpes', 'Phoca'), species = 'Manis_pentadactyla' , genes = c("A2AB","ADORA3","ADRB2","APOB", "APP","ATP7","BCHE","BDNF", "BMI1","BRCA1","BRCA2","CNR1", "COI","CREM","CYTB","DMP1", "EDG1","ENAM","FBN1","GHR", "IRBP","ND1","ND2","PLCB4", "PNOC","RAG1a","RAG1b","RAG2", "TTN","TYR1","VWF"), speciesLevel=TRUE ) sq.retrieve.indirect(test) ## End(Not run)
Performs tree inference under "RAxML"
for aligned fasta sequences in
a given folder (default is "2.Alignments"
).
tree.constraint( taxonomy_folder = "1.CuratedSequences", targetColumns = c("kingdom", "phylum", "class", "order", "family", "genus", "species_names"), Topology = "((ingroup), outgroup);", outgroup = NULL )
tree.constraint( taxonomy_folder = "1.CuratedSequences", targetColumns = c("kingdom", "phylum", "class", "order", "family", "genus", "species_names"), Topology = "((ingroup), outgroup);", outgroup = NULL )
taxonomy_folder |
Name of the folder where the 1.Taxonomy file is stored. |
targetColumns |
Where to find |
Topology |
A string summarizing the desired topological constraint in newick format. |
outgroup |
Optional and only required when the topology argument is not "((ingroup), outgroup);". |
None
## Not run: sq.retrieve.direct( clades = c("Felis", "Vulpes", "Phoca"), species = "Manis_pentadactyla", genes = c("ADORA3", "CYTB") ) sq.curate( filterTaxonomicCriteria = "Felis|Vulpes|Phoca|Manis", kingdom = "animals", folder = "0.Sequences" ) tree.constraint( taxonomy_folder = "1.CuratedSequences", targetColumns = c("kingdom", "phylum", "class", "order", "family", "genus", "species_names"), Topology = "((ingroup), outgroup);", outgroup = "Manis_pentadactyla" ) tree.constraint( taxonomy_folder = "1.CuratedSequences", targetColumns = c("kingdom", "phylum", "class", "order", "family", "genus", "species_names"), Topology = "((Felis), (Phoca));" ) ## End(Not run)
## Not run: sq.retrieve.direct( clades = c("Felis", "Vulpes", "Phoca"), species = "Manis_pentadactyla", genes = c("ADORA3", "CYTB") ) sq.curate( filterTaxonomicCriteria = "Felis|Vulpes|Phoca|Manis", kingdom = "animals", folder = "0.Sequences" ) tree.constraint( taxonomy_folder = "1.CuratedSequences", targetColumns = c("kingdom", "phylum", "class", "order", "family", "genus", "species_names"), Topology = "((ingroup), outgroup);", outgroup = "Manis_pentadactyla" ) tree.constraint( taxonomy_folder = "1.CuratedSequences", targetColumns = c("kingdom", "phylum", "class", "order", "family", "genus", "species_names"), Topology = "((Felis), (Phoca));" ) ## End(Not run)
Performs tree dating under "treePL"
or "PATHd-8"
based on secondary calibrations.
Note that "treePL"
or "PATHd-8"
must be installed in your PATH.
How to install "PATHd-8"
in mac
"https://gist.github.com/cromanpa94/a43bc710a17220f71d796d6590ea7fe4"
and "treePL"
can be installed using homebrew
(brew install brewsci/bio/treepl). Thanks to Brian O'Meara and Jonathan
Chang, respectively.
tree.dating( taxonomyFolder = "1.CuratedSequences", phylogenyFolder = "3.Phylogeny", ... )
tree.dating( taxonomyFolder = "1.CuratedSequences", phylogenyFolder = "3.Phylogeny", ... )
taxonomyFolder |
Name of the folder where |
phylogenyFolder |
Name of the folder where
|
... |
Arguments passed to |
None
## Not run: sq.retrieve.direct( clades = c("Felis", "Vulpes", "Phoca"), species = "Manis_pentadactyla", genes = c("ADORA3", "CYTB") ) sq.curate( filterTaxonomicCriteria = "Felis|Vulpes|Phoca|Manis", kingdom = "animals", folder = "0.Sequences" ) sq.aln(folder = "1.CuratedSequences") tree.raxml( folder = "2.Alignments", FilePatterns = "Masked", raxml_exec = "raxmlHPC", Bootstrap = 100, outgroup = "Manis_pentadactyla" ) tree.dating( taxonomyFolder = "1.CuratedSequences", phylogenyFolder = "3.Phylogeny", scale = "treePL" ) ## End(Not run)
## Not run: sq.retrieve.direct( clades = c("Felis", "Vulpes", "Phoca"), species = "Manis_pentadactyla", genes = c("ADORA3", "CYTB") ) sq.curate( filterTaxonomicCriteria = "Felis|Vulpes|Phoca|Manis", kingdom = "animals", folder = "0.Sequences" ) sq.aln(folder = "1.CuratedSequences") tree.raxml( folder = "2.Alignments", FilePatterns = "Masked", raxml_exec = "raxmlHPC", Bootstrap = 100, outgroup = "Manis_pentadactyla" ) tree.dating( taxonomyFolder = "1.CuratedSequences", phylogenyFolder = "3.Phylogeny", scale = "treePL" ) ## End(Not run)
Performs tree inference under "RAxML"
for aligned fasta sequences in
a given folder (default is "2.Alignments"
). Note that you need at
least two gene regions to run a partitioned analysis.
tree.raxml( folder = "2.Alignments", FilePatterns = "Masked_", raxml_exec = "raxmlHPC", Bootstrap = 100, outgroup, partitioned = FALSE, ... )
tree.raxml( folder = "2.Alignments", FilePatterns = "Masked_", raxml_exec = "raxmlHPC", Bootstrap = 100, outgroup, partitioned = FALSE, ... )
folder |
Name of the folder where the sequences to align are stored (character). |
FilePatterns |
A string that is common to all the target files
in the relevant folder (character). Note that
this argument can be set to |
raxml_exec |
Where to find |
Bootstrap |
Number of bootstrap replicates (numeric). |
outgroup |
A single string of comma-separated tip labels to be
used as outgroup in |
partitioned |
Whether analyses should be partitioned by gene (Logical). |
... |
Arguments passed to |
None
## Not run: sq.retrieve.direct( clades = c("Felis", "Vulpes", "Phoca"), species = "Manis_pentadactyla", genes = c("ADORA3", "CYTB") ) sq.curate( filterTaxonomicCriteria = "Felis|Vulpes|Phoca|Manis", kingdom = "animals", folder = "0.Sequences" ) sq.aln(folder = "1.CuratedSequences") tree.raxml( folder = "2.Alignments", FilePatterns = "Masked", raxml_exec = "raxmlHPC", Bootstrap = 100, outgroup = "Manis_pentadactyla" ) ## End(Not run)
## Not run: sq.retrieve.direct( clades = c("Felis", "Vulpes", "Phoca"), species = "Manis_pentadactyla", genes = c("ADORA3", "CYTB") ) sq.curate( filterTaxonomicCriteria = "Felis|Vulpes|Phoca|Manis", kingdom = "animals", folder = "0.Sequences" ) sq.aln(folder = "1.CuratedSequences") tree.raxml( folder = "2.Alignments", FilePatterns = "Masked", raxml_exec = "raxmlHPC", Bootstrap = 100, outgroup = "Manis_pentadactyla" ) ## End(Not run)
Implements the RogueNaRok algorithm for rogue taxon identification within phruta
tree.roguetaxa(folder = "3.Phylogeny", ...)
tree.roguetaxa(folder = "3.Phylogeny", ...)
folder |
Name of the folder where the sequences to align are stored (character). |
... |
Arguments passed to |
None
## Not run: sq.retrieve.direct( clades = c("Felis", "Vulpes", "Phoca"), species = "Manis_pentadactyla", genes = c("ADORA3", "CYTB") ) sq.curate( filterTaxonomicCriteria = "Felis|Vulpes|Phoca|Manis", kingdom = "animals", folder = "0.Sequences" ) sq.aln(folder = "1.CuratedSequences") tree.raxml( folder = "2.Alignments", FilePatterns = "Masked", raxml_exec = "raxmlHPC", Bootstrap = 100, outgroup = "Manis_pentadactyla" ) tree.roguetaxa(folder = "3.Phylogeny") ## End(Not run)
## Not run: sq.retrieve.direct( clades = c("Felis", "Vulpes", "Phoca"), species = "Manis_pentadactyla", genes = c("ADORA3", "CYTB") ) sq.curate( filterTaxonomicCriteria = "Felis|Vulpes|Phoca|Manis", kingdom = "animals", folder = "0.Sequences" ) sq.aln(folder = "1.CuratedSequences") tree.raxml( folder = "2.Alignments", FilePatterns = "Masked", raxml_exec = "raxmlHPC", Bootstrap = 100, outgroup = "Manis_pentadactyla" ) tree.roguetaxa(folder = "3.Phylogeny") ## End(Not run)