| Title: | Find R Packages Matching Either Descriptions or Other R Packages |
|---|---|
| Description: | Find R packages from 'CRAN', 'rOpenSci', or 'Bioconductor' corpora. Packages can be matched to general text descriptions, to names of installed packages, or to local paths to entire source repositories. The package is used to list the most similar packages for each new submission to the 'rOpenSci' software peer-review program <doi:10.5281/zenodo.18885936>. |
| Authors: | Mark Padgham [aut, cre] (ORCID: <https://orcid.org/0000-0003-2172-5265>), Davis Vaughan [ctb] |
| Maintainer: | Mark Padgham <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.5.2.089 |
| Built: | 2026-04-02 11:46:32 UTC |
| Source: | https://github.com/ropensci-review-tools/pkgmatch |
This function generates a selection of test data for the "cran" corpus, to allow functions to be run offline, without having to download the large datasets otherwise required for the package to function.
Note that these data are randomly generated, and results will be generally meaningless. They are generated solely to demonstrate how the package functions, and are not intended to derive meaningful outputs.
generate_pkgmatch_example_data()generate_pkgmatch_example_data()
(Invisibly) The path to the temporary directory containing the package data.
Other utils:
head.pkgmatch(),
pkgmatch_browse(),
pkgmatch_load_data(),
pkgmatch_update_cache(),
print.pkgmatch()
generate_pkgmatch_example_data () input <- "curl" # Name of a single installed package pkgmatch_similar_pkgs (input, corpus = "cran")generate_pkgmatch_example_data () input <- "curl" # Name of a single installed package pkgmatch_similar_pkgs (input, corpus = "cran")
Head method for 'pkgmatch' objects
## S3 method for class 'pkgmatch' head(x, n = 5L, ...)## S3 method for class 'pkgmatch' head(x, n = 5L, ...)
x |
Object for which head is to be printed |
n |
Number of rows of full |
... |
Not used |
A (usually) smaller version of x, with all columns displayed.
Other utils:
generate_pkgmatch_example_data(),
pkgmatch_browse(),
pkgmatch_load_data(),
pkgmatch_update_cache(),
print.pkgmatch()
## Not run: input <- "Download open spatial data from NASA" p <- pkgmatch_similar_pkgs (input) p # Default print method, lists 5 best matching packages head (p) # Shows first 5 rows of full `data.frame` object ## End(Not run)## Not run: input <- "Download open spatial data from NASA" p <- pkgmatch_similar_pkgs (input) p # Default print method, lists 5 best matching packages head (p) # Shows first 5 rows of full `data.frame` object ## End(Not run)
BM25 values match single inputs to document corpora by weighting terms by their inverse frequencies, so that relatively rare words contribute more to match scores than common words. For each input, the BM25 value is the sum of relative frequencies of each term in the input multiplied by the Inverse Document Frequency (IDF) of that term in the entire corpus. See the Wikipedia page at https://en.wikipedia.org/wiki/Okapi_BM25 for further details.
pkgmatch_bm25(input, txt = NULL, idfs = NULL, corpus = NULL, minchar = 3L)pkgmatch_bm25(input, txt = NULL, idfs = NULL, corpus = NULL, minchar = 3L)
input |
A single character string to match against the second parameter of all input documents. |
txt |
An optional list of input documents. If not specified, data will
be loaded as specified by the |
idfs |
Optional list of Inverse Document Frequency weightings generated
by the internal |
corpus |
If |
minchar |
Minimal number of characters; tokens with less than this number are discarded. |
A data.frame of package names and 'BM25' measures against text
from whole packages both with and without function descriptions.
Other bm25:
pkgmatch_bm25_fn_calls()
# The following function simulates remote data in temporary directory, to # enable package usage without downloading. Do not run for normal usage. generate_pkgmatch_example_data () input <- "curl" # Name of a single installed package pkgmatch_bm25 (input, corpus = "cran") # Or pre-load document-frequency weightings and pass those: idfs <- pkgmatch_load_data ("idfs", corpus = "cran", fns = FALSE) # Those have token frequencies for both "full" text, and for descriptions # only "desc_only": pkgmatch_bm25 (input, corpus = "cran", idfs = idfs$full) pkgmatch_bm25 (input, corpus = "cran", idfs = idfs$descs_only)# The following function simulates remote data in temporary directory, to # enable package usage without downloading. Do not run for normal usage. generate_pkgmatch_example_data () input <- "curl" # Name of a single installed package pkgmatch_bm25 (input, corpus = "cran") # Or pre-load document-frequency weightings and pass those: idfs <- pkgmatch_load_data ("idfs", corpus = "cran", fns = FALSE) # Those have token frequencies for both "full" text, and for descriptions # only "desc_only": pkgmatch_bm25 (input, corpus = "cran", idfs = idfs$full) pkgmatch_bm25 (input, corpus = "cran", idfs = idfs$descs_only)
See ?pkgmatch_bm25 for details of BM25 ranks. This function
calculates "BM25" ranks from function-call frequencies between a local R
package and all packages in specified corpus. Values are thus higher for
packages with similar patterns of function calls, weighted by inverse
frequencies, so functions called infrequently across the entire corpus
contribute more than common functions.
Note that the results of this function are entirely different from
pkgmatch_bm25 with corpus = "ropensci-fns" or corpus = "bioc-fns". The latter return BM25 values from text descriptions of all
functions in all rOpenSci or BioConductor packages, whereas this function
returns BM25 values based on frequencies of function calls within packages.
pkgmatch_bm25_fn_calls(path, corpus = NULL)pkgmatch_bm25_fn_calls(path, corpus = NULL)
path |
Local path to source code of an R package. |
corpus |
One of "ropensci" or "cran" |
A data.frame of two columns:
"package" Naming the package from the specified corpus;
bm25 The "BM25" index value for the nominated packages, where high values indicate greater overlap in term frequencies.
Other bm25:
pkgmatch_bm25()
## Not run: u <- "https://cran.r-project.org/src/contrib/odbc_1.5.0.tar.gz" path <- file.path (tempdir (), basename (u)) download.file (u, destfile = path) bm25 <- pkgmatch_bm25_fn_calls (path) ## End(Not run)## Not run: u <- "https://cran.r-project.org/src/contrib/odbc_1.5.0.tar.gz" path <- file.path (tempdir (), basename (u)) download.file (u, destfile = path) bm25 <- pkgmatch_bm25_fn_calls (path) ## End(Not run)
pkgmatch resultsOpen web pages for pkgmatch results
pkgmatch_browse(p, n = NULL)pkgmatch_browse(p, n = NULL)
p |
A |
n |
Number of top-matching entries which should be opened. Defaults to the value passed to the main functions. |
(Invisibly) A named vector of integers, with 0 for all pages able to be successfully opened, and 1 otherwise.
Other utils:
generate_pkgmatch_example_data(),
head.pkgmatch(),
pkgmatch_load_data(),
pkgmatch_update_cache(),
print.pkgmatch()
## Not run: input <- "genomics and transcriptomics sequence data" p <- pkgmatch_similar_pkgs (input) pkgmatch_browse (p) # Open main package pages on rOpenSci p <- pkgmatch_similar_pkgs (input, corpus = "cran") pkgmatch_browse (p) # Open main package pages on CRAN p <- pkgmatch_similar_fns (input) pkgmatch_browse (p) # Open pages for best-matching rOpenSci functions ## End(Not run)## Not run: input <- "genomics and transcriptomics sequence data" p <- pkgmatch_similar_pkgs (input) pkgmatch_browse (p) # Open main package pages on rOpenSci p <- pkgmatch_similar_pkgs (input, corpus = "cran") pkgmatch_browse (p) # Open main package pages on CRAN p <- pkgmatch_similar_fns (input) pkgmatch_browse (p) # Open pages for best-matching rOpenSci functions ## End(Not run)
Load pre-computed data for a specified corpus. Data types are:
"idfs" for Inverse Document Frequency weightings;
"functions" for frequency tables for text descriptions of function calls; or
"calls" for frequency tables for actual function calls.
This function is called within the main pkgmatch_similar_pkgs function to load required data there, and should not generally need to be explicitly called.
pkgmatch_load_data( what = "idfs", corpus = "ropensci", fns = FALSE, raw = FALSE )pkgmatch_load_data( what = "idfs", corpus = "ropensci", fns = FALSE, raw = FALSE )
what |
One of the three data types described above: "idfs", "functions", or "calls". |
corpus |
Must be specified as one of "ropensci", "cran", or "bioc" (for
BioConductor). If |
fns |
If |
raw |
Only has effect of |
The loaded data.
Other utils:
generate_pkgmatch_example_data(),
head.pkgmatch(),
pkgmatch_browse(),
pkgmatch_update_cache(),
print.pkgmatch()
## Not run: idfs <- pkgmatch_load_data ("idfs") idfs_fns <- pkgmatch_load_data ("idfs", fns = TRUE) ## End(Not run)## Not run: idfs <- pkgmatch_load_data ("idfs") idfs_fns <- pkgmatch_load_data ("idfs", fns = TRUE) ## End(Not run)
Function matching is only available for functions from the corpora of rOpenSci or Bioconductor packages, and not for CRAN packages.
pkgmatch_similar_fns(input, corpus = "ropensci", n = 5L, browse = FALSE)pkgmatch_similar_fns(input, corpus = "ropensci", n = 5L, browse = FALSE)
input |
A text string. |
corpus |
One of "ropensci" or "bioc" (for BioConductor). It is not possible to match functions again CRAN packages. |
n |
When the result of this function is printed to screen, the top |
browse |
If |
A modified data.frame object of class "pkgmatch". The data.frame
has 3 columns:
"pkg_fn" with the name of the function in the form "package::function";
"simil" with a similarity score between 0 and 1; and
"rank" as an integer index, with the highest rank of 1 as the first row.
The return object has a default print method which prints the names only
of the first 5 best matching functions; see ?print.pkgmatch for details.
Other main:
pkgmatch_similar_pkgs()
## Not run: input <- "Process raster satellite images" p <- pkgmatch_similar_fns (input) p # Default print method, lists 5 best matching functions head (p) # Shows first 5 rows of full `data.frame` object ## End(Not run)## Not run: input <- "Process raster satellite images" p <- pkgmatch_similar_fns (input) p # Default print method, lists 5 best matching functions head (p) # Shows first 5 rows of full `data.frame` object ## End(Not run)
This function accepts as input either a text description, the
name of a locally-installed package, or a path to a local directory containing an R package.
It ranks all R packages within the specified corpus in terms of how well they
match that input. The "corpus" argument can specify either rOpenSci's package suite,
CRAN, or
Bioconductor.
Ranks are obtained from scores derived from "Best Match 25" (BM25) scores based on document token frequencies.
Ranks are generally obtained by matching both for full package text from the specified corpus, including all long-form documentation, and by matching package descriptions only. The function returns a single rank derived by combining individual ranks using the Reciprocal Rank Fusion (RRF) algorithm.
Finally, all components of this function are locally cached for each call
(by the memoise package), so additional calls to this function with
the same input and corpus should be much faster than initial calls.
pkgmatch_similar_pkgs( input, corpus = NULL, idfs = NULL, n = 5L, browse = FALSE )pkgmatch_similar_pkgs( input, corpus = NULL, idfs = NULL, n = 5L, browse = FALSE )
input |
Either a text string, a path to local source code of an R package, or the name of any installed R package. |
corpus |
Must be specified as one of "ropensci", "cran", or "bioc" (for
BioConductor). If |
idfs |
Inverse Document Frequency tables for a suite of packages, generated from pkgmatch_bm25. If not provided, pre-generated IDF tables will be downloaded and stored in a local cache directory. |
n |
When the result of this function is printed to screen, the top |
browse |
If |
A data.frame with a "package" column naming packages, and a
column of package ranks, with 1 being most similar. For the CRAN corpus, a
column of package versions is also included.
The returned object has a default print method which prints the best 5
matches directly to the screen, yet returns information on all packages
within the specified corpus. There is also a head method to print the
first few entries of these full data (default n = 5). To see all data, use
as.data.frame().
The first time this function is run without passing idfs, required
values will be automatically downloaded and stored in a locally persistent
cache directory. Especially for the "cran" corpus, this downloading may take
quite some time.
Other main:
pkgmatch_similar_fns()
# The following function simulates remote data in temporary directory, to # enable package usage without downloading. Do not run for normal usage. generate_pkgmatch_example_data () input <- "curl" # Name of a single installed package p <- pkgmatch_similar_pkgs (input, corpus = "cran") p # Default print method, lists 5 best matching packages head (p) # Shows first 5 rows of full `data.frame` object# The following function simulates remote data in temporary directory, to # enable package usage without downloading. Do not run for normal usage. generate_pkgmatch_example_data () input <- "curl" # Name of a single installed package p <- pkgmatch_similar_pkgs (input, corpus = "cran") p # Default print method, lists 5 best matching packages head (p) # Shows first 5 rows of full `data.frame` object
This function uses "treesitter" (https://github.com/tree-sitter/tree-sitter) to tag all function calls made within a local package, and to associate those calls with package namespaces.
This is used as input to the pkgmatch_bm25_fn_calls function, to enable function calls within a local package to be inversely weighted by frequencies within all packages within a corpus. The results of applying this function to the full corpora used in this package are contained within the data listed on https://github.com/ropensci-review-tools/pkgmatch/releases/tag/v0.5.2, as "fn-calls-ropensci.Rds" and "fn-calls-cran.Rds".
pkgmatch_treesitter_fn_tags(path)pkgmatch_treesitter_fn_tags(path)
path |
Path to local package, or |
A data.frame of all function calls made within the package, with
the following columns:
'fn' Name of the package function within which call is made, including namespace identifiers of "::" for exported functions and ":::" for non-exported functions.
name Name of function being called, including namespace.
start Byte number within file corresponding to start of definition
end Byte number within file corresponding to end of definition
file Name of file in which fn call is defined.
# Get function calls made within locally-installed packages: fn_tags <- pkgmatch_treesitter_fn_tags ("curl") # Name of installed package fn_tags <- pkgmatch_treesitter_fn_tags ("cli") # Name of installed package # Or get calls from full source code: u <- "https://cran.r-project.org/src/contrib/odbc_1.5.0.tar.gz" path <- file.path (tempdir (), basename (u)) ## Not run: download.file (u, destfile = path) fn_tags <- pkgmatch_treesitter_fn_tags (path) ## End(Not run)# Get function calls made within locally-installed packages: fn_tags <- pkgmatch_treesitter_fn_tags ("curl") # Name of installed package fn_tags <- pkgmatch_treesitter_fn_tags ("cli") # Name of installed package # Or get calls from full source code: u <- "https://cran.r-project.org/src/contrib/odbc_1.5.0.tar.gz" path <- file.path (tempdir (), basename (u)) ## Not run: download.file (u, destfile = path) fn_tags <- pkgmatch_treesitter_fn_tags (path) ## End(Not run)
pkgmatch data to latest versions.This function forces all locally-cached data to be updated with latest version of remote data provided on the latest release of GitHub repository at https://github.com/ropensci-review-tools/pkgmatch/releases.
Caching strategies are described in the "Data Caching and Updating"
vignette, accessible either locally via
vignette("data-caching-and-updating", package = "pkgmatch"), or online at
https://docs.ropensci.org/pkgmatch/articles/B_data-caching-and-updating.html.
In short, locally-cached data used by this package are updated
by default every 30 days (with the vignette describing how to modify this
default behaviour). This function forces all locally-cached data to be
updated, regardless of update frequencies.
pkgmatch_update_cache()pkgmatch_update_cache()
(Invisibly) A list of full local paths to all files which were updated.
Other utils:
generate_pkgmatch_example_data(),
head.pkgmatch(),
pkgmatch_browse(),
pkgmatch_load_data(),
print.pkgmatch()
## Not run: pkgmatch_update_cache () ## End(Not run)## Not run: pkgmatch_update_cache () ## End(Not run)
This function is intended for internal rOpenSci use only. Usage
by any unauthorized users will error and have no effect unless run with
upload = FALSE, in which case updated data will be created in the
sub-directory "pkgmatch-results" of R's current temporary directory. This
updating may take a very long time!
The function does not update the BioConductor data. Because those are fixed
to specific BioConductor releases, they are only updated manually with the
internal pkgmatch_generate_bioc() function.
Note that this function is categorically different from
pkgmatch_update_cache. This function updates the internal data used
by the pkgmatch package, and should only ever be run by package
maintainers. The pkgmatch_update_cache downloads the latest versions
of these data to a local cache for use in this package.
pkgmatch_update_data( upload = TRUE, local_cran_mirror = NULL, local_ropensci_mirror = NULL )pkgmatch_update_data( upload = TRUE, local_cran_mirror = NULL, local_ropensci_mirror = NULL )
upload |
If |
local_cran_mirror |
Optional path to a local directory with full CRAN mirror. If specified, data will use packages from this local source for updating. Default behaviour if not specified is to download new packages into tempdir, and delete once data have been updated. |
local_ropensci_mirror |
Optional path to a local directory with full rOpenSci mirror. If specified, data will use repositories from this local source for updating. Default behaviour if not specified is to clone new repositories into tempdir, and delete once data have been updated. |
Local path to directory containing updated results.
## Not run: pkgmatch_update_data (upload = FALSEE) ## End(Not run)## Not run: pkgmatch_update_data (upload = FALSEE) ## End(Not run)
The main pkgmatch function, pkgmatch_similar_pkgs,
returns data.frame objects of class "pkgmatch". This class exists
primarily to enable this print method, which summarises by default the top 5
matching packages or functions. Objects can be converted to standard
data.frames with as.data.frame().
## S3 method for class 'pkgmatch' print(x, ...)## S3 method for class 'pkgmatch' print(x, ...)
x |
Object to be printed |
... |
Additional parameters passed to default 'print' method. |
The result of printing x, in form of either a single character
vector, or a named list of character vectors.
Other utils:
generate_pkgmatch_example_data(),
head.pkgmatch(),
pkgmatch_browse(),
pkgmatch_load_data(),
pkgmatch_update_cache()
## Not run: input <- "Download open spatial data from NASA" p <- pkgmatch_similar_pkgs (input) p # Default print method, lists 5 best matching packages head (p) # Shows first 5 rows of full `data.frame` object ## End(Not run)## Not run: input <- "Download open spatial data from NASA" p <- pkgmatch_similar_pkgs (input) p # Default print method, lists 5 best matching packages head (p) # Shows first 5 rows of full `data.frame` object ## End(Not run)