Package 'pkgmatch'

Title: Find R Packages Matching Either Descriptions or Other R Packages
Description: Find R packages matching either descriptions or other R packages.
Authors: Mark Padgham [aut, cre] , Davis Vaughan [ctb]
Maintainer: Mark Padgham <[email protected]>
License: MIT + file LICENSE
Version: 0.4.3.097
Built: 2025-03-27 12:32:05 UTC
Source: https://github.com/ropensci-review-tools/pkgmatch

Help Index


Generate example data to use with pkgmatch

Description

This function generates a selection of test data for the "cran" corpus, to allow functions to be run offline, without having to download the large datasets otherwise required for the package to function.

Note that these data are randomly generated, and results will be generally meaningless. They are generated solely to demonstrate how the package functions, and are not intended to derive meaningful outputs.

Usage

generate_pkgmatch_example_data()

Value

(Invisibly) The path to the temporary directory containing the package data.

See Also

Other utils: head.pkgmatch(), pkgmatch_browse(), pkgmatch_load_data(), pkgmatch_update_cache(), print.pkgmatch(), text_is_code()

Examples

generate_pkgmatch_example_data ()
input <- "curl" # Name of a single installed package
pkgmatch_similar_pkgs (input, corpus = "cran")

Get the URL for local ollama API

Description

Return the URL of the specified ollama API. Default is "127.0.0.1:11434"

Usage

get_ollama_url()

Value

The ollama API URL

See Also

set_ollama_url

Other ollama: ollama_check(), set_ollama_url()


Head method for 'pkgmatch' objects

Description

Head method for 'pkgmatch' objects

Usage

## S3 method for class 'pkgmatch'
head(x, n = 5L, ...)

Arguments

x

Object for which head is to be printed

n

Number of rows of full pkgmatch object to be displayed

...

Not used

Value

A (usually) smaller version of x, with all columns displayed.

See Also

Other utils: generate_pkgmatch_example_data(), pkgmatch_browse(), pkgmatch_load_data(), pkgmatch_update_cache(), print.pkgmatch(), text_is_code()

Examples

## Not run: 
input <- "Download open spatial data from NASA"
p <- pkgmatch_similar_pkgs (input)
p # Default print method, lists 5 best matching packages
head (p) # Shows first 5 rows of full `data.frame` object

## End(Not run)

Check ollama installation

Description

Performs the following checks:

  • Check that ollama is installed

  • Check that ollama is running

  • Check that ollama has the required models, and download if not

The required models are the Jina AI embeddings: https://ollama.com/jina/jina-embeddings-v2-base-en for text embeddings, and https://ollama.com/ordis/jina-embeddings-v2-base-code for code embeddings.

Note that the URL of a locally-running ollama instance is presumed by default to be "127.0.0.1:11434". Other values can be set using the set_ollama_url function.

Usage

ollama_check(sudo = is_docker_sudo())

Arguments

sudo

Set to TRUE if ollama is running in docker with sudo privileges.

Value

TRUE if everything works okay, otherwise the function will error before returning, and issue an informative error message.

See Also

Other ollama: get_ollama_url(), set_ollama_url()

Examples

## Not run: 
chk <- ollama_check ()

## End(Not run)

The "Best Matching 25" (BM25) ranking function.

Description

BM25 values match single inputs to document corpora by weighting terms by their inverse frequencies, so that relatively rare words contribute more to match scores than common words. For each input, the BM25 value is the sum of relative frequencies of each term in the input multiplied by the Inverse Document Frequency (IDF) of that term in the entire corpus. See the Wikipedia page at https://en.wikipedia.org/wiki/Okapi_BM25 for further details.

Usage

pkgmatch_bm25(input, txt = NULL, idfs = NULL, corpus = NULL)

Arguments

input

A single character string to match against the second parameter of all input documents.

txt

An optional list of input documents. If not specified, data will be loaded as specified by the corpus parameter.

idfs

Optional list of Inverse Document Frequency weightings generated by the internal bm25_idf function. If not specified, values for the rOpenSci corpus will be automatically downloaded and used.

corpus

If txt is not specified, data for nominated corpus will be downloaded to local cache directory, and BM25 values calculated against those. Must be one of "ropensci", "ropensci-fns", or "cran". Note that the "ropensci-fns" corpus contains entries for every single function of every rOpenSci package, and the resulting BM25 values can be used to determine the best-matching function. The other two corpora are package-based, and the results can be used to find the best-matching package.

Value

A data.frame of package names and 'BM25' measures against text from whole packages both with and without function descriptions.

See Also

Other bm25: pkgmatch_bm25_fn_calls()

Examples

# The following function simulates remote data in temporary directory, to
# enable package usage without downloading. Do not run for normal usage.
generate_pkgmatch_example_data ()

input <- "curl" # Name of a single installed package
pkgmatch_bm25 (input, corpus = "cran")
# Or pre-load document-frequency weightings and pass those:
idfs <- pkgmatch_load_data ("idfs", corpus = "cran", fns = FALSE)
pkgmatch_bm25 (input, corpus = "cran", idfs = idfs)

The "Best Matching 25" (BM25) ranking function for function calls

Description

See ?pkgmatch_bm25 for details of BM25 ranks. This function calculates "BM25" ranks from function-call frequencies between a local R package and all packages in specified corpus. Values are thus higher for packages with similar patterns of function calls, weighted by inverse frequencies, so functions called infrequently across the entire corpus contribute more than common functions.

Note that the results of this function are entirely different from pkgmatch_bm25 with corpus = "ropensci-fns". The latter returns BM25 values from text descriptions of all functions in all rOpenSci packages, whereas this function returns BM25 values based on frequencies of function calls within packages.

Usage

pkgmatch_bm25_fn_calls(path, corpus = NULL)

Arguments

path

Local path to source code of an R package.

corpus

One of "ropensci" or "cran"

Value

A data.frame of two columns:

  • "package" Naming the package from the specified corpus;

  • bm25 The "BM25" index value for the nominated packages, where high values indicate greater overlap in term frequencies.

See Also

Other bm25: pkgmatch_bm25()

Examples

## Not run: 
u <- "https://cran.r-project.org/src/contrib/odbc_1.5.0.tar.gz"
path <- file.path (tempdir (), basename (u))
download.file (u, destfile = path)
bm25 <- pkgmatch_bm25_fn_calls (path)

## End(Not run)

Open web pages for pkgmatch results

Description

Open web pages for pkgmatch results

Usage

pkgmatch_browse(p, n = NULL)

Arguments

p

A pkgmatch object returned from either pkgmatch_similar_pkgs or pkgmatch_similar_fns.

n

Number of top-matching entries which should be opened. Defaults to the value passed to the main functions.

Value

(Invisibly) A named vector of integers, with 0 for all pages able to be successfully opened, and 1 otherwise.

See Also

Other utils: generate_pkgmatch_example_data(), head.pkgmatch(), pkgmatch_load_data(), pkgmatch_update_cache(), print.pkgmatch(), text_is_code()

Examples

## Not run: 
input <- "genomics and transcriptomics sequence data"
p <- pkgmatch_similar_pkgs (input)
pkgmatch_browse (p) # Open main package pages on rOpenSci
p <- pkgmatch_similar_pkgs (input, corpus = "cran")
pkgmatch_browse (p) # Open main package pages on CRAN
p <- pkgmatch_similar_fns (input)
pkgmatch_browse (p) # Open pages for best-matching rOpenSci functions

## End(Not run)

Return raw embeddings from package text and function definitions.

Description

This function accepts a vector of either names of installed packages, or paths to local source code directories, and calculates language model (LM) embeddings for both text descriptions within the package (documentation, including of functions), and for the entire code base. Embeddings may also be calculating separately for all function descriptions.

The embeddings are currently retrieved from a local 'ollama' server (https://ollama.com) running Jina AI embeddings (https://ollama.com/jina/jina-embeddings-v2-base-en for text, and https://ollama.com/ordis/jina-embeddings-v2-base-code for code).

Usage

pkgmatch_embeddings_from_pkgs(packages = NULL, functions_only = FALSE)

Arguments

packages

A vector of either names of installed packages, or local paths to directories containing R packages.

functions_only

If TRUE, calculate embeddings for function descriptions only. This is intended to generate a separate set of embeddings which can then be used to match plain-text queries of functions, rather than entire packages.

Value

If !functions_only, a list of two matrices of embeddings: one for the text descriptions of the specified packages, including individual descriptions of all package functions, and one for the entire code base. For functions_only, a single matrix of embeddings for all function descriptions.

See Also

Other embeddings: pkgmatch_embeddings_from_text()

Examples

packages <- "curl"
emb_fns <- pkgmatch_embeddings_from_pkgs (packages, functions_only = TRUE)
colnames (emb_fns) # All functions the package
emb_pkg <- pkgmatch_embeddings_from_pkgs (packages, functions_only = FALSE)
names (emb_pkg)
colnames (emb_pkg$text_with_fns) # "curl"

Return raw embeddings from a vector of text strings.

Description

This function accepts a vector of character strings, packages, or paths to local source code directories, and calculates language model (LM) embeddings for each string within the vector.

The embeddings are currently retrieved from a local 'ollama' server (https://ollama.com) running Jina AI text embeddings (https://ollama.com/jina/jina-embeddings-v2-base-en).

Usage

pkgmatch_embeddings_from_text(input = NULL)

Arguments

input

A vector of one or more text strings for which embeddings are to be extracted.

Value

A matrix of embeddings, one column for each input item, and a fixed number of rows defined by the embedding length of the language models.

See Also

Other embeddings: pkgmatch_embeddings_from_pkgs()

Examples

## Not run: 
input <- "Download open spatial data from NASA"
emb <- pkgmatch_embeddings_from_text (input = input)

## End(Not run)

Load 'pkgmatch' data for specified corpus.

Description

Load pre-computed data for a specified corpus. Data types are:

  • "embeddings" for language model embeddings;

  • "idfs" for Inverse Document Frequency weightings;

  • "functions" for frequency tables for text descriptions of function calls; or

  • "calls" for frequency tables for actual function calls.

This function is called within the main pkgmatch_similar_pkgs and pkgmatch_similar_fns functions to load required data there, and should not generally need to be explicitly called.

Usage

pkgmatch_load_data(
  what = "embeddings",
  corpus = "ropensci",
  fns = FALSE,
  raw = FALSE
)

Arguments

what

One of the four data types described above: "embeddings", "idfs", "functions", or "calls".

corpus

Must be specified as one of "ropensci" or "cran". If embeddings or idfs parameters are not specified, they will be automatically downloaded for the corpus specified by this parameter. The function will then return the most similar package from the specified corpus. Note that calculations will corpus = "cran" will generally take longer, because the corpus is much larger.

fns

If FALSE (default), load embeddings for all packages; otherwise load (considerably larger dataset of) embeddings for all individual functions.

raw

Only has effect of what = "calls", in which case default of FALSE loads single Inverse Document Frequency table to entire corpus; otherwise if TRUE, loads raw function call counts for each package in corpus.

Value

The loaded data.

See Also

Other utils: generate_pkgmatch_example_data(), head.pkgmatch(), pkgmatch_browse(), pkgmatch_update_cache(), print.pkgmatch(), text_is_code()

Examples

## Not run: 
embeddings <- pkgmatch_load_data ("embeddings")
embeddings_fns <- pkgmatch_load_data ("embeddings", fns = TRUE)
idfs <- pkgmatch_load_data ("idfs")
idfs_fns <- pkgmatch_load_data ("idfs", fns = TRUE)

## End(Not run)

Identify R functions best matching a given input string

Description

Function matching is only available for functions from the corpus of rOpenSci packages. Function matching is also based on LM output only, and unlike package matching does not combine LM output with BM25 word-frequency matching.

Usage

pkgmatch_similar_fns(input, embeddings = NULL, n = 5L, browse = FALSE)

Arguments

input

A text string.

embeddings

Large Language Model embeddings for a suite of packages, generated from pkgmatch_embeddings_from_pkgs. If not provided, pre-generated embeddings will be downloaded and stored in a local cache directory.

n

When the result of this function is printed to screen, the top n packages will be displayed.

browse

If TRUE, automatically open webpages of the top n matches in local browser.

Value

A modified data.frame object of class "pkgmatch". The data.frame has 3 columns:

  1. "function" with the name of the function in the form "::";

  2. "simil" with a similarity score between 0 and 1; and

  3. "rank" as an integer index, with the highest rank of 1 as the first row.

The return object has a default print method which prints the names only of the first 5 best matching functions; see ?print.pkgmatch for details.

See Also

Other main: pkgmatch_similar_pkgs()

Examples

## Not run: 
input <- "Process raster satellite images"
p <- pkgmatch_similar_fns (input)
p # Default print method, lists 5 best matching packages
head (p) # Shows first 5 rows of full `data.frame` object

## End(Not run)

Find R packages matching an input of either text or another package

Description

This function accepts as input either a text description, or a path to a local R package, and ranks all R packages within the specified corpus in terms of how well they match that input. The "corpus" argument can specify either rOpenSci's package suite, or CRAN.

Ranks are obtained from scores derived from:

  • Cosine similarities between Language Model (LM) embeddings for the input, and corresponding embeddings for the specified corpus.

  • "Best Match 25" (BM25) scores based on document token frequencies.

For text input, ranks are generally obtained for packages both including and excluding function descriptions as part of the package text, giving two sets of ranks for a given input. Where input is an entire R package, separate ranks are also calculated for package code and text, thus giving four distinct ranks. The function ultimately returns a single rank, derived by combining individual ranks using the Reciprocal Rank Fusion (RRF) algorithm. The additional parameter of lm_proportion determines the extent to which the final ranking weights the LM versus BM25 components.

Finally, all components of this function are locally cached for each call (by the memoise package), so additional calls to this function with the same input and corpus should be much faster than initial calls. This means the effect of changing lm_proportion can easily be examined by simply repeating calls to this function.

Usage

pkgmatch_similar_pkgs(
  input,
  corpus = NULL,
  embeddings = NULL,
  idfs = NULL,
  input_is_code = text_is_code(input),
  lm_proportion = 0.5,
  n = 5L,
  browse = FALSE
)

Arguments

input

Either a text string, a path to local source code of an R package, or the name of any installed R package.

corpus

Must be specified as one of "ropensci" or "cran". If embeddings or idfs parameters are not specified, they will be automatically downloaded for the corpus specified by this parameter. The function will then return the most similar package from the specified corpus. Note that calculations will corpus = "cran" will generally take longer, because the corpus is much larger.

embeddings

Large Language Model embeddings for a suite of packages, generated from pkgmatch_embeddings_from_pkgs. If not provided, pre-generated embeddings will be downloaded and stored in a local cache directory.

idfs

Inverse Document Frequency tables for a suite of packages, generated from pkgmatch_bm25. If not provided, pre-generated IDF tables will be downloaded and stored in a local cache directory.

input_is_code

A binary flag indicating whether input is code or plain text. Ignored if input is path to a local package; otherwise can be used to force appropriate interpretation of input type.

lm_proportion

A value between 0 and 1 to control the relative contributions of results from Language Models ("LMs") versus results from traditional token-frequency models. Final rankings are generated by combining these two kinds of results, so that lm_proportion = 0 will return results from token frequency analyses only, while lm_proportion = 1 will return results from LMs only.

n

When the result of this function is printed to screen, the top n packages will be displayed.

browse

If TRUE, automatically open webpages of the top n matches in local browser.

Value

A data.frame with a "package" column naming packages, and one or more columns of package ranks in terms of text similarity and, if input is an R package, of similarity in code structure.

The returned object has a default print method which prints the best 5 matches directly to the screen, yet returns information on all packages within the specified corpus. This information is in the form of a data.frame, with one column for the package name, and one or more additional columns of integer ranks for each package. There is also a head method to print the first few entries of these full data (default n = 5). To see all data, use as.data.frame(). See the example below for how to manipulate these objects.

Note

The first time this function is run without passing either embeddings or idfs, required values will be automatically downloaded and stored in a locally persistent cache directory. Especially for the "cran" corpus, this downloading may take quite some time.

See Also

input_is_code

Other main: pkgmatch_similar_fns()

Examples

# The following function simulates remote data in temporary directory, to
# enable package usage without downloading. Do not run for normal usage.
generate_pkgmatch_example_data ()

input <- "curl" # Name of a single installed package
p <- pkgmatch_similar_pkgs (input, corpus = "cran")
p # Default print method, lists 5 best matching packages
head (p) # Shows first 5 rows of full `data.frame` object

# This second call modifies default combining of results equally from language
# model and token frequency (BM25) results. It will be much faster than first
# call, because previously generated embeddings are re-used.
p2 <- pkgmatch_similar_pkgs (input, corpus = "cran", lm_proportion = 0.25)

# Example demonstrating how to combine results using different values of
# `lm_proportion`. Input is a package, so result has columns for "text_rank"
# and "code_rank".
lm_props <- 0:10 / 10
res <- lapply (lm_props, function (p) {
    nm_text <- sprintf ("text_rank_p%02.0f", p * 10)
    nm_code <- sprintf ("code_rank_p%02.0f", p * 10)
    res <- pkgmatch_similar_pkgs (input, corpus = "cran", lm_proportion = p) |>
        dplyr::rename ({{nm_text}} := "text_rank", {{nm_code}} := "code_rank") |>
        dplyr::arrange (package)
    if (p > 0) {
        res <- dplyr::select (res, -package, -version)
    }
    return (res)
})
res <- do.call (cbind, res)

# That then has paired columns of (text rank, code rank) for each of the
# 11 values of `lm_props`.
head (res)

Identify all function calls make within a package.

Description

This function uses "treesitter" (https://github.com/tree-sitter/tree-sitter) to tag all function calls made within a local package, and to associate those calls with package namespaces.

This is used as input to the pkgmatch_bm25_fn_calls function, to enable function calls within a local package to be inversely weighted by frequencies within all packages within a corpus. The results of applying this function to the full corpora used in this package are contained within the data listed on https://github.com/ropensci-review-tools/pkgmatch/releases/tag/v0.4.0, as "fn-calls-ropensci.Rds" and "fn-calls-cran.Rds".

Usage

pkgmatch_treesitter_fn_tags(path)

Arguments

path

Path to local package, or .tar.gz file of package source.

Value

A data.frame of all function calls made within the package, with the following columns:

  • 'fn' Name of the package function within which call is made, including namespace identifiers of "::" for exported functions and ":::" for non-exported functions.

  • name Name of function being called, including namespace.

  • start Byte number within file corresponding to start of definition

  • end Byte number within file corresponding to end of definition

  • file Name of file in which fn call is defined.

Examples

# Get function calls made within locally-installed packages:
fn_tags <- pkgmatch_treesitter_fn_tags ("curl") # Name of installed package
fn_tags <- pkgmatch_treesitter_fn_tags ("cli") # Name of installed package

# Or get calls from full source code:
u <- "https://cran.r-project.org/src/contrib/odbc_1.5.0.tar.gz"
path <- file.path (tempdir (), basename (u))
## Not run: 
download.file (u, destfile = path)
fn_tags <- pkgmatch_treesitter_fn_tags (path)

## End(Not run)

Update all locally-cached pkgmatch data to latest versions.

Description

This function forces all locally-cached data to be updated with latest version of remote data provided on the latest release of GitHub repository at https://github.com/ropensci-review-tools/pkgmatch/releases.

Caching strategies are described in the "Data Caching and Updating" vignette, accessible either locally via vignette("data-caching-and-updating", package = "pkgmatch"), or online at https://docs.ropensci.org/pkgmatch/articles/C_data-caching-and-updating.html. In short, locally-cached data used by this package are updated by default every 30 days (with the vignette describing how to modify this default behaviour). This function forces all locally-cached data to be updated, regardless of update frequencies.

Usage

pkgmatch_update_cache()

See Also

Other utils: generate_pkgmatch_example_data(), head.pkgmatch(), pkgmatch_browse(), pkgmatch_load_data(), print.pkgmatch(), text_is_code()

Examples

## Not run: 
pkgmatch_update_cache ()

## End(Not run)

Update pkgmatch corpus data on GitHub

Description

This function is intended for internal rOpenSci use only. Usage by any unauthorized users will error and have no effect unless run with upload = FALSE, in which case updated data will be created in the sub-directory "pkgmatch-results" of R's current temporary directory. This updating may take a very long time!

Note that this function is categorically different from pkgmatch_update_cache. This function updates the internal data used by the pkgmatch package, and should only ever be run by package maintainers. The pkgmatch_update_cache downloads the latest versions of these data to a local cache for use in this package.

Usage

pkgmatch_update_data(upload = TRUE)

Arguments

upload

If TRUE, upload updated results to GitHub release.

Value

Local path to directory containing updated results.

Examples

## Not run: 
pkgmatch_update_data (upload = FALSEE)

## End(Not run)

Print method for 'pkgmatch' objects

Description

The main pkgmatch functions, pkgmatch_similar_pkgs and pkgmatch_similar_fns, return data.frame objects of class "pkgmatch". This class exists primarily to enable this print method, which summarises by default the top 5 matching packages or functions. Objects can be converted to standard data.frames with as.data.frame().

Usage

## S3 method for class 'pkgmatch'
print(x, ...)

Arguments

x

Object to be printed

...

Additional parameters passed to default 'print' method.

Value

The result of printing x, in form of either a single character vector, or a named list of character vectors.

See Also

Other utils: generate_pkgmatch_example_data(), head.pkgmatch(), pkgmatch_browse(), pkgmatch_load_data(), pkgmatch_update_cache(), text_is_code()

Examples

## Not run: 
input <- "Download open spatial data from NASA"
p <- pkgmatch_similar_pkgs (input)
p # Default print method, lists 5 best matching packages
head (p) # Shows first 5 rows of full `data.frame` object

## End(Not run)

Set the URL for local ollama API

Description

Set the URL for local ollama API

Usage

set_ollama_url(ollama_url)

Arguments

ollama_url

The desired ollama API URL

Value

The ollama API URL

See Also

get_ollama_url()

Other ollama: get_ollama_url(), ollama_check()


Estimate whether input text is code or English prose text.

Description

This function is used as part of the input of many functions, to determine whether the input is text of whether it is code. All such functions use it via an input parameter named input_is_code, which is set by default to the value returned from this function. That value can always be over-ridden by specifying a fixed value of either TRUE or FALSE for input_is_code.

Values from this function are only approximate, and there are even software packages which can give false negatives and be identified as prose (like rOpenSci's "geonames" package), and prose which may be wrongly identified as code.

Usage

text_is_code(txt)

Arguments

txt

Single input text string

Value

Logical value indicating whether or not txt was identified as code.

See Also

Other utils: generate_pkgmatch_example_data(), head.pkgmatch(), pkgmatch_browse(), pkgmatch_load_data(), pkgmatch_update_cache(), print.pkgmatch()

Examples

txt <- "Some text without any code"
text_is_code (txt)
txt <- "this_is_code <- function (x) { x }"
text_is_code (txt)