--- title: "Example applications" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Example applications} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set ( collapse = TRUE, comment = "#>" ) ``` This vignette attempts to answer the question of why you should use the `pkgmatch` package, by describing a couple of example applications. ## Text searches for R packages ### Using search engines Anybody wanting an answer to the question, "_Is there an R package that does that?_" will most commonly use a search engine. Here we'll consider the following example search: > R package to return web search engine results into R as strings or URLs Note that there is currently no package which does that, nor is there likely to be, because search results are not generally retrievable via APIs, and in the rare cases in which they are, they are always restricted to authorized access only, and thus require API keys (and commonly also payment). Given that we expect no direct match, it is then not surprising that most search engines will then deliver a [pile of links to pages on web _scraping_](https://duckduckgo.com/?q=R%20package%20to%20return%20web%20search%20engine%20results%20into%20R%20as%20strings%20or%20URLs%20r%20programming), even though that word is not even part of the search. If you're lucky, [the `searcher` package](https://r-pkg.thecoatlessprofessor.com/searcher/) may appear in the results, although that package does not actually return search results (for reasons described above, because of which it merely open links in web browsers). There is also an R-specific search engine, ["rseek.org"](https://rseek.org), but even that largely fails to deliver [any useful results](https://rseek.org/?q=R%20package%20to%20return%20web%20search%20engine%20results%20in%20R%20as%20strings%20or%20URLs%20). The first actual package mentioned is [the `stringdist` package](https://journal.r-project.org/archive/2014-1/loo.pdf), which is in no way related to our query (and even then, the link is to the R-journal article describing the package, and not the package itself). Finally, GitHub has excellent search facilities, and yet searching for our string there simply returns [no results matching entire repositories](https://github.com/search?q=R%20package%20to%20return%20web%20search%20engine%20results%20into%20R%20as%20strings%20or%20URLs&type=repositories). Although there are huge numbers of matches in other aspects, such as code or issues, clicking on those produces very little or no useful information in attempting to identify repositories matching the search string. These search engine results illustrate the general difficulty of searching for particular _types_ of result, in our case R packages. Search engines are inherently broad and generic, and use string comparisons to match outputs to inputs, largely regardless of the type of output. This means that search engines are generally poor tools for identifying specific kinds of objects or results, and generally yield mostly "noise" which must be extensively filtered before the desired kinds of objects can be identified and compared. In summary: - Search engine results are general, and require extensive filtering to be useful. ### Using language models Many people now use language model interfaces, such as [phind](https://www.phind.com/search/cmbqhybqt0000206imod5zpqo) or [perplexity](https://www.perplexity.ai/search/r-package-to-return-web-search-9mTr_z_eT2iqkKutaU3YYA), for web searching. These use complex language embeddings to match inputs to outputs, and so will generally be more likely to return actual R packages as outputs. Clicking on those links shows both to return actual packages, with most results including general web-scraping packages such as [rvest](https://rvest.tidyverse.org), along with more specific packages such as [searcher](https://r-pkg.thecoatlessprofessor.com/searcher/) or [googleSearchR](https://github.com/irfanalidv/GoogleSearchR). A notable limitation of language model results is nevertheless that training data are collated regardless of age, and so results may frequently include old or obsolete packages (such as [RSelenium](https://github.com/ropensci/RSelenium) or [RCrawler](https://github.com/salimk/Rcrawler/)). Mis-matches may also occur, such as confusion between [google's "serp-api" for their search engine](https://serpapi.com/), and the R package named ["serp"](https://github.com/ejikeugba/serp), which is completely unrelated. There are also potential ethical ramifications of many language models, notably including that models capable of reproducing code should respect licensing conditions of that code. This may prevent models from identifying packages which were not used within their training data due to licensing restrictions. In summary: - Language model results may be out-of-date - Language model results may return false matches - Language model results may be restricted only to packages with appropriate licenses ### Using 'pkgmatch' Compared to the true generality of web search engines or language model interfaces, `pkgmatch` is very restricted in scope, but it overcomes some of the limitations described above because: - Results are always and only the names of R packages matching input queries - Results are always up-to-date - pkgmatch can return names of any package with a CRAN-compliant license However, like language models, pkgmatch may also return false matches, the computational reasons for which are described in the vignettes, [_How does pkgmatch work?_](https://docs.ropensci.org/pkgmatch/articles/B_how-does-it-work.html) and [_Why are the results not what I expect?_](https://docs.ropensci.org/pkgmatch/articles/E_why-are-the-results-not-what-i-expect.html). We nevertheless hope that these advantages make pkgmatch a uniquely useful tool in searching for R packages. Now let's look at how it responds to the same input query used above: ```{r initial-search, eval = FALSE} text <- "R package to return web search engine results into R as strings or URLs" pkgmatch::pkgmatch_similar_pkgs (text, corpus = "cran") ``` ```{r initial-search-out, echo = FALSE} c ("RWsearch", "ore", "sos", "gghilbertstrings", "rjsoncons") ``` Of those top five matches: - The ['RWsearch' package](https://cran.r-project.org/web/packages/RWsearch/index.html) is directly related. - The ['ore' package](https://github.com/jonclayden/ore) is a regular expression interface, and so seems mis-matched, yet nevertheless provides a host of search-related functions, including many function names which include "search". - The ['sos' package](https://github.com/sbgraves237/sos) is also directly related. The other two seem unrelated, but: - The ['gghilbertstrings' package](https://github.com/Sumidu/gghilbertstrings) seems entirely unrelated, but it does present an [example based on "search-engine-results"](https://github.com/Sumidu/gghilbertstrings), and so maybe appears for that reason? - The ['rjsoncons' package](https://github.com/mtmorgan/rjsoncons) is also not clearly related to our search term, but like [the 'ore' package](https://github.com/jonclayden/ore) also provides a wealth of functions related to querying and extraction. These results illustrate a contrast between the overly-general results of search engines or language models, and pkgmatch results which are highly specific because they are restricted to R packages only. Although we expected no precise match for reasons described above, pkgmatch nevertheless revealed the presence of two very clearly related packages, ['RWsearch'](https://cran.r-project.org/web/packages/RWsearch/index.html) and ['sos'](https://github.com/sbgraves237/sos), neither of which were returned in results from either [search engines](https://duckduckgo.com/?q=R%20package%20to%20return%20web%20search%20engine%20results%20into%20R%20as%20strings%20or%20URLs%20r%20programming), or [large](https://www.phind.com/search/cmbqhybqt0000206imod5zpqo) language [models](https://www.perplexity.ai/search/r-package-to-return-web-search-9mTr_z_eT2iqkKutaU3YYA). ## Searches based on function code pkgmatch can also be used when writing R functions or scripts, to answer the question of which R packages may already exist which do what I am trying to code. The easiest way to illustrate this is with a concrete example, like the following attempt to use [the `httr2` package](https://httr2.r-lib.org/) to extract data from a search API: ```{r search-code} search_api_results <- function (q) { url <- "https://mysearchapi.com/search" q <- httr2::httr_request (url) |> httr2::req_url_query (q = q) resp <- httr2::req_perform (url) check <- httr2::resp_check_status (resp) body <- httr2::resp_body_json (resp) httr2::req_body_json (body) } ``` We can pass this function definition to the [`pkgmatch_similar_pkgs()` function](https://docs.ropensci.org/pkgmatch/reference/pkgmatch_similar_pkgs.html), first converting it to a single character string, and also calling the function with the explicit `input_is_code = TRUE` parameter to ensure that our input is interpreted as code and not text. ```{r fn-search, eval = FALSE} s <- paste0 (deparse (search_api_results), collapse = "\\n") pkgs <- pkgmatch_similar_pkgs (s, corpus = "cran", input_is_code = TRUE) pkgs ``` ```{r fn-search-out, echo = FALSE} c ("nominatimlite", "wikiTools", "httr2", "searcher", "httptest") ``` We can then call `pkgmatch_browse (pkgs)` to open the CRAN page for each of those packages in our default browser, and follow links from there to see whether any of those packages might help develop our function. We can also search for the closest matching _functions_ instead of packages, although function matching is restricted to the rOpenSci corpus only. ```{r fn-fn-search, eval = FALSE} fns <- pkgmatch_similar_fns (s) fns ``` ```{r fn-fn-search-out, echo = FALSE} c ( "crul::curl-options", "rnassqs::nassqs_GET", "webmockr::build_httr2_request", "rtweet::search_tweets", "webmockr::build_httr_request" ) ``` Again, web pages for all of those functions can be opened by passing the `fns` result to [the `pkgmatch_browse()` function](https://docs.ropensci.org/pkgmatch/reference/pkgmatch_browse.html). That will open the documentation pages on [docs.ropensci.org](https://docs.ropensci.org), with all function documentation entries include a link to the corresponding source code at the top of the page. This demonstrates how pkgmatch can be used to identify similar source code to any given code input. ## Searches based on entire packages Entire packages can also be used as input to pkgmatch functions. The simplest way to do this is to submit the name of an installed package, like this: ```{r whole-pkg-search, eval = FALSE} pkgs <- pkgmatch_similar_pkgs ("crul", corpus = "cran") ``` As explained in [the _How does pkgmatch work?_ vignette](https://docs.ropensci.org/pkgmatch/articles/B_how-does-it-work.html), pkgmatch extracts all code and text from the nominated packages and uses these to generate three sets of embeddings: of all package code, and of all text both with and without text from individual function descriptions. Matches with other packages are based on combinations of matches with these three sets of embeddings, as well as matches based on inverse document frequencies for package text (as also explained in [the _How does pkgmatch work?_ vignette](https://docs.ropensci.org/pkgmatch/articles/B_how-does-it-work.html#token-frequency-similarities)). The above call yields this result: ```{r whole-pkg-search-output-fakey, eval = FALSE} pkgs ``` ```{r whole-pkg-search-output, echo = FALSE} list ( text = c ("factormodel", "crul", "healthequal", "httr2", "crumble"), code = c ("shinyKGode", "igraphinshiny", "crul", "httr2", "sparkavro") ) ``` As expected, the `crul` package appears in the top results for matching both text and code (although not in first place, for reasons described in [the _Why are the results not what I expect?_ vignette](https://docs.ropensci.org/pkgmatch/articles/E_why-are-the-results-not-what-i-expect.html)). Moreover, the only two packages matched on both code and text are `crul` itself, and the highly-similar [`httr2` package](https://httr2.r-lib.org/). Finally, the ability to pass entire packages to [the `pkgmatch_similar_pkgs()` function](https://docs.ropensci.org/pkgmatch/reference/pkgmatch_similar_pkgs.html) reflects the original motivation for this package, which is to provide a useful tool for [rOpenSci's software peer review process](https://ropensci.org/software-review/), through enabling editors to easily assess similarity of new submissions with all previous rOpenSci packages.