Package: textreuse 0.1.5

Yaoxiang Li

textreuse: Detect Text Reuse and Document Similarity

Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.

Authors:Yaoxiang Li [aut, cre], Lincoln Mullen [aut]

textreuse_0.1.5.tar.gz
textreuse_0.1.5.zip(r-4.6)textreuse_0.1.5.zip(r-4.5)textreuse_0.1.5.zip(r-4.4)
textreuse_0.1.5.tgz(r-4.5-x86_64)textreuse_0.1.5.tgz(r-4.5-arm64)textreuse_0.1.5.tgz(r-4.4-x86_64)textreuse_0.1.5.tgz(r-4.4-arm64)
textreuse_0.1.5.tar.gz(r-4.6-arm64)textreuse_0.1.5.tar.gz(r-4.6-x86_64)textreuse_0.1.5.tar.gz(r-4.5-arm64)textreuse_0.1.5.tar.gz(r-4.5-x86_64)
textreuse_0.1.5.tgz(r-4.4-emscripten)
textreuse.pdf |textreuse.html
textreuse/json (API)
NEWS

# Install 'textreuse' in R:
install.packages('textreuse', repos = c('https://ropensci.r-universe.dev', 'https://cloud.r-project.org'))

Reviews:rOpenSci Software Review #20

Bug tracker:https://github.com/ropensci/textreuse/issues

Pkgdown site:https://docs.ropensci.org

Uses libs:
  • c++– GNU Standard C++ Library v3

On CRAN:

Conda:

peer-reviewedcpp

9.29 score 201 stars 233 scripts 481 downloads 1 mentions 43 exports 27 dependencies

Last updated 4 months agofrom:895b5ff299 (on master). Checks:3 OK, 11 NOTE. Indexed: yes.

TargetResultTotal time
pkgdown docsOK164
source / vignettesOK233
linux-devel-x86_64NOTE147
linux-devel-arm64NOTE170
linux-release-x86_64NOTE150
linux-release-arm64NOTE169
macos-release-x86_64NOTE223
macos-release-arm64NOTE79
macos-oldrel-x86_64NOTE226
macos-oldrel-arm64NOTE130
windows-develNOTE202
windows-releaseNOTE241
windows-oldrelNOTE169
wasm-releaseOK148

Exports:align_localcontentcontent<-filenameshas_contenthas_hasheshas_minhasheshas_tokenshash_stringhasheshashes<-is.TextReuseCorpusis.TextReuseTextDocumentjaccard_bag_similarityjaccard_dissimilarityjaccard_similaritylshlsh_candidateslsh_comparelsh_probabilitylsh_querylsh_subsetlsh_thresholdmetameta<-minhash_generatorminhashesminhashes<-pairwise_candidatespairwise_compareratio_of_matchesrehashskippedTextReuseCorpusTextReuseTextDocumenttokenizetokenize_ngramstokenize_sentencestokenize_skip_ngramstokenize_wordstokenstokens<-wordcount

Dependencies:assertthatBHclicpp11digestdplyrfansigenericsgluelifecyclemagrittrNLPpillarpkgconfigpurrrR6RcppRcppProgressrlangstringistringrtibbletidyrtidyselectutf8vctrswithr

Introduction to the textreuse package

Rendered fromtextreuse-introduction.Rmdusingknitr::rmarkdownon May 10 2025.

Last update: 2020-05-12
Started: 2015-10-22

Minhash and locality-sensitive hashing

Rendered fromtextreuse-minhash.Rmdusingknitr::rmarkdownon May 10 2025.

Last update: 2015-10-31
Started: 2015-10-22

Pairwise comparisons for document similarity

Rendered fromtextreuse-pairwise.Rmdusingknitr::rmarkdownon May 10 2025.

Last update: 2015-10-31
Started: 2015-10-22

Text Alignment

Rendered fromtextreuse-alignment.Rmdusingknitr::rmarkdownon May 10 2025.

Last update: 2015-10-22
Started: 2015-10-22

Readme and manuals

Help Manual

Help pageTopics
textreuse: Detect Text Reuse and Document Similaritytextreuse-package textreuse
Local alignment of natural language textsalign_local
Convert candidates data frames to other formatsas.matrix.textreuse_candidates
Filenames from pathsfilenames
Hash a string to an integerhash_string
Locality sensitive hashing for minhashlsh
Candidate pairs from LSH comparisonslsh_candidates
Compare candidates identified by LSHlsh_compare
Probability that a candidate pair will be detected with LSHlsh_probability lsh_threshold
Query a LSH cache for matches to a single documentlsh_query
List of all candidates in a corpuslsh_subset
Generate a minhash functionminhash_generator
Candidate pairs from pairwise comparisonspairwise_candidates
Pairwise comparisons among documents in a corpuspairwise_compare
Recompute the hashes for a document or corpusrehash
Measure similarity/dissimilarity in documentsjaccard_bag_similarity jaccard_dissimilarity jaccard_similarity ratio_of_matches similarity-functions
TextReuseCorpusis.TextReuseCorpus skipped TextReuseCorpus
TextReuseTextDocumenthas_content has_hashes has_minhashes has_tokens is.TextReuseTextDocument TextReuseTextDocument
Accessors for TextReuse objectshashes hashes<- minhashes minhashes<- TextReuseTextDocument-accessors tokens tokens<-
Recompute the tokens for a document or corpustokenize
Split texts into tokenstokenizers tokenize_ngrams tokenize_sentences tokenize_skip_ngrams tokenize_words
Count wordswordcount