This release brings together several years of maintenance and feature work to make textreuse easier to use on current R installations and more practical for larger document collections.
This is a CRAN resubmission that fixes a moved README URL reported by CRAN incoming checks.
TextReuseTextDocument() and TextReuseCorpus() now accept an encoding
argument, making it easier to read source files whose text encoding is known
or differs from the platform default.TextReuseCorpus() now keeps skipped-document bookkeeping deterministic.
Skipped documents are reported consistently, and skip metadata is available
even when skip_short = FALSE.align_local() now returns an empty local alignment instead of throwing an
error when two texts have no matching words. This makes batch alignment
workflows easier to run because no-match pairs can be represented directly.align_local() gains preserve_punctuation, allowing displayed alignments to
keep punctuation from the original texts when that context is useful.count_matches() and matching_tokens() helpers expose absolute match
counts and the matched tokens themselves, so users can inspect what drove a
similarity score rather than relying only on a ratio.pairwise_candidates() and matrix conversion now preserve all document IDs,
including documents without returned candidate pairs.as_sparse_matrix() provides a sparse matrix representation of candidate
results, which is more convenient for downstream modeling, graph analysis, and
workflows with many documents.lsh_add() can add new documents to an existing LSH bucket cache, so users can
extend an index without rebuilding it from scratch.lsh_compare() can run comparisons in parallel on non-Windows platforms when
options(mc.cores) is set.shingle_ngrams()lsh() on corpora without minhashes