Package: tokenizers Type: Package Title: Fast, Consistent Tokenization of Natural Language Text Version: 0.3.1 Date: 2024-03-27 Description: Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'. License: MIT + file LICENSE LazyData: yes Authors@R: c(person("Thomas", "Charlon", role = c("aut", "cre"), email = "charlon@protonmail.com", comment = c(ORCID = "0000-0001-7497-0470")), person("Lincoln", "Mullen", role = c("aut"), email = "lincoln@lincolnmullen.com", comment = c(ORCID = "0000-0001-5103-6917")), person("Os", "Keyes", role = c("ctb"), email = "ironholds@gmail.com", comment = c(ORCID = "0000-0001-5196-609X")), person("Dmitriy", "Selivanov", role = c("ctb"), email = "selivanov.dmitriy@gmail.com"), person("Jeffrey", "Arnold", role = c("ctb"), email = "jeffrey.arnold@gmail.com", comment = c(ORCID = "0000-0001-9953-3904")), person("Kenneth", "Benoit", role = c("ctb"), email = "kbenoit@lse.ac.uk", comment = c(ORCID = "0000-0002-0797-564X"))) URL: https://docs.ropensci.org/tokenizers/, https://github.com/ropensci/tokenizers BugReports: https://github.com/ropensci/tokenizers/issues RoxygenNote: 7.3.1 Depends: R (>= 3.1.3) Imports: stringi (>= 1.0.1), Rcpp (>= 0.12.3), SnowballC (>= 0.5.1) LinkingTo: Rcpp Encoding: UTF-8 Suggests: covr, knitr, rmarkdown, stopwords (>= 0.9.0), testthat VignetteBuilder: knitr Config/pak/sysreqs: libicu-dev Repository: https://ropensci.r-universe.dev Date/Publication: 2024-03-27 09:33:34 UTC RemoteUrl: https://github.com/ropensci/tokenizers RemoteRef: master RemoteSha: b80863d088d4b39695b602ca11e061ac34770ec7 NeedsCompilation: yes Packaged: 2026-07-01 08:14:52 UTC; root Author: Thomas Charlon [aut, cre] (ORCID: ), Lincoln Mullen [aut] (ORCID: ), Os Keyes [ctb] (ORCID: ), Dmitriy Selivanov [ctb], Jeffrey Arnold [ctb] (ORCID: ), Kenneth Benoit [ctb] (ORCID: ) Maintainer: Thomas Charlon