Package: tokenizers 0.3.1

Thomas Charlon

tokenizers: Fast, Consistent Tokenization of Natural Language Text

Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.

Authors:Thomas Charlon [aut, cre], Lincoln Mullen [aut], Os Keyes [ctb], Dmitriy Selivanov [ctb], Jeffrey Arnold [ctb], Kenneth Benoit [ctb]

tokenizers_0.3.1.tar.gz
tokenizers_0.3.1.zip(r-4.6)tokenizers_0.3.1.zip(r-4.5)tokenizers_0.3.1.zip(r-4.4)
tokenizers_0.3.1.tgz(r-4.6-x86_64)tokenizers_0.3.1.tgz(r-4.6-arm64)tokenizers_0.3.1.tgz(r-4.5-x86_64)tokenizers_0.3.1.tgz(r-4.5-arm64)
tokenizers_0.3.1.tar.gz(r-4.6-arm64)tokenizers_0.3.1.tar.gz(r-4.6-x86_64)tokenizers_0.3.1.tar.gz(r-4.5-arm64)tokenizers_0.3.1.tar.gz(r-4.5-x86_64)
tokenizers_0.3.1.tgz(r-4.5-emscripten)
tokenizers.pdf |tokenizers.html✨
tokenizers/json (API)
NEWS

# Install 'tokenizers' in R:

install.packages('tokenizers', repos = c('https://packages.ropensci.org', 'https://cloud.r-project.org'))

Reviews:rOpenSci Software Review #33

Bug tracker:https://github.com/ropensci/tokenizers/issues

Pkgdown/docs site:https://docs.ropensci.org

Uses libs:

c++– GNU Standard C++ Library v3

Datasets:

mobydick - The text of Moby Dick

On CRAN:

nlp peer-reviewed text-mining tokenizer cpp

13.47 score 186 stars 76 packages 1.1k scripts 54k downloads 1 mentions 15 exports 3 dependencies

Last updated from:b80863d088 (on master). Checks:13 OK, 1 NOTE. Indexed: yes.

Target	Result	Total time
linux-devel-arm64	OK	131
linux-devel-x86_64	OK	155
pkgdown docs	OK	147
source / vignettes	OK	177
linux-release-arm64	OK	165
linux-release-x86_64	OK	134
macos-devel-arm64	OK	143
macos-devel-x86_64	OK	182
macos-release-arm64	OK	122
macos-release-x86_64	OK	153
windows-devel	OK	136
windows-release	OK	92
windows-oldrel	NOTE	120
wasm-release	OK	111

Exports:chunk_text count_characters count_sentences count_words tokenize_character_shingles tokenize_characters tokenize_lines tokenize_ngrams tokenize_paragraphs tokenize_ptb tokenize_regex tokenize_sentences tokenize_skip_ngrams tokenize_word_stems tokenize_words

Dependencies:Rcpp SnowballC stringi

Introduction to the tokenizers Package

Lincoln Mullen

Rendered fromintroduction-to-tokenizers.Rmdusingknitr::rmarkdownon Dec 16 2025.

Last update: 2022-12-19
Started: 2016-08-11

The Text Interchange Formats and the tokenizers Package

Lincoln Mullen

Rendered fromtif-and-tokenizers.Rmdusingknitr::rmarkdownon Dec 16 2025.

Last update: 2022-09-23
Started: 2018-03-14

Help page	Topics
Basic tokenizers	basic-tokenizers tokenize_characters tokenize_lines tokenize_paragraphs tokenize_regex tokenize_sentences tokenize_words
Chunk text into smaller segments	chunk_text
Count words, sentences, characters	count_characters count_sentences count_words
The text of Moby Dick	mobydick
N-gram tokenizers	ngram-tokenizers tokenize_ngrams tokenize_skip_ngrams
Character shingle tokenizers	tokenize_character_shingles
Penn Treebank Tokenizer	tokenize_ptb
Word stem tokenizer	tokenize_word_stems
Tokenizers	tokenizers-package tokenizers

Package: tokenizers 0.3.1

tokenizers: Fast, Consistent Tokenization of Natural Language Text

Introduction to the tokenizers Package

The Text Interchange Formats and the tokenizers Package

Citation

Development and contributors

Readme and manuals

Help Manual

Usage by other packages (reverse dependencies)