Package: tokenizers 0.3.1
tokenizers: Fast, Consistent Tokenization of Natural Language Text
Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.
Authors:
tokenizers_0.3.1.tar.gz
tokenizers_0.3.1.zip(r-4.5)tokenizers_0.3.1.zip(r-4.4)tokenizers_0.3.1.zip(r-4.3)
tokenizers_0.3.1.tgz(r-4.4-x86_64)tokenizers_0.3.1.tgz(r-4.4-arm64)tokenizers_0.3.1.tgz(r-4.3-x86_64)tokenizers_0.3.1.tgz(r-4.3-arm64)
tokenizers_0.3.1.tar.gz(r-4.5-noble)tokenizers_0.3.1.tar.gz(r-4.4-noble)
tokenizers_0.3.1.tgz(r-4.4-emscripten)tokenizers_0.3.1.tgz(r-4.3-emscripten)
tokenizers.pdf |tokenizers.html✨
tokenizers/json (API)
NEWS
# Install 'tokenizers' in R: |
install.packages('tokenizers', repos = c('https://packages.ropensci.org', 'https://cloud.r-project.org')) |
Bug tracker:https://github.com/ropensci/tokenizers/issues
- mobydick - The text of Moby Dick
nlppeer-reviewedtext-miningtokenizer
Last updated 7 months agofrom:b80863d088 (on master). Checks:OK: 1 NOTE: 8. Indexed: yes.
Target | Result | Date |
---|---|---|
Doc / Vignettes | OK | Sep 25 2024 |
R-4.5-win-x86_64 | NOTE | Sep 25 2024 |
R-4.5-linux-x86_64 | NOTE | Sep 25 2024 |
R-4.4-win-x86_64 | NOTE | Sep 25 2024 |
R-4.4-mac-x86_64 | NOTE | Sep 25 2024 |
R-4.4-mac-aarch64 | NOTE | Sep 25 2024 |
R-4.3-win-x86_64 | NOTE | Sep 25 2024 |
R-4.3-mac-x86_64 | NOTE | Sep 25 2024 |
R-4.3-mac-aarch64 | NOTE | Sep 25 2024 |
Exports:chunk_textcount_characterscount_sentencescount_wordstokenize_character_shinglestokenize_characterstokenize_linestokenize_ngramstokenize_paragraphstokenize_ptbtokenize_regextokenize_sentencestokenize_skip_ngramstokenize_word_stemstokenize_words
Introduction to the tokenizers Package
Rendered fromintroduction-to-tokenizers.Rmd
usingknitr::rmarkdown
on Sep 25 2024.Last update: 2022-12-19
Started: 2016-08-11
The Text Interchange Formats and the tokenizers Package
Rendered fromtif-and-tokenizers.Rmd
usingknitr::rmarkdown
on Sep 25 2024.Last update: 2022-09-23
Started: 2018-03-14
Readme and manuals
Help Manual
Help page | Topics |
---|---|
Basic tokenizers | basic-tokenizers tokenize_characters tokenize_lines tokenize_paragraphs tokenize_regex tokenize_sentences tokenize_words |
Chunk text into smaller segments | chunk_text |
Count words, sentences, characters | count_characters count_sentences count_words |
The text of Moby Dick | mobydick |
N-gram tokenizers | ngram-tokenizers tokenize_ngrams tokenize_skip_ngrams |
Character shingle tokenizers | tokenize_character_shingles |
Penn Treebank Tokenizer | tokenize_ptb |
Word stem tokenizer | tokenize_word_stems |
Tokenizers | tokenizers-package tokenizers |