| Title: | Download and Process Public Domain Works from Project Gutenberg |
|---|---|
| Description: | Download and process public domain works in the Project Gutenberg collection <https://www.gutenberg.org/>. Includes metadata for all Project Gutenberg works, so that they can be searched and retrieved. |
| Authors: | Jordan Bradford [aut, cre] (ORCID: <https://orcid.org/0009-0000-8570-3474>), Jon Harmon [aut] (ORCID: <https://orcid.org/0000-0003-4781-4346>), Myfanwy Johnston [aut], David Robinson [aut, cph] |
| Maintainer: | Jordan Bradford <[email protected]> |
| License: | GPL-2 |
| Version: | 0.4.1.9000 |
| Built: | 2026-02-20 08:33:33 UTC |
| Source: | https://github.com/ropensci/gutenbergr |
Identifies section markers (chapters, cantos, letters, etc.) in Project Gutenberg texts and adds a column indicating which section each line belongs to. Sections are forward-filled, so all text between markers belongs to the previous section.
gutenberg_add_sections( data, pattern, ignore_case = TRUE, format_fn = NULL, group_by = "auto", section_col = "section" )gutenberg_add_sections( data, pattern, ignore_case = TRUE, format_fn = NULL, group_by = "auto", section_col = "section" )
data |
A tibble::tibble with a |
pattern |
A regex pattern to identify headers. Must match the specific formatting of your book. See Details and Examples for common patterns. |
ignore_case |
Logical; should pattern matching be case-insensitive?
Default is |
format_fn |
Optional function to format section text. Receives the matched text and returns formatted text. Common options include stringr::str_to_title and stringr::str_to_upper but a custom function can also be provided. |
group_by |
Character vector of column names to group by before filling
sections, or |
section_col |
Character string specifying the name of the section column
to create. Defaults to |
Different books use different formatting for their section markers. Here are patterns for common formats:
Chapters with Roman numerals: "^Chapter [IVXLCDM]+"
Chapters with Arabic numerals: "^Chapter [0-9]+"
Plays with both Roman and Arabic numerals: "^(ACT|SCENE) [IVXLCDM0-9]+"
Books (e.g., Paradise Lost): "^BOOK [IVXLCDM]+"
Cantos (e.g., Dante's Inferno): "^CANTO [IVXLCDM]+"
Staves (e.g., A Christmas Carol): "^STAVE [IVXLCDM]+"
Multiple formats (e.g., Frankenstein): "^(Letter|Chapter) [0-9]+"
Use gutenberg_works() to search for books and examine a few lines with
gutenberg_download() to determine the exact format before writing your pattern.
A tibble::tibble with an added column named according to
section_col, containing the section marker for each row. Rows before the
first section marker will have NA.
# Dante's "Inferno" - Cantos with Roman numerals inferno <- gutenberg_download(1001) |> gutenberg_add_sections(pattern = "^CANTO [IVXLCDM]+") # Mary Shelley's "Frankenstein" # Letters and Chapters with Arabic numerals, normalized to title case frankenstein <- gutenberg_download(84) |> gutenberg_add_sections( pattern = "^(Letter|Chapter) [0-9]+", format_fn = stringr::str_to_title ) # Classic Brontë sisters' works # Chapters with Roman numerals, with trailing periods removed from section text # Consider using `options(gutenbergr_cache_type = "persistent")` # to prevent redownloading in the future. bronte_sisters <- gutenberg_download( c( 767, # "Agnes Grey" by Anne Brontë 768, # "Wuthering Heights" by Emily Brontë 969, # "The Tenant of Wildfell Hall" by Anne Brontë 1260, # "Jane Eyre" by Charlotte Brontë 9182, # "Villette" by Charlotte Brontë ), meta_fields = c("author", "title") ) |> gutenberg_add_sections( pattern = "^\\s*CHAPTER [IVXLCDM]+", format_fn = function(x) str_remove(x, "\\.$") ) # Leo Tolstoy's "War and Peace" # Add two custom named columns for hierarchical sections war_and_peace <- gutenberg_download(2600) |> gutenberg_add_sections( pattern = "^BOOK [A-Z]+", section_col = "book" ) |> gutenberg_add_sections( pattern = "^CHAPTER [IVXLCDM]+", section_col = "chapter" )# Dante's "Inferno" - Cantos with Roman numerals inferno <- gutenberg_download(1001) |> gutenberg_add_sections(pattern = "^CANTO [IVXLCDM]+") # Mary Shelley's "Frankenstein" # Letters and Chapters with Arabic numerals, normalized to title case frankenstein <- gutenberg_download(84) |> gutenberg_add_sections( pattern = "^(Letter|Chapter) [0-9]+", format_fn = stringr::str_to_title ) # Classic Brontë sisters' works # Chapters with Roman numerals, with trailing periods removed from section text # Consider using `options(gutenbergr_cache_type = "persistent")` # to prevent redownloading in the future. bronte_sisters <- gutenberg_download( c( 767, # "Agnes Grey" by Anne Brontë 768, # "Wuthering Heights" by Emily Brontë 969, # "The Tenant of Wildfell Hall" by Anne Brontë 1260, # "Jane Eyre" by Charlotte Brontë 9182, # "Villette" by Charlotte Brontë ), meta_fields = c("author", "title") ) |> gutenberg_add_sections( pattern = "^\\s*CHAPTER [IVXLCDM]+", format_fn = function(x) str_remove(x, "\\.$") ) # Leo Tolstoy's "War and Peace" # Add two custom named columns for hierarchical sections war_and_peace <- gutenberg_download(2600) |> gutenberg_add_sections( pattern = "^BOOK [A-Z]+", section_col = "book" ) |> gutenberg_add_sections( pattern = "^CHAPTER [IVXLCDM]+", section_col = "chapter" )
Data frame with metadata about each author of a Project Gutenberg work. Although the Project Gutenberg raw data also includes metadata on contributors, editors, illustrators, etc., this dataset contains only people who have been the single author of at least one work.
gutenberg_authorsgutenberg_authors
A tibble::tibble() with one row for each
author, with the columns:
Unique identifier for the author that can be used to join with the gutenberg_metadata dataset
The agent_name field from the original metadata
Alias
Year of birth
Year of death
Link to Wikipedia article on the author. If there are multiple, they are "|"-delimited
Character vector of aliases. If there are multiple, they are "/"-delimited
To find the date on which this metadata was last updated,
run attr(gutenberg_authors, "date_updated").
gutenberg_metadata, gutenberg_subjects
# See date last updated attr(gutenberg_authors, "date_updated")# See date last updated attr(gutenberg_authors, "date_updated")
Deletes all cached .rds files in the directory currently returned by
gutenberg_cache_dir().
gutenberg_cache_clear_all(verbose = TRUE)gutenberg_cache_clear_all(verbose = TRUE)
verbose |
Whether to show the status message confirming the path. |
The number of files deleted (invisibly).
# Clear entire current cache gutenberg_cache_clear_all()# Clear entire current cache gutenberg_cache_clear_all()
Calculates the path to the directory where Gutenberg files are stored,
based on the current gutenbergr_cache_type and gutenbergr_base_cache_dir
options.
gutenberg_cache_dir()gutenberg_cache_dir()
A character string representing the path to the cache directory.
The following options control caching behavior:
gutenbergr_cache_type: Character string indicating how downloaded works
are cached. Must be either "session" (default) or "persistent".
gutenbergr_base_cache_dir: Base directory used for persistent caching when
gutenbergr_cache_type = "persistent".
By default, this is an OS-specific cache directory determined by
tools::R_user_dir("gutenbergr", "cache"). Advanced users may set this
to a custom path.
# Get current cache directory gutenberg_cache_dir()# Get current cache directory gutenberg_cache_dir()
Provides a detailed list of files currently stored in the directory
returned by gutenberg_cache_dir().
gutenberg_cache_list(verbose = TRUE)gutenberg_cache_list(verbose = TRUE)
verbose |
Whether to show the status message showing the cache directory path. |
A tibble::tibble() with the following columns:
The title of the work.
The author(s) of the work.
The filename.
Size of the file in megabytes.
The last modification time.
The file's absolute path.
# List all works in the currently set cache gutenberg_cache_list() # Suppress the directory path message gutenberg_cache_list(verbose = FALSE)# List all works in the currently set cache gutenberg_cache_list() # Suppress the directory path message gutenberg_cache_list(verbose = FALSE)
Delete specific files from the cache
gutenberg_cache_remove_ids(ids, verbose = TRUE)gutenberg_cache_remove_ids(ids, verbose = TRUE)
ids |
A numeric or character vector of Gutenberg IDs to remove from the current cache. |
verbose |
Whether to show the status messages. |
The number of files successfully deleted (invisibly).
# Remove specific books from cache gutenberg_cache_remove_ids(c(1, 2)) # Remove silently gutenberg_cache_remove_ids(1, verbose = FALSE)# Remove specific books from cache gutenberg_cache_remove_ids(c(1, 2)) # Remove silently gutenberg_cache_remove_ids(1, verbose = FALSE)
Configures whether the cache should be temporary (per-session) or persistent across sessions.
gutenberg_cache_set( type = getOption("gutenbergr_cache_type", "session"), verbose = TRUE )gutenberg_cache_set( type = getOption("gutenbergr_cache_type", "session"), verbose = TRUE )
type |
Either
|
verbose |
Whether to show the status message confirming the path. |
The active cache path (invisibly).
The following options control caching behavior:
gutenbergr_cache_type: Character string indicating how downloaded works
are cached. Must be either "session" (default) or "persistent".
gutenbergr_base_cache_dir: Base directory used for persistent caching when
gutenbergr_cache_type = "persistent".
By default, this is an OS-specific cache directory determined by
tools::R_user_dir("gutenbergr", "cache"). Advanced users may set this
to a custom path.
# Set to persistent (survives R sessions) gutenberg_cache_set("persistent") # Set back to session cache (temporary) gutenberg_cache_set("session") # Check current cache location gutenberg_cache_dir()# Set to persistent (survives R sessions) gutenberg_cache_set("persistent") # Set back to session cache (temporary) gutenberg_cache_set("session") # Check current cache location gutenberg_cache_dir()
Download one or more works by their Project Gutenberg IDs into a data frame
with one row per line per work. This can be used to download a single work of
interest or multiple at a time. You can look up the Gutenberg IDs of a work
using gutenberg_works() or the gutenberg_metadata dataset.
gutenberg_download( gutenberg_id, mirror = gutenberg_get_mirror(verbose = verbose), strip = TRUE, meta_fields = character(), verbose = TRUE, use_cache = TRUE )gutenberg_download( gutenberg_id, mirror = gutenberg_get_mirror(verbose = verbose), strip = TRUE, meta_fields = character(), verbose = TRUE, use_cache = TRUE )
gutenberg_id |
A vector of Project Gutenberg IDs, or a data frame
containing a |
mirror |
A mirror URL to retrieve the books from. By default uses the
mirror from |
strip |
Whether to strip suspected headers and footers using
|
meta_fields |
Additional fields describing each book, such as |
verbose |
Whether to show messages about the Project Gutenberg mirror that was chosen. |
use_cache |
Whether to use caching. Defaults to
|
A two column tbl_df (see tibble::tibble()) with one row for each
line of the text or texts, with columns:
Integer column with the Project Gutenberg ID of each text
A character vector of lines of text
# Download "The Count of Monte Cristo" gutenberg_download(1184) # Download two books: "Wuthering Heights" and "Jane Eyre" books <- gutenberg_download(c(768, 1260), meta_fields = "title") books dplyr::count(books, title) # Download all books from Jane Austen austen <- gutenberg_works(author == "Austen, Jane") |> gutenberg_download(meta_fields = "title") austen dplyr::count(austen, title)# Download "The Count of Monte Cristo" gutenberg_download(1184) # Download two books: "Wuthering Heights" and "Jane Eyre" books <- gutenberg_download(c(768, 1260), meta_fields = "title") books dplyr::count(books, title) # Download all books from Jane Austen austen <- gutenberg_works(author == "Austen, Jane") |> gutenberg_download(meta_fields = "title") austen dplyr::count(austen, title)
Get all mirror data from https://www.gutenberg.org/MIRRORS.ALL. This only includes mirrors reported to Project Gutenberg and verified to be relatively stable. For more information on mirroring and getting your own mirror listed, see https://www.gutenberg.org/help/mirroring.html.
gutenberg_get_all_mirrors()gutenberg_get_all_mirrors()
A tibble::tibble() of Project Gutenberg mirrors and related data,
or NULL (invisibly) if the mirror list cannot be retrieved or parsed.
If a tibble::tibble() is returned, it contains:
Continent where the mirror is located
Nation where the mirror is located
Location of the mirror
Provider of the mirror
URL of the mirror
Special notes
gutenberg_get_all_mirrors()gutenberg_get_all_mirrors()
Get the recommended mirror for Gutenberg files and set the global
gutenberg_mirror option.
gutenberg_get_mirror(verbose = TRUE)gutenberg_get_mirror(verbose = TRUE)
verbose |
Whether to show messages about the Project Gutenberg mirror that was chosen. |
A character vector with the URL for the chosen mirror.
gutenberg_get_mirror()gutenberg_get_mirror()
Data frame with metadata about the languages of each Project Gutenberg work.
gutenberg_languagesgutenberg_languages
A tibble::tibble() with one row for each
work-language pair, with the columns:
Unique identifier for the work that can be used to join with the gutenberg_metadata dataset
Language ISO 639 code. Two letter code if one exists, otherwise three letter.
Number of languages for this work.
To find the date on which this metadata was last updated,
run attr(gutenberg_languages, "date_updated").
gutenberg_metadata, gutenberg_subjects
# See date last updated attr(gutenberg_languages, "date_updated")# See date last updated attr(gutenberg_languages, "date_updated")
Selected fields of metadata about each of the Project Gutenberg works.
gutenberg_metadatagutenberg_metadata
A tibble::tibble() with one row for each work in Project
Gutenberg and the following columns:
Numeric ID, used to retrieve works from Project Gutenberg
Title
Author, if a single one given. Given as last name first (e.g. "Doyle, Arthur Conan")
Project Gutenberg author ID
Language ISO 639 code, separated by / if multiple. Two letter code if one exists, otherwise three letter. See https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
Which collection or collections this is found in, separated by / if multiple
Generally one of three options: "Public domain in the USA." (the most common by far), "Copyrighted. Read the copyright notice inside this book for details.", or "None"
Whether there is a file containing digits followed by
.txt in Project Gutenberg for this record (as opposed to, for
example, audiobooks). If not, cannot be retrieved with
gutenberg_download()
To find the date on which this metadata was last updated, run
attr(gutenberg_metadata, "date_updated").
gutenberg_works(), gutenberg_authors, gutenberg_subjects
library(dplyr) library(stringr) gutenberg_metadata gutenberg_metadata |> count(author, sort = TRUE) # Look for Shakespeare, excluding collections (containing "Works") and # translations shakespeare_metadata <- gutenberg_metadata |> filter( author == "Shakespeare, William", language == "en", !str_detect(title, "Works"), has_text, !str_detect(rights, "Copyright") ) |> distinct(title) # Note that the gutenberg_works() function filters for English # non-copyrighted works and does de-duplication by default: shakespeare_metadata2 <- gutenberg_works( author == "Shakespeare, William", !str_detect(title, "Works") ) # See date last updated attr(gutenberg_metadata, "date_updated")library(dplyr) library(stringr) gutenberg_metadata gutenberg_metadata |> count(author, sort = TRUE) # Look for Shakespeare, excluding collections (containing "Works") and # translations shakespeare_metadata <- gutenberg_metadata |> filter( author == "Shakespeare, William", language == "en", !str_detect(title, "Works"), has_text, !str_detect(rights, "Copyright") ) |> distinct(title) # Note that the gutenberg_works() function filters for English # non-copyrighted works and does de-duplication by default: shakespeare_metadata2 <- gutenberg_works( author == "Shakespeare, William", !str_detect(title, "Works") ) # See date last updated attr(gutenberg_metadata, "date_updated")
Strip header and footer content from a Project Gutenberg book. This is based on formatting heuristics (regular expression guesses), so it may not be perfect.
gutenberg_strip(text)gutenberg_strip(text)
text |
A character vector where each element is a line of a book. |
This function identifies the Project Gutenberg "start" and "end" markers. It also attempts to strip out initial metadata paragraphs (such as "Produced by...", "Transcribed from...", etc.).
Note that this will not strip:
Tables of contents
Prologues or introductions
Other author-written text that appears at the start of a book
A character vector with Project Gutenberg headers and footers removed.
library(dplyr) # Download a book without stripping to see the headers book <- gutenberg_works(title == "Pride and Prejudice") |> gutenberg_download(strip = FALSE) # Look at the raw header and footer head(book$text, 20) tail(book$text, 20) # Manually strip the text text_stripped <- gutenberg_strip(book$text) # Check the cleaned results head(text_stripped, 10) tail(text_stripped, 10)library(dplyr) # Download a book without stripping to see the headers book <- gutenberg_works(title == "Pride and Prejudice") |> gutenberg_download(strip = FALSE) # Look at the raw header and footer head(book$text, 20) tail(book$text, 20) # Manually strip the text text_stripped <- gutenberg_strip(book$text) # Check the cleaned results head(text_stripped, 10) tail(text_stripped, 10)
Gutenberg metadata about the subject of each work, particularly Library of Congress Classifications (lcc) and Library of Congress Subject Headings (lcsh).
gutenberg_subjectsgutenberg_subjects
A tibble::tibble() with one row for each pairing
of work and subject, with columns:
ID describing a work that can be joined with gutenberg_metadata
Either "lcc" (Library of Congress Classification) or "lcsh" (Library of Congress Subject Headings)
Subject
Find more information about Library of Congress Categories here: https://www.loc.gov/catdir/cpso/lcco/, and about Library of Congress Subject Headings here: https://id.loc.gov/authorities/subjects.html.
To find the date on which this metadata was last updated,
run attr(gutenberg_subjects, "date_updated").
gutenberg_metadata, gutenberg_authors
library(dplyr) library(stringr) gutenberg_subjects |> filter(subject_type == "lcsh") |> count(subject, sort = TRUE) sherlock_holmes_subjects <- gutenberg_subjects |> filter(str_detect(subject, "Holmes, Sherlock")) sherlock_holmes_subjects sherlock_holmes_metadata <- gutenberg_works() |> filter(author == "Doyle, Arthur Conan") |> semi_join(sherlock_holmes_subjects, by = "gutenberg_id") sherlock_holmes_metadata holmes_books <- gutenberg_download(sherlock_holmes_metadata$gutenberg_id) holmes_books # See date last updated attr(gutenberg_subjects, "date_updated")library(dplyr) library(stringr) gutenberg_subjects |> filter(subject_type == "lcsh") |> count(subject, sort = TRUE) sherlock_holmes_subjects <- gutenberg_subjects |> filter(str_detect(subject, "Holmes, Sherlock")) sherlock_holmes_subjects sherlock_holmes_metadata <- gutenberg_works() |> filter(author == "Doyle, Arthur Conan") |> semi_join(sherlock_holmes_subjects, by = "gutenberg_id") sherlock_holmes_metadata holmes_books <- gutenberg_download(sherlock_holmes_metadata$gutenberg_id) holmes_books # See date last updated attr(gutenberg_subjects, "date_updated")
Get a table of Gutenberg work metadata that has been filtered by some common (settable) defaults, along with the option to add additional filters. This function is for convenience when working with common conditions when pulling a set of books to analyze. For more detailed filtering of the entire Project Gutenberg metadata, use the gutenberg_metadata and related datasets.
gutenberg_works( ..., languages = "en", only_text = TRUE, rights = c("Public domain in the USA.", "None"), distinct = TRUE, all_languages = FALSE, only_languages = TRUE )gutenberg_works( ..., languages = "en", only_text = TRUE, rights = c("Public domain in the USA.", "None"), distinct = TRUE, all_languages = FALSE, only_languages = TRUE )
... |
Additional filters, given as expressions using the variables in
the gutenberg_metadata dataset (e.g. |
languages |
Vector of languages to include. |
only_text |
Whether the works must have Gutenberg text attached. Works
without text (e.g. audiobooks) cannot be downloaded with
|
rights |
Values to allow in the |
distinct |
Whether to return only one distinct combination of each title
and |
all_languages |
Whether, if multiple languages are given, all of them
need to be present in a work. For example, if |
only_languages |
Whether to exclude works that have other languages
besides the ones provided. For example, whether to include |
By default, returns:
English-language works.
Works that are in text format in Gutenberg (as opposed to audio).
Works whose text is not under copyright.
At most one distinct field for each title/author pair.
A tibble::tibble() with one row for each work, in the same format
as gutenberg_metadata.
library(dplyr) # Default: English, text-based, public domain works gutenberg_works() # Filter conditions using ... gutenberg_works(author == "Shakespeare, William") # Language specifications gutenberg_works(languages = "es") |> count(language, sort = TRUE) # Filter for works that are specifically English AND French gutenberg_works(languages = c("en", "fr"), all_languages = TRUE)library(dplyr) # Default: English, text-based, public domain works gutenberg_works() # Filter conditions using ... gutenberg_works(author == "Shakespeare, William") # Language specifications gutenberg_works(languages = "es") |> count(language, sort = TRUE) # Filter for works that are specifically English AND French gutenberg_works(languages = c("en", "fr"), all_languages = TRUE)
A tibble::tibble() of book text for two sample books, generated using
gutenberg_download().
sample_bookssample_books
A tibble::tibble() with one row for each
line of text from each book, with columns:
Unique identifier for the work that can be used to join with the gutenberg_metadata dataset.
A character vector of lines of text.
The title of this work.
The author of this work.
This code was used to download the books:
gutenberg_download(c(109, 105), meta_fields = c("title", "author"))