--- title: "Introduction to rtika" author: "Sasha Goodman" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to rtika} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: bibliography.bib --- ```{r setup, include = FALSE} # only evaluate code if "NOT_CRAN" NOT_CRAN <- identical(tolower(Sys.getenv("NOT_CRAN")), "true") knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) if(NOT_CRAN){ if(is.na(rtika::tika_jar())){ rtika::install_tika() } } ``` # A Digital Babel Fish ``` .----. ____ __\\\\\\__ \___'--" .-. /___>------rtika '0' /____,--. \----B "".____ ___-" // / / ]/ ``` Apache Tika is similar to the Babel fish in Douglas Adam's book, "The Hitchhikers' Guide to the Galaxy" [@mattmann2011tika p. 3]. The Babel fish translates any natural language to any other. While Apache Tika does not yet translate natural languages, it starts to tame the tower of babel of digital document formats. As the Babel fish allowed a person to understand Vogon poetry, Tika allows a computer to extract text and objects from Microsoft Word. The world of digital file formats is like a place where each community has their own language. Academic, business, government, and online communities use anywhere from a few file types to thousands. Unfortunately, attempts to unify groups around a single format are often fruitless [@mattmann2013computing]. This plethora of document formats has become a common concern. Tika is a common library to address this issue. Starting in Apache Nutch in 2005, Tika became its own project in 2007 and then a component of other Apache projects including Lucene, Jackrabbit, Mahout, and Solr [@mattmann2011tika p. 17]. With the increased volume of data in digital archives, and terabyte sized data becoming common, Tika's design goals include keeping complexity at bay, low memory consumption, and fast processing [@mattmann2011tika p. 18]. The `rtika` package is an interface to Apache Tika that leverages Tika's batch processor module to parse many documents fairly efficiently. Therefore, I recommend using batches whenever possible. # Extract Plain Text Video, sound and images are important, and yet much meaningful data remains numeric or textual. Tika can parse many formats and extract alpha-numeric characters, along with a few characters to control the arrangement of text, like line breaks. I recommend an analyst start with a directory on the computer and get a vector of paths to each file using `base::list.files()`. The commented code below has a recipe. Here, I use test files that are included with the package. ```{r, eval=NOT_CRAN} library('rtika') library('magrittr') # Code to get ALL the files in my_path: # my_path <- "~" # batch <- file.path(my_path, # list.files(path = my_path, # recursive = TRUE)) # pipe the batch into tika_text() # to get plain text # test files batch <- c( system.file("extdata", "jsonlite.pdf", package = "rtika"), system.file("extdata", "curl.pdf", package = "rtika"), system.file("extdata", "table.docx", package = "rtika"), system.file("extdata", "xml2.pdf", package = "rtika"), system.file("extdata", "R-FAQ.html", package = "rtika"), system.file("extdata", "calculator.jpg", package = "rtika"), system.file("extdata", "tika.apache.org.zip", package = "rtika") ) text <- batch %>% tika_text() # normal syntax also works: # text <- tika_text(batch) ``` The output is a R character vector of the same length and order as the input files. In the example above, there are several seconds of overhead to start up the Tika batch processor and then process the output. The most costly file was the first one. Large batches are parsed more quickly. For example, when parsing thousands of 1-5 page Word documents, I've measured 1/100th of a second per document on average. Occasionally, files are not parsable and the returned value for the file will be `NA`. The reasons include corrupt files, disk input/output issues, empty files, password protection, a unhandled format, the document structure is broken, or the document has an unexpected variation. These issues should be rare. Tika works well on most documents, but if an archive is very large there may be a small percentage of unparsable files, and you might want to handle those. ```{r, eval=NOT_CRAN} # Find which files had an issue # Handle them if needed batch[which(is.na(text))] ``` Plain text is easy to search using `base::grep()`. ```{r, eval=NOT_CRAN} length(text) search <- text[grep(pattern = ' is ', x = text)] length(search) ``` With plain text, a variety of interesting analyses are possible, ranging from word counting to constructing matrices for deep learning. Much of this text processing is handled easily with the well documented `tidytext` package [@silge2017text]. Among other things, it handles tokenization and creating term-document matrices. # Preserve Content-Type when Downloading A general suggestion is to use `tika_fetch()` when downloading files from the Internet, to preserve the server Content-Type information in a file extension. Tika's Content-Type detection is improved with file extensions (Tika also relies on other features such as Magic bytes, which are unique control bytes in the file header). The `tika_fetch()` function tries to preserves Content-Type information from the download server by finding the matching extension in Tika's database. ```{r, eval=NOT_CRAN} download_directory <- tempfile('rtika_') dir.create(download_directory) urls <- c('https://tika.apache.org/', 'https://cran.rstudio.com/web/packages/keras/keras.pdf') downloaded <- urls %>% tika_fetch(download_directory) # it will add the appropriate file extension to the downloads downloaded ``` This `tika_fetch()` function is used internally by the `tika()` functions when processing URLs. By using `tika_fetch()` explicitly with a specified directory, you can also save the files and return to them later. # Settings for Big Datasets Large jobs are possible with `rtika`. However, with hundreds of thousands of documents, the R object returned by the `tika()` functions can be too big for RAM. In such cases, it is good to use the computer's disk more, since running out of RAM slows the computer. I suggest changing two parameters in any of the `tika()` parsers. First, set `return = FALSE` to prevent returning a big R character vector of text. Second, specify an existing directory on the file system using `output_dir`, pointing to where the processed files will be saved. The files can be dealt with in smaller batches later on. Another option is to increase the number of threads, setting `threads` to something like the number of processors minus one. ```{r, eval=NOT_CRAN} # create a directory not already in use. my_directory <- tempfile('rtika_') dir.create(my_directory) # pipe the batch to tika_text() batch %>% tika_text(threads = 4, return = FALSE, output_dir = my_directory) # list all the file locations processed_files <- file.path( normalizePath(my_directory), list.files(path = my_directory, recursive = TRUE) ) ``` The location of each file in `output_dir` follows a convention from the Apache Tika batch processor: the full path to each file mirrors the original file's path, only within the `output_dir`. ```{r, eval=NOT_CRAN} processed_files ``` Note that `tika_text()` produces `.txt` files, `tika_xml()` produces `.xml` files, `tika_html()` produces `.html` files, and both `tika_json()` and `tika_json_text()` produce `.json` files. # Get a Structured XHTML Rendition Plain text falls short for some purposes. For example, pagination might be important for selecting a particular page in a PDF. The Tika authors chose HTML as a universal format because it offers semantic elements that are common or familiar. For example, the hyperlink is represented in HTML as the anchor element `` with the attribute `href`. The HTML in Tika preserves this metadata: ```{r , eval=NOT_CRAN} library('xml2') # get XHTML text html <- batch %>% tika_html() %>% lapply(xml2::read_html) # parse links from documents links <- html %>% lapply(xml2::xml_find_all, '//a') %>% lapply(xml2::xml_attr, 'href') sample(links[[1]],10) ``` Each type of file has different information preserved by Tika's internal parsers. The particular aspects vary. Some notes: * PDF files retain pagination, with each page starting with the XHTML element `
`. * PDFs retain hyperlinks in the anchor element `` with the attribute `href`. * Word and Excel documents retain tabular data as a `` element. The `rvest` package has a function to get tables of data with `rvest::html_table()`. * Multiple Excel sheets are preserved as multiple XHTML tables. Ragged tables, where rows have differing numbers of cells, are not supported. Note that `tika_html()` and `tika_xml()` both produce the same strict form of HTML called XHTML, and either works essentially the same for all the documents I've tried. # Access Metadata in the XHTML The `tika_html()` and `tika_xml()` functions are focused on extracting strict, structured HTML as XHTML. In addition, metadata can be accessed in the `meta` tags of the XHTML. Common metadata fields include `Content-Type`, `Content-Length`, `Creation-Date`, and `Content-Encoding`. ```{r , eval=NOT_CRAN} # Content-Type html %>% lapply(xml2::xml_find_first, '//meta[@name="Content-Type"]') %>% lapply(xml2::xml_attr, 'content') %>% unlist() # Creation-Date html %>% lapply(xml2::xml_find_first, '//meta[@name="Creation-Date"]') %>% lapply(xml2::xml_attr, 'content') %>% unlist() ``` # Get Metadata in JSON Format Metadata can also accessed with `tika_json()` and `tika_json_text()`. Consider all that can be found from a single image: ```{r, eval=NOT_CRAN} library('jsonlite') # batch <- system.file("extdata", "calculator.jpg", package = "rtika") # a list of data.frames metadata <- batch %>% tika_json() %>% lapply(jsonlite::fromJSON) # look at metadata for an image str(metadata[[6]]) ``` In addition, each specific format can have its own specialized metadata fields. For example, photos sometimes store latitude and longitude: ```{r, eval=NOT_CRAN} metadata[[6]]$'geo:lat' metadata[[6]]$'geo:long' ``` # Get Metadata from "Container" Documents Some types of documents can have multiple objects within them. For example, a `.gzip` file may contain many other files. The `tika_json()` and `tika_json_text()` functions have a special ability that others do not. They will recurse into a container and examine each file within. The Tika authors call the format `jsonRecursive` for this reason. In the following example, I created a compressed archive of the Apache Tika homepage, using the command line programs `wget` and `zip`. The small archive includes the HTML page, its images, and required files. ```{r, eval=NOT_CRAN} # wget gets a webpage and other files. # sys::exec_wait('wget', c('--page-requisites', 'https://tika.apache.org/')) # Put it all into a .zip file # sys::exec_wait('zip', c('-r', 'tika.apache.org.zip' ,'tika.apache.org')) batch <- system.file("extdata", "tika.apache.org.zip", package = "rtika") # a list of data.frames metadata <- batch %>% tika_json() %>% lapply(jsonlite::fromJSON) # The structure is very long. See it on your own with: str(metadata) ``` Here are some of the main metadata fields of the recursive `json` output: ```{r, eval=NOT_CRAN} # the 'X-TIKA:embedded_resource_path' field embedded_resource_path <- metadata %>% lapply(function(x){ x$'X-TIKA:embedded_resource_path' }) embedded_resource_path ``` The `X-TIKA:embedded_resource_path` field tells you where in the document hierarchy each object resides. The first item in the character vector is the root, which is the container itself. The other items are embedded one layer down, as indicated by the forward slash `/`. In the context of the `X-TIKA:embedded_resource_path` field, paths are not literally directory paths like in a file system. In reality, the image `icon_info_sml.gif` is within a folder called `images`. Rather, the number of forward slashes indicates the level of recursion within the document. One slash `/` reveals a first set of embedded documents. Additional slashes `/` indicate that the parser has recursed into an embedded document within an embedded document. ```{r, eval=NOT_CRAN} content_type <- metadata %>% lapply(function(x){ x$'Content-Type' }) content_type ``` The `Content-Type` metadata reveals the first item is the container and has the type `application/zip`. The items after that are deeper and include web formats such as `application/xhtml+xml`, `image/png`, and `text/css`. ```{r, eval=NOT_CRAN} content <- metadata %>% lapply(function(x){ x$'X-TIKA:content' }) str(content) ``` The `X-TIKA:content` field includes the XHTML rendition of an object. It is possible to extract plain text in the `X-TIKA:content` field by calling `tika_json_text()` instead. That is the only difference between `tika_json()` and `tika_json_text()`. It may be surprising to learn that Word documents are containers (at least the modern `.docx` variety are). By parsing them with `tika_json()` or `tika_json_text()`, the various images and embedded objects can be analyzed. However, there is an added complexity, because each document may produce a long vector of `Content-Types` for each embedded file, instead of a single `Content-Type` for the container like `tika_xml()` and `tika_html()`. # Extending rtika Out of the box, `rtika` uses all the available Tika Detectors and Parsers and runs with sensible defaults. For most, this will work well. In future versions, Tika uses a configuration file to customize parsing. This config file option is on hold in `rtika`, because Tika's batch module is still new and the config file format will likely change and be backward incompatible. Please stay tuned. There is also room for improvement with the document formats common in the R community, especially Latex and Markdown. Tika currently reads and writes these formats just fine, captures metadata and recognizes the MIME type when downloading with `tika_fetch()`. However, Tika does not have parsers to fully understand the Latex or Markdown document structure, render it to XHTML, and extract the plain text while ignoring markup. For these cases, Pandoc will be more useful (See: https://pandoc.org/demos.html ). You may find these resources useful: * Current Tika issues and progress can be seen here: https://issues.apache.org/jira/projects/TIKA * The Tika Wiki is here: https://cwiki.apache.org/confluence/display/tika/ * Tika sourcecode: https://github.com/apache/tika # References