Title: | R Interface to 'Apache Tika' |
---|---|
Description: | Extract text or metadata from over a thousand file types, using Apache Tika <https://tika.apache.org/>. Get either plain text or structured XHTML content. |
Authors: | Sasha Goodman [aut, cre], The Apache Software Foundation [aut, cph], Julia Silge [rev] (Reviewed the package for rOpenSci, see https://github.com/ropensci/software-review/issues/191/), David Gohel [rev] (Reviewed the package for rOpenSci, see https://github.com/ropensci/software-review/issues/191/) |
Maintainer: | Sasha Goodman <[email protected]> |
License: | Apache License 2.0 | file LICENSE |
Version: | 2.7.0 |
Built: | 2024-10-28 06:06:44 UTC |
Source: | https://github.com/ropensci/rtika |
jar
This downloads and installs the Tika App jar
(~60 MB) into a user directory,
and verifies the integrity of the file using a checksum.
The default settings should work fine.
install_tika( version = "2.7.0", digest = paste0("7fefbe5570a95900d39193134e8277aec99e5450a8", "cecbb5787b3d6651ebf735e460ccccddb49bdc2990", "8a9058fc36e4689aed6da6d63a1cf70ca09ccf26bcca"), mirrors = c("https://ftp.wayne.edu/apache/tika/", "http://mirrors.ocf.berkeley.edu/apache/tika/", "http://apache.cs.utah.edu/tika/", "http://mirror.cc.columbia.edu/pub/software/apache/tika/"), retries = 2, url = character() )
install_tika( version = "2.7.0", digest = paste0("7fefbe5570a95900d39193134e8277aec99e5450a8", "cecbb5787b3d6651ebf735e460ccccddb49bdc2990", "8a9058fc36e4689aed6da6d63a1cf70ca09ccf26bcca"), mirrors = c("https://ftp.wayne.edu/apache/tika/", "http://mirrors.ocf.berkeley.edu/apache/tika/", "http://apache.cs.utah.edu/tika/", "http://mirror.cc.columbia.edu/pub/software/apache/tika/"), retries = 2, url = character() )
version |
The declared Tika version |
digest |
The sha512 checksum. Set to an empty string |
mirrors |
A vector of Apache mirror sites. One is picked randomly. |
retries |
The number of times to try the download. |
url |
Optional url to a particular location of the tika app. Setting this to any character string overrides downloading from random mirrors. |
Logical if the installation was successful.
The default settings of install_tika()
should typically be left as they are.
This function will download the version of the Tika jar
tested to work
with this package, and can verify file integrity using a checksum.
It will normally download from a random Apache mirror.
If the mirror fails,
it tries the archive at http://archive.apache.org/dist/tika/
.
You can also enter a value for url
directly to override this.
It will download into a directory determined
by tools::R_user_dir("rtika", which = "data")
,
specific to the operating system.
If tika()
is stopping with an error compalining about the jar
,
try running install_tika()
again.
If you are uninstalling the entire rtika
package
and want to remove the Tika App jar
also,
run:
unlink(tools::R_user_dir("rtika", which = "data"), recursive = TRUE)
Alternately, navigate to the install folder and delete it manually.
It is the file path returned by
tools::R_user_dir("rtika", which = "data")
.
The path is OS specific.
Tika is distributed under the Apache License Version 2.0, which generally permits distribution of the code "Object" without the "Source". The master copy of the Apache Tika source code is held in GIT. You can fetch (clone) the large source from GitHub ( https://github.com/apache/tika ).
install_tika()
install_tika()
Gets the system command needed to run Java from the command line, as a string.
Typically, this is the string: 'java'.
However, if the R session has the JAVA_HOME
environmental variable set, it will use that to locate java instead.
This can be persisted over sessions (see the Details below).
java()
java()
The system command needed to invoke Java, as a string.
This function is used by all of the tika()
functions internally as the default value to its java
parameter.
This function tries to find an environmental variable using Sys.getenv("JAVA_HOME")
.
It looks for the java
executable inside the bin
directory in the JAVA_HOME
directory.
If you want to use a specific version of Java, set the JAVA_HOME
variable using Sys.setenv(JAVA_HOME = 'my path')
,
where 'my path' is the path to a folder that has a bin
directory with a java
executable.
For example, on Windows 10 JAVA_HOME
might be C:/Program Files (x86)/Java/jre1.8.0_171
.
On Ubuntu and OS X, it might be the /usr
directory.
The JAVA_HOME
variable can also be set to persist over sessions.
Add the path to the .Rprofile
by adding Sys.setenv(JAVA_HOME = 'my path')
, and it will use that every time R is started.
# Typically, this function returns the string 'java'. # If JAVA_HOME is set, it's a path to java in a 'bin' folder. java()
# Typically, this function returns the string 'java'. # If JAVA_HOME is set, it's a path to java in a 'bin' folder. java()
Extract text or metadata from over a thousand file types. Get either plain text or structured XHTML content.
If you have not done so already, finish installing rtika by typing in the R console:
install_tika()
The tika_text
function will extract plain text from many types of documents. It is a good place to start. Please read the Vignette also.
Other main functions include tika_xml
and tika_html
that get a structured XHMTL rendition. The tika_json
function gets metadata as '.json', with XHMTL content.
The tika_json_text
function gets metadata as '.json', with plain text content.
tika
is the main function the others above inherit from.
Use tika_fetch
to download files with a file extension matching the Content-Type.
Extract text or metadata from over a thousand file types.
Get either plain text or structured XHTML
.
Metadata includes Content-Type
, character encoding, and Exif data from
jpeg or tiff images. See the long list of supported file types,
click the "Supported Formats" link on this page :
https://tika.apache.org/.
tika( input, output = c("text", "jsonRecursive", "xml", "html")[1], output_dir = "", return = TRUE, java = rtika::java(), jar = rtika::tika_jar(), threads = 2, max_restarts = integer(), timeout = 3e+05, max_file_size = integer(), config = system.file("extdata", "ocr.xml", package = "rtika"), args = character(), quiet = TRUE, cleanup = TRUE, lib.loc = .libPaths() )
tika( input, output = c("text", "jsonRecursive", "xml", "html")[1], output_dir = "", return = TRUE, java = rtika::java(), jar = rtika::tika_jar(), threads = 2, max_restarts = integer(), timeout = 3e+05, max_file_size = integer(), config = system.file("extdata", "ocr.xml", package = "rtika"), args = character(), quiet = TRUE, cleanup = TRUE, lib.loc = .libPaths() )
input |
Character vector describing the paths to the input documents. Strings starting with 'http://','https://', or 'ftp://' are downloaded to a temporary directory. On Windows, the local paths cannot span drives because of a Windows convention. |
output |
Optional character vector of the output format. The default,
|
output_dir |
Optional directory path to save the converted files in. Tika may overwrite files so an empty directory is best. See the 'Output Details' section before using. |
return |
Logical if an R object should be returned. Defaults to TRUE. If set to FALSE, and output_dir (above) must be specified. |
java |
Optional command to invoke Java. For example, it can be the full path to a particular Java version. See the Configuration section below. |
jar |
Optional alternative path to a |
threads |
Integer of the number of file consumer threads Tika uses. Defaults to 2. |
max_restarts |
Integer of the maximum number of times the watchdog process will restart the child process. The default is no limit. |
timeout |
Integer of the number of milliseconds allowed to a parse before the process is killed and restarted. Defaults to 300000. |
max_file_size |
Integer of the maximum bytes allowed. Do not process files larger than this. The default is unlimited. |
config |
Path to the XML config file. Defaults to |
args |
Optional character vector of additional arguments passed to Tika,
that may not yet be implemented in this R interface, in the pattern of
|
quiet |
Logical if Tika command line messages and errors are to be
suppressed. Defaults to |
cleanup |
Logical to clean up temporary files after running the command,
which can accumulate. Defaults to |
lib.loc |
Optional character vector describing the library paths. Normally, it's best to leave this parameter alone. The parameter is included mainly for package testing. |
A character vector in the same order and with the same length as
input
. Unprocessed files are as.character(NA)
.
If return = FALSE
, then a NULL
value is invisibly returned.
See the Output Details section below.
If an input file did not exist, could not be downloaded, was a directory, or
Tika could not process it, the result will be as.character(NA)
for
that file.
By default, output = "text"
and this produces plain text with no
metadata. Some formatting is preserved in this case using tabs, newlines and
spaces.
Setting output
to either "xml"
or the shortcut "x"
will
produce a strict form of HTML
known as XHTML
, with metadata in
the head
node and formatted text in the body
.
Content retains more formatting with "xml"
. For example, a Word or
Excel table will become a HTML table
, with table data as text in
td
elements. The "html"
option and its shortcut "h"
seem to produce the same result as "xml"
.
Parse XHTML output with xml2::read_html
.
Setting output
to "jsonRecursive"
or its shortcut "J"
produces a tree structure in 'json'. Metadata fields are at the top level.
The XHTML
or plain text will be found in the X-TIKA:content
field. By default the text is XHTML
. This can be changed to plain
text like this: output=c("jsonRecursive","text")
or
output=c("J","t")
. This syntax is meant to mirror Tika's. Parse
json
with jsonlite::fromJSON
.
If output_dir
is specified, then the converted files will also be
saved to this directory. It's best to use an empty directory because Tika
may overwrite existing files. Tika seems to add an extra file extension to
each file to reduce the chance, but it's still best to use an empty
directory. The file locations within the output_dir
maintain the same
general path structure as the input files. Downloaded files have a path
similar to the 'tempdir()' that R uses. The original paths are now relative
to output_dir
. Files are appended with .txt
for the default
plain text, but can be .json
, .xml
, or .html
depending
on the output
setting. One way to get a list of the processed files
is to use list.files
with recursive=TRUE
.
If output_dir
is not specified, files are saved to a volatile temp
directory named by tempdir()
and will be deleted when R shuts down.
If this function will be run on very large batches repeatedly, these
temporary files can be cleaned up every time by adding
cleanup=TRUE
.
Tika is a foundational library for several Apache projects such as the Apache Solr search engine. It has been in development since at least 2007. The most efficient way I've found to process many thousands of documents is Tika's 'batch' mode, which is the only mode used in 'rtika'. There are potentially more things that can be done, given enough time and attention, because Apache Tika includes many libraries and methods in its .jar file. The source is available at: https://tika.apache.org/.
Tika requires Java 8.
Java installation instructions are at https://openjdk.org/install/ or https://www.java.com/en/download/help/download_options.xml.
By default, this R package internally invokes Java by calling the java
command from the command line. To specify the path to a particular Java
version, set the path in the java
attribute of the tika
function.
#extract text batch <- c( system.file("extdata", "jsonlite.pdf", package = "rtika"), system.file("extdata", "curl.pdf", package = "rtika"), system.file("extdata", "table.docx", package = "rtika"), system.file("extdata", "xml2.pdf", package = "rtika"), system.file("extdata", "R-FAQ.html", package = "rtika"), system.file("extdata", "calculator.jpg", package = "rtika"), system.file("extdata", "tika.apache.org.zip", package = "rtika") ) text = tika(batch) cat(substr(text[1],45,450)) #more complex metadata if(requireNamespace('jsonlite')){ json = tika(batch,c('J','t')) # 'J' is shortcut for jsonRecursive # 't' for text metadata = lapply(json, jsonlite::fromJSON ) #embedded resources lapply(metadata, function(x){ as.character(x$'Content-Type') }) lapply(metadata, function(x){ as.character(x$'Creation-Date') }) lapply(metadata, function(x){ as.character(x$'X-TIKA:embedded_resource_path') }) }
#extract text batch <- c( system.file("extdata", "jsonlite.pdf", package = "rtika"), system.file("extdata", "curl.pdf", package = "rtika"), system.file("extdata", "table.docx", package = "rtika"), system.file("extdata", "xml2.pdf", package = "rtika"), system.file("extdata", "R-FAQ.html", package = "rtika"), system.file("extdata", "calculator.jpg", package = "rtika"), system.file("extdata", "tika.apache.org.zip", package = "rtika") ) text = tika(batch) cat(substr(text[1],45,450)) #more complex metadata if(requireNamespace('jsonlite')){ json = tika(batch,c('J','t')) # 'J' is shortcut for jsonRecursive # 't' for text metadata = lapply(json, jsonlite::fromJSON ) #embedded resources lapply(metadata, function(x){ as.character(x$'Content-Type') }) lapply(metadata, function(x){ as.character(x$'Creation-Date') }) lapply(metadata, function(x){ as.character(x$'X-TIKA:embedded_resource_path') }) }
This is used by install_tika()
internally,
or can be called directly on a jar
file.
The latest jar
files and checksums are at https://tika.apache.org/download.html.
tika_check(digest, jar = tika_jar(), algo = "sha512")
tika_check(digest, jar = tika_jar(), algo = "sha512")
digest |
Character vector of length one with the target checksum. |
jar |
Optional alternative path to a Tika |
algo |
Optional algorithm used to create checksum. Defaults to SHA512. |
logical if the jar
checksum matches digest
.
On the Internet, Content-Type information is mainly communicated via the server's headers.
This is an issue if a file is saved to disk without examining the headers.
The file can have a missing or incorrect file extension.
For example, a URL ending in a slash (/
) can produce file with the Content-Type of text/html
.
The same URL might also produce a image/jpeg
or application/pdf
file.
URLs ending in .php
, .cfm
can produce any Content-Type.
The downloaded file will lose the server's declared Content-Type unless its appended as a file extension.
tika_fetch()
gets a file from the URL, examines the server headers,
and appends the matching file extension
from Tika's database.
tika_fetch( urls, download_dir = tempdir(), ssl_verifypeer = TRUE, retries = 1, quiet = TRUE )
tika_fetch( urls, download_dir = tempdir(), ssl_verifypeer = TRUE, retries = 1, quiet = TRUE )
urls |
Character vector of one or more URLs to be downloaded. |
download_dir |
Character vector of length one describing the path to the directory to save the results. |
ssl_verifypeer |
Logical, with a default of TRUE. Some server SSL certificates might not be recognized by the host system, and in these rare cases the user can ignore that if they know why. |
retries |
Integer of the number of times to retry each url after a failure to download. |
quiet |
Logical if download warnings should be printed. Defaults to FALSE. |
Character vector of the same length and order as input with the paths describing the locations of the downloaded files. Errors are returned as NA.
tika_fetch('https://tika.apache.org/') # a unique file name with .html appended to it
tika_fetch('https://tika.apache.org/') # a unique file name with .html appended to it
If output_dir
is specified, files will have the .html
file extension.
tika_html(input, ...)
tika_html(input, ...)
input |
Character vector describing the paths and/or urls to the input documents. |
... |
Other parameters to be sent to |
A character vector in the same order and with the same length as input
, of unparsed XHTML
. Unprocessed files are as.character(NA)
.
batch <- c( system.file("extdata", "jsonlite.pdf", package = "rtika"), system.file("extdata", "curl.pdf", package = "rtika"), system.file("extdata", "table.docx", package = "rtika"), system.file("extdata", "xml2.pdf", package = "rtika"), system.file("extdata", "R-FAQ.html", package = "rtika"), system.file("extdata", "calculator.jpg", package = "rtika"), system.file("extdata", "tika.apache.org.zip", package = "rtika") ) html <- tika_html(batch)
batch <- c( system.file("extdata", "jsonlite.pdf", package = "rtika"), system.file("extdata", "curl.pdf", package = "rtika"), system.file("extdata", "table.docx", package = "rtika"), system.file("extdata", "xml2.pdf", package = "rtika"), system.file("extdata", "R-FAQ.html", package = "rtika"), system.file("extdata", "calculator.jpg", package = "rtika"), system.file("extdata", "tika.apache.org.zip", package = "rtika") ) html <- tika_html(batch)
Gets the path to the Tika App .jar
installed by tika_install()
.
tika_jar()
tika_jar()
A string describing the file path to the Tika App .jar
file. If not found, NA
.
The tika_jar()
function also checks if the .jar
is actually on the file system.
The file path is used by all of the tika()
functions by default.
You can call Apache Tika directly, as shown in the examples here.
It is better to use the sys
package and avoid system2()
,
which has caused erratic, intermittent errors with Tika.
jar <- tika_jar() # see help sys::exec_wait('java',c('-jar',jar, '--help')) # detect language of web page sys::exec_wait('java',c('-jar',jar, '--language','https://tika.apache.org/'))
jar <- tika_jar() # see help sys::exec_wait('java',c('-jar',jar, '--help')) # detect language of web page sys::exec_wait('java',c('-jar',jar, '--language','https://tika.apache.org/'))
Tika can parse and extract text from almost anything, including zip, tar, tar.bz2, and other archives that contain documents. If you have a zip file with 100 text files in it, you can get the text and metadata for each file nested inside of the zip file. This recursive output is currently used for the jsonified mode. See: https://wiki.apache.org/tika/RecursiveMetadata
The document content is XHTML in the "X-TIKA:content" field.
If output_dir
is specified, files will have the .json
file extension.
tika_json(input, ...)
tika_json(input, ...)
input |
Character vector describing the paths and/or urls to the input documents. |
... |
Other parameters to be sent to |
A character vector in the same order and with the same length as input
, of unparsed json
. Unprocessed files are as.character(NA)
.
batch <- c( system.file("extdata", "jsonlite.pdf", package = "rtika"), system.file("extdata", "curl.pdf", package = "rtika"), system.file("extdata", "table.docx", package = "rtika"), system.file("extdata", "xml2.pdf", package = "rtika"), system.file("extdata", "R-FAQ.html", package = "rtika"), system.file("extdata", "calculator.jpg", package = "rtika"), system.file("extdata", "tika.apache.org.zip", package = "rtika") ) json <- tika_json(batch)
batch <- c( system.file("extdata", "jsonlite.pdf", package = "rtika"), system.file("extdata", "curl.pdf", package = "rtika"), system.file("extdata", "table.docx", package = "rtika"), system.file("extdata", "xml2.pdf", package = "rtika"), system.file("extdata", "R-FAQ.html", package = "rtika"), system.file("extdata", "calculator.jpg", package = "rtika"), system.file("extdata", "tika.apache.org.zip", package = "rtika") ) json <- tika_json(batch)
Tika can parse and extract text from almost anything, including zip, tar, tar.bz2, and other archives that contain documents. If you have a zip file with 100 text files in it, you can get the text and metadata for each file nested inside of the zip file. This recursive output is currently used for the jsonified mode. See: https://wiki.apache.org/tika/RecursiveMetadata
The document contents are plain text in the "X-TIKA:content" field.
If output_dir
is specified, files will have the .json
file extension.
tika_json_text(input, ...)
tika_json_text(input, ...)
input |
Character vector describing the paths and/or urls to the input documents. |
... |
Other parameters to be sent to |
A character vector in the same order and with the same length as input
, of unparsed json
. Unprocessed files are as.character(NA)
.
batch <- c( system.file("extdata", "jsonlite.pdf", package = "rtika"), system.file("extdata", "curl.pdf", package = "rtika"), system.file("extdata", "table.docx", package = "rtika"), system.file("extdata", "xml2.pdf", package = "rtika"), system.file("extdata", "R-FAQ.html", package = "rtika"), system.file("extdata", "calculator.jpg", package = "rtika"), system.file("extdata", "tika.apache.org.zip", package = "rtika") ) json <- tika_json_text(batch)
batch <- c( system.file("extdata", "jsonlite.pdf", package = "rtika"), system.file("extdata", "curl.pdf", package = "rtika"), system.file("extdata", "table.docx", package = "rtika"), system.file("extdata", "xml2.pdf", package = "rtika"), system.file("extdata", "R-FAQ.html", package = "rtika"), system.file("extdata", "calculator.jpg", package = "rtika"), system.file("extdata", "tika.apache.org.zip", package = "rtika") ) json <- tika_json_text(batch)
If output_dir
is specified, files will have the .txt
file extension.
tika_text(input, ...)
tika_text(input, ...)
input |
Character vector describing the paths and/or urls to the input documents. |
... |
Other parameters to be sent to |
A character vector in the same order and with the same length as input
, of plain text. Unprocessed files are as.character(NA)
.
batch <- c( system.file("extdata", "jsonlite.pdf", package = "rtika"), system.file("extdata", "curl.pdf", package = "rtika"), system.file("extdata", "table.docx", package = "rtika"), system.file("extdata", "xml2.pdf", package = "rtika"), system.file("extdata", "R-FAQ.html", package = "rtika"), system.file("extdata", "calculator.jpg", package = "rtika"), system.file("extdata", "tika.apache.org.zip", package = "rtika") ) text <- tika_text(batch)
batch <- c( system.file("extdata", "jsonlite.pdf", package = "rtika"), system.file("extdata", "curl.pdf", package = "rtika"), system.file("extdata", "table.docx", package = "rtika"), system.file("extdata", "xml2.pdf", package = "rtika"), system.file("extdata", "R-FAQ.html", package = "rtika"), system.file("extdata", "calculator.jpg", package = "rtika"), system.file("extdata", "tika.apache.org.zip", package = "rtika") ) text <- tika_text(batch)
If output_dir
is specified, files will have the .xml
file extension.
tika_xml(input, ...)
tika_xml(input, ...)
input |
Character vector describing the paths and/or urls to the input documents. |
... |
Other parameters to be sent to |
A character vector in the same order and with the same length as input
, of unparsed XHTML
. Unprocessed files are as.character(NA)
.
batch <- c( system.file("extdata", "jsonlite.pdf", package = "rtika"), system.file("extdata", "curl.pdf", package = "rtika"), system.file("extdata", "table.docx", package = "rtika"), system.file("extdata", "xml2.pdf", package = "rtika"), system.file("extdata", "R-FAQ.html", package = "rtika"), system.file("extdata", "calculator.jpg", package = "rtika"), system.file("extdata", "tika.apache.org.zip", package = "rtika") ) xml <- tika_xml(batch)
batch <- c( system.file("extdata", "jsonlite.pdf", package = "rtika"), system.file("extdata", "curl.pdf", package = "rtika"), system.file("extdata", "table.docx", package = "rtika"), system.file("extdata", "xml2.pdf", package = "rtika"), system.file("extdata", "R-FAQ.html", package = "rtika"), system.file("extdata", "calculator.jpg", package = "rtika"), system.file("extdata", "tika.apache.org.zip", package = "rtika") ) xml <- tika_xml(batch)