Title: | Data Quality Reporting for Temporal Datasets |
---|---|
Description: | Generate reports that enable quick visual review of temporal shifts in record-level data. Time series plots showing aggregated values are automatically created for each data field (column) depending on its contents (e.g. min/max/mean values for numeric data, no. of distinct values for categorical data), as well as overviews for missing values, non-conformant values, and duplicated rows. The resulting reports are shareable and can contribute to forming a transparent record of the entire analysis process. It is designed with Electronic Health Records in mind, but can be used for any type of record-level temporal data (i.e. tabular data where each row represents a single "event", one column contains the "event date", and other columns contain any associated values for the event). |
Authors: | T. Phuong Quan [aut, cre] , Jack Cregan [ctb], University of Oxford [cph], National Institute for Health Research (NIHR) [fnd], Brad Cannell [rev] |
Maintainer: | T. Phuong Quan <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.1.1.9000 |
Built: | 2024-12-01 07:58:09 UTC |
Source: | https://github.com/ropensci/daiquiri |
Aggregates a daiquiri_source_data
object based on the field_types()
specified at load time.
Default time period for aggregation is a calendar day
aggregate_data(source_data, aggregation_timeunit = "day", show_progress = TRUE)
aggregate_data(source_data, aggregation_timeunit = "day", show_progress = TRUE)
source_data |
A |
aggregation_timeunit |
Unit of time to aggregate over. Specify one of
|
show_progress |
Print progress to console. Default = |
A daiquiri_aggregated_data
object
# load example data into a data.frame raw_data <- read_data( system.file("extdata", "example_prescriptions.csv", package = "daiquiri"), delim = ",", col_names = TRUE ) # validate and prepare the data for aggregation source_data <- prepare_data( raw_data, field_types = field_types( PrescriptionID = ft_uniqueidentifier(), PrescriptionDate = ft_timepoint(), AdmissionDate = ft_datetime(includes_time = FALSE), Drug = ft_freetext(), Dose = ft_numeric(), DoseUnit = ft_categorical(), PatientID = ft_ignore(), Location = ft_categorical(aggregate_by_each_category = TRUE) ), override_column_names = FALSE, na = c("", "NULL") ) # aggregate the data aggregated_data <- aggregate_data( source_data, aggregation_timeunit = "day" ) aggregated_data
# load example data into a data.frame raw_data <- read_data( system.file("extdata", "example_prescriptions.csv", package = "daiquiri"), delim = ",", col_names = TRUE ) # validate and prepare the data for aggregation source_data <- prepare_data( raw_data, field_types = field_types( PrescriptionID = ft_uniqueidentifier(), PrescriptionDate = ft_timepoint(), AdmissionDate = ft_datetime(includes_time = FALSE), Drug = ft_freetext(), Dose = ft_numeric(), DoseUnit = ft_categorical(), PatientID = ft_ignore(), Location = ft_categorical(aggregate_by_each_category = TRUE) ), override_column_names = FALSE, na = c("", "NULL") ) # aggregate the data aggregated_data <- aggregate_data( source_data, aggregation_timeunit = "day" ) aggregated_data
Close any active log file
close_log()
close_log()
If a log file was found, the path to the log file that was closed, otherwise an empty string
close_log()
close_log()
Accepts record-level data from a data frame, validates it against the expected type of content of each column, generates a collection of time series plots for visual inspection, and saves a report to disk.
daiquiri_report( df, field_types, override_column_names = FALSE, na = c("", "NA", "NULL"), dataset_description = NULL, aggregation_timeunit = "day", report_title = "daiquiri data quality report", save_directory = ".", save_filename = NULL, show_progress = TRUE, log_directory = NULL )
daiquiri_report( df, field_types, override_column_names = FALSE, na = c("", "NA", "NULL"), dataset_description = NULL, aggregation_timeunit = "day", report_title = "daiquiri data quality report", save_directory = ".", save_filename = NULL, show_progress = TRUE, log_directory = NULL )
df |
A data frame. Rectangular data can be read from file using
|
field_types |
|
override_column_names |
If |
na |
vector containing strings that should be interpreted as missing
values, Default = |
dataset_description |
Short description of the dataset being checked. This will appear on the report. If blank, the name of the data frame object will be used |
aggregation_timeunit |
Unit of time to aggregate over. Specify one of
|
report_title |
Title to appear on the report |
save_directory |
String specifying directory in which to save the report. Default is current directory. |
save_filename |
String specifying filename for the report, excluding any
file extension. If no filename is supplied, one will be automatically
generated with the format |
show_progress |
Print progress to console. Default = |
log_directory |
String specifying directory in which to save log file. If no directory is supplied, progress is not logged. |
A list containing information relating to the supplied parameters as
well as the resulting daiquiri_source_data
and daiquiri_aggregated_data
objects.
In order for the package to detect any non-conformant
values in numeric or datetime fields, these should be present in the data
frame in their raw character format. Rectangular data from a text file will
automatically be read in as character type if you use the read_data()
function. Data frame columns that are not of class character will still be
processed according to the field_types
specified.
read_data()
, field_types()
,
field_types_available()
# load example data into a data.frame raw_data <- read_data( system.file("extdata", "example_prescriptions.csv", package = "daiquiri"), delim = ",", col_names = TRUE ) # create a report in the current directory daiq_obj <- daiquiri_report( raw_data, field_types = field_types( PrescriptionID = ft_uniqueidentifier(), PrescriptionDate = ft_timepoint(), AdmissionDate = ft_datetime(includes_time = FALSE, na = "1800-01-01"), Drug = ft_freetext(), Dose = ft_numeric(), DoseUnit = ft_categorical(), PatientID = ft_ignore(), Location = ft_categorical(aggregate_by_each_category = TRUE) ), override_column_names = FALSE, na = c("", "NULL"), dataset_description = "Example data provided with package", aggregation_timeunit = "day", report_title = "daiquiri data quality report", save_directory = ".", save_filename = "example_data_report", show_progress = TRUE, log_directory = NULL )
# load example data into a data.frame raw_data <- read_data( system.file("extdata", "example_prescriptions.csv", package = "daiquiri"), delim = ",", col_names = TRUE ) # create a report in the current directory daiq_obj <- daiquiri_report( raw_data, field_types = field_types( PrescriptionID = ft_uniqueidentifier(), PrescriptionDate = ft_timepoint(), AdmissionDate = ft_datetime(includes_time = FALSE, na = "1800-01-01"), Drug = ft_freetext(), Dose = ft_numeric(), DoseUnit = ft_categorical(), PatientID = ft_ignore(), Location = ft_categorical(aggregate_by_each_category = TRUE) ), override_column_names = FALSE, na = c("", "NULL"), dataset_description = "Example data provided with package", aggregation_timeunit = "day", report_title = "daiquiri data quality report", save_directory = ".", save_filename = "example_data_report", show_progress = TRUE, log_directory = NULL )
Export aggregated data to disk. Creates a separate file for each aggregated field in dataset.
export_aggregated_data( aggregated_data, save_directory, save_file_prefix = "", save_file_type = "csv" )
export_aggregated_data( aggregated_data, save_directory, save_file_prefix = "", save_file_type = "csv" )
aggregated_data |
A |
save_directory |
String. Full or relative path for save folder |
save_file_prefix |
String. Optional prefix for the exported filenames |
save_file_type |
String. Filetype extension supported by |
(invisibly) The daiquiri_aggregated_data
object that was passed in
raw_data <- read_data( system.file("extdata", "example_prescriptions.csv", package = "daiquiri"), delim = ",", col_names = TRUE ) source_data <- prepare_data( raw_data, field_types = field_types( PrescriptionID = ft_uniqueidentifier(), PrescriptionDate = ft_timepoint(), AdmissionDate = ft_datetime(includes_time = FALSE), Drug = ft_freetext(), Dose = ft_numeric(), DoseUnit = ft_categorical(), PatientID = ft_ignore(), Location = ft_categorical(aggregate_by_each_category = TRUE) ), override_column_names = FALSE, na = c("", "NULL") ) aggregated_data <- aggregate_data( source_data, aggregation_timeunit = "day" ) export_aggregated_data( aggregated_data, save_directory = ".", save_file_prefix = "ex_" )
raw_data <- read_data( system.file("extdata", "example_prescriptions.csv", package = "daiquiri"), delim = ",", col_names = TRUE ) source_data <- prepare_data( raw_data, field_types = field_types( PrescriptionID = ft_uniqueidentifier(), PrescriptionDate = ft_timepoint(), AdmissionDate = ft_datetime(includes_time = FALSE), Drug = ft_freetext(), Dose = ft_numeric(), DoseUnit = ft_categorical(), PatientID = ft_ignore(), Location = ft_categorical(aggregate_by_each_category = TRUE) ), override_column_names = FALSE, na = c("", "NULL") ) aggregated_data <- aggregate_data( source_data, aggregation_timeunit = "day" ) export_aggregated_data( aggregated_data, save_directory = ".", save_file_prefix = "ex_" )
Specify the names and types of fields in the source data frame. This is
important because the data in each field will be aggregated in different
ways, depending on its field_type
. See field_types_available
field_types(...)
field_types(...)
... |
names and types of fields (columns) in source data. |
A field_types
object
field_types_available()
, template_field_types()
fts <- field_types( PatientID = ft_uniqueidentifier(), TestID = ft_ignore(), TestDate = ft_timepoint(), TestName = ft_categorical(aggregate_by_each_category = FALSE), TestResult = ft_numeric(), ResultDate = ft_datetime(), ResultComment = ft_freetext(), Location = ft_categorical() ) fts
fts <- field_types( PatientID = ft_uniqueidentifier(), TestID = ft_ignore(), TestDate = ft_timepoint(), TestName = ft_categorical(aggregate_by_each_category = FALSE), TestResult = ft_numeric(), ResultDate = ft_datetime(), ResultComment = ft_freetext(), Location = ft_categorical() ) fts
Specify only a subset of the names and types of fields in the source data frame. The remaining fields will be given the same 'default' type.
field_types_advanced(..., .default_field_type = ft_simple())
field_types_advanced(..., .default_field_type = ft_simple())
... |
names and types of fields (columns) in source data. |
.default_field_type |
|
A field_types
object
field_types()
, field_types_available()
, template_field_types()
fts <- field_types_advanced( PrescriptionDate = ft_timepoint(), PatientID = ft_ignore(), .default_field_type = ft_simple() ) fts
fts <- field_types_advanced( PrescriptionDate = ft_timepoint(), PatientID = ft_ignore(), .default_field_type = ft_simple() ) fts
Each column in the source dataset must be assigned to a particular ft_xx
depending on the type of data that it contains. This is done through a
field_types()
specification.
ft_timepoint(includes_time = TRUE, format = "", na = NULL) ft_uniqueidentifier(na = NULL) ft_categorical(aggregate_by_each_category = FALSE, na = NULL) ft_numeric(na = NULL) ft_datetime(includes_time = TRUE, format = "", na = NULL) ft_freetext(na = NULL) ft_simple(na = NULL) ft_strata(na = NULL) ft_ignore()
ft_timepoint(includes_time = TRUE, format = "", na = NULL) ft_uniqueidentifier(na = NULL) ft_categorical(aggregate_by_each_category = FALSE, na = NULL) ft_numeric(na = NULL) ft_datetime(includes_time = TRUE, format = "", na = NULL) ft_freetext(na = NULL) ft_simple(na = NULL) ft_strata(na = NULL) ft_ignore()
includes_time |
If |
format |
Where datetime values are not in the format |
na |
Column-specific vector of strings that should be interpreted as missing values (in addition to those specified at dataset level) |
aggregate_by_each_category |
If |
A field_type
object denoting the type of data in the column
ft_timepoint()
- identifies the data field which should
be used as the independent time variable. There should be one and only one
of these specified.
ft_uniqueidentifier()
- identifies data fields which
contain a (usually computer-generated) identifier for an entity, e.g. a
patient. It does not need to be unique within the dataset.
ft_categorical()
- identifies data fields which should
be treated as categorical.
ft_numeric()
- identifies data fields which contain numeric values that
should be treated as continuous. Any values which contain non-numeric
characters (including grouping marks) will be classed as non-conformant
ft_datetime()
- identifies data fields which contain date
values that should be treated as continuous.
ft_freetext()
- identifies data fields which contain
free text values. Only presence/missingness will be evaluated.
ft_simple()
- identifies data fields where you only
want presence/missingness to be evaluated (but which are not necessarily
free text).
ft_strata()
- identifies a categorical data field which should
be used to stratify the rest of the data.
ft_ignore()
- identifies data fields which should be
ignored. These will not be loaded.
field_types()
, template_field_types()
fts <- field_types( PatientID = ft_uniqueidentifier(), TestID = ft_ignore(), TestDate = ft_timepoint(), TestName = ft_categorical(aggregate_by_each_category = FALSE), TestResult = ft_numeric(), ResultDate = ft_datetime(), ResultComment = ft_freetext(), Location = ft_categorical() ) ft_simple()
fts <- field_types( PatientID = ft_uniqueidentifier(), TestID = ft_ignore(), TestDate = ft_timepoint(), TestName = ft_categorical(aggregate_by_each_category = FALSE), TestResult = ft_numeric(), ResultDate = ft_datetime(), ResultComment = ft_freetext(), Location = ft_categorical() ) ft_simple()
Choose a directory in which to save the log file. If this is not called, no log file is created.
initialise_log(log_directory)
initialise_log(log_directory)
log_directory |
String containing directory to save log file |
Character string containing the full path to the newly-created log file
log_name <- initialise_log(".") log_name
log_name <- initialise_log(".") log_name
Validate a data frame against a field_types()
specification, and prepare
for aggregation.
prepare_data( df, field_types, override_column_names = FALSE, na = c("", "NA", "NULL"), dataset_description = NULL, show_progress = TRUE )
prepare_data( df, field_types, override_column_names = FALSE, na = c("", "NA", "NULL"), dataset_description = NULL, show_progress = TRUE )
df |
A data frame |
field_types |
|
override_column_names |
If |
na |
vector containing strings that should be interpreted as missing
values. Default = |
dataset_description |
Short description of the dataset being checked. This will appear on the report. If blank, the name of the data frame object will be used |
show_progress |
Print progress to console. Default = |
A daiquiri_source_data
object
field_types()
, field_types_available()
,
aggregate_data()
, report_data()
,
daiquiri_report()
# load example data into a data.frame raw_data <- read_data( system.file("extdata", "example_prescriptions.csv", package = "daiquiri"), delim = ",", col_names = TRUE ) # validate and prepare the data for aggregation source_data <- prepare_data( raw_data, field_types = field_types( PrescriptionID = ft_uniqueidentifier(), PrescriptionDate = ft_timepoint(), AdmissionDate = ft_datetime(includes_time = FALSE), Drug = ft_freetext(), Dose = ft_numeric(), DoseUnit = ft_categorical(), PatientID = ft_ignore(), Location = ft_categorical(aggregate_by_each_category = TRUE) ), override_column_names = FALSE, na = c("", "NULL"), dataset_description = "Example data provided with package" ) source_data
# load example data into a data.frame raw_data <- read_data( system.file("extdata", "example_prescriptions.csv", package = "daiquiri"), delim = ",", col_names = TRUE ) # validate and prepare the data for aggregation source_data <- prepare_data( raw_data, field_types = field_types( PrescriptionID = ft_uniqueidentifier(), PrescriptionDate = ft_timepoint(), AdmissionDate = ft_datetime(includes_time = FALSE), Drug = ft_freetext(), Dose = ft_numeric(), DoseUnit = ft_categorical(), PatientID = ft_ignore(), Location = ft_categorical(aggregate_by_each_category = TRUE) ), override_column_names = FALSE, na = c("", "NULL"), dataset_description = "Example data provided with package" ) source_data
Popular file readers such as readr::read_delim()
perform datatype
conversion by default, which can interfere with daiquiri's ability to detect
non-conformant values. Use this function instead to ensure optimal
compatibility with daiquiri's features.
read_data( file, delim = NULL, col_names = TRUE, quote = "\"", trim_ws = TRUE, comment = "", skip = 0, n_max = Inf, show_progress = TRUE )
read_data( file, delim = NULL, col_names = TRUE, quote = "\"", trim_ws = TRUE, comment = "", skip = 0, n_max = Inf, show_progress = TRUE )
file |
A string containing path of file containing data to load, or a
URL starting |
delim |
Single character used to separate fields within a record. E.g.
|
col_names |
Either |
quote |
Single character used to quote strings. |
trim_ws |
Should leading and trailing whitespace be trimmed from each field? |
comment |
A string used to identify comments. Any text after the comment characters will be silently ignored |
skip |
Number of lines to skip before reading data. If |
n_max |
Maximum number of lines to read. |
show_progress |
Display a progress bar? Default = |
This function is aimed at non-expert users of R, and operates as a restricted
implementation of readr::read_delim()
. If you prefer to use read_delim()
directly, ensure you set the following parameters: col_types = readr::cols(.default = "c")
and na = character()
A data frame
field_types()
, field_types_available()
,
aggregate_data()
, report_data()
,
daiquiri_report()
raw_data <- read_data( system.file("extdata", "example_prescriptions.csv", package = "daiquiri"), delim = ",", col_names = TRUE ) head(raw_data)
raw_data <- read_data( system.file("extdata", "example_prescriptions.csv", package = "daiquiri"), delim = ",", col_names = TRUE ) head(raw_data)
Generate report from previously-created daiquiri_source_data
and
daiquiri_aggregated_data
objects
report_data( source_data, aggregated_data, report_title = "daiquiri data quality report", save_directory = ".", save_filename = NULL, format = "html", show_progress = TRUE, ... )
report_data( source_data, aggregated_data, report_title = "daiquiri data quality report", save_directory = ".", save_filename = NULL, format = "html", show_progress = TRUE, ... )
source_data |
A |
aggregated_data |
A |
report_title |
Title to appear on the report |
save_directory |
String specifying directory in which to save the report. Default is current directory. |
save_filename |
String specifying filename for the report, excluding any
file extension. If no filename is supplied, one will be automatically
generated with the format |
format |
File format of the report. Currently only |
show_progress |
Print progress to console. Default = |
... |
Further parameters to be passed to |
A string containing the name and path of the saved report
prepare_data()
, aggregate_data()
,
daiquiri_report()
# load example data into a data.frame raw_data <- read_data( system.file("extdata", "example_prescriptions.csv", package = "daiquiri"), delim = ",", col_names = TRUE ) # validate and prepare the data for aggregation source_data <- prepare_data( raw_data, field_types = field_types( PrescriptionID = ft_uniqueidentifier(), PrescriptionDate = ft_timepoint(), AdmissionDate = ft_datetime(includes_time = FALSE), Drug = ft_freetext(), Dose = ft_numeric(), DoseUnit = ft_categorical(), PatientID = ft_ignore(), Location = ft_categorical(aggregate_by_each_category = TRUE) ), override_column_names = FALSE, na = c("", "NULL"), dataset_description = "Example data provided with package", show_progress = TRUE ) # aggregate the data aggregated_data <- aggregate_data( source_data, aggregation_timeunit = "day", show_progress = TRUE ) # save a report in the current directory using the previously-created objects report_data( source_data, aggregated_data, report_title = "daiquiri data quality report", save_directory = ".", save_filename = "example_data_report", show_progress = TRUE )
# load example data into a data.frame raw_data <- read_data( system.file("extdata", "example_prescriptions.csv", package = "daiquiri"), delim = ",", col_names = TRUE ) # validate and prepare the data for aggregation source_data <- prepare_data( raw_data, field_types = field_types( PrescriptionID = ft_uniqueidentifier(), PrescriptionDate = ft_timepoint(), AdmissionDate = ft_datetime(includes_time = FALSE), Drug = ft_freetext(), Dose = ft_numeric(), DoseUnit = ft_categorical(), PatientID = ft_ignore(), Location = ft_categorical(aggregate_by_each_category = TRUE) ), override_column_names = FALSE, na = c("", "NULL"), dataset_description = "Example data provided with package", show_progress = TRUE ) # aggregate the data aggregated_data <- aggregate_data( source_data, aggregation_timeunit = "day", show_progress = TRUE ) # save a report in the current directory using the previously-created objects report_data( source_data, aggregated_data, report_title = "daiquiri data quality report", save_directory = ".", save_filename = "example_data_report", show_progress = TRUE )
Helper function to generate template code for a field_types()
specification,
based on the supplied data frame. All fields (columns) in the specification
will be defined using the default_field_type
, and the console output can be
copied and edited before being used as input to daiquiri_report()
or prepare_data()
.
template_field_types(df, default_field_type = ft_ignore())
template_field_types(df, default_field_type = ft_ignore())
df |
data frame including the column names for the template specification |
default_field_type |
|
(invisibly) Character string containing the template code
df <- data.frame( col1 = rep("2022-01-01", 5), col2 = rep(1, 5), col3 = 1:5, col4 = rnorm(5) ) template_field_types(df, default_field_type = ft_numeric())
df <- data.frame( col1 = rep("2022-01-01", 5), col2 = rep(1, 5), col3 = 1:5, col4 = rnorm(5) ) template_field_types(df, default_field_type = ft_numeric())