--- title: "Using Skimr" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using Skimr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Introduction `skimr` is designed to provide summary statistics about variables in data frames, tibbles, data tables and vectors. It is opinionated in its defaults, but easy to modify. In base R, the most similar functions are `summary()` for vectors and data frames and `fivenum()` for numeric vectors: ```{r} summary(iris) ``` ```{r} summary(iris$Sepal.Length) ``` ```{r} fivenum(iris$Sepal.Length) ``` ```{r} summary(iris$Species) ``` ## The `skim()` function The core function of `skimr` is `skim()`, which is designed to work with (grouped) data frames, and will try coerce other objects to data frames if possible. Like `summary()`, `skim()`'s method for data frames presents results for every column; the statistics it provides depend on the class of the variable. ### Skimming data frames By design, the main focus of `skimr` is on data frames; it is intended to fit well within a data [pipeline](https://r4ds.had.co.nz/pipes.html) and relies extensively on [tidyverse](https://www.tidyverse.org/) vocabulary, which focuses on data frames. Results of `skim()` are *printed* horizontally, with one section per variable type and one row per variable. ```{r, render = knitr::normal_print} library(skimr) skim(iris) ``` The format of the results are a single wide data frame combining the results, with some additional attributes and two metadata columns: - `skim_variable`: name of the original variable - `skim_type`: class of the variable Unlike many other objects within `R`, these columns are intrinsic to the `skim_df` class. Dropping these variables will result in a coercion to a `tibble`. The `is_skim_df()` function is used to assert that an object is a skim_df. ```{r} skim(iris) %>% is_skim_df() ``` ```{r, render = knitr::normal_print} skim(iris) %>% dplyr::select(-skim_type, -skim_variable) %>% is_skim_df() ``` ```{r, render = knitr::normal_print} skim(iris) %>% dplyr::select(-n_missing) %>% is_skim_df() ``` In order to avoid type coercion, columns for summary statistics for different types are prefixed with the corresponding `skim_type`. This means that the columns of the `skim_df` are somewhat sparse, with quite a few missing values. This is because for some statistics the representations for different types of variables is different. For example, the mean of a Date variable and of a numeric variable are represented differently when printing, but this cannot be supported in a single vector. The exception to this are `n_missing` and `complete_rate` (missing/number of observations) which are the same for all types of variables. ```{r, render = knitr::normal_print} skim(iris) %>% tibble::as_tibble() ``` This is in contrast to `summary.data.frame()`, which stores statistics in a `table`. The distinction is important, because the `skim_df` object is pipeable and easy to use for additional manipulation: for example, the user could select all of the variable means, or all summary statistics for a specific variable. ```{r, render = knitr::normal_print} skim(iris) %>% dplyr::filter(skim_variable == "Petal.Length") ``` Most `dplyr` verbs should work as expected. ```{r, render = knitr::normal_print} skim(iris) %>% dplyr::select(skim_type, skim_variable, n_missing) ``` The base skimmers `n_missing` and `complete_rate` are computed for all of the columns in the data. But all other type-based skimmers have a namespace. You need to use a `skim_type` prefix to refer to correct column. ```{r, render = knitr::normal_print} skim(iris) %>% dplyr::select(skim_type, skim_variable, numeric.mean) ``` `skim()` also supports grouped data created by `dplyr::group_by()`. In this case, one additional column for each grouping variable is added to the `skim_df` object. ```{r, render = knitr::normal_print} iris %>% dplyr::group_by(Species) %>% skim() ``` Individual columns from a data frame may be selected using tidyverse-style selectors. ```{r, render = knitr::normal_print} skim(iris, Sepal.Length, Species) ``` Or with common `select` helpers. ```{r, render = knitr::normal_print} skim(iris, starts_with("Sepal")) ``` If an individual column is of an unsupported class, it is treated as a character variable with a warning. ## Skimming vectors In `skimr` v2, `skim()` will attempt to coerce non-data frames (such as vectors and matrices) to data frames. In most cases with vectors, the object being evaluated should be equivalent to wrapping the object in `as.data.frame()`. For example, the `lynx` data set is class `ts`. ```{r, render = knitr::normal_print} skim(lynx) ``` Which is the same as coercing to a data frame. ```{r} all.equal(skim(lynx), skim(as.data.frame(lynx))) ``` ## Skimming matrices `skimr` does not support skimming matrices directly but coerces them to data frames. Columns in the matrix become variables. This behavior is similar to `summary.matrix()`). Three possible ways to handle matrices with `skim()` parallel the three variations of the mean function for matrices. ```{r, render = knitr::normal_print} m <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), nrow = 4, ncol = 3) m ``` Skimming the matrix produces similar results to `colMeans()`. ```{r, render = knitr::normal_print} colMeans(m) skim(m) # Similar to summary.matrix and colMeans() ``` Skimming the transpose of the matrix will give row-wise results. ```{r, render = knitr::normal_print} rowMeans(m) skim(t(m)) ``` And call `c()` on the matrix to get results across all columns. ```{r, render = knitr::normal_print} skim(c(m)) mean(m) ``` ### Skimming without modification `skim_tee()` produces the same printed version as `skim()` but returns the original, unmodified data frame. This allows for continued piping of the original data. ```{r, render = knitr::normal_print} iris_setosa <- iris %>% skim_tee() %>% dplyr::filter(Species == "setosa") head(iris_setosa) ``` Note, that `skim_tee()` is customized differently than `skim` itself. See below for more details. ## Reshaping the results from `skim()` As noted above, `skim()` returns a wide data frame. This is usually the most sensible format for the majority of operations when investigating data, but the package has some other functions to help with edge cases. First, `partition()` returns a named list of the wide data frames for each data type. Unlike the original data the partitioned data only has columns corresponding to the skimming functions used for this data type. These data frames are, therefore, not `skim_df` objects. ```{r, render = knitr::normal_print} iris %>% skim() %>% partition() ``` Alternatively, `yank()` selects only the subtable for a specific type. Think of it like `dplyr::select` on column types in the original data. Again, unsuitable columns are dropped. ```{r, render = knitr::normal_print} iris %>% skim() %>% yank("numeric") ``` `to_long()` returns a single long data frame with columns `variable`, `type`, `statistic` and `formatted`. This is similar but not identical to the `skim_df` object in `skimr` v1. ```{r, render = knitr::normal_print} iris %>% skim() %>% to_long() %>% head() ``` Since the `skim_variable` and `skim_type` columns are a core component of the `skim_df` class, it's possible to get unwanted side effects when using `dplyr::select()`. Instead, use `focus()` to select columns of the skimmed results and keep them as a `skim_df`; it always keeps the metadata column. ```{r, render = knitr::normal_print} iris %>% skim() %>% focus(n_missing, numeric.mean) ``` ## Rendering the results of `skim()` The `skim_df` object is a wide data frame. The display is created by default using `print.skim_df()`; users can specify additional options by explicitly calling `print([skim_df object], ...)`. For documents rendered by `knitr`, the package provides a custom `knit_print` method. To use it, the final line of your code chunk should have a `skim_df` object. ```{r} skim(Orange) ``` The same type of rendering is available from reshaped `skim_df` objects, those generated by `partition()` and `yank()` in particular. ```{r} skim(Orange) %>% yank("numeric") ``` ## Customizing print options Although its not a common use case outside of writing vignettes about `skimr`, you can fall back to default printing methods by adding the chunk option `render = knitr::normal_print`. You can also disable the `skimr` summary by setting the chunk option `skimr_include_summary = FALSE`. You can change the number of digits shown in the columns of generated statistics by changing the `skimr_digits` chunk option. ## Modifying `skim()` `skimr` is opinionated in its choice of defaults, but users can easily add, replace, or remove the statistics for a class. For interactive use, you can create your own skimming function with the `skim_with()` factory. `skimr` also has an API for extensions in other packages. Working with that is covered later. To add a statistic for a data type, create an `sfl()` (a `skimr` function list) for each class that you want to change: ```{r} my_skim <- skim_with(numeric = sfl(new_mad = mad)) my_skim(faithful) ``` As the previous example suggests, the default is to append *new* summary statistics to the preexisting set. This behavior isn't always desirable, especially when you want lots of changes. To stop appending, set `append = FALSE`. ```{r} my_skim <- skim_with(numeric = sfl(new_mad = mad), append = FALSE) my_skim(faithful) ``` You can also use `skim_with()` to remove specific statistics by setting them to `NULL`. This is commonly used to disable the inline histograms and spark graphs. ```{r} no_hist <- skim_with(ts = sfl(line_graph = NULL)) no_hist(Nile) ``` The same pattern applies to changing skimmers for multiple classes simultaneously. If you want to partially-apply function arguments, use the Tidyverse lambda syntax. ```{r} my_skim <- skim_with( numeric = sfl(total = ~ sum(., na.rm = TRUE)), factor = sfl(missing = ~ sum(is.na(.))), append = FALSE ) my_skim(iris) ``` To modify the "base" skimmers, refer to them in a similar manner. Since base skimmers are usually a small group, they must return the same type for all data types in R, `append` doesn't apply here. ```{r} my_skim <- skim_with(base = sfl(length = length)) my_skim(faithful) ``` ## Extending `skimr` Packages may wish to export their own `skim()` functions. Use `skim_with()` for this. In fact, this is how `skimr` generates its version of `skim()`. ```{r} #' @export my_package_skim <- skim_with() ``` Alternatively, defaults for another data types can be added to `skimr` with the `get_skimmers` generic. The method for your data type should return an `sfl()`. Unlike the `sfl()` used interactively, you also need to set the `skim_type` argument. It should match the method type in the function signature. ```{r} get_skimmers.my_type <- function(column) { sfl( skim_type = "my_type", total = sum ) } my_data <- data.frame( my_type = structure(1:3, class = c("my_type", "integer")) ) skim(my_data) ``` An extended example is available in the vignette *Supporting additional objects*. ## Solutions to common rendering problems The details of rendering are dependent on the operating system R is running on, the locale of the installation, and the fonts installed. Rendering may also differ based on whether it occurs in the console or when knitting to specific types of documents such as HTML and PDF. The most commonly reported problems involve rendering the spark graphs (inline histogram and line chart) on Windows. One common fix is to switch your locale. The function `fix_windows_histograms()` does this for you. In order to render the sparkgraphs in html or PDF histogram you may need to change fonts to one that supports blocks or Braille (depending on which you need). Please review the separate vignette and associated template for details.