--- title: "Optimizing Storage for Version Control" author: "Thierry Onkelinx" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Optimizing Storage for Version Control} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} library(knitr) opts_chunk$set( collapse = TRUE, comment = "#>" ) options(width = 83) ``` ## Introduction This vignette focuses on what `git2rdata` does to make storing dataframes under version control more efficient and convenient. `vignette("plain_text", package = "git2rdata")` describes all details on the actual file format. Hence we will not discuss the `optimize` and `na` arguments to the `write_vc()` function. We will not illustrate the efficiency of `write_vc()` and `read_vc()`. `vignette("efficiency", package = "git2rdata")` covers those topics. ## Setup ```{r initialise} # Create a directory in tempdir root <- tempfile(pattern = "git2r-") dir.create(root) # Create dummy data set.seed(20190222) x <- data.frame( x = sample(LETTERS), y = factor( sample(c("a", "b", NA), 26, replace = TRUE), levels = c("a", "b", "c") ), z = c(NA, 1:25), abc = c(rnorm(25), NA), def = sample(c(TRUE, FALSE, NA), 26, replace = TRUE), timestamp = seq( as.POSIXct("2018-01-01"), as.POSIXct("2019-01-01"), length = 26 ), stringsAsFactors = FALSE ) str(x) ``` ## Assumptions A critical assumption made by `git2rdata` is that the dataframe itself contains all information. Each row is an observation, each column is a variable. The dataframe has `colnames` but no `rownames`. This implies that two observations switching place does not alter the information content. Nor does switching two variables. Version control systems like [git](https://git-scm.com/), [subversion](https://subversion.apache.org/) or [mercurial](https://www.mercurial-scm.org/) focus on accurately keeping track of _any_ change in the files. Two observations switching place in a plain text file _is_ a change, although the information content^[ _sensu_ <!-- spell-check: ignore --> `git2rdata`] doesn't change. `git2rdata` helps the user to prepare the plain text files in such a way that any change in the version history is an actual change in the information content. ## Sorting Observations Version control systems often track changes in plain text files based on row based differences. In layman's terms they record lines removed from and inserted in the file at what location. Changing an existing line implies removing the old version and inserting the new one. The minimal example below illustrates this. Original version ``` A,B 1,10 2,11 3,12 ``` Altered version. The row containing `1, 10` moves to the last line. The row containing `3,12` changed to `3,0`. ``` A,B 2,11 3,0 1,10 ``` Diff between original and altered version. Notice than we have a deletion of two lines and two insertions. ```diff A,B -1,10 2,11 -3,12 +3,0 +1,10 ``` Ensuring that the observations are always sorted in the same way thus helps minimizing the diff. The sorted version of the same altered version looks like the example below. ``` A,B 1,10 2,11 3,0 ``` Diff between original and the sorted alternate version. Notice that all changes revert to actual changes in the information content. Another benefit is that changes are easily spotted in the diff. A deletion without insertion on the next line is a removed observation. An insertion without preceding deletion is a new observation. A deletion followed by an insertion is an updated observation. ```diff A,B 1,10 2,11 -3,12 +3,0 ``` This is where the `sorting` argument comes into play. If this argument is not provided when writing a file for the first time, it will yield a warning about the lack of sorting. `write_vc()` then writes the observations in their current order. New versions of the file will not apply any sorting either, leaving this burden to the user. The changed hash for the data file illustrates this in the example below. The metadata hash remains the same. ```{r row_order} library(git2rdata) write_vc(x, file = "row_order", root = root) write_vc(x[sample(nrow(x)), ], file = "row_order", root = root) ``` `sorting` should contain a vector of variable names. The observations are automatically sorted along these variables. Now we get an error because the set of sorting variables has changed. The metadata stores the set of sorting variables. Changing the sorting can potentially lead to large diffs, which `git2rdata` tries to avoid as much as possible. From this moment on we will store the output of `write_vc()` in an object reduce output. ```{r apply_sorting, error = TRUE} fn <- write_vc(x, "row_order", root, sorting = "y") ``` Using `strict = FALSE` turns such errors into warnings and allows to update the file. Notice that we get a new warning: the variable we used for sorting resulted in ties, thus the order of the observations is not guaranteed to be stable. The solution is to use more or different variables. We'll need `strict = FALSE` again to override the change in sorting variables. ```{r update_sorting} fn <- write_vc(x, "row_order", root, sorting = "y", strict = FALSE) fn <- write_vc(x, "row_order", root, sorting = c("y", "x"), strict = FALSE) ``` Once we have defined the sorting, we may omit the `sorting` argument when writing new versions. `write_vc` uses the sorting as defined in the existing metadata. It checks for potential ties. Ties results in a warning. ```{r update_sorted} print_file <- function(file, root, n = -1) { fn <- file.path(root, file) data <- readLines(fn, n = n) cat(data, sep = "\n") } print_file("row_order.yml", root, 7) fn <- write_vc(x[sample(nrow(x)), ], "row_order", root) fn <- write_vc(x[sample(nrow(x)), ], "row_order", root, sorting = c("y", "x")) fn <- write_vc(x[sample(nrow(x), replace = TRUE), ], "row_order", root) ``` ## Sorting Variables The order of the variables (columns) has an even bigger impact on a row based diff. Let's revisit our minimal example. Suppose that we swap `A` and `B` from our [original example](#sorting-observations). The new data looks as below. ``` B,A 10,1 11,2 13,3 ``` The resulting diff is maximal because every single row changed. Yet none of the information changed. Hence, maintaining column order is crucial when storing dataframes as plain text files under version control. The `vignette("efficiency", package = "git2rdata")` vignette illustrates this on a more realistic data set. ```diff -A,B +B,A -1,10 +10,1 -2,11 +11,2 -3,13 +13,3 ``` When `write_vc()` writes a dataframe for the first time, it stores the original order of the columns in the metadata. From that moment on, `write_vc()` uses the order stored in the metadata. The example below writes the same data set twice. The second version contains identical information but randomizes the order of the observations and the columns. The sorting by the internals of `write_vc()` will undo this randomization, resulting in an unchanged file. ```{r variable_order} write_vc(x, "column_order", root, sorting = c("x", "abc")) print_file("column_order.tsv", root, n = 5) write_vc(x[sample(nrow(x)), sample(ncol(x))], "column_order", root) print_file("column_order.tsv", root, n = 5) ``` ## Handling Factors Optimized `vignette("plain_text", package = "git2rdata")` and `vignette("efficiency", package = "git2rdata")` illustrate how we can store a factor more efficiently when storing their index in the data file and the indices and labels in the metadata. We take this even a bit further: what happens if new data arrives and we need an extra factor level? ```{r factor} old <- data.frame(color = c("red", "blue"), stringsAsFactors = TRUE) write_vc(old, "factor", root, sorting = "color") print_file("factor.yml", root) ``` Let's add an observation with a new factor level. If we store the updated dataframe in a new file, we see that the indices are different. The factor level `"blue"` remains unchanged, but `"red"` becomes the third level and get index `3` instead of index `2`. This could lead to a large diff whereas the potential semantics (and thus the information content) are not changed. ```{r factor2} updated <- data.frame( color = c("red", "green", "blue"), stringsAsFactors = TRUE ) write_vc(updated, "factor2", root, sorting = "color") print_file("factor2.yml", root) ``` When we try to overwrite the original data with the updated version, we get an error because there is a change in factor levels and / or indices. In this specific case, we decided that the change is OK and force the writing by setting `strict = FALSE`. Notice that the original labels (`"blue"` and `"red"`) keep their index, the new level (`"green"`) gets the first available index number. ```{r factor_update, error = TRUE} write_vc(updated, "factor", root) fn <- write_vc(updated, "factor", root, strict = FALSE) print_file("factor.yml", root) ``` The next example removes the `"blue"` level and switches the order of the remaining levels. Notice that the meta data retains the existing indices. The order of the labels and indices reflects their new ordering. ```{r factor_deleted} deleted <- data.frame( color = factor(c("red", "green"), levels = c("red", "green")) ) write_vc(deleted, "factor", root, sorting = "color", strict = FALSE) print_file("factor.yml", root) ``` Changing a factor to an ordered factor or _vice versa_ will also keep existing level indices. ```{r factor_ordered} ordered <- data.frame( color = factor(c("red", "green"), levels = c("red", "green"), ordered = TRUE) ) write_vc(ordered, "factor", root, sorting = "color", strict = FALSE) print_file("factor.yml", root) ``` ## Relabelling a Factor The example below will store a dataframe, relabel the factor levels and store it again using `write_vc()`. Notice the update of both the labels and the indices. Hence creating a large diff, where updating the labels would do. ```{r} write_vc(old, "write_vc", root, sorting = "color") print_file("write_vc.yml", root) relabeled <- old # translate the color names to Dutch levels(relabeled$color) <- c("blauw", "rood") write_vc(relabeled, "write_vc", root, strict = FALSE) print_file("write_vc.yml", root) ``` We created `relabel()`, which changes the labels in the meta data while maintaining their indices. It takes three arguments: the name of the data file, the root and the change. `change` accepts two formats, a list or a dataframe. The name of the list must match with the variable name of a factor in the data. Each element of the list must be a named vector, the name being the existing label and the value the new label. The dataframe format requires a `factor`, `old` and `new` variable with one row for each change in label. ```{r} write_vc(old, "relabel", root, sorting = "color") relabel("relabel", root, change = list(color = c(red = "rood", blue = "blauw"))) print_file("relabel.yml", root) relabel( "relabel", root, change = data.frame( factor = "color", old = "blauw", new = "blue", stringsAsFactors = TRUE ) ) print_file("relabel.yml", root) ``` A _caveat_: `relabel()` does not make sense when the data file uses verbose storage. The verbose mode stores the factor labels and not their indices, in which case relabelling a label will always yield a large diff. Hence, `relabel()` requires the optimized storage.