--- title: "Data provenance" author: "Ben Raymond, Michael Sumner" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Data provenance} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- Bowerbird will maintain a local collection of data files, sourced from external data providers. When one does an analysis using such files, it is useful to know *which* files were used, and the *provenance* of those files. This information will assist in making analyses reproducible. Bowerbird itself actually knows very little about the data files that it maintains, particularly if it is using the `bb_handler_rget` method. In this case, for a given data source it typically knows the URL and flags to pass to `bb_rget`, and some basic metadata (primarily intended to be read by the user). This is by design: the heavy lifting involved in mirroring a remote data source is generally handballed `bb_rget`. This simplifies both the bowerbird code as well as the process of writing and maintaining data sources. The downside of this is that bowerbird can't necessarily answer questions about data provenance in specific detail. Say that we are using data from a data source that contains many separate files, one per day, spanning many years, and we run an analysis that uses only files from the year 1999. Ideally we would like a function that could tell us exactly which files from the complete data set are needed for our particular analysis, but this is impossible without specific knowledge of how the data source is structured. Analogous situations exist with data sources that split geographic space across files, or split different parameters across files. Despite this, bowerbird can provide some general information to assist with data provenance issues. ## Which files were used in an analysis Say we have a bowerbird config: ```{r} library(bowerbird) cf <- bb_config("/some/local/path") %>% bb_add(bb_example_sources()) ``` We can ask bowerbird where these files are stored locally: ```{r} source_dirs <- bb_data_source_dir(cf) knitr::kable(data.frame(name = cf$data_sources$name, source_dir = source_dirs)) ``` The directory for each data source will contain *all* of the files associated with that data source, not just the files used in a particular analysis. (In extreme cases this directory might even contain files from other data sources, although this should only happen if multiple data sources have overlapping `source_url` paths, which ought not to be a common occurrence.) So while this directory is not the *minimal* set of files needed to reproduce an analysis, it does at least contain that set, and could be used to store those files in an online repository that assigns DOIs (such as figshare, see https://cran.r-project.org/package=rfigshare, or zenodo), or bundle into a docker image (see e.g https://github.com/o2r-project/containerit). The complete list of files within a data source's directory can be generated using a standard directory listing (e.g. `list.files(path = my_source_dir, recursive = TRUE`), but see also the `bb_fingerprint` function described below, which provides additional information about the files. ### Refining the list of files Ascertaining exactly which files were used in a particular analysis is a task that is better handled by the code being used to do the file-reading and analysis, not by the repository-management (bowerbird) code. We are unaware of any general solutions that will keep track of the files used by an arbitrary chunk of R code. The recordr package (https://github.com/NCEAS/recordr) comes close: it will collect info about which files were read or written. However, this only works for file types that have read functions implemented in recordr. At the time of writing, this does not cover netcdf files or other files typically used for environmental data, and so its application in that domain is likely to be of limited value. Packages that provide specific data access functionality (e.g. https://github.com/AustralianAntarcticDivision/raadtools) might provide mechanisms for tracking which data files are needed in order to fulfil particular data queries. ## The provenance of files used in an analysis Provenance: where the files came from, when they were downloaded, their version, etc. The `bb_fingerprint` function, given a data repository configuration, will return the timestamp of download and hashes of all files associated with its data sources. Thus, for all of these files, we have established where they came from (the data source ID), when they were downloaded, and a hash so that later versions of those files can be compared to detect changes. ### A word on digital object identifiers When a data source is defined, it should have used its DOI as its data source ID (if it has a DOI, otherwise some other unique identifier). The idea of a DOI is that a data set that has a DOI should be peristent (accessible for long term use), and the DOI should uniquely identify the data resource that it is assigned to. When a data set changes in a substantial manner, and/or it is necessary to identify both the original and the changed material, a new DOI should be assigned. Thus, knowing the DOI gives some indication of the data provenance. However, it is worth noting that the DOI might not uniquely identify the state of the data source. For example, the NSIDC SMMR-SSM/I Nasateam near-real-time sea ice concentration data set (DOI http://doi.org/10.5067/U8C09DWVX9LM) is updated each day (new files are added) as new data is acquired by the satellites. At periodic intervals, these files are subjected to more rigorous quality control and post-processing, and then moved into a different data set (http://doi.org/10.5067/8GQ8LZQVL0VL). Thus, knowing the DOI of the near-real-time data set does not uniquely identify the files that were included in that data set at a given point in time. Similar ambiguity can arise for other reasons, including data corrections. Say that one data file within a large collection was found to have errors and was corrected: it is at the discretion of the data provider whether the data set as a whole receives a new DOI because of this change.