--- title: "dataset_df: Create Datasets that are Easy to Share Exchange and Extend" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{dataset_df: Create Datasets that are Easy to Share Exchange and Extend} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setupvignette, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` The `dataset` package extends R’s native data structures with machine-readable metadata. It follows a *semantic early-binding* approach: metadata is embedded as soon as the data is created, making datasets suitable for long-term reuse, FAIR-compliant publishing, and integration into semantic web systems. In R, a `data.frame` is defined as a tightly coupled collection of variables that share many of the properties of matrices and lists, and it serves as the fundamental data structure for most of R’s modeling software. Users of the R ecosystem often use the term *data frame* interchangeably with *dataset*. However, the standards used in libraries, repositories, and statistical systems for publishing, exchanging, and reusing datasets require metadata that even “tidy” data frames do not provide. This vignette introduces the `dataset_df` class and the `dataset_df()` constructor, which extend tidy data frames with a semantic layer. For details on semantically enriched vectors, see `vignette("defined", package = "dataset")`. Readers interested in the underlying ISO and W3C definitions of *dataset* will find them discussed in `vignette("design", package = "dataset")`. ## Purpose The `dataset_df()` function helps you create **semantically rich datasets** that meet the interoperability, exchange, and reuse requirements of libraries, repositories, and statistical systems. It defines a new S3 class, inherited from the modernised data frame of `tibble::tibble()`, that retains compatibility with existing workflows but is easier to: - understand by humans, - validate and process by machines, - deposit, exchange, and publish, - share across tools, teams, and domains. This vignette walks you through creating such a dataset using a subset of the *GDP and main aggregates – international data cooperation annual data* dataset from Eurostat\ (DOI: [https://doi.org/10.2908/NAIDA_10_GDP).](https://doi.org/10.2908/NAIDA_10_GDP).) ## Load example data ```{r loaddata} library(dataset) data("gdp") ``` ```{r printgdp} print(gdp) ``` This example dataset is already in tidy format: each row represents a single observation for a country and year, and each column is a variable. `dataset_df` builds on this structure by adding semantic information to the variables and the dataset itself, ensuring that both the shape and the meaning of the data are preserved and unambiguous. While the raw dataset represented in the `gdp` data.frame is valid and tidy, it can be hard to interpret without external documentation. For example: - Countries are encoded in the `geo` variable. - Reporting frequency (e.g., `A` for annual) is stored in `freq`. ## Add metadata to your dataset The `dataset_df()` constructor enables two levels of semantic annotation for a `tbl_df` object: - **Variable-level metadata** — label, unit, definition, namespace. - **Dataset-level metadata** — title, author, license, description. - Let’s create a smaller dataset and enrich it with metadata. Let’s create a semantically enriched subset: ```{r createdataasetdf} small_country_dataset <- dataset_df( geo = defined( gdp$geo, label = "Country name", concept = "http://purl.org/linked-data/sdmx/2009/dimension#refArea", namespace = "https://dd.eionet.europa.eu/vocabulary/eurostat/geo/$1" ), year = defined( gdp$year, label = "Reference Period (Year)", concept = "http://purl.org/linked-data/sdmx/2009/dimension#refPeriod" ), gdp = defined( gdp$gdp, label = "Gross Domestic Product", unit = "CP_MEUR", concept = "http://data.europa.eu/83i/aa/GDP" ), unit = defined( gdp$unit, label = "Unit of Measure", concept = "http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure", namespace = "https://dd.eionet.europa.eu/vocabulary/eurostat/unit/$1" ), freq = defined( gdp$freq, label = "Frequency", concept = "http://purl.org/linked-data/sdmx/2009/code" ), dataset_bibentry = dublincore( title = "Small Country Dataset", creator = person("Jane", "Doe"), publisher = "Example Inc.", datasource = "https://doi.org/10.2908/NAIDA_10_GDP", rights = "CC-BY", coverage = "Andorra, Liechtenstein, San Marino and the Feroe Islands" ) ) ``` ## Inspecting variable-level metadata Columns created with the `defined` class store semantic information such as the label, the concept’s definition link, and the unit of measure. Check the variable label: ```{r varlabel} var_label(small_country_dataset$gdp) ``` And the measure of unit: ```{r varunit} var_unit(small_country_dataset$gdp) ``` ## Adding dataset-level metadata A `dataset_df()` object can also store metadata describing the dataset as a whole. This metadata follows widely adopted standards: - Dublin Core Terms (`dublincore()`), used in libraries and data repositories. - DataCite (`datacite()`), commonly used in research data repositories. Each metadata field can be accessed or modified using simple assignment functions. For example, you can set the dataset language. ```{r language} language(small_country_dataset) <- "en" ``` ## Reviewing dataset-level metadata To see the complete dataset description, you can print it as a BibTeX-style entry, which is suitable for citation or export. ```{r bibentry} print(get_bibentry(small_country_dataset), "bibtex") ``` This prints a complete BibTeX-style entry, suitable for citation or export. ## Joining datasets The previous dataset contains observations for three data subjects — Andorra, Liechtenstein, and San Marino — but does not include the Feroe Islands. ```{r feroedf} feroe_df <- data.frame( geo = rep("FO", 3), year = 2020:2022, gdp = c(2523.6, 2725.8, 3013.2), unit = rep("CP_MEUR", 3), freq = rep("A", 3) ) ``` The `dataset_df` class does not allow binding two datasets directly unless their concept definitions, units of measure, and URI namespaces match. ```{r notevaluatedrbind, eval=FALSE} rbind(small_country_dataset, feroe_df) ``` ``` Error in rbind(deparse.level, ...) : numbers of columns of arguments do not match ``` While this constraint can feel restrictive during an analysis workflow, it ensures semantic consistency when the data is later published or exchanged. This is similar in spirit to tidy data principles: when combining datasets, both structure and meaning must align. In `dataset_df`, the tidy data rule that “variables are columns” is complemented by the requirement that variables with the same name also share the same definition, units, and concept references. To add the missing Feroe Islands data, first create a compatible dataset using the same definitions, country coding, and units of measure as the original. ```{r fereodataset} feroe_dataset <- dataset_df( geo = defined( feroe_df$geo, label = "Country name", concept = "http://purl.org/linked-data/sdmx/2009/dimension#refArea", namespace = "https://dd.eionet.europa.eu/vocabulary/eurostat/geo/$1" ), year = defined( feroe_df$year, label = "Reference Period (Year)", concept = "http://purl.org/linked-data/sdmx/2009/dimension#refPeriod" ), gdp = defined( feroe_df$gdp, label = "Gross Domestic Product", unit = "CP_MEUR", concept = "http://data.europa.eu/83i/aa/GDP" ), unit = defined( feroe_df$unit, label = "Unit of Measure", concept = "http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure", namespace = "https://dd.eionet.europa.eu/vocabulary/eurostat/unit/$1" ), freq = defined( feroe_df$freq, label = "Frequency", concept = "http://purl.org/linked-data/sdmx/2009/code" ) ) ``` Once the new dataset is defined in this way, you can combine it with the existing one using `bind_defined_rows()`. ```{r binddefinedrows} joined_dataset <- bind_defined_rows(small_country_dataset, feroe_dataset) joined_dataset ``` The combined dataset behaves like a regular tibble but retains its metadata. If you convert it to a base R `data.frame`, you will lose the helper methods and built-in checks, but the metadata will remain in the object’s attributes. ```{r backwardcompatibility} attributes(as.data.frame(joined_dataset)) ``` ## Conclusion With `dataset_df()` your datasets are: - **Self-descriptive** — variables carry labels, units, and definitions. - **Machine-readable** — linked vocabularies and standard identifiers are embedded. - **Ready to publish and share** — compliant with metadata standards like Dublin Core and DataCite. This approach supports the FAIR data principles (Findable, Accessible, Interoperable, Reusable) and makes your data easier to reuse, interpret, and validate. By maintaining metadata from creation through publication, `dataset_df` helps preserve meaning across the entire data lifecycle. The package is designed to work seamlessly with the rOpenSci [rdflib](https://github.com/ropensci/rdflib) package and complements tidyverse workflows while enabling exports to semantic web formats like RDF.