--- title: "From R to RDF" output: rmarkdown::html_vignette: md_extensions: -autolink_bare_uris vignette: > %\VignetteIndexEntry{From R to RDF} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{css, echo=FALSE} .smaller .table { font-size: 11px; } .smaller pre, .smaller code { font-size: 11px; line-height: 1.2; } ``` ## From tidy data to RDF triples This vignette demonstrates how to convert tidy R datasets into semantically enriched RDF triple structures, using the `dataset` and `rdflib` packages. These packages help you annotate variables with machine-readable concepts, units, and links to controlled vocabularies. We’ll start with a small example of a tidy dataset representing countries (`geo`) with unique identifiers (`rowid`) and then show how to transform the dataset into RDF triples using standard vocabularies. ```{r setup} library(dataset) library(rdflib) data("gdp") ``` ## Creating a minimal semantically defined dataset ```{r minimaldf} small_geo <- dataset_df( geo = defined( gdp$geo[1:3], label = "Geopolitical entity", concept = "http://purl.org/linked-data/sdmx/2009/dimension#refArea", namespace = "https://www.geonames.org/countries/$1/" ), identifier = c( obs = "https://dataset.dataobservatory.eu/examples/dataset.html#" ) ) ``` The dataset has no creator or author, but the rows have identifiers that can be resolved with . In real publishing scenarios, you would replace these with persistent URIs that identify actual datasets and their observations. For example, a DOI-based identifier such as: `https://doi.org/10.5281/zenodo.14917851#obs:1` So let's see how this minimal dataset prints in R: ```{r printsmallgeodf} print(small_geo) ``` A tidy dataset can always be pivotted to a three-column long (tidy) format, which can define every cell value in the tabular dataset with a subject-predicate-object triple. ```{r triplesdf, eval=FALSE} triples_df <- dataset_to_triples(small_geo) knitr::kable(triples_df) ``` ::: smaller ```{r triplesdfprintsmall, echo=FALSE} triples_df <- dataset_to_triples(small_geo) knitr::kable(triples_df) ``` ::: This produces triples like: ```{r createntriples} ntriples <- dataset_to_triples(small_geo, format = "nt") ``` ```{r pritriples, eval=FALSE} cat(ntriples, sep = "\n") ``` ::: smaller ```{r printsmaller} cat(ntriples, sep = "\n") ``` ::: Each row of your dataset becomes a **subject**, each variable a **predicate**, and each value either a **URI** or a typed literal (like a date or number) — depending on how it's defined. The first statement in the example defines the intersection of the first row (observation, identified by the `rowid`) `dataset#eg:1` and the column [reference area](http://purl.org/linked-data/sdmx/2009/dimension#refArea) defined by the URI as [Andorra](https://www.geonames.org/countries/AD/).The advantage of this approach is that the row and column definitions as well as coded cell values have a permanent metadata definition. ### RDF triples enable interoperability The Resource Description Framework (RDF) represents data as subject–predicate–object triples. This allows your dataset to be machine-readable, linkable to external vocabularies, and to be ready for queries via SPARQL. ### RDF triples enable interoperability The Resource Description Framework (RDF) represents data as subject–predicate–object triples. This allows your dataset to be machine-readable, linkable to external vocabularies, and queryable via SPARQL. ```{r ntripleexample} n_triple( s = "https://dataset.dataobservatory.eu/examples/dataset.html#obs1", p = "http://purl.org/dc/terms/title", o = "Small Country Dataset" ) ``` ```{r readrdf} # We write to a temporary file our Ntriples created earlier temp_file <- tempfile(fileext = ".nt") writeLines(ntriples, con = temp_file) rdf_graph <- rdf() rdf_parse(rdf_graph, doc = temp_file, format = "ntriples") rdf_graph ``` A simple, serverless scaffolding for publishing `dataset_df` objects on the web (with HTML + RDF exports) is available at with the example of this vignette tutorial. ## Clean up It is a good practice to close connections, or clean up larger objects living in the memory: ```{r cleanup} # Clean up: delete file and clear RDF graph unlink(temp_file) rm(rdf_graph) gc() ``` ## Scale up We build a slightly bigger graph, save it, and reload it. ```{r scaleup} small_country_dataset <- dataset_df( geo = defined( gdp$geo, label = "Country name", concept = "http://dd.eionet.europa.eu/vocabulary/eurostat/geo/", namespace = "https://www.geonames.org/countries/$1/" ), year = defined( gdp$year, label = "Reference Period (Year)", concept = "http://purl.org/linked-data/sdmx/2009/dimension#refPeriod" ), gdp = defined( gdp$gdp, label = "Gross Domestic Product", unit = "https://dd.eionet.europa.eu/vocabularyconcept/eurostat/unit/CP_MEUR", concept = "http://data.europa.eu/83i/aa/GDP" ), unit = gdp$unit, freq = defined( gdp$freq, label = "Frequency", concept = "http://purl.org/linked-data/sdmx/2009/code" ), identifier = c( obs = "https://dataset.dataobservatory.eu/examples/dataset.html#" ), dataset_bibentry = dublincore( title = "Small Country Dataset", creator = person("Jane", "Doe"), publisher = "Example Inc.", datasource = "https://doi.org/10.2908/NAIDA_10_GDP", rights = "CC-BY", coverage = "Andorra, Lichtenstein and San Marino" ) ) ``` ```{r smallcountrydfnt} small_country_df_nt <- dataset_to_triples( small_country_dataset, format = "nt" ) ``` The following lines read as: - [1] `Observation #1` is a geopolitical entity, `Andorra`. - [11] `Observation #1` has a reference time period of `2020`. - [21] `Observation #1` has a decimal GDP value of `2354.8` - [31] `Observation #1` has a unit of `million euros, current prices`. - [41] `Observation #1` has a measurement frequency that is `annual`. :::: smaller ```{r smallcountrydfntsample} ## See rows 1,11,21 small_country_df_nt[c(1, 11, 21, 31, 41)] ``` :::: he statements about `Observation 1`, i.e. Andorra's national economy in 2020, is not serialised consecutively in the text file. This is not necessary, because each cell is precisely connected to the *row* (first part of the triple) and *column* (second part of the triple). We could say that the entire map to the original dataset is embedded into the flat text file, therefore it can be easily imported into a database. *Note: The `.html#` in these example IRIs does not mean the resource is an HTML file. Any absolute IRI is valid in RDF. This form is used here only for illustration; in practice, a bare namespace such as `/dataset#` is more conventional.* ```{r readrdf2} # We write to a temporary file our Ntriples created earlier temp_file <- tempfile(fileext = ".nt") writeLines(small_country_df_nt, con = temp_file ) rdf_graph <- rdf() rdf_parse(rdf_graph, doc = temp_file, format = "ntriples") ``` ```{r readrdf2print, eval = FALSE} rdf_graph ``` :::: smaller ```{r readrdf2printsmaller} rdf_graph ``` :::: Your dataset is now ready to be exported to meet the true FAIR standards, because they are: - **self-descriptive**: variables carry labels, units, and definitions. - **machine-readable**: linked vocabularies and standard identifiers. - **ready to publish and share**: they carry the metadata of each variable, potentially each observation unit, and through metadata standards like Dublin Core and DataCite the information about the whole dataset, too. ```{r readjsonld} # Create temporary JSON-LD output file jsonld_file <- tempfile(fileext = ".jsonld") # Serialize (export) the entire graph to JSON-LD format rdf_serialize(rdf_graph, doc = jsonld_file, format = "jsonld") ``` Read it back to R for display (only first 30 lines are shown): :::: smaller ```{r readjsonldprint} cat(readLines(jsonld_file)[1:30], sep = "\n") ``` :::: ```{r clenup2, echo=FALSE, message=FALSE} unlink(temp_file) rm(rdf_graph) gc() ```