---
title: "Introduction to DataSpaceR"
author: "Ju Yeong Kim"
date: "2022-06-15"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to DataSpaceR}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
This package provides a thin wrapper around [Rlabkey](https://cran.r-project.org/package=Rlabkey) and connects to the the [CAVD DataSpace](https://dataspace.cavd.org) database, making it easier to fetch datasets from specific studies.
## Configuration
First, go to [DataSpace](https://dataspace.cavd.org) now and set yourself up with an account.
In order to connect to the CAVD DataSpace via `DataSpaceR`, you will need a `netrc` file in your home directory that will contain a `machine` name (hostname of DataSpace), and `login` and `password`. There are two ways to create a `netrc` file.
### Creating a netrc file with `writeNetrc`
On your R console, create a `netrc` file using a function from `DataSpaceR`:
```r
writeNetrc(
login = "yourEmail@address.com",
password = "yourSecretPassword",
netrcFile = "/your/home/directory/.netrc" # use getNetrcPath() to get the default path
)
```
This will create a `netrc` file in your home directory. Make sure you have a valid login and password.
### Manually creating a netrc file
***Alternatively***, you can manually create a netrc file.
* On Windows, this file should be named `_netrc`
* On UNIX/Mac, it should be named `.netrc`
* The file should be located in the user's home directory, and the permissions on the file should be unreadable for everybody except the owner
* To determine your home directory, run `Sys.getenv("HOME")` in R
The following three lines must be included in the `.netrc` or `_netrc` file either separated by white space (spaces, tabs, or newlines) or commas. Multiple such blocks can exist in one file.
```
machine dataspace.cavd.org
login myuser@domain.com
password supersecretpassword
```
See [here](https://www.labkey.org/wiki/home/Documentation/page.view?name=netrc) for more information about `netrc`.
## Initiate a connection
We'll be looking at study `cvd256`. If you want to use a different study, change that string. You can instantiate multiple connections to different studies simultaneously.
```r
library(DataSpaceR)
con <- connectDS()
con
#>
#> URL: https://dataspace.cavd.org
#> User: jmtaylor@scharp.org
#> Available studies: 273
#> - 77 studies with data
#> - 5049 subjects
#> - 423195 data points
#> Available groups: 3
#> Available publications: 1530
#> - 12 publications with data
```
The call to `connectDS` instantiates the connection. Printing the object shows where it's connected and the available studies.
```r
knitr::kable(head(con$availableStudies))
```
|study_name |short_name |title |type |status |stage |species |start_date |strategy |network |data_availability |ni_data_availability |
|:----------|:----------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------|:--------|:----------------|:------------------|:----------|:------------------------------------|:-------|:-----------------|:--------------------|
|cor01 |NA |The correlate of risk targeted intervention study (CORTIS): A randomized, partially-blinded, clinical trial of isoniazid and rifapentine (3HP) therapy to prevent pulmonary tuberculosis in high-risk individuals identified by a transcriptomic correlate of risk |Phase III |Inactive |Assays Completed |Human |NA |NA |GH-VAP |NA |NA |
|cvd232 |Parks_RV_232 |Limiting Dose Vaginal SIVmac239 Challenge of RhCMV-SIV vaccinated Indian rhesus macaques. |Pre-Clinical NHP |Inactive |Assays Completed |Rhesus macaque |2009-11-24 |Vector vaccines (viral or bacterial) |CAVD |NA |NA |
|cvd234 |Zolla-Pazner_Mab_test1 Study |Zolla-Pazner_Mab_Test1 |Antibody Screening |Inactive |Assays Completed |Non-Organism Study |2009-02-03 |Prophylactic neutralizing Ab |CAVD |NA |NA |
|cvd235 |mAbs potency |Weiss mAbs potency |Antibody Screening |Inactive |Assays Completed |Non-Organism Study |2008-08-21 |Prophylactic neutralizing Ab |CAVD |NA |NA |
|cvd236 |neutralization assays |neutralization assays |Antibody Screening |Active |In Progress |Non-Organism Study |2009-02-03 |Prophylactic neutralizing Ab |CAVD |NA |NA |
|cvd238 |Gallo_PA_238 |HIV-1 neutralization responses in chronically infected individuals |Antibody Screening |Inactive |Assays Completed |Non-Organism Study |2009-01-08 |Prophylactic neutralizing Ab |CAVD |NA |NA |
`con$availableStudies` shows the available studies in the CAVD DataSpace. Check out [the reference page](https://docs.ropensci.org/DataSpaceR/reference/DataSpaceConnection.html) of `DataSpaceConnection` for all available fields and methods.
```r
cvd256 <- con$getStudy("cvd256")
cvd256
#>
#> Study: cvd256
#> URL: https://dataspace.cavd.org/CAVD/cvd256
#> Available datasets:
#> - Binding Ab multiplex assay
#> - Demographics
#> - Neutralizing antibody
#> Available non-integrated datasets:
```
`con$getStudy` creates a connection to the study `cvd256`. Printing the object shows where it's connected, to what study, and the available datasets.
```r
knitr::kable(cvd256$availableDatasets)
```
|name |label | n|integrated |
|:------------|:--------------------------|----:|:----------|
|BAMA |Binding Ab multiplex assay | 6740|TRUE |
|Demographics |Demographics | 121|TRUE |
|NAb |Neutralizing antibody | 1419|TRUE |
```r
knitr::kable(cvd256$treatmentArm)
```
|arm_id |arm_part |arm_group |arm_name |randomization |coded_label | last_day|description |
|:-------------|:--------|:---------|:--------|:-------------|:---------------|--------:|:-----------------------------------------------------------------------------------------------------|
|cvd256-NA-A-A |NA |A |A |Vaccine |Group A Vaccine | 168|DNA-C 4 mg administered IM at weeks 0, 4, and 8 AND NYVAC-C 10^7pfu/mL administered IM at week 24 |
|cvd256-NA-B-B |NA |B |B |Vaccine |Group B Vaccine | 168|DNA-C 4 mg administered IM at weeks 0 and 4 AND NYVAC-C 10^7pfu/mL administered IM at weeks 20 and 24 |
Available datasets and treatment arm information for the connection can be accessed by `availableDatasets` and `treatmentArm`.
## Fetching datasets
We can grab any of the datasets listed in the connection (`availableDatasets`).
```r
NAb <- cvd256$getDataset("NAb")
dim(NAb)
#> [1] 1419 33
colnames(NAb)
#> [1] "participant_id" "participant_visit" "visit_day"
#> [4] "assay_identifier" "summary_level" "specimen_type"
#> [7] "antigen" "antigen_type" "virus"
#> [10] "virus_type" "virus_insert_name" "clade"
#> [13] "neutralization_tier" "tier_clade_virus" "target_cell"
#> [16] "initial_dilution" "titer_ic50" "titer_ic80"
#> [19] "response_call" "nab_lab_source_key" "lab_code"
#> [22] "exp_assayid" "titer_id50" "titer_id80"
#> [25] "nab_response_id50" "nab_response_id80" "slope"
#> [28] "vaccine_matched" "study_prot" "virus_full_name"
#> [31] "virus_species" "virus_host_cell" "virus_backbone"
```
The *cvd256* object is an [`R6`](https://cran.r-project.org/package=R6) class, so it behaves like a true object. Functions (like `getDataset`) are members of the object, thus the `$` semantics to access member functions.
We can get detailed variable information using `getDatasetDescription`. `getDataset` and `getDatasetDescription` accept either the `name` or `label` field listed in `availableDatasets`.
```r
knitr::kable(cvd256$getDatasetDescription("NAb"))
```
|fieldName |caption |type |description |
|:-------------------|:-------------------------------------------|:--------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|ParticipantId |Participant ID |Text (String) |Subject identifier |
|antigen |Antigen name |Text (String) |The name of the antigen (virus) being tested. |
|antigen_type |Antigen type |Text (String) |The standardized term for the type of virus used in the construction of the nAb antigen. |
|assay_identifier |Assay identifier |Text (String) |Name identifying assay |
|clade |Virus clade |Text (String) |The clade (gene subtype) of the virus (antigen) being tested. |
|exp_assayid |Experimental Assay Design Code |Integer |Unique ID assigned to the experiment design of the assay for tracking purposes. |
|initial_dilution |Initial dilution |Number (Double) |Indicates the initial specimen dilution. |
|lab_code |Lab ID |Text (String) |A code indicating the lab performing the assay. |
|nab_lab_source_key |Data provenance |Integer |Details regarding the provenance of the assay results. |
|nab_response_ID50 |Response call ID50 |True/False (Boolean) |Indicates if neutralization is detected based on ID50 titer. |
|nab_response_ID80 |Response call ID80 |True/False (Boolean) |Indicates if neutralization is detected based on ID80 titer. |
|neutralization_tier |Neutralization tier |Text (String) |A classification specific to HIV NAb assay design, in which an antigen is assessed for its ease of neutralization (1=most easily neutralized, 3=least easily neutralized) |
|response_call |Response call |True/False (Boolean) |Indicates if neutralization is detected. |
|slope |Slope |Number (Double) |The slope calculated using the difference between 50% and 80% neutralization. |
|specimen_type |Specimen type |Text (String) |The type of specimen used in the assay. For nAb assays, this is generally serum or plasma. |
|study_prot |Study Protocol |Text (String) |Study protocol |
|summary_level |Data summary level |Text (String) |Defines the level at which the magnitude or response has been summarized (e.g. summarized at the isolate level). |
|target_cell |Target cell |Text (String) |The cell line used in the assay to determine infection (lack of neutralization). Generally TZM-bl or A3R5, but can also be other cell lines or non-engineered cells. |
|tier_clade_virus |Neutralization tier + Antigen clade + Virus |Text (String) |A combination of neutralization tier, antigen clade, and virus used for filtering. |
|titer_ID50 |Titer ID50 |Number (Double) |The adjusted value of 50% maximal inhibitory dilution (ID50). |
|titer_ID80 |Titer ID80 |Number (Double) |The adjusted value of 80% maximal inhibitory dilution (ID80). |
|titer_ic50 |Titer IC50 |Number (Double) |The half maximal inhibitory concentration (IC50). |
|titer_ic80 |Titer IC80 |Number (Double) |The 80% maximal inhibitory concentration (IC80). |
|vaccine_matched |Antigen vaccine match indicator |True/False (Boolean) |Indicates if the interactive part of the antigen was designed to match the immunogen in the vaccine. |
|virus |Virus name |Text (String) |The term for the virus (antigen) being tested. |
|virus_backbone |Virus backbone |Text (String) |Indicates the backbone used to generate the virus if from a different plasmid than the envelope. |
|virus_full_name |Virus full name |Text (String) |The full name of the virus used in the construction of the nAb antigen. |
|virus_host_cell |Virus host cell |Text (String) |The host cell used to incubate the virus stock. |
|virus_insert_name |Virus insert name |Text (String) |The amino acid sequence inserted in the virus construct. |
|virus_species |Virus species |Text (String) |A classification for virus species using informal taxonomy. |
|virus_type |Virus type |Text (String) |The type of virus used in the construction of the nAb antigen. |
|visit_day |Visit Day |Integer |Target study day defined for a study visit. Study days are relative to Day 0, where Day 0 is typically defined as enrollment and/or first injection. |
To get only a subset of the data and speed up the download, filters can be passed to `getDataset`. The filters are created using the `makeFilter` function of the `Rlabkey` package.
```r
cvd256Filter <- makeFilter(c("visit_day", "EQUAL", "0"))
NAb_day0 <- cvd256$getDataset("NAb", colFilter = cvd256Filter)
dim(NAb_day0)
#> [1] 709 33
```
See `?makeFilter` for more information on the syntax.
## Creating a connection to all studies
To fetch data from multiple studies, create a connection at the project level.
```r
cavd <- con$getStudy("")
```
This will instantiate a connection at the `CAVD` level. Most functions work cross study connections just like they do on single studies.
You can get a list of datasets available across all studies.
```r
cavd
#>
#> Study: CAVD
#> URL: https://dataspace.cavd.org/CAVD
#> Available datasets:
#> - Binding Ab multiplex assay
#> - Demographics
#> - Enzyme-Linked ImmunoSpot
#> - Intracellular Cytokine Staining
#> - Neutralizing antibody
#> - PK MAb
#> Available non-integrated datasets:
knitr::kable(cavd$availableDatasets)
```
|name |label | n|integrated |
|:------------|:-------------------------------|------:|:----------|
|BAMA |Binding Ab multiplex assay | 170320|TRUE |
|Demographics |Demographics | 5049|TRUE |
|ELISPOT |Enzyme-Linked ImmunoSpot | 5610|TRUE |
|ICS |Intracellular Cytokine Staining | 195883|TRUE |
|NAb |Neutralizing antibody | 51382|TRUE |
|PKMAb |PK MAb | 3217|TRUE |
In all-study connection, `getDataset` will combine the requested datasets. Note that in most cases, the datasets will have too many subjects for quick data transfer, making filtering of the data a necessity. The `colFilter` argument can be used here, as described in the `getDataset` section.
```r
conFilter <- makeFilter(c("species", "EQUAL", "Human"))
human <- cavd$getDataset("Demographics", colFilter = conFilter)
dim(human)
#> [1] 3142 36
colnames(human)
#> [1] "subject_id" "subject_visit"
#> [3] "species" "subspecies"
#> [5] "sexatbirth" "race"
#> [7] "ethnicity" "country_enrollment"
#> [9] "circumcised_enrollment" "bmi_enrollment"
#> [11] "agegroup_range" "agegroup_enrollment"
#> [13] "age_enrollment" "study_label"
#> [15] "study_start_date" "study_first_enr_date"
#> [17] "study_fu_complete_date" "study_public_date"
#> [19] "study_network" "study_last_vaccination_day"
#> [21] "study_type" "study_part"
#> [23] "study_group" "study_arm"
#> [25] "study_arm_summary" "study_arm_coded_label"
#> [27] "study_randomization" "study_product_class_combination"
#> [29] "study_product_combination" "study_short_name"
#> [31] "study_grant_pi_name" "study_strategy"
#> [33] "study_prot" "genderidentity"
#> [35] "studycohort" "bmi_category"
```
Check out [the reference page](https://docs.ropensci.org/DataSpaceR/reference/DataSpaceStudy.html) of `DataSpaceStudy` for all available fields and methods.
## Connect to a saved group
A group is a curated collection of participants from filtering of treatments, products, studies, or species, and it is created in [the DataSpace App](https://dataspace.cavd.org/cds/CAVD/app.view).
Let's say you are using the App to filter and visualize data and want to save them for later or explore in R with `DataSpaceR`. You can save a group by clicking the Save button on the Active Filter Panel.
We can browse available the saved groups or the curated groups by DataSpace Team via `availableGroups`.
```r
knitr::kable(con$availableGroups)
```
| group_id|label |original_label |description |created_by |shared | n|studies |
|--------:|:----------------------------------|:----------------------------------|:-------------------------------------------------------------------------------------------------------------------------|:----------|:------|---:|:------------------------------|
| 220|NYVAC durability comparison |NYVAC_durability |Compare durability in 4 NHP studies using NYVAC-C (vP2010) and NYVAC-KC-gp140 (ZM96) products. |ehenrich |TRUE | 78|cvd281, cvd434, cvd259, cvd277 |
| 228|HVTN 505 case control subjects |HVTN 505 case control subjects |Participants from HVTN 505 included in the case-control analysis |drienna |TRUE | 189|vtn505 |
| 230|HVTN 505 polyfunctionality vs BAMA |HVTN 505 polyfunctionality vs BAMA |Compares ICS polyfunctionality (CD8+, Any Env) to BAMA mfi-delta (single Env antigen) in the HVTN 505 case control cohort |drienna |TRUE | 170|vtn505 |
To fetch data from a saved group, create a connection at the project level with a group ID. For example, we can connect to the "NYVAC durability comparison" group which has group ID 220 by `getGroup`.
```r
nyvac <- con$getGroup(220)
nyvac
#>
#> Group: NYVAC durability comparison
#> URL: https://dataspace.cavd.org/CAVD
#> Available datasets:
#> - Binding Ab multiplex assay
#> - Demographics
#> - Enzyme-Linked ImmunoSpot
#> - Intracellular Cytokine Staining
#> - Neutralizing antibody
#> Available non-integrated datasets:
```
Retrieving a dataset is the same as before.
```r
NAb_nyvac <- nyvac$getDataset("NAb")
dim(NAb_nyvac)
#> [1] 4281 33
```
## Access Virus Metadata
DataSpace maintains metadata about all viruses used in Neutralizing Antibody (NAb) assays. This data can be accessed through the app on the [NAb antigen page](https://dataspace.cavd.org/cds/CAVD/app.view#learn/learn/Assay/NAB/antigens) and [NAb MAb antigen page](https://dataspace.cavd.org/cds/CAVD/app.view#learn/learn/Assay/NAB%20MAB/antigens).
We can access this metadata in `DataSpaceR` with `con$virusMetadata`:
```r
knitr::kable(head(con$virusMetadata))
```
|assay_identifier |cds_virus_id |virus |virus_type |neutralization_tier |clade |antigen_control |virus_full_name |virus_name_other |virus_species |virus_host_cell |virus_backbone |panel_names |
|:----------------|:------------|:------------|:--------------|:-------------------|:-----|:---------------|:------------------------------|:----------------|:-------------|:---------------|:--------------|:--------------------|
|NAB MAB |cds_1 |0013095-2.11 |Env Pseudotype |2 |NA |0 |0013095-2.11 [SG3Δenv] 293T/17 |NA |HIV |293T/17 |SG3Δenv |Tiered diverse panel |
|NAB MAB |cds_2 |001428-2.42 |Env Pseudotype |2 |C |0 |001428-2.42 [SG3Δenv] 293T/17 |NA |HIV |293T/17 |SG3Δenv |Tiered diverse panel |
|NAB MAB |cds_3 |0041.v3.c18 |Env Pseudotype |2 |C |0 |0041.v3.c18 [SG3Δenv] 293T/17 |0041.V3.C18 |HIV |293T/17 |SG3Δenv |NA |
|NAB MAB |cds_4 |0077.v1.c16 |Env Pseudotype |2 |C |0 |0077.v1.c16 [SG3Δenv] 293T/17 |0077.v1.c16 |HIV |293T/17 |SG3Δenv |NA |
|NAB |cds_252 |00836-2.5 |Env Pseudotype |1B |C |0 |00836-2.5 [SG3Δenv] 293T/17 |NA |HIV |293T/17 |SG3Δenv |Tiered diverse panel |
|NAB MAB |cds_5 |0260.v5.c1 |Env Pseudotype |2 |A |0 |0260.v5.c1 [SG3Δenv] 293T/17 |0260.V5.C1 |HIV |293T/17 |SG3Δenv |Tiered diverse panel |
## Access monoclonal antibody data
See other vignette for a tutorial on accessing monoclonal antibody data with `DataSpaceR`:
```r
vignette("Monoconal_Antibody_Data")
```
## Browse and Download Publication Data
DataSpace maintains a curated collection of relevant publications, which can be accessed through the [Publications page](https://dataspace.cavd.org/cds/CAVD/app.view?#learn/learn/Publication) through the app. Metadata about these publications can be accessed through `DataSpaceR` with `con$availablePublications`.
See Publication Data vignette for a tutorial on accessing publication data through DataSpaceR.
```r
vignette("Publication_Data")
#> Warning: vignette 'Publication_Data' not found
```
## Reference Tables
The followings are the tables of all fields and methods that work on `DataSpaceConnection` and `DataSpaceStudy` objects and could be used as a quick reference.
### `DataSpaceConnection`
| Name | Description |
| --- | --- |
| `availableStudies` | The table of available studies. |
| `availableGroups` | The table of available groups. |
| `availablePublications` | The table of available publications. |
| `mabGrid` | The filtered mAb grid. |
| `mabGridSummary` | The summarized mAb grid with updated `n_` columns and `geometric_mean_curve_ic50`. |
| `virusMetadata` | Metadata about all viruses in the DataSpace. |
| `filterMabGrid` | Filter rows in the mAb grid by specifying the values to keep in the columns found in the `mabGrid` field. |
| `resetMabGrid` | Reset the mAb grid to the unfiltered state. |
| `getMab` | Create a `DataSpaceMab` object by filtered `mabGrid`. |
| `getStudy` | Create a `DataSpaceStudy` object by study. |
| `getGroup` | Create a `DataSpaceStudy` object by group. |
| `downloadPublicationData` | Download data from a chosen publication. |
### `DataSpaceStudy`
| Name | Description |
| --- | --- |
| `study` | The study name. |
| `group` | The group name. |
| `availableDatasets` | The table of datasets available in the study object. |
| `treatmentArm` | The table of treatment arm information for the connected study. Not available for all study connection. |
| `dataDir` | The default target directory for downloading non-integrated datasets. |
| `studyInfo` | Stores the information about the study. |
| `getDataset` | Get a dataset from the connection. |
| `getDatasetDescription` | Get variable information. |
| `setDataDir` | Set default target directory for downloading non-integrated datasets. |
### `DataSpaceMab`
| Name | Description |
| --- | --- |
| `studyAndMabs` | The table of available mAbs by study. |
| `mabs` | The table of available mAbs and their attributes. |
| `nabMab` | The table of mAbs and their neutralizing measurements against viruses. |
| `studies` | The table of available studies. |
| `assays` | The table of assay status by study. |
| `variableDefinitions` | The table of variable definitions. |
## Session information
```r
sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#>
#> locale:
#> [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
#> [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8
#> [7] LC_PAPER=en_US.utf8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] data.table_1.14.2 DataSpaceR_0.7.5 knitr_1.37
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.8 digest_0.6.29 assertthat_0.2.1 R6_2.5.1
#> [5] jsonlite_1.8.0 magrittr_2.0.2 evaluate_0.15 highr_0.9
#> [9] httr_1.4.2 stringi_1.7.6 curl_4.3.2 tools_4.1.2
#> [13] stringr_1.4.0 Rlabkey_2.8.3 xfun_0.29 compiler_4.1.2
```