--- title: "Introduction to DataSpaceR" author: "Ju Yeong Kim" date: "2022-06-15" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to DataSpaceR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- This package provides a thin wrapper around [Rlabkey](https://cran.r-project.org/package=Rlabkey) and connects to the the [CAVD DataSpace](https://dataspace.cavd.org) database, making it easier to fetch datasets from specific studies. ## Configuration First, go to [DataSpace](https://dataspace.cavd.org) now and set yourself up with an account. In order to connect to the CAVD DataSpace via `DataSpaceR`, you will need a `netrc` file in your home directory that will contain a `machine` name (hostname of DataSpace), and `login` and `password`. There are two ways to create a `netrc` file. ### Creating a netrc file with `writeNetrc` On your R console, create a `netrc` file using a function from `DataSpaceR`: ```r writeNetrc( login = "yourEmail@address.com", password = "yourSecretPassword", netrcFile = "/your/home/directory/.netrc" # use getNetrcPath() to get the default path ) ``` This will create a `netrc` file in your home directory. Make sure you have a valid login and password. ### Manually creating a netrc file ***Alternatively***, you can manually create a netrc file. * On Windows, this file should be named `_netrc` * On UNIX/Mac, it should be named `.netrc` * The file should be located in the user's home directory, and the permissions on the file should be unreadable for everybody except the owner * To determine your home directory, run `Sys.getenv("HOME")` in R The following three lines must be included in the `.netrc` or `_netrc` file either separated by white space (spaces, tabs, or newlines) or commas. Multiple such blocks can exist in one file. ``` machine dataspace.cavd.org login myuser@domain.com password supersecretpassword ``` See [here](https://www.labkey.org/wiki/home/Documentation/page.view?name=netrc) for more information about `netrc`. ## Initiate a connection We'll be looking at study `cvd256`. If you want to use a different study, change that string. You can instantiate multiple connections to different studies simultaneously. ```r library(DataSpaceR) con <- connectDS() con #> #> URL: https://dataspace.cavd.org #> User: jmtaylor@scharp.org #> Available studies: 273 #> - 77 studies with data #> - 5049 subjects #> - 423195 data points #> Available groups: 3 #> Available publications: 1530 #> - 12 publications with data ``` The call to `connectDS` instantiates the connection. Printing the object shows where it's connected and the available studies. ```r knitr::kable(head(con$availableStudies)) ``` |study_name |short_name |title |type |status |stage |species |start_date |strategy |network |data_availability |ni_data_availability | |:----------|:----------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------|:--------|:----------------|:------------------|:----------|:------------------------------------|:-------|:-----------------|:--------------------| |cor01 |NA |The correlate of risk targeted intervention study (CORTIS): A randomized, partially-blinded, clinical trial of isoniazid and rifapentine (3HP) therapy to prevent pulmonary tuberculosis in high-risk individuals identified by a transcriptomic correlate of risk |Phase III |Inactive |Assays Completed |Human |NA |NA |GH-VAP |NA |NA | |cvd232 |Parks_RV_232 |​Limiting Dose Vaginal SIVmac239 Challenge of RhCMV-SIV vaccinated Indian rhesus macaques. |Pre-Clinical NHP |Inactive |Assays Completed |Rhesus macaque |2009-11-24 |Vector vaccines (viral or bacterial) |CAVD |NA |NA | |cvd234 |Zolla-Pazner_Mab_test1 Study |Zolla-Pazner_Mab_Test1 |Antibody Screening |Inactive |Assays Completed |Non-Organism Study |2009-02-03 |Prophylactic neutralizing Ab |CAVD |NA |NA | |cvd235 |mAbs potency |Weiss mAbs potency |Antibody Screening |Inactive |Assays Completed |Non-Organism Study |2008-08-21 |Prophylactic neutralizing Ab |CAVD |NA |NA | |cvd236 |neutralization assays |neutralization assays |Antibody Screening |Active |In Progress |Non-Organism Study |2009-02-03 |Prophylactic neutralizing Ab |CAVD |NA |NA | |cvd238 |Gallo_PA_238 |HIV-1 neutralization responses in chronically infected individuals |Antibody Screening |Inactive |Assays Completed |Non-Organism Study |2009-01-08 |Prophylactic neutralizing Ab |CAVD |NA |NA | `con$availableStudies` shows the available studies in the CAVD DataSpace. Check out [the reference page](https://docs.ropensci.org/DataSpaceR/reference/DataSpaceConnection.html) of `DataSpaceConnection` for all available fields and methods. ```r cvd256 <- con$getStudy("cvd256") cvd256 #> #> Study: cvd256 #> URL: https://dataspace.cavd.org/CAVD/cvd256 #> Available datasets: #> - Binding Ab multiplex assay #> - Demographics #> - Neutralizing antibody #> Available non-integrated datasets: ``` `con$getStudy` creates a connection to the study `cvd256`. Printing the object shows where it's connected, to what study, and the available datasets. ```r knitr::kable(cvd256$availableDatasets) ``` |name |label | n|integrated | |:------------|:--------------------------|----:|:----------| |BAMA |Binding Ab multiplex assay | 6740|TRUE | |Demographics |Demographics | 121|TRUE | |NAb |Neutralizing antibody | 1419|TRUE | ```r knitr::kable(cvd256$treatmentArm) ``` |arm_id |arm_part |arm_group |arm_name |randomization |coded_label | last_day|description | |:-------------|:--------|:---------|:--------|:-------------|:---------------|--------:|:-----------------------------------------------------------------------------------------------------| |cvd256-NA-A-A |NA |A |A |Vaccine |Group A Vaccine | 168|DNA-C 4 mg administered IM at weeks 0, 4, and 8 AND NYVAC-C 10^7pfu/mL administered IM at week 24 | |cvd256-NA-B-B |NA |B |B |Vaccine |Group B Vaccine | 168|DNA-C 4 mg administered IM at weeks 0 and 4 AND NYVAC-C 10^7pfu/mL administered IM at weeks 20 and 24 | Available datasets and treatment arm information for the connection can be accessed by `availableDatasets` and `treatmentArm`. ## Fetching datasets We can grab any of the datasets listed in the connection (`availableDatasets`). ```r NAb <- cvd256$getDataset("NAb") dim(NAb) #> [1] 1419 33 colnames(NAb) #> [1] "participant_id" "participant_visit" "visit_day" #> [4] "assay_identifier" "summary_level" "specimen_type" #> [7] "antigen" "antigen_type" "virus" #> [10] "virus_type" "virus_insert_name" "clade" #> [13] "neutralization_tier" "tier_clade_virus" "target_cell" #> [16] "initial_dilution" "titer_ic50" "titer_ic80" #> [19] "response_call" "nab_lab_source_key" "lab_code" #> [22] "exp_assayid" "titer_id50" "titer_id80" #> [25] "nab_response_id50" "nab_response_id80" "slope" #> [28] "vaccine_matched" "study_prot" "virus_full_name" #> [31] "virus_species" "virus_host_cell" "virus_backbone" ``` The *cvd256* object is an [`R6`](https://cran.r-project.org/package=R6) class, so it behaves like a true object. Functions (like `getDataset`) are members of the object, thus the `$` semantics to access member functions. We can get detailed variable information using `getDatasetDescription`. `getDataset` and `getDatasetDescription` accept either the `name` or `label` field listed in `availableDatasets`. ```r knitr::kable(cvd256$getDatasetDescription("NAb")) ``` |fieldName |caption |type |description | |:-------------------|:-------------------------------------------|:--------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |ParticipantId |Participant ID |Text (String) |Subject identifier | |antigen |Antigen name |Text (String) |The name of the antigen (virus) being tested. | |antigen_type |Antigen type |Text (String) |The standardized term for the type of virus used in the construction of the nAb antigen. | |assay_identifier |Assay identifier |Text (String) |Name identifying assay | |clade |Virus clade |Text (String) |The clade (gene subtype) of the virus (antigen) being tested. | |exp_assayid |Experimental Assay Design Code |Integer |Unique ID assigned to the experiment design of the assay for tracking purposes. | |initial_dilution |Initial dilution |Number (Double) |Indicates the initial specimen dilution. | |lab_code |Lab ID |Text (String) |A code indicating the lab performing the assay. | |nab_lab_source_key |Data provenance |Integer |Details regarding the provenance of the assay results. | |nab_response_ID50 |Response call ID50 |True/False (Boolean) |Indicates if neutralization is detected based on ID50 titer. | |nab_response_ID80 |Response call ID80 |True/False (Boolean) |Indicates if neutralization is detected based on ID80 titer. | |neutralization_tier |Neutralization tier |Text (String) |A classification specific to HIV NAb assay design, in which an antigen is assessed for its ease of neutralization (1=most easily neutralized, 3=least easily neutralized) | |response_call |Response call |True/False (Boolean) |Indicates if neutralization is detected. | |slope |Slope |Number (Double) |The slope calculated using the difference between 50% and 80% neutralization. | |specimen_type |Specimen type |Text (String) |The type of specimen used in the assay. For nAb assays, this is generally serum or plasma. | |study_prot |Study Protocol |Text (String) |Study protocol | |summary_level |Data summary level |Text (String) |Defines the level at which the magnitude or response has been summarized (e.g. summarized at the isolate level). | |target_cell |Target cell |Text (String) |The cell line used in the assay to determine infection (lack of neutralization). Generally TZM-bl or A3R5, but can also be other cell lines or non-engineered cells. | |tier_clade_virus |Neutralization tier + Antigen clade + Virus |Text (String) |A combination of neutralization tier, antigen clade, and virus used for filtering. | |titer_ID50 |Titer ID50 |Number (Double) |The adjusted value of 50% maximal inhibitory dilution (ID50). | |titer_ID80 |Titer ID80 |Number (Double) |The adjusted value of 80% maximal inhibitory dilution (ID80). | |titer_ic50 |Titer IC50 |Number (Double) |The half maximal inhibitory concentration (IC50). | |titer_ic80 |Titer IC80 |Number (Double) |The 80% maximal inhibitory concentration (IC80). | |vaccine_matched |Antigen vaccine match indicator |True/False (Boolean) |Indicates if the interactive part of the antigen was designed to match the immunogen in the vaccine. | |virus |Virus name |Text (String) |The term for the virus (antigen) being tested. | |virus_backbone |Virus backbone |Text (String) |Indicates the backbone used to generate the virus if from a different plasmid than the envelope. | |virus_full_name |Virus full name |Text (String) |The full name of the virus used in the construction of the nAb antigen. | |virus_host_cell |Virus host cell |Text (String) |The host cell used to incubate the virus stock. | |virus_insert_name |Virus insert name |Text (String) |The amino acid sequence inserted in the virus construct. | |virus_species |Virus species |Text (String) |A classification for virus species using informal taxonomy. | |virus_type |Virus type |Text (String) |The type of virus used in the construction of the nAb antigen. | |visit_day |Visit Day |Integer |Target study day defined for a study visit. Study days are relative to Day 0, where Day 0 is typically defined as enrollment and/or first injection. | To get only a subset of the data and speed up the download, filters can be passed to `getDataset`. The filters are created using the `makeFilter` function of the `Rlabkey` package. ```r cvd256Filter <- makeFilter(c("visit_day", "EQUAL", "0")) NAb_day0 <- cvd256$getDataset("NAb", colFilter = cvd256Filter) dim(NAb_day0) #> [1] 709 33 ``` See `?makeFilter` for more information on the syntax. ## Creating a connection to all studies To fetch data from multiple studies, create a connection at the project level. ```r cavd <- con$getStudy("") ``` This will instantiate a connection at the `CAVD` level. Most functions work cross study connections just like they do on single studies. You can get a list of datasets available across all studies. ```r cavd #> #> Study: CAVD #> URL: https://dataspace.cavd.org/CAVD #> Available datasets: #> - Binding Ab multiplex assay #> - Demographics #> - Enzyme-Linked ImmunoSpot #> - Intracellular Cytokine Staining #> - Neutralizing antibody #> - PK MAb #> Available non-integrated datasets: knitr::kable(cavd$availableDatasets) ``` |name |label | n|integrated | |:------------|:-------------------------------|------:|:----------| |BAMA |Binding Ab multiplex assay | 170320|TRUE | |Demographics |Demographics | 5049|TRUE | |ELISPOT |Enzyme-Linked ImmunoSpot | 5610|TRUE | |ICS |Intracellular Cytokine Staining | 195883|TRUE | |NAb |Neutralizing antibody | 51382|TRUE | |PKMAb |PK MAb | 3217|TRUE | In all-study connection, `getDataset` will combine the requested datasets. Note that in most cases, the datasets will have too many subjects for quick data transfer, making filtering of the data a necessity. The `colFilter` argument can be used here, as described in the `getDataset` section. ```r conFilter <- makeFilter(c("species", "EQUAL", "Human")) human <- cavd$getDataset("Demographics", colFilter = conFilter) dim(human) #> [1] 3142 36 colnames(human) #> [1] "subject_id" "subject_visit" #> [3] "species" "subspecies" #> [5] "sexatbirth" "race" #> [7] "ethnicity" "country_enrollment" #> [9] "circumcised_enrollment" "bmi_enrollment" #> [11] "agegroup_range" "agegroup_enrollment" #> [13] "age_enrollment" "study_label" #> [15] "study_start_date" "study_first_enr_date" #> [17] "study_fu_complete_date" "study_public_date" #> [19] "study_network" "study_last_vaccination_day" #> [21] "study_type" "study_part" #> [23] "study_group" "study_arm" #> [25] "study_arm_summary" "study_arm_coded_label" #> [27] "study_randomization" "study_product_class_combination" #> [29] "study_product_combination" "study_short_name" #> [31] "study_grant_pi_name" "study_strategy" #> [33] "study_prot" "genderidentity" #> [35] "studycohort" "bmi_category" ``` Check out [the reference page](https://docs.ropensci.org/DataSpaceR/reference/DataSpaceStudy.html) of `DataSpaceStudy` for all available fields and methods. ## Connect to a saved group A group is a curated collection of participants from filtering of treatments, products, studies, or species, and it is created in [the DataSpace App](https://dataspace.cavd.org/cds/CAVD/app.view). Let's say you are using the App to filter and visualize data and want to save them for later or explore in R with `DataSpaceR`. You can save a group by clicking the Save button on the Active Filter Panel. We can browse available the saved groups or the curated groups by DataSpace Team via `availableGroups`. ```r knitr::kable(con$availableGroups) ``` | group_id|label |original_label |description |created_by |shared | n|studies | |--------:|:----------------------------------|:----------------------------------|:-------------------------------------------------------------------------------------------------------------------------|:----------|:------|---:|:------------------------------| | 220|NYVAC durability comparison |NYVAC_durability |Compare durability in 4 NHP studies using NYVAC-C (vP2010) and NYVAC-KC-gp140 (ZM96) products. |ehenrich |TRUE | 78|cvd281, cvd434, cvd259, cvd277 | | 228|HVTN 505 case control subjects |HVTN 505 case control subjects |Participants from HVTN 505 included in the case-control analysis |drienna |TRUE | 189|vtn505 | | 230|HVTN 505 polyfunctionality vs BAMA |HVTN 505 polyfunctionality vs BAMA |Compares ICS polyfunctionality (CD8+, Any Env) to BAMA mfi-delta (single Env antigen) in the HVTN 505 case control cohort |drienna |TRUE | 170|vtn505 | To fetch data from a saved group, create a connection at the project level with a group ID. For example, we can connect to the "NYVAC durability comparison" group which has group ID 220 by `getGroup`. ```r nyvac <- con$getGroup(220) nyvac #> #> Group: NYVAC durability comparison #> URL: https://dataspace.cavd.org/CAVD #> Available datasets: #> - Binding Ab multiplex assay #> - Demographics #> - Enzyme-Linked ImmunoSpot #> - Intracellular Cytokine Staining #> - Neutralizing antibody #> Available non-integrated datasets: ``` Retrieving a dataset is the same as before. ```r NAb_nyvac <- nyvac$getDataset("NAb") dim(NAb_nyvac) #> [1] 4281 33 ``` ## Access Virus Metadata DataSpace maintains metadata about all viruses used in Neutralizing Antibody (NAb) assays. This data can be accessed through the app on the [NAb antigen page](https://dataspace.cavd.org/cds/CAVD/app.view#learn/learn/Assay/NAB/antigens) and [NAb MAb antigen page](https://dataspace.cavd.org/cds/CAVD/app.view#learn/learn/Assay/NAB%20MAB/antigens). We can access this metadata in `DataSpaceR` with `con$virusMetadata`: ```r knitr::kable(head(con$virusMetadata)) ``` |assay_identifier |cds_virus_id |virus |virus_type |neutralization_tier |clade |antigen_control |virus_full_name |virus_name_other |virus_species |virus_host_cell |virus_backbone |panel_names | |:----------------|:------------|:------------|:--------------|:-------------------|:-----|:---------------|:------------------------------|:----------------|:-------------|:---------------|:--------------|:--------------------| |NAB MAB |cds_1 |0013095-2.11 |Env Pseudotype |2 |NA |0 |0013095-2.11 [SG3Δenv] 293T/17 |NA |HIV |293T/17 |SG3Δenv |Tiered diverse panel | |NAB MAB |cds_2 |001428-2.42 |Env Pseudotype |2 |C |0 |001428-2.42 [SG3Δenv] 293T/17 |NA |HIV |293T/17 |SG3Δenv |Tiered diverse panel | |NAB MAB |cds_3 |0041.v3.c18 |Env Pseudotype |2 |C |0 |0041.v3.c18 [SG3Δenv] 293T/17 |0041.V3.C18 |HIV |293T/17 |SG3Δenv |NA | |NAB MAB |cds_4 |0077.v1.c16 |Env Pseudotype |2 |C |0 |0077.v1.c16 [SG3Δenv] 293T/17 |0077.v1.c16 |HIV |293T/17 |SG3Δenv |NA | |NAB |cds_252 |00836-2.5 |Env Pseudotype |1B |C |0 |00836-2.5 [SG3Δenv] 293T/17 |NA |HIV |293T/17 |SG3Δenv |Tiered diverse panel | |NAB MAB |cds_5 |0260.v5.c1 |Env Pseudotype |2 |A |0 |0260.v5.c1 [SG3Δenv] 293T/17 |0260.V5.C1 |HIV |293T/17 |SG3Δenv |Tiered diverse panel | ## Access monoclonal antibody data See other vignette for a tutorial on accessing monoclonal antibody data with `DataSpaceR`: ```r vignette("Monoconal_Antibody_Data") ``` ## Browse and Download Publication Data DataSpace maintains a curated collection of relevant publications, which can be accessed through the [Publications page](https://dataspace.cavd.org/cds/CAVD/app.view?#learn/learn/Publication) through the app. Metadata about these publications can be accessed through `DataSpaceR` with `con$availablePublications`. See Publication Data vignette for a tutorial on accessing publication data through DataSpaceR. ```r vignette("Publication_Data") #> Warning: vignette 'Publication_Data' not found ``` ## Reference Tables The followings are the tables of all fields and methods that work on `DataSpaceConnection` and `DataSpaceStudy` objects and could be used as a quick reference. ### `DataSpaceConnection` | Name | Description | | --- | --- | | `availableStudies` | The table of available studies. | | `availableGroups` | The table of available groups. | | `availablePublications` | The table of available publications. | | `mabGrid` | The filtered mAb grid. | | `mabGridSummary` | The summarized mAb grid with updated `n_` columns and `geometric_mean_curve_ic50`. | | `virusMetadata` | Metadata about all viruses in the DataSpace. | | `filterMabGrid` | Filter rows in the mAb grid by specifying the values to keep in the columns found in the `mabGrid` field. | | `resetMabGrid` | Reset the mAb grid to the unfiltered state. | | `getMab` | Create a `DataSpaceMab` object by filtered `mabGrid`. | | `getStudy` | Create a `DataSpaceStudy` object by study. | | `getGroup` | Create a `DataSpaceStudy` object by group. | | `downloadPublicationData` | Download data from a chosen publication. | ### `DataSpaceStudy` | Name | Description | | --- | --- | | `study` | The study name. | | `group` | The group name. | | `availableDatasets` | The table of datasets available in the study object. | | `treatmentArm` | The table of treatment arm information for the connected study. Not available for all study connection. | | `dataDir` | The default target directory for downloading non-integrated datasets. | | `studyInfo` | Stores the information about the study. | | `getDataset` | Get a dataset from the connection. | | `getDatasetDescription` | Get variable information. | | `setDataDir` | Set default target directory for downloading non-integrated datasets. | ### `DataSpaceMab` | Name | Description | | --- | --- | | `studyAndMabs` | The table of available mAbs by study. | | `mabs` | The table of available mAbs and their attributes. | | `nabMab` | The table of mAbs and their neutralizing measurements against viruses. | | `studies` | The table of available studies. | | `assays` | The table of assay status by study. | | `variableDefinitions` | The table of variable definitions. | ## Session information ```r sessionInfo() #> R version 4.1.2 (2021-11-01) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: Ubuntu 18.04.5 LTS #> #> Matrix products: default #> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 #> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1 #> #> locale: #> [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C #> [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 #> [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 #> [7] LC_PAPER=en_US.utf8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] data.table_1.14.2 DataSpaceR_0.7.5 knitr_1.37 #> #> loaded via a namespace (and not attached): #> [1] Rcpp_1.0.8 digest_0.6.29 assertthat_0.2.1 R6_2.5.1 #> [5] jsonlite_1.8.0 magrittr_2.0.2 evaluate_0.15 highr_0.9 #> [9] httr_1.4.2 stringi_1.7.6 curl_4.3.2 tools_4.1.2 #> [13] stringr_1.4.0 Rlabkey_2.8.3 xfun_0.29 compiler_4.1.2 ```