--- title: "Built-In Distributions" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Built-In Distributions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(distionary) ``` This vignette covers built-in distribution families available in `distionary`. It provides insight into each family and its counterparts in the `stats` package. ## Built-In Distribution Families All distribution families found in the `stats` package are integrated into `distionary`, along with a few others. They are shown in the table below with notes on whether they have counterparts in the `stats` package. | Distribution | `distionary` Function | Has counterpart in `stats` | |----------------------------------|------------------------|-----------------------------| | Bernoulli | `dst_bern()` | Yes | | Beta | `dst_beta()` | Yes | | Binomial | `dst_binom()` | Yes | | Cauchy | `dst_cauchy()` | Yes | | Chi Squared | `dst_chisq()` | Yes | | Degenerate | `dst_degenerate()` | No | | Exponential | `dst_exp()` | Yes | | Empirical | `dst_empirical()` | No | | F | `dst_f()` | Yes | | Finite | `dst_finite()` | No | | Gamma | `dst_gamma()` | Yes | | Geometric | `dst_geom()` | Yes | | Generalised Extreme Value (GEV) | `dst_gev()` | No | | Generalised Pareto (GP) | `dst_gp()` | No | | Hypergeometric | `dst_hyper()` | Yes | | Log Normal | `dst_lnorm()` | Yes | | Log Pearson Type III | `dst_lp3()` | No | | Negative Binomial | `dst_nbinom()` | Yes | | Normal | `dst_norm()` | Yes | | Pearson Type III | `dst_pearson3()` | No | | Poisson | `dst_pois()` | Yes | | Student _t_ | `dst_t()` | Yes | | Uniform | `dst_unif()` | Yes | | Weibull | `dst_weibull()` | Yes | In addition, there is a special "Null" distribution object akin to a missing or unknown distribution. This is useful, for example, if an algorithm fails to return a distribution: instead of throwing an error, a Null distribution can be returned. ```{r} # Make a Null distribution. null <- dst_null() # Inspect null ``` This is the behaviour when specifying `NA` as a distribution parameter: ```{r} dst_norm(mean = 0, sd = NA) ``` The Null distribution always evaluates to `NA`. ```{r} mean(null) eval_pmf(null, at = 1:10) range(null) ``` ## The Empirical Distribution The Empirical distribution is different from the others in terms of its utility because it makes _your data_ the distribution, and is therefore far more flexible than any of the other distributions. In statistical terminology, this is because the other built-in distributions are _parametric_, defined by a fixed amount of numeric parameters. For example, exactly two parameters make up the family of Normal distributions. The empirical distribution, on the other hand, is a _non-parametric_ model, because there isn't a fixed number of numeric parameters that can describe it. The Empirical distribution is particularly advantageous when a distribution's standard form is unknown or infeasible to model, or when comparing a custom model against data is needed. Here is an example of forming an empirical distribution from a dataset generated from a Normal distribution. First, define the normal distribution and generate a sample; the first few observations are shown below. ```{r} set.seed(42) normal <- dst_norm(0, 1) x <- realise(normal, n = 100) # Inspect the first few observations head(x) ``` An Empirical distribution can be made from these values. ```{r} empirical <- dst_empirical(x) # Inspect empirical ``` Comparing its CDF to the normal distribution, one can see the two are similar: ```{r} plot(empirical, "cdf", from = -4, to = 4, n = 1000) plot(normal, "cdf", add = TRUE, col = "red") legend( "topleft", legend = c("Empirical", "Normal"), col = c("black", "red"), lty = 1 ) ``` Although this distribution is non-parametric, the `parameters()` function is still applicable because it's not tied to the statistical definition of "parametric", which is concerned with parameter dimension. In `distionary`, the dataset itself and their probabilities comprise the distribution parameters (using `str()` here for better printing): ```{r} str(parameters(empirical)) ``` One should also note that Empirical distributions are discrete, with outcomes defined by the observed values: ```{r} vtype(empirical) ```