--- title: "rrr for Multivariate Regression" author: "Chris Addy" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{rrr for Multivariate Regression} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} always_allow_html: yes --- This vignette assumes some familiarity on the part of the reader with principal component analysis^[ Hotelling, H. (1936). Analysis of a complex of statistical variables into principal components, *Journal of Educational Psychology*, **24**, 417-411, 498-520.], canonical variate analysis^[Hotelling, H. (1936). Relations between two sets of variates, *Biometrika*, **28**, 321-377.], and/or linear discriminant analysis^[Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems, *Annals of Eugencis*, **7**, 179-188.]. For a more thorough treatment, see the appropriate references. Here we will show a brief mathematical motivation for reduced-rank regression and then show that principal component analysis, canonical variate analysis, and linear discriminant analysis are special cases of reduced-rank regression. For a more thorough treatment of reduced-rank regression and its special cases, see *Modern Multivariate Statistical Techniques*.^[Izenman, A.J. (2008). *Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning.* Springer.] The mathematical formulation given below is a treatment of the theory to be minimally sufficient for anyone in need of introduction to reduced-rank regression to use in practice and is not intended to be a rigorous take on the concept. There are other packages that carry out principal component analysis, canonical variate analysis, and linear discrimnant analysis. As such, this package would prove most useful to readers of the above text, as it adopts the same hierarchical framework of reduced-rank regression. ```{r} library(rrr) ``` ## Classical Multivariate Regression Let $\mathbf{X} = \left(X_1, X_2, \dots, X_r\right)^\tau$ and $\mathbf{Y} = \left(Y_1, Y_2, \dots, Y_s\right)^\tau$ be jointly distributed random vectors with $$ \mathrm{E}\left\{ \begin{pmatrix} \mathbf{X} \\ \mathbf{Y} \\ \end{pmatrix} \right\} = \begin{pmatrix} \boldsymbol{\mu}_X \\ \boldsymbol{\mu}_Y \\ \end{pmatrix}, \quad \mathrm{cov}\left\{ \begin{pmatrix} \mathbf{X} \\ \mathbf{Y} \\ \end{pmatrix} \right\} = \begin{pmatrix} \mathbf{\Sigma}_{XX} & \mathbf{\Sigma}_{XY} \\ \mathbf{\Sigma}_{YX} & \mathbf{\Sigma}_{YY} \\ \end{pmatrix} $$ The classical multivariate regression model is given by $$ \overset{s \times 1}{\mathbf{Y}} = \overset{s \times 1}{\boldsymbol{\mu}} + \overset{s \times r}{\mathbf{C}} \; \overset{r \times 1}{\mathbf{X}} + \overset{s \times 1}{\varepsilon} $$ with $$ \mathrm{E}\left(\varepsilon\right) = \mathbf{0}, \quad \mathrm{cov}\left(\varepsilon\right) = \mathbf{\Sigma}_{\varepsilon \varepsilon} $$ and $\varepsilon$ distributed independently of $\mathbf{X}.$ To estimate $\boldsymbol{\mu}$ and $\mathbf{C}$ we minimize the least-squares criterion $$ \mathrm{E}\left[\left(\mathbf{Y} - \boldsymbol{\mu} - \mathbf{C} \mathbf{X}\right)\left(\mathbf{Y} - \boldsymbol{\mu} - \mathbf{C}\mathbf{X}\right)^\tau\right], $$ with expecation taken over the joint distribution of $\left(\mathbf{X}^\tau, \mathbf{Y}^\tau\right)$, with the assumption that $\mathbf{\Sigma}_{XX}$ is nonsingular, and therefore invertible. This is minimized when $$ \begin{aligned} \boldsymbol{\mu} & = \boldsymbol{\mu}_Y - \mathbf{C} \boldsymbol{\mu}_X \\ \mathbf{C} & = \mathbf{\Sigma}_{YX} \mathbf{\Sigma}_{XX}^{-1} \end{aligned} $$ The least-squares estimator of $\mathbf{C}$ is given by $$ \hat{\mathbf{C}} = \hat{\mathbf{\Sigma}}_{YX} \hat{\mathbf{\Sigma}}_{XX}^{-1} $$ Note that $\mathbf{C}$ -- and hence $\hat{\mathbf{C}}$ -- contains no term that takes into the account the correlation of the $Y_i$s. This is a surprising result, since we would expect correlation among the responses. In other words, to find the least-squares estimate $\hat{\mathbf{C}}$ of $\mathbf{C}$, one need only regress $\mathbf{X}$ separately on each $Y_i$ and concatenate those multiple-regression coefficient vectors into a matrix to construct the estimated coefficient matrix $\hat{\mathbf{C}}$. In some very important sense, the classical multivariate regression model is not *truly* multivariate. ## The `tobacco` Data Set ```{r} library(dplyr) data(tobacco) tobacco <- as_data_frame(tobacco) ``` We see that the `tobacco` data set^[Anderson, R.L and Bancroft, T.A (1952). *Statistical Theory in Research*, New York: McGraw-Hill. p. 205. ] has 9 variables and 25 observations. There are 6 $X_i$ predictor variables -- representing the percentages of nitrogen, chlorine, potassium, phosphorus, calcium, and magnesium, respectively -- and 3 $Y_j$ response variables -- representing cigarette burn rates in inches per 1,000 seconds, percent sugar in the leaf, and percent nicotine in the leaf, respectively. ```{r} tobacco_x <- tobacco %>% select(starts_with("X")) tobacco_y <- tobacco %>% select(starts_with("Y")) ``` Below we see that there is not only correlation among the $X_i$s but also among the $Y_i$s. The classical multivariate model will not capture that information. We can get a good idea of the correlation structure using `GGally::ggcorr`. GGally is a package that extends the functionality of the package `ggplot2` and has been utilized in `rrr` to create pairwise plots, as seen below. ```{r} GGally::ggcorr(tobacco_x) ``` We can see correlation between the $X_i$s, which will be accounted for in the classical multivariate regression model. ```{r} GGally::ggcorr(tobacco_y) ``` There is clearly correlation in the $Y_i$s, especially between percent nicotine and percent sugar. Any regression model that we fit should take this into account. ```{r} ## multivariate regression x <- as.matrix(tobacco_x) y <- as.matrix(tobacco_y) multivar_reg <- t(cov(y, x) %*% solve(cov(x))) ## separate multiple regression lm1 <- lm(y[,1] ~ x)$coeff lm2 <- lm(y[,2] ~ x)$coeff lm3 <- lm(y[,3] ~ x)$coeff ``` As expected, the multivariate coefficients are the same as the multiple regression coefficients of each of the $Y_i$s ```{r} multivar_reg cbind(lm1, lm2, lm3) ``` ## Reduced-Rank Regression One way to introduce a multivariate component into the model is to allow for the possibility that $\mathbf{C}$ is deficient, or of *reduced-rank* $t$. $$ \mathrm{rank}\left(\mathbf{C}\right) = t \leq \mathrm{min}\left(r, s\right) $$ In other words, we allow for the possibility that there are unknown linear constraints on $\mathbf{C}$. Without loss of generality, we consider the case when $r > s$, i.e., $t < s$. When $t = s$, the regression model is *full-rank*, and can be fit using multiple regression on each $Y_i \in \mathbf{Y}$ as seen above. When $t < s$, $\mathbf{C}$ can be decomposed into non-unique matrices $\mathbf{A}_{s \times t}$ and $\mathbf{B}_{t \times r}$, such that $\mathbf{C} = \mathbf{AB},$ and the multivariate regression model is given by $$ \overset{s \times 1}{\mathbf{Y}} = \overset{s \times 1}{\boldsymbol{\mu}} + \overset{s \times t}{\mathbf{A}} \; \overset{t \times r}{\mathbf{B}} \; \overset{r \times 1}{\mathbf{X}} + \overset{s \times 1}{\varepsilon} $$ Estimating $\boldsymbol{\mu}, \mathbf{A}, \mathbf{B}$, and the *reduced-rank regression coefficient* $\mathbf{C}^{\left(t\right)}$, is done by minimizing the weighted sum-of-squares criterion $$ \mathrm{E}\left[\left(\mathbf{Y} - \boldsymbol{\mu} - \mathbf{ABX}\right)^\tau \mathbf{\Gamma}\left(\mathbf{Y} - \boldsymbol{\mu} - \mathbf{ABX}\right)\right] $$ where $\boldsymbol{\Gamma}$ is a positive-definite symmetric $\left(s \times s\right)$-matrix of weights. This expectation is taken over the joint distribution $\left(\mathbf{X}^\tau, \mathbf{Y}^\tau\right)^\tau$. The weighted sum-of-squares criterion is minimized when $$ \begin{aligned} \boldsymbol{\mu}^{\left(t\right)} & = \boldsymbol{\mu}_Y - \mathbf{A}^{\left(t\right)}\mathbf{B}^{\left(t\right)}\boldsymbol{\mu}_X \\ \mathbf{A}^{\left(t\right)} & = \mathbf{\Gamma}^{-1/2}\mathbf{V}_t \\ \mathbf{B}^{\left(t\right)} & = \mathbf{V}_t^\tau \boldsymbol{\Gamma}^{-1/2}\mathbf{\Sigma}_{YX}\mathbf{\Sigma}_{XX}^{-1} \\ \end{aligned} $$ where $\mathbf{V}_t = \left(\mathbf{v}_1, \dots, \mathbf{v}_t\right)$ is an $\left(s \times t\right)$-matrix, with $\mathbf{v}_j$ the eigenvector associated with the $j$th largest eigenvalue of $$ \mathbf{\Gamma}^{1/2}\mathbf{\Sigma}_{YX} \mathbf{\Sigma}_{XX}^{-1} \mathbf{\Sigma}_{XY} \mathbf{\Gamma}^{1/2} $$ We try out different values of $\mathbf{\Gamma}$ in applications. Two popular choices -- and ones that lead to interesting results as we will see -- are $\mathbf{\Gamma} = \mathbf{I}_r$ and $\mathbf{\Gamma} = \boldsymbol{\Sigma}_{YY}^{-1}$. The following is equivalent to performing canonical variate analysis. Since the reduced-rank regression coefficient relies on inverting $\boldsymbol{\Sigma}_{XX}$ and, possibly, $\boldsymbol{\Sigma}_{YY}$, we want to take into consideration the cases when $\boldsymbol{\Sigma}_{XX}, \boldsymbol{\Sigma}_{YY}$ are singular or difficult to invert. We can perturb the diagonal of the covariance matrices by some small constant, $k$. This will ensure that the covariance matrix is invertible by -- only slightly -- altering the data. The motivation for is taken from ridge regression^[Hoerl, A.E. and Kennard, R. (1970). Ridge regression: Biased estimation for non-orthogonal problems. *Technometrics* **12**: 55-67. Reprinted in *Technometrics*, **42** (2000), 80-86.] in the multiple-regression context, and from the idea of *softly-shrunken* reduced-rank regression.^[Aldrin, Magne. "Multivariate Prediction Using Softly Shrunk Reduced-Rank Regression." The American Statistician 54.1 (2000): 29. Web. ] Thus, we carry out the reduced-rank regression procedure using $$ \begin{aligned} \hat{\boldsymbol{\Sigma}}_{XX}^{\left(k\right)} & = \hat{\boldsymbol{\Sigma}}_{XX} + k \mathbf{I}_r \\ \hat{\boldsymbol{\Sigma}}_{YY}^{\left(k\right)} & = \hat{\boldsymbol{\Sigma}}_{YY} + k \mathbf{I}_r \end{aligned} $$ ### Assessessing Effective Dimensionality ### Estimate $t$ and $k$ with `rank_trace()`. ```{r} args(rank_trace) ``` \ $\hat{\mathbf{C}}$ is calculated using sample observations. Therefore its *mathematical* rank will always be full, but it will have a *statistical* rank $t$ which is an unknown hyperparameter that needs to be estimated. One method of estimating $t$ is to plot the *rank trace*. Along the $X$-axis, we plot a measure of the difference between the rank-$t$ coefficient matrix and the full-rank coefficient matrix for each value of $t$. Along the $Y$-axis, we plot the reduction in residual covariance between the rank-$t$ residuals and the full-rank residuals for each value of $t$. ```{r} ### use the identity matrix for gamma rank_trace(tobacco_x, tobacco_y) ``` Set `plot = FALSE` to print data frame of rank trace coordinates. ```{r} rank_trace(tobacco_x, tobacco_y, plot = FALSE) ``` When the weight matrix, $\mathbf{\Gamma}$, takes on a more complicated form, the rank trace may plot points outside the unit square, or may not be a smooth monotic curve. When this is the case, we can change the value of `k` to smooth the rank trace. This value of $k$ is then used as the ridge perturbation, $k$, described above. ### Fitting Reduced-Rank Regression Model The main function in the `rrr` package is `rrr()` which fits a reduced-rank regression model and outputs the coefficients. ### Fit reduced-rank regression model with `rrr()` ```{r} args(rrr) ``` `rrr()` takes as inputs the data frames, or matrices, of input and response variables, the type of reduced-rank regression procedure to perform, the rank (defaulted to full rank), and the ridge constant $k$. The argument `type` defaults to `type = "identity"`, which sets $\mathbf{\Gamma} = \mathbf{I}$ but can be set to `type = "pca"`, `type = "cva"`, or `type = "lda"` to perform principal component analysis, canonical variate analysis, or linear discriminant analysis, respectively `rrr()` outputs the appropriate coefficients depending on the type of reduced-rank regression performed. ```{r} rrr(tobacco_x, tobacco_y, rank = "full") ``` We can see that `rrr()` with `rank = "full"` and `k = 0` returns the classical multivariate regression coefficients as above. They differ only by a transpose, and is presented this way in `rrr` as a matter of convention. It is this form that is presented in the literature.^[Izenman, A.J. (2008). *Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning.* Springer. ] ### Diagnostics ### Plot and Print Residuals with `residuals()` ```{r} args(residuals) ``` ### Plot Residuals We can visually check the model assumptions with `residuals()`. The leftmost column of the scatter plot can be used to look for serial patterns in the residuals. The diagonal can be used to look at the distribution and visually assess whether or not it is symmetric, has a mean of zero, etc. ```{r} residuals(tobacco_x, tobacco_y, rank = 1) ``` To print a data frame of the residuals, set `plot = FALSE`. ```{r} residuals(tobacco_x, tobacco_y, rank = 1, plot = FALSE) ``` ## Principal Components Analysis ### PCA is a Special Case of Reduced-Rank Regression Set $$ \begin{aligned} \mathbf{Y} & \equiv \mathbf{X} \\ \mathbf{\Gamma} & = \mathbf{I}_r \end{aligned} $$ Then, the least squares criterion $$ \mathrm{E}\left[\left(\mathbf{X} - \boldsymbol{\mu} - \mathbf{A}\mathbf{B} \mathbf{X}\right)\left(\mathbf{X} - \boldsymbol{\mu} - \mathbf{A}\mathbf{B} \mathbf{X}\right)^\tau\right] $$ is minimized when $$ \begin{aligned} \mathbf{A}^{\left(t\right)} & = \left(\mathbf{v}_1, \dots, \mathbf{v}_t\right) \\ \mathbf{B}^{\left(t\right)} & = \mathbf{A}^{\left(t\right) \tau} \\ \boldsymbol{\mu}^{\left(t\right)} & = \left(\mathbf{I}_r - \mathbf{A}^{\left(t\right)}\mathbf{B}^{\left(t\right)}\right)\boldsymbol{\mu}_X \\ \end{aligned} $$ where $\mathbf{v}_j = \mathbf{v}_j \left(\mathbf{\Sigma}_{XX}\right)$ is the eigenvector associated with the $j$th largest eigenvalue of $\mathbf{\Sigma}_{XX}.$ The best reduced-rank approximation to the original $\mathbf{X}$ is $$ \begin{aligned} \hat{\mathbf{X}}^{\left(t\right)} & = \boldsymbol{\mu}^{\left(t\right)} + \mathbf{A}^{\left(t\right)}\mathbf{B}^{\left(t\right)} \mathbf{X} \\ & \mathrm{or} \\ \hat{\mathbf{X}} & = \mathbf{A}^{\left(t\right)}\mathbf{B}^{\left(t\right)}\mathbf{X}_c \\ \end{aligned} $$ where $\mathbf{X}_c$ is the vector $\mathbf{X}$ after mean-centering. ## The `pendigits` Data Set ```{r message = FALSE, warning = FALSE} data(pendigits) digits <- as_data_frame(pendigits) %>% select(-V36) ``` Forty-four writers hand-wrote the digits 0-9 250 times in random order on a in 5 500x500 pixel boxes on a pressure-sensitive tablet with integrated LCD screen. The first 10 digits were thrown out -- without telling the writers -- to ignore variation from the writers gaining familiarity with the device.^[Alimoglu, F. (1995). Combining multiple classifiers for pen-based handwritten digit recognition, M.Sc. thesis, Institute of Graduate Studies in Science and Engineering, Bogazici University, Istanbul, Turkey.] The raw data of the coordinates was cleaned and translated. ```{r} digits_features <- digits %>% select(-V35) digits_class <- digits %>% select(V35) ``` We can get a good visualization of the correlation structure using `GGally::ggcorr`. Below we see that there is very heavy correlation among the variables. ```{r} GGally::ggcorr(digits_features) ``` ### Assessing Dimensionality The ratio $$ \frac{\lambda_{t + 1} + \cdots \lambda_r}{\lambda_1 + \cdots \lambda_r} $$ is a goodness-of-fit measure of how well the last $r - t$ principal components explain the totoal variation in $\mathbf{X}$ The function `rrr()` (see below) outputs this goodness-of-fit measure ```{r} rrr(digits_features, digits_features, type = "pca")$goodness_of_fit ``` ### Estimate $t$ and ridge constant $k$ with `rank_trace()` ```{r} rank_trace(digits_features, digits_features, type = "pca") ``` Print data frame of rank trace coordinates by setting `plot = FALSE`. ```{r} rank_trace(digits_features, digits_features, type = "pca", plot = FALSE) ``` ### Plot Principal Component Scores ### Pairwise Plots with `pairwise_plot()` ```{r} args(pairwise_plot) ``` A common PCA method of visualization is to plot the $j$th sample PC scores against the $k$th PC scores, $$ \begin{aligned} \left(\xi_{ij}, \xi_{ik}\right) & \\ = \left(\hat{\mathbf{v}}_j^\tau \mathbf{X}_i, \hat{\mathbf{v}}_k^\tau \mathbf{X}_i\right)&, \quad i = 1,2, \dots, n \end{aligned} $$ Since the first two principal components will capture the most variance -- and hence the most useful information -- of all possible pairs of principal components, we typically would set $j = 1, k = 2$ and plot the first two sample PC scores against each other. In `rrr` this is the default. ```{r} pairwise_plot(digits_features, digits_class, type = "pca") ``` We can set the $x$- and $y$-axes to whichever pairs of PC scores we would like to plot by changing the `pair_x` and `pair_y` arguments. ```{r} pairwise_plot(digits_features, digits_class, type = "pca", pair_x = 1, pair_y = 3) ``` ### Plot all pairs of PC scores with `allpairs_plot()` ```{r} args(allpairs_plot) ``` Alternatively, we can look at structure in the data by plotting all PC pairs, along with some other visual diagnostics with `allpairs_plot()`. Along with plotting principal component scores against each other, the plot matrix also shows histograms and box plots to show how the points are distributed along principal component axes. ```{r warning = FALSE, message = FALSE} allpairs_plot(digits_features, digits_class, type = "pca", rank = 3) ``` ### Fitting a PCA Model ### Fit model with `rrr()` ```{r} rrr(digits_features, digits_features, type = "pca", rank = 3) ``` ## Canonical Variate Analysis ### CVA as a Special Case of Reduced-Rank Regression Canonical Variate Analysis^[Hotelling, H. (1936). Relations between two sets of variates, *Biometrika*, **28**, 321-377.] is a method of linear dimensionality reduction that turns the original $\mathbf{X}$ and $\mathbf{Y}$ into new variables $\boldsymbol{\xi}$ and $\boldsymbol{\omega}$, respectively. Canonical variate analysis can be performed as a special case of reduced-rank regression. Set $\mathbf{\Gamma} = \mathbf{\Sigma}_{YY}^{-1}$. Then, the $t$ new pairs of canonical variables $\left(\xi_i, \omega_i\right), i = 1, \dots, t$ are calculated by fitting a reduced-rank -- rank $t$ -- regression equation. The canonical variate scores are given by $$ \boldsymbol{\xi}^{\left(t\right)} = \mathbf{G}^{\left(t\right)}\mathbf{X}, \quad \boldsymbol{\omega}^{\left(t\right)} = \mathbf{H}^{\left(t\right)} \mathbf{Y}, $$ with $$ \begin{aligned} \mathbf{G}^{\left(t\right)} & = \mathbf{B}^{\left(t\right)} \\ \mathbf{H}^{\left(t\right)} & = \mathbf{A}^{\left(t\right)-} \\ \end{aligned} $$ where $\mathbf{A}^{\left(t\right)}, \mathbf{B}^{\left(t\right)}$ are the matrices from the reduced-rank regression formulation above. Note that $\mathbf{H}^{\left(t\right)} = \mathbf{A}^{\left(t\right)-}$ is the generalized inverse of $\mathbf{A}^{\left(t\right)}$. When $t = s, \mathbf{H}^{\left(s\right)} = \mathbf{A}^{\left(t\right)+}$ is the unique Moore-Penrose generalized inverse of $\mathbf{A}^{\left(t\right)}$. ## The `COMBO17` Data Set ```{r} ### COMBO-17 galaxy data data(COMBO17) galaxy <- as_data_frame(COMBO17) %>% select(-starts_with("e."), -Nr, -UFS:-IFD) %>% na.omit() ``` This data set^[Wolf, C., Meisenheimer, M., Kleinheinrich, M., Borch, A., Dye, S., Gray, M., Wisotski, L., Bell, E.F., Rix, H.W., Cimatti, A., Hasinger, G., and Szokoly, G. (2004). A catalogue of the Chandra Deep Field South with multi-colour classification and photometric redshifts from COMBO-17, Astronomy & Astrophysics, https://arxiv.org/pdf/astro-ph/0403666.pdf] comes from a public catalogue of objects in the Chandra Deep Field South, an area of the sky. This subset of the catalogue is all the objects classified as "Galaxies", with only observations that do not have any missing values.^[Donald Richards in the Department of Statistics at Pennsylvania State University helped Dr. Izenman in understanding the variables in this data set. The package author, by extension, would also like to acknowledge Dr. Richards for help in understanding the data set used in the text *Modern Multivariate Statistical Techniques*.] ```{r} galaxy_x <- galaxy %>% select(-Rmag:-chi2red) galaxy_y <- galaxy %>% select(Rmag:chi2red) ``` ```{r} GGally::ggcorr(galaxy_x) ``` ```{r} GGally::ggcorr(galaxy_y) ``` We can see above the heavy correlation among the $X_i$s and among the $Y_i$s. This data set, therefore, makes for a good candidate to perform canonical variate analysis. ### Assessing Effective Dimensionality Estimate $t$ and $k$ with `rank_trace()` ```{r} rank_trace(galaxy_x, galaxy_y, type = "cva") ``` ### Diagnostics Plot and print residuals with `residuals()`, setting `type = "cva"`. ### Plot Residuals ```{r} residuals(galaxy_x, galaxy_y, type = "cva", rank = 2, k = 0.001) ``` ### Print Residuals ```{r} residuals(galaxy_x, galaxy_y, type = "cva", rank = 2, k = 0.001, plot = FALSE) ``` ### Plot Pairwise Canonical Variate Scores Plot canonical variate scores with `pairwise_plot()` ```{r} pairwise_plot(galaxy_x, galaxy_y, type = "cva", pair_x = 1, k = 0.001) ``` Choose which pair of canonical variate scores to plot by changing the argument `pair_x`. ```{r} pairwise_plot(galaxy_x, galaxy_y, type = "cva", pair_x = 6) ``` ### Fit Reduced-Rank Canonical Variate Model Fit model with `rrr()`, setting `type = "cva"`. ```{r} rrr(galaxy_x, galaxy_y, type = "cva", rank = 2, k = 0.001) ``` ### Print Canonical Variate scores Print canonical variate scores with `scores()`, setting `type = "cva"`. ```{r} scores(galaxy_x, galaxy_y, type = "cva", rank = 2, k = 0.001) ``` ## Linear Discriminant Analysis ### LDA as a Special Case of CVA Linear discriminant analysis is a classification procedure. We can turn it into a regression procedure -- specifically a reduced-rank canonical variate procedure -- in the following way. Let each $i = 1, 2, \dots, n$ observation belong to one, and only one, of $K = s + 1$ distinct classes. We can construct an *indicator response matrix*, $\mathbf{Y}$ where each row $i$ is an indicator response vector for the $i$th observation. The vector will have a 1 in the column that represents that class to which the observation belongs and will be 0 elsewhere. We then regress this $Y$ binary response matrix against the matrix $X$ of predictor variables. Linear discriminant analysis requires the assumptions that each class is normally distributed and that the covariance matrix of each class is equal to all others. While these assumptions will not be met in all cases, when they are -- and when the classes are well separated -- linear discriminant analysis is a very efficient classification method. ## The `iris` Data Set ```{r} data(iris) iris <- as_data_frame(iris) ``` This is R.A. Fisher's classic `iris` data set that comes packaged in base `R`. ```{r} iris_features <- iris %>% select(-Species) iris_class <- iris %>% select(Species) ``` ### Assesssing Effective Dimensionality Assessing the rank $t$ of this reduced-rank regression is equivalent to determining the number of linear discriminant functions that best discriminate between the $K$ classes, with $\mathrm{min}\left(r, s\right) = \mathrm{min}\left(r, K - 1\right)$ maximum number of linear discriminant functions. Generally, plotting linear discriminant functions against each other, i.e., the first and second linear discriminant functions, is used to determine whether sufficient discrimination is obtained. Plotting techniques are discussed in the following section. ### Plot LDA Pairs Plot LDA pairs with `pairwise_plot()`, setting `type = "lda"`. A typical graphical display for multiclass LDA is to plot the $j$th discriminant scores for the $n$ points against the $k$ discriminant scores. ```{r} pairwise_plot(iris_features, iris_class, type = "lda", k = 0.0001) ``` ### Fitting LDA Models Fit LDA model with `rrr()`, setting `type = "lda"`. ```{r} rrr(iris_features, iris_class, type = "lda", k = 0.0001) ``` ### Print LDA Scores with `scores()` Print linear discriminant scores and the class means with `scores()`, setting `type = "lda"` ```{r} scores(iris_features, iris_class, type = "lda", k = 0.0001) ```