Evaluates a dissimilarity matrix by comparing each observation to its nearest neighbor based on side information. For continuous variables, RMSD and correlation are computed; for categorical variables, the kappa index is used.
Arguments
- diss
A symmetric dissimilarity matrix. Alternatively, a vector containing the lower triangle values (as returned by
dist).- side_info
A matrix of side information corresponding to the observations. Can be numeric (one or more columns) or character (single column for categorical data).
- d
Deprecated. Use
dissindiss_evaluate()instead.
Value
A list with the following components:
- eval
For numeric side information: a matrix with columns
rmsdandr(correlation). For categorical: a matrix with columnkappa.- global_eval
If multiple numeric side information variables are provided, summary statistics across variables.
- first_nn
A matrix with the original side information and the side information of each observation's nearest neighbor.
Details
This function assesses whether a dissimilarity matrix captures meaningful structure by examining the side information of nearest neighbor pairs (Ramirez-Lopez et al., 2013). If observations that are similar in the dissimilarity space also have similar side information values, the dissimilarity is considered effective.
For numeric side_info, the root mean square of differences (RMSD)
between each observation and its nearest neighbor is computed:
\[j(i) = NN(x_i, X^{{-i}})\] \[RMSD = \sqrt{\frac{1}{m} \sum_{i=1}^{m} (y_i - y_{j(i)})^2}\]
where \(NN(x_i, X^{-i})\) returns the index of the nearest neighbor of observation \(i\) (excluding itself), \(y_i\) is the side information value for observation \(i\), and \(m\) is the number of observations.
For categorical side_info, the kappa index is computed:
\[\kappa = \frac{p_o - p_e}{1 - p_e}\]
where \(p_o\) is the observed agreement and \(p_e\) is the agreement expected by chance.
References
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex datasets. Geoderma 195-196, 268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J.A.M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.
Examples
# \donttest{
library(prospectr)
#> prospectr version 0.2.8 -- galo
#> check the package repository at: https://github.com/l-ramirez-lopez/prospectr
data(NIRsoil)
sg <- savitzkyGolay(NIRsoil$spc, p = 3, w = 11, m = 0)
NIRsoil$spc <- sg
Yr <- NIRsoil$Nt[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]
# Compute PCA-based dissimilarity
d <- dissimilarity(Xr, diss_method = diss_pca(ncomp = 8))
# Evaluate using side information
ev <- diss_evaluate(d$dissimilarity, side_info = as.matrix(Yr))
ev$eval
#> rmsd r
#> side_info_1 0.644268 0.8404257
# Evaluate with multiple side information variables
Yr_2 <- NIRsoil$CEC[as.logical(NIRsoil$train)]
ev_2 <- diss_evaluate(d$dissimilarity, side_info = cbind(Yr, Yr_2))
ev_2$eval
#> rmsd r
#> Yr 0.644268 0.8404257
#> Yr_2 4.718934 0.6586788
ev_2$global_eval
#> mean_standardized_rmsd mean_r
#> 1 0.5814115 0.7495522
# }
