Memory-based learning (a.k.a. instance-based learning or local regression) is a non-linear lazy learning approach for predicting a response variable from predictor variables. For each observation in a prediction set, a local regression is fitted using a subset of similar observations (nearest neighbors) from a reference set. This function does not produce a global model.
Usage
mbl(Xr, Yr, Xu, Yu = NULL,
neighbors,
diss_method = diss_pca(ncomp = ncomp_by_opc()),
diss_usage = c("none", "predictors", "weights"),
fit_method = fit_wapls(min_ncomp = 3, max_ncomp = 15),
spike = NULL, group = NULL,
gh = FALSE,
control = mbl_control(),
verbose = TRUE, seed = NULL, ...)
# S3 method for class 'mbl'
plot(x, what = c("validation", "gh"), metric = "rmse", ncomp = c(1, 2), ...)
get_predictions(x)
# S3 method for class 'mbl'
plot(x, what = c("validation", "gh"), metric = "rmse", ncomp = c(1, 2), ...)Arguments
- Xr
A matrix of predictor variables for the reference data (observations in rows, variables in columns). Column names are required.
- Yr
A numeric vector or single-column matrix of response values corresponding to
Xr. NA values are not permitted.- Xu
A matrix of predictor variables for the data to be predicted (observations in rows, variables in columns). Must have the same column names as
Xr.- Yu
An optional numeric vector or single-column matrix of response values corresponding to
Xu. Used for computing prediction statistics. Default isNULL.- neighbors
A neighbor selection object specifying how to select neighbors. Use
neighbors_k()for fixed k-nearest neighbors orneighbors_diss()for dissimilarity threshold-based selection.- diss_method
A dissimilarity method object or a precomputed dissimilarity matrix. Available constructors:
diss_pca(): Mahalanobis distance in PCA score space. This is the default where the number of components is optimized using side information (seencomp_by_opc()).diss_pls(): Mahalanobis distance in PLS score spacediss_euclidean(): Euclidean distancediss_mahalanobis(): Mahalanobis distancediss_cosine(): Cosine dissimilaritydiss_correlation(): Correlation-based dissimilarity
A precomputed matrix can also be passed. When
diss_usage = "predictors", it must be square with dimensions(nrow(Xr) + nrow(Xu))and zeros on the diagonal. Otherwise, it must havenrow(Xr)rows andnrow(Xu)columns.- diss_usage
How dissimilarity information is used in local models:
"none"(default): dissimilarities used only for neighbor selection"predictors": local dissimilarity matrix columns added as predictors"weights": neighbors weighted by dissimilarity using a tricubic function
- fit_method
A local fitting method object. Available constructors:
fit_pls(): Partial least squares regressionfit_wapls(): Weighted average PLS (default)fit_gpr(): Gaussian process regression
- spike
An integer vector indicating indices of observations in
Xrto force into (positive values) or exclude from (negative values) all neighborhoods. Default isNULL. Spiking does not change neighborhood size; forced observations displace the most distant neighbors.- group
An optional factor assigning group labels to
Xrobservations (e.g., measurement batches). Used to avoid pseudo-replication in cross-validation: when one observation is held out, all observations from its group are also removed.- gh
Logical indicating whether to compute global Mahalanobis (GH) distances. Default is
FALSE. GH distances measure how far each observation lies from the center of the reference set in PLS score space. The computation uses a fixed methodology: PLS projection with the number of components selected viancomp_by_opc()(capped at 40). This is independent of thediss_methodargument.- control
A list from
mbl_control()specifying validation type, tuning options, and other settings.- verbose
Logical indicating whether to display a progress bar. Default is
TRUE. Not shown during parallel execution.- seed
An integer for random number generation, enabling reproducible cross-validation results. Default is
NULL.- ...
Additional arguments (currently unused).
- x
An object of class
mbl(as returned bymbl).- what
Character vector specifying what to plot. Options are
"validation"(validation statistics) and/or"gh"(PLS scores used for GH distance computation). Default is both.- metric
Character string specifying which validation statistic to plot. Options are
"rmse","st_rmse", or"r2". Only used when"validation"is inwhat.- ncomp
Integer vector of length 1 or 2 specifying which PLS components to plot. Default is
c(1, 2). Only used when"gh"is inwhat.
Value
mbl
For mbl(), a list of class mbl containing:
control: control parameters fromcontrolfit_method: fit constructor fromfit_methodXu_neighbors: list with neighbor indices and dissimilaritiesdissimilarities: dissimilarity method and matrix (ifreturn_dissimilarity = TRUEincontrol)n_predictions: number of predictions madegh: GH distances forXrandXu(ifgh = TRUE)validation_results: validation statistics by methodresults: list of data.frame objects with predictions, one per neighborhood sizeseed: the seed value used
Each results table contains:
o_index: observation indexk: number of neighbors usedk_diss,k_original: (neighbors_dissonly) threshold and original countncomp: (fit_plsonly) number of PLS componentsmin_ncomp,max_ncomp: (fit_waplsonly) component rangeyu_obs,pred: observed and predicted valuesyr_min_obs,yr_max_obs: response range in neighborhoodindex_nearest_in_Xr,index_farthest_in_Xr: neighbor indicesy_nearest,y_farthest: neighbor response valuesdiss_nearest,diss_farthest: neighbor dissimilaritiesy_nearest_pred: (NNv validation) leave-one-out predictionloc_rmse_cv,loc_st_rmse_cv: (local_cv validation) CV statisticsloc_ncomp: (local dissimilarity only) components used locally
Details
Spiking
The spike argument forces specific reference observations into or out
of neighborhoods. Positive indices are always included; negative indices are
always excluded. When observations are forced in, the most distant neighbors
are displaced to maintain neighborhood size. See Guerrero et al. (2010).
Dissimilarity usage
When diss_usage = "predictors", the local dissimilarity matrix columns
are appended as additional predictor variables, which can improve predictions
(Ramirez-Lopez et al., 2013a).
When diss_usage = "weights", neighbors are weighted using a tricubic
function (Cleveland and Devlin, 1988; Naes et al., 1990):
\[W_{j} = (1 - v^{3})^{3}\]
where \(v = d(xr_i, xu_j) / \max(d)\).
GH distance
The global Mahalanobis distance (GH) measures how far each observation lies
from the center of the reference set. It is always computed using a PLS
projection with the number of components optimized via
ncomp_by_opc() (maximum 40 components or nrow(Xr),
whichever is smaller). This methodology is fixed and independent of the
diss_method specified for neighbor selection.
GH distances are useful for identifying extrapolation: observations with high GH values lie far from the calibration space and may yield unreliable predictions.
Grouping
The group argument enables leave-group-out cross-validation. When
validation_type = "local_cv" in mbl_control(), the
p parameter refers to the proportion of groups (not observations)
retained per iteration.
References
Cleveland, W. S., and Devlin, S. J. 1988. Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American Statistical Association 83:596-610.
Guerrero, C., Zornoza, R., Gomez, I., Mataix-Beneyto, J. 2010. Spiking of NIR regional models using observations from target sites: Effect of model size on prediction accuracy. Geoderma 158:66-77.
Naes, T., Isaksson, T., Kowalski, B. 1990. Locally weighted regression and scatter correction for near-infrared reflectance data. Analytical Chemistry 62:664-673.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196:268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J.A.M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199:43-53.
Rasmussen, C.E., Williams, C.K. 2006. Gaussian Processes for Machine Learning. MIT Press.
Shenk, J., Westerhaus, M., Berzaghi, P. 1997. Investigation of a LOCAL calibration procedure for near infrared instruments. Journal of Near Infrared Spectroscopy 5:223-232.
Author
Leonardo Ramirez-Lopez and Antoine Stevens
Examples
if (FALSE) { # \dontrun{
library(prospectr)
data(NIRsoil)
# Preprocess: detrend + first derivative with Savitzky-Golay
sg_det <- savitzkyGolay(
detrend(NIRsoil$spc, wav = as.numeric(colnames(NIRsoil$spc))),
m = 1, p = 1, w = 7
)
NIRsoil$spc_pr <- sg_det
# Split data
test_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC), ]
test_y <- NIRsoil$CEC[NIRsoil$train == 0 & !is.na(NIRsoil$CEC)]
train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC), ]
train_y <- NIRsoil$CEC[NIRsoil$train == 1 & !is.na(NIRsoil$CEC)]
# Example 1: Spectrum-based learner (Ramirez-Lopez et al., 2013)
ctrl <- mbl_control(validation_type = "NNv")
sbl <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
neighbors = neighbors_k(seq(40, 140, by = 20)),
diss_method = diss_pca(ncomp = ncomp_by_opc(40)),
fit_method = fit_gpr(),
control = ctrl
)
sbl
plot(sbl)
get_predictions(sbl)
# Example 2: With known Yu
sbl_2 <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
neighbors = neighbors_k(seq(40, 140, by = 20)),
fit_method = fit_gpr(),
control = ctrl
)
plot(sbl_2)
# Example 3: LOCAL algorithm (Shenk et al., 1997)
local_algo <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
neighbors = neighbors_k(seq(40, 140, by = 20)),
diss_method = diss_correlation(),
diss_usage = "none",
fit_method = fit_wapls(min_ncomp = 3, max_ncomp = 15),
control = ctrl
)
plot(local_algo)
# Example 4: Using dissimilarity as predictors
local_algo_2 <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
neighbors = neighbors_k(seq(40, 140, by = 20)),
diss_method = diss_pca(ncomp = ncomp_by_opc(40)),
diss_usage = "predictors",
fit_method = fit_wapls(min_ncomp = 3, max_ncomp = 15),
control = ctrl
)
plot(local_algo_2)
# Example 5: Parallel execution
library(doParallel)
n_cores <- min(2, parallel::detectCores() - 1)
clust <- makeCluster(n_cores)
registerDoParallel(clust)
local_algo_par <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
neighbors = neighbors_k(seq(40, 140, by = 20)),
diss_method = diss_correlation(),
fit_method = fit_wapls(min_ncomp = 3, max_ncomp = 15),
control = ctrl
)
registerDoSEQ()
try(stopCluster(clust))
} # }
