Section 8 Prepare the vis-NIR augmented dataset

Here we’ll create the data that will be used for Spatial modelling. This dataset will contain

Nr: The arbitrary sample number.
ID: The factor indicating the sample IDs.
POINT_X: The X (geographical) coordinate.
POINT_Y: The Y (geographical) coordinate.
layer: A factor indicating the depth layer at which the sample was collected (A: 0-20 cm and B: 80-100 cm).
set: A factor indicating whether the sample was used for vis-NIR calibrations (train), for vis-NIR predictions (prediction) or if it belongs to model’s validation (validation). The samples labeled as validation are the same samples initially labeled as validation in the original dataset.
Ca: The exchangeable Calcium content in the sample (\(mmol_{c}\) \(kg^{−1}\), measured by conventional laboratory methods)
Clay: The percentage of clay contnet in the soil sample (measured by conventional laboratory methods).
Silt: The percentage of silt contnet in the soil sample (measured by conventional laboratory methods).
Sand: The percentage of sand contnet in the soil sample (measured by conventional laboratory methods).
alr_Clay: The additive log-ratio transformed clay contnets (measured by conventional laboratory methods).
alr_Silt: The additive log-ratio transformed silt contnets (measured by conventional laboratory methods).
Ca_spec: This is the vis-NIR augmented exchangeable Ca²⁺ contents.
alr_Clay_spec: This is the vis-NIR augmented additive log-ratio transformed clay contnets.
alr_Silt_spec: This is the vis-NIR augmented additive log-ratio transformed silt contnets.

For the vis-NIR augmented variables (alr_Clay_spc, alr_Silt_spc and Ca_spec) there are three classes of values:

The values of the samples that are labeled as train come from the conventional laboratory methods (e.g. for Ca_spec the values of these samples for this variable are identical to the corresponding values in the variable Ca).
The values of the samples that are labeled as prediction come from the predictions done with the respective vis-NIR model.
The values of the samples that are labeled as validation are treated as missing (i.e. NAs).

## samples for the set 'prediction'
vnirpredictions

## samples for the set 'train'
vnirtrain <- train[, c("ID", "POINT_X", "POINT_Y", "set", "Ca", "Clay", "Silt", 
                    "Sand", "alr_Clay", "alr_Silt")]
vnirtrain$set <- factor("train")
vnirtrain$Ca_spec <- vnirtrain$Ca
vnirtrain$alr_Clay_spec <- vnirtrain$alr_Clay
vnirtrain$alr_Silt_spec <- vnirtrain$alr_Silt

## samples for the set 'validation'
vnirvalidation <- valida[, c("ID", "POINT_X", "POINT_Y", "set", "Ca", "Clay", 
                    "Silt", "Sand", "alr_Clay", "alr_Silt")]
vnirvalidation$set <- factor(vnirvalidation$set)
vnirvalidation$Ca_spec <- NA
vnirvalidation$alr_Clay_spec <- NA
vnirvalidation$alr_Silt_spec <- NA

Now create a single data.frame containing the three data sets…

vniraugmented <- rbind(vnirtrain, vnirpredictions, vnirvalidation)

vniraugmented$layer <- factor(substr(vniraugmented$ID, 1, 1))

## Reorganize the variables
vniraugmented <- vniraugmented[, c("ID", "POINT_X", "POINT_Y", "layer", "set", 
                    "Ca", "Clay", "Silt", "Sand", "alr_Clay", "alr_Silt", "Ca_spec", "alr_Clay_spec", 
                    "alr_Silt_spec")]

Compute some statistics for the final data set…

## Names of the properties
props <- c("Ca", "Clay", "Silt", "Sand", "alr_Clay", "alr_Silt", "Ca_spec", 
                    "alr_Clay_spec", "alr_Silt_spec")

## Compute the statistics: mean, standard deviation and the quantiles ('0%',
## '25%', '50%', '75%' and'100%')
statsprops <- aggregate(vniraugmented[, props], by = list(set = vniraugmented$set, 
                    layer = vniraugmented$layer), FUN = function(x) {
                    c(mean = mean(as.matrix(x), na.rm = TRUE), sd = sd(as.matrix(x), na.rm = TRUE), 
                                        quantile(x, na.rm = TRUE))
})

## Reorganize the object containing the results of the statistics
statsprops <- lapply(props, FUN = function(x, object, ids) {
                    object <- cbind(object[, keep], as.data.frame(statsquant[[x]]))
                    
}, object = statsprops, ids = c("set", "layer"))
names(statsprops) <- props

statsprops <- do.call("rbind", statsprops)
statsprops$property <- gsub(".[0-9]", "", rownames(statsprops))
statsprops[is.na(statsprops)] <- NA

## Reorganize the order of the variables
statsprops <- statsprops[, c("set", "layer", "property", "mean", "sd", "0%", 
                    "25%", "50%", "75%", "100%")]

statsprops

Optionally, save this data in your working directory

write.table(x = vniraugmented, file = "vniraugmented.txt", sep = "\t", row.names = FALSE)