Section 5 Splitting the data
At this point we can split the data into calibration, validation, and predition sets:
The calibration set comprises the samples identified in the previous section. The IDs of these samples are in the object
cal_smpls
.The validation set are the ones in the original set and that are labeled as
validation
. In the previous section the validation samples where extracted into a separate object (valida
).The prediction set includes all the samples not selected for calibration and that were initially labeled as
cal_candidate
.
To split the data you can execute the following:
train <- data[as.character(data$ID) %in% cal_smpls, ]
pred <- data[!(as.character(data$ID) %in% c(cal_smpls)), ]
train$layer <- as.factor(substr(train$ID, 1, 1))
valida$layer <- as.factor(substr(valida$ID, 1, 1))
pred$layer <- as.factor(substr(pred$ID, 1, 1))
train$ID <- factor(train$ID)
valida$ID <- factor(valida$ID)
pred$ID <- factor(pred$ID)
Optionally, we can get rid of all the unncessary data (R
objects that will not be used from now on):
## necessary objects
reqobjects <- c("train", "pred", "valida", "cal_smpls", "o2rm")
## objects to be removed
o2rm <- ls()[!ls() %in% reqobjects]
## remove the objects
rm(list = o2rm)
Alternatively…
## If you have saved the IDs of the calibration samples into your working
## directory you can:
cal_smpls <- readLines("calibration_samples_ids.txt")
and then…
## necessary objects
reqobjects <- c("cal_smpls", "o2rm")
## objects to be removed
o2rm <- ls()[!ls() %in% reqobjects]
## read again the data
nirfile <- file("https://github.com/l-ramirez-lopez/VNIR_spectroscopy_for_robust_soil_mapping/raw/master/SoilNIRSaoPaulo.rds")
data <- readRDS(nirfile)
## extract the validation samples into a new set/object
valida <- data[data$set == "validation", ]
data <- data[data$set == "cal_candidate", ]
train <- data[as.character(data$ID) %in% cal_smpls, ]
pred <- data[!(as.character(data$ID) %in% c(cal_smpls)), ]
train$layer <- as.factor(substr(train$ID, 1, 1))
valida$layer <- as.factor(substr(valida$ID, 1, 1))
pred$layer <- as.factor(substr(pred$ID, 1, 1))
train$ID <- factor(train$ID)
valida$ID <- factor(valida$ID)
pred$ID <- factor(pred$ID)