Section 5 Splitting the data


At this point we can split the data into calibration, validation, and predition sets:

  • The calibration set comprises the samples identified in the previous section. The IDs of these samples are in the object cal_smpls.

  • The validation set are the ones in the original set and that are labeled as validation. In the previous section the validation samples where extracted into a separate object (valida).

  • The prediction set includes all the samples not selected for calibration and that were initially labeled as cal_candidate.

To split the data you can execute the following:

train <- data[as.character(data$ID) %in% cal_smpls, ]
pred <- data[!(as.character(data$ID) %in% c(cal_smpls)), ]

train$layer <- as.factor(substr(train$ID, 1, 1))
valida$layer <- as.factor(substr(valida$ID, 1, 1))
pred$layer <- as.factor(substr(pred$ID, 1, 1))

train$ID <- factor(train$ID)
valida$ID <- factor(valida$ID)
pred$ID <- factor(pred$ID)

Optionally, we can get rid of all the unncessary data (R objects that will not be used from now on):

## necessary objects
reqobjects <- c("train", "pred", "valida", "cal_smpls", "o2rm")

## objects to be removed
o2rm <- ls()[!ls() %in% reqobjects]

## remove the objects
rm(list = o2rm)

Alternatively…

## If you have saved the IDs of the calibration samples into your working
## directory you can:
cal_smpls <- readLines("calibration_samples_ids.txt")

and then…

## necessary objects
reqobjects <- c("cal_smpls", "o2rm")

## objects to be removed
o2rm <- ls()[!ls() %in% reqobjects]

## read again the data
nirfile <- file("https://github.com/l-ramirez-lopez/VNIR_spectroscopy_for_robust_soil_mapping/raw/master/SoilNIRSaoPaulo.rds")
data <- readRDS(nirfile)

## extract the validation samples into a new set/object
valida <- data[data$set == "validation", ]
data <- data[data$set == "cal_candidate", ]

train <- data[as.character(data$ID) %in% cal_smpls, ]
pred <- data[!(as.character(data$ID) %in% c(cal_smpls)), ]

train$layer <- as.factor(substr(train$ID, 1, 1))
valida$layer <- as.factor(substr(valida$ID, 1, 1))
pred$layer <- as.factor(substr(pred$ID, 1, 1))

train$ID <- factor(train$ID)
valida$ID <- factor(valida$ID)
pred$ID <- factor(pred$ID)