The present note introduces how to implement the efficient and versatile kNN-LWPLSR algorithm with the Julia package Jchemo.The use of the algorithm is illustrated on the dataset challenge2008 about near-infrared (NIR) spectroscopy . Note: almost the entire content of this note can be directly transposed to the kNN-LWPLSR algorithm provided in the R package rchemo.
kNN-LWPLSR combines nearest neighborhood selection (kNN) and partial least squares regression. kNN-LWPLSR is well suited when the predictive variables (X) are multicollinear and when heterogeneity in the data generates non-linear relations between X and the response variables to predict (Y).
More generally, kNN-LWPLS is family of algorithms that can also be used for discrimination problems (i.e. when the response is categorical).
NIR spectrometry is a fast and nondestructive analytical method used in many contexts, for instance in agronomy to evaluate the nutritive quality of forages. Basically, spectral data X (matrix of n observations and p columns representing wavelengths) are collected on samples of the material to study (e.g. forages) using a spectrometer. In parallel, a set of variables of interest (the 'response' variables) Y = { y1, …, yq } (q vectors of n observations) (e.g. chemical compositions) are measured precisely in laboratory. Regression models (or discrimination models when Y represent class memberships) are then fitted on the data {X, Y} and used to predict the response variables from new spectral observations.
Spectral data have the characteristic to have highly collinear columns and, in general, matrix X is ill-conditioned. To be solved, the regression problem requires regularization methods. A very popular approach used for NIR data is the partial least squares regression (PLSR). The first step of PLSR (regularization step) is to reduce the dimension (nb. columns) of X to a limited number a << p of orthogonal vectors n x 1, maximizing the squared covariance with Y. These vectors are referred to as the PLS scores, or latent variables (LVs), often noted t and gathered in matrix T (n x a). The second step (regression) is to fit a multiple linear regression (MLR) that predicts Y from T.
PLSR is in general very efficient when the relationship between X and Y is linear. The method is fast (in particular with the Dayal & McGregor kernel #1 algorithm), even for large data. The parameter to tune is the number of scores (nb. LVs) considered in the regression model.
However, and especially in agronomy, new developments in data acquisition have resulted in increasingly large and complex datasets, posing new challenges for PLSR. Many modern datasets contain clusters due to variations in the collected products (categories, period, and areas of collection, etc.), often leading to non-linear relationships between X and Y. Since PLS relies on the assumption of linearity, its performance tends to decrease when applied to this type of data.
To address this challenge, the family of Local PLSR methods extends the standard PLS by adapting to local patterns, making it more effective in handling complex datasets. The general principle is, for each new observation to predict (xnew), to do a pre-selection of the k nearest neighbors of xnew (kNN selection step) and then to fit a PLSR model to the neighborhood. The PLSR model is used to predict the response Ynew. Two illustrations of neighborhood selection are presented below
The local PLS family includes many variants, depending essentially on
(a) how are selected the neighborhood, and
(b) the type of PLSR model implemented on the neighborhoods.
One of the variants is the kNN-LWPLSR approach described in section 1.2. This approach is implemented in the Jchemo function lwplsr, which is used in the present note. kNN-LWPLSR has the advantages to be fast, easy to tune and efficient for many types of data.
kNN-LWPLSR runs locally weighted PLSR (LWPLSR) on each neighborhood, instead of standard PLSR as in usual simpler local PLSR methods. LWPLSR is a particular case of weighted PLSR (WPLSR):
In WPLSR, a n x 1 vector of weights w = ( w[1], w[2], … w[n] ) is embedded into the PLSR equations. The PLS scores are computed by maximizing w-weighted squared covariances between the scores and the response variables Y. The MLR prediction equation is computed by regressing Y on the scores using w-weighted least-squares. The w-weighting is also embedded in the centering and the eventual scaling of the data. Note: in standard PLSR, a uniform weight, 1 / n, is given to all the training observations and therefore w can be removed from the equations (incidentally, this is a particular case of WPLSR).
Specifically in LWPLSR, the weight vector w is computed from a decreasing function, say f, of the dissimilarities (e.g., distances) between the n training observations and xnew, the observation to predict. Closer is xi to xnew, higher is the weight w[i] in the PLSR equations and therefore its importance to the prediction. This is the same distance-based principle as in the well-known locally weighted regression algorithm LOESS.
Compared to LWPLSR, kNN-LWPLSR simply adds a preliminary step: a neighborhood is selected in the training around xnew and then, for prediction, LWPLSR is applied to this neighborhood.
Concerning the prediction results, kNN-LWPLSR is equivalent to run LWPLSR on the overall training data but with a double weighting: a first binary weighting (0: xi is not a neighbor, 1: xi is a neighbor) and a second weighting, intra-neighborhood, defined by function f. In practice and for large datasets, however, kNN-LWPLSR is much faster than to compute LWPLSR on the all training set.
lwplsrThis section details the kNN-LWPLSR pipeline as defined in function lwplsr of package Jchemo.
using Jchemo # if not loaded before
Function lwplsr has several keyword parameters that can be specified and presented in the help-page of the function
?lwplsr lwplsr(; kwargs...) lwplsr(X, Y; kwargs...) k-Nearest-Neighbours locally weighted partial least squares regression (kNN-LWPLSR). • X : X-data (n, p). • Y : Y-data (n, q). Keyword arguments: • nlvdis : Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. If nlvdis = 0, there is no dimension reduction. • metric : Type of dissimilarity used to select the neighbors and to compute the weights (see function getknn). Possible values are: :eucl (Euclidean), :mah (Mahalanobis), :sam (spectral angular distance), :cor (correlation distance). • h : A scalar defining the shape of the weight function computed by function winvs. Lower is h, sharper is the function. See function winvs for details (keyword arguments criw and squared of winvs can also be specified here). • k : The number of nearest neighbors to select for each observation to predict. • tolw : For stabilization when very close neighbors. • nlv : Nb. latent variables (LVs) for the local (i.e. inside each neighborhood) models. • scal : Boolean. If true, (a) each column of the global X (and of the global Y if there is a preliminary PLS reduction dimension) is scaled by its uncorrected standard deviation before to compute the distances and the weights, and (b) the X and Y scaling is also done within each neighborhood (local level) for the weighted PLSR. • verbose : Boolean. If true, predicting information are printed. [...]
The default values of the parameters can be displayed by
@pars lwplsr
Jchemo.ParLwplsr nlvdis: Int64 0 metric: Symbol eucl h: Float64 Inf k: Int64 1 criw: Float64 4.0 squared: Bool false tolw: Float64 0.0001 nlv: Int64 1 scal: Bool false verbose: Bool false
The five main parameters to consider are: nlvdis, metric, h, k and nlv. They are are detailed below in the below sections 2.2 and 2.3.
A first step is to choose if the dissimilarities between observations are computed after a dimension reduction of X or not. This is managed by parameter nlvdis
If nlvdis = 0, there is not dimension reduction.
If nlvdis > 0, a preliminary global PLS with nlvdis LVs is done on the entire dataset {X, Y} and the dissimilarities are computed on the resulting score matrix T (n x nlvdis). Note: This option only concerns the dissimilarities computation. The regression steps (LWPLSR) are always computed on the original X data.
Then, the type of dissimilarities has to be chosen, with parameter metric. The available metrics are those proposed in function getknn:
• metric : Type of distance used for the query. Possible values are :eucl (Euclidean), :mah (Mahalanobis), :sam (spectral angular distance), :cos (cosine distance), :cor (correlation distance).
To compute Mahalanobis distances on on 15-25 global nlvdis scores is often a good choice. But the best choice of metric is often dataset-dependent and no general rule can be recommanded.
Note: If X has collinear columns (which is the case for NIRS data), the use of Mahalanobis distance requires a preliminary dimension reduction since the inverse of the covariance matrix cov(X) can not be computed with stability.
The next paramater to set is parameter h, which determines the shape of the weight function f for LWPLSR models. Function f has a negative exponential shape whose h defines its sharpness: lower is h, sharper is function f and therefore more the closest neighbors of xnew have importance in the LWPLSR fit. The case h = Inf is the unweighted situation (all the components of w are equal), which corresponds to a more usual kNN-PLSR.
In function f, weights are computed by exp(-d / (h * MAD(d))) and are set to 0 for extreme (potentially outlier) distances such as d > Median(d) + 4 * MAD(d). Finally, weights are standardized to their maximal value. An illustration of the effect of h is given below
Many alternatives could have been considered to define the weight functions f (e.g. bicube or tricube functions). The actual negative exponential f chosen in Jchemo is versatile and easily tunable (many shapes can be represented from the variation of a single parameter only).
The two final parameters to set are
k: the number of observations defining the neighborhood for each observation xnew to predict.
nlv: the number of LVs considered in the LWPLSR model fitted on the neighborhood.
In this version of the algorithm, k and nlv are the same for all the observations to predict. Note that if k is larger than the number of training observations, kNN-LWPLSR reduces to LWLSR.
Dataset challenge2008 was built for the prediction-challenge organized in 2018 at congress Chemometrics2018 in Paris. It consists of 4,075 NIR spectra collected on various materials typically analysed in agronomic laboratories. These materials are animal feed, rapeseed, corn gluten, grass silage, wheat, full-fat soya, milk powder and whey, maize, sunflower seed (grounded), and soya meal. The univariate response y to predict was the protein concentration.
The dataset contains
Object X (4075 x 680): The spectra, with wavelengths of 1120-2478 nm and a 2-nm step.
Object Y (4075 x 4): Variable conc (protein concentration) and other meta-data.
## Preliminary loading of packages using Jchemo # if not loaded before using JchemoData # a library of various benchmark datasets using JLD2 # package for loading/saving JLD2 data using CairoMakie # plotting backend using FreqTables # utilities for frequency tables
path_jdat = dirname(dirname(pathof(JchemoData))) db = joinpath(path_jdat, "data/challenge2018.jld2") @load db dat @names dat
(:X, :Y)
X = dat.X @head X
... (4075, 680)
| Row | 1120 | 1122 | 1124 | 1126 | 1128 | 1130 | 1132 | 1134 | 1136 | 1138 | 1140 | 1142 | 1144 | 1146 | 1148 | 1150 | 1152 | 1154 | 1156 | 1158 | 1160 | 1162 | 1164 | 1166 | 1168 | 1170 | 1172 | 1174 | 1176 | 1178 | 1180 | 1182 | 1184 | 1186 | 1188 | 1190 | 1192 | 1194 | 1196 | 1198 | 1200 | 1202 | 1204 | 1206 | 1208 | 1210 | 1212 | 1214 | 1216 | 1218 | 1220 | 1222 | 1224 | 1226 | 1228 | 1230 | 1232 | 1234 | 1236 | 1238 | 1240 | 1242 | 1244 | 1246 | 1248 | 1250 | 1252 | 1254 | 1256 | 1258 | 1260 | 1262 | 1264 | 1266 | 1268 | 1270 | 1272 | 1274 | 1276 | 1278 | 1280 | 1282 | 1284 | 1286 | 1288 | 1290 | 1292 | 1294 | 1296 | 1298 | 1300 | 1302 | 1304 | 1306 | 1308 | 1310 | 1312 | 1314 | 1316 | 1318 | ⋯ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | ⋯ | |
| 1 | 0.597482 | 0.595978 | 0.593623 | 0.59084 | 0.587451 | 0.583092 | 0.578666 | 0.572134 | 0.566125 | 0.560204 | 0.551694 | 0.544339 | 0.537008 | 0.528312 | 0.520828 | 0.51228 | 0.504673 | 0.498204 | 0.48968 | 0.483176 | 0.477588 | 0.470384 | 0.464844 | 0.459269 | 0.454023 | 0.4496 | 0.444271 | 0.440026 | 0.436514 | 0.431822 | 0.428201 | 0.424881 | 0.421379 | 0.41859 | 0.415556 | 0.413193 | 0.411501 | 0.40946 | 0.408254 | 0.407503 | 0.406948 | 0.406903 | 0.407211 | 0.407941 | 0.408835 | 0.410495 | 0.412211 | 0.41412 | 0.416691 | 0.419178 | 0.421902 | 0.424632 | 0.427036 | 0.43022 | 0.432809 | 0.434996 | 0.437823 | 0.439963 | 0.442041 | 0.443982 | 0.445578 | 0.447395 | 0.448785 | 0.449829 | 0.451165 | 0.452131 | 0.453035 | 0.453898 | 0.454573 | 0.455325 | 0.455932 | 0.456439 | 0.45706 | 0.457588 | 0.458121 | 0.458712 | 0.459253 | 0.459968 | 0.460628 | 0.461245 | 0.462104 | 0.462849 | 0.463604 | 0.464367 | 0.464976 | 0.465614 | 0.466088 | 0.466315 | 0.466568 | 0.466553 | 0.466394 | 0.465981 | 0.465464 | 0.464522 | 0.463511 | 0.462348 | 0.460648 | 0.458807 | 0.456757 | 0.454314 | ⋯ |
| 2 | 0.954192 | 0.953237 | 0.952002 | 0.950426 | 0.948382 | 0.946138 | 0.943543 | 0.94057 | 0.937514 | 0.934336 | 0.931181 | 0.92803 | 0.924979 | 0.922076 | 0.919204 | 0.916252 | 0.913228 | 0.91006 | 0.906591 | 0.902934 | 0.899069 | 0.894974 | 0.890786 | 0.886513 | 0.882297 | 0.878178 | 0.874064 | 0.870132 | 0.866404 | 0.862724 | 0.859357 | 0.856211 | 0.853359 | 0.850962 | 0.848901 | 0.84742 | 0.846361 | 0.845615 | 0.845178 | 0.845054 | 0.845031 | 0.845245 | 0.845549 | 0.845977 | 0.846463 | 0.847118 | 0.847808 | 0.848598 | 0.849367 | 0.850203 | 0.851115 | 0.85195 | 0.852774 | 0.853623 | 0.854617 | 0.855511 | 0.856426 | 0.857282 | 0.858228 | 0.859145 | 0.860036 | 0.860869 | 0.861774 | 0.862626 | 0.863431 | 0.864185 | 0.864786 | 0.865254 | 0.865612 | 0.865936 | 0.866131 | 0.866294 | 0.8663 | 0.866269 | 0.86628 | 0.866326 | 0.866443 | 0.866614 | 0.866967 | 0.867366 | 0.867852 | 0.868533 | 0.869212 | 0.869894 | 0.870584 | 0.871296 | 0.872008 | 0.872612 | 0.873183 | 0.873633 | 0.874032 | 0.874337 | 0.87455 | 0.87461 | 0.874656 | 0.874554 | 0.87432 | 0.873982 | 0.873513 | 0.872882 | ⋯ |
| 3 | 0.611137 | 0.609566 | 0.60743 | 0.604767 | 0.601434 | 0.597316 | 0.592598 | 0.586992 | 0.580741 | 0.574189 | 0.566943 | 0.559381 | 0.551773 | 0.543753 | 0.535977 | 0.52818 | 0.520482 | 0.513455 | 0.506598 | 0.500236 | 0.494469 | 0.488922 | 0.483579 | 0.478512 | 0.473364 | 0.468619 | 0.463914 | 0.459365 | 0.455331 | 0.451357 | 0.447738 | 0.444393 | 0.4412 | 0.43831 | 0.435652 | 0.433147 | 0.431142 | 0.429417 | 0.428137 | 0.427358 | 0.427051 | 0.427196 | 0.427831 | 0.428873 | 0.43028 | 0.432035 | 0.434176 | 0.436401 | 0.438986 | 0.441528 | 0.444208 | 0.446958 | 0.449588 | 0.452339 | 0.455044 | 0.457599 | 0.460146 | 0.462568 | 0.464761 | 0.466901 | 0.468719 | 0.47045 | 0.472001 | 0.473309 | 0.47453 | 0.475624 | 0.476567 | 0.477469 | 0.478218 | 0.478939 | 0.479589 | 0.480172 | 0.480771 | 0.481325 | 0.48187 | 0.482473 | 0.483049 | 0.483702 | 0.484401 | 0.485088 | 0.485835 | 0.486543 | 0.487239 | 0.487919 | 0.488478 | 0.488971 | 0.489365 | 0.489597 | 0.489697 | 0.48958 | 0.489309 | 0.488807 | 0.488128 | 0.487165 | 0.485927 | 0.484502 | 0.48277 | 0.480747 | 0.47851 | 0.475817 | ⋯ |
Y = dat.Y @head Y
... (4075, 4)
| Row | typ | label | conc | test |
|---|---|---|---|---|
| String | String | Float64 | Int64 | |
| 1 | FRG | wheat (ung) | 12.74 | 0 |
| 2 | MPW | milk powder & whey | 35.7212 | 0 |
| 3 | FRG | wheat (ung) | 12.0 | 0 |
y = Y.conc # variable to predict (protein concentration)
4075-element Vector{Float64}:
12.7399998
35.721199
12.0
13.8449764
19.2999992
7.6191959
24.332531
36.5717583
55.9027786
11.2599993
⋮
65.1012344
56.1860695
50.8610802
8.1399994
13.3100004
11.6800003
18.2399998
67.6700745
25.0300007
wlst = names(X) # wavelengths wl = parse.(Float64, wlst)
680-element Vector{Float64}:
1120.0
1122.0
1124.0
1126.0
1128.0
1130.0
1132.0
1134.0
1136.0
1138.0
⋮
2462.0
2464.0
2466.0
2468.0
2470.0
2472.0
2474.0
2476.0
2478.0
ntot, p = size(X)
(4075, 680)
freqtable(string.(Y.typ, " - ", Y.label))
10-element Named Vector{Int64}
Dim1 │
──────────────────────────┼────
ANF - animal feed │ 391
CLZ - rapeseed(ung) │ 420
CNG - corn gluten │ 395
EHH - grass silage │ 422
FFS - full fat soya │ 432
FRG - wheat (ung) │ 411
MPW - milk powder & whey │ 410
PEE - maize wp │ 407
SFG - sun flower seed(gr) │ 281
TTS - soya meal │ 506
The spectra can be plotted by
## For illustration purpose, only 30 spectra (randomly chosen) are plotted plotsp(X, wl; size = (500, 300), nsamp = 30, xlabel = "Wavelength (nm)", ylabel = "Reflectance").f
Two preprocessing steps are implemented to remove eventual non-informative physical effects in the spectra: a standard normal variation transformation (SNV), followed by a 2nd-order Savitsky-Golay derivation.
model1 = snv() model2 = savgol(npoint = 21, deriv = 2, degree = 3) model = pip(model1, model2) fit!(model, X) @head Xp = transf(model, X)
3×680 Matrix{Float64}:
-0.00393533 -0.00441755 -0.00477681 … 0.000514478 0.000481081
-0.00121436 -0.0013095 -0.00135921 0.000996321 0.000930884
-0.00355712 -0.00399785 -0.00434838 0.00044084 0.000421751
... (4075, 680)
The observation of the resulting spectra clearly indicates the presence of high heterogeneity in the data
plotsp(Xp, wl; size = (500, 300), nsamp = 30, xlabel = "Wavelength (nm)").f
which is confirmed by the highly clustered pattern observed in the PCA score space (related to the various types of materials analyzed)
In this illustration, the split of the total data is the one given at the Paris prediction-challenge (variable Y.test in the dataset), but any other splitting could be chosen by ad'hoc sampling.
freqtable(Y.test)
2-element Named Vector{Int64}
Dim1 │
──────┼─────
0 │ 3701
1 │ 374
The final data are given by
s = Bool.(Y.test) # same as: s = Y.test .== 1 Xtrain = rmrow(Xp, s) Ytrain = rmrow(Y, s) ytrain = rmrow(y, s) typtrain = rmrow(Y.typ, s) Xtest = Xp[s, :] Ytest = Y[s, :] ytest = y[s] typtest = Y.typ[s] ntrain = nro(Xtrain) ntest = nro(Xtest) (ntot = ntot, ntrain, ntest)
(ntot = 4075, ntrain = 3701, ntest = 374)
It is convenient to check that the test set is well represented or not by the training set. A first look can for instance be made by projecting the test spectra in a PCA score space built from the training spectra.
But such a 2-D (or 3-D) representation can be misleading since differences can exist in higher dimensions. A better approach is to build the plot of the score (SD) vs. orthogonal (OD) distances. It has the advantage to jointly account for all the dimensions of the PCA model.
It is also usefull to check the representativity of the y-variable.
summ(y, Y.test)
Class: 0
1×7 DataFrame
Row │ variable mean std min max n nmissing
│ Symbol Float64 Float64 Float64 Float64 Int64 Int64
─────┼───────────────────────────────────────────────────────────────
1 │ x1 31.894 20.297 3.061 76.604 3701 0
Class: 1
1×7 DataFrame
Row │ variable mean std min max n nmissing
│ Symbol Float64 Float64 Float64 Float64 Int64 Int64
─────┼───────────────────────────────────────────────────────────────
1 │ x1 32.288 20.874 2.766 75.8559 374 0
A usual approach of model tuning in machine learning is to split the training dataset to calibration sets vs. validation sets, and then do a grid search:
the parameter space is reduced to a discrete grid of parameter combinations on which is explored the model performance on the validation sets. The combination showing the better performances is can generally be considered as the best model based on the available data.
The popular K-fold cross-validation (CV) is such a strategy. Nevertheless, K-fold CV requires to predict all the observations of the training set (in this illustration, ntrain = 3,701 obs.), this for each parameter combination of the grid. This can be too time consuming for local PLSR if the datasets and/or the grid size are large. A slighter approach is to do single split 'calibration/validation' (CAL and VAL sets, respectively) from the training data and to run the grid search on this single split. This strategy is used in the present note.
Below, the VAL set is selected by systematic sampling along the data but other sampling designs (e.g. random) could be chosen.
nval = 300 s = sampsys(1:ntrain, nval) Xcal = Xtrain[s.train, :] ycal = ytrain[s.train] Xval = Xtrain[s.test, :] yval = ytrain[s.test] ncal = ntrain - nval (ntot = ntot, ntrain, ntest, ncal, nval)
(ntot = 4075, ntrain = 3701, ntest = 374, ncal = 3401, nval = 300)
Then the grid of parameters is built by
## For this illustration, the grid has been built with a relatively low number of parameter combinations. ## More extended combinations could be considered. nlvdis = [15]; metric = [:mah] h = [1; 2; 4; 6; Inf] k = [200; 350; 500; 1000] nlv = 0:15 pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k) # the grid length(pars[1]) # nb. parameter combinations considered
20
The grid search can easily be run with function gridscore, which is a generic function working that can be used by all the predictive models of package Jchemo. Note: The equivalent function for K-fold CV is gridcv.
model = lwplsr() res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, # performance criterion computed on VAL pars, # defined grid of parameters nlv # parameter 'nlv' has been set out of 'pars' to decrease the computation time (see gridscore help-page) ) @head res # first rows of the result table
... (320, 6)
| Row | nlvdis | metric | h | k | nlv | y1 |
|---|---|---|---|---|---|---|
| Any | Any | Any | Any | Int64 | Float64 | |
| 1 | 15 | mah | 1.0 | 200 | 0 | 2.02906 |
| 2 | 15 | mah | 1.0 | 200 | 1 | 1.51452 |
| 3 | 15 | mah | 1.0 | 200 | 2 | 1.11928 |
which gives graphically
group = string.("nlvdis=", res.nlvdis, ", h=", res.h, ", k=", res.k) plotgrid(res.nlv, res.y1, group; step = 1, xlabel ="Nb. LVs", ylabel = "RMSEP (Validation set)").f
The final model can be defined by selecting the best parameter combination.
u = findall(res.y1 .== minimum(res.y1))[1] res[u, :]
| Row | nlvdis | metric | h | k | nlv | y1 |
|---|---|---|---|---|---|---|
| Any | Any | Any | Any | Int64 | Float64 | |
| 9 | 15 | mah | 1.0 | 200 | 8 | 0.675864 |
and the the final prediction of the test set is given by
model = lwplsr(nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u], k = res.k[u], nlv = res.nlv[u]) fit!(model, Xtrain, ytrain) pred = predict(model, Xtest).pred @head pred
3×1 Matrix{Float64}:
19.76328396932776
6.200351184655297
11.381457704579901
... (374, 1)
Summary of the predictions
mse(pred, ytest) rmsep(pred, ytest) # estimate of generalization error
1×1 Matrix{Float64}:
0.7157501840557478
plotxy(pred, ytest; color = (:red, .5), bisect = true, xlabel = "Predictions (Test)", ylabel = "Observed data (Test)", title = "Protein concentration (%)").f
Andersson M. A comparison of nine PLS1 algorithms. J Chemom. 2009;23(10):518-529. doi:10.1002/cem.1248.
Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the American statistical association, 74(368), 829-836. DOI: 10.1080/01621459.1979.10481038
Cleveland, W. S., & Devlin, S. J. (1988). Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American statistical association, 83(403), 596-610. DOI:10.1080/01621459.1988.10478639
Davrieux F, Dufour D, Dardenne P, et al. LOCAL regression algorithm improves near infrared spectroscopy predictions when the target constituent evolves in breeding populations. J Infrared Spectrosc. 2016;24(2):109. doi:10.1255/jnirs.1213
Davrieux F, Dufour D, Dardenne P, et al. LOCAL regression algorithm improves near infrared spectroscopy predictions when the target constituent evolves in breeding populations. J Infrared Spectrosc. 2016;24(2):109. doi:10.1255/jnirs.1213
Dayal BS, MacGregor JF. Improved PLS algorithms. J Chemom. 1997;11(1):73-85. doi:10.1002/(SICI)1099-128X(199701)11:1<73::AID-CEM435>3.0.CO;2-#.
Höskuldsson A. PLS regression methods. J Chemom. 1988;2(3):211-228. doi:10.1002/cem.1180020306.
Kim S, Kano M, Nakagawa H, Hasebe S. Estimation of active pharmaceutical ingredients content using locally weighted partial least squares and statistical wavelength selection. Int J Pharm. 2011;421(2):269-274. doi:10.1016/j.ijpharm.2011.10.007
Lesnoff, M., Metz, M., Roger, J.-M., 2020. Comparison of locally weighted PLS strategies for regression and discrimination on agronomic NIR data. Journal of Chemometrics n/a, e3209. https://doi.org/10.1002/cem.3209
Lesnoff, M., 2024. Averaging a local PLSR pipeline to predict chemical compositions and nutritive values of forages and feed from spectral near infrared data. Chemometrics and Intelligent Laboratory Systems 244, 105031. https://doi.org/10.1016/j.chemolab.2023.105031
Manne R. Analysis of two partial-least-squares algorithms for multivariate calibration. Chemom Intell Lab Syst. 1987;2(1-3):187-197. doi:10.1016/0169-7439(87)80096-5.
Schaal, S., Atkeson, C.G., Vijayakumar, S., 2002. Scalable Techniques from Nonparametric Statistics for Real Time Robot Learning. Applied Intelligence 17, 49–60. https://doi.org/10.1023/A:1015727715131
Shenk J, Westerhaus M, Berzaghi P. Investigation of a LOCAL calibration procedure for near infrared instruments. J Infrared Spectrosc. 1997;5(1):223. doi:10.1255/jnirs.115
Sicard E, Sabatier R. Theoretical framework for local PLS1 regression, and application to a rainfall data set. Comput Stat Data Anal. 2006;51(2):1393-1410. doi:10.1016/j.csda.2006.05.002.
Wold H. Nonlinear iterative partial least squares (NIPALS) modeling: some current developments. In: Multivariate Analysis II. Wright State University, Dayton, Ohio, USA. June 19–24, 1972. New York : Academic Press: Krishnaiah , P. R.; 1973:383 – 407.
Wold S, Sjöström M, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst. 2001;58(2):109-130. doi:10.1016/S0169-7439(01)00155-1.
Yoshizaki R, Kano M, Tanabe S, Miyano T. Process Parameter Optimization based on LW-PLS in Pharmaceutical Granulation Process - This work was partially supported by Japan Society for the Promotion of Science (JSPS), Grant-in-Aid for Scientific Research (C) 24560940. IFAC-Pap. 2015;48(8):303-308. doi:10.1016/j.ifacol.2015.08.198.
Zhang X, Kano M, Li Y. Locally weighted kernel partial least squares regression based on sparse nonlinear features for virtual sensing of nonlinear time-varying processes. Comput Chem Eng. 2017;104:164-171. doi:10.1016/j.compchemeng.2017.04.014.