Introduction to the kNN-LWPLSR algorithmn

The note introduces how to implement the efficient and versatile kNN-LWPLSR algorithm with Julia package Jchemo. The workflow examples are illustrated with dataset challenge2008 on near-infrared spectroscopy (NIRS). Note: almost the entire content of this note can be directly transposed to the algorithm provided in the R package rchemo.

kNN-LWPLSR combines nearest neighborhood selection and partial least squared regression. kNN-LWPLSR is well suited when the predictive variables (X) are multicollinear and when heterogeneity in the data generates non-linear relations between X and the response variables to predict (Y).

1. Preliminaries

1.1 NIRS regression

NIRS is a fast and nondestructive analytical method used in many contexts, for instance in agronomy to evaluate the nutritive quality of forages. Basically, spectral data X (matrix of n observations and p columns representing wavelengths) are collected on samples of the material to study (e.g. forages) using a spectrometer, and targeted response variables Y = { y1, …, yq } (q vectors of n observations) (e.g. chemical compositions) are measured precisely in laboratory. Regression models of Y on X are then fitted and used to predict the response variables from new spectral observations.

Spectral data are known to be highly collinear in columns and, in general, matrix X is ill-conditioned. Regularization methods have to be implemented to solve the regression problem. A very popular regularization used for NIRS is the partial least squares regression (PLSR) that is a latent variables (LVs) approach. PLSR starts by reducing the dimension (nb. columns) of X to a limited number a << p of orthogonal vectors n x 1 maximizing the squared covariance with Y, and referred to as scores T (n x a). Response Y is then regressed on scores (the latent variables) T by a multiple linear regression (MLR). PLSR is in general very efficient when the relationship between X and Y is linear. The method is fast (in particular with the Dayal & McGregor kernel #1 algorithm), even for large data. The parameter to tune is the dimension of T (number of LVs).

For several years, agronomic databases (e.g. in feed, food or soils researches) started to aggregate large numbers of samples of different natures or origins, leading to inyernal heterogeneity. This generates curvatures and/or clustering in the data that can alter the linear relation between X and Y and therefore the PLSR prediction performances. Local PLSR methodology is an easy approach that can turn out non-linearity in the data. The general principle is, for each new observation to predict (xnew), to do a pre-selection of k nearest neighbors of the observation (the kNN selection step) and then to apply a PLSR to the neighborhood (i.e. the k selected neighbors). Two illustrations of neighborhood selection are presented below

Many variants of local PLSR pipelines can be built, depending essentially on (a) how are selected the neighborhoods and (b) the type of PLSR model implemented on the neighborhoods. The kNN-LWPLSR algorithm described below is one of these variants.

1.2 Theory (summary)

kNN-LWPLSR applies locally weighted PLSR (LWPLSR), instead of common PLSR, on each neighborhood. LWPLSR is a particular case of weighted PLSR (WPLSR). In WPLSR, a n x 1 vector of weights w = ( w[1], w[2], … w[n] ) is embedded into the PLSR equations. The PLS scores are computed by maximizing w-weighted squared covariances between the scores and the response variables Y. The MLR prediction equation is computed by regressing Y on the scores using w-weighted least-squares. The w-weighting is also embedded in the centering and eventual scaling of the data. Note: in usual PLSR, a uniform weight, 1 / n, is given to all the training observations and therefore w can be removed from the equations.

In LWPLSR, the weight vector w is computed from a decreasing function, say f, of dissimilarities (e.g. distances) between the n training observations and xnew, the observation to predict. Closer is xi to xnew, higher is the weight w[i] in the PLSR equations and therefore its importance to the prediction. This is the same distance-based principle as in the well-known locally weighted regression algorithm LOESS.

Compared to LWPLSR, kNN-LWPLSR simply adds a preliminary step: a neighborhood is selected around xnew, and then LWPLSR is applied to this neighborhood for prediction. kNN-LWPLSR can be viewed as a LWPLSR with a double weighting: a first binary weighting (0: xi is not a neighbor, 1: xi is a neighbor) and a second weighting, intra-neighborhood, defined by function f. In practice, it is in general much faster to compute kNN-LWPLSR than LWPLSR, in particlar for large datasets.

2. Function lwplsr

This section details the kNN-LWPLSR pipeline as defined in function lwplsr of package Jchemo.

2.1 Keyword parameters

Function lwplsr has several keyword parameters that can be specified. The full list is providen in the function help-page

?lwplsr

  lwplsr(; kwargs...)
  lwplsr(X, Y; kwargs...)

  k-Nearest-Neighbours locally weighted partial least squares regression (kNN-LWPLSR).X : X-data (n, p).Y : Y-data (n, q).

  Keyword arguments:nlvdis : Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. If nlvdis = 0, there is no dimension reduction.metric : Type of dissimilarity used to select the neighbors and to compute the weights (see function getknn). Possible values are: :eucl (Euclidean), :mah (Mahalanobis), :sam (spectral angular
       distance), :cor (correlation distance).h : A scalar defining the shape of the weight function computed by function winvs. Lower is h, sharper is the function. See function winvs for details (keyword arguments criw and squared of winvs
       can also be specified here).k : The number of nearest neighbors to select for each observation to predict.tolw : For stabilization when very close neighbors.nlv : Nb. latent variables (LVs) for the local (i.e. inside each neighborhood) models.scal : Boolean. If true, (a) each column of the global X (and of the global Y if there is a preliminary PLS reduction dimension) is scaled by its uncorrected standard deviation before to compute
       the distances and the weights, and (b) the X and Y scaling is also done within each neighborhood (local level) for the weighted PLSR.verbose : Boolean. If true, predicting information are printed.

  [...]

and the default values of the parameters can be displayed by

using Jchemo
@pars lwplsr
Jchemo.ParLwplsr
  nlvdis: Int64 0
  metric: Symbol eucl
  h: Float64 Inf
  k: Int64 1
  criw: Float64 4.0
  squared: Bool false
  tolw: Float64 0.0001
  nlv: Int64 1
  scal: Bool false
  verbose: Bool false

The five main parameters to consider are: nlvdis, metric, h, k and nlv. They are are detailed below.

2.2 Neighborhood computation

A first step is to choose if the dissimilarities between observations are computed after a dimension reduction of X or not. This is managed by parameter nlvdis

  • If nlvdis = 0, there is not dimension reduction

  • If nlvdis > 0, a preliminary global PLS with nlvdis LVs is done on the entire dataset {X, Y} and the dissimilarities are computed on the resulting score matrix T (n x nlvdis)

Then, the type of dissimilarities has to be chosen, with parameter metric. The available metrics are those proposed in function getknn:

metric : Type of distance used for the query. Possible values are :eucl (Euclidean), :mah (Mahalanobis), :sam (spectral angular
       distance), :cor (correlation distance).

Mahalanobis distance computed on on 15-25 global nlvdis scores is often a good choice, but this is very dataset-dependent and other choices can also be as or more performant.

Note: If X has collinear columns (which is the case for NIRS data), the use of Mahalanobis distance requires a preliminary dimension reduction since the inverse of the covariance matrix cov(X) can not be computed with stability.

2.2 Weighting function

The next paramater to set is parameter h. Weight function f has a negative exponential shape whose h defines its sharpness: lower is h, sharper is function f and therefore more the closest neighbors of xnew have importance in the LWPLSR fit. The case h = Inf is the unweighted situation (all the components of w are equal) corresponding to a kNN-PLSR.

More precisely, weights are computed by exp(-d / (h * MAD(d))) and are set to 0 for extreme (potentially outlier) distances such as d > Median(d) + 4 * MAD(d). Finally, weights are standardized to their maximal value. An illustration of the effect of h is given below

Many alternative weight functions to f (e.g. bicube or tricube functions) could have been implemented. Function f was chosen to be versatile and easily tunable.

2.3 Dimensions of the neighborhood and the local models

The two final parameters to set are

  • k: the number of observations defining the neighborhood for each observation to predict

  • nlv: the number of LVs considered in the local LWPLSR model

Note that if k is larger than the training size, kNN-LWPLSR reduces to LWLSR.

3. Case study

Dataset challenge2008 was built for the prediction-challenge organized in 2018 at congress Chemometrics2018 in Paris. It consists in NIR spectra collected on various vegetal and milk materials. The response Y to predict is univariate and corresponds to the protein concentration.

3.1 Data importation

The dataset contains

  • Object X (4075 x 680): The spectra, with wavelengths of 1120-2478 nm and a 2-nm step.

  • Object Y (4075 x 4): Variable conc (protein concentration) and other meta-data.

## Preliminary loading of packages
using Jchemo       # if not loaded before
using JchemoData   # a library of various benchmark datasets
using JLD2         # a Julia data format 
using CairoMakie   # making graphics 
using FreqTables   # utilities for frequency tables
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/challenge2018.jld2") 
@load db dat
@names dat
(:X, :Y)
X = dat.X
@head X
... (4075, 680)
3×680 DataFrame
580 columns omitted
Row1120112211241126112811301132113411361138114011421144114611481150115211541156115811601162116411661168117011721174117611781180118211841186118811901192119411961198120012021204120612081210121212141216121812201222122412261228123012321234123612381240124212441246124812501252125412561258126012621264126612681270127212741276127812801282128412861288129012921294129612981300130213041306130813101312131413161318
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.5974820.5959780.5936230.590840.5874510.5830920.5786660.5721340.5661250.5602040.5516940.5443390.5370080.5283120.5208280.512280.5046730.4982040.489680.4831760.4775880.4703840.4648440.4592690.4540230.44960.4442710.4400260.4365140.4318220.4282010.4248810.4213790.418590.4155560.4131930.4115010.409460.4082540.4075030.4069480.4069030.4072110.4079410.4088350.4104950.4122110.414120.4166910.4191780.4219020.4246320.4270360.430220.4328090.4349960.4378230.4399630.4420410.4439820.4455780.4473950.4487850.4498290.4511650.4521310.4530350.4538980.4545730.4553250.4559320.4564390.457060.4575880.4581210.4587120.4592530.4599680.4606280.4612450.4621040.4628490.4636040.4643670.4649760.4656140.4660880.4663150.4665680.4665530.4663940.4659810.4654640.4645220.4635110.4623480.4606480.4588070.4567570.454314
20.9541920.9532370.9520020.9504260.9483820.9461380.9435430.940570.9375140.9343360.9311810.928030.9249790.9220760.9192040.9162520.9132280.910060.9065910.9029340.8990690.8949740.8907860.8865130.8822970.8781780.8740640.8701320.8664040.8627240.8593570.8562110.8533590.8509620.8489010.847420.8463610.8456150.8451780.8450540.8450310.8452450.8455490.8459770.8464630.8471180.8478080.8485980.8493670.8502030.8511150.851950.8527740.8536230.8546170.8555110.8564260.8572820.8582280.8591450.8600360.8608690.8617740.8626260.8634310.8641850.8647860.8652540.8656120.8659360.8661310.8662940.86630.8662690.866280.8663260.8664430.8666140.8669670.8673660.8678520.8685330.8692120.8698940.8705840.8712960.8720080.8726120.8731830.8736330.8740320.8743370.874550.874610.8746560.8745540.874320.8739820.8735130.872882
30.6111370.6095660.607430.6047670.6014340.5973160.5925980.5869920.5807410.5741890.5669430.5593810.5517730.5437530.5359770.528180.5204820.5134550.5065980.5002360.4944690.4889220.4835790.4785120.4733640.4686190.4639140.4593650.4553310.4513570.4477380.4443930.44120.438310.4356520.4331470.4311420.4294170.4281370.4273580.4270510.4271960.4278310.4288730.430280.4320350.4341760.4364010.4389860.4415280.4442080.4469580.4495880.4523390.4550440.4575990.4601460.4625680.4647610.4669010.4687190.470450.4720010.4733090.474530.4756240.4765670.4774690.4782180.4789390.4795890.4801720.4807710.4813250.481870.4824730.4830490.4837020.4844010.4850880.4858350.4865430.4872390.4879190.4884780.4889710.4893650.4895970.4896970.489580.4893090.4888070.4881280.4871650.4859270.4845020.482770.4807470.478510.475817
Y = dat.Y
@head Y
... (4075, 4)
3×4 DataFrame
Rowtyplabelconctest
StringStringFloat64Int64
1FRGwheat (ung)12.740
2MPWmilk powder & whey35.72120
3FRGwheat (ung)12.00
y = Y.conc        # variable to predict (protein concentration)
4075-element Vector{Float64}:
 12.7399998
 35.721199
 12.0
 13.8449764
 19.2999992
  7.6191959
 24.332531
 36.5717583
 55.9027786
 11.2599993
  ⋮
 65.1012344
 56.1860695
 50.8610802
  8.1399994
 13.3100004
 11.6800003
 18.2399998
 67.6700745
 25.0300007
wlst = names(X)     # wavelengths
wl = parse.(Float64, wlst)
680-element Vector{Float64}:
 1120.0
 1122.0
 1124.0
 1126.0
 1128.0
 1130.0
 1132.0
 1134.0
 1136.0
 1138.0
    ⋮
 2462.0
 2464.0
 2466.0
 2468.0
 2470.0
 2472.0
 2474.0
 2476.0
 2478.0
ntot, p = size(X)
(4075, 680)
freqtable(string.(Y.typ, " - ", Y.label))
10-element Named Vector{Int64}
Dim1                      │ 
──────────────────────────┼────
ANF - animal feed         │ 391
CLZ - rapeseed(ung)       │ 420
CNG - corn gluten         │ 395
EHH - grass silage        │ 422
FFS - full fat soya       │ 432
FRG - wheat (ung)         │ 411
MPW - milk powder & whey  │ 410
PEE - maize wp            │ 407
SFG - sun flower seed(gr) │ 281
TTS - soya meal           │ 506

The spectra (random selection of 30 observations) can be plotted by

plotsp(X, wl; size = (500, 300), nsamp = 30, xlabel = "Wavelength (nm)", ylabel = "Reflectance").f
3.2 Data preprocessing

Two preprocessing steps are implemented to remove eventual non-informative physical effects in the spectra: a standard normal variation transformation (SNV), followed by a 2nd-order Savitsky-Golay derivation

model1 = snv()
model2 = savgol(npoint = 21, deriv = 2, degree = 3)
model = pip(model1, model2)
fit!(model, X)
@head Xp = transf(model, X)
3×680 Matrix{Float64}:
 -0.00393533  -0.00441755  -0.00477681  …  0.000514478  0.000481081
 -0.00121436  -0.0013095   -0.00135921     0.000996321  0.000930884
 -0.00355712  -0.00399785  -0.00434838     0.00044084   0.000421751
... (4075, 680)

The resulting spectra indicate the presence of high heterogeneity in the data

plotsp(Xp, wl; size = (500, 300), nsamp = 30, xlabel = "Wavelength (nm)").f

that is confirmed by the highly clustered pattern observed in the PCA score space (related to the forages and feed considered)

3.3 Split training vs. test sets

In this example, the split of the total data is already provided in the dataset (variable Y.test, given at the Paris prediction-challenge), but other splitting could be done a posteriori by ad'hoc sampling

freqtable(Y.test)
2-element Named Vector{Int64}
Dim1  │ 
──────┼─────
0     │ 3701
1     │  374

The final data are given by

s = Bool.(Y.test)  # same as: s = Y.test .== 1
Xtrain = rmrow(Xp, s)
Ytrain = rmrow(Y, s)
ytrain = rmrow(y, s)
typtrain = rmrow(Y.typ, s)
Xtest = Xp[s, :]
Ytest = Y[s, :]
ytest = y[s]
typtest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = ntot, ntrain, ntest)
(ntot = 4075, ntrain = 3701, ntest = 374)

It is convenient to check that the test set is globally well represented by the training set, for instance using projection of the test in a PCA score space built from the training observations

The plot score (SD) vs. orthogonal (OD) distances is even more powerful since it involves all dimensions of the PCA model (rather than only two or three plotted dimensions)

It is also usefull to check the representativity of the y-variable

summ(y, Y.test)
Class: 0
1×7 DataFrame
 Row │ variable  mean     std      min      max      n      nmissing
     │ Symbol    Float64  Float64  Float64  Float64  Int64  Int64
─────┼───────────────────────────────────────────────────────────────
   1 │ x1         31.894   20.297    3.061   76.604   3701         0


Class: 1
1×7 DataFrame
 Row │ variable  mean     std      min      max      n      nmissing
     │ Symbol    Float64  Float64  Float64  Float64  Int64  Int64
─────┼───────────────────────────────────────────────────────────────
   1 │ x1         32.288   20.874    2.766  75.8559    374         0
3.4 Model tuning

A usual approach of model tuning is to do a grid search (i.e. to evaluate the model performance over a grid of parameter combinations) using a sampling design that splits the training data to calibration vs. validation sets. The popular K-fold cross-validation (CV) is such a sampling design. Nevertheless, K-fold CV requires to predict all of the training observations (here ntrain = 3,701 obs.) for each parameter combination, which can be too time consuming for local PLSR and large datasets. A slighter approach is to split the training set to a calibration set and a validation set, and to run the grid search on the validation set. This is what is done in this note, using the gridscore function.

Below, the validation set is selected by systematic sampling over the data but other sampling designs (e.g. random) could be chosen

nval = 300
s = sampsys(1:ntrain, nval)
Xcal = Xtrain[s.train, :]
ycal = ytrain[s.train]
Xval = Xtrain[s.test, :]
yval = ytrain[s.test]
ncal = ntrain - nval
(ntot = ntot, ntrain, ntest, ncal, nval)
(ntot = 4075, ntrain = 3701, ntest = 374, ncal = 3401, nval = 300)

Then the grid of parameters is built by

## Below, more extended combinations could be considered (this is simplification for the example)
nlvdis = [15]; metric = [:mah] 
h = [1; 2; 4; 6; Inf]
k = [200; 350; 500; 1000]  
nlv = 0:15 
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k)  # the grid
length(pars[1])  # nb. parameter combinations considered
20

Consider the performance criterion as the RMSEP computed on the validation set

model = lwplsr()
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, pars, nlv, verbose = false)
@head res   # first rows of the result table
... (320, 6)
3×6 DataFrame
Rownlvdismetrichknlvy1
AnyAnyAnyAnyInt64Float64
115mah1.020002.02906
215mah1.020011.51452
315mah1.020021.11928

that gives graphically

group = string.("nlvdis=", res.nlvdis, ", h=", res.h, ", k=", res.k)
plotgrid(res.nlv, res.y1, group; step = 1, xlabel ="Nb. LVs", ylabel = "RMSEP (Validation)").f
3.5 Final predictions

The final model can be defined by selecting the best parameters combination

u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
DataFrameRow (6 columns)
Rownlvdismetrichknlvy1
AnyAnyAnyAnyInt64Float64
915mah1.020080.675864

and the the final prediction of the test set is given by

model = lwplsr(nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u], 
    k = res.k[u], nlv = res.nlv[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@head pred
3×1 Matrix{Float64}:
 19.76328396932776
  6.200351184655297
 11.381457704579901
... (374, 1)

Summary of the predictions

mse(pred, ytest)
rmsep(pred, ytest)    # estimate of generalization error
1×1 Matrix{Float64}:
 0.7157501840557478
plotxy(pred, ytest; color = (:red, .5), bisect = true, xlabel = "Predictions", 
    ylabel = "Observed test data", title = "Protein concentration (%)").f

4. Related references

  • Andersson M. A comparison of nine PLS1 algorithms. J Chemom. 2009;23(10):518-529. doi:10.1002/cem.1248.

  • Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the American statistical association, 74(368), 829-836. DOI: 10.1080/01621459.1979.10481038

  • Cleveland, W. S., & Devlin, S. J. (1988). Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American statistical association, 83(403), 596-610. DOI:10.1080/01621459.1988.10478639

  • Davrieux F, Dufour D, Dardenne P, et al. LOCAL regression algorithm improves near infrared spectroscopy predictions when the target constituent evolves in breeding populations. J Infrared Spectrosc. 2016;24(2):109. doi:10.1255/jnirs.1213

  • Davrieux F, Dufour D, Dardenne P, et al. LOCAL regression algorithm improves near infrared spectroscopy predictions when the target constituent evolves in breeding populations. J Infrared Spectrosc. 2016;24(2):109. doi:10.1255/jnirs.1213

  • Dayal BS, MacGregor JF. Improved PLS algorithms. J Chemom. 1997;11(1):73-85. doi:10.1002/(SICI)1099-128X(199701)11:1<73::AID-CEM435>3.0.CO;2-#.

  • Höskuldsson A. PLS regression methods. J Chemom. 1988;2(3):211-228. doi:10.1002/cem.1180020306.

  • Kim S, Kano M, Nakagawa H, Hasebe S. Estimation of active pharmaceutical ingredients content using locally weighted partial least squares and statistical wavelength selection. Int J Pharm. 2011;421(2):269-274. doi:10.1016/j.ijpharm.2011.10.007

  • Lesnoff, M., Metz, M., Roger, J.-M., 2020. Comparison of locally weighted PLS strategies for regression and discrimination on agronomic NIR data. Journal of Chemometrics n/a, e3209. https://doi.org/10.1002/cem.3209

  • Lesnoff, M., 2024. Averaging a local PLSR pipeline to predict chemical compositions and nutritive values of forages and feed from spectral near infrared data. Chemometrics and Intelligent Laboratory Systems 244, 105031. https://doi.org/10.1016/j.chemolab.2023.105031

  • Manne R. Analysis of two partial-least-squares algorithms for multivariate calibration. Chemom Intell Lab Syst. 1987;2(1-3):187-197. doi:10.1016/0169-7439(87)80096-5.

  • Schaal, S., Atkeson, C.G., Vijayakumar, S., 2002. Scalable Techniques from Nonparametric Statistics for Real Time Robot Learning. Applied Intelligence 17, 49–60. https://doi.org/10.1023/A:1015727715131

  • Shenk J, Westerhaus M, Berzaghi P. Investigation of a LOCAL calibration procedure for near infrared instruments. J Infrared Spectrosc. 1997;5(1):223. doi:10.1255/jnirs.115

  • Sicard E, Sabatier R. Theoretical framework for local PLS1 regression, and application to a rainfall data set. Comput Stat Data Anal. 2006;51(2):1393-1410. doi:10.1016/j.csda.2006.05.002.

  • Wold H. Nonlinear iterative partial least squares (NIPALS) modeling: some current developments. In: Multivariate Analysis II. Wright State University, Dayton, Ohio, USA. June 19–24, 1972. New York : Academic Press: Krishnaiah , P. R.; 1973:383 – 407.

  • Wold S, Sjöström M, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst. 2001;58(2):109-130. doi:10.1016/S0169-7439(01)00155-1.

  • Yoshizaki R, Kano M, Tanabe S, Miyano T. Process Parameter Optimization based on LW-PLS in Pharmaceutical Granulation Process - This work was partially supported by Japan Society for the Promotion of Science (JSPS), Grant-in-Aid for Scientific Research (C) 24560940. IFAC-Pap. 2015;48(8):303-308. doi:10.1016/j.ifacol.2015.08.198.

  • Zhang X, Kano M, Li Y. Locally weighted kernel partial least squares regression based on sparse nonlinear features for virtual sensing of nonlinear time-varying processes. Comput Chem Eng. 2017;104:164-171. doi:10.1016/j.compchemeng.2017.04.014.