Index of functions

Here is a list of all exported functions from Jchemo.jl.

For more details, click on the link and you'll be directed to the function help.

Base.summaryMethod
summary(object::Cca, X, Y)

Summarize the fitted model.

  • object : The fitted model.
  • X : The X-data that was used to fit the model.
  • Y : The Y-data that was used to fit the model.
source
Base.summaryMethod
summary(object::Ccawold, X, Y)

Summarize the fitted model.

  • object : The fitted model.
  • X : The X-data that was used to fit the model.
  • Y : The Y-data that was used to fit the model.
source
Base.summaryMethod
summary(object::Comdim, Xbl)

Summarize the fitted model.

  • object : The fitted model.
  • Xbl : The X-data that was used to fit the model.
source
Base.summaryMethod
summary(object::Fda)

Summarize the fitted model.

  • object : The fitted model.
  • X : The X-data that was used to fit the model.
source
Base.summaryMethod
summary(object::Kpca)

Summarize the fitted model.

  • object : The fitted model.
source
Base.summaryMethod
summary(object::Mbpca, Xbl)

Summarize the fitted model.

  • object : The fitted model.
  • Xbl : The X-data that was used to fit the model.
source
Base.summaryMethod
summary(object::Mbplsr, Xbl)

Summarize the fitted model.

  • object : The fitted model.
  • Xbl : The X-data that was used to fit the model.
source
Base.summaryMethod
summary(object::Mbplswest, Xbl)

Summarize the fitted model.

  • object : The fitted model.
  • Xbl : The X-data that was used to fit the model.
source
Base.summaryMethod
summary(object::Pca, X)

Summarize the fitted model.

  • object : The fitted model.
  • X : The X-data that was used to fit the model.
source
Base.summaryMethod
summary(object::Pcr, X)

Summarize the fitted model.

  • object : The fitted model.
  • X : The X-data that was used to fit the model.
source
Base.summaryMethod
summary(object::Plscan, X, Y)

Summarize the fitted model.

  • object : The fitted model.
  • X : The X-data that was used to fit the model.
  • Y : The Y-data that was used to fit the model.
source
Base.summaryMethod
summary(object::Plstuck, X, Y)

Summarize the fitted model.

  • object : The fitted model.
  • X : The X-data that was used to fit the model.
  • Y : The Y-data that was used to fit the model.
source
Base.summaryMethod
summary(object::Rasvd, X, Y)

Summarize the fitted model.

  • object : The fitted model.
  • X : The X-data that was used to fit the model.
  • Y : The Y-data that was used to fit the model.
source
Base.summaryMethod
summary(object::Spca, X)

Summarize the fitted model.

  • object : The fitted model.
  • X : The X-data that was used to fit the model.
source
Base.summaryMethod
summary(object::Union{Plsr, Splsr}, X)

Summarize the fitted model.

  • object : The fitted model.
  • X : The X-data that was used to fit the model.
source
Jchemo.aggstatMethod
aggstat(X, y; fun = mean)
aggstat(X::DataFrame; vars, groups, fun = mean)

Compute column-wise statistics by class in a dataset.

  • X : Data (n, p).
  • y : A categorical variable (n) (class membership).
  • fun : Function to compute (default = mean).

Specific for dataframes:

  • vars : Vector of the ames of the variables to summarize.
  • groups : Vector of the names of the categorical variables to consider for computations by class.

Variables defined in vars and groups must be columns of X.

Return a matrix or, if only argument X::DataFrame is used, a dataframe.

Examples

using DataFrames, Statistics

n, p = 20, 5
X = rand(n, p)
df = DataFrame(X, :auto)
y = rand(1:3, n)
res = aggstat(X, y; fun = sum)
res.X
aggstat(df, y; fun = sum).X

n, p = 20, 5
X = rand(n, p)
df = DataFrame(X, string.("v", 1:p))
df.gr1 = rand(1:2, n)
df.gr2 = rand(["a", "b", "c"], n)
df
aggstat(df; vars = [:v1, :v2], groups = [:gr1, :gr2], fun = var)
source
Jchemo.aggsumMethod
aggsum(x::Vector, y::Vector)

Compute sub-total sums by class of a categorical variable.

  • x : A quantitative variable to sum (n)
  • y : A categorical variable (n) (class membership).

Return a vector.

Examples

x = rand(1000)
y = vcat(rand(["a" ; "c"], 900), repeat(["b"], 100))
aggsum(x, y)
source
Jchemo.aicplsrMethod
aicplsr(X, y; alpha = 2, kwargs...)

Compute Akaike's (AIC) and Mallows's (Cp) criteria for univariate PLSR models.

  • X : X-data (n, p).
  • y : Univariate Y-data.

Keyword arguments:

  • Same arguments as those of function cglsr.
  • alpha : Coefficient multiplicating the model complexity (df) to compute AIC.

The function uses function dfplsr_cg.

References

Hansen, P.C., 1998. Rank-Deficient and Discrete Ill-Posed Problems, Mathematical Modeling and Computation. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898719697

Hansen, P.C., 2008. Regularization Tools version 4.0 for Matlab 7.3. Numer Algor 46, 189–194. https://doi.org/10.1007/s11075-007-9136-9

Lesnoff, M., Roger, J.-M., Rutledge, D.N., 2021. Monte Carlo methods for estimating Mallows’s Cp and AIC criteria for PLSR models. Illustration on agronomic spectroscopic NIR data. Journal of Chemometrics n/a, e3369. https://doi.org/10.1002/cem.3369

Examples

using JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 40
res = aicplsr(X, y; nlv) ;
res.crit
res.opt
res.delta

zaic = res.crit.aic
f, ax = plotgrid(0:nlv, zaic; xlabel = "Nb. LVs", ylabel = "AIC")
scatter!(ax, 0:nlv, zaic)
f
source
Jchemo.aov1Method
aov1(x, Y)
One-factor ANOVA test.
  • x : Univariate categorical (factor) data (n).
  • Y : Y-data (n, q).

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2") 
@load db dat
pnames(dat)
@head dat.X
x = dat.X[:, 5]
Y = dat.X[:, 1:4]
tab(x) 

res = aov1(x, Y) ;
pnames(res)
res.SSF
res.SSR 
res.F 
res.pval
source
Jchemo.biasMethod
bias(pred, Y)

Compute the prediction bias, i.e. the opposite of the mean prediction error.

  • pred : Predictions.
  • Y : Observed data.

Examples

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
bias(pred, Ytest)

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
bias(pred, ytest)
source
Jchemo.blockscalMethod
blockscal(Xbl; kwargs...)
blockscal(Xbl, weights::Weight; kwargs...)

Scale multiblock X-data.

  • Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.
  • weights : Weights (n) of the observations (rows of the blocks). Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • bscal : Type of block scaling. Possible values are: :none, :frob, :mfa, :ncol, :sd. See thereafter.
  • centr : Boolean. If true, each column of blocks in Xbl is centered (before the block scaling).
  • scal : Boolean. If true, each column of blocks in Xbl is scaled by its uncorrected standard deviation (before the block scaling).

Types of block scaling:

  • :none : No block scaling.
  • :frob : Let D be the diagonal matrix of vector weights.w. Each block X is divided by its Frobenius norm = sqrt(tr(X' * D * X)). After this scaling, tr(X' * D * X) = 1.
  • mfa : Each block X is divided by sv, where sv is the dominant singular value of X (this is the "MFA" approach).
  • ncol : Each block X is divided by the nb. of columns of the block.
  • sd : Each block X is divided by sqrt(sum(weighted variances of the block-columns)). After this scaling, sum(weighted variances of the block-columns) = 1.

Examples

n = 5 ; m = 3 ; p = 10 
X = rand(n, p) 
Xnew = rand(m, p)
listbl = [3:4, 1, [6; 8:10]]
Xbl = mblock(X, listbl) 
Xblnew = mblock(Xnew, listbl) 
@head Xbl[3]

centr = true ; scal = true
bscal = :frob
mod = model(blockscal; centr, scal, bscal)
fit!(mod, Xbl)
zXbl = transf(mod, Xbl) ; 
@head zXbl[3]

zXblnew = transf(mod, Xblnew) ; 
zXblnew[3]
source
Jchemo.caldsMethod
calds(X1, X2; kwargs...)

Direct standardization (DS) for calibration transfer of spectral data.

  • X1 : Spectra (n, p) to transfer to the target.
  • X2 : Target spectra (n, p).

Keyword arguments:

  • fun : Function used as transfer model.
  • Other optional arguments for function fun.

X1 and X2 must represent the same n samples ("standards").

The objective is to transform spectra X1 to new spectra as close as possible as the target X2. Method DS fits a model (defined in fun) that predicts X2 from X1.

References

Y. Wang, D. J. Veltkamp, and B. R. Kowalski, “Multivariate Instrument Standardization,” Anal. Chem., vol. 63, no. 23, pp. 2750–2756, 1991, doi: 10.1021/ac00023a016.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/caltransfer.jld2")
@load db dat
pnames(dat)
## Objects X1 and X2 are spectra collected 
## on the same samples. 
## X2 represents the target space. 
## We want to transfer X1 in the same space
## as X2.
## Data to transfer
X1cal = dat.X1cal
X1val = dat.X1val
n = nro(X1cal)
m = nro(X1val)
## Target space
X2cal = dat.X2cal
X2val = dat.X2val

## Fitting the model
mod = model(calds; fun = plskern, nlv = 10) 
#mod = model(calds; fun = mlrpinv)   # less robust
fit!(mod, X1cal, X2cal)

## Transfer of new spectra X1val 
## expected to be close to X2val
pred = predict(mod, X1val).pred

i = 1
f = Figure(size = (500, 300))
ax = Axis(f[1, 1])
lines!(X2val[i, :]; label = "x2")
lines!(ax, X1val[i, :]; label = "x1")
lines!(pred[i, :]; linestyle = :dash, label = "x1_corrected")
axislegend(position = :rb, framevisible = false)
f
source
Jchemo.calpdsMethod
calpds(X1, X2; npoint = 5, fun = plskern, kwargs...)

Piecewise direct standardization (PDS) for calibration transfer of spectral data.

  • X1 : Spectra (n, p) to transfer to the target.
  • X2 : Target spectra (n, p).

Keyword arguments:

  • npoint : Half-window size (nb. points left or right to the given wavelength).
  • fun : Function used as transfer model.
  • kwargs : Optional arguments for fun.

X1 and X2 must represent the same n standard samples.

The objective is to transform spectra X1 to new spectra as close as possible as the target X2. Method PDS fits models (defined in fun) that predict X2 from X1.

The window used in X1 to predict wavelength "i" in X2 is:

  • i - npoint, i - npoint + 1, ..., i, ..., i + npoint - 1, i + npoint

References

Bouveresse, E., Massart, D.L., 1996. Improvement of the piecewise direct targetisation procedure for the transfer of NIR spectra for multivariate calibration. Chemometrics and Intelligent Laboratory Systems 32, 201–213. https://doi.org/10.1016/0169-7439(95)00074-7

Y. Wang, D. J. Veltkamp, and B. R. Kowalski, “Multivariate Instrument Standardization,” Anal. Chem., vol. 63, no. 23, pp. 2750–2756, 1991, doi: 10.1021/ac00023a016.

Wülfert, F., Kok, W.Th., Noord, O.E. de, Smilde, A.K., 2000. Correction of Temperature-Induced Spectral Variation by Continuous Piecewise Direct Standardization. Anal. Chem. 72, 1639–1644. https://doi.org/10.1021/ac9906835

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/caltransfer.jld2")
@load db dat
pnames(dat)
## Objects X1 and X2 are spectra collected 
## on the same samples. 
## X2 represents the target space. 
## We want to transfer X1 in the same space
## as X2.
## Data to transfer
X1cal = dat.X1cal
X1val = dat.X1val
n = nro(X1cal)
m = nro(X1val)
## Target space
X2cal = dat.X2cal
X2val = dat.X2val

## Fitting the model
mod = model(calpds; npoint = 2, fun = plskern, nlv = 2) 
fit!(mod, X1cal, X2cal)

## Transfer of new spectra X1val 
## expected to be close to X2val
pred = predict(mod, X1val).pred

i = 1
f = Figure(size = (500, 300))
ax = Axis(f[1, 1])
lines!(X2val[i, :]; label = "x2")
lines!(ax, X1val[i, :]; label = "x1")
lines!(pred[i, :]; linestyle = :dash, label = "x1_corrected")
axislegend(position = :rb, framevisible = false)
f
source
Jchemo.ccaMethod
cca(X, Y; kwargs...)
cca(X, Y, weights::Weight; kwargs...)
cca!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Canonical correlation Analysis (CCA, RCCA).

  • X : First block of data.
  • Y : Second block of data.
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs = scores T) to compute.
  • bscal : Type of block scaling. Possible values are: :none, :frob. See functions blockscal.
  • tau : Regularization parameter (∊ [0, 1]).
  • scal : Boolean. If true, each column of blocks in X and Y is scaled by its uncorrected standard deviation (before the block scaling).

This function implements a CCA algorithm using SVD decompositions and presented in Weenink 2003 section 2.

A continuum regularization is available (parameter tau). After block centering and scaling, the function returns block scores (Tx and Ty) that are proportionnal to the eigenvectors of Projx * Projy and Projy * Projx, respectively, defined as follows:

  • Cx = (1 - tau) * X'DX + tau * Ix
  • Cy = (1 - tau) * Y'DY + tau * Iy
  • Cxy = X'DY
  • Projx = sqrt(D) * X * invCx * X' * sqrt(D)
  • Projy = sqrt(D) * Y * invCx * Y' * sqrt(D)

where D is the observation (row) metric. Value tau = 0 can generate unstability when inverting the covariance matrices. Often, a better alternative is to use an epsilon value (e.g. tau = 1e-8) to get similar results as with pseudo-inverses.

The normed scores returned by the function are expected (using uniform weights) to be the same as those returned by functions rcc of the R packages CCA (González et al.) and mixOmics (Lê Cao et al.) whith their parameters lambda1 and lambda2 set to:

  • lambda1 = lambda2 = tau / (1 - tau) * n / (n - 1)

References

González, I., Déjean, S., Martin, P.G.P., Baccini, A., 2008. CCA: An R Package to Extend Canonical Correlation Analysis. Journal of Statistical Software 23, 1-14. https://doi.org/10.18637/jss.v023.i12

Hotelling, H. (1936): “Relations between two sets of variates”, Biometrika 28: pp. 321–377.

Lê Cao, K.-A., Rohart, F., Gonzalez, I., Dejean, S., Abadi, A.J., Gautier, B., Bartolo, F., Monget, P., Coquery, J., Yao, F., Liquet, B., 2022. mixOmics: Omics Data Integration Project. https://doi.org/10.18129/B9.bioc.mixOmics

Weenink, D. 2003. Canonical Correlation Analysis, Institute of Phonetic Sciences, Univ. of Amsterdam, Proceedings 25, 81-99.

Examples

using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "linnerud.jld2") 
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n, p = size(X)
q = nco(Y)

nlv = 3
bscal = :frob ; tau = 1e-8
mod = model(cca; nlv, bscal, tau)
fit!(mod, X, Y)
pnames(mod)
pnames(mod.fm)

@head mod.fm.Tx
@head transfbl(mod, X, Y).Tx

@head mod.fm.Ty
@head transfbl(mod, X, Y).Ty

res = summary(mod, X, Y) ;
pnames(res)
res.cort2t 
res.rdx
res.rdy
res.corx2t 
res.cory2t 
source
Jchemo.ccawoldMethod
ccawold(X, Y; kwargs...)
ccawold(X, Y, weights::Weight; kwargs...)
ccawold!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Canonical correlation analysis (CCA, RCCA) - Wold Nipals algorithm.

  • X : First block of data.
  • Y : Second block of data.
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs = scores T) to compute.
  • bscal : Type of block scaling. Possible values are: :none, :frob. See functions blockscal.
  • tau : Regularization parameter (∊ [0, 1]).
  • tol : Tolerance value for convergence (Nipals).
  • maxit : Maximum number of iterations (Nipals).
  • scal : Boolean. If true, each column of blocks in X and Y is scaled by its uncorrected standard deviation (before the block scaling).

This function implements the Nipals ccawold algorithm presented by Tenenhaus 1998 p.204 (related to Wold et al. 1984).

In this implementation, after each step of LVs computation, X and Y are deflated relatively to their respective scores (tx and ty).

A continuum regularization is available (parameter tau). After block centering and scaling, the covariances matrices are computed as follows:

  • Cx = (1 - tau) * X'DX + tau * Ix
  • Cy = (1 - tau) * Y'DY + tau * Iy

where D is the observation (row) metric. Value tau = 0 can generate unstability when inverting the covariance matrices. Often, a better alternative is to use an epsilon value (e.g. tau = 1e-8) to get similar results as with pseudo-inverses.

The normed scores returned by the function are expected (using uniform weights) to be the same as those returned by function rgcca of the R package RGCCA (Tenenhaus & Guillemot 2017, Tenenhaus et al. 2017).

References

Tenenhaus, A., Guillemot, V. 2017. RGCCA: Regularized and Sparse Generalized Canonical Correlation Analysis for Multiblock Data Multiblock data analysis. https://cran.r-project.org/web/packages/RGCCA/index.html

Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.

Tenenhaus, M., Tenenhaus, A., Groenen, P.J.F., 2017. Regularized Generalized Canonical Correlation Analysis: A Framework for Sequential Multiblock Component Methods. Psychometrika 82, 737–777. https://doi.org/10.1007/s11336-017-9573-x

Wold, S., Ruhe, A., Wold, H., Dunn, III, W.J., 1984. The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses. SIAM Journal on Scientific and Statistical Computing 5, 735–743. https://doi.org/10.1137/0905052

Examples

using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "linnerud.jld2") 
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n, p = size(X)
q = nco(Y)

nlv = 2
bscal = :frob ; tau = 1e-4
mod = model(ccawold; nlv, bscal, tau, tol = 1e-10)
fit!(mod, X, Y)
pnames(mod)
pnames(mod.fm)

@head mod.fm.Tx
@head transfbl(mod, X, Y).Tx

@head mod.fm.Ty
@head transfbl(mod, X, Y).Ty

res = summary(mod, X, Y) ;
pnames(res)
res.explvarx
res.explvary
res.cort2t 
res.rdx
res.rdy
res.corx2t 
res.cory2t 
source
Jchemo.centerMethod
center(X)
center(X, weights::Weight)

Column-wise centering of X-data.

  • X : X-data (n, p).

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f

mod = model(center) 
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
colmean(Xptrain)
@head Xptest 
@head Xtest .- colmean(Xtrain)'
plotsp(Xptrain).f
plotsp(Xptest).f
source
Jchemo.cglsrMethod
cglsr(X, y; kwargs...)
cglsr!(X::Matrix, y::Matrix; kwargs...)

Conjugate gradient algorithm for the normal equations (CGLS; Björck 1996).

  • X : X-data (n, p).
  • y : Univariate Y-data (n).

Keyword arguments:

  • nlv : Nb. CG iterations.
  • gs : Boolean. If true (default), a Gram-Schmidt orthogonalization of the normal equation residual vectors is done.
  • filt : Boolean. If true, CG filter factors are computed (output F). Default = false.
  • scal : Boolean. If true, each column of X and y are scaled by its uncorrected standard deviation (default = false).

X and y are internally centered.

CGLS algorithm "7.4.1" Bjorck 1996, p.289. The part of the code computing the re-orthogonalization (Hansen 1998) and filter factors (Vogel 1987, Hansen 1998) is a transcription (with few adaptations) of the Matlab function cgls (Saunders et al. https://web.stanford.edu/group/SOL/software/cgls/; Hansen 2008).

References

Björck, A., 1996. Numerical Methods for Least Squares Problems, Other Titles in Applied Mathematics. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611971484

Hansen, P.C., 1998. Rank-Deficient and Discrete Ill-Posed Problems, Mathematical Modeling and Computation. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898719697

Hansen, P.C., 2008. Regularization Tools version 4.0 for Matlab 7.3. Numer Algor 46, 189–194. https://doi.org/10.1007/s11075-007-9136-9

Manne R. Analysis of two partial-least-squares algorithms for multivariate calibration. Chemometrics Intell. Lab. Syst. 1987, 2: 187–197.

Phatak A, De Hoog F. Exploiting the connection between PLS, Lanczos methods and conjugate gradients: alternative proofs of some properties of PLS. J. Chemometrics 2002; 16: 361–367.

Vogel, C. R., "Solving ill-conditioned linear systems using the conjugate gradient method", Report, Dept. of Mathematical Sciences, Montana State University, 1987.

Examples

using JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 5 ; scal = true
mod = model(cglsr; nlv, scal) ;
fit!(mod, Xtrain, ytrain)
pnames(mod.fm) 
@head mod.fm.B
coef(mod.fm).B
coef(mod.fm).int

pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, 
    xlabel = "Prediction", ylabel = "Observed").f   
source
Jchemo.coefMethod
coef(object::Cglsr)

Compute the b-coefficients of a fitted model.

  • object : The fitted model.
source
Jchemo.coefMethod
coef(object::Dkplsr; nlv = nothing)

Compute the b-coefficients of a fitted model.

  • object : The fitted model.
  • nlv : Nb. LVs, or collection of nb. LVs, to consider.
source
Jchemo.coefMethod
coef(object::Kplsr; nlv = nothing)

Compute the b-coefficients of a fitted model.

  • object : The fitted model.
  • nlv : Nb. LVs, or collection of nb. LVs, to consider.
source
Jchemo.coefMethod
coef(object::Krr; lb = nothing)

Compute the b-coefficients of a fitted model.

  • object : The fitted model.
  • lb : Ridge regularization parameter "lambda".
source
Jchemo.coefMethod
coef(object::Rosaplsr; nlv = nothing)

Compute the X b-coefficients of a model fitted with nlv LVs.

  • object : The fitted model.
  • nlv : Nb. LVs to consider.
source
Jchemo.coefMethod
coef(object::Rr; lb = nothing)

Compute the b-coefficients of a fitted model.

  • object : The fitted model.
  • lb : Ridge regularization parameter "lambda".
source
Jchemo.coefMethod
coef(object::Mlr)

Compute the coefficients of the fitted model.

  • object : The fitted model.
source
Jchemo.coefMethod
coef(object::Union{Plsr, Pcr, Splsr}; nlv = nothing)

Compute the b-coefficients of a LV model.

  • object : The fitted model.
  • nlv : Nb. LVs to consider.

For a model fitted from X(n, p) and Y(n, q), the returned object B is a matrix (p, q). If nlv = 0, B is a matrix of zeros. The returned object int is the intercept.

source
Jchemo.colmadMethod
colmad(X)

Compute column-wise median absolute deviations (MAD) of a matrix.

  • X : Data (n, p).

Return a vector.

Examples

n, p = 5, 6
X = rand(n, p)

colmad(X)
source
Jchemo.colmeanMethod
colmean(X)
colmean(X, weights::Weight)

Compute column-wise means of a matrix.

  • X : Data (n, p).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Return a vector.

Examples

n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))

colmean(X)
colmean(X, w)
source
Jchemo.colmedMethod
colmed(X)

Compute column-wise medians of a matrix.

  • X : Data (n, p).

Return a vector.

Examples

n, p = 5, 6
X = rand(n, p)

colmed(X)
source
Jchemo.colnormMethod
colnorm(X)
colnorm(X, weights::Weight)

Compute column-wise norms of a matrix.

  • X : Data (n, p).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

The norm computed for a column x of X is:

  • sqrt(x' * x)

The weighted norm is:

  • sqrt(x' * D * x), where D is the diagonal matrix of weights.w.

Return a vector.

Examples

n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))

colnorm(X)
colnorm(X, w)
source
Jchemo.colstdMethod
colstd(X)
colstd(X, weights::Weight)

Compute column-wise standard deviations (uncorrected) of a matrix.

  • X : Data (n, p).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Return a vector.

Examples

n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))

colstd(X)
colstd(X, w)
source
Jchemo.colsumMethod
colsum(X)
colsum(X, weights::Weight)

Compute column-wise sums of a matrix.

  • X : Data (n, p).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Return a vector.

Examples

n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))

colsum(X)
colsum(X, w)
source
Jchemo.colvarMethod
colvar(X)
colvar(X, weights::Weight)

Compute column-wise variances (uncorrected) of a matrix.

  • X : Data (n, p).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Return a vector.

Examples

n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))

colvar(X)
colvar(X, w)
source
Jchemo.comdimMethod
comdim(Xbl; kwargs...)
comdim(Xbl, weights::Weight; kwargs...)
comdim!(Xbl::Matrix, weights::Weight; kwargs...)

Common components and specific weights analysis (ComDim, aka CCSWA).

  • Xbl : List of blocks (vector of matrices) of X-data. Typically, output of function mblock.
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs = scores T) to compute.
  • bscal : Type of block scaling. See function blockscal for possible values.
  • tol : Tolerance value for convergence (Nipals).
  • maxit : Maximum number of iterations (Nipals).
  • scal : Boolean. If true, each column of blocks in Xbl is scaled by its uncorrected standard deviation (before the block scaling).

"SVD" algorithm of Hannafi & Qannari 2008 p.84.

The function returns several objects, in particular:

  • T : The non normed global scores.
  • U : The normed global scores.
  • W : The global loadings.
  • Tbl : The block scores (grouped by blocks, in the original scale).
  • Tb : The block scores (grouped by LV, in the metric scale).
  • Wbl : The block loadings.
  • lb : The specific weights (saliences) "lambda".
  • mu : The sum of the squared saliences.

Function summary returns:

  • explvarx : Proportion of the total inertia of X (sum of the squared norms of the blocks) explained by each global score.
  • explvarxx : Proportion of the XX' total inertia (sum of the squared norms of the products Xk * Xk') explained by each global score (= indicator "V" in Qannari et al. 2000, Hanafi et al. 2008).
  • sal2 : Proportion of the squared saliences of each block within each global score.
  • contr_block : Contribution of each block to the global scores (= proportions of the saliences "lambda" within each score).
    • explX : Proportion of the inertia of the blocks
    explained by each global score.
  • corx2t : Correlation between the global scores and the original variables.
  • cortb2t : Correlation between the global scores and the block scores.
  • rv : RV coefficient.
  • lg : Lg coefficient.

References

Cariou, V., Qannari, E.M., Rutledge, D.N., Vigneau, E., 2018. ComDim: From multiblock data analysis to path modeling. Food Quality and Preference, Sensometrics 2016: Sensometrics-by-the-Sea 67, 27–34. https://doi.org/10.1016/j.foodqual.2017.02.012

Cariou, V., Jouan-Rimbaud Bouveresse, D., Qannari, E.M., Rutledge, D.N., 2019. Chapter 7 - ComDim Methods for the Analysis of Multiblock Data in a Data Fusion Perspective, in: Cocchi, M. (Ed.), Data Handling in Science and Technology, Data Fusion Methodology and Applications. Elsevier, pp. 179–204. https://doi.org/10.1016/B978-0-444-63984-4.00007-7

Ghaziri, A.E., Cariou, V., Rutledge, D.N., Qannari, E.M., 2016. Analysis of multiblock datasets using ComDim: Overview and extension to the analysis of (K + 1) datasets. Journal of Chemometrics 30, 420–429. https://doi.org/10.1002/cem.2810

Hanafi, M., 2008. Nouvelles propriétés de l’analyse en composantes communes et poids spécifiques. Journal de la société française de statistique 149, 75–97.

Qannari, E.M., Wakeling, I., Courcoux, P., MacFie, H.J.H., 2000. Defining the underlying sensory dimensions. Food Quality and Preference 11, 151–154. https://doi.org/10.1016/S0950-3293(99)00069-5

Examples

using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2") 
@load db dat
pnames(dat) 
X = dat.X
group = dat.group
listbl = [1:11, 12:19, 20:25]
Xbl = mblock(X[1:6, :], listbl)
Xblnew = mblock(X[7:8, :], listbl)
n = nro(Xbl[1]) 

nlv = 3
bscal = :frob
scal = false
#scal = true
mod = model(comdim; nlv, bscal, scal)
fit!(mod, Xbl)
pnames(mod) 
pnames(mod.fm)
## Global scores 
@head mod.fm.T
@head transf(mod, Xbl)
transf(mod, Xblnew)
## Blocks scores
i = 1
@head mod.fm.Tbl[i]
@head transfbl(mod, Xbl)[i]

res = summary(mod, Xbl) ;
pnames(res) 
res.explvarx
res.explvarxx
res.sal2 
res.contr_block
res.explX   # = mod.fm.lb if bscal = :frob
rowsum(Matrix(res.explX))
res.corx2t 
res.cortb2t
res.rv
source
Jchemo.confMethod
conf(pred, y; digits = 1)

Confusion matrix.

  • pred : Univariate predictions.
  • y : Univariate observed data.

Keyword arguments:

  • digits : Nb. digits used to round percentages.

Examples

using CairoMakie

y = ["d"; "c"; "b"; "c"; "a"; "d"; "b"; "d"; 
    "b"; "b"; "a"; "a"; "c"; "d"; "d"]
pred = ["a"; "d"; "b"; "d"; "b"; "d"; "b"; "d"; 
    "b"; "b"; "a"; "a"; "d"; "d"; "d"]
#y = rand(1:10, 200); pred = rand(1:10, 200)

res = conf(pred, y) ;
pnames(res)
res.cnt       # Counts (dataframe built from `A`) 
res.pct       # Row %  (dataframe built from `Apct`))
res.A         
res.Apct
res.diagpct
res.accpct    # Accuracy (% classification successes)
res.lev       # Levels

plotconf(res).f

plotconf(res; cnt = false, ptext = false).f
source
Jchemo.cor2Method
cor2(pred, Y)

Compute the squared linear correlation between data and predictions.

  • pred : Predictions.
  • Y : Observed data.

Examples

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
cor2(pred, Ytest)

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
cor2(pred, ytest)
source
Jchemo.cormMethod
corm(X, weights::Weight)
corm(X, Y, weights::Weight)

Compute a weighted correlation matrix.

  • X : Data (n, p).
  • Y : Data (n, q).
  • weights : Weights (n) of the observations. Object of type Weight (e.g. generated by function mweight).

Uncorrected correlation matrix

  • of X-columns : ==> (p, p) matrix
  • or between X-columns and Y-columns : ==> (p, q) matrix.

Examples

n, p = 5, 6
X = rand(n, p)
Y = rand(n, 3)
w = mweight(rand(n))

corm(X, w)
corm(X, Y, w)
source
Jchemo.cosmMethod
cosm(X)
cosm(X, Y)

Compute a cosinus matrix.

  • X : Data (n, p).
  • Y : Data (n, q).

The function computes the cosinus matrix:

  • of the columns of X: ==> (p, p) matrix
  • or between columns of X and Y : ==> (p, q) matrix.

Examples

n, p = 5, 6
X = rand(n, p)
Y = rand(n, 3)

cosm(X)
cosm(X, Y)
source
Jchemo.cosvMethod
cosv(x, y)

Compute cosinus between two vectors.

  • x : vector (n).
  • y : vector (n).

Examples

n = 5
x = rand(n)
y = rand(n)

cosv(x, y)
source
Jchemo.covmMethod
covm(X, weights::Weight)
covm(X, Y, weights::Weight)

Compute a weighted covariance matrix.

  • X : Data (n, p).
  • Y : Data (n, q).
  • weights : Weights (n) of the observations. Object of type Weight (e.g. generated by function mweight).

The function computes the uncorrected weighted covariance matrix:

  • of the columns of X: ==> (p, p) matrix
  • or between columns of X and Y : ==> (p, q) matrix.

Examples

n, p = 5, 6
X = rand(n, p)
Y = rand(n, 3)
w = mweight(rand(n))

covm(X, w)
covm(X, Y, w)
source
Jchemo.cscaleMethod
cscale()
cscale(X)
cscale(X, weights::Weight)

Column-wise centering and scaling of X-data.

  • X : X-data (n, p).

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))

db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f

mod = model(cscale) 
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
colmean(Xptrain)
colstd(Xptrain)
@head Xptest 
@head (Xtest .- colmean(Xtrain)') ./ colstd(Xtrain)'
plotsp(Xptrain).f
plotsp(Xptest).f
source
Jchemo.detrendMethod
detrend(X; kwargs...)

De-trend transformation of each row of X-data.

  • X : X-data (n, p).

Keyword arguments:

  • degree : Polynom degree.

The function fits a polynomial regression to each observation and returns the residuals.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f

mod = model(detrend; degree = 2)
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
plotsp(Xptrain, wl).f
plotsp(Xptest, wl).f
source
Jchemo.dfplsr_cgMethod
dfplsr_cg(X, y; kwargs...)

Compute the model complexity (df) of PLSR models with the CGLS algorithm.

  • X : X-data (n, p).
  • y : Univariate Y-data.

Keyword arguments:

  • Same as function cglsr.

The number of degrees of freedom (df) of the PLSR model is returned for 0, 1, ..., nlv LVs.

References

Hansen, P.C., 1998. Rank-Deficient and Discrete Ill-Posed Problems, Mathematical Modeling and Computation. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898719697

Hansen, P.C., 2008. Regularization Tools version 4.0 for Matlab 7.3. Numer Algor 46, 189–194. https://doi.org/10.1007/s11075-007-9136-9

Lesnoff, M., Roger, J.-M., Rutledge, D.N., 2021. Monte Carlo methods for estimating Mallows’s Cp and AIC criteria for PLSR models. Illustration on agronomic spectroscopic NIR data. Journal of Chemometrics n/a, e3369. https://doi.org/10.1002/cem.3369

Examples

## The example below reproduces the numerical illustration
## given by Kramer & Sugiyama 2011 on the Ozone data 
## (Fig. 1, center).
## Function "pls.model" used for df calculations
## in the R package "plsdof" v0.2-9 (Kramer & Braun 2019)
## automatically scales the X matrix before PLS.
## The example scales X for consistency with plsdof.

using JchemoData, JLD2, DataFrames, CairoMakie 
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ozone.jld2") 
@load db dat
pnames(dat)
X = dat.X
dropmissing!(X) 
zX = rmcol(Matrix(X), 4) 
y = X[:, 4] 
## For consistency with plsdof
xstds = colstd(zX)
zXs = fscale(zX, xstds)
## End

nlv = 12 ; gs = true
res = dfplsr_cg(zXs, y; nlv, gs) ;
res.df 
df_kramer = [1.000000, 3.712373, 6.456417, 11.633565, 
    12.156760, 11.715101, 12.349716,
    12.192682, 13.000000, 13.000000, 
    13.000000, 13.000000, 13.000000]
f, ax = plotgrid(0:nlv, df_kramer; step = 2, xlabel = "Nb. LVs", ylabel = "df")
scatter!(ax, 0:nlv, res.df; color = "red")
ablines!(ax, 1, 1; color = :grey, linestyle = :dot)
f
source
Jchemo.difmeanMethod
difmean(X1, X2; normx::Bool = false)

Compute a 1-D detrimental matrix by difference of the column-means of two X-datas.

  • X1 : Spectra (n1, p).
  • X2 : Spectra (n2, p).

Keyword arguments:

  • normx : Boolean. If true, the column-means vectors of X1 and X2 are normed before computing their difference.

The function returns a matrix D (1, p) computed by the difference between two mean-spectra, i.e. the column-means of X1 and X2.

D is assumed to contain the detrimental information that can be removed (by orthogonalization) from X1 and X2 for calibration transfer. For instance, D can be used as input of function eposvd.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/caltransfer.jld2")
@load db dat
pnames(dat)
X1cal = dat.X1cal
X1val = dat.X1val
X2cal = dat.X2cal
X2val = dat.X2val

## The objective is to remove a detrimental 
## information (here, D) from spaces X1 and X2
D = difmean(X1cal, X2cal).D
res = eposvd(D; nlv = 1)
## Corrected Val matrices
X1val_c = X1val * res.M
X2val_c = X2val * res.M

i = 1
f = Figure(size = (800, 300))
ax1 = Axis(f[1, 1])
ax2 = Axis(f[1, 2])
lines!(ax1, X1val[i, :]; label = "x1")
lines!(ax1, X2val[i, :]; label = "x2")
axislegend(ax1, position = :cb, framevisible = false)
lines!(ax2, X1val_c[i, :]; label = "x1_correct")
lines!(ax2, X2val_c[i, :]; label = "x2_correct")
axislegend(ax2, position = :cb, framevisible = false)
f
source
Jchemo.dkplskdedaMethod
dkplskdeda(X, y; kwargs...)
dkplskdeda(X, y, weights::Weight; kwargs...)

DKPLS-KDEDA.

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute. Must be >= 1
  • kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • Keyword arguments of function dmkern (bandwidth definition) can also be specified here.
  • alpha : Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0) and LDA (alpha = 1).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Same as function plskdeda (PLS-KDEDA) except that a direct kernel PLSR (function dkplsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

See function dkplslda for examples.

source
Jchemo.dkplsldaMethod
dkplslda(X, y; kwargs...)
dkplslda(X, y, weights::Weight; kwargs...)

DKPLS-LDA.

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute. Must be >= 1.
  • kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Same as function plslda (PLS-LDA) except that a direct kernel PLSR (function dkplsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlv = 15
gamma = .1
mod = model(dkplslda; nlv, gamma) 
#mod = model(dkplslda; nlv, gamma, prior = :prop) 
#mod = model(dkplsqda; nlv, gamma, alpha = .5) 
#mod = model(dkplskdeda; nlv, gamma, a_kde = .5) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

fmpls = fm.fm.fmpls ;
@head fmpls.T
@head transf(mod, Xtrain)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)

coef(fmpls)

res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(mod, Xtest; nlv = 1:2).pred
source
Jchemo.dkplsqdaMethod
dkplsqda(X, y; kwargs...)
dkplsqda(X, y, weights::Weight; kwargs...)

DKPLS-QDA.

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute. Must be >= 1
  • kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • alpha : Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0) and LDA (alpha = 1).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Same as function plsqda (PLS-QDA) except that a direct kernel PLSR (function dkplsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

See function dkplslda for examples.

source
Jchemo.dkplsrMethod
dkplsr(X, Y; kwargs...)
dkplsr(X, Y, weights::Weight; kwargs...)
dkplsr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Direct kernel partial least squares regression (DKPLSR) (Bennett & Embrechts 2003).

  • X : X-data (n, p).
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to consider.
  • kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
  • scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

The method builds kernel Gram matrices and then runs a usual PLSR algorithm on them. This is faster (but not equivalent) to the "true" Nipals KPLSR algorithm (function kplsr) described in Rosipal & Trejo (2001).

References

Bennett, K.P., Embrechts, M.J., 2003. An optimization perspective on kernel partial least squares regression, in: Advances in Learning Theory: Methods, Models and Applications, NATO Science Series III: Computer & Systems Sciences. IOS Press Amsterdam, pp. 227-250.

Rosipal, R., Trejo, L.J., 2001. Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space. Journal of Machine Learning Research 2, 97-123.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 20
kern = :krbf ; gamma = 1e-1 ; scal = false
#gamma = 1e-4 ; scal = true
mod = model(dkplsr; nlv, kern, gamma, scal) ;
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
@head mod.fm.T

coef(mod)
coef(mod; nlv = 3)

@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)

res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, 
    xlabel = "Prediction", ylabel = "Observed").f  

####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106 
x = collect(-10:.2:10) 
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x) 
y = zy + .2 * randn(n) 
nlv = 2
gamma = 1 / 3
mod = model(dkplsr; nlv, gamma) ;
fit!(mod, x, y)
pred = predict(mod, x).pred 
f, ax = scatter(x, y) 
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f
source
Jchemo.dkplsrdaMethod
dkplsrda(X, y; kwargs...)
dkplsrda(X, y, weights::Weight; kwargs...)

Discrimination based on direct kernel partial least squares regression (KPLSR-DA).

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute.
  • kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Same as function plsrda (PLSR-DA) except that a direct kernel PLSR (function dkplsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlv = 15
kern = :krbf ; gamma = .001 
scal = true
mod = model(dkplsrda; nlv, kern, gamma, scal) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

@head fm.fm.T
@head transf(mod, Xtrain)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)

coef(fm.fm)

res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(mod, Xtest; nlv = 1:2).pred
source
Jchemo.dmkernMethod
dmkern(X; kwargs...)

Gaussian kernel density estimation (KDE).

  • X : X-data (n, p).

Keyword arguments:

  • h_kde : Define the bandwith, see examples.
  • a_kde : Constant for the Scott's rule (default bandwith), see thereafter.

Estimation of the probability density of X (column space) by non parametric Gaussian kernels.

Data X can be univariate (p = 1) or multivariate (p > 1). In the last case, function dmkern computes a multiplicative kernel such as in Scott & Sain 2005 Eq.19, and the internal bandwidth matrix H is diagonal (see the code).

Note: H in the dmkern code is often noted "H^(1/2)" in the litterature (e.g. Wikipedia).

The default bandwith is computed by:

  • h_kde = a_kde * n^(-1 / (p + 4)) * colstd(X)

(a_kde = 1 in Scott & Sain 2005).

References

Scott, D.W., Sain, S.R., 2005. 9 - Multidimensional Density Estimation, in: Rao, C.R., Wegman, E.J., Solka, J.L. (Eds.), Handbook of Statistics, Data Mining and Data Visualization. Elsevier, pp. 229–261. https://doi.org/10.1016/S0169-7161(04)24009-3

Examples

using JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "iris.jld2") 
@load db dat
pnames(dat)
X = dat.X[:, 1:4] 
y = dat.X[:, 5]
n = nro(X)
tab(y) 

mod0 = model(fda; nlv = 2)
fit!(mod0, X, y)
@head T = mod0.fm.T
p = nco(T)

#### Probability density in the FDA 
#### score space (2D)

mod = model(dmkern)
fit!(mod, T) 
pnames(mod.fm)
mod.fm.H
u = [1; 4; 150]
predict(mod, T[u, :]).pred

h_kde = .3
mod = model(dmkern; h_kde)
fit!(mod, T) 
mod.fm.H
u = [1; 4; 150]
predict(mod, T[u, :]).pred

h_kde = [.3; .1]
mod = model(dmkern; h_kde)
fit!(mod, T) 
mod.fm.H
u = [1; 4; 150]
predict(mod, T[u, :]).pred

## Bivariate distribution
npoints = 2^7
nlv = 2
lims = [(minimum(T[:, j]), maximum(T[:, j])) for j = 1:nlv]
x1 = LinRange(lims[1][1], lims[1][2], npoints)
x2 = LinRange(lims[2][1], lims[2][2], npoints)
z = mpar(x1 = x1, x2 = x2)
grid = reduce(hcat, z)
m = nro(grid)
mod = model(dmkern) 
#mod = model(dmkern; a_kde = .5) 
#mod = model(dmkern; h_kde = .3) 
fit!(mod, T) 

res = predict(mod, grid) ;
pred_grid = vec(res.pred)
f = Figure(size = (600, 400))
ax = Axis(f[1, 1];  title = "Density for FDA scores (Iris)", xlabel = "Score 1", 
    ylabel = "Score 2")
co = contour!(ax, grid[:, 1], grid[:, 2], pred_grid; levels = 10, labels = true)
scatter!(ax, T[:, 1], T[:, 2], color = :red, markersize = 5)
#xlims!(ax, -15, 15) ;ylims!(ax, -15, 15)
f

## Univariate distribution
x = T[:, 1]
mod = model(dmkern) 
#mod = model(dmkern; a_kde = .5) 
#mod = model(dmkern; h_kde = .3) 
fit!(mod, x) 
pred = predict(mod, x).pred 
f = Figure()
ax = Axis(f[1, 1])
hist!(ax, x; bins = 30, normalization = :pdf)  # area = 1
scatter!(ax, x, vec(pred); color = :red)
f

x = T[:, 1]
npoints = 2^8
lims = [minimum(x), maximum(x)]
#delta = 5 ; lims = [minimum(x) - delta, maximum(x) + delta]
grid = LinRange(lims[1], lims[2], npoints)
mod = model(dmkern) 
#mod = model(dmkern; a_kde = .5) 
#mod = model(dmkern; h_kde = .3) 
fit!(mod, x) 
pred_grid = predict(mod, grid).pred 
f = Figure()
ax = Axis(f[1, 1])
hist!(ax, x; bins = 30, normalization = :pdf)  # area = 1
lines!(ax, grid, vec(pred_grid); color = :red)
f
source
Jchemo.dmnormFunction
dmnorm(X = nothing; mu = nothing, S = nothing, simpl::Bool = false)
dmnorm!(X = nothing; mu = nothing, S = nothing, simpl::Bool = false)

Normal probability density estimation.

  • X : X-data (n, p) used to estimate the mean and the covariance matrix. If nothing, mu and S must be provided.

Keyword arguments:

  • mu : Mean vector of the normal distribution. If nothing, mu is computed by the column-means of X.
  • S : Covariance matrix of the normal distribution. If nothing, S is computed by cov(X; corrected = true).
  • simpl : Boolean. If true, the constant term and the determinant in the density formula are set to 1.

Data X can be univariate (p = 1) or multivariate (p > 1). See examples.

When simple = true, the determinant of the covariance matrix (object detS) and the constant (2 * pi)^(-p / 2) (object cst) in the density formula are set to 1. The function returns a pseudo density that resumes to exp(-d / 2), where d is the squared Mahalanobis distance to the fcenter. This can for instance be useful when the number of columns (p) of X becomes too large and when consequently:

  • detS tends to 0 or, conversely, to infinity
  • cst tends to 0

which makes impossible to compute the true density.

Examples

using JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "iris.jld2") 
@load db dat
pnames(dat)
X = dat.X[:, 1:4] 
y = dat.X[:, 5]
n = nro(X)
tab(y) 

mod0 = model(fda; nlv = 2)
fit!(mod0, X, y)
@head T = mod0.fm.T
n, p = size(T)

#### Probability density in the FDA score space (2D)
#### Example of class Setosa 
s = y .== "setosa"
zT = T[s, :]

## Bivariate distribution
mod = model(dmnorm)
fit!(mod, zT)
fm = mod.fm
pnames(fm)
fm.Uinv 
fm.detS
pred = predict(mod, zT).pred
@head pred

mu = colmean(zT)
S = covm(zT, mweight(ones(nro(zT))))
## Direct syntax
dmnorm(; mu = mu, S = S).Uinv
dmnorm(; mu = mu, S = S).detS

npoints = 2^7
lims = [(minimum(zT[:, j]), maximum(zT[:, j])) for j = 1:nlv]
x1 = LinRange(lims[1][1], lims[1][2], npoints)
x2 = LinRange(lims[2][1], lims[2][2], npoints)
z = mpar(x1 = x1, x2 = x2)
grid = reduce(hcat, z)
mod = model(dmnorm)
fit!(mod, zT)
res = predict(mod, grid) ;
pred_grid = vec(res.pred)
f = Figure(size = (600, 400))
ax = Axis(f[1, 1];  title = "Density for FDA scores (Iris - Setosa)", 
    xlabel = "Score 1", ylabel = "Score 2")
co = contour!(ax, grid[:, 1], grid[:, 2], pred_grid; levels = 10, labels = true)
scatter!(ax, T[:, 1], T[:, 2], color = :red, markersize = 5)
scatter!(ax, zT[:, 1], zT[:, 2], color = :blue, markersize = 5)
#xlims!(ax, -12, 12) ;ylims!(ax, -12, 12)
f

## Univariate distribution
j = 1
x = zT[:, j]
mod = model(dmnorm)
fit!(mod, x)
pred = predict(mod, x).pred 
f = Figure()
ax = Axis(f[1, 1]; xlabel = string("FDA-score ", j))
hist!(ax, x; bins = 30, normalization = :pdf)  # area = 1
scatter!(ax, x, vec(pred); color = :red)
f

x = zT[:, j]
npoints = 2^8
lims = [minimum(x), maximum(x)]
#delta = 5 ; lims = [minimum(x) - delta, maximum(x) + delta]
grid = LinRange(lims[1], lims[2], npoints)
mod = model(dmnorm)
fit!(mod, x)
pred_grid = predict(mod, grid).pred 
f = Figure()
ax = Axis(f[1, 1]; xlabel = string("FDA-score ", j))
hist!(ax, x; bins = 30, normalization = :pdf)  # area = 1
lines!(ax, grid, vec(pred_grid); color = :red)
f
source
Jchemo.dmnormlogFunction
dmnormlog(X = nothing; mu = nothing, S = nothing, simpl::Bool = false)
dmnormlog!(X = nothing; mu = nothing, S = nothing, simpl::Bool = false)

Logarithm of the normal probability density estimation.

  • X : X-data (n, p) used to estimate the mean and the covariance matrix. If nothing, mu and S must be provided.

Keyword arguments: * mu : Mean vector of the normal distribution. If nothing, mu is computed by the column-means of X. * S : Covariance matrix of the normal distribution. If nothing, S is computed by cov(X; corrected = true). * simpl : Boolean. If true, the constant term and the determinant in the density formula are set to 1.

See the help of function dmnorm.

Examples

using JLD2, CairoMakie
using JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "iris.jld2") 
@load db dat
pnames(dat)
X = dat.X[:, 1:4] 
y = dat.X[:, 5]
n = nro(X)
tab(y) 

## Example of class Setosa 
s = y .== "setosa"
zX = X[s, :]

mod = model(dmnormlog)
fit!(mod, zX)
fm = mod.fm
pnames(fm)
fm.Uinv 
fm.logdetS
pred = predict(mod, zX).pred
@head pred 

mod0 = model(dmnorm)
fit!(mod0, zX)
pred0 = predict(mod0, zX).pred
@head log.(pred0)
source
Jchemo.dummyFunction
dummy(y, T = Float64)

Compute dummy table from a categorical variable.

  • y : A categorical variable.
  • T : Type of the output dummy table Y.

Examples

y = ["d", "a", "b", "c", "b", "c"]
#y =  rand(1:3, 7)
res = dummy(y)
pnames(res)
res.Y
source
Jchemo.duplMethod
dupl(X; digits = 3)

Find duplicated rows in a dataset.

  • X : A dataset.
  • digits : Nb. digits used to round X before checking.

Examples

X = rand(5, 3)
Z = vcat(X, X[1:3, :], X[1:1, :])
dupl(X)
dupl(Z)

M = hcat(X, fill(missing, 5))
Z = vcat(M, M[1:3, :])
dupl(M)
dupl(Z)
source
Jchemo.eposvdMethod
eposvd(D; nlv = 1)

Compute an orthogonalization matrix for calibration transfer of spectral data.

  • D : Data (m, p) containing the detrimental information on which spectra (rows of a matrix X) have to be orthogonalized.

Keyword arguments:

  • nlv : Nb. of first loadings vectors of D considered for the orthogonalization.

The objective is to remove some detrimental information (e.g. humidity patterns in signals, multiple spectrometers, etc.) from a X-dataset (n, p). The detrimental information is defined by the main row-directions computed from a matrix D (m, p).

Function eposvd returns two objects:

  • P (p, nlv) : The matrix of the nlv first loading vectors of the SVD decomposition (non centered PCA) of D.
  • M (p, p) : The orthogonalization matrix, used to orthogonolize a given matrix X to directions contained in P.

Any matrix X can then be corrected from D by:

  • X_corrected = X * M.

Matrix D can be built from many methods. For instance, two common methods are:

  • EPO (Roger et al. 2003, 2018): D is built from a set of differences between spectra collected under different conditions.
  • TOP (Andrew & Fearn 2004): Each row of D is the mean spectrum computed for a given spectrometer instrument.

A particular situation is the following. Assume that D is built from some differences between matrices X1 and X2, and that a bilinear model (e.g. PLSR) is fitted on the data {X1corrected, Y} where X1corrected = X1 * M. To predict new data X2new with the fitted model, there is no need to correct X2new.

References

Andrew, A., Fearn, T., 2004. Transfer by orthogonal projection: making near-infrared calibrations robust to between-instrument variation. Chemometrics and Intelligent Laboratory Systems 72, 51–56. https://doi.org/10.1016/j.chemolab.2004.02.004

Roger, J.-M., Chauchard, F., Bellon-Maurel, V., 2003. EPO-PLS external parameter orthogonalisation of PLS application to temperature-independent measurement of sugar content of intact fruits. Chemometrics and Intelligent Laboratory Systems 66, 191-204. https://doi.org/10.1016/S0169-7439(03)00051-0

Roger, J.-M., Boulet, J.-C., 2018. A review of orthogonal projections for calibration. Journal of Chemometrics 32, e3045. https://doi.org/10.1002/cem.3045

Zeaiter, M., Roger, J.M., Bellon-Maurel, V., 2006. Dynamic orthogonal projection. A new method to maintain the on-line robustness of multivariate calibrations. Application to NIR-based monitoring of wine fermentations. Chemometrics and Intelligent Laboratory Systems, 80, 227–235. https://doi.org/10.1016/j.chemolab.2005.06.011

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/caltransfer.jld2")
@load db dat
pnames(dat)
X1cal = dat.X1cal
X1val = dat.X1val
X2cal = dat.X2cal
X2val = dat.X2val

## The objective is to remove a detrimental 
## information (here, D) from spaces X1 and X2
D = X1cal - X2cal
nlv = 2
res = eposvd(D; nlv)
res.M # orthogonalization matrix
res.P # detrimental directions (columns of matrix P = loadings of D)

## Corrected Val matrices
X1val_c = X1val * res.M
X2val_c = X2val * res.M

i = 1
f = Figure(size = (800, 300))
ax1 = Axis(f[1, 1])
ax2 = Axis(f[1, 2])
lines!(ax1, X1val[i, :]; label = "x1")
lines!(ax1, X2val[i, :]; label = "x2")
axislegend(ax1, position = :cb, framevisible = false)
lines!(ax2, X1val_c[i, :]; label = "x1_correct")
lines!(ax2, X2val_c[i, :]; label = "x2_correct")
axislegend(ax2, position = :cb, framevisible = false)
f
source
Jchemo.errpMethod
errp(pred, y)

Compute the classification error rate (ERRP).

  • pred : Predictions.
  • y : Observed data (class membership).

Examples

Xtrain = rand(10, 5) 
ytrain = rand(["a" ; "b"], 10)
Xtest = rand(4, 5) 
ytest = rand(["a" ; "b"], 4)

mod = model(plsrda; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
errp(pred, ytest)
source
Jchemo.euclsqMethod
euclsq(X, Y)

Squared Euclidean distances between the rows of X and Y.

  • X : Data (n, p).
  • Y : Data (m, p).

For X(n, p) and Y (m, p), the function returns an object (n, m) with:

  • i, j = distance between row i of X and row j of Y.

Examples

X = rand(5, 3)
Y = rand(2, 3)

euclsq(X, Y)

euclsq(X[1:1, :], Y[1:1, :])

euclsq(X[:, 1], 4)
euclsq(1, 4)
source
Jchemo.fblockscalMethod
fblockscal(Xbl, bscales)
fblockscal!(Xbl::Vector, bscales::Vector)

Scale multiblock X-data.

  • Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.
  • bscales : A vector (of length equal to the nb. of blocks) of the scalars diving the blocks.

Examples

n = 5 ; m = 3 ; p = 10 
X = rand(n, p) 
listbl = [3:4, 1, [6; 8:10]]
Xbl = mblock(X, listbl) 

bscales = 10 * ones(3)
zXbl = fblockscal(Xbl, bscales) ;
@head zXbl[3]
@head Xbl[3]

fblockscal!(Xbl, bscales) ;
@head Xbl[3]
source
Jchemo.fcenterMethod
fcenter(X, v)
fcenter!(X::AbstractMatrix, v)

Center each column of X.

  • X : Data.
  • v : Centering vector.

examples

n, p = 5, 6
X = rand(n, p)
xmeans = colmean(X)
fcenter(X, xmeans)
source
Jchemo.fcscaleMethod
fcscale(X, u, v)
fcscale!(X, u, v)

Center and fscale each column of X.

  • X : Data.
  • u : Centering vector.
  • v : Scaling vector.

examples

n, p = 5, 6
X = rand(n, p)
xmeans = colmean(X)
xstds = colstd(X)
fcscale(X, xmeans, xstds)
source
Jchemo.fdaMethod
fda(X, y; kwargs...)
fda(X, y, weights; kwargs...)
fda!(X::Matrix, y, weights; kwargs...)

Factorial discriminant analysis (FDA).

  • X : X-data (n, p).
  • y : y-data (n) (class membership).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. of discriminant components.
  • lb : Ridge regularization parameter "lambda". Can be used when X has collinearities.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

FDA by eigen factorization of Inverse(W) * B, where W is the "Within"-covariance matrix (pooled over the classes), and B the "Between"-covariance matrix.

The function maximizes the compromise:

  • p'Bp / p'Wp

i.e. max p'Bp with constraint p'Wp = 1. Vectors p (columns of P) are the linear discrimant coefficients often referred to as "LD".

If X is ill-conditionned, a ridge regularization can be used:

  • If lb > 0, W is replaced by W + lb * I, where I is the Idendity matrix.

In these fda functions, observation weights (argument weights) are used to compute matrices W and B.

In the high-level version, the observation weights are automatically defined by the given priors (argument prior): the sub-total weights by class are set equal to the prior probabilities. For other choices, use the low-level versions.

Examples

using JchemoData, JLD2, CairoMakie 
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2") 
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest) 
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
tab(ytrain)
tab(ytest)

nlv = 2
mod = model(fda; nlv)
#mod = model(fdasvd; nlv)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
lev = fm.lev
nlev = length(lev)
aggsum(fm.weights.w, ytrain)

@head fm.T 
@head transf(mod, Xtrain)
@head transf(mod, Xtest)

## X-loadings matrix
## = coefficients of the linear discriminant function
## = "LD" of function lda of the R package MASS
fm.P
fm.P' * fm.P

## Explained variance computed by weighted PCA 
## of the class centers in transformed scale
summary(mod).explvarx

## Projections of the class centers 
## to the score space
ct = fm.Tcenters 
f, ax = plotxy(fm.T[:, 1], fm.T[:, 2], ytrain; ellipse = true, title = "FDA",
    xlabel = "Score-1", ylabel = "Score-2")
scatter!(ax, ct[:, 1], ct[:, 2], marker = :star5, markersize = 15, color = :red)  # see available_marker_symbols()
f
source
Jchemo.fdasvdMethod
fdasvd(X, y, weights; kwargs...)
fdasvd!(X::Matrix, y, weights; kwargs...)

Factorial discriminant analysis (FDA).

  • X : X-data (n, p).
  • y : y-data (n) (class membership).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. of discriminant components.
  • lb : Ridge regularization parameter "lambda". Can be used when X has collinearities.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

FDA by a weighted SVD factorization of the matrix of the class centers (after spherical transformaton). The function gives the same results as function fda.

See function fda for details and examples.

source
Jchemo.fdifMethod
fdif(X; kwargs...)

Finite differences (discrete derivates) for each row of X-data.

  • X : X-data (n, p).

Keyword arguments:

  • npoint : Nb. points involved in the window for the finite differences. The range of the window (= nb. intervals of two successive colums) is npoint - 1.

The method reduces the column-dimension:

  • (n, p) –> (n, p - npoint + 1).

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f

mod = model(fdif; npoint = 2) 
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f
source
Jchemo.findindexMethod
findindex(x, lev)

Replace a vector containg levels by the indexes of a set of levels.

  • x : Vector (n) of levels to replace.
  • lev : Vector (nlev) containing the levels.

Warning: The levels in x must be contained in lev.

Examples

lev = ["EHH" ; "FFS" ; "ANF" ; "CLZ" ; "CNG" ; "FRG" ; "MPW" ; "PEE" ; "SFG" ; "TTS"]
x = ["EHH" ; "TTS" ; "FRG"]
findindex(x, lev)
source
Jchemo.findmax_claMethod
findmax_cla(x)
findmax_cla(x, weights::Weight)

Find the most occurent level in x.

  • x : A categorical variable.
  • weights : Weights (n) of the observations. Object of type Weight (e.g. generated by function mweight).

If ex-aequos, the function returns the first.

Examples

x = rand(1:3, 10)
tab(x)
findmax_cla(x)
source
Jchemo.frobMethod
frob(X)
frob(X, weights::Weight)

Frobenius norm of a matrix.

  • X : A matrix (n, p).
  • weights : Weights (n) of the observations. Object of type Weight (e.g. generated by function mweight).

The Frobenius norm of X is:

  • sqrt(tr(X' * X)).

The Frobenius weighted norm is:

  • sqrt(tr(X' * D * X)), where D is the diagonal matrix of vector w.
source
Jchemo.fscaleMethod
fscale(X, v)
fscale!(X::AbstractMatrix, v)

Scale each column of X.

  • X : Data.
  • v : Scaling vector.

Examples

X = rand(5, 2) 
fscale(X, colstd(X))
source
Jchemo.fweightMethod
fweight(d; typw = :bisquare, alpha = 0)

Computation of weights from distances.

  • d : Vector of distances.

Keyword arguments:

  • typw : Define the weight function.
  • alpha : Parameter of the weight function, see below.

The returned weight vector is:

  • w = f(d / q) where f is the weight function and q the 1-alpha quantile of d (Cleveland & Grosse 1991).

Possible values for typw are:

  • :bisquare: w = (1 - x^2)^2
  • :cauchy: w = 1 / (1 + x^2)
  • :epan: w = 1 - x^2
  • :fair: w = 1 / (1 + x)^2
  • :invexp: w = exp(-x)
  • :invexp2: w = exp(-x / 2)
  • :gauss: w = exp(-x^2)
  • :trian: w = 1 - x
  • :tricube: w = (1 - x^3)^3

References

Cleveland, W.S., Grosse, E., 1991. Computational methods for local regression. Stat Comput 1, 47–62. https://doi.org/10.1007/BF01890836

Examples

using CairoMakie, Distributions

d = sort(sqrt.(rand(Chi(1), 1000)))
cols = cgrad(:tab10, collect(1:9)) ;
alpha = 0
f = Figure(size = (600, 500))
ax = Axis(f, xlabel = "d", ylabel = "Weight")
typw = :bisquare
w = fweight(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = cols[1])
typw = :cauchy
w = fweight(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = cols[2])
typw = :epan
w = fweight(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = cols[3])
typw = :fair
w = fweight(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = cols[4])
typw = :gauss
w = fweight(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = cols[5])
typw = :trian
w = fweight(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = cols[6])
typw = :invexp
w = fweight(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = cols[7])
typw = :invexp2
w = fweight(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = cols[8])
typw = :tricube
w = fweight(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = cols[9])
axislegend("Function", position = :lb)
f[1, 1] = ax
f
source
Jchemo.getknnMethod
getknn(Xtrain, X; metric = :eucl, k = 1)

Return the k nearest neighbors in Xtrain of each row of the query X.

  • Xtrain : Training X-data.
  • X : Query X-data.

Keyword arguments:

  • metric : Type of distance used for the query. Possible values are :eucl (Euclidean), :mah (Mahalanobis), :sam (spectral angular distance), :cor (correlation distance).
  • k : Number of neighbors to return.

The distances (not squared) are also returned.

Spectral angular and correlation distances between two vectors x and y:

  • Spectral angular distance (x, y) = acos(x'y / norm(x)norm(y)) / pi
  • Correlation distance (x, y) = sqrt((1 - cor(x, y)) / 2)

Both distances are bounded within 0 (y = x) and 1 (y = -x).

Examples

Xtrain = rand(5, 3)
X = rand(2, 3)
x = X[1:1, :]

k = 3
res = getknn(Xtrain, X; k)
res.ind  # indexes
res.d    # distances

res = getknn(Xtrain, x; k)
res.ind

res = getknn(Xtrain, X; metric = :mah, k)
res.ind
source
Jchemo.gridcvMethod
gridcv(mod, X, Y; segm, score, pars = nothing, nlv = nothing, lb = nothing, 
    verbose = false)

Cross-validation (CV) of a model over a grid of parameters.

  • mod : Model to evaluate.
  • X : Training X-data (n, p).
  • Y : Training Y-data (n, q).

Keyword arguments:

  • segm : Segments of observations used for the CV (output of functions segmts, segmkf, etc.).
  • score : Function computing the prediction score (e.g. rmsep).
  • pars : tuple of named vectors of same length defining the parameter combinations (e.g. output of function mpar).
  • verbose : If true, fitting information are printed.
  • nlv : Value, or vector of values, of the nb. of latent variables (LVs).
  • lb : Value, or vector of values, of the ridge regularization parameter "lambda".

The function is used for grid-search: it computed a prediction score (= error rate) for model mod over the combinations of parameters defined in pars.

For models based on LV or ridge regularization, using arguments nlv and lb allow faster computations than including these parameters in argument `pars. See the examples.

The function returns two outputs:

  • res : mean results
  • res_p : results per replication.

Examples

######## Regression

using JLD2, CairoMakie, JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
mod = model(savgol; npoint = 21, deriv = 2, degree = 2)
fit!(mod, X)
Xp = transf(mod, X)
s = year .<= 2012
Xtrain = Xp[s, :]
ytrain = y[s]
Xtest = rmrow(Xp, s)
ytest = rmrow(y, s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)

## Replicated K-fold CV 
K = 3 ; rep = 10
segm = segmkf(ntrain, K; rep)
## Replicated test-set validation
#m = Int(round(ntrain / 3)) ; rep = 30
#segm = segmts(ntrain, m; rep)

####-- Plsr
mod = model(plskern)
nlv = 0:30
rescv = gridcv(mod, Xtrain, ytrain; segm, score = rmsep, nlv) ;
pnames(rescv)
res = rescv.res 
plotgrid(res.nlv, res.y1; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod = model(plskern; nlv = res.nlv[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

## Adding pars 
pars = mpar(scal = [false; true])
rescv = gridcv(mod, Xtrain, ytrain; segm,  score = rmsep, pars, nlv) ;
res = rescv.res 
typ = res.scal
plotgrid(res.nlv, res.y1, typ; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod = model(plskern; nlv = res.nlv[u], scal = res.scal[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f   

####-- Rr 
lb = (10).^(-8:.1:3)
mod = model(rr) 
rescv = gridcv(mod, Xtrain, ytrain; segm, score = rmsep, lb) ;
res = rescv.res 
loglb = log.(10, res.lb)
plotgrid(loglb, res.y1; step = 2, xlabel = "log(lambda)", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod = model(rr; lb = res.lb[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f     
    
## Adding pars 
pars = mpar(scal = [false; true])
rescv = gridcv(mod, Xtrain, ytrain; segm, score = rmsep, pars, lb) ;
res = rescv.res 
loglb = log.(10, res.lb)
typ = string.(res.scal)
plotgrid(loglb, res.y1, typ; step = 2, xlabel = "log(lambda)", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod = model(rr; lb = res.lb[u], scal = res.scal[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f   

####-- Kplsr 
mod = model(kplsr)
nlv = 0:30
gamma = (10).^(-5:1.:5)
pars = mpar(gamma = gamma)
rescv = gridcv(mod, Xtrain, ytrain; segm,  score = rmsep, pars, nlv) ;
res = rescv.res 
loggamma = round.(log.(10, res.gamma), digits = 1)
plotgrid(res.nlv, res.y1, loggamma; step = 2, xlabel = "Nb. LVs",  ylabel = "RMSEP", 
    leg_title = "Log(gamma)").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod = model(kplsr; nlv = res.nlv[u], gamma = res.gamma[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f   

####-- Knnr 
nlvdis = [15, 25] ; metric = [:mah]
h = [1, 2.5, 5]
k = [1; 5; 10; 20; 50 ; 100] 
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k)
length(pars[1]) 
mod = model(knnr)
rescv = gridcv(mod, Xtrain, ytrain; segm, score = rmsep, pars, verbose = true) ;
res = rescv.res 
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod = model(knnr; nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u], 
    k = res.k[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f   

####-- Lwplsr 
nlvdis = 15 ; metric = [:mah]
h = [1, 2.5, 5] ; k = [50, 100] 
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k)
length(pars[1]) 
nlv = 0:20
mod = model(lwplsr)
rescv = gridcv(mod, Xtrain, ytrain; segm, score = rmsep, pars, nlv, verbose = true) ;
res = rescv.res 
group = string.("h=", res.h, " k=", res.k)
plotgrid(res.nlv, res.y1, group; xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod = model(lwplsr; nlvdis = res.nlvdis[u], metric = res.metric[u], 
    h = res.h[u], k = res.k[u], nlv = res.nlv[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f   

####-- LwplsrAvg 
nlvdis = 15 ; metric = [:mah]
h = [1, 2.5, 5] ; k = [50, 100]
nlv = [0:15, 0:20, 5:20]  
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k, nlv = nlv)
length(pars[1]) 
mod = model(lwplsravg)
rescv = gridcv(mod, Xtrain, ytrain; segm, score = rmsep, pars, verbose = true) ;
res = rescv.res 
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod = model(lwplsravg; nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u], 
    k = res.k[u], nlv = res.nlv[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f     

######## Discrimination
## The principle is the same as for regression

using JLD2, CairoMakie, JchemoData
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
tab(Y.typ)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)

## Replicated K-fold CV 
K = 3 ; rep = 10
segm = segmkf(ntrain, K; rep)
## Replicated test-set validation
#m = Int(round(ntrain / 3)) ; rep = 30
#segm = segmts(ntrain, m; rep)

####-- Plslda
mod = model(plslda)
nlv = 1:30
prior = [:unif; :prop]
pars = mpar(prior = prior)
rescv = gridcv(mod, Xtrain, ytrain; segm, score = errp, pars, nlv)
res = rescv.res
typ = res.prior
plotgrid(res.nlv, res.y1, typ; step = 2, xlabel = "Nb. LVs", ylabel = "ERR").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod = model(plslda; nlv = res.nlv[u], prior = res.prior[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show errp(pred, ytest)
conf(pred, ytest).pct
source
Jchemo.gridcv_brMethod
gridcv_br(X, Y; segm, fun, score, pars, verbose = false)

Working function for gridcv.

See function gridcv for examples.

source
Jchemo.gridcv_lbMethod
gridcv_lb(X, Y; segm, fun, score, pars = nothing, lb, verbose = false)

Working function for gridcv.

Specific and faster than gridcv_br for models using ridge regularization (e.g. RR). Argument pars must not contain nlv.

See function gridcv for examples.

source
Jchemo.gridcv_lvMethod
gridcv_lv((X, Y; segm, fun, score, pars = nothing, nlv, verbose = false)

Working function for gridcv.

Specific and faster than gridcv_br for models using latent variables (e.g. PLSR). Argument pars must not contain nlv.

See function gridcv for examples.

source
Jchemo.gridscoreMethod
gridscore(mod, Xtrain, Ytrain, X, Y; score, pars = nothing, nlv = nothing, 
    lb = nothing, verbose = false)

Test-set validation of a model over a grid of parameters.

  • mod : Model to evaluate.
  • Xtrain : Training X-data (n, p).
  • Ytrain : Training Y-data (n, q).
  • X : Validation X-data (m, p).
  • Y : Validation Y-data (m, q).

Keyword arguments:

  • score : Function computing the prediction score (e.g. rmsep).
  • pars : tuple of named vectors of same length defining the parameter combinations (e.g. output of function mpar).
  • verbose : If true, fitting information are printed.
  • nlv : Value, or vector of values, of the nb. of latent variables (LVs).
  • lb : Value, or vector of values, of the ridge regularization parameter "lambda".

The function is used for grid-search: it computed a prediction score (= error rate) for model mod over the combinations of parameters defined in pars. The score is computed over sets {X,Y`}.

For models based on LV or ridge regularization, using arguments nlv and lb allow faster computations than including these parameters in argument `pars. See the examples.

Examples

######## Regression 

using JLD2, CairoMakie, JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
mod = model(savgol; npoint = 21, deriv = 2, degree = 2)
fit!(mod, X)
Xp = transf(mod, X)
s = year .<= 2012
Xtrain = Xp[s, :]
ytrain = y[s]
Xtest = rmrow(Xp, s)
ytest = rmrow(y, s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Building Cal and Val 
## within Train
nval = Int(round(.3 * ntrain))
s = samprand(ntrain, nval)
Xcal = Xtrain[s.train, :]
ycal = ytrain[s.train]
Xval = Xtrain[s.test, :]
yval = ytrain[s.test]

####-- Plsr
mod = model(plskern)
nlv = 0:30
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, nlv)
plotgrid(res.nlv, res.y1; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod = model(plskern; nlv = res.nlv[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

## Adding pars 
pars = mpar(scal = [false; true])
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, pars, nlv)
typ = res.scal
plotgrid(res.nlv, res.y1, typ; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod = model(plskern; nlv = res.nlv[u], scal = res.scal[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

####-- Rr 
lb = (10).^(-8:.1:3)
mod = model(rr) 
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, lb)
loglb = log.(10, res.lb)
plotgrid(loglb, res.y1; step = 2, xlabel = "log(lambda)", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod = model(rr; lb = res.lb[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    
    
## Adding pars 
pars = mpar(scal = [false; true])
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, pars, lb)
loglb = log.(10, res.lb)
typ = string.(res.scal)
plotgrid(loglb, res.y1, typ; step = 2, xlabel = "log(lambda)", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod = model(rr; lb = res.lb[u], scal = res.scal[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

####-- Kplsr 
mod = model(kplsr)
nlv = 0:30
gamma = (10).^(-5:1.:5)
pars = mpar(gamma = gamma)
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, pars, nlv)
loggamma = round.(log.(10, res.gamma), digits = 1)
plotgrid(res.nlv, res.y1, loggamma; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP",
    leg_title = "Log(gamma)").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod = model(kplsr; nlv = res.nlv[u], gamma = res.gamma[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

####-- Knnr 
nlvdis = [15; 25] ; metric = [:mah]
h = [1, 2.5, 5]
k = [1, 5, 10, 20, 50, 100] 
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k)
length(pars[1]) 
mod = model(knnr)
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, pars, verbose = true)
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod = model(knnr; nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u], 
    k = res.k[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

####-- Lwplsr 
nlvdis = 15 ; metric = [:mah]
h = [1, 2.5, 5] ; k = [50, 100] 
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k)
length(pars[1]) 
nlv = 0:20
mod = model(lwplsr)
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, pars, nlv, verbose = true)
group = string.("h=", res.h, " k=", res.k)
plotgrid(res.nlv, res.y1, group; xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod = model(lwplsr; nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u], 
    k = res.k[u], nlv = res.nlv[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

####-- LwplsrAvg 
nlvdis = 15 ; metric = [:mah]
h = [1, 2.5, 5] ; k = [50, 100]
nlv = [0:15, 0:20, 5:20] 
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k, nlv = nlv)
length(pars[1]) 
mod = model(lwplsravg)
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, pars, verbose = true)
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod = model(lwplsravg; nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u], 
    k = res.k[u], nlv = res.nlv[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f   

####-- Mbplsr
listbl = [1:525, 526:1050]
Xbltrain = mblock(Xtrain, listbl)
Xbltest = mblock(Xtest, listbl) 
Xbl_cal = mblock(Xcal, listbl) 
Xbl_val = mblock(Xval, listbl) 

mod = model(mbplsr)
bscal = [:none, :frob]
pars = mpar(bscal = bscal) 
nlv = 0:30
res = gridscore(mod, Xbl_cal, ycal, Xbl_val, yval; score = rmsep, pars, nlv)
group = res.bscal 
plotgrid(res.nlv, res.y1, group; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod = model(mbplsr; bscal = res.bscal[u], nlv = res.nlv[u])
fit!(mod, Xbltrain, ytrain)
pred = predict(mod, Xbltest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    
    
######## Discrimination
## The principle is the same as for regression

using JLD2, CairoMakie, JchemoData
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
tab(Y.typ)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Building Cal and Val 
## within Train
nval = Int(round(.3 * ntrain))
s = samprand(ntrain, nval)
Xcal = Xtrain[s.train, :]
ycal = ytrain[s.train]
Xval = Xtrain[s.test, :]
yval = ytrain[s.test]

####-- Plslda
mod = model(plslda)
nlv = 1:30
prior = [:unif, :prop]
pars = mpar(prior = prior)
res = gridscore(mod, Xcal, ycal, Xval, yval; score = errp, pars, nlv)
typ = res.prior
plotgrid(res.nlv, res.y1, typ; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod = model(plslda; nlv = res.nlv[u], prior = res.prior[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show errp(pred, ytest)
conf(pred, ytest).pct
source
Jchemo.gridscoreMethod
gridscore(mod::Pipeline, Xtrain, Ytrain, X, Y; score, pars = nothing, 
    nlv = nothing, lb = nothing, verbose = false)

Test-set validation of a model pipeline over a grid of parameters.

  • mod : A pipeline of models to evaluate.
  • Xtrain : Training X-data (n, p).
  • Ytrain : Training Y-data (n, q).
  • X : Validation X-data (m, p).
  • Y : Validation Y-data (m, q).

Keyword arguments:

  • score : Function computing the prediction score (e.g. rmsep).
  • pars : tuple of named vectors of same length defining the parameter combinations (e.g. output of function mpar).
  • verbose : If true, fitting information are printed.
  • nlv : Value, or vector of values, of the nb. of latent variables (LVs).
  • lb : Value, or vector of values, of the ridge regularization parameter "lambda".

In the present version of the function, only the last model of the pipeline (= the final predictor) is validated.

For other details, see function gridscore for simple models.

Examples

using JLD2, CairoMakie, JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Building Cal and Val 
## within Train
nval = Int(round(.3 * ntrain))
s = samprand(ntrain, nval)
Xcal = Xtrain[s.train, :]
ycal = ytrain[s.train]
Xval = Xtrain[s.test, :]
yval = ytrain[s.test]

####-- Pipeline Snv :> Savgol :> Plsr
## Only the last model is validated
## mod1
centr = true ; scal = false
mod1 = model(snv; centr, scal)
## mod2 
npoint = 11 ; deriv = 2 ; degree = 3
mod2 = model(savgol; npoint, deriv, degree)
## mod3
nlv = 0:30
mod3 = model(plskern)
##
mod = pip(mod1, mod2, mod3)
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, nlv) ;
plotgrid(res.nlv, res.y1; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod3 = model(plskern; nlv = res.nlv[u])
mod = pip(mod1, mod2, mod3)
fit!(mod, Xtrain, ytrain)
res = predict(mod, Xtest) ; 
@head res.pred 
rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
      ylabel = "Observed").f

####-- Pipeline Pca :> Svmr
## Only the last model is validated
## mod1
nlv = 15 ; scal = true
mod1 = model(pcasvd; nlv, scal)
## mod2
kern = [:krbf]
gamma = (10).^(-5:1.:5)
cost = (10).^(1:3)
epsilon = [.1, .2, .5]
pars = mpar(kern = kern, gamma = gamma, cost = cost, epsilon = epsilon)
mod2 = model(svmr)
##
mod = pip(mod1, mod2)
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, pars)
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
mod2 = model(svmr; kern = res.kern[u], gamma = res.gamma[u], cost = res.cost[u],
    epsilon = res.epsilon[u])
mod = pip(mod1, mod2) ;
fit!(mod, Xtrain, ytrain)
res = predict(mod, Xtest) ; 
@head res.pred 
rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
      ylabel = "Observed").f
source
Jchemo.gridscore_brMethod
gridscore_br(Xtrain, Ytrain, X, Y; fun, score, pars, 
    verbose = false)

Working function for gridscore.

See function gridscore for examples.

source
Jchemo.gridscore_lbMethod
gridscore_lb(Xtrain, Ytrain, X, Y; fun, score, pars = nothing, 
    lb, verbose = false)

Working function for gridscore.

Specific and faster than gridscore_br for models using ridge regularization (e.g. RR). Argument pars must not contain lb.

See function gridscore for examples.

source
Jchemo.gridscore_lvMethod
gridscore_lv(Xtrain, Ytrain, X, Y; fun, score, pars = nothing, 
    nlv, verbose = false)

Working function for gridscore.

Specific and faster than gridscore_br for models using latent variables (e.g. PLSR). Argument pars must not contain nlv.

See function gridscore for examples.

source
Jchemo.headMethod
head(X)

Display the first rows of a dataset.

Examples

X = rand(100, 5)
head(X)
@head X
source
Jchemo.interplMethod
interpl(X; kwargs...)

Sampling spectra by interpolation.

  • X : Matrix (n, p) of spectra (rows).

Keyword arguments:

  • wl : Values representing the column "names" of X. Must be a numeric vector of length p, or an AbstractRange.
  • wlfin : Final values (within the range of wl) where to interpolate the spectrum. Must be a numeric vector, or an AbstractRange.

The function implements a cubic spline interpolation using package DataInterpolations.jl.

References

Package DAtaInterpolations.jl https://github.com/PumasAI/DataInterpolations.jl https://htmlpreview.github.io/?https://github.com/PumasAI/DataInterpolations.jl/blob/v2.0.0/example/DataInterpolations.html

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f

wlfin = range(500, 2400, length = 10)
#wlfin = collect(range(500, 2400, length = 10))
mod = model(interpl; wl, wlfin)
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f
source
Jchemo.isel!Function
isel!(mod, X, Y, wl = 1:nco(X); rep = 1, nint = 5, psamp = .3, score = rmsep)

Interval variable selection.

  • mod : Model to evaluate.
  • X : X-data (n, p).
  • Y : Y-data (n, q).
  • wl : Optional numeric labels (p, 1) of the X-columns.

Keyword arguments:

  • rep : Number of replications of the splitting training/test.
  • nint : Nb. intervals.
  • psamp : Proportion of data used as test set to compute the score.
  • score : Function computing the prediction score.

The principle is as follows:

  • Data (X, Y) are splitted randomly to a training and a test set.
  • Range 1:p in X is segmented to nint intervals, when possible of equal size.
  • The model is fitted on the training set and the score (error rate) on the test set, firtsly accounting for all the p variables (reference) and secondly for each of the nint intervals.
  • This process is replicated rep times. Average results are provided in the outputs, as well the results per replication.

References

  • Nørgaard, L., Saudland, A., Wagner, J., Nielsen, J.P., Munck, L.,

Engelsen, S.B., 2000. Interval Partial Least-Squares Regression (iPLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy. Appl Spectrosc 54, 413–419. https://doi.org/10.1366/0003702001949500

Examples

using DataFrames, JLD2, CairoMakie
using JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "tecator.jld2") 
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y 
wl_str = names(X)
wl = parse.(Float64, wl_str) 
ntot, p = size(X)
typ = Y.typ
namy = names(Y)[1:3]
plotsp(X, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance").f

s = typ .== "train"
Xtrain = X[s, :]
Ytrain = Y[s, namy]
Xtest = rmrow(X, s)
Ytest = rmrow(Y[:, namy], s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)

## Work on the j-th 
## y-variable 
j = 2
nam = namy[j]
ytrain = Ytrain[:, nam]
ytest = Ytest[:, nam]

mod = model(plskern; nlv = 5)
nint = 10
res = isel!(mod, Xtrain, ytrain, wl; rep = 30, nint) ;
res.res_rep
res.res0_rep
zres = res.res
zres0 = res.res0
f = Figure(size = (650, 300))
ax = Axis(f[1, 1], xlabel = "Wawelength (nm)", ylabel = "RMSEP_Val",
    xticks = zres.lo)
scatter!(ax, zres.mid, zres.y1; color = (:red, .5))
vlines!(ax, zres.lo; color = :grey, linestyle = :dash, linewidth = 1)
hlines!(ax, zres0.y1, linestyle = :dash)
f
source
Jchemo.kdedaMethod
kdeda(X, y; kwargs...)

Discriminant analysis using non-parametric kernel Gaussian density estimation (KDE-DA).

  • X : X-data (n, p).
  • y : Univariate class membership (n).

Keyword arguments:

  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • Keyword arguments of function dmkern (bandwidth definition) can also be specified here.

The principle is the same as functions lda and qda except that densities are estimated from function dmkern instead of function dmnorm.

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

prior = :unif
#prior = :prop
mod = model(kdeda; prior)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

mod = model(kdeda; prior, a_kde = .5) 
#mod = model(kdeda; prior, h_kde = .1) 
fit!(mod, Xtrain, ytrain)
mod.fm.fm[1].H
source
Jchemo.knndaMethod
knnda(X, y; kwargs...)

k-Nearest-Neighbours weighted discrimination (KNN-DA).

  • X : X-data (n, p).
  • y : Univariate class membership (n).

Keyword arguments:

  • nlvdis : Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. If nlvdis = 0, there is no dimension reduction.
  • metric : Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are: :eucl (Euclidean distance), :mah (Mahalanobis distance).
  • h : A scalar defining the shape of the weight function computed by function wdist. Lower is h, sharper is the function. See function wdist for details (keyword arguments criw and squared of wdist can also be specified here).
  • k : The number of nearest neighbors to select for each observation to predict.
  • tolw : For stabilization when very close neighbors.
  • scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation for the global dimension reduction.

This function has the same principle as function knnrexcept that a discrimination is done instead of a regression. A weighted vote is done over the neighborhood, and the prediction corresponds to the most frequent class.

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlvdis = 25 ; metric = :mah
h = 2 ; k = 10
mod = model(knnda; nlvdis, metric, h, k) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

res = predict(mod, Xtest) ; 
pnames(res) 
res.listnn
res.listd
res.listw
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
source
Jchemo.knnrMethod
knnr(X, Y; kwargs...)

k-Nearest-Neighbours weighted regression (KNNR).

  • X : X-data (n, p).
  • Y : Y-data (n, q).

Keyword arguments:

  • nlvdis : Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. If nlvdis = 0, there is no dimension reduction.
  • metric : Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are: :eucl (Euclidean distance), :mah (Mahalanobis distance).
  • h : A scalar defining the shape of the weight function computed by function wdist. Lower is h, sharper is the function. See function wdist for details (keyword arguments criw and squared of wdist can also be specified here).
  • k : The number of nearest neighbors to select for each observation to predict.
  • tolw : For stabilization when very close neighbors.
  • scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation for the global dimension reduction.

The general principle of this function is as follows (many other variants of kNNR pipelines can be built):

For each new observation to predict, the prediction is the weighted mean over a selected neighborhood (in X) of size k. Within the selected neighborhood, the weights are defined from the dissimilarities between the new observation and the neighborhood, and are computed from function 'wdist'.

In general, for high dimensional X-data, using the Mahalanobis distance requires preliminary dimensionality reduction of the data. In function knnr', the preliminary reduction (argumentnlvdis) is done by PLS on {X,Y`}.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlvdis = 5 ; metric = :mah 
#nlvdis = 0 ; metric = :eucl 
h = 1 ; k = 5 
mod = model(knnr; nlvdis, metric, h, k) ;
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)

res = predict(mod, Xtest) ; 
pnames(res) 
res.listnn
res.listd
res.listw
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, 
    xlabel = "Prediction", ylabel = "Observed").f    

####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106 
x = collect(-10:.2:10) 
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x) 
y = zy + .2 * randn(n) 
mod = model(knnr; k = 15, h = 5) 
fit!(mod, x, y)
pred = predict(mod, x).pred 
f, ax = scatter(x, y) 
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f
source
Jchemo.kpcaMethod
kpca(X; kwargs...)
kpca(X, weights::Weight; kwargs...)

Kernel PCA (Scholkopf et al. 1997, Scholkopf & Smola 2002, Tipping 2001).

  • X : X-data (n, p).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. principal components (PCs) to consider.
  • kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

The method is implemented by SVD factorization of the weighted Gram matrix:

  • D^(1/2) * Phi(X) * Phi(X)' * D^(1/2)

where X is the cenetred matrix and D is a diagonal matrix of weights (weights.w) of the observations (rows of X).

References

Scholkopf, B., Smola, A., Müller, K.-R., 1997. Kernel principal component analysis, in: Gerstner, W., Germond, A., Hasler, M., Nicoud, J.-D. (Eds.), Artificial Neural Networks, ICANN 97, Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, pp. 583-588. https://doi.org/10.1007/BFb0020217

Scholkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond, Adaptive computation and machine learning. MIT Press, Cambridge, Mass.

Tipping, M.E., 2001. Sparse kernel principal component analysis. Advances in neural information processing systems, MIT Press. http://papers.nips.cc/paper/1791-sparse-kernel-principal-component-analysis.pdf

Examples

using JchemoData, JLD2 
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2") 
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
n = nro(X)
ntest = 30
s = samprand(n, ntest) 
Xtrain = X[s.train, :]
Xtest = X[s.test, :]

nlv = 3
kern = :krbf ; gamma = 1e-4
mod = model(kpca; nlv, kern, gamma) ;
fit!(mod, Xtrain)
pnames(mod.fm)
@head T = mod.fm.T
T' * T
mod.fm.P' * mod.fm.P

@head Ttest = transf(mod, Xtest)

res = summary(mod) ;
pnames(res)
res.explvarx
source
Jchemo.kplskdedaMethod
kplskdeda(X, y; kwargs...)
kplskdeda(X, y, weights::Weight; kwargs...)

KPLS-KDEDA.

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute. Must be >= 1
  • kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • Keyword arguments of function dmkern (bandwidth definition) can also be specified here.
  • alpha : Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0) and LDA (alpha = 1).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Same as function plskdeda (PLS-KDEDA) except that a kernel PLSR (function kplsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

See function kplslda for examples.

source
Jchemo.kplsldaMethod
kplslda(X, y; kwargs...)
kplslda(X, y, weights::Weight; kwargs...)

KPLS-LDA.

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute. Must be >= 1
  • kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Same as function plslda (PLS-LDA) except that a kernel PLSR (function kplsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlv = 15
gamma = .1
mod = model(kplslda; nlv, gamma) 
#mod = model(kplslda; nlv, gamma, prior = :prop) 
#mod = model(kplsqda; nlv, gamma, alpha = .5) 
#mod = model(kplskdeda; nlv, gamma, a_kde = .5) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

fmpls = fm.fm.fmpls ;
@head fmpls.T
@head transf(mod, Xtrain)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)

coef(fmpls)

res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(mod, Xtest; nlv = 1:2).pred
source
Jchemo.kplsqdaMethod
kplsqda(X, y; kwargs...)
kplsqda(X, y, weights::Weight; kwargs...)

KPLS-QDA.

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute. Must be >= 1
  • kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • alpha : Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0) and LDA (alpha = 1).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Same as function plsqda (PLS-QDA) except that a kernel PLSR (function kplsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

See function kplslda for examples.

source
Jchemo.kplsrMethod
kplsr(X, Y; kwargs...)
kplsr(X, Y, weights::Weight; kwargs...)
kplsr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Kernel partial least squares regression (KPLSR) implemented with a Nipals algorithm (Rosipal & Trejo, 2001).

  • X : X-data (n, p).
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to consider.
  • kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
  • scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

This algorithm becomes slow for n > 1000. Use function dkplsr instead.

References

Rosipal, R., Trejo, L.J., 2001. Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space. Journal of Machine Learning Research 2, 97-123.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 20
kern = :krbf ; gamma = 1e-1
mod = model(kplsr; nlv, kern, gamma) ;
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
@head mod.fm.T

coef(mod)
coef(mod; nlv = 3)

@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)

res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, 
    xlabel = "Prediction", ylabel = "Observed").f    

####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106 
x = collect(-10:.2:10) 
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x) 
y = zy + .2 * randn(n) 
nlv = 2
kern = :krbf ; gamma = 1 / 3
mod = model(kplsr; nlv, kern, gamma) 
fit!(mod, x, y)
pred = predict(mod, x).pred 
f, ax = scatter(x, y) 
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f
source
Jchemo.kplsrdaMethod
kplsrda(X, y; kwargs...)
kplsrda(X, y, weights::Weight; kwargs...)

Discrimination based on kernel partial least squares regression (KPLSR-DA).

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute.
  • kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Same as function plsrda (PLSR-DA) except that a kernel PLSR (function kplsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlv = 15
kern = :krbf ; gamma = .001 
scal = true
mod = model(kplsrda; nlv, kern, gamma, scal) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

@head fm.fm.T
@head transf(mod, Xtrain)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)

coef(fm.fm)

res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(mod, Xtest; nlv = 1:2).pred
source
Jchemo.kpolMethod
kpol(X, Y; kwargs...)

Compute a polynomial kernel Gram matrix.

  • X : X-data (n, p).
  • Y : Y-data (m, p).

Keyword arguments:

  • degree : Degree of the polynom.
  • gamma : Scale of the polynom.
  • coef0 : Offset of the polynom.

Given matrices X and Yof sizes (n, p) and (m, p), respectively, the function returns the (n, m) Gram matrix:

  • K(X, Y) = Phi(X) * Phi(Y)'.

The polynomial kernel between two vectors x and y is computed by (gamma * (x' * y) + coef0)^degree.

References

Scholkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond, Adaptive computation and machine learning. MIT Press, Cambridge, Mass.

Examples

X = rand(5, 3)
Y = rand(2, 3)
kpol(X, Y; degree = 3, gamma = .1, cost = 10)
source
Jchemo.krbfMethod
krbf(X, Y; kwargs...)

Compute a Radial-Basis-Function (RBF) kernel Gram matrix.

  • X : X-data (n, p).
  • Y : Y-data (m, p).

Keyword arguments:

  • gamma : Scale parameter.

Given matrices X and Yof sizes (n, p) and (m, p), respectively, the function returns the (n, m) Gram matrix:

  • K(X, Y) = Phi(X) * Phi(Y)'.

The RBF kernel between two vectors x and y is computed by exp(-gamma * ||x - y||^2).

References

Scholkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond, Adaptive computation and machine learning. MIT Press, Cambridge, Mass.

Examples

X = rand(5, 3)
Y = rand(2, 3)
krbf(X, Y; gamma = .1)
source
Jchemo.krrMethod
krr(X, Y; kwargs...)
krr(X, Y, weights::Weight; kwargs...)
krr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Kernel ridge regression (KRR) implemented by SVD factorization.

  • X : X-data (n, p).
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • lb : Ridge regularization parameter "lambda".
  • kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
  • scal : Boolean. If true, each column of `X is scaled by its uncorrected standard deviation.

KRR is also referred to as least squared SVM regression (LS-SVMR). The method is close to the particular case of SVM regression where there is no marge excluding the observations (epsilon coefficient set to zero). The difference is that a L2-norm optimization is done, instead of L1 in SVM.

References

Bennett, K.P., Embrechts, M.J., 2003. An optimization perspective on kernel partial least squares regression, in: Advances in Learning Theory: Methods, Models and Applications, NATO Science Series III: Computer & Systems Sciences. IOS Press Amsterdam, pp. 227-250.

Cawley, G.C., Talbot, N.L.C., 2002. Reduced Rank Kernel Ridge Regression. Neural Processing Letters 16, 293-302. https://doi.org/10.1023/A:1021798002258

Krell, M.M., 2018. Generalizing, Decoding, and Optimizing Support Vector Machine Classification. arXiv:1801.04929.

Saunders, C., Gammerman, A., Vovk, V., 1998. Ridge Regression Learning Algorithm in Dual Variables, in: In Proceedings of the 15th International Conference on Machine Learning. Morgan Kaufmann, pp. 515-521.

Suykens, J.A.K., Lukas, L., Vandewalle, J., 2000. Sparse approximation using least squares support vector machines. 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353). https://doi.org/10.1109/ISCAS.2000.856439

Welling, M., n.d. Kernel ridge regression. Department of Computer Science, University of Toronto, Toronto, Canada. https://www.ics.uci.edu/~welling/classnotes/papers_class/Kernel-Ridge.pdf

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

lb = 1e-3
kern = :krbf ; gamma = 1e-1
mod = model(krr; lb, kern, gamma) ;
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)

coef(mod)

res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, 
    xlabel = "Prediction", ylabel = "Observed").f    

coef(mod; lb = 1e-1)
res = predict(mod, Xtest; lb = [.1 ; .01])
@head res.pred[1]
@head res.pred[2]

lb = 1e-3
kern = :kpol ; degree = 1
mod = model(krr; lb, kern, degree) 
fit!(mod, Xtrain, ytrain)
res = predict(mod, Xtest)
rmsep(res.pred, ytest)

####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106 
x = collect(-10:.2:10) 
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x) 
y = zy + .2 * randn(n) 
lb = 1e-1
kern = :krbf ; gamma = 1 / 3
mod = model(krr; lb, kern, gamma) 
fit!(mod, x, y)
pred = predict(mod, x).pred 
f, ax = scatter(x, y) 
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f
source
Jchemo.krrdaMethod
krrda(X, y; kwargs...)
krrda(X, y, weights::Weight; kwargs...)

Discrimination based on kernel ridge regression (KRR-DA).

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • lb : Ridge regularization parameter "lambda".
  • kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Same as function rrda (RR-DA) except that a kernel RR (function krr), instead of a RR (function rr), is run on the Y-dummy table.

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

lb = 1e-5
kern = :krbf ; gamma = .001 
scal = true
mod = model(krrda; lb, kern, gamma, scal) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

coef(fm.fm)

res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(mod, Xtest; lb = [.1, .001]).pred
source
Jchemo.ldaMethod
lda(; kwargs...)
lda(X, y; kwargs...)
lda(X, y, weights::Weight; kwargs...)

Linear discriminant analysis (LDA).

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).

In these lda functions, observation weights (argument weights) are used to compute the intra-class (= "within") covariance matrix. Argument prior is used to define the usual prior class probabilities.

In the high-level version, the observation weights are automatically defined by the given priors (prior): the sub-total weights by class are set equal to the prior probabilities. For other choices, use the low-level version.

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

mod = lda()
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
aggsum(fm.weights.w, ytrain)

res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
source
Jchemo.lgMethod
lg(X, Y; centr = true)
lg(Xbl; centr = true)

Compute the Lg coefficient between matrices.

  • X : Matrix (n, p).
  • Y : Matrix (n, q).
  • Xbl : A list (vector) of matrices.

Keyword arguments:

  • centr : Boolean indicating if the matrices will be internally centered or not.

Lg(X, Y) = Sum.(j=1..p) Sum.(k= 1..q) cov(xj, yk)^2

RV(X, Y) = Lg(X, Y) / sqrt(Lg(X, X), Lg(Y, Y))

References

Escofier, B. & Pagès, J. 1984. L’analyse factorielle multiple. Cahiers du Bureau universitaire de recherche opérationnelle. Série Recherche, tome 42, p. 3-68

Escofier, B. & Pagès, J. (2008). Analyses Factorielles Simples et Multiples : Objectifs, Méthodes et Interprétation. Dunod, 4e édition.

Examples

X = rand(5, 10)
Y = rand(5, 3)
lg(X, Y)

X = rand(5, 15) 
listbl = [3:4, 1, [6; 8:10]]
Xbl = mblock(X, listbl)
lg(Xbl)
source
Jchemo.listMethod
list(Q, n::Integer)

Create a Vector{Q}(undef, n).

isassigned(object, i) can be used to check if cell i is empty.

Examples

list(Float64, 5)
list(Array{Float64}, 5)
list(Matrix{Int}, 5)
source
Jchemo.listMethod
list(n::Integer)

Create a Vector{Any}(nothing, n).

isnothing(object, i) can be used to check if cell i is empty.

Examples

list(5)
source
Jchemo.locwMethod
locw(Xtrain, Ytrain, X; listnn, listw = nothing, fun, verbose = false, 
    kwargs...)

Compute predictions for a given kNN model.

  • Xtrain : Training X-data.
  • Ytrain : Training Y-data.
  • X : X-data (m observations) to predict.

Keyword arguments:

  • listnn : List (vector) of m vectors of indexes.
  • listw : List (vector) of m vectors of weights.
  • fun : Function computing the model on the m neighborhoods.
  • verbose : Boolean. If true, fitting information are printed.
  • kwargs : Keywords arguments to pass in function fun. Each argument must have length = 1 (not be a collection).

Each component i of listnn and listw contains the indexes and weights, respectively, of the nearest neighbors of x_i in Xtrain. The sizes of the neighborhood for i = 1,...,m can be different.

source
Jchemo.locwlvMethod
locwlv(Xtrain, Ytrain, X; listnn, listw = nothing, fun, nlv, verbose = true, 
    kwargs...)

Compute predictions for a given kNN model.

  • Xtrain : Training X-data.
  • Ytrain : Training Y-data.
  • X : X-data (m observations) to predict.

Keyword arguments:

  • listnn : List (vector) of m vectors of indexes.
  • listw : List (vector) of m vectors of weights.
  • fun : Function computing the model on the m neighborhoods.
  • nlv : Nb. or collection of nb. of latent variables (LVs).
  • verbose : Boolean. If true, fitting information are printed.
  • kwargs : Keywords arguments to pass in function fun. Each argument must have length = 1 (not be a collection).

Same as locw but specific and much faster for LV-based models (e.g. PLSR).

source
Jchemo.lwmlrMethod
lwmlr(X, Y; kwargs...)

k-Nearest-Neighbours locally weighted multiple linear regression (kNN-LWMLR).

  • X : X-data (n, p).
  • Y : Y-data (n, q).

Keyword arguments:

  • metric : Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are: :eucl (Euclidean distance), :mah (Mahalanobis distance).
  • h : A scalar defining the shape of the weight function computed by function wdist. Lower is h, sharper is the function. See function wdist for details (keyword arguments criw and squared of wdist can also be specified here).
  • k : The number of nearest neighbors to select for each observation to predict.
  • tolw : For stabilization when very close neighbors.

This is the same principle as function lwplsr except that MLR models are fitted on the neighborhoods, instead of PLSR models. The neighborhoods are computed directly on X (there is no preliminary dimension reduction).

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 20
mod0 = model(pcasvd; nlv) ;
fit!(mod0, Xtrain) 
@head Ttrain = mod0.fm.T 
@head Ttest = transf(mod0, Xtest)

metric = :eucl 
h = 2 ; k = 100 
mod = model(lwmlr; metric, h, k) 
fit!(mod, Ttrain, ytrain)
pnames(mod)
pnames(mod.fm)

res = predict(mod, Ttest) ; 
pnames(res) 
res.listnn
res.listd
res.listw
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, 
    xlabel = "Prediction", ylabel = "Observed").f    

####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106 
x = collect(-10:.2:10) 
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x) 
y = zy + .2 * randn(n) 
mod = model(lwmlr; metric = :eucl, h = 1.5, k = 20) ;
fit!(mod, x, y)
pred = predict(mod, x).pred 
f, ax = scatter(x, y) 
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f
source
Jchemo.lwmlrdaMethod
lwmlrda(X, y; kwargs...)

k-Nearest-Neighbours locally weighted MLR-based discrimination (kNN-LWMLR-DA).

  • X : X-data (n, p).
  • y : Univariate class membership (n).

Keyword arguments:

  • metric : Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are: :eucl (Euclidean distance), :mah (Mahalanobis distance).
  • h : A scalar defining the shape of the weight function computed by function wdist. Lower is h, sharper is the function. See function wdist for details (keyword arguments criw and squared of wdist can also be specified here).
  • k : The number of nearest neighbors to select for each observation to predict.
  • tolw : For stabilization when very close neighbors.
  • scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation for the global dimension reduction.

This is the same principle as function lwmlr except that MLR-DA models, instead of MLR models, are fitted on the neighborhoods.

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

metric = :mah
h = 2 ; k = 10
mod = model(lwmlrda; metric, h, k) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

res = predict(mod, Xtest) ; 
pnames(res) 
res.listnn
res.listd
res.listw
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
source
Jchemo.lwplsldaMethod
lwplslda(X, y; kwargs...)

kNN-LWPLS-LDA.

  • X : X-data (n, p).
  • y : Univariate class membership (n).

Keyword arguments:

  • nlvdis : Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. If nlvdis = 0, there is no dimension reduction.
  • metric : Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are: :eucl (Euclidean distance), :mah (Mahalanobis distance).
  • h : A scalar defining the shape of the weight function computed by function wdist. Lower is h, sharper is the function. See function wdist for details (keyword arguments criw and squared of wdist can also be specified here).
  • k : The number of nearest neighbors to select for each observation to predict.
  • tolw : For stabilization when very close neighbors.
  • nlv : Nb. latent variables (LVs) for the local (i.e. inside each neighborhood) models.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation for the global dimension reduction and the local models.

This is the same principle as function lwplsr except that PLS-LDA models, instead of PLSR models, are fitted on the neighborhoods.

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlvdis = 25 ; metric = :mah
h = 1 ; k = 100
mod = model(lwplslda; nlvdis, metric, h, k, prior = :prop) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

res = predict(mod, Xtest) ; 
pnames(res) 
res.listnn
res.listd
res.listw
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
source
Jchemo.lwplsqdaMethod
lwplsqda(X, y; kwargs...)

kNN-LWPLS-QDA.

  • X : X-data (n, p).
  • y : Univariate class membership (n).

Keyword arguments:

  • nlvdis : Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. If nlvdis = 0, there is no dimension reduction.
  • metric : Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are: :eucl (Euclidean distance), :mah (Mahalanobis distance).
  • h : A scalar defining the shape of the weight function computed by function wdist. Lower is h, sharper is the function. See function wdist for details (keyword arguments criw and squared of wdist can also be specified here).
  • k : The number of nearest neighbors to select for each observation to predict.
  • tolw : For stabilization when very close neighbors.
  • nlv : Nb. latent variables (LVs) for the local (i.e. inside each neighborhood) models.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional).
  • alpha : Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0) and LDA (alpha = 1).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation for the global dimension reduction and the local models.

This is the same principle as function lwplsr except that PLS-QDA models, instead of PLSR models, are fitted on the neighborhoods.

  • Warning: The present version of this function suffers from

frequent stops due to non positive definite matrices when doing QDA on neighborhoods, since some classes within the neighborhood can have very few observations. It is recommended to select a sufficiantly large number of neighbors or/and to use a regularized QDA (alpha > 0).

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlvdis = 25 ; metric = :mah
h = 1 ; k = 200
mod = model(lwplsqda; nlvdis, metric, h, k, prior = :prop, alpha = .5) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

res = predict(mod, Xtest) ; 
pnames(res) 
res.listnn
res.listd
res.listw
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
source
Jchemo.lwplsrMethod
lwplsr(X, Y; kwargs...)

k-Nearest-Neighbours locally weighted partial least squares regression (kNN-LWPLSR).

  • X : X-data (n, p).
  • Y : Y-data (n, q).

Keyword arguments:

  • nlvdis : Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. If nlvdis = 0, there is no dimension reduction.
  • metric : Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are: :eucl (Euclidean distance), :mah (Mahalanobis distance).
  • h : A scalar defining the shape of the weight function computed by function wdist. Lower is h, sharper is the function. See function wdist for details (keyword arguments criw and squared of wdist can also be specified here).
  • k : The number of nearest neighbors to select for each observation to predict.
  • tolw : For stabilization when very close neighbors.
  • nlv : Nb. latent variables (LVs) for the local (i.e. inside each neighborhood) models.
  • scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation for the global dimension reduction and the local models.

Function lwplsr fits kNN-LWPLSR models such as in Lesnoff et al. 2020. The general principle of the pipeline is as follows (many other variants of pipelines can be built):

LWPLSR is a particular case of weighted PLSR (WPLSR) (e.g. Schaal et al. 2002). In WPLSR, a priori weights, different from the usual 1/n (standard PLSR), are given to the n training observations. These weights are used for calculating (i) the scores and loadings of the WPLS and (ii) the regression model that fits (by weighted least squares) the Y-response(s) to the WPLS scores. The specificity of LWPLSR (compared to WPLSR) is that the weights are computed from dissimilarities (e.g. distances) between the new observation to predict and the training observations ("L" in LWPLSR comes from "localized"). Note that in LWPLSR the weights and therefore the fitted WPLSR model change for each new observation to predict.

In the original LWPLSR, all the n training observations are used for each observation to predict (e.g. Sicard & Sabatier 2006, Kim et al 2011). This can be very time consuming when n is large. A faster (and often more efficient) strategy is to preliminary select, in the training set, a number of k nearest neighbors to the observation to predict (= "weighting 1") and then to apply LWPLSR only to this pre-selected neighborhood (= "weighting 2"). T his strategy corresponds to a kNN-LWPLSR and is the one implemented in function lwplsr.

In lwplsr, the dissimilarities used for weightings 1 and 2 are computed from the raw X-data, or after a dimension reduction, depending on argument nlvdis. In the last case, global PLS2 scores (LVs) are computed from {X, Y} and the dissimilarities are computed over these scores.

In general, for high dimensional X-data, using the Mahalanobis distance requires preliminary dimensionality reduction of the data. In function knnr', the preliminary reduction (argumentnlvdis) is done by PLS on {X,Y`}.

References

Kim, S., Kano, M., Nakagawa, H., Hasebe, S., 2011. Estimation of active pharmaceutical ingredients content using locally weighted partial least squares and statistical wavelength selection. Int. J. Pharm., 421, 269-274.

Lesnoff, M., Metz, M., Roger, J.-M., 2020. Comparison of locally weighted PLS strategies for regression and discrimination on agronomic NIR data. Journal of Chemometrics, e3209. https://doi.org/10.1002/cem.3209

Schaal, S., Atkeson, C., Vijayamakumar, S. 2002. Scalable techniques from nonparametric statistics for the real time robot learning. Applied Intell., 17, 49-60.

Sicard, E. Sabatier, R., 2006. Theoretical framework for local PLS1 regression and application to a rainfall data set. Comput. Stat. Data Anal., 51, 1393-1410.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlvdis = 5 ; metric = :mah 
h = 1 ; k = 200 ; nlv = 15
mod = model(lwplsr; nlvdis, metric, h, k, nlv) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)

res = predict(mod, Xtest) ; 
pnames(res) 
res.listnn
res.listd
res.listw
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, 
    xlabel = "Prediction", ylabel = "Observed").f    
source
Jchemo.lwplsravgMethod
lwplsravg(X, Y; kwargs...)

Averaging kNN-LWPLSR models with different numbers of latent variables (kNN-LWPLSR-AVG).

  • X : X-data (n, p).
  • Y : Y-data (n, q).

Keyword arguments:

  • nlvdis : Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. If nlvdis = 0, there is no dimension reduction.
  • metric : Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are: :eucl (Euclidean distance), :mah (Mahalanobis distance).
  • h : A scalar defining the shape of the weight function computed by function wdist. Lower is h, sharper is the function. See function wdist for details (keyword arguments criw and squared of wdist can also be specified here).
  • k : The number of nearest neighbors to select for each observation to predict.
  • tolw : For stabilization when very close neighbors.
  • nlv : A range of nb. of latent variables (LVs) to compute for the local (i.e. inside each neighborhood) models.
  • scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation for the global dimension reduction and the local models.

Ensemblist method where the predictions are computed by averaging the predictions of a set of models built with different numbers of LVs, such as in Lesnoff 2023. On each neighborhood, a PLSR-averaging (Lesnoff et al.

  1. is done instead of a PLSR.

For instance, if argument nlv is set to nlv = 5:10, the prediction for a new observation is the simple average of the predictions returned by the models with 5 LVs, 6 LVs, ... 10 LVs, respectively.

References

Lesnoff, M., Andueza, D., Barotin, C., Barre, P., Bonnal, L., Fernández Pierna, J.A., Picard, F., Vermeulen, P., Roger, J.-M., 2022. Averaging and Stacking Partial Least Squares Regression Models to Predict the Chemical Compositions and the Nutritive Values of Forages from Spectral Near Infrared Data. Applied Sciences 12, 7850. https://doi.org/10.3390/app12157850

M. Lesnoff, Averaging a local PLSR pipeline to predict chemical compositions and nutritive values of forages and feed from spectral near infrared data, Chemometrics and Intelligent Laboratory Systems. 244 (2023) 105031. https://doi.org/10.1016/j.chemolab.2023.105031.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlvdis = 5 ; metric = :mah 
h = 1 ; k = 200 ; nlv = 4:20
mod = model(lwplsravg; nlvdis, metric, h, k, nlv) ;
fit!(mod, Ttrain, ytrain)
pnames(mod)
pnames(mod.fm)

res = predict(mod, Ttest) ; 
pnames(res) 
res.listnn
res.listd
res.listw
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, 
    xlabel = "Prediction", ylabel = "Observed").f  
source
Jchemo.lwplsrdaMethod
lwplsrda(X, y; kwargs...)

kNN-LWPLSR-DA.

  • X : X-data (n, p).
  • y : Univariate class membership (n).

Keyword arguments:

  • nlvdis : Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. If nlvdis = 0, there is no dimension reduction.
  • metric : Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are: :eucl (Euclidean distance), :mah (Mahalanobis distance).
  • h : A scalar defining the shape of the weight function computed by function wdist. Lower is h, sharper is the function. See function wdist for details (keyword arguments criw and squared of wdist can also be specified here).
  • k : The number of nearest neighbors to select for each observation to predict.
  • tolw : For stabilization when very close neighbors.
  • nlv : Nb. latent variables (LVs) for the local (i.e. inside each neighborhood) models.
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation for the global dimension reduction and the local models.

This is the same principle as function lwplsr except that PLSR-DA models, instead of PLSR models, are fitted on the neighborhoods.

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlvdis = 25 ; metric = :mah
h = 2 ; k = 100
mod = model(lwplsrda; nlvdis, metric, h, k) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

res = predict(mod, Xtest) ; 
pnames(res) 
res.listnn
res.listd
res.listw
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
source
Jchemo.mahsqMethod
mahsq(X, Y)
mahsq(X, Y, Sinv)

Squared Mahalanobis distances between the rows of X and Y.

  • X : Data (n, p).
  • Y : Data (m, p).
  • Sinv : Inverse of a covariance matrix S. If not given, S is computed as the uncorrected covariance matrix of X.

When X and Y are (n, p) and (m, p), repectively, it returns an object (n, m) with:

  • i, j = distance between row i of X and row j of Y.

Examples

using StatsBase 

X = rand(5, 3)
Y = rand(2, 3)

mahsq(X, Y)

S = cov(X, corrected = false)
Sinv = inv(S)
mahsq(X, Y, Sinv)
mahsq(X[1:1, :], Y[1:1, :], Sinv)

mahsq(X[:, 1], 4)
mahsq(1, 4, 2.1)
source
Jchemo.mahsqcholMethod
mahsqchol(X, Y)
mahsqchol(X, Y, Uinv)

Compute the squared Mahalanobis distances (with a Cholesky factorization) between the observations (rows) of X and Y.

  • X : Data (n, p).
  • Y : Data (m, p).
  • Uinv : Inverse of the upper matrix of a Cholesky factorization of a covariance matrix S. If not given, the factorization is done on S, the uncorrected covariance matrix of X.

When X and Y are (n, p) and (m, p), repectively, it returns an object (n, m) with:

  • i, j = distance between row i of X and row j of Y.

Examples

using LinearAlgebra

X = rand(5, 3)
Y = rand(2, 3)

mahsqchol(X, Y)

S = cov(X, corrected = false)
U = cholesky(Hermitian(S)).U 
Uinv = inv(U)
mahsqchol(X, Y, Uinv)

mahsqchol(X[:, 1], 4)
mahsqchol(1, 4, sqrt(2.1))
source
Jchemo.matBFunction
matB(X, y, weights::Weight)

Between-class covariance matrix.

  • X : X-data (n, p).
  • y : A vector (n) defining the class membership.
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Compute the between-class covariance matrix (output B) of X. This is the (non-corrected) covariance matrix of the weighted class centers.

Examples

using StatsBase

n = 20 ; p = 3
X = rand(n, p)
y = rand(1:3, n)
tab(y) 
weights = mweight(ones(n)) 

res = matB(X, y, weights) ;
res.B
res.priors
res.ni
res.lev

res = matW(X, y, weights) ;
res.W
res.Wi

matW(X, y, weights).W + matB(X, y, weights).B
cov(X; corrected = false)

v = mweight(collect(1:n))
matW(X, y, v).priors 
matB(X, y, v).priors 
matW(X, y, v).W + matB(X, y, v).B
covm(X, v)
source
Jchemo.matWFunction
matW(X, y, weights::Weight)

Within-class covariance matrices.

  • X : X-data (n, p).
  • y : A vector (n) defing the class membership.
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Compute the (non-corrected) within-class and pooled covariance matrices (outputs Wi and W, respectively) of X.

If class i contains only one observation, Wi is computed by:

  • covm(X,weights).

For examples, see function matB.

source
Jchemo.mavgMethod
mavg(X; kwargs...)

Smoothing by moving averages of each row of X-data.

  • X : X-data (n, p).

Keyword arguments:

  • npoint : Nb. points involved in the window.

The smoothing is computed by convolution with padding, using function imfilter of package ImageFiltering.jl. The centered kernel is ones(npoint) / npoint. Each returned point is located on the center of the kernel.

The function returns a matrix (n, p).

References

Package ImageFiltering.jl https://github.com/JuliaImages/ImageFiltering.jl

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f

mod = model(mavg; npoint = 10) 
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f
source
Jchemo.mbconcatMethod
mbconcat(Xbl)

Concatenate horizontaly multiblock X-data.

  • Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.

Examples

n = 5 ; m = 3 ; p = 10 
X = rand(n, p) 
Xnew = rand(m, p)
listbl = [3:4, 1, [6; 8:10]]
Xbl = mblock(X, listbl) 
Xblnew = mblock(Xnew, listbl) 
@head Xbl[3]

mod = model(mbconcat) 
fit!(mod, Xbl)
transf(mod, Xbl)
transf(mod, Xblnew)
source
Jchemo.mblockMethod
mblock(X, listbl)

Make blocks from a matrix.

  • X : X-data (n, p).
  • listbl : A vector whose each component defines the colum numbers defining a block in X. The length of listbl is the number of blocks.

The function returns a list (vector) of blocks.

Examples

n = 5 ; p = 10 
X = rand(n, p) 
listbl = [3:4, 1, [6; 8:10]]

Xbl = mblock(X, listbl)
Xbl[1]
Xbl[2]
Xbl[3]
source
Jchemo.mbpcaMethod
mbpca(Xbl; kwargs...)
mbpca(Xbl, weights::Weight; kwargs...)
mbpca!(Xbl::Matrix, weights::Weight; kwargs...)

Consensus principal components analysis (CPCA = MBPCA).

  • Xbl : List of blocks (vector of matrices) of X-data. Typically, output of function mblock.
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs = scores T) to compute.
  • bscal : Type of block scaling. See function blockscal for possible values.
  • tol : Tolerance value for Nipals convergence.
  • maxit : Maximum number of iterations (Nipals).
  • scal : Boolean. If true, each column of blocks in Xbl is scaled by its uncorrected standard deviation (before the block scaling).

The MBPCA global scores are equal to the scores of the PCA of the horizontal concatenation X = [X1 X2 ... Xk].

The function returns several objects, in particular:

  • T : The non normed global scores.
  • U : The normed global scores.
  • W : The global loadings.
  • Tbl : The block scores (grouped by blocks, in original scale).
  • Tb : The block scores (grouped by LV, in the metric scale).
  • Wbl : The block loadings.
  • lb : The specific weights "lambda".
  • mu : The sum of the specific weights (= eigen value of the global PCA).

Function summary returns:

  • explvarx : Proportion of the total inertia of X (sum of the squared norms of the blocks) explained by each global score.
  • contr_block : Contribution of each block to the global scores.
  • explX : Proportion of the inertia of the blocks explained by each global score.
  • corx2t : Correlation between the global scores and the original variables.
  • cortb2t : Correlation between the global scores and the block scores.
  • rv : RV coefficient.
  • lg : Lg coefficient.

References

Mangamana, E.T., Cariou, V., Vigneau, E., Glèlè Kakaï, R.L., Qannari, E.M., 2019. Unsupervised multiblock data analysis: A unified approach and extensions. Chemometrics and Intelligent Laboratory Systems 194, 103856. https://doi.org/10.1016/j.chemolab.2019.103856

Westerhuis, J.A., Kourti, T., MacGregor, J.F., 1998. Analysis of multiblock and hierarchical PCA and PLS models. Journal of Chemometrics 12, 301–321. https://doi.org/10.1002/(SICI)1099-128X(199809/10)12:5<301::AID-CEM515>3.0.CO;2-S

Examples

using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2") 
@load db dat
pnames(dat) 
X = dat.X
group = dat.group
listbl = [1:11, 12:19, 20:25]
Xbl = mblock(X[1:6, :], listbl)
Xblnew = mblock(X[7:8, :], listbl)
n = nro(Xbl[1]) 

nlv = 3
bscal = :frob
scal = false
#scal = true
mod = model(mbpca; nlv, bscal, scal)
fit!(mod, Xbl)
pnames(mod) 
pnames(mod.fm)
## Global scores 
@head mod.fm.T
@head transf(mod, Xbl)
transf(mod, Xblnew)
## Blocks scores
i = 1
@head mod.fm.Tbl[i]
@head transfbl(mod, Xbl)[i]

res = summary(mod, Xbl) ;
pnames(res) 
res.explvarx
res.contr_block
res.explX   # = mod.fm.lb if bscal = :frob
rowsum(Matrix(res.explX))
res.corx2t 
res.cortb2t
res.rv
source
Jchemo.mbplskdedaMethod
mbplskdeda(Xbl, y; kwargs...)
mbplskdeda(Xbl, y, weights::Weight; kwargs...)

Multiblock PLS-KDEDA.

  • Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs = scores T) to compute.
  • bscal : Type of block scaling. See function blockscal for possible values.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • Keyword arguments of function dmkern (bandwidth definition) can also be specified here.
  • scal : Boolean. If true, each column of blocks in Xbl and Y is scaled by its uncorrected standard deviation (before the block scaling).

This is the same principle as function plskdeda, for multiblock X-data.

See function mbplslda for examples.

source
Jchemo.mbplsldaMethod
mbplslda(Xbl, y; kwargs...)
mbplslda(Xbl, y, weights::Weight; kwargs...)

Multiblock PLS-LDA.

  • Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs = scores T) to compute.
  • bscal : Type of block scaling. See function blockscal for possible values.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • scal : Boolean. If true, each column of blocks in Xbl and Y is scaled by its uncorrected standard deviation (before the block scaling).

This is the same principle as function plslda, for multiblock X-data.

Examples

using JLD2, CairoMakie, JchemoData
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
tab(Y.typ)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
wlst = names(X)
wl = parse.(Float64, wlst)
#plotsp(X, wl; nsamp = 20).f
##
listbl = [1:350, 351:700]
Xbltrain = mblock(Xtrain, listbl)
Xbltest = mblock(Xtest, listbl) 

nlv = 15
scal = false
#scal = true
bscal = :none
#bscal = :frob
mod = model(mbplslda; nlv, bscal, scal)
#mod = model(mbplsqda; nlv, bscal, alpha = .5, scal)
#mod = model(mbplskdeda; nlv, bscal, scal)
fit!(mod, Xbltrain, ytrain) 
pnames(mod) 

@head transf(mod, Xbltrain)
@head transf(mod, Xbltest)

res = predict(mod, Xbltest) ; 
@head res.pred 
@show errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(mod, Xbltest; nlv = 1:2).pred
source
Jchemo.mbplsqdaMethod
mbplsqda(Xbl, y; kwargs...)
mbplsqda(Xbl, y, weights::Weight; kwargs...)

Multiblock PLS-QDA.

  • Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs = scores T) to compute.
  • bscal : Type of block scaling. See function blockscal for possible values.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • alpha : Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0) and LDA (alpha = 1).
  • scal : Boolean. If true, each column of blocks in Xbl and Y is scaled by its uncorrected standard deviation (before the block scaling).

This is the same principle as function plsqda, for multiblock X-data.

See function mbplslda for examples.

source
Jchemo.mbplsrMethod
mbplsr(Xbl, Y; kwargs...)
mbplsr(Xbl, Y, weights::Weight; kwargs...)
mbplsr!(Xbl::Matrix, Y::Matrix, weights::Weight; kwargs...)

Multiblock PLSR (MBPLSR).

  • Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs = scores T) to compute.
  • bscal : Type of block scaling. See function blockscal for possible values.
  • scal : Boolean. If true, each column of blocks in Xbl and Y is scaled by its uncorrected standard deviation (before the block scaling).

This function runs a PLSR on {X, Y} where X is the horizontal concatenation of the blocks in Xbl. The function gives the same results as function mbplswest, but is much faster.

Examples

using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2") 
@load db dat
pnames(dat) 
X = dat.X
Y = dat.Y
y = Y.c1
group = dat.group
listbl = [1:11, 12:19, 20:25]
s = 1:6
Xbltrain = mblock(X[s, :], listbl)
Xbltest = mblock(rmrow(X, s), listbl)
ytrain = y[s]
ytest = rmrow(y, s) 
ntrain = nro(ytrain) 
ntest = nro(ytest) 
ntot = ntrain + ntest 
(ntot = ntot, ntrain , ntest)

nlv = 3
bscal = :frob
scal = false
#scal = true
mod = model(mbplsr; nlv, bscal, scal)
fit!(mod, Xbltrain, ytrain)
pnames(mod) 
pnames(mod.fm)
@head mod.fm.T
@head transf(mod, Xbltrain)
transf(mod, Xbltest)

res = predict(mod, Xbltest)
res.pred 
rmsep(res.pred, ytest)

res = summary(mod, Xbltrain) ;
pnames(res) 
res.explvarx
res.corx2t 
res.rdx
source
Jchemo.mbplsrdaMethod
mbplsrda(Xbl, y; kwargs...)
mbplsrda(Xbl, y, weights::Weight; kwargs...)

Discrimination based on multiblock partial least squares regression (MBPLSR-DA).

  • Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs = scores T) to compute.
  • bscal : Type of block scaling. See function blockscal for possible values.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • scal : Boolean. If true, each column of blocks in Xbl and Y is scaled by its uncorrected standard deviation (before the block scaling).

This is the same principle as function plsrda, for multiblock X-data.

Examples

using JLD2, CairoMakie, JchemoData
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
tab(Y.typ)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
wlst = names(X)
wl = parse.(Float64, wlst)
#plotsp(X, wl; nsamp = 20).f
##
listbl = [1:350, 351:700]
Xbltrain = mblock(Xtrain, listbl)
Xbltest = mblock(Xtest, listbl) 

nlv = 15
scal = false
#scal = true
bscal = :none
#bscal = :frob
mod = model(mbplsrda; nlv, bscal, scal)
fit!(mod, Xbltrain, ytrain) 
pnames(mod) 

@head mod.fm.fm.T 
@head transf(mod, Xbltrain)
@head transf(mod, Xbltest)

res = predict(mod, Xbltest) ; 
@head res.pred 
@show errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(mod, Xbltest; nlv = 1:2).pred
source
Jchemo.mbplswestMethod
mbplswest(Xbl, Y; kwargs...)
mbplswest(Xbl, Y, weights::Weight; kwargs...)
mbplswest!(Xbl::Matrix, Y::Matrix, weights::Weight; kwargs...)

Multiblock PLSR (MBPLSR) - Nipals algorithm.

  • Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs = scores T) to compute.
  • bscal : Type of block scaling. See function blockscal for possible values.
  • tol : Tolerance value for convergence (Nipals).
  • maxit : Maximum number of iterations (Nipals).
  • scal : Boolean. If true, each column of blocks in Xbl and Y is scaled by its uncorrected standard deviation (before the block scaling).

This functions implements the MBPLSR Nipals algorithm such as in Westerhuis et al. 1998. The function gives the same results as function mbplsr.

References

Westerhuis, J.A., Kourti, T., MacGregor, J.F., 1998. Analysis of multiblock and hierarchical PCA and PLS models. Journal of Chemometrics 12, 301–321. https://doi.org/10.1002/(SICI)1099-128X(199809/10)12:5<301::AID-CEM515>3.0.CO;2-S

Examples

using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2") 
@load db dat
pnames(dat) 
X = dat.X
Y = dat.Y
y = Y.c1
group = dat.group
listbl = [1:11, 12:19, 20:25]
s = 1:6
Xbltrain = mblock(X[s, :], listbl)
Xbltest = mblock(rmrow(X, s), listbl)
ytrain = y[s]
ytest = rmrow(y, s) 
ntrain = nro(ytrain) 
ntest = nro(ytest) 
ntot = ntrain + ntest 
(ntot = ntot, ntrain , ntest)

nlv = 3
bscal = :frob
scal = false
#scal = true
mod = model(mbplswest; nlv, bscal, scal)
fit!(mod, Xbltrain, ytrain)
pnames(mod) 
pnames(mod.fm)
@head mod.fm.T
@head transf(mod, Xbltrain)
transf(mod, Xbltest)

res = predict(mod, Xbltest)
res.pred 
rmsep(res.pred, ytest)

res = summary(mod, Xbltrain) ;
pnames(res) 
res.explvarx
res.corx2t 
res.cortb2t 
res.rdx
source
Jchemo.merrpMethod
merrp(pred, y)

Compute the mean intra-class classification error rate.

  • pred : Predictions.
  • y : Observed data (class membership).

ERRP (see function errp) is computed for each class. Function merrp returns the average of these intra-class ERRPs.

Examples

Xtrain = rand(10, 5) 
ytrain = rand(["a" ; "b"], 10)
Xtest = rand(4, 5) 
ytest = rand(["a" ; "b"], 4)

mod = model(plsrda; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
merrp(pred, ytest)
source
Jchemo.missMethod
miss(X)

Find rows with missing data in a dataset.

  • X : A dataset.

Examples

X = rand(5, 4)
zX = hcat(rand(2, 3), fill(missing, 2))
Z = vcat(X, zX)
miss(X)
miss(Z)
source
Jchemo.mlevMethod
mlev(x)

Return the sorted levels of a vector or a dataset.

Examples

x = rand(["a";"b";"c"], 20)
lev = mlev(x)
nlev = length(lev)

X = reshape(x, 5, 4)
mlev(X)

df = DataFrame(g1 = rand(1:2, n), 
    g2 = rand(["a"; "c"], n))
mlev(df)
source
Jchemo.mlrMethod
mlr(X, Y; kwargs...)
mlr(X, Y, weights::Weight; kwargs...)
mlr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Compute a mutiple linear regression model (MLR) by using the QR algorithm.

  • X : X-data (n, p).
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • noint : Boolean. Define if the model is computed with an intercept or not.

Safe but can be little slower than other methods.

Examples

using JchemoData, JLD2, CairoMakie 
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2") 
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 2:4]
y = dat.X[:, 1]
n = nro(X)
ntest = 30
s = samprand(n, ntest) 
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]

mod = model(mlr)
#mod = model(mlrchol)
#mod = model(mlrpinv)
#mod = model(mlrpinvn) 
fit!(mod, Xtrain, ytrain) 
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.B
fm.int 
coef(mod) 
res = predict(mod, Xtest)
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, 
    xlabel = "Prediction", ylabel = "Observed").f    

mod = model(mlr; noint = true)
fit!(mod, Xtrain, ytrain) 
coef(mod) 
source
Jchemo.mlrcholMethod
mlrchol(X, Y)
mlrchol(X, Y, weights::Weight)
mlrchol!mlrchol!(X::Matrix, Y::Matrix, weights::Weight)

Compute a mutiple linear regression model (MLR) using the Normal equations and a Choleski factorization.

  • X : X-data, with nb. columns >= 2 (required by function cholesky).
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Compute a model with intercept.

Faster but can be less accurate (based on squared element X'X).

See function mlr for examples.

source
Jchemo.mlrdaMethod
mlrda(X, y; kwargs...)
mlrda(X, y, weights::Weight)

Discrimination based on multple linear regression (MLR-DA).

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).

The training variable y (univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present in y. Each column of Ydummy is a dummy (0/1) variable. Then, a multiple linear regression (MLR) is run on {X, Ydummy}, returning predictions of the dummy variables (= object posterior returned by fuction predict). These predictions can be considered as unbounded estimates (i.e. eventuall outside of [0, 1]) of the class membership probabilities. For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.

In the high-level version of the function, the observation weights used in the MLR are defined with argument prior. For other choices, use the low-level version (argument weights).

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

mod = model(mlrda)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
source
Jchemo.mlrpinvMethod
mlrpinv(; kwargs...)
mlrpinv(X, Y; kwargs...)
mlrpinv(X, Y, weights::Weight; kwargs...)
mlrpinv!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Compute a mutiple linear regression model (MLR) by using a pseudo-inverse.

  • X : X-data (n, p).
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • noint : Boolean. Define if the model is computed with an intercept or not.

Safe but can be slower.

See function mlr for examples.

source
Jchemo.mlrpinvnMethod
mlrpinvn() 
mlrpinvn(X, Y)
mlrpinvn(X, Y, weights::Weight)
mlrpinvn!mlrchol!(X::Matrix, Y::Matrix, 
    weights::Weight)

Compute a mutiple linear regression model (MLR) by using the Normal equations and a pseudo-inverse.

  • X : X-data (n, p).
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Safe and fast for p not too large.

Compute a model with intercept.

See function mlr for examples.

source
Jchemo.mlrvecMethod
mlrvec(; kwargs...)
mlrvec(X, Y; kwargs...)
mlrvec(X, Y, weights::Weight; kwargs...)
mlrvec!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Compute a simple linear regression model (univariate x).

  • x : Univariate X-data (n).
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • noint : Boolean. Define if the model is computed with an intercept or not.

See function mlr for examples.

source
Jchemo.modelMethod
model(fun::Function; kwargs...)

Build a model.

  • fun : The function defining the the model.
  • kwargs...: Keyword arguments of fun.

Examples

X = rand(5, 10)
y = rand(5)

mod = model(detrend)  # use the default arguments of 'detrend'
#mod = detrend(X; degree = 2)
pnames(mod)
fit!(mod, X)
Xp = transf(mod, X)

mod = model(plskern; nlv = 3) 
fit!(mod, X, y)
pred = predict(mod, X).pred
source
Jchemo.mparFunction
mpar(; kwargs...)

Return a tuple with all the combinations of the parameter values defined in kwargs. Keyword arguments:

  • kwargs : Vector(s) of the parameter(s) values.

Examples

nlvdis = 25 ; metric = [:mah] 
h = [1 ; 2 ; Inf] ; k = [500 ; 1000] 
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k) 
length(pars[1])
reduce(hcat, pars)
source
Jchemo.mseMethod
mse(pred, Y; digits = 3)

Summary of model performance for regression.

  • pred : Predictions.
  • Y : Observed data.

Examples

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
mse(pred, Ytest)

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
mse(pred, ytest)
source
Jchemo.msepMethod
msep(pred, Y)

Compute the mean of the squared prediction errors (MSEP).

  • pred : Predictions.
  • Y : Observed data.

Examples

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
msep(pred, Ytest)

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
msep(pred, ytest)
source
Jchemo.mweightMethod
mweight(x::Vector)

Return an object of type Weight containing vector w = x / sum(x) (if ad'hoc building, w must sum to 1).

Examples

x = rand(10)
w = mweight(x)
sum(w.w)
source
Jchemo.mweightclaMethod
mweightcla(x::Vector; prior::Union{Symbol, Vector} = :unif)
mweightcla(Q::DataType, x::Vector; prior::Union{Symbol, Vector} = :unif)

Compute observation weights for a categorical variable, given specified sub-total weights for the classes.

  • x : A categorical variable (n) (class membership).
  • Q : A data type (e.g. Float32).

Keyword arguments:

  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).

Return an object of type Weight (see function mweight) containing a vector w (n) that sums to 1.

Examples

x = vcat(rand(["a" ; "c"], 900), repeat(["b"], 100))
tab(x)
weights = mweightcla(x)
#weights = mweightcla(x; prior = :prop)
#weights = mweightcla(x; prior = [.1, .7, .2])
aggstat(weights.w, x; fun = sum).X
source
Jchemo.nipalsMethod
nipals(X; kwargs...)
nipals(X, UUt, VVt; kwargs...)

Nipals to compute the first score and loading vectors of a matrix.

  • X : X-data (n, p).
  • UUt : Matrix (n, n) for Gram-Schmidt orthogonalization.
  • VVt : Matrix (p, p) for Gram-Schmidt orthogonalization.

Keyword arguments:

  • tol : Tolerance value for stopping the iterations.
  • maxit : Maximum nb. of iterations.

The function finds:

  • {u, v, sv} = argmin(||X - u * sv * v'||)

with the constraints:

  • ||u|| = ||v|| = 1

using the alternating least squares algorithm to compute SVD (Gabriel & Zalir 1979).

At the end, X ~ u * sv * v', where:

  • u : left singular vector (u * sv = scores)
  • v : right singular vector (loadings)
  • sv : singular value.

When NIPALS is used on sequentially deflated matrices, vectors u and v can loose orthogonality due to accumulation of rounding errors. Orthogonality can be rebuilt from the Gram-Schmidt method (arguments UUt and VVt).

References

K.R. Gabriel, S. Zamir, Lower rank approximation of matrices by least squares with any choice of weights, Technometrics 21 (1979) 489–498.

Examples

using LinearAlgebra

X = rand(5, 3)

res = nipals(X)
res.niter
res.sv
svd(X).S[1] 
res.v
svd(X).V[:, 1] 
res.u
svd(X).U[:, 1] 
source
Jchemo.nipalsmissMethod
nipalsmiss(X; kwargs...)
nipalsmiss(X, UUt, VVt; kwargs...)

Nipals to compute the first score and loading vectors of a matrix with missing data.

  • X : X-data (n, p).
  • UUt : Matrix (n, n) for Gram-Schmidt orthogonalization.
  • VVt : Matrix (p, p) for Gram-Schmidt orthogonalization.

Keyword arguments:

  • tol : Tolerance value for stopping the iterations.
  • maxit : Maximum nb. of iterations.

See function nipals.

References

K.R. Gabriel, S. Zamir, Lower rank approximation of matrices by least squares with any choice of weights, Technometrics 21 (1979) 489–498.

Wright, K., 2018. Package nipals: Principal Components Analysis using NIPALS with Gram-Schmidt Orthogonalization. https://cran.r-project.org/

Examples

X = [1. 2 missing 4 ; 4 missing 6 7 ; 
    missing 5 6 13 ; missing 18 7 6 ; 
    12 missing 28 7] 

res = nipalsmiss(X)
res.niter
res.sv
res.v
res.u
source
Jchemo.normwMethod
normw(x, weights::Weight)

Compute the weighted norm of a vector.

  • x : A vector (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

The weighted norm of vector x is computed by:

  • sqrt(x' * D * x), where D is the diagonal matrix of vector weights.w.
source
Jchemo.occodMethod
occod(fm, X; kwargs...)

One-class classification using PCA/PLS orthognal distance (OD).

  • fm : The preliminary model that (e.g. PCA) was fitted (object fm) on the training data assumed to represent the training class.
  • X : Training X-data (n, p), on which was fitted the model fm.

Keyword arguments:

  • mcut : Type of cutoff. Possible values are: :mad, :q. See Thereafter.
  • cri : When mcut = :mad, a constant. See thereafter.
  • risk : When mcut = :q, a risk-I level. See thereafter.

In this method, the outlierness d of an observation is the orthogonal distance (OD = "X-residuals") of this observation, ie. the Euclidean distance between the observation and its projection on the score plan defined by the fitted (e.g. PCA) model (e.g. Hubert et al. 2005, Van Branden & Hubert 2005 p. 66, Varmuza & Filzmoser 2009 p. 79).

See function occsd for details on outputs.

References

M. Hubert, P. J. Rousseeuw, K. Vanden Branden (2005). ROBPCA: a new approach to robust principal components analysis. Technometrics, 47, 64-79.

K. Vanden Branden, M. Hubert (2005). Robuts classification in high dimension based on the SIMCA method. Chem. Lab. Int. Syst, 79, 10-21.

K. Varmuza, P. Filzmoser (2009). Introduction to multivariate statistical analysis in chemometrics. CRC Press, Boca Raton.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/challenge2018.jld2") 
@load db dat
pnames(dat)
X = dat.X    
Y = dat.Y
mod = model(savgol; npoint = 21, deriv = 2, degree = 3)
fit!(mod, X) 
Xp = transf(mod, X) 
s = Bool.(Y.test)
Xtrain = rmrow(Xp, s)
Ytrain = rmrow(Y, s)
Xtest = Xp[s, :]
Ytest = Y[s, :]

## Below, the reference class is "EEH"
cla1 = "EHH" ; cla2 = "PEE" ; cod = "out"   # here cla2 should be detected
#cla1 = "EHH" ; cla2 = "EHH" ; cod = "in"   # here cla2 should not be detected
s1 = Ytrain.typ .== cla1
s2 = Ytest.typ .== cla2
zXtrain = Xtrain[s1, :]    
zXtest = Xtest[s2, :] 
ntrain = nro(zXtrain)
ntest = nro(zXtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
ytrain = repeat(["in"], ntrain)
ytest = repeat([cod], ntest)

## Group description
mod0 = model(pcasvd; nlv = 10) 
fit!(mod, zXtrain) 
Ttrain = mod0.fm.T
Ttest = transf(mod0, zXtest)
T = vcat(Ttrain, Ttest)
group = vcat(repeat(["1"], ntrain), repeat(["2"], ntest))
i = 1
plotxy(T[:, i], T[:, i + 1], group; leg_title = "Class", 
    xlabel = string("PC", i), ylabel = string("PC", i + 1)).f

#### Occ
## Preliminary PCA fitted model
mod0 = model(pcasvd; nlv = 10) 
fit!(mod0, zXtrain)
fm0 = mod0.fm ;  
## Outlierness
mod = model(occod)
#mod = model(occod; mcut = :mad, cri = 4)
#mod = model(occod; mcut = :q, risk = .01) ;
#mod = model(occsdod)
fit!(mod, fm0, zXtrain) 
pnames(mod) 
pnames(mod.fm) 
@head d = mod.fm.d
d = d.dstand
f, ax = plotxy(1:length(d), d; size = (500, 300), 
    xlabel = "Obs. index", ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f

res = predict(mod, zXtest) ;
pnames(res)
@head res.d
@head res.pred
tab(res.pred)
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
d1 = mod.fm.d.dstand
d2 = res.d.dstand
d = vcat(d1, d2)
f, ax = plotxy(1:length(d), d, group; size = (500, 300), leg_title = "Class", 
    xlabel = "Obs. index", ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f
source
Jchemo.occsdMethod
occsd(fm; kwargs...)

One-class classification using PCA/PLS score distance (SD).

  • fm : The preliminary model that (e.g. PCA) was fitted (object fm) on the training data assumed to represent the training class.

Keyword arguments:

  • mcut : Type of cutoff. Possible values are: :mad, :q. See Thereafter.
  • cri : When mcut = :mad, a constant. See thereafter.
  • risk : When mcut = :q, a risk-I level. See thereafter.

In this method, the outlierness d of an observation is defined by its score distance (SD), ie. the Mahalanobis distance between the projection of the observation on the score plan defined by the fitted (e.g. PCA) model and the center of the score plan.

If a new observation has d higher than a given cutoff, the observation is assumed to not belong to the training (= reference) class. The cutoff is computed with non-parametric heuristics. Noting [d] the vector of outliernesses computed on the training class:

  • If mcut = :mad, then cutoff = median([d]) + cri * mad([d]).
  • If mcut = :q, then cutoff is estimated from the empirical cumulative density function computed on [d], for a given risk-I (risk).

Alternative approximate cutoffs have been proposed in the literature (e.g.: Nomikos & MacGregor 1995, Hubert et al. 2005, Pomerantsev 2008). Typically, and whatever the approximation method used to compute the cutoff, it is recommended to tune this cutoff depending on the detection objectives.

Outputs

  • pval: Estimate of p-value (see functions pval) computed from the training distribution [d].
  • dstand: standardized distance defined as d / cutoff. A value dstand > 1 may be considered as extreme compared to the distribution of the training data.
  • gh is the Winisi "GH" (usually, GH > 3 is considered as extreme).

Specific for function predict:

  • pred: class prediction
    • dstand <= 1 ==> in: the observation is expected to belong to the training class,
    • dstand > 1 ==> out: extreme value, possibly not belonging to the same class as the training.

References

M. Hubert, P. J. Rousseeuw, K. Vanden Branden (2005). ROBPCA: a new approach to robust principal components analysis. Technometrics, 47, 64-79.

Nomikos, P., MacGregor, J.F., 1995. Multivariate SPC Charts for Monitoring Batch Processes. null 37, 41-59. https://doi.org/10.1080/00401706.1995.10485888

Pomerantsev, A.L., 2008. Acceptance areas for multivariate classification derived by projection methods. Journal of Chemometrics 22, 601-609. https://doi.org/10.1002/cem.1147

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/challenge2018.jld2") 
@load db dat
pnames(dat)
X = dat.X    
Y = dat.Y
mod = savgol(npoint = 21, deriv = 2, degree = 3)
fit!(mod, X) 
Xp = transf(mod, X) 
s = Bool.(Y.test)
Xtrain = rmrow(Xp, s)
Ytrain = rmrow(Y, s)
Xtest = Xp[s, :]
Ytest = Y[s, :]

## Below, the reference class is "EEH"
cla1 = "EHH" ; cla2 = "PEE" ; cod = "out"   # here cla2 should be detected
#cla1 = "EHH" ; cla2 = "EHH" ; cod = "in"   # here cla2 should not be detected
s1 = Ytrain.typ .== cla1
s2 = Ytest.typ .== cla2
zXtrain = Xtrain[s1, :]    
zXtest = Xtest[s2, :] 
ntrain = nro(zXtrain)
ntest = nro(zXtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
ytrain = repeat(["in"], ntrain)
ytest = repeat([cod], ntest)

## Group description
mod = model(pcasvd; nlv = 10) 
fit!(mod, zXtrain) 
Ttrain = mod.fm.T
Ttest = transf(mod, zXtest)
T = vcat(Ttrain, Ttest)
group = vcat(repeat(["1"], ntrain), repeat(["2"], ntest))
i = 1
plotxy(T[:, i], T[:, i + 1], group; leg_title = "Class", 
    xlabel = string("PC", i), ylabel = string("PC", i + 1)).f

#### Occ
## Preliminary PCA fitted model
mod0 = model(pcasvd; nlv = 30) 
fit!(mod0, zXtrain)
fm0 = mod0.fm ;  
## Outlierness
mod = model(occsd)
#mod = model(occsd; mcut = :mad, cri = 4)
#mod = model(occsd; mcut = :q, risk = .01)
fit!(mod, fm0) 
pnames(mod) 
pnames(mod.fm) 
@head d = mod.fm.d
d = d.dstand
f, ax = plotxy(1:length(d), d; size = (500, 300), 
    xlabel = "Obs. index", ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f

res = predict(mod, zXtest) ;
pnames(res)
@head res.d
@head res.pred
tab(res.pred)
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
d1 = mod.fm.d.dstand
d2 = res.d.dstand
d = vcat(d1, d2)
f, ax = plotxy(1:length(d), d, group; size = (500, 300), leg_title = "Class", 
    xlabel = "Obs. index", ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f
source
Jchemo.occsdodMethod
occsdod(object, X; kwargs...)

One-class classification using a compromise between PCA/PLS score (SD) and orthogonal (OD) distances.

  • fm : The preliminary model that (e.g. PCA) was fitted (object fm) on the training data assumed to represent the training class.
  • X : Training X-data (n, p), on which was fitted the model fm.

Keyword arguments:

  • mcut : Type of cutoff. Possible values are: :mad, :q. See Thereafter.
  • cri : When mcut = :mad, a constant. See thereafter.
  • risk : When mcut = :q, a risk-I level. See thereafter.

In this method, the outlierness d of a given observation is a compromise between the score distance (SD) and the orthogonal distance (OD). The compromise is computed from the standardized distances by:

  • dstand = sqrt(dstand_sd * dstand_od).

See functions:

  • occsd for details of the outputs,
  • and occod for examples.
source
Jchemo.occstahMethod
occstah(X; kwargs...)

One-class classification using the Stahel-Donoho outlierness.

  • X : Training X-data (n, p).

Keyword arguments:

  • nlv : Nb. dimensions on which X is projected.
  • mcut : Type of cutoff. Possible values are: :mad, :q. See Thereafter.
  • cri : When mcut = :mad, a constant. See thereafter.
  • risk : When mcut = :q, a risk-I level. See thereafter.
  • scal : Boolean. If true, each column of X is scaled such as in function stah.

In this method, the outlierness d of a given observation is the Stahel-Donoho outlierness (see ?stah).

See function occsd for details on outputs.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/challenge2018.jld2") 
@load db dat
pnames(dat)
X = dat.X    
Y = dat.Y
mod = model(savgol; npoint = 21, deriv = 2, degree = 3)
fit!(mod, X) 
Xp = transf(mod, X) 
s = Bool.(Y.test)
Xtrain = rmrow(Xp, s)
Ytrain = rmrow(Y, s)
Xtest = Xp[s, :]
Ytest = Y[s, :]

## Below, the reference class is "EEH"
cla1 = "EHH" ; cla2 = "PEE" ; cod = "out"   # here cla2 should be detected
#cla1 = "EHH" ; cla2 = "EHH" ; cod = "in"   # here cla2 should not be detected
s1 = Ytrain.typ .== cla1
s2 = Ytest.typ .== cla2
zXtrain = Xtrain[s1, :]    
zXtest = Xtest[s2, :] 
ntrain = nro(zXtrain)
ntest = nro(zXtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
ytrain = repeat(["in"], ntrain)
ytest = repeat([cod], ntest)

## Group description
mod = model(pcasvd; nlv = 10) 
fit!(mod, zXtrain) 
Ttrain = mod.fm.T
Ttest = transf(mod, zXtest)
T = vcat(Ttrain, Ttest)
group = vcat(repeat(["1"], ntrain), repeat(["2"], ntest))
i = 1
plotxy(T[:, i], T[:, i + 1], group; leg_title = "Class", 
    xlabel = string("PC", i), ylabel = string("PC", i + 1)).f

#### Occ
## Preliminary dimension 
## Not required but often more 
## efficient
nlv = 50
mod0 = model(pcasvd; nlv) ;
fit!(mod0, zXtrain)
Ttrain = mod0.fm.T
Ttest = transf(mod0, zXtest)
## Outlierness
mod = model(occstah; nlv, scal = true)
fit!(mod, Ttrain) 
pnames(mod) 
pnames(mod.fm) 
@head d = mod.fm.d
d = d.dstand
f, ax = plotxy(1:length(d), d; size = (500, 300), xlabel = "Obs. index", 
    ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f

res = predict(mod, Ttest) ;
pnames(res)
@head res.d
@head res.pred
tab(res.pred)
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
d1 = mod.fm.d.dstand
d2 = res.d.dstand
d = vcat(d1, d2)
f, ax = plotxy(1:length(d), d, group; size = (500, 300), leg_title = "Class", 
    xlabel = "Obs. index", ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f
source
Jchemo.outMethod
out(x)

Return if elements of a vector are strictly outside of a given range.

  • x : Univariate data.
  • y : Univariate data on which is computed the range (min, max).

Return a BitVector.

Examples

x = [-200.; -100; -1; 0; 1; 200]
out(x, [-1; .2; 1])
out(x, (-1, 1))
source
Jchemo.pcaeigenMethod
pcaeigen(X; kwargs...)
pcaeigen(X, weights::Weight; kwargs...)
pcaeigen!(X::Matrix, weights::Weight; kwargs...)

PCA by Eigen factorization.

  • X : X-data (n, p).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. of principal components (PCs).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Let us note D the (n, n) diagonal matrix of weights (weights.w) and X the centered matrix in metric D. The function minimizes ||X - T * P'||^2 in metric D, by computing an Eigen factorization of X' * D * X.

See function pcasvd for examples.

source
Jchemo.pcaeigenkMethod
pcaeigenk(X; kwargs...)
pcaeigenk(X, weights::Weight; kwargs...)
pcaeigenk!(X::Matrix, weights::Weight; kwargs...)

PCA by Eigen factorization of the kernel matrix XX'.

  • X : X-data (n, p).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. of principal components (PCs).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

This is the "kernel cross-product" version of the PCA algorithm (e.g. Wu et al. 1997). For wide matrices (n << p, where p is the nb. columns) and n not too large, this algorithm can be much faster than the others.

Let us note D the (n, n) diagonal matrix of weights (weights.w) and X the centered matrix in metric D. The function minimizes ||X - T * P'||^2 in metric D, by computing an Eigen factorization of D^(1/2) * X * X' D^(1/2).

See function pcasvd for examples.

References

Wu, W., Massart, D.L., de Jong, S., 1997. The kernel PCA algorithms for wide data. Part I: Theory and algorithms. Chemometrics and Intelligent Laboratory Systems 36, 165-172. https://doi.org/10.1016/S0169-7439(97)00010-5

source
Jchemo.pcanipalsMethod
pcanipals(X; kwargs...)
pcanipals(X, weights::Weight; kwargs...)
pcanipals!(X::Matrix, weights::Weight; kwargs...)

PCA by NIPALS algorithm.

  • X : X-data (n, p).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. of principal components (PCs).
  • gs : Boolean. If true (default), a Gram-Schmidt orthogonalization of the scores and loadings is done before each X-deflation.
  • tol : Tolerance value for stopping the iterations.
  • maxit : Maximum nb. of iterations.
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Let us note D the (n, n) diagonal matrix of weights (weights.w) and X the centered matrix in metric D. The function minimizes ||X - T * P'||^2 in metric D by NIPALS.

See function pcasvd for examples.

References

Andrecut, M., 2009. Parallel GPU Implementation of Iterative PCA Algorithms. Journal of Computational Biology 16, 1593-1599. https://doi.org/10.1089/cmb.2008.0221

K.R. Gabriel, S. Zamir, Lower rank approximation of matrices by least squares with any choice of weights, Technometrics 21 (1979) 489–498.

Gabriel, R. K., 2002. Le biplot - Outil d'exploration de données multidimensionnelles. Journal de la Société Française de la Statistique, 143, 5-55.

Lingen, F.J., 2000. Efficient Gram-Schmidt orthonormalisation on parallel computers. Communications in Numerical Methods in Engineering 16, 57-66. https://doi.org/10.1002/(SICI)1099-0887(200001)16:1<57::AID-CNM320>3.0.CO;2-I

Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris, France.

Wright, K., 2018. Package nipals: Principal Components Analysis using NIPALS with Gram-Schmidt Orthogonalization. https://cran.r-project.org/

source
Jchemo.pcanipalsmissMethod
pcanipals(X; kwargs...)
pcanipals(X, weights::Weight; kwargs...)
pcanipals!(X::Matrix, weights::Weight; kwargs...)

PCA by NIPALS algorithm allowing missing data.

  • X : X-data (n, p).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. of principal components (PCs).
  • gs : Boolean. If true (default), a Gram-Schmidt orthogonalization of the scores and loadings is done before each X-deflation.
  • tol : Tolerance value for stopping the iterations.
  • maxit : Maximum nb. of iterations.
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

References

Wright, K., 2018. Package nipals: Principal Components Analysis using NIPALS with Gram-Schmidt Orthogonalization. https://cran.r-project.org/

Examples

X = [1 2. missing 4 ; 4 missing 6 7 ; 
    missing 5 6 13 ; missing 18 7 6 ; 
    12 missing 28 7] 

nlv = 3 
tol = 1e-15
scal = false
#scal = true
gs = false
#gs = true
mod = model(pcanipalsmiss; nlv, tol, gs, maxit = 500, scal)
fit!(mod, X)
pnames(mod) 
pnames(mod.fm)
fm = mod.fm ;
fm.niter
fm.sv
fm.P
fm.T
## Orthogonality 
## only if gs = true
fm.T' * fm.T
fm.P' * fm.P

## Impute missing data in X
mod = model(pcanipalsmiss; nlv = 2, gs = true) ;
fit!(mod, X)
Xfit = xfit(mod.fm)
s = ismissing.(X)
X_imput = copy(X)
X_imput[s] .= Xfit[s]
X_imput
source
Jchemo.pcasphMethod
pcasph(X; kwargs...)
pcasph(X, weights::Weight; kwargs...)
pcasph!(X::Matrix, weights::Weight; kwargs...)

Spherical PCA.

  • X : X-data (n, p).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. of principal components (PCs).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Spherical PCA (Locantore et al. 1990, Maronna 2005, Daszykowski et al. 2007). Matrix X is centered by the spatial median computed by function Jchemo.colmedspa.

References

Daszykowski, M., Kaczmarek, K., Vander Heyden, Y., Walczak, B., 2007. Robust statistics in data analysis - A review. Chemometrics and Intelligent Laboratory Systems 85, 203-219. https://doi.org/10.1016/j.chemolab.2006.06.016

Locantore N., Marron J.S., Simpson D.G., Tripoli N., Zhang J.T., Cohen K.L. Robust principal component analysis for functional data, Test 8 (1999) 1–7

Maronna, R., 2005. Principal components and orthogonal regression based on robust scales, Technometrics, 47:3, 264-273, DOI: 10.1198/004017005000000166

Examples

using JchemoData, JLD2, CairoMakie 
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "octane.jld2") 
@load db dat
pnames(dat)
X = dat.X 
wlst = names(X)
wl = parse.(Float64, wlst)
n = nro(X)

nlv = 6
mod = model(pcasph; nlv)  
#mod = model(pcasvd; nlv) 
fit!(mod, X)
pnames(mod)
pnames(mod.fm)
@head T = mod.fm.T
## Same as:
transf(mod, X)

i = 1
plotxy(T[:, i], T[:, i + 1]; zeros = true, xlabel = "PC1", 
    ylabel = "PC2").f
source
Jchemo.pcasvdMethod
pcasvd(X; kwargs...)
pcasvd(X, weights::Weight; kwargs...)
pcasvd!(X::Matrix, weights::Weight; kwargs...)

PCA by SVD factorization.

  • X : X-data (n, p).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. of principal components (PCs).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Let us note D the (n, n) diagonal matrix of weights (weights.w) and X the centered matrix in metric D. The function minimizes ||X - T * P'||^2 in metric D, by computing a SVD factorization of sqrt(D) * X:

  • sqrt(D) * X ~ U * S * V'

Outputs are:

  • T = D^(-1/2) * U * S
  • P = V
  • The diagonal of S

Examples

using JchemoData, JLD2, CairoMakie 
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2") 
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
n = nro(X)
ntest = 30
s = samprand(n, ntest) 
@head Xtrain = X[s.train, :]
@head Xtest = X[s.test, :]

nlv = 3
mod = model(pcasvd; nlv)
#mod = model(pcaeigen; nlv)
#mod = model(pcaeigenk; nlv)
#mod = model(pcanipals; nlv)
fit!(mod, Xtrain)
pnames(mod)
pnames(mod.fm)
@head T = mod.fm.T
## Same as:
@head transf(mod, X)
T' * T
@head P = mod.fm.P
P' * P

@head Ttest = transf(mod, Xtest)

res = summary(mod, Xtrain) ;
pnames(res)
res.explvarx
res.contr_var
res.coord_var
res.cor_circle
source
Jchemo.pcrMethod
pcr(X, Y; kwargs...)
pcr(X, Y, weights::Weight; kwargs...)
pcr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Principal component regression (PCR) with a SVD factorization.

  • X : X-data (n, p).
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute.
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 15
mod = model(pcr; nlv) ;
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
@head mod.fm.T

coef(mod)
coef(mod; nlv = 3)

@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)

res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, 
    xlabel = "Prediction", ylabel = "Observed").f    

res = predict(mod, Xtest; nlv = 1:2)
@head res.pred[1]
@head res.pred[2]

res = summary(mod, Xtrain) ;
pnames(res)
z = res.explvarx
plotgrid(z.nlv, z.cumpvar; step = 2, xlabel = "Nb. LVs", 
    ylabel = "Prop. Explained X-Variance").f
source
Jchemo.pipMethod
pip(args...)

Build a pipeline of models.

  • args... : Succesive models, see examples.

Examples

using JLD2, CairoMakie, JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)

## Pipeline Snv :> Savgol :> Pls :> Svmr

mod1 = model(snv; centr = true, scal = true)
npoint = 11 ; deriv = 2 ; degree = 3
mod2 = model(savgol; npoint, deriv, degree)
mod3 = model(plskern; nlv = 15)
mod4 = model(svmr; gamma = 1e3, cost = 100, epsilon = .9)
mod = pip(mod1, mod2, mod3, mod4)
fit!(mod, Xtrain, ytrain)
res = predict(mod, Xtest) ; 
@head res.pred 
rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
      ylabel = "Observed").f
source
Jchemo.plotconfMethod
plotconf(object; size = (500, 400), cnt = true, ptext = true, 
    fontsize = 15, coldiag = :red, )

Plot a conf matrix.

  • object : Output of function conf.

Keyword arguments:

  • size : Size (horizontal, vertical) of the figure.
  • cnt : Boolean. If true, plot the occurrences, else plot the row %s.
  • ptext : Boolean. If true, display the value in each cell.
  • fontsize : Font size when ptext = true.
  • coldiag : Font color when ptext = true.

See examples in help page of function conf. ```

source
Jchemo.plotgridMethod
plotgrid(indx::AbstractVector, r; 
    size = (500, 300), step = 5, color = nothing, 
    kwargs...)
plotgrid(indx::AbstractVector, r, group; 
    size = (700, 350), step = 5, color = nothing, 
    leg = true, leg_title = "Group", kwargs...)

Plot error/performance rates of a model.

  • indx : A numeric variable representing the grid of model parameters, e.g. the nb. LVs if PLSR models.
  • r : The error/performance rate.

Keyword arguments:

  • group : Categorical variable defining groups. A separate line is plotted for each level of group.
  • size : Size (horizontal, vertical) of the figure.
  • step : Step used for defining the xticks.
  • color : Set color. If group if used, must be a vector of same length as the number of levels in group.
  • leg : Boolean. If group is used, display a legend or not.
  • leg_title : Title of the legend.
  • kwargs : Optional arguments to pass in Axis of CairoMakie.

To use plotgrid, a backend (e.g. CairoMakie) has to be specified.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

mod = plskern() 
nlv = 0:20
res = gridscore(mod, Xtrain, ytrain, 
    Xtest, ytest; score = rmsep, nlv)
plotgrid(res.nlv, res.y1;
    xlabel = "Nb. LVs", ylabel = "RMSEP").f

mod = lwplsr() 
nlvdis = 15 ; metric = [:mah]
h = [1 ; 2.5 ; 5] ; k = [50 ; 100] 
pars = mpar(nlvdis = nlvdis, metric = metric, 
    h = h, k = k)
nlv = 0:20
res = gridscore(mod, Xtrain, ytrain, 
    Xtest, ytest; score = rmsep, 
    pars, nlv)
group = string.("h=", res.h, " k=", res.k)
plotgrid(res.nlv, res.y1, group;
    xlabel = "Nb. LVs", ylabel = "RMSECV").f
source
Jchemo.plotspFunction
plotsp(X, wl = 1:nco(X); size = (500, 300), color = nothing, 
    nsamp = nothing, kwargs...)

Plotting spectra.

  • X : X-data (n, p).
  • wl : Column names of X. Must be numeric.

Keyword arguments:

  • size : Size (horizontal, vertical) of the figure.
  • color : Set a unique color (and eventually transparency) to the spectra.
  • nsamp : Nb. spectra (X-rows) to plot. If nothing, all spectra are plotted.
  • kwargs : Optional arguments to pass in Axis of CairoMakie.

The function plots the rows of X.

To use plotxy, a backend (e.g. CairoMakie) has to be specified.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X
wlst = names(X)
wl = parse.(Float64, wlst) 

plotsp(X).f
plotsp(X; color = (:red, .2)).f
plotsp(X, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance").f

f, ax = plotsp(X, wl; color = (:red, .2))
xmeans = colmean(X)
lines!(ax, wl, xmeans; color = :black, linewidth = 2)
vlines!(ax, 1200)
f
source
Jchemo.plotxyMethod
plotxy(x, y; size = (500, 300), color = nothing, ellipse::Bool = false, 
    prob = .95, circle::Bool = false, bisect::Bool = false, zeros::Bool = false,
    xlabel = "", ylabel = "", title = "", kwargs...)
plotxy(x, y, group; size = (600, 350), color = nothing, ellipse::Bool = false, 
    prob = .95, circle::Bool = false, bisect::Bool = false, zeros::Bool = false,
    xlabel = "", ylabel = "", title = "", leg::Bool = true, leg_title = "Group", 
    kwargs...)

Scatter plot of (x, y) data

  • x : A x-vector (n).
  • y : A y-vector (n).
  • group : Categorical variable defining groups (n).

Keyword arguments:

  • size : Size (horizontal, vertical) of the figure.
  • color : Set color(s). If group if used, color must be a vector of same length as the number of levels in group.
  • ellipse : Boolean. Draw an ellipse of confidence, assuming a Ch-square distribution with df = 2. If group is used, one ellipse is drawn per group.
  • prob : Probability for the ellipse of confidence.
  • bisect : Boolean. Draw a bisector.
  • zeros : Boolean. Draw horizontal and vertical axes passing through origin (0, 0).
  • xlabel : Label for the x-axis.
  • ylabel : Label for the y-axis.
  • title : Title of the graphic.
  • leg : Boolean. If group is used, display a legend or not.
  • leg_title : Title of the legend.
  • kwargs : Optional arguments to pass in function scatter of Makie.

To use plotxy, a backend (e.g. CairoMakie) has to be specified.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
lev = mlev(year)
nlev = length(lev)

mod = model(pcasvd; nlv = 5)  
fit!(mod, X) 
@head T = mod.fm.T

plotxy(T[:, 1], T[:, 2]; color = (:red, .5)).f

plotxy(T[:, 1], T[:, 2], year; ellipse = true, xlabel = "PC1", 
    ylabel = "PC2").f

i = 2
colm = cgrad(:Dark2_5, nlev; categorical = true)
plotxy(T[:, i], T[:, i + 1], year; color = colm, xlabel = string("PC", i), 
    ylabel = string("PC", i + 1), zeros = true, ellipse = true).f

plotxy(T[:, 1], T[:, 2], year).lev

plotxy(1:5, 1:5).f

y = reshape(rand(5), 5, 1)
plotxy(1:5, y).f

## Several layers can be added
## (same syntax as in Makie)
A = rand(50, 2)
f, ax = plotxy(A[:, 1], A[:, 2]; xlabel = "x1", ylabel = "x2")
ylims!(ax, -1, 2)
hlines!(ax, 0.5; color = :red, linestyle = :dot)
f
source
Jchemo.plscanMethod
plscan(X, Y; kwargs...)
plscan(X, Y, weights::Weight; kwargs...)
plscan!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Canonical partial least squares regression (Canonical PLS).

  • X : First block of data.
  • Y : Second block of data.
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs = scores T) to compute.
  • bscal : Type of block scaling. Possible values are: :none, :frob. See functions blockscal.
  • scal : Boolean. If true, each column of blocks in X and Y is scaled by its uncorrected standard deviation (before the block scaling).

Canonical PLS with the Nipals algorithm (Wold 1984, Tenenhaus 1998 chap.11), referred to as PLS-W2A (i.e. Wold PLS mode A) in Wegelin 2000. The two blocks X and X play a symmetric role. After each step of scores computation, X and Y are deflated by the x- and y-scores, respectively.

References

Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.

Wegelin, J.A., 2000. A Survey of Partial Least Squares (PLS) Methods, with Emphasis on the Two-Block Case (No. 371). University of Washington, Seattle, Washington, USA.

Wold, S., Ruhe, A., Wold, H., Dunn, III, W.J., 1984. The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses. SIAM Journal on Scientific and Statistical Computing 5, 735–743. https://doi.org/10.1137/0905052

Examples

using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "linnerud.jld2") 
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n, p = size(X)
q = nco(Y)

nlv = 2
bscal = :frob
mod = model(plscan; nlv, bscal)
fit!(mod, X, Y)
pnames(mod)
pnames(mod.fm)

@head mod.fm.Tx
@head transfbl(mod, X, Y).Tx

@head mod.fm.Ty
@head transfbl(mod, X, Y).Ty

res = summary(mod, X, Y) ;
pnames(res)
res.explvarx
res.explvary
res.cort2t 
res.rdx
res.rdy
res.corx2t 
res.cory2t 
source
Jchemo.plskdedaMethod
plskdeda(X, y; kwargs...)
plskdeda(X, y, weights::Weight; kwargs...)

KDE-DA on PLS latent variables (PLS-KDEDA).

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute. Must be >= 1.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • Keyword arguments of function dmkern (bandwidth definition) can also be specified here.
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

The principle is the same as functions plslda and plsqda except that class densities are estimated from dmkern instead of dmnorm.

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlv = 15
mod = model(plskdeda; nlv) 
#mod = model(plskdeda; nlv, a_kde = .5)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

fmpls = fm.fm.fmpls ;
@head fmpls.T
@head transf(mod, Xtrain)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)

coef(fmpls)

res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(mod, Xtest; nlv = 1:2).pred
summary(fmpls, Xtrain)
source
Jchemo.plskernMethod
plskern(X, Y; kwargs...)
plskern(X, Y, weights::Weight; kwargs...)
plskern!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Partial least squares regression (PLSR) with the "improved kernel algorithm #1" (Dayal & McGegor, 1997).

  • X : X-data (n, p).
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute.
  • scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

About the row-weighting in PLS algorithms (weights): See in particular Schaal et al. 2002, Siccard & Sabatier 2006, Kim et al. 2011, and Lesnoff et al. 2020.

References

Dayal, B.S., MacGregor, J.F., 1997. Improved PLS algorithms. Journal of Chemometrics 11, 73-85.

Kim, S., Kano, M., Nakagawa, H., Hasebe, S., 2011. Estimation of active pharmaceutical ingredients content using locally weighted partial least squares and statistical wavelength selection. Int. J. Pharm., 421, 269-274.

Lesnoff, M., Metz, M., Roger, J.M., 2020. Comparison of locally weighted PLS strategies for regression and discrimination on agronomic NIR Data. Journal of Chemometrics. e3209. https://onlinelibrary.wiley.com/doi/abs/10.1002/cem.3209

Schaal, S., Atkeson, C., Vijayamakumar, S. 2002. Scalable techniques from nonparametric statistics for the real time robot learning. Applied Intell., 17, 49-60.

Sicard, E. Sabatier, R., 2006. Theoretical framework for local PLS1 regression and application to a rainfall data set. Comput. Stat. Data Anal., 51, 1393-1410.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 15
mod = model(plskern; nlv) ;
#mod = model(plsnipals; nlv) ;
#mod = model(plswold; nlv) ;
#mod = model(plsrosa; nlv) ;
#mod = model(plssimp; nlv) ;
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
@head mod.fm.T

coef(mod)
coef(mod; nlv = 3)

@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)

res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, 
    xlabel = "Prediction", ylabel = "Observed").f    

res = predict(mod, Xtest; nlv = 1:2)
@head res.pred[1]
@head res.pred[2]

res = summary(mod, Xtrain) ;
pnames(res)
z = res.explvarx
plotgrid(z.nlv, z.cumpvar; step = 2, xlabel = "Nb. LVs", 
    ylabel = "Prop. Explained X-Variance").f
source
Jchemo.plsldaMethod
plslda(X, y; kwargs...)
plslda(X, y, weights::Weight; kwargs...)

LDA on PLS latent variables (PLS-LDA).

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute. Must be >= 1.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

LDA on PLS latent variables. The training variable y (univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present in y. Each column of Ydummy is a dummy (0/1) variable. Then, a weighted PLSR2 (i.e. multivariate) is run on {X, Ydummy}, returning a score matrix T. Finally, a LDA is done on {T, y}.

In these plslda functions, observation weights (argument weights) are used to compute the PLS scores and the LDA intra-class (= "within") covariance matrix. Argument prior is used to define the usual LDA prior class probabilities.

In the high-level version, the observation weights are automatically defined by the given priors: the sub-total weights by class are set equal to the prior probabilities. For other choices, use the low-level version.

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlv = 15
mod = model(plslda; nlv) 
#mod = model(plslda; nlv, prior = :prop) 
#mod = model(plsqda; nlv, alpha = .1) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

fmpls = fm.fm.fmpls ;
@head fmpls.T
@head transf(mod, Xtrain)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)

coef(fmpls)

res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(mod, Xtest; nlv = 1:2).pred
summary(fmpls, Xtrain)
source
Jchemo.plsnipalsMethod
plsnipals(X, Y; kwargs...)
plsnipals(X, Y, weights::Weight; kwargs...)
plsnipals!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Partial Least Squares Regression (PLSR) with the Nipals algorithm.

  • X : X-data (n, p).
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute.
  • scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

In this function, for PLS2 (multivariate Y), the Nipals iterations are replaced by a direct computation of the PLS weights (w) by SVD decomposition of matrix X'Y (Hoskuldsson 1988 p.213).

See function plskern for examples.

References

Hoskuldsson, A., 1988. PLS regression methods. Journal of Chemometrics 2, 211-228.https://doi.org/10.1002/cem.1180020306

Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris, France.

Wold, S., Sjostrom, M., Eriksson, l., 2001. PLS-regression: a basic tool for chemometrics. Chem. Int. Lab. Syst., 58, 109-130.

source
Jchemo.plsqdaMethod
plsqda(X, y; kwargs...)
plsqda(X, y, weights::Weight; kwargs...)

QDA on PLS latent variables (PLS-QDA) with continuum.

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute. Must be >= 1.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • alpha : Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0) and LDA (alpha = 1).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

QDA on PLS latent variables. The training variable y (univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present in y. Each column of Ydummy is a dummy (0/1) variable. Then, a PLSR2 (i.e. multivariate) is run on {X, Ydummy}, returning a score matrix T. Finally, a QDA (possibly with continuum) is done on {T, y}.

See functions qda and plslda for details (arguments weights, prior and alpha) and examples.

source
Jchemo.plsravgMethod
plsravg(X, Y; kwargs...)
plsravg(X, Y, weights::Weight; kwargs...)
plsravg!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Averaging PLSR models with different numbers of latent variables (PLSR-AVG).

  • X : X-data (n, p).
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : A range of nb. of latent variables (LVs) to compute.
  • scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

Ensemblist method where the predictions are computed by averaging the predictions of a set of models built with different numbers of LVs.

For instance, if argument nlv is set to nlv = 5:10, the prediction for a new observation is the simple average of the predictions returned by the models with 5 LVs, 6 LVs, ... 10 LVs, respectively.

References

Lesnoff, M., Andueza, D., Barotin, C., Barre, P., Bonnal, L., Fernández Pierna, J.A., Picard, F., Vermeulen, P., Roger, J.-M., 2022. Averaging and Stacking Partial Least Squares Regression Models to Predict the Chemical Compositions and the Nutritive Values of Forages from Spectral Near Infrared Data. Applied Sciences 12, 7850. https://doi.org/10.3390/app12157850

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2") 
@load db dat
pnames(dat)
X = dat.X 
Y = dat.Y
@head Y
y = Y.ndf
#y = Y.dm
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(y, s)
Xtest = X[s, :]
ytest = y[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)

nlv = 0:30
#nlv = 5:20
#nlv = 25
mod = model(plsravg; nlv) ;
fit!(mod, Xtrain, ytrain)

res = predict(mod, Xtest)
@head res.pred
res.predlv   # predictions for each nb. of LVs 
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, 
    xlabel = "Prediction", ylabel = "Observed").f    
source
Jchemo.plsrdaMethod
plsrda(X, y; kwargs...)
plsrda(X, y, weights::Weight; kwargs...)

Discrimination based on partial least squares regression (PLSR-DA).

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

This is the usual "PLSDA" (prediction of the Y-dummy table by a PLS2 regression). The training variable y (univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present in y. Each column of Ydummy is a dummy (0/1) variable. Then, a weighted PLSR2 (i.e. multivariate) is run on {X, Ydummy}, returning predictions of the dummy variables (= object posterior returned by fuction predict). These predictions can be considered as unbounded estimates (i.e. eventuall outside of [0, 1]) of the class membership probabilities. For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.

In the high-level version of the function, the observation weights used in the PLS2-R are defined with argument prior. For other choices, use the low-level version (argument weights).

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlv = 15
mod = model(plsrda; nlv) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
aggsum(fm.weights.w, ytrain)

@head fm.fm.T
@head transf(mod, Xtrain)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)

coef(fm.fm)

res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(mod, Xtest; nlv = 1:2).pred
summary(fm.fm, Xtrain)
source
Jchemo.plsrosaMethod
plsrosa(X, Y; kwargs...)
plsrosa(X, Y, weights::Weight; kwargs...)
plsrosa!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Partial Least Squares Regression (PLSR) with the ROSA algorithm (Liland et al. 2016).

  • X : X-data (n, p).
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute.
  • scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

Note: The function has the following differences with the original algorithm of Liland et al. (2016):

  • Scores T (LVs) are not normed.
  • Multivariate Y is allowed.

See function plskern for examples.

References

Liland, K.H., Næs, T., Indahl, U.G., 2016. ROSA—a fast extension of partial least squares regression for multiblock data analysis. Journal of Chemometrics 30, 651–662. https://doi.org/10.1002/cem.2824

source
Jchemo.plssimpMethod
plssimp(X, Y; kwargs...)
plssimp(X, Y, weights::Weight; kwargs...)
plssimp!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Partial Least Squares Regression (PLSR) with the SIMPLS algorithm (de Jong 1993).

  • X : X-data (n, p).
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute.
  • scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

Note: In this function, scores T (LVs) are not normed, conversely to the original algorithm of de Jong (2013).

See function plskern for examples.

References

de Jong, S., 1993. SIMPLS: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems 18, 251–263. https://doi.org/10.1016/0169-7439(93)85002-X

source
Jchemo.plstuckMethod
plstuck(X, Y; kwargs...)
plstuck(X, Y, weights::Weight; kwargs...)
plstuck!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Tucker's inter-battery method of factor analysis

  • X : First block of data.
  • Y : Second block of data.
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs = scores T) to compute.
  • bscal : Type of block scaling. Possible values are: :none, :frob. See functions blockscal.
  • scal : Boolean. If true, each column of blocks in X and Y is scaled by its uncorrected standard deviation (before the block scaling).

Inter-battery method of factor analysis (Tucker 1958, Tenenhaus 1998 chap.3). The two blocks X and X play a symmetric role. This method is referred to as PLS-SVD in Wegelin 2000. The basis of the method is to factorize the covariance matrix X'Y by SVD.

References

Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.

Tishler, A., Lipovetsky, S., 2000. Modelling and forecasting with robust canonical analysis: method and application. Computers & Operations Research 27, 217–232. https://doi.org/10.1016/S0305-0548(99)00014-3

Tucker, L.R., 1958. An inter-battery method of factor analysis. Psychometrika 23, 111–136. https://doi.org/10.1007/BF02289009

Wegelin, J.A., 2000. A Survey of Partial Least Squares (PLS) Methods, with Emphasis on the Two-Block Case (No. 371). University of Washington, Seattle, Washington, USA.

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/linnerud.jld2") 
@load db dat
pnames(dat)
X = dat.X 
Y = dat.Y

fm = plstuck(X, Y; nlv = 3)
pnames(fm)

fm.Tx
transf(fm, X, Y).Tx
fscale(fm.Tx, colnorm(fm.Tx))

res = summary(fm, X, Y)
pnames(res)
source
Jchemo.plswoldMethod
plswold(X, Y; kwargs...)
plswold(X, Y, weights::Weight; kwargs...)
plswold!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Partial Least Squares Regression (PLSR) with the Wold algorithm

  • X : X-data (n, p).
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute.
  • tol : Tolerance for the Nipals algorithm.
  • maxit : Maximum number of iterations for the Nipals algorithm.
  • scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

Wold Nipals PLSR algorithm: Tenenhaus 1998 p.204.

See function plskern for examples.

References

Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris, France.

Wold, S., Ruhe, A., Wold, H., Dunn, III, W.J., 1984. The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS). Approach to Generalized Inverses. SIAM Journal on Scientific and Statistical Computing 5, 735–743. https://doi.org/10.1137/0905052

source
Jchemo.predictMethod
predict(object::CalDs, X; kwargs...)

Compute predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::CalPds, X; kwargs...)

Compute predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Cglsr, X; nlv = nothing)

Compute Y-predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
  • nlv : Nb. iterations, or collection of nb. iterations, to consider.
source
Jchemo.predictMethod
predict(object::Dkplsr, X; nlv = nothing)

Compute Y-predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
  • nlv : Nb. LVs, or collection of nb. LVs, to consider.
source
Jchemo.predictMethod
predict(object::Dmkern, x)

Compute predictions from a_kde fitted model.

  • object : The fitted model.
  • x : Data (vector) for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Dmnorm, X)

Compute predictions from a fitted model.

  • object : The fitted model.
  • X : Data (vector) for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Knnda1, X)

Compute the y-predictions from the fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Knnr, X)

Compute the Y-predictions from the fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Kplsr, X; nlv = nothing)

Compute Y-predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
  • nlv : Nb. LVs, or collection of nb. LVs, to consider.

If nothing, it is the maximum nb. LVs.

source
Jchemo.predictMethod
predict(object::Krr, X; lb = nothing)

Compute Y-predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
  • lb : Regularization parameter, or collection of regularization parameters, "lambda" to consider.
source
Jchemo.predictMethod
predict(object::Lda, X)

Compute y-predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Lwmlr, X)

Compute the Y-predictions from the fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Lwmlrda, X)

Compute y-predictions from the fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Lwplslda, X; nlv = nothing)

Compute the y-predictions from the fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Lwplsqda, X; nlv = nothing)

Compute the y-predictions from the fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Lwplsr, X; nlv = nothing)

Compute the Y-predictions from the fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::LwplsrAvg, X)

Compute the Y-predictions from the fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Lwplsrda, X; nlv = nothing)

Compute the y-predictions from the fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Mbplslda, Xbl; nlv = nothing)

Compute Y-predictions from a fitted model.

  • object : The fitted model.
  • Xbl : A list of blocks (vector of matrices) of X-data for which predictions are computed.
  • nlv : Nb. LVs, or collection of nb. LVs, to consider.
source
Jchemo.predictMethod
predict(object::Mbplsrda, Xbl; nlv = nothing)

Compute Y-predictions from a fitted model.

  • object : The fitted model.
  • Xbl : A list of blocks (vector of matrices) of X-data for which predictions are computed.
  • nlv : Nb. LVs, or collection of nb. LVs, to consider.
source
Jchemo.predictMethod
predict(object::Mlrda, X)

Compute y-predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Occod, X)

Compute predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Occsd, X)

Compute predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Occsdod, X)

Compute predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Occstah, X)

Compute predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Plslda, X; nlv = nothing)

Compute Y-predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
  • nlv : Nb. LVs, or collection of nb. LVs, to consider.
source
Jchemo.predictMethod
predict(object::Plsravg, X)

Compute Y-predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Plsrda, X; nlv = nothing)

Compute Y-predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
  • nlv : Nb. LVs, or collection of nb. LVs, to consider.
source
Jchemo.predictMethod
predict(object::Qda, X)

Compute y-predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Qda, X)

Compute y-predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Rosaplsr, Xbl; nlv = nothing)

Compute Y-predictions from a fitted model.

  • object : The fitted model.
  • Xbl : A list of blocks (vector of matrices) of X-data for which predictions are computed.
  • nlv : Nb. LVs, or collection of nb. LVs, to consider.
source
Jchemo.predictMethod
predict(object::Rr, X; lb = nothing)

Compute Y-predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
  • lb : Regularization parameter, or collection of regularization parameters, "lambda" to consider.
source
Jchemo.predictMethod
predict(object::Rrda, X; lb = nothing)

Compute Y-predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
  • lb : Regularization parameter, or collection of regularization parameters, "lambda" to consider. If nothing, it is the parameter stored in the fitted model.
source
Jchemo.predictMethod
predict(object::Soplsr, Xbl)

Compute Y-predictions from a fitted model.

  • object : The fitted model.
  • Xbl : A list of blocks (vector of matrices) of X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Svmda, X)

Compute y-predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Svmr, X)

Compute y-predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::TreedaDt, X)

Compute Y-predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::TreerDt, X)

Compute Y-predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Union{Mbplsr, Mbplswest}, Xbl; nlv = nothing)

Compute Y-predictions from a fitted model.

  • object : The fitted model.
  • Xbl : A list of blocks (vector of matrices) of X-data for which predictions are computed.
  • nlv : Nb. LVs, or collection of nb. LVs, to consider.
source
Jchemo.predictMethod
predict(object::Mlr, X)

Compute the Y-predictions from the fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
source
Jchemo.predictMethod
predict(object::Union{Plsr, Pcr, Splsr}, X; nlv = nothing)

Compute Y-predictions from a fitted model.

  • object : The fitted model.
  • X : X-data for which predictions are computed.
  • nlv : Nb. LVs, or collection of nb. LVs, to consider.
source
Jchemo.pvalMethod
pval(d::Distribution, q)
pval(x::Array, q)
pval(e_cdf::ECDF, q)

Compute p-value(s) for a distribution, an ECDF or vector.

  • d : A distribution computed from Distribution.jl.
  • x : Univariate data.
  • e_cdf : An ECDF computed from StatsBase.jl.
  • q : Value(s) for which to compute the p-value(s).

Compute or estimate the p-value of quantile q, ie. P(Q > q) where Q is the random variable.

Examples

using Distributions, StatsBase

d = Distributions.Normal(0, 1)
q = 1.96
#q = [1.64; 1.96]
Distributions.cdf(d, q)    # cumulative density function (CDF)
Distributions.ccdf(d, q)   # complementary CDF (CCDF)
pval(d, q)                 # Distributions.ccdf

x = rand(5)
e_cdf = StatsBase.ecdf(x)
e_cdf(x)                # empirical CDF computed at each point of x (ECDF)
p_val = 1 .- e_cdf(x)   # complementary ECDF at each point of x
q = .3
#q = [.3; .5; 10]
pval(e_cdf, q)          # 1 .- e_cdf(q)
pval(x, q)
source
Jchemo.qdaMethod
qda(X, y; kwargs...)
qda(X, y, weights::Weight; kwargs...)

Quadratic discriminant analysis (QDA, with continuum towards LDA).

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • alpha : Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0) and LDA (alpha = 1).

A value alpha > 0 shrinks the class-covariances by class (Wi) toward a common LDA covariance ("within" W). This corresponds to the "first regularization (Eqs.16)" described in Friedman 1989 (where alpha is referred to as "lambda").

In these qda functions, observation weights (argument weights) are used to compute covariance matrices Wi and W. Argument prior is used to define the usual prior class probabilities.

In the high-level version, the observation weights are automatically defined by the given priors (prior): the sub-total weights by class are set equal to the prior probabilities. For other choices, use the low-level version.

References

Friedman JH. Regularized Discriminant Analysis. Journal of the American Statistical Association. 1989; 84(405):165-175. doi:10.1080/01621459.1989.10478752.

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

mod = model(qda)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
aggsum(fm.weights.w, ytrain)

res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

## With regularization
mod = model(qda; alpha = .5)
#mod = model(qda; alpha = 1) # = LDA
fit!(mod, Xtrain, ytrain)
mod.fm.Wi
res = predict(mod, Xtest) ;
errp(res.pred, ytest)
source
Jchemo.r2Method
r2(pred, Y)

Compute the R2 coefficient.

  • pred : Predictions.
  • Y : Observed data.

The rate R2 is calculated by:

  • R2 = 1 - MSEP(current model) / MSEP(null model)

where the "null model" is the overall mean. For predictions over CV or test sets, and/or for non linear models, it can be different from the square of the correlation coefficient (cor2) between the true data and the predictions.

Examples

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
r2(pred, Ytest)

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
r2(pred, ytest)
source
Jchemo.rasvdMethod
rasvd(X, Y; kwargs...)
rasvd(X, Y, weights::Weight; kwargs...)
rasvd!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Redundancy analysis (RA), aka PCA on instrumental variables (PCAIV)

  • X : First block of data.
  • Y : Second block of data.
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs = scores T) to compute.
  • bscal : Type of block scaling. Possible values are: :none, :frob. See functions blockscal.
  • tau : Regularization parameter (∊ [0, 1]).
  • scal : Boolean. If true, each column of blocks in X and Y is scaled by its uncorrected standard deviation (before the block scaling).

See e.g. Bougeard et al. 2011a,b and Legendre & Legendre 2012. Let Yhat be the fitted values of the regression of Y on X. The scores Ty are the PCA scores of Yhat. The scores Tx are the fitted values of the regression of Ty on X.

A continuum regularization is available. After block centering and scaling, the covariances matrices are computed as follows:

  • Cx = (1 - tau) * X'DX + tau * Ix

where D is the observation (row) metric. Value tau = 0 can generate unstability when inverting the covariance matrices. Often, a better alternative is to use an epsilon value (e.g. tau = 1e-8) to get similar results as with pseudo-inverses.

References

Bougeard, S., Qannari, E.M., Lupo, C., Chauvin, C., 2011-a. Multiblock redundancy analysis from a user's perspective. Application in veterinary epidemiology. Electronic Journal of Applied Statistical Analysis 4, 203-214. https://doi.org/10.1285/i20705948v4n2p203

Bougeard, S., Qannari, E.M., Rose, N., 2011-b. Multiblock redundancy analysis: interpretation tools and application in epidemiology. Journal of Chemometrics 25, 467-475. https://doi.org/10.1002/cem.1392

Legendre, P., Legendre, L., 2012. Numerical Ecology. Elsevier, Amsterdam, The Netherlands.

Tenenhaus, A., Guillemot, V. 2017. RGCCA: Regularized and Sparse Generalized Canonical Correlation Analysis for Multiblock Data Multiblock data analysis. https://cran.r-project.org/web/packages/RGCCA/index.html

Examples

using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "linnerud.jld2") 
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n, p = size(X)
q = nco(Y)

nlv = 2
bscal = :frob ; tau = 1e-4
mod = model(rasvd; nlv, bscal, tau)
fit!(mod, X, Y)
pnames(mod)
pnames(mod.fm)

@head mod.fm.Tx
@head transfbl(mod, X, Y).Tx

@head mod.fm.Ty
@head transfbl(mod, X, Y).Ty

res = summary(mod, X, Y) ;
pnames(res)
res.explvarx
res.cort2t 
res.rdx
res.rdy
res.corx2t 
res.cory2t 
source
Jchemo.rdMethod
rd(X, Y; typ = :cor)
rd(X, Y, weights::Weight; typ = :cor)

Compute redundancy coefficients between two matrices.

  • X : Matrix (n, p).
  • Y : Matrix (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • typ : Possibles values are: :cor (correlation), :cov (uncorrected covariance).

Returns the redundancy coefficient between X and each column of Y, i.e.:

(1 / p) * [Sum.(j=1, .., p) cor(xj, y1)^2 ; ... ; Sum.(j=1, .., p) cor(xj, yq)^2]

See Tenenhaus 1998 section 2.2.1 p.10-11.

References

Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.

Examples

X = rand(5, 10)
Y = rand(5, 3)
rd(X, Y)
source
Jchemo.rdaMethod
rda(X, y; kwargs...)
rda(X, y, weights::Weight; kwargs...)

Regularized discriminant analysis (RDA).

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • alpha : Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0) and LDA (alpha = 1).
  • lb : Ridge regularization parameter "lambda" (>= 0).
  • simpl : Boolean. See function dmnorm.
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Let us note W the (corrected) pooled within-class covariance matrix and Wi the (corrected) within-class covariance matrix of class i. The regularization is done by the two following successive steps (for each class i):

  1. Continuum between QDA and LDA: Wi(1) = (1 - alpha) * Wi + alpha * W
  2. Ridge regularization: Wi(2) = Wi(1) + lb * I

Then the QDA algorithm is run on matrices {Wi(2)}.

Function rda is slightly different from the regularization expression used by Friedman 1989 (Eq.18). It shrinks the covariance matrices Wi(2) to the diagonal of the Idendity matrix (ridge regularization; e.g. Guo et al. 2007).

Particular cases:

  • alpha = 1 & lb = 0 : LDA
  • alpha = 0 & lb = 0 : QDA
  • alpha = 1 & lb > 0 : Penalized LDA (Hastie et al 1995) with diagonal regularization matrix

See functions lda and qda for other details (arguments weights and prior).

References

Friedman JH. Regularized Discriminant Analysis. Journal of the American Statistical Association. 1989; 84(405):165-175. doi:10.1080/01621459.1989.10478752.

Guo Y, Hastie T, Tibshirani R. Regularized linear discriminant analysis and its application in microarrays. Biostatistics. 2007; 8(1):86-100. doi:10.1093/biostatistics/kxj035.

Hastie, T., Buja, A., Tibshirani, R., 1995. Penalized Discriminant Analysis. The Annals of Statistics 23, 73–102.

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

alpha = .5
lb = 1e-8
mod = model(rda; alpha, lb)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
source
Jchemo.recodcat2intMethod
recodcat2int(x; start = 1)

Recode a categorical variable to a integer variable.

  • x : Variable to recode.
  • start : Integer that will be set to the first category.

The integers returned by the function correspond to the sorted levels (categories) of x.

Examples

x = ["b", "a", "b"]   
[x recodcat2int(x)]
recodcat2int(x; start = 0)
recodcat2int([25, 1, 25])
source
Jchemo.recodnum2intMethod
recodnum2int(x, q)

Recode a continuous variable to integer classes.

  • x : Variable to recode.
  • q : Values separating the classes.

Examples

using Statistics
x = [collect(1:10); 8.1 ; 3.1] 
q = [3; 8]
zx = recodnum2int(x, q)  
[x zx]
probs = [.33; .66]
q = quantile(x, probs) 
zx = recodnum2int(x, q)  
[x zx]
source
Jchemo.replacebylevMethod
replacebylev(x, lev)

Replace the elements of a vector by levels of corresponding order.

  • x : Vector (n) of values to replace.
  • lev : Vector (nlev) containing the levels.

Warning: x and lev must contain the same number (nlev) of levels.

The ith sorted level in x is replaced by the ith sorted level of lev.

Examples

x = [10; 4; 3; 3; 4; 4]
lev = ["B"; "C"; "AA"]
sort(lev)
[x replacebylev(x, lev)]
zx = string.(x)
[zx replacebylev(zx, lev)]

lev = [3; 0; -1]
[x replacebylev(x, lev)]
source
Jchemo.replacebylev2Method
replacebylev2(x::Union{Int, Array{Int}}, lev::Array)

Replace the elements of an index-vector by levels.

  • x : Vector (n) of values to replace.
  • lev : Vector (nlev) containing the levels.

Warning: Let us note nlev the number of levels in lev. Vector x must contain integer values between 1 and nlev.

Each element xi is replaced by sort(lev)[x[i]].

Examples

x = [2; 1; 2; 2]
lev = ["B"; "C"; "AA"]
sort(lev)
[x replacebylev2(x, lev)]
replacebylev2([2], lev)
replacebylev2(2, lev)

x = [2; 1; 2]
lev = [3; 0; -1]
replacebylev2(x, lev)
source
Jchemo.replacedictMethod
replacedict(x, dict)

Replace the elements of a vector by levels defined in a dictionary.

  • x : Vector (n) of values to replace.
  • dict : A dictionary of the correpondances betwwen the old and new values.

Examples

dict = Dict("a" => 1000, "b" => 1, "c" => 2)

x = ["c"; "c"; "a"; "a"; "a"]
replacedict(x, dict)

x = ["c"; "c"; "a"; "a"; "a"; "e"]
replacedict(x, dict)
source
Jchemo.residclaMethod
residcla(pred, y)

Compute the discrimination residual vector (0 = no error, 1 = error).

  • pred : Predictions.
  • y : Observed data (class membership).

Examples

Xtrain = rand(10, 5) 
ytrain = rand(["a" ; "b"], 10)
Xtest = rand(4, 5) 
ytest = rand(["a" ; "b"], 4)

mod = model(plsrda; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
residcla(pred, ytest)
source
Jchemo.residregMethod
residreg(pred, Y)

Compute the regression residual vector.

  • pred : Predictions.
  • Y : Observed data.

Examples

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
residreg(pred, Ytest)

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
residreg(pred, ytest)
source
Jchemo.rfda_dtMethod
rfda_dt(X, y; kwargs...)

Random forest discrimination with DecisionTree.jl.

  • X : X-data (n, p).
  • y : Univariate class membership (n).

Keyword arguments:

  • n_trees : Nb. trees built for the forest.
  • partial_sampling : Proportion of sampled observations for each tree.
  • n_subfeatures : Nb. variables to select at random at each split (default: -1 ==> sqrt(#variables)).
  • max_depth : Maximum depth of the decision trees (default: -1 ==> no maximum).
  • min_sample_leaf : Minimum number of samples each leaf needs to have.
  • min_sample_split : Minimum number of observations in needed for a split.
  • mth : Boolean indicating if a multi-threading is done when new data are predicted with function predict.
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.
  • Do dump(Par(), maxdepth = 1) to print the default values of the keyword arguments.

The function fits a random forest discrimination² model using package `DecisionTree.jl'.

References

Breiman, L., 1996. Bagging predictors. Mach Learn 24, 123–140. https://doi.org/10.1007/BF00058655

Breiman, L., 2001. Random Forests. Machine Learning 45, 5–32. https://doi.org/10.1023/A:1010933404324

DecisionTree.jl https://github.com/JuliaAI/DecisionTree.jl

Genuer, R., 2010. Forêts aléatoires : aspects théoriques, sélection de variables et applications. PhD Thesis. Université Paris Sud - Paris XI.

Gey, S., 2002. Bornes de risque, détection de ruptures, boosting : trois thèmes statistiques autour de CART en régression (These de doctorat). Paris 11. http://www.theses.fr/2002PA112245

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n, p = size(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

n_trees = 200
n_subfeatures = p / 3 
max_depth = 10
mod = model(rfda_dt; n_trees, n_subfeatures, max_depth) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

res = predict(mod, Xtest) ; 
pnames(res) 
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
source
Jchemo.rfr_dtMethod
rfr_dt(X, y; kwargs...)

Random forest regression with DecisionTree.jl.

  • X : X-data (n, p).
  • y : Univariate y-data (n).

Keyword arguments:

  • n_trees : Nb. trees built for the forest.
  • partial_sampling : Proportion of sampled observations for each tree.
  • n_subfeatures : Nb. variables to select at random at each split (default: -1 ==> sqrt(#variables)).
  • max_depth : Maximum depth of the decision trees (default: -1 ==> no maximum).
  • min_sample_leaf : Minimum number of samples each leaf needs to have.
  • min_sample_split : Minimum number of observations in needed for a split.
  • mth : Boolean indicating if a multi-threading is done when new data are predicted with function predict.
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.
  • Do dump(Par(), maxdepth = 1) to print the default values of the keyword arguments.

The function fits a random forest regression model using package `DecisionTree.jl'.

References

Breiman, L., 1996. Bagging predictors. Mach Learn 24, 123–140. https://doi.org/10.1007/BF00058655

Breiman, L., 2001. Random Forests. Machine Learning 45, 5–32. https://doi.org/10.1023/A:1010933404324

DecisionTree.jl https://github.com/JuliaAI/DecisionTree.jl

Genuer, R., 2010. Forêts aléatoires : aspects théoriques, sélection de variables et applications. PhD Thesis. Université Paris Sud - Paris XI.

Gey, S., 2002. Bornes de risque, détection de ruptures, boosting : trois thèmes statistiques autour de CART en régression (These de doctorat). Paris 11. http://www.theses.fr/2002PA112245

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
p = nco(X)

n_trees = 200
n_subfeatures = p / 3
max_depth = 15
mod = model(rfr_dt; n_trees, n_subfeatures, max_depth) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)

res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    
source
Jchemo.rmcolMethod
rmcol(X, s)

Remove the columns of a matrix or the components of a vector having indexes s.

  • X : Matrix or vector.
  • s : Vector of the indexes.

Examples

X = rand(5, 3) 
rmcol(X, [1, 3])
source
Jchemo.rmgapMethod
rmgap(X; kwargs...)

Remove vertical gaps in spectra (e.g. for ASD).

  • X : X-data (n, p).

Keyword arguments:

  • indexcol : Indexes (∈ [1, p]) of the X-columns where are located the gaps to remove.
  • npoint : The number of X-columns used on the left side of each gap for fitting the linear regressions.

For each spectra (row-observation of matrix X) and each defined gap, the correction is done by extrapolation from a simple linear regression computed on the left side of the gap.

For instance, If two gaps are observed between column-indexes 651-652 and between column-indexes 1425-1426, respectively, the syntax should be indexcol = [651 ; 1425].

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/asdgap.jld2") 
@load db dat
pnames(dat)
X = dat.X
wlst = names(dat.X)
wl = parse.(Float64, wlst)

wl_target = [1000 ; 1800] 
indexcol = findall(in(wl_target).(wl))

f, ax = plotsp(X, wl)
vlines!(ax, wl_target; linestyle = :dot, color = (:grey, .8))
f

## Corrected data
mod = model(rmgap; npoint = 5, indexcol)
fit!(mod, X)
Xc = transf(mod, X)
f, ax = plotsp(Xc, wl)
vlines!(ax, wl_target; linestyle = :dot, color = (:grey, .8))
f
source
Jchemo.rmrowMethod
rmrow(X, s)

Remove the rows of a matrix or the components of a vector having indexes s.

  • X : Matrix or vector.
  • s : Vector of the indexes.

Examples

X = rand(5, 2) 
rmrow(X, [1, 4])
source
Jchemo.rmsepMethod
rmsep(pred, Y)

Compute the square root of the mean of the squared prediction errors (RMSEP).

  • pred : Predictions.
  • Y : Observed data.

Examples

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
rmsep(pred, Ytest)

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
rmsep(pred, ytest)
source
Jchemo.rmsepstandMethod
rmsepstand(pred, Y)

Compute the standardized square root of the mean of the squared prediction errors (RMSEP_stand).

  • pred : Predictions.
  • Y : Observed data.

RMSEP is standardized to Y:

  • RMSEP_stand = RMSEP ./ Y.

Examples

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
rmsepstand(pred, Ytest)

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
rmsepstand(pred, ytest)
source
Jchemo.rosaplsrMethod
rosaplsr(Xbl, Y; kwargs...)
rosaplsr(Xbl, Y, weights::Weight; kwargs...)
rosaplsr!(Xbl::Vector, Y::Matrix, weights::Weight; kwargs...)

Multiblock ROSA PLSR (Liland et al. 2016).

  • Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs = scores T) to compute.
  • scal : Boolean. If true, each column of blocks in Xbl and Y is scaled by its uncorrected standard deviation (before the block scaling).

The function has the following differences with the original algorithm of Liland et al. (2016):

  • Scores T are not normed to 1.
  • Multivariate Y is allowed. In such a case, the squared residuals are summed over the columns for finding the winning block for each global LV (therefore Y-columns should have the same fscale).

References

Liland, K.H., Næs, T., Indahl, U.G., 2016. ROSA — a fast extension of partial least squares regression for multiblock data analysis. Journal of Chemometrics 30, 651–662. https://doi.org/10.1002/cem.2824

Examples

using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2") 
@load db dat
pnames(dat) 
X = dat.X
Y = dat.Y
y = Y.c1
group = dat.group
listbl = [1:11, 12:19, 20:25]
s = 1:6
Xbltrain = mblock(X[s, :], listbl)
Xbltest = mblock(rmrow(X, s), listbl)
ytrain = y[s]
ytest = rmrow(y, s) 
ntrain = nro(ytrain) 
ntest = nro(ytest) 
ntot = ntrain + ntest 
(ntot = ntot, ntrain , ntest)

nlv = 3
scal = false
#scal = true
mod = model(rosaplsr; nlv, scal)
fit!(mod, Xbltrain, ytrain)
pnames(mod) 
pnames(mod.fm)
@head mod.fm.T
@head transf(mod, Xbltrain)
transf(mod, Xbltest)

res = predict(mod, Xbltest)
res.pred 
rmsep(res.pred, ytest)
source
Jchemo.rowmeanMethod
rowmean(X)

Compute row-wise means of a matrix.

  • X : Data (n, p).

Return a vector.

Examples

n, p = 5, 6
X = rand(n, p)
rowmean(X)
source
Jchemo.rownormMethod
rownorm(X)

Compute row-wise norms of a matrix.

  • X : Data (n, p).

The norm computed for a row x of X is:

  • sqrt(x' * x)

Return a vector.

Note: Thanks to @mcabbott at https://discourse.julialang.org/t/orders-of-magnitude-runtime-difference-in-row-wise-norm/96363.

Examples

n, p = 5, 6
X = rand(n, p)

rownorm(X)
source
Jchemo.rowstdMethod
rowstd(X)

Compute row-wise standard deviations (uncorrected) of a matrix`.

  • X : Data (n, p).

Return a vector.

Examples

n, p = 5, 6
X = rand(n, p)
rowstd(X)
source
Jchemo.rowsumMethod
rowsum(X)

Compute row-wise sums of a matrix.

  • X : Data (n, p).

Return a vector.

Examples

X = rand(5, 2) 
rowsum(X)
source
Jchemo.rowvarMethod
rowvar(X)

Compute row-wise variances (uncorrected) of a matrix.

  • X : Data (n, p).

Return a vector.

Examples

n, p = 5, 6
X = rand(n, p)
rowvar(X)
source
Jchemo.rpMethod
rp(X; kwargs...)
rp(X, weights::Weight; kwargs...)
rp!(X::Matrix, weights::Weight; kwargs...)

Make a random projection of X-data.

  • X : X-data (n, p).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. dimensions on which X is projected.
  • mrp : Method of random projection. Possible values are: :gauss, :li. See the respective functions rpmatgauss and rpmatli for their keyword arguments.
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Examples

n, p = (5, 10)
X = rand(n, p)
nlv = 3
mrp = :li ; s_li = sqrt(p) 
#mrp = :gauss
mod = model(rp; nlv, mrp, s_li)
fit!(mod, X)
pnames(mod)
pnames(mod.fm)
@head mod.fm.T 
@head mod.fm.P 
transf(mod, X[1:2, :])
source
Jchemo.rpdMethod
rpd(pred, Y)

Compute the ratio "deviation to model performance" (RPD).

  • pred : Predictions.
  • Y : Observed data.

This is the ratio of the deviation to the model performance to the deviation, defined by:

  • RPD = Std(Y) / RMSEP

where Std(Y) is the standard deviation.

Since Std(Y) = RMSEP(null model) where the null model is the simple average, this also gives:

  • RPD = RMSEP(null model) / RMSEP

Examples

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
rpd(pred, Ytest)

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
rpd(pred, ytest)
source
Jchemo.rpdrMethod
rpdr(pred, Y)

Compute a robustified RPD.

  • pred : Predictions.
  • Y : Observed data.

Examples

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
rpdr(pred, Ytest)

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
rpdr(pred, ytest)
source
Jchemo.rpmatgaussFunction
rpmatgauss(p::Int, nlv::Int, Q = Float64)

Build a gaussian random projection matrix.

  • p : Nb. variables (attributes) to project.
  • nlv : Nb. of simulated projection dimensions.
  • Q : Type of components of the built projection matrix.

The function returns a random projection matrix P of dimension p x nlv. The projection of a given matrix X of size n x p is given by X * P.

P is simulated from i.i.d. N(0, 1) / sqrt(nlv).

References

Li, P., Hastie, T.J., Church, K.W., 2006. Very sparse random projections, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06. Association for Computing Machinery, New York, NY, USA, pp. 287–296. https://doi.org/10.1145/1150402.1150436

Examples

p = 10 ; nlv = 3
rpmatgauss(p, nlv)
source
Jchemo.rpmatliFunction
rpmatli(p::Int, nlv::Int, Q = Float64; s_li)

Build a sparse random projection matrix (Achlioptas 2001, Li et al. 2006).

  • p : Nb. variables (attributes) to project.
  • nlv : Nb. of simulated projection dimensions.
  • Q : Type of components of the built projection matrix.

Keyword arguments:

  • s_li : Coefficient defining the sparsity of the returned matrix (higher is s, higher is the sparsity).

The function returns a random projection matrix P of dimension p x nlv. The projection of a given matrix X of size n x p is given by X * P.

Matrix P is simulated from i.i.d. discrete sampling within values:

  • 1 with prob. 1/(2 * s)
  • 0 with prob. 1 - 1 / s
  • -1 with prob. 1/(2 * s)

Usual values for s are:

  • sqrt(p) (Li et al. 2006)
  • p / log(p) (Li et al. 2006)
  • 1 (Achlioptas 2001)
  • 3 (Achlioptas 2001)

References

Achlioptas, D., 2001. Database-friendly random projections, in: Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’01. Association for Computing Machinery, New York, NY, USA, pp. 274–281. https://doi.org/10.1145/375551.375608

Li, P., Hastie, T.J., Church, K.W., 2006. Very sparse random projections, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06. Association for Computing Machinery, New York, NY, USA, pp. 287–296. https://doi.org/10.1145/1150402.1150436

Examples

p = 10 ; nlv = 3
rpmatli(p, nlv)
source
Jchemo.rrMethod
rr(X, Y; kwargs...)
rr(X, Y, weights::Weight; kwargs...)
rr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Ridge regression (RR) implemented by SVD factorization.

  • X : X-data (n, p).
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • lb : Ridge regularization parameter "lambda".
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

References

Cule, E., De Iorio, M., 2012. A semi-automatic method to guide the choice of ridge parameter in ridge regression. arXiv:1205.0686.

Hastie, T., Tibshirani, R., 2004. Efficient quadratic regularization for expression arrays. Biostatistics 5, 329-340. https://doi.org/10.1093/biostatistics/kxh010

Hastie, T., Tibshirani, R., Friedman, J., 2009. The elements of statistical learning: data mining, inference, and prediction, 2nd ed. Springer, New York.

Hoerl, A.E., Kennard, R.W., 1970. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 12, 55-67. https://doi.org/10.1080/00401706.1970.10488634

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

lb = 1e-3
mod = model(rr; lb) 
#mod = model(rrchol; lb) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)

coef(mod)

res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

## Only for function 'rr' (not for 'rrchol')
coef(mod; lb = 1e-1)
res = predict(mod, Xtest; lb = [.1 ; .01])
@head res.pred[1]
@head res.pred[2]
source
Jchemo.rrcholMethod
rrchol(X, Y; kwargs...)
rrchol(X, Y, weights::Weight; kwargs...)
rrchol!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Ridge regression (RR) using the Normal equations and a Cholesky factorization.

  • X : X-data (n, p).
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • lb : Ridge regularization parameter "lambda".
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

See function rr for examples.

References

Cule, E., De Iorio, M., 2012. A semi-automatic method to guide the choice of ridge parameter in ridge regression. arXiv:1205.0686.

Hastie, T., Tibshirani, R., 2004. Efficient quadratic regularization for expression arrays. Biostatistics 5, 329-340. https://doi.org/10.1093/biostatistics/kxh010

Hastie, T., Tibshirani, R., Friedman, J., 2009. The elements of statistical learning: data mining, inference, and prediction, 2nd ed. Springer, New York.

Hoerl, A.E., Kennard, R.W., 1970. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 12, 55-67. https://doi.org/10.1080/00401706.1970.10488634

source
Jchemo.rrdaMethod
rrda(X, y; kwargs...)
rrda(X, y, weights::Weight; kwargs...)

Discrimination based on ridge regression (RR-DA).

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • lb : Ridge regularization parameter "lambda".
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

The training variable y (univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present in y. Each column of Ydummy is a dummy (0/1) variable. Then, a ridge regression (RR) is run on {X, Ydummy}, returning predictions of the dummy variables (= object posterior returned by fuction predict). These predictions can be considered as unbounded estimates (i.e. eventuall outside of [0, 1]) of the class membership probabilities. For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.

In the high-level version of the function, the observation weights used in the RR are defined with argument prior. For other choices, use the low-level version (argument weights).

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

lb = 1e-5
mod = model(rrda; lb) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

coef(fm.fm)

res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(mod, Xtest; lb = [.1; .01]).pred
source
Jchemo.rrrMethod
rrr(X, Y; kwargs...)
rrr(X, Y, weights::Weight; kwargs...)
rr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Reduced rank regression (RRR, aka RA).

  • X : X-data (n, p).
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute.
  • tau : Regularization parameter (∊ [0, 1]).
  • scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

Reduced rank regression, also referred to as redundancy analysis (RA) regression. In this function, the RA uses the Nipals algorithm presented in Mangamana et al 2021, section 2.1.1.

A continuum regularization is available. After block centering and scaling, the covariances matrices are computed as follows:

  • Cx = (1 - tau) * X'DX + tau * Ix

where D is the observation (row) metric. Value tau = 0 can generate unstability when inverting the covariance matrices. A better alternative is generally to use an epsilon value (e.g. tau = 1e-8) to get similar results as with pseudo-inverses.

References

Bougeard, S., Qannari, E.M., Lupo, C., Chauvin, C., 2011. Multiblock redundancy analysis from a user’s perspective. Application in veterinary epidemiology. Electronic Journal of Applied Statistical Analysis 4, 203-214–214. https://doi.org/10.1285/i20705948v4n2p203

Bougeard, S., Qannari, E.M., Rose, N., 2011. Multiblock redundancy analysis: interpretation tools and application in epidemiology. Journal of Chemometrics 25, 467–475. https://doi.org/10.1002/cem.1392

Tchandao Mangamana, E., Glèlè Kakaï, R., Qannari, E.M., 2021. A general strategy for setting up supervised methods of multiblock data analysis. Chemometrics and Intelligent Laboratory Systems 217, 104388. https://doi.org/10.1016/j.chemolab.2021.104388

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 1
tau = 1e-4
mod = model(rrr; nlv, tau) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
@head mod.fm.T

coef(mod)
coef(mod; nlv = 3)

@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)

res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f   
source
Jchemo.rvMethod
rv(X, Y; centr = true)
rv(Xbl::Vector; centr = true)

Compute the RV coefficient between matrices.

  • X : Matrix (n, p).
  • Y : Matrix (n, q).
  • Xbl : A list (vector) of matrices.
  • centr : Boolean indicating if the matrices will be internally centered or not.

RV is bounded in [0, 1].

A dissimilarty measure between X and Y can be computed by d = sqrt(2 * (1 - RV)).

References

Escoufier, Y., 1973. Le Traitement des Variables Vectorielles. Biometrics 29, 751–760. https://doi.org/10.2307/2529140

Josse, J., Holmes, S., 2016. Measuring multivariate association and beyond. Stat Surv 10, 132–167. https://doi.org/10.1214/16-SS116

Josse, J., Pagès, J., Husson, F., 2008. Testing the significance of the RV coefficient. Computational Statistics & Data Analysis 53, 82–91. https://doi.org/10.1016/j.csda.2008.06.012

Kazi-Aoual, F., Hitier, S., Sabatier, R., Lebreton, J.-D., 1995. Refined approximations to permutation tests for multivariate inference. Computational Statistics & Data Analysis 20, 643–656. https://doi.org/10.1016/0167-9473(94)00064-2

Mayer, C.-D., Lorent, J., Horgan, G.W., 2011. Exploratory Analysis of Multiple Omics Datasets Using the Adjusted RV Coefficient. Statistical Applications in Genetics and Molecular Biology 10. https://doi.org/10.2202/1544-6115.1540

Smilde, A.K., Kiers, H.A.L., Bijlsma, S., Rubingh, C.M., van Erk, M.J., 2009. Matrix correlations for high-dimensional data: the modified RV-coefficient. Bioinformatics 25, 401–405. https://doi.org/10.1093/bioinformatics/btn634

Robert, P., Escoufier, Y., 1976. A Unifying Tool for Linear Multivariate Statistical Methods: The RV-Coefficient. Journal of the Royal Statistical Society: Series C (Applied Statistics) 25, 257–265. https://doi.org/10.2307/2347233

Examples

X = rand(5, 10)
Y = rand(5, 3)
rv(X, Y)

X = rand(5, 15) 
listbl = [3:4, 1, [6; 8:10]]
Xbl = mblock(X, listbl)
rv(Xbl)
source
Jchemo.sampclaFunction
sampcla(x, k::Union{Int, Vector{Int}}, y = nothing)

Build training vs. test sets by stratified sampling.

  • x : Class membership (n) of the observations.
  • k : Nb. test observations to sample in each class. If k is a single value, the nb. of sampled observations is the same for each class. Alternatively, kcan be a vector of length equal to the nb. of classes in x.
  • y : Quantitative variable (n) used if systematic sampling.

Two outputs are returned (= row indexes of the data):

  • train (n - k),
  • test (k).

If y = nothing, the sampling is random, else it is systematic over the sorted y(see function sampsys).

References

Naes, T., 1987. The design of calibration in near infra-red reflectance analysis by clustering. Journal of Chemometrics 1, 121-134.

Examples

x = string.(repeat(1:3, 5))
n = length(x)
tab(x)
k = 2 
res = sampcla(x, k)
res.test
x[res.test]
tab(x[res.test])

y = rand(n)
res = sampcla(x, k, y)
res.test
x[res.test]
tab(x[res.test])
source
Jchemo.sampdfFunction
sampdf(Y::DataFrame, k::Union{Int, Vector{Int}}, id = 1:nro(Y); msamp = :rand)

Build training vs. test sets from each column of a dataframe.

  • Y : DataFrame (n, p) whose each column can contain missing values.
  • k : Nb. of test observations selected for each Y column. The selection is done within the non-missing observations of the considered column. If k is a single value, the same nb. of observations are selected for each column. Alternatively, k can be a vector of length p.
  • id : Vector (n) of IDs.

Keyword arguments:

  • msamp : Type of sampling for the test set. Possible values are: :rand = random sampling, :sys = systematic sampling over each sorted Y column (see function sampsys).

Typically, dataframe Y contains a set of response variables to predict.

Examples

using DataFrames

Y = hcat([rand(5); missing; rand(6)],
   [rand(2); missing; missing; rand(7); missing])
Y = DataFrame(Y, :auto)
n = nro(Y)

k = 3
res = sampdf(Y, k) 
#res = sampdf(Y, k, string.(1:n))
pnames(res)
res.nam
length(res.test)
res.train
res.test

## Replicated splitting Train/Test
rep = 10
k = 3
ids = [sampdf(Y, k) for i = 1:rep]
length(ids)
i = 1    # replication
ids[i]
ids[i].train 
ids[i].test
j = 1    # variable y  
ids[i].train[j]
ids[i].test[j]
ids[i].nam[j]
source
Jchemo.sampdpMethod
sampdp(X, k::Int; metric = :eucl)

Build training vs. test sets by DUPLEX sampling.

  • X : X-data (n, p).
  • k : Nb. pairs (training/test) of observations to sample. Must be <= n / 2.

Keyword arguments:

  • metric : Metric used for the distance computation. Possible values are: :eucl (Euclidean), :mah (Mahalanobis).

Three outputs (= row indexes of the data) are returned:

  • train (k),
  • test (k),
  • remain (n - 2 * k).

Outputs train and test are built from the DUPLEX algorithm (Snee, 1977 p.421). They are expected to cover approximately the same X-space region and have similar statistical properties.

In practice, when output remain is not empty (i.e. when there are remaining observations), one common strategy is to add it to output train.

References

Kennard, R.W., Stone, L.A., 1969. Computer aided design of experiments. Technometrics, 11(1), 137-148.

Snee, R.D., 1977. Validation of Regression Models: Methods and Examples. Technometrics 19, 415-428. https://doi.org/10.1080/00401706.1977.10489581

Examples

X = [0.381392  0.00175002 ; 0.1126    0.11263 ; 
    0.613296  0.152485 ; 0.726536  0.762032 ;
    0.367451  0.297398 ; 0.511332  0.320198 ; 
    0.018514  0.350678] 

k = 3
sampdp(X, k)
source
Jchemo.sampksMethod
sampks(X, k::Int; metric = :eucl)

Build training vs. test sets by Kennard-Stone sampling.

  • X : X-data (n, p).
  • k : Nb. test observations to sample.

Keyword arguments:

  • metric : Metric used for the distance computation. Possible values are: :eucl (Euclidean), :mah (Mahalanobis).

Two outputs (= row indexes of the data) are returned:

  • train (n - k),
  • test (k).

Output test is built from the Kennard-Stone (KS) algorithm (Kennard & Stone, 1969).

Note: By construction, the set of observations selected by KS sampling contains higher variability than the set of the remaining observations. In the seminal article (K&S, 1969), the algorithm is used to select observations that will be used to build a calibration set. To the opposite, in the present function, KS is used to select a test set with higher variability than the training set.

References

Kennard, R.W., Stone, L.A., 1969. Computer aided design of experiments. Technometrics, 11(1), 137-148.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)

X = dat.X 
y = dat.Y.tbc

k = 80
res = sampks(X, k)
pnames(res)
res.train 
res.test

mod = model(pcasvd; nlv = 15) 
fit!(mod, X) 
@head T = mod.fm.T
res = sampks(T, k; metric = :mah)

#####################

n = 10
k = 25 
X = [repeat(1:n, inner = n) repeat(1:n, outer = n)] 
X = Float64.(X) 
X .= X + .1 * randn(nro(X), nco(X))
s = sampks(X, k).test
f, ax = plotxy(X[:, 1], X[:, 2])
scatter!(ax, X[s, 1], X[s, 2]; color = "red") 
f
source
Jchemo.samprandMethod
samprand(n::Int, k::Int; replace = false)

Build training vs. test sets by random sampling.

  • n : Total nb. of observations.
  • k : Nb. test observations to sample.

Keyword arguments:

  • replace : Boolean. If false, the sampling is without replacement.

Two outputs are returned (= row indexes of the data):

  • train (n - k),
  • test (k).

Output test is built by random sampling within 1:n.

Examples

n = 10
samprand(n, 4)
source
Jchemo.sampsysMethod
sampsys(y, k::Int)

Build training vs. test sets by systematic sampling over a quantitative variable.

  • y : Quantitative variable (n) to sample.
  • k : Nb. test observations to sample. Must be >= 2.

Two outputs are returned (= row indexes of the data):

  • train (n - k),
  • test (k).

Output test is built by systematic sampling over the rank of the y observations. For instance if k / n ~ .3, one observation over three observations over the sorted y is selected.

Output test always contains the indexes of the minimum and maximum of y.

Examples

y = rand(7)
[y sort(y)]
res = sampsys(y, 3)
sort(y[res.test])
source
Jchemo.sampwspMethod
sampwsp(X, dmin; maxit = nro(X))

Build training vs. test sets by WSP sampling.

  • X : X-data (n, p).
  • dmin : Distance "dmin" (Santiago et al. 2012).

Keyword arguments:

  • maxit : Maximum number of iterations.

Two outputs (= row indexes of the data) are returned:

  • train (n - k),
  • test (k).

Output test is built from the "Wootton, Sergent, Phan-Tan-Luu" (WSP) algorithm, assumed to generate samples uniformely distributed in the X domain (Santiago et al. 2012).

References

Béal A. 2015. Description et sélection de données en grande dimensio. Thèse de doctorat. Laboratoire d’Instrumentation et de sciences analytiques, Ecole doctorale des siences chimiques, Université d'Aix-Marseille.

Santiago, J., Claeys-Bruno, M., Sergent, M., 2012. Construction of space-filling designs using WSP algorithm for high dimensional spaces. Chemometrics and Intelligent Laboratory Systems, Selected Papers from Chimiométrie 2010 113, 26–31. https://doi.org/10.1016/j.chemolab.2011.06.003

Examples

n = 600 ; p = 2
X = rand(n, p)
dmin = .5
s = sampwsp(X, dmin)
pnames(res)
@show length(s.test)
plotxy(X[s.test, 1], X[s.test, 2]).f
source
Jchemo.savgkMethod
savgk(nhwindow::Int, degree::Int, deriv::Int)

Compute the kernel of the Savitzky-Golay filter.

  • nhwindow : Nb. points (>= 1) of the half window.
  • degree : Degree of the smoothing polynom, where 1 <= degree <= 2 * nhwindow.
  • deriv : Derivation order, where 0 <= deriv <= degree.

The size of the kernel is odd (npoint = 2 * nhwindow + 1):

  • x[-nhwindow], x[-nhwindow+1], ..., x[0], ...., x[nhwindow-1], x[nhwindow].

If deriv = 0, there is no derivation (only polynomial smoothing).

The case degree = 0 (i.e. simple moving average) is not allowed by the funtion.

References

Luo, J., Ying, K., Bai, J., 2005. Savitzky–Golay smoothing and differentiation filter for even number data. Signal Processing 85, 1429–1434. https://doi.org/10.1016/j.sigpro.2005.02.002

Examples

res = savgk(21, 3, 2)
pnames(res)
res.S 
res.G 
res.kern
source
Jchemo.savgolMethod
savgol(X; kwargs...)

Savitzky-Golay derivation and smoothing of each row of X-data.

  • X : X-data (n, p).

Keyword arguments:

  • npoint : Size of the filter (nb. points involved in the kernel). Must be odd and >= 3. The half-window size is nhwindow = (npoint - 1) / 2.
  • degree : Degree of the smoothing polynom. Must be: 1 <= degree <= npoint - 1.
  • deriv : Derivation order. Must be: 0 <= deriv <= degree.

The smoothing is computed by convolution (with padding), using function imfilter of package ImageFiltering.jl. Each returned point is located on the center of the kernel. The kernel is computed with function savgk.

The function returns a matrix (n, p).

References

Luo, J., Ying, K., Bai, J., 2005. Savitzky–Golay smoothing and differentiation filter for even number data. Signal Processing 85, 1429–1434. https://doi.org/10.1016/j.sigpro.2005.02.002

Savitzky, A., Golay, M.J.E., 2002. Smoothing and Differentiation of Data by Simplified Least Squares Procedures. [WWW Document]. https://doi.org/10.1021/ac60214a047

Schafer, R.W., 2011. What Is a Savitzky-Golay Filter? [Lecture Notes]. IEEE Signal Processing Magazine 28, 111–117. https://doi.org/10.1109/MSP.2011.941097

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f

npoint = 11 ; degree = 2 ; deriv = 2
mod = model(savgol; npoint, degree, deriv) 
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f

####### Gaussian signal 

u = -15:.1:15
n = length(u)
x = exp.(-.5 * u.^2) / sqrt(2 * pi) + .03 * randn(n)
M = 10  # half window
N = 3   # degree
deriv = 0
#deriv = 1
mod = model(savgol; npoint = 2M + 1, degree = N, deriv)
fit!(mod, x')
xp = transf(mod, x')
f, ax = plotsp(x', u; color = :blue)
lines!(ax, u, vec(xp); color = :red)
f
source
Jchemo.scaleMethod
scale(X)
scale(X, weights::Weight)

Column-wise scaling of X-data.

  • X : X-data (n, p).

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f

mod = model(scale) 
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
colstd(Xptrain)
@head Xptest 
@head Xtest ./ colstd(Xtrain)'
plotsp(Xptrain).f
plotsp(Xptest).f
source
Jchemo.segmkfMethod
segmkf(n::Int, K::Int; rep = 1)
segmkf(group::Vector, K::Int; rep = 1)

Build segments of observations for K-fold cross-validation.

  • n : Total nb. of observations in the dataset. The sampling is implemented with 1:n.
  • group : A vector (n) defining blocks of observations.
  • K : Nb. folds (segments) splitting the n observations.

Keyword arguments:

  • rep : Nb. replications of the sampling.

For each replication, the function splits the n observations tp K segments that can be used for K-fold cross-validation.

If group is used (must be a vector of length n), the function samples entire groups (= blocks) of observations instead of observations. Such a block-sampling is required when data is structured by blocks and when the response to predict is correlated within blocks. This prevents underestimation of the generalization error.

The function returns a list (vector) of rep elements. Each element of the list contains K segments (= K vectors). Each segment contains the indexes (position within 1:n) of the sampled observations.

Examples

n = 10 ; K = 3
rep = 4 
segm = segmkf(n, K; rep)
i = 1 
segm[i]
segm[i][1]

n = 10 
group = ["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"]    # blocks of the observations
tab(group) 
K = 3 ; rep = 4 
segm = segmkf(group, K; rep)
i = 1 
segm[i]
segm[i][1]
group[segm[i][1]]
group[segm[i][2]]
group[segm[i][3]]
source
Jchemo.segmtsMethod
segmts(n::Int, m::Int; rep = 1, seed = nothing)
segmts(group::Vector, m::Int; rep = 1, seed = nothing)

Build segments of observations for "test-set" validation.

  • n : Total nb. of observations in the dataset. The sampling is implemented within 1:n.
  • group : A vector (n) defining blocks of observations.
  • m : Nb. test observations, or groups if group is used, returned in each segment.

Keyword arguments:

  • rep : Nb. replications of the sampling.
  • seed : Eventual seed for the Random.MersenneTwister generator. Must be of length = rep. When nothing, the seed is random at each replication.

For each replication, the function builds a test set that can be used to validate a model.

If group is used (must be a vector of length n), the function samples entire groups (= blocks) of observations instead of observations. Such a block-sampling is required when data is structured by blocks and when the response to predict is correlated within blocks. This prevents underestimation of the generalization error.

The function returns a list (vector) of rep elements. Each element of the list is a vector of the indexes (positions within 1:n) of the sampled observations.

Examples

n = 10 ; m = 3
rep = 4 
segm = segmts(n, m; rep) 
i = 1
segm[i]
segm[i][1]

n = 10 
group = ["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"]    # blocks of the observations
tab(group)  
m = 2 ; rep = 4 
segm = segmts(group, m; rep)
i = 1 
segm[i]
segm[i][1]
group[segm[i][1]]
source
Jchemo.selwoldMethod
selwold(indx, r; smooth = true, npoint = 5, alpha = .05, digits = 3, graph = true, 
    step = 2, xlabel = "Index", ylabel = "Value", title = "Score")

Wold's criterion to select dimensionality in LV models (e.g. PLSR).

  • indx : A variable representing the model parameter(s), e.g. nb. LVs if PLSR models.
  • r : A vector of error rates (n), e.g. RMSECV.

Keyword arguments:

  • smooth : Boolean. If true, the selection is done after a moving-average smoothing of rate R (see function mavg).
  • npoint : Window of the moving-average used to smooth rate R.
  • alpha : Proportion alpha used as threshold for rate R.
  • digits : Number of digits in the outputs.
  • graph : Boolean. If true, outputs are plotted.
  • step : Step used for defining the xticks in the graphs.
  • xlabel : Horizontal label for the plots.
  • ylabel : Vertical label for the plots.
  • title : Title of the left plot.

The slection criterion is the "precision gain ratio":

  • R = 1 - r(a+1) / r(a)

where r is an observed error rate quantifying the model performance (e.g. RMSEP, classification error rate, etc.) and a the model dimensionnality (= nb. LVs). r can also represent other indicators such as the eigenvalues of a PCA.

R is the relative gain in perforamnce efficiency after a new LV is added to the model. The iterations continue until R becomes lower than a threshold value alpha. By default and only as an indication, the default alpha=.05 is set in the function, but the user should set any other value depending on his data and parsimony objective.

In his original article, Wold (1978; see also Bro et al. 2008) used the ratio of cross-validated over training residual sums of squares, i.e. PRESS over SSR. Instead, function selwold compares values of consistent nature (the successive values in the input vector r). For instance, r was set to PRESS values in Li et al. (2002) and Andries et al. (2011), which is equivalent to the "punish factor" described in Westad & Martens (2000).

The ratio R can be erratic (particulary when r is the error rate of a discrimination model), making difficult the dimensionnaly selection. In such a situation, function selwold proposes to calculate a smoothing of R (argument smooth).

The function returns two outputs (in addition to eventual plots):

  • opt : The index corresponding to the minimum value of r.
  • sel : The index of the selection from the R (or smoothed R) threshold.

References

Andries, J.P.M., Vander Heyden, Y., Buydens, L.M.C., 2011. Improved variable reduction in partial least squares modelling based on Predictive-Property-Ranked Variables and adaptation of partial least squares complexity. Analytica Chimica Acta 705, 292-305. https://doi.org/10.1016/j.aca.2011.06.037

Bro, R., Kjeldahl, K., Smilde, A.K., Kiers, H.A.L., 2008. Cross-validation of component models: A critical look at current methods. Anal Bioanal Chem 390, 1241-1251. https://doi.org/10.1007/s00216-007-1790-1

Li, B., Morris, J., Martin, E.B., 2002. Model selection for partial least squares regression. Chemometrics and Intelligent Laboratory Systems 64, 79-89. https://doi.org/10.1016/S0169-7439(02)00051-5

Westad, F., Martens, H., 2000. Variable Selection in near Infrared Spectroscopy Based on Significance Testing in Partial Least Squares Regression. J. Near Infrared Spectrosc., JNIRS 8, 117–124.

Wold S. Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models. Technometrics. 1978;20(4):397-405

Examples

using JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
n = nro(Xtrain)

segm = segmts(n, 50; rep = 30)
mod = model(plskern)
nlv = 0:20
res = gridcv(mod, Xtrain, ytrain; segm, score = rmsep, nlv).res
res[res.y1 .== minimum(res.y1), :]
plotgrid(res.nlv, res.y1;xlabel = "Nb. LVs", ylabel = "RMSEP").f
zres = selwold(res.nlv, res.y1; smooth = true, graph = true) ;
@show zres.opt
@show zres.sel
zres.f
source
Jchemo.sepMethod
sep(pred, Y)

Compute the corrected SEP ("SEP_c"), i.e. the standard deviation of the prediction errors.

  • pred : Predictions.
  • Y : Observed data.

References

Bellon-Maurel, V., Fernandez-Ahumada, E., Palagos, B., Roger, J.-M., McBratney, A., 2010. Critical review of chemometric indicators commonly used for assessing the quality of the prediction of soil attributes by NIR spectroscopy. TrAC Trends in Analytical Chemistry 29, 1073–1081. https://doi.org/10.1016/j.trac.2010.05.006

Examples

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
sep(pred, Ytest)

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
sep(pred, ytest)
source
Jchemo.snormMethod
snorm(X)

Row-wise norming of X-data.

  • X : X-data (n, p).

Each row of X is divide by its norm.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f

mod = model(snorm) 
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f
rownorm(Xptrain)
rownorm(Xptest)
source
Jchemo.snvMethod
snv(X; kwargs...)

Standard-normal-variate (SNV) transformation of each row of X-data.

  • X : X-data (n, p).

Keyword arguments:

  • centr : Boolean indicating if the centering in done.
  • scal : Boolean indicating if the scaling in done.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f

centr = true ; scal = true
mod = model(snv; centr, scal) 
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f
source
Jchemo.softMethod
soft(x::Real, delta)

Soft thresholding function.

  • x : Value to transform.
  • delta : Range for the thresholding.

The returned value is:

  • sign(x) * max(0, abs(x) - delta)

where delta >= 0.

Examples

using CairoMakie 

delta = .2
soft(3, delta)

x = LinRange(-2, 2, 100)
y = soft.(x, delta)
lines(x, y)
source
Jchemo.softmaxMethod
softmax(x::AbstractVector)
softmax(X::Union{Matrix, DataFrame})

Softmax function.

  • x : A vector to transform.
  • X : A matrix whose rows are transformed.

Let v be a vector:

  • 'softmax'(v) = exp.(v) / sum(exp.(v))

Examples

x = 1:3
softmax(x)

X = rand(5, 3)
softmax(X)
source
Jchemo.soplsrMethod
soplsr(Xbl, Y; kwargs...)
soplsr(Xbl, Y, weights::Weight; kwargs...)
soplsr!(Xbl::Matrix, Y::Matrix, weights::Weight; kwargs...)

Multiblock sequentially orthogonalized PLSR (SO-PLSR).

  • Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs = scores T) to compute.
  • scal : Boolean. If true, each column of blocks in Xbl and Y is scaled by its uncorrected standard deviation.

References

Biancolillo et al. , 2015. Combining SO-PLS and linear discriminant analysis for multi-block classification. Chemometrics and Intelligent Laboratory Systems, 141, 58-67.

Biancolillo, A. 2016. Method development in the area of multi-block analysis focused on food analysis. PhD. University of copenhagen.

Menichelli et al., 2014. SO-PLS as an exploratory tool for path modelling. Food Quality and Preference, 36, 122-134.

Examples

using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2") 
@load db dat
pnames(dat) 
X = dat.X
Y = dat.Y
y = Y.c1
group = dat.group
listbl = [1:11, 12:19, 20:25]
s = 1:6
Xbltrain = mblock(X[s, :], listbl)
Xbltest = mblock(rmrow(X, s), listbl)
ytrain = y[s]
ytest = rmrow(y, s) 
ntrain = nro(ytrain) 
ntest = nro(ytest) 
ntot = ntrain + ntest 
(ntot = ntot, ntrain , ntest)

nlv = 2
#nlv = [2, 1, 2]
#nlv = [2, 0, 1]
scal = false
#scal = true
mod = model(soplsr; nlv, scal)
fit!(mod, Xbltrain, ytrain)
pnames(mod) 
pnames(mod.fm)
@head mod.fm.T
@head transf(mod, Xbltrain)
transf(mod, Xbltest)

res = predict(mod, Xbltest)
res.pred 
rmsep(res.pred, ytest)
source
Jchemo.spcaMethod
spca(X; kwargs...)
spca(X, weights::Weight; kwargs...)
spca!(X::Matrix, weights::Weight; kwargs...)

Sparse PCA (Shen & Huang 2008).

  • X : X-data (n, p).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. principal components (PCs).
  • msparse : Method used for the sparse thresholding. Possible values are: :soft, :mix, :hard. See thereafter.
  • delta : Only used if msparse = :soft. Range for the thresholding on the loadings (after they are standardized to their maximal absolute value). Must ∈ [0, 1]. Higher is delta, stronger is the thresholding.
  • nvar : Only used if msparse = :mix or msparse = :hard. Nb. variables (X-columns) selected for each principal component (PC). Can be a single integer (i.e. same nb. of variables for each PC), or a vector of length nlv.
  • tol : Tolerance value for stopping the iterations.
  • maxit : Maximum nb. of Nipals iterations.
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Sparse principal component analysis via regularized low rank matrix approximation (Shen & Huang 2008). A Nipals algorithm is used. The Function provides three methods of thresholding to compute the sparse loadings:

  • msparse = :soft: Soft thresholding of standardized loadings. Let us note v a given loading vector before thresholding. Vector abs(v) is then standardized to its maximal component (= max{abs(v[i]), i = 1..p}). The soft-thresholding function (see function soft) is applied to this standardized vector, with the constant delta ∈ [0, 1]. This returns the sparse vector theta. Vector v is multiplied term-by-term by this vector theta, which finally gives the sparse loadings.

  • msparse = :mix: Method used in function spca of the R package mixOmics (Lê Cao et al.). For each PC, the nvar X-variables showing the largest values in vector abs(v) are selected. Then a soft-thresholding is applied to the corresponding selected loadings. Range delta is automatically (internally) set equal to the maximal value of the components of abs(v) corresponding to variables removed from the selection.

  • msparse = :hard: For each PC, the nvar X-variables showing the largest values in vector abs(v) are selected.

The case msparse = :mix returns the same results as function spca of the R package mixOmics.

Note: The resulting sparse loadings vectors (P-columns) are in general non orthogonal. Therefore, there is no a unique decomposition of the variance of X such as in PCA. Function summary returns the following objects:

  • explvarx: The proportion of variance of X explained by each column t of T, computed by regressing X on t (such as what is done in PLS).
  • explvarx_adj: Adjusted explained variance proposed by Shen & Huang 2008 section 2.3.

References

Kim-Anh Lê Cao, Florian Rohart, Ignacio Gonzalez, Sebastien Dejean with key contributors Benoit Gautier, Francois Bartolo, contributions from Pierre Monget, Jeff Coquery, FangZou Yao and Benoit Liquet. (2016). mixOmics: Omics Data Integration Project. R package version 6.1.1. https://CRAN.R-project.org/package=mixOmics

https://www.bioconductor.org/packages/release/bioc/html/mixOmics.html

Shen, H., Huang, J.Z., 2008. Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis 99, 1015–1034. https://doi.org/10.1016/j.jmva.2007.06.007

Examples

using JchemoData, JLD2 
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2") 
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
n = nro(X)
ntest = 30
s = samprand(n, ntest) 
Xtrain = X[s.train, :]
Xtest = X[s.test, :]

nlv = 3 
msparse = :mix ; nvar = 2
#msparse = :hard ; nvar = 2
scal = false
mod = model(spca; nlv, msparse, nvar, scal) ;
fit!(mod, Xtrain) 
fm = mod.fm ;
pnames(fm)
fm.niter
fm.sellv 
fm.sel
fm.P
fm.P' * fm.P
@head T = fm.T
@head transf(mod, Xtrain)

@head Ttest = transf(fm, Xtest)

res = summary(mod, Xtrain) ;
res.explvarx
res.explvarx_adj

nlv = 3 
msparse = :soft ; delta = .4 
mod = model(spca; nlv, msparse, delta) ;
fit!(mod, Xtrain) 
mod.fm.P
source
Jchemo.splskdedaMethod
splskdeda(X, y; kwargs...)
splskdeda(X, y, weights::Weight; kwargs...)

Sparse PLS-KDE-DA.

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute. Must be >= 1.
  • msparse : Method used for the sparse thresholding. Possible values are: :soft, :mix, :hard. See thereafter.
  • delta : Only used if msparse = :soft. Range for the thresholding on the loadings (after they are standardized to their maximal absolute value). Must ∈ [0, 1]. Higher is delta, stronger is the thresholding.
  • nvar : Only used if msparse = :mix or msparse = :hard. Nb. variables (X-columns) selected for each principal component (PC). Can be a single integer (i.e. same nb. of variables for each PC), or a vector of length nlv.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • Keyword arguments of function dmkern (bandwidth definition) can also be specified here.
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Same as function plskdeda (PLS-LDA) except that a sparse PLSR (function splskern), instead of a PLSR (function plskern), is run on the Y-dummy table.

See function splslda for examples.

source
Jchemo.splskernMethod
splskern(X, Y; kwargs...)
splskern(X, Y, weights::Weight; kwargs...)
splskern!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Sparse partial least squares regression (Lê Cao et al. 2008)

  • X : X-data (n, p).
  • Y : Y-data (n, q).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute.
  • msparse : Method used for the sparse thresholding. Possible values are: :soft, :mix, :hard. See thereafter.
  • delta : Only used if msparse = :soft. Range for the thresholding on the loadings (after they are standardized to their maximal absolute value). Must ∈ [0, 1]. Higher is delta, stronger is the thresholding.
  • nvar : Only used if msparse = :mix or msparse = :hard. Nb. variables (X-columns) selected for each principal component (PC). Can be a single integer (i.e. same nb. of variables for each PC), or a vector of length nlv.
  • scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

Sparse partial least squares regression (Lê Cao et al. 2008), with the fast "improved kernel algorithm #1" of Dayal & McGregor (1997).

In the present version of splskern, the sparse correction only concerns X. The function provides three methods of thresholding to compute the sparse X-loading weights w, see function spca for description (same principles). The case msparse = :mix returns the same results as function spls of the R package mixOmics with the regression mode (and without sparseness on Y).

The case msparse = :hard (or msparse = :mix) and nvar = 1 correspond to the COVSEL regression described in Roger et al 2011 (see also Höskuldsson 1992).

References

Dayal, B.S., MacGregor, J.F., 1997. Improved PLS algorithms. Journal of Chemometrics 11, 73-85.

Höskuldsson, A., 1992. The H-principle in modelling with applications to chemometrics. Chemometrics and Intelligent Laboratory Systems, Proceedings of the 2nd Scandinavian Symposium on Chemometrics 14, 139–153. https://doi.org/10.1016/0169-7439(92)80099-P

Lê Cao, K.-A., Rossouw, D., Robert-Granié, C., Besse, P., 2008. A Sparse PLS for Variable Selection when Integrating Omics Data. Statistical Applications in Genetics and Molecular Biology 7. https://doi.org/10.2202/1544-6115.1390

Kim-Anh Lê Cao, Florian Rohart, Ignacio Gonzalez, Sebastien Dejean with key contributors Benoit Gautier, Francois Bartolo, contributions from Pierre Monget, Jeff Coquery, FangZou Yao and Benoit Liquet. (2016). mixOmics: Omics Data Integration Project. R package version 6.1.1. https://CRAN.R-project.org/package=mixOmics

https://www.bioconductor.org/packages/release/bioc/html/mixOmics.html

Roger, J.M., Palagos, B., Bertrand, D., Fernandez-Ahumada, E., 2011. covsel: Variable selection for highly multivariate and multi-response calibration: Application to IR spectroscopy. Chem. Lab. Int. Syst. 106, 216-223.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 15
msparse = :mix ; nvar = 5
#msparse = :hard ; nvar = 5
mod = model(splskern; nlv, msparse, nvar) ;
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
@head mod.fm.T
@head mod.fm.W

coef(mod)
coef(mod; nlv = 3)

@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)

res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

res = summary(mod, Xtrain) ;
pnames(res)
z = res.explvarx
plotgrid(z.nlv, z.cumpvar; step = 2, xlabel = "Nb. LVs", 
    ylabel = "Prop. Explained X-Variance").f
source
Jchemo.splsldaMethod
splslda(X, y; kwargs...)
splslda(X, y, weights::Weight; kwargs...)

Sparse PLS-LDA.

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute. Must be >= 1.
  • msparse : Method used for the sparse thresholding. Possible values are: :soft, :mix, :hard. See thereafter.
  • delta : Only used if msparse = :soft. Range for the thresholding on the loadings (after they are standardized to their maximal absolute value). Must ∈ [0, 1]. Higher is delta, stronger is the thresholding.
  • nvar : Only used if msparse = :mix or msparse = :hard. Nb. variables (X-columns) selected for each principal component (PC). Can be a single integer (i.e. same nb. of variables for each PC), or a vector of length nlv.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Same as function plslda (PLS-LDA) except that a sparse PLSR (function splskern), instead of a PLSR (function plskern), is run on the Y-dummy table.

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlv = 15
msparse = :mix ; nvar = 10
mod = model(splslda; nlv, msparse, nvar) 
#mod = model(splsqda; nlv, msparse, nvar, alpha = .1) 
#mod = model(splskdeda; nlv, msparse, nvar, a_kde = .9) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

fmpls = fm.fm.fmpls ; 
@head fmpls.T
@head transf(mod, Xtrain)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)

coef(fmpls)
summary(fmpls, Xtrain)

res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(mod, Xtest; nlv = 1:2).pred
source
Jchemo.splsqdaMethod
splsqda(X, y; kwargs...)
splsqda(X, y, weights::Weight; kwargs...)

Sparse PLS-QDA (with continuum).

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute. Must be >= 1.
  • msparse : Method used for the sparse thresholding. Possible values are: :soft, :mix, :hard. See thereafter.
  • delta : Only used if msparse = :soft. Range for the thresholding on the loadings (after they are standardized to their maximal absolute value). Must ∈ [0, 1]. Higher is delta, stronger is the thresholding.
  • nvar : Only used if msparse = :mix or msparse = :hard. Nb. variables (X-columns) selected for each principal component (PC). Can be a single integer (i.e. same nb. of variables for each PC), or a vector of length nlv.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • alpha : Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0) and LDA (alpha = 1).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Same as function plsqda (PLS-LDA) except that a sparse PLSR (function splskern), instead of a PLSR (function plskern), is run on the Y-dummy table.

See function splslda for examples.

source
Jchemo.splsrdaMethod
splsrda(X, y; kwargs...)
splsrda(X, y, weights::Weight; kwargs...)

Sparse PLSR-DA.

  • X : X-data (n, p).
  • y : Univariate class membership (n).
  • weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to compute.
  • msparse : Method used for the sparse thresholding. Possible values are: :soft, :mix, :hard. See thereafter.
  • delta : Only used if msparse = :soft. Range for the thresholding on the loadings (after they are standardized to their maximal absolute value). Must ∈ [0, 1]. Higher is delta, stronger is the thresholding.
  • nvar : Only used if msparse = :mix or msparse = :hard. Nb. variables (X-columns) selected for each principal component (PC). Can be a single integer (i.e. same nb. of variables for each PC), or a vector of length nlv.
  • prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order as mlev(x)).
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Same as function plsrda (PLSR-DA) except that a sparse PLSR (function splskern), instead of a PLSR (function plskern), is run on the Y-dummy table.

See function plsrda and splskern for details.

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlv = 15
msparse = :mix ; nvar = 10
mod = model(splsrda; nlv, msparse, nvar) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

@head fm.fm.T
@head transf(mod, Xtrain)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)

coef(fm.fm)

res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(mod, Xtest; nlv = 1:2).pred
summary(fm.fm, Xtrain)
source
Jchemo.ssqMethod
ssq(X)

Compute the total inertia of a matrix.

  • X : Matrix.

Sum of all the squared components of X (= norm(X)^2; Squared Frobenius norm).

Examples

X = rand(5, 2) 
ssq(X)
source
Jchemo.ssrMethod
ssr(pred, Y)

Compute the sum of squared prediction errors (SSR).

  • pred : Predictions.
  • Y : Observed data.

Examples

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
ssr(pred, Ytest)

mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
ssr(pred, ytest)
source
Jchemo.stahMethod
stah(X, a; kwargs...)

Compute the Stahel-Donoho outlierness.

  • X : X-data (n, p).
  • a : Nb. dimensions simulated for the projection pursuit method.

Keyword arguments:

  • scal : Boolean. If true, matrix X is centred (by median) and scaled (by MAD) before computing the outlierness.

See Maronna and Yohai 1995 for details on the outlierness measure.

This outlierness measure is computed from a projection-pursuit approach:

  • A projection matrix P (p, a) is built randomly from binary (0/1) data,
  • and the observations (rows of X) are projected on the a directions.

References

Maronna, R.A., Yohai, V.J., 1995. The Behavior of the Stahel-Donoho Robust Multivariate Estimator. Journal of the American Statistical Association 90, 330–341. https://doi.org/10.1080/01621459.1995.10476517

Examples

n = 300 ; p = 700 ; m = 80
ntot = n + m
X1 = randn(n, p)
X2 = randn(m, p) .+ rand(1:3, p)'
X = vcat(X1, X2)

a = 10
scal = false
#scal = true
res = stah(X, a; scal) ;
pnames(res)
res.d
plotxy(1:nro(X), res.d).f
source
Jchemo.summMethod
summ(X; digits = 3)
summ(X, y; digits = 3)

Summarize a dataset (or a variable).

  • X : A dataset (n, p).
  • y : A categorical variable (n) (class membership).
  • digits : Nb. digits in the outputs.

Examples

n = 50
X = rand(n, 3) 
y = rand(1:3, n)
res = summ(X)
pnames(res)
summ(X[:, 2]).res

summ(X, y)
source
Jchemo.svmdaMethod
svmda(X, y; kwargs...)

Support vector machine for discrimination "C-SVC" (SVM-DA).

  • X : X-data (n, p).
  • y : Univariate class membership (n).

Keyword arguments:

  • kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol, :klin, :ktanh. See below.
  • gamma : kern parameter, see below.
  • degree : kern parameter, see below.
  • coef0 : kern parameter, see below.
  • cost : Cost of constraints violation C parameter.
  • epsilon : Epsilon parameter in the loss function.
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Kernel types:

  • :krbf – radial basis function: exp(-gamma * ||x - y||^2)
  • :kpol – polynomial: (gamma * x' * y + coef0)^degree
  • "klin – linear: x' * y
  • :ktan – sigmoid: tanh(gamma * x' * y + coef0)

The function uses LIBSVM.jl (https://github.com/JuliaML/LIBSVM.jl) that is an interface to library LIBSVM (Chang & Li 2001).

References

Julia package LIBSVM.jl: https://github.com/JuliaML/LIBSVM.jl

Chang, C.-C. & Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Detailed documentation (algorithms, formulae, ...) can be found in http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.ps.gz

Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

Schölkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning. MIT Press, Cambridge, Mass.

Examples

using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

kern = :krbf ; gamma = 1e4
cost = 1000 ; epsilon = .5
mod = model(svmda; kern, gamma, cost, epsilon) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni

res = predict(mod, Xtest) ; 
pnames(res) 
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
source
Jchemo.svmrMethod
svmr(X, y; kwargs...)

Support vector machine for regression (Epsilon-SVR).

  • X : X-data (n, p).
  • y : Univariate y-data (n).

Keyword arguments:

  • kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol, :klin, :ktanh. See below.
  • gamma : kern parameter, see below.
  • degree : kern parameter, see below.
  • coef0 : kern parameter, see below.
  • cost : Cost of constraints violation C parameter.
  • epsilon : Epsilon parameter in the loss function.
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Kernel types:

  • :krbf – radial basis function: exp(-gamma * ||x - y||^2)
  • :kpol – polynomial: (gamma * x' * y + coef0)^degree
  • "klin – linear: x' * y
  • :ktan – sigmoid: tanh(gamma * x' * y + coef0)

The function uses LIBSVM.jl (https://github.com/JuliaML/LIBSVM.jl) that is an interface to library LIBSVM (Chang & Li 2001).

References

Julia package LIBSVM.jl: https://github.com/JuliaML/LIBSVM.jl

Chang, C.-C. & Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Detailed documentation (algorithms, formulae, ...) can be found in http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.ps.gz

Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

Schölkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning. MIT Press, Cambridge, Mass.

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
p = nco(X)

kern = :krbf ; gamma = .1
cost = 1000 ; epsilon = 1
mod = model(svmr; kern, gamma, cost, epsilon) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)

res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106 
x = collect(-10:.2:10) 
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x) 
y = zy + .2 * randn(n) 
kern = :krbf ; gamma = .1
mod = model(svmr; kern, gamma) 
fit!(mod, x, y)
pred = predict(mod, x).pred 
f, ax = scatter(x, y) 
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f
source
Jchemo.tabMethod
tab(x)

Univariate tabulation.

  • x : Categorical variable.

The output cointains sorted levels.

Examples

x = rand(["a";"b";"c"], 20)
res = tab(x)
res.keys
res.vals
source
Jchemo.tabdfMethod
tabdf(X; groups = nothing)

Compute the nb. occurences in categorical variables of a dataset.

  • X : Data.
  • groups : Vector of the names of the group variables to consider in X (by default: all the columns of X).

The output (dataframe) contains sorted levels.

Examples

n = 20
X =  hcat(rand(1:2, n), rand(["a", "b", "c"], n))
tabdf(X)
tabdf(X[:, 2])

df = DataFrame(X, [:v1, :v2])
tabdf(df)
tabdf(df; groups = [:v1, :v2])
tabdf(df; groups = :v2)
source
Jchemo.tabduplMethod
tabdupl(x)

Tabulate duplicated values in a vector.

  • x : Categorical variable.

Examples

x = ["a", "b", "c", "a", "b", "b"]
tab(x)
res = tabdupl(x)
res.keys
res.vals
source
Jchemo.transfMethod
transf(object::Blockscal, Xbl)
transf!(object::Blockscal, Xbl)

Compute the preprocessed data from a model.

  • object : The fitted model.
  • Xbl : A list of blocks (vector of matrices) of X-data for which LVs are computed.
source
Jchemo.transfMethod
transf(object::Center, X)
transf!(object::Center, X::Matrix)

Compute the preprocessed data from a model.

  • object : Model.
  • X : X-data to transform.
source
Jchemo.transfMethod
transf(object::Comdim, Xbl; nlv = nothing)
transfbl(object::Comdim, Xbl; nlv = nothing)

Compute latent variables (LVs = scores T) from a fitted model.

  • object : The fitted model.
  • Xbl : A list of blocks (vector of matrices) of X-data for which LVs are computed.
  • nlv : Nb. LVs to compute.
source
Jchemo.transfMethod
transf(object::Cscale, X)
transf!(object::Cscale, X::Matrix)

Compute the preprocessed data from a model.

  • object : Model.
  • X : X-data to transform.
source
Jchemo.transfMethod
transf(object::Detrend, X)
transf!(object::Detrend, X)

Compute the preprocessed data from a model.

  • object : Model.
  • X : X-data to transform.
source
Jchemo.transfMethod
transf(object::Dkplsr, X; nlv = nothing)

Compute latent variables (LVs = scores T) from a fitted model.

  • object : The fitted model.
  • X : X-data for which LVs are computed.
  • nlv : Nb. LVs to consider.
source
Jchemo.transfMethod
transf(object::Fdif, X)
transf!(object::Fdif, X::Matrix, M::Matrix)

Compute the preprocessed data from a model.

  • object : Model.
  • X : X-data to transform.
  • M : Pre-allocated output matrix (n, p - npoint + 1).

The in-place function stores the output in M.

source
Jchemo.transfMethod
transf(object::Interpl, X)
transf!(object::Interpl, X::Matrix, M::Matrix)

Compute the preprocessed data from a model.

  • object : Model.
  • X : X-data to transform.
  • M : Pre-allocated output matrix (n, p).

The in-place function stores the output in M.

source
Jchemo.transfMethod
transf(object::Kpca, X; nlv = nothing)

Compute PCs (scores T) from a fitted model.

  • object : The fitted model.
  • X : X-data for which PCs are computed.
  • nlv : Nb. PCs to compute.
source
Jchemo.transfMethod
transf(object::Kplsr, X; nlv = nothing)

Compute latent variables (LVs = scores T) from a fitted model.

  • object : The fitted model.
  • X : X-data for which LVs are computed.
  • nlv : Nb. LVs to consider.
source
Jchemo.transfMethod
transf(object::Mavg, X)
transf!(object::Mavg, X)

Compute the preprocessed data from a model.

  • object : Model.
  • X : X-data to transform.
source
Jchemo.transfMethod
transf(object::Mbconcat, Xbl)

Compute the preprocessed data from a model.

  • object : The fitted model.
  • Xbl : A list of blocks (vector of matrices) of X-data for which LVs are computed.
source
Jchemo.transfMethod
transf(object::Mbpca, Xbl; nlv = nothing)
transfbl(object::Mbpca, Xbl; nlv = nothing)

Compute latent variables (LVs = scores T) from a fitted model.

  • object : The fitted model.
  • Xbl : A list of blocks (vector of matrices) of X-data for which LVs are computed.
  • nlv : Nb. LVs to compute.
source
Jchemo.transfMethod
transf(object::Mbplslda, Xbl; nlv = nothing)

Compute latent variables (LVs = scores T) from a fitted model.

  • object : The fitted model.
  • Xbl : A list of blocks (vector of matrices) of X-data for which LVs are computed.
  • nlv : Nb. LVs to consider.
source
Jchemo.transfMethod
transf(object::Mbplsrda, Xbl; nlv = nothing)

Compute latent variables (LVs = scores T) from a fitted model.

  • object : The fitted model.
  • Xbl : A list of blocks (vector of matrices) of X-data for which LVs are computed.
  • nlv : Nb. LVs to compute.
source
Jchemo.transfMethod
transf(object::Pcr, X; nlv = nothing)

Compute latent variables (LVs = scores T) from a fitted model and a matrix X.

  • object : The fitted model.
  • X : Matrix (m, p) for which LVs are computed.
  • nlv : Nb. LVs to consider.
source
Jchemo.transfMethod
transf(object::Plslda, X; nlv = nothing)

Compute latent variables (LVs = scores T) from a fitted model.

  • object : The fitted model.
  • X : Matrix (m, p) for which LVs are computed.
  • nlv : Nb. LVs to consider.
source
Jchemo.transfMethod
transf(object::Plsrda, X; nlv = nothing)

Compute latent variables (LVs = scores T) from a fitted model.

  • object : The fitted model.
  • X : X-data (m, p) for which LVs are computed.
  • nlv : Nb. LVs to consider.
source
Jchemo.transfMethod
transf(object::Rmgap, X)
transf!(object::Rmgap, X)

Compute the preprocessed data from a model.

  • object : Model.
  • X : X-data to transform.
source
Jchemo.transfMethod
transf(object::Rosaplsr, Xbl; nlv = nothing)

Compute latent variables (LVs = scores T) from a fitted model.

  • object : The fitted model.
  • Xbl : A list of blocks (vector of matrices) of X-data for which LVs are computed.
  • nlv : Nb. LVs to compute.
source
Jchemo.transfMethod
transf(object::Rp, X; nlv = nothing)

Compute scores T from a fitted model.

  • object : The fitted model.
  • X : Matrix (m, p) for which scores T are computed.
  • nlv : Nb. scores to compute.
source
Jchemo.transfMethod
transf(object::Savgol, X)
transf!(object::Savgol, X)

Compute the preprocessed data from a model.

  • object : Model.
  • X : X-data to transform.
source
Jchemo.transfMethod
transf(object::Scale, X)
transf!(object::Scale, X::Matrix)

Compute the preprocessed data from a model.

  • object : Model.
  • X : X-data to transform.
source
Jchemo.transfMethod
transf(object::Snorm, X)
transf!(object::Snorm, X)

Compute the preprocessed data from a model.

  • object : Model.
  • X : X-data to transform.
source
Jchemo.transfMethod
transf(object::Snv, X)
transf!(object::Snv, X)

Compute the preprocessed data from a model.

  • object : Model.
  • X : X-data to transform.
source
Jchemo.transfMethod
transf(object::Soplsr, Xbl)

Compute latent variables (LVs = scores T) from a fitted model.

  • object : The fitted model.
  • Xbl : A list of blocks (vector of matrices) of X-data for which LVs are computed.
source
Jchemo.transfMethod
transf(object::Spca, X; nlv = nothing)
Compute principal components (PCs = scores T) from a 
    fitted model and X-data.
  • object : The fitted model.
  • X : X-data for which PCs are computed.
  • nlv : Nb. PCs to compute.
source
Jchemo.transfMethod
transf(object::Union{Pca, Fda}, X; nlv = nothing)

Compute principal components (PCs = scores T) from a fitted model and X-data.

  • object : The fitted model.
  • X : X-data for which PCs are computed.
  • nlv : Nb. PCs to compute.
source
Jchemo.transfMethod
transf(object::Union{Mbplsr, Mbplswest}, Xbl; nlv = nothing)

Compute latent variables (LVs = scores T) from a fitted model.

  • object : The fitted model.
  • Xbl : A list of blocks (vector of matrices) of X-data for which LVs are computed.
  • nlv : Nb. LVs to compute.
source
Jchemo.transfMethod
transf(object::Union{Plsr, Splsr}, 
    X; nlv = nothing)

Compute latent variables (LVs = scores T) from a fitted model.

  • object : The fitted model.
  • X : Matrix (m, p) for which LVs are computed.
  • nlv : Nb. LVs to consider.
source
Jchemo.transfblMethod
transfbl(object::Cca, X, Y; nlv = nothing)

Compute latent variables (LVs = scores T) from a fitted model.

  • object : The fitted model.
  • X : X-data for which components (LVs) are computed.
  • Y : Y-data for which components (LVs) are computed.
  • nlv : Nb. LVs to compute.
source
Jchemo.transfblMethod
transfbl(object::Ccawold, X, Y; nlv = nothing)

Compute latent variables (LVs = scores T) from a fitted model.

  • object : The fitted model.
  • X : X-data for which components (LVs) are computed.
  • Y : Y-data for which components (LVs) are computed.
  • nlv : Nb. LVs to compute.
source
Jchemo.transfblMethod
transfbl(object::Plscan, X, Y; nlv = nothing)

Compute latent variables (LVs = scores T) from a fitted model.

  • object : The fitted model.
  • X : X-data for which components (LVs) are computed.
  • Y : Y-data for which components (LVs) are computed.
  • nlv : Nb. LVs to compute.
source
Jchemo.transfblMethod
transfbl(object::Plstuck, X, Y; nlv = nothing)

Compute latent variables (LVs = scores T) from a fitted model.

  • object : The fitted model.
  • X : X-data for which components (LVs) are computed.
  • Y : Y-data for which components (LVs) are computed.
  • nlv : Nb. LVs to compute.
source
Jchemo.transfblMethod
transfbl(object::Rasvd, X, Y; nlv = nothing)

Compute latent variables (LVs = scores T) from a fitted model.

  • object : The fitted model.
  • X : X-data for which components (LVs) are computed.
  • Y : Y-data for which components (LVs) are computed.
  • nlv : Nb. LVs to compute.
source
Jchemo.treer_dtMethod
treer_dt(X, y; kwargs...)

Regression tree (CART) with DecisionTree.jl.

  • X : X-data (n, p).
  • y : Univariate y-data (n).

Keyword arguments:

  • n_subfeatures : Nb. variables to select at random at each split (default: 0 ==> keep all).
  • max_depth : Maximum depth of the decision tree (default: -1 ==> no maximum).
  • min_sample_leaf : Minimum number of samples each leaf needs to have.
  • min_sample_split : Minimum number of observations in needed for a split.
  • scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

The function fits a single regression tree (CART) using package `DecisionTree.jl'.

References

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. Classification And Regression Trees. Chapman & Hall, 1984.

DecisionTree.jl https://github.com/JuliaAI/DecisionTree.jl

Gey, S., 2002. Bornes de risque, détection de ruptures, boosting : trois thèmes statistiques autour de CART en régression (These de doctorat). Paris 11. http://www.theses.fr/2002PA112245

Examples

using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat)
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
p = nco(X)

n_subfeatures = p / 3 
max_depth = 15
mod = model(treer_dt; n_subfeatures, max_depth) 
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)

res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    
source
Jchemo.vcatdfMethod
vcatdf(dat; cols = :intersect)

Vertical concatenation of a list of dataframes.

  • dat : List (vector) of dataframes.
  • cols : Determines the columns of the returned data frame. See ?DataFrames.vcat.

Examples

using DataFrames
dat1 = DataFrame(rand(5, 2), [:v3, :v1]) 
dat2 = DataFrame(100 * rand(2, 2), [:v3, :v1])
dat = (dat1, dat2)
Jchemo.vcatdf(dat)

dat2 = DataFrame(100 * rand(2, 2), [:v1, :v3])
dat = (dat1, dat2)
Jchemo.vcatdf(dat)

dat2 = DataFrame(100 * rand(2, 3), [:v3, :v1, :a])
dat = (dat1, dat2)
Jchemo.vcatdf(dat)
Jchemo.vcatdf(dat; cols = :union)
source
Jchemo.vcolMethod
vcol(X::AbstractMatrix, j)
vcol(X::DataFrame, j)
vcol(x::Vector, j)

View of the j-th column(s) of a matrix X, or of the j-th element(s) of vector x.

source
Jchemo.vipMethod
vip(object::Union{Pcr, Plsr}; nlv = nothing)
vip(object::Union{Pcr, Plsr}, Y; nlv = nothing)

Variable importance on Projections (VIP).

  • object : The fitted model.
  • Y : The Y-data that was used to fit the model.

Keyword arguments:

  • nlv : Nb. latent variables (LVs) to consider. If nothing, the maximal model is considered.

For a PLS model (or PCR, etc.) fitted on (X, Y) with a number of A latent variables, and for variable xj (column j of X):

  • VIP(xj) = Sum.a(1,...,A) R2(Yc, ta) waj^2 / Sum.a(1,...,A) R2(Yc, ta) (1 / p)

where:

  • Yc is the centered Y,
  • ta is the a-th X-score,
  • R2(Yc, ta) is the proportion of Yc-variance explained by ta, i.e. ||Yc.hat||^2 / ||Yc||^2 (where Yc.hat is the LS estimate of Yc by ta).

When Y is used, R2(Yc, ta) is replaced by the redundancy Rd(Yc, ta) (see function rd), such as in Tenenhaus 1998 p.139.

References

Chong, I.-G., Jun, C.-H., 2005. Performance of some variable selection methods when multicollinearity is present. Chemometrics and Intelligent Laboratory Systems 78, 103–112. https://doi.org/10.1016/j.chemolab.2004.12.011

Mehmood, T., Sæbø, S., Liland, K.H., 2020. Comparison of variable selection methods in partial least squares regression. Journal of Chemometrics 34, e3226. https://doi.org/10.1002/cem.3226

Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.

Examples

X = [1. 2 3 4; 4 1 6 7; 12 5 6 13; 
    27 18 7 6; 12 11 28 7] 
Y = [10. 11 13; 120 131 27; 8 12 4; 
    1 200 8; 100 10 89] 
y = Y[:, 1] 
ycla = [1; 1; 1; 2; 2]

nlv = 3
mod = model(plskern; nlv)
fit!(mod, X, y)
res = vip(mod.fm)
pnames(res)
res.imp

fit!(mod, X, Y)
vip(mod.fm).imp
vip(mod.fm, Y).imp

mod = model(plsrda; nlv) 
fit!(mod, X, ycla)
pnames(mod.fm)
fm = mod.fm.fm ;
vip(fm).imp
Ydummy = dummy(ycla).Y
vip(fm, Ydummy).imp

mod = model(plslda; nlv) 
fit!(mod, X, ycla)
pnames(mod.fm.fm)
fm = mod.fm.fm.fmpls ;
vip(fm).imp
vip(fm, Ydummy).imp
source
Jchemo.vipermMethod
viperm(mod, X, Y; rep = 50, psamp = .3, score = rmsep)

Variable importance by direct permutations.

  • mod : Model to evaluate.
  • X : X-data (n, p).
  • Y : Y-data (n, q).

Keyword arguments:

  • rep : Number of replications of the splitting training/test.
  • psamp : Proportion of data used as test set to compute the score.
  • score : Function computing the prediction score.

The principle is as follows:

  • Data (X, Y) are splitted randomly to a training and a test set.
  • The model is fitted on Xtrain, and the score (error rate) is computed on Xtest. This gives the reference error rate.
  • Rows of a given variable (feature) j in Xtest are randomly permutated (the rest of Xtest is unchanged). The score is computed on the Xtestpermj (i.e. Xtest after thta the rows of variable j were permuted). The importance of variable j is computed by the difference between this score and the reference score.
  • This process is run for each variable j separately and replicated rep times. Average results are provided in the outputs, as well as the results per replication.

In general, this method returns similar results as the out-of-bag permutation method used in random forests (Breiman, 2001).

References

  • Nørgaard, L., Saudland, A., Wagner, J., Nielsen, J.P., Munck, L.,

Engelsen, S.B., 2000. Interval Partial Least-Squares Regression (iPLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy. Appl Spectrosc 54, 413–419. https://doi.org/10.1366/0003702001949500

Examples

using JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "tecator.jld2") 
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y 
wl_str = names(X)
wl = parse.(Float64, wl_str) 
ntot, p = size(X)
typ = Y.typ
namy = names(Y)[1:3]
plotsp(X, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance").f
s = typ .== "train"
Xtrain = X[s, :]
Ytrain = Y[s, namy]
Xtest = rmrow(X, s)
Ytest = rmrow(Y[:, namy], s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)

## Work on the j-th y-variable 
j = 2
nam = namy[j]
ytrain = Ytrain[:, nam]
ytest = Ytest[:, nam]

mod = model(plskern; nlv = 9)
res = viperm(mod, Xtrain, ytrain; rep = 50, score = rmsep) ;
z = vec(res.imp)
f = Figure(size = (500, 400))
ax = Axis(f[1, 1]; xlabel = "Wavelength (nm)", ylabel = "Importance")
scatter!(ax, wl, vec(z); color = (:red, .5))
u = [910; 950]
vlines!(ax, u; color = :grey, linewidth = 1)
f

mod = model(rfr_dt; n_trees = 10, max_depth = 2000, min_samples_leaf = 5)
res = viperm(mod, Xtrain, ytrain; rep = 50)
z = vec(res.imp)
f = Figure(size = (500, 400))
ax = Axis(f[1, 1];
    xlabel = "Wavelength (nm)", 
    ylabel = "Importance")
scatter!(ax, wl, vec(z); color = (:red, .5))
u = [910; 950]
vlines!(ax, u; color = :grey, linewidth = 1)
f
source
Jchemo.vrowMethod
vrow(X::AbstractMatrix, i)
vrow(X::DataFrame, i)
vrow(x::Vector, i)

View of the i-th row(s) of a matrix X, or of the i-th element(s) of vector x.

source
Jchemo.wdistMethod
wdist(d; h = 2, criw = 4, squared = false)
wdist!(d; h = 2, criw = 4, squared = false)

Compute weights from distances using a decreasing exponential function.

  • d : A vector of distances.

Keyword arguments:

  • h : A scaling positive scalar defining the shape of the weight function.
  • criw : A positive scalar defining outliers in the distances vector d.
  • squared: If true, distances are replaced by the squared distances; the weight function is then a Gaussian (RBF) kernel function.

Weights are computed by:

  • exp(-d / (h * MAD(d)))

or are set to 0 for distances > Median(d) + criw * MAD(d). This is an adaptation of the weight function presented in Kim et al. 2011.

The weights decrease with increasing distances. Lower is h, sharper is the decreasing function. Weights are set to 0 for outliers (extreme distances).

References

Kim S, Kano M, Nakagawa H, Hasebe S. Estimation of active pharmaceutical ingredients content using locally weighted partial least squares and statistical wavelength selection. Int J Pharm. 2011; 421(2):269-274. https://doi.org/10.1016/j.ijpharm.2011.10.007

Examples

using CairoMakie, Distributions

x1 = rand(Chisq(10), 100) ;
x2 = rand(Chisq(40), 10) ;
d = [sqrt.(x1) ; sqrt.(x2)]
h = 2 ; criw = 3
w = wdist(d; h, criw) ;
f = Figure(size = (600, 300))
ax1 = Axis(f, xlabel = "Distance", ylabel = "Nb. observations")
hist!(ax1, d, bins = 30)
ax2 = Axis(f, xlabel = "Distance", ylabel = "Weight")
scatter!(ax2, d, w)
f[1, 1] = ax1 
f[1, 2] = ax2 
f

d = collect(0:.5:15) ;
h = [.5, 1, 1.5, 2.5, 5, 10, Inf] 
#h = [1, 2, 5, Inf] 
w = wdist(d; h = h[1]) 
f = Figure(size = (500, 400))
ax = Axis(f, xlabel = "Distance", ylabel = "Weight")
lines!(ax, d, w, label = string("h = ", h[1]))
for i = 2:length(h)
    w = wdist(d; h = h[i])
    lines!(ax, d, w, label = string("h = ", h[i]))
end
axislegend("Values of h"; position = :lb)
f[1, 1] = ax
f
source
Jchemo.xfitMethod
xfit(object)
xfit(object, X; nlv = nothing)
xfit!(object, X::Matrix; nlv = nothing)

Matrix fitting from a bilinear model (e.g. PCA).

  • object : The fitted model.
  • X : New X-data to be approximated from the model. Must be in the same scale as the X-data used to fit the model object, i.e. before centering and eventual scaling.

Keyword arguments:

  • nlv : Nb. components (PCs or LVs) to consider. If nothing, it is the maximum nb. of components.

Compute an approximate of matrix X from a bilinear model (e.g. PCA or PLS) fitted on X. The fitted X is returned in the original scale of the X-data used to fit the model object.

Examples

X = [1. 2 3 4; 4 1 6 7; 12 5 6 13; 
    27 18 7 6; 12 11 28 7] 
Y = [10. 11 13; 120 131 27; 8 12 4; 
    1 200 8; 100 10 89] 
n, p = size(X)
Xnew = X[1:3, :]
Ynew = Y[1:3, :]
y = Y[:, 1]
ynew = Ynew[:, 1]
weights = mweight(rand(n))

nlv = 2 
scal = false
#scal = true
mod = model(pcasvd; nlv, scal) ;
fit!(mod, X)
fm = mod.fm ;
@head xfit(fm)
xfit(fm, Xnew)
xfit(fm, Xnew; nlv = 0)
xfit(fm, Xnew; nlv = 1)
fm.xmeans

@head X
@head xfit(fm) + xresid(fm, X)
@head xfit(fm, X; nlv = 1) + xresid(fm, X; nlv = 1)

@head Xnew
@head xfit(fm, Xnew) + xresid(fm, Xnew)

mod = model(pcasvd; nlv = min(n, p), scal) 
fit!(mod, X)
fm = mod.fm ;
@head xfit(fm) 
@head xfit(fm, X)
@head xresid(fm, X)

nlv = 3
scal = false
#scal = true
mod = model(plskern; nlv, scal)
fit!(mod, X, Y, weights) 
fm = mod.fm ;
@head xfit(fm)
xfit(fm, Xnew)
xfit(fm, Xnew, nlv = 0)
xfit(fm, Xnew, nlv = 1)

@head X
@head xfit(fm) + xresid(fm, X)
@head xfit(fm, X; nlv = 1) + xresid(fm, X; nlv = 1)

@head Xnew
@head xfit(fm, Xnew) + xresid(fm, Xnew)

mod = model(plskern; nlv = min(n, p), scal) 
fit!(mod, X, Y, weights) 
fm = mod.fm ;
@head xfit(fm) 
@head xfit(fm, Xnew)
@head xresid(fm, Xnew)
source
Jchemo.xresidMethod
xresid(object, X; nlv = nothing)
xresid!(object, X::Matrix; nlv = nothing)

Residual matrix from a bilinear model (e.g. PCA).

  • object : The fitted model.
  • X : New X-data to be approximated from the model. Must be in the same scale as the X-data used to fit the model object, i.e. before centering and eventual scaling.

Keyword arguments:

  • nlv : Nb. components (PCs or LVs) to consider. If nothing, it is the maximum nb. of components.

Compute the residual matrix:

  • E = X - X_fit

where X_fit is the fitted X returned by function xfit. See xfit for examples. ```

source