Index of functions

Here is a list of all exported functions from Jchemo.jl.

For more details, click on the link and you'll be directed to the function help.

Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Jchemo.aggmean
Jchemo.aggstat
Jchemo.aggsumv
Jchemo.aicplsr
Jchemo.aov1
Jchemo.bias
Jchemo.blockscal
Jchemo.calds
Jchemo.calpds
Jchemo.cca
Jchemo.ccawold
Jchemo.center
Jchemo.cglsr
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.colmad
Jchemo.colmean
Jchemo.colmed
Jchemo.colnorm
Jchemo.colstd
Jchemo.colsum
Jchemo.colvar
Jchemo.comdim
Jchemo.conf
Jchemo.convertdf
Jchemo.cor2
Jchemo.corm
Jchemo.corv
Jchemo.cosm
Jchemo.cosv
Jchemo.covm
Jchemo.covv
Jchemo.cscale
Jchemo.detrend_airpls
Jchemo.detrend_arpls
Jchemo.detrend_asls
Jchemo.detrend_lo
Jchemo.detrend_pol
Jchemo.dfplsr_cg
Jchemo.difmean
Jchemo.dkplskdeda
Jchemo.dkplslda
Jchemo.dkplsqda
Jchemo.dkplsr
Jchemo.dkplsrda
Jchemo.dmkern
Jchemo.dmnorm
Jchemo.dmnormlog
Jchemo.dummy
Jchemo.dupl
Jchemo.ensure_df
Jchemo.ensure_mat
Jchemo.eposvd
Jchemo.errp
Jchemo.euclsq
Jchemo.fcenter
Jchemo.fconcat
Jchemo.fcscale
Jchemo.fda
Jchemo.fdasvd
Jchemo.fdif
Jchemo.findmax_cla
Jchemo.findmiss
Jchemo.finduniq
Jchemo.frob
Jchemo.fscale
Jchemo.fweight
Jchemo.getknn
Jchemo.gridcv
Jchemo.gridcv_br
Jchemo.gridcv_lb
Jchemo.gridcv_lv
Jchemo.gridscore
Jchemo.gridscore
Jchemo.gridscore_br
Jchemo.gridscore_lb
Jchemo.gridscore_lv
Jchemo.head
Jchemo.interpl
Jchemo.iqrv
Jchemo.isel!
Jchemo.kdeda
Jchemo.knnda
Jchemo.knnr
Jchemo.kpca
Jchemo.kplskdeda
Jchemo.kplslda
Jchemo.kplsqda
Jchemo.kplsr
Jchemo.kplsrda
Jchemo.kpol
Jchemo.krbf
Jchemo.krr
Jchemo.krrda
Jchemo.lda
Jchemo.list
Jchemo.list
Jchemo.locw
Jchemo.locwlv
Jchemo.loessr
Jchemo.lwmlr
Jchemo.lwmlrda
Jchemo.lwplslda
Jchemo.lwplsqda
Jchemo.lwplsr
Jchemo.lwplsravg
Jchemo.lwplsrda
Jchemo.madv
Jchemo.mae
Jchemo.mahsq
Jchemo.mahsqchol
Jchemo.matB
Jchemo.matW
Jchemo.mavg
Jchemo.mbconcat
Jchemo.mblock
Jchemo.mbpca
Jchemo.mbplskdeda
Jchemo.mbplslda
Jchemo.mbplsqda
Jchemo.mbplsr
Jchemo.mbplsrda
Jchemo.mbplswest
Jchemo.meanv
Jchemo.merrp
Jchemo.mlev
Jchemo.mlr
Jchemo.mlrchol
Jchemo.mlrda
Jchemo.mlrpinv
Jchemo.mlrpinvn
Jchemo.mlrvec
Jchemo.mpar
Jchemo.mse
Jchemo.msep
Jchemo.mweight
Jchemo.mweightcla
Jchemo.nco
Jchemo.nipals
Jchemo.nipalsmiss
Jchemo.normv
Jchemo.nro
Jchemo.occknn
Jchemo.occlknn
Jchemo.occod
Jchemo.occsd
Jchemo.occsdod
Jchemo.occstah
Jchemo.out
Jchemo.outeucl
Jchemo.outknn
Jchemo.outlknn
Jchemo.outod
Jchemo.outsd
Jchemo.outsdod
Jchemo.outstah
Jchemo.parsemiss
Jchemo.pcaeigen
Jchemo.pcaeigenk
Jchemo.pcanipals
Jchemo.pcanipalsmiss
Jchemo.pcaout
Jchemo.pcapp
Jchemo.pcasph
Jchemo.pcasvd
Jchemo.pcr
Jchemo.pip
Jchemo.plotconf
Jchemo.plotgrid
Jchemo.plotlv
Jchemo.plotsp
Jchemo.plotxy
Jchemo.plotxyz
Jchemo.plscan
Jchemo.plskdeda
Jchemo.plskern
Jchemo.plslda
Jchemo.plsnipals
Jchemo.plsqda
Jchemo.plsravg
Jchemo.plsrda
Jchemo.plsrosa
Jchemo.plsrout
Jchemo.plssimp
Jchemo.plstuck
Jchemo.plswold
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.pval
Jchemo.qda
Jchemo.r2
Jchemo.rasvd
Jchemo.rd
Jchemo.rda
Jchemo.recod_catbydict
Jchemo.recod_catbyind
Jchemo.recod_catbyint
Jchemo.recod_catbylev
Jchemo.recod_indbylev
Jchemo.recod_miss
Jchemo.recod_numbyint
Jchemo.recovkw
Jchemo.residcla
Jchemo.residreg
Jchemo.rfda
Jchemo.rfr
Jchemo.rmcol
Jchemo.rmgap
Jchemo.rmrow
Jchemo.rmsep
Jchemo.rmsepstand
Jchemo.rosaplsr
Jchemo.rowmean
Jchemo.rownorm
Jchemo.rowstd
Jchemo.rowsum
Jchemo.rowvar
Jchemo.rp
Jchemo.rpd
Jchemo.rpdr
Jchemo.rpmatgauss
Jchemo.rpmatli
Jchemo.rr
Jchemo.rrchol
Jchemo.rrda
Jchemo.rrmsep
Jchemo.rrr
Jchemo.rv
Jchemo.sampcla
Jchemo.sampdf
Jchemo.sampdp
Jchemo.sampks
Jchemo.samprand
Jchemo.sampsys
Jchemo.sampwsp
Jchemo.savgk
Jchemo.savgol
Jchemo.scale
Jchemo.segmkf
Jchemo.segmts
Jchemo.selwold
Jchemo.sep
Jchemo.snorm
Jchemo.snv
Jchemo.softmax
Jchemo.soplsr
Jchemo.sourcedir
Jchemo.spca
Jchemo.spcr
Jchemo.splskdeda
Jchemo.splslda
Jchemo.splsqda
Jchemo.splsr
Jchemo.splsrda
Jchemo.ssr
Jchemo.stdv
Jchemo.summ
Jchemo.sumv
Jchemo.svmda
Jchemo.svmr
Jchemo.tab
Jchemo.tabdupl
Jchemo.thresh_hard
Jchemo.thresh_soft
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transfbl
Jchemo.transfbl
Jchemo.transfbl
Jchemo.transfbl
Jchemo.transfbl
Jchemo.treeda
Jchemo.treer
Jchemo.umap
Jchemo.varv
Jchemo.vcatdf
Jchemo.vcol
Jchemo.vip
Jchemo.viperm
Jchemo.vrow
Jchemo.wdis
Jchemo.winvs
Jchemo.wtal
Jchemo.xfit
Jchemo.xresid

Base.summary — Method

summary(object::Cca, X, Y)

Summarize the fitted model.

object : The fitted model.
X : The X-data that was used to fit the model.
Y : The Y-data that was used to fit the model.

source

Base.summary — Method

summary(object::Ccawold, X, Y)

Summarize the fitted model.

object : The fitted model.
X : The X-data that was used to fit the model.
Y : The Y-data that was used to fit the model.

source

Base.summary — Method

summary(object::Comdim, Xbl)

Summarize the fitted model.

object : The fitted model.
Xbl : The X-data that was used to fit the model.

source

Base.summary — Method

summary(object::Fda)

Summarize the fitted model.

object : The fitted model.
X : The X-data that was used to fit the model.

source

Base.summary — Method

summary(object::Kpca)

Summarize the fitted model.

object : The fitted model.

source

Base.summary — Method

summary(object::Mbpca, Xbl)

Summarize the fitted model.

object : The fitted model.
Xbl : The X-data that was used to fit the model.

source

Base.summary — Method

summary(object::Mbplsr, Xbl)

Summarize the fitted model.

object : The fitted model.
Xbl : The X-data that was used to fit the model.

source

Base.summary — Method

summary(object::Mbplswest, Xbl)

Summarize the fitted model.

object : The fitted model.
Xbl : The X-data that was used to fit the model.

source

Base.summary — Method

summary(object::Pca, X)

Summarize the fitted model.

object : The fitted model.
X : The X-data that was used to fit the model.

source

Base.summary — Method

summary(object::Plscan, X, Y)

Summarize the fitted model.

object : The fitted model.
X : The X-data that was used to fit the model.
Y : The Y-data that was used to fit the model.

source

Base.summary — Method

summary(object::Plstuck, X, Y)

Summarize the fitted model.

object : The fitted model.
X : The X-data that was used to fit the model.
Y : The Y-data that was used to fit the model.

source

Base.summary — Method

summary(object::Rasvd, X, Y)

Summarize the fitted model.

object : The fitted model.
X : The X-data that was used to fit the model.
Y : The Y-data that was used to fit the model.

source

Base.summary — Method

summary(object::Spca, X)

Summarize the fitted model.

object : The fitted model.
X : The X-data that was used to fit the model.

source

Base.summary — Method

summary(object::Union{Pcr, Spcr}, X)

Summarize the fitted model.

object : The fitted model.
X : The X-data that was used to fit the model.

source

Base.summary — Method

summary(object::Union{Plsr, Splsr}, X)

Summarize the fitted model.

object : The fitted model.
X : The X-data that was used to fit the model.

source

Jchemo.aggmean — Method

aggmean(X, y)

Compute column-wise mean by class in a dataset.

X : Data (n, p).
y : A categorical variable (n) (class membership).

Faster than aggstat.

Examples

using Jchemo

n, p = 20, 5
X = rand(n, p)
y = rand(1:3, n)
df = DataFrame(X, :auto) 
res = aggmean(X, y)
res.X
res.lev 
aggmean(df, y).X

source

Jchemo.aggstat — Method

aggstat(X, y; algo = mean)
aggstat(X::DataFrame; sel, groupby, algo = mean)

Compute column-wise statistics by group in a dataset.

X : Data (n, p).
y : A categorical variable (n) defining the groups.
algo : Function to compute (default = mean).

Specific for X::dataframe:

sel : Names (vector) of the variables to summarize.
groupby : Names (vector) of the categorical variables defining the groups.

Variables defined in sel and groupby must be columns of X.

Examples

using Jchemo, DataFrames, Statistics

n, p = 20, 5
X = rand(n, p)
df = DataFrame(X, :auto)
y = rand(1:3, n)
res = aggstat(X, y; algo = sum)
@names res
res.lev 
res.X
aggstat(df, y; algo = sum).X

n, p = 20, 5
X = rand(n, p)
df = DataFrame(X, string.("v", 1:p))
df.y1 = rand(1:2, n)
df.y2 = rand(["a", "b", "c"], n)
df
aggstat(df; sel = [:v1, :v2] , groupby = [:y1, :y2], algo = var)  # return a dataframe

source

Jchemo.aggsumv — Method

aggsumv(x::Vector, y::Union{Vector, BitVector})

Compute sub-total sums by class of a categorical variable.

x : A quantitative variable to sum (n)
y : A categorical variable (n) (class membership).

Examples

using Jchemo

x = rand(1000)
y = vcat(rand(["a" ; "c"], 900), repeat(["b"], 100))
aggsumv(x, y)

source

Jchemo.aicplsr — Method

aicplsr(X, y; alpha = 2, kwargs...)

Compute Akaike's (AIC) and Mallows's (Cp) criteria for univariate PLSR models.

X : X-data (n, p).
y : Univariate Y-data.

Keyword arguments:

Same arguments as those of function cglsr.
alpha : Coefficient multiplicating the model complexity (df) to compute AIC.

The function uses function dfplsr_cg.

References

Hansen, P.C., 1998. Rank-Deficient and Discrete Ill-Posed Problems, Mathematical Modeling and Computation. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898719697

Hansen, P.C., 2008. Regularization Tools version 4.0 for Matlab 7.3. Numer Algor 46, 189–194. https://doi.org/10.1007/s11075-007-9136-9

Lesnoff, M., Roger, J.-M., Rutledge, D.N., 2021. Monte Carlo methods for estimating Mallows’s Cp and AIC criteria for PLSR models. Illustration on agronomic spectroscopic NIR data. Journal of Chemometrics n/a, e3369. https://doi.org/10.1002/cem.3369

Examples

using Jchemo, JchemoData, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 40
res = aicplsr(X, y; nlv) ;
res.crit
res.opt
res.delta

zaic = res.crit.aic
f, ax = plotgrid(0:nlv, zaic; xlabel = "Nb. LVs", ylabel = "AIC")
scatter!(ax, 0:nlv, zaic)
f

source

Jchemo.aov1 — Method

aov1(x, Y)
One-factor ANOVA test.

x : Univariate categorical (factor) data (n).
Y : Y-data (n, q).

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2") 
@load db dat
@names dat
@head dat.X
x = dat.X[:, 5]
Y = dat.X[:, 1:4]
tab(x) 

res = aov1(x, Y) ;
@names res
res.SSF
res.SSR 
res.F 
res.pval

source

Jchemo.bias — Method

bias(pred, Y)

Compute the prediction bias, i.e. the opposite of the mean prediction error.

pred : Predictions.
Y : Observed data.

Examples

using Jchemo

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

model = plskern(nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
bias(pred, Ytest)

model = plskern(nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
bias(pred, ytest)

source

Jchemo.blockscal — Method

blockscal(; kwargs...)
blockscal(Xbl; kwargs...)
blockscal(Xbl, weights::Weight; kwargs...)

Scale multiblock X-data.

Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.
weights : Weights (n) of the observations (rows of the blocks). Must be of type Weight (see e.g. function mweight).

Keyword arguments:

centr : Boolean. If true, each column of blocks in Xbl is centered (before the block scaling).
scal : Boolean. If true, each column of blocks in Xbl is scaled by its uncorrected standard deviation (before the block scaling).
bscal : Type of block scaling. Possible values are: :none, :frob, :mfa, :ncol, :sd. See thereafter.

If implemented, the data transformations follow the order: column centering, column scaling and finally block scaling.

Types of block scaling:

:none : No block scaling.
:frob : Let D be the diagonal matrix of vector weights.w. Each block X is divided by its Frobenius norm = sqrt(tr(X' * D * X)). After this scaling, tr(X' * D * X) = 1.
:mfa : Each block X is divided by sv, where sv is the dominant singular value of X (this is the "MFA" approach; "AFM "in French).
:ncol : Each block X is divided by the nb. of columns of the block.
:sd : Each block X is divided by sqrt(sum(weighted variances of the block-columns)). After this scaling, sum(weighted variances of the block-columns) = 1.

Examples

using Jchemo
n = 5 ; m = 3 ; p = 10 
X = rand(n, p) 
Xnew = rand(m, p)
listbl = [3:4, 1, [6; 8:10]]
Xbl = mblock(X, listbl) 
Xblnew = mblock(Xnew, listbl) 
@head Xbl[3]

centr = true ; scal = true
bscal = :frob
model = blockscal(; centr, scal, bscal)
fit!(model, Xbl)
## Data transformation
zXbl = transf(model, Xbl) ; 
@head zXbl[3]

zXblnew = transf(model, Xblnew) ; 
zXblnew[3]

source

Jchemo.calds — Method

calds(; algo = plskern, kwargs...)
calds(X1, X2; algo = plskern, kwargs...)

Direct standardization (DS) for calibration transfer of spectral data.

X1 : Spectra (n, p) to transfer to the target.
X2 : Target spectra (n, p).

Keyword arguments:

algo : Function used as transfer model.
kwargs : Optional arguments for algo.

X1 and X2 must represent the same n samples ("standards").

The objective is to transform spectra X1 to new spectra as close as possible as the target X2. Method DS fits a model (defined in algo) that predicts X2 from X1.

References

Y. Wang, D. J. Veltkamp, and B. R. Kowalski, “Multivariate Instrument Standardization,” Anal. Chem., vol. 63, no. 23, pp. 2750–2756, 1991, doi: 10.1021/ac00023a016.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/caltransfer.jld2")
@load db dat
@names dat
## Objects X1 and X2 are spectra collected 
## on the same samples. 
## X2 represents the target space. 
## We want to transfer X1 in the same space
## as X2.
## Data to transfer
X1cal = dat.X1cal
X1val = dat.X1val
n = nro(X1cal)
m = nro(X1val)
## Target space
X2cal = dat.X2cal
X2val = dat.X2val

## Fitting the model
fitm = calds(X1cal, X2cal; algo = plskern, nlv = 10) 
#fitm = calds(X1cal, X2cal; algo = mlrpinv)   # less robust 

## Transfer of new spectra X1val 
## expected to be close to X2val
pred = predict(fitm, X1val).pred

i = 1
f = Figure(size = (500, 300))
ax = Axis(f[1, 1])
lines!(X2val[i, :]; label = "x2")
lines!(ax, X1val[i, :]; label = "x1")
lines!(pred[i, :]; linestyle = :dash, label = "x1_corrected")
axislegend(position = :rb, framevisible = false)
f

source

Jchemo.calpds — Method

calpds(; npoint = 5, algo = plskern, kwargs...)
calpds(X1, X2; npoint = 5, algo = plskern, kwargs...)

Piecewise direct standardization (PDS) for calibration transfer of spectral data.

X1 : Spectra (n, p) to transfer to the target.
X2 : Target spectra (n, p).

Keyword arguments:

npoint : Half-window size (nb. points left or right to the given wavelength).
algo : Function used as transfer model.
kwargs : Optional arguments for algo.

X1 and X2 must represent the same n standard samples.

The objective is to transform spectra X1 to new spectra as close as possible as the target X2. Method PDS fits models (defined in algo) that predict X2 from X1.

The window used in X1 to predict wavelength "i" in X2 is:

i - npoint, i - npoint + 1, ..., i, ..., i + npoint - 1, i + npoint

References

Bouveresse, E., Massart, D.L., 1996. Improvement of the piecewise direct targetisation procedure for the transfer of NIR spectra for multivariate calibration. Chemometrics and Intelligent Laboratory Systems 32, 201–213. https://doi.org/10.1016/0169-7439(95)00074-7

Y. Wang, D. J. Veltkamp, and B. R. Kowalski, “Multivariate Instrument Standardization,” Anal. Chem., vol. 63, no. 23, pp. 2750–2756, 1991, doi: 10.1021/ac00023a016.

Wülfert, F., Kok, W.Th., Noord, O.E. de, Smilde, A.K., 2000. Correction of Temperature-Induced Spectral Variation by Continuous Piecewise Direct Standardization. Anal. Chem. 72, 1639–1644. https://doi.org/10.1021/ac9906835

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/caltransfer.jld2")
@load db dat
@names dat
## Objects X1 and X2 are spectra collected 
## on the same samples. 
## X2 represents the target space. 
## We want to transfer X1 in the same space
## as X2.
## Data to transfer
X1cal = dat.X1cal
X1val = dat.X1val
n = nro(X1cal)
m = nro(X1val)
## Target space
X2cal = dat.X2cal
X2val = dat.X2val

## Fitting the model
fitm = calpds(X1cal, X2cal; npoint = 2, algo = plskern, nlv = 2) 

## Transfer of new spectra X1val 
## expected to be close to X2val
pred = predict(fitm, X1val).pred

i = 1
f = Figure(size = (500, 300))
ax = Axis(f[1, 1])
lines!(X2val[i, :]; label = "x2")
lines!(ax, X1val[i, :]; label = "x1")
lines!(pred[i, :]; linestyle = :dash, label = "x1_corrected")
axislegend(position = :rb, framevisible = false)
f

source

Jchemo.cca — Method

cca(; kwargs...)
cca(X, Y; kwargs...)
cca(X, Y, weights::Weight; kwargs...)
cca!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Canonical correlation Analysis (CCA, RCCA).

X : First block of data.
Y : Second block of data.
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs = scores) to compute.
bscal : Type of block scaling. Possible values are: :none, :frob. See functions blockscal.
tau : Regularization parameter (∊ [0, 1]).
scal : Boolean. If true, each column of blocks X and Y is scaled by its uncorrected standard deviation (before the block scaling).

This function implements a CCA algorithm using SVD decompositions and presented in Weenink 2003 section 2.

A continuum regularization is available (parameter tau). After block centering and scaling, the function returns block LVs (Tx and Ty) that are proportionnal to the eigenvectors of Projx * Projy and Projy * Projx, respectively, defined as follows:

Cx = (1 - tau) * X'DX + tau * Ix
Cy = (1 - tau) * Y'DY + tau * Iy
Cxy = X'DY
Projx = sqrt(D) * X * invCx * X' * sqrt(D)
Projy = sqrt(D) * Y * invCx * Y' * sqrt(D)

where D is the observation (row) metric. Value tau = 0 can generate unstability when inverting the covariance matrices. Often, a better alternative is to use an epsilon value (e.g. tau = 1e-8) to get similar results as with pseudo-inverses.

After normalized (and using uniform weights), the scores returned by the function are expected to be the same as those returned by functions rcc of the R packages CCA (González et al.) and mixOmics (Lê Cao et al.) whith their parameters lambda1 and lambda2 set to:

lambda1 = lambda2 = tau / (1 - tau) * n / (n - 1)

See function plscan for the details on the summary outputs.

References

González, I., Déjean, S., Martin, P.G.P., Baccini, A., 2008. CCA: An R Package to Extend Canonical Correlation Analysis. Journal of Statistical Software 23, 1-14. https://doi.org/10.18637/jss.v023.i12

Hotelling, H. (1936): “Relations between two sets of variates”, Biometrika 28: pp. 321–377.

Lê Cao, K.-A., Rohart, F., Gonzalez, I., Dejean, S., Abadi, A.J., Gautier, B., Bartolo, F., Monget, P., Coquery, J., Yao, F., Liquet, B., 2022. mixOmics: Omics Data Integration Project. https://doi.org/10.18129/B9.bioc.mixOmics

Weenink, D. 2003. Canonical Correlation Analysis, Institute of Phonetic Sciences, Univ. of Amsterdam, Proceedings 25, 81-99.

Examples

using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "linnerud.jld2") 
@load db dat
@names dat
X = dat.X
Y = dat.Y
n, p = size(X)
q = nco(Y)

nlv = 3
bscal = :frob ; tau = 1e-8
model = cca(; nlv, bscal, tau)
fit!(model, X, Y)
@names model
@names model.fitm

@head model.fitm.Tx
@head transfbl(model, X, Y).Tx

@head model.fitm.Ty
@head transfbl(model, X, Y).Ty

res = summary(model, X, Y) ;
@names res
res.cortx2ty
res.rvx2tx
res.rvy2ty
res.rdx2tx
res.rdy2ty
res.corx2tx 
res.cory2ty

source

Jchemo.ccawold — Method

ccawold(; kwargs...)
ccawold(X, Y; kwargs...)
ccawold(X, Y, weights::Weight; kwargs...)
ccawold!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Canonical correlation analysis (CCA, RCCA) - Wold Nipals algorithm.

X : First block of data.
Y : Second block of data.
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs = scores) to compute.
bscal : Type of block scaling. Possible values are: :none, :frob. See functions blockscal.
tau : Regularization parameter (∊ [0, 1]).
tol : Tolerance value for convergence (Nipals).
maxit : Maximum number of iterations (Nipals).
scal : Boolean. If true, each column of blocks X and Y is scaled by its uncorrected standard deviation (before the block scaling).

This function implements the Nipals ccawold algorithm presented by Tenenhaus 1998 p.204 (related to Wold et al. 1984).

In this implementation, after each step of LVs computation, X and Y are deflated relatively to their respective scores (tx and ty).

A continuum regularization is available (parameter tau). After block centering and scaling, the covariances matrices are computed as follows:

Cx = (1 - tau) * X'DX + tau * Ix
Cy = (1 - tau) * Y'DY + tau * Iy

The normed scores returned by the function are expected (using uniform weights) to be the same as those returned by function rgcca of the R package RGCCA (Tenenhaus & Guillemot 2017, Tenenhaus et al. 2017).

See function plscan for the details on the summary outputs. See function plscan for the details on the summary outputs.

References

Tenenhaus, A., Guillemot, V. 2017. RGCCA: Regularized and Sparse Generalized Canonical Correlation Analysis for Multiblock Data Multiblock data analysis. https://cran.r-project.org/web/packages/RGCCA/index.html

Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.

Tenenhaus, M., Tenenhaus, A., Groenen, P.J.F., 2017. Regularized Generalized Canonical Correlation Analysis: A Framework for Sequential Multiblock Component Methods. Psychometrika 82, 737–777. https://doi.org/10.1007/s11336-017-9573-x

Wold, S., Ruhe, A., Wold, H., Dunn, III, W.J., 1984. The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses. SIAM Journal on Scientific and Statistical Computing 5, 735–743. https://doi.org/10.1137/0905052

Examples

using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "linnerud.jld2") 
@load db dat
@names dat
X = dat.X
Y = dat.Y
n, p = size(X)
q = nco(Y)

nlv = 2
bscal = :frob ; tau = 1e-4
model = ccawold(; nlv, bscal, tau, tol = 1e-10)
fit!(model, X, Y)
@names model
@names model.fitm

@head model.fitm.Tx
@head transfbl(model, X, Y).Tx

@head model.fitm.Ty
@head transfbl(model, X, Y).Ty

res = summary(model, X, Y) ;
@names res
res.explvarx
res.explvary
res.cortx2ty
res.rvx2tx
res.rvy2ty
res.rdx2tx
res.rdy2ty
res.corx2tx 
res.cory2ty

source

Jchemo.center — Method

center()
center(X)
center(X, weights::Weight)

Column-wise centering of X-data.

X : X-data (n, p).

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f

model = center() 
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
colmean(Xptrain)
@head Xptest 
@head Xtest .- colmean(Xtrain)'
plotsp(Xptrain).f
plotsp(Xptest).f

source

Jchemo.cglsr — Method

cglsr(; kwargs...)
cglsr(X, y; kwargs...)
cglsr!(X::Matrix, y::Matrix; kwargs...)

Conjugate gradient algorithm for the normal equations (CGLS; Björck 1996).

X : X-data (n, p).
y : Univariate Y-data (n).

Keyword arguments:

nlv : Nb. CG iterations.
gs : Boolean. If true (default), a Gram-Schmidt orthogonalization of the normal equation residual vectors is done.
filt : Boolean. If true, CG filter factors are computed (output F). Default = false.
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

CGLS algorithm "7.4.1" Bjorck 1996, p.289. In the present function, the part of the code computing the re-orthogonalization (Hansen 1998) and filter factors (Vogel 1987, Hansen 1998) is a transcription (with few adaptations) of the Matlab function cgls (Saunders et al. https://web.stanford.edu/group/SOL/software/cgls/; Hansen 2008).

References

Björck, A., 1996. Numerical Methods for Least Squares Problems, Other Titles in Applied Mathematics. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611971484

Hansen, P.C., 1998. Rank-Deficient and Discrete Ill-Posed Problems, Mathematical Modeling and Computation. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898719697

Hansen, P.C., 2008. Regularization Tools version 4.0 for Matlab 7.3. Numer Algor 46, 189–194. https://doi.org/10.1007/s11075-007-9136-9

Manne R. Analysis of two partial-least-squares algorithms for multivariate calibration. Chemometrics Intell. Lab. Syst. 1987, 2: 187–197.

Phatak A, De Hoog F. Exploiting the connection between PLS, Lanczos methods and conjugate gradients: alternative proofs of some properties of PLS. J. Chemometrics 2002; 16: 361–367.

Vogel, C. R., "Solving ill-conditioned linear systems using the conjugate gradient method", Report, Dept. of Mathematical Sciences, Montana State University, 1987.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 5 ; scal = true
model = cglsr(; nlv, scal)
fit!(model, Xtrain, ytrain)
@names model.fitm 
@head model.fitm.B
coef(model.fitm).B
coef(model.fitm).int

pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f

source

Jchemo.coef — Method

coef(object::Cglsr)

Compute the b-coefficients of a fitted model.

object : The fitted model.

source

Jchemo.coef — Method

coef(object::Dkplsr; nlv = nothing)

Compute the b-coefficients of a fitted model.

object : The fitted model.
nlv : Nb. LVs, or collection of nb. LVs, to consider.

source

Jchemo.coef — Method

coef(object::Kplsr; nlv = nothing)

Compute the b-coefficients of a fitted model.

object : The fitted model.
nlv : Nb. LVs, or collection of nb. LVs, to consider.

source

Jchemo.coef — Method

coef(object::Krr; lb = nothing)

Compute the b-coefficients of a fitted model.

object : The fitted model.
lb : Ridge regularization parameter "lambda".

source

Jchemo.coef — Method

coef(object::Pcr; nlv = nothing)

Compute the b-coefficients of a LV model.

object : The fitted model.
nlv : Nb. LVs to consider.

For a model fitted from X(n, p) and Y(n, q), the returned object B is a matrix (p, q). If nlv = 0, B is a matrix of zeros. The returned object int is the intercept.

source

Jchemo.coef — Method

coef(object::Rosaplsr; nlv = nothing)

Compute the X b-coefficients of a model fitted with nlv LVs.

object : The fitted model.
nlv : Nb. LVs to consider.

source

Jchemo.coef — Method

coef(object::Rr; lb = nothing)

Compute the b-coefficients of a fitted model.

object : The fitted model.
lb : Ridge regularization parameter "lambda".

source

Jchemo.coef — Method

coef(object::Mlr)

Compute the coefficients of the fitted model.

object : The fitted model.

source

Jchemo.coef — Method

coef(object::Union{Plsr, Pcr, Splsr}; nlv = nothing)

Compute the b-coefficients of a LV model.

object : The fitted model.
nlv : Nb. LVs to consider.

For a model fitted from X(n, p) and Y(n, q), the returned object B is a matrix (p, q). If nlv = 0, B is a matrix of zeros. The returned object int is the intercept.

source

Jchemo.colmad — Method

colmad(X)

Compute column-wise median absolute deviations (MAD) of a matrix.

X : Data (n, p).

Return a vector.

Examples

using Jchemo

n, p = 5, 6
X = rand(n, p)

colmad(X)

source

Jchemo.colmean — Method

colmean(X)
colmean(X, weights::Weight)

Compute column-wise means of a matrix.

X : Data (n, p).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Return a vector.

Examples

using Jchemo

n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))

colmean(X)
colmean(X, w)

source

Jchemo.colmed — Method

colmed(X)

Compute column-wise medians of a matrix.

X : Data (n, p).

Return a vector.

Examples

using Jchemo

n, p = 5, 6
X = rand(n, p)

colmed(X)

source

Jchemo.colnorm — Method

colnorm(X)
colnorm(X, weights::Weight)

Compute column-wise norms of a matrix.

X : Data (n, p).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

The norm of each column x of X is computed by:

sqrt(x' * x)

The weighted norm is:

sqrt(x' * D * x), where D is the diagonal matrix of weights.w

Warning: colnorm(X, mweight(ones(n))) = colnorm(X) / sqrt(n).

Return a vector.

Examples

using Jchemo

n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))

colnorm(X)
colnorm(X, w)

source

Jchemo.colstd — Method

colstd(X)
colstd(X, weights::Weight)

Compute column-wise standard deviations (uncorrected) of a matrix.

X : Data (n, p).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Return a vector.

Examples

using Jchemo

n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))

colstd(X)
colstd(X, w)

source

Jchemo.colsum — Method

colsum(X)
colsum(X, weights::Weight)

Compute column-wise sums of a matrix.

X : Data (n, p).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Return a vector.

Examples

using Jchemo

n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))

colsum(X)
colsum(X, w)

source

Jchemo.colvar — Method

colvar(X)
colvar(X, weights::Weight)

Compute column-wise variances (uncorrected) of a matrix.

X : Data (n, p).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Return a vector.

Examples

using Jchemo

n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))

colvar(X)
colvar(X, w)

source

Jchemo.comdim — Method

comdim(; kwargs...)
comdim(Xbl; kwargs...)
comdim(Xbl, weights::Weight; kwargs...)
comdim!(Xbl::Matrix, weights::Weight; kwargs...)

Common components and specific weights analysis (CCSWA, a.k.a ComDim).

Xbl : List of blocks (vector of matrices) of X-data. Typically, output of function mblock.
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. global latent variables (LVs = scores) to compute.
bscal : Type of block scaling. See function blockscal for possible values.
tol : Tolerance value for convergence (Nipals).
maxit : Maximum number of iterations (Nipals).
scal : Boolean. If true, each column of blocks in Xbl is scaled by its uncorrected standard deviation (before the block scaling).

"SVD" algorithm of Hannafi & Qannari 2008 p.84.

The function returns several objects, in particular:

T : The global LVs (not-normed).
U : The global LVs (normed).
W : The block weights (normed).
Tb : The block LVs (in the metric scale), returned grouped by LV.
Tbl : The block LVs (in the original scale), returned grouped by block.
Vbl : The block loadings (normed).
lb : The block specific weights (saliences) 'lambda'.
mu : The sum of the block specific weights.

Function summary returns:

explvarx : Proportion of the total X inertia (squared Frobenious norm) explained by the global LVs.
explvarxx : Proportion of the total XX' inertia explained by the global LVs (= indicator "V" in Qannari et al. 2000, Hanafi et al. 2008).
explxbl : Proportion of the inertia of each block (= Xbl[k]) explained by the global LVs.
psal2 : Proportion of the squared saliences of each block within each global score.
contrxbl2t : Contribution of each block to the global LVs (= lb proportions).
rvxbl2t : RV coefficients between each block and the global LVs.
rdxbl2t : Rd coefficients between each block and the global LVs.
cortbl2t : Correlations between the block LVs (= Tbl[k]) and the global LVs.
corx2t : Correlation between the X-variables and the global LVs.

References

Cariou, V., Qannari, E.M., Rutledge, D.N., Vigneau, E., 2018. ComDim: From multiblock data analysis to path modeling. Food Quality and Preference, Sensometrics 2016: Sensometrics-by-the-Sea 67, 27–34. https://doi.org/10.1016/j.foodqual.2017.02.012

Cariou, V., Jouan-Rimbaud Bouveresse, D., Qannari, E.M., Rutledge, D.N., 2019. Chapter 7 - ComDim Methods for the Analysis of Multiblock Data in a Data Fusion Perspective, in: Cocchi, M. (Ed.), Data Handling in Science and Technology, Data Fusion Methodology and Applications. Elsevier, pp. 179–204. https://doi.org/10.1016/B978-0-444-63984-4.00007-7

Ghaziri, A.E., Cariou, V., Rutledge, D.N., Qannari, E.M., 2016. Analysis of multiblock datasets using ComDim: Overview and extension to the analysis of (K + 1) datasets. Journal of Chemometrics 30, 420–429. https://doi.org/10.1002/cem.2810

Hanafi, M., 2008. Nouvelles propriétés de l’analyse en composantes communes et poids spécifiques. Journal de la société française de statistique 149, 75–97.

Qannari, E.M., Wakeling, I., Courcoux, P., MacFie, H.J.H., 2000. Defining the underlying sensory dimensions. Food Quality and Preference 11, 151–154. https://doi.org/10.1016/S0950-3293(99)00069-5

Examples

using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2") 
@load db dat
@names dat 
X = dat.X
group = dat.group
listbl = [1:11, 12:19, 20:25]
Xbl = mblock(X[1:6, :], listbl)
Xblnew = mblock(X[7:8, :], listbl)
n = nro(Xbl[1]) 

nlv = 3
bscal = :frob
scal = false
#scal = true
model = comdim(; nlv, bscal, scal)
fit!(model, Xbl)
@names model 
@names model.fitm
## Global scores 
@head model.fitm.T
@head transf(model, Xbl)
transf(model, Xblnew)
## Blocks scores
i = 1
@head model.fitm.Tbl[i]
@head transfbl(model, Xbl)[i]

res = summary(model, Xbl) ;
@names res 
res.explvarx
res.explvarxx
res.psal2 
res.contrxbl2t
res.explxbl   # = model.fitm.lb if bscal = :frob
rowsum(Matrix(res.explxbl))
res.rvxbl2t
res.rdxbl2t
res.cortbl2t
res.corx2t

source

Jchemo.conf — Method

conf(pred, y; digits = 1)

Confusion matrix.

pred : Univariate predictions.
y : Univariate observed data.

Keyword arguments:

digits : Nb. digits used to round percentages.

Examples

using Jchemo, CairoMakie

y = ["d"; "c"; "b"; "c"; "a"; "d"; "b"; "d"; 
    "b"; "b"; "a"; "a"; "c"; "d"; "d"]
pred = ["a"; "d"; "b"; "d"; "b"; "d"; "b"; "d"; 
    "b"; "b"; "a"; "a"; "d"; "d"; "d"]
#y = rand(1:10, 200); pred = rand(1:10, 200)

res = conf(pred, y) ;
@names res
res.cnt       # Counts (dataframe built from `A`) 
res.pct       # Row %  (dataframe built from `Apct`))
res.A         
res.Apct
res.diagpct
res.accpct    # Accuracy (% classification successes)
res.lev       # Levels

plotconf(res).f

plotconf(res; cnt = false, ptext = false).f

source

Jchemo.convertdf — Method

convertdf(df::DataFrame; miss = nothing, typ)

Convert the columns of a dataframe to given types.

df : A dataframe.
miss : The code used in df to identify the data to be declared as missing (of type Missing). See function recod_miss.
typ : A vector of the targeted types for the columns of the new dataframe.

Examples

using Jchemo, DataFrames

source

Jchemo.cor2 — Method

cor2(pred, Y)

Compute the squared linear correlation between data and predictions.

pred : Predictions.
Y : Observed data.

Examples

using Jchemo 

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

model = plskern(nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
cor2(pred, Ytest)

model = plskern(nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
cor2(pred, ytest)

source

Jchemo.corm — Method

corm(X) 
corm(X, Y) 
corm(X, weights::Weight)
corm(X, Y, weights::Weight)

Compute a weighted correlation matrix.

X : Data (n, p).
Y : Data (n, q).
weights : Weights (n) of the observations. Object of type Weight (e.g. generated by function mweight).

Uncorrected correlation matrix

of X-columns : ==> (p, p) matrix
or between X-columns and Y-columns : ==> (p, q) matrix.

Examples

using Jchemo

n, p = 5, 6
X = rand(n, p)
Y = rand(n, 3)
w = mweight(rand(n))

corm(X, w)
corm(X, Y, w)

source

Jchemo.corv — Method

corv(x, y)

Compute correlation between two vectors.

x : vector (n).
y : vector (n).

References

@Stevengj, https://discourse.julialang.org/t/interesting-post-about-simd-dot-product-and-cosine-similarity/123282.

Examples

using Jchemo

n = 5
x = rand(n)
y = rand(n)

corv(x, y)

source

Jchemo.cosm — Method

cosm(X)
cosm(X, Y)

Compute a cosinus matrix.

X : Data (n, p).
Y : Data (n, q).

The function computes the cosinus matrix:

of the columns of X: ==> (p, p) matrix
or between columns of X and Y : ==> (p, q) matrix.

Examples

using Jchemo

n, p = 5, 6
X = rand(n, p)
Y = rand(n, 3)

cosm(X)
cosm(X, Y)

source

Jchemo.cosv — Method

cosv(x, y)

Compute cosinus between two vectors.

x : vector (n).
y : vector (n).

References

@Stevengj, https://discourse.julialang.org/t/interesting-post-about-simd-dot-product-and-cosine-similarity/123282.

Examples

using Jchemo

n = 5
x = rand(n)
y = rand(n)

cosv(x, y)

source

Jchemo.covm — Method

covm(X)
covm(X, weights::Weight)
covm(X, Y) 
covm(X, Y, weights::Weight)

Compute a weighted covariance matrix.

X : Data (n, p).
Y : Data (n, q).
weights : Weights (n) of the observations. Object of type Weight (e.g. generated by function mweight).

The function computes the uncorrected covariance matrix:

of the columns of X: ==> (p, p) matrix
or between columns of X and Y : ==> (p, q) matrix.

Examples

using Jchemo

n, p = 5, 6
X = rand(n, p)
Y = rand(n, 3)
w = mweight(rand(n))

covm(X, w)
covm(X, Y, w)

source

Jchemo.covv — Method

cosv(x, y)

Compute uncorrected covariance between two vectors.

x : vector (n).
y : vector (n).

References

@Stevengj, https://discourse.julialang.org/t/interesting-post-about-simd-dot-product-and-cosine-similarity/123282.

Examples

using Jchemo

n = 5
x = rand(n)
y = rand(n)

covv(x, y)

source

Jchemo.cscale — Method

cscale()
cscale(X)
cscale(X, weights::Weight)

Column-wise centering and scaling of X-data.

X : X-data (n, p).

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))

db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f

model = cscale() 
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
colmean(Xptrain)
colstd(Xptrain)
@head Xptest 
@head (Xtest .- colmean(Xtrain)') ./ colstd(Xtrain)'
plotsp(Xptrain).f
plotsp(Xptest).f

source

Jchemo.detrend_airpls — Method

detrend_airpls(; kwargs...)
detrend_airpls(X; kwargs...)

Baseline correction of each row of X-data by adaptive iteratively reweighted penalized least squares algorithm (AIRPLS).

X : X-data (n, p).

Keyword arguments:

lb : Penalizing (smoothing) parameter "lambda".
maxit : Maximum number of iterations.
verbose : If true, nb. iterations are printed.

De-trend transformation: the function fits a baseline by AIRPLS (see Zhang et al. 2010, and Baek et al. 2015 section 2) for each observation and returns the residuals (= signals corrected from the baseline).

References

Baek, S.-J., Park, A., Ahn, Y.-J., Choo, J., 2015. Baseline correction using asymmetrically reweighted penalized least squares smoothing. Analyst 140, 250–257. https://doi.org/10.1039/C4AN01061B

Zhang, Z.-M., Chen, S., Liang, Y.-Z., 2010. Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst 135, 1138–1146. https://doi.org/10.1039/B922045C

https://github.com/zmzhang/airPLS/tree/master

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f

## Example on 1 spectrum
i = 2
zX = Matrix(X)[i:i, :]
lb = 1e6
model = detrend_airpls(; lb)
fit!(model, zX)
zXc = transf(model, zX)   # = corrected spectrum 
B = zX - zXc              # = estimated baseline
f, ax = plotsp(zX, wl)
lines!(wl, vec(B); color = :blue)
lines!(wl, vec(zXc); color = :black)
f

source

Jchemo.detrend_arpls — Method

detrend_arpls(; kwargs...)
detrend_arpls(X; kwargs...)

Baseline correction of each row of X-data by asymmetrically reweighted penalized least squares smoothing (ARPLS).

X : X-data (n, p).

Keyword arguments:

lb : Penalizing (smoothness) parameter "lambda".
tol : Tolerance value for stopping the iterations.
maxit : Maximum number of iterations.
verbose : If true, nb. iterations are printed.

De-trend transformation: the function fits a baseline by ARPLS (see Baek et al. 2015 section 3) for each observation and returns the residuals (= signals corrected from the baseline).

References

Baek, S.-J., Park, A., Ahn, Y.-J., Choo, J., 2015. Baseline correction using asymmetrically reweighted penalized least squares smoothing. Analyst 140, 250–257. https://doi.org/10.1039/C4AN01061B

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f

## Example on 1 spectrum
i = 2
zX = Matrix(X)[i:i, :]
lb = 1e4
model = detrend_arpls(; lb, p)
fit!(model, zX)
zXc = transf(model, zX)   # = corrected spectrum 
B = zX - zXc              # = estimated baseline
f, ax = plotsp(zX, wl)
lines!(wl, vec(B); color = :blue)
lines!(wl, vec(zXc); color = :black)
f

source

Jchemo.detrend_asls — Method

detrend_asls(; kwargs...)
detrend_asls(X; kwargs...)

Baseline correction of each row of X-data by asymmetric least squares algorithm (ASLS).

X : X-data (n, p).

Keyword arguments:

lb : Penalizing (smoothness) parameter "lambda".
p : Asymmetry parameter (0 < p << 1).
tol : Tolerance value for stopping the iterations.
maxit : Maximum number of iterations.
verbose : If true, nb. iterations are printed.

De-trend transformation: the function fits a baseline by ASLS (see Baek et al. 2015 section 2) for each observation and returns the residuals (= signals corrected from the baseline).

Generally 0.001 ≤ p ≤ 0.1 is a good choice (for a signal with positive peaks) and 1e2 ≤ lb ≤ 1e9, but exceptions may occur (Eilers & Boelens 2005).

References

Baek, S.-J., Park, A., Ahn, Y.-J., Choo, J., 2015. Baseline correction using asymmetrically reweighted penalized least squares smoothing. Analyst 140, 250–257. https://doi.org/10.1039/C4AN01061B

Eilers, P. H., & Boelens, H. F. (2005). Baseline correction with asymmetric least squares smoothing. Leiden University Medical Centre Report, 1(1).

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f

## Example on 1 spectrum
i = 2
zX = Matrix(X)[i:i, :]
lb = 1e5 ; p = .001
model = detrend_asls(; lb, p)
fit!(model, zX)
zXc = transf(model, zX)   # = corrected spectrum 
B = zX - zXc              # = estimated baseline
f, ax = plotsp(zX, wl)
lines!(wl, vec(B); color = :blue)
lines!(wl, vec(zXc); color = :black)
f

source

Jchemo.detrend_lo — Method

detrend_lo(; kwargs...)
detrend_lo(X; kwargs...)

Baseline correction of each row of X-data by LOESS regression.

X : X-data (n, p).

Keyword arguments:

span : Window for neighborhood selection (level of smoothing) for the local fitting, typically in 0, 1.
degree : Polynomial degree for the local fitting.

De-trend transformation: The function fits a baseline by LOESS regression (function loessr) for each observation and returns the residuals (= signals corrected from the baseline).

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f

model = detrend_lo(span = .8)
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
plotsp(Xptrain, wl).f
plotsp(Xptest, wl).f

## Example on 1 spectrum
i = 2
zX = Matrix(X)[i:i, :]
model = detrend_lo(span = .75)
fit!(model, zX)
zXc = transf(model, zX)   # = corrected spectrum 
B = zX - zXc            # = estimated baseline
f, ax = plotsp(zX, wl)
lines!(wl, vec(B); color = :blue)
lines!(wl, vec(zXc); color = :black)
f

source

Jchemo.detrend_pol — Method

detrend_pol(; kwargs...)
detrend_pol(X; kwargs...)

Baseline correction of each row of X-data by polynomial linear regression.

X : X-data (n, p).

Keyword arguments:

degree : Polynom degree.

De-trend transformation: the function fits a baseline by polynomial regression for each observation and returns the residuals (= signals corrected from the baseline).

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f

model = detrend_pol(degree = 2)
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
plotsp(Xptrain, wl).f
plotsp(Xptest, wl).f

## Example on 1 spectrum
i = 2
zX = Matrix(X)[i:i, :]
model = detrend_pol(degree = 1)
fit!(model, zX)
zXc = transf(model, zX)   # = corrected spectrum 
B = zX - zXc            # = estimated baseline
f, ax = plotsp(zX, wl)
lines!(wl, vec(B); color = :blue)
lines!(wl, vec(zXc); color = :black)
f

source

Jchemo.dfplsr_cg — Method

dfplsr_cg(X, y; kwargs...)

Compute the model complexity (df) of PLSR models with the CGLS algorithm.

X : X-data (n, p).
y : Univariate Y-data.

Keyword arguments:

Same as function cglsr.

The number of degrees of freedom (df) of the PLSR model is returned for 0, 1, ..., nlv LVs.

References

Hansen, P.C., 1998. Rank-Deficient and Discrete Ill-Posed Problems, Mathematical Modeling and Computation. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898719697

Hansen, P.C., 2008. Regularization Tools version 4.0 for Matlab 7.3. Numer Algor 46, 189–194. https://doi.org/10.1007/s11075-007-9136-9

Examples

## The example below reproduces the numerical illustration
## given by Kramer & Sugiyama 2011 on the Ozone data 
## (Fig. 1, center).
## Function "pls.model" used for df calculations
## in the R package "plsdof" v0.2-9 (Kramer & Braun 2019)
## automatically scales the X matrix before PLS.
## The example scales X for consistency with plsdof.

using Jchemo, JchemoData, JLD2, DataFrames, CairoMakie 
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ozone.jld2") 
@load db dat
@names dat
X = dat.X
dropmissing!(X) 
zX = rmcol(Matrix(X), 4) 
y = X[:, 4] 
## For consistency with plsdof
xscales = colstd(zX)
zXs = fscale(zX, xscales)
## End

nlv = 12 ; gs = true
res = dfplsr_cg(zXs, y; nlv, gs) ;
res.df 
df_kramer = [1.000000, 3.712373, 6.456417, 11.633565, 
    12.156760, 11.715101, 12.349716,
    12.192682, 13.000000, 13.000000, 
    13.000000, 13.000000, 13.000000]
f, ax = plotgrid(0:nlv, df_kramer; step = 2, xlabel = "Nb. LVs", ylabel = "df")
scatter!(ax, 0:nlv, res.df; color = "red")
ablines!(ax, 1, 1; color = :grey, linestyle = :dot)
f

source

Jchemo.difmean — Method

difmean(X1, X2; normx::Bool = false)

Compute a 1-D detrimental matrix by difference of the column-means of two X-datas.

X1 : Spectra (n1, p).
X2 : Spectra (n2, p).

Keyword arguments:

normx : Boolean. If true, the column-means vectors of X1 and X2 are normed before computing their difference.

The function returns a matrix D (1, p) computed by the difference between two mean-spectra, i.e. the column-means of X1 and X2.

D is assumed to contain the detrimental information that can be removed (by orthogonalization) from X1 and X2 for calibration transfer. For instance, D can be used as input of function eposvd.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/caltransfer.jld2")
@load db dat
@names dat
X1cal = dat.X1cal
X1val = dat.X1val
X2cal = dat.X2cal
X2val = dat.X2val

## The objective is to remove a detrimental 
## information (here, D) from spaces X1 and X2
D = difmean(X1cal, X2cal).D
res = eposvd(D; nlv = 1)
## Corrected Val matrices
X1val_c = X1val * res.M
X2val_c = X2val * res.M

i = 1
f = Figure(size = (800, 300))
ax1 = Axis(f[1, 1])
ax2 = Axis(f[1, 2])
lines!(ax1, X1val[i, :]; label = "x1")
lines!(ax1, X2val[i, :]; label = "x2")
axislegend(ax1, position = :cb, framevisible = false)
lines!(ax2, X1val_c[i, :]; label = "x1_correct")
lines!(ax2, X2val_c[i, :]; label = "x2_correct")
axislegend(ax2, position = :cb, framevisible = false)
f

source

Jchemo.dkplskdeda — Method

dkplskdeda(; kwargs...)
dkplskdeda(X, y; kwargs...)
dkplskdeda(X, y, weights::Weight; kwargs...)

DKPLS-KDEDA.

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute. Must be >= 1.
kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
Keyword arguments of function dmkern (bandwidth definition) can also be specified here.
scal : Boolean. If true, each column of X and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.

Same as function plskdeda (PLS-KDEDA) except that a direct kernel PLSR (function dkplsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

See function dkplslda for examples.

source

Jchemo.dkplslda — Method

dkplslda(; kwargs...)
dkplslda(X, y; kwargs...)
dkplslda(X, y, weights::Weight; kwargs...)

DKPLS-LDA.

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute. Must be >= 1.
kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Same as function plslda (PLS-LDA) except that a direct kernel PLSR (function dkplsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlv = 15
gamma = .1
model = dkplslda(; nlv, gamma) 
#model = dkplslda(; nlv, gamma, prior = :unif) 
#model = dkplsqda(; nlv, gamma, alpha = .5) 
#model = dkplskdeda(; nlv, gamma, a = .5) 
fit!(model, Xtrain, ytrain)
@names model
@names fitm = model.fitm
fitm.lev
fitm.ni

embfitm = fitm.fitm.embfitm ;
@head embfitm.T
@head transf(model, Xtrain)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)

coef(embfitm)

res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(model, Xtest; nlv = 1:2).pred

source

Jchemo.dkplsqda — Method

dkplsqda(; kwargs...)
dkplsqda(X, y; kwargs...)
dkplsqda(X, y, weights::Weight; kwargs...)

DKPLS-QDA.

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute. Must be >= 1.
kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
alpha : Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0) and LDA (alpha = 1).
scal : Boolean. If true, each column of X and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.

Same as function plsqda (PLS-QDA) except that a direct kernel PLSR (function dkplsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

See function dkplslda for examples.

source

Jchemo.dkplsr — Method

dkplsr(; kwargs...)
dkplsr(X, Y; kwargs...)
dkplsr(X, Y, weights::Weight; kwargs...)
dkplsr!(X::Matrix, Y::Union{Matrix, BitMatrix}, weights::Weight; kwargs...)

Direct kernel partial least squares regression (DKPLSR) (Bennett & Embrechts 2003).

X : X-data (n, p).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to consider.
kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

The method builds kernel Gram matrices and then runs a usual PLSR algorithm on them. This is faster (but not equivalent) to the "true" KPLSR (Nipals) algorithm (function kplsr) described in Rosipal & Trejo (2001).

References

Bennett, K.P., Embrechts, M.J., 2003. An optimization perspective on kernel partial least squares regression, in: Advances in Learning Theory: Methods, Models and Applications, NATO Science Series III: Computer & Systems Sciences. IOS Press Amsterdam, pp. 227-250.

Rosipal, R., Trejo, L.J., 2001. Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space. Journal of Machine Learning Research 2, 97-123.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 20
kern = :krbf ; gamma = 1e-1 ; scal = false
#gamma = 1e-4 ; scal = true
model = dkplsr(; nlv, kern, gamma, scal) ;
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
@head model.fitm.T

coef(model)
coef(model; nlv = 3)

@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)

res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",  
    ylabel = "Observed").f  

####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106 
x = collect(-10:.2:10) 
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x) 
y = zy + .2 * randn(n) 
nlv = 2
gamma = 1 / 3
model = dkplsr(; nlv, gamma) ;
fit!(model, x, y)
pred = predict(model, x).pred 
f, ax = scatter(x, y) 
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f

source

Jchemo.dkplsrda — Method

dkplsrda(; kwargs...)
dkplsrda(X, y; kwargs...)
dkplsrda(X, y, weights::Weight; kwargs...)

Discrimination based on direct kernel partial least squares regression (KPLSR-DA).

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute.
kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
scal : Boolean. If true, each column of X and Ydummy is scaled by its uncorrected standard deviation.

Same as function plsrda (PLSR-DA) except that a direct kernel PLSR (function dkplsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlv = 15
kern = :krbf ; gamma = .001 
scal = true
model = dkplsrda(; nlv, kern, gamma, scal) 
fit!(model, Xtrain, ytrain)
@names model
@names fitm = model.fitm
fitm.lev
fitm.ni

@head fitm.fitm.T
@head transf(model, Xtrain)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)

coef(fitm.fitm)

res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(model, Xtest; nlv = 1:2).pred

source

Jchemo.dmkern — Method

dmkern(; kwargs...)
dmkern(X; kwargs...)

Gaussian kernel density estimation (KDE).

X : X-data (n, p).

Keyword arguments:

h : Define the bandwith, see examples.
a : Constant for the Scott's rule (default bandwith), see thereafter.

Estimation of the probability density of X (column space) by non parametric Gaussian kernels.

Data X can be univariate (p = 1) or multivariate (p > 1). In the last case, function dmkern computes a multiplicative kernel such as in Scott & Sain 2005 Eq.19, and the internal bandwidth matrix H is diagonal (see the code).

Note: H in the dmkern code is often noted "H^(1/2)" in the litterature (e.g. Wikipedia).

The default bandwith is computed by:

h = a * n^(-1 / (p + 4)) * colstd(X)

(a = 1 in Scott & Sain 2005).

References

Scott, D.W., Sain, S.R., 2005. 9 - Multidimensional Density Estimation, in: Rao, C.R., Wegman, E.J., Solka, J.L. (Eds.), Handbook of Statistics, Data Mining and Data Visualization. Elsevier, pp. 229–261. https://doi.org/10.1016/S0169-7161(04)24009-3

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "iris.jld2") 
@load db dat
@names dat
X = dat.X[:, 1:4] 
y = dat.X[:, 5]
n = nro(X)
tab(y) 

nlv = 2
model0 = fda(; nlv)
fit!(model0, X, y)
@head T = transf(model0, X)
n, p = size(T)

#### Probability density in the FDA score space (2D)

model = dmkern()
fit!(model, T) 
@names model.fitm
model.fitm.H
u = [1; 4; 150]
predict(model, T[u, :]).pred

h = .3
model = dmkern(; h)
fit!(model, T) 
model.fitm.H
predict(model, T[u, :]).pred

h = [.3; .1]
model = dmkern(; h)
fit!(model, T) 
model.fitm.H
predict(model, T[u, :]).pred

## Bivariate distribution
npoints = 2^7
nlv = 2
lims = [(minimum(T[:, j]), maximum(T[:, j])) for j = 1:nlv]
x1 = LinRange(lims[1][1], lims[1][2], npoints)
x2 = LinRange(lims[2][1], lims[2][2], npoints)
z = mpar(x1 = x1, x2 = x2)
grid = reduce(hcat, z)
m = nro(grid)
model = dmkern() 
#model = dmkern(a = .5) 
#model = dmkern(h = .3) 
fit!(model, T) 

res = predict(model, grid) ;
pred_grid = vec(res.pred)
f = Figure(size = (600, 400))
ax = Axis(f[1, 1];  title = "Density for FDA scores (Iris)", xlabel = "Score 1", 
    ylabel = "Score 2")
co = contour!(ax, grid[:, 1], grid[:, 2], pred_grid; levels = 10, labels = true)
scatter!(ax, T[:, 1], T[:, 2], color = :red, markersize = 5)
#xlims!(ax, -15, 15) ;ylims!(ax, -15, 15)
f

## Univariate distribution
x = T[:, 1]
model = dmkern() 
#model = dmkern(a = .5) 
#model = dmkern(h = .3) 
fit!(model, x) 
pred = predict(model, x).pred 
f = Figure()
ax = Axis(f[1, 1])
hist!(ax, x; bins = 30, normalization = :pdf)  # area = 1
scatter!(ax, x, vec(pred); color = :red)
f

x = T[:, 1]
npoints = 2^8
lims = [minimum(x), maximum(x)]
#delta = 5 ; lims = [minimum(x) - delta, maximum(x) + delta]
grid = LinRange(lims[1], lims[2], npoints)
model = dmkern() 
#model = dmkern(a = .5) 
#model = dmkern(h = .3) 
fit!(model, x) 
pred_grid = predict(model, grid).pred 
f = Figure()
ax = Axis(f[1, 1])
hist!(ax, x; bins = 30, normalization = :pdf)  # area = 1
lines!(ax, grid, vec(pred_grid); color = :red)
f

source

Jchemo.dmnorm — Method

dmnorm(; kwargs...)
dmnorm(X; kwargs...)
dmnorm!(X::Matrix; kwargs...)
dmnorm(mu, S; kwargs...)
dmnorm!(mu::Vector, S::Matrix; kwargs...)

Normal probability density estimation.

X : X-data (n, p) used to estimate the mean mu and the covariance matrix S. If X is not given, mu and S must be provided in kwargs.
mu : Mean vector of the normal distribution.
S : Covariance matrix of the Normal distribution.

Keyword arguments:

simpl : Boolean. If true, the constant term and the determinant in the Normal density formula are set to 1.

Data X can be univariate (p = 1) or multivariate (p > 1). See examples.

When simple = true, the determinant of the covariance matrix (object detS) and the constant (2 * pi)^(-p / 2) (object cst) in the density formula are set to 1. The function returns a pseudo density that resumes to exp(-d / 2), where d is the squared Mahalanobis distance to the center mu. This can for instance be useful when the number of columns (p) of X becomes too large, with the possible consequences that:

detS tends to 0 or, conversely, to infinity;
cst tends to 0,

which makes impossible to compute the true density.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "iris.jld2") 
@load db dat
@names dat
X = dat.X[:, 1:4] 
y = dat.X[:, 5]
n = nro(X)
tab(y) 

nlv = 2
model0 = fda(; nlv)
fit!(model0, X, y)
@head T = transf(model0, X)
n, p = size(T)

#### Probability density in the FDA score space (2D)
#### Example of class Setosa 
s = y .== "setosa"
zT = T[s, :]
m = nro(zT)

#### Bivariate distribution
model = dmnorm()
fit!(model, zT)
fitm = model.fitm
@names fitm
fitm.Uinv 
fitm.detS
@head pred = predict(model, zT).pred

## Direct syntax
mu = colmean(zT)
S = covm(zT, mweight(ones(m))) * m / (m - 1) # corrected cov. matrix
fitm = dmnorm(mu, S) ; 
@names fitm
fitm.Uinv
fitm.detS

npoints = 2^7
lims = [(minimum(zT[:, j]), maximum(zT[:, j])) for j = 1:nlv]
x1 = LinRange(lims[1][1], lims[1][2], npoints)
x2 = LinRange(lims[2][1], lims[2][2], npoints)
z = mpar(x1 = x1, x2 = x2)
grid = reduce(hcat, z)
model = dmnorm()
fit!(model, zT)
res = predict(model, grid) ;
pred_grid = vec(res.pred)
f = Figure(size = (600, 400))
ax = Axis(f[1, 1];  title = "Density for FDA scores (Iris - Setosa)", 
    xlabel = "Score 1", ylabel = "Score 2")
co = contour!(ax, grid[:, 1], grid[:, 2], pred_grid; levels = 10, labels = true)
scatter!(ax, T[:, 1], T[:, 2], color = :red, markersize = 5)
scatter!(ax, zT[:, 1], zT[:, 2], color = :blue, markersize = 5)
#xlims!(ax, -12, 12) ;ylims!(ax, -12, 12)
f

#### Univariate distribution
j = 1
x = zT[:, j]
model = dmnorm()
fit!(model, x)
pred = predict(model, x).pred 
f = Figure()
ax = Axis(f[1, 1]; xlabel = string("FDA-score ", j))
hist!(ax, x; bins = 30, normalization = :pdf)  # area = 1
scatter!(ax, x, vec(pred); color = :red)
f

x = zT[:, j]
npoints = 2^8
lims = [minimum(x), maximum(x)]
#delta = 5 ; lims = [minimum(x) - delta, maximum(x) + delta]
grid = LinRange(lims[1], lims[2], npoints)
model = dmnorm()
fit!(model, x)
pred_grid = predict(model, grid).pred 
f = Figure()
ax = Axis(f[1, 1]; xlabel = string("FDA-score ", j))
hist!(ax, x; bins = 30, normalization = :pdf)  # area = 1
lines!(ax, grid, vec(pred_grid); color = :red)
f

source

Jchemo.dmnormlog — Method

dmnormlog(; kwargs...)
dmnormlog(X; kwargs...)
dmnormlog!(X::Matrix; kwargs...)
dmnormlog(mu, S; kwargs...)
dmnormlog!(mu::Vector, S::Matrix; kwargs...)

Logarithm of the normal probability density estimation. * X : X-data (n, p) used to estimate the mean mu and the covariance matrix S. If X is not given, mu and S must be provided in kwargs. * mu : Mean vector of the normal distribution. * S : Covariance matrix of the Normal distribution. Keyword arguments: * simpl : Boolean. If true, the constant term and the determinant in the Normal density formula are set to 1.

See the help page of function dmnorm.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "iris.jld2") 
@load db dat
@names dat
X = dat.X[:, 1:4] 
y = dat.X[:, 5]
n = nro(X)
tab(y) 

## Example of class Setosa 
s = y .== "setosa"
zX = X[s, :]

model = dmnormlog()
fit!(model, zX)
fitm = model.fitm
@names fitm
fitm.Uinv 
fitm.logdetS
@head pred = predict(model, zX).pred

## Consistency with dmnorm
model0 = dmnorm()
fit!(model0, zX)
@head pred0 = predict(model0, zX).pred
@head log.(pred0)

source

Jchemo.dummy — Method

dummy(y)

Compute dummy table from a categorical variable.

y : A categorical variable.

The output Y (dummy table) is a BitMatrix.

Examples

using Jchemo

y = ["d", "a", "b", "c", "b", "c"]
#y =  rand(1:3, 7)
res = dummy(y)
@names res
res.Y

source

Jchemo.dupl — Method

dupl(X; digits = 3)

Find duplicated rows in a dataset.

X : A dataset.
digits : Nb. digits used to round X before checking.

Examples

using Jchemo

X = rand(5, 3)
Z = vcat(X, X[1:3, :], X[1:1, :])
dupl(X)
dupl(Z)

M = hcat(X, fill(missing, 5))
Z = vcat(M, M[1:3, :])
dupl(M)
dupl(Z)

source

Jchemo.ensure_df — Method

ensure_df(X)

Reshape X to a dataframe if necessary.

source

Jchemo.ensure_mat — Method

ensure_mat(X)

Reshape X to a matrix if necessary.

source

Jchemo.eposvd — Method

eposvd(D; nlv = 1)

Compute an orthogonalization matrix for calibration transfer of spectral data.

D : Data (m, p) containing the detrimental information on which spectra (rows of a matrix X) have to be orthogonalized.

Keyword arguments:

nlv : Nb. of first loadings vectors of D considered for the orthogonalization.

The objective is to remove some detrimental information (e.g. humidity patterns in signals, multiple spectrometers, etc.) from a X-dataset (n, p). The detrimental information is defined by the main row-directions computed from a matrix D (m, p).

Function eposvd returns two objects:

V (p, nlv) : The matrix of the nlv first loading vectors of the SVD decomposition (non centered PCA) of D.
M (p, p) : The orthogonalization matrix, used to orthogonolize a given matrix X to directions contained in V.

Any matrix X can then be corrected from D by:

X_corrected = X * M.

Matrix D can be built from many methods. For instance, two common methods are:

EPO (Roger et al. 2003, 2018): D is built from a set of differences between spectra collected under different conditions.
TOP (Andrew & Fearn 2004): Each row of D is the mean spectrum computed for a given spectrometer instrument.

A particular situation is the following. Assume that D is built from some differences between matrices X1 and X2, and that a bilinear model (e.g. PLSR) is fitted on the data {X1corrected, Y} where X1corrected = X1 * M. To predict new data X2new with the fitted model, there is no need to correct X2new.

References

Andrew, A., Fearn, T., 2004. Transfer by orthogonal projection: making near-infrared calibrations robust to between-instrument variation. Chemometrics and Intelligent Laboratory Systems 72, 51–56. https://doi.org/10.1016/j.chemolab.2004.02.004

Roger, J.-M., Chauchard, F., Bellon-Maurel, V., 2003. EPO-PLS external parameter orthogonalisation of PLS application to temperature-independent measurement of sugar content of intact fruits. Chemometrics and Intelligent Laboratory Systems 66, 191-204. https://doi.org/10.1016/S0169-7439(03)00051-0

Roger, J.-M., Boulet, J.-C., 2018. A review of orthogonal projections for calibration. Journal of Chemometrics 32, e3045. https://doi.org/10.1002/cem.3045

Zeaiter, M., Roger, J.M., Bellon-Maurel, V., 2006. Dynamic orthogonal projection. A new method to maintain the on-line robustness of multivariate calibrations. Application to NIR-based monitoring of wine fermentations. Chemometrics and Intelligent Laboratory Systems, 80, 227–235. https://doi.org/10.1016/j.chemolab.2005.06.011

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/caltransfer.jld2")
@load db dat
@names dat
X1cal = dat.X1cal
X1val = dat.X1val
X2cal = dat.X2cal
X2val = dat.X2val

## The objective is to remove a detrimental 
## information (here, D) from spaces X1 and X2
D = X1cal - X2cal
nlv = 2
res = eposvd(D; nlv)
res.M # orthogonalization matrix
res.V # detrimental directions (columns of matrix V = loadings of D)

## Corrected Val matrices
X1val_c = X1val * res.M
X2val_c = X2val * res.M

i = 1
f = Figure(size = (800, 300))
ax1 = Axis(f[1, 1])
ax2 = Axis(f[1, 2])
lines!(ax1, X1val[i, :]; label = "x1")
lines!(ax1, X2val[i, :]; label = "x2")
axislegend(ax1, position = :cb, framevisible = false)
lines!(ax2, X1val_c[i, :]; label = "x1_correct")
lines!(ax2, X2val_c[i, :]; label = "x2_correct")
axislegend(ax2, position = :cb, framevisible = false)
f

source

Jchemo.errp — Method

errp(pred, y)

Compute the classification error rate (ERRP).

pred : Predictions.
y : Observed data (class membership).

Examples

using Jchemo

Xtrain = rand(10, 5) 
ytrain = rand(["a" ; "b"], 10)
Xtest = rand(4, 5) 
ytest = rand(["a" ; "b"], 4)

model = plsrda(; nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
errp(pred, ytest)

source

Jchemo.euclsq — Method

euclsq(X, Y)

Squared Euclidean distances between the rows of X and Y.

X : Data (n, p).
Y : Data (m, p).

For X(n, p) and Y (m, p), the function returns an object (n, m) with:

i, j = distance between row i of X and row j of Y.

Examples

X = rand(5, 3)
Y = rand(2, 3)

euclsq(X, Y)

euclsq(X[1:1, :], Y[1:1, :])

euclsq(X[:, 1], 4)
euclsq(1, 4)

source

Jchemo.fcenter — Method

fcenter(X, v)
fcenter!(X::AbstractMatrix, v)

Center each column of a matrix.

X : Data (n, p).
v : Centering vector (p).

examples

using Jchemo

n, p = 5, 6
X = rand(n, p)
xmeans = colmean(X)
fcenter(X, xmeans)

source

Jchemo.fconcat — Method

fconcat()

Concatenate horizontaly multiblock X-data.

Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.

Examples

using Jchemo
n = 5 ; m = 3 ; p = 9 
X = rand(n, p) 
Xnew = rand(m, p)
listbl = [3:4, 1, [6; 8:9]]
Xbl = mblock(X, listbl) 
Xblnew = mblock(Xnew, listbl) 
@head Xbl[3]

fconcat(Xbl)

source

Jchemo.fcscale — Method

fcscale(X, u, v)
fcscale!(X, u, v)

Center and fscale each column of a matrix.

X : Data (n, p).
u : Centering vector (p).
v : Scaling vector (p).

examples

using Jchemo

n, p = 5, 6
X = rand(n, p)
xmeans = colmean(X)
xscales = colstd(X)
fcscale(X, xmeans, xscales)

source

Jchemo.fda — Method

fda(; kwargs...)
fda(X, y; kwargs...)
fda(X, y, weights; kwargs...)
fda!(X::Matrix, y, weights; kwargs...)

Factorial discriminant analysis (FDA).

X : X-data (n, p).
y : y-data (n) (class membership).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. of discriminant components.
lb : Ridge regularization parameter "lambda". Can be used when X has collinearities.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

FDA by eigen factorization of Inverse(W) * B, where W is the "Within"-covariance matrix (pooled over the classes), and B the "Between"-covariance matrix.

The function maximizes the consensus:

p'Bp / p'Wp

i.e. max p'Bp with constraint p'Wp = 1. Vectors p (columns of V) are the linear discrimant coefficients often referred to as "LD".

If X is ill-conditionned, a ridge regularization can be used:

If lb > 0, W is replaced by W + lb * I, where I is the Idendity matrix.

In the high-level version of the present functions, the observation weights are automatically defined by the given priors (argument prior): the sub-totals by class of the observation weights are set equal to the prior probabilities. The low-level version (argument weights) allows to implement other choices.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie 
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2") 
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest) 
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
tab(ytrain)
tab(ytest)

nlv = 2
model = fda(; nlv)
#model = fdasvd(; nlv)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
lev = fitm.lev
nlev = length(lev)
fitm.priors
aggsumv(fitm.weights.w, ytrain)

@head fitm.T 
@head transf(model, Xtrain)
@head transf(model, Xtest)

## X-loadings matrix
## = coefficients of the linear discriminant function
## = "LD" of function lda of the R package MASS
fitm.V
fitm.V' * fitm.V

## Explained variance computed by weighted PCA of the class centers 
## in transformed scale
summary(model).explvarx

## Projections of the class centers to the score space
ct = fitm.Tcenters 
f, ax = plotxy(fitm.T[:, 1], fitm.T[:, 2], ytrain; ellipse = true, title = "FDA",
    xlabel = "Score-1", ylabel = "Score-2")
scatter!(ax, ct[:, 1], ct[:, 2], marker = :star5, markersize = 15, color = :red)  # see available_marker_symbols()
f

source

Jchemo.fdasvd — Method

fdasvd(; kwargs...)
fdasvd(X, y, weights; kwargs...)
fdasvd!(X::Matrix, y, weights; kwargs...)

Factorial discriminant analysis (FDA).

X : X-data (n, p).
y : y-data (n) (class membership).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. of discriminant components.
lb : Ridge regularization parameter "lambda". Can be used when X has collinearities.
prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

FDA by a weighted SVD factorization of the matrix of the class centers (after spherical transformaton). The function gives the same results as function fda.

See function fda for details and examples.

source

Jchemo.fdif — Method

fdif(; kwargs...)
fdif(X; kwargs...)

Finite differences (discrete derivates) for each row of X-data.

X : X-data (n, p).

Keyword arguments:

npoint : Nb. points involved in the window for the finite differences. The range of the window (= nb. intervals of two successive colums) is npoint - 1.

The method reduces the column-dimension:

(n, p) –> (n, p - npoint + 1).

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f

model = fdif(npoint = 2) 
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f

source

Jchemo.findmax_cla — Method

findmax_cla(x)
findmax_cla(x, weights::Weight)

Find the most occurent level in x.

x : A categorical variable.
weights : Weights (n) of the observations. Object of type Weight (e.g. generated by function mweight).

If ex-aequos, the function returns the first.

Examples

using Jchemo

x = rand(1:3, 10)
tab(x)
findmax_cla(x)

source

Jchemo.findmiss — Method

findmiss(X)

Find rows with missing data in a dataset.

X : A dataset.

For dataframes, see also DataFrames.completecases and DataFrames.dropmissing.

Examples

using Jchemo

X = rand(5, 4)
zX = hcat(rand(2, 3), fill(missing, 2))
Z = vcat(X, zX)
findmiss(X)
findmiss(Z)

source

Jchemo.finduniq — Method

finduniq(id)

Find the indexes making unique the IDs in a ID vector.

id : A vector of IDs.

Can be used to remove duplicated rows in a dataset, identified by a single ID variable.

Examples

using Jchemo

v = ["a", "d", "c", "b", "a", "d", "a"]  # a vector of IDs

s = finduniq(v)  # indexes of the IDs without duplicates
v[s]

source

Jchemo.frob — Method

frob(X)
frob(X, weights::Weight)
frob2(X)
frob2(X, weights::Weight)

Frobenius norm of a matrix.

X : A matrix (n, p).
weights : Weights (n) of the observations. Object of type Weight (e.g. generated by function mweight).

The Frobenius norm of X is:

sqrt(tr(X' * X)).

The Frobenius weighted norm is:

sqrt(tr(X' * D * X)), where D is the diagonal matrix of vector w.

Functions frob2 are the squared versions of frob.

References

@Stevengj, https://discourse.julialang.org/t/interesting-post-about-simd-dot-product-and-cosine-similarity/123282.

source

Jchemo.fscale — Method

fscale(X, v)
fscale!(X::AbstractMatrix, v)

Scale each column of a matrix.

X : Data (n, p).
v : Scaling vector (p).

Examples

using Jchemo

X = rand(5, 2) 
fscale(X, colstd(X))

source

Jchemo.fweight — Method

fweight(X, v)
fweight!(X::AbstractMatrix, v)

Weight each row of a matrix.

X : Data (n, p).
v : A weighting vector (n).

Examples

using Jchemo, LinearAlgebra

X = rand(5, 2) 
w = rand(5) 
fweight(X, w)
diagm(w) * X

fweight!(X, w)
X

source

Jchemo.getknn — Method

getknn(Xtrain, X; metric = :eucl, k = 1)

Return the k nearest neighbors in Xtrain of each row of the query X.

Xtrain : Training X-data.
X : Query X-data.

Keyword arguments:

metric : Type of distance used for the query. Possible values are :eucl (Euclidean), :mah (Mahalanobis), :sam (spectral angular distance), :cor (correlation distance).
k : Number of neighbors to return.

The distances (not squared) are also returned.

Spectral angular and correlation distances between two vectors x and y:

Spectral angular distance (x, y) = acos(x'y / normv(x)normv(y)) / pi
Correlation distance (x, y) = sqrt((1 - cor(x, y)) / 2)

Both distances are bounded within 0 (y = x) and 1 (y = -x).

Examples

using Jchemo
Xtrain = rand(5, 3)
X = rand(2, 3)
x = X[1:1, :]

k = 3
res = getknn(Xtrain, X; k)
res.ind  # indexes
res.d    # distances

res = getknn(Xtrain, x; k)
res.ind

res = getknn(Xtrain, X; metric = :mah, k)
res.ind

source

Jchemo.gridcv — Method

gridcv(model, X, Y; segm, score, pars = nothing, nlv = nothing, lb = nothing, 
    verbose = false)

Cross-validation (CV) of a model over a grid of parameters.

model : Model to evaluate.
X : Training X-data (n, p).
Y : Training Y-data (n, q).

Keyword arguments:

segm : Segments of observations used for the CV (output of functions segmts, segmkf, etc.).
score : Function computing the prediction score (e.g. rmsep).
pars : tuple of named vectors of same length defining the parameter combinations (e.g. output of function mpar).
verbose : If true, predicting information are printed.
nlv : Value, or vector of values, of the nb. of latent variables (LVs).
lb : Value, or vector of values, of the ridge regularization parameter "lambda".

The function is used for grid-search: it computed a prediction score (= error rate) for the specified model over the combinations of parameters defined in pars.

For models based on LV or ridge regularization, using arguments nlv and lb allow faster computations than including these parameters in argument `pars. See the examples.

The function returns two outputs:

res : mean results
res_p : results per replication.

Examples

######## Regression

using JLD2, CairoMakie, JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
model = savgol(npoint = 21, deriv = 2, degree = 2)
fit!(model, X)
Xp = transf(model, X)
s = year .<= 2012
Xtrain = Xp[s, :]
ytrain = y[s]
Xtest = rmrow(Xp, s)
ytest = rmrow(y, s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)

## Replicated K-fold CV 
K = 3 ; rep = 10
segm = segmkf(ntrain, K; rep)
## Replicated test-set validation
#m = round(Int, ntrain / 3) ; rep = 30
#segm = segmts(ntrain, m; rep)

####-- Plsr
model = plskern()
nlv = 0:30
rescv = gridcv(model, Xtrain, ytrain; segm, score = rmsep, nlv) ;
@names rescv
res = rescv.res 
plotgrid(res.nlv, res.y1; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = plskern(; nlv = res.nlv[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

## Adding pars 
pars = mpar(scal = [false; true])
rescv = gridcv(model, Xtrain, ytrain; segm,  score = rmsep, pars, nlv) ;
res = rescv.res 
typ = res.scal
plotgrid(res.nlv, res.y1, typ; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = plskern(nlv = res.nlv[u], scal = res.scal[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f   

####-- Rr 
lb = (10).^(-8:.1:3)
model = rr() 
rescv = gridcv(model, Xtrain, ytrain; segm, score = rmsep, lb) ;
res = rescv.res 
loglb = log.(10, res.lb)
plotgrid(loglb, res.y1; step = 2, xlabel = "log(lambda)", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = rr(lb = res.lb[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f     
    
## Adding pars 
pars = mpar(scal = [false; true])
rescv = gridcv(model, Xtrain, ytrain; segm, score = rmsep, pars, lb) ;
res = rescv.res 
loglb = log.(10, res.lb)
typ = string.(res.scal)
plotgrid(loglb, res.y1, typ; step = 2, xlabel = "log(lambda)", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = rr(lb = res.lb[u], scal = res.scal[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f   

####-- Kplsr 
model = kplsr()
nlv = 0:30
gamma = (10).^(-5:1.:5)
pars = mpar(gamma = gamma)
rescv = gridcv(model, Xtrain, ytrain; segm,  score = rmsep, pars, nlv) ;
res = rescv.res 
loggamma = round.(log.(10, res.gamma), digits = 1)
plotgrid(res.nlv, res.y1, loggamma; step = 2, xlabel = "Nb. LVs",  ylabel = "RMSEP", 
    leg_title = "Log(gamma)").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = kplsr(nlv = res.nlv[u], gamma = res.gamma[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f   

####-- Knnr 
nlvdis = [15, 25] ; metric = [:mah]
h = [1, 2.5, 5]
k = [1; 5; 10; 20; 50 ; 100] 
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k)
length(pars[1]) 
model = knnr()
rescv = gridcv(model, Xtrain, ytrain; segm, score = rmsep, pars, verbose = true) ;
res = rescv.res 
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = knnr(nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u], 
    k = res.k[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f   

####-- Lwplsr 
nlvdis = 15 ; metric = [:mah]
h = [1, 2, 5] ; k = [200, 350, 500] 
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k)
length(pars[1]) 
nlv = 0:20
model = lwplsr()
rescv = gridcv(model, Xtrain, ytrain; segm, score = rmsep, pars, nlv, verbose = true) ;
res = rescv.res 
group = string.("h=", res.h, " k=", res.k)
plotgrid(res.nlv, res.y1, group; xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = lwplsr(nlvdis = res.nlvdis[u], metric = res.metric[u], 
    h = res.h[u], k = res.k[u], nlv = res.nlv[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f   

####-- LwplsrAvg 
nlvdis = 15 ; metric = [:mah]
h = [1, 2, 5] ; k = [200, 350, 500] 
nlv = [0:20, 5:20] 
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k, nlv = nlv)
length(pars[1]) 
model = lwplsravg()
rescv = gridcv(model, Xtrain, ytrain; segm, score = rmsep, pars, verbose = true) ;
res = rescv.res 
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = lwplsravg(nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u], 
    k = res.k[u], nlv = res.nlv[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f   
    
##---- Mbplsr
listbl = [1:525, 526:1050]
Xbltrain = mblock(Xtrain, listbl)
Xbltest = mblock(Xtest, listbl) 
Xblcal = mblock(Xcal, listbl) 
Xblval = mblock(Xval, listbl) 

model = mbplsr()
bscal = [:none, :frob]
pars = mpar(bscal = bscal) 
nlv = 0:30
rescv = gridcv(model, Xbltrain, ytrain; segm,  score = rmsep, pars, nlv) ;
res = rescv.res 
group = res.bscal 
plotgrid(res.nlv, res.y1, group; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = mbplsr(bscal = res.bscal[u], nlv = res.nlv[u])
fit!(model, Xbltrain, ytrain)
pred = predict(model, Xbltest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f   

######## Discrimination
## The principle is the same as for regression

using JLD2, CairoMakie, JchemoData
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
tab(Y.typ)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)

## Replicated K-fold CV 
K = 3 ; rep = 10
segm = segmkf(ntrain, K; rep)
## Replicated test-set validation
#m = round(Int, ntrain / 3) ; rep = 30
#segm = segmts(ntrain, m; rep)

####-- Plslda
model = plslda()
nlv = 1:30
pars = mpar(scal = [false; true])
rescv = gridcv(model, Xtrain, ytrain; segm, score = errp, pars, nlv)
res = rescv.res
typ = res.scal
plotgrid(res.nlv, res.y1, typ; step = 2, xlabel = "Nb. LVs", ylabel = "ERR").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = plslda(nlv = res.nlv[u], scal = res.scal[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show errp(pred, ytest)
conf(pred, ytest).pct

## Computation of the confusion matrix within CV, for the best model 
matpred = Vector{Matrix{String}}(undef, rep * K)
k = 1
for i = 1:rep
    listsegm = segm[i]
    for j = 1:K
        s = listsegm[j]
        model = plslda(nlv = res.nlv[u], scal = res.scal[u])
        fit!(model, rmrow(Xtrain, s), rmrow(ytrain, s))
        pred = predict(model, Xtrain[s, :]).pred
        matpred[k] = hcat(pred, ytrain[s])
        k = k + 1
    end
end
respred = reduce(vcat, matpred)
conf(respred[:, 1], respred[:, 2]).pct

source

Jchemo.gridcv_br — Method

gridcv_br(X, Y; segm, algo, score, pars, verbose = false)

Working function for gridcv.

See function gridcv for examples.

source

Jchemo.gridcv_lb — Method

gridcv_lb(X, Y; segm, algo, score, pars = nothing, lb, verbose = false)

Working function for gridcv.

Specific and faster than gridcv_br for models using ridge regularization (e.g. RR). Argument pars must not contain nlv.

See function gridcv for examples.

source

Jchemo.gridcv_lv — Method

gridcv_lv((X, Y; segm, algo, score, pars = nothing, nlv, verbose = false)

Working function for gridcv.

Specific and faster than gridcv_br for models using latent variables (e.g. PLSR). Argument pars must not contain nlv.

See function gridcv for examples.

source

Jchemo.gridscore — Method

gridscore(model, Xtrain, Ytrain, X, Y; score, pars = nothing, nlv = nothing, 
    lb = nothing, verbose = false)

Test-set validation of a model over a grid of parameters.

model : Model to evaluate.
Xtrain : Training X-data (n, p).
Ytrain : Training Y-data (n, q).
X : Validation X-data (m, p).
Y : Validation Y-data (m, q).

Keyword arguments:

score : Function computing the prediction score (e.g. rmsep).
pars : tuple of named vectors of same length defining the parameter combinations (e.g. output of function mpar).
verbose : If true, predicting information are printed.
nlv : Value, or vector of values, of the nb. of latent variables (LVs).
lb : Value, or vector of values, of the ridge regularization parameter "lambda".

The function is used for grid-search: it computed a prediction score (= error rate) for model model over the combinations of parameters defined in pars. The score is computed over sets {X,Y`}.

For models based on LV or ridge regularization, using arguments nlv and lb allow faster computations than including these parameters in argument `pars. See the examples.

Examples

######## Regression 

using JLD2, CairoMakie, JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
model = savgol(npoint = 21, deriv = 2, degree = 2)
fit!(model, X)
Xp = transf(model, X)
s = year .<= 2012
Xtrain = Xp[s, :]
ytrain = y[s]
Xtest = rmrow(Xp, s)
ytest = rmrow(y, s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Train ==> Cal + Val 
nval = round(Int, .3 * ntrain)
s = samprand(ntrain, nval)
Xcal = Xtrain[s.train, :]
ycal = ytrain[s.train]
Xval = Xtrain[s.test, :]
yval = ytrain[s.test]

##---- Plsr
model = plskern()
nlv = 0:30
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, nlv)
plotgrid(res.nlv, res.y1; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = plskern(nlv = res.nlv[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

## Adding pars 
pars = mpar(scal = [false; true])
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, pars, nlv)
typ = res.scal
plotgrid(res.nlv, res.y1, typ; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = plskern(nlv = res.nlv[u], scal = res.scal[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

##---- Rr 
lb = (10).^(-8:.1:3)
model = rr() 
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, lb)
loglb = log.(10, res.lb)
plotgrid(loglb, res.y1; step = 2, xlabel = "log(lambda)", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = rr(lb = res.lb[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    
    
## Adding pars 
pars = mpar(scal = [false; true])
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, pars, lb)
loglb = log.(10, res.lb)
typ = string.(res.scal)
plotgrid(loglb, res.y1, typ; step = 2, xlabel = "log(lambda)", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = rr(lb = res.lb[u], scal = res.scal[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

##---- Kplsr 
model = kplsr()
nlv = 0:30
gamma = (10).^(-5:1.:5)
pars = mpar(gamma = gamma)
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, pars, nlv)
loggamma = round.(log.(10, res.gamma), digits = 1)
plotgrid(res.nlv, res.y1, loggamma; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP",
    leg_title = "Log(gamma)").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = kplsr(nlv = res.nlv[u], gamma = res.gamma[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

##---- Knnr 
nlvdis = [15; 25] ; metric = [:mah]
h = [1, 2.5, 5]
k = [1, 5, 10, 20, 50, 100] 
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k)
length(pars[1]) 
model = knnr()
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, pars, verbose = true)
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = knnr(nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u], 
    k = res.k[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

##---- Lwplsr 
nlvdis = 15 ; metric = [:mah]
h = [1, 2, 5] ; k = [200, 350, 500] 
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k)
length(pars[1]) 
nlv = 0:20
model = lwplsr()
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, pars, nlv, verbose = true)
group = string.("h=", res.h, " k=", res.k)
plotgrid(res.nlv, res.y1, group; xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = lwplsr(nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u], 
    k = res.k[u], nlv = res.nlv[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

##---- LwplsrAvg 
nlvdis = 15 ; metric = [:mah]
h = [1, 2, 5] ; k = [200, 350, 500] 
nlv = [0:20, 5:20] 
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k, nlv = nlv)
length(pars[1]) 
model = lwplsravg()
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, pars, verbose = true)
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = lwplsravg(nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u], 
    k = res.k[u], nlv = res.nlv[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f   

##---- Mbplsr
listbl = [1:525, 526:1050]
Xbltrain = mblock(Xtrain, listbl)
Xbltest = mblock(Xtest, listbl) 
Xblcal = mblock(Xcal, listbl) 
Xblval = mblock(Xval, listbl) 

model = mbplsr()
bscal = [:none, :frob]
pars = mpar(bscal = bscal) 
nlv = 0:30
res = gridscore(model, Xblcal, ycal, Xblval, yval; score = rmsep, pars, nlv)
group = res.bscal 
plotgrid(res.nlv, res.y1, group; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = mbplsr(bscal = res.bscal[u], nlv = res.nlv[u])
fit!(model, Xbltrain, ytrain)
pred = predict(model, Xbltest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    
    
######## Discrimination
## The principle is the same as for regression

using JLD2, CairoMakie, JchemoData
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
tab(Y.typ)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Train ==> Cal + Val 
nval = round(Int, .3 * ntrain)
s = samprand(ntrain, nval)
Xcal = Xtrain[s.train, :]
ycal = ytrain[s.train]
Xval = Xtrain[s.test, :]
yval = ytrain[s.test]

##---- Plslda
model = plslda()
nlv = 1:30
prior = [:unif, :prop]
pars = mpar(prior = prior)
res = gridscore(model, Xcal, ycal, Xval, yval; score = errp, pars, nlv)
typ = res.prior
plotgrid(res.nlv, res.y1, typ; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model = plslda(nlv = res.nlv[u], prior = res.prior[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show errp(pred, ytest)
conf(pred, ytest).pct

source

Jchemo.gridscore — Method

gridscore(model::Pipeline, Xtrain, Ytrain, X, Y; score, pars = nothing, 
    nlv = nothing, lb = nothing, verbose = false)

Test-set validation of a model pipeline over a grid of parameters.

model : A pipeline of models to evaluate.
Xtrain : Training X-data (n, p).
Ytrain : Training Y-data (n, q).
X : Validation X-data (m, p).
Y : Validation Y-data (m, q).

Keyword arguments:

score : Function computing the prediction score (e.g. rmsep).
pars : tuple of named vectors of same length defining the parameter combinations (e.g. output of function mpar).
verbose : If true, predicting information are printed.
nlv : Value, or vector of values, of the nb. of latent variables (LVs).
lb : Value, or vector of values, of the ridge regularization parameter "lambda".

In the present version of the function, only the last model of the pipeline (= the final predictor) is validated.

For other details, see function gridscore for simple models.

Examples

using JLD2, CairoMakie, JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Building Cal and Val 
## within Train
nval = round(Int, .3 * ntrain)
s = samprand(ntrain, nval)
Xcal = Xtrain[s.train, :]
ycal = ytrain[s.train]
Xval = Xtrain[s.test, :]
yval = ytrain[s.test]

####-- Pipeline Snv :> Savgol :> Plsr
## Only the last model is validated
## model1
model1 = snv()
## model2 
npoint = 11 ; deriv = 2 ; degree = 3
model2 = savgol(; npoint, deriv, degree)
## model3
nlv = 0:30
model3 = plskern()
##
model = pip(model1, model2, model3)
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, nlv) ;
plotgrid(res.nlv, res.y1; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model3 = plskern(nlv = res.nlv[u])
model = pip(model1, model2, model3)
fit!(model, Xtrain, ytrain)
res = predict(model, Xtest) ; 
@head res.pred 
rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
      ylabel = "Observed").f

####-- Pipeline Pca :> Svmr
## Only the last model is validated
## model1
nlv = 15 ; scal = true
model1 = pcasvd(; nlv, scal)
## model2
kern = [:krbf]
gamma = (10).^(-5:1.:5)
cost = (10).^(1:3)
epsilon = [.1, .2, .5]
pars = mpar(kern = kern, gamma = gamma, cost = cost, epsilon = epsilon)
model2 = svmr()
##
model = pip(model1, model2)
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, pars, verbose = true)
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
model2 = svmr(kern = res.kern[u], gamma = res.gamma[u], cost = res.cost[u], epsilon = res.epsilon[u])
model = pip(model1, model2) 
fit!(model, Xtrain, ytrain)
res = predict(model, Xtest) ; 
@head res.pred 
rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
      ylabel = "Observed").f

source

Jchemo.gridscore_br — Method

gridscore_br(Xtrain, Ytrain, X, Y; algo, score, pars, 
    verbose = false)

Working function for gridscore.

See function gridscore for examples.

source

Jchemo.gridscore_lb — Method

gridscore_lb(Xtrain, Ytrain, X, Y; algo, score, pars = nothing, 
    lb, verbose = false)

Working function for gridscore.

Specific and faster than gridscore_br for models using ridge regularization (e.g. RR). Argument pars must not contain lb.

See function gridscore for examples.

source

Jchemo.gridscore_lv — Method

gridscore_lv(Xtrain, Ytrain, X, Y; algo, score, pars = nothing, 
    nlv, verbose = false)

Working function for gridscore.

Specific and faster than gridscore_br for models using latent variables (e.g. PLSR). Argument pars must not contain nlv.

See function gridscore for examples.

source

Jchemo.head — Method

@head X

Display the first rows of a dataset.

Examples

using Jchemo

X = rand(100, 5)
@head X

source

Jchemo.interpl — Method

interpl(; kwargs...)
interpl(X; kwargs...)

Sampling spectra by interpolation.

X : Matrix (n, p) of spectra (rows).

Keyword arguments:

wl : Values representing the column "names" of X. Must be a numeric vector of length p, or an AbstractRange, with increasing values.
wlfin : Final values (within the range of wl) where to interpolate each spectrum. Must be a numeric vector, or an AbstractRange, with increasing values.

The function implements a cubic spline interpolation using package DataInterpolations.jl.

References

http://github.com/SciML/DataInterpolations.jl

Bhagavan et al. 2024, https://doi.org/10.21105/joss.06917

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f

wlfin = range(500, 2400, length = 10)
#wlfin = collect(range(500, 2400, length = 10))
model = interpl(; wl, wlfin)
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f

source

Jchemo.iqrv — Method

iqrv(x)

Compute the interquartile interval (IQR) of a vector.

x : A vector (n).

Examples

x = rand(100)
iqrv(x)

source

Jchemo.isel! — Function

isel!(model, X, Y, wl = 1:nco(X); rep = 1, nint = 5, psamp = .3, score = rmsep)

Interval variable selection.

model : Model to evaluate.
X : X-data (n, p).
Y : Y-data (n, q).
wl : Optional numeric labels (p, 1) of the X-columns.

Keyword arguments:

rep : Number of replications of the splitting training/test.
nint : Nb. intervals.
psamp : Proportion of data used as test set to compute the score.
score : Function computing the prediction score.

The principle is as follows:

Data (X, Y) are splitted randomly to a training and a test set.
Range 1:p in X is segmented to nint intervals, when possible of equal size.
The model is fitted on the training set and the score (error rate) on the test set, firtsly accounting for all the p variables (reference) and secondly for each of the nint intervals.
This process is replicated rep times. Average results are provided in the outputs, as well the results per replication.

References

Nørgaard, L., Saudland, A., Wagner, J., Nielsen, J.V., Munck, L.,

Engelsen, S.B., 2000. Interval Partial Least-Squares Regression (iPLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy. Appl Spectrosc 54, 413–419. https://doi.org/10.1366/0003702001949500

Examples

using Jchemo, JchemoData, DataFrames, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "tecator.jld2") 
@load db dat
@names dat
X = dat.X
Y = dat.Y 
wl_str = names(X)
wl = parse.(Float64, wl_str) 
ntot, p = size(X)
typ = Y.typ
namy = names(Y)[1:3]
plotsp(X, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance").f

s = typ .== "train"
Xtrain = X[s, :]
Ytrain = Y[s, namy]
Xtest = rmrow(X, s)
Ytest = rmrow(Y[:, namy], s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)

## Work on the j-th y-variable 
j = 2
nam = namy[j]
ytrain = Ytrain[:, nam]
ytest = Ytest[:, nam]

model = plskern(nlv = 5)
nint = 10
res = isel!(model, Xtrain, ytrain, wl; rep = 30, nint) ;
res.res_rep
res.res0_rep
zres = res.res
zres0 = res.res0
f = Figure(size = (650, 300))
ax = Axis(f[1, 1], xlabel = "Wawelength (nm)", ylabel = "RMSEP_Val",
    xticks = zres.lo)
scatter!(ax, zres.mid, zres.y1; color = (:red, .5))
vlines!(ax, zres.lo; color = :grey, linestyle = :dash, linewidth = 1)
hlines!(ax, zres0.y1, linestyle = :dash)
f

source

Jchemo.kdeda — Method

kdeda(; kwargs...)
kdeda(X, y; kwargs...)

Discriminant analysis using non-parametric kernel Gaussian density estimation (KDE-DA).

X : X-data (n, p).
y : Univariate class membership (n).

Keyword arguments:

prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
Keyword arguments of function dmkern (bandwidth definition) can also be specified here.

Same as function qda except that class densities are estimated from function dmkern instead of function dmnorm.

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

prior = :unif
#prior = :prop
model = kdeda(; prior)
fit!(model, Xtrain, ytrain)
@names model
@names fitm = model.fitm
fitm.lev
fitm.ni

res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

model = kdeda(; prior, a = .5) 
#model = kdeda(; prior, h = .1) 
fit!(model, Xtrain, ytrain)
model.fitm.fitm[1].H

source

Jchemo.knnda — Method

knnda(; kwargs...)
knnda(X, y; kwargs...)

k-Nearest-Neighbours weighted discrimination (kNN-DA).

X : X-data (n, p).
y : Univariate class membership (n).

Keyword arguments:

metric : Type of dissimilarity used to select the neighbors and to compute the weights (see function getknn). Possible values are: :eucl (Euclidean), :mah (Mahalanobis), :sam (spectral angular distance), :cor (correlation distance).
h : A scalar defining the shape of the weight function computed by function winvs. Lower is h, sharper is the function. See function winvs for details (keyword arguments criw and squared of winvs can also be specified here).
k : The number of nearest neighbors to select for each observation to predict.
tolw : For stabilization when very close neighbors.
scal : Boolean. If true, each column of the global X is scaled by its uncorrected standard deviation before the distance and weight computations.

This function has the same principle as function knnr except that a discrimination replaces the regression. A weighted vote is done over the neighborhood, and the prediction corresponds to the most frequent class.

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

metric = :eucl
h = 2 ; k = 10
model = knnda(; metric, h, k) 
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni

res = predict(model, Xtest) ; 
@names res 
res.listnn
res.listd
res.listw
@head res.pred
@show errp(res.pred, ytest)
conf(res.pred, ytest).cnt

## With dimension reduction
model1 = pcasvd(; nlv = 15)
metric = :mah ; h = 1 ; k = 3 
model2 = knnda(; metric, h, k) 
model = pip(model1, model2)
fit!(model, Xtrain, ytrain)
@head pred = predict(model, Xtest).pred 
errp(pred, ytest)

source

Jchemo.knnr — Method

knnr(; kwargs...)
knnr(X, Y; kwargs...)

k-Nearest-Neighbours weighted regression (KNNR).

X : X-data (n, p).
Y : Y-data (n, q).

Keyword arguments:

metric : Type of dissimilarity used to select the neighbors and to compute the weights (see function getknn). Possible values are: :eucl (Euclidean), :mah (Mahalanobis), :sam (spectral angular distance), :cor (correlation distance).
h : A scalar defining the shape of the weight function computed by function winvs. Lower is h, sharper is the function. See function winvs for details (keyword arguments criw and squared of winvs can also be specified here).
k : The number of nearest neighbors to select for each observation to predict.
tolw : For stabilization when very close neighbors.
scal : Boolean. If true, each column of the global X is scaled by its uncorrected standard deviation before the distance and weight computations.

The general principle of this function is as follows (many other variants of kNNR pipelines can be built): a) For each new observation to predict, the prediction is the weighted mean of y over a selected neighborhood (in X) of size k. b) Within the selected neighborhood, the weights are defined from the dissimilarities between the new observation and the neighborhood, and are computed from function 'winvs'.

In general, for X-data with high dimensions, using the Mahalanobis distance requires a preliminary dimensionality reduction (see examples).

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

h = 1 ; k = 3 
model = knnr(; h, k) 
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
dump(model.fitm.par)
res = predict(model, Xtest) ; 
@names res 
res.listnn
res.listd
res.listw
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

## With dimension reduction
model1 = pcasvd(nlv = 15)
metric = :eucl ; h = 1 ; k = 3 
model2 = knnr(; metric, h, k) 
model = pip(model1, model2)
fit!(model, Xtrain, ytrain)
res = predict(model, Xtest) ; 
@head res.pred
@show rmsep(res.pred, ytest)

####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106 
x = collect(-10:.2:10) 
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x) 
y = zy + .2 * randn(n) 
model = knnr(k = 15, h = 5) 
fit!(model, x, y)
pred = predict(model, x).pred 
f, ax = scatter(x, y) 
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f

source

Jchemo.kpca — Method

kpca(; kwargs...)
kpca(X; kwargs...)
kpca(X, weights::Weight; kwargs...)

Kernel PCA (Scholkopf et al. 1997, Scholkopf & Smola 2002, Tipping 2001).

X : X-data (n, p).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. principal components (PCs) to consider.
kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

The method is implemented by SVD factorization of the weighted Gram matrix:

D^(1/2) * Phi(X) * Phi(X)' * D^(1/2)

where X is the cenetred matrix and D is a diagonal matrix of weights (weights.w) of the observations (rows of X).

References

Scholkopf, B., Smola, A., MÃ¼ller, K.-R., 1997. Kernel principal component analysis, in: Gerstner, W., Germond, A., Hasler, M., Nicoud, J.-D. (Eds.), Artificial Neural Networks, ICANN 97, Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, pp. 583-588. https://doi.org/10.1007/BFb0020217

Scholkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond, Adaptive computation and machine learning. MIT Press, Cambridge, Mass.

Tipping, M.E., 2001. Sparse kernel principal component analysis. Advances in neural information processing systems, MIT Press. http://papers.nips.cc/paper/1791-sparse-kernel-principal-component-analysis.pdf

Examples

using Jchemo, JchemoData, JLD2 
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2") 
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
n = nro(X)
ntest = 30
s = samprand(n, ntest) 
Xtrain = X[s.train, :]
Xtest = X[s.test, :]

nlv = 3
kern = :krbf ; gamma = 1e-4
model = kpca(; nlv, kern, gamma) ;
fit!(model, Xtrain)
@names model.fitm
@head T = model.fitm.T
T' * T
model.fitm.V' * model.fitm.V

@head Ttest = transf(model, Xtest)

res = summary(model) ;
@names res
res.explvarx

source

Jchemo.kplskdeda — Method

kplskdeda(; kwargs...)
kplskdeda(X, y; kwargs...)
kplskdeda(X, y, weights::Weight; kwargs...)

KPLS-KDEDA.

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute. Must be >= 1.
kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
Keyword arguments of function dmkern (bandwidth definition) can also be specified here.
scal : Boolean. If true, each column of X and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.

Same as function plskdeda (PLS-KDEDA) except that a kernel PLSR (function kplsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

See function kplslda for examples.

source

Jchemo.kplslda — Method

kplslda(; kwargs...)
kplslda(X, y; kwargs...)
kplslda(X, y, weights::Weight; kwargs...)

KPLS-LDA.

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute. Must be >= 1
kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
scal : Boolean. If true, each column of X and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.

Same as function plslda (PLS-LDA) except that a kernel PLSR (function kplsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlv = 15
gamma = .1
model = kplslda(; nlv, gamma) 
#model = kplslda(; nlv, gamma, prior = :unif) 
#model = kplsqda(; nlv, gamma, alpha = .5) 
#model = kplskdeda(; nlv, gamma, a = .5) 
fit!(model, Xtrain, ytrain)
@names model
@names fitm = model.fitm
fitm.lev
fitm.ni

embfitm = fitm.fitm.embfitm ;
@head embfitm.T
@head transf(model, Xtrain)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)

coef(embfitm)

res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(model, Xtest; nlv = 1:2).pred

source

Jchemo.kplsqda — Method

kplsqda(; kwargs...)
kplsqda(X, y; kwargs...)
kplsqda(X, y, weights::Weight; kwargs...)

KPLS-QDA.

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute. Must be >= 1
kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
alpha : Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0) and LDA (alpha = 1).
scal : Boolean. If true, each column of X and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.

Same as function plsqda (PLS-QDA) except that a kernel PLSR (function kplsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

See function kplslda for examples.

source

Jchemo.kplsr — Method

kplsr(; kwargs...)
kplsr(X, Y; kwargs...)
kplsr(X, Y, weights::Weight; kwargs...)
kplsr!(X::Matrix, Y::Union{Matrix, BitMatrix}, weights::Weight; kwargs...)

Kernel partial least squares regression (KPLSR) implemented with a Nipals algorithm (Rosipal & Trejo, 2001).

X : X-data (n, p).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to consider.
kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

This algorithm becomes slow for n > 1000. Use function dkplsr instead.

References

Rosipal, R., Trejo, L.J., 2001. Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space. Journal of Machine Learning Research 2, 97-123.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 20
kern = :krbf ; gamma = 1e-1
model = kplsr(; nlv, kern, gamma) ;
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
@head model.fitm.T

coef(model)
coef(model; nlv = 3)

@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)

res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106 
x = collect(-10:.2:10) 
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x) 
y = zy + .2 * randn(n) 
nlv = 2
kern = :krbf ; gamma = 1 / 3
model = kplsr(; nlv, kern, gamma) 
fit!(model, x, y)
pred = predict(model, x).pred 
f, ax = scatter(x, y) 
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f

source

Jchemo.kplsrda — Method

kplsrda(; kwargs...)
kplsrda(X, y; kwargs...)
kplsrda(X, y, weights::Weight; kwargs...)

Discrimination based on kernel partial least squares regression (KPLSR-DA).

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute.
kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
scal : Boolean. If true, each column of X and Ydummy is scaled by its uncorrected standard deviation.

Same as function plsrda (PLSR-DA) except that a kernel PLSR (function kplsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlv = 15
kern = :krbf ; gamma = .001 
scal = true
model = kplsrda(; nlv, kern, gamma, scal) 
fit!(model, Xtrain, ytrain)
@names model
@names fitm = model.fitm
fitm.lev
fitm.ni

@head fitm.fitm.T
@head transf(model, Xtrain)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)

coef(fitm.fitm)

res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(model, Xtest; nlv = 1:2).pred

source

Jchemo.kpol — Method

kpol(X, Y; kwargs...)

Compute a polynomial kernel Gram matrix.

X : X-data (n, p).
Y : Y-data (m, p).

Keyword arguments:

gamma : Scale of the polynom.
coef0 : Offset of the polynom.
degree : Degree of the polynom.

Given matrices X and Y of sizes (n, p) and (m, p), respectively, the function returns the (n, m) Gram matrix:

K(X, Y) = Phi(X) * Phi(Y)'.

The polynomial kernel between two vectors x and y is computed by (gamma * (x' * y) + coef0)^degree.

References

Scholkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond, Adaptive computation and machine learning. MIT Press, Cambridge, Mass.

Examples

using Jchemo
X = rand(5, 3)
Y = rand(2, 3)
kpol(X, Y; gamma = .1, coef0 = 10, degree = 3)

source

Jchemo.krbf — Method

krbf(X, Y; kwargs...)

Compute a Radial-Basis-Function (RBF) kernel Gram matrix.

X : X-data (n, p).
Y : Y-data (m, p).

Keyword arguments:

gamma : Scale parameter.

Given matrices X and Y of sizes (n, p) and (m, p), respectively, the function returns the (n, m) Gram matrix:

K(X, Y) = Phi(X) * Phi(Y)'.

The RBF kernel between two vectors x and y is computed by exp(-gamma * ||x - y||^2).

References

Scholkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond, Adaptive computation and machine learning. MIT Press, Cambridge, Mass.

Examples

using Jchemo
X = rand(5, 3)
Y = rand(2, 3)
krbf(X, Y; gamma = .1)

source

Jchemo.krr — Method

krr(; kwargs...)
krr(X, Y; kwargs...)
krr(X, Y, weights::Weight; kwargs...)
krr!(X::Matrix, Y::Union{Matrix, BitMatrix}, weights::Weight; kwargs...)

Kernel ridge regression (KRR) implemented by SVD factorization.

X : X-data (n, p).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

lb : Ridge regularization parameter "lambda".
kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
scal : Boolean. If true, each column of `X is scaled by its uncorrected standard deviation.

KRR is also referred to as least squared SVM regression (LS-SVMR). The method is close to the particular case of SVM regression where there is no marge excluding the observations (epsilon coefficient set to zero). The difference is that a L2-norm optimization is done, instead of L1 in SVM.

References

Bennett, K.V., Embrechts, M.J., 2003. An optimization perspective on kernel partial least squares regression, in: Advances in Learning Theory: Methods, Models and Applications, NATO Science Series III: Computer & Systems Sciences. IOS Press Amsterdam, pp. 227-250.

Cawley, G.C., Talbot, N.L.C., 2002. Reduced Rank Kernel Ridge Regression. Neural Processing Letters 16, 293-302. https://doi.org/10.1023/A:1021798002258

Krell, M.M., 2018. Generalizing, Decoding, and Optimizing Support Vector Machine Classification. arXiv:1801.04929.

Saunders, C., Gammerman, A., Vovk, V., 1998. Ridge Regression Learning Algorithm in Dual Variables, in: In Proceedings of the 15th International Conference on Machine Learning. Morgan Kaufitmann, pp. 515-521.

Suykens, J.A.K., Lukas, L., Vandewalle, J., 2000. Sparse approximation using least squares support vector machines. 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353). https://doi.org/10.1109/ISCAS.2000.856439

Welling, M., n.d. Kernel ridge regression. Department of Computer Science, University of Toronto, Toronto, Canada. https://www.ics.uci.edu/~welling/classnotes/papers_class/Kernel-Ridge.pdf

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

lb = 1e-3
kern = :krbf ; gamma = 1e-1
model = krr(; lb, kern, gamma) ;
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm

coef(model)

res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
   ylabel = "Observed").f    

coef(model; lb = 1e-1)
res = predict(model, Xtest; lb = [.1 ; .01])
@head res.pred[1]
@head res.pred[2]

lb = 1e-3
kern = :kpol ; degree = 1
model = krr(; lb, kern, degree) 
fit!(model, Xtrain, ytrain)
res = predict(model, Xtest)
rmsep(res.pred, ytest)

####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106 
x = collect(-10:.2:10) 
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x) 
y = zy + .2 * randn(n) 
lb = 1e-1
kern = :krbf ; gamma = 1 / 3
model = krr(; lb, kern, gamma) 
fit!(model, x, y)
pred = predict(model, x).pred 
f, ax = scatter(x, y) 
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f

source

Jchemo.krrda — Method

krrda(; kwargs...)
krrda(X, y; kwargs...)
krrda(X, y, weights::Weight; kwargs...)

Discrimination based on kernel ridge regression (KRR-DA).

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

lb : Ridge regularization parameter "lambda".
kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol. See respective functions krbf and kpol for their keyword arguments.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Same as function rrda (RR-DA) except that a kernel RR (function krr), instead of a RR (function rr), is run on the Y-dummy table.

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

lb = 1e-5
kern = :krbf ; gamma = .001 
scal = true
model = krrda(; lb, kern, gamma, scal) 
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni

coef(fitm.fitm)

res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(model, Xtest; lb = [.1, .001]).pred

source

Jchemo.lda — Method

lda(; kwargs...)
lda(; kwargs...)
lda(X, y; kwargs...)
lda(X, y, weights::Weight; kwargs...)

Linear discriminant analysis (LDA).

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).

The low-level method (i.e. having argument weights) of the function allows to set any vector of observation weights to be used in the intermediate computations. In the high-level methods (no argument weights), they are automatically computed from the argument prior value: for each class, the total of the observation weights is set equal to the prior probability corresponding to the class.

Note: For highly unbalanced classes, it may be recommended to set 'prior = :unif' when using the function (and to use a score such as merrp instead of errp when evaluating the perfomance).

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

model = lda()
fit!(model, Xtrain, ytrain)
@names model
@names fitm = model.fitm
fitm.lev
fitm.ni
fitm.priors
aggsumv(fitm.weights.w, ytrain)

res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

source

Jchemo.list — Method

list(Q, n::Integer)

Create a Vector {Q}(undef, n).

isassigned(object, i) can be used to check if cell i is empty.

Examples

using Jchemo

list(Float64, 5)
list(Array{Float64}, 5)
list(Matrix{Int}, 5)

source

Jchemo.list — Method

list(n::Integer)

Create a Vector{Any}(nothing, n).

isnothing(object, i) can be used to check if cell i is empty.

Examples

using Jchemo

list(5)

source

Jchemo.locw — Method

locw(Xtrain, Ytrain, X; listnn, listw = nothing, algo, verbose = false, kwargs...)

Compute predictions for a given kNN model.

Xtrain : Training X-data.
Ytrain : Training Y-data.
X : X-data (m observations) to predict.

Keyword arguments:

listnn : List (vector) of m vectors of indexes.
listw : List (vector) of m vectors of weights.
algo : Function computing the model on the m neighborhoods.
verbose : Boolean. If true, predicting information are printed.
kwargs : Keywords arguments to pass in function algo. Each argument must have length = 1 (not be a collection).

Each component i of listnn and listw contains the indexes and weights, respectively, of the nearest neighbors of x_i in Xtrain. The sizes of the neighborhood for i = 1,...,m can be different.

source

Jchemo.locwlv — Method

locwlv(Xtrain, Ytrain, X; listnn, listw = nothing, algo, nlv, verbose = true, kwargs...)

Compute predictions for a given kNN model.

Xtrain : Training X-data.
Ytrain : Training Y-data.
X : X-data (m observations) to predict.

Keyword arguments:

listnn : List (vector) of m vectors of indexes.
listw : List (vector) of m vectors of weights.
algo : Function computing the model on the m neighborhoods.
nlv : Nb. or collection of nb. of latent variables (LVs).
verbose : Boolean. If true, predicting information are printed.
kwargs : Keywords arguments to pass in function algo. Each argument must have length = 1 (not be a collection).

Same as locw but specific and much faster for LV-based models (e.g. PLSR).

source

Jchemo.loessr — Method

loessr(; kwargs...)
loessr(X, y; kwargs...)

Compute a locally weighted regression model (LOESS).

X : X-data (n, p).
y : Univariate y-data (n).

Keyword arguments:

span : Window for neighborhood selection (level of smoothing) for the local fitting, typically in 0, 1.
degree : Polynomial degree for the local fitting.
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

The function fits a LOESS model using package `Loess.jl'.

Smaller values of span result in smaller local context in fitting (less smoothing).

References

https://github.com/JuliaStats/Loess.jl

Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the American statistical association, 74(368), 829-836. DOI: 10.1080/01621459.1979.10481038

Cleveland, W. S., & Devlin, S. J. (1988). Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American statistical association, 83(403), 596-610. DOI: 10.1080/01621459.1988.10478639

Cleveland, W. S., & Grosse, E. (1991). Computational methods for local regression. Statistics and computing, 1(1), 47-62. DOI: 10.1007/BF01890836

Examples

using Jchemo, CairoMakie

####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106 
x = collect(-10:.2:10) 
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x) 
y = zy + .2 * randn(n) 
model = loessr(span = 1 / 3) 
fit!(model, x, y)
pred = predict(model, x).pred 
f = Figure(size = (700, 300))
ax = Axis(f[1, 1], xlabel = "x", ylabel = "y")
scatter!(x, y) 
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred); label = "Loess")
f[1, 2] = Legend(f, ax, framevisible = false)
f

source

Jchemo.lwmlr — Method

lwmlr(; kwargs...)
lwmlr(X, Y; kwargs...)

k-Nearest-Neighbours locally weighted multiple linear regression (kNN-LWMLR).

X : X-data (n, p).
Y : Y-data (n, q).

Keyword arguments:

metric : Type of dissimilarity used to select the neighbors and to compute the weights (see function getknn). Possible values are: :eucl (Euclidean), :mah (Mahalanobis), :sam (spectral angular distance), :cor (correlation distance).
h : A scalar defining the shape of the weight function computed by function winvs. Lower is h, sharper is the function. See function winvs for details (keyword arguments criw and squared of winvs can also be specified here).
k : The number of nearest neighbors to select for each observation to predict.
tolw : For stabilization when very close neighbors.
scal : Boolean. If true, each column of the global X is scaled by its uncorrected standard deviation before the distance and weight computations.
verbose : Boolean. If true, predicting information are printed.

This is the same principle as function lwplsr except that MLR models are fitted on the neighborhoods, instead of PLSR models. The neighborhoods are computed directly on X (there is no preliminary dimension reduction).

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 20
model0 = pcasvd(; nlv) ;
fit!(model0, Xtrain) 
@head Ttrain = model0.fitm.T 
@head Ttest = transf(model0, Xtest)

metric = :eucl 
h = 2 ; k = 100 
model = lwmlr(; metric, h, k) 
fit!(model, Ttrain, ytrain)
@names model
@names model.fitm
dump(model.fitm.par)

res = predict(model, Ttest) ; 
@names res 
res.listnn
res.listd
res.listw
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",  
    ylabel = "Observed").f    

####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106 
x = collect(-10:.2:10) 
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x) 
y = zy + .2 * randn(n) 
model = lwmlr(metric = :eucl, h = 1.5, k = 20) ;
fit!(model, x, y)
pred = predict(model, x).pred 
f, ax = scatter(x, y) 
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f

source

Jchemo.lwmlrda — Method

lwmlrda(; kwargs...)
lwmlrda(X, y; kwargs...)

k-Nearest-Neighbours locally weighted MLR-based discrimination (kNN-LWMLR-DA).

X : X-data (n, p).
y : Univariate class membership (n).

Keyword arguments:

metric : Type of dissimilarity used to select the neighbors and to compute the weights (see function getknn). Possible values are: :eucl (Euclidean), :mah (Mahalanobis), :sam (spectral angular distance), :cor (correlation distance).
h : A scalar defining the shape of the weight function computed by function winvs. Lower is h, sharper is the function. See function winvs for details (keyword arguments criw and squared of winvs can also be specified here).
k : The number of nearest neighbors to select for each observation to predict.
tolw : For stabilization when very close neighbors.
scal : Boolean. If true, each column of the global X is scaled by its uncorrected standard deviation before the distance and weight computations.
verbose : Boolean. If true, predicting information are printed.

This is the same principle as function lwmlr except that MLR-DA models, instead of MLR models, are fitted on the neighborhoods.

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

metric = :mah
h = 2 ; k = 10
model = lwmlrda(; metric, h, k) 
fit!(model, Xtrain, ytrain)
@names model
@names fitm = model.fitm
fitm.lev
fitm.ni

res = predict(model, Xtest) ; 
@names res 
res.listnn
res.listd
res.listw
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

source

Jchemo.lwplslda — Method

lwplslda(; kwargs...)
lwplslda(X, y; kwargs...)

kNN-LWPLS-LDA.

X : X-data (n, p).
y : Univariate class membership (n).

Keyword arguments:

nlvdis : Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. If nlvdis = 0, there is no dimension reduction. If nlvdis = 0, there is no dimension reduction.
metric : Type of dissimilarity used to select the neighbors and to compute the weights (see function getknn). Possible values are: :eucl (Euclidean), :mah (Mahalanobis), :sam (spectral angular distance), :cor (correlation distance).
h : A scalar defining the shape of the weight function computed by function winvs. Lower is h, sharper is the function. See function winvs for details (keyword arguments criw and squared of winvs can also be specified here).
k : The number of nearest neighbors to select for each observation to predict.
tolw : For stabilization when very close neighbors.
nlv : Nb. latent variables (LVs) for the local (i.e. inside each neighborhood) models.
prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional).
scal : Boolean. If true, (a) each column of the global X (and of the global Y if there is a preliminary PLS reduction dimension) is scaled by its uncorrected standard deviation before to compute the distances and the weights, and (b) the X and Y scaling is also done within each neighborhood (local level) for the weighted PLSR.
verbose : Boolean. If true, predicting information are printed.

This is the same principle as function lwplsr except that a PLS-LDA model, instead of a PLSR model, is fitted on each neighborhoods.

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlvdis = 25 ; metric = :mah
h = 2 ; k = 200
nlv = 10
model = lwplslda(; nlvdis, metric, h, k, nlv, prior = :unif) 
fit!(model, Xtrain, ytrain)
@names model
@names fitm = model.fitm
fitm.lev
fitm.ni

res = predict(model, Xtest) ; 
@names res 
res.listnn
res.listd
res.listw
@head res.pred
@show errp(res.pred, ytest)
conf(res.pred, ytest).cnt

source

Jchemo.lwplsqda — Method

lwplsqda(; kwargs...)
lwplsqda(X, y; kwargs...)

kNN-LWPLS-QDA.

X : X-data (n, p).
y : Univariate class membership (n).

Keyword arguments:

nlvdis : Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. If nlvdis = 0, there is no dimension reduction. If nlvdis = 0, there is no dimension reduction.
metric : Type of dissimilarity used to select the neighbors and to compute the weights (see function getknn). Possible values are: :eucl (Euclidean), :mah (Mahalanobis), :sam (spectral angular distance), :cor (correlation distance).
h : A scalar defining the shape of the weight function computed by function winvs. Lower is h, sharper is the function. See function winvs for details (keyword arguments criw and squared of winvs can also be specified here).
k : The number of nearest neighbors to select for each observation to predict.
tolw : For stabilization when very close neighbors.
nlv : Nb. latent variables (LVs) for the local (i.e. inside each neighborhood) models.
prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional).
alpha : Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0) and LDA (alpha = 1).
scal : Boolean. If true, (a) each column of the global X (and of the global Y if there is a preliminary PLS reduction dimension) is scaled by its uncorrected standard deviation before to compute the distances and the weights, and (b) the X and Y scaling is also done within each neighborhood (local level) for the weighted PLSR.
verbose : Boolean. If true, predicting information are printed.

This is the same principle as function lwplsr except that a PLS-QDA model, instead of a PLSR model, is fitted on each neighborhoods.

Warning: The present version of this function can suffer from stops due to non positive definite matrices when doing QDA on neighborhoods. This is due to that some classes within the neighborhood can have very few observations. It is recommended to select a sufficiantly large number of neighbors or/and to use a regularized QDA (alpha > 0).

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlvdis = 25 ; metric = :mah
h = 2 ; k = 200
nlv = 10
model = lwplsqda(; nlvdis, metric, h, k, nlv, prior = :unif, alpha = .5) 
fit!(model, Xtrain, ytrain)
@names model
@names fitm = model.fitm
fitm.lev
fitm.ni

res = predict(model, Xtest) ; 
@names res 
res.listnn
res.listd
res.listw
@head res.pred
@show errp(res.pred, ytest)
conf(res.pred, ytest).cnt

source

Jchemo.lwplsr — Method

lwplsr(; kwargs...)
lwplsr(X, Y; kwargs...)

k-Nearest-Neighbours locally weighted partial least squares regression (kNN-LWPLSR).

X : X-data (n, p).
Y : Y-data (n, q).

Keyword arguments:

nlvdis : Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. If nlvdis = 0, there is no dimension reduction.
metric : Type of dissimilarity used to select the neighbors and to compute the weights (see function getknn). Possible values are: :eucl (Euclidean), :mah (Mahalanobis), :sam (spectral angular distance), :cor (correlation distance).
h : A scalar defining the shape of the weight function computed by function winvs. Lower is h, sharper is the function. See function winvs for details (keyword arguments criw and squared of winvs can also be specified here).
k : The number of nearest neighbors to select for each observation to predict.
tolw : For stabilization when very close neighbors.
nlv : Nb. latent variables (LVs) for the local (i.e. inside each neighborhood) models.
scal : Boolean. If true, (a) each column of the global X (and of the global Y if there is a preliminary PLS reduction dimension) is scaled by its uncorrected standard deviation before to compute the distances and the weights, and (b) the X and Y scaling is also done within each neighborhood (local level) for the weighted PLSR.
verbose : Boolean. If true, predicting information are printed.

Function lwplsr fits kNN-LWPLSR models such as in Lesnoff et al. 2020. The general principle of the pipeline is as follows (many other variants of pipelines can be built):

LWPLSR is a particular case of weighted PLSR (WPLSR) (e.g. Schaal et al. 2002). In WPLSR, a priori weights, different from the usual 1/n (standard PLSR), are given to the n training observations. These weights are used to calculate (i) the scores and loadings of the WPLS and (ii) the regression model that fits (by weighted least squares) the Y-response(s) to the WPLS scores. The specificity of LWPLSR (compared to WPLSR) is that the weights are computed from dissimilarities (e.g. distances) between the new observation to predict and the training observations ("L" in LWPLSR comes from "localized"). Note that in LWPLSR the weights, and therefore the fitted WPLSR model, change for each new observation to predict (there are no a 'unique' fitted model).

In the original LWPLSR, all the n training observations are used for each observation to predict (e.g. Sicard & Sabatier 2006, Kim et al 2011). This can be very time consuming when n is large. A faster (and often more efficient) strategy is to preliminary select, in the training set, a number of k nearest neighbors to the observation to predict (= "weighting 1") and then to apply LWPLSR only to this pre-selected neighborhood (= "weighting 2"). This strategy corresponds to kNN-LWPLSR implemented in function lwplsr.

In lwplsr, the dissimilarities used for weightings 1 and 2 are computed directely from the raw X-data, or after a dimension reduction, depending on argument nlvdis. In the last case, global PLS2 scores (LVs) are computed from {X, Y} and the dissimilarities are computed over these scores.

In general, for high dimensional X-data, using the Mahalanobis distance requires preliminary dimensionality reduction of the data.

References

Kim, S., Kano, M., Nakagawa, H., Hasebe, S., 2011. Estimation of active pharmaceutical ingredients content using locally weighted partial least squares and statistical wavelength selection. Int. J. Pharm., 421, 269-274.

Lesnoff, M., Metz, M., Roger, J.-M., 2020. Comparison of locally weighted PLS strategies for regression and discrimination on agronomic NIR data. Journal of Chemometrics, e3209. https://doi.org/10.1002/cem.3209

Schaal, S., Atkeson, C., Vijayamakumar, S. 2002. Scalable techniques from nonparametric statistics for the real time robot learning. Applied Intell., 17, 49-60.

Sicard, E. Sabatier, R., 2006. Theoretical framework for local PLS1 regression and application to a rainfall dataset. Comput. Stat. Data Anal., 51, 1393-1410.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlvdis = 15 ; metric = :mah 
h = 1 ; k = 500 ; nlv = 10
model = lwplsr(; nlvdis, metric, h, k, nlv) 
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm

res = predict(model, Xtest) ; 
@names res 
res.listnn
res.listd
res.listw
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",  
    ylabel = "Observed").f

source

Jchemo.lwplsravg — Method

lwplsravg(; kwargs...)
lwplsravg(X, Y; kwargs...)

Averaging kNN-LWPLSR models with different numbers of latent variables (kNN-LWPLSR-AVG).

X : X-data (n, p).
Y : Y-data (n, q).

Keyword arguments:

nlvdis : Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. If nlvdis = 0, there is no dimension reduction. If nlvdis = 0, there is no dimension reduction.
metric : Type of dissimilarity used to select the neighbors and to compute the weights (see function getknn). Possible values are: :eucl (Euclidean), :mah (Mahalanobis), :sam (spectral angular distance), :cor (correlation distance).
h : A scalar defining the shape of the weight function computed by function winvs. Lower is h, sharper is the function. See function winvs for details (keyword arguments criw and squared of winvs can also be specified here).
k : The number of nearest neighbors to select for each observation to predict.
tolw : For stabilization when very close neighbors.
nlv : A range of nb. of latent variables (LVs) to compute for the local (i.e. inside each neighborhood) models.
scal : Boolean. If true, (a) each column of the global X (and of the global Y if there is a preliminary PLS reduction dimension) is scaled by its uncorrected standard deviation before to compute the distances and the weights, and (b) the X and Y scaling is also done within each neighborhood (local level) for the weighted PLSR.
verbose : Boolean. If true, predicting information are printed.

Ensemblist method where the predictions are computed by averaging the predictions of a set of models built with different numbers of LVs, such as in Lesnoff 2023. On each neighborhood, a PLSR-averaging (Lesnoff et al. 2022) is done instead of a PLSR.

For instance, if argument nlv is set to nlv = 5:10, the prediction for a new observation is the simple average of the predictions returned by the models with 5 LVs, 6 LVs, ... 10 LVs, respectively.

References

Lesnoff, M., Andueza, D., Barotin, C., Barre, V., Bonnal, L., Fernández Pierna, J.A., Picard, F., Vermeulen, V., Roger, J.-M., 2022. Averaging and Stacking Partial Least Squares Regression Models to Predict the Chemical Compositions and the Nutritive Values of Forages from Spectral Near Infrared Data. Applied Sciences 12, 7850. https://doi.org/10.3390/app12157850

M. Lesnoff, Averaging a local PLSR pipeline to predict chemical compositions and nutritive values of forages and feed from spectral near infrared data, Chemometrics and Intelligent Laboratory Systems. 244 (2023) 105031. https://doi.org/10.1016/j.chemolab.2023.105031.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlvdis = 5 ; metric = :mah 
h = 1 ; k = 200 ; nlv = 4:20
model = lwplsravg(; nlvdis, metric, h, k, nlv) ;
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm

res = predict(model, Xtest) ; 
@names res 
res.listnn
res.listd
res.listw
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",  
    ylabel = "Observed").f

source

Jchemo.lwplsrda — Method

lwplsrda(; kwargs...)
lwplsrda(X, y; kwargs...)

kNN-LWPLSR-DA.

X : X-data (n, p).
y : Univariate class membership (n).

Keyword arguments:

nlvdis : Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. If nlvdis = 0, there is no dimension reduction. If nlvdis = 0, there is no dimension reduction.
metric : Type of dissimilarity used to select the neighbors and to compute the weights (see function getknn). Possible values are: :eucl (Euclidean), :mah (Mahalanobis), :sam (spectral angular distance), :cor (correlation distance).
h : A scalar defining the shape of the weight function computed by function winvs. Lower is h, sharper is the function. See function winvs for details (keyword arguments criw and squared of winvs can also be specified here).
k : The number of nearest neighbors to select for each observation to predict.
tolw : For stabilization when very close neighbors.
nlv : Nb. latent variables (LVs) for the local (i.e. inside each neighborhood) models.
prior : Type of prior probabilities for class membership. Possible values are: :unif (uniform), :prop (proportional).
scal : Boolean. If true, (a) each column of the global X (and of the global Y if there is a preliminary PLS reduction dimension) is scaled by its uncorrected standard deviation before to compute the distances and the weights, and (b) the X and Y scaling is also done within each neighborhood (local level) for the weighted PLSR.
verbose : Boolean. If true, predicting information are printed.

This is the same principle as function lwplsr except that PLSR-DA models, instead of PLSR models, are fitted on the neighborhoods.

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlvdis = 25 ; metric = :mah
h = 2 ; k = 200
nlv = 10
model = lwplsrda(; nlvdis, metric, h, k, nlv) 
fit!(model, Xtrain, ytrain)
@names model
@names fitm = model.fitm
fitm.lev
fitm.ni

res = predict(model, Xtest) ; 
@names res 
res.listnn
res.listd
res.listw
@head res.pred
@show errp(res.pred, ytest)
conf(res.pred, ytest).cnt

source

Jchemo.madv — Method

madv(x)

Compute the median absolute deviation (MAD) of a vector.

x : A vector (n).

This is the MAD adjusted by a factor (1.4826) for asymptotically normal consistency.

Examples

using Jchemo

x = rand(100)
madv(x)

source

Jchemo.mae — Method

mae(pred, Y)

Compute the median absolute error (MAE).

pred : Predictions.
Y : Observed data.

Examples

using Jchemo 

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

model = plskern(nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
mae(pred, Ytest)

model = plskern(nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
mae(pred, ytest)

source

Jchemo.mahsq — Method

mahsq(X, Y)
mahsq(X, Y, Sinv)

Squared Mahalanobis distances between the rows of X and Y.

X : Data (n, p).
Y : Data (m, p).
Sinv : Inverse of a covariance matrix S. If not given, S is computed as the uncorrected covariance matrix of X.

When X and Y are (n, p) and (m, p), repectively, it returns an object (n, m) with:

i, j = distance between row i of X and row j of Y.

Examples

using StatsBase 

X = rand(5, 3)
Y = rand(2, 3)

mahsq(X, Y)

S = cov(X, corrected = false)
Sinv = inv(S)
mahsq(X, Y, Sinv)
mahsq(X[1:1, :], Y[1:1, :], Sinv)

mahsq(X[:, 1], 4)
mahsq(1, 4, 2.1)

source

Jchemo.mahsqchol — Method

mahsqchol(X, Y)
mahsqchol(X, Y, Uinv)

Compute the squared Mahalanobis distances (with a Cholesky factorization) between the observations (rows) of X and Y.

X : Data (n, p).
Y : Data (m, p).
Uinv : Inverse of the upper matrix of a Cholesky factorization of a covariance matrix S. If not given, the factorization is done on S, the uncorrected covariance matrix of X.

When X and Y are (n, p) and (m, p), repectively, it returns an object (n, m) with:

i, j = distance between row i of X and row j of Y.

Examples

using LinearAlgebra, StatsBase

X = rand(5, 3)
Y = rand(2, 3)

mahsqchol(X, Y)

S = cov(X, corrected = false)
U = cholesky(Hermitian(S)).U 
Uinv = inv(U)
mahsqchol(X, Y, Uinv)

mahsqchol(X[:, 1], 4)
mahsqchol(1, 4, sqrt(2.1))

source

Jchemo.matB — Function

matB(X, y, weights::Weight)

Between-class covariance matrix.

X : X-data (n, p).
y : A vector (n) defining the class membership.
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Compute the between-class covariance matrix (output B) of X. This is the (non-corrected) covariance matrix of the weighted class centers.

Examples

using Jchemo, StatsBase

n = 20 ; p = 3
X = rand(n, p)
y = rand(1:3, n)
tab(y) 
weights = mweight(ones(n)) 

res = matB(X, y, weights) ;
res.B
res.priors
res.ni
res.lev

res = matW(X, y, weights) ;
res.W
res.Wi

matW(X, y, weights).W + matB(X, y, weights).B
cov(X; corrected = false)

v = mweight(collect(1:n))
matW(X, y, v).priors 
matB(X, y, v).priors 
matW(X, y, v).W + matB(X, y, v).B
covm(X, v)

source

Jchemo.matW — Function

matW(X, y, weights::Weight)

Within-class covariance matrices.

X : X-data (n, p).
y : A vector (n) defing the class membership.
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Compute the (non-corrected) within-class and pooled covariance matrices (outputs Wi and W, respectively) of X.

If class i contains only one observation, Wi is computed by:

covm(X,weights).

For examples, see function matB.

source

Jchemo.mavg — Method

mavg(; kwargs...)
mavg(X; kwargs...)

Smoothing by moving averages of each row of X-data.

X : X-data (n, p).

Keyword arguments:

npoint : Nb. points involved in the window.

The function returns a matrix (n, p).

The smoothing is computed by convolution with padding, using function imfilter of package ImageFiltering.jl. The centered kernel is ones(npoint) / npoint. Each returned point is located on the center of the kernel. Assume a signal x of length p (row of X) correponding to a vector wl of p wavelengths (or other indexes).

If npoint = 3, the kernel is kern = [.33, .33, .33], and:

The output value at index i = 4 is: dot(kern, [x[3], x[4], x[5]]). The output wavelength is: wl[4]
The output value at index i = 1 is: dot(kern, [x[1], x[1], x[2]]) (padding). The corresponding wavelength is: wl[1].

If npoint = 4, the kernel is kern = [.25, .25, .25, .25], and:

The output value at index i = 4 is: dot(kern, [x[3], x[4], x[5], x[6]]). The corresponding wavelength is: (wl[4] + wl[5]) / 2.
The output value at index i = 1 is: dot(kern, x[1], x[1], x[2], x[3]) (padding). The corresponding wavelength is: (wl[1] + wl[2]) / 2.

References

https://github.com/JuliaImages/ImageFiltering.jl

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f

model = mavg(npoint = 10) 
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f

source

Jchemo.mbconcat — Method

mbconcat()
mbconcat(Xbl)

Concatenate horizontaly multiblock X-data.

Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.

Examples

using Jchemo
n = 5 ; m = 3 ; p = 9 
X = rand(n, p) 
Xnew = rand(m, p)
listbl = [3:4, 1, [6; 8:9]]
Xbl = mblock(X, listbl) 
Xblnew = mblock(Xnew, listbl) 
@head Xbl[3]

model = mbconcat() 
fit!(model, Xbl)
transf(model, Xbl)
transf(model, Xblnew)

source

Jchemo.mblock — Method

mblock(X, listbl)

Make blocks from a matrix.

X : X-data (n, p).
listbl : A vector whose each component defines the colum numbers defining a block in X. The length of listbl is the number of blocks.

The function returns a list (vector) of blocks.

Examples

using Jchemo
n = 5 ; p = 10 
X = rand(n, p) 
listbl = [3:4, 1, [6; 8:10]]

Xbl = mblock(X, listbl)
Xbl[1]
Xbl[2]
Xbl[3]

source

Jchemo.mbpca — Method

mbpca(; kwargs...)
mbpca(Xbl; kwargs...)
mbpca(Xbl, weights::Weight; kwargs...)
mbpca!(Xbl::Matrix, weights::Weight; kwargs...)

Consensus principal components analysis (CPCA, a.k.a MBPCA).

Xbl : List of blocks (vector of matrices) of X-data. Typically, output of function mblock.
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. global latent variables (LVs = scores) to compute.
bscal : Type of block scaling. See function blockscal for possible values.
tol : Tolerance value for Nipals convergence.
maxit : Maximum number of iterations (Nipals).
scal : Boolean. If true, each column of blocks in Xbl is scaled by its uncorrected standard deviation (before the block scaling).

CPCA algorithm (Westerhuis et a; 1998), a.k.a MBPCA, and reffered to as CPCA-W in Smilde et al. 2003.

Apart eventual block scaling, the MBPCA is equivalent to the PCA of the horizontally concatenated matrix X = X1 X2 ... Xk.

The function returns several objects, in particular:

T : The global LVs (not-normed).
U : The global LVs (normed).
W : The block weights (normed).
Tb : The block LVs (in the metric scale), returned grouped by LV.
Tbl : The block LVs (in the original scale), returned grouped by block.
Vbl : The block loadings (normed).
lb : The block specific weights ('lambda') for the global LVs.
mu : The sum of the block specific weights (= eigen values of the global PCA).

Function summary returns:

explvarx : Proportion of the total X inertia (squared Frobenious norm) explained by the global LVs.
explxbl : Proportion of the inertia of each block (= Xbl[k]) explained by the global LVs.
contrxbl2t : Contribution of each block to the global LVs (= lb proportions).
rvxbl2t : RV coefficients between each block and the global LVs.
rdxbl2t : Rd coefficients between each block and the global LVs.
cortbl2t : Correlations between the block LVs (= Tbl[k]) and the global LVs.
corx2t : Correlation between the X-variables and the global LVs.

References

Mangamana, E.T., Cariou, V., Vigneau, E., Glèlè Kakaï, R.L., Qannari, E.M., 2019. Unsupervised multiblock data analysis: A unified approach and extensions. Chemometrics and Intelligent Laboratory Systems 194, 103856. https://doi.org/10.1016/j.chemolab.2019.103856

Smilde, A.K., Westerhuis, J.A., de Jong, S., 2003. A framework for sequential multiblock component methods. Journal of Chemometrics 17, 323–337. https://doi.org/10.1002/cem.811

Westerhuis, J.A., Kourti, T., MacGregor, J.F., 1998. Analysis of multiblock and hierarchical PCA and PLS models. Journal of Chemometrics 12, 301–321. https://doi.org/10.1002/(SICI)1099-128X(199809/10)12:5<301::AID-CEM515>3.0.CO;2-S

Examples

using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2") 
@load db dat
@names dat 
X = dat.X
group = dat.group
listbl = [1:11, 12:19, 20:25]
Xbl = mblock(X[1:6, :], listbl)
Xblnew = mblock(X[7:8, :], listbl)
n = nro(Xbl[1]) 

nlv = 3
bscal = :frob
scal = false
#scal = true
model = mbpca(; nlv, bscal, scal)
fit!(model, Xbl)
@names model 
@names model.fitm
## Global scores 
@head model.fitm.T
@head transf(model, Xbl)
transf(model, Xblnew)
## Blocks scores
i = 1
@head model.fitm.Tbl[i]
@head transfbl(model, Xbl)[i]

res = summary(model, Xbl) ;
@names res 
res.explvarx
res.explxbl   # = model.fitm.lb if bscal = :frob
rowsum(Matrix(res.explxbl))
res.contrxbl2t
res.rvxbl2t
res.rdxbl2t
res.cortbl2t
res.corx2t

source

Jchemo.mbplskdeda — Method

mbplskdeda(; kwargs...)
mbplskdeda(Xbl, y; kwargs...)
mbplskdeda(Xbl, y, weights::Weight; kwargs...)

Multiblock PLS-KDEDA.

Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs = scores) to compute.
bscal : Type of block scaling. See function blockscal for possible values.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
Keyword arguments of function dmkern (bandwidth definition) can also be specified here.
scal : Boolean. If true, each column of blocks in Xbl and Ydummy is scaled by its uncorrected standard deviation (before the block scaling) in the MBPLS computation.

Same as function mbplsqda except that the class densities are estimated from dmkern instead of dmnorm.

See function mbplslda for examples.

source

Jchemo.mbplslda — Method

mbplslda(; kwargs...)
mbplslda(Xbl, y; kwargs...)
mbplslda(Xbl, y, weights::Weight; kwargs...)

Multiblock PLS-LDA.

Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs = scores) to compute.
bscal : Type of block scaling. See function blockscal for possible values.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
scal : Boolean. If true, each column of blocks in Xbl and Ydummy is scaled by its uncorrected standard deviation (before the block scaling) in the MBPLS computation.

The approach is as follows:

The training variable y (univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present in y. Each column of Ydummy is a dummy (0/1) variable.
A multivariate MBPLSR (MBPLSR2) is run on {X, Ydummy}, returning a score matrix T.
A LDA is done on {T, y}, returning estimates of posterior probabilities (∊ [0, 1]) of class membership.
For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.

Note: For highly unbalanced classes, it may be recommended to set 'prior = :unif' when using the function (and to use a score such as merrp instead of errp when evaluating the perfomance).

Examples

using Jchemo, JLD2, CairoMakie, JchemoData
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
tab(Y.typ)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
wlst = names(X)
wl = parse.(Float64, wlst)
#plotsp(X, wl; nsamp = 20).f
##
listbl = [1:350, 351:700]
Xbltrain = mblock(Xtrain, listbl)
Xbltest = mblock(Xtest, listbl) 

nlv = 15
scal = false
#scal = true
bscal = :none
#bscal = :frob
model = mbplslda(; nlv, bscal, scal)
#model = mbplsqda(; nlv, bscal, alpha = .5, scal)
#model = mbplskdeda(; nlv, bscal, scal)
fit!(model, Xbltrain, ytrain) 
@names model 

@head transf(model, Xbltrain)
@head transf(model, Xbltest)

res = predict(model, Xbltest) ; 
@head res.pred 
@show errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(model, Xbltest; nlv = 1:2).pred

source

Jchemo.mbplsqda — Method

mbplsqda(; kwargs...)
mbplsqda(Xbl, y; kwargs...)
mbplsqda(Xbl, y, weights::Weight; kwargs...)

Multiblock PLS-QDA.

Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs = scores) to compute.
bscal : Type of block scaling. See function blockscal for possible values.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
alpha : Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0) and LDA (alpha = 1).
scal : Boolean. If true, each column of blocks in Xbl and Ydummy is scaled by its uncorrected standard deviation (before the block scaling) in the MBPLS computation.

The approach is as follows:

The training variable y (univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present in y. Each column of Ydummy is a dummy (0/1) variable.
A multivariate MBPLSR (MBPLSR2) is run on {X, Ydummy}, returning a score matrix T.
A QDA (possibly with continuum) is done on {T, y}, returning estimates of posterior probabilities (∊ [0, 1]) of class membership.
For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.

Note: For highly unbalanced classes, it may be recommended to set 'prior = :unif' when using the function (and to use a score such as merrp instead of errp when evaluating the perfomance).

See function mbplslda for examples.

source

Jchemo.mbplsr — Method

mbplsr(; kwargs...)
mbplsr(Xbl, Y; kwargs...)
mbplsr(Xbl, Y, weights::Weight; kwargs...)
mbplsr!(Xbl::Matrix, Y::Union{Matrix, BitMatrix}, weights::Weight; kwargs...)

Multiblock PLSR (MBPLSR).

Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. global latent variables (LVs = scores) to compute.
bscal : Type of block scaling. See function blockscal for possible values.
scal : Boolean. If true, each column of blocks in Xbl and Y is scaled by its uncorrected standard deviation (before the block scaling).

This function runs a PLSR on {X, Y} where X is the horizontal concatenation of the blocks in Xbl. The function gives the same global LVs and predictions as function mbplswest, but is much faster.

Function summary returns:

explvarx : Proportion of the total X inertia (squared Frobenious norm) explained by the global LVs.
rvxbl2t : RV coefficients between each block and the global LVs.
rdxbl2t : Rd coefficients between each block (= Xbl[k]) and the global LVs.
corx2t : Correlation between the X-variables and the global LVs.

Examples

using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2") 
@load db dat
@names dat 
X = dat.X
Y = dat.Y
y = Y.c1
group = dat.group
listbl = [1:11, 12:19, 20:25]
s = 1:6
Xbltrain = mblock(X[s, :], listbl)
ytrain = y[s]
Xbltest = mblock(rmrow(X, s), listbl)
ytest = rmrow(y, s) 
ntrain = nro(ytrain) 
ntest = nro(ytest) 
ntot = ntrain + ntest 
(ntot = ntot, ntrain , ntest)

nlv = 3
bscal = :frob
model = mbplsr(; nlv, bscal)
fit!(model, Xbltrain, ytrain)
@names model 
@names model.fitm
@head model.fitm.T
@head transf(model, Xbltrain)
transf(model, Xbltest)

res = predict(model, Xbltest)
res.pred 
rmsep(res.pred, ytest)

res = summary(model, Xbltrain) ;
@names res 
res.explvarx
res.rvxbl2t
res.rdxbl2t
res.cortbl2t
res.corx2t 

## This MBPLSR can also be implemented with function pip

model1 = blockscal(; bscal, centr = true) ;
model2 = mbconcat()
model3 = plskern(; nlv, scal = false) ;
model = pip(model1, model2, model3)
fit!(model, Xbltrain, ytrain)
@head T =  model.model[3].fitm.T  # = transf(model, Xbltrain)
transf(model, Xbltest)
predict(model, Xbltest).pred

source

Jchemo.mbplsrda — Method

mbplsrda(; kwargs...)
mbplsrda(Xbl, y; kwargs...)
mbplsrda(Xbl, y, weights::Weight; kwargs...)

Discrimination based on multiblock partial least squares regression (MBPLSR-DA).

Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs = scores) to compute.
bscal : Type of block scaling. See function blockscal for possible values.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
scal : Boolean. If true, each column of blocks in Xbl and Ydummy is scaled by its uncorrected standard deviation (before the block scaling).

The approach is as follows:

The training variable y (univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present in y. Each column of Ydummy is a dummy (0/1) variable.
Then, a multivariate MBPLSR (MBPLSR2) is run on {X, Ydummy}, returning predictions of the dummy variables (= object posterior returned by fuction predict). These predictions can be considered as unbounded estimates (i.e. eventually outside of [0, 1]) of the class membership probabilities.
For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.

Note: For highly unbalanced classes, it may be recommended to set 'prior = :unif' when using the function (and to use a score such as merrp instead of errp when evaluating the perfomance).

Examples

using Jchemo, JLD2, CairoMakie, JchemoData
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
tab(Y.typ)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
wlst = names(X)
wl = parse.(Float64, wlst)
#plotsp(X, wl; nsamp = 20).f
##
listbl = [1:350, 351:700]
Xbltrain = mblock(Xtrain, listbl)
Xbltest = mblock(Xtest, listbl) 

nlv = 15
scal = false
#scal = true
bscal = :none
#bscal = :frob
model = mbplsrda(; nlv, bscal, scal)
fit!(model, Xbltrain, ytrain) 
@names model 

@head model.fitm.fitm.T 
@head transf(model, Xbltrain)
@head transf(model, Xbltest)

res = predict(model, Xbltest) ; 
@head res.pred 
@show errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(model, Xbltest; nlv = 1:2).pred

source

Jchemo.mbplswest — Method

mbplswest(; kwargs...)
mbplswest(Xbl, Y; kwargs...)
mbplswest(Xbl, Y, weights::Weight; kwargs...)
mbplswest!(Xbl::Matrix, Y::Matrix, weights::Weight; kwargs...)

Multiblock PLSR (MBPLSR) - Nipals algorithm.

Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. global latent variables (LVs = scores) to compute.
bscal : Type of block scaling. See function blockscal for possible values.
tol : Tolerance value for convergence (Nipals).
maxit : Maximum number of iterations (Nipals).
scal : Boolean. If true, each column of blocks in Xbl and Y is scaled by its uncorrected standard deviation (before the block scaling).

This functions implements the MBPLSR Nipals algorithm such as in Westerhuis et al. 1998. The function gives the same global scores and predictions as function mbplsr.

Function summary returns:

explvarx : Proportion of the total X inertia (squared Frobenious norm) explained by the global LVs.
rvxbl2t : RV coefficients between each block and the global LVs.
rdxbl2t : Rd coefficients between each block (= Xbl[k]) and the global LVs.
cortbl2t : Correlations between the block LVs (= Tbl[k]) and the global LVs.
corx2t : Correlation between the X-variables and the global LVs.

References

Examples

using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2") 
@load db dat
@names dat 
X = dat.X
Y = dat.Y
y = Y.c1
group = dat.group
listbl = [1:11, 12:19, 20:25]
s = 1:6
Xbltrain = mblock(X[s, :], listbl)
Xbltest = mblock(rmrow(X, s), listbl)
ytrain = y[s]
ytest = rmrow(y, s) 
ntrain = nro(ytrain) 
ntest = nro(ytest) 
ntot = ntrain + ntest 
(ntot = ntot, ntrain , ntest)

nlv = 3
bscal = :frob
scal = false
#scal = true
model = mbplswest(; nlv, bscal, scal)
fit!(model, Xbltrain, ytrain)
@names model 
@names model.fitm
@head model.fitm.T
@head transf(model, Xbltrain)
transf(model, Xbltest)

res = predict(model, Xbltest)
res.pred 
rmsep(res.pred, ytest)

res = summary(model, Xbltrain) ;
@names res 
res.explvarx
res.rvxbl2t
res.rdxbl2t
res.cortbl2t
res.corx2t

source

Jchemo.meanv — Method

meanv(x)
meanv(x, weights::Weight)

Compute the mean of a vector.

x : A vector (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Examples

using Jchemo

n = 100
x = rand(n)
w = mweight(rand(n)) 

meanv(x)
meanv(x, w)

source

Jchemo.merrp — Method

merrp(pred, y)

Compute the mean intra-class classification error rate.

pred : Predictions.
y : Observed data (class membership).

ERRP (see function errp) is computed for each class. Function merrp returns the average of these intra-class ERRPs.

Examples

using Jchemo

Xtrain = rand(10, 5) 
ytrain = rand(["a" ; "b"], 10)
Xtest = rand(4, 5) 
ytest = rand(["a" ; "b"], 4)

model = plsrda(; nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
merrp(pred, ytest)

source

Jchemo.mlev — Method

mlev(x)

Return the sorted levels of a vector or a dataset.

Examples

using Jchemo

x = rand(["a";"b";"c"], 20)
lev = mlev(x)
nlev = length(lev)

X = reshape(x, 5, 4)
mlev(X)

df = DataFrame(g1 = rand(1:2, n), g2 = rand(["a"; "c"], n))
mlev(df)

source

Jchemo.mlr — Method

mlr(; kwargs...)
mlr(X, Y; kwargs...)
mlr(X, Y, weights::Weight; kwargs...)
mlr!(X::Matrix, Y::Union{Matrix, BitMatrix}, weights::Weight; kwargs...)

Compute a mutiple linear regression model (MLR) by using the QR algorithm.

X : X-data (n, p).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

noint : Boolean. Define if the model is computed with an intercept or not.

Safe but can be little slower than other methods.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie 
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2") 
@load db dat
@names dat
@head dat.X
X = dat.X[:, 2:4]
y = dat.X[:, 1]
n = nro(X)
ntest = 30
s = samprand(n, ntest) 
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]

model = mlr()
#model = mlrchol()
#model = mlrpinv()
#model = mlrpinvn() 
fit!(model, Xtrain, ytrain) 
@names model
@names model.fitm
fitm = model.fitm ;
fitm.B
fitm.int 
coef(model) 
res = predict(model, Xtest)
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",  
    ylabel = "Observed").f    

model = mlr(noint = true)
fit!(model, Xtrain, ytrain) 
coef(model) 

model = mlrvec()
fit!(model, Xtrain[:, 1], ytrain) 
coef(model)

source

Jchemo.mlrchol — Method

mlrchol()
mlrchol(X, Y)
mlrchol(X, Y, weights::Weight)
mlrchol!mlrchol!(X::Matrix, Y::Matrix, weights::Weight)

Compute a mutiple linear regression model (MLR) using the Normal equations and a Choleski factorization.

X : X-data, with nb. columns >= 2 (required by function cholesky).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Only compute a model with intercept.

Faster but can be less accurate (based on squared element X'X).

See function mlr for examples.

source

Jchemo.mlrda — Method

mlrda(; kwargs...)
mlrda(X, y; kwargs...)
mlrda(X, y, weights::Weight)

Discrimination based on multple linear regression (MLR-DA).

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).

The approach is as follows:

The training variable y (univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present in y. Each column of Ydummy is a dummy (0/1) variable.
Then, a multiple linear regression (MLR) is run on {X, Ydummy}, returning predictions of the dummy variables (= object posterior returned by fuction predict). These predictions can be considered as unbounded estimates (i.e. eventually outside of [0, 1]) of the class membership probabilities.
For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.

Note: For highly unbalanced classes, it may be recommended to set 'prior = :unif' when using the function (and to use a score such as merrp instead of errp when evaluating the perfomance).

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

model = mlrda()
fit!(model, Xtrain, ytrain)
@names model
@names fitm = model.fitm
fitm.lev
fitm.ni

res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

source

Jchemo.mlrpinv — Method

mlrpinv()
mlrpinv(X, Y; kwargs...)
mlrpinv(X, Y, weights::Weight; kwargs...)
mlrpinv!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Compute a mutiple linear regression model (MLR) by using a pseudo-inverse.

X : X-data (n, p).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

noint : Boolean. Define if the model is computed with an intercept or not.

Safe but can be slower.

See function mlr for examples.

source

Jchemo.mlrpinvn — Method

mlrpinvn() 
mlrpinvn(X, Y)
mlrpinvn(X, Y, weights::Weight)
mlrpinvn!mlrchol!(X::Matrix, Y::Matrix, weights::Weight)

Compute a mutiple linear regression model (MLR) by using the Normal equations and a pseudo-inverse.

X : X-data (n, p).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Safe and fast for p not too large.

Only compute a model with intercept.

See function mlr for examples.

source

Jchemo.mlrvec — Method

mlrvec(; kwargs...)
mlrvec(X, Y; kwargs...)
mlrvec(X, Y, weights::Weight; kwargs...)
mlrvec!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Compute a simple (univariate x) linear regression model.

x : Univariate X-data (n).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

noint : Boolean. Define if the model is computed with an intercept or not.

See function mlr for examples.

source

Jchemo.mpar — Function

mpar(; kwargs...)

Return a tuple with all the combinations of the parameter values defined in kwargs. Keyword arguments:

kwargs : Vector(s) of the parameter(s) values.

Examples

using Jchemo
nlvdis = 25 ; metric = [:mah] 
h = [1 ; 2 ; Inf] ; k = [500 ; 1000] 
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k) 
length(pars[1])
reduce(hcat, pars)

source

Jchemo.mse — Method

mse(pred, Y; digits = 3)

Summary of model performance for regression.

pred : Predictions.
Y : Observed data.

Examples

using Jchemo

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

model = plskern(nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
mse(pred, Ytest)

model = plskern(nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
mse(pred, ytest)

source

Jchemo.msep — Method

msep(pred, Y)

Compute the mean of the squared prediction errors (MSEP).

pred : Predictions.
Y : Observed data.

Examples

using Jchemo 

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

model = plskern(nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
msep(pred, Ytest)

model = plskern(nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
msep(pred, ytest)

source

Jchemo.mweight — Method

mweight(x::Vector)

Return an object of type Weight containing vector w = x / sum(x) (if ad'hoc building, w must sum to 1).

Examples

using Jchemo

x = rand(10)
w = mweight(x)
sum(w.w)

source

Jchemo.mweightcla — Method

mweightcla(y::AbstractVector; prior::Union{Symbol, Vector} = :prop)
mweightcla(Q::DataType, y::Vector; prior::Union{Symbol, Vector} = :prop)

Compute observation weights for a categorical variable, given specified sub-total weights for the classes.

y : A categorical variable (n) (class membership).
Q : A data type (e.g. Float32).

Keyword arguments:

prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).

Return an object of type Weight (see function mweight) containing a vector w (n) that sums to 1.

Examples

using Jchemo

y = vcat(rand(["a" ; "c"], 900), repeat(["b"], 100))
tab(y)
weights = mweightcla(y)
#weights = mweightcla(y; prior = :prop)
#weights = mweightcla(y; prior = [.1, .7, .2])
res = aggstat(weights.w, y; algo = sum)
[res.lev res.X]

source

Jchemo.nco — Method

nco(X)

Return the nb. columns of X.

source

Jchemo.nipals — Method

nipals(X; kwargs...)
nipals(X, UUt, VVt; kwargs...)

Nipals to compute the first score and loading vectors of a matrix.

X : X-data (n, p).
UUt : Matrix (n, n) for Gram-Schmidt orthogonalization.
VVt : Matrix (p, p) for Gram-Schmidt orthogonalization.

Keyword arguments:

tol : Tolerance value for stopping the iterations.
maxit : Maximum nb. of iterations.

The function finds:

{u, v, sv} = argmin(||X - u * sv * v'||)

with the constraints:

||u|| = ||v|| = 1

using the alternating least squares algorithm to compute SVD (Gabriel & Zalir 1979).

At the end, X ~ u * sv * v', where:

u : left singular vector (u * sv = scores)
v : right singular vector (loadings)
sv : singular value.

When NIPALS is used on sequentially deflated matrices, vectors u and v can loose orthogonality due to accumulation of rounding errors. Orthogonality can be rebuilt from the Gram-Schmidt method (arguments UUt and VVt).

References

K.R. Gabriel, S. Zamir, Lower rank approximation of matrices by least squares with any choice of weights, Technometrics 21 (1979) 489–498.

Examples

using Jchemo, LinearAlgebra

X = rand(5, 3)

res = nipals(X)
res.niter
res.sv
svd(X).S[1] 
res.v
svd(X).V[:, 1] 
res.u
svd(X).U[:, 1]

source

Jchemo.nipalsmiss — Method

nipalsmiss(X; kwargs...)
nipalsmiss(X, UUt, VVt; kwargs...)

Nipals to compute the first score and loading vectors of a matrix with missing data.

X : X-data (n, p).
UUt : Matrix (n, n) for Gram-Schmidt orthogonalization.
VVt : Matrix (p, p) for Gram-Schmidt orthogonalization.

Keyword arguments:

tol : Tolerance value for stopping the iterations.
maxit : Maximum nb. of iterations.

See function nipals.

References

K.R. Gabriel, S. Zamir, Lower rank approximation of matrices by least squares with any choice of weights, Technometrics 21 (1979) 489–498.

Wright, K., 2018. Package nipals: Principal Components Analysis using NIPALS with Gram-Schmidt Orthogonalization. https://cran.r-project.org/

Examples

using Jchemo 

X = [1. 2 missing 4 ; 4 missing 6 7 ; 
    missing 5 6 13 ; missing 18 7 6 ; 
    12 missing 28 7] 

res = nipalsmiss(X)
res.niter
res.sv
res.v
res.u

source

Jchemo.normv — Method

normv(x)
normv(x, weights::Weight)

Compute the norm of a vector.

x : A vector (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

The norm of vector x is computed by:

sqrt(x' * x)

The weighted norm of vector x is computed by:

sqrt(x' * D * x), where D is the diagonal matrix of vector weights.w.

References

@gdkrmr, https://discourse.julialang.org/t/julian-way-to-write-this-code/119348/17

@Stevengj, https://discourse.julialang.org/t/interesting-post-about-simd-dot-product-and-cosine-similarity/123282.

Examples

using Jchemo

n = 1000
x = rand(n)
w = mweight(ones(n))

normv(x)
sqrt(n) * normv(x, w)

source

Jchemo.nro — Method

nro(X)

Return the nb. rows of X.

source

Jchemo.occknn — Method

occknn(; kwargs...)
occknn(X; kwargs...)

One-class classification using kNN distance-based outlierness.

X : Training X-data (n, p) assumed to represent the reference class.

Keyword arguments:

nsamp : Nb. of observations (X-rows) sampled in the training data and for which are computed the outliernesses (stimated outlierness distribution of the reference class).
metric : Metric used to compute the distances. See function getknn.
k : Nb. nearest neighbors to consider.
algo : Function summarizing the k distances to the neighbors.
cut : Type of cutoff. Possible values are: :mad, :q. See Thereafter.
cri : When cut = :mad, a constant. See thereafter.
risk : When cut = :q, a risk-I level. See thereafter.
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

See functions:

outknn for details on the outlierness computation method,
and occsd for details on the the cutoff computation and the outputs.

For predictions (predict), the outlierness of each new observation is compared to the outlierness distribution estimated from the nsamp sampled observations.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/challenge2018.jld2") 
@load db dat
@names dat
X = dat.X    
Y = dat.Y
model = savgol(npoint = 21, deriv = 2, degree = 3)
fit!(model, X) 
Xp = transf(model, X) 
s = Bool.(Y.test)
Xtrain = rmrow(Xp, s)
Ytrain = rmrow(Y, s)
Xtest = Xp[s, :]
Ytest = Y[s, :]

## Build the example data
## - cla_train is the reference class (= 'in'), "EHH" 
cla_train = "EHH"
s = Ytrain.typ .== cla_train
Xtrain_fin = Xtrain[s, :]    
ntrain = nro(Xtrain_fin)
## cla_test contains the observations to be predicted (i.e. to be 'in' or 'out' of cla_train), 
## a mix of "EEH" and "PEE" 
cla_test1 = "EHH"   # should be predicted 'in'
s = Ytest.typ .== cla_test1
Xtest_fin1 = Xtest[s, :] 
ntest1 = nro(Xtest_fin1)
##
cla_test2 = "PEE"   # should be predicted 'out'
s = Ytest.typ .== cla_test2
Xtest_fin2 = Xtest[s, :] 
ntest2 = nro(Xtest_fin2)
##
Xtest_fin = vcat(Xtest_fin1, Xtest_fin2)
## Only used to compute error rates
ytrain_fin = repeat(["in"], ntrain)
ytest_fin = [repeat(["in"], ntest1); repeat(["out"], ntest2)]
y_fin = vcat(ytrain_fin, ytest_fin)
## 
ntot = ntrain + ntest1 + ntest2
(ntot = ntot, ntrain, ntest1, ntest2)

## Data description
nlv = 10
model = pcasvd(; nlv) 
fit!(model, Xtrain_fin) 
Ttrain = model.fitm.T
Ttest = transf(model, Xtest_fin)
T = vcat(Ttrain, Ttest)
i = 1
group = vcat(repeat(["Train-EHH"], ntrain), repeat(["Test-EHH"], ntest1), repeat(["Test-PEE"], ntest2))
color = [:red, :blue, (:green, .5)]
plotxy(T[:, i], T[:, i + 1], group; color = color, leg_title = "Type of obs.", xlabel = string("PC", i), 
    ylabel = string("PC", i + 1)).f

#### Occ
## Training
nsamp = 150 ; k = 5 ; cri = 2.5
model = occknn(; nsamp, k, cri)
#model = occlknn(; nsamp, k = 10, cri)
fit!(model, Xtrain_fin) 
@names model 
fitm = model.fitm ;
@names fitm 
@head dtrain = fitm.d
fitm.cutoff
d = dtrain.dstand
f, ax = plotxy(1:length(d), d; color = (:green, .5), size = (500, 300), xlabel = "Obs. index", 
    ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f
## Prediction of Test
res = predict(model, Xtest_fin) 
@names res
@head pred = res.pred
@head dtest = res.d
tab(pred)
errp(pred, ytest_fin)
conf(pred, ytest_fin).cnt
##
d = vcat(dtrain.dstand, dtest.dstand)
group = vcat(repeat(["Train"], nsamp), repeat(["Test-EHH"], ntest1), repeat(["Test-PEE"], ntest2))
color = [:red, :blue, (:green, .5)]
f, ax = plotxy(1:length(d), d, group; color = color, size = (500, 300), leg_title = "Type of obs.", 
    xlabel = "Obs. index", ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f

source

Jchemo.occlknn — Method

occlknn(; kwargs...)
occlknn(X; kwargs...)

One-class classification using local kNN distance-based outlierness.

X : Training X-data (n, p) assumed to represent the reference class.

Keyword arguments:

nsamp : Nb. of observations (X-rows) sampled in the training data and for which are computed the outliernesses (stimated outlierness distribution of the reference class).
metric : Metric used to compute the distances. See function getknn.
k : Nb. nearest neighbors to consider.
algo : Function summarizing the k distances to the neighbors.
cut : Type of cutoff. Possible values are: :mad, :q. See Thereafter.
cri : When cut = :mad, a constant. See thereafter.
risk : When cut = :q, a risk-I level. See thereafter.
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

See functions:

occknn for examples,
outlknn for details on the outlierness computation method,
and occsd for details on the the cutoff computation and the outputs.

For predictions (predict), the outlierness of each new observation is compared to the outlierness distribution estimated from the nsamp sampled observations.

source

Jchemo.occod — Method

occod(; kwargs...)
occod(fitm, X; kwargs...)

One-class classification using PCA/PLS orthognal distance (OD).

fitm : The preliminary model (e.g. object fitm returned by function pcasvd) that was fitted on the training data assumed to represent the reference class.
X : Training X-data (n, p), on which was fitted the model fitm.

Keyword arguments:

cut : Type of cutoff. Possible values are: :mad, :q. See Thereafter.
cri : When cut = :mad, a constant. See thereafter.
risk : When cut = :q, a risk-I level. See thereafter.

In this method, the outlierness d of an observation is the orthogonal distance (= 'X-residuals') of this observation, ie. the Euclidean distance between the observation and its projection to the score plan defined by the fitted (e.g. PCA) model (e.g. Hubert et al. 2005, Van Branden & Hubert 2005 p. 66, Varmuza & Filzmoser 2009 p. 79).

See function occsd for details on the cutoff computation and the outputs.

References

M. Hubert, V. J. Rousseeuw, K. Vanden Branden (2005). ROBPCA: a new approach to robust principal components analysis. Technometrics, 47, 64-79.

K. Vanden Branden, M. Hubert (2005). Robuts classification in high dimension based on the SIMCA method. Chem. Lab. Int. Syst, 79, 10-21.

K. Varmuza, V. Filzmoser (2009). Introduction to multivariate statistical analysis in chemometrics. CRC Press, Boca Raton.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/challenge2018.jld2") 
@load db dat
@names dat
X = dat.X    
Y = dat.Y
model = savgol(npoint = 21, deriv = 2, degree = 3)
fit!(model, X) 
Xp = transf(model, X) 
s = Bool.(Y.test)
Xtrain = rmrow(Xp, s)
Ytrain = rmrow(Y, s)
Xtest = Xp[s, :]
Ytest = Y[s, :]

## Build the example data
## - cla_train is the reference class (= 'in'), "EHH" 
cla_train = "EHH"
s = Ytrain.typ .== cla_train
Xtrain_fin = Xtrain[s, :]    
ntrain = nro(Xtrain_fin)
## cla_test contains the observations to be predicted (i.e. to be 'in' or 'out' of cla_train), 
## a mix of "EEH" and "PEE" 
cla_test1 = "EHH"   # should be predicted 'in'
s = Ytest.typ .== cla_test1
Xtest_fin1 = Xtest[s, :] 
ntest1 = nro(Xtest_fin1)
##
cla_test2 = "PEE"   # should be predicted 'out'
s = Ytest.typ .== cla_test2
Xtest_fin2 = Xtest[s, :] 
ntest2 = nro(Xtest_fin2)
##
Xtest_fin = vcat(Xtest_fin1, Xtest_fin2)
## Only used to compute error rates
ytrain_fin = repeat(["in"], ntrain)
ytest_fin = [repeat(["in"], ntest1); repeat(["out"], ntest2)]
y_fin = vcat(ytrain_fin, ytest_fin)
## 
ntot = ntrain + ntest1 + ntest2
(ntot = ntot, ntrain, ntest1, ntest2)

#### Preliminary PCA fitted model
nlv = 15
model0 = pcasvd(; nlv) 
#model0 = pcaout(; nlv) 
fit!(model0, Xtrain_fin) 
res = summary(model0, Xtrain_fin).explvarx 
plotgrid(res.nlv, res.pvar; step = 2, xlabel = "Nb. LVs", ylabel = "% Variance explained").f
Ttrain = model0.fitm.T
Ttest = transf(model0, Xtest_fin)
T = vcat(Ttrain, Ttest)
i = 1
group = vcat(repeat(["Train-EHH"], ntrain), repeat(["Test-EHH"], ntest1), repeat(["Test-PEE"], ntest2))
color = [:red, :blue, (:green, .5)]
plotxy(T[:, i], T[:, i + 1], group; color = color, leg_title = "Type of obs.", xlabel = string("PC", i), 
    ylabel = string("PC", i + 1)).f

#### Occ
## Training
model = occod(; cri = 2.5)
#model = occod(cut = :mad, cri = 4)
#model = occod(cut = :q, risk = .01)
#model = occsdod(; cri = 2.5)
fit!(model, model0.fitm, Xtrain_fin) 
@names model 
fitm = model.fitm ;
@names fitm 
@head dtrain = fitm.d
#fitm.cutoff
d = dtrain.dstand
f, ax = plotxy(1:length(d), d; color = (:green, .5), size = (500, 300), xlabel = "Obs. index", 
    ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f
## Prediction of Test
res = predict(model, Xtest_fin) 
@names res
@head pred = res.pred
@head dtest = res.d
tab(pred)
errp(pred, ytest_fin)
conf(pred, ytest_fin).cnt
##
d = vcat(dtrain.dstand, dtest.dstand)
color = [:red, :blue, (:green, .5)]
f, ax = plotxy(1:length(d), d, group; color = color, size = (500, 300), leg_title = "Type of obs.", 
    xlabel = "Obs. index", ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f

source

Jchemo.occsd — Method

occsd(; kwargs...)
occsd(fitm; kwargs...)

One-class classification using PCA/PLS score distance (SD).

fitm : The preliminary model (e.g. object fitm returned by function pcasvd) that was fitted on the training data assumed to represent the reference class.

Keyword arguments:

cut : Type of cutoff. Possible values are: :mad, :q. See Thereafter.
cri : When cut = :mad, a constant. See thereafter.
risk : When cut = :q, a risk-I level. See thereafter.

In this method, the outlierness d of an observation is defined by its score distance (SD), ie. the Mahalanobis distance between the projection of the observation on the score plan defined by the fitted (e.g. PCA) model and the "center" (always defined by zero) of the score plan.

If a new observation has d higher than a given cutoff, the observation is assumed to not belong to the training (= reference) class. The cutoff is computed with non-parametric heuristics. Noting [d] the vector of outliernesses computed on the training class:

If cut = :mad, then cutoff = MED([d]) + cri * MAD([d]).
If cut = :q, then cutoff is estimated from the empirical cumulative density function computed on [d], for a given risk-I (risk).

Alternative approximate cutoffs have been proposed in the literature (e.g.: Nomikos & MacGregor 1995, Hubert et al. 2005, Pomerantsev 2008). Typically, and whatever the approximation method used to compute the cutoff, it is recommended to tune this cutoff depending on the detection objectives.

Outputs

pval: Estimate of p-value (see functions pval) computed from the training distribution [d].
dstand: standardized distance defined as d / cutoff. A value dstand > 1 may be considered as extreme compared to the distribution of the training data.
gh is the Winisi "GH" (usually, GH > 3 is considered as extreme).

Specific for function predict:

pred: class prediction
- dstand <= 1 ==> in: the observation is expected to belong to the training class,
- dstand > 1 ==> out: extreme value, possibly not belonging to the same class as the training.

References

M. Hubert, V. J. Rousseeuw, K. Vanden Branden (2005). ROBPCA: a new approach to robust principal components analysis. Technometrics, 47, 64-79.

Nomikos, V., MacGregor, J.F., 1995. Multivariate SPC Charts for Monitoring Batch Processes. null 37, 41-59. https://doi.org/10.1080/00401706.1995.10485888

Pomerantsev, A.L., 2008. Acceptance areas for multivariate classification derived by projection methods. Journal of Chemometrics 22, 601-609. https://doi.org/10.1002/cem.1147

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/challenge2018.jld2") 
@load db dat
@names dat
X = dat.X    
Y = dat.Y
model = savgol(npoint = 21, deriv = 2, degree = 3)
fit!(model, X) 
Xp = transf(model, X) 
s = Bool.(Y.test)
Xtrain = rmrow(Xp, s)
Ytrain = rmrow(Y, s)
Xtest = Xp[s, :]
Ytest = Y[s, :]

## Build the example data
## - cla_train is the reference class (= 'in'), "EHH" 
cla_train = "EHH"
s = Ytrain.typ .== cla_train
Xtrain_fin = Xtrain[s, :]    
ntrain = nro(Xtrain_fin)
## cla_test contains the observations to be predicted (i.e. to be 'in' or 'out' of cla_train), 
## a mix of "EEH" and "PEE" 
cla_test1 = "EHH"   # should be predicted 'in'
s = Ytest.typ .== cla_test1
Xtest_fin1 = Xtest[s, :] 
ntest1 = nro(Xtest_fin1)
##
cla_test2 = "PEE"   # should be predicted 'out'
s = Ytest.typ .== cla_test2
Xtest_fin2 = Xtest[s, :] 
ntest2 = nro(Xtest_fin2)
##
Xtest_fin = vcat(Xtest_fin1, Xtest_fin2)
## Only used to compute error rates
ytrain_fin = repeat(["in"], ntrain)
ytest_fin = [repeat(["in"], ntest1); repeat(["out"], ntest2)]
y_fin = vcat(ytrain_fin, ytest_fin)
## 
ntot = ntrain + ntest1 + ntest2
(ntot = ntot, ntrain, ntest1, ntest2)

#### Preliminary PCA fitted model
nlv = 15
model0 = pcasvd(; nlv) 
#model0 = pcaout(; nlv) 
fit!(model0, Xtrain_fin) 
res = summary(model0, Xtrain_fin).explvarx 
plotgrid(res.nlv, res.pvar; step = 2, xlabel = "Nb. LVs", ylabel = "% Variance explained").f
Ttrain = model0.fitm.T
Ttest = transf(model0, Xtest_fin)
T = vcat(Ttrain, Ttest)
i = 1
group = vcat(repeat(["Train-EHH"], ntrain), repeat(["Test-EHH"], ntest1), repeat(["Test-PEE"], ntest2))
color = [:red, :blue, (:green, .5)]
plotxy(T[:, i], T[:, i + 1], group; color = color, leg_title = "Type of obs.", xlabel = string("PC", i), 
    ylabel = string("PC", i + 1)).f

#### Occ
## Training
model = occsd(; cri = 2.5)
#model = occsd(cut = :mad, cri = 4)
#model = occsd(cut = :q, risk = .01)
fit!(model, model0.fitm) 
@names model 
fitm = model.fitm ;
@names fitm 
@head dtrain = fitm.d
fitm.cutoff
d = dtrain.dstand
f, ax = plotxy(1:length(d), d; color = (:green, .5), size = (500, 300), xlabel = "Obs. index", 
    ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f
## Prediction of Test
res = predict(model, Xtest_fin) 
@names res
@head pred = res.pred
@head dtest = res.d
tab(pred)
errp(pred, ytest_fin)
conf(pred, ytest_fin).cnt
##
d = vcat(dtrain.dstand, dtest.dstand)
color = [:red, :blue, (:green, .5)]
f, ax = plotxy(1:length(d), d, group; color = color, size = (500, 300), leg_title = "Type of obs.", 
    xlabel = "Obs. index", ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f

source

Jchemo.occsdod — Method

occsdod(; kwargs...)
occsdod(object, X; kwargs...)

One-class classification using a consensus between PCA/PLS score and orthogonal distances (SD and OD).

fitm : The preliminary model (e.g. object fitm returned by function pcasvd) that was fitted on the training data assumed to represent the reference class.
X : Training X-data (n, p), on which was fitted the model fitm.

Keyword arguments:

cut : Type of cutoff. Possible values are: :mad, :q. See Thereafter.
cri : When cut = :mad, a constant. See thereafter.
risk : When cut = :q, a risk-I level. See thereafter.

In this method, the outlierness d of a given observation is a consensus between the score distance (SD) and the orthogonal distance (OD). The consensus is computed from the standardized distances by:

dstand = sqrt(dstand_sd * dstand_od).

See functions:

occsd for details on the cutoff computation and the outputs,
and occod for examples.

source

Jchemo.occstah — Method

occstah(; kwargs...)
occstah(X; kwargs...)

One-class classification using the Stahel-Donoho outlierness.

X : Training X-data (n, p) assumed to represent the reference class.

Keyword arguments:

nlv : Nb. random directions on which X is projected.
cut : Type of cutoff. Possible values are: :mad, :q. See Thereafter.
cri : When cut = :mad, a constant. See thereafter.
risk : When cut = :q, a risk-I level. See thereafter.
scal : Boolean. If true, each column of X is scaled such as in function outstah.

In this method, the outlierness d of a given observation is the Stahel-Donoho outlierness (see function outstah).

See function occsd for details on the outputs.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/challenge2018.jld2") 
@load db dat
@names dat
X = dat.X    
Y = dat.Y
model = savgol(npoint = 21, deriv = 2, degree = 3)
fit!(model, X) 
Xp = transf(model, X) 
s = Bool.(Y.test)
Xtrain = rmrow(Xp, s)
Ytrain = rmrow(Y, s)
Xtest = Xp[s, :]
Ytest = Y[s, :]

## Build the example data
## - cla_train is the reference class (= 'in'), "EHH" 
cla_train = "EHH"
s = Ytrain.typ .== cla_train
Xtrain_fin = Xtrain[s, :]    
ntrain = nro(Xtrain_fin)
## cla_test contains the observations to be predicted (i.e. to be 'in' or 'out' of cla_train), 
## a mix of "EEH" and "PEE" 
cla_test1 = "EHH"   # should be predicted 'in'
s = Ytest.typ .== cla_test1
Xtest_fin1 = Xtest[s, :] 
ntest1 = nro(Xtest_fin1)
##
cla_test2 = "PEE"   # should be predicted 'out'
s = Ytest.typ .== cla_test2
Xtest_fin2 = Xtest[s, :] 
ntest2 = nro(Xtest_fin2)
##
Xtest_fin = vcat(Xtest_fin1, Xtest_fin2)
## Only used to compute error rates
ytrain_fin = repeat(["in"], ntrain)
ytest_fin = [repeat(["in"], ntest1); repeat(["out"], ntest2)]
y_fin = vcat(ytrain_fin, ytest_fin)
## 
ntot = ntrain + ntest1 + ntest2
(ntot = ntot, ntrain, ntest1, ntest2)

## Data description
nlv = 10
model = pcasvd(; nlv) 
fit!(model, Xtrain_fin) 
Ttrain = model.fitm.T
Ttest = transf(model, Xtest_fin)
T = vcat(Ttrain, Ttest)
i = 1
group = vcat(repeat(["Train-EHH"], ntrain), repeat(["Test-EHH"], ntest1), repeat(["Test-PEE"], ntest2))
color = [:red, :blue, (:green, .5)]
plotxy(T[:, i], T[:, i + 1], group; color = color, leg_title = "Type of obs.", xlabel = string("PC", i), 
    ylabel = string("PC", i + 1)).f

#### Occ
## Training
model = occstah(; nlv = 5000, cri = 2, scal = true)
fit!(model, Xtrain_fin) 
@names model 
fitm = model.fitm ;
@names fitm 
@head fitm.V  # random projection directions 
@head dtrain = fitm.d
d = dtrain.dstand
f, ax = plotxy(1:length(d), d; color = (:green, .5), size = (500, 300), xlabel = "Obs. index", 
    ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f
## Prediction of Test
res = predict(model, Xtest_fin) ;
@names res
@head dtest = res.d
@head res.pred
tab(res.pred)
errp(res.pred, ytest_fin)
conf(res.pred, ytest_fin).cnt
##
d = vcat(dtrain.dstand, dtest.dstand)
color = [:red, :blue, (:green, .5)]
f, ax = plotxy(1:length(d), d, group; color = color, size = (500, 300), leg_title = "Type of obs.", 
    xlabel = "Obs. index", ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f

source

Jchemo.out — Method

out(x)

Return if elements of a vector are strictly outside of a given range.

x : Univariate data.
y : Univariate data on which is computed the range (min, max).

Return a BitVector.

Examples

using Jchemo

x = [-200.; -100; -1; 0; 1; 200]
out(x, [-1; .2; 1])
out(x, (-1, 1))

source

Jchemo.outeucl — Method

outeucl(X; scal = false)
outeucl!(X::Matrix; scal = false)

Compute outlierness from Euclidean distances to center.

X : X-data (n, p).

Keyword arguments:

scal : Boolean. If true, each column of X is scaled by its MAD before computing the outlierness.

Outlyingness is calculated by the Euclidean distance between the observation (rows of X) and a robust estimate of the center of the data (in the present function, the spatial median). Such outlyingness was for instance used in the robust PLSR algorithm of Serneels et al. 2005 (PRM).

References

Serneels, S., Croux, C., Filzmoser, V., Van Espen, V.J., 2005. Partial robust M-regression. Chemometrics and Intelligent Laboratory Systems 79, 55-64. https://doi.org/10.1016/j.chemolab.2005.04.007

Examples

using Jchemo, CairoMakie
n = 300 ; p = 700 ; m = 80
ntot = n + m
X1 = randn(n, p)
X2 = randn(m, p) .+ rand(1:3, p)'
X = vcat(X1, X2)

scal = false
#scal = true
res = outeucl(X; scal) ;
@names res
res.d    # outlierness 
plotxy(1:ntot, res.d).f

source

Jchemo.outknn — Method

outknn(X; metric = :eucl, k, algo = sum, scal::Bool = false)
outknn!(X::Matrix; metric = :eucl, k, algo = sum, scal::Bool = false)

Compute a kNN distance-based outlierness.

X : X-data (n, p).

Keyword arguments:

metric : Metric used to compute the distances. See function getknn.
k : Nb. nearest neighbors to consider.
algo : Function summarizing the k distances to the neighbors.
scal : Boolean. If true, each column of X is scaled before computing the outlierness.

For each observation (row of X), the outlierness is defined by a summary (e.g. by sum or maximum) of the distances between the observation and its k nearest neighbors.

References

Angiulli, F., Pizzuti, C., 2005. Outlier mining in large high-dimensional data sets. IEEE Transactions on Knowledge and Data Engineering 17, 203–215. https://doi.org/10.1109/TKDE.2005.31

Angiulli, F., Basta, S., Pizzuti, C., 2006. Distance-based detection and prediction of outliers. IEEE Transactions on Knowledge and Data Engineering 18, 145–160. https://doi.org/10.1109/TKDE.2006.29

Campos, G.O., Zimek, A., Sander, J., Campello, R.J.G.B., Micenková, B., Schubert, E., Assent, I., Houle, M.E., 2016. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Disc 30, 891–927. https://doi.org/10.1007/s10618-015-0444-8

Ramaswamy, S., Rastogi, R., Shim, K., 2000. Efficient algorithms for mining outliers from large data sets, in: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD ’00. Association for Computing Machinery, New York, NY, USA, pp. 427–438. https://doi.org/10.1145/342009.335437

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "octane.jld2")
@load db dat
X = dat.X
wlst = names(X)
wl = parse.(Float64, wlst)
n, p = size(X)
## Six of the samples (25, 26, and 36-39) contain added alcohol.
s = [25; 26; 36:39]
typ = zeros(Int, n)
typ[s] .= 1
#plotsp(X, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance").f

metric = :eucl ; k = 15 ; algo = sum
#algo = maximum
res = outknn(X; metric, k, algo) ;
@names res
f, ax = plotxy(1:n, res.d, typ, xlabel = "Obs. index", ylabel = "Outlierness")
text!(ax, 1:n, res.d; text = string.(1:n), fontsize = 10)
f

## With a preliminary PCA
nlv = 3
model = pcasph(; nlv)
fit!(model, X)
T = model.fitm.T
metric = :eucl 
k = 15
res = outknn(T; metric, k, scal = true)
plotxy(1:n, res.d, typ, xlabel = "Obs. index", ylabel = "Outlierness").f

source

Jchemo.outlknn — Method

outlknn(X; metric = :eucl, k, algo = sum, scal::Bool = false)
outlknn!(X::Matrix; metric = :eucl, k, algo = sum, scal::Bool = false)

Compute a local kNN distance-based outlierness.

X : X-data (n, p).

Keyword arguments:

metric : Metric used to compute the distances. See function getknn.
k : Nb. nearest neighbors to consider.
algo : Function summarizing the distances to the neighbors.
scal : Boolean. If true, each column of X is scaled before computing the outlierness.

The idea is to compare the KNN-outlierness of the observation to the KNN-outlierness of its neighbors, giving a local measure of outlierness. For each observation (row of X), the outlierness is defined as folloxs:

A summary (e.g. by sum) of the distances between the observation and its k nearest neighbors is computed, say out1.
The same summary is computed for each of the k nearest neighbors of the observation, and the median of the k returned values is computed, say out2.
The outlierness of the observation is finally defined as the ratio out1 / out2.

The approach can be seen as a simplification of the local outlier factor (LOF) method (Breunig et al. 2000), such as the Simplified-LOF method (Schubert et al 2014 p.206, Campos et al. 2016 p.896) where local density is estimated by the inverse of the k-distance.

References

Schubert, E., Zimek, A., Kriegel, H.-P., 2014. Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min Knowl Disc 28, 190–237. https://doi.org/10.1007/s10618-012-0300-z

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "octane.jld2")
@load db dat
X = dat.X
wlst = names(X)
wl = parse.(Float64, wlst)
n, p = size(X)
## Six of the samples (25, 26, and 36-39) contain added alcohol.
s = [25; 26; 36:39]
typ = zeros(Int, n)
typ[s] .= 1
#plotsp(X, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance").f

metric = :eucl ; k = 15 ; algo = sum
#algo = maximum
res = outlknn(X; metric, k, algo) ;
@names res
f, ax = plotxy(1:n, res.d, typ, xlabel = "Obs. index", ylabel = "Outlierness")
text!(ax, 1:n, res.d; text = string.(1:n), fontsize = 10)
f

## With a preliminary PCA
nlv = 3
model = pcasph(; nlv)
fit!(model, X)
T = model.fitm.T
metric = :eucl 
k = 15
res = outlknn(T; metric, k, scal = true)
plotxy(1:n, res.d, typ, xlabel = "Obs. index", ylabel = "Outlierness").f

source

Jchemo.outod — Method

outod(fitm, X)

Compute outlierness from PCA/PLS orthogonal distance (OD).

fitm : The preliminary model (e.g. object fitm returned by function pcasvd) that was fitted on the data.
X : X-data (n, p) on which was fitted the model fitm.

References

M. Hubert, V. J. Rousseeuw, K. Vanden Branden (2005). ROBPCA: a new approach to robust principal components analysis. Technometrics, 47, 64-79.

K. Vanden Branden, M. Hubert (2005). Robuts classification in high dimension based on the SIMCA method. Chem. Lab. Int. Syst, 79, 10-21.

K. Varmuza, V. Filzmoser (2009). Introduction to multivariate statistical analysis in chemometrics. CRC Press, Boca Raton.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "octane.jld2")
@load db dat
X = dat.X
wlst = names(X)
wl = parse.(Float64, wlst)
n, p = size(X)
## Six of the samples (25, 26, and 36-39) contain added alcohol.
s = [25; 26; 36:39]
typ = zeros(Int, n)
typ[s] .= 1
#plotsp(X, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance").f

model = pcaout(; nlv = 3)
fit!(model, X) 
fitm = model.fitm ;
res = outsd(fitm) ;
#res = outsdod(fitm, X) ;
@names res
f, ax = plotxy(1:n, res.d, typ, xlabel = "Obs. index", ylabel = "Outlierness")
text!(ax, 1:n, res.d; text = string.(1:n), fontsize = 10)
f

source

Jchemo.outsd — Method

outsd(fitm)

Compute outlierness from PCA/PLS score distance (SD).

fitm : The preliminary model (e.g. object fitm returned by function pcasvd) that was fitted on the data.

References

M. Hubert, V. J. Rousseeuw, K. Vanden Branden (2005). ROBPCA: a new approach to robust principal components analysis. Technometrics, 47, 64-79.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "octane.jld2")
@load db dat
X = dat.X
wlst = names(X)
wl = parse.(Float64, wlst)
n, p = size(X)
## Six of the samples (25, 26, and 36-39) contain added alcohol.
s = [25; 26; 36:39]
typ = zeros(Int, n)
typ[s] .= 1
#plotsp(X, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance").f

model = pcaout(; nlv = 3)
fit!(model, X) 
fitm = model.fitm ;
res = outsd(fitm) ;
@names res
f, ax = plotxy(1:n, res.d, typ, xlabel = "Obs. index", ylabel = "Outlierness")
text!(ax, 1:n, res.d; text = string.(1:n), fontsize = 10)
f

source

Jchemo.outsdod — Method

outsdod(fitm, X; cut = :mad, cri = 3, risk = .025)

Compute outlierness from PCA/PLS score and orthogonal distances (SD and OD).

fitm : The preliminary model (e.g. object fitm returned by function pcasvd) that was fitted on the data.
X : X-data (n, p) on which was fitted the model fitm.

Keyword arguments:

cut : Type of cutoff to standardize SD and OD. Possible values are: :mad, :q. See Thereafter.
cri : When cut = :mad, a constant. See thereafter.
risk : When cut = :q, a risk-I level. See thereafter.

In this method, the outlierness d of a given observation is a consensus between the standardized score and orthogonal distances. The returned consensus is computed by:

d = sqrt(SDstand * ODstand)

where:

SDstand = SD / cutoffSD
ODstand = OD / cutoffOD

The cutoff is computed with non-parametric heuristics. Noting [d] the SD- or OD-vector:

If cut = :mad, then cutoff = MED([d]) + cri * MAD([d]).
If cut = :q, then cutoff is estimated from the empirical cumulative density function computed on [d], for a given risk-I (risk).

See function outod for examples.

References

M. Hubert, V. J. Rousseeuw, K. Vanden Branden (2005). ROBPCA: a new approach to robust principal components analysis. Technometrics, 47, 64-79.

K. Vanden Branden, M. Hubert (2005). Robuts classification in high dimension based on the SIMCA method. Chem. Lab. Int. Syst, 79, 10-21.

K. Varmuza, V. Filzmoser (2009). Introduction to multivariate statistical analysis in chemometrics. CRC Press, Boca Raton.

source

Jchemo.outstah — Method

outstah(X, V; scal = false)
outstah!(X::Matrix, V::Matrix; scal = false)

Compute the Stahel-Donoho outlierness.

X : X-data (n, p).
V : A projection matrix (p, nlv) representing the directions of the projection pursuit.

Keyword arguments:

scal : Boolean. If true, each column of X is scaled by its MAD before computing the outlierness.

See Maronna and Yohai 1995 for details on the outlierness measure.

A projection-pursuit approach is used: given a projection matrix V (p, nlv) (in general built randomly), the observations (rows of X) are projected on the nlv directions and the Stahel-Donoho outlierness is computed for each observation from these projections.

References

Maronna, R.A., Yohai, V.J., 1995. The Behavior of the Stahel-Donoho Robust Multivariate Estimator. Journal of the American Statistical Association 90, 330–341. https://doi.org/10.1080/01621459.1995.10476517

Examples

using Jchemo, CairoMakie

n = 300 ; p = 700 ; m = 80
ntot = n + m
X1 = randn(n, p)
X2 = randn(m, p) .+ rand(1:3, p)'
X = vcat(X1, X2)

nlv = 100
V = rand(0:1, p, nlv)
scal = false
#scal = true
res = outstah(X, V; scal) ;
@names res
res.d    # outlierness 
plotxy(1:ntot, res.d).f

source

Jchemo.parsemiss — Method

parsemiss(Q, x::Vector{Union{String, Missing}})

Parsing a string vector allowing missing data.

Q : Type that results from the parsing of type `String'.
x : A string vector containing missing (of type Missing) observations.

See examples.

Examples

using Jchemo

x = ["1"; "3.2"; missing]
x_p = parsemiss(Float64, x)

source

Jchemo.pcaeigen — Method

pcaeigen(; kwargs...)
pcaeigen(X; kwargs...)
pcaeigen(X, weights::Weight; kwargs...)
pcaeigen!(X::Matrix, weights::Weight; kwargs...)

PCA by Eigen factorization.

X : X-data (n, p).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. of principal components (PCs).
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Let us note D the (n, n) diagonal matrix of weights (weights.w) and X the centered matrix in metric D. The function minimizes ||X - T * V'||^2 in metric D, by computing an Eigen factorization of X' * D * X.

See function pcasvd for examples.

source

Jchemo.pcaeigenk — Method

pcaeigenk(; kwargs...)
pcaeigenk(X; kwargs...)
pcaeigenk(X, weights::Weight; kwargs...)
pcaeigenk!(X::Matrix, weights::Weight; kwargs...)

PCA by Eigen factorization of the kernel matrix XX'.

X : X-data (n, p).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. of principal components (PCs).
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

This is the "kernel cross-product" version of the PCA algorithm (e.g. Wu et al. 1997). For wide matrices (n << p, where p is the nb. columns) and n not too large, this algorithm can be much faster than the others.

See function pcasvd for examples.

References

Wu, W., Massart, D.L., de Jong, S., 1997. The kernel PCA algorithms for wide data. Part I: Theory and algorithms. Chemometrics and Intelligent Laboratory Systems 36, 165-172. https://doi.org/10.1016/S0169-7439(97)00010-5

source

Jchemo.pcanipals — Method

pcanipals(; kwargs...)
pcanipals(X; kwargs...)
pcanipals(X, weights::Weight; kwargs...)
pcanipals!(X::Matrix, weights::Weight; kwargs...)

PCA by NIPALS algorithm.

X : X-data (n, p).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. of principal components (PCs).
gs : Boolean. If true (default), a Gram-Schmidt orthogonalization of the scores and loadings is done before each X-deflation.
tol : Tolerance value for stopping the iterations.
maxit : Maximum nb. of iterations.
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Let us note D the (n, n) diagonal matrix of weights (weights.w) and X the centered matrix in metric D. The function minimizes ||X - T * V'||^2 in metric D by NIPALS.

See function pcasvd for examples.

References

Andrecut, M., 2009. Parallel GPU Implementation of Iterative PCA Algorithms. Journal of Computational Biology 16, 1593-1599. https://doi.org/10.1089/cmb.2008.0221

K.R. Gabriel, S. Zamir, Lower rank approximation of matrices by least squares with any choice of weights, Technometrics 21 (1979) 489–498.

Gabriel, R. K., 2002. Le biplot - Outil d'exploration de données multidimensionnelles. Journal de la Société Française de la Statistique, 143, 5-55.

Lingen, F.J., 2000. Efficient Gram-Schmidt orthonormalisation on parallel computers. Communications in Numerical Methods in Engineering 16, 57-66. https://doi.org/10.1002/(SICI)1099-0887(200001)16:1<57::AID-CNM320>3.0.CO;2-I

Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris, France.

Wright, K., 2018. Package nipals: Principal Components Analysis using NIPALS with Gram-Schmidt Orthogonalization. https://cran.r-project.org/

source

Jchemo.pcanipalsmiss — Method

pcanipalsmiss(; kwargs...)
pcanipals(X; kwargs...)
pcanipals(X, weights::Weight; kwargs...)
pcanipals!(X::Matrix, weights::Weight; kwargs...)

PCA by NIPALS algorithm allowing missing data.

X : X-data (n, p).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. of principal components (PCs).
gs : Boolean. If true (default), a Gram-Schmidt orthogonalization of the scores and loadings is done before each X-deflation.
tol : Tolerance value for stopping the iterations.
maxit : Maximum nb. of iterations.
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

References

Wright, K., 2018. Package nipals: Principal Components Analysis using NIPALS with Gram-Schmidt Orthogonalization. https://cran.r-project.org/

Examples

X = [1 2. missing 4 ; 4 missing 6 7 ; 
    missing 5 6 13 ; missing 18 7 6 ; 
    12 missing 28 7] 

nlv = 3 
tol = 1e-15
scal = false
#scal = true
gs = false
#gs = true
model = pcanipalsmiss(; nlv, tol, gs, maxit = 500, scal)
fit!(model, X)
@names model 
@names model.fitm
fitm = model.fitm ;
fitm.niter
fitm.sv
fitm.V
fitm.T
## Orthogonality 
## only if gs = true
fitm.T' * fitm.T
fitm.V' * fitm.V

## Impute missing data in X
model = pcanipalsmiss(; nlv = 2, gs = true) ;
fit!(model, X)
Xfit = xfit(model.fitm)
s = ismissing.(X)
X_imp = copy(X)
X_imp[s] .= Xfit[s]
X_imp

source

Jchemo.pcaout — Method

pcaout(; kwargs...)
pcaout(X; kwargs...)
pcaout(X, weights::Weight; kwargs...)
pcaout!(X::Matrix, weights::Weight; kwargs...)

Robust PCA using outlierness.

X : X-data (n, p).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. of principal components (PCs).
prm : Proportion of the data removed (hard rejection of outliers) for each outlierness measure.
scal : Boolean. If true, each column of X is scaled by its MAD when computing the outlierness and by its uncorrected standard deviation when computing weighted PCA.

Robust PCA combining outlyingness measures and weighted PCA (WPCA).

The objective is to remove the effect of multivariate X-outliers that have potentially bad leverages. Observations (X-rows) receive weights depending on two outlyingness indicators:

The Stahel-Donoho outlyingness (Maronna and Yohai, 1995) is computed (function outstah) on X. The proportion prm of the observations with the highest outlyingness values receive a weight w1 = 0 (the other receive a weight w1 = 1).
An outlyingness based on the Euclidean distance to center (function outstah) is computed. The proportion prm of the observations with the highest outlyingness values receive a weight w2 = 0 (the other receive a weight w2 = 1).

The final weights of the observations are computed by weights.w * w1 * w2 that is used in a weighted PCA.

By default, the function uses prm = .3 (such as in the ROBPCA algorithm of Hubert et al. 2005, 2009).

References

Hubert, M., Rousseeuw, V.J., Vanden Branden, K., 2005. ROBPCA: A New Approach to Robust Principal Component Analysis. Technometrics 47, 64-79. https://doi.org/10.1198/004017004000000563

Hubert, M., Rousseeuw, V., Verdonck, T., 2009. Robust PCA for skewed data and its outlier map. Computational Statistics & Data Analysis 53, 2264-2274. https://doi.org/10.1016/j.csda.2008.05.027

Examples

using Jchemo, JchemoData, JLD2, CairoMakie 
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "octane.jld2") 
@load db dat
@names dat
X = dat.X 
wlst = names(X)
wl = parse.(Float64, wlst)
n = nro(X)

nlv = 3
model = pcaout(; nlv)  
#model = pcasvd(; nlv) 
fit!(model, X)
@names model
@names model.fitm
@head T = model.fitm.T
## Same as:
transf(model, X)

i = 1
plotxy(T[:, i], T[:, i + 1]; zeros = true, xlabel = string("PC", i), 
    ylabel = string("PC", i + 1)).f

source

Jchemo.pcapp — Method

pcapp(; kwargs...)
pcapp(X; kwargs...)
pcapp!(X::Matrix; kwargs...)

Robust PCA by projection pursuit.

X : X-data (n, p).

Keyword arguments:

nlv : Nb. of principal components (PCs).
nsim : Nb. of additional (to X-rows) simulated directions for the projection pursuit.
scal : Boolean. If true, each column of X is scaled by its MAD.

For nsim = 0, this is the Croux & Ruiz-Gazen (C-R, 2005) PCA algorithm that uses a projection pursuit (PP) method. Data X are robustly centered by the spatial median, and the observations are projected to the "PP" directions defined by the observations (rows of X) after they are normed. The first PCA loading vector is the direction (within the PP directions) that maximizes a given 'projection index', here the median absolute deviation (MAD). Then, X is deflated to this loading vector, and the process is re-run to define the next loading vector. And so on.

A possible extension of this algorithm is to randomly simulate additionnal candidate PP directions to the n row observations. If nsim > 0, the function simulates nsim additional PP directions to the n initial ones, as proposed in Hubert et al. (2005): random couples of observations are sampled in X and, for each couple, the direction passes through the two observations of the couple (see function simpphub).

References

Croux, C., Ruiz-Gazen, A., 2005. High breakdown estimators for principal components: the projection-pursuit approach revisited. Journal of Multivariate Analysis 95, 206–226. https://doi.org/10.1016/j.jmva.2004.08.002

Hubert, M., Rousseeuw, V.J., Vanden Branden, K., 2005. ROBPCA: A New Approach to Robust Principal Component Analysis. Technometrics 47, 64-79. https://doi.org/10.1198/004017004000000563

Examples

using Jchemo, JchemoData, JLD2, CairoMakie 
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "octane.jld2") 
@load db dat
@names dat
X = dat.X 
wlst = names(X)
wl = parse.(Float64, wlst)
n = nro(X)

nlv = 3
model = pcapp(; nlv, nsim = 2000)  
#model = pcasvd(; nlv) 
fit!(model, X)
@names model
@names model.fitm
@head T = model.fitm.T
## Same as:
@head transf(model, X)

i = 1
plotxy(T[:, i], T[:, i + 1]; zeros = true, xlabel = string("PC", i), 
    ylabel = string("PC", i + 1)).f

source

Jchemo.pcasph — Method

pcasph(; kwargs...)
pcasph(X; kwargs...)
pcasph(X, weights::Weight; kwargs...)
pcasph!(X::Matrix, weights::Weight; kwargs...)

Spherical PCA.

X : X-data (n, p).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. of principal components (PCs).
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Spherical PCA (Locantore et al. 1990, Maronna 2005, Daszykowski et al. 2007). Matrix X is centered by the spatial median computed by function Jchemo.colmedspa.

References

Daszykowski, M., Kaczmarek, K., Vander Heyden, Y., Walczak, B., 2007. Robust statistics in data analysis - A review. Chemometrics and Intelligent Laboratory Systems 85, 203-219. https://doi.org/10.1016/j.chemolab.2006.06.016

Locantore N., Marron J.S., Simpson D.G., Tripoli N., Zhang J.T., Cohen K.L. Robust principal component analysis for functional data, Test 8 (1999) 1–7

Maronna, R., 2005. Principal components and orthogonal regression based on robust scales, Technometrics, 47:3, 264-273, DOI: 10.1198/004017005000000166

Examples

using Jchemo, JchemoData, JLD2, CairoMakie 
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "octane.jld2") 
@load db dat
@names dat
X = dat.X 
wlst = names(X)
wl = parse.(Float64, wlst)
n = nro(X)

nlv = 3
model = pcasph(; nlv)  
#model = pcasvd(; nlv) 
fit!(model, X)
@names model
@names model.fitm
@head T = model.fitm.T
## Same as:
transf(model, X)

i = 1
plotxy(T[:, i], T[:, i + 1]; zeros = true, xlabel = string("PC", i), 
    ylabel = string("PC", i + 1)).f

source

Jchemo.pcasvd — Method

pcasvd(; kwargs...)
pcasvd(X; kwargs...)
pcasvd(X, weights::Weight; kwargs...)
pcasvd!(X::Matrix, weights::Weight; kwargs...)

PCA by SVD factorization.

X : X-data (n, p).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. of principal components (PCs).
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Let us note D the (n, n) diagonal matrix of weights (weights.w) and X the centered matrix in metric D. The function minimizes ||X - T * V'||^2 in metric D, by computing a SVD factorization of sqrt(D) * X:

sqrt(D) * X ~ U * S * V'

Outputs are:

T = D^(-1/2) * U * S
V = V
The diagonal of S

Examples

using Jchemo, JchemoData, JLD2, CairoMakie 
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2") 
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
n = nro(X)
ntest = 30
s = samprand(n, ntest) 
@head Xtrain = X[s.train, :]
@head Xtest = X[s.test, :]

nlv = 3
model = pcasvd(; nlv)
#model = pcaeigen(; nlv)
#model = pcaeigenk(; nlv)
#model = pcanipals(; nlv)
fit!(model, Xtrain)
@names model
@names model.fitm
@head T = model.fitm.T
## Same as:
@head transf(model, X)
T' * T
@head V = model.fitm.V
V' * V

@head Ttest = transf(model, Xtest)

res = summary(model, Xtrain) ;
@names res
res.explvarx
res.contr_var
res.coord_var
res.cor_circle

source

Jchemo.pcr — Method

pcr(; kwargs...)
pcr(X, Y; kwargs...)
pcr(X, Y, weights::Weight; kwargs...)
pcr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Principal component regression (PCR) with a SVD factorization.

X : X-data (n, p).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

Same as function pcasvd

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 15
model = pcr(; nlv) ;
fit!(model, Xtrain, ytrain)
@names model
fitm = model.fitm ;
@names fitm
@names fitm.fitm

@head fitm.fitm.T
@head transf(model, X)

coef(model)
coef(model; nlv = 3)

@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)

res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",  
    ylabel = "Observed").f    

res = predict(model, Xtest; nlv = 1:2)
@head res.pred[1]
@head res.pred[2]

res = summary(model, Xtrain) ;
@names res
z = res.explvarx
plotgrid(z.nlv, z.cumpvar; step = 2, xlabel = "Nb. LVs", ylabel = "Prop. Explained X-Variance").f

source

Jchemo.pip — Method

pip(args...)

Build a pipeline of models.

args... : Succesive models, see examples.

Examples

using JLD2, CairoMakie, JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)

## Pipeline Snv :> Savgol :> Pls :> Svmr

model1 = snv()
model2 = savgol(npoint = 11, deriv = 2, degree = 3)
model3 = plskern(nlv = 15)
model4 = svmr(gamma = 1e3, cost = 1000, epsilon = .1)
model = pip(model1, model2, model3, model4)
fit!(model, Xtrain, ytrain)
res = predict(model, Xtest) ; 
@head res.pred 
rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
      ylabel = "Observed").f

source

Jchemo.plotconf — Method

plotconf(object; size = (500, 400), cnt = true, ptext = true, 
    fontsize = 15, coldiag = :red, )

Plot a confusion matrix.

object : Output of function conf.

Keyword arguments:

size : Size (horizontal, vertical) of the figure.
cnt : Boolean. If true, plot the occurrences, else plot the row %s.
ptext : Boolean. If true, display the value in each cell.
fontsize : Font size when ptext = true.
coldiag : Font color when ptext = true.

See examples in help page of function conf.

To use the function, a backend (e.g. CairoMakie) has to be specified. ```

source

Jchemo.plotgrid — Method

plotgrid(indx::AbstractVector, r; size = (500, 300), step = 5, 
    color = nothing, kwargs...)
plotgrid(indx::AbstractVector, r, group; size = (700, 350), 
    step = 5, color = nothing, leg = true, leg_title = "Group", kwargs...)

Plot error/performance rates of a model.

indx : A numeric variable representing the grid of model parameters, e.g. the nb. LVs if PLSR models.
r : The error/performance rate.

Keyword arguments:

group : Categorical variable defining groups. A separate line is plotted for each level of group.
size : Size (horizontal, vertical) of the figure.
step : Step used for defining the xticks.
color : Set color. If group if used, must be a vector of same length as the number of levels in group.
leg : Boolean. If group is used, display a legend or not.
leg_title : Title of the legend.
kwargs : Optional arguments to pass in Axis of CairoMakie.

To use the function, a backend (e.g. CairoMakie) has to be specified.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

model = plskern() 
nlv = 0:20
res = gridscore(model, Xtrain, ytrain, 
    Xtest, ytest; score = rmsep, nlv)
plotgrid(res.nlv, res.y1; xlabel = "Nb. LVs", ylabel = "RMSEP").f

model = lwplsr() 
nlvdis = 15 ; metric = [:mah]
h = [1 ; 2.5 ; 5] ; k = [50 ; 100] 
pars = mpar(nlvdis = nlvdis, metric = metric, 
    h = h, k = k)
nlv = 0:20
res = gridscore(model, Xtrain, ytrain, Xtest, ytest; score = rmsep, 
    pars, nlv)
group = string.("h=", res.h, " k=", res.k)
plotgrid(res.nlv, res.y1, group; xlabel = "Nb. LVs", ylabel = "RMSECV").f

source

Jchemo.plotlv — Method

plotlv(T; size = (700, 350), shape, start = 1, color = nothing, zeros::Bool = false,
    xlabel = "", ylabel = "", title = "", kwargs...)
plotlv(T, group; size = (700, 350), shape, start = 1, color = nothing, zeros::Bool = false,
    xlabel = "", ylabel = "", title = "", leg::Bool = true, leg_title = "Group", 
    kwargs...)

Matrix of 2-D plots of successive latent variables (PCA, PLS, etc.).

T : A matrix of (PCA, PLS, ec.) latent variables (LVs) to plot (n, A).
group : Categorical variable defining groups (n).

Keyword arguments:

size : Size (horizontal, vertical) of the figure.
shape : A tuple of length = 2 defining the shape of the figure: nb. rows and columns of the matriice of plots.
start : Start of the numbering of the LVs in the plots.
color : Set color(s). If group if used, color must be a vector of same length as the number of levels in group.
zeros : Boolean. Draw horizontal and vertical axes passing through origin (0, 0).
xlabel : Label for the x-axis.
ylabel : Label for the y-axis.
zlabel : Label for the z-axis.
title : Title of the graphic.
leg : Boolean. If group is used, display a legend or not.
leg_title : Title of the legend.
kwargs : Optional arguments to pass in function scatter of Makie.

To use the function, a backend (e.g. CairoMakie) has to be specified.

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlv = 15
model = pcasvd(; nlv) 
fit!(model, Xtrain)
@head Ttrain = model.fitm.T
@head Ttest = transf(model, Xtest)
T = vcat(Ttrain, Ttest)

CairoMakie.activate!()
#GLMakie.activate!()

plotlv(Ttrain[:, 1:6]; shape = (2, 3), color = (:blue, .5), zeros = true, xlabel = "PC", ylabel = "PC").f
plotlv(Ttrain[:, 3:8]; shape = (2, 3), start = 3, color = (:blue, .5), zeros = true, xlabel = "PC", ylabel = "PC").f

group = vcat(repeat(["Train"], ntrain), repeat(["Test"], ntest))
plotlv(T[:, 1:6], group; shape = (2, 3), color = nothing, zeros = true, xlabel = "PC", ylabel = "PC",
    leg = true).f

group = vcat(repeat(["Train"], ntrain), repeat(["Test"], ntest))
color = [(:red, .3); (:blue, .3)]
#color = cgrad(:Dark2_5, 2; categorical = true, alpha = .5)
plotlv(1000 * T[:, 1:6], group; shape = (2, 3), color = color, zeros = true, xlabel = "PC", ylabel = "PC",
    leg = true).f

source

Jchemo.plotsp — Function

plotsp(X, wl = 1:nco(X); size = (500, 300), nsamp = nro(X), color = nothing, 
    kwargs...)

Plotting spectra.

X : X-data (n, p).
wl : Column names of X. Must be numeric.

Keyword arguments:

size : Size (horizontal, vertical) of the figure.
nsamp : Nb. spectra (X-rows) to plot. If nothing, all spectra are plotted.
color : Set a unique color (and eventually transparency) to the spectra.
kwargs : Optional arguments to pass in Axis of CairoMakie.

Plot of the rows (e.g. spectrum) of X.

To use the function, a backend (e.g. CairoMakie) has to be specified.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X
wlst = names(X)
wl = parse.(Float64, wlst) 

plotsp(X).f
plotsp(X; color = (:red, .2)).f
plotsp(X, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance").f

tck = collect(wl[1]:200:wl[end]) ;
plotsp(X, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance", xticks = tck).f

f, ax = plotsp(X, wl; color = (:red, .2))
xmeans = colmean(X)
lines!(ax, wl, xmeans; color = :black, linewidth = 2)
vlines!(ax, 1200)
f

source

Jchemo.plotxy — Method

plotxy(x, y; size = (500, 300), color = nothing, ellipse::Bool = false, 
    prob = .95, circle::Bool = false, bisect::Bool = false, zeros::Bool = false,
    xlabel = "", ylabel = "", title = "", kwargs...)
plotxy(x, y, group; size = (600, 350), color = nothing, ellipse::Bool = false, 
    prob = .95, circle::Bool = false, bisect::Bool = false, zeros::Bool = false,
    xlabel = "", ylabel = "", title = "", leg::Bool = true, leg_title = "Group", 
    kwargs...)

2D scatter plot of x-y data

x : A x-vector (n).
y : A y-vector (n).
group : Categorical variable defining groups (n).

Keyword arguments:

size : Size (horizontal, vertical) of the figure.
color : Set color(s). If group if used, color must be a vector of same length as the number of levels in group.
ellipse : Boolean. Draw an ellipse of confidence, assuming a Ch-square distribution with df = 2. If group is used, one ellipse is drawn per group.
prob : Probability for the ellipse of confidence.
bisect : Boolean. Draw a bisector.
zeros : Boolean. Draw horizontal and vertical axes passing through origin (0, 0).
xlabel : Label for the x-axis.
ylabel : Label for the y-axis.
title : Title of the graphic.
leg : Boolean. If group is used, display a legend or not.
leg_title : Title of the legend.
kwargs : Optional arguments to pass in function scatter of Makie.

To use the function, a backend (e.g. CairoMakie) has to be specified.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
lev = mlev(year)
nlev = length(lev)

model = pcasvd(nlv = 5)  
fit!(model, X) 
@head T = model.fitm.T

plotxy(T[:, 1], T[:, 2]; color = (:red, .5)).f

plotxy(T[:, 1], T[:, 2], year; ellipse = true, xlabel = "PC1", ylabel = "PC2").f

i = 2
colm = cgrad(:Dark2_5, nlev; categorical = true)
plotxy(T[:, i], T[:, i + 1], year; color = colm, xlabel = string("PC", i), 
    ylabel = string("PC", i + 1), zeros = true, ellipse = true).f

plotxy(T[:, 1], T[:, 2], year).lev

plotxy(1:5, 1:5).f

y = reshape(rand(5), 5, 1)
plotxy(1:5, y).f

## Several layers can be added
## (same syntax as in Makie)
A = rand(50, 2)
f, ax = plotxy(A[:, 1], A[:, 2]; xlabel = "x1", ylabel = "x2")
ylims!(ax, -1, 2)
hlines!(ax, 0.5; color = :red, linestyle = :dot)
f

source

Jchemo.plotxyz — Method

plotxy(x, y, z; size = (500, 300), color = nothing, perspectiveness = .1,
    xlabel = "", ylabel = "", zlabel = "", title = "", kwargs...)
plotxy(x, y, z, group; size = (500, 300), color = nothing, perspectiveness = .1, 
    xlabel = "", ylabel = "", zlabel = "", title = "", leg::Bool = true, leg_title = "Group", 
    kwargs...)

3D scatter plot of x-y-z data.

x : A x-vector (n).
y : A y-vector (n).
z : A y-vector (n).
group : Categorical variable defining groups (n).

Keyword arguments:

size : Size (horizontal, vertical) of the figure.
color : Set color(s). If group if used, color must be a vector of same length as the number of levels in group.
xlabel : Label for the x-axis.
ylabel : Label for the y-axis.
zlabel : Label for the z-axis.
title : Title of the graphic.
leg : Boolean. If group is used, display a legend or not.
leg_title : Title of the legend.
kwargs : Optional arguments to pass in function scatter of Makie.

To use the function, a backend (e.g. CairoMakie) has to be specified.

Examples

using Jchemo, CairoMakie, GLMakie
n = 1000
x = randn(n)
y = randn(n)
z = randn(n)
group = rand(["A", "B", "C"], n)
s = group .== "B"
x[s] .+= 10 ;
s = group .== "C"
x[s] .+= 20 ;

CairoMakie.activate!()
#GLMakie.activate!()

plotxyz(x, y, z; size = (500, 300), markersize = 10, xlabel = "V1").f
plotxyz(x, y, z; size = (500, 300), color = (:red, .3), markersize = 10, xlabel = "V1").f

plotxyz(x, y, z, group; size = (500, 300), markersize = 10, xlabel = "V1").f
plotxyz(x, y, z, group; size = (500, 300), markersize = 10, xlabel = "V1", alpha = .3).f

color = [(:red, .3); (:blue, .3); (:green, .3)]
#color = cgrad(:Dark2_5; categorical = true, alpha = .3)[1:nlev]
plotxyz(x, y, z, group; size = (500, 300), color = color, leg = true, markersize = 10, xlabel = "V1").f

source

Jchemo.plscan — Method

plscan(; kwargs...)
plscan(X, Y; kwargs...)
plscan(X, Y, weights::Weight; kwargs...)
plscan!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Canonical partial least squares regression (Canonical PLS).

X : First block of data.
Y : Second block of data.
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs = scores) to compute.
bscal : Type of block scaling. Possible values are: :none, :frob. See functions blockscal.
scal : Boolean. If true, each column of blocks X and Y is scaled by its uncorrected standard deviation (before the block scaling).

Canonical PLS with the Nipals algorithm (Wold 1984, Tenenhaus 1998 chap.11), referred to as PLS-W2A (i.e. Wold PLS mode A) in Wegelin 2000. The two blocks X and Y play a symmetric role. After each step of scores computation, X and Y are deflated by the x- and y-scores, respectively.

Function summary returns:

cortx2ty: Correlations between the X- and Y-LVs

and for block X:

explvarx : Proportion of the block inertia (squared Frobenious norm) explained by the block LVs (Tx).
rvx2tx : RV coefficients between the block and the block LVs.
rdx2tx : Rd coefficients between the block and the block LVs.
corx2tx : Correlation between the block variables and the block LVs. The same is returned for block Y.

References

Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.

Wegelin, J.A., 2000. A Survey of Partial Least Squares (PLS) Methods, with Emphasis on the Two-Block Case (No. 371). University of Washington, Seattle, Washington, USA.

Examples

using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "linnerud.jld2") 
@load db dat
@names dat
X = dat.X
Y = dat.Y
n, p = size(X)
q = nco(Y)

nlv = 2
bscal = :frob
model = plscan(; nlv, bscal)
fit!(model, X, Y)
@names model
@names model.fitm

fitm = model.fitm
@head fitm.Tx
@head transfbl(model, X, Y).Tx

@head fitm.Ty
@head transfbl(model, X, Y).Ty

res = summary(model, X, Y) ;
@names res
res.explvarx
res.explvary
res.cortx2ty
res.rvx2tx
res.rvy2ty
res.rdx2tx
res.rdy2ty
res.corx2tx 
res.cory2ty

source

Jchemo.plskdeda — Method

plskdeda(; kwargs...)
plskdeda(X, y; kwargs...)
plskdeda(X, y, weights::Weight; kwargs...)

KDE-DA on PLS latent variables (PLS-KDEDA).

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute. Must be >= 1.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
scal : Boolean. If true, each column of X and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.
Keyword arguments of function dmkern (bandwidth definition) can also be specified here.

Same as function plsqda except that the class densities are estimated from dmkern instead of dmnorm.

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlv = 15
model = plskdeda(; nlv) 
#model = plskdeda(; nlv, prior = :unif) 
#model = plskdeda(; nlv, a = .5)
fit!(model, Xtrain, ytrain)
@names model
@names fitm = model.fitm
fitm.lev
fitm.ni

embfitm = fitm.fitm.embfitm ;
@head embfitm.T
@head transf(model, Xtrain)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)

coef(embfitm)

res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(model, Xtest; nlv = 1:2).pred
summary(embfitm, Xtrain)

source

Jchemo.plskern — Method

plskern(; kwargs...)
plskern(X, Y; kwargs...)
plskern(X, Y, weights::Weight; kwargs...)
plskern!(X::Matrix, Y::Union{Matrix, BitMatrix}, weights::Weight; kwargs...)

Partial least squares regression (PLSR) with the "improved kernel algorithm #1" (Dayal & McGegor, 1997).

X : X-data (n, p).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute.
scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

About the row-weighting in PLS algorithms (weights): see in particular Schaal et al. 2002, Siccard & Sabatier 2006, Kim et al. 2011, and Lesnoff et al. 2020.

References

Dayal, B.S., MacGregor, J.F., 1997. Improved PLS algorithms. Journal of Chemometrics 11, 73-85.

Lesnoff, M., Metz, M., Roger, J.M., 2020. Comparison of locally weighted PLS strategies for regression and discrimination on agronomic NIR Data. Journal of Chemometrics. e3209. https://onlinelibrary.wiley.com/doi/abs/10.1002/cem.3209

Schaal, S., Atkeson, C., Vijayamakumar, S. 2002. Scalable techniques from nonparametric statistics for the real time robot learning. Applied Intell., 17, 49-60.

Sicard, E. Sabatier, R., 2006. Theoretical framework for local PLS1 regression and application to a rainfall dataset. Comput. Stat. Data Anal., 51, 1393-1410.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 15
model = plskern(; nlv) ;
#model = plsnipals(; nlv) ;
#model = plswold(; nlv) ;
#model = plsrosa(; nlv) ;
#model = plssimp(; nlv) ;
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
@head model.fitm.T

coef(model)
coef(model; nlv = 3)

@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)

res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",  
    ylabel = "Observed").f    

res = predict(model, Xtest; nlv = 1:2)
@head res.pred[1]
@head res.pred[2]

res = summary(model, Xtrain) ;
@names res
z = res.explvarx
plotgrid(z.nlv, z.cumpvar; step = 2, xlabel = "Nb. LVs", ylabel = "Prop. Explained X-Variance").f

source

Jchemo.plslda — Method

plslda(; kwargs...)
plslda(X, y; kwargs...)
plslda(X, y, weights::Weight; kwargs...)

LDA on PLS latent variables (PLS-LDA).

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute. Must be >= 1.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
scal : Boolean. If true, each column of X and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.

LDA on PLS latent variables. The approach is as follows:

The training variable y (univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present in y. Each column of Ydummy is a dummy (0/1) variable.
A multivariate PLSR (PLSR2) is run on {X, Ydummy}, returning a score matrix T.
A LDA is done on {T, y}, returning estimates of posterior probabilities (∊ [0, 1]) of class membership.
For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.

Note: For highly unbalanced classes, it may be recommended to set 'prior = :unif' when using the function (and to use a score such as merrp instead of errp when evaluating the perfomance).

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlv = 15
model = plslda(; nlv) 
#model = plslda(; nlv, prior = :unif) 
#model = plsqda(; nlv, alpha = .1) 
fit!(model, Xtrain, ytrain)
@names model
@names fitm = model.fitm
fitm.lev
fitm.ni

embfitm = fitm.fitm.embfitm ;
@head embfitm.T
@head transf(model, Xtrain)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)

coef(embfitm)

res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(model, Xtest; nlv = 1:2).pred
summary(embfitm, Xtrain)

source

Jchemo.plsnipals — Method

plsnipals(; kwargs...)
plsnipals(X, Y; kwargs...)
plsnipals(X, Y, weights::Weight; kwargs...)
plsnipals!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Partial Least Squares Regression (PLSR) with the Nipals algorithm.

X : X-data (n, p).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute.
scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

In this function, for PLS2 (multivariate Y), the Nipals iterations are replaced by a direct computation of the PLS weights (w) by SVD decomposition of matrix X'Y (Hoskuldsson 1988 p.213).

See function plskern for examples.

References

Hoskuldsson, A., 1988. PLS regression methods. Journal of Chemometrics 2, 211-228. https://doi.org/10.1002/cem.1180020306

Tenenhaus, M., 1998. La régression PLS: thÃ©orie et pratique. Editions Technip, Paris, France.

Wold, S., Sjostrom, M., Eriksson, l., 2001. PLS-regression: a basic tool for chemometrics. Chem. Int. Lab. Syst., 58, 109-130.

source

Jchemo.plsqda — Method

plsqda(; kwargs...)
plsqda(X, y; kwargs...)
plsqda(X, y, weights::Weight; kwargs...)

QDA on PLS latent variables (PLS-QDA) with continuum.

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute. Must be >= 1.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
alpha : Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0) and LDA (alpha = 1).
scal : Boolean. If true, each column of X and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.

QDA on PLS latent variables. The approach is as follows:

The training variable y (univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present in y. Each column of Ydummy is a dummy (0/1) variable.
A multivariate PLSR (PLSR2) is run on {X, Ydummy}, returning a score matrix T.
A QDA (possibly with continuum) is done on {T, y}, returning estimates of posterior probabilities (∊ [0, 1]) of class membership.
For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.

Note: For highly unbalanced classes, it may be recommended to set 'prior = :unif' when using the function (and to use a score such as merrp instead of errp when evaluating the perfomance).

See function plslda for examples.

source

Jchemo.plsravg — Method

plsravg(; kwargs...)
plsravg(X, Y; kwargs...)
plsravg(X, Y, weights::Weight; kwargs...)
plsravg!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Averaging PLSR models with different numbers of latent variables (PLSR-AVG).

X : X-data (n, p).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : A range of nb. of latent variables (LVs) to compute.
scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

Ensemblist method where the predictions are computed by averaging the predictions of a set of models built with different numbers of LVs.

References

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2") 
@load db dat
@names dat
X = dat.X 
Y = dat.Y
@head Y
y = Y.ndf
#y = Y.dm
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(y, s)
Xtest = X[s, :]
ytest = y[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)

nlv = 0:30
#nlv = 5:20
#nlv = 25
model = plsravg(; nlv) ;
fit!(model, Xtrain, ytrain)

res = predict(model, Xtest)
@head res.pred
res.predlv   # predictions for each nb. of LVs 
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",  
    ylabel = "Observed").f

source

Jchemo.plsrda — Method

plsrda(; kwargs...)
plsrda(X, y; kwargs...)
plsrda(X, y, weights::Weight; kwargs...)

Discrimination based on partial least squares regression (PLSR-DA).

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
scal : Boolean. If true, each column of X and Ydummy is scaled by its uncorrected standard deviation.

This is the usual "PLSDA". The approach is as follows:

The training variable y (univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present in y. Each column of Ydummy is a dummy (0/1) variable.
Then, a multivariate PLSR (PLSR2) is run on {X, Ydummy}, returning predictions of the dummy variables (= object posterior returned by fuction predict). These predictions can be considered as unbounded estimates (i.e. eventually outside of [0, 1]) of the class membership probabilities.
For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.

Note: For highly unbalanced classes, it may be recommended to set 'prior = :unif' when using the function (and to use a score such as merrp instead of errp when evaluating the perfomance).

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlv = 15
model = plsrda(; nlv) 
#model = plsrda(; nlv, prior = :unif) 
fit!(model, Xtrain, ytrain)
@names model
@names fitm = model.fitm
fitm.lev
fitm.ni
@names fitm.fitm
aggsumv(fitm.fitm.weights.w, ytrain)

@head fitm.fitm.T
@head transf(model, Xtrain)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)

coef(fitm.fitm)

res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(model, Xtest; nlv = 1:2).pred
summary(fitm.fitm, Xtrain)

source

Jchemo.plsrosa — Method

plsrosa(; kwargs...)
plsrosa(X, Y; kwargs...)
plsrosa(X, Y, weights::Weight; kwargs...)
plsrosa!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Partial Least Squares Regression (PLSR) with the ROSA algorithm (Liland et al. 2016).

X : X-data (n, p).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute.
scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

Note: The function has the following differences with the original algorithm of Liland et al. (2016):

Scores T (LVs) are not normed.
Multivariate Y is allowed.

See function plskern for examples.

References

Liland, K.H., Næs, T., Indahl, U.G., 2016. ROSA—a fast extension of partial least squares regression for multiblock data analysis. Journal of Chemometrics 30, 651–662. https://doi.org/10.1002/cem.2824

source

Jchemo.plsrout — Method

plsrout(; kwargs...)
plsrout(X, Y; kwargs...)
plsrout(X, Y, weights::Weight; kwargs...)
pcaout!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Robust PLSR using outlierness.

X : X-data (n, p).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. of latent variables (LVs).
prm : Proportion of the data removed (hard rejection of outliers) for each outlierness measure.
scal : Boolean. If true, each column of X is scaled by its MAD when computing the outlierness and by its uncorrected standard deviation when computing weighted PCA.

Robust PLSR combining outlyingness measures and weighted PLSR (WPLSR). This is the same principle as function pcaout (see the help page) but the final step is a weighted PLSR instead of a weighted PCA.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 15
model = plsrout(; nlv)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
@head model.fitm.T

coef(model)
coef(model; nlv = 3)

@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)

res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",  
    ylabel = "Observed").f    

res = predict(model, Xtest; nlv = 1:2)
@head res.pred[1]
@head res.pred[2]

source

Jchemo.plssimp — Method

plssimp(; kwargs...)
plssimp(X, Y; kwargs...)
plssimp(X, Y, weights::Weight; kwargs...)
plssimp!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Partial Least Squares Regression (PLSR) with the SIMPLS algorithm (de Jong 1993).

X : X-data (n, p).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute.
scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

Note: In this function, scores T (LVs) are not normed, conversely to the original algorithm of de Jong (2013).

See function plskern for examples.

References

de Jong, S., 1993. SIMPLS: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems 18, 251–263. https://doi.org/10.1016/0169-7439(93)85002-X

source

Jchemo.plstuck — Method

plstuck(; kwargs...)
plstuck(X, Y; kwargs...)
plstuck(X, Y, weights::Weight; kwargs...)
plstuck!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Tucker's inter-battery method of factor analysis

X : First block of data.
Y : Second block of data.
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs = scores) to compute.
bscal : Type of block scaling. Possible values are: :none, :frob. See functions blockscal.
scal : Boolean. If true, each column of blocks X and Y is scaled by its uncorrected standard deviation (before the block scaling).

Inter-battery method of factor analysis (Tucker 1958, Tenenhaus 1998 chap.3). The two blocks X and X play a symmetric role. This method is referred to as PLS-SVD in Wegelin 2000. The method factorizes the covariance matrix X'Y by SVD.

See function plscan for the details on the summary outputs.

References

Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.

Tishler, A., Lipovetsky, S., 2000. Modelling and forecasting with robust canonical analysis: method and application. Computers & Operations Research 27, 217–232. https://doi.org/10.1016/S0305-0548(99)00014-3

Tucker, L.R., 1958. An inter-battery method of factor analysis. Psychometrika 23, 111–136. https://doi.org/10.1007/BF02289009

Wegelin, J.A., 2000. A Survey of Partial Least Squares (PLS) Methods, with Emphasis on the Two-Block Case (No. 371). University of Washington, Seattle, Washington, USA.

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/linnerud.jld2") 
@load db dat
@names dat
X = dat.X 
Y = dat.Y

model = plstuck(nlv = 3)
fit!(model, X, Y) 
@names model
@names model.fitm

fitm = model.fitm
@head fitm.Tx
@head transfbl(model, X, Y).Tx

@head fitm.Ty
@head transfbl(model, X, Y).Ty

res = summary(model, X, Y) ;
@names res
res.explvarx
res.explvary
res.cortx2ty
res.rvx2tx
res.rvy2ty
res.rdx2tx
res.rdy2ty
res.corx2tx 
res.cory2ty

source

Jchemo.plswold — Method

plswold(; kwargs...)
plswold(X, Y; kwargs...)
plswold(X, Y, weights::Weight; kwargs...)
plswold!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Partial Least Squares Regression (PLSR) with the Wold algorithm

X : X-data (n, p).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute.
tol : Tolerance for the Nipals algorithm.
maxit : Maximum number of iterations for the Nipals algorithm.
scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

Wold Nipals PLSR algorithm: Tenenhaus 1998 p.204.

See function plskern for examples.

References

Tenenhaus, M., 1998. La régression PLS: thÃ©orie et pratique. Editions Technip, Paris, France.

Wold, S., Ruhe, A., Wold, H., Dunn, III, W.J., 1984. The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS). Approach to Generalized Inverses. SIAM Journal on Scientific and Statistical Computing 5, 735–743. https://doi.org/10.1137/0905052

source

Jchemo.predict — Method

predict(object::Calds, X; kwargs...)

Compute predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Calpds, X; kwargs...)

Compute predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Cglsr, X; nlv = nothing)

Compute Y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.
nlv : Nb. iterations, or collection of nb. iterations, to consider.

source

Jchemo.predict — Method

predict(object::Dkplsr, X; nlv = nothing)

Compute Y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.
nlv : Nb. LVs, or collection of nb. LVs, to consider.

source

Jchemo.predict — Method

predict(object::Dmkern, x)

Compute predictions from a fitted model.

object : The fitted model.
x : Data (vector) for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Dmnorm, X)

Compute predictions from a fitted model.

object : The fitted model.
X : Data (vector) for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Knnda1, X)

Compute the y-predictions from the fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Knnr, X)

Compute the Y-predictions from the fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Kplsr, X; nlv = nothing)

Compute Y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.
nlv : Nb. LVs, or collection of nb. LVs, to consider.

If nothing, it is the maximum nb. LVs.

source

Jchemo.predict — Method

predict(object::Krr, X; lb = nothing)

Compute Y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.
lb : Regularization parameter, or collection of regularization parameters, "lambda" to consider.

source

Jchemo.predict — Method

predict(object::Loessr, X)

Compute y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Lwmlr, X)

Compute the Y-predictions from the fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Lwmlrda, X)

Compute y-predictions from the fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Lwplslda, X; nlv = nothing)

Compute the y-predictions from the fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Lwplsqda, X; nlv = nothing)

Compute the y-predictions from the fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Lwplsr, X; nlv = nothing)

Compute the Y-predictions from the fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Lwplsravg, X)

Compute the Y-predictions from the fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Lwplsrda, X; nlv = nothing)

Compute the y-predictions from the fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Mbplsprobda, Xbl; nlv = nothing)

Compute Y-predictions from a fitted model.

object : The fitted model.
Xbl : A list of blocks (vector of matrices) of X-data for which predictions are computed.
nlv : Nb. LVs, or collection of nb. LVs, to consider.

source

Jchemo.predict — Method

predict(object::Mbplsrda, Xbl; nlv = nothing)

Compute Y-predictions from a fitted model.

object : The fitted model.
Xbl : A list of blocks (vector of matrices) of X-data for which predictions are computed.
nlv : Nb. LVs, or collection of nb. LVs, to consider.

source

Jchemo.predict — Method

predict(object::Mlrda, X)

Compute y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Occod, X)

Compute predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Occsd, X)

Compute predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Occsdod, X)

Compute predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Occstah, X)

Compute predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Pcr, X; nlv = nothing)

Compute Y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.
nlv : Nb. LVs, or collection of nb. LVs, to consider.

source

Jchemo.predict — Method

predict(object::Plsprobda, X; nlv = nothing)

Compute Y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.
nlv : Nb. LVs, or collection of nb. LVs, to consider.

source

Jchemo.predict — Method

predict(object::Plsravg, X)

Compute Y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Plsrda, X; nlv = nothing)

Compute Y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.
nlv : Nb. LVs, or collection of nb. LVs, to consider.

source

Jchemo.predict — Method

predict(object::Qda, X)

Compute y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Rosaplsr, Xbl; nlv = nothing)

Compute Y-predictions from a fitted model.

object : The fitted model.
Xbl : A list of blocks (vector of matrices) of X-data for which predictions are computed.
nlv : Nb. LVs, or collection of nb. LVs, to consider.

source

Jchemo.predict — Method

predict(object::Rr, X; lb = nothing)

Compute Y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.
lb : Regularization parameter, or collection of regularization parameters, "lambda" to consider.

source

Jchemo.predict — Method

predict(object::Rrda, X; lb = nothing)

Compute Y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.
lb : Regularization parameter, or collection of regularization parameters, "lambda" to consider. If nothing, it is the parameter stored in the fitted model.

source

Jchemo.predict — Method

predict(object::Soplsr, Xbl)

Compute Y-predictions from a fitted model.

object : The fitted model.
Xbl : A list of blocks (vector of matrices) of X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Spcr, X; nlv = nothing)

Compute Y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.
nlv : Nb. LVs, or collection of nb. LVs, to consider.

source

Jchemo.predict — Method

predict(object::Svmda, X)

Compute y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Svmr, X)

Compute y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Treeda, X)

Compute Y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Treer, X)

Compute Y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Union{Lda, Qda}, X)

Compute y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Union{Mbplsr, Mbplswest}, Xbl; nlv = nothing)

Compute Y-predictions from a fitted model.

object : The fitted model.
Xbl : A list of blocks (vector of matrices) of X-data for which predictions are computed.
nlv : Nb. LVs, or collection of nb. LVs, to consider.

source

Jchemo.predict — Method

predict(object::Mlr, X)

Compute the Y-predictions from the fitted model.

object : The fitted model.
X : X-data for which predictions are computed.

source

Jchemo.predict — Method

predict(object::Union{Plsr, Pcr, Splsr}, X; nlv = nothing)

Compute Y-predictions from a fitted model.

object : The fitted model.
X : X-data for which predictions are computed.
nlv : Nb. LVs, or collection of nb. LVs, to consider.

source

Jchemo.pval — Method

pval(d::Distribution, q)
pval(x::Array, q)
pval(e_cdf::ECDF, q)

Compute p-value(s) for a distribution, an ECDF or vector.

d : A distribution computed from Distribution.jl.
x : Univariate data.
e_cdf : An ECDF computed from StatsBase.jl.
q : Value(s) for which to compute the p-value(s).

Compute or estimate the p-value of quantile q, ie. V(Q > q) where Q is the random variable.

Examples

using Jchemo, Distributions, StatsBase

d = Distributions.Normal(0, 1)
q = 1.96
#q = [1.64; 1.96]
Distributions.cdf(d, q)    # cumulative density function (CDF)
Distributions.ccdf(d, q)   # complementary CDF (CCDF)
pval(d, q)                 # Distributions.ccdf

x = rand(5)
e_cdf = StatsBase.ecdf(x)
e_cdf(x)                # empirical CDF computed at each point of x (ECDF)
p_val = 1 .- e_cdf(x)   # complementary ECDF at each point of x
q = .3
#q = [.3; .5; 10]
pval(e_cdf, q)          # 1 .- e_cdf(q)
pval(x, q)

source

Jchemo.qda — Method

qda(; kwargs...)
qda(X, y; kwargs...)
qda(X, y, weights::Weight; kwargs...)

Quadratic discriminant analysis (QDA, with continuum towards LDA).

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
alpha : Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0) and LDA (alpha = 1).

Note: For highly unbalanced classes, it may be recommended to set 'prior = :unif' when using the function (and to use a score such as merrp instead of errp when evaluating the perfomance).

For the continuum approach, a value alpha > 0 shrinks the class-covariances (Wi) toward a common LDA covariance ('within-W'). This corresponds to the 'first regularization (Eqs.16)' approach described in Friedman 1989 (in which the present parameter alpha is referred to as 'lambda').

References

Friedman JH. Regularized Discriminant Analysis. Journal of the American Statistical Association. 1989; 84(405):165-175. doi:10.1080/01621459.1989.10478752.

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

model = qda()
fit!(model, Xtrain, ytrain)
@names model
@names fitm = model.fitm
fitm.lev
fitm.ni
fitm.priors
aggsumv(fitm.weights.w, ytrain)

res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

## With regularization
model = qda(alpha = .5)
#model = qda(alpha = 1) # = LDA
fit!(model, Xtrain, ytrain)
model.fitm.Wi
res = predict(model, Xtest) ;
errp(res.pred, ytest)

source

Jchemo.r2 — Method

r2(pred, Y)

Compute the R2 coefficient.

pred : Predictions.
Y : Observed data.

The rate R2 is calculated by:

R2 = 1 - MSEP(current model) / MSEP(null model)

where the "null model" is the overall mean. For predictions over CV or test sets, and/or for non linear models, it can be different from the square of the correlation coefficient (cor2) between the true data and the predictions.

Examples

using Jchemo

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

model = plskern(nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
r2(pred, Ytest)

model = plskern(nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
r2(pred, ytest)

source

Jchemo.rasvd — Method

rasvd(; kwargs...)
rasvd(X, Y; kwargs...)
rasvd(X, Y, weights::Weight; kwargs...)
rasvd!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Redundancy analysis (RA), a.k.a PCA on instrumental variables (PCAIV)

X : First block of data.
Y : Second block of data.
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs = scores) to compute.
bscal : Type of block scaling. Possible values are: :none, :frob. See functions blockscal.
tau : Regularization parameter (∊ [0, 1]).
scal : Boolean. If true, each column of blocks X and Y is scaled by its uncorrected standard deviation (before the block scaling).

See e.g. Bougeard et al. 2011a,b and Legendre & Legendre 2012. Let Yhat be the fitted values of the regression of Y on X. The scores Ty are the PCA scores of Yhat. The scores Tx are the fitted values of the regression of Ty on X.

A continuum regularization is available. After block centering and scaling, the covariances matrices are computed as follows:

Cx = (1 - tau) * X'DX + tau * Ix

where D is the observation (row) metric. Value tau = 0 can generate unstability when inverting the covariance matrices. A better alternative is generally to use an epsilon value (e.g. tau = 1e-8) to get similar results as with pseudo-inverses.

References

Bougeard, S., Qannari, E.M., Lupo, C., Chauvin, C., 2011-a. Multiblock redundancy analysis from a user's perspective. Application in veterinary epidemiology. Electronic Journal of Applied Statistical Analysis 4, 203-214. https://doi.org/10.1285/i20705948v4n2p203

Bougeard, S., Qannari, E.M., Rose, N., 2011-b. Multiblock redundancy analysis: interpretation tools and application in epidemiology. Journal of Chemometrics 25, 467-475. https://doi.org/10.1002/cem.1392

Legendre, V., Legendre, L., 2012. Numerical Ecology. Elsevier, Amsterdam, The Netherlands.

Examples

using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "linnerud.jld2") 
@load db dat
@names dat
X = dat.X
Y = dat.Y
n, p = size(X)
q = nco(Y)

nlv = 2
bscal = :frob ; tau = 1e-4
model = rasvd(; nlv, bscal, tau)
fit!(model, X, Y)
@names model
@names model.fitm

@head model.fitm.Tx
@head transfbl(model, X, Y).Tx

@head model.fitm.Ty
@head transfbl(model, X, Y).Ty

res = summary(model, X, Y) ;
@names res
res.explvarx
res.explvary
res.cortx2ty
res.rvx2tx
res.rvy2ty
res.rdx2tx
res.rdy2ty
res.corx2tx 
res.cory2ty

source

Jchemo.rd — Method

rd(X, Y; typ = :cor)
rd(X, Y, weights::Weight; typ = :cor)

Compute redundancy coefficients (Rd).

X : Matrix (n, p).
Y : Matrix (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

typ : Possibles values are: :cor (correlation), :cov (uncorrected covariance).

Returns the redundancy coefficient between X and each column of Y, i.e. for each k = 1,...,q:

Mean {cor(xj, yk)^2 ; j = 1, ..., p }

Depending argument typ, the correlation can be replaced by the (not corrected) covariance.

See Tenenhaus 1998 section 2.2.1 p.10-11.

References

Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.

Examples

using Jchemo
X = rand(5, 10)
Y = rand(5, 3)
rd(X, Y)

source

Jchemo.rda — Method

rda(; kwargs...)
rda(X, y; kwargs...)
rda(X, y, weights::Weight; kwargs...)

Regularized discriminant analysis (RDA).

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
alpha : Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0) and LDA (alpha = 1).
lb : Ridge regularization parameter "lambda" (>= 0).
simpl : Boolean. See function dmnorm.
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Let us note W the (corrected) pooled within-class covariance matrix and Wi the (corrected) within-class covariance matrix of class i. The regularization is done by the two following successive steps (for each class i):

Continuum between QDA and LDA: Wi(1) = (1 - alpha) * Wi + alpha * W
Ridge regularization: Wi(2) = Wi(1) + lb * I

Then the QDA algorithm is run on matrices {Wi(2)}.

Function rda is slightly different from the regularization expression used by Friedman 1989 (Eq.18): the choice is to shrink the covariance matrices Wi(2) to the diagonal of the Idendity matrix (ridge regularization; e.g. Guo et al. 2007).

Particular cases:

alpha = 1 & lb = 0 : LDA
alpha = 0 & lb = 0 : QDA
alpha = 1 & lb > 0 : Penalized LDA (Hastie et al 1995) with diagonal regularization matrix

See functions lda and qda for other details (arguments weightsand prior).

References

Friedman JH. Regularized Discriminant Analysis. Journal of the American Statistical Association. 1989; 84(405):165-175. doi:10.1080/01621459.1989.10478752.

Guo Y, Hastie T, Tibshirani R. Regularized linear discriminant analysis and its application in microarrays. Biostatistics. 2007; 8(1):86-100. doi:10.1093/biostatistics/kxj035.

Hastie, T., Buja, A., Tibshirani, R., 1995. Penalized Discriminant Analysis. The Annals of Statistics 23, 73–102.

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

alpha = .5
lb = 1e-8
model = rda(; alpha, lb)
fit!(model, Xtrain, ytrain)
@names model
@names fitm = model.fitm
fitm.lev
fitm.ni

res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

source

Jchemo.recod_catbydict — Method

recod_catbydict(x, dict)

Recode a categorical variable to dictionnary levels.

x : Categorical variable (n) to replace.
dict : Dictionary giving the correpondances between the old and new levels.

See examples.

Examples

using Jchemo

dict = Dict("a" => 1000, "b" => 1, "c" => 2)
x = ["c" ; "c" ; "a" ; "a" ; "a"]
recod_catbydict(x, dict)

x = ["c" ; "c" ; "a" ; "a" ; "a" ; "e"]
recod_catbydict(x, dict)

source

Jchemo.recod_catbyind — Method

recod_catbyind(x, lev)

Recode a categorical variable to indexes of levels.

x : Categorical variable (n) to replace.
lev : Vector containing categorical levels.

See examples.

Warning: The levels in x must be contained in lev.

Examples

using Jchemo

lev = ["EHH" ; "FFS" ; "ANF" ; "CLZ" ; "CNG" ; "FRG" ; "MPW" ; "PEE" ; "SFG" ; "SFG" ; "TTS"]
slev = mlev(lev)
[slev 1:length(slev)] 
x = ["EHH" ; "TTS" ; "FRG" ; "EHH"]
recod_catbyind(x, lev)

source

Jchemo.recod_catbyint — Method

recod_catbyint(x; start = 1)

Recode a categorical variable to integers.

x : Categorical variable (n) to replace.
start : Integer labelling the first categorical level in x.

The integers returned by the function correspond to the sorted levels of x, see examples.

Examples

using Jchemo

x = ["b", "a", "b"]
mlev(x)   
[x recod_catbyint(x)]
recod_catbyint(x; start = 0)

recod_catbyint([25, 1, 25])

source

Jchemo.recod_catbylev — Method

recod_catbylev(x, lev)

Recode a categorical variable to levels.

x : Variable (n) to replace.
lev : Vector containing the categorical levels.

The ith sorted level in x is replaced by the ith sorted level in lev, see examples.

Warning: x and lev must contain the same number of levels.

Examples

using Jchemo

x = [10 ; 4 ; 3 ; 3 ; 4 ; 4]
lev = ["B" ; "C" ; "AA" ; "AA"]
mlev(x)
mlev(lev)
[x recod_catbylev(x, lev)]
xstr = string.(x)
[xstr recod_catbylev(xstr, lev)]

lev = [3; 0; 0; -1]
mlev(x)
mlev(lev)
[x recod_catbylev(x, lev)]

source

Jchemo.recod_indbylev — Method

recod_indbylev(x::Union{Int, Array{Int}}, lev::Array)

Recode an index variable to levels.

x : Index variable (n) to replace.
lev : Vector containing the categorical levels.

Assuming slev = 'sort(unique(lev))', each element x[i] (i = 1, ..., n) is replaced by slev[x[i]], see examples.

Warning: Vector x must contain integers between 1 and nlev, where nlev is the number of levels in lev.

Examples

using Jchemo

x = [2 ; 1 ; 2 ; 2]
lev = ["B" ; "C" ; "AA" ; "AA"]
mlev(x)
mlev(lev)
[x recod_indbylev(x, lev)]
recod_indbylev([2], lev)
recod_indbylev(2, lev)

x = [2 ; 1 ; 2]
lev = [3 ; 0 ; 0 ; -1]
mlev(x)
mlev(lev)
recod_indbylev(x, lev)

source

Jchemo.recod_miss — Method

recod_miss(X; miss = nothing)
recod_miss(df; miss = nothing)

Declare data as missing in a dataset.

X : A dataset (array).
miss : The code used in the dataset to identify the data to be declared as missing (of type Missing).

Specific for dataframes:

df : A dataset (dataframe).

The case miss = nothing has the only action to allow missing in X or df.

See examples.

Examples

using Jchemo, DataFrames

X = hcat(1:5, [0, 0, 7., 10, 1.2])
X_miss = recod_miss(X; miss = 0)

df = DataFrame(i = 1:5, x = [0, 0, 7., 10, 1.2])
df_miss = recod_miss(df; miss = 0)

df = DataFrame(i = 1:5, x = ["0", "0", "c", "d", "e"])
df_miss = recod_miss(df; miss = "0")

source

Jchemo.recod_numbyint — Method

recod_numbyint(x, q)

Recode a continuous variable to integers.

x : Continuous variable (n) to replace.
q : Numerical values separating classes in x. The first class is labelled to 1.

See examples.

Examples

using Jchemo, Statistics
x = [collect(1:10); 8.1 ; 3.1] 

q = [3; 8]
zx = recod_numbyint(x, q)  
[x zx]
probs = [.33; .66]
q = quantile(x, probs) 
zx = recod_numbyint(x, q)  
[x zx]

source

Jchemo.recovkw — Method

recovkw(ParStruct, kwargs)

source

Jchemo.residcla — Method

residcla(pred, y)

Compute the discrimination residual vector (0 = no error, 1 = error).

pred : Predictions.
y : Observed data (class membership).

Examples

using Jchemo

Xtrain = rand(10, 5) 
ytrain = rand(["a" ; "b"], 10)
Xtest = rand(4, 5) 
ytest = rand(["a" ; "b"], 4)

model = plsrda(; nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
residcla(pred, ytest)

source

Jchemo.residreg — Method

residreg(pred, Y)

Compute the regression residual vector.

pred : Predictions.
Y : Observed data.

Examples

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

model = plskern(nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
residreg(pred, Ytest)

model = plskern(nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
residreg(pred, ytest)

source

Jchemo.rfda — Method

rfda(; kwargs...)
rfda(X, y::Union{Array{Int}, Array{String}}; kwargs...)

Random forest discrimination with DecisionTree.jl.

X : X-data (n, p).
y : Univariate class membership (n).

Keyword arguments:

n_trees : Nb. trees built for the forest.
partial_sampling : Proportion of sampled observations for each tree.
n_subfeatures : Nb. variables to select at random at each split (default: -1 ==> sqrt(#variables)).
max_depth : Maximum depth of the decision trees (default: -1 ==> no maximum).
min_sample_leaf : Minimum number of samples each leaf needs to have.
min_sample_split : Minimum number of observations in needed for a split.
mth : Boolean indicating if a multi-threading is done when new data are predicted with function predict.
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

The function fits a random forest discrimination² model using package `DecisionTree.jl'.

For DA in DecisionTree.jl, 'y' components must be Int or String

References

Breiman, L., 1996. Bagging predictors. Mach Learn 24, 123–140. https://doi.org/10.1007/BF00058655

Breiman, L., 2001. Random Forests. Machine Learning 45, 5–32. https://doi.org/10.1023/A:1010933404324

DecisionTree.jl https://github.com/JuliaAI/DecisionTree.jl

Genuer, R., 2010. Forêts aléatoires : aspects théoriques, sélection de variables et applications. PhD Thesis. Université Paris Sud - Paris XI.

Gey, S., 2002. Bornes de risque, détection de ruptures, boosting : trois thèmes statistiques autour de CART en régression (These de doctorat). Paris 11. http://www.theses.fr/2002PA112245

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n, p = size(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

n_trees = 200
n_subfeatures = p / 3 
max_depth = 10
model = rfda(; n_trees, n_subfeatures, max_depth) 
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni

res = predict(model, Xtest) ; 
@names res 
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

source

Jchemo.rfr — Method

rfr(; kwargs...)
rfr(X, y; kwargs...)

Random forest regression with DecisionTree.jl.

X : X-data (n, p).
y : Univariate y-data (n).

Keyword arguments:

n_trees : Nb. trees built for the forest.
partial_sampling : Proportion of sampled observations for each tree.
n_subfeatures : Nb. variables to select at random at each split (default: -1 ==> sqrt(#variables)).
max_depth : Maximum depth of the decision trees (default: -1 ==> no maximum).
min_sample_leaf : Minimum number of samples each leaf needs to have.
min_sample_split : Minimum number of observations in needed for a split.
mth : Boolean indicating if a multi-threading is done when new data are predicted with function predict.
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

The function fits a random forest regression model using package `DecisionTree.jl'.

References

Breiman, L., 1996. Bagging predictors. Mach Learn 24, 123–140. https://doi.org/10.1007/BF00058655

Breiman, L., 2001. Random Forests. Machine Learning 45, 5–32. https://doi.org/10.1023/A:1010933404324

DecisionTree.jl https://github.com/JuliaAI/DecisionTree.jl

Genuer, R., 2010. Forêts aléatoires : aspects théoriques, sélection de variables et applications. PhD Thesis. Université Paris Sud - Paris XI.

Gey, S., 2002. Bornes de risque, détection de ruptures, boosting : trois thèmes statistiques autour de CART en régression (These de doctorat). Paris 11. http://www.theses.fr/2002PA112245

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
p = nco(X)

n_trees = 200
n_subfeatures = p / 3
max_depth = 15
model = rfr(; n_trees, n_subfeatures, max_depth) 
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm

res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f

source

Jchemo.rmcol — Method

rmcol(X::Union{AbstractMatrix, DataFrame}, s::Union{Vector, BitVector, UnitRange, Number})
rmcol(X::Vector, s::Union{Vector, BitVector, UnitRange, Number})

Remove the columns of a matrix or the components of a vector having indexes s.

X : Matrix or vector.
s : Vector of the indexes.

Examples

using Jchemo

X = rand(5, 3) 
rmcol(X, [1, 3])

source

Jchemo.rmgap — Method

rmgap(; kwargs...)
rmgap(X; kwargs...)

Remove vertical gaps in spectra (e.g. for ASD).

X : X-data (n, p).

Keyword arguments:

indexcol : Indexes (∈ [1, p]) of the X-columns where are located the gaps to remove.
npoint : The number of X-columns used on the left side of each gap for fitting the linear regressions.

For each spectra (row-observation of matrix X) and each defined gap, the correction is done by extrapolation from a simple linear regression computed on the left side of the gap.

For instance, If two gaps are observed between column-indexes 651-652 and between column-indexes 1425-1426, respectively, the syntax should be indexcol = [651 ; 1425].

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/asdgap.jld2") 
@load db dat
@names dat
X = dat.X
wlst = names(dat.X)
wl = parse.(Float64, wlst)

wl_target = [1000 ; 1800] 
indexcol = findall(in(wl_target).(wl))

f, ax = plotsp(X, wl)
vlines!(ax, wl_target; linestyle = :dot, color = (:grey, .8))
f

## Corrected data
model = rmgap(; indexcol, npoint = 5)
fit!(model, X)
Xc = transf(model, X)
f, ax = plotsp(Xc, wl)
vlines!(ax, wl_target; linestyle = :dot, color = (:grey, .8))
f

source

Jchemo.rmrow — Method

rmrow(X::Union{AbstractMatrix, DataFrame}, s::Union{Vector, BitVector, UnitRange, Number})
rmrow(X::Union{Vector, BitVector}, s::Union{Vector, BitVector, UnitRange, Number})

Remove the rows of a matrix or the components of a vector having indexes s.

X : Matrix or vector.
s : Vector of the indexes.

Examples

using Jchemo

X = rand(5, 2) 
rmrow(X, [1, 4])

source

Jchemo.rmsep — Method

rmsep(pred, Y)

Compute the square root of the mean of the squared prediction errors (RMSEP).

pred : Predictions.
Y : Observed data.

Examples

using Jchemo 

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

model = plskern(nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
rmsep(pred, Ytest)

model = plskern(nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
rmsep(pred, ytest)

source

Jchemo.rmsepstand — Method

rmsepstand(pred, Y)

Compute the standardized square root of the mean of the squared prediction errors (RMSEP_stand).

pred : Predictions.
Y : Observed data.

RMSEP is standardized to Y:

RMSEP_stand = RMSEP ./ Y.

Examples

using Jchemo

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

model = plskern(nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
rmsepstand(pred, Ytest)

model = plskern(nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
rmsepstand(pred, ytest)

source

Jchemo.rosaplsr — Method

rosaplsr(; kwargs...)
rosaplsr(Xbl, Y; kwargs...)
rosaplsr(Xbl, Y, weights::Weight; kwargs...)
rosaplsr!(Xbl::Vector, Y::Matrix, weights::Weight; kwargs...)

Multiblock ROSA PLSR (Liland et al. 2016).

Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs = scores) to compute.
scal : Boolean. If true, each column of blocks in Xbl and Y is scaled by its uncorrected standard deviation (before the block scaling).

The function has the following differences with the original algorithm of Liland et al. (2016):

Scores T (latent variables LVs) are not normed to 1.
Multivariate Y is allowed. In such a case, the squared residuals are summed over the columns to find the winning block for each global LV (therefore, Y-columns should have the same scale).

References

Liland, K.H., Næs, T., Indahl, U.G., 2016. ROSA — a fast extension of partial least squares regression for multiblock data analysis. Journal of Chemometrics 30, 651–662. https://doi.org/10.1002/cem.2824

Examples

using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2") 
@load db dat
@names dat 
X = dat.X
Y = dat.Y
y = Y.c1
group = dat.group
listbl = [1:11, 12:19, 20:25]
s = 1:6
Xbltrain = mblock(X[s, :], listbl)
Xbltest = mblock(rmrow(X, s), listbl)
ytrain = y[s]
ytest = rmrow(y, s) 
ntrain = nro(ytrain) 
ntest = nro(ytest) 
ntot = ntrain + ntest 
(ntot = ntot, ntrain , ntest)

nlv = 3
scal = false
#scal = true
model = rosaplsr(; nlv, scal)
fit!(model, Xbltrain, ytrain)
@names model 
@names model.fitm
@head model.fitm.T
@head transf(model, Xbltrain)
transf(model, Xbltest)

res = predict(model, Xbltest)
res.pred 
rmsep(res.pred, ytest)

source

Jchemo.rowmean — Method

rowmean(X)

Compute row-wise means of a matrix.

X : Data (n, p).

Return a vector.

Examples

using Jchemo

n, p = 5, 6
X = rand(n, p)
rowmean(X)

source

Jchemo.rownorm — Method

rownorm(X)

Compute row-wise norms of a matrix.

X : Data (n, p).

The norm of each row x of X is computed as:

sqrt(x' * x)

Return a vector.

Note: Thanks to @mcabbott at https://discourse.julialang.org/t/orders-of-magnitude-runtime-difference-in-row-wise-norm/96363.

Examples

using Jchemo

n, p = 5, 6
X = rand(n, p)

rownorm(X)

source

Jchemo.rowstd — Method

rowstd(X)

Compute row-wise standard deviations (uncorrected) of a matrix`.

X : Data (n, p).

Return a vector.

Examples

using Jchemo

n, p = 5, 6
X = rand(n, p)
rowstd(X)

source

Jchemo.rowsum — Method

rowsum(X)

Compute row-wise sums of a matrix.

X : Data (n, p).

Return a vector.

Examples

using Jchemo
 
X = rand(5, 2) 
rowsum(X)

source

Jchemo.rowvar — Method

rowvar(X)

Compute row-wise variances (uncorrected) of a matrix.

X : Data (n, p).

Return a vector.

Examples

using Jchemo

n, p = 5, 6
X = rand(n, p)
rowvar(X)

source

Jchemo.rp — Method

rp(; kwargs...)
rp(X; kwargs...)
rp(X, weights::Weight; kwargs...)
rp!(X::Matrix, weights::Weight; kwargs...)

Make a random projection of X-data.

X : X-data (n, p).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. dimensions on which X is projected.
meth : Method of random projection. Possible values are: :gauss, :li. See the respective functions rpmatgauss and rpmatli for their keyword arguments.
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Examples

using Jchemo
n, p = (5, 10)
X = rand(n, p)
nlv = 3
meth = :li ; s = sqrt(p) 
#meth = :gauss
model = rp(; nlv, meth, s)
fit!(model, X)
@names model
@names model.fitm
@head model.fitm.T 
@head model.fitm.V 
transf(model, X[1:2, :])

source

Jchemo.rpd — Method

rpd(pred, Y)

Compute the ratio "deviation to model performance" (RPD).

pred : Predictions.
Y : Observed data.

This is the ratio of the deviation to the model performance to the deviation, defined by:

RPD = Std(Y) / RMSEP

where Std(Y) is the standard deviation.

Since Std(Y) = RMSEP(null model) where the null model is the simple average, this also gives:

RPD = RMSEP(null model) / RMSEP

Examples

using Jchemo

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

model = plskern(nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
rpd(pred, Ytest)

model = plskern(nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
rpd(pred, ytest)

source

Jchemo.rpdr — Method

rpdr(pred, Y)

Compute a robustified RPD.

pred : Predictions.
Y : Observed data.

Examples

using Jchemo

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

model = plskern(nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
rpdr(pred, Ytest)

model = plskern(nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
rpdr(pred, ytest)

source

Jchemo.rpmatgauss — Function

rpmatgauss(p::Int, nlv::Int, Q = Float64)

Build a gaussian random projection matrix.

p : Nb. variables (attributes) to project.
nlv : Nb. of simulated projection dimensions.
Q : Type of components of the built projection matrix.

The function returns a random projection matrix V of dimension p x nlv. The projection of a given matrix X of size n x p is given by X * V.

V is simulated from i.i.d. N(0, 1) / sqrt(nlv).

References

Li, V., Hastie, T.J., Church, K.W., 2006. Very sparse random projections, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06. Association for Computing Machinery, New York, NY, USA, pp. 287–296. https://doi.org/10.1145/1150402.1150436

Examples

using Jchemo
p = 10 ; nlv = 3
rpmatgauss(p, nlv)

source

Jchemo.rpmatli — Function

rpmatli(p::Int, nlv::Int, Q = Float64; s)

Build a sparse random projection matrix (Achlioptas 2001, Li et al. 2006).

p : Nb. variables (attributes) to project.
nlv : Nb. of simulated projection dimensions.
Q : Type of components of the built projection matrix.

Keyword arguments:

s : Coefficient defining the sparsity of the returned matrix (higher is s, higher is the sparsity).

The function returns a random projection matrix V of dimension p x nlv. The projection of a given matrix X of size n x p is given by X * V.

Matrix V is simulated from i.i.d. discrete sampling within values:

1 with prob. 1/(2 * s)
0 with prob. 1 - 1 / s
-1 with prob. 1/(2 * s)

Usual values for s are:

sqrt(p) (Li et al. 2006)
p / log(p) (Li et al. 2006)
1 (Achlioptas 2001)
3 (Achlioptas 2001)

References

Achlioptas, D., 2001. Database-friendly random projections, in: Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’01. Association for Computing Machinery, New York, NY, USA, pp. 274–281. https://doi.org/10.1145/375551.375608

Examples

using Jchemo
p = 10 ; nlv = 3
rpmatli(p, nlv)

source

Jchemo.rr — Method

rr(; kwargs...)
rr(X, Y; kwargs...)
rr(X, Y, weights::Weight; kwargs...)
rr!(X::Matrix, Y::Union{Matrix, BitMatrix}, weights::Weight; kwargs...)

Ridge regression (RR) implemented by SVD factorization.

X : X-data (n, p).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

lb : Ridge regularization parameter "lambda".
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

References

Cule, E., De Iorio, M., 2012. A semi-automatic method to guide the choice of ridge parameter in ridge regression. arXiv:1205.0686.

Hastie, T., Tibshirani, R., 2004. Efficient quadratic regularization for expression arrays. Biostatistics 5, 329-340. https://doi.org/10.1093/biostatistics/kxh010

Hastie, T., Tibshirani, R., Friedman, J., 2009. The elements of statistical learning: data mining, inference, and prediction, 2nd ed. Springer, New York.

Hoerl, A.E., Kennard, R.W., 1970. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 12, 55-67. https://doi.org/10.1080/00401706.1970.10488634

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

lb = 1e-3
model = rr(; lb) 
#model = rrchol(; lb) 
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm

coef(model)

res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

## !! Only for function 'rr' (not for 'rrchol')
coef(model; lb = 1e-1)
res = predict(model, Xtest; lb = [.1 ; .01])
@head res.pred[1]
@head res.pred[2]

source

Jchemo.rrchol — Method

rrchol(; kwargs...)
rrchol(X, Y; kwargs...)
rrchol(X, Y, weights::Weight; kwargs...)
rrchol!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Ridge regression (RR) using the Normal equations and a Cholesky factorization.

X : X-data (n, p).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

lb : Ridge regularization parameter "lambda".
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

See function rr for examples.

References

Cule, E., De Iorio, M., 2012. A semi-automatic method to guide the choice of ridge parameter in ridge regression. arXiv:1205.0686.

Hastie, T., Tibshirani, R., 2004. Efficient quadratic regularization for expression arrays. Biostatistics 5, 329-340. https://doi.org/10.1093/biostatistics/kxh010

Hastie, T., Tibshirani, R., Friedman, J., 2009. The elements of statistical learning: data mining, inference, and prediction, 2nd ed. Springer, New York.

Hoerl, A.E., Kennard, R.W., 1970. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 12, 55-67. https://doi.org/10.1080/00401706.1970.10488634

source

Jchemo.rrda — Method

rrda(; kwargs...)
rrda(X, y; kwargs...)
rrda(X, y, weights::Weight; kwargs...)

Discrimination based on ridge regression (RR-DA).

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

lb : Ridge regularization parameter "lambda".
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

The approach is as follows:

The training variable y (univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present in y. Each column of Ydummy is a dummy (0/1) variable.
Then, a ridge regression (RR) is run on {X, Ydummy}, returning predictions of the dummy variables (= object posterior returned by fuction predict). These predictions can be considered as unbounded estimates (i.e. eventually outside of [0, 1]) of the class membership probabilities.
For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.

Note: For highly unbalanced classes, it may be recommended to set 'prior = :unif' when using the function (and to use a score such as merrp instead of errp when evaluating the perfomance).

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

lb = 1e-5
model = rrda(; lb) 
fit!(model, Xtrain, ytrain)
@names model
@names fitm = model.fitm
fitm.lev
fitm.ni
@names fitm.fitm
aggsumv(fitm.fitm.weights.w, ytrain)

coef(fitm.fitm)

res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(model, Xtest; lb = [.1; .01]).pred

source

Jchemo.rrmsep — Method

rrmsep(pred, Y)

Compute the relative RMSEP.

pred : Predictions.
Y : Observed data.

For each variable y in Y, RRMSEP = RMSEP / mean(y)

Examples

```julia using Jchemo

Xtrain = rand(10, 5) Ytrain = rand(10, 2) ytrain = Ytrain[:, 1] Xtest = rand(4, 5) Ytest = rand(4, 2) ytest = Ytest[:, 1]

model = plskern(nlv = 2) fit!(model, Xtrain, Ytrain) pred = predict(model, Xtest).pred rrmsep(pred, Ytest)

fit!(model, Xtrain, ytrain) pred = predict(model, Xtest).pred rrmsep(pred, ytest)

source

Jchemo.rrr — Method

rrr(; kwargs...)
rrr(X, Y; kwargs...)
rrr(X, Y, weights::Weight; kwargs...)
rr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Reduced rank regression (RRR, a.k.a RA).

X : X-data (n, p).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute.
tau : Regularization parameter (∊ [0, 1]).
tol : Tolerance for the Nipals algorithm.
maxit : Maximum number of iterations for the Nipals algorithm.
scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

Reduced rank regression, also referred to as redundancy analysis (RA) regression. In this function, the RA uses the Nipals algorithm presented in Mangamana et al 2021, section 2.1.1.

A continuum regularization is available. After block centering and scaling, the covariances matrices are computed as follows:

Cx = (1 - tau) * X'DX + tau * Ix

References

Bougeard, S., Qannari, E.M., Lupo, C., Chauvin, C., 2011. Multiblock redundancy analysis from a user’s perspective. Application in veterinary epidemiology. Electronic Journal of Applied Statistical Analysis 4, 203-214–214. https://doi.org/10.1285/i20705948v4n2p203

Bougeard, S., Qannari, E.M., Rose, N., 2011. Multiblock redundancy analysis: interpretation tools and application in epidemiology. Journal of Chemometrics 25, 467–475. https://doi.org/10.1002/cem.1392

Tchandao Mangamana, E., Glèlè Kakaï, R., Qannari, E.M., 2021. A general strategy for setting up supervised methods of multiblock data analysis. Chemometrics and Intelligent Laboratory Systems 217, 104388. https://doi.org/10.1016/j.chemolab.2021.104388

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 1
tau = 1e-4
model = rrr(; nlv, tau) 
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
@head model.fitm.T

coef(model)
coef(model; nlv = 3)

@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)

res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f

source

Jchemo.rv — Method

rv(X, Y; centr = true)
rv(Xbl::Vector; centr = true)

Compute RV coefficients.

X : Matrix (n, p).
Y : Matrix (n, q).
Xbl : A list (vector) of matrices.
centr : Boolean indicating if the matrices will be internally centered or not.

RV is bounded within [0, 1].

A dissimilarty measure between X and Y can be computed by d = sqrt(2 * (1 - RV)).

References

Escoufier, Y., 1973. Le Traitement des Variables Vectorielles. Biometrics 29, 751–760. https://doi.org/10.2307/2529140

Josse, J., Holmes, S., 2016. Measuring multivariate association and beyond. Stat Surv 10, 132–167. https://doi.org/10.1214/16-SS116

Josse, J., Pagès, J., Husson, F., 2008. Testing the significance of the RV coefficient. Computational Statistics & Data Analysis 53, 82–91. https://doi.org/10.1016/j.csda.2008.06.012

Kazi-Aoual, F., Hitier, S., Sabatier, R., Lebreton, J.-D., 1995. Refined approximations to permutation tests for multivariate inference. Computational Statistics & Data Analysis 20, 643–656. https://doi.org/10.1016/0167-9473(94)00064-2

Mayer, C.-D., Lorent, J., Horgan, G.W., 2011. Exploratory Analysis of Multiple Omics Datasets Using the Adjusted RV Coefficient. Statistical Applications in Genetics and Molecular Biology 10. https://doi.org/10.2202/1544-6115.1540

Smilde, A.K., Kiers, H.A.L., Bijlsma, S., Rubingh, C.M., van Erk, M.J., 2009. Matrix correlations for high-dimensional data: the modified RV-coefficient. Bioinformatics 25, 401–405. https://doi.org/10.1093/bioinformatics/btn634

Robert, P., Escoufier, Y., 1976. A Unifying Tool for Linear Multivariate Statistical Methods: The RV-Coefficient. Journal of the Royal Statistical Society: Series C (Applied Statistics) 25, 257–265. https://doi.org/10.2307/2347233

Examples

using Jchemo
X = rand(5, 10)
Y = rand(5, 3)
rv(X, Y)

X = rand(5, 15) 
listbl = [3:4, 1, [6; 8:10]]
Xbl = mblock(X, listbl)
rv(Xbl)

source

Jchemo.sampcla — Function

sampcla(x, k::Union{Int, Vector{Int}}, y = nothing; seed::Union{Nothing, Int} = nothing)

Build training vs. test sets using a stratified sampling.

x : Class membership (n) of the observations.
k : Nb. test observations to sample in each class. If k is a single value, the nb. of sampled observations is the same for each class. Alternatively, k can be a vector of length equal to the nb. of classes in x.
y : Quantitative variable (n) used if systematic sampling.

Keyword arguments:

seed : Eventual seed for the Random.MersenneTwister generator.

Two outputs are returned (= row indexes of the data):

train (n - k),
test (k).

If y = nothing, the sampling ( within each class) of the k test observations is random, else it is systematic over the sorted y (see the principle in function sampsys).

References

Naes, T., 1987. The design of calibration in near infra-red reflectance analysis by clustering. Journal of Chemometrics 1, 121-134.

Examples

using Jchemo
x = string.(repeat(1:3, 5))
n = length(x)
tab(x)
k = 2 
res = sampcla(x, k)
res.test
x[res.test]
tab(x[res.test])

sampcla(x, k; seed = 123)

y = rand(n)
res = sampcla(x, k, y)
res.test
x[res.test]
tab(x[res.test])

source

Jchemo.sampdf — Function

sampdf(Y::DataFrame, k::Union{Int, Vector{Int}}, id = 1:nro(Y); meth = :rand,
    seed::Union{Nothing, Int} = nothing)

Build training vs. test sets from each column of a dataframe.

Y : DataFrame (n, p). Typivally, contains a set of response variables to predict. Can contain missing values.
k : Nb. of test observations selected for each Y-column. The selection is done within the non-missing observations of the considered column. If k is a single value, the same nb. of observations are selected for each column. Alternatively, k can be a vector of length p.
id : Vector (n) of IDs.

Keyword arguments:

meth : Type of sampling for the test set. Possible values are: :rand = random sampling, :sys = systematic sampling over each sorted Y-column (see the principle in function sampsys).
seed : When meth = :rand, eventual seed for the Random.MersenneTwister generator.

Examples

using Jchemo, DataFrames

Y = hcat([rand(5); missing; rand(6)], [rand(2); missing; missing; rand(7); missing])
Y = DataFrame(Y, :auto)
n = nro(Y)

k = 3
res = sampdf(Y, k) 
#res = sampdf(Y, k, string.(1:n))
@names res
res.nam
length(res.test)
res.train
res.test

sampdf(Y, k; seed = 123) 

## Replicated splitting Train/Test
rep = 10
k = 3
ids = [sampdf(Y, k) for i = 1:rep]
length(ids)
i = 1    # replication
ids[i]
ids[i].train 
ids[i].test
j = 1    # variable y  
ids[i].train[j]
ids[i].test[j]
ids[i].nam[j]

source

Jchemo.sampdp — Method

sampdp(X, k::Int; metric = :eucl)

Build training vs. test sets by DUPLEX sampling.

X : X-data (n, p).
k : Nb. pairs (training/test) of observations to sample. Must be <= n / 2.

Keyword arguments:

metric : Metric used for the distance computation. Possible values are: :eucl (Euclidean), :mah (Mahalanobis).

Three outputs (= row indexes of the data) are returned:

train (k),
test (k),
remain (n - 2 * k).

Outputs train and test are built from the DUPLEX algorithm (Snee, 1977 p.421). They are expected to cover approximately the same X-space region and have similar statistical properties.

In practice, when output remain is not empty (i.e. when there are remaining observations), one common strategy is to add it to output train.

References

Kennard, R.W., Stone, L.A., 1969. Computer aided design of experiments. Technometrics, 11(1), 137-148.

Snee, R.D., 1977. Validation of Regression Models: Methods and Examples. Technometrics 19, 415-428. https://doi.org/10.1080/00401706.1977.10489581

Examples

using Jchemo

X = [0.381392  0.00175002 ; 0.1126    0.11263 ; 
    0.613296  0.152485 ; 0.726536  0.762032 ;
    0.367451  0.297398 ; 0.511332  0.320198 ; 
    0.018514  0.350678] 

k = 3
sampdp(X, k)

source

Jchemo.sampks — Method

sampks(X, k::Int; metric = :eucl)

Build training vs. test sets by Kennard-Stone sampling.

X : X-data (n, p).
k : Nb. test observations to sample.

Keyword arguments:

metric : Metric used for the distance computation. Possible values are: :eucl (Euclidean), :mah (Mahalanobis).

Two outputs (= row indexes of the data) are returned:

train (n - k),
test (k).

Output test is built from the Kennard-Stone (KS) algorithm (Kennard & Stone, 1969).

Note: By construction, the set of observations selected by KS sampling contains higher variability than the set of the remaining observations. In the seminal article (K&S, 1969), the algorithm is used to select observations that will be used to build a calibration set. To the opposite, in the present function, KS is used to select a test set with higher variability than the training set.

References

Kennard, R.W., Stone, L.A., 1969. Computer aided design of experiments. Technometrics, 11(1), 137-148.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat

X = dat.X 
y = dat.Y.tbc

k = 80
res = sampks(X, k)
@names res
res.train 
res.test

model = pcasvd(nlv = 15) 
fit!(model, X) 
@head T = model.fitm.T
res = sampks(T, k; metric = :mah)

#####################

n = 10
k = 25 
X = [repeat(1:n, inner = n) repeat(1:n, outer = n)] 
X = Float64.(X) 
X .= X + .1 * randn(nro(X), nco(X))
s = sampks(X, k).test
f, ax = plotxy(X[:, 1], X[:, 2])
scatter!(ax, X[s, 1], X[s, 2]; color = "red") 
f

source

Jchemo.samprand — Method

samprand(n::Int, k::Int; seed::Union{Nothing, Int} = nothing)
samprand(group::Vector, k::Int; seed::Union{Nothing, Int} = nothing)

Build training vs. test sets by random sampling.

n : Total nb. of observations.
group : A vector (n) defining groups of observations.
k : Nb. test observations, or nb. test groups if group is used, returned in each validation segment.

Keyword arguments:

seed : Eventual seed for the Random.MersenneTwister generator.

Two outputs are returned (= row indexes of the data):

train (n - k),
test (k).

If group is used (must be a vector of length n), the function samples groups of observations instead of single observations. Such a group-sampling is required when the data are structured by groups and when the response to predict is correlated within groups. This prevents underestimation of the generalization error.

Examples

using Jchemo

n = 10
samprand(n, 4)
samprand(n, 4; seed = 123)

n = 10 
group = ["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"]    # groups of the observations
tab(group)  
k = 2 
res = samprand(group, k)
group[res.test]

source

Jchemo.sampsys — Method

sampsys(y, k::Int)

Build training vs. test sets by systematic sampling over a quantitative variable.

y : Quantitative variable (n) to sample.
k : Nb. test observations to sample. Must be >= 2.

Two outputs are returned (= row indexes of the data):

train (n - k),
test (k).

Output test is built by systematic sampling over the rank of the y observations. For instance if k / n ~ .3, one observation over three observations over the sorted y is selected.

Output test always contains the indexes of the minimum and maximum of y.

Examples

using Jchemo 

y = rand(7)
[y sort(y)]
res = sampsys(y, 3)
sort(y[res.test])

source

Jchemo.sampwsp — Method

sampwsp(X, dmin; recod = false, maxit = nro(X))

Build training vs. test sets by WSP sampling.

X : X-data (n, p).
dmin : Distance "dmin" (Santiago et al. 2012).

Keyword arguments:

recod : Boolean indicating if X is recoded or not before the sampling (see below).
maxit : Maximum number of iterations.

Two outputs (= row indexes of the data) are returned:

train (n - k),
test (k).

Output test is built from the "Wootton, Sergent, Phan-Tan-Luu" (WSP) algorithm, assumed to generate samples uniformely distributed in the X domain (Santiago et al. 2012).

If recod = true, each column x of X is recoded within [0, 1] and the center of the domain is the vector repeat([.5], p). Column x is recoded such as:

vmin = minimum(x)
vmax = maximum(x)
vdiff = vmax - vmin
x .= 0.5 .+ (x .- (vdiff / 2 + vmin)) / vdiff

References

Béal A. 2015. Description et sélection de données en grande dimensio. Thèse de doctorat. Laboratoire d’Instrumentation et de sciences analytiques, Ecole doctorale des siences chimiques, Université d'Aix-Marseille.

Santiago, J., Claeys-Bruno, M., Sergent, M., 2012. Construction of space-filling designs using WSP algorithm for high dimensional spaces. Chemometrics and Intelligent Laboratory Systems, Selected Papers from Chimiométrie 2010 113, 26–31. https://doi.org/10.1016/j.chemolab.2011.06.003

Examples

using Jchemo

n = 600 ; p = 2
X = rand(n, p)
dmin = .5
s = sampwsp(X, dmin)
@names res
@show length(s.test)
plotxy(X[s.test, 1], X[s.test, 2]).f

source

Jchemo.savgk — Method

savgk(nhwindow::Int, degree::Int, deriv::Int)

Compute the kernel of the Savitzky-Golay filter.

nhwindow : Nb. points (>= 1) of the half window.
degree : Degree of the smoothing polynom, where 1 <= degree <= 2 * nhwindow.
deriv : Derivation order, where 0 <= deriv <= degree.

The size of the kernel is odd (npoint = 2 * nhwindow + 1):

x[-nhwindow], x[-nhwindow+1], ..., x[0], ...., x[nhwindow-1], x[nhwindow].

If deriv = 0, there is no derivation (only polynomial smoothing).

The case degree = 0 (i.e. simple moving average) is not allowed by the funtion.

References

Luo, J., Ying, K., Bai, J., 2005. Savitzky–Golay smoothing and differentiation filter for even number data. Signal Processing 85, 1429–1434. https://doi.org/10.1016/j.sigpro.2005.02.002

Examples

using Jchemo
res = savgk(21, 3, 2)
@names res
res.S 
res.G 
res.kern

source

Jchemo.savgol — Method

savgol(; kwargs...)
savgol(X; kwargs...)

Savitzky-Golay derivation and smoothing of each row of X-data.

X : X-data (n, p).

Keyword arguments:

npoint : Size of the filter (nb. points involved in the kernel). Must be odd and >= 3. The half-window size is nhwindow = (npoint - 1) / 2.
deriv : Derivation order. Must be: 0 <= deriv <= degree.
degree : Degree of the smoothing polynom. Must be: 1 <= degree <= npoint - 1.

The smoothing is computed by convolution (with padding), using function imfilter of package ImageFiltering.jl. Each returned point is located on the center of the kernel. The kernel is computed with function savgk.

The function returns a matrix (n, p).

References

Luo, J., Ying, K., Bai, J., 2005. Savitzky–Golay smoothing and differentiation filter for even number data. Signal Processing 85, 1429–1434. https://doi.org/10.1016/j.sigpro.2005.02.002

Savitzky, A., Golay, M.J.E., 2002. Smoothing and Differentiation of Data by Simplified Least Squares Procedures. [WWW Document]. https://doi.org/10.1021/ac60214a047

Schafer, R.W., 2011. What Is a Savitzky-Golay Filter? [Lecture Notes]. IEEE Signal Processing Magazine 28, 111–117. https://doi.org/10.1109/MSP.2011.941097

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f

npoint = 11 ; deriv = 2 ; degree = 2
model = savgol(; npoint, deriv, degree) 
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f

####### Gaussian signal 

u = -15:.1:15
n = length(u)
x = exp.(-.5 * u.^2) / sqrt(2 * pi) + .03 * randn(n)
M = 10  # half window
N = 3   # degree
deriv = 0
#deriv = 1
model = savgol(; npoint = 2M + 1, degree = N, deriv)
fit!(model, x')
xp = transf(model, x')
f, ax = plotsp(x', u; color = :blue)
lines!(ax, u, vec(xp); color = :red)
f

source

Jchemo.scale — Method

scale()
scale(X)
scale(X, weights::Weight)

Column-wise scaling of X-data.

X : X-data (n, p).

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f

model = scale() 
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
colstd(Xptrain)
@head Xptest 
@head Xtest ./ colstd(Xtrain)'
plotsp(Xptrain).f
plotsp(Xptest).f

source

Jchemo.segmkf — Method

segmkf(n::Int, K::Int; rep = 1, seed::Union{Nothing, Int, Vector{Int}} = nothing)
segmkf(group::Vector, K::Int; rep = 1, seed::Union{Nothing, Int, Vector{Int}} = nothing)

Build segments of observations for K-fold cross-validation.

n : Total nb. of observations in the dataset. The sampling is implemented with 1:n.
group : A vector (n) defining blocks of observations.
K : Nb. folds (segments) splitting the n observations.

Keyword arguments:

rep : Nb. replications of the sampling.
seed : Eventual seed for the Random.MersenneTwister generator. Must be of length = rep.

For each replication, the function splits the n observations to K segments that can be used for K-fold cross-validation.

If group is used (must be a vector of length n), the function samples entire groups (= blocks) of observations instead of observations. Such a block-sampling is required when data is structured by blocks and when the response to predict is correlated within blocks. This prevents underestimation of the generalization error.

The function returns a list (vector) of rep elements. Each element of the list contains K segments (= K vectors). Each segment contains the indexes (position within 1:n) of the sampled observations.

Examples

using Jchemo 

n = 10 ; K = 3
rep = 4 
segm = segmkf(n, K; rep)
i = 1 
segm[i]
segm[i][1]

segmkf(n, K; seed = 123)
segmkf(n, K; rep, seed = collect(1:rep))

n = 10 
group = ["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"]    # blocks of the observations
tab(group) 
K = 3 ; rep = 4 
segm = segmkf(group, K; rep)
i = 1 
segm[i]
segm[i][1]
group[segm[i][1]]
group[segm[i][2]]
group[segm[i][3]]

source

Jchemo.segmts — Method

segmts(n::Int, k::Int; rep = 1, seed::Union{Nothing, Int, Vector{Int}} = nothing)
segmts(group::Vector, k::Int; rep = 1, seed::Union{Nothing, Int, Vector{Int}} = nothing)

Build segments of observations for "test-set" validation.

n : Total nb. of observations in the dataset. The sampling is implemented within 1:n.
group : A vector (n) defining groups of observations.
k : Nb. test observations, or nb. test groups if group is used, returned in each validation segment.

Keyword arguments:

rep : Nb. replications of the sampling.
seed : Eventual seed for the Random.MersenneTwister generator. Must be of length = rep.

For each replication, the function builds a test set that can be used to validate a model.

The function returns a list (vector) of rep elements. Each element of the list is a vector of the indexes (positions within 1:n) of the sampled observations.

Examples

using Jchemo 

n = 10 ; k = 3
rep = 4 
segm = segmts(n, k; rep) 
i = 1
segm[i]
segm[i][1]

segmts(n, k; seed = 123)
segmts(n, k; rep, seed = collect(1:rep))

n = 10 
group = ["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"]    # groups of the observations
tab(group)  
k = 2 ; rep = 4 
segm = segmts(group, k; rep)
i = 1 
segm[i]
segm[i][1]
group[segm[i][1]]

source

Jchemo.selwold — Method

selwold(indx, r; smooth = true, npoint = 5, alpha = .05, digits = 3, graph = true, 
    step = 2, xlabel = "Index", ylabel = "Value", title = "Score")

Wold's criterion to select dimensionality in LV models (e.g. PLSR).

indx : A variable representing the model parameter(s), e.g. nb. LVs if PLSR models.
r : A vector of error rates (n), e.g. RMSECV.

Keyword arguments:

smooth : Boolean. If true, the selection is done after a moving-average smoothing of rate R (see function mavg).
npoint : Window of the moving-average used to smooth rate R.
alpha : Proportion alpha used as threshold for rate R.
digits : Number of digits in the outputs.
graph : Boolean. If true, outputs are plotted.
step : Step used for defining the xticks in the graphs.
xlabel : Horizontal label for the plots.
ylabel : Vertical label for the plots.
title : Title of the left plot.

The slection criterion is the "precision gain ratio":

R = 1 - r(a+1) / r(a)

where r is an observed error rate quantifying the model performance (e.g. RMSEP, classification error rate, etc.) and a the model dimensionnality (= nb. LVs). r can also represent other indicators such as the eigenvalues of a PCA.

R is the relative gain in perforamnce efficiency after a new LV is added to the model. The iterations continue until R becomes lower than a threshold value alpha. By default and only as an indication, the default alpha=.05 is set in the function, but the user should set any other value depending on his data and parsimony objective.

In his original article, Wold (1978; see also Bro et al. 2008) used the ratio of cross-validated over training residual sums of squares, i.e. PRESS over SSR. Instead, function selwold compares values of consistent nature (the successive values in the input vector r). For instance, r was set to PRESS values in Li et al. (2002) and Andries et al. (2011), which is equivalent to the "punish factor" described in Westad & Martens (2000).

The ratio R can be erratic (particulary when r is the error rate of a discrimination model), making difficult the dimensionnaly selection. In such a situation, function selwold proposes to calculate a smoothing of R (argument smooth).

The function returns two outputs (in addition to eventual plots):

opt : The index corresponding to the minimum value of r.
sel : The index of the selection from the R (or smoothed R) threshold.

References

Andries, J.V.M., Vander Heyden, Y., Buydens, L.M.C., 2011. Improved variable reduction in partial least squares modelling based on Predictive-Property-Ranked Variables and adaptation of partial least squares complexity. Analytica Chimica Acta 705, 292-305. https://doi.org/10.1016/j.aca.2011.06.037

Bro, R., Kjeldahl, K., Smilde, A.K., Kiers, H.A.L., 2008. Cross-validation of component models: A critical look at current methods. Anal Bioanal Chem 390, 1241-1251. https://doi.org/10.1007/s00216-007-1790-1

Li, B., Morris, J., Martin, E.B., 2002. Model selection for partial least squares regression. Chemometrics and Intelligent Laboratory Systems 64, 79-89. https://doi.org/10.1016/S0169-7439(02)00051-5

Westad, F., Martens, H., 2000. Variable Selection in near Infrared Spectroscopy Based on Significance Testing in Partial Least Squares Regression. J. Near Infrared Spectrosc., JNIRS 8, 117â124.

Wold S. Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models. Technometrics. 1978;20(4):397-405

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
n = nro(Xtrain)

segm = segmts(n, 50; rep = 30)
model = plskern()
nlv = 0:20
res = gridcv(model, Xtrain, ytrain; segm, score = rmsep, nlv).res
res[res.y1 .== minimum(res.y1), :]
plotgrid(res.nlv, res.y1;xlabel = "Nb. LVs", ylabel = "RMSEP").f
zres = selwold(res.nlv, res.y1; smooth = true, graph = true) ;
@show zres.opt
@show zres.sel
zres.f

source

Jchemo.sep — Method

sep(pred, Y)

Compute the corrected SEP ("SEP_c"), i.e. the standard deviation of the prediction errors.

pred : Predictions.
Y : Observed data.

References

Bellon-Maurel, V., Fernandez-Ahumada, E., Palagos, B., Roger, J.-M., McBratney, A., 2010. Critical review of chemometric indicators commonly used for assessing the quality of the prediction of soil attributes by NIR spectroscopy. TrAC Trends in Analytical Chemistry 29, 1073–1081. https://doi.org/10.1016/j.trac.2010.05.006

Examples

using Jchemo

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

model = plskern(nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
sep(pred, Ytest)

model = plskern(nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
sep(pred, ytest)

source

Jchemo.snorm — Method

snorm()
snorm(X)

Row-wise norming of X-data.

X : X-data (n, p).

Each row of X is divide by its norm.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f

model = snorm()
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f
@head rownorm(Xptrain)
@head rownorm(Xptest)

source

Jchemo.snv — Method

snv(; kwargs...)
snv(X; kwargs...)

Standard-normal-variate (SNV) transformation of each row of X-data.

X : X-data (n, p).

Keyword arguments:

centr : Boolean indicating if the centering in done.
scal : Boolean indicating if the scaling in done.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f

model = snv() 
#model = snv(scal = false) 
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f
@head rowmean(Xptrain)
@head rowstd(Xptrain)
@head rowmean(Xptest)
@head rowstd(Xptest)

source

Jchemo.softmax — Method

softmax(x::AbstractVector)
softmax(X::Union{Matrix, DataFrame})

Softmax function.

x : A vector to transform.
X : A matrix whose rows are transformed.

Let v be a vector:

'softmax'(v) = exp.(v) / sum(exp.(v))

Examples

using Jchemo

x = 1:3
softmax(x)

X = rand(5, 3)
softmax(X)

source

Jchemo.soplsr — Method

soplsr(; kwargs...)
soplsr(Xbl, Y; kwargs...)
soplsr(Xbl, Y, weights::Weight; kwargs...)
soplsr!(Xbl::Matrix, Y::Matrix, weights::Weight; kwargs...)

Multiblock sequentially orthogonalized PLSR (SO-PLSR).

Xbl : List of blocks (vector of matrices) of X-data Typically, output of function mblock from (n, p) data.
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs = scores) to compute.
scal : Boolean. If true, each column of blocks in Xbl and Y is scaled by its uncorrected standard deviation.

References

Biancolillo et al. , 2015. Combining SO-PLS and linear discriminant analysis for multi-block classification. Chemometrics and Intelligent Laboratory Systems, 141, 58-67.

Biancolillo, A. 2016. Method development in the area of multi-block analysis focused on food analysis. PhD. University of copenhagen.

Menichelli et al., 2014. SO-PLS as an exploratory tool for path modelling. Food Quality and Preference, 36, 122-134.

Examples

using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2") 
@load db dat
@names dat 
X = dat.X
Y = dat.Y
y = Y.c1
group = dat.group
listbl = [1:11, 12:19, 20:25]
s = 1:6
Xbltrain = mblock(X[s, :], listbl)
Xbltest = mblock(rmrow(X, s), listbl)
ytrain = y[s]
ytest = rmrow(y, s) 
ntrain = nro(ytrain) 
ntest = nro(ytest) 
ntot = ntrain + ntest 
(ntot = ntot, ntrain , ntest)

nlv = 2
#nlv = [2, 1, 2]
#nlv = [2, 0, 1]
scal = false
#scal = true
model = soplsr(; nlv, scal)
fit!(model, Xbltrain, ytrain)
@names model 
@names model.fitm
@head model.fitm.T
@head transf(model, Xbltrain)
transf(model, Xbltest)

res = predict(model, Xbltest)
res.pred 
rmsep(res.pred, ytest)

source

Jchemo.sourcedir — Method

sourcedir(path)

Include all the files contained in a directory.

source

Jchemo.spca — Method

spca(; kwargs...)
spca(X; kwargs...)
spca(X, weights::Weight; kwargs...)
spca!(X::Matrix, weights::Weight; kwargs...)

Sparse PCA (Shen & Huang 2008).

X : X-data (n, p).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. principal components (PCs).
meth : Method used for the sparse thresholding. Possible values are: :soft, :hard. See thereafter.
nvar : Nb. variables (X-columns) selected for each principal component (PC). Can be a single integer (i.e. same nb. of variables for each PC), or a vector of length nlv.
defl : Type of X-matrix deflation, see below.
tol : Tolerance value for stopping the Nipals iterations.
maxit : Maximum nb. of Nipals iterations.
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

sPCA-rSVD algorithm (regularized low rank matrix approximation) of Shen & Huang 2008.

The algorithm computes each loadings vector iteratively, by alternating least squares regressions (Nipals) including a step of thresholding. Function spca provides thresholding methods '1' and '2' reported in Shen & Huang 2008 Lemma 2 (:soft and :hard):

The tuning parameter used by Shen & Huang 2008 is the number of null elements in the loadings vector, referred to as degree of sparsity. Conversely, the present function spca uses the number of non-zero elements (nvar), equal to p - degree of sparsity.
See the code of function snipals_shen for details on how is computed the cutoff 'lambda' used inside the thresholding function (Shen & Huang 2008), given a value for nvar. Differences from other softwares may occur when there are tied values in the loadings vector (depending on the choices of method used to compute quantiles).

Matrix X can be deflated in two ways:

defl = :v : Matrix X is deflated by regression of the X'-columns on the loadings vector v. This is the method proposed by Shen & Huang 2008 (see in Theorem A.2 p.1033).
defl = :t : Matrix X is deflated by regression of the X-columns on the score vector t. This is the method used in function spca of the R package mixOmics (Le Cao et al. 2016).

The method of computation of the % variance explained in X by each PC (returned by function summary) depends on the type of deflation chosen (see the code).

References

Kim-Anh Lê Cao, Florian Rohart, Ignacio Gonzalez, Sebastien Dejean with key contributors Benoit Gautier, Francois Bartolo, contributions from Pierre Monget, Jeff Coquery, FangZou Yao and Benoit Liquet. (2016). mixOmics: Omics Data Integration Project. R package version 6.1.1. https://www.bioconductor.org/packages/release/bioc/html/mixOmics.html

Shen, H., Huang, J.Z., 2008. Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis 99, 1015–1034. https://doi.org/10.1016/j.jmva.2007.06.007

Examples

using Jchemo, JchemoData, JLD2 
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2") 
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
n = nro(X)
ntest = 30
s = samprand(n, ntest) 
Xtrain = X[s.train, :]
Xtest = X[s.test, :]

nlv = 3 
meth = :soft
#meth = :hard
nvar = 2
model = spca(; nlv, meth, nvar) ;
fit!(model, Xtrain) 
fitm = model.fitm ;
@names fitm
fitm.niter
fitm.sellv 
fitm.sel
V = fitm.V
V' * V
@head T = fitm.T
T' * T
@head transf(model, Xtrain)

@head Ttest = transf(fitm, Xtest)

res = summary(model, Xtrain) ;
res.explvarx

source

Jchemo.spcr — Method

spcr(; kwargs...)
spcr(X, Y; kwargs...)
spcr(X, Y, weights::Weight; kwargs...)
spcr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)

Sparse principal component regression (sPCR).

X : X-data (n, p).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

Same as function spca.

Regression on scores computed from a sparse PCA (sPCA-rSVD algorithm of Shen & Huang 2008 ). See function spca for details.

References

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 15
meth = :soft
#meth = :hard
nvar = 20 
model = spcr(; nlv, meth, nvar, defl = :t) ;
fit!(model, Xtrain, ytrain)
@names model
fitm = model.fitm ;
@names fitm
@head fitm.fitm.T
@head transf(model, X)
@head fitm.fitm.V

@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)

res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

res = summary(model, Xtrain) ;
@names res
z = res.explvarx
plotgrid(z.nlv, z.cumpvar; step = 2, xlabel = "Nb. LVs", 
    ylabel = "Prop. Explained X-Variance").f

source

Jchemo.splskdeda — Method

splskdeda(; kwargs...)
splskdeda(X, y; kwargs...)
splskdeda(X, y, weights::Weight; kwargs...)

Sparse PLS-KDE-DA.

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute. Must be >= 1.
meth : Method used for the sparse thresholding. Possible values are: :soft, :hard. See thereafter.
nvar : Nb. variables (X-columns) selected for each LV. Can be a single integer (i.e. same nb. of variables for each LV), or a vector of length nlv.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
Keyword arguments of function dmkern (bandwidth definition) can also be specified here.
scal : Boolean. If true, each column of X and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.

Same as function plskdeda (PLS-KDEDA) except that a sparse PLSR (function splsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

See function splslda for examples.

source

Jchemo.splslda — Method

splslda(; kwargs...)
splslda(X, y; kwargs...)
splslda(X, y, weights::Weight; kwargs...)

Sparse PLS-LDA.

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute. Must be >= 1.
meth : Method used for the sparse thresholding. Possible values are: :soft, :hard. See thereafter.
nvar : Nb. variables (X-columns) selected for each LV. Can be a single integer (i.e. same nb. of variables for each LV), or a vector of length nlv.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
scal : Boolean. If true, each column of X and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.

Same as function plslda (PLSR-LDA) except that a sparse PLSR (function splsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlv = 15
meth = :soft
nvar = 10
model = splslda(; nlv, meth, nvar) 
#model = splsqda(; nlv, meth, nvar, alpha = .1) 
#model = splskdeda(; nlv, meth, nvar, a = .9) 
fit!(model, Xtrain, ytrain)
@names model
@names fitm = model.fitm
fitm.lev
fitm.ni

embfitm = fitm.fitm.embfitm ; 
@head embfitm.T
@head transf(model, Xtrain)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)

coef(embfitm)
summary(embfitm, Xtrain)

res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(model, Xtest; nlv = 1:2).pred

source

Jchemo.splsqda — Method

splsqda(; kwargs...)
splsqda(X, y; kwargs...)
splsqda(X, y, weights::Weight; kwargs...)

Sparse PLS-QDA (with continuum).

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute. Must be >= 1.
meth : Method used for the sparse thresholding. Possible values are: :soft, :hard. See thereafter.
nvar : Nb. variables (X-columns) selected for each LV. Can be a single integer (i.e. same nb. of variables for each LV), or a vector of length nlv.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
alpha : Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0) and LDA (alpha = 1).
scal : Boolean. If true, each column of X and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.

Same as function plsqda (PLSR-QDA) except that a sparse PLSR (function splsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

See function splslda for examples.

source

Jchemo.splsr — Method

splsr(; kwargs...)
splsr(X, Y; kwargs...)
splsr(X, Y, weights::Weight; kwargs...)
splsr!(X::Matrix, Y::Union{Matrix, BitMatrix}, weights::Weight; kwargs...)

Sparse partial least squares regression (Lê Cao et al. 2008)

X : X-data (n, p).
Y : Y-data (n, q).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute.
meth : Method used for the sparse thresholding. Possible values are: :soft, :hard. See thereafter.
nvar : Nb. variables (X-columns) selected for each LV. Can be a single integer (i.e. same nb. of variables for each LV), or a vector of length nlv.
scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

Adaptation of the sparse partial least squares regression algorihm of Lê Cao et al. 2008. The fast "improved kernel algorithm #1" of Dayal & McGregor (1997) is used instead Nipals.

In the present version of splsr, the sparse thresholding only concerns X. The function provides two thresholding methods to compute the sparse X-loading weights w (:soft and :hard), see function spca for description.

The case meth = :soft returns the same results as function spls of the R package mixOmics (Lê Cao et al.) with the regression mode and without sparseness on Y.

The COVSEL regression method described in Roger et al 2011 (see also Höskuldsson 1992) is implemented by setting nvar = 1.

References

Dayal, B.S., MacGregor, J.F., 1997. Improved PLS algorithms. Journal of Chemometrics 11, 73-85.

Höskuldsson, A., 1992. The H-principle in modelling with applications to chemometrics. Chemometrics and Intelligent Laboratory Systems, Proceedings of the 2nd Scandinavian Symposium on Chemometrics 14, 139–153. https://doi.org/10.1016/0169-7439(92)80099-P

Lê Cao, K.-A., Rossouw, D., Robert-Granié, C., Besse, P., 2008. A Sparse PLS for Variable Selection when Integrating Omics Data. Statistical Applications in Genetics and Molecular Biology 7. https://doi.org/10.2202/1544-6115.1390

Package mixOmics on Bioconductor: https://www.bioconductor.org/packages/release/bioc/html/mixOmics.html

Roger, J.M., Palagos, B., Bertrand, D., Fernandez-Ahumada, E., 2011. covsel: Variable selection for highly multivariate and multi-response calibration: Application to IR spectroscopy. Chem. Lab. Int. Syst. 106, 216-223.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)

nlv = 15
meth = :soft
#meth = :hard
nvar = 20
model = splsr(; nlv, meth, nvar) ;
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
@head model.fitm.T
@head model.fitm.W

coef(model)
coef(model; nlv = 3)

@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)

res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

res = summary(model, Xtrain) ;
@names res
z = res.explvarx
plotgrid(z.nlv, z.cumpvar; step = 2, xlabel = "Nb. LVs", 
    ylabel = "Prop. Explained X-Variance").f

source

Jchemo.splsrda — Method

splsrda(; kwargs...)
splsrda(X, y; kwargs...)
splsrda(X, y, weights::Weight; kwargs...)

Sparse PLSR-DA.

X : X-data (n, p).
y : Univariate class membership (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute.
meth : Method used for the sparse thresholding. Possible values are: :soft, :hard. See thereafter.
nvar : Nb. variables (X-columns) selected for each LV. Can be a single integer (i.e. same nb. of variables for each LV), or a vector of length nlv.
prior : Type of prior probabilities for class membership. Possible values are: :prop (proportionnal), :unif (uniform), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order as mlev(y)).
scal : Boolean. If true, each column of X and Ydummy is scaled by its uncorrected standard deviation.

Same as function plsrda (PLSR-DA) except that a sparse PLSR (function splsr), instead of a PLSR (function plskern), is run on the Y-dummy table.

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

nlv = 15
meth = :soft
nvar = 10
model = splsrda(; nlv, meth, nvar) 
fit!(model, Xtrain, ytrain)
@names model
@names fitm = model.fitm
fitm.lev
fitm.ni

@head fitm.fitm.T
@head transf(model, Xtrain)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)

coef(fitm.fitm)

res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

predict(model, Xtest; nlv = 1:2).pred
summary(fitm.fitm, Xtrain)

source

Jchemo.ssr — Method

ssr(pred, Y)

Compute the sum of squared prediction errors (SSR).

pred : Predictions.
Y : Observed data.

Examples

using Jchemo

Xtrain = rand(10, 5) 
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5) 
Ytest = rand(4, 2)
ytest = Ytest[:, 1]

model = plskern(nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
ssr(pred, Ytest)

model = plskern(nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
ssr(pred, ytest)

source

Jchemo.stdv — Method

stdv(x)
stdv(x, weights::Weight)

Compute the uncorrected standard deviation of a vector.

x : A vector (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Examples

using Jchemo

n = 1000
x = rand(n)
w = mweight(rand(n))

stdv(x)
stdv(x, w)

source

Jchemo.summ — Method

summ(X; digits = 3)
summ(X, y; digits = 3)

Summarize a dataset (or a variable).

X : A dataset (n, p).
y : A categorical variable (n) (class membership).
digits : Nb. digits in the outputs.

Examples

using Jchemo

n = 50
X = rand(n, 3) 
y = rand(1:3, n)
res = summ(X)
@names res
summ(X[:, 2]).res

summ(X, y)

source

Jchemo.sumv — Method

sumv(x)
sumv(x, weights::Weight)

Compute the sum of a vector.

x : A vector (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Examples

using Jchemo

n = 100
x = rand(n)
w = mweight(rand(n)) 

sumv(x)
sumv(x, w)

source

Jchemo.svmda — Method

svmda(; kwargs...)
svmda(X, y; kwargs...)

Support vector machine for discrimination "C-SVC" (SVM-DA).

X : X-data (n, p).
y : Univariate class membership (n).

Keyword arguments:

kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol, :klin, :ktanh. See below.
gamma : kern parameter, see below.
degree : kern parameter, see below.
coef0 : kern parameter, see below.
cost : Cost of constraints violation C parameter.
epsilon : Epsilon parameter in the loss function.
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Kernel types:

:krbf – radial basis function: exp(-gamma * ||x - y||^2)
:kpol – polynomial: (gamma * x' * y + coef0)^degree
"klin – linear: x' * y
:ktan – sigmoid: tanh(gamma * x' * y + coef0)

The function uses LIBSVM.jl (https://github.com/JuliaML/LIBSVM.jl) that is an interface to library LIBSVM (Chang & Li 2001).

References

Julia package LIBSVM.jl: https://github.com/JuliaML/LIBSVM.jl

Chang, C.-C. & Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Detailed documentation (algorithms, formulae, ...) can be found in http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.ps.gz

Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

Schölkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning. MIT Press, Cambridge, Mass.

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

kern = :krbf ; gamma = 1e4
cost = 1000 ; epsilon = .5
model = svmda(; kern, gamma, cost, epsilon) 
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni

res = predict(model, Xtest) ; 
@names res 
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

source

Jchemo.svmr — Method

svmr(; kwargs...)
svmr(X, y; kwargs...)

Support vector machine for regression (Epsilon-SVR).

X : X-data (n, p).
y : Univariate y-data (n).

Keyword arguments:

kern : Type of kernel used to compute the Gram matrices. Possible values are: :krbf, :kpol, :klin, :ktanh. See below.
gamma : kern parameter, see below.
coef0 : kern parameter, see below.
degree : kern parameter, see below.
cost : Cost of constraints violation C parameter.
epsilon : Epsilon parameter in the loss function.
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

Kernel types:

:krbf – radial basis function: exp(-gamma * ||x - y||^2)
:kpol – polynomial: (gamma * x' * y + coef0)^degree
"klin – linear: x' * y
:ktan – sigmoid: tanh(gamma * x' * y + coef0)

The function uses LIBSVM.jl (https://github.com/JuliaML/LIBSVM.jl) that is an interface to library LIBSVM (Chang & Li 2001).

References

Julia package LIBSVM.jl: https://github.com/JuliaML/LIBSVM.jl

Schölkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning. MIT Press, Cambridge, Mass.

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
p = nco(X)

kern = :krbf ; gamma = .1
cost = 1000 ; epsilon = 1
model = svmr(; kern, gamma, cost, epsilon) 
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm

res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f    

####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106 
x = collect(-10:.2:10) 
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x) 
y = zy + .2 * randn(n) 
kern = :krbf ; gamma = .1
model = svmr(; kern, gamma) 
fit!(model, x, y)
pred = predict(model, x).pred 
f, ax = scatter(x, y) 
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f

source

Jchemo.tab — Method

tab(X::AbstractArray)
tab(X::DataFrame; groupby = nothing)

Tabulation of categorical variables.

x : Categorical variable or dataset containing categorical variable(s).

Specific for a dataset:

groupby : Vector of the names of the group variables to consider in X (by default: all the columns of X).

The output cointains sorted levels.

Examples

using Jchemo, DataFrames

x = rand(["a"; "b"; "c"], 20)
res = tab(x)
res.keys
res.vals

n = 20
X = hcat(rand(1:2, n), rand(["a", "b", "c"], n))
df = DataFrame(X, [:v1, :v2])

tab(X[:, 2])
tab(string.(X))

tab(df)
tab(df; groupby = [:v1, :v2])
tab(df; groupby = :v2)

source

Jchemo.tabdupl — Method

tabdupl(x)

Tabulate duplicated values in a vector.

x : Categorical variable.

Examples

using Jchemo

x = ["a", "b", "c", "a", "b", "b"]
tab(x)
res = tabdupl(x)
res.keys
res.vals

source

Jchemo.thresh_hard — Method

thresh_hard(x::Real, delta)

Hard thresholding function.

x : Value to transform.
delta : Range for the thresholding.

The returned value is:

abs(x) > delta ? x : 0

where delta >= 0.

Examples

using Jchemo, CairoMakie 

delta = .7
thresh_hard(3, delta)

x = LinRange(-2, 2, 500)
y = thresh_hard.(x, delta)
lines(x, y; axis = (xlabel = "x", ylabel = "f(x)"))

source

Jchemo.thresh_soft — Method

thresh_soft(x::Real, delta)

Soft thresholding function.

x : Value to transform.
delta : Range for the thresholding.

The returned value is:

sign(x) * max(0, abs(x) - delta)

where delta >= 0.

Examples

using Jchemo, CairoMakie 

delta = .7
thresh_soft(3, delta)

x = LinRange(-2, 2, 100)
y = thresh_soft.(x, delta)
lines(x, y; axis = (xlabel = "x", ylabel = "f(x)"))

source

Jchemo.transf — Method

transf(object::Blockscal, Xbl)
transf!(object::Blockscal, Xbl)

Compute the preprocessed data from a model.

object : The fitted model.
Xbl : A list of blocks (vector of matrices) of X-data for which LVs are computed.

source

Jchemo.transf — Method

transf(object::Center, X)
transf!(object::Center, X::Matrix)

Compute the preprocessed data from a model.

object : Model.
X : X-data to transform.

source

Jchemo.transf — Method

transf(object::Comdim, Xbl; nlv = nothing)
transfbl(object::Comdim, Xbl; nlv = nothing)

Compute latent variables (LVs = scores) from a fitted model.

object : The fitted model.
Xbl : A list of blocks (vector of matrices) of X-data for which LVs are computed.
nlv : Nb. LVs to compute.

source

Jchemo.transf — Method

transf(object::Cscale, X)
transf!(object::Cscale, X::Matrix)

Compute the preprocessed data from a model.

object : Model.
X : X-data to transform.

source

Jchemo.transf — Method

transf(object::DetrendAirpls, X)
transf!(object::DetrendAirpls, X)

Compute the preprocessed data from a model.

object : Model.
X : X-data to transform.

source

Jchemo.transf — Method

transf(object::DetrendArpls, X)
transf!(object::DetrendArpls, X)

Compute the preprocessed data from a model.

object : Model.
X : X-data to transform.

source

Jchemo.transf — Method

transf(object::DetrendAsls, X)
transf!(object::DetrendAsls, X)

Compute the preprocessed data from a model.

object : Model.
X : X-data to transform.

source

Jchemo.transf — Method

transf(object::DetrendLo, X)
transf!(object::DetrendLo, X)

Compute the preprocessed data from a model.

object : Model.
X : X-data to transform.

source

Jchemo.transf — Method

transf(object::DetrendPol, X)
transf!(object::DetrendPol, X)

Compute the preprocessed data from a model.

object : Model.
X : X-data to transform.

source

Jchemo.transf — Method

transf(object::Dkplsr, X; nlv = nothing)

Compute latent variables (LVs = scores) from a fitted model.

object : The fitted model.
X : X-data for which LVs are computed.
nlv : Nb. LVs to consider.

source

Jchemo.transf — Method

transf(object::Fdif, X)
transf!(object::Fdif, X::Matrix, M::Matrix)

Compute the preprocessed data from a model.

object : Model.
X : X-data to transform.
M : Pre-allocated output matrix (n, p - npoint + 1).

The in-place function stores the output in M.

source

Jchemo.transf — Method

transf(object::Interpl, X)
transf!(object::Interpl, X::Matrix, M::Matrix)

Compute the preprocessed data from a model.

object : Model.
X : X-data to transform.
M : Pre-allocated output matrix (n, p).

The in-place function stores the output in M.

source

Jchemo.transf — Method

transf(object::Kpca, X; nlv = nothing)

Compute PCs (scores T) from a fitted model.

object : The fitted model.
X : X-data for which PCs are computed.
nlv : Nb. PCs to compute.

source

Jchemo.transf — Method

transf(object::Kplsr, X; nlv = nothing)

Compute latent variables (LVs = scores) from a fitted model.

object : The fitted model.
X : X-data for which LVs are computed.
nlv : Nb. LVs to consider.

source

Jchemo.transf — Method

transf(object::Mavg, X)
transf!(object::Mavg, X)

Compute the preprocessed data from a model.

object : Model.
X : X-data to transform.

source

Jchemo.transf — Method

transf(object::Mbconcat, Xbl)

Compute the preprocessed data from a model.

object : The fitted model.
Xbl : A list of blocks (vector of matrices) of X-data for which LVs are computed.

source

Jchemo.transf — Method

transf(object::Mbpca, Xbl; nlv = nothing)
transfbl(object::Mbpca, Xbl; nlv = nothing)

Compute latent variables (LVs = scores) from a fitted model.

object : The fitted model.
Xbl : A list of blocks (vector of matrices) of X-data for which LVs are computed.
nlv : Nb. LVs to compute.

source

Jchemo.transf — Method

transf(object::Mbplsprobda, Xbl; nlv = nothing)

Compute latent variables (LVs = scores) from a fitted model.

object : The fitted model.
Xbl : A list of blocks (vector of matrices) of X-data for which LVs are computed.
nlv : Nb. LVs to consider.

source

Jchemo.transf — Method

transf(object::Mbplsrda, Xbl; nlv = nothing)

Compute latent variables (LVs = scores) from a fitted model.

object : The fitted model.
Xbl : A list of blocks (vector of matrices) of X-data for which LVs are computed.
nlv : Nb. LVs to compute.

source

Jchemo.transf — Method

transf(object::Plsprobda, X; nlv = nothing)

Compute latent variables (LVs = scores) from a fitted model.

object : The fitted model.
X : Matrix (m, p) for which LVs are computed.
nlv : Nb. LVs to consider.

source

Jchemo.transf — Method

transf(object::Plsrda, X; nlv = nothing)

Compute latent variables (LVs = scores) from a fitted model.

object : The fitted model.
X : X-data (m, p) for which LVs are computed.
nlv : Nb. LVs to consider.

source

Jchemo.transf — Method

transf(object::Rmgap, X)
transf!(object::Rmgap, X)

Compute the preprocessed data from a model.

object : Model.
X : X-data to transform.

source

Jchemo.transf — Method

transf(object::Rosaplsr, Xbl; nlv = nothing)

Compute latent variables (LVs = scores) from a fitted model.

object : The fitted model.
Xbl : A list of blocks (vector of matrices) of X-data for which LVs are computed.
nlv : Nb. LVs to compute.

source

Jchemo.transf — Method

transf(object::Rp, X; nlv = nothing)

Compute scores T from a fitted model.

object : The fitted model.
X : Matrix (m, p) for which scores T are computed.
nlv : Nb. scores to compute.

source

Jchemo.transf — Method

transf(object::Savgol, X)
transf!(object::Savgol, X)

Compute the preprocessed data from a model.

object : Model.
X : X-data to transform.

source

Jchemo.transf — Method

transf(object::Scale, X)
transf!(object::Scale, X::Matrix)

Compute the preprocessed data from a model.

object : Model.
X : X-data to transform.

source

Jchemo.transf — Method

transf(object::Snorm, X)
transf!(object::Snorm, X)

Compute the preprocessed data from a model.

object : Model.
X : X-data to transform.

source

Jchemo.transf — Method

transf(object::Snv, X)
transf!(object::Snv, X)

Compute the preprocessed data from a model.

object : Model.
X : X-data to transform.

source

Jchemo.transf — Method

transf(object::Soplsr, Xbl)

Compute latent variables (LVs = scores) from a fitted model.

object : The fitted model.
Xbl : A list of blocks (vector of matrices) of X-data for which LVs are computed.

source

Jchemo.transf — Method

transf(object::Spca, X; nlv = nothing)
Compute principal components (PCs = scores T) from a 
    fitted model and X-data.

object : The fitted model.
X : X-data for which PCs are computed.
nlv : Nb. PCs to compute.

source

Jchemo.transf — Method

transf(object::Umap, X)

Compute latent variables (LVs = scores) from a fitted model.

object : The fitted model.
X : Matrix (m, p) for which LVs are computed.

source

Jchemo.transf — Method

transf(object::Union{Pca, Fda}, X; nlv = nothing)

Compute principal components (PCs = scores T) from a fitted model and X-data.

object : The fitted model.
X : X-data for which PCs are computed.
nlv : Nb. PCs to compute.

source

Jchemo.transf — Method

transf(object::Union{Mbplsr, Mbplswest}, Xbl; nlv = nothing)

Compute latent variables (LVs = scores) from a fitted model.

object : The fitted model.
Xbl : A list of blocks (vector of matrices) of X-data for which LVs are computed.
nlv : Nb. LVs to compute.

source

Jchemo.transf — Method

transf(object::Union{Pcr, Spcr}, X; nlv = nothing)

Compute latent variables (LVs = scores) from a fitted model and a matrix X.

object : The fitted model.
X : Matrix (m, p) for which LVs are computed.
nlv : Nb. LVs to consider.

source

Jchemo.transf — Method

transf(object::Union{Plsr, Splsr}, X; nlv = nothing)

Compute latent variables (LVs = scores) from a fitted model.

object : The fitted model.
X : Matrix (m, p) for which LVs are computed.
nlv : Nb. LVs to consider.

source

Jchemo.transfbl — Method

transfbl(object::Cca, X, Y; nlv = nothing)

Compute latent variables (LVs = scores) from a fitted model.

object : The fitted model.
X : X-data for which components (LVs) are computed.
Y : Y-data for which components (LVs) are computed.
nlv : Nb. LVs to compute.

source

Jchemo.transfbl — Method

transfbl(object::Ccawold, X, Y; nlv = nothing)

Compute latent variables (LVs = scores) from a fitted model.

object : The fitted model.
X : X-data for which components (LVs) are computed.
Y : Y-data for which components (LVs) are computed.
nlv : Nb. LVs to compute.

source

Jchemo.transfbl — Method

transfbl(object::Plscan, X, Y; nlv = nothing)

Compute latent variables (LVs = scores) from a fitted model.

object : The fitted model.
X : X-data for which components (LVs) are computed.
Y : Y-data for which components (LVs) are computed.
nlv : Nb. LVs to compute.

source

Jchemo.transfbl — Method

transfbl(object::Plstuck, X, Y; nlv = nothing)

Compute latent variables (LVs = scores) from a fitted model.

object : The fitted model.
X : X-data for which components (LVs) are computed.
Y : Y-data for which components (LVs) are computed.
nlv : Nb. LVs to compute.

source

Jchemo.transfbl — Method

transfbl(object::Rasvd, X, Y; nlv = nothing)

Compute latent variables (LVs = scores) from a fitted model.

object : The fitted model.
X : X-data for which components (LVs) are computed.
Y : Y-data for which components (LVs) are computed.
nlv : Nb. LVs to compute.

source

Jchemo.treeda — Method

treeda(; kwargs...)
treeda(X, y; kwargs...)

Discrimination tree (CART) with DecisionTree.jl.

X : X-data (n, p).
y : Univariate class membership (n).

Keyword arguments:

n_subfeatures : Nb. variables to select at random at each split (default: 0 ==> keep all).
max_depth : Maximum depth of the decision tree (default: -1 ==> no maximum).
min_sample_leaf : Minimum number of samples each leaf needs to have.
min_sample_split : Minimum number of observations in needed for a split.
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

The function fits a single discrimination tree (CART) using package `DecisionTree.jl'.

References

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. Classification And Regression Trees. Chapman & Hall, 1984.

DecisionTree.jl https://github.com/JuliaAI/DecisionTree.jl

Gey, S., 2002. Bornes de risque, détection de ruptures, boosting : trois thèmes statistiques autour de CART en régression (These de doctorat). Paris 11. http://www.theses.fr/2002PA112245

Examples

using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n, p = size(X) 
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)

n_subfeatures = p / 3 
max_depth = 10
model = treeda(; n_subfeatures, max_depth) 
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni

res = predict(model, Xtest) ; 
@names res 
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt

source

Jchemo.treer — Method

treer(; kwargs...)
treer(X, y; kwargs...)

Regression tree (CART) with DecisionTree.jl.

X : X-data (n, p).
y : Univariate y-data (n).

Keyword arguments:

n_subfeatures : Nb. variables to select at random at each split (default: 0 ==> keep all).
max_depth : Maximum depth of the decision tree (default: -1 ==> no maximum).
min_sample_leaf : Minimum number of samples each leaf needs to have.
min_sample_split : Minimum number of observations in needed for a split.
scal : Boolean. If true, each column of X is scaled by its uncorrected standard deviation.

The function fits a single regression tree (CART) using package `DecisionTree.jl'.

References

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. Classification And Regression Trees. Chapman & Hall, 1984.

DecisionTree.jl https://github.com/JuliaAI/DecisionTree.jl

Gey, S., 2002. Bornes de risque, détection de ruptures, boosting : trois thèmes statistiques autour de CART en régression (These de doctorat). Paris 11. http://www.theses.fr/2002PA112245

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
@names dat
X = dat.X 
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
p = nco(X)

n_subfeatures = p / 3 
max_depth = 15
model = treer(; n_subfeatures, max_depth) 
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm

res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction", 
    ylabel = "Observed").f

source

Jchemo.umap — Method

umap(; kwargs...)
umap(X; kwargs...)

UMAP: Uniform manifold approximation and projection for dimension reduction

X : X-data (n, p).

Keyword arguments:

nlv : Nb. latent variables (LVs) to compute.
psamp : Proportion of sampling in X for training.
n_neighbors : Nb. approximate neighbors used to construct the initial high-dimensional graph.
min_dist : Minimum distance between points in low-dimensional space.
scal : Boolean. If true, each column of X and Y is scaled by its uncorrected standard deviation.

The function fits a UMAP dimension reducion using package `UMAP.jl'. The used metric is the Euclidean distance.

If psamp < 1, only a proportion psamp of the observations (rows of X) are used to build the model (systematic sampling over the first score of the PCA of X). Can be used to decrease computation times when n is large.

References

https://github.com/dillondaudert/UMAP.jl

McInnes, L, Healy, J, Melville, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiV 1802.03426, 2018 https://arxiv.org/abs/1802.03426

https://umap-learn.readthedocs.io/en/latest/howumapworks.html

https://pair-code.github.io/understanding-umap/

Examples

using Jchemo, JchemoData
using JLD2, GLMakie, CairoMakie, FreqTables
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "challenge2018.jld2") 
@load db dat
@names dat
X = dat.X 
Y = dat.Y
wlst = names(X)
wl = parse.(Float64, wlst)
ntot = nro(X)
summ(Y)
typ = Y.typ
test = Y.test
y = Y.conc

model1 = snv() 
model2 = savgol(npoint = 21, deriv = 2, degree = 3)
model = pip(model1, model2)
fit!(model, X)
@head Xp = transf(model, X)
plotsp(Xp, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance", nsamp = 20).f

s = Bool.(test)
Xtrain = rmrow(Xp, s)
Ytrain = rmrow(Y, s)
ytrain = rmrow(y, s)
typtrain = rmrow(typ, s)
Xtest = Xp[s, :]
Ytest = Y[s, :]
ytest = y[s]
typtest = typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = ntot, ntrain, ntest)

freqtable(string.(typ, "-", Y.label))
freqtable(typ, test)

#################

nlv = 3
n_neighbors = 50 ; min_dist = .5 
model = umap(; nlv, n_neighbors, min_dist)  
fit!(model, Xtrain)
@head T = model.fitm.T
@head Ttest = transf(model, Xtest)

nlv = 3
n_neighbors = 50 ; min_dist = .5 
model = umap(; nlv, n_neighbors, min_dist)  
fit!(model, Xtrain)
@head T = model.fitm.T
@head Ttest = transf(model, Xtest)
GLMakie.activate!() 
#CairoMakie.activate!()
lev = mlev(typtrain)
nlev = length(lev)
colsh = :tab10
colm = cgrad(colsh, nlev; alpha = .7, categorical = true) 
ztyp = recod_catbyint(typtrain)
f = Figure()
i = 1
ax = Axis3(f[1, 1], xlabel = string("LV", i), ylabel = string("LV", i + 1), 
        zlabel = string("LV", i + 2), title = "UMAP", perspectiveness = 0) 
scatter!(ax, T[:, i], T[:, i + 1], T[:, i + 2]; markersize = 8, 
    color = ztyp, colormap = colm) 
scatter!(ax, Ttest[:, i], Ttest[:, i + 1], Ttest[:, i + 2], color = :black, 
    markersize = 10)  
elt = [MarkerElement(color = colm[i], marker = '●', markersize = 10) for i in eachindex(lev)]
#elt = [PolyElement(polycolor = colm[i]) for i in eachindex(lev)]
title = "Group"
Legend(f[1, 2], elt, lev, title; nbanks = 1, rowgap = 10, framevisible = false)
f

source

Jchemo.varv — Method

varv(x)
varv(x, weights::Weight)

Compute the uncorrected variance of a vector.

x : A vector (n).
weights : Weights (n) of the observations. Must be of type Weight (see e.g. function mweight).

Examples

using Jchemo

n = 1000
x = rand(n)
w = mweight(rand(n))

varv(x)
varv(x, w)

source

Jchemo.vcatdf — Method

vcatdf(dat; cols = :intersect)

Vertical concatenation of a list of dataframes.

dat : List (vector) of dataframes.
cols : Determines the columns of the returned dataframe. See ?DataFrames.vcat.

Examples

using Jchemo, DataFrames

dat1 = DataFrame(rand(5, 2), [:v3, :v1]) 
dat2 = DataFrame(100 * rand(2, 2), [:v3, :v1])
dat = (dat1, dat2)
Jchemo.vcatdf(dat)

dat2 = DataFrame(100 * rand(2, 2), [:v1, :v3])
dat = (dat1, dat2)
Jchemo.vcatdf(dat)

dat2 = DataFrame(100 * rand(2, 3), [:v3, :v1, :a])
dat = (dat1, dat2)
Jchemo.vcatdf(dat)
Jchemo.vcatdf(dat; cols = :union)

source

Jchemo.vcol — Method

vcol(X::AbstractMatrix, j)
vcol(X::DataFrame, j)
vcol(x::Vector, j)

View of the j-th column(s) of a matrix X, or of the j-th element(s) of vector x.

source

Jchemo.vip — Method

vip(object::Union{Plsr, Pcr, Splsr, Spcr}; nlv = nothing)
vip(object::Union{Plsr, Pcr, Splsr, Spcr}, Y; nlv = nothing)

Variable importance on Projections (VIP).

object : The fitted model.
Y : The Y-data that was used to fit the model.

Keyword arguments:

nlv : Nb. latent variables (LVs) to consider. If nothing, the maximal model is considered.

For a PLS model (or PCR, etc.) fitted on (X, Y) with a number of A latent variables, and for variable xj (column j of X):

VIP(xj) = Sum.a(1,...,A) R2(Yc, ta) waj^2 / Sum.a(1,...,A) R2(Yc, ta) (1 / p)

where:

Yc is the centered Y,
ta is the a-th X-score,
R2(Yc, ta) is the proportion of Yc-variance explained by ta, i.e. ||Yc.hat||^2 / ||Yc||^2 (where Yc.hat is the LS estimate of Yc by ta).

When Y is used, R2(Yc, ta) is replaced by the redundancy Rd(Yc, ta) (see function rd), such as in Tenenhaus 1998 p.139.

References

Chong, I.-G., Jun, C.-H., 2005. Performance of some variable selection methods when multicollinearity is present. Chemometrics and Intelligent Laboratory Systems 78, 103–112. https://doi.org/10.1016/j.chemolab.2004.12.011

Mehmood, T., Sæbø, S., Liland, K.H., 2020. Comparison of variable selection methods in partial least squares regression. Journal of Chemometrics 34, e3226. https://doi.org/10.1002/cem.3226

Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.

Examples

using Jchemo

X = [1. 2 3 4; 4 1 6 7; 12 5 6 13; 27 18 7 6 ; 12 11 28 7] 
Y = [10. 11 13; 120 131 27; 8 12 4; 1 200 8; 100 10 89] 
y = Y[:, 1] 
ycla = [1; 1; 1; 2; 2]

nlv = 3
model = plskern(; nlv)
fit!(model, X, y)
res = vip(model.fitm)
@names res
res.imp

fit!(model, X, Y)
vip(model.fitm).imp
vip(model.fitm, Y).imp

## For PLSDA

model = plsrda(; nlv) 
fit!(model, X, ycla)
@names model.fitm
fitm = model.fitm.fitm ;  # fitted PLS model
vip(fitm).imp
Ydummy = dummy(ycla).Y
vip(fitm, Ydummy).imp

model = plslda(; nlv) 
fit!(model, X, ycla)
@names model.fitm.fitm
fitm = model.fitm.fitm.embfitm ;  # fitted PLS model
vip(fitm).imp
vip(fitm, Ydummy).imp

source

Jchemo.viperm — Method

viperm(model, X, Y; rep = 50, psamp = .3, score = rmsep)

Variable importance by direct permutations.

model : Model to evaluate.
X : X-data (n, p).
Y : Y-data (n, q).

Keyword arguments:

rep : Number of replications of the splitting training/test.
psamp : Proportion of data used as test set to compute the score.
score : Function computing the prediction score.

The principle is as follows:

Data (X, Y) are splitted randomly to a training and a test set.
The model is fitted on Xtrain, and the score (error rate) is computed on Xtest. This gives the reference error rate.
Rows of a given variable (feature) j in Xtest are randomly permutated (the rest of Xtest is unchanged). The score is computed on the Xtestpermj (i.e. Xtest after thta the rows of variable j were permuted). The importance of variable j is computed by the difference between this score and the reference score.
This process is run for each variable j separately and replicated rep times. Average results are provided in the outputs, as well as the results per replication.

In general, this method returns similar results as the out-of-bag permutation method used in random forests (Breiman, 2001).

References

Nørgaard, L., Saudland, A., Wagner, J., Nielsen, J.V., Munck, L.,

Examples

using Jchemo, JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "tecator.jld2") 
@load db dat
@names dat
X = dat.X
Y = dat.Y 
wl_str = names(X)
wl = parse.(Float64, wl_str) 
ntot, p = size(X)
typ = Y.typ
namy = names(Y)[1:3]
plotsp(X, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance").f
s = typ .== "train"
Xtrain = X[s, :]
Ytrain = Y[s, namy]
Xtest = rmrow(X, s)
Ytest = rmrow(Y[:, namy], s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)

## Work on the j-th y-variable 
j = 2
nam = namy[j]
ytrain = Ytrain[:, nam]
ytest = Ytest[:, nam]

model = plskern(nlv = 9)
res = viperm(model, Xtrain, ytrain; rep = 50, score = rmsep) ;
z = vec(res.imp)
f = Figure(size = (500, 400))
ax = Axis(f[1, 1]; xlabel = "Wavelength (nm)", ylabel = "Importance")
scatter!(ax, wl, vec(z); color = (:red, .5))
u = [910; 950]
vlines!(ax, u; color = :grey, linewidth = 1)
f

model = rfr(n_trees = 10, max_depth = 2000, min_samples_leaf = 5)
res = viperm(model, Xtrain, ytrain; rep = 50)
z = vec(res.imp)
f = Figure(size = (500, 400))
ax = Axis(f[1, 1];
    xlabel = "Wavelength (nm)", 
    ylabel = "Importance")
scatter!(ax, wl, vec(z); color = (:red, .5))
u = [910; 950]
vlines!(ax, u; color = :grey, linewidth = 1)
f

source

Jchemo.vrow — Method

vrow(X::AbstractMatrix, i)
vrow(X::DataFrame, i)
vrow(x::Vector, i)

View of the i-th row(s) of a matrix X, or of the i-th element(s) of vector x.

source

Jchemo.wdis — Method

wdis(d; typw = :bisquare, alpha = 0)

Different functions to compute weights from distances.

d : Vector of distances.

Keyword arguments:

typw : Define the weight function.
alpha : Parameter of the weight function, see below.

The returned weight vector is:

w = f(d / q) where f is the weight function and q the 1-alpha quantile of d (Cleveland & Grosse 1991).

Possible values for typw are:

:bisquare: w = (1 - d^2)^2
:cauchy: w = 1 / (1 + d^2)
:epan: w = 1 - d^2
:fair: w = 1 / (1 + d)^2
:invexp: w = exp(-d)
:invexp2: w = exp(-d / 2)
:gauss: w = exp(-d^2)
:trian: w = 1 - d
:tricube: w = (1 - d^3)^3

References

Cleveland, W.S., Grosse, E., 1991. Computational methods for local regression. Stat Comput 1, 47–62. https://doi.org/10.1007/BF01890836

Examples

using Jchemo, CairoMakie, Distributions

d = sort(sqrt.(rand(Chi(1), 1000)))
colm = cgrad(:tab10, collect(1:9)) ;
alpha = 0
f = Figure(size = (600, 500))
ax = Axis(f, xlabel = "d", ylabel = "Weight")
typw = :bisquare
w = wdis(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = colm[1])
typw = :cauchy
w = wdis(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = colm[2])
typw = :epan
w = wdis(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = colm[3])
typw = :fair
w = wdis(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = colm[4])
typw = :gauss
w = wdis(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = colm[5])
typw = :trian
w = wdis(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = colm[6])
typw = :invexp
w = wdis(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = colm[7])
typw = :invexp2
w = wdis(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = colm[8])
typw = :tricube
w = wdis(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = colm[9])
axislegend("Function", position = :lb)
f[1, 1] = ax
f

source

Jchemo.winvs — Method

winvs(d; h = 2, criw = 4, squared = false)
winvs!(d; h = 2, criw = 4, squared = false)

Compute weights from distances using an inverse scaled exponential function.

d : A vector of distances.

Keyword arguments:

h : A scaling positive scalar defining the shape of the weight function.
criw : A positive scalar defining outliers in the distances vector d.
squared: If true, distances are replaced by the squared distances; the weight function is then a Gaussian (RBF) kernel function.

Weights are computed by:

exp(-d / (h * MAD(d)))

or are set to 0 for extreme (potentially outlier) distances such as d > Median(d) + criw * MAD(d). Finally, weights are standardized to their maximal value. This is an adaptation of the weight function presented in Kim et al. 2011.

The weights decrease when distances increase. Lower is h, sharper is the decreasing function.

References

Kim S, Kano M, Nakagawa H, Hasebe S. Estimation of active pharmaceutical ingredients content using locally weighted partial least squares and statistical wavelength selection. Int J Pharm. 2011; 421(2):269-274. https://doi.org/10.1016/j.ijpharm.2011.10.007

Examples

using Jchemo, CairoMakie, Distributions

x1 = rand(Chisq(10), 100) ;
x2 = rand(Chisq(40), 10) ;
d = [sqrt.(x1) ; sqrt.(x2)]
h = 2 ; criw = 3
w = winvs(d; h, criw) ;
f = Figure(size = (600, 300))
ax1 = Axis(f, xlabel = "Distance", ylabel = "Nb. observations")
hist!(ax1, d, bins = 30)
ax2 = Axis(f, xlabel = "Distance", ylabel = "Weight")
scatter!(ax2, d, w)
f[1, 1] = ax1 
f[1, 2] = ax2 
f

d = collect(0:.5:15) ;
h = [.5, 1, 1.5, 2.5, 5, 10, Inf] 
#h = [1, 2, 5, Inf] 
w = winvs(d; h = h[1]) 
f = Figure(size = (500, 400))
ax = Axis(f, xlabel = "Distance", ylabel = "Weight")
lines!(ax, d, w, label = string("h = ", h[1]))
for i = 2:length(h)
    w = winvs(d; h = h[i])
    lines!(ax, d, w, label = string("h = ", h[i]))
end
axislegend("Values of h"; position = :lb)
f[1, 1] = ax
f

source

Jchemo.wtal — Method

wtal(d; a = 1)

Compute weights from distances using the 'talworth' distribution.

d : Vector of distances.

Keyword arguments:

a : Parameter of the weight function, see below.

The returned weight vector w has component w[i] = 1 if |d[i]| <= a, and w[i] = 0 if |d[i]| > a.

Examples

d = rand(10)
wtal(d; a = .8)

source

Jchemo.xfit — Method

xfit(object)
xfit(object, X; nlv = nothing)
xfit!(object, X::Matrix; nlv = nothing)

Matrix fitting from a bilinear model (e.g. PCA).

object : The fitted model.
X : New X-data to be approximated from the model. Must be in the same scale as the X-data used to fit the model object, i.e. before centering and eventual scaling.

Keyword arguments:

nlv : Nb. components (PCs or LVs) to consider. If nothing, it is the maximum nb. of components.

Compute an approximate of matrix X from a bilinear model (e.g. PCA or PLS) fitted on X. The fitted X is returned in the original scale of the X-data used to fit the model object.

Examples

using Jchemo

X = [1. 2 3 4; 4 1 6 7; 12 5 6 13; 
    27 18 7 6; 12 11 28 7] 
Y = [10. 11 13; 120 131 27; 8 12 4; 
    1 200 8; 100 10 89] 
n, p = size(X)
Xnew = X[1:3, :]
Ynew = Y[1:3, :]
y = Y[:, 1]
ynew = Ynew[:, 1]
weights = mweight(rand(n))

nlv = 2 
scal = false
#scal = true
model = pcasvd(; nlv, scal) ;
fit!(model, X)
fitm = model.fitm ;
@head xfit(fitm)
xfit(fitm, Xnew)
xfit(fitm, Xnew; nlv = 0)
xfit(fitm, Xnew; nlv = 1)
fitm.xmeans

@head X
@head xfit(fitm) + xresid(fitm, X)
@head xfit(fitm, X; nlv = 1) + xresid(fitm, X; nlv = 1)

@head Xnew
@head xfit(fitm, Xnew) + xresid(fitm, Xnew)

model = pcasvd(; nlv = min(n, p), scal) 
fit!(model, X)
fitm = model.fitm ;
@head xfit(fitm) 
@head xfit(fitm, X)
@head xresid(fitm, X)

nlv = 3
scal = false
#scal = true
model = plskern(; nlv, scal)
fit!(model, X, Y, weights) 
fitm = model.fitm ;
@head xfit(fitm)
xfit(fitm, Xnew)
xfit(fitm, Xnew, nlv = 0)
xfit(fitm, Xnew, nlv = 1)

@head X
@head xfit(fitm) + xresid(fitm, X)
@head xfit(fitm, X; nlv = 1) + xresid(fitm, X; nlv = 1)

@head Xnew
@head xfit(fitm, Xnew) + xresid(fitm, Xnew)

model = plskern(; nlv = min(n, p), scal) 
fit!(model, X, Y, weights) 
fitm = model.fitm ;
@head xfit(fitm) 
@head xfit(fitm, Xnew)
@head xresid(fitm, Xnew)

source

Jchemo.xresid — Method

xresid(object, X; nlv = nothing)
xresid!(object, X::Matrix; nlv = nothing)

Residual matrix from a bilinear model (e.g. PCA).

object : The fitted model.
X : New X-data to be approximated from the model. Must be in the same scale as the X-data used to fit the model object, i.e. before centering and eventual scaling.

Keyword arguments:

nlv : Nb. components (PCs or LVs) to consider. If nothing, it is the maximum nb. of components.

Compute the residual matrix:

E = X - X_fit

where X_fit is the fitted X returned by function xfit. See xfit for examples. ```

source