Index of functions
Here is a list of all exported functions from Jchemo.jl.
For more details, click on the link and you'll be directed to the function help.
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Jchemo.aggstat
Jchemo.aggsum
Jchemo.aicplsr
Jchemo.aov1
Jchemo.bias
Jchemo.blockscal
Jchemo.calds
Jchemo.calpds
Jchemo.cca
Jchemo.ccawold
Jchemo.center
Jchemo.cglsr
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.colmad
Jchemo.colmean
Jchemo.colmed
Jchemo.colnorm
Jchemo.colstd
Jchemo.colsum
Jchemo.colvar
Jchemo.comdim
Jchemo.conf
Jchemo.cor2
Jchemo.corm
Jchemo.cosm
Jchemo.cosv
Jchemo.covm
Jchemo.cscale
Jchemo.detrend
Jchemo.dfplsr_cg
Jchemo.difmean
Jchemo.dkplskdeda
Jchemo.dkplslda
Jchemo.dkplsqda
Jchemo.dkplsr
Jchemo.dkplsrda
Jchemo.dmkern
Jchemo.dmnorm
Jchemo.dmnormlog
Jchemo.dummy
Jchemo.dupl
Jchemo.ensure_df
Jchemo.ensure_mat
Jchemo.eposvd
Jchemo.errp
Jchemo.euclsq
Jchemo.fblockscal
Jchemo.fcenter
Jchemo.fcscale
Jchemo.fda
Jchemo.fdasvd
Jchemo.fdif
Jchemo.findindex
Jchemo.findmax_cla
Jchemo.frob
Jchemo.fscale
Jchemo.fweight
Jchemo.getknn
Jchemo.gridcv
Jchemo.gridcv_br
Jchemo.gridcv_lb
Jchemo.gridcv_lv
Jchemo.gridscore
Jchemo.gridscore
Jchemo.gridscore_br
Jchemo.gridscore_lb
Jchemo.gridscore_lv
Jchemo.head
Jchemo.interpl
Jchemo.isel!
Jchemo.kdeda
Jchemo.knnda
Jchemo.knnr
Jchemo.kpca
Jchemo.kplskdeda
Jchemo.kplslda
Jchemo.kplsqda
Jchemo.kplsr
Jchemo.kplsrda
Jchemo.kpol
Jchemo.krbf
Jchemo.krr
Jchemo.krrda
Jchemo.lda
Jchemo.lg
Jchemo.list
Jchemo.list
Jchemo.locw
Jchemo.locwlv
Jchemo.lwmlr
Jchemo.lwmlrda
Jchemo.lwplslda
Jchemo.lwplsqda
Jchemo.lwplsr
Jchemo.lwplsravg
Jchemo.lwplsrda
Jchemo.mahsq
Jchemo.mahsqchol
Jchemo.matB
Jchemo.matW
Jchemo.mavg
Jchemo.mbconcat
Jchemo.mblock
Jchemo.mbpca
Jchemo.mbplskdeda
Jchemo.mbplslda
Jchemo.mbplsqda
Jchemo.mbplsr
Jchemo.mbplsrda
Jchemo.mbplswest
Jchemo.merrp
Jchemo.miss
Jchemo.mlev
Jchemo.mlr
Jchemo.mlrchol
Jchemo.mlrda
Jchemo.mlrpinv
Jchemo.mlrpinvn
Jchemo.mlrvec
Jchemo.model
Jchemo.mpar
Jchemo.mse
Jchemo.msep
Jchemo.mweight
Jchemo.mweightcla
Jchemo.nco
Jchemo.nipals
Jchemo.nipalsmiss
Jchemo.normw
Jchemo.nro
Jchemo.occod
Jchemo.occsd
Jchemo.occsdod
Jchemo.occstah
Jchemo.out
Jchemo.pcaeigen
Jchemo.pcaeigenk
Jchemo.pcanipals
Jchemo.pcanipalsmiss
Jchemo.pcasph
Jchemo.pcasvd
Jchemo.pcr
Jchemo.pip
Jchemo.plist
Jchemo.plotconf
Jchemo.plotgrid
Jchemo.plotsp
Jchemo.plotxy
Jchemo.plscan
Jchemo.plskdeda
Jchemo.plskern
Jchemo.plslda
Jchemo.plsnipals
Jchemo.plsqda
Jchemo.plsravg
Jchemo.plsrda
Jchemo.plsrosa
Jchemo.plssimp
Jchemo.plstuck
Jchemo.plswold
Jchemo.pmod
Jchemo.pnames
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.psize
Jchemo.pval
Jchemo.qda
Jchemo.r2
Jchemo.rasvd
Jchemo.rd
Jchemo.rda
Jchemo.recodcat2int
Jchemo.recodnum2int
Jchemo.recovkwargs
Jchemo.replacebylev
Jchemo.replacebylev2
Jchemo.replacedict
Jchemo.residcla
Jchemo.residreg
Jchemo.rfda_dt
Jchemo.rfr_dt
Jchemo.rmcol
Jchemo.rmgap
Jchemo.rmrow
Jchemo.rmsep
Jchemo.rmsepstand
Jchemo.rosaplsr
Jchemo.rowmean
Jchemo.rownorm
Jchemo.rowstd
Jchemo.rowsum
Jchemo.rowvar
Jchemo.rp
Jchemo.rpd
Jchemo.rpdr
Jchemo.rpmatgauss
Jchemo.rpmatli
Jchemo.rr
Jchemo.rrchol
Jchemo.rrda
Jchemo.rrr
Jchemo.rv
Jchemo.sampcla
Jchemo.sampdf
Jchemo.sampdp
Jchemo.sampks
Jchemo.samprand
Jchemo.sampsys
Jchemo.sampwsp
Jchemo.savgk
Jchemo.savgol
Jchemo.scale
Jchemo.segmkf
Jchemo.segmts
Jchemo.selwold
Jchemo.sep
Jchemo.snorm
Jchemo.snv
Jchemo.soft
Jchemo.softmax
Jchemo.soplsr
Jchemo.sourcedir
Jchemo.spca
Jchemo.splskdeda
Jchemo.splskern
Jchemo.splslda
Jchemo.splsqda
Jchemo.splsrda
Jchemo.ssq
Jchemo.ssr
Jchemo.stah
Jchemo.summ
Jchemo.svmda
Jchemo.svmr
Jchemo.tab
Jchemo.tabdf
Jchemo.tabdupl
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transfbl
Jchemo.transfbl
Jchemo.transfbl
Jchemo.transfbl
Jchemo.transfbl
Jchemo.treer_dt
Jchemo.vcatdf
Jchemo.vcol
Jchemo.vip
Jchemo.viperm
Jchemo.vrow
Jchemo.wdist
Jchemo.xfit
Jchemo.xresid
Base.summary
— Methodsummary(object::Cca, X, Y)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.Y
: The Y-data that was used to fit the model.
Base.summary
— Methodsummary(object::Ccawold, X, Y)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.Y
: The Y-data that was used to fit the model.
Base.summary
— Methodsummary(object::Comdim, Xbl)
Summarize the fitted model.
object
: The fitted model.Xbl
: The X-data that was used to fit the model.
Base.summary
— Methodsummary(object::Fda)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.
Base.summary
— Methodsummary(object::Kpca)
Summarize the fitted model.
object
: The fitted model.
Base.summary
— Methodsummary(object::Mbpca, Xbl)
Summarize the fitted model.
object
: The fitted model.Xbl
: The X-data that was used to fit the model.
Base.summary
— Methodsummary(object::Mbplsr, Xbl)
Summarize the fitted model.
object
: The fitted model.Xbl
: The X-data that was used to fit the model.
Base.summary
— Methodsummary(object::Mbplswest, Xbl)
Summarize the fitted model.
object
: The fitted model.Xbl
: The X-data that was used to fit the model.
Base.summary
— Methodsummary(object::Pca, X)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.
Base.summary
— Methodsummary(object::Pcr, X)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.
Base.summary
— Methodsummary(object::Plscan, X, Y)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.Y
: The Y-data that was used to fit the model.
Base.summary
— Methodsummary(object::Plstuck, X, Y)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.Y
: The Y-data that was used to fit the model.
Base.summary
— Methodsummary(object::Rasvd, X, Y)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.Y
: The Y-data that was used to fit the model.
Base.summary
— Methodsummary(object::Spca, X)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.
Base.summary
— Methodsummary(object::Union{Plsr, Splsr}, X)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.
Jchemo.aggstat
— Methodaggstat(X, y; fun = mean)
aggstat(X::DataFrame; vars, groups, fun = mean)
Compute column-wise statistics by class in a dataset.
X
: Data (n, p).y
: A categorical variable (n) (class membership).fun
: Function to compute (default = mean).
Specific for dataframes:
vars
: Vector of the ames of the variables to summarize.groups
: Vector of the names of the categorical variables to consider for computations by class.
Variables defined in vars
and groups
must be columns of X
.
Return a matrix or, if only argument X::DataFrame
is used, a dataframe.
Examples
using DataFrames, Statistics
n, p = 20, 5
X = rand(n, p)
df = DataFrame(X, :auto)
y = rand(1:3, n)
res = aggstat(X, y; fun = sum)
res.X
aggstat(df, y; fun = sum).X
n, p = 20, 5
X = rand(n, p)
df = DataFrame(X, string.("v", 1:p))
df.gr1 = rand(1:2, n)
df.gr2 = rand(["a", "b", "c"], n)
df
aggstat(df; vars = [:v1, :v2], groups = [:gr1, :gr2], fun = var)
Jchemo.aggsum
— Methodaggsum(x::Vector, y::Vector)
Compute sub-total sums by class of a categorical variable.
x
: A quantitative variable to sum (n)y
: A categorical variable (n) (class membership).
Return a vector.
Examples
x = rand(1000)
y = vcat(rand(["a" ; "c"], 900), repeat(["b"], 100))
aggsum(x, y)
Jchemo.aicplsr
— Methodaicplsr(X, y; alpha = 2, kwargs...)
Compute Akaike's (AIC) and Mallows's (Cp) criteria for univariate PLSR models.
X
: X-data (n, p).y
: Univariate Y-data.
Keyword arguments:
- Same arguments as those of function
cglsr
. alpha
: Coefficient multiplicating the model complexity (df) to compute AIC.
The function uses function dfplsr_cg
.
References
Hansen, P.C., 1998. Rank-Deficient and Discrete Ill-Posed Problems, Mathematical Modeling and Computation. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898719697
Hansen, P.C., 2008. Regularization Tools version 4.0 for Matlab 7.3. Numer Algor 46, 189–194. https://doi.org/10.1007/s11075-007-9136-9
Lesnoff, M., Roger, J.-M., Rutledge, D.N., 2021. Monte Carlo methods for estimating Mallows’s Cp and AIC criteria for PLSR models. Illustration on agronomic spectroscopic NIR data. Journal of Chemometrics n/a, e3369. https://doi.org/10.1002/cem.3369
Examples
using JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 40
res = aicplsr(X, y; nlv) ;
res.crit
res.opt
res.delta
zaic = res.crit.aic
f, ax = plotgrid(0:nlv, zaic; xlabel = "Nb. LVs", ylabel = "AIC")
scatter!(ax, 0:nlv, zaic)
f
Jchemo.aov1
— Methodaov1(x, Y)
One-factor ANOVA test.
x
: Univariate categorical (factor) data (n).Y
: Y-data (n, q).
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
pnames(dat)
@head dat.X
x = dat.X[:, 5]
Y = dat.X[:, 1:4]
tab(x)
res = aov1(x, Y) ;
pnames(res)
res.SSF
res.SSR
res.F
res.pval
Jchemo.bias
— Methodbias(pred, Y)
Compute the prediction bias, i.e. the opposite of the mean prediction error.
pred
: Predictions.Y
: Observed data.
Examples
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
bias(pred, Ytest)
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
bias(pred, ytest)
Jchemo.blockscal
— Methodblockscal(Xbl; kwargs...)
blockscal(Xbl, weights::Weight; kwargs...)
Scale multiblock X-data.
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.weights
: Weights (n) of the observations (rows of the blocks). Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
bscal
: Type of block scaling. Possible values are::none
,:frob
,:mfa
,:ncol
,:sd
. See thereafter.centr
: Boolean. Iftrue
, each column of blocks inXbl
is centered (before the block scaling).scal
: Boolean. Iftrue
, each column of blocks inXbl
is scaled by its uncorrected standard deviation (before the block scaling).
Types of block scaling:
:none
: No block scaling.:frob
: Let D be the diagonal matrix of vectorweights.w
. Each block X is divided by its Frobenius norm = sqrt(tr(X' * D * X)). After this scaling, tr(X' * D * X) = 1.mfa
: Each block X is divided by sv, where sv is the dominant singular value of X (this is the "MFA" approach).ncol
: Each block X is divided by the nb. of columns of the block.sd
: Each block X is divided by sqrt(sum(weighted variances of the block-columns)). After this scaling, sum(weighted variances of the block-columns) = 1.
Examples
n = 5 ; m = 3 ; p = 10
X = rand(n, p)
Xnew = rand(m, p)
listbl = [3:4, 1, [6; 8:10]]
Xbl = mblock(X, listbl)
Xblnew = mblock(Xnew, listbl)
@head Xbl[3]
centr = true ; scal = true
bscal = :frob
mod = model(blockscal; centr, scal, bscal)
fit!(mod, Xbl)
zXbl = transf(mod, Xbl) ;
@head zXbl[3]
zXblnew = transf(mod, Xblnew) ;
zXblnew[3]
Jchemo.calds
— Methodcalds(X1, X2; kwargs...)
Direct standardization (DS) for calibration transfer of spectral data.
X1
: Spectra (n, p) to transfer to the target.X2
: Target spectra (n, p).
Keyword arguments:
fun
: Function used as transfer model.- Other optional arguments for function
fun
.
X1
and X2
must represent the same n samples ("standards").
The objective is to transform spectra X1
to new spectra as close as possible as the target X2
. Method DS fits a model (defined in fun
) that predicts X2
from X1
.
References
Y. Wang, D. J. Veltkamp, and B. R. Kowalski, “Multivariate Instrument Standardization,” Anal. Chem., vol. 63, no. 23, pp. 2750–2756, 1991, doi: 10.1021/ac00023a016.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/caltransfer.jld2")
@load db dat
pnames(dat)
## Objects X1 and X2 are spectra collected
## on the same samples.
## X2 represents the target space.
## We want to transfer X1 in the same space
## as X2.
## Data to transfer
X1cal = dat.X1cal
X1val = dat.X1val
n = nro(X1cal)
m = nro(X1val)
## Target space
X2cal = dat.X2cal
X2val = dat.X2val
## Fitting the model
mod = model(calds; fun = plskern, nlv = 10)
#mod = model(calds; fun = mlrpinv) # less robust
fit!(mod, X1cal, X2cal)
## Transfer of new spectra X1val
## expected to be close to X2val
pred = predict(mod, X1val).pred
i = 1
f = Figure(size = (500, 300))
ax = Axis(f[1, 1])
lines!(X2val[i, :]; label = "x2")
lines!(ax, X1val[i, :]; label = "x1")
lines!(pred[i, :]; linestyle = :dash, label = "x1_corrected")
axislegend(position = :rb, framevisible = false)
f
Jchemo.calpds
— Methodcalpds(X1, X2; npoint = 5, fun = plskern, kwargs...)
Piecewise direct standardization (PDS) for calibration transfer of spectral data.
X1
: Spectra (n, p) to transfer to the target.X2
: Target spectra (n, p).
Keyword arguments:
npoint
: Half-window size (nb. points left or right to the given wavelength).fun
: Function used as transfer model.kwargs
: Optional arguments forfun
.
X1
and X2
must represent the same n standard samples.
The objective is to transform spectra X1
to new spectra as close as possible as the target X2
. Method PDS fits models (defined in fun
) that predict X2
from X1
.
The window used in X1
to predict wavelength "i" in X2
is:
- i -
npoint
, i -npoint
+ 1, ..., i, ..., i +npoint
- 1, i +npoint
References
Bouveresse, E., Massart, D.L., 1996. Improvement of the piecewise direct targetisation procedure for the transfer of NIR spectra for multivariate calibration. Chemometrics and Intelligent Laboratory Systems 32, 201–213. https://doi.org/10.1016/0169-7439(95)00074-7
Y. Wang, D. J. Veltkamp, and B. R. Kowalski, “Multivariate Instrument Standardization,” Anal. Chem., vol. 63, no. 23, pp. 2750–2756, 1991, doi: 10.1021/ac00023a016.
Wülfert, F., Kok, W.Th., Noord, O.E. de, Smilde, A.K., 2000. Correction of Temperature-Induced Spectral Variation by Continuous Piecewise Direct Standardization. Anal. Chem. 72, 1639–1644. https://doi.org/10.1021/ac9906835
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/caltransfer.jld2")
@load db dat
pnames(dat)
## Objects X1 and X2 are spectra collected
## on the same samples.
## X2 represents the target space.
## We want to transfer X1 in the same space
## as X2.
## Data to transfer
X1cal = dat.X1cal
X1val = dat.X1val
n = nro(X1cal)
m = nro(X1val)
## Target space
X2cal = dat.X2cal
X2val = dat.X2val
## Fitting the model
mod = model(calpds; npoint = 2, fun = plskern, nlv = 2)
fit!(mod, X1cal, X2cal)
## Transfer of new spectra X1val
## expected to be close to X2val
pred = predict(mod, X1val).pred
i = 1
f = Figure(size = (500, 300))
ax = Axis(f[1, 1])
lines!(X2val[i, :]; label = "x2")
lines!(ax, X1val[i, :]; label = "x1")
lines!(pred[i, :]; linestyle = :dash, label = "x1_corrected")
axislegend(position = :rb, framevisible = false)
f
Jchemo.cca
— Methodcca(X, Y; kwargs...)
cca(X, Y, weights::Weight; kwargs...)
cca!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Canonical correlation Analysis (CCA, RCCA).
X
: First block of data.Y
: Second block of data.weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores T) to compute.bscal
: Type of block scaling. Possible values are::none
,:frob
. See functionsblockscal
.tau
: Regularization parameter (∊ [0, 1]).scal
: Boolean. Iftrue
, each column of blocks inX
andY
is scaled by its uncorrected standard deviation (before the block scaling).
This function implements a CCA algorithm using SVD decompositions and presented in Weenink 2003 section 2.
A continuum regularization is available (parameter tau
). After block centering and scaling, the function returns block scores (Tx and Ty) that are proportionnal to the eigenvectors of Projx * Projy and Projy * Projx, respectively, defined as follows:
- Cx = (1 -
tau
) * X'DX +tau
* Ix - Cy = (1 -
tau
) * Y'DY +tau
* Iy - Cxy = X'DY
- Projx = sqrt(D) * X * invCx * X' * sqrt(D)
- Projy = sqrt(D) * Y * invCx * Y' * sqrt(D)
where D is the observation (row) metric. Value tau
= 0 can generate unstability when inverting the covariance matrices. Often, a better alternative is to use an epsilon value (e.g. tau
= 1e-8) to get similar results as with pseudo-inverses.
The normed scores returned by the function are expected (using uniform weights
) to be the same as those returned by functions rcc
of the R packages CCA
(González et al.) and mixOmics
(Lê Cao et al.) whith their parameters lambda1 and lambda2 set to:
- lambda1 = lambda2 =
tau
/ (1 -tau
) * n / (n - 1)
References
González, I., Déjean, S., Martin, P.G.P., Baccini, A., 2008. CCA: An R Package to Extend Canonical Correlation Analysis. Journal of Statistical Software 23, 1-14. https://doi.org/10.18637/jss.v023.i12
Hotelling, H. (1936): “Relations between two sets of variates”, Biometrika 28: pp. 321–377.
Lê Cao, K.-A., Rohart, F., Gonzalez, I., Dejean, S., Abadi, A.J., Gautier, B., Bartolo, F., Monget, P., Coquery, J., Yao, F., Liquet, B., 2022. mixOmics: Omics Data Integration Project. https://doi.org/10.18129/B9.bioc.mixOmics
Weenink, D. 2003. Canonical Correlation Analysis, Institute of Phonetic Sciences, Univ. of Amsterdam, Proceedings 25, 81-99.
Examples
using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "linnerud.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n, p = size(X)
q = nco(Y)
nlv = 3
bscal = :frob ; tau = 1e-8
mod = model(cca; nlv, bscal, tau)
fit!(mod, X, Y)
pnames(mod)
pnames(mod.fm)
@head mod.fm.Tx
@head transfbl(mod, X, Y).Tx
@head mod.fm.Ty
@head transfbl(mod, X, Y).Ty
res = summary(mod, X, Y) ;
pnames(res)
res.cort2t
res.rdx
res.rdy
res.corx2t
res.cory2t
Jchemo.ccawold
— Methodccawold(X, Y; kwargs...)
ccawold(X, Y, weights::Weight; kwargs...)
ccawold!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Canonical correlation analysis (CCA, RCCA) - Wold Nipals algorithm.
X
: First block of data.Y
: Second block of data.weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores T) to compute.bscal
: Type of block scaling. Possible values are::none
,:frob
. See functionsblockscal
.tau
: Regularization parameter (∊ [0, 1]).tol
: Tolerance value for convergence (Nipals).maxit
: Maximum number of iterations (Nipals).scal
: Boolean. Iftrue
, each column of blocks inX
andY
is scaled by its uncorrected standard deviation (before the block scaling).
This function implements the Nipals ccawold algorithm presented by Tenenhaus 1998 p.204 (related to Wold et al. 1984).
In this implementation, after each step of LVs computation, X and Y are deflated relatively to their respective scores (tx and ty).
A continuum regularization is available (parameter tau
). After block centering and scaling, the covariances matrices are computed as follows:
- Cx = (1 -
tau
) * X'DX +tau
* Ix - Cy = (1 -
tau
) * Y'DY +tau
* Iy
where D is the observation (row) metric. Value tau
= 0 can generate unstability when inverting the covariance matrices. Often, a better alternative is to use an epsilon value (e.g. tau
= 1e-8) to get similar results as with pseudo-inverses.
The normed scores returned by the function are expected (using uniform weights
) to be the same as those returned by function rgcca
of the R package RGCCA
(Tenenhaus & Guillemot 2017, Tenenhaus et al. 2017).
References
Tenenhaus, A., Guillemot, V. 2017. RGCCA: Regularized and Sparse Generalized Canonical Correlation Analysis for Multiblock Data Multiblock data analysis. https://cran.r-project.org/web/packages/RGCCA/index.html
Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.
Tenenhaus, M., Tenenhaus, A., Groenen, P.J.F., 2017. Regularized Generalized Canonical Correlation Analysis: A Framework for Sequential Multiblock Component Methods. Psychometrika 82, 737–777. https://doi.org/10.1007/s11336-017-9573-x
Wold, S., Ruhe, A., Wold, H., Dunn, III, W.J., 1984. The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses. SIAM Journal on Scientific and Statistical Computing 5, 735–743. https://doi.org/10.1137/0905052
Examples
using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "linnerud.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n, p = size(X)
q = nco(Y)
nlv = 2
bscal = :frob ; tau = 1e-4
mod = model(ccawold; nlv, bscal, tau, tol = 1e-10)
fit!(mod, X, Y)
pnames(mod)
pnames(mod.fm)
@head mod.fm.Tx
@head transfbl(mod, X, Y).Tx
@head mod.fm.Ty
@head transfbl(mod, X, Y).Ty
res = summary(mod, X, Y) ;
pnames(res)
res.explvarx
res.explvary
res.cort2t
res.rdx
res.rdy
res.corx2t
res.cory2t
Jchemo.center
— Methodcenter(X)
center(X, weights::Weight)
Column-wise centering of X-data.
X
: X-data (n, p).
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f
mod = model(center)
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
colmean(Xptrain)
@head Xptest
@head Xtest .- colmean(Xtrain)'
plotsp(Xptrain).f
plotsp(Xptest).f
Jchemo.cglsr
— Methodcglsr(X, y; kwargs...)
cglsr!(X::Matrix, y::Matrix; kwargs...)
Conjugate gradient algorithm for the normal equations (CGLS; Björck 1996).
X
: X-data (n, p).y
: Univariate Y-data (n).
Keyword arguments:
nlv
: Nb. CG iterations.gs
: Boolean. Iftrue
(default), a Gram-Schmidt orthogonalization of the normal equation residual vectors is done.filt
: Boolean. Iftrue
, CG filter factors are computed (outputF
). Default =false
.scal
: Boolean. Iftrue
, each column ofX
andy
are scaled by its uncorrected standard deviation (default =false
).
X
and y
are internally centered.
CGLS algorithm "7.4.1" Bjorck 1996, p.289. The part of the code computing the re-orthogonalization (Hansen 1998) and filter factors (Vogel 1987, Hansen 1998) is a transcription (with few adaptations) of the Matlab function cgls
(Saunders et al. https://web.stanford.edu/group/SOL/software/cgls/; Hansen 2008).
References
Björck, A., 1996. Numerical Methods for Least Squares Problems, Other Titles in Applied Mathematics. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611971484
Hansen, P.C., 1998. Rank-Deficient and Discrete Ill-Posed Problems, Mathematical Modeling and Computation. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898719697
Hansen, P.C., 2008. Regularization Tools version 4.0 for Matlab 7.3. Numer Algor 46, 189–194. https://doi.org/10.1007/s11075-007-9136-9
Manne R. Analysis of two partial-least-squares algorithms for multivariate calibration. Chemometrics Intell. Lab. Syst. 1987, 2: 187–197.
Phatak A, De Hoog F. Exploiting the connection between PLS, Lanczos methods and conjugate gradients: alternative proofs of some properties of PLS. J. Chemometrics 2002; 16: 361–367.
Vogel, C. R., "Solving ill-conditioned linear systems using the conjugate gradient method", Report, Dept. of Mathematical Sciences, Montana State University, 1987.
Examples
using JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 5 ; scal = true
mod = model(cglsr; nlv, scal) ;
fit!(mod, Xtrain, ytrain)
pnames(mod.fm)
@head mod.fm.B
coef(mod.fm).B
coef(mod.fm).int
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true,
xlabel = "Prediction", ylabel = "Observed").f
Jchemo.coef
— Methodcoef(object::Cglsr)
Compute the b-coefficients of a fitted model.
object
: The fitted model.
Jchemo.coef
— Methodcoef(object::Dkplsr; nlv = nothing)
Compute the b-coefficients of a fitted model.
object
: The fitted model.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.coef
— Methodcoef(object::Kplsr; nlv = nothing)
Compute the b-coefficients of a fitted model.
object
: The fitted model.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.coef
— Methodcoef(object::Krr; lb = nothing)
Compute the b-coefficients of a fitted model.
object
: The fitted model.lb
: Ridge regularization parameter "lambda".
Jchemo.coef
— Methodcoef(object::Rosaplsr; nlv = nothing)
Compute the X b-coefficients of a model fitted with nlv
LVs.
object
: The fitted model.nlv
: Nb. LVs to consider.
Jchemo.coef
— Methodcoef(object::Rr; lb = nothing)
Compute the b-coefficients of a fitted model.
object
: The fitted model.lb
: Ridge regularization parameter "lambda".
Jchemo.coef
— Methodcoef(object::Mlr)
Compute the coefficients of the fitted model.
object
: The fitted model.
Jchemo.coef
— Methodcoef(object::Union{Plsr, Pcr, Splsr}; nlv = nothing)
Compute the b-coefficients of a LV model.
object
: The fitted model.nlv
: Nb. LVs to consider.
For a model fitted from X(n, p) and Y(n, q), the returned object B
is a matrix (p, q). If nlv
= 0, B
is a matrix of zeros. The returned object int
is the intercept.
Jchemo.colmad
— Methodcolmad(X)
Compute column-wise median absolute deviations (MAD) of a matrix.
X
: Data (n, p).
Return a vector.
Examples
n, p = 5, 6
X = rand(n, p)
colmad(X)
Jchemo.colmean
— Methodcolmean(X)
colmean(X, weights::Weight)
Compute column-wise means of a matrix.
X
: Data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Return a vector.
Examples
n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))
colmean(X)
colmean(X, w)
Jchemo.colmed
— Methodcolmed(X)
Compute column-wise medians of a matrix.
X
: Data (n, p).
Return a vector.
Examples
n, p = 5, 6
X = rand(n, p)
colmed(X)
Jchemo.colnorm
— Methodcolnorm(X)
colnorm(X, weights::Weight)
Compute column-wise norms of a matrix.
X
: Data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
The norm computed for a column x of X
is:
- sqrt(x' * x)
The weighted norm is:
- sqrt(x' * D * x), where D is the diagonal matrix of
weights.w
.
Return a vector.
Examples
n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))
colnorm(X)
colnorm(X, w)
Jchemo.colstd
— Methodcolstd(X)
colstd(X, weights::Weight)
Compute column-wise standard deviations (uncorrected) of a matrix.
X
: Data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Return a vector.
Examples
n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))
colstd(X)
colstd(X, w)
Jchemo.colsum
— Methodcolsum(X)
colsum(X, weights::Weight)
Compute column-wise sums of a matrix.
X
: Data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Return a vector.
Examples
n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))
colsum(X)
colsum(X, w)
Jchemo.colvar
— Methodcolvar(X)
colvar(X, weights::Weight)
Compute column-wise variances (uncorrected) of a matrix.
X
: Data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Return a vector.
Examples
n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))
colvar(X)
colvar(X, w)
Jchemo.comdim
— Methodcomdim(Xbl; kwargs...)
comdim(Xbl, weights::Weight; kwargs...)
comdim!(Xbl::Matrix, weights::Weight; kwargs...)
Common components and specific weights analysis (ComDim, aka CCSWA).
Xbl
: List of blocks (vector of matrices) of X-data. Typically, output of functionmblock
.weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores T) to compute.bscal
: Type of block scaling. See functionblockscal
for possible values.tol
: Tolerance value for convergence (Nipals).maxit
: Maximum number of iterations (Nipals).scal
: Boolean. Iftrue
, each column of blocks inXbl
is scaled by its uncorrected standard deviation (before the block scaling).
"SVD" algorithm of Hannafi & Qannari 2008 p.84.
The function returns several objects, in particular:
T
: The non normed global scores.U
: The normed global scores.W
: The global loadings.Tbl
: The block scores (grouped by blocks, in the original scale).Tb
: The block scores (grouped by LV, in the metric scale).Wbl
: The block loadings.lb
: The specific weights (saliences) "lambda".mu
: The sum of the squared saliences.
Function summary
returns:
explvarx
: Proportion of the total inertia of X (sum of the squared norms of the blocks) explained by each global score.explvarxx
: Proportion of the XX' total inertia (sum of the squared norms of the products Xk * Xk') explained by each global score (= indicator "V" in Qannari et al. 2000, Hanafi et al. 2008).sal2
: Proportion of the squared saliences of each block within each global score.contr_block
: Contribution of each block to the global scores (= proportions of the saliences "lambda" within each score).explX
: Proportion of the inertia of the blocks
corx2t
: Correlation between the global scores and the original variables.cortb2t
: Correlation between the global scores and the block scores.rv
: RV coefficient.lg
: Lg coefficient.
References
Cariou, V., Qannari, E.M., Rutledge, D.N., Vigneau, E., 2018. ComDim: From multiblock data analysis to path modeling. Food Quality and Preference, Sensometrics 2016: Sensometrics-by-the-Sea 67, 27–34. https://doi.org/10.1016/j.foodqual.2017.02.012
Cariou, V., Jouan-Rimbaud Bouveresse, D., Qannari, E.M., Rutledge, D.N., 2019. Chapter 7 - ComDim Methods for the Analysis of Multiblock Data in a Data Fusion Perspective, in: Cocchi, M. (Ed.), Data Handling in Science and Technology, Data Fusion Methodology and Applications. Elsevier, pp. 179–204. https://doi.org/10.1016/B978-0-444-63984-4.00007-7
Ghaziri, A.E., Cariou, V., Rutledge, D.N., Qannari, E.M., 2016. Analysis of multiblock datasets using ComDim: Overview and extension to the analysis of (K + 1) datasets. Journal of Chemometrics 30, 420–429. https://doi.org/10.1002/cem.2810
Hanafi, M., 2008. Nouvelles propriétés de l’analyse en composantes communes et poids spécifiques. Journal de la société française de statistique 149, 75–97.
Qannari, E.M., Wakeling, I., Courcoux, P., MacFie, H.J.H., 2000. Defining the underlying sensory dimensions. Food Quality and Preference 11, 151–154. https://doi.org/10.1016/S0950-3293(99)00069-5
Examples
using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2")
@load db dat
pnames(dat)
X = dat.X
group = dat.group
listbl = [1:11, 12:19, 20:25]
Xbl = mblock(X[1:6, :], listbl)
Xblnew = mblock(X[7:8, :], listbl)
n = nro(Xbl[1])
nlv = 3
bscal = :frob
scal = false
#scal = true
mod = model(comdim; nlv, bscal, scal)
fit!(mod, Xbl)
pnames(mod)
pnames(mod.fm)
## Global scores
@head mod.fm.T
@head transf(mod, Xbl)
transf(mod, Xblnew)
## Blocks scores
i = 1
@head mod.fm.Tbl[i]
@head transfbl(mod, Xbl)[i]
res = summary(mod, Xbl) ;
pnames(res)
res.explvarx
res.explvarxx
res.sal2
res.contr_block
res.explX # = mod.fm.lb if bscal = :frob
rowsum(Matrix(res.explX))
res.corx2t
res.cortb2t
res.rv
Jchemo.conf
— Methodconf(pred, y; digits = 1)
Confusion matrix.
pred
: Univariate predictions.y
: Univariate observed data.
Keyword arguments:
digits
: Nb. digits used to round percentages.
Examples
using CairoMakie
y = ["d"; "c"; "b"; "c"; "a"; "d"; "b"; "d";
"b"; "b"; "a"; "a"; "c"; "d"; "d"]
pred = ["a"; "d"; "b"; "d"; "b"; "d"; "b"; "d";
"b"; "b"; "a"; "a"; "d"; "d"; "d"]
#y = rand(1:10, 200); pred = rand(1:10, 200)
res = conf(pred, y) ;
pnames(res)
res.cnt # Counts (dataframe built from `A`)
res.pct # Row % (dataframe built from `Apct`))
res.A
res.Apct
res.diagpct
res.accpct # Accuracy (% classification successes)
res.lev # Levels
plotconf(res).f
plotconf(res; cnt = false, ptext = false).f
Jchemo.cor2
— Methodcor2(pred, Y)
Compute the squared linear correlation between data and predictions.
pred
: Predictions.Y
: Observed data.
Examples
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
cor2(pred, Ytest)
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
cor2(pred, ytest)
Jchemo.corm
— Methodcorm(X, weights::Weight)
corm(X, Y, weights::Weight)
Compute a weighted correlation matrix.
X
: Data (n, p).Y
: Data (n, q).weights
: Weights (n) of the observations. Object of typeWeight
(e.g. generated by functionmweight
).
Uncorrected correlation matrix
- of
X
-columns : ==> (p, p) matrix - or between
X
-columns andY
-columns : ==> (p, q) matrix.
Examples
n, p = 5, 6
X = rand(n, p)
Y = rand(n, 3)
w = mweight(rand(n))
corm(X, w)
corm(X, Y, w)
Jchemo.cosm
— Methodcosm(X)
cosm(X, Y)
Compute a cosinus matrix.
X
: Data (n, p).Y
: Data (n, q).
The function computes the cosinus matrix:
- of the columns of
X
: ==> (p, p) matrix - or between columns of
X
andY
: ==> (p, q) matrix.
Examples
n, p = 5, 6
X = rand(n, p)
Y = rand(n, 3)
cosm(X)
cosm(X, Y)
Jchemo.cosv
— Methodcosv(x, y)
Compute cosinus between two vectors.
x
: vector (n).y
: vector (n).
Examples
n = 5
x = rand(n)
y = rand(n)
cosv(x, y)
Jchemo.covm
— Methodcovm(X, weights::Weight)
covm(X, Y, weights::Weight)
Compute a weighted covariance matrix.
X
: Data (n, p).Y
: Data (n, q).weights
: Weights (n) of the observations. Object of typeWeight
(e.g. generated by functionmweight
).
The function computes the uncorrected weighted covariance matrix:
- of the columns of
X
: ==> (p, p) matrix - or between columns of
X
andY
: ==> (p, q) matrix.
Examples
n, p = 5, 6
X = rand(n, p)
Y = rand(n, 3)
w = mweight(rand(n))
covm(X, w)
covm(X, Y, w)
Jchemo.cscale
— Methodcscale()
cscale(X)
cscale(X, weights::Weight)
Column-wise centering and scaling of X-data.
X
: X-data (n, p).
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f
mod = model(cscale)
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
colmean(Xptrain)
colstd(Xptrain)
@head Xptest
@head (Xtest .- colmean(Xtrain)') ./ colstd(Xtrain)'
plotsp(Xptrain).f
plotsp(Xptest).f
Jchemo.detrend
— Methoddetrend(X; kwargs...)
De-trend transformation of each row of X-data.
X
: X-data (n, p).
Keyword arguments:
degree
: Polynom degree.
The function fits a polynomial regression to each observation and returns the residuals.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f
mod = model(detrend; degree = 2)
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
plotsp(Xptrain, wl).f
plotsp(Xptest, wl).f
Jchemo.dfplsr_cg
— Methoddfplsr_cg(X, y; kwargs...)
Compute the model complexity (df) of PLSR models with the CGLS algorithm.
X
: X-data (n, p).y
: Univariate Y-data.
Keyword arguments:
- Same as function
cglsr
.
The number of degrees of freedom (df
) of the PLSR model is returned for 0, 1, ..., nlv
LVs.
References
Hansen, P.C., 1998. Rank-Deficient and Discrete Ill-Posed Problems, Mathematical Modeling and Computation. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898719697
Hansen, P.C., 2008. Regularization Tools version 4.0 for Matlab 7.3. Numer Algor 46, 189–194. https://doi.org/10.1007/s11075-007-9136-9
Lesnoff, M., Roger, J.-M., Rutledge, D.N., 2021. Monte Carlo methods for estimating Mallows’s Cp and AIC criteria for PLSR models. Illustration on agronomic spectroscopic NIR data. Journal of Chemometrics n/a, e3369. https://doi.org/10.1002/cem.3369
Examples
## The example below reproduces the numerical illustration
## given by Kramer & Sugiyama 2011 on the Ozone data
## (Fig. 1, center).
## Function "pls.model" used for df calculations
## in the R package "plsdof" v0.2-9 (Kramer & Braun 2019)
## automatically scales the X matrix before PLS.
## The example scales X for consistency with plsdof.
using JchemoData, JLD2, DataFrames, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ozone.jld2")
@load db dat
pnames(dat)
X = dat.X
dropmissing!(X)
zX = rmcol(Matrix(X), 4)
y = X[:, 4]
## For consistency with plsdof
xstds = colstd(zX)
zXs = fscale(zX, xstds)
## End
nlv = 12 ; gs = true
res = dfplsr_cg(zXs, y; nlv, gs) ;
res.df
df_kramer = [1.000000, 3.712373, 6.456417, 11.633565,
12.156760, 11.715101, 12.349716,
12.192682, 13.000000, 13.000000,
13.000000, 13.000000, 13.000000]
f, ax = plotgrid(0:nlv, df_kramer; step = 2, xlabel = "Nb. LVs", ylabel = "df")
scatter!(ax, 0:nlv, res.df; color = "red")
ablines!(ax, 1, 1; color = :grey, linestyle = :dot)
f
Jchemo.difmean
— Methoddifmean(X1, X2; normx::Bool = false)
Compute a 1-D detrimental matrix by difference of the column-means of two X-datas.
X1
: Spectra (n1, p).X2
: Spectra (n2, p).
Keyword arguments:
normx
: Boolean. Iftrue
, the column-means vectors ofX1
andX2
are normed before computing their difference.
The function returns a matrix D
(1, p) computed by the difference between two mean-spectra, i.e. the column-means of X1
and X2
.
D
is assumed to contain the detrimental information that can be removed (by orthogonalization) from X1
and X2
for calibration transfer. For instance, D
can be used as input of function eposvd
.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/caltransfer.jld2")
@load db dat
pnames(dat)
X1cal = dat.X1cal
X1val = dat.X1val
X2cal = dat.X2cal
X2val = dat.X2val
## The objective is to remove a detrimental
## information (here, D) from spaces X1 and X2
D = difmean(X1cal, X2cal).D
res = eposvd(D; nlv = 1)
## Corrected Val matrices
X1val_c = X1val * res.M
X2val_c = X2val * res.M
i = 1
f = Figure(size = (800, 300))
ax1 = Axis(f[1, 1])
ax2 = Axis(f[1, 2])
lines!(ax1, X1val[i, :]; label = "x1")
lines!(ax1, X2val[i, :]; label = "x2")
axislegend(ax1, position = :cb, framevisible = false)
lines!(ax2, X1val_c[i, :]; label = "x1_correct")
lines!(ax2, X2val_c[i, :]; label = "x2_correct")
axislegend(ax2, position = :cb, framevisible = false)
f
Jchemo.dkplskdeda
— Methoddkplskdeda(X, y; kwargs...)
dkplskdeda(X, y, weights::Weight; kwargs...)
DKPLS-KDEDA.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).- Keyword arguments of function
dmkern
(bandwidth definition) can also be specified here. alpha
: Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0
) and LDA (alpha = 1
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Same as function plskdeda
(PLS-KDEDA) except that a direct kernel PLSR (function dkplsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
See function dkplslda
for examples.
Jchemo.dkplslda
— Methoddkplslda(X, y; kwargs...)
dkplslda(X, y, weights::Weight; kwargs...)
DKPLS-LDA.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1.kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Same as function plslda
(PLS-LDA) except that a direct kernel PLSR (function dkplsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlv = 15
gamma = .1
mod = model(dkplslda; nlv, gamma)
#mod = model(dkplslda; nlv, gamma, prior = :prop)
#mod = model(dkplsqda; nlv, gamma, alpha = .5)
#mod = model(dkplskdeda; nlv, gamma, a_kde = .5)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
fmpls = fm.fm.fmpls ;
@head fmpls.T
@head transf(mod, Xtrain)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)
coef(fmpls)
res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(mod, Xtest; nlv = 1:2).pred
Jchemo.dkplsqda
— Methoddkplsqda(X, y; kwargs...)
dkplsqda(X, y, weights::Weight; kwargs...)
DKPLS-QDA.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).alpha
: Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0
) and LDA (alpha = 1
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Same as function plsqda
(PLS-QDA) except that a direct kernel PLSR (function dkplsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
See function dkplslda
for examples.
Jchemo.dkplsr
— Methoddkplsr(X, Y; kwargs...)
dkplsr(X, Y, weights::Weight; kwargs...)
dkplsr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Direct kernel partial least squares regression (DKPLSR) (Bennett & Embrechts 2003).
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to consider.kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
The method builds kernel Gram matrices and then runs a usual PLSR algorithm on them. This is faster (but not equivalent) to the "true" Nipals KPLSR algorithm (function kplsr
) described in Rosipal & Trejo (2001).
References
Bennett, K.P., Embrechts, M.J., 2003. An optimization perspective on kernel partial least squares regression, in: Advances in Learning Theory: Methods, Models and Applications, NATO Science Series III: Computer & Systems Sciences. IOS Press Amsterdam, pp. 227-250.
Rosipal, R., Trejo, L.J., 2001. Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space. Journal of Machine Learning Research 2, 97-123.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 20
kern = :krbf ; gamma = 1e-1 ; scal = false
#gamma = 1e-4 ; scal = true
mod = model(dkplsr; nlv, kern, gamma, scal) ;
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
@head mod.fm.T
coef(mod)
coef(mod; nlv = 3)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)
res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true,
xlabel = "Prediction", ylabel = "Observed").f
####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106
x = collect(-10:.2:10)
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x)
y = zy + .2 * randn(n)
nlv = 2
gamma = 1 / 3
mod = model(dkplsr; nlv, gamma) ;
fit!(mod, x, y)
pred = predict(mod, x).pred
f, ax = scatter(x, y)
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f
Jchemo.dkplsrda
— Methoddkplsrda(X, y; kwargs...)
dkplsrda(X, y, weights::Weight; kwargs...)
Discrimination based on direct kernel partial least squares regression (KPLSR-DA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Same as function plsrda
(PLSR-DA) except that a direct kernel PLSR (function dkplsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlv = 15
kern = :krbf ; gamma = .001
scal = true
mod = model(dkplsrda; nlv, kern, gamma, scal)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
@head fm.fm.T
@head transf(mod, Xtrain)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)
coef(fm.fm)
res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(mod, Xtest; nlv = 1:2).pred
Jchemo.dmkern
— Methoddmkern(X; kwargs...)
Gaussian kernel density estimation (KDE).
X
: X-data (n, p).
Keyword arguments:
h_kde
: Define the bandwith, see examples.a_kde
: Constant for the Scott's rule (default bandwith), see thereafter.
Estimation of the probability density of X
(column space) by non parametric Gaussian kernels.
Data X
can be univariate (p = 1) or multivariate (p > 1). In the last case, function dmkern
computes a multiplicative kernel such as in Scott & Sain 2005 Eq.19, and the internal bandwidth matrix H
is diagonal (see the code).
Note: H
in the dmkern
code is often noted "H^(1/2)" in the litterature (e.g. Wikipedia).
The default bandwith is computed by:
h_kde
=a_kde
* n^(-1 / (p + 4)) * colstd(X
)
(a_kde
= 1 in Scott & Sain 2005).
References
Scott, D.W., Sain, S.R., 2005. 9 - Multidimensional Density Estimation, in: Rao, C.R., Wegman, E.J., Solka, J.L. (Eds.), Handbook of Statistics, Data Mining and Data Visualization. Elsevier, pp. 229–261. https://doi.org/10.1016/S0169-7161(04)24009-3
Examples
using JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "iris.jld2")
@load db dat
pnames(dat)
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
tab(y)
mod0 = model(fda; nlv = 2)
fit!(mod0, X, y)
@head T = mod0.fm.T
p = nco(T)
#### Probability density in the FDA
#### score space (2D)
mod = model(dmkern)
fit!(mod, T)
pnames(mod.fm)
mod.fm.H
u = [1; 4; 150]
predict(mod, T[u, :]).pred
h_kde = .3
mod = model(dmkern; h_kde)
fit!(mod, T)
mod.fm.H
u = [1; 4; 150]
predict(mod, T[u, :]).pred
h_kde = [.3; .1]
mod = model(dmkern; h_kde)
fit!(mod, T)
mod.fm.H
u = [1; 4; 150]
predict(mod, T[u, :]).pred
## Bivariate distribution
npoints = 2^7
nlv = 2
lims = [(minimum(T[:, j]), maximum(T[:, j])) for j = 1:nlv]
x1 = LinRange(lims[1][1], lims[1][2], npoints)
x2 = LinRange(lims[2][1], lims[2][2], npoints)
z = mpar(x1 = x1, x2 = x2)
grid = reduce(hcat, z)
m = nro(grid)
mod = model(dmkern)
#mod = model(dmkern; a_kde = .5)
#mod = model(dmkern; h_kde = .3)
fit!(mod, T)
res = predict(mod, grid) ;
pred_grid = vec(res.pred)
f = Figure(size = (600, 400))
ax = Axis(f[1, 1]; title = "Density for FDA scores (Iris)", xlabel = "Score 1",
ylabel = "Score 2")
co = contour!(ax, grid[:, 1], grid[:, 2], pred_grid; levels = 10, labels = true)
scatter!(ax, T[:, 1], T[:, 2], color = :red, markersize = 5)
#xlims!(ax, -15, 15) ;ylims!(ax, -15, 15)
f
## Univariate distribution
x = T[:, 1]
mod = model(dmkern)
#mod = model(dmkern; a_kde = .5)
#mod = model(dmkern; h_kde = .3)
fit!(mod, x)
pred = predict(mod, x).pred
f = Figure()
ax = Axis(f[1, 1])
hist!(ax, x; bins = 30, normalization = :pdf) # area = 1
scatter!(ax, x, vec(pred); color = :red)
f
x = T[:, 1]
npoints = 2^8
lims = [minimum(x), maximum(x)]
#delta = 5 ; lims = [minimum(x) - delta, maximum(x) + delta]
grid = LinRange(lims[1], lims[2], npoints)
mod = model(dmkern)
#mod = model(dmkern; a_kde = .5)
#mod = model(dmkern; h_kde = .3)
fit!(mod, x)
pred_grid = predict(mod, grid).pred
f = Figure()
ax = Axis(f[1, 1])
hist!(ax, x; bins = 30, normalization = :pdf) # area = 1
lines!(ax, grid, vec(pred_grid); color = :red)
f
Jchemo.dmnorm
— Functiondmnorm(X = nothing; mu = nothing, S = nothing, simpl::Bool = false)
dmnorm!(X = nothing; mu = nothing, S = nothing, simpl::Bool = false)
Normal probability density estimation.
X
: X-data (n, p) used to estimate the mean and the covariance matrix. Ifnothing
,mu
andS
must be provided.
Keyword arguments:
mu
: Mean vector of the normal distribution. Ifnothing
,mu
is computed by the column-means ofX
.S
: Covariance matrix of the normal distribution. Ifnothing
,S
is computed by cov(X
; corrected = true).simpl
: Boolean. Iftrue
, the constant term and the determinant in the density formula are set to 1.
Data X
can be univariate (p = 1) or multivariate (p > 1). See examples.
When simple
= true
, the determinant of the covariance matrix (object detS
) and the constant (2 * pi)^(-p / 2) (object cst
) in the density formula are set to 1. The function returns a pseudo density that resumes to exp(-d / 2), where d is the squared Mahalanobis distance to the fcenter. This can for instance be useful when the number of columns (p) of X
becomes too large and when consequently:
detS
tends to 0 or, conversely, to infinitycst
tends to 0
which makes impossible to compute the true density.
Examples
using JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "iris.jld2")
@load db dat
pnames(dat)
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
tab(y)
mod0 = model(fda; nlv = 2)
fit!(mod0, X, y)
@head T = mod0.fm.T
n, p = size(T)
#### Probability density in the FDA score space (2D)
#### Example of class Setosa
s = y .== "setosa"
zT = T[s, :]
## Bivariate distribution
mod = model(dmnorm)
fit!(mod, zT)
fm = mod.fm
pnames(fm)
fm.Uinv
fm.detS
pred = predict(mod, zT).pred
@head pred
mu = colmean(zT)
S = covm(zT, mweight(ones(nro(zT))))
## Direct syntax
dmnorm(; mu = mu, S = S).Uinv
dmnorm(; mu = mu, S = S).detS
npoints = 2^7
lims = [(minimum(zT[:, j]), maximum(zT[:, j])) for j = 1:nlv]
x1 = LinRange(lims[1][1], lims[1][2], npoints)
x2 = LinRange(lims[2][1], lims[2][2], npoints)
z = mpar(x1 = x1, x2 = x2)
grid = reduce(hcat, z)
mod = model(dmnorm)
fit!(mod, zT)
res = predict(mod, grid) ;
pred_grid = vec(res.pred)
f = Figure(size = (600, 400))
ax = Axis(f[1, 1]; title = "Density for FDA scores (Iris - Setosa)",
xlabel = "Score 1", ylabel = "Score 2")
co = contour!(ax, grid[:, 1], grid[:, 2], pred_grid; levels = 10, labels = true)
scatter!(ax, T[:, 1], T[:, 2], color = :red, markersize = 5)
scatter!(ax, zT[:, 1], zT[:, 2], color = :blue, markersize = 5)
#xlims!(ax, -12, 12) ;ylims!(ax, -12, 12)
f
## Univariate distribution
j = 1
x = zT[:, j]
mod = model(dmnorm)
fit!(mod, x)
pred = predict(mod, x).pred
f = Figure()
ax = Axis(f[1, 1]; xlabel = string("FDA-score ", j))
hist!(ax, x; bins = 30, normalization = :pdf) # area = 1
scatter!(ax, x, vec(pred); color = :red)
f
x = zT[:, j]
npoints = 2^8
lims = [minimum(x), maximum(x)]
#delta = 5 ; lims = [minimum(x) - delta, maximum(x) + delta]
grid = LinRange(lims[1], lims[2], npoints)
mod = model(dmnorm)
fit!(mod, x)
pred_grid = predict(mod, grid).pred
f = Figure()
ax = Axis(f[1, 1]; xlabel = string("FDA-score ", j))
hist!(ax, x; bins = 30, normalization = :pdf) # area = 1
lines!(ax, grid, vec(pred_grid); color = :red)
f
Jchemo.dmnormlog
— Functiondmnormlog(X = nothing; mu = nothing, S = nothing, simpl::Bool = false)
dmnormlog!(X = nothing; mu = nothing, S = nothing, simpl::Bool = false)
Logarithm of the normal probability density estimation.
X
: X-data (n, p) used to estimate the mean and the covariance matrix. Ifnothing
,mu
andS
must be provided.
Keyword arguments: * mu
: Mean vector of the normal distribution. If nothing
, mu
is computed by the column-means of X
. * S
: Covariance matrix of the normal distribution. If nothing
, S
is computed by cov(X
; corrected = true). * simpl
: Boolean. If true
, the constant term and the determinant in the density formula are set to 1.
See the help of function dmnorm
.
Examples
using JLD2, CairoMakie
using JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "iris.jld2")
@load db dat
pnames(dat)
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
tab(y)
## Example of class Setosa
s = y .== "setosa"
zX = X[s, :]
mod = model(dmnormlog)
fit!(mod, zX)
fm = mod.fm
pnames(fm)
fm.Uinv
fm.logdetS
pred = predict(mod, zX).pred
@head pred
mod0 = model(dmnorm)
fit!(mod0, zX)
pred0 = predict(mod0, zX).pred
@head log.(pred0)
Jchemo.dummy
— Functiondummy(y, T = Float64)
Compute dummy table from a categorical variable.
y
: A categorical variable.T
: Type of the output dummy tableY
.
Examples
y = ["d", "a", "b", "c", "b", "c"]
#y = rand(1:3, 7)
res = dummy(y)
pnames(res)
res.Y
Jchemo.dupl
— Methoddupl(X; digits = 3)
Find duplicated rows in a dataset.
X
: A dataset.digits
: Nb. digits used to roundX
before checking.
Examples
X = rand(5, 3)
Z = vcat(X, X[1:3, :], X[1:1, :])
dupl(X)
dupl(Z)
M = hcat(X, fill(missing, 5))
Z = vcat(M, M[1:3, :])
dupl(M)
dupl(Z)
Jchemo.ensure_df
— Methodensure_df(X)
Reshape X
to a dataframe if necessary.
Jchemo.ensure_mat
— Methodensure_mat(X)
Reshape X
to a matrix if necessary.
Jchemo.eposvd
— Methodeposvd(D; nlv = 1)
Compute an orthogonalization matrix for calibration transfer of spectral data.
D
: Data (m, p) containing the detrimental information on which spectra (rows of a matrix X) have to be orthogonalized.
Keyword arguments:
nlv
: Nb. of first loadings vectors ofD
considered for the orthogonalization.
The objective is to remove some detrimental information (e.g. humidity patterns in signals, multiple spectrometers, etc.) from a X-dataset (n, p). The detrimental information is defined by the main row-directions computed from a matrix D
(m, p).
Function eposvd
returns two objects:
P
(p,nlv
) : The matrix of thenlv
first loading vectors of the SVD decomposition (non centered PCA) ofD
.M
(p, p) : The orthogonalization matrix, used to orthogonolize a given matrix X to directions contained inP
.
Any matrix X can then be corrected from D
by:
- X_corrected = X *
M
.
Matrix D
can be built from many methods. For instance, two common methods are:
- EPO (Roger et al. 2003, 2018):
D
is built from a set of differences between spectra collected under different conditions. - TOP (Andrew & Fearn 2004): Each row of
D
is the mean spectrum computed for a given spectrometer instrument.
A particular situation is the following. Assume that D
is built from some differences between matrices X1 and X2, and that a bilinear model (e.g. PLSR) is fitted on the data {X1corrected, Y} where X1corrected = X1 * M
. To predict new data X2new with the fitted model, there is no need to correct X2new.
References
Andrew, A., Fearn, T., 2004. Transfer by orthogonal projection: making near-infrared calibrations robust to between-instrument variation. Chemometrics and Intelligent Laboratory Systems 72, 51–56. https://doi.org/10.1016/j.chemolab.2004.02.004
Roger, J.-M., Chauchard, F., Bellon-Maurel, V., 2003. EPO-PLS external parameter orthogonalisation of PLS application to temperature-independent measurement of sugar content of intact fruits. Chemometrics and Intelligent Laboratory Systems 66, 191-204. https://doi.org/10.1016/S0169-7439(03)00051-0
Roger, J.-M., Boulet, J.-C., 2018. A review of orthogonal projections for calibration. Journal of Chemometrics 32, e3045. https://doi.org/10.1002/cem.3045
Zeaiter, M., Roger, J.M., Bellon-Maurel, V., 2006. Dynamic orthogonal projection. A new method to maintain the on-line robustness of multivariate calibrations. Application to NIR-based monitoring of wine fermentations. Chemometrics and Intelligent Laboratory Systems, 80, 227–235. https://doi.org/10.1016/j.chemolab.2005.06.011
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/caltransfer.jld2")
@load db dat
pnames(dat)
X1cal = dat.X1cal
X1val = dat.X1val
X2cal = dat.X2cal
X2val = dat.X2val
## The objective is to remove a detrimental
## information (here, D) from spaces X1 and X2
D = X1cal - X2cal
nlv = 2
res = eposvd(D; nlv)
res.M # orthogonalization matrix
res.P # detrimental directions (columns of matrix P = loadings of D)
## Corrected Val matrices
X1val_c = X1val * res.M
X2val_c = X2val * res.M
i = 1
f = Figure(size = (800, 300))
ax1 = Axis(f[1, 1])
ax2 = Axis(f[1, 2])
lines!(ax1, X1val[i, :]; label = "x1")
lines!(ax1, X2val[i, :]; label = "x2")
axislegend(ax1, position = :cb, framevisible = false)
lines!(ax2, X1val_c[i, :]; label = "x1_correct")
lines!(ax2, X2val_c[i, :]; label = "x2_correct")
axislegend(ax2, position = :cb, framevisible = false)
f
Jchemo.errp
— Methoderrp(pred, y)
Compute the classification error rate (ERRP).
pred
: Predictions.y
: Observed data (class membership).
Examples
Xtrain = rand(10, 5)
ytrain = rand(["a" ; "b"], 10)
Xtest = rand(4, 5)
ytest = rand(["a" ; "b"], 4)
mod = model(plsrda; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
errp(pred, ytest)
Jchemo.euclsq
— Methodeuclsq(X, Y)
Squared Euclidean distances between the rows of X
and Y
.
X
: Data (n, p).Y
: Data (m, p).
For X
(n, p) and Y
(m, p), the function returns an object (n, m) with:
- i, j = distance between row i of
X
and row j ofY
.
Examples
X = rand(5, 3)
Y = rand(2, 3)
euclsq(X, Y)
euclsq(X[1:1, :], Y[1:1, :])
euclsq(X[:, 1], 4)
euclsq(1, 4)
Jchemo.fblockscal
— Methodfblockscal(Xbl, bscales)
fblockscal!(Xbl::Vector, bscales::Vector)
Scale multiblock X-data.
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.bscales
: A vector (of length equal to the nb. of blocks) of the scalars diving the blocks.
Examples
n = 5 ; m = 3 ; p = 10
X = rand(n, p)
listbl = [3:4, 1, [6; 8:10]]
Xbl = mblock(X, listbl)
bscales = 10 * ones(3)
zXbl = fblockscal(Xbl, bscales) ;
@head zXbl[3]
@head Xbl[3]
fblockscal!(Xbl, bscales) ;
@head Xbl[3]
Jchemo.fcenter
— Methodfcenter(X, v)
fcenter!(X::AbstractMatrix, v)
Center each column of X
.
X
: Data.v
: Centering vector.
examples
n, p = 5, 6
X = rand(n, p)
xmeans = colmean(X)
fcenter(X, xmeans)
Jchemo.fcscale
— Methodfcscale(X, u, v)
fcscale!(X, u, v)
Center and fscale each column of X
.
X
: Data.u
: Centering vector.v
: Scaling vector.
examples
n, p = 5, 6
X = rand(n, p)
xmeans = colmean(X)
xstds = colstd(X)
fcscale(X, xmeans, xstds)
Jchemo.fda
— Methodfda(X, y; kwargs...)
fda(X, y, weights; kwargs...)
fda!(X::Matrix, y, weights; kwargs...)
Factorial discriminant analysis (FDA).
X
: X-data (n, p).y
: y-data (n) (class membership).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. of discriminant components.lb
: Ridge regularization parameter "lambda". Can be used whenX
has collinearities.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
FDA by eigen factorization of Inverse(W) * B, where W is the "Within"-covariance matrix (pooled over the classes), and B the "Between"-covariance matrix.
The function maximizes the compromise:
- p'Bp / p'Wp
i.e. max p'Bp with constraint p'Wp = 1. Vectors p (columns of P
) are the linear discrimant coefficients often referred to as "LD".
If X
is ill-conditionned, a ridge regularization can be used:
- If
lb
> 0, W is replaced by W +lb
* I, where I is the Idendity matrix.
In these fda
functions, observation weights (argument weights
) are used to compute matrices W and B.
In the high-level version, the observation weights are automatically defined by the given priors (argument prior
): the sub-total weights by class are set equal to the prior probabilities. For other choices, use the low-level versions.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
tab(ytrain)
tab(ytest)
nlv = 2
mod = model(fda; nlv)
#mod = model(fdasvd; nlv)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
lev = fm.lev
nlev = length(lev)
aggsum(fm.weights.w, ytrain)
@head fm.T
@head transf(mod, Xtrain)
@head transf(mod, Xtest)
## X-loadings matrix
## = coefficients of the linear discriminant function
## = "LD" of function lda of the R package MASS
fm.P
fm.P' * fm.P
## Explained variance computed by weighted PCA
## of the class centers in transformed scale
summary(mod).explvarx
## Projections of the class centers
## to the score space
ct = fm.Tcenters
f, ax = plotxy(fm.T[:, 1], fm.T[:, 2], ytrain; ellipse = true, title = "FDA",
xlabel = "Score-1", ylabel = "Score-2")
scatter!(ax, ct[:, 1], ct[:, 2], marker = :star5, markersize = 15, color = :red) # see available_marker_symbols()
f
Jchemo.fdasvd
— Methodfdasvd(X, y, weights; kwargs...)
fdasvd!(X::Matrix, y, weights; kwargs...)
Factorial discriminant analysis (FDA).
X
: X-data (n, p).y
: y-data (n) (class membership).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. of discriminant components.lb
: Ridge regularization parameter "lambda". Can be used whenX
has collinearities.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
FDA by a weighted SVD factorization of the matrix of the class centers (after spherical transformaton). The function gives the same results as function fda
.
See function fda
for details and examples.
Jchemo.fdif
— Methodfdif(X; kwargs...)
Finite differences (discrete derivates) for each row of X-data.
X
: X-data (n, p).
Keyword arguments:
npoint
: Nb. points involved in the window for the finite differences. The range of the window (= nb. intervals of two successive colums) is npoint - 1.
The method reduces the column-dimension:
- (n, p) –> (n, p - npoint + 1).
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f
mod = model(fdif; npoint = 2)
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f
Jchemo.findindex
— Methodfindindex(x, lev)
Replace a vector containg levels by the indexes of a set of levels.
x
: Vector (n) of levels to replace.lev
: Vector (nlev) containing the levels.
Warning: The levels in x
must be contained in lev
.
Examples
lev = ["EHH" ; "FFS" ; "ANF" ; "CLZ" ; "CNG" ; "FRG" ; "MPW" ; "PEE" ; "SFG" ; "TTS"]
x = ["EHH" ; "TTS" ; "FRG"]
findindex(x, lev)
Jchemo.findmax_cla
— Methodfindmax_cla(x)
findmax_cla(x, weights::Weight)
Find the most occurent level in x
.
x
: A categorical variable.weights
: Weights (n) of the observations. Object of typeWeight
(e.g. generated by functionmweight
).
If ex-aequos, the function returns the first.
Examples
x = rand(1:3, 10)
tab(x)
findmax_cla(x)
Jchemo.frob
— Methodfrob(X)
frob(X, weights::Weight)
Frobenius norm of a matrix.
X
: A matrix (n, p).weights
: Weights (n) of the observations. Object of typeWeight
(e.g. generated by functionmweight
).
The Frobenius norm of X
is:
- sqrt(tr(X' * X)).
The Frobenius weighted norm is:
- sqrt(tr(X' * D * X)), where D is the diagonal matrix of vector
w
.
Jchemo.fscale
— Methodfscale(X, v)
fscale!(X::AbstractMatrix, v)
Scale each column of X
.
X
: Data.v
: Scaling vector.
Examples
X = rand(5, 2)
fscale(X, colstd(X))
Jchemo.fweight
— Methodfweight(d; typw = :bisquare, alpha = 0)
Computation of weights from distances.
d
: Vector of distances.
Keyword arguments:
typw
: Define the weight function.alpha
: Parameter of the weight function, see below.
The returned weight vector is:
- w = f(
d
/ q) where f is the weight function and q the 1-alpha
quantile ofd
(Cleveland & Grosse 1991).
Possible values for typw
are:
- :bisquare: w = (1 - x^2)^2
- :cauchy: w = 1 / (1 + x^2)
- :epan: w = 1 - x^2
- :fair: w = 1 / (1 + x)^2
- :invexp: w = exp(-x)
- :invexp2: w = exp(-x / 2)
- :gauss: w = exp(-x^2)
- :trian: w = 1 - x
- :tricube: w = (1 - x^3)^3
References
Cleveland, W.S., Grosse, E., 1991. Computational methods for local regression. Stat Comput 1, 47–62. https://doi.org/10.1007/BF01890836
Examples
using CairoMakie, Distributions
d = sort(sqrt.(rand(Chi(1), 1000)))
cols = cgrad(:tab10, collect(1:9)) ;
alpha = 0
f = Figure(size = (600, 500))
ax = Axis(f, xlabel = "d", ylabel = "Weight")
typw = :bisquare
w = fweight(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = cols[1])
typw = :cauchy
w = fweight(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = cols[2])
typw = :epan
w = fweight(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = cols[3])
typw = :fair
w = fweight(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = cols[4])
typw = :gauss
w = fweight(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = cols[5])
typw = :trian
w = fweight(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = cols[6])
typw = :invexp
w = fweight(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = cols[7])
typw = :invexp2
w = fweight(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = cols[8])
typw = :tricube
w = fweight(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = cols[9])
axislegend("Function", position = :lb)
f[1, 1] = ax
f
Jchemo.getknn
— Methodgetknn(Xtrain, X; metric = :eucl, k = 1)
Return the k nearest neighbors in Xtrain
of each row of the query X
.
Xtrain
: Training X-data.X
: Query X-data.
Keyword arguments:
metric
: Type of distance used for the query. Possible values are:eucl
(Euclidean),:mah
(Mahalanobis),:sam
(spectral angular distance),:cor
(correlation distance).k
: Number of neighbors to return.
The distances (not squared) are also returned.
Spectral angular and correlation distances between two vectors x and y:
- Spectral angular distance (x, y) = acos(x'y / norm(x)norm(y)) / pi
- Correlation distance (x, y) = sqrt((1 - cor(x, y)) / 2)
Both distances are bounded within 0 (y = x) and 1 (y = -x).
Examples
Xtrain = rand(5, 3)
X = rand(2, 3)
x = X[1:1, :]
k = 3
res = getknn(Xtrain, X; k)
res.ind # indexes
res.d # distances
res = getknn(Xtrain, x; k)
res.ind
res = getknn(Xtrain, X; metric = :mah, k)
res.ind
Jchemo.gridcv
— Methodgridcv(mod, X, Y; segm, score, pars = nothing, nlv = nothing, lb = nothing,
verbose = false)
Cross-validation (CV) of a model over a grid of parameters.
mod
: Model to evaluate.X
: Training X-data (n, p).Y
: Training Y-data (n, q).
Keyword arguments:
segm
: Segments of observations used for the CV (output of functionssegmts
,segmkf
, etc.).score
: Function computing the prediction score (e.g.rmsep
).pars
: tuple of named vectors of same length defining the parameter combinations (e.g. output of functionmpar
).verbose
: Iftrue
, fitting information are printed.nlv
: Value, or vector of values, of the nb. of latent variables (LVs).lb
: Value, or vector of values, of the ridge regularization parameter "lambda".
The function is used for grid-search: it computed a prediction score (= error rate) for model mod
over the combinations of parameters defined in pars
.
For models based on LV or ridge regularization, using arguments nlv
and lb
allow faster computations than including these parameters in argument `pars. See the examples.
The function returns two outputs:
res
: mean resultsres_p
: results per replication.
Examples
######## Regression
using JLD2, CairoMakie, JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
mod = model(savgol; npoint = 21, deriv = 2, degree = 2)
fit!(mod, X)
Xp = transf(mod, X)
s = year .<= 2012
Xtrain = Xp[s, :]
ytrain = y[s]
Xtest = rmrow(Xp, s)
ytest = rmrow(y, s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Replicated K-fold CV
K = 3 ; rep = 10
segm = segmkf(ntrain, K; rep)
## Replicated test-set validation
#m = Int(round(ntrain / 3)) ; rep = 30
#segm = segmts(ntrain, m; rep)
####-- Plsr
mod = model(plskern)
nlv = 0:30
rescv = gridcv(mod, Xtrain, ytrain; segm, score = rmsep, nlv) ;
pnames(rescv)
res = rescv.res
plotgrid(res.nlv, res.y1; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod = model(plskern; nlv = res.nlv[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
## Adding pars
pars = mpar(scal = [false; true])
rescv = gridcv(mod, Xtrain, ytrain; segm, score = rmsep, pars, nlv) ;
res = rescv.res
typ = res.scal
plotgrid(res.nlv, res.y1, typ; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod = model(plskern; nlv = res.nlv[u], scal = res.scal[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####-- Rr
lb = (10).^(-8:.1:3)
mod = model(rr)
rescv = gridcv(mod, Xtrain, ytrain; segm, score = rmsep, lb) ;
res = rescv.res
loglb = log.(10, res.lb)
plotgrid(loglb, res.y1; step = 2, xlabel = "log(lambda)", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod = model(rr; lb = res.lb[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
## Adding pars
pars = mpar(scal = [false; true])
rescv = gridcv(mod, Xtrain, ytrain; segm, score = rmsep, pars, lb) ;
res = rescv.res
loglb = log.(10, res.lb)
typ = string.(res.scal)
plotgrid(loglb, res.y1, typ; step = 2, xlabel = "log(lambda)", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod = model(rr; lb = res.lb[u], scal = res.scal[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####-- Kplsr
mod = model(kplsr)
nlv = 0:30
gamma = (10).^(-5:1.:5)
pars = mpar(gamma = gamma)
rescv = gridcv(mod, Xtrain, ytrain; segm, score = rmsep, pars, nlv) ;
res = rescv.res
loggamma = round.(log.(10, res.gamma), digits = 1)
plotgrid(res.nlv, res.y1, loggamma; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP",
leg_title = "Log(gamma)").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod = model(kplsr; nlv = res.nlv[u], gamma = res.gamma[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####-- Knnr
nlvdis = [15, 25] ; metric = [:mah]
h = [1, 2.5, 5]
k = [1; 5; 10; 20; 50 ; 100]
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k)
length(pars[1])
mod = model(knnr)
rescv = gridcv(mod, Xtrain, ytrain; segm, score = rmsep, pars, verbose = true) ;
res = rescv.res
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod = model(knnr; nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u],
k = res.k[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####-- Lwplsr
nlvdis = 15 ; metric = [:mah]
h = [1, 2.5, 5] ; k = [50, 100]
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k)
length(pars[1])
nlv = 0:20
mod = model(lwplsr)
rescv = gridcv(mod, Xtrain, ytrain; segm, score = rmsep, pars, nlv, verbose = true) ;
res = rescv.res
group = string.("h=", res.h, " k=", res.k)
plotgrid(res.nlv, res.y1, group; xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod = model(lwplsr; nlvdis = res.nlvdis[u], metric = res.metric[u],
h = res.h[u], k = res.k[u], nlv = res.nlv[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####-- LwplsrAvg
nlvdis = 15 ; metric = [:mah]
h = [1, 2.5, 5] ; k = [50, 100]
nlv = [0:15, 0:20, 5:20]
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k, nlv = nlv)
length(pars[1])
mod = model(lwplsravg)
rescv = gridcv(mod, Xtrain, ytrain; segm, score = rmsep, pars, verbose = true) ;
res = rescv.res
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod = model(lwplsravg; nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u],
k = res.k[u], nlv = res.nlv[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
######## Discrimination
## The principle is the same as for regression
using JLD2, CairoMakie, JchemoData
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
tab(Y.typ)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Replicated K-fold CV
K = 3 ; rep = 10
segm = segmkf(ntrain, K; rep)
## Replicated test-set validation
#m = Int(round(ntrain / 3)) ; rep = 30
#segm = segmts(ntrain, m; rep)
####-- Plslda
mod = model(plslda)
nlv = 1:30
prior = [:unif; :prop]
pars = mpar(prior = prior)
rescv = gridcv(mod, Xtrain, ytrain; segm, score = errp, pars, nlv)
res = rescv.res
typ = res.prior
plotgrid(res.nlv, res.y1, typ; step = 2, xlabel = "Nb. LVs", ylabel = "ERR").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod = model(plslda; nlv = res.nlv[u], prior = res.prior[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show errp(pred, ytest)
conf(pred, ytest).pct
Jchemo.gridcv_br
— Methodgridcv_br(X, Y; segm, fun, score, pars, verbose = false)
Working function for gridcv
.
See function gridcv
for examples.
Jchemo.gridcv_lb
— Methodgridcv_lb(X, Y; segm, fun, score, pars = nothing, lb, verbose = false)
Working function for gridcv
.
Specific and faster than gridcv_br
for models using ridge regularization (e.g. RR). Argument pars
must not contain nlv
.
See function gridcv
for examples.
Jchemo.gridcv_lv
— Methodgridcv_lv((X, Y; segm, fun, score, pars = nothing, nlv, verbose = false)
Working function for gridcv
.
Specific and faster than gridcv_br
for models using latent variables (e.g. PLSR). Argument pars
must not contain nlv
.
See function gridcv
for examples.
Jchemo.gridscore
— Methodgridscore(mod, Xtrain, Ytrain, X, Y; score, pars = nothing, nlv = nothing,
lb = nothing, verbose = false)
Test-set validation of a model over a grid of parameters.
mod
: Model to evaluate.Xtrain
: Training X-data (n, p).Ytrain
: Training Y-data (n, q).X
: Validation X-data (m, p).Y
: Validation Y-data (m, q).
Keyword arguments:
score
: Function computing the prediction score (e.g.rmsep
).pars
: tuple of named vectors of same length defining the parameter combinations (e.g. output of functionmpar
).verbose
: Iftrue
, fitting information are printed.nlv
: Value, or vector of values, of the nb. of latent variables (LVs).lb
: Value, or vector of values, of the ridge regularization parameter "lambda".
The function is used for grid-search: it computed a prediction score (= error rate) for model mod
over the combinations of parameters defined in pars
. The score is computed over sets {X,
Y`}.
For models based on LV or ridge regularization, using arguments nlv
and lb
allow faster computations than including these parameters in argument `pars. See the examples.
Examples
######## Regression
using JLD2, CairoMakie, JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
mod = model(savgol; npoint = 21, deriv = 2, degree = 2)
fit!(mod, X)
Xp = transf(mod, X)
s = year .<= 2012
Xtrain = Xp[s, :]
ytrain = y[s]
Xtest = rmrow(Xp, s)
ytest = rmrow(y, s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Building Cal and Val
## within Train
nval = Int(round(.3 * ntrain))
s = samprand(ntrain, nval)
Xcal = Xtrain[s.train, :]
ycal = ytrain[s.train]
Xval = Xtrain[s.test, :]
yval = ytrain[s.test]
####-- Plsr
mod = model(plskern)
nlv = 0:30
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, nlv)
plotgrid(res.nlv, res.y1; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod = model(plskern; nlv = res.nlv[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
## Adding pars
pars = mpar(scal = [false; true])
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, pars, nlv)
typ = res.scal
plotgrid(res.nlv, res.y1, typ; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod = model(plskern; nlv = res.nlv[u], scal = res.scal[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####-- Rr
lb = (10).^(-8:.1:3)
mod = model(rr)
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, lb)
loglb = log.(10, res.lb)
plotgrid(loglb, res.y1; step = 2, xlabel = "log(lambda)", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod = model(rr; lb = res.lb[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
## Adding pars
pars = mpar(scal = [false; true])
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, pars, lb)
loglb = log.(10, res.lb)
typ = string.(res.scal)
plotgrid(loglb, res.y1, typ; step = 2, xlabel = "log(lambda)", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod = model(rr; lb = res.lb[u], scal = res.scal[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####-- Kplsr
mod = model(kplsr)
nlv = 0:30
gamma = (10).^(-5:1.:5)
pars = mpar(gamma = gamma)
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, pars, nlv)
loggamma = round.(log.(10, res.gamma), digits = 1)
plotgrid(res.nlv, res.y1, loggamma; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP",
leg_title = "Log(gamma)").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod = model(kplsr; nlv = res.nlv[u], gamma = res.gamma[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####-- Knnr
nlvdis = [15; 25] ; metric = [:mah]
h = [1, 2.5, 5]
k = [1, 5, 10, 20, 50, 100]
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k)
length(pars[1])
mod = model(knnr)
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, pars, verbose = true)
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod = model(knnr; nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u],
k = res.k[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####-- Lwplsr
nlvdis = 15 ; metric = [:mah]
h = [1, 2.5, 5] ; k = [50, 100]
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k)
length(pars[1])
nlv = 0:20
mod = model(lwplsr)
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, pars, nlv, verbose = true)
group = string.("h=", res.h, " k=", res.k)
plotgrid(res.nlv, res.y1, group; xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod = model(lwplsr; nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u],
k = res.k[u], nlv = res.nlv[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####-- LwplsrAvg
nlvdis = 15 ; metric = [:mah]
h = [1, 2.5, 5] ; k = [50, 100]
nlv = [0:15, 0:20, 5:20]
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k, nlv = nlv)
length(pars[1])
mod = model(lwplsravg)
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, pars, verbose = true)
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod = model(lwplsravg; nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u],
k = res.k[u], nlv = res.nlv[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####-- Mbplsr
listbl = [1:525, 526:1050]
Xbltrain = mblock(Xtrain, listbl)
Xbltest = mblock(Xtest, listbl)
Xbl_cal = mblock(Xcal, listbl)
Xbl_val = mblock(Xval, listbl)
mod = model(mbplsr)
bscal = [:none, :frob]
pars = mpar(bscal = bscal)
nlv = 0:30
res = gridscore(mod, Xbl_cal, ycal, Xbl_val, yval; score = rmsep, pars, nlv)
group = res.bscal
plotgrid(res.nlv, res.y1, group; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod = model(mbplsr; bscal = res.bscal[u], nlv = res.nlv[u])
fit!(mod, Xbltrain, ytrain)
pred = predict(mod, Xbltest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
######## Discrimination
## The principle is the same as for regression
using JLD2, CairoMakie, JchemoData
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
tab(Y.typ)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Building Cal and Val
## within Train
nval = Int(round(.3 * ntrain))
s = samprand(ntrain, nval)
Xcal = Xtrain[s.train, :]
ycal = ytrain[s.train]
Xval = Xtrain[s.test, :]
yval = ytrain[s.test]
####-- Plslda
mod = model(plslda)
nlv = 1:30
prior = [:unif, :prop]
pars = mpar(prior = prior)
res = gridscore(mod, Xcal, ycal, Xval, yval; score = errp, pars, nlv)
typ = res.prior
plotgrid(res.nlv, res.y1, typ; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod = model(plslda; nlv = res.nlv[u], prior = res.prior[u])
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
@show errp(pred, ytest)
conf(pred, ytest).pct
Jchemo.gridscore
— Methodgridscore(mod::Pipeline, Xtrain, Ytrain, X, Y; score, pars = nothing,
nlv = nothing, lb = nothing, verbose = false)
Test-set validation of a model pipeline over a grid of parameters.
mod
: A pipeline of models to evaluate.Xtrain
: Training X-data (n, p).Ytrain
: Training Y-data (n, q).X
: Validation X-data (m, p).Y
: Validation Y-data (m, q).
Keyword arguments:
score
: Function computing the prediction score (e.g.rmsep
).pars
: tuple of named vectors of same length defining the parameter combinations (e.g. output of functionmpar
).verbose
: Iftrue
, fitting information are printed.nlv
: Value, or vector of values, of the nb. of latent variables (LVs).lb
: Value, or vector of values, of the ridge regularization parameter "lambda".
In the present version of the function, only the last model of the pipeline (= the final predictor) is validated.
For other details, see function gridscore
for simple models.
Examples
using JLD2, CairoMakie, JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Building Cal and Val
## within Train
nval = Int(round(.3 * ntrain))
s = samprand(ntrain, nval)
Xcal = Xtrain[s.train, :]
ycal = ytrain[s.train]
Xval = Xtrain[s.test, :]
yval = ytrain[s.test]
####-- Pipeline Snv :> Savgol :> Plsr
## Only the last model is validated
## mod1
centr = true ; scal = false
mod1 = model(snv; centr, scal)
## mod2
npoint = 11 ; deriv = 2 ; degree = 3
mod2 = model(savgol; npoint, deriv, degree)
## mod3
nlv = 0:30
mod3 = model(plskern)
##
mod = pip(mod1, mod2, mod3)
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, nlv) ;
plotgrid(res.nlv, res.y1; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod3 = model(plskern; nlv = res.nlv[u])
mod = pip(mod1, mod2, mod3)
fit!(mod, Xtrain, ytrain)
res = predict(mod, Xtest) ;
@head res.pred
rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####-- Pipeline Pca :> Svmr
## Only the last model is validated
## mod1
nlv = 15 ; scal = true
mod1 = model(pcasvd; nlv, scal)
## mod2
kern = [:krbf]
gamma = (10).^(-5:1.:5)
cost = (10).^(1:3)
epsilon = [.1, .2, .5]
pars = mpar(kern = kern, gamma = gamma, cost = cost, epsilon = epsilon)
mod2 = model(svmr)
##
mod = pip(mod1, mod2)
res = gridscore(mod, Xcal, ycal, Xval, yval; score = rmsep, pars)
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
mod2 = model(svmr; kern = res.kern[u], gamma = res.gamma[u], cost = res.cost[u],
epsilon = res.epsilon[u])
mod = pip(mod1, mod2) ;
fit!(mod, Xtrain, ytrain)
res = predict(mod, Xtest) ;
@head res.pred
rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
Jchemo.gridscore_br
— Methodgridscore_br(Xtrain, Ytrain, X, Y; fun, score, pars,
verbose = false)
Working function for gridscore
.
See function gridscore
for examples.
Jchemo.gridscore_lb
— Methodgridscore_lb(Xtrain, Ytrain, X, Y; fun, score, pars = nothing,
lb, verbose = false)
Working function for gridscore
.
Specific and faster than gridscore_br
for models using ridge regularization (e.g. RR). Argument pars
must not contain lb
.
See function gridscore
for examples.
Jchemo.gridscore_lv
— Methodgridscore_lv(Xtrain, Ytrain, X, Y; fun, score, pars = nothing,
nlv, verbose = false)
Working function for gridscore
.
Specific and faster than gridscore_br
for models using latent variables (e.g. PLSR). Argument pars
must not contain nlv
.
See function gridscore
for examples.
Jchemo.head
— Methodhead(X)
Display the first rows of a dataset.
Examples
X = rand(100, 5)
head(X)
@head X
Jchemo.interpl
— Methodinterpl(X; kwargs...)
Sampling spectra by interpolation.
X
: Matrix (n, p) of spectra (rows).
Keyword arguments:
wl
: Values representing the column "names" ofX
. Must be a numeric vector of length p, or an AbstractRange.wlfin
: Final values (within the range ofwl
) where to interpolate the spectrum. Must be a numeric vector, or an AbstractRange.
The function implements a cubic spline interpolation using package DataInterpolations.jl.
References
Package DAtaInterpolations.jl https://github.com/PumasAI/DataInterpolations.jl https://htmlpreview.github.io/?https://github.com/PumasAI/DataInterpolations.jl/blob/v2.0.0/example/DataInterpolations.html
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f
wlfin = range(500, 2400, length = 10)
#wlfin = collect(range(500, 2400, length = 10))
mod = model(interpl; wl, wlfin)
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f
Jchemo.isel!
— Functionisel!(mod, X, Y, wl = 1:nco(X); rep = 1, nint = 5, psamp = .3, score = rmsep)
Interval variable selection.
mod
: Model to evaluate.X
: X-data (n, p).Y
: Y-data (n, q).wl
: Optional numeric labels (p, 1) of the X-columns.
Keyword arguments:
rep
: Number of replications of the splitting training/test.nint
: Nb. intervals.psamp
: Proportion of data used as test set to compute thescore
.score
: Function computing the prediction score.
The principle is as follows:
- Data (X, Y) are splitted randomly to a training and a test set.
- Range 1:p in
X
is segmented tonint
intervals, when possible of equal size. - The model is fitted on the training set and the score (error rate) on the test set, firtsly accounting for all the p variables (reference) and secondly for each of the
nint
intervals. - This process is replicated
rep
times. Average results are provided in the outputs, as well the results per replication.
References
- Nørgaard, L., Saudland, A., Wagner, J., Nielsen, J.P., Munck, L.,
Engelsen, S.B., 2000. Interval Partial Least-Squares Regression (iPLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy. Appl Spectrosc 54, 413–419. https://doi.org/10.1366/0003702001949500
Examples
using DataFrames, JLD2, CairoMakie
using JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "tecator.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
wl_str = names(X)
wl = parse.(Float64, wl_str)
ntot, p = size(X)
typ = Y.typ
namy = names(Y)[1:3]
plotsp(X, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance").f
s = typ .== "train"
Xtrain = X[s, :]
Ytrain = Y[s, namy]
Xtest = rmrow(X, s)
Ytest = rmrow(Y[:, namy], s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Work on the j-th
## y-variable
j = 2
nam = namy[j]
ytrain = Ytrain[:, nam]
ytest = Ytest[:, nam]
mod = model(plskern; nlv = 5)
nint = 10
res = isel!(mod, Xtrain, ytrain, wl; rep = 30, nint) ;
res.res_rep
res.res0_rep
zres = res.res
zres0 = res.res0
f = Figure(size = (650, 300))
ax = Axis(f[1, 1], xlabel = "Wawelength (nm)", ylabel = "RMSEP_Val",
xticks = zres.lo)
scatter!(ax, zres.mid, zres.y1; color = (:red, .5))
vlines!(ax, zres.lo; color = :grey, linestyle = :dash, linewidth = 1)
hlines!(ax, zres0.y1, linestyle = :dash)
f
Jchemo.kdeda
— Methodkdeda(X, y; kwargs...)
Discriminant analysis using non-parametric kernel Gaussian density estimation (KDE-DA).
X
: X-data (n, p).y
: Univariate class membership (n).
Keyword arguments:
prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).- Keyword arguments of function
dmkern
(bandwidth definition) can also be specified here.
The principle is the same as functions lda
and qda
except that densities are estimated from function dmkern
instead of function dmnorm
.
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
prior = :unif
#prior = :prop
mod = model(kdeda; prior)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
mod = model(kdeda; prior, a_kde = .5)
#mod = model(kdeda; prior, h_kde = .1)
fit!(mod, Xtrain, ytrain)
mod.fm.fm[1].H
Jchemo.knnda
— Methodknnda(X, y; kwargs...)
k-Nearest-Neighbours weighted discrimination (KNN-DA).
X
: X-data (n, p).y
: Univariate class membership (n).
Keyword arguments:
nlvdis
: Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. Ifnlvdis = 0
, there is no dimension reduction.metric
: Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are::eucl
(Euclidean distance),:mah
(Mahalanobis distance).h
: A scalar defining the shape of the weight function computed by functionwdist
. Lower is h, sharper is the function. See functionwdist
for details (keyword argumentscriw
andsquared
ofwdist
can also be specified here).k
: The number of nearest neighbors to select for each observation to predict.tolw
: For stabilization when very close neighbors.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation for the global dimension reduction.
This function has the same principle as function knnr
except that a discrimination is done instead of a regression. A weighted vote is done over the neighborhood, and the prediction corresponds to the most frequent class.
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlvdis = 25 ; metric = :mah
h = 2 ; k = 10
mod = model(knnda; nlvdis, metric, h, k)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
res = predict(mod, Xtest) ;
pnames(res)
res.listnn
res.listd
res.listw
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.knnr
— Methodknnr(X, Y; kwargs...)
k-Nearest-Neighbours weighted regression (KNNR).
X
: X-data (n, p).Y
: Y-data (n, q).
Keyword arguments:
nlvdis
: Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. Ifnlvdis = 0
, there is no dimension reduction.metric
: Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are::eucl
(Euclidean distance),:mah
(Mahalanobis distance).h
: A scalar defining the shape of the weight function computed by functionwdist
. Lower is h, sharper is the function. See functionwdist
for details (keyword argumentscriw
andsquared
ofwdist
can also be specified here).k
: The number of nearest neighbors to select for each observation to predict.tolw
: For stabilization when very close neighbors.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation for the global dimension reduction.
The general principle of this function is as follows (many other variants of kNNR pipelines can be built):
For each new observation to predict, the prediction is the weighted mean over a selected neighborhood (in X
) of size k
. Within the selected neighborhood, the weights are defined from the dissimilarities between the new observation and the neighborhood, and are computed from function 'wdist'.
In general, for high dimensional X-data, using the Mahalanobis distance requires preliminary dimensionality reduction of the data. In function knnr', the preliminary reduction (argument
nlvdis) is done by PLS on {
X,
Y`}.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlvdis = 5 ; metric = :mah
#nlvdis = 0 ; metric = :eucl
h = 1 ; k = 5
mod = model(knnr; nlvdis, metric, h, k) ;
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
res = predict(mod, Xtest) ;
pnames(res)
res.listnn
res.listd
res.listw
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true,
xlabel = "Prediction", ylabel = "Observed").f
####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106
x = collect(-10:.2:10)
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x)
y = zy + .2 * randn(n)
mod = model(knnr; k = 15, h = 5)
fit!(mod, x, y)
pred = predict(mod, x).pred
f, ax = scatter(x, y)
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f
Jchemo.kpca
— Methodkpca(X; kwargs...)
kpca(X, weights::Weight; kwargs...)
Kernel PCA (Scholkopf et al. 1997, Scholkopf & Smola 2002, Tipping 2001).
X
: X-data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. principal components (PCs) to consider.kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
The method is implemented by SVD factorization of the weighted Gram matrix:
- D^(1/2) * Phi(X) * Phi(X)' * D^(1/2)
where X is the cenetred matrix and D is a diagonal matrix of weights (weights.w
) of the observations (rows of X).
References
Scholkopf, B., Smola, A., Müller, K.-R., 1997. Kernel principal component analysis, in: Gerstner, W., Germond, A., Hasler, M., Nicoud, J.-D. (Eds.), Artificial Neural Networks, ICANN 97, Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, pp. 583-588. https://doi.org/10.1007/BFb0020217
Scholkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond, Adaptive computation and machine learning. MIT Press, Cambridge, Mass.
Tipping, M.E., 2001. Sparse kernel principal component analysis. Advances in neural information processing systems, MIT Press. http://papers.nips.cc/paper/1791-sparse-kernel-principal-component-analysis.pdf
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
Xtest = X[s.test, :]
nlv = 3
kern = :krbf ; gamma = 1e-4
mod = model(kpca; nlv, kern, gamma) ;
fit!(mod, Xtrain)
pnames(mod.fm)
@head T = mod.fm.T
T' * T
mod.fm.P' * mod.fm.P
@head Ttest = transf(mod, Xtest)
res = summary(mod) ;
pnames(res)
res.explvarx
Jchemo.kplskdeda
— Methodkplskdeda(X, y; kwargs...)
kplskdeda(X, y, weights::Weight; kwargs...)
KPLS-KDEDA.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).- Keyword arguments of function
dmkern
(bandwidth definition) can also be specified here. alpha
: Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0
) and LDA (alpha = 1
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Same as function plskdeda
(PLS-KDEDA) except that a kernel PLSR (function kplsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
See function kplslda
for examples.
Jchemo.kplslda
— Methodkplslda(X, y; kwargs...)
kplslda(X, y, weights::Weight; kwargs...)
KPLS-LDA.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Same as function plslda
(PLS-LDA) except that a kernel PLSR (function kplsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlv = 15
gamma = .1
mod = model(kplslda; nlv, gamma)
#mod = model(kplslda; nlv, gamma, prior = :prop)
#mod = model(kplsqda; nlv, gamma, alpha = .5)
#mod = model(kplskdeda; nlv, gamma, a_kde = .5)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
fmpls = fm.fm.fmpls ;
@head fmpls.T
@head transf(mod, Xtrain)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)
coef(fmpls)
res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(mod, Xtest; nlv = 1:2).pred
Jchemo.kplsqda
— Methodkplsqda(X, y; kwargs...)
kplsqda(X, y, weights::Weight; kwargs...)
KPLS-QDA.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).alpha
: Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0
) and LDA (alpha = 1
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Same as function plsqda
(PLS-QDA) except that a kernel PLSR (function kplsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
See function kplslda
for examples.
Jchemo.kplsr
— Methodkplsr(X, Y; kwargs...)
kplsr(X, Y, weights::Weight; kwargs...)
kplsr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Kernel partial least squares regression (KPLSR) implemented with a Nipals algorithm (Rosipal & Trejo, 2001).
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to consider.kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
This algorithm becomes slow for n > 1000. Use function dkplsr
instead.
References
Rosipal, R., Trejo, L.J., 2001. Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space. Journal of Machine Learning Research 2, 97-123.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 20
kern = :krbf ; gamma = 1e-1
mod = model(kplsr; nlv, kern, gamma) ;
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
@head mod.fm.T
coef(mod)
coef(mod; nlv = 3)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)
res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true,
xlabel = "Prediction", ylabel = "Observed").f
####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106
x = collect(-10:.2:10)
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x)
y = zy + .2 * randn(n)
nlv = 2
kern = :krbf ; gamma = 1 / 3
mod = model(kplsr; nlv, kern, gamma)
fit!(mod, x, y)
pred = predict(mod, x).pred
f, ax = scatter(x, y)
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f
Jchemo.kplsrda
— Methodkplsrda(X, y; kwargs...)
kplsrda(X, y, weights::Weight; kwargs...)
Discrimination based on kernel partial least squares regression (KPLSR-DA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Same as function plsrda
(PLSR-DA) except that a kernel PLSR (function kplsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlv = 15
kern = :krbf ; gamma = .001
scal = true
mod = model(kplsrda; nlv, kern, gamma, scal)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
@head fm.fm.T
@head transf(mod, Xtrain)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)
coef(fm.fm)
res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(mod, Xtest; nlv = 1:2).pred
Jchemo.kpol
— Methodkpol(X, Y; kwargs...)
Compute a polynomial kernel Gram matrix.
X
: X-data (n, p).Y
: Y-data (m, p).
Keyword arguments:
degree
: Degree of the polynom.gamma
: Scale of the polynom.coef0
: Offset of the polynom.
Given matrices X
and Y
of sizes (n, p) and (m, p), respectively, the function returns the (n, m) Gram matrix:
- K(X, Y) = Phi(X) * Phi(Y)'.
The polynomial kernel between two vectors x and y is computed by (gamma
* (x' * y) + coef0
)^degree
.
References
Scholkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond, Adaptive computation and machine learning. MIT Press, Cambridge, Mass.
Examples
X = rand(5, 3)
Y = rand(2, 3)
kpol(X, Y; degree = 3, gamma = .1, cost = 10)
Jchemo.krbf
— Methodkrbf(X, Y; kwargs...)
Compute a Radial-Basis-Function (RBF) kernel Gram matrix.
X
: X-data (n, p).Y
: Y-data (m, p).
Keyword arguments:
gamma
: Scale parameter.
Given matrices X
and Y
of sizes (n, p) and (m, p), respectively, the function returns the (n, m) Gram matrix:
- K(X, Y) = Phi(X) * Phi(Y)'.
The RBF kernel between two vectors x and y is computed by exp(-gamma
* ||x - y||^2).
References
Scholkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond, Adaptive computation and machine learning. MIT Press, Cambridge, Mass.
Examples
X = rand(5, 3)
Y = rand(2, 3)
krbf(X, Y; gamma = .1)
Jchemo.krr
— Methodkrr(X, Y; kwargs...)
krr(X, Y, weights::Weight; kwargs...)
krr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Kernel ridge regression (KRR) implemented by SVD factorization.
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
lb
: Ridge regularization parameter "lambda".kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.scal
: Boolean. Iftrue
, each column of `X is scaled by its uncorrected standard deviation.
KRR is also referred to as least squared SVM regression (LS-SVMR). The method is close to the particular case of SVM regression where there is no marge excluding the observations (epsilon coefficient set to zero). The difference is that a L2-norm optimization is done, instead of L1 in SVM.
References
Bennett, K.P., Embrechts, M.J., 2003. An optimization perspective on kernel partial least squares regression, in: Advances in Learning Theory: Methods, Models and Applications, NATO Science Series III: Computer & Systems Sciences. IOS Press Amsterdam, pp. 227-250.
Cawley, G.C., Talbot, N.L.C., 2002. Reduced Rank Kernel Ridge Regression. Neural Processing Letters 16, 293-302. https://doi.org/10.1023/A:1021798002258
Krell, M.M., 2018. Generalizing, Decoding, and Optimizing Support Vector Machine Classification. arXiv:1801.04929.
Saunders, C., Gammerman, A., Vovk, V., 1998. Ridge Regression Learning Algorithm in Dual Variables, in: In Proceedings of the 15th International Conference on Machine Learning. Morgan Kaufmann, pp. 515-521.
Suykens, J.A.K., Lukas, L., Vandewalle, J., 2000. Sparse approximation using least squares support vector machines. 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353). https://doi.org/10.1109/ISCAS.2000.856439
Welling, M., n.d. Kernel ridge regression. Department of Computer Science, University of Toronto, Toronto, Canada. https://www.ics.uci.edu/~welling/classnotes/papers_class/Kernel-Ridge.pdf
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
lb = 1e-3
kern = :krbf ; gamma = 1e-1
mod = model(krr; lb, kern, gamma) ;
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
coef(mod)
res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true,
xlabel = "Prediction", ylabel = "Observed").f
coef(mod; lb = 1e-1)
res = predict(mod, Xtest; lb = [.1 ; .01])
@head res.pred[1]
@head res.pred[2]
lb = 1e-3
kern = :kpol ; degree = 1
mod = model(krr; lb, kern, degree)
fit!(mod, Xtrain, ytrain)
res = predict(mod, Xtest)
rmsep(res.pred, ytest)
####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106
x = collect(-10:.2:10)
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x)
y = zy + .2 * randn(n)
lb = 1e-1
kern = :krbf ; gamma = 1 / 3
mod = model(krr; lb, kern, gamma)
fit!(mod, x, y)
pred = predict(mod, x).pred
f, ax = scatter(x, y)
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f
Jchemo.krrda
— Methodkrrda(X, y; kwargs...)
krrda(X, y, weights::Weight; kwargs...)
Discrimination based on kernel ridge regression (KRR-DA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
lb
: Ridge regularization parameter "lambda".kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Same as function rrda
(RR-DA) except that a kernel RR (function krr
), instead of a RR (function rr
), is run on the Y-dummy table.
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
lb = 1e-5
kern = :krbf ; gamma = .001
scal = true
mod = model(krrda; lb, kern, gamma, scal)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
coef(fm.fm)
res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(mod, Xtest; lb = [.1, .001]).pred
Jchemo.lda
— Methodlda(; kwargs...)
lda(X, y; kwargs...)
lda(X, y, weights::Weight; kwargs...)
Linear discriminant analysis (LDA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).
In these lda
functions, observation weights (argument weights
) are used to compute the intra-class (= "within") covariance matrix. Argument prior
is used to define the usual prior class probabilities.
In the high-level version, the observation weights are automatically defined by the given priors (prior
): the sub-total weights by class are set equal to the prior probabilities. For other choices, use the low-level version.
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
mod = lda()
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
aggsum(fm.weights.w, ytrain)
res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.lg
— Methodlg(X, Y; centr = true)
lg(Xbl; centr = true)
Compute the Lg coefficient between matrices.
X
: Matrix (n, p).Y
: Matrix (n, q).Xbl
: A list (vector) of matrices.
Keyword arguments:
centr
: Boolean indicating if the matrices will be internally centered or not.
Lg(X, Y) = Sum.(j=1..p) Sum.(k= 1..q) cov(xj, yk)^2
RV(X, Y) = Lg(X, Y) / sqrt(Lg(X, X), Lg(Y, Y))
References
Escofier, B. & Pagès, J. 1984. L’analyse factorielle multiple. Cahiers du Bureau universitaire de recherche opérationnelle. Série Recherche, tome 42, p. 3-68
Escofier, B. & Pagès, J. (2008). Analyses Factorielles Simples et Multiples : Objectifs, Méthodes et Interprétation. Dunod, 4e édition.
Examples
X = rand(5, 10)
Y = rand(5, 3)
lg(X, Y)
X = rand(5, 15)
listbl = [3:4, 1, [6; 8:10]]
Xbl = mblock(X, listbl)
lg(Xbl)
Jchemo.list
— Methodlist(Q, n::Integer)
Create a Vector{Q}(undef, n).
isassigned(object, i)
can be used to check if cell i is empty.
Examples
list(Float64, 5)
list(Array{Float64}, 5)
list(Matrix{Int}, 5)
Jchemo.list
— Methodlist(n::Integer)
Create a Vector{Any}(nothing, n).
isnothing(object, i)
can be used to check if cell i is empty.
Examples
list(5)
Jchemo.locw
— Methodlocw(Xtrain, Ytrain, X; listnn, listw = nothing, fun, verbose = false,
kwargs...)
Compute predictions for a given kNN model.
Xtrain
: Training X-data.Ytrain
: Training Y-data.X
: X-data (m observations) to predict.
Keyword arguments:
listnn
: List (vector) of m vectors of indexes.listw
: List (vector) of m vectors of weights.fun
: Function computing the model on the m neighborhoods.verbose
: Boolean. Iftrue
, fitting information are printed.kwargs
: Keywords arguments to pass in functionfun
. Each argument must have length = 1 (not be a collection).
Each component i of listnn
and listw
contains the indexes and weights, respectively, of the nearest neighbors of x_i in Xtrain. The sizes of the neighborhood for i = 1,...,m can be different.
Jchemo.locwlv
— Methodlocwlv(Xtrain, Ytrain, X; listnn, listw = nothing, fun, nlv, verbose = true,
kwargs...)
Compute predictions for a given kNN model.
Xtrain
: Training X-data.Ytrain
: Training Y-data.X
: X-data (m observations) to predict.
Keyword arguments:
listnn
: List (vector) of m vectors of indexes.listw
: List (vector) of m vectors of weights.fun
: Function computing the model on the m neighborhoods.nlv
: Nb. or collection of nb. of latent variables (LVs).verbose
: Boolean. Iftrue
, fitting information are printed.kwargs
: Keywords arguments to pass in functionfun
. Each argument must have length = 1 (not be a collection).
Same as locw
but specific and much faster for LV-based models (e.g. PLSR).
Jchemo.lwmlr
— Methodlwmlr(X, Y; kwargs...)
k-Nearest-Neighbours locally weighted multiple linear regression (kNN-LWMLR).
X
: X-data (n, p).Y
: Y-data (n, q).
Keyword arguments:
metric
: Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are::eucl
(Euclidean distance),:mah
(Mahalanobis distance).h
: A scalar defining the shape of the weight function computed by functionwdist
. Lower is h, sharper is the function. See functionwdist
for details (keyword argumentscriw
andsquared
ofwdist
can also be specified here).k
: The number of nearest neighbors to select for each observation to predict.tolw
: For stabilization when very close neighbors.
This is the same principle as function lwplsr
except that MLR models are fitted on the neighborhoods, instead of PLSR models. The neighborhoods are computed directly on X
(there is no preliminary dimension reduction).
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 20
mod0 = model(pcasvd; nlv) ;
fit!(mod0, Xtrain)
@head Ttrain = mod0.fm.T
@head Ttest = transf(mod0, Xtest)
metric = :eucl
h = 2 ; k = 100
mod = model(lwmlr; metric, h, k)
fit!(mod, Ttrain, ytrain)
pnames(mod)
pnames(mod.fm)
res = predict(mod, Ttest) ;
pnames(res)
res.listnn
res.listd
res.listw
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true,
xlabel = "Prediction", ylabel = "Observed").f
####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106
x = collect(-10:.2:10)
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x)
y = zy + .2 * randn(n)
mod = model(lwmlr; metric = :eucl, h = 1.5, k = 20) ;
fit!(mod, x, y)
pred = predict(mod, x).pred
f, ax = scatter(x, y)
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f
Jchemo.lwmlrda
— Methodlwmlrda(X, y; kwargs...)
k-Nearest-Neighbours locally weighted MLR-based discrimination (kNN-LWMLR-DA).
X
: X-data (n, p).y
: Univariate class membership (n).
Keyword arguments:
metric
: Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are::eucl
(Euclidean distance),:mah
(Mahalanobis distance).h
: A scalar defining the shape of the weight function computed by functionwdist
. Lower is h, sharper is the function. See functionwdist
for details (keyword argumentscriw
andsquared
ofwdist
can also be specified here).k
: The number of nearest neighbors to select for each observation to predict.tolw
: For stabilization when very close neighbors.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation for the global dimension reduction.
This is the same principle as function lwmlr
except that MLR-DA models, instead of MLR models, are fitted on the neighborhoods.
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
metric = :mah
h = 2 ; k = 10
mod = model(lwmlrda; metric, h, k)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
res = predict(mod, Xtest) ;
pnames(res)
res.listnn
res.listd
res.listw
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.lwplslda
— Methodlwplslda(X, y; kwargs...)
kNN-LWPLS-LDA.
X
: X-data (n, p).y
: Univariate class membership (n).
Keyword arguments:
nlvdis
: Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. Ifnlvdis = 0
, there is no dimension reduction.metric
: Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are::eucl
(Euclidean distance),:mah
(Mahalanobis distance).h
: A scalar defining the shape of the weight function computed by functionwdist
. Lower is h, sharper is the function. See functionwdist
for details (keyword argumentscriw
andsquared
ofwdist
can also be specified here).k
: The number of nearest neighbors to select for each observation to predict.tolw
: For stabilization when very close neighbors.nlv
: Nb. latent variables (LVs) for the local (i.e. inside each neighborhood) models.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation for the global dimension reduction and the local models.
This is the same principle as function lwplsr
except that PLS-LDA models, instead of PLSR models, are fitted on the neighborhoods.
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlvdis = 25 ; metric = :mah
h = 1 ; k = 100
mod = model(lwplslda; nlvdis, metric, h, k, prior = :prop)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
res = predict(mod, Xtest) ;
pnames(res)
res.listnn
res.listd
res.listw
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.lwplsqda
— Methodlwplsqda(X, y; kwargs...)
kNN-LWPLS-QDA.
X
: X-data (n, p).y
: Univariate class membership (n).
Keyword arguments:
nlvdis
: Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. Ifnlvdis = 0
, there is no dimension reduction.metric
: Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are::eucl
(Euclidean distance),:mah
(Mahalanobis distance).h
: A scalar defining the shape of the weight function computed by functionwdist
. Lower is h, sharper is the function. See functionwdist
for details (keyword argumentscriw
andsquared
ofwdist
can also be specified here).k
: The number of nearest neighbors to select for each observation to predict.tolw
: For stabilization when very close neighbors.nlv
: Nb. latent variables (LVs) for the local (i.e. inside each neighborhood) models.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional).alpha
: Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0
) and LDA (alpha = 1
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation for the global dimension reduction and the local models.
This is the same principle as function lwplsr
except that PLS-QDA models, instead of PLSR models, are fitted on the neighborhoods.
- Warning: The present version of this function suffers from
frequent stops due to non positive definite matrices when doing QDA on neighborhoods, since some classes within the neighborhood can have very few observations. It is recommended to select a sufficiantly large number of neighbors or/and to use a regularized QDA (alpha > 0
).
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlvdis = 25 ; metric = :mah
h = 1 ; k = 200
mod = model(lwplsqda; nlvdis, metric, h, k, prior = :prop, alpha = .5)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
res = predict(mod, Xtest) ;
pnames(res)
res.listnn
res.listd
res.listw
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.lwplsr
— Methodlwplsr(X, Y; kwargs...)
k-Nearest-Neighbours locally weighted partial least squares regression (kNN-LWPLSR).
X
: X-data (n, p).Y
: Y-data (n, q).
Keyword arguments:
nlvdis
: Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. Ifnlvdis = 0
, there is no dimension reduction.metric
: Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are::eucl
(Euclidean distance),:mah
(Mahalanobis distance).h
: A scalar defining the shape of the weight function computed by functionwdist
. Lower is h, sharper is the function. See functionwdist
for details (keyword argumentscriw
andsquared
ofwdist
can also be specified here).k
: The number of nearest neighbors to select for each observation to predict.tolw
: For stabilization when very close neighbors.nlv
: Nb. latent variables (LVs) for the local (i.e. inside each neighborhood) models.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation for the global dimension reduction and the local models.
Function lwplsr
fits kNN-LWPLSR models such as in Lesnoff et al. 2020. The general principle of the pipeline is as follows (many other variants of pipelines can be built):
LWPLSR is a particular case of weighted PLSR (WPLSR) (e.g. Schaal et al. 2002). In WPLSR, a priori weights, different from the usual 1/n (standard PLSR), are given to the n training observations. These weights are used for calculating (i) the scores and loadings of the WPLS and (ii) the regression model that fits (by weighted least squares) the Y-response(s) to the WPLS scores. The specificity of LWPLSR (compared to WPLSR) is that the weights are computed from dissimilarities (e.g. distances) between the new observation to predict and the training observations ("L" in LWPLSR comes from "localized"). Note that in LWPLSR the weights and therefore the fitted WPLSR model change for each new observation to predict.
In the original LWPLSR, all the n training observations are used for each observation to predict (e.g. Sicard & Sabatier 2006, Kim et al 2011). This can be very time consuming when n is large. A faster (and often more efficient) strategy is to preliminary select, in the training set, a number of k
nearest neighbors to the observation to predict (= "weighting 1") and then to apply LWPLSR only to this pre-selected neighborhood (= "weighting 2"). T his strategy corresponds to a kNN-LWPLSR and is the one implemented in function lwplsr
.
In lwplsr
, the dissimilarities used for weightings 1 and 2 are computed from the raw X-data, or after a dimension reduction, depending on argument nlvdis
. In the last case, global PLS2 scores (LVs) are computed from {X
, Y
} and the dissimilarities are computed over these scores.
In general, for high dimensional X-data, using the Mahalanobis distance requires preliminary dimensionality reduction of the data. In function knnr', the preliminary reduction (argument
nlvdis) is done by PLS on {
X,
Y`}.
References
Kim, S., Kano, M., Nakagawa, H., Hasebe, S., 2011. Estimation of active pharmaceutical ingredients content using locally weighted partial least squares and statistical wavelength selection. Int. J. Pharm., 421, 269-274.
Lesnoff, M., Metz, M., Roger, J.-M., 2020. Comparison of locally weighted PLS strategies for regression and discrimination on agronomic NIR data. Journal of Chemometrics, e3209. https://doi.org/10.1002/cem.3209
Schaal, S., Atkeson, C., Vijayamakumar, S. 2002. Scalable techniques from nonparametric statistics for the real time robot learning. Applied Intell., 17, 49-60.
Sicard, E. Sabatier, R., 2006. Theoretical framework for local PLS1 regression and application to a rainfall data set. Comput. Stat. Data Anal., 51, 1393-1410.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlvdis = 5 ; metric = :mah
h = 1 ; k = 200 ; nlv = 15
mod = model(lwplsr; nlvdis, metric, h, k, nlv)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
res = predict(mod, Xtest) ;
pnames(res)
res.listnn
res.listd
res.listw
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true,
xlabel = "Prediction", ylabel = "Observed").f
Jchemo.lwplsravg
— Methodlwplsravg(X, Y; kwargs...)
Averaging kNN-LWPLSR models with different numbers of latent variables (kNN-LWPLSR-AVG).
X
: X-data (n, p).Y
: Y-data (n, q).
Keyword arguments:
nlvdis
: Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. Ifnlvdis = 0
, there is no dimension reduction.metric
: Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are::eucl
(Euclidean distance),:mah
(Mahalanobis distance).h
: A scalar defining the shape of the weight function computed by functionwdist
. Lower is h, sharper is the function. See functionwdist
for details (keyword argumentscriw
andsquared
ofwdist
can also be specified here).k
: The number of nearest neighbors to select for each observation to predict.tolw
: For stabilization when very close neighbors.nlv
: A range of nb. of latent variables (LVs) to compute for the local (i.e. inside each neighborhood) models.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation for the global dimension reduction and the local models.
Ensemblist method where the predictions are computed by averaging the predictions of a set of models built with different numbers of LVs, such as in Lesnoff 2023. On each neighborhood, a PLSR-averaging (Lesnoff et al.
- is done instead of a PLSR.
For instance, if argument nlv
is set to nlv
= 5:10
, the prediction for a new observation is the simple average of the predictions returned by the models with 5 LVs, 6 LVs, ... 10 LVs, respectively.
References
Lesnoff, M., Andueza, D., Barotin, C., Barre, P., Bonnal, L., Fernández Pierna, J.A., Picard, F., Vermeulen, P., Roger, J.-M., 2022. Averaging and Stacking Partial Least Squares Regression Models to Predict the Chemical Compositions and the Nutritive Values of Forages from Spectral Near Infrared Data. Applied Sciences 12, 7850. https://doi.org/10.3390/app12157850
M. Lesnoff, Averaging a local PLSR pipeline to predict chemical compositions and nutritive values of forages and feed from spectral near infrared data, Chemometrics and Intelligent Laboratory Systems. 244 (2023) 105031. https://doi.org/10.1016/j.chemolab.2023.105031.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlvdis = 5 ; metric = :mah
h = 1 ; k = 200 ; nlv = 4:20
mod = model(lwplsravg; nlvdis, metric, h, k, nlv) ;
fit!(mod, Ttrain, ytrain)
pnames(mod)
pnames(mod.fm)
res = predict(mod, Ttest) ;
pnames(res)
res.listnn
res.listd
res.listw
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true,
xlabel = "Prediction", ylabel = "Observed").f
Jchemo.lwplsrda
— Methodlwplsrda(X, y; kwargs...)
kNN-LWPLSR-DA.
X
: X-data (n, p).y
: Univariate class membership (n).
Keyword arguments:
nlvdis
: Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. Ifnlvdis = 0
, there is no dimension reduction.metric
: Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are::eucl
(Euclidean distance),:mah
(Mahalanobis distance).h
: A scalar defining the shape of the weight function computed by functionwdist
. Lower is h, sharper is the function. See functionwdist
for details (keyword argumentscriw
andsquared
ofwdist
can also be specified here).k
: The number of nearest neighbors to select for each observation to predict.tolw
: For stabilization when very close neighbors.nlv
: Nb. latent variables (LVs) for the local (i.e. inside each neighborhood) models.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation for the global dimension reduction and the local models.
This is the same principle as function lwplsr
except that PLSR-DA models, instead of PLSR models, are fitted on the neighborhoods.
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlvdis = 25 ; metric = :mah
h = 2 ; k = 100
mod = model(lwplsrda; nlvdis, metric, h, k)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
res = predict(mod, Xtest) ;
pnames(res)
res.listnn
res.listd
res.listw
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.mahsq
— Methodmahsq(X, Y)
mahsq(X, Y, Sinv)
Squared Mahalanobis distances between the rows of X
and Y
.
X
: Data (n, p).Y
: Data (m, p).Sinv
: Inverse of a covariance matrix S. If not given, S is computed as the uncorrected covariance matrix ofX
.
When X
and Y
are (n, p) and (m, p), repectively, it returns an object (n, m) with:
- i, j = distance between row i of
X
and row j ofY
.
Examples
using StatsBase
X = rand(5, 3)
Y = rand(2, 3)
mahsq(X, Y)
S = cov(X, corrected = false)
Sinv = inv(S)
mahsq(X, Y, Sinv)
mahsq(X[1:1, :], Y[1:1, :], Sinv)
mahsq(X[:, 1], 4)
mahsq(1, 4, 2.1)
Jchemo.mahsqchol
— Methodmahsqchol(X, Y)
mahsqchol(X, Y, Uinv)
Compute the squared Mahalanobis distances (with a Cholesky factorization) between the observations (rows) of X
and Y
.
X
: Data (n, p).Y
: Data (m, p).Uinv
: Inverse of the upper matrix of a Cholesky factorization of a covariance matrix S. If not given, the factorization is done on S, the uncorrected covariance matrix ofX
.
When X
and Y
are (n, p) and (m, p), repectively, it returns an object (n, m) with:
- i, j = distance between row i of
X
and row j ofY
.
Examples
using LinearAlgebra
X = rand(5, 3)
Y = rand(2, 3)
mahsqchol(X, Y)
S = cov(X, corrected = false)
U = cholesky(Hermitian(S)).U
Uinv = inv(U)
mahsqchol(X, Y, Uinv)
mahsqchol(X[:, 1], 4)
mahsqchol(1, 4, sqrt(2.1))
Jchemo.matB
— FunctionmatB(X, y, weights::Weight)
Between-class covariance matrix.
X
: X-data (n, p).y
: A vector (n) defining the class membership.weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Compute the between-class covariance matrix (output B
) of X
. This is the (non-corrected) covariance matrix of the weighted class centers.
Examples
using StatsBase
n = 20 ; p = 3
X = rand(n, p)
y = rand(1:3, n)
tab(y)
weights = mweight(ones(n))
res = matB(X, y, weights) ;
res.B
res.priors
res.ni
res.lev
res = matW(X, y, weights) ;
res.W
res.Wi
matW(X, y, weights).W + matB(X, y, weights).B
cov(X; corrected = false)
v = mweight(collect(1:n))
matW(X, y, v).priors
matB(X, y, v).priors
matW(X, y, v).W + matB(X, y, v).B
covm(X, v)
Jchemo.matW
— FunctionmatW(X, y, weights::Weight)
Within-class covariance matrices.
X
: X-data (n, p).y
: A vector (n) defing the class membership.weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Compute the (non-corrected) within-class and pooled covariance matrices (outputs Wi
and W
, respectively) of X
.
If class i contains only one observation, Wi is computed by:
covm(
X,
weights)
.
For examples, see function matB
.
Jchemo.mavg
— Methodmavg(X; kwargs...)
Smoothing by moving averages of each row of X-data.
X
: X-data (n, p).
Keyword arguments:
npoint
: Nb. points involved in the window.
The smoothing is computed by convolution with padding, using function imfilter of package ImageFiltering.jl. The centered kernel is ones(npoint
) / npoint
. Each returned point is located on the center of the kernel.
The function returns a matrix (n, p).
References
Package ImageFiltering.jl https://github.com/JuliaImages/ImageFiltering.jl
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f
mod = model(mavg; npoint = 10)
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f
Jchemo.mbconcat
— Methodmbconcat(Xbl)
Concatenate horizontaly multiblock X-data.
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.
Examples
n = 5 ; m = 3 ; p = 10
X = rand(n, p)
Xnew = rand(m, p)
listbl = [3:4, 1, [6; 8:10]]
Xbl = mblock(X, listbl)
Xblnew = mblock(Xnew, listbl)
@head Xbl[3]
mod = model(mbconcat)
fit!(mod, Xbl)
transf(mod, Xbl)
transf(mod, Xblnew)
Jchemo.mblock
— Methodmblock(X, listbl)
Make blocks from a matrix.
X
: X-data (n, p).listbl
: A vector whose each component defines the colum numbers defining a block inX
. The length oflistbl
is the number of blocks.
The function returns a list (vector) of blocks.
Examples
n = 5 ; p = 10
X = rand(n, p)
listbl = [3:4, 1, [6; 8:10]]
Xbl = mblock(X, listbl)
Xbl[1]
Xbl[2]
Xbl[3]
Jchemo.mbpca
— Methodmbpca(Xbl; kwargs...)
mbpca(Xbl, weights::Weight; kwargs...)
mbpca!(Xbl::Matrix, weights::Weight; kwargs...)
Consensus principal components analysis (CPCA = MBPCA).
Xbl
: List of blocks (vector of matrices) of X-data. Typically, output of functionmblock
.weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores T) to compute.bscal
: Type of block scaling. See functionblockscal
for possible values.tol
: Tolerance value for Nipals convergence.maxit
: Maximum number of iterations (Nipals).scal
: Boolean. Iftrue
, each column of blocks inXbl
is scaled by its uncorrected standard deviation (before the block scaling).
The MBPCA global scores are equal to the scores of the PCA of the horizontal concatenation X = [X1 X2 ... Xk].
The function returns several objects, in particular:
T
: The non normed global scores.U
: The normed global scores.W
: The global loadings.Tbl
: The block scores (grouped by blocks, in original scale).Tb
: The block scores (grouped by LV, in the metric scale).Wbl
: The block loadings.lb
: The specific weights "lambda".mu
: The sum of the specific weights (= eigen value of the global PCA).
Function summary
returns:
explvarx
: Proportion of the total inertia of X (sum of the squared norms of the blocks) explained by each global score.contr_block
: Contribution of each block to the global scores.explX
: Proportion of the inertia of the blocks explained by each global score.corx2t
: Correlation between the global scores and the original variables.cortb2t
: Correlation between the global scores and the block scores.rv
: RV coefficient.lg
: Lg coefficient.
References
Mangamana, E.T., Cariou, V., Vigneau, E., Glèlè Kakaï, R.L., Qannari, E.M., 2019. Unsupervised multiblock data analysis: A unified approach and extensions. Chemometrics and Intelligent Laboratory Systems 194, 103856. https://doi.org/10.1016/j.chemolab.2019.103856
Westerhuis, J.A., Kourti, T., MacGregor, J.F., 1998. Analysis of multiblock and hierarchical PCA and PLS models. Journal of Chemometrics 12, 301–321. https://doi.org/10.1002/(SICI)1099-128X(199809/10)12:5<301::AID-CEM515>3.0.CO;2-S
Examples
using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2")
@load db dat
pnames(dat)
X = dat.X
group = dat.group
listbl = [1:11, 12:19, 20:25]
Xbl = mblock(X[1:6, :], listbl)
Xblnew = mblock(X[7:8, :], listbl)
n = nro(Xbl[1])
nlv = 3
bscal = :frob
scal = false
#scal = true
mod = model(mbpca; nlv, bscal, scal)
fit!(mod, Xbl)
pnames(mod)
pnames(mod.fm)
## Global scores
@head mod.fm.T
@head transf(mod, Xbl)
transf(mod, Xblnew)
## Blocks scores
i = 1
@head mod.fm.Tbl[i]
@head transfbl(mod, Xbl)[i]
res = summary(mod, Xbl) ;
pnames(res)
res.explvarx
res.contr_block
res.explX # = mod.fm.lb if bscal = :frob
rowsum(Matrix(res.explX))
res.corx2t
res.cortb2t
res.rv
Jchemo.mbplskdeda
— Methodmbplskdeda(Xbl, y; kwargs...)
mbplskdeda(Xbl, y, weights::Weight; kwargs...)
Multiblock PLS-KDEDA.
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores T) to compute.bscal
: Type of block scaling. See functionblockscal
for possible values.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).- Keyword arguments of function
dmkern
(bandwidth definition) can also be specified here. scal
: Boolean. Iftrue
, each column of blocks inXbl
andY
is scaled by its uncorrected standard deviation (before the block scaling).
This is the same principle as function plskdeda
, for multiblock X-data.
See function mbplslda
for examples.
Jchemo.mbplslda
— Methodmbplslda(Xbl, y; kwargs...)
mbplslda(Xbl, y, weights::Weight; kwargs...)
Multiblock PLS-LDA.
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores T) to compute.bscal
: Type of block scaling. See functionblockscal
for possible values.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).scal
: Boolean. Iftrue
, each column of blocks inXbl
andY
is scaled by its uncorrected standard deviation (before the block scaling).
This is the same principle as function plslda
, for multiblock X-data.
Examples
using JLD2, CairoMakie, JchemoData
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
tab(Y.typ)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
wlst = names(X)
wl = parse.(Float64, wlst)
#plotsp(X, wl; nsamp = 20).f
##
listbl = [1:350, 351:700]
Xbltrain = mblock(Xtrain, listbl)
Xbltest = mblock(Xtest, listbl)
nlv = 15
scal = false
#scal = true
bscal = :none
#bscal = :frob
mod = model(mbplslda; nlv, bscal, scal)
#mod = model(mbplsqda; nlv, bscal, alpha = .5, scal)
#mod = model(mbplskdeda; nlv, bscal, scal)
fit!(mod, Xbltrain, ytrain)
pnames(mod)
@head transf(mod, Xbltrain)
@head transf(mod, Xbltest)
res = predict(mod, Xbltest) ;
@head res.pred
@show errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(mod, Xbltest; nlv = 1:2).pred
Jchemo.mbplsqda
— Methodmbplsqda(Xbl, y; kwargs...)
mbplsqda(Xbl, y, weights::Weight; kwargs...)
Multiblock PLS-QDA.
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores T) to compute.bscal
: Type of block scaling. See functionblockscal
for possible values.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).alpha
: Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0
) and LDA (alpha = 1
).scal
: Boolean. Iftrue
, each column of blocks inXbl
andY
is scaled by its uncorrected standard deviation (before the block scaling).
This is the same principle as function plsqda
, for multiblock X-data.
See function mbplslda
for examples.
Jchemo.mbplsr
— Methodmbplsr(Xbl, Y; kwargs...)
mbplsr(Xbl, Y, weights::Weight; kwargs...)
mbplsr!(Xbl::Matrix, Y::Matrix, weights::Weight; kwargs...)
Multiblock PLSR (MBPLSR).
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores T) to compute.bscal
: Type of block scaling. See functionblockscal
for possible values.scal
: Boolean. Iftrue
, each column of blocks inXbl
andY
is scaled by its uncorrected standard deviation (before the block scaling).
This function runs a PLSR on {X, Y
} where X is the horizontal concatenation of the blocks in Xbl
. The function gives the same results as function mbplswest
, but is much faster.
Examples
using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
y = Y.c1
group = dat.group
listbl = [1:11, 12:19, 20:25]
s = 1:6
Xbltrain = mblock(X[s, :], listbl)
Xbltest = mblock(rmrow(X, s), listbl)
ytrain = y[s]
ytest = rmrow(y, s)
ntrain = nro(ytrain)
ntest = nro(ytest)
ntot = ntrain + ntest
(ntot = ntot, ntrain , ntest)
nlv = 3
bscal = :frob
scal = false
#scal = true
mod = model(mbplsr; nlv, bscal, scal)
fit!(mod, Xbltrain, ytrain)
pnames(mod)
pnames(mod.fm)
@head mod.fm.T
@head transf(mod, Xbltrain)
transf(mod, Xbltest)
res = predict(mod, Xbltest)
res.pred
rmsep(res.pred, ytest)
res = summary(mod, Xbltrain) ;
pnames(res)
res.explvarx
res.corx2t
res.rdx
Jchemo.mbplsrda
— Methodmbplsrda(Xbl, y; kwargs...)
mbplsrda(Xbl, y, weights::Weight; kwargs...)
Discrimination based on multiblock partial least squares regression (MBPLSR-DA).
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores T) to compute.bscal
: Type of block scaling. See functionblockscal
for possible values.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).scal
: Boolean. Iftrue
, each column of blocks inXbl
andY
is scaled by its uncorrected standard deviation (before the block scaling).
This is the same principle as function plsrda
, for multiblock X-data.
Examples
using JLD2, CairoMakie, JchemoData
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
tab(Y.typ)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
wlst = names(X)
wl = parse.(Float64, wlst)
#plotsp(X, wl; nsamp = 20).f
##
listbl = [1:350, 351:700]
Xbltrain = mblock(Xtrain, listbl)
Xbltest = mblock(Xtest, listbl)
nlv = 15
scal = false
#scal = true
bscal = :none
#bscal = :frob
mod = model(mbplsrda; nlv, bscal, scal)
fit!(mod, Xbltrain, ytrain)
pnames(mod)
@head mod.fm.fm.T
@head transf(mod, Xbltrain)
@head transf(mod, Xbltest)
res = predict(mod, Xbltest) ;
@head res.pred
@show errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(mod, Xbltest; nlv = 1:2).pred
Jchemo.mbplswest
— Methodmbplswest(Xbl, Y; kwargs...)
mbplswest(Xbl, Y, weights::Weight; kwargs...)
mbplswest!(Xbl::Matrix, Y::Matrix, weights::Weight; kwargs...)
Multiblock PLSR (MBPLSR) - Nipals algorithm.
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores T) to compute.bscal
: Type of block scaling. See functionblockscal
for possible values.tol
: Tolerance value for convergence (Nipals).maxit
: Maximum number of iterations (Nipals).scal
: Boolean. Iftrue
, each column of blocks inXbl
andY
is scaled by its uncorrected standard deviation (before the block scaling).
This functions implements the MBPLSR Nipals algorithm such as in Westerhuis et al. 1998. The function gives the same results as function mbplsr
.
References
Westerhuis, J.A., Kourti, T., MacGregor, J.F., 1998. Analysis of multiblock and hierarchical PCA and PLS models. Journal of Chemometrics 12, 301–321. https://doi.org/10.1002/(SICI)1099-128X(199809/10)12:5<301::AID-CEM515>3.0.CO;2-S
Examples
using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
y = Y.c1
group = dat.group
listbl = [1:11, 12:19, 20:25]
s = 1:6
Xbltrain = mblock(X[s, :], listbl)
Xbltest = mblock(rmrow(X, s), listbl)
ytrain = y[s]
ytest = rmrow(y, s)
ntrain = nro(ytrain)
ntest = nro(ytest)
ntot = ntrain + ntest
(ntot = ntot, ntrain , ntest)
nlv = 3
bscal = :frob
scal = false
#scal = true
mod = model(mbplswest; nlv, bscal, scal)
fit!(mod, Xbltrain, ytrain)
pnames(mod)
pnames(mod.fm)
@head mod.fm.T
@head transf(mod, Xbltrain)
transf(mod, Xbltest)
res = predict(mod, Xbltest)
res.pred
rmsep(res.pred, ytest)
res = summary(mod, Xbltrain) ;
pnames(res)
res.explvarx
res.corx2t
res.cortb2t
res.rdx
Jchemo.merrp
— Methodmerrp(pred, y)
Compute the mean intra-class classification error rate.
pred
: Predictions.y
: Observed data (class membership).
ERRP (see function errp
) is computed for each class. Function merrp
returns the average of these intra-class ERRPs.
Examples
Xtrain = rand(10, 5)
ytrain = rand(["a" ; "b"], 10)
Xtest = rand(4, 5)
ytest = rand(["a" ; "b"], 4)
mod = model(plsrda; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
merrp(pred, ytest)
Jchemo.miss
— Methodmiss(X)
Find rows with missing data in a dataset.
X
: A dataset.
Examples
X = rand(5, 4)
zX = hcat(rand(2, 3), fill(missing, 2))
Z = vcat(X, zX)
miss(X)
miss(Z)
Jchemo.mlev
— Methodmlev(x)
Return the sorted levels of a vector or a dataset.
Examples
x = rand(["a";"b";"c"], 20)
lev = mlev(x)
nlev = length(lev)
X = reshape(x, 5, 4)
mlev(X)
df = DataFrame(g1 = rand(1:2, n),
g2 = rand(["a"; "c"], n))
mlev(df)
Jchemo.mlr
— Methodmlr(X, Y; kwargs...)
mlr(X, Y, weights::Weight; kwargs...)
mlr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Compute a mutiple linear regression model (MLR) by using the QR algorithm.
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
noint
: Boolean. Define if the model is computed with an intercept or not.
Safe but can be little slower than other methods.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 2:4]
y = dat.X[:, 1]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
mod = model(mlr)
#mod = model(mlrchol)
#mod = model(mlrpinv)
#mod = model(mlrpinvn)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.B
fm.int
coef(mod)
res = predict(mod, Xtest)
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true,
xlabel = "Prediction", ylabel = "Observed").f
mod = model(mlr; noint = true)
fit!(mod, Xtrain, ytrain)
coef(mod)
Jchemo.mlrchol
— Methodmlrchol(X, Y)
mlrchol(X, Y, weights::Weight)
mlrchol!mlrchol!(X::Matrix, Y::Matrix, weights::Weight)
Compute a mutiple linear regression model (MLR) using the Normal equations and a Choleski factorization.
X
: X-data, with nb. columns >= 2 (required by function cholesky).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Compute a model with intercept.
Faster but can be less accurate (based on squared element X'X).
See function mlr
for examples.
Jchemo.mlrda
— Methodmlrda(X, y; kwargs...)
mlrda(X, y, weights::Weight)
Discrimination based on multple linear regression (MLR-DA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).
The training variable y
(univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present in y
. Each column of Ydummy is a dummy (0/1) variable. Then, a multiple linear regression (MLR) is run on {X
, Ydummy}, returning predictions of the dummy variables (= object posterior
returned by fuction predict
). These predictions can be considered as unbounded estimates (i.e. eventuall outside of [0, 1]) of the class membership probabilities. For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.
In the high-level version of the function, the observation weights used in the MLR are defined with argument prior
. For other choices, use the low-level version (argument weights
).
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
mod = model(mlrda)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.mlrpinv
— Methodmlrpinv(; kwargs...)
mlrpinv(X, Y; kwargs...)
mlrpinv(X, Y, weights::Weight; kwargs...)
mlrpinv!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Compute a mutiple linear regression model (MLR) by using a pseudo-inverse.
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
noint
: Boolean. Define if the model is computed with an intercept or not.
Safe but can be slower.
See function mlr
for examples.
Jchemo.mlrpinvn
— Methodmlrpinvn()
mlrpinvn(X, Y)
mlrpinvn(X, Y, weights::Weight)
mlrpinvn!mlrchol!(X::Matrix, Y::Matrix,
weights::Weight)
Compute a mutiple linear regression model (MLR) by using the Normal equations and a pseudo-inverse.
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Safe and fast for p not too large.
Compute a model with intercept.
See function mlr
for examples.
Jchemo.mlrvec
— Methodmlrvec(; kwargs...)
mlrvec(X, Y; kwargs...)
mlrvec(X, Y, weights::Weight; kwargs...)
mlrvec!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Compute a simple linear regression model (univariate x).
x
: Univariate X-data (n).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
noint
: Boolean. Define if the model is computed with an intercept or not.
See function mlr
for examples.
Jchemo.model
— Methodmodel(fun::Function; kwargs...)
Build a model.
fun
: The function defining the the model.kwargs...
: Keyword arguments offun
.
Examples
X = rand(5, 10)
y = rand(5)
mod = model(detrend) # use the default arguments of 'detrend'
#mod = detrend(X; degree = 2)
pnames(mod)
fit!(mod, X)
Xp = transf(mod, X)
mod = model(plskern; nlv = 3)
fit!(mod, X, y)
pred = predict(mod, X).pred
Jchemo.mpar
— Functionmpar(; kwargs...)
Return a tuple with all the combinations of the parameter values defined in kwargs. Keyword arguments:
kwargs
: Vector(s) of the parameter(s) values.
Examples
nlvdis = 25 ; metric = [:mah]
h = [1 ; 2 ; Inf] ; k = [500 ; 1000]
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k)
length(pars[1])
reduce(hcat, pars)
Jchemo.mse
— Methodmse(pred, Y; digits = 3)
Summary of model performance for regression.
pred
: Predictions.Y
: Observed data.
Examples
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
mse(pred, Ytest)
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
mse(pred, ytest)
Jchemo.msep
— Methodmsep(pred, Y)
Compute the mean of the squared prediction errors (MSEP).
pred
: Predictions.Y
: Observed data.
Examples
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
msep(pred, Ytest)
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
msep(pred, ytest)
Jchemo.mweight
— Methodmweight(x::Vector)
Return an object of type Weight
containing vector w = x / sum(x)
(if ad'hoc building, w
must sum to 1).
Examples
x = rand(10)
w = mweight(x)
sum(w.w)
Jchemo.mweightcla
— Methodmweightcla(x::Vector; prior::Union{Symbol, Vector} = :unif)
mweightcla(Q::DataType, x::Vector; prior::Union{Symbol, Vector} = :unif)
Compute observation weights for a categorical variable, given specified sub-total weights for the classes.
x
: A categorical variable (n) (class membership).Q
: A data type (e.g.Float32
).
Keyword arguments:
prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).
Return an object of type Weight
(see function mweight
) containing a vector w
(n) that sums to 1.
Examples
x = vcat(rand(["a" ; "c"], 900), repeat(["b"], 100))
tab(x)
weights = mweightcla(x)
#weights = mweightcla(x; prior = :prop)
#weights = mweightcla(x; prior = [.1, .7, .2])
aggstat(weights.w, x; fun = sum).X
Jchemo.nco
— Methodnco(X)
Return the nb. columns of X
.
Jchemo.nipals
— Methodnipals(X; kwargs...)
nipals(X, UUt, VVt; kwargs...)
Nipals to compute the first score and loading vectors of a matrix.
X
: X-data (n, p).UUt
: Matrix (n, n) for Gram-Schmidt orthogonalization.VVt
: Matrix (p, p) for Gram-Schmidt orthogonalization.
Keyword arguments:
tol
: Tolerance value for stopping the iterations.maxit
: Maximum nb. of iterations.
The function finds:
- {u, v, sv} = argmin(||X - u * sv * v'||)
with the constraints:
- ||u|| = ||v|| = 1
using the alternating least squares algorithm to compute SVD (Gabriel & Zalir 1979).
At the end, X ~ u * sv * v', where:
- u : left singular vector (u * sv = scores)
- v : right singular vector (loadings)
- sv : singular value.
When NIPALS is used on sequentially deflated matrices, vectors u and v can loose orthogonality due to accumulation of rounding errors. Orthogonality can be rebuilt from the Gram-Schmidt method (arguments UUt
and VVt
).
References
K.R. Gabriel, S. Zamir, Lower rank approximation of matrices by least squares with any choice of weights, Technometrics 21 (1979) 489–498.
Examples
using LinearAlgebra
X = rand(5, 3)
res = nipals(X)
res.niter
res.sv
svd(X).S[1]
res.v
svd(X).V[:, 1]
res.u
svd(X).U[:, 1]
Jchemo.nipalsmiss
— Methodnipalsmiss(X; kwargs...)
nipalsmiss(X, UUt, VVt; kwargs...)
Nipals to compute the first score and loading vectors of a matrix with missing data.
X
: X-data (n, p).UUt
: Matrix (n, n) for Gram-Schmidt orthogonalization.VVt
: Matrix (p, p) for Gram-Schmidt orthogonalization.
Keyword arguments:
tol
: Tolerance value for stopping the iterations.maxit
: Maximum nb. of iterations.
See function nipals
.
References
K.R. Gabriel, S. Zamir, Lower rank approximation of matrices by least squares with any choice of weights, Technometrics 21 (1979) 489–498.
Wright, K., 2018. Package nipals: Principal Components Analysis using NIPALS with Gram-Schmidt Orthogonalization. https://cran.r-project.org/
Examples
X = [1. 2 missing 4 ; 4 missing 6 7 ;
missing 5 6 13 ; missing 18 7 6 ;
12 missing 28 7]
res = nipalsmiss(X)
res.niter
res.sv
res.v
res.u
Jchemo.normw
— Methodnormw(x, weights::Weight)
Compute the weighted norm of a vector.
x
: A vector (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
The weighted norm of vector x
is computed by:
- sqrt(x' * D * x), where D is the diagonal matrix of vector
weights.w
.
Jchemo.nro
— Methodnro(X)
Return the nb. rows of X
.
Jchemo.occod
— Methodoccod(fm, X; kwargs...)
One-class classification using PCA/PLS orthognal distance (OD).
fm
: The preliminary model that (e.g. PCA) was fitted (objectfm
) on the training data assumed to represent the training class.X
: Training X-data (n, p), on which was fitted the modelfm
.
Keyword arguments:
mcut
: Type of cutoff. Possible values are::mad
,:q
. See Thereafter.cri
: Whenmcut
=:mad
, a constant. See thereafter.risk
: Whenmcut
=:q
, a risk-I level. See thereafter.
In this method, the outlierness d
of an observation is the orthogonal distance (OD = "X-residuals") of this observation, ie. the Euclidean distance between the observation and its projection on the score plan defined by the fitted (e.g. PCA) model (e.g. Hubert et al. 2005, Van Branden & Hubert 2005 p. 66, Varmuza & Filzmoser 2009 p. 79).
See function occsd
for details on outputs.
References
M. Hubert, P. J. Rousseeuw, K. Vanden Branden (2005). ROBPCA: a new approach to robust principal components analysis. Technometrics, 47, 64-79.
K. Vanden Branden, M. Hubert (2005). Robuts classification in high dimension based on the SIMCA method. Chem. Lab. Int. Syst, 79, 10-21.
K. Varmuza, P. Filzmoser (2009). Introduction to multivariate statistical analysis in chemometrics. CRC Press, Boca Raton.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/challenge2018.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
mod = model(savgol; npoint = 21, deriv = 2, degree = 3)
fit!(mod, X)
Xp = transf(mod, X)
s = Bool.(Y.test)
Xtrain = rmrow(Xp, s)
Ytrain = rmrow(Y, s)
Xtest = Xp[s, :]
Ytest = Y[s, :]
## Below, the reference class is "EEH"
cla1 = "EHH" ; cla2 = "PEE" ; cod = "out" # here cla2 should be detected
#cla1 = "EHH" ; cla2 = "EHH" ; cod = "in" # here cla2 should not be detected
s1 = Ytrain.typ .== cla1
s2 = Ytest.typ .== cla2
zXtrain = Xtrain[s1, :]
zXtest = Xtest[s2, :]
ntrain = nro(zXtrain)
ntest = nro(zXtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
ytrain = repeat(["in"], ntrain)
ytest = repeat([cod], ntest)
## Group description
mod0 = model(pcasvd; nlv = 10)
fit!(mod, zXtrain)
Ttrain = mod0.fm.T
Ttest = transf(mod0, zXtest)
T = vcat(Ttrain, Ttest)
group = vcat(repeat(["1"], ntrain), repeat(["2"], ntest))
i = 1
plotxy(T[:, i], T[:, i + 1], group; leg_title = "Class",
xlabel = string("PC", i), ylabel = string("PC", i + 1)).f
#### Occ
## Preliminary PCA fitted model
mod0 = model(pcasvd; nlv = 10)
fit!(mod0, zXtrain)
fm0 = mod0.fm ;
## Outlierness
mod = model(occod)
#mod = model(occod; mcut = :mad, cri = 4)
#mod = model(occod; mcut = :q, risk = .01) ;
#mod = model(occsdod)
fit!(mod, fm0, zXtrain)
pnames(mod)
pnames(mod.fm)
@head d = mod.fm.d
d = d.dstand
f, ax = plotxy(1:length(d), d; size = (500, 300),
xlabel = "Obs. index", ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f
res = predict(mod, zXtest) ;
pnames(res)
@head res.d
@head res.pred
tab(res.pred)
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
d1 = mod.fm.d.dstand
d2 = res.d.dstand
d = vcat(d1, d2)
f, ax = plotxy(1:length(d), d, group; size = (500, 300), leg_title = "Class",
xlabel = "Obs. index", ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f
Jchemo.occsd
— Methodoccsd(fm; kwargs...)
One-class classification using PCA/PLS score distance (SD).
fm
: The preliminary model that (e.g. PCA) was fitted (objectfm
) on the training data assumed to represent the training class.
Keyword arguments:
mcut
: Type of cutoff. Possible values are::mad
,:q
. See Thereafter.cri
: Whenmcut
=:mad
, a constant. See thereafter.risk
: Whenmcut
=:q
, a risk-I level. See thereafter.
In this method, the outlierness d
of an observation is defined by its score distance (SD), ie. the Mahalanobis distance between the projection of the observation on the score plan defined by the fitted (e.g. PCA) model and the center of the score plan.
If a new observation has d
higher than a given cutoff
, the observation is assumed to not belong to the training (= reference) class. The cutoff
is computed with non-parametric heuristics. Noting [d] the vector of outliernesses computed on the training class:
- If
mcut
=:mad
, thencutoff
= median([d]) +cri
* mad([d]). - If
mcut
=:q
, thencutoff
is estimated from the empirical cumulative density function computed on [d], for a given risk-I (risk
).
Alternative approximate cutoffs have been proposed in the literature (e.g.: Nomikos & MacGregor 1995, Hubert et al. 2005, Pomerantsev 2008). Typically, and whatever the approximation method used to compute the cutoff, it is recommended to tune this cutoff depending on the detection objectives.
Outputs
pval
: Estimate of p-value (see functionspval
) computed from the training distribution [d].dstand
: standardized distance defined asd
/cutoff
. A valuedstand
> 1 may be considered as extreme compared to the distribution of the training data.gh
is the Winisi "GH" (usually, GH > 3 is considered as extreme).
Specific for function predict
:
pred
: class predictiondstand
<= 1 ==>in
: the observation is expected to belong to the training class,dstand
> 1 ==>out
: extreme value, possibly not belonging to the same class as the training.
References
M. Hubert, P. J. Rousseeuw, K. Vanden Branden (2005). ROBPCA: a new approach to robust principal components analysis. Technometrics, 47, 64-79.
Nomikos, P., MacGregor, J.F., 1995. Multivariate SPC Charts for Monitoring Batch Processes. null 37, 41-59. https://doi.org/10.1080/00401706.1995.10485888
Pomerantsev, A.L., 2008. Acceptance areas for multivariate classification derived by projection methods. Journal of Chemometrics 22, 601-609. https://doi.org/10.1002/cem.1147
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/challenge2018.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
mod = savgol(npoint = 21, deriv = 2, degree = 3)
fit!(mod, X)
Xp = transf(mod, X)
s = Bool.(Y.test)
Xtrain = rmrow(Xp, s)
Ytrain = rmrow(Y, s)
Xtest = Xp[s, :]
Ytest = Y[s, :]
## Below, the reference class is "EEH"
cla1 = "EHH" ; cla2 = "PEE" ; cod = "out" # here cla2 should be detected
#cla1 = "EHH" ; cla2 = "EHH" ; cod = "in" # here cla2 should not be detected
s1 = Ytrain.typ .== cla1
s2 = Ytest.typ .== cla2
zXtrain = Xtrain[s1, :]
zXtest = Xtest[s2, :]
ntrain = nro(zXtrain)
ntest = nro(zXtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
ytrain = repeat(["in"], ntrain)
ytest = repeat([cod], ntest)
## Group description
mod = model(pcasvd; nlv = 10)
fit!(mod, zXtrain)
Ttrain = mod.fm.T
Ttest = transf(mod, zXtest)
T = vcat(Ttrain, Ttest)
group = vcat(repeat(["1"], ntrain), repeat(["2"], ntest))
i = 1
plotxy(T[:, i], T[:, i + 1], group; leg_title = "Class",
xlabel = string("PC", i), ylabel = string("PC", i + 1)).f
#### Occ
## Preliminary PCA fitted model
mod0 = model(pcasvd; nlv = 30)
fit!(mod0, zXtrain)
fm0 = mod0.fm ;
## Outlierness
mod = model(occsd)
#mod = model(occsd; mcut = :mad, cri = 4)
#mod = model(occsd; mcut = :q, risk = .01)
fit!(mod, fm0)
pnames(mod)
pnames(mod.fm)
@head d = mod.fm.d
d = d.dstand
f, ax = plotxy(1:length(d), d; size = (500, 300),
xlabel = "Obs. index", ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f
res = predict(mod, zXtest) ;
pnames(res)
@head res.d
@head res.pred
tab(res.pred)
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
d1 = mod.fm.d.dstand
d2 = res.d.dstand
d = vcat(d1, d2)
f, ax = plotxy(1:length(d), d, group; size = (500, 300), leg_title = "Class",
xlabel = "Obs. index", ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f
Jchemo.occsdod
— Methodoccsdod(object, X; kwargs...)
One-class classification using a compromise between PCA/PLS score (SD) and orthogonal (OD) distances.
fm
: The preliminary model that (e.g. PCA) was fitted (objectfm
) on the training data assumed to represent the training class.X
: Training X-data (n, p), on which was fitted the modelfm
.
Keyword arguments:
mcut
: Type of cutoff. Possible values are::mad
,:q
. See Thereafter.cri
: Whenmcut
=:mad
, a constant. See thereafter.risk
: Whenmcut
=:q
, a risk-I level. See thereafter.
In this method, the outlierness d
of a given observation is a compromise between the score distance (SD) and the orthogonal distance (OD). The compromise is computed from the standardized distances by:
dstand
= sqrt(dstand_sd
*dstand_od
).
See functions:
occsd
for details of the outputs,- and
occod
for examples.
Jchemo.occstah
— Methodoccstah(X; kwargs...)
One-class classification using the Stahel-Donoho outlierness.
X
: Training X-data (n, p).
Keyword arguments:
nlv
: Nb. dimensions on whichX
is projected.mcut
: Type of cutoff. Possible values are::mad
,:q
. See Thereafter.cri
: Whenmcut
=:mad
, a constant. See thereafter.risk
: Whenmcut
=:q
, a risk-I level. See thereafter.scal
: Boolean. Iftrue
, each column ofX
is scaled such as in functionstah
.
In this method, the outlierness d
of a given observation is the Stahel-Donoho outlierness (see ?stah
).
See function occsd
for details on outputs.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/challenge2018.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
mod = model(savgol; npoint = 21, deriv = 2, degree = 3)
fit!(mod, X)
Xp = transf(mod, X)
s = Bool.(Y.test)
Xtrain = rmrow(Xp, s)
Ytrain = rmrow(Y, s)
Xtest = Xp[s, :]
Ytest = Y[s, :]
## Below, the reference class is "EEH"
cla1 = "EHH" ; cla2 = "PEE" ; cod = "out" # here cla2 should be detected
#cla1 = "EHH" ; cla2 = "EHH" ; cod = "in" # here cla2 should not be detected
s1 = Ytrain.typ .== cla1
s2 = Ytest.typ .== cla2
zXtrain = Xtrain[s1, :]
zXtest = Xtest[s2, :]
ntrain = nro(zXtrain)
ntest = nro(zXtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
ytrain = repeat(["in"], ntrain)
ytest = repeat([cod], ntest)
## Group description
mod = model(pcasvd; nlv = 10)
fit!(mod, zXtrain)
Ttrain = mod.fm.T
Ttest = transf(mod, zXtest)
T = vcat(Ttrain, Ttest)
group = vcat(repeat(["1"], ntrain), repeat(["2"], ntest))
i = 1
plotxy(T[:, i], T[:, i + 1], group; leg_title = "Class",
xlabel = string("PC", i), ylabel = string("PC", i + 1)).f
#### Occ
## Preliminary dimension
## Not required but often more
## efficient
nlv = 50
mod0 = model(pcasvd; nlv) ;
fit!(mod0, zXtrain)
Ttrain = mod0.fm.T
Ttest = transf(mod0, zXtest)
## Outlierness
mod = model(occstah; nlv, scal = true)
fit!(mod, Ttrain)
pnames(mod)
pnames(mod.fm)
@head d = mod.fm.d
d = d.dstand
f, ax = plotxy(1:length(d), d; size = (500, 300), xlabel = "Obs. index",
ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f
res = predict(mod, Ttest) ;
pnames(res)
@head res.d
@head res.pred
tab(res.pred)
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
d1 = mod.fm.d.dstand
d2 = res.d.dstand
d = vcat(d1, d2)
f, ax = plotxy(1:length(d), d, group; size = (500, 300), leg_title = "Class",
xlabel = "Obs. index", ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f
Jchemo.out
— Methodout(x)
Return if elements of a vector are strictly outside of a given range.
x
: Univariate data.y
: Univariate data on which is computed the range (min, max).
Return a BitVector.
Examples
x = [-200.; -100; -1; 0; 1; 200]
out(x, [-1; .2; 1])
out(x, (-1, 1))
Jchemo.pcaeigen
— Methodpcaeigen(X; kwargs...)
pcaeigen(X, weights::Weight; kwargs...)
pcaeigen!(X::Matrix, weights::Weight; kwargs...)
PCA by Eigen factorization.
X
: X-data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. of principal components (PCs).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Let us note D the (n, n) diagonal matrix of weights (weights.w
) and X the centered matrix in metric D. The function minimizes ||X - T * P'||^2 in metric D, by computing an Eigen factorization of X' * D * X.
See function pcasvd
for examples.
Jchemo.pcaeigenk
— Methodpcaeigenk(X; kwargs...)
pcaeigenk(X, weights::Weight; kwargs...)
pcaeigenk!(X::Matrix, weights::Weight; kwargs...)
PCA by Eigen factorization of the kernel matrix XX'.
X
: X-data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. of principal components (PCs).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
This is the "kernel cross-product" version of the PCA algorithm (e.g. Wu et al. 1997). For wide matrices (n << p, where p is the nb. columns) and n not too large, this algorithm can be much faster than the others.
Let us note D the (n, n) diagonal matrix of weights (weights.w
) and X the centered matrix in metric D. The function minimizes ||X - T * P'||^2 in metric D, by computing an Eigen factorization of D^(1/2) * X * X' D^(1/2).
See function pcasvd
for examples.
References
Wu, W., Massart, D.L., de Jong, S., 1997. The kernel PCA algorithms for wide data. Part I: Theory and algorithms. Chemometrics and Intelligent Laboratory Systems 36, 165-172. https://doi.org/10.1016/S0169-7439(97)00010-5
Jchemo.pcanipals
— Methodpcanipals(X; kwargs...)
pcanipals(X, weights::Weight; kwargs...)
pcanipals!(X::Matrix, weights::Weight; kwargs...)
PCA by NIPALS algorithm.
X
: X-data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. of principal components (PCs).gs
: Boolean. Iftrue
(default), a Gram-Schmidt orthogonalization of the scores and loadings is done before each X-deflation.tol
: Tolerance value for stopping the iterations.maxit
: Maximum nb. of iterations.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Let us note D the (n, n) diagonal matrix of weights (weights.w
) and X the centered matrix in metric D. The function minimizes ||X - T * P'||^2 in metric D by NIPALS.
See function pcasvd
for examples.
References
Andrecut, M., 2009. Parallel GPU Implementation of Iterative PCA Algorithms. Journal of Computational Biology 16, 1593-1599. https://doi.org/10.1089/cmb.2008.0221
K.R. Gabriel, S. Zamir, Lower rank approximation of matrices by least squares with any choice of weights, Technometrics 21 (1979) 489–498.
Gabriel, R. K., 2002. Le biplot - Outil d'exploration de données multidimensionnelles. Journal de la Société Française de la Statistique, 143, 5-55.
Lingen, F.J., 2000. Efficient Gram-Schmidt orthonormalisation on parallel computers. Communications in Numerical Methods in Engineering 16, 57-66. https://doi.org/10.1002/(SICI)1099-0887(200001)16:1<57::AID-CNM320>3.0.CO;2-I
Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris, France.
Wright, K., 2018. Package nipals: Principal Components Analysis using NIPALS with Gram-Schmidt Orthogonalization. https://cran.r-project.org/
Jchemo.pcanipalsmiss
— Methodpcanipals(X; kwargs...)
pcanipals(X, weights::Weight; kwargs...)
pcanipals!(X::Matrix, weights::Weight; kwargs...)
PCA by NIPALS algorithm allowing missing data.
X
: X-data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. of principal components (PCs).gs
: Boolean. Iftrue
(default), a Gram-Schmidt orthogonalization of the scores and loadings is done before each X-deflation.tol
: Tolerance value for stopping the iterations.maxit
: Maximum nb. of iterations.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
References
Wright, K., 2018. Package nipals: Principal Components Analysis using NIPALS with Gram-Schmidt Orthogonalization. https://cran.r-project.org/
Examples
X = [1 2. missing 4 ; 4 missing 6 7 ;
missing 5 6 13 ; missing 18 7 6 ;
12 missing 28 7]
nlv = 3
tol = 1e-15
scal = false
#scal = true
gs = false
#gs = true
mod = model(pcanipalsmiss; nlv, tol, gs, maxit = 500, scal)
fit!(mod, X)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.niter
fm.sv
fm.P
fm.T
## Orthogonality
## only if gs = true
fm.T' * fm.T
fm.P' * fm.P
## Impute missing data in X
mod = model(pcanipalsmiss; nlv = 2, gs = true) ;
fit!(mod, X)
Xfit = xfit(mod.fm)
s = ismissing.(X)
X_imput = copy(X)
X_imput[s] .= Xfit[s]
X_imput
Jchemo.pcasph
— Methodpcasph(X; kwargs...)
pcasph(X, weights::Weight; kwargs...)
pcasph!(X::Matrix, weights::Weight; kwargs...)
Spherical PCA.
X
: X-data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. of principal components (PCs).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Spherical PCA (Locantore et al. 1990, Maronna 2005, Daszykowski et al. 2007). Matrix X
is centered by the spatial median computed by function Jchemo.colmedspa
.
References
Daszykowski, M., Kaczmarek, K., Vander Heyden, Y., Walczak, B., 2007. Robust statistics in data analysis - A review. Chemometrics and Intelligent Laboratory Systems 85, 203-219. https://doi.org/10.1016/j.chemolab.2006.06.016
Locantore N., Marron J.S., Simpson D.G., Tripoli N., Zhang J.T., Cohen K.L. Robust principal component analysis for functional data, Test 8 (1999) 1–7
Maronna, R., 2005. Principal components and orthogonal regression based on robust scales, Technometrics, 47:3, 264-273, DOI: 10.1198/004017005000000166
Examples
using JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "octane.jld2")
@load db dat
pnames(dat)
X = dat.X
wlst = names(X)
wl = parse.(Float64, wlst)
n = nro(X)
nlv = 6
mod = model(pcasph; nlv)
#mod = model(pcasvd; nlv)
fit!(mod, X)
pnames(mod)
pnames(mod.fm)
@head T = mod.fm.T
## Same as:
transf(mod, X)
i = 1
plotxy(T[:, i], T[:, i + 1]; zeros = true, xlabel = "PC1",
ylabel = "PC2").f
Jchemo.pcasvd
— Methodpcasvd(X; kwargs...)
pcasvd(X, weights::Weight; kwargs...)
pcasvd!(X::Matrix, weights::Weight; kwargs...)
PCA by SVD factorization.
X
: X-data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. of principal components (PCs).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Let us note D the (n, n) diagonal matrix of weights (weights.w
) and X the centered matrix in metric D. The function minimizes ||X - T * P'||^2 in metric D, by computing a SVD factorization of sqrt(D) * X:
- sqrt(D) * X ~ U * S * V'
Outputs are:
T
= D^(-1/2) * U * SP
= V- The diagonal of S
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
@head Xtrain = X[s.train, :]
@head Xtest = X[s.test, :]
nlv = 3
mod = model(pcasvd; nlv)
#mod = model(pcaeigen; nlv)
#mod = model(pcaeigenk; nlv)
#mod = model(pcanipals; nlv)
fit!(mod, Xtrain)
pnames(mod)
pnames(mod.fm)
@head T = mod.fm.T
## Same as:
@head transf(mod, X)
T' * T
@head P = mod.fm.P
P' * P
@head Ttest = transf(mod, Xtest)
res = summary(mod, Xtrain) ;
pnames(res)
res.explvarx
res.contr_var
res.coord_var
res.cor_circle
Jchemo.pcr
— Methodpcr(X, Y; kwargs...)
pcr(X, Y, weights::Weight; kwargs...)
pcr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Principal component regression (PCR) with a SVD factorization.
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 15
mod = model(pcr; nlv) ;
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
@head mod.fm.T
coef(mod)
coef(mod; nlv = 3)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)
res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true,
xlabel = "Prediction", ylabel = "Observed").f
res = predict(mod, Xtest; nlv = 1:2)
@head res.pred[1]
@head res.pred[2]
res = summary(mod, Xtrain) ;
pnames(res)
z = res.explvarx
plotgrid(z.nlv, z.cumpvar; step = 2, xlabel = "Nb. LVs",
ylabel = "Prop. Explained X-Variance").f
Jchemo.pip
— Methodpip(args...)
Build a pipeline of models.
args...
: Succesive models, see examples.
Examples
using JLD2, CairoMakie, JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Pipeline Snv :> Savgol :> Pls :> Svmr
mod1 = model(snv; centr = true, scal = true)
npoint = 11 ; deriv = 2 ; degree = 3
mod2 = model(savgol; npoint, deriv, degree)
mod3 = model(plskern; nlv = 15)
mod4 = model(svmr; gamma = 1e3, cost = 100, epsilon = .9)
mod = pip(mod1, mod2, mod3, mod4)
fit!(mod, Xtrain, ytrain)
res = predict(mod, Xtest) ;
@head res.pred
rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
Jchemo.plist
— Methodplist(x)
Print each element of a list.
Jchemo.plotconf
— Methodplotconf(object; size = (500, 400), cnt = true, ptext = true,
fontsize = 15, coldiag = :red, )
Plot a conf matrix.
object
: Output of functionconf
.
Keyword arguments:
size
: Size (horizontal, vertical) of the figure.cnt
: Boolean. Iftrue
, plot the occurrences, else plot the row %s.ptext
: Boolean. Iftrue
, display the value in each cell.fontsize
: Font size whenptext = true
.coldiag
: Font color whenptext = true
.
See examples in help page of function conf
. ```
Jchemo.plotgrid
— Methodplotgrid(indx::AbstractVector, r;
size = (500, 300), step = 5, color = nothing,
kwargs...)
plotgrid(indx::AbstractVector, r, group;
size = (700, 350), step = 5, color = nothing,
leg = true, leg_title = "Group", kwargs...)
Plot error/performance rates of a model.
indx
: A numeric variable representing the grid of model parameters, e.g. the nb. LVs if PLSR models.r
: The error/performance rate.
Keyword arguments:
group
: Categorical variable defining groups. A separate line is plotted for each level ofgroup
.size
: Size (horizontal, vertical) of the figure.step
: Step used for defining the xticks.color
: Set color. Ifgroup
if used, must be a vector of same length as the number of levels ingroup
.leg
: Boolean. Ifgroup
is used, display a legend or not.leg_title
: Title of the legend.kwargs
: Optional arguments to pass inAxis
of CairoMakie.
To use plotgrid
, a backend (e.g. CairoMakie) has to be specified.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
mod = plskern()
nlv = 0:20
res = gridscore(mod, Xtrain, ytrain,
Xtest, ytest; score = rmsep, nlv)
plotgrid(res.nlv, res.y1;
xlabel = "Nb. LVs", ylabel = "RMSEP").f
mod = lwplsr()
nlvdis = 15 ; metric = [:mah]
h = [1 ; 2.5 ; 5] ; k = [50 ; 100]
pars = mpar(nlvdis = nlvdis, metric = metric,
h = h, k = k)
nlv = 0:20
res = gridscore(mod, Xtrain, ytrain,
Xtest, ytest; score = rmsep,
pars, nlv)
group = string.("h=", res.h, " k=", res.k)
plotgrid(res.nlv, res.y1, group;
xlabel = "Nb. LVs", ylabel = "RMSECV").f
Jchemo.plotsp
— Functionplotsp(X, wl = 1:nco(X); size = (500, 300), color = nothing,
nsamp = nothing, kwargs...)
Plotting spectra.
X
: X-data (n, p).wl
: Column names ofX
. Must be numeric.
Keyword arguments:
size
: Size (horizontal, vertical) of the figure.color
: Set a unique color (and eventually transparency) to the spectra.nsamp
: Nb. spectra (X-rows) to plot. Ifnothing
, all spectra are plotted.kwargs
: Optional arguments to pass inAxis
of CairoMakie.
The function plots the rows of X
.
To use plotxy
, a backend (e.g. CairoMakie) has to be specified.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
wlst = names(X)
wl = parse.(Float64, wlst)
plotsp(X).f
plotsp(X; color = (:red, .2)).f
plotsp(X, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance").f
f, ax = plotsp(X, wl; color = (:red, .2))
xmeans = colmean(X)
lines!(ax, wl, xmeans; color = :black, linewidth = 2)
vlines!(ax, 1200)
f
Jchemo.plotxy
— Methodplotxy(x, y; size = (500, 300), color = nothing, ellipse::Bool = false,
prob = .95, circle::Bool = false, bisect::Bool = false, zeros::Bool = false,
xlabel = "", ylabel = "", title = "", kwargs...)
plotxy(x, y, group; size = (600, 350), color = nothing, ellipse::Bool = false,
prob = .95, circle::Bool = false, bisect::Bool = false, zeros::Bool = false,
xlabel = "", ylabel = "", title = "", leg::Bool = true, leg_title = "Group",
kwargs...)
Scatter plot of (x, y) data
x
: A x-vector (n).y
: A y-vector (n).group
: Categorical variable defining groups (n).
Keyword arguments:
size
: Size (horizontal, vertical) of the figure.color
: Set color(s). Ifgroup
if used,color
must be a vector of same length as the number of levels ingroup
.ellipse
: Boolean. Draw an ellipse of confidence, assuming a Ch-square distribution with df = 2. Ifgroup
is used, one ellipse is drawn per group.prob
: Probability for the ellipse of confidence.bisect
: Boolean. Draw a bisector.zeros
: Boolean. Draw horizontal and vertical axes passing through origin (0, 0).xlabel
: Label for the x-axis.ylabel
: Label for the y-axis.title
: Title of the graphic.leg
: Boolean. Ifgroup
is used, display a legend or not.leg_title
: Title of the legend.kwargs
: Optional arguments to pass in functionscatter
of Makie.
To use plotxy
, a backend (e.g. CairoMakie) has to be specified.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
lev = mlev(year)
nlev = length(lev)
mod = model(pcasvd; nlv = 5)
fit!(mod, X)
@head T = mod.fm.T
plotxy(T[:, 1], T[:, 2]; color = (:red, .5)).f
plotxy(T[:, 1], T[:, 2], year; ellipse = true, xlabel = "PC1",
ylabel = "PC2").f
i = 2
colm = cgrad(:Dark2_5, nlev; categorical = true)
plotxy(T[:, i], T[:, i + 1], year; color = colm, xlabel = string("PC", i),
ylabel = string("PC", i + 1), zeros = true, ellipse = true).f
plotxy(T[:, 1], T[:, 2], year).lev
plotxy(1:5, 1:5).f
y = reshape(rand(5), 5, 1)
plotxy(1:5, y).f
## Several layers can be added
## (same syntax as in Makie)
A = rand(50, 2)
f, ax = plotxy(A[:, 1], A[:, 2]; xlabel = "x1", ylabel = "x2")
ylims!(ax, -1, 2)
hlines!(ax, 0.5; color = :red, linestyle = :dot)
f
Jchemo.plscan
— Methodplscan(X, Y; kwargs...)
plscan(X, Y, weights::Weight; kwargs...)
plscan!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Canonical partial least squares regression (Canonical PLS).
X
: First block of data.Y
: Second block of data.weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores T) to compute.bscal
: Type of block scaling. Possible values are::none
,:frob
. See functionsblockscal
.scal
: Boolean. Iftrue
, each column of blocks inX
andY
is scaled by its uncorrected standard deviation (before the block scaling).
Canonical PLS with the Nipals algorithm (Wold 1984, Tenenhaus 1998 chap.11), referred to as PLS-W2A (i.e. Wold PLS mode A) in Wegelin 2000. The two blocks X
and X
play a symmetric role. After each step of scores computation, X and Y are deflated by the x- and y-scores, respectively.
References
Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.
Wegelin, J.A., 2000. A Survey of Partial Least Squares (PLS) Methods, with Emphasis on the Two-Block Case (No. 371). University of Washington, Seattle, Washington, USA.
Wold, S., Ruhe, A., Wold, H., Dunn, III, W.J., 1984. The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses. SIAM Journal on Scientific and Statistical Computing 5, 735–743. https://doi.org/10.1137/0905052
Examples
using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "linnerud.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n, p = size(X)
q = nco(Y)
nlv = 2
bscal = :frob
mod = model(plscan; nlv, bscal)
fit!(mod, X, Y)
pnames(mod)
pnames(mod.fm)
@head mod.fm.Tx
@head transfbl(mod, X, Y).Tx
@head mod.fm.Ty
@head transfbl(mod, X, Y).Ty
res = summary(mod, X, Y) ;
pnames(res)
res.explvarx
res.explvary
res.cort2t
res.rdx
res.rdy
res.corx2t
res.cory2t
Jchemo.plskdeda
— Methodplskdeda(X, y; kwargs...)
plskdeda(X, y, weights::Weight; kwargs...)
KDE-DA on PLS latent variables (PLS-KDEDA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).- Keyword arguments of function
dmkern
(bandwidth definition) can also be specified here. scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
The principle is the same as functions plslda
and plsqda
except that class densities are estimated from dmkern
instead of dmnorm
.
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlv = 15
mod = model(plskdeda; nlv)
#mod = model(plskdeda; nlv, a_kde = .5)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
fmpls = fm.fm.fmpls ;
@head fmpls.T
@head transf(mod, Xtrain)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)
coef(fmpls)
res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(mod, Xtest; nlv = 1:2).pred
summary(fmpls, Xtrain)
Jchemo.plskern
— Methodplskern(X, Y; kwargs...)
plskern(X, Y, weights::Weight; kwargs...)
plskern!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Partial least squares regression (PLSR) with the "improved kernel algorithm #1" (Dayal & McGegor, 1997).
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
About the row-weighting in PLS algorithms (weights
): See in particular Schaal et al. 2002, Siccard & Sabatier 2006, Kim et al. 2011, and Lesnoff et al. 2020.
References
Dayal, B.S., MacGregor, J.F., 1997. Improved PLS algorithms. Journal of Chemometrics 11, 73-85.
Kim, S., Kano, M., Nakagawa, H., Hasebe, S., 2011. Estimation of active pharmaceutical ingredients content using locally weighted partial least squares and statistical wavelength selection. Int. J. Pharm., 421, 269-274.
Lesnoff, M., Metz, M., Roger, J.M., 2020. Comparison of locally weighted PLS strategies for regression and discrimination on agronomic NIR Data. Journal of Chemometrics. e3209. https://onlinelibrary.wiley.com/doi/abs/10.1002/cem.3209
Schaal, S., Atkeson, C., Vijayamakumar, S. 2002. Scalable techniques from nonparametric statistics for the real time robot learning. Applied Intell., 17, 49-60.
Sicard, E. Sabatier, R., 2006. Theoretical framework for local PLS1 regression and application to a rainfall data set. Comput. Stat. Data Anal., 51, 1393-1410.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 15
mod = model(plskern; nlv) ;
#mod = model(plsnipals; nlv) ;
#mod = model(plswold; nlv) ;
#mod = model(plsrosa; nlv) ;
#mod = model(plssimp; nlv) ;
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
@head mod.fm.T
coef(mod)
coef(mod; nlv = 3)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)
res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true,
xlabel = "Prediction", ylabel = "Observed").f
res = predict(mod, Xtest; nlv = 1:2)
@head res.pred[1]
@head res.pred[2]
res = summary(mod, Xtrain) ;
pnames(res)
z = res.explvarx
plotgrid(z.nlv, z.cumpvar; step = 2, xlabel = "Nb. LVs",
ylabel = "Prop. Explained X-Variance").f
Jchemo.plslda
— Methodplslda(X, y; kwargs...)
plslda(X, y, weights::Weight; kwargs...)
LDA on PLS latent variables (PLS-LDA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
LDA on PLS latent variables. The training variable y
(univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present in y
. Each column of Ydummy is a dummy (0/1) variable. Then, a weighted PLSR2 (i.e. multivariate) is run on {X
, Ydummy}, returning a score matrix T
. Finally, a LDA is done on {T
, y
}.
In these plslda
functions, observation weights (argument weights
) are used to compute the PLS scores and the LDA intra-class (= "within") covariance matrix. Argument prior
is used to define the usual LDA prior class probabilities.
In the high-level version, the observation weights are automatically defined by the given priors: the sub-total weights by class are set equal to the prior probabilities. For other choices, use the low-level version.
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlv = 15
mod = model(plslda; nlv)
#mod = model(plslda; nlv, prior = :prop)
#mod = model(plsqda; nlv, alpha = .1)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
fmpls = fm.fm.fmpls ;
@head fmpls.T
@head transf(mod, Xtrain)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)
coef(fmpls)
res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(mod, Xtest; nlv = 1:2).pred
summary(fmpls, Xtrain)
Jchemo.plsnipals
— Methodplsnipals(X, Y; kwargs...)
plsnipals(X, Y, weights::Weight; kwargs...)
plsnipals!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Partial Least Squares Regression (PLSR) with the Nipals algorithm.
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
In this function, for PLS2 (multivariate Y), the Nipals iterations are replaced by a direct computation of the PLS weights (w) by SVD decomposition of matrix X'Y (Hoskuldsson 1988 p.213).
See function plskern
for examples.
References
Hoskuldsson, A., 1988. PLS regression methods. Journal of Chemometrics 2, 211-228.https://doi.org/10.1002/cem.1180020306
Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris, France.
Wold, S., Sjostrom, M., Eriksson, l., 2001. PLS-regression: a basic tool for chemometrics. Chem. Int. Lab. Syst., 58, 109-130.
Jchemo.plsqda
— Methodplsqda(X, y; kwargs...)
plsqda(X, y, weights::Weight; kwargs...)
QDA on PLS latent variables (PLS-QDA) with continuum.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).alpha
: Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0
) and LDA (alpha = 1
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
QDA on PLS latent variables. The training variable y
(univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present in y
. Each column of Ydummy is a dummy (0/1) variable. Then, a PLSR2 (i.e. multivariate) is run on {X
, Ydummy}, returning a score matrix T
. Finally, a QDA (possibly with continuum) is done on {T
, y
}.
See functions qda
and plslda
for details (arguments weights
, prior
and alpha
) and examples.
Jchemo.plsravg
— Methodplsravg(X, Y; kwargs...)
plsravg(X, Y, weights::Weight; kwargs...)
plsravg!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Averaging PLSR models with different numbers of latent variables (PLSR-AVG).
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: A range of nb. of latent variables (LVs) to compute.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
Ensemblist method where the predictions are computed by averaging the predictions of a set of models built with different numbers of LVs.
For instance, if argument nlv
is set to nlv
= 5:10
, the prediction for a new observation is the simple average of the predictions returned by the models with 5 LVs, 6 LVs, ... 10 LVs, respectively.
References
Lesnoff, M., Andueza, D., Barotin, C., Barre, P., Bonnal, L., Fernández Pierna, J.A., Picard, F., Vermeulen, P., Roger, J.-M., 2022. Averaging and Stacking Partial Least Squares Regression Models to Predict the Chemical Compositions and the Nutritive Values of Forages from Spectral Near Infrared Data. Applied Sciences 12, 7850. https://doi.org/10.3390/app12157850
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
@head Y
y = Y.ndf
#y = Y.dm
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(y, s)
Xtest = X[s, :]
ytest = y[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
nlv = 0:30
#nlv = 5:20
#nlv = 25
mod = model(plsravg; nlv) ;
fit!(mod, Xtrain, ytrain)
res = predict(mod, Xtest)
@head res.pred
res.predlv # predictions for each nb. of LVs
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true,
xlabel = "Prediction", ylabel = "Observed").f
Jchemo.plsrda
— Methodplsrda(X, y; kwargs...)
plsrda(X, y, weights::Weight; kwargs...)
Discrimination based on partial least squares regression (PLSR-DA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
This is the usual "PLSDA" (prediction of the Y-dummy table by a PLS2 regression). The training variable y
(univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present in y
. Each column of Ydummy is a dummy (0/1) variable. Then, a weighted PLSR2 (i.e. multivariate) is run on {X
, Ydummy}, returning predictions of the dummy variables (= object posterior
returned by fuction predict
). These predictions can be considered as unbounded estimates (i.e. eventuall outside of [0, 1]) of the class membership probabilities. For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.
In the high-level version of the function, the observation weights used in the PLS2-R are defined with argument prior
. For other choices, use the low-level version (argument weights
).
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlv = 15
mod = model(plsrda; nlv)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
aggsum(fm.weights.w, ytrain)
@head fm.fm.T
@head transf(mod, Xtrain)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)
coef(fm.fm)
res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(mod, Xtest; nlv = 1:2).pred
summary(fm.fm, Xtrain)
Jchemo.plsrosa
— Methodplsrosa(X, Y; kwargs...)
plsrosa(X, Y, weights::Weight; kwargs...)
plsrosa!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Partial Least Squares Regression (PLSR) with the ROSA algorithm (Liland et al. 2016).
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
Note: The function has the following differences with the original algorithm of Liland et al. (2016):
- Scores T (LVs) are not normed.
- Multivariate Y is allowed.
See function plskern
for examples.
References
Liland, K.H., Næs, T., Indahl, U.G., 2016. ROSA—a fast extension of partial least squares regression for multiblock data analysis. Journal of Chemometrics 30, 651–662. https://doi.org/10.1002/cem.2824
Jchemo.plssimp
— Methodplssimp(X, Y; kwargs...)
plssimp(X, Y, weights::Weight; kwargs...)
plssimp!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Partial Least Squares Regression (PLSR) with the SIMPLS algorithm (de Jong 1993).
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
Note: In this function, scores T (LVs) are not normed, conversely to the original algorithm of de Jong (2013).
See function plskern
for examples.
References
de Jong, S., 1993. SIMPLS: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems 18, 251–263. https://doi.org/10.1016/0169-7439(93)85002-X
Jchemo.plstuck
— Methodplstuck(X, Y; kwargs...)
plstuck(X, Y, weights::Weight; kwargs...)
plstuck!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Tucker's inter-battery method of factor analysis
X
: First block of data.Y
: Second block of data.weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores T) to compute.bscal
: Type of block scaling. Possible values are::none
,:frob
. See functionsblockscal
.scal
: Boolean. Iftrue
, each column of blocks inX
andY
is scaled by its uncorrected standard deviation (before the block scaling).
Inter-battery method of factor analysis (Tucker 1958, Tenenhaus 1998 chap.3). The two blocks X
and X
play a symmetric role. This method is referred to as PLS-SVD in Wegelin 2000. The basis of the method is to factorize the covariance matrix X'Y by SVD.
References
Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.
Tishler, A., Lipovetsky, S., 2000. Modelling and forecasting with robust canonical analysis: method and application. Computers & Operations Research 27, 217–232. https://doi.org/10.1016/S0305-0548(99)00014-3
Tucker, L.R., 1958. An inter-battery method of factor analysis. Psychometrika 23, 111–136. https://doi.org/10.1007/BF02289009
Wegelin, J.A., 2000. A Survey of Partial Least Squares (PLS) Methods, with Emphasis on the Two-Block Case (No. 371). University of Washington, Seattle, Washington, USA.
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/linnerud.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
fm = plstuck(X, Y; nlv = 3)
pnames(fm)
fm.Tx
transf(fm, X, Y).Tx
fscale(fm.Tx, colnorm(fm.Tx))
res = summary(fm, X, Y)
pnames(res)
Jchemo.plswold
— Methodplswold(X, Y; kwargs...)
plswold(X, Y, weights::Weight; kwargs...)
plswold!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Partial Least Squares Regression (PLSR) with the Wold algorithm
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.tol
: Tolerance for the Nipals algorithm.maxit
: Maximum number of iterations for the Nipals algorithm.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
Wold Nipals PLSR algorithm: Tenenhaus 1998 p.204.
See function plskern
for examples.
References
Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris, France.
Wold, S., Ruhe, A., Wold, H., Dunn, III, W.J., 1984. The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS). Approach to Generalized Inverses. SIAM Journal on Scientific and Statistical Computing 5, 735–743. https://doi.org/10.1137/0905052
Jchemo.pmod
— Methodpmod(foo)
Shortcut for function parentmodule
.
Jchemo.pnames
— Methodpnames(x)
Return the names of the elements of x
.
Jchemo.predict
— Methodpredict(object::CalDs, X; kwargs...)
Compute predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::CalPds, X; kwargs...)
Compute predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Cglsr, X; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.nlv
: Nb. iterations, or collection of nb. iterations, to consider.
Jchemo.predict
— Methodpredict(object::Dkplsr, X; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.predict
— Methodpredict(object::Dmkern, x)
Compute predictions from a_kde fitted model.
object
: The fitted model.x
: Data (vector) for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Dmnorm, X)
Compute predictions from a fitted model.
object
: The fitted model.X
: Data (vector) for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Knnda1, X)
Compute the y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Knnr, X)
Compute the Y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Kplsr, X; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
If nothing, it is the maximum nb. LVs.
Jchemo.predict
— Methodpredict(object::Krr, X; lb = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.lb
: Regularization parameter, or collection of regularization parameters, "lambda" to consider.
Jchemo.predict
— Methodpredict(object::Lda, X)
Compute y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Lwmlr, X)
Compute the Y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Lwmlrda, X)
Compute y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Lwplslda, X; nlv = nothing)
Compute the y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Lwplsqda, X; nlv = nothing)
Compute the y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Lwplsr, X; nlv = nothing)
Compute the Y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::LwplsrAvg, X)
Compute the Y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Lwplsrda, X; nlv = nothing)
Compute the y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Mbplslda, Xbl; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.predict
— Methodpredict(object::Mbplsrda, Xbl; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.predict
— Methodpredict(object::Mlrda, X)
Compute y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Occod, X)
Compute predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Occsd, X)
Compute predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Occsdod, X)
Compute predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Occstah, X)
Compute predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Plslda, X; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.predict
— Methodpredict(object::Plsravg, X)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Plsrda, X; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.predict
— Methodpredict(object::Qda, X)
Compute y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Qda, X)
Compute y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Rosaplsr, Xbl; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.predict
— Methodpredict(object::Rr, X; lb = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.lb
: Regularization parameter, or collection of regularization parameters, "lambda" to consider.
Jchemo.predict
— Methodpredict(object::Rrda, X; lb = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.lb
: Regularization parameter, or collection of regularization parameters, "lambda" to consider. If nothing, it is the parameter stored in the fitted model.
Jchemo.predict
— Methodpredict(object::Soplsr, Xbl)
Compute Y-predictions from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Svmda, X)
Compute y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Svmr, X)
Compute y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::TreedaDt, X)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::TreerDt, X)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Union{Mbplsr, Mbplswest}, Xbl; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.predict
— Methodpredict(object::Mlr, X)
Compute the Y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Union{Plsr, Pcr, Splsr}, X; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.psize
— Methodpsize(x)
Print the type and size of x
.
Jchemo.pval
— Methodpval(d::Distribution, q)
pval(x::Array, q)
pval(e_cdf::ECDF, q)
Compute p-value(s) for a distribution, an ECDF or vector.
d
: A distribution computed fromDistribution.jl
.x
: Univariate data.e_cdf
: An ECDF computed fromStatsBase.jl
.q
: Value(s) for which to compute the p-value(s).
Compute or estimate the p-value of quantile q
, ie. P(Q > q
) where Q is the random variable.
Examples
using Distributions, StatsBase
d = Distributions.Normal(0, 1)
q = 1.96
#q = [1.64; 1.96]
Distributions.cdf(d, q) # cumulative density function (CDF)
Distributions.ccdf(d, q) # complementary CDF (CCDF)
pval(d, q) # Distributions.ccdf
x = rand(5)
e_cdf = StatsBase.ecdf(x)
e_cdf(x) # empirical CDF computed at each point of x (ECDF)
p_val = 1 .- e_cdf(x) # complementary ECDF at each point of x
q = .3
#q = [.3; .5; 10]
pval(e_cdf, q) # 1 .- e_cdf(q)
pval(x, q)
Jchemo.qda
— Methodqda(X, y; kwargs...)
qda(X, y, weights::Weight; kwargs...)
Quadratic discriminant analysis (QDA, with continuum towards LDA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).alpha
: Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0
) and LDA (alpha = 1
).
A value alpha
> 0 shrinks the class-covariances by class (Wi) toward a common LDA covariance ("within" W). This corresponds to the "first regularization (Eqs.16)" described in Friedman 1989 (where alpha
is referred to as "lambda").
In these qda
functions, observation weights (argument weights
) are used to compute covariance matrices Wi and W. Argument prior
is used to define the usual prior class probabilities.
In the high-level version, the observation weights are automatically defined by the given priors (prior
): the sub-total weights by class are set equal to the prior probabilities. For other choices, use the low-level version.
References
Friedman JH. Regularized Discriminant Analysis. Journal of the American Statistical Association. 1989; 84(405):165-175. doi:10.1080/01621459.1989.10478752.
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
mod = model(qda)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
aggsum(fm.weights.w, ytrain)
res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
## With regularization
mod = model(qda; alpha = .5)
#mod = model(qda; alpha = 1) # = LDA
fit!(mod, Xtrain, ytrain)
mod.fm.Wi
res = predict(mod, Xtest) ;
errp(res.pred, ytest)
Jchemo.r2
— Methodr2(pred, Y)
Compute the R2 coefficient.
pred
: Predictions.Y
: Observed data.
The rate R2 is calculated by:
- R2 = 1 - MSEP(current model) / MSEP(null model)
where the "null model" is the overall mean. For predictions over CV or test sets, and/or for non linear models, it can be different from the square of the correlation coefficient (cor2
) between the true data and the predictions.
Examples
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
r2(pred, Ytest)
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
r2(pred, ytest)
Jchemo.rasvd
— Methodrasvd(X, Y; kwargs...)
rasvd(X, Y, weights::Weight; kwargs...)
rasvd!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Redundancy analysis (RA), aka PCA on instrumental variables (PCAIV)
X
: First block of data.Y
: Second block of data.weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores T) to compute.bscal
: Type of block scaling. Possible values are::none
,:frob
. See functionsblockscal
.tau
: Regularization parameter (∊ [0, 1]).scal
: Boolean. Iftrue
, each column of blocks inX
andY
is scaled by its uncorrected standard deviation (before the block scaling).
See e.g. Bougeard et al. 2011a,b and Legendre & Legendre 2012. Let Yhat be the fitted values of the regression of Y
on X
. The scores Ty
are the PCA scores of Yhat. The scores Tx
are the fitted values of the regression of Ty
on X
.
A continuum regularization is available. After block centering and scaling, the covariances matrices are computed as follows:
- Cx = (1 -
tau
) * X'DX +tau
* Ix
where D is the observation (row) metric. Value tau
= 0 can generate unstability when inverting the covariance matrices. Often, a better alternative is to use an epsilon value (e.g. tau
= 1e-8) to get similar results as with pseudo-inverses.
References
Bougeard, S., Qannari, E.M., Lupo, C., Chauvin, C., 2011-a. Multiblock redundancy analysis from a user's perspective. Application in veterinary epidemiology. Electronic Journal of Applied Statistical Analysis 4, 203-214. https://doi.org/10.1285/i20705948v4n2p203
Bougeard, S., Qannari, E.M., Rose, N., 2011-b. Multiblock redundancy analysis: interpretation tools and application in epidemiology. Journal of Chemometrics 25, 467-475. https://doi.org/10.1002/cem.1392
Legendre, P., Legendre, L., 2012. Numerical Ecology. Elsevier, Amsterdam, The Netherlands.
Tenenhaus, A., Guillemot, V. 2017. RGCCA: Regularized and Sparse Generalized Canonical Correlation Analysis for Multiblock Data Multiblock data analysis. https://cran.r-project.org/web/packages/RGCCA/index.html
Examples
using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "linnerud.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n, p = size(X)
q = nco(Y)
nlv = 2
bscal = :frob ; tau = 1e-4
mod = model(rasvd; nlv, bscal, tau)
fit!(mod, X, Y)
pnames(mod)
pnames(mod.fm)
@head mod.fm.Tx
@head transfbl(mod, X, Y).Tx
@head mod.fm.Ty
@head transfbl(mod, X, Y).Ty
res = summary(mod, X, Y) ;
pnames(res)
res.explvarx
res.cort2t
res.rdx
res.rdy
res.corx2t
res.cory2t
Jchemo.rd
— Methodrd(X, Y; typ = :cor)
rd(X, Y, weights::Weight; typ = :cor)
Compute redundancy coefficients between two matrices.
X
: Matrix (n, p).Y
: Matrix (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
typ
: Possibles values are::cor
(correlation),:cov
(uncorrected covariance).
Returns the redundancy coefficient between X
and each column of Y
, i.e.:
(1 / p) * [Sum.(j=1, .., p) cor(xj, y1)^2 ; ... ; Sum.(j=1, .., p) cor(xj, yq)^2]
See Tenenhaus 1998 section 2.2.1 p.10-11.
References
Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.
Examples
X = rand(5, 10)
Y = rand(5, 3)
rd(X, Y)
Jchemo.rda
— Methodrda(X, y; kwargs...)
rda(X, y, weights::Weight; kwargs...)
Regularized discriminant analysis (RDA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).alpha
: Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0
) and LDA (alpha = 1
).lb
: Ridge regularization parameter "lambda" (>= 0).simpl
: Boolean. See functiondmnorm
.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Let us note W the (corrected) pooled within-class covariance matrix and Wi the (corrected) within-class covariance matrix of class i. The regularization is done by the two following successive steps (for each class i):
- Continuum between QDA and LDA: Wi(1) = (1 -
alpha
) * Wi +alpha
* W - Ridge regularization: Wi(2) = Wi(1) +
lb
* I
Then the QDA algorithm is run on matrices {Wi(2)}.
Function rda
is slightly different from the regularization expression used by Friedman 1989 (Eq.18). It shrinks the covariance matrices Wi(2) to the diagonal of the Idendity matrix (ridge regularization; e.g. Guo et al. 2007).
Particular cases:
alpha
= 1 &lb
= 0 : LDAalpha
= 0 &lb
= 0 : QDAalpha
= 1 &lb
> 0 : Penalized LDA (Hastie et al 1995) with diagonal regularization matrix
See functions lda
and qda
for other details (arguments weights
and prior
).
References
Friedman JH. Regularized Discriminant Analysis. Journal of the American Statistical Association. 1989; 84(405):165-175. doi:10.1080/01621459.1989.10478752.
Guo Y, Hastie T, Tibshirani R. Regularized linear discriminant analysis and its application in microarrays. Biostatistics. 2007; 8(1):86-100. doi:10.1093/biostatistics/kxj035.
Hastie, T., Buja, A., Tibshirani, R., 1995. Penalized Discriminant Analysis. The Annals of Statistics 23, 73–102.
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
alpha = .5
lb = 1e-8
mod = model(rda; alpha, lb)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.recodcat2int
— Methodrecodcat2int(x; start = 1)
Recode a categorical variable to a integer variable.
x
: Variable to recode.start
: Integer that will be set to the first category.
The integers returned by the function correspond to the sorted levels (categories) of x
.
Examples
x = ["b", "a", "b"]
[x recodcat2int(x)]
recodcat2int(x; start = 0)
recodcat2int([25, 1, 25])
Jchemo.recodnum2int
— Methodrecodnum2int(x, q)
Recode a continuous variable to integer classes.
x
: Variable to recode.q
: Values separating the classes.
Examples
using Statistics
x = [collect(1:10); 8.1 ; 3.1]
q = [3; 8]
zx = recodnum2int(x, q)
[x zx]
probs = [.33; .66]
q = quantile(x, probs)
zx = recodnum2int(x, q)
[x zx]
Jchemo.recovkwargs
— Methodrecovkwargs(ParamStruct, kwargs)
Jchemo.replacebylev
— Methodreplacebylev(x, lev)
Replace the elements of a vector by levels of corresponding order.
x
: Vector (n) of values to replace.lev
: Vector (nlev) containing the levels.
Warning: x
and lev
must contain the same number (nlev) of levels.
The ith sorted level in x
is replaced by the ith sorted level of lev
.
Examples
x = [10; 4; 3; 3; 4; 4]
lev = ["B"; "C"; "AA"]
sort(lev)
[x replacebylev(x, lev)]
zx = string.(x)
[zx replacebylev(zx, lev)]
lev = [3; 0; -1]
[x replacebylev(x, lev)]
Jchemo.replacebylev2
— Methodreplacebylev2(x::Union{Int, Array{Int}}, lev::Array)
Replace the elements of an index-vector by levels.
x
: Vector (n) of values to replace.lev
: Vector (nlev) containing the levels.
Warning: Let us note nlev the number of levels in lev
. Vector x
must contain integer values between 1 and nlev.
Each element x
i is replaced by sort(lev
)[x
[i]].
Examples
x = [2; 1; 2; 2]
lev = ["B"; "C"; "AA"]
sort(lev)
[x replacebylev2(x, lev)]
replacebylev2([2], lev)
replacebylev2(2, lev)
x = [2; 1; 2]
lev = [3; 0; -1]
replacebylev2(x, lev)
Jchemo.replacedict
— Methodreplacedict(x, dict)
Replace the elements of a vector by levels defined in a dictionary.
x
: Vector (n) of values to replace.dict
: A dictionary of the correpondances betwwen the old and new values.
Examples
dict = Dict("a" => 1000, "b" => 1, "c" => 2)
x = ["c"; "c"; "a"; "a"; "a"]
replacedict(x, dict)
x = ["c"; "c"; "a"; "a"; "a"; "e"]
replacedict(x, dict)
Jchemo.residcla
— Methodresidcla(pred, y)
Compute the discrimination residual vector (0 = no error, 1 = error).
pred
: Predictions.y
: Observed data (class membership).
Examples
Xtrain = rand(10, 5)
ytrain = rand(["a" ; "b"], 10)
Xtest = rand(4, 5)
ytest = rand(["a" ; "b"], 4)
mod = model(plsrda; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
residcla(pred, ytest)
Jchemo.residreg
— Methodresidreg(pred, Y)
Compute the regression residual vector.
pred
: Predictions.Y
: Observed data.
Examples
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
residreg(pred, Ytest)
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
residreg(pred, ytest)
Jchemo.rfda_dt
— Methodrfda_dt(X, y; kwargs...)
Random forest discrimination with DecisionTree.jl.
X
: X-data (n, p).y
: Univariate class membership (n).
Keyword arguments:
n_trees
: Nb. trees built for the forest.partial_sampling
: Proportion of sampled observations for each tree.n_subfeatures
: Nb. variables to select at random at each split (default: -1 ==> sqrt(#variables)).max_depth
: Maximum depth of the decision trees (default: -1 ==> no maximum).min_sample_leaf
: Minimum number of samples each leaf needs to have.min_sample_split
: Minimum number of observations in needed for a split.mth
: Boolean indicating if a multi-threading is done when new data are predicted with functionpredict
.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.- Do
dump(Par(), maxdepth = 1)
to print the default values of the keyword arguments.
The function fits a random forest discrimination² model using package `DecisionTree.jl'.
References
Breiman, L., 1996. Bagging predictors. Mach Learn 24, 123–140. https://doi.org/10.1007/BF00058655
Breiman, L., 2001. Random Forests. Machine Learning 45, 5–32. https://doi.org/10.1023/A:1010933404324
DecisionTree.jl https://github.com/JuliaAI/DecisionTree.jl
Genuer, R., 2010. Forêts aléatoires : aspects théoriques, sélection de variables et applications. PhD Thesis. Université Paris Sud - Paris XI.
Gey, S., 2002. Bornes de risque, détection de ruptures, boosting : trois thèmes statistiques autour de CART en régression (These de doctorat). Paris 11. http://www.theses.fr/2002PA112245
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n, p = size(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
n_trees = 200
n_subfeatures = p / 3
max_depth = 10
mod = model(rfda_dt; n_trees, n_subfeatures, max_depth)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
res = predict(mod, Xtest) ;
pnames(res)
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.rfr_dt
— Methodrfr_dt(X, y; kwargs...)
Random forest regression with DecisionTree.jl.
X
: X-data (n, p).y
: Univariate y-data (n).
Keyword arguments:
n_trees
: Nb. trees built for the forest.partial_sampling
: Proportion of sampled observations for each tree.n_subfeatures
: Nb. variables to select at random at each split (default: -1 ==> sqrt(#variables)).max_depth
: Maximum depth of the decision trees (default: -1 ==> no maximum).min_sample_leaf
: Minimum number of samples each leaf needs to have.min_sample_split
: Minimum number of observations in needed for a split.mth
: Boolean indicating if a multi-threading is done when new data are predicted with functionpredict
.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.- Do
dump(Par(), maxdepth = 1)
to print the default values of the keyword arguments.
The function fits a random forest regression model using package `DecisionTree.jl'.
References
Breiman, L., 1996. Bagging predictors. Mach Learn 24, 123–140. https://doi.org/10.1007/BF00058655
Breiman, L., 2001. Random Forests. Machine Learning 45, 5–32. https://doi.org/10.1023/A:1010933404324
DecisionTree.jl https://github.com/JuliaAI/DecisionTree.jl
Genuer, R., 2010. Forêts aléatoires : aspects théoriques, sélection de variables et applications. PhD Thesis. Université Paris Sud - Paris XI.
Gey, S., 2002. Bornes de risque, détection de ruptures, boosting : trois thèmes statistiques autour de CART en régression (These de doctorat). Paris 11. http://www.theses.fr/2002PA112245
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
p = nco(X)
n_trees = 200
n_subfeatures = p / 3
max_depth = 15
mod = model(rfr_dt; n_trees, n_subfeatures, max_depth)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
Jchemo.rmcol
— Methodrmcol(X, s)
Remove the columns of a matrix or the components of a vector having indexes s
.
X
: Matrix or vector.s
: Vector of the indexes.
Examples
X = rand(5, 3)
rmcol(X, [1, 3])
Jchemo.rmgap
— Methodrmgap(X; kwargs...)
Remove vertical gaps in spectra (e.g. for ASD).
X
: X-data (n, p).
Keyword arguments:
indexcol
: Indexes (∈ [1, p]) of theX
-columns where are located the gaps to remove.npoint
: The number ofX
-columns used on the left side of each gap for fitting the linear regressions.
For each spectra (row-observation of matrix X
) and each defined gap, the correction is done by extrapolation from a simple linear regression computed on the left side of the gap.
For instance, If two gaps are observed between column-indexes 651-652 and between column-indexes 1425-1426, respectively, the syntax should be indexcol
= [651 ; 1425].
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/asdgap.jld2")
@load db dat
pnames(dat)
X = dat.X
wlst = names(dat.X)
wl = parse.(Float64, wlst)
wl_target = [1000 ; 1800]
indexcol = findall(in(wl_target).(wl))
f, ax = plotsp(X, wl)
vlines!(ax, wl_target; linestyle = :dot, color = (:grey, .8))
f
## Corrected data
mod = model(rmgap; npoint = 5, indexcol)
fit!(mod, X)
Xc = transf(mod, X)
f, ax = plotsp(Xc, wl)
vlines!(ax, wl_target; linestyle = :dot, color = (:grey, .8))
f
Jchemo.rmrow
— Methodrmrow(X, s)
Remove the rows of a matrix or the components of a vector having indexes s
.
X
: Matrix or vector.s
: Vector of the indexes.
Examples
X = rand(5, 2)
rmrow(X, [1, 4])
Jchemo.rmsep
— Methodrmsep(pred, Y)
Compute the square root of the mean of the squared prediction errors (RMSEP).
pred
: Predictions.Y
: Observed data.
Examples
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
rmsep(pred, Ytest)
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
rmsep(pred, ytest)
Jchemo.rmsepstand
— Methodrmsepstand(pred, Y)
Compute the standardized square root of the mean of the squared prediction errors (RMSEP_stand).
pred
: Predictions.Y
: Observed data.
RMSEP is standardized to Y
:
- RMSEP_stand = RMSEP ./
Y
.
Examples
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
rmsepstand(pred, Ytest)
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
rmsepstand(pred, ytest)
Jchemo.rosaplsr
— Methodrosaplsr(Xbl, Y; kwargs...)
rosaplsr(Xbl, Y, weights::Weight; kwargs...)
rosaplsr!(Xbl::Vector, Y::Matrix, weights::Weight; kwargs...)
Multiblock ROSA PLSR (Liland et al. 2016).
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores T) to compute.scal
: Boolean. Iftrue
, each column of blocks inXbl
andY
is scaled by its uncorrected standard deviation (before the block scaling).
The function has the following differences with the original algorithm of Liland et al. (2016):
- Scores T are not normed to 1.
- Multivariate
Y
is allowed. In such a case, the squared residuals are summed over the columns for finding the winning block for each global LV (therefore Y-columns should have the same fscale).
References
Liland, K.H., Næs, T., Indahl, U.G., 2016. ROSA — a fast extension of partial least squares regression for multiblock data analysis. Journal of Chemometrics 30, 651–662. https://doi.org/10.1002/cem.2824
Examples
using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
y = Y.c1
group = dat.group
listbl = [1:11, 12:19, 20:25]
s = 1:6
Xbltrain = mblock(X[s, :], listbl)
Xbltest = mblock(rmrow(X, s), listbl)
ytrain = y[s]
ytest = rmrow(y, s)
ntrain = nro(ytrain)
ntest = nro(ytest)
ntot = ntrain + ntest
(ntot = ntot, ntrain , ntest)
nlv = 3
scal = false
#scal = true
mod = model(rosaplsr; nlv, scal)
fit!(mod, Xbltrain, ytrain)
pnames(mod)
pnames(mod.fm)
@head mod.fm.T
@head transf(mod, Xbltrain)
transf(mod, Xbltest)
res = predict(mod, Xbltest)
res.pred
rmsep(res.pred, ytest)
Jchemo.rowmean
— Methodrowmean(X)
Compute row-wise means of a matrix.
X
: Data (n, p).
Return a vector.
Examples
n, p = 5, 6
X = rand(n, p)
rowmean(X)
Jchemo.rownorm
— Methodrownorm(X)
Compute row-wise norms of a matrix.
X
: Data (n, p).
The norm computed for a row x of X
is:
- sqrt(x' * x)
Return a vector.
Note: Thanks to @mcabbott at https://discourse.julialang.org/t/orders-of-magnitude-runtime-difference-in-row-wise-norm/96363.
Examples
n, p = 5, 6
X = rand(n, p)
rownorm(X)
Jchemo.rowstd
— Methodrowstd(X)
Compute row-wise standard deviations (uncorrected) of a matrix`.
X
: Data (n, p).
Return a vector.
Examples
n, p = 5, 6
X = rand(n, p)
rowstd(X)
Jchemo.rowsum
— Methodrowsum(X)
Compute row-wise sums of a matrix.
X
: Data (n, p).
Return a vector.
Examples
X = rand(5, 2)
rowsum(X)
Jchemo.rowvar
— Methodrowvar(X)
Compute row-wise variances (uncorrected) of a matrix.
X
: Data (n, p).
Return a vector.
Examples
n, p = 5, 6
X = rand(n, p)
rowvar(X)
Jchemo.rp
— Methodrp(X; kwargs...)
rp(X, weights::Weight; kwargs...)
rp!(X::Matrix, weights::Weight; kwargs...)
Make a random projection of X-data.
X
: X-data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. dimensions on whichX
is projected.mrp
: Method of random projection. Possible values are::gauss
,:li
. See the respective functionsrpmatgauss
andrpmatli
for their keyword arguments.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Examples
n, p = (5, 10)
X = rand(n, p)
nlv = 3
mrp = :li ; s_li = sqrt(p)
#mrp = :gauss
mod = model(rp; nlv, mrp, s_li)
fit!(mod, X)
pnames(mod)
pnames(mod.fm)
@head mod.fm.T
@head mod.fm.P
transf(mod, X[1:2, :])
Jchemo.rpd
— Methodrpd(pred, Y)
Compute the ratio "deviation to model performance" (RPD).
pred
: Predictions.Y
: Observed data.
This is the ratio of the deviation to the model performance to the deviation, defined by:
- RPD = Std(Y) / RMSEP
where Std(Y) is the standard deviation.
Since Std(Y) = RMSEP(null model) where the null model is the simple average, this also gives:
- RPD = RMSEP(null model) / RMSEP
Examples
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
rpd(pred, Ytest)
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
rpd(pred, ytest)
Jchemo.rpdr
— Methodrpdr(pred, Y)
Compute a robustified RPD.
pred
: Predictions.Y
: Observed data.
Examples
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
rpdr(pred, Ytest)
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
rpdr(pred, ytest)
Jchemo.rpmatgauss
— Functionrpmatgauss(p::Int, nlv::Int, Q = Float64)
Build a gaussian random projection matrix.
p
: Nb. variables (attributes) to project.nlv
: Nb. of simulated projection dimensions.Q
: Type of components of the built projection matrix.
The function returns a random projection matrix P of dimension p
x nlv
. The projection of a given matrix X of size n x p
is given by X * P.
P is simulated from i.i.d. N(0, 1) / sqrt(nlv
).
References
Li, P., Hastie, T.J., Church, K.W., 2006. Very sparse random projections, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06. Association for Computing Machinery, New York, NY, USA, pp. 287–296. https://doi.org/10.1145/1150402.1150436
Examples
p = 10 ; nlv = 3
rpmatgauss(p, nlv)
Jchemo.rpmatli
— Functionrpmatli(p::Int, nlv::Int, Q = Float64; s_li)
Build a sparse random projection matrix (Achlioptas 2001, Li et al. 2006).
p
: Nb. variables (attributes) to project.nlv
: Nb. of simulated projection dimensions.Q
: Type of components of the built projection matrix.
Keyword arguments:
s_li
: Coefficient defining the sparsity of the returned matrix (higher iss
, higher is the sparsity).
The function returns a random projection matrix P of dimension p
x nlv
. The projection of a given matrix X of size n x p
is given by X * P.
Matrix P is simulated from i.i.d. discrete sampling within values:
- 1 with prob. 1/(2 *
s
) - 0 with prob. 1 - 1 /
s
- -1 with prob. 1/(2 *
s
)
Usual values for s
are:
- sqrt(
p
) (Li et al. 2006) p
/ log(p
) (Li et al. 2006)- 1 (Achlioptas 2001)
- 3 (Achlioptas 2001)
References
Achlioptas, D., 2001. Database-friendly random projections, in: Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’01. Association for Computing Machinery, New York, NY, USA, pp. 274–281. https://doi.org/10.1145/375551.375608
Li, P., Hastie, T.J., Church, K.W., 2006. Very sparse random projections, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06. Association for Computing Machinery, New York, NY, USA, pp. 287–296. https://doi.org/10.1145/1150402.1150436
Examples
p = 10 ; nlv = 3
rpmatli(p, nlv)
Jchemo.rr
— Methodrr(X, Y; kwargs...)
rr(X, Y, weights::Weight; kwargs...)
rr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Ridge regression (RR) implemented by SVD factorization.
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
lb
: Ridge regularization parameter "lambda".scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
References
Cule, E., De Iorio, M., 2012. A semi-automatic method to guide the choice of ridge parameter in ridge regression. arXiv:1205.0686.
Hastie, T., Tibshirani, R., 2004. Efficient quadratic regularization for expression arrays. Biostatistics 5, 329-340. https://doi.org/10.1093/biostatistics/kxh010
Hastie, T., Tibshirani, R., Friedman, J., 2009. The elements of statistical learning: data mining, inference, and prediction, 2nd ed. Springer, New York.
Hoerl, A.E., Kennard, R.W., 1970. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 12, 55-67. https://doi.org/10.1080/00401706.1970.10488634
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
lb = 1e-3
mod = model(rr; lb)
#mod = model(rrchol; lb)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
coef(mod)
res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
## Only for function 'rr' (not for 'rrchol')
coef(mod; lb = 1e-1)
res = predict(mod, Xtest; lb = [.1 ; .01])
@head res.pred[1]
@head res.pred[2]
Jchemo.rrchol
— Methodrrchol(X, Y; kwargs...)
rrchol(X, Y, weights::Weight; kwargs...)
rrchol!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Ridge regression (RR) using the Normal equations and a Cholesky factorization.
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
lb
: Ridge regularization parameter "lambda".scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
See function rr
for examples.
References
Cule, E., De Iorio, M., 2012. A semi-automatic method to guide the choice of ridge parameter in ridge regression. arXiv:1205.0686.
Hastie, T., Tibshirani, R., 2004. Efficient quadratic regularization for expression arrays. Biostatistics 5, 329-340. https://doi.org/10.1093/biostatistics/kxh010
Hastie, T., Tibshirani, R., Friedman, J., 2009. The elements of statistical learning: data mining, inference, and prediction, 2nd ed. Springer, New York.
Hoerl, A.E., Kennard, R.W., 1970. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 12, 55-67. https://doi.org/10.1080/00401706.1970.10488634
Jchemo.rrda
— Methodrrda(X, y; kwargs...)
rrda(X, y, weights::Weight; kwargs...)
Discrimination based on ridge regression (RR-DA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
lb
: Ridge regularization parameter "lambda".prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
The training variable y
(univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present in y
. Each column of Ydummy is a dummy (0/1) variable. Then, a ridge regression (RR) is run on {X
, Ydummy}, returning predictions of the dummy variables (= object posterior
returned by fuction predict
). These predictions can be considered as unbounded estimates (i.e. eventuall outside of [0, 1]) of the class membership probabilities. For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.
In the high-level version of the function, the observation weights used in the RR are defined with argument prior
. For other choices, use the low-level version (argument weights
).
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
lb = 1e-5
mod = model(rrda; lb)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
coef(fm.fm)
res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(mod, Xtest; lb = [.1; .01]).pred
Jchemo.rrr
— Methodrrr(X, Y; kwargs...)
rrr(X, Y, weights::Weight; kwargs...)
rr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Reduced rank regression (RRR, aka RA).
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.tau
: Regularization parameter (∊ [0, 1]).scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
Reduced rank regression, also referred to as redundancy analysis (RA) regression. In this function, the RA uses the Nipals algorithm presented in Mangamana et al 2021, section 2.1.1.
A continuum regularization is available. After block centering and scaling, the covariances matrices are computed as follows:
- Cx = (1 -
tau
) * X'DX +tau
* Ix
where D is the observation (row) metric. Value tau
= 0 can generate unstability when inverting the covariance matrices. A better alternative is generally to use an epsilon value (e.g. tau
= 1e-8) to get similar results as with pseudo-inverses.
References
Bougeard, S., Qannari, E.M., Lupo, C., Chauvin, C., 2011. Multiblock redundancy analysis from a user’s perspective. Application in veterinary epidemiology. Electronic Journal of Applied Statistical Analysis 4, 203-214–214. https://doi.org/10.1285/i20705948v4n2p203
Bougeard, S., Qannari, E.M., Rose, N., 2011. Multiblock redundancy analysis: interpretation tools and application in epidemiology. Journal of Chemometrics 25, 467–475. https://doi.org/10.1002/cem.1392
Tchandao Mangamana, E., Glèlè Kakaï, R., Qannari, E.M., 2021. A general strategy for setting up supervised methods of multiblock data analysis. Chemometrics and Intelligent Laboratory Systems 217, 104388. https://doi.org/10.1016/j.chemolab.2021.104388
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 1
tau = 1e-4
mod = model(rrr; nlv, tau)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
@head mod.fm.T
coef(mod)
coef(mod; nlv = 3)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)
res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
Jchemo.rv
— Methodrv(X, Y; centr = true)
rv(Xbl::Vector; centr = true)
Compute the RV coefficient between matrices.
X
: Matrix (n, p).Y
: Matrix (n, q).Xbl
: A list (vector) of matrices.centr
: Boolean indicating if the matrices will be internally centered or not.
RV is bounded in [0, 1].
A dissimilarty measure between X
and Y
can be computed by d = sqrt(2 * (1 - RV)).
References
Escoufier, Y., 1973. Le Traitement des Variables Vectorielles. Biometrics 29, 751–760. https://doi.org/10.2307/2529140
Josse, J., Holmes, S., 2016. Measuring multivariate association and beyond. Stat Surv 10, 132–167. https://doi.org/10.1214/16-SS116
Josse, J., Pagès, J., Husson, F., 2008. Testing the significance of the RV coefficient. Computational Statistics & Data Analysis 53, 82–91. https://doi.org/10.1016/j.csda.2008.06.012
Kazi-Aoual, F., Hitier, S., Sabatier, R., Lebreton, J.-D., 1995. Refined approximations to permutation tests for multivariate inference. Computational Statistics & Data Analysis 20, 643–656. https://doi.org/10.1016/0167-9473(94)00064-2
Mayer, C.-D., Lorent, J., Horgan, G.W., 2011. Exploratory Analysis of Multiple Omics Datasets Using the Adjusted RV Coefficient. Statistical Applications in Genetics and Molecular Biology 10. https://doi.org/10.2202/1544-6115.1540
Smilde, A.K., Kiers, H.A.L., Bijlsma, S., Rubingh, C.M., van Erk, M.J., 2009. Matrix correlations for high-dimensional data: the modified RV-coefficient. Bioinformatics 25, 401–405. https://doi.org/10.1093/bioinformatics/btn634
Robert, P., Escoufier, Y., 1976. A Unifying Tool for Linear Multivariate Statistical Methods: The RV-Coefficient. Journal of the Royal Statistical Society: Series C (Applied Statistics) 25, 257–265. https://doi.org/10.2307/2347233
Examples
X = rand(5, 10)
Y = rand(5, 3)
rv(X, Y)
X = rand(5, 15)
listbl = [3:4, 1, [6; 8:10]]
Xbl = mblock(X, listbl)
rv(Xbl)
Jchemo.sampcla
— Functionsampcla(x, k::Union{Int, Vector{Int}}, y = nothing)
Build training vs. test sets by stratified sampling.
x
: Class membership (n) of the observations.k
: Nb. test observations to sample in each class. Ifk
is a single value, the nb. of sampled observations is the same for each class. Alternatively,k
can be a vector of length equal to the nb. of classes inx
.y
: Quantitative variable (n) used if systematic sampling.
Two outputs are returned (= row indexes of the data):
train
(n -k
),test
(k
).
If y
= nothing
, the sampling is random, else it is systematic over the sorted y
(see function sampsys
).
References
Naes, T., 1987. The design of calibration in near infra-red reflectance analysis by clustering. Journal of Chemometrics 1, 121-134.
Examples
x = string.(repeat(1:3, 5))
n = length(x)
tab(x)
k = 2
res = sampcla(x, k)
res.test
x[res.test]
tab(x[res.test])
y = rand(n)
res = sampcla(x, k, y)
res.test
x[res.test]
tab(x[res.test])
Jchemo.sampdf
— Functionsampdf(Y::DataFrame, k::Union{Int, Vector{Int}}, id = 1:nro(Y); msamp = :rand)
Build training vs. test sets from each column of a dataframe.
Y
: DataFrame (n, p) whose each column can contain missing values.k
: Nb. of test observations selected for eachY
column. The selection is done within the non-missing observations of the considered column. Ifk
is a single value, the same nb. of observations are selected for each column. Alternatively,k
can be a vector of length p.id
: Vector (n) of IDs.
Keyword arguments:
msamp
: Type of sampling for the test set. Possible values are::rand
= random sampling,:sys
= systematic sampling over each sortedY
column (see functionsampsys
).
Typically, dataframe Y
contains a set of response variables to predict.
Examples
using DataFrames
Y = hcat([rand(5); missing; rand(6)],
[rand(2); missing; missing; rand(7); missing])
Y = DataFrame(Y, :auto)
n = nro(Y)
k = 3
res = sampdf(Y, k)
#res = sampdf(Y, k, string.(1:n))
pnames(res)
res.nam
length(res.test)
res.train
res.test
## Replicated splitting Train/Test
rep = 10
k = 3
ids = [sampdf(Y, k) for i = 1:rep]
length(ids)
i = 1 # replication
ids[i]
ids[i].train
ids[i].test
j = 1 # variable y
ids[i].train[j]
ids[i].test[j]
ids[i].nam[j]
Jchemo.sampdp
— Methodsampdp(X, k::Int; metric = :eucl)
Build training vs. test sets by DUPLEX sampling.
X
: X-data (n, p).k
: Nb. pairs (training/test) of observations to sample. Must be <= n / 2.
Keyword arguments:
metric
: Metric used for the distance computation. Possible values are::eucl
(Euclidean),:mah
(Mahalanobis).
Three outputs (= row indexes of the data) are returned:
train
(k
),test
(k
),remain
(n - 2 *k
).
Outputs train
and test
are built from the DUPLEX algorithm (Snee, 1977 p.421). They are expected to cover approximately the same X-space region and have similar statistical properties.
In practice, when output remain
is not empty (i.e. when there are remaining observations), one common strategy is to add it to output train
.
References
Kennard, R.W., Stone, L.A., 1969. Computer aided design of experiments. Technometrics, 11(1), 137-148.
Snee, R.D., 1977. Validation of Regression Models: Methods and Examples. Technometrics 19, 415-428. https://doi.org/10.1080/00401706.1977.10489581
Examples
X = [0.381392 0.00175002 ; 0.1126 0.11263 ;
0.613296 0.152485 ; 0.726536 0.762032 ;
0.367451 0.297398 ; 0.511332 0.320198 ;
0.018514 0.350678]
k = 3
sampdp(X, k)
Jchemo.sampks
— Methodsampks(X, k::Int; metric = :eucl)
Build training vs. test sets by Kennard-Stone sampling.
X
: X-data (n, p).k
: Nb. test observations to sample.
Keyword arguments:
metric
: Metric used for the distance computation. Possible values are::eucl
(Euclidean),:mah
(Mahalanobis).
Two outputs (= row indexes of the data) are returned:
train
(n
-k
),test
(k
).
Output test
is built from the Kennard-Stone (KS) algorithm (Kennard & Stone, 1969).
Note: By construction, the set of observations selected by KS sampling contains higher variability than the set of the remaining observations. In the seminal article (K&S, 1969), the algorithm is used to select observations that will be used to build a calibration set. To the opposite, in the present function, KS is used to select a test set with higher variability than the training set.
References
Kennard, R.W., Stone, L.A., 1969. Computer aided design of experiments. Technometrics, 11(1), 137-148.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
k = 80
res = sampks(X, k)
pnames(res)
res.train
res.test
mod = model(pcasvd; nlv = 15)
fit!(mod, X)
@head T = mod.fm.T
res = sampks(T, k; metric = :mah)
#####################
n = 10
k = 25
X = [repeat(1:n, inner = n) repeat(1:n, outer = n)]
X = Float64.(X)
X .= X + .1 * randn(nro(X), nco(X))
s = sampks(X, k).test
f, ax = plotxy(X[:, 1], X[:, 2])
scatter!(ax, X[s, 1], X[s, 2]; color = "red")
f
Jchemo.samprand
— Methodsamprand(n::Int, k::Int; replace = false)
Build training vs. test sets by random sampling.
n
: Total nb. of observations.k
: Nb. test observations to sample.
Keyword arguments:
replace
: Boolean. Iffalse
, the sampling is without replacement.
Two outputs are returned (= row indexes of the data):
train
(n
-k
),test
(k
).
Output test
is built by random sampling within 1:n
.
Examples
n = 10
samprand(n, 4)
Jchemo.sampsys
— Methodsampsys(y, k::Int)
Build training vs. test sets by systematic sampling over a quantitative variable.
y
: Quantitative variable (n) to sample.k
: Nb. test observations to sample. Must be >= 2.
Two outputs are returned (= row indexes of the data):
train
(n -k
),test
(k
).
Output test
is built by systematic sampling over the rank of the y
observations. For instance if k
/ n ~ .3, one observation over three observations over the sorted y
is selected.
Output test
always contains the indexes of the minimum and maximum of y
.
Examples
y = rand(7)
[y sort(y)]
res = sampsys(y, 3)
sort(y[res.test])
Jchemo.sampwsp
— Methodsampwsp(X, dmin; maxit = nro(X))
Build training vs. test sets by WSP sampling.
X
: X-data (n, p).dmin
: Distance "dmin" (Santiago et al. 2012).
Keyword arguments:
maxit
: Maximum number of iterations.
Two outputs (= row indexes of the data) are returned:
train
(n
- k),test
(k).
Output test
is built from the "Wootton, Sergent, Phan-Tan-Luu" (WSP) algorithm, assumed to generate samples uniformely distributed in the X
domain (Santiago et al. 2012).
References
Béal A. 2015. Description et sélection de données en grande dimensio. Thèse de doctorat. Laboratoire d’Instrumentation et de sciences analytiques, Ecole doctorale des siences chimiques, Université d'Aix-Marseille.
Santiago, J., Claeys-Bruno, M., Sergent, M., 2012. Construction of space-filling designs using WSP algorithm for high dimensional spaces. Chemometrics and Intelligent Laboratory Systems, Selected Papers from Chimiométrie 2010 113, 26–31. https://doi.org/10.1016/j.chemolab.2011.06.003
Examples
n = 600 ; p = 2
X = rand(n, p)
dmin = .5
s = sampwsp(X, dmin)
pnames(res)
@show length(s.test)
plotxy(X[s.test, 1], X[s.test, 2]).f
Jchemo.savgk
— Methodsavgk(nhwindow::Int, degree::Int, deriv::Int)
Compute the kernel of the Savitzky-Golay filter.
nhwindow
: Nb. points (>= 1) of the half window.degree
: Degree of the smoothing polynom, where 1 <=degree
<= 2 * nhwindow.deriv
: Derivation order, where 0 <=deriv
<= degree.
The size of the kernel is odd (npoint = 2 * nhwindow + 1):
- x[-nhwindow], x[-nhwindow+1], ..., x[0], ...., x[nhwindow-1], x[nhwindow].
If deriv
= 0, there is no derivation (only polynomial smoothing).
The case degree
= 0 (i.e. simple moving average) is not allowed by the funtion.
References
Luo, J., Ying, K., Bai, J., 2005. Savitzky–Golay smoothing and differentiation filter for even number data. Signal Processing 85, 1429–1434. https://doi.org/10.1016/j.sigpro.2005.02.002
Examples
res = savgk(21, 3, 2)
pnames(res)
res.S
res.G
res.kern
Jchemo.savgol
— Methodsavgol(X; kwargs...)
Savitzky-Golay derivation and smoothing of each row of X-data.
X
: X-data (n, p).
Keyword arguments:
npoint
: Size of the filter (nb. points involved in the kernel). Must be odd and >= 3. The half-window size is nhwindow = (npoint
- 1) / 2.degree
: Degree of the smoothing polynom. Must be: 1 <=degree
<=npoint
- 1.deriv
: Derivation order. Must be: 0 <=deriv
<=degree
.
The smoothing is computed by convolution (with padding), using function imfilter of package ImageFiltering.jl. Each returned point is located on the center of the kernel. The kernel is computed with function savgk
.
The function returns a matrix (n, p).
References
Luo, J., Ying, K., Bai, J., 2005. Savitzky–Golay smoothing and differentiation filter for even number data. Signal Processing 85, 1429–1434. https://doi.org/10.1016/j.sigpro.2005.02.002
Savitzky, A., Golay, M.J.E., 2002. Smoothing and Differentiation of Data by Simplified Least Squares Procedures. [WWW Document]. https://doi.org/10.1021/ac60214a047
Schafer, R.W., 2011. What Is a Savitzky-Golay Filter? [Lecture Notes]. IEEE Signal Processing Magazine 28, 111–117. https://doi.org/10.1109/MSP.2011.941097
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f
npoint = 11 ; degree = 2 ; deriv = 2
mod = model(savgol; npoint, degree, deriv)
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f
####### Gaussian signal
u = -15:.1:15
n = length(u)
x = exp.(-.5 * u.^2) / sqrt(2 * pi) + .03 * randn(n)
M = 10 # half window
N = 3 # degree
deriv = 0
#deriv = 1
mod = model(savgol; npoint = 2M + 1, degree = N, deriv)
fit!(mod, x')
xp = transf(mod, x')
f, ax = plotsp(x', u; color = :blue)
lines!(ax, u, vec(xp); color = :red)
f
Jchemo.scale
— Methodscale(X)
scale(X, weights::Weight)
Column-wise scaling of X-data.
X
: X-data (n, p).
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f
mod = model(scale)
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
colstd(Xptrain)
@head Xptest
@head Xtest ./ colstd(Xtrain)'
plotsp(Xptrain).f
plotsp(Xptest).f
Jchemo.segmkf
— Methodsegmkf(n::Int, K::Int; rep = 1)
segmkf(group::Vector, K::Int; rep = 1)
Build segments of observations for K-fold cross-validation.
n
: Total nb. of observations in the dataset. The sampling is implemented with 1:n.group
: A vector (n) defining blocks of observations.K
: Nb. folds (segments) splitting then
observations.
Keyword arguments:
rep
: Nb. replications of the sampling.
For each replication, the function splits the n
observations tp K
segments that can be used for K-fold cross-validation.
If group
is used (must be a vector of length n), the function samples entire groups (= blocks) of observations instead of observations. Such a block-sampling is required when data is structured by blocks and when the response to predict is correlated within blocks. This prevents underestimation of the generalization error.
The function returns a list (vector) of rep
elements. Each element of the list contains K
segments (= K
vectors). Each segment contains the indexes (position within 1:n
) of the sampled observations.
Examples
n = 10 ; K = 3
rep = 4
segm = segmkf(n, K; rep)
i = 1
segm[i]
segm[i][1]
n = 10
group = ["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"] # blocks of the observations
tab(group)
K = 3 ; rep = 4
segm = segmkf(group, K; rep)
i = 1
segm[i]
segm[i][1]
group[segm[i][1]]
group[segm[i][2]]
group[segm[i][3]]
Jchemo.segmts
— Methodsegmts(n::Int, m::Int; rep = 1, seed = nothing)
segmts(group::Vector, m::Int; rep = 1, seed = nothing)
Build segments of observations for "test-set" validation.
n
: Total nb. of observations in the dataset. The sampling is implemented within 1:n
.group
: A vector (n) defining blocks of observations.m
: Nb. test observations, or groups ifgroup
is used, returned in each segment.
Keyword arguments:
rep
: Nb. replications of the sampling.seed
: Eventual seed for theRandom.MersenneTwister
generator. Must be of length =rep
. Whennothing
, the seed is random at each replication.
For each replication, the function builds a test set that can be used to validate a model.
If group
is used (must be a vector of length n), the function samples entire groups (= blocks) of observations instead of observations. Such a block-sampling is required when data is structured by blocks and when the response to predict is correlated within blocks. This prevents underestimation of the generalization error.
The function returns a list (vector) of rep
elements. Each element of the list is a vector of the indexes (positions within 1:n
) of the sampled observations.
Examples
n = 10 ; m = 3
rep = 4
segm = segmts(n, m; rep)
i = 1
segm[i]
segm[i][1]
n = 10
group = ["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"] # blocks of the observations
tab(group)
m = 2 ; rep = 4
segm = segmts(group, m; rep)
i = 1
segm[i]
segm[i][1]
group[segm[i][1]]
Jchemo.selwold
— Methodselwold(indx, r; smooth = true, npoint = 5, alpha = .05, digits = 3, graph = true,
step = 2, xlabel = "Index", ylabel = "Value", title = "Score")
Wold's criterion to select dimensionality in LV models (e.g. PLSR).
indx
: A variable representing the model parameter(s), e.g. nb. LVs if PLSR models.r
: A vector of error rates (n), e.g. RMSECV.
Keyword arguments:
smooth
: Boolean. Iftrue
, the selection is done after a moving-average smoothing of rate R (see functionmavg
).npoint
: Window of the moving-average used to smooth rate R.alpha
: Proportion alpha used as threshold for rate R.digits
: Number of digits in the outputs.graph
: Boolean. Iftrue
, outputs are plotted.step
: Step used for defining the xticks in the graphs.xlabel
: Horizontal label for the plots.ylabel
: Vertical label for the plots.title
: Title of the left plot.
The slection criterion is the "precision gain ratio":
- R = 1 -
r
(a+1) /r
(a)
where r
is an observed error rate quantifying the model performance (e.g. RMSEP, classification error rate, etc.) and a the model dimensionnality (= nb. LVs). r
can also represent other indicators such as the eigenvalues of a PCA.
R is the relative gain in perforamnce efficiency after a new LV is added to the model. The iterations continue until R becomes lower than a threshold value alpha
. By default and only as an indication, the default alpha
=.05 is set in the function, but the user should set any other value depending on his data and parsimony objective.
In his original article, Wold (1978; see also Bro et al. 2008) used the ratio of cross-validated over training residual sums of squares, i.e. PRESS over SSR. Instead, function selwold
compares values of consistent nature (the successive values in the input vector r
). For instance, r
was set to PRESS values in Li et al. (2002) and Andries et al. (2011), which is equivalent to the "punish factor" described in Westad & Martens (2000).
The ratio R can be erratic (particulary when r
is the error rate of a discrimination model), making difficult the dimensionnaly selection. In such a situation, function selwold
proposes to calculate a smoothing of R (argument smooth
).
The function returns two outputs (in addition to eventual plots):
opt
: The index corresponding to the minimum value ofr
.sel
: The index of the selection from the R (or smoothed R) threshold.
References
Andries, J.P.M., Vander Heyden, Y., Buydens, L.M.C., 2011. Improved variable reduction in partial least squares modelling based on Predictive-Property-Ranked Variables and adaptation of partial least squares complexity. Analytica Chimica Acta 705, 292-305. https://doi.org/10.1016/j.aca.2011.06.037
Bro, R., Kjeldahl, K., Smilde, A.K., Kiers, H.A.L., 2008. Cross-validation of component models: A critical look at current methods. Anal Bioanal Chem 390, 1241-1251. https://doi.org/10.1007/s00216-007-1790-1
Li, B., Morris, J., Martin, E.B., 2002. Model selection for partial least squares regression. Chemometrics and Intelligent Laboratory Systems 64, 79-89. https://doi.org/10.1016/S0169-7439(02)00051-5
Westad, F., Martens, H., 2000. Variable Selection in near Infrared Spectroscopy Based on Significance Testing in Partial Least Squares Regression. J. Near Infrared Spectrosc., JNIRS 8, 117â124.
Wold S. Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models. Technometrics. 1978;20(4):397-405
Examples
using JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
n = nro(Xtrain)
segm = segmts(n, 50; rep = 30)
mod = model(plskern)
nlv = 0:20
res = gridcv(mod, Xtrain, ytrain; segm, score = rmsep, nlv).res
res[res.y1 .== minimum(res.y1), :]
plotgrid(res.nlv, res.y1;xlabel = "Nb. LVs", ylabel = "RMSEP").f
zres = selwold(res.nlv, res.y1; smooth = true, graph = true) ;
@show zres.opt
@show zres.sel
zres.f
Jchemo.sep
— Methodsep(pred, Y)
Compute the corrected SEP ("SEP_c"), i.e. the standard deviation of the prediction errors.
pred
: Predictions.Y
: Observed data.
References
Bellon-Maurel, V., Fernandez-Ahumada, E., Palagos, B., Roger, J.-M., McBratney, A., 2010. Critical review of chemometric indicators commonly used for assessing the quality of the prediction of soil attributes by NIR spectroscopy. TrAC Trends in Analytical Chemistry 29, 1073–1081. https://doi.org/10.1016/j.trac.2010.05.006
Examples
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
sep(pred, Ytest)
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
sep(pred, ytest)
Jchemo.snorm
— Methodsnorm(X)
Row-wise norming of X-data.
X
: X-data (n, p).
Each row of X
is divide by its norm.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f
mod = model(snorm)
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f
rownorm(Xptrain)
rownorm(Xptest)
Jchemo.snv
— Methodsnv(X; kwargs...)
Standard-normal-variate (SNV) transformation of each row of X-data.
X
: X-data (n, p).
Keyword arguments:
centr
: Boolean indicating if the centering in done.scal
: Boolean indicating if the scaling in done.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(dat.X, wl; nsamp = 20).f
centr = true ; scal = true
mod = model(snv; centr, scal)
fit!(mod, Xtrain)
Xptrain = transf(mod, Xtrain)
Xptest = transf(mod, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f
Jchemo.soft
— Methodsoft(x::Real, delta)
Soft thresholding function.
x
: Value to transform.delta
: Range for the thresholding.
The returned value is:
- sign(x) * max(0, abs(x) - delta)
where delta >= 0.
Examples
using CairoMakie
delta = .2
soft(3, delta)
x = LinRange(-2, 2, 100)
y = soft.(x, delta)
lines(x, y)
Jchemo.softmax
— Methodsoftmax(x::AbstractVector)
softmax(X::Union{Matrix, DataFrame})
Softmax function.
x
: A vector to transform.X
: A matrix whose rows are transformed.
Let v be a vector:
- 'softmax'(v) = exp.(v) / sum(exp.(v))
Examples
x = 1:3
softmax(x)
X = rand(5, 3)
softmax(X)
Jchemo.soplsr
— Methodsoplsr(Xbl, Y; kwargs...)
soplsr(Xbl, Y, weights::Weight; kwargs...)
soplsr!(Xbl::Matrix, Y::Matrix, weights::Weight; kwargs...)
Multiblock sequentially orthogonalized PLSR (SO-PLSR).
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores T) to compute.scal
: Boolean. Iftrue
, each column of blocks inXbl
andY
is scaled by its uncorrected standard deviation.
References
Biancolillo et al. , 2015. Combining SO-PLS and linear discriminant analysis for multi-block classification. Chemometrics and Intelligent Laboratory Systems, 141, 58-67.
Biancolillo, A. 2016. Method development in the area of multi-block analysis focused on food analysis. PhD. University of copenhagen.
Menichelli et al., 2014. SO-PLS as an exploratory tool for path modelling. Food Quality and Preference, 36, 122-134.
Examples
using JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
y = Y.c1
group = dat.group
listbl = [1:11, 12:19, 20:25]
s = 1:6
Xbltrain = mblock(X[s, :], listbl)
Xbltest = mblock(rmrow(X, s), listbl)
ytrain = y[s]
ytest = rmrow(y, s)
ntrain = nro(ytrain)
ntest = nro(ytest)
ntot = ntrain + ntest
(ntot = ntot, ntrain , ntest)
nlv = 2
#nlv = [2, 1, 2]
#nlv = [2, 0, 1]
scal = false
#scal = true
mod = model(soplsr; nlv, scal)
fit!(mod, Xbltrain, ytrain)
pnames(mod)
pnames(mod.fm)
@head mod.fm.T
@head transf(mod, Xbltrain)
transf(mod, Xbltest)
res = predict(mod, Xbltest)
res.pred
rmsep(res.pred, ytest)
Jchemo.sourcedir
— Methodsourcedir(path)
Include all the files contained in a directory.
Jchemo.spca
— Methodspca(X; kwargs...)
spca(X, weights::Weight; kwargs...)
spca!(X::Matrix, weights::Weight; kwargs...)
Sparse PCA (Shen & Huang 2008).
X
: X-data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. principal components (PCs).msparse
: Method used for the sparse thresholding. Possible values are::soft
,:mix
,:hard
. See thereafter.delta
: Only used ifmsparse = :soft
. Range for the thresholding on the loadings (after they are standardized to their maximal absolute value). Must ∈ [0, 1]. Higher isdelta
, stronger is the thresholding.nvar
: Only used ifmsparse = :mix
ormsparse = :hard
. Nb. variables (X
-columns) selected for each principal component (PC). Can be a single integer (i.e. same nb. of variables for each PC), or a vector of lengthnlv
.tol
: Tolerance value for stopping the iterations.maxit
: Maximum nb. of Nipals iterations.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Sparse principal component analysis via regularized low rank matrix approximation (Shen & Huang 2008). A Nipals algorithm is used. The Function provides three methods of thresholding to compute the sparse loadings:
msparse = :soft
: Soft thresholding of standardized loadings. Let us note v a given loading vector before thresholding. Vector abs(v) is then standardized to its maximal component (= max{abs(v[i]), i = 1..p}). The soft-thresholding function (see functionsoft
) is applied to this standardized vector, with the constantdelta
∈ [0, 1]. This returns the sparse vectortheta
. Vector v is multiplied term-by-term by this vectortheta
, which finally gives the sparse loadings.msparse = :mix
: Method used in functionspca
of the R packagemixOmics
(Lê Cao et al.). For each PC, thenvar
X
-variables showing the largest values in vector abs(v) are selected. Then a soft-thresholding is applied to the corresponding selected loadings. Rangedelta
is automatically (internally) set equal to the maximal value of the components of abs(v) corresponding to variables removed from the selection.msparse = :hard
: For each PC, thenvar
X
-variables showing the largest values in vector abs(v) are selected.
The case msparse = :mix
returns the same results as function spca
of the R package mixOmics.
Note: The resulting sparse loadings vectors (P
-columns) are in general non orthogonal. Therefore, there is no a unique decomposition of the variance of X
such as in PCA. Function summary
returns the following objects:
explvarx
: The proportion of variance ofX
explained by each column t ofT
, computed by regressingX
on t (such as what is done in PLS).explvarx_adj
: Adjusted explained variance proposed by Shen & Huang 2008 section 2.3.
References
Kim-Anh Lê Cao, Florian Rohart, Ignacio Gonzalez, Sebastien Dejean with key contributors Benoit Gautier, Francois Bartolo, contributions from Pierre Monget, Jeff Coquery, FangZou Yao and Benoit Liquet. (2016). mixOmics: Omics Data Integration Project. R package version 6.1.1. https://CRAN.R-project.org/package=mixOmics
https://www.bioconductor.org/packages/release/bioc/html/mixOmics.html
Shen, H., Huang, J.Z., 2008. Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis 99, 1015–1034. https://doi.org/10.1016/j.jmva.2007.06.007
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
pnames(dat)
@head dat.X
X = dat.X[:, 1:4]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
Xtest = X[s.test, :]
nlv = 3
msparse = :mix ; nvar = 2
#msparse = :hard ; nvar = 2
scal = false
mod = model(spca; nlv, msparse, nvar, scal) ;
fit!(mod, Xtrain)
fm = mod.fm ;
pnames(fm)
fm.niter
fm.sellv
fm.sel
fm.P
fm.P' * fm.P
@head T = fm.T
@head transf(mod, Xtrain)
@head Ttest = transf(fm, Xtest)
res = summary(mod, Xtrain) ;
res.explvarx
res.explvarx_adj
nlv = 3
msparse = :soft ; delta = .4
mod = model(spca; nlv, msparse, delta) ;
fit!(mod, Xtrain)
mod.fm.P
Jchemo.splskdeda
— Methodsplskdeda(X, y; kwargs...)
splskdeda(X, y, weights::Weight; kwargs...)
Sparse PLS-KDE-DA.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1.msparse
: Method used for the sparse thresholding. Possible values are::soft
,:mix
,:hard
. See thereafter.delta
: Only used ifmsparse = :soft
. Range for the thresholding on the loadings (after they are standardized to their maximal absolute value). Must ∈ [0, 1]. Higher isdelta
, stronger is the thresholding.nvar
: Only used ifmsparse = :mix
ormsparse = :hard
. Nb. variables (X
-columns) selected for each principal component (PC). Can be a single integer (i.e. same nb. of variables for each PC), or a vector of lengthnlv
.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).- Keyword arguments of function
dmkern
(bandwidth definition) can also be specified here. scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Same as function plskdeda
(PLS-LDA) except that a sparse PLSR (function splskern
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
See function splslda
for examples.
Jchemo.splskern
— Methodsplskern(X, Y; kwargs...)
splskern(X, Y, weights::Weight; kwargs...)
splskern!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Sparse partial least squares regression (Lê Cao et al. 2008)
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.msparse
: Method used for the sparse thresholding. Possible values are::soft
,:mix
,:hard
. See thereafter.delta
: Only used ifmsparse = :soft
. Range for the thresholding on the loadings (after they are standardized to their maximal absolute value). Must ∈ [0, 1]. Higher isdelta
, stronger is the thresholding.nvar
: Only used ifmsparse = :mix
ormsparse = :hard
. Nb. variables (X
-columns) selected for each principal component (PC). Can be a single integer (i.e. same nb. of variables for each PC), or a vector of lengthnlv
.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
Sparse partial least squares regression (Lê Cao et al. 2008), with the fast "improved kernel algorithm #1" of Dayal & McGregor (1997).
In the present version of splskern
, the sparse correction only concerns X
. The function provides three methods of thresholding to compute the sparse X
-loading weights w, see function spca
for description (same principles). The case msparse = :mix
returns the same results as function spls
of the R package mixOmics with the regression mode (and without sparseness on Y
).
The case msparse = :hard
(or msparse = :mix
) and nvar = 1
correspond to the COVSEL regression described in Roger et al 2011 (see also Höskuldsson 1992).
References
Dayal, B.S., MacGregor, J.F., 1997. Improved PLS algorithms. Journal of Chemometrics 11, 73-85.
Höskuldsson, A., 1992. The H-principle in modelling with applications to chemometrics. Chemometrics and Intelligent Laboratory Systems, Proceedings of the 2nd Scandinavian Symposium on Chemometrics 14, 139–153. https://doi.org/10.1016/0169-7439(92)80099-P
Lê Cao, K.-A., Rossouw, D., Robert-Granié, C., Besse, P., 2008. A Sparse PLS for Variable Selection when Integrating Omics Data. Statistical Applications in Genetics and Molecular Biology 7. https://doi.org/10.2202/1544-6115.1390
Kim-Anh Lê Cao, Florian Rohart, Ignacio Gonzalez, Sebastien Dejean with key contributors Benoit Gautier, Francois Bartolo, contributions from Pierre Monget, Jeff Coquery, FangZou Yao and Benoit Liquet. (2016). mixOmics: Omics Data Integration Project. R package version 6.1.1. https://CRAN.R-project.org/package=mixOmics
https://www.bioconductor.org/packages/release/bioc/html/mixOmics.html
Roger, J.M., Palagos, B., Bertrand, D., Fernandez-Ahumada, E., 2011. covsel: Variable selection for highly multivariate and multi-response calibration: Application to IR spectroscopy. Chem. Lab. Int. Syst. 106, 216-223.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 15
msparse = :mix ; nvar = 5
#msparse = :hard ; nvar = 5
mod = model(splskern; nlv, msparse, nvar) ;
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
@head mod.fm.T
@head mod.fm.W
coef(mod)
coef(mod; nlv = 3)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)
res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
res = summary(mod, Xtrain) ;
pnames(res)
z = res.explvarx
plotgrid(z.nlv, z.cumpvar; step = 2, xlabel = "Nb. LVs",
ylabel = "Prop. Explained X-Variance").f
Jchemo.splslda
— Methodsplslda(X, y; kwargs...)
splslda(X, y, weights::Weight; kwargs...)
Sparse PLS-LDA.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1.msparse
: Method used for the sparse thresholding. Possible values are::soft
,:mix
,:hard
. See thereafter.delta
: Only used ifmsparse = :soft
. Range for the thresholding on the loadings (after they are standardized to their maximal absolute value). Must ∈ [0, 1]. Higher isdelta
, stronger is the thresholding.nvar
: Only used ifmsparse = :mix
ormsparse = :hard
. Nb. variables (X
-columns) selected for each principal component (PC). Can be a single integer (i.e. same nb. of variables for each PC), or a vector of lengthnlv
.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Same as function plslda
(PLS-LDA) except that a sparse PLSR (function splskern
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlv = 15
msparse = :mix ; nvar = 10
mod = model(splslda; nlv, msparse, nvar)
#mod = model(splsqda; nlv, msparse, nvar, alpha = .1)
#mod = model(splskdeda; nlv, msparse, nvar, a_kde = .9)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
fmpls = fm.fm.fmpls ;
@head fmpls.T
@head transf(mod, Xtrain)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)
coef(fmpls)
summary(fmpls, Xtrain)
res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(mod, Xtest; nlv = 1:2).pred
Jchemo.splsqda
— Methodsplsqda(X, y; kwargs...)
splsqda(X, y, weights::Weight; kwargs...)
Sparse PLS-QDA (with continuum).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1.msparse
: Method used for the sparse thresholding. Possible values are::soft
,:mix
,:hard
. See thereafter.delta
: Only used ifmsparse = :soft
. Range for the thresholding on the loadings (after they are standardized to their maximal absolute value). Must ∈ [0, 1]. Higher isdelta
, stronger is the thresholding.nvar
: Only used ifmsparse = :mix
ormsparse = :hard
. Nb. variables (X
-columns) selected for each principal component (PC). Can be a single integer (i.e. same nb. of variables for each PC), or a vector of lengthnlv
.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).alpha
: Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0
) and LDA (alpha = 1
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Same as function plsqda
(PLS-LDA) except that a sparse PLSR (function splskern
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
See function splslda
for examples.
Jchemo.splsrda
— Methodsplsrda(X, y; kwargs...)
splsrda(X, y, weights::Weight; kwargs...)
Sparse PLSR-DA.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.msparse
: Method used for the sparse thresholding. Possible values are::soft
,:mix
,:hard
. See thereafter.delta
: Only used ifmsparse = :soft
. Range for the thresholding on the loadings (after they are standardized to their maximal absolute value). Must ∈ [0, 1]. Higher isdelta
, stronger is the thresholding.nvar
: Only used ifmsparse = :mix
ormsparse = :hard
. Nb. variables (X
-columns) selected for each principal component (PC). Can be a single integer (i.e. same nb. of variables for each PC), or a vector of lengthnlv
.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Same as function plsrda
(PLSR-DA) except that a sparse PLSR (function splskern
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
See function plsrda
and splskern
for details.
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlv = 15
msparse = :mix ; nvar = 10
mod = model(splsrda; nlv, msparse, nvar)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
@head fm.fm.T
@head transf(mod, Xtrain)
@head transf(mod, Xtest)
@head transf(mod, Xtest; nlv = 3)
coef(fm.fm)
res = predict(mod, Xtest) ;
pnames(res)
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(mod, Xtest; nlv = 1:2).pred
summary(fm.fm, Xtrain)
Jchemo.ssq
— Methodssq(X)
Compute the total inertia of a matrix.
X
: Matrix.
Sum of all the squared components of X
(= norm(X)^2
; Squared Frobenius norm).
Examples
X = rand(5, 2)
ssq(X)
Jchemo.ssr
— Methodssr(pred, Y)
Compute the sum of squared prediction errors (SSR).
pred
: Predictions.Y
: Observed data.
Examples
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, Ytrain)
pred = predict(mod, Xtest).pred
ssr(pred, Ytest)
mod = model(plskern; nlv = 2)
fit!(mod, Xtrain, ytrain)
pred = predict(mod, Xtest).pred
ssr(pred, ytest)
Jchemo.stah
— Methodstah(X, a; kwargs...)
Compute the Stahel-Donoho outlierness.
X
: X-data (n, p).a
: Nb. dimensions simulated for the projection pursuit method.
Keyword arguments:
scal
: Boolean. Iftrue
, matrixX
is centred (by median) and scaled (by MAD) before computing the outlierness.
See Maronna and Yohai 1995 for details on the outlierness measure.
This outlierness measure is computed from a projection-pursuit approach:
- A projection matrix
P
(p,a
) is built randomly from binary (0/1) data, - and the observations (rows of
X
) are projected on thea
directions.
References
Maronna, R.A., Yohai, V.J., 1995. The Behavior of the Stahel-Donoho Robust Multivariate Estimator. Journal of the American Statistical Association 90, 330–341. https://doi.org/10.1080/01621459.1995.10476517
Examples
n = 300 ; p = 700 ; m = 80
ntot = n + m
X1 = randn(n, p)
X2 = randn(m, p) .+ rand(1:3, p)'
X = vcat(X1, X2)
a = 10
scal = false
#scal = true
res = stah(X, a; scal) ;
pnames(res)
res.d
plotxy(1:nro(X), res.d).f
Jchemo.summ
— Methodsumm(X; digits = 3)
summ(X, y; digits = 3)
Summarize a dataset (or a variable).
X
: A dataset (n, p).y
: A categorical variable (n) (class membership).digits
: Nb. digits in the outputs.
Examples
n = 50
X = rand(n, 3)
y = rand(1:3, n)
res = summ(X)
pnames(res)
summ(X[:, 2]).res
summ(X, y)
Jchemo.svmda
— Methodsvmda(X, y; kwargs...)
Support vector machine for discrimination "C-SVC" (SVM-DA).
X
: X-data (n, p).y
: Univariate class membership (n).
Keyword arguments:
kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
,:klin
,:ktanh
. See below.gamma
:kern
parameter, see below.degree
:kern
parameter, see below.coef0
:kern
parameter, see below.cost
: Cost of constraints violation C parameter.epsilon
: Epsilon parameter in the loss function.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Kernel types:
- :krbf – radial basis function: exp(-gamma * ||x - y||^2)
- :kpol – polynomial: (gamma * x' * y + coef0)^degree
- "klin – linear: x' * y
- :ktan – sigmoid: tanh(gamma * x' * y + coef0)
The function uses LIBSVM.jl (https://github.com/JuliaML/LIBSVM.jl) that is an interface to library LIBSVM (Chang & Li 2001).
References
Julia package LIBSVM.jl: https://github.com/JuliaML/LIBSVM.jl
Chang, C.-C. & Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Detailed documentation (algorithms, formulae, ...) can be found in http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.ps.gz
Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Schölkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning. MIT Press, Cambridge, Mass.
Examples
using JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
kern = :krbf ; gamma = 1e4
cost = 1000 ; epsilon = .5
mod = model(svmda; kern, gamma, cost, epsilon)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
fm = mod.fm ;
fm.lev
fm.ni
res = predict(mod, Xtest) ;
pnames(res)
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.svmr
— Methodsvmr(X, y; kwargs...)
Support vector machine for regression (Epsilon-SVR).
X
: X-data (n, p).y
: Univariate y-data (n).
Keyword arguments:
kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
,:klin
,:ktanh
. See below.gamma
:kern
parameter, see below.degree
:kern
parameter, see below.coef0
:kern
parameter, see below.cost
: Cost of constraints violation C parameter.epsilon
: Epsilon parameter in the loss function.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Kernel types:
- :krbf – radial basis function: exp(-gamma * ||x - y||^2)
- :kpol – polynomial: (gamma * x' * y + coef0)^degree
- "klin – linear: x' * y
- :ktan – sigmoid: tanh(gamma * x' * y + coef0)
The function uses LIBSVM.jl (https://github.com/JuliaML/LIBSVM.jl) that is an interface to library LIBSVM (Chang & Li 2001).
References
Julia package LIBSVM.jl: https://github.com/JuliaML/LIBSVM.jl
Chang, C.-C. & Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Detailed documentation (algorithms, formulae, ...) can be found in http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.ps.gz
Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Schölkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning. MIT Press, Cambridge, Mass.
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
p = nco(X)
kern = :krbf ; gamma = .1
cost = 1000 ; epsilon = 1
mod = model(svmr; kern, gamma, cost, epsilon)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106
x = collect(-10:.2:10)
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x)
y = zy + .2 * randn(n)
kern = :krbf ; gamma = .1
mod = model(svmr; kern, gamma)
fit!(mod, x, y)
pred = predict(mod, x).pred
f, ax = scatter(x, y)
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f
Jchemo.tab
— Methodtab(x)
Univariate tabulation.
x
: Categorical variable.
The output cointains sorted levels.
Examples
x = rand(["a";"b";"c"], 20)
res = tab(x)
res.keys
res.vals
Jchemo.tabdf
— Methodtabdf(X; groups = nothing)
Compute the nb. occurences in categorical variables of a dataset.
X
: Data.groups
: Vector of the names of the group variables to consider inX
(by default: all the columns ofX
).
The output (dataframe) contains sorted levels.
Examples
n = 20
X = hcat(rand(1:2, n), rand(["a", "b", "c"], n))
tabdf(X)
tabdf(X[:, 2])
df = DataFrame(X, [:v1, :v2])
tabdf(df)
tabdf(df; groups = [:v1, :v2])
tabdf(df; groups = :v2)
Jchemo.tabdupl
— Methodtabdupl(x)
Tabulate duplicated values in a vector.
x
: Categorical variable.
Examples
x = ["a", "b", "c", "a", "b", "b"]
tab(x)
res = tabdupl(x)
res.keys
res.vals
Jchemo.transf
— Methodtransf(object::Blockscal, Xbl)
transf!(object::Blockscal, Xbl)
Compute the preprocessed data from a model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which LVs are computed.
Jchemo.transf
— Methodtransf(object::Center, X)
transf!(object::Center, X::Matrix)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::Comdim, Xbl; nlv = nothing)
transfbl(object::Comdim, Xbl; nlv = nothing)
Compute latent variables (LVs = scores T) from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which LVs are computed.nlv
: Nb. LVs to compute.
Jchemo.transf
— Methodtransf(object::Cscale, X)
transf!(object::Cscale, X::Matrix)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::Detrend, X)
transf!(object::Detrend, X)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::Dkplsr, X; nlv = nothing)
Compute latent variables (LVs = scores T) from a fitted model.
object
: The fitted model.X
: X-data for which LVs are computed.nlv
: Nb. LVs to consider.
Jchemo.transf
— Methodtransf(object::Fdif, X)
transf!(object::Fdif, X::Matrix, M::Matrix)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.M
: Pre-allocated output matrix (n, p - npoint + 1).
The in-place function stores the output in M
.
Jchemo.transf
— Methodtransf(object::Interpl, X)
transf!(object::Interpl, X::Matrix, M::Matrix)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.M
: Pre-allocated output matrix (n, p).
The in-place function stores the output in M
.
Jchemo.transf
— Methodtransf(object::Kpca, X; nlv = nothing)
Compute PCs (scores T) from a fitted model.
object
: The fitted model.X
: X-data for which PCs are computed.nlv
: Nb. PCs to compute.
Jchemo.transf
— Methodtransf(object::Kplsr, X; nlv = nothing)
Compute latent variables (LVs = scores T) from a fitted model.
object
: The fitted model.X
: X-data for which LVs are computed.nlv
: Nb. LVs to consider.
Jchemo.transf
— Methodtransf(object::Mavg, X)
transf!(object::Mavg, X)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::Mbconcat, Xbl)
Compute the preprocessed data from a model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which LVs are computed.
Jchemo.transf
— Methodtransf(object::Mbpca, Xbl; nlv = nothing)
transfbl(object::Mbpca, Xbl; nlv = nothing)
Compute latent variables (LVs = scores T) from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which LVs are computed.nlv
: Nb. LVs to compute.
Jchemo.transf
— Methodtransf(object::Mbplslda, Xbl; nlv = nothing)
Compute latent variables (LVs = scores T) from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which LVs are computed.nlv
: Nb. LVs to consider.
Jchemo.transf
— Methodtransf(object::Mbplsrda, Xbl; nlv = nothing)
Compute latent variables (LVs = scores T) from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which LVs are computed.nlv
: Nb. LVs to compute.
Jchemo.transf
— Methodtransf(object::Pcr, X; nlv = nothing)
Compute latent variables (LVs = scores T) from a fitted model and a matrix X.
object
: The fitted model.X
: Matrix (m, p) for which LVs are computed.nlv
: Nb. LVs to consider.
Jchemo.transf
— Methodtransf(object::Plslda, X; nlv = nothing)
Compute latent variables (LVs = scores T) from a fitted model.
object
: The fitted model.X
: Matrix (m, p) for which LVs are computed.nlv
: Nb. LVs to consider.
Jchemo.transf
— Methodtransf(object::Plsrda, X; nlv = nothing)
Compute latent variables (LVs = scores T) from a fitted model.
object
: The fitted model.X
: X-data (m, p) for which LVs are computed.nlv
: Nb. LVs to consider.
Jchemo.transf
— Methodtransf(object::Rmgap, X)
transf!(object::Rmgap, X)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::Rosaplsr, Xbl; nlv = nothing)
Compute latent variables (LVs = scores T) from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which LVs are computed.nlv
: Nb. LVs to compute.
Jchemo.transf
— Methodtransf(object::Rp, X; nlv = nothing)
Compute scores T from a fitted model.
object
: The fitted model.X
: Matrix (m, p) for which scores T are computed.nlv
: Nb. scores to compute.
Jchemo.transf
— Methodtransf(object::Savgol, X)
transf!(object::Savgol, X)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::Scale, X)
transf!(object::Scale, X::Matrix)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::Snorm, X)
transf!(object::Snorm, X)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::Snv, X)
transf!(object::Snv, X)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::Soplsr, Xbl)
Compute latent variables (LVs = scores T) from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which LVs are computed.
Jchemo.transf
— Methodtransf(object::Spca, X; nlv = nothing)
Compute principal components (PCs = scores T) from a
fitted model and X-data.
object
: The fitted model.X
: X-data for which PCs are computed.nlv
: Nb. PCs to compute.
Jchemo.transf
— Methodtransf(object::Union{Pca, Fda}, X; nlv = nothing)
Compute principal components (PCs = scores T) from a fitted model and X-data.
object
: The fitted model.X
: X-data for which PCs are computed.nlv
: Nb. PCs to compute.
Jchemo.transf
— Methodtransf(object::Union{Mbplsr, Mbplswest}, Xbl; nlv = nothing)
Compute latent variables (LVs = scores T) from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which LVs are computed.nlv
: Nb. LVs to compute.
Jchemo.transf
— Methodtransf(object::Union{Plsr, Splsr},
X; nlv = nothing)
Compute latent variables (LVs = scores T) from a fitted model.
object
: The fitted model.X
: Matrix (m, p) for which LVs are computed.nlv
: Nb. LVs to consider.
Jchemo.transfbl
— Methodtransfbl(object::Cca, X, Y; nlv = nothing)
Compute latent variables (LVs = scores T) from a fitted model.
object
: The fitted model.X
: X-data for which components (LVs) are computed.Y
: Y-data for which components (LVs) are computed.nlv
: Nb. LVs to compute.
Jchemo.transfbl
— Methodtransfbl(object::Ccawold, X, Y; nlv = nothing)
Compute latent variables (LVs = scores T) from a fitted model.
object
: The fitted model.X
: X-data for which components (LVs) are computed.Y
: Y-data for which components (LVs) are computed.nlv
: Nb. LVs to compute.
Jchemo.transfbl
— Methodtransfbl(object::Plscan, X, Y; nlv = nothing)
Compute latent variables (LVs = scores T) from a fitted model.
object
: The fitted model.X
: X-data for which components (LVs) are computed.Y
: Y-data for which components (LVs) are computed.nlv
: Nb. LVs to compute.
Jchemo.transfbl
— Methodtransfbl(object::Plstuck, X, Y; nlv = nothing)
Compute latent variables (LVs = scores T) from a fitted model.
object
: The fitted model.X
: X-data for which components (LVs) are computed.Y
: Y-data for which components (LVs) are computed.nlv
: Nb. LVs to compute.
Jchemo.transfbl
— Methodtransfbl(object::Rasvd, X, Y; nlv = nothing)
Compute latent variables (LVs = scores T) from a fitted model.
object
: The fitted model.X
: X-data for which components (LVs) are computed.Y
: Y-data for which components (LVs) are computed.nlv
: Nb. LVs to compute.
Jchemo.treer_dt
— Methodtreer_dt(X, y; kwargs...)
Regression tree (CART) with DecisionTree.jl.
X
: X-data (n, p).y
: Univariate y-data (n).
Keyword arguments:
n_subfeatures
: Nb. variables to select at random at each split (default: 0 ==> keep all).max_depth
: Maximum depth of the decision tree (default: -1 ==> no maximum).min_sample_leaf
: Minimum number of samples each leaf needs to have.min_sample_split
: Minimum number of observations in needed for a split.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
The function fits a single regression tree (CART) using package `DecisionTree.jl'.
References
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. Classification And Regression Trees. Chapman & Hall, 1984.
DecisionTree.jl https://github.com/JuliaAI/DecisionTree.jl
Gey, S., 2002. Bornes de risque, détection de ruptures, boosting : trois thèmes statistiques autour de CART en régression (These de doctorat). Paris 11. http://www.theses.fr/2002PA112245
Examples
using JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
pnames(dat)
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
p = nco(X)
n_subfeatures = p / 3
max_depth = 15
mod = model(treer_dt; n_subfeatures, max_depth)
fit!(mod, Xtrain, ytrain)
pnames(mod)
pnames(mod.fm)
res = predict(mod, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
Jchemo.vcatdf
— Methodvcatdf(dat; cols = :intersect)
Vertical concatenation of a list of dataframes.
dat
: List (vector) of dataframes.cols
: Determines the columns of the returned data frame. See ?DataFrames.vcat.
Examples
using DataFrames
dat1 = DataFrame(rand(5, 2), [:v3, :v1])
dat2 = DataFrame(100 * rand(2, 2), [:v3, :v1])
dat = (dat1, dat2)
Jchemo.vcatdf(dat)
dat2 = DataFrame(100 * rand(2, 2), [:v1, :v3])
dat = (dat1, dat2)
Jchemo.vcatdf(dat)
dat2 = DataFrame(100 * rand(2, 3), [:v3, :v1, :a])
dat = (dat1, dat2)
Jchemo.vcatdf(dat)
Jchemo.vcatdf(dat; cols = :union)
Jchemo.vcol
— Methodvcol(X::AbstractMatrix, j)
vcol(X::DataFrame, j)
vcol(x::Vector, j)
View of the j-th column(s) of a matrix X
, or of the j-th element(s) of vector x
.
Jchemo.vip
— Methodvip(object::Union{Pcr, Plsr}; nlv = nothing)
vip(object::Union{Pcr, Plsr}, Y; nlv = nothing)
Variable importance on Projections (VIP).
object
: The fitted model.Y
: The Y-data that was used to fit the model.
Keyword arguments:
nlv
: Nb. latent variables (LVs) to consider. Ifnothing
, the maximal model is considered.
For a PLS model (or PCR, etc.) fitted on (X, Y) with a number of A latent variables, and for variable xj (column j of X):
- VIP(xj) = Sum.a(1,...,A) R2(Yc, ta) waj^2 / Sum.a(1,...,A) R2(Yc, ta) (1 / p)
where:
- Yc is the centered Y,
- ta is the a-th X-score,
- R2(Yc, ta) is the proportion of Yc-variance explained by ta, i.e. ||Yc.hat||^2 / ||Yc||^2 (where Yc.hat is the LS estimate of Yc by ta).
When Y
is used, R2(Yc, ta) is replaced by the redundancy Rd(Yc, ta) (see function rd
), such as in Tenenhaus 1998 p.139.
References
Chong, I.-G., Jun, C.-H., 2005. Performance of some variable selection methods when multicollinearity is present. Chemometrics and Intelligent Laboratory Systems 78, 103–112. https://doi.org/10.1016/j.chemolab.2004.12.011
Mehmood, T., Sæbø, S., Liland, K.H., 2020. Comparison of variable selection methods in partial least squares regression. Journal of Chemometrics 34, e3226. https://doi.org/10.1002/cem.3226
Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.
Examples
X = [1. 2 3 4; 4 1 6 7; 12 5 6 13;
27 18 7 6; 12 11 28 7]
Y = [10. 11 13; 120 131 27; 8 12 4;
1 200 8; 100 10 89]
y = Y[:, 1]
ycla = [1; 1; 1; 2; 2]
nlv = 3
mod = model(plskern; nlv)
fit!(mod, X, y)
res = vip(mod.fm)
pnames(res)
res.imp
fit!(mod, X, Y)
vip(mod.fm).imp
vip(mod.fm, Y).imp
mod = model(plsrda; nlv)
fit!(mod, X, ycla)
pnames(mod.fm)
fm = mod.fm.fm ;
vip(fm).imp
Ydummy = dummy(ycla).Y
vip(fm, Ydummy).imp
mod = model(plslda; nlv)
fit!(mod, X, ycla)
pnames(mod.fm.fm)
fm = mod.fm.fm.fmpls ;
vip(fm).imp
vip(fm, Ydummy).imp
Jchemo.viperm
— Methodviperm(mod, X, Y; rep = 50, psamp = .3, score = rmsep)
Variable importance by direct permutations.
mod
: Model to evaluate.X
: X-data (n, p).Y
: Y-data (n, q).
Keyword arguments:
rep
: Number of replications of the splitting training/test.psamp
: Proportion of data used as test set to compute thescore
.score
: Function computing the prediction score.
The principle is as follows:
- Data (X, Y) are splitted randomly to a training and a test set.
- The model is fitted on Xtrain, and the score (error rate) is computed on Xtest. This gives the reference error rate.
- Rows of a given variable (feature) j in Xtest are randomly permutated (the rest of Xtest is unchanged). The score is computed on the Xtestpermj (i.e. Xtest after thta the rows of variable j were permuted). The importance of variable j is computed by the difference between this score and the reference score.
- This process is run for each variable j separately and replicated
rep
times. Average results are provided in the outputs, as well as the results per replication.
In general, this method returns similar results as the out-of-bag permutation method used in random forests (Breiman, 2001).
References
- Nørgaard, L., Saudland, A., Wagner, J., Nielsen, J.P., Munck, L.,
Engelsen, S.B., 2000. Interval Partial Least-Squares Regression (iPLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy. Appl Spectrosc 54, 413–419. https://doi.org/10.1366/0003702001949500
Examples
using JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "tecator.jld2")
@load db dat
pnames(dat)
X = dat.X
Y = dat.Y
wl_str = names(X)
wl = parse.(Float64, wl_str)
ntot, p = size(X)
typ = Y.typ
namy = names(Y)[1:3]
plotsp(X, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance").f
s = typ .== "train"
Xtrain = X[s, :]
Ytrain = Y[s, namy]
Xtest = rmrow(X, s)
Ytest = rmrow(Y[:, namy], s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Work on the j-th y-variable
j = 2
nam = namy[j]
ytrain = Ytrain[:, nam]
ytest = Ytest[:, nam]
mod = model(plskern; nlv = 9)
res = viperm(mod, Xtrain, ytrain; rep = 50, score = rmsep) ;
z = vec(res.imp)
f = Figure(size = (500, 400))
ax = Axis(f[1, 1]; xlabel = "Wavelength (nm)", ylabel = "Importance")
scatter!(ax, wl, vec(z); color = (:red, .5))
u = [910; 950]
vlines!(ax, u; color = :grey, linewidth = 1)
f
mod = model(rfr_dt; n_trees = 10, max_depth = 2000, min_samples_leaf = 5)
res = viperm(mod, Xtrain, ytrain; rep = 50)
z = vec(res.imp)
f = Figure(size = (500, 400))
ax = Axis(f[1, 1];
xlabel = "Wavelength (nm)",
ylabel = "Importance")
scatter!(ax, wl, vec(z); color = (:red, .5))
u = [910; 950]
vlines!(ax, u; color = :grey, linewidth = 1)
f
Jchemo.vrow
— Methodvrow(X::AbstractMatrix, i)
vrow(X::DataFrame, i)
vrow(x::Vector, i)
View of the i-th row(s) of a matrix X
, or of the i-th element(s) of vector x
.
Jchemo.wdist
— Methodwdist(d; h = 2, criw = 4, squared = false)
wdist!(d; h = 2, criw = 4, squared = false)
Compute weights from distances using a decreasing exponential function.
d
: A vector of distances.
Keyword arguments:
h
: A scaling positive scalar defining the shape of the weight function.criw
: A positive scalar defining outliers in the distances vectord
.squared
: Iftrue
, distances are replaced by the squared distances; the weight function is then a Gaussian (RBF) kernel function.
Weights are computed by:
- exp(-
d
/ (h
* MAD(d
)))
or are set to 0 for distances > Median(d
) + criw * MAD(d
). This is an adaptation of the weight function presented in Kim et al. 2011.
The weights decrease with increasing distances. Lower is h, sharper is the decreasing function. Weights are set to 0 for outliers (extreme distances).
References
Kim S, Kano M, Nakagawa H, Hasebe S. Estimation of active pharmaceutical ingredients content using locally weighted partial least squares and statistical wavelength selection. Int J Pharm. 2011; 421(2):269-274. https://doi.org/10.1016/j.ijpharm.2011.10.007
Examples
using CairoMakie, Distributions
x1 = rand(Chisq(10), 100) ;
x2 = rand(Chisq(40), 10) ;
d = [sqrt.(x1) ; sqrt.(x2)]
h = 2 ; criw = 3
w = wdist(d; h, criw) ;
f = Figure(size = (600, 300))
ax1 = Axis(f, xlabel = "Distance", ylabel = "Nb. observations")
hist!(ax1, d, bins = 30)
ax2 = Axis(f, xlabel = "Distance", ylabel = "Weight")
scatter!(ax2, d, w)
f[1, 1] = ax1
f[1, 2] = ax2
f
d = collect(0:.5:15) ;
h = [.5, 1, 1.5, 2.5, 5, 10, Inf]
#h = [1, 2, 5, Inf]
w = wdist(d; h = h[1])
f = Figure(size = (500, 400))
ax = Axis(f, xlabel = "Distance", ylabel = "Weight")
lines!(ax, d, w, label = string("h = ", h[1]))
for i = 2:length(h)
w = wdist(d; h = h[i])
lines!(ax, d, w, label = string("h = ", h[i]))
end
axislegend("Values of h"; position = :lb)
f[1, 1] = ax
f
Jchemo.xfit
— Methodxfit(object)
xfit(object, X; nlv = nothing)
xfit!(object, X::Matrix; nlv = nothing)
Matrix fitting from a bilinear model (e.g. PCA).
object
: The fitted model.X
: New X-data to be approximated from the model. Must be in the same scale as the X-data used to fit the modelobject
, i.e. before centering and eventual scaling.
Keyword arguments:
nlv
: Nb. components (PCs or LVs) to consider. Ifnothing
, it is the maximum nb. of components.
Compute an approximate of matrix X
from a bilinear model (e.g. PCA or PLS) fitted on X
. The fitted X is returned in the original scale of the X-data used to fit the model object
.
Examples
X = [1. 2 3 4; 4 1 6 7; 12 5 6 13;
27 18 7 6; 12 11 28 7]
Y = [10. 11 13; 120 131 27; 8 12 4;
1 200 8; 100 10 89]
n, p = size(X)
Xnew = X[1:3, :]
Ynew = Y[1:3, :]
y = Y[:, 1]
ynew = Ynew[:, 1]
weights = mweight(rand(n))
nlv = 2
scal = false
#scal = true
mod = model(pcasvd; nlv, scal) ;
fit!(mod, X)
fm = mod.fm ;
@head xfit(fm)
xfit(fm, Xnew)
xfit(fm, Xnew; nlv = 0)
xfit(fm, Xnew; nlv = 1)
fm.xmeans
@head X
@head xfit(fm) + xresid(fm, X)
@head xfit(fm, X; nlv = 1) + xresid(fm, X; nlv = 1)
@head Xnew
@head xfit(fm, Xnew) + xresid(fm, Xnew)
mod = model(pcasvd; nlv = min(n, p), scal)
fit!(mod, X)
fm = mod.fm ;
@head xfit(fm)
@head xfit(fm, X)
@head xresid(fm, X)
nlv = 3
scal = false
#scal = true
mod = model(plskern; nlv, scal)
fit!(mod, X, Y, weights)
fm = mod.fm ;
@head xfit(fm)
xfit(fm, Xnew)
xfit(fm, Xnew, nlv = 0)
xfit(fm, Xnew, nlv = 1)
@head X
@head xfit(fm) + xresid(fm, X)
@head xfit(fm, X; nlv = 1) + xresid(fm, X; nlv = 1)
@head Xnew
@head xfit(fm, Xnew) + xresid(fm, Xnew)
mod = model(plskern; nlv = min(n, p), scal)
fit!(mod, X, Y, weights)
fm = mod.fm ;
@head xfit(fm)
@head xfit(fm, Xnew)
@head xresid(fm, Xnew)
Jchemo.xresid
— Methodxresid(object, X; nlv = nothing)
xresid!(object, X::Matrix; nlv = nothing)
Residual matrix from a bilinear model (e.g. PCA).
object
: The fitted model.X
: New X-data to be approximated from the model. Must be in the same scale as the X-data used to fit the modelobject
, i.e. before centering and eventual scaling.
Keyword arguments:
nlv
: Nb. components (PCs or LVs) to consider. Ifnothing
, it is the maximum nb. of components.
Compute the residual matrix:
- E =
X
- X_fit
where X_fit is the fitted X returned by function xfit
. See xfit
for examples. ```