Index of functions
Here is a list of all exported functions from Jchemo.jl.
For more details, click on the link and you'll be directed to the function help.
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Base.summary
Jchemo.aggstat
Jchemo.aggsum
Jchemo.aicplsr
Jchemo.aov1
Jchemo.bias
Jchemo.blockscal
Jchemo.calds
Jchemo.calpds
Jchemo.cca
Jchemo.ccawold
Jchemo.center
Jchemo.cglsr
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.coef
Jchemo.colmad
Jchemo.colmean
Jchemo.colmed
Jchemo.colnorm
Jchemo.colstd
Jchemo.colsum
Jchemo.colvar
Jchemo.comdim
Jchemo.conf
Jchemo.convertdf
Jchemo.cor2
Jchemo.corm
Jchemo.corv
Jchemo.cosm
Jchemo.cosv
Jchemo.covm
Jchemo.covv
Jchemo.cscale
Jchemo.detrend_airpls
Jchemo.detrend_arpls
Jchemo.detrend_asls
Jchemo.detrend_lo
Jchemo.detrend_pol
Jchemo.dfplsr_cg
Jchemo.difmean
Jchemo.dkplskdeda
Jchemo.dkplslda
Jchemo.dkplsqda
Jchemo.dkplsr
Jchemo.dkplsrda
Jchemo.dmkern
Jchemo.dmnorm
Jchemo.dmnormlog
Jchemo.dummy
Jchemo.dupl
Jchemo.ensure_df
Jchemo.ensure_mat
Jchemo.eposvd
Jchemo.errp
Jchemo.euclsq
Jchemo.fcenter
Jchemo.fconcat
Jchemo.fcscale
Jchemo.fda
Jchemo.fdasvd
Jchemo.fdif
Jchemo.findmax_cla
Jchemo.findmiss
Jchemo.frob
Jchemo.fscale
Jchemo.fweight
Jchemo.getknn
Jchemo.gridcv
Jchemo.gridcv_br
Jchemo.gridcv_lb
Jchemo.gridcv_lv
Jchemo.gridscore
Jchemo.gridscore
Jchemo.gridscore_br
Jchemo.gridscore_lb
Jchemo.gridscore_lv
Jchemo.head
Jchemo.interpl
Jchemo.iqrv
Jchemo.isel!
Jchemo.kdeda
Jchemo.knnda
Jchemo.knnr
Jchemo.kpca
Jchemo.kplskdeda
Jchemo.kplslda
Jchemo.kplsqda
Jchemo.kplsr
Jchemo.kplsrda
Jchemo.kpol
Jchemo.krbf
Jchemo.krr
Jchemo.krrda
Jchemo.lda
Jchemo.list
Jchemo.list
Jchemo.locw
Jchemo.locwlv
Jchemo.loessr
Jchemo.lwmlr
Jchemo.lwmlrda
Jchemo.lwplslda
Jchemo.lwplsqda
Jchemo.lwplsr
Jchemo.lwplsravg
Jchemo.lwplsrda
Jchemo.madv
Jchemo.mahsq
Jchemo.mahsqchol
Jchemo.matB
Jchemo.matW
Jchemo.mavg
Jchemo.mbconcat
Jchemo.mblock
Jchemo.mbpca
Jchemo.mbplskdeda
Jchemo.mbplslda
Jchemo.mbplsqda
Jchemo.mbplsr
Jchemo.mbplsrda
Jchemo.mbplswest
Jchemo.meanv
Jchemo.merrp
Jchemo.mlev
Jchemo.mlr
Jchemo.mlrchol
Jchemo.mlrda
Jchemo.mlrpinv
Jchemo.mlrpinvn
Jchemo.mlrvec
Jchemo.mpar
Jchemo.mse
Jchemo.msep
Jchemo.mweight
Jchemo.mweightcla
Jchemo.nco
Jchemo.nipals
Jchemo.nipalsmiss
Jchemo.normv
Jchemo.nro
Jchemo.occod
Jchemo.occsd
Jchemo.occsdod
Jchemo.occstah
Jchemo.out
Jchemo.outeucl
Jchemo.outstah
Jchemo.parsemiss
Jchemo.pcaeigen
Jchemo.pcaeigenk
Jchemo.pcanipals
Jchemo.pcanipalsmiss
Jchemo.pcaout
Jchemo.pcapp
Jchemo.pcasph
Jchemo.pcasvd
Jchemo.pcr
Jchemo.pip
Jchemo.plotconf
Jchemo.plotgrid
Jchemo.plotsp
Jchemo.plotxy
Jchemo.plscan
Jchemo.plskdeda
Jchemo.plskern
Jchemo.plslda
Jchemo.plsnipals
Jchemo.plsqda
Jchemo.plsravg
Jchemo.plsrda
Jchemo.plsrosa
Jchemo.plsrout
Jchemo.plssimp
Jchemo.plstuck
Jchemo.plswold
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.predict
Jchemo.pval
Jchemo.qda
Jchemo.r2
Jchemo.rasvd
Jchemo.rd
Jchemo.rda
Jchemo.recod_catbydict
Jchemo.recod_catbyind
Jchemo.recod_catbyint
Jchemo.recod_catbylev
Jchemo.recod_indbylev
Jchemo.recod_miss
Jchemo.recod_numbyint
Jchemo.recovkw
Jchemo.residcla
Jchemo.residreg
Jchemo.rfda
Jchemo.rfr
Jchemo.rmcol
Jchemo.rmgap
Jchemo.rmrow
Jchemo.rmsep
Jchemo.rmsepstand
Jchemo.rosaplsr
Jchemo.rowmean
Jchemo.rownorm
Jchemo.rowstd
Jchemo.rowsum
Jchemo.rowvar
Jchemo.rp
Jchemo.rpd
Jchemo.rpdr
Jchemo.rpmatgauss
Jchemo.rpmatli
Jchemo.rr
Jchemo.rrchol
Jchemo.rrda
Jchemo.rrr
Jchemo.rv
Jchemo.sampcla
Jchemo.sampdf
Jchemo.sampdp
Jchemo.sampks
Jchemo.samprand
Jchemo.sampsys
Jchemo.sampwsp
Jchemo.savgk
Jchemo.savgol
Jchemo.scale
Jchemo.segmkf
Jchemo.segmts
Jchemo.selwold
Jchemo.sep
Jchemo.snorm
Jchemo.snv
Jchemo.softmax
Jchemo.soplsr
Jchemo.sourcedir
Jchemo.spca
Jchemo.spcr
Jchemo.splskdeda
Jchemo.splslda
Jchemo.splsqda
Jchemo.splsr
Jchemo.splsrda
Jchemo.ssr
Jchemo.stdv
Jchemo.summ
Jchemo.sumv
Jchemo.svmda
Jchemo.svmr
Jchemo.tab
Jchemo.tabdupl
Jchemo.thresh_hard
Jchemo.thresh_soft
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transf
Jchemo.transfbl
Jchemo.transfbl
Jchemo.transfbl
Jchemo.transfbl
Jchemo.transfbl
Jchemo.treeda
Jchemo.treer
Jchemo.umap
Jchemo.varv
Jchemo.vcatdf
Jchemo.vcol
Jchemo.vip
Jchemo.viperm
Jchemo.vrow
Jchemo.wdis
Jchemo.winvs
Jchemo.wtal
Jchemo.xfit
Jchemo.xresid
Base.summary
— Methodsummary(object::Cca, X, Y)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.Y
: The Y-data that was used to fit the model.
Base.summary
— Methodsummary(object::Ccawold, X, Y)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.Y
: The Y-data that was used to fit the model.
Base.summary
— Methodsummary(object::Comdim, Xbl)
Summarize the fitted model.
object
: The fitted model.Xbl
: The X-data that was used to fit the model.
Base.summary
— Methodsummary(object::Fda)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.
Base.summary
— Methodsummary(object::Kpca)
Summarize the fitted model.
object
: The fitted model.
Base.summary
— Methodsummary(object::Mbpca, Xbl)
Summarize the fitted model.
object
: The fitted model.Xbl
: The X-data that was used to fit the model.
Base.summary
— Methodsummary(object::Mbplsr, Xbl)
Summarize the fitted model.
object
: The fitted model.Xbl
: The X-data that was used to fit the model.
Base.summary
— Methodsummary(object::Mbplswest, Xbl)
Summarize the fitted model.
object
: The fitted model.Xbl
: The X-data that was used to fit the model.
Base.summary
— Methodsummary(object::Pca, X)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.
Base.summary
— Methodsummary(object::Plscan, X, Y)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.Y
: The Y-data that was used to fit the model.
Base.summary
— Methodsummary(object::Plstuck, X, Y)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.Y
: The Y-data that was used to fit the model.
Base.summary
— Methodsummary(object::Rasvd, X, Y)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.Y
: The Y-data that was used to fit the model.
Base.summary
— Methodsummary(object::Spca, X)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.
Base.summary
— Methodsummary(object::Union{Pcr, Spcr}, X)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.
Base.summary
— Methodsummary(object::Union{Plsr, Splsr}, X)
Summarize the fitted model.
object
: The fitted model.X
: The X-data that was used to fit the model.
Jchemo.aggstat
— Methodaggstat(X, y; algo = mean)
aggstat(X::DataFrame; vary, vargroup, algo = mean)
Compute column-wise statistics by class in a dataset.
X
: Data (n, p).y
: A categorical variable (n) (class membership).algo
: Function to compute (default = mean).
Specific for dataframes:
vary
: Vector of the names of the variables to summarize.vargroup
: Vector of the names of the categorical variables to consider for computations by class.
Variables defined in vary
and vargroup
must be columns of X
.
Return a matrix or, if only argument X::DataFrame
is used, a dataframe.
Examples
using Jchemo, DataFrames, Statistics
n, p = 20, 5
X = rand(n, p)
df = DataFrame(X, :auto)
y = rand(1:3, n)
res = aggstat(X, y; algo = sum)
res.X
aggstat(df, y; algo = sum).X
n, p = 20, 5
X = rand(n, p)
df = DataFrame(X, string.("v", 1:p))
df.gr1 = rand(1:2, n)
df.gr2 = rand(["a", "b", "c"], n)
df
aggstat(df; vary = [:v1, :v2], vargroup = [:gr1, :gr2], algo = var)
Jchemo.aggsum
— Methodaggsum(x::Vector, y::Union{Vector, BitVector})
Compute sub-total sums by class of a categorical variable.
x
: A quantitative variable to sum (n)y
: A categorical variable (n) (class membership).
Return a vector.
Examples
using Jchemo
x = rand(1000)
y = vcat(rand(["a" ; "c"], 900), repeat(["b"], 100))
aggsum(x, y)
Jchemo.aicplsr
— Methodaicplsr(X, y; alpha = 2, kwargs...)
Compute Akaike's (AIC) and Mallows's (Cp) criteria for univariate PLSR models.
X
: X-data (n, p).y
: Univariate Y-data.
Keyword arguments:
- Same arguments as those of function
cglsr
. alpha
: Coefficient multiplicating the model complexity (df) to compute AIC.
The function uses function dfplsr_cg
.
References
Hansen, P.C., 1998. Rank-Deficient and Discrete Ill-Posed Problems, Mathematical Modeling and Computation. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898719697
Hansen, P.C., 2008. Regularization Tools version 4.0 for Matlab 7.3. Numer Algor 46, 189–194. https://doi.org/10.1007/s11075-007-9136-9
Lesnoff, M., Roger, J.-M., Rutledge, D.N., 2021. Monte Carlo methods for estimating Mallows’s Cp and AIC criteria for PLSR models. Illustration on agronomic spectroscopic NIR data. Journal of Chemometrics n/a, e3369. https://doi.org/10.1002/cem.3369
Examples
using Jchemo, JchemoData, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 40
res = aicplsr(X, y; nlv) ;
res.crit
res.opt
res.delta
zaic = res.crit.aic
f, ax = plotgrid(0:nlv, zaic; xlabel = "Nb. LVs", ylabel = "AIC")
scatter!(ax, 0:nlv, zaic)
f
Jchemo.aov1
— Methodaov1(x, Y)
One-factor ANOVA test.
x
: Univariate categorical (factor) data (n).Y
: Y-data (n, q).
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
@names dat
@head dat.X
x = dat.X[:, 5]
Y = dat.X[:, 1:4]
tab(x)
res = aov1(x, Y) ;
@names res
res.SSF
res.SSR
res.F
res.pval
Jchemo.bias
— Methodbias(pred, Y)
Compute the prediction bias, i.e. the opposite of the mean prediction error.
pred
: Predictions.Y
: Observed data.
Examples
using Jchemo
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
model = plskern(nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
bias(pred, Ytest)
model = plskern(nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
bias(pred, ytest)
Jchemo.blockscal
— Methodblockscal(; kwargs...)
blockscal(Xbl; kwargs...)
blockscal(Xbl, weights::Weight; kwargs...)
Scale multiblock X-data.
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.weights
: Weights (n) of the observations (rows of the blocks). Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
centr
: Boolean. Iftrue
, each column of blocks inXbl
is centered (before the block scaling).scal
: Boolean. Iftrue
, each column of blocks inXbl
is scaled by its uncorrected standard deviation (before the block scaling).bscal
: Type of block scaling. Possible values are::none
,:frob
,:mfa
,:ncol
,:sd
. See thereafter.
If implemented, the data transformations follow the order: column centering, column scaling and finally block scaling.
Types of block scaling:
:none
: No block scaling.:frob
: Let D be the diagonal matrix of vectorweights.w
. Each block X is divided by its Frobenius norm = sqrt(tr(X' * D * X)). After this scaling, tr(X' * D * X) = 1.:mfa
: Each block X is divided by sv, where sv is the dominant singular value of X (this is the "MFA" approach; "AFM "in French).:ncol
: Each block X is divided by the nb. of columns of the block.:sd
: Each block X is divided by sqrt(sum(weighted variances of the block-columns)). After this scaling, sum(weighted variances of the block-columns) = 1.
Examples
using Jchemo
n = 5 ; m = 3 ; p = 10
X = rand(n, p)
Xnew = rand(m, p)
listbl = [3:4, 1, [6; 8:10]]
Xbl = mblock(X, listbl)
Xblnew = mblock(Xnew, listbl)
@head Xbl[3]
centr = true ; scal = true
bscal = :frob
model = blockscal(; centr, scal, bscal)
fit!(model, Xbl)
## Data transformation
zXbl = transf(model, Xbl) ;
@head zXbl[3]
zXblnew = transf(model, Xblnew) ;
zXblnew[3]
Jchemo.calds
— Methodcalds(; algo = plskern, kwargs...)
calds(X1, X2; algo = plskern, kwargs...)
Direct standardization (DS) for calibration transfer of spectral data.
X1
: Spectra (n, p) to transfer to the target.X2
: Target spectra (n, p).
Keyword arguments:
algo
: Function used as transfer model.kwargs
: Optional arguments foralgo
.
X1
and X2
must represent the same n samples ("standards").
The objective is to transform spectra X1
to new spectra as close as possible as the target X2
. Method DS fits a model (defined in algo
) that predicts X2
from X1
.
References
Y. Wang, D. J. Veltkamp, and B. R. Kowalski, “Multivariate Instrument Standardization,” Anal. Chem., vol. 63, no. 23, pp. 2750–2756, 1991, doi: 10.1021/ac00023a016.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/caltransfer.jld2")
@load db dat
@names dat
## Objects X1 and X2 are spectra collected
## on the same samples.
## X2 represents the target space.
## We want to transfer X1 in the same space
## as X2.
## Data to transfer
X1cal = dat.X1cal
X1val = dat.X1val
n = nro(X1cal)
m = nro(X1val)
## Target space
X2cal = dat.X2cal
X2val = dat.X2val
## Fitting the model
fitm = calds(X1cal, X2cal; algo = plskern, nlv = 10)
#fitm = calds(X1cal, X2cal; algo = mlrpinv) # less robust
## Transfer of new spectra X1val
## expected to be close to X2val
pred = predict(fitm, X1val).pred
i = 1
f = Figure(size = (500, 300))
ax = Axis(f[1, 1])
lines!(X2val[i, :]; label = "x2")
lines!(ax, X1val[i, :]; label = "x1")
lines!(pred[i, :]; linestyle = :dash, label = "x1_corrected")
axislegend(position = :rb, framevisible = false)
f
Jchemo.calpds
— Methodcalpds(; npoint = 5, algo = plskern, kwargs...)
calpds(X1, X2; npoint = 5, algo = plskern, kwargs...)
Piecewise direct standardization (PDS) for calibration transfer of spectral data.
X1
: Spectra (n, p) to transfer to the target.X2
: Target spectra (n, p).
Keyword arguments:
npoint
: Half-window size (nb. points left or right to the given wavelength).algo
: Function used as transfer model.kwargs
: Optional arguments foralgo
.
X1
and X2
must represent the same n standard samples.
The objective is to transform spectra X1
to new spectra as close as possible as the target X2
. Method PDS fits models (defined in algo
) that predict X2
from X1
.
The window used in X1
to predict wavelength "i" in X2
is:
- i -
npoint
, i -npoint
+ 1, ..., i, ..., i +npoint
- 1, i +npoint
References
Bouveresse, E., Massart, D.L., 1996. Improvement of the piecewise direct targetisation procedure for the transfer of NIR spectra for multivariate calibration. Chemometrics and Intelligent Laboratory Systems 32, 201–213. https://doi.org/10.1016/0169-7439(95)00074-7
Y. Wang, D. J. Veltkamp, and B. R. Kowalski, “Multivariate Instrument Standardization,” Anal. Chem., vol. 63, no. 23, pp. 2750–2756, 1991, doi: 10.1021/ac00023a016.
Wülfert, F., Kok, W.Th., Noord, O.E. de, Smilde, A.K., 2000. Correction of Temperature-Induced Spectral Variation by Continuous Piecewise Direct Standardization. Anal. Chem. 72, 1639–1644. https://doi.org/10.1021/ac9906835
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/caltransfer.jld2")
@load db dat
@names dat
## Objects X1 and X2 are spectra collected
## on the same samples.
## X2 represents the target space.
## We want to transfer X1 in the same space
## as X2.
## Data to transfer
X1cal = dat.X1cal
X1val = dat.X1val
n = nro(X1cal)
m = nro(X1val)
## Target space
X2cal = dat.X2cal
X2val = dat.X2val
## Fitting the model
fitm = calpds(X1cal, X2cal; npoint = 2, algo = plskern, nlv = 2)
## Transfer of new spectra X1val
## expected to be close to X2val
pred = predict(fitm, X1val).pred
i = 1
f = Figure(size = (500, 300))
ax = Axis(f[1, 1])
lines!(X2val[i, :]; label = "x2")
lines!(ax, X1val[i, :]; label = "x1")
lines!(pred[i, :]; linestyle = :dash, label = "x1_corrected")
axislegend(position = :rb, framevisible = false)
f
Jchemo.cca
— Methodcca(; kwargs...)
cca(X, Y; kwargs...)
cca(X, Y, weights::Weight; kwargs...)
cca!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Canonical correlation Analysis (CCA, RCCA).
X
: First block of data.Y
: Second block of data.weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores) to compute.bscal
: Type of block scaling. Possible values are::none
,:frob
. See functionsblockscal
.tau
: Regularization parameter (∊ [0, 1]).scal
: Boolean. Iftrue
, each column of blocksX
andY
is scaled by its uncorrected standard deviation (before the block scaling).
This function implements a CCA algorithm using SVD decompositions and presented in Weenink 2003 section 2.
A continuum regularization is available (parameter tau
). After block centering and scaling, the function returns block LVs (Tx and Ty) that are proportionnal to the eigenvectors of Projx * Projy and Projy * Projx, respectively, defined as follows:
- Cx = (1 -
tau
) * X'DX +tau
* Ix - Cy = (1 -
tau
) * Y'DY +tau
* Iy - Cxy = X'DY
- Projx = sqrt(D) * X * invCx * X' * sqrt(D)
- Projy = sqrt(D) * Y * invCx * Y' * sqrt(D)
where D is the observation (row) metric. Value tau
= 0 can generate unstability when inverting the covariance matrices. Often, a better alternative is to use an epsilon value (e.g. tau
= 1e-8) to get similar results as with pseudo-inverses.
After normalized (and using uniform weights
), the scores returned by the function are expected to be the same as those returned by functions rcc
of the R packages CCA
(González et al.) and mixOmics
(Lê Cao et al.) whith their parameters lambda1 and lambda2 set to:
- lambda1 = lambda2 =
tau
/ (1 -tau
) * n / (n - 1)
See function plscan
for the details on the summary
outputs.
References
González, I., Déjean, S., Martin, P.G.P., Baccini, A., 2008. CCA: An R Package to Extend Canonical Correlation Analysis. Journal of Statistical Software 23, 1-14. https://doi.org/10.18637/jss.v023.i12
Hotelling, H. (1936): “Relations between two sets of variates”, Biometrika 28: pp. 321–377.
Lê Cao, K.-A., Rohart, F., Gonzalez, I., Dejean, S., Abadi, A.J., Gautier, B., Bartolo, F., Monget, P., Coquery, J., Yao, F., Liquet, B., 2022. mixOmics: Omics Data Integration Project. https://doi.org/10.18129/B9.bioc.mixOmics
Weenink, D. 2003. Canonical Correlation Analysis, Institute of Phonetic Sciences, Univ. of Amsterdam, Proceedings 25, 81-99.
Examples
using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "linnerud.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n, p = size(X)
q = nco(Y)
nlv = 3
bscal = :frob ; tau = 1e-8
model = cca(; nlv, bscal, tau)
fit!(model, X, Y)
@names model
@names model.fitm
@head model.fitm.Tx
@head transfbl(model, X, Y).Tx
@head model.fitm.Ty
@head transfbl(model, X, Y).Ty
res = summary(model, X, Y) ;
@names res
res.cortx2ty
res.rvx2tx
res.rvy2ty
res.rdx2tx
res.rdy2ty
res.corx2tx
res.cory2ty
Jchemo.ccawold
— Methodccawold(; kwargs...)
ccawold(X, Y; kwargs...)
ccawold(X, Y, weights::Weight; kwargs...)
ccawold!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Canonical correlation analysis (CCA, RCCA) - Wold Nipals algorithm.
X
: First block of data.Y
: Second block of data.weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores) to compute.bscal
: Type of block scaling. Possible values are::none
,:frob
. See functionsblockscal
.tau
: Regularization parameter (∊ [0, 1]).tol
: Tolerance value for convergence (Nipals).maxit
: Maximum number of iterations (Nipals).scal
: Boolean. Iftrue
, each column of blocksX
andY
is scaled by its uncorrected standard deviation (before the block scaling).
This function implements the Nipals ccawold algorithm presented by Tenenhaus 1998 p.204 (related to Wold et al. 1984).
In this implementation, after each step of LVs computation, X and Y are deflated relatively to their respective scores (tx and ty).
A continuum regularization is available (parameter tau
). After block centering and scaling, the covariances matrices are computed as follows:
- Cx = (1 -
tau
) * X'DX +tau
* Ix - Cy = (1 -
tau
) * Y'DY +tau
* Iy
where D is the observation (row) metric. Value tau
= 0 can generate unstability when inverting the covariance matrices. Often, a better alternative is to use an epsilon value (e.g. tau
= 1e-8) to get similar results as with pseudo-inverses.
The normed scores returned by the function are expected (using uniform weights
) to be the same as those returned by function rgcca
of the R package RGCCA
(Tenenhaus & Guillemot 2017, Tenenhaus et al. 2017).
See function plscan
for the details on the summary
outputs. See function plscan
for the details on the summary
outputs.
References
Tenenhaus, A., Guillemot, V. 2017. RGCCA: Regularized and Sparse Generalized Canonical Correlation Analysis for Multiblock Data Multiblock data analysis. https://cran.r-project.org/web/packages/RGCCA/index.html
Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.
Tenenhaus, M., Tenenhaus, A., Groenen, P.J.F., 2017. Regularized Generalized Canonical Correlation Analysis: A Framework for Sequential Multiblock Component Methods. Psychometrika 82, 737–777. https://doi.org/10.1007/s11336-017-9573-x
Wold, S., Ruhe, A., Wold, H., Dunn, III, W.J., 1984. The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses. SIAM Journal on Scientific and Statistical Computing 5, 735–743. https://doi.org/10.1137/0905052
Examples
using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "linnerud.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n, p = size(X)
q = nco(Y)
nlv = 2
bscal = :frob ; tau = 1e-4
model = ccawold(; nlv, bscal, tau, tol = 1e-10)
fit!(model, X, Y)
@names model
@names model.fitm
@head model.fitm.Tx
@head transfbl(model, X, Y).Tx
@head model.fitm.Ty
@head transfbl(model, X, Y).Ty
res = summary(model, X, Y) ;
@names res
res.explvarx
res.explvary
res.cortx2ty
res.rvx2tx
res.rvy2ty
res.rdx2tx
res.rdy2ty
res.corx2tx
res.cory2ty
Jchemo.center
— Methodcenter()
center(X)
center(X, weights::Weight)
Column-wise centering of X-data.
X
: X-data (n, p).
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f
model = center()
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
colmean(Xptrain)
@head Xptest
@head Xtest .- colmean(Xtrain)'
plotsp(Xptrain).f
plotsp(Xptest).f
Jchemo.cglsr
— Methodcglsr(; kwargs...)
cglsr(X, y; kwargs...)
cglsr!(X::Matrix, y::Matrix; kwargs...)
Conjugate gradient algorithm for the normal equations (CGLS; Björck 1996).
X
: X-data (n, p).y
: Univariate Y-data (n).
Keyword arguments:
nlv
: Nb. CG iterations.gs
: Boolean. Iftrue
(default), a Gram-Schmidt orthogonalization of the normal equation residual vectors is done.filt
: Boolean. Iftrue
, CG filter factors are computed (outputF
). Default =false
.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
CGLS algorithm "7.4.1" Bjorck 1996, p.289. In the present function, the part of the code computing the re-orthogonalization (Hansen 1998) and filter factors (Vogel 1987, Hansen 1998) is a transcription (with few adaptations) of the Matlab function cgls
(Saunders et al. https://web.stanford.edu/group/SOL/software/cgls/; Hansen 2008).
References
Björck, A., 1996. Numerical Methods for Least Squares Problems, Other Titles in Applied Mathematics. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611971484
Hansen, P.C., 1998. Rank-Deficient and Discrete Ill-Posed Problems, Mathematical Modeling and Computation. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898719697
Hansen, P.C., 2008. Regularization Tools version 4.0 for Matlab 7.3. Numer Algor 46, 189–194. https://doi.org/10.1007/s11075-007-9136-9
Manne R. Analysis of two partial-least-squares algorithms for multivariate calibration. Chemometrics Intell. Lab. Syst. 1987, 2: 187–197.
Phatak A, De Hoog F. Exploiting the connection between PLS, Lanczos methods and conjugate gradients: alternative proofs of some properties of PLS. J. Chemometrics 2002; 16: 361–367.
Vogel, C. R., "Solving ill-conditioned linear systems using the conjugate gradient method", Report, Dept. of Mathematical Sciences, Montana State University, 1987.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 5 ; scal = true
model = cglsr(; nlv, scal)
fit!(model, Xtrain, ytrain)
@names model.fitm
@head model.fitm.B
coef(model.fitm).B
coef(model.fitm).int
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
Jchemo.coef
— Methodcoef(object::Cglsr)
Compute the b-coefficients of a fitted model.
object
: The fitted model.
Jchemo.coef
— Methodcoef(object::Dkplsr; nlv = nothing)
Compute the b-coefficients of a fitted model.
object
: The fitted model.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.coef
— Methodcoef(object::Kplsr; nlv = nothing)
Compute the b-coefficients of a fitted model.
object
: The fitted model.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.coef
— Methodcoef(object::Krr; lb = nothing)
Compute the b-coefficients of a fitted model.
object
: The fitted model.lb
: Ridge regularization parameter "lambda".
Jchemo.coef
— Methodcoef(object::Pcr; nlv = nothing)
Compute the b-coefficients of a LV model.
object
: The fitted model.nlv
: Nb. LVs to consider.
For a model fitted from X(n, p) and Y(n, q), the returned object B
is a matrix (p, q). If nlv
= 0, B
is a matrix of zeros. The returned object int
is the intercept.
Jchemo.coef
— Methodcoef(object::Rosaplsr; nlv = nothing)
Compute the X b-coefficients of a model fitted with nlv
LVs.
object
: The fitted model.nlv
: Nb. LVs to consider.
Jchemo.coef
— Methodcoef(object::Rr; lb = nothing)
Compute the b-coefficients of a fitted model.
object
: The fitted model.lb
: Ridge regularization parameter "lambda".
Jchemo.coef
— Methodcoef(object::Mlr)
Compute the coefficients of the fitted model.
object
: The fitted model.
Jchemo.coef
— Methodcoef(object::Union{Plsr, Pcr, Splsr}; nlv = nothing)
Compute the b-coefficients of a LV model.
object
: The fitted model.nlv
: Nb. LVs to consider.
For a model fitted from X(n, p) and Y(n, q), the returned object B
is a matrix (p, q). If nlv
= 0, B
is a matrix of zeros. The returned object int
is the intercept.
Jchemo.colmad
— Methodcolmad(X)
Compute column-wise median absolute deviations (MAD) of a matrix.
X
: Data (n, p).
Return a vector.
Examples
using Jchemo
n, p = 5, 6
X = rand(n, p)
colmad(X)
Jchemo.colmean
— Methodcolmean(X)
colmean(X, weights::Weight)
Compute column-wise means of a matrix.
X
: Data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Return a vector.
Examples
using Jchemo
n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))
colmean(X)
colmean(X, w)
Jchemo.colmed
— Methodcolmed(X)
Compute column-wise medians of a matrix.
X
: Data (n, p).
Return a vector.
Examples
using Jchemo
n, p = 5, 6
X = rand(n, p)
colmed(X)
Jchemo.colnorm
— Methodcolnorm(X)
colnorm(X, weights::Weight)
Compute column-wise norms of a matrix.
X
: Data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
The norm computed for a column x of X
is:
- sqrt(x' * x)
The weighted norm is:
- sqrt(x' * D * x), where D is the diagonal matrix of
weights.w
Warning: colnorm(X, mweight(ones(n)))
= colnorm(X) / sqrt(n)
.
Return a vector.
Examples
using Jchemo
n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))
colnorm(X)
colnorm(X, w)
Jchemo.colstd
— Methodcolstd(X)
colstd(X, weights::Weight)
Compute column-wise standard deviations (uncorrected) of a matrix.
X
: Data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Return a vector.
Examples
using Jchemo
n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))
colstd(X)
colstd(X, w)
Jchemo.colsum
— Methodcolsum(X)
colsum(X, weights::Weight)
Compute column-wise sums of a matrix.
X
: Data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Return a vector.
Examples
using Jchemo
n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))
colsum(X)
colsum(X, w)
Jchemo.colvar
— Methodcolvar(X)
colvar(X, weights::Weight)
Compute column-wise variances (uncorrected) of a matrix.
X
: Data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Return a vector.
Examples
using Jchemo
n, p = 5, 6
X = rand(n, p)
w = mweight(rand(n))
colvar(X)
colvar(X, w)
Jchemo.comdim
— Methodcomdim(; kwargs...)
comdim(Xbl; kwargs...)
comdim(Xbl, weights::Weight; kwargs...)
comdim!(Xbl::Matrix, weights::Weight; kwargs...)
Common components and specific weights analysis (CCSWA, a.k.a ComDim).
Xbl
: List of blocks (vector of matrices) of X-data. Typically, output of functionmblock
.weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. global latent variables (LVs = scores) to compute.bscal
: Type of block scaling. See functionblockscal
for possible values.tol
: Tolerance value for convergence (Nipals).maxit
: Maximum number of iterations (Nipals).scal
: Boolean. Iftrue
, each column of blocks inXbl
is scaled by its uncorrected standard deviation (before the block scaling).
"SVD" algorithm of Hannafi & Qannari 2008 p.84.
The function returns several objects, in particular:
T
: The global LVs (not-normed).U
: The global LVs (normed).W
: The block weights (normed).Tb
: The block LVs (in the metric scale), returned grouped by LV.Tbl
: The block LVs (in the original scale), returned grouped by block.Vbl
: The block loadings (normed).lb
: The block specific weights (saliences) 'lambda'.mu
: The sum of the block specific weights.
Function summary
returns:
explvarx
: Proportion of the total X inertia (squared Frobenious norm) explained by the global LVs.explvarxx
: Proportion of the total XX' inertia explained by each global score (= indicator "V" in Qannari et al. 2000, Hanafi et al. 2008).explxbl
: Proportion of the inertia of each block (= Xbl[k]) explained by the global LVs.psal2
: Proportion of the squared saliences of each block within each global score.contrxbl2t
: Contribution of each block to the global LVs.rvxbl2t
: RV coefficients between each block and the global LVs.rdxbl2t
: Rd coefficients between each block and the global LVs.cortbl2t
: Correlations between the block LVs (= Tbl[k]) and the global LVs.corx2t
: Correlation between the X-variables and the global LVs.
References
Cariou, V., Qannari, E.M., Rutledge, D.N., Vigneau, E., 2018. ComDim: From multiblock data analysis to path modeling. Food Quality and Preference, Sensometrics 2016: Sensometrics-by-the-Sea 67, 27–34. https://doi.org/10.1016/j.foodqual.2017.02.012
Cariou, V., Jouan-Rimbaud Bouveresse, D., Qannari, E.M., Rutledge, D.N., 2019. Chapter 7 - ComDim Methods for the Analysis of Multiblock Data in a Data Fusion Perspective, in: Cocchi, M. (Ed.), Data Handling in Science and Technology, Data Fusion Methodology and Applications. Elsevier, pp. 179–204. https://doi.org/10.1016/B978-0-444-63984-4.00007-7
Ghaziri, A.E., Cariou, V., Rutledge, D.N., Qannari, E.M., 2016. Analysis of multiblock datasets using ComDim: Overview and extension to the analysis of (K + 1) datasets. Journal of Chemometrics 30, 420–429. https://doi.org/10.1002/cem.2810
Hanafi, M., 2008. Nouvelles propriétés de l’analyse en composantes communes et poids spécifiques. Journal de la société française de statistique 149, 75–97.
Qannari, E.M., Wakeling, I., Courcoux, P., MacFie, H.J.H., 2000. Defining the underlying sensory dimensions. Food Quality and Preference 11, 151–154. https://doi.org/10.1016/S0950-3293(99)00069-5
Examples
using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2")
@load db dat
@names dat
X = dat.X
group = dat.group
listbl = [1:11, 12:19, 20:25]
Xbl = mblock(X[1:6, :], listbl)
Xblnew = mblock(X[7:8, :], listbl)
n = nro(Xbl[1])
nlv = 3
bscal = :frob
scal = false
#scal = true
model = comdim(; nlv, bscal, scal)
fit!(model, Xbl)
@names model
@names model.fitm
## Global scores
@head model.fitm.T
@head transf(model, Xbl)
transf(model, Xblnew)
## Blocks scores
i = 1
@head model.fitm.Tbl[i]
@head transfbl(model, Xbl)[i]
res = summary(model, Xbl) ;
@names res
res.explvarx
res.explvarxx
res.psal2
res.contrxbl2t
res.explxbl # = model.fitm.lb if bscal = :frob
rowsum(Matrix(res.explxbl))
res.rvxbl2t
res.rdxbl2t
res.cortbl2t
res.corx2t
Jchemo.conf
— Methodconf(pred, y; digits = 1)
Confusion matrix.
pred
: Univariate predictions.y
: Univariate observed data.
Keyword arguments:
digits
: Nb. digits used to round percentages.
Examples
using Jchemo, CairoMakie
y = ["d"; "c"; "b"; "c"; "a"; "d"; "b"; "d";
"b"; "b"; "a"; "a"; "c"; "d"; "d"]
pred = ["a"; "d"; "b"; "d"; "b"; "d"; "b"; "d";
"b"; "b"; "a"; "a"; "d"; "d"; "d"]
#y = rand(1:10, 200); pred = rand(1:10, 200)
res = conf(pred, y) ;
@names res
res.cnt # Counts (dataframe built from `A`)
res.pct # Row % (dataframe built from `Apct`))
res.A
res.Apct
res.diagpct
res.accpct # Accuracy (% classification successes)
res.lev # Levels
plotconf(res).f
plotconf(res; cnt = false, ptext = false).f
Jchemo.convertdf
— Methodconvertdf(df::DataFrame; miss = nothing, typ)
Convert the columns of a dataframe to given types.
df
: A dataframe.miss
: The code used indf
to identify the data to be declared asmissing
(of typeMissing
). See functionrecod_miss
.typ
: A vector of the targeted types for the columns of the new dataframe.
Examples
using Jchemo, DataFrames
Jchemo.cor2
— Methodcor2(pred, Y)
Compute the squared linear correlation between data and predictions.
pred
: Predictions.Y
: Observed data.
Examples
using Jchemo
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
model = plskern; nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
cor2(pred, Ytest)
model = plskern; nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
cor2(pred, ytest)
Jchemo.corm
— Methodcorm(X)
corm(X, Y)
corm(X, weights::Weight)
corm(X, Y, weights::Weight)
Compute a weighted correlation matrix.
X
: Data (n, p).Y
: Data (n, q).weights
: Weights (n) of the observations. Object of typeWeight
(e.g. generated by functionmweight
).
Uncorrected correlation matrix
- of
X
-columns : ==> (p, p) matrix - or between
X
-columns andY
-columns : ==> (p, q) matrix.
Examples
using Jchemo
n, p = 5, 6
X = rand(n, p)
Y = rand(n, 3)
w = mweight(rand(n))
corm(X, w)
corm(X, Y, w)
Jchemo.corv
— Methodcorv(x, y)
Compute correlation between two vectors.
x
: vector (n).y
: vector (n).
References
@Stevengj, https://discourse.julialang.org/t/interesting-post-about-simd-dot-product-and-cosine-similarity/123282.
Examples
using Jchemo
n = 5
x = rand(n)
y = rand(n)
corv(x, y)
Jchemo.cosm
— Methodcosm(X)
cosm(X, Y)
Compute a cosinus matrix.
X
: Data (n, p).Y
: Data (n, q).
The function computes the cosinus matrix:
- of the columns of
X
: ==> (p, p) matrix - or between columns of
X
andY
: ==> (p, q) matrix.
Examples
using Jchemo
n, p = 5, 6
X = rand(n, p)
Y = rand(n, 3)
cosm(X)
cosm(X, Y)
Jchemo.cosv
— Methodcosv(x, y)
Compute cosinus between two vectors.
x
: vector (n).y
: vector (n).
References
@Stevengj, https://discourse.julialang.org/t/interesting-post-about-simd-dot-product-and-cosine-similarity/123282.
Examples
using Jchemo
n = 5
x = rand(n)
y = rand(n)
cosv(x, y)
Jchemo.covm
— Methodcovm(X)
covm(X, weights::Weight)
covm(X, Y)
covm(X, Y, weights::Weight)
Compute a weighted covariance matrix.
X
: Data (n, p).Y
: Data (n, q).weights
: Weights (n) of the observations. Object of typeWeight
(e.g. generated by functionmweight
).
The function computes the uncorrected covariance matrix:
- of the columns of
X
: ==> (p, p) matrix - or between columns of
X
andY
: ==> (p, q) matrix.
Examples
using Jchemo
n, p = 5, 6
X = rand(n, p)
Y = rand(n, 3)
w = mweight(rand(n))
covm(X, w)
covm(X, Y, w)
Jchemo.covv
— Methodcosv(x, y)
Compute uncorrected covariance between two vectors.
x
: vector (n).y
: vector (n).
References
@Stevengj, https://discourse.julialang.org/t/interesting-post-about-simd-dot-product-and-cosine-similarity/123282.
Examples
using Jchemo
n = 5
x = rand(n)
y = rand(n)
covv(x, y)
Jchemo.cscale
— Methodcscale()
cscale(X)
cscale(X, weights::Weight)
Column-wise centering and scaling of X-data.
X
: X-data (n, p).
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f
model = cscale()
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
colmean(Xptrain)
colstd(Xptrain)
@head Xptest
@head (Xtest .- colmean(Xtrain)') ./ colstd(Xtrain)'
plotsp(Xptrain).f
plotsp(Xptest).f
Jchemo.detrend_airpls
— Methoddetrend_airpls(; kwargs...)
detrend_airpls(X; kwargs...)
Baseline correction of each row of X-data by adaptive iteratively reweighted penalized least squares algorithm (AIRPLS).
X
: X-data (n, p).
Keyword arguments:
lb
: Penalizing (smoothing) parameter "lambda".maxit
: Maximum number of iterations.verbose
: Iftrue
, nb. iterations are printed.
De-trend transformation: the function fits a baseline by AIRPLS (see Zhang et al. 2010, and Baek et al. 2015 section 2) for each observation and returns the residuals (= signals corrected from the baseline).
References
Baek, S.-J., Park, A., Ahn, Y.-J., Choo, J., 2015. Baseline correction using asymmetrically reweighted penalized least squares smoothing. Analyst 140, 250–257. https://doi.org/10.1039/C4AN01061B
Zhang, Z.-M., Chen, S., Liang, Y.-Z., 2010. Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst 135, 1138–1146. https://doi.org/10.1039/B922045C
https://github.com/zmzhang/airPLS/tree/master
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f
## Example on 1 spectrum
i = 2
zX = Matrix(X)[i:i, :]
lb = 1e6
model = detrend_airpls(; lb)
fit!(model, zX)
zXc = transf(model, zX) # = corrected spectrum
B = zX - zXc # = estimated baseline
f, ax = plotsp(zX, wl)
lines!(wl, vec(B); color = :blue)
lines!(wl, vec(zXc); color = :black)
f
Jchemo.detrend_arpls
— Methoddetrend_arpls(; kwargs...)
detrend_arpls(X; kwargs...)
Baseline correction of each row of X-data by asymmetrically reweighted penalized least squares smoothing (ARPLS).
X
: X-data (n, p).
Keyword arguments:
lb
: Penalizing (smoothness) parameter "lambda".tol
: Tolerance value for stopping the iterations.maxit
: Maximum number of iterations.verbose
: Iftrue
, nb. iterations are printed.
De-trend transformation: the function fits a baseline by ARPLS (see Baek et al. 2015 section 3) for each observation and returns the residuals (= signals corrected from the baseline).
References
Baek, S.-J., Park, A., Ahn, Y.-J., Choo, J., 2015. Baseline correction using asymmetrically reweighted penalized least squares smoothing. Analyst 140, 250–257. https://doi.org/10.1039/C4AN01061B
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f
## Example on 1 spectrum
i = 2
zX = Matrix(X)[i:i, :]
lb = 1e4
model = detrend_arpls(; lb, p)
fit!(model, zX)
zXc = transf(model, zX) # = corrected spectrum
B = zX - zXc # = estimated baseline
f, ax = plotsp(zX, wl)
lines!(wl, vec(B); color = :blue)
lines!(wl, vec(zXc); color = :black)
f
Jchemo.detrend_asls
— Methoddetrend_asls(; kwargs...)
detrend_asls(X; kwargs...)
Baseline correction of each row of X-data by asymmetric least squares algorithm (ASLS).
X
: X-data (n, p).
Keyword arguments:
lb
: Penalizing (smoothness) parameter "lambda".p
: Asymmetry parameter (0 <p
<< 1).tol
: Tolerance value for stopping the iterations.maxit
: Maximum number of iterations.verbose
: Iftrue
, nb. iterations are printed.
De-trend transformation: the function fits a baseline by ASLS (see Baek et al. 2015 section 2) for each observation and returns the residuals (= signals corrected from the baseline).
Generally 0.001 ≤ p ≤ 0.1
is a good choice (for a signal with positive peaks) and 1e2 ≤ lb ≤ 1e9
, but exceptions may occur (Eilers & Boelens 2005).
References
Baek, S.-J., Park, A., Ahn, Y.-J., Choo, J., 2015. Baseline correction using asymmetrically reweighted penalized least squares smoothing. Analyst 140, 250–257. https://doi.org/10.1039/C4AN01061B
Eilers, P. H., & Boelens, H. F. (2005). Baseline correction with asymmetric least squares smoothing. Leiden University Medical Centre Report, 1(1).
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f
## Example on 1 spectrum
i = 2
zX = Matrix(X)[i:i, :]
lb = 1e5 ; p = .001
model = detrend_asls(; lb, p)
fit!(model, zX)
zXc = transf(model, zX) # = corrected spectrum
B = zX - zXc # = estimated baseline
f, ax = plotsp(zX, wl)
lines!(wl, vec(B); color = :blue)
lines!(wl, vec(zXc); color = :black)
f
Jchemo.detrend_lo
— Methoddetrend_lo(; kwargs...)
detrend_lo(X; kwargs...)
Baseline correction of each row of X-data by LOESS regression.
X
: X-data (n, p).
Keyword arguments:
span
: Window for neighborhood selection (level of smoothing) for the local fitting, typically in 0, 1.degree
: Polynomial degree for the local fitting.
De-trend transformation: The function fits a baseline by LOESS regression (function loessr
) for each observation and returns the residuals (= signals corrected from the baseline).
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f
model = detrend_lo(span = .8)
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
plotsp(Xptrain, wl).f
plotsp(Xptest, wl).f
## Example on 1 spectrum
i = 2
zX = Matrix(X)[i:i, :]
model = detrend_lo(span = .75)
fit!(model, zX)
zXc = transf(model, zX) # = corrected spectrum
B = zX - zXc # = estimated baseline
f, ax = plotsp(zX, wl)
lines!(wl, vec(B); color = :blue)
lines!(wl, vec(zXc); color = :black)
f
Jchemo.detrend_pol
— Methoddetrend_pol(; kwargs...)
detrend_pol(X; kwargs...)
Baseline correction of each row of X-data by polynomial linear regression.
X
: X-data (n, p).
Keyword arguments:
degree
: Polynom degree.
De-trend transformation: the function fits a baseline by polynomial regression for each observation and returns the residuals (= signals corrected from the baseline).
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f
model = detrend_pol(degree = 2)
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
plotsp(Xptrain, wl).f
plotsp(Xptest, wl).f
## Example on 1 spectrum
i = 2
zX = Matrix(X)[i:i, :]
model = detrend_pol(degree = 1)
fit!(model, zX)
zXc = transf(model, zX) # = corrected spectrum
B = zX - zXc # = estimated baseline
f, ax = plotsp(zX, wl)
lines!(wl, vec(B); color = :blue)
lines!(wl, vec(zXc); color = :black)
f
Jchemo.dfplsr_cg
— Methoddfplsr_cg(X, y; kwargs...)
Compute the model complexity (df) of PLSR models with the CGLS algorithm.
X
: X-data (n, p).y
: Univariate Y-data.
Keyword arguments:
- Same as function
cglsr
.
The number of degrees of freedom (df
) of the PLSR model is returned for 0, 1, ..., nlv
LVs.
References
Hansen, P.C., 1998. Rank-Deficient and Discrete Ill-Posed Problems, Mathematical Modeling and Computation. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898719697
Hansen, P.C., 2008. Regularization Tools version 4.0 for Matlab 7.3. Numer Algor 46, 189–194. https://doi.org/10.1007/s11075-007-9136-9
Lesnoff, M., Roger, J.-M., Rutledge, D.N., 2021. Monte Carlo methods for estimating Mallows’s Cp and AIC criteria for PLSR models. Illustration on agronomic spectroscopic NIR data. Journal of Chemometrics n/a, e3369. https://doi.org/10.1002/cem.3369
Examples
## The example below reproduces the numerical illustration
## given by Kramer & Sugiyama 2011 on the Ozone data
## (Fig. 1, center).
## Function "pls.model" used for df calculations
## in the R package "plsdof" v0.2-9 (Kramer & Braun 2019)
## automatically scales the X matrix before PLS.
## The example scales X for consistency with plsdof.
using Jchemo, JchemoData, JLD2, DataFrames, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ozone.jld2")
@load db dat
@names dat
X = dat.X
dropmissing!(X)
zX = rmcol(Matrix(X), 4)
y = X[:, 4]
## For consistency with plsdof
xscales = colstd(zX)
zXs = fscale(zX, xscales)
## End
nlv = 12 ; gs = true
res = dfplsr_cg(zXs, y; nlv, gs) ;
res.df
df_kramer = [1.000000, 3.712373, 6.456417, 11.633565,
12.156760, 11.715101, 12.349716,
12.192682, 13.000000, 13.000000,
13.000000, 13.000000, 13.000000]
f, ax = plotgrid(0:nlv, df_kramer; step = 2, xlabel = "Nb. LVs", ylabel = "df")
scatter!(ax, 0:nlv, res.df; color = "red")
ablines!(ax, 1, 1; color = :grey, linestyle = :dot)
f
Jchemo.difmean
— Methoddifmean(X1, X2; normx::Bool = false)
Compute a 1-D detrimental matrix by difference of the column-means of two X-datas.
X1
: Spectra (n1, p).X2
: Spectra (n2, p).
Keyword arguments:
normx
: Boolean. Iftrue
, the column-means vectors ofX1
andX2
are normed before computing their difference.
The function returns a matrix D
(1, p) computed by the difference between two mean-spectra, i.e. the column-means of X1
and X2
.
D
is assumed to contain the detrimental information that can be removed (by orthogonalization) from X1
and X2
for calibration transfer. For instance, D
can be used as input of function eposvd
.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/caltransfer.jld2")
@load db dat
@names dat
X1cal = dat.X1cal
X1val = dat.X1val
X2cal = dat.X2cal
X2val = dat.X2val
## The objective is to remove a detrimental
## information (here, D) from spaces X1 and X2
D = difmean(X1cal, X2cal).D
res = eposvd(D; nlv = 1)
## Corrected Val matrices
X1val_c = X1val * res.M
X2val_c = X2val * res.M
i = 1
f = Figure(size = (800, 300))
ax1 = Axis(f[1, 1])
ax2 = Axis(f[1, 2])
lines!(ax1, X1val[i, :]; label = "x1")
lines!(ax1, X2val[i, :]; label = "x2")
axislegend(ax1, position = :cb, framevisible = false)
lines!(ax2, X1val_c[i, :]; label = "x1_correct")
lines!(ax2, X2val_c[i, :]; label = "x2_correct")
axislegend(ax2, position = :cb, framevisible = false)
f
Jchemo.dkplskdeda
— Methoddkplskdeda(; kwargs...)
dkplskdeda(X, y; kwargs...)
dkplskdeda(X, y, weights::Weight; kwargs...)
DKPLS-KDEDA.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).- Keyword arguments of function
dmkern
(bandwidth definition) can also be specified here. scal
: Boolean. Iftrue
, each column ofX
and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.
Same as function plskdeda
(PLS-KDEDA) except that a direct kernel PLSR (function dkplsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
See function dkplslda
for examples.
Jchemo.dkplslda
— Methoddkplslda(; kwargs...)
dkplslda(X, y; kwargs...)
dkplslda(X, y, weights::Weight; kwargs...)
DKPLS-LDA.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1.kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Same as function plslda
(PLS-LDA) except that a direct kernel PLSR (function dkplsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlv = 15
gamma = .1
model = dkplslda(; nlv, gamma)
#model = dkplslda(; nlv, gamma, prior = :prop)
#model = dkplsqda(; nlv, gamma, alpha = .5)
#model = dkplskdeda(; nlv, gamma, a = .5)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
embfitm = fitm.fitm.embfitm ;
@head embfitm.T
@head transf(model, Xtrain)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)
coef(embfitm)
res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(model, Xtest; nlv = 1:2).pred
Jchemo.dkplsqda
— Methoddkplsqda(; kwargs...)
dkplsqda(X, y; kwargs...)
dkplsqda(X, y, weights::Weight; kwargs...)
DKPLS-QDA.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).alpha
: Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0
) and LDA (alpha = 1
).scal
: Boolean. Iftrue
, each column ofX
and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.
Same as function plsqda
(PLS-QDA) except that a direct kernel PLSR (function dkplsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
See function dkplslda
for examples.
Jchemo.dkplsr
— Methoddkplsr(; kwargs...)
dkplsr(X, Y; kwargs...)
dkplsr(X, Y, weights::Weight; kwargs...)
dkplsr!(X::Matrix, Y::Union{Matrix, BitMatrix}, weights::Weight; kwargs...)
Direct kernel partial least squares regression (DKPLSR) (Bennett & Embrechts 2003).
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to consider.kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
The method builds kernel Gram matrices and then runs a usual PLSR algorithm on them. This is faster (but not equivalent) to the "true" KPLSR (Nipals) algorithm (function kplsr
) described in Rosipal & Trejo (2001).
References
Bennett, K.P., Embrechts, M.J., 2003. An optimization perspective on kernel partial least squares regression, in: Advances in Learning Theory: Methods, Models and Applications, NATO Science Series III: Computer & Systems Sciences. IOS Press Amsterdam, pp. 227-250.
Rosipal, R., Trejo, L.J., 2001. Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space. Journal of Machine Learning Research 2, 97-123.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 20
kern = :krbf ; gamma = 1e-1 ; scal = false
#gamma = 1e-4 ; scal = true
model = dkplsr(; nlv, kern, gamma, scal) ;
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
@head model.fitm.T
coef(model)
coef(model; nlv = 3)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)
res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106
x = collect(-10:.2:10)
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x)
y = zy + .2 * randn(n)
nlv = 2
gamma = 1 / 3
model = dkplsr(; nlv, gamma) ;
fit!(model, x, y)
pred = predict(model, x).pred
f, ax = scatter(x, y)
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f
Jchemo.dkplsrda
— Methoddkplsrda(; kwargs...)
dkplsrda(X, y; kwargs...)
dkplsrda(X, y, weights::Weight; kwargs...)
Discrimination based on direct kernel partial least squares regression (KPLSR-DA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).scal
: Boolean. Iftrue
, each column ofX
and Ydummy is scaled by its uncorrected standard deviation.
Same as function plsrda
(PLSR-DA) except that a direct kernel PLSR (function dkplsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlv = 15
kern = :krbf ; gamma = .001
scal = true
model = dkplsrda(; nlv, kern, gamma, scal)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
@head fitm.fitm.T
@head transf(model, Xtrain)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)
coef(fitm.fitm)
res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(model, Xtest; nlv = 1:2).pred
Jchemo.dmkern
— Methoddmkern(; kwargs...)
dmkern(X; kwargs...)
Gaussian kernel density estimation (KDE).
X
: X-data (n, p).
Keyword arguments:
h
: Define the bandwith, see examples.a
: Constant for the Scott's rule (default bandwith), see thereafter.
Estimation of the probability density of X
(column space) by non parametric Gaussian kernels.
Data X
can be univariate (p = 1) or multivariate (p > 1). In the last case, function dmkern
computes a multiplicative kernel such as in Scott & Sain 2005 Eq.19, and the internal bandwidth matrix H
is diagonal (see the code).
Note: H
in the dmkern
code is often noted "H^(1/2)" in the litterature (e.g. Wikipedia).
The default bandwith is computed by:
h
=a
* n^(-1 / (p + 4)) * colstd(X
)
(a
= 1 in Scott & Sain 2005).
References
Scott, D.W., Sain, S.R., 2005. 9 - Multidimensional Density Estimation, in: Rao, C.R., Wegman, E.J., Solka, J.L. (Eds.), Handbook of Statistics, Data Mining and Data Visualization. Elsevier, pp. 229–261. https://doi.org/10.1016/S0169-7161(04)24009-3
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "iris.jld2")
@load db dat
@names dat
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
tab(y)
nlv = 2
model0 = fda(; nlv)
fit!(model0, X, y)
@head T = transf(model0, X)
n, p = size(T)
#### Probability density in the FDA score space (2D)
model = dmkern()
fit!(model, T)
@names model.fitm
model.fitm.H
u = [1; 4; 150]
predict(model, T[u, :]).pred
h = .3
model = dmkern(; h)
fit!(model, T)
model.fitm.H
predict(model, T[u, :]).pred
h = [.3; .1]
model = dmkern(; h)
fit!(model, T)
model.fitm.H
predict(model, T[u, :]).pred
## Bivariate distribution
npoints = 2^7
nlv = 2
lims = [(minimum(T[:, j]), maximum(T[:, j])) for j = 1:nlv]
x1 = LinRange(lims[1][1], lims[1][2], npoints)
x2 = LinRange(lims[2][1], lims[2][2], npoints)
z = mpar(x1 = x1, x2 = x2)
grid = reduce(hcat, z)
m = nro(grid)
model = dmkern()
#model = dmkern(a = .5)
#model = dmkern(h = .3)
fit!(model, T)
res = predict(model, grid) ;
pred_grid = vec(res.pred)
f = Figure(size = (600, 400))
ax = Axis(f[1, 1]; title = "Density for FDA scores (Iris)", xlabel = "Score 1",
ylabel = "Score 2")
co = contour!(ax, grid[:, 1], grid[:, 2], pred_grid; levels = 10, labels = true)
scatter!(ax, T[:, 1], T[:, 2], color = :red, markersize = 5)
#xlims!(ax, -15, 15) ;ylims!(ax, -15, 15)
f
## Univariate distribution
x = T[:, 1]
model = dmkern()
#model = dmkern(a = .5)
#model = dmkern(h = .3)
fit!(model, x)
pred = predict(model, x).pred
f = Figure()
ax = Axis(f[1, 1])
hist!(ax, x; bins = 30, normalization = :pdf) # area = 1
scatter!(ax, x, vec(pred); color = :red)
f
x = T[:, 1]
npoints = 2^8
lims = [minimum(x), maximum(x)]
#delta = 5 ; lims = [minimum(x) - delta, maximum(x) + delta]
grid = LinRange(lims[1], lims[2], npoints)
model = dmkern()
#model = dmkern(a = .5)
#model = dmkern(h = .3)
fit!(model, x)
pred_grid = predict(model, grid).pred
f = Figure()
ax = Axis(f[1, 1])
hist!(ax, x; bins = 30, normalization = :pdf) # area = 1
lines!(ax, grid, vec(pred_grid); color = :red)
f
Jchemo.dmnorm
— Methoddmnorm(; kwargs...)
dmnorm(X; kwargs...)
dmnorm!(X::Matrix; kwargs...)
dmnorm(mu, S; kwargs...)
dmnorm!(mu::Vector, S::Matrix; kwargs...)
Normal probability density estimation.
X
: X-data (n, p) used to estimate the meanmu
and the covariance matrixS
. IfX
is not given,mu
andS
must be provided inkwargs
.mu
: Mean vector of the normal distribution.S
: Covariance matrix of the Normal distribution.
Keyword arguments:
simpl
: Boolean. Iftrue
, the constant term and the determinant in the Normal density formula are set to 1.
Data X
can be univariate (p = 1) or multivariate (p > 1). See examples.
When simple
= true
, the determinant of the covariance matrix (object detS
) and the constant (2 * pi)^(-p / 2) (object cst
) in the density formula are set to 1. The function returns a pseudo density that resumes to exp(-d / 2), where d is the squared Mahalanobis distance to the center mu
. This can for instance be useful when the number of columns (p) of X
becomes too large, with the possible consequences that:
detS
tends to 0 or, conversely, to infinity;cst
tends to 0,
which makes impossible to compute the true density.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "iris.jld2")
@load db dat
@names dat
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
tab(y)
nlv = 2
model0 = fda(; nlv)
fit!(model0, X, y)
@head T = transf(model0, X)
n, p = size(T)
#### Probability density in the FDA score space (2D)
#### Example of class Setosa
s = y .== "setosa"
zT = T[s, :]
m = nro(zT)
#### Bivariate distribution
model = dmnorm()
fit!(model, zT)
fitm = model.fitm
@names fitm
fitm.Uinv
fitm.detS
@head pred = predict(model, zT).pred
## Direct syntax
mu = colmean(zT)
S = covm(zT, mweight(ones(m))) * m / (m - 1) # corrected cov. matrix
fitm = dmnorm(mu, S) ;
@names fitm
fitm.Uinv
fitm.detS
npoints = 2^7
lims = [(minimum(zT[:, j]), maximum(zT[:, j])) for j = 1:nlv]
x1 = LinRange(lims[1][1], lims[1][2], npoints)
x2 = LinRange(lims[2][1], lims[2][2], npoints)
z = mpar(x1 = x1, x2 = x2)
grid = reduce(hcat, z)
model = dmnorm()
fit!(model, zT)
res = predict(model, grid) ;
pred_grid = vec(res.pred)
f = Figure(size = (600, 400))
ax = Axis(f[1, 1]; title = "Density for FDA scores (Iris - Setosa)",
xlabel = "Score 1", ylabel = "Score 2")
co = contour!(ax, grid[:, 1], grid[:, 2], pred_grid; levels = 10, labels = true)
scatter!(ax, T[:, 1], T[:, 2], color = :red, markersize = 5)
scatter!(ax, zT[:, 1], zT[:, 2], color = :blue, markersize = 5)
#xlims!(ax, -12, 12) ;ylims!(ax, -12, 12)
f
#### Univariate distribution
j = 1
x = zT[:, j]
model = dmnorm()
fit!(model, x)
pred = predict(model, x).pred
f = Figure()
ax = Axis(f[1, 1]; xlabel = string("FDA-score ", j))
hist!(ax, x; bins = 30, normalization = :pdf) # area = 1
scatter!(ax, x, vec(pred); color = :red)
f
x = zT[:, j]
npoints = 2^8
lims = [minimum(x), maximum(x)]
#delta = 5 ; lims = [minimum(x) - delta, maximum(x) + delta]
grid = LinRange(lims[1], lims[2], npoints)
model = dmnorm()
fit!(model, x)
pred_grid = predict(model, grid).pred
f = Figure()
ax = Axis(f[1, 1]; xlabel = string("FDA-score ", j))
hist!(ax, x; bins = 30, normalization = :pdf) # area = 1
lines!(ax, grid, vec(pred_grid); color = :red)
f
Jchemo.dmnormlog
— Methoddmnormlog(; kwargs...)
dmnormlog(X; kwargs...)
dmnormlog!(X::Matrix; kwargs...)
dmnormlog(mu, S; kwargs...)
dmnormlog!(mu::Vector, S::Matrix; kwargs...)
Logarithm of the normal probability density estimation. * X
: X-data (n, p) used to estimate the mean mu
and the covariance matrix S
. If X
is not given, mu
and S
must be provided in kwargs
. * mu
: Mean vector of the normal distribution. * S
: Covariance matrix of the Normal distribution. Keyword arguments: * simpl
: Boolean. If true
, the constant term and the determinant in the Normal density formula are set to 1.
See the help page of function dmnorm
.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "iris.jld2")
@load db dat
@names dat
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
tab(y)
## Example of class Setosa
s = y .== "setosa"
zX = X[s, :]
model = dmnormlog()
fit!(model, zX)
fitm = model.fitm
@names fitm
fitm.Uinv
fitm.logdetS
@head pred = predict(model, zX).pred
## Consistency with dmnorm
model0 = dmnorm()
fit!(model0, zX)
@head pred0 = predict(model0, zX).pred
@head log.(pred0)
Jchemo.dummy
— Methoddummy(y)
Compute dummy table from a categorical variable.
y
: A categorical variable.
The output Y
(dummy table) is a BitMatrix.
Examples
using Jchemo
y = ["d", "a", "b", "c", "b", "c"]
#y = rand(1:3, 7)
res = dummy(y)
@names res
res.Y
Jchemo.dupl
— Methoddupl(X; digits = 3)
Find duplicated rows in a dataset.
X
: A dataset.digits
: Nb. digits used to roundX
before checking.
Examples
using Jchemo
X = rand(5, 3)
Z = vcat(X, X[1:3, :], X[1:1, :])
dupl(X)
dupl(Z)
M = hcat(X, fill(missing, 5))
Z = vcat(M, M[1:3, :])
dupl(M)
dupl(Z)
Jchemo.ensure_df
— Methodensure_df(X)
Reshape X
to a dataframe if necessary.
Jchemo.ensure_mat
— Methodensure_mat(X)
Reshape X
to a matrix if necessary.
Jchemo.eposvd
— Methodeposvd(D; nlv = 1)
Compute an orthogonalization matrix for calibration transfer of spectral data.
D
: Data (m, p) containing the detrimental information on which spectra (rows of a matrix X) have to be orthogonalized.
Keyword arguments:
nlv
: Nb. of first loadings vectors ofD
considered for the orthogonalization.
The objective is to remove some detrimental information (e.g. humidity patterns in signals, multiple spectrometers, etc.) from a X-dataset (n, p). The detrimental information is defined by the main row-directions computed from a matrix D
(m, p).
Function eposvd
returns two objects:
V
(p,nlv
) : The matrix of thenlv
first loading vectors of the SVD decomposition (non centered PCA) ofD
.M
(p, p) : The orthogonalization matrix, used to orthogonolize a given matrix X to directions contained inV
.
Any matrix X can then be corrected from D
by:
- X_corrected = X *
M
.
Matrix D
can be built from many methods. For instance, two common methods are:
- EPO (Roger et al. 2003, 2018):
D
is built from a set of differences between spectra collected under different conditions. - TOP (Andrew & Fearn 2004): Each row of
D
is the mean spectrum computed for a given spectrometer instrument.
A particular situation is the following. Assume that D
is built from some differences between matrices X1 and X2, and that a bilinear model (e.g. PLSR) is fitted on the data {X1corrected, Y} where X1corrected = X1 * M
. To predict new data X2new with the fitted model, there is no need to correct X2new.
References
Andrew, A., Fearn, T., 2004. Transfer by orthogonal projection: making near-infrared calibrations robust to between-instrument variation. Chemometrics and Intelligent Laboratory Systems 72, 51–56. https://doi.org/10.1016/j.chemolab.2004.02.004
Roger, J.-M., Chauchard, F., Bellon-Maurel, V., 2003. EPO-PLS external parameter orthogonalisation of PLS application to temperature-independent measurement of sugar content of intact fruits. Chemometrics and Intelligent Laboratory Systems 66, 191-204. https://doi.org/10.1016/S0169-7439(03)00051-0
Roger, J.-M., Boulet, J.-C., 2018. A review of orthogonal projections for calibration. Journal of Chemometrics 32, e3045. https://doi.org/10.1002/cem.3045
Zeaiter, M., Roger, J.M., Bellon-Maurel, V., 2006. Dynamic orthogonal projection. A new method to maintain the on-line robustness of multivariate calibrations. Application to NIR-based monitoring of wine fermentations. Chemometrics and Intelligent Laboratory Systems, 80, 227–235. https://doi.org/10.1016/j.chemolab.2005.06.011
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/caltransfer.jld2")
@load db dat
@names dat
X1cal = dat.X1cal
X1val = dat.X1val
X2cal = dat.X2cal
X2val = dat.X2val
## The objective is to remove a detrimental
## information (here, D) from spaces X1 and X2
D = X1cal - X2cal
nlv = 2
res = eposvd(D; nlv)
res.M # orthogonalization matrix
res.V # detrimental directions (columns of matrix V = loadings of D)
## Corrected Val matrices
X1val_c = X1val * res.M
X2val_c = X2val * res.M
i = 1
f = Figure(size = (800, 300))
ax1 = Axis(f[1, 1])
ax2 = Axis(f[1, 2])
lines!(ax1, X1val[i, :]; label = "x1")
lines!(ax1, X2val[i, :]; label = "x2")
axislegend(ax1, position = :cb, framevisible = false)
lines!(ax2, X1val_c[i, :]; label = "x1_correct")
lines!(ax2, X2val_c[i, :]; label = "x2_correct")
axislegend(ax2, position = :cb, framevisible = false)
f
Jchemo.errp
— Methoderrp(pred, y)
Compute the classification error rate (ERRP).
pred
: Predictions.y
: Observed data (class membership).
Examples
using Jchemo
Xtrain = rand(10, 5)
ytrain = rand(["a" ; "b"], 10)
Xtest = rand(4, 5)
ytest = rand(["a" ; "b"], 4)
model = plsrda(; nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
errp(pred, ytest)
Jchemo.euclsq
— Methodeuclsq(X, Y)
Squared Euclidean distances between the rows of X
and Y
.
X
: Data (n, p).Y
: Data (m, p).
For X
(n, p) and Y
(m, p), the function returns an object (n, m) with:
- i, j = distance between row i of
X
and row j ofY
.
Examples
X = rand(5, 3)
Y = rand(2, 3)
euclsq(X, Y)
euclsq(X[1:1, :], Y[1:1, :])
euclsq(X[:, 1], 4)
euclsq(1, 4)
Jchemo.fcenter
— Methodfcenter(X, v)
fcenter!(X::AbstractMatrix, v)
Center each column of a matrix.
X
: Data (n, p).v
: Centering vector (p).
examples
using Jchemo
n, p = 5, 6
X = rand(n, p)
xmeans = colmean(X)
fcenter(X, xmeans)
Jchemo.fconcat
— Methodfconcat()
Concatenate horizontaly multiblock X-data.
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.
Examples
using Jchemo
n = 5 ; m = 3 ; p = 9
X = rand(n, p)
Xnew = rand(m, p)
listbl = [3:4, 1, [6; 8:9]]
Xbl = mblock(X, listbl)
Xblnew = mblock(Xnew, listbl)
@head Xbl[3]
fconcat(Xbl)
Jchemo.fcscale
— Methodfcscale(X, u, v)
fcscale!(X, u, v)
Center and fscale each column of a matrix.
X
: Data (n, p).u
: Centering vector (p).v
: Scaling vector (p).
examples
using Jchemo
n, p = 5, 6
X = rand(n, p)
xmeans = colmean(X)
xscales = colstd(X)
fcscale(X, xmeans, xscales)
Jchemo.fda
— Methodfda(; kwargs...)
fda(X, y; kwargs...)
fda(X, y, weights; kwargs...)
fda!(X::Matrix, y, weights; kwargs...)
Factorial discriminant analysis (FDA).
X
: X-data (n, p).y
: y-data (n) (class membership).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. of discriminant components.lb
: Ridge regularization parameter "lambda". Can be used whenX
has collinearities.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
FDA by eigen factorization of Inverse(W) * B, where W is the "Within"-covariance matrix (pooled over the classes), and B the "Between"-covariance matrix.
The function maximizes the consensus:
- p'Bp / p'Wp
i.e. max p'Bp with constraint p'Wp = 1. Vectors p (columns of V
) are the linear discrimant coefficients often referred to as "LD".
If X
is ill-conditionned, a ridge regularization can be used:
- If
lb
> 0, W is replaced by W +lb
* I, where I is the Idendity matrix.
In the high-level version of the present functions, the observation weights are automatically defined by the given priors (argument prior
): the sub-totals by class of the observation weights are set equal to the prior probabilities. The low-level version (argument weights
) allows to implement other choices.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
tab(ytrain)
tab(ytest)
nlv = 2
model = fda(; nlv)
#model = fdasvd(; nlv)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
lev = fitm.lev
nlev = length(lev)
aggsum(fitm.weights.w, ytrain)
@head fitm.T
@head transf(model, Xtrain)
@head transf(model, Xtest)
## X-loadings matrix
## = coefficients of the linear discriminant function
## = "LD" of function lda of the R package MASS
fitm.V
fitm.V' * fitm.V
## Explained variance computed by weighted PCA
## of the class centers in transformed scale
summary(model).explvarx
## Projections of the class centers
## to the score space
ct = fitm.Tcenters
f, ax = plotxy(fitm.T[:, 1], fitm.T[:, 2], ytrain; ellipse = true, title = "FDA",
xlabel = "Score-1", ylabel = "Score-2")
scatter!(ax, ct[:, 1], ct[:, 2], marker = :star5, markersize = 15, color = :red) # see available_marker_symbols()
f
Jchemo.fdasvd
— Methodfdasvd(; kwargs...)
fdasvd(X, y, weights; kwargs...)
fdasvd!(X::Matrix, y, weights; kwargs...)
Factorial discriminant analysis (FDA).
X
: X-data (n, p).y
: y-data (n) (class membership).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. of discriminant components.lb
: Ridge regularization parameter "lambda". Can be used whenX
has collinearities.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
FDA by a weighted SVD factorization of the matrix of the class centers (after spherical transformaton). The function gives the same results as function fda
.
See function fda
for details and examples.
Jchemo.fdif
— Methodfdif(; kwargs...)
fdif(X; kwargs...)
Finite differences (discrete derivates) for each row of X-data.
X
: X-data (n, p).
Keyword arguments:
npoint
: Nb. points involved in the window for the finite differences. The range of the window (= nb. intervals of two successive colums) is npoint - 1.
The method reduces the column-dimension:
- (n, p) –> (n, p - npoint + 1).
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f
model = fdif(npoint = 2)
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f
Jchemo.findmax_cla
— Methodfindmax_cla(x)
findmax_cla(x, weights::Weight)
Find the most occurent level in x
.
x
: A categorical variable.weights
: Weights (n) of the observations. Object of typeWeight
(e.g. generated by functionmweight
).
If ex-aequos, the function returns the first.
Examples
using Jchemo
x = rand(1:3, 10)
tab(x)
findmax_cla(x)
Jchemo.findmiss
— Methodfindmiss(X)
Find rows with missing data in a dataset.
X
: A dataset.
For dataframes, see also DataFrames.completecases
and DataFrames.dropmissing
.
Examples
using Jchemo
X = rand(5, 4)
zX = hcat(rand(2, 3), fill(missing, 2))
Z = vcat(X, zX)
findmiss(X)
findmiss(Z)
Jchemo.frob
— Methodfrob(X)
frob(X, weights::Weight)
frob2(X)
frob2(X, weights::Weight)
Frobenius norm of a matrix.
X
: A matrix (n, p).weights
: Weights (n) of the observations. Object of typeWeight
(e.g. generated by functionmweight
).
The Frobenius norm of X
is:
- sqrt(tr(X' * X)).
The Frobenius weighted norm is:
- sqrt(tr(X' * D * X)), where D is the diagonal matrix of vector
w
.
Functions frob2
are the squared versions of frob
.
References
@Stevengj, https://discourse.julialang.org/t/interesting-post-about-simd-dot-product-and-cosine-similarity/123282.
Jchemo.fscale
— Methodfscale(X, v)
fscale!(X::AbstractMatrix, v)
Scale each column of a matrix.
X
: Data (n, p).v
: Scaling vector (p).
Examples
using Jchemo
X = rand(5, 2)
fscale(X, colstd(X))
Jchemo.fweight
— Methodfweight(X, v)
fweight!(X::AbstractMatrix, v)
Weight each row of a matrix.
X
: Data (n, p).v
: A weighting vector (n).
Examples
using Jchemo, LinearAlgebra
X = rand(5, 2)
w = rand(5)
fweight(X, w)
diagm(w) * X
fweight!(X, w)
X
Jchemo.getknn
— Methodgetknn(Xtrain, X; metric = :eucl, k = 1)
Return the k nearest neighbors in Xtrain
of each row of the query X
.
Xtrain
: Training X-data.X
: Query X-data.
Keyword arguments:
metric
: Type of distance used for the query. Possible values are:eucl
(Euclidean),:mah
(Mahalanobis),:sam
(spectral angular distance),:cor
(correlation distance).k
: Number of neighbors to return.
The distances (not squared) are also returned.
Spectral angular and correlation distances between two vectors x and y:
- Spectral angular distance (x, y) = acos(x'y / normv(x)normv(y)) / pi
- Correlation distance (x, y) = sqrt((1 - cor(x, y)) / 2)
Both distances are bounded within 0 (y = x) and 1 (y = -x).
Examples
using Jchemo
Xtrain = rand(5, 3)
X = rand(2, 3)
x = X[1:1, :]
k = 3
res = getknn(Xtrain, X; k)
res.ind # indexes
res.d # distances
res = getknn(Xtrain, x; k)
res.ind
res = getknn(Xtrain, X; metric = :mah, k)
res.ind
Jchemo.gridcv
— Methodgridcv(model, X, Y; segm, score, pars = nothing, nlv = nothing, lb = nothing,
verbose = false)
Cross-validation (CV) of a model over a grid of parameters.
model
: Model to evaluate.X
: Training X-data (n, p).Y
: Training Y-data (n, q).
Keyword arguments:
segm
: Segments of observations used for the CV (output of functionssegmts
,segmkf
, etc.).score
: Function computing the prediction score (e.g.rmsep
).pars
: tuple of named vectors of same length defining the parameter combinations (e.g. output of functionmpar
).verbose
: Iftrue
, predicting information are printed.nlv
: Value, or vector of values, of the nb. of latent variables (LVs).lb
: Value, or vector of values, of the ridge regularization parameter "lambda".
The function is used for grid-search: it computed a prediction score (= error rate) for the specified model
over the combinations of parameters defined in pars
.
For models based on LV or ridge regularization, using arguments nlv
and lb
allow faster computations than including these parameters in argument `pars. See the examples.
The function returns two outputs:
res
: mean resultsres_p
: results per replication.
Examples
######## Regression
using JLD2, CairoMakie, JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
model = savgol(npoint = 21, deriv = 2, degree = 2)
fit!(model, X)
Xp = transf(model, X)
s = year .<= 2012
Xtrain = Xp[s, :]
ytrain = y[s]
Xtest = rmrow(Xp, s)
ytest = rmrow(y, s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Replicated K-fold CV
K = 3 ; rep = 10
segm = segmkf(ntrain, K; rep)
## Replicated test-set validation
#m = Int(round(ntrain / 3)) ; rep = 30
#segm = segmts(ntrain, m; rep)
####-- Plsr
model = plskern()
nlv = 0:30
rescv = gridcv(model, Xtrain, ytrain; segm, score = rmsep, nlv) ;
@names rescv
res = rescv.res
plotgrid(res.nlv, res.y1; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = plskern(; nlv = res.nlv[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
## Adding pars
pars = mpar(scal = [false; true])
rescv = gridcv(model, Xtrain, ytrain; segm, score = rmsep, pars, nlv) ;
res = rescv.res
typ = res.scal
plotgrid(res.nlv, res.y1, typ; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = plskern(nlv = res.nlv[u], scal = res.scal[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####-- Rr
lb = (10).^(-8:.1:3)
model = rr()
rescv = gridcv(model, Xtrain, ytrain; segm, score = rmsep, lb) ;
res = rescv.res
loglb = log.(10, res.lb)
plotgrid(loglb, res.y1; step = 2, xlabel = "log(lambda)", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = rr(lb = res.lb[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
## Adding pars
pars = mpar(scal = [false; true])
rescv = gridcv(model, Xtrain, ytrain; segm, score = rmsep, pars, lb) ;
res = rescv.res
loglb = log.(10, res.lb)
typ = string.(res.scal)
plotgrid(loglb, res.y1, typ; step = 2, xlabel = "log(lambda)", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = rr(lb = res.lb[u], scal = res.scal[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####-- Kplsr
model = kplsr()
nlv = 0:30
gamma = (10).^(-5:1.:5)
pars = mpar(gamma = gamma)
rescv = gridcv(model, Xtrain, ytrain; segm, score = rmsep, pars, nlv) ;
res = rescv.res
loggamma = round.(log.(10, res.gamma), digits = 1)
plotgrid(res.nlv, res.y1, loggamma; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP",
leg_title = "Log(gamma)").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = kplsr(nlv = res.nlv[u], gamma = res.gamma[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####-- Knnr
nlvdis = [15, 25] ; metric = [:mah]
h = [1, 2.5, 5]
k = [1; 5; 10; 20; 50 ; 100]
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k)
length(pars[1])
model = knnr()
rescv = gridcv(model, Xtrain, ytrain; segm, score = rmsep, pars, verbose = true) ;
res = rescv.res
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = knnr(nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u],
k = res.k[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####-- Lwplsr
nlvdis = 15 ; metric = [:mah]
h = [1, 2, 5] ; k = [200, 350, 500]
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k)
length(pars[1])
nlv = 0:20
model = lwplsr()
rescv = gridcv(model, Xtrain, ytrain; segm, score = rmsep, pars, nlv, verbose = true) ;
res = rescv.res
group = string.("h=", res.h, " k=", res.k)
plotgrid(res.nlv, res.y1, group; xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = lwplsr(nlvdis = res.nlvdis[u], metric = res.metric[u],
h = res.h[u], k = res.k[u], nlv = res.nlv[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####-- LwplsrAvg
nlvdis = 15 ; metric = [:mah]
h = [1, 2, 5] ; k = [200, 350, 500]
nlv = [0:20, 5:20]
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k, nlv = nlv)
length(pars[1])
model = lwplsravg()
rescv = gridcv(model, Xtrain, ytrain; segm, score = rmsep, pars, verbose = true) ;
res = rescv.res
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = lwplsravg(nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u],
k = res.k[u], nlv = res.nlv[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
##---- Mbplsr
listbl = [1:525, 526:1050]
Xbltrain = mblock(Xtrain, listbl)
Xbltest = mblock(Xtest, listbl)
Xblcal = mblock(Xcal, listbl)
Xblval = mblock(Xval, listbl)
model = mbplsr()
bscal = [:none, :frob]
pars = mpar(bscal = bscal)
nlv = 0:30
rescv = gridcv(model, Xbltrain, ytrain; segm, score = rmsep, pars, nlv) ;
res = rescv.res
group = res.bscal
plotgrid(res.nlv, res.y1, group; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = mbplsr(bscal = res.bscal[u], nlv = res.nlv[u])
fit!(model, Xbltrain, ytrain)
pred = predict(model, Xbltest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
######## Discrimination
## The principle is the same as for regression
using JLD2, CairoMakie, JchemoData
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
tab(Y.typ)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Replicated K-fold CV
K = 3 ; rep = 10
segm = segmkf(ntrain, K; rep)
## Replicated test-set validation
#m = Int(round(ntrain / 3)) ; rep = 30
#segm = segmts(ntrain, m; rep)
####-- Plslda
model = plslda()
nlv = 1:30
prior = [:unif; :prop]
pars = mpar(prior = prior)
rescv = gridcv(model, Xtrain, ytrain; segm, score = errp, pars, nlv)
res = rescv.res
typ = res.prior
plotgrid(res.nlv, res.y1, typ; step = 2, xlabel = "Nb. LVs", ylabel = "ERR").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = plslda(nlv = res.nlv[u], prior = res.prior[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show errp(pred, ytest)
conf(pred, ytest).pct
Jchemo.gridcv_br
— Methodgridcv_br(X, Y; segm, algo, score, pars, verbose = false)
Working function for gridcv
.
See function gridcv
for examples.
Jchemo.gridcv_lb
— Methodgridcv_lb(X, Y; segm, algo, score, pars = nothing, lb, verbose = false)
Working function for gridcv
.
Specific and faster than gridcv_br
for models using ridge regularization (e.g. RR). Argument pars
must not contain nlv
.
See function gridcv
for examples.
Jchemo.gridcv_lv
— Methodgridcv_lv((X, Y; segm, algo, score, pars = nothing, nlv, verbose = false)
Working function for gridcv
.
Specific and faster than gridcv_br
for models using latent variables (e.g. PLSR). Argument pars
must not contain nlv
.
See function gridcv
for examples.
Jchemo.gridscore
— Methodgridscore(model, Xtrain, Ytrain, X, Y; score, pars = nothing, nlv = nothing,
lb = nothing, verbose = false)
Test-set validation of a model over a grid of parameters.
model
: Model to evaluate.Xtrain
: Training X-data (n, p).Ytrain
: Training Y-data (n, q).X
: Validation X-data (m, p).Y
: Validation Y-data (m, q).
Keyword arguments:
score
: Function computing the prediction score (e.g.rmsep
).pars
: tuple of named vectors of same length defining the parameter combinations (e.g. output of functionmpar
).verbose
: Iftrue
, predicting information are printed.nlv
: Value, or vector of values, of the nb. of latent variables (LVs).lb
: Value, or vector of values, of the ridge regularization parameter "lambda".
The function is used for grid-search: it computed a prediction score (= error rate) for model model
over the combinations of parameters defined in pars
. The score is computed over sets {X,
Y`}.
For models based on LV or ridge regularization, using arguments nlv
and lb
allow faster computations than including these parameters in argument `pars. See the examples.
Examples
######## Regression
using JLD2, CairoMakie, JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
model = savgol(npoint = 21, deriv = 2, degree = 2)
fit!(model, X)
Xp = transf(model, X)
s = year .<= 2012
Xtrain = Xp[s, :]
ytrain = y[s]
Xtest = rmrow(Xp, s)
ytest = rmrow(y, s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Train ==> Cal + Val
nval = Int(round(.3 * ntrain))
s = samprand(ntrain, nval)
Xcal = Xtrain[s.train, :]
ycal = ytrain[s.train]
Xval = Xtrain[s.test, :]
yval = ytrain[s.test]
##---- Plsr
model = plskern()
nlv = 0:30
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, nlv)
plotgrid(res.nlv, res.y1; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = plskern(nlv = res.nlv[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
## Adding pars
pars = mpar(scal = [false; true])
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, pars, nlv)
typ = res.scal
plotgrid(res.nlv, res.y1, typ; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = plskern(nlv = res.nlv[u], scal = res.scal[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
##---- Rr
lb = (10).^(-8:.1:3)
model = rr()
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, lb)
loglb = log.(10, res.lb)
plotgrid(loglb, res.y1; step = 2, xlabel = "log(lambda)", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = rr(lb = res.lb[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
## Adding pars
pars = mpar(scal = [false; true])
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, pars, lb)
loglb = log.(10, res.lb)
typ = string.(res.scal)
plotgrid(loglb, res.y1, typ; step = 2, xlabel = "log(lambda)", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = rr(lb = res.lb[u], scal = res.scal[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
##---- Kplsr
model = kplsr()
nlv = 0:30
gamma = (10).^(-5:1.:5)
pars = mpar(gamma = gamma)
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, pars, nlv)
loggamma = round.(log.(10, res.gamma), digits = 1)
plotgrid(res.nlv, res.y1, loggamma; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP",
leg_title = "Log(gamma)").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = kplsr(nlv = res.nlv[u], gamma = res.gamma[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
##---- Knnr
nlvdis = [15; 25] ; metric = [:mah]
h = [1, 2.5, 5]
k = [1, 5, 10, 20, 50, 100]
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k)
length(pars[1])
model = knnr()
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, pars, verbose = true)
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = knnr(nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u],
k = res.k[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
##---- Lwplsr
nlvdis = 15 ; metric = [:mah]
h = [1, 2, 5] ; k = [200, 350, 500]
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k)
length(pars[1])
nlv = 0:20
model = lwplsr()
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, pars, nlv, verbose = true)
group = string.("h=", res.h, " k=", res.k)
plotgrid(res.nlv, res.y1, group; xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = lwplsr(nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u],
k = res.k[u], nlv = res.nlv[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
##---- LwplsrAvg
nlvdis = 15 ; metric = [:mah]
h = [1, 2, 5] ; k = [200, 350, 500]
nlv = [0:20, 5:20]
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k, nlv = nlv)
length(pars[1])
model = lwplsravg()
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, pars, verbose = true)
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = lwplsravg(nlvdis = res.nlvdis[u], metric = res.metric[u], h = res.h[u],
k = res.k[u], nlv = res.nlv[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
##---- Mbplsr
listbl = [1:525, 526:1050]
Xbltrain = mblock(Xtrain, listbl)
Xbltest = mblock(Xtest, listbl)
Xblcal = mblock(Xcal, listbl)
Xblval = mblock(Xval, listbl)
model = mbplsr()
bscal = [:none, :frob]
pars = mpar(bscal = bscal)
nlv = 0:30
res = gridscore(model, Xblcal, ycal, Xblval, yval; score = rmsep, pars, nlv)
group = res.bscal
plotgrid(res.nlv, res.y1, group; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = mbplsr(bscal = res.bscal[u], nlv = res.nlv[u])
fit!(model, Xbltrain, ytrain)
pred = predict(model, Xbltest).pred
@show rmsep(pred, ytest)
plotxy(vec(pred), ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
######## Discrimination
## The principle is the same as for regression
using JLD2, CairoMakie, JchemoData
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
tab(Y.typ)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Train ==> Cal + Val
nval = Int(round(.3 * ntrain))
s = samprand(ntrain, nval)
Xcal = Xtrain[s.train, :]
ycal = ytrain[s.train]
Xval = Xtrain[s.test, :]
yval = ytrain[s.test]
##---- Plslda
model = plslda()
nlv = 1:30
prior = [:unif, :prop]
pars = mpar(prior = prior)
res = gridscore(model, Xcal, ycal, Xval, yval; score = errp, pars, nlv)
typ = res.prior
plotgrid(res.nlv, res.y1, typ; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model = plslda(nlv = res.nlv[u], prior = res.prior[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
@show errp(pred, ytest)
conf(pred, ytest).pct
Jchemo.gridscore
— Methodgridscore(model::Pipeline, Xtrain, Ytrain, X, Y; score, pars = nothing,
nlv = nothing, lb = nothing, verbose = false)
Test-set validation of a model pipeline over a grid of parameters.
model
: A pipeline of models to evaluate.Xtrain
: Training X-data (n, p).Ytrain
: Training Y-data (n, q).X
: Validation X-data (m, p).Y
: Validation Y-data (m, q).
Keyword arguments:
score
: Function computing the prediction score (e.g.rmsep
).pars
: tuple of named vectors of same length defining the parameter combinations (e.g. output of functionmpar
).verbose
: Iftrue
, predicting information are printed.nlv
: Value, or vector of values, of the nb. of latent variables (LVs).lb
: Value, or vector of values, of the ridge regularization parameter "lambda".
In the present version of the function, only the last model of the pipeline (= the final predictor) is validated.
For other details, see function gridscore
for simple models.
Examples
using JLD2, CairoMakie, JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Building Cal and Val
## within Train
nval = Int(round(.3 * ntrain))
s = samprand(ntrain, nval)
Xcal = Xtrain[s.train, :]
ycal = ytrain[s.train]
Xval = Xtrain[s.test, :]
yval = ytrain[s.test]
####-- Pipeline Snv :> Savgol :> Plsr
## Only the last model is validated
## model1
model1 = snv()
## model2
npoint = 11 ; deriv = 2 ; degree = 3
model2 = savgol(; npoint, deriv, degree)
## model3
nlv = 0:30
model3 = plskern()
##
model = pip(model1, model2, model3)
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, nlv) ;
plotgrid(res.nlv, res.y1; step = 2, xlabel = "Nb. LVs", ylabel = "RMSEP").f
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model3 = plskern(nlv = res.nlv[u])
model = pip(model1, model2, model3)
fit!(model, Xtrain, ytrain)
res = predict(model, Xtest) ;
@head res.pred
rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####-- Pipeline Pca :> Svmr
## Only the last model is validated
## model1
nlv = 15 ; scal = true
model1 = pcasvd(; nlv, scal)
## model2
kern = [:krbf]
gamma = (10).^(-5:1.:5)
cost = (10).^(1:3)
epsilon = [.1, .2, .5]
pars = mpar(kern = kern, gamma = gamma, cost = cost, epsilon = epsilon)
model2 = svmr()
##
model = pip(model1, model2)
res = gridscore(model, Xcal, ycal, Xval, yval; score = rmsep, pars, verbose = true)
u = findall(res.y1 .== minimum(res.y1))[1]
res[u, :]
model2 = svmr(kern = res.kern[u], gamma = res.gamma[u], cost = res.cost[u], epsilon = res.epsilon[u])
model = pip(model1, model2)
fit!(model, Xtrain, ytrain)
res = predict(model, Xtest) ;
@head res.pred
rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
Jchemo.gridscore_br
— Methodgridscore_br(Xtrain, Ytrain, X, Y; algo, score, pars,
verbose = false)
Working function for gridscore
.
See function gridscore
for examples.
Jchemo.gridscore_lb
— Methodgridscore_lb(Xtrain, Ytrain, X, Y; algo, score, pars = nothing,
lb, verbose = false)
Working function for gridscore
.
Specific and faster than gridscore_br
for models using ridge regularization (e.g. RR). Argument pars
must not contain lb
.
See function gridscore
for examples.
Jchemo.gridscore_lv
— Methodgridscore_lv(Xtrain, Ytrain, X, Y; algo, score, pars = nothing,
nlv, verbose = false)
Working function for gridscore
.
Specific and faster than gridscore_br
for models using latent variables (e.g. PLSR). Argument pars
must not contain nlv
.
See function gridscore
for examples.
Jchemo.head
— Method@head X
Display the first rows of a dataset.
Examples
using Jchemo
X = rand(100, 5)
@head X
Jchemo.interpl
— Methodinterpl(; kwargs...)
interpl(X; kwargs...)
Sampling spectra by interpolation.
X
: Matrix (n, p) of spectra (rows).
Keyword arguments:
wl
: Values representing the column "names" ofX
. Must be a numeric vector of length p, or an AbstractRange, with growing values.wlfin
: Final values (within the range ofwl
) where to interpolate each spectrum. Must be a numeric vector, or an AbstractRange, with growing values.
The function implements a cubic spline interpolation using package DataInterpolations.jl.
References
Package DAtaInterpolations.jl https://github.com/PumasAI/DataInterpolations.jl https://htmlpreview.github.io/?https://github.com/PumasAI/DataInterpolations.jl/blob/v2.0.0/example/DataInterpolations.html
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f
wlfin = range(500, 2400, length = 10)
#wlfin = collect(range(500, 2400, length = 10))
model = interpl(; wl, wlfin)
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f
Jchemo.iqrv
— Methodiqrv(x)
Compute the interquartile interval (IQR) of a vector.
x
: A vector (n).
Examples
x = rand(100)
iqrv(x)
Jchemo.isel!
— Functionisel!(model, X, Y, wl = 1:nco(X); rep = 1, nint = 5, psamp = .3, score = rmsep)
Interval variable selection.
model
: Model to evaluate.X
: X-data (n, p).Y
: Y-data (n, q).wl
: Optional numeric labels (p, 1) of the X-columns.
Keyword arguments:
rep
: Number of replications of the splitting training/test.nint
: Nb. intervals.psamp
: Proportion of data used as test set to compute thescore
.score
: Function computing the prediction score.
The principle is as follows:
- Data (X, Y) are splitted randomly to a training and a test set.
- Range 1:p in
X
is segmented tonint
intervals, when possible of equal size. - The model is fitted on the training set and the score (error rate) on the test set, firtsly accounting for all the p variables (reference) and secondly for each of the
nint
intervals. - This process is replicated
rep
times. Average results are provided in the outputs, as well the results per replication.
References
- Nørgaard, L., Saudland, A., Wagner, J., Nielsen, J.V., Munck, L.,
Engelsen, S.B., 2000. Interval Partial Least-Squares Regression (iPLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy. Appl Spectrosc 54, 413–419. https://doi.org/10.1366/0003702001949500
Examples
using Jchemo, JchemoData, DataFrames, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "tecator.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
wl_str = names(X)
wl = parse.(Float64, wl_str)
ntot, p = size(X)
typ = Y.typ
namy = names(Y)[1:3]
plotsp(X, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance").f
s = typ .== "train"
Xtrain = X[s, :]
Ytrain = Y[s, namy]
Xtest = rmrow(X, s)
Ytest = rmrow(Y[:, namy], s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Work on the j-th y-variable
j = 2
nam = namy[j]
ytrain = Ytrain[:, nam]
ytest = Ytest[:, nam]
model = plskern(nlv = 5)
nint = 10
res = isel!(model, Xtrain, ytrain, wl; rep = 30, nint) ;
res.res_rep
res.res0_rep
zres = res.res
zres0 = res.res0
f = Figure(size = (650, 300))
ax = Axis(f[1, 1], xlabel = "Wawelength (nm)", ylabel = "RMSEP_Val",
xticks = zres.lo)
scatter!(ax, zres.mid, zres.y1; color = (:red, .5))
vlines!(ax, zres.lo; color = :grey, linestyle = :dash, linewidth = 1)
hlines!(ax, zres0.y1, linestyle = :dash)
f
Jchemo.kdeda
— Methodkdeda(; kwargs...)
kdeda(X, y; kwargs...)
Discriminant analysis using non-parametric kernel Gaussian density estimation (KDE-DA).
X
: X-data (n, p).y
: Univariate class membership (n).
Keyword arguments:
prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).- Keyword arguments of function
dmkern
(bandwidth definition) can also be specified here.
The principle is the same as functions qda
except that densities by class are estimated from function dmkern
instead of function dmnorm
.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
prior = :unif
#prior = :prop
model = kdeda(; prior)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
model = kdeda(; prior, a = .5)
#model = kdeda(; prior, h = .1)
fit!(model, Xtrain, ytrain)
model.fitm.fitm[1].H
Jchemo.knnda
— Methodknnda(; kwargs...)
knnda(X, y; kwargs...)
k-Nearest-Neighbours weighted discrimination (KNN-DA).
X
: X-data (n, p).y
: Univariate class membership (n).
Keyword arguments:
metric
: Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are::eucl
(Euclidean distance),:mah
(Mahalanobis distance).h
: A scalar defining the shape of the weight function computed by functionwinvs
. Lower is h, sharper is the function. See functionwinvs
for details (keyword argumentscriw
andsquared
ofwinvs
can also be specified here).k
: The number of nearest neighbors to select for each observation to predict.tolw
: For stabilization when very close neighbors.scal
: Boolean. Iftrue
, each column of the globalX
is scaled by its uncorrected standard deviation before the distance and weight computations.
This function has the same principle as function knnr
except that a discrimination replaces the regression. A weighted vote is done over the neighborhood, and the prediction corresponds to the most frequent class.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
metric = :eucl
h = 2 ; k = 10
model = knnda(; metric, h, k)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
res = predict(model, Xtest) ;
@names res
res.listnn
res.listd
res.listw
@head res.pred
@show errp(res.pred, ytest)
conf(res.pred, ytest).cnt
## With dimension reduction
model1 = pcasvd(; nlv = 15)
metric = :mah ; h = 1 ; k = 3
model2 = knnda(; metric, h, k)
model = pip(model1, model2)
fit!(model, Xtrain, ytrain)
@head pred = predict(model, Xtest).pred
errp(pred, ytest)
Jchemo.knnr
— Methodknnr(; kwargs...)
knnr(X, Y; kwargs...)
k-Nearest-Neighbours weighted regression (KNNR).
X
: X-data (n, p).Y
: Y-data (n, q).
Keyword arguments:
metric
: Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are::eucl
(Euclidean distance),:mah
(Mahalanobis distance).h
: A scalar defining the shape of the weight function computed by functionwinvs
. Lower is h, sharper is the function. See functionwinvs
for details (keyword argumentscriw
andsquared
ofwinvs
can also be specified here).k
: The number of nearest neighbors to select for each observation to predict.tolw
: For stabilization when very close neighbors.scal
: Boolean. Iftrue
, each column of the globalX
is scaled by its uncorrected standard deviation before the distance and weight computations.
The general principle of this function is as follows (many other variants of kNNR pipelines can be built): a) For each new observation to predict, the prediction is the weighted mean of y
over a selected neighborhood (in X
) of size k
. b) Within the selected neighborhood, the weights are defined from the dissimilarities between the new observation and the neighborhood, and are computed from function 'winvs'.
In general, for X-data with high dimensions, using the Mahalanobis distance requires a preliminary dimensionality reduction (see examples).
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
h = 1 ; k = 3
model = knnr(; h, k)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
dump(model.fitm.par)
res = predict(model, Xtest) ;
@names res
res.listnn
res.listd
res.listw
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
## With dimension reduction
model1 = pcasvd(nlv = 15)
metric = :eucl ; h = 1 ; k = 3
model2 = knnr(; metric, h, k)
model = pip(model1, model2)
fit!(model, Xtrain, ytrain)
res = predict(model, Xtest) ;
@head res.pred
@show rmsep(res.pred, ytest)
####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106
x = collect(-10:.2:10)
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x)
y = zy + .2 * randn(n)
model = knnr(k = 15, h = 5)
fit!(model, x, y)
pred = predict(model, x).pred
f, ax = scatter(x, y)
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f
Jchemo.kpca
— Methodkpca(; kwargs...)
kpca(X; kwargs...)
kpca(X, weights::Weight; kwargs...)
Kernel PCA (Scholkopf et al. 1997, Scholkopf & Smola 2002, Tipping 2001).
X
: X-data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. principal components (PCs) to consider.kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
The method is implemented by SVD factorization of the weighted Gram matrix:
- D^(1/2) * Phi(X) * Phi(X)' * D^(1/2)
where X is the cenetred matrix and D is a diagonal matrix of weights (weights.w
) of the observations (rows of X).
References
Scholkopf, B., Smola, A., Müller, K.-R., 1997. Kernel principal component analysis, in: Gerstner, W., Germond, A., Hasler, M., Nicoud, J.-D. (Eds.), Artificial Neural Networks, ICANN 97, Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, pp. 583-588. https://doi.org/10.1007/BFb0020217
Scholkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond, Adaptive computation and machine learning. MIT Press, Cambridge, Mass.
Tipping, M.E., 2001. Sparse kernel principal component analysis. Advances in neural information processing systems, MIT Press. http://papers.nips.cc/paper/1791-sparse-kernel-principal-component-analysis.pdf
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
Xtest = X[s.test, :]
nlv = 3
kern = :krbf ; gamma = 1e-4
model = kpca(; nlv, kern, gamma) ;
fit!(model, Xtrain)
@names model.fitm
@head T = model.fitm.T
T' * T
model.fitm.V' * model.fitm.V
@head Ttest = transf(model, Xtest)
res = summary(model) ;
@names res
res.explvarx
Jchemo.kplskdeda
— Methodkplskdeda(; kwargs...)
kplskdeda(X, y; kwargs...)
kplskdeda(X, y, weights::Weight; kwargs...)
KPLS-KDEDA.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).- Keyword arguments of function
dmkern
(bandwidth definition) can also be specified here. scal
: Boolean. Iftrue
, each column ofX
and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.
Same as function plskdeda
(PLS-KDEDA) except that a kernel PLSR (function kplsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
See function kplslda
for examples.
Jchemo.kplslda
— Methodkplslda(; kwargs...)
kplslda(X, y; kwargs...)
kplslda(X, y, weights::Weight; kwargs...)
KPLS-LDA.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).scal
: Boolean. Iftrue
, each column ofX
and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.
Same as function plslda
(PLS-LDA) except that a kernel PLSR (function kplsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlv = 15
gamma = .1
model = kplslda(; nlv, gamma)
#model = kplslda(; nlv, gamma, prior = :prop)
#model = kplsqda(; nlv, gamma, alpha = .5)
#model = kplskdeda(; nlv, gamma, a = .5)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
embfitm = fitm.fitm.embfitm ;
@head embfitm.T
@head transf(model, Xtrain)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)
coef(embfitm)
res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(model, Xtest; nlv = 1:2).pred
Jchemo.kplsqda
— Methodkplsqda(; kwargs...)
kplsqda(X, y; kwargs...)
kplsqda(X, y, weights::Weight; kwargs...)
KPLS-QDA.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).alpha
: Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0
) and LDA (alpha = 1
).scal
: Boolean. Iftrue
, each column ofX
and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.
Same as function plsqda
(PLS-QDA) except that a kernel PLSR (function kplsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
See function kplslda
for examples.
Jchemo.kplsr
— Methodkplsr(; kwargs...)
kplsr(X, Y; kwargs...)
kplsr(X, Y, weights::Weight; kwargs...)
kplsr!(X::Matrix, Y::Union{Matrix, BitMatrix}, weights::Weight; kwargs...)
Kernel partial least squares regression (KPLSR) implemented with a Nipals algorithm (Rosipal & Trejo, 2001).
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to consider.kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
This algorithm becomes slow for n > 1000. Use function dkplsr
instead.
References
Rosipal, R., Trejo, L.J., 2001. Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space. Journal of Machine Learning Research 2, 97-123.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 20
kern = :krbf ; gamma = 1e-1
model = kplsr(; nlv, kern, gamma) ;
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
@head model.fitm.T
coef(model)
coef(model; nlv = 3)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)
res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106
x = collect(-10:.2:10)
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x)
y = zy + .2 * randn(n)
nlv = 2
kern = :krbf ; gamma = 1 / 3
model = kplsr(; nlv, kern, gamma)
fit!(model, x, y)
pred = predict(model, x).pred
f, ax = scatter(x, y)
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f
Jchemo.kplsrda
— Methodkplsrda(; kwargs...)
kplsrda(X, y; kwargs...)
kplsrda(X, y, weights::Weight; kwargs...)
Discrimination based on kernel partial least squares regression (KPLSR-DA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).scal
: Boolean. Iftrue
, each column ofX
and Ydummy is scaled by its uncorrected standard deviation.
Same as function plsrda
(PLSR-DA) except that a kernel PLSR (function kplsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlv = 15
kern = :krbf ; gamma = .001
scal = true
model = kplsrda(; nlv, kern, gamma, scal)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
@head fitm.fitm.T
@head transf(model, Xtrain)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)
coef(fitm.fitm)
res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(model, Xtest; nlv = 1:2).pred
Jchemo.kpol
— Methodkpol(X, Y; kwargs...)
Compute a polynomial kernel Gram matrix.
X
: X-data (n, p).Y
: Y-data (m, p).
Keyword arguments:
gamma
: Scale of the polynom.coef0
: Offset of the polynom.degree
: Degree of the polynom.
Given matrices X
and Y
of sizes (n, p) and (m, p), respectively, the function returns the (n, m) Gram matrix:
- K(X, Y) = Phi(X) * Phi(Y)'.
The polynomial kernel between two vectors x and y is computed by (gamma
* (x' * y) + coef0
)^degree
.
References
Scholkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond, Adaptive computation and machine learning. MIT Press, Cambridge, Mass.
Examples
using Jchemo
X = rand(5, 3)
Y = rand(2, 3)
kpol(X, Y; gamma = .1, coef0 = 10, degree = 3)
Jchemo.krbf
— Methodkrbf(X, Y; kwargs...)
Compute a Radial-Basis-Function (RBF) kernel Gram matrix.
X
: X-data (n, p).Y
: Y-data (m, p).
Keyword arguments:
gamma
: Scale parameter.
Given matrices X
and Y
of sizes (n, p) and (m, p), respectively, the function returns the (n, m) Gram matrix:
- K(X, Y) = Phi(X) * Phi(Y)'.
The RBF kernel between two vectors x and y is computed by exp(-gamma
* ||x - y||^2).
References
Scholkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond, Adaptive computation and machine learning. MIT Press, Cambridge, Mass.
Examples
using Jchemo
X = rand(5, 3)
Y = rand(2, 3)
krbf(X, Y; gamma = .1)
Jchemo.krr
— Methodkrr(; kwargs...)
krr(X, Y; kwargs...)
krr(X, Y, weights::Weight; kwargs...)
krr!(X::Matrix, Y::Union{Matrix, BitMatrix}, weights::Weight; kwargs...)
Kernel ridge regression (KRR) implemented by SVD factorization.
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
lb
: Ridge regularization parameter "lambda".kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.scal
: Boolean. Iftrue
, each column of `X is scaled by its uncorrected standard deviation.
KRR is also referred to as least squared SVM regression (LS-SVMR). The method is close to the particular case of SVM regression where there is no marge excluding the observations (epsilon coefficient set to zero). The difference is that a L2-norm optimization is done, instead of L1 in SVM.
References
Bennett, K.V., Embrechts, M.J., 2003. An optimization perspective on kernel partial least squares regression, in: Advances in Learning Theory: Methods, Models and Applications, NATO Science Series III: Computer & Systems Sciences. IOS Press Amsterdam, pp. 227-250.
Cawley, G.C., Talbot, N.L.C., 2002. Reduced Rank Kernel Ridge Regression. Neural Processing Letters 16, 293-302. https://doi.org/10.1023/A:1021798002258
Krell, M.M., 2018. Generalizing, Decoding, and Optimizing Support Vector Machine Classification. arXiv:1801.04929.
Saunders, C., Gammerman, A., Vovk, V., 1998. Ridge Regression Learning Algorithm in Dual Variables, in: In Proceedings of the 15th International Conference on Machine Learning. Morgan Kaufitmann, pp. 515-521.
Suykens, J.A.K., Lukas, L., Vandewalle, J., 2000. Sparse approximation using least squares support vector machines. 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353). https://doi.org/10.1109/ISCAS.2000.856439
Welling, M., n.d. Kernel ridge regression. Department of Computer Science, University of Toronto, Toronto, Canada. https://www.ics.uci.edu/~welling/classnotes/papers_class/Kernel-Ridge.pdf
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
lb = 1e-3
kern = :krbf ; gamma = 1e-1
model = krr(; lb, kern, gamma) ;
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
coef(model)
res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
coef(model; lb = 1e-1)
res = predict(model, Xtest; lb = [.1 ; .01])
@head res.pred[1]
@head res.pred[2]
lb = 1e-3
kern = :kpol ; degree = 1
model = krr(; lb, kern, degree)
fit!(model, Xtrain, ytrain)
res = predict(model, Xtest)
rmsep(res.pred, ytest)
####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106
x = collect(-10:.2:10)
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x)
y = zy + .2 * randn(n)
lb = 1e-1
kern = :krbf ; gamma = 1 / 3
model = krr(; lb, kern, gamma)
fit!(model, x, y)
pred = predict(model, x).pred
f, ax = scatter(x, y)
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f
Jchemo.krrda
— Methodkrrda(; kwargs...)
krrda(X, y; kwargs...)
krrda(X, y, weights::Weight; kwargs...)
Discrimination based on kernel ridge regression (KRR-DA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
lb
: Ridge regularization parameter "lambda".kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
. See respective functionskrbf
andkpol
for their keyword arguments.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Same as function rrda
(RR-DA) except that a kernel RR (function krr
), instead of a RR (function rr
), is run on the Y-dummy table.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
lb = 1e-5
kern = :krbf ; gamma = .001
scal = true
model = krrda(; lb, kern, gamma, scal)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
coef(fitm.fitm)
res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(model, Xtest; lb = [.1, .001]).pred
Jchemo.lda
— Methodlda(; kwargs...)
lda(; kwargs...)
lda(X, y; kwargs...)
lda(X, y, weights::Weight; kwargs...)
Linear discriminant analysis (LDA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).
In the high-level version of the present functions, the observation weights are automatically defined by the given priors (argument prior
): the sub-totals by class of the observation weights are set equal to the prior probabilities. The low-level version (argument weights
) allows to implement other choices.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
model = lda()
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
aggsum(fitm.weights.w, ytrain)
res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.list
— Methodlist(Q, n::Integer)
Create a Vector {Q}(undef, n)
.
isassigned(object, i)
can be used to check if cell i is empty.
Examples
using Jchemo
list(Float64, 5)
list(Array{Float64}, 5)
list(Matrix{Int}, 5)
Jchemo.list
— Methodlist(n::Integer)
Create a Vector{Any}(nothing, n).
isnothing(object, i)
can be used to check if cell i is empty.
Examples
using Jchemo
list(5)
Jchemo.locw
— Methodlocw(Xtrain, Ytrain, X; listnn, listw = nothing, algo, verbose = false, kwargs...)
Compute predictions for a given kNN model.
Xtrain
: Training X-data.Ytrain
: Training Y-data.X
: X-data (m observations) to predict.
Keyword arguments:
listnn
: List (vector) of m vectors of indexes.listw
: List (vector) of m vectors of weights.algo
: Function computing the model on the m neighborhoods.verbose
: Boolean. Iftrue
, predicting information are printed.kwargs
: Keywords arguments to pass in functionalgo
. Each argument must have length = 1 (not be a collection).
Each component i of listnn
and listw
contains the indexes and weights, respectively, of the nearest neighbors of x_i in Xtrain. The sizes of the neighborhood for i = 1,...,m can be different.
Jchemo.locwlv
— Methodlocwlv(Xtrain, Ytrain, X; listnn, listw = nothing, algo, nlv, verbose = true, kwargs...)
Compute predictions for a given kNN model.
Xtrain
: Training X-data.Ytrain
: Training Y-data.X
: X-data (m observations) to predict.
Keyword arguments:
listnn
: List (vector) of m vectors of indexes.listw
: List (vector) of m vectors of weights.algo
: Function computing the model on the m neighborhoods.nlv
: Nb. or collection of nb. of latent variables (LVs).verbose
: Boolean. Iftrue
, predicting information are printed.kwargs
: Keywords arguments to pass in functionalgo
. Each argument must have length = 1 (not be a collection).
Same as locw
but specific and much faster for LV-based models (e.g. PLSR).
Jchemo.loessr
— Methodloessr(; kwargs...)
loessr(X, y; kwargs...)
Compute a locally weighted regression model (LOESS).
X
: X-data (n, p).y
: Univariate y-data (n).
Keyword arguments:
span
: Window for neighborhood selection (level of smoothing) for the local fitting, typically in 0, 1.degree
: Polynomial degree for the local fitting.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
The function fits a LOESS model using package `Loess.jl'.
Smaller values of span
result in smaller local context in fitting (less smoothing).
References
https://github.com/JuliaStats/Loess.jl
Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the American statistical association, 74(368), 829-836. DOI: 10.1080/01621459.1979.10481038
Cleveland, W. S., & Devlin, S. J. (1988). Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American statistical association, 83(403), 596-610. DOI: 10.1080/01621459.1988.10478639
Cleveland, W. S., & Grosse, E. (1991). Computational methods for local regression. Statistics and computing, 1(1), 47-62. DOI: 10.1007/BF01890836
Examples
using Jchemo, CairoMakie
####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106
x = collect(-10:.2:10)
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x)
y = zy + .2 * randn(n)
model = loessr(span = 1 / 3)
fit!(model, x, y)
pred = predict(model, x).pred
f = Figure(size = (700, 300))
ax = Axis(f[1, 1], xlabel = "x", ylabel = "y")
scatter!(x, y)
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred); label = "Loess")
f[1, 2] = Legend(f, ax, framevisible = false)
f
Jchemo.lwmlr
— Methodlwmlr(; kwargs...)
lwmlr(X, Y; kwargs...)
k-Nearest-Neighbours locally weighted multiple linear regression (kNN-LWMLR).
X
: X-data (n, p).Y
: Y-data (n, q).
Keyword arguments:
metric
: Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are::eucl
(Euclidean distance),:mah
(Mahalanobis distance).h
: A scalar defining the shape of the weight function computed by functionwinvs
. Lower is h, sharper is the function. See functionwinvs
for details (keyword argumentscriw
andsquared
ofwinvs
can also be specified here).k
: The number of nearest neighbors to select for each observation to predict.tolw
: For stabilization when very close neighbors.scal
: Boolean. Iftrue
, each column of the globalX
is scaled by its uncorrected standard deviation before the distance and weight computations.verbose
: Boolean. Iftrue
, predicting information are printed.
This is the same principle as function lwplsr
except that MLR models are fitted on the neighborhoods, instead of PLSR models. The neighborhoods are computed directly on X
(there is no preliminary dimension reduction).
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 20
model0 = pcasvd(; nlv) ;
fit!(model0, Xtrain)
@head Ttrain = model0.fitm.T
@head Ttest = transf(model0, Xtest)
metric = :eucl
h = 2 ; k = 100
model = lwmlr(; metric, h, k)
fit!(model, Ttrain, ytrain)
@names model
@names model.fitm
dump(model.fitm.par)
res = predict(model, Ttest) ;
@names res
res.listnn
res.listd
res.listw
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106
x = collect(-10:.2:10)
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x)
y = zy + .2 * randn(n)
model = lwmlr(metric = :eucl, h = 1.5, k = 20) ;
fit!(model, x, y)
pred = predict(model, x).pred
f, ax = scatter(x, y)
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f
Jchemo.lwmlrda
— Methodlwmlrda(; kwargs...)
lwmlrda(X, y; kwargs...)
k-Nearest-Neighbours locally weighted MLR-based discrimination (kNN-LWMLR-DA).
X
: X-data (n, p).y
: Univariate class membership (n).
Keyword arguments:
metric
: Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are::eucl
(Euclidean distance),:mah
(Mahalanobis distance).h
: A scalar defining the shape of the weight function computed by functionwinvs
. Lower is h, sharper is the function. See functionwinvs
for details (keyword argumentscriw
andsquared
ofwinvs
can also be specified here).k
: The number of nearest neighbors to select for each observation to predict.tolw
: For stabilization when very close neighbors.scal
: Boolean. Iftrue
, each column of the globalX
is scaled by its uncorrected standard deviation before the distance and weight computations.verbose
: Boolean. Iftrue
, predicting information are printed.
This is the same principle as function lwmlr
except that MLR-DA models, instead of MLR models, are fitted on the neighborhoods.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
metric = :mah
h = 2 ; k = 10
model = lwmlrda(; metric, h, k)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
res = predict(model, Xtest) ;
@names res
res.listnn
res.listd
res.listw
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.lwplslda
— Methodlwplslda(; kwargs...)
lwplslda(X, y; kwargs...)
kNN-LWPLS-LDA.
X
: X-data (n, p).y
: Univariate class membership (n).
Keyword arguments:
nlvdis
: Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. Ifnlvdis = 0
, there is no dimension reduction.metric
: Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are::eucl
(Euclidean distance),:mah
(Mahalanobis distance).h
: A scalar defining the shape of the weight function computed by functionwinvs
. Lower is h, sharper is the function. See functionwinvs
for details (keyword argumentscriw
andsquared
ofwinvs
can also be specified here).k
: The number of nearest neighbors to select for each observation to predict.tolw
: For stabilization when very close neighbors.nlv
: Nb. latent variables (LVs) for the local (i.e. inside each neighborhood) models.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional).scal
: Boolean. Iftrue
, (a) each column of the globalX
(and of the global Ydummy if there is a preliminary PLS reduction dimension) is scaled by its uncorrected standard deviation before to compute the distances and the weights, and (b) the X and Ydummy scaling is also done within each neighborhood (local level) for the weighted PLS.verbose
: Boolean. Iftrue
, predicting information are printed.
This is the same principle as function lwplsr
except that a PLS-LDA model, instead of a PLSR model, is fitted on each neighborhoods.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlvdis = 25 ; metric = :mah
h = 2 ; k = 200
nlv = 10
model = lwplslda(; nlvdis, metric, h, k, nlv, prior = :prop)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
res = predict(model, Xtest) ;
@names res
res.listnn
res.listd
res.listw
@head res.pred
@show errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.lwplsqda
— Methodlwplsqda(; kwargs...)
lwplsqda(X, y; kwargs...)
kNN-LWPLS-QDA.
X
: X-data (n, p).y
: Univariate class membership (n).
Keyword arguments:
nlvdis
: Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. Ifnlvdis = 0
, there is no dimension reduction.metric
: Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are::eucl
(Euclidean distance),:mah
(Mahalanobis distance).h
: A scalar defining the shape of the weight function computed by functionwinvs
. Lower is h, sharper is the function. See functionwinvs
for details (keyword argumentscriw
andsquared
ofwinvs
can also be specified here).k
: The number of nearest neighbors to select for each observation to predict.tolw
: For stabilization when very close neighbors.nlv
: Nb. latent variables (LVs) for the local (i.e. inside each neighborhood) models.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional).alpha
: Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0
) and LDA (alpha = 1
).scal
: Boolean. Iftrue
, (a) each column of the globalX
(and of the global Ydummy if there is a preliminary PLS reduction dimension) is scaled by its uncorrected standard deviation before to compute the distances and the weights, and (b) the X and Ydummy scaling is also done within each neighborhood (local level) for the weighted PLS.verbose
: Boolean. Iftrue
, predicting information are printed.
This is the same principle as function lwplsr
except that a PLS-QDA model, instead of a PLSR model, is fitted on each neighborhoods.
- Warning: The present version of this function can suffer from stops due to non positive definite matrices when doing QDA on neighborhoods. This is due to that some classes within the neighborhood can have very few observations. It is recommended to select a sufficiantly large number of neighbors or/and to use a regularized QDA (
alpha > 0
).
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlvdis = 25 ; metric = :mah
h = 2 ; k = 200
nlv = 10
model = lwplsqda(; nlvdis, metric, h, k, nlv, prior = :prop, alpha = .5)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
res = predict(model, Xtest) ;
@names res
res.listnn
res.listd
res.listw
@head res.pred
@show errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.lwplsr
— Methodlwplsr(; kwargs...)
lwplsr(X, Y; kwargs...)
k-Nearest-Neighbours locally weighted partial least squares regression (kNN-LWPLSR).
X
: X-data (n, p).Y
: Y-data (n, q).
Keyword arguments:
nlvdis
: Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. Ifnlvdis = 0
, there is no dimension reduction.metric
: Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are::eucl
(Euclidean distance),:mah
(Mahalanobis distance).h
: A scalar defining the shape of the weight function computed by functionwinvs
. Lower is h, sharper is the function. See functionwinvs
for details (keyword argumentscriw
andsquared
ofwinvs
can also be specified here).k
: The number of nearest neighbors to select for each observation to predict.tolw
: For stabilization when very close neighbors.nlv
: Nb. latent variables (LVs) for the local (i.e. inside each neighborhood) models.scal
: Boolean. Iftrue
, (a) each column of the globalX
(and of the globalY
if there is a preliminary PLS reduction dimension) is scaled by its uncorrected standard deviation before to compute the distances and the weights, and (b) the X and Y scaling is also done within each neighborhood (local level) for the weighted PLSR.verbose
: Boolean. Iftrue
, predicting information are printed.
Function lwplsr
fits kNN-LWPLSR models such as in Lesnoff et al. 2020. The general principle of the pipeline is as follows (many other variants of pipelines can be built):
LWPLSR is a particular case of weighted PLSR (WPLSR) (e.g. Schaal et al. 2002). In WPLSR, a priori weights, different from the usual 1/n (standard PLSR), are given to the n training observations. These weights are used for calculating (i) the scores and loadings of the WPLS and (ii) the regression model that fits (by weighted least squares) the Y-response(s) to the WPLS scores. The specificity of LWPLSR (compared to WPLSR) is that the weights are computed from dissimilarities (e.g. distances) between the new observation to predict and the training observations ("L" in LWPLSR comes from "localized"). Note that in LWPLSR the weights and therefore the fitted WPLSR model change for each new observation to predict.
In the original LWPLSR, all the n training observations are used for each observation to predict (e.g. Sicard & Sabatier 2006, Kim et al 2011). This can be very time consuming when n is large. A faster (and often more efficient) strategy is to preliminary select, in the training set, a number of k
nearest neighbors to the observation to predict (= "weighting 1") and then to apply LWPLSR only to this pre-selected neighborhood (= "weighting 2"). T his strategy corresponds to a kNN-LWPLSR and is the one implemented in function lwplsr
.
In lwplsr
, the dissimilarities used for weightings 1 and 2 are computed from the raw X-data, or after a dimension reduction, depending on argument nlvdis
. In the last case, global PLS2 scores (LVs) are computed from {X
, Y
} and the dissimilarities are computed over these scores.
In general, for high dimensional X-data, using the Mahalanobis distance requires preliminary dimensionality reduction of the data. In function knnr', the preliminary reduction (argument
nlvdis) is done by PLS on {
X,
Y`}.
References
Kim, S., Kano, M., Nakagawa, H., Hasebe, S., 2011. Estimation of active pharmaceutical ingredients content using locally weighted partial least squares and statistical wavelength selection. Int. J. Pharm., 421, 269-274.
Lesnoff, M., Metz, M., Roger, J.-M., 2020. Comparison of locally weighted PLS strategies for regression and discrimination on agronomic NIR data. Journal of Chemometrics, e3209. https://doi.org/10.1002/cem.3209
Schaal, S., Atkeson, C., Vijayamakumar, S. 2002. Scalable techniques from nonparametric statistics for the real time robot learning. Applied Intell., 17, 49-60.
Sicard, E. Sabatier, R., 2006. Theoretical framework for local PLS1 regression and application to a rainfall dataset. Comput. Stat. Data Anal., 51, 1393-1410.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlvdis = 15 ; metric = :mah
h = 1 ; k = 500 ; nlv = 10
model = lwplsr(; nlvdis, metric, h, k, nlv)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
res = predict(model, Xtest) ;
@names res
res.listnn
res.listd
res.listw
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
Jchemo.lwplsravg
— Methodlwplsravg(; kwargs...)
lwplsravg(X, Y; kwargs...)
Averaging kNN-LWPLSR models with different numbers of latent variables (kNN-LWPLSR-AVG).
X
: X-data (n, p).Y
: Y-data (n, q).
Keyword arguments:
nlvdis
: Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. Ifnlvdis = 0
, there is no dimension reduction.metric
: Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are::eucl
(Euclidean distance),:mah
(Mahalanobis distance).h
: A scalar defining the shape of the weight function computed by functionwinvs
. Lower is h, sharper is the function. See functionwinvs
for details (keyword argumentscriw
andsquared
ofwinvs
can also be specified here).k
: The number of nearest neighbors to select for each observation to predict.tolw
: For stabilization when very close neighbors.nlv
: A range of nb. of latent variables (LVs) to compute for the local (i.e. inside each neighborhood) models.scal
: Boolean. Iftrue
, (a) each column of the globalX
(and of the globalY
if there is a preliminary PLS reduction dimension) is scaled by its uncorrected standard deviation before to compute the distances and the weights, and (b) the X and Y scaling is also done within each neighborhood (local level) for the weighted PLSR.verbose
: Boolean. Iftrue
, predicting information are printed.
Ensemblist method where the predictions are computed by averaging the predictions of a set of models built with different numbers of LVs, such as in Lesnoff 2023. On each neighborhood, a PLSR-averaging (Lesnoff et al.
- is done instead of a PLSR.
For instance, if argument nlv
is set to nlv
= 5:10
, the prediction for a new observation is the simple average of the predictions returned by the models with 5 LVs, 6 LVs, ... 10 LVs, respectively.
References
Lesnoff, M., Andueza, D., Barotin, C., Barre, V., Bonnal, L., Fernández Pierna, J.A., Picard, F., Vermeulen, V., Roger, J.-M., 2022. Averaging and Stacking Partial Least Squares Regression Models to Predict the Chemical Compositions and the Nutritive Values of Forages from Spectral Near Infrared Data. Applied Sciences 12, 7850. https://doi.org/10.3390/app12157850
M. Lesnoff, Averaging a local PLSR pipeline to predict chemical compositions and nutritive values of forages and feed from spectral near infrared data, Chemometrics and Intelligent Laboratory Systems. 244 (2023) 105031. https://doi.org/10.1016/j.chemolab.2023.105031.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlvdis = 5 ; metric = :mah
h = 1 ; k = 200 ; nlv = 4:20
model = lwplsravg(; nlvdis, metric, h, k, nlv) ;
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
res = predict(model, Xtest) ;
@names res
res.listnn
res.listd
res.listw
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
Jchemo.lwplsrda
— Methodlwplsrda(; kwargs...)
lwplsrda(X, y; kwargs...)
kNN-LWPLSR-DA.
X
: X-data (n, p).y
: Univariate class membership (n).
Keyword arguments:
nlvdis
: Number of latent variables (LVs) to consider in the global PLS used for the dimension reduction before computing the dissimilarities. Ifnlvdis = 0
, there is no dimension reduction.metric
: Type of dissimilarity used to select the neighbors and to compute the weights. Possible values are::eucl
(Euclidean distance),:mah
(Mahalanobis distance).h
: A scalar defining the shape of the weight function computed by functionwinvs
. Lower is h, sharper is the function. See functionwinvs
for details (keyword argumentscriw
andsquared
ofwinvs
can also be specified here).k
: The number of nearest neighbors to select for each observation to predict.tolw
: For stabilization when very close neighbors.nlv
: Nb. latent variables (LVs) for the local (i.e. inside each neighborhood) models.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional).scal
: Boolean. Iftrue
, (a) each column of the globalX
(and of the global Ydummy if there is a preliminary PLS reduction dimension) is scaled by its uncorrected standard deviation before to compute the distances and the weights, and (b) the X and Ydummy scaling is also done within each neighborhood (local level) for the weighted PLS.verbose
: Boolean. Iftrue
, predicting information are printed.
This is the same principle as function lwplsr
except that PLSR-DA models, instead of PLSR models, are fitted on the neighborhoods.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlvdis = 25 ; metric = :mah
h = 2 ; k = 200
nlv = 10
model = lwplsrda(; nlvdis, metric, h, k, nlv)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
res = predict(model, Xtest) ;
@names res
res.listnn
res.listd
res.listw
@head res.pred
@show errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.madv
— Methodmadv(x)
Compute the median absolute deviation (MAD) of a vector.
x
: A vector (n).
This is the MAD adjusted by a factor (1.4826) for asymptotically normal consistency.
Examples
using Jchemo
x = rand(100)
madv(x)
Jchemo.mahsq
— Methodmahsq(X, Y)
mahsq(X, Y, Sinv)
Squared Mahalanobis distances between the rows of X
and Y
.
X
: Data (n, p).Y
: Data (m, p).Sinv
: Inverse of a covariance matrix S. If not given, S is computed as the uncorrected covariance matrix ofX
.
When X
and Y
are (n, p) and (m, p), repectively, it returns an object (n, m) with:
- i, j = distance between row i of
X
and row j ofY
.
Examples
using StatsBase
X = rand(5, 3)
Y = rand(2, 3)
mahsq(X, Y)
S = cov(X, corrected = false)
Sinv = inv(S)
mahsq(X, Y, Sinv)
mahsq(X[1:1, :], Y[1:1, :], Sinv)
mahsq(X[:, 1], 4)
mahsq(1, 4, 2.1)
Jchemo.mahsqchol
— Methodmahsqchol(X, Y)
mahsqchol(X, Y, Uinv)
Compute the squared Mahalanobis distances (with a Cholesky factorization) between the observations (rows) of X
and Y
.
X
: Data (n, p).Y
: Data (m, p).Uinv
: Inverse of the upper matrix of a Cholesky factorization of a covariance matrix S. If not given, the factorization is done on S, the uncorrected covariance matrix ofX
.
When X
and Y
are (n, p) and (m, p), repectively, it returns an object (n, m) with:
- i, j = distance between row i of
X
and row j ofY
.
Examples
using LinearAlgebra, StatsBase
X = rand(5, 3)
Y = rand(2, 3)
mahsqchol(X, Y)
S = cov(X, corrected = false)
U = cholesky(Hermitian(S)).U
Uinv = inv(U)
mahsqchol(X, Y, Uinv)
mahsqchol(X[:, 1], 4)
mahsqchol(1, 4, sqrt(2.1))
Jchemo.matB
— FunctionmatB(X, y, weights::Weight)
Between-class covariance matrix.
X
: X-data (n, p).y
: A vector (n) defining the class membership.weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Compute the between-class covariance matrix (output B
) of X
. This is the (non-corrected) covariance matrix of the weighted class centers.
Examples
using Jchemo, StatsBase
n = 20 ; p = 3
X = rand(n, p)
y = rand(1:3, n)
tab(y)
weights = mweight(ones(n))
res = matB(X, y, weights) ;
res.B
res.priors
res.ni
res.lev
res = matW(X, y, weights) ;
res.W
res.Wi
matW(X, y, weights).W + matB(X, y, weights).B
cov(X; corrected = false)
v = mweight(collect(1:n))
matW(X, y, v).priors
matB(X, y, v).priors
matW(X, y, v).W + matB(X, y, v).B
covm(X, v)
Jchemo.matW
— FunctionmatW(X, y, weights::Weight)
Within-class covariance matrices.
X
: X-data (n, p).y
: A vector (n) defing the class membership.weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Compute the (non-corrected) within-class and pooled covariance matrices (outputs Wi
and W
, respectively) of X
.
If class i contains only one observation, Wi is computed by:
covm(
X,
weights)
.
For examples, see function matB
.
Jchemo.mavg
— Methodmavg(; kwargs...)
mavg(X; kwargs...)
Smoothing by moving averages of each row of X-data.
X
: X-data (n, p).
Keyword arguments:
npoint
: Nb. points involved in the window.
The function returns a matrix (n, p).
The smoothing is computed by convolution with padding, using function imfilter
of package ImageFiltering.jl. The centered kernel is ones(npoint) / npoint
. Each returned point is located on the center of the kernel. Assume a signal x of length p (row of X
) correponding to a vector wl of p wavelengths (or other indexes).
If npoint = 3
, the kernel is kern = [.33, .33, .33], and:
- The output value at index i = 4 is: dot(kern, [x[3], x[4], x[5]]). The output wavelength is: wl[4]
- The output value at index i = 1 is: dot(kern, [x[1], x[1], x[2]]) (padding). The corresponding wavelength is: wl[1].
If npoint = 4
, the kernel is kern = [.25, .25, .25, .25], and:
- The output value at index i = 4 is: dot(kern, [x[3], x[4], x[5], x[6]]). The corresponding wavelength is: (wl[4] + wl[5]) / 2.
- The output value at index i = 1 is: dot(kern, x[1], x[1], x[2], x[3]) (padding). The corresponding wavelength is: (wl[1] + wl[2]) / 2.
References
Package ImageFiltering.jl https://github.com/JuliaImages/ImageFiltering.jl
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f
model = mavg(npoint = 10)
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f
Jchemo.mbconcat
— Methodmbconcat()
mbconcat(Xbl)
Concatenate horizontaly multiblock X-data.
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.
Examples
using Jchemo
n = 5 ; m = 3 ; p = 9
X = rand(n, p)
Xnew = rand(m, p)
listbl = [3:4, 1, [6; 8:9]]
Xbl = mblock(X, listbl)
Xblnew = mblock(Xnew, listbl)
@head Xbl[3]
model = mbconcat()
fit!(model, Xbl)
transf(model, Xbl)
transf(model, Xblnew)
Jchemo.mblock
— Methodmblock(X, listbl)
Make blocks from a matrix.
X
: X-data (n, p).listbl
: A vector whose each component defines the colum numbers defining a block inX
. The length oflistbl
is the number of blocks.
The function returns a list (vector) of blocks.
Examples
using Jchemo
n = 5 ; p = 10
X = rand(n, p)
listbl = [3:4, 1, [6; 8:10]]
Xbl = mblock(X, listbl)
Xbl[1]
Xbl[2]
Xbl[3]
Jchemo.mbpca
— Methodmbpca(; kwargs...)
mbpca(Xbl; kwargs...)
mbpca(Xbl, weights::Weight; kwargs...)
mbpca!(Xbl::Matrix, weights::Weight; kwargs...)
Consensus principal components analysis (CPCA, a.k.a MBPCA).
Xbl
: List of blocks (vector of matrices) of X-data. Typically, output of functionmblock
.weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. global latent variables (LVs = scores) to compute.bscal
: Type of block scaling. See functionblockscal
for possible values.tol
: Tolerance value for Nipals convergence.maxit
: Maximum number of iterations (Nipals).scal
: Boolean. Iftrue
, each column of blocks inXbl
is scaled by its uncorrected standard deviation (before the block scaling).
CPCA algorithm (Westerhuis et a; 1998), a.k.a MBPCA, and reffered to as CPCA-W in Smilde et al. 2003.
Apart eventual block scaling, the MBPCA is equivalent to the PCA of the horizontally concatenated matrix X = X1 X2 ... Xk.
The function returns several objects, in particular:
T
: The global LVs (not-normed).U
: The global LVs (normed).W
: The block weights (normed).Tb
: The block LVs (in the metric scale), returned grouped by LV.Tbl
: The block LVs (in the original scale), returned grouped by block.Vbl
: The block loadings (normed).lb
: The block specific weights ('lambda') for the global LVs.mu
: The sum of the block specific weights (= eigen values of the global PCA).
Function summary
returns:
explvarx
: Proportion of the total X inertia (squared Frobenious norm) explained by the global LVs.explxbl
: Proportion of the inertia of each block (= Xbl[k]) explained by the global LVs.contrxbl2t
: Contribution of each block to the global LVs.rvxbl2t
: RV coefficients between each block and the global LVs.rdxbl2t
: Rd coefficients between each block and the global LVs.cortbl2t
: Correlations between the block LVs (= Tbl[k]) and the global LVs.corx2t
: Correlation between the X-variables and the global LVs.
References
Mangamana, E.T., Cariou, V., Vigneau, E., Glèlè Kakaï, R.L., Qannari, E.M., 2019. Unsupervised multiblock data analysis: A unified approach and extensions. Chemometrics and Intelligent Laboratory Systems 194, 103856. https://doi.org/10.1016/j.chemolab.2019.103856
Smilde, A.K., Westerhuis, J.A., de Jong, S., 2003. A framework for sequential multiblock component methods. Journal of Chemometrics 17, 323–337. https://doi.org/10.1002/cem.811
Westerhuis, J.A., Kourti, T., MacGregor, J.F., 1998. Analysis of multiblock and hierarchical PCA and PLS models. Journal of Chemometrics 12, 301–321. https://doi.org/10.1002/(SICI)1099-128X(199809/10)12:5<301::AID-CEM515>3.0.CO;2-S
Examples
using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2")
@load db dat
@names dat
X = dat.X
group = dat.group
listbl = [1:11, 12:19, 20:25]
Xbl = mblock(X[1:6, :], listbl)
Xblnew = mblock(X[7:8, :], listbl)
n = nro(Xbl[1])
nlv = 3
bscal = :frob
scal = false
#scal = true
model = mbpca(; nlv, bscal, scal)
fit!(model, Xbl)
@names model
@names model.fitm
## Global scores
@head model.fitm.T
@head transf(model, Xbl)
transf(model, Xblnew)
## Blocks scores
i = 1
@head model.fitm.Tbl[i]
@head transfbl(model, Xbl)[i]
res = summary(model, Xbl) ;
@names res
res.explvarx
res.explxbl # = model.fitm.lb if bscal = :frob
rowsum(Matrix(res.explxbl))
res.contrxbl2t
res.rvxbl2t
res.rdxbl2t
res.cortbl2t
res.corx2t
Jchemo.mbplskdeda
— Methodmbplskdeda(; kwargs...)
mbplskdeda(Xbl, y; kwargs...)
mbplskdeda(Xbl, y, weights::Weight; kwargs...)
Multiblock PLS-KDEDA.
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores) to compute.bscal
: Type of block scaling. See functionblockscal
for possible values.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).- Keyword arguments of function
dmkern
(bandwidth definition) can also be specified here. scal
: Boolean. Iftrue
, each column of blocks inXbl
and Ydummy is scaled by its uncorrected standard deviation (before the block scaling) in the MBPLS computation.
The principle is the same as function mbplsqda
except that the densities by class are estimated from dmkern
instead of dmnorm
.
See function mbplslda
for examples.
Jchemo.mbplslda
— Methodmbplslda(; kwargs...)
mbplslda(Xbl, y; kwargs...)
mbplslda(Xbl, y, weights::Weight; kwargs...)
Multiblock PLS-LDA.
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores) to compute.bscal
: Type of block scaling. See functionblockscal
for possible values.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).scal
: Boolean. Iftrue
, each column of blocks inXbl
and Ydummy is scaled by its uncorrected standard deviation (before the block scaling) in the MBPLS computation.
The method is as follows:
- The training variable
y
(univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present iny
. Each column of Ydummy is a dummy (0/1) variable. - A multivariate MBPLSR (MBPLSR2) is run on {
X
, Ydummy}, returning a score matrixT
. - A LDA is done on {
T
,y
}, returning estimates of posterior probabilities (∊ [0, 1]) of class membership. - For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.
In the high-level version of the present functions, the observation weights are automatically defined by the given priors (argument prior
): the sub-totals by class of the observation weights are set equal to the prior probabilities. The low-level version (argument weights
) allows to implement other choices.
Examples
using Jchemo, JLD2, CairoMakie, JchemoData
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
tab(Y.typ)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
wlst = names(X)
wl = parse.(Float64, wlst)
#plotsp(X, wl; nsamp = 20).f
##
listbl = [1:350, 351:700]
Xbltrain = mblock(Xtrain, listbl)
Xbltest = mblock(Xtest, listbl)
nlv = 15
scal = false
#scal = true
bscal = :none
#bscal = :frob
model = mbplslda(; nlv, bscal, scal)
#model = mbplsqda(; nlv, bscal, alpha = .5, scal)
#model = mbplskdeda(; nlv, bscal, scal)
fit!(model, Xbltrain, ytrain)
@names model
@head transf(model, Xbltrain)
@head transf(model, Xbltest)
res = predict(model, Xbltest) ;
@head res.pred
@show errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(model, Xbltest; nlv = 1:2).pred
Jchemo.mbplsqda
— Methodmbplsqda(; kwargs...)
mbplsqda(Xbl, y; kwargs...)
mbplsqda(Xbl, y, weights::Weight; kwargs...)
Multiblock PLS-QDA.
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores) to compute.bscal
: Type of block scaling. See functionblockscal
for possible values.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).alpha
: Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0
) and LDA (alpha = 1
).scal
: Boolean. Iftrue
, each column of blocks inXbl
and Ydummy is scaled by its uncorrected standard deviation (before the block scaling) in the MBPLS computation.
The method is as follows:
- The training variable
y
(univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present iny
. Each column of Ydummy is a dummy (0/1) variable. - A multivariate MBPLSR (MBPLSR2) is run on {
X
, Ydummy}, returning a score matrixT
. - A QDA (possibly with continuum) is done on {
T
,y
}, returning estimates of posterior probabilities (∊ [0, 1]) of class membership. - For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.
In the high-level version of the present functions, the observation weights are automatically defined by the given priors (argument prior
): the sub-totals by class of the observation weights are set equal to the prior probabilities. The low-level version (argument weights
) allows to implement other choices.
See function mbplslda
for examples.
Jchemo.mbplsr
— Methodmbplsr(; kwargs...)
mbplsr(Xbl, Y; kwargs...)
mbplsr(Xbl, Y, weights::Weight; kwargs...)
mbplsr!(Xbl::Matrix, Y::Union{Matrix, BitMatrix}, weights::Weight; kwargs...)
Multiblock PLSR (MBPLSR).
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. global latent variables (LVs = scores) to compute.bscal
: Type of block scaling. See functionblockscal
for possible values.scal
: Boolean. Iftrue
, each column of blocks inXbl
andY
is scaled by its uncorrected standard deviation (before the block scaling).
This function runs a PLSR on {X, Y
} where X is the horizontal concatenation of the blocks in Xbl
. The function gives the same global LVs and predictions as function mbplswest
, but is much faster.
Function summary
returns:
explvarx
: Proportion of the total X inertia (squared Frobenious norm) explained by the global LVs.rvxbl2t
: RV coefficients between each block and the global LVs.rdxbl2t
: Rd coefficients between each block (= Xbl[k]) and the global LVs.corx2t
: Correlation between the X-variables and the global LVs.
Examples
using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
y = Y.c1
group = dat.group
listbl = [1:11, 12:19, 20:25]
s = 1:6
Xbltrain = mblock(X[s, :], listbl)
Xbltest = mblock(rmrow(X, s), listbl)
ytrain = y[s]
ytest = rmrow(y, s)
ntrain = nro(ytrain)
ntest = nro(ytest)
ntot = ntrain + ntest
(ntot = ntot, ntrain , ntest)
nlv = 3
bscal = :frob
scal = false
#scal = true
model = mbplsr(; nlv, bscal, scal)
fit!(model, Xbltrain, ytrain)
@names model
@names model.fitm
@head model.fitm.T
@head transf(model, Xbltrain)
transf(model, Xbltest)
res = predict(model, Xbltest)
res.pred
rmsep(res.pred, ytest)
res = summary(model, Xbltrain) ;
@names res
res.explvarx
res.rvxbl2t
res.rdxbl2t
res.cortbl2t
res.corx2t
Jchemo.mbplsrda
— Methodmbplsrda(; kwargs...)
mbplsrda(Xbl, y; kwargs...)
mbplsrda(Xbl, y, weights::Weight; kwargs...)
Discrimination based on multiblock partial least squares regression (MBPLSR-DA).
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores) to compute.bscal
: Type of block scaling. See functionblockscal
for possible values.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).scal
: Boolean. Iftrue
, each column of blocks inXbl
and Ydummy is scaled by its uncorrected standard deviation (before the block scaling).
The method is as follows:
- The training variable
y
(univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present iny
. Each column of Ydummy is a dummy (0/1) variable. - Then, a multivariate MBPLSR (MBPLSR2) is run on {
X
, Ydummy}, returning predictions of the dummy variables (= objectposterior
returned by fuctionpredict
). These predictions can be considered as unbounded estimates (i.e. eventuall outside of [0, 1]) of the class membership probabilities. - For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.
In the high-level version of the present functions, the observation weights are automatically defined by the given priors (argument prior
): the sub-totals by class of the observation weights are set equal to the prior probabilities. The low-level version (argument weights
) allows to implement other choices.
Examples
using Jchemo, JLD2, CairoMakie, JchemoData
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
tab(Y.typ)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
wlst = names(X)
wl = parse.(Float64, wlst)
#plotsp(X, wl; nsamp = 20).f
##
listbl = [1:350, 351:700]
Xbltrain = mblock(Xtrain, listbl)
Xbltest = mblock(Xtest, listbl)
nlv = 15
scal = false
#scal = true
bscal = :none
#bscal = :frob
model = mbplsrda(; nlv, bscal, scal)
fit!(model, Xbltrain, ytrain)
@names model
@head model.fitm.fitm.T
@head transf(model, Xbltrain)
@head transf(model, Xbltest)
res = predict(model, Xbltest) ;
@head res.pred
@show errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(model, Xbltest; nlv = 1:2).pred
Jchemo.mbplswest
— Methodmbplswest(; kwargs...)
mbplswest(Xbl, Y; kwargs...)
mbplswest(Xbl, Y, weights::Weight; kwargs...)
mbplswest!(Xbl::Matrix, Y::Matrix, weights::Weight; kwargs...)
Multiblock PLSR (MBPLSR) - Nipals algorithm.
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. global latent variables (LVs = scores) to compute.bscal
: Type of block scaling. See functionblockscal
for possible values.tol
: Tolerance value for convergence (Nipals).maxit
: Maximum number of iterations (Nipals).scal
: Boolean. Iftrue
, each column of blocks inXbl
andY
is scaled by its uncorrected standard deviation (before the block scaling).
This functions implements the MBPLSR Nipals algorithm such as in Westerhuis et al. 1998. The function gives the same global scores and predictions as function mbplsr
.
Function summary
returns:
explvarx
: Proportion of the total X inertia (squared Frobenious norm) explained by the global LVs.rvxbl2t
: RV coefficients between each block and the global LVs.rdxbl2t
: Rd coefficients between each block (= Xbl[k]) and the global LVs.cortbl2t
: Correlations between the block LVs (= Tbl[k]) and the global LVs.corx2t
: Correlation between the X-variables and the global LVs.
References
Westerhuis, J.A., Kourti, T., MacGregor, J.F., 1998. Analysis of multiblock and hierarchical PCA and PLS models. Journal of Chemometrics 12, 301–321. https://doi.org/10.1002/(SICI)1099-128X(199809/10)12:5<301::AID-CEM515>3.0.CO;2-S
Examples
using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
y = Y.c1
group = dat.group
listbl = [1:11, 12:19, 20:25]
s = 1:6
Xbltrain = mblock(X[s, :], listbl)
Xbltest = mblock(rmrow(X, s), listbl)
ytrain = y[s]
ytest = rmrow(y, s)
ntrain = nro(ytrain)
ntest = nro(ytest)
ntot = ntrain + ntest
(ntot = ntot, ntrain , ntest)
nlv = 3
bscal = :frob
scal = false
#scal = true
model = mbplswest(; nlv, bscal, scal)
fit!(model, Xbltrain, ytrain)
@names model
@names model.fitm
@head model.fitm.T
@head transf(model, Xbltrain)
transf(model, Xbltest)
res = predict(model, Xbltest)
res.pred
rmsep(res.pred, ytest)
res = summary(model, Xbltrain) ;
@names res
res.explvarx
res.rvxbl2t
res.rdxbl2t
res.cortbl2t
res.corx2t
Jchemo.meanv
— Methodmeanv(x)
meanv(x, weights::Weight)
Compute the mean of a vector.
x
: A vector (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Examples
using Jchemo
n = 100
x = rand(n)
w = mweight(rand(n))
meanv(x)
meanv(x, w)
Jchemo.merrp
— Methodmerrp(pred, y)
Compute the mean intra-class classification error rate.
pred
: Predictions.y
: Observed data (class membership).
ERRP (see function errp
) is computed for each class. Function merrp
returns the average of these intra-class ERRPs.
Examples
using Jchemo
Xtrain = rand(10, 5)
ytrain = rand(["a" ; "b"], 10)
Xtest = rand(4, 5)
ytest = rand(["a" ; "b"], 4)
model = plsrda(; nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
merrp(pred, ytest)
Jchemo.mlev
— Methodmlev(x)
Return the sorted levels of a vector or a dataset.
Examples
using Jchemo
x = rand(["a";"b";"c"], 20)
lev = mlev(x)
nlev = length(lev)
X = reshape(x, 5, 4)
mlev(X)
df = DataFrame(g1 = rand(1:2, n), g2 = rand(["a"; "c"], n))
mlev(df)
Jchemo.mlr
— Methodmlr(; kwargs...)
mlr(X, Y; kwargs...)
mlr(X, Y, weights::Weight; kwargs...)
mlr!(X::Matrix, Y::Union{Matrix, BitMatrix}, weights::Weight; kwargs...)
Compute a mutiple linear regression model (MLR) by using the QR algorithm.
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
noint
: Boolean. Define if the model is computed with an intercept or not.
Safe but can be little slower than other methods.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
@names dat
@head dat.X
X = dat.X[:, 2:4]
y = dat.X[:, 1]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
model = mlr()
#model = mlrchol()
#model = mlrpinv()
#model = mlrpinvn()
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.B
fitm.int
coef(model)
res = predict(model, Xtest)
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
model = mlr(noint = true)
fit!(model, Xtrain, ytrain)
coef(model)
model = mlrvec()
fit!(model, Xtrain[:, 1], ytrain)
coef(model)
Jchemo.mlrchol
— Methodmlrchol()
mlrchol(X, Y)
mlrchol(X, Y, weights::Weight)
mlrchol!mlrchol!(X::Matrix, Y::Matrix, weights::Weight)
Compute a mutiple linear regression model (MLR) using the Normal equations and a Choleski factorization.
X
: X-data, with nb. columns >= 2 (required by function cholesky).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Compute only a model with intercept.
Faster but can be less accurate (based on squared element X'X).
See function mlr
for examples.
Jchemo.mlrda
— Methodmlrda(; kwargs...)
mlrda(X, y; kwargs...)
mlrda(X, y, weights::Weight)
Discrimination based on multple linear regression (MLR-DA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).
The method is as follows:
- The training variable
y
(univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present iny
. Each column of Ydummy is a dummy (0/1) variable. - Then, a multiple linear regression (MLR) is run on {
X
, Ydummy}, returning predictions of the dummy variables (= objectposterior
returned by fuctionpredict
). These predictions can be considered as unbounded estimates (i.e. eventuall outside of [0, 1]) of the class membership probabilities. - For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.
In the high-level version of the present functions, the observation weights are automatically defined by the given priors (argument prior
): the sub-totals by class of the observation weights are set equal to the prior probabilities. The low-level version (argument weights
) allows to implement other choices.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
model = mlrda()
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.mlrpinv
— Methodmlrpinv()
mlrpinv(X, Y; kwargs...)
mlrpinv(X, Y, weights::Weight; kwargs...)
mlrpinv!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Compute a mutiple linear regression model (MLR) by using a pseudo-inverse.
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
noint
: Boolean. Define if the model is computed with an intercept or not.
Safe but can be slower.
See function mlr
for examples.
Jchemo.mlrpinvn
— Methodmlrpinvn()
mlrpinvn(X, Y)
mlrpinvn(X, Y, weights::Weight)
mlrpinvn!mlrchol!(X::Matrix, Y::Matrix,
weights::Weight)
Compute a mutiple linear regression model (MLR) by using the Normal equations and a pseudo-inverse.
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Safe and fast for p not too large.
Compute only a model with intercept.
See function mlr
for examples.
Jchemo.mlrvec
— Methodmlrvec(; kwargs...)
mlrvec(X, Y; kwargs...)
mlrvec(X, Y, weights::Weight; kwargs...)
mlrvec!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Compute a simple (univariate x) linear regression model.
x
: Univariate X-data (n).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
noint
: Boolean. Define if the model is computed with an intercept or not.
See function mlr
for examples.
Jchemo.mpar
— Functionmpar(; kwargs...)
Return a tuple with all the combinations of the parameter values defined in kwargs. Keyword arguments:
kwargs
: Vector(s) of the parameter(s) values.
Examples
using Jchemo
nlvdis = 25 ; metric = [:mah]
h = [1 ; 2 ; Inf] ; k = [500 ; 1000]
pars = mpar(nlvdis = nlvdis, metric = metric, h = h, k = k)
length(pars[1])
reduce(hcat, pars)
Jchemo.mse
— Methodmse(pred, Y; digits = 3)
Summary of model performance for regression.
pred
: Predictions.Y
: Observed data.
Examples
using Jchemo
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
model = plskern; nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
mse(pred, Ytest)
model = plskern; nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
mse(pred, ytest)
Jchemo.msep
— Methodmsep(pred, Y)
Compute the mean of the squared prediction errors (MSEP).
pred
: Predictions.Y
: Observed data.
Examples
using Jchemo
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
model = plskern; nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
msep(pred, Ytest)
model = plskern; nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
msep(pred, ytest)
Jchemo.mweight
— Methodmweight(x::Vector)
Return an object of type Weight
containing vector w = x / sum(x)
(if ad'hoc building, w
must sum to 1).
Examples
using Jchemo
x = rand(10)
w = mweight(x)
sum(w.w)
Jchemo.mweightcla
— Methodmweightcla(x::AbstractVector; prior::Union{Symbol, Vector} = :unif)
mweightcla(Q::DataType, x::Vector; prior::Union{Symbol, Vector} = :unif)
Compute observation weights for a categorical variable, given specified sub-total weights for the classes.
x
: A categorical variable (n) (class membership).Q
: A data type (e.g.Float32
).
Keyword arguments:
prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (the vector must be sorted in the same order asmlev(x)
).
Return an object of type Weight
(see function mweight
) containing a vector w
(n) that sums to 1.
Examples
using Jchemo
x = vcat(rand(["a" ; "c"], 900), repeat(["b"], 100))
tab(x)
weights = mweightcla(x)
#weights = mweightcla(x; prior = :prop)
#weights = mweightcla(x; prior = [.1, .7, .2])
res = aggstat(weights.w, x; algo = sum)
[res.lev res.X]
Jchemo.nco
— Methodnco(X)
Return the nb. columns of X
.
Jchemo.nipals
— Methodnipals(X; kwargs...)
nipals(X, UUt, VVt; kwargs...)
Nipals to compute the first score and loading vectors of a matrix.
X
: X-data (n, p).UUt
: Matrix (n, n) for Gram-Schmidt orthogonalization.VVt
: Matrix (p, p) for Gram-Schmidt orthogonalization.
Keyword arguments:
tol
: Tolerance value for stopping the iterations.maxit
: Maximum nb. of iterations.
The function finds:
- {u, v, sv} = argmin(||X - u * sv * v'||)
with the constraints:
- ||u|| = ||v|| = 1
using the alternating least squares algorithm to compute SVD (Gabriel & Zalir 1979).
At the end, X ~ u * sv * v', where:
- u : left singular vector (u * sv = scores)
- v : right singular vector (loadings)
- sv : singular value.
When NIPALS is used on sequentially deflated matrices, vectors u and v can loose orthogonality due to accumulation of rounding errors. Orthogonality can be rebuilt from the Gram-Schmidt method (arguments UUt
and VVt
).
References
K.R. Gabriel, S. Zamir, Lower rank approximation of matrices by least squares with any choice of weights, Technometrics 21 (1979) 489–498.
Examples
using Jchemo, LinearAlgebra
X = rand(5, 3)
res = nipals(X)
res.niter
res.sv
svd(X).S[1]
res.v
svd(X).V[:, 1]
res.u
svd(X).U[:, 1]
Jchemo.nipalsmiss
— Methodnipalsmiss(X; kwargs...)
nipalsmiss(X, UUt, VVt; kwargs...)
Nipals to compute the first score and loading vectors of a matrix with missing data.
X
: X-data (n, p).UUt
: Matrix (n, n) for Gram-Schmidt orthogonalization.VVt
: Matrix (p, p) for Gram-Schmidt orthogonalization.
Keyword arguments:
tol
: Tolerance value for stopping the iterations.maxit
: Maximum nb. of iterations.
See function nipals
.
References
K.R. Gabriel, S. Zamir, Lower rank approximation of matrices by least squares with any choice of weights, Technometrics 21 (1979) 489–498.
Wright, K., 2018. Package nipals: Principal Components Analysis using NIPALS with Gram-Schmidt Orthogonalization. https://cran.r-project.org/
Examples
using Jchemo
X = [1. 2 missing 4 ; 4 missing 6 7 ;
missing 5 6 13 ; missing 18 7 6 ;
12 missing 28 7]
res = nipalsmiss(X)
res.niter
res.sv
res.v
res.u
Jchemo.normv
— Methodnormv(x)
normv(x, weights::Weight)
Compute the norm of a vector.
x
: A vector (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
The weighted norm of vector x
is computed by:
- sqrt(x' * D * x), where D is the diagonal matrix of vector
weights.w
.
References
@gdkrmr, https://discourse.julialang.org/t/julian-way-to-write-this-code/119348/17
@Stevengj, https://discourse.julialang.org/t/interesting-post-about-simd-dot-product-and-cosine-similarity/123282.
Examples
using Jchemo
n = 1000
x = rand(n)
w = mweight(ones(n))
normv(x)
sqrt(n) * normv(x, w)
Jchemo.nro
— Methodnro(X)
Return the nb. rows of X
.
Jchemo.occod
— Methodoccod(; kwargs...)
occod(fitm, X; kwargs...)
One-class classification using PCA/PLS orthognal distance (OD).
fitm
: The preliminary model (e.g. objectfitm
returned by functionpcasvd
) that was fitted on the training data assumed to represent the training class.X
: Training X-data (n, p), on which was fitted the modelfitm
.
Keyword arguments:
mcut
: Type of cutoff. Possible values are::mad
,:q
. See Thereafter.cri
: Whenmcut
=:mad
, a constant. See thereafter.risk
: Whenmcut
=:q
, a risk-I level. See thereafter.
In this method, the outlierness d
of an observation is the orthogonal distance (= 'X-residuals') of this observation, ie. the Euclidean distance between the observation and its projection on the score plan defined by the fitted (e.g. PCA) model (e.g. Hubert et al. 2005, Van Branden & Hubert 2005 p. 66, Varmuza & Filzmoser 2009 p. 79).
See function occsd
for details on outputs.
References
M. Hubert, V. J. Rousseeuw, K. Vanden Branden (2005). ROBPCA: a new approach to robust principal components analysis. Technometrics, 47, 64-79.
K. Vanden Branden, M. Hubert (2005). Robuts classification in high dimension based on the SIMCA method. Chem. Lab. Int. Syst, 79, 10-21.
K. Varmuza, V. Filzmoser (2009). Introduction to multivariate statistical analysis in chemometrics. CRC Press, Boca Raton.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/challenge2018.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
model = savgol(npoint = 21, deriv = 2, degree = 3)
fit!(model, X)
Xp = transf(model, X)
s = Bool.(Y.test)
Xtrain = rmrow(Xp, s)
Ytrain = rmrow(Y, s)
Xtest = Xp[s, :]
Ytest = Y[s, :]
## Below, the reference class is "EEH"
cla1 = "EHH" ; cla2 = "PEE" ; cod = "out" # here cla2 should be detected
#cla1 = "EHH" ; cla2 = "EHH" ; cod = "in" # here cla2 should not be detected
s1 = Ytrain.typ .== cla1
s2 = Ytest.typ .== cla2
zXtrain = Xtrain[s1, :]
zXtest = Xtest[s2, :]
ntrain = nro(zXtrain)
ntest = nro(zXtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
ytrain = repeat(["in"], ntrain)
ytest = repeat([cod], ntest)
## Group description
model = pcasvd(nlv = 10)
fit!(model, zXtrain)
Ttrain = model.fitm.T
Ttest = transf(model, zXtest)
T = vcat(Ttrain, Ttest)
group = vcat(repeat(["1"], ntrain), repeat(["2"], ntest))
i = 1
plotxy(T[:, i], T[:, i + 1], group; leg_title = "Class", xlabel = string("PC", i),
ylabel = string("PC", i + 1)).f
#### Occ
## Preliminary PCA fitted model
model0 = pcasvd(nlv = 10)
fit!(model0, zXtrain)
## Outlierness
model = occod()
#model = occod(mcut = :mad, cri = 4)
#model = occod(mcut = :q, risk = .01)
#model = occsdod()
fit!(model, model0.fitm, zXtrain)
@names model
@names model.fitm
@head d = model.fitm.d
d = d.dstand
f, ax = plotxy(1:length(d), d; size = (500, 300),
xlabel = "Obs. index", ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f
res = predict(model, zXtest) ;
@names res
@head res.d
@head res.pred
tab(res.pred)
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
d1 = model.fitm.d.dstand
d2 = res.d.dstand
d = vcat(d1, d2)
f, ax = plotxy(1:length(d), d, group; size = (500, 300), leg_title = "Class",
xlabel = "Obs. index", ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f
Jchemo.occsd
— Methodoccsd(; kwargs...)
occsd(fitm; kwargs...)
One-class classification using PCA/PLS score distance (SD).
fitm
: The preliminary model (e.g. objectfitm
returned by functionpcasvd
) that was fitted on the training data assumed to represent the training class.
Keyword arguments:
mcut
: Type of cutoff. Possible values are::mad
,:q
. See Thereafter.cri
: Whenmcut
=:mad
, a constant. See thereafter.risk
: Whenmcut
=:q
, a risk-I level. See thereafter.
In this method, the outlierness d
of an observation is defined by its score distance (SD), ie. the Mahalanobis distance between the projection of the observation on the score plan defined by the fitted (e.g. PCA) model and the center of the score plan.
If a new observation has d
higher than a given cutoff
, the observation is assumed to not belong to the training (= reference) class. The cutoff
is computed with non-parametric heuristics. Noting [d] the vector of outliernesses computed on the training class:
- If
mcut
=:mad
, thencutoff
= median([d]) +cri
* madv([d]). - If
mcut
=:q
, thencutoff
is estimated from the empirical cumulative density function computed on [d], for a given risk-I (risk
).
Alternative approximate cutoffs have been proposed in the literature (e.g.: Nomikos & MacGregor 1995, Hubert et al. 2005, Pomerantsev 2008). Typically, and whatever the approximation method used to compute the cutoff, it is recommended to tune this cutoff depending on the detection objectives.
Outputs
pval
: Estimate of p-value (see functionspval
) computed from the training distribution [d].dstand
: standardized distance defined asd
/cutoff
. A valuedstand
> 1 may be considered as extreme compared to the distribution of the training data.gh
is the Winisi "GH" (usually, GH > 3 is considered as extreme).
Specific for function predict
:
pred
: class predictiondstand
<= 1 ==>in
: the observation is expected to belong to the training class,dstand
> 1 ==>out
: extreme value, possibly not belonging to the same class as the training.
References
M. Hubert, V. J. Rousseeuw, K. Vanden Branden (2005). ROBPCA: a new approach to robust principal components analysis. Technometrics, 47, 64-79.
Nomikos, V., MacGregor, J.F., 1995. Multivariate SPC Charts for Monitoring Batch Processes. null 37, 41-59. https://doi.org/10.1080/00401706.1995.10485888
Pomerantsev, A.L., 2008. Acceptance areas for multivariate classification derived by projection methods. Journal of Chemometrics 22, 601-609. https://doi.org/10.1002/cem.1147
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/challenge2018.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
model = savgol(npoint = 21, deriv = 2, degree = 3)
fit!(model, X)
Xp = transf(model, X)
s = Bool.(Y.test)
Xtrain = rmrow(Xp, s)
Ytrain = rmrow(Y, s)
Xtest = Xp[s, :]
Ytest = Y[s, :]
## Below, the reference class is "EEH"
cla1 = "EHH" ; cla2 = "PEE" ; cod = "out" # here cla2 should be detected
#cla1 = "EHH" ; cla2 = "EHH" ; cod = "in" # here cla2 should not be detected
s1 = Ytrain.typ .== cla1
s2 = Ytest.typ .== cla2
zXtrain = Xtrain[s1, :]
zXtest = Xtest[s2, :]
ntrain = nro(zXtrain)
ntest = nro(zXtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
ytrain = repeat(["in"], ntrain)
ytest = repeat([cod], ntest)
## Group description
model = pcasvd(nlv = 10)
fit!(model, zXtrain)
Ttrain = model.fitm.T
Ttest = transf(model, zXtest)
T = vcat(Ttrain, Ttest)
group = vcat(repeat(["1"], ntrain), repeat(["2"], ntest))
i = 1
plotxy(T[:, i], T[:, i + 1], group; leg_title = "Class", xlabel = string("PC", i),
ylabel = string("PC", i + 1)).f
#### Occ
## Preliminary PCA fitted model
model0 = pcasvd(nlv = 30)
fit!(model0, zXtrain)
## Outlierness
model = occsd()
#model = occsd(mcut = :mad, cri = 4)
#model = occsd(mcut = :q, risk = .01)
fit!(model, model0.fitm)
@names model
@names model.fitm
@head d = model.fitm.d
d = d.dstand
f, ax = plotxy(1:length(d), d; size = (500, 300), xlabel = "Obs. index",
ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f
res = predict(model, zXtest) ;
@names res
@head res.d
@head res.pred
tab(res.pred)
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
d1 = model.fitm.d.dstand
d2 = res.d.dstand
d = vcat(d1, d2)
f, ax = plotxy(1:length(d), d, group; size = (500, 300), leg_title = "Class", xlabel = "Obs. index",
ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f
Jchemo.occsdod
— Methodoccsdod(; kwargs...)
occsdod(object, X; kwargs...)
One-class classification using a consensus between PCA/PLS score and orthogonal (SD and OD) distances.
fitm
: The preliminary model (e.g. objectfitm
returned by functionpcasvd
) that was fitted on the training data assumed to represent the training class.X
: Training X-data (n, p), on which was fitted the modelfitm
.
Keyword arguments:
mcut
: Type of cutoff. Possible values are::mad
,:q
. See Thereafter.cri
: Whenmcut
=:mad
, a constant. See thereafter.risk
: Whenmcut
=:q
, a risk-I level. See thereafter.
In this method, the outlierness d
of a given observation is a consensus between the score distance (SD) and the orthogonal distance (OD). The consensus is computed from the standardized distances by:
dstand
= sqrt(dstand_sd
*dstand_od
).
See functions:
occsd
for details of the outputs,- and
occod
for examples.
Jchemo.occstah
— Methodoccstah(; kwargs...)
occstah(X; kwargs...)
One-class classification using the Stahel-Donoho outlierness.
X
: Training X-data (n, p).
Keyword arguments:
nlv
: Nb. dimensions on whichX
is projected.mcut
: Type of cutoff. Possible values are::mad
,:q
. See Thereafter.cri
: Whenmcut
=:mad
, a constant. See thereafter.risk
: Whenmcut
=:q
, a risk-I level. See thereafter.scal
: Boolean. Iftrue
, each column ofX
is scaled such as in functionoutstah
.
In this method, the outlierness d
of a given observation is the Stahel-Donoho outlierness (see ?outstah
).
See function occsd
for details on outputs.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/challenge2018.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
model = savgol(npoint = 21, deriv = 2, degree = 3)
fit!(model, X)
Xp = transf(model, X)
s = Bool.(Y.test)
Xtrain = rmrow(Xp, s)
Ytrain = rmrow(Y, s)
Xtest = Xp[s, :]
Ytest = Y[s, :]
## Below, the reference class is "EEH"
cla1 = "EHH" ; cla2 = "PEE" ; cod = "out" # here cla2 should be detected
#cla1 = "EHH" ; cla2 = "EHH" ; cod = "in" # here cla2 should not be detected
s1 = Ytrain.typ .== cla1
s2 = Ytest.typ .== cla2
zXtrain = Xtrain[s1, :]
zXtest = Xtest[s2, :]
ntrain = nro(zXtrain)
ntest = nro(zXtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
ytrain = repeat(["in"], ntrain)
ytest = repeat([cod], ntest)
## Group description
model = pcasvd(nlv = 10)
fit!(model, zXtrain)
Ttrain = model.fitm.T
Ttest = transf(model, zXtest)
T = vcat(Ttrain, Ttest)
group = vcat(repeat(["1"], ntrain), repeat(["2"], ntest))
i = 1
plotxy(T[:, i], T[:, i + 1], group; leg_title = "Class", xlabel = string("PC", i),
ylabel = string("PC", i + 1)).f
#### Occ
## Preliminary dimension reduction
## (Not required but often more efficient)
nlv = 50
model0 = pcasvd(; nlv)
fit!(model0, zXtrain)
Ttrain = model0.fitm.T
Ttest = transf(model0, zXtest)
## Outlierness
model = occstah(; nlv, scal = true)
fit!(model, Ttrain)
@names model
@names model.fitm
@head d = model.fitm.d
d = d.dstand
f, ax = plotxy(1:length(d), d; size = (500, 300), xlabel = "Obs. index",
ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f
res = predict(model, Ttest) ;
@names res
@head res.d
@head res.pred
tab(res.pred)
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
d1 = model.fitm.d.dstand
d2 = res.d.dstand
d = vcat(d1, d2)
f, ax = plotxy(1:length(d), d, group; size = (500, 300), leg_title = "Class",
xlabel = "Obs. index", ylabel = "Standardized distance")
hlines!(ax, 1; linestyle = :dot)
f
Jchemo.out
— Methodout(x)
Return if elements of a vector are strictly outside of a given range.
x
: Univariate data.y
: Univariate data on which is computed the range (min, max).
Return a BitVector.
Examples
using Jchemo
x = [-200.; -100; -1; 0; 1; 200]
out(x, [-1; .2; 1])
out(x, (-1, 1))
Jchemo.outeucl
— Methodouteucl(X, V; kwargs...)
outeucl!(X::Matrix, V::Matrix; kwargs...)
Compute an outlierness from Euclidean distances to center.
X
: X-data (n, p).
Keyword arguments:
scal
: Boolean. Iftrue
, each column ofX
is scaled by its MAD before computing the outlierness.
Outlyingness is calculated by the Euclidean distance between the observation (rows of X
) and a robust estimate of the center of the data (in the present function, the spatial median). Such outlyingness was for instance used in the robust PLSR algorithm of Serneels et al. 2005 (PRM).
References
Serneels, S., Croux, C., Filzmoser, V., Van Espen, V.J., 2005. Partial robust M-regression. Chemometrics and Intelligent Laboratory Systems 79, 55-64. https://doi.org/10.1016/j.chemolab.2005.04.007
Examples
using Jchemo, CairoMakie
n = 300 ; p = 700 ; m = 80
ntot = n + m
X1 = randn(n, p)
X2 = randn(m, p) .+ rand(1:3, p)'
X = vcat(X1, X2)
nlv = 10
scal = false
#scal = true
res = outeucl(X; scal) ;
@names res
res.d # outlierness
plotxy(1:ntot, res.d).f
Jchemo.outstah
— Methodoutstah(X, V; kwargs...)
outstah!(X::Matrix, V::Matrix; kwargs...)
Compute the Stahel-Donoho outlierness.
X
: X-data (n, p).V
: A projection matrix (p, nlv) representing the directions of the projection pursuit.
Keyword arguments:
scal
: Boolean. Iftrue
, each column ofX
is scaled by its MAD before computing the outlierness.
See Maronna and Yohai 1995 for details on the outlierness measure.
A projection-pursuit approach is used: given a projection matrix V
(p, nlv) (in general built randomly), the observations (rows of X
) are projected on the nlv
directions and the Stahel-Donoho outlierness is computed for each observation from these projections.
References
Maronna, R.A., Yohai, V.J., 1995. The Behavior of the Stahel-Donoho Robust Multivariate Estimator. Journal of the American Statistical Association 90, 330–341. https://doi.org/10.1080/01621459.1995.10476517
Examples
using Jchemo, CairoMakie
n = 300 ; p = 700 ; m = 80
ntot = n + m
X1 = randn(n, p)
X2 = randn(m, p) .+ rand(1:3, p)'
X = vcat(X1, X2)
nlv = 10
V = rand(0:1, p, nlv)
scal = false
#scal = true
res = outstah(X, V; scal) ;
@names res
res.d # outlierness
plotxy(1:ntot, res.d).f
Jchemo.parsemiss
— Methodparsemiss(Q, x::Vector{Union{String, Missing}})
Parsing a string vector allowing missing data.
Q
: Type that results from the parsing of type `String'.x
: A string vector containingmissing
(of typeMissing
) observations.
See examples.
Examples
using Jchemo
x = ["1"; "3.2"; missing]
x_p = parsemiss(Float64, x)
Jchemo.pcaeigen
— Methodpcaeigen(; kwargs...)
pcaeigen(X; kwargs...)
pcaeigen(X, weights::Weight; kwargs...)
pcaeigen!(X::Matrix, weights::Weight; kwargs...)
PCA by Eigen factorization.
X
: X-data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. of principal components (PCs).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Let us note D the (n, n) diagonal matrix of weights (weights.w
) and X the centered matrix in metric D. The function minimizes ||X - T * V'||^2 in metric D, by computing an Eigen factorization of X' * D * X.
See function pcasvd
for examples.
Jchemo.pcaeigenk
— Methodpcaeigenk(; kwargs...)
pcaeigenk(X; kwargs...)
pcaeigenk(X, weights::Weight; kwargs...)
pcaeigenk!(X::Matrix, weights::Weight; kwargs...)
PCA by Eigen factorization of the kernel matrix XX'.
X
: X-data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. of principal components (PCs).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
This is the "kernel cross-product" version of the PCA algorithm (e.g. Wu et al. 1997). For wide matrices (n << p, where p is the nb. columns) and n not too large, this algorithm can be much faster than the others.
Let us note D the (n, n) diagonal matrix of weights (weights.w
) and X the centered matrix in metric D. The function minimizes ||X - T * V'||^2 in metric D, by computing an Eigen factorization of D^(1/2) * X * X' D^(1/2).
See function pcasvd
for examples.
References
Wu, W., Massart, D.L., de Jong, S., 1997. The kernel PCA algorithms for wide data. Part I: Theory and algorithms. Chemometrics and Intelligent Laboratory Systems 36, 165-172. https://doi.org/10.1016/S0169-7439(97)00010-5
Jchemo.pcanipals
— Methodpcanipals(; kwargs...)
pcanipals(X; kwargs...)
pcanipals(X, weights::Weight; kwargs...)
pcanipals!(X::Matrix, weights::Weight; kwargs...)
PCA by NIPALS algorithm.
X
: X-data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. of principal components (PCs).gs
: Boolean. Iftrue
(default), a Gram-Schmidt orthogonalization of the scores and loadings is done before each X-deflation.tol
: Tolerance value for stopping the iterations.maxit
: Maximum nb. of iterations.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Let us note D the (n, n) diagonal matrix of weights (weights.w
) and X the centered matrix in metric D. The function minimizes ||X - T * V'||^2 in metric D by NIPALS.
See function pcasvd
for examples.
References
Andrecut, M., 2009. Parallel GPU Implementation of Iterative PCA Algorithms. Journal of Computational Biology 16, 1593-1599. https://doi.org/10.1089/cmb.2008.0221
K.R. Gabriel, S. Zamir, Lower rank approximation of matrices by least squares with any choice of weights, Technometrics 21 (1979) 489–498.
Gabriel, R. K., 2002. Le biplot - Outil d'exploration de données multidimensionnelles. Journal de la Société Française de la Statistique, 143, 5-55.
Lingen, F.J., 2000. Efficient Gram-Schmidt orthonormalisation on parallel computers. Communications in Numerical Methods in Engineering 16, 57-66. https://doi.org/10.1002/(SICI)1099-0887(200001)16:1<57::AID-CNM320>3.0.CO;2-I
Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris, France.
Wright, K., 2018. Package nipals: Principal Components Analysis using NIPALS with Gram-Schmidt Orthogonalization. https://cran.r-project.org/
Jchemo.pcanipalsmiss
— Methodpcanipalsmiss(; kwargs...)
pcanipals(X; kwargs...)
pcanipals(X, weights::Weight; kwargs...)
pcanipals!(X::Matrix, weights::Weight; kwargs...)
PCA by NIPALS algorithm allowing missing data.
X
: X-data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. of principal components (PCs).gs
: Boolean. Iftrue
(default), a Gram-Schmidt orthogonalization of the scores and loadings is done before each X-deflation.tol
: Tolerance value for stopping the iterations.maxit
: Maximum nb. of iterations.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
References
Wright, K., 2018. Package nipals: Principal Components Analysis using NIPALS with Gram-Schmidt Orthogonalization. https://cran.r-project.org/
Examples
X = [1 2. missing 4 ; 4 missing 6 7 ;
missing 5 6 13 ; missing 18 7 6 ;
12 missing 28 7]
nlv = 3
tol = 1e-15
scal = false
#scal = true
gs = false
#gs = true
model = pcanipalsmiss(; nlv, tol, gs, maxit = 500, scal)
fit!(model, X)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.niter
fitm.sv
fitm.V
fitm.T
## Orthogonality
## only if gs = true
fitm.T' * fitm.T
fitm.V' * fitm.V
## Impute missing data in X
model = pcanipalsmiss(; nlv = 2, gs = true) ;
fit!(model, X)
Xfit = xfit(model.fitm)
s = ismissing.(X)
X_imp = copy(X)
X_imp[s] .= Xfit[s]
X_imp
Jchemo.pcaout
— Methodpcaout(; kwargs...)
pcaout(X; kwargs...)
pcaout(X, weights::Weight; kwargs...)
pcaout!(X::Matrix, weights::Weight; kwargs...)
Robust PCA using outlierness.
X
: X-data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. of principal components (PCs).prm
: Proportion of the data removed (hard rejection of outliers) for each outlierness measure.scal
: Boolean. Iftrue
, each column ofX
is scaled by its MAD when computing the outlierness and by its uncorrected standard deviation when computing weighted PCA.
Robust PCA combining outlyingness measures and weighted PCA (WPCA).
The objective is to remove the effect of multivariate X
-outliers that have potentially bad leverages. Observations (X
-rows) receive weights depending on two outlyingness indicators:
- The Stahel-Donoho outlyingness (Maronna and Yohai, 1995) is computed (function
outstah
) onX
. The proportionprm
of the observations with the highest outlyingness values receive a weight w1 = 0 (the other receive a weight w1 = 1). - An outlyingness based on the Euclidean distance to center (function
outstah
) is computed. The proportionprm
of the observations with the highest outlyingness values receive a weight w2 = 0 (the other receive a weight w2 = 1).
The final weights of the observations are computed by weights.w * w1 * w2 that is used in a weighted PCA.
By default, the function uses prm = .3
(such as in the ROBPCA algorithm of Hubert et al. 2005, 2009).
References
Hubert, M., Rousseeuw, V.J., Vanden Branden, K., 2005. ROBPCA: A New Approach to Robust Principal Component Analysis. Technometrics 47, 64-79. https://doi.org/10.1198/004017004000000563
Hubert, M., Rousseeuw, V., Verdonck, T., 2009. Robust PCA for skewed data and its outlier map. Computational Statistics & Data Analysis 53, 2264-2274. https://doi.org/10.1016/j.csda.2008.05.027
Maronna, R.A., Yohai, V.J., 1995. The Behavior of the Stahel-Donoho Robust Multivariate Estimator. Journal of the American Statistical Association 90, 330–341. https://doi.org/10.1080/01621459.1995.10476517
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "octane.jld2")
@load db dat
@names dat
X = dat.X
wlst = names(X)
wl = parse.(Float64, wlst)
n = nro(X)
nlv = 3
model = pcaout(; nlv)
#model = pcasvd(; nlv)
fit!(model, X)
@names model
@names model.fitm
@head T = model.fitm.T
## Same as:
transf(model, X)
i = 1
plotxy(T[:, i], T[:, i + 1]; zeros = true, xlabel = string("PC", i),
ylabel = string("PC", i + 1)).f
Jchemo.pcapp
— Methodpcapp(; kwargs...)
pcapp(X; kwargs...)
pcapp!(X::Matrix; kwargs...)
Robust PCA by projection pursuit.
X
: X-data (n, p).
Keyword arguments:
nlv
: Nb. of principal components (PCs).nsim
: Nb. of additional (to X-rows) simulated directions for the projection pursuit.scal
: Boolean. Iftrue
, each column ofX
is scaled by its MAD.
For nsim = 0
, this is the Croux & Ruiz-Gazen (C-R, 2005) PCA algorithm that uses a projection pursuit (PP) method. Data X
are robustly centered by the spatial median, and the observations are projected to the "PP" directions defined by the observations (rows of X
) after they are normed. The first PCA loading vector is the direction (within the PP directions) that maximizes a given 'projection index', here the median absolute deviation (MAD). Then, X
is deflated to this loading vector, and the process is re-run to define the next loading vector. And so on.
A possible extension of this algorithm is to randomly simulate additionnal candidate PP directions to the n row observations. If nsim > 0
, the function simulates nsim
additional PP directions to the n initial ones, as proposed in Hubert et al. (2005): random couples of observations are sampled in X
and, for each couple, the direction passes through the two observations of the couple (see function simpphub
).
References
Croux, C., Ruiz-Gazen, A., 2005. High breakdown estimators for principal components: the projection-pursuit approach revisited. Journal of Multivariate Analysis 95, 206–226. https://doi.org/10.1016/j.jmva.2004.08.002
Hubert, M., Rousseeuw, V.J., Vanden Branden, K., 2005. ROBPCA: A New Approach to Robust Principal Component Analysis. Technometrics 47, 64-79. https://doi.org/10.1198/004017004000000563
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "octane.jld2")
@load db dat
@names dat
X = dat.X
wlst = names(X)
wl = parse.(Float64, wlst)
n = nro(X)
nlv = 3
model = pcapp(; nlv, nsim = 2000)
#model = pcasvd(; nlv)
fit!(model, X)
@names model
@names model.fitm
@head T = model.fitm.T
## Same as:
@head transf(model, X)
i = 1
plotxy(T[:, i], T[:, i + 1]; zeros = true, xlabel = string("PC", i),
ylabel = string("PC", i + 1)).f
Jchemo.pcasph
— Methodpcasph(; kwargs...)
pcasph(X; kwargs...)
pcasph(X, weights::Weight; kwargs...)
pcasph!(X::Matrix, weights::Weight; kwargs...)
Spherical PCA.
X
: X-data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. of principal components (PCs).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Spherical PCA (Locantore et al. 1990, Maronna 2005, Daszykowski et al. 2007). Matrix X
is centered by the spatial median computed by function Jchemo.colmedspa
.
References
Daszykowski, M., Kaczmarek, K., Vander Heyden, Y., Walczak, B., 2007. Robust statistics in data analysis - A review. Chemometrics and Intelligent Laboratory Systems 85, 203-219. https://doi.org/10.1016/j.chemolab.2006.06.016
Locantore N., Marron J.S., Simpson D.G., Tripoli N., Zhang J.T., Cohen K.L. Robust principal component analysis for functional data, Test 8 (1999) 1–7
Maronna, R., 2005. Principal components and orthogonal regression based on robust scales, Technometrics, 47:3, 264-273, DOI: 10.1198/004017005000000166
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "octane.jld2")
@load db dat
@names dat
X = dat.X
wlst = names(X)
wl = parse.(Float64, wlst)
n = nro(X)
nlv = 3
model = pcasph(; nlv)
#model = pcasvd(; nlv)
fit!(model, X)
@names model
@names model.fitm
@head T = model.fitm.T
## Same as:
transf(model, X)
i = 1
plotxy(T[:, i], T[:, i + 1]; zeros = true, xlabel = string("PC", i),
ylabel = string("PC", i + 1)).f
Jchemo.pcasvd
— Methodpcasvd(; kwargs...)
pcasvd(X; kwargs...)
pcasvd(X, weights::Weight; kwargs...)
pcasvd!(X::Matrix, weights::Weight; kwargs...)
PCA by SVD factorization.
X
: X-data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. of principal components (PCs).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Let us note D the (n, n) diagonal matrix of weights (weights.w
) and X the centered matrix in metric D. The function minimizes ||X - T * V'||^2 in metric D, by computing a SVD factorization of sqrt(D) * X:
- sqrt(D) * X ~ U * S * V'
Outputs are:
T
= D^(-1/2) * U * SV
= V- The diagonal of S
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
@head Xtrain = X[s.train, :]
@head Xtest = X[s.test, :]
nlv = 3
model = pcasvd(; nlv)
#model = pcaeigen(; nlv)
#model = pcaeigenk(; nlv)
#model = pcanipals(; nlv)
fit!(model, Xtrain)
@names model
@names model.fitm
@head T = model.fitm.T
## Same as:
@head transf(model, X)
T' * T
@head V = model.fitm.V
V' * V
@head Ttest = transf(model, Xtest)
res = summary(model, Xtrain) ;
@names res
res.explvarx
res.contr_var
res.coord_var
res.cor_circle
Jchemo.pcr
— Methodpcr(; kwargs...)
pcr(X, Y; kwargs...)
pcr(X, Y, weights::Weight; kwargs...)
pcr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Principal component regression (PCR) with a SVD factorization.
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
- Same as function
pcasvd
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 15
model = pcr(; nlv) ;
fit!(model, Xtrain, ytrain)
@names model
fitm = model.fitm ;
@names fitm
@names fitm.fitm
@head fitm.fitm.T
@head transf(model, X)
coef(model)
coef(model; nlv = 3)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)
res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
res = predict(model, Xtest; nlv = 1:2)
@head res.pred[1]
@head res.pred[2]
res = summary(model, Xtrain) ;
@names res
z = res.explvarx
plotgrid(z.nlv, z.cumpvar; step = 2, xlabel = "Nb. LVs", ylabel = "Prop. Explained X-Variance").f
Jchemo.pip
— Methodpip(args...)
Build a pipeline of models.
args...
: Succesive models, see examples.
Examples
using JLD2, CairoMakie, JchemoData
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Pipeline Snv :> Savgol :> Pls :> Svmr
model1 = snv()
model2 = savgol(npoint = 11, deriv = 2, degree = 3)
model3 = plskern(nlv = 15)
model4 = svmr(gamma = 1e3, cost = 1000, epsilon = .1)
model = pip(model1, model2, model3, model4)
fit!(model, Xtrain, ytrain)
res = predict(model, Xtest) ;
@head res.pred
rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
Jchemo.plotconf
— Methodplotconf(object; size = (500, 400), cnt = true, ptext = true,
fontsize = 15, coldiag = :red, )
Plot a confusion matrix.
object
: Output of functionconf
.
Keyword arguments:
size
: Size (horizontal, vertical) of the figure.cnt
: Boolean. Iftrue
, plot the occurrences, else plot the row %s.ptext
: Boolean. Iftrue
, display the value in each cell.fontsize
: Font size whenptext = true
.coldiag
: Font color whenptext = true
.
See examples in help page of function conf
. ```
Jchemo.plotgrid
— Methodplotgrid(indx::AbstractVector, r; size = (500, 300), step = 5,
color = nothing, kwargs...)
plotgrid(indx::AbstractVector, r, group; size = (700, 350),
step = 5, color = nothing, leg = true, leg_title = "Group", kwargs...)
Plot error/performance rates of a model.
indx
: A numeric variable representing the grid of model parameters, e.g. the nb. LVs if PLSR models.r
: The error/performance rate.
Keyword arguments:
group
: Categorical variable defining groups. A separate line is plotted for each level ofgroup
.size
: Size (horizontal, vertical) of the figure.step
: Step used for defining the xticks.color
: Set color. Ifgroup
if used, must be a vector of same length as the number of levels ingroup
.leg
: Boolean. Ifgroup
is used, display a legend or not.leg_title
: Title of the legend.kwargs
: Optional arguments to pass inAxis
of CairoMakie.
To use plotgrid
, a backend (e.g. CairoMakie) has to be specified.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
model = plskern()
nlv = 0:20
res = gridscore(model, Xtrain, ytrain,
Xtest, ytest; score = rmsep, nlv)
plotgrid(res.nlv, res.y1; xlabel = "Nb. LVs", ylabel = "RMSEP").f
model = lwplsr()
nlvdis = 15 ; metric = [:mah]
h = [1 ; 2.5 ; 5] ; k = [50 ; 100]
pars = mpar(nlvdis = nlvdis, metric = metric,
h = h, k = k)
nlv = 0:20
res = gridscore(model, Xtrain, ytrain, Xtest, ytest; score = rmsep,
pars, nlv)
group = string.("h=", res.h, " k=", res.k)
plotgrid(res.nlv, res.y1, group; xlabel = "Nb. LVs", ylabel = "RMSECV").f
Jchemo.plotsp
— Functionplotsp(X, wl = 1:nco(X); size = (500, 300), color = nothing, nsamp = nothing,
kwargs...)
Plotting spectra.
X
: X-data (n, p).wl
: Column names ofX
. Must be numeric.
Keyword arguments:
size
: Size (horizontal, vertical) of the figure.color
: Set a unique color (and eventually transparency) to the spectra.nsamp
: Nb. spectra (X-rows) to plot. Ifnothing
, all spectra are plotted.kwargs
: Optional arguments to pass inAxis
of CairoMakie.
The function plots the rows of X
.
To use plotxy
, a backend (e.g. CairoMakie) has to be specified.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
wlst = names(X)
wl = parse.(Float64, wlst)
plotsp(X).f
plotsp(X; color = (:red, .2)).f
plotsp(X, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance").f
tck = collect(wl[1]:200:wl[end]) ;
plotsp(X, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance", xticks = tck).f
f, ax = plotsp(X, wl; color = (:red, .2))
xmeans = colmean(X)
lines!(ax, wl, xmeans; color = :black, linewidth = 2)
vlines!(ax, 1200)
f
Jchemo.plotxy
— Methodplotxy(x, y; size = (500, 300), color = nothing, ellipse::Bool = false,
prob = .95, circle::Bool = false, bisect::Bool = false, zeros::Bool = false,
xlabel = "", ylabel = "", title = "", kwargs...)
plotxy(x, y, group; size = (600, 350), color = nothing, ellipse::Bool = false,
prob = .95, circle::Bool = false, bisect::Bool = false, zeros::Bool = false,
xlabel = "", ylabel = "", title = "", leg::Bool = true, leg_title = "Group",
kwargs...)
Scatter plot of (x, y) data
x
: A x-vector (n).y
: A y-vector (n).group
: Categorical variable defining groups (n).
Keyword arguments:
size
: Size (horizontal, vertical) of the figure.color
: Set color(s). Ifgroup
if used,color
must be a vector of same length as the number of levels ingroup
.ellipse
: Boolean. Draw an ellipse of confidence, assuming a Ch-square distribution with df = 2. Ifgroup
is used, one ellipse is drawn per group.prob
: Probability for the ellipse of confidence.bisect
: Boolean. Draw a bisector.zeros
: Boolean. Draw horizontal and vertical axes passing through origin (0, 0).xlabel
: Label for the x-axis.ylabel
: Label for the y-axis.title
: Title of the graphic.leg
: Boolean. Ifgroup
is used, display a legend or not.leg_title
: Title of the legend.kwargs
: Optional arguments to pass in functionscatter
of Makie.
To use plotxy
, a backend (e.g. CairoMakie) has to be specified.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
lev = mlev(year)
nlev = length(lev)
model = pcasvd(nlv = 5)
fit!(model, X)
@head T = model.fitm.T
plotxy(T[:, 1], T[:, 2]; color = (:red, .5)).f
plotxy(T[:, 1], T[:, 2], year; ellipse = true, xlabel = "PC1", ylabel = "PC2").f
i = 2
colm = cgrad(:Dark2_5, nlev; categorical = true)
plotxy(T[:, i], T[:, i + 1], year; color = colm, xlabel = string("PC", i),
ylabel = string("PC", i + 1), zeros = true, ellipse = true).f
plotxy(T[:, 1], T[:, 2], year).lev
plotxy(1:5, 1:5).f
y = reshape(rand(5), 5, 1)
plotxy(1:5, y).f
## Several layers can be added
## (same syntax as in Makie)
A = rand(50, 2)
f, ax = plotxy(A[:, 1], A[:, 2]; xlabel = "x1", ylabel = "x2")
ylims!(ax, -1, 2)
hlines!(ax, 0.5; color = :red, linestyle = :dot)
f
Jchemo.plscan
— Methodplscan(; kwargs...)
plscan(X, Y; kwargs...)
plscan(X, Y, weights::Weight; kwargs...)
plscan!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Canonical partial least squares regression (Canonical PLS).
X
: First block of data.Y
: Second block of data.weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores) to compute.bscal
: Type of block scaling. Possible values are::none
,:frob
. See functionsblockscal
.scal
: Boolean. Iftrue
, each column of blocksX
andY
is scaled by its uncorrected standard deviation (before the block scaling).
Canonical PLS with the Nipals algorithm (Wold 1984, Tenenhaus 1998 chap.11), referred to as PLS-W2A (i.e. Wold PLS mode A) in Wegelin 2000. The two blocks X
and Y
play a symmetric role. After each step of scores computation, X and Y are deflated by the x- and y-scores, respectively.
Function summary
returns:
cortx2ty
: Correlations between the X- and Y-LVs.
and for block X
:
explvarx
: Proportion of the block inertia (squared Frobenious norm) explained by the block LVs (Tx
).rvx2tx
: RV coefficients between the block and the block LVs.rdx2tx
: Rd coefficients between the block and the block LVs.corx2tx
: Correlation between the block variables and the block LVs.
The same is returned for block Y
.
References
Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.
Wegelin, J.A., 2000. A Survey of Partial Least Squares (PLS) Methods, with Emphasis on the Two-Block Case (No. 371). University of Washington, Seattle, Washington, USA.
Wold, S., Ruhe, A., Wold, H., Dunn, III, W.J., 1984. The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses. SIAM Journal on Scientific and Statistical Computing 5, 735–743. https://doi.org/10.1137/0905052
Examples
using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "linnerud.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n, p = size(X)
q = nco(Y)
nlv = 2
bscal = :frob
model = plscan(; nlv, bscal)
fit!(model, X, Y)
@names model
@names model.fitm
fitm = model.fitm
@head fitm.Tx
@head transfbl(model, X, Y).Tx
@head fitm.Ty
@head transfbl(model, X, Y).Ty
res = summary(model, X, Y) ;
@names res
res.explvarx
res.explvary
res.cortx2ty
res.rvx2tx
res.rvy2ty
res.rdx2tx
res.rdy2ty
res.corx2tx
res.cory2ty
Jchemo.plskdeda
— Methodplskdeda(; kwargs...)
plskdeda(X, y; kwargs...)
plskdeda(X, y, weights::Weight; kwargs...)
KDE-DA on PLS latent variables (PLS-KDEDA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).scal
: Boolean. Iftrue
, each column ofX
and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.- Keyword arguments of function
dmkern
(bandwidth definition) can also be specified here.
The principle is the same as function plsqda
except that the densities by class are estimated from dmkern
instead of dmnorm
.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlv = 15
model = plskdeda(; nlv)
#model = plskdeda(; nlv, a = .5)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
embfitm = fitm.fitm.embfitm ;
@head embfitm.T
@head transf(model, Xtrain)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)
coef(embfitm)
res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(model, Xtest; nlv = 1:2).pred
summary(embfitm, Xtrain)
Jchemo.plskern
— Methodplskern(; kwargs...)
plskern(X, Y; kwargs...)
plskern(X, Y, weights::Weight; kwargs...)
plskern!(X::Matrix, Y::Union{Matrix, BitMatrix}, weights::Weight; kwargs...)
Partial least squares regression (PLSR) with the "improved kernel algorithm #1" (Dayal & McGegor, 1997).
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
About the row-weighting in PLS algorithms (weights
): See in particular Schaal et al. 2002, Siccard & Sabatier 2006, Kim et al. 2011, and Lesnoff et al. 2020.
References
Dayal, B.S., MacGregor, J.F., 1997. Improved PLS algorithms. Journal of Chemometrics 11, 73-85.
Kim, S., Kano, M., Nakagawa, H., Hasebe, S., 2011. Estimation of active pharmaceutical ingredients content using locally weighted partial least squares and statistical wavelength selection. Int. J. Pharm., 421, 269-274.
Lesnoff, M., Metz, M., Roger, J.M., 2020. Comparison of locally weighted PLS strategies for regression and discrimination on agronomic NIR Data. Journal of Chemometrics. e3209. https://onlinelibrary.wiley.com/doi/abs/10.1002/cem.3209
Schaal, S., Atkeson, C., Vijayamakumar, S. 2002. Scalable techniques from nonparametric statistics for the real time robot learning. Applied Intell., 17, 49-60.
Sicard, E. Sabatier, R., 2006. Theoretical framework for local PLS1 regression and application to a rainfall dataset. Comput. Stat. Data Anal., 51, 1393-1410.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 15
model = plskern(; nlv) ;
#model = plsnipals(; nlv) ;
#model = plswold(; nlv) ;
#model = plsrosa(; nlv) ;
#model = plssimp(; nlv) ;
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
@head model.fitm.T
coef(model)
coef(model; nlv = 3)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)
res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
res = predict(model, Xtest; nlv = 1:2)
@head res.pred[1]
@head res.pred[2]
res = summary(model, Xtrain) ;
@names res
z = res.explvarx
plotgrid(z.nlv, z.cumpvar; step = 2, xlabel = "Nb. LVs",
ylabel = "Prop. Explained X-Variance").f
Jchemo.plslda
— Methodplslda(; kwargs...)
plslda(X, y; kwargs...)
plslda(X, y, weights::Weight; kwargs...)
LDA on PLS latent variables (PLS-LDA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).scal
: Boolean. Iftrue
, each column ofX
and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.
LDA on PLS latent variables. The method is as follows:
- The training variable
y
(univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present iny
. Each column of Ydummy is a dummy (0/1) variable. - A multivariate PLSR (PLSR2) is run on {
X
, Ydummy}, returning a score matrixT
. - A LDA is done on {
T
,y
}, returning estimates of posterior probabilities (∊ [0, 1]) of class membership. - For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.
In the high-level version of the present functions, the observation weights are automatically defined by the given priors (argument prior
): the sub-totals by class of the observation weights are set equal to the prior probabilities. The low-level version (argument weights
) allows to implement other choices.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlv = 15
model = plslda(; nlv)
#model = plslda(; nlv, prior = :prop)
#model = plsqda(; nlv, alpha = .1)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
embfitm = fitm.fitm.embfitm ;
@head embfitm.T
@head transf(model, Xtrain)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)
coef(embfitm)
res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(model, Xtest; nlv = 1:2).pred
summary(embfitm, Xtrain)
Jchemo.plsnipals
— Methodplsnipals(; kwargs...)
plsnipals(X, Y; kwargs...)
plsnipals(X, Y, weights::Weight; kwargs...)
plsnipals!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Partial Least Squares Regression (PLSR) with the Nipals algorithm.
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
In this function, for PLS2 (multivariate Y), the Nipals iterations are replaced by a direct computation of the PLS weights (w) by SVD decomposition of matrix X'Y (Hoskuldsson 1988 p.213).
See function plskern
for examples.
References
Hoskuldsson, A., 1988. PLS regression methods. Journal of Chemometrics 2, 211-228.https://doi.org/10.1002/cem.1180020306
Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris, France.
Wold, S., Sjostrom, M., Eriksson, l., 2001. PLS-regression: a basic tool for chemometrics. Chem. Int. Lab. Syst., 58, 109-130.
Jchemo.plsqda
— Methodplsqda(; kwargs...)
plsqda(X, y; kwargs...)
plsqda(X, y, weights::Weight; kwargs...)
QDA on PLS latent variables (PLS-QDA) with continuum.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).alpha
: Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0
) and LDA (alpha = 1
).scal
: Boolean. Iftrue
, each column ofX
and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.
QDA on PLS latent variables. The method is as follows:
- The training variable
y
(univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present iny
. Each column of Ydummy is a dummy (0/1) variable. - A multivariate PLSR (PLSR2) is run on {
X
, Ydummy}, returning a score matrixT
. - A QDA (possibly with continuum) is done on {
T
,y
}, returning estimates of posterior probabilities (∊ [0, 1]) of class membership. - For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.
In the high-level version of the present functions, the observation weights are automatically defined by the given priors (argument prior
): the sub-totals by class of the observation weights are set equal to the prior probabilities. The low-level version (argument weights
) allows to implement other choices.
See function plslda
for examples.
Jchemo.plsravg
— Methodplsravg(; kwargs...)
plsravg(X, Y; kwargs...)
plsravg(X, Y, weights::Weight; kwargs...)
plsravg!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Averaging PLSR models with different numbers of latent variables (PLSR-AVG).
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: A range of nb. of latent variables (LVs) to compute.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
Ensemblist method where the predictions are computed by averaging the predictions of a set of models built with different numbers of LVs.
For instance, if argument nlv
is set to nlv
= 5:10
, the prediction for a new observation is the simple average of the predictions returned by the models with 5 LVs, 6 LVs, ... 10 LVs, respectively.
References
Lesnoff, M., Andueza, D., Barotin, C., Barre, V., Bonnal, L., Fernández Pierna, J.A., Picard, F., Vermeulen, V., Roger, J.-M., 2022. Averaging and Stacking Partial Least Squares Regression Models to Predict the Chemical Compositions and the Nutritive Values of Forages from Spectral Near Infrared Data. Applied Sciences 12, 7850. https://doi.org/10.3390/app12157850
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
@head Y
y = Y.ndf
#y = Y.dm
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(y, s)
Xtest = X[s, :]
ytest = y[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
nlv = 0:30
#nlv = 5:20
#nlv = 25
model = plsravg(; nlv) ;
fit!(model, Xtrain, ytrain)
res = predict(model, Xtest)
@head res.pred
res.predlv # predictions for each nb. of LVs
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
Jchemo.plsrda
— Methodplsrda(; kwargs...)
plsrda(X, y; kwargs...)
plsrda(X, y, weights::Weight; kwargs...)
Discrimination based on partial least squares regression (PLSR-DA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).scal
: Boolean. Iftrue
, each column ofX
and Ydummy is scaled by its uncorrected standard deviation.
This is the usual "PLSDA". The method is as follows:
- The training variable
y
(univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present iny
. Each column of Ydummy is a dummy (0/1) variable. - Then, a multivariate PLSR (PLSR2) is run on {
X
, Ydummy}, returning predictions of the dummy variables (= objectposterior
returned by fuctionpredict
). These predictions can be considered as unbounded estimates (i.e. eventuall outside of [0, 1]) of the class membership probabilities. - For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.
In the high-level version of the present functions, the observation weights are automatically defined by the given priors (argument prior
):
- the sub-totals by class of the observation weights are set equal to the prior probabilities.
The low-level version (argument weights
) allows to implement other choices.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlv = 15
model = plsrda(; nlv)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
@names fitm.fitm
aggsum(fitm.fitm.weights.w, ytrain)
@head fitm.fitm.T
@head transf(model, Xtrain)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)
coef(fitm.fitm)
res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(model, Xtest; nlv = 1:2).pred
summary(fitm.fitm, Xtrain)
Jchemo.plsrosa
— Methodplsrosa(; kwargs...)
plsrosa(X, Y; kwargs...)
plsrosa(X, Y, weights::Weight; kwargs...)
plsrosa!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Partial Least Squares Regression (PLSR) with the ROSA algorithm (Liland et al. 2016).
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
Note: The function has the following differences with the original algorithm of Liland et al. (2016):
- Scores T (LVs) are not normed.
- Multivariate Y is allowed.
See function plskern
for examples.
References
Liland, K.H., Næs, T., Indahl, U.G., 2016. ROSA—a fast extension of partial least squares regression for multiblock data analysis. Journal of Chemometrics 30, 651–662. https://doi.org/10.1002/cem.2824
Jchemo.plsrout
— Methodplsrout(; kwargs...)
plsrout(X, Y; kwargs...)
plsrout(X, Y, weights::Weight; kwargs...)
pcaout!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Robust PLSR using outlierness.
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. of latent variables (LVs).prm
: Proportion of the data removed (hard rejection of outliers) for each outlierness measure.scal
: Boolean. Iftrue
, each column ofX
is scaled by its MAD when computing the outlierness and by its uncorrected standard deviation when computing weighted PCA.
Robust PLSR combining outlyingness measures and weighted PLSR (WPLSR). This is the same principle as function pcaout
(see the help page) but the final step is a weighted PLSR instead of a weighted PCA.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 15
model = plsrout(; nlv)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
@head model.fitm.T
coef(model)
coef(model; nlv = 3)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)
res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
res = predict(model, Xtest; nlv = 1:2)
@head res.pred[1]
@head res.pred[2]
Jchemo.plssimp
— Methodplssimp(; kwargs...)
plssimp(X, Y; kwargs...)
plssimp(X, Y, weights::Weight; kwargs...)
plssimp!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Partial Least Squares Regression (PLSR) with the SIMPLS algorithm (de Jong 1993).
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
Note: In this function, scores T (LVs) are not normed, conversely to the original algorithm of de Jong (2013).
See function plskern
for examples.
References
de Jong, S., 1993. SIMPLS: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems 18, 251–263. https://doi.org/10.1016/0169-7439(93)85002-X
Jchemo.plstuck
— Methodplstuck(; kwargs...)
plstuck(X, Y; kwargs...)
plstuck(X, Y, weights::Weight; kwargs...)
plstuck!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Tucker's inter-battery method of factor analysis
X
: First block of data.Y
: Second block of data.weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores) to compute.bscal
: Type of block scaling. Possible values are::none
,:frob
. See functionsblockscal
.scal
: Boolean. Iftrue
, each column of blocksX
andY
is scaled by its uncorrected standard deviation (before the block scaling).
Inter-battery method of factor analysis (Tucker 1958, Tenenhaus 1998 chap.3). The two blocks X
and X
play a symmetric role. This method is referred to as PLS-SVD in Wegelin 2000. The method factorizes the covariance matrix X'Y by SVD.
See function plscan
for the details on the summary
outputs.
References
Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.
Tishler, A., Lipovetsky, S., 2000. Modelling and forecasting with robust canonical analysis: method and application. Computers & Operations Research 27, 217–232. https://doi.org/10.1016/S0305-0548(99)00014-3
Tucker, L.R., 1958. An inter-battery method of factor analysis. Psychometrika 23, 111–136. https://doi.org/10.1007/BF02289009
Wegelin, J.A., 2000. A Survey of Partial Least Squares (PLS) Methods, with Emphasis on the Two-Block Case (No. 371). University of Washington, Seattle, Washington, USA.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/linnerud.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
model = plstuck(nlv = 3)
fit!(model, X, Y)
@names model
@names model.fitm
fitm = model.fitm
@head fitm.Tx
@head transfbl(model, X, Y).Tx
@head fitm.Ty
@head transfbl(model, X, Y).Ty
res = summary(model, X, Y) ;
@names res
res.explvarx
res.explvary
res.cortx2ty
res.rvx2tx
res.rvy2ty
res.rdx2tx
res.rdy2ty
res.corx2tx
res.cory2ty
Jchemo.plswold
— Methodplswold(; kwargs...)
plswold(X, Y; kwargs...)
plswold(X, Y, weights::Weight; kwargs...)
plswold!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Partial Least Squares Regression (PLSR) with the Wold algorithm
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.tol
: Tolerance for the Nipals algorithm.maxit
: Maximum number of iterations for the Nipals algorithm.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
Wold Nipals PLSR algorithm: Tenenhaus 1998 p.204.
See function plskern
for examples.
References
Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris, France.
Wold, S., Ruhe, A., Wold, H., Dunn, III, W.J., 1984. The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS). Approach to Generalized Inverses. SIAM Journal on Scientific and Statistical Computing 5, 735–743. https://doi.org/10.1137/0905052
Jchemo.predict
— Methodpredict(object::Calds, X; kwargs...)
Compute predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Calpds, X; kwargs...)
Compute predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Cglsr, X; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.nlv
: Nb. iterations, or collection of nb. iterations, to consider.
Jchemo.predict
— Methodpredict(object::Dkplsr, X; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.predict
— Methodpredict(object::Dmkern, x)
Compute predictions from a fitted model.
object
: The fitted model.x
: Data (vector) for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Dmnorm, X)
Compute predictions from a fitted model.
object
: The fitted model.X
: Data (vector) for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Knnda1, X)
Compute the y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Knnr, X)
Compute the Y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Kplsr, X; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
If nothing, it is the maximum nb. LVs.
Jchemo.predict
— Methodpredict(object::Krr, X; lb = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.lb
: Regularization parameter, or collection of regularization parameters, "lambda" to consider.
Jchemo.predict
— Methodpredict(object::Loessr, X)
Compute y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Lwmlr, X)
Compute the Y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Lwmlrda, X)
Compute y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Lwplslda, X; nlv = nothing)
Compute the y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Lwplsqda, X; nlv = nothing)
Compute the y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Lwplsr, X; nlv = nothing)
Compute the Y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Lwplsravg, X)
Compute the Y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Lwplsrda, X; nlv = nothing)
Compute the y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Mbplsprobda, Xbl; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.predict
— Methodpredict(object::Mbplsrda, Xbl; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.predict
— Methodpredict(object::Mlrda, X)
Compute y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Occod, X)
Compute predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Occsd, X)
Compute predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Occsdod, X)
Compute predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Occstah, X)
Compute predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Pcr, X; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.predict
— Methodpredict(object::Plsprobda, X; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.predict
— Methodpredict(object::Plsravg, X)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Plsrda, X; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.predict
— Methodpredict(object::Qda, X)
Compute y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Rosaplsr, Xbl; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.predict
— Methodpredict(object::Rr, X; lb = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.lb
: Regularization parameter, or collection of regularization parameters, "lambda" to consider.
Jchemo.predict
— Methodpredict(object::Rrda, X; lb = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.lb
: Regularization parameter, or collection of regularization parameters, "lambda" to consider. If nothing, it is the parameter stored in the fitted model.
Jchemo.predict
— Methodpredict(object::Soplsr, Xbl)
Compute Y-predictions from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Spcr, X; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.predict
— Methodpredict(object::Svmda, X)
Compute y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Svmr, X)
Compute y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Treeda, X)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Treer, X)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Union{Lda, Qda}, X)
Compute y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Union{Mbplsr, Mbplswest}, Xbl; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.predict
— Methodpredict(object::Mlr, X)
Compute the Y-predictions from the fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.
Jchemo.predict
— Methodpredict(object::Union{Plsr, Pcr, Splsr}, X; nlv = nothing)
Compute Y-predictions from a fitted model.
object
: The fitted model.X
: X-data for which predictions are computed.nlv
: Nb. LVs, or collection of nb. LVs, to consider.
Jchemo.pval
— Methodpval(d::Distribution, q)
pval(x::Array, q)
pval(e_cdf::ECDF, q)
Compute p-value(s) for a distribution, an ECDF or vector.
d
: A distribution computed fromDistribution.jl
.x
: Univariate data.e_cdf
: An ECDF computed fromStatsBase.jl
.q
: Value(s) for which to compute the p-value(s).
Compute or estimate the p-value of quantile q
, ie. V(Q > q
) where Q is the random variable.
Examples
using Jchemo, Distributions, StatsBase
d = Distributions.Normal(0, 1)
q = 1.96
#q = [1.64; 1.96]
Distributions.cdf(d, q) # cumulative density function (CDF)
Distributions.ccdf(d, q) # complementary CDF (CCDF)
pval(d, q) # Distributions.ccdf
x = rand(5)
e_cdf = StatsBase.ecdf(x)
e_cdf(x) # empirical CDF computed at each point of x (ECDF)
p_val = 1 .- e_cdf(x) # complementary ECDF at each point of x
q = .3
#q = [.3; .5; 10]
pval(e_cdf, q) # 1 .- e_cdf(q)
pval(x, q)
Jchemo.qda
— Methodqda(; kwargs...)
qda(X, y; kwargs...)
qda(X, y, weights::Weight; kwargs...)
Quadratic discriminant analysis (QDA, with continuum towards LDA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).alpha
: Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0
) and LDA (alpha = 1
).
In the high-level version of the present functions, the observation weights are automatically defined by the given priors (argument prior
): the sub-totals by class of the observation weights are set equal to the prior probabilities. The low-level version (argument weights
) allows to implement other choices.
For the continuum approach, a value alpha
> 0 shrinks the class-covariances by class (Wi) toward a common LDA covariance ("within" W). This corresponds to the "first regularization (Eqs.16)" described in Friedman 1989 (where alpha
is referred to as "lambda").
References
Friedman JH. Regularized Discriminant Analysis. Journal of the American Statistical Association. 1989; 84(405):165-175. doi:10.1080/01621459.1989.10478752.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
model = qda()
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
aggsum(fitm.weights.w, ytrain)
res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
## With regularization
model = qda(alpha = .5)
#model = qda(alpha = 1) # = LDA
fit!(model, Xtrain, ytrain)
model.fitm.Wi
res = predict(model, Xtest) ;
errp(res.pred, ytest)
Jchemo.r2
— Methodr2(pred, Y)
Compute the R2 coefficient.
pred
: Predictions.Y
: Observed data.
The rate R2 is calculated by:
- R2 = 1 - MSEP(current model) / MSEP(null model)
where the "null model" is the overall mean. For predictions over CV or test sets, and/or for non linear models, it can be different from the square of the correlation coefficient (cor2
) between the true data and the predictions.
Examples
using Jchemo
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
model = plskern; nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
r2(pred, Ytest)
model = plskern; nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
r2(pred, ytest)
Jchemo.rasvd
— Methodrasvd(; kwargs...)
rasvd(X, Y; kwargs...)
rasvd(X, Y, weights::Weight; kwargs...)
rasvd!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Redundancy analysis (RA), a.k.a PCA on instrumental variables (PCAIV)
X
: First block of data.Y
: Second block of data.weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores) to compute.bscal
: Type of block scaling. Possible values are::none
,:frob
. See functionsblockscal
.tau
: Regularization parameter (∊ [0, 1]).scal
: Boolean. Iftrue
, each column of blocksX
andY
is scaled by its uncorrected standard deviation (before the block scaling).
See e.g. Bougeard et al. 2011a,b and Legendre & Legendre 2012. Let Yhat be the fitted values of the regression of Y
on X
. The scores Ty
are the PCA scores of Yhat. The scores Tx
are the fitted values of the regression of Ty
on X
.
A continuum regularization is available. After block centering and scaling, the covariances matrices are computed as follows:
- Cx = (1 -
tau
) * X'DX +tau
* Ix
where D is the observation (row) metric. Value tau
= 0 can generate unstability when inverting the covariance matrices. A better alternative is generally to use an epsilon value (e.g. tau
= 1e-8) to get similar results as with pseudo-inverses.
References
Bougeard, S., Qannari, E.M., Lupo, C., Chauvin, C., 2011-a. Multiblock redundancy analysis from a user's perspective. Application in veterinary epidemiology. Electronic Journal of Applied Statistical Analysis 4, 203-214. https://doi.org/10.1285/i20705948v4n2p203
Bougeard, S., Qannari, E.M., Rose, N., 2011-b. Multiblock redundancy analysis: interpretation tools and application in epidemiology. Journal of Chemometrics 25, 467-475. https://doi.org/10.1002/cem.1392
Legendre, V., Legendre, L., 2012. Numerical Ecology. Elsevier, Amsterdam, The Netherlands.
Tenenhaus, A., Guillemot, V. 2017. RGCCA: Regularized and Sparse Generalized Canonical Correlation Analysis for Multiblock Data Multiblock data analysis. https://cran.r-project.org/web/packages/RGCCA/index.html
Examples
using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "linnerud.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n, p = size(X)
q = nco(Y)
nlv = 2
bscal = :frob ; tau = 1e-4
model = rasvd(; nlv, bscal, tau)
fit!(model, X, Y)
@names model
@names model.fitm
@head model.fitm.Tx
@head transfbl(model, X, Y).Tx
@head model.fitm.Ty
@head transfbl(model, X, Y).Ty
res = summary(model, X, Y) ;
@names res
res.explvarx
res.explvary
res.cortx2ty
res.rvx2tx
res.rvy2ty
res.rdx2tx
res.rdy2ty
res.corx2tx
res.cory2ty
Jchemo.rd
— Methodrd(X, Y; typ = :cor)
rd(X, Y, weights::Weight; typ = :cor)
Compute redundancy coefficients (Rd).
X
: Matrix (n, p).Y
: Matrix (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
typ
: Possibles values are::cor
(correlation),:cov
(uncorrected covariance).
Returns the redundancy coefficient between X
and each column of Y
, i.e. for each k = 1,...,q:
- Mean {cor(xj, yk)^2 ; j = 1, ..., p }
Depending argument typ
, the correlation can be replaced by the (not corrected) covariance.
See Tenenhaus 1998 section 2.2.1 p.10-11.
References
Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.
Examples
using Jchemo
X = rand(5, 10)
Y = rand(5, 3)
rd(X, Y)
Jchemo.rda
— Methodrda(; kwargs...)
rda(X, y; kwargs...)
rda(X, y, weights::Weight; kwargs...)
Regularized discriminant analysis (RDA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).alpha
: Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0
) and LDA (alpha = 1
).lb
: Ridge regularization parameter "lambda" (>= 0).simpl
: Boolean. See functiondmnorm
.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Let us note W the (corrected) pooled within-class covariance matrix and Wi the (corrected) within-class covariance matrix of class i. The regularization is done by the two following successive steps (for each class i):
- Continuum between QDA and LDA: Wi(1) = (1 -
alpha
) * Wi +alpha
* W - Ridge regularization: Wi(2) = Wi(1) +
lb
* I
Then the QDA algorithm is run on matrices {Wi(2)}.
Function rda
is slightly different from the regularization expression used by Friedman 1989 (Eq.18): the choice is to shrink the covariance matrices Wi(2) to the diagonal of the Idendity matrix (ridge regularization; e.g. Guo et al. 2007).
Particular cases:
alpha
= 1 &lb
= 0 : LDAalpha
= 0 &lb
= 0 : QDAalpha
= 1 &lb
> 0 : Penalized LDA (Hastie et al 1995) with diagonal regularization matrix
See functions lda
and qda
for other details (arguments weights
and prior
).
References
Friedman JH. Regularized Discriminant Analysis. Journal of the American Statistical Association. 1989; 84(405):165-175. doi:10.1080/01621459.1989.10478752.
Guo Y, Hastie T, Tibshirani R. Regularized linear discriminant analysis and its application in microarrays. Biostatistics. 2007; 8(1):86-100. doi:10.1093/biostatistics/kxj035.
Hastie, T., Buja, A., Tibshirani, R., 1995. Penalized Discriminant Analysis. The Annals of Statistics 23, 73–102.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
y = dat.X[:, 5]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = n - ntest
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
alpha = .5
lb = 1e-8
model = rda(; alpha, lb)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.recod_catbydict
— Methodrecod_catbydict(x, dict)
Recode a categorical variable to dictionnary levels.
x
: Categorical variable (n) to replace.dict
: Dictionary giving the correpondances between the old and new levels.
See examples.
Examples
using Jchemo
dict = Dict("a" => 1000, "b" => 1, "c" => 2)
x = ["c" ; "c" ; "a" ; "a" ; "a"]
recod_catbydict(x, dict)
x = ["c" ; "c" ; "a" ; "a" ; "a" ; "e"]
recod_catbydict(x, dict)
Jchemo.recod_catbyind
— Methodrecod_catbyind(x, lev)
Recode a categorical variable to indexes of levels.
x
: Categorical variable (n) to replace.lev
: Vector containing categorical levels.
See examples.
Warning: The levels in x
must be contained in lev
.
Examples
using Jchemo
lev = ["EHH" ; "FFS" ; "ANF" ; "CLZ" ; "CNG" ; "FRG" ; "MPW" ; "PEE" ; "SFG" ; "SFG" ; "TTS"]
slev = mlev(lev)
[slev 1:length(slev)]
x = ["EHH" ; "TTS" ; "FRG" ; "EHH"]
recod_catbyind(x, lev)
Jchemo.recod_catbyint
— Methodrecod_catbyint(x; start = 1)
Recode a categorical variable to integers.
x
: Categorical variable (n) to replace.start
: Integer labelling the first categorical level inx
.
The integers returned by the function correspond to the sorted levels of x
, see examples.
Examples
using Jchemo
x = ["b", "a", "b"]
mlev(x)
[x recod_catbyint(x)]
recod_catbyint(x; start = 0)
recod_catbyint([25, 1, 25])
Jchemo.recod_catbylev
— Methodrecod_catbylev(x, lev)
Recode a categorical variable to levels.
x
: Variable (n) to replace.lev
: Vector containing the categorical levels.
The ith sorted level in x
is replaced by the ith sorted level in lev
, see examples.
Warning: x
and lev
must contain the same number of levels.
Examples
using Jchemo
x = [10 ; 4 ; 3 ; 3 ; 4 ; 4]
lev = ["B" ; "C" ; "AA" ; "AA"]
mlev(x)
mlev(lev)
[x recod_catbylev(x, lev)]
xstr = string.(x)
[xstr recod_catbylev(xstr, lev)]
lev = [3; 0; 0; -1]
mlev(x)
mlev(lev)
[x recod_catbylev(x, lev)]
Jchemo.recod_indbylev
— Methodrecod_indbylev(x::Union{Int, Array{Int}}, lev::Array)
Recode an index variable to levels.
x
: Index variable (n) to replace.lev
: Vector containing the categorical levels.
Assuming slev = 'sort(unique(lev))', each element x[i]
(i = 1, ..., n) is replaced by slev[x[i]]
, see examples.
Warning: Vector x
must contain integers between 1 and nlev, where nlev is the number of levels in lev
.
Examples
using Jchemo
x = [2 ; 1 ; 2 ; 2]
lev = ["B" ; "C" ; "AA" ; "AA"]
mlev(x)
mlev(lev)
[x recod_indbylev(x, lev)]
recod_indbylev([2], lev)
recod_indbylev(2, lev)
x = [2 ; 1 ; 2]
lev = [3 ; 0 ; 0 ; -1]
mlev(x)
mlev(lev)
recod_indbylev(x, lev)
Jchemo.recod_miss
— Methodrecod_miss(X; miss = nothing)
recod_miss(df; miss = nothing)
Declare data as missing in a dataset.
X
: A dataset (array).miss
: The code used in the dataset to identify the data to be declared asmissing
(of typeMissing
).
Specific for dataframes:
df
: A dataset (dataframe).
The case miss = nothing
has the only action to allow missing
in X
or df
.
See examples.
Examples
using Jchemo, DataFrames
X = hcat(1:5, [0, 0, 7., 10, 1.2])
X_miss = recod_miss(X; miss = 0)
df = DataFrame(i = 1:5, x = [0, 0, 7., 10, 1.2])
df_miss = recod_miss(df; miss = 0)
df = DataFrame(i = 1:5, x = ["0", "0", "c", "d", "e"])
df_miss = recod_miss(df; miss = "0")
Jchemo.recod_numbyint
— Methodrecod_numbyint(x, q)
Recode a continuous variable to integers.
x
: Continuous variable (n) to replace.q
: Numerical values separating classes inx
. The first class is labelled to 1.
See examples.
Examples
using Jchemo, Statistics
x = [collect(1:10); 8.1 ; 3.1]
q = [3; 8]
zx = recod_numbyint(x, q)
[x zx]
probs = [.33; .66]
q = quantile(x, probs)
zx = recod_numbyint(x, q)
[x zx]
Jchemo.recovkw
— Methodrecovkw(ParStruct, kwargs)
Jchemo.residcla
— Methodresidcla(pred, y)
Compute the discrimination residual vector (0 = no error, 1 = error).
pred
: Predictions.y
: Observed data (class membership).
Examples
using Jchemo
Xtrain = rand(10, 5)
ytrain = rand(["a" ; "b"], 10)
Xtest = rand(4, 5)
ytest = rand(["a" ; "b"], 4)
model = plsrda(; nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
residcla(pred, ytest)
Jchemo.residreg
— Methodresidreg(pred, Y)
Compute the regression residual vector.
pred
: Predictions.Y
: Observed data.
Examples
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
model = plskern; nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
residreg(pred, Ytest)
model = plskern; nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
residreg(pred, ytest)
Jchemo.rfda
— Methodrfda(; kwargs...)
rfda(X, y::Union{Array{Int}, Array{String}}; kwargs...)
Random forest discrimination with DecisionTree.jl.
X
: X-data (n, p).y
: Univariate class membership (n).
Keyword arguments:
n_trees
: Nb. trees built for the forest.partial_sampling
: Proportion of sampled observations for each tree.n_subfeatures
: Nb. variables to select at random at each split (default: -1 ==> sqrt(#variables)).max_depth
: Maximum depth of the decision trees (default: -1 ==> no maximum).min_sample_leaf
: Minimum number of samples each leaf needs to have.min_sample_split
: Minimum number of observations in needed for a split.mth
: Boolean indicating if a multi-threading is done when new data are predicted with functionpredict
.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
The function fits a random forest discrimination² model using package `DecisionTree.jl'.
For DA in DecisionTree.jl, 'y' components must be Int or String
References
Breiman, L., 1996. Bagging predictors. Mach Learn 24, 123–140. https://doi.org/10.1007/BF00058655
Breiman, L., 2001. Random Forests. Machine Learning 45, 5–32. https://doi.org/10.1023/A:1010933404324
DecisionTree.jl https://github.com/JuliaAI/DecisionTree.jl
Genuer, R., 2010. Forêts aléatoires : aspects théoriques, sélection de variables et applications. PhD Thesis. Université Paris Sud - Paris XI.
Gey, S., 2002. Bornes de risque, détection de ruptures, boosting : trois thèmes statistiques autour de CART en régression (These de doctorat). Paris 11. http://www.theses.fr/2002PA112245
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n, p = size(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
n_trees = 200
n_subfeatures = p / 3
max_depth = 10
model = rfda(; n_trees, n_subfeatures, max_depth)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
res = predict(model, Xtest) ;
@names res
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.rfr
— Methodrfr(; kwargs...)
rfr(X, y; kwargs...)
Random forest regression with DecisionTree.jl.
X
: X-data (n, p).y
: Univariate y-data (n).
Keyword arguments:
n_trees
: Nb. trees built for the forest.partial_sampling
: Proportion of sampled observations for each tree.n_subfeatures
: Nb. variables to select at random at each split (default: -1 ==> sqrt(#variables)).max_depth
: Maximum depth of the decision trees (default: -1 ==> no maximum).min_sample_leaf
: Minimum number of samples each leaf needs to have.min_sample_split
: Minimum number of observations in needed for a split.mth
: Boolean indicating if a multi-threading is done when new data are predicted with functionpredict
.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
The function fits a random forest regression model using package `DecisionTree.jl'.
References
Breiman, L., 1996. Bagging predictors. Mach Learn 24, 123–140. https://doi.org/10.1007/BF00058655
Breiman, L., 2001. Random Forests. Machine Learning 45, 5–32. https://doi.org/10.1023/A:1010933404324
DecisionTree.jl https://github.com/JuliaAI/DecisionTree.jl
Genuer, R., 2010. Forêts aléatoires : aspects théoriques, sélection de variables et applications. PhD Thesis. Université Paris Sud - Paris XI.
Gey, S., 2002. Bornes de risque, détection de ruptures, boosting : trois thèmes statistiques autour de CART en régression (These de doctorat). Paris 11. http://www.theses.fr/2002PA112245
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
p = nco(X)
n_trees = 200
n_subfeatures = p / 3
max_depth = 15
model = rfr(; n_trees, n_subfeatures, max_depth)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
Jchemo.rmcol
— Methodrmcol(X::Union{AbstractMatrix, DataFrame}, s::Union{Vector, BitVector, UnitRange, Number})
rmcol(X::Vector, s::Union{Vector, BitVector, UnitRange, Number})
Remove the columns of a matrix or the components of a vector having indexes s
.
X
: Matrix or vector.s
: Vector of the indexes.
Examples
using Jchemo
X = rand(5, 3)
rmcol(X, [1, 3])
Jchemo.rmgap
— Methodrmgap(; kwargs...)
rmgap(X; kwargs...)
Remove vertical gaps in spectra (e.g. for ASD).
X
: X-data (n, p).
Keyword arguments:
indexcol
: Indexes (∈ [1, p]) of theX
-columns where are located the gaps to remove.npoint
: The number ofX
-columns used on the left side of each gap for fitting the linear regressions.
For each spectra (row-observation of matrix X
) and each defined gap, the correction is done by extrapolation from a simple linear regression computed on the left side of the gap.
For instance, If two gaps are observed between column-indexes 651-652 and between column-indexes 1425-1426, respectively, the syntax should be indexcol
= [651 ; 1425].
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/asdgap.jld2")
@load db dat
@names dat
X = dat.X
wlst = names(dat.X)
wl = parse.(Float64, wlst)
wl_target = [1000 ; 1800]
indexcol = findall(in(wl_target).(wl))
f, ax = plotsp(X, wl)
vlines!(ax, wl_target; linestyle = :dot, color = (:grey, .8))
f
## Corrected data
model = rmgap(; indexcol, npoint = 5)
fit!(model, X)
Xc = transf(model, X)
f, ax = plotsp(Xc, wl)
vlines!(ax, wl_target; linestyle = :dot, color = (:grey, .8))
f
Jchemo.rmrow
— Methodrmrow(X::Union{AbstractMatrix, DataFrame}, s::Union{Vector, BitVector, UnitRange, Number})
rmrow(X::Union{Vector, BitVector}, s::Union{Vector, BitVector, UnitRange, Number})
Remove the rows of a matrix or the components of a vector having indexes s
.
X
: Matrix or vector.s
: Vector of the indexes.
Examples
using Jchemo
X = rand(5, 2)
rmrow(X, [1, 4])
Jchemo.rmsep
— Methodrmsep(pred, Y)
Compute the square root of the mean of the squared prediction errors (RMSEP).
pred
: Predictions.Y
: Observed data.
Examples
using Jchemo
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
model = plskern; nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
rmsep(pred, Ytest)
model = plskern; nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
rmsep(pred, ytest)
Jchemo.rmsepstand
— Methodrmsepstand(pred, Y)
Compute the standardized square root of the mean of the squared prediction errors (RMSEP_stand).
pred
: Predictions.Y
: Observed data.
RMSEP is standardized to Y
:
- RMSEP_stand = RMSEP ./
Y
.
Examples
using Jchemo
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
model = plskern; nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
rmsepstand(pred, Ytest)
model = plskern; nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
rmsepstand(pred, ytest)
Jchemo.rosaplsr
— Methodrosaplsr(; kwargs...)
rosaplsr(Xbl, Y; kwargs...)
rosaplsr(Xbl, Y, weights::Weight; kwargs...)
rosaplsr!(Xbl::Vector, Y::Matrix, weights::Weight; kwargs...)
Multiblock ROSA PLSR (Liland et al. 2016).
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores) to compute.scal
: Boolean. Iftrue
, each column of blocks inXbl
andY
is scaled by its uncorrected standard deviation (before the block scaling).
The function has the following differences with the original algorithm of Liland et al. (2016):
- Scores T (latent variables LVs) are not normed to 1.
- Multivariate
Y
is allowed. In such a case, the squared residuals are summed over the columns to find the winning block for each global LV (therefore, Y-columns should have the same scale).
References
Liland, K.H., Næs, T., Indahl, U.G., 2016. ROSA — a fast extension of partial least squares regression for multiblock data analysis. Journal of Chemometrics 30, 651–662. https://doi.org/10.1002/cem.2824
Examples
using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
y = Y.c1
group = dat.group
listbl = [1:11, 12:19, 20:25]
s = 1:6
Xbltrain = mblock(X[s, :], listbl)
Xbltest = mblock(rmrow(X, s), listbl)
ytrain = y[s]
ytest = rmrow(y, s)
ntrain = nro(ytrain)
ntest = nro(ytest)
ntot = ntrain + ntest
(ntot = ntot, ntrain , ntest)
nlv = 3
scal = false
#scal = true
model = rosaplsr(; nlv, scal)
fit!(model, Xbltrain, ytrain)
@names model
@names model.fitm
@head model.fitm.T
@head transf(model, Xbltrain)
transf(model, Xbltest)
res = predict(model, Xbltest)
res.pred
rmsep(res.pred, ytest)
Jchemo.rowmean
— Methodrowmean(X)
Compute row-wise means of a matrix.
X
: Data (n, p).
Return a vector.
Examples
using Jchemo
n, p = 5, 6
X = rand(n, p)
rowmean(X)
Jchemo.rownorm
— Methodrownorm(X)
Compute row-wise norms of a matrix.
X
: Data (n, p).
The norm computed for a row x of X
is:
- sqrt(x' * x)
Return a vector.
Note: Thanks to @mcabbott at https://discourse.julialang.org/t/orders-of-magnitude-runtime-difference-in-row-wise-norm/96363.
Examples
using Jchemo
n, p = 5, 6
X = rand(n, p)
rownorm(X)
Jchemo.rowstd
— Methodrowstd(X)
Compute row-wise standard deviations (uncorrected) of a matrix`.
X
: Data (n, p).
Return a vector.
Examples
using Jchemo
n, p = 5, 6
X = rand(n, p)
rowstd(X)
Jchemo.rowsum
— Methodrowsum(X)
Compute row-wise sums of a matrix.
X
: Data (n, p).
Return a vector.
Examples
using Jchemo
X = rand(5, 2)
rowsum(X)
Jchemo.rowvar
— Methodrowvar(X)
Compute row-wise variances (uncorrected) of a matrix.
X
: Data (n, p).
Return a vector.
Examples
using Jchemo
n, p = 5, 6
X = rand(n, p)
rowvar(X)
Jchemo.rp
— Methodrp(; kwargs...)
rp(X; kwargs...)
rp(X, weights::Weight; kwargs...)
rp!(X::Matrix, weights::Weight; kwargs...)
Make a random projection of X-data.
X
: X-data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. dimensions on whichX
is projected.meth
: Method of random projection. Possible values are::gauss
,:li
. See the respective functionsrpmatgauss
andrpmatli
for their keyword arguments.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Examples
using Jchemo
n, p = (5, 10)
X = rand(n, p)
nlv = 3
meth = :li ; s = sqrt(p)
#meth = :gauss
model = rp(; nlv, meth, s)
fit!(model, X)
@names model
@names model.fitm
@head model.fitm.T
@head model.fitm.V
transf(model, X[1:2, :])
Jchemo.rpd
— Methodrpd(pred, Y)
Compute the ratio "deviation to model performance" (RPD).
pred
: Predictions.Y
: Observed data.
This is the ratio of the deviation to the model performance to the deviation, defined by:
- RPD = Std(Y) / RMSEP
where Std(Y) is the standard deviation.
Since Std(Y) = RMSEP(null model) where the null model is the simple average, this also gives:
- RPD = RMSEP(null model) / RMSEP
Examples
using Jchemo
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
model = plskern; nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
rpd(pred, Ytest)
model = plskern; nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
rpd(pred, ytest)
Jchemo.rpdr
— Methodrpdr(pred, Y)
Compute a robustified RPD.
pred
: Predictions.Y
: Observed data.
Examples
using Jchemo
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
model = plskern; nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
rpdr(pred, Ytest)
model = plskern; nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
rpdr(pred, ytest)
Jchemo.rpmatgauss
— Functionrpmatgauss(p::Int, nlv::Int, Q = Float64)
Build a gaussian random projection matrix.
p
: Nb. variables (attributes) to project.nlv
: Nb. of simulated projection dimensions.Q
: Type of components of the built projection matrix.
The function returns a random projection matrix V of dimension p
x nlv
. The projection of a given matrix X of size n x p
is given by X * V.
V is simulated from i.i.d. N(0, 1) / sqrt(nlv
).
References
Li, V., Hastie, T.J., Church, K.W., 2006. Very sparse random projections, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06. Association for Computing Machinery, New York, NY, USA, pp. 287–296. https://doi.org/10.1145/1150402.1150436
Examples
using Jchemo
p = 10 ; nlv = 3
rpmatgauss(p, nlv)
Jchemo.rpmatli
— Functionrpmatli(p::Int, nlv::Int, Q = Float64; s)
Build a sparse random projection matrix (Achlioptas 2001, Li et al. 2006).
p
: Nb. variables (attributes) to project.nlv
: Nb. of simulated projection dimensions.Q
: Type of components of the built projection matrix.
Keyword arguments:
s
: Coefficient defining the sparsity of the returned matrix (higher iss
, higher is the sparsity).
The function returns a random projection matrix V of dimension p
x nlv
. The projection of a given matrix X of size n x p
is given by X * V.
Matrix V is simulated from i.i.d. discrete sampling within values:
- 1 with prob. 1/(2 *
s
) - 0 with prob. 1 - 1 /
s
- -1 with prob. 1/(2 *
s
)
Usual values for s
are:
- sqrt(
p
) (Li et al. 2006) p
/ log(p
) (Li et al. 2006)- 1 (Achlioptas 2001)
- 3 (Achlioptas 2001)
References
Achlioptas, D., 2001. Database-friendly random projections, in: Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’01. Association for Computing Machinery, New York, NY, USA, pp. 274–281. https://doi.org/10.1145/375551.375608
Li, V., Hastie, T.J., Church, K.W., 2006. Very sparse random projections, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06. Association for Computing Machinery, New York, NY, USA, pp. 287–296. https://doi.org/10.1145/1150402.1150436
Examples
using Jchemo
p = 10 ; nlv = 3
rpmatli(p, nlv)
Jchemo.rr
— Methodrr(; kwargs...)
rr(X, Y; kwargs...)
rr(X, Y, weights::Weight; kwargs...)
rr!(X::Matrix, Y::Union{Matrix, BitMatrix}, weights::Weight; kwargs...)
Ridge regression (RR) implemented by SVD factorization.
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
lb
: Ridge regularization parameter "lambda".scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
References
Cule, E., De Iorio, M., 2012. A semi-automatic method to guide the choice of ridge parameter in ridge regression. arXiv:1205.0686.
Hastie, T., Tibshirani, R., 2004. Efficient quadratic regularization for expression arrays. Biostatistics 5, 329-340. https://doi.org/10.1093/biostatistics/kxh010
Hastie, T., Tibshirani, R., Friedman, J., 2009. The elements of statistical learning: data mining, inference, and prediction, 2nd ed. Springer, New York.
Hoerl, A.E., Kennard, R.W., 1970. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 12, 55-67. https://doi.org/10.1080/00401706.1970.10488634
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
lb = 1e-3
model = rr(; lb)
#model = rrchol(; lb)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
coef(model)
res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
## !! Only for function 'rr' (not for 'rrchol')
coef(model; lb = 1e-1)
res = predict(model, Xtest; lb = [.1 ; .01])
@head res.pred[1]
@head res.pred[2]
Jchemo.rrchol
— Methodrrchol(; kwargs...)
rrchol(X, Y; kwargs...)
rrchol(X, Y, weights::Weight; kwargs...)
rrchol!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Ridge regression (RR) using the Normal equations and a Cholesky factorization.
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
lb
: Ridge regularization parameter "lambda".scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
See function rr
for examples.
References
Cule, E., De Iorio, M., 2012. A semi-automatic method to guide the choice of ridge parameter in ridge regression. arXiv:1205.0686.
Hastie, T., Tibshirani, R., 2004. Efficient quadratic regularization for expression arrays. Biostatistics 5, 329-340. https://doi.org/10.1093/biostatistics/kxh010
Hastie, T., Tibshirani, R., Friedman, J., 2009. The elements of statistical learning: data mining, inference, and prediction, 2nd ed. Springer, New York.
Hoerl, A.E., Kennard, R.W., 1970. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 12, 55-67. https://doi.org/10.1080/00401706.1970.10488634
Jchemo.rrda
— Methodrrda(; kwargs...)
rrda(X, y; kwargs...)
rrda(X, y, weights::Weight; kwargs...)
Discrimination based on ridge regression (RR-DA).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
lb
: Ridge regularization parameter "lambda".prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
The method is as follows:
- The training variable
y
(univariate class membership) is transformed to a dummy table (Ydummy) containing nlev columns, where nlev is the number of classes present iny
. Each column of Ydummy is a dummy (0/1) variable. - Then, a ridge regression (RR) is run on {
X
, Ydummy}, returning predictions of the dummy variables (= objectposterior
returned by fuctionpredict
). These predictions can be considered as unbounded estimates (i.e. eventuall outside of [0, 1]) of the class membership probabilities. - For a given observation, the final prediction is the class corresponding to the dummy variable for which the probability estimate is the highest.
In the high-level version of the present functions, the observation weights are automatically defined by the given priors (argument prior
): the sub-totals by class of the observation weights are set equal to the prior probabilities. The low-level version (argument weights
) allows to implement other choices.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
lb = 1e-5
model = rrda(; lb)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
@names fitm.fitm
aggsum(fitm.fitm.weights.w, ytrain)
coef(fitm.fitm)
res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(model, Xtest; lb = [.1; .01]).pred
Jchemo.rrr
— Methodrrr(; kwargs...)
rrr(X, Y; kwargs...)
rrr(X, Y, weights::Weight; kwargs...)
rr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Reduced rank regression (RRR, a.k.a RA).
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.tau
: Regularization parameter (∊ [0, 1]).tol
: Tolerance for the Nipals algorithm.maxit
: Maximum number of iterations for the Nipals algorithm.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
Reduced rank regression, also referred to as redundancy analysis (RA) regression. In this function, the RA uses the Nipals algorithm presented in Mangamana et al 2021, section 2.1.1.
A continuum regularization is available. After block centering and scaling, the covariances matrices are computed as follows:
- Cx = (1 -
tau
) * X'DX +tau
* Ix
where D is the observation (row) metric. Value tau
= 0 can generate unstability when inverting the covariance matrices. A better alternative is generally to use an epsilon value (e.g. tau
= 1e-8) to get similar results as with pseudo-inverses.
References
Bougeard, S., Qannari, E.M., Lupo, C., Chauvin, C., 2011. Multiblock redundancy analysis from a user’s perspective. Application in veterinary epidemiology. Electronic Journal of Applied Statistical Analysis 4, 203-214–214. https://doi.org/10.1285/i20705948v4n2p203
Bougeard, S., Qannari, E.M., Rose, N., 2011. Multiblock redundancy analysis: interpretation tools and application in epidemiology. Journal of Chemometrics 25, 467–475. https://doi.org/10.1002/cem.1392
Tchandao Mangamana, E., Glèlè Kakaï, R., Qannari, E.M., 2021. A general strategy for setting up supervised methods of multiblock data analysis. Chemometrics and Intelligent Laboratory Systems 217, 104388. https://doi.org/10.1016/j.chemolab.2021.104388
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 1
tau = 1e-4
model = rrr(; nlv, tau)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
@head model.fitm.T
coef(model)
coef(model; nlv = 3)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)
res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
Jchemo.rv
— Methodrv(X, Y; centr = true)
rv(Xbl::Vector; centr = true)
Compute RV coefficients.
X
: Matrix (n, p).Y
: Matrix (n, q).Xbl
: A list (vector) of matrices.centr
: Boolean indicating if the matrices will be internally centered or not.
RV is bounded within [0, 1].
A dissimilarty measure between X
and Y
can be computed by d = sqrt(2 * (1 - RV)).
References
Escoufier, Y., 1973. Le Traitement des Variables Vectorielles. Biometrics 29, 751–760. https://doi.org/10.2307/2529140
Josse, J., Holmes, S., 2016. Measuring multivariate association and beyond. Stat Surv 10, 132–167. https://doi.org/10.1214/16-SS116
Josse, J., Pagès, J., Husson, F., 2008. Testing the significance of the RV coefficient. Computational Statistics & Data Analysis 53, 82–91. https://doi.org/10.1016/j.csda.2008.06.012
Kazi-Aoual, F., Hitier, S., Sabatier, R., Lebreton, J.-D., 1995. Refined approximations to permutation tests for multivariate inference. Computational Statistics & Data Analysis 20, 643–656. https://doi.org/10.1016/0167-9473(94)00064-2
Mayer, C.-D., Lorent, J., Horgan, G.W., 2011. Exploratory Analysis of Multiple Omics Datasets Using the Adjusted RV Coefficient. Statistical Applications in Genetics and Molecular Biology 10. https://doi.org/10.2202/1544-6115.1540
Smilde, A.K., Kiers, H.A.L., Bijlsma, S., Rubingh, C.M., van Erk, M.J., 2009. Matrix correlations for high-dimensional data: the modified RV-coefficient. Bioinformatics 25, 401–405. https://doi.org/10.1093/bioinformatics/btn634
Robert, P., Escoufier, Y., 1976. A Unifying Tool for Linear Multivariate Statistical Methods: The RV-Coefficient. Journal of the Royal Statistical Society: Series C (Applied Statistics) 25, 257–265. https://doi.org/10.2307/2347233
Examples
using Jchemo
X = rand(5, 10)
Y = rand(5, 3)
rv(X, Y)
X = rand(5, 15)
listbl = [3:4, 1, [6; 8:10]]
Xbl = mblock(X, listbl)
rv(Xbl)
Jchemo.sampcla
— Functionsampcla(x, k::Union{Int, Vector{Int}}, y = nothing)
Build training vs. test sets by stratified sampling.
x
: Class membership (n) of the observations.k
: Nb. test observations to sample in each class. Ifk
is a single value, the nb. of sampled observations is the same for each class. Alternatively,k
can be a vector of length equal to the nb. of classes inx
.y
: Quantitative variable (n) used if systematic sampling.
Two outputs are returned (= row indexes of the data):
train
(n -k
),test
(k
).
If y
= nothing
, the sampling is random, else it is systematic over the sorted y
(see function sampsys
).
References
Naes, T., 1987. The design of calibration in near infra-red reflectance analysis by clustering. Journal of Chemometrics 1, 121-134.
Examples
using Jchemo
x = string.(repeat(1:3, 5))
n = length(x)
tab(x)
k = 2
res = sampcla(x, k)
res.test
x[res.test]
tab(x[res.test])
y = rand(n)
res = sampcla(x, k, y)
res.test
x[res.test]
tab(x[res.test])
Jchemo.sampdf
— Functionsampdf(Y::DataFrame, k::Union{Int, Vector{Int}}, id = 1:nro(Y); meth = :rand)
Build training vs. test sets from each column of a dataframe.
Y
: DataFrame (n, p). Can contain missing values.k
: Nb. of test observations selected for eachY
-column. The selection is done within the non-missing observations of the considered column. Ifk
is a single value, the same nb. of observations are selected for each column. Alternatively,k
can be a vector of length p.id
: Vector (n) of IDs.
Keyword arguments:
meth
: Type of sampling for the test set. Possible values are::rand
= random sampling,:sys
= systematic sampling over each sortedY
-column (see functionsampsys
).
Typically, dataframe Y
contains a set of response variables to predict.
Examples
using Jchemo, DataFrames
Y = hcat([rand(5); missing; rand(6)],
[rand(2); missing; missing; rand(7); missing])
Y = DataFrame(Y, :auto)
n = nro(Y)
k = 3
res = sampdf(Y, k)
#res = sampdf(Y, k, string.(1:n))
@names res
res.nam
length(res.test)
res.train
res.test
## Replicated splitting Train/Test
rep = 10
k = 3
ids = [sampdf(Y, k) for i = 1:rep]
length(ids)
i = 1 # replication
ids[i]
ids[i].train
ids[i].test
j = 1 # variable y
ids[i].train[j]
ids[i].test[j]
ids[i].nam[j]
Jchemo.sampdp
— Methodsampdp(X, k::Int; metric = :eucl)
Build training vs. test sets by DUPLEX sampling.
X
: X-data (n, p).k
: Nb. pairs (training/test) of observations to sample. Must be <= n / 2.
Keyword arguments:
metric
: Metric used for the distance computation. Possible values are::eucl
(Euclidean),:mah
(Mahalanobis).
Three outputs (= row indexes of the data) are returned:
train
(k
),test
(k
),remain
(n - 2 *k
).
Outputs train
and test
are built from the DUPLEX algorithm (Snee, 1977 p.421). They are expected to cover approximately the same X-space region and have similar statistical properties.
In practice, when output remain
is not empty (i.e. when there are remaining observations), one common strategy is to add it to output train
.
References
Kennard, R.W., Stone, L.A., 1969. Computer aided design of experiments. Technometrics, 11(1), 137-148.
Snee, R.D., 1977. Validation of Regression Models: Methods and Examples. Technometrics 19, 415-428. https://doi.org/10.1080/00401706.1977.10489581
Examples
using Jchemo
X = [0.381392 0.00175002 ; 0.1126 0.11263 ;
0.613296 0.152485 ; 0.726536 0.762032 ;
0.367451 0.297398 ; 0.511332 0.320198 ;
0.018514 0.350678]
k = 3
sampdp(X, k)
Jchemo.sampks
— Methodsampks(X, k::Int; metric = :eucl)
Build training vs. test sets by Kennard-Stone sampling.
X
: X-data (n, p).k
: Nb. test observations to sample.
Keyword arguments:
metric
: Metric used for the distance computation. Possible values are::eucl
(Euclidean),:mah
(Mahalanobis).
Two outputs (= row indexes of the data) are returned:
train
(n
-k
),test
(k
).
Output test
is built from the Kennard-Stone (KS) algorithm (Kennard & Stone, 1969).
Note: By construction, the set of observations selected by KS sampling contains higher variability than the set of the remaining observations. In the seminal article (K&S, 1969), the algorithm is used to select observations that will be used to build a calibration set. To the opposite, in the present function, KS is used to select a test set with higher variability than the training set.
References
Kennard, R.W., Stone, L.A., 1969. Computer aided design of experiments. Technometrics, 11(1), 137-148.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
k = 80
res = sampks(X, k)
@names res
res.train
res.test
model = pcasvd(nlv = 15)
fit!(model, X)
@head T = model.fitm.T
res = sampks(T, k; metric = :mah)
#####################
n = 10
k = 25
X = [repeat(1:n, inner = n) repeat(1:n, outer = n)]
X = Float64.(X)
X .= X + .1 * randn(nro(X), nco(X))
s = sampks(X, k).test
f, ax = plotxy(X[:, 1], X[:, 2])
scatter!(ax, X[s, 1], X[s, 2]; color = "red")
f
Jchemo.samprand
— Methodsamprand(n::Int, k::Int; replace = false)
Build training vs. test sets by random sampling.
n
: Total nb. of observations.k
: Nb. test observations to sample.
Keyword arguments:
replace
: Boolean. Iffalse
, the sampling is without replacement.
Two outputs are returned (= row indexes of the data):
train
(n
-k
),test
(k
).
Output test
is built by random sampling within 1:n
.
Examples
using Jchemo
n = 10
samprand(n, 4)
Jchemo.sampsys
— Methodsampsys(y, k::Int)
Build training vs. test sets by systematic sampling over a quantitative variable.
y
: Quantitative variable (n) to sample.k
: Nb. test observations to sample. Must be >= 2.
Two outputs are returned (= row indexes of the data):
train
(n -k
),test
(k
).
Output test
is built by systematic sampling over the rank of the y
observations. For instance if k
/ n ~ .3, one observation over three observations over the sorted y
is selected.
Output test
always contains the indexes of the minimum and maximum of y
.
Examples
using Jchemo
y = rand(7)
[y sort(y)]
res = sampsys(y, 3)
sort(y[res.test])
Jchemo.sampwsp
— Methodsampwsp(X, dmin; recod = false, maxit = nro(X))
Build training vs. test sets by WSP sampling.
X
: X-data (n, p).dmin
: Distance "dmin" (Santiago et al. 2012).
Keyword arguments:
recod
: Boolean indicating ifX
is recoded or not before the sampling (see below).maxit
: Maximum number of iterations.
Two outputs (= row indexes of the data) are returned:
train
(n
- k),test
(k).
Output test
is built from the "Wootton, Sergent, Phan-Tan-Luu" (WSP) algorithm, assumed to generate samples uniformely distributed in the X
domain (Santiago et al. 2012).
If recod = true
, each column x of X
is recoded within [0, 1] and the center of the domain is the vector repeat([.5], p)
. Column x is recoded such as:
- vmin = minimum(x)
- vmax = maximum(x)
- vdiff = vmax - vmin
- x .= 0.5 .+ (x .- (vdiff / 2 + vmin)) / vdiff
References
Béal A. 2015. Description et sélection de données en grande dimensio. Thèse de doctorat. Laboratoire d’Instrumentation et de sciences analytiques, Ecole doctorale des siences chimiques, Université d'Aix-Marseille.
Santiago, J., Claeys-Bruno, M., Sergent, M., 2012. Construction of space-filling designs using WSP algorithm for high dimensional spaces. Chemometrics and Intelligent Laboratory Systems, Selected Papers from Chimiométrie 2010 113, 26–31. https://doi.org/10.1016/j.chemolab.2011.06.003
Examples
using Jchemo
n = 600 ; p = 2
X = rand(n, p)
dmin = .5
s = sampwsp(X, dmin)
@names res
@show length(s.test)
plotxy(X[s.test, 1], X[s.test, 2]).f
Jchemo.savgk
— Methodsavgk(nhwindow::Int, degree::Int, deriv::Int)
Compute the kernel of the Savitzky-Golay filter.
nhwindow
: Nb. points (>= 1) of the half window.degree
: Degree of the smoothing polynom, where 1 <=degree
<= 2 * nhwindow.deriv
: Derivation order, where 0 <=deriv
<= degree.
The size of the kernel is odd (npoint = 2 * nhwindow + 1):
- x[-nhwindow], x[-nhwindow+1], ..., x[0], ...., x[nhwindow-1], x[nhwindow].
If deriv
= 0, there is no derivation (only polynomial smoothing).
The case degree
= 0 (i.e. simple moving average) is not allowed by the funtion.
References
Luo, J., Ying, K., Bai, J., 2005. Savitzky–Golay smoothing and differentiation filter for even number data. Signal Processing 85, 1429–1434. https://doi.org/10.1016/j.sigpro.2005.02.002
Examples
using Jchemo
res = savgk(21, 3, 2)
@names res
res.S
res.G
res.kern
Jchemo.savgol
— Methodsavgol(; kwargs...)
savgol(X; kwargs...)
Savitzky-Golay derivation and smoothing of each row of X-data.
X
: X-data (n, p).
Keyword arguments:
npoint
: Size of the filter (nb. points involved in the kernel). Must be odd and >= 3. The half-window size is nhwindow = (npoint
- 1) / 2.deriv
: Derivation order. Must be: 0 <=deriv
<=degree
.degree
: Degree of the smoothing polynom. Must be: 1 <=degree
<=npoint
- 1.
The smoothing is computed by convolution (with padding), using function imfilter of package ImageFiltering.jl. Each returned point is located on the center of the kernel. The kernel is computed with function savgk
.
The function returns a matrix (n, p).
References
Luo, J., Ying, K., Bai, J., 2005. Savitzky–Golay smoothing and differentiation filter for even number data. Signal Processing 85, 1429–1434. https://doi.org/10.1016/j.sigpro.2005.02.002
Savitzky, A., Golay, M.J.E., 2002. Smoothing and Differentiation of Data by Simplified Least Squares Procedures. [WWW Document]. https://doi.org/10.1021/ac60214a047
Schafer, R.W., 2011. What Is a Savitzky-Golay Filter? [Lecture Notes]. IEEE Signal Processing Magazine 28, 111–117. https://doi.org/10.1109/MSP.2011.941097
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f
npoint = 11 ; deriv = 2 ; degree = 2
model = savgol(; npoint, deriv, degree)
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f
####### Gaussian signal
u = -15:.1:15
n = length(u)
x = exp.(-.5 * u.^2) / sqrt(2 * pi) + .03 * randn(n)
M = 10 # half window
N = 3 # degree
deriv = 0
#deriv = 1
model = savgol(; npoint = 2M + 1, degree = N, deriv)
fit!(model, x')
xp = transf(model, x')
f, ax = plotsp(x', u; color = :blue)
lines!(ax, u, vec(xp); color = :red)
f
Jchemo.scale
— Methodscale()
scale(X)
scale(X, weights::Weight)
Column-wise scaling of X-data.
X
: X-data (n, p).
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f
model = scale()
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
colstd(Xptrain)
@head Xptest
@head Xtest ./ colstd(Xtrain)'
plotsp(Xptrain).f
plotsp(Xptest).f
Jchemo.segmkf
— Methodsegmkf(n::Int, K::Int; rep = 1)
segmkf(group::Vector, K::Int; rep = 1)
Build segments of observations for K-fold cross-validation.
n
: Total nb. of observations in the dataset. The sampling is implemented with 1:n.group
: A vector (n) defining blocks of observations.K
: Nb. folds (segments) splitting then
observations.
Keyword arguments:
rep
: Nb. replications of the sampling.
For each replication, the function splits the n
observations to K
segments that can be used for K-fold cross-validation.
If group
is used (must be a vector of length n
), the function samples entire groups (= blocks) of observations instead of observations. Such a block-sampling is required when data is structured by blocks and when the response to predict is correlated within blocks. This prevents underestimation of the generalization error.
The function returns a list (vector) of rep
elements. Each element of the list contains K
segments (= K
vectors). Each segment contains the indexes (position within 1:n
) of the sampled observations.
Examples
using Jchemo
n = 10 ; K = 3
rep = 4
segm = segmkf(n, K; rep)
i = 1
segm[i]
segm[i][1]
n = 10
group = ["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"] # blocks of the observations
tab(group)
K = 3 ; rep = 4
segm = segmkf(group, K; rep)
i = 1
segm[i]
segm[i][1]
group[segm[i][1]]
group[segm[i][2]]
group[segm[i][3]]
Jchemo.segmts
— Methodsegmts(n::Int, m::Int; rep = 1, seed = nothing)
segmts(group::Vector, m::Int; rep = 1, seed = nothing)
Build segments of observations for "test-set" validation.
n
: Total nb. of observations in the dataset. The sampling is implemented within 1:n
.group
: A vector (n) defining blocks of observations.m
: Nb. test observations, or groups ifgroup
is used, returned in each segment.
Keyword arguments:
rep
: Nb. replications of the sampling.seed
: Eventual seed for theRandom.MersenneTwister
generator. Must be of length =rep
. Whennothing
, the seed is random at each replication.
For each replication, the function builds a test set that can be used to validate a model.
If group
is used (must be a vector of length n), the function samples entire groups (= blocks) of observations instead of observations. Such a block-sampling is required when data is structured by blocks and when the response to predict is correlated within blocks. This prevents underestimation of the generalization error.
The function returns a list (vector) of rep
elements. Each element of the list is a vector of the indexes (positions within 1:n
) of the sampled observations.
Examples
using Jchemo
n = 10 ; m = 3
rep = 4
segm = segmts(n, m; rep)
i = 1
segm[i]
segm[i][1]
n = 10
group = ["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"] # blocks of the observations
tab(group)
m = 2 ; rep = 4
segm = segmts(group, m; rep)
i = 1
segm[i]
segm[i][1]
group[segm[i][1]]
Jchemo.selwold
— Methodselwold(indx, r; smooth = true, npoint = 5, alpha = .05, digits = 3, graph = true,
step = 2, xlabel = "Index", ylabel = "Value", title = "Score")
Wold's criterion to select dimensionality in LV models (e.g. PLSR).
indx
: A variable representing the model parameter(s), e.g. nb. LVs if PLSR models.r
: A vector of error rates (n), e.g. RMSECV.
Keyword arguments:
smooth
: Boolean. Iftrue
, the selection is done after a moving-average smoothing of rate R (see functionmavg
).npoint
: Window of the moving-average used to smooth rate R.alpha
: Proportion alpha used as threshold for rate R.digits
: Number of digits in the outputs.graph
: Boolean. Iftrue
, outputs are plotted.step
: Step used for defining the xticks in the graphs.xlabel
: Horizontal label for the plots.ylabel
: Vertical label for the plots.title
: Title of the left plot.
The slection criterion is the "precision gain ratio":
- R = 1 -
r
(a+1) /r
(a)
where r
is an observed error rate quantifying the model performance (e.g. RMSEP, classification error rate, etc.) and a the model dimensionnality (= nb. LVs). r
can also represent other indicators such as the eigenvalues of a PCA.
R is the relative gain in perforamnce efficiency after a new LV is added to the model. The iterations continue until R becomes lower than a threshold value alpha
. By default and only as an indication, the default alpha
=.05 is set in the function, but the user should set any other value depending on his data and parsimony objective.
In his original article, Wold (1978; see also Bro et al. 2008) used the ratio of cross-validated over training residual sums of squares, i.e. PRESS over SSR. Instead, function selwold
compares values of consistent nature (the successive values in the input vector r
). For instance, r
was set to PRESS values in Li et al. (2002) and Andries et al. (2011), which is equivalent to the "punish factor" described in Westad & Martens (2000).
The ratio R can be erratic (particulary when r
is the error rate of a discrimination model), making difficult the dimensionnaly selection. In such a situation, function selwold
proposes to calculate a smoothing of R (argument smooth
).
The function returns two outputs (in addition to eventual plots):
opt
: The index corresponding to the minimum value ofr
.sel
: The index of the selection from the R (or smoothed R) threshold.
References
Andries, J.V.M., Vander Heyden, Y., Buydens, L.M.C., 2011. Improved variable reduction in partial least squares modelling based on Predictive-Property-Ranked Variables and adaptation of partial least squares complexity. Analytica Chimica Acta 705, 292-305. https://doi.org/10.1016/j.aca.2011.06.037
Bro, R., Kjeldahl, K., Smilde, A.K., Kiers, H.A.L., 2008. Cross-validation of component models: A critical look at current methods. Anal Bioanal Chem 390, 1241-1251. https://doi.org/10.1007/s00216-007-1790-1
Li, B., Morris, J., Martin, E.B., 2002. Model selection for partial least squares regression. Chemometrics and Intelligent Laboratory Systems 64, 79-89. https://doi.org/10.1016/S0169-7439(02)00051-5
Westad, F., Martens, H., 2000. Variable Selection in near Infrared Spectroscopy Based on Significance Testing in Partial Least Squares Regression. J. Near Infrared Spectrosc., JNIRS 8, 117â124.
Wold S. Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models. Technometrics. 1978;20(4):397-405
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
n = nro(Xtrain)
segm = segmts(n, 50; rep = 30)
model = plskern()
nlv = 0:20
res = gridcv(model, Xtrain, ytrain; segm, score = rmsep, nlv).res
res[res.y1 .== minimum(res.y1), :]
plotgrid(res.nlv, res.y1;xlabel = "Nb. LVs", ylabel = "RMSEP").f
zres = selwold(res.nlv, res.y1; smooth = true, graph = true) ;
@show zres.opt
@show zres.sel
zres.f
Jchemo.sep
— Methodsep(pred, Y)
Compute the corrected SEP ("SEP_c"), i.e. the standard deviation of the prediction errors.
pred
: Predictions.Y
: Observed data.
References
Bellon-Maurel, V., Fernandez-Ahumada, E., Palagos, B., Roger, J.-M., McBratney, A., 2010. Critical review of chemometric indicators commonly used for assessing the quality of the prediction of soil attributes by NIR spectroscopy. TrAC Trends in Analytical Chemistry 29, 1073–1081. https://doi.org/10.1016/j.trac.2010.05.006
Examples
using Jchemo
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
model = plskern; nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
sep(pred, Ytest)
model = plskern; nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
sep(pred, ytest)
Jchemo.snorm
— Methodsnorm()
snorm(X)
Row-wise norming of X-data.
X
: X-data (n, p).
Each row of X
is divide by its norm.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f
model = snorm()
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f
@head rownorm(Xptrain)
@head rownorm(Xptest)
Jchemo.snv
— Methodsnv(; kwargs...)
snv(X; kwargs...)
Standard-normal-variate (SNV) transformation of each row of X-data.
X
: X-data (n, p).
Keyword arguments:
centr
: Boolean indicating if the centering in done.scal
: Boolean indicating if the scaling in done.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
year = dat.Y.year
s = year .<= 2012
Xtrain = X[s, :]
Xtest = rmrow(X, s)
wlst = names(dat.X)
wl = parse.(Float64, wlst)
plotsp(X, wl; nsamp = 20).f
model = snv()
#model = snv(scal = false)
fit!(model, Xtrain)
Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)
plotsp(Xptrain).f
plotsp(Xptest).f
@head rowmean(Xptrain)
@head rowstd(Xptrain)
@head rowmean(Xptest)
@head rowstd(Xptest)
Jchemo.softmax
— Methodsoftmax(x::AbstractVector)
softmax(X::Union{Matrix, DataFrame})
Softmax function.
x
: A vector to transform.X
: A matrix whose rows are transformed.
Let v be a vector:
- 'softmax'(v) = exp.(v) / sum(exp.(v))
Examples
using Jchemo
x = 1:3
softmax(x)
X = rand(5, 3)
softmax(X)
Jchemo.soplsr
— Methodsoplsr(; kwargs...)
soplsr(Xbl, Y; kwargs...)
soplsr(Xbl, Y, weights::Weight; kwargs...)
soplsr!(Xbl::Matrix, Y::Matrix, weights::Weight; kwargs...)
Multiblock sequentially orthogonalized PLSR (SO-PLSR).
Xbl
: List of blocks (vector of matrices) of X-data Typically, output of functionmblock
from (n, p) data.Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs = scores) to compute.scal
: Boolean. Iftrue
, each column of blocks inXbl
andY
is scaled by its uncorrected standard deviation.
References
Biancolillo et al. , 2015. Combining SO-PLS and linear discriminant analysis for multi-block classification. Chemometrics and Intelligent Laboratory Systems, 141, 58-67.
Biancolillo, A. 2016. Method development in the area of multi-block analysis focused on food analysis. PhD. University of copenhagen.
Menichelli et al., 2014. SO-PLS as an exploratory tool for path modelling. Food Quality and Preference, 36, 122-134.
Examples
using Jchemo, JchemoData, JLD2
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "ham.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
y = Y.c1
group = dat.group
listbl = [1:11, 12:19, 20:25]
s = 1:6
Xbltrain = mblock(X[s, :], listbl)
Xbltest = mblock(rmrow(X, s), listbl)
ytrain = y[s]
ytest = rmrow(y, s)
ntrain = nro(ytrain)
ntest = nro(ytest)
ntot = ntrain + ntest
(ntot = ntot, ntrain , ntest)
nlv = 2
#nlv = [2, 1, 2]
#nlv = [2, 0, 1]
scal = false
#scal = true
model = soplsr(; nlv, scal)
fit!(model, Xbltrain, ytrain)
@names model
@names model.fitm
@head model.fitm.T
@head transf(model, Xbltrain)
transf(model, Xbltest)
res = predict(model, Xbltest)
res.pred
rmsep(res.pred, ytest)
Jchemo.sourcedir
— Methodsourcedir(path)
Include all the files contained in a directory.
Jchemo.spca
— Methodspca(; kwargs...)
spca(X; kwargs...)
spca(X, weights::Weight; kwargs...)
spca!(X::Matrix, weights::Weight; kwargs...)
Sparse PCA (Shen & Huang 2008).
X
: X-data (n, p).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. principal components (PCs).meth
: Method used for the sparse thresholding. Possible values are::soft
,:hard
. See thereafter.nvar
: Nb. variables (X
-columns) selected for each principal component (PC). Can be a single integer (i.e. same nb. of variables for each PC), or a vector of lengthnlv
.defl
: Type ofX
-matrix deflation, see below.tol
: Tolerance value for stopping the Nipals iterations.maxit
: Maximum nb. of Nipals iterations.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
sPCA-rSVD algorithm (regularized low rank matrix approximation) of Shen & Huang 2008.
The algorithm computes each loadings vector iteratively, by alternating least squares regressions (Nipals) including a step of thresholding. Function spca
provides thresholding methods '1' and '2' reported in Shen & Huang 2008 Lemma 2 (:soft
and :hard
):
- The tuning parameter used by Shen & Huang 2008 is the number of null elements in the loadings vector, referred to as degree of sparsity. Conversely, the present function
spca
uses the number of non-zero elements (nvar
), equal to p - degree of sparsity. - See the code of function
snipals_shen
for details on how is computed the cutoff 'lambda' used inside the thresholding function (Shen & Huang 2008), given a value fornvar
. Differences from other softwares may occur when there are tied values in the loadings vector (depending on the choices of method used to compute quantiles).
Matrix X
can be deflated in two ways:
defl = :v
: MatrixX
is deflated by regression of theX'
-columns on the loadings vectorv
. This is the method proposed by Shen & Huang 2008 (see in Theorem A.2 p.1033).defl = :t
: MatrixX
is deflated by regression of theX
-columns on the score vectort
. This is the method used in functionspca
of the R packagemixOmics
(Le Cao et al. 2016).
The method of computation of the % variance explained in X by each PC (returned by function summary
) depends on the type of deflation chosen (see the code).
References
Kim-Anh Lê Cao, Florian Rohart, Ignacio Gonzalez, Sebastien Dejean with key contributors Benoit Gautier, Francois Bartolo, contributions from Pierre Monget, Jeff Coquery, FangZou Yao and Benoit Liquet. (2016). mixOmics: Omics Data Integration Project. R package version 6.1.1. https://www.bioconductor.org/packages/release/bioc/html/mixOmics.html
Shen, H., Huang, J.Z., 2008. Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis 99, 1015–1034. https://doi.org/10.1016/j.jmva.2007.06.007
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/iris.jld2")
@load db dat
@names dat
@head dat.X
X = dat.X[:, 1:4]
n = nro(X)
ntest = 30
s = samprand(n, ntest)
Xtrain = X[s.train, :]
Xtest = X[s.test, :]
nlv = 3
meth = :soft
#meth = :hard
nvar = 2
model = spca(; nlv, meth, nvar) ;
fit!(model, Xtrain)
fitm = model.fitm ;
@names fitm
fitm.niter
fitm.sellv
fitm.sel
V = fitm.V
V' * V
@head T = fitm.T
T' * T
@head transf(model, Xtrain)
@head Ttest = transf(fitm, Xtest)
res = summary(model, Xtrain) ;
res.explvarx
Jchemo.spcr
— Methodspcr(; kwargs...)
spcr(X, Y; kwargs...)
spcr(X, Y, weights::Weight; kwargs...)
spcr!(X::Matrix, Y::Matrix, weights::Weight; kwargs...)
Sparse principal component regression (sPCR).
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
- Same as function
spca
.
Regression on scores computed from a sparse PCA (sPCA-rSVD algorithm of Shen & Huang 2008 ). See function spca
for details.
References
Shen, H., Huang, J.Z., 2008. Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis 99, 1015–1034. https://doi.org/10.1016/j.jmva.2007.06.007
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 15
meth = :soft
#meth = :hard
nvar = 20
model = spcr(; nlv, meth, nvar, defl = :t) ;
fit!(model, Xtrain, ytrain)
@names model
fitm = model.fitm ;
@names fitm
@head fitm.fitm.T
@head transf(model, X)
@head fitm.fitm.V
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)
res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
res = summary(model, Xtrain) ;
@names res
z = res.explvarx
plotgrid(z.nlv, z.cumpvar; step = 2, xlabel = "Nb. LVs",
ylabel = "Prop. Explained X-Variance").f
Jchemo.splskdeda
— Methodsplskdeda(; kwargs...)
splskdeda(X, y; kwargs...)
splskdeda(X, y, weights::Weight; kwargs...)
Sparse PLS-KDE-DA.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1.meth
: Method used for the sparse thresholding. Possible values are::soft
,:hard
. See thereafter.nvar
: Nb. variables (X
-columns) selected for each LV. Can be a single integer (i.e. same nb. of variables for each LV), or a vector of lengthnlv
.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).- Keyword arguments of function
dmkern
(bandwidth definition) can also be specified here. scal
: Boolean. Iftrue
, each column ofX
and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.
Same as function plskdeda
(PLS-KDEDA) except that a sparse PLSR (function splsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
See function splslda
for examples.
Jchemo.splslda
— Methodsplslda(; kwargs...)
splslda(X, y; kwargs...)
splslda(X, y, weights::Weight; kwargs...)
Sparse PLS-LDA.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1.meth
: Method used for the sparse thresholding. Possible values are::soft
,:hard
. See thereafter.nvar
: Nb. variables (X
-columns) selected for each LV. Can be a single integer (i.e. same nb. of variables for each LV), or a vector of lengthnlv
.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).scal
: Boolean. Iftrue
, each column ofX
and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.
Same as function plslda
(PLSR-LDA) except that a sparse PLSR (function splsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlv = 15
meth = :soft
nvar = 10
model = splslda(; nlv, meth, nvar)
#model = splsqda(; nlv, meth, nvar, alpha = .1)
#model = splskdeda(; nlv, meth, nvar, a = .9)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
embfitm = fitm.fitm.embfitm ;
@head embfitm.T
@head transf(model, Xtrain)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)
coef(embfitm)
summary(embfitm, Xtrain)
res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(model, Xtest; nlv = 1:2).pred
Jchemo.splsqda
— Methodsplsqda(; kwargs...)
splsqda(X, y; kwargs...)
splsqda(X, y, weights::Weight; kwargs...)
Sparse PLS-QDA (with continuum).
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute. Must be >= 1.meth
: Method used for the sparse thresholding. Possible values are::soft
,:hard
. See thereafter.nvar
: Nb. variables (X
-columns) selected for each LV. Can be a single integer (i.e. same nb. of variables for each LV), or a vector of lengthnlv
.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).alpha
: Scalar (∈ [0, 1]) defining the continuum between QDA (alpha = 0
) and LDA (alpha = 1
).scal
: Boolean. Iftrue
, each column ofX
and Ydummy is scaled by its uncorrected standard deviation in the PLS computation.
Same as function plsqda
(PLSR-QDA) except that a sparse PLSR (function splsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
See function splslda
for examples.
Jchemo.splsr
— Methodsplsr(; kwargs...)
splsr(X, Y; kwargs...)
splsr(X, Y, weights::Weight; kwargs...)
splsr!(X::Matrix, Y::Union{Matrix, BitMatrix}, weights::Weight; kwargs...)
Sparse partial least squares regression (Lê Cao et al. 2008)
X
: X-data (n, p).Y
: Y-data (n, q).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.meth
: Method used for the sparse thresholding. Possible values are::soft
,:hard
. See thereafter.nvar
: Nb. variables (X
-columns) selected for each LV. Can be a single integer (i.e. same nb. of variables for each LV), or a vector of lengthnlv
.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
Adaptation of the sparse partial least squares regression algorihm of Lê Cao et al. 2008. The fast "improved kernel algorithm #1" of Dayal & McGregor (1997) is used instead Nipals.
In the present version of splsr
, the sparse thresholding only concerns X
. The function provides two thresholding methods to compute the sparse X
-loading weights w (:soft
and :hard
), see function spca
for description.
The case meth = :soft
returns the same results as function spls
of the R package mixOmics (Lê Cao et al.) with the regression mode and without sparseness on Y
.
The COVSEL regression method described in Roger et al 2011 (see also Höskuldsson 1992) is implemented by setting nvar = 1
.
References
Dayal, B.S., MacGregor, J.F., 1997. Improved PLS algorithms. Journal of Chemometrics 11, 73-85.
Höskuldsson, A., 1992. The H-principle in modelling with applications to chemometrics. Chemometrics and Intelligent Laboratory Systems, Proceedings of the 2nd Scandinavian Symposium on Chemometrics 14, 139–153. https://doi.org/10.1016/0169-7439(92)80099-P
Lê Cao, K.-A., Rossouw, D., Robert-Granié, C., Besse, P., 2008. A Sparse PLS for Variable Selection when Integrating Omics Data. Statistical Applications in Genetics and Molecular Biology 7. https://doi.org/10.2202/1544-6115.1390
Kim-Anh Lê Cao, Florian Rohart, Ignacio Gonzalez, Sebastien Dejean with key contributors Benoit Gautier, Francois Bartolo, contributions from Pierre Monget, Jeff Coquery, FangZou Yao and Benoit Liquet. (2016). mixOmics: Omics Data Integration Project. R package version 6.1.1. https://www.bioconductor.org/packages/release/bioc/html/mixOmics.html
Package mixOmics on Bioconductor: https://www.bioconductor.org/packages/release/bioc/html/mixOmics.html
Roger, J.M., Palagos, B., Bertrand, D., Fernandez-Ahumada, E., 2011. covsel: Variable selection for highly multivariate and multi-response calibration: Application to IR spectroscopy. Chem. Lab. Int. Syst. 106, 216-223.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
nlv = 15
meth = :soft
#meth = :hard
nvar = 20
model = splsr(; nlv, meth, nvar) ;
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
@head model.fitm.T
@head model.fitm.W
coef(model)
coef(model; nlv = 3)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)
res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
res = summary(model, Xtrain) ;
@names res
z = res.explvarx
plotgrid(z.nlv, z.cumpvar; step = 2, xlabel = "Nb. LVs",
ylabel = "Prop. Explained X-Variance").f
Jchemo.splsrda
— Methodsplsrda(; kwargs...)
splsrda(X, y; kwargs...)
splsrda(X, y, weights::Weight; kwargs...)
Sparse PLSR-DA.
X
: X-data (n, p).y
: Univariate class membership (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.meth
: Method used for the sparse thresholding. Possible values are::soft
,:hard
. See thereafter.nvar
: Nb. variables (X
-columns) selected for each LV. Can be a single integer (i.e. same nb. of variables for each LV), or a vector of lengthnlv
.prior
: Type of prior probabilities for class membership. Possible values are::unif
(uniform),:prop
(proportional), or a vector (of length equal to the number of classes) giving the prior weight for each class (in case of vector, it must be sorted in the same order asmlev(y)
).scal
: Boolean. Iftrue
, each column ofX
and Ydummy is scaled by its uncorrected standard deviation.
Same as function plsrda
(PLSR-DA) except that a sparse PLSR (function splsr
), instead of a PLSR (function plskern
), is run on the Y-dummy table.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
nlv = 15
meth = :soft
nvar = 10
model = splsrda(; nlv, meth, nvar)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
@head fitm.fitm.T
@head transf(model, Xtrain)
@head transf(model, Xtest)
@head transf(model, Xtest; nlv = 3)
coef(fitm.fitm)
res = predict(model, Xtest) ;
@names res
@head res.posterior
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
predict(model, Xtest; nlv = 1:2).pred
summary(fitm.fitm, Xtrain)
Jchemo.ssr
— Methodssr(pred, Y)
Compute the sum of squared prediction errors (SSR).
pred
: Predictions.Y
: Observed data.
Examples
using Jchemo
Xtrain = rand(10, 5)
Ytrain = rand(10, 2)
ytrain = Ytrain[:, 1]
Xtest = rand(4, 5)
Ytest = rand(4, 2)
ytest = Ytest[:, 1]
model = plskern; nlv = 2)
fit!(model, Xtrain, Ytrain)
pred = predict(model, Xtest).pred
ssr(pred, Ytest)
model = plskern; nlv = 2)
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
ssr(pred, ytest)
Jchemo.stdv
— Methodstdv(x)
stdv(x, weights::Weight)
Compute the uncorrected standard deviation of a vector.
x
: A vector (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Examples
using Jchemo
n = 1000
x = rand(n)
w = mweight(rand(n))
stdv(x)
stdv(x, w)
Jchemo.summ
— Methodsumm(X; digits = 3)
summ(X, y; digits = 3)
Summarize a dataset (or a variable).
X
: A dataset (n, p).y
: A categorical variable (n) (class membership).digits
: Nb. digits in the outputs.
Examples
using Jchemo
n = 50
X = rand(n, 3)
y = rand(1:3, n)
res = summ(X)
@names res
summ(X[:, 2]).res
summ(X, y)
Jchemo.sumv
— Methodsumv(x)
sumv(x, weights::Weight)
Compute the sum of a vector.
x
: A vector (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Examples
using Jchemo
n = 100
x = rand(n)
w = mweight(rand(n))
sumv(x)
sumv(x, w)
Jchemo.svmda
— Methodsvmda(; kwargs...)
svmda(X, y; kwargs...)
Support vector machine for discrimination "C-SVC" (SVM-DA).
X
: X-data (n, p).y
: Univariate class membership (n).
Keyword arguments:
kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
,:klin
,:ktanh
. See below.gamma
:kern
parameter, see below.degree
:kern
parameter, see below.coef0
:kern
parameter, see below.cost
: Cost of constraints violation C parameter.epsilon
: Epsilon parameter in the loss function.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Kernel types:
- :krbf – radial basis function: exp(-gamma * ||x - y||^2)
- :kpol – polynomial: (gamma * x' * y + coef0)^degree
- "klin – linear: x' * y
- :ktan – sigmoid: tanh(gamma * x' * y + coef0)
The function uses LIBSVM.jl (https://github.com/JuliaML/LIBSVM.jl) that is an interface to library LIBSVM (Chang & Li 2001).
References
Julia package LIBSVM.jl: https://github.com/JuliaML/LIBSVM.jl
Chang, C.-C. & Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Detailed documentation (algorithms, formulae, ...) can be found in http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.ps.gz
Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Schölkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning. MIT Press, Cambridge, Mass.
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n = nro(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
kern = :krbf ; gamma = 1e4
cost = 1000 ; epsilon = .5
model = svmda(; kern, gamma, cost, epsilon)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
res = predict(model, Xtest) ;
@names res
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.svmr
— Methodsvmr(; kwargs...)
svmr(X, y; kwargs...)
Support vector machine for regression (Epsilon-SVR).
X
: X-data (n, p).y
: Univariate y-data (n).
Keyword arguments:
kern
: Type of kernel used to compute the Gram matrices. Possible values are::krbf
,:kpol
,:klin
,:ktanh
. See below.gamma
:kern
parameter, see below.coef0
:kern
parameter, see below.degree
:kern
parameter, see below.cost
: Cost of constraints violation C parameter.epsilon
: Epsilon parameter in the loss function.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
Kernel types:
- :krbf – radial basis function: exp(-gamma * ||x - y||^2)
- :kpol – polynomial: (gamma * x' * y + coef0)^degree
- "klin – linear: x' * y
- :ktan – sigmoid: tanh(gamma * x' * y + coef0)
The function uses LIBSVM.jl (https://github.com/JuliaML/LIBSVM.jl) that is an interface to library LIBSVM (Chang & Li 2001).
References
Julia package LIBSVM.jl: https://github.com/JuliaML/LIBSVM.jl
Chang, C.-C. & Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Detailed documentation (algorithms, formulae, ...) can be found in http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.ps.gz
Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Schölkopf, B., Smola, A.J., 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning. MIT Press, Cambridge, Mass.
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
p = nco(X)
kern = :krbf ; gamma = .1
cost = 1000 ; epsilon = 1
model = svmr(; kern, gamma, cost, epsilon)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
####### Example of fitting the function sinc(x)
####### described in Rosipal & Trejo 2001 p. 105-106
x = collect(-10:.2:10)
x[x .== 0] .= 1e-5
n = length(x)
zy = sin.(abs.(x)) ./ abs.(x)
y = zy + .2 * randn(n)
kern = :krbf ; gamma = .1
model = svmr(; kern, gamma)
fit!(model, x, y)
pred = predict(model, x).pred
f, ax = scatter(x, y)
lines!(ax, x, zy, label = "True model")
lines!(ax, x, vec(pred), label = "Fitted model")
axislegend("Method")
f
Jchemo.tab
— Methodtab(X::AbstractArray)
tab(X::DataFrame; vargroup = nothing)
Tabulation of categorical variables.
x
: Categorical variable or dataset containing categorical variable(s).
Specific for a dataset:
vargroup
: Vector of the names of the group variables to consider inX
(by default: all the columns ofX
).
The output cointains sorted levels.
Examples
using Jchemo, DataFrames
x = rand(["a"; "b"; "c"], 20)
res = tab(x)
res.keys
res.vals
n = 20
X = hcat(rand(1:2, n), rand(["a", "b", "c"], n))
df = DataFrame(X, [:v1, :v2])
tab(X[:, 2])
tab(string.(X))
tab(df)
tab(df; vargroup = [:v1, :v2])
tab(df; vargroup = :v2)
Jchemo.tabdupl
— Methodtabdupl(x)
Tabulate duplicated values in a vector.
x
: Categorical variable.
Examples
using Jchemo
x = ["a", "b", "c", "a", "b", "b"]
tab(x)
res = tabdupl(x)
res.keys
res.vals
Jchemo.thresh_hard
— Methodthresh_hard(x::Real, delta)
Hard thresholding function.
x
: Value to transform.delta
: Range for the thresholding.
The returned value is:
- abs(
x
) >delta
?x
: 0
where delta >= 0.
Examples
using Jchemo, CairoMakie
delta = .7
thresh_hard(3, delta)
x = LinRange(-2, 2, 500)
y = thresh_hard.(x, delta)
lines(x, y; axis = (xlabel = "x", ylabel = "f(x)"))
Jchemo.thresh_soft
— Methodthresh_soft(x::Real, delta)
Soft thresholding function.
x
: Value to transform.delta
: Range for the thresholding.
The returned value is:
- sign(
x
) * max(0, abs(x
) -delta
)
where delta >= 0.
Examples
using Jchemo, CairoMakie
delta = .7
thresh_soft(3, delta)
x = LinRange(-2, 2, 100)
y = thresh_soft.(x, delta)
lines(x, y; axis = (xlabel = "x", ylabel = "f(x)"))
Jchemo.transf
— Methodtransf(object::Blockscal, Xbl)
transf!(object::Blockscal, Xbl)
Compute the preprocessed data from a model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which LVs are computed.
Jchemo.transf
— Methodtransf(object::Center, X)
transf!(object::Center, X::Matrix)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::Comdim, Xbl; nlv = nothing)
transfbl(object::Comdim, Xbl; nlv = nothing)
Compute latent variables (LVs = scores) from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which LVs are computed.nlv
: Nb. LVs to compute.
Jchemo.transf
— Methodtransf(object::Cscale, X)
transf!(object::Cscale, X::Matrix)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::DetrendAirpls, X)
transf!(object::DetrendAirpls, X)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::DetrendArpls, X)
transf!(object::DetrendArpls, X)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::DetrendAsls, X)
transf!(object::DetrendAsls, X)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::DetrendLo, X)
transf!(object::DetrendLo, X)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::DetrendPol, X)
transf!(object::DetrendPol, X)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::Dkplsr, X; nlv = nothing)
Compute latent variables (LVs = scores) from a fitted model.
object
: The fitted model.X
: X-data for which LVs are computed.nlv
: Nb. LVs to consider.
Jchemo.transf
— Methodtransf(object::Fdif, X)
transf!(object::Fdif, X::Matrix, M::Matrix)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.M
: Pre-allocated output matrix (n, p - npoint + 1).
The in-place function stores the output in M
.
Jchemo.transf
— Methodtransf(object::Interpl, X)
transf!(object::Interpl, X::Matrix, M::Matrix)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.M
: Pre-allocated output matrix (n, p).
The in-place function stores the output in M
.
Jchemo.transf
— Methodtransf(object::Kpca, X; nlv = nothing)
Compute PCs (scores T) from a fitted model.
object
: The fitted model.X
: X-data for which PCs are computed.nlv
: Nb. PCs to compute.
Jchemo.transf
— Methodtransf(object::Kplsr, X; nlv = nothing)
Compute latent variables (LVs = scores) from a fitted model.
object
: The fitted model.X
: X-data for which LVs are computed.nlv
: Nb. LVs to consider.
Jchemo.transf
— Methodtransf(object::Mavg, X)
transf!(object::Mavg, X)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::Mbconcat, Xbl)
Compute the preprocessed data from a model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which LVs are computed.
Jchemo.transf
— Methodtransf(object::Mbpca, Xbl; nlv = nothing)
transfbl(object::Mbpca, Xbl; nlv = nothing)
Compute latent variables (LVs = scores) from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which LVs are computed.nlv
: Nb. LVs to compute.
Jchemo.transf
— Methodtransf(object::Mbplsprobda, Xbl; nlv = nothing)
Compute latent variables (LVs = scores) from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which LVs are computed.nlv
: Nb. LVs to consider.
Jchemo.transf
— Methodtransf(object::Mbplsrda, Xbl; nlv = nothing)
Compute latent variables (LVs = scores) from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which LVs are computed.nlv
: Nb. LVs to compute.
Jchemo.transf
— Methodtransf(object::Plsprobda, X; nlv = nothing)
Compute latent variables (LVs = scores) from a fitted model.
object
: The fitted model.X
: Matrix (m, p) for which LVs are computed.nlv
: Nb. LVs to consider.
Jchemo.transf
— Methodtransf(object::Plsrda, X; nlv = nothing)
Compute latent variables (LVs = scores) from a fitted model.
object
: The fitted model.X
: X-data (m, p) for which LVs are computed.nlv
: Nb. LVs to consider.
Jchemo.transf
— Methodtransf(object::Rmgap, X)
transf!(object::Rmgap, X)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::Rosaplsr, Xbl; nlv = nothing)
Compute latent variables (LVs = scores) from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which LVs are computed.nlv
: Nb. LVs to compute.
Jchemo.transf
— Methodtransf(object::Rp, X; nlv = nothing)
Compute scores T from a fitted model.
object
: The fitted model.X
: Matrix (m, p) for which scores T are computed.nlv
: Nb. scores to compute.
Jchemo.transf
— Methodtransf(object::Savgol, X)
transf!(object::Savgol, X)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::Scale, X)
transf!(object::Scale, X::Matrix)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::Snorm, X)
transf!(object::Snorm, X)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::Snv, X)
transf!(object::Snv, X)
Compute the preprocessed data from a model.
object
: Model.X
: X-data to transform.
Jchemo.transf
— Methodtransf(object::Soplsr, Xbl)
Compute latent variables (LVs = scores) from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which LVs are computed.
Jchemo.transf
— Methodtransf(object::Spca, X; nlv = nothing)
Compute principal components (PCs = scores T) from a
fitted model and X-data.
object
: The fitted model.X
: X-data for which PCs are computed.nlv
: Nb. PCs to compute.
Jchemo.transf
— Methodtransf(object::Umap, X)
Compute latent variables (LVs = scores) from a fitted model.
object
: The fitted model.X
: Matrix (m, p) for which LVs are computed.
Jchemo.transf
— Methodtransf(object::Union{Pca, Fda}, X; nlv = nothing)
Compute principal components (PCs = scores T) from a fitted model and X-data.
object
: The fitted model.X
: X-data for which PCs are computed.nlv
: Nb. PCs to compute.
Jchemo.transf
— Methodtransf(object::Union{Mbplsr, Mbplswest}, Xbl; nlv = nothing)
Compute latent variables (LVs = scores) from a fitted model.
object
: The fitted model.Xbl
: A list of blocks (vector of matrices) of X-data for which LVs are computed.nlv
: Nb. LVs to compute.
Jchemo.transf
— Methodtransf(object::Union{Pcr, Spcr}, X; nlv = nothing)
Compute latent variables (LVs = scores) from a fitted model and a matrix X.
object
: The fitted model.X
: Matrix (m, p) for which LVs are computed.nlv
: Nb. LVs to consider.
Jchemo.transf
— Methodtransf(object::Union{Plsr, Splsr}, X; nlv = nothing)
Compute latent variables (LVs = scores) from a fitted model.
object
: The fitted model.X
: Matrix (m, p) for which LVs are computed.nlv
: Nb. LVs to consider.
Jchemo.transfbl
— Methodtransfbl(object::Cca, X, Y; nlv = nothing)
Compute latent variables (LVs = scores) from a fitted model.
object
: The fitted model.X
: X-data for which components (LVs) are computed.Y
: Y-data for which components (LVs) are computed.nlv
: Nb. LVs to compute.
Jchemo.transfbl
— Methodtransfbl(object::Ccawold, X, Y; nlv = nothing)
Compute latent variables (LVs = scores) from a fitted model.
object
: The fitted model.X
: X-data for which components (LVs) are computed.Y
: Y-data for which components (LVs) are computed.nlv
: Nb. LVs to compute.
Jchemo.transfbl
— Methodtransfbl(object::Plscan, X, Y; nlv = nothing)
Compute latent variables (LVs = scores) from a fitted model.
object
: The fitted model.X
: X-data for which components (LVs) are computed.Y
: Y-data for which components (LVs) are computed.nlv
: Nb. LVs to compute.
Jchemo.transfbl
— Methodtransfbl(object::Plstuck, X, Y; nlv = nothing)
Compute latent variables (LVs = scores) from a fitted model.
object
: The fitted model.X
: X-data for which components (LVs) are computed.Y
: Y-data for which components (LVs) are computed.nlv
: Nb. LVs to compute.
Jchemo.transfbl
— Methodtransfbl(object::Rasvd, X, Y; nlv = nothing)
Compute latent variables (LVs = scores) from a fitted model.
object
: The fitted model.X
: X-data for which components (LVs) are computed.Y
: Y-data for which components (LVs) are computed.nlv
: Nb. LVs to compute.
Jchemo.treeda
— Methodtreeda(; kwargs...)
treeda(X, y; kwargs...)
Discrimination tree (CART) with DecisionTree.jl.
X
: X-data (n, p).y
: Univariate class membership (n).
Keyword arguments:
n_subfeatures
: Nb. variables to select at random at each split (default: 0 ==> keep all).max_depth
: Maximum depth of the decision tree (default: -1 ==> no maximum).min_sample_leaf
: Minimum number of samples each leaf needs to have.min_sample_split
: Minimum number of observations in needed for a split.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
The function fits a single discrimination tree (CART) using package `DecisionTree.jl'.
References
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. Classification And Regression Trees. Chapman & Hall, 1984.
DecisionTree.jl https://github.com/JuliaAI/DecisionTree.jl
Gey, S., 2002. Bornes de risque, détection de ruptures, boosting : trois thèmes statistiques autour de CART en régression (These de doctorat). Paris 11. http://www.theses.fr/2002PA112245
Examples
using Jchemo, JchemoData, JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
n, p = size(X)
s = Bool.(Y.test)
Xtrain = rmrow(X, s)
ytrain = rmrow(Y.typ, s)
Xtest = X[s, :]
ytest = Y.typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = n, ntrain, ntest)
tab(ytrain)
tab(ytest)
n_subfeatures = p / 3
max_depth = 10
model = treeda(; n_subfeatures, max_depth)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
fitm = model.fitm ;
fitm.lev
fitm.ni
res = predict(model, Xtest) ;
@names res
@head res.pred
errp(res.pred, ytest)
conf(res.pred, ytest).cnt
Jchemo.treer
— Methodtreer(; kwargs...)
treer(X, y; kwargs...)
Regression tree (CART) with DecisionTree.jl.
X
: X-data (n, p).y
: Univariate y-data (n).
Keyword arguments:
n_subfeatures
: Nb. variables to select at random at each split (default: 0 ==> keep all).max_depth
: Maximum depth of the decision tree (default: -1 ==> no maximum).min_sample_leaf
: Minimum number of samples each leaf needs to have.min_sample_split
: Minimum number of observations in needed for a split.scal
: Boolean. Iftrue
, each column ofX
is scaled by its uncorrected standard deviation.
The function fits a single regression tree (CART) using package `DecisionTree.jl'.
References
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. Classification And Regression Trees. Chapman & Hall, 1984.
DecisionTree.jl https://github.com/JuliaAI/DecisionTree.jl
Gey, S., 2002. Bornes de risque, détection de ruptures, boosting : trois thèmes statistiques autour de CART en régression (These de doctorat). Paris 11. http://www.theses.fr/2002PA112245
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2")
@load db dat
@names dat
X = dat.X
y = dat.Y.tbc
year = dat.Y.year
tab(year)
s = year .<= 2012
Xtrain = X[s, :]
ytrain = y[s]
Xtest = rmrow(X, s)
ytest = rmrow(y, s)
p = nco(X)
n_subfeatures = p / 3
max_depth = 15
model = treer(; n_subfeatures, max_depth)
fit!(model, Xtrain, ytrain)
@names model
@names model.fitm
res = predict(model, Xtest)
@head res.pred
@show rmsep(res.pred, ytest)
plotxy(res.pred, ytest; color = (:red, .5), bisect = true, xlabel = "Prediction",
ylabel = "Observed").f
Jchemo.umap
— Methodumap(; kwargs...)
umap(X; kwargs...)
UMAP: Uniform manifold approximation and projection for dimension reduction
X
: X-data (n, p).
Keyword arguments:
nlv
: Nb. latent variables (LVs) to compute.psamp
: Proportion of sampling inX
for training.n_neighbors
: Nb. approximate neighbors used to construct the initial high-dimensional graph.min_dist
: Minimum distance between points in low-dimensional space.scal
: Boolean. Iftrue
, each column ofX
andY
is scaled by its uncorrected standard deviation.
The function fits a UMAP dimension reducion using package `UMAP.jl'. The used metric is the Euclidean distance.
If psamp < 1
, only a proportion psamp
of the observations (rows of X
) are used to build the model (systematic sampling over the first score of the PCA of X
). Can be used to decrease computation times when n is large.
References
https://github.com/dillondaudert/UMAP.jl
McInnes, L, Healy, J, Melville, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiV 1802.03426, 2018 https://arxiv.org/abs/1802.03426
https://umap-learn.readthedocs.io/en/latest/howumapworks.html
https://pair-code.github.io/understanding-umap/
Examples
using Jchemo, JchemoData
using JLD2, GLMakie, CairoMakie, FreqTables
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "challenge2018.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
wlst = names(X)
wl = parse.(Float64, wlst)
ntot = nro(X)
summ(Y)
typ = Y.typ
test = Y.test
y = Y.conc
model1 = snv()
model2 = savgol(npoint = 21, deriv = 2, degree = 3)
model = pip(model1, model2)
fit!(model, X)
@head Xp = transf(model, X)
plotsp(Xp, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance", nsamp = 20).f
s = Bool.(test)
Xtrain = rmrow(Xp, s)
Ytrain = rmrow(Y, s)
ytrain = rmrow(y, s)
typtrain = rmrow(typ, s)
Xtest = Xp[s, :]
Ytest = Y[s, :]
ytest = y[s]
typtest = typ[s]
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = ntot, ntrain, ntest)
freqtable(string.(typ, "-", Y.label))
freqtable(typ, test)
#################
nlv = 3
n_neighbors = 50 ; min_dist = .5
model = umap(; nlv, n_neighbors, min_dist)
fit!(model, Xtrain)
@head T = model.fitm.T
@head Ttest = transf(model, Xtest)
nlv = 3
n_neighbors = 50 ; min_dist = .5
model = umap(; nlv, n_neighbors, min_dist)
fit!(model, Xtrain)
@head T = model.fitm.T
@head Ttest = transf(model, Xtest)
GLMakie.activate!()
#CairoMakie.activate!()
lev = mlev(typtrain)
nlev = length(lev)
colsh = :tab10
colm = cgrad(colsh, nlev; alpha = .7, categorical = true)
ztyp = recod_catbyint(typtrain)
f = Figure()
i = 1
ax = Axis3(f[1, 1], xlabel = string("LV", i), ylabel = string("LV", i + 1),
zlabel = string("LV", i + 2), title = "UMAP", perspectiveness = 0)
scatter!(ax, T[:, i], T[:, i + 1], T[:, i + 2]; markersize = 8,
color = ztyp, colormap = colm)
scatter!(ax, Ttest[:, i], Ttest[:, i + 1], Ttest[:, i + 2], color = :black,
markersize = 10)
elt = [MarkerElement(color = colm[i], marker = '●', markersize = 10) for i in 1:nlev]
#elt = [PolyElement(polycolor = colm[i]) for i in 1:nlev]
title = "Group"
Legend(f[1, 2], elt, lev, title; nbanks = 1, rowgap = 10, framevisible = false)
f
Jchemo.varv
— Methodvarv(x)
varv(x, weights::Weight)
Compute the uncorrected variance of a vector.
x
: A vector (n).weights
: Weights (n) of the observations. Must be of typeWeight
(see e.g. functionmweight
).
Examples
using Jchemo
n = 1000
x = rand(n)
w = mweight(rand(n))
varv(x)
varv(x, w)
Jchemo.vcatdf
— Methodvcatdf(dat; cols = :intersect)
Vertical concatenation of a list of dataframes.
dat
: List (vector) of dataframes.cols
: Determines the columns of the returned dataframe. See ?DataFrames.vcat.
Examples
using Jchemo, DataFrames
dat1 = DataFrame(rand(5, 2), [:v3, :v1])
dat2 = DataFrame(100 * rand(2, 2), [:v3, :v1])
dat = (dat1, dat2)
Jchemo.vcatdf(dat)
dat2 = DataFrame(100 * rand(2, 2), [:v1, :v3])
dat = (dat1, dat2)
Jchemo.vcatdf(dat)
dat2 = DataFrame(100 * rand(2, 3), [:v3, :v1, :a])
dat = (dat1, dat2)
Jchemo.vcatdf(dat)
Jchemo.vcatdf(dat; cols = :union)
Jchemo.vcol
— Methodvcol(X::AbstractMatrix, j)
vcol(X::DataFrame, j)
vcol(x::Vector, j)
View of the j-th column(s) of a matrix X
, or of the j-th element(s) of vector x
.
Jchemo.vip
— Methodvip(object::Union{Plsr, Pcr, Splsr, Spcr}; nlv = nothing)
vip(object::Union{Plsr, Pcr, Splsr, Spcr}, Y; nlv = nothing)
Variable importance on Projections (VIP).
object
: The fitted model.Y
: The Y-data that was used to fit the model.
Keyword arguments:
nlv
: Nb. latent variables (LVs) to consider. Ifnothing
, the maximal model is considered.
For a PLS model (or PCR, etc.) fitted on (X, Y) with a number of A latent variables, and for variable xj (column j of X):
- VIP(xj) = Sum.a(1,...,A) R2(Yc, ta) waj^2 / Sum.a(1,...,A) R2(Yc, ta) (1 / p)
where:
- Yc is the centered Y,
- ta is the a-th X-score,
- R2(Yc, ta) is the proportion of Yc-variance explained by ta, i.e. ||Yc.hat||^2 / ||Yc||^2 (where Yc.hat is the LS estimate of Yc by ta).
When Y
is used, R2(Yc, ta) is replaced by the redundancy Rd(Yc, ta) (see function rd
), such as in Tenenhaus 1998 p.139.
References
Chong, I.-G., Jun, C.-H., 2005. Performance of some variable selection methods when multicollinearity is present. Chemometrics and Intelligent Laboratory Systems 78, 103–112. https://doi.org/10.1016/j.chemolab.2004.12.011
Mehmood, T., Sæbø, S., Liland, K.H., 2020. Comparison of variable selection methods in partial least squares regression. Journal of Chemometrics 34, e3226. https://doi.org/10.1002/cem.3226
Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.
Examples
using Jchemo
X = [1. 2 3 4; 4 1 6 7; 12 5 6 13; 27 18 7 6 ; 12 11 28 7]
Y = [10. 11 13; 120 131 27; 8 12 4; 1 200 8; 100 10 89]
y = Y[:, 1]
ycla = [1; 1; 1; 2; 2]
nlv = 3
model = plskern(; nlv)
fit!(model, X, y)
res = vip(model.fitm)
@names res
res.imp
fit!(model, X, Y)
vip(model.fitm).imp
vip(model.fitm, Y).imp
## For PLSDA
model = plsrda(; nlv)
fit!(model, X, ycla)
@names model.fitm
fitm = model.fitm.fitm ; # fitted PLS model
vip(fitm).imp
Ydummy = dummy(ycla).Y
vip(fitm, Ydummy).imp
model = plslda(; nlv)
fit!(model, X, ycla)
@names model.fitm.fitm
fitm = model.fitm.fitm.embfitm ; # fitted PLS model
vip(fitm).imp
vip(fitm, Ydummy).imp
Jchemo.viperm
— Methodviperm(model, X, Y; rep = 50, psamp = .3, score = rmsep)
Variable importance by direct permutations.
model
: Model to evaluate.X
: X-data (n, p).Y
: Y-data (n, q).
Keyword arguments:
rep
: Number of replications of the splitting training/test.psamp
: Proportion of data used as test set to compute thescore
.score
: Function computing the prediction score.
The principle is as follows:
- Data (X, Y) are splitted randomly to a training and a test set.
- The model is fitted on Xtrain, and the score (error rate) is computed on Xtest. This gives the reference error rate.
- Rows of a given variable (feature) j in Xtest are randomly permutated (the rest of Xtest is unchanged). The score is computed on the Xtestpermj (i.e. Xtest after thta the rows of variable j were permuted). The importance of variable j is computed by the difference between this score and the reference score.
- This process is run for each variable j separately and replicated
rep
times. Average results are provided in the outputs, as well as the results per replication.
In general, this method returns similar results as the out-of-bag permutation method used in random forests (Breiman, 2001).
References
- Nørgaard, L., Saudland, A., Wagner, J., Nielsen, J.V., Munck, L.,
Engelsen, S.B., 2000. Interval Partial Least-Squares Regression (iPLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy. Appl Spectrosc 54, 413–419. https://doi.org/10.1366/0003702001949500
Examples
using Jchemo, JchemoData, JLD2, CairoMakie
mypath = dirname(dirname(pathof(JchemoData)))
db = joinpath(mypath, "data", "tecator.jld2")
@load db dat
@names dat
X = dat.X
Y = dat.Y
wl_str = names(X)
wl = parse.(Float64, wl_str)
ntot, p = size(X)
typ = Y.typ
namy = names(Y)[1:3]
plotsp(X, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance").f
s = typ .== "train"
Xtrain = X[s, :]
Ytrain = Y[s, namy]
Xtest = rmrow(X, s)
Ytest = rmrow(Y[:, namy], s)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
ntot = ntrain + ntest
(ntot = ntot, ntrain, ntest)
## Work on the j-th y-variable
j = 2
nam = namy[j]
ytrain = Ytrain[:, nam]
ytest = Ytest[:, nam]
model = plskern(nlv = 9)
res = viperm(model, Xtrain, ytrain; rep = 50, score = rmsep) ;
z = vec(res.imp)
f = Figure(size = (500, 400))
ax = Axis(f[1, 1]; xlabel = "Wavelength (nm)", ylabel = "Importance")
scatter!(ax, wl, vec(z); color = (:red, .5))
u = [910; 950]
vlines!(ax, u; color = :grey, linewidth = 1)
f
model = rfr(n_trees = 10, max_depth = 2000, min_samples_leaf = 5)
res = viperm(model, Xtrain, ytrain; rep = 50)
z = vec(res.imp)
f = Figure(size = (500, 400))
ax = Axis(f[1, 1];
xlabel = "Wavelength (nm)",
ylabel = "Importance")
scatter!(ax, wl, vec(z); color = (:red, .5))
u = [910; 950]
vlines!(ax, u; color = :grey, linewidth = 1)
f
Jchemo.vrow
— Methodvrow(X::AbstractMatrix, i)
vrow(X::DataFrame, i)
vrow(x::Vector, i)
View of the i-th row(s) of a matrix X
, or of the i-th element(s) of vector x
.
Jchemo.wdis
— Methodwdis(d; typw = :bisquare, alpha = 0)
Different functions to compute weights from distances.
d
: Vector of distances.
Keyword arguments:
typw
: Define the weight function.alpha
: Parameter of the weight function, see below.
The returned weight vector is:
- w = f(
d
/ q) where f is the weight function and q the 1-alpha
quantile ofd
(Cleveland & Grosse 1991).
Possible values for typw
are:
- :bisquare: w = (1 - d^2)^2
- :cauchy: w = 1 / (1 + d^2)
- :epan: w = 1 - d^2
- :fair: w = 1 / (1 + d)^2
- :invexp: w = exp(-d)
- :invexp2: w = exp(-d / 2)
- :gauss: w = exp(-d^2)
- :trian: w = 1 - d
- :tricube: w = (1 - d^3)^3
References
Cleveland, W.S., Grosse, E., 1991. Computational methods for local regression. Stat Comput 1, 47–62. https://doi.org/10.1007/BF01890836
Examples
using Jchemo, CairoMakie, Distributions
d = sort(sqrt.(rand(Chi(1), 1000)))
colm = cgrad(:tab10, collect(1:9)) ;
alpha = 0
f = Figure(size = (600, 500))
ax = Axis(f, xlabel = "d", ylabel = "Weight")
typw = :bisquare
w = wdis(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = colm[1])
typw = :cauchy
w = wdis(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = colm[2])
typw = :epan
w = wdis(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = colm[3])
typw = :fair
w = wdis(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = colm[4])
typw = :gauss
w = wdis(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = colm[5])
typw = :trian
w = wdis(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = colm[6])
typw = :invexp
w = wdis(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = colm[7])
typw = :invexp2
w = wdis(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = colm[8])
typw = :tricube
w = wdis(d; typw, alpha)
lines!(ax, d, w, label = String(typw), color = colm[9])
axislegend("Function", position = :lb)
f[1, 1] = ax
f
Jchemo.winvs
— Methodwinvs(d; h = 2, criw = 4, squared = false)
winvs!(d; h = 2, criw = 4, squared = false)
Compute weights from distances using an inverse scaled exponential function.
d
: A vector of distances.
Keyword arguments:
h
: A scaling positive scalar defining the shape of the weight function.criw
: A positive scalar defining outliers in the distances vectord
.squared
: Iftrue
, distances are replaced by the squared distances; the weight function is then a Gaussian (RBF) kernel function.
Weights are computed by:
- exp(-
d
/ (h
* MAD(d
)))
or are set to 0 for extreme (potentially outlier) distances such as d
> Median(d
) + criw * MAD(d
). This is an adaptation of the weight function presented in Kim et al. 2011.
The weights decrease when distances increase. Lower is h, sharper is the decreasing function.
References
Kim S, Kano M, Nakagawa H, Hasebe S. Estimation of active pharmaceutical ingredients content using locally weighted partial least squares and statistical wavelength selection. Int J Pharm. 2011; 421(2):269-274. https://doi.org/10.1016/j.ijpharm.2011.10.007
Examples
using Jchemo, CairoMakie, Distributions
x1 = rand(Chisq(10), 100) ;
x2 = rand(Chisq(40), 10) ;
d = [sqrt.(x1) ; sqrt.(x2)]
h = 2 ; criw = 3
w = winvs(d; h, criw) ;
f = Figure(size = (600, 300))
ax1 = Axis(f, xlabel = "Distance", ylabel = "Nb. observations")
hist!(ax1, d, bins = 30)
ax2 = Axis(f, xlabel = "Distance", ylabel = "Weight")
scatter!(ax2, d, w)
f[1, 1] = ax1
f[1, 2] = ax2
f
d = collect(0:.5:15) ;
h = [.5, 1, 1.5, 2.5, 5, 10, Inf]
#h = [1, 2, 5, Inf]
w = winvs(d; h = h[1])
f = Figure(size = (500, 400))
ax = Axis(f, xlabel = "Distance", ylabel = "Weight")
lines!(ax, d, w, label = string("h = ", h[1]))
for i = 2:length(h)
w = winvs(d; h = h[i])
lines!(ax, d, w, label = string("h = ", h[i]))
end
axislegend("Values of h"; position = :lb)
f[1, 1] = ax
f
Jchemo.wtal
— Methodwtal(d; a = 1)
Compute weights from distances using the 'talworth' distribution.
d
: Vector of distances.
Keyword arguments:
a
: Parameter of the weight function, see below.
The returned weight vector w has component w[i] = 1 if |d
[i]| <= a
, and w[i] = 0 if |d
[i]| > a
.
Examples
d = rand(10)
wtal(d; a = .8)
Jchemo.xfit
— Methodxfit(object)
xfit(object, X; nlv = nothing)
xfit!(object, X::Matrix; nlv = nothing)
Matrix fitting from a bilinear model (e.g. PCA).
object
: The fitted model.X
: New X-data to be approximated from the model. Must be in the same scale as the X-data used to fit the modelobject
, i.e. before centering and eventual scaling.
Keyword arguments:
nlv
: Nb. components (PCs or LVs) to consider. Ifnothing
, it is the maximum nb. of components.
Compute an approximate of matrix X
from a bilinear model (e.g. PCA or PLS) fitted on X
. The fitted X is returned in the original scale of the X-data used to fit the model object
.
Examples
using Jchemo
X = [1. 2 3 4; 4 1 6 7; 12 5 6 13;
27 18 7 6; 12 11 28 7]
Y = [10. 11 13; 120 131 27; 8 12 4;
1 200 8; 100 10 89]
n, p = size(X)
Xnew = X[1:3, :]
Ynew = Y[1:3, :]
y = Y[:, 1]
ynew = Ynew[:, 1]
weights = mweight(rand(n))
nlv = 2
scal = false
#scal = true
model = pcasvd(; nlv, scal) ;
fit!(model, X)
fitm = model.fitm ;
@head xfit(fitm)
xfit(fitm, Xnew)
xfit(fitm, Xnew; nlv = 0)
xfit(fitm, Xnew; nlv = 1)
fitm.xmeans
@head X
@head xfit(fitm) + xresid(fitm, X)
@head xfit(fitm, X; nlv = 1) + xresid(fitm, X; nlv = 1)
@head Xnew
@head xfit(fitm, Xnew) + xresid(fitm, Xnew)
model = pcasvd(; nlv = min(n, p), scal)
fit!(model, X)
fitm = model.fitm ;
@head xfit(fitm)
@head xfit(fitm, X)
@head xresid(fitm, X)
nlv = 3
scal = false
#scal = true
model = plskern(; nlv, scal)
fit!(model, X, Y, weights)
fitm = model.fitm ;
@head xfit(fitm)
xfit(fitm, Xnew)
xfit(fitm, Xnew, nlv = 0)
xfit(fitm, Xnew, nlv = 1)
@head X
@head xfit(fitm) + xresid(fitm, X)
@head xfit(fitm, X; nlv = 1) + xresid(fitm, X; nlv = 1)
@head Xnew
@head xfit(fitm, Xnew) + xresid(fitm, Xnew)
model = plskern(; nlv = min(n, p), scal)
fit!(model, X, Y, weights)
fitm = model.fitm ;
@head xfit(fitm)
@head xfit(fitm, Xnew)
@head xresid(fitm, Xnew)
Jchemo.xresid
— Methodxresid(object, X; nlv = nothing)
xresid!(object, X::Matrix; nlv = nothing)
Residual matrix from a bilinear model (e.g. PCA).
object
: The fitted model.X
: New X-data to be approximated from the model. Must be in the same scale as the X-data used to fit the modelobject
, i.e. before centering and eventual scaling.
Keyword arguments:
nlv
: Nb. components (PCs or LVs) to consider. Ifnothing
, it is the maximum nb. of components.
Compute the residual matrix:
- E =
X
- X_fit
where X_fit is the fitted X returned by function xfit
. See xfit
for examples. ```