This example illustrates the underlying principles of the fit of three different probabilistic PLSDA models: PLS-LDA, PLS-QDA and PLS-KDEDA (see functions plslda
, plsqda
, plskdeda
).
PLS scores are computed on a training Y-dummy table, and then the training X-observations are projected on the score space. Then the probability densities of these projections are estimated assuming Gaussian distributions (LDA and QDA) or by non-parametric kernels (KDE).
The new observations to predict are projected in the score space, and their location are compared the training density estimates of each class. This determines their probablilities of belonging to the classes. The class showing the highest probability is chosen.
To simplify the graphical illustrations in the example, the PLS score space is set to 2D (2 latent variables) but in prcatice the described methods are often implemented on higher space dimensions.
using Jchemo, JchemoData using JLD2, CairoMakie
path_jdat = dirname(dirname(pathof(JchemoData))) db = joinpath(path_jdat, "data", "iris.jld2") @load db dat @names dat
(:X,)
summ(dat.X)
(res = 5×7 DataFrame Row │ variable mean std min max n nmissing │ Symbol Union… Union… Any Any Int64 Int64 ─────┼────────────────────────────────────────────────────────────────── 1 │ sepal_length 5.843 0.828 4.3 7.9 150 0 2 │ sepal_width 3.057 0.436 2.0 4.4 150 0 3 │ petal_length 3.758 1.765 1.0 6.9 150 0 4 │ petal_width 1.199 0.762 0.1 2.5 150 0 5 │ species setosa virginica 150 0, ntot = 150)
X = dat.X[:, 1:4] y = dat.X[:, 5] # the classes (species) ntot = nro(X)
150
tab(y)
OrderedCollections.OrderedDict{String, Int64} with 3 entries: "setosa" => 50 "versicolor" => 50 "virginica" => 50
lev = mlev(y) nlev = length(lev)
3
ntest = 30 s = samprand(ntot, ntest) Xtrain = X[s.train, :] ytrain = y[s.train] Xtest = X[s.test, :] ytest = y[s.test] ntrain = ntot - ntest (ntot = ntot, ntrain, ntest) tab(ytrain) tab(ytest)
OrderedCollections.OrderedDict{String, Int64} with 3 entries: "setosa" => 13 "versicolor" => 10 "virginica" => 7
Ytrain_dummy = dummy(ytrain).Y
120×3 BitMatrix: 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 ⋮ 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1
nlv = 2 model = plskern(; nlv) fit!(model, Xtrain, Ytrain_dummy)
Scores
@head Ttrain = model.fitm.T @head Ttest = transf(model, Xtest)
3×2 Matrix{Float64}: 2.80858 -0.309397 2.82205 0.212753 2.99554 0.101511 ... (120, 2) 3×2 Matrix{Float64}: 2.85231 -0.356802 2.97728 0.504551 2.54422 -1.36603 ... (30, 2)
i = 1 plotxy(Ttrain[:, i], Ttrain[:, i + 1], ytrain; title = "PLS2 space -Train", xlabel = string("LV", i), ylabel = string("LV", i + 1), leg_title = "Species", zeros = true, ellipse = false).f
The new observation is projected in the score space (blue star in the figure below), and its location is compared to the training observations (red points in the figure below) and densities.
k = 1 # example of the projection of the first obs. of Test
1
f = Figure(size = (900, 300)) ax = list(nlev) for i in eachindex(lev) zs = ytrain .== lev[i] zT = Ttrain[zs, :] ax[i] = Axis(f[1, i]; title = lev[i], xlabel = "PLS-LV1", ylabel = "PLS-LV2") scatter!(ax[i], zT[:, 1], zT[:, 2], color = :red, markersize = 10) ## New point to predict scatter!(ax[i], Ttest[k, 1], Ttest[k, 2], color = :blue, marker = :star5, markersize = 15) ## End hlines!(ax[i], 0; linestyle = :dash, color = :grey) vlines!(ax[i], 0; linestyle = :dash, color = :grey) xlims!(ax[i], -4, 4) ylims!(ax[i], -1.7, 1.7) end f
The same graphic representation can be done with the probability density (instead of the observations as above) estimated for each class with the three methods.
LDA estimate
The Gaussian probability densities (see function dmnorm
) are estimated assuming that the classes have the same PLS-score covariance matrix (W
).
weights = mweight(ones(ntrain)) # observation weights res = matW(Ttrain, ytrain, weights) W = res.W * ntrain / (ntrain - nlev) npoints = 2^7 x1 = LinRange(-4, 4, npoints) x2 = LinRange(-2, 2, npoints) z = mpar(x1 = x1, x2 = x2) grid = reduce(hcat, z) f = Figure(size = (900, 400)) ax = list(nlev) for i in eachindex(lev) zs = ytrain .== lev[i] zT = Ttrain[zs, :] #lims = [[minimum(zT[:, j]) maximum(zT[:, j])] for j = 1:nlv] #x1 = LinRange(lims[1][1], lims[1][2], npoints) #x2 = LinRange(lims[2][1], lims[2][2], npoints) zfitm = dmnorm(colmean(zT), W) # Gaussian estimate zres = predict(zfitm, grid) pred_grid = vec(zres.pred) ax[i] = Axis(f[1, i]; title = lev[i], xlabel = "LV1", ylabel = "LV2") co = contourf!(ax[i], grid[:, 1], grid[:, 2], pred_grid; levels = 10) scatter!(ax[i], zT[:, 1], zT[:, 2], color = :red, markersize = 10) scatter!(ax[i], Ttest[k, 1], Ttest[k, 2], color = :blue, marker = :star5, markersize = 15) hlines!(ax[i], 0; linestyle = :dash, color = :grey) vlines!(ax[i], 0; linestyle = :dash, color = :grey) xlims!(ax[i], -4, 4) ylims!(ax[i], -1.7, 1.7) Colorbar(f[2, i], co; label = "Density", vertical = false) end f
QDA estimate
In this method, the classes are assumed to have different covariance matrices.
res = matW(Ttrain, ytrain, weights) Wi = res.Wi ni = res.ni npoints = 2^7 x1 = LinRange(-4, 4, npoints) x2 = LinRange(-2, 2, npoints) z = mpar(x1 = x1, x2 = x2) grid = reduce(hcat, z) z = mpar(x1 = x1, x2 = x2) grid = reduce(hcat, z) f = Figure(size = (900, 400)) ax = list(nlev) for i in eachindex(lev) zs = ytrain .== lev[i] zT = Ttrain[zs, :] zS = Wi[i] * ni[i] / (ni[i] - 1) zfitm = dmnorm(colmean(zT), zS) # Gaussian estimate zres = predict(zfitm, grid) pred_grid = vec(zres.pred) ax[i] = Axis(f[1, i]; title = lev[i], xlabel = "LV1", ylabel = "LV2") co = contourf!(ax[i], grid[:, 1], grid[:, 2], pred_grid; levels = 10) scatter!(ax[i], zT[:, 1], zT[:, 2], color = :red, markersize = 10) scatter!(ax[i], Ttest[k, 1], Ttest[k, 2], color = :blue, marker = :star5, markersize = 15) hlines!(ax[i], 0; linestyle = :dash, color = :grey) vlines!(ax[i], 0; linestyle = :dash, color = :grey) xlims!(ax[i], -4, 4) ylims!(ax[i], -1.7, 1.7) Colorbar(f[2, i], co; label = "Density", vertical = false) end f
Non-parametric KDE estimate
In this method, the densities are estimated by kernel estimators (see function dmkern
).
npoints = 2^7 x1 = LinRange(-4, 4, npoints) x2 = LinRange(-2, 2, npoints) z = mpar(x1 = x1, x2 = x2) grid = reduce(hcat, z) f = Figure(size = (900, 400)) ax = list(nlev) for i in eachindex(lev) zs = ytrain .== lev[i] zT = Ttrain[zs, :] zfitm = dmkern(zT; a = .5) # KDE estimate zres = predict(zfitm, grid) pred_grid = vec(zres.pred) ax[i] = Axis(f[1, i]; title = lev[i], xlabel = "LV1", ylabel = "LV2") co = contourf!(ax[i], grid[:, 1], grid[:, 2], pred_grid; levels = 10) scatter!(ax[i], zT[:, 1], zT[:, 2], color = :red, markersize = 10) scatter!(ax[i], Ttest[k, 1], Ttest[k, 2], color = :blue, marker = :star5, markersize = 15) hlines!(ax[i], 0; linestyle = :dash, color = :grey) vlines!(ax[i], 0; linestyle = :dash, color = :grey) xlims!(ax[i], -4, 4) ylims!(ax[i], -1.7, 1.7) Colorbar(f[2, i], co; label = "Density", vertical = false) end f
It is clear from all the above figures that the new observation is located in the middle of the class 'setosa', and therefore that PLSDA methods will predict it as belonging to this class.