Density estimations - iris

This example illustrates the underlying principles of the fit of three different probabilistic PLSDA models: PLS-LDA, PLS-QDA and PLS-KDEDA (see functions plslda, plsqda, plskdeda).

PLS scores are computed on a training Y-dummy table, and then the training X-observations are projected on the score space. Then the probability densities of these projections are estimated assuming Gaussian distributions (LDA and QDA) or by non-parametric kernels (KDE).
The new observations to predict are projected in the score space, and their location are compared the training density estimates of each class. This determines their probablilities of belonging to the classes. The class showing the highest probability is chosen.

To simplify the graphical illustrations in the example, the PLS score space is set to 2D (2 latent variables) but in prcatice the described methods are often implemented on higher space dimensions.

using Jchemo, JchemoData
using JLD2, CairoMakie

Data importation

path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data", "iris.jld2") 
@load db dat
@names dat

(:X,)

summ(dat.X)

(res = 5×7 DataFrame
 Row │ variable      mean    std     min     max        n      nmissing
     │ Symbol        Union…  Union…  Any     Any        Int64  Int64
─────┼──────────────────────────────────────────────────────────────────
   1 │ sepal_length  5.843   0.828   4.3     7.9          150         0
   2 │ sepal_width   3.057   0.436   2.0     4.4          150         0
   3 │ petal_length  3.758   1.765   1.0     6.9          150         0
   4 │ petal_width   1.199   0.762   0.1     2.5          150         0
   5 │ species                       setosa  virginica    150         0, ntot = 150)

X = dat.X[:, 1:4] 
y = dat.X[:, 5]    # the classes (species)
ntot = nro(X)

tab(y)

OrderedCollections.OrderedDict{String, Int64} with 3 entries:
  "setosa"     => 50
  "versicolor" => 50
  "virginica"  => 50

lev = mlev(y)
nlev = length(lev)

Split Tot to Train/Test

ntest = 30
s = samprand(ntot, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = ntot - ntest
(ntot = ntot, ntrain, ntest)
tab(ytrain)
tab(ytest)

OrderedCollections.OrderedDict{String, Int64} with 3 entries:
  "setosa"     => 13
  "versicolor" => 10
  "virginica"  => 7

Computation of the Pls scores on the Y-dummy table

Ytrain_dummy = dummy(ytrain).Y

120×3 BitMatrix:
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 ⋮     
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1

nlv = 2
model = plskern(; nlv)
fit!(model, Xtrain, Ytrain_dummy)

Scores

@head Ttrain = model.fitm.T
@head Ttest = transf(model, Xtest)

3×2 Matrix{Float64}:
 2.80858  -0.309397
 2.82205   0.212753
 2.99554   0.101511
... (120, 2)
 
3×2 Matrix{Float64}:
 2.85231  -0.356802
 2.97728   0.504551
 2.54422  -1.36603
... (30, 2)

i = 1
plotxy(Ttrain[:, i], Ttrain[:, i + 1], ytrain; title = "PLS2 space -Train", xlabel = string("LV", i), 
    ylabel = string("LV", i + 1), leg_title = "Species", zeros = true, ellipse = false).f

Projection of a new observation in the fitted score space

The new observation is projected in the score space (blue star in the figure below), and its location is compared to the training observations (red points in the figure below) and densities.

k = 1  # example of the projection of the first obs. of Test

f = Figure(size = (900, 300))
ax = list(nlev)
for i in eachindex(lev) 
    zs = ytrain .== lev[i]
    zT = Ttrain[zs, :]
    ax[i] = Axis(f[1, i]; title = lev[i], xlabel = "PLS-LV1", ylabel = "PLS-LV2")
    scatter!(ax[i], zT[:, 1], zT[:, 2], color = :red, markersize = 10)
    ## New point to predict
    scatter!(ax[i], Ttest[k, 1], Ttest[k, 2], color = :blue, marker = :star5, markersize = 15)
    ## End
    hlines!(ax[i], 0; linestyle = :dash, color = :grey)
    vlines!(ax[i], 0; linestyle = :dash, color = :grey)
    xlims!(ax[i], -4, 4)
    ylims!(ax[i], -1.7, 1.7)
end
f

The same graphic representation can be done with the probability density (instead of the observations as above) estimated for each class with the three methods.

LDA estimate

The Gaussian probability densities (see function dmnorm) are estimated assuming that the classes have the same PLS-score covariance matrix (W).

weights = mweight(ones(ntrain))   # observation weights
res = matW(Ttrain, ytrain, weights)
W = res.W * ntrain / (ntrain - nlev)
npoints = 2^7
x1 = LinRange(-4, 4, npoints)
x2 = LinRange(-2, 2, npoints)
z = mpar(x1 = x1, x2 = x2)
grid = reduce(hcat, z)
f = Figure(size = (900, 400))
ax = list(nlev)
for i in eachindex(lev) 
    zs = ytrain .== lev[i]
    zT = Ttrain[zs, :]
    #lims = [[minimum(zT[:, j]) maximum(zT[:, j])] for j = 1:nlv]
    #x1 = LinRange(lims[1][1], lims[1][2], npoints)
    #x2 = LinRange(lims[2][1], lims[2][2], npoints)
    zfitm = dmnorm(colmean(zT), W)   # Gaussian estimate
    zres = predict(zfitm, grid) 
    pred_grid = vec(zres.pred)
    ax[i] = Axis(f[1, i]; title = lev[i], xlabel = "LV1", ylabel = "LV2")
    co = contourf!(ax[i], grid[:, 1], grid[:, 2], pred_grid; levels = 10)
    scatter!(ax[i], zT[:, 1], zT[:, 2], color = :red, markersize = 10)
    scatter!(ax[i], Ttest[k, 1], Ttest[k, 2], color = :blue, marker = :star5, markersize = 15)
    hlines!(ax[i], 0; linestyle = :dash, color = :grey)
    vlines!(ax[i], 0; linestyle = :dash, color = :grey)
    xlims!(ax[i], -4, 4)
    ylims!(ax[i], -1.7, 1.7)
    Colorbar(f[2, i], co; label = "Density", vertical = false)
end
f

QDA estimate

In this method, the classes are assumed to have different covariance matrices.

res = matW(Ttrain, ytrain, weights)
Wi = res.Wi
ni = res.ni
npoints = 2^7
x1 = LinRange(-4, 4, npoints)
x2 = LinRange(-2, 2, npoints)
z = mpar(x1 = x1, x2 = x2)
grid = reduce(hcat, z)
z = mpar(x1 = x1, x2 = x2)
grid = reduce(hcat, z)
f = Figure(size = (900, 400))
ax = list(nlev)
for i in eachindex(lev) 
    zs = ytrain .== lev[i]
    zT = Ttrain[zs, :]
    zS = Wi[i] * ni[i] / (ni[i] - 1)
    zfitm = dmnorm(colmean(zT), zS)   # Gaussian estimate
    zres = predict(zfitm, grid)
    pred_grid = vec(zres.pred) 
    ax[i] = Axis(f[1, i]; title = lev[i], xlabel = "LV1", ylabel = "LV2")
    co = contourf!(ax[i], grid[:, 1], grid[:, 2], pred_grid; levels = 10)
    scatter!(ax[i], zT[:, 1], zT[:, 2], color = :red, markersize = 10)
    scatter!(ax[i], Ttest[k, 1], Ttest[k, 2], color = :blue, marker = :star5, markersize = 15)
    hlines!(ax[i], 0; linestyle = :dash, color = :grey)
    vlines!(ax[i], 0; linestyle = :dash, color = :grey)
    xlims!(ax[i], -4, 4)
    ylims!(ax[i], -1.7, 1.7)
    Colorbar(f[2, i], co; label = "Density", vertical = false)
end
f

Non-parametric KDE estimate

In this method, the densities are estimated by kernel estimators (see function dmkern).

npoints = 2^7
x1 = LinRange(-4, 4, npoints)
x2 = LinRange(-2, 2, npoints)
z = mpar(x1 = x1, x2 = x2)
grid = reduce(hcat, z)
f = Figure(size = (900, 400))
ax = list(nlev)
for i in eachindex(lev) 
    zs = ytrain .== lev[i]
    zT = Ttrain[zs, :]
    zfitm = dmkern(zT; a = .5)   # KDE estimate
    zres = predict(zfitm, grid)
    pred_grid = vec(zres.pred)
    ax[i] = Axis(f[1, i]; title = lev[i], xlabel = "LV1", ylabel = "LV2")
    co = contourf!(ax[i], grid[:, 1], grid[:, 2], pred_grid; levels = 10)
    scatter!(ax[i], zT[:, 1], zT[:, 2], color = :red, markersize = 10)
    scatter!(ax[i], Ttest[k, 1], Ttest[k, 2], color = :blue, marker = :star5, markersize = 15)
    hlines!(ax[i], 0; linestyle = :dash, color = :grey)
    vlines!(ax[i], 0; linestyle = :dash, color = :grey)
    xlims!(ax[i], -4, 4)
    ylims!(ax[i], -1.7, 1.7)
    Colorbar(f[2, i], co; label = "Density", vertical = false)
end
f

It is clear from all the above figures that the new observation is located in the middle of the class 'setosa', and therefore that PLSDA methods will predict it as belonging to this class.