Density estimations - iris

This example illustrates the principles underlying the fit of three different probabilistic PLSDA models: PLS-LDA, PLS-QDA and PLS-KDEDA (see functions plslda, plsqda, plskdeda).

  • Step-1: PLS scores are computed on the training data {X, Y-dummy} (where Y-dummy is the table of the binary-coded variables of the classes contained in the categorical response variable y). Then the observations X are projected on this score space, and the multivariate probability density distribution of these projections are estimated for each class. The estimations assume Gaussian distributions for LDA and QDA (parametric methods), and use non-parametric kernels for KDE.

  • Step-2: The prediction of a given new observation, xnew, is as follows. Observation xnew is projected on the training score space computed in step-1 and, for each class, its location is compared to the density distribution estimated for the class. This determines the probability of xnew to belong to the given class. The class showing the highest probability is chosen as prediction.

In practice, the PLS score space used for discrimination can have many dimensions. In the present example, to simplify the graphical illustrations, a 2-D PLS score space (2 latent variables) is considered.

using Jchemo, JchemoData
using JLD2, CairoMakie

Data importation

path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data", "iris.jld2") 
@load db dat
@names dat
(:X,)
summ(dat.X)
(res = 5×7 DataFrame
 Row │ variable      mean    std     min     max        n      nmissing
     │ Symbol        Union…  Union…  Any     Any        Int64  Int64
─────┼──────────────────────────────────────────────────────────────────
   1 │ sepal_length  5.843   0.828   4.3     7.9          150         0
   2 │ sepal_width   3.057   0.436   2.0     4.4          150         0
   3 │ petal_length  3.758   1.765   1.0     6.9          150         0
   4 │ petal_width   1.199   0.762   0.1     2.5          150         0
   5 │ species                       setosa  virginica    150         0, ntot = 150)
X = dat.X[:, 1:4] 
y = dat.X[:, 5]    # the classes (species)
ntot = nro(X)
150
tab(y)
OrderedCollections.OrderedDict{String, Int64} with 3 entries:
  "setosa"     => 50
  "versicolor" => 50
  "virginica"  => 50
lev = mlev(y)
nlev = length(lev)
3

Split Tot to Train/Test

ntest = 30
s = samprand(ntot, ntest)
Xtrain = X[s.train, :]
ytrain = y[s.train]
Xtest = X[s.test, :]
ytest = y[s.test]
ntrain = ntot - ntest
(ntot = ntot, ntrain, ntest)
tab(ytrain)
tab(ytest)
OrderedCollections.OrderedDict{String, Int64} with 3 entries:
  "setosa"     => 8
  "versicolor" => 7
  "virginica"  => 15

Computation of the Pls scores on the Y-dummy table

Ytrain_dummy = dummy(ytrain).Y
120×3 BitMatrix:
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 ⋮     
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
nlv = 2
model = plskern(; nlv)
fit!(model, Xtrain, Ytrain_dummy)

Scores

@head Ttrain = model.fitm.T
@head Ttest = transf(model, Xtest)
3×2 Matrix{Float64}:
 2.6053   -0.286791
 2.61995   0.232042
 2.79383   0.0994609
... (120, 2)
 
3×2 Matrix{Float64}:
 2.64409   0.252782
 2.21507  -0.745692
 2.59655  -0.996732
... (30, 2)
i = 1
plotxy(Ttrain[:, i], Ttrain[:, i + 1], ytrain; title = "PLS2 space -Train", xlabel = string("LV", i), 
    ylabel = string("LV", i + 1), leg_title = "Species", zeros = true, ellipse = false).f

Projection of a new observation in the fitted score space

The new observation is projected on the score space (blue star in the figure below), and its location is compared to the training observations (red points in the figure below) and densities.

k = 1  # example of the projection of the first obs. of Test
1
f = Figure(size = (900, 300))
ax = list(nlev)
for i in eachindex(lev) 
    zs = ytrain .== lev[i]
    zT = Ttrain[zs, :]
    ax[i] = Axis(f[1, i]; title = lev[i], xlabel = "PLS-LV1", ylabel = "PLS-LV2")
    scatter!(ax[i], zT[:, 1], zT[:, 2], color = :red, markersize = 10)
    ## New point to predict
    scatter!(ax[i], Ttest[k, 1], Ttest[k, 2], color = :blue, marker = :star5, markersize = 15)
    ## End
    hlines!(ax[i], 0; linestyle = :dash, color = :grey)
    vlines!(ax[i], 0; linestyle = :dash, color = :grey)
    xlims!(ax[i], -4, 4)
    ylims!(ax[i], -1.7, 1.7)
end
f

The same graphical representation can be done with the probability density (instead of the observations as above) estimated for each class with the three methods.

LDA estimate

In LDA, the Gaussian probability densities are estimated assuming that the classes have the same PLS-score covariance matrix (W) (see function dmnorm).

weights = mweight(ones(ntrain))   # observation weights
res = matW(Ttrain, ytrain, weights)
W = res.W * ntrain / (ntrain - nlev)
npoints = 2^7
x1 = LinRange(-4, 4, npoints)
x2 = LinRange(-2, 2, npoints)
z = mpar(x1 = x1, x2 = x2)
grid = reduce(hcat, z)
f = Figure(size = (900, 400))
ax = list(nlev)
for i in eachindex(lev) 
    zs = ytrain .== lev[i]
    zT = Ttrain[zs, :]
    #lims = [[minimum(zT[:, j]) maximum(zT[:, j])] for j = 1:nlv]
    #x1 = LinRange(lims[1][1], lims[1][2], npoints)
    #x2 = LinRange(lims[2][1], lims[2][2], npoints)
    zfitm = dmnorm(colmean(zT), W)   # Gaussian estimate
    zres = predict(zfitm, grid) 
    pred_grid = vec(zres.pred)
    ax[i] = Axis(f[1, i]; title = lev[i], xlabel = "LV1", ylabel = "LV2")
    co = contourf!(ax[i], grid[:, 1], grid[:, 2], pred_grid; levels = 10)
    scatter!(ax[i], zT[:, 1], zT[:, 2], color = :red, markersize = 10)
    scatter!(ax[i], Ttest[k, 1], Ttest[k, 2], color = :blue, marker = :star5, markersize = 15)
    hlines!(ax[i], 0; linestyle = :dash, color = :grey)
    vlines!(ax[i], 0; linestyle = :dash, color = :grey)
    xlims!(ax[i], -4, 4)
    ylims!(ax[i], -1.7, 1.7)
    Colorbar(f[2, i], co; label = "Density", vertical = false)
end
f

QDA estimate

In QDA, the classes are assumed to have different covariance matrices.

res = matW(Ttrain, ytrain, weights)
Wi = res.Wi
ni = res.ni
npoints = 2^7
x1 = LinRange(-4, 4, npoints)
x2 = LinRange(-2, 2, npoints)
z = mpar(x1 = x1, x2 = x2)
grid = reduce(hcat, z)
z = mpar(x1 = x1, x2 = x2)
grid = reduce(hcat, z)
f = Figure(size = (900, 400))
ax = list(nlev)
for i in eachindex(lev) 
    zs = ytrain .== lev[i]
    zT = Ttrain[zs, :]
    zS = Wi[i] * ni[i] / (ni[i] - 1)
    zfitm = dmnorm(colmean(zT), zS)   # Gaussian estimate
    zres = predict(zfitm, grid)
    pred_grid = vec(zres.pred) 
    ax[i] = Axis(f[1, i]; title = lev[i], xlabel = "LV1", ylabel = "LV2")
    co = contourf!(ax[i], grid[:, 1], grid[:, 2], pred_grid; levels = 10)
    scatter!(ax[i], zT[:, 1], zT[:, 2], color = :red, markersize = 10)
    scatter!(ax[i], Ttest[k, 1], Ttest[k, 2], color = :blue, marker = :star5, markersize = 15)
    hlines!(ax[i], 0; linestyle = :dash, color = :grey)
    vlines!(ax[i], 0; linestyle = :dash, color = :grey)
    xlims!(ax[i], -4, 4)
    ylims!(ax[i], -1.7, 1.7)
    Colorbar(f[2, i], co; label = "Density", vertical = false)
end
f

Non-parametric KDE estimate

In KDE, the densities are estimated by kernel estimators (see function dmkern).

npoints = 2^7
x1 = LinRange(-4, 4, npoints)
x2 = LinRange(-2, 2, npoints)
z = mpar(x1 = x1, x2 = x2)
grid = reduce(hcat, z)
f = Figure(size = (900, 400))
ax = list(nlev)
for i in eachindex(lev) 
    zs = ytrain .== lev[i]
    zT = Ttrain[zs, :]
    zfitm = dmkern(zT; a = .5)   # KDE estimate
    zres = predict(zfitm, grid)
    pred_grid = vec(zres.pred)
    ax[i] = Axis(f[1, i]; title = lev[i], xlabel = "LV1", ylabel = "LV2")
    co = contourf!(ax[i], grid[:, 1], grid[:, 2], pred_grid; levels = 10)
    scatter!(ax[i], zT[:, 1], zT[:, 2], color = :red, markersize = 10)
    scatter!(ax[i], Ttest[k, 1], Ttest[k, 2], color = :blue, marker = :star5, markersize = 15)
    hlines!(ax[i], 0; linestyle = :dash, color = :grey)
    vlines!(ax[i], 0; linestyle = :dash, color = :grey)
    xlims!(ax[i], -4, 4)
    ylims!(ax[i], -1.7, 1.7)
    Colorbar(f[2, i], co; label = "Density", vertical = false)
end
f

In the present example, it is clear from all the above figures that the new observation is located in the middle of the class 'setosa', and therefore that the considered PLSDA methods will predict it as belonging to this class.