gridcv - forages2 - Plsrda

using Jchemo, JchemoData
using JLD2, CairoMakie
using FreqTables

Data importation

path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/forages2.jld2") 
@load db dat
@names dat
(:X, :Y)
X = dat.X 
@head X
... (485, 700)
3×700 DataFrame
600 columns omitted
Row1100110211041106110811101112111411161118112011221124112611281130113211341136113811401142114411461148115011521154115611581160116211641166116811701172117411761178118011821184118611881190119211941196119812001202120412061208121012121214121612181220122212241226122812301232123412361238124012421244124612481250125212541256125812601262126412661268127012721274127612781280128212841286128812901292129412961298
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
1-0.000231591-0.000175945-8.48176e-52.05217e-50.0001100940.0001617570.0001549530.0001637540.0001876020.000214990.0002424790.0002654980.0002821410.0002814420.0002710250.0002610750.0002572840.0002521770.000242930.0002282950.0002190970.0002141360.0002156120.0002189820.0002280040.0002360810.0002360170.0002203270.0001870960.0001371387.68593e-51.13679e-5-5.00951e-5-9.54664e-5-0.000119199-0.000131897-0.000142349-0.000161489-0.00019387-0.000244808-0.000303259-0.000366904-0.000416738-0.000451535-0.00046995-0.000478637-0.000477348-0.000478142-0.000476719-0.000479701-0.000482037-0.000496769-0.000511959-0.000532094-0.000542661-0.000540188-0.000512715-0.00045798-0.000370395-0.000256256-0.0001269071.13716e-60.0001190470.0002127450.0002756850.0003078630.0003135470.0002969770.0002696610.0002478180.0002339440.0002287730.0002245670.0002212560.0002188930.0002177410.0002101440.000196640.0001819490.0001697740.0001516910.000123859.23378e-55.9959e-52.58352e-5-4.77314e-6-3.21835e-5-5.53154e-5-6.71707e-5-6.54166e-5-5.16448e-5-2.43366e-51.12255e-54.68917e-57.773e-50.0001067850.0001331730.0001536070.0001685180.000182591
2-9.66352e-5-3.30928e-55.64966e-50.0001541350.0002377250.0002957890.0003195870.0003574050.0004046110.0004479960.0004797860.0004883390.0004659290.0004023010.0003136480.0002202260.0001384837.35084e-53.50018e-52.83293e-56.05478e-50.0001182720.0001877260.0002498420.000296970.0003150620.0002988280.0002516430.0001870550.0001182435.60849e-53.8727e-6-3.28778e-5-4.84688e-5-4.38912e-5-3.34954e-5-2.72637e-5-3.65483e-5-6.62949e-5-0.000121833-0.000193587-0.000280244-0.000362132-0.000434981-0.000494461-0.000546531-0.000590606-0.000638514-0.000684688-0.000734688-0.000783664-0.000842714-0.000892596-0.000930301-0.000938118-0.000913585-0.000846217-0.000737781-0.000588122-0.000410395-0.000220611-3.69382e-50.0001310720.0002660780.0003583770.0004086840.0004245280.0004121470.0003838960.0003579570.0003383850.0003267490.0003155720.000305420.0002936710.0002800050.0002594820.0002336970.00020440.0001771990.0001479890.0001123257.33317e-53.48779e-5-2.5229e-6-3.27922e-5-5.52233e-5-7.06412e-5-7.49675e-5-6.44041e-5-4.04393e-5-6.50489e-63.09196e-56.87358e-50.0001052020.0001423130.0001771820.0002066520.0002307880.000253703
3-0.000131769-7.8398e-57.92223e-78.90044e-50.0001600220.0001984350.0001965980.0002122250.0002411090.0002712350.0003010450.0003249210.0003376190.0003258570.000299790.0002771670.000270180.000271650.0002776060.0002877220.0003082030.0003248470.0003285730.0003108060.000277280.0002268980.0001604748.30948e-57.98825e-6-5.32827e-5-9.57157e-5-0.000123438-0.0001371-0.000134382-0.00011527-9.07963e-5-6.97458e-5-6.29138e-5-7.14491e-5-9.85941e-5-0.000137562-0.000192678-0.000248177-0.000303993-0.000356125-0.000407616-0.0004553-0.000507819-0.000555473-0.000603436-0.000647099-0.000701763-0.000754429-0.000806879-0.000838493-0.000842167-0.000803445-0.000720829-0.000592138-0.000428566-0.000245567-6.43964e-50.0001011930.0002322420.0003221330.0003736050.0003918170.0003793320.0003478290.0003164950.0002922360.0002784310.0002646210.0002503050.0002393870.0002345040.0002246330.0002056840.0001804080.0001576150.0001351080.0001068717.3258e-53.90321e-57.34127e-6-1.78231e-5-3.94282e-5-5.6427e-5-6.15935e-5-5.19038e-5-2.96367e-53.09722e-63.98752e-57.62892e-50.0001082710.0001376320.0001656240.0001911820.0002115860.000229586
Y = dat.Y
@head Y
... (485, 4)
3×4 DataFrame
Rowdmndftyptest
Float64?Float64?StringInt64
192.2337.58Legume forages1
293.2649.6462Legume forages0
392.963.2939Forage trees0
y = Y.typ
test = Y.test
tab(y)
OrderedCollections.OrderedDict{String, Int64} with 3 entries:
  "Cereal and grass forages" => 160
  "Forage trees"             => 101
  "Legume forages"           => 224
freqtable(y, test)
3×2 Named Matrix{Int64}
             Dim1 ╲ Dim2 │   0    1
─────────────────────────┼─────────
Cereal and grass forages │ 100   60
Forage trees             │  56   45
Legume forages           │ 167   57
wlst = names(X)
wl = parse.(Int, wlst)
#plotsp(X, wl; xlabel = "Wavelength (nm)", ylabel = "Absorbance").f
700-element Vector{Int64}:
 1100
 1102
 1104
 1106
 1108
 1110
 1112
 1114
 1116
 1118
    ⋮
 2482
 2484
 2486
 2488
 2490
 2492
 2494
 2496
 2498

Note:: X-data are already preprocessed (SNV + Savitsky-Golay 2nd deriv).

Split Tot to Train/Test

The model is fitted on Train, and the generalization error is estimated on Test. In this example, Train is already defined in variable typ of the dataset, and Test is defined by the remaining samples. But Tot could also be split a posteriori, for instance by sampling (random, systematic or any other designs). See for instance functions samprand, sampsys, etc.

s = Bool.(test)
Xtrain = rmrow(X, s)
ytrain = rmrow(y, s)
Xtest = X[s, :]
ytest = y[s]
ntot = nro(X)
ntrain = nro(Xtrain)
ntest = nro(Xtest)
(ntot = ntot, ntrain, ntest)
(ntot = 485, ntrain = 323, ntest = 162)
tab(ytrain)
OrderedCollections.OrderedDict{String, Int64} with 3 entries:
  "Cereal and grass forages" => 100
  "Forage trees"             => 56
  "Legume forages"           => 167
tab(ytest)
OrderedCollections.OrderedDict{String, Int64} with 3 entries:
  "Cereal and grass forages" => 60
  "Forage trees"             => 45
  "Legume forages"           => 57

Replicated K-fold CV

K = 3     # nb. folds (segments)
rep = 25  # nb. replications
segm = segmkf(ntrain, K; rep = rep)
25-element Vector{Vector{Vector{Int64}}}:
 [[3, 4, 5, 6, 10, 14, 16, 18, 19, 24  …  299, 301, 303, 304, 305, 306, 310, 312, 314, 321], [8, 9, 13, 15, 17, 20, 23, 27, 29, 31  …  294, 298, 300, 302, 311, 315, 318, 320, 322, 323], [1, 2, 7, 11, 12, 21, 22, 25, 26, 30  …  290, 295, 296, 307, 308, 309, 313, 316, 317, 319]]
 [[4, 7, 10, 14, 15, 18, 26, 27, 28, 30  …  303, 304, 307, 308, 311, 315, 317, 318, 320, 322], [1, 5, 11, 17, 20, 21, 25, 29, 32, 33  …  286, 291, 293, 294, 295, 296, 302, 312, 313, 314], [2, 3, 6, 8, 9, 12, 13, 16, 19, 22  …  298, 299, 305, 306, 309, 310, 316, 319, 321, 323]]
 [[3, 4, 6, 16, 17, 18, 21, 23, 27, 33  …  292, 295, 296, 299, 303, 304, 309, 310, 311, 318], [1, 2, 5, 8, 9, 10, 11, 13, 15, 20  …  293, 302, 305, 307, 314, 315, 319, 320, 321, 323], [7, 12, 14, 19, 22, 24, 28, 31, 32, 34  …  298, 300, 301, 306, 308, 312, 313, 316, 317, 322]]
 [[1, 2, 5, 14, 16, 19, 20, 24, 26, 28  …  288, 293, 294, 296, 301, 304, 307, 311, 312, 321], [3, 6, 9, 15, 17, 23, 32, 33, 37, 38  …  306, 308, 313, 314, 315, 316, 317, 319, 320, 323], [4, 7, 8, 10, 11, 12, 13, 18, 21, 22  …  297, 298, 299, 300, 302, 303, 309, 310, 318, 322]]
 [[1, 2, 5, 7, 10, 19, 24, 28, 29, 31  …  292, 296, 297, 304, 308, 309, 313, 314, 317, 322], [11, 12, 14, 16, 22, 26, 27, 30, 34, 39  …  283, 291, 294, 299, 300, 301, 302, 303, 311, 319], [3, 4, 6, 8, 9, 13, 15, 17, 18, 20  …  306, 307, 310, 312, 315, 316, 318, 320, 321, 323]]
 [[1, 7, 8, 10, 12, 18, 21, 23, 25, 29  …  287, 290, 296, 300, 305, 309, 314, 321, 322, 323], [4, 6, 13, 14, 15, 16, 19, 22, 26, 28  …  295, 299, 303, 310, 311, 312, 315, 316, 317, 320], [2, 3, 5, 9, 11, 17, 20, 24, 27, 31  …  298, 301, 302, 304, 306, 307, 308, 313, 318, 319]]
 [[4, 5, 6, 7, 12, 13, 15, 19, 20, 21  …  285, 287, 291, 297, 300, 307, 311, 316, 319, 323], [2, 9, 34, 39, 40, 42, 45, 47, 50, 51  …  301, 302, 304, 305, 306, 309, 315, 320, 321, 322], [1, 3, 8, 10, 11, 14, 16, 17, 18, 23  …  293, 298, 303, 308, 310, 312, 313, 314, 317, 318]]
 [[11, 15, 17, 19, 20, 21, 24, 30, 32, 43  …  295, 296, 303, 307, 308, 309, 312, 314, 316, 322], [5, 6, 10, 12, 14, 18, 28, 29, 31, 35  …  294, 297, 300, 302, 304, 305, 306, 318, 319, 320], [1, 2, 3, 4, 7, 8, 9, 13, 16, 22  …  298, 299, 301, 310, 311, 313, 315, 317, 321, 323]]
 [[2, 3, 5, 8, 10, 11, 17, 20, 24, 25  …  289, 293, 295, 299, 300, 305, 306, 311, 312, 313], [1, 4, 6, 7, 9, 12, 13, 15, 16, 21  …  286, 287, 288, 291, 294, 296, 302, 304, 307, 314], [14, 18, 19, 22, 23, 26, 32, 33, 34, 37  …  310, 315, 316, 317, 318, 319, 320, 321, 322, 323]]
 [[3, 4, 7, 9, 10, 27, 29, 31, 34, 37  …  293, 295, 298, 299, 302, 303, 305, 307, 309, 312], [1, 2, 5, 8, 14, 15, 17, 19, 21, 23  …  300, 306, 308, 311, 315, 318, 319, 320, 322, 323], [6, 11, 12, 13, 16, 18, 20, 22, 25, 28  …  294, 296, 301, 304, 310, 313, 314, 316, 317, 321]]
 ⋮
 [[5, 13, 15, 17, 19, 28, 30, 35, 36, 42  …  290, 291, 298, 300, 302, 316, 319, 320, 322, 323], [6, 11, 12, 14, 22, 26, 32, 33, 46, 49  …  297, 299, 301, 303, 307, 309, 313, 314, 315, 318], [1, 2, 3, 4, 7, 8, 9, 10, 16, 18  …  296, 304, 305, 306, 308, 310, 311, 312, 317, 321]]
 [[2, 6, 11, 12, 13, 14, 15, 19, 24, 26  …  296, 297, 300, 301, 302, 306, 309, 312, 322, 323], [1, 7, 8, 9, 16, 18, 20, 23, 25, 27  …  289, 294, 298, 304, 305, 313, 314, 317, 319, 320], [3, 4, 5, 10, 17, 21, 22, 28, 32, 35  …  299, 303, 307, 308, 310, 311, 315, 316, 318, 321]]
 [[1, 2, 3, 5, 6, 7, 8, 11, 15, 19  …  277, 279, 284, 306, 307, 308, 316, 317, 319, 323], [9, 10, 12, 16, 18, 25, 28, 29, 35, 39  …  299, 300, 303, 305, 309, 312, 315, 318, 320, 321], [4, 13, 14, 17, 21, 22, 23, 30, 31, 32  …  296, 298, 301, 302, 304, 310, 311, 313, 314, 322]]
 [[3, 8, 9, 13, 15, 21, 22, 25, 27, 28  …  296, 297, 304, 306, 310, 311, 312, 313, 314, 316], [1, 2, 5, 10, 11, 17, 23, 30, 32, 34  …  295, 298, 300, 302, 308, 315, 317, 319, 321, 322], [4, 6, 7, 12, 14, 16, 18, 19, 20, 24  …  294, 299, 301, 303, 305, 307, 309, 318, 320, 323]]
 [[4, 13, 15, 18, 19, 27, 28, 33, 44, 45  …  297, 298, 299, 301, 305, 309, 312, 319, 321, 323], [2, 3, 7, 11, 12, 16, 17, 20, 21, 22  …  303, 304, 307, 308, 310, 313, 314, 316, 317, 322], [1, 5, 6, 8, 9, 10, 14, 24, 26, 30  …  287, 290, 292, 295, 300, 306, 311, 315, 318, 320]]
 [[2, 13, 14, 16, 21, 25, 27, 29, 31, 33  …  295, 296, 298, 299, 300, 303, 304, 308, 310, 320], [3, 5, 6, 7, 8, 10, 11, 12, 15, 18  …  306, 307, 309, 311, 313, 314, 315, 316, 317, 323], [1, 4, 9, 17, 22, 23, 26, 32, 36, 37  …  294, 297, 301, 302, 305, 312, 318, 319, 321, 322]]
 [[1, 4, 7, 8, 9, 14, 18, 19, 20, 22  …  287, 289, 290, 292, 294, 297, 301, 302, 308, 314], [5, 6, 11, 12, 13, 17, 21, 23, 25, 32  …  303, 304, 306, 310, 312, 313, 317, 319, 320, 321], [2, 3, 10, 15, 16, 24, 26, 28, 31, 33  …  300, 305, 307, 309, 311, 315, 316, 318, 322, 323]]
 [[1, 2, 3, 10, 11, 13, 16, 19, 24, 30  …  305, 307, 308, 314, 315, 316, 318, 319, 321, 322], [4, 5, 8, 15, 18, 21, 23, 26, 28, 29  …  290, 293, 294, 295, 297, 299, 309, 311, 313, 323], [6, 7, 9, 12, 14, 17, 20, 22, 25, 27  …  288, 298, 301, 303, 304, 306, 310, 312, 317, 320]]
 [[3, 4, 6, 8, 9, 12, 15, 16, 17, 20  …  309, 312, 313, 314, 316, 317, 318, 319, 320, 323], [1, 2, 11, 13, 14, 22, 28, 36, 37, 39  …  274, 276, 279, 280, 281, 291, 297, 307, 308, 310], [5, 7, 10, 18, 19, 21, 25, 26, 27, 30  …  300, 301, 302, 304, 305, 306, 311, 315, 321, 322]]
nlv = 0:30
model = plsrda()
rescv = gridcv(model, Xtrain, ytrain; segm, score = errp, nlv)
@names rescv 
res = rescv.res
res_rep = rescv.res_rep
2325×4 DataFrame
2300 rows omitted
Rowrepsegmnlvy1
Int64Int64Int64Float64
11100.5
21110.259259
31120.25
41130.194444
51140.12037
61150.0925926
71160.0833333
81170.0833333
91180.0833333
101190.0740741
1111100.0740741
1211110.0833333
1311120.0833333
2314253190.102804
2315253200.102804
2316253210.102804
2317253220.102804
2318253230.0934579
2319253240.0934579
2320253250.0934579
2321253260.0934579
2322253270.0934579
2323253280.0841121
2324253290.0841121
2325253300.0841121
plotgrid(res.nlv, res.y1; step = 2, xlabel = "Nb. LVs", ylabel = "ERRP-CV").f
f, ax = plotgrid(res.nlv, res.y1; step = 2, xlabel = "Nb. LVs", ylabel = "ERRP-CV")
for i = 1:rep, j = 1:K
    zres = res_rep[res_rep.rep .== i .&& res_rep.segm .== j, :]
    lines!(ax, zres.nlv, zres.y1; color = (:grey, .2))
end
lines!(ax, res.nlv, res.y1; color = :red, linewidth = 1)
f

Specifying argument prior

This is recommended if classes are highly unbalanced.

prior = [:unif]  
pars = mpar(prior = prior)
nlv = 0:30
model = plsrda()
res = gridcv(model, Xtrain, ytrain; segm, score = merrp, pars, nlv).res
31×3 DataFrame
6 rows omitted
Rownlvpriory1
Int64SymbolFloat64
10unif0.666667
21unif0.369011
32unif0.206978
43unif0.182696
54unif0.151481
65unif0.119707
76unif0.117355
87unif0.117669
98unif0.112861
109unif0.116276
1110unif0.118322
1211unif0.121019
1312unif0.118498
2019unif0.124413
2120unif0.123317
2221unif0.127453
2322unif0.127537
2423unif0.129279
2524unif0.131948
2625unif0.132012
2726unif0.13238
2827unif0.13368
2928unif0.129848
3029unif0.130858
3130unif0.13164

Selection of the best parameter combination

u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
DataFrameRow (3 columns)
Rownlvpriory1
Int64SymbolFloat64
98unif0.112861

Final prediction (Test) using the optimal model

model = plsrda(nlv = res.nlv[u], prior = res.prior[u])
fit!(model, Xtrain, ytrain)
pred = predict(model, Xtest).pred
162×1 Matrix{String}:
 "Forage trees"
 "Cereal and grass forages"
 "Cereal and grass forages"
 "Legume forages"
 "Cereal and grass forages"
 "Cereal and grass forages"
 "Legume forages"
 "Forage trees"
 "Forage trees"
 "Forage trees"
 ⋮
 "Cereal and grass forages"
 "Cereal and grass forages"
 "Forage trees"
 "Forage trees"
 "Legume forages"
 "Legume forages"
 "Legume forages"
 "Legume forages"
 "Legume forages"

Generalization error

errp(pred, ytest)
1×1 Matrix{Float64}:
 0.12962962962962962
merrp(pred, ytest)
1×1 Matrix{Float64}:
 0.1304093567251462
cf = conf(pred, ytest)
@names cf
(:cnt, :pct, :A, :Apct, :diagpct, :accpct, :lev)
cf.cnt
3×4 DataFrame
Rowypred_Cereal and grass foragespred_Forage treespred_Legume forages
StringInt64Int64Int64
1Cereal and grass forages5433
2Forage trees0396
3Legume forages0948
cf.pct
3×4 DataFrame
Rowlevelspred_Cereal and grass foragespred_Forage treespred_Legume forages
StringFloat64Float64Float64
1Cereal and grass forages90.05.05.0
2Forage trees0.086.713.3
3Legume forages0.015.884.2