k-fold Cross Validation
Context: Special qualifiers
k-fold Cross Validation
The !KCV k qualifier,
when used with !CYCLE 1:n and
!FILTER f !EXCLUDE $I
causes ASReml to save crossvalidation predictions for model term k based on repeated analyses of the data,
excluding records where variable f has a value corresponding to the current cycle.
This implementation of crossvalidation requires the user to define a variable,
say CVgroup, which allocates the data records to g groups.
The analysis can then be repeated g times (using !CYCLE 1:g)
dropping records in group i using !EXCLUDE $I in the
various runs. If the records dropped pertain uniquely to levels of a
model term ( grm1(Nclone) in the example below),
the !KCV grm1(Nclone)
qualifier will collect together predicted values for the levels
of grm1(Nclone) which are predicted in the run but having no direct data.
These are written to a .kcv file.
Predicted values from
cross validation should normally be correlated with an independent
measure of the value. In a simulation context, we might keep the 'true'
value. A less desirable option is to correlate the predicted values
with values predicted from the whole data. If the CYCLE is extended
by 1, (that is, n=g+1) no records are dropped in the final round and ASReml will report
the predictions from the full data in a second field in the .kcv file), and correlation with the CV
predictions.
For example, in evaluating the accuracy of prediction from a genomic model,
one might run the following model.
!WORK 1
Cross Validation test with Nassau Data
Nfam 71 !A
Nfemale 26 !A
Nmale 37 !A
Nclone 857 !A !L Clones.txt !LSKIP 1
MatOrder 914 !A
rep 8 !A
iblk 80 !A
culture 2 !A
DBH6 HT6 HT8 CWAC6 !M-9
CVgroup 10 !=Nclone !-1 !MOD 10 !+1
!CYCLE 1:11
snpData.mkr !SKIP 1 !HEAD 0 !CENTRE !MARKERS 4854 !IDS 923
nassau_cut_v3.csv !MAXIT 30 !SKIP 1 !DFF -1
!FILTER CVgroup !EXCLUDE $I !KCV grm1(Nclone) # Data
HT6 ~ mu culture culture.rep !r grm1(Nclo) 0.276 Nclone 0.152 rep.iblk 0.308
This code partitions the data into 10 classes using the variable CVgroup
defined from variable Nclone in this example by allocating every 10th clone to each group.
The !CYCLE 1:11 runs the analysis 11 times. The first 10 drop the
records pertaining to the respective groups. The last run includes all the data. The !KCV grm1(Nclone) qualifier causes \ASReml to save the solutions for model term grm1(Nclone) corresponding to levels for which the data was omitted from the
in one field and the values from cycle 11 in a second field.
The correlation between the fields is reported to the .asr file.
Important When performing cross-validation, the manner of partitioning the records can be critical.
The method used here is just a simple method used for convenience in this example. Furthermore,
correlation of the predicted values from reduced data with predicted values from the full data
is not very helpful. Where an independent 'true' value exisits (as in the case of simulated data),
that should be used.
Return to index