Error when trying to use a fold column when number of folds < official number of levels in that column

Description

I am trying to do to a pretty standard thing in ML and i am getting an error.

task:

  • there’s a “cv” categorical column, which has 5 values (5-folds)

  • i subset the frame by the cv column, to make train (1-4) and test (5)

  • now i try to train a h2o.glm using train and i want to do 4-fold CV here using the 4 folds i have left, using the fold_column argument.

  • however there’s an error in h2o.glm because its mad that train$cv says it has 5 levels, but only 4 are represented in the dataset. ive confimed this because it works if i use the original dataset with all 5 folds.

  • i can’t find a way to re-level the frame to tell it that cv column only has 4 levels. h2o.setLevels() is just a re-naming tool but you cant change the cardinality of the domain.

C

an we relax this restriction on fold_column in H2O algos?

Fixed

Assignee

Michal Kurka

Fix versions

Reporter

Erin LeDell

Support ticket URL

None

Labels

None

Affected Spark version

None

Customer Request Type

None

Task progress

None

ReleaseNotesHidden

None

CustomerVisible

No

Priority

Major