Auto-detect unique ID columns and remove from predictors


We can auto-detect unique id columns (#unique values == #observations) in H2OFrames and ignore them in the set of predictor columns (with a warning that they have been removed).

Motivation: ID columns are not useful for prediction, and when they are encoded as a factor, they will cause all sorts of performance issues since they will be super high-cardinality columns. If the ID column is numeric, there's not as much of a performance issue, but it's just a useless column.

Suggestion by for AutoML: on AutoML level I think we can go further and do some gentle preprocessing, eg.
if #unique values < #observations && #unique values == #non-NA values => substitute for a column “value.x.defined” with yes/no values (this way you will preserve information if something has NA or no, eg. missing social security number might be a good feature for an algo)

Your pinned fields
Click on the next to a field label to start pinning.


New H2O Bugs

Fix versions


Erin LeDell