Minimal Target Encoding support in AutoML

Description

The objective of minimal support for TargetEncoding in AutoML is to expose an API that will allow us to incrementally improve the feature integration & auto-tuning later. The first automatic TE will be very straight-forward (and activated based on rules about cardinality of categorical columns).

The main prerequisite is that should be resolved so that models trained by AutoML using TE can be used in production (the downloaded model mojo should be self-contained).

The second prerequisite is that TE works with all types of problems: there’s no single feature currently in AutoML that works only for binomial or multinomial or regression, and we want AutoML to still fully support all of them: see amd .

Initial minimal support in AutoML

TE will be integrated as follows:

  • TE is disabled by default (opt-in only), triggered with preprocessing = [“target_encoding”]

  • one single TE model trained, first on all categorical columns of the training set (after internal splits), using TE default parameters (blending off, noise on with defaults).

  • TE is applied to all algorithms except GLM and DNN

  • leakage strategy is KFold if CV is enabled, and None otherwise.

  • if CV enabled with user provided `fold_column`, it is used by TE for the KFold strategy.

  • if CV enabled with nfolds, AutoML will generate a fold_column first, append it to train frame and use it for both TE and models training.

Given the extreme simplicity of this minimal integration, we can’t expect TE to improve predicting performance on most datasets, this will be the objective of further TE hyperparameter optimization. Ticket for TE integration improvements is here:

API:

Rather than exposing a TE parameter at the top-level, we are abstracting to a more generic “preprocessing” arg, so we can add other preprocessors in the future (e.g. Label encoding). We also chose to use an array/list vs a dictionary so that the user can specify an order of the preprocessing steps.

Here is what it will look like in Python & R:First line is off (default), second line is automatic TE (we define the steps), and third is custom TE. (last step will be added later, maybe 3.32.0.2, we first want to encourage users to to the “auto” feature rather than trying to tune themselves.). Ticket for customization is here:

Python:

R:

Activity

Show:
Sebastien Poirier
September 16, 2020, 5:04 PM

FYI

Sebastien Poirier
September 17, 2020, 11:47 PM

activated based on rules about cardinality of categorical columns

I thought we agreed we wouldn’t do that immediately

Erin LeDell
September 24, 2020, 1:21 AM
Edited

Current rule for TE activation for a column is (we need to update the text description above when we’re done to accurately reflect what’s implemented):


https://github.com/h2oai/h2o-3/blob/3618b1ae59c4f56595d41e01998731836bcab255/h2o-automl/src/main/java/ai/h2o/automl/preprocessing/TargetEncoding.java#L148-L160

where: _cardinalityThreshold defaults to 25

Assignee

Sebastien Poirier

Fix versions

Reporter

Sebastien Poirier

Support ticket URL

None

Labels

None

Affected Spark version

None

Customer Request Type

None

Task progress

None

ReleaseNotesHidden

None

CustomerVisible

No

Priority

Major
Configure