Minimal Target Encoding support in AutoML
The objective of minimal support for TargetEncoding in AutoML is to expose an API that will allow us to incrementally improve the feature integration & auto-tuning later. The first automatic TE will be very straight-forward (and activated based on rules about cardinality of categorical columns).
The main prerequisite is that should be resolved so that models trained by AutoML using TE can be used in production (the downloaded model mojo should be self-contained).
The second prerequisite is that TE works with all types of problems: there’s no single feature currently in AutoML that works only for binomial or multinomial or regression, and we want AutoML to still fully support all of them: see amd .
Initial minimal support in AutoML
TE will be integrated as follows:
TE is disabled by default (opt-in only), triggered with preprocessing = [“target_encoding”]
one single TE model trained, first on all categorical columns of the training set (after internal splits), using TE default parameters (blending off, noise on with defaults).
TE is applied to all algorithms except GLM and DNN
leakage strategy is KFold if CV is enabled, and None otherwise.
if CV enabled with user provided `fold_column`, it is used by TE for the KFold strategy.
if CV enabled with nfolds, AutoML will generate a fold_column first, append it to train frame and use it for both TE and models training.
Given the extreme simplicity of this minimal integration, we can’t expect TE to improve predicting performance on most datasets, this will be the objective of further TE hyperparameter optimization. Ticket for TE integration improvements is here:
Rather than exposing a TE parameter at the top-level, we are abstracting to a more generic “preprocessing” arg, so we can add other preprocessors in the future (e.g. Label encoding). We also chose to use an array/list vs a dictionary so that the user can specify an order of the preprocessing steps.
Here is what it will look like in Python & R:First line is off (default), second line is automatic TE (we define the steps), and third is custom TE. (last step will be added later, maybe 22.214.171.124, we first want to encourage users to to the “auto” feature rather than trying to tune themselves.). Ticket for customization is here:
Current rule for TE activation for a column is (we need to update the text description above when we’re done to accurately reflect what’s implemented):
where: _cardinalityThreshold defaults to 25
activated based on rules about cardinality of categorical columns
I thought we agreed we wouldn’t do that immediately