Improve AutoML Target Encoding integration (auto mode)

Description

Ideas for improving the general performance of the basic Target Encoding integration in AutoML (currently turned off by default in AutoML, but activated by setting `preprocessing = ["target_encoding]` in the AutoML function).

  • Configure TE on a per-algorithm basis (XGBoost and non-XGB models to start, then tune each model separately). One suggestion was: apply only to categorical columns with cardinality >=10 for xgboost and >=25 for h2o tree algos

  • Consider not applying TE at all to DNN models (DNNs are able to find out interactions more easily than tree models. TE usually is bad for NNs since they probably overfit to the values TE provides instead of finding them using backprop.)

  • Different minimal cardinality threshold (when > N categories, turn on TE, otherwise leave it off)

  • Different upper cardinality threshold (when > N categories, drop original categorical column, otherwise keep original column in the training frame)

Our current approach is: Column is encoded if card >= 10 (hard limit) and nrows/card >= 10 (blending inflection point).

We also want to improve the user experience by offering customizable TE encodings (but for now its just a on/off switch to an auto-TE strategy). Ticket for that is here:

Assignee

Sebastien Poirier

Fix versions

Reporter

Erin LeDell

Support ticket URL

None

Labels

None

Affected Spark version

None

Customer Request Type

None

Task progress

None

ReleaseNotesHidden

None

CustomerVisible

No

Components

Priority

Major