Ideas for improving the general performance of the basic Target Encoding integration in AutoML (currently turned off by default in AutoML, but activated by setting `preprocessing = ["target_encoding]` in the AutoML function).
Configure TE on a per-algorithm basis (XGBoost and non-XGB models to start, then tune each model separately). One suggestion was: apply only to categorical columns with cardinality >=10 for xgboost and >=25 for h2o tree algos
Consider not applying TE at all to DNN models (DNNs are able to find out interactions more easily than tree models. TE usually is bad for NNs since they probably overfit to the values TE provides instead of finding them using backprop.)
Different minimal cardinality threshold (when > N categories, turn on TE, otherwise leave it off)
Different upper cardinality threshold (when > N categories, drop original categorical column, otherwise keep original column in the training frame)
Our current approach is: Column is encoded if card >= 10 (hard limit) and nrows/card >= 10 (blending inflection point).
We also want to improve the user experience by offering customizable TE encodings (but for now its just a on/off switch to an auto-TE strategy). Ticket for that is here: