Bad split on categorical variable in GBM and DRF affecting model quality
Description
Dataset and config to reproduce the bug
Training CSV used
categorical_column | target |
---|---|
A | True |
B | True |
C | False |
D | False |
E | True |
F | True |
G | False |
H | False |
Configuration used
parameter | value |
---|---|
ntrees | 1 |
max_depth | 1 |
min_rows | 1 |
nbins_cats | 4 for BUG and 8 for NO BUG |
column | type |
---|---|
categorical_column | enum |
target | enum |
Explanations
With nbins_cats = 8 (i.e. nbins_cats greater or equal to the number of unique values of the categorical column), there is no bug, the training AUC is 1 as expected and the tree is the expected one below :
Whereas with nbins_cats = 4, there is the bug i.e. a bad split (and "numerical") on the categorical column, the training AUC is 0.75 and it is confirmed by the bad tree shown below :
Normally, in this example, even with nbins_cats = 4 we should get the same optimal split than with nbins_cats = 8 and thus AUC should be 1.
But you see that AUC is only 0.75 and not 1.
Activity
@Michal Kurka
This bug DOES actually affects model quality.
I updated the ticket with a simple example to demonstrate that.
I found what triggers this bug.
The wrong "numerical" split on a categorical column happens when the value for the parameter nbins_cats is strictly lower than the number of unique values for this column.
Thus a workaround to avoid the bug is to put a very big value for this parameter (to be sure it is higher than cardinality of any categorical column).
Downgraded to regular priority, doesn't seem to affect the quality of the model.
, not yet, we had to put this on hold for now - will let you know soon
@Michal Kurka any news on this ?