Bad split on categorical variable in GBM and DRF affecting model quality

Description

Dataset and config to reproduce the bug

Training CSV used

categorical_column

target

A

True

B

True

C

False

D

False

E

True

F

True

G

False

H

False

Configuration used

parameter

value

ntrees

1

max_depth

1

min_rows

1

nbins_cats

4 for BUG and 8 for NO BUG

column

type

categorical_column

enum

target

enum

Explanations

With nbins_cats = 8 (i.e. nbins_cats greater or equal to the number of unique values of the categorical column), there is no bug, the training AUC is 1 as expected and the tree is the expected one below :

Whereas with nbins_cats = 4, there is the bug i.e. a bad split (and "numerical") on the categorical column, the training AUC is 0.75 and it is confirmed by the bad tree shown below :

Normally, in this example, even with nbins_cats = 4 we should get the same optimal split than with nbins_cats = 8 and thus AUC should be 1.
But you see that AUC is only 0.75 and not 1.

Activity

Show:
Former user
February 1, 2019, 10:24 AM
Edited

@Michal Kurka
This bug DOES actually affects model quality.
I updated the ticket with a simple example to demonstrate that.

Former user
January 31, 2019, 1:55 PM

I found what triggers this bug.
The wrong "numerical" split on a categorical column happens when the value for the parameter nbins_cats is strictly lower than the number of unique values for this column.
Thus a workaround to avoid the bug is to put a very big value for this parameter (to be sure it is higher than cardinality of any categorical column).

Michal Kurka
July 3, 2018, 9:10 PM

Downgraded to regular priority, doesn't seem to affect the quality of the model.

Michal Kurka
May 24, 2018, 2:48 PM

, not yet, we had to put this on hold for now - will let you know soon

Former user
May 14, 2018, 12:28 PM

@Michal Kurka any news on this ?

Assignee

Michal Kurka

Fix versions

None

Reporter

Former user

Support ticket URL

None

Labels

Affected Spark version

None

Customer Request Type

None

Task progress

None

ReleaseNotesHidden

None

CustomerVisible

Yes

Components

Affects versions

Priority

Critical