Improve the Model Explainability interface in R & Python

Description

This is just an epic to collect tickets related to improvements in the R and Python interface for H2O Explainability.

A brief list (to be converted into JIRAs):

  • For AutoML objects, don’t use the whole model id in the labels/axes/legend of the plots that are the output of h2o.explain() – just use shortened model_id names. The date at the end does not help with visually understanding how the different models/algos compare (the long ids are distracting to the eye, so we could just remove it from the display by default…maybe it can override to use full model_id in plot_overrides?). Would be nice to view the model names as just: StackedEnsemble_AllModels, GLM_1, DRF_1, XGBoost_3, GBM_grid__1__model_3, etc.

  • Model correlation has interpretable models (GLM) highlighed in red text (in Python), but we don’t explain what the red is for, and we don’t do it in the other visuals like Varimp Heatmap. Need to check if this is also the case in R.

  • Move explanation descriptions into a JSON or text file so there’s just one source and read that into R and Python (easier to edit a single source). Then make some updates to the descriptions.

  • I think we would benefit by using a title for plot name and subtitle for the model_id in all the R plots since they are pretty squished when using inside RStudio.
    https://www.datanovia.com/en/blog/ggplot-title-subtitle-and-caption/

  • I wonder if the Leaderboard printed out (specifically when you pass in an AutoML object) should just be top 20 models by default?

  • We can default the Leaderboard to 20 models, but we could find a way to allow the user to override this with plot_overrides. e.g. plot_overrides = list(leaderboard=list(nrow=-1)) to get all (or maybe there’s something better than -1 to use here, like “ALL” if they want to show all rows)? or they can set to a particular number, like 50.

  • Add more information at the top of the explain print-out for AutoML specific stats (how many models of each type, and best score (using default loss) for each algo type.

  • Visual improvement tweaks/ideas for the printed Leaderboard

    • is there a way to control the number of decimal places shown? we could reduce to about 5 decimal places and get the table to be skinner & fit on the page better

    • is it easy to left-align the model names in the first column? then it would be easy to read the type of model better.

  • I am wondering since the user passes the newdata test frame explicitly in the h2o.explain() function if we just shouldn't use the test set leaderboard metrics instead of the default CV metrics. But then it’s delivering a different “view” of the leaderboard than the internal AutoML object has… so it’s going to have some inconsistency either way, we just have to choose which type of inconsistency is better/worse.

  • Add learning curve of leader model (let’s decide if we want to plot train vs CV error or validation error or error for a single fold, etc).

Activity

Show:
Hud Wahab
October 22, 2020, 2:33 PM

Not sure if this is related, but .explain() doesn’t seem to be available for 3.30.1.3. See

Assignee

Tomas Fryda

Fix versions

Reporter

Erin LeDell

Support ticket URL

None

Labels

None

Affected Spark version

None

Customer Request Type

None

Task progress

None

ReleaseNotesHidden

None

CustomerVisible

No

Components

Priority

Major

Epic Name

Improvements to Model Explainability interface
Configure