This is just an epic to collect tickets related to improvements in the R and Python interface for H2O Explainability.
A brief list (to be converted into JIRAs):
For AutoML objects, don’t use the whole model id in the labels/axes/legend of the plots that are the output of h2o.explain() – just use shortened model_id names. The date at the end does not help with visually understanding how the different models/algos compare (the long ids are distracting to the eye, so we could just remove it from the display by default…maybe it can override to use full model_id in plot_overrides?). Would be nice to view the model names as just: StackedEnsemble_AllModels, GLM_1, DRF_1, XGBoost_3, GBM_grid__1__model_3, etc.
Model correlation has interpretable models (GLM) highlighed in red text (in Python), but we don’t explain what the red is for, and we don’t do it in the other visuals like Varimp Heatmap. Need to check if this is also the case in R.
Move explanation descriptions into a JSON or text file so there’s just one source and read that into R and Python (easier to edit a single source). Then make some updates to the descriptions.
I think we would benefit by using a title for plot name and subtitle for the model_id in all the R plots since they are pretty squished when using inside RStudio.
I wonder if the Leaderboard printed out (specifically when you pass in an AutoML object) should just be top 20 models by default?
We can default the Leaderboard to 20 models, but we could find a way to allow the user to override this with plot_overrides. e.g. plot_overrides = list(leaderboard=list(nrow=-1)) to get all (or maybe there’s something better than -1 to use here, like “ALL” if they want to show all rows)? or they can set to a particular number, like 50.
Add more information at the top of the explain print-out for AutoML specific stats (how many models of each type, and best score (using default loss) for each algo type.
Visual improvement tweaks/ideas for the printed Leaderboard
is there a way to control the number of decimal places shown? we could reduce to about 5 decimal places and get the table to be skinner & fit on the page better
is it easy to left-align the model names in the first column? then it would be easy to read the type of model better.
I am wondering since the user passes the newdata test frame explicitly in the h2o.explain() function if we just shouldn't use the test set leaderboard metrics instead of the default CV metrics. But then it’s delivering a different “view” of the leaderboard than the internal AutoML object has… so it’s going to have some inconsistency either way, we just have to choose which type of inconsistency is better/worse.
Add learning curve of leader model (let’s decide if we want to plot train vs CV error or validation error or error for a single fold, etc).
Not sure if this is related, but .explain() doesn’t seem to be available for 18.104.22.168. See