Fault tolerant grid search

Description

Grid search:
save data (per algo detect what frames need to be saved)
save params
enable model checkpointing

On crash:
reload data
reload trained models
restart gridsearch with same params (grid will auto continue where we left off)

Proposed roadmap:
Stage 1 (this jira, end of January 2021):
• Introduce a generic API for automatic checkpointing and resuming from a checkpoint in H2O-3 - this would utilize existing building blocks in h2o
• SW will need to be able H2O cluster failure, dispose of the cluster, start anew one and ask H2O to resume from a checkpoint
• This solution will work for Grid Search and for algos that currently support checkpointing, for algos that do not support checkpointing (GLM) - the work will be seamlessly restarted from scratch

Stage 2 (Q1 2021):
• Add support for AutoML
• Add support for checkpointing to algos that do not currently support it (GLM, CoxPH, …) - based on booking preference

Fixed

Assignee

Jan Sterba

Fix versions

Reporter

Jan Sterba

Support ticket URL

None

Labels

None

Affected Spark version

None

Customer Request Type

None

Task progress

None

ReleaseNotesHidden

None

CustomerVisible

No

Components

Priority

Major