XGBoost gets stuck with 50+ executors (instead of failing outright)

Description

See support ticket for details: https://support.h2o.ai/a/tickets/98339

Activity

Show:
Neema Mashayekhi
February 9, 2021, 4:51 AM

User resolved issue by upgrading to 3.32.0.3, which had XGBoost upgrade to 1.2 and fix ( )

Jan Sterba
January 6, 2021, 1:56 PM

was not able to reproduce on 3.30.1.x, suggesting upgrade since the code causing the error above was changed in xgboost

Jan Sterba
January 5, 2021, 6:49 PM

Investigation of logs revealed:

there are two bugs: first that the training did not stop because of this error, and second that the error even happened - but that could be an xgboost bug

this is bad because it prevents us from shutting down cleanly

Fixed

Assignee

Jan Štěrba

Fix versions

Reporter

Neema Mashayekhi

Support ticket URL

None

Labels

None

Affected Spark version

None

Customer Request Type

None

Task progress

None

ReleaseNotesHidden

None

CustomerVisible

No

Components

Affects versions

Priority

Major