NPE while building DRF

Description

While running several DRF jobs concurrently an NPE occurs.

To repro:

1) Started h2o-3.jar version 3.24.0.5 (vanilla jar downloaded from h2o.ai website, not the MLI one we use in DAI) `java -jar h2o.jar -web_ip 127.0.0.1 -ip 127.0.0.1 -port 54321 -name mateusz`
2) Uploaded the dataset I'm attaching (adding both csv and exported hex) -


3) Ran ~5 times (with different model_id) quickly (was lazy, probably better to write a simple script doing this in a loop):

4) Got:

Most times logs show only:

Activity

Show:
Mateusz Dymczyk
April 30, 2020, 4:57 AM

Here are logs with the debug option turned on.

 

Michal Kurka
April 29, 2020, 1:43 PM

running multiple models on the same frame concurrently is supported as long as the target model name is unique.

I think we need to disregard the Flow reproducer because neither of us is able to reproduce the issue and there were fixes that we know for sure could cause concurrency issues in DRF.

Instead, we need to focus on what DAI is doing since the issue can be reproduced there.

Mateusz Dymczyk
April 29, 2020, 4:32 AM

yes, reproducing this with h2o-3 (not via DAI) was an issue for me, think I only was able to get it to fail once every few dozen tries. But it did fail so probably means it’s not failing because of misused API.

One way to reproduce it easily would be to run it via DIA, it should be fairly easy - just download the latest DAI distro, start it and run multiple MLIs. I can jump on a call with someone to show how to do it.

when I reproduced it via Flow I uploaded the data only once and reused it to build the same model X times. The only thing I changed every time was the model name, the rest of the parameters stayed the same. Is running multiple models on the same frame not supported?

Michal Kurka
April 28, 2020, 1:32 PM

it seems to me that the dataset is repeatedly imported with the same target key

this would be an issue eg. when creating the cv-folds, the key fold the fold is derived from the input frame + cv fold id, this can cause collisions since the updates are not done using locking

Zuzana Olajcová
April 28, 2020, 12:58 PM
Edited

I tried with 3.24.0.5 from Flow by Import Data (the csv) → Parse → ~10 cells with the build (different model names). Used exactly the setup from the ticket description 1 buildModel 'drf', {"model_i... with just changing model ids and also tried some random setups. Run the whole thing multiple times (import → parse → builds). I wasn’t able to reproduce.

I’ve tried also with the latest version from Flow: Import → Parse → Build 10x with no success.

 

Fixed

Assignee

Michal Kurka

Fix versions

Reporter

Mateusz Dymczyk

Support ticket URL

None

Labels

None

Affected Spark version

None

Customer Request Type

None

Task progress

None

ReleaseNotesHidden

None

CustomerVisible

No

Priority

Major