runit_NOPASS_quantile_1_golden.R: intermittent hang

Description

this may be similar to https://0xdata.atlassian.net/browse/PUB-1152
notice it apparently hung here
http://mr-0xb1:8080/job/h2o_master_DEV_runit_small_commit_only/593/console

I's hanging on the last quantile here in the h2o output
http://mr-0xb1:8080/job/h2o_master_DEV_runit_small_commit_only/593/artifact/h2o-r/tests/results/java_9_0.out.txt

03-13 01:22:57.435 172.16.2.161:41018 6485 # Session INFO: Method: POST , URI: /3/Rapids.json, route: /3/Rapids, parms: {ast=(= !rapids_13_sid_9d643de843555bea36bab074dec74f24 (is.factor %file38793c1ace4c.hex_2_sid_9d643de843555bea36bab074dec74f24))}
03-13 01:22:57.452 172.16.2.161:41018 6485 # Session INFO: Method: GET , URI: /3/Rapids.json/isEval, route: /3/Rapids/isEval, parms: {ast_key=file38793c1ace4c.hex_2_sid_9d643de843555bea36bab074dec74f24}
03-13 01:22:57.467 172.16.2.161:41018 6485 # Session INFO: Method: GET , URI: /3/ModelBuilders/quantile.json, route: /3/ModelBuilders/(?<algo>.*), parms: {algo=quantile}
03-13 01:22:57.492 172.16.2.161:41018 6485 # Session INFO: Method: POST , URI: /3/ModelBuilders/quantile.json/parameters, route: /3/ModelBuilders/quantile/parameters, parms: {training_frame=file38793c1ace4c.hex_2_sid_9d643de843555bea36bab074dec74f24, probs=[0.01,0.05,0.1,0.25,0.333,0.5,0.667,0.75,0.9,0.95,0.99]}
03-13 01:22:57.516 172.16.2.161:41018 6485 # Session INFO: Method: POST , URI: /3/ModelBuilders/quantile.json, route: /3/ModelBuilders/quantile, parms: {training_frame=file38793c1ace4c.hex_2_sid_9d643de843555bea36bab074dec74f24, probs=[0.01,0.05,0.1,0.25,0.333,0.5,0.667,0.75,0.9,0.95,0.99]}
03-13 01:22:57.518 172.16.2.161:41018 6485 FJ-0-41 INFO: Building H2O Quantile model with these parameters:

from the console log:

SUMMARY OF RESULTS

----------------------------------------------------------------------

Total tests: 190
Passed: 189
Did not pass: 0
Did not complete: 1
Tolerated NOPASS: 0
Tolerated NOFEATURE: 0
NOPASS tests skipped: 82
NOFEATURE tests skipped: 0

Total time: 2013.34 sec
Time/completed test: 10.65 sec

Build was aborted
Archiving artifacts
Killing Test testdir_golden/runit_quantile_1_golden.R with PID 14457
Killing JVM with PID 6446
Killing JVM with PID 6447

Activity

Show:
Kevin Normoyle
March 19, 2015, 10:05 PM

got 50 passes on the small mr-0x1 thru 0x10 machines
made it nopass
immediately failed first time (hang) on mr-0xb1

hung first run on mr-0xb1 as part of h2o_master_DEV_runit_small_commit_only.
changing back to NOPASS
The 50 passes were on the small machines. So maybe the hang only happens on the
bigger machines (used only on *commit_only jobs) with more threads/cores (32/16 vs 8/4)

so probably need to think about testing/debugging this on mr-0xd1 thru d4

Your pinned fields
Click on the next to a field label to start pinning.

Assignee

New H2O Bugs

Reporter

Kevin Normoyle

CustomerVisible

No