composite function fail when inner cbind()
Description
Doing as_spark_frame(df.cbind(df2)) fails
repro:
Activity
Fixed by monkey patching spark's Dataframe. We attach h2o_frame as inner field of data frame which prevents early garbage colllection by python
Ok, so it's actually H2O bug. To reproduce start 1 h2o node and in python do ( we're starting the client )
The result is that the frame gets deleted right away
03-16 18:10:13.133 192.168.0.17:54321 31668 #90375-14 INFO: GET /3/Frames/py_2_sid_9d5f, parms: {row_count=10}
03-16 18:10:13.146 192.168.0.17:54321 31668 #90375-15 INFO: POST /99/Rapids, parms: {ast=(rm py_2_sid_9d5f), session_id=_sid_9d5f}
03-16 18:10:13.151 192.168.0.17:54321 31668 #90375-16 INFO: POST /99/Rapids, parms: {ast=(rm py_1_sid_9d5f), session_id=_sid_9d5f}
At one moment I thought it's caused by whole-stage codegen, however it's not. Reproducible with codegen disabled via spark.conf.set("spark.sql.codegen.wholeStage", false)
Another update: NPE reproducible also in local mode ( 1 thread, 1 JVM )
Another update:
17/03/15 19:06:05 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 5.0 (TID 39, 192.168.0.17, executor 2): java.lang.NullPointerException
at org.apache.spark.h2o.backends.internal.InternalReadConverterCtx.underlyingFrame(InternalReadConverterCtx.scala:40)
at org.apache.spark.h2o.backends.internal.InternalReadConverterCtx.<init>(InternalReadConverterCtx.scala:32)
at org.apache.spark.h2o.converters.ReadConverterCtxUtils$$anonfun$create$2.apply(ReadConverterCtxUtils.scala:32)
at org.apache.spark.h2o.converters.ReadConverterCtxUtils$$anonfun$create$2.apply(ReadConverterCtxUtils.scala:32)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.h2o.converters.ReadConverterCtxUtils$.create(ReadConverterCtxUtils.scala:32)
at org.apache.spark.h2o.converters.H2ORDDLike$H2OChunkIterator$class.converterCtx(H2ORDDLike.scala:67)
at org.apache.spark.h2o.converters.H2ODataFrame$$anon$1.converterCtx$lzycompute(H2ODataFrame.scala:75)
at org.apache.spark.h2o.converters.H2ODataFrame$$anon$1.converterCtx(H2ODataFrame.scala:75)
at org.apache.spark.h2o.converters.H2ODataFrame$$anon$1.<init>(H2ODataFrame.scala:85)
at org.apache.spark.h2o.converters.H2ODataFrame.compute(H2ODataFrame.scala:75)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)