In PySparkling, ensure external backend can run without h2o client on spark driver side.
The goal of this change is to ensure:
-> No H2O on Spark Driver
-> Don’t need extended H2O JAR in case of PySparkling on external backend
-> The conversion from DataFrame to H2OFrame and back needs to go without running H2O
→ Implement proxy to flow UI on one of the worker nodes → leader node
Trying to target 22.214.171.124-1 but without any promise as the work is unexplored area
Thank you for the update Kuba. In my opinion, that’s the biggest improvement to Sparkling Water since the invention of Sparkling Water!
for reference, to enable the client-less solution one needs to set up the following spark option to true spark.ext.h2o.rest.api.based.client
In the first iteration (this JIRA) we are targeting and testing this in context of PySparkling & External backend. Note that in the first iteration the user will be able to convert the dataframe → h2o frame and vice versa and use the full power of H2O Python API. Sparkling Water pipeline algo wrappers won’t be available, the users need to use H2O Python API estimators. The PySParkling Algo API will be implemented in the next stage
Thanks for the clarification Kuba
”Sparkling Water pipeline algo wrappers won’t be available, the users need to use H2O Python API estimators. The PySParkling Algo API will be implemented in the next stage”
Does this mean if build models in SW, we would still need to use H2O Python API, and to use we wouldn’t be able to set spark.ext.h2o.rest.api.based.client ?
I guess I don’t understand difference between “H2O Python API” and “PySParkling Algo API”.
Would the former go away, after the next stage is implemented?
What’s Jira that tracks further improvements / stage 2?
The JIRA -> https://0xdata.atlassian.net/browse/SW-1537
No API is going away after this is fully implemented.
If you have trained models using SW, you can use rest api based approach, but only if you have used H2O Python API http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/index.html. The PySparkling Algo API -> such as H2OXGBoost in http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/tutorials/sw_xgboost.html?highlight=xgboost and for other algorithms won't be supported in the first iteration of client-less approach.
It might be a bit confusing as in PySparkling, people normally work with the H2O Python API. We just have additional abstraction which works well with the Spark Pipelines over our algos which won’t be supported right away.