[PySparkling] Client Separation from Spark Driver

Description

In PySparkling, ensure external backend can run without h2o client on spark driver side.

The goal of this change is to ensure:
-> No H2O on Spark Driver
-> Don’t need extended H2O JAR in case of PySparkling on external backend
-> The conversion from DataFrame to H2OFrame and back needs to go without running H2O

→ Implement proxy to flow UI on one of the worker nodes → leader node

Activity

Show:
Jakub Hava
November 14, 2019, 11:28 PM

Trying to target 3.28.0.1-1 but without any promise as the work is unexplored area

Ruslan Dautkhanov
November 15, 2019, 4:57 PM

Thank you for the update Kuba. In my opinion, that’s the biggest improvement to Sparkling Water since the invention of Sparkling Water!

Jakub Hava
November 15, 2019, 7:20 PM

for reference, to enable the client-less solution one needs to set up the following spark option to true spark.ext.h2o.rest.api.based.client

In the first iteration (this JIRA) we are targeting and testing this in context of PySparkling & External backend. Note that in the first iteration the user will be able to convert the dataframe → h2o frame and vice versa and use the full power of H2O Python API. Sparkling Water pipeline algo wrappers won’t be available, the users need to use H2O Python API estimators. The PySParkling Algo API will be implemented in the next stage

Ruslan Dautkhanov
November 15, 2019, 7:34 PM

Thanks for the clarification Kuba

”Sparkling Water pipeline algo wrappers won’t be available, the users need to use H2O Python API estimators. The PySParkling Algo API will be implemented in the next stage”

Does this mean if build models in SW, we would still need to use H2O Python API, and to use we wouldn’t be able to set spark.ext.h2o.rest.api.based.client ?

I guess I don’t understand difference between “H2O Python API” and “PySParkling Algo API”.
Would the former go away, after the next stage is implemented?

What’s Jira that tracks further improvements / stage 2?

Jakub Hava
November 15, 2019, 7:39 PM
Edited

The JIRA -> https://0xdata.atlassian.net/browse/SW-1537
No API is going away after this is fully implemented.

If you have trained models using SW, you can use rest api based approach, but only if you have used H2O Python API http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/index.html. The PySparkling Algo API -> such as H2OXGBoost in http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/tutorials/sw_xgboost.html?highlight=xgboost and for other algorithms won't be supported in the first iteration of client-less approach.

It might be a bit confusing as in PySparkling, people normally work with the H2O Python API. We just have additional abstraction which works well with the Spark Pipelines over our algos which won’t be supported right away.

Assignee

Jakub Hava

Reporter

Jakub Hava

Labels

None

CustomerVisible

No

testcase 1

None

testcase 2

None

testcase 3

None

h2ostream link

None

Affected Spark version

None

AffectedContact

None

AffectedCustomers

None

AffectedPilots

None

AffectedOpenSource

None

Support Assessment

None

Customer Request Type

None

Support ticket URL

None

End date

None

Baseline start date

None

Baseline end date

None

Task progress

None

Task mode

None

ReleaseNotesHidden

None

Fix versions

Priority

Major

Epic Name

[External Backend] Client Separation
Configure