I have replaced the _jvm() calls, but later I get:
The numpy depedendency is used in several places in Spark, in this cases it is defined in pyspark.ml.param in init file. So changing the code won’t help.
But more I think about it, we might remove the dependency to pyspark & numpy all together.
In absolutely most cases, spark is already available so it just downloads the extra dependencies.
Only case affected would be when people would have fresh python intallation, to external spark and wold do:
pip install h2o_pysparkling_2.4. In this case the intallation would succeed but the user could not use Sparkling Water as there is no Spark.
So what I would do:
remove numpy & pyspark dependencies from setup.py
Document that if the python environment does not have Spark, user need to install it, either as pip install pyspark or download spark distribution and adding python libs on the PYTHONPATH