Customer is trying to reproduce the modeling result. Typical workflow is as the following.
Their source data is in MapR.
The workflow is load the data into Spark, do data engineering, save the data back to MapR, load it into Sparkling Water, and train the model.
They can't reproduce the result even they meet other requests of reproducibility in our doc because Spark can't guarantee the same physical distribution of parquet files. We can't do anything in H2O cluster too if the parquet files are not at the same order. It will be great if we can provide a function or option in models to re-distribute the data to get the reproducible result if customers have the same data.
, can they manually create an id column or some column they use to do a ‘repartition by’. Then use that column to have consistent distributions between two different users/sessions?