Add a function to allow re-distribution of the data

Description

Customer is trying to reproduce the modeling result. Typical workflow is as the following.

Their source data is in MapR.
The workflow is load the data into Spark, do data engineering, save the data back to MapR, load it into Sparkling Water, and train the model.

They can't reproduce the result even they meet other requests of reproducibility in our doc because Spark can't guarantee the same physical distribution of parquet files. We can't do anything in H2O cluster too if the parquet files are not at the same order. It will be great if we can provide a function or option in models to re-distribute the data to get the reproducible result if customers have the same data.

Activity

Show:
Neema Mashayekhi
September 19, 2020, 12:52 AM

, can they manually create an id column or some column they use to do a ‘repartition by’. Then use that column to have consistent distributions between two different users/sessions?

Assignee

New H2O Bugs

Fix versions

None

Reporter

Feng Bai

Support ticket URL

Labels

Affected Spark version

None

Customer Request Type

None

Task progress

None

ReleaseNotesHidden

None

CustomerVisible

No

Priority

Major
Configure