jumbling rows of data during scoring



we are seeing something very disturbing that makes us very queasy about using H2O: namely before and after passing through H2O predictions, we see a phenomenon I will illustrate with a simple example (switching some columns between rows):

Input dataframe:

id date partialkey feature1 feature2 feature3 feature4 feature5
1 Dec-1 a 0 1 0 0 0
2 Dec-1 b 2 3 1 1 1

output dataframe:

id date partialkey feature1 feature2 feature3 feature4 feature5 predict A R
1 Dec-1 a 0 1 1 1 1 out1 0.4 0.6
2 Dec-1 b 2 3 0 0 0 out2 0.1 0.9

In other words, two rows have SOME OF THEIR COLUMS switched (in the example above, feature3 feature4 feature5)

This could happen if for example the columns (which are stored separately in chunks, I believe) are not stiched back together properly after running through scoring on a model.

We are investigating the issue and trying to come up with a reproducible example (hard to share because propriatory dataset), but I'm wondering if you can tell me if you have seen any similar bugs and what could be causing it. It is quite urgent, so please I beg you, please respond.

Here are more details:
1. We don't see an issue if running on a single node without ignored columns
2. All training columns are categorical (strings) and model is doing binary classification. There are no missing values in the input dataset in the columns used for training.
3. The dataframe in question has a lot of ignored columns. Some of the columns are originally of Decimal type. All decimal type columns are cast to doubles before passing through H2O. All datatypes are doubles, strings and timestamps.
4. Running Sparkling water 1.5.10 on CDH 5.4

The H2O code is very simple and is only a few lines:

cast Spark dataframe to H2OFrame
load stored H2O model from hdfs location
declare categorical columns
score on the H2OFrame (to get predict, A, R columns) (for GBM, DRF and LR models)
add predictions to original dataframe (from GBM, DRF and LR models)
dataframe update key
convert H2OFrame to Spark DataFrame

Here are more observations when restricting to dataframes without the ignored columns (some 50 columns are ignored, 32 columns are retained):

  • we have two slightly different codebases which produce slightly different derived features on identical inputs

  • original input after transformation thus produces two slightly different input dataframes to predictions - I will call these Set1 and Set2

  • Set1 uses Model1 for prediction and there is no switching rows in this case - NO PROBLEM

  • Set2 uses Model2 for prediction and switching rows happens in this case - PROBLEM

  • passing Set2 through Model1: switching rows happens in this case - PROBLEM

  • passing subset of Set2 where jumbling of rows happens (166 rows out of 20,000) through Model 1 - now switching columns does NOT happen - NO PROBLEM

So, identical models and identical code, run on two almost identical datasets have very different behavior.

How is it possible that the structure of the dataset causes jumbling of data??? And only in some columns???
Could casting to different datatypes (Decimal to Double or int to string) have anything to do with it???

Thank you so much for any insight you can offer!



Michal Malohlava
March 7, 2017, 5:56 PM

So i expect, it will be fixed in 2.1.1

Jakub Hava
March 7, 2017, 5:57 PM

Oh, then my bad. It seemed that they just closed the PR without explaining. However the method

still behaves non-deterministically

Jakub Hava
March 7, 2017, 6:00 PM

I'll built Spark from that branch locally and try it there, I'm interested whether it solves our issue

Michal Malohlava
March 7, 2017, 6:02 PM

It is fine, lets wait for 2.1.1

Jakub Hava
March 7, 2017, 6:13 PM

Couldn't wait - tested on today's 2.1.1 snapshot and the except method still behaves non-deterministically as above.

Won't Fix


Jakub Hava

Fix versions



Avkash Chauhan

Support ticket URL




Affected Spark version


Customer Request Type


Task progress






Support Assessment

Data Science Issue