Hi,
we are seeing something very disturbing that makes us very queasy about using H2O: namely before and after passing through H2O predictions, we see a phenomenon I will illustrate with a simple example (switching some columns between rows):
Input dataframe:
id date partialkey feature1 feature2 feature3 feature4 feature5
---------------------------------------------------------------------------------------------
1 Dec-1 a 0 1 0 0 0
2 Dec-1 b 2 3 1 1 1
output dataframe:
id date partialkey feature1 feature2 feature3 feature4 feature5 predict A R
---------------------------------------------------------------------------------------------
1 Dec-1 a 0 1 1 1 1 out1 0.4 0.6
2 Dec-1 b 2 3 0 0 0 out2 0.1 0.9
In other words, two rows have SOME OF THEIR COLUMS switched (in the example above, feature3 feature4 feature5)
This could happen if for example the columns (which are stored separately in chunks, I believe) are not stiched back together properly after running through scoring on a model.
We are investigating the issue and trying to come up with a reproducible example (hard to share because propriatory dataset), but I'm wondering if you can tell me if you have seen any similar bugs and what could be causing it. It is quite urgent, so please I beg you, please respond.
Here are more details:
1. We don't see an issue if running on a single node without ignored columns
2. All training columns are categorical (strings) and model is doing binary classification. There are no missing values in the input dataset in the columns used for training.
3. The dataframe in question has a lot of ignored columns. Some of the columns are originally of Decimal type. All decimal type columns are cast to doubles before passing through H2O. All datatypes are doubles, strings and timestamps.
4. Running Sparkling water 1.5.10 on CDH 5.4
The H2O code is very simple and is only a few lines:
cast Spark dataframe to H2OFrame
load stored H2O model from hdfs location
declare categorical columns
score on the H2OFrame (to get predict, A, R columns) (for GBM, DRF and LR models)
add predictions to original dataframe (from GBM, DRF and LR models)
dataframe update key
convert H2OFrame to Spark DataFrame
INVESTIGATING FURTHER:
Here are more observations when restricting to dataframes without the ignored columns (some 50 columns are ignored, 32 columns are retained):
we have two slightly different codebases which produce slightly different derived features on identical inputs
original input after transformation thus produces two slightly different input dataframes to predictions - I will call these Set1 and Set2
Set1 uses Model1 for prediction and there is no switching rows in this case - NO PROBLEM
Set2 uses Model2 for prediction and switching rows happens in this case - PROBLEM
passing Set2 through Model1: switching rows happens in this case - PROBLEM
passing subset of Set2 where jumbling of rows happens (166 rows out of 20,000) through Model 1 - now switching columns does NOT happen - NO PROBLEM
So, identical models and identical code, run on two almost identical datasets have very different behavior.
How is it possible that the structure of the dataset causes jumbling of data??? And only in some columns???
Could casting to different datatypes (Decimal to Double or int to string) have anything to do with it???
Thank you so much for any insight you can offer!
https://groups.google.com/forum/#!topic/h2ostream/QsgcbrpyJAs
So i expect, it will be fixed in 2.1.1
Oh, then my bad. It seemed that they just closed the PR without explaining. However the method
still behaves non-deterministically
I'll built Spark from that branch locally and try it there, I'm interested whether it solves our issue
It is fine, lets wait for 2.1.1
Couldn't wait - tested on today's 2.1.1 snapshot and the except method still behaves non-deterministically as above.