H2OFrame in Python is adding additional duplicate rows to the Pandas DataFrame

Description

When converting a Pandas dataframe to a H2o frame using the h2o.H2OFrame() function an error is occuring.

Additional rows are being created in the H2o Frame. When I looked into this, it appears the new rows are duplicates of other rows. Depending on the data size the number of duplicate rows added varies, but typically around 2-10.

Code:

train_h2o = h2o.H2OFrame(python_obj=train_df_complete)

print(train_df_complete.shape[0])
print(train_h2o.nrow)

Output:

3871998
3872000

Activity

Show:
Michal Raška
August 28, 2017, 7:23 AM

Can you please specify on which frame it occurs? I cannot reproduce it and I've tried many of the smalldata datasets. Tried Python 3.5 and 3.6. Thanks

Assignee

New H2O Bugs

Fix versions

None

Reporter

George Carmichael

Support ticket URL

None

Labels

Affected Spark version

None

Customer Request Type

None

Task progress

None

ReleaseNotesHidden

None

CustomerVisible

Yes

Components

Affects versions

Priority

Critical