conversion of sparse data DataFrame to H2OFrame is slow

Description

it takes a long time to build the model on a sparse dataset (89x5000) when read in using parquet format on a 5 executor SW cluster.

Activity

Show:
Jakub Hava
November 21, 2017, 1:59 PM

On data where each value in sparse vector was specified, it took roughly the same. ( no zero-values)

Jakub Hava
November 21, 2017, 2:05 PM
Edited

The same for the case where non-zero value is on every second position.

So tested cases:

  • really sparse case - just one non zero value

  • dense case - all values in sparse vector specified

  • worst case for indices computations - every second value is non-zero (lots of small gaps )

All behaved very similar and took about 5 seconds to convert which is fine. I exported the parquet to local file-system.

Are they loading this parquet from hdfs or local file-system as well ?

But from my testing I can't reproduce this slowness, was their environment any special ?

CC:

Jakub Hava
November 22, 2017, 12:57 PM

bad and good news! I can see that it is really slow from PySparkling.

This code is a few seconds in Scala

This so far hasn’t finished in Python

Jakub Hava
November 22, 2017, 12:58 PM
Edited

Scala Job Spark UI

Python Job Spark UI

So the tasks took around the same time.

So for some reason on the pysparkling side it takes longer time (but it’s not not in hours, but takes around 5 minutes ) but the task itself last the same

Jakub Hava
November 22, 2017, 1:00 PM
Edited

this problem is caused at

, particularly this bit takes long and fr = H2OFrame.get_frame(sid)

so

And the root cause is doFetch method in FramesHandler since it needs to load first few rows to store into expr on python side. And since this is sparse data, these 10 rows is 50000*50 doubles which takes some times. Minor note, if we pass 0 to it, then this code makes it 100

0 should be valid value, however and we do not have any option.

But rows are not the core issue, but columns. What we can do is to specify also default number of columns to show using column_coun ( it’s already in the API, but python is not using it). Right now it’s default which means all. This would help in the sparse data case

Assignee

Jakub Hava

Reporter

Nidhi Mehta

Labels

None

CustomerVisible

No

testcase 1

None

testcase 2

None

testcase 3

None

h2ostream link

None

Affected Spark version

None

AffectedContact

None

AffectedCustomers

None

AffectedPilots

None

AffectedOpenSource

None

Support Assessment

None

Customer Request Type

None

Support ticket URL

End date

None

Baseline start date

None

Baseline end date

None

Task progress

None

Task mode

None

ReleaseNotesHidden

None

Fix versions

Priority

Major
Configure