it takes a long time to build the model on a sparse dataset (89x5000) when read in using parquet format on a 5 executor SW cluster.
On data where each value in sparse vector was specified, it took roughly the same. ( no zero-values)
The same for the case where non-zero value is on every second position.
So tested cases:
really sparse case - just one non zero value
dense case - all values in sparse vector specified
worst case for indices computations - every second value is non-zero (lots of small gaps )
All behaved very similar and took about 5 seconds to convert which is fine. I exported the parquet to local file-system.
Are they loading this parquet from hdfs or local file-system as well ?
But from my testing I can't reproduce this slowness, was their environment any special ?
bad and good news! I can see that it is really slow from PySparkling.
This code is a few seconds in Scala
This so far hasn’t finished in Python
Scala Job Spark UI
Python Job Spark UI
So the tasks took around the same time.
So for some reason on the pysparkling side it takes longer time (but it’s not not in hours, but takes around 5 minutes ) but the task itself last the same
this problem is caused at
, particularly this bit takes long and fr = H2OFrame.get_frame(sid)
And the root cause is doFetch method in FramesHandler since it needs to load first few rows to store into expr on python side. And since this is sparse data, these 10 rows is 50000*50 doubles which takes some times. Minor note, if we pass 0 to it, then this code makes it 100
0 should be valid value, however and we do not have any option.
But rows are not the core issue, but columns. What we can do is to specify also default number of columns to show using column_coun ( it’s already in the API, but python is not using it). Right now it’s default which means all. This would help in the sparse data case