I've been experiencing an issue with the parsing of a large data set (~500M rows, ~200 factor columns).

The issue is that a number of numerical factors are detected, incorrectly, as categoricals. Our NULL value is "\N" (as output from Hive), and in most cases is detected correctly so. But for a small number of factors (about 8 out of 200) the parser treats it as a category level. It is reproducible in the sense that if I start a new H2O instance and reload the data, the same 8 columns are flagged incorrectly.

Of course, as has been stated elsewhere, using for example the R as.numeric() function gives an error (cannot coerce type 'S4' to vector of type 'double'), so it cannot be forced numeric.

Interestingly, if I create a table consisting solely of the 8 columns which weren't being passed correctly, and load JUST those columns as a new dataset into H2O, it parses perfectly. I've also confirmed independently that there is only one non-numeric value in each column which is indeed the NULL character string.

We are using H2O, and while I realise this won't fully be supported going forward, I would appreciate any insight into the above problem, or how we can work around it for now.



Cliff Click
June 1, 2015, 4:00 PM

We got several Hive special cases already (e.g. \A field separators), might as well handle this one also (in h2o-3).
Over to Brandon...

