Parsing issue and inconsistencies
---------- Forwarded message ----------
Date: Mon, Jun 1, 2015 at 12:09 AM
Subject: [h2ostream] Parsing issue and inconsistencies
I've been experiencing an issue with the parsing of a large data set (~500M rows, ~200 factor columns).
The issue is that a number of numerical factors are detected, incorrectly, as categoricals. Our NULL value is "\N" (as output from Hive), and in most cases is detected correctly so. But for a small number of factors (about 8 out of 200) the parser treats it as a category level. It is reproducible in the sense that if I start a new H2O instance and reload the data, the same 8 columns are flagged incorrectly.
Of course, as has been stated elsewhere, using for example the R as.numeric() function gives an error (cannot coerce type 'S4' to vector of type 'double'), so it cannot be forced numeric.
Interestingly, if I create a table consisting solely of the 8 columns which weren't being passed correctly, and load JUST those columns as a new dataset into H2O, it parses perfectly. I've also confirmed independently that there is only one non-numeric value in each column which is indeed the NULL character string.
We are using H2O 184.108.40.206, and while I realise this won't fully be supported going forward, I would appreciate any insight into the above problem, or how we can work around it for now.
We got several Hive special cases already (e.g. \A field separators), might as well handle this one also (in h2o-3).
Over to Brandon...