General Troubleshooting Tips

Description

The following error message displayed when I tried to launch H2O - what should I do
Exception in thread "main" java.lang.UnsupportedClassVersionError: water/H2OApp
: Unsupported major.minor version 51.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(Unknown Source)
at java.lang.ClassLoader.defineClass(Unknown Source)
at java.security.SecureClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.access$000(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
Could not find the main class: water.H2OApp. Program will exit.

This error output indicates that your Java version is not supported. Upgrade to Java 7 (JVM) or later and H2O should launch successfully.
I am not launching on Hadoop. How can i increase the amount of time that H2O allows for expected nodes to connect?

For cluster startup, if you are not launching on Hadoop, then you will not need to specify a timeout. You can add additional nodes to the cloud as long as you haven’t submitted any jobs to the cluster. So, you should check Data Science Courses to understand the when you do submit a job to the cluster, the cluster will lock and will print a message similar to “Locking cloud to new members, because <reason>...”.

What’s the best approach to help diagnose a possible memory problem on a cluster?

We’ve found that the best way to understand JVM memory consumption is to turn on the following JVM flags:

-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
You can then use the following tool to analyze the output: http://www.tagtraum.com/gcviewer-download.html

How can I debug memory issues?

We recommend the following approach using R to debug memory issues:

my for loop {

  1. perform loop

rm(R object that isn’t needed anymore)
rm(R object of h2o thing that isn’t needed anymore)

  1. trigger removal of h2o back-end objects that got rm’d above, since the rm can be lazy.
    gc()

  2. optional extra one to be paranoid. this is usually very fast.
    gc()

  1. optionally sanity check that you see only what you expect to see here, and not more.
    h2o.ls()

  1. tell back-end cluster nodes to do three back-to-back JVM full GCs.
    h2o:::.h2o.garbageCollect()
    h2o:::.h2o.garbageCollect()
    h2o:::.h2o.garbageCollect()
    }
    Note that the h2o.garbageCollct() function works as follows:

  1. Trigger an explicit garbage collection across all nodes in the H2O cluster.
    .h2o.garbageCollect <- function() {
    res <- .h2o.__remoteSend("GarbageCollect", method = "POST")
    }
    This tells the backend to do a forcible full-GC on each node in the H2O cluster. Doing three of them back-to-back makes it stand out clearly in the gcviewer chart where the bottom-of-inner loop is. You can then correlate what you expect to see with the X (time) axis of the memory utilization graph.

At this point you want to see if the bottom trough of the usage is growing from iteration to iteration after the triple full-GC bars in the graph. If the trough is not growing from iteration to iteration, then there is no leak; your usage is just really too much, and you need a bigger heap. If the trough is growing, then there is likely some kind of leak. You can try to use h2o.ls() to learn where the leak is. If h2o.ls() doesn’t help, then you will have to drill much deeper using, for example, YourKit and reviewing the JVM-level heap profiles.

Algorithms
What’s the process for implementing new algorithms in H2O?

This blog post by Cliff walks you through building a new algorithm, using K-Means, Quantiles, and Grep as examples.

To learn more about performance characteristics when implementing new algorithms, refer to Cliff’s KV Store Guide.

How do I find the standard errors of the parameter estimates (p-values)?

P-values are currently supported for non-regularized GLM. The following requirements must be met:

The family cannot be multinomial
The lambda value must be equal to zero
The IRLSM solver must be used
Lambda search cannot be used
To generate p-values, do one of the following:

check the compute_p_values checkbox in the GLM model builder in Flow
use compute_p_values=TRUE in R or Python while creating the model
The p-values are listed in the coefficients table
How do I specify regression or classification for Distributed Random Forest in the web UI?

If the response column is numeric, H2O generates a regression model. If the response column is enum, the model uses classification. To specify the column type, select it from the drop-down column name list in the Edit Column Names and Types section during parsing.

What’s the largest number of classes that H2O supports for multinomial prediction?

For tree-based algorithms, the maximum number of classes (or levels) for a response column is 1000.

How do I obtain a tree diagram of my DRF model?

Output the SVG code for the edges and nodes. A simple tree visitor is available here and the Java code generator is available here.

Is Word2Vec available? I can see the Java and R sources, but calling the API generates an error.

Word2Vec, along with other natural language processing (NLP) algos, are currently in development in the current version of H2O.

What are the “best practices” for preparing data for a K-Means model?

There aren’t specific “best practices,” as it depends on your data and the column types. However, removing outliers and transforming any categorical columns to have the same weight as the numeric columns will help, especially if you’re standardizing your data.

What is your implementation of Deep Learning based on?

Our Deep Learning algorithm is based on the feedforward neural net. For more information, refer to our Data Science documentation or Wikipedia.

How is deviance computed for a Deep Learning regression model?

For a Deep Learning regression model, deviance is computed as follows:

Loss = MeanSquare -> MSE==Deviance For Absolute/Laplace or Huber -> MSE != Deviance.

For my 0-tree GBM multinomial model, I got a different score depending on whether or not validation was enabled, even though my dataset was the same - why is that?

Different results may be generated because of the way H2O computes the initial MSE.

How does your Deep Learning Autoencoder work? Is it deep or shallow?

H2O’s DL autoencoder is based on the standard deep (multi-layer) neural net architecture, where the entire network is learned together, instead of being stacked layer-by-layer. The only difference is that no response is required in the input and that the output layer has as many neurons as the input layer. If you don’t achieve convergence, then try using the Tanh activation and fewer layers. We have some example test scripts here, and even some that show how stacked auto-encoders can be implemented in R.

Are there any H2O examples using text for classification?

Currently, the following examples are available for Sparkling Water:

Use TF-IDF weighting scheme for classifying text messages
https://github.com/h2oai/sparkling-water/blob/master/examples/scripts/hamOrSpam.script.scala

Use Word2Vec Skip-gram model + GBM for classifying job titles
https://github.com/h2oai/sparkling-water/blob/master/examples/scripts/craigslistJobTitles.script.scala

Most machine learning tools cannot predict with a new categorical level that was not included in the training set. How does H2O make predictions in this scenario?

Here is an example of how the prediction process works in H2O:

Train a model using data that has a categorical predictor column with levels B,C, and D (no other levels); this level will be the “training set domain”: {B,C,D}
During scoring, the test set has only rows with levels A,C, and E for that column; this is the “test set domain”: {A,C,E}
For scoring, a combined “scoring domain” is created, which is the training domain appended with the extra test set domain entries: {B,C,D,A,E}
Each model can handle these extra levels {A,E} separately during scoring.
The behavior for unseen categorical levels depends on the algorithm and how it handles missing levels (NA values):

For DRF and GBM, missing values are interpreted as containing information (i.e., missing for a reason) rather than missing at random. During tree building, split decisions for every node are found by minimizing the loss function and treating missing values as a separate category that can go either left or right.
Deep Learning creates an extra input neuron for missing and unseen categorical levels, which can remain untrained if there were no missing or unseen categorical levels in the training data, resulting in a random contribution to the next layer during testing.
GLM skips unseen levels in the beta*x dot product.
How are quantiles computed?

The quantile results in Flow are computed lazily on-demand and cached. It is a fast approximation (max - min / 1024) that is very accurate for most use cases. If the distribution is skewed, the quantile results may not be as accurate as the results obtained using h2o.quantile in R or H2OFrame.quantile in Python.

How do I create a classification model? The model always defaults to regression.

To create a classification model, the response column type must be enum - if the response is numeric, a regression model is created.

Your pinned fields
Click on the next to a field label to start pinning.

Assignee

New H2O Bugs

Fix versions

Reporter

sam kiran

Labels