Memory leak in H2O (standalone cluster)



I created a reproducible example in R and tested it on a tiny 4 node linux cluster using h2o -

The workflow creates dummy data and then iterativly computes a new model, make a prediction, calculates a dummy KPI and last removes the model plus the prediction data. It uses the "full blown gc" approach from Tom (
You can run it with

I run it twice - one time only with a simple housekeeping (h2o.rm) and a larger dataset- and one time with Toms GC approach and a smaller dataset. In both cases I used a fresh h2o cluster where each of the four nodes was started accoring to

I attached the jvm node logs from each run.

  • only simple housekeeping

  • tiggers multiple gc runs

A first analysis indicates that in both cases the heap increases from iteration to iteration, regardless wether we use just a simple housekeeping or a multiple garbage collections.

  • only simple housekeeping

  • tiggers multiple gc runs (in the picture I hide the gc runs so that the heap consumption is visible)



Monitoring memory consumption in h2o shows that there is a memory leak when running repetitive model creation jobs. A typical ML use case when you want to do this is for example hyperparameter tuning, model validation using a resampling approach, feature selection, bootstrapping, ....
Our example is about feature selection where we take a subset of the features, train a model and evaluate it afterwards. After each of these iterations all new data sets (prediction data set) and the model files are removed with h2o.rm()

The cluster is a six node cluster where each node is started with

After we encountered the first problem, we start a new run and in parallel created a small monitoring script to get a constant update of h2o cluster statistics (using

). This script runs on the main node:

Additionally the script also counts the number of keys using the R API function

(user objects).
The analysis of the monitoring data after approx. 15 h shows

  1. We keep our workspace clean so that the number of user objects is constant (kv_count)

  2. Free mem is decreasing of time (free_mem)

  3. POJO mem is increasing over time with clearly visible spikes over time (pojo_mem)

The pojo_mem spikes correspond with log warnings of the form
[for node .45.2]

[for node .45.3]

As the number of user objects is constant, the memory increase indicates some kind of problematic garbage collection or housekeeping and have serious impact on the usage of the h2o cluster: node failure. In our first run we encountered this effect in a way that the current job stops further processing and most flow requests became unresponsive. To solve this problem we had to restart the cluster - meaning a complete loss of data and results.


Divya Mereddy
December 3, 2019, 7:59 PM

I also faced same problem with memory leakage

Tom Kraljevic
October 21, 2016, 5:17 PM

Tom Kraljevic
October 21, 2016, 5:08 PM

I tried this myself, and am not seeing the same problem.

Running with h2o

Here is how i started h2o:

Here is the R script I am using:

Attaching the graph from gcviewer (tom_gc_1.png)


Roberto Rösler

Fix versions



Roberto Rösler

Support ticket URL




Affected Spark version


Customer Request Type


Task progress