Memory leak in H2O (standalone cluster)
Description
UPDATE
I created a reproducible example in R and tested it on a tiny 4 node linux cluster using h2o 3.8.3.2. -
The workflow creates dummy data and then iterativly computes a new model, make a prediction, calculates a dummy KPI and last removes the model plus the prediction data. It uses the "full blown gc" approach from Tom (https://groups.google.com/d/msg/h2ostream/Dc6l4xzwkaU/n-w2p02mBwAJ).
You can run it with
I run it twice - one time only with a simple housekeeping (h2o.rm) and a larger dataset- and one time with Toms GC approach and a smaller dataset. In both cases I used a fresh h2o cluster where each of the four nodes was started accoring to
I attached the jvm node logs from each run.
only simple housekeeping
tiggers multiple gc runs
A first analysis indicates that in both cases the heap increases from iteration to iteration, regardless wether we use just a simple housekeeping or a multiple garbage collections.
only simple housekeeping
tiggers multiple gc runs (in the picture I hide the gc runs so that the heap consumption is visible)
END OF UPDATE
--------------------------------------------------------------------------------------------
Monitoring memory consumption in h2o shows that there is a memory leak when running repetitive model creation jobs. A typical ML use case when you want to do this is for example hyperparameter tuning, model validation using a resampling approach, feature selection, bootstrapping, ....
Our example is about feature selection where we take a subset of the features, train a model and evaluate it afterwards. After each of these iterations all new data sets (prediction data set) and the model files are removed with h2o.rm()
The cluster is a six node cluster where each node is started with
After we encountered the first problem, we start a new run and in parallel created a small monitoring script to get a constant update of h2o cluster statistics (using
). This script runs on the main node:
Additionally the script also counts the number of keys using the R API function
(user objects).
The analysis of the monitoring data after approx. 15 h shows
We keep our workspace clean so that the number of user objects is constant (kv_count)
Free mem is decreasing of time (free_mem)
POJO mem is increasing over time with clearly visible spikes over time (pojo_mem)
The pojo_mem spikes correspond with log warnings of the form
[for node .45.2]
[for node .45.3]
As the number of user objects is constant, the memory increase indicates some kind of problematic garbage collection or housekeeping and have serious impact on the usage of the h2o cluster: node failure. In our first run we encountered this effect in a way that the current job stops further processing and most flow requests became unresponsive. To solve this problem we had to restart the cluster - meaning a complete loss of data and results.
Activity
I also faced same problem with memory leakage
I tried this myself, and am not seeing the same problem.
Running with h2o 3.10.0.8.
Here is how i started h2o:
Here is the R script I am using:
Attaching the graph from gcviewer (tom_gc_1.png)