H2O stops accessing HDFS after Kerberos ticket is renewed
Description
In Kerberized clusters, a non-expired ticket is required to access HDFS.
This ticket has a limited lifetime.
After a ticket is renewed (it has to be renewed before it expires), H2O service has problems accessing HDFS:
01-06 09:28:12.728 10.20.37.31:53011 30029 #s/exists ERRR: Caught exception: HDFS IO Failure:
01-06 09:28:12.728 10.20.37.31:53011 30029 #s/exists ERRR: accessed URI : hdfs://dcoabhdfs01.some.domain:8020/user/svc_h2oqa/h2oflows/environment/clips
01-06 09:28:12.728 10.20.37.31:53011 30029 #s/exists ERRR: configuration: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, /var/run/cloudera-scm-agent/process/1696-yarn-NODEMANAGER/core-site.xml
01-06 09:28:12.728 10.20.37.31:53011 30029 #s/exists ERRR: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 529 for svc_h2oqa) can't be found in cache; Stacktrace: [water.persist.PersistHdfs.exists(PersistHdfs.java:438), water.persist.PersistManager.exists(PersistManager.java:317), water.init.NodePersistentStorage.exists(NodePersistentStorage.java:94), water.api.NodePersistentStorageHandler.exists(NodePersistentStorageHandler.java:21), sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method), sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57), sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43), java.lang.reflect.Method.invoke(Method.java:606), water.api.Handler.handle(Handler.java:64), water.api.RequestServer.handle(RequestServer.java:644), water.api.RequestServer.serve(RequestServer.java:585), water.JettyHTTPD$H2oDefaultServlet.doGeneric(JettyHTTPD.java:617), water.JettyHTTPD$H2oDefaultServlet.doGet(JettyHTTPD.java:559), javax.servlet.http.HttpServlet.service(HttpServlet.java:707), javax.servlet.http.HttpServlet.service(HttpServlet.java:820), org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)]
klist shows new ticket as refreshed and not expired.
Does h2o service has to somehow renew its token for the renewed Kerberos ticket? See error above around "token (HDFS_DELEGATION_TOKEN token 529 for svc_h2oqa) can't be found in cache"?
Activity
I know this is an older ticket, but there is an issue with renewing kerberos
It is fixed in Spark 3.0 (not released)
https://issues.apache.org/jira/browse/SPARK-25689
Also we don't run into this problem for a while when using h2o external cluster..
I think this ticket can be closed.
Since h2o just uses the standard MapReduce ApplicationMaster, I can't really think of what else there would be to change...
Would be interesting to pose this question to one of the Hadoop distro vendors then...
KDC / active directory limit maximum ticket lifetime. You can't do that.
Look what Spark does
https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L184
they renew ticket themselves periodically. That's the only fix.
And again, yes, we have renewable tickets but this does not help.
Would be helpful to know if, for example, '-r365d' solves this issue in your environment.
If so, I'll add that to the docs.
Hi, I asked about the specifics and this was the response from the user who solved this:
At Strata you were asking how we resolved the expired HDFS_DELEGATION_TOKEN issue with kinit since we have long-running h2o clusters on our hadoop cluster. Here’s what our guys do when they launch the h2o clusters for us.
#First step is create a renewable ticket for kerberos:
kinit –kt keytab-file svch2o –r7d # here svch2o is the user account we’re using to launch the h2o processes
[ ... then run startup script with hadoop jar command ... ]
So, apparently there is nothing special going on other than requesting a renewable kerberos ticket for the user before executing the startup script.