H2O stops accessing HDFS after Kerberos ticket is renewed

Description

In Kerberized clusters, a non-expired ticket is required to access HDFS.
This ticket has a limited lifetime.
After a ticket is renewed (it has to be renewed before it expires), H2O service has problems accessing HDFS:

01-06 09:28:12.728 10.20.37.31:53011 30029 #s/exists ERRR: Caught exception: HDFS IO Failure:
01-06 09:28:12.728 10.20.37.31:53011 30029 #s/exists ERRR: accessed URI : hdfs://dcoabhdfs01.some.domain:8020/user/svc_h2oqa/h2oflows/environment/clips
01-06 09:28:12.728 10.20.37.31:53011 30029 #s/exists ERRR: configuration: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, /var/run/cloudera-scm-agent/process/1696-yarn-NODEMANAGER/core-site.xml
01-06 09:28:12.728 10.20.37.31:53011 30029 #s/exists ERRR: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 529 for svc_h2oqa) can't be found in cache; Stacktrace: [water.persist.PersistHdfs.exists(PersistHdfs.java:438), water.persist.PersistManager.exists(PersistManager.java:317), water.init.NodePersistentStorage.exists(NodePersistentStorage.java:94), water.api.NodePersistentStorageHandler.exists(NodePersistentStorageHandler.java:21), sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method), sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57), sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43), java.lang.reflect.Method.invoke(Method.java:606), water.api.Handler.handle(Handler.java:64), water.api.RequestServer.handle(RequestServer.java:644), water.api.RequestServer.serve(RequestServer.java:585), water.JettyHTTPD$H2oDefaultServlet.doGeneric(JettyHTTPD.java:617), water.JettyHTTPD$H2oDefaultServlet.doGet(JettyHTTPD.java:559), javax.servlet.http.HttpServlet.service(HttpServlet.java:707), javax.servlet.http.HttpServlet.service(HttpServlet.java:820), org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)]

klist shows new ticket as refreshed and not expired.
Does h2o service has to somehow renew its token for the renewed Kerberos ticket? See error above around "token (HDFS_DELEGATION_TOKEN token 529 for svc_h2oqa) can't be found in cache"?

Activity

Show:
Ruslan Dautkhanov
June 10, 2019, 7:00 PM

I know this is an older ticket, but there is an issue with renewing kerberos

It is fixed in Spark 3.0 (not released)
https://issues.apache.org/jira/browse/SPARK-25689

Also we don't run into this problem for a while when using h2o external cluster..

I think this ticket can be closed.

Tom Kraljevic
April 1, 2016, 10:39 PM

Since h2o just uses the standard MapReduce ApplicationMaster, I can't really think of what else there would be to change...
Would be interesting to pose this question to one of the Hadoop distro vendors then...

Ruslan Dautkhanov
April 1, 2016, 4:21 PM

KDC / active directory limit maximum ticket lifetime. You can't do that.

Look what Spark does
https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L184
they renew ticket themselves periodically. That's the only fix.

And again, yes, we have renewable tickets but this does not help.

Tom Kraljevic
April 1, 2016, 4:12 PM

Would be helpful to know if, for example, '-r365d' solves this issue in your environment.
If so, I'll add that to the docs.

Tom Kraljevic
April 1, 2016, 4:10 PM

Hi, I asked about the specifics and this was the response from the user who solved this:

At Strata you were asking how we resolved the expired HDFS_DELEGATION_TOKEN issue with kinit since we have long-running h2o clusters on our hadoop cluster. Here’s what our guys do when they launch the h2o clusters for us.

#First step is create a renewable ticket for kerberos:
kinit –kt keytab-file svch2o –r7d # here svch2o is the user account we’re using to launch the h2o processes

[ ... then run startup script with hadoop jar command ... ]

So, apparently there is nothing special going on other than requesting a renewable kerberos ticket for the user before executing the startup script.

Assignee

New H2O Bugs

Fix versions

None

Reporter

Ruslan Dautkhanov

Support ticket URL

None

Labels

None

Affected Spark version

None

Customer Request Type

None

Task progress

None

ReleaseNotesHidden

None

CustomerVisible

No

AffectedOpenSource

Components

Priority

Critical