eG Monitoring
 

Measures reported by HdpNNUgiTest

Authentication is the first level of security for any system. It is all about validating the identity of a user or a process. In a simple sense, it means verifying a username and password.

Hadoop uses Kerberos for authentication and identity propagation. Kerberos is a network authentication protocol, which eliminates the need for transmission of passwords across the network and removes the potential threat of an attacker sniffing the network. It uses “tickets” to allow nodes and users to identify themselves.

If Kerberos authentication is not configured properly, then authentication will fail every time a DataNode attempts to communicate with the NameNode in the Hadoop cluster. Likewise, clients will also be unable to login to the NameNode for submitting application requests. To ensure that users/nodes are able to access Hadoop storage at all times, administrators should be intolerant to such authentication failures, and should instantly check Kerberos configuration if such failures frequently occur.

At the same time, repeated authentication failures may not always imply a Kerberos configuration issue. Sometimes, users with malicious intent can pose as a trusted identity and attempt to gain access to the data stored in Hadoop. Kerberos may be foiling such attempts by failing authentication. Administrators need to be wary of such attempts as well.

Also, a delay in authentication, no matter how short, can adversely impact user satisfaction with the Hadoop storage. For 'happy' Hadoop users, administrators should promptly detect such delays, ascertain the reason for the same, and eliminate it, before end-users complain.

The insights provided by the HdpNNUgiTest test helps administrators on all the above accounts! This test closely tracks login attempts to the NameNode in a Hadoop cluster and alerts administrators to consistent authentication failures. In the process, the test sheds light on improper Kerberos configuration or suspicious login activity on the storage. Additionally, the test measures the average time taken by successful and failed logins, thus pointing administrators to authentication delays that may be spoiling user experience with Hadoop storage.

Outputs of the test : One set of the results for the Hadoop storage being monitored

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
Kerb_login_success_rate Indicates the rate of successful logins to the NameNode in the target Hadoop cluster. Successes/Sec A high value is desired for this measure.
Kerb_loginsucces_avgtime Indicates the average time taken to successfully authenticate logins to the NameNode Seconds A high value indicates that the cluster is taking too long to authenticate logins. This can have a negative impact on user experience. You may want to check Kerberos configuration for irregularities.
Kerb_login_fail_rate Indicates the rate of failed logins to the NameNode in the target Hadoop cluster. Failures/Sec A high value for this measure is a cause for concern, as it indicates frequent authentication failures.

Clusters that use Kerberos for authentication have several possible sources of potential issues, including:

  • Failure of the Key Distribution Center (KDC)

  • Missing Kerberos or OS packages or libraries

  • Incorrect mapping of Kerberos REALMs for cross-realm authentication

These are just some examples, but they can prevent users and services from authenticating and can interfere with the cluster's ability to run and process workloads. The first step whenever an issue emerges is to try to isolate the source of the actual issue, by answering basic questions such as these:

  • Is the issue a local issue or a global issue? That is, are all users failing to authenticate, or is the issue specific to a single user?

  • Is the issue specific to a single service, or are all services problematic? and so on.

If all users and multiple services are affected—and if the cluster has not worked at all after integrating with Kerberos for authentication—step through all settings for the Kerberos configuration files.

However, a configuration issue may not always be the reason for a spurt in authentication failures. If this measure registers an unusually high value during certain time windows, it could indicate an attempt to hack the cluster. Do what is required to protect your cluster against such attacks.

Kerb_loginfail_avgtime Indicates the time taken for authentication to fail. Seconds A high value indicates that the cluster is waiting too long before it fails an authentication attempt. You may want to check Kerberos configuration to figure out where the bottleneck is.