Introduction to Hadoop Security – How to secure a Hadoop Cluster?
Are you thinking that whether the data stored and processed using Hadoop is secure or not?
Hadoop is the software framework for storing and processing vast amounts of data. In this article, we will study Hadoop Security. The article first explains the reason for Hadoop Security. Then we will explore the Hadoop Security, 3 A’s in Hadoop Security, and how Hadoop achieves security. The article describes the Kerberos, transparent encryption in HDFS, and HDFS file and directory permission, which solves HDFS security issues. The article also enlists some Hadoop ecosystem components for monitoring and managing Hadoop Security.
Don't become Obsolete & get a Pink Slip
Follow DataFlair on Google News & Stay ahead of the game
Why Hadoop Security?
The goal of designing Hadoop is to manage large amounts of data in a trusted environment, so security was not a significant concern. But with the rise of the digital universe and the adoption of Hadoop in almost every sector like businesses, finance, health care, military, education, government, etc., security becomes the major concern.
The previous Hadoop implementation lacked security features because the built-in security and available options are inconsistent among release versions, which affect many sectors like multiple business sectors, health, medical departments, national security, and military, etc. It became obvious that there should be a mechanism that ensures Hadoop Security.
So, Hadoop security is the major leap that Hadoop Framework needs to take.
What to do about Hadoop Security?
The security in Hadoop requires many of the same approaches seen in the traditional data management system. These include 3 A’s of security.
Let us first see what these 3 A’s says:
Authentication: It means “Who am I/prove it?”.
Authorization: It means, “What can I do?”
Auditing: It means, “What did I do?”
Data Protection: It means, “How can I encrypt the data at rest and over the wire?”.
Authentication: It is the first stage that strongly authenticates the user to prove their identities. In authentication, user credentials like UserId, password are authenticated. Authentication ensures that the user who is seeking to perform an operation is the one who he claims to be and thus trustable.
Authorization: It is the second stage that defines what individual users can do after they have been authenticated. Authorization controls what a particular user can do to a specific file. It provides permission to the user whether he can access the data or not.
Auditing: Auditing is the process of keeping track of what an authenticated, authorized user did once he gets access to the cluster. It records all the activity of the authenticated user, including what data was accessed, added, changed, and what analysis occurred by the user from the period when he login to the cluster.
Data Protection: It refers to the use of techniques like encryption and data masking for preventing sensitive data access by unauthorized users and applications.
Introduction to Hadoop Security
Around 2009, Hadoop’s security was designed and implemented and had been stabilizing since then. In 2010, the security feature added in Hadoop with the following two fundamental goals:
- Preventing unauthorized access to the files stored in HDFS.
- Not exceeding high cost while achieving authorization.
Hadoop Security thus refers to the process that provides authentication, authorization, auditing, and secure the Hadoop data storage unit by offering an inviolable wall of security against any cyber threat.
Let us now see how Hadoop achieves its security.
How Hadoop achieve Security?
Kerberos is an authentication protocol that is now used as a standard to implement authentication in the Hadoop cluster.
Hadoop, by default, does not do any authentication, which can have severe effects on the corporate data centers. To overcome this limitation, Kerberos which provides a secure way to authenticate users was introduced in the Hadoop Ecosystem.
Kerberos is the network authentication protocol developed at MIT, which uses “tickets” to allow nodes to identify themselves.
Hadoop uses the Kerberos protocol to ensure that someone who is making the request is the one who he claims to be.
In the secure mode, all Hadoop nodes use Kerberos to do mutual authentication. It means that when two nodes talk to each other, they each make sure that the other node is who it says it is.
Kerberos uses secret-key cryptography for providing authentication for client-server applications.
Kerberos in Hadoop
The client makes the three steps while using Hadoop with Kerberos.
- Authentication: In Kerberos, the client first authenticates itself to the authentication server. The authentication server provides the timestamped Ticket-Granting Ticket (TGT) to the client.
- Authorization: The client then uses TGT to request a service ticket from the Ticket-Granting Server.
- Service Request: On receiving the service ticket, the client directly interacts with the Hadoop cluster daemons such as NameNode and ResourceManager.
Authentication server and Ticket Granting Server together form the Key Distribution Center (KDC) of Kerberos.
The client on the user’s behalf performs the authorization and the service request steps.
The authentication step is carried out by the user through the kinit command, which will ask for a password.
We don’t need to enter a password every time while running a job because Ticket-Granting Ticket lasts for 10 hours by default, which is renewable up to a week.
If we don’t want ourselves to get a prompt for the password, we can create a Kerberos keytab file using ktutil command.
The keytab file stores passwords supplied to knit with the -t option.
2. Transparent Encryption in HDFS
For data protection, Hadoop HDFS implements transparent encryption. Once it is configured, the data that is to be read from and written to the special HDFS directories is encrypted and decrypted transparently without requiring any changes to the user application code.
This encryption is end-to-end encryption, which means that only the client will encrypt or decrypt the data. Hadoop HDFS will never store or have access to unencrypted data or unencrypted data encryption keys, satisfying at-rest encryption, and in-transit encryption.
At-rest encryption refers to the encryption of data when data is on persistent media such as a disk.
In-transit encryption means encryption of data when data is traveling over the network.
HDFS encryption enables the existing Hadoop applications to run transparently on the encrypted data.
This HDFS-level encryption also prevents the filesystem or OS-level attacks.
Encryption Zone(EZ): It is a special directory whose content upon write is encrypted transparently, and during read, the content is transparently decrypted.
Encryption Zone Key (EZK): Every Encryption Zone key has a EZK specified during zone creation.
Data Encryption Key (DEK): Every file in EZ has its own unique DEK, which is never handled directly by HDFS. They are used to encrypt and decrypt the file data.
Encrypted Data Encryption Key(EDEK): HDFS handles EDEK. The client decrypts the EDEK and then uses the corresponding DEK to read/write data.
Key Management Server(KMS): The KMS is responsible for providing access to the stored EZK, generating new EDEK for storage on NameNode, and decrypting the EDEK for use by the HDFS clients.
The transparent encryption in HDFS works in the following manner:
- While creating a new file in EZ, the NameNode asks Key Management Server (KMS) to create a new Encrypted Data Encryption Key encrypted with EZk.
- This EDEK is stored on the NameNode as part of the file’s metadata.
- During file read within the encryption zone, NameNode provides the file’s EDEK along with the EZK version used to encrypt the EDEK to the client.
- The client then asks KMS to decrypt the EDEK. KMS first checks whether the client has permission to access the encryption zone key version or not. If the client has access permission, it uses the DEK to decrypt the file’s content.
All these steps take place automatically through the Hadoop HDFS client, the NameNode, and the KMS interactions.
3. HDFS file and directory permission
For authorizing the user, the Hadoop HDFS checks the files and directory permission after the user authentication.
The HDFS permission model is very similar to the POSIX model. Every file and directory in HDFS is having an owner and a group.
The files or directories have different permissions for the owner, group members, and all other users.
For files, r is for reading permission, w is for write or append permission.
For directories, r is the permission to list the content of the directory, w is the permission to create or delete files/directories, and x is the permission to access a child of the directory.
To restrict others except for the files/directory owner and the superuser, from deleting or moving the files within the directory, we can add a sticky bit on directories.
The owner of the file/directory is the user identity of the client process, and the group of file/directory is the parent directory group.
Also, every client process which is going to access the HDFS has a two-part identity that is a user name and group list.
The HDFS do a permission check for the file or directory accessed by the client as follow:
- If the user name of the client access process matches the owner of file or directory, then HDFS perform the test for the owner permissions;
- If the group of file/directory matches any of member of the group list of the client access process, then HDFS perform the test for the group permissions;
- Otherwise, the HDFS tests the other permissions of files/directories.
If the permissions check fails, then the client operation fails.
Wait!! Before start working on Hadoop do explore 15 Must-Know Hadoop Ecosystem Components.
Tools for Hadoop Security
The Hadoop ecosystem contains some tools for supporting Hadoop Security. The two major Apache open-source projects that support Hadoop Security are Knox and Ranger.
Knox is a REST API base perimeter security gateway that performs authentication, support monitoring, auditing, authorization management, and policy enforcement on Hadoop clusters. It authenticates user credentials generally against LDAP and Active Directory. It allows only the successfully authenticated users to access the Hadoop cluster.
It is an authorization system that provides or denies access to Hadoop cluster resources such as HDFS files, Hive tables, etc. based on predefined policies. User request assumes to be already authenticated while coming to Ranger. It has different authorization functionality for different Hadoop components such as YARN, Hive, HBase, etc.
In this article, we had studied Hadoop security. We had seen how Hadoop uses Kerberos to authenticate the user accessing the Hadoop HDFS files or directories. We had also studied transparent encryption in HDFS for protecting the files or directories in HDFS. The article had described how HDFS checks for the client permission to access the files or directories. In addition, the article also highlighted some major Apache projects such as Knox and Ranger for monitoring and supporting Hadoop Security.
Want to work on live Big Data projects for hands-on experience? check out Hadoop training by DataFlair.