Spark Hadoop Cloudera Certifications You Must Know

DataFlair Team

7 years ago

Objective

This is a comprehensive guide about various Spark Hadoop Cloudera certifications. In this Cloudera certification tutorial we will discuss all the aspects like different certifications offered by Cloudera, the pattern of Cloudera certification exam / test, number of questions passing score, time limits, required skills and weightage of each and every topic. We will discuss about all the certifications offered by Cloudera like: “CCA Spark and Hadoop Developer Exam (CCA175)”, “Cloudera Certified Administrator for Apache Hadoop (CCAH)”, “CCP Data Scientist”, “CCP Data Engineer”.

Spark Hadoop Cloudera Certifications You Must Know

1. CCA Spark and Hadoop Developer Exam (CCA175)

In CCA Spark and Hadoop Developer certification, you need to write code in Scala and Python and run it on the cluster to prove your skills. This exam can be taken from any computer at any time globally.
CCA175 is a hands-on, practical exam using Cloudera technologies. The users are given their own CDH5 (currently 5.3.2) cluster that is pre-loaded with Spark, Impala, Crunch, Hive, Pig, Sqoop, Kafka, Flume, Kite, Hue, Oozie, DataFu, and many other software that are needed by the users.

a. CCA Spark and Hadoop Developer Certification Exam (CCA175) Details:

Number of Questions: 10–12 performance-based (hands-on) tasks on CDH5 cluster
Time Limit: 120 minutes
Passing Score: 70%
Language: English, Japanese (forthcoming)
CCA Spark and Hadoop Developer certification Cost: USD $295

b. CCA175 Exam Question Format

In each CCA question, you would be required to solve a particular scenario. In some cases, a tool such as Impala or Hive may be used. In other cases, coding is required. In Spark problem, a template (in Scala or Python) is often provided that contains a skeleton of the solution, asking the candidate to fill in the missing lines with functional code.

c. Prerequisites

There are no prerequisites required to take any Cloudera certification exam.

d. Exam selection and related topics

I. Required Skills

Data Ingest: These are the skills required to transfer data between external systems and your cluster. It includes:

Using Sqoop to import data from a MySQL database into HDFS and Change the delimiter and file format of data
Using Sqoop to Export data to a MySQL database
Ingest real-time and near-real-time (NRT) streaming data into HDFS using Flume
Using Hadoop File System (FS) commands to load data into and out of HDFS

II. Transform, Stage, Store:

It converts a set of data values in a given format stored in HDFS into new data values and/or a new data format and write them into HDFS. This includes writing Spark applications in Scala / Python for the below tasks:

Load data from HDFS and store results back to HDFS
Join disparate datasets together
Calculate aggregate statistics (e.g., average or sum)
Filter data into a smaller dataset
Write a query that produces ranked or sorted data

III. Data Analysis

Data Definition Language (DDL) to create tables in the Hive metastore, use by Hive and Impala.

Read and/or create a table in the Hive metastore in a given schema
Avro schema extraction from a set of data-files
Hive metastore table creation using the Avro file format and an external schema file
Improve query performance by creating partitioned tables in the Hive meta-store
Evolve an Avro schema by changing JSON files

2. Cloudera Certified Administrator for Apache Hadoop (CCAH)

Cloudera Certified Administrator for Apache Hadoop (CCAH) certification shows your technical knowledge, skills, and ability to configure, deploy, monitor, manage, maintain, and secure an Apache Hadoop cluster.

a. Cloudera Certified Administrator for Apache Hadoop (CCA-500) details

Number of Questions: 60 questions
Time Limit: 90 minutes
Passing Score: 70%
Language: English, Japanese
Cloudera Certified Administrator for Apache Hadoop (CCAH) certification Price: USD $295

b. Exam sections and related topics

I. HDFS (17%)

HDFS Features & design principle and function of HDFS daemons
Describe the operations of the Apache Hadoop cluster in data storage and in data processing
Features of current computing systems that motivated system like Apache Hadoop and commands to handle files in the HDFS
Given a scenario, identify appropriate use cases for HDFS Federation
Identify components and daemon of an HDFS HA-Quorum cluster
HDFS security (Kerberos) and file read-write paths
Determine the best data serialization choice for a given scenario
Internals of HDFS read operations and HDFS write operations

II. YARN (17%)

Understand how to deploy core ecosystem components along with Spark, Impala, and Hive
Understand Yarn, MapReduce v2 (MRv2 / YARN) deployments
Understand basic design strategy for YARN and how resource allocations is handled by it
Understand Resource Manager and Node Manager
Identify the workflow of job running on YARN
Determine which files you must change and how in order to migrate a cluster from MapReduce version 1 (MRv1) to MapReduce version 2 (MRv2) running on YARN

III. Hadoop Cluster Planning (16%)

Principal points to consider while choosing the hardware and operating systems to host an Apache Hadoop cluster
Understand kernel tuning and disk swapping
Identify a hardware configuration and ecosystem components your cluster needs for the given scenario
Cluster sizing: identify the specifics for the workload, including CPU, memory, storage, disk I/O for a given case
Disk Sizing and Configuration, including JBOD versus RAID, SANs, virtualization, and disk sizing requirements in a cluster
Network Topologies: understand network usage in Hadoop (for both HDFS and MapReduce) and propose or identify key network design components for a given scenario

IV. Hadoop Cluster Installation and Administration (25%)

Understand how to install and configure Hadoop cluster
Identify how the cluster will handle disk and machine failures for given case
Analyze a logging configuration and logging configuration file format
Understand the basics of Hadoop metrics and cluster health monitoring
Install ecosystem components in CDH 5 like Impala, Flume, Oozie, Hue, Cloudera Manager, Sqoop, Hive, and Pig etc.
Identify the function and purpose of available tools for managing the Apache Hadoop file system

V. Resource Management (10%)

Understand the overall design goals of each of Hadoop schedulers and resource manager
Given a scenario, determine how the Fair/FIFO/Capacity Scheduler allocates cluster resources under YARN

VI. Monitoring and Logging (15%)

Understand the functions and features of Hadoop’s metric collection abilities
Analyze the NameNode and Yarn Web UIs
Understand how to monitor cluster daemons
Identify and monitor CPU usage on master nodes
Describe how to monitor swap and memory allocation on all nodes
Interpret a log file and Identify how to manage Hadoop’s log files

3. CCP Data Scientist

“Cloudera Certified Professional Data Scientist” is able to perform descriptive and inferential statistics, apply advanced analytical techniques and build machine learning models using standard tools. Candidates need to prove their abilities on a live cluster with large datasets in a variety of formats. It needs clearing 3 CCP Data Scientist exams (DS700, DS701, and DS702) in any order. All three exams must be passed within 365 days of each other.

a. Common Skills (all exams)

Extract relevant features from a large dataset containing bad records, partial records, errors, or other forms of “noise”
Extract features from a data in multiple formats like JSON, XML, raw text logs, industry-specific encodings, and graph link data

b. Descriptive and Inferential Statistics on Big Data (DS700)

Determining confidence for a hypothesis using statistical tests
Calculate common summary statistics, such as mean, variance, and counts
Fit a distribution to a dataset and use it to predict event likelihoods
Perform complex statistical calculations on a large dataset

c. Advanced Analytical Techniques on Big Data (DS701)

Build a model that contains relevant features from a large dataset
Define relevant data groupings and assign data records from a large dataset into a defined set of data groupings
Evaluate goodness of fit for a given set of data groupings and a dataset
Apply advanced analytical techniques, such as network graph analysis or outlier detection

d. Machine Learning at Scale (DS702)

Build a model with relevant features from a large dataset and select a classification algorithm for it
Predict labels for an unlabeled dataset using a labeled dataset for reference
Tune algorithm meta parameters to maximize algorithm performance
Determine the success of a given algorithm for the given dataset using validation techniques

e. What technologies/languages do you need to know?

You’ll be provided with a cluster with Hadoop technologies on a cluster, plus standard tools like Python and R. Among these standard technologies, it’s your choice what to use to solve the problem.

4. CCP Data Engineer

“Cloudera Certified Data Engineer” is able to perform core competencies required to ingest, transform, store, and analyze data in Cloudera’s CDH environment.

a. What do you need to know?

I. Data Ingestion

These are the skills to transfer data between external systems and your cluster. It includes:

Import and export data between an external RDBMS and your cluster, including specific subsets, changing the delimiter and file format of imported data during ingest, and altering the data access pattern or privileges.
Ingest real-time and near-real time (NRT) streaming data into HDFS, including distribution to multiple data sources and converting data on ingest from one format to another.
Load data into and out of HDFS using the Hadoop File System HDFS commands.

II. Transform, Stage, Store

It means converting a set of data values in a given format stored in HDFS into new data values and/or a new data format and write them into HDFS or Hive/HCatalog. It includes:

Convert data from one file format to another and write it with compression
Convert data from one set of values to another (e.g., Lat/Long to Postal Address using an external library)
Purge bad records from a data set, e.g., null values
De-duplication and merge data
De-normalize data from multiple disparate data sets
Evolve an Avro or Parquet schema
Partition an existing data set according to one or more partition keys
Tune data for optimal query performance

III. Data Analysis

It includes operations like Filter, sort, join, aggregate, and/or transform one or more data sets in a given format stored in HDFS to produce a specified result. The queries will include complex data types (e.g., array, map, struct), the implementation of external libraries, partitioned data, compressed data, and requires the use of metadata from Hive/HCatalog.

Write a query to aggregate multiple rows of data and to filter data
Write a query that produces ranked or sorted data
Write a query that joins multiple data sets
Read and/or create a Hive or an HCatalog table from existing data in HDFS

IV. Workflow

It includes the ability to create and execute various jobs and actions that move data towards greater value and use in a system. It includes:

Create and execute a linear workflow with actions that include Hadoop jobs, Hive jobs, Pig jobs, custom actions, etc.
Create and execute a branching workflow with actions that include Hadoop jobs, Hive jobs, Pig jobs, custom action, etc.
Orchestrate a workflow to execute regularly at predefined times, including workflows that have data dependencies

b. What should you expect?

You are given five to eight customer problems each with a unique, large data set, a CDH cluster, and four hours. For each problem, you must implement a technical solution that meets all the requirements using any tool or combination of tools on the cluster (see list below) — you get to pick the tool(s) that are right for the job.