Big Data with Hadoop and Spark

Big Data with Hadoop
and Spark

Overview

This foundational course provides an introductory overview of Apache Hadoop – HDFS and MapReduce and Spark. A basic understanding of Big Data from business and technology perspectives is provided, along with an overview of common benefits, challenges, and adoption issues. The course content is divided into a series of modular sections, each of which is accompanied by one or more hands-on exercises.

Course Objective

This course aims to provide participants with a foundational understanding of Big Data technologies through hands-on experience with Apache Hadoop and Apache Spark. Participants will gain insights into the concepts, architecture, and practical applications of HDFS, MapReduce, SparkSQL, Spark Streaming, and MLLib. The course is designed to build both conceptual knowledge and technical skills in managing, processing, and analyzing large-scale datasets using modern Big Data frameworks. By the end of the course, learners will be able to install and configure Hadoop, develop and execute MapReduce jobs, manipulate data using Spark DataFrames and SQL, implement machine learning algorithms using Spark MLLib, and process real-time data streams with Spark Streaming and GraphX.

Who Should Attend

This course is designed for individuals of all level in an organisation.

Prerequisites

Basic familiarity with databases and data management

Training Calendar

Intake

Duration

Program Fees

Inquire further

3 Day

Contact us to find out more

Module

Module 1 - Fundamental Terminology and Concepts (30 min)

• Veracity, variability, visualization, and value (and the 3 V’s)
• HDFS and MapReduce in Hadoop
• Unstructured, semi-structured, and structured data

Module 2 - Brief History of Big Data (30 min)

• Google and MapReduce
• Web 2.0
• Hadoop vendors

Module 3 - Business Drivers for Big Data (30 min)

• KYC (Know Your Customer)
• Sales & Marketing
• Financial forecasting

Module 4 - Characteristics of Big Data (30 min)

Module 5 - Benefits of Adopting Big Data (30 min)

Module 6 - Challenges and Limitations of Big Data (30 min)

Module 7 - HDFS and Distributed Storage (90 min)

• Installing Hadoop and types of installations
• HDFS and Data Ingestion using Sqoop or Flume

Module 8 - Hands-on Exercises (90 min)

• Installing Hadoop
• Working with Cloudera CDH
• Basic Data Ingestion with Sqoop and Flume

Module 9 - Introduction to MapReduce (90 min)

• Map and Reduce
• Partitioning mappers and reducers

Module 10 - MapReduce Architecture (90 min)

• Working with MapReduce
• Distributed Cacheing
• Input and output formatters

Module 11 - MapReduce Examples (90 min)

• Sample programs

Module 12 - Hands-on Exercises (90 min)

• Writing MapReduce Jobs
• Writing Partitioners
• Writing Input and Output Formatters

Module 13 - SparkSQL, DataFrames, and Datasets (90 min)

• SparkSQL
• Executing SQL commands on a dataframe
• Using Dataframes instead of RDD’s

Module 14 - Spark MLLib (90 min)

• Using MLLib to produce movie recommendations
• Analyzing ALS Recommendation results
• Using Dataframes with MLLib

Module 15 - Spark Streaming and GraphX (90 min)

• Streaming data and NRT processing
• VertexRDD and EdgeRDD

Module 16 - Hands-on Exercises (90 min)

• Sample streaming
• Sample GraphX script

FAQs

General Questions:

Q: What is the Big Data with Hadoop and Spark course about?
This course provides a comprehensive introduction to Big Data concepts and technologies, focusing on Apache Hadoop and Apache Spark. It combines theory with hands-on practice to help participants understand data storage, processing, and analytics using tools like HDFS, MapReduce, SparkSQL, MLLib, and Spark Streaming.

Q: Who should attend this course?
This course is suitable for individuals at all levels within an organization who are looking to build foundational knowledge and hands-on skills in Big Data technologies.

Q: What are the prerequisites for this course?
Participants should have a basic familiarity with databases and data management concepts.

Q: How long is the course?
The course lasts for 3 days.

Q: What key topics are covered in this course?
Understanding Big Data terminology and concepts
Overview of Hadoop and Spark ecosystems
Installing and working with HDFS, MapReduce, and Spark
Hands-on data ingestion using Sqoop and Flume
Writing MapReduce jobs and working with input/output formatters
Using SparkSQL, DataFrames, and Datasets
Implementing machine learning with Spark MLLib
Processing real-time data with Spark Streaming and GraphX

Q: Will I receive a certification after completing the course?
No formal certification is provided, but participants will gain practical skills and knowledge in Big Data technologies applicable to real-world data analytics and processing tasks.

Program Content & Skills:

Q: What foundational Big Data and processing concepts will I learn in this course?
You’ll learn the core principles of Big Data including the 4 V’s (volume, velocity, variety, veracity), understand structured and unstructured data, and explore how Hadoop and Spark frameworks handle large-scale data processing using HDFS, MapReduce, and Spark components.

Q: How does the course prepare me to align Big Data technologies with business goals?
The course explains how Big Data supports business initiatives like customer insights, marketing strategies, and financial forecasting. You’ll explore real-world drivers for Big Data adoption and how to apply technical solutions to meet business needs.

Q: What skills will I develop in managing and processing data?
You’ll gain practical skills in setting up Hadoop, ingesting data using Sqoop and Flume, writing and executing MapReduce jobs, working with Cloudera CDH, and leveraging Spark for data manipulation, machine learning, and stream processing.

Q: Will I learn how to work with both batch and real-time data processing?
Yes. The course covers traditional batch processing using MapReduce as well as real-time processing using Spark Streaming and GraphX, allowing you to handle various data processing scenarios effectively.

Q: How does the course address real-world data analytics and implementation needs?
You’ll work through hands-on exercises that simulate real use cases, such as writing MapReduce programs, running SparkSQL queries, building recommendation systems with MLLib, and processing streaming data — all using tools commonly used in enterprise environments.

Submit your interest today !

Corporate

Community

SOLUTIONS

SOLUTIONS