Big Data & Hadoop Development

Course Duration : 70 hrs + Case Studies

About Big Data and Hadoop

Big data
Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large that it’s difficult to process using traditional database andsoftware techniques. In most enterprise scenarios the data is too big or it moves too fast or it exceeds current processing capacity. Big data has the potential to help companies improve operations and make faster, more intelligent decisions.
Apache Hadoop is 100% open source, and pioneered a fundamentally new way of storing and processing data. Instead of relying on expensive, proprietary hardware and different systems to store and process data, Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. With Hadoop, no data is too big. And in today’s hyper-connected world where more and more data is being created every day, Hadoop’s breakthrough advantages mean that businesses and organizations can now find value in data that was recently considered useless.

Course Overview

This advanced level course focuses on the following key areas:

  • Understanding Big Data and Hadoop
  • Master the concepts of HDFS and MapReduce framework
  • Understand Hadoop 2.x Architecture
  • Setup Hadoop Cluster and write Complex MapReduce programs
  • Learn data loading techniques using Sqoop and Flume
  • Perform data analytics using Pig, Hive and YARN
  • Implement HBase and MapReduce integration
  • Implement best practices for Hadoop development

What we offer

Training under the guidance of 20+ years experienced Data Scientist with post graduation from IIT, PhD from Boston University, and 40+ research papers on Data Science.
After training, Internship at our Development Partner’s house (Ideal Analytics/ ArcVision) in real-time/live project work.
Case studies on real industry data
Classroom training with flexible timing
Customized/On-demand training
Unlimited access to exclusive Study Materials on Cloud

Chapter-1: Introduction to BigData, Hadoop

1.1 Big Data Introduction
1.2 Hadoop Introduction
1.3 What is Hadoop? Why Hadoop?
1.4 Hadoop History?
1.5 Different types of Components in Hadoop?
HDFS, Map Reduce, PIG, Hive, SQOOP, HBASE
, OOZIE, Flume, Zookeeper and so on…
1.6 What is the scope of Hadoop?


Chapter-2: Deep Drive in HDFS (for Storing the Data)

2.1 Introduction of HDFS
2.2 HDFS Design
2.3 HDFS role in Hadoop
2.4 Features of HDFS
2.5 Daemons of Hadoopand its functionality
– Name Node
– Secondary Name Node
– Job Tracker
– Data Node
– Task Tracker
2.6 Anatomy of File Wright
2.7 Anatomy of File Read
2.8 Network Topology
– Nodes
– Racks
– Data Center
2.9 Parallel Copying using DistCp
2.10 Basic Configuration for HDFS
2.11 Data Organization
– Blocks
– Replication
2.12 Rack Awareness
2.13 Heartbeat Signal
2.14 How to Store the Data into HDFS
2.15 How to Read the Data from HDFS
2.16 Accessing HDFS (Introduction of Basic UNIX commands)
2.17 CLI commands


Chapter-3: MapReduce using Java (Processing the Data)

3.1 Introduction of  MapReduce.
3.2 MapReduce Architecture
3.3 Dataflow in MapReduce
– Splits
– Mapper
– Portioning
– Sort and shuffle Combiner
– Reducer
3.4 Understand Difference Between Block and InputSplit
3.5 Role of RecordReader
3.6 Basic Configuration of MapReduce
3.7 MapReduce life cycle
– Driver Code
– Mapper
– and Reducer
3.8 How MapReduce Works
3.9 Writing and Executing the Basic MapReduce Program using Java
3.10 Submission & Initialization of MapReduce Job.
3.11 File Input/output
3.12 Formatsin MapReduce Jobs
– Text Input Format
– Key Value Input Format
– Sequence File Input Format
– NLine Input FormatJoins
– Mapside Joins
– Reducer
– Side Joins
3.13 Word Count Example
3.14 Partition MapReduce Program
3.15 Side Data Distribution
– Distributed Cache (with Program)
3.16 Counters (with Program)
– Types of Counters
– Task Counters
– Job Counters
– User Defined Counters
– Propagation of Counters
3.17 Job Scheduling


Chapter-4: PIG

4.1 Introduction to Apache PIG
4.2 Introduction to PIG Data Flow Engine
4.3 MapReduce vs PIG in detail
4.4 When should PIG used?
4.5 Data Types in PIG
4.6 Basic PIG programming
4.7 Modes of Execution in PIG
– Local Modeand
– MapReduce Mode
4.8 Execution Mechanisms
– Grunt Shell
– Script
– Embedded
4.9 Operators/Transformations in PIG
4.10 PIG UDF’swith Program
4.11 Word Count Examplein PIG
4.12 The difference between the MapReduce and PIG


Chapter-5: SQOOP

5.1 Introduction to SQOOP
5.2 Use of SQOOP
5.3 Connect to mySql database
5.4 SQOOP commands
– Import
– Export
– Eval
– Codegen and etc…
5.5 Joins in SQOOP
5.6 Export to MySQL
5.7 Export to HBase


Chapter-6: HIVE

6.1 Introduction to HIVE
6.2 HIVE Meta Store
6.3 HIVE Architecture
6.4 Tables in HIVE
6.5 Managed Tables
– External Tables
6.6 Hive Data Types
– Primitive Types
– Complex Types
6.7 Partition
6.8 Joins in HIVE
6.6 HIVE UDF’s and UADF’s with Programs
6.7 Word Count Example


Chapter-7: HBASE

7.1 Introduction to HBASE
7.2 Basic Configurations of HBASE
7.3 Fundamentals of HBase
7.4 What is NoSQL?
7.5 HBase DataModel
-Table and Row
– Column Family and Column Qualifier
– Cell and its Versioning
7.6 Categories of NoSQL Data Bases
– KeyValue Database
– Document Database
7.7 Column Family Database
7.8 HBASE Architecture
– HMaster
– Region Servers
– Regions
– MemStore
– Store SQL vs NOSQL
7.9 How HBASE is differ from RDBMS
7.10 HDFS vs HBase Client side buffering or bulk uploads
7.11 HBase Designing Tables
7.12 HBase Operations
– Get
– Scan
– Put
– Delete


Chapter-8: MongoDB

8.1 What is MongoDB?
8.2 Where to Use?
8.3 Configuration On Windows
8.4 Insertingthe data into MongoDB?
8.5 Reading the MongoDB data.


Chapter-9: Cluster Setup

9.1 Downloading and installing the Ubuntu12.x
9.2 Installing Java
9.3 Installing Hadoop
9.4 Creating Cluster
9.5 Increasing Decreasing the Cluster size
9.6 Monitoring the Cluster Health
9.7 Starting and Stoppingthe Nodes


Chapter-10: Zookeeper

10.1 Introduction Zookeeper
10.2 Data Modal
10.3 Operations


Chapter-11: OOZIE

11.1 Introduction to OOZIE
11.2 Use of OOZIE
11.3 Where to use?


Chapter-12: Flume

12.1 Introduction to Flume
12.2 Uses of Flume
12.3 Flume Architecture
– Flume Master
– Flume Collectors
– Flume Agents


Chapter-13: Impala

13.1 Over View
13.2 Data Load
13.3 Architecture
13.4 Hands-on
13.5 Hive vs Impala


Chapter-14: Apache Spark

14.1 Spark Architecture
14.2 Integration with Hadoop
– Text File
14.3 Introduction to Spark Sql
– CSV data
14.4 Spark Streaming Architecture
– Dstreams

Project: 2 Project Explanations with Architecture

We offer multiple case studies based on different industries.

Case Study 1: Crime Data Analysis

Big data is the voluminous and complex collection of data that comes from different sources such as sensors, social media, website etc. Such voluminous data becomes tough to process using ancient processing application. There are various tools and techniques in the market for big data analytics. With continually increasing population, crimes and crime rate analyzing related data is a huge issue for governments to make strategic decisions so as to maintain law and order. The best place to look up to find room for improvement is the voluminous raw data that is generated on a regular basis from various sources by applying Big Data Analytics (BDA) which helps to analyze certain trends that must be discovered, so that law and order can be maintained properly and there is a sense of safety and well-being among the citizens of the country.

Step – 1: Data collecting from US crime data for last 15 years.
Step-2: Analyzing that data to dig out following information.
1. Total Number of crimes from the year 2003-2016 With the ever increasing population and crime rates.
2. Total Number of crimes occurring in each state.
3. Total Number of Crime by Type It includes crime in which the objective is violent for example murder or in which violence means to an end for example robbery.
4. Total Number of crimes against women Crimes against women5. Time series crime Analysis


Case Study-2: Real Time social Media Data Analytics:-

Social media has gained immense popularity with marketing teams, and Twitter is an effective tool for a company to get people excited about its products. Twitter makes it easy to engage users and communicate directly with them, and in turn, users can provide word-of- mouth marketing for companies by discussing the products. In this post, we’ll learn how we can use Apache HDFS, Apache Oozie, and Apache Hive to design an end-to- end data pipeline that will enable us to analyze Twitter data.

Step – 1: Data collecting from real-time tweets from Twitter and save it into Hadoop/HDFS
Step – 2: Analyzing the data to dig out following information
1. Tweets #Analytics
2. Tweets word count
3. Location wise word count
4. Twitters Sentiment Analysis

Abhinandan Chakraborty

Guest Faculty (BigData, Hadoop)

Abhinandan, a B.Tech in Computer Science with around 4 Years of experience in Big Data live project Development, Java,Hadoop, MapReduce, Apache Hive,Impala,Apache Spark, HBase, Apache Flume, Apache Kafka, Apache Cassandra, Apache Storm,D3 .Have experience in Big Data live project development and POC. Currently employed with a well-known Big-Data consulting company in Sector-5, Salt Lake. At NIVT, in the last couple of years, he has trained many high profile MNC professionals on Big Data, Hadoop & Apache, Spark and created lots of references for NIVT in the industry.