Big Data Analytics with R & Hadoop

Course Duration : 70 hrs + Case Study

About Big Data Analytics with R and Hadoop

Big data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations, and other useful information. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue. New methods of working with big data, such as Hadoop and MapReduce, offer alternatives to traditional data warehousing.
Big Data Analytics with R and Hadoop is focused on the techniques of integrating R and Hadoop by various tools such as RHIPE and RHadoop. A powerful data analytics engine can be built, which can process analytics algorithms over a large scale dataset in a scalable manner. This can be implemented through data analytics operations of R, MapReduce, and HDFS of Hadoop.

Course Overview

This advanced-level analytics course focuses on the following key areas:

  • Technologies for Handling Big Data
  • Understanding the Hadoop Echosystem
  • Reading Datasets into R, Exporting Data from R
  • Manipulating and Processing Data in R.

What we offer

Training under the guidance of 20+ years experienced Data Scientist with post graduation from IIT, PhD from Boston University, and 40+ research papers on Data Science.
After training, Internship at our Development Partner’s house (Ideal Analytics/ ArcVision) in real-time/live project work.
Case studies on real industry data
Classroom training with flexible timing
Customized/On-demand training
Unlimited access to exclusive Study Materials on Cloud

Chapter-1: Introduction to Big Data

1.1 Big Data
1.2 History of Data Management –Evolution of Big Data
1.3 Structuring of Big Data
1.4 Element of Big Data
1.5 Application of Big Data in Business context
1.6 Careers in Big Data


Chapter-2: Business Application of Big Data

2.1 The Significance of Social Network Data
2.2 Financial Fraud and Big Data
2.3 Fraud detection in Insurance
2.4 Use of Big Data in the Retail Industry


Chapter-3: Technologies for Handling Big Data

3.1 Distributed and parallel computing for Big data
3.2 Introducing Hadoop
3.3 Cloud Computing and Big Data
3.4 In-Memory Technology and Big Data


Chapter-4: Understanding the Hadoop Echosystem

4.1 The Hadoop Echosystem
4.2 Storing Data with HDFS
4.3 Processing Data with Hadoop Map Reduce
4.4 Storing Big data with HBase
4.5 Using Hive for Querying Big Databases
4.6 Interacting with Hadoop Echosystem (Pig/Sqoop/Zookeeper/Flume/Oozie)


Chapter-5: MapReduce Fundamentals

5.1 Origins of MapReduce
5.2 How MapReduce works
5.3 Optimization Techniques for MapReduce Jobs
5.4 Applications of mapReduce
5.5 Role of HBase in processing Big Data
5.6 Mining Bigdata with Hive


Chapter-6: Understanding Analytics

6.1 Analysis versus Reporting
6.2 Basic and Advance Analytics
6.3 Conducting an Analysis- Things to consider
6.4 Building an Analytic Team


Chapter-7: Analytical Approaches and Tools

7.1 The evolution of Analytic Approaches
7.2 The evolution of Analytic Tools
7.3 Categories of Analytic Tools
7.4 Some popular Analytical Tools (R/SPSS/SAS)
7.5 Comparison between Analytical Tools


Chapter-8: Getting Started

8.1 Learning objectives
8.2 Download and Install R and R Studio
8.3 Working in the R Windowing Environment
8.4 Install and Load Packages


Chapter-9: Basic Building Blocks in R

9.1 Learning Objectives
9.2 R as a Calculator
9.3 Work with variables
9.4 Understand Data Types
9.5 Store Data in Vectors
9.6 Call Functions


Chapter-10: Advanced Data Structures in R

10.1.Learning Objectives
10.2.Create and Access Information in Data Frames
10.3 Create and Access Information in Lists
10.4 Create and Access Information in Matrices
10.5 Create and Access Information in Arrays


Chapter-11: Reading Data into R

11.1 Learning Objectives
11.2 Reading CSV Files
11.3 Understanding Excel is not easily accessible in R
11.4 Read from Databases
11.5 Read Data files from other Statistical Tools
11.6 Load binary R files
11.7 Load Data included with R
11.8 Scrape Data from the web


Chapter-12: Making Statistical Graphs

12.1 Learning Objectives
12.2 Using Datasets for creating Graphs.
12.3 Making Histograms , Bar graphs , Line graphs,Scatterplots,Boxplots etc with Base Graphics
12.4  Introduction to ggplot2
12.5 Histograms and density plots with ggplot2
12.6 Scatterplots with ggplot2
12.7 Box and violin plots with ggplot2
12.8 Creating Line plots
12.9 Control colour and shapes
12.10 Add themes to graphs


Chapter-13: Basics of Programming

13.1 Learning Objectives
13.2 The Classic “Hello World” Example
13.3 Basics of Function Arguments
13.4 Return a Value from a Function
13.5 Flexibility with the do call
13.6  If Statements for controlling Program Flow
13.7  If-Else Statements
13.8 Multiple checks using Switch
13.9 Checks on entire Vectors
13.10 Check Compound Statements
13.11 Iteration- for and while loop
13.12 Control loops with Break and Next


Chapter-14: Data Munging

14.1 Learning Objectives
14.2 Repeating Matrix Operations – the apply function
14.3 Repeating List Operations
14.4 The mapply function
14.5 The aggregate function
14.6 The plyr package
14.7 Combining Datasets
14.8 Joining Datasets
14.9 Switch storage paradigms


Chapter-15: Manipulating Strings

15.1 Learning Objectives
15.2 Combine String together
15.3 Extract Text


Chapter-16: Basic Statistics

16.1 Learning Objectives
16.2 Drawing numbers from Probability Distributions
16. 3 Summary Statistics-Mean, Variance,SD,Correlation
16.4 Compare samples with t-tests and Analysis of Variance


Chapter-17: Linear Models

17.1 Learning Objectives
17.2 Fit simple Linear models
17.3 Exploring the Data
17.4 Fit multiple Regression Models
17.5 Fit Generalised Linear Models(GLM)
17.6 Fit Logistic Regression
17.7 Fit Poisson Regression
17.8 Analyze Survival Data
17.9 Asses Model Quality and Residuals
17.10 Compare Models


Chapter-18: Other Models

18.1 Learning Objectives
18.2 Select variables and improve predictions with elastic net
18.3 Decrease uncertainty with weakly informative priors
18.4 Fit Non-Linear Least Squares
18.5 Splines
18.6 Generalised Additive Models (GAM)
18.7 Fit Decision Trees to make a Random Forest


Chapter-19: Time Series Analysis

19.1 Learning Objectives
19.2 Understanding ACFs and PACFs
19.3 Fit and Assess ARIMA Models


Chapter-20: Text Mining

20.1 Learning Objectives
20.2 Text Extraction & manipulation
20.3 Sentiment Analysis
20.4 Social Media Analytics- Case Studies


Chapter-21: Integrating R and Hadoop and Understanding Hive

21.1 Integrating R and Hadoop and Understanding Hive Hadoop
21.2 Integrating R and Hadoop- R Hadoop
21.3 Text Mining for Deriving Useful Information
21.4 Introduction to Hive

We have various case studies based on different industries. You can choose the case study as applicable for you.

Case Study 1: Regression Analysis

How to assess if you are paying correct price or not while buying a property?
Price is very important function for any business. Correct price can create a real gap between profit and loss. In this case study we will take an example of property pricing to gain a deeper understanding of regression analysis.

Step – 1: Data Preparation
A. Checking the outlier
B. Checking Missing Values and how to treat them.
C. Basic bivariate and univariate analysis i.e. checking correlations, how the variables are distributed.
Step – 2: Principle Component Analysis
Step – 3: Traditional Regression Analysis with variable selection


Case Study 2: Marketing Analytics

Being a key decision and strategy maker on an online retail store that specializes in apparel and clothing, how by establishing analytics practice opportunity to improve PnL could be figured out. Background of behavioural analytics – How human brains follow involuntary pattern (behave like other similar people around them) and the detection of the pattern is preciously the idea behind marketing analytics.

Step – 1: EDA – Exploratory Data Analysis
A. Exploring different patterns i.e. distribution of the customers across the number of product categories purchased by each customer.
B. Why the customers buying different product categories
C. Categorization of customers based on the # of product category they purchased.
D. Which category is contributing highest sales?
Step – 2: Association Analysis
E. Support/Confidence/Lift – Apriori concept
F. Market Basket Analysis
Step – 3: Customer Segmentation
A. Classification/Clustering


Case Study 3: Score Card ModelLing

Given the on-going turmoil on credit markets, a critical re-assessment of credit risk modelling approaches is more than ever needed. This modelling approach generates some probability of default score for each customer on basis of some collection of independent variables (it may differ as per business requirements). After that it is usable for predictive modelling, MIS reporting etc.

Step – 1: EDA – Exploratory Data Analysis
A. Data import and basic data sanity check.
B. Exploring different patterns i.e. distribution of data
C. Variables (categorical & numerical) selection approaches.
D. Training and validation data creation.
Step – 2: Model Preparation
E. Creating indicator variables
F. Apply step wise regression
Step – 3: validation of model
G. Check for multi Collinearity (using correlation matrix, VIF)
H. Generate Score using logistic regression.
I. KS calculation
J. Coefficient validation, coefficient stability and score stability.


Case Study 4: Web Scrapping & Text Analysis

The rapid growth of the World Wide Web over the past two decades tremendously changed the way we share, collect, and publish data. Firms, public institutions, and private users provide every imaginable type of information and new channels of communication generate vast amounts of data on human behavior. Web scrapping is a process to extract data from websites and applying some text analysis algorithms to analyze these data. Twitter analysis, google data analysis etc.

Step – 1: Setup connection
A. Create a key against developer account.
B. Run API request to fetch data.
Step – 2: Data Extraction
C. Save API requested data into excel/csv.
D. Data analysis and sanity check (dealing with missing data)
Step – 3: Text mining
E. Apply diff-2 algorithms like: sentiment analysis.

Tania Chakraborty

In-house Faculty (R & SAS)

Tania, with a background in engineering, have 3+ years of hands on working experience on various Analytics tools, mainly SAS & R. She played a major role in the student data analysis of two entire countries, Dominica & St. Kitts, on a popular student management software “openSIS”.
Other than that she has worked on various other data analysis projects like, Data Analysis on US Economic Indices, Twitter Sentimental Analysis, GDP rates etc.
Simultaneously with project work, she provides training on Big Data analytics using Hadoop and R, Base SAS & Advanced SAS. She has already educated over hundred high profile MNC professionals on Data Analytics. She is the most junior but most appreciated faculty of our team.

Tanushree Bhattacharyya

Guest Faculty (Advanced Excel, R)

Tanushree, a post graduate in M.Sc(Econometrics & Statistics), having 8 yrs of experience in Analytics & Mkt Research.Currently working in a big MNC house, proficient in statistical tools like SAS, Advanced Excel, VBA, Access, SQL, SPSS, Quantum. She is highly skilled in data analysis and building statistical model, creating publication quality report and automation of the models with VBA/SAS/SPSS with an excellent track record of managing clients, projects and exceeding expectations. She is an expert in handling analytical projects involving various statistical techniques like demand forecasting , multivariate techniques, optimization, segmentation and reporting the insights to the management to fulfill the business requirements. She is involved with NIVT for over a year now and has an excellent track record of providing training to professionals on Excel,VBA & R Programming. On behalf NIVT she has conducted training in some corporate houses like Dynamic Level.

Abhinandan Chakraborty

Guest Faculty (BigData, Hadoop)

Abhinandan, a B.Tech in Computer Science with around 4 Years of experience in Big Data live project Development, Java,Hadoop, MapReduce, Apache Hive,Impala,Apache Spark, HBase, Apache Flume, Apache Kafka, Apache Cassandra, Apache Storm,D3 .Have experience in Big Data live project development and POC. Currently employed with a well-known Big-Data consulting company in Sector-5, Salt Lake. At NIVT, in the last couple of years, he has trained many high profile MNC professionals on Big Data, Hadoop & Apache, Spark and created lots of references for NIVT in the industry.

Debajyoti Chakraborty

In-house Faculty (R & SAS)

Debajyoti, a Statistical Analyst, Member of Actuarial Society of India, Analytics Trainer on Statistical Softwares - SAS,R,Ms Excel with basic query language knowledge on SQL. Graduate in Statistics with Maths and Computer Science as other subjects. Having over 3yrs of work experience as Data Analyst. In Statistical Analyst role he has worked on multiple industry projects including dashboarding and analytics implementation for Retail and Healthcare projects. Also, as an Actuarial Analyst he assisted in Claim Analytics. As an Analytics Trainer, he is providing Analytics training to Industry Professionals and Academic Students on Statistical Software packages - SAS, R, MS Excel (Beginner to Advance) and SPSS, and overseeing Data Analysis projects undertaken by students and knowledge sharing for successful completion of projects on time.