Big Data Analytics with R & Hadoop
About Big Data Analytics with R and Hadoop
Big data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations, and other useful information. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue. New methods of working with big data, such as Hadoop and MapReduce, offer alternatives to traditional data warehousing.
Big Data Analytics with R and Hadoop is focused on the techniques of integrating R and Hadoop by various tools such as RHIPE and RHadoop. A powerful data analytics engine can be built, which can process analytics algorithms over a large scale dataset in a scalable manner. This can be implemented through data analytics operations of R, MapReduce, and HDFS of Hadoop.
This advanced-level analytics course focuses on the following key areas:
- Technologies for Handling Big Data
- Understanding the Hadoop Echosystem
- Reading Datasets into R, Exporting Data from R
- Manipulating and Processing Data in R.
What we offer
Training under the guidance of 20+ years experienced Data Scientist with post graduation from IIT, PhD from Boston University, and 40+ research papers on Data Science.
After training, Internship at our Development Partner’s house (Ideal Analytics/ ArcVision) in real-time/live project work.
Case studies on real industry data
Classroom training with flexible timing
Unlimited access to exclusive Study Materials on Cloud
Chapter-1: Introduction to Big Data
1.1 Big Data
1.2 History of Data Management –Evolution of Big Data
1.3 Structuring of Big Data
1.4 Element of Big Data
1.5 Application of Big Data in Business context
1.6 Careers in Big Data
Chapter-2: Business Application of Big Data
2.1 The Significance of Social Network Data
2.2 Financial Fraud and Big Data
2.3 Fraud detection in Insurance
2.4 Use of Big Data in the Retail Industry
Chapter-3: Technologies for Handling Big Data
3.1 Distributed and parallel computing for Big data
3.2 Introducing Hadoop
3.3 Cloud Computing and Big Data
3.4 In-Memory Technology and Big Data
Chapter-4: Understanding the Hadoop Echosystem
4.1 The Hadoop Echosystem
4.2 Storing Data with HDFS
4.3 Processing Data with Hadoop Map Reduce
4.4 Storing Big data with HBase
4.5 Using Hive for Querying Big Databases
4.6 Interacting with Hadoop Echosystem (Pig/Sqoop/Zookeeper/Flume/Oozie)
Chapter-5: MapReduce Fundamentals
5.1 Origins of MapReduce
5.2 How MapReduce works
5.3 Optimization Techniques for MapReduce Jobs
5.4 Applications of mapReduce
5.5 Role of HBase in processing Big Data
5.6 Mining Bigdata with Hive
Chapter-6: Understanding Analytics
6.1 Analysis versus Reporting
6.2 Basic and Advance Analytics
6.3 Conducting an Analysis- Things to consider
6.4 Building an Analytic Team
Chapter-7: Analytical Approaches and Tools
7.1 The evolution of Analytic Approaches
7.2 The evolution of Analytic Tools
7.3 Categories of Analytic Tools
7.4 Some popular Analytical Tools (R/SPSS/SAS)
7.5 Comparison between Analytical Tools
Chapter-8: Getting Started
8.1 Learning objectives
8.2 Download and Install R and R Studio
8.3 Working in the R Windowing Environment
8.4 Install and Load Packages
Chapter-9: Basic Building Blocks in R
9.1 Learning Objectives
9.2 R as a Calculator
9.3 Work with variables
9.4 Understand Data Types
9.5 Store Data in Vectors
9.6 Call Functions
Chapter-10: Advanced Data Structures in R
10.2.Create and Access Information in Data Frames
10.3 Create and Access Information in Lists
10.4 Create and Access Information in Matrices
10.5 Create and Access Information in Arrays
Chapter-11: Reading Data into R
11.1 Learning Objectives
11.2 Reading CSV Files
11.3 Understanding Excel is not easily accessible in R
11.4 Read from Databases
11.5 Read Data files from other Statistical Tools
11.6 Load binary R files
11.7 Load Data included with R
11.8 Scrape Data from the web
Chapter-12: Making Statistical Graphs
12.1 Learning Objectives
12.2 Using Datasets for creating Graphs.
12.3 Making Histograms , Bar graphs , Line graphs,Scatterplots,Boxplots etc with Base Graphics
12.4 Introduction to ggplot2
12.5 Histograms and density plots with ggplot2
12.6 Scatterplots with ggplot2
12.7 Box and violin plots with ggplot2
12.8 Creating Line plots
12.9 Control colour and shapes
12.10 Add themes to graphs
Chapter-13: Basics of Programming
13.1 Learning Objectives
13.2 The Classic “Hello World” Example
13.3 Basics of Function Arguments
13.4 Return a Value from a Function
13.5 Flexibility with the do call
13.6 If Statements for controlling Program Flow
13.7 If-Else Statements
13.8 Multiple checks using Switch
13.9 Checks on entire Vectors
13.10 Check Compound Statements
13.11 Iteration- for and while loop
13.12 Control loops with Break and Next
Chapter-14: Data Munging
14.1 Learning Objectives
14.2 Repeating Matrix Operations – the apply function
14.3 Repeating List Operations
14.4 The mapply function
14.5 The aggregate function
14.6 The plyr package
14.7 Combining Datasets
14.8 Joining Datasets
14.9 Switch storage paradigms
Chapter-15: Manipulating Strings
15.1 Learning Objectives
15.2 Combine String together
15.3 Extract Text
Chapter-16: Basic Statistics
16.1 Learning Objectives
16.2 Drawing numbers from Probability Distributions
16. 3 Summary Statistics-Mean, Variance,SD,Correlation
16.4 Compare samples with t-tests and Analysis of Variance
Chapter-17: Linear Models
17.1 Learning Objectives
17.2 Fit simple Linear models
17.3 Exploring the Data
17.4 Fit multiple Regression Models
17.5 Fit Generalised Linear Models(GLM)
17.6 Fit Logistic Regression
17.7 Fit Poisson Regression
17.8 Analyze Survival Data
17.9 Asses Model Quality and Residuals
17.10 Compare Models
Chapter-18: Other Models
18.1 Learning Objectives
18.2 Select variables and improve predictions with elastic net
18.3 Decrease uncertainty with weakly informative priors
18.4 Fit Non-Linear Least Squares
18.6 Generalised Additive Models (GAM)
18.7 Fit Decision Trees to make a Random Forest
Chapter-19: Time Series Analysis
19.1 Learning Objectives
19.2 Understanding ACFs and PACFs
19.3 Fit and Assess ARIMA Models
Chapter-20: Text Mining
20.1 Learning Objectives
20.2 Text Extraction & manipulation
20.3 Sentiment Analysis
20.4 Social Media Analytics- Case Studies
Chapter-21: Integrating R and Hadoop and Understanding Hive
21.1 Integrating R and Hadoop and Understanding Hive Hadoop
21.2 Integrating R and Hadoop- R Hadoop
21.3 Text Mining for Deriving Useful Information
21.4 Introduction to Hive
We have various case studies based on different industries. You can choose the case study as applicable for you.
Case Study 1: Regression Analysis
How to assess if you are paying correct price or not while buying a property?
Price is very important function for any business. Correct price can create a real gap between profit and loss. In this case study we will take an example of property pricing to gain a deeper understanding of regression analysis.
Step – 1: Data Preparation
A. Checking the outlier
B. Checking Missing Values and how to treat them.
C. Basic bivariate and univariate analysis i.e. checking correlations, how the variables are distributed.
Step – 2: Principle Component Analysis
Step – 3: Traditional Regression Analysis with variable selection
Case Study 2: Marketing Analytics
Being a key decision and strategy maker on an online retail store that specializes in apparel and clothing, how by establishing analytics practice opportunity to improve PnL could be figured out. Background of behavioural analytics – How human brains follow involuntary pattern (behave like other similar people around them) and the detection of the pattern is preciously the idea behind marketing analytics.
Step – 1: EDA – Exploratory Data Analysis
A. Exploring different patterns i.e. distribution of the customers across the number of product categories purchased by each customer.
B. Why the customers buying different product categories
C. Categorization of customers based on the # of product category they purchased.
D. Which category is contributing highest sales?
Step – 2: Association Analysis
E. Support/Confidence/Lift – Apriori concept
F. Market Basket Analysis
Step – 3: Customer Segmentation
Case Study 3: Score Card ModelLing
Given the on-going turmoil on credit markets, a critical re-assessment of credit risk modelling approaches is more than ever needed. This modelling approach generates some probability of default score for each customer on basis of some collection of independent variables (it may differ as per business requirements). After that it is usable for predictive modelling, MIS reporting etc.
Step – 1: EDA – Exploratory Data Analysis
A. Data import and basic data sanity check.
B. Exploring different patterns i.e. distribution of data
C. Variables (categorical & numerical) selection approaches.
D. Training and validation data creation.
Step – 2: Model Preparation
E. Creating indicator variables
F. Apply step wise regression
Step – 3: validation of model
G. Check for multi Collinearity (using correlation matrix, VIF)
H. Generate Score using logistic regression.
I. KS calculation
J. Coefficient validation, coefficient stability and score stability.
Case Study 4: Web Scrapping & Text Analysis
The rapid growth of the World Wide Web over the past two decades tremendously changed the way we share, collect, and publish data. Firms, public institutions, and private users provide every imaginable type of information and new channels of communication generate vast amounts of data on human behavior. Web scrapping is a process to extract data from websites and applying some text analysis algorithms to analyze these data. Twitter analysis, google data analysis etc.
Step – 1: Setup connection
A. Create a key against developer account.
B. Run API request to fetch data.
Step – 2: Data Extraction
C. Save API requested data into excel/csv.
D. Data analysis and sanity check (dealing with missing data)
Step – 3: Text mining
E. Apply diff-2 algorithms like: sentiment analysis.
In-house Faculty (R & SAS)
Other than that she has worked on various other data analysis projects like, Data Analysis on US Economic Indices, Twitter Sentimental Analysis, GDP rates etc.
Simultaneously with project work, she provides training on Big Data analytics using Hadoop and R, Base SAS & Advanced SAS. She has already educated over hundred high profile MNC professionals on Data Analytics. She is the most junior but most appreciated faculty of our team.
Guest Faculty (Advanced Excel, R)
Guest Faculty (BigData, Hadoop)
In-house Faculty (R & SAS)