Starts this February 20th, 2021

For 2021, we have made few changes in the regular course module.This time along with #Spark and #Kafka, we have added #NiFi & #Airflow which are very much high in demand for a data engineer career.Also, we will be learning how to run the spark workloads in AWS EMR. During the training, we will perform all the hands-on using the production type cluster.


Course Introduction
• Why Apache Spark?
      o Problem in Data Driven Businesses
      o How Spark Solves it and why Big Data Solutions
                a) Spark Fundamental
      o What comprises of Spark and Hadoop Ecosystem
                a) Core Spark Components
                b) Apache Subprojects
• YARN – Basics
      o Why Computational Framework
      o YARN Architecture
      o How YARN executes MR and Spark jobs
      o YARN Applications in WEB UIs and Shell
      o YARN Application Logs

• Introduction to Python
      o Introduction to Functional Programming
      o Features of Functional Paradigm
      o Variables, Control structures, Functions and Objects
      o Mutable and Immutable Data
      o First Class Functions
      o Strings, Tuples and Named Tuples
      o Lists, Dicts and sets
      o Lambda Functions
      o Hands-on on all the above

• Spark and DataBricks– High Level and Architecture
      o Directed Acyclic Graph
      o Spark Standalone Architecture
      o Introduction to DataBricks
     o Databricks Architecture and Components
     o Functional Programming in spark
     o Introduction to Spark RDD
     o Hand-on – DataBricks Environment Tour
     o Hands-on – Running Spark Basic Program in DataBricks • Spark RDDs
     o How RDDs are created from files or data in memory
     o Handling File Formats
     o Additional Operations on RDD
     o Hands-on – Process Data Files using spark RDD

• Aggregations using Pair RDDs
     o Key Value Pair RDD
     o Other Pair RDD Concepts
     o Pair RDD to join Datasets
     o Hands-on – Using Pair RDD to join Dataframes in DataBricks

• Writing and Deploying Spark Applications
    o How to write a spark application – Scala and Pyspark
    o Running Spark Applications in DataBricks
    o Running Spark Applications in Cloudera Environment
    o Access Spark Application Web UI and controlling the applications
    o Configuring application properties and Loggings
    o Hands-On – Writing Two spark application – Pyspark and Scala
    o Hands-on Configuring a spark applications

• Spark Dataframes
    o Introduction to Spark DataFrames
    o DataFrame API
    o Load Data to DataFrames
    o Converting DataFrames to pair RDD
    o Hands-on – Working with DataFrames

• Parallel Processing and optimizations in Spark
    o DF partitions
    o Partitions of File-Based DataFrame
    o HDFS and Data Locality
    o Executing parallel operations
    o Stages and Tasks
    o Hands-on – Viewing stages and jobs in spark applicationUI
    o DF Lineage
    o DF Persistence
    o Distributed persistence
    o Hands-on – How to Persist a DF
    o Hands-on – Batch ETL using Spark

• Spark SQLContext
    o Spark SQL Basics
    o Creating SparkSQL
    o Querying SparkSQL
    o Hands-on – Working with SparkSQL UseCase

• Spark Streaming
    o Spark Streaming overview
    o DStreams and ReadStreams
    o Developing stream Applications
    o Multi Batch Operations
    o Time slicing and state operations
    o Sliding window Operations
    o Integrating Kafka with Spark Streaming
    o Hands-On – All the above – Kafka and Spark Integrations will be covered after Kafka Portion.

• Spark Optimizations
    o Adaptive Query Optimizations
    o Container Sizing
    o BackPressure
    o Choosing File Formats
    o Hands-on All the Above with few other scenarios

• Running Spark Jobs in AWS EMR
    o Introduction to AWS EMR
    o Introduction to S3 Bucket
    o EMR Components
    o Hands-on Running Spark Basic Jobs in Notebook
    o Storing Data in S3
    o Hands-on – Running the previous use cases in EMR

• Spark Data Processing patterns
    o Iterative Algorithms in Spark
    o Graph Processing and Analysis
    o Hands – on – Implementation of iterative Algorithm with Spark

• Spark ML Libraries
    o Introduction to Machine Learning
    o Machine Learning Libraries with Spark
    o Clustering in Spark
    o K-Means Algorithms in Spark
    o Hands-on – Implementation of K-means
    o Classification and Regression in Spark
    o Hands-on – Logistic Regression in Spark

• Kafka Introduction
    o Architecture
    o Overview of key concepts
    o Overview of ZooKeeper
    o Cluster, Nodes, Kafka Brokers
    o Consumers, Producers, Logs, Partitions, Records, Keys
    o Partitions for write throughput
    o Partitions for Consumer parallelism (multi-threaded consumers)
    o Replicas, Followers, Leaders
    o How to scale writes
    o Disaster recovery
    o Performance profile of Kafka
    o Consumer Groups, “High Water Mark”, what do consumers see
    o Consumer load balancing and fail-over
    o Working with Partitions for parallel processing and resiliency
    o Brief Overview of Kafka Streams, Kafka Connectors
    o Hands-on

  1.  Create a topic with replication and partitions
  2.  Produce and consume messages from the command line

• Low-level Kafka Architecture
    o Motivation Focus on high throughput
    o File structure on disk and how data is written
    o Kafka Producer load balancing details
    o Producer Record batching by size and time
    o Producer async commit and commit (flush, close)
    o Pull vs poll
    o Compressions via message batches
    o Consumer poll batching, long poll
    o Consumer Trade-offs of requesting larger batches
    o Managing consumer position (auto-commit, async commit and sync commit)
    o Messaging – At most once, At least once, Exactly once
    o Performance trade-offs message delivery semantics
    o Replication, Quorums, ISRs, committed records
    o Failover and leadership election
    o Failure scenarios
    o Hands-on

  1.  Writing Java and Python Kafka Producer
  2.  Writing Java and Python Kafka Consumer

• Writing Advanced Kafka Producers and Consumers
    o Using batching (time/size)
    o Using compression
    o Async producers and sync producers
    o Commit and async commit
    o Default partitioning (round robin no key, partition on key if key)
    o Controlling which partition records are written to (custom
    o Advanced Producer configurations list
    o Adjusting poll read size
    o Implementing at most once message semantics using Java API
    o Implementing at least once message semantics using Java API
    o Implementing as close as we can get to exactly once Java API
    o Re-consume messages that are already consumed
    o Using ConsumerRebalanceListener to start consuming from a certain offset
    o Hands-on

  1. Use message batching and compression
  2. Round Robin partition
  3. Custom Partition Program in Java
  4.  Adjusting poll read size
  5.  Implementing Semantics in consumer

 Kafka Schema Registry and REST Proxy

    o AVRO File Format Introduction
    o Kafka Schema Registry
    o Kafka REST Proxy
    o Ingesting data using Kafka REST Proxy
    o Hands-on : Ingesting and Validating the data using Schema Registry and REST Proxy
• Kafka Connect
    o Kafka Connect Introduction
    o Components of Kafka Connect
    o File Source and File Sink
    o Hands-on : Setting up of Kafka Connect
    o Hands-on: Kafka Connect from RDBMS source
    o Hands-on : Kafka Connect using File Source

• Kafka Streaming and KSQL
    o Components of Kafka Streaming
    o Overview of Kafka Streams

    o Kafka Streams Fundamentals

    o Kafka Streams Application
    o Components of KSQL
    o Using KSQL
    o KSQL – Data Manipulation
    o KSQL – Aggregations
    o Lab : Exercises using KSQL

• Introduction to NiFI and its components
    o Introduction to Apache Nifi
    o Apache Nifi Architecture
    o NiFI Pre-requisites
    o Install and Configure NiFi Single Node with Hands-on
    o NiFi UI – UI Summary and History with Hands-on
    o Introduction to NiFI FlowFIle
    o NiFi Controller services and Reporting Tasks
    o NiFI Repositories
    o NiFI Templates
    o Introduction to NiFi Process Group with Hands-on
    o Introduction to NiFi Remote Process Group
    o FlowFile Topology – Content and Attributes
    o Remote process Group Transmission
    o NiFI Flow Creation – Hands-on : PutFIle to FlowFIle
    o Hands-on – NiFi Flows – 3 Numbers

• Data Ingestion using NiFi
    o Big Data Ingestion using NiFi with Hands-on
    o Performing Kafka Ingestion using NiFi with Hands-on
    o NiFI Best Practices

• Airflow Introduction
    o Airflow Overview
    o DAG and its usage
    o Basic Cli Commands
    o Web UI
    o Hands-on – DAG – Use Case
    o Hands-on – Integrating Spark Job using Airflow