This course is about big data and its role in carrying
out modern business intelligence for actionable insight
to address new business needs. This course is a lab-led
and open source software rooted course. Students will
learn the fundamentals of MapReduce, Spark framework,
NoSQL databases, PySpark, and Amazon Athena. The class
will focus on the storage, processing, and analysis
aspects of big data. Students will use Spark cluster
and MapReduce fundamentals to solve big data problems.
The main focus of this class is to cover the following concepts:
-
Concepts of Big Data
- Cluster Computing
- Scale-up Architecture: Why or Why Not
- Scale-out Architecture: Why or Why Not
- Scale-out Architectures (using Hadoop, Spark, PySpark)
- Fault Tolerance: How?
- Data Replication: How?
-
Distributed Computing
- Cluster Computing (Master and Worker Nodes)
- Distributed and Parallel Algorithms
-
Distributed File Systems
- Hadoop Distributed File System
- Amazon S3
-
MapReduce
-
Spark
- Apache Spark
- Spark Cluster Computing
- Use Spark, PySpark, and Python to teach MapReduce and distributed computing
- Spark RDDs
- Spark DataFrames
- SQL for NoSQL Data, How?
-
Amazon Athena
- Serverless Architectures
- Amazon Athena
- Amazon Athena, S3, Data Partitioning