Big Data - Technology and Applications

Full course, Centrale Supelec - ESSEC, 2019

Big Data - Distributed computing and databases

ESSEC Business SchoolCentraleSupelec

This course is part of the MSc Data Science and Business Analytics offered by Essec Business School and CentraleSupelec

Big Data is much more than a buzzword, it is a set of distributed computing techniques allowing developers to solve practical problems at a scale commensurable with that of the entire Internet. These problems would not fit on any single computer, be it the most powerful on Earth. Instead, Big Data is all about bringing algorithms to the data and not the other way around. This way of thinking has formed the basis of the fortune of some the most influencial Web era multinationals: Google, Amazon, Facebook ; but it is a generic tool, useful in open-source environment as well as startups, SMEs, etc.

In this course, we learn the frameworks of Big Data, what it really represents, how this comes together works with a functional outlook framework: very large databases with HDFS, distributed computing with Hadoop and Spark. We leverage cloud computing and build simple, practical, but large-scale software suitable for computing clusters such as PageRank.

Help

Throughout the course, support, help, videos, supplementary material, further examples and general communication is supported on a Slack Workspace.

Invitation to the Slack for the course here.

Lectures

 EntryDescription
01IntroductionGeneral introduction to Big Data
02HadoopHadoop: A framework for Big Data
03MapReduceMap-Reduce: Functional Programming for Big Data
04SparkSpark and Resilient Distributed Datasets
05PageRankPageRank, a distributed graph-based algorithm
06EcosystemBeyond MapReduce, the Hadoop Zoo

Tutorials

 EntryDescription
01Tutorial 1Tutorial 1: distributed word count
 Tut1 codeTutorial 1: python notebook
02Tutorial 2Tutorial 2: distributed matrix-vector multiplication
03Tutorial 3Tutorial 3: introduction to Spark
04Tutorial 4Tutorial 4: Machine Learning on Spark
05Tutorial 5Tutorial 5: Pagerank in Hadoop and Spark on AWS

Assigments and grading

Assignemnts and grading are provided through the Edunao platform. Follow this link. All participants for 2019-2020 should already be registered. Log in with your CentraleSupelec account.

AWS

Big data is not really possible on your own computer, hence we have provided credits for you on Amazon Web Services (AWS), Amazon’s Elastic Computing Cloud (EC2). We have configured and made available a virtual machine for you. Instructions will be given during lectures.

Register on AWS via the RosettaHub system, by following this link with your provided CentraleSupelec mail account as identifier.

To start and connect to a Jupyter Notebook, follow these instructions

To debug Hadoop programs, try these instructions

Thanks

Special thanks to Pr. Céline Hudelot for her help.