Big Data - Technology and Applications

Full course, Centrale Supelec - ESSEC, 2019

Big Data - Distributed computing and databases

This course is part of the MSc Data Science and Business Analytics offered by Essec Business School and CentraleSupelec

Big Data is much more than a buzzword, it is a set of distributed computing techniques allowing developers to solve practical problems at a scale commensurable with that of the entire Internet. These problems would not fit on any single computer, be it the most powerful on Earth. Instead, Big Data is all about bringing algorithms to the data and not the other way around. This way of thinking has formed the basis of the fortune of some the most influencial Web era multinationals: Google, Amazon, Facebook ; but it is a generic tool, useful in open-source environment as well as startups, SMEs, etc.

In this course, we learn the frameworks of Big Data, what it really represents, how this comes together works with a functional outlook framework: very large databases with HDFS, distributed computing with Hadoop and Spark. We leverage cloud computing and build simple, practical, but large-scale software suitable for computing clusters such as PageRank.

Help

Throughout the course, support, help, videos, supplementary material, further examples and general communication is supported on a Slack Workspace.

Invitation to the Slack for the course here.

Lectures

	Entry	Description
01	Introduction	General introduction to Big Data
02	Hadoop	Hadoop: A framework for Big Data
03	MapReduce	Map-Reduce: Functional Programming for Big Data
04	Spark	Spark and Resilient Distributed Datasets
05	PageRank	PageRank, a distributed graph-based algorithm
06	Ecosystem	Beyond MapReduce, the Hadoop Zoo

Tutorials

	Entry	Description
01	Tutorial 1	Tutorial 1: distributed word count
	Tut1 code	Tutorial 1: python notebook
02	Tutorial 2	Tutorial 2: distributed matrix-vector multiplication
03	Tutorial 3	Tutorial 3: introduction to Spark
04	Tutorial 4	Tutorial 4: Machine Learning on Spark
05	Tutorial 5	Tutorial 5: Pagerank in Hadoop and Spark on AWS

Assigments and grading

Assignemnts and grading are provided through the Edunao platform. Follow this link. All participants for 2019-2020 should already be registered. Log in with your CentraleSupelec account.

AWS

Big data is not really possible on your own computer, hence we have provided credits for you on Amazon Web Services (AWS), Amazon’s Elastic Computing Cloud (EC2). We have configured and made available a virtual machine for you. Instructions will be given during lectures.

Register on AWS via the RosettaHub system, by following this link with your provided CentraleSupelec mail account as identifier.

To start and connect to a Jupyter Notebook, follow these instructions

To debug Hadoop programs, try these instructions

Thanks

Special thanks to Pr. Céline Hudelot for her help.

Share on

Twitter Facebook Google+ LinkedIn

Prof. Hugues Talbot