Big Data - Distributed computing and databases
This course is part of the MSc Data Science and Business Analytics offered by Essec Business School and CentraleSupelec
Big Data is much more than a buzzword, it is a set of distributed computing techniques allowing developers to solve practical problems at a scale commensurable with that of the entire Internet. These problems would not fit on any single computer, be it the most powerful on Earth. Instead, Big Data is all about bringing algorithms to the data and not the other way around. This way of thinking has formed the basis of the fortune of some the most influencial Web era multinationals: Google, Amazon, Facebook ; but it is a generic tool, useful in open-source environment as well as startups, SMEs, etc.
In this course, we learn the frameworks of Big Data, what it really represents, how this comes together works with a functional outlook framework: very large databases with HDFS, distributed computing with Hadoop and Spark. We leverage cloud computing and build simple, practical, but large-scale software suitable for computing clusters such as PageRank.
Throughout the course, support, help, videos, supplementary material, further examples and general communication is supported on a Slack Workspace.
Invitation to the Slack for the course here.
|01||Introduction||General introduction to Big Data|
|02||Hadoop||Hadoop: A framework for Big Data|
|03||MapReduce||Map-Reduce: Functional Programming for Big Data|
|04||Spark||Spark and Resilient Distributed Datasets|
|05||PageRank||PageRank, a distributed graph-based algorithm|
|06||Ecosystem||Beyond MapReduce, the Hadoop Zoo|
|01||Tutorial 1||Tutorial 1: distributed word count|
|Tut1 code||Tutorial 1: python notebook|
|02||Tutorial 2||Tutorial 2: distributed matrix-vector multiplication|
|03||Tutorial 3||Tutorial 3: introduction to Spark|
|04||Tutorial 4||Tutorial 4: Machine Learning on Spark|
|05||Tutorial 5||Tutorial 5: Pagerank in Hadoop and Spark on AWS|
Assigments and grading
Assignemnts and grading are provided through the Edunao platform. Follow this link. All participants for 2019-2020 should already be registered. Log in with your CentraleSupelec account.
Big data is not really possible on your own computer, hence we have provided credits for you on Amazon Web Services (AWS), Amazon’s Elastic Computing Cloud (EC2). We have configured and made available a virtual machine for you. Instructions will be given during lectures.
Register on AWS via the RosettaHub system, by following this link with your provided CentraleSupelec mail account as identifier.
To start and connect to a Jupyter Notebook, follow these instructions
To debug Hadoop programs, try these instructions
Special thanks to Pr. Céline Hudelot for her help.