sds-2-2

Updated 4 days ago

**Scalable Data Science from Atlantis**, *A Big Data Course in Apache Spark 2.2*

Welcome! This is a 2017 Uppsala Big Data Meetup of a fast-paced PhD course sequel in data science.

Most lectures are live via HangoutsOnAir in Youtube at this channel and archived in this playlist. We are not set up for live interactions online.

There are two parts to the course sequel:

- Introduction to Data Science
- Fundamentals of Data Science

The course gitbook, edited by Dan Lilja, Raazesh Sainudiin and Tilo Wiklund, is under construction at:

Introduction to Data Science and Fundamentals of Data Science is a *"big data"* sequel that will introduce researchers from various scientific and engineering disciplines to the rapidly growing field of data science and equip them with the latest industry-recommended open sourced tools for extracting insights from large-scale datasets through the complete data science process of collecting, cleaning, extracting, transforming, loading, exploring, modelling, analysing, tuning, predicting, communicating and serving.

The course is supported by databricks academic partners programme (databricks is the UC, Berkeley startup founded by the creators of Apache Spark™, a fast and general engine for large-scale data processing) and Amazon Web Services Educate, and aims to train data scientists towards being able to meet the current needs of Stockholm's data industry through feedback from the AI and Data Analytics Centre of Excellence at Combient AB, an industrial joint venture between 21 large companies in Sweden and Finland.

Data Science is the study of the generalizable extraction of knowledge from data in a practical and scalable manner. A data scientist is characterized by an integrated skill set spanning mathematics, statistics, machine learning, artificial intelligence, databases and optimization along with a deep understanding of the craft of problem formulation to engineer effective solutions (DOI:10.1145/2500499, DOI: 10.1126/science.aaa8415). This course will introduce students to this rapidly growing field and equip them with some of its basic principles and tools.

In particular, in *Introduction to Data Science*, they will be introduced to basic skills needed to collect, store, extract, transform, load, explore, model, evaluate, tune, test and predict using large structured and unstructured datasets from the real-world.
The course will use the latest, open-sourced, fast and general engine for large scale data processing.
Various common methods will be covered in-depth with a focus on the student’s ability to execute the data science process on their own through a course project (in *Fundamentals of Data Science*).

**Target group/s and recommended background**

Students are recommended to have basic knowledge of algorithms and some programming experience (equivalent to completing an introductory course in computer science), and some familiarity with linear algebra, calculus, probability and statistics.
These basic concepts will be introduced quickly and one could take the course by putting extra effort on these basics as the course progresses.
Successful completion of the course on *Introduction to Data Science* or equivalent and an interest in doing a course project in a small team is a prerequisite for *Fundamentals of Data Science*.

**Contents, study format and form of examination**

The course will cover the following contents:

- key concepts in distributed fault-tolerant storage and computing, and working knowledge of a data scientist’s toolkit: Shell/Scala/SQL/Python/R, etc.
- practical understanding of the
*data science process*:- ingest, extract, transform, load and explore structured and unstructured datasets
- model, train/fit, validate/select, tune, test and predict (through an estimator) with a practical understanding of the underlying mathematics, numerics and statistics
- communicate and serve the model’s predictions to the clients

- practical applications of predictive models for classification and regression, using case-studies of real datasets

There will be assignments involving computer programming and data analysis. The grade is based on attendance, course participation and successful completion of programming assignments.

**Contents, study format and form of examination**

The course will cover the following contents:

- key concepts in distributed fault-tolerant filestores and in-memory computing
- understanding the
*data science process*(the underlying mathematics, numerics and statistics as well as concerns around privacy and ethics at a deeper level) - applications of current predictive models and methods in data science to make/take common decisions/actions, including classification, regression, anomaly detection and recommendation, using case-studies of real datasets
- apply the data science process to one’s own case study and work collaboratively in a group (course project).

There will be assignments involving computer programming and data analysis, and written and oral presentation of the course project. The grade will be based on attendance, course participation, successful completion of programming assignments and the final course project.

The *course project* could take one of the following forms:

- analyzing an interesting dataset using existing methods
- obtaining your own dataset and analyzing it using existing methods
- building your own data product
- focussing on the theoretical properties of an algorithm, etc.

Students are encouraged to work in teams of two or three for a project.

The project should be orally presented during the first week of January 2018 and made avaialable as a written report with all source codes and explanations.
Assignments, on the other hand, are to be completed individually.

**The exact set of topics may change slightly from the tentative outline below in order to cater better to the registered students at Uppsala University.**
{: .notice--danger}

- Introduction: What is Data Science?
- Big Data and Data Science - beyond buzz-words
- fault tolerance and distributed file-stores
- distributed in-memory processing in Apache Spark

- History and latest landscape of research and practice
- Skill sets of a data scientist (maths, stats, computer science and software engineering)
- The Data Science Process

- Big Data and Data Science - beyond buzz-words
- Basics of Probability, Statistics, Linear Algebra, Calculus and Programming
- Populations and samples - basic concepts (random variables, density and distribution functions)
- Simulate from basic discrete and continuous distributions (python/scala/R)
- Refresher in Linear Algebra (numpy, scala-breeze, R)
- concepts in model selection and tuning via cross-validation and testing
- Statistical modeling and fitting a model (linear regression, least squares/maximum likelihood, gradient descent)
- Shell command-line and crash-course/refresher in basic programming (python/scala/R)

- Introduction to Map-Reduce and Distributed Computing
- Resilient Distributed datasets
- Transformations and Actions in Apache Spark (Introduction)
- Basics of Functional programming (python lambda functions and scala closures)
- Case Study 0: Word-count of US State of the Union Addresses

- Ingest, Extract, Transform, Load and Explore with noSQL
- Case Study 0: US State of the Union Addresses
- Case Study 1: Power-plant data
- Case Study 2: 1 Million Songs
- SQL basics - select, filter, join, group by, aggregate, etc. using SparkSQL

- Two Basic Supervised Machine Learning Algorithms
- Linear Regression (power-plant data)
- Decision Trees for Classification (hand-written digit recognition)

- Two Basic Unsupervised Machine Learning Algorithms
- k-means (1 million songs dataset)
- Gaussian Mixture Models and EM Algorithm

- Unstructured to Semi-structured text data
- Supervised: Sentiment analysis with Support vector Machine (twitter dataset)
- Unsupervised: Latent Dirichlet Allocation (news groups dataset)
- Assignment: Build your own sentiment detector

- Dimensionality Reduction
- Distributed Linear Algebra -- basic concepts
- Singular value Decomposition
- Principal Component Analysis

- Collaborative Filtering for Recommendation Systems
- Matrix completion via Alternative Least Squares
- Assignment: build your own recommendation system

- Neural networks
- Linear and logistic regression as neural networks
- Back propagation for gradient descent
- Use of pre-trained neural networks from google/Baidu/facebook in your machine learning pipeline

- Mining Networks and Graphs
- Social networks as graphs (twitter data)
- Extract, transform and loading of network data
- Discovery of communities in graphs (wikipedia click streams)
- label and belief propagation
- querying sub-structures in graphs (US Airport network)

- Data Science and Ethical Issues
- Discussions on ethics, privacy and security
- Case studies from the field

- Project Presentations

- Introduction
- Why Spark?
- Login to databricks
- Scala Crash Course
- RDDs
- RDDs HOMEWORK
- Word Count - SOU
- Spark SQL Basics
- ETL Diamonds Data
- ETL Power Plant
- Wiki Click streams
- Simulation Intro
- Machine Learning Intro
- K-Means 1MSongs Intro
- 1MSongs - 1 ETL
- 1MSongs - 2 Explore
- 1MSongs - 3 Model
- Decision Trees for Digits
- Linear Algebra Intro
- Linear Regression Intro
- Distrib. Linear Algebra
- Power Plant - Model Tune Evaluate
- Activity Detection - Random Forest
- Graph Frames Intro
- Ontime Flight Performance
- Spark Streaming Intro
- Extended Twitter Utils
- Tweet Transmission Trees
- REST Twitter API
- Tweet Collector
- Tweet Track, Follow
- Tweet Hashtag Counter
- Tweet Classifier
- Power Plant - Model Tune Evaluate Deploy
- Geospatial Analytics in Magellan
- NY Taxi trips in Magellan
- Old Bailey Online - ETL of XML
- 20 Newsgroups - Latent Dirichlet Allocation
- Cornell Movie Dialogs - Latent Dirichlet Allocation
- Movie Recommendation - Alternating Least Squares
- Animal Names Streaming Files
- Normal Mixture Streaming Files
- Structured Streaming Prog Guide
- Graph Mixture Streaming Files
- Structured Streaming of JSONs
- T-Digest Normal Mixture Streaming Files
- Sketching with T-Digest
- Intro to Deep Learning
- Outline for DL
- Neural Networks
- Deep feed Forward NNs with Keras
- Hello Tensorflow
- Batch Tensorflow with Matrices
- Convolutional Neural Nets
- MNIST: Multi-Layer-Perceptron
- MNIST: Convolutional Neural net
- CIFAR-10: CNNs
- Recurrent Neural Nets and LSTMs
- LSTM solution
- LSTM spoke Zarathustra
- 2017 Advise from Data Industry
- Potential Projects
- 2017 dbc ARCHIVES

We will be supplementing the lecture notes with reading assignments from original sources.

Here are some resources that may be of further help.

- Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. Freely available from: https://www.cs.cornell.edu/jeh/book2016June9.pdf. It is intended as a modern theoretical course in computer science and statistical learning.
- Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. ISBN 0262018020. 2013.
- Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning, Second Edition. ISBN 0387952845. 2009. Freely available from: https://statweb.stanford.edu/~tibs/ElemStatLearn/.

- Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive Datasets. v2.1, Cambridge University Press. 2014. Freely available from: http://www.mmds.org/#ver21.
- Foster Provost and Tom Fawcett. Data Science for Business: What You Need to Know about Data Mining and Data-analytic Thinking. ISBN 1449361323. 2013.
- Mohammed J. Zaki and Wagner Miera Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press. 2014.
- Cathy O’Neil and Rachel Schutt. Doing Data Science, Straight Talk From The Frontline. O’Reilly. 2014.

Here are some free online courses if you need quick refreshers or want to go indepth into specific subjects.

- Linear Algebra Refresher Course (with Python)
- Intro to Descriptive Statistics
- Intro to Inferential Statistics

- Learning Spark : lightning-fast data analytics by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia, O'Reilly, 2015.
- Advanced analytics with Spark : patterns for learning from data at scale, O'Reilly, 2015.
Command-line Basics

- Intro to Data Analysis: Using NumPy and Pandas
- Data Analysis with R by facebook
- Data Visualization and D3.js
- Scala Programming
- Scala for Data Science, Pascal Bugnion, Packt Publishing, 416 pages, 2016.