Advanced Analytics for a Big Data World

# Advanced Analytics for a Big Data World This page contains slides, references and materials for the "Advanced Analytics for a Big Data World" course. 💬 Psst: want to keep up to date on news in data science, analytics, big data, ML and AI? [Consider subscribing to our Data Science Briefings newsletter.](https://www.dataminingapps.com/dataminingapps-newsletter/) *Last updated at 2019-05-20* # Table of contents

# Course 1: February 12 ## Slides * [0 - Course introduction](./slides/0 - Course introduction.pdf) * [1 - Data science process](./slides/1 - Data science process.pdf) ## Assignment Information The evaluation of this course consists of a lab report (50% of the marks) and a closed-book written exam with both multiple-choice and open questions (50% of the marks). * Your lab-report will consist of your write-ups of four assignments, which will be made available on this page throughout the semester * You will work in groups of four students (three or five can also be accepted) * The four assignments consist of (1) Research paper discussion, (2) Building a predictive model using R or Python in a competition-style setup, (3) A Spark streaming text mining assignment and (4) A Neo4j graph assignment * Per assignment, you describe your results (screen shots, numbers, approach) in 2-3 pages; more detailed information will be provided per assignment * You do not hand in each assignment separately, but hand in your completed lab report containing all four assignments on Sunday June 2nd **For forming groups, please see the Toledo page.** **Note for externals (i.e. anyone who'll NOT partake in the exams -- this doesn't apply to normal students)**: you are free to partake in the assignments (they'll be posted here as well) but not required to. In case you want to join, feel free to form groups or work individually. Feel free to skip assignments. ## Background Information Some extra reading material and sources based on the examples seen in class (optional): * [AlphaGo](https://deepmind.com/blog/alphago-zero-learning-scratch/), [AlphaZero](https://deepmind.com/blog/alphazero-shedding-new-light-grand-games-chess-shogi-and-go/) and [AlphaStar](https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/) * Self driving cars: [1](http://www.nvidia.com/object/drive-px.html), [2](http://kevinhughes.ca/blog/tensor-kart), [3](https://github.com/bethesirius/ChosunTruck), [4](https://www.youtube.com/watch?v=X4u2DCOLoIg), [5](https://selfdrivingcars.mit.edu/), [6](https://eu.udacity.com/course/self-driving-car-engineer-nanodegree--nd013) * [Pix2pix for image translation](http://affinelayer.com/pixsrv/index.html) * [Google's "Teachable Machine"](https://teachablemachine.withgoogle.com/) * Neural networks for skin cancer detection: [1](https://www.nature.com/articles/nature21056.epdf), [2](http://news.stanford.edu/2017/01/25/artificial-intelligence-used-identify-skin-cancer/) * [An AI just beat top lawyers at their own game](https://mashable.com/2018/02/26/ai-beats-humans-at-contracts/) * [3D Generative-Adversarial Modeling](http://3dgan.csail.mit.edu/) and [Image2Mesh](https://arxiv.org/abs/1711.10669) * [Generating Videos with Scene Dynamics](http://web.mit.edu/vondrick/tinyvideo/) * [Predicting selfie-succes](http://karpathy.github.io/2015/10/25/selfie/) * [Desnapify](https://github.com/ipsingh06/ml-desnapify) * [Neural Style Transfer in "Come Swim"](https://arxiv.org/abs/1701.04928) * [Sexual orientation from photographs](https://www.theguardian.com/technology/2017/sep/07/new-artificial-intelligence-can-tell-whether-youre-gay-or-straight-from-a-photograph) * [AI is sending people to jail and getting it wrong](https://www.technologyreview.com/s/612775/algorithms-criminal-justice-ai/) * Another "scary" application: DeepFakes: [1](https://boingboing.net/2018/02/13/there-ive-done-something.html), [2](https://www.theverge.com/2018/2/7/16982046/reddit-deepfakes-ai-celebrity-face-swap-porn-community-ban) * [IQ Test Result: Advanced AI Machine Matches Four-Year-Old Child's Score](https://www.technologyreview.com/s/541936/iq-test-result-advanced-ai-machine-matches-four-year-old-childs-score/) * [Hiring Data Scientists: What to Look for?](http://www.dataminingapps.com/2015/06/hiring-data-scientists-what-to-look-for/) * [Sky high salaries in AI](https://www.bloomberg.com/news/articles/2018-02-13/in-the-war-for-ai-talent-sky-high-salaries-are-the-weapons) with [discussion here](https://news.ycombinator.com/item?id=16366815) * In 2019, still [the most promising job](https://www.techrepublic.com/article/why-data-scientist-is-the-most-promising-job-of-2019/) * [The famous tank story and alternative tales of validity in AI](https://www.gwern.net/Tanks) * [How To Break Anonymity of the Netflix Prize Dataset](https://arxiv.org/abs/cs/0610105) * [Why UPS drivers don’t turn left and you probably shouldn’t either](http://www.independent.co.uk/news/science/why-ups-drivers-don-t-turn-left-and-you-probably-shouldn-t-either-a7541241.html) * [Your Garbage Data Is A Gold Mine](https://www.fastcompany.com/3063110/the-rise-of-weird-data) * Another "interesting" insight: [Why does Amazon use packages that are too large?](http://www.distractify.com/fyi/2017/12/28/Z1UYuIS/amazon-huge-boxes) * [The drivetrain approach for "data products"](https://www.oreilly.com/ideas/drivetrain-approach-data-products), from the people behind [https://course.fast.ai/](https://course.fast.ai/) * [The brutal fight to mine your data and sell it to your boss](https://www.bloomberg.com/news/features/2017-11-15/the-brutal-fight-to-mine-your-data-and-sell-it-to-your-boss) * [What to do before hiring an employee?](https://hbr.org/2016/01/what-to-do-before-you-fire-a-pivotal-employee) On societal impact (we'll discuss this a bit further later on as well): * [http://www.independent.co.uk/news/world/asia/china-surveillance-big-data-score-censorship-a7375221.html](http://www.independent.co.uk/news/world/asia/china-surveillance-big-data-score-censorship-a7375221.html) * [https://www.propublica.org/article/breaking-the-black-box-how-machines-learn-to-be-racist](https://www.propublica.org/article/breaking-the-black-box-how-machines-learn-to-be-racist) * [https://www.technologyreview.com/s/602025/how-vector-space-mathematics-reveals-the-hidden-sexism-in-language/](https://www.technologyreview.com/s/602025/how-vector-space-mathematics-reveals-the-hidden-sexism-in-language/) * [https://www.theguardian.com/science/2016/sep/01/how-algorithms-rule-our-working-lives](https://www.theguardian.com/science/2016/sep/01/how-algorithms-rule-our-working-lives) * [http://www.fast.ai/2017/11/02/ethics/](http://www.fast.ai/2017/11/02/ethics/) * [https://www.washingtonpost.com/news/theworldpost/wp/2017/10/09/pierre-omidyar-6-ways-social-media-has-become-a-direct-threat-to-democracy/](https://www.washingtonpost.com/news/theworldpost/wp/2017/10/09/pierre-omidyar-6-ways-social-media-has-become-a-direct-threat-to-democracy/) * [https://www.vrt.be/vrtnws/nl/2017/11/20/zo-is-het-werken-voor-deliveroo---hoe-sneller-ik-fiets--hoe-verd/](https://www.vrt.be/vrtnws/nl/2017/11/20/zo-is-het-werken-voor-deliveroo---hoe-sneller-ik-fiets--hoe-verd/) * [https://qz.com/1133504/to-predict-crime-chinas-tracking-medical-histories-cafe-visits-supermarket-membership-human-rights-watch-warns/](https://qz.com/1133504/to-predict-crime-chinas-tracking-medical-histories-cafe-visits-supermarket-membership-human-rights-watch-warns/) * [https://www.theverge.com/2018/2/8/16990030/china-facial-recognition-sunglasses-surveillance](https://www.theverge.com/2018/2/8/16990030/china-facial-recognition-sunglasses-surveillance) For those looking for one of the better Western presentations on the Chinese Social Credit system: [https://media.ccc.de/v/35c3-9904-the_social_credit_system](https://media.ccc.de/v/35c3-9904-the_social_credit_system) Some older articles which are pretty interesting: * [KFC China is using facial recognition tech to serve customers - but are they buying it?](https://www.theguardian.com/technology/2017/jan/11/china-beijing-first-smart-restaurant-kfc-facial-recognition) * [Dubai police launch AI that can spot crimes](http://newatlas.com/dubai-police-crime-prediction-software/47092/) * [China Tries Its Hand at Pre-Crime](https://www.bloomberg.com/news/articles/2016-03-03/china-tries-its-hand-at-pre-crime) * [Chicago turns to big data to predict gun and gang violence](https://www.engadget.com/2016/05/23/chicago-turns-to-big-data-to-predict-gun-and-gang-violence/) * [The Role of Data and Analytics in Insurance Fraud Detection](http://www.insurancenexus.com/fraud/role-data-and-analytics-insurance-fraud-detection) * [This employee ID badge monitors and listens to you at work — except in the bathroom](https://www.washingtonpost.com/news/business/wp/2016/09/07/this-employee-badge-knows-not-only-where-you-are-but-whether-you-are-talking-to-your-co-workers/) For those looking for a set of tutorials to get started with data science programming or want to update their R / Python skills already: [https://www.datacamp.com/]([https://www.datacamp.com/]). # Course 2: February 19 ## Slides * [2 - Preprocessing](./slides/2 - Preprocessing.pdf) ## Background Information In the news: * [Why are Machine Learning Projects so Hard to Manage?](https://medium.com/@l2k/why-are-machine-learning-projects-so-hard-to-manage-8e9b9cf49641) * [This is why AI has yet to reshape most businesses](https://www.technologyreview.com/s/612897/this-is-why-ai-has-yet-to-reshape-most-businesses/) * [Researchers, scared by their own work, hold back “deepfakes for text” AI](https://arstechnica.com/information-technology/2019/02/researchers-scared-by-their-own-work-hold-back-deepfakes-for-text-ai/), [https://blog.openai.com/better-language-models/](https://blog.openai.com/better-language-models/), [http://approximatelycorrect.com/2019/02/17/openai-trains-language-model-mass-hysteria-ensues/](http://approximatelycorrect.com/2019/02/17/openai-trains-language-model-mass-hysteria-ensues/) * [Audio AI: isolating vocals from stereo music using Convolutional Neural Networks](https://towardsdatascience.com/audio-ai-isolating-vocals-from-stereo-music-using-convolutional-neural-networks-210532383785) * [Data science is different now](https://veekaybee.github.io/2019/02/13/data-science-is-different/) * [Google and Microsoft Warn That AI May Do Dumb Things](https://www.wired.com/story/google-microsoft-warn-ai-may-do-dumb-things/) * [Artificial Intelligence Study of Human Genome Finds Unknown Human Ancestor](https://www.smithsonianmag.com/science-nature/artificial-intelligence-study-human-genome-finds-unknown-human-ancestor-species-180971436/) * [Pricing algorithms can learn to collude with each other to raise prices](https://www.technologyreview.com/the-download/612947/pricing-algorithms-can-learn-to-collude-with-each-other-to-raise-prices/) * [https://thisairbnbdoesnotexist.com/](https://thisairbnbdoesnotexist.com/) and [https://thispersondoesnotexist.com/](https://thispersondoesnotexist.com/) Extra references: * [What’s your ML test score? A rubric for ML production systems](https://research.google.com/pubs/pub45742.html): an absolutely great paper on thinking about models in production! * [More on data "leakage" and why you should avoid it](https://www.kaggle.com/wiki/Leakage) * [MICE is a popular package for dealing with missing values in R](https://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/) * [More explanation on the hashing trick on Wikipedia](https://en.wikipedia.org/wiki/Feature_hashing) * [More information on principal component analysis (PCA)](http://setosa.io/ev/principal-component-analysis/) * Vector representation of words: Word2Vec, see e.g. [here](https://deeplearning4j.org/word2vec and https://www.tensorflow.org/tutorials/word2vec), [GloVe](http://nlp.stanford.edu/projects/glove/) is an alternative, similar approach. We'll talk more about this later on as well * [OpenCV](http://opencv.org/) (for feature extraction from facial images), or see [this page](https://github.com/samyak-268/facial-feature-detection) * [smbinning, an R package for weights of evidence encoding](https://cran.r-project.org/web/packages/smbinning/index.html) * [More on the leave one out mean](https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748) as discussed on Kaggle * [Entity embeddings of categorical variables](https://arxiv.org/pdf/1604.06737.pdf) * Interesting application of PCA to "understand" the latent features of a deep learning network: [https://www.youtube.com/watch?v=4VAkrUNLKSo](https://www.youtube.com/watch?v=4VAkrUNLKSo) More on IV and SSI: * More information on the Information Value: [http://ucanalytics.com/blogs/information-value-and-weight-of-evidencebanking-case/](http://ucanalytics.com/blogs/information-value-and-weight-of-evidencebanking-case/). * Another interesting metric to know about is the [System Stability Index](http://www.dataminingapps.com/2016/10/what-is-a-system-stability-index-ssi-and-how-can-it-be-used-to-monitor-population-stability/), which you can use to monitor whether your current population has shifted too much from the observations you had during training time, and can hence indicate it's time to retrain your model. A system stability index (SSI) is calculated by contrasting the expected or training, and observed or actual population percentages across the various score ranges of a scorecard. [Many vendors](https://technet.microsoft.com/en-us/library/cc749032(v=ws.10).aspx) use this in their implementation. Sometimes called the Population Stability Index or Deviation Index as well (in SAS for example) ## Assignment 1 The first assignment consists of a paper review. Pick one of the following papers and read through it: * [A Simple Approach to Ordinal Classification (Eibe Frank and Mark Hall, 2001)](./papers/ordinal_tech_report.pdf) * [To tune or not to tune the number of trees in random forest? (Philipp Probst and Anne-Laure Boulesteix, 2017)](./papers/tune_or_not.pdf) * [Boosting Random Forests to Reduce Bias; One-Step Boosted Forest and its Variance Estimate (Indrayudh Ghosal and Giles Hooker, 2018)](./papers/one_time_boost.pdf) * [Random Forest with Learned Representations for Semantic Segmentation (Byeongkeun Kang and Truong Q. Nguyen, 2019)](./papers/random_forest_semantic.pdf) The first part of your lab report should contain: * Paper summary (1-2 paragraphs, bullet list also okay): Summarize the goals, approach and results of the paper * Application setting (2-4 paragraphs): Reflect on the feasibility of the insights in practice. In which application areas do you see this technique being useful? Can you link this to what we’ve discussed in course? * Review (2-4 paragraphs): What are the pro’s and con’s of the technique/paper according to your view? * Expansion (2-4 paragraphs): Do you see possible interesting ways to expand upon this technique (or contents)? How would you do so? Do you see a way to combine this technique with others? Do you see a way to get similar results through other approaches? What are interesting next steps? As a general tip: don't get overwhelmed by mathematics that might be present in the paper: the goal is that you show your ability to get the general gist, impressions, and can assess the usefulness of a technique, approach, ... found in research for practical applications. **Note for externals (i.e. anyone who'll NOT partake in the exams -- this doesn't apply to normal students)**: you are free to partake in the assignments but not required to. In case you want to join, feel free to form groups or work individually. Feel free to skip assignments. # Course 3: February 26 ## Slides * [3 - Supervised learning](./slides/3 - Supervised learning.pdf) ## Background Information In the news: * [Keeping up with AI in 2019](https://medium.com/thelaunchpad/what-is-the-next-big-thing-in-ai-and-ml-904a3f3345ef) * [Life and society are increasingly governed by numbers](https://www.economist.com/books-and-arts/2019/02/23/life-and-society-are-increasingly-governed-by-numbers) * [When Algorithms Think You Want to Die](https://www.wired.com/story/when-algorithms-think-you-want-to-die/) * [Don’t Let Robots Pull the Trigger](https://www.scientificamerican.com/article/dont-let-robots-pull-the-trigger/) * [Forecasting in Python with Prophet](https://mode.com/example-gallery/forecasting_prophet_python_cookbook/) * [It Takes Two Neurons To Ride a Bicycle](http://paradise.caltech.edu/~cook/papers/TwoNeurons.pdf) * [Farmworker vs Robot](https://www.washingtonpost.com/news/national/wp/2019/02/17/feature/inside-the-race-to-replace-farmworkers-with-robots/) * [AAAS: Machine learning 'causing science crisis'](https://www.bbc.com/news/science-environment-47267081) * [Why I abandoned online data courses for project-based learning](https://medium.freecodecamp.org/why-i-abandoned-online-data-courses-for-project-based-learning-17e350fd9347) Extra references: * [Frank Harell on stepwise regression](https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/) * [aerosolve - Machine learning for humans](https://medium.com/airbnb-engineering/aerosolve-machine-learning-for-humans-55efcf602665) * [ID3.pdf](./papers/ID3.pdf) and [C45.pdf](./papers/C45.pdf): extra material regarding decision trees * [CloudForest](https://github.com/ryanbressler/CloudForest), an interesting decision tree ensemble implementation with support for three-way splits to deal with missing values, implemented in Go * Two newer examples using k-NN: [1](https://towardsdatascience.com/scanned-digits-recognition-using-k-nearest-neighbor-k-nn-d1a1528f0dea) and [2](https://medium.com/learning-machine-learning/recommending-animes-using-nearest-neighbors-61320a1a5934) * [RIPPER](https://christophm.github.io/interpretable-ml-book/rules.html) ## Assignment 2 The second assignment consists of the construction of a predictive model. The data set describes a churn setting for telco customers of a Latin-American telco provider. The label ("CHURN") describes a binary target established as follows: * Customer still at the company by end of December 2013: CHURN = 0 * Customer left the company in December 2013: CHURN = 1 * The state of the features was extracted based on the situation at the end of October 2013. The prediction is hence set up to predict churn with a 1 month "action period". Customers which churned during the course of November were filtered out from the data set. * The data set was then randomly split into a train and test split. * Features available: financial, usage features (see below) Download: * [telco_train.csv](http://seppe.net/aa/compete/data/telco_train.csv) * [telco_test.csv](http://seppe.net/aa/compete/data/telco_test.csv) Instructions: * The goal is to construct a predictive model to predict whether a customer will churn or not * Your model needs to be build using R or Python. As an environment, you can use e.g. Jupyter (Lab), RStudio, Google Colaboratory, Microsoft Azure Machine Learning Studio, ... (others are fine too) * The second part of your lab report should contain a clear overview of your whole modeling pipeline, including approach, exploratory analysis (if any), preprocessing (if any), construction of model, set-up of validation and results of the model * Feel free to use code fragments, tables, visualisations, ... * You can use any predictive technique you want. The focus is not on the absolute outcome, but rather on the whole process: general setup, critical thinking, and the ability to get and validate an outcome * Google around, read documentation, find tutorials, etc! * Take care of: performing a thorough validation, treating categorical variables if you need to, dealing with class imbalance * You're free to use unsupervised technique for your data exploration part, too. When you decide to build a black box model, you're free to experiment and include ideas regarding making the result more understandable * Any other assumptions or insights are thoughts can be included as well: the idea is to take what we've seen in class, get your hands dirty and reflect on what we've seen **Important: All groups should submit the results of their predictive model at least once to the leaderboard** * See: [http://seppe.net/aa/compete/](http://seppe.net/aa/compete/) * An e-mail with a password will be mailed to you * The goal is not to "win", but to help you reflect on your model's results, seeing how others are doing, ... * The leaderboard is based on the AUROC measure. The results of your latest submission are used to rank. This means it is your job to keep track of different model versions / approaches / outputs in case you'd like to go back to an earlier result * The public leaderboard calculates your AUC score for a predetermined set 50% of the instances in the test set * Later on, the leaderboard will be frozen (you'll be warned in advance) and the other 50% results (hidden leaderboard) will be shown * You should then reflect on both results and explain accordingly in your report. E.g. if you did well on the public leaderboard but not on the hidden one, what might have caused this? * Also take some time the reflect on the AUROC measure being used here. Is the the measure you'd have chosen in this setting? Features: * ID: customer identifier * CHURN: binary target * START\_DATE: data when customer joined the telco provided * PREPAID: whether this customer has ever utilized prepaid mobile cards in the past * FIN\_STATE: financial state parameter * COUNT\_PAYMENT\_DELAYS\_CURRENT: number of payment arrears at the moment of extracting the features (cumulative over multiple products) * COUNT\_PAYMENT\_DELAYS\_1YEAR: number of payment arrears over the past year * DAYS\_PAYMENT\_DELAYS\_CURRENT: number of days the customer paid late based on the moment of extracting the features; note that this is a cumulative count over multiple products); counts go negative based on number of days * DAYS\_PAYMENT\_DELAYS\_1YEAR: number of days the customer paid late over the past year * COMPLAINT\_1WEEK, COMPLAINT\_2WEEKS, COMPLAINT\_1MONTH, COMPLAINT\_3MONTHS, COMPLAINT\_6MONTHS: number of complaints received from the customer (past 1 week, 2 weeks, 1 month, 3 months, and 6 months) * CLV: customer lifetime value as estimated by a different (simple) model at the time of extracting the features * COUNT\_OFFNET\_CALLS\_1WEEK, COUNT\_ONNET\_CALLS\_1WEEK: number of on (with customers of same telco) and offnet calls over the past week * COUNT\_CONNECTIONS\_3MONTH, AVG\_DATA\_1MONTH, AVG\_DATA\_3MONTH: number of data connections over the past 3 months, average bytes of data used per connection over the past 1 and 3 months; not applicable if customer doesn't have a data subscription * COUNT\_SMS\_INC\_ONNET\_6MONTH, COUNT\_SMS\_OUT\_OFFNET\_6MONTH, COUNT\_SMS\_INC\_OFFNET\_1MONTH, COUNT\_SMS\_INC\_OFFNET\_WKD\_1MONTH, COUNT\_SMS\_INC\_ONNET\_1MONTH, COUNT\_SMS\_INC\_ONNET\_WKD\_1MONTH, COUNT\_SMS\_OUT\_OFFNET\_1MONTH, COUNT\_SMS\_OUT\_OFFNET\_WKD\_1MONTH, COUNT\_SMS\_OUT\_ONNET\_1MONTH, COUNT\_SMS\_OUT\_ONNET\_WKD\_1MONTH: counts of SMS messages sent (OUT) and received (INC) over different time frames; again, ONNET indicates that the counterparty was also a subscriber of the same telco, OFFNET indicates they were not; WKD indicates that the aggregation was only done over the weekends (Saturday and Sunday) * AVG\_MINUTES\_INC\_OFFNET\_1MONTH, AVG\_MINUTES\_INC\_ONNET\_1MONTH, MINUTES\_INC\_OFFNET\_WKD\_1MONTH, MINUTES\_INC\_ONNET\_WKD\_1MONTH AVG\_MINUTES\_OUT\_OFFNET\_1MONTH, AVG\_MINUTES\_OUT\_ONNET\_1MONTH, MINUTES\_OUT\_OFFNET\_WKD\_1MONTH, MINUTES\_OUT\_ONNET\_WKD\_1MONTH: information concerning the durations of the calls; AVG indicates average per call, otherwise the value indicates a cumulative amount of minutes; INC and OUT indicate incoming and outgoing calls; ONNET indicates that the counterparty was also a subscriber of the same telco, OFFNET indicates they were not; WKD indicates that the aggregation or cumulation was only done over the weekends (Saturday and Sunday) **Note for externals (i.e. anyone who'll NOT partake in the exams -- this doesn't apply to normal students)**: you are free to partake in the assignments but not required to. In case you want to join, feel free to form groups or work individually. Feel free to skip assignments. # Course 4: March 5 ## Slides * [4 - Model evaluation](./slides/4 - Model evaluation.pdf) ## Background Information In the news: * [Why There Will Be No Data Science Job Titles By 2029](https://www.forbes.com/sites/forbestechcouncil/2019/02/04/why-there-will-be-no-data-science-job-titles-by-2029/#7a2411a83a8f) * [Microsoft nni](https://github.com/Microsoft/nni) * [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook) * [China’s Tech Firms Are Mapping Pig Faces](https://www.nytimes.com/2019/02/24/business/china-pig-technology-facial-recognition.html) * [Beyond local pattern matching](http://ai.stanford.edu/blog/beyond_local_pattern_matching/) * [Machine learning can boost the value of wind energy](https://deepmind.com/blog/machine-learning-can-boost-value-wind-energy/) * [Seven Myths in Machine Learning Research](https://crazyoscarchang.github.io/2019/02/16/seven-myths-in-machine-learning-research/) * [Various views of variability](http://www.storytellingwithdata.com/blog/2019/2/21/various-views-of-variability) Extra references: * [ROSE](https://cran.r-project.org/web/packages/ROSE/index.html) is a popular package for dealing with over/undersampling in R * More on the ROC curve: [1](https://arxiv.org/pdf/1812.01388.pdf), [2](http://www.rduin.nl/presentations/ROC%20Tutorial%20Peter%20Flach/ROCtutorialPartI.pdf), [3](https://stats.stackexchange.com/questions/225210/accuracy-vs-area-under-the-roc-curve/225221#225221) * [Averaging ROC curves for multiclass](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html) * [h-index.pdf](./papers/h-index.pdf): paper regarding the h-index as an alternative for AUC * [A blog post explaining cross validation](https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) * More on probability calibration [here](http://scikit-learn.org/stable/modules/calibration.html) and [here](http://fastml.com/classifier-calibration-with-platts-scaling-and-isotonic-regression) * [More on the System Stability Index](https://www.dataminingapps.com/2016/10/what-is-a-system-stability-index-ssi-and-how-can-it-be-used-to-monitor-population-stability/) * [Assertive R Programming with assertr](https://cran.r-project.org/web/packages/assertr/vignettes/assertr.html) * [Visibility and Monitoring for Machine Learning Models](http://blog.launchdarkly.com/visibility-and-monitoring-for-machine-learning-models/) * [What’s your ML test score? A rubric for production ML systems](https://research.google.com/pubs/pub45742.html) * [Ten ways your data science project is going to fail](http://www.martingoodson.com/ten-ways-your-data-project-is-going-to-fail/) (a bit older, but still fun) * [Towards a data science maturity model](https://blog.dominodatalab.com/introducing-the-data-science-maturity-model/) * [Luigi](https://github.com/spotify/luigi) is a Python module that helps you build complex pipelines of batch jobs, so is [dagobah](https://github.com/thieman/dagobah) and [Airflow](https://airflow.apache.org/) # Course 5: March 12 ## Slides * [5 - Data science tools](./slides/5 - Data science tools.pdf) * [6 - Ensemble modeling and interpretability](./slides/6 - Ensemble modeling and interpretability.pdf) ## Background Information In the news: * [Forty percent of ‘AI startups’ in Europe don’t actually use AI](https://www.theverge.com/2019/3/5/18251326/ai-startups-europe-fake-40-percent-mmc-report) * [A new study finds a potential risk with self-driving cars: failure to detect dark-skinned pedestrians](https://www.vox.com/future-perfect/2019/3/5/18251924/self-driving-car-racial-bias-study-autonomous-vehicle-dark-skin) * [Is it a Duck or a Rabbit? For Google Cloud Vision, it depends how the image is rotated](https://www.reddit.com/r/dataisbeautiful/comments/aydqig/is_it_a_duck_or_a_rabbit_for_google_cloud_vision/) * [ImageNet-trained CNNs are biased towards texture](https://github.com/rgeirhos/texture-vs-shape) and [paper](https://openreview.net/pdf?id=SkfMWhAqYQ) * [Here’s How We’ll Know an AI Is Conscious](http://nautil.us/blog/heres-how-well-know-an-ai-is-conscious) * [The best resources in Machine Learning & AI](http://bestofml.com/) * [Transform ML models into a native code (Java, C, Python, etc.) with zero dependencies](https://github.com/BayesWitnesses/m2cgen), also see [https://github.com/nok/sklearn-porter](https://github.com/nok/sklearn-porter), [https://github.com/konstantint/SKompiler](https://github.com/konstantint/SKompiler) and [https://github.com/jonnor/emlearn](https://github.com/jonnor/emlearn) * [Amazon thinks AI will help solve its counterfeits problem](https://edition.cnn.com/2019/02/28/tech/amazon-counterfeits-project-zero/index.html) * [The AI-Art Gold Rush Is Here](https://www.theatlantic.com/technology/archive/2019/03/ai-created-art-invades-chelsea-gallery-scene/584134/) * [Shark or Baseball? Inside the ‘Black Box’ of a Neural Network](https://www.wired.com/story/inside-black-box-of-neural-network/) Extra references on data science tools: * [tidyverse](https://www.tidyverse.org/) * [https://www.rstudio.com/resources/cheatsheets/](https://www.rstudio.com/resources/cheatsheets/) for cheat sheets on most tidy packages * [caret](http://caret.r-forge.r-project.org/), [mlr](https://mlr.mlr-org.com/) and [modelr](https://modelr.tidyverse.org/) * [http://r4ds.had.co.nz/](http://r4ds.had.co.nz/): R for data science book * Visualizations with R: `ggplot2` and [`ggvis`](http://ggvis.rstudio.com/), also see [`shiny`](https://shiny.rstudio.com/) * Other R packages: see slides * [Minimally sufficient pandas](https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428): great tour through Pandas' API * A fun commparison of Python visualization libraries: [https://dsaber.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/](https://dsaber.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/) * Other Python packages: see slides * Linking between R and Python: see e.g. [rpy2](https://rpy2.readthedocs.io/en/version_2.8.x/) (R in Python) or [reticulate](https://github.com/rstudio/reticulate) (Python in R) * d3.js galleries: [1](https://github.com/mbostock/d3/wiki/Gallery), [2](http://bl.ocks.org/), [3](http://bl.ocks.org/mbostock), [4](https://bost.ocks.org/mike/) * [Plotly Dash](https://plot.ly/products/dash/) * Working with large files: [ff](https://cran.r-project.org/web/packages/ff/index.html), [bigmemory](https://cran.r-project.org/web/packages/bigmemory/index.html), [disk.frame (great package)](https://github.com/xiaodaigh/disk.frame), [Dask](https://github.com/dask/dask), [Pandas on Ray](https://modin.readthedocs.io/en/latest/pandas_on_ray.html), [vaex](https://github.com/vaexio/vaex), [Sframe](https://github.com/apple/turicreate) * Some issues with notebooks: [1](https://www.reddit.com/r/Python/comments/9aoi35/i_dont_like_notebooks_joel_grus_jupytercon_2018/), [2](https://yihui.name/en/2018/09/notebook-war/) * Hosted notebooks: [Azure ML Studio](https://studio.azureml.net/), [Google Colab](https://colab.research.google.com/), [Kaggle Kernels](https://www.kaggle.com/kernels) are free options * [Learning the Gitflow git workflow](https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow) Extra references on ensemble modeling: * The [documentation of scikit-learn](http://scikit-learn.org/stable/modules/ensemble.html) is very complete in terms of ensemble modeling * Kaggle post on [model stacking](http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/) * A great post describing ["smoothing" using an ensemble](http://www.overkillanalytics.net/more-is-always-better-the-power-of-simple-ensembles/) * [Random forest.pdf](./papers/Random forest.pdf): the original paper on random forests * [Adaboost.pdf](./papers/Adaboost.pdf): the original paper on AdaBoost * [xgboost documentation](https://xgboost.readthedocs.io/en/latest/) with a good [introduction](https://xgboost.readthedocs.io/en/latest/tutorials/model.html) * [More on gradient boosting](https://explained.ai/gradient-boosting/index.html) * By the way, people are still improving on extreme gradient boosting and random forests, see e.g. [xForest](https://github.com/aksnzhy/xforest): "A super-fast and scalable Random Forest library based on fast histogram decision tree algorithm and distributed bagging framework" and [thundergbm](https://github.com/Xtra-Computing/thundergbm): "Fast GBDTs and Random Forests on GPUs" Extra references on interpretability: * [http://explained.ai/rf-importance/index.html](Beware of using feature importance!) * [https://academic.oup.com/bioinformatics/article/26/10/1340/193348](Permutation importance: a corrected feature importance measure) * [Interpreting random forests: Decision path gathering](http://blog.datadive.net/interpreting-random-forests/) * [SHAP (SHapley Additive exPlanations)](https://github.com/slundberg/shap) * [Local interpretable model-agnostic explanations](https://github.com/marcotcr/lime) * [Fantastic book on the topic](https://christophm.github.io/interpretable-ml-book/) * [Another great overview](https://github.com/jphall663/awesome-machine-learninginterpretability) * [rfpimp package](https://pypi.org/project/rfpimp/) * [Forest floor](http://forestfloor.dk/) for higher-dimensional partial depence plots * [The pdp R package](https://cran.r-project.org/web/packages/pdp/pdp.pdf) * [The iml R Package](https://cran.r-project.org/web/packages/iml/index.html) * [Descriptive mAchine Learning EXplanations (DALEX) R Package](https://github.com/pbiecek/DALEX) * [eli5 for Python](https://eli5.readthedocs.io/en/latest/index.html) * [Skater for Python](https://github.com/datascienceinc/Skater) * scikit-learn has Gini-reduction based importance (not so nice), but permutation importance can be done manually: [sklearn.model_selection.permutation_test_score.html](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.permutation_test_score.html) * Or with [https://github.com/parrt/random-forest-importances](https://github.com/parrt/random-forest-importances) * Or with [https://github.com/ralphhaygood/sklearn-gbmi](https://github.com/ralphhaygood/sklearn-gbmi) (sklearn-gbmi) * [pdpbox for Python](https://github.com/SauceCat/PDPbox) * [vip for Python (and R)](https://koalaverse.github.io/vip/index.html) * [https://medium.com/@Zelros/a-brief-history-of-machine-learning-models-explainability-f1c3301be9dc](https://medium.com/@Zelros/a-brief-history-of-machine-learning-models-explainability-f1c3301be9dc) * [More on interaction effects](http://blog.macuyiko.com/post/2019/discovering-interaction-effects-in-ensemble-models.html) * [Skope-rules](https://github.com/scikit-learncontrib/skope-rules): Skope-rules aims at learning logical, interpretable rules for "scoping" a target class, i.e. detecting with high precision instances of this class # Course 6: March 19 ## Slides * [7 - Unsupervised learning and anomaly detection](./slides/7 - Unsupervised learning and anomaly detection.pdf) ## Background Information In the news: * [Facial recognition's 'dirty little secret': Millions of online photos scraped without consent](https://www.nbcnews.com/tech/internet/facial-recognition-s-dirty-little-secret-millions-online-photos-scraped-n981921) * [How to hack your face to dodge the rise of facial recognition tech](https://www.wired.co.uk/article/avoid-facial-recognition-software) * [Beware the data science pin factory](https://multithreaded.stitchfix.com/blog/2019/03/11/FullStackDS-Generalists/) * [Why Model Explainability is The Next Data Science Superpower](https://towardsdatascience.com/why-model-explainability-is-the-next-data-science-superpower-b11b6102a5e0) * [The AI Roles Some Companies Forget to Fill](https://hbr.org/2019/03/the-ai-roles-some-companies-forget-to-fill) * [Data Science Is Now Bigger Than 'Big Data'](https://www.forbes.com/sites/kalevleetaru/2019/03/13/data-science-is-now-bigger-than-big-data/#6742dd3b3fcf) * [Mozilla releases Iodide, an open source browser tool for publishing dynamic data science](https://alpha.iodide.io/) Extra references: * [Beer and diapers](https://www.itbusiness.ca/news/behind-the-beer-and-diapers-data-mining-legend/136) * [Pregnancy scores](http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/) * [Presentation on association rule mining with R, including some visualisation options](https://www.slideshare.net/rdatamining/association-rule-mining-with-r) * [arulesViz R package](https://journal.r-project.org/archive/2017/RJ-2017-047/RJ-2017-047.pdf) * [SPMF](http://www.philippe-fournier-viger.com/spmf/) * [Comparing different hierarchical linkage methods on toy datasets](https://scikit-learn.org/stable/auto_examples/cluster/plot_linkage_comparison.html#sphx-glr-auto-examples-cluster-plot-linkage-comparison-py) * [The 5 Clustering Algorithms Data Scientists Need to Know](https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68) * [Cool visualisation of the DBSCAN clustering technique in the browser](https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/) * [More on the Gower distance](https://medium.com/@rumman1988/clustering-categorical-and-numerical-datatype-using-gower-distance-ab89b3aa90d9) * [Self-Organising Maps for Customer Segmentation using R](https://www.r-bloggers.com/self-organising-maps-for-customer-segmentation-using-r/) * [t-SNE](https://lvdmaaten.github.io/tsne/) and using it for [anomaly detection](https://medium.com/@Zelros/anomaly-detection-with-t-sne-211857b1cd00) * [http://distill.pub/2016/misread-tsne/](http://distill.pub/2016/misread-tsne/) provides very interesting visualisations and more explanation on t-SNE * [Isolation forests](https://scikit-learn.org/stable/auto_examples/plot_anomaly_comparison.html#sphx-glr-auto-examples-plot-anomaly-comparison-py) * [Local outlier factor](https://scikit-learn.org/stable/auto_examples/neighbors/plot_lof_outlier_detection.html#sphx-glr-auto-examples-neighbors-plot-lof-outlier-detection-py) * [Twitter's anomaly detection package](https://github.com/twitter/AnomalyDetection) and [prophet](https://facebook.github.io/prophet/) * BigML is one of the few commercial vendors that implements isolation forests and is [capable to deal with categorical data, see [1](https://bigml.com/features/anomaly-detection) and [2](https://bigml.com/api/anomalies) * [An interesting article on detecting NBA all-stars using CADE](http://darrkj.github.io/blog/2014/may102014/) * Papers on [DBSCAN](./papers/dbscan.pdf), [isolation forests](./papers/iforest.pdf) and [CADE](./papers/CADE.pdf) # Course 7: March 26 ## Slides * [8 - Deep learning](./slides/8 - Deep learning.pdf) ## Background Information In the news: * [The little black dress reimagined by an AI](https://lbd-ai.com/) * [Red pepper chef](https://medium.com/@anthony_sarkis/red-pepper-chef-from-new-training-data-to-deployed-system-in-a-few-lines-of-code-8d25b77fe447) * [A landscape diagram for Python data](https://community.ibm.com/community/user/datascience/blogs/paco-nathan/2019/03/12/a-landscape-diagram-for-python-data) * [AI predicts office workers’ room temperature preferences](https://venturebeat.com/2019/03/22/ai-predicts-office-workers-room-temperature-preferences/) * [Do you see what AI sees? Study finds that humans can think like computers](https://hub.jhu.edu/2019/03/22/computer-vision-fooled-artificial-intelligence/) * [Dashcam video shows Tesla steering toward lane divider—again](https://arstechnica.com/cars/2019/03/dashcam-video-shows-tesla-steering-toward-lane-divider-again/) * [How El País used AI to make their comments section less toxic](https://www.blog.google/outreach-initiatives/google-news-initiative/how-el-pais-used-ai-make-their-comments-section-less-toxic/) * [Honoring J.S. Bach with our first AI-powered Doodle](https://www.google.com/doodles/celebrating-johann-sebastian-bach) * [Careful how you treat today’s AI: it might take revenge in the future](http://theconversation.com/careful-how-you-treat-todays-ai-it-might-take-revenge-in-the-future-112611) * [Algorithms have already taken over human decision making](https://theconversation.com/algorithms-have-already-taken-over-human-decision-making-111436) Extra references: * [A brief history of AI](https://beamandrew.github.io/deeplearning/2017/02/23/deep_learning_101_part1.html) * [Backpropagation explained](http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html) * [Introduction to neural networks](https://victorzhou.com/blog/intro-to-neural-networks/) * [Another link explaining ANNs](http://www.emergentmind.com/neural-network) * [Great short YouTube playlist explaining ANNs](https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) * [The learning rate finder](https://github.com/surmenok/keras_lr_finder) and [cyclical training](https://arxiv.org/abs/1506.01186): two newer approaches towards adaptive learning rates * [Activation functions: which is better?](https://towardsdatascience.com/activation-functions-and-its-types-which-is-better-a9a5310cc8f) * [Play with different optimizers (stochastic gradient descent, ADAM, ...) in the browser](https://vis.ensmallen.org/) * [More information on different approaches of gradient descent](http://cs231n.github.io/neural-networks-3/) * [Tensorflow playground](https://playground.tensorflow.org/) * ["What's hidden in the hidden layers?"](https://www.cs.cmu.edu/~dst/pubs/byte-hiddenlayer-1989.pdf): an amazing article from 89 that shows people were already trying then to open up the black box, and even playing around with self driving cars... highly recommended! * [Visualizing CNNs in the browser](http://scs.ryerson.ca/~aharley/vis/) * Data augmentation libraries: [1](https://keras.io/preprocessing/image/) and [2](https://github.com/aleju/imgaug) * [Teachable machine](https://teachablemachine.withgoogle.com/): using a pretrained network + k-NN... you can take a look at the JavaScript source to see how it works * These articles also provide more information on CNNs: [http://cs231n.github.io/convolutional-networks/](http://cs231n.github.io/convolutional-networks/), [http://cs231n.github.io/understanding-cnn/](http://cs231n.github.io/understanding-cnn/) (opening their black box), [http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/) and [https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/](https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/) * [Alibaba launches ‘smile to pay’ facial recognition system at KFC in China](https://www.cnbc.com/2017/09/04/alibaba-launches-smile-to-pay-facial-recognition-system-at-kfc-china.html) * [Beijing KFC is pioneering technology to try to predict and remember people’s fast food choices](https://www.theguardian.com/technology/2017/jan/11/china-beijing-first-smart-restaurant-kfc-facial-recognition) * [Facial Recognition Is Accurate, if You’re a White Guy](https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html) * [Example of an app using style transfer](https://mspoweruser.com/popular-ios-app-prisma-coming-windows-10-month/) * [Deep dreaming](https://ai.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html) * Recurrent neural networks: these articles provide more information on LSTM networks, a type of RNN: [http://colah.github.io/posts/2015-08-Understanding-LSTMs/](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) and [https://deeplearning4j.org/lstm](https://deeplearning4j.org/lstm) * [Time series prediction with LSTMs](https://www.altumintelligence.com/articles/a/Time-Series-Prediction-Using-LSTM-Deep-Neural-Networks) * And more use cases for [text to image generation](https://arxiv.org/abs/1605.05396), [image to image translation](https://arxiv.org/abs/1611.07004), [increasing image resolution](https://arxiv.org/abs/1609.04802) (CSI Miami style...) and [predicting next video frames](https://arxiv.org/abs/1511.06380) * [Original paper on GANs](https://arxiv.org/abs/1406.2661) * Generative adversarial networks: [http://www.kdnuggets.com/2017/01/generative-adversarial-networks-hot-topic-machine-learning.html](http://www.kdnuggets.com/2017/01/generative-adversarial-networks-hot-topic-machine-learning.html,), [https://openai.com/blog/generative-models/](https://openai.com/blog/generative-models/), [http://blog.aylien.com/introduction-generative-adversarial-networks-code-tensorflow/](http://blog.aylien.com/introduction-generative-adversarial-networks-code-tensorflow/), [https://arxiv.org/pdf/1701.00160.pdf](https://arxiv.org/pdf/1701.00160.pdf), [http://cs.stanford.edu/people/karpathy/gan/](http://cs.stanford.edu/people/karpathy/gan/) (simple browser example) * [Training a GAN in the browser](https://poloclub.github.io/ganlab/) * Q-learning: [https://www.nervanasys.com/demystifying-deep-reinforcement-learning/](https://www.nervanasys.com/demystifying-deep-reinforcement-learning/) and [http://mnemstudio.org/path-finding-q-learning-tutorial.htm](http://mnemstudio.org/path-finding-q-learning-tutorial.htm) * A simple browser example using reinforcement learning: [http://projects.rajivshah.com/rldemo/](http://projects.rajivshah.com/rldemo/) * Probability versus uncertainty: [https://alexgkendall.com/computer_vision/bayesian_deep_learning_for_safe_ai/](https://alexgkendall.com/computer_vision/bayesian_deep_learning_for_safe_ai/) * If you would like to get started with deep learning, you can check out: [Keras](https://keras.io/) + [TensorFlow](https://www.tensorflow.org/) (or [CNTK](https://github.com/Microsoft/CNTK) as a Windows-friendly alternative) or [PyTorch](http://pytorch.org/) * [Deep learning power scores](https://towardsdatascience.com/deep-learning-framework-power-scores-2018-23607ddf297a) * [“Lessons from Optics, The Other Deep Learning”](http://www.argmin.net/2018/01/25/optics/): showing that there is still a lot of research ahead of us, definitely worth reading * [More on adversarial attacks](https://blog.openai.com/adversarial-example-research), also see this [article](https://www.wired.com/story/ai-has-a-hallucination-problem-thats-proving-tough-to-fix/) that discusses the problem and how it is proving hard to solve * [Layer activations](https://distill.pub/2019/activation-atlas/) * [Other methods to explain CNNs](http://cs231n.github.io/understanding-cnn/) Q-learning Python example: * Try playing around with the [Python example Q-learning code](./code/qlearning.py) * Things to try: adjust the learning rate (alpha) and see what happens. Adjust the discount factor, adjust the exploration versus exploitation tradeoff. Try initiatilizing the Q-matrix with e.g. random numbers. Try playing with the rewards. Try with a larger "maze". (Advanced: try converting the setting to a different type of simple game.) # Course 8: April 2 ## Slides * [9 - Hadoop and Spark](./slides/9 - Hadoop and Spark.pdf) ## Background Information In the news: * [Artificial intelligence group DeepMind readies first commercial product](https://outline.com/EpGqWm) * [Machine Learning In The Judicial System Is Mostly Just Hype](https://palladiummag.com/2019/03/29/machine-learning-in-the-judicial-system-is-mostly-just-hype/) * [Can AI Be a Fair Judge in Court? Estonia Thinks So](https://www.wired.com/story/can-ai-be-fair-judge-court-estonia-thinks-so/) * [How malevolent machine learning could derail AI](https://www.technologyreview.com/s/613170/emtech-digital-dawn-song-adversarial-machine-learning/) * [Inmates in Finland are training AI as part of prison labor](https://www.theverge.com/2019/3/28/18285572/prison-labor-finland-artificial-intelligence-data-tagging-vainu) * [Researchers estimate it takes approximately 1.5 megabytes of data to store language information in the brain](https://medicalxpress.com/news/2019-03-approximately-megabytes-language-brain.html) * [Learning a SAT Solver from Single-Bit Supervision](https://arxiv.org/abs/1802.03685) and [follow-up](https://arxiv.org/abs/1903.04671v4) * [The Rise of Generative Adversarial Networks](https://blog.usejournal.com/the-rise-of-generative-adversarial-networks-be52d424e517) Extra references: * More information on the architecture of HDFS: [https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html and https://developer.yahoo.com/hadoop/tutorial/module2.html](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html and https://developer.yahoo.com/hadoop/tutorial/module2.html) and [http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/](http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/) * More information on the Hadoop MapReduce pipeline: [https://developer.yahoo.com/hadoop/tutorial/module4.html](https://developer.yahoo.com/hadoop/tutorial/module4.html) * The whole [Yahoo! tutorial on Hadoop](https://developer.yahoo.com/hadoop/tutorial/) is a good reference * More information on YARN: [https://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/](https://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/), [https://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/](https://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/), [http://blog.cloudera.com/blog/2014/05/how-apache-hadoop-yarn-ha-works/](http://blog.cloudera.com/blog/2014/05/how-apache-hadoop-yarn-ha-works/) and [https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html](https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html) * Mastering Apache Spark book (a bit older now): [https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark](https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark) * [Apache Spark architecture overview](https://mapr.com/ebooks/spark/03-apache-spark-architecture-overview.html) * [A tale of three Apache Spark APIs](https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html) * [Overview of Spark 2 Dataset (with Scala examples)](http://xinhstechblog.blogspot.be/2016/07/overview-of-spark-20-dataset-dataframe.html) * [H2O](http://www.h2o.ai/) * [Apache Spark documentation](https://spark.apache.org/docs/latest/index.html) (this is actually a great starting point, as it contains many code samples) MapReduce examples in Python: * Download a [ZIP file](./code/mapreduce.zip) with some simulated MapReduce examples you can run in Python without setting up a full Hadoop environment. The simulation will start reducing on partial results and apply the reduction operations as long as possible (i.e. until every element having the same key is reduced). Make sure the example files are in the same folder as "map_reduce.py" and then run them, e.g. using: `python example_avgpergenre.py`. You can use this setup to try develop some other MapReduce programs, too. DummyRDD examples: * Download a [ZIP file](./code/dummyrdd.zip) with some DummyRDD examples, see [https://github.com/wdm0006/DummyRDD](https://github.com/wdm0006/DummyRDD). Install DummyRDD first using pip: `pip install dummy_spark`, then run one of the Python examples from the ZIP file. More RDD examples can be found [here](http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html). Note that RDDs provide a low-level API. In most settings, you'll want to work with SparkSQL (i.e. DataFrame and Dataset), but it can be a good way to familiarize yourself with the low-level APIs. More information on the difference between DataFrames (Dataset[Row]) and typed Datasets (Dataset[T]) in Spark can be found [here](./papers/DataFrames_Datasets.pdf). # Course 9: April 23 ## Slides * (Continuing from last course slides) ## Background Information In the news: * [YouTube Flags Notre-Dame Fire as 9/11 Conspiracy](https://www.bloomberg.com/news/articles/2019-04-15/youtube-flagsnotre-dame-fire-as-9-11-conspiracy-in-wrong-call) * [Open Questions about Generative Adversarial Networks](https://distill.pub/2019/gan-open-problems/) * [Visualizing memorization in RNNs](https://distill.pub/2019/memorization-in-rnns/) * [IBM halting sales of Watson AI tool for drug discovery amid sluggish growth](https://www.statnews.com/2019/04/18/ibm-halting-sales-of-watson-for-drug-discovery/) * [Why language technology can't handle Game of Thrones (yet)](https://www.sciencedaily.com/releases/2019/04/190418080816.htm) * [EU AI Expert Group: Ethical risks are ‘unimaginable’](https://www.artificialintelligence-news.com/2019/04/11/eu-ai-expert-group-ethicalrisks/) * [Retailers like Walmart are embracing robots – here’s how workers can tell if they’ll be replaced](https://theconversation.com/retailers-like-walmart-are-embracing-robots-heres-how-workers-can-tellif-theyll-be-replaced-115415) and [2](https://theconversation.com/what-robots-and-ai-may-mean-for-university-lecturers-and-students114383) Extra references: * [Spark Structured Streaming docs](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) * [Apache Storm](https://storm.apache.org/) * [Apache Flink](https://flink.apache.org/) * [Apache Ignite](https://ignite.apache.org/) * [So many frameworks](https://landscape.cncf.io/category=streaming-messaging&format=cardmode&grouping=category) * [Hacker News also has a field day](https://news.ycombinator.com/item?id=19016997) * [Scalability! But at what COST?](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf) * [Gartner looks at data science platforms](https://thomaswdinsmore.com/2017/02/28/gartner-looks-at-data-science-platforms/) * [Uber Michelangelo](https://eng.uber.com/scaling-michelangelo/) * [Uber's Big Data platform](https://eng.uber.com/uber-big-data-platform/) * [Spotify](https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/) uses [Airflow](https://airflow.apache.org/), [Luigi](https://github.com/spotify/luigi) * [Architecture of giants](https://blog.keen.io/architecture-of-giants-data-stacks-at-facebook-netflix-airbnb-and-pinterest/) ## Assignment 3 For this assignment, you will construct a predictive model based on a streaming textual data source using Spark Streaming. The data you'll work with consists of book reviews on Amazon. You'll use the review to try to predict the number of stars given in the review (1 to 5). Make sure to read all the instructions below carefully. ### Setting up Spark Since the data set we'll work with is still relatively small, you will (luckily) not need a cluster of machines, but can simply run Spark locally on your machine. Here's how you make sure you have a local Spark set up and ready to go: 1. First, download the ZIP file from [https://drive.google.com/open?id=1tusT4IS2ns_YiC2oAdKWU0KWnipyufqW](https://drive.google.com/open?id=1tusT4IS2ns_YiC2oAdKWU0KWnipyufqW) and extract it somewhere, e.g. "spark" on your Desktop is fine. This ZIP file contains a portable Spark installation with Java and necessary tooling included. It includes the latest Spark version at this time (2.4) together with a bug fix to make it work on Windows. 2. Second, make sure you have a Python 3 Anaconda distribution installed on your system. (At the very least, Python 3 and pip need to be available on your system.) Windows users: to start Spark, double-click "letsgo-win.bat". If all goes well, a Jupyter notebook should open up with a connection to Spark being established and ready to go. Preferably, you should have Python in your path. If the launcher fails to find Python, it will provide you with an error message. Upon starting, Java might require access to your Windows firewall, which you can safely accept. Mac users: if you're on Mac, open up a Terminal window and navigate to where you've unzipped the file and run "letsgo-mac.sh", e.g.: cd /Users/seppe/Desktop/spark ./letsgo-mac.sh Again, do make sure you have installed Python 3 Anaconda first. If Mac complains about the file not being executable, you might first have to enter the following command to make it executable: chmod +x ./letsgo-mac.sh You might also get a window popup asking if you want to install XCode. You can ignore this, as you don't need it. If all goes well, a Jupyter notebook should open up with a connection to Spark being established. If you encounter issues, check the FAQ below first -- otherwise, feel free to e-mail me. ### Example notebooks Once you have Jupyter notebook open, feel free to explore the example notebooks under "notebooks": * `spark_example.py.ipynb`: A simple Spark example (calculating pi) to check if your installation is working * `spark_streaming_example.py.ipynb`: A simple Spark Streaming example that prints out the data you'll work with * `spark_streaming_example_saving.py.ipynb`: A simple Spark Streaming example that saves the data * `spark_streaming_example_predicting.py.ipynb`: A very naive prediction approach * `spark_structured_streaming_example.py.ipynb`: An example using Spark Structured Streaming ### Assignment Using Spark, your task for this assignment is as follows: 1. Construct a historical data set using the provided data stream Important: get started with this as soon as possible. We will discuss Spark Streaming and text mining techniques in more detail in class, though you can already take a look at the "spark_streaming_example_saving.py.ipynb" notebook to start saving data. Since we're working with a streaming data source, data you don't capture will be lost - Make sure to set up Spark first using the instructions posted above if you haven't done so already - The streaming server is running at "seppe.net:7778", as shown in the example notebooks 2. Construct a predictive model to predict the rating of a review based on the review text (and potentially title, book title and user name as well) - Note that you can use extra data if you want, and extra libraries (e.g. sentiment analysis libraries might be a smart idea), as long as you can adhere to point 3 below - You're encouraged to built your model using spark.ml (MLlib), though scikit-learn can be used as a fallback 3. Use your model to predict some new titles in a streaming setup - I.e. show that you can connect to the data source, featurize incoming reviews as done in training, have your model predict the rating given, and show it, similar to "spark_streaming_example_predicting.py.ipynb" (but hopefully using a smarter, real predictive model) The third part of your lab report should contain: * Overview of the steps above, the source code of your programs, as well as the output after running them * Feel free to include screen shots or info on encountered challenges and how you dealt with them * Even if your solution is not fully working or not working correctly, you can still receive marks for this assignment if you show what you tried and how you'd need to improve your end result (i.e. when you can prove a conceptual understanding of the problem and solution) **Note for externals (i.e. anyone who'll NOT partake in the exams -- this doesn't apply to normal students)**: you are free to partake in the assignments but not required to. In case you want to join, feel free to form groups or work individually. Feel free to skip assignments. ### Further remarks * Get started with setting up Spark and fetching articles as quickly as possible. If you encounter troubles getting Spark to run, do let me know * The rate at which reviews come in is about 15 every hour -- you obviously don't need all of them, but make sure to have enough to train your model * The data stream is line delimited with every line contain a review in JSON format, but can be easily converted to a DataFrame (and RDD). The example notebooks give some ideas on how to do so * You can use both Spark Streaming or Spark Structured Streaming. The former is most likely easier to work with * The focus of this assignment is on getting the full pipeline as outlined above constructed, not on getting spectacularly high accuracies, though TF-IDF, stemming, stop word removal might be interesting to apply. You can decide how you approach the rating problem (multi class, ordinal multi class, or as a binary setup (e.g. <3 and >=3, should you find that easier) * Preferably, your predictive model needs to be build using MLlib (so read documentation and tutorials). In case you really encounter trouble, you can use scikit-learn as well to still perform the "deployment" stage. Any type or approach is valid. As stated above, other libraries can be used as well, as long as you can show that your model can provide predictions in real-time * Let me know in case the streaming server would crash… simply with an e-mail (don't hesitate) * As discussed in class: you're free to schedule your planning and division of work. You do not hand in each assignment separately, but hand in your completed lab report containing all four assignments on Sunday June 2nd. ### FAQ **"letsgo-win.bat" doesn't start the notebook -- it says it can't find Python** Edit the file and change the following line: `call C:\Users\%username%\Anaconda3\Scripts\activate.bat C:\Users\%username%\Anaconda3` To match the correct Anaconda installation on your system and base path on your system. **"letsgo-win.bat" doesn't start the notebook -- it says I have spaces in my path** Try again by moving the installation to a directory without spaces. **It seems I receive duplicated data when restarting my Jupyter notebook** The server keeps a limited history of articles, so it is possible you see the same reviews coming in if you restart your client. Make sure to remove those duplicates before training your model. **I've looked up a review I've just received on Amazon and it's way older than this exact time** The "timestamp" in the reviews that are send to you match the time of extracting the review from Amazon, not when it was posted by the user. To avoid hammering Amazon's servers, reviews are extracted at a steady-state and send to you. Once extracted, it will not be send multiple times, however. **I can't save the stream... everything else seems fine** Make sure you're calling the "saveAsTextfiles" function with "file://" prepended to the path: `lines.saveAsTextFiles("file:///C:/...")` Also make sure that the folder where you want to save the files exist. **Can I prevent the "saveAsTextfiles" function from creating so many directories and files?** You can first repartition the RDD to one partition before saving it: `lines.repartition(1).saveAsTextFiles("file:///C:/...")` To prevent multiple directories, change the trigger time to e.g. `ssc = StreamingContext(sc, 60)` Though this will still create multiple directories. Setting the trigger interval higher is not really recommended, as you wouldn't want to lose data in case something goes wrong. **So if I still end up with multiple directories, how do I read them in?** It's pretty easy to loop over subdirectories in Python. This can be easily found on Google. Alternatively, the "sc.textFile" command is pretty smart and can parse through multiple files in one go. **I'm trying to convert my saved files to a DataFrame, but Spark complains for some files only?** Data is always messy, especially the ones provided by this instructor. Make sure you can handle badly formatted lines and discard them. **My stream crashes after a while with an "RDD is empty" error** Make sure you're checking for empty RDDs, e.g. ``` if rdd.isEmpty(): return ``` **I've managed to create a model. When I try to apply it on the stream, Spark crashes with a Hive / Derby error, e.g. when I try to .load() my model(s) or once the first RDD arrives** Check the example notebooks on how to load in your model in "globals()" once. **When I call "ssc_t.stop()"... Spark never seems to stop the stream** You can try changing "stopGraceFully=True" to "False". Even then, Spark might not want to stop its stream processing pipeline in case you're doing a lot with the incoming data, preventing Spark from cleaning up. Try decreasing the trigger time, or simply restart the Jupyter kernel to start over. **Spark complains that only one StreamingContext can be active at a time (and other general "it doesn't work anymore" questions)** In case of trouble, a good idea is always to close all running notebooks and start again fresh. **I can't access Spark when starting Jupyter manually** That's right. The script provided here links PySpark to Jupyter so the "sc" and "spark" variables will be created for you. It's hence best to use the provided "letsgo-*" scripts. **Can I use R?** There are two main Spark R packages being maintained right now: SparkR (the official one) and sparklyr (from the folks at RStudio and fits better with the tidyverse). Both are fine to use, but you'll have to do some setting up in order so R can find your Spark installation. The example below assumes you have the "SparkR" library installed: ``` # Set up environment variables: do this before loading the the library # Otherwise, R will attempt to download spark on its own # Make sure to adjust the paths below to match your system # HADOOP_HOME is only required on Windows, remove this line for Max/Linux if (nchar(Sys.getenv("SPARK_HOME")) < 1) { Sys.setenv(SPARK_HOME = "C:\\Users\\Seppe\\Desktop\\spark\\spark-2.4.0-bin-hadoop2.7\\") Sys.setenv(HADOOP_HOME = "C:\\Users\\Seppe\\Desktop\\spark\\winutils\\") } library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) # Set up the Spark Context: sparkR.session(master = "local[*]") # Read a structured streaming data frame: reviews <- read.stream("socket", host = "seppe.net", port = 7778) # Provide a sink, using "console" will not show anything on your main R console # So use memory with a queryName instead: query <- write.stream(reviews, "memory", queryName="reviews") # Your operations: head(sql("SELECT * FROM reviews")) # Once you are finished, stop the query: stopQuery(query) ``` However, I'd strongly recommend using Python. # Course 10: April 30 ## Slides * [10 - Text mining](./slides/10 - Text mining.pdf) ## Background Information In the news: * [A Recipe for Training Neural Networks](https://karpathy.github.io/2019/04/25/recipe/) * [Using R To Analyze The Redacted Mueller Report](https://www.jlukito.com/blog/2019/4/20/using-r-to-analyze-the-redacted-mueller-report) * [How to hide from the AI surveillance state with a color printout](https://www.technologyreview.com/f/613409/how-to-hide-from-the-ai-surveillance-state-with-a-color-printout/) * [MuseNet](https://openai.com/blog/musenet/) * [Notes on AI Bias](https://www.ben-evans.com/benedictevans/2019/4/15/notes-on-ai-bias) * [Dear AI startups: Your ML models are dying quietly](https://sanau.co/ML-models-are-dying-quietly) * [Can NSFW models be fooled? (NSFW)](https://medium.com/@marekkcichy/does-ai-have-a-dirty-mind-too-6948430e4b2b) Extra references: * [UMAP is a technique you can use for dimensionality reduction in high-dimensional spaces, like t-SNE or PCA](https://github.com/lmcinnes/umap) * [word2vec](https://code.google.com/archive/p/word2vec/) * [GloVe](https://nlp.stanford.edu/projects/glove/) * More word2vec background readings: [1](https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/), [2](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/), [3](https://blog.acolyer.org/2016/04/22/glove-global-vectors-for-word-representation/), [4](https://iksinc.wordpress.com/tag/continuous-bag-of-words-cbow/), [5](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/), [6](https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e) * [fastText](https://github.com/facebookresearch/fastText) * [Shared representations](http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/) * [par2vec, doc2vec](https://medium.com/@amarbudhiraja/understanding-document-embeddings-of-doc2vec-bfe7237a26da) * [The concept of representational learning and embeddings has even been applied towards sparse, high level categoricals](https://arxiv.org/abs/1604.06737), [and here](https://www.r-bloggers.com/exploring-embeddings-for-categorical-variables-with-keras/), and [here](https://www.fast.ai/2018/04/29/categorical-embeddings/) * [Princeton researchers discover why AI become racist and sexist](https://arstechnica.com/science/2017/04/princeton-scholars-figure-out-why- your-ai-is-racist/) * [AirBnb Listing Embeddings](https://medium.com/airbnb-engineering/listing-embeddings-for-similar-listing-recommendations-and-real-time-personalization-in-search-601172f7603e) * [Embeddings for jobs (from a Belgian start-up)](https://www.youtube.com/watch?v=Oqe0cShGcHs) * A lot has happened in NLP: [1](http://jalammar.github.io/illustrated-bert/), [2](http://ruder.io/10-exciting-ideas-of-2018-in-nlp/), [3](https://github.com/mratsim/Arraymancer/issues/268), [4](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html), [5](http://mlexplained.com/2019/01/30/an-in-depth-tutorial-to-allennlp-from-basics-to-elmo-and-bert/), [6](https://towardsdatascience.com/beyond-word-embeddings-part-2-word-vectors-nlp-modeling-from-bow-to-bert-4ebd4711d0ec) * [Stanford CS224N: NLP with Deep Learning | Winter 2019](https://www.youtube.com/watch?v=kEMJRjEdNzM&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z) * [Natural Language Toolkit (NLTK)](http://www.nltk.org) * [MITIE: library and tools for information extraction](https://github.com/mit-nlp/MITIE) * R: tm , topicmodels and nlp packages, http://tidytextmining.com/ * [Gensim](https://radimrehurek.com/gensim/) * [SpaCy](https://spacy.io/) * [RASA](https://nlu.rasa.com/) * [AllenNLP](https://allennlp.org/) * [vaderSentiment: Valence Aware Dictionary and sEntiment Reasoner](https://github.com/cjhutto/vaderSentiment) # Course 11: May 7 ## Slides * [11 - Social network mining](./slides/11 - Social network mining.pdf) ## Background Information In the news: * [Word Clouds: We Can’t Make Them Go Away, So Let’s Improve Them](https://medium.com/multiple-views-visualization-researchexplained/improving-word-clouds-9d4a04b0722b) * [Do neural nets dream of electric sheep?](https://aiweirdness.com/post/171451900302/do-neural-nets-dream-ofelectric-sheep) * [GPT-2 Interim Update](https://openai.com/blog/better-language-models/#update) * [Python at Netflix](https://medium.com/netflix-techblog/python-at-netflix-bba45dae649e) * [Unsupervised learning: the curious pupil](https://deepmind.com/blog/unsupervised-learning/) * [Google releases AI training data set with 5 million images and 200,000 landmarks](https://ai.googleblog.com/2019/05/announcing-google-landmarks-v2-improved.html) * [DrWhy](https://github.com/ModelOriented/DrWhy/blob/master/README.md) Extra references: On visualisation and layouts of graphs: * You'll not that most layout techniques make use of a physics-inspired "force based" approach, where the edges between nodes are regarded as "springs" and the layout algorithm goes through a number of iterations to let the graph stabilize towards a comprehensible, attractive layout * [Journal paper discussing the popular ForceAtlas2 technique](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0098679) * [Another paper detailing simple and less simple layout techniques](http://profs.etsmtl.ca/mmcguffin/research/2012-mcguffin-simpleNetVis/mcguffin-2012-simpleNetVis.pdf) * [Examples to play with in the browser](https://bl.ocks.org/steveharoz/8c3e2524079a8c440df60c1ab72b5d03) * [Webcola is a JavaScript library to layout graphs](http://marvl.infotech.monash.edu/webcola/) * [GraphViz](http://www.graphviz.org/) is a standalone tool for graph-based visualizations and layout, which are described by means of the DOT language. It has lots of bindings with programming languages available and is still widely used as a behind-the-scenes layout driver in many products PageRank, personalized PageRank, and others: * [https://www.r-bloggers.com/from-random-walks-to-personalized-pagerank/](https://www.r-bloggers.com/from-random-walks-to-personalized-pagerank/) * [http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm](http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm) * [http://www.math.ucsd.edu/~fan/wp/lov.pdf](http://www.math.ucsd.edu/~fan/wp/lov.pdf) * [http://www.cs.yale.edu/homes/spielman/462/2010/lect16-10.pdf](http://www.cs.yale.edu/homes/spielman/462/2010/lect16-10.pdf) * The [docs of igraph](http://igraph.org/r/doc/) contain a good overview of available centrality, betweenness, ... metrics and community mining * Node2vec: [http://snap.stanford.edu/node2vec/](http://snap.stanford.edu/node2vec/) * GraphSAGE: [https://github.com/williamleif/GraphSAGE](https://github.com/williamleif/GraphSAGE) * Deepwalk: [https://arxiv.org/abs/1403.6652](https://arxiv.org/abs/1403.6652) * More techniques continue to be developed. E.g. [Gat2Vec](https://link.springer.com/article/10.1007/s00607-018-0622-9): which performs representation learning on graphs with attributes Tools: * Many JavaScript-based tools are available to visualise and layout graphs in the browser: e.g. [Linkurious](https://linkurio.us/), [sigma.js](http://sigmajs.org/) * igraph is a graph analysis package for R and Python: [http://igraph.org/](http://igraph.org/) * Others: [http://js.cytoscape.org/](http://js.cytoscape.org/), [http://sigmajs.org/](http://sigmajs.org/) and [http://visjs.org/](http://visjs.org/) * [NetworkX](https://networkx.github.io/) is a Python package for graph analysis * Spark's [GraphX](http://spark.apache.org/graphx/) and [GraphFrames](https://graphframes.github.io/) * [Gephi](https://gephi.org/) is a tool for graph layout, analysis and visualisation * Also see the [ggraph](https://cran.r-project.org/web/packages/ggraph/index.html), [ggnet2](https://briatte.github.io/ggnet/), [sna](https://cran.r-project.org/web/packages/sna/index.html), [network](https://cran.r-project.org/web/packages/network/index.html), [tidygraph](https://github.com/thomasp85/tidygraph) packages in R NoSQL vendors and examples (to name a few): * Memcached: [https://memcached.org/](https://memcached.org/) * Cassandra: [http://cassandra.apache.org/](http://cassandra.apache.org/) * HBase: [https://hbase.apache.org/](https://hbase.apache.org/) * CouchDB: [http://couchdb.apache.org/](http://couchdb.apache.org/) * VoltDB: [https://www.voltdb.com/](https://www.voltdb.com/) * MongoDB: [https://www.mongodb.com/](https://www.mongodb.com/) * Redis: [https://redis.io/](https://redis.io/) * CockroachDB: [https://www.cockroachlabs.com/](https://www.cockroachlabs.com/) * AsterixDB: [https://asterixdb.apache.org/](https://asterixdb.apache.org/) * See [https://jepsen.io/analyses](https://jepsen.io/analyses) for detailed analyses of consistency models of various NoSQL vendors Graph databases: * Neo4j: [https://neo4j.com/](https://neo4j.com/) * OrientDB: [http://orientdb.com/orientdb/](http://orientdb.com/orientdb/) * SparkSee: [http://www.sparsity-technologies.com/](http://www.sparsity-technologies.com/) * FlockDB: [https://github.com/twitter-archive/flockdb](https://github.com/twitter-archive/flockdb) * Titan: [http://titan.thinkaurelius.com/](http://titan.thinkaurelius.com/) * AllegroGraph: [https://franz.com/agraph/allegrograph/](https://franz.com/agraph/allegrograph/) * InfiniteGraph: [http://www.objectivity.com/products/infinitegraph/](http://www.objectivity.com/products/infinitegraph/) * Intro to Cypher: [https://neo4j.com/developer/cypher-query-language/](https://neo4j.com/developer/cypher-query-language/) Analytics on Neo4j: * [Efficient graph algorithms on Neo4j](https://neo4j.com/blog/efficient-graph-algorithms-neo4j/): build-in solution. **Update**: does in fact [contain personalized PageRank in the latest release](https://neo4j.com/docs/graph-algorithms/current/algorithms/page-rank/#algorithms-pagerank-personalized), neat! * [GraphAware Framework](https://github.com/graphaware/neo4j-framework), * [https://github.com/maxdemarzi/graph_processing](https://github.com/maxdemarzi/graph_processing), [https://github.com/neo4j-contrib/neo4j-mazerunner](https://github.com/neo4j-contrib/neo4j-mazerunner): older projects which can be ignored nowadays * [https://github.com/neo4j-contrib/neo4j-spark-connector](https://github.com/neo4j-contrib/neo4j-spark-connector): the Neo4j Spark Connector uses the binary Bolt protocol to transfer datafrom and to a Neo4j server * [https://github.com/versae/ipython-cypher](https://github.com/versae/ipython-cypher): queries are send through Cypher and results can be can be stored in a variable and then converted to a Pandas DataFrame * [https://github.com/nicolewhite/RNeo4j](https://github.com/nicolewhite/RNeo4j): “RNeo4j” package: similar approach to R data frames * [Neo4j APOC](https://neo4j-contrib.github.io/neo4j-apoc-procedures/) for import/export, and more * [Py2Neo](https://py2neo.org/v4/): Neo4j Python client library * [Neo4j shell-tools](https://github.com/jexp/neo4j-shell-tools) for import/export * Use cases: [https://neo4j.com/graphgists/](https://neo4j.com/graphgists/) and [https://neo4j.com/sandbox-v2/](https://neo4j.com/sandbox-v2/) ## Assignment 4 **Update: some students have reported that the graph contains more information regarding the year 2017 than 2016. The issue is due to Neo4j overriding edges for "2016" with "2017" during the creation. I've prepared a new version with the edges for "mandate_at" and "member_of" split up according to the years ("mandate_at_201X", "member_of_201X"). Please use this version in case you'd like to analyze differences over the years:** * [https://drive.google.com/open?id=1mHY5bMOYPAKgqKJGp2N0xC8LW1Svt9eK](https://drive.google.com/open?id=1mHY5bMOYPAKgqKJGp2N0xC8LW1Svt9eK) In this assignment, you'll explore a Neo4j graph database using Cypher -- Neo4j's query language -- and Gephi. The data set revolves around political mandates in Belgium. Politicians in Belgium are required to list all mandates they have at public and private organizations. The definition of a mandate in this setting does not only encompass political mandates (e.g. be a member of a political party) but also include being in the board of directors of firm, working as a consultant, and so on. Data for the years 2016 and 2017 was scraped from https://multimedia.tijd.be/mandaten2018/ from which a graph was constructed including nodes for organizations, parties, and politicians and edges representing party membership, mandates (with attributes describing year and function). - The original data still contains a lot of quality issues and has organizations with e.g. blank names, spaces, "N.A.", "NA", "na" and so on. I've already performed a significant amount of cleaning (a single node, with name "NA", now represents unknown organizations, for instance) - Why not take the list as published by the government? Well, indeed, it's "open data", though the government releases the list as a PDF file, so normally we'd need to use something like Camelot to extract the data - Why no data for 2018? Legislation was changed, so that data for the year 2018 will only become available in 2020 The goal of this assignment is to: - Explore this graph using Neo4j (Cypher) and Gephi and find some interesting patterns - Visualize your findings in an appealing manner - The main goal is to get familiar with Neo4j and Gephi, but also to hone your "storytelling" skills. In that sense, try to focus on a single or a few hypotheses or findings you explore in full (with nicely formatted visualizations) and explaining what it says instead of just going for quick filter saying: "here are the three nodes with the most connections" (boring). Things you can focus on include: looking at the differences of parties in Flanders vs. Wallonia, or left vs. right-wing. Taking a look at politicians that changed a lot of mandates from 2016 to 2017. Taking a look at organizations that involve politicians from all sorts of parties. (For those of you that are a bit more familiar with local politics: another fun idea is to find two politicians who are opposed to one another in public debates and see how many "degrees of separation run between them in this network") Please take a look at [assignment4.pdf](./papers/assignment4.pdf) for instructions regarding setting up Neo4j and Gephi and working on the assignment. Definitely take a look at the "closing notes" for more tips and instructions. [fixme.py](./code/fixme.py) contains the source code of the GraphML cleaning script mentioned in the guide (you'll need to this to do a proper export to GraphML). The fourth part of your lab report should contain: - Your Cypher queries together with their output, putting in screenshots is fine, too - Approach on how you went from Neo4j to Gephi, results of your workflow in Gephi and visualizations you created - The focus is on exploration and finding some interesting insights: so put in your visualizations, queries, steps applied... - Instead of Gephi, you are free to use another visualization tool such as Cytoscape, SigmaJS or VisJS as well if you feel this makes for a more rewarding hands-on opportunity for you, but note that you're on your own in terms of setting this up :) - As discussed in class: you're free to schedule your planning and division of work. You do not hand in each assignment separately, but hand in your completed lab report containing all four assignments on Sunday June 2nd. **Note for externals (i.e. anyone who'll NOT partake in the exams -- this doesn't apply to normal students)**: you are free to partake in the assignments but not required to. In case you want to join, feel free to form groups or work individually. Feel free to skip assignments. # Course 12: May 14 ## Slides * [12 - Closing topics](./slides/12 - Closing topics.pdf) ## Background Information In the news: - [An End-to-End AutoML Solution for Tabular Data](https://ai.googleblog.com/2019/05/an-end-to-end-automl-solution-for.html) - [Who to Sue When a Robot Loses Your Fortune](https://www.bloomberg.com/news/articles/2019-05-06/who-to-sue-when-a-robot-loses-your-fortune) - [Don’t let industry write the rules for AI](https://www.nature.com/articles/d41586-019-01413-1) - [Predicting Stack Overflow Tags with Google’s Cloud AI](https://stackoverflow.blog/2019/05/06/predicting-stack-overflow-tags-with-googles-cloud-ai/) - [Can ML help us understand why?](https://go.technologyreview.com/can-machine-learning-help-us-understand-why) - [AI and the “Useless” Class](https://towardsdatascience.com/ai-and-the-useless-class-e56893aaabb) - [Artificial Intelligence May Not 'Hallucinate' After All](https://www.wired.com/story/adversarial-examples-ai-may-not-hallucinate/) - [Real numbers, data science and chaos: How to fit any dataset with a single parameter](https://github.com/Ranlot/single-parameter-fit) - [GradType](https://gradtype.darksi.de/) A list of all links mentioned in the slides: - [http://aif360.mybluemix.net/data](http://aif360.mybluemix.net/data) - [http://deon.drivendata.org/](http://deon.drivendata.org/) - [http://www.cc.gatech.edu/~alanwags/DLAI2016/(Gunning)%20IJCAI-16%20DLAI%20WS.pdf](http://www.cc.gatech.edu/~alanwags/DLAI2016/(Gunning)%20IJCAI-16%20DLAI%20WS.pdf) - [http://www.landoop.com/blog/2017/12/apache-kafka-gdpr-compliance/](http://www.landoop.com/blog/2017/12/apache-kafka-gdpr-compliance/) - [https://alexgkendall.com/computervision/bayesiandeeplearningforsafeai/](https://alexgkendall.com/computervision/bayesiandeeplearningforsafeai/) - [https://algofairness.github.io/fatconference-2018-auditing-tutorial/](https://algofairness.github.io/fatconference-2018-auditing-tutorial/) - [https://arxiv.org/abs/1609.02943](https://arxiv.org/abs/1609.02943) - [https://arxiv.org/ftp/arxiv/papers/1804/1804.11238.pdf](https://arxiv.org/ftp/arxiv/papers/1804/1804.11238.pdf) - [https://blogs.microsoft.com/on-the-issues/2018/12/06/facial-recognition-its-time-for-action/](https://blogs.microsoft.com/on-the-issues/2018/12/06/facial-recognition-its-time-for-action/) - [https://cdn.oreillystatic.com/en/assets/1/event/269/Human%20in%20the%20loop_%20Bayesian%20rules%20enabling%20explainable%20AI%20%20Presentation.pdf](https://cdn.oreillystatic.com/en/assets/1/event/269/Human%20in%20the%20loop_%20Bayesian%20rules%20enabling%20explainable%20AI%20%20Presentation.pdf) - [https://dataprivacylab.org/projects/identifiability/paper1.pdf](https://dataprivacylab.org/projects/identifiability/paper1.pdf) - [https://deepmind.com/blog/robust-and-verified-ai/](https://deepmind.com/blog/robust-and-verified-ai/) - [https://dsapp.uchicago.edu/projects/aequitas/](https://dsapp.uchicago.edu/projects/aequitas/) - [https://github.com/IBM/AIF360](https://github.com/IBM/AIF360) - [https://github.com/adebayoj/fairml](https://github.com/adebayoj/fairml) - [https://github.com/algofairness/BlackBoxAuditing](https://github.com/algofairness/BlackBoxAuditing) - [https://github.com/h2oai/mli-resources](https://github.com/h2oai/mli-resources) - [https://github.com/marcotcr/lime](https://github.com/marcotcr/lime) - [https://github.com/scikit-learn-contrib/skope-rules](https://github.com/scikit-learn-contrib/skope-rules) - [https://github.com/slundberg/shap](https://github.com/slundberg/shap) - [https://hortonworks.com/blog/voltage-data-centric-security-hadoop/](https://hortonworks.com/blog/voltage-data-centric-security-hadoop/) - [https://medium.com/@chris_bour/ai-will-predictive-models-outliers-be-the-new-socially-excluded-bbcb6a7b16b1](https://medium.com/@chris_bour/ai-will-predictive-models-outliers-be-the-new-socially-excluded-bbcb6a7b16b1) - [https://medium.com/@jamesbridle/something-is-wrong-on-the-internet-c39c471271d2)](https://medium.com/@jamesbridle/something-is-wrong-on-the-internet-c39c471271d2)) - [https://medium.com/@sderymail/on-the-wrong-side-of-algorithms-part-1-f266a4c3342b](https://medium.com/@sderymail/on-the-wrong-side-of-algorithms-part-1-f266a4c3342b) - [https://medium.com/dropoutlabs/privacy-preserving-machine-learning-2018-a-year-in-review-b6345a95ae0f](https://medium.com/dropoutlabs/privacy-preserving-machine-learning-2018-a-year-in-review-b6345a95ae0f) - [https://papers.nips.cc/paper/8003-towards-robust-interpretability-with-self-explaining-neural-networks.pdf](https://papers.nips.cc/paper/8003-towards-robust-interpretability-with-self-explaining-neural-networks.pdf) - [https://points.datasociety.net/your-data-is-being-manipulated-a7e31a83577b](https://points.datasociety.net/your-data-is-being-manipulated-a7e31a83577b) - [https://ppml-workshop.github.io/ppml/](https://ppml-workshop.github.io/ppml/) - [https://slate.com/technology/2016/02/how-to-hold-governments-accountable-for-their-algorithms.html](https://slate.com/technology/2016/02/how-to-hold-governments-accountable-for-their-algorithms.html) - [https://techcrunch.com/2018/11/07/china-can-apparently-now-identify-citizens-based-on-the-way-they-walk/](https://techcrunch.com/2018/11/07/china-can-apparently-now-identify-citizens-based-on-the-way-they-walk/) - [https://thenextweb.com/artificial-intelligence/2019/02/21/predictive-policing-is-a-scam-that-perpetuates-systemic-bias/](https://thenextweb.com/artificial-intelligence/2019/02/21/predictive-policing-is-a-scam-that-perpetuates-systemic-bias/) - [https://towardsdatascience.com/making-your-neural-network-say-i-dont-know-bayesian-nns-using-pyro-and-pytorch-b1c24e6ab8cd](https://towardsdatascience.com/making-your-neural-network-say-i-dont-know-bayesian-nns-using-pyro-and-pytorch-b1c24e6ab8cd) - [https://www.bloomberg.com/news/features/2017-11-15/the-brutal-fight-to-mine-your-data-and-sell-it-to-your-boss](https://www.bloomberg.com/news/features/2017-11-15/the-brutal-fight-to-mine-your-data-and-sell-it-to-your-boss) - [https://www.fast.ai/2017/11/02/ethics/](https://www.fast.ai/2017/11/02/ethics/) - [https://www.insurancejournal.com/news/national/2019/04/08/523153.htm](https://www.insurancejournal.com/news/national/2019/04/08/523153.htm) - [https://www.nature.com/articles/d41586-018-05469-3](https://www.nature.com/articles/d41586-018-05469-3) - [https://www.nature.com/articles/d41586-019-00505-2](https://www.nature.com/articles/d41586-019-00505-2) - [https://www.newsroom.co.nz/2019/03/26/501538/the-ai-failures-of-facebook-youtube](https://www.newsroom.co.nz/2019/03/26/501538/the-ai-failures-of-facebook-youtube) - [https://www.nytimes.com/2016/08/01/opinion/make-algorithms-accountable.html](https://www.nytimes.com/2016/08/01/opinion/make-algorithms-accountable.html) - [https://www.nytimes.com/2019/01/31/opinion/ai-bias-healthcare.html](https://www.nytimes.com/2019/01/31/opinion/ai-bias-healthcare.html) - [https://www.popsci.com/technology/article/2013-03/four-location-data-points-give-away-cellphone-users-identities](https://www.popsci.com/technology/article/2013-03/four-location-data-points-give-away-cellphone-users-identities) - [https://www.propublica.org/article/breaking-the-black-box-how-machines-learn-to-be-racist?word=Trump](https://www.propublica.org/article/breaking-the-black-box-how-machines-learn-to-be-racist?word=Trump) - [https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) - [https://www.protegrity.com/products/protegrity-protectors/protegrity-avatar-for-hortonworks/](https://www.protegrity.com/products/protegrity-protectors/protegrity-avatar-for-hortonworks/) - [https://www.researchgate.net/publication/331214506AutomatedrationalegenerationatechniqueforexplainableAIanditseffectsonhumanperceptions](https://www.researchgate.net/publication/331214506AutomatedrationalegenerationatechniqueforexplainableAIanditseffectsonhumanperceptions) - [https://www.scientificamerican.com/article/algorithms-designed-to-fight-poverty-can-actually-make-it-worse/](https://www.scientificamerican.com/article/algorithms-designed-to-fight-poverty-can-actually-make-it-worse/) - [https://www.scientificamerican.com/article/dont-let-robots-pull-the-trigger/](https://www.scientificamerican.com/article/dont-let-robots-pull-the-trigger/) - [https://www.slideshare.net/Hadoop_Summit/securing-data-in-hadoop-at-uber](https://www.slideshare.net/Hadoop_Summit/securing-data-in-hadoop-at-uber) - [https://www.smithsonianmag.com/innovation/artificial-intelligence-is-now-used-predict-crime-is-it-biased-180968337/](https://www.smithsonianmag.com/innovation/artificial-intelligence-is-now-used-predict-crime-is-it-biased-180968337/) - [https://www.statista.com](https://www.statista.com) - [https://www.technologyreview.com/s/612775/algorithms-criminal-justice-ai/](https://www.technologyreview.com/s/612775/algorithms-criminal-justice-ai/) - [https://www.theguardian.com/science/2016/sep/01/how-algorithms-rule-our-working-lives](https://www.theguardian.com/science/2016/sep/01/how-algorithms-rule-our-working-lives) - [https://www.theguardian.com/technology/2018/jul/06/artificial-intelligence-ai-humans-bots-tech-companies](https://www.theguardian.com/technology/2018/jul/06/artificial-intelligence-ai-humans-bots-tech-companies) - [https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist](https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist) - [https://www.theverge.com/2019/1/25/18197301/youtube-algorithm-conspiracy-theories-misinformation](https://www.theverge.com/2019/1/25/18197301/youtube-algorithm-conspiracy-theories-misinformation) - [https://www.theverge.com/2019/2/19/18229938/youtube-child-exploitation-recommendation-algorithm-predators](https://www.theverge.com/2019/2/19/18229938/youtube-child-exploitation-recommendation-algorithm-predators) - [https://www.vox.com/future-perfect/2019/3/5/18251924/self-driving-car-racial-bias-studyautonomous-vehicle-dark-skin](https://www.vox.com/future-perfect/2019/3/5/18251924/self-driving-car-racial-bias-studyautonomous-vehicle-dark-skin) - [https://www.washingtonpost.com/news/theworldpost/wp/2017/10/09/pierre-omidyar-6-ways-social-media-has-become-a-direct-threat-to-democracy/](https://www.washingtonpost.com/news/theworldpost/wp/2017/10/09/pierre-omidyar-6-ways-social-media-has-become-a-direct-threat-to-democracy/) - [https://www.wired.com/story/google-microsoft-warn-ai-may-do-dumb-things/](https://www.wired.com/story/google-microsoft-warn-ai-may-do-dumb-things/) - [https://www.wired.com/story/when-algorithms-think-you-want-to-die/](https://www.wired.com/story/when-algorithms-think-you-want-to-die/) # Books Here are some recommended books to check out if you're interested in an accompanying reference: * Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking (Foster Provost, Tom Fawcett) * Data Mining: Practical Machine Learning Tools and Techniques (Ian Witten, Eibe Frank, Mark Hall) * Data Mining: The Textbook (Charu Aggarwal) * Data Mining and Analysis: Fundamental Concepts and Algorithms (Mohammed Zaki, Wagner Meira) * Data Mining and Business Analytics with R (Johannes Ledolter) Online books: * [R for Data Science (Garrett Grolemund, Hadley Wickham)](http://r4ds.had.co.nz/) * [Computer Age Statistical Inference](http://web.stanford.edu/~hastie/CASI/index.html) * [http://tidytextmining.com/](Text Mining with R: A Tidy Approach) If you want an even more exhaustive list of data science books, feel free to check out [https://github.com/chaconnewu/free-data-science-books](https://github.com/chaconnewu/free-data-science-books), neatly ordered by topic and level (beginner to veteran). Some "self-promotion" for those of you interested in web scraping: [http://www.webscrapingfordatascience.com/](http://www.webscrapingfordatascience.com/).