# Advanced Analytics for a Big Data World (2025)

This page contains slides, references and materials for the "Advanced Analytics for a Big Data World" course.

*Last updated at 2025-03-26*

# Table of contents

# Course 1: February 12 ## Slides * [About the Course](./slides/0 - About Course.pdf) * [Introduction](./slides/1 - Introduction.pdf) ## Assignment Information The evaluation of this course consists of a lab report (50% of the marks) and a closed-book written exam with both multiple-choice and open questions (50% of the marks). * Your lab-report will consist of your write-ups of four assignments, which will be made available throughout the semester * You will work in groups of five students * The four assignments consist of (1) Predictive model competition using R or Python (tabular); (2) Deep learning application (imagery); (3) Text mining with Spark streaming or LLMs (text); (4) Social network/graph analytics assignment (network) * Per assignment, you describe your results (screen shots, numbers, approach); more detailed information will be provided per assignment * You do not hand in each assignment separately, but hand in your completed lab report containing all four assignments on Sunday June 1st **For forming groups, please see the Toledo page.** **Note for externals (i.e. anyone who will NOT partake in the exams -- this doesn't apply to normal students)**: you are free to partake in (any of) the assignments individually (they'll be posted on this page as well), but not required to. ## Recording * [YouTube](https://youtu.be/2XZdVOapC3U) ## Background Information 💭 If you like these links, you can also check out our biweekly newsletter where we gather the latest AI news around the web. You can subscribe over at: [https://www.dataminingapps.com/dataminingapps-newsletter/](https://www.dataminingapps.com/dataminingapps-newsletter/) Extra references: * [The Youtube channel Veritasium has just release a video talking about AlphaFold and the GNOME materials project, take a look](https://www.youtube.com/watch?v=P_fHJIYENdI) * [AlphaGo](https://www.engadget.com/2016-03-14-the-final-lee-sedol-vs-alphago-match-is-about-to-start.html) and their newer ["AlphaGo Zero"](https://deepmind.com/blog/alphazero-shedding-new-light-grand-games-chess-shogi-and-go/) * [AlphaStar](https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/) * [AlphaFold](https://deepmind.com/blog/article/AlphaFold-Using-AI-for-scientific-discovery) and a [recent article where it was used](https://arxiv.org/abs/2201.09647) * [AlphaCode](https://alphacode.deepmind.com/) * [DALL·E: Creating Images from Text](https://openai.com/blog/dall-e/) * [Automating My Job with GPT-3](https://blog.seekwell.io/gpt3) * [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) * [ChatGPT](https://openai.com/blog/chatgpt/) * [Millions of new materials discovered with deep learning](https://deepmind.google/discover/blog/millions-of-new-materials-discovered-with-deep-learning/) * [AlphaGeometry](https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/) * [DeepSeek expained](https://heidloff.net/article/deepseek-r1/) and [announcement](https://github.com/deepseek-ai/DeepSeek-R1) * [Janus Pro is DeepSeek's image generator](https://huggingface.co/deepseek-ai/Janus-Pro-7B) * [Kimi](https://github.com/MoonshotAI/Kimi-k1.5) and [Qwen](https://github.com/QwenLM/Qwen) * [Hunyuan3D-2](https://github.com/Tencent/Hunyuan3D-2) * [The Economics of AI Today](https://thegradient.pub/the-economics-of-ai-today/) * [Designing great data products: The Drivetrain Approach: A four-step process for building data products](https://www.oreilly.com/radar/drivetrain-approach-data-products/) * [Google's Rules of ML](https://developers.google.com/machine-learning/guides/rules-of-ml) * [150 successful machine learning models: 6 lessons learned at Booking.com](https://blog.acolyer.org/2019/10/07/150-successful-machine-learning-models/) -- recommended read! * The tank story: [how much is true?](https://www.gwern.net/Tanks) * [Self-driven car spins in circles](https://twitter.com/mat_kelcey/status/886101319559335936) * [Correlation is not causation](https://web.archive.org/web/20210413060837/http://robertmatthews.org/wp-content/uploads/2016/03/RM-storks-paper.pdf) * [Once billed as a revolution in medicine, IBM’s Watson Health is sold off in parts](https://www.statnews.com/2022/01/21/ibm-watson-health-sale-equity/) * [How To Break Anonymity of the Netflix Prize Dataset](https://arxiv.org/abs/cs/0610105) * [Why UPS drivers don’t turn left and you probably shouldn’t either](http://www.independent.co.uk/news/science/why-ups-drivers-don-t-turn-left-and-you-probably-shouldn-t-either-a7541241.html) * [Your Garbage Data Is A Gold Mine](https://www.fastcompany.com/3063110/the-rise-of-weird-data) * [Weapons of Math Destruction](https://en.wikipedia.org/wiki/Weapons_of_Math_Destruction) * [Beware the data science pin factory](https://multithreaded.stitchfix.com/blog/2019/03/11/FullStackDS-Generalists/) -- recommended read! * [Sky high salaries in AI](https://www.bloomberg.com/news/articles/2018-02-13/in-the-war-for-ai-talent-sky-high-salaries-are-the-weapons) with [discussion here](https://news.ycombinator.com/item?id=16366815) * [Hiring Data Scientists: What to Look for?](http://www.dataminingapps.com/2015/06/hiring-data-scientists-what-to-look-for/) * [I suspect AI today is like big data ten years ago](https://news.ycombinator.com/item?id=16366815) * A nice example of a "weird" outcome: [Why does Amazon use packages that are too large?](http://www.distractify.com/fyi/2017/12/28/Z1UYuIS/amazon-huge-boxes) Some older examples good and bad (not shown in this year's course): * [How AI is battling the coronavirus outbreak](https://www.vox.com/recode/2020/1/28/21110902/artificial-intelligence-ai-coronavirus-wuhan) * [How artificial intelligence provided early warnings of the Wuhan virus](https://qz.com/1791222/how-artificial-intelligence-provided-early-warning-of-wuhan-virus/) * [Would you take a drug discovered by artificial intelligence?](https://www.vox.com/2020/1/31/21117102/artificial-intelligence-drug-discovery-exscientia) * An example of bad science: [New AI can guess whether you're gay or straight from a photograph](https://www.theguardian.com/technology/2017/sep/07/new-artificial-intelligence-can-tell-whether-youre-gay-or-straight-from-a-photograph) * ... but commonplace today: [Facial analysis AI is being used in job interviews – it will probably reinforce inequality](https://theconversation.com/facial-analysis-ai-is-being-used-in-job-interviews-it-will-probably-reinforce-inequality-124790) * [No self-driving cars yet](https://www.reuters.com/business/autos-transportation/tesla-recalls-nearly-54000-us-vehicles-rolling-stop-software-feature-2022-02-01/) * [Kickstarter shut down the campaign for AI porn group Unstable Diffusion amid changing guidelines](https://techcrunch.com/2022/12/21/kickstarter-shut-down-the-campaign-for-ai-porn-group-unstable-diffusion-amid-changing-guidelines/) * [From Deepfake to DignifAI...](https://www.nbcnews.com/tech/internet/conservative-influencers-are-using-ai-cover-photos-sex-workers-rcna137341) * [AI-Generated 'Seinfeld' Show Banned on Twitch After Transphobic Standup Bit](https://www.vice.com/en/article/y3pymx/ai-generated-seinfeld-show-nothing-forever-banned-on-twitch-after-transphobic-standup-bit) * [DAN - do anything now](https://www.cnbc.com/2023/02/06/chatgpt-jailbreak-forces-it-to-break-its-own-rules.html) * [AI-powered Bing Chat spills its secrets via prompt injection attack](https://arstechnica.com/information-technology/2023/02/ai-powered-bing-chat-spills-its-secrets-via-prompt-injection-attack/) * [Making new bing angry by making it do something it's both allowed and not allowed to do](https://www.reddit.com/r/ChatGPT/comments/112uczi/making_new_bing_angry_by_making_it_do_something/) * ["Our new paper shows that diffusion models memorize images from their training data and emit them at generation time"](https://twitter.com/Eric_Wallace_/status/1620449934863642624) * [Kaggle competitions results over time](https://www.kaggle.com/kaggle/meta-kaggle) * [The brutal fight to mine your data and sell it to your boss](https://www.bloomberg.com/news/features/2017-11-15/the-brutal-fight-to-mine-your-data-and-sell-it-to-your-boss) * [Google's AI can see through your eyes](https://medium.com/health-ai/googles-ai-can-see-through-your-eyes-what-doctors-can-t-c1031c0b3df4) * [Robot tanks: On patrol but not allowed to shoot](https://www.bbc.com/news/business-50387954) * [... and robot fighter pilots](https://www.newyorker.com/magazine/2022/01/24/the-rise-of-ai-fighter-pilots) * [Dermatologist-level classification of skin cancer with deep neural networks](https://www.nature.com/articles/nature21056.epdf) * [CLIPasso](https://clipasso.github.io/clipasso/) * [Clearview AI Once Told Cops To “Run Wild” With Its Facial Recognition Tool. It's Now Facing Legal Challenges](https://www.buzzfeednews.com/article/ryanmac/clearview-ai-cops-run-wild-facial-recognition-lawsuits) * [Is a supermarket discount coupon worth giving away your privacy?](https://www.latimes.com/business/story/2020-01-21/ralphs-privacy-disclosure) * Self driving cars: [1](http://www.nvidia.com/object/drive-px.html), [2](http://kevinhughes.ca/blog/tensor-kart), [3](https://github.com/bethesirius/ChosunTruck), [4](https://www.youtube.com/watch?v=X4u2DCOLoIg), [5](https://selfdrivingcars.mit.edu/), [6](https://eu.udacity.com/course/self-driving-car-engineer-nanodegree--nd013) * [Pix2pix for image translation](http://affinelayer.com/pixsrv/index.html) * [Google's "Teachable Machine"](https://teachablemachine.withgoogle.com/) * [3D Generative-Adversarial Modeling](http://3dgan.csail.mit.edu/) * [Generating Videos with Scene Dynamics](http://web.mit.edu/vondrick/tinyvideo/) * DeepFakes: [1](https://boingboing.net/2018/02/13/there-ive-done-something.html), [2](https://www.theverge.com/2018/2/7/16982046/reddit-deepfakes-ai-celebrity-face-swap-porn-community-ban) * [IQ Test Result: Advanced AI Machine Matches Four-Year-Old Child's Score](https://www.technologyreview.com/s/541936/iq-test-result-advanced-ai-machine-matches-four-year-old-childs-score/) * [KFC China is using facial recognition tech to serve customers - but are they buying it?](https://www.theguardian.com/technology/2017/jan/11/china-beijing-first-smart-restaurant-kfc-facial-recognition) * [Dubai police launch AI that can spot crimes](http://newatlas.com/dubai-police-crime-prediction-software/47092/) * [Chicago turns to big data to predict gun and gang violence](https://www.engadget.com/2016/05/23/chicago-turns-to-big-data-to-predict-gun-and-gang-violence/) * [The Role of Data and Analytics in Insurance Fraud Detection](http://www.insurancenexus.com/fraud/role-data-and-analytics-insurance-fraud-detection) * [This employee ID badge monitors and listens to you at work — except in the bathroom](https://www.washingtonpost.com/news/business/wp/2016/09/07/this-employee-badge-knows-not-only-where-you-are-but-whether-you-are-talking-to-your-co-workers/) # Course 2: February 17 ## Slides * [Preprocessing and Feature Engineering](./slides/2 - Preprocessing.pdf) * [Supervised Modeling](./slides/3 - Supervised Modeling.pdf) ## Recording * [YouTube](https://youtu.be/64c6ErUgmlo) ## Background Information In the news: * [LIMO: Less Is More for Reasoning 🚀](https://github.com/GAIR-NLP/LIMO) * [The Anthropic Economic Index](https://www.anthropic.com/news/the-anthropic-economic-index) * [Introducing Perplexity Deep Research](https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research) * [Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance](https://humanaigc.github.io/animate-anyone-2/) * [Using AI to decode language from the brain and advance our understanding of human communication](https://ai.meta.com/blog/brain-ai-research-human-communication/) Extra references on preprocessing: * [Forecasting with Google Trends](https://medium.com/dataminingapps-articles/forecasting-with-google-trends-114ab741bda4) * [Google Street View in insurance](https://arxiv.org/ftp/arxiv/papers/1904/1904.05270.pdf) * [Predicting the State of a House Using Google Street View](https://link.springer.com/chapter/10.1007/978-3-031-05760-1_46) * Packages for missing value summarization: [missingno](https://github.com/ResidentMario/missingno) and [VIM](https://cran.r-project.org/web/packages/VIM/index.html) * [MICE is also a popular package for dealing with missing values in R](https://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/) * [More on data "leakage" and why you should avoid it](https://www.kaggle.com/alexisbcook/data-leakage) * [Another excellent presentation on the types of data leakage](https://www.slideshare.net/YuriyGuts/target-leakage-in-machine-learning) * [`smbinning`, an R package for weights of evidence encoding](https://cran.r-project.org/web/packages/smbinning/index.html) * [`category_encoders`: an interesting package containing a wide variety of categorical encoding techniques](http://contrib.scikit-learn.org/category_encoders/) * [More on the leave one out mean](https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748) as discussed on Kaggle * [More explanation on the hashing trick on Wikipedia](https://en.wikipedia.org/wiki/Feature_hashing) * [Feature Hashing in Python](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html) * [Entity Embeddings of Categorical Variables](https://arxiv.org/pdf/1604.06737.pdf) * [`dm` R package](https://krlmlr.github.io/dm/) * [`skrub`: Prepping tables for machine learning](https://github.com/skrub-data/skrub) * [`featuretools`: An open source python framework for automated feature engineering](https://www.featuretools.com/) * See also [`stumpy`](https://github.com/TDAmeritrade/stumpy) and [`tsfresh`](https://tsfresh.readthedocs.io/) * [AutoFeat](https://github.com/cod3licious/autofeat) * [FeatureSelector](https://github.com/WillKoehrsen/feature-selector) * [OneBM](https://arxiv.org/abs/1706.00327) * [More information on principal component analysis (PCA)](http://setosa.io/ev/principal-component-analysis/) * [OpenCV](http://opencv.org/) (for feature extraction from facial images), or see [this page](https://github.com/ageitgey/face_recognition) * Interesting application of PCA to "understand" the latent features of a deep learning network: [https://www.youtube.com/watch?v=4VAkrUNLKSo](https://www.youtube.com/watch?v=4VAkrUNLKSo) * [Another application of PCA for understanding model outputs](https://github.com/asabuncuoglu13/sketch-embeddings) Extra references on supervised basics: * [“Building Bridges between Regression, Clustering, and Classification”](https://arxiv.org/abs/2502.02996) * [Frank Harell on stepwise regression](https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/) * [Stepwise feature selection in scikit-learn](https://scikit-learn.org/stable/modules/feature_selection.html#sequential-feature-selection) but note that it applies cross validation * [L1 and L2 animation](https://nitter.net/itayevron/status/1328421322821693441) * [aerosolve - Machine learning for humans](https://medium.com/airbnb-engineering/aerosolve-machine-learning-for-humans-55efcf602665) * [ID3.pdf](./papers/ID3.pdf) and [C45.pdf](./papers/C45.pdf): extra material regarding decision trees * [Nice video on Entropy and Information](https://www.youtube.com/watch?v=v68zYyaEmEA) * [CloudForest](https://github.com/ryanbressler/CloudForest), an older but interesting decision tree ensemble implementation with support for three-way splits to deal with missing values, implemented in... Go * [RIPPER](https://christophm.github.io/interpretable-ml-book/rules.html), [RuleFit](https://github.com/christophM/rulefit) and [Skope-Rules](https://github.com/scikit-learn-contrib/skope-rules) * [dtreeviz](https://github.com/parrt/dtreeviz) for nicer visualizations or [pybaobabdt](https://pypi.org/project/pybaobabdt/) * White box models can be easily deployed, even in Excel... For some fun examples, see [m2cgen](https://github.com/BayesWitnesses/m2cgen), which can convert ML models to Java, C, Python, Go, PHP, ..., [emlearn](https://github.com/emlearn/emlearn) converts ML code to portable C99 code for microcontrollers, [sklearn-porter](https://github.com/nok/sklearn-porter) converts scikit-learn models to C, Java and others, and [SKompiler](https://github.com/konstantint/SKompiler) covnerts scikit-learn models to SQL queries, and... Excel, [this screencast](https://www.youtube.com/watch?v=7vUfa7W0NpY) shows it in action * Two newer examples using k-NN: [1](https://towardsdatascience.com/scanned-digits-recognition-using-k-nearest-neighbor-k-nn-d1a1528f0dea) and [2](https://medium.com/learning-machine-learning/recommending-animes-using-nearest-neighbors-61320a1a5934) # Course 3: February 26 ## Slides * Continuing with the slides of last time ## Assignment 1 In this assignment, you will construct a predictive model to predict for a classic use cases: predicting the sales price of houses. The data set was obtained from a Belgian real estate website. The data set can be [downloaded from here](https://seppe.net/aa/assignment1/data.zip). You will play "competitively" using [this website](https://seppe.net/aa/assignment1/) * The test set does not contain the target, so you will need to split up the train set accordingly to create your own validation set. The test set supplied in the data is used to rank and assess your model on the competition leaderboard * Your model will be evaluated using two metrics: Winkler Score with alpha = 0.20 (more info below) and Mean Absolute Error as a secondary score * Note that only about half of the test set is used for the "public" leaderboard. That means that the score you will see on the leaderboard is done using this part of the test only (you don't know which half). Later on through the semester, submissions are frozen and the resuls on the "hidden" part will be revealed * Outliers, noisy data, missing values, a relatively limited list of features, and the peculiar evaluation metric adopted here will make for challenges to be overcome * You will have received a password and your final group number through an email, which you need to make submissions * The results of your latest submission are used to rank you on the leaderboard. This means it is your job to keep track of different model versions / approaches / outputs in case you'd like to go back to an earlier result * The leaderboard will be frozen and the hidden results shown a few weeks before the deadline. You should then reflect on both results and explain accordingly in your report. E.g. if you did well on the public leaderboard but not on the hidden one, what might have caused this? The idea is not that you then step in and "fix" your model, but to learn and reflect * Also, whilst you can definitely try, the goal is not to "win", but to help you reflect on your model's results, see how others are doing, etc. * Your model needs to be build using Python (or R, Go, Rust, Julia or whatever you prefer as long as it involves coding). As an environment, you can use e.g. Jupyter (Notebook or Lab), RStudio, Google Colaboratory, Microsoft Azure Machine Learning Studio... and any additional library or package you want * Latitude and longitude of the property are included, and you are free to use external, open data sources as well The Winkler Score was chosen as a metric here since we include an interesting twist for this case. Rather than just predicting a point prediction, you will also need to supply a *lower and upper* bound around your point prediction. Obviously, having enourmously large intervals would capture all true prices, but would not be very useful in practice. The Winkler Score (1972) is a measure which is used to evaluate prediction intervals. The score is calculated as follows for each instance: * If the true value falls within the interval, the score (a measure of error, really) is the length of the interval: lower is better * If not, the score is the length of the penalty plus an additional penalty equal to two over alpha times the distance by which the true value falls outside of interval * We are using alpha = 0.20 here * Sometimes the distances are squared, but we do not do so here * See more info [here](https://otexts.com/fpp3/distaccuracy.html) The first part of your lab report should contain a clear overview of your whole modeling pipeline, including exploratory analysis (if any), preprocessing, construction of model, set-up of validation and results of the model: * Feel free to include code fragments, tables, visualisations, etc. * Some groups prefer to write their final report using Jupyter Notebook, which is fine, as long as it is readable top-to-bottom * You can use any predictive technique/approach you want, though focus on the whole process: general setup, critical thinking, and the ability to get and validate an outcome * You're free to use unsupervised technique for your data exploration part, too. When you decide to build a black box model, including some interpretability techniques to explain it is a nice idea * Any other assumptions or insights are thoughts can be included as well: the idea is to take what we've seen in class, get your hands dirty and try out what we've seen * **Important: All groups should submit the results of their predictive model at least once to the leaderboard before the hidden scores are revealed (I'll warn you in time)** More info on how to submit can be found on the [submission website](https://seppe.net/aa/assignment1/). The data set has a number of features which are mostly self-explanatory. Important to note however is that `price` contains the target, and `id` contains a unique identifier which you should not use as a feature. Note that features can still contain noise, outliers, missing values, etc. *You do not hand in each assignment separately, but hand in your completed lab report containing all four assignments on Sunday June 1st. For an overview of the groups, see Toledo. Note for externals (i.e. anyone who will NOT partake in the exams -- this doesn't apply to normal students): you are free to partake in (any of) the assignments individually, but not required to.* ## Recording * [YouTube](https://youtu.be/gtXAnCHdmYs) ## Background Information In the news: * [Large Language Diffusion Models](https://ml-gsai.github.io/LLaDA-demo/) * [Chat with BadSeek](https://blog.sshh.io/p/how-to-backdoor-large-language-models) * [Magma: A Foundation Model for Multimodal AI Agents](https://microsoft.github.io/Magma/) * [WonderHuman](https://wyiguanw.github.io/WonderHuman/) * [CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image](https://sites.google.com/view/cast4) * [OmniParser](https://microsoft.github.io/OmniParser/) Extra references: * [ROSE](https://cran.r-project.org/web/packages/ROSE/index.html) is a popular package for dealing with over/undersampling in R * [imblearn](https://imbalanced-learn.org/stable/) contains many smart sampling implementations for Python * [Tuning Imbalanced Learning Sampling Approaches](https://www.dataminingapps.com/2019/06/tuning-imbalanced-learning-sampling-approaches/) * More on the ROC curve: [1](https://arxiv.org/pdf/1812.01388.pdf), [2](http://www.rduin.nl/presentations/ROC%20Tutorial%20Peter%20Flach/ROCtutorialPartI.pdf), [3](https://stats.stackexchange.com/questions/225210/accuracy-vs-area-under-the-roc-curve/225221#225221) * [Averaging ROC curves for multiclass](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html) * [The Relationship Between Precision-Recall and ROC Curves](./papers/rocpr.pdf): a paper by KU Leuven's Jesse Davis et al. on the topic with some other interesting remarks * [h-index.pdf](./papers/h-index.pdf): paper regarding the h-index as an alternative for AUC * [BSZ tuning](./papers/bsztuning.pdf): paper on BSZ tuning: a simple cost-sensitive regression approach * [A blog post explaining cross validation](https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) * [Multiclass and multilabel algorithms in scikit-learn](https://scikit-learn.org/stable/modules/multiclass.html) * [scikit.ml](http://scikit.ml/) contains more advanced multilabel techniques * More on probability calibration [here](http://scikit-learn.org/stable/modules/calibration.html) and [here](http://fastml.com/classifier-calibration-with-platts-scaling-and-isotonic-regression) * [More on the System Stability Index](https://www.dataminingapps.com/2016/10/what-is-a-system-stability-index-ssi-and-how-can-it-be-used-to-monitor-population-stability/) * [Visibility and Monitoring for Machine Learning Models](http://blog.launchdarkly.com/visibility-and-monitoring-for-machine-learning-models/) * [What's your ML test score? A rubric for production ML systems](https://research.google.com/pubs/pub45742.html) * [Hidden Technical Debt in Machine Learning Systems](https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems) # Course 4: March 3 ## Slides * [Ensemble Models](./slides/4 - EnsembleModels.pdf) * [Model Interpretability](./slides/5 - Interpretability.pdf) ## Recording * [YouTube](https://youtu.be/cBmDCUg4hj4) ## Background Information Extra references on ensemble models: * The [jar of jelly beans](https://towardsdatascience.com/the-unexpected-lesson-within-a-jelly-bean-jar-1b6de9c40cca) * The [documentation of scikit-learn](http://scikit-learn.org/stable/modules/ensemble.html) is very complete in terms of ensemble modeling * Kaggle post on [model stacking](http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/) * [Random forest.pdf](./papers/Random forest.pdf): the original paper on random forests * [ExtraTrees](https://scikit-learn.org/stable/modules/ensemble.html#forest): "In extremely randomized trees (see ExtraTreesClassifier and ExtraTreesRegressor classes), randomness goes one step further in the way splits are computed. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule" * Also interesting to note is that scikit-learn's implementation of decision trees (and random forest) supports [multi-output problems](https://scikit-learn.org/stable/modules/tree.html#tree-multioutput) * Note that some implementations/papers for ExtraTrees will go a step further and simply select a splitting point completely at random (e.g. the subset of thresholds is size 1 -- this is helpful when working with very noisy features) * [To tune or not to tune the number of trees in a random forest](./papers/tune_or_not.pdf); conclusions: use a sufficiently high amount of trees * [Adaboost.pdf](./papers/Adaboost.pdf): the original paper on AdaBoost * [alr.pdf](./papers/alr.pdf): Friedman's paper on AdaBoost and Additive Logistic Regression * [xgboost documentation](https://xgboost.readthedocs.io/en/latest/) with a good [introduction](https://xgboost.readthedocs.io/en/latest/tutorials/model.html) * [lightgbm documentation](https://lightgbm.readthedocs.io/en/latest/pythonapi/) * [catboost documentation](https://catboost.ai/en/docs/) * Note that all three of these have sklearn-API compatible classifiers and regressors, so you can combine them with other typical sklearn steps Extra references on interpretability: * [Fantastic book on the topic of interpretability](https://christophm.github.io/interpretable-ml-book/) * [http://explained.ai/rf-importance/index.html](Beware of using feature importance!) * [https://academic.oup.com/bioinformatics/article/26/10/1340/193348](Permutation importance: a corrected feature importance measure) * [Interpreting random forests: Decision path gathering](http://blog.datadive.net/interpreting-random-forests/) * [Local interpretable model-agnostic explanations](https://github.com/marcotcr/lime) * [SHAP (SHapley Additive exPlanations)](https://github.com/slundberg/shap) * [Another great overview](https://github.com/jphall663/awesome-machine-learninginterpretability) * [rfpimp package](https://pypi.org/project/rfpimp/) * [Forest floor](http://forestfloor.dk/) for higher-dimensional partial depence plots * [The pdp R package](https://cran.r-project.org/web/packages/pdp/pdp.pdf) * [The iml R Package](https://cran.r-project.org/web/packages/iml/index.html) * [Descriptive mAchine Learning EXplanations (DALEX) R Package](https://github.com/pbiecek/DALEX) * [eli5 for Python](https://eli5.readthedocs.io/en/latest/index.html) * [Skater for Python](https://github.com/datascienceinc/Skater) * scikit-learn has Gini-reduction based importance but permutation importance has [been added in recent versions](https://scikit-learn.org/stable/modules/permutation_importance.html) * Or with [https://github.com/parrt/random-forest-importances](https://github.com/parrt/random-forest-importances) * Or with [https://github.com/ralphhaygood/sklearn-gbmi](https://github.com/ralphhaygood/sklearn-gbmi) (sklearn-gbmi) * [pdpbox for Python](https://github.com/SauceCat/PDPbox) * [vip for Python (and R)](https://koalaverse.github.io/vip/index.html) * [https://medium.com/@Zelros/a-brief-history-of-machine-learning-models-explainability-f1c3301be9dc](https://medium.com/@Zelros/a-brief-history-of-machine-learning-models-explainability-f1c3301be9dc) * Graft, Reassemble, Answer delta, Neighbour sensitivity, Training delta (GRANT) - [https://github.com/wagtaillabs](https://github.com/wagtaillabs) * [Classification Acceleration via Merging Decision Trees](./papers/mergingtrees.pdf) # Course 5: March 10 ## Slides * [Deep Learning Part 1: Foundations and Images](./slides/6 - DeepLearning.pdf) ## Assignment 2 In this assignment, you will work with a Geoguessr-style data set of Street View images, collected around mountainous areas across 12 countries. GeoGuessr is a browser-based geography game in which players must deduce locations from Google Street View imagery. Some players have become extremely good at the game, in some cases even knowing where they are by looking at a single image. * The data set consists of plusminus 100 panoramic images per country * Download the data from [this Google Drive link](https://drive.google.com/file/d/1jcm_4wQzLE3tOhQf-MJI_QmH76Cj4KnW/view?usp=sharing) Your primary goal is to predict the country based on the image. You can use any deep learning library you want (Keras is recommended). Using pre-trained models is allowed and likely to help a lot, image augmentation might be useful as well. If necessary, feel free to resize the images (using Python or whatever tool). Explore and experiment. You should approach this as a "how far can I get in a small amount of time" style project like you would be facing in real life. I don't expect fantastic results in terms of accuracy, but it'll be interesting to see what you can do with this. Important: you need to make sure to perform a good train/test split. The [Keras code examples on computer vision](https://keras.io/examples/vision/) are a good place to start. Some additional tips and pointers: * Given the small number of images per country, fine-tuning an existing image is probably a good idea * For extra style points, you can look up how Google Street View constructs its "panorama" images and see whether you can slice them into normal looking ones. This might help in terms of data augmentation * Try using an interpretability technique to figure out what your model is focusing on * The main goal is to classify an image into one of the 13 different countries. If that is too hard, you can try reducing the number of countries. You can also try to relabel the images yourself between e.g. snow / no-snow * Alternatively, the file names are formatted as `[timestamp]_[lat]_[lon].jpg`, so if you really want to have a crazy difficult challenge, you could also look for a way to predict the coordinates. Perhaps there exist some pretrained models for this already? * The script used to collect the images can be found [here](./assignment2/scrape_images.py), which you can use to test out your model on a new bunch of images should you wish to do so. The second part of your lab report should contain: * Overview of your full pipe line, including architecture, trade-offs, ways used to prevent overfitting, etc. * Results based on your chosen evaluation metric * Illustration of your model's predictions on a couple of test images *You do not hand in each assignment separately, but hand in your completed lab report containing all four assignments on Sunday June 1st. For an overview of the groups, see Toledo. Note for externals (i.e. anyone who will NOT partake in the exams -- this doesn't apply to normal students): you are free to partake in (any of) the assignments individually, but not required to.* ## Recording * [YouTube](https://youtu.be/6WhmuINm_JE) ## Background Information In the news: * [Manus AI](https://manus.im/) * [Mercury - a diffusion based LLM](https://www.inceptionlabs.ai/news) * [Meshpad](https://derkleineli.github.io/meshpad/) * [Phi 4](https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/) * [LLM Post-Training: A Deep Dive into Reasoning Large Language Models](https://arxiv.org/pdf/2502.21321) * [Vibe coding](https://arstechnica.com/ai/2025/03/is-vibe-coding-with-ai-gnarly-or-reckless-maybe-some-of-both/) Extra references: * (See slides for most references) * [Keras Vision tutorials - use these to get started!](https://keras.io/examples/vision/) * [A brief history of AI](https://beamandrew.github.io/deeplearning/2017/02/23/deep_learning_101_part1.html) * [Who invented the reverse mode of differentiation](https://www.math.uni-bielefeld.de/documenta/vol-ismp/52_griewank-andreas-b.pdf) * [Backpropagation explained](http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html) * [Great short YouTube playlist explaining ANNs (3blue1brown)](https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) * [And other one explaining convolutions (3blue1brown)](https://www.youtube.com/watch?v=8rrHTtUzyZA ) * [Introduction to neural networks](https://victorzhou.com/blog/intro-to-neural-networks/) * [Another link explaining ANNs](http://www.emergentmind.com/neural-network) * [Tensorflow playground](https://playground.tensorflow.org/) # Course 6: March 17 ## Slides * Continuing with the slides of last time ## Recording * [YouTube](https://youtu.be/UiAh4gkaRyU) ## Background Information In the news: * [Gemini 2 Flash](https://deepmind.google/technologies/gemini/flash/) * [Block Diffusion](https://huggingface.co/papers/2503.09573) * [Baidu Ernie](https://techcrunch.com/2025/03/16/baidu-launches-two-new-versions-of-its-ai-model-ernie/) * [Gemma 3](https://blog.google/technology/developers/gemma-3/) * [Blender MCP](https://github.com/ahujasid/blender-mcp) * [More on agentic models](https://x.com/nikitabase/status/1900941231808516194) * [OWL](https://github.com/camel-ai/owl) # Course 7: March 24 ## Slides * [Unsupervised Modeling](./slides/7 - UnsupervisedModeling.pdf) ## Recording * [YouTube](https://youtu.be/T73cTpS2sCw) ## Background Information In the news: * [Deep Learning is Not So Mysterious or Different](https://arxiv.org/abs/2503.02113) * [Improving Recommendation Systems & Search in the Age of LLMs](https://eugeneyan.com/writing/recsys-llm/) and [beeformer](https://github.com/recombee/beeformer) * [Data Formulator](https://github.com/microsoft/data-formulator) * [SpatialLM](https://huggingface.co/manycore-research/SpatialLM-Llama-1B) * [StarVector](https://huggingface.co/collections/starvector/starvector-models-6783b22c7bd4b43d13cb5289) * [The Cybernetic Teammate: A Field Experiment on Generative AI Reshaping Teamwork and Expertise](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5188231) Extra references: * [Comparing different hierarchical linkage methods on toy datasets](https://scikit-learn.org/stable/auto_examples/cluster/plot_linkage_comparison.html#sphx-glr-auto-examples-cluster-plot-linkage-comparison-py) * [Visualisation of the DBSCAN clustering technique in the browser](https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/) * [More on the Gower distance](https://medium.com/@rumman1988/clustering-categorical-and-numerical-datatype-using-gower-distance-ab89b3aa90d9) * [Self-Organising Maps for Customer Segmentation using R](https://www.r-bloggers.com/self-organising-maps-for-customer-segmentation-using-r/) * [t-SNE](https://lvdmaaten.github.io/tsne/) and using it for [anomaly detection](https://medium.com/@Zelros/anomaly-detection-with-t-sne-211857b1cd00) * [http://distill.pub/2016/misread-tsne/](http://distill.pub/2016/misread-tsne/) provides very interesting visualisations and more explanation on t-SNE * [Be careful when clustering the output of t-SNE](https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne/264647#264647) * [UMAP](https://umap-learn.readthedocs.io/en/latest/how_umap_works.html) * [PixPlot](https://dhlab.yale.edu/projects/pixplot/): another cool example of t-SNE (this was the name I was trying to recall during class) * [Isolation forests](https://scikit-learn.org/stable/auto_examples/plot_anomaly_comparison.html#sphx-glr-auto-examples-plot-anomaly-comparison-py) * [Local outlier factor](https://scikit-learn.org/stable/auto_examples/neighbors/plot_lof_outlier_detection.html#sphx-glr-auto-examples-neighbors-plot-lof-outlier-detection-py) * [Twitter's anomaly detection package](https://github.com/twitter/AnomalyDetection) and [prophet](https://facebook.github.io/) * [An interesting article on detecting NBA all-stars using CADE](http://darrkj.github.io/blog/2014/may102014/) * Papers on [DBSCAN](./papers/dbscan.pdf), [isolation forests](./papers/iforest.pdf) and [CADE](./papers/CADE.pdf) # Course 8: March 31 ## Slides * [Data Science Tools](./slides/8 - DataScienceTools.pdf) ## Recording * [YouTube](https://youtu.be/1F5gMEDuKbQ) ## Background Information In the news: * [Introducing 4o Image Generation](https://openai.com/index/introducing-4o-image-generation/) * [Qwen2.5-VL-32B: Smarter and Lighter](https://qwenlm.github.io/blog/qwen2.5-vl-32b/) * [Deciphering language processing in the human brain through LLM representations](https://research.google/blog/deciphering-language-processing-in-the-human-brain-through-llm-representations/) * [AI will change the world but not in the way you think](https://thomashunter.name/posts/2025-03-19-ai-llms-will-change-the-world) * [DeepSeek-V3.1](https://deepseek.ai/blog/deepseek-v31) * [A Recipe for Generating 3D Worlds From a Single Image](https://katjaschwarz.github.io/worlds/) * [Tracing the thoughts of a large language model](https://www.anthropic.com/research/tracing-thoughts-language-model) Extra references: * [Is a Dataframe Just a Table?](https://plateau-workshop.org/assets/papers-2019/10.pdf) * Modern R with the [tidyverse](https://www.tidyverse.org/) * [https://www.rstudio.com/resources/cheatsheets/](https://www.rstudio.com/resources/cheatsheets/) for cheat sheets on most tidy packages * [caret](http://caret.r-forge.r-project.org/), [mlr3]( https://mlr3.mlr-org.com/), [modelr](https://modelr.tidyverse.org/) and [tidymodels](https://www.tidymodels.org/) * [http://r4ds.had.co.nz/](http://r4ds.had.co.nz/): R for data science book * Visualizations with R: `ggplot2` and [`ggvis`](http://ggvis.rstudio.com/), also see [`shiny`](https://shiny.rstudio.com/) * Other R packages: see slides * [Learn about Numpy broadcasting](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) * [Minimally sufficient pandas](https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428): great tour through Pandas' API * [What's new in pandas](https://pandas.pydata.org/docs/dev/whatsnew/) * [scikit-learn](https://scikit-learn.org/stable/) and [statsmodels](https://www.statsmodels.org/stable/index.html) * An older but fun comparison of Python visualization libraries: [https://dsaber.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/](https://dsaber.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/) * [Datashader](https://datashader.org/) is a package to render massive data sets (also showed Sanddance in class) * Other Python packages: see slides, [imbalanced-learn](https://imbalanced-learn.readthedocs.io/en/stable/), [scikit-multilearn](http://scikit.ml/) and [semisup-learn](https://github.com/tmadl/semisup-learn) are nice to know about, though scikit-learn has some support for many of these as well * Time series: [prophet](https://facebook.github.io/prophet/) or [darts](https://github.com/unit8co/darts) or [statsforecast](https://github.com/Nixtla/statsforecast) are useful to check out as well * Linking between R and Python: see e.g. [rpy2](https://rpy2.readthedocs.io/en/version_2.8.x/) (R in Python) or [reticulate](https://github.com/rstudio/reticulate) (Python in R) * d3.js galleries: [1](https://github.com/mbostock/d3/wiki/Gallery), [2](http://bl.ocks.org/), [3](http://bl.ocks.org/mbostock), [4](https://bost.ocks.org/mike/) * ["The Pudding"](https://pudding.cool/) uses d3 for some fun digital stories * Also take a look at [Explorable Explanations](https://explorabl.es/) and [Complexity Explorables](http://www.complexity-explorables.org/) for some great visualization examples * [Plotly Dash](https://plot.ly/products/dash/) * Working with large files: [ff](https://cran.r-project.org/web/packages/ff/index.html), [bigmemory](https://cran.r-project.org/web/packages/bigmemory/index.html), [disk.frame](https://github.com/xiaodaigh/disk.frame), [Dask](https://github.com/dask/dask), [Pandas on Ray](https://modin.readthedocs.io/en/latest/pandas_on_ray.html), [vaex](https://github.com/vaexio/vaex), [modin](https://github.com/modin-project/modin) and [Sframe](https://github.com/apple/turicreate) * And [DuckDB](https://duckdb.org/) and [Polars](https://www.pola.rs/) and [Ibis](https://ibis-project.org/) * [Jupyter Widgets](https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20Basics.html) * [papermill](https://github.com/nteract/papermill) and associated [blog post](https://medium.com/netflix-techblog/scheduling-notebooks-348e6c14cfd6) * [ploomber](https://github.com/ploomber/ploomber) * Some issues with notebooks: [1](https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/edit), [2](https://yihui.name/en/2018/09/notebook-war/) * Hosted notebooks: [Azure ML Studio](https://studio.azureml.net/), [Google Colab](https://colab.research.google.com/), [Kaggle Kernels](https://www.kaggle.com/kernels) are free options * Modular code development with [nbdev](https://github.com/fastai/nbdev) * [Getting started with conda](https://conda.io/projects/conda/en/latest/user-guide/getting-started.html) * [Labeling tools overview](https://github.com/heartexlabs/awesome-data-labeling) * [human-learn](https://github.com/koaning/human-learn) * [snorkel-ai](https://snorkel.ai/platform/), [snorkel](https://github.com/snorkel-team/snorkel) and [their paper](https://arxiv.org/abs/1711.10160) * [FlyingSquid](https://github.com/HazyResearch/flyingsquid) * [Alteryx Compose](https://github.com/alteryx/compose) * [Mostly.ai](https://mostly.ai/) - one of the many "synthetic data" companies * [Learning the Gitflow git workflow](https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow) * [Hidden technical debt in machine learning systems](https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf) * ML models degrade silently! [1](https://towardsdatascience.com/why-machine-learning-modelsdegrade-in-production-d0f2108e9214), [2](https://www.elastic.co/blog/beware-steep-declineunderstanding-model-degradation-machine-learning-models), [3](https://mlinproduction.com/model-retraining/) * [Devops and ML/AI](https://www.tecton.ai/blog/devops-ml-data/) * Some cool in-house tools: [Michelangelo](https://eng.uber.com/michelangelo-machine-learning-platform/), [Manifold](https://github.com/uber/manifold), [D3](https://www.uber.com/en-BE/blog/d3-an-automated-system-to-detect-data-drifts/), [FBLearner flow](https://engineering.fb.com/core-data/introducing-fblearner-flow-facebook-s-ai-backbone/), [Airflow and Luigi](https://medium.com/better-programming/airbnbs-airflow-versus-spotify-s-luigi-bd4c7c2c0791), [Netflix ML platform](https://research.netflix.com/research-area/machine-learning-platform) and [Bighead](https://databricks.com/session/bighead-airbnbs-end-to-end-machine-learning-platform) * [Poetry](https://python-poetry.org/), [Poe the Poet](https://github.com/nat-n/poethepoet) and [cookiecutter](https://github.com/cookiecutter/cookiecutter) * [What’s your ML test score? A rubric for ML production systems](https://research.google/pubs/pub45742/) * Some newer MLops products: [ZenML](https://zenml.io/why-ZenML/), [MonaLabs](https://www.monalabs.io/) and [Evidently.ai](https://evidentlyai.com/) * [https://github.com/cleanlab/cleanlab](https://github.com/cleanlab/cleanlab) ## Assignment 3 The third assignment consists of the construction of a predictive model using Spark (Structured) Streaming and textual data. You will work with data coming from [Arxiv.org](https://arxiv.org/), the free distribution service and open-access archive for nearly 2.4 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. A data streamer server is already set up for you as follows: - Published papers are monitored, and their categories, title, and abstract are extracted. We do so by polling the API every so often. Example endpoint: [https://export.arxiv.org/api/query?search_query=submittedDate:[202503251500+TO+202503261500]&max_results=2000](https://export.arxiv.org/api/query?search_query=submittedDate:[202503251500+TO+202503261500]&max_results=2000) Next, the information is exposed through a streaming data source to you running at `seppe.net:7778`. When you connect to it, this will provide publications to you one by one: - We fetch publications starting from now minus 96 hours ago and drip feed them over the connection - We start from 96h ago so you can more easily test the connection by immediately receiving publications; the reason why we go so much back is because no publications are accepted during the weekend - Next, whilst the stream is being kept open, we just continue to send newly published articles as-they-arrive The stream is provided as a textual data source with one article per line, formatted as a JSON object, e.g.: ``` { "aid": "http://arxiv.org/abs/2503.19871v1", "title": "A natural MSSM from a novel $\\mathsf{SO(10)}$ [...]", "summary": "The $\\mathsf{SO(10)}$ model [...]", "main_category": "hep-ph", "categories": "hep-ph,hep-ex", "published": "2025-03-25T17:36:54Z" } ``` The goal of this assignment is threefold: * 1 - Collect a historical set of data * *Important: get started with this as soon as possible. We will discuss Spark and text mining in more detail later on, but you can already start gathering your data* * 2 - Construct a predictive model that predicts the categories an article belongs to: * There are different ways how you can approach this question: you can either try to predict the `main_category` (in which case it is a multiclass problem), or try to tackle it as a multilabel problem by trying to predict all of the `categories` (comma separated) * The second question is related to which categories you want to include: categories which contain a hyphen, such as `hep-ph` above, are a subcategory of the broader `hep`, so you might wish to reduce the number of classes by only focussing on the main categories (or focussing only on articles belong to a single main category, such as computer science, `cs`) * You can see all the categories over at [https://arxiv.org/](https://arxiv.org/) * You can use any predictive model you want to, but groups that incorporate a small LLM, or a more modern textual model ([https://huggingface.co/facebook/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) is a very good start, for instance), will be rewarded for this * If you want to use a more traditional approach (TF-IDF plus a classifier, then try to use Spark's built-in ML models) * 3 - Show that your model can make predictions in a "deployed" setting **Setting up Spark** Since the data set we'll work with is still relatively small, you will (luckily) not need a cluster of machines, but can run Spark locally on your machine (and save the data locally as well). * First, download the ZIP file from [this link](./assignment3/spark.zip) and extract it somewhere, e.g. on your Desktop. This ZIP file contains the latest stable Spark version available at this time (3.5.5) * If you prefer to follow along with video instructions, an MP4 file is contained in the ZIP with a walkthrough for Windows and Mac users * We will use `pixi.sh` to set up our environment. Pixi is a modern alternative to Conda, but you can use Conda as well (in which case you install the packages below using `conda install`). Download Pixi for your platform over at [https://github.com/prefix-dev/pixi/releases/](https://github.com/prefix-dev/pixi/releases/) and put the executable file (e.g. `pixi.exe` for Windows) in the `spark` folder you just extracted * Next, we install all packages we need in the environment. In a Terminal or command line window, first navigate to the `spark`, and run: `pixi init` to initialize the environment. On Mac, use `./pixi` instead of `pixi` in these commands. Then run `pixi add python=3.11 pyspark findspark jupyter openjdk=11` to install the necessary packages * Mac users will probably also have to make the Spark binaries executable (in case you get a PermissionError in the notebooks), you can do so by running this command in the `spark` directory `chmod +x ./spark-3.5.5-bin-hadoop3/bin*` * You can then start Jupyter using `pixi run jupyter notebook` **Example notebooks** Once you have Jupyter open, explore the example notebooks under `notebooks`. **Important: the first cell in these notebooks use `findspark` to initialize Spark and its contexts. You will need to add the same cell to all new notebooks you create.** * `spark_example.ipynb`: Try this first! This is a simple Spark example to calculate pi and serves as check to see whether Spark is working correctly * `spark_streaming_example.ipynb`: A simple Spark Streaming example that prints out the data you'll work with. This is a test to see whether you can receive the data * `spark_streaming_example_saving.ipynb`: A simple Spark Streaming example that saves the data. Use this to get started saving your historical set * `spark_streaming_example_predicting.ipynb`: A very naïve prediction approach * `spark_structured_streaming_example.ipynb`: An example using Spark Structured Streaming **Objective** Using Spark, your task for this assignment is as follows: * 1 - Collect a historical set of data * Get started with this as soon as possible * Make sure to set up Spark using the instructions posted above * 2 - Construct a predictive model * The stream is text-based with each line containing one message (one instance) formatted as a JSON dictionary * You are strongly encouraged to build your model using `spark.ml` (MLlib), but you can use `scikit-learn` as a fallback * Alternatively, use a more modern model, as described above * Pick between the multiclass vs. multilabel, all categories vs. main categories tasks * 3 - Show that your model can make predictions in a "deployed" setting * I.e. show that you can connect to the data source, preprocess/featurize incoming messages, have your model predict the label, and show it, similar to `spark_streaming_example_predicting.ipynb` (but using a smarter, real predictive model) * This means that you'll need to look for a way to save and load your trained model, if necessary * The goal is not to obtain a perfect predictive accuracy, but mainly to make sure you can set up Spark and work in a streaming environment The third part of your lab report should contain: * Overview of the steps above, the source code of your programs, as well as the output after running them * Feel free to include screen shots or info on encountered challenges and how you dealt with them * Even if your solution is not fully working or not working correctly, you can still receive marks for this assignment based on what you tried and how you'd need to improve your end result **Further remarks** * Get started with setting up Spark and fetching data as quickly as possible * Make sure to have enough data to train your model. New publications arrive relatively slow (during some days, no articles might appear at all, whereas other days will be very busy) * The data stream is line delimited with every line containing one instance in JSON format, but can be easily converted to a DataFrame (and RDD). The example notebooks give some ideas on how to do so * You can use both Spark Streaming or Spark Structured Streaming * Don't be afraid to ask e.g. ChatGPT or Claude for help to code up your approach, this is certainly permitted, but make sure not to get stuck in "vibe coding" where you have a notebook spanning twenty pages without knowing what you're really doing anymore * Do let me know in case the streaming server would crash *You do not hand in each assignment separately, but hand in your completed lab report containing all four assignments on Sunday June 1st. For an overview of the groups, see Toledo. Note for externals (i.e. anyone who will NOT partake in the exams -- this doesn't apply to normal students): you are free to partake in (any of) the assignments individually, but not required to.* **FAQ** * The first cell in the example notebook fails (the one using `findspark`) - This cell attempts to do a couple of things: set the `SPARK_HOME` environment variable to the right directory; on Windows: set the `HADOOP_HOME` environment variable to the right `winutils` subfolder (necessary for Spark to work); initialize `findspark`, and then construct the different Spark contexts. Inspect the output and make sure the path names are correct * Spark cannot be initialized, the notebook or command line shows an error "getSubject is supported only if a security manager is allowed" - Your `openjdk` version is too recent, make sure you have installed version 11 in your environment. * Everything seems to work but I get a lot of warnings on the console or in the notebook - Spark is very verbose. On Mac, warnings are shown in the notebook itself, which makes it more annoying to use. If you don't like that, Google on how to stop Jupyter from capturing stderr * I can't save the stream... everything else seems fine - Make sure you're calling the "saveAsTextfiles" function with "file:///" prepended to the path: `lines.saveAsTextFiles("file:///C:/...")`. Also make sure that the folder where you want to save the files exist. Note that the "saveAsTextfiles" method expects a *directory name* as the argument. It will automatically create a folder for each mini-batch of data. * Can I prevent the `saveAsTextFiles` function from creating so many directories and files? - You can first repartition the RDD to one partition before saving it: `lines.repartition(1).saveAsTextFiles("file:///C:/...")`. To prevent multiple directories, change the trigger time to e.g. `ssc = StreamingContext(sc, 60)`, though this will still create multiple directories. Setting the trigger interval higher is not really recommended, as you wouldn't want to lose data in case something goes wrong. * So if I still end up with multiple directories, how do I read them in? - It's pretty easy to loop over subdirectories in Python. Alternatively, the `sc.textFile` command is pretty smart and can parse through multiple files in one go. * Is it normal all my folders only contain `_SUCCESS` files but no actual data files? - That depends. A `_SUCCESS` file indicates that the mini-batch was saved correctly. `part-*` files contain the actual data. And files ending with `.crc` contain a checksum. It's normal if not all of your folders contain `part-*` data, when no data was received in that time frame. However, if none of your folders are having data, especially not when you have restarted the notebook, something else has gone wrong. Try the `spark_streaming_example.ipynb` notebook to verify whether you're at least receiving data at all. * Is there a way how I can monitor Spark? - Yes, go to [http://127.0.0.1:4040/](http://127.0.0.1:4040/) in your browser while Spark is running and you'll get access to a monitoring dashboard. Under the "Environment" tab, you should be able to find a "spark.speculation" entry for instance w.r.t. the question above. Under "Jobs", "Stage", and "Streaming", you can get more info on how things are going. * I'm trying to convert my saved files to a DataFrame, but Spark complains for some files? - Data is always messy, especially the ones provided by this instructor. Make sure you can handle badly formatted lines and discard them. * My stream crashes after a while with an "RDD is empty" error... - Make sure you're checking for empty RDDs, e.g. `if rdd.isEmpty(): return`. * I've managed to create a model. When I try to apply it on the stream, Spark crashes with a Hive / Derby error, e.g. when I try to .load() my model(s) or once the first RDD arrives - Check the example notebooks for ideas on how to load in your model in "globals()" once. * When I call `ssc_t.stop()`, Spark never seems to stop the stream - You can try changing `stopGraceFully=True` to `False`. Even then, Spark might not want to stop its stream processing pipeline in case you're doing a lot with the incoming data, preventing Spark from cleaning up. Try decreasing the trigger time, or simply restart the Jupyter kernel to start over. * Spark complains that only one StreamingContext can be active at a time (or "ValueError: Cannot run multiple SparkContexts at once") - A good idea is to (save and) close all running notebooks and start again fresh. Spark doesn't like having multiple contexts running, so it is best to only have one notebook running at a time. (Closing a tab with a notebook does not mean that the *kernel* is stopped, however, check the "Running" tab on the Jupyter main page.) * Why do I receive the same instances (or: why do I have instances twice) when reconnecting? - To make sure you are served data right away, the stream server starts from a while back and works its way back to the current time. You can remove duplicate instances based on the `aid` identifier. * Can I use R? - There are two main Spark R packages available: `SparkR` (the official one) and `sparklyr` (from the folks at RStudio and fits better with the tidyverse). You can try using these, but you'll have to do some setting up in order so R can find your Spark installation. I'd strongly recommend using Python. * The server is just a socket server, so can't we just get the data that way? - For those who know, yes, basically: `nc seppe.net 7778`, so indeed in this case it would be easy to do this in Python directly. # Resources ## Books If you want an exhaustive list of data science books (not required for the course), feel free to check out [https://github.com/chaconnewu/free-data-science-books](https://github.com/chaconnewu/free-data-science-books), neatly ordered by topic and level (beginner to veteran). This repository is also interesting: [https://github.com/bradleyboehmke/data-science-learning-resources](https://github.com/bradleyboehmke/data-science-learning-resources). And another two full with books: [https://github.com/Saurav6789/Books-](https://github.com/Saurav6789/Books-) and [https://github.com/yashnarkhede/Data-Scientist-Books](https://github.com/yashnarkhede/Data-Scientist-Books). ## Python Tutorials Python itself is [quite easy](https://learnxinyminutes.com/docs/python/); you mainly need to figure out the additional libraries and their usage. Try to become familiar with Numpy, Pandas, and scikit-learn first, e.g. [play along with a couple of these tutorials](https://scikit-learn.org/stable/tutorial/index.html). The bottom of [this page](https://learnxinyminutes.com/docs/python/) also lists some more resources to learn Python. The following are quite good: * [A Crash Course in Python for Scientists](https://nbviewer.jupyter.org/gist/anonymous/5924718) * [Dive Into Python 3](https://diveintopython3.net/index.html) * [https://docs.python-guide.org/](https://docs.python-guide.org/) (a bit more intermediate) * Someone has also posted [this 100 Page Python Intro](https://learnbyexample.github.io/100_page_python_intro/introduction.html) ## DataCamp To get access to DataCamp, use this [registration link](https://www.datacamp.com/groups/shared_links/c80035336a272e42cf6e73f687cfb18d0f4fd2a1762e784c76df2b5eecdb72a0). Note that this will require a @(student.)kuleuven.be email address. If you'd like to use a personal email instead (e.g. because you already have an account on DataCamp), send me an email.