Advanced Analytics for a Big Data World

# Advanced Analytics for a Big Data World (2025) This page contains slides, references and materials for the "Advanced Analytics for a Big Data World" course. *Last updated at 2025-05-12* # Table of contents

# Course 1: February 12 ## Slides * [About the Course](./slides/0 - About Course.pdf) * [Introduction](./slides/1 - Introduction.pdf) ## Assignment Information The evaluation of this course consists of a lab report (50% of the marks) and a closed-book written exam with both multiple-choice and open questions (50% of the marks). * Your lab-report will consist of your write-ups of four assignments, which will be made available throughout the semester * You will work in groups of five students * The four assignments consist of (1) Predictive model competition using R or Python (tabular); (2) Deep learning application (imagery); (3) Text mining with Spark streaming or LLMs (text); (4) Social network/graph analytics assignment (network) * Per assignment, you describe your results (screen shots, numbers, approach); more detailed information will be provided per assignment * You do not hand in each assignment separately, but hand in your completed lab report containing all four assignments on Sunday June 1st **For forming groups, please see the Toledo page.** **Note for externals (i.e. anyone who will NOT partake in the exams -- this doesn't apply to normal students)**: you are free to partake in (any of) the assignments individually (they'll be posted on this page as well), but not required to. ## Recording * [YouTube](https://youtu.be/2XZdVOapC3U) ## Background Information 💭 If you like these links, you can also check out our biweekly newsletter where we gather the latest AI news around the web. You can subscribe over at: [https://www.dataminingapps.com/dataminingapps-newsletter/](https://www.dataminingapps.com/dataminingapps-newsletter/) Extra references: * [The Youtube channel Veritasium has just release a video talking about AlphaFold and the GNOME materials project, take a look](https://www.youtube.com/watch?v=P_fHJIYENdI) * [AlphaGo](https://www.engadget.com/2016-03-14-the-final-lee-sedol-vs-alphago-match-is-about-to-start.html) and their newer ["AlphaGo Zero"](https://deepmind.com/blog/alphazero-shedding-new-light-grand-games-chess-shogi-and-go/) * [AlphaStar](https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/) * [AlphaFold](https://deepmind.com/blog/article/AlphaFold-Using-AI-for-scientific-discovery) and a [recent article where it was used](https://arxiv.org/abs/2201.09647) * [AlphaCode](https://alphacode.deepmind.com/) * [DALL·E: Creating Images from Text](https://openai.com/blog/dall-e/) * [Automating My Job with GPT-3](https://blog.seekwell.io/gpt3) * [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) * [ChatGPT](https://openai.com/blog/chatgpt/) * [Millions of new materials discovered with deep learning](https://deepmind.google/discover/blog/millions-of-new-materials-discovered-with-deep-learning/) * [AlphaGeometry](https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/) * [DeepSeek expained](https://heidloff.net/article/deepseek-r1/) and [announcement](https://github.com/deepseek-ai/DeepSeek-R1) * [Janus Pro is DeepSeek's image generator](https://huggingface.co/deepseek-ai/Janus-Pro-7B) * [Kimi](https://github.com/MoonshotAI/Kimi-k1.5) and [Qwen](https://github.com/QwenLM/Qwen) * [Hunyuan3D-2](https://github.com/Tencent/Hunyuan3D-2) * [The Economics of AI Today](https://thegradient.pub/the-economics-of-ai-today/) * [Designing great data products: The Drivetrain Approach: A four-step process for building data products](https://www.oreilly.com/radar/drivetrain-approach-data-products/) * [Google's Rules of ML](https://developers.google.com/machine-learning/guides/rules-of-ml) * [150 successful machine learning models: 6 lessons learned at Booking.com](https://blog.acolyer.org/2019/10/07/150-successful-machine-learning-models/) -- recommended read! * The tank story: [how much is true?](https://www.gwern.net/Tanks) * [Self-driven car spins in circles](https://twitter.com/mat_kelcey/status/886101319559335936) * [Correlation is not causation](https://web.archive.org/web/20210413060837/http://robertmatthews.org/wp-content/uploads/2016/03/RM-storks-paper.pdf) * [Once billed as a revolution in medicine, IBM’s Watson Health is sold off in parts](https://www.statnews.com/2022/01/21/ibm-watson-health-sale-equity/) * [How To Break Anonymity of the Netflix Prize Dataset](https://arxiv.org/abs/cs/0610105) * [Why UPS drivers don’t turn left and you probably shouldn’t either](http://www.independent.co.uk/news/science/why-ups-drivers-don-t-turn-left-and-you-probably-shouldn-t-either-a7541241.html) * [Your Garbage Data Is A Gold Mine](https://www.fastcompany.com/3063110/the-rise-of-weird-data) * [Weapons of Math Destruction](https://en.wikipedia.org/wiki/Weapons_of_Math_Destruction) * [Beware the data science pin factory](https://multithreaded.stitchfix.com/blog/2019/03/11/FullStackDS-Generalists/) -- recommended read! * [Sky high salaries in AI](https://www.bloomberg.com/news/articles/2018-02-13/in-the-war-for-ai-talent-sky-high-salaries-are-the-weapons) with [discussion here](https://news.ycombinator.com/item?id=16366815) * [Hiring Data Scientists: What to Look for?](http://www.dataminingapps.com/2015/06/hiring-data-scientists-what-to-look-for/) * [I suspect AI today is like big data ten years ago](https://news.ycombinator.com/item?id=16366815) * A nice example of a "weird" outcome: [Why does Amazon use packages that are too large?](http://www.distractify.com/fyi/2017/12/28/Z1UYuIS/amazon-huge-boxes) Some older examples good and bad (not shown in this year's course): * [How AI is battling the coronavirus outbreak](https://www.vox.com/recode/2020/1/28/21110902/artificial-intelligence-ai-coronavirus-wuhan) * [How artificial intelligence provided early warnings of the Wuhan virus](https://qz.com/1791222/how-artificial-intelligence-provided-early-warning-of-wuhan-virus/) * [Would you take a drug discovered by artificial intelligence?](https://www.vox.com/2020/1/31/21117102/artificial-intelligence-drug-discovery-exscientia) * An example of bad science: [New AI can guess whether you're gay or straight from a photograph](https://www.theguardian.com/technology/2017/sep/07/new-artificial-intelligence-can-tell-whether-youre-gay-or-straight-from-a-photograph) * ... but commonplace today: [Facial analysis AI is being used in job interviews – it will probably reinforce inequality](https://theconversation.com/facial-analysis-ai-is-being-used-in-job-interviews-it-will-probably-reinforce-inequality-124790) * [No self-driving cars yet](https://www.reuters.com/business/autos-transportation/tesla-recalls-nearly-54000-us-vehicles-rolling-stop-software-feature-2022-02-01/) * [Kickstarter shut down the campaign for AI porn group Unstable Diffusion amid changing guidelines](https://techcrunch.com/2022/12/21/kickstarter-shut-down-the-campaign-for-ai-porn-group-unstable-diffusion-amid-changing-guidelines/) * [From Deepfake to DignifAI...](https://www.nbcnews.com/tech/internet/conservative-influencers-are-using-ai-cover-photos-sex-workers-rcna137341) * [AI-Generated 'Seinfeld' Show Banned on Twitch After Transphobic Standup Bit](https://www.vice.com/en/article/y3pymx/ai-generated-seinfeld-show-nothing-forever-banned-on-twitch-after-transphobic-standup-bit) * [DAN - do anything now](https://www.cnbc.com/2023/02/06/chatgpt-jailbreak-forces-it-to-break-its-own-rules.html) * [AI-powered Bing Chat spills its secrets via prompt injection attack](https://arstechnica.com/information-technology/2023/02/ai-powered-bing-chat-spills-its-secrets-via-prompt-injection-attack/) * [Making new bing angry by making it do something it's both allowed and not allowed to do](https://www.reddit.com/r/ChatGPT/comments/112uczi/making_new_bing_angry_by_making_it_do_something/) * ["Our new paper shows that diffusion models memorize images from their training data and emit them at generation time"](https://twitter.com/Eric_Wallace_/status/1620449934863642624) * [Kaggle competitions results over time](https://www.kaggle.com/kaggle/meta-kaggle) * [The brutal fight to mine your data and sell it to your boss](https://www.bloomberg.com/news/features/2017-11-15/the-brutal-fight-to-mine-your-data-and-sell-it-to-your-boss) * [Google's AI can see through your eyes](https://medium.com/health-ai/googles-ai-can-see-through-your-eyes-what-doctors-can-t-c1031c0b3df4) * [Robot tanks: On patrol but not allowed to shoot](https://www.bbc.com/news/business-50387954) * [... and robot fighter pilots](https://www.newyorker.com/magazine/2022/01/24/the-rise-of-ai-fighter-pilots) * [Dermatologist-level classification of skin cancer with deep neural networks](https://www.nature.com/articles/nature21056.epdf) * [CLIPasso](https://clipasso.github.io/clipasso/) * [Clearview AI Once Told Cops To “Run Wild” With Its Facial Recognition Tool. It's Now Facing Legal Challenges](https://www.buzzfeednews.com/article/ryanmac/clearview-ai-cops-run-wild-facial-recognition-lawsuits) * [Is a supermarket discount coupon worth giving away your privacy?](https://www.latimes.com/business/story/2020-01-21/ralphs-privacy-disclosure) * Self driving cars: [1](http://www.nvidia.com/object/drive-px.html), [2](http://kevinhughes.ca/blog/tensor-kart), [3](https://github.com/bethesirius/ChosunTruck), [4](https://www.youtube.com/watch?v=X4u2DCOLoIg), [5](https://selfdrivingcars.mit.edu/), [6](https://eu.udacity.com/course/self-driving-car-engineer-nanodegree--nd013) * [Pix2pix for image translation](http://affinelayer.com/pixsrv/index.html) * [Google's "Teachable Machine"](https://teachablemachine.withgoogle.com/) * [3D Generative-Adversarial Modeling](http://3dgan.csail.mit.edu/) * [Generating Videos with Scene Dynamics](http://web.mit.edu/vondrick/tinyvideo/) * DeepFakes: [1](https://boingboing.net/2018/02/13/there-ive-done-something.html), [2](https://www.theverge.com/2018/2/7/16982046/reddit-deepfakes-ai-celebrity-face-swap-porn-community-ban) * [IQ Test Result: Advanced AI Machine Matches Four-Year-Old Child's Score](https://www.technologyreview.com/s/541936/iq-test-result-advanced-ai-machine-matches-four-year-old-childs-score/) * [KFC China is using facial recognition tech to serve customers - but are they buying it?](https://www.theguardian.com/technology/2017/jan/11/china-beijing-first-smart-restaurant-kfc-facial-recognition) * [Dubai police launch AI that can spot crimes](http://newatlas.com/dubai-police-crime-prediction-software/47092/) * [Chicago turns to big data to predict gun and gang violence](https://www.engadget.com/2016/05/23/chicago-turns-to-big-data-to-predict-gun-and-gang-violence/) * [The Role of Data and Analytics in Insurance Fraud Detection](http://www.insurancenexus.com/fraud/role-data-and-analytics-insurance-fraud-detection) * [This employee ID badge monitors and listens to you at work — except in the bathroom](https://www.washingtonpost.com/news/business/wp/2016/09/07/this-employee-badge-knows-not-only-where-you-are-but-whether-you-are-talking-to-your-co-workers/) # Course 2: February 17 ## Slides * [Preprocessing and Feature Engineering](./slides/2 - Preprocessing.pdf) * [Supervised Modeling](./slides/3 - Supervised Modeling.pdf) ## Recording * [YouTube](https://youtu.be/64c6ErUgmlo) ## Background Information In the news: * [LIMO: Less Is More for Reasoning 🚀](https://github.com/GAIR-NLP/LIMO) * [The Anthropic Economic Index](https://www.anthropic.com/news/the-anthropic-economic-index) * [Introducing Perplexity Deep Research](https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research) * [Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance](https://humanaigc.github.io/animate-anyone-2/) * [Using AI to decode language from the brain and advance our understanding of human communication](https://ai.meta.com/blog/brain-ai-research-human-communication/) Extra references on preprocessing: * [Forecasting with Google Trends](https://medium.com/dataminingapps-articles/forecasting-with-google-trends-114ab741bda4) * [Google Street View in insurance](https://arxiv.org/ftp/arxiv/papers/1904/1904.05270.pdf) * [Predicting the State of a House Using Google Street View](https://link.springer.com/chapter/10.1007/978-3-031-05760-1_46) * Packages for missing value summarization: [missingno](https://github.com/ResidentMario/missingno) and [VIM](https://cran.r-project.org/web/packages/VIM/index.html) * [MICE is also a popular package for dealing with missing values in R](https://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/) * [More on data "leakage" and why you should avoid it](https://www.kaggle.com/alexisbcook/data-leakage) * [Another excellent presentation on the types of data leakage](https://www.slideshare.net/YuriyGuts/target-leakage-in-machine-learning) * [`smbinning`, an R package for weights of evidence encoding](https://cran.r-project.org/web/packages/smbinning/index.html) * [`category_encoders`: an interesting package containing a wide variety of categorical encoding techniques](http://contrib.scikit-learn.org/category_encoders/) * [More on the leave one out mean](https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748) as discussed on Kaggle * [More explanation on the hashing trick on Wikipedia](https://en.wikipedia.org/wiki/Feature_hashing) * [Feature Hashing in Python](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html) * [Entity Embeddings of Categorical Variables](https://arxiv.org/pdf/1604.06737.pdf) * [`dm` R package](https://krlmlr.github.io/dm/) * [`skrub`: Prepping tables for machine learning](https://github.com/skrub-data/skrub) * [`featuretools`: An open source python framework for automated feature engineering](https://www.featuretools.com/) * See also [`stumpy`](https://github.com/TDAmeritrade/stumpy) and [`tsfresh`](https://tsfresh.readthedocs.io/) * [AutoFeat](https://github.com/cod3licious/autofeat) * [FeatureSelector](https://github.com/WillKoehrsen/feature-selector) * [OneBM](https://arxiv.org/abs/1706.00327) * [More information on principal component analysis (PCA)](http://setosa.io/ev/principal-component-analysis/) * [OpenCV](http://opencv.org/) (for feature extraction from facial images), or see [this page](https://github.com/ageitgey/face_recognition) * Interesting application of PCA to "understand" the latent features of a deep learning network: [https://www.youtube.com/watch?v=4VAkrUNLKSo](https://www.youtube.com/watch?v=4VAkrUNLKSo) * [Another application of PCA for understanding model outputs](https://github.com/asabuncuoglu13/sketch-embeddings) Extra references on supervised basics: * [“Building Bridges between Regression, Clustering, and Classification”](https://arxiv.org/abs/2502.02996) * [Frank Harell on stepwise regression](https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/) * [Stepwise feature selection in scikit-learn](https://scikit-learn.org/stable/modules/feature_selection.html#sequential-feature-selection) but note that it applies cross validation * [L1 and L2 animation](https://nitter.net/itayevron/status/1328421322821693441) * [aerosolve - Machine learning for humans](https://medium.com/airbnb-engineering/aerosolve-machine-learning-for-humans-55efcf602665) * [ID3.pdf](./papers/ID3.pdf) and [C45.pdf](./papers/C45.pdf): extra material regarding decision trees * [Nice video on Entropy and Information](https://www.youtube.com/watch?v=v68zYyaEmEA) * [CloudForest](https://github.com/ryanbressler/CloudForest), an older but interesting decision tree ensemble implementation with support for three-way splits to deal with missing values, implemented in... Go * [RIPPER](https://christophm.github.io/interpretable-ml-book/rules.html), [RuleFit](https://github.com/christophM/rulefit) and [Skope-Rules](https://github.com/scikit-learn-contrib/skope-rules) * [dtreeviz](https://github.com/parrt/dtreeviz) for nicer visualizations or [pybaobabdt](https://pypi.org/project/pybaobabdt/) * White box models can be easily deployed, even in Excel... For some fun examples, see [m2cgen](https://github.com/BayesWitnesses/m2cgen), which can convert ML models to Java, C, Python, Go, PHP, ..., [emlearn](https://github.com/emlearn/emlearn) converts ML code to portable C99 code for microcontrollers, [sklearn-porter](https://github.com/nok/sklearn-porter) converts scikit-learn models to C, Java and others, and [SKompiler](https://github.com/konstantint/SKompiler) covnerts scikit-learn models to SQL queries, and... Excel, [this screencast](https://www.youtube.com/watch?v=7vUfa7W0NpY) shows it in action * Two newer examples using k-NN: [1](https://towardsdatascience.com/scanned-digits-recognition-using-k-nearest-neighbor-k-nn-d1a1528f0dea) and [2](https://medium.com/learning-machine-learning/recommending-animes-using-nearest-neighbors-61320a1a5934) # Course 3: February 26 ## Slides * Continuing with the slides of last time ## Recording * [YouTube](https://youtu.be/gtXAnCHdmYs) ## Background Information In the news: * [Large Language Diffusion Models](https://ml-gsai.github.io/LLaDA-demo/) * [Chat with BadSeek](https://blog.sshh.io/p/how-to-backdoor-large-language-models) * [Magma: A Foundation Model for Multimodal AI Agents](https://microsoft.github.io/Magma/) * [WonderHuman](https://wyiguanw.github.io/WonderHuman/) * [CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image](https://sites.google.com/view/cast4) * [OmniParser](https://microsoft.github.io/OmniParser/) Extra references: * [ROSE](https://cran.r-project.org/web/packages/ROSE/index.html) is a popular package for dealing with over/undersampling in R * [imblearn](https://imbalanced-learn.org/stable/) contains many smart sampling implementations for Python * [Tuning Imbalanced Learning Sampling Approaches](https://www.dataminingapps.com/2019/06/tuning-imbalanced-learning-sampling-approaches/) * More on the ROC curve: [1](https://arxiv.org/pdf/1812.01388.pdf), [2](http://www.rduin.nl/presentations/ROC%20Tutorial%20Peter%20Flach/ROCtutorialPartI.pdf), [3](https://stats.stackexchange.com/questions/225210/accuracy-vs-area-under-the-roc-curve/225221#225221) * [Averaging ROC curves for multiclass](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html) * [The Relationship Between Precision-Recall and ROC Curves](./papers/rocpr.pdf): a paper by KU Leuven's Jesse Davis et al. on the topic with some other interesting remarks * [h-index.pdf](./papers/h-index.pdf): paper regarding the h-index as an alternative for AUC * [BSZ tuning](./papers/bsztuning.pdf): paper on BSZ tuning: a simple cost-sensitive regression approach * [A blog post explaining cross validation](https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) * [Multiclass and multilabel algorithms in scikit-learn](https://scikit-learn.org/stable/modules/multiclass.html) * [scikit.ml](http://scikit.ml/) contains more advanced multilabel techniques * More on probability calibration [here](http://scikit-learn.org/stable/modules/calibration.html) and [here](http://fastml.com/classifier-calibration-with-platts-scaling-and-isotonic-regression) * [More on the System Stability Index](https://www.dataminingapps.com/2016/10/what-is-a-system-stability-index-ssi-and-how-can-it-be-used-to-monitor-population-stability/) * [Visibility and Monitoring for Machine Learning Models](http://blog.launchdarkly.com/visibility-and-monitoring-for-machine-learning-models/) * [What's your ML test score? A rubric for production ML systems](https://research.google.com/pubs/pub45742.html) * [Hidden Technical Debt in Machine Learning Systems](https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems) ## Assignment 1 In this assignment, you will construct a predictive model to predict for a classic use cases: predicting the sales price of houses. The data set was obtained from a Belgian real estate website. The data set can be [downloaded from here](https://seppe.net/aa/assignment1/data.zip). You will play "competitively" using [this website](https://seppe.net/aa/assignment1/) * The test set does not contain the target, so you will need to split up the train set accordingly to create your own validation set. The test set supplied in the data is used to rank and assess your model on the competition leaderboard * Your model will be evaluated using two metrics: Winkler Score with alpha = 0.20 (more info below) and Mean Absolute Error as a secondary score * Note that only about half of the test set is used for the "public" leaderboard. That means that the score you will see on the leaderboard is done using this part of the test only (you don't know which half). Later on through the semester, submissions are frozen and the resuls on the "hidden" part will be revealed * Outliers, noisy data, missing values, a relatively limited list of features, and the peculiar evaluation metric adopted here will make for challenges to be overcome * You will have received a password and your final group number through an email, which you need to make submissions * The results of your latest submission are used to rank you on the leaderboard. This means it is your job to keep track of different model versions / approaches / outputs in case you'd like to go back to an earlier result * The leaderboard will be frozen and the hidden results shown a few weeks before the deadline. You should then reflect on both results and explain accordingly in your report. E.g. if you did well on the public leaderboard but not on the hidden one, what might have caused this? The idea is not that you then step in and "fix" your model, but to learn and reflect * Also, whilst you can definitely try, the goal is not to "win", but to help you reflect on your model's results, see how others are doing, etc. * Your model needs to be build using Python (or R, Go, Rust, Julia or whatever you prefer as long as it involves coding). As an environment, you can use e.g. Jupyter (Notebook or Lab), RStudio, Google Colaboratory, Microsoft Azure Machine Learning Studio... and any additional library or package you want * Latitude and longitude of the property are included, and you are free to use external, open data sources as well The Winkler Score was chosen as a metric here since we include an interesting twist for this case. Rather than just predicting a point prediction, you will also need to supply a *lower and upper* bound around your point prediction. Obviously, having enourmously large intervals would capture all true prices, but would not be very useful in practice. The Winkler Score (1972) is a measure which is used to evaluate prediction intervals. The score is calculated as follows for each instance: * If the true value falls within the interval, the score (a measure of error, really) is the length of the interval: lower is better * If not, the score is the length of the penalty plus an additional penalty equal to two over alpha times the distance by which the true value falls outside of interval * We are using alpha = 0.20 here * Sometimes the distances are squared, but we do not do so here * See more info [here](https://otexts.com/fpp3/distaccuracy.html) The first part of your lab report should contain a clear overview of your whole modeling pipeline, including exploratory analysis (if any), preprocessing, construction of model, set-up of validation and results of the model: * Feel free to include code fragments, tables, visualisations, etc. * Some groups prefer to write their final report using Jupyter Notebook, which is fine, as long as it is readable top-to-bottom * You can use any predictive technique/approach you want, though focus on the whole process: general setup, critical thinking, and the ability to get and validate an outcome * You're free to use unsupervised technique for your data exploration part, too. When you decide to build a black box model, including some interpretability techniques to explain it is a nice idea * Any other assumptions or insights are thoughts can be included as well: the idea is to take what we've seen in class, get your hands dirty and try out what we've seen * **Important: All groups should submit the results of their predictive model at least once to the leaderboard before the hidden scores are revealed (I'll warn you in time)** More info on how to submit can be found on the [submission website](https://seppe.net/aa/assignment1/). The data set has a number of features which are mostly self-explanatory. Important to note however is that `price` contains the target, and `id` contains a unique identifier which you should not use as a feature. Note that features can still contain noise, outliers, missing values, etc. *You do not hand in each assignment separately, but hand in your completed lab report containing all four assignments on Sunday June 1st. For an overview of the groups, see Toledo. Note for externals (i.e. anyone who will NOT partake in the exams -- this doesn't apply to normal students): you are free to partake in (any of) the assignments individually, but not required to.* # Course 4: March 3 ## Slides * [Ensemble Models](./slides/4 - EnsembleModels.pdf) * [Model Interpretability](./slides/5 - Interpretability.pdf) ## Recording * [YouTube](https://youtu.be/cBmDCUg4hj4) ## Background Information Extra references on ensemble models: * The [jar of jelly beans](https://towardsdatascience.com/the-unexpected-lesson-within-a-jelly-bean-jar-1b6de9c40cca) * The [documentation of scikit-learn](http://scikit-learn.org/stable/modules/ensemble.html) is very complete in terms of ensemble modeling * Kaggle post on [model stacking](http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/) * [Random forest.pdf](./papers/Random forest.pdf): the original paper on random forests * [ExtraTrees](https://scikit-learn.org/stable/modules/ensemble.html#forest): "In extremely randomized trees (see ExtraTreesClassifier and ExtraTreesRegressor classes), randomness goes one step further in the way splits are computed. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule" * Also interesting to note is that scikit-learn's implementation of decision trees (and random forest) supports [multi-output problems](https://scikit-learn.org/stable/modules/tree.html#tree-multioutput) * Note that some implementations/papers for ExtraTrees will go a step further and simply select a splitting point completely at random (e.g. the subset of thresholds is size 1 -- this is helpful when working with very noisy features) * [To tune or not to tune the number of trees in a random forest](./papers/tune_or_not.pdf); conclusions: use a sufficiently high amount of trees * [Adaboost.pdf](./papers/Adaboost.pdf): the original paper on AdaBoost * [alr.pdf](./papers/alr.pdf): Friedman's paper on AdaBoost and Additive Logistic Regression * [xgboost documentation](https://xgboost.readthedocs.io/en/latest/) with a good [introduction](https://xgboost.readthedocs.io/en/latest/tutorials/model.html) * [lightgbm documentation](https://lightgbm.readthedocs.io/en/latest/pythonapi/) * [catboost documentation](https://catboost.ai/en/docs/) * Note that all three of these have sklearn-API compatible classifiers and regressors, so you can combine them with other typical sklearn steps Extra references on interpretability: * [Fantastic book on the topic of interpretability](https://christophm.github.io/interpretable-ml-book/) * [http://explained.ai/rf-importance/index.html](Beware of using feature importance!) * [https://academic.oup.com/bioinformatics/article/26/10/1340/193348](Permutation importance: a corrected feature importance measure) * [Interpreting random forests: Decision path gathering](http://blog.datadive.net/interpreting-random-forests/) * [Local interpretable model-agnostic explanations](https://github.com/marcotcr/lime) * [SHAP (SHapley Additive exPlanations)](https://github.com/slundberg/shap) * [Another great overview](https://github.com/jphall663/awesome-machine-learninginterpretability) * [rfpimp package](https://pypi.org/project/rfpimp/) * [Forest floor](http://forestfloor.dk/) for higher-dimensional partial depence plots * [The pdp R package](https://cran.r-project.org/web/packages/pdp/pdp.pdf) * [The iml R Package](https://cran.r-project.org/web/packages/iml/index.html) * [Descriptive mAchine Learning EXplanations (DALEX) R Package](https://github.com/pbiecek/DALEX) * [eli5 for Python](https://eli5.readthedocs.io/en/latest/index.html) * [Skater for Python](https://github.com/datascienceinc/Skater) * scikit-learn has Gini-reduction based importance but permutation importance has [been added in recent versions](https://scikit-learn.org/stable/modules/permutation_importance.html) * Or with [https://github.com/parrt/random-forest-importances](https://github.com/parrt/random-forest-importances) * Or with [https://github.com/ralphhaygood/sklearn-gbmi](https://github.com/ralphhaygood/sklearn-gbmi) (sklearn-gbmi) * [pdpbox for Python](https://github.com/SauceCat/PDPbox) * [vip for Python (and R)](https://koalaverse.github.io/vip/index.html) * [https://medium.com/@Zelros/a-brief-history-of-machine-learning-models-explainability-f1c3301be9dc](https://medium.com/@Zelros/a-brief-history-of-machine-learning-models-explainability-f1c3301be9dc) * Graft, Reassemble, Answer delta, Neighbour sensitivity, Training delta (GRANT) - [https://github.com/wagtaillabs](https://github.com/wagtaillabs) * [Classification Acceleration via Merging Decision Trees](./papers/mergingtrees.pdf) # Course 5: March 10 ## Slides * [Deep Learning Part 1: Foundations and Images](./slides/6 - DeepLearning.pdf) ## Recording * [YouTube](https://youtu.be/6WhmuINm_JE) ## Background Information In the news: * [Manus AI](https://manus.im/) * [Mercury - a diffusion based LLM](https://www.inceptionlabs.ai/news) * [Meshpad](https://derkleineli.github.io/meshpad/) * [Phi 4](https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/) * [LLM Post-Training: A Deep Dive into Reasoning Large Language Models](https://arxiv.org/pdf/2502.21321) * [Vibe coding](https://arstechnica.com/ai/2025/03/is-vibe-coding-with-ai-gnarly-or-reckless-maybe-some-of-both/) Extra references: * (See slides for most references) * [Keras Vision tutorials - use these to get started!](https://keras.io/examples/vision/) * [A brief history of AI](https://beamandrew.github.io/deeplearning/2017/02/23/deep_learning_101_part1.html) * [Who invented the reverse mode of differentiation](https://www.math.uni-bielefeld.de/documenta/vol-ismp/52_griewank-andreas-b.pdf) * [Backpropagation explained](http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html) * [Great short YouTube playlist explaining ANNs (3blue1brown)](https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) * [And other one explaining convolutions (3blue1brown)](https://www.youtube.com/watch?v=8rrHTtUzyZA ) * [Introduction to neural networks](https://victorzhou.com/blog/intro-to-neural-networks/) * [Another link explaining ANNs](http://www.emergentmind.com/neural-network) * [Tensorflow playground](https://playground.tensorflow.org/) ## Assignment 2 In this assignment, you will work with a Geoguessr-style data set of Street View images, collected around mountainous areas across 12 countries. GeoGuessr is a browser-based geography game in which players must deduce locations from Google Street View imagery. Some players have become extremely good at the game, in some cases even knowing where they are by looking at a single image. * The data set consists of plusminus 100 panoramic images per country * Download the data from [this Google Drive link](https://drive.google.com/file/d/1jcm_4wQzLE3tOhQf-MJI_QmH76Cj4KnW/view?usp=sharing) Your primary goal is to predict the country based on the image. You can use any deep learning library you want (Keras is recommended). Using pre-trained models is allowed and likely to help a lot, image augmentation might be useful as well. If necessary, feel free to resize the images (using Python or whatever tool). Explore and experiment. You should approach this as a "how far can I get in a small amount of time" style project like you would be facing in real life. I don't expect fantastic results in terms of accuracy, but it'll be interesting to see what you can do with this. Important: you need to make sure to perform a good train/test split. The [Keras code examples on computer vision](https://keras.io/examples/vision/) are a good place to start. Some additional tips and pointers: * Given the small number of images per country, fine-tuning an existing image is probably a good idea * For extra style points, you can look up how Google Street View constructs its "panorama" images and see whether you can slice them into normal looking ones. This might help in terms of data augmentation * Try using an interpretability technique to figure out what your model is focusing on * The main goal is to classify an image into one of the 13 different countries. If that is too hard, you can try reducing the number of countries. You can also try to relabel the images yourself between e.g. snow / no-snow * Alternatively, the file names are formatted as `[timestamp]_[lat]_[lon].jpg`, so if you really want to have a crazy difficult challenge, you could also look for a way to predict the coordinates. Perhaps there exist some pretrained models for this already? * The script used to collect the images can be found [here](./assignment2/scrape_images.py), which you can use to test out your model on a new bunch of images should you wish to do so. The second part of your lab report should contain: * Overview of your full pipe line, including architecture, trade-offs, ways used to prevent overfitting, etc. * Results based on your chosen evaluation metric * Illustration of your model's predictions on a couple of test images *You do not hand in each assignment separately, but hand in your completed lab report containing all four assignments on Sunday June 1st. For an overview of the groups, see Toledo. Note for externals (i.e. anyone who will NOT partake in the exams -- this doesn't apply to normal students): you are free to partake in (any of) the assignments individually, but not required to.* # Course 6: March 17 ## Slides * Continuing with the slides of last time ## Recording * [YouTube](https://youtu.be/UiAh4gkaRyU) ## Background Information In the news: * [Gemini 2 Flash](https://deepmind.google/technologies/gemini/flash/) * [Block Diffusion](https://huggingface.co/papers/2503.09573) * [Baidu Ernie](https://techcrunch.com/2025/03/16/baidu-launches-two-new-versions-of-its-ai-model-ernie/) * [Gemma 3](https://blog.google/technology/developers/gemma-3/) * [Blender MCP](https://github.com/ahujasid/blender-mcp) * [More on agentic models](https://x.com/nikitabase/status/1900941231808516194) * [OWL](https://github.com/camel-ai/owl) # Course 7: March 24 ## Slides * [Unsupervised Modeling](./slides/7 - UnsupervisedModeling.pdf) ## Recording * [YouTube](https://youtu.be/T73cTpS2sCw) ## Background Information In the news: * [Deep Learning is Not So Mysterious or Different](https://arxiv.org/abs/2503.02113) * [Improving Recommendation Systems & Search in the Age of LLMs](https://eugeneyan.com/writing/recsys-llm/) and [beeformer](https://github.com/recombee/beeformer) * [Data Formulator](https://github.com/microsoft/data-formulator) * [SpatialLM](https://huggingface.co/manycore-research/SpatialLM-Llama-1B) * [StarVector](https://huggingface.co/collections/starvector/starvector-models-6783b22c7bd4b43d13cb5289) * [The Cybernetic Teammate: A Field Experiment on Generative AI Reshaping Teamwork and Expertise](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5188231) Extra references: * [Comparing different hierarchical linkage methods on toy datasets](https://scikit-learn.org/stable/auto_examples/cluster/plot_linkage_comparison.html#sphx-glr-auto-examples-cluster-plot-linkage-comparison-py) * [Visualisation of the DBSCAN clustering technique in the browser](https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/) * [More on the Gower distance](https://medium.com/@rumman1988/clustering-categorical-and-numerical-datatype-using-gower-distance-ab89b3aa90d9) * [Self-Organising Maps for Customer Segmentation using R](https://www.r-bloggers.com/self-organising-maps-for-customer-segmentation-using-r/) * [t-SNE](https://lvdmaaten.github.io/tsne/) and using it for [anomaly detection](https://medium.com/@Zelros/anomaly-detection-with-t-sne-211857b1cd00) * [http://distill.pub/2016/misread-tsne/](http://distill.pub/2016/misread-tsne/) provides very interesting visualisations and more explanation on t-SNE * [Be careful when clustering the output of t-SNE](https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne/264647#264647) * [UMAP](https://umap-learn.readthedocs.io/en/latest/how_umap_works.html) * [PixPlot](https://dhlab.yale.edu/projects/pixplot/): another cool example of t-SNE (this was the name I was trying to recall during class) * [Isolation forests](https://scikit-learn.org/stable/auto_examples/plot_anomaly_comparison.html#sphx-glr-auto-examples-plot-anomaly-comparison-py) * [Local outlier factor](https://scikit-learn.org/stable/auto_examples/neighbors/plot_lof_outlier_detection.html#sphx-glr-auto-examples-neighbors-plot-lof-outlier-detection-py) * [Twitter's anomaly detection package](https://github.com/twitter/AnomalyDetection) and [prophet](https://facebook.github.io/) * [An interesting article on detecting NBA all-stars using CADE](http://darrkj.github.io/blog/2014/may102014/) * Papers on [DBSCAN](./papers/dbscan.pdf), [isolation forests](./papers/iforest.pdf) and [CADE](./papers/CADE.pdf) # Course 8: March 31 ## Slides * [Data Science Tools](./slides/8 - DataScienceTools.pdf) ## Recording * [YouTube](https://youtu.be/1F5gMEDuKbQ) ## Background Information In the news: * [Introducing 4o Image Generation](https://openai.com/index/introducing-4o-image-generation/) * [Qwen2.5-VL-32B: Smarter and Lighter](https://qwenlm.github.io/blog/qwen2.5-vl-32b/) * [Deciphering language processing in the human brain through LLM representations](https://research.google/blog/deciphering-language-processing-in-the-human-brain-through-llm-representations/) * [AI will change the world but not in the way you think](https://thomashunter.name/posts/2025-03-19-ai-llms-will-change-the-world) * [DeepSeek-V3.1](https://deepseek.ai/blog/deepseek-v31) * [A Recipe for Generating 3D Worlds From a Single Image](https://katjaschwarz.github.io/worlds/) * [Tracing the thoughts of a large language model](https://www.anthropic.com/research/tracing-thoughts-language-model) Extra references: * [Is a Dataframe Just a Table?](https://plateau-workshop.org/assets/papers-2019/10.pdf) * Modern R with the [tidyverse](https://www.tidyverse.org/) * [https://www.rstudio.com/resources/cheatsheets/](https://www.rstudio.com/resources/cheatsheets/) for cheat sheets on most tidy packages * [caret](http://caret.r-forge.r-project.org/), [mlr3]( https://mlr3.mlr-org.com/), [modelr](https://modelr.tidyverse.org/) and [tidymodels](https://www.tidymodels.org/) * [http://r4ds.had.co.nz/](http://r4ds.had.co.nz/): R for data science book * Visualizations with R: `ggplot2` and [`ggvis`](http://ggvis.rstudio.com/), also see [`shiny`](https://shiny.rstudio.com/) * Other R packages: see slides * [Learn about Numpy broadcasting](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) * [Minimally sufficient pandas](https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428): great tour through Pandas' API * [What's new in pandas](https://pandas.pydata.org/docs/dev/whatsnew/) * [scikit-learn](https://scikit-learn.org/stable/) and [statsmodels](https://www.statsmodels.org/stable/index.html) * An older but fun comparison of Python visualization libraries: [https://dsaber.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/](https://dsaber.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/) * [Datashader](https://datashader.org/) is a package to render massive data sets (also showed Sanddance in class) * Other Python packages: see slides, [imbalanced-learn](https://imbalanced-learn.readthedocs.io/en/stable/), [scikit-multilearn](http://scikit.ml/) and [semisup-learn](https://github.com/tmadl/semisup-learn) are nice to know about, though scikit-learn has some support for many of these as well * Time series: [prophet](https://facebook.github.io/prophet/) or [darts](https://github.com/unit8co/darts) or [statsforecast](https://github.com/Nixtla/statsforecast) are useful to check out as well * Linking between R and Python: see e.g. [rpy2](https://rpy2.readthedocs.io/en/version_2.8.x/) (R in Python) or [reticulate](https://github.com/rstudio/reticulate) (Python in R) * d3.js galleries: [1](https://github.com/mbostock/d3/wiki/Gallery), [2](http://bl.ocks.org/), [3](http://bl.ocks.org/mbostock), [4](https://bost.ocks.org/mike/) * ["The Pudding"](https://pudding.cool/) uses d3 for some fun digital stories * Also take a look at [Explorable Explanations](https://explorabl.es/) and [Complexity Explorables](http://www.complexity-explorables.org/) for some great visualization examples * [Plotly Dash](https://plot.ly/products/dash/) * Working with large files: [ff](https://cran.r-project.org/web/packages/ff/index.html), [bigmemory](https://cran.r-project.org/web/packages/bigmemory/index.html), [disk.frame](https://github.com/xiaodaigh/disk.frame), [Dask](https://github.com/dask/dask), [Pandas on Ray](https://modin.readthedocs.io/en/latest/pandas_on_ray.html), [vaex](https://github.com/vaexio/vaex), [modin](https://github.com/modin-project/modin) and [Sframe](https://github.com/apple/turicreate) * And [DuckDB](https://duckdb.org/) and [Polars](https://www.pola.rs/) and [Ibis](https://ibis-project.org/) * [Jupyter Widgets](https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20Basics.html) * [papermill](https://github.com/nteract/papermill) and associated [blog post](https://medium.com/netflix-techblog/scheduling-notebooks-348e6c14cfd6) * [ploomber](https://github.com/ploomber/ploomber) * Some issues with notebooks: [1](https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/edit), [2](https://yihui.name/en/2018/09/notebook-war/) * Hosted notebooks: [Azure ML Studio](https://studio.azureml.net/), [Google Colab](https://colab.research.google.com/), [Kaggle Kernels](https://www.kaggle.com/kernels) are free options * Modular code development with [nbdev](https://github.com/fastai/nbdev) * [Getting started with conda](https://conda.io/projects/conda/en/latest/user-guide/getting-started.html) * [Labeling tools overview](https://github.com/heartexlabs/awesome-data-labeling) * [human-learn](https://github.com/koaning/human-learn) * [snorkel-ai](https://snorkel.ai/platform/), [snorkel](https://github.com/snorkel-team/snorkel) and [their paper](https://arxiv.org/abs/1711.10160) * [FlyingSquid](https://github.com/HazyResearch/flyingsquid) * [Alteryx Compose](https://github.com/alteryx/compose) * [Mostly.ai](https://mostly.ai/) - one of the many "synthetic data" companies * [Learning the Gitflow git workflow](https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow) * [Hidden technical debt in machine learning systems](https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf) * ML models degrade silently! [1](https://towardsdatascience.com/why-machine-learning-modelsdegrade-in-production-d0f2108e9214), [2](https://www.elastic.co/blog/beware-steep-declineunderstanding-model-degradation-machine-learning-models), [3](https://mlinproduction.com/model-retraining/) * [Devops and ML/AI](https://www.tecton.ai/blog/devops-ml-data/) * Some cool in-house tools: [Michelangelo](https://eng.uber.com/michelangelo-machine-learning-platform/), [Manifold](https://github.com/uber/manifold), [D3](https://www.uber.com/en-BE/blog/d3-an-automated-system-to-detect-data-drifts/), [FBLearner flow](https://engineering.fb.com/core-data/introducing-fblearner-flow-facebook-s-ai-backbone/), [Airflow and Luigi](https://medium.com/better-programming/airbnbs-airflow-versus-spotify-s-luigi-bd4c7c2c0791), [Netflix ML platform](https://research.netflix.com/research-area/machine-learning-platform) and [Bighead](https://databricks.com/session/bighead-airbnbs-end-to-end-machine-learning-platform) * [Poetry](https://python-poetry.org/), [Poe the Poet](https://github.com/nat-n/poethepoet) and [cookiecutter](https://github.com/cookiecutter/cookiecutter) * [What’s your ML test score? A rubric for ML production systems](https://research.google/pubs/pub45742/) * Some newer MLops products: [ZenML](https://zenml.io/why-ZenML/), [MonaLabs](https://www.monalabs.io/) and [Evidently.ai](https://evidentlyai.com/) * [https://github.com/cleanlab/cleanlab](https://github.com/cleanlab/cleanlab) ## Assignment 3 The third assignment consists of the construction of a predictive model using Spark (Structured) Streaming and textual data. You will work with data coming from [Arxiv.org](https://arxiv.org/), the free distribution service and open-access archive for nearly 2.4 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. A data streamer server is already set up for you as follows: - Published papers are monitored, and their categories, title, and abstract are extracted. We do so by polling the API every so often. Example endpoint: [https://export.arxiv.org/api/query?search_query=submittedDate:[202503251500+TO+202503261500]&max_results=2000](https://export.arxiv.org/api/query?search_query=submittedDate:[202503251500+TO+202503261500]&max_results=2000) Next, the information is exposed through a streaming data source to you running at `seppe.net:7778`. When you connect to it, this will provide publications to you one by one: - We fetch publications starting from now minus 96 hours ago and drip feed them over the connection - We start from 96h ago so you can more easily test the connection by immediately receiving publications; the reason why we go so much back is because no publications are accepted during the weekend - Next, whilst the stream is being kept open, we just continue to send newly published articles as-they-arrive The stream is provided as a textual data source with one article per line, formatted as a JSON object, e.g.: ``` { "aid": "http://arxiv.org/abs/2503.19871v1", "title": "A natural MSSM from a novel $\\mathsf{SO(10)}$ [...]", "summary": "The $\\mathsf{SO(10)}$ model [...]", "main_category": "hep-ph", "categories": "hep-ph,hep-ex", "published": "2025-03-25T17:36:54Z" } ``` The goal of this assignment is threefold: * 1 - Collect a historical set of data * *Important: get started with this as soon as possible. We will discuss Spark and text mining in more detail later on, but you can already start gathering your data* * 2 - Construct a predictive model that predicts the categories an article belongs to: * There are different ways how you can approach this question: you can either try to predict the `main_category` (in which case it is a multiclass problem), or try to tackle it as a multilabel problem by trying to predict all of the `categories` (comma separated) * The second question is related to which categories you want to include: categories which contain a hyphen, such as `hep-ph` above, are a subcategory of the broader `hep`, so you might wish to reduce the number of classes by only focussing on the main categories (or focussing only on articles belong to a single main category, such as computer science, `cs`) * You can see all the categories over at [https://arxiv.org/](https://arxiv.org/) * You can use any predictive model you want to, but groups that incorporate a small LLM, or a more modern textual model ([https://huggingface.co/facebook/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) is a very good start, for instance), will be rewarded for this * If you want to use a more traditional approach (TF-IDF plus a classifier, then try to use Spark's built-in ML models) * 3 - Show that your model can make predictions in a "deployed" setting **Setting up Spark** Since the data set we'll work with is still relatively small, you will (luckily) not need a cluster of machines, but can run Spark locally on your machine (and save the data locally as well). * First, download the ZIP file from [this link](./assignment3/spark.zip) and extract it somewhere, e.g. on your Desktop. This ZIP file contains the latest stable Spark version available at this time (3.5.5) * If you prefer to follow along with video instructions, an MP4 file is contained in the ZIP with a walkthrough for Windows and Mac users * We will use `pixi.sh` to set up our environment. Pixi is a modern alternative to Conda, but you can use Conda as well (in which case you install the packages below using `conda install`). Download Pixi for your platform over at [https://github.com/prefix-dev/pixi/releases/](https://github.com/prefix-dev/pixi/releases/) and put the executable file (e.g. `pixi.exe` for Windows) in the `spark` folder you just extracted * Next, we install all packages we need in the environment. In a Terminal or command line window, first navigate to the `spark`, and run: `pixi init` to initialize the environment. On Mac, use `./pixi` instead of `pixi` in these commands. Then run `pixi add python=3.11 pyspark findspark jupyter openjdk=11` to install the necessary packages * Mac users will probably also have to make the Spark binaries executable (in case you get a PermissionError in the notebooks), you can do so by running this command in the `spark` directory `chmod +x ./spark-3.5.5-bin-hadoop3/bin*` * You can then start Jupyter using `pixi run jupyter notebook` **Example notebooks** Once you have Jupyter open, explore the example notebooks under `notebooks`. **Important: the first cell in these notebooks use `findspark` to initialize Spark and its contexts. You will need to add the same cell to all new notebooks you create.** * `spark_example.ipynb`: Try this first! This is a simple Spark example to calculate pi and serves as check to see whether Spark is working correctly * `spark_streaming_example.ipynb`: A simple Spark Streaming example that prints out the data you'll work with. This is a test to see whether you can receive the data * `spark_streaming_example_saving.ipynb`: A simple Spark Streaming example that saves the data. Use this to get started saving your historical set * `spark_streaming_example_predicting.ipynb`: A very naïve prediction approach * `spark_structured_streaming_example.ipynb`: An example using Spark Structured Streaming **Objective** Using Spark, your task for this assignment is as follows: * 1 - Collect a historical set of data * Get started with this as soon as possible * Make sure to set up Spark using the instructions posted above * 2 - Construct a predictive model * The stream is text-based with each line containing one message (one instance) formatted as a JSON dictionary * You are strongly encouraged to build your model using `spark.ml` (MLlib), but you can use `scikit-learn` as a fallback * Alternatively, use a more modern model, as described above * Pick between the multiclass vs. multilabel, all categories vs. main categories tasks * 3 - Show that your model can make predictions in a "deployed" setting * I.e. show that you can connect to the data source, preprocess/featurize incoming messages, have your model predict the label, and show it, similar to `spark_streaming_example_predicting.ipynb` (but using a smarter, real predictive model) * This means that you'll need to look for a way to save and load your trained model, if necessary * The goal is not to obtain a perfect predictive accuracy, but mainly to make sure you can set up Spark and work in a streaming environment The third part of your lab report should contain: * Overview of the steps above, the source code of your programs, as well as the output after running them * Feel free to include screen shots or info on encountered challenges and how you dealt with them * Even if your solution is not fully working or not working correctly, you can still receive marks for this assignment based on what you tried and how you'd need to improve your end result **Further remarks** * Get started with setting up Spark and fetching data as quickly as possible * Make sure to have enough data to train your model. New publications arrive relatively slow (during some days, no articles might appear at all, whereas other days will be very busy) * The data stream is line delimited with every line containing one instance in JSON format, but can be easily converted to a DataFrame (and RDD). The example notebooks give some ideas on how to do so * You can use both Spark Streaming or Spark Structured Streaming * Don't be afraid to ask e.g. ChatGPT or Claude for help to code up your approach, this is certainly permitted, but make sure not to get stuck in "vibe coding" where you have a notebook spanning twenty pages without knowing what you're really doing anymore * Do let me know in case the streaming server would crash *You do not hand in each assignment separately, but hand in your completed lab report containing all four assignments on Sunday June 1st. For an overview of the groups, see Toledo. Note for externals (i.e. anyone who will NOT partake in the exams -- this doesn't apply to normal students): you are free to partake in (any of) the assignments individually, but not required to.* **FAQ** * The first cell in the example notebook fails (the one using `findspark`) - This cell attempts to do a couple of things: set the `SPARK_HOME` environment variable to the right directory; on Windows: set the `HADOOP_HOME` environment variable to the right `winutils` subfolder (necessary for Spark to work); initialize `findspark`, and then construct the different Spark contexts. Inspect the output and make sure the path names are correct * Spark cannot be initialized, the notebook or command line shows an error "getSubject is supported only if a security manager is allowed" - Your `openjdk` version is too recent, make sure you have installed version 11 in your environment. * Everything seems to work but I get a lot of warnings on the console or in the notebook - Spark is very verbose. On Mac, warnings are shown in the notebook itself, which makes it more annoying to use. If you don't like that, Google on how to stop Jupyter from capturing stderr * I can't save the stream... everything else seems fine - Make sure you're calling the "saveAsTextfiles" function with "file:///" prepended to the path: `lines.saveAsTextFiles("file:///C:/...")`. Also make sure that the folder where you want to save the files exist. Note that the "saveAsTextfiles" method expects a *directory name* as the argument. It will automatically create a folder for each mini-batch of data. * Can I prevent the `saveAsTextFiles` function from creating so many directories and files? - You can first repartition the RDD to one partition before saving it: `lines.repartition(1).saveAsTextFiles("file:///C:/...")`. To prevent multiple directories, change the trigger time to e.g. `ssc = StreamingContext(sc, 60)`, though this will still create multiple directories. Setting the trigger interval higher is not really recommended, as you wouldn't want to lose data in case something goes wrong. * So if I still end up with multiple directories, how do I read them in? - It's pretty easy to loop over subdirectories in Python. Alternatively, the `sc.textFile` command is pretty smart and can parse through multiple files in one go. * Is it normal all my folders only contain `_SUCCESS` files but no actual data files? - That depends. A `_SUCCESS` file indicates that the mini-batch was saved correctly. `part-*` files contain the actual data. And files ending with `.crc` contain a checksum. It's normal if not all of your folders contain `part-*` data, when no data was received in that time frame. However, if none of your folders are having data, especially not when you have restarted the notebook, something else has gone wrong. Try the `spark_streaming_example.ipynb` notebook to verify whether you're at least receiving data at all. * Is there a way how I can monitor Spark? - Yes, go to [http://127.0.0.1:4040/](http://127.0.0.1:4040/) in your browser while Spark is running and you'll get access to a monitoring dashboard. Under the "Environment" tab, you should be able to find a "spark.speculation" entry for instance w.r.t. the question above. Under "Jobs", "Stage", and "Streaming", you can get more info on how things are going. * I'm trying to convert my saved files to a DataFrame, but Spark complains for some files? - Data is always messy, especially the ones provided by this instructor. Make sure you can handle badly formatted lines and discard them. * My stream crashes after a while with an "RDD is empty" error... - Make sure you're checking for empty RDDs, e.g. `if rdd.isEmpty(): return`. * I've managed to create a model. When I try to apply it on the stream, Spark crashes with a Hive / Derby error, e.g. when I try to .load() my model(s) or once the first RDD arrives - Check the example notebooks for ideas on how to load in your model in "globals()" once. * When I call `ssc_t.stop()`, Spark never seems to stop the stream - You can try changing `stopGraceFully=True` to `False`. Even then, Spark might not want to stop its stream processing pipeline in case you're doing a lot with the incoming data, preventing Spark from cleaning up. Try decreasing the trigger time, or simply restart the Jupyter kernel to start over. * Spark complains that only one StreamingContext can be active at a time (or "ValueError: Cannot run multiple SparkContexts at once") - A good idea is to (save and) close all running notebooks and start again fresh. Spark doesn't like having multiple contexts running, so it is best to only have one notebook running at a time. (Closing a tab with a notebook does not mean that the *kernel* is stopped, however, check the "Running" tab on the Jupyter main page.) * Why do I receive the same instances (or: why do I have instances twice) when reconnecting? - To make sure you are served data right away, the stream server starts from a while back and works its way back to the current time. You can remove duplicate instances based on the `aid` identifier. * Can I use R? - There are two main Spark R packages available: `SparkR` (the official one) and `sparklyr` (from the folks at RStudio and fits better with the tidyverse). You can try using these, but you'll have to do some setting up in order so R can find your Spark installation. I'd strongly recommend using Python. * The server is just a socket server, so can't we just get the data that way? - For those who know, yes, basically: `nc seppe.net 7778`, so indeed in this case it would be easy to do this in Python directly. # Course 9: April 28 ## Slides * [Hadoop and Spark](./slides/9 - HadoopSpark.pdf) ## Recording * [YouTube](https://youtu.be/j0TL2tOVJu4) ## Background Information In the news: * [Watching o3 guess a photo’s location is surreal, dystopian and wildly entertaining](https://simonwillison.net/2025/Apr/26/o3-photo-locations/) * [Image Segmentation using Gemini 2.5](https://simonw.substack.com/p/image-segmentation-using-gemini-25) * [World Emulation with a DNN](https://madebyoll.in/posts/world_emulation_via_dnn/) * [Teaching LLMs how to solid model](https://willpatrick.xyz/technology/2025/04/23/teaching-llms-how-to-solid-model.html) * [The Policy Puppetry Prompt Injection Technique](https://hiddenlayer.com/innovation-hub/novel-universal-bypass-for-all-major-llms/) * [The Hottest AI Job of 2023 Is Already Obsolete](https://www.wsj.com/articles/the-hottest-ai-job-of-2023-is-already-obsolete-1961b054) * [Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?](https://limit-of-rlvr.github.io/) * [LLMs can see and hear without any training](https://github.com/facebookresearch/MILS) Extra references: * [Spark docs](https://spark.apache.org/docs/latest/) * [Apache Storm](https://storm.apache.org/), [Apache Flink](https://flink.apache.org/), [Apache Ignite](https://ignite.apache.org/), [H2O](http://www.h2o.ai/)... [So many frameworks](https://landscape.cncf.io/category=streaming-messaging&format=cardmode&grouping=category) * [Scalability! But at what COST?](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf) * [Gartner looks at data science platforms](https://thomaswdinsmore.com/2017/02/28/gartner-looks-at-data-science-platforms/) MapReduce examples can be downloaded [here](./code/mapreduce.zip) * Note that these are implemented in pure Python, but show off (simulate) how reducing can already start even if not all mapping operations are finished To play around with [`DummyRDD`](https://github.com/wdm0006/DummyRDD): * You can download some examples [here](./code/dummyrdd.zip) [Extra document on DataFrames and Datasets](./papers/DataFrames_Datasets.pdf) # Course 10: May 5 ## Slides * [Text Mining](./slides/10 - TextMining.pdf) ## Recording * [YouTube](https://youtu.be/zw9E9koSsA4) ## Background Information In the news: * [o3 Beats a Master-Level Geoguessr Player—Even with Fake EXIF Data](https://sampatt.com/blog/2025-04-28-can-o3-beat-a-geoguessr-master) * [Sycophancy in GPT-4o: What happened and what we’re doing about it](https://openai.com/index/sycophancy-in-gpt-4o/) * [WorldGen](https://worldgen.github.io/) * [Qwen 3](https://qwenlm.github.io/blog/qwen3/) * [IBM crossed a transformer with an SSM and got ‘Bamba’](https://research.ibm.com/blog/bamba-ssm-transformer-model) Extra references: * [word2vec](https://code.google.com/archive/p/word2vec/) * [GloVe](https://nlp.stanford.edu/projects/glove/) * [fastText](https://fasttext.cc/) * word2vec background readings: [1](https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/), [2](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/), [3](https://blog.acolyer.org/2016/04/22/glove-global-vectors-for-word-representation/), [4](https://iksinc.wordpress.com/tag/continuous-bag-of-words-cbow/), [5](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/), [6](https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e) * [par2vec, doc2vec](https://medium.com/@amarbudhiraja/understanding-document-embeddings-of-doc2vec-bfe7237a26da) * [The concept of representational learning and embeddings has even been applied towards sparse, high level categoricals](https://arxiv.org/abs/1604.06737), [and here](https://www.r-bloggers.com/exploring-embeddings-for-categorical-variables-with-keras/), and [here](https://www.fast.ai/2018/04/29/categorical-embeddings/) * [Princeton researchers discover why AI becomes racist and sexist](https://arstechnica.com/science/2017/04/princeton-scholars-figure-out-why-your-ai-is-racist/) * [AirBnb Listing Embeddings](https://medium.com/airbnb-engineering/listing-embeddings-for-similar-listing-recommendations-and-real-time-personalization-in-search-601172f7603e) * [Embeddings at Twitter](https://blog.twitter.com/engineering/en_us/topics/insights/2018/embeddingsattwitter.html) * [More on transformers](https://www.pinecone.io/learn/transformers/) * [Natural Language Toolkit (NLTK)](http://www.nltk.org) * [MITIE: library and tools for information extraction](https://github.com/mit-nlp/MITIE) * R: tm , topicmodels and nlp packages, http://tidytextmining.com/ * [Gensim](https://radimrehurek.com/gensim/) * [SpaCy](https://spacy.io/) * [vaderSentiment: Valence Aware Dictionary and sEntiment Reasoner](https://github.com/cjhutto/vaderSentiment) * [Pretrained models from Hugging Face](https://huggingface.co/transformers/) * [More on transformers](https://lilianweng.github.io/posts/2018-06-24-attention/) * [SBERT](https://sbert.net/) * [Self-attention](https://colab.research.google.com/drive/1rPk3ohrmVclqhH7uQ7qys4oznDdAhpzF) * [Transformer visualization](https://bbycroft.net/llm) ## Assignment 4 For this assignment, you will work with a graph extracted from the [Steam](https://store.steampowered.com) store. A data set consisting of users, apps (which are mainly games), curators (special users who recommend games) and tags was created, as well as the relations between these objects. The graph schema (metagraph) looks as follows:

To create the data set, the following approach was taken: - First, we start from the following list of top selling games over the past few months (using endpoint [https://api.steampowered.com/ISteamChartsService/GetTopReleasesPages/v1](https://api.steampowered.com/ISteamChartsService/GetTopReleasesPages/v1)): ``` {"response":{"pages":[ {"name":"Top Releases of February 2025","start_of_month":1738396800,"url_path":"top_february_2025", "item_ids":[{"appid":2246340},{"appid":1771300},{"appid":1295660},{"appid":2457220},{"appid":3241660},{"appid":3061810},{"appid":2385530}, {"appid":3244220},{"appid":1763250},{"appid":1902960},{"appid":1239080},{"appid":2073250},{"appid":2668430},{"appid":3178350}, {"appid":2344320},{"appid":2525380},{"appid":2376580},{"appid":2756920},{"appid":2089600},{"appid":1270580},{"appid":3300410}, {"appid":2239140},{"appid":3347400},{"appid":3471350},{"appid":3483400},{"appid":3469800},{"appid":2972410},{"appid":3255980}, {"appid":3445910}]}, {"name":"Top Releases of January 2025","start_of_month":1738396800,"url_path":"top_january_2025", "item_ids":[{"appid":2384580},{"appid":2909400},{"appid":2651280},{"appid":3058630},{"appid":2495100},{"appid":2634950},{"appid":619820}, {"appid":690830},{"appid":2169200},{"appid":1133870},{"appid":3287520},{"appid":1660080},{"appid":3059070},{"appid":1245480}, {"appid":1534840},{"appid":516750},{"appid":2273980},{"appid":2530980},{"appid":2725260},{"appid":1426450},{"appid":3301060}, {"appid":3198850},{"appid":1690710},{"appid":2994020},{"appid":3174480},{"appid":3180210},{"appid":3008670},{"appid":3359730}, {"appid":3205310},{"appid":3422870},{"appid":3380110}]}, {"name":"Top Releases of December 2024","start_of_month":1733040000,"url_path":"top_december_2024", "item_ids":[{"appid":2667120},{"appid":2958970},{"appid":2844850},{"appid":333640},{"appid":2677660},{"appid":2507950},{"appid":2694490}, {"appid":3107230},{"appid":2521380},{"appid":2916430},{"appid":1231560},{"appid":1887400},{"appid":2527500},{"appid":2460070}, {"appid":895400},{"appid":2767030},{"appid":2963880},{"appid":1699480},{"appid":2289720},{"appid":2353250},{"appid":3247080}, {"appid":3274300},{"appid":2962300},{"appid":3174120},{"appid":3130740},{"appid":2604420},{"appid":3130750},{"appid":2574900}]}]}}` - Next, we get the reviews for these games (and the user who posted the review) posted over the past 3 days (to keep the graph size under control) - Next, we iterate over all the apps and fetch their details, curator reviews (max. 50), tags, and other related apps as recommended by Steam - Next, we iterate over all the users and fetch their details, list of owned games, and friend list - Since the last step will increase the number of apps and users, we continue iterating over apps and users until we feel the graph is large enough (and it was time to post the data set) **Nodes and Edges** - General remarks: - All nodes have an `ident` attribute which can be used as a unique identifier. This is different from the built-in `_id` (Memgraph’s Node id) - Both nodes and edges have attributes, though obviously you can also think of ways to expand these (e.g. weight the edges) based on the data you have available and using Cypher queries - Not all nodes have attributes available or their full information available. For users, this might be because the user has made (parts of their) profile information private (so we also might not see their friends and owned games). For apps, this could be due to the fact that the app is not available anymore at the store but was at the time when the user e.g. purchased the app. Based on how the data was extracted, for some apps and users, no further expansion (including getting details) was performed **Objective** Your task for this assignment is to use Cypher queries and Gephi or other visualization libraries to analyze this data set. You are free to explore anything you deem interesting and present your findings in your report. The main goal is to get familiar with Cypher but also to hone your "storytelling" skills. Hence, try to focus on a single or a few hypotheses or findings you explore in full (with nicely formatted visualizations) and explaining what it says instead of just going for quick filter saying: "here are the three nodes with the most connections" (boring) or showing a [graph hairball](https://cambridge-intelligence.com/how-to-fix-hairballs/). Below are some suggestions/tips of what you can go for: - Perform community mining on (a subset of) the graph and see whether you can identify particular groups of users, apps (with tags), etc. - Perform an analysis of app popularity - Use the textual description of apps or reviews to perform a deeper analysis. Which users are active, which ones post funny reviews, etc. - Think of a recommender system approach that goes beyond the base recommendations offered by Steam - Use e.g. networkx or one of Memgraph’s [MAGE algorithms](https://memgraph.com/docs/mage/algorithms) to perform some more intricate analysis - Think of a potential predictive use case Don’t try to go for all of these. Pick something you want to explore and try to work this out in full. You will note that the data set is likely too large to look at everything at once, so a guided “deep dive” will work better. Once you know what you want to focus on, you can also clean up the graph by e.g. removing nodes or edges you are not interested in, or exporting a subgraph (see below). **Setting up Memgraph and Gephi** We used to work with Neo4j for this assignment, which is a solid tool supporting lots of (data science) extensions, but has become so commercial lately that it is a hassle to download. It can also be quite slow. Instead, we’ll be working with Memgraph, which is also “open source but commercial”, but with the advantage that it is faster and easier to set up than Neo4j. It is mostly compatible (e.g. Cypher works the same), but some aspects (such as exporting data) are a bit trickier. Memgraph provides an installer script which sets up a bunch of Docker containers using Docker Compose. Since this is becoming an increasingly common pattern, we can take this opportunity to get (a little bit) familiarized with Docker. First, download and install [Docker Desktop](https://www.docker.com/products/docker-desktop/) for your platform. Pick the free plan and log in with your Google or GitHub account. (Note: also Docker is obviously becoming more and more commercial. There are alternatives available such as [Podman](https://podman.io/) and [Rancher](https://rancherdesktop.io/), but these are a little bit harder to set up correctly. Feel free to experiment with these if you want to but you are "on your own" in that case.) Start Docker Desktop once it is installed. If running, you will see an icon in your task bar. Windows users: head to the settings window of Docker Desktop and select the WSL2 backend as the “engine” if possible. It is also a good idea to look under the "Resources" tab and increase the RAM (memory) docker containers can utilize, if you have the memory to spare. You should be able to run the docker command on a command line / Terminal window: ``` docker Usage: docker [OPTIONS] COMMAND A self-sufficient runtime for containers Commands: [A long list of commands] ``` Next, we will not use the provided installation script of Memgraph itself but instead use this [docker-compose.yml](./assignment4/docker-compose.yml) file. Download this file and place it in a directory of your choosing. You can take a look at this file in a text editor to take a look at the configuration it provides. It downloads two images, sets up two containers, exposes the necessary ports, attaches a volume, and sets a MEMGRAPH environment variable to increase the query time limit. If necessary, you can increase this limit. In your command line window, navigate to the location where you've placed this file, and run the following command: ``` docker compose up ``` This will take a while the first time. Eventually, you should see the following message: ``` memgraph-lab | [2025-05-05 08:00:00.000] INFO: Lab is running at http://localhost:3000 in docker mode ``` Keep this window open to keep Memgraph running. In a web browser, navigate to: [http://localhost:3000/](http://localhost:3000/). The Memgraph lab page should appear and should allow to connect to your locally running Memgraph instance. You can press CTRL+C in the command line window and run `docker compose up` again later to start Memgraph up again, or use the Docker Desktop app to start the two containers from there. You won’t have any nodes or edges (yet). Download the [steam.cypherl](https://drive.google.com/file/d/1Fcxl86_jV8n_7buMF3nPZ8B0dS-oq8Nn/view?usp=sharing) data file. Since this file is large, importing it through the web interface will most likely make it crash, so we will use the `mgconsole` program instead. Open a new command line window (whilst Memgraph is still running) and navigate to the directory you downloaded the cypherl file to, then run: ``` cat steam.cypherl | docker run -i --rm --network host memgraph/mgconsole:latest ``` On Mac or: ``` type steam.cypherl | docker run -i --rm --network host memgraph/mgconsole:latest ``` On Windows. If all goes well, this should start importing the data. You can verify this by looking at the Memgraph Lab tab in your web browser and seeing the number of nodes and edges increase. The total number of nodes and edges should reach 987,046 and 25.25M respectively. This might take some time to finish, but afterwards your data is ready in Memgraph. Next, [download and install Gephi](https://gephi.org/users/download/) for your OS. In case of issues, you can always remove all containers, images and volumes using the Docker Desktop application and run `docker compose up` again. **Working with the Graph** You can execute queries in the “Query Execution” pane. Feel free to play around with some Cypher queries to explore this graph, e.g.: ``` MATCH (u:User)-[e:OWNS]-(a:App) RETURN * LIMIT 20 ``` - It’s a good idea to use limit in order not to overflow yourself with results - Also note that we labeled the edge (“e”) here, otherwise, they will not be shown in the result (this is different from the default behavior from Neo4j which connects nodes) - You can click nodes in the result view to see more info and expand them Some more examples: ``` MATCH (u:User)-[e:OWNS]-(a:App) WHERE a.name = "Portal" RETURN * LIMIT 20 ``` Or: ``` MATCH (u:User)-[e:OWNS]-(a:App) WHERE a.about_the_game =~ '.*crazy.*' RETURN * LIMIT 20 ``` Or: ``` MATCH (a1:App)-[e:RECOMMENDED*1..2]-(a2:App) WHERE a1.name = "Portal" RETURN * ``` Result:

You can change size and color of the node types as well as the attribute used to label them by going to the “Graph Style Editor” tab. Memgraph’s styling language is quite powerful. For more information, check out: [https://memgraph.com/blog/how-to-style-your-graphs-in-memgraph-lab](https://memgraph.com/blog/how-to-style-your-graphs-in-memgraph-lab). **Using Gephi** To load the data in Gephi, we must get it out of Memgraph in some manner. There is a Neo4j Gephi plugin, but it has not received much attention lately and is incompatible with Memgraph (even although every other Neo4j tool is). In Neo4j, there is a custom APOC procedure to export queries as GraphML files, which Gephi can read, but sadly this is not present yet in Memgraph (there is a function to export the whole graph as a GraphML file, but keeps in characters Gephi doesn’t like). Luckily, there is an export function to JSON from a query result from the web interface. To export query results from Memgraph Lab: run a query or select results you want to export. Click Export results and choose JSON. This file can then be converted using the [json2graphml.py](./assignment4/json2graphml.py) Python script (note that you will need the networkx package): ``` python json2graphml.py "memgraph-query-results-export.json" Done, output file saved to: memgraph-query-results-export.graphml ``` (Note that this script attempts to perform a best-effort conversion and might not be 100% foolproof depending on the query you used.) Feel free to use other Python (or JavaScript) libraries to make visualizations as well if you prefer these over Gephi. Memgraph itself has a good article with some tips over at [https://memgraph.com/blog/graph-visualization-in-python](https://memgraph.com/blog/graph-visualization-in-python). **More on Exporting** The web interface provides an option to export a JSON file, but what about using Cypher directly to export? Currently, Memgraph provides the following options ([https://memgraph.com/docs/mage/query-modules/python/export-util](https://memgraph.com/docs/mage/query-modules/python/export-util)): ``` CALL export_util.json(path) YIELD file_path RETURN file_path; ``` Which exports the full database (a bit extreme), and: ``` CALL export_util.csv_query(path) YIELD file_path RETURN file_path; ``` Which works e.g. as follows: ``` WITH "MATCH (u:User)-[e:OWNS]-(a:App) RETURN * LIMIT 20" AS query CALL export_util.csv_query(query, "/var/lib/memgraph/my_export.csv", True) YIELD file_path RETURN file_path; ``` It is recommended to run this in the terminal window rather than from the web interface. How do you get a terminal query prompt to Memgraph? By executing the following command in a separate Terminal window: ``` docker run -it --rm --network host memgraph/mgconsole:latest ``` (Note the `-it` instead of just `-i` earlier). A prompt will open on which you can execute queries: ``` memgraph: WITH "[YOUR QUERY HERE]" AS query CALL export_util.csv_query(query, "/var/lib/memgraph/my_export.csv", True) YIELD file_path RETURN file_path; +-------------------------------------------------+ | file_path | +-------------------------------------------------+ | "/var/lib/memgraph/my_export.csv" | +-------------------------------------------------+ 1 row in set (round trip in 0.118 sec) ``` We use “/var/lib/memgraph/” as the path to export to. You might wonder how this can work on a Windows machine, and where this file actually is. The answer is it is inside of the docker container. So how to get it out? We can copy the file as follows: ``` docker cp bd6b5372d713:/var/lib/memgraph/my_export.csv my_export.csv Successfully copied 24.6kB to G:\my_export.csv ``` You will have to replace the identifier `bd6b5372d713` with the one on your machine. You can get this identifier by running `docker ps` and getting the id for `memgraph/memgraph-mage:latest`. The file is then present on your local environment, though is not easily parsed – so it’s probably better to export using the web interface or use the provided “graphml” file (which you can load in with networkx should you want to). *You do not hand in each assignment separately, but hand in your completed lab report containing all four assignments on Sunday June 1st. For an overview of the groups, see Toledo. Note for externals (i.e. anyone who will NOT partake in the exams -- this doesn't apply to normal students): you are free to partake in (any of) the assignments individually, but not required to.* # Course 11: May 12 ## Slides * [Graph Learning](./slides/11 - GraphLearning.pdf) ## Recording * [YouTube](https://youtu.be/pst10yMxDXo) ## Background Information In the news: * [Writing an LLM from scratch, part 13](https://www.gilesthomas.com/2025/05/llm-from-scratch-13-taking-stock-part-1-attention-heads-are-dumb) * [Continuous Thought Machines](https://pub.sakana.ai/ctm/) * [Bonsai: Tree representations for distortion-free visualization and exploratory analysis of single-cell omics data](https://www.biorxiv.org/content/10.1101/2025.05.08.652944v1.full.pdf) ([GitHub](https://github.com/dhdegroot/Bonsai-data-representation)) * [GarmentDiffusion](https://shenfu-research.github.io/Garment-Diffusion/) * [RL with one example](https://x.com/AlexGDimakis/status/1921348214525219206) * [LegoGPT](https://avalovelace1.github.io/LegoGPT/) Extra references: * [Network visualization of 50k blogs and links](https://news.ycombinator.com/item?id=40136208) * [I Made a Graph of Wikipedia... This Is What I Found](https://www.youtube.com/watch?v=JheGL6uSF-4) * [Y'all Are Nerds (According to Math)](https://www.youtube.com/watch?v=o879xRxmwmU&t=216s) * PageRank, personalized PageRank, and others: [1](https://www.r-bloggers.com/from-random-walks-to-personalized-pagerank/), [2](http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm), [3](http://www.math.ucsd.edu/~fan/wp/lov.pdf), [4](http://www.cs.yale.edu/homes/spielman/462/2010/lect16-10.pdf) * The [docs of igraph](http://igraph.org/r/doc/) contain a good overview of available centrality, betweenness, ... metrics and community mining * Node2vec: [http://snap.stanford.edu/node2vec/](http://snap.stanford.edu/node2vec/) * GraphSAGE: [https://github.com/williamleif/GraphSAGE](https://github.com/williamleif/GraphSAGE) * Deepwalk: [https://arxiv.org/abs/1403.6652](https://arxiv.org/abs/1403.6652) * [GNN](https://towardsdatascience.com/a-gentle-introduction-to-graph-neural-network-basics-deepwalk-and-graphsage-db5d540d50b3) * [GCN](https://towardsdatascience.com/how-to-do-deep-learning-on-graphs-with-graph-convolutional-networks-7d2250723780) * Most layout techniques make use of a physics-inspired "force based" approach, where the edges between nodes are regarded as "springs" and the layout algorithm goes through a number of iterations to let the graph stabilize towards a comprehensible, attractive layout * [Journal paper discussing the popular ForceAtlas2 technique](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0098679) * [Another paper detailing simple and less simple layout techniques](http://profs.etsmtl.ca/mmcguffin/research/2012-mcguffin-simpleNetVis/mcguffin-2012-simpleNetVis.pdf) * [Examples to play with in the browser](https://bl.ocks.org/steveharoz/8c3e2524079a8c440df60c1ab72b5d03) * [Webcola is a JavaScript library to layout graphs](http://marvl.infotech.monash.edu/webcola/) * [GraphViz](http://www.graphviz.org/) is a standalone tool for graph-based visualizations and layout, which are described by means of the DOT language. It has lots of bindings with programming languages available and is still widely used as a behind-the-scenes layout driver in many products * Many JavaScript-based tools are available to visualise and layout graphs in the browser: e.g. [Linkurious](https://linkurio.us/), [sigma.js](http://sigmajs.org/) * igraph is a graph analysis package for R and Python: [http://igraph.org/](http://igraph.org/) * Others: [http://js.cytoscape.org/](http://js.cytoscape.org/), [http://sigmajs.org/](http://sigmajs.org/) and [http://visjs.org/](http://visjs.org/) * [NetworkX](https://networkx.github.io/) is a Python package for graph analysis * Spark's [GraphX](http://spark.apache.org/graphx/) and [GraphFrames](https://graphframes.github.io/) * [Gephi](https://gephi.org/) is a tool for graph layout, analysis and visualisation * Also see the [ggraph](https://cran.r-project.org/web/packages/ggraph/index.html), [ggnet2](https://briatte.github.io/ggnet/), [sna](https://cran.r-project.org/web/packages/sna/index.html), [network](https://cran.r-project.org/web/packages/network/index.html), [tidygraph](https://github.com/thomasp85/tidygraph) packages in R * [GEM](https://github.com/palash1992/GEM) and [GraphSAGE](https://github.com/williamleif/GraphSAGE) * [DGL](https://docs.dgl.ai/index.html) * [Spektral](https://graphneural.network/) * [Pytorch Geometric](https://pytorch-geometric.readthedocs.io/en/latest/) * NoSQL vendors and examples (to name a few): [https://memcached.org/](https://memcached.org/), [http://cassandra.apache.org/](http://cassandra.apache.org/), [https://hbase.apache.org/](https://hbase.apache.org/), [http://couchdb.apache.org/](http://couchdb.apache.org/), [https://www.voltdb.com/](https://www.voltdb.com/), [https://www.mongodb.com/](https://www.mongodb.com/), [https://redis.io/](https://redis.io/), [https://www.cockroachlabs.com/](https://www.cockroachlabs.com/) * See [https://jepsen.io/analyses](https://jepsen.io/analyses) for detailed analyses of consistency models of various NoSQL vendors * Neo4j: [https://neo4j.com/](https://neo4j.com/) * Memgraph: [https://memgraph.com/](https://memgraph.com/) * OrientDB: [http://orientdb.com/orientdb/](http://orientdb.com/orientdb/) * SparkSee: [http://www.sparsity-technologies.com/](http://www.sparsity-technologies.com/) * FlockDB: [https://github.com/twitter-archive/flockdb](https://github.com/twitter-archive/flockdb) * Titan: [http://titan.thinkaurelius.com/](http://titan.thinkaurelius.com/) * AllegroGraph: [https://franz.com/agraph/allegrograph/](https://franz.com/agraph/allegrograph/) * InfiniteGraph: [http://www.objectivity.com/products/infinitegraph/](http://www.objectivity.com/products/infinitegraph/) * EdgeDB: [https://www.edgedb.com/](https://www.edgedb.com/) * TinkerPop and Gremlin: [https://tinkerpop.apache.org/gremlin.html](https://tinkerpop.apache.org/gremlin.html) * Intro to Cypher: [https://memgraph.com/docs/cypher-manual/](https://memgraph.com/docs/cypher-manual/) and [https://neo4j.com/developer/cypher-query-language/](https://neo4j.com/developer/cypher-query-language/) * [Efficient graph algorithms on Neo4j](https://neo4j.com/blog/efficient-graph-algorithms-neo4j/): build-in solution * [Data science library on Neo4j](https://neo4j.com/graph-data-science-library/): build-in, newer solution * [MAGE - Memgraph Advanced Graph Extensions](https://memgraph.com/docs/mage) * Neo4j use cases: [https://neo4j.com/graphgists/](https://neo4j.com/graphgists/) and [https://neo4j.com/sandbox/](https://neo4j.com/sandbox/) # Resources ## Books If you want an exhaustive list of data science books (not required for the course), feel free to check out [https://github.com/chaconnewu/free-data-science-books](https://github.com/chaconnewu/free-data-science-books), neatly ordered by topic and level (beginner to veteran). This repository is also interesting: [https://github.com/bradleyboehmke/data-science-learning-resources](https://github.com/bradleyboehmke/data-science-learning-resources). And another two full with books: [https://github.com/Saurav6789/Books-](https://github.com/Saurav6789/Books-) and [https://github.com/yashnarkhede/Data-Scientist-Books](https://github.com/yashnarkhede/Data-Scientist-Books). ## Python Tutorials Python itself is [quite easy](https://learnxinyminutes.com/docs/python/); you mainly need to figure out the additional libraries and their usage. Try to become familiar with Numpy, Pandas, and scikit-learn first, e.g. [play along with a couple of these tutorials](https://scikit-learn.org/stable/tutorial/index.html). The bottom of [this page](https://learnxinyminutes.com/docs/python/) also lists some more resources to learn Python. The following are quite good: * [A Crash Course in Python for Scientists](https://nbviewer.jupyter.org/gist/anonymous/5924718) * [Dive Into Python 3](https://diveintopython3.net/index.html) * [https://docs.python-guide.org/](https://docs.python-guide.org/) (a bit more intermediate) * Someone has also posted [this 100 Page Python Intro](https://learnbyexample.github.io/100_page_python_intro/introduction.html) ## DataCamp To get access to DataCamp, use this [registration link](https://www.datacamp.com/groups/shared_links/c80035336a272e42cf6e73f687cfb18d0f4fd2a1762e784c76df2b5eecdb72a0). Note that this will require a @(student.)kuleuven.be email address. If you'd like to use a personal email instead (e.g. because you already have an account on DataCamp), send me an email.