[SoK 2026] Final Update for 'Automating Promo Data Collection' Task
Hi all! Just finished up the last bit of work for my Season of KDE task of automating data collection for the KDE promotional team.
Since the midterm blogpost I've been assigned no new tasks. That means my final deliverables are a follower/post count scraping script for specific social media websites, a Reddit Insights page scraper that totals weekly insight data for a given subreddit, and an article evaluation script that reads articles found by the Google Alerts system and evaluates their sentiment on KDE and its software.
Follower and post counts scraper
Nothing much has changed here outside of some better error handling, consistency in argument help strings, and improved readability of log messages. The script has run well on its weekly timer and seems to show no signs of giving up. I do think I can improve it by making it more extensible to accommodate the scrubbing of new websites and accounts, but as of now it functions well for the links we're most worried about.
Reddit Insights page scraper
In the prior blogpost I mentioned worries about getting the script to run on a headless server. The script has since been made capable of running headlessly through use of a Docker image which wraps the program run with an Xvfb display server. Xvfb enables this by running display requirements in virtual memory, allowing for the use of headful software in a headless environment.
Shoutouts to Sean Pianka's repo containing dockerfiles used to run Xvfb-wrapped Selenium scripts and Selenium's own Docker images used for Selenium Grid server project. Without those resources it would have taken me a lot longer to hack together the requirements for a Docker image that could run Selenium headfully.
Along with the headless runs being solved, I also implemented plenty of bug fixes and improvements to user-facing messages. Many of the bugs came from not properly exiting Selenium during handled errors which I found out from the server having hundreds of open Firefox instances. Hopefully I've cleaned all those up.
Google Alerts evaluator
This task was a fairly large undertaking involving plenty of research and implementation steps. There were three major requirements:
- Develop a pipeline to take in Google Alerts emails and pre-process them into articles the model can read.
- Evaluate lightweight sentiment analysis models that can run on a server for their ability to analyze articles on KDE products.
- Parse model output into a human-readable and easy to work with data format.
The final result is a pipeline that
- Reads Google Alerts emails
- Pre-processes the articles into Markdown files for model reading
- Feeds them to a local LLM configured to provide sentiment analysis output
- Takes the LLM output and sends it into a CSV file (if possible)
You can see how this task could take a lot of time out of people, so hopefully this pipeline can significantly alleviate that time spent.
Google Alerts email reading and processing
This was no issue as Google Alerts are all sent through Gmail and Google itself provides a very useable Gmail API for extracting emails from Gmail accounts. After generating the required credentials, fetching emails was as easy as using tools built specifically for this job in a Python package that contained bindings for the Gmail API in Python. The emails were all formatted in XML, so past experience with webscraping from the last two tasks played a part in making fetching article links from the emails painless to implement. After the article links were extracted from the emails, their contents were then fetched in Markdown format for use with the decided model.
Model evaluation
We very quickly looked towards some local large-language models (LLMs) to serve the sentiment analysis task. There were more than a few sentiment analysis fields that would be difficult for more basic models and it simplified implementation greatly. After the evaluation of some small-footprint models, by far the best at both conforming to the desired output format and performing sentiment analysis on the articles was Qwen3 with 4 billion parameters. It is lightweight enough to run on an older CPU in decent time, and while it doesn't agree amazingly with human judgement it more often errs on the side of caution, such as deciding more articles are related to KDE than aren't which wastes time but doesn't exclude relevant articles.
Designing model output and post-processing
It turns out that LLMs come in different flavors, and some, specifically instruct models, are much better at conforming to instructions than others. Many attempts were made to make other types of models provide output in a strict format and, if you need specific output, it's a headache you should definitely consider avoiding by choosing instruct models from the start.
An instruct model coupled with a well-constructed system prompt (the meta prompt that sets initial instructions for the model) and grammar file written in GBNF format can cause model output to be very predictable. The system prompt written for this task is specifically constructed to bound model output by asking it to output sentiment analysis features in a Python formatted array of strings. Even with the above methods, instruct models still do botch output occasionally, so the script contains plenty of post-processing and error handling steps before model output is processed into the output CSV file.
Experience and lessons learned
I've learned a significant amount about web scraping and how to navigate data troubles. I'm definitely a lot more confident about using developer webtools, HTML processing, and browser automation frameworks as a result of my SoK experience. Also after working on the Google Alerts sentiment analysis task, and I certainly feel more educated on AI topics and how they are used and deployed.
My project was a little unusual in that I wasn't working on an existing KDE software but utility scripts that were built from ground-up for KDE community members. This made things fun through the freedom I had with implementing solutions, but I feel the scripts are not fully developed or as problem-free as possible. I'd hate to just leave them as is while feeling that way, so I'll continue working on the already made scripts as well as new ones so long as I can help out.
Huge thanks to Paul Brown for mentoring me through this project and being a pleasure to work with, as well as the KDE community for hosting this great event. I had a lot of fun working on these scripts and am glad I could help out by contributing something to this awesome community.