[SoK 2026] Midterm update for 'Automating promo data collection' task
Hey all! I'm CJ and I'm checking in with a midterm update on the Season of KDE task of automating data collection for the KDE promotional team.
The first term of the two for this Season of KDE task has mostly been a learning experience of what does and doesn't work when it comes to scraping data from the web, laying down our toolset and approach to data collection.
Three subtasks have resulted:
- Create a script that collects follower and post counts from several websites housing KDE's social media accounts
- Create a script that processes information from the Reddit Insights page for the KDE subreddit
- Create a script automating the evaluation of articles discussing KDE tools
The first two of those are mostly completed while the last one is in its research and planning phase. Both finished subtasks came with their own sets of challenges, techniques and tools that I'll detail separately.
Follower and post counts scraper
This is a script I discussed in my first blog post that scrapes follower and post counts from X (formerly known as Twitter), Mastodon, Bluesky, and Threads. The major updates to this script made since then are that it employs a more user and server-friendly usage method and that we've tackled a few issues that came up outside of the script's scraping. On the usage side I've added command-line arguments and an expectation for a JSON file containing the links to scrub from. This makes swapping out social media links easy as well as adding options for scaling up configuration of the script if any further development is needed.
At the point of writing the logic of the script has held up well but the data format we were outputting to, Open Document Format (ODF), wasn't friendly for our specific usage, which is something I touched on in that first blog post. In the end we decided the tools that interface with ODF were too unwieldy to work with from an automation and programmatic standpoint so we're looking into alternatives at the moment. One promising solution is KDE's LabPlot which has a good looking (but experimental) Python API and is FOSS. For now I've set the script up to output to a user-friendly JSON file until we resolve what tool will be leveraged for data analysis in the end.
Another issue came from the input-side of the script in the X/Twitter scraping portion. Many public Nitter instances implement bot-prevention I was unaware of that triggered on an attempted headless server run of the script. With that making simpler scraping methods difficult and also paying respect to those instances' desire not to be botted, I've decided to spin up our own local Nitter instance on the server which is running the script. Now scraping X/Twitter comes much more easily and with a lot less risk of failure.
KDE subreddit Insights scraper
Since that first week we've added another task, being the creation of a script that can add up the weekly influx of new visitors, unique page views, and members of the KDE subreddit utilizing the subreddit's Insight page. This script mostly challenged our ability to automate the login process for Reddit as the usual methods are prevented by browser verification tools.
Reddit implements some version of reCAPTCHA that utilizes a form of invisible reCAPTCHA on their login page. The method of implementation changes based off which version they use, but in the end a score grading the likelihood of a user being a bot or a human is returned to the website upon login. This means that simple HTTP requests are likely not enough to get the job done and that a level of interaction supplied using a browser automation framework is needed to handle the login process.
To that end, we chose to leverage the long-standing Selenium web browser automation framework. Selenium, and many browser-automation frameworks like it, works by launching a full-featured web browser to run its automated tasks. This introduces problems in running these scripts on a headless server but greatly simplifies bot-prevention thwarting and the loading of any JavaScript-sensitive page elements.
With Selenium automating our login process, the only challenge left was to process the HTML data retrieved. Reddit Insights presents its information in the form of bar charts that visualize the daily page views, unique visitors, and subscribers to a subreddit. Some small analysis of the page source revealed that the daily data populating the bar charts are stored with millisecond UNIX timestamp representations of those days. Using BeautifulSoup, it was very easy for me to grab that daily data using those timestamps and sum up the totals needed for our script.
The main challenge this script presents now is how we can get it running on a weekly basis in a headless server. The UI component is non-negotiable so the solution will very likely come in the form of server configuration.
Smaller updates
- Investigated automation of NextCloud data uploads
- Researched how to schedule scripts to run on an interval using systemd unit files
- Wrote technical documentation on the purpose and usage of both scripts developed at the point of writing
- Researched various alternative packages for performing HTTP requests and browser automation tasks
Future
Since the last two subtasks are complete logic-wise outside of any future issues we run into, a new one has been assigned as part of the data collection automation task. The KDE promo team collects various articles about KDE and software related to it and evaluates the contents of those articles as they relate to KDE and how they view whichever KDE tool they discuss. This evaluation process is performed manually which takes up time, so I've been tasked with developing some method of analyzing these articles in an automated fashion.
Along with that new subtask, solving the issues of running browser automation software on a server and what data evaluation software we'll target will greatly benefit us by expanding our options for deploying scripts made in this task and making their data immediately useful for the KDE promo team.
Lessons learned
It's been a lot of fun to tackle the first two tasks. I've had to pull from past experience with APIs, HTML, and HTTP that have been rotting in deeper parts of my brain as well as learn much more about how modern, full-featured websites deploy those tools. I'm a bit anxious about the problem of server deployment since I want these scripts to be as useful and maintainable as possible for the KDE promotion team, but I'm confident we'll find a solution and I'm sure it will feel very rewarding to solve.
Concerning the new subtask, this assignment is a departure from the first two and it's very likely a light and local AI/machine learning method will be looped into this process. That makes it exciting to tackle since it's so different from the last couple of subtasks and incorporates an entirely separate emerging field. I'm very much looking forward to rounding my skills with the new challenges this subtask presents.