<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>CJ Nguyen on KDE Blogs</title><link>https://blogs.kde.org/authors/cjnguyen/</link><description>Recent content in CJ Nguyen on KDE Blogs</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Sun, 29 Mar 2026 23:32:47 +0100</lastBuildDate><atom:link href="https://blogs.kde.org/authors/cjnguyen/index.xml" rel="self" type="application/rss+xml"/><item><title>[SoK 2026] Final Update for 'Automating Promo Data Collection' Task</title><link>https://blogs.kde.org/2026/03/27/sok-2026-final-update-for-automating-promo-data-collection-task/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><author>CJ Nguyen</author><guid>https://blogs.kde.org/2026/03/27/sok-2026-final-update-for-automating-promo-data-collection-task/</guid><description>&lt;p&gt;Hi all! Just finished up the last bit of work for my Season of KDE task of automating data collection for the KDE promotional team.&lt;/p&gt;
&lt;p&gt;Since the &lt;a href="https://blogs.kde.org/2026/02/18/sok-2026-midterm-update-for-automating-promo-data-collection-task/"&gt;midterm blogpost&lt;/a&gt; I've been assigned no new tasks.
That means my final deliverables are a &lt;a href="https://invent.kde.org/cjnguyen/social_medial_follower_posts_scraper"&gt;follower/post count scraping script&lt;/a&gt; for specific social media websites, a &lt;a href="https://invent.kde.org/cjnguyen/reddit_insights_scraper"&gt;Reddit Insights page scraper&lt;/a&gt; that totals weekly insight data for a given subreddit, and &lt;a href="https://invent.kde.org/cjnguyen/google-alerts-evaluator"&gt;an article evaluation script&lt;/a&gt; that reads articles found by the Google Alerts system and evaluates their sentiment on KDE and its software.&lt;/p&gt;
&lt;h2 id="follower-and-post-counts-scraper"&gt;Follower and post counts scraper&lt;/h2&gt;
&lt;p&gt;Nothing much has changed here outside of some better error handling, consistency in argument help strings, and improved readability of log messages.
The script has run well on its weekly timer and seems to show no signs of giving up.
I do think I can improve it by making it more extensible to accommodate the scrubbing of new websites and accounts, but as of now it functions well for the links we're most worried about.&lt;/p&gt;
&lt;h2 id="reddit-insights-page-scraper"&gt;Reddit Insights page scraper&lt;/h2&gt;
&lt;p&gt;In the prior blogpost I mentioned worries about getting the script to run on a headless server.
The script has since been made capable of running headlessly through use of a Docker image which wraps the program run with an Xvfb display server.
&lt;a href="https://x.org/releases/X11R7.7/doc/man/man1/Xvfb.1.xhtml"&gt;Xvfb&lt;/a&gt; enables this by running display requirements in virtual memory, allowing for the use of headful software in a headless environment.&lt;/p&gt;
&lt;p&gt;Shoutouts to &lt;a href="https://github.com/seanpianka/docker-python-xvfb-selenium-chrome-firefox"&gt;Sean Pianka's repo&lt;/a&gt; containing dockerfiles used to run Xvfb-wrapped Selenium scripts and &lt;a href="https://github.com/SeleniumHQ/docker-selenium"&gt;Selenium's own Docker images&lt;/a&gt; used for Selenium Grid server project.
Without those resources it would have taken me a lot longer to hack together the requirements for a Docker image that could run Selenium headfully.&lt;/p&gt;
&lt;p&gt;Along with the headless runs being solved, I also implemented plenty of bug fixes and improvements to user-facing messages.
Many of the bugs came from not properly exiting Selenium during handled errors which I found out from the server having hundreds of open Firefox instances.
Hopefully I've cleaned all those up.&lt;/p&gt;
&lt;h2 id="google-alerts-evaluator"&gt;Google Alerts evaluator&lt;/h2&gt;
&lt;p&gt;This task was a fairly large undertaking involving plenty of research and implementation steps.
There were three major requirements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Develop a pipeline to take in Google Alerts emails and pre-process them into articles the model can read.&lt;/li&gt;
&lt;li&gt;Evaluate lightweight sentiment analysis models that can run on a server for their ability to analyze articles on KDE products.&lt;/li&gt;
&lt;li&gt;Parse model output into a human-readable and easy to work with data format.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The final result is a pipeline that&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Reads Google Alerts emails&lt;/li&gt;
&lt;li&gt;Pre-processes the articles into Markdown files for model reading&lt;/li&gt;
&lt;li&gt;Feeds them to a local LLM configured to provide sentiment analysis output&lt;/li&gt;
&lt;li&gt;Takes the LLM output and sends it into a CSV file (if possible)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;You can see how this task could take a lot of time out of people, so hopefully this pipeline can significantly alleviate that time spent.&lt;/p&gt;
&lt;h3 id="google-alerts-email-reading-and-processing"&gt;Google Alerts email reading and processing&lt;/h3&gt;
&lt;p&gt;This was no issue as Google Alerts are all sent through Gmail and Google itself provides a &lt;a href="https://developers.google.com/workspace/gmail/api/guides"&gt;very useable Gmail API&lt;/a&gt; for extracting emails from Gmail accounts.
After generating the required credentials, fetching emails was as easy as using tools built specifically for this job in a &lt;a href="https://pypi.org/project/google-api-python-client/"&gt;Python package that contained bindings&lt;/a&gt; for the Gmail API in Python.
The emails were all formatted in XML, so past experience with webscraping from the last two tasks played a part in making fetching article links from the emails painless to implement.
After the article links were extracted from the emails, their contents were then fetched in Markdown format for use with the decided model.&lt;/p&gt;
&lt;h3 id="model-evaluation"&gt;Model evaluation&lt;/h3&gt;
&lt;p&gt;We very quickly looked towards some local large-language models (LLMs) to serve the sentiment analysis task.
There were more than a few sentiment analysis fields that would be difficult for more basic models and it simplified implementation greatly.
After the evaluation of some small-footprint models, by far the best at both conforming to the desired output format and performing sentiment analysis on the articles was Qwen3 with 4 billion parameters.
It is lightweight enough to run on an older CPU in decent time, and while it doesn't agree amazingly with human judgement it more often errs on the side of caution, such as deciding more articles are related to KDE than aren't which wastes time but doesn't exclude relevant articles.&lt;/p&gt;
&lt;h3 id="designing-model-output-and-post-processing"&gt;Designing model output and post-processing&lt;/h3&gt;
&lt;p&gt;It turns out that LLMs come in different flavors, and some, specifically instruct models, are much better at conforming to instructions than others.
Many attempts were made to make other types of models provide output in a strict format and, if you need specific output, it's a headache you should definitely consider avoiding by choosing instruct models from the start.&lt;/p&gt;
&lt;p&gt;An instruct model coupled with a well-constructed system prompt (the meta prompt that sets initial instructions for the model) and grammar file &lt;a href="https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md"&gt;written in GBNF format&lt;/a&gt; can cause model output to be very predictable.
The system prompt written for this task is specifically constructed to bound model output by asking it to output sentiment analysis features in a Python formatted array of strings.
Even with the above methods, instruct models still do botch output occasionally, so the script contains plenty of post-processing and error handling steps before model output is processed into the output CSV file.&lt;/p&gt;
&lt;h2 id="experience-and-lessons-learned"&gt;Experience and lessons learned&lt;/h2&gt;
&lt;p&gt;I've learned a significant amount about web scraping and how to navigate data troubles.
I'm definitely a lot more confident about using developer webtools, HTML processing, and browser automation frameworks as a result of my SoK experience.
Also after working on the Google Alerts sentiment analysis task, and I certainly feel more educated on AI topics and how they are used and deployed.&lt;/p&gt;
&lt;p&gt;My project was a little unusual in that I wasn't working on an existing KDE software but utility scripts that were built from ground-up for KDE community members.
This made things fun through the freedom I had with implementing solutions, but I feel the scripts are not fully developed or as problem-free as possible.
I'd hate to just leave them as is while feeling that way, so I'll continue working on the already made scripts as well as new ones so long as I can help out.&lt;/p&gt;
&lt;p&gt;Huge thanks to &lt;a href="https://invent.kde.org/paulb"&gt;Paul Brown&lt;/a&gt; for mentoring me through this project and being a pleasure to work with, as well as the KDE community for hosting this great event.
I had a lot of fun working on these scripts and am glad I could help out by contributing something to this awesome community.&lt;/p&gt;</description></item><item><title>[SoK 2026] Midterm update for 'Automating promo data collection' task</title><link>https://blogs.kde.org/2026/02/18/sok-2026-midterm-update-for-automating-promo-data-collection-task/</link><pubDate>Wed, 18 Feb 2026 00:00:00 +0000</pubDate><author>CJ Nguyen</author><guid>https://blogs.kde.org/2026/02/18/sok-2026-midterm-update-for-automating-promo-data-collection-task/</guid><description>&lt;p&gt;Hey all! I'm CJ and I'm checking in with a midterm update on the Season of KDE task of automating data collection for the KDE promotional team.&lt;/p&gt;
&lt;p&gt;The first term of the two for this Season of KDE task has mostly been a learning experience of what does and doesn't work when it comes to scraping data from the web, laying down our toolset and approach to data collection.&lt;/p&gt;
&lt;p&gt;Three subtasks have resulted:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a script that collects follower and post counts from several websites housing KDE's social media accounts&lt;/li&gt;
&lt;li&gt;Create a script that processes information from the Reddit Insights page for the KDE subreddit&lt;/li&gt;
&lt;li&gt;Create a script automating the evaluation of articles discussing KDE tools&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The first two of those are mostly completed while the last one is in its research and planning phase.
Both finished subtasks came with their own sets of challenges, techniques and tools that I'll detail separately.&lt;/p&gt;
&lt;h2 id="follower-and-post-counts-scraper"&gt;Follower and post counts scraper&lt;/h2&gt;
&lt;p&gt;This is a script I discussed in my &lt;a href="https://blogs.kde.org/2026/02/01/season-of-kde-2026-week-1-progress-for-automating-promo-data-collection/"&gt;first blog post&lt;/a&gt; that scrapes follower and post counts from X (formerly known as Twitter), Mastodon, Bluesky, and Threads.
The major updates to this script made since then are that it employs a more user and server-friendly usage method and that we've tackled a few issues that came up outside of the script's scraping.
On the usage side I've added command-line arguments and an expectation for a JSON file containing the links to scrub from.
This makes swapping out social media links easy as well as adding options for scaling up configuration of the script if any further development is needed.&lt;/p&gt;
&lt;p&gt;At the point of writing the logic of the script has held up well but the data format we were outputting to, Open Document Format (ODF), wasn't friendly for our specific usage, which is something I touched on in that first blog post.
In the end we decided the tools that interface with ODF were too unwieldy to work with from an automation and programmatic standpoint so we're looking into alternatives at the moment.
One promising solution is KDE's &lt;a href="https://labplot.org/"&gt;LabPlot&lt;/a&gt; which has a good looking (but experimental) &lt;a href="https://docs.labplot.org/en/sdk.html"&gt;Python API&lt;/a&gt; and is FOSS.
For now I've set the script up to output to a user-friendly JSON file until we resolve what tool will be leveraged for data analysis in the end.&lt;/p&gt;
&lt;p&gt;Another issue came from the input-side of the script in the X/Twitter scraping portion.
Many public Nitter instances implement bot-prevention I was unaware of that triggered on an attempted headless server run of the script.
With that making simpler scraping methods difficult and also paying respect to those instances' desire not to be botted, I've decided to spin up our own local Nitter instance on the server which is running the script.
Now scraping X/Twitter comes much more easily and with a lot less risk of failure.&lt;/p&gt;
&lt;h2 id="kde-subreddit-insights-scraper"&gt;KDE subreddit Insights scraper&lt;/h2&gt;
&lt;p&gt;Since that first week we've added another task, being the creation of a script that can add up the weekly influx of new visitors, unique page views, and members of the KDE subreddit utilizing the subreddit's Insight page.
This script mostly challenged our ability to automate the login process for Reddit as the usual methods are prevented by browser verification tools.&lt;/p&gt;
&lt;p&gt;Reddit implements some version of reCAPTCHA that utilizes a form of &lt;a href="https://developers.google.com/recaptcha/docs/versions"&gt;invisible reCAPTCHA&lt;/a&gt; on their login page.
The method of implementation changes based off which version they use, but in the end a score grading the likelihood of a user being a bot or a human is returned to the website upon login.
This means that simple HTTP requests are likely not enough to get the job done and that a level of interaction supplied using a browser automation framework is needed to handle the login process.&lt;/p&gt;
&lt;p&gt;To that end, we chose to leverage the long-standing &lt;a href="https://github.com/seleniumhq/selenium"&gt;Selenium&lt;/a&gt; web browser automation framework.
Selenium, and many browser-automation frameworks like it, works by launching a full-featured web browser to run its automated tasks.
This introduces problems in running these scripts on a headless server but greatly simplifies bot-prevention thwarting and the loading of any JavaScript-sensitive page elements.&lt;/p&gt;
&lt;p&gt;With Selenium automating our login process, the only challenge left was to process the HTML data retrieved.
Reddit Insights presents its information in the form of bar charts that visualize the daily page views, unique visitors, and subscribers to a subreddit.
Some small analysis of the page source revealed that the daily data populating the bar charts are stored with millisecond UNIX timestamp representations of those days.
Using &lt;a href="https://www.crummy.com/software/BeautifulSoup/"&gt;BeautifulSoup&lt;/a&gt;, it was very easy for me to grab that daily data using those timestamps and sum up the totals needed for our script.&lt;/p&gt;
&lt;p&gt;The main challenge this script presents now is how we can get it running on a weekly basis in a headless server.
The UI component is non-negotiable so the solution will very likely come in the form of server configuration.&lt;/p&gt;
&lt;h2 id="smaller-updates"&gt;Smaller updates&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Investigated automation of NextCloud data uploads&lt;/li&gt;
&lt;li&gt;Researched how to schedule scripts to run on an interval using systemd unit files&lt;/li&gt;
&lt;li&gt;Wrote technical documentation on the purpose and usage of both scripts developed at the point of writing&lt;/li&gt;
&lt;li&gt;Researched various alternative packages for performing HTTP requests and browser automation tasks&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="future"&gt;Future&lt;/h2&gt;
&lt;p&gt;Since the last two subtasks are complete logic-wise outside of any future issues we run into, a new one has been assigned as part of the data collection automation task.
The KDE promo team collects various articles about KDE and software related to it and evaluates the contents of those articles as they relate to KDE and how they view whichever KDE tool they discuss.
This evaluation process is performed manually which takes up time, so I've been tasked with developing some method of analyzing these articles in an automated fashion.&lt;/p&gt;
&lt;p&gt;Along with that new subtask, solving the issues of running browser automation software on a server and what data evaluation software we'll target will greatly benefit us by expanding our options for deploying scripts made in this task and making their data immediately useful for the KDE promo team.&lt;/p&gt;
&lt;h2 id="lessons-learned"&gt;Lessons learned&lt;/h2&gt;
&lt;p&gt;It's been a lot of fun to tackle the first two tasks.
I've had to pull from past experience with APIs, HTML, and HTTP that have been rotting in deeper parts of my brain as well as learn much more about how modern, full-featured websites deploy those tools.
I'm a bit anxious about the problem of server deployment since I want these scripts to be as useful and maintainable as possible for the KDE promotion team, but I'm confident we'll find a solution and I'm sure it will feel very rewarding to solve.&lt;/p&gt;
&lt;p&gt;Concerning the new subtask, this assignment is a departure from the first two and it's very likely a light and local AI/machine learning method will be looped into this process.
That makes it exciting to tackle since it's so different from the last couple of subtasks and incorporates an entirely separate emerging field.
I'm very much looking forward to rounding my skills with the new challenges this subtask presents.&lt;/p&gt;</description></item><item><title>Season of KDE 2026: Week 1 Progress for Automating Promo Data Collection</title><link>https://blogs.kde.org/2026/02/01/season-of-kde-2026-week-1-progress-for-automating-promo-data-collection/</link><pubDate>Sun, 01 Feb 2026 00:00:00 +0000</pubDate><author>CJ Nguyen</author><guid>https://blogs.kde.org/2026/02/01/season-of-kde-2026-week-1-progress-for-automating-promo-data-collection/</guid><description>&lt;p&gt;Hi all! I'm CJ, and I'm participating in Season of KDE 2026 by automating portions of the data collection for the KDE promo team. This post is an update on the work I've done in the first week of SoK.&lt;/p&gt;
&lt;p&gt;My mentor gave me a light task to help me get set up and familiarize myself with the tools I'll be using for the rest of the project. The task was to automate the population of a spreadsheet that tracks follower and post counts for X (formerly known as Twitter), Mastodon, BlueSky, and Threads.&lt;/p&gt;
&lt;p&gt;The spreadsheet takes the follower and post counts of some of KDE's social media platforms and makes calculations based off that data. Important things to note:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;data from the sites is entered manually&lt;/li&gt;
&lt;li&gt;there are a lot of styles and formulas in the sheet&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="fetching-account-data"&gt;Fetching Account Data&lt;/h2&gt;
&lt;p&gt;Grabbing data was mostly no trouble. Mastodon and BlueSky were especially easy to work with. They have a public and well documented API that lets people collect all kinds of data in human-readable formats. One particular endpoint from both sources output account information, including follower and post counts, for a given account in neat JSON files (&lt;a href="https://docs.bsky.app/docs/api/app-bsky-actor-get-profile"&gt;BlueSky&lt;/a&gt;, &lt;a href="https://docs.joinmastodon.org/methods/accounts/#get"&gt;Mastodon&lt;/a&gt;). All it took were GET requests to these endpoints and it was smooth sailing.&lt;/p&gt;
&lt;p&gt;X and Threads proved a bit more finnicky. Both of their APIs limit access to much of their functionality usually meaning webscraping methods are the most accessible for grabbing public account data. Threads shows users' follower counts out directly on an account's landing page, so processing a GET request to the URL of KDE's Threads account made it easy to grab. The problem is that there seems to be no direct way to grab the post count either through their API or with webscraping methods. For now, we've chosen to leave that be and circle back when I explore Threads more in the future. X presents a similar problem but there is an open-source frontend alternative named &lt;a href="https://github.com/zedeus/nitter"&gt;Nitter&lt;/a&gt;, instances of which lay all the stats information out in the open. The reliability of this method depends on public Nitter instances being available so it may be worth coming back around to this in a later part of the project, but for now it's a viable solution for getting follower and post counts.&lt;/p&gt;
&lt;h2 id="inputting-the-data-into-the-spreadsheet"&gt;Inputting the Data Into the Spreadsheet&lt;/h2&gt;
&lt;p&gt;With the data all fetched, all that was left is to add that data to the ODF spreadsheet. I had this down as the easy part of the task but in the end it wasn't so simple. The two major Python packages I found that can interpret and write ODF files: Pandas and pyexcel. Both of these have no problem reading data from the files, but when it comes to saving they don't preserve some elements of the spreadsheet. In the end we went the simple route which is to save the data to a separate ODF file using one of the Python-ODF interfaces and import that into the data sheet. This took a little finagling with formulas to get things working without popping errors into cells the sheet, but in the end we have an output ODF spreadsheet file containing the required data and the original spreadsheet with all the calculations pulling that data into its formulas, removing any requirement of a human interfacing with this portion of data collection.&lt;/p&gt;
&lt;h2 id="learned-lessons"&gt;Learned Lessons&lt;/h2&gt;
&lt;p&gt;I feel like this week's task was a great first step into data collection automation. It was challenging without being too difficult to make progress on and forced me to explore different avenues for gathering data. On the confidence side, getting a (mostly) successful task out the gate helped me feel more comfortable with the tools and processes that will likely appear throughout the entirety of my SoK experience. Things will scale up from here on out though so I'm also keeping myself in check.&lt;/p&gt;
&lt;p&gt;From what I understand some of the most difficult parts of automated data collection come through having to interface with Javascript and not getting banned, both of which I've yet to come face-to-face with in any substantial capacity so far. Along with that I've face unexpected problems, such as the issue with modifying ODF files and that some websites don't play as well with certain browsers, which I don't have an easy way to test for yet. With these in mind I'm trying to tread lightly and be diligent with research and good practice as I continue on.&lt;/p&gt;</description></item></channel></rss>