Building a University Course System: Challenges and Solutions

8 min readNov 2, 2023

“The course website has never worked properly” — these are the lyrics from a song selection for the graduation song at National Taiwan University in a certain year. It demonstrates how a bad-interaction course system could impact students.

The course website is a must-visit for every NTU student. I still remember when I entered NTU, I was terrified by this course website. Selecting courses is like using Amazon, adding courses to a shopping cart. The process of adding courses to the shopping cart is extremely unreasonable — it opens a new window every time you add the course to the shopping cart.

However, the course website is not the only case with bad-interaction experiences, there are many across the school. But the course website is the most persistent visits system, it has been complained about for a very long while, and apparently, there is nothing has changed which is what we all could imagine. 😄

My senior Po-Hao (James) Chang, also known as Mr. Po, and his team created a new generation of course website called “NTU Course Neo” as part of a class project.

Once it was released, there were many positive responses from students. Later on, the Neo team got a partnership with the Office of Academic Affairs (ACA, Which responds to the legacy course website), and that plan to integrate/replace the old course website.

In order to handle the course website for over 30,000 students in the university, we had several discussions about system design and its infrastructure.

Some Background

The first request from ACA is to self-hosted everything. This means we cannot use any PaaS such as Cloud SQL, GKE, and many more and need to host those services by ourselves. All of them had to be built from scratch, which only increased the implementation complexity and maintenance costs.

In this project, I was the only Software Architect. For someone like me, who has mainly worked with cloud services in recent years, it was indeed a painful experience. It had been a long time since I had encountered a situation where I had to set up a self-hosted database.

Anyway, their reason for this request is to ensure cybersecurity and reduce the possibility of lacking student data. Which I think is totally irrational as their system is based on SQL-injection-enabled, Open Private Data practice and also lacks prevention of data. (LOL)

Fortunately, ACA already had its data center, so we just need to get new machines and set them up.

Infrastructure

The main infrastructure is based on Proxmox Virtual Environment (PVE) and setup number of Virtual Machines on the hypervisor layer. With this, we are able to recover VMs without physical access to the machine or use IPMI.

Application Layer Infrastructure

The two main services are the API server and the front-end server, both are running on Kubernetes. Since some of the pages on the front end require Server-side rendering, it needed a place to serve it, meaning we couldn’t solely rely on a file server.

Kubernetes

As for the reason we chose Kubernetes, the primary reason was horizontal scaling. Since our services were Node.js-based, and Node.js is single-threaded, it couldn’t utilize the full potential on a multithreaded machine natively. For that case, we will need something like PM2 to handle clustering.

Of course, there are several existing solutions that we could use such as PM2 clustering or VM-based horizontal scaling, which would be less complicated than dealing with Kubernetes.

For some companies or teams, that might be a valid point. However, one of the prerequisites was that almost every team member had some level of understanding of Docker/Kubernetes. Most people had experience with NASA which I hadn’t (because I hadn’t completed the homework 0…).

Plus, I believe that PM2 clustering came with various OS environment dependency issues, and VM-based scaling could consume considerable resources and might not scale fast enough. So, in the end, we decided to go with Kubernetes, so we could benefit from the features provided by Kubernetes such as Auto Scaling, and Ingress.

Kubernetes on Bare Metal (VM)

There are many tutorials on how to set up Kubernetes on bare metal, but, in a nutshell, it involves having a master node (which, by default, can’t serve pods) and adding worker nodes (VMs) to the master node using Kubeadm to create the cluster.

Setting up Kubernetes on a bare metal was quite challenging. For instance, adding nodes to form a cluster using Kubeadm often led to issues.

We encountered problems with Flannel occasionally, causing mysterious “connection refused” errors on worker nodes after joining the cluster.

Additionally, deploying services that require an external IP will require specifying NodePort or using Metal LB for network-layer load balancing, such as with Ingress like Kong, Istio-Envoy, and more.

Ingress

While Kubernetes has built-in Ingress, we need some extra features, so we have explored several Ingress solutions such as Kong, Istio-Envoy, and Trafik. Eventually, we settled on Kong.

Configuring up and deploying with Kong was relatively straightforward and the simplest, although it offered the least functionality, limited to the Kubernetes Ingress Controller. It’s different from a typical Kong Gateway, and there’s a significant difference between the two.

We initially had plans to use Istio-Envoy, but during stress testing, we noticed that Istio injects an Envoy sidecar for each pod, and each injected Envoy sidecar requires additional resources which significantly increased computing resources. After considering various tests, we abandoned Istio-Envoy and switched to Kong.

I personally liked Istio because it provided many features, and with Envoy’s native API, it offered great flexibility for configuring the Gateway. However, if you’re not into too much hassle, using the NGINX Ingress Controller is a suitable option. It covers most of the essentials, and it has Grafana integration for monitoring.

Tips
Attaching an X-REQUEST-ID on each could significantly save time on debugging.

Database

Since our databases were hosted on rich-resource VMs, handling a significant number of connections wouldn’t be an issue, instead how to manage the connection pool was becoming an issue for us.

Let’s assume that the formula above is ideal for evaluating the maximum number of connections per connection pool. Based on this, If the maximum number of pods were set at 5 and the database could handle the total connections of 1000, then, each pod should have around 200 connections in the connection pool.

After the loading test, we found that some pods still experienced connection pool shortages, while others did not fully utilize their connection pool. To address this issue, we initially considered using database load-balancing tools like PgBouncer.

Unfortunately, setting up high availability (HA) for PgBouncer was a complex task. After discussing with the Professor, we realized that the load-balancing algorithm could cause a hugely different result, so we finally changed to “Least connections” from “Round-robin” (which is the default setup) to resolve the issue.

Please refer to Round-robin vs Least connection for the details about those differences and performance comparison.

Front-End Performance Tuning

Most of the core front-end modules came from NTU Course Neo. Since Neo had only a short development time, there were aspects that didn’t follow best practices, and there wasn’t much time for performance tuning.

Our front-end was built using Next.js, so we primarily focused on optimizing React. While Webpack normally takes care of most things, it doesn’t always do enough. For example, Webpack handles code splitting, but that’s only effective when components are lazy-loaded. Initially, most of the components were not lazy-loaded, so we spent some time refactoring most of them to lazy-load.

LRU Cache

By default, Next.js doesn’t offer an LRU Cache. This means every page requires a re-render before it appears.

After implementing an LRU Cache (with Custom Server), we found that the performance didn’t improve significantly.

During stress testing, we found that, like the API server, in a round-robin scenario, a particular pod could get overloaded, leading to many connection blocks or even timeouts. We addressed this issue by switching to the “Least connections” load-balancing algorithm.

Disabled Gzip Compression

We disabled Gzip compression on the front-end server because one of our gateways handles compression so we need to expose uncompressed files to the gateway.

Challenges and Frustrations

School-based SSO

It is very common, and the way must go through for all NTU students. As a proper school-based system, supporting Single Sign-On (SSO) is a requirement.

However, we faced various difficulties during the SSO integration process. Firstly, NTU’s Computer and Network Center (which responds to SSO) did not provide specific SSO integration documentation, and they supported multiple authentication methods, including encrypted and non-encrypted ways. The uncertainty and lack of detailed documentation put significant pressure on our team as we have almost reached the deadline for launching the project.

Adding to the uncertainty, the data required by the Identity Provider (IDP), which is the Computer and Network Center in this case, differed for each method.

Besides that, when we encountered errors, we did not receive error messages, which meant that we had to contact the engineer at the Computer and Network Center. Fortunately, they provided us with strong support, and we are deeply grateful for their assistance.

Later, we went through almost every SAML and WS-Fed authentication method to determine the protocol NTU’s Computer and Network Center used.

Additionally, the way SSO 2.0’s callback was a website that sends data by submitting a form to a specified callback URL. Which is they created a webpage with many hidden fields and used JavaScript to execute the form submission. Consequently, we could only return the token in the URL to the front end, which increased the risk of token leakage.

I cannot confirm whether the implementation by NTU’s Computer and Network Center was incorrect, but we were unable to make any changes to the process. For same-origin policy, we could set the cookies in the browser. However, considering that we may need to do the OpenAPI or support cross-platform in the future, we decided not to implement it in this way.

Turns out, I recommended a mechanism called the “Transition Token,” which involved exchanging the callback’s Transition Token for the real Access Token.

Epilogue

The entire architecture of the project, from migration to completion, took approximately a month, with most of the time spent on configuration tuning and setting up the environment.

The most challenging aspect was integrating with the existing course website flow. The legacy course website was entirely web-based (session-based) architecture and had no implementation of RESTful.

After multiple discussions with the engineer (Legacy site), they asked us to write into their database directly. However, I personally felt that this approach was entirely unacceptable, especially considering that we hadn’t yet established a robust role-based access control (RBAC). This approach could potentially lead to serious issues.

Therefore, to address the integration challenge, I wrote a significant amount of PHP code to create APIs on top of the existing structure. This ensured that we implemented best practices and increased its maintainability.

It may be hard to believe, but the core functions of the new course website were largely developed by current students at National Taiwan University.

I must say that the entire workflow was genuinely exhausting. Fortunately, National Taiwan University students have a quick learning curve and are quite adept.

This article was completed in early February 2023, and I left the course website team at the end of February of the same year. While I had earnestly hoped to achieve certain goals, the final results did not meet my expectations. This is regrettable but entirely expected.