Phoenix - the recommendation system

June 21, 2020

Every day, hundred of thousands of users read RTL Nieuws or watch Videoland. To enhance the experience, we like to recommend content to our customers. These recommendations are tailored according to many factors, such as: previously watched videos or read article, other users' taste, similar content, etc. Our data science team works incredibly hard in order to create the best recommendation for each of our users.

Once the recommendations are available, we need to make sure that the end user is able to get it. Hence, for fast uploading and delivery, we created Phoenix.

The name

The name is not totally random. We used to have an external party that would deliver the recommendations. However, we were facing a multitude of issues, and often the system wasn't capable of delivering the quality we needed.

Towards the end of 2019 we launched our own solution in production and like the phoenix, we raised from the ashes and give stability to the platform. We decreased the latency from an average of 110ms to 35ms and lowered the data-upload time from 8 hours to 3 minutes.

The internals

Phoenix is developed in Go with Redis as backend. In our experience, a recommendation is nonetheless that a key/value pair. In fact, the way that we store each recommendation in Redis is as follow

user_id_1 -> [item_1, item_2, ..., item_n]
user_id_2 -> [item_4, item_5, ..., item_m]

The user_id_x is the unique ID that identifies a customer. Disclaimer - we do NOT use emails or any sensitive data for recommendations.

Since the recommendation has been reduced to a simple unit of key/value, designing the API was quite straightforward. Redis is a natural choice for the matter. We could have used other Key/Value storage databases, but the simplicity and robustness of Redis are second to none.

Scalability

Writing the API and choosing the database was "easy". But, how do we make sure that we can scale properly and cost-efficient? Well, the answer was Kubernetes.

In our production cluster we are using 3 c5.xlarge nodes with tainting to avoid having other Pods taking resources. The Deployment object has anti-affinity definition that will not allow to have two pods of the same kind on the same node. We are doing this for redundancy purposes and making sure that we are able to loadbalance across multiple nodes.

When traffic increases, in particular during holidays or pandemic-times, we have defined a HorizontalPodAutoscaling (HPA) that will instruct Kubernetes to schedule a new pod. Due to the anti-affinity a new node is created and attached to the cluster. In the opposite scenario, when the cluster "cools-off" then it self-shrink to reduce costs.

The metrics that we use are based on the prometheus metrics that we added to the codebase. Based on the RPS and CPU/Memory usage, the HPA decides if it needs to trigger a scaling-up/down event. This allows us to use custom metrics to decide what it's best to do based on the load.

Give it a try 🚀

If you would like to try Phoenix, you can start by reading the wiki and forking the project. If you find any bug or you'd like to have a feature, don't hesitate to open an Issue or a PR ❤️

We hope that this project can make your life easier and it makes it for us!

Peace ✌️

Davide Berdin