We need to find an efficient and effective manner to allocate the resources around. This is what the resource management layer does.

Introduction to the problem

What is Cluster Resource Management?

Most of the time, the user specifies an amount of resources, and then the cluster decides how much to allocate (but approaches like (Delimitrou & Kozyrakis 2014), do it differently). There are mainly two parts in cluster resource management:

  • Allocation: deciding how many resources an application.
  • Assignment: from which physical machine you can effectively put the application.

Types of management architectures

We mainly divide the management architectures in three ways:

Centralized

The resources of the cluster are managed in a single place.

Kubernetes

Kubernetes is a container orchestration system. It is a platform that allows you to run containers at scale. It is a system that helps you run and manage containers.

  • Monitors and reschedules containers.
  • Provides persistent storage API, and communication APIs between containers.
  • Users interface with a master node, which is responsible for managing the cluster.
  • Nodes are called kubelets.
  • The design is inspired by borg, closed source used by Google.

Pods 🟨++

These are a group of containers with the same:

  • Lifecycle (live and die together).
  • Network
  • Storage volumes
  • They should run a common task.

Controlling pod placement

You can label certain nodes, and then specify that a pod should run on a node with a certain label, or node-selectors. Kubernetes has also a simple control loop to keep a certain number of pods running.

if you don’t have hard constraints, then the placement is done by following some affinity, taits, and toleration measures.

The service abstraction

A service in Kube is a group of pods that work together to offer a certain service. They are usually behind a load balancer. Every service has a

  • Request: resources requested by the container
  • Limit: max resource a container can access. Usage > Request: resources might be available • Usage > Limit: container is throttled or killed

Quality of Service 🟩–

We can set some quality of service parameters on kubernetes: to have guaranteed requests and limits. We have three possible settings:

  • Guaranteed: the requests are equal to the limit. Every container has a requests and limits set. They cannot get more resources, but they are the least likely to get killed, usually used for reliable resource availability.
  • Burstable: if the process needs more resources, it can be killed. At least one container has a limit set. They can use additional resources when available.fdfd
  • Best effort: none of the containers has a limit. Highest protection means less likely to be killed.

Example:

apiVersion: v1
kind: Pod
metadata:
  name: qos-example
spec:
  containers:
  - name: app-container
    image: nginx
    resources:
      requests:
        cpu: "250m"
        memory: "256Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"

This falls into the burstable category, as limits are higher than the requests, if they were equal, then it would have been a guaranteed, if they were not set, then it would have been a best effort pod.

Keep Alive

Sandboxes can be kept alive to process more invocations. Usually sandboxes are kept alive after a request is processed (for some seconds). The cost is that they use more memory. Knative decides the sandbox count as number of inflight requests divided by sandbox parallelism (number of requests that a sandbox can execute in parallel). This is run periodically every 2 seconds. Policies can vary:

  • Parallelism
  • Keep alive period.

Service Mesh 🟥

TODO

Horizontal Pod Autoscaler 🟥

TODO:

Dirigent

Classical FaaS (see Cloud Computing Services) are not low-latency in scheduling functions, cold starts are especially slow.

Classical FaaS systems

Previously, people have focused on single worker node optimizations (startup time, namespace management, container image formats), but the control plane (similar to Control Plane for networks), the part that actually issues the commands, have been overlooked.

Knative

Classically Knative on K8s (but originally build for long workloads, so it has a high latency), classically about 1000 ms to start one function, and about 3000 for 100 functions. The main latency is from the control plane. On average 300 sandbox creations per second on cold start.

Analysis of Bottlenecks

Each function has multiple hierarchical abstractions to decide the scheduling:

  • deployment
  • ReplicaSet
  • Pods You have a controller for each abstraction, that attempts to make it to the desired state, this created lots of overhead. You have too many controllers. Each controller needs to write to the etcd Database, which is a bottleneck. API server consumes a lot of CPU time, as everything passess through this (serialization bottleneck).

Another bottleneck is high number of RPC communications due to the microservice architecture.

So in brief, three main bottlenecks:

  • Hierarchical abstraction with independent controllers
  • Synchronous persistence for cluster updates (Serialization bottlencks)
  • Microservice architecture with per-sandbox sidecars (Communication bottlenecks)

Dirigent Innovations

This is a simple architecture independent of Kubernetes (difficult to make it compatible).

Attacking bottlenecks

  • Single controller: One controller for all the abstractions.
    • They abolished hierarchical complex abstractions
    • Mainly three abstractions (Function, sandbox, dataplane, workernode).
    • Highly efficient, some couple of bytes (16), instead of some kilybytes
  • Persistence-free latency-critical operations:
    • Only function, dataplane and workernode are persisted, the other can be reconstructed.
    • The sandbox can get different placement or IP address (cluster manager is quicker), user doesn’t need to know this.
  • Monolithic control and data plane provides simplifying state management and simple deployment, instead of many RPCs.

Evaluations

The paper compares to Knative and OpenWhisk, on hello-world and busy-loop functions.

Cluster Resource Management-20250324155357870

Image from the paper

We some some orders of magnitude improvements.

Two-level

Mesos

Mesos is an example of two level cluster managers: One global, central controller and multiple framework schedulers Mesos hands out the so called resource offer to applications, who then decide whether to take the offer or not. Mesos adopts a classical master-slave architecture, akin to what we have seen for Apache Spark, Massive Parallel Processing, Distributed file systems and similars.

Mesos Architecture 🟩–

We have three parts:

  • Workers: they continuously send resource usage statistics to the controller. They have some executor API to accept jobs from the application frameworks (Spark, Kube, etc).
  • Controller: decides about resource offerings
  • Application frameworks: decide whether to accept or not, and provides tasks to the workers.

Distributed Cluster managers

One would start from centralized solutions, and then move to this mode or Two-level only if we see some scalability issue. We have Omega (just a proof of concept, but never actually developed) and Sparrow.

Sparrow

Sparrow is an example of a distributed scheduler. The main advantage is being more robust: it doesn’t have a single point of failure anymore.

Sparrow’s main architecture

You have a bunch of schedulers, that assign tasks to workers. Cluster Resource Management-20250324152225659 They do it independently.

Power of two choices 🟨–

This is a way to choose the worker for the task:

  • Probing: I query two workers and chose the one that has least load.
  • Assigning: properly starting the task after gathering some info.

It creates a lot of network overhead. To reduce this we batch the requests (not probe for every request).

Late Binding 🟨

This is an optimization to the early binding (solves if I choose a worker that has a very long job queue problem).

After probing, instead of issuing the job, we make a reservation on some workers. When the worker is ready, it notifies the scheduler, and the scheduler assigns the job to the worker and cancels other reservations.

References

[1] Delimitrou & Kozyrakis “Quasar: Resource-Efficient and QoS-aware Cluster Management” Association for Computing Machinery 2014