Queueing theory is the theory behind what happens when you have lots of jobs, scarce resources, and subsequently long queues and delays. It is literally the “theory of queues”: what makes queues appear and how to make them go away.

This is basically what happens in clusters, where you have a limited number of workers that need to execute a number of jobs.

We need some little maths to model the stochastic process of request arrivals.

Introduction to Queuing Theory

Characterization of Queueing Theory 🟩

We define here some basic terminology:

  • Number of servers: the number of workers that process in parallel
  • System capacity: the finite number of request that can be fit at any time.
    • This includes the buffer capacity, how many requests can be waiting to be assigned.
  • Population size: how many requests can be processed in unit time.
  • Arrival rate: the stochastic process that models the arrival of the requests.
  • Queue discipline: how you choose the request (First come first served etc…)
  • Service time: time required to service a job on this CPU, so it is measured usually in seconds/job, often indicated with $\mu$.

Interval Arrival Time 🟩

It is usually modelled as a Poisson distribution, see Poisson processes. Using uniform distribution is not very realistic.

Key assumptions 🟩

  • Stationary (it does not depend on the time)
  • State independent it does not depend on the previous requests. In real systems these assumptions are not always true. But for this analysis it is ok to do it. In reality there are some bursts that make it not stationary (or just day vs night).

Types of Queues

Queueing systems in computer and network performance analysis are often classified into Open Systems, Interactive Closed Systems, and Batch Closed Systems based on how jobs arrive, circulate, and leave the system.

Open System 🟩

  • Definition: Jobs (requests, tasks, or customers) arrive from an external source, are processed, and eventually leave the system. There is no feedback on the speed of the requests. They can arrive at any time. You could have a unlimited number of requests (useful for real systems to evaluate maximum load).
  • Characteristics:
    • Arrival rate is typically modeled as a Poisson process.
    • Service times follow an exponential or other probabilistic distribution.
    • There is no limit on the number of jobs in the system at a given time.
  • Example: A web server handling requests from users across the internet.

If you assume $\lambda < \mu$ then the throughput of this system is the mean arrival rate. Making higher $\mu$ means lower response time or waiting time in the queue, but does not influence the throughput. If $\lambda > \mu$ then the system is unstable, meaning it could have infinite response time as time goes on: the system does not keep up with the number of the requests. The system is stable if $\rho < 1$ meaning the utilization is less than one hundred percent of the time. If it is $1$ it is still not stable, as queues will grow indefinitely for randomness reasons (server cannot catch up when the queue grows). $1$ seems to be the asyntote. The utilization is defined as $\rho = \frac{\lambda}{\mu}$, the fraction of time the server is busy.

Interactive Closed System 🟩

Closed system means no new requests until I hear back from the previous one. In these systems the throughput is the service rate. In closed systems the $\mu$ is thus an upper bound for the throughput.

  • Definition: A fixed number of jobs (users, processes) continuously circulate within the system. After a job is served, it re-enters the queue after some think time (a delay before rejoining).
  • Characteristics:
    • The total number of jobs in the system is constant.
    • Think time between jobs prevents overwhelming the system.
    • Response time affects user behavior (e.g., a slower system results in longer think times).
  • Example: A system with N interactive users running on a time-sharing computer. Each user submits a request, waits for a response, and then submits another request after a short delay.

Time-sharing OS is another example of these type of system.

Batch Closed System 🟩

  • Definition: A fixed number of jobs circulate continuously in the system without think time. Once a job completes, it immediately rejoins the queue for processing.
  • Characteristics:
    • The number of jobs remains constant, but they move as fast as the system allows.
    • No idle periods—jobs are processed as quickly as resources permit.
  • Example: A batch processing system, such as a data analytics pipeline that processes a fixed set of tasks repeatedly.

Some laws

This section is about some laws that are useful to understand the performance of the system.

Interactive Response Time Law: Closed Systems 🟩

For closed systems we have:

$$ X = \frac{\text{Jobs}}{\text{Time}} = \frac{\left( N \frac{T}{R + Z} \right)}{T} = \frac{N}{R + Z} $$$$ R = \frac{N}{X} - Z $$

Which is generally good

Little’s Law 🟨–

$$ L = \lambda (R + Z) $$

Where $L$ is the number of jobs in the system, $\lambda$ is the arrival rate of the jobs, $R$ is the response time, $Z$ is the think time (This is valid in close systems), if we assume to have an open system then we need to substitute $W$ to $R + Z$, which is the time spent in the system.

Models of Queues

Birth-Death Process 🟥

$$ \lambda_{n}P_{n} = \mu_{n + 1}P_{n + 1} $$

One nice thing is that you can model the probability of being at a certain state starting from some state (you have only one path!)

$$ P_{n} = \left( \frac{\lambda}{\mu} \right)^{n}P_{0} = \rho^{n} \cdot (1 - \rho) $$$$ P_{0}\sum_{n=0}^{\infty} \rho^{n} = \frac{P_{0}}{1 - \rho} = 1 $$

The above should be equal to 1.

$$ E[n] = \sum_{n=0}^{\infty} nP_{n} = \frac{\rho}{1 - \rho} $$

This is a known result from geometric series.

M/M/1 🟩

This is a queueing system that has only a single server. The arrival rate is $\lambda$ and the service rate is $\mu$. M stands for Markovian, and the numbers are the arrival and service rate. We have an exponential (Poisson) arrival and service time. We can still use the above Markovian structure, but now we have defined probabilities for the trainsitions.

$$ E[w] = \frac{1}{\mu - \lambda} = \frac{\rho}{(1 - \rho)\lambda} $$

M/M/m 🟩

This is the same as above but with $m$ servers. The bottom transition is a little more interesting, while the transitions towards more states is equal. In general this model has less response time than m M/M/1 queues.

M/M/1/B 🟥++

This is the same as above but with a buffer of size $B$ (we don’t have infinite buffer size in reality).

Optimizing for latency and throughput

Throughput optimizations 🟨

One simple way to add throughput to the system is to add more servers. But this comes at a cost:

  1. Effective load balancing (more servers, it becomes more complex)
  2. Cost of the servers
  3. Bottlenecks in accessing shared resources (e.g. database).

Latency optimizations 🟨

  1. Parallelism: Split the job into smaller tasks and run them in parallel (but you need to care about overheads in communication and joining the results)
  2. Critical path length shortening: reduce the length of the critical path (the longest path in the system)
    1. Reduce network calls or other communications (this might reduce throughput though, as we are adding more load to a single machine).
    2. Reduce the number of steps
    3. Buy faster hardware