Wednesday, 4 April 2012

Capacity Planning

The Lack of Toilets
One thing that annoys me at almost every single race is the lack of toilets. What the reasons for organisers not providing enough toilets I don't know. It can be a result of wanting to save money from renting port-a-loos, the organisers might just be too unorganised but it can very well also be because the organisers have no idea of how many toilets that are required. Assuming the latter I've I spent quite a few hours of my valuable time writing this post that hopefully can enlighten organisers. My goal is that they'll become better at estimating how many toilets that are needed for a race. This can of course also be used for calculating the toilet need for other events, like concerts, garden parties, weddings and what have you got. However the input variables "service time" and "arrival rate" might have to be tuned since the group of people at a concert have a different needs for "hitting the loo" compared to runners.

Observations and Calculations
Finding the exact number of toilets needed for an event is not exactly rocket science, it is some toilet science mixed with some queuing theory. It's logic and it's about being organised so you can gather sufficient amount of data through observations. Then you can use key numbers, like the average time a toilet visit takes and the number of people in the need of visiting a toilet in what period of time, in established formulas.

Proper observation is the only thing that can give you a good workload model. If you have a group with equally as many male and female there might be a higher probability for a female customer than a male. Why? Because many male runners find other alternatives, like peeing in a park. And for the same reason, when male customers needs "service time" they might be more likely to take a dump. Hence the average service time for a male customer might be higher than for a female. For longer races, like a marathon, there might be more likely that people do number two, compared to a 5 km race. There might as well be more likely that people visits the toilet more than once in longer races compared to shorter once. Only thought thorough observations and interviewing "toilet customers" the correct results can be found. I'll leave the observation and registration part of the job to someone else. I've just looked at the mathematics that do apply.

Motivating Examples
EXAMPLE: "Number of toilets"
Let's say you're organising a marathon. In your race there are 5000 runners. You assume that 4000 of these will need to visit a toilet in the last hour before the race start. You also assume that an average toilet visit takes 2 minutes. How many toilets are needed to handle this load?

The arrival rate is λ = 4000 runners / 60 minutes = 66.67 runners / minute.
If each toilet visit takes 2 minutes it's theoretically possible for one toilet to process 30 runners, that would be if the toilet is utilised at it's maximum. Let's say 28 people a toilet is a more sane number. Hence we get
4000 / m = 28 ⇒ m = 4000 / 28 = 142.86.
The organiser do need 143 toilets for this event.

EXAMPLE: "Avoid long waiting times"
The next year exactly as many runners are participating in the race, 5000 runners. And as the year before 4000 of these are likely to visit the loo in the last hour before the race starts. However the organisers did get a lot of complaints from runners that had to wait too long in line last year. As any great organiser the feedback is taken into consideration. The organiser has promised that no runner will stand in line for longer than 8 minutes this year. How many toilets is then needed?

Arrival rate is: λ = 66.67 runners / minute
Service time (toilet time) is: Rs = 2 minutes
The service rate (toilet visitors per minute) is: μ = 1 / 2 = 0.5 customers per minute
Wait time (max) is: Rw = 8 minutes
Response time: R = Rw + Rs = 8 + 2 = 10 minutes

R = 1 / (μ − λ) ⇒ λ = (Rμ - 1) / R = (10*0.5 - 1) / 10 = 4 / 10 = 0.4
So to achieve an average waiting time of 10 minuted the arrival rate can be no more than 0.4 customers a minute. How many toilet queues do we need to distribute the 4000 runners on then?
66.67 / m = 0.4 ⇒ m = 66.67 / 0.4 = 166.68
Assuming all 4000 are distributed equally among all toilets the organisers need 167 toilets for the event.

EXAMPLE: "The waiting time"
You've been observing a toilet and have registered that it's occupied 50% of of the time. You assume that one toilet visit in average takes 2 minutes. How long will you have to wait in line before you'll enter the toilet?

R1 = S1 / (1 - U1) = 2 / (1-0.5) = 4 minutes.
Hence you'll on average have to queue for 4 minutes - 2 minutes = 2 minutes before you'll enter the toilet.

Want to know more? Keep on reading. Below follows the theory behind the above examples. It should also give organisers ideas on how to set up their toilet system.

Capacity Planning
When I did my Masters in Computer Science and Engineering (CSE) at UNSW I did one subject with the short name "Capacity Planning of Computer Systems and Networks", or "COMP9334" to actually use the short name (check out the course home page here and a complete overview over lecture notes here). Little did I know back in 2003 that what Prof. Sanjay Jha did teach me would come in very handy for estimating the number of toilets needed for optimising poo-queuing times. It is the theory from that course I base this article on. A group of runners in the need of taking a dump can very well be compared to processes and toilets can be compared to CPUs. Capacity is defined as "the maximum amount of work a system can handle in a given period of time", in this case the system is a set of one or more toilets and the work is the runner doing his or hers private thing.

A queuing system
Runners, from now on defined as customers, arrive at random times. They line up in a queue where they wait for taking a dump or peeing, from now on defined as service, if all toilets are occupied. When a toilet gets vacant they enter, spends a random service time in the toilet before they depart upon completed service. An event, like a race, is defined as a system. A system can have one more more toilets.

Queuing disciplines
There are several rules that can be followed for queuing. You have "priority queuing" where some prioritised customers are served first. For example, if Haile Gebrselassie lined up he might be let in front of other runners. Then you have "Round-robin" which most likely is a bad idea for a toilet queues. Very simplified it could mean that every customer got 30 seconds toilet time. In that limited period of time the customer would have to do whatever he or she managed to do. After 30 seconds the customer would have to leave the toilet and queue again and wait for another 30 seconds slot becoming available. This process would compete until the customer was done completely. You also have "last come first served" (LCFS / LIFO) which would be quite a unpopular method as it would mean that the last customer queuing up would be the first to enter the toilet. The last queuing discipline worth mentioning is the one that creates the least amount of fuzz and therefore also is the most commonly used for toilet queues. It is the "first come first served" (FCFS) which means that the first customer in line is the next to be given access to the toilet.

Probability theory
The methods deduced will be based on probability theory. The reason for this is that toilet customer do not arrive evenly every x seconds, as they would do in a deterministic world, they arrive randomly.

In a deterministic world it would look like this

In the real world the customers arrive randomly

Performance Metrics
Performance metrics, or how performance is measured, is also known as Quality of Service (QoS) metrics, and common metrics are response time, throughput, availability, reliability and scalability.

Response time
The time measured in seconds it takes from lining up in the toilet queue till the toilet is left.

Response time (R) = Time in waiting (Rw) + Time in service (Rs)
Note that T is also often used as notation for response time, so it can be written as
Response time (T) = Time in waiting (Tw) + Time in service (Ts)

The response time for peeing is 10 seconds waiting in line + 50 seconds time in service = 60 seconds.

The rate at which toilet visits are completed. Throughput is a function of the load:
throughput = min(offered load, capacity).

Assume a toilet visit takes 60 seconds on average, and that the maximum number of visits one hour is 50 (last 10 minutes are used for change of toilet paper, cleaning, delay for a person to move from the line and into the toilet etc) the throughput is 50 visits an hour. If 60 people line up for a toilet visit the throughput is still 50, hence 10 people will have to hold on for more than one hour.

Trashing equals congestion collapse
src : lecture notes
When the load is high the toilet can get problems. E.g. the toilet might go full of shit or run out of toilet paper. Such a situation is called "trashing". Trashing means runners leaving the queue, or the toilet, finding alternative places for taking dumps. A situation with a lot of trashing is not positive for an organiser of an event, especially not if the event takes place in a city. Consequences of thrashing for races in the forest are not equally bad.

The fraction of time a toilet is open and available to its customers.

The Service Level Agreement (SLA) in between the organisers of the event and the contractors of the porta-a-loos quarantees the toilet to be available 23 out of 24 hours. Hence the toilet availability is 1 - 1/24 =  95.83%.

The probability that a system functions properly and continuously over a fixed period of time. Related terms MTTF (mean time to failure) - replace faulty part, MTBF (mean time between failures) - repair fault.

How fast does performance degrade with increasing load?
System B is more scalable (src : lecture notes)

To evaluate performance of toilets in detail, using either simulation or analysis, you need to specify workload.    With workload defined as how many customers there will be in a given period of time.

Real workload
Real workload is found by doing observation of a toilet. Then you can count how many customers that come and queue up.

One toilet got 10 requests for service during 30 minutes. In other words 10 customers came and lined up in the toilet queue in a 30 minutes period.

Synthetic workload
Synthetic workload is also called a workload model. It is a model of the real workload, or a model that represent behaviour of a real toilet queue.

Workload characterisation
To develop a workload model you need to observe key characteristics in a real environment. The process of doing this is called workload characterisation. There are different techniques for workload characterization. Like Averaging, Specifying dispersion, Single-parameter histograms, Multi-parameter histograms, Principal-component analysis, Markov models and Clustering. The first two techniques are mentioned in more detail below.
Averaging is an easy technique to use for toilet queues. You can simply just use observed data and find an average number of customers and average time spent "in service".

E.g.: We observe 5 customers. Customer one (k1) spends 60 second in the toilet (service time), customer two (k2) spends 180 seconds, k3 120 seconds, k4 70 seconds and service time for k5 is 360 seconds. The average is then 1/5 * (60+180+120+70+360) sec = 158 sec.

Specifying dispersion might be a better technique to use to get a more accurate workload model. Specifying dispersion means to group data into different classes. It can be used to divide female customers from male customers and customers in the need for doing number one from those who needs to do number two. Since a race pretty much knows the exact number of male and female participants have grouped data into sex can be handy if there are observed high values for coefficient of variation (standard deviation divided by mean).

E.g: We observed 6 customers. 3 male and 3 female. Male customer number one (mx1) spends 180 seconds in the toilet, mx2 150 seconds, mx3 210 seconds. Female customers number one (fx1) spends 140 seconds, fx2 110 seconds, fx3, 95 seconds. The average for the males is then 180 seconds while it is 115 seconds for the females.

Operational analysis
Collected performance data during day-to-day operation that can be used for analysis, below is an example on how to do this.

Notations for measured quantities
Γ Observation period
Ai Total number of arrivals to toilet i in Γ
Bi Total busy time of  toilet i in Γ
Ci Total number of completions from toilet i in Γ
A0 Total number of arrivals to the system in Γ
C0 Total number of completions from the system in Γ
Vi Average number of times a runner visits a toilet in  Γ ( Vi = Ci / C0 ).
K Total number of toilets in the system
k Total number of customers in the system

Notations for derived quantities
λi Arrival rate at toilet i
Si Mean service time per customer at toilet i
μ Service rate at toilet i
Ui (ρi) Utilisation of toilet i
Xi Throughput of toilet i
X0 Throughput of the system

EXAMPLE: "Finding the Quantities"
We have an event (a system) with one single toilet.
Observation period is 1 hour.
The toilet is busy for 36 minutes.
18 customers arrived and visited the toilet.
We want to find 
- Arrival rate
- Mean service time per customer
- Utilisation of the toilet
- System throughput

The measure quantities hence are
Γ = 60 min
A1 = A0 = 18 arrivals
B1 = 36 min
C1 = C0 = 18 completions
V1 = C1/C0 = 18/18 = 1 visit
K = 1 toilet

The derived quantities are:
λ1 = A1/Γ = 18/60 = 0.3 arrivals per minute
S1 = B1/C1 = 36/18 = 2 minutes per customer
μ1 = 1/S1 = 1/2 = 0.5 services per minute
U1 = B1/Γ = 36/60 = 60%
X0 = C0/Γ = 18/60 = 0.3 customers per minute

Several laws are valid and can be applied.

Utilisation law
Utilisation is the fraction of time a toilet is busy on average. U = B/Γ where U <= 1. The utilisation laws is defined as U = C/Γ * B/C ⇒ U = X * S. With flow equilibrium (arrivals of customers = completions) X = λ hence U = λ * S. And for a m-toilet system U = (X * S)/m.

Forced Flow Law
Throughput of toilet i is equal to the average number of visits made by a customer to toilet i multiplied by the system throughput. So Xi = Vi * X0.

Service Demand Law
Service demand is the total average service time required by a customer at toilet i, denoted Di. On average, a customer visits toilet i Vi times. On average, service time per visit of a customer at toilet i is Si.  So, Di = Vi * S. Service Demand Law is Di = Vi * Si = Xi / X0 * Si ⇒ Di = Ui / X0.

Little’s Law
For a resource with throughput X and response time R, average number of customers in the system (those waiting in line plus those being in service) is N = X * R ⇒ N = λ * R.

Little's Law can intuitively be explained by considering "Rune The Runner" who just left the system (in other words left the toilet) after completed service. On his way out Rune sees that average N customers are in line waiting for toilet service time. These has arrived while Rune was in service. The average number is λR.

Example: "Customers in the System"
A race has got 1000 runners in total. 500 of these needs to visit the toilet in the last 30 minutes before the race starts. Average response time is 2.5 minutes. How many runners will be "in the system" (in toilets or in line) on average during this time? λ = 500/30 = 16.67 customers pr minute. Hence the average number of customers in the system is N = 16.67 * 2.5 = 41.67 ⇒ 42.

Open Queuing Network
An event, like a race, can be compared to a open queuing network. The characteristics of such a system is external arrivals and departures. Customers come and customers go. The workload intensity is specified by arrival rate. When the system is in equilibrium the throughput = arrival rate. An unbounded number of customers are in the system.

Inputs variables for an open queuing network are
X = system throughput (X = λ with flow equilibrium assumption)
Si = average service time per visit of a customer to the i-th toilet
Vi = average number of visits of a customer to the i-th toilet
K = number of toilets in the system

Outputs variables for an open queuing network are
Ni = mean number of jobs at the i-th toilet
N = mean number of jobs in the system
Ri = response time of the i-th toilet
R = response time of the system

Applying the Forced Flow Law, the throughput of the i-th toilet is
Xi = ViX

Applying the Utilisation Law, the utilisation of the i-th toilet is
Ui = XiSi = ViSiX

Applying the Little's Law, the number of customers at the i-th device is
Ni = RiXi = RiViX

How to find Ri
Ri is the total time on average each visit of a customer spends at the i-th toilet. This is the response time of the toilet visit itself, plus the response times of the customers already there when he or she arrives. Given the assumption of exponentially distributed inter-arrival times and service times it means that on arrival at the i-th toilet, the customer sees an average number of Ni customers already there (including the one in service). All customers will have the same average service time Si (including the one in service) Therefore, we have Ri = (Ni + 1)Si. We can further substitute Ni with RiViX, hence we get Ri = (RiViX +1)Si ⇒ Ri = Si / (1 - Ui).

EXAMPLE: "What is the waiting time?"
If the average service time is 2 minutes and the utilisation of the toilet is 60% for how long do you have to wait in line before it's your turn?
R1 = S1 / (1 - U1) = 2 / (1 - 0.6) = 5 minutes.
Rwaiting = R1 - Rservice = 5 minutes - 2 minutes = 3 minutes.
The average waiting time in the line is 3 minutes.

Mean Value Analysis (MVA)
Response time of the open queuing network system 
Mean number of customers in the system 

Configurations of Toilet Systems

Kendall Notation
Queuing systems can be represented using the Kendall Notation A/B/C/D/E.

The A defines the arrival process. If M then exponential distribution (Poisson arrival process) if G then general distribution. For a toilet system we have exponential distribution since it can be compared with "a process in which events occur continuously and independently at a constant average rate" (Wikipedia).

The B defines service time distribution (M or G). For service time of a toilet system we have exponential distribution as well.

The C defines the number of toilets in the system.

The D defines the maximum number of customers in the system (size of queue / waiting area + number of toilets). This value can be omitted if ∞.

The E defines the queuing discipline. If FCFS it can be omitted as this is default.

Below are three examples of different configurations of a toilet system.

Configuration 1: M/M/1
A single toilet queue with exponentially distributed inter-arrival times, exponentially distributed
service times, and infinite queue size. Read more here.

Configuration 2: m M/M/1
m M/M/1
This configuration consists of m number of M/M/1 queues.

Configuration 3: M/M/m:
An m-toilet queue with exponentially distributed inter-arrival times, exponentially distributed
service times, and infinite queue size. Read more here, note that it is called the M/M/c model as well.

Estimating response times for M/M/1
Given configuration M/M/1 it is possible to estimate response times if thinking of the toilet system as a Markov-chain. Pk = probability that there are k customers in the system. The system is in state k ⇒There are k customers in the system. If k = 0 there are no customers in the system. In other words no customers in the toilet nor in the queue. If k = 1 there is one customer in the system (in the toilet or in the queue on his or hers way to the toilet). If k > 1 there is one customer in the toilet and the other queued up.
Markov-chain state diagram for M/M/1
Mean response times, along with other interesting values, can be derived following this method.

λ = arrival rate
μ = service rate
ρ = utilisation
T = response time

We have that utilisation is:
ρ = λ / μ, ρ < 1 ⇒ λ < μ

Mean queue size for a M/M/1 system is
N = Σ (from k=0 till k=∞) k * Pk ⇒N = ρ / (1 - ρ)

Response time is then, using Little's Law as well
N = λR ⇒ R = N / λ = ρ / ( λ(1-ρ) ) ⇒R = 1 / ( μ (1-ρ) ) or R = 1 / (μ − λ)

The average time spent waiting is
Rw = 1/(μ − λ) − 1/μ ⇒ Rw = ρ/(μ − λ)

EXAMPLE: "Calculating the utilisation"
The arrival rate to a system is 0.3 customers per minute. The service time is 2 minutes hence the service rate is 1/2 = 0.5. What is the utilisation?
The utilisation is ρ = λ / μ = 0.3 / 0.5 = 0.6 = 60%

EXAMPLE: "What is the response time"
A toilet serves 0.5 people a minute and it is utilised by 60%. What is the response time?
R = 1 / ( μ (1-ρ) ) = 1 / ( 0.5 (1-0.6) ) = 1 / 0.2 = 5 minutes.

EXAMPLE: "What is the waiting time"
Given a utilisation of 60%, a service rate of 0.5 customers per minute and an arrival rate of 0.3 customers per minute what is the average waiting time?
Rw = ρ/(μ − λ) = 0.6 / (0.5 - 0.3) = 0.6 / 0.2 = 3 minutes.

How response times do vary with ρ
Given 1 / μ = 1 / 0.5 = 2 minutes we get the graph below. Note that as utilisation is moving towards 1 the response time is moving towards ∞.
When ρ → 1 ⇒ R → ∞
Up to 96% utilisation

Estimating response times for M/M/m
For a M/M/m system all customers line up in one single toilet queue. There they wait in line until one of the m toilets becomes available.
Markov-chain state diagram for M/M/m
The formula for response time for M/M/m systems get's slightly bigger. Remember that both the notation R and T can be used for response time.
Response time for M/M/m systems

EXAMPLE: "What is the response time"
Given a M/M/m system with 2 toilets a utilisation of 60% and an arrival rate of 0.3 customers per minute. What is the response time?
We have : m = 2, μ = 0.5, ρ = 0.6, λ = 0.3

mμ(1 - ρ) = 2*0.5 (1 - 0.6) = 0.4
1/μ = 1 / 0.5 = 2
(mρ)^m / m! = (2*0.6)^2 / 1*2 = 1.44 / 2 = 0.72
(1 - ρ) = 1 - 0.6 = 0.4

Note that k defines the number of customers in the system, or the state the system is in. We are calculating the sum for k=0 till m-1 hence we do not need to know the number of customers in the system.
Σ (mρ)^k/k! ⇒ for k=0  then 0, for k=1 then (2*0.5)^1 / 1 = 1⇒0 + 1 = 1

Using the above calculations we get that the response time is
R = 1 / 0.4 + 2 = 2.5 + 2 = 4.5 minutes.

Other references
The following lecture notes from CS9334 are very relevant:

And this is handy as well:

Mathematical characters in HTML:

No comments:

Post a Comment

Allowed HTML tags:
<a href="">hyperlink</a>

Please, show the courtesy of identifying yourself when adding a comment. Anonymous comments will, most likely, be removed.