Inprobability theory, thebirthday problem asks for the probability that, in a set ofnrandomly chosen people, at least two will share the samebirthday. Thebirthday paradox is the counterintuitive fact that only 23 people are needed for that probability to exceed 50%.
The birthday paradox is averidical paradox: it seems wrong at first glance but is, in fact, true. While it may seem surprising that only 23 individuals are required to reach a 50% probability of a shared birthday, this result is made more intuitive by considering that the birthday comparisons will be made between every possible pair of individuals. With 23 individuals, there are23 × 22/2 = 253 pairs to consider.
Real-world applications for the birthday problem include a cryptographic attack called thebirthday attack, which uses this probabilistic model to reduce the complexity of finding acollision for ahash function, as well as calculating the approximate risk of a hash collision existing within the hashes of a given size of population.
The problem is generally attributed toHarold Davenport in about 1927, though he did not publish it at the time. Davenport did not claim to be its discoverer "because he could not believe that it had not been stated earlier".[1][2] The first publication of a version of the birthday problem was byRichard von Mises in 1939.[3]
From apermutations perspective, let the eventA be the probability of finding a group of 23 people without any repeated birthdays. Where the eventB is the probability of finding a group of 23 people with at least two people sharing same birthday,P(B) = 1 −P(A). This is such thatP(A) is the ratio of the total number of birthdays,, without repetitions and order matters (e.g. for a group of 2 people, mm/dd birthday format, one possible outcome is) divided by the total number of birthdays with repetition and order matters,, as it is the total space of outcomes from the experiment (e.g. 2 people, one possible outcome is). Therefore and arepermutations.
Another way the birthday problem can be solved is by asking for an approximate probability that in a group ofn people at least two have the same birthday. For simplicity,leap years,twins,selection bias, and seasonal and weekly variations in birth rates[4] are generally disregarded, and instead it is assumed that there are 365 possible birthdays, and that each person's birthday is equally likely to be any of these days, independent of the other people in the group.
For independent birthdays, auniform distribution of birthdays minimizes the probability of two people in a group having the same birthday. Any unevenness increases the likelihood of two people sharing a birthday.[5][6] However real-world birthdays are not sufficiently uneven to make much change: the real-world group size necessary to have a greater than 50% chance of a shared birthday is 23, as in the theoretical uniform distribution.[7]
The goal is to computeP(B), the probability that at least two people in the room have the same birthday. However, it is simpler to calculateP(A′), the probability that no two people in the room have the same birthday. Then, becauseB andA′ are the only two possibilities and are alsomutually exclusive,P(B) = 1 −P(A′).
Here is the calculation ofP(B) for 23 people. Let the 23 people be numbered 1 to 23. Theevent that all 23 people have different birthdays is the same as the event that person 2 does not have the same birthday as person 1, and that person 3 does not have the same birthday as either person 1 or person 2, and so on, and finally that person 23 does not have the same birthday as any of persons 1 through 22. Let these events be called Event 2, Event 3, and so on. Event 1 is the event of person 1 having a birthday, which occurs with probability 1. This conjunction of events may be computed usingconditional probability: the probability of Event 2 is364/365, as person 2 may have any birthday other than the birthday of person 1. Similarly, the probability of Event 3 given that Event 2 occurred is363/365, as person 3 may have any of the birthdays not already taken by persons 1 and 2. This continues until finally the probability of Event 23 given that all preceding events occurred is343/365. Finally, the principle of conditional probability implies thatP(A′) is equal to the product of these individual probabilities:
1 |
The terms of equation (1) can be collected to arrive at:
2 |
Evaluating equation (2) givesP(A′) ≈ 0.492703
Therefore,P(B) ≈ 1 − 0.492703 = 0.507297 (50.7297%).
This process can be generalized to a group ofn people, wherep(n) is the probability of at least two of then people sharing a birthday. It is easier to first calculate the probabilityp(n) that alln birthdays aredifferent. According to thepigeonhole principle,p(n) is zero whenn > 365. Whenn ≤ 365:
where! is thefactorial operator,(365
n) is thebinomial coefficient andkPr denotespermutation.
The equation expresses the fact that the first person has no one to share a birthday, the second person cannot have the same birthday as the first(364/365), the third cannot have the same birthday as either of the first two(363/365), and in general thenth birthday cannot be the same as any of then − 1 preceding birthdays.
Theevent of at least two of then persons having the same birthday iscomplementary to alln birthdays being different. Therefore, its probabilityp(n) is
The following table shows the probability for some other values ofn (for this table, the existence of leap years is ignored, and each birthday is assumed to be equally likely):
n | p(n) |
---|---|
1 | 0.0% |
5 | 2.7% |
10 | 11.7% |
20 | 41.1% |
23 | 50.7% |
30 | 70.6% |
40 | 89.1% |
50 | 97.0% |
60 | 99.4% |
70 | 99.9% |
75 | 99.97% |
100 | 99.99997% |
200 | 99.9999999999999999999999999998% |
300 | (100 −6×10−80)% |
350 | (100 −3×10−129)% |
365 | (100 −1.45×10−155)% |
≥ 366 | 100% |
TheTaylor series expansion of theexponential function (the constante ≈2.718281828)
provides a first-order approximation forex for:
To apply this approximation to the first expression derived forp(n), setx = −a/365. Thus,
Then, replacea with non-negative integers for each term in the formula ofp(n) untila =n − 1, for example, whena = 1,
The first expression derived forp(n) can be approximated as
Therefore,
An even coarser approximation is given by
which, as the graph illustrates, is still fairly accurate.
According to the approximation, the same approach can be applied to any number of "people" and "days". If rather than 365 days there ared, if there aren persons, and ifn ≪d, then using the same approach as above we achieve the result that ifp(n,d) is the probability that at least two out ofn people share the same birthday from a set ofd available days, then:
The probability of any two people not having the same birthday is364/365. In a room containingn people, there are(n
2) =n(n − 1)/2 pairs of people, i.e.(n
2) events. The probability of no two people sharing the same birthday can be approximated by assuming that these events are independent and hence by multiplying their probability together. Being independent would be equivalent to pickingwith replacement, any pair of people in the world, not just in a room. In short364/365 can be multiplied by itself(n
2) times, which gives us
Since this is the probability of no one having the same birthday, then the probability of someone sharing a birthday is
And for the group of 23 people, the probability of sharing is
Applying thePoisson approximation for the binomial on the group of 23 people,
so
The result is over 50% as previous descriptions. This approximation is the same as the one above based on the Taylor expansion that usesex ≈ 1 +x.
A goodrule of thumb which can be used formental calculation is the relation
which can also be written as
which works well for probabilities less than or equal to1/2. In these equations,d is the number of days in a year.
For instance, to estimate the number of people required for a1/2 chance of a shared birthday, we get
Which is not too far from the correct answer of 23.
This can also be approximated using the following formula for thenumber of people necessary to have at least a1/2 chance of matching:
This is a result of the good approximation that an event with1/k probability will have a1/2 chance of occurring at least once if it is repeatedkln 2 times.[8]
length of hex string | no. of bits (b) | hash space size (2b) | Number of hashed elements such that probability of at least one hash collision ≥ p | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
p =10−18 | p =10−15 | p =10−12 | p =10−9 | p =10−6 | p = 0.001 | p = 0.01 | p = 0.25 | p = 0.50 | p = 0.75 | |||
8 | 32 | 4.3×109 | 2 | 2 | 2 | 2.9 | 93 | 2.9×103 | 9.3×103 | 5.0×104 | 7.7×104 | 1.1×105 |
(10) | (40) | (1.1×1012) | 2 | 2 | 2 | 47 | 1.5×103 | 4.7×104 | 1.5×105 | 8.0×105 | 1.2×106 | 1.7×106 |
(12) | (48) | (2.8×1014) | 2 | 2 | 24 | 7.5×102 | 2.4×104 | 7.5×105 | 2.4×106 | 1.3×107 | 2.0×107 | 2.8×107 |
16 | 64 | 1.8×1019 | 6.1 | 1.9×102 | 6.1×103 | 1.9×105 | 6.1×106 | 1.9×108 | 6.1×108 | 3.3×109 | 5.1×109 | 7.2×109 |
(24) | (96) | (7.9×1028) | 4.0×105 | 1.3×107 | 4.0×108 | 1.3×1010 | 4.0×1011 | 1.3×1013 | 4.0×1013 | 2.1×1014 | 3.3×1014 | 4.7×1014 |
32 | 128 | 3.4×1038 | 2.6×1010 | 8.2×1011 | 2.6×1013 | 8.2×1014 | 2.6×1016 | 8.3×1017 | 2.6×1018 | 1.4×1019 | 2.2×1019 | 3.1×1019 |
(48) | (192) | (6.3×1057) | 1.1×1020 | 3.5×1021 | 1.1×1023 | 3.5×1024 | 1.1×1026 | 3.5×1027 | 1.1×1028 | 6.0×1028 | 9.3×1028 | 1.3×1029 |
64 | 256 | 1.2×1077 | 4.8×1029 | 1.5×1031 | 4.8×1032 | 1.5×1034 | 4.8×1035 | 1.5×1037 | 4.8×1037 | 2.6×1038 | 4.0×1038 | 5.7×1038 |
(96) | (384) | (3.9×10115) | 8.9×1048 | 2.8×1050 | 8.9×1051 | 2.8×1053 | 8.9×1054 | 2.8×1056 | 8.9×1056 | 4.8×1057 | 7.4×1057 | 1.0×1058 |
128 | 512 | 1.3×10154 | 1.6×1068 | 5.2×1069 | 1.6×1071 | 5.2×1072 | 1.6×1074 | 5.2×1075 | 1.6×1076 | 8.8×1076 | 1.4×1077 | 1.9×1077 |
The lighter fields in this table show the number of hashes needed to achieve the given probability of collision (column) given a hash space of a certain size in bits (row). Using the birthday analogy: the "hash space size" resembles the "available days", the "probability of collision" resembles the "probability of shared birthday", and the "required number of hashed elements" resembles the "required number of people in a group". One could also use this chart to determine the minimum hash size required (given upper bounds on the hashes and probability of error), or the probability of collision (for fixed number of hashes and probability of error).
For comparison,10−18 to10−15 is the uncorrectable bit error rate of a typical hard disk.[9] In theory, 128-bit hash functions, such asMD5, should stay within that range until about8.2×1011 documents, even if its possible outputs are many more.
The argument below is adapted from an argument ofPaul Halmos.[nb 1]
As stated above, the probability that no two birthdays coincide is
As in earlier paragraphs, interest lies in the smallestn such thatp(n) >1/2; or equivalently, the smallestn such thatp(n) <1/2.
Using the inequality1 −x <e−x in the above expression we replace1 −k/365 withe−k⁄365. This yields
Therefore, the expression above is not only an approximation, but also anupper bound ofp(n). The inequality
impliesp(n) <1/2. Solving forn gives
Now,730 ln 2 is approximately 505.997, which is barely below 506, the value ofn2 −n attained whenn = 23. Therefore, 23 people suffice. Incidentally, solvingn2 −n = 730 ln 2 forn gives the approximate formula of Frank H. Mathis cited above.
This derivation only shows thatat most 23 people are needed to ensure the chances of a birthday match are at least even; it leaves open the possibility thatn is 22 or less could also work.
Given a year withd days, thegeneralized birthday problem asks for the minimal numbern(d) such that, in a set ofn randomly chosen people, the probability of a birthday coincidence is at least 50%. In other words,n(d) is the minimal integern such that
The classical birthday problem thus corresponds to determiningn(365). The first 99 values ofn(d) are given here (sequenceA033810 in theOEIS):
d | 1–2 | 3–5 | 6–9 | 10–16 | 17–23 | 24–32 | 33–42 | 43–54 | 55–68 | 69–82 | 83–99 |
---|---|---|---|---|---|---|---|---|---|---|---|
n(d) | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
A similar calculation shows thatn(d) = 23 whend is in the range 341–372.
A number of bounds and formulas forn(d) have been published.[10]For anyd ≥ 1, the numbern(d) satisfies[11]
These bounds are optimal in the sense that the sequencen(d) −√2d ln 2gets arbitrarily close to
while it has
as its maximum, taken ford = 43.
The bounds are sufficiently tight to give the exact value ofn(d) in most of the cases. For example, ford = 365 these bounds imply that22.7633 <n(365) < 23.7736 and 23 is the only integer in that range. In general, it follows from these bounds thatn(d) always equals either
where⌈ · ⌉ denotes theceiling function.The formula
holds for 73% of all integersd.[12] The formula
holds foralmost alld, i.e., for a set of integersd withasymptotic density 1.[12]
The formula
holds for alld ≤1018, but it is conjectured that there are infinitely many counterexamples to this formula.[13]
The formula
holds for alld ≤1018, and it is conjectured that this formula holds for alld.[13]
It is possible to extend the problem to ask how many people in a group are necessary for there to be a greater than 50% probability that at least 3, 4, 5, etc. of the group share the same birthday.
The first few values are as follows: >50% probability of 3 people sharing a birthday - 88 people; >50% probability of 4 people sharing a birthday - 187 people (sequenceA014088 in theOEIS).[14]
The strong birthday problem asks for the number of people that need to be gathered together before there is a 50% chance thateveryone in the gathering shares their birthday with at least one other person. For d=365 days the answer is 3,064 people.[15][16]
The number of people needed for arbitrary number of days is given by (sequenceA380129 in theOEIS)
The birthday problem can be generalized as follows:
The generic results can be derived using the same arguments given above.
Conversely, ifn(p;d) denotes the number of random integers drawn from[1,d] to obtain a probabilityp that at least two numbers are the same, then
The birthday problem in this more generic sense applies tohash functions: the expected number ofN-bit hashes that can be generated before getting a collision is not2N, but rather only2N⁄2. This is exploited bybirthday attacks oncryptographic hash functions and is the reason why a small number of collisions in ahash table are, for all practical purposes, inevitable.
The theory behind the birthday problem was used by Zoe Schnabel[18] under the name ofcapture-recapture statistics to estimate the size of fish population in lakes. The birthday problem and its generalizations are also useful tools for modelling coincidences.[19]
The classic birthday problem allows for more than two people to share a particular birthday or for there to be matches on multiple days. The probability that amongn people there is exactly one pair of individuals with a matching birthday givend possible days is[19]
Unlike the standard birthday problem, asn increases the probability reaches a maximum value before decreasing. For example, ford = 365, the probability of a unique match has a maximum value of 0.3864 occurring whenn = 28.
The basic problem considers all trials to be of one "type". The birthday problem has been generalized to consider an arbitrary number of types.[20] In the simplest extension there are two types of people, saym men andn women, and the problem becomes characterizing the probability of a shared birthday between at least one man and one woman. (Shared birthdays between two men or two women do not count.) The probability of no shared birthdays here is
whered = 365 andS2 areStirling numbers of the second kind. Consequently, the desired probability is1 −p0.
This variation of the birthday problem is interesting because there is not a unique solution for the total number of peoplem +n. For example, the usual 50% probability value is realized for both a 32-member group of 16 men and 16 women and a 49-member group of 43 women and 6 men.
A related question is, as people enter a room one at a time, which one is most likely to be the first to have the same birthday as someone already in the room? That is, for whatn isp(n) −p(n − 1) maximum? The answer is 20—if there is a prize for first match, the best position in line is 20th.[citation needed]
In the birthday problem, neither of the two people is chosen in advance. By contrast, the probabilityq(n) thatat least one other person in a room ofn other people has the same birthday as aparticular person (for example, you) is given by
and for generald by
In the standard case ofd = 365, substitutingn = 23 gives about 6.1%, which is less than 1 chance in 16. For a greater than 50% chance thatat least one other person in a roomful ofn people has the same birthday asyou,n would need to be at least 253. This number is significantly higher than365/2 = 182.5: the reason is that it is likely that there are some birthday matches among the other people in the room.
For any one person in a group ofn people the probability that he or she shares his birthday with someone else is, as explained above. The expected number of people with a shared (non-unique) birthday can now be calculated easily by multiplying that probability by the number of people (n), so it is:
(This multiplication can be done this way because of the linearity of theexpected value of indicator variables). This implies that the expected number of people with a non-shared (unique) birthday is:
Similar formulas can be derived for the expected number of people who share with three, four, etc. other people.
The expected number of people needed until every birthday is achieved is called theCoupon collector's problem. It can be calculated bynHn, whereHn is thenthharmonic number. For 365 possible dates (the birthday problem), the answer is 2365.
Another generalization is to ask for the probability of finding at least one pair in a group ofn people with birthdays withink calendar days of each other, if there ared equally likely birthdays.[21]
The number of people required so that the probability that some pair will have a birthday separated byk days or fewer will be higher than 50% is given in the following table:
k | n ford = 365 |
---|---|
0 | 23 |
1 | 14 |
2 | 11 |
3 | 9 |
4 | 8 |
5 | 8 |
6 | 7 |
7 | 7 |
Thus in a group of just seven random people, it is more likely than not that two of them will have a birthday within a week of each other.[21]
The expected number of different birthdays, i.e. the number of days that are at least one person's birthday, is:
This follows from the expected number of days that are no one's birthday:
which follows from the probability that a particular day is no one's birthday,(d − 1/d)n
, easily summed because of the linearity of the expected value.
For instance, withd = 365, you should expect about 21 different birthdays when there are 22 people, or 46 different birthdays when there are 50 people. When there are 1000 people, there will be around 341 different birthdays (24 unclaimed birthdays).
The above can be generalized from the distribution of the number of people with their birthday on any particular day, which is aBinomial distribution with probability1/d. Multiplying the relevant probability byd will then give the expected number of days. For example, the expected number of days which are shared; i.e. which are at least two (i.e. not zero and not one) people's birthday is:
The probability that thekth integer randomly chosen from[1,d] will repeat at least one previous choice equalsq(k − 1;d) above. The expected total number of times a selection will repeat a previous selection asn such integers are chosen equals[22]
This can be seen to equal the number of people minus the expected number of different birthdays.
In an alternative formulation of the birthday problem, one asks theaverage number of people required to find a pair with the same birthday. If we consider the probability function Pr[n people have at least one shared birthday], thisaverage is determining themean of the distribution, as opposed to the customary formulation, which asks for themedian. The problem is relevant to severalhashing algorithms analyzed byDonald Knuth in his bookThe Art of Computer Programming. It may be shown[23][24] that if one samples uniformly, with replacement, from a population of sizeM, the number of trials required for the first repeated sampling ofsome individual hasexpected valuen = 1 +Q(M), where
The function
has been studied bySrinivasa Ramanujan and hasasymptotic expansion:
WithM = 365 days in a year, the average number of people required to find a pair with the same birthday isn = 1 +Q(M) ≈ 24.61659, somewhat more than 23, the number required for a 50% chance. In the best case, two people will suffice; at worst, the maximum possible number ofM + 1 = 366 people is needed; but on average, only 25 people are required
An analysis using indicator random variables can provide a simpler but approximate analysis of this problem.[25] For each pair (i,j) for k people in a room, we define the indicator random variableXij, for, by
LetX be a random variable counting the pairs of individuals with the same birthday.
Forn = 365, ifk = 28, the expected number of pairs of individuals with the same birthday is28 × 27/2 × 365 ≈ 1.0356. Therefore, we can expect at least one matching pair with at least 28 people.
In the2014 FIFA World Cup, each of the 32 squads had 23 players. An analysis of the official squad lists suggested that 16 squads had pairs of players sharing birthdays, and of these 5 squads had two pairs: Argentina, France, Iran, South Korea and Switzerland each had two pairs, and Australia, Bosnia and Herzegovina, Brazil, Cameroon, Colombia, Honduras, Netherlands, Nigeria, Russia, Spain and USA each with one pair.[26]
Voracek, Tran andFormann showed that the majority of people markedly overestimate the number of people that is necessary to achieve a given probability of people having the same birthday, and markedly underestimate the probability of people having the same birthday when a specific sample size is given.[27] Further results showed that psychology students and women did better on the task than casino visitors/personnel or men, but were less confident about their estimates.
The reverse problem is to find, for a fixed probabilityp,the greatestn for which the probabilityp(n) is smaller than the givenp, or the smallestn for which the probabilityp(n) is greater than the givenp.[citation needed]
Taking the above formula ford = 365, one has
The following table gives some sample calculations.
p | n | n↓ | p(n↓) | n↑ | p(n↑) |
---|---|---|---|---|---|
0.01 | 0.14178√365 =2.70864 | 2 | 0.00274 | 3 | 0.00820 |
0.05 | 0.32029√365 = 6.11916 | 6 | 0.04046 | 7 | 0.05624 |
0.1 | 0.45904√365 = 8.77002 | 8 | 0.07434 | 9 | 0.09462 |
0.2 | 0.66805√365 =12.76302 | 12 | 0.16702 | 13 | 0.19441 |
0.3 | 0.84460√365 = 16.13607 | 16 | 0.28360 | 17 | 0.31501 |
0.5 | 1.17741√365 = 22.49439 | 22 | 0.47570 | 23 | 0.50730 |
0.7 | 1.55176√365 = 29.64625 | 29 | 0.68097 | 30 | 0.70632 |
0.8 | 1.79412√365 = 34.27666 | 34 | 0.79532 | 35 | 0.81438 |
0.9 | 2.14597√365 = 40.99862 | 40 | 0.89123 | 41 | 0.90315 |
0.95 | 2.44775√365 = 46.76414 | 46 | 0.94825 | 47 | 0.95477 |
0.99 | 3.03485√365 =57.98081 | 57 | 0.99012 | 58 | 0.99166 |
Some values falling outside the bounds have beencolored to show that the approximation is not always exact.
A related problem is thepartition problem, a variant of theknapsack problem fromoperations research. Some weights are put on abalance scale; each weight is an integer number of grams randomly chosen between one gram and one million grams (onetonne). The question is whether one can usually (that is, with probability close to 1) transfer the weights between the left and right arms to balance the scale. (In case the sum of all the weights is an odd number of grams, a discrepancy of one gram is allowed.) If there are only two or three weights, the answer is very clearly no; although there are some combinations which work, the majority of randomly selected combinations of three weights do not. If there are very many weights, the answer is clearly yes. The question is, how many are just sufficient? That is, what is the number of weights such that it is equally likely for it to be possible to balance them as it is to be impossible?
Often, people's intuition is that the answer is above100000. Most people's intuition is that it is in the thousands or tens of thousands, while others feel it should at least be in the hundreds. The correct answer is 23.[citation needed]
The reason is that the correct comparison is to the number of partitions of the weights into left and right. There are2N − 1 different partitions forN weights, and the left sum minus the right sum can be thought of as a new random quantity for each partition. The distribution of the sum of weights is approximatelyGaussian, with a peak at500000N and width1000000√N, so that when2N − 1 is approximately equal to1000000√N the transition occurs. 223 − 1 is about 4 million, while the width of the distribution is only 5 million.[28]
Arthur C. Clarke's 1961 novelA Fall of Moondust contains a section where the main characters, trapped underground for an indefinite amount of time, are celebrating a birthday and find themselves discussing the validity of the birthday problem. As stated by a physicist passenger: "If you have a group of more than twenty-four people, the odds are better than even that two of them have the same birthday." Eventually, out of 22 present, it is revealed that two characters share the same birthday, May 23.
The reasoning is based on important tools that all students of mathematics should have ready access to. The birthday problem used to be a splendid illustration of the advantages of pure thought over mechanical manipulation; the inequalities can be obtained in a minute or two, whereas the multiplications would take much longer, and be much more subject to error, whether the instrument is a pencil or an old-fashioned desk computer. Whatcalculators do not yield is understanding, or mathematical facility, or a solid basis for more advanced, generalized theories.