Introduction
It all started with this tweet from Karl Broman.
That the 1st 11 socks in the laundry are each distinct suggests there are a lot more socks. pic.twitter.com/EGSo9P6rw7
— Karl Broman (@kwbroman) October 17, 2014
Rasmus Bååth presented a Bayesian analysis. Christian Robert presented exact ptobability calculations pointing out that Feller had posed a similar problem.
Working independently I came up with the same exact probabilities (see below) using a direct counting approach. I also found an approximation when the number of paired socks is large.
Based on these calculations, we can say that the likelihood is increasing as a function of $n$ the number of socks. So, it seems we need prior information or some other data (like capture-recapture) to get sensible inference.
Paired socks, exact probability
Suppose we have $n$ pairs of socks and we pick $k$ of them. The probability of getting no two socks from any pair is the ratio of the total number of ways that can happen to the total number of ways we can pick $k$ socks from $n$ pairs.
The numerator, is the total number of ways we can pick $k$ distinct socks from $n$ pairs. This is ${n \choose k} 2^{k}$. We can choose $k$ distinct kinds from the $n$ kinds in ${n \choose k}$ ways. Given each pair, we can choose a sock in two different ways.
The denominator is the total number of ways we can pick $k$ socks from $2n$. So, that would be: ${2n \choose k}$.
Thus, the probability is:
$$ \frac{ {n \choose k} 2^{k} } { 2n \choose k }.$$
Here's the R function:
norepeat1 <- function(n,k)
{
if( n<1000 )
choose(n,k) * (2^k) / choose(2*n,k)
else
exp(lchoose(n,k) + k*log(2) - lchoose(2*n,k))
}
It has been modified to consider situations when the number of socks is very large.
Paired and singletons, exact probability
Since Karl could not rule out the possibility that some socks were unpaired, let us consider the case where there are $n_1$ singleton socks and $n_2$ paired socks. We draw $k$ socks at random. The number of ways to get no matching socks where we have $i$ singletons is:
$$ {n_1 \choose i} {n_2 \choose {k-i}} 2^{k-i}.$$
The total number of ways to draw $k$ socks and get no matches is the sum of this over $i$. The total number of ways to get $k$ socks from $n_1$ singletons and $n_2$ pairs is just
$$ {{n_1 + 2 n_2} \choose k}.$$
Thus our desired probability is:
$$ \left[ \sum_{i=0}^k {n_1 \choose i} {n_2 \choose {k-i}} 2^{k-i} \right] \Big / {{n_1 + 2 n_2} \choose k}.$$
The R function is:
norepeat2 <- function(n1,n2,k)
{
a <- 0
for( i in 0:k )
a <- a + choose(n1,i) * choose(n2,k-i) * (2^(k-i))
a/choose(2*n2+n1,k)
}
Given that there were 3 singletons, and 21 pairs, the probability of no matching pairs in 11 socks is:
> norepeat2(n1=3,n2=21,k=11)
[1] 0.2275212
This has been calculated by simulation. Here's Karl's function to calculate the number of socks to get the first matching pair.
# if laundry contains n_pair pairs of socks and n_sing singletons,
# how many socks to draw to get first pair?
sim_socks <-
function(n_pair=21, n_sing=3, n_sim=100000)
{
socks <- c(rep(1:n_pair, 2), n_pair+(1:n_sing))
# replicate(n_sim, min(which(duplicated(sample(socks)))))
library(magrittr)
replicate(n_sim, sample(socks) %>% duplicated %>% which %>% min)
}
The probability of no match in the first 11 socks is:
> x <- sim_socks(n_pair=21,n_sing=3,n_sim=100000)
> mean(x>11)
[1] 0.22723
Simulation and theory match.
Paired socks, approximate probability
As I was playing around with the function to calculate the probabilities for paired socks, I noticed that the probability seemed to depend only on $k/\sqrt{n}$ when $n$ was large. It took me a while to come up with a proof, but here is it.
Recall that the probability of interest is
$$ \frac{ {n \choose k} 2^{k} } { 2n \choose k }.$$
We can write this as:
$$ \frac{n! 2^k}{k! (n-k)!} \Big/ \frac{ 2n!}{ (2n-k)! k!} = \frac{ (2n-k)! 2^k } {k! n!} \Big/ \frac{ 2n! }{ n! n!} = \frac{ {{2n-k}\choose{n}} (\frac{1}{2})^{2n-k}} { {{2n}\choose{n}} (\frac{1}{2})^{2n}} $$
Both the numerator and denominator are binomial probabilities that can be approximated by the normal distribution for large $n$. After some algebra we come to the approximation:
$$ \exp(-\frac{k^2}{4n}).$$
This approximations works pretty well when $k$ is small relative to $n$.
# n=30 k=3
> norepeat1(n=30,k=3)
[1] 0.9491525
> exp(-0.25*3^2/30)
[1] 0.9277435
# n=100 k=11
> norepeat1(n=100,k=11)
[1] 0.7479764
> exp(-0.25*11^2/100)
[1] 0.7389685
# n=10000 k=110
> norepeat1(n=10000,k=110)
[1] 0.7397806
> exp(-0.25*110^2/10000)
[1] 0.7389685