### BIRTHDAY PROBLEM ###
# We assume 365 days in a year, equally probable as bithdays. We assume some
# number k of people in a room, with no 'dependencies' (e.g. no twins). We're
# interested in the probability that all k people have distinct birthdays.
# This function computes the probability exactly (up to machine imprecision).
exact <- function(k) {
product <- 1
for (i in 0:(k - 1))
product <- product * (1 - i / 365)
product
}
# Here are the crucial k to consider. In English, what do these results mean?
exact(22)
exact(23)
# This function computes the probability approximately. The approximation works
# best when k is much smaller than 365.
approx <- function(k) {
exp(-k * (k - 1) / (2 * 365))
}
# How well does the approximation work?
approx(22)
approx(23)
### ARIZONA DNA DATABASE EXAMPLE ###
# This example comes from Dobrow, "Probability with Applications and R",
# Example 2.10. It's a true story from 2001 in Arizona. Suppose that police
# collect DNA for evidence in criminal investigations. Specifically, they
# build a profile from nine loci on the chromosomes. Without going into the
# details, there are about 754,000,000 possible profiles. The database has
# 65,493 profiles in it. A query of the database reveals that two of the
# profiles match exactly. Does this match call into question the reliability of
# the database?
# Assume that all of the 754,000,000 possible profiles are equally probable,
# and that all 65,493 current profiles are 'independent' (e.g. no twins). Then
# this is the birthday problem with different numbers. The probability that all
# 65,493 profiles will be distinct is approximately...
k <- 65493
m <- 754000000
exp(-k * (k - 1) / (2 * m))
# So does the match call into question the reliability of the database?
### HASH TABLE EXAMPLE ###
# In computer science, a hash table is a particular way of organizing
# information inside the computer's memory. Essentially, the hash table
# consists of m boxes, into which we want to place k pieces of information. The
# hash table works fastest when there are no collisions --- that is, when no
# two pieces of information land in the same box. So what is the probability of
# no collisions?
# Assume that all pieces of information have the same probability of landing in
# all boxes. Then we can work out the approximate probability...
k <- 1000000
m <- k * k
exp(-k * (k - 1) / (2 * m))
# Suppose that we have no choice about k --- it's determined by how much data
# our customer wants to store --- but we can choose m. For the sake of speed,
# do we want to choose m to be large or small?
# A computer scientist might also ask: For the sake of memory usage, do we want
# to choose m to be large or small? And what effect does using a lot of memory
# have on speed? This is an example of how computer scientists often consider
# subtle tradeoffs between time efficiency and space efficiency.