### BIRTHDAY PROBLEM ### # We assume 365 days in a year, equally probable as bithdays. We assume some # number k of people in a room, with no 'dependencies' (e.g. no twins). We're # interested in the probability that all k people have distinct birthdays. # This function computes the probability exactly (up to machine imprecision). exact <- function(k) { product <- 1 for (i in 0:(k - 1)) product <- product * (1 - i / 365) product } # Here are the crucial k to consider. In English, what do these results mean? exact(22) exact(23) # This function computes the probability approximately. The approximation works # best when k is much smaller than 365. approx <- function(k) { exp(-k * (k - 1) / (2 * 365)) } # How well does the approximation work? approx(22) approx(23) ### ARIZONA DNA DATABASE EXAMPLE ### # This example comes from Dobrow, "Probability with Applications and R", # Example 2.10. It's a true story from 2001 in Arizona. Suppose that police # collect DNA for evidence in criminal investigations. Specifically, they # build a profile from nine loci on the chromosomes. Without going into the # details, there are about 754,000,000 possible profiles. The database has # 65,493 profiles in it. A query of the database reveals that two of the # profiles match exactly. Does this match call into question the reliability of # the database? # Assume that all of the 754,000,000 possible profiles are equally probable, # and that all 65,493 current profiles are 'independent' (e.g. no twins). Then # this is the birthday problem with different numbers. The probability that all # 65,493 profiles will be distinct is approximately... k <- 65493 m <- 754000000 exp(-k * (k - 1) / (2 * m)) # So does the match call into question the reliability of the database? ### HASH TABLE EXAMPLE ### # In computer science, a hash table is a particular way of organizing # information inside the computer's memory. Essentially, the hash table # consists of m boxes, into which we want to place k pieces of information. The # hash table works fastest when there are no collisions --- that is, when no # two pieces of information land in the same box. So what is the probability of # no collisions? # Assume that all pieces of information have the same probability of landing in # all boxes. Then we can work out the approximate probability... k <- 1000000 m <- k * k exp(-k * (k - 1) / (2 * m)) # Suppose that we have no choice about k --- it's determined by how much data # our customer wants to store --- but we can choose m. For the sake of speed, # do we want to choose m to be large or small? # A computer scientist might also ask: For the sake of memory usage, do we want # to choose m to be large or small? And what effect does using a lot of memory # have on speed? This is an example of how computer scientists often consider # subtle tradeoffs between time efficiency and space efficiency.