# Both of the following commands sample eight numbers from the list {0, 1, 2, ..., 8, 9}. The first command samples without replacement, producing eight distinct numbers always. The second command samples with replacement, and often repeats a number in its output.
sample(0:9, 8, replace=FALSE)
sample(0:9, 8, replace=TRUE)
### DEALING CARDS ###
# Let's number the cards in a standard deck from 1 to 52, with the four aces being 1 (spades), 2 (hearts), 3 (diamonds), 4 (clubs), the four twos being 5, 6, 7, 8, etc. Here are two hands of four cards.
handOfFour <- function() sample(1:52, 4, replace=FALSE)
handOfFour()
handOfFour()
# Here is a function for testing whether a hand is all aces.
allAces <- function(hand) {
hand[[1]] <= 4 && hand[[2]] <= 4 && hand[[3]] <= 4 && hand[[4]] <= 4
}
allAces(handOfFour())
# Here's how often we expect to get all aces, from theory.
(4 * 3 * 2 * 1) / (52 * 51 * 50 * 49)
# Here are 1,000,000 trials. Is the answer close to our theoretical prediction? In my experience, not very. Try increasing to 10,000,000 trials.
n <- 1000000
sum(replicate(n, allAces(handOfFour()))) / n
# Now what if we have a magical deck (or four decks used in succession), so that we can instead sample with replacement?
handOfFourReplacing <- function() sample(1:52, 4, replace=TRUE)
handOfFourReplacing()
# Here's how often we expect to get all aces, in theory.
(4 / 52)^4
# Here are 1,000,000 trials. Is the answer close to our prediction?
n <- 1000000
sum(replicate(n, allAces(handOfFourReplacing()))) / n
### BOOTSTRAPPING ###
# Here is part of a data set of Titus et al. (2005) from the Sierra Nevada mountains of California. The data arise from measuring a certain magnetic property in rocks. The details aren't important right now.
data <- c(-0.0062670029, 0.0432107223, 0.0178873715, 0.0336088550, 0.0469899499, 0.0455324360, 0.0252212117, 0.0442877903, 0.0378982737, 0.0503467673, 0.0738210561, 0.0565447207, 0.0649448289, 0.0851928042, 0.0571385153, 0.0629379207, 0.0568939621, 0.0215691305, 0.0482308898, 0.0266693757, 0.0225608892, 0.0391732096, 0.0207526939, -0.0294822425, -0.0232134280, -0.0301225660, -0.0334584210, -0.0040964386, -0.0147630458, -0.0153584326, -0.0073727513, -0.0173664327, -0.0255012127, -0.0105506318, -0.0083496883, -0.0120874593, -0.0310197194, -0.0042256328, -0.0250462667, -0.0065719032, -0.0021940247, -0.0159014744, -0.0034075815, -0.0122259588, 0.0029743760, -0.0035224231, 0.0193698737, -0.0065998307, 0.0021295731, -0.0187745488, 0.0018460427, 0.0253745283, 0.0285806303, 0.0305216771, -0.0008630528, 0.0036894285, -0.0187226604, 0.0087372390, 0.0135470324, 0.0182375470, 0.0267434442, -0.0109369868, 0.0159658187, 0.0170504426, 0.0002614576, 0.0024366281, 0.0037072184, 0.0011790952, 0.0003227600, 0.0023298055, 0.0007245120, 0.0014262387, 0.0092494984, 0.0098207051, 0.0058571214, 0.0046549970, 0.0107196758, 0.0079501759, 0.0036337620, 0.0428603566, -0.0067049752, -0.0034421831, 0.0062257619, -0.0061507440, -0.0069640491, -0.0329645330, -0.0143722362, -0.0210156549, -0.0123328455, 0.0486306496, 0.0391999809, 0.0170942197, 0.0274335451, 0.0195739435, 0.0252534154, 0.0397281516, 0.0194409506, 0.0076855575, 0.0087840515, 0.0419647353)
# Here's the sample size (number of data), a histogram, and the sample mean (average). The sample mean is our best guess for the mean of the population from which the data were sampled.
length(data)
hist(data)
mean(data)
# Bootstrapping is a technique for understanding the uncertainty in that estimate of the population mean. It is based on resampling the data set, with replacement, to make a 'mutated' data set of the same size. In this mutated data set, some of the original data are omitted and others are repeated.
mutant <- sample(data, length(data), replace=TRUE)
mutant
# The mean of the mutated data set is usually close to the original mean, but not exactly equal.
mean(data)
mean(mutant)
# In bootstrapping, we repeat this process (mutate the data set and compute the mean) many times, to get a large set of mutated means.
mutantMeans <- replicate(10000, mean(sample(data, length(data), replace=TRUE)))
hist(mutantMeans)
# Statistical theory (which we will not discuss here) shows that these mutated means characterize the uncertainty in our original estimate of the population mean. Roughly speaking, the middle 95% of the mutated means give us a 95% 'confidence interval' for the population mean.
quantile(mutantMeans, probs=c(0.025, 0.975))
# Technically, we should have centered the confidence interval at the sample mean. I skipped this step to keep the code simpler. It would change our answer very little. Also, I am not explaining the meaning of confidence intervals here. That is a subtle topic best treated in a statistics course. Anyway, you can see that sampling with replacement has practical applications.