### INSTALLING THE REQUIRED SOFTWARE ###
# If you're working on one of our lab computers, then all of the required
# software is already installed, so skip this section.
# If you're working on your own computer, then you need to install two pieces
# of software. Installation is easy and free of cost. First download and
# install R from www.r-project.org. Then download and install RStudio from
# www.rstudio.com.
### GETTING STARTED WITH RSTUDIO ###
# Launch the RStudio application. RStudio presents a single large window
# divided into 'panes'. One pane shows plots, another pane displays online
# help, another pane helps you manage your files, etc.
# In RStudio's menus, find the 'Open File...' command, and use it to open this
# file. This file should appear in the editor pane (typically the upper left
# corner of the RStudio window).
# For the really basic R that we'll be doing, the most important pane is the
# Console (typically the lower left corner of the RStudio window). It's where
# you enter commands into R, and where R usually returns its answers.
# If your command makes a plot, then it might show up in the Plots pane or a
# separate window. If your command asks for help, then it might show up in the
# Help pane.
### GLORIFIED CALCULATOR ###
# R is essentially a glorified calculator. Type (or copy and paste) the
# following line of code into RStudio's Console pane, and press the Return
# key. R evaluates your input and reports an answer.
3 + 2
# Similarly, run each of the following lines of code in R. You can probably
# guess what they mean. Feel free to try other numbers and operations.
# Experiment!
(-5)^2
sqrt(16 - 7)
# In RStudio, there is another way to run a line of code. Click anywhere on
# the line of code that you want to run, and then click the Run button in the
# top right corner of the editor pane. RStudio sends the line of code to R.
# RStudio also advances the cursor to the next line of code, so that you can
# immediately run that if you want. You can also run lines of code from an
# RStudio menu, which has a keyboard shortcut (e.g., Command-Return on macOS).
log(17)
log(1 + 2^2 + 4 * 2^3)
# As in almost all programming languages and computer systems, angles are in
# radians by default. It's because math is easier in radians than in degrees.
sin(pi / 2)
atan(4 / 3)
atan(-4 / -3)
atan2(-4, -3)
# There are many functions relevant to probability.
factorial(4)
choose(5, 2)
# To view documentation on any R function, enter that function's name into the
# search bar of RStudio's Help pane. Or use a question mark as in the
# following line of code. Either way, help appears in the Help pane. You can
# then click on the Help pane's zoom button to get a larger view of that help.
?atan2
### STORING VALUES IN MEMORY ###
# Modern calculators do much more than isolated simple calculations. You can
# store the result of a calculation in memory, for later use. This may sound
# like a small feature, but it's an important stepping stone to programming.
# For example, maybe you prefer to work with angles in degrees, not radians.
# Here's the cleanest way to proceed.
degree <- 2 * pi / 360
sin(90 * degree)
sin(30 * degree)
# When you use the '<-' operator, R evaluates the stuff on the right side,
# puts its value into the memory location named on the left side, and
# DOESN'T show you the answer. But if you want to see what got stored in
# memory, just ask R.
degree
# For another example, suppose that you run a small business. You want to
# compute its quarterly profit. (And you've been studying R for only a few
# minutes.) You might enter three months of revenues and costs like this.
revenue <- 13000 + 15000 + 14000
cost <- 16000 + 12000 + 5000
# To compute your quarterly profit, you might do this.
revenue - cost
# Warning: Memory is temporary. When you quit RStudio, all of the values
# stored in its memory are lost. (There is an option to save them, but I
# recommend that you don't grow dependent on it.)
### BERNOULLI DISTRIBUTION ###
# Here is a function for sampling from a Bernoulli-distributed random variable.
# Don't worry about how the code works.
rbern <- function(n, p) {
sample.int(2, n, prob=c(1 - p, p), replace=TRUE) - 1
}
# Here are some examples. Study them, to figure out what they mean.
rbern(100, 1 / 6)
sum(rbern(1000000, 1 / 6))
### GEOMETRIC DISTRIBUTION ###
# We know that if X ~ Geom(n, p), then P(X = k) = (1 - p)^k p. Here are some
# examples computed by hand.
p <- 1 / 3
(1 - p)^4 * p
p <- 0.001
(1 - p)^16 * p
# Here are the same probabilities computed using R's built-in knowledge of the
# geometric distribution.
dgeom(4, 1 / 3)
dgeom(16, 0.001)
# Let's focus on X ~ Geom(0.001). What is P(X <= 3)? Here are two ways to
# compute it.
dgeom(0, 0.001) + dgeom(1, 0.001) + dgeom(2, 0.001) + dgeom(3, 0.001)
pgeom(3, 0.001)
# How high does k need to be, for P(X <= k) to reach 95%? Change the first
# input to pgeom, trying various values, until you find an answer.
pgeom(450, 0.001)
# That task is so common in probability and statistics that R offers a shortcut.
qgeom(0.95, 0.001)
# This code produces a sample from the geometric distribution.
data <- rgeom(1000, 0.09)
data
# Here is a function for drawing a histogram, which shows how many of each
# possible value we get. Don't worry about how this code works.
discreteHistogram <- function(a, b, data) {
hist(
data,
breaks=seq(from=(a - 0.5), to=(b + 0.5), length.out=(b - a + 1)),
xaxp=c(a, b, b - a))
}
# Here is a picture of the sample from the geometric distribution.
discreteHistogram(min(data), max(data), data)
### BINOMIAL DISTRIBUTION ###
# Let X be the number of sixes you encounter when rolling a (fair, six-sided)
# die 12 times. You expect X to take values around 2. But what is the
# probability that you get exactly 2 sixes? Only about 30%, according to the
# binomial distribution.
choose(12, 0) * (5 / 6)^12 * (1 / 6)^0
choose(12, 1) * (5 / 6)^11 * (1 / 6)^1
choose(12, 2) * (5 / 6)^10 * (1 / 6)^2
# Here's how to compute the same numbers using R's built-in knowledge of the
# binomial distribution.
dbinom(0, 12, 1 / 6)
dbinom(1, 12, 1 / 6)
dbinom(2, 12, 1 / 6)
# Here are those three numbers again, made succinctly using sapply.
sapply(0:2, dbinom, 12, 1 / 6)
# How do we compute P(X = 0) + P(X = 1) + P(X = 2)? Here are three ways.
dbinom(0, 12, 1 / 6) + dbinom(1, 12, 1 / 6) + dbinom(2, 12, 1 / 6)
sum(sapply(0:2, dbinom, 12, 1 / 6))
pbinom(2, 12, 1 / 6)
# Explain the result of these commands.
sum(sapply(0:12, dbinom, 12, 1 / 6))
pbinom(12, 12, 1 / 6)
# Here's the PMF and a histogram of 10,000 random values of X. They match well.
plot(x=0:12, y=sapply(0:12, dbinom, 12, 1 / 6))
discreteHistogram(0, 12, rbinom(10000, 12, 1 / 6))
# Let's switch to X ~ Binom(100, 0.2). For which k is P(X <= k) = 0.95? Use
# trial and error.
pbinom(24, 100, 0.2)
pbinom(25, 100, 0.2)
pbinom(26, 100, 0.2)
pbinom(27, 100, 0.2)
pbinom(28, 100, 0.2)
# Here's the efficient way.
qbinom(0.95, 100, 0.2)
### OVERBOOKING YOUR #^(&!4& AIRPLANE FLIGHT ###
# In a policy called overbooking, airlines sell more seats than they have,
# counting on some of the passengers not to show up for their flights. Below,
# specify the probability of a passenger showing up for her flight, the
# capacity of the airplane, and the overbooking limit of the airline.
prob <- 0.88
capacity <- 100
limit <- 110
# Here's the probability that exactly one too many passengers show up.
k <- capacity + 1
choose(limit, k) * prob^k * (1 - prob)^(limit - k)
dbinom(k, limit, prob)
# Here are the probabilities that two and three too many show up.
k <- capacity + 2
dbinom(k, limit, prob)
k <- capacity + 3
dbinom(k, limit, prob)
# Here's the total probability of overbooking the flight.
1 - pbinom(capacity, limit, prob)
# Change the limit below, to try to get the total probability below 5%.
# Experiment. How much should the airline overbook?
limit <- 110
1 - pbinom(capacity, limit, prob)
### HYPERGEOMETRIC DISTRIBUTION ###
# Draw five cards from a standard deck, and let X be the number of aces. Then X
# ~ HGeom(4, 48, 5). Do the following six numbers seem reasonable?
dhyper(0, 4, 48, 5)
dhyper(1, 4, 48, 5)
dhyper(2, 4, 48, 5)
dhyper(3, 4, 48, 5)
dhyper(4, 4, 48, 5)
dhyper(5, 4, 48, 5)
### LEGALIZE IT? ###
# Suppose that, at a college of 2,000 students, 613 of them support
# legalization of marijuana. Let X be the number of legalization supporters in
# a survey of 200 students. Then X ~ HGeom(613, 2000 - 613, 200). Here are the
# probabilities of a few values of X.
dhyper(60, 613, 2000 - 613, 200)
dhyper(61, 613, 2000 - 613, 200)
dhyper(62, 613, 2000 - 613, 200)
# In a more realistic problem, the number of supporters is unknown, and we
# conduct the survey in the hopes of ascertaining that number. Suppose that
# 89 of the 200 students surveyed support legalization. By trial and error,
# find the value of the variable _supporters_ that maximizes the probability of
# observing 89 supporters. (This approach to estimating a parameter of a
# population is called maximum likelihood estimation. It is common in
# statistics.)
supporters <- 882
dhyper(89, supporters, 2000 - supporters, 200)
### NEGATIVE BINOMIAL ###
# Remember that X ~ NBin(r, p) models the number of failures before the rth
# success in repeated Bernoulli trials, each with probability p. For the sake
# of examples, let's focus on r = 5 and p = 1 / 3.
# Here is P(X = 20) computed in two ways.
choose(20 + 5 - 1, 5 - 1) * (1 / 3)^5 * (1 - 1 / 3)^20
dnbinom(20, 5, 1 / 3)
# Here is P(X > 20) computed in two ways.
1 - sum(sapply(0:20, dnbinom, 5, 1 / 3))
1 - pnbinom(20, 5, 1 / 3)
# Here's a sample.
data <- rnbinom(100, 5, 1 / 3)
data
discreteHistogram(min(data), max(data), data)
### HOMEWORK ###
# A. In the airline overbooking problem, show your solution. That is, you
# should show a few lines of code, with their resulting output, that
# demonstrate how the limit should be set. Explain if necessary.
# B. In the hypergeometric problem, where the number of marijuana legalization
# supporters is unknown, show your solution. That is, you should show a few
# lines of code, with their resulting output, that demonstrate the maximum
# likelihood estimate. Explain if necessary.
# When I borrow a book from another Math/Stats professor, I sometimes return
# the book by placing it in their mailbox in the Math/Stats department. This is
# a little risky, because the book could be stolen. I do it 25 times with no
# problems. On the 26th time, the book is stolen. Under certain assumptions,
# this situation can be modeled using the geometric distribution.
# C. Use trial and error in R to find the parameter value p that maximizes the
# probability of the result observed above. Show a succinct version of your
# code and the answer(s) that it gives you. If necessary, explain.
# D. Solve the same problem using calculus.
# E. For X ~ NBin(5, 1 / 3), find k such that P(X <= k) = 95%. Do it in two
# ways: trial and error with pnbinom, and more simply with qnbinom.