2010 April 12 / j d a v i s @ c a r l e t o n . e d u
Here are data on plane flights from Atlanta to various destinations. The data are paired; for example, distance #8 and fare #8 go together.
distances <- c(568, 933, 720, 1190, 602, 683, 1719, 589, 327, 894, 419, 749, 749, 392, 657, 461, 1565, 2150) fares <- c(219, 222, 249, 308, 249, 141, 252, 229, 183, 209, 199, 248, 301, 238, 205, 232, 371, 343) plot(distances, fares) cor(distances, fares) cor(fares, distances)
Here are the paired economic and social views from our introductory class survey. Why does the scatterplot look so bad?
x <- c(3, 4, 4, 3, 3, 2, 2, 4, 2, 2, 3, 4, 4, 3, 3, 2, 2, 2, 4, 3, 2, 4, 4, 4, 3, 4, 4, 4, 3, 4, 4, 4) y <- c(4, 5, 5, 3, 4, 3, 4, 4, 3, 5, 5, 4, 5, 4, 4, 2, 3, 4, 4, 5, 2, 3, 3, 5, 4, 5, 4, 4, 3, 4, 4, 4) boxplot(x, y) plot(x, y) cor(x, y)
Here are paired data about the temperature and gas usage at Carleton, over many months since 1993.
tempData <- c(11.2, 13.95, 26.05, 41.4, 55.75, 63.85, 69.35, 69.1, 54.15, 45.75, 30.15, 20.75, 3.5, 8.85, 32, 44.05, 58.3, 68.4, 67.85, 66, 62.6, 51.45, 36.6, 24.75, 15.9, 17.9, 33.2, 39.9, 54.15, 69.9, 71.35, 73.5, 57.95, 46.85, 25.5, 17.9, 8.1, 15.5, 24.1, 39.45, 53.05, 67.1, 68.35, 67.9, 59.7, 48.5, 24.65, 13.35, 8.45, 19.45, 28.8, 41.3, 51.2, 69.05, 70.25, 66.6, 61.8, 49.05, 27.7, 26.65, 17.8, 30.6, 30.05, 48.55, 63.65, 64.5, 71.4, 69.9, 64.85, 50.4, 35.85, 25.45, 10.65, 26.65, 32.1, 46.85, 59.35, 66.15, 74.25, 68.5, 58.7, 46.5, 40.9, 23.05, 12.85, 25.7, 39.35, 44.95, 58.95, 65.4, 70.25, 69.95, 57.9, 50.4, 30.1, 6.5, 17.55, 9.3, 24.6, 46.55, 58.6, 67, 73.05, 71.2, 58.4, 46.2, 44.95, 27.4, 22.55, 25.3, 23.85, 44.8, 52.45, 69.7, 74.95, 64.6, 64, 42.3, 31.65, 23.25, 14.7, 16.1, 30.4, 43.95, 49.15, 65.85, 71.5, 72.1, 61, 49, 31.85, 23.75, 12.1, 19.2, 35.05, 47, 55.8, 64.05, 69.45, 63.5, 65.05, 49.5, 22.45, 14.5, 25.1, 28.55, 49.1, 71.15, 73.55) gasData <- c(18354, 16253, 12568, 11507, 7619, 4599, 2976, 2811, 6432, 10383, 14432, 18660, 21779, 19140, 14469, 11473, 7201, 4459, 4304, 3951, 5578, 9681, 12562, 16439, 19608, 17690, 14915, 12063, 8164, 4522, 4055, 2120, 7385, 11857, 17299, 18508, 23073.35, 14236.12, 19486.33, 14908.78, 7652.74, 6710.57, 6881.6, 5977.88, 9241.58, 13276.98, 15871.2, 20190.4, 21390.61, 16459.17, 15216.43, 12414.2, 9174.79, 4921.36, 4688.6, 3659.39, 6142.84, 10860.78, 15501.82, 15037.31, 18327.57, 14396.71, 15342.93, 10421.58, 6325, 5107.56, 4307.07, 3322.4, 6198.5, 9963.14, 13283.51, 20794.34, 21616.32, 15467.41, 14503.98, 11467.98, 7886.52, 5260.38, 4542.87, 3599.68, 7532.32, 11175.52, 12355.51, 15544.4, 21365.88, 16694.79, 14333.77, 12832.35, 7753, 5636.68, 5983.95, 4141.05, 7920.99, 10572.89, 17285.17, 29294.89, 22652.81, 23042.4, 17577.11, 12192.21, 8445.18, 5961.45, 5091.5, 4314.03, 8383.15, 12165.54, 11622.31, 16712.33, 20275.5, 16346.61, 18077.44, 13152.01, 10469.45, 6125.39, 4590.26, 4406, 6742.67, 13411.37, 15307.62, 16122.17, 22643.72, 20550.03, 16432.25, 12829.56, 9670.17, 6083.08, 4958.8, 4109.72, 7895.8, 11716.04, 16207.71, 17630.74, 26448.97, 31438.86, 30034.25, 11513.01, 9310.7, 5987.74, 5088.64, 4528.7, 6422.42, 11997.3, 13791.82, 17742.98, 19169.25, 14978, 11119, 5455.6, 4699.16) plot(tempData, gasData) cor(tempData, gasData)
To fit a linear model, we put the data into a data frame, feed the data frame to the lm
function to find the linear relationship between gas
and temp
, and then inspect the results. We get gas = -285.1 temp + 24538.5
.
tempGasData <- data.frame(temp=tempData, gas=gasData) tempGasModel <- lm(gas ~ temp, data=tempGasData) tempGasModel plot(tempGasModel)
The two data with highest gas values seem to be outliers. We can edit them quickly by calling up an editor, finding those two data, and deleting their rows. Now we get gas = -277.6 temp + 24015.0
.
newTempGasData <- edit(tempGasData) newTempGasModel <- lm(gas ~ temp, data=newTempGasData) newTempGasModel plot(newTempGasModel)
You may have noticed that crickets chirp loudly on hot summer evenings. Actually, I don't know whether they chirp loudly when it's hot, but it does appear that they chirp frequently when its hot. Here are data on temperature (in Fahrenheit) and chirps per minute of crickets.
tempData <- c(50, 55, 60, 65, 70, 75, 80, 85, 90) chirpData <- c(20, 46, 79, 91, 113, 140, 173, 198, 211) plot(tempData, chirpData)
The data look linear, so let's fit a linear model.
tempChirpData <- data.frame(temp=tempData, chirps=chirpData) tempChirpModel <- lm(chirps ~ temp, data=tempChirpData) tempChirpModel plot(tempChirpModel)
What happens to the cricket chirping rate when the temperature is below 40? When it's above 90?
Here are the estimates of Brazil's population (in millions) from our introductory class survey:
brazil <- c(0.225, 1, 10, 18, 25, 75, 80, 90, 100, 100, 100, 140, 150, 175, 180, 200, 200, 200, 200, 200, 200, 200, 209, 210, 265, 300, 300, 300, 310, 350, 450, 545) hist(brazil) boxplot(brazil)
As we saw on Day 04, reshaping the data by taking the logarithm helps the symmetry, and helps us understand the outliers.
logBrazil <- log10(brazil) hist(logBrazil) boxplot(logBrazil)
Here are paired population data for the USA over a 200-year period. Describe it. How good of a linear relationship is there?
years <- c(1800, 1850, 1900, 1950, 2000) pops <- c(5, 23, 76, 151, 285) plot(years, pops) cor(years, pops)
Let's reshape those data.
Let's try again.logPops = log10(pops) plot(years, logPops) cor(years, logPops)
sqrtPops = sqrt(pops) plot(years, sqrtPops) cor(years, sqrtPops)
Here are paired average SAT scores for the 50 states.
verbalData <- c(565, 521, 525, 566, 495, 536, 507, 508, 489, 498, 484, 485, 543, 564, 494, 590, 579, 549, 559, 504, 507, 507, 557, 582, 569, 570, 546, 567, 508, 520, 498, 554, 497, 490, 596, 536, 566, 523, 498, 501, 480, 574, 563, 495, 583, 506, 507, 519, 526, 577, 544) mathData <- c(558, 513, 521, 550, 511, 538, 504, 495, 473, 496, 477, 510, 536, 575, 494, 600, 571, 544, 550, 498, 504, 504, 565, 593, 557, 569, 547, 568, 507, 514, 505, 548, 499, 486, 599, 535, 557, 521, 492, 491, 474, 566, 552, 500, 575, 500, 496, 519, 506, 586, 544)