2010 April 12 / j d a v i s @ c a r l e t o n . e d u

# Day 06 Notes

## Scatterplots and Correlation

Here are data on plane flights from Atlanta to various destinations. The data are paired; for example, distance #8 and fare #8 go together.

```distances <- c(568, 933, 720, 1190, 602, 683, 1719, 589, 327, 894, 419, 749, 749, 392, 657, 461, 1565, 2150)
fares <- c(219, 222, 249, 308, 249, 141, 252, 229, 183, 209, 199, 248, 301, 238, 205, 232, 371, 343)
plot(distances, fares)
cor(distances, fares)
cor(fares, distances)
```

Here are the paired economic and social views from our introductory class survey. Why does the scatterplot look so bad?

```x <- c(3, 4, 4, 3, 3, 2, 2, 4, 2, 2, 3, 4, 4, 3, 3, 2, 2, 2, 4, 3, 2, 4, 4, 4, 3, 4, 4, 4, 3, 4, 4, 4)
y <- c(4, 5, 5, 3, 4, 3, 4, 4, 3, 5, 5, 4, 5, 4, 4, 2, 3, 4, 4, 5, 2, 3, 3, 5, 4, 5, 4, 4, 3, 4, 4, 4)
boxplot(x, y)
plot(x, y)
cor(x, y)
```

## Linear Regression

Here are paired data about the temperature and gas usage at Carleton, over many months since 1993.

```tempData <- c(11.2, 13.95, 26.05, 41.4, 55.75, 63.85, 69.35, 69.1, 54.15, 45.75, 30.15, 20.75, 3.5, 8.85, 32, 44.05, 58.3, 68.4, 67.85, 66, 62.6, 51.45, 36.6, 24.75, 15.9, 17.9, 33.2, 39.9, 54.15, 69.9, 71.35, 73.5, 57.95, 46.85, 25.5, 17.9, 8.1, 15.5, 24.1, 39.45, 53.05, 67.1, 68.35, 67.9, 59.7, 48.5, 24.65, 13.35, 8.45, 19.45, 28.8, 41.3, 51.2, 69.05, 70.25, 66.6, 61.8, 49.05, 27.7, 26.65, 17.8, 30.6, 30.05, 48.55, 63.65, 64.5, 71.4, 69.9, 64.85, 50.4, 35.85, 25.45, 10.65, 26.65, 32.1, 46.85, 59.35, 66.15, 74.25, 68.5, 58.7, 46.5, 40.9, 23.05, 12.85, 25.7, 39.35, 44.95, 58.95, 65.4, 70.25, 69.95, 57.9, 50.4, 30.1, 6.5, 17.55, 9.3, 24.6, 46.55, 58.6, 67, 73.05, 71.2, 58.4, 46.2, 44.95, 27.4, 22.55, 25.3, 23.85, 44.8, 52.45, 69.7, 74.95, 64.6, 64, 42.3, 31.65, 23.25, 14.7, 16.1, 30.4, 43.95, 49.15, 65.85, 71.5, 72.1, 61, 49, 31.85, 23.75, 12.1, 19.2, 35.05, 47, 55.8, 64.05, 69.45, 63.5, 65.05, 49.5, 22.45, 14.5, 25.1, 28.55, 49.1, 71.15, 73.55)
gasData <- c(18354, 16253, 12568, 11507, 7619, 4599, 2976, 2811, 6432, 10383, 14432, 18660, 21779, 19140, 14469, 11473, 7201, 4459, 4304, 3951, 5578, 9681, 12562, 16439, 19608, 17690, 14915, 12063, 8164, 4522, 4055, 2120, 7385, 11857, 17299, 18508, 23073.35, 14236.12, 19486.33, 14908.78, 7652.74, 6710.57, 6881.6, 5977.88, 9241.58, 13276.98, 15871.2, 20190.4, 21390.61, 16459.17, 15216.43, 12414.2, 9174.79, 4921.36, 4688.6, 3659.39, 6142.84, 10860.78, 15501.82, 15037.31, 18327.57, 14396.71, 15342.93, 10421.58, 6325, 5107.56, 4307.07, 3322.4, 6198.5, 9963.14, 13283.51, 20794.34, 21616.32, 15467.41, 14503.98, 11467.98, 7886.52, 5260.38, 4542.87, 3599.68, 7532.32, 11175.52, 12355.51, 15544.4, 21365.88, 16694.79, 14333.77, 12832.35, 7753, 5636.68, 5983.95, 4141.05, 7920.99, 10572.89, 17285.17, 29294.89, 22652.81, 23042.4, 17577.11, 12192.21, 8445.18, 5961.45, 5091.5, 4314.03, 8383.15, 12165.54, 11622.31, 16712.33, 20275.5, 16346.61, 18077.44, 13152.01, 10469.45, 6125.39, 4590.26, 4406, 6742.67, 13411.37, 15307.62, 16122.17, 22643.72, 20550.03, 16432.25, 12829.56, 9670.17, 6083.08, 4958.8, 4109.72, 7895.8, 11716.04, 16207.71, 17630.74, 26448.97, 31438.86, 30034.25, 11513.01, 9310.7, 5987.74, 5088.64, 4528.7, 6422.42, 11997.3, 13791.82, 17742.98, 19169.25, 14978, 11119, 5455.6, 4699.16)
plot(tempData, gasData)
cor(tempData, gasData)
```

To fit a linear model, we put the data into a data frame, feed the data frame to the `lm` function to find the linear relationship between `gas` and `temp`, and then inspect the results. We get `gas = -285.1 temp + 24538.5`.

```tempGasData <- data.frame(temp=tempData, gas=gasData)
tempGasModel <- lm(gas ~ temp, data=tempGasData)
tempGasModel
plot(tempGasModel)
```

The two data with highest gas values seem to be outliers. We can edit them quickly by calling up an editor, finding those two data, and deleting their rows. Now we get `gas = -277.6 temp + 24015.0`.

```newTempGasData <- edit(tempGasData)
newTempGasModel <- lm(gas ~ temp, data=newTempGasData)
newTempGasModel
plot(newTempGasModel)
```

## Extrapolation

You may have noticed that crickets chirp loudly on hot summer evenings. Actually, I don't know whether they chirp loudly when it's hot, but it does appear that they chirp frequently when its hot. Here are data on temperature (in Fahrenheit) and chirps per minute of crickets.

```tempData <- c(50, 55, 60, 65, 70, 75, 80, 85, 90)
chirpData <- c(20, 46, 79, 91, 113, 140, 173, 198, 211)
plot(tempData, chirpData)
```

The data look linear, so let's fit a linear model.

```tempChirpData <- data.frame(temp=tempData, chirps=chirpData)
tempChirpModel <- lm(chirps ~ temp, data=tempChirpData)
tempChirpModel
plot(tempChirpModel)
```

What happens to the cricket chirping rate when the temperature is below 40? When it's above 90?

## Reshaping Data

Here are the estimates of Brazil's population (in millions) from our introductory class survey:

```brazil <- c(0.225, 1, 10, 18, 25, 75, 80, 90, 100, 100, 100, 140, 150, 175, 180, 200, 200, 200, 200, 200, 200, 200, 209, 210, 265, 300, 300, 300, 310, 350, 450, 545)
hist(brazil)
boxplot(brazil)
```

As we saw on Day 04, reshaping the data by taking the logarithm helps the symmetry, and helps us understand the outliers.

```logBrazil <- log10(brazil)
hist(logBrazil)
boxplot(logBrazil)
```

Here are paired population data for the USA over a 200-year period. Describe it. How good of a linear relationship is there?

```years <- c(1800, 1850, 1900, 1950, 2000)
pops <- c(5, 23, 76, 151, 285)
plot(years, pops)
cor(years, pops)
```

Let's reshape those data.

```logPops = log10(pops)
plot(years, logPops)
cor(years, logPops)
```
Let's try again.

```sqrtPops = sqrt(pops)
plot(years, sqrtPops)
cor(years, sqrtPops)
```

## More Data: SAT Scores

Here are paired average SAT scores for the 50 states.

```verbalData <- c(565, 521, 525, 566, 495, 536, 507, 508, 489, 498, 484, 485, 543, 564, 494, 590, 579, 549, 559, 504, 507, 507, 557, 582, 569, 570, 546, 567, 508, 520, 498, 554, 497, 490, 596, 536, 566, 523, 498, 501, 480, 574, 563, 495, 583, 506, 507, 519, 526, 577, 544)
mathData <- c(558, 513, 521, 550, 511, 538, 504, 495, 473, 496, 477, 510, 536, 575, 494, 600, 571, 544, 550, 498, 504, 504, 565, 593, 557, 569, 547, 568, 507, 514, 505, 548, 499, 486, 599, 535, 557, 521, 492, 491, 474, 566, 552, 500, 575, 500, 496, 519, 506, 586, 544)
```