2010 April 5 / j d a v i s @ c a r l e t o n . e d u

# Day 04 Notes

## Basic Plots and Summary Statistics

Here's the Titanic categorical data (number of passengers in each class) from our textbook. Copy and paste (or, even better, type) each line into the S+ or R command window.

```data <- c(325, 285, 706, 885)
data[1]
data[2]
summary(data)
barplot(data)
```

Here are the heights, in inches, from our introductory class survey. Copy and paste each line into the command window.

```heights <- c(59, 61, 64, 64, 65, 65, 66, 67, 67, 67, 67, 67, 68, 69, 69, 70, 70, 70, 70, 71, 72, 72, 72, 72, 72, 72, 72, 73, 73, 73, 74, 76)
hist(heights)
boxplot(heights)
```
Here they are, broken down by gender. I also give the sample standard deviation (standard error) for the female heights.
```femaleHeights <- c(59, 61, 64, 65, 65, 66, 67, 67, 67, 68, 70)
maleHeights <- c(64, 67, 67, 69, 69, 70, 70, 70, 71, 72, 72, 72, 72, 72, 72, 72, 73, 73, 73, 74, 76)
boxplot(femaleHeights, maleHeights)
sd(femaleHeights)
```

## Reshaping Data

Here are the estimates of Brazil's population (in millions) from our introductory class survey.

```brazil <- c(0.225, 1, 10, 18, 25, 75, 80, 90, 100, 100, 100, 140, 150, 175, 180, 200, 200, 200, 200, 200, 200, 200, 209, 210, 265, 300, 300, 300, 310, 350, 450, 545)
hist(brazil)
boxplot(brazil)
```
The datum 545 shows up as an outlier, whereas the datum 0.225 does not. What do you think of this?

The data are quite skewed to the right. We can roughly fix this by replacing each datum with its logarithm (to the base 10). Then where are the outliers?

```logBrazil = log10(brazil)
summary(logBrazil)
sd(logBrazil)
hist(logBrazil)
```

## Normal Distribution

`qnorm()` and `pnorm()` convert between probabilities and z-scores. For example, here we learn that about 16% of values have z-score less than -1; this is the 68% rule.

```pnorm(-1)
pnorm(-1, 0, 1)
qnorm(0.16)
qnorm(0.16, 0, 1)
```

You can also supply your own mean and standard deviation. For example, here we learn that about 98% of values have z-score less than 36, in N(24, 6); this is the 95% rule.

```pnorm(36, 24, 6)
qnorm(0.9772499, 24, 6)
```

Here we do a quantile-quantile plot (probability plot) of the male height data from our introductory class survey. They don't look very close to normal; why?

```maleHeights <- c(64, 67, 67, 69, 69, 70, 70, 70, 71, 72, 72, 72, 72, 72, 72, 72, 73, 73, 73, 74, 76)
qqnorm(maleHeights)
```

## Scatterplots and Correlation

Here are the economic and social views from our introductory class survey. The data are paired; for example, economic datum 17 comes from the same student as social datum 17. Why does the scatterplot look so bad?

```x <- c(3, 4, 4, 3, 3, 2, 2, 4, 2, 2, 3, 4, 4, 3, 3, 2, 2, 2, 4, 3, 2, 4, 4, 4, 3, 4, 4, 4, 3, 4, 4, 4)
y <- c(4, 5, 5, 3, 4, 3, 4, 4, 3, 5, 5, 4, 5, 4, 4, 2, 3, 4, 4, 5, 2, 3, 3, 5, 4, 5, 4, 4, 3, 4, 4, 4)
plot(x, y)
```

Here are population data for the USA over a 200-year period. Describe it. How could you reshape it profitably?

```years <- c(1800, 1850, 1900, 1950, 2000)
pops <- c(5, 23, 76, 151, 285)
plot(years, pops)
```

Here are data on plane flights from Atlanta to various destinations.

```distances <- c(568, 933, 720, 1190, 602, 683, 1719, 589, 327, 894, 419, 749, 749, 392, 657, 461, 1565, 2150)
fares <- c(219, 222, 249, 308, 249, 141, 252, 229, 183, 209, 199, 248, 301, 238, 205, 232, 371, 343)
plot(distances, fares)
```