2010 May 19 / j d a v i s @ c a r l e t o n . e d u

Carleton College Math 215, Spring 2010, Prof. Joshua R. Davis

In this course, much of your grade is determined by in-class exams, where speed is valued over thoroughness and quality of presentation. Another big part of your grade is determined by homework; there, you should have time to stretch out and produce a polished product, but many students cannot or do not take the time, and anyway the questions are small.

The goal of the course project is for you to do one high-quality, polished, thorough, insightful piece of statistical work. Your project demonstrates not just basic competence with the course material, but also an understanding of what kind of questions a statistician asks about data, how statistical methods answer those questions, and how the answers are used in support of an argument.

There are two big stages to the project. In the first stage, you select a problem, collect your data, and meet with me to get them approved. The meetings will be scheduled soon. In the second stage, you analyze your data, obtain inferences and conclusions, and write them up in a well-written, polished paper with graphics, tables, etc. The paper is due at 5:00 PM on the last day of classes.

If your combined score from our first two exams is greater than 80/120, then you are strongly encouraged to complete the project with a partner. You are not required to have a partner, but I recommend it, because you can then undertake a more ambitious and interesting project, than you could if you were working alone. You cannot have more than one partner. You and your partner are expected to contribute equally to the project.

If your combined score from our first two exams is less than 80/120, then you are required to complete the project on your own. The main reason is that the project gives you a chance to demonstrate that you have learned the material, despite your early exam grades. If you need someone to bounce ideas off of, then talk to me or another student. I understand that your project cannot be as ambitious as one completed by two students, but it must still be a polished, thorough, significant piece of work.

The choice of problem is largely up to you. You choose the problem based on your interests, your ability to find relevant data, and your ability to apply the course material. I am impressed by creativity in posing an interesting problem and resourcefulness in obtaining data to address that problem. Here are some examples of problems from past years' projects.

- How does the Miles of Smiles program affect the dental health of low-income Northfield residents?
- What relationships exist among the manufacturing choices (alcohol content, etc.) and the marketing choices (container size, etc.) of beer?
- What personal characteristics affect performance in the National Hockey League?
- Who sells cheaper textbooks: the Carleton bookstore or on-line sellers?
- Chocolate chip cookies: Is there a relationship between price and taste?
- How have changes in Carleton's admission policies affected student demographics?

Of course, the crux of your problem is the *why* and the *who* — that is, the population, the sample, and what you want to infer about the population from the sample. I have two explicit requirements.

- Your subjects cannot be the countries of the world or subnational divisions thereof, such as the states of the USA. Why? Because then there is no larger population, about which to draw inferences.
- Your subjects cannot be time periods, such as months, years, or decades. Why? Because in this course we have not learned the techniques that are most useful for such data, such as time series.

Once you've formulated a problem, then you need to obtain a data set that is sufficiently rich to address the problem. These requirements give you an idea of the minimum amount of work that is acceptable.

- Your data must include at least 30 observations, at least one categorical variable, and at least two quantitative variables.
- Your paper will have to include descriptive summary statistics (e.g. five-number summaries, means) and descriptive graphics (e.g. bar charts, histograms).
- Your paper will have to include at least one description of a relationship between variables (e.g. scatterplots, side-by-side boxplots).
- Your paper will have to employ a confidence interval somewhere.
- Your paper will have to employ a hypothesis test somewhere.
- Your paper will have to employ a least-squares fit somewhere.

Where do you find such data? You could conduct your own experiment or survey. Some of the previous years' projects did, but I don't recommend it, because it takes so much work. Instead, consider the various databases that are scattered around the Internet and in libraries. The Carleton library has a list of resources. Here is a small list, to give you an idea.

- US Census data
- Minnesota Population Center demographics
- Northfield and Rice County data
- US election data (on-campus access only?)
- Minnesota's 2008 Election data
- Edina Realty real estate data
- USGS water data
- NOAA earth/environment data
- USDA agricultural data (try here too)
- US college data

The data sets found in textbooks and at the Data and Story Library are cleaned up and simplified. You may not use such data sets. You're supposed to be working with real data, which are messy, inconvenient, and educational.

What if you can't find data relevant to your problem? Keep trying; it takes some work. But if you really can't find data, then switch to a different problem and try to find data for it.

Once you have found your problem and data, type up a short (one page, say) description, that demonstrates that your problem and data fulfill the requirements of the project. Do not list your data, but do tell me exactly what kind of data you have (e.g. age, gender, income, and blood pressure for a sample of 1023 Navy Seals).

Bring your description to our meeting. In our meeting, you and your partner present your proposal, probably at a chalkboard. The meeting is at most 15 minutes long, so prepare what you are going to say carefully. Writing the written description should help you prepare. The written description also helps me follow along as you talk, and serves as my record of what we agreed to in the meeting.

Here is a suggested structure for your paper. I don't recommend deviating far from this format.

- Title: Essentially, your title should be an extremely short description of your project. That is, a title must be informative. If you like, you may also have an entertaining or catchy title/subtitle. However, puns on
*A Tale of Two Cities*are strictly forbidden. - Introduction: What is your paper about? What's the population? What's the problem? Why is it interesting and useful? What background information do you have (e.g. from older studies)? Define any terms that a typical reader (say, a classmate) wouldn't know, such as
*recidivism*in a paper about the prison system. - Data: What are your data? Describe how, where, and when you obtained your data. List all sources; in theory, I should be able to locate the same data and replicate your data set. Describe
*who*the subjects are and*what*variables you have for them. Note units of measurement. Do not list the data in this section. In fact, you do not need to list your data at all. If you do, then put them in an appendix. - Results: What have you computed? Report your descriptive statistics, graphics, best-fit lines, correlation coefficients,
*R*^{2}, confidence intervals,*P*-values, etc. Do not show the details of arithmetic calculations. Your reader trusts that you can perform arithmetic correctly; he just wants the answers. - Discussion: What do your results mean? Interpret them in plain English. What inferences can you make? Do you need to make additional assumptions, to make those inferences? Do the results confirm or contradict previous studies? Can you determine causal relationships? If you wish to speculate on reasons or causes, make it clear that you are speculating, rather than reporting objective facts. What can you
*not*tell about your data? How could your study be improved? - Conclusion: What was your problem/question, and what was its solution/answer? What have we learned from the study? This should not be long — one paragraph, say. It serves the same purpose as the abstract or executive summary in other genres of writing.

Make sure that your paper satisfies the basic requirements laid out above (at least one hypothesis test, at least one least-squares fit, etc.). Beyond the basic requirements, here are some points to keep in mind.

- I am interested in whether you know how to apply statistical methods appropriately, in service to your argument/solution. You have to exercise judgment in what you say and what you don't say. For example, should you state the standard deviation of every variable in your data set? No, because it is inappropriate for categorical variables. So should you state it for every quantitative variable? If there are outliers or asymmetry, then not without discussion. And maybe not at all, if you never use that information to advance your case anyway. Your goal is to always to write a strong, polished paper that solves a problem convincingly, not to throw every possible statistic at your reader.
- Use technical language correctly. For example, do not talk about the "correlation" between two categorical variables. If you're describing an experiment, then call it an experiment rather than a "careful scientific study". If you're computing a standard error, then do not call it a "standard deviation".
- Don't forget to discuss whether conditions are met for employing various techniques (e.g. the normal approximation to the binomial model). Don't forget to comment on outliers, skewness, etc. Don't forget to follow techniques to their end. For example, any linear regression should include a residuals plot and commentary on the quality of the fit. Any hypothesis test should include commentary on the effect size.
- Remember not to throw away digits in the middle of calculations; this causes later calculations to be less precise. However, do round judiciously, following our textbook's advice, in the final answers that you report to your reader.
- Figures and tables should be embedded in the text, rather than in appendices. Every figure and table must be numbered. Your text must refer to every figure and table. If a figure or table is not referred to in the text, then it is serving no purpose and should be omitted. If you need that figure or table to satisfy the basic requirements of the project, then you can't omit it; you'd better find some use for it in the text.
- On graphs, don't forget to label each axis (including units) and mark the scale on each axis. If you're comparing two plots, then make them have the same scale.
- Whenever you introduce notation, clearly define what it means — for example, "Let
*p*be the proportion of American adults who think that Barack Obama is Muslim." Then you do not need to define what*p*(^{^}*p*-hat) is, because the hat part is standard notation for the value of*p*arising from a sample. Wherever possible, your notation should agree with the standard statistical notation that we've used in the course. - Variables such as
*p*and*X*should be italicized. Never put quotation marks around them, as in "*p*" or "*X*". Things that are not variables should not be italicized; this includes constants such as 3 and punctuation such as (. - Write in an active voice rather than a passive one. You may use the first person (e.g. "I", "we").
- Proofread your paper. It is expected to be organized, coherent, and logical. Even if it fails on those counts, it can still be clear of spelling and grammar mistakes. Make sure that your statements say exactly what you mean to say.

Your paper is due (in my hands or in my mailbox in the Mathematics Department) at 5:00 PM on the last day of classes.

Three examples of student papers from earlier terms have been made available to you on the Courses file server. Directions for connecting to Courses can be found here. Once you are connected to Courses, navigate to the folder for this course (Spring 2010, Math 215-03, etc.). Then look in the Course Materials folder.