Introduction

In this project you will process and analyze actual public data to answer questions of your choice. The goal of the project is to answer meaningful questions using data and the techniques you've learned in CS 88 and Data 8.

Logistics

This project is worth 20 points and is due on Tuesday, 12/6/16. You may work with one other partner. You should not share your code with students who are not your partner or copy from anyone else's solutions. In the end, you will submit one project for both partners. Make sure to add your partner on OK. The project is worth 20 points. You should write all code and text for the project in a Jupyter notebook and then download the notebook in HTML format for the final submission.

You will turn in the following files:

  • data_analysis.html

You can submit by uploading these files to OK.

You will be able to view your submissions on the OK dashboard.

Task 1: Obtain the Data

Go to http://catalog.data.gov/dataset and select one or more dataset(s) of your choice in the format(s) of your choice. We suggest working with data that is available in csv format as it can be easily processed. You are allowed to work with other data types as well, as long as you know how to read and process the data.

You may also work with multiple datasets in order to explore relationships between the two datasets, or to just have a greater amount of data to work with. We suggest that you use no more than a few datasets. Rather than adding complexity through the volume of data, we suggest adding complexity through your analysis of the data. Save a link to each of your datasets and include all links in the final Jupyter notebook.

Task 2: Pose Three Questions

Ask three questions of varying complexity (easy, medium, difficult) that you want to answer with the dataset. An easy question is something that can be answered by looking at one column of the data (i.e. what is the mean value of some column?). The medium and difficult questions should require complex processing of the data in order to arrive at some sort of answer. Many of the more difficult questions may not have concrete answers. You will not be penalized for not providing concrete answers. You should still give sophisticated insight into the question and explain what makes it difficult to answer.

These guidelines leave a lot of room to exercise creativity and interest. We encourage you to come up with ideas by the final lab meeting (11/28/16) and discuss your ideas with a TA. As you are processing the data, you may find that you want to change your questions, which is fine. Clearly state all three questions in the Jupyter notebook.

Task 3: Read in the Data

Process the dataset through methods of your choice. You should start by reading the data into some sort of data structure. A great choice is the Tables module used in Data 8. You can take a look at readings and assignments for Data 8 to see examples of Tables. You are also welcome to use external Python libraries to work with the data. Feel free to look up methods for reading and processing data files. Some helpful links are listed below.

Task 4: Process and Analyze the Data

Once you have read the data into some sort of data structure, the next task is to process and analyze this data. You may want to remove unnecessary columns, adding in additional columns, etc. We expect the analysis to be based on techniques you have learned in Data 8 and CS88, listed below. However, you are not limited to these techniques. You may also use other techniques that were not taught in these courses. Describe all tecniques used near relevant code cells in the Jupyter notebook. You should have both code and appropriate captions/explanations in your notebook.

  • Map/reduce/filter
  • Iterative/recursive algorithms
  • Higher order functions
  • Table joins
  • Random sampling
  • Linear regression
  • Estimating confidence intervals
  • Hypothesis testing
  • Bootstraping
  • Classification
  • Plots/histograms
  • Any other techniques learned in Data 8/CS 88

Task 5: Answer the Questions

Answer the questions you posed in task 2. As mentioned above, you are not expected to provide concrete answers to all questions. However, you are expected to provide a greater level of understanding and insight to the questions through data analysis. You should also explain what makes some of these questions difficult to answer. Clearly write out your answers out in a text cell of the Jupyter notebook. It should be clear which answer corresponds to which question.

Grading

  • Asking three questions about the data (3 pts)
  • Complexity of the questions (2 pts)
  • Data analysis (code and text) (9 pts)
  • Answering the questions (text) (6 pts)

To get a good idea of what we expect the final Jupyter notebook to look like, please take a look at the Data 8 lab assignments. The labs have both code and text explanations that accompany the code. You do not need to describe every single line of code, but be sure to explain the techniques you are using to analyze the data.

Submission

Download your Jupyter notebook in HTML format and upload the file data_analysis.html to OK. This HTML file should contain the following information:

  • Link to the dataset
  • 3 questions
  • All code used
  • Text explanations for the code, as necessary
  • Answers to the questions