Lab 12: SQL
Due at 11:59:59 pm on 12/5/2022.
Starter Files
Download lab12.zip. Inside the archive, you will find starter files for the questions in this lab, along with a copy of the OK autograder.
Submission
By the end of this lab, you should have submitted the lab with python3 ok --submit
. You may submit more than once before the deadline; only the final submission will be graded. Check that you have successfully submitted your code on okpy.org.
- To receive credit for this lab, you must complete Questions 2, 3, 4, 5, 6 and 7 in lab12.sql and submit through OK.
SQL Basics
Creating Tables
You can create SQL tables either from scratch or from existing tables.
The following statement creates a table by specifying column names and values
without referencing another table. Each SELECT
clause specifies the values
for one row, and UNION
is used to join rows together. The AS
clauses give a
name to each column; it need not be repeated in subsequent rows after the
first.
CREATE TABLE [table_name] AS
SELECT [val1] AS [column1], [val2] AS [column2], ... UNION
SELECT [val3] , [val4] , ... UNION
SELECT [val5] , [val6] , ...;
Let's say we want to make the following table called big_game
which records
the scores for the Big Game each year. This table has three columns:
berkeley
, stanford
, and year
.
We could do so with the following CREATE TABLE
statement:
CREATE TABLE big_game AS
SELECT 30 AS berkeley, 7 AS stanford, 2002 AS year UNION
SELECT 28, 16, 2003 UNION
SELECT 17, 38, 2014;
Selecting From Tables
More commonly, we will create new tables by selecting specific columns that we
want from existing tables by using a SELECT
statement as follows:
SELECT [columns] FROM [tables] WHERE [condition] ORDER BY [columns] LIMIT [limit];
Let's break down this statement:
SELECT [columns]
tells SQL that we want to include the given columns in our output table;[columns]
is a comma-separated list of column names, and*
can be used to select all columnsFROM [table]
tells SQL that the columns we want to select are from the given tableWHERE [condition]
filters the output table by only including rows whose values satisfy the given[condition]
, a boolean expressionORDER BY [columns]
orders the rows in the output table by the given comma-separated list of columnsLIMIT [limit]
limits the number of rows in the output table by the integer[limit]
Note: We capitalize SQL keywords purely because of style convention. It makes queries much easier to read, though they will still work if you don't capitalize keywords.
Here are some examples:
Select all of Berkeley's scores from the big_game
table, but only include
scores from years past 2002:
sqlite> SELECT berkeley FROM big_game WHERE year > 2002;
28
17
Select the scores for both schools in years that Berkeley won:
sqlite> SELECT berkeley, stanford FROM big_game WHERE berkeley > stanford;
30|7
28|16
Select the years that Stanford scored more than 15 points:
sqlite> SELECT year FROM big_game WHERE stanford > 15;
2003
2014
SQL operators
Expressions in the SELECT
, WHERE
, and ORDER BY
clauses can contain
one or more of the following operators:
- comparison operators:
=
,>
,<
,<=
,>=
,<>
or!=
("not equal") - boolean operators:
AND
,OR
- arithmetic operators:
+
,-
,*
,/
- concatenation operator:
||
Here are some examples:
Output the ratio of Berkeley's score to Stanford's score each year:
sqlite> select berkeley * 1.0 / stanford from big_game;
0.447368421052632
1.75
4.28571428571429
Output the sum of scores in years where both teams scored over 10 points:
sqlite> select berkeley + stanford from big_game where berkeley > 10 and stanford > 10;
55
44
Output a table with a single column and single row containing the value "hello world":
sqlite> SELECT "hello" || " " || "world";
hello world
Joins
We can use joins to include rows from another table that satisfy the where
predicate. Joins can either be on different tables, or the same table if we
include an alias. Here we are referencing the football table twice, once as
the alias a
and once as the alias b
.
sqlite> SELECT a.Berkeley - b.Berkeley, a.Stanford - b.Stanford, a.Year, b.Year
...> FROM Football as a, Football as b WHERE a.Year > b.Year;
-11|22|2014|2003
-13|21|2014|2002
-2|9|2003|2002
What is this query asking for?
You may notice that it does not seem like we actually performed any operations
to do the join. However, the join is implicit in the fact that we listed more
than one table after the FROM. In this example, we joined the table Football
with itself and gave each instance of the table an alias, a
and b
so that
we could distinctly refer to each table's attributes and perform selections and
comparisons on them, such as a.Year > b.Year
.
One way to think of a join is that it produces a cross-product between the two tables by matching each row from the first table with every other row in the second table, which creates a new, larger joined table.
Here's an illustration of what happened in the joining process during the above query.
From here, the select statement examines the joined table and selects the values
it desires: a.Berkeley - b.Berkeley
and a.Stanford - b.Stanford
but only
from the rows WHERE a.Year > b.Year
. This prevents duplicate results from
appearing in our output!
Survey
Getting to Know Your Classmates
In a past semester of CS88, we asked the students to take a survey. In this lab, we will interact with the results of the survey by using SQL queries to see if we can find interesting trends in the data. Data Scientists often use languages like SQL to do Exploratory Data Analysis.
First, take a look at data.sql
and examine the tables defined in it. Note
their structures. There are two tables you will be working with:
students
: The main results of the survey. Each column represents a different question from the survey, except for the first column, which is the time of when the result was submitted. This time is a unique identifier for each of the rows in the table.Column Name Question time
The unique timestamp that identifies the submission number
What's your favorite number between 1 and 100? color
What is your favorite color? seven
Choose the number 7 below.
Options:- 7
- You are not the boss of me!
- I do what I want.
- I'm a rebel
- Choose this option instead.
- YOLO!
song
If you could listen to only one of these songs for the rest of your life, which would it be?
Options:- "Hotline Bling" by Drake
- "I Want It That Way" by Backstreet Boys
- "Shake It Off" by Taylor Swift
- "Baby" by Justin Bieber
- "Sandstorm" by Darude
- "Hello" by Adele
- "Thinking Out Loud" by Ed Sheeran
date
Pick a day of the year! pet
If you could have any animal in the world as a pet, what would it be? gerald
Choose your favorite photo of Gerald Friedland! (Options shown under Question 1) smallest
Try to guess the smallest unique positive INTEGER that anyone will put! checkboxes
: Numbers randomly selected from 1 to 10 and included 2015, 9000, and 9001, matched to each student. Students may be matched to multiple numbers. Each row has a time (which is again a unique identifier) and has the value'True'
if the student is matched with that number or'False'
if the student was not. The column names in this table are the following strings, referring to each possible number:'0'
,'1'
,'2'
,'4'
,'5'
,'6'
,'7'
,'8'
,'9'
,'10'
,'2015'
,'9000'
,'9001'
.
A time in students
matches up with a time in checkboxes
. For
example, the row with time "11/11/2015 9:54:03" in students
matches up with
the row with time "11/11/2015 9:54:03" in checkboxes
. We randomly choose from 1 to 10
and included 2015, 9000, and 9001 and matched it to each students. We used
time to uniquely identify each student, rather than using their name or email.
Note: If you are looking for your personal response within the data, you may have noticed that some of your answers are slightly different from what you inputted. In order to make SQLite accept our data, and to optimize for as many matches as possible during our joins, we did the following things to clean up the data:
number
andsmallest
: If you did not input a number, we put the number-1
in as a placeholder.color
andpet
: We converted all the strings to be completely lowercase.
You will write all of your solutions in the starter file provided. As with other labs, you can test your solutions with OK. In addition, you can use either of the following commands. You may need to refer to the Usage section to find the appropriate command for your OS:
python3 sqlite_shell.py < lab12.sql
python3 sqlite_shell.py --init lab12.sql
Question 1: What Would SQL print?
First, load the tables into sqlite3. If you're on Windows or Mac OS X (Mavericks or older), use the following command:
$ python3 sqlite_shell.py --init lab12.sql
If that doesn't work, try:
$ python3 sqlite_shell.py < lab12.sql
Before we start, inspect the schema of the tables that we've created for you:
sqlite> .schema
A schema tells you the name of each of our tables and their attributes. In general, you can think of a schema as a map that describes the logical entities and relationships of a database. Just as the outline of a book tells a reader the order and category in which content is organized, a schema details the organizational hierarchy of information within a database.
Let's also take a look at some of the entries in our table. There are a lot of entries though, so let's just output the first 20:
sqlite> SELECT * FROM students LIMIT 20;
If you're curious about some of the answers students put into the Google form,
open up data.sql
in your favorite text editor and take a look!
For each of the SQL queries below, decide to yourself and/or your partner what the query is looking for, then try running the query yourself and see!
sqlite> SELECT * FROM students; -- This is a comment. * is shorthand for all columns!
______selects all records from students;
sqlite> SELECT color FROM students WHERE number = 16;
______selects the color from students who said their favorite number was 16;
sqlite> SELECT song, pet FROM students WHERE color = "blue" AND date = "12/25";
______selects the song and pet from students who said their favorite color was blue and picked December 25th;
Question 2: Obedience
To warm-up, let's ask a simple question related to our data: Is there a correlation between whether students do as they're told and their favorite image of Gerald Friedland?
Write a SQL query to create a table that contains the columns seven
(this
column representing "obedience") and gerald
(the image students selected)
from the students
table.
You should get the following output:
sqlite> SELECT * FROM obedience LIMIT 10;
7|Option 2
Choose this option instead.|Option 3
YOLO!|Option 3
7|Option 4
7|Option 5
YOLO!|Option 3
Choose this option instead.|Option 3
Choose this option instead.|Option 3
7|Option 1
7|Option 3
CREATE TABLE obedience as
-- REPLACE THIS LINE
SELECT 'YOUR CODE HERE';
SELECT seven, gerald FROM students;
Use OK to test your code:
python3 ok -q obedience
Question 3: Go Bears! (And Dogs?)
Now that we have learned how to select columns from a SQL table, let's filter the results to see some more interesting results!
It turns out that our students have a lot of school spirit: the most popular favorite
color was 'blue'
. You would think that this school spirit would carry over to the
pet answer, and everyone would want a pet bear! Unfortunately, this was not the case,
and the majority of students opted to have a pet 'dog'
instead. That is the more
sensible choice, I suppose...
Write a SQL query to create a table that contains both the column color
and the
column pet
, using the keyword WHERE
to restrict the answers to the most popular
results of color being 'blue'
and pet being 'dog'
. Note that some people put different
names for dogs—i.e. putting specific dogs or putting 'puppy'
instead of dog. Unfortunately,
we would not be able to account for those cases. Something messy as data scientist you will have
to deal with in collecting data.
You should get the following output:
sqlite> SELECT * FROM blue_dog;
blue|dog
blue|dog
blue|dog
CREATE TABLE blue_dog as
-- REPLACE THIS LINE
SELECT 'YOUR CODE HERE';
SELECT color, pet FROM students WHERE color = 'blue' AND pet = 'dog';
Use OK to test your code:
python3 ok -q bluedog
Question 4: The Smallest Unique Integer
Who successfully managed to guess the smallest unique integer value? Let's find out!
To keep it simply, we limited the allowed inputs to be a positive number (greater than zero).
Unfortunately we have not learned how to do aggregations, which can help us count the number of times a specific value was selected, in SQL just yet. As such, we can only hand inspect our data to determine it. However, an anonymous elf has informed us that the smallest unique value is greater than 3!
Write a SQL query with the columns time
and smallest
to try to determine what
the smallest integer value is. In order to make it easier for us to inspect these
values, use ORDER BY
to sort the numerical values, and LIMIT
your result to
the first 20 values that are greater than the number 3.
The first 5 lines of your output should look like this:
sqlite> SELECT * FROM smallest_int LIMIT 5;
4/17/2019 10:19:17|4
4/19/2019 17:46:44|4
4/20/2019 20:29:22|5
4/16/2019 18:44:34|6
4/17/2019 9:44:12|6
CREATE TABLE smallest_int as
-- REPLACE THIS LINE
SELECT 'YOUR CODE HERE';
SELECT time, smallest FROM students WHERE smallest > 3 ORDER BY smallest LIMIT 20;
Use OK to test your code:
python3 ok -q smallest-int
After you've successfully passed the OK test, actually take a look at the table
smallest_int
that you just created and find the smallest unique integer value!
To do this, try the following:
$ python3 sqlite_shell.py --init lab12.sql
sqlite> SELECT * FROM smallest_int; -- No LIMIT this time!
Question 5: Sevens
Let's take a look at data from both of our tables, students
and checkboxes
,
to find out if students that got the number 7 assigned to them also chose '7'
for the
obedience question. Specifically, we want to look at the students that fulfill
the below conditions and see if they also chose '7'
in the question that asked
students to choose the number 7 (column seven
in students
).
- reported that their favorite number (column
number
instudents
) was 7 - have
'True'
in column'7'
incheckboxes
, meaning they got assigned the number7
.
In order to examine rows from both the students
and the checkboxes
table, we will
need to perform a join.
How would you specify the WHERE
clause to make the SELECT
statement only consider
rows in the joined table whose values all correspond to the same student? If
you find that your output is massive and overwhelming, then you are probably missing
the necessary condition in your WHERE
clause to ensure this.
Note: The columns in the
checkboxes
table are strings with the associated number, so you must put quotes around the column name to refer to it. For example if you alias the table asa
, to get the column to see if a student checked 9001, you must writea.'9001'
.
Write a SQL query to create a table with just the column seven
from students
, filtering
first for students who said their favorite number (column number
) was 7 in the students
table and who checked the box for seven (column '7'
) in the checkboxes
table.
You should get the following output:
sqlite> SELECT * FROM sevens;
7
CREATE TABLE sevens as
-- REPLACE THIS LINE
SELECT 'YOUR CODE HERE';
SELECT s.seven FROM students as s, checkboxes as c WHERE s.number = 7 AND c.'7' = 'True' AND s.time = c.time;
Use OK to test your code:
python3 ok -q sevens
Question 6: Matchmaker, Matchmaker
Did you study Data Science with the hope of finding your soul mate? Well you're in luck! With all this data in hand, it's easy for us to find your perfect match. If two students want the same pet and have the same taste in music, they are clearly meant to be together! In order to provide some more information for the potential lovebirds to converse about, let's include the favorite colors of the two individuals as well!
In order to match up students, you will have to do a join on the students
table
with itself. When you do a join, SQLite will match every single row with every single
other row, so make sure you do not match anyone with themselves, or match any given pair
twice!
Hint: You may want to enforce a sort of "ordering" on the column time
(which
is a unique identifier) from your joined tables, in order to do the above correctly.
Write a SQL query to create a table that has 4 columns:
- The shared preferred pet of the couple
- The shared favorite song of the couple
- The favorite color of the first person
- The favorite color of the second person
You should get the following output:
sqlite> SELECT * FROM matchmaker LIMIT 20;
dog|"I Want it That Way"" by Backstreet Boys|none|blue
dog|"I Want it That Way"" by Backstreet Boys|none|blue
dog|"Hello" by Adele|green|pink
dog|"Hello" by Adele|green|purple
dog|"I Want it That Way"" by Backstreet Boys|blue|blue
dog|"Thinking Out Loud" by Ed Sheeran|don't have one|black
dog|"Thinking Out Loud" by Ed Sheeran|don't have one|black
owl|"Hello" by Adele|blue|blue
dog|"Hello" by Adele|pink|purple
dolphin|"I Want it That Way"" by Backstreet Boys|green|blue
cat|"Hello" by Adele|pink|blue
dog|"Hotline Bling" by Drake|white|blue
dog|"Hotline Bling" by Drake|white|black
dog|"Thinking Out Loud" by Ed Sheeran|black|black
dog|"Hotline Bling" by Drake|blue|black
CREATE TABLE matchmaker as
-- REPLACE THIS LINE
SELECT 'YOUR CODE HERE';
SELECT a.pet, a.song, a.color, b.color FROM students AS a, students AS b WHERE a.time < b.time AND a.pet = b.pet AND a.song = b.song;
Use OK to test your code:
python3 ok -q matchmaker
The COUNT aggregate
Recall how finding the smallest integer anyone chose was rather painful, because we could not simply count up how many times each integer was chosen by anyone.
Bring in SQL aggregation, which is commonly used to aggregate values in order to answer these types of questions!
In order to perform SQL aggregation, we need to group rows in our table by one or more attributes. Once
we have groups, we can aggregate over the groups in our table and find things like the maximum value (MAX
),
the minimum value (MIN
), the number of rows in the group (COUNT
), the average over all of the values
(AVG
), and more! SELECT
statements that use aggregation are marked by two things: an aggregate function
(MAX
, MIN
, COUNT
, AVG
, etc.) and a GROUP BY
clause. For example:
sqlite> SELECT song, MAX(number) FROM students GROUP BY song;
|69
Baby|76
Hello|100
Hotline Bling|100
Sandstorm|100
Shake It Off|98
That Way|100
Thinking Out Loud|99
This SELECT
statement groups all of the rows in our table students
by song
. Then, within each
group, we perform aggregation by MAXing over the attribute number
. By selecting song
and MAX(number)
,
we then can see the highest number
any student chose for any given song
.
Question 7: The Smallest Unique Integer (Part 2)
Now, let's revisit the previous problem of finding the smallest integer that anyone chose, and take
a closer look at the COUNT
aggregate.
Just like MAX
above, we can select a COUNT
of some attribute in our query after grouping the rows
in our table by some attribute.
Write a SQL query that uses the COUNT
aggregate to create a table that pairs the attribute smallest
with
the number of times it was chosen by a student (this is the aggregation part). In order to cut out the
people who chose not to respond, and the sneaky cheaters that tried to put small non-integer values,
limit your results to the number 1 and greater!
Hint: You may find that there isn't a particular attribute you should have to perform the COUNT
aggregation over. If you are only interested in counting the number of rows in a group, you can just
say COUNT(*)
.
Hint: Think about what attribute you need to GROUP BY
.
After you've defined your table, you should get something like:
sqlite> SELECT * FROM smallest_int_count LIMIT 10;
1|34
2|4
3|7
4|2
5|1
6|2
7|5
8|1
10|1
11|1
CREATE TABLE smallest_int_count as
-- REPLACE THIS LINE
SELECT 'YOUR CODE HERE';
SELECT smallest, COUNT(*) FROM students WHERE smallest >= 1 GROUP BY smallest;
Use OK to test your code:
python3 ok -q smallest-int-count
It looks like the number 5
only had one person choose it! Were you the lucky student that put
it down?
Submit
Make sure to submit this assignment by running:
python3 ok --submit