Homework 11
Due at 11:59:59 pm on Thursday, 5/2/2024.
Instructions
Download hw11.zip. Inside the archive, you will find starter files for the questions in this homework, along with a copy of the OK autograder.
Readings: This homework relies on following references:
Usage
First, check that a file named sqlite_shell.py
exists alongside the assignment files.
You can start an interactive SQLite session in your Terminal or Git Bash with the following command:
python3 sqlite_shell.py
While the interpreter is running, you can type .help
to see some of the commands you can run.
To exit out of the SQLite interpreter, type .exit
or .quit
or press Ctrl-C. Remember that if you see ...>
after pressing enter, you probably forgot a ;
.
You can also run all the statements in a .sql
file by doing the following:
This command runs your code and then exits SQLite immediately afterwards.
python3 sqlite_shell.py < hw11.sql
This command runs your code and then opens an interactive SQLite session, which is similar to running Python code with the interactive -i
flag.
python3 sqlite_shell.py --init hw11.sql
To complete this homework assignment, you will need to use SQLite version 3.8.3 or greater.
To check your progress, you can run sqlite3 directly by running:
python3 sqlite_shell.py --init hw11.sql
You should also check your work using ok:
python3 ok
Working with the Chinook Dataset
The Chinook dataset is a sample dataset that you'll be working with in this homework. It contains data about many different musical albums, as well as the tracks and musical artists on those albums. To simplify things, we've only given you the artists
, albums
, and tracks
tables (note that this means that some columns in the tracks
table, such as genreId
and mediaTypeId
won't reference anything). We've also cleaned up the data for you and truncated the tracks
table (it used to be several thousand records long!). Let's take a look at the different tables we've given you:
artists
artistId
: the unique ID number assigned to this artistname
: the name of this artist
albums
albumId
: the unique ID number assigned to this albumtitle
: the title of this albumartistId
: the unique ID of the artist on this album
tracks
trackId
: the unique ID number assigned to this trackname
: the name of this trackalbumId
: the unique ID of the album that this track is onmediaTypeId
: don't worry about this!genreId
: don't worry about this!composer
: the composer of this trackmilliseconds
: the length of this track in millisecondsbytes
: the size of this track file in bytesunitPrice
: the price of this track
You will write all of your solutions in the starter file provided. As with other labs, you can test your solutions with OK. In addition, you can use either of the following commands. You may need to refer to the Usage section to find the appropriate command for your OS:
python3 sqlite_shell.py < hw11.sql
python3 sqlite_shell.py --init hw11.sql
Question 1: What Would SQL print?
Please note that this question does not count as part of your Gradescope submission score.
First, load the tables into sqlite3:
python3 sqlite_shell.py --init hw11.sql
Before we start, inspect the schema of the tables that we've created for you:
sqlite> .schema
This tells you the name of each of our tables and their attributes.
Let's also take a look at some of the entries in our table. There are a lot of entries though, so let's just output the first 20:
sqlite> SELECT * FROM tracks LIMIT 20;
If you're curious about which artists/albums/tracks we've included,
open up data.sql
in your favorite text editor and take a look!
For each of the SQL queries below, decide to yourself what the query is looking for, then try running the query yourself and see! There is nothing to submit for this question.
IMPORTANT NOTE: Comments in SQL begin with
--
. Technically once you write a comment, you do not need to end the statement with a semicolon. However, due to limitations of our SQLite shell, you will need to end all comments with a semicolon.
sqlite> -- Hint: * is shorthand for all columns
sqlite> SELECT * FROM artists;
sqlite> -- Hint: There are 1000000 bytes in a megabyte
sqlite> SELECT name FROM tracks WHERE bytes > 3000000;
sqlite> -- Hint: How many milliseconds are in a minute?
sqlite> SELECT name, composer FROM tracks WHERE albumId = 24 AND milliseconds > 180000;
Click for Query 1 solution
Selects all records from artists
Click for Query 2 solution
Selects the name of all tracks that are larger than 3 Mb
Click for Query 3 solution
Selects the name and composer of all tracks longer than 3 minutes on the album with albumID = 24
Question 2: Get rid of the fluff!
Write a SQL query to create a table that contains just the name
and unitPrice
of each track in the tracks
table.
You should get the following output:
sqlite> SELECT * FROM pricing LIMIT 10;
For Those About To Rock (We Salute You)|0.99
Fast As a Shark|0.99
Restless and Wild|0.99
Princess of the Dawn|0.99
Put The Finger On You|0.99
Let's Get It Up|0.99
Inject The Venom|0.99
Snowballed|0.99
Evil Walks|0.99
C.O.D.|0.99
CREATE TABLE pricing as
select "REPLACE THIS LINE WITH YOUR SOLUTION";
Use OK to test your code:
python3 ok -q pricing
Question 3: Long songs...
Now that we have learned how to select columns from a SQL table, let's filter the results to see some more interesting results!
Write a SQL query to create a table that contains the names of all tracks longer than 8 minutes (that's a long song!)
You should get the following output:
sqlite> SELECT * FROM long LIMIT 10;
You Oughta Know (Alternate)
Stratus
Wheels Of Confusion / The Straightener
Book Of Thel
The Alchemist
Talkin' 'Bout Women Obviously
Terra
You Shook Me(2)
How Many More Times
Advance Romance
CREATE TABLE long as
select "REPLACE THIS LINE WITH YOUR SOLUTION";
Use OK to test your code:
python3 ok -q long
Question 4: Smallest Song
Which song takes up the least space (as a file)? Let's find out!
Write a SQL query with the columns name
and milliseconds
to try to determine what the smallest song file is. In order to make it
easier for us to inspect these values, use ORDER BY
to sort the numerical values
and LIMIT
your result to just the smallest track.
You should get the following output:
sqlite> SELECT * FROM smallest;
Mateus Enter|33149
CREATE TABLE smallest as
select "REPLACE THIS LINE WITH YOUR SOLUTION";
Use OK to test your code:
python3 ok -q smallest
Joins
We can use joins to include rows from another table that satisfy the WHERE
predicate. Joins can either be on different tables, or the same table if we
include an alias. Here we are referencing the football table twice, once as
the alias a
and once as the alias b
.
sqlite> SELECT a.Berkeley - b.Berkeley, a.Stanford - b.Stanford, a.Year, b.Year
...> FROM Football as a, Football as b WHERE a.Year > b.Year;
-11|22|2014|2003
-13|21|2014|2002
-2|9|2003|2002
What is this query asking for? We're creating a new table where each row is the difference in scores from 2 years. For example, in the first row, we are comparing the big game scores in 2014 and 2003. In 2014, Berkeley scored 17 and in 2003, Berkeley scored 28. 17 - 28 = -11. Similarly, in 2014, Stanford scored 38 and in 2003 they scored 16. 38 - 16 = 22.
You may notice that it does not seem like we actually performed any operations
to do the join. However, the join is implicit in the fact that we listed more
than one table after the FROM. In this example, we joined the table Football
with itself and gave each instance of the table an alias, a
and b
so that
we could distinctly refer to each table's attributes and perform selections and
comparisons on them, such as a.Year > b.Year
.
One way to think of a join is that it produces a cross-product between the two tables by matching each row from the first table with every other row in the second table, which creates a new, larger joined table.
Here's an illustration of what happened in the joining process during the above query.
From here, the select statement examines the joined table and selects the values
it desires: a.Berkeley - b.Berkeley
and a.Stanford - b.Stanford
but only
from the rows WHERE a.Year > b.Year
. This prevents duplicate results from
appearing in our output!
Question 5: Long songs on albums
We know which songs are long, but now we want to know which albums have long songs on them!
Write a SQL query to create a table that contains the titles of all albums with tracks longer than 8 minutes on them.
You should get the following output:
sqlite> SELECT * FROM long_album LIMIT 10;
Jagged Little Pill
The Best Of Billy Cobham
Black Sabbath Vol. 4 (Remaster)
Chemical Wedding
Chemical Wedding
The Best Of Buddy Guy - The Millenium Collection
Prenda Minha
BBC Sessions [Disc 1] [Live]
BBC Sessions [Disc 1] [Live]
Note that albums can appear more than once in this table. Why is this? What does the number of times an album appears tell you?
CREATE TABLE long_album as
select "REPLACE THIS LINE WITH YOUR SOLUTION";
Use OK to test your code:
python3 ok -q long_album
Aggregate Functions
So far, we haven't given you a good way of counting the number of rows that fit certain criteria.
Bring in SQL aggregation, which is commonly used to aggregate values in order to answer these types of questions!
In order to perform SQL aggregation, we need to group rows in our table by one or more attributes. Once
we have groups, we can aggregate over the groups in our table and find things like the maximum value (MAX
),
the minimum value (MIN
), the number of rows in the group (COUNT
), the average over all of the values
(AVG
), and more! SELECT
statements that use aggregation are marked by two things: an aggregate function
(MAX
, MIN
, COUNT
, AVG
, etc.) and a GROUP BY
clause. For example, if we wanted to see how many
tracks are on each album (identified by albumId):
sqlite> SELECT albumId, max(milliseconds) FROM tracks GROUP BY albumId LIMIT 10;
1|343719
3|375418
4|369319
5|381231
6|491885
7|387134
9|436453
10|343457
11|309786
12|163265
So the longest song on the album with albumId 57 is 343719 ms long!
This SELECT
statement groups all of the rows in our table tracks
by albumId
. Then, within each
group, we perform aggregation by MAXing over the attribute milliseconds
. By selecting albumId
and MAX(milliseconds)
,
we then can see the greatest milliseconds
for any track for a given albumId
.
When using an aggregate function in your
SELECT
statement, any other columns in theSELECT
statement must be in theGROUP BY
statement!
Question 6: Number of songs on each album
Write a SQL query to create a table that contains the albumId
and number of songs
on the album with that albumId
. HINT: Remember that count(*)
will count the
number of rows in each group of a GROUP BY
statement.
You should get the following output:
sqlite> SELECT * FROM track_count LIMIT 10;
1|10
3|3
4|8
5|15
6|13
7|12
9|8
10|14
11|12
12|12
CREATE TABLE track_count as
select "REPLACE THIS LINE WITH YOUR SOLUTION";
Use OK to test your code:
python3 ok -q track_count
Question 7: Counting albums
So far, we haven't yet considered the artists
table. What if we want to know
how many albums each artist has released?
Write a SQL query to create a table that contains the name of each artist along
with how many albums that artist has released (that are in the albums
table).
You should get the following output:
sqlite> SELECT * FROM album_count LIMIT 10;
Aaron Copland & London Symphony Orchestra|1
Aaron Goldberg|1
Academy of St. Martin in the Fields & Sir Neville Marriner|1
Academy of St. Martin in the Fields Chamber Ensemble & Sir Neville Marriner|1
Academy of St. Martin in the Fields, John Birch, Sir Neville Marriner & Sylvia McNair|1
Academy of St. Martin in the Fields, Sir Neville Marriner & Thurston Dart|1
Accept|2
Adrian Leaper & Doreen de Feis|1
Aerosmith|1
CREATE TABLE album_count as
select "REPLACE THIS LINE WITH YOUR SOLUTION";
Use OK to test your code:
python3 ok -q album_count
Question 8: Busiest artists
How many songs have each artist produced? Write a SQL query to create a table containing artists' names and the number of tracks that they're produced (we'll assume an artist produced a track if it appears on one of their albums).
HINT: You may need all three tables!
You should get the following output:
sqlite> SELECT * FROM busiest_artists LIMIT 10;
AC/DC|18
Accept|3
Aerosmith|15
Alanis Morissette|13
Alice In Chains|12
Antônio Carlos Jobim|17
Apocalyptica|8
Audioslave|26
BackBeat|12
Billy Cobham|8
CREATE TABLE busiest_artists as
select "REPLACE THIS LINE WITH YOUR SOLUTION";
Use OK to test your code:
python3 ok -q busiest_artists
Note that computing a join over all three tables is very expensive! To see how many rows are considered, run the following query:
SELECT count(*) FROM artists, albums, tracks;
There may be a slight delay (because you're computing so many different candidate rows!).
Submission
When you are done, submit your file to Gradescope. You only need to upload the following files:
hw11.sql