{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# CS88 Introduction to SQL\n",
"\n",
"You have now learned in some depth how to think computation, how to express your thoughts in a modern programming language, Python, clearly and succintly, and how to apply this knowledge to various tasks in Data Science. In doing so, you have come to understand what it means to *interpret* a program. The python interpreter is a piece of software that executes - i.e., interprets - python programs.\n",
"\n",
"You have also gained an understanding of object oriented programming, and in particular a dataframe object called `Table`s.\n",
"\n",
"In this lecture, we are going to utilize your understanding of `Table`s to rapidly learn a little language, SQL, which is the way essentially all queries to data bases are expressed. Formally, it performs operations defined in a relational algebra. But, you already understand essentially all the operations - cause you learned them on `Table`s. The difference is that a database holds a collection of tables - yes the same word - on disk, so they are permanent.\n",
"\n",
"We will use a particular relational database management system, sqlite3, which actual stores the database in a conventional file."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from datascience import *\n",
"import numpy as np\n",
"\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plots\n",
"plots.style.use('fivethirtyeight')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## SQL Query Syntax \n",
"\n",
"Nearly the whole language consists of one very complicated statement.\n",
"\n",
"```\n",
"SELECT [ALL | DISTINCT] {* | expr [[AS] c_alias] {,expr [[AS] c_alias] ...}}\n",
" FROM tableref {, tableref}\n",
" [WHERE search_condition]\n",
" [GROUP BY colname {,colname...}]\n",
" [HAVING search condition];\n",
"```\n",
"\n",
"In SQL, queries and subqueries produce a new table, as do most `Table` methods.\n",
"\n",
"Here's a pretty [good reference](https://www.dofactory.com/sql/syntax). There are lots of them out there. [Here's another](http://www.zentut.com/sql-tutorial/).\n",
"\n",
"SQL keywords are case insensitive, but convention is to use uppercase for clarity. Column names and text values are case sensitive.\n",
"\n",
"For students who have come through Data8, the experience with datascience `Table`s provides a unique way to learn about SQL and it's underlying relational concepts.\n",
"\n",
"Let's start with a `Table` fresh out of Data 8 Lecture 10."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
" \n",
" \n",
" Flavor | Color | Price | \n",
"
\n",
" \n",
" \n",
" \n",
" strawberry | pink | 3.55 | \n",
"
\n",
" \n",
" chocolate | light brown | 4.75 | \n",
"
\n",
" \n",
" chocolate | dark brown | 5.25 | \n",
"
\n",
" \n",
" strawberry | pink | 5.25 | \n",
"
\n",
" \n",
" bubblegum | pink | 4.75 | \n",
"
\n",
" \n",
" chocolate | dark brown | 5.25 | \n",
"
\n",
" \n",
"
"
],
"text/plain": [
"Flavor | Color | Price\n",
"strawberry | pink | 3.55\n",
"chocolate | light brown | 4.75\n",
"chocolate | dark brown | 5.25\n",
"strawberry | pink | 5.25\n",
"bubblegum | pink | 4.75\n",
"chocolate | dark brown | 5.25"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Table constructor method\n",
"\n",
"# An example of creating a Table from a list of rows.\n",
"Table([\"Flavor\",\"Color\",\"Price\"]).with_rows([\n",
" ('strawberry','pink', 3.55),\n",
" ('chocolate','light brown', 4.75),\n",
" ('chocolate','dark brown', 5.25),\n",
" ('strawberry','pink',5.25),\n",
" ('bubblegum','pink',4.75),\n",
" ('chocolate','dark brown', 5.25)\n",
"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Select\n",
"\n",
"The basic aspect of SQL is the select statement which returns a table.\n",
"\n",
"The following create a one-row Table\n",
"```\n",
"SELECT \"strawberry\" AS Flavor, \"pink\" AS Color, 3.55 AS Price; \n",
"```\n",
"\n",
"Note - column names are always text and have no `\"` around them.\n",
"\n",
"To try this out, run `sqlite3` creating a database file, say `testsql`\n",
"\n",
"`sqlite3 testsql`\n",
"\n",
"This will run interactively, so you can see the results of the queries you perform.\n",
"\n",
"```\n",
"ideas/sql> sqlite3 testsql.db\n",
"SQLite version 3.13.0 2016-05-18 10:57:30\n",
"Enter \".help\" for usage hints.\n",
"sqlite> select \"strawberry\" as Flavor, \"pink\" as Color, 3.55 as Price;\n",
"strawberry|pink|3.55\n",
"sqlite>\n",
"```\n",
"Don't forget the \";\"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table as Union of tables\n",
"\n",
"Creating a multi-row table (object) with a unior of `select` statements\n",
"```\n",
"select \"strawberry\" as Flavor, \"pink\" as Color, 3.55 as Price union\n",
"select \"chocolate\",\"light brown\", 4.75 union\n",
"select \"chocolate\",\"dark brown\", 5.25 union\n",
"select \"strawberry\",\"pink\",5.25 union\n",
"select \"bubblegum\",\"pink\",4.75 union\n",
"select \"strawberry\",\"pink\",5.25 union\n",
"select \"chocolate\",\"dark brown\", 5.25;\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how a SQL table is a relation, it is not a list. \n",
"\n",
"```\n",
"sqlite> select \"strawberry\" as Flavor, \"pink\" as Color, 3.55 as Price union\n",
" ...> select \"chocolate\",\"light brown\", 4.75 union\n",
" ...> select \"chocolate\",\"dark brown\", 5.25 union\n",
" ...> select \"strawberry\",\"pink\",5.25 union\n",
" ...> select \"bubblegum\",\"pink\",4.75 union\n",
" ...> select \"strawberry\",\"pink\",5.25 union\n",
" ...> select \"chocolate\",\"dark brown\", 5.25;\n",
"bubblegum|pink|4.75\n",
"chocolate|dark brown|5.25\n",
"chocolate|light brown|4.75\n",
"strawberry|pink|3.55\n",
"strawberry|pink|5.25\n",
"```\n",
"Notice, no duplicates. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Create table\n",
"\n",
"Let's create a version of the table as might occur in a retail setting. We have added a column for the transaction ID.\n",
"```\n",
"create table cones as\n",
" select 1 as Id, \"strawberry\" as Flavor, \"pink\" as Color, 3.55 as Price union\n",
" select 2, \"chocolate\",\"light brown\", 4.75 union\n",
" select 3, \"chocolate\",\"dark brown\", 5.25 union\n",
" select 4, \"strawberry\",\"pink\",5.25 union\n",
" select 5, \"bubblegum\",\"pink\",4.75 union\n",
" select 6, \"chocolate\", \"dark brown\", 5.25;\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### sqlite3\n",
"\n",
"Paste that into your `sqlite3` session. Then you can access the data.\n",
"\n",
"```\n",
"sqlite> .headers on\n",
"sqlite> select * from cones;\n",
"ID|Flavor|Color|Price\n",
"1|strawberry|pink|3.55\n",
"2|chocolate|light brown|4.75\n",
"3|chocolate|dark brown|5.25\n",
"4|strawberry|pink|5.25\n",
"5|bubblegum|pink|4.75\n",
"6|chocolate|dark brown|5.25\n",
"sqlite> \n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" Id | Flavor | Color | Price | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | strawberry | pink | 3.55 | \n",
"
\n",
" \n",
" 2 | chocolate | light brown | 4.75 | \n",
"
\n",
" \n",
" 3 | chocolate | dark brown | 5.25 | \n",
"
\n",
" \n",
" 4 | strawberry | pink | 5.25 | \n",
"
\n",
" \n",
" 5 | bubblegum | pink | 4.75 | \n",
"
\n",
" \n",
" 6 | chocolate | dark brown | 5.25 | \n",
"
\n",
" \n",
"
"
],
"text/plain": [
"Id | Flavor | Color | Price\n",
"1 | strawberry | pink | 3.55\n",
"2 | chocolate | light brown | 4.75\n",
"3 | chocolate | dark brown | 5.25\n",
"4 | strawberry | pink | 5.25\n",
"5 | bubblegum | pink | 4.75\n",
"6 | chocolate | dark brown | 5.25"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a Table and assign it to a variable\n",
"\n",
"cones = Table([\"Id\", \"Flavor\",\"Color\",\"Price\"]).with_rows([\n",
" (1, 'strawberry','pink', 3.55),\n",
" (2, 'chocolate','light brown', 4.75),\n",
" (3, 'chocolate','dark brown', 5.25),\n",
" (4, 'strawberry','pink',5.25),\n",
" (5, 'bubblegum','pink',4.75),\n",
" (6, 'chocolate','dark brown', 5.25)\n",
"])\n",
"cones"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tables in a database\n",
"\n",
"The sqlite3 command line provides pseudo-ops to explore the tables in the database\n",
"\n",
"```\n",
"sqlite> .tables\n",
"cones\n",
"sqlite> .schema cones\n",
"CREATE TABLE cones(\n",
" Id,\n",
" Flavor,\n",
" Color,\n",
" Price\n",
");\n",
"```\n",
"SQL also provides table management operations, besides `CREATE`.\n",
"\n",
"```\n",
"sqlite> drop table cones;\n",
"sqlite> .tables\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating a table and its schema\n",
"\n",
"While it is possible to create a table and fill it in, as in the example above, in most database settings tables are permanent objects that are created and then repeatedly accessed and modified.\n",
"\n",
"To create an empty table and specify its schema:\n",
"\n",
"```\n",
"CREATE TABLE cones (Id INT, Flavor TEXT, Color TEXT, Price REAL);\n",
"```\n",
"Then it can be filled in with `INSERT` statements."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Insert\n",
"\n",
"To insert a row into a table, you need to specify three things:\n",
"\n",
"* The table, which you want to insert a new row, in the INSERT INTO clause.\n",
"* a comma-separated list of columns in the table surrounded by parentheses.\n",
"* a comma-separated list of values surrounded by parentheses in the VALUES clause.\n",
"\n",
"```\n",
"INSERT INTO table(column1, column2,...)\n",
"VALUES (value1, value2,...);\n",
"```\n",
"For example, to fill out our cones table creation:\n",
"\n",
"```\n",
"INSERT INTO cones VALUES (1, \"strawberry\", \"pink\", 3.55);\n",
"INSERT INTO cones VALUES (2, \"chocolate\",\"light brown\", 4.75);\n",
"INSERT INTO cones VALUES (3, \"chocolate\",\"dark brown\", 5.25);\n",
"INSERT INTO cones VALUES (4, \"strawberry\",\"pink\",5.25);\n",
"INSERT INTO cones VALUES (5, \"bubblegum\",\"pink\",4.75);\n",
"INSERT INTO cones VALUES (6, \"chocolate\", \"dark brown\", 5.25);\n",
"```\n",
"\n",
"Individual columns-value relationships can be specified too.\n",
"\n",
"```\n",
"insert into cones(Id, Flavor, Color, Price) values (7, \"Vanila\", \"White\", 3.95);\n",
"```\n",
"\n",
"Columns where data is not inserted are filled with `null`.\n",
"\n",
"```\n",
"sqlite> insert into cones(Flavor, Price) values (\"Vanila\", 2.25);\n",
"sqlite> select * from cones;\n",
"Id|Flavor|Color|Price\n",
"1|strawberry|pink|3.55\n",
"2|chocolate|light brown|4.75\n",
"3|chocolate|dark brown|5.25\n",
"4|strawberry|pink|5.25\n",
"5|bubblegum|pink|4.75\n",
"6|chocolate|dark brown|5.25\n",
"7|Vanila|White|3.95\n",
"|Vanila||2.25\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" Id | Flavor | Color | Price | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | strawberry | pink | 3.55 | \n",
"
\n",
" \n",
" 2 | chocolate | light brown | 4.75 | \n",
"
\n",
" \n",
" 3 | chocolate | dark brown | 5.25 | \n",
"
\n",
" \n",
" 4 | strawberry | pink | 5.25 | \n",
"
\n",
" \n",
" 5 | bubblegum | pink | 4.75 | \n",
"
\n",
" \n",
" 6 | chocolate | dark brown | 5.25 | \n",
"
\n",
" \n",
" 7 | Vanila | White | 3.95 | \n",
"
\n",
" \n",
"
"
],
"text/plain": [
"Id | Flavor | Color | Price\n",
"1 | strawberry | pink | 3.55\n",
"2 | chocolate | light brown | 4.75\n",
"3 | chocolate | dark brown | 5.25\n",
"4 | strawberry | pink | 5.25\n",
"5 | bubblegum | pink | 4.75\n",
"6 | chocolate | dark brown | 5.25\n",
"7 | Vanila | White | 3.95"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cones.append((7, \"Vanila\", \"White\", 3.95))\n",
"cones"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ordering\n",
"\n",
"In addition to the set of records in a table, often one wants to order them in a particular away. You've seen this with `Table.sort`."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" Id | Flavor | Color | Price | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | strawberry | pink | 3.55 | \n",
"
\n",
" \n",
" 7 | Vanila | White | 3.95 | \n",
"
\n",
" \n",
" 2 | chocolate | light brown | 4.75 | \n",
"
\n",
" \n",
" 5 | bubblegum | pink | 4.75 | \n",
"
\n",
" \n",
" 3 | chocolate | dark brown | 5.25 | \n",
"
\n",
" \n",
" 4 | strawberry | pink | 5.25 | \n",
"
\n",
" \n",
" 6 | chocolate | dark brown | 5.25 | \n",
"
\n",
" \n",
"
"
],
"text/plain": [
"Id | Flavor | Color | Price\n",
"1 | strawberry | pink | 3.55\n",
"7 | Vanila | White | 3.95\n",
"2 | chocolate | light brown | 4.75\n",
"5 | bubblegum | pink | 4.75\n",
"3 | chocolate | dark brown | 5.25\n",
"4 | strawberry | pink | 5.25\n",
"6 | chocolate | dark brown | 5.25"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Tables have an explicit order, which can be controlled\n",
"cones.sort('Price')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `select` statement has an `ORDER BY` clause.\n",
"\n",
"\n",
"```\n",
"SELECT * FROM cones ORDER BY Price;\n",
"```\n",
"
\n",
"\n",
"```\n",
"sqlite> select * from cones order by Price;\n",
"1|strawberry|pink|3.55\n",
"2|chocolate|light brown|4.75\n",
"5|bubblegum|pink|4.75\n",
"3|chocolate|dark brown|5.25\n",
"4|strawberry|pink|5.25\n",
"6|chocolate|dark brown|5.25\n",
"sqlite> 6|chocolate|dark brown|5.25\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Selecting columns - Projection\n",
"\n",
"by default, operations on tables refer to *all* the columns.\n",
"One can *select* certain columns\n",
"\n",
"```\n",
"SELECT FROM ;\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" Flavor | Price | \n",
"
\n",
" \n",
" \n",
" \n",
" strawberry | 3.55 | \n",
"
\n",
" \n",
" chocolate | 4.75 | \n",
"
\n",
" \n",
" chocolate | 5.25 | \n",
"
\n",
" \n",
" strawberry | 5.25 | \n",
"
\n",
" \n",
" bubblegum | 4.75 | \n",
"
\n",
" \n",
" chocolate | 5.25 | \n",
"
\n",
" \n",
" Vanila | 3.95 | \n",
"
\n",
" \n",
"
"
],
"text/plain": [
"Flavor | Price\n",
"strawberry | 3.55\n",
"chocolate | 4.75\n",
"chocolate | 5.25\n",
"strawberry | 5.25\n",
"bubblegum | 4.75\n",
"chocolate | 5.25\n",
"Vanila | 3.95"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cones.select(['Flavor', 'Price'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Indexing and .column return an numpy.array, rather than a Table\n",
"\n",
"cones['Flavor']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## SELECT from table\n",
"\n",
"The `SELECT` statement specifies the table explicitly. (It's not a method on an object. The DB keeps track of all the tables.)\n",
"\n",
"Specify the columns to select\n",
"\n",
"```\n",
"SELECT FROM ;\n",
"```\n",
"
\n",
"```\n",
"sqlite> .headers on\n",
"sqlite> select Flavor, Price from cones;\n",
"Flavor|Price\n",
"strawberry|3.55\n",
"chocolate|4.75\n",
"chocolate|5.25\n",
"strawberry|5.25\n",
"bubblegum|4.75\n",
"chocolate|5.25\n",
"\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Permanent data\n",
"\n",
"Quit sqlite3 with the `.quit` pseuodo command. Then start it up again on the same database. Notice how everything is still there."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filtering rows - where"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Here's the simple equality filter in Tables\n",
"\n",
"cones.where('Flavor', 'chocolate')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SQL where clause\n",
"\n",
"By default, `SELECT` operations on tables refer to *all* the rows.\n",
"One can operate on a subset of the rows by specifying a predicate expression in the `WHERE` clause.\n",
"\n",
"```\n",
"SELECT FROM WHERE ;\n",
"```\n",
"
\n",
"\n",
"```\n",
"sqlite> select * from cones where Flavor = \"chocolate\";\n",
"ID|Flavor|Color|Price\n",
"2|chocolate|light brown|4.75\n",
"3|chocolate|dark brown|5.25\n",
"6|chocolate|dark brown|5.25\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Or an expression that produces a boolean array\n",
"\n",
"cones.where(cones[\"Price\"] > 5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SQL:\n",
"\n",
"```\n",
"sqlite> select * from cones where Price > 5;\n",
"ID|Flavor|Color|Price\n",
"3|chocolate|dark brown|5.25\n",
"4|strawberry|pink|5.25\n",
"6|chocolate|dark brown|5.25\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Approximate match\n",
"\n",
"Regular expression matches are so common that they are built in in SQL.\n",
"\n",
"```\n",
"sqlite> select * from cones where Flavor like \"%berry%\";\n",
"Flavor|Color|Price\n",
"strawberry|pink|3.55\n",
"strawberry|pink|5.25\n",
"sqlite> \n",
"```\n",
"\n",
"On the other hand, you have the full power of Python to express what you mean."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cones.where(cones.apply(lambda x:'berry' in x, 'Flavor'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## SQL Boolean operators\n",
"\n",
"SQLite understands the following binary operators, in order from highest to lowest precedence:\n",
"```\n",
"||\n",
"* / %\n",
"+ -\n",
"<< >> & |\n",
"< <= > >=\n",
"= == != <> IS IS NOT IN LIKE GLOB MATCH REGEXP\n",
"AND \n",
"OR\n",
"```\n",
"Supported unary prefix operators are these:\n",
"```\n",
"- + ~ NOT\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Grouping and Aggregation\n",
"\n",
"The GROUP BY clause is used to group rows of SELECT statement into a set of summary rows or groups based on values of columns or expressions. \n",
"\n",
"Apply an aggregate function, such as SUM, AVG, MIN, MAX or COUNT, to each group to output the summary information.\n",
"\n",
"```\n",
"SELECT FROM WHERE GROUP BY ;\n",
"```\n",
"
\n",
"\n",
"You have been applying aggregation functions to \"groups\" of rows. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# By default Tables aggregate with count\n",
"cones.group('Flavor')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The analogous SQL has a `group by` clause in the select statement. The value expression specifies what aggregation operation to perform on the groups.\n",
"\n",
"```\n",
"sqlite> select count(Price), Flavor from cones group by Flavor;\n",
"count(Price)|Flavor\n",
"1|bubblegum\n",
"2|chocolate\n",
"2|strawberry\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You can specify how to aggregate\n",
"\n",
"cones.group('Flavor', min)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# And select the meaningful columns\n",
"\n",
"cones.select(['Flavor', 'Price']).group('Flavor', np.mean)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SQL GROUP BY built in operators in the column expression\n",
"\n",
"```\n",
"sqlite> select min(Price), Flavor from cones group by Flavor;\n",
"min(Price)|Flavor\n",
"4.75|bubblegum\n",
"4.75|chocolate\n",
"3.55|strawberry\n",
"\n",
"sqlite> select avg(Price), Flavor from cones group by Flavor;\n",
"avg(Price)|Flavor\n",
"4.75|bubblegum\n",
"5.0|chocolate\n",
"4.4|strawberry\n",
"sqlite> \n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### GROUP operators\n",
"\n",
"The following are the most commonly used SQL aggregate functions:\n",
"\n",
"* AVG – calculates the average of a set of values.\n",
"* COUNT – counts rows in a specified table or view.\n",
"* MIN – gets the minimum value in a set of values.\n",
"* MAX – gets the maximum value in a set of values.\n",
"* SUM – calculates the sum of values.\n",
"\n",
"Notice that all aggregate functions above ignore NULL values except for the COUNT function."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"SQL expression can be quite general.\n",
"\n",
"```\n",
"sqlite> .header on\n",
"sqlite> select Color, max(Price) - min(Price) from cones group by Color;\n",
"Color|max(Price) - min(Price)\n",
"dark brown|0.0\n",
"light brown|0.0\n",
"pink|1.7\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Unique values\n",
"\n",
"Being able to determine the unique values in a column or collection of columns in common enough working with data that SQl builds it in. This is a different sort of aggregation that does not involve groups.\n",
"\n",
"```\n",
"select distinct Flavor from cones;\n",
"select distinct Flavor, Color from cones;\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.unique(cones['Flavor'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cones.groups(['Flavor', 'Color']).drop('count')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Joining multiple tables\n",
"\n",
"Information in different tables can be brought together using *join* operations. You've had a lot of experience with this already.\n",
"\n",
"Let's imagine that there was another table associated with the cashier's transations."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sales = Table(['Cashier', 'Tid']).with_rows([\n",
" ('Baskin', 1),\n",
" ('Robin', 2),\n",
" ('Baskin', 3),\n",
" ('Baskin', 4),\n",
" ('Robin', 5),\n",
" ('Robin', 6)\n",
"])\n",
"sales"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SQL - create a similar table\n",
"\n",
"```\n",
"create table sales as\n",
" select \"Baskin\" as Cashier, 1 as Tid union\n",
" select \"Baskin\", 3 union\n",
" select \"Baskin\", 4 union\n",
" select \"Robin\", 2 union\n",
" select \"Robin\", 5 union\n",
" select \"Robin\", 6;\n",
"```\n",
"Paste this into sqlite3\n",
"```\n",
"sqlite> .tables\n",
"cones sales\n",
"sqlite> select * from sales;\n",
"Cashier|Tid\n",
"Baskin|1\n",
"Baskin|3\n",
"Baskin|4\n",
"Robin|2\n",
"Robin|5\n",
"Robin|6\n",
"sqlite> \n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# The JOIN of these two tables around their related columns\n",
"sales.join('Tid', cones, 'Id')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## SQL join \n",
"\n",
"In SQL the join operation is easily expressed by multiple `FROM` tables, but it is not the same as the \"inner join\" you are familiar with.\n",
"\n",
"```\n",
"SELECT * FROM sales, cones;\n",
"```\n",
"produces the cross-product (outer join) of all the rows of `sales` and the rows of `cones`. We want the ones with matching `Id`'s. This is expressed through the `WHERE` clause.\n",
"\n",
"```\n",
"SELECT * FROM sales, cones WHERE Tid=Id;\n",
"```\n",
"\n",
"Try it out.\n",
"
\n",
"```\n",
"sqlite> .tables\n",
"cones sales\n",
"sqlite> SELECT * FROM sales, cones WHERE Tid=Id;\n",
"Cashier|Tid|Id|Flavor|Color|Price\n",
"Baskin|1|1|strawberry|pink|3.55\n",
"Baskin|3|3|chocolate|dark brown|5.25\n",
"Baskin|4|4|strawberry|pink|5.25\n",
"Robin|2|2|chocolate|light brown|4.75\n",
"Robin|5|5|bubblegum|pink|4.75\n",
"Robin|6|6|chocolate|dark brown|5.25\n",
"sqlite> \n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Refining the join description\n",
"\n",
"The names of tables can be used to disambiguate, say if the columns have the same names in different tables.\n",
"\n",
"```\n",
"select * from sales, cones where sales.Tid=cones.Id;\n",
"```\n",
"\n",
"And, we can tailor the result with the expression.\n",
"\n",
"```\n",
"select Flavor, Color, Price, Cashier, Id from sales, cones where sales.Tid=cones.Id;\n",
"```\n",
"\n",
"And we could create this as a new, permanent table.\n",
"\n",
"```\n",
"create table full as \n",
" select Flavor, Color, Price, Cashier, ID from sales, cones where sales.Tid=cones.Id;\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating and droping tables\n",
"\n",
"```\n",
"DROP TABLE \n",
"```\n",
"For example, create a new table of the join and drop it.\n",
"\n",
"```\n",
"sqlite> create table full as \n",
" ...> select Flavor, Color, Price, Cashier, ID from sales, cones where sales.TID=cones.ID;\n",
"sqlite> .tables\n",
"cones full sales\n",
"sqlite> select * from full;\n",
"Flavor|Color|Price|Cashier|ID\n",
"strawberry|pink|3.55|Baskin|1\n",
"chocolate|dark brown|5.25|Baskin|3\n",
"strawberry|pink|5.25|Baskin|4\n",
"chocolate|light brown|4.75|Robin|2\n",
"bubblegum|pink|4.75|Robin|5\n",
"chocolate|dark brown|5.25|Robin|6\n",
"```\n",
"\n",
"```\n",
"sqlite> drop table full;\n",
"sqlite> .tables\n",
"cones sales\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Naming result columns\n",
"\n",
"```\n",
"SELECT as FROM WHERE GROUP BY \n",
"\n",
"```\n",
"Create a table with a `Taste` column with two values: \"delicious\" and \"other\".\n",
"\n",
"```\n",
"select \"delicious\" as Taste, Flavor, Color from cones where Flavor is \"chocolate\" union\n",
"select \"other\", Flavor, Color from cones where Flavor is not \"chocolate\";\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Subqueries\n",
"\n",
"The statement that results in a table can be placed where a table name can appear. \n",
"```\n",
"SELECT FROM ;\n",
"```\n",
"Here's a silly example:\n",
"```\n",
"sqlite> select Flavor, Color from (select * from cones where Flavor is \"chocolate\");\n",
"chocolate|light brown\n",
"chocolate|dark brown\n",
"chocolate|dark brown\n",
"```\n",
"More usefully \n",
"```\n",
"select TID from sales where Cashier is \"Baskin\";\n",
"```\n",
"Could be used in\n",
"```\n",
"sqlite> select * from cones\n",
" ...> where ID in (select TID from sales where Cashier is \"Baskin\");\n",
"ID|Flavor|Color|Price\n",
"1|strawberry|pink|3.55\n",
"3|chocolate|dark brown|5.25\n",
"4|strawberry|pink|5.25\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Programming language APIs\n",
"\n",
"Database tables area seldom accessed from a command line as we've done in our interactive session. They are used by passing queries to the database interpreter - typically underneath some application. So, they are accessed from within the programming language via an *Application Programming Iterface* or *API*.\n",
"\n",
"Here's an example of the python API to sqlite3."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import sqlite3"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"icecream = sqlite3.connect('icecream.db')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"icecream.execute('SELECT * FROM cones;')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"icecream.execute('SELECT DISTINCT Flavor FROM cones;').fetchall()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"icecream.execute('SELECT * FROM cones WHERE Flavor is \"chocolate\";').fetchall()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Be careful about filling in parameters of the query casually\n",
"# as it may be open SQL injection attacks\n",
"\n",
"col = \"Flavor\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# This looks like it is providing just the list of flavors\n",
"\n",
"icecream.execute('SELECT DISTINCT %s FROM cones;' % col).fetchall()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# What if an attacker managed this? Go back and evaluate the query above\n",
"col = \"*\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Query parameter substitution allows the parameters to be filled in\n",
"# at the database query processor.\n",
"\n",
"icecream.execute('SELECT * FROM cones WHERE Flavor is ?;', (\"chocolate\",)).fetchall()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Building abstractions that hide the database\n",
"def flavor(db, flv):\n",
" return db.execute('SELECT * FROM cones WHERE Flavor is ?;', (flv,)).fetchall()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"flavor(icecream, \"chocolate\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## SQL Tables - a new class\n",
"\n",
"Let's derive a class from Tables that allows Tables to be read or written to a database.\n",
"\n",
"This illustrates:\n",
"* Class inheritance to extend functionality\n",
"* Use of helper functions to make coding clear and abstract messy details of a particular implementation.\n",
"* Exception handling to deal with behavior of underlying module\n",
"* classmethods and instance methods"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# sqlite3 Helper functions for the SQL_Table class\n",
"\n",
"import sqlite3\n",
"def build_list(s):\n",
" \"\"\"Return string representing column name tuple.\"\"\"\n",
" res = \"(\"\n",
" for v in s[:-1]:\n",
" res = res + v + \", \"\n",
" res = res + s[-1] + \")\"\n",
" return res\n",
"\n",
"def build_q(n):\n",
" \"\"\"Return string representing parameter substitution tuple.\"\"\"\n",
" return \"(\" + \"?, \"*(n-1) + \"?)\"\n",
"\n",
"def sqltable_exists(dbconn, table):\n",
" \"\"\"Return schema as SQL create statement if table exists.\"\"\"\n",
" create = dbconn.execute('select sql from sqlite_master where name=?;', \n",
" (table,)).fetchall()\n",
" return create[0][0] if create else False\n",
"\n",
"def sqlcol_names(dbconn, table):\n",
" \"\"\"Return list of column names of a SQL table.\"\"\"\n",
" cursor = dbconn.execute('SELECT * FROM %s LIMIT 1;' % table)\n",
" col_names = [col[0] for col in cursor.description]\n",
" return col_names\n",
"\n",
"def sqlexec(dbconn, query, verbose=False):\n",
" \"\"\"Wrap verbose processing around SQL query.\"\"\"\n",
" if verbose:\n",
" print(query)\n",
" return dbconn.execute(query)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SQL_Table Class"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class SQL_Table(Table):\n",
" \"\"\" Extend Table class with methods to read/write a Table\n",
" from/to a table in a SQLite3 database.\n",
" \"\"\"\n",
" @classmethod\n",
" def read(cls, filepath, table, verbose=False):\n",
" \"\"\"Create a SQL_Table by reading a table from a SQL database.\"\"\"\n",
" \n",
" dbconn = sqlite3.connect(filepath, \n",
" detect_types=sqlite3.PARSE_COLNAMES)\n",
" \n",
" col_names = sqlcol_names(dbconn, table)\n",
" rows = sqlexec(dbconn,'SELECT * from %s;' % table, verbose).fetchall()\n",
" dbconn.close()\n",
" return cls(col_names).with_rows(rows)\n",
" \n",
" def write(self, filepath, table, verbose=False, overwrite=True):\n",
" \"\"\"Write a Table into a SQL database as a SQL table.\"\"\"\n",
" \n",
" dbconn = sqlite3.connect(filepath)\n",
" # If a table of this name exists, set it aside\n",
" old_table = sqltable_exists(dbconn, table)\n",
" if overwrite and old_table:\n",
" backup = table+\"_bkp\"\n",
" if sqltable_exists(dbconn, backup):\n",
" sqlexec(dbconn, 'DROP TABLE %s;' % backup, verbose)\n",
" sqlexec(dbconn, 'ALTER TABLE %s RENAME TO %s;' % (table, backup), verbose)\n",
"\n",
" # Create table and insert each row\n",
" cols = build_list(self.labels)\n",
" sqlexec(dbconn, \"CREATE TABLE %s %s;\" % (table, cols), verbose)\n",
" for row in self.rows: \n",
" sqlexec(dbconn, 'INSERT INTO %s VALUES %s;' % (table, tuple(row)), verbose)\n",
" dbconn.commit()\n",
" \n",
" # clean up table, if was set aside\n",
" if overwrite and old_table:\n",
" try:\n",
" sqlexec(dbconn, 'DROP TABLE %s;' % backup, verbose)\n",
" except:\n",
" print('Unable to drop %s', backup)\n",
" dbconn.close()\n",
" \n",
" @classmethod\n",
" def cast(cls, table):\n",
" \"\"\"Return a SQL_Table version of a Table.\"\"\"\n",
" return cls().with_columns(zip(table.labels, table.columns))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mycones = SQL_Table.read(\"icecream.db\", \"cones\", verbose=True)\n",
"mycones"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type(mycones)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"db = mycones.write(\"test.db\", \"cones\", verbose=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"SQL_Table.read(\"test.db\", \"cones\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"scones = SQL_Table.cast(cones)\n",
"scones.write(\"cones.db\", \"cones\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"SQL_Table.read(\"cones.db\", \"cones\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A larger example - NBA Players"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Table constructor method\n",
"\n",
"# NBA player data from 15-16 season\n",
"nba = SQL_Table.read_table('nba_salaries.csv')\n",
"nba"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"SQL equivalent\n",
"\n",
"sqlite3 non-standard way to create a table from a csv file\n",
"\n",
"```\n",
".mode csv\n",
".import nba_salaries.csv nba_raw\n",
"```\n",
"\n",
"Now we have a real table to work with\n",
"```\n",
"sqlite>.tables\n",
"nba\n",
"sqlite> .headers on\n",
"sqlite> select * from nba_raw limit 10;\n",
"```\n",
"\n",
"Caution: Here we have specified `limit` to reduce the size of the output. Unlike the `__repr__` of `Table`, this is actually limiting the size of the returned table value. In non-interactive use, you would seldom want to do this."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"nba.write(\"nba_salaries.db\", \"nba\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"nba_db = sqlite3.connect(\"nba_salaries.db\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nba.select(['PLAYER', 'POSITION', 'TEAM'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"SQL equivalent\n",
"\n",
"Specify columns by sequence of names separated by \",\", instead of \"*\".\n",
"\n",
"```\n",
"select PLAYER, POSITION, TEAM from nba_raw limit 10;\n",
"\n",
"sqlite> select PLAYER, POSITION, TEAM from nba limit 10;\n",
"PLAYER,POSITION,TEAM\n",
"\"Paul Millsap\",PF,\"Atlanta Hawks\"\n",
"\"Al Horford\",C,\"Atlanta Hawks\"\n",
"\"Tiago Splitter\",C,\"Atlanta Hawks\"\n",
"\"Jeff Teague\",PG,\"Atlanta Hawks\"\n",
"\"Kyle Korver\",SG,\"Atlanta Hawks\"\n",
"\"Thabo Sefolosha\",SF,\"Atlanta Hawks\"\n",
"\"Mike Scott\",PF,\"Atlanta Hawks\"\n",
"\"Kent Bazemore\",SF,\"Atlanta Hawks\"\n",
"\"Dennis Schroder\",PG,\"Atlanta Hawks\"\n",
"\"Tim Hardaway Jr.\",SG,\"Atlanta Hawks\"\n",
"sqlite> \n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nba_db.execute(\"select PLAYER, POSITION, TEAM from nba;\").fetchall()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nba.where('TEAM', 'Atlanta Hawks')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"SQL equivalent\n",
"\n",
"```\n",
"select * from nba_raw where TEAM = \"Atlanta Hawks\";\n",
"```\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nba_db.execute('select * from nba where TEAM = \"Atlanta Hawks\";').fetchall()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"nba.where(nba['SALARY'] > 15)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nba_db.execute('select PLAYER, SALARY from nba where SALARY > 15').fetchall()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exploring this in the sqlite3 session\n",
"\n",
"Try this using the nba_raw table in the nba_salaries database imported as above. You will see some of the common subtleties of data types.\n",
"\n",
"```\n",
"select PLAYER, SALARY from nba_raw where SALARY > 10 limit 10;\n",
"\n",
"select PLAYER, SALARY from nba_raw where cast(SALARY as int) > 10 limit 10;\n",
"```\n",
"Why isn't this working?\n",
"\n",
"At the command line we could do\n",
"```\n",
".schema raw_nba\n",
"```\n",
"But within SQL itself we can get to this information because sqlite3\n",
"holds its own information in (hidden) tables.\n",
"\n",
"```\n",
"select * from sqlite_master;\n",
"select sql from sqlite_master where name=\"nba_raw\";\n",
"```\n",
"We can create a new table from an existing one and get the type specified in the process.\n",
"\n",
"```\n",
"create table nba as\n",
" select PLAYER, POSITION, TEAM, cast(SALARY as REAL) as SALARY from nba_raw;\n",
" \n",
"select * from nba where SALARY > 15 order by SALARY;\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### LIKE operator\n",
"\n",
"```\n",
"select * from nba where TEAM like \"Atlanta%\";\n",
"```\n",
"Try this out\n",
"```\n",
"sqlite> select * from nba where TEAM like \"Atlanta%\";\n",
"PLAYER,POSITION,TEAM,SALARY\n",
"\"Paul Millsap\",PF,\"Atlanta Hawks\",18.671659\n",
"\"Al Horford\",C,\"Atlanta Hawks\",12.0\n",
"\"Tiago Splitter\",C,\"Atlanta Hawks\",9.75625\n",
"\"Jeff Teague\",PG,\"Atlanta Hawks\",8.0\n",
"\"Kyle Korver\",SG,\"Atlanta Hawks\",5.746479\n",
"\"Thabo Sefolosha\",SF,\"Atlanta Hawks\",4.0\n",
"\"Mike Scott\",PF,\"Atlanta Hawks\",3.333333\n",
"\"Kent Bazemore\",SF,\"Atlanta Hawks\",2.0\n",
"\"Dennis Schroder\",PG,\"Atlanta Hawks\",1.7634\n",
"\"Tim Hardaway Jr.\",SG,\"Atlanta Hawks\",1.3045200000000001\n",
"\"Walter Tavares\",C,\"Atlanta Hawks\",1.0\n",
"\"Jason Richardson\",SG,\"Atlanta Hawks\",0.947276\n",
"\"Lamar Patterson\",SG,\"Atlanta Hawks\",0.525093\n",
"\"Terran Petteway\",SG,\"Atlanta Hawks\",0.525093\n",
"sqlite> \n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nba_db.execute('select * from nba where TEAM like \"Atlanta%\";').fetchall()"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}