{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Lecture 4: SQL\n", "======================" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Set Operators\n", "=======\n", "We generate three tables below:\n", "* R is {1,2,3,4,5}\n", "* S is {}\n", "* T is {1,4,7,10}\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n", "Done.\n", "Done.\n", "Done.\n", "Done.\n", "Done.\n", "1 rows affected.\n", "1 rows affected.\n", "1 rows affected.\n", "1 rows affected.\n", "1 rows affected.\n", "1 rows affected.\n", "1 rows affected.\n", "1 rows affected.\n", "1 rows affected.\n" ] } ], "source": [ "# Create tables & insert some random numbers\n", "# Note: in Postgresql, try the generate_series function...\n", "%sql DROP TABLE IF EXISTS R; DROP TABLE IF EXISTS S; DROP TABLE IF EXISTS T;\n", "%sql CREATE TABLE R (A int); CREATE TABLE S (A int); CREATE TABLE T (A int);\n", "for i in range(1,6):\n", " %sql INSERT INTO R VALUES (:i)\n", "for i in range(1,11,3):\n", " %sql INSERT INTO T VALUES (:i)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let's try to get $R \\cap (S \\cup T) = \\{1,4\\}$" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", "
A
" ], "text/plain": [ "[]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql SELECT DISTINCT R.A FROM R, S, T\n", "WHERE R.A=S.A OR R.A=T.A" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Why is the return set empty?**\n", "\n", "*Recall the order of operations for this query:*\n", "1. We take the cross-product of `R,S,T`\n", "2. We filter this intermediate table by the conditions of the `WHERE` clause. \n", "\n", "Let's see (1):" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", "
A
" ], "text/plain": [ "[]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql SELECT DISTINCT R.A FROM R, S, T;" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The cross-product query is empty because `S` is empty!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Union\n", "-----\n", "\n", "Let's try a different way using a new SQL command, `UNION`:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
A
1
4
" ], "text/plain": [ "[(1,), (4,)]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT R.A FROM R, S WHERE R.A=S.A\n", "UNION -- this is an explicit keyword!\n", "SELECT R.A FROM R, T WHERE R.A=T.A" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "* There are no duplicates (union is a set)\n", "* If we wanted duplicates, we could use UNION ALL\n", " * R = {1,2,3,4,5}\n", " * S = {1,2,3,4,5}\n", " * T = {1,4,7,10}" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n", "Done.\n", "1 rows affected.\n", "1 rows affected.\n", "1 rows affected.\n", "1 rows affected.\n", "1 rows affected.\n" ] } ], "source": [ "%sql DROP TABLE IF EXISTS S; CREATE TABLE S (A int);\n", "for i in range(1,6):\n", " %sql INSERT INTO S VALUES (:i)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
A
1
2
3
4
5
1
4
" ], "text/plain": [ "[(1,), (2,), (3,), (4,), (5,), (1,), (4,)]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "-- UNION ALL example- notice that 1 and 4 occur twice!\n", "SELECT R.A FROM R, S WHERE R.A=S.A\n", "UNION ALL\n", "SELECT R.A FROM R, T WHERE R.A=T.A" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Other Set Operations: INTERSECT, EXCEPT\n", "-------------------------------------\n", "\n", "Two examples below... see more in the next section" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
A
1
4
" ], "text/plain": [ "[(1,), (4,)]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT R.A FROM R, S, T WHERE R.A = S.A\n", "INTERSECT\n", "SELECT R.A FROM R, S, T WHERE R.A = T.A" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
A
2
3
5
" ], "text/plain": [ "[(2,), (3,), (5,)]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT R.A FROM R, S, T WHERE R.A = S.A\n", "EXCEPT\n", "SELECT R.A FROM R, S, T WHERE R.A = T.A" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "slide" } }, "source": [ "Running example for rest of this notebook: BAGELS\n", "===========================\n", "\n", "The selection of which had nothing to do with how hungry the person writing this section was.\n", "\n", "Outline of topics:\n", "* Nested queries: motivation & examples\n", "* Aggregation: Overview, with group-by\n", "* Null Values, Outer vs. Inner Joins" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Story:\n", "------\n", "* *eBagel* is a ground-breaking new tech startup in the red-hot NoSQL-based bagel enterprise space\n", "* eBagel has just received $100M in venture capital funding. However, their sales numbers are dropping off, and you have been brought in to analyze the data and figure out what's going wrong" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let's begin by populating the following tables:\n", "> Franchise(name TEXT, db_type TEXT)\n", "\n", "> Store(franchise TEXT, location TEXT)\n", "\n", "> Bagel(name TEXT, price MONEY, made_by TEXT)\n", "\n", "> Purchase(bagel_name TEXT, franchise TEXT, date INT, quantity INT, purchaser_age INT)\n", "\n", "Let's quickly load the data first" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Union\n", "-----\n", "\n", "First, let's find franchises that have locations in PA **OR** NYC to see who the potential competition is:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
franchise
BAGEL CORP
Bobs Bagels
eBagel
" ], "text/plain": [ "[(u'BAGEL CORP',), (u'Bobs Bagels',), (u'eBagel',)]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql \n", "SELECT franchise FROM store WHERE location = 'NYC'\n", "UNION\n", "SELECT franchise FROM store WHERE location = 'PA';" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Intersect: Subtle problem...\n", "--------------------------\n", "\n", "eBagel's CEO is curious to know more about the back-end technologies of Bagel companies that have successfully spread across multiple locations. Let's try to use an `INTERSECT` operator to find the database types of franchises which have stores in PA **AND** NYC:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
db_type
MySQL
NoSQL
" ], "text/plain": [ "[(u'MySQL',), (u'NoSQL',)]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT f.db_type\n", "FROM franchise f, store s \n", "WHERE f.name = s.franchise AND s.location = 'NYC'\n", "INTERSECT\n", "SELECT f.db_type\n", "FROM franchise f, store s \n", "WHERE f.name = s.franchise AND s.location = 'PA'" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "*What happened here?*\n", "\n", "If we look at the data, we should only have gotten \"MySQL\" as a result:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namelocationdb_type
Bobs BagelsNYCNoSQL
eBagelPANoSQL
BAGEL CORPChicagoMySQL
BAGEL CORPNYCMySQL
BAGEL CORPPAMySQL
" ], "text/plain": [ "[(u'Bobs Bagels', u'NYC', u'NoSQL'),\n", " (u'eBagel', u'PA', u'NoSQL'),\n", " (u'BAGEL CORP', u'Chicago', u'MySQL'),\n", " (u'BAGEL CORP', u'NYC', u'MySQL'),\n", " (u'BAGEL CORP', u'PA', u'MySQL')]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql \n", "SELECT f.name, s.location, f.db_type\n", "FROM franchise f, store s \n", "WHERE f.name = s.franchise;" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "So what happened? We can see why this occurs by breaking the query operation down:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
db_type
NoSQL
MySQL
" ], "text/plain": [ "[(u'NoSQL',), (u'MySQL',)]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql SELECT f.db_type \n", "FROM franchise f, store s \n", "WHERE f.name = s.franchise AND s.location = 'NYC'" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
db_type
NoSQL
MySQL
" ], "text/plain": [ "[(u'NoSQL',), (u'MySQL',)]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql SELECT f.db_type\n", "FROM franchise f, store s\n", "WHERE f.name = s.franchise AND s.location = 'PA'" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Essentially, the problem is that we did the `INTERSECT` operation over the return attributes, not the attributes we actually wanted to use!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Nested queries\n", "========\n", "\n", "Instead, let's see one solution to the above problem that uses *nested queries*: i.e. subqueries that return relations which our main query then operates over:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
db_type
MySQL
" ], "text/plain": [ "[(u'MySQL',)]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT f.db_type\n", "FROM franchise f\n", "WHERE f.name IN (\n", " SELECT s.franchise FROM store s WHERE s.location = 'NYC')\n", " AND f.name IN (\n", " SELECT s.franchise FROM store s WHERE s.location = 'PA');" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Another nested queries example: eBagel's CEO wants to know what databases support bagel companies that appeal to to the crucial twenty-something demographic:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
db_type
NoSQL
MySQL
" ], "text/plain": [ "[(u'NoSQL',), (u'MySQL',)]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT f.db_type\n", "FROM franchise f\n", "WHERE f.name IN (\n", " SELECT b.made_by\n", " FROM bagel b, purchase p\n", " WHERE b.name = p.bagel_name \n", " AND p.purchaser_age >= 20 AND p.purchaser_age < 30);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Nested queries are powerful and convenient! Could we accomplish the above without them however? Would this return the same result?" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
db_type
MySQL
NoSQL
MySQL
NoSQL
" ], "text/plain": [ "[(u'MySQL',), (u'NoSQL',), (u'MySQL',), (u'NoSQL',)]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT f.db_type\n", "FROM franchise f, bagel b, purchase p\n", "WHERE f.name = b.made_by \n", " AND b.name = p.bagel_name \n", " AND p.purchaser_age >= 20 AND p.purchaser_age < 30;" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Beware of duplicates!** We can add a `DISTINCT` however and they will be equivalent- try this for yourself." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can also use the following over the results of nested subqueries:\n", "* `ALL`\n", "* `ANY`\n", "* `EXISTS`\n", "\n", "**Unfortunately, `ALL` and `ANY` are not supported in SQLite (the DBMS we're using here)- see the lecture notes for examples.** However we can show an example of `EXISTS`. Suppose we want to answer the following question: does eBagel have *any* products which are cheaper than *any* of its competitors'?" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameprice
eBagel Expansion Pack1.99
" ], "text/plain": [ "[(u'eBagel Expansion Pack', 1.99)]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT b.name, b.price\n", "FROM bagel b\n", "WHERE b.made_by = 'eBagel'\n", " AND EXISTS (SELECT name FROM bagel WHERE made_by <> 'eBagel' AND price > b.price);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Note that this is a *correlated query* because values from the main outer query are involved in the inner nested query. Especially take notice of the variable scoping here. Note also that this can be expressed as a single SFW query- can you figure out how?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Summary so far:\n", "--------------\n", "\n", "SQL:\n", "* Provides a high-level declarative language for manipulating data (DML)\n", "* The workhorse is the SFW block\n", "* Set operators are powerful- but have some subtleties to be aware of!\n", "* Powerful, nested queries are also allowed\n", " * **Note that this becomes especially helpful when using a more full-featured DBMS than SQLite!**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Aggregation\n", "=======\n", "\n", "SQL supports several aggregation operations:\n", "* `SUM`\n", "* `COUNT`\n", "* `AVG`\n", "* `MIN`\n", "* `MAX`\n", "\n", "*Except `COUNT`*, all aggregation operators apply to a single attribute only" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are some simple examples:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
AVG(price)
14.99
" ], "text/plain": [ "[(14.989999999999998,)]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql SELECT AVG(price) FROM bagel WHERE made_by = 'eBagel';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that you can define the attribute labels of the output schema, which is convenient for aggregation operations:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
Number of Stores in PA
2
" ], "text/plain": [ "[(2,)]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql SELECT COUNT(*) AS \"Number of Stores in PA\" FROM store WHERE location = 'PA';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we want to know how many total locations there are of relevance in the Bagel 2.0 space?" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
COUNT(location)
5
" ], "text/plain": [ "[(5,)]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql SELECT COUNT(location) FROM store;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What we actually wanted:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
COUNT(DISTINCT location)
3
" ], "text/plain": [ "[(3,)]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql SELECT COUNT(DISTINCT location) FROM store;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, you've purchased some industry sales data, and want to get a sense of how big the bagel space is exactly. How can we do this easily and efficiently, *all in SQL*?" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
net_sales
432.99
" ], "text/plain": [ "[(432.99,)]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT SUM(b.price * p.quantity) AS net_sales\n", "FROM bagel b, purchase p\n", "WHERE b.name = p.bagel_name;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's a huge market opportunity! Now, eBagel's CEO is quite excited, and wants to see more detail, such as how the distribution across companies breaks down. Luckily, **grouping and aggregation can be combined!**" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
made_byrevenue
BAGEL CORP12.87
Bobs Bagels85.98
eBagel272.63
" ], "text/plain": [ "[(u'BAGEL CORP', 12.87), (u'Bobs Bagels', 85.98), (u'eBagel', 272.63)]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT b.made_by, SUM(b.price * p.quantity) AS revenue\n", "FROM bagel b, purchase p\n", "WHERE b.made_by = p.franchise AND b.name = p.bagel_name\n", "GROUP BY b.made_by;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are exciting results!\n", "\n", "Finally, to cap off the analysis, you want to see the distribution of revenue by product. *However*, in order to paint a more effective high-level picture, you only want to see best-sellers, which you define as products with more that a dozen total sales.\n", "\n", "To do this we introduce the `HAVING` clause, which conditions *on aggregates*" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namesales
Plain with shmear84.51
eBagel Expansion Pack272.63
" ], "text/plain": [ "[(u'Plain with shmear', 84.50999999999999), (u'eBagel Expansion Pack', 272.63)]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT b.name, SUM(b.price * p.quantity) AS sales\n", "FROM bagel b, purchase p\n", "WHERE b.name = p.bagel_name AND b.made_by = p.franchise\n", "GROUP BY b.name\n", "HAVING SUM(p.quantity) > 12;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's break down this last query into its actual evaluation steps:\n", "\n", "First, the FROM-WHERE segment of the clause is evaluated:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namepricemade_bybagel_namefranchisedatequantitypurchaser_age
Plain with shmear1.99Bobs BagelsPlain with shmearBobs Bagels11228
Plain with shmear1.99Bobs BagelsPlain with shmearBobs Bagels424None
Egg with shmear2.39Bobs BagelsEgg with shmearBobs Bagels2647
eBagel Expansion Pack1.99eBageleBagel Expansion PackeBagel11375
Plain with shmear0.99BAGEL CORPPlain with shmearBAGEL CORP21224
Plain with shmear0.99BAGEL CORPPlain with shmearBAGEL CORP3117
" ], "text/plain": [ "[(u'Plain with shmear', 1.99, u'Bobs Bagels', u'Plain with shmear', u'Bobs Bagels', 1, 12, 28),\n", " (u'Plain with shmear', 1.99, u'Bobs Bagels', u'Plain with shmear', u'Bobs Bagels', 4, 24, None),\n", " (u'Egg with shmear', 2.39, u'Bobs Bagels', u'Egg with shmear', u'Bobs Bagels', 2, 6, 47),\n", " (u'eBagel Expansion Pack', 1.99, u'eBagel', u'eBagel Expansion Pack', u'eBagel', 1, 137, 5),\n", " (u'Plain with shmear', 0.99, u'BAGEL CORP', u'Plain with shmear', u'BAGEL CORP', 2, 12, 24),\n", " (u'Plain with shmear', 0.99, u'BAGEL CORP', u'Plain with shmear', u'BAGEL CORP', 3, 1, 17)]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT *\n", "FROM bagel b, purchase p\n", "WHERE b.name = p.bagel_name AND b.made_by = p.franchise;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(*Aside: what would we have gotten if we had left out `AND b.made_by = p.franchise` in the WHERE clause? Breaking down a complex query into its component steps is the best way to debug things like this!*)\n", "\n", "Next, the GROUP-BY is applied:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namepricesmade_bysbagel_namefranchisesdatesquantitiespurchaser_ages
Egg with shmear2.39Bobs BagelsEgg with shmearBobs Bagels2647
Plain with shmear1.99,1.99,0.99,0.99Bobs Bagels,Bobs Bagels,BAGEL CORP,BAGEL CORPPlain with shmearBobs Bagels,Bobs Bagels,BAGEL CORP,BAGEL CORP1,4,2,312,24,12,128,24,17
eBagel Expansion Pack1.99eBageleBagel Expansion PackeBagel11375
" ], "text/plain": [ "[(u'Egg with shmear', u'2.39', u'Bobs Bagels', u'Egg with shmear', u'Bobs Bagels', u'2', u'6', u'47'),\n", " (u'Plain with shmear', u'1.99,1.99,0.99,0.99', u'Bobs Bagels,Bobs Bagels,BAGEL CORP,BAGEL CORP', u'Plain with shmear', u'Bobs Bagels,Bobs Bagels,BAGEL CORP,BAGEL CORP', u'1,4,2,3', u'12,24,12,1', u'28,24,17'),\n", " (u'eBagel Expansion Pack', u'1.99', u'eBagel', u'eBagel Expansion Pack', u'eBagel', u'1', u'137', u'5')]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT \n", " b.name,\n", " GROUP_CONCAT(b.price, ',') AS prices,\n", " GROUP_CONCAT(b.made_by, ',') AS made_bys,\n", " bagel_name, \n", " GROUP_CONCAT(p.franchise, ',') AS franchises,\n", " GROUP_CONCAT(p.date, ',') AS dates,\n", " GROUP_CONCAT(p.quantity, ',') AS quantities,\n", " GROUP_CONCAT(p.purchaser_age, ',') AS purchaser_ages\n", "FROM bagel b, purchase p\n", "WHERE b.name = p.bagel_name AND b.made_by = p.franchise\n", "GROUP BY b.name;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Note that the `GROUP_CONCAT` functions above are just for nicer display, as otherwise SQLite will choose a single arbitrary relation's values to display for the column in such a query*\n", "\n", "Next, the `HAVING` condition is applied:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namepricesmade_bysbagel_namefranchisesdatestotal_quantitypurchaser_ages
Plain with shmear1.99,1.99,0.99,0.99Bobs Bagels,Bobs Bagels,BAGEL CORP,BAGEL CORPPlain with shmearBobs Bagels,Bobs Bagels,BAGEL CORP,BAGEL CORP1,4,2,34928,24,17
eBagel Expansion Pack1.99eBageleBagel Expansion PackeBagel11375
" ], "text/plain": [ "[(u'Plain with shmear', u'1.99,1.99,0.99,0.99', u'Bobs Bagels,Bobs Bagels,BAGEL CORP,BAGEL CORP', u'Plain with shmear', u'Bobs Bagels,Bobs Bagels,BAGEL CORP,BAGEL CORP', u'1,4,2,3', 49, u'28,24,17'),\n", " (u'eBagel Expansion Pack', u'1.99', u'eBagel', u'eBagel Expansion Pack', u'eBagel', u'1', 137, u'5')]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT \n", " b.name,\n", " GROUP_CONCAT(b.price, ',') AS prices,\n", " GROUP_CONCAT(b.made_by, ',') AS made_bys,\n", " bagel_name, \n", " GROUP_CONCAT(p.franchise, ',') AS franchises,\n", " GROUP_CONCAT(p.date, ',') AS dates,\n", " SUM(p.quantity) AS total_quantity,\n", " GROUP_CONCAT(p.purchaser_age, ',') AS purchaser_ages\n", "FROM bagel b, purchase p\n", "WHERE b.name = p.bagel_name AND b.made_by = p.franchise\n", "GROUP BY b.name\n", "HAVING SUM(p.quantity) > 12;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And finally, the aggregates and other operations in the select clause (projection) are applied:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namesales
Plain with shmear84.51
eBagel Expansion Pack272.63
" ], "text/plain": [ "[(u'Plain with shmear', 84.50999999999999), (u'eBagel Expansion Pack', 272.63)]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT b.name, SUM(b.price * p.quantity) AS sales\n", "FROM bagel b, purchase p\n", "WHERE b.name = p.bagel_name AND b.made_by = p.franchise\n", "GROUP BY b.name\n", "HAVING SUM(p.quantity) > 12;" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.15" } }, "nbformat": 4, "nbformat_minor": 1 }