{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "NOTE:\n", "-----\n", "\n", "Please run the below cells first before proceeding- you'll need them soon!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%load_ext sql\n", "%sql sqlite:///dataset_1.db" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Activity 4-2\n", "------------\n", "Aggregation operators, GROUP BY" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise #1\n", "-----------\n", "\n", "Consider a set of tables that describe the up-and-coming bagel startup industry; for now let's just look at two tables here, `bagel`, which describes types of bagels made by the different bagel companies:\n", "> * name STRING\n", "> * price FLOAT\n", "> * made_by STRING\n", "\n", "And `purchase`:\n", "> * bagel_name STRING\n", "> * franchise STRING\n", "> * date INT\n", "> * quantity INT\n", "> * purchaser_age INT\n", "\n", "Where `purchase.bagel_name` references `bagel.name` and `purchase.franchise` references `bagel.made_by`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%sql SELECT * FROM bagel LIMIT 3;" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%sql SELECT * FROM purchase LIMIT 3;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can you write a query to get the _total revenue_ for each bagel type **which had an average purchaser age over 18**? Type your query below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " * sqlite:///dataset_1.db\n", "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameRevenue
Egg with shmear14.34
Plain with shmear84.51
" ], "text/plain": [ "[(u'Egg with shmear', 14.34), (u'Plain with shmear', 84.50999999999999)]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"\"\"\n", "Expected output below\n", "\n", "Don't re-execute this cell!\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise #2\n", "-----------\n", "\n", "Here we'll use a simplified version of the `precipitation_full` table, which just has _daily_ rainfall _in CA only_, and has the following schema:\n", "\n", "> * station_id\n", "> * day\n", "> * precipitation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%sql SELECT * FROM precipitation LIMIT 5;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to get station_ids which have average precipitations > 75. Try doing this first as a nested query:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " * sqlite:///dataset_1.db\n", "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
station_id
88302
250002
335701
357302
488301
" ], "text/plain": [ "[(88302,), (250002,), (335701,), (357302,), (488301,)]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"\"\"\n", "Expected output below\n", "\n", "Don't re-execute this cell!\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, try re-writing as a GROUP BY:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql " ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " * sqlite:///dataset_1.db\n", "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
station_id
88302
250002
335701
357302
488301
" ], "text/plain": [ "[(88302,), (250002,), (335701,), (357302,), (488301,)]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"\"\"\n", "Expected output below\n", "\n", "Don't re-execute this cell!\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now time it by using `%time` followed by single-line versions of your queries above (clunky, but will work) to see how they compare!\n", "\n", "**Note:** Yes, currently the answers are filled in below for convenience... but you should still try getting them on your own above!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time %sql SELECT DISTINCT p.station_id FROM precipitation p WHERE (SELECT AVG(precipitation) FROM precipitation WHERE station_id = p.station_id) > 75;" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time %sql SELECT p.station_id FROM precipitation p GROUP BY p.station_id HAVING AVG(p.precipitation) > 75;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**An ~ 10-20x difference in execution time!!**" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.15" } }, "nbformat": 4, "nbformat_minor": 1 }