class: center, middle, inverse, title-slide # Module 2 - Session 2 - Data exploration ## Working effectively with data ### CivicDataLab ### 2021/08/11 (updated: 2021-08-12) ---
# Exercise - Web Scraping + Data Exploration - [Link](https://nalsa.gov.in/dashboard/) to NALSA dashboard - Create a CSV file with variables available under the Victim Compensation Schemes table for these states: - Delhi - Maharashtra - Karnataka - West Bengal - Uttar Pradesh - Create a chart to compare the yearly compensation numbers between these states - Create a folder [here](https://drive.google.com/drive/u/0/folders/1W1t0j1NETZDQ5F7j7Ts6tDNiMQUTKGXD) and upload the dataset (including the chart) [Worksheet Link](https://docs.google.com/spreadsheets/d/1IEit4GN7w_d-a855qpa3_eq_J7PnVV5wL0OGu1_OXjo/edit#gid=199180986) --- class: center, middle # Working with databases --- # Why to use a database ? - Dealing with large datasets - Platform agnostic - Programming language agnostic - Easy to share and maintain as compared to storing data as multiple data files --- # A relational database - Data stored as tables - Each row in the table is a record with a unique ID called the key (Primary Key). - The columns of the table hold attributes of the data, and each record usually has a value for each attribute. - Uses [SQL (Structured Query Language)](https://www.w3schools.com/sql/) to query ( _storing_, _manipulating_, _retrieving_) data --- # Database terminologies 1. **Schema** - _A database schema is the design of tables, columns, relations, and constraints that make up a logically distinct section of a database_. 2. **Key** - _A key is a database field whose purpose is to uniquely identify a record_. Type off keys: - **Candidate Key** - _The set of columns that can each uniquely identify a record and from which the primary key is chosen._ - **Primary Key** - _This key uniquely identifies a record in a table. It cannot be null.There can be only one Primary key in a table._ - **Foreign Key** - _The key linking a record to a record in another table. A table's foreign key must exist as the primary key of another table._ 4. **SQL** - _Structured Query Language, or SQL, is the most commonly used language to access data from a database_ --- # Understanding keys .panelset[ .panel[ .panel-name[Table 1] ![](assets/keys_master.webp) ] .panel[.panel-name[Candidate] ![](assets/candidate_keys.webp) ] .panel[.panel-name[Primary] ![](assets/primary_alternate.webp) ] .panel[.panel-name[Composite] ![](assets/composite_key.webp) ] ] --- # Foregin Keys **Id** in the **Department table** [Primary Key] .center[ ![](assets/foreign_1.webp) ] **Dep_Id** in the **Employee table** [Foreign Key] .pull-left[ ![](assets/keys_master.webp) ] .pull-right[ ![](assets/foreign_2.webp) ] --- # SQL Basics .left-column[.card[ ![](assets/sql-queries.png) ]] .right-column[ .center[ .bg-yellow[.black[SELECT]] .bg-washed-red.b--dark-red.ba.bw2.br3.shadow-5.ph4.mt5[ **SELECT** {stuff you want to select} **FROM** {the table that it is in} ] ]] --- # SQL Basics .left-column[.card[ ![](assets/sql-queries.png) ]] .right-column[ .center[ .bg-yellow[.black[SELECT ALL]] .bg-washed-red.b--dark-red.ba.bw2.br3.shadow-5.ph4.mt5[The `*` is called a “splat” and is a handy, frequently used shortcut to get all columns. ] ]] --- # SQL Basics .left-column[.card[ ![](assets/sql-queries.png) ]] .right-column[ .center[ .bg-yellow[.black[ORDER BY]] .bg-washed-red.b--dark-red.ba.bw2.br3.shadow-5.ph4.mt5[SELECT {stuff you want to select} FROM {the table that it is in} ORDER BY {column you want to order by} ] **ASC**ending (_default_) or **DESC**ending SELECT * FROM tracks ORDER BY name DESC ]] --- # SQL Basics .left-column[.card[ ![](assets/sql-queries.png) ]] .right-column[ .center[ .bg-yellow[.black[LIMIT]] .bg-washed-red.b--dark-red.ba.bw2.br3.shadow-5.ph4.mt5[ **LIMIT** the number of rows - SELECT * FROM artists LIMIT {Number to Limit By} ] SELECT * FROM artists LIMIT 5 ] ] --- # SQL Basics .left-column[.card[ ![](assets/sql-queries.png) ]] .right-column[ .center[ .bg-yellow[.black[OFFSET]] .bg-washed-red.b--dark-red.ba.bw2.br3.shadow-5.ph4.mt5[ **OFFSET** - Where to start returning data - SELECT * FROM artists LIMIT 5 OFFSET {Number of rows to skip} ] SELECT * FROM artists LIMIT 5 OFFSET 2 ] ] --- # SQL Basics .left-column[.card[ ![](assets/sql-queries.png) ]] .right-column[ .center[ .bg-yellow[.black[WHERE]] .bg-washed-red.b--dark-red.ba.bw2.br3.shadow-5.ph4.mt5[ The **WHERE** command is followed by the conditions you’d like to filter by. SELECT * FROM artists WHERE {Filter Conditions}; ] **Single clause** - SELECT * FROM artists WHERE id = 85 **Multiple clauses** - SELECT * FROM tracks WHERE album_id = 89 AND composer = 'Green Day' **Combinations** - SELECT * FROM tracks WHERE composer = 'Green Day' OR (composer = 'AC/DC' AND milliseconds > 240000) ] ] --- # SQL Basics .left-column[.card[ ![](assets/sql-queries.png) ]] .right-column[ .center[ .bg-yellow[.black[AGGREGATE]] ![](assets/sql-aggregations.png) ] ] --- # SQL Basics .left-column[.card[ ![](assets/sql-queries.png) ]] .right-column[ .center[ .bg-yellow[.black[AGGREGATE]] **MAX/MIN/AVG of one column** - SELECT MAX(unit_price), MIN(unit_price), AVG(unit_price) FROM tracks **Total Rows** - SELECT COUNT(*) FROM tracks **Unique Values in a column** - SELECT COUNT(DISTINCT composer) FROM tracks ] ] --- # SQL Basics .left-column[.card[ ![](assets/sql-queries.png) ]] .right-column[ .center[ .bg-yellow[.black[GROUP BY]] .bg-washed-red.b--dark-red.ba.bw2.br3.shadow-5.ph4.mt5[ For splitting aggregations into groups ] **Counting** - SELECT genre_id, COUNT(*) FROM tracks GROUP BY genre_id **Count but rename column** - SELECT composer, COUNT(*) as "count" FROM tracks GROUP BY composer ORDER BY "count" DESC **Aggregate by multiple columns** - SELECT media_type_id, genre_id, COUNT(*) FROM tracks GROUP BY media_type_id, genre_id ORDER BY media_type_id, genre_id ] ] --- # Database tools .pull-left[ 1. Database - [SQLite](https://www.sqlite.org/index.html) 2. Database Explorer - [SQLite Browser](https://sqlitebrowser.org/) - _For working with database_ > DB Browser for SQLite (DB4S) is a high quality, visual, open source tool to create, design, and edit database files compatible with SQLite. ] .pull-right[ [![](assets/db_browser.png)](https://sqlitebrowser.org/) ] --- class: center, middle # Database exercises --- # Exploring mortality data - Import the [CSV file](https://docs.google.com/spreadsheets/d/1fbdwMCnR3vLZBEJDG86iGmz733chVP4E4P1Z7tIUEC4/edit#gid=0) in the database - Explore variables - Find total number of rows - Find unique states - Find the maximum and minimum number of death across all states and years - Find total deaths in April 2020 across all states - Calculate total deaths across years - Extract the top 5 entries in terms of number of deaths across year and state --- # Exploring data from eCourts **Dataset** - [Link](https://www.dropbox.com/sh/hkcde3z2l1h9mq1/AADRe-BuBQ92ozAJiG7YERdCa?dl=0) - _The database contains 81.2 million cases_ **Source**: [Devdatalab](http://www.devdatalab.org/judicial-data) **Objective**: - Understand how the data is structured - Import the data in a database - Explore the sample datasets - Find out the total cases present for each district for the year 2018 **Tags** .bg-yellow[.black[_database_]] .bg-yellow[.black[_large-datasets_]] .bg-yellow[.black[_sqlite_]] .bg-yellow[.black[_eCourts_]] --- # Exercise - Using Databases - Install SQLite DB Browser - Create a new database - Load the judges_clean dataset in the DB - Find the distribution of male/female judges in **Bengaluru** district court where judge position is _chief metropolitan magistrate_ - Save the file, as CSV, in the drive