SQL Fundamentals
SQL, which stands for Structured Query Language, is a language for interacting with data stored in something called a relational database.
In this blog post will provide a brief introduction to working with relational databases and you’ll learn about their structure. Also, you’ll learn how to talk about them using database lingo, and how to begin an analysis using simple SQL commands to select and summarize columns from database tables.
Data Scientist
The role of a data scientist is to turn raw data into actionable insights. Much of the world’s raw data lives in organized collections of tables called relational databases. To be an effective data scientist, you must know how to wrangle and extract data from these databases using a language called SQL. In this blog post will talk about syntax in SQL (shared by many types of databases, such as PostgreSQL, MySQL, SQL Server, and Oracle) and everything you need to know to begin working with databases today.
Beginning your SQL journey
SQL, which stands for Structured Query Language, is a language for interacting with data stored in something called a relational database.
You can think of a relational database as a collection of tables. A table is just a set of rows and columns, like a spreadsheet, which represents exactly one type of entity. For example, a table might represent employees in a company or purchases made, but not both.
Each row, or record, of a table contains information about a single entity. For example, in a table representing employees, each row represents a single person. Each column, or field, of a table contains a single attribute for all rows in the table. For example, in a table representing employees, we might have a column containing first and last names for all employees.
The table of employees might look something like this:
id | name | age | nationality |
1 | Jessica | 32 | USA |
2 | Anna | 35 | UK |
3 | Laura | 30 | Germany |
SELECTing single columns
While SQL can be used to create and modify databases, the focus of this blog post will be querying databases. A query is a request for data from a database table (or combination of tables). Querying is an essential skill for a data scientist, since the data you need for your analyses will often live in databases.
In SQL, you can select data from a table using a SELECT
statement. For example, the following query selects the name
column from the people
table:
SELECT name
FROM people;
That said, it’s good practice to make SQL keywords uppercase to distinguish them from other parts of your query, like column and table names.
It’s also good practice (but not necessary) to include a semicolon at the end of your query. This tells SQL where the end of your query is.
SELECT title
FROM movies;
SELECTing multiple columns
In the real world, you will often want to select multiple columns. Luckily, SQL makes this really easy. To select multiple columns from a table, simply separate the column names with commas.
For example, this query selects two columns, name
and birthdate
, from the people
table:
SELECT name, birthdate
FROM people;
Sometimes, you may want to select all columns from a table. Typing out every column name would be a pain, so there’s a handy shortcut:
SELECT *
FROM people;
If you only want to return a certain number of results, you can use the LIMIT
keyword to limit the number of rows returned:
SELECT *
FROM people
LIMIT 10;
SELECT DISTINCT
Often your results will include many duplicate values. If you want to select all the unique values from a column, you can use the DISTINCT
keyword. This might be useful if, for example, you’re interested in knowing which languages are represented in the movies
table:
SELECT DISTINCT language
FROM movies;
Learning to COUNT
What if you want to count the number of employees in your employees table? The COUNT
statement lets you do this by returning the number of rows in one or more columns.
For example, this code gives the number of rows in the employees
table:
SELECT COUNT(*)
FROM employees;
You can combine keywords to select data. Next example will count all of the unique names from the employees
table:
SELECT COUNT(DISTINCT(name))
FROM employees;
What is JOIN in SQL?
SQL JOIN is essentially a method of combining data from two or more tables that have a common column. You can JOIN tables in two ways: INNER JOIN and OUTER JOIN.
INNER JOIN
The SQL Inner Join is a form of join that joins several tables by getting data with matching values in both tables (via the common field).
It compares each row of the first table to each row of the second table to identify all pairs of rows that fulfill the join-predicate. When the join-predicate is met, the column values from both tables are joined to form a new table.
id | name | age | nationality |
1 | Jessica | 32 | USA |
2 | Anna | 35 | UK |
3 | Laura | 30 | Germany |
course_id | user_id |
1 | 1 |
2 | 2 |
3 | 1 |
4 | 1 |
5 | 2 |
Basic syntax:
SELECT COLUMN1, COLUMN2 FROM [TABLE 1] INNER JOIN [TABLE 2] ON Condition;
Now lets INNER JOIN both tables:
SELECT StudentCourse.user_id,Student.id
FROM Student INNER JOIN StudentCourse
ON StudentCourse.user_id = Student.id
ORDER BY StudentCourse.user_id;
OUTER JOIN
OUTER JOIN will return values form one or both tables even thought there is no match. The syntax is the same as INNER JOIN, only the results are different.