In this blog post will provide a brief introduction to working with relational databases and you’ll learn about their structure. Also, you’ll learn how to talk about them using database lingo, and how to begin an analysis using simple SQL commands to select and summarize columns from database tables.
The role of a data scientist is to turn raw data into actionable insights. Much of the world’s raw data lives in organized collections of tables called relational databases. To be an effective data scientist, you must know how to wrangle and extract data from these databases using a language called SQL. In this blog post will talk about syntax in SQL (shared by many types of databases, such as PostgreSQL, MySQL, SQL Server, and Oracle) and everything you need to know to begin working with databases today.
Beginning your SQL journey
SQL, which stands for Structured Query Language, is a language for interacting with data stored in something called a relational database.
You can think of a relational database as a collection of tables. A table is just a set of rows and columns, like a spreadsheet, which represents exactly one type of entity. For example, a table might represent employees in a company or purchases made, but not both.
Each row, or record, of a table contains information about a single entity. For example, in a table representing employees, each row represents a single person. Each column, or field, of a table contains a single attribute for all rows in the table. For example, in a table representing employees, we might have a column containing first and last names for all employees.
The table of employees might look something like this:
SELECTing single columns
While SQL can be used to create and modify databases, the focus of this blog post will be querying databases. A query is a request for data from a database table (or combination of tables). Querying is an essential skill for a data scientist, since the data you need for your analyses will often live in databases.
In SQL, you can select data from a table using a
SELECT statement. For example, the following query selects the
name column from the
SELECT name FROM people;
That said, it’s good practice to make SQL keywords uppercase to distinguish them from other parts of your query, like column and table names.
It’s also good practice (but not necessary) to include a semicolon at the end of your query. This tells SQL where the end of your query is.
SELECT title FROM movies;
SELECTing multiple columns
In the real world, you will often want to select multiple columns. Luckily, SQL makes this really easy. To select multiple columns from a table, simply separate the column names with commas.
For example, this query selects two columns,
birthdate, from the
SELECT name, birthdate FROM people;
Sometimes, you may want to select all columns from a table. Typing out every column name would be a pain, so there’s a handy shortcut:
SELECT * FROM people;
If you only want to return a certain number of results, you can use the
LIMIT keyword to limit the number of rows returned:
SELECT * FROM people LIMIT 10;
Often your results will include many duplicate values. If you want to select all the unique values from a column, you can use the
DISTINCT keyword. This might be useful if, for example, you’re interested in knowing which languages are represented in the
SELECT DISTINCT language FROM movies;
Learning to COUNT
What if you want to count the number of employees in your employees table? The
COUNT statement lets you do this by returning the number of rows in one or more columns.
For example, this code gives the number of rows in the
SELECT COUNT(*) FROM employees;
You can combine keywords to select data. Next example will count all of the unique names from the
SELECT COUNT(DISTINCT(name)) FROM employees;