What is data science, why is it so popular, and why did the Harvard Business Review hail it as the “sexiest job of the 21st century”? In this non-technical post, you’ll be introduced to everything you were ever too afraid to ask about this fast-growing and exciting field, without needing to write a single line of code.
We’ll start the post by defining what data science is. We’ll cover the data science workflow and how data science is applied to real-world problems. Also, you will learn about different roles within the data science field.
What is data science?
It’s a set of methodologies for taking in thousands of forms of data that are available to us today, and using them to draw meaningful conclusions. Data is being collected all around us. Every like, click, email, credit card swipe, or tweet is a new piece of data that can be used to better describe the present or better predict the future.
What can data do?
Data can describe our current state, like our energy consumption. This can be accomplished with dashboards or alerts, simplifying time-intensive reporting processes. It can help detect anomalous events, such as fraudulent purchases. If we have data on what has happened previously, we can increase efficiency by automatically detecting a new event that is unexpected or abnormal. Data can also diagnose the causes of observed events and behaviors, for instance your activity on Spotify or Netflix. Rather than determining correlations between small numbers of events, data science techniques help us understand complex systems with many possible causes. Finally, data can predict future events, such as forecasting population size. We can use new techniques to take various causes into account and predict potential outcomes. Further, we can evaluate the probability of our prediction mathematically to clarify our level of uncertainty.
The data science workflow
So, how do we start to use data? In data science, we generally have four steps to any project.
- First, we collect data from many sources, such as surveys, web traffic results, geo-tagged social media posts, and financial transactions. Once collected, we store that data in a safe and accessible way.
- At this point, data is in its raw form, so the next step is to prepare data. This includes “cleaning data”, for instance finding missing or duplicate values, and converting data into a more organized format.
- Then, we explore and visualize the cleaned data. This could involve building dashboards to track how the data changes over time or performing comparisons between two sets of data.
- Finally, we run experiments and predictions on the data. For example, this could involve building a system that forecasts temperature changes or performing a test to find which web page acquires more customers.
Now, you know why data science is important and the first four steps in the data science workflow.
Customer segmentation workflow
Laura manages a data science team at a subscription-based dog food delivery company. Her team has been asked to investigate loss of customers, also known as customer churn. Here are the steps that her team will need to take:
- Download the data
- Reformat the delivery date on all entries to be in the same time zone
- Create a line chart that shows decay in subscriptions by cohort
- Cluster the users into different personas and perform a regression to predict churn for each cluster
Building a customer service chatbot
Laura and his data team are working on a customer service chatbot. The chatbot is a computer program that uses data science to answer basic customer questions via a messenger. The team will use transcripts from over 300,000 customer service interactions to train a chatbot to answer customer questions.
Data Collection and Storage
- Load the transcripts into the data team’s database
- Collect the timestamps for each transcript
- Gather customer information for each conversation
Exploration and Visualization
- Create a bar chart of the number of conversations of each type
- Plot the number of conversations vs. the time of day
Experimentation and Prediction
- Use a machine learning model to predict possible responses for each question
- Create an algorithm that classifies the initial customer question
Assigning data science project
Laura manages an analytics team and has a few tasks that she’s hoping to achieve this quarter. The tasks are centered around the following domains:
- traditional machine learning
- deep learning
- Internet of Things (IoT)
The knowledge to build traditional machine learning and deep learning applications is present within her team. There is another team in the company that is specialized in working with IoT data. Laura wants to know for which tasks she’ll need their help.
Traditional Machine Learning
- Cluster patients by symptoms to help doctors select a treatment
- Predict ride-sharing prices at a certain time and location based on previous prices
- Automatically summarize text from news articles
- Flag images that contain a safety violation
Internet of Things
- Detect machinery failure with vibration detectors
- Automate building cooling using temperature sensors
Laura is an investment analyst. She’s been asked to review a new start-up that uses easy-to-install vibration sensors to measure the amount of traffic on a bridge or highway. Her boss has asked her to do some background research and decide which category this start-up belongs in. Based on the startup’s dashboard shown down below, which category best fits this start-up?
- Traditional machine learning
- Deep learning
- Internet of Things
- Natural language processing