BREAKING INTO DATA SCIENCE

BY SUNRISE LONG

WHAT IS DATA?

Image result for data engineer"

The data industry revolves about using the information business collect to make informed decision. Company collect all sorts of information. When you make a purchase at Costco, like a post on Instagram or sign up for a credit card, companies are recording information about your activity. This data can be used in a variety of ways to guide businesses. Knowing what percentage of your customer base is over 25 can be useful when determining what products to stock next season. Understanding peak demand hours at an airport can used to intelligently schedule shifts at check in. But data goes further than that. When companies collect enough historic information, they can use these data sets to identify patterns and make predictions on future outcomes, also known as machine learning. Using data on your transaction history, banks can predict when you will close your account. This information can be used by client retention teams to try and keep you with the bank through promotions or other incentives. 

DIFFERENT JOBS IN DATA

There are 4 primary jobs in data: data analyst, data scientist, data engineer and machine learning engineer. Analysts primary do descriptive work, they describe the environment of the business.  Data analysts are responsible for analyzing data sets and generating reports that answer key business questions. They do work on reporting the year-to-date gross revenue, measuring customer retention rates as well as build dashboard and visualization for management. The output of an analyst’s work is used to inform business decision makers to better understand the current status of the company.

Data scientists are more on the predictive side of data. Data scientists have a strong understanding of statistical methods and models. The term “data scientist” can vary. Some job are very machine learning heavy while others are more focused at supporting product managers and wider business units. This work can be deciding on metrics to measure the success of a marking campaign or designing A/B tests for the performance of a new feature.

The work of a data scientist tries to predict the outcome events or think of novel ways to use data to support decision making. The output of a data scientist’s work is primarily aimed at directly supporting the business. Typical work involves predicting attrition, classifying different types of customers as well as answering more difficult questions. The key distinction between an analyst and scientist is the use of machine learning and strategic thinking. 

Data engineers are software engineers who write code that moves data from one place to another. They build automated data pipelines to ingest, clean, summarize and write data to their target destinations. These operations form the backbone of an organization’s data infrastructure and enable data analysts and scientists to do their work. The volume of data processed by these pipelines can vary from hundreds of thousands to billions of records a day.  The core considerations of a data engineer are scale, reliability and maintainability.

Machine learning engineers build advanced machine learning models that are used in production. While a data scientist will build an ad hoc model used to inform the business once on a particular matter, machine learning engineers develop models that automate some form of decision making. For example a data scientist may build an attrition model to inform the marketing team which customers are likely to cancel their subscription, a machine learning engineer will build an automated framework to send emails to new at risk customers daily.  

The kind of problems machine learning engineers tackle varies too. They often focus on more complex data sets that involve images or natural language. It is important to note that at some companies data scientists are really just machine learning engineers

CORE DA, DS, DE AND MLE SKILLS

DA: Tableau, Looker, Excel, Power Point, SQL

DS: Scikit-learn, d3.js, pandas, matplotlib

DE: Scala, Spark, (Oozie, Azkaban, Airflow), Kafka, AWS

MLE: Tensorflow, NLTK, OpenCV, Spark, Tensorflow Serving