Chapter Introduction: Data Ingestion

This is a turning point in the book. You might view it as where the fun ends or where it begins. We now start operating within the data science pipeline. This chapter focuses on data ingestion, which is the process of collating data and storing it safely. Data science cannot occur without data! Faulty data ingestion will doom any data science project. None of the future steps of the data science pipeline matter unless we have the data we need. We can theorize all we want about technologies and algorithms. At the end of the day, none of that means anything unless we can successfully carry out the data ingestion step. When starting a data science project, our primary concern should be the data!

Please note that I am working in a local directory called applied_data_science connected to a remote git repo with the same name. You can call your repo whatever you would like, perhaps something like churn_predictions. Descriptive repo names are a plus. I have called my repo something more generic for the purposes of this book (read into it: search engine optimization).

For a data scientist, data ingestion often, but certainly not always, involves writing SQL queries. In many cases, the data scientist will not be responsible for setting up the database and performing ETL processes to update that database. However, those are useful skills to have and will be covered in this section. We will primarily spend our time on relational databases, as they are most popular. We will, though, touch on a non-relational databases. We will also cover other handy items surrounding data ingestion, such as webscraping.

Understanding our Data

As mentioned in Chapter 1, our aim is to predict if someone will cancel their subscription to a private movie review website. For $4.99 a month, a user can access a website with expert, in-depth movie reviews and can author their own reviews as well. If a user is at risk of canceling their subscription, we want to know so our marketing department can take appropriate action.

Please note this is a fictional dataset. That said, it has many real-world attributes: missing data, not a ton of signal, messy fields, bad data, outliers, decently-high dimensionality. This isn't the iris dataset - you'll have to work to get a decent model. This data will mimic what you actually might encounter in the wild.

Having a data dictionary is highly useful, especially when collaborating with other data scientists. At a high level, a data dictionary should define what each field means. Additional details can also be welcome.

Data Dictionary

Feature	Description
client_id	unique ID for each user
activity_score	0-100 score for site activity; 100 is the max
propensity_score	output of another team's ML model that predicted if the customer would make it past the free trial period
profile_score_new	score of how in-depth a user profile is based on new methodology; points are awarded for adding certain sections to a profile
completeness_score	score to reflect how many aspects of the site a user has utilized
xp_points	how many experience points the user has gained from taking quizzes
profile_score	score of how in-depth a user profile is based on old methodology; points are awarded for adding certain sections to a profile
portfolio_score	additional version of the activity score for users who pay an additional $1 to promote their top reviews
mouse_movement	measure of mouse movements captured by a JavaScript script
average_stars	average stars given to other reviews in a month
ad_target_group	ID of targeting group for on-site ads
device_type	ID of the type of device primarily used on the site
all_star_group	all users can enter one all-star group based on their favorite genre; this ID shows which group they belong to
mouse_x	group to measure mouse click activity
mouse_y	alternative group to measure mouse click activity
coupon_code	coupon code ID offered during trial period
ad_engagement_group	primary ad group the user has opted into
user_group	movie reviewer segment ID based on self-selection by the user
secondary_user_group	movie reviewer segment ID based on self-selection by the user
browser_type	ID for main type of browser used on the site
email_code	ID for mailing list opted into
marketing_creative	ID for creative used to convert the user into a subscriber
marketing_creative	ID for creative used to convert the user into a subscriber
marketing_creative_sub	secondary ID for creative used to convert the user into a subscriber
promotion_category	ID for promotion used to convert the user into a subscriber
marketing_campaign	ID for marketing campaign used to convert the user into a subscriber
marketing_channel	ID for marketing channel used to convert the user into a subscriber
site_level	access level on the site
acquired_date	date the user subscribed
churn	whether or not the user has terminated their subscription

For the purposes of this book, assume these are all the features we realistically have access to. An important part of data science is brainstorming new features to ingest. For this problem, we could devise a slew of additional features: number of log-ins, average time on site, number of customer service calls. For our purposes, assume we do not have access to such features. This situation also reflects a reality of data science: we can't always get all the data we want, and we have to work with what is available to us. That said, we should still communicate features we want in the hopes those can be captured and ingested at a later time.

One drawback of our data is that it does not supply when a user churned. Rather, it only provides a binary classification. Therefore, our model will only be able to provide a binary picture and cannot indicate when a customer might churn. Likewise, our data are right-censored: some current subscribers will turn into churners over time. Our data will evolve. Likewise, some of the features could very well change. Our data is simply a snapshot, which is inherently limiting. Hallmarks of effective data science are 1) understanding the limitations of the data and the downstream implications on our model and 2) working within those limits as best we can. Certainly, we should try to broaden the limits, but that may not always be possible. For this project, we cannot change the limit. However, if we could, we might aim to do the following in an ideal world.

Get access to the churn date and work to predict when a customer will churn. This could take the form of predicting the probability they churn, say, this week. Perhaps it could be a "survival" modeling problem, that is, predicting the time until some event occurs (in this case, churn). For the former, this might involve constructing historical weekly snapshots of customers' profiles and appending the "churn" result for each week. For example, if a customer churned on the 10th week, they would have 10 observations. The churn category would be 0 for the first nine and 1 for the 10th. The predictor features would represent weekly data (e.g. in the first week, activity_score was 10; in the second week it was 5...). This would also allow us to create rolling stats and generally overcome the "point-in-time" data problem. That said, we can certainly still operate on the data we have. We are essentially answering the following question: "assuming features stay relatively constant over the course of a user's lifetime on the site, our snapshot data allows us to build a profile of attributes that are likely to enhance the probability of a user canceling their subscription at some point, allowing us to more intelligently and proactively target users to curb churn".

Working with csv Files

Ah, the tried-and-true csv, the character separated value file format. Did you think csv was an acronym for comma separated file? So did I for many years. However, you can delimit your data with virtually any character your heart desires. (I worked with a vendor a while back that wanted a ^ separated file of all things).

Though we often prefer to work with databases, using csv files is still common in data science, and it can be a handy way to consume data. In fact, the initial version of our data will come in csv format.

Using the pandas library, reading in a csv file is a breeze.

$ python3
>>> import pandas as pd
>>> df = pd.read_csv(‘data/site_churn_data.csv’)

Easy as pie. If you need to read in something more exotic, like a ^ separated file, you would simply need to adjust the value in the “sep” argument. By default, the “sep” argument expects a comma.

>>> df = pd.read_csv(‘data/site_churn_data.csv’, sep=’^’)

Reading in a csv this way produces a dataframe, the most popular data structure in pandas. It’s essentially an Excel spreadsheet, a structure with rows and columns filled with data (or potentially null values). We’ll cover pandas more in later chapters.

Writing a csv is just as easy. This takes a dataframe and saves it as a csv (default separator is the comma) to your local working directory.

>>> df.to_csv(‘sample.csv’)

If you wanted to be wild and delimit your file with a | (called a pipe), you would do the following.

>>> df.to_csv(‘sample.csv”, sep=’|’)

If you executed the to_csv commands, be sure to clean up those unnecessary files.

Working with Excel Files

Microsoft Excel can produce character separated files, but Excel files (with the extension .xlsx) are fundamentally different from csv files. Fortunately, we can still use pandas to read in such files. Let’s create a copy of site_churn_data.csv as an Excel file.

The first thing we need to do is create an Excel Writer and write our dataframe to an Excel file. After importing pandas, we can accomplish this by executing the following function. You can put this in a scratch.py file or run it from your terminal.

def create_excel_file():
    df = pd.read_csv('data/site_churn_data.csv')
    df_churn = df.loc[df['churn'] == 'yes']
    df_no_churn = df.loc[df['churn'] == 'no']

    writer = pd.ExcelWriter('data/churn_annotated.xlsx', engine ='xlsxwriter')
    df_churn.to_excel(writer, sheet_name='churn_yes')
    df_no_churn.to_excel(writer, sheet_name='churn_no')
    writer.save()

By default, pandas reads in the first tab, which in this case is churn_yes. What if we wanted to read in both tabs? We can do the following.

df_yes_churn = pd.read_excel('data/churn_annotated.xlsx’, sheetname=’churn_yes’)
df_no_churn = pd.read_excel('data/churn_annotated.xlsx’, sheetname=’churn_no’)

We now have two nice, new dataframes.

Reading in Multiple Files using Glob

Bill from marketing sent you a dozen files that you need for your model. They all have the same columns, just different timeframes. Thanks a lot, Bill. Instead of reading in all those files one by one, we can use the glob library, which comes standard with Python.

import glob
import pandas as pd

marketing_df = pd.DataFrame([])
for counter, file in glob.glob(‘marketing_data*.csv'):
temp_df = pd.read_csv(file)
marketing_df = marketing_df.append(temp_df)

We start by creating an empty dataframe. We then run a "for loop" that reads in every csv file that begins with ‘marketing_data’ (we assume every file has this prefix) and append each new dataframe to our marketing_df.

Serializing Dataframes with joblib

Another common task is serializing Python objects, which involves converting them into binary formats that can be stored on disk. Serializing a pandas dataframe is an alternative to writing it to a csv. This method has a couple of advantages. First, it preserves data types. For example, a serialized dataframe will remember that you cast a column of integers as a string; a csv can’t do that. Second, serialized objects can be more memory efficient than csv files. However, there is a problem in serialization paradise. Pandas and numpy may not, in some cases, be able handle dataframes that were serialized under different versions. Such a situation can be mitigated by using virtual environments, however (i.e. we can just leverage a virtual environment to make sure we use the appropriate library versions).

The two major libraries for serializing Python libraries are pickle and joblib. I recommend using joblib as it is more efficient and easier to use. To serialize a Python object using joblib, we can use the following code.

import joblib
joblib.dump(df, ‘my_serialized_dataframe.pkl’)

Serialized objects typically end with with ‘pkl’ extension and are referred to as pickle files.

To load a serialized object, we can do the following.

df = joblib.load(‘my_serialized_dataframe.pkl’)

Working with Compressed Data

During your time as a data scientist, you might come across files that are compressed, which saves disk space. Gzip is a common type of compression, with files having the extension .gzip. Such files are actually pretty easy to handle.

Assuming a csv file has been compressed, we can use our best friend pandas to read in and decompress the file.

df = pd.read_csv('sample.csv', compression='gzip')

The following code compresses a csv file.

df.to_csv('sample.csv', compression='gzip', index=False)

We can also use our pal joblib to save a compressed pickle file by using the compress argument. This argument takes an integer from 0-9, with 9 being the highest level of compression. More compression means smaller file sizes. It also means longer read and write times.

import joblib
joblib.dump(df, ‘my_serialized_dataframe.pkl’, compress=3)

Working with Parquet Files

As a csv, our churn dataset is 58 MB, which really isn't that big. Ideally, we want to store our data in a remote database. However, that might not always be possible. If we have a large dataset we must store on disk, we might consider a Parquet file. Such files organize data in columns, whereas something like a csv file organizes data in rows. The former is actually more lightweight. Let's see the impact.

$ python3 >>> import pandas as pd >>> df = pd.read_csv('data/site_churn_data.csv') >>> df.to_parquet('data/site_churn_data.parquet') >>> df = pd.read_parquet('data/site_churn_data.parquet')

On disk, the parquet file is only 16 MB.

Working with Feather Files

Feather is another option. This will compress our file (to 35 MB in this case). It will also be faster in terms of reading anf writing compared to a csv.

$ python3 >>> import pandas as pd >>> import feather >>> df = pd.read_csv('data/site_churn_data.csv') >>> feather.write_dataframe(df, 'data/site_churn_data.feather') >>> df = feather.read_dataframe('data/site_churn_data.feather)

Working with JSON

JSON is an acronym for JavaScript object notation. When communicating over HTTP (more on that later), JSON is a common data format. We will review the basics of handling such data in Python.

Here’s the JSON file we’ll work with.

The following code will read the JSON into a dataframe and dump and amended version to disk. The comments walk you through the code - I use comments here for educational purposes :-)

With new_data.json, we recreated data.json and added a new key to each employee.

Interacting with APIs

API is an acronym for application programming interface. The name helps us understand a lot, right? Eh, not really. Think of it this way: an API is like a contract. If you send the API data in a specified format, it will return data in a specified format. To use APIs, we first need to understand two basic HTTP requests, GET and POST, and the difference between client and server.

In the simplest terms, the client is the entity that wants data or resources. The server is the entity that has data and resources to share. When you want data from an API, you are the client and the API is the server.

A GET retrieves data. Whenever you visit a webpage, you are performing a GET request. A POST sends data. Whenever you fill out an online form and hit submit, you are performing a POST request.

Most API’s are built using the REST framework, which stands for representational state transfer. The API is called, and it returns data. That’s it. Typically, REST APIs are fairly easy to interact with, though the specifics are based on the parameters and data the API expects.

Let's say we want to pull some zip-code data from the Census API. Here’s how we can accomplish such an aim.

First, request an API key for the Census API here. An API key is a common authentication scheme, though others exist. We will pull data from the American Community Survey, which provides rich demographic data. A full list of features can be found here. For this example, we will pull only one feature, median income. To pull the correct feature, you have to specify the proper “Name” based on the foregoing link.

Running the following code will print a dataframe that shows the median income for zip code 64108.

The above code pulls statistics for one zip code. Fortunately, we can pull multiple zip codes in a single API call by slightly amending our zip_code variable in the main function. Passing the following feature will extract data for two zip codes.

zip_code = '64108, 64735'

To get a list of all zip codes (and some potentially useful other data), download this file from the IRS.

Building a Webscraper using Requests and Beautiful Soup; Downloading a File Programmatically with Selenium

A webscraper simply extracts the raw HTML off a webpage, allowing us to transform it into a more friendly format. Therefore, having a working understanding of HTML tags is useful. You can study HTML tags here. For this exercise, we will scrape unemployment rates. Please keep in mind this is a bit of a silly example (for reasons you shall soon observe).

We need to be cautious when we scrape websites. Some prohibit it! In fact, I was not able to find a website with a table of unemployment rates that allows webscraping. Therefore, we’ll use this as an opportunity to programmatically download a file with unemployment rates, render it in a local web service, and then scrape it. This is not a production-grade workflow. It is only a pedagogical tool to illustrate concepts, particularly webscraping.

First, get the Chrome webdriver and put it in your working directory: https://chromedriver.chromium.org/downloads. You can use another browser if you’d like. You'll just need to download the appropriate driver.

Next, create a Python file called web_scraping_exercise.py. Our application is pretty straightforward. The function download_file_via_selenium() configures the webdriver and downloads our file. The function create_unemployment_endpoint() reads in the downloaded file and renders it as HTML in a specified URL. The run_webscraper() function does a GET on our local server, spun up by the app.run() call in the main method. It then parses the HTML table and formats it as a dataframe. More or less, the webscraper simply looks for a set of HTML tags, and knowing the meaning of these tags, we can write code to transform that data into our desired format.

Now, run web_scraping_exercise.py in PyCharm. This will spin up your local server.

Then, fire up Python on the command line and run the function run_webscraper(). This will produce a dataframe of unemployment rates!

$ python3 >>> from web_scraping_exercise import run_webscraper >>>run_webscraper()

Interacting with Google Drive

Our journey of data ingestion continues with the ubiquitous Google Drive. Fortunately, we can easily upload and download files from Google drive using Python.

You’ll first have to set up Google Drive API access. You can follow the instructions in this tutorial.

We can then interact with Google Drive with the pydrive library.

A Gentle Introduction to SQL

Data scientists often work with SQL databases. SQL is an acronym for structured query language. In the world of data science, the pronunciation of SQL is a hot topic. I’m a strong proponent of the pronunciation being see-qwell. Other people, who don’t want to sound as cool, pronounce the letters: ess-que-el.

Authoring a SQL query is a common way to extract data from a database. These queries have three components that we almost always use (though others certainly exist): select, from, and where.

The select clause is where we state the fields we want to return.
The from clause specifies the tables where those fields live.
The where clause determines conditions that must be met for a row to appear in the results.

If you’re unfamiliar with SQL, here is how a query looks:

select age, zip_code from clients where purchase_date >= '2020-01-01';

SQL statements are also used to create database tables and manage database users. We will go over such statements later in this chapter.

Extracting Data From Gmail

In some cases, we might have data sent via email. If this is done repeatedly, we might want to automatically pull down the data. This excellent tutorial will show you how to do just that.

Interacting with SQLite

SQLite a light-weight database that is quite easy to use. It’s not something a company would want to use to store customer records, but it can be useful when prototyping a solution or if you want an alternative to a csv. Writing data to SQLite is generally safer than storing it in a csv file. If you have Excel on your computer, opening up a file and accidentally modifying data is a possibility (Pycharm does have a useful "local history" function, and we can, of, course track files in git, though we usually add csv files to .gitignore). With SQLite, you would have to write code to alter your data.

Below is a file for interacting with SQLite. The functionality is pretty clear. To note, if we connect to a database that doesn’t exist, SQLite will create it for us. Pretty nifty.

Setting up MySQL on AWS

A local SQLite database is all well and good. However, if we wanted another data scientist to access our database, our only recourse would be to share the database file. And what would happen if we needed to update our database file? We would have to share our file again. A better setup would be to create our database on AWS, which will provide a single endpoint that anyone with the proper credentials can hit. We'll learn how accomplish this goal in the following videos.

Additionally, you could use Pulumi to create an RDS instance. (We set up Pulumi in chapter 6).

Another digression on security: protecting customer information is important. This project has fake data, but we still want to be aware of best practices. Customers entrust organizations with their information, and it is incumbent on us to uphold this trust. Likewise, there are oftentimes legal requirements for how we need to handle data.

Getting Set Up in MySQL Workbench

In the next few sections, we will use MySQL Workbench to interact with our MySQL database. Watch the following video to learn how to accomplish your set up.

You can download the RDS pemfile here.

Creating Schemas and Managing Database Users

When we created our MySQL instance, we received root credentials for our database. This account is very powerful. Caveat emptor! When running code in a production application, we do not want to use the root account. Ideally, we want to use an account tailored to only our application's needs. This is much more secure - yeah, pesky security concerns, I know. Using MySQL Workbench, let's create a schema for our project and a new user.

First, create a churn_model schema with the following commands in MySQL Workbench.

create database churn_model;
show databases;

Let's now create a user for this schema. This is a pretty powerful user, even though it only has access to a single schema. When we deploy our machine learning application, we will create and use an account with narrower permissions.

create user 'churn_model'@'%' identified by 'a_good_password';
select * from mysql.user;
grant insert, select, update, delete, create, drop, index, create_view on churn_model.* to 'churn_model'@'%';
flush privileges;

Storing our Credentials in Secrets Manager

Where shall we store our passwords? AWS Secrets Manager is an outstanding choice. It's a secure, encrypted password manager that allows us to programmatically pull secrets in Python. Always retrieving credentials from a password manager has an added bonus: when we need to rotate our credentials, we only need to perform this action in a single location. Take the following steps to safely store our MySQL credentials for the churn_model user.

Use your administrator account to give the churn_model AWS account access to Secrets Manager.
Log into the AWS Management Console using the churn_model account and navigate to Secrets Manager.
Click on "store a new secret".
For the secret type, select "other type of secrets".
The keys in your secret will be host, user, password, and database.
Fill in the appropriate values for the churn_model user.
Name the secret "churn-model-mysql".
Don't enable automatic rotation.
Store the secret.

I am going to cd into my home directory and make sure I have a copy of the RDS SSL certificate in my working directory. I then will use our data_science_helpers library to pull the MySQL credentials and connect to our database.

$ cd
$ cp ~/ssl/rds-ca-2019-root.pem .
$ python3
>>> from ds_helpers import aws, db
>>> mysql_creds = aws.get_secrets_manager_secret('churn-model-mysql')
>>> db.connect_to_mysql(mysql_creds)

Hooray! We've connected to our database via Python, securely pulling credentials for Secrets Manager and using our private package.

Even though we won't pull it programmatically, store your root MySQL credentials as another secret. This ensures they are in a safe location that could be potentially be accessed by others in case you get hit by a bus.

Loading Our Data into MySQL

Our database is empty, which doesn't do us much good. Let's write and execute a little Python script to accomplish this task. We will call the file load_data.py and will put in in the utilities directory.

Optimizing Relational Databases with Indices

To enable our queries to run faster, we can add indices to them. Indices help queries locate data more efficiently. Think of an index like a cheat sheet for your query, telling the query how to more quickly find the data you want. For instance, if we need to query a specific client_id, the index creates pointers that allow the query to locate and pull back specific values more quickly. Let's create an index on our MySQL table using MySQL Workbench.

This is the type of query we want to optimize (it could be for any client_id).

select * from churn_model.churn_data where client_id = 196389;

By using the explain argument, we can get execution information about our query. We'll see we are not using an index.

explain select * from churn_model.churn_data where client_id = 196389;

We shall now create our index.

create index idx_client_id on churn_model.churn_data (client_id);

If we re-run the explain argument on our query, we'll see our index is now being used to locate the desired rows.

explain select * from churn_model.churn_data where client_id = 196389;

Lastly, we can run our query again, which will leverage the index.

select * from churn_model.churn_data where client_id = 196389;

Adding Views to a Relational Database

Views create synthetic tables. A view is powered by an underlying SQL query, not data. One of the main goals of a view is to prevent us from writing the same long query repeatedly. Rather, we could do something like select * from view. An example shall make this more complete (given my example is any good).

For this example, let's say (per the request of our boss) we repeatedly need to monitor the results of the following query. (There are better ways to accomplish this task, but we won't cover them here).

select client_id, acquired_date from churn_model.churn_data where churn = 'no' and xp_points >= 21 and average_stars >= 22 and ad_target_group = 'level_3';

We can easily turn this into a view with the following statement.

create view churn_model.no_churn_active as select client_id, acquired_date from churn_model.churn_data where churn = 'no' and xp_points >= 21 and average_stars >= 22 and ad_target_group = 'level_3';

We can access our view with a simple select command. We'll get the exact same results we received by issuing the underlying query.

select * from churn_model.no_churn_active;

Under the hood, the view is running our query. The only difference is that we now can write a simpler SQL statement to retrieve our data.

Writing Stored Procedures

Python functions take arguments and execute one or more operations. MySQL stored procedures are like Python functions: they can take arguments, and with a single call, execute one or more operations.

For data science work, we will train machine learning models. We will want to log our training results over time to compare models. If we log results to a database, these results are permanent and trackable. Let's create a stored procedure so that we can generate such a structure for any ML project we undertake. To note, since this will be a procedure applicable across projects, I will create the stored procedure using the root account.

DELIMITER // create procedure GenerateModelPerformanceTables() begin create table if not exists model_score ( id int not null primary key auto_increment, meta__inserted_at timestamp default current_timestamp, model_uid text, holdout_score float, scoring_metric text, holdout_type enum('test', 'cross_validation') ); create table if not exists feature_score ( id int not null primary key auto_increment, meta__inserted_at timestamp default current_timestamp, model_uid text, feature text, importance_score float, importance_metric text ); create table if not exists cmodel_metadata ( id int not null primary key auto_increment, training_timestamp timestamp, model_uid text ); create table if not exists churn_model.model_logs ( id int not null auto_increment primary key, meta__inserted_at timestamp not null default current_timestamp, uid text, logging_timestamp timestamp, input_output_payloads json ); END // DELIMITER ;

We can now execute our stored procedure to create the tables.

use churn_model; call GenerateModelPerformanceTables(); show tables;

Let's say we had another data science project. We could switch to it's schema (by running "use schema_name;") and execute this stored procedure (using an account that has access to the stored procedure). We would then have our tables that we could populate.

To note, we have only created a handful of tables. Overall, the stored procedure is pretty simple and could easily be expanded.

Working with a NoSQL Database

An alternative to a SQL database is a NoSQL database. As you noticed when creating our MySQL tables, we had to define a structure for our tables. Likewise, when we inserted data into our tables, it had to match the format we defined. If our MySQL column is expecting floats and it gets strings, the insert will fail. This model is called "schema on write".

A NoSQL database is fundamentally different. It follows "schema on read". In such a scheme (get it?), we can write our raw data into a database that has no defined schema. We can just throw the data in there! When we query our database, we are able to pull back only the data we need for our work. This shall be more clear with an example.

The NoSQL database we will use is called DynamoDB, which is the AWS version of a popular database called MongoDB. We'll use the Python boto3 library to interact with DynamoDB. For the account represented by AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, you'll need to ensure it is permissioned to use DynamoDB.

To populate a DynamoDB table with our churn data, we can execute the following script.

You can then, say, pull back all the cases where churn = yes with the following script.

Citations

Working with the Census Data API
CSV Files for Storage? No Thanks. There’s a Better Option