Chapter Introduction: Code Structure and Cleanliness

I remember one of my first production scripts. It was an R program meant to automate a lengthy report. Looking back, I’m embarrassed I wrote such messy code. One of the main flaws of the script was its length; it was thousands of lines long. The script included the code for everything: ingestion, wrangling, and outputting the final product. It was a mess to debug and understand what was going on.

At the time, I didn’t understand how to properly structure code. This is a topic that I rarely see discussed in mainstream data science tutorials, but it is highly important for authoring effective data science code. In this section, we will cover both Python code and directory structures.

Code Structure and Cleanliness

If we jam everything into the root directory, we will potentially have trouble locating files. Likewise, other developers might be overwhelmed inheriting a huge repo with a laundry list of files! We need to organize our processes into logical subdirectories. The below is how we will structure the directories for our current project. Not every Python project needs to follow the exact same structure, though being repeatable is a positive. Standardized structures encourage collaboration and allow for easier onboarding. They also enable more opportunities to automate processes.

Directory Name Type of File data database connections and ingestion code helpers functions to support production code modeling machine learning model training scratch scratch files we don't want to delete; directory is added to .gitignore tests unit and integration tests utilities ancillary processes to support the project

Code Structure (and Shell Scripts)

If you’re doing a data science project, do not put all your code in one file. Don’t do it. Cramming all code into a single file makes debugging a challenge. I tend to have two files that actually execute production code within a repository (i.e. one that trains models and the other that deploys the chosen model). I often put ancillary, one-off scripts in the utilities directory. The other files become modules that are imported by other scripts. How we have set up our directory structure encourages us write modular code broken down into distinct files.

For example, here is how I recommend structuring code for the modeling directory.

config.py - contains all constant variables, like column lists, parameter grids, etc.
pipeline.py - stores scikit-learn pipeline structure with all preprocessing and modeling steps.
model.py - houses functions for training machine learning models.
evaluate.py - contains functions for evaluating machine learning models.
explain.py - stores functions for explaining machine learning models.
train.py - imports all foregoing files to train and serialize models.

These are not all the files you will need for your project, obviously. In other directories, you’ll also need files for unit tests, your production application, and potentially other tasks.

Since we have a pretty good idea of how we want to repeatedly structure machine learning project, we do not want to spend time creating such files individually. Instead, we can write a shell script to automatically create our directory structure and starter Python files. A shell script allows us to execute a series of bash commands, which, as we’ve seen, can create files and directories. Let’s author this shell script.

$ vim create_ml_project_structure.sh

Insert the following into our shell script.

	#!/bin/bash

	# create directories
	mkdir data
	mkdir helpers
	mkdir modeling
	mkdir scratch
	mkdir tests
	mkdir utilities

	# create files
	touch data/data.py
	touch data/__init__.py
	touch helpers/constants.py
	touch helpers/helpers.py
	touch helpers/__init__.py
	touch modeling/config.py
	touch modeling/pipeline.py
	touch modeling/model.py
	touch modeling/evaluate.py
	touch modeling/explain.py
	touch modeling/train.py
	touch modeling/__init__.py
	touch utilities/explore.py
	touch requirements.txt
	touch readme.md
	touch buildspec.yml
	touch main.py

	# create Dockerfile
	echo "FROM python:3.6
	MAINTAINER Micah Melling, micahmelling@gmail.com
	RUN groupadd docker
	RUN useradd --create-home appuser
	RUN usermod -aG docker appuser
	RUN newgrp docker
	RUN mkdir /app
	WORKDIR /app
	COPY . /app
	RUN chown -R appuser:docker /app
	USER appuser
	RUN pip install --no-cache-dir -r requirements.txt
	ENTRYPOINT ["python3"]
	" > Dockerfile

	# create stock .gitignore
	echo "*.txt
	!requirements.txt
	*.pyc
	*.csv
	*.xlsx
	*.png
	*.jpeg
	*.db
	*.pkl
	*.dot
	*.DS_Store
	.idea/
	__pycache__/
	venv/
	scratch.py
	scratch/
	create_ml_project_structure.sh
	" > .gitignore

	# create a virtual environment
	python3 -m venv venv

view raw create_ml_project_structure.sh hosted with ❤ by GitHub

Let’s run this puppy and confirm the output! To note, we first have to issue a chmod command to make the script executable.

$ chmod +x create_ml_project_structure.sh $ mkdir sample_repo && cd "$_" $ ./create_ml_project_structure.sh $ ls

As you can see, the script created a skeleton for our repo. That said, the script is not too opinionated. Check out the Dockerfile - we don't have an entrypoint or command at the end of the file. Based on our project, we'll need to fill in this gap. Likewise, we have a tests directory without any Python files. Based on our project requirements, we'll also have to fill in that gap. Keep in mind that each project is unique and necessitates customization. We realistically can't expect to create a template that will perfectly fit evert data science project. Being keenly aware of each effort's needs and tailoring our project accordingly is worthwhile.

Copier Templates

An alternative, or partial alternative, to the above shell script is a copier template. More or less, you create a directory or a git repo that you want to be your template (a git repo is preferable since changes can be tracked and distributed). The template would be a series of files and subdirectories like we saw in the shell script. We could even include starter code - perhaps a working requirements.txt file, standard model configurations in modeling/config.py, and a readme format. We should be careful to not include too much code in these templates, however - that's a job best suited for a custom library (which we shall cover later in this chapter).

After pip installing copier, you can issue the following command.

$ copier path/to/project/template path/to/destination

As mentioned, the template path could be a GitHub URL. Pretty nifty. Plenty of other cool things can be done with copier, which you can read about in the docs.

In an ideal world, we would create unique templates for each type of project (e.g. batch process, API, etc). These templates could make more assumptions about what is needed for a given project. We could also ship the template with a .pre-commit.yaml file and a shell script that would set up pre-commit and a virtual environment.

Main Methods

If you’ve reviewed much Python code, you’ve probably seen the following:

if __name__ == “__main__”:
some_function()

This is called a main method. When you run a Python script, its name becomes __main__. Those seemingly mysterious lines of code above are essentially saying: When I run this script (which is now called main), execute the following code. In the above example, the code in the function called some_function() will be executed. In general, all code should be executed in a main method. This is a useful convention to signal where code is actually being executed; it makes for a standardized entry point into your work.

As discussed above, your core execution scripts will import code from other scripts. Most of your files are bespoke modules for your core production scripts, and none of those scripts will be called __main__ as they will not be executed directly.

Modular and Loosely-Coupled Code

The code structure we have outlined is contingent on code being modular, that is, wrapped in single-purpose functions. Modular code rests on the idea that functionality of a script should be broken down into independent, interchangeable components. In general, a function should accomplish one goal. Modular code is also easier to read and debug. Beyond that, code wrapped in a function runs faster in Python (this has to do with how Python handles local vs. global variables).

Let’s take a look at the following function.

def clean_data(df):
   df[‘income’] = pd.cut(df[‘income’], bins=[0, 10_000, 25_000, 50_000, 100_000,                                              1_000_000])
   df.fillna({‘gender’: ‘not_provided’}, inplace=True)
   df[‘zip_code’] = df[‘zip_code’].astype(‘str’)
   return df

Is this modular code? Sort of. The code is wrapped in a function, and it’s all related to data cleaning. However, the following implementation is preferable as it separates functionality into clear component parts. When reading an entire script, the below would make the sequence of data wrangling steps easier to follow and would likely make finding bugs more palatable.

def bin_income(df):
df[‘income’] = pd.cut(df[‘income’], bins=[0, 10_000, 25_000, 50_000, 00_000, 1_000_000])
return df

def fill_gender_nulls(df):
df.fillna({‘gender’: ‘not_provided’}, inplace=True)
return df

def convert_zip_to_string():
df[‘zip_code’] = df[‘zip_code’].astype(‘str’)
return df

That’s better. The code is more modular, but it’s not what we would call loosely coupled. Loosely coupled code can be reused; it facilitates efficiencies over time.

The foregoing function convert_zip_to_string is only good for converting one variable to a string. What if we need to convert another variable to a string? We would either need to write a new function or add another line to the existing function and rename it. Alternatively, we could rewrite convert_zip_to_string, along with the other functions, to be loosely coupled.

def bin_feature(df, feature, bins):
df[feature] = pd.cut(df[feature], bins=bins])
return df

def fill_feature_nulls(df, feature, fill_value):
df.fillna({feature:fill_value}, inplace=True)
return df

def convert_feature_to_string(df, feature):
df[feature] = df[feature].astype(‘str’)
return df

Ah, much better. These functions can be used to preprocess multiple features. We accomplished this by removing references to specific features and replacing them with new function arguments. Let’s say we wanted to bin both income and expenditures, we could do the following:

income_bins=[0, 10_000, 25_000, 50_000, 100_000, 1_000_000]
expenditure_bins=[0, 25_000, 50_000, 75_000, 150_000 1_000_000]
# The following returns a new copy of the dataframe with income binned
household_data_df = bin_feature(df=household_data_df, feature=’income’, bins=income_bins)
# The following returns a new copy of the dataframe with expenditures binned
household_data_df = bin_feature(df=household_data_df, feature=’expenditures’, bins=expenditure_bins)

Later, we’ll go over how to embed such preprocessing in a scikit-learn pipeline. Attempt to contain your excitement.

Loosely coupled code has the benefit of being transferable. We could likely use the above functions across a variety of projects; it just becomes a copy-and-paste job or, better yet, something we could add to a private Python package. That said, we could improve the existing implementation even further by expanding them to natively handle more than one feature. Feel free to work through that exercise on your own.

There is an important point to make at this juncture: We do, at some level, have to make assumptions about our own specific workflows and encourage certain standards for ourselves. For instance, some of my code may not be fully transferable to you because, say, I want to log my model results differently than you. That's fine. We want to author loosely-coupled code yet still leave enough flexibility for certain customizations that encourage specific workflows.

Naming Conventions

When I started as a data scientist, I would often have pandas dataframes named df1, df2, etc. A better way exists. This section describes what I recommend doing instead. To note, we've not discussed pandas dataframes yet. For those unfamiliar, pandas is a popular library for wrangling data. It is often abbreviated as pd. A dataframe is the dominant data structure in pandas. Like an Excel spreadsheet, a dataframe is comprised of rows and columns of data. That's it.

Dataframes within functions, called local variables, should typically just be called df, in my opinion. This convention is predictable and consistent. Global dataframes should be named something more descriptive, such as defaults_df. For example, a helper function might look something like the following.

def select_columns(df, col_list):
df = df[col_list]
return df

Our global dataframe, which we could get by querying a database, should look something like the following.

defaults_df = pd.read_sql(defaults_query, get_mysql_conn())

Using select_columns() would look something like the following.

defaults_df = select_column(defaults_df, columns_for_modeling)

A brief word on columns: keep the formatting consistent. I prefer columns that are all lowercase and where spaces are replaced by underscores (i.e. snake case). Do your best to keep column names consistent. This helps prevent errors and can increase development efficiency.

That covers dataframes. What about other common structures, like lists and dictionaries? Again, in the global scope, we want something descriptive, not my_list or dict2. Try names like columns_to_strings_list or csv_replacement_names_dict. I am a big fan of descriptive, albeit longer, names. This allows others to read our code from start to finish and clearly understand what is happening even without digging deep into the code. This can be powerful. Likewise, I often prefer to append the data structure to the end of the object name for added clarity.

What about function names? I tend to prefer function names that begin with verbs. For instance, I would recommend a name like make_predictions() over predictions(). The verb communicates action and encourages better descriptions in my view. Likewise, a common convention is to have one function that actually runs the execution (see the GitHub repo for examples). This provides a clear, consistent entrypoint for readers of our code. I highly recommend following this convention. This type of function allows us to write a single, full docstring of what is being executed (though sometimes a docstring can be put at the top of a file). I also tend to pass in all global variables into the main execution function as a singular and predictable funnel. Oftentimes, this primary entrypoint function is simply called main, which is another solid convention but may not always be appropriate.

And what about file names? Consistency and conciseness are the big qualifications in my mind. For example, our modeling directory has the following files: config.py, evaluate.py, explain.py, model.py, pipeline.py, and train.py. We don't have any crazy names. They are short yet descriptive.

Lastly, I want to comment on inertia in naming conventions. All programmers have been there. We name something as df2 and get the code working in production. We don’t go back and refactor the names because, well, the scripts work like we need them. Rather, focus on writing good code with sensible naming conventions up front. This will help fight the inertia with poorly named features and objects that can be confusing once we haven't looked at our code in 6 months.

Comments and Docstrings

Properly commenting code is somewhat of a controversial topic. Some data scientists comment religiously. Others do not comment much at all. Contrary to what might be common advice, I side with the latter camp. We should strive to write clean, clear code. If someone else reads your code, even if they don’t know how to program, your naming conventions and general structure should be clear enough for them to follow. Good code largely eliminates the need for comments.

Now, this is not to say you should never leave comments in your code. Sometimes funky things happen that necessitate explanation. For example, the API you’re getting data from sometimes returns nulls and other times returns blank strings. Such a situation warrants a comment. Likewise, if your script is implementing business logic and making certain assumptions, adding comments is also beneficial to expound these decision. However, something like the following does not need a comment:

defaults_df = pd.read_sql(defaults_query, get_mysql_conn())

It’s clear that we are ingesting sales data from a database. I've seen programmers leave a comment like "# querying database". In general, avoid such pithy comments as they just muddle your program and add no real value.

In sum: rather than commenting copiously, write better code.

Also, a quick but relevant tangent. Pep8 standards, which declare how Python code should be written, state there should be a space between the # and the text of your comment. Something like "#this is a bad comment" is not proper.

Docstrings are another ballgame. A docstring is, well, a string that explains the core components of a function. Docstrings represent an important tool for communicating with other developers, including your future self. They also become the core of your project documentation, which will be covered in a later chapter. Let's take a look at a doc string.

def convert_camel_case_to_snake_case(df):
    """
    Converts dataframe column headers from snakeCase to camel_case.
    :param df: any valid pandas dataframe
    :return: pandas dataframe
    """
    new_columns = []
    for column in list(df):
        new_column = re.sub(r'(?         new_columns.append(new_column)
    df.columns = new_columns
    return df

Docstrings can be accessed via a function's __doc__ attribute.
print(convert_camel_case_to_snake_case.__doc__)

Generally, every function you write should have a docstring.

Print Statements

As a young, inexperienced programmer, my code was littered with print statements. It was my main form of debugging. Print statements are incredibly useful while prototyping, experimenting, and debugging, but they mostly do not have a place in production-grade code. I find they are primarily appropriate for allowing us to track the main execution steps of a script. Again, we want our code to be clean and as free as possible from extraneous lines.

Likewise, I favor writing output to a file rather than printing it. I don't know how many times I have been burned by printing a value and not being able to track it down when I needed it. If I had written the output to a file, I could have easily referenced it later.

Creating a Custom Linter

If you've used PyCharm, you'll notice that it gives you some nice hints about your code style. For example, if you don't have proper indentation, PyCharm will underline that portion of your code. Pretty handy. By default, PyCharm "lints" your code based on Pep8 standards. However, if you have more specific development standards, that won't help you too much. Fortunately, you can create your own custom code linter to alert you to areas where your code doesn't meet standards. Linting only encourages conventions and does not enforce them.

First, let's install the bellybutton library and set up a directory in which we can experiment.

$ pip3 install bellybutton

$ mkdir lint_example && cd "$_"

Create a simple Python file called sample.py with the following content.

def add_numbers(a, b):
return a + b

Running the following command will create a .bellybutton.yml file where we can house our linting rules

$ bellybutton init

The yaml file will already have some useful content. Let's edit it to create a rule that will identify variables that have a one-character name. These are not descriptive and should be assigned a better name.

settings:
   all_files: &all_files !settings
     included:
       - ~+/*
     excluded:
       - ~+/.tox/*
       - ~+/.git/*
       - ~+/venv/*
     allow_ignore: yes

default_settings: *all_files

rules:
   ShortName:
     description: "name is only one character"
     expr: //Name[string-length(@id) <= 1]
     example: "a"
     instead: "my_value"

If we run our linter, we will see that it catches the single-character variables in sample.py

$ bellybutton lint

Again, a custom linter will not force us to follow any conventions but can help catch cases where we do not conform to standards we wish to follow.

Using Python Black to Auto-Format Files

Python Black provides auto-formatting of Python files. After pip installing black, you can simply run something like the following from terminal:

$ black main.py

This command will automatically format your Python file according to black's standards. These standards are quite opinionated, which is by design. You can learn more about black here.

Creating Bespoke Bash Commands

Throughout this book, we have used many built-in bash commands, such as cd and mv. We can also create custom bash commands for our own convenience.

As data scientists, we sometimes will need to check the last time a table was updated. We can accomplish this task by writing a simple SQL query, but wouldn’t it be nice to find out via a single line in terminal? We shall now create such a capability.

Set up files and directories.

$ cd $ cd PycharmProjects $ mkdir cli_projects> $ cd cli_projects $ touch generate_sample_tables.py $ touch table_update_lookup.py

Populate generate_sample_tables.py with the following code and run it. This script will populate a SQLite table for use in this tutorial.

	import sqlite3
	import pandas as pd
	import numpy as np

	from time import sleep
	from datetime import datetime


	def connect_to_sqlite(db_name):
	"""
	creates a connect to a local sqlite database
	:param db_name: the database to connect to
	:return: sqlite connection object
	"""
	return sqlite3.connect(db_name)


	def generate_data(n_rows=100, n_cols=10, p_int=0.5, p_str=0.5):
	"""
	generates a dataframe of faux data for testing
	:param n_rows: number of rows in the dataframe
	:param n_cols: number of columns in the dataframe
	:param p_int: percentage of columns that are integers
	:param p_str: percentage of columns that are strings
	"""
	column_names = ['col_' + str(n) for n in range(n_cols)]

	float_data = []
	for i in range(int(p_int * n_cols)):
	temp_float_data = np.random.choice(100, n_rows).tolist()
	float_data.append(temp_float_data)

	str_data = []
	for i in range(int(p_str * n_cols)):
	temp_str_data = np.random.choice(['a', 'b', 'c', 'd'], n_rows).tolist()
	str_data.append(temp_str_data)

	df = pd.DataFrame()
	for column, data in zip(column_names, float_data + str_data):
	temp_df = pd.DataFrame({column: data})
	df = pd.concat([df, temp_df], axis=1)

	df['insert_timestamp'] = datetime.now().strftime("%m-%d-%Y %H:%M:%S")
	return df


	def dynamically_create_sqlite_table_from_dataframe(df, date_fields, table_name, sqlite_conn):
	"""
	generates and executes a create table SQL statement based on a dataframe, using duck-typing to infer column datatypes
	:param df: pandas dataframe to write to sqlite
	:param date_fields: fields in the dataframe we need to ensure are read as dates and not strings
	:param table_name: sqlite table now
	:param sqlite_conn: connection to sqlite database
	"""
	create_table_statement = f'create table if not exists {table_name} (id integer not null primary key autoincrement,'

	dtype_mapping = {
	'float': 'float',
	'int': 'integer',
	'date': 'timestamp',
	'object': 'text'
	}

	for field in date_fields:
	df[field] = pd.to_datetime(df[field])

	df_dtypes = df.dtypes
	for index, dtype in enumerate(df_dtypes):
	column_name = df_dtypes.index[index]
	mapped_dtype = [val for key, val in dtype_mapping.items() if key in str(dtype)][0]
	create_statement_addendum = f''' {column_name} {mapped_dtype},'''
	create_table_statement += create_statement_addendum

	create_table_statement = create_table_statement[:-1]
	create_table_statement += ')'

	cursor = sqlite_conn.cursor()
	cursor.execute(create_table_statement)


	def write_df_to_sqlite_table(df, table_name, sqlite_conn):
	"""
	write a pandas dataframe to a sqlite table
	:param df: pandas dataframe to write to sqlite table
	:param table_name: name of the sqlite table
	:param sqlite_conn: connection to sqlite database
	"""
	df.to_sql(table_name, sqlite_conn, if_exists='append', index=False)


	if __name__ == "__main__":
	sample_df = generate_data()
	sqlite_connection = connect_to_sqlite('sample_db')
	dynamically_create_sqlite_table_from_dataframe(sample_df, ['insert_timestamp'], 'sample_table', sqlite_connection)
	write_df_to_sqlite_table(sample_df, 'sample_table', sqlite_connection)
	sleep(10)
	new_sample_df = generate_data()
	write_df_to_sqlite_table(new_sample_df, 'sample_table', sqlite_connection)

view raw populate_sqlite.py hosted with ❤ by GitHub

Populate table_update_lookup.py with the following code. This is the script we will turn into a bash command.

	import sqlite3
	import pandas as pd
	import os
	import sys


	# update db connection as needed
	def connect_to_sqlite(db_name):
	return sqlite3.connect(db_name)


	def main():
	query = '''select max({}) as 'timestamp' from {} '''.format(sys.argv[3], sys.argv[2])
	df = pd.read_sql(query, connect_to_sqlite(sys.argv[1]))
	max_insert = df['timestamp'][0]
	print(max_insert)


	if __name__ == "__main__":
	# if working with sqlite, update the follwoing path as needed to ensure we can find the sqlite database,
	# which is just a local file
	# if connecting to a remote server, this line would not be necessary
	os.chdir('/Users/micahmelling/PycharmProjects/cli_projects')
	main()

view raw table_update_lookup.py hosted with ❤ by GitHub

We can run this script from the command line like a normal Python script. When we run it, we will need to pass three command line arguments in order: the database name, the table name, and the timestamp column name to check.

$ python3 table_update_lookup.py sample_db sample_table insert_timestamp

The above will return the most recent insert_timestamp from sample_db.sample_table.

To make our script executable on the command line, we simply need to take the following steps.

1) Put the following line at the top of table_update_lookup.py:

#!/usr/bin/env python3

That is called a shebang, and it lets our shell know which interpreter should execute this script. In this case, the interpreter is Python3.

2) Set the executable flag on our script, which lets our shell know this script can be run directly from the command line.

$ chmod +x table_update_lookup.py

3) Move the script into a home directory called bin, a common convention for such work. We do not need our file extension any more, so we drop it when moving it into bin.

$ mkdir -p ~/bin
$ cp table_update_lookup.py ~/bin/table_update_lookup

4) Add ~/bin to our PATH in our bash_profile.

$ vim ~/.bash_profile
Insert the following using our cool vim skills: export PATH=$PATH":$HOME/bin"
$ source ~/.bash_profile

5) Verify everything works!

$ cd
$ table_update_lookup sample_db sample_table insert_timestamp

If you see a timestamp printed in your terminal, all is good!

This example is a bit of a toy, but with some slight modifications (i.e. querying tables on a remote MySQL server), it could be quite powerful and time-saving.

Quick Operations with B-Python

Most of the time, we want to write full scripts in a code editor. However, there are times when we might want to perform a quick, one-off calculation in Python. (I use Python as my calculator!). If you run $ python3 from terminal, you will get a Python interpreter and can run code line-by-line. This is fine and useful. However, we don't have some of the nice bells and whistles of PyCharm, such as auto-completion or color coding. However, we could get a more interactive environment with bpython.

$ pip3 install bpython $ bypython

Using the Pandas Extensions API

We can also clean up come by writing certain abstractions. For example, pandas has a nifty little extensions API. Let's say we repeatedly needed to find the second highest value in a column. (This sounds like a weird task, but I've had to perform niche wrangling tasks many times in my career). We could write the following extension.

	import pandas as pd


	@pd.api.extensions.register_dataframe_accessor("ext")
	class PandasAccessor:
	def __init__(self, pandas_object):
	self.pandas_object = pandas_object

	def get_second_highest(self, column):
	sorted_col = self.pandas_object.sort_values(by=column, ascending=False)
	value = sorted_col[column].iloc[[1]].values[0][0]
	return value


	if __name__ == "__main__":
	df = pd.DataFrame({'a': [1, 5, 10, 14, 24, 16, 18]})
	print(df.ext.get_second_highest(['a']))

view raw pandas_extensions_api.py hosted with ❤ by GitHub

Building a Library to Abstract Common Processes

Python has many useful packages that we can install with pip, chief among them are pandas, numpy, and scikit-learn. Could we create our own package? Absolutely. Doing so can be a useful way to abstract common processes we plan to use across projects. If we find we are using a function across projects, we would be best served in many cases by adding it to an installable library.

Creating a Python package is surprisingly quite simple. Let’s create the necessary structure.

$ mkdir data_science_helpers
$ cd data_science_helpers
$ touch setup.py
$ touch README.md
$ mkdir ds_helpers
$ cd ds_helpers
$ touch __init__.py
$ touch aws.py
$ touch db.py

In the root directory of our project, we need a file called setup.py. This is the build script needed by the setuptools library, which allows us to build Python distributions. The README.md is a place to store the documentation for our package. We also need a subdirectory where the code will go. In this case, we call it ds_helpers. Within this subdirectory, we need a file called __init__.py, which is required to import the directory as a package. We will add aws.py and db.py, which will house our code. In these files, we will add some functionality to interact with AWS and remote databases. For this code to work as expected, you will need to have AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY set as environment variables.

You’ll also want to create a GitHub repository. Luckily, we can host our package on GitHub and install it locally with pip. If you make your repo public, anyone in the world will be able to install your package. Depending on your code, this may not be an issue. If you make the repo private, only you will be able to install your package. We'll address this topic more in the next section. For now, feel free to either make your package repo public or private.

Insert the following text into setup.py. Switch out my information for yours. Of note, going forward I will mostly use PyCharm as my text editor as opposed to VIM.

	import setuptools


	with open("README.md", "r") as fh:
	long_description = fh.read()

	setuptools.setup(
	name="data_science_helpers",
	version="0.0.1",
	author="Name",
	author_email="email@email.com",
	description="Common data science helper functions",
	long_description=long_description,
	long_description_content_type="text/markdown",
	url="url",
	packages=['ds_helpers'],
	install_requires=['boto3>=1.11.2', 'SQLAlchemy>=1.3.20', 'pymysql>=0.9.3', 'tqdm>=4.46.1'],
	classifiers=[
	"Programming Language :: Python :: 3",
	"License :: OSI Approved :: MIT License",
	"Operating System :: OS Independent",
	],
	python_requires='>=3.6',
	)

view raw setup.py hosted with ❤ by GitHub

Put the following code in a file called aws.py in our ds_helpers directory. Most of this code was inspired by the boto3 documentation and AWS documentation. Boto3 is the AWS SDK (software development kit) for Python.

	import boto3
	import json
	import os
	import botocore

	from tqdm import tqdm
	from botocore.exceptions import ClientError


	def get_secrets_manager_secret(secret_name, region='us-west-2'):
	"""
	Retrieves a secret from AWS Secrets Manager, a password manager.

	:param secret_name: name of the secret
	:type secret_name: str
	:param region: name of the AWS region; default is us-west-2
	:type region: str
	:return: dictionary with the secret keys and values
	"""
	session = boto3.session.Session()
	client = session.client(
	service_name='secretsmanager',
	region_name=region
	)
	secret_value_response = client.get_secret_value(SecretId=secret_name)
	for key, value in secret_value_response.items():
	if key == 'SecretString':
	return json.loads(value)


	def set_aws_environment_variables(secret_name):
	"""
	Sets AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables based on the access_key and secret_key in
	the secret_name.

	:param secret_name: Secrets Manager secret name
	"""
	keys_secret = get_secrets_manager_secret(secret_name)
	os.environ['AWS_ACCESS_KEY_ID'] = keys_secret.get('access_key')
	os.environ['AWS_SECRET_ACCESS_KEY'] = keys_secret.get('secret_key')


	def upload_file_to_s3(file_name, bucket, directory=''):
	"""
	Uploads a local file to S3, in the specified bucket and directory.

	:param file_name: the name of the local file; this will also be the file name in S3
	:type file_name: str
	:param bucket: name of the bucket to upload the file to; the bucket will be created if it does not exist
	:type bucket: str
	:param directory: the name of the directory in which to put the file; default is root
	:type directory: str
	"""
	s3_client = boto3.client('s3')
	s3_client.upload_file(file_name, bucket, os.path.join(directory, file_name))


	def upload_directory_to_s3(local_directory, bucket):
	"""
	Uploads an entire directory to S3.

	:param local_directory: name of the local directory
	:type local_directory: str
	:param bucket: name of the S3 bucket
	:type bucket: str
	"""
	directory_walk = os.walk(local_directory)
	for directory_path, directory_name, file_names in directory_walk:
	if directory_path != os.path.join(local_directory):
	sub_dir = os.path.basename(directory_path)
	for file in tqdm(file_names):
	upload_file_to_s3(file_name=os.path.join(local_directory, sub_dir, file), bucket=bucket)
	else:
	for file in tqdm(file_names):
	upload_file_to_s3(file_name=os.path.join(local_directory, file), bucket=bucket)


	def download_file_from_s3(file_name, bucket, directory=''):
	"""
	Downloads a single specified file from S3. If the file is in a directory in S3, the file will be placed in a
	directory of the same name locally.

	:param file_name: the name of the file to download; this will also be the name of the file locally
	:type file_name: str
	:param bucket: the name of the bucket where the file exists
	:type bucket: str
	:param directory: the name of the directory where the file exists; the directory will be created locally if it does
	not exist
	:type directory: str
	:return: None
	"""
	s3 = boto3.resource('s3')
	try:
	if not os.path.exists(directory) and directory != '':
	os.makedirs(directory)
	s3.Bucket(bucket).download_file(os.path.join(directory, file_name), os.path.join(directory, file_name))
	except botocore.exceptions.ClientError as e:
	if e.response['Error']['Code'] == "404":
	print("The object does not exist.")
	else:
	raise


	def download_directory_from_s3(bucket, directory):
	"""
	Downloads a folder in S3 and all of its contents into a local directory.

	:param bucket: name of the bucket in S3 where the directory lives
	:type bucket: str
	:param directory: directory in S3 we want to download
	:type directory: str
	"""
	s3_resource = boto3.resource('s3')
	bucket = s3_resource.Bucket(bucket)
	for object in bucket.objects.filter(Prefix=directory):
	if not os.path.exists(os.path.dirname(object.key)):
	os.makedirs(os.path.dirname(object.key))
	bucket.download_file(object.key, object.key)


	def log_payload_to_s3(input_payload, output_payload, uid, bucket_name):
	"""
	Logs input and output payloads to S3 in a single object.

	:param input_payload: the input payload
	:param output_payload: the output payload
	:param uid: session UID
	:param bucket_name: the name of the S3 bucket to upload the payloads to
	"""
	new_input_payload = dict()
	new_input_payload['input'] = input_payload
	new_output_payload = dict()
	new_output_payload['output'] = output_payload
	final_payload = str({new_input_payload, new_output_payload})
	with open(f'{uid}.json', 'w') as outfile:
	outfile.write(final_payload)
	upload_file_to_s3(f'{uid}.json', bucket_name)
	os.remove(f'{uid}.json')

view raw aws.py hosted with ❤ by GitHub

Insert the following code into db.py.

	import sqlalchemy


	def connect_to_mysql(connection_dictionary, ssl_path='rds-ca-2019-root.pem'):
	"""
	Connects to a MySQL database. Require the RDS pem file is in your working directory.

	:param connection_dictionary: dictionary containing keys for host, user, password, and database
	:type connection_dictionary: dictionary
	:param ssl_path: path to the RDS ssl file
	:type ssl_path: path or path represented as string
	:return: sqlalchemy connection
	"""
	ssl_args = {'ssl': {'ca': ssl_path}}
	host = connection_dictionary['host']
	user = connection_dictionary['user']
	password = connection_dictionary['password']
	database = connection_dictionary['database']
	engine = sqlalchemy.create_engine(f'mysql+pymysql://{user}:{password}@{host}:3306/{database}',
	connect_args=ssl_args)
	return engine


	def write_dataframe_to_database(df, schema, table, db_conn):
	"""
	Writes a dataframe to a database table

	:param df: dataframe we want to record
	:type df: pandas dataframe
	:param schema: name of the schema
	:type schema: str
	:param table: name of the table
	:type table: str
	:param db_conn: sqlalchemy connection
	"""
	df.to_sql(name=table, schema=schema, con=db_conn, if_exists='append', index=False)


	def dynamically_create_ddl_and_execute(df, schema, table, db_conn):
	"""
	Creates a DDL statement based on the inputted dataframe, which includes an id column and a meta__inserted_at
	column. Executes the DDL to create the table.

	:param df: dataframe we want to use to construct the ddl statement
	:type df: pandas dataframe
	:param schema: name of the schema in which the table will go
	:type schema: str
	:param table: name of the table to create
	:type table: str
	:param db_conn: sqlalchemy connection
	"""
	ddl_statement = f'''
	create table if not exists {schema}.{table}(
	id int not null auto_increment primary key,
	meta__inserted_at timestamp default current_timestamp,
	'''

	dtype_mapping = {
	'int': 'int',
	'float': 'float',
	'object': 'text'
	}

	cols = list(df)
	for col in cols:
	col_dtype = str(df[col].dtype)
	sql_value = [val for key, val in dtype_mapping.items() if key in col_dtype][0]
	temp_str = f'{col} {sql_value},'
	ddl_statement += temp_str

	ddl_statement = ddl_statement[:-1]
	ddl_statement += ');'
	db_conn.execute(ddl_statement)

view raw db.py hosted with ❤ by GitHub

Lastly, insert the following into the __init__.py file.

	__version__ = '0.0.1'

	from .aws import *
	from .db import *

view raw __init__.py hosted with ❤ by GitHub

Commit the code and push it to your remote repository. Let’s now test our library!

Set up environment for testing.

$ mkdir package_check && cd "$_"
$ python3 -m venv venv
$ source venv/bin/activate

Install the package. To note, your URL will be different. We're using the https endpoint, so you may have to authenticate if your repo is private. Using the https endpoint covers both cases where the repo is public or private.

$ pip3 install git+https://github.com/micahmelling/data_science_helpers#egg=ds_helpers

Verify the installation. You should see the package has been installed along with the dependencies.

$ pip3 freeze

Start Python, import functions from library, and create a sample file.

$ python3
>>> from ds_helpers.aws import upload_file_to_s3, download_folder_from_s3
>>>file = open("sample.txt", "w")
>>>file.write("testing...testing...testing")
>>>file.close()

Verify our functions work. We’ll see that our file has been uploaded and downloaded as expected. Please note that you'll need to use a different bucket name.

>>> upload_file_to_s3('sample.txt', 'micahm-sample-s3-bucket', 'ds-files')
>>> download_file_from_s3('sample.txt', 'micahm-sample-s3-bucket', 'ds-files')
>>> download_folder_from_s3('micahm-sample-s3-bucket', 'ds-files')

Congrats! We now have a working Python package with some useful helper code.

Creating our own PyPi Server

In the previous section, we hosted our package on GitHub. This is convenient but comes with limitations. If we made our repo public, then anyone in the world can install it. That may or may not be a concern. If we made our repo private, we have no issue installing it locally. However, what if we want to use our package on an AWS EC2 instance? Or what if we want a contractor to use the package, but we don't want them to have access to our GitHub? Fortunately, we can easily create our own PyPi server on AWS S3 and only allow certain IP addresses to install the package.

PyPi is the main distribution source for Python packages. Per it's website, "PyPi helps you find and install software developed and shared by the Python community". When you pip install a package, pip is (often) reaching out to the PyPi servers to get the package you want to install. We can emulate PyPi using the instructions in the following video.

Per the video, you can find the sample bucket policy here.

Using Pulumi to Set Up our PyPi Server

The set up presented in the video from the last section has a couple of drawbacks. One, our site does not have an SSL certificate. Two, the process requires many manual steps. We can remedy that by using Pulumi, which will allow us to write out AWS infrastructure as code.

You can follow this tutorial to get started with setting up Pulumi on your machine.

Using your AWS admin account, perform the following tasks.

Using IAM, create a user called pulumi_static_site. Only give it programmatic access. I recommend putting the API keys in Secrets Manager.
Using IAM, create a group called pulumi_static_site.
Attach the following AWS Managed Policies to the group: AWSWAFFullAccess, CloudFrontFullAccess, AmazonRoute53FullAccess, and AmazonS3FullAccess.
Add the user pulumi_static_site to the group pulumi_static_site.
Register a domain in Route 53.
Get a corresponding SSL certificate. SSL certificate for your domain. Do so in the us-east-1 region (N. Virginia).

Now, execute the following from the command line. The access keys you export as temporary environment variables are those for the pulumi_static_site user.

$ mkdir pulumi scripts && cd "$_" $ mkdir private-python-package && cd "$_" $ pulumi new aws-python $ export AWS_ACCESS_KEY_ID= $ export AWS_SECRET_ACCESS_KEY=

You're now ready to to place the following code in __main__.py.

	import mimetypes
	import pulumi_aws as aws
	import pulumi

	from pulumi import FileAsset
	from pulumi_aws import s3


	def main(bucket_name, index_html_path, aliases, certificate_arn, domain_name, hosted_zone_id, ip_address):
	"""
	Creates a static, single-page website accessible via a Route 53 DNS and protected with SSL.

	Example:
	main(
	bucket_name='my-s3-bucket',
	index_html_path='index.html',
	aliases=['python-package.mydomain.com'],
	certificate_arn='arn:aws:acm:us-east-1:000000000000:certificate/oooooooooo', # found in certificate manager
	domain_name='python-package.mydomain.com',
	hosted_zone_id='ZZZZZZZZZZZZZZZZZZZZZZZ', # found in route 53
	ip_address='0.0.0.0/32' # desired IP address to have access
	)

	:param bucket_name: name of the S3 bucket hosting the content.
	:param index_html_path: path to index.html
	:param aliases: CloudFront aliases, which must include domain_name
	:param certificate_arn: ARN of the SSL cert, which must be in us-east-1
	:param domain_name: domain name
	:param hosted_zone_id: Route53 hosted zone ID
	:param ip_address: IP address we want to have access
	"""
	web_bucket = s3.Bucket(bucket_name,
	bucket=bucket_name,
	website=s3.BucketWebsiteArgs(
	index_document="index.html"
	))

	bucket_public_access_block = aws.s3.BucketPublicAccessBlock(f'{bucket_name}_public_access',
	bucket=web_bucket.id,
	block_public_acls=False,
	block_public_policy=False)

	mime_type, _ = mimetypes.guess_type(index_html_path)
	obj = s3.BucketObject(
	index_html_path,
	bucket=web_bucket.id,
	source=FileAsset(index_html_path),
	content_type=mime_type
	)

	# can easily modify the ipset to accommodate multiple IP addresses
	ipset = aws.waf.IpSet("ipset",
	ip_set_descriptors=[aws.waf.IpSetIpSetDescriptorArgs(
	type="IPV4",
	value=ip_address,
	)])

	wafrule = aws.waf.Rule("wafrule",
	metric_name="WAFRule",
	predicates=[aws.waf.RulePredicateArgs(
	data_id=ipset.id,
	negated=False,
	type="IPMatch",
	)],

	opts=pulumi.ResourceOptions(depends_on=[ipset]))

	waf_acl = aws.waf.WebAcl("wafAcl",
	metric_name="WebACL",
	default_action=aws.waf.WebAclDefaultActionArgs(
	type="BLOCK",
	),
	rules=[aws.waf.WebAclRuleArgs(
	action=aws.waf.WebAclRuleActionArgs(
	type="ALLOW",
	),
	priority=1,
	rule_id=wafrule.id,
	type="REGULAR",
	)],
	opts=pulumi.ResourceOptions(depends_on=[
	ipset,
	wafrule,
	]))

	oai = aws.cloudfront.OriginAccessIdentity(f"{bucket_name}_oai")

	s3_distribution = aws.cloudfront.Distribution(
	f'{bucket_name}_distribution',
	origins=[aws.cloudfront.DistributionOriginArgs(
	domain_name=web_bucket.bucket_regional_domain_name,
	origin_id=f's3{bucket_name}_origin',
	s3_origin_config=aws.cloudfront.DistributionOriginS3OriginConfigArgs(
	origin_access_identity=oai.cloudfront_access_identity_path,
	),
	)],
	enabled=True,
	is_ipv6_enabled=False,
	default_root_object="index.html",
	aliases=aliases,
	web_acl_id=waf_acl.id,
	default_cache_behavior=aws.cloudfront.DistributionDefaultCacheBehaviorArgs(
	allowed_methods=[
	"DELETE",
	"GET",
	"HEAD",
	"OPTIONS",
	"PATCH",
	"POST",
	"PUT",
	],
	cached_methods=[
	"GET",
	"HEAD",
	],
	target_origin_id=f's3{bucket_name}_origin',
	forwarded_values=aws.cloudfront.DistributionOrderedCacheBehaviorForwardedValuesArgs(
	query_string=False,
	headers=["Origin"],
	cookies=aws.cloudfront.DistributionOrderedCacheBehaviorForwardedValuesCookiesArgs(
	forward="none",
	),
	),
	viewer_protocol_policy="redirect-to-https",
	),
	restrictions=aws.cloudfront.DistributionRestrictionsArgs(
	geo_restriction=aws.cloudfront.DistributionRestrictionsGeoRestrictionArgs(
	restriction_type="none",
	),
	),
	viewer_certificate=aws.cloudfront.DistributionViewerCertificateArgs(
	acm_certificate_arn=certificate_arn,
	ssl_support_method='sni-only'
	))

	# you might have to wrap oai.iam_arn in an f-string on the first run and then re-run with the original way.
	# pulumi is a bit finicky
	source = aws.iam.get_policy_document(statements=[
	aws.iam.GetPolicyDocumentStatementArgs(
	actions=["s3:GetObject"],
	resources=[f"arn:aws:s3:::{bucket_name}/*"],
	principals=[
	aws.iam.GetPolicyDocumentStatementPrincipalArgs(
	type='AWS',
	identifiers=[oai.iam_arn]
	),
	]
	),
	aws.iam.GetPolicyDocumentStatementArgs(
	actions=["s3:GetObject"],
	resources=[f"arn:aws:s3:::{bucket_name}/*"],
	principals=[
	aws.iam.GetPolicyDocumentStatementPrincipalArgs(
	type='*',
	identifiers=['*']
	),
	],
	conditions=[
	aws.iam.GetPolicyDocumentStatementConditionArgs(
	test='IpAddress',
	variable="aws:SourceIp",
	values=[f'{ip_address}']
	)
	]
	)
	]
	)

	web_bucket_name = web_bucket.id
	bucket_policy = s3.BucketPolicy(f"{bucket_name}_bucket-policy",
	bucket=web_bucket_name,
	policy=source.json)

	route_53_record = aws.route53.Record(domain_name,
	zone_id=hosted_zone_id,
	name=domain_name,
	type="A",
	aliases=[aws.route53.RecordAliasArgs(
	name=s3_distribution.domain_name,
	zone_id=s3_distribution.hosted_zone_id,
	evaluate_target_health=False,
	)]
	)

view raw pulumi_pypi_server.py hosted with ❤ by GitHub

You'll also need the following index.html in your working directory.

	<html>
	<body>
	<a href="my_package_name-0.0.1.tar.gz">
	my_package_name-0.0.1.tar.gz
	</a>
	<br>
	<a href="my_package_name-0.0.2.tar.gz">
	my_package_name-0.0.2.tar.gz
	</a>
	</body>
	</html>

view raw index.html hosted with ❤ by GitHub

Likewise, you'll need to add the package distributions (see the video in the previous section) to the root of your S3 bucket. In my example, I included two package versions.

You're now ready to deploy your PyPi server!

$ pulumi up

A couple of notes. 1) You can easy adapt the code to accommodate more IP addresses. See comments. 2) Pulumi can be a little finicky with web application firewalls (WAFs). See the comment on lines 136 - 137 for an adjustment you might have to make. 3) You can easily build logical checks in your script. For instance, you could halt the script if the user put an IP address of 0.0.0.0/32, which would make the site open to the entire world.

Now, you should be able to go to your domain and see your index.html that links to your packages! To note, this script also produces a non-SSL site, like in the previous section, though access is still locked down by IP address. Simply pretend this version does not exist.

Now, how do you install your package? As an example, let's say your package is called data_science_helpers, and your domain is private-package.mydatasciencesite.com. You can simply issue a pip install with some additional flags so that pip knows where to look.

$ pip3 install --find-links https://private-package.mydatasciencesite.com/ --trusted-host private-package.mydatasciencesite.com data_science_helpers==0.0.1

There you have it. A private Python package!

What happens if you want to release a new package version? Fow now, you'll simply need to update your index.html to include a reference to the new package and to also upload the new package distribution to S3. You might have to run an invalidation on your current Cloudfront Distribution after uploading the new files.

aws cloudfront create-invalidation --distribution-id=YOUR_DISTRIBUTION_ID --paths "/*"

Don't worry - we'll discuss better ways to go about releasing new versions in chapter 14 :-)

Should you want to delete your site and associated resources, you can simply issue the following command.

$ pulumi destroy

Citations

Don't Comment Your Code
How Do I Make My Own Command-Line Commands Using Python?
Packaging & Hosting python repo to S3
An Authenticated S3 Python Package Repository