.. toctree:: :maxdepth: 2 :caption: Contents:
.. automodule:: modeling.train :members:
.. automodule:: app :members:
Documentation: no one wants to create it, but everyone wants to have it. If we have done our due diligence by creating docstrings, we are most of the way there. We can turn our docstrings into formal documentation via a framework called Sphinx. Better yet, we can automatically generate this documentation each time we push to our remote git repository. In addition, we'll cover some topics and tools to round out your project documentation.
We have used pretty straightforward docstrings using the reStructuredText (reST) format. However, as you might have guessed, we could employ other styles as shown in the below code.
# reStructuredText (reST) format | |
def download_folder_from_s3(bucket_name, directory='/'): | |
""" | |
Downloads the contents of an entire folder from an S3 bucket into a local directory. If the remote directory is root, | |
then the local directory name is constructed with the f-string f's3_{bucket_name}'. Otherwise, the name of the remote | |
directory is also the name of the local directory. | |
:param bucket_name: name of the S3 bucket | |
:type bucket_name: str | |
:param directory: name of the directory in the S3 bucket, default '/' | |
:type directory: str | |
:returns: this function does not return anything | |
""" | |
s3_resource = boto3.resource('s3') | |
bucket = s3_resource.Bucket(bucket_name) | |
if directory == '/': | |
directory_name = f's3_{bucket_name}' | |
if not os.path.exists(directory_name): | |
os.makedirs(directory_name) | |
for s3_object in bucket.objects.all(): | |
bucket.download_file(s3_object.key, os.path.join(directory_name, s3_object.key)) | |
else: | |
for s3_object in bucket.objects.filter(Prefix=directory): | |
if not os.path.exists(os.path.dirname(s3_object.key)): | |
os.makedirs(os.path.dirname(s3_object.key)) | |
bucket.download_file(s3_object.key, s3_object.key) | |
# Google format | |
def download_folder_from_s3(bucket_name, directory='/'): | |
""" | |
Downloads the contents of an entire folder from an S3 bucket into a local directory. If the remote directory is root, | |
then the local directory name is constructed with the f-string f's3_{bucket_name}'. Otherwise, the name of the remote | |
directory is also the name of the local directory. | |
Args: | |
bucket_name (str): name of the S3 bucket, default '/' | |
directory (str): name of the directory in the S3 bucket | |
Returns: | |
this function does not return anything | |
""" | |
s3_resource = boto3.resource('s3') | |
bucket = s3_resource.Bucket(bucket_name) | |
if directory == '/': | |
directory_name = f's3_{bucket_name}' | |
if not os.path.exists(directory_name): | |
os.makedirs(directory_name) | |
for s3_object in bucket.objects.all(): | |
bucket.download_file(s3_object.key, os.path.join(directory_name, s3_object.key)) | |
else: | |
for s3_object in bucket.objects.filter(Prefix=directory): | |
if not os.path.exists(os.path.dirname(s3_object.key)): | |
os.makedirs(os.path.dirname(s3_object.key)) | |
bucket.download_file(s3_object.key, s3_object.key) | |
# Numpy format | |
def download_folder_from_s3(bucket_name, directory='/'): | |
""" | |
Downloads the contents of an entire folder from an S3 bucket into a local directory. If the remote directory is root, | |
then the local directory name is constructed with the f-string f's3_{bucket_name}'. Otherwise, the name of the remote | |
directory is also the name of the local directory. | |
Parameters | |
---------- | |
bucket_name: str | |
name of the S3 bucket, default '/' | |
directory: str | |
name of the directory in the S3 bucket | |
Returns | |
------- | |
this function does not return anything | |
""" | |
s3_resource = boto3.resource('s3') | |
bucket = s3_resource.Bucket(bucket_name) | |
if directory == '/': | |
directory_name = f's3_{bucket_name}' | |
if not os.path.exists(directory_name): | |
os.makedirs(directory_name) | |
for s3_object in bucket.objects.all(): | |
bucket.download_file(s3_object.key, os.path.join(directory_name, s3_object.key)) | |
else: | |
for s3_object in bucket.objects.filter(Prefix=directory): | |
if not os.path.exists(os.path.dirname(s3_object.key)): | |
os.makedirs(os.path.dirname(s3_object.key)) | |
bucket.download_file(s3_object.key, s3_object.key) |
Type hinting is a nifty little way to supply more context about our functions. In the code below, we communicate that the argument db_secret_name is expected to be a string, and the function get_most_recent_db_insert returns a string.
def get_most_recent_db_insert(db_secret_name: str) -> str: | |
""" | |
Gets the most recent logging_timestamp inserted into the churn_model.model_logs table | |
:param db_secret_name: name of the Secrets Manager secret for DB credentials | |
:returns: most recent logging_timestamp | |
""" | |
query = ''' | |
select max(logging_timestamp) as max_insert | |
from churn_model.model_logs; | |
''' | |
df = pd.read_sql(query, db.connect_to_mysql(aws.get_secrets_manager_secret(db_secret_name), | |
ssl_path='data/rds-ca-2019-root.pem')) | |
max_insert = df['max_insert'][0] | |
return max_insert | |
Sphinx is probably the most popular framework for documenting code. It keys off our docstrings, meaning we can pretty easily create sleek-looking documentation. Better yet, we can automatically update our documentation each time we make a code change. We can accomplish this aim by building it into our buildspec.yml. However, we first need to do some set up.
We first need to create a docs directory.
$ mkdir docs && cd "$_"
We can then run the sphinx quickstart to create our project, filing out the appropriate details.
$ sphinx-quickstart
Next, we shall configure our conf.py file.
# Configuration file for the Sphinx documentation builder. | |
# | |
# This file only contains a selection of the most common options. For a full | |
# list see the documentation: | |
# https://www.sphinx-doc.org/en/master/usage/configuration.html | |
# -- Path setup -------------------------------------------------------------- | |
# If extensions (or modules to document with autodoc) are in another directory, | |
# add these directories to sys.path here. If the directory is relative to the | |
# documentation root, use os.path.abspath to make it absolute, like shown here. | |
import os | |
import sys | |
sys.path.insert(0, os.path.abspath('../..')) | |
# -- Project information ----------------------------------------------------- | |
project = 'Churn Model' | |
copyright = '2021, Micah Melling' | |
author = 'Micah Melling' | |
# -- General configuration --------------------------------------------------- | |
# Add any Sphinx extension module names here, as strings. They can be | |
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom | |
# ones. | |
extensions = [ | |
'sphinx.ext.autodoc' | |
] | |
# Add any paths that contain templates here, relative to this directory. | |
templates_path = ['_templates'] | |
# List of patterns, relative to source directory, that match files and | |
# directories to ignore when looking for source files. | |
# This pattern also affects html_static_path and html_extra_path. | |
exclude_patterns = [] | |
# -- Options for HTML output ------------------------------------------------- | |
# The theme to use for HTML and HTML Help pages. See the documentation for | |
# a list of builtin themes. | |
# | |
html_theme = 'alabaster' | |
# Add any paths that contain custom static files (such as style sheets) here, | |
# relative to this directory. They are copied after the builtin static files, | |
# so a file named "default.css" will overwrite the builtin "default.css". | |
html_static_path = ['_static'] |
We also must configure our index.rst file, which will actually declare what files we want to document.
We shall now create the actual documentation files!
$ make html
Let's use the following aws-cli command to copy our website code onto the S3 website.
$ aws s3 cp docs/ s3://churn-model-ds-docs/docs --recursive
We can use the following Pulumi script to create a static website on S3 to host our documentation.
import mimetypes | |
import pulumi_aws as aws | |
import pulumi | |
from pulumi import FileAsset | |
from pulumi_aws import s3 | |
def main(bucket_name, index_html_path, aliases, certificate_arn, domain_name, hosted_zone_id, ip_address): | |
""" | |
Creates a static, single-page website accessible via a Route 53 DNS and protected with SSL. | |
Example: | |
main( | |
bucket_name='my-s3-bucket', | |
index_html_path='index.html', | |
aliases=['python-package.mydomain.com'], | |
certificate_arn='arn:aws:acm:us-east-1:000000000000:certificate/oooooooooo', # found in certificate manager | |
domain_name='python-package.mydomain.com', | |
hosted_zone_id='ZZZZZZZZZZZZZZZZZZZZZZZ', # found in route 53 | |
ip_address='0.0.0.0/32' # desired IP address to have access | |
) | |
:param bucket_name: name of the S3 bucket hosting the content. | |
:param index_html_path: path to index.html | |
:param aliases: CloudFront aliases, which must include domain_name | |
:param certificate_arn: ARN of the SSL cert, which must be in us-east-1 | |
:param domain_name: domain name | |
:param hosted_zone_id: Route53 hosted zone ID | |
:param ip_address: IP address we want to have access | |
""" | |
web_bucket = s3.Bucket(bucket_name, | |
bucket=bucket_name, | |
website=s3.BucketWebsiteArgs( | |
index_document="index.html" | |
)) | |
bucket_public_access_block = aws.s3.BucketPublicAccessBlock(f'{bucket_name}_public_access', | |
bucket=web_bucket.id, | |
block_public_acls=False, | |
block_public_policy=False) | |
mime_type, _ = mimetypes.guess_type(index_html_path) | |
obj = s3.BucketObject( | |
index_html_path, | |
bucket=web_bucket.id, | |
source=FileAsset(index_html_path), | |
content_type=mime_type | |
) | |
# can easily modify the ipset to accommodate multiple IP addresses | |
ipset = aws.waf.IpSet("ipset", | |
ip_set_descriptors=[aws.waf.IpSetIpSetDescriptorArgs( | |
type="IPV4", | |
value=ip_address, | |
)]) | |
wafrule = aws.waf.Rule("wafrule", | |
metric_name="WAFRule", | |
predicates=[aws.waf.RulePredicateArgs( | |
data_id=ipset.id, | |
negated=False, | |
type="IPMatch", | |
)], | |
opts=pulumi.ResourceOptions(depends_on=[ipset])) | |
waf_acl = aws.waf.WebAcl("wafAcl", | |
metric_name="WebACL", | |
default_action=aws.waf.WebAclDefaultActionArgs( | |
type="BLOCK", | |
), | |
rules=[aws.waf.WebAclRuleArgs( | |
action=aws.waf.WebAclRuleActionArgs( | |
type="ALLOW", | |
), | |
priority=1, | |
rule_id=wafrule.id, | |
type="REGULAR", | |
)], | |
opts=pulumi.ResourceOptions(depends_on=[ | |
ipset, | |
wafrule, | |
])) | |
oai = aws.cloudfront.OriginAccessIdentity(f"{bucket_name}_oai") | |
s3_distribution = aws.cloudfront.Distribution( | |
f'{bucket_name}_distribution', | |
origins=[aws.cloudfront.DistributionOriginArgs( | |
domain_name=web_bucket.bucket_regional_domain_name, | |
origin_id=f's3{bucket_name}_origin', | |
s3_origin_config=aws.cloudfront.DistributionOriginS3OriginConfigArgs( | |
origin_access_identity=oai.cloudfront_access_identity_path, | |
), | |
)], | |
enabled=True, | |
is_ipv6_enabled=False, | |
default_root_object=index_html_path,, | |
aliases=aliases, | |
web_acl_id=waf_acl.id, | |
default_cache_behavior=aws.cloudfront.DistributionDefaultCacheBehaviorArgs( | |
allowed_methods=[ | |
"DELETE", | |
"GET", | |
"HEAD", | |
"OPTIONS", | |
"PATCH", | |
"POST", | |
"PUT", | |
], | |
cached_methods=[ | |
"GET", | |
"HEAD", | |
], | |
target_origin_id=f's3{bucket_name}_origin', | |
forwarded_values=aws.cloudfront.DistributionOrderedCacheBehaviorForwardedValuesArgs( | |
query_string=False, | |
headers=["Origin"], | |
cookies=aws.cloudfront.DistributionOrderedCacheBehaviorForwardedValuesCookiesArgs( | |
forward="none", | |
), | |
), | |
viewer_protocol_policy="redirect-to-https", | |
), | |
restrictions=aws.cloudfront.DistributionRestrictionsArgs( | |
geo_restriction=aws.cloudfront.DistributionRestrictionsGeoRestrictionArgs( | |
restriction_type="none", | |
), | |
), | |
viewer_certificate=aws.cloudfront.DistributionViewerCertificateArgs( | |
acm_certificate_arn=certificate_arn, | |
ssl_support_method='sni-only' | |
)) | |
# you might have to wrap oai.iam_arn in an f-string on the first run and then re-run with the original way. | |
# pulumi is a bit finicky | |
source = aws.iam.get_policy_document(statements=[ | |
aws.iam.GetPolicyDocumentStatementArgs( | |
actions=["s3:GetObject"], | |
resources=[f"arn:aws:s3:::{bucket_name}/*"], | |
principals=[ | |
aws.iam.GetPolicyDocumentStatementPrincipalArgs( | |
type='AWS', | |
identifiers=[oai.iam_arn] | |
), | |
] | |
), | |
aws.iam.GetPolicyDocumentStatementArgs( | |
actions=["s3:GetObject"], | |
resources=[f"arn:aws:s3:::{bucket_name}/*"], | |
principals=[ | |
aws.iam.GetPolicyDocumentStatementPrincipalArgs( | |
type='*', | |
identifiers=['*'] | |
), | |
], | |
conditions=[ | |
aws.iam.GetPolicyDocumentStatementConditionArgs( | |
test='IpAddress', | |
variable="aws:SourceIp", | |
values=[f'{ip_address}'] | |
) | |
] | |
) | |
] | |
) | |
web_bucket_name = web_bucket.id | |
bucket_policy = s3.BucketPolicy(f"{bucket_name}_bucket-policy", | |
bucket=web_bucket_name, | |
policy=source.json) | |
route_53_record = aws.route53.Record(domain_name, | |
zone_id=hosted_zone_id, | |
name=domain_name, | |
type="A", | |
aliases=[aws.route53.RecordAliasArgs( | |
name=s3_distribution.domain_name, | |
zone_id=s3_distribution.hosted_zone_id, | |
evaluate_target_health=False, | |
)] | |
) |
Using the above as inspiration, we can amend our buildspec.yml. Our CI/CD pipeline will now include automatically updating our documentation! (We need to make sure our full docs directory is part of our .gitignore). We simply need to run the following sequence of commands.
$ cd docs
$ make html
$ aws s3 cp docs/ s3://churn-model-ds-docs/docs --recursive"
We can now the visit the our Route53 DNS to view our documentation! If we configured /docs/build/html/index.html, then our main documentation should appear in our home path.
We can use flasgger to automatically generate a documentation endpoint for our API. We need a particular style of docstring, which can be found in the example below and in the library's documentation. We simply need to pip install the library, import it, and then call swagger = Swagger(app) in our application. We can then visit http://localhost:5000/apidocs/ to see our documentation. If you're using Flask Talisman, you might have to make adjustments to the content security policy.
The ReadMe is a core component of every git repo. It's documentation for all developers that will work on the repo, including yourself. Knowing how to develop a strong ReadMe can admittedly be a little tricky. And, well, no one really likes to write documentation, right? We can make the process a bit less painless by creating a template we can repeatedly follow. The below is the ReadMe for our churn project. The general structure can be cribbed for other data science projects.
This repository creates a REST API, written in Flask, that predicts the probability a customer will churn, that is, cancel their subscription. This probability is generated from a machine learning model, also trained in this repo. In sum, the repository contains all the code necessary to build a churn prediction system.
Start by cloning the repository.
$ git clone git@github.com:micahmelling/applied_data_science.git
Spin up a local server to test the Flask application by adding the following boilerplate.
if name == "main":
app.run(debug=True)
You can then run a payload through the following endpoint: http://127.0.0.1:5000/predict.
Sample input payload:
{
"activity_score": 0.730119,
"propensity_score": 0.180766,
"profile_score_new": 22,
"completeness_score": 41.327,
"xp_points": 18.2108,
"profile_score": 20.0989,
"portfolio_score": 25.1467,
"mouse_movement": 0.5,
"average_stars": 1.5,
"ad_target_group": "level_4",
"marketing_message": "level_3",
"device_type": "level_3",
"all_star_group": "level_2",
"mouse_x": "level_5",
"coupon_code": "level_10",
"ad_engagement_group": "level_2",
"user_group": "level_3",
"browser_type": "level_2",
"email_code": "level_1",
"marketing_creative": "level_4",
"secondary_user_group": "level_11",
"promotion_category": "level_9",
"marketing_campaign": "level_8",
"mouse_y": "level_8",
"marketing_channel": "level_16",
"marketing_creative_sub": "level_1",
"site_level": "level_12",
"acquired_date": "2015-06-09",
"client_id": "1963820"
}
The model will return a payload like the following.
{
'prediction': 0.14,
'high_risk': 'no',
'response_time': 0.471,
'ltv': 0
}
The repo also comes with a Dockerfile that can be used to create a Docker image.
$ docker build -t churn .
$ docker run --rm -it churn
The docker run command will spin up http://127.0.0.1:8000, and you can now hit the predict endpoint once again. Please note the port is different compared to using the default Flask server.
The app_settings.py file in root gives us the ability to update the models our API uses along with the option to update key global variables. These values represent ones we don't intend to change frequently or want to run through and be tested in our CI/CD pipeline.
Our application also responds to a configuration. The configuration allows us to update straightforward values we might want to change somewhat frequently. For example, we might want to almost effortlessly update the percentage of requests that receives a holdout treatment. Our config values are housed in MySQL in churn_model.prod_config or churn_model.stage_config, depending on the environment, A config UI is available that allows us to easily add and update config values via the config-refresh endpoint in our API.
The modeling directory houses code for training models that predict churn. Kicking off train.py will train a new set of models, defined in config.py. Within the modeling directory, a subdirectory is created for each model run, all named with a model UID. This allows us to version models and keep records of every run. Each modelsubdirectory is uploaded to S3, so we can clear local versions once our model directory becomes cluttered.
Below is a rundown of all the files we might wish to adjust in the modeling directory.
A number of other files exist in the repo. Below are the most important ones.
A production release of new code or models can be accomplished by pushing to the main branch of the remote repository. This will kick off a CI/CD pipeline build that will test the code changes, release them to staging, and then release them to production upon manual approval.
Lucidchart is a powerful software for creating diagrams of software systems. There is a robust free option, and it's remarkably intuitive. Below is a Lucidchart diagram for our production application. Lucidchart is so easy that I believe you could easily replicate what I have constructed.