My first job out of college was as a junior analyst for Edelman, a large marketing and communications agency in Washington, DC. I remember getting home late in the evenings and working through data science tutorials until I couldn’t keep my eyes open. I was striving to grasp the scope of data science and simultaneously understand how to be a competent practitioner. The materials I found were useful, but individually they did not help me understand and apply end-to-end data science.
I wrote this book thinking about my 22-year-old self. I was searching for a single resource that could help me apply all facets of data science. This represents my best effort at creating the tool I was hoping to find at that juncture in my life.
This is a technical book. You will see a lot of code. You will come across many technical terms. However, I submit this book is applicable to multiple audiences.
Is this a handbook or the next great American novel? It’s certainly closer to the former than the latter. That said, this book has a narrative structure. The chapters are ordered sequentially, starting with a discussion about key concepts in data science and proceeding through all the steps of building a production application.
My recommendation is to read this book from start to finish. My goal is to connect the dots among the various components of data science. For instance, how we tackle problems in the model development section is related to known challenges we will face in the model deployment phase. After reading the book end-to-end, it can be used as a handy reference tool. I include a large amount of Python code and step-by-step instructions. If you’re foggy on, say, constructing a pipeline in scikit-learn, you can jog your memory using this book. That said, the reader can consume chapters individually if desired. Each chapter tackles a distinct topic, and, therefore, this book could be leveraged as a reference handbook solely.
I also want to note this book is written from a given viewpoint: my own. I do not want this to be a dry manual. Rather, I want to share my experiences and lessons I’ve learnt. I've been fortunate to have been exposed to multiple corners of analytics, and I hope my viewpoints can help you on your own data science journey. Throughout this book, I’ll attempt to tackle some data science clichés, too. I’m a bit of a data science hipster; I want to stray away from the traditional and predictable. (And yes, I do wear glasses and grow out my beard on occasion).
The writing tends to be geared to people with data science or development experience, as I often gloss over building blocks (e.g. like the following is a Python list: [1, 2, 3], which is different a dictionary or set). That said, I do provide some high-level explanations where they might be useful or necessary for the less-technically-inclined reader. Doing so will hopefully fill in any gaps in grasping major concepts.
Lastly, this book includes a large number of code snippets. The official GitHub repo for book can be found at the following link: https://github.com/micahmelling/applied_data_science. However, the book includes many tutorials and examples not found in the repo. The repo simply presents the main components.
Many books have an overarching example, such as working on a consulting project for company XYZ to predict ABC. Though predictable, it’s pedagogically beneficially. In this book, we’ll build a model to predict if someone will cancel their subscription to a movie review website. Don't worry, be happy: More details about the data will be presented in the following chapters. That said, building a model to predict customer churn is a canonical data science project. I'd venture a guess that a large percentage of data science shops across the world perform customer churn modeling as one of their core services. Likewise, if you search "churn" on the data science website kaggle.com, you'll see scores of data sets, discussions, and code samples surrounding the topic. As a nice bonus, if you Google something like "machine learning churn model", you'll get more than 3.8 million results. This is ubiquitous yet impactful application of data science.
Nothing in this book is confidential or from a proprietary source. A citations section is included at the end of each chapter. The content and examples in this book are broadly applicable and not coupled to any particular use case or environment. What I hope separates this book from other resources is that it brings together a large number of data science topics. If you search any individual topic in this book, you will find relevant theories, resources, examples, and discussions. However, I believe you will be hard-pressed to find a resource that brings together such a large collection of data science ideas and examples. My work is inspired by the open-source work of others. In today's world, data is proprietary. Due to the open-source movement, algorithms and methodologies are less so. With a tool like Python and access to the Internet, pretty much anybody can implement cutting-edge models and follow best practices.