Home About the Author

Chapter Introduction: Data Science Theories

This chapter will focus on applied theory. What does that term mean? It’s about using theoretical and conceptual tools to solve and frame problems. Theory can help us think about problems to guide our practical application.

The theory we will cover in this section may be different than what you will see in other theoretical discussions about data science. We will not discuss derivatives or integrals, nor will we cover Big O Notation or P vs. NP. Rather, we will focus on concepts that will help you think about data science projects in a logical way and formulate feasible solutions. By the end of this chapter, you will be a practical Plato of data science ("The measure of a man is what he does with data" - Plato, probably).

Prediction vs. Inference

In general, we build models for one of two purposes: to predict (i.e. data science) or to understand (i.e. decision science). We discussed this idea in Chapter 2 but shall continue here. The two aims are not mutually exclusive, though some tradeoffs certainly exist. Before building a model, you should identify how much you care about each item, which will also help you pinpoint if you’re in the realm of data science or decision science.

If you care only about inference, a model like linear regression might be appropriate. However, if you want predictive performance, linear regression would likely be inferior to ridge or lasso regression. In practice, the latter two methodologies tend to outperform ordinary least squares (OLS) linear regression. If you are mostly concerned with predictive performance, you should determine how much intuition you are willing to surrender. Fortunately, there are wonderful methodologies for explaining even complex models. Nowadays, our sacrifice mostly comes in the form of features we consider and some of the models we might use. For the feature side of the equation, see the Data Science vs. Decision Science section in Chapter 2. In short, the features we want for general understanding might very well be different from what we want for sheer predictive predictive power. For the model considered, if we only care about predictive performance, we shouldn't even consider plain OLS linear regression or singular decision trees. However, if we desire only understanding, those models might be on the table.

A relevant situation is building machine learning models to award loans. If the model says to reject John Doe for a loan, we need to supply a reasonable and understandable explanation. In such a scenario, the goal is to build a model as predictive as possible, constrained by the fact that results still need to be explainable. Some problems do not necessitate much reason for model intuition, though. If a deep neural net can tell the difference between a picture of a cat and a picture of a dog, do we need to know the reason? Most likely not, though that may not always be the case. Given modern advances in model interpretability, we can often illuminate the workings of even very complex models. When people talk about "black box" models these days, it's likely they have accepted old-school beliefs about machine learning models and have not researched modern intepretability methods. A model that might have been a "black box" 5 years ago likely no longer falls under that category, given proper application of new interpretability methods. Even if we aren't under pressure by an outside source to explain our model, doing so is often a good idea. For one, it helps us validate if our model is cheating - that is, it's using a feature in an unrealistic way.

Many problems fall between needing little transparency (e.g. predicting photos of cats vs. dogs) and requiring plain-as-day, simple explanations for predictions. When human beings will use the model, some level of explanation is often useful to engender trust. For example, if a machine learning model flags patients likely to return to the hospital within 30 days, the doctors using the model will rightly want to have some level of understanding about how the model makes decisions.

Correlation vs. Causation (Model-Free vs. Model-Based Approaches)

Time to take on a cliché. Everyone has heard the adage “correlation is not causation.” This truism is, clearly, important to remember. Traditional statistics tend to avoid causation and advocates for a model-free approach; that is, let the data guide you. In his popular and somewhat controversial book, “The Book of Why”, Judea Pearl confronts this approach headfirst. Using do-calculus and causal diagrams, he advocates for a model-based approach that can tease out causation. (Mr. Pearl would certainly be disappointed, but such topics will not be covered in this book). He argues that without knowledge of causation, AI agents can never become humanlike or even superhuman. Intuitively, that makes sense to me.

Where does data science tend to fall on the model-free vs. model-based approach? I think it’s somewhere in the middle (sure, sure...typical non-committal answer). In data science, we often mostly care about predictive power, though constraints can exist (see previous section). This often gives us the freedom to include a large number of features, use automatic feature selection tactics, and let the model find what’s predictive. For better or for worse, data science libraries allow us to build highly-predictive models without hardly any knowledge of the underlying problem we are attempting to solve. For better, such a situation allows us to get workable solutions up and running quickly. For worse, building models with limited knowledge about the overarching situation can lead to significant issues, such as inducing feature leakage, a topic we will cover later in the book.

A full-on model-based approach would involve carefully selecting the features that go into the model. No features would be included unless we have a strong theoretical reason for their inclusion. Conversely, a full-on model-free approach would involve throwing all features into a model without even knowing what they are. The former could be prohibitively time-consuming if we have hundreds or thousands of features. The latter could be dangerous as we might throw in a ton of junk features that confuse the model or allow it to cheat. Logically, we want to be somewhere in the middle. Demonstrating we understand our data by including or engineering specific features AND removing specific variables can enhance predictive power. That said, we can also let the algorithm and feature selection mechanisms do some heavy lifting. The "machine" may find high-order interactions or non-linearities we would have never considered on our own. We should allow the "data to speak" through the algorithm because that is expressly the goal of machine learning and statistical modeling. That said, we should apply our common sense and subject-matter expertise to add and remove features to allow the algorithm to focus on handling the most impactful variables. You (model-based) and your algorithm (model-free) should form a team. You'll be like Jordan and Pippen.

The model-based portion of our work can serve as a valuable gut check. For example, if we have strong reason to believe the number of times emailing customer service is negatively correlated with the probability of remaining a customer, we can use some fairly simple techniques to help us determine if the model places value on this feature. If we find the model essentially ignores this feature, that indicates there might be a bug in the modeling code (e.g. some customers that churned were accidentally labeled as something else).

The Master Algorithm and the No-Free Lunch Theorem

In economics, the term “no free lunch” often relates to opportunity cost. Nothing is free. By performing one action, we forgo pursuing another. Though not the overarching concept in this section, this point is applicable to data science and deserves a small tangent. Per the data science pipeline, once the barebones functionality is in place, we have multiple areas in which we can improve our end product. If we choose to focus on trying new models, we forgo the opportunity to create new, potentially-predictive features or to optimize code for better performance in production. What’s the best route? That is up to the data scientist. The choice will never come without a cost.

OK, so what’s the No-Free Lunch Theorem in machine learning? It’s a theorem essentially stating that, if we make no assumptions about the data, then we have no reason to prefer one model to another. Said a slightly different way, no (current) machine learning model will work optimally for all problems. Practically, this means we should try a variety of different models. In general, experimentation is positive and required for maximizing predictive power.

But what if there was one single algorithm that worked best for all problems? At the time of writing this paragraph, no such algorithm exists (or is public knowledge). Thinking about the possibility of such an algorithm is a worthwhile thought experiment. In his book “The Master Algorithm”, Pedro Domingos discusses how the major areas of machine learning could be combined to form a super algorithm, capable of handling any and every problem. Domingos labels the major areas of machine learning as the “Five Tribes of Machine Learning.”

  • Symbolists: decision trees
  • Connectionists: neural networks
  • Evolutionaries: genetic algorithms
  • Bayesians: Bayes theorem
  • Analogizers: support vector machines

We won’t cover these “tribes” here, but I encourage you to research them on your own. However, the broader point is this: effective data science attempts different modeling techniques.

Direct vs. Indirect Effects

One of the beauties of many machine learning models is that they can capture complex interactions among features. Many models are designed to find such interactions and indirect effects, but we can also perform feature engineering to help guide our models. In this vein, thinking in terms of direct and indirect effects can be beneficial.

As the name indicates, a direct effect occurs when Variable A is directly correlated with Variable B. Most everyone understands this concept. For example, eating lots of sugary foods and not exercising will likely lead to health problems. Don't conduct this experiment to verify this direct effect ;)

As expected, indirect effects are more complex. Since indirect effects are prevalent and important, helping our machine learning models uncover them is often worthwhile, though it can be challenging. We might specifically create interaction features by combining multiple features into one based on our human intuition of what will be predictive. We should also include proxy variables where needed. For example, we might want to predict how many runs will be scored in a given baseball game. In addition to using rolling player stats, the starting pitchers, and the ballpark, we might also want to factor in some important intangibles, such as player health or energy levels. However, unless we work for a major league team, we likely won’t have access to such an amazing feature. Luckily, we can access useful proxy variables, such as the month, rolling game number in the season, rolling aggregate number of games played by the starters, and the weather. All of these, among some others, would be proxies for the health and energy of players.

Why do we discuss this topic? Being aware of indirect effects is helpful in guiding our feature engineering and understanding how our models might behave.

Complexity

Is your problem difficult or complex? What’s the difference? The terms might appear to be synonymous, but they’re certainly not.

A good example of a difficult problem is image recognition. At the most basic level, you’ll need to collect training data that will represent the images you’ll experience in a production setting, and you’ll also need to find the best type of neural network architecture. You might even need to employ cloud computing resources or a technology like Spark to enable the training to happen in a reasonable time. All of that combines to make for a clear challenge. However, the problem is not necessarily complex. The steps are fairly well-defined, and with a large enough set of acceptable images, we can expect solid performance.

By contrast, the steps to solving a complex problem are less defined. In this space, we will deal with substantial noise and observations that might seem contrary. Fundamentally, the environment is less controlled and more unpredictable. Predicting human behavior would often fall under this category. In many cases, we will struggle to collect adequate data to explain human decision making. For example, Jane and Joy look nearly identical from the data we have on them. Jane bought our product while Joy didn’t. Why? It could be because Jane ate breakfast, and Joy didn’t. It might be because Joy lost her job yesterday. It could be because Jane found $50 on the sidewalk and wanted to spend some money. A large number of factors contribute to human behavior. Even when we have large amount of data (notice I didn’t say a certain infamous term), many of our features will often be abstractions to some degree. Beyond that, humans have an unpredictable streak that can never be captured by data. In a complex environment, given the data we are able to collect, there might be no true function that separates the classes well.

Knowing if you’re in a complex or difficult environment will help set your expectations and tactical execution. If you’re in a complex environment, a cap on model performance likely exists. After a certain point, the data you would need to obtain more predictive power might not be possible to collect...or might be too invasive or expensive to obtain. Granted, caps on model performance also exist in difficult environments, but the threshold is typically higher, all else held constant. Likewise, we tend to run into fewer issues with finding predictive data in a difficult environment. If we want to predict cats vs. dogs, the pixels in the image will be predictive.

Behavioral Economics and Cognitive Bias

During data science interviews, I ask a variety of technical and non-technical questions, one that falls in the latter category is “why are you pursuing data science as a career?”. Probably my favorite answer was something to the effect of “because it ties into my fascination with behavioral economics.” The candidate proceeded to discuss Thinking Fast and Slow by Daniel Kahneman, an excellent book that I highly recommend.

I think the connection between behavioral economics and data science is profound. At its core, behavioral economics teaches us that humans are fallible. We’re subject to biases, rely too heavily on heuristics, and often do not act rationally. Machine learning and statistical modeling can play a useful role in helping to correct some of these faults. Now, humans can and do inject bias into models, so data science isn’t a perfect solution, though it is a step in the right direction in many respects.

A lot of people claim machine learning is a black box. I mostly don’t think that’s true. Models serve the function of clarifying assumptions and providing transparency into our approach. When we feed data into a model, we’re essentially saying “these are the features I think have a chance at predicting the outcome of interest”. Likewise, when we select a model, we’re saying “these are the mathematics and statistics I am going to use to make predictions.” Many models use approaches that can be intuitively explained. Likewise, even with complex deep learning models, we can take steps to identify which features the model considers most important. Regardless, our data and general approach are clarified in our code.

Even if a model is not fully transparent, it is consistent. Barring a bug in the code, it will make the same decision given the same data (and same random seed). The same cannot be said for humans. The classic example is judges handing out harsher sentences based on time of day. Again, humans are fallible, and statistical models help us to clarify our assumptions and even put our heuristics in code for all to see.

Applied Full Stack Data Science