Thoughtful application of machine learning can impact the world in amazingly positive ways. The breadth and depth of data in our world today, combined with powerful data scientists and advanced analytics tools and functions, can serve to improve business, enhance organizations, and change lives around the globe in ways we never could have imagined 20 years ago. Moreover, it will become increasingly important that students, workers, and managers become familiar with machine learning topics, not only their function in the world of data science but their ability to be applied to all kinds of complicated environments in which data is present and collectible.
Industry 4.0 is on the horizon, if it hasn’t already begun. Business students like myself learned in strategy courses that operational effectiveness is necessary but not sufficient to developing a competitive advantage. That means that firms must develop great internal productivity to even have a chance at competing, let alone leading the market. In order to develop operational effectiveness in a time of increasing digital transformation, businesses and individuals alike must learn to adapt to new ways of thinking about the integration of machine learning in their organizations.
In that line of thinking, today I’m going to focus on a use-case for custom-built machine learning that even the smallest of businesses, equipped with enough data capacity, can leverage to enable their own digital transformation.
As usual, we start by setting up our workspace. I’ve explained the importance of Python and its open-source modules prior, but this time we are introducing new libraries and packages to aid our analysis, synthesis, and model-building:
There are two main areas in which we will need to import packages: (1) basic libraries for data manipulation, plotting, and correlation analysis, and (2) libraries for building, assessing, and regularizing our eventual models.
The first of these are common amongst Python users: Pandas, Matplotlib, Seaborn, and SciPy. MissingNo is a lesser known package that allows users to quickly check for missing values in data sets.
The second half of these are more complex and require a bit of explanation, though I will keep things brief and link to documentation for those of you who are interested in the fine tuning of each library. First, most everything in this half comes from scikit-learn, a generalized machine learning library that features classification, regression, and clustering algorithms that we pull from here. We’re using the following:
- model_selection to split our training and testing sizes for validation
- DummyClassifier to establish accuracy baselines
- metrics to assess the recall, precision, and accuracy scores of our classification models, as well as the ability to plot our eventual confusion matrix
- ensemble to import the needed methods for building a variety of classification models to compare to one another, including Bagging, Random Forest, AdaBoost, Gradient Boosted Trees, and a soft voting machine ensemble
The only exception here is XGBoost, which we are using as another comparable model-building method to compare to others.
Now that we have our libraries imported, let’s jump into the data set.
As mentioned earlier, we are seeking to understand how machine learning can be applied to business cases. To gather an applicable data set, we can look no further than the UCI Machine Learning Repository, which features hundreds of high-quality data sets across a number of types, areas, and tasks. In particular, today I’m going to be unpacking the Bank Marketing Data Set, which features 21 variables across 41,188 instances of direct marketing campaigns at a Portuguese banking institution. The time frame for this data is from May 2008 to November 2010, though we don’t have any time-series data. This is a classification task in which we will seek to predict whether or not a client we engage with will subscribe with a term deposit. The data looks a little bit like this:
We can see all our variables and some samples across the first five rows. Of course, there is a lot more variation in the actual set. If you’re interested in what each variable means, I’ll link the data set here for sake of keeping this article brief.
With our data in hand, I think we’re ready to jump into our exploratory data analysis!
Validation and Exploration
Data Quality Assurance
The first step of high quality exploration is to determine whether our data set is complete, what kinds of variables we have, and how we can manipulate the answers to those questions to best suit our application of machine learning.
First, let’s make sure we have all our data in order. We’ll use MissingNo for this task:
It seems like all our data is here: 41,888 values confirmed with no variables having blank or null records. This is lucky for our sake, as missing data always presents a problem in building machine learning models. 100% accurate data collection may not always be possible in a business environment, but for the sake of our application today this is enough. Let’s proceed to look closer at the type of each of the 21 variables present:
It seems that some of our categorical variables (e.g. martial status, education, job type) are getting classified as individual objects. We won’t be able to use object classes in classification with our planned methods for a number of reasons. We need to switch them to categories using Pandas:
Now that we’ve got everything squared away, we can begin further exploration of the data set.
Exploratory Data Analysis
Let’s begin by looking at the categorical variables that describe some of our potential clients and the engagements we’ve had with them so far using some standard countplot bar graphs:
This is a lot of great information about our previous activities and who we talk with. Right out of the gate, it seems like the primary components of our client base are administrative and blue-collar workers (with technicians not far behind) that have university or high-school degrees. There is more variation in education than job type, but this is interesting as it may aid other marketing efforts in terms of how we talk with them, present ourselves, and expectations when doing business.
The other thing we notice is the surprising variation in time of our calls, not in terms of the days of the week (which are relatively uniform), but instead the months in which we are calling people. A shocking amount of our marketing activities take place in May and the summer, with very few to none in the winter time. This may be company policy, this may be because our salespeople have more free time in this season for some reason. We cannot be sure because the data set does not tell us, yet normally businesses may have the upper hand in analyzing this type of data because they can attribute certain historical results to company decisions.
One consideration that we may make in looking is that we will have to be wary of bias from our categorical variables, ensuring that they do not creep into undue influence on our model’s feature importances. For instance, just because the largest amount of marketing activities take place in May doesn’t mean that the May is the most effective month to be selling.
Moving forward to our numerical variables (such as the age of the client and various macroeconomic indicators), it is important we try to detect not only the interactions of data but their correlation as well. As mentioned in previous articles, the goal of correlation analysis is to detect features that are affecting output too much, or features that are carrying the same information. In general, this is usually happening when explored features are having a linear relationship.
To create this visualization for this, we use a pairplot from Seaborn coupled with a Pearson correlation coefficient function. However, there is a function which is called pairplot from the same library that we can use to plot relationships of all quantitive features. Here is how that looks:
We can not several features that are having a linear relationship. It is apparent that multiple features will have to be dropped in order to get better results with our solutions. The following are of concern:
- Macroeconomic indicators: it would seem our data is a great validator for the theory that macroeconomic forces are interrelated, as all those present in our case have a very strong positive correlation (r > 0.9). These include the employment variation rate, consumer price and confidence indices, the Euro Interbank Offered Rate, and market employment count.
- Previous campaigns: the only other correlation we have in our pairplot is the number of previous campaigns tied with the macroeconomic indicators above, resulting in a moderate negative correlation (r < -0.45). It seems that the bank may have made prior decisions of number of campaigns based on those indicators.
Deciding which features to drop comes down to the subjective views of the data scientist at hand. In my view, since previous campaigns is moderately correlated with those indicators and is something that is somewhat within the long-term decision-making power of the business, I will choose to keep it.
However, this decision is also offset by my removal of some economic indicators, which I will choose to narrow down. Considering some of these indicators have different frequencies (daily, monthly, quarterly), I will drop those with the longer quarterly frequencies as they will not feed into day-to-day decision-making of the firm that much. Namely, I will drop the features for the market employment count and the employee variation rate. We’re now ready to jump into building our initial classification models.
Model Building and Assessment
Considering that model-building is relatively standardized across the programming-side of things, I won’t bore you with 10 different screenshots of my programming. Instead, I’ll go into the models I chose, their individual results, the means by which I compared them, and the ultimate model that prevailed from them.
How to Evaluate Machine Learning Models
Firstly, it is important to distinguish between the metrics we use to asses models because of data balance, specifically in using metrics such as accuracy or predicted positive condition. These metrics can be misleading for imbalanced data sets. Consider a sample with 95 negative and 5 positive values. Classifying all values as negative in this case gives 0.95 accuracy score, which, while technically accurate, creates an incredibly inflexible model.
There are many metrics that don’t suffer from this problem, specifically precision and recall. In machine learning, precision is the fraction of positive classifications that are actually correct. Recall is the proportion of actual positives identified correctly, which sounds similar, though it is necessary to explain that they are two metrics opposed to one another.
To use an example, let’s say we are trying to distinguish between A and B. Precision asks the question, out of those predicted to be A, how many were actually A and not B? Recall asks, out of those that were actually A, how many were predicted to be A and not B?
To fully evaluate the effectiveness of a model, you must examine both precision and recall. Unfortunately, precision and recall are often in tension. That is, improving precision typically reduces recall and vice versa, though it is important to be concerned with both in evaluating the models at hand.
Comparison and Evaluation
With these metrics in mind, and looking at accuracy as well, we can begin evaluating the models I have built using the following classification methods:
- DummyClassifier (DC)
- Decision Tree (DT)
- Bootstrapped Aggregation, AKA bagging (BA)
- Random Forest (RFC)
- AdaBoost (AB)
- Gradient Boosted Trees (GBT)
- XGBoost (XGB)
- Voting Classifier Ensemble (VCE)
The outcome of building and fine-tuning each of these models is below.
The reason we run a panel of models like above is so we can compare them in quick fashion. Those newer to machine learning will point out that not all the parameters are same model to model, but this is because the effect that parameters have on each method will help or hinder its optimization. In this case, there was no one-size-fits-all array of parameters, so I fine-tuned each one to a certain degree to optimize its performance.
As we can see, the key metrics of the models generally increase as we move from simpler ensemble methods (such as the DummyClassifier) to more complicated methods (AB, VCE, XGB). There are some outliers like Random Forest that have high precision, but it also has correspondingly low recall with overall lower relative accuracy.
XGBoost vs. Voting
Though most of these models are relatively close in accuracy, the top two methods are clearly the Voting Classifier Ensemble and XGBoost judging by their combined recall, precision, and accuracy. However, we should consider each model in terms of its overall strengths and weaknesses, not just the quantitative metrics.
In this case, XGB is still the favorite for a few reasons. First, it is incredibly lightweight and flexible, which is good if we are running many predictive classifications, as well as if we have to change the parameters later on. It also means that we can optimize it very quickly by using nested for loops. It is further much more flexible at regularization than the VCE, which is important in order to avoid overfitting the model to the training set. Since the VCE has multiple types of ensemble methods within it, the combination of which is not easily regularized. Finally, we can run cross-validation after each iteration, and ultimately, if we are missing data like in real-world environments, we can still use XGB because it is designed to handle null values.
This case still stands when looking at the metrics themselves, as XGB has higher precision but lower accuracy, something that we want in this case as we are trying to limit the scope (recall) of our marketing activities while emphasizing our turnover (precision) of engaging clients who will subscribe. This becomes clear when we look at the classification matrix below. There are no downsides for false negatives (predicting people will not subscribe, then having them subscribe) but we can’t afford to spend wasted time of our callers on people who we expect will subscribe and then don’t (false positives). We have been able to prune out those false positives and focus solely on true positives and true negatives, which will aid our time spent calling, improve productivity, and reduce extraneous marketing costs.
For this reason, I chose to focus on optimizing the XGB model and aimed to maximize the important metrics mentioned prior. While I was only able to increase accuracy by about 0.14%, it still shows that it is possible to optimize even further when a singular model is chosen.
To make matters applicable to humans on a day-to-day basis, we can also develop a visual decision tree that can seek to guide salespeople and managers on which aspects of a potential client are most important to know to predict probability of subscription:
It becomes clear that in predicting whether or not someone will sign up for our services, the key indicators to consider are the following:
- Length of the campaign: a campaign can simply run too long, and it will be unlikely that if someone does not sign up within a certain time period, they will never sign up, even if times are good. This is a classic case of over-marketing and sunk costs.
- Three-month Euro Interbank Offered Rate: interest rates affect the health and welfare of many businesses, and can put economic pressure to cut costs and reduce new subscriptions on administrative and technical personnel.
- Consumer confidence: if consumers are not confident in their expected financial situation, it creates additional economic pressure on businesses and reasons to cut subscription costs. Conversely, if times are good, businesses may create long-term investments.
These are just a few examples of how the variables of this data set can be interpreted to give guidance to all those across the organization involved in the selling and service of subscriptions at our focal firm.
Machine learning is applicable to all business cases in which data is present, and is increasingly more necessary to building competitive advantage. Finance, marketing, sales, HR, and managerial teams are primed to experience an influx of digital transformation, fueled by open-source software, user-friendly programming languages, and the integration of data science into every organization. In order to compete, firms must invest in resources that identify and capitalize on their incoming and outgoing data. This includes hiring professional data scientists, financing digital transformation, and, of course, building machine learning algorithms like we have just built today. It’s sink or swim for businesses, and those who don’t get ahead of the curve now will face consequences in just the next five years.