Exploring Data in Python

Truman Fritz
14 min readMar 8, 2021

Our world is covered in data. We’ve known this for years, and students across the country like myself have looked to our schools to teach us how to make sense of the terabytes of data available at our fingertips. We’ve embraced professors and faculty who want to help us explore the quickly growing world of data, and whether it’s Excel or direct programming, these skills are becoming increasingly in demand.

Exploratory Data Analysis (EDA) is one of the first steps in successfully venturing into any dataset. EDA allows a user and data scientist to become more familiar wit hand more connect to whatever information they have in their hands. Understanding the best practices of EDA is crucial to setting up any project with data, especially those in data science, artificial intelligence, or machine learning.

Today, we’re going to be diving deeper into what EDA can do for any data scientist, and a step-by-step example of looking into a new dataset.

Setting Up Your Workspace

At the moment, Python is one of the most popular language for data scientists. This is because of the vast variety of modules that you can use for different data science tasks. In this article, we will use Python 3.7. Apart from that, we use modules:

  • Pandas — this is an open source library providing easy-to-use and high-performance data structures and analysis tools for the Python.
  • NumPy — this is a key Python library for scientific computing.
  • Matlotlib — this is a Python 2D plotting library. Using it we can create plots, histograms, bar charts, scatterplots, etc.
  • Seaborn — this is a data visualization library based on matplotlib. It provides a high-level API for drawing statistical graphics.
  • MissingNo — this is another data visualization library for Python, that we will use for missing data detection.

Any Python IDE will work. I’m personally using Google Colab, but Jupyter IDE, PyCharm, or most any other programming environment will work. I also set up one personal configuration for Pandas that allows me to see all the columns in a given DataFrame. You can leave out this line if you prefer a more condensed frame.

After loading in the required packages, we also need to load in our raw data CSV via Pandas.

You can accomplish this by using pd.read_csv which will load in your data as a DataFrame, an easy to use yet powerful data structure provided by Pandas.

I’m using the following dataset below on bike sharing systems. Bike sharing systems work like somewhat like rent-a-car systems. Basically, there are several locations in a city where one can obtain a membership, rent a bicycle and return it. Users can rent a bicycle in one location and return it to a different location. At the moment, there are more than 500 cities around the world with these systems at the moment.

The cool thing about these systems, apart from being good for the environment, is that they record information about a number of rented bikes. When this data is combined with weather data, we can use it for investigating many topics, like pollution and traffic mobility. That is exactly what we have on our hands in Bike Sharing Demands Dataset. This dataset is distributed by the UCI ML Repository and initially provided by the Capital Bikeshare program from Washington, D.C.

Example rows from data set.

Goals for the Analysis

Before we jump into analysis, even though we have our workspace set up, it’s important to establish a few goals that can guide yourself or another data scientist along the way. The power of data is incredible, to the point that it’s easy to get lost in tangential ideas and frustration with programming. For this dataset, we have three questions we’d like to look into:

  • Are there any correlations between user count and other environmental factors, e.g., time of day, day of week, or weather patterns?
  • Are there differences between casual users and registered users dependent on other factors?
  • Are there significant outliers or unexpected patterns that would points towards potential new market segments or marketing strategies for these bike services?

Univariate Analysis

In this first step of data analysis, we are trying to determine the nature of each feature. Sometimes categorical variables can have the wrong type. Alternatively, some quantitative data could be out of scale. These things are investigated during univariate analysis, the first phase of EDA. We observe each feature individually, but we try to grasp the image of the dataset as a whole. A good approach is to get the shape of the dataset and observe the number of samples. We can also print several samples and try to get some information from them. Again, we are using Pandas for this:

Pandas function head will give us first five records from the loaded data. We can show more data by giving any index as a parameter for that function. The shape function will also give us the shape of the array as a tuple. Here is the output of the code snippet from above:

From this output, we can see that we are having 17,379 samples or records. Apart from that, we can notice that features have values on different scales. For example, record 0 has temp feature has value 0.24 while feature registered has value 13. If we are working on some machine learning or deep learning solution, this is a situation we need to address by putting these features onto the same scale. If we don’t do that before we start the training process, the machine learning model will “think” that registered feature is more important than temp feature. In this article, we will not do that, because it is out of the scope, but I can suggest tools to help such as StandardScaler from the SciKitLearn library.

Another thing we can notice is that some features are not carrying information and are useless. Like features instant (just an index of the sample) and dteday (contained in other features). So we can remove them:

The next thing we can do in this analysis is checking the datatypes of each feature. Again, we use Pandas for this:

The output of this call looks like this:

The important thing to notice here is that we have categorical variables like season, year (yr), month (mnth), etc. However, these features in the loaded data have type int64, meaning machine learning and deep learning models will observe them as quantitative features. We have to change this and change their type. We can do that like this:

Now, when we call data.dtypes we get this output:

Great, now our categorical features are actually of the category data type!

Missing Data

Data can come in an unstructured manner, meaning some records may miss data for some features. We need to detect those places and replace them with some values. Depending on the rest of the dataset, we may apply different strategies for replacing those missing values. For example, we may fill these empty slots with average feature value, or maximal feature value. For detecting missing data, we use Pandas or Missingno:

The output of these two lines of code looks like this:

The first, tabular section comes from Pandas. Here we can see that there is no missing values inside this dataset (all zeros).

In the second section, which is more graphical and comes from Missingno, we get the confirmation for this. If there were missing values, Missingno output would have horizontal white lines, indicating that is the case.

Distribution

The important characteristic of features we need to explore is distribution. This is especially important for this example since our output (dependent variable Count) has a quantitative nature. Mathematically speaking, the distribution of a feature is a listing or function showing all the possible values (or intervals) of the data and their frequency of occurrence. When we are talking about the distribution of categorical data, we can see see the number of samples in each category. On the other hand, when we are observing a distribution of numerical data, values are ordered from smallest to largest and sliced into reasonably sized groups.

Distribution of the data is usually represented with a histogram. Basically, we split complete the range of possible values into intervals and count how many samples fall into each interval. To do this in Python we use Seaborn module:

In this particular case, we are using Count feature, i.e. of the output, with created interval of 30 samples. Here is how that histogram looks like:

Apart from this, we can do this for every feature in the dataset:

When we are observing the distribution of the data, we want to describe certain characteristics like its center, shape, spread, amount of variability, etc. To do so we use several measures, primarily those to describe the center and spread of the distribution.

For describing the center of the distribution we use functions such as mode and median, but the go-to measurement is finding the mean of the sample. To get these values we can use Pandas functions mean, mode and median.

To describe the spread of the distribution we most commonly use measures:

  • Range — This measure represents the distance between the smallest data point and the largest one.
  • Inter-quartile range (IQR) — While range covers the whole data, IQR indicates where middle 50 percent of data is located. When we are looking for this value we first look for the median M, since it splits data into half. Then we are locating the median of the lower end of the data (denoted as Q1) and the median of the higher end of the data (denoted as Q3). Data between Q1 and Q3 is the IQR.
  • Standard deviation — This measure gives the average distance between data points and the mean. Essentially, it quantifies the spread of a distribution.

To get all these measures we use describe the function of Pandas. This function will give back the rest of the measures that we used for the center as well. Here is how that looks like.

Outliers

Outliers are values that are deviating from the whole distribution of the data. They can be natural, provided by the same process as the rest of the data, but sometimes they can be just plain mistakes. Thus sometimes we want to have these values in the dataset, since they may carry some important information, while other times we want to remove those samples, because of the wrong information that they may carry. In a nutshell, we can use the Inter-Quartile Range (IQR) to detect outliers. We can define outliers as samples that fall below Q1–1.5(IQR) or above Q3 + 1.5(IQR). However, the easiest way to detect those values is by using boxplot.

The purpose of the boxplot is to visualize the distribution, but in a different way than histogram does. In essence, it includes important points that we explained in the previous chapter: max value, min value, median, and two IQR points (Q1, Q3). Note that the max value of a boxplot or distribution can be an outlier. This is how we can see all the important points using boxplot and detect outliers. By the number of these outliers we can assume their nature, i.e. are they mistakes or a natural part of the distribution.

In Python, we can use Seaborn library to get the boxplot for single a feature, or for the combination of two features. This is technique is useful for exploring the relationship between quantitive and categorical variables. In this concrete example, we displayed not only distribution of the Count feature on its own, but its relationship with several features as well. Here is the code snippet:

The corresponding output:

In this particular case, without going into detail analysis, we may assume that these outliers are part of the natural process and that we will not remove them.

Correlation Analysis

So far we observed features individually and the relationship between quantitative and categorical features. However, in order to prepare data for our fancy algorithms, it is always important to analyze relationships of one quantitative feature to another. The goal of this section of the analysis is to detect features that are affecting output too much, or features that are carrying the same information. In general, this is usually happening when explored features are having a linear relationship. Meaning, we can model this relationship in the form y = kx +n, where y and x are explored features (variables), while k and n are scalar values.

As you can see in the image above, there are two types of a linear relationship, positive and negative. Positive linear relationship means that an increase in one of the feature results in an increase in the other feature. On the other hand, a negative linear relationship means that an increase in one of the feature results in a decrease in the other feature. Another characteristic is the strength of this relationship. Basically, if data points are far away from the modeled function, the relationship is weaker. From the image above we can determine that relationship in the first graph is stronger than on the other one (but it is not that obvious).

In order to determine what kind of relationship we have, we are using visualization tools like Scatterplot and Correlation Matrix. A scatterplot is a useful tool when it comes to displaying features relationship. Basically, we set one feature on X-axis, and the other on the Y-axis. Then for every sample, we pick a point in the coordinate system that has values for respective features.

To create this visualization for two features we use scatterplot function from the Seaborn module. However, there is a function which is called pairplot from the same library that we can use to plot relationships of all quantitive features. Here is how that looks like:

We can not several features that are having a linear relationship. It is apparent that features Count and Registered have a linear relationship, as well as the features Temperature (temp) and Normalized Temperature (atemp). However, from the Scatterplot, we cannot really determine what is the strength of that relationship. For this purpose, we are using the correlation matrix.

Correlation matrix consists of correlation coefficients for each feature relationship. The correlation coefficient is a measure that gives us information about the strength and direction of a linear relationship between two quantitative features. This coefficient can have value from the range -1 tо 1. If this coefficient is negative, examined linear relationship is negative, otherwise, it is positive. If the value is closer to the -1 or 1, the relationship is stronger. To get this information we use a combination of Pandas and Seaborn modules. Here is how:

As you can see, this confirms our initial analysis with Scatterplot. Meaning, Coung and Registered, as well as Temperature and Normalized Temperature, have a strong positive linear relationship. In order to get better results with our artificial intelligence solutions, we may choose to remove some of those features.

Analysis

In analyzing the data set we have above, we initially entered into this project with three questions:

  • Are there any correlations between user count and other environmental factors, e.g., time of day, day of week, or weather patterns?
  • Are there differences between casual users and registered users dependent on other factors?
  • Are there significant outliers or unexpected patterns that would points towards potential new market segments or marketing strategies for these bike services?

With our data analysis complete, we can try and answer each in turn now:

  • Based on the histogram plots of each environmental factor, there seem to be a few types of environments that stimulate bike usage (e.g., slightly colder or slightly warmer temperatures, low to mid windspeed, medium to high humidity). There also appears to be a strong positive correlation between user count and temperature, but the pairplot above does not appear to support theories that posit correlation between user count and other environmental factors.
  • It seems that the more registered users there are, the more frequent we have active users on the platform. This makes sense, as one is undeniably the other. However, it puts forward a possible theory that registered users drive a snowball effect for active users, and maybe even stimulate increased use by casual users to join the platform or convert via registration. Casual users also have a split correlation, either highly or medium positive. This is a very interesting outcome, and requires further investigation.
  • Looking at our boxplots, there seem to be a significant amount of outliers or days that had high use especially around the hours of 10am to 4pm. However, these are not the highest times of use on average, and there are typically more average active users at 8am and 5-6pm. This may then signal that there are more potential users in the midday window, but we do not do a good enough job at activating them.

It is worth noting that while we cannot control environmental factors, marketing and advertising for any service is critical to use. Capital Bikeshare is a program run by Lyft, and while it may be more speculative than an actual revenue-generating business, it is important to understand the factors that go into consumer choices.

All in all, there are still many things to investigate with this data set and these environmental impacts on consumer choices. However, this gives us a great starting position on which to create further data collection and analysis.

Conclusion

In this article, we tried to cover a lot of ground. We went through several statistical methods for analyzing data and detecting potential downfalls for your AI applications. What do you think? What are your favorite Exploratory Data Analysis techniques?

--

--

Truman Fritz
0 Followers

Strategy Consultant @ Simon-Kucher & Partners | Data Scientist and Intermediate Programmer