To be honest with you: when I was in college learning about Data Science, I completely underestimated the value of good plotting techniques. It’s true, everything was a dumpster fire back then: in between learning Python from scratch, getting my head around all the possible algorithms and understanding the math behind everything, plotting was the very last of my problems. I was just focused on trying to complete the assignments and move onto the next one.
And why shouldn’t plotting be the last of my problems? We were always makes plots of the same types of things. You know: pairplots, distplots, qqplots…those charts you use when visualizing the data is the only way of understanding it. Very useful charts. But very generic and default charts too. So copying and pasting a bunch of code became my best friend those days.
For my projects, the deliverable was always a model. Hopefully with some reasonable score thanks to hours and hours of cleaning and feature engineering. I was the only person involved in my projects and my professors already knew everything about that data since they gave it to me. So who would I be plotting for? Myself? C’mon…unnecessary! Right? I knew better than no one what was I trying to achieve in each step. I needed to explain nothing to nobody.
But apart from all this, let’s be honest, plotting it’s not fancy at all. Anyone can plot. My 60 years old father can pull some charts out of the hat just by using Excel. And if everybody can do it then, of course, that’s the definition of not being fancy. Cause here my friends, we’re doing Data Science and Machine Learning. And most of the people can’t even understand what that is. That’s why we’re all so cool and sexy, as Harvard Business Review magazine said.
The problem is peeps -if you haven’t found out yet by my excessive irony, that this is not how real life works. And I believe that was probably the biggest failure in my Data Science immersive: not weighting up enough the importance of explainability and interpretability. You might be a genius, but if you are not able to explain to a third party how and why you’re getting those wonderful predictions, then you might have nothing. For example, at Ravelin Technology we offer machine learning-based solutions for fraud prevention. Imagine telling a client, that we’re blocking X% of the transactions because the machine learning model says so, but without having any idea at all about why is doing it. Surely not very appealing for any e-commerce out there trying to maximize conversion and sells, right? Imagine now this same kind of situation in some other sensitive domains as healthcare…disaster would be just around the corner.
Now, apart from business-related problems, and even aside from a legal point of view or from the fact that maybe your business only cares about predictions -no matter how you get them, understanding how an algorithm is actually working can help. Not only to better explain outputs to customers but also to better align the activities of data scientists and analysts.
So it comes up that in the real world, the picture is completely different from the one I had when I was working in my academic data science projects: I’m never the only person involved in a project and my workmates and/or clients usually don’t know much about the data I’m using. So who would I be plotting for now? Does it still sound unnecessary? Probably not. And being able to explain your thinking process to people is a key part of any data-related job. That’s why copying and pasting charts is not enough and charts personalization becomes very important.
In what’s left of this post I’ll like to share with you 10basic, intermediate and advanced tools I’ve found very useful in real life when it comes to plotting for explaining things about your data.
The libraries I’ll be referencing in the following lines we’ll be:
- Seaborn | From: import seaborn as sns
- Matplotlib | From: matplotlib.pyplot as plt
Additionally, if you want, you can set up a style and your favourite format like:
plt.style.use(‘fivethirtyeight’)%config InlineBackend.figure_format = ‘retina’%matplotlib inline
Having said that, let’s jump straight to the tools:
1. Drawing multiple plots
There will be opportunities in which you’ll want to plot several things within one chart, and we’ll cover that in brief. But some other times, you’ll want to throw different charts in the same row or column, complementing each other and/or showing different pieces of information.
For this, we’re going to see a very basic but essential tool: subplots. How to use it? Very simple. A chart in matplotlib is a construction using:
- A figure: the background or canvas for drawing our charts
- Axes: our chart or charts
Usually, these things are set up automatically on the background of our code, but if we want to draw several plots, we only need to create our figure and axes objects in the following way:
fig, ax = plt.subplots(ncols=number_of_cols, nrows=number_of_rows, figsize=(x,y))
So for example, if we set up ncols = 1 and nrows = 2, we’ll be creating a figure of size x,y with only 2 charts, distributed in two different rows. The only thing left would be to specify the order for the different plots using the ‘ax’ parameter and starting from 0. For example:
sns.scatterplot(x=horizontal_data_1, y=vertical_data_1, ax=ax);sns.scatterplot(x=horizontal_data_2, y=vertical_data_2, ax=ax);
2. Axis labelling
This might appear unnecessary, or not very helpful, but you cannot imagine how many times you can be asked what’s on the X/Y axis if your chart is somehow confusing, or who is seeing it is not very familiar with the data. Following our previous example with two plots, if we want to set up a specific name for our axis, we’ll have to use the following code lines:
ax.set(xlabel=’My X label’, ylabel=’My Y label’)ax.set(xlabel=’My second X label’, ylabel=’My second and very creative Y label’)
3. Setting titles
Another basic but key tool if we’re presenting our data to a third party is using titles, and it works in a very similar way as our previous axis labelling point:
ax.title.set_text(‘This title has to be very clear and explicative’)ax.title.set_text(‘And this title has to explain what’s different in this chart’)
4. Annotate things within the chart
More often than note, just having the y-scale to the right or left of our chart is not going to be very clear by itself. Whether it is because all values are very close in between them or because precision is very important for the thing being analyzed. Whatever the case, annotating the value on the plot can be very useful to add clarification and self-explanation to it.
Suppose now we’re using subplots, so we have several charts, and one of them is a Seaborn’s barplot in the position ax. In this case, the code for getting the annotation on each bar is a bit more complex but very easy to implement:
for p in ax.patches:ax.annotate(“%.2f” % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()),ha=’center’, va=’center’, fontsize=12, color=’white’, xytext=(0, -10), textcoords=’offset points’)
For each ‘patch’ or bar in our chart, the code till the ‘ha’ parameter gets the location, height and width of the bar, to put the value annotation in the right position. In a similar way, we can also specify alignment, fontsize and colour of the annotation, while the ‘xytext’ parameter indicates whether we want to move our annotation in some x or y direction. In our example above, we would be moving the text down on the y axis.
5. Differencing labels using different colours
In some cases, over a period of time or a range of values, we might have measured a different kind of objects. For example, suppose that over 6 months we measure the weight of dogs and cats. At the end of the period, we want to plot the weight of each animal, but differencing in between dogs and cats using blue and red respectively. For that, in most of the traditional plots, we can use the parameter ‘hue’ to provide a list of colours for our elements.
Take the following example:
weight = [5,4,8,2,6,2]month = [‘febrero’,’enero’,’abril’,’junio’,’marzo’,’mayo’]animal_type = [‘dog’,’cat’,’cat’,’dog’,’dog’,’dog’]hue = [‘blue’,’red’,’red’,’blue’,’blue’,’blue’]sns.scatterplot(x=month, y=weight, hue=hue);
- ax will be the chart in which we want to insert the line
- 32 is going to be the value in which the line is going to be drawn
- And c=’r’ draws in colour red
If we are working with subplots, it is as simple as adding an axvline to the corresponding axe as in the example above. However, if we are not using subplots, we should do the following:
g = sns.scatterplot(x=month, y=weight, hue=hue, legend=False)g.axvline(2,c=’r’)plt.show()
Mind that for this to work, you should set up always the same data for x-axis in both charts. Otherwise, they’re not going to match.
9. Overlapping plots and changing labels and colours
Overlapping charts on the same axis is easy: we just need to write the code for all the plots we want, and afterwards, we can simply call ‘plt.show()’ for all of them to be drawn together:
a = [1,2,3,4,5]
b = [4,5,6,2,2]
c = [2,5,6,2,1]sns.lineplot(x=a, y=b, c=’r’)
sns.lineplot(x=a, y=c, c=’b’)plt.show()