Movatterモバイル変換


[0]ホーム

URL:


Practical Business Python

Taking care of business, one python script at a time

Tue 25 April 2017

Effectively Using Matplotlib

Posted byChris Moffitt in articles   

article header image

Introduction

The python visualization world can be a frustrating place for a new user. Thereare many different options and choosing the right one is a challenge.For example, even after 2 years, thisarticle is one of the top posts thatlead people to this site. In that article, I threw some shade at matplotliband dismissed it during the analysis. However, after using tools such as pandas,scikit-learn, seaborn and the rest of the data science stack in python - Ithink I was a little premature in dismissing matplotlib. To be honest, Idid not quite understand it and how to use it effectively in my workflow.

Now that I have taken the time to learn some of these tools and how to use them withmatplotlib, I have started to see matplotlib as an indispensable tool.This post will show how I use matplotlib and provide some recommendations for users gettingstarted or users who have not taken the time to learn matplotlib. I do firmly believematplotlib is an essential part of the python data science stack and hope thisarticle will help people understand how to use it for their own visualizations.

Why all the negativity towards matplotlib?

In my opinion, there are a couple of reasons why matplotlib is challenging forthe new user to learn.

First, matplotlib has two interfaces. The first is based onMATLAB and uses a state-based interface.The second option is an an object-oriented interface. The why’s of this dual approachare outside the scope of this post butknowing that there are two approaches isvitally important when plotting with matplotlib.

The reason two interfaces cause confusion is that in the world of stack overflowand tons of information available via google searches, new users will stumbleacross multiple solutions to problems that look somewhat similar but are not the same.I can speak from experience. Looking back on some of my old code, I can tell that thereis a mishmash of matplotlib code - which is confusing to me (even if I wrote it).

Key Point
New matplotlib users should learn and use the object oriented interface.

Another historic challenge with matplotlib is that some of the default stylechoices were rather unattractive. In a world where R could generate some reallycool plots with ggplot, the matplotlib options tended to look a bit ugly incomparison. The good news is that matplotlib 2.0 has much nicer styling capabilitiesand ability to theme your visualizations with minimal effort.

The third challenge I see with matplotlib is that there is confusion as to when youshould use pure matplotlib to plot something vs. a tool like pandas or seaborn that isbuilt on top of matplotlib. Anytime there can be more than one way to do something,it is challenging for the new or infrequent user to follow the right path. Couplethis confusion with the two differentAPI’s and it is a recipe for frustration.

Why stick with matplotlib?

Despite some of these issues, I have come to appreciate matplotlib because itis extremely powerful. The library allows you to create almost any visualizationyou could imagine. Additionally, there is a rich ecosystem of python toolsbuilt around it and many of the more advanced visualization tools use matplotlib asthe base library. If you do any work in the python data science stack, you willneed to develop some basic familiarity with how to use matplotlib. That isthe focus of the rest of this post - developing a basic approach for effectivelyusing matplotlib.

Basic Premises

If you take nothing else away from this post, I recommend the following stepsfor learning how to use matplotlib:

  1. Learn the basic matplotlib terminology, specifically what is aFigure and anAxes.
  2. Always use the object-oriented interface. Get in the habit of using it from the start of your analysis.
  3. Start your visualizations with basic pandas plotting.
  4. Use seaborn for the more complex statistical visualizations.
  5. Use matplotlib to customize the pandas or seaborn visualization.

This graphic from thematplotlib faq is gold. Keep it handy to understand thedifferent terminology of a plot.

Matplotlib parts

Most of the terms are straightforward but the main thing to remember is that theFigureis the final image that may contain 1 or more axes. TheAxes represent anindividual plot. Once you understand what these are and how to access them throughthe object orientedAPI, the rest of the process starts to fall into place.

The other benefit of this knowledge is that you have a starting point when yousee things on the web. If you take the time to understand this point, the restof the matplotlibAPI will start to make sense. Also, many of the advanced pythonpackages like seaborn and ggplot rely on matplotlib so understanding the basicswill make those more powerful frameworks much easier to learn.

Finally, I am not saying that you should avoid the other good optionslike ggplot (aka ggpy), bokeh, plotly or altair. I just think you’ll need a basic understandingof matplotlib + pandas + seaborn to start. Once you understand the basic visualizationstack, you can explore the other options and make informed choices based on your needs.

Getting Started

The rest of this post will be a primer on how to do the basic visualizationcreation in pandas and customize the most common items using matplotlib. Onceyou understand the basic process, further customizations are relatively straightforward.

I have focused on the most common plotting tasks I encounter such aslabeling axes, adjusting limits, updating plot titles, saving figures and adjusting legends.If you would like to follow along, thenotebook includes additional detail that should be helpful.

To get started, I am going to setup my imports and read in some data:

importpandasaspdimportmatplotlib.pyplotaspltfrommatplotlib.tickerimportFuncFormatterdf=pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=true")df.head()
account numbernameskuquantityunit priceext pricedate
0740150BartonLLCB1-200003986.693380.912014-01-01 07:21:51
1714466Trantow-BarrowsS2-77896-163.16-63.162014-01-01 10:00:47
2218895Kulas IncB1-699242390.702086.102014-01-01 13:24:58
3307599Kassulke, Ondricka and MetzS1-654814121.05863.052014-01-01 15:05:22
4412290Jerde-HilpertS2-34077683.21499.262014-01-01 23:26:55

The data consists of sales transactions for 2014. In order to make thispost a little shorter, I’m going to summarize the data so we can see thetotal number of purchases and total sales for the top 10 customers. I am alsogoing to rename columns for clarity during plots.

top_10=(df.groupby('name')['ext price','quantity'].agg({'ext price':'sum','quantity':'count'}).sort_values(by='ext price',ascending=False))[:10].reset_index()top_10.rename(columns={'name':'Name','ext price':'Sales','quantity':'Purchases'},inplace=True)

Here is what the data looks like.

NamePurchasesSales
0Kulas Inc94137351.96
1White-Trantow86135841.99
2Trantow-Barrows94123381.38
3Jerde-Hilpert89112591.43
4Fritsch, Russel and Anderson81112214.71
5BartonLLC82109438.50
6WillLLC74104437.60
7Koepp Ltd82103660.54
8Frami, Hills and Schmidt72103569.59
9KeelingLLC74100934.30

Now that the data is formatted in a simple table, let’s talk about plotting theseresults as a bar chart.

As I mentioned earlier, matplotlib has many different styles available forrendering plots. You can see which ones are available on your system usingplt.style.available.

plt.style.available
['seaborn-dark', 'seaborn-dark-palette', 'fivethirtyeight', 'seaborn-whitegrid', 'seaborn-darkgrid', 'seaborn', 'bmh', 'classic', 'seaborn-colorblind', 'seaborn-muted', 'seaborn-white', 'seaborn-talk', 'grayscale', 'dark_background', 'seaborn-deep', 'seaborn-bright', 'ggplot', 'seaborn-paper', 'seaborn-notebook', 'seaborn-poster', 'seaborn-ticks', 'seaborn-pastel']

Using a style is as simple as:

plt.style.use('ggplot')

I encourage you to play around with different styles and see which ones you like.

Now that we have a nicer style in place, the first step is to plot the datausing the standard pandas plotting function:

top_10.plot(kind='barh',y="Sales",x="Name")
Pandas plot 1

The reason I recommend using pandas plotting first is that it is a quick andeasy way to prototype your visualization. Since most people are probably alreadydoing some level of data manipulation/analysis in pandas as a first step, go aheadand use the basic plots to get started.

Customizing the Plot

Assuming you are comfortable with the gist of this plot, the next step isto customize it. Some of the customizations (like adding titles andlabels) are very simple to use with the pandasplot function. However,you will probably find yourself needing to move outside of that functionalityat some point. That’s why I recommend getting in the habit of doing this:

fig,ax=plt.subplots()top_10.plot(kind='barh',y="Sales",x="Name",ax=ax)

The resulting plot looks exactly the same as the original but we added an additional calltoplt.subplots() and passed theax to the plotting function. Whyshould you do this? Remember when I said it is critical to get accessto the axes and figures in matplotlib? That’s what we have accomplished here. Any futurecustomization will be done via theax orfig objects.

We have the benefit of a quick plot from pandas but access to all the power frommatplotlib now. An example should show what we can do now. Also, by usingthis naming convention, it is fairly straightforward to adapt others’ solutionsto your unique needs.

Suppose we want to tweak the x limits and change some axis labels? Now that we have theaxes in theax variable, we have a lot of control:

fig,ax=plt.subplots()top_10.plot(kind='barh',y="Sales",x="Name",ax=ax)ax.set_xlim([-10000,140000])ax.set_xlabel('Total Revenue')ax.set_ylabel('Customer');
Pandas plot 2

Here’s another shortcut we can use to change the title and both labels:

fig,ax=plt.subplots()top_10.plot(kind='barh',y="Sales",x="Name",ax=ax)ax.set_xlim([-10000,140000])ax.set(title='2014 Revenue',xlabel='Total Revenue',ylabel='Customer')
Pandas plot 3

To further demonstrate this approach, we can also adjust the size of this image.By using theplt.subplots() function, we can define thefigsize in inches.We can also remove the legend usingax.legend().set_visible(False)

fig,ax=plt.subplots(figsize=(5,6))top_10.plot(kind='barh',y="Sales",x="Name",ax=ax)ax.set_xlim([-10000,140000])ax.set(title='2014 Revenue',xlabel='Total Revenue')ax.legend().set_visible(False)
Pandas plot 4

There are plenty of things you probably want to do to clean up this plot. Oneof the biggest eye sores is the formatting of the Total Revenue numbers. Matplotlibcan help us with this through the use of theFuncFormatter. This versatilefunction can apply a user defined function to a value and return a nicelyformatted string to place on the axis.

Here is a currency formatting function to gracefully handleUS dollars in theseveral hundred thousand dollar range:

defcurrency(x,pos):'The two args are the value and tick position'ifx>=1000000:return'${:1.1f}M'.format(x*1e-6)return'${:1.0f}K'.format(x*1e-3)

Now that we have a formatter function, we need to define it and apply it to thex axis. Here is the full code:

fig,ax=plt.subplots()top_10.plot(kind='barh',y="Sales",x="Name",ax=ax)ax.set_xlim([-10000,140000])ax.set(title='2014 Revenue',xlabel='Total Revenue',ylabel='Customer')formatter=FuncFormatter(currency)ax.xaxis.set_major_formatter(formatter)ax.legend().set_visible(False)
Pandas plot 5

That’s much nicer and shows a good example of the flexibility to define yourown solution to the problem.

The final customization feature I will go through is the ability to addannotations to the plot. In order to draw a vertical line, you can useax.axvline()and to add custom text, you can useax.text().

For this example, we’ll draw a line showing an average and include labels showingthree new customers. Here is the full code with comments to pull itall together.

# Create the figure and the axesfig,ax=plt.subplots()# Plot the data and get the averagedtop_10.plot(kind='barh',y="Sales",x="Name",ax=ax)avg=top_10['Sales'].mean()# Set limits and labelsax.set_xlim([-10000,140000])ax.set(title='2014 Revenue',xlabel='Total Revenue',ylabel='Customer')# Add a line for the averageax.axvline(x=avg,color='b',label='Average',linestyle='--',linewidth=1)# Annotate the new customersforcustin[3,5,8]:ax.text(115000,cust,"New Customer")# Format the currencyformatter=FuncFormatter(currency)ax.xaxis.set_major_formatter(formatter)# Hide the legendax.legend().set_visible(False)
Pandas plot 6

While this may not be the most exciting plot it does show how much power you havewhen following this approach.

Figures and Plots

Up until now, all the changes we have made have been with the individual plot.Fortunately, we also have the ability to add multiple plots on a figure as wellas save the entire figure using various options.

If we decided that we wanted to put two plots on the same figure, we should havea basic understanding of how to do it. First, create the figure, then the axes, thenplot it all together. We can accomplish this usingplt.subplots():

fig,(ax0,ax1)=plt.subplots(nrows=1,ncols=2,sharey=True,figsize=(7,4))

In this example, I’m usingnrows andncols to specify the sizebecause this is very clear to the new user. In sample code you will frequentlyjust see variables like 1,2. I think using the named parameters is a little easierto interpret later on when you’re looking at your code.

I am also usingsharey=True so that the yaxis will share the same labels.

This example is also kind of nifty because the various axes get unpacked toax0andax1. Now that we have these axes, you can plot them like the examplesabove but put one plot onax0 and the other onax1.

# Get the figure and the axesfig,(ax0,ax1)=plt.subplots(nrows=1,ncols=2,sharey=True,figsize=(7,4))top_10.plot(kind='barh',y="Sales",x="Name",ax=ax0)ax0.set_xlim([-10000,140000])ax0.set(title='Revenue',xlabel='Total Revenue',ylabel='Customers')# Plot the average as a vertical lineavg=top_10['Sales'].mean()ax0.axvline(x=avg,color='b',label='Average',linestyle='--',linewidth=1)# Repeat for the unit plottop_10.plot(kind='barh',y="Purchases",x="Name",ax=ax1)avg=top_10['Purchases'].mean()ax1.set(title='Units',xlabel='Total Units',ylabel='')ax1.axvline(x=avg,color='b',label='Average',linestyle='--',linewidth=1)# Title the figurefig.suptitle('2014 Sales Analysis',fontsize=14,fontweight='bold');# Hide the legendsax1.legend().set_visible(False)ax0.legend().set_visible(False)
Pandas plot 7

Up until now, I have been relying on the jupyter notebook to display the figuresby virtue of the%matplotlib inline directive. However, there are goingto be plenty of times where you have the need to save a figure in a specific formatand integrate it with some other presentation.

Matplotlib supports many different formats for saving files. You can usefig.canvas.get_supported_filetypes() to see what your system supports:

fig.canvas.get_supported_filetypes()
{'eps': 'Encapsulated Postscript', 'jpeg': 'Joint Photographic Experts Group', 'jpg': 'Joint Photographic Experts Group', 'pdf': 'Portable Document Format', 'pgf': 'PGF code for LaTeX', 'png': 'Portable Network Graphics', 'ps': 'Postscript', 'raw': 'Raw RGBA bitmap', 'rgba': 'Raw RGBA bitmap', 'svg': 'Scalable Vector Graphics', 'svgz': 'Scalable Vector Graphics', 'tif': 'Tagged Image File Format', 'tiff': 'Tagged Image File Format'}

Since we have thefig object, we can save the figure using multiple options:

fig.savefig('sales.png',transparent=False,dpi=80,bbox_inches="tight")

This version saves the plot as a png with opaque background. I have also specifiedthe dpi andbbox_inches="tight" in order to minimize excess white space.

Conclusion

Hopefully this process has helped you understand how to more effectively usematplotlib in your daily data analysis. If you get in the habit of using this approachwhen doing your analysis, you should be able to quickly find out how to do whatever youneed to do to customize your plot.

As a final bonus, I am including a quick guide to unify all the concepts.I hope this helps bring this post together and proves a handy reference forfuture use.

Matplotlib example
Tagspandas matplotlib 

Comments

Subscribe to themailing list

Social

Submit a Topic

Popular

Article Roadmap

Feeds


Disclosure

We are a participant in the Amazon Services LLC Associates Program,an affiliate advertising program designed to provide a means for us to earnfees by linking to Amazon.com and affiliated sites.


PSF Supporting Member

[8]ページ先頭

©2009-2026 Movatter.jp