MEP28: Remove Complexity from Axes.boxplot#
Status#
Discussion
Branches and Pull requests#
The following lists any open PRs or branches related to this MEP:
Deprecate redundant statistical kwargs in
Axes.boxplot:phobson/matplotlibDeprecate redundant style options in
Axes.boxplot:phobson/matplotlibDeprecate passings 2D NumPy arrays as input: None
Add pre- & post-processing options to
cbook.boxplot_stats:phobson/matplotlibExposing
cbook.boxplot_statsthroughAxes.boxplotkwargs: NoneRemove redundant statistical kwargs in
Axes.boxplot: NoneRemove redundant style options in
Axes.boxplot: NoneRemaining items that arise through discussion: None
Abstract#
Over the past few releases, theAxes.boxplot method has grown incomplexity to support fully customizable artist styling and statisticalcomputation. This lead toAxes.boxplot being split off into multipleparts. The statistics needed to draw a boxplot are computed incbook.boxplot_stats, while the actual artists are drawn byAxes.bxp.The original method,Axes.boxplot remains as the most public API thathandles passing the user-supplied data tocbook.boxplot_stats, feedingthe results toAxes.bxp, and pre-processing style information foreach facet of the boxplot plots.
This MEP will outline a path forward to rollback the added complexityand simplify the API while maintaining reasonable backwardscompatibility.
Detailed description#
Currently, theAxes.boxplot method accepts parameters that allow theusers to specify medians and confidence intervals for each box thatwill be drawn in the plot. These were provided so that advanced userscould provide statistics computed in a different fashion that the simplemethod provided by matplotlib. However, handling this input requirescomplex logic to make sure that the forms of the data structure match whatneeds to be drawn. At the moment, that logic contains 9 separate if/elsestatements nested up to 5 levels deep with a for loop, and may raise up to 2 errors.These parameters were added prior to the creation of theAxes.bxp method,which draws boxplots from a list of dictionaries containing the relevantstatistics. Matplotlib also provides a function that computes thesestatistics viacbook.boxplot_stats. Note that advanced users can noweither a) write their own function to compute the stats required byAxes.bxp, or b) modify the output returned bycbook.boxplots_statsto fully customize the position of the artists of the plots. With thisflexibility, the parameters to manually specify only the medians and theirconfidences intervals remain for backwards compatibility.
Around the same time that the two roles ofAxes.boxplot were split intocbook.boxplot_stats for computation andAxes.bxp for drawing, bothAxes.boxplot andAxes.bxp were written to accept parameters thatindividually toggle the drawing of all components of the boxplots, andparameters that individually configure the style of those artists. However,to maintain backwards compatibility, thesym parameter (previously usedto specify the symbol of the fliers) was retained. This parameter itselfrequires fairly complex logic to reconcile thesym parameters with thenewerflierprops parameter at the default style specified bymatplotlibrc.
This MEP seeks to dramatically simplify the creation of boxplots fornovice and advanced users alike. Importantly, the changes proposed herewill also be available to downstream packages like seaborn, as seabornsmartly allows users to pass arbitrary dictionaries of parameters throughthe seaborn API to the underlying matplotlib functions.
This will be achieved in the following way:
cbook.boxplot_statswill be modified to allow pre- and post-computation transformation functions to be passed in (e.g.,np.logandnp.expfor lognormally distributed data)Axes.boxplotwill be modified to also accept and naïvely pass themtocbook.boxplots_stats(Alt: pass the stat function and a dictof its optional parameters).Outdated parameters from
Axes.boxplotwill be deprecated andlater removed.
Importance#
Since the limits of the whiskers are computed arithmetically, thereis an implicit assumption of normality in box and whisker plots.This primarily affects which data points are classified as outliers.
Allowing transformations to the data and the results used to drawboxplots will allow users to opt-out of that assumption if thedata are known to not fit a normal distribution.
Below is an example of howAxes.boxplot classifies outliers of lognormaldata differently depending one these types of transforms.
importnumpyasnpimportmatplotlib.pyplotaspltfrommatplotlibimportcbooknp.random.seed(0)fig,ax=plt.subplots(figsize=(4,6))ax.set_yscale('log')data=np.random.lognormal(-1.75,2.75,size=37)stats=cbook.boxplot_stats(data,labels=['arithmetic'])logstats=cbook.boxplot_stats(np.log(data),labels=['log-transformed'])forlsdictinlogstats:forkey,valueinlsdict.items():ifkey!='label':lsdict[key]=np.exp(value)stats.extend(logstats)ax.bxp(stats)fig.show()

Implementation#
Passing transform functions tocbook.boxplots_stats#
This MEP proposes that two parameters (e.g.,transform_in andtransform_out be added to the cookbook function that computes thestatistics for the boxplot function. These will be optional keyword-onlyarguments and can easily be set tolambdax:x as a no-op when omittedby the user. Thetransform_in function will be applied to the dataas theboxplot_stats function loops through each subset of the datapassed to it. After the list of statistics dictionaries are computed thetransform_out function is applied to each value in the dictionaries.
These transformations can then be added to the call signature ofAxes.boxplot with little impact to that method's complexity. This isbecause they can be directly passed tocbook.boxplot_stats.Alternatively,Axes.boxplot could be modified to accept an optionalstatistical function kwarg and a dictionary of parameters to be directlypassed to it.
At this point in the implementation users and external libraries likeseaborn would have complete control via theAxes.boxplot method. Moreimportantly, at the very least, seaborn would require no changes to itsAPI to allow users to take advantage of these new options.
Simplifications to theAxes.boxplot API and other functions#
Simplifying the boxplot method consists primarily of deprecating and thenremoving the redundant parameters. Optionally, a next step would includerectifying minor terminological inconsistencies betweenAxes.boxplotandAxes.bxp.
The parameters to be deprecated and removed include:
usermedians- processed by 10 SLOC, 3ifblocks, aforloopconf_intervals- handled by 15 SLOC, 6ifblocks, aforloopsym- processed by 12 SLOC, 4ifblocks
Removing thesym option allows all code in handling the remainingstyling parameters to be moved toAxes.bxp. This doesn't removeany complexity, but does reinforce the single responsibility principleamongAxes.bxp,cbook.boxplot_stats, andAxes.boxplot.
Additionally, thenotch parameter could be renamedshownotchesto be consistent withAxes.bxp. This kind of cleanup could be takena step further and thewhis,bootstrap,autorange couldbe rolled into the kwargs passed to the newstatfxn parameter.
Backward compatibility#
Implementation of this MEP would eventually result in the backwardsincompatible deprecation and then removal of the keyword parametersusermedians,conf_intervals, andsym. Cursory searches onGitHub indicated thatusermedians,conf_intervals are used byfew users, who all seem to have a very strong knowledge of matplotlib.A robust deprecation cycle should provide sufficient time for theseusers to migrate to a new API.
Deprecation ofsym however, may have a much broader reach intothe matplotlib userbase.
Schedule#
An accelerated timeline could look like the following:
v2.0.1 add transforms to
cbook.boxplots_stats, expose inAxes.boxplotv2.1.0 Initial Deprecations , and using 2D NumPy arrays as input
Using 2D NumPy arrays as input. The semantics around 2D arrays are generally confusing.
usermedians,conf_intervals,symparameters
v2.2.0
remove
usermedians,conf_intervals,symparametersdeprecate
notchin favor ofshownotchesto be consistent withother parameters andAxes.bxp
v2.3.0
remove
notchparametermove all style and artist toggling logic to
Axes.bxpsuchAxes.boxplotis little more than a broker betweenAxes.bxpandcbook.boxplots_stats
Anticipated Impacts to Users#
As described above deprecatingusermedians andconf_intervalswill likely impact few users. Those who will be impacted are almostcertainly advanced users who will be able to adapt to the change.
Deprecating thesym option may import more users and effort shouldbe taken to collect community feedback on this.
Anticipated Impacts to Downstream Libraries#
The source code (GitHub master as of 2016-10-17) was inspected forseaborn and python-ggplot to see if these changes would impact theiruse. None of the parameters nominated for removal in this MEP are used byseaborn. The seaborn APIs that use matplotlib's boxplot function allowuser's to pass arbitrary**kwargs through to matplotlib's API. Thusseaborn users with modern matplotlib installations will be able to takefull advantage of any new features added as a result of this MEP.
Python-ggplot has implemented its own function to draw boxplots. Therefore,no impact can come to it as a result of implementing this MEP.
Alternatives#
Variations on the theme#
This MEP can be divided into a few loosely coupled components:
Allowing pre- and post-computation transformation function in
cbook.boxplot_statsExposing that transformation in the
Axes.boxplotAPIRemoving redundant statistical options in
Axes.boxplotShifting all styling parameter processing from
Axes.boxplottoAxes.bxp.
With this approach, #2 depends and #1, and #4 depends on #3.
There are two possible approaches to #2. The first and most direct wouldbe to mirror the newtransform_in andtransform_out parameters ofcbook.boxplot_stats inAxes.boxplot and pass them directly.
The second approach would be to addstatfxn andstatfxn_argsparameters toAxes.boxplot. Under this implementation, the defaultvalue ofstatfxn would becbook.boxplot_stats, but users couldpass their own function. Thentransform_in andtransform_out wouldthen be passed as elements of thestatfxn_args parameter.
defboxplot_stats(data,...,transform_in=None,transform_out=None):iftransform_inisNone:transform_in=lambdax:xiftransform_outisNone:transform_out=lambdax:xoutput=[]for_dindata:d=transform_in(_d)stat_dict=do_stats(d)forkey,valueinstat_dict.item():ifkey!='label':stat_dict[key]=transform_out(value)output.append(d)returnoutputclassAxes(...):defboxplot_option1(data,...,transform_in=None,transform_out=None):stats=cbook.boxplot_stats(data,...,transform_in=transform_in,transform_out=transform_out)returnself.bxp(stats,...)defboxplot_option2(data,...,statfxn=None,**statopts):ifstatfxnisNone:statfxn=boxplot_statsstats=statfxn(data,**statopts)returnself.bxp(stats,...)
Both cases would allow users to do the following:
fig,ax1=plt.subplots()artists1=ax1.boxplot_optionX(data,transform_in=np.log,transform_out=np.exp)
But Option Two lets a user write a completely custom stat function(e.g.,my_box_stats) with fancy BCA confidence intervals and thewhiskers set differently depending on some attribute of the data.
This is available under the current API:
fig,ax1=plt.subplots()my_stats=my_box_stats(data,bootstrap_method='BCA',whisker_method='dynamic')ax1.bxp(my_stats)
And would be more concise with Option Two
fig,ax=plt.subplots()statopts=dict(transform_in=np.log,transform_out=np.exp)ax.boxplot(data,...,**statopts)
Users could also pass their own function to compute the stats:
fig,ax1=plt.subplots()ax1.boxplot(data,statfxn=my_box_stats,bootstrap_method='BCA',whisker_method='dynamic')
From the examples above, Option Two seems to have only marginal benefit,but in the context of downstream libraries like seaborn, its advantageis more apparent as the following would be possible without any patchesto seaborn:
importseaborntips=seaborn.load_data('tips')g=seaborn.factorplot(x="day",y="total_bill",hue="sex",data=tips,kind='box',palette="PRGn",shownotches=True,statfxn=my_box_stats,bootstrap_method='BCA',whisker_method='dynamic')
This type of flexibility was the intention behind splitting the overallboxplot API in the current three functions. In practice however, downstreamlibraries like seaborn support versions of matplotlib dating back wellbefore the split. Thus, adding just a bit more flexibility to theAxes.boxplot could expose all the functionality to users of thedownstream libraries with modern matplotlib installation without interventionfrom the downstream library maintainers.
Doing less#
Another obvious alternative would be to omit the added pre- and post-computation transform functionality incbook.boxplot_stats andAxes.boxplot, and simply remove the redundant statistical and styleparameters as described above.
Doing nothing#
As with many things in life, doing nothing is an option here. This meanswe simply advocate for users and downstream libraries to take advantageof the split betweencbook.boxplot_stats andAxes.bxp and letthem decide how to provide an interface to that.