Movatterモバイル変換


[0]ホーム

URL:


Navigation

Table Of Contents

Search

Enter search terms or a module, class or function name.

What’s New

These are new features and improvements of note in each release.

v0.19.1 (November 3, 2016)

This is a minor bug-fix release from 0.19.0 and includes some small regression fixes,bug fixes and performance improvements.We recommend that all users upgrade to this version.

What’s new in v0.19.1

Performance Improvements

  • Fixed performance regression in factorization ofPeriod data (GH14338)
  • Fixed performance regression inSeries.asof(where) whenwhere is a scalar (GH14461)
  • Improved performance inDataFrame.asof(where) whenwhere is a scalar (GH14461)
  • Improved performance in.to_json() whenlines=True (GH14408)
  • Improved performance in certain types ofloc indexing with a MultiIndex (GH14551).

Bug Fixes

  • Source installs from PyPI will now again work withoutcython installed, as in previous versions (GH14204)
  • Compat with Cython 0.25 for building (GH14496)
  • Fixed regression where user-provided file handles were closed inread_csv (c engine) (GH14418).
  • Fixed regression inDataFrame.quantile when missing values where present in some columns (GH14357).
  • Fixed regression inIndex.difference where thefreq of aDatetimeIndex was incorrectly set (GH14323)
  • Added backpandas.core.common.array_equivalent with a deprecation warning (GH14555).
  • Bug inpd.read_csv for the C engine in which quotation marks were improperly parsed in skipped rows (GH14459)
  • Bug inpd.read_csv for Python 2.x in which Unicode quote characters were no longer being respected (GH14477)
  • Fixed regression inIndex.append when categorical indices were appended (GH14545).
  • Fixed regression inpd.DataFrame where constructor fails when given dict withNone value (GH14381)
  • Fixed regression inDatetimeIndex._maybe_cast_slice_bound when index is empty (GH14354).
  • Bug in localizing an ambiguous timezone when a boolean is passed (GH14402)
  • Bug inTimedeltaIndex addition with a Datetime-like object where addition overflow in the negative direction was not being caught (GH14068,GH14453)
  • Bug in string indexing against data withobjectIndex may raiseAttributeError (GH14424)
  • Corrrecly raiseValueError on empty input topd.eval() anddf.query() (GH13139)
  • Bug inRangeIndex.intersection when result is a empty set (GH14364).
  • Bug in groupby-transform broadcasting that could cause incorrect dtype coercion (GH14457)
  • Bug inSeries.__setitem__ which allowed mutating read-only arrays (GH14359).
  • Bug inDataFrame.insert where multiple calls with duplicate columns can fail (GH14291)
  • pd.merge() will raiseValueError with non-boolean parameters in passed boolean type arguments (GH14434)
  • Bug inTimestamp where dates very near the minimum (1677-09) could underflow on creation (GH14415)
  • Bug inpd.concat where names of thekeys were not propagated to the resultingMultiIndex (GH14252)
  • Bug inpd.concat whereaxis cannot take string parameters'rows' or'columns' (GH14369)
  • Bug inpd.concat with dataframes heterogeneous in length and tuplekeys (GH14438)
  • Bug inMultiIndex.set_levels where illegal level values were still set after raising an error (GH13754)
  • Bug inDataFrame.to_json wherelines=True and a value contained a} character (GH14391)
  • Bug indf.groupby causing anAttributeError when grouping a single index frame by a column and the index level (:issue`14327`)
  • Bug indf.groupby whereTypeError raised whenpd.Grouper(key=...) is passed in a list (GH14334)
  • Bug inpd.pivot_table may raiseTypeError orValueError whenindex orcolumnsis not scalar andvalues is not specified (GH14380)

v0.19.0 (October 2, 2016)

This is a major release from 0.18.1 and includes number of API changes, several new features,enhancements, and performance improvements along with a large number of bug fixes. We recommend that allusers upgrade to this version.

Highlights include:

  • merge_asof() for asof-style time-series joining, seehere
  • .rolling() is now time-series aware, seehere
  • read_csv() now supports parsingCategorical data, seehere
  • A functionunion_categorical() has been added for combining categoricals, seehere
  • PeriodIndex now has its ownperiod dtype, and changed to be more consistent with otherIndex classes. Seehere
  • Sparse data structures gained enhanced support ofint andbool dtypes, seehere
  • Comparison operations withSeries no longer ignores the index, seehere for an overview of the API changes.
  • Introduction of a pandas development API for utility functions, seehere.
  • Deprecation ofPanel4D andPanelND. We recommend to represent these types of n-dimensional data with thexarray package.
  • Removal of the previously deprecated modulespandas.io.data,pandas.io.wb,pandas.tools.rplot.

Warning

pandas >= 0.19.0 will no longer silence numpy ufunc warnings upon import, seehere.

What’s new in v0.19.0

New features

merge_asof for asof-style time-series joining

A long-time requested feature has been added through themerge_asof() function, tosupport asof style joining of time-series (GH1870,GH13695,GH13709,GH13902). Full documentation ishere.

Themerge_asof() performs an asof merge, which is similar to a left-joinexcept that we match on nearest key rather than equal keys.

In [1]:left=pd.DataFrame({'a':[1,5,10],   ...:'left_val':['a','b','c']})   ...:In [2]:right=pd.DataFrame({'a':[1,2,3,6,7],   ...:'right_val':[1,2,3,6,7]})   ...:In [3]:leftOut[3]:    a left_val0   1        a1   5        b2  10        cIn [4]:rightOut[4]:   a  right_val0  1          11  2          22  3          33  6          64  7          7

We typically want to match exactly when possible, and use the mostrecent value otherwise.

In [5]:pd.merge_asof(left,right,on='a')Out[5]:    a left_val  right_val0   1        a          11   5        b          32  10        c          7

We can also match rows ONLY with prior data, and not an exact match.

In [6]:pd.merge_asof(left,right,on='a',allow_exact_matches=False)Out[6]:    a left_val  right_val0   1        a        NaN1   5        b        3.02  10        c        7.0

In a typical time-series example, we havetrades andquotes and we want toasof-join them.This also illustrates using theby parameter to group data before merging.

In [7]:trades=pd.DataFrame({   ...:'time':pd.to_datetime(['20160525 13:30:00.023',   ...:'20160525 13:30:00.038',   ...:'20160525 13:30:00.048',   ...:'20160525 13:30:00.048',   ...:'20160525 13:30:00.048']),   ...:'ticker':['MSFT','MSFT',   ...:'GOOG','GOOG','AAPL'],   ...:'price':[51.95,51.95,   ...:720.77,720.92,98.00],   ...:'quantity':[75,155,   ...:100,100,100]},   ...:columns=['time','ticker','price','quantity'])   ...:In [8]:quotes=pd.DataFrame({   ...:'time':pd.to_datetime(['20160525 13:30:00.023',   ...:'20160525 13:30:00.023',   ...:'20160525 13:30:00.030',   ...:'20160525 13:30:00.041',   ...:'20160525 13:30:00.048',   ...:'20160525 13:30:00.049',   ...:'20160525 13:30:00.072',   ...:'20160525 13:30:00.075']),   ...:'ticker':['GOOG','MSFT','MSFT',   ...:'MSFT','GOOG','AAPL','GOOG',   ...:'MSFT'],   ...:'bid':[720.50,51.95,51.97,51.99,   ...:720.50,97.99,720.50,52.01],   ...:'ask':[720.93,51.96,51.98,52.00,   ...:720.93,98.01,720.88,52.03]},   ...:columns=['time','ticker','bid','ask'])   ...:
In [9]:tradesOut[9]:                     time ticker   price  quantity0 2016-05-25 13:30:00.023   MSFT   51.95        751 2016-05-25 13:30:00.038   MSFT   51.95       1552 2016-05-25 13:30:00.048   GOOG  720.77       1003 2016-05-25 13:30:00.048   GOOG  720.92       1004 2016-05-25 13:30:00.048   AAPL   98.00       100In [10]:quotesOut[10]:                     time ticker     bid     ask0 2016-05-25 13:30:00.023   GOOG  720.50  720.931 2016-05-25 13:30:00.023   MSFT   51.95   51.962 2016-05-25 13:30:00.030   MSFT   51.97   51.983 2016-05-25 13:30:00.041   MSFT   51.99   52.004 2016-05-25 13:30:00.048   GOOG  720.50  720.935 2016-05-25 13:30:00.049   AAPL   97.99   98.016 2016-05-25 13:30:00.072   GOOG  720.50  720.887 2016-05-25 13:30:00.075   MSFT   52.01   52.03

An asof merge joins on theon, typically a datetimelike field, which is ordered, andin this case we are using a grouper in theby field. This is like a left-outer join, exceptthat forward filling happens automatically taking the most recent non-NaN value.

In [11]:pd.merge_asof(trades,quotes,   ....:on='time',   ....:by='ticker')   ....:Out[11]:                     time ticker   price  quantity     bid     ask0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.961 2016-05-25 13:30:00.038   MSFT   51.95       155   51.97   51.982 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.933 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.934 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

This returns a merged DataFrame with the entries in the same order as the original leftpassed DataFrame (trades in this case), with the fields of thequotes merged.

.rolling() is now time-series aware

.rolling() objects are now time-series aware and can accept a time-series offset (or convertible) for thewindow argument (GH13327,GH12995).See the full documentationhere.

In [12]:dft=pd.DataFrame({'B':[0,1,2,np.nan,4]},   ....:index=pd.date_range('20130101 09:00:00',periods=5,freq='s'))   ....:In [13]:dftOut[13]:                       B2013-01-01 09:00:00  0.02013-01-01 09:00:01  1.02013-01-01 09:00:02  2.02013-01-01 09:00:03  NaN2013-01-01 09:00:04  4.0

This is a regular frequency index. Using an integer window parameter works to roll along the window frequency.

In [14]:dft.rolling(2).sum()Out[14]:                       B2013-01-01 09:00:00  NaN2013-01-01 09:00:01  1.02013-01-01 09:00:02  3.02013-01-01 09:00:03  NaN2013-01-01 09:00:04  NaNIn [15]:dft.rolling(2,min_periods=1).sum()Out[15]:                       B2013-01-01 09:00:00  0.02013-01-01 09:00:01  1.02013-01-01 09:00:02  3.02013-01-01 09:00:03  2.02013-01-01 09:00:04  4.0

Specifying an offset allows a more intuitive specification of the rolling frequency.

In [16]:dft.rolling('2s').sum()Out[16]:                       B2013-01-01 09:00:00  0.02013-01-01 09:00:01  1.02013-01-01 09:00:02  3.02013-01-01 09:00:03  2.02013-01-01 09:00:04  4.0

Using a non-regular, but still monotonic index, rolling with an integer window does not impart any special calculation.

In [17]:dft=DataFrame({'B':[0,1,2,np.nan,4]},   ....:index=pd.Index([pd.Timestamp('20130101 09:00:00'),   ....:pd.Timestamp('20130101 09:00:02'),   ....:pd.Timestamp('20130101 09:00:03'),   ....:pd.Timestamp('20130101 09:00:05'),   ....:pd.Timestamp('20130101 09:00:06')],   ....:name='foo'))   ....:In [18]:dftOut[18]:                       Bfoo2013-01-01 09:00:00  0.02013-01-01 09:00:02  1.02013-01-01 09:00:03  2.02013-01-01 09:00:05  NaN2013-01-01 09:00:06  4.0In [19]:dft.rolling(2).sum()Out[19]:                       Bfoo2013-01-01 09:00:00  NaN2013-01-01 09:00:02  1.02013-01-01 09:00:03  3.02013-01-01 09:00:05  NaN2013-01-01 09:00:06  NaN

Using the time-specification generates variable windows for this sparse data.

In [20]:dft.rolling('2s').sum()Out[20]:                       Bfoo2013-01-01 09:00:00  0.02013-01-01 09:00:02  1.02013-01-01 09:00:03  3.02013-01-01 09:00:05  NaN2013-01-01 09:00:06  4.0

Furthermore, we now allow an optionalon parameter to specify a column (rather than thedefault of the index) in a DataFrame.

In [21]:dft=dft.reset_index()In [22]:dftOut[22]:                  foo    B0 2013-01-01 09:00:00  0.01 2013-01-01 09:00:02  1.02 2013-01-01 09:00:03  2.03 2013-01-01 09:00:05  NaN4 2013-01-01 09:00:06  4.0In [23]:dft.rolling('2s',on='foo').sum()Out[23]:                  foo    B0 2013-01-01 09:00:00  0.01 2013-01-01 09:00:02  1.02 2013-01-01 09:00:03  3.03 2013-01-01 09:00:05  NaN4 2013-01-01 09:00:06  4.0

read_csv has improved support for duplicate column names

Duplicate column names are now supported inread_csv() whetherthey are in the file or passed in as thenames parameter (GH7160,GH9424)

In [24]:data='0,1,2\n3,4,5'In [25]:names=['a','b','a']

Previous behavior:

In [2]:pd.read_csv(StringIO(data),names=names)Out[2]:   a  b  a0  2  1  21  5  4  5

The firsta column contained the same data as the seconda column, when it should havecontained the values[0,3].

New behavior:

In [26]:pd.read_csv(StringIO(data),names=names)Out[26]:   a  b  a.10  0  1    21  3  4    5

read_csv supports parsingCategorical directly

Theread_csv() function now supports parsing aCategorical column whenspecified as a dtype (GH10153). Depending on the structure of the data,this can result in a faster parse time and lower memory usage compared toconverting toCategorical after parsing. See the iodocs here.

In [27]:data='col1,col2,col3\na,b,1\na,b,2\nc,d,3'In [28]:pd.read_csv(StringIO(data))Out[28]:  col1 col2  col30    a    b     11    a    b     22    c    d     3In [29]:pd.read_csv(StringIO(data)).dtypesOut[29]:col1    objectcol2    objectcol3     int64dtype: objectIn [30]:pd.read_csv(StringIO(data),dtype='category').dtypesOut[30]:col1    categorycol2    categorycol3    categorydtype: object

Individual columns can be parsed as aCategorical using a dict specification

In [31]:pd.read_csv(StringIO(data),dtype={'col1':'category'}).dtypesOut[31]:col1    categorycol2      objectcol3       int64dtype: object

Note

The resulting categories will always be parsed as strings (object dtype).If the categories are numeric they can be converted using theto_numeric() function, or as appropriate, another convertersuch asto_datetime().

In [32]:df=pd.read_csv(StringIO(data),dtype='category')In [33]:df.dtypesOut[33]:col1    categorycol2    categorycol3    categorydtype: objectIn [34]:df['col3']Out[34]:0    11    22    3Name: col3, dtype: categoryCategories (3, object): [1, 2, 3]In [35]:df['col3'].cat.categories=pd.to_numeric(df['col3'].cat.categories)In [36]:df['col3']Out[36]:0    11    22    3Name: col3, dtype: categoryCategories (3, int64): [1, 2, 3]

Categorical Concatenation

  • A functionunion_categoricals() has been added for combining categoricals, seeUnioning Categoricals (GH13361,GH:13763, issue:13846,GH14173)

    In [37]:frompandas.types.concatimportunion_categoricalsIn [38]:a=pd.Categorical(["b","c"])In [39]:b=pd.Categorical(["a","b"])In [40]:union_categoricals([a,b])Out[40]:[b, c, a, b]Categories (3, object): [b, c, a]
  • concat andappend now can concatcategory dtypes with differentcategories asobject dtype (GH13524)

    In [41]:s1=pd.Series(['a','b'],dtype='category')In [42]:s2=pd.Series(['b','c'],dtype='category')

    Previous behavior:

    In [1]:pd.concat([s1,s2])ValueError: incompatible categories in categorical concat

    New behavior:

    In [43]:pd.concat([s1,s2])Out[43]:0    a1    b0    b1    cdtype: object

Semi-Month Offsets

Pandas has gained new frequency offsets,SemiMonthEnd (‘SM’) andSemiMonthBegin (‘SMS’).These provide date offsets anchored (by default) to the 15th and end of month, and 15th and 1st of month respectively.(GH1543)

In [44]:frompandas.tseries.offsetsimportSemiMonthEnd,SemiMonthBegin

SemiMonthEnd:

In [45]:Timestamp('2016-01-01')+SemiMonthEnd()Out[45]:Timestamp('2016-01-15 00:00:00')In [46]:pd.date_range('2015-01-01',freq='SM',periods=4)Out[46]:DatetimeIndex(['2015-01-15','2015-01-31','2015-02-15','2015-02-28'],dtype='datetime64[ns]',freq='SM-15')

SemiMonthBegin:

In [47]:Timestamp('2016-01-01')+SemiMonthBegin()Out[47]:Timestamp('2016-01-15 00:00:00')In [48]:pd.date_range('2015-01-01',freq='SMS',periods=4)Out[48]:DatetimeIndex(['2015-01-01','2015-01-15','2015-02-01','2015-02-15'],dtype='datetime64[ns]',freq='SMS-15')

Using the anchoring suffix, you can also specify the day of month to use instead of the 15th.

In [49]:pd.date_range('2015-01-01',freq='SMS-16',periods=4)Out[49]:DatetimeIndex(['2015-01-01','2015-01-16','2015-02-01','2015-02-16'],dtype='datetime64[ns]',freq='SMS-16')In [50]:pd.date_range('2015-01-01',freq='SM-14',periods=4)Out[50]:DatetimeIndex(['2015-01-14','2015-01-31','2015-02-14','2015-02-28'],dtype='datetime64[ns]',freq='SM-14')

New Index methods

The following methods and options are added toIndex, to be more consistent with theSeries andDataFrame API.

Index now supports the.where() function for same shape indexing (GH13170)

In [51]:idx=pd.Index(['a','b','c'])In [52]:idx.where([True,False,True])Out[52]:Index([u'a',nan,u'c'],dtype='object')

Index now supports.dropna() to exclude missing values (GH6194)

In [53]:idx=pd.Index([1,2,np.nan,4])In [54]:idx.dropna()Out[54]:Float64Index([1.0,2.0,4.0],dtype='float64')

ForMultiIndex, values are dropped if any level is missing by default. Specifyinghow='all' only drops values where all levels are missing.

In [55]:midx=pd.MultiIndex.from_arrays([[1,2,np.nan,4],   ....:[1,2,np.nan,np.nan]])   ....:In [56]:midxOut[56]:MultiIndex(levels=[[1, 2, 4], [1, 2]],           labels=[[0, 1, -1, 2], [0, 1, -1, -1]])In [57]:midx.dropna()Out[57]:MultiIndex(levels=[[1, 2, 4], [1, 2]],           labels=[[0, 1], [0, 1]])In [58]:midx.dropna(how='all')Out[58]:MultiIndex(levels=[[1, 2, 4], [1, 2]],           labels=[[0, 1, 2], [0, 1, -1]])

Index now supports.str.extractall() which returns aDataFrame, see thedocs here (GH10008,GH13156)

In [59]:idx=pd.Index(["a1a2","b1","c1"])In [60]:idx.str.extractall("[ab](?P<digit>\d)")Out[60]:        digit  match0 0         1  1         21 0         1

Index.astype() now accepts an optional boolean argumentcopy, which allows optional copying if the requirements on dtype are satisfied (GH13209)

Google BigQuery Enhancements

  • Theread_gbq() method has gained thedialect argument to allow users to specify whether to use BigQuery’s legacy SQL or BigQuery’s standard SQL. See thedocs for more details (GH13615).
  • Theto_gbq() method now allows the DataFrame column order to differ from the destination table schema (GH11359).

Fine-grained numpy errstate

Previous versions of pandas would permanently silence numpy’s ufunc error handling whenpandas was imported. Pandas did this in order to silence the warnings that would arise from using numpy ufuncs on missing data, which are usually represented asNaN s. Unfortunately, this silenced legitimate warnings arising in non-pandas code in the application. Starting with 0.19.0, pandas will use thenumpy.errstate context manager to silence these warnings in a more fine-grained manner, only around where these operations are actually used in the pandas codebase. (GH13109,GH13145)

After upgrading pandas, you may seenewRuntimeWarnings being issued from your code. These are likely legitimate, and the underlying cause likely existed in the code when using previous versions of pandas that simply silenced the warning. Usenumpy.errstate around the source of theRuntimeWarning to control how these conditions are handled.

get_dummies now returns integer dtypes

Thepd.get_dummies function now returns dummy-encoded columns as small integers, rather than floats (GH8725). This should provide an improved memory footprint.

Previous behavior:

In [1]:pd.get_dummies(['a','b','a','c']).dtypesOut[1]:a    float64b    float64c    float64dtype: object

New behavior:

In [61]:pd.get_dummies(['a','b','a','c']).dtypesOut[61]:a    uint8b    uint8c    uint8dtype: object

Downcast values to smallest possible dtype into_numeric

pd.to_numeric() now accepts adowncast parameter, which will downcast the data if possible to smallest specified numerical dtype (GH13352)

In [62]:s=['1',2,3]In [63]:pd.to_numeric(s,downcast='unsigned')Out[63]:array([1,2,3],dtype=uint8)In [64]:pd.to_numeric(s,downcast='integer')Out[64]:array([1,2,3],dtype=int8)

pandas development API

As part of making pandas API more uniform and accessible in the future, we have created a standardsub-package of pandas,pandas.api to hold public API’s. We are starting by exposing typeintrospection functions inpandas.api.types. More sub-packages and officially sanctioned API’swill be published in future versions of pandas (GH13147,GH13634)

The following are now part of this API:

In [65]:importpprintIn [66]:frompandas.apiimporttypesIn [67]:funcs=[fforfindir(types)ifnotf.startswith('_')]In [68]:pprint.pprint(funcs)['is_any_int_dtype', 'is_bool', 'is_bool_dtype', 'is_categorical', 'is_categorical_dtype', 'is_complex', 'is_complex_dtype', 'is_datetime64_any_dtype', 'is_datetime64_dtype', 'is_datetime64_ns_dtype', 'is_datetime64tz_dtype', 'is_datetimetz', 'is_dict_like', 'is_dtype_equal', 'is_extension_type', 'is_float', 'is_float_dtype', 'is_floating_dtype', 'is_hashable', 'is_int64_dtype', 'is_integer', 'is_integer_dtype', 'is_iterator', 'is_list_like', 'is_named_tuple', 'is_number', 'is_numeric_dtype', 'is_object_dtype', 'is_period', 'is_period_dtype', 'is_re', 'is_re_compilable', 'is_scalar', 'is_sequence', 'is_sparse', 'is_string_dtype', 'is_timedelta64_dtype', 'is_timedelta64_ns_dtype', 'pandas_dtype']

Note

Calling these functions from the internal modulepandas.core.common will now show aDeprecationWarning (GH13990)

Other enhancements

  • Timestamp can now accept positional and keyword parameters similar todatetime.datetime() (GH10758,GH11630)

    In [69]:pd.Timestamp(2012,1,1)Out[69]:Timestamp('2012-01-01 00:00:00')In [70]:pd.Timestamp(year=2012,month=1,day=1,hour=8,minute=30)Out[70]:Timestamp('2012-01-01 08:30:00')
  • The.resample() function now accepts aon= orlevel= parameter for resampling on a datetimelike column orMultiIndex level (GH13500)

    In [71]:df=pd.DataFrame({'date':pd.date_range('2015-01-01',freq='W',periods=5),   ....:'a':np.arange(5)},   ....:index=pd.MultiIndex.from_arrays([   ....:[1,2,3,4,5],   ....:pd.date_range('2015-01-01',freq='W',periods=5)],   ....:names=['v','d']))   ....:In [72]:dfOut[72]:              a       datev d1 2015-01-04  0 2015-01-042 2015-01-11  1 2015-01-113 2015-01-18  2 2015-01-184 2015-01-25  3 2015-01-255 2015-02-01  4 2015-02-01In [73]:df.resample('M',on='date').sum()Out[73]:            adate2015-01-31  62015-02-28  4In [74]:df.resample('M',level='d').sum()Out[74]:            ad2015-01-31  62015-02-28  4
  • The.get_credentials() method ofGbqConnector can now first try to fetchthe application default credentials. See thedocs for more details (GH13577).

  • The.tz_localize() method ofDatetimeIndex andTimestamp has gained theerrors keyword, so you can potentially coerce nonexistent timestamps toNaT. The default behavior remains to raising aNonExistentTimeError (GH13057)

  • .to_hdf/read_hdf() now accept path objects (e.g.pathlib.Path,py.path.local) for the file path (GH11773)

  • Thepd.read_csv() withengine='python' has gained support for thedecimal (GH12933),na_filter (GH13321) and thememory_map option (GH13381).

  • Consistent with the Python API,pd.read_csv() will now interpret+inf as positive infinity (GH13274)

  • Thepd.read_html() has gained support for thena_values,converters,keep_default_na options (GH13461)

  • Categorical.astype() now accepts an optional boolean argumentcopy, effective when dtype is categorical (GH13209)

  • DataFrame has gained the.asof() method to return the last non-NaN values according to the selected subset (GH13358)

  • TheDataFrame constructor will now respect key ordering if a list ofOrderedDict objects are passed in (GH13304)

  • pd.read_html() has gained support for thedecimal option (GH12907)

  • Series has gained the properties.is_monotonic,.is_monotonic_increasing,.is_monotonic_decreasing, similar toIndex (GH13336)

  • DataFrame.to_sql() now allows a single value as the SQL type for all columns (GH11886).

  • Series.append now supports theignore_index option (GH13677)

  • .to_stata() andStataWriter can now write variable labels to Stata dta files using a dictionary to make column names to labels (GH13535,GH13536)

  • .to_stata() andStataWriter will automatically convertdatetime64[ns] columns to Stata format%tc, rather than raising aValueError (GH12259)

  • read_stata() andStataReader raise with a more explicit error message when reading Stata files with repeated value labels whenconvert_categoricals=True (GH13923)

  • DataFrame.style will now render sparsified MultiIndexes (GH11655)

  • DataFrame.style will now show column level names (e.g.DataFrame.columns.names) (GH13775)

  • DataFrame has gained support to re-order the columns based on the valuesin a row usingdf.sort_values(by='...',axis=1) (GH10806)

    In [75]:df=pd.DataFrame({'A':[2,7],'B':[3,5],'C':[4,8]},   ....:index=['row1','row2'])   ....:In [76]:dfOut[76]:      A  B  Crow1  2  3  4row2  7  5  8In [77]:df.sort_values(by='row2',axis=1)Out[77]:      B  A  Crow1  3  2  4row2  5  7  8
  • Added documentation toI/O regarding the perils of reading in columns with mixed dtypes and how to handle it (GH13746)

  • to_html() now has aborder argument to control the value in the opening<table> tag. The default is the value of thehtml.border option, which defaults to 1. This also affects the notebook HTML repr, but since Jupyter’s CSS includes a border-width attribute, the visual effect is the same. (GH11563).

  • RaiseImportError in the sql functions whensqlalchemy is not installed and a connection string is used (GH11920).

  • Compatibility with matplotlib 2.0. Older versions of pandas should also work with matplotlib 2.0 (GH13333)

  • Timestamp,Period,DatetimeIndex,PeriodIndex and.dt accessor have gained a.is_leap_year property to check whether the date belongs to a leap year. (GH13727)

  • astype() will now accept a dict of column name to data types mapping as thedtype argument. (GH12086)

  • Thepd.read_json andDataFrame.to_json has gained support for reading and writing json lines withlines option seeLine delimited json (GH9180)

  • read_excel() now supports the true_values and false_values keyword arguments (GH13347)

  • groupby() will now accept a scalar and a single-element list for specifyinglevel on a non-MultiIndex grouper. (GH13907)

  • Non-convertible dates in an excel date column will be returned without conversion and the column will beobject dtype, rather than raising an exception (GH10001).

  • pd.Timedelta(None) is now accepted and will returnNaT, mirroringpd.Timestamp (GH13687)

  • pd.read_stata() can now handle some format 111 files, which are produced by SAS when generating Stata dta files (GH11526)

  • Series andIndex now supportdivmod which will return a tuple ofseries or indices. This behaves like a standard binary operator with regardsto broadcasting rules (GH14208).

API changes

Series.tolist() will now return Python types

Series.tolist() will now return Python types in the output, mimicking NumPy.tolist() behavior (GH10904)

In [78]:s=pd.Series([1,2,3])

Previous behavior:

In [7]:type(s.tolist()[0])Out[7]: <class 'numpy.int64'>

New behavior:

In [79]:type(s.tolist()[0])Out[79]:int

Series operators for different indexes

FollowingSeries operators have been changed to make all operators consistent,includingDataFrame (GH1134,GH4581,GH13538)

  • Series comparison operators now raiseValueError whenindex are different.
  • Series logical operators align bothindex of left and right hand side.

Warning

Until 0.18.1, comparingSeries with the same length, would succeed even ifthe.index are different (the result ignores.index). As of 0.19.0, this will raisesValueError to be more strict. This section also describes how to keep previous behavior or align different indexes, using the flexible comparison methods like.eq.

As a result,Series andDataFrame operators behave as below:

Arithmetic operators

Arithmetic operators align bothindex (no changes).

In [80]:s1=pd.Series([1,2,3],index=list('ABC'))In [81]:s2=pd.Series([2,2,2],index=list('ABD'))In [82]:s1+s2Out[82]:A    3.0B    4.0C    NaND    NaNdtype: float64In [83]:df1=pd.DataFrame([1,2,3],index=list('ABC'))In [84]:df2=pd.DataFrame([2,2,2],index=list('ABD'))In [85]:df1+df2Out[85]:     0A  3.0B  4.0C  NaND  NaN
Comparison operators

Comparison operators raiseValueError when.index are different.

Previous Behavior (Series):

Series compared values ignoring the.index as long as both had the same length:

In [1]:s1==s2Out[1]:A    FalseB     TrueC    Falsedtype: bool

New behavior (Series):

In [2]:s1==s2Out[2]:ValueError: Can only compare identically-labeled Series objects

Note

To achieve the same result as previous versions (compare values based on locations ignoring.index), compare both.values.

In [86]:s1.values==s2.valuesOut[86]:array([False,True,False],dtype=bool)

If you want to compareSeries aligning its.index, see flexible comparison methods section below:

In [87]:s1.eq(s2)Out[87]:A    FalseB     TrueC    FalseD    Falsedtype: bool

Current Behavior (DataFrame, no change):

In [3]:df1==df2Out[3]:ValueError: Can only compare identically-labeled DataFrame objects
Logical operators

Logical operators align both.index of left and right hand side.

Previous behavior (Series), only left hand sideindex was kept:

In [4]:s1=pd.Series([True,False,True],index=list('ABC'))In [5]:s2=pd.Series([True,True,True],index=list('ABD'))In [6]:s1&s2Out[6]:A     TrueB    FalseC    Falsedtype: bool

New behavior (Series):

In [88]:s1=pd.Series([True,False,True],index=list('ABC'))In [89]:s2=pd.Series([True,True,True],index=list('ABD'))In [90]:s1&s2Out[90]:A     TrueB    FalseC    FalseD    Falsedtype: bool

Note

Series logical operators fill aNaN result withFalse.

Note

To achieve the same result as previous versions (compare values based on only left hand side index), you can usereindex_like:

In [91]:s1&s2.reindex_like(s1)Out[91]:A     TrueB    FalseC    Falsedtype: bool

Current Behavior (DataFrame, no change):

In [92]:df1=pd.DataFrame([True,False,True],index=list('ABC'))In [93]:df2=pd.DataFrame([True,True,True],index=list('ABD'))In [94]:df1&df2Out[94]:       0A   TrueB  FalseC    NaND    NaN
Flexible comparison methods

Series flexible comparison methods likeeq,ne,le,lt,ge andgt now align bothindex. Use these operators if you want to compare twoSerieswhich has the differentindex.

In [95]:s1=pd.Series([1,2,3],index=['a','b','c'])In [96]:s2=pd.Series([2,2,2],index=['b','c','d'])In [97]:s1.eq(s2)Out[97]:a    Falseb     Truec    Falsed    Falsedtype: boolIn [98]:s1.ge(s2)Out[98]:a    Falseb     Truec     Trued    Falsedtype: bool

Previously, this worked the same as comparison operators (see above).

Series type promotion on assignment

ASeries will now correctly promote its dtype for assignment with incompat values to the current dtype (GH13234)

In [99]:s=pd.Series()

Previous behavior:

In [2]:s["a"]=pd.Timestamp("2016-01-01")In [3]:s["b"]=3.0TypeError: invalid type promotion

New behavior:

In [100]:s["a"]=pd.Timestamp("2016-01-01")In [101]:s["b"]=3.0In [102]:sOut[102]:a    2016-01-01 00:00:00b                      3dtype: objectIn [103]:s.dtypeOut[103]:dtype('O')

.to_datetime() changes

Previously if.to_datetime() encountered mixed integers/floats and strings, but no datetimes witherrors='coerce' it would convert all toNaT.

Previous behavior:

In [2]:pd.to_datetime([1,'foo'],errors='coerce')Out[2]:DatetimeIndex(['NaT','NaT'],dtype='datetime64[ns]',freq=None)

Current behavior:

This will now convert integers/floats with the default unit ofns.

In [104]:pd.to_datetime([1,'foo'],errors='coerce')Out[104]:DatetimeIndex(['1970-01-01 00:00:00.000000001','NaT'],dtype='datetime64[ns]',freq=None)

Bug fixes related to.to_datetime():

  • Bug inpd.to_datetime() when passing integers or floats, and nounit anderrors='coerce' (GH13180).
  • Bug inpd.to_datetime() when passing invalid datatypes (e.g. bool); will now respect theerrors keyword (GH13176)
  • Bug inpd.to_datetime() which overflowed onint8, andint16 dtypes (GH13451)
  • Bug inpd.to_datetime() raiseAttributeError withNaN and the other string is not valid whenerrors='ignore' (GH12424)
  • Bug inpd.to_datetime() did not cast floats correctly whenunit was specified, resulting in truncated datetime (GH13834)

Merging changes

Merging will now preserve the dtype of the join keys (GH8596)

In [105]:df1=pd.DataFrame({'key':[1],'v1':[10]})In [106]:df1Out[106]:   key  v10    1  10In [107]:df2=pd.DataFrame({'key':[1,2],'v1':[20,30]})In [108]:df2Out[108]:   key  v10    1  201    2  30

Previous behavior:

In [5]:pd.merge(df1,df2,how='outer')Out[5]:   key    v10  1.0  10.01  1.0  20.02  2.0  30.0In [6]:pd.merge(df1,df2,how='outer').dtypesOut[6]:key    float64v1     float64dtype: object

New behavior:

We are able to preserve the join keys

In [109]:pd.merge(df1,df2,how='outer')Out[109]:   key  v10    1  101    1  202    2  30In [110]:pd.merge(df1,df2,how='outer').dtypesOut[110]:key    int64v1     int64dtype: object

Of course if you have missing values that are introduced, then theresulting dtype will be upcast, which is unchanged from previous.

In [111]:pd.merge(df1,df2,how='outer',on='key')Out[111]:   key  v1_x  v1_y0    1  10.0    201    2   NaN    30In [112]:pd.merge(df1,df2,how='outer',on='key').dtypesOut[112]:key       int64v1_x    float64v1_y      int64dtype: object

.describe() changes

Percentile identifiers in the index of a.describe() output will now be rounded to the least precision that keeps them distinct (GH13104)

In [113]:s=pd.Series([0,1,2,3,4])In [114]:df=pd.DataFrame([0,1,2,3,4])

Previous behavior:

The percentiles were rounded to at most one decimal place, which could raiseValueError for a data frame if the percentiles were duplicated.

In [3]:s.describe(percentiles=[0.0001,0.0005,0.001,0.999,0.9995,0.9999])Out[3]:count     5.000000mean      2.000000std       1.581139min       0.0000000.0%      0.0004000.1%      0.0020000.1%      0.00400050%       2.00000099.9%     3.996000100.0%    3.998000100.0%    3.999600max       4.000000dtype: float64In [4]:df.describe(percentiles=[0.0001,0.0005,0.001,0.999,0.9995,0.9999])Out[4]:...ValueError: cannot reindex from a duplicate axis

New behavior:

In [115]:s.describe(percentiles=[0.0001,0.0005,0.001,0.999,0.9995,0.9999])Out[115]:count     5.000000mean      2.000000std       1.581139min       0.0000000.01%     0.0004000.05%     0.0020000.1%      0.00400050%       2.00000099.9%     3.99600099.95%    3.99800099.99%    3.999600max       4.000000dtype: float64In [116]:df.describe(percentiles=[0.0001,0.0005,0.001,0.999,0.9995,0.9999])Out[116]:               0count   5.000000mean    2.000000std     1.581139min     0.0000000.01%   0.0004000.05%   0.0020000.1%    0.00400050%     2.00000099.9%   3.99600099.95%  3.99800099.99%  3.999600max     4.000000

Furthermore:

  • Passing duplicatedpercentiles will now raise aValueError.
  • Bug in.describe() on a DataFrame with a mixed-dtype column index, which would previously raise aTypeError (GH13288)

Period changes

PeriodIndex now hasperiod dtype

PeriodIndex now has its ownperiod dtype. Theperiod dtype is apandas extension dtype likecategory or thetimezone aware dtype (datetime64[ns,tz]) (GH13941).As a consequence of this change,PeriodIndex no longer has an integer dtype:

Previous behavior:

In [1]:pi=pd.PeriodIndex(['2016-08-01'],freq='D')In [2]:piOut[2]:PeriodIndex(['2016-08-01'],dtype='int64',freq='D')In [3]:pd.api.types.is_integer_dtype(pi)Out[3]:TrueIn [4]:pi.dtypeOut[4]:dtype('int64')

New behavior:

In [117]:pi=pd.PeriodIndex(['2016-08-01'],freq='D')In [118]:piOut[118]:PeriodIndex(['2016-08-01'],dtype='period[D]',freq='D')In [119]:pd.api.types.is_integer_dtype(pi)Out[119]:FalseIn [120]:pd.api.types.is_period_dtype(pi)Out[120]:TrueIn [121]:pi.dtypeOut[121]:period[D]In [122]:type(pi.dtype)Out[122]:pandas.types.dtypes.PeriodDtype
Period('NaT') now returnspd.NaT

Previously,Period has its ownPeriod('NaT') representation different frompd.NaT. NowPeriod('NaT') has been changed to returnpd.NaT. (GH12759,GH13582)

Previous behavior:

In [5]:pd.Period('NaT',freq='D')Out[5]:Period('NaT','D')

New behavior:

These result inpd.NaT without providingfreq option.

In [123]:pd.Period('NaT')Out[123]:NaTIn [124]:pd.Period(None)Out[124]:NaT

To be compatible withPeriod addition and subtraction,pd.NaT now supports addition and subtraction withint. Previously it raisedValueError.

Previous behavior:

In [5]:pd.NaT+1...ValueError: Cannot add integral value to Timestamp without freq.

New behavior:

In [125]:pd.NaT+1Out[125]:NaTIn [126]:pd.NaT-1Out[126]:NaT
PeriodIndex.values now returns array ofPeriod object

.values is changed to return an array ofPeriod objects, rather than an arrayof integers (GH13988).

Previous behavior:

In [6]:pi=pd.PeriodIndex(['2011-01','2011-02'],freq='M')In [7]:pi.valuesarray([492, 493])

New behavior:

In [127]:pi=pd.PeriodIndex(['2011-01','2011-02'],freq='M')In [128]:pi.valuesOut[128]:array([Period('2011-01','M'),Period('2011-02','M')],dtype=object)

Index+ /- no longer used for set operations

Addition and subtraction of the base Index type and of DatetimeIndex(not the numeric index types)previously performed set operations (set union and difference). Thisbehavior was already deprecated since 0.15.0 (in favor using the specific.union() and.difference() methods), and is now disabled. Whenpossible,+ and- are now used for element-wise operations, forexample for concatenating strings or subtracting datetimes(GH8227,GH14127).

Previous behavior:

In [1]:pd.Index(['a','b'])+pd.Index(['a','c'])FutureWarning: using '+' to provide set union with Indexes is deprecated, use '|' or .union()Out[1]:Index(['a','b','c'],dtype='object')

New behavior: the same operation will now perform element-wise addition:

In [129]:pd.Index(['a','b'])+pd.Index(['a','c'])Out[129]:Index([u'aa',u'bc'],dtype='object')

Note that numeric Index objects already performed element-wise operations.For example, the behavior of adding two integer Indexes is unchanged.The baseIndex is now made consistent with this behavior.

In [130]:pd.Index([1,2,3])+pd.Index([2,3,4])Out[130]:Int64Index([3,5,7],dtype='int64')

Further, because of this change, it is now possible to subtract twoDatetimeIndex objects resulting in a TimedeltaIndex:

Previous behavior:

In [1]:pd.DatetimeIndex(['2016-01-01','2016-01-02'])-pd.DatetimeIndex(['2016-01-02','2016-01-03'])FutureWarning: using '-' to provide set differences with datetimelike Indexes is deprecated, use .difference()Out[1]:DatetimeIndex(['2016-01-01'],dtype='datetime64[ns]',freq=None)

New behavior:

In [131]:pd.DatetimeIndex(['2016-01-01','2016-01-02'])-pd.DatetimeIndex(['2016-01-02','2016-01-03'])Out[131]:TimedeltaIndex(['-1 days','-1 days'],dtype='timedelta64[ns]',freq=None)

Index.difference and.symmetric_difference changes

Index.difference andIndex.symmetric_difference will now, more consistently, treatNaN values as any other values. (GH13514)

In [132]:idx1=pd.Index([1,2,3,np.nan])In [133]:idx2=pd.Index([0,1,np.nan])

Previous behavior:

In [3]:idx1.difference(idx2)Out[3]:Float64Index([nan,2.0,3.0],dtype='float64')In [4]:idx1.symmetric_difference(idx2)Out[4]:Float64Index([0.0,nan,2.0,3.0],dtype='float64')

New behavior:

In [134]:idx1.difference(idx2)Out[134]:Float64Index([2.0,3.0],dtype='float64')In [135]:idx1.symmetric_difference(idx2)Out[135]:Float64Index([0.0,2.0,3.0],dtype='float64')

Index.unique consistently returnsIndex

Index.unique() now returns unique values as anIndex of the appropriatedtype. (GH13395).Previously, mostIndex classes returnednp.ndarray, andDatetimeIndex,TimedeltaIndex andPeriodIndex returnedIndex to keep metadata like timezone.

Previous behavior:

In [1]:pd.Index([1,2,3]).unique()Out[1]:array([1,2,3])In [2]:pd.DatetimeIndex(['2011-01-01','2011-01-02','2011-01-03'],tz='Asia/Tokyo').unique()Out[2]:DatetimeIndex(['2011-01-01 00:00:00+09:00', '2011-01-02 00:00:00+09:00',               '2011-01-03 00:00:00+09:00'],              dtype='datetime64[ns, Asia/Tokyo]', freq=None)

New behavior:

In [136]:pd.Index([1,2,3]).unique()Out[136]:Int64Index([1,2,3],dtype='int64')In [137]:pd.DatetimeIndex(['2011-01-01','2011-01-02','2011-01-03'],tz='Asia/Tokyo').unique()Out[137]:DatetimeIndex(['2011-01-01 00:00:00+09:00', '2011-01-02 00:00:00+09:00',               '2011-01-03 00:00:00+09:00'],              dtype='datetime64[ns, Asia/Tokyo]', freq=None)

MultiIndex constructors,groupby andset_index preserve categorical dtypes

MultiIndex.from_arrays andMultiIndex.from_product will now preserve categorical dtypeinMultiIndex levels (GH13743,GH13854).

In [138]:cat=pd.Categorical(['a','b'],categories=list("bac"))In [139]:lvl1=['foo','bar']In [140]:midx=pd.MultiIndex.from_arrays([cat,lvl1])In [141]:midxOut[141]:MultiIndex(levels=[[u'b', u'a', u'c'], [u'bar', u'foo']],           labels=[[1, 0], [1, 0]])

Previous behavior:

In [4]:midx.levels[0]Out[4]:Index(['b','a','c'],dtype='object')In [5]:midx.get_level_values[0]Out[5]:Index(['a','b'],dtype='object')

New behavior: the single level is now aCategoricalIndex:

In [142]:midx.levels[0]Out[142]:CategoricalIndex([u'b',u'a',u'c'],categories=[u'b',u'a',u'c'],ordered=False,dtype='category')In [143]:midx.get_level_values(0)Out[143]:CategoricalIndex([u'a',u'b'],categories=[u'b',u'a',u'c'],ordered=False,dtype='category')

An analogous change has been made toMultiIndex.from_product.As a consequence,groupby andset_index also preserve categorical dtypes in indexes

In [144]:df=pd.DataFrame({'A':[0,1],'B':[10,11],'C':cat})In [145]:df_grouped=df.groupby(by=['A','C']).first()In [146]:df_set_idx=df.set_index(['A','C'])

Previous behavior:

In [11]:df_grouped.index.levels[1]Out[11]:Index(['b','a','c'],dtype='object',name='C')In [12]:df_grouped.reset_index().dtypesOut[12]:A      int64C     objectB    float64dtype: objectIn [13]:df_set_idx.index.levels[1]Out[13]:Index(['b','a','c'],dtype='object',name='C')In [14]:df_set_idx.reset_index().dtypesOut[14]:A      int64C     objectB      int64dtype: object

New behavior:

In [147]:df_grouped.index.levels[1]Out[147]:CategoricalIndex([u'b',u'a',u'c'],categories=[u'b',u'a',u'c'],ordered=False,name=u'C',dtype='category')In [148]:df_grouped.reset_index().dtypesOut[148]:A       int64C    categoryB     float64dtype: objectIn [149]:df_set_idx.index.levels[1]Out[149]:CategoricalIndex([u'b',u'a',u'c'],categories=[u'b',u'a',u'c'],ordered=False,name=u'C',dtype='category')In [150]:df_set_idx.reset_index().dtypesOut[150]:A       int64C    categoryB       int64dtype: object

read_csv will progressively enumerate chunks

Whenread_csv() is called withchunksize=n and without specifying an index,each chunk used to have an independently generated index from0 ton-1.They are now given instead a progressive index, starting from0 for the first chunk,fromn for the second, and so on, so that, when concatenated, they are identical tothe result of callingread_csv() without thechunksize= argument(GH12185).

In [151]:data='A,B\n0,1\n2,3\n4,5\n6,7'

Previous behavior:

In [2]:pd.concat(pd.read_csv(StringIO(data),chunksize=2))Out[2]:   A  B0  0  11  2  30  4  51  6  7

New behavior:

In [152]:pd.concat(pd.read_csv(StringIO(data),chunksize=2))Out[152]:   A  B0  0  11  2  32  4  53  6  7

Sparse Changes

These changes allow pandas to handle sparse data with more dtypes, and for work to make a smoother experience with data handling.

int64 andbool support enhancements

Sparse data structures now gained enhanced support ofint64 andbooldtype (GH667,GH13849).

Previously, sparse data werefloat64 dtype by default, even if all inputs were ofint orbool dtype. You had to specifydtype explicitly to create sparse data withint64 dtype. Also,fill_value had to be specified explicitly because the default wasnp.nan which doesn’t appear inint64 orbool data.

In [1]:pd.SparseArray([1,2,0,0])Out[1]:[1.0, 2.0, 0.0, 0.0]Fill: nanIntIndexIndices: array([0, 1, 2, 3], dtype=int32)# specifying int64 dtype, but all values are stored in sp_values because# fill_value default is np.nanIn [2]:pd.SparseArray([1,2,0,0],dtype=np.int64)Out[2]:[1, 2, 0, 0]Fill: nanIntIndexIndices: array([0, 1, 2, 3], dtype=int32)In [3]:pd.SparseArray([1,2,0,0],dtype=np.int64,fill_value=0)Out[3]:[1, 2, 0, 0]Fill: 0IntIndexIndices: array([0, 1], dtype=int32)

As of v0.19.0, sparse data keeps the input dtype, and uses more appropriatefill_value defaults (0 forint64 dtype,False forbool dtype).

In [153]:pd.SparseArray([1,2,0,0],dtype=np.int64)Out[153]:[1, 2, 0, 0]Fill: 0IntIndexIndices: array([0, 1], dtype=int32)In [154]:pd.SparseArray([True,False,False,False])Out[154]:[True, False, False, False]Fill: FalseIntIndexIndices: array([0], dtype=int32)

See thedocs for more details.

Operators now preserve dtypes
  • Sparse data structure now can preservedtype after arithmetic ops (GH13848)

    In [155]:s=pd.SparseSeries([0,2,0,1],fill_value=0,dtype=np.int64)In [156]:s.dtypeOut[156]:dtype('int64')In [157]:s+1Out[157]:0    11    32    13    2dtype: int64BlockIndexBlock locations: array([1, 3], dtype=int32)Block lengths: array([1, 1], dtype=int32)
  • Sparse data structure now supportastype to convert internaldtype (GH13900)

    In [158]:s=pd.SparseSeries([1.,0.,2.,0.],fill_value=0)In [159]:sOut[159]:0    1.01    0.02    2.03    0.0dtype: float64BlockIndexBlock locations: array([0, 2], dtype=int32)Block lengths: array([1, 1], dtype=int32)In [160]:s.astype(np.int64)Out[160]:0    11    02    23    0dtype: int64BlockIndexBlock locations: array([0, 2], dtype=int32)Block lengths: array([1, 1], dtype=int32)

    astype fails if data contains values which cannot be converted to specifieddtype.Note that the limitation is applied tofill_value which default isnp.nan.

    In [7]:pd.SparseSeries([1.,np.nan,2.,np.nan],fill_value=np.nan).astype(np.int64)Out[7]:ValueError: unable to coerce current fill_value nan to int64 dtype
Other sparse fixes
  • SubclassedSparseDataFrame andSparseSeries now preserve class types when slicing or transposing. (GH13787)
  • SparseArray withbool dtype now supports logical (bool) operators (GH14000)
  • Bug inSparseSeries withMultiIndex[] indexing may raiseIndexError (GH13144)
  • Bug inSparseSeries withMultiIndex[] indexing result may have normalIndex (GH13144)
  • Bug inSparseDataFrame in whichaxis=None did not default toaxis=0 (GH13048)
  • Bug inSparseSeries andSparseDataFrame creation withobject dtype may raiseTypeError (GH11633)
  • Bug inSparseDataFrame doesn’t respect passedSparseArray orSparseSeries ‘s dtype andfill_value (GH13866)
  • Bug inSparseArray andSparseSeries don’t apply ufunc tofill_value (GH13853)
  • Bug inSparseSeries.abs incorrectly keeps negativefill_value (GH13853)
  • Bug in single row slicing on multi-typeSparseDataFrame s, types were previously forced to float (GH13917)
  • Bug inSparseSeries slicing changes integer dtype to float (GH8292)
  • Bug inSparseDataFarme comparison ops may raiseTypeError (GH13001)
  • Bug inSparseDataFarme.isnull raisesValueError (GH8276)
  • Bug inSparseSeries representation withbool dtype may raiseIndexError (GH13110)
  • Bug inSparseSeries andSparseDataFrame ofbool orint64 dtype may display its values likefloat64 dtype (GH13110)
  • Bug in sparse indexing usingSparseArray withbool dtype may return incorrect result (GH13985)
  • Bug inSparseArray created fromSparseSeries may losedtype (GH13999)
  • Bug inSparseSeries comparison with dense returns normalSeries rather thanSparseSeries (GH13999)

Indexer dtype changes

Note

This change only affects 64 bit python running on Windows, and only affects relatively advancedindexing operations

Methods such asIndex.get_indexer that return an indexer array, coerce that array to a “platform int”, so that it can bedirectly used in 3rd party library operations likenumpy.take. Previously, a platform int was defined asnp.int_which corresponds to a C integer, but the correct type, and what is being used now, isnp.intp, which correspondsto the C integer size that can hold a pointer (GH3033,GH13972).

These types are the same on many platform, but for 64 bit python on Windows,np.int_ is 32 bits, andnp.intp is 64 bits. Changing this behavior improves performance for manyoperations on that platform.

Previous behavior:

In [1]:i=pd.Index(['a','b','c'])In [2]:i.get_indexer(['b','b','c']).dtypeOut[2]:dtype('int32')

New behavior:

In [1]:i=pd.Index(['a','b','c'])In [2]:i.get_indexer(['b','b','c']).dtypeOut[2]:dtype('int64')

Other API Changes

  • Timestamp.to_pydatetime will issue aUserWarning whenwarn=True, and the instance has a non-zero number of nanoseconds, previously this would print a message to stdout (GH14101).
  • Series.unique() with datetime and timezone now returns return array ofTimestamp with timezone (GH13565).
  • Panel.to_sparse() will raise aNotImplementedError exception when called (GH13778).
  • Index.reshape() will raise aNotImplementedError exception when called (GH12882).
  • .filter() enforces mutual exclusion of the keyword arguments (GH12399).
  • eval‘s upcasting rules forfloat32 types have been updated to be more consistent with NumPy’s rules. New behavior will not upcast tofloat64 if you multiply a pandasfloat32 object by a scalar float64 (GH12388).
  • AnUnsupportedFunctionCall error is now raised if NumPy ufuncs likenp.mean are called on groupby or resample objects (GH12811).
  • __setitem__ will no longer apply a callable rhs as a function instead of storing it. Callwhere directly to get the previous behavior (GH13299).
  • Calls to.sample() will respect the random seed set vianumpy.random.seed(n) (GH13161)
  • Styler.apply is now more strict about the outputs your function must return. Foraxis=0 oraxis=1, the output shape must be identical. Foraxis=None, the output must be a DataFrame with identical columns and index labels (GH13222).
  • Float64Index.astype(int) will now raiseValueError ifFloat64Index containsNaN values (GH13149)
  • TimedeltaIndex.astype(int) andDatetimeIndex.astype(int) will now returnInt64Index instead ofnp.array (GH13209)
  • PassingPeriod with multiple frequencies to normalIndex now returnsIndex withobject dtype (GH13664)
  • PeriodIndex.fillna withPeriod has different freq now coerces toobject dtype (GH13664)
  • Faceted boxplots fromDataFrame.boxplot(by=col) now return aSeries whenreturn_type is not None. Previously these returned anOrderedDict. Note that whenreturn_type=None, the default, these still return a 2-D NumPy array (GH12216,GH7096).
  • pd.read_hdf will now raise aValueError instead ofKeyError, if a mode other thanr,r+ anda is supplied. (GH13623)
  • pd.read_csv(),pd.read_table(), andpd.read_hdf() raise the builtinFileNotFoundError exception for Python 3.x when called on a nonexistent file; this is back-ported asIOError in Python 2.x (GH14086)
  • More informative exceptions are passed through the csv parser. The exception type would now be the original exception type instead ofCParserError (GH13652).
  • pd.read_csv() in the C engine will now issue aParserWarning or raise aValueError whensep encoded is more than one character long (GH14065)
  • DataFrame.values will now returnfloat64 with aDataFrame of mixedint64 anduint64 dtypes, conforming tonp.find_common_type (GH10364,GH13917)
  • .groupby.groups will now return a dictionary ofIndex objects, rather than a dictionary ofnp.ndarray orlists (GH14293)

Deprecations

  • Series.reshape andCategorical.reshape have been deprecated and will be removed in a subsequent release (GH12882,GH12882)
  • PeriodIndex.to_datetime has been deprecated in favor ofPeriodIndex.to_timestamp (GH8254)
  • Timestamp.to_datetime has been deprecated in favor ofTimestamp.to_pydatetime (GH8254)
  • Index.to_datetime andDatetimeIndex.to_datetime have been deprecated in favor ofpd.to_datetime (GH8254)
  • pandas.core.datetools module has been deprecated and will be removed in a subsequent release (GH14094)
  • SparseList has been deprecated and will be removed in a future version (GH13784)
  • DataFrame.to_html() andDataFrame.to_latex() have dropped thecolSpace parameter in favor ofcol_space (GH13857)
  • DataFrame.to_sql() has deprecated theflavor parameter, as it is superfluous when SQLAlchemy is not installed (GH13611)
  • Deprecatedread_csv keywords:
    • compact_ints anduse_unsigned have been deprecated and will be removed in a future version (GH13320)
    • buffer_lines has been deprecated and will be removed in a future version (GH13360)
    • as_recarray has been deprecated and will be removed in a future version (GH13373)
    • skip_footer has been deprecated in favor ofskipfooter and will be removed in a future version (GH13349)
  • top-levelpd.ordered_merge() has been renamed topd.merge_ordered() and the original name will be removed in a future version (GH13358)
  • Timestamp.offset property (and named arg in the constructor), has been deprecated in favor offreq (GH12160)
  • pd.tseries.util.pivot_annual is deprecated. Usepivot_table as alternative, an example ishere (GH736)
  • pd.tseries.util.isleapyear has been deprecated and will be removed in a subsequent release. Datetime-likes now have a.is_leap_year property (GH13727)
  • Panel4D andPanelND constructors are deprecated and will be removed in a future version. The recommended way to represent these types of n-dimensional data are with thexarray package. Pandas provides ato_xarray() method to automate this conversion (GH13564).
  • pandas.tseries.frequencies.get_standard_freq is deprecated. Usepandas.tseries.frequencies.to_offset(freq).rule_code instead (GH13874)
  • pandas.tseries.frequencies.to_offset‘sfreqstr keyword is deprecated in favor offreq (GH13874)
  • Categorical.from_array has been deprecated and will be removed in a future version (GH13854)

Removal of prior version deprecations/changes

  • TheSparsePanel class has been removed (GH13778)
  • Thepd.sandbox module has been removed in favor of the external librarypandas-qt (GH13670)
  • Thepandas.io.data andpandas.io.wb modules are removed in favor ofthepandas-datareader package (GH13724).
  • Thepandas.tools.rplot module has been removed in favor oftheseaborn package (GH13855)
  • DataFrame.to_csv() has dropped theengine parameter, as was deprecated in 0.17.1 (GH11274,GH13419)
  • DataFrame.to_dict() has dropped theouttype parameter in favor oforient (GH13627,GH8486)
  • pd.Categorical has dropped setting of theordered attribute directly in favor of theset_ordered method (GH13671)
  • pd.Categorical has dropped thelevels attribute in favor ofcategories (GH8376)
  • DataFrame.to_sql() has dropped themysql option for theflavor parameter (GH13611)
  • Panel.shift() has dropped thelags parameter in favor ofperiods (GH14041)
  • pd.Index has dropped thediff method in favor ofdifference (GH13669)
  • pd.DataFrame has dropped theto_wide method in favor ofto_panel (GH14039)
  • Series.to_csv has dropped thenanRep parameter in favor ofna_rep (GH13804)
  • Series.xs,DataFrame.xs,Panel.xs,Panel.major_xs, andPanel.minor_xs have dropped thecopy parameter (GH13781)
  • str.split has dropped thereturn_type parameter in favor ofexpand (GH13701)
  • Removal of the legacy time rules (offset aliases), deprecated since 0.17.0 (this has been alias since 0.8.0) (GH13590,GH13868). Now legacy time rules raisesValueError. For the list of currently supported offsets, seehere.
  • The default value for thereturn_type parameter forDataFrame.plot.box andDataFrame.boxplot changed fromNone to"axes". These methods will now return a matplotlib axes by default instead of a dictionary of artists. Seehere (GH6581).
  • Thetquery anduquery functions in thepandas.io.sql module are removed (GH5950).

Performance Improvements

  • Improved performance of sparseIntIndex.intersect (GH13082)
  • Improved performance of sparse arithmetic withBlockIndex when the number of blocks are large, though recommended to useIntIndex in such cases (GH13082)
  • Improved performance ofDataFrame.quantile() as it now operates per-block (GH11623)
  • Improved performance of float64 hash table operations, fixing some very slow indexing and groupby operations in python 3 (GH13166,GH13334)
  • Improved performance ofDataFrameGroupBy.transform (GH12737)
  • Improved performance ofIndex andSeries.duplicated (GH10235)
  • Improved performance ofIndex.difference (GH12044)
  • Improved performance ofRangeIndex.is_monotonic_increasing andis_monotonic_decreasing (GH13749)
  • Improved performance of datetime string parsing inDatetimeIndex (GH13692)
  • Improved performance of hashingPeriod (GH12817)
  • Improved performance offactorize of datetime with timezone (GH13750)
  • Improved performance of by lazily creating indexing hashtables on larger Indexes (GH14266)
  • Improved performance ofgroupby.groups (GH14293)
  • Unecessary materializing of a MultiIndex when introspecting for memory usage (GH14308)

Bug Fixes

  • Bug ingroupby().shift(), which could cause a segfault or corruption in rare circumstances when grouping by columns with missing values (GH13813)
  • Bug ingroupby().cumsum() calculatingcumprod whenaxis=1. (GH13994)
  • Bug inpd.to_timedelta() in which theerrors parameter was not being respected (GH13613)
  • Bug inio.json.json_normalize(), where non-ascii keys raised an exception (GH13213)
  • Bug when passing a not-default-indexedSeries asxerr oryerr in.plot() (GH11858)
  • Bug in area plot draws legend incorrectly if subplot is enabled or legend is moved after plot (matplotlib 1.5.0 is required to draw area plot legend properly) (GH9161,GH13544)
  • Bug inDataFrame assignment with an object-dtypedIndex where the resultant column is mutable to the original object. (GH13522)
  • Bug in matplotlibAutoDataFormatter; this restores the second scaled formatting and re-adds micro-second scaled formatting (GH13131)
  • Bug in selection from aHDFStore with a fixed format andstart and/orstop specified will now return the selected range (GH8287)
  • Bug inCategorical.from_codes() where an unhelpful error was raised when an invalidordered parameter was passed in (GH14058)
  • Bug inSeries construction from a tuple of integers on windows not returning default dtype (int64) (GH13646)
  • Bug inTimedeltaIndex addition with a Datetime-like object where addition overflow was not being caught (GH14068)
  • Bug in.groupby(..).resample(..) when the same object is called multiple times (GH13174)
  • Bug in.to_records() when index name is a unicode string (GH13172)
  • Bug in calling.memory_usage() on object which doesn’t implement (GH12924)
  • Regression inSeries.quantile with nans (also shows up in.median() and.describe() ); furthermore now names theSeries with the quantile (GH13098,GH13146)
  • Bug inSeriesGroupBy.transform with datetime values and missing groups (GH13191)
  • Bug where emptySeries were incorrectly coerced in datetime-like numeric operations (GH13844)
  • Bug inCategorical constructor when passed aCategorical containing datetimes with timezones (GH14190)
  • Bug inSeries.str.extractall() withstr index raisesValueError (GH13156)
  • Bug inSeries.str.extractall() with single group and quantifier (GH13382)
  • Bug inDatetimeIndex andPeriod subtraction raisesValueError orAttributeError rather thanTypeError (GH13078)
  • Bug inIndex andSeries created withNaN andNaT mixed data may not havedatetime64 dtype (GH13324)
  • Bug inIndex andSeries may ignorenp.datetime64('nat') andnp.timdelta64('nat') to infer dtype (GH13324)
  • Bug inPeriodIndex andPeriod subtraction raisesAttributeError (GH13071)
  • Bug inPeriodIndex construction returning afloat64 index in some circumstances (GH13067)
  • Bug in.resample(..) with aPeriodIndex not changing itsfreq appropriately when empty (GH13067)
  • Bug in.resample(..) with aPeriodIndex not retaining its type or name with an emptyDataFrame appropriately when empty (GH13212)
  • Bug ingroupby(..).apply(..) when the passed function returns scalar values per group (GH13468).
  • Bug ingroupby(..).resample(..) where passing some keywords would raise an exception (GH13235)
  • Bug in.tz_convert on a tz-awareDateTimeIndex that relied on index being sorted for correct results (GH13306)
  • Bug in.tz_localize withdateutil.tz.tzlocal may return incorrect result (GH13583)
  • Bug inDatetimeTZDtype dtype withdateutil.tz.tzlocal cannot be regarded as valid dtype (GH13583)
  • Bug inpd.read_hdf() where attempting to load an HDF file with a single dataset, that had one or more categorical columns, failed unless the key argument was set to the name of the dataset. (GH13231)
  • Bug in.rolling() that allowed a negative integer window in contruction of theRolling() object, but would later fail on aggregation (GH13383)
  • Bug inSeries indexing with tuple-valued data and a numeric index (GH13509)
  • Bug in printingpd.DataFrame where unusual elements with theobject dtype were causing segfaults (GH13717)
  • Bug in rankingSeries which could result in segfaults (GH13445)
  • Bug in various index types, which did not propagate the name of passed index (GH12309)
  • Bug inDatetimeIndex, which did not honour thecopy=True (GH13205)
  • Bug inDatetimeIndex.is_normalized returns incorrectly for normalized date_range in case of local timezones (GH13459)
  • Bug inpd.concat and.append may coercesdatetime64 andtimedelta toobject dtype containing python built-indatetime ortimedelta rather thanTimestamp orTimedelta (GH13626)
  • Bug inPeriodIndex.append may raisesAttributeError when the result isobject dtype (GH13221)
  • Bug inCategoricalIndex.append may accept normallist (GH13626)
  • Bug inpd.concat and.append with the same timezone get reset to UTC (GH7795)
  • Bug inSeries andDataFrame.append raisesAmbiguousTimeError if data contains datetime near DST boundary (GH13626)
  • Bug inDataFrame.to_csv() in which float values were being quoted even though quotations were specified for non-numeric values only (GH12922,GH13259)
  • Bug inDataFrame.describe() raisingValueError with only boolean columns (GH13898)
  • Bug inMultiIndex slicing where extra elements were returned when level is non-unique (GH12896)
  • Bug in.str.replace does not raiseTypeError for invalid replacement (GH13438)
  • Bug inMultiIndex.from_arrays which didn’t check for input array lengths matching (GH13599)
  • Bug incartesian_product andMultiIndex.from_product which may raise with empty input arrays (GH12258)
  • Bug inpd.read_csv() which may cause a segfault or corruption when iterating in large chunks over a stream/file under rare circumstances (GH13703)
  • Bug inpd.read_csv() which caused errors to be raised when a dictionary containing scalars is passed in forna_values (GH12224)
  • Bug inpd.read_csv() which caused BOM files to be incorrectly parsed by not ignoring the BOM (GH4793)
  • Bug inpd.read_csv() withengine='python' which raised errors when a numpy array was passed in forusecols (GH12546)
  • Bug inpd.read_csv() where the index columns were being incorrectly parsed when parsed as dates with athousands parameter (GH14066)
  • Bug inpd.read_csv() withengine='python' in whichNaN values weren’t being detected after data was converted to numeric values (GH13314)
  • Bug inpd.read_csv() in which thenrows argument was not properly validated for both engines (GH10476)
  • Bug inpd.read_csv() withengine='python' in which infinities of mixed-case forms were not being interpreted properly (GH13274)
  • Bug inpd.read_csv() withengine='python' in which trailingNaN values were not being parsed (GH13320)
  • Bug inpd.read_csv() withengine='python' when reading from atempfile.TemporaryFile on Windows with Python 3 (GH13398)
  • Bug inpd.read_csv() that preventsusecols kwarg from accepting single-byte unicode strings (GH13219)
  • Bug inpd.read_csv() that preventsusecols from being an empty set (GH13402)
  • Bug inpd.read_csv() in the C engine where the NULL character was not being parsed as NULL (GH14012)
  • Bug inpd.read_csv() withengine='c' in which NULLquotechar was not accepted even thoughquoting was specified asNone (GH13411)
  • Bug inpd.read_csv() withengine='c' in which fields were not properly cast to float when quoting was specified as non-numeric (GH13411)
  • Bug inpd.read_csv() in Python 2.x with non-UTF8 encoded, multi-character separated data (GH3404)
  • Bug inpd.read_csv(), where aliases for utf-xx (e.g. UTF-xx, UTF_xx, utf_xx) raised UnicodeDecodeError (GH13549)
  • Bug inpd.read_csv,pd.read_table,pd.read_fwf,pd.read_stata andpd.read_sas where files were opened by parsers but not closed if bothchunksize anditerator wereNone. (GH13940)
  • Bug inStataReader,StataWriter,XportReader andSAS7BDATReader where a file was not properly closed when an error was raised. (GH13940)
  • Bug inpd.pivot_table() wheremargins_name is ignored whenaggfunc is a list (GH13354)
  • Bug inpd.Series.str.zfill,center,ljust,rjust, andpad when passing non-integers, did not raiseTypeError (GH13598)
  • Bug in checking for any null objects in aTimedeltaIndex, which always returnedTrue (GH13603)
  • Bug inSeries arithmetic raisesTypeError if it contains datetime-like asobject dtype (GH13043)
  • BugSeries.isnull() andSeries.notnull() ignorePeriod('NaT') (GH13737)
  • BugSeries.fillna() andSeries.dropna() don’t affect toPeriod('NaT') (GH13737
  • Bug in.fillna(value=np.nan) incorrectly raisesKeyError on acategory dtypedSeries (GH14021)
  • Bug in extension dtype creation where the created types were not is/identical (GH13285)
  • Bug in.resample(..) where incorrect warnings were triggered by IPython introspection (GH13618)
  • Bug inNaT -Period raisesAttributeError (GH13071)
  • Bug inSeries comparison may output incorrect result if rhs containsNaT (GH9005)
  • Bug inSeries andIndex comparison may output incorrect result if it containsNaT withobject dtype (GH13592)
  • Bug inPeriod addition raisesTypeError ifPeriod is on right hand side (GH13069)
  • Bug inPeirod andSeries orIndex comparison raisesTypeError (GH13200)
  • Bug inpd.set_eng_float_format() that would prevent NaN and Inf from formatting (GH11981)
  • Bug in.unstack withCategorical dtype resets.ordered toTrue (GH13249)
  • Clean some compile time warnings in datetime parsing (GH13607)
  • Bug infactorize raisesAmbiguousTimeError if data contains datetime near DST boundary (GH13750)
  • Bug in.set_index raisesAmbiguousTimeError if new index contains DST boundary and multi levels (GH12920)
  • Bug in.shift raisesAmbiguousTimeError if data contains datetime near DST boundary (GH13926)
  • Bug inpd.read_hdf() returns incorrect result when aDataFrame with acategorical column and a query which doesn’t match any values (GH13792)
  • Bug in.iloc when indexing with a non lex-sorted MultiIndex (GH13797)
  • Bug in.loc when indexing with date strings in a reverse sortedDatetimeIndex (GH14316)
  • Bug inSeries comparison operators when dealing with zero dim NumPy arrays (GH13006)
  • Bug in.combine_first may return incorrectdtype (GH7630,GH10567)
  • Bug ingroupby whereapply returns different result depending on whether first result isNone or not (GH12824)
  • Bug ingroupby(..).nth() where the group key is included inconsistently if called after.head()/.tail() (GH12839)
  • Bug in.to_html,.to_latex and.to_string silently ignore custom datetime formatter passed through theformatters key word (GH10690)
  • Bug inDataFrame.iterrows(), not yielding aSeries subclasse if defined (GH13977)
  • Bug inpd.to_numeric whenerrors='coerce' and input contains non-hashable objects (GH13324)
  • Bug in invalidTimedelta arithmetic and comparison may raiseValueError rather thanTypeError (GH13624)
  • Bug in invalid datetime parsing into_datetime andDatetimeIndex may raiseTypeError rather thanValueError (GH11169,GH11287)
  • Bug inIndex created with tz-awareTimestamp and mismatchedtz option incorrectly coerces timezone (GH13692)
  • Bug inDatetimeIndex with nanosecond frequency does not include timestamp specified withend (GH13672)
  • Bug in`Series` when setting a slice with a`np.timedelta64` (GH14155)
  • Bug inIndex raisesOutOfBoundsDatetime ifdatetime exceedsdatetime64[ns] bounds, rather than coercing toobject dtype (GH13663)
  • Bug inIndex may ignore specifieddatetime64 ortimedelta64 passed asdtype (GH13981)
  • Bug inRangeIndex can be created without no arguments rather than raisesTypeError (GH13793)
  • Bug in.value_counts() raisesOutOfBoundsDatetime if data exceedsdatetime64[ns] bounds (GH13663)
  • Bug inDatetimeIndex may raiseOutOfBoundsDatetime if inputnp.datetime64 has other unit thanns (GH9114)
  • Bug inSeries creation withnp.datetime64 which has other unit thanns asobject dtype results in incorrect values (GH13876)
  • Bug inresample with timedelta data where data was casted to float (GH13119).
  • Bug inpd.isnull()pd.notnull() raiseTypeError if input datetime-like has other unit thanns (GH13389)
  • Bug inpd.merge() may raiseTypeError if input datetime-like has other unit thanns (GH13389)
  • Bug inHDFStore/read_hdf() discardedDatetimeIndex.name iftz was set (GH13884)
  • Bug inCategorical.remove_unused_categories() changes.codes dtype to platform int (GH13261)
  • Bug ingroupby withas_index=False returns all NaN’s when grouping on multiple columns including a categorical one (GH13204)
  • Bug indf.groupby(...)[...] where getitem withInt64Index raised an error (GH13731)
  • Bug in the CSS classes assigned toDataFrame.style for index names. Previously they were assigned"col_headinglevel<n>col<c>" wheren was the number of levels + 1. Now they are assigned"index_namelevel<n>", wheren is the correct level for that MultiIndex.
  • Bug wherepd.read_gbq() could throwImportError:Nomodulenameddiscovery as a result of a naming conflict with another python package called apiclient (GH13454)
  • Bug inIndex.union returns an incorrect result with a named empty index (GH13432)
  • Bugs inIndex.difference andDataFrame.join raise in Python3 when using mixed-integer indexes (GH13432,GH12814)
  • Bug in subtract tz-awaredatetime.datetime from tz-awaredatetime64 series (GH14088)
  • Bug in.to_excel() when DataFrame contains a MultiIndex which contains a label with a NaN value (GH13511)
  • Bug in invalid frequency offset string like “D1”, “-2-3H” may not raiseValueError (GH13930)
  • Bug inconcat andgroupby for hierarchical frames withRangeIndex levels (GH13542).
  • Bug inSeries.str.contains() for Series containing onlyNaN values ofobject dtype (GH14171)
  • Bug inagg() function on groupby dataframe changes dtype ofdatetime64[ns] column tofloat64 (GH12821)
  • Bug in using NumPy ufunc withPeriodIndex to add or subtract integer raiseIncompatibleFrequency. Note that using standard operator like+ or- is recommended, because standard operators use more efficient path (GH13980)
  • Bug in operations onNaT returningfloat instead ofdatetime64[ns] (GH12941)
  • Bug inSeries flexible arithmetic methods (like.add()) raisesValueError whenaxis=None (GH13894)
  • Bug inDataFrame.to_csv() withMultiIndex columns in which a stray empty line was added (GH6618)
  • Bug inDatetimeIndex,TimedeltaIndex andPeriodIndex.equals() may returnTrue when input isn’tIndex but contains the same values (GH13107)
  • Bug in assignment against datetime with timezone may not work if it contains datetime near DST boundary (GH14146)
  • Bug inpd.eval() andHDFStore query truncating long float literals with python 2 (GH14241)
  • Bug inIndex raisesKeyError displaying incorrect column when column is not in the df and columns contains duplicate values (GH13822)
  • Bug inPeriod andPeriodIndex creating wrong dates when frequency has combined offset aliases (GH13874)
  • Bug in.to_string() when called with an integerline_width andindex=False raises an UnboundLocalError exception becauseidx referenced before assignment.
  • Bug ineval() where theresolvers argument would not accept a list (GH14095)
  • Bugs instack,get_dummies,make_axis_dummies which don’t preserve categorical dtypes in (multi)indexes (GH13854)
  • PeriodIndex can now acceptlist andarray which containspd.NaT (GH13430)
  • Bug indf.groupby where.median() returns arbitrary values if grouped dataframe contains empty bins (GH13629)
  • Bug inIndex.copy() wherename parameter was ignored (GH14302)

v0.18.1 (May 3, 2016)

This is a minor bug-fix release from 0.18.0 and includes a large number ofbug fixes along with several new features, enhancements, and performance improvements.We recommend that all users upgrade to this version.

Highlights include:

  • .groupby(...) has been enhanced to provide convenient syntax when working with.rolling(..),.expanding(..) and.resample(..) per group, seehere
  • pd.to_datetime() has gained the ability to assemble dates from aDataFrame, seehere
  • Method chaining improvements, seehere.
  • Custom business hour offset, seehere.
  • Many bug fixes in the handling ofsparse, seehere
  • Expanded theTutorials section with a feature on modern pandas, courtesy of@TomAugsburger. (GH13045).

New features

Custom Business Hour

TheCustomBusinessHour is a mixture ofBusinessHour andCustomBusinessDay whichallows you to specify arbitrary holidays. For details,seeCustom Business Hour (GH11514)

In [1]:frompandas.tseries.offsetsimportCustomBusinessHourIn [2]:frompandas.tseries.holidayimportUSFederalHolidayCalendarIn [3]:bhour_us=CustomBusinessHour(calendar=USFederalHolidayCalendar())

Friday before MLK Day

In [4]:dt=datetime(2014,1,17,15)In [5]:dt+bhour_usOut[5]:Timestamp('2014-01-17 16:00:00')

Tuesday after MLK Day (Monday is skipped because it’s a holiday)

In [6]:dt+bhour_us*2Out[6]:Timestamp('2014-01-20 09:00:00')

.groupby(..) syntax with window and resample operations

.groupby(...) has been enhanced to provide convenient syntax when working with.rolling(..),.expanding(..) and.resample(..) per group, see (GH12486,GH12738).

You can now use.rolling(..) and.expanding(..) as methods on groupbys. These return another deferred object (similar to what.rolling() and.expanding() do on ungrouped pandas objects). You can then operate on theseRollingGroupby objects in a similar manner.

Previously you would have to do this to get a rolling window mean per-group:

In [7]:df=pd.DataFrame({'A':[1]*20+[2]*12+[3]*8,   ...:'B':np.arange(40)})   ...:In [8]:dfOut[8]:    A   B0   1   01   1   12   1   23   1   34   1   45   1   56   1   6.. ..  ..33  3  3334  3  3435  3  3536  3  3637  3  3738  3  3839  3  39[40 rows x 2 columns]
In [9]:df.groupby('A').apply(lambdax:x.rolling(4).B.mean())Out[9]:A1  0      NaN   1      NaN   2      NaN   3      1.5   4      2.5   5      3.5   6      4.5         ...3  33     NaN   34     NaN   35    33.5   36    34.5   37    35.5   38    36.5   39    37.5Name: B, dtype: float64

Now you can do:

In [10]:df.groupby('A').rolling(4).B.mean()Out[10]:A1  0      NaN   1      NaN   2      NaN   3      1.5   4      2.5   5      3.5   6      4.5         ...3  33     NaN   34     NaN   35    33.5   36    34.5   37    35.5   38    36.5   39    37.5Name: B, dtype: float64

For.resample(..) type of operations, previously you would have to:

In [11]:df=pd.DataFrame({'date':pd.date_range(start='2016-01-01',   ....:periods=4,   ....:freq='W'),   ....:'group':[1,1,2,2],   ....:'val':[5,6,7,8]}).set_index('date')   ....:In [12]:dfOut[12]:            group  valdate2016-01-03      1    52016-01-10      1    62016-01-17      2    72016-01-24      2    8
In [13]:df.groupby('group').apply(lambdax:x.resample('1D').ffill())Out[13]:                  group  valgroup date1     2016-01-03      1    5      2016-01-04      1    5      2016-01-05      1    5      2016-01-06      1    5      2016-01-07      1    5      2016-01-08      1    5      2016-01-09      1    5...                 ...  ...2     2016-01-18      2    7      2016-01-19      2    7      2016-01-20      2    7      2016-01-21      2    7      2016-01-22      2    7      2016-01-23      2    7      2016-01-24      2    8[16 rows x 2 columns]

Now you can do:

In [14]:df.groupby('group').resample('1D').ffill()Out[14]:                  group  valgroup date1     2016-01-03      1    5      2016-01-04      1    5      2016-01-05      1    5      2016-01-06      1    5      2016-01-07      1    5      2016-01-08      1    5      2016-01-09      1    5...                 ...  ...2     2016-01-18      2    7      2016-01-19      2    7      2016-01-20      2    7      2016-01-21      2    7      2016-01-22      2    7      2016-01-23      2    7      2016-01-24      2    8[16 rows x 2 columns]

Method chaininng improvements

The following methods / indexers now accept acallable. It is intended to makethese more useful in method chains, see thedocumentation.(GH11485,GH12533)

  • .where() and.mask()
  • .loc[],iloc[] and.ix[]
  • [] indexing
.where() and.mask()

These can accept a callable for the condition andotherarguments.

In [15]:df=pd.DataFrame({'A':[1,2,3],   ....:'B':[4,5,6],   ....:'C':[7,8,9]})   ....:In [16]:df.where(lambdax:x>4,lambdax:x+10)Out[16]:    A   B  C0  11  14  71  12   5  82  13   6  9
.loc[],.iloc[],.ix[]

These can accept a callable, and a tuple of callable as a slicer. The callablecan return a valid boolean indexer or anything which is valid for these indexer’s input.

# callable returns bool indexerIn [17]:df.loc[lambdax:x.A>=2,lambdax:x.sum()>10]Out[17]:   B  C1  5  82  6  9# callable returns list of labelsIn [18]:df.loc[lambdax:[1,2],lambdax:['A','B']]Out[18]:   A  B1  2  52  3  6
[] indexing

Finally, you can use a callable in[] indexing of Series, DataFrame and Panel.The callable must return a valid input for[] indexing depending on itsclass and index type.

In [19]:df[lambdax:'A']Out[19]:0    11    22    3Name: A, dtype: int64

Using these methods / indexers, you can chain data selection operationswithout using temporary variable.

In [20]:bb=pd.read_csv('data/baseball.csv',index_col='id')In [21]:(bb.groupby(['year','team'])   ....:.sum()   ....:.loc[lambdadf:df.r>100]   ....:)   ....:Out[21]:           stint    g    ab    r    h  X2b  X3b  hr    rbi    sb   cs   bb  \year team2007 CIN       6  379   745  101  203   35    2  36  125.0  10.0  1.0  105     DET       5  301  1062  162  283   54    4  37  144.0  24.0  7.0   97     HOU       4  311   926  109  218   47    6  14   77.0  10.0  4.0   60     LAN      11  413  1021  153  293   61    3  36  154.0   7.0  5.0  114     NYN      13  622  1854  240  509  101    3  61  243.0  22.0  4.0  174     SFN       5  482  1305  198  337   67    6  40  171.0  26.0  7.0  235     TEX       2  198   729  115  200   40    4  28  115.0  21.0  4.0   73     TOR       4  459  1408  187  378   96    2  58  223.0   4.0  2.0  190              so   ibb   hbp    sh    sf  gidpyear team2007 CIN   127.0  14.0   1.0   1.0  15.0  18.0     DET   176.0   3.0  10.0   4.0   8.0  28.0     HOU   212.0   3.0   9.0  16.0   6.0  17.0     LAN   141.0   8.0   9.0   3.0   8.0  29.0     NYN   310.0  24.0  23.0  18.0  15.0  48.0     SFN   188.0  51.0   8.0  16.0   6.0  41.0     TEX   140.0   4.0   5.0   2.0   8.0  16.0     TOR   265.0  16.0  12.0   4.0  16.0  38.0

Partial string indexing onDateTimeIndex when part of aMultiIndex

Partial string indexing now matches onDateTimeIndex when part of aMultiIndex (GH10331)

In [22]:dft2=pd.DataFrame(np.random.randn(20,1),   ....:columns=['A'],   ....:index=pd.MultiIndex.from_product([pd.date_range('20130101',   ....:periods=10,   ....:freq='12H'),   ....:['a','b']]))   ....:In [23]:dft2Out[23]:                              A2013-01-01 00:00:00 a  1.474071                    b -0.0640342013-01-01 12:00:00 a -1.282782                    b  0.7818362013-01-02 00:00:00 a -1.071357                    b  0.4411532013-01-02 12:00:00 a  2.353925...                         ...2013-01-04 00:00:00 b -0.8456962013-01-04 12:00:00 a -1.340896                    b  1.8468832013-01-05 00:00:00 a -1.328865                    b  1.6827062013-01-05 12:00:00 a -1.717693                    b  0.888782[20 rows x 1 columns]In [24]:dft2.loc['2013-01-05']Out[24]:                              A2013-01-05 00:00:00 a -1.328865                    b  1.6827062013-01-05 12:00:00 a -1.717693                    b  0.888782

On other levels

In [25]:idx=pd.IndexSliceIn [26]:dft2=dft2.swaplevel(0,1).sort_index()In [27]:dft2Out[27]:                              Aa 2013-01-01 00:00:00  1.474071  2013-01-01 12:00:00 -1.282782  2013-01-02 00:00:00 -1.071357  2013-01-02 12:00:00  2.353925  2013-01-03 00:00:00  0.221471  2013-01-03 12:00:00  0.758527  2013-01-04 00:00:00 -0.964980...                         ...b 2013-01-02 12:00:00  0.583787  2013-01-03 00:00:00 -0.744471  2013-01-03 12:00:00  1.729689  2013-01-04 00:00:00 -0.845696  2013-01-04 12:00:00  1.846883  2013-01-05 00:00:00  1.682706  2013-01-05 12:00:00  0.888782[20 rows x 1 columns]In [28]:dft2.loc[idx[:,'2013-01-05'],:]Out[28]:                              Aa 2013-01-05 00:00:00 -1.328865  2013-01-05 12:00:00 -1.717693b 2013-01-05 00:00:00  1.682706  2013-01-05 12:00:00  0.888782

Assembling Datetimes

pd.to_datetime() has gained the ability to assemble datetimes from a passed inDataFrame or a dict. (GH8158).

In [29]:df=pd.DataFrame({'year':[2015,2016],   ....:'month':[2,3],   ....:'day':[4,5],   ....:'hour':[2,3]})   ....:In [30]:dfOut[30]:   day  hour  month  year0    4     2      2  20151    5     3      3  2016

Assembling using the passed frame.

In [31]:pd.to_datetime(df)Out[31]:0   2015-02-04 02:00:001   2016-03-05 03:00:00dtype: datetime64[ns]

You can pass only the columns that you need to assemble.

In [32]:pd.to_datetime(df[['year','month','day']])Out[32]:0   2015-02-041   2016-03-05dtype: datetime64[ns]

Other Enhancements

  • pd.read_csv() now supportsdelim_whitespace=True for the Python engine (GH12958)

  • pd.read_csv() now supports opening ZIP files that contains a single CSV, via extension inference or explictcompression='zip' (GH12175)

  • pd.read_csv() now supports opening files using xz compression, via extension inference or explicitcompression='xz' is specified;xz compressions is also supported byDataFrame.to_csv in the same way (GH11852)

  • pd.read_msgpack() now always gives writeable ndarrays even when compression is used (GH12359).

  • pd.read_msgpack() now supports serializing and de-serializing categoricals with msgpack (GH12573)

  • .to_json() now supportsNDFrames that contain categorical and sparse data (GH10778)

  • interpolate() now supportsmethod='akima' (GH7588).

  • pd.read_excel() now accepts path objects (e.g.pathlib.Path,py.path.local) for the file path, in line with otherread_* functions (GH12655)

  • Added.weekday_name property as a component toDatetimeIndex and the.dt accessor. (GH11128)

  • Index.take now handlesallow_fill andfill_value consistently (GH12631)

    In [33]:idx=pd.Index([1.,2.,3.,4.],dtype='float')# default, allow_fill=True, fill_value=NoneIn [34]:idx.take([2,-1])Out[34]:Float64Index([3.0,4.0],dtype='float64')In [35]:idx.take([2,-1],fill_value=True)Out[35]:Float64Index([3.0,nan],dtype='float64')
  • Index now supports.str.get_dummies() which returnsMultiIndex, seeCreating Indicator Variables (GH10008,GH10103)

    In [36]:idx=pd.Index(['a|b','a|c','b|c'])In [37]:idx.str.get_dummies('|')Out[37]:MultiIndex(levels=[[0, 1], [0, 1], [0, 1]],           labels=[[1, 1, 0], [1, 0, 1], [0, 1, 1]],           names=[u'a', u'b', u'c'])
  • pd.crosstab() has gained anormalize argument for normalizing frequency tables (GH12569). Examples in the updated docshere.

  • .resample(..).interpolate() is now supported (GH12925)

  • .isin() now accepts passedsets (GH12988)

Sparse changes

These changes conform sparse handling to return the correct types and work to make a smoother experience with indexing.

SparseArray.take now returns a scalar for scalar input,SparseArray for others. Furthermore, it handles a negative indexer with the same rule asIndex (GH10560,GH12796)

In [38]:s=pd.SparseArray([np.nan,np.nan,1,2,3,np.nan,4,5,np.nan,6])In [39]:s.take(0)Out[39]:nanIn [40]:s.take([1,2,3])Out[40]:[nan, 1.0, 2.0]Fill: nanIntIndexIndices: array([1, 2], dtype=int32)
  • Bug inSparseSeries[] indexing withEllipsis raisesKeyError (GH9467)
  • Bug inSparseArray[] indexing with tuples are not handled properly (GH12966)
  • Bug inSparseSeries.loc[] with list-like input raisesTypeError (GH10560)
  • Bug inSparseSeries.iloc[] with scalar input may raiseIndexError (GH10560)
  • Bug inSparseSeries.loc[],.iloc[] withslice returnsSparseArray, rather thanSparseSeries (GH10560)
  • Bug inSparseDataFrame.loc[],.iloc[] may results in denseSeries, rather thanSparseSeries (GH12787)
  • Bug inSparseArray addition ignoresfill_value of right hand side (GH12910)
  • Bug inSparseArray mod raisesAttributeError (GH12910)
  • Bug inSparseArray pow calculates1**np.nan asnp.nan which must be 1 (GH12910)
  • Bug inSparseArray comparison output may incorrect result or raiseValueError (GH12971)
  • Bug inSparseSeries.__repr__ raisesTypeError when it is longer thanmax_rows (GH10560)
  • Bug inSparseSeries.shape ignoresfill_value (GH10452)
  • Bug inSparseSeries andSparseArray may have differentdtype from its dense values (GH12908)
  • Bug inSparseSeries.reindex incorrectly handlefill_value (GH12797)
  • Bug inSparseArray.to_frame() results inDataFrame, rather thanSparseDataFrame (GH9850)
  • Bug inSparseSeries.value_counts() does not countfill_value (GH6749)
  • Bug inSparseArray.to_dense() does not preservedtype (GH10648)
  • Bug inSparseArray.to_dense() incorrectly handlefill_value (GH12797)
  • Bug inpd.concat() ofSparseSeries results in dense (GH10536)
  • Bug inpd.concat() ofSparseDataFrame incorrectly handlefill_value (GH9765)
  • Bug inpd.concat() ofSparseDataFrame may raiseAttributeError (GH12174)
  • Bug inSparseArray.shift() may raiseNameError orTypeError (GH12908)

API changes

.groupby(..).nth() changes

The index in.groupby(..).nth() output is now more consistent when theas_index argument is passed (GH11039):

In [41]:df=DataFrame({'A':['a','b','a'],   ....:'B':[1,2,3]})   ....:In [42]:dfOut[42]:   A  B0  a  11  b  22  a  3

Previous Behavior:

In [3]:df.groupby('A',as_index=True)['B'].nth(0)Out[3]:0    11    2Name: B, dtype: int64In [4]:df.groupby('A',as_index=False)['B'].nth(0)Out[4]:0    11    2Name: B, dtype: int64

New Behavior:

In [43]:df.groupby('A',as_index=True)['B'].nth(0)Out[43]:Aa    1b    2Name: B, dtype: int64In [44]:df.groupby('A',as_index=False)['B'].nth(0)Out[44]:0    11    2Name: B, dtype: int64

Furthermore, previously, a.groupby would always sort, regardless ifsort=False was passed with.nth().

In [45]:np.random.seed(1234)In [46]:df=pd.DataFrame(np.random.randn(100,2),columns=['a','b'])In [47]:df['c']=np.random.randint(0,4,100)

Previous Behavior:

In [4]:df.groupby('c',sort=True).nth(1)Out[4]:          a         bc0 -0.334077  0.0021181  0.036142 -2.0749782 -0.720589  0.8871633  0.859588 -0.636524In [5]:df.groupby('c',sort=False).nth(1)Out[5]:          a         bc0 -0.334077  0.0021181  0.036142 -2.0749782 -0.720589  0.8871633  0.859588 -0.636524

New Behavior:

In [48]:df.groupby('c',sort=True).nth(1)Out[48]:          a         bc0 -0.334077  0.0021181  0.036142 -2.0749782 -0.720589  0.8871633  0.859588 -0.636524In [49]:df.groupby('c',sort=False).nth(1)Out[49]:          a         bc2 -0.720589  0.8871633  0.859588 -0.6365240 -0.334077  0.0021181  0.036142 -2.074978

numpy function compatibility

Compatibility between pandas array-like methods (e.g.sum andtake) and theirnumpycounterparts has been greatly increased by augmenting the signatures of thepandas methods soas to accept arguments that can be passed in fromnumpy, even if they are not necessarilyused in thepandas implementation (GH12644,GH12638,GH12687)

  • .searchsorted() forIndex andTimedeltaIndex now accept asorter argument to maintain compatibility with numpy’ssearchsorted function (GH12238)
  • Bug in numpy compatibility ofnp.round() on aSeries (GH12600)

An example of this signature augmentation is illustrated below:

In [50]:sp=pd.SparseDataFrame([1,2,3])In [51]:spOut[51]:   00  11  22  3

Previous behaviour:

In [2]:np.cumsum(sp,axis=0)...TypeError: cumsum() takes at most 2 arguments (4 given)

New behaviour:

In [52]:np.cumsum(sp,axis=0)Out[52]:   00  11  32  6

Using.apply on groupby resampling

Usingapply on resampling groupby operations (using apd.TimeGrouper) now has the same output types as similarapply calls on other groupby operations. (GH11742).

In [53]:df=pd.DataFrame({'date':pd.to_datetime(['10/10/2000','11/10/2000']),   ....:'value':[10,13]})   ....:In [54]:dfOut[54]:        date  value0 2000-10-10     101 2000-11-10     13

Previous behavior:

In [1]:df.groupby(pd.TimeGrouper(key='date',freq='M')).apply(lambdax:x.value.sum())Out[1]:...TypeError: cannot concatenate a non-NDFrame object# Output is a SeriesIn [2]:df.groupby(pd.TimeGrouper(key='date',freq='M')).apply(lambdax:x[['value']].sum())Out[2]:date2000-10-31  value    102000-11-30  value    13dtype: int64

New Behavior:

# Output is a SeriesIn [55]:df.groupby(pd.TimeGrouper(key='date',freq='M')).apply(lambdax:x.value.sum())Out[55]:date2000-10-31    102000-11-30    13Freq: M, dtype: int64# Output is a DataFrameIn [56]:df.groupby(pd.TimeGrouper(key='date',freq='M')).apply(lambdax:x[['value']].sum())Out[56]:            valuedate2000-10-31     102000-11-30     13

Changes inread_csv exceptions

In order to standardize theread_csv API for both thec andpython engines, both will now raise anEmptyDataError, a subclass ofValueError, in response to empty columns or header (GH12493,GH12506)

Previous behaviour:

In [1]:df=pd.read_csv(StringIO(''),engine='c')...ValueError: No columns to parse from fileIn [2]:df=pd.read_csv(StringIO(''),engine='python')...StopIteration

New behaviour:

In [1]:df=pd.read_csv(StringIO(''),engine='c')...pandas.io.common.EmptyDataError: No columns to parse from fileIn [2]:df=pd.read_csv(StringIO(''),engine='python')...pandas.io.common.EmptyDataError: No columns to parse from file

In addition to this error change, several others have been made as well:

  • CParserError now sub-classesValueError instead of just aException (GH12551)
  • ACParserError is now raised instead of a genericException inread_csv when thec engine cannot parse a column (GH12506)
  • AValueError is now raised instead of a genericException inread_csv when thec engine encounters aNaN value in an integer column (GH12506)
  • AValueError is now raised instead of a genericException inread_csv whentrue_values is specified, and thec engine encounters an element in a column containing unencodable bytes (GH12506)
  • pandas.parser.OverflowError exception has been removed and has been replaced with Python’s built-inOverflowError exception (GH12506)
  • pd.read_csv() no longer allows a combination of strings and integers for theusecols parameter (GH12678)

to_datetime error changes

Bugs inpd.to_datetime() when passing aunit with convertible entries anderrors='coerce' or non-convertible witherrors='ignore'. Furthermore, anOutOfBoundsDateime exception will be raised when an out-of-range value is encountered for that unit whenerrors='raise'. (GH11758,GH13052,GH13059)

Previous behaviour:

In [27]:pd.to_datetime(1420043460,unit='s',errors='coerce')Out[27]:NaTIn [28]:pd.to_datetime(11111111,unit='D',errors='ignore')OverflowError: Python int too large to convert to C longIn [29]:pd.to_datetime(11111111,unit='D',errors='raise')OverflowError: Python int too large to convert to C long

New behaviour:

In [2]:pd.to_datetime(1420043460,unit='s',errors='coerce')Out[2]:Timestamp('2014-12-31 16:31:00')In [3]:pd.to_datetime(11111111,unit='D',errors='ignore')Out[3]:11111111In [4]:pd.to_datetime(11111111,unit='D',errors='raise')OutOfBoundsDatetime: cannot convert input with unit 'D'

Other API changes

  • .swaplevel() forSeries,DataFrame,Panel, andMultiIndex now features defaults for its first two parametersi andj that swap the two innermost levels of the index. (GH12934)
  • .searchsorted() forIndex andTimedeltaIndex now accept asorter argument to maintain compatibility with numpy’ssearchsorted function (GH12238)
  • Period andPeriodIndex now raisesIncompatibleFrequency error which inheritsValueError rather than rawValueError (GH12615)
  • Series.apply for category dtype now applies the passed function to each of the.categories (and not the.codes), and returns acategory dtype if possible (GH12473)
  • read_csv will now raise aTypeError ifparse_dates is neither a boolean, list, or dictionary (matches the doc-string) (GH5636)
  • The default for.query()/.eval() is nowengine=None, which will usenumexpr if it’s installed; otherwise it will fallback to thepython engine. This mimics the pre-0.18.1 behavior ifnumexpr is installed (and which, previously, if numexpr was not installed,.query()/.eval() would raise). (GH12749)
  • pd.show_versions() now includespandas_datareader version (GH12740)
  • Provide a proper__name__ and__qualname__ attributes for generic functions (GH12021)
  • pd.concat(ignore_index=True) now usesRangeIndex as default (GH12695)
  • pd.merge() andDataFrame.join() will show aUserWarning when merging/joining a single- with a multi-leveled dataframe (GH9455,GH12219)
  • Compat withscipy > 0.17 for deprecatedpiecewise_polynomial interpolation method; support for the replacementfrom_derivatives method (GH12887)

Deprecations

  • The method nameIndex.sym_diff() is deprecated and can be replaced byIndex.symmetric_difference() (GH12591)
  • The method nameCategorical.sort() is deprecated in favor ofCategorical.sort_values() (GH12882)

Performance Improvements

  • Improved speed of SAS reader (GH12656,GH12961)
  • Performance improvements in.groupby(..).cumcount() (GH11039)
  • Improved memory usage inpd.read_csv() when usingskiprows=an_integer (GH13005)
  • Improved performance ofDataFrame.to_sql when checking case sensitivity for tables. Now only checks if table has been created correctly when table name is not lower case. (GH12876)
  • Improved performance ofPeriod construction and time series plotting (GH12903,GH11831).
  • Improved performance of.str.encode() and.str.decode() methods (GH13008)
  • Improved performance ofto_numeric if input is numeric dtype (GH12777)
  • Improved performance of sparse arithmetic withIntIndex (GH13036)

Bug Fixes

  • usecols parameter inpd.read_csv is now respected even when the lines of a CSV file are not even (GH12203)
  • Bug ingroupby.transform(..) whenaxis=1 is specified with a non-monotonic ordered index (GH12713)
  • Bug inPeriod andPeriodIndex creation raisesKeyError iffreq="Minute" is specified. Note that “Minute” freq is deprecated in v0.17.0, and recommended to usefreq="T" instead (GH11854)
  • Bug in.resample(...).count() with aPeriodIndex always raising aTypeError (GH12774)
  • Bug in.resample(...) with aPeriodIndex casting to aDatetimeIndex when empty (GH12868)
  • Bug in.resample(...) with aPeriodIndex when resampling to an existing frequency (GH12770)
  • Bug in printing data which containsPeriod with differentfreq raisesValueError (GH12615)
  • Bug inSeries construction withCategorical anddtype='category' is specified (GH12574)
  • Bugs in concatenation with a coercable dtype was too aggressive, resulting in different dtypes in outputformatting when an object was longer thandisplay.max_rows (GH12411,GH12045,GH11594,GH10571,GH12211)
  • Bug infloat_format option with option not being validated as a callable. (GH12706)
  • Bug inGroupBy.filter whendropna=False and no groups fulfilled the criteria (GH12768)
  • Bug in__name__ of.cum* functions (GH12021)
  • Bug in.astype() of aFloat64Inde/Int64Index to anInt64Index (GH12881)
  • Bug in roundtripping an integer based index in.to_json()/.read_json() whenorient='index' (the default) (GH12866)
  • Bug in plottingCategorical dtypes cause error when attempting stacked bar plot (GH13019)
  • Compat with >=numpy 1.11 forNaT comparions (GH12969)
  • Bug in.drop() with a non-uniqueMultiIndex. (GH12701)
  • Bug in.concat of datetime tz-aware and naive DataFrames (GH12467)
  • Bug in correctly raising aValueError in.resample(..).fillna(..) when passing a non-string (GH12952)
  • Bug fixes in various encoding and header processing issues inpd.read_sas() (GH12659,GH12654,GH12647,GH12809)
  • Bug inpd.crosstab() where would silently ignoreaggfunc ifvalues=None (GH12569).
  • Potential segfault inDataFrame.to_json when serialisingdatetime.time (GH11473).
  • Potential segfault inDataFrame.to_json when attempting to serialise 0d array (GH11299).
  • Segfault into_json when attempting to serialise aDataFrame orSeries with non-ndarray values; now supports serialization ofcategory,sparse, anddatetime64[ns,tz] dtypes (GH10778).
  • Bug inDataFrame.to_json with unsupported dtype not passed to default handler (GH12554).
  • Bug in.align not returning the sub-class (GH12983)
  • Bug in aligning aSeries with aDataFrame (GH13037)
  • Bug inABCPanel in whichPanel4D was not being considered as a valid instance of this generic type (GH12810)
  • Bug in consistency of.name on.groupby(..).apply(..) cases (GH12363)
  • Bug inTimestamp.__repr__ that causedpprint to fail in nested structures (GH12622)
  • Bug inTimedelta.min andTimedelta.max, the properties now report the true minimum/maximumtimedeltas as recognized by pandas. See thedocumentation. (GH12727)
  • Bug in.quantile() with interpolation may coerce tofloat unexpectedly (GH12772)
  • Bug in.quantile() with emptySeries may return scalar rather than emptySeries (GH12772)
  • Bug in.loc with out-of-bounds in a large indexer would raiseIndexError rather thanKeyError (GH12527)
  • Bug in resampling when using aTimedeltaIndex and.asfreq(), would previously not include the final fencepost (GH12926)
  • Bug in equality testing with aCategorical in aDataFrame (GH12564)
  • Bug inGroupBy.first(),.last() returns incorrect row whenTimeGrouper is used (GH7453)
  • Bug inpd.read_csv() with thec engine when specifyingskiprows with newlines in quoted items (GH10911,GH12775)
  • Bug inDataFrame timezone lost when assigning tz-aware datetimeSeries with alignment (GH12981)
  • Bug in.value_counts() whennormalize=True anddropna=True where nulls still contributed to the normalized count (GH12558)
  • Bug inSeries.value_counts() loses name if its dtype iscategory (GH12835)
  • Bug inSeries.value_counts() loses timezone info (GH12835)
  • Bug inSeries.value_counts(normalize=True) withCategorical raisesUnboundLocalError (GH12835)
  • Bug inPanel.fillna() ignoringinplace=True (GH12633)
  • Bug inpd.read_csv() when specifyingnames,usecols, andparse_dates simultaneously with thec engine (GH9755)
  • Bug inpd.read_csv() when specifyingdelim_whitespace=True andlineterminator simultaneously with thec engine (GH12912)
  • Bug inSeries.rename,DataFrame.rename andDataFrame.rename_axis not treatingSeries as mappings to relabel (GH12623).
  • Clean in.rolling.min and.rolling.max to enhance dtype handling (GH12373)
  • Bug ingroupby where complex types are coerced to float (GH12902)
  • Bug inSeries.map raisesTypeError if its dtype iscategory or tz-awaredatetime (GH12473)
  • Bugs on 32bit platforms for some test comparisons (GH12972)
  • Bug in index coercion when falling back fromRangeIndex construction (GH12893)
  • Better error message in window functions when invalid argument (e.g. a float window) is passed (GH12669)
  • Bug in slicing subclassedDataFrame defined to return subclassedSeries may return normalSeries (GH11559)
  • Bug in.str accessor methods may raiseValueError if input hasname and the result isDataFrame orMultiIndex (GH12617)
  • Bug inDataFrame.last_valid_index() andDataFrame.first_valid_index() on empty frames (GH12800)
  • Bug inCategoricalIndex.get_loc returns different result from regularIndex (GH12531)
  • Bug inPeriodIndex.resample where name not propagated (GH12769)
  • Bug indate_rangeclosed keyword and timezones (GH12684).
  • Bug inpd.concat raisesAttributeError when input data contains tz-aware datetime and timedelta (GH12620)
  • Bug inpd.concat did not handle emptySeries properly (GH11082)
  • Bug in.plot.bar alginment whenwidth is specified withint (GH12979)
  • Bug infill_value is ignored if the argument to a binary operator is a constant (GH12723)
  • Bug inpd.read_html() when using bs4 flavor and parsing table with a header and only one column (GH9178)
  • Bug in.pivot_table whenmargins=True anddropna=True where nulls still contributed to margin count (GH12577)
  • Bug in.pivot_table whendropna=False where table index/column names disappear (GH12133)
  • Bug inpd.crosstab() whenmargins=True anddropna=False which raised (GH12642)
  • Bug inSeries.name whenname attribute can be a hashable type (GH12610)
  • Bug in.describe() resets categorical columns information (GH11558)
  • Bug whereloffset argument was not applied when callingresample().count() on a timeseries (GH12725)
  • pd.read_excel() now accepts column names associated with keyword argumentnames (GH12870)
  • Bug inpd.to_numeric() withIndex returnsnp.ndarray, rather thanIndex (GH12777)
  • Bug inpd.to_numeric() with datetime-like may raiseTypeError (GH12777)
  • Bug inpd.to_numeric() with scalar raisesValueError (GH12777)

v0.18.0 (March 13, 2016)

This is a major release from 0.17.1 and includes a small number of API changes, several new features,enhancements, and performance improvements along with a large number of bug fixes. We recommend that allusers upgrade to this version.

Warning

pandas >= 0.18.0 no longer supports compatibility with Python version 2.6and 3.3 (GH7718,GH11273)

Warning

numexpr version 2.4.4 will now show a warning and not be used as a computation back-end for pandas because of some buggy behavior. This does not affect other versions (>= 2.1 and >= 2.4.6). (GH12489)

Highlights include:

  • Moving and expanding window functions are now methods on Series and DataFrame,similar to.groupby, seehere.
  • Adding support for aRangeIndex as a specialized form of theInt64Indexfor memory savings, seehere.
  • API breaking change to the.resample method to make it more.groupbylike, seehere.
  • Removal of support for positional indexing with floats, which was deprecatedsince 0.14.0. This will now raise aTypeError, seehere.
  • The.to_xarray() function has been added for compatibility with thexarray package, seehere.
  • Theread_sas function has been enhanced to readsas7bdat files, seehere.
  • Addition of the.str.extractall() method,and API changes to the.str.extract() methodand.str.cat() method.
  • pd.test() top-level nose test runner is available (GH4327).

Check theAPI Changes anddeprecations before updating.

New features

Window functions are now methods

Window functions have been refactored to be methods onSeries/DataFrame objects, rather than top-level functions, which are now deprecated. This allows these window-type functions, to have a similar API to that of.groupby. See the full documentationhere (GH11603,GH12373)

In [1]:np.random.seed(1234)In [2]:df=pd.DataFrame({'A':range(10),'B':np.random.randn(10)})In [3]:dfOut[3]:   A         B0  0  0.4714351  1 -1.1909762  2  1.4327073  3 -0.3126524  4 -0.7205895  5  0.8871636  6  0.8595887  7 -0.6365248  8  0.0156969  9 -2.242685

Previous Behavior:

In [8]:pd.rolling_mean(df,window=3)        FutureWarning: pd.rolling_mean is deprecated for DataFrame and will be removed in a future version, replace with                       DataFrame.rolling(window=3,center=False).mean()Out[8]:    A         B0 NaN       NaN1 NaN       NaN2   1  0.2377223   2 -0.0236404   3  0.1331555   4 -0.0486936   5  0.3420547   6  0.3700768   7  0.0795879   8 -0.954504

New Behavior:

In [4]:r=df.rolling(window=3)

These show a descriptive repr

In [5]:rOut[5]:Rolling[window=3,center=False,axis=0]

with tab-completion of available methods and properties.

In [9]:r.r.A           r.agg         r.apply       r.count       r.exclusions  r.max         r.median      r.name        r.skew        r.sumr.B           r.aggregate   r.corr        r.cov         r.kurt        r.mean        r.min         r.quantile    r.std         r.var

The methods operate on theRolling object itself

In [6]:r.mean()Out[6]:     A         B0  NaN       NaN1  NaN       NaN2  1.0  0.2377223  2.0 -0.0236404  3.0  0.1331555  4.0 -0.0486936  5.0  0.3420547  6.0  0.3700768  7.0  0.0795879  8.0 -0.954504

They provide getitem accessors

In [7]:r['A'].mean()Out[7]:0    NaN1    NaN2    1.03    2.04    3.05    4.06    5.07    6.08    7.09    8.0Name: A, dtype: float64

And multiple aggregations

In [8]:r.agg({'A':['mean','std'],   ...:'B':['mean','std']})   ...:Out[8]:     A              B  mean  std      mean       std0  NaN  NaN       NaN       NaN1  NaN  NaN       NaN       NaN2  1.0  1.0  0.237722  1.3273643  2.0  1.0 -0.023640  1.3355054  3.0  1.0  0.133155  1.1437785  4.0  1.0 -0.048693  0.8357476  5.0  1.0  0.342054  0.9203797  6.0  1.0  0.370076  0.8718508  7.0  1.0  0.079587  0.7500999  8.0  1.0 -0.954504  1.162285

Changes to rename

Series.rename andNDFrame.rename_axis can now take a scalar or list-likeargument for altering the Series or axisname, in addition to their old behaviors of altering labels. (GH9494,GH11965)

In [9]:s=pd.Series(np.random.randn(5))In [10]:s.rename('newname')Out[10]:0    1.1500361    0.9919462    0.9533243   -2.0212554   -0.334077Name: newname, dtype: float64
In [11]:df=pd.DataFrame(np.random.randn(5,2))In [12]:(df.rename_axis("indexname")   ....:.rename_axis("columns_name",axis="columns"))   ....:Out[12]:columns_name         0         1indexname0             0.002118  0.4054531             0.289092  1.3211582            -1.546906 -0.2026463            -0.655969  0.1934214             0.553439  1.318152

The new functionality works well in method chains. Previously these methods only accepted functions or dicts mapping alabel to a new label.This continues to work as before for function or dict-like values.

Range Index

ARangeIndex has been added to theInt64Index sub-classes to support a memory saving alternative for common use cases. This has a similar implementation to the pythonrange object (xrange in python 2), in that it only stores the start, stop, and step values for the index. It will transparently interact with the user API, converting toInt64Index if needed.

This will now be the default constructed index forNDFrame objects, rather than previous anInt64Index. (GH939,GH12070,GH12071,GH12109,GH12888)

Previous Behavior:

In [3]:s=pd.Series(range(1000))In [4]:s.indexOut[4]:Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,            ...            990, 991, 992, 993, 994, 995, 996, 997, 998, 999], dtype='int64', length=1000)In [6]:s.index.nbytesOut[6]:8000

New Behavior:

In [13]:s=pd.Series(range(1000))In [14]:s.indexOut[14]:RangeIndex(start=0,stop=1000,step=1)In [15]:s.index.nbytesOut[15]:72

Changes to str.extract

The.str.extract method takes a regularexpression with capture groups, finds the first match in each subjectstring, and returns the contents of the capture groups(GH11386).

In v0.18.0, theexpand argument was added toextract.

  • expand=False: it returns aSeries,Index, orDataFrame, depending on the subject and regular expression pattern (same behavior as pre-0.18.0).
  • expand=True: it always returns aDataFrame, which is more consistent and less confusing from the perspective of a user.

Currently the default isexpand=None which gives aFutureWarning and usesexpand=False. To avoid this warning, please explicitly specifyexpand.

In [1]:pd.Series(['a1','b2','c3']).str.extract('[ab](\d)',expand=None)FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame)but in a future version of pandas this will be changed to expand=True (return DataFrame)Out[1]:0      11      22    NaNdtype: object

Extracting a regular expression with one group returns a Series ifexpand=False.

In [16]:pd.Series(['a1','b2','c3']).str.extract('[ab](\d)',expand=False)Out[16]:0      11      22    NaNdtype: object

It returns aDataFrame with one column ifexpand=True.

In [17]:pd.Series(['a1','b2','c3']).str.extract('[ab](\d)',expand=True)Out[17]:     00    11    22  NaN

Calling on anIndex with a regex with exactly one capture groupreturns anIndex ifexpand=False.

In [18]:s=pd.Series(["a1","b2","c3"],["A11","B22","C33"])In [19]:s.indexOut[19]:Index([u'A11',u'B22',u'C33'],dtype='object')In [20]:s.index.str.extract("(?P<letter>[a-zA-Z])",expand=False)Out[20]:Index([u'A',u'B',u'C'],dtype='object',name=u'letter')

It returns aDataFrame with one column ifexpand=True.

In [21]:s.index.str.extract("(?P<letter>[a-zA-Z])",expand=True)Out[21]:  letter0      A1      B2      C

Calling on anIndex with a regex with more than one capture groupraisesValueError ifexpand=False.

>>>s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)",expand=False)ValueError: only one regex group is supported with Index

It returns aDataFrame ifexpand=True.

In [22]:s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)",expand=True)Out[22]:  letter   10      A  111      B  222      C  33

In summary,extract(expand=True) always returns aDataFramewith a row for every subject string, and a column for every capturegroup.

Addition of str.extractall

The.str.extractall method was added(GH11386). Unlikeextract, which returns only the firstmatch.

In [23]:s=pd.Series(["a1a2","b1","c1"],["A","B","C"])In [24]:sOut[24]:A    a1a2B      b1C      c1dtype: objectIn [25]:s.str.extract("(?P<letter>[ab])(?P<digit>\d)",expand=False)Out[25]:  letter digitA      a     1B      b     1C    NaN   NaN

Theextractall method returns all matches.

In [26]:s.str.extractall("(?P<letter>[ab])(?P<digit>\d)")Out[26]:        letter digit  matchA 0          a     1  1          a     2B 0          b     1

Changes to str.cat

The method.str.cat() concatenates the members of aSeries. Before, ifNaN values were present in the Series, calling.str.cat() on it would returnNaN, unlike the rest of theSeries.str.* API. This behavior has been amended to ignoreNaN values by default. (GH11435).

A new, friendlierValueError is added to protect against the mistake of supplying thesep as an arg, rather than as a kwarg. (GH11334).

In [27]:pd.Series(['a','b',np.nan,'c']).str.cat(sep=' ')Out[27]:'a b c'In [28]:pd.Series(['a','b',np.nan,'c']).str.cat(sep=' ',na_rep='?')Out[28]:'a b ? c'
In [2]:pd.Series(['a','b',np.nan,'c']).str.cat(' ')ValueError: Did you mean to supply a `sep` keyword?

Datetimelike rounding

DatetimeIndex,Timestamp,TimedeltaIndex,Timedelta have gained the.round(),.floor() and.ceil() method for datetimelike rounding, flooring and ceiling. (GH4314,GH11963)

Naive datetimes

In [29]:dr=pd.date_range('20130101 09:12:56.1234',periods=3)In [30]:drOut[30]:DatetimeIndex(['2013-01-01 09:12:56.123400', '2013-01-02 09:12:56.123400',               '2013-01-03 09:12:56.123400'],              dtype='datetime64[ns]', freq='D')In [31]:dr.round('s')Out[31]:DatetimeIndex(['2013-01-01 09:12:56', '2013-01-02 09:12:56',               '2013-01-03 09:12:56'],              dtype='datetime64[ns]', freq=None)# Timestamp scalarIn [32]:dr[0]Out[32]:Timestamp('2013-01-01 09:12:56.123400',freq='D')In [33]:dr[0].round('10s')Out[33]:Timestamp('2013-01-01 09:13:00')

Tz-aware are rounded, floored and ceiled in local times

In [34]:dr=dr.tz_localize('US/Eastern')In [35]:drOut[35]:DatetimeIndex(['2013-01-01 09:12:56.123400-05:00',               '2013-01-02 09:12:56.123400-05:00',               '2013-01-03 09:12:56.123400-05:00'],              dtype='datetime64[ns, US/Eastern]', freq='D')In [36]:dr.round('s')Out[36]:DatetimeIndex(['2013-01-01 09:12:56-05:00', '2013-01-02 09:12:56-05:00',               '2013-01-03 09:12:56-05:00'],              dtype='datetime64[ns, US/Eastern]', freq=None)

Timedeltas

In [37]:t=timedelta_range('1 days 2 hr 13 min 45 us',periods=3,freq='d')In [38]:tOut[38]:TimedeltaIndex(['1 days 02:13:00.000045', '2 days 02:13:00.000045',                '3 days 02:13:00.000045'],               dtype='timedelta64[ns]', freq='D')In [39]:t.round('10min')Out[39]:TimedeltaIndex(['1 days 02:10:00','2 days 02:10:00','3 days 02:10:00'],dtype='timedelta64[ns]',freq=None)# Timedelta scalarIn [40]:t[0]Out[40]:Timedelta('1 days 02:13:00.000045')In [41]:t[0].round('2h')Out[41]:Timedelta('1 days 02:00:00')

In addition,.round(),.floor() and.ceil() will be available thru the.dt accessor ofSeries.

In [42]:s=pd.Series(dr)In [43]:sOut[43]:0   2013-01-01 09:12:56.123400-05:001   2013-01-02 09:12:56.123400-05:002   2013-01-03 09:12:56.123400-05:00dtype: datetime64[ns, US/Eastern]In [44]:s.dt.round('D')Out[44]:0   2013-01-01 00:00:00-05:001   2013-01-02 00:00:00-05:002   2013-01-03 00:00:00-05:00dtype: datetime64[ns, US/Eastern]

Formatting of Integers in FloatIndex

Integers inFloatIndex, e.g. 1., are now formatted with a decimal point and a0 digit, e.g.1.0 (GH11713)This change not only affects the display to the console, but also the output of IO methods like.to_csv or.to_html.

Previous Behavior:

In [2]:s=pd.Series([1,2,3],index=np.arange(3.))In [3]:sOut[3]:0    11    22    3dtype: int64In [4]:s.indexOut[4]:Float64Index([0.0,1.0,2.0],dtype='float64')In [5]:print(s.to_csv(path=None))0,11,22,3

New Behavior:

In [45]:s=pd.Series([1,2,3],index=np.arange(3.))In [46]:sOut[46]:0.0    11.0    22.0    3dtype: int64In [47]:s.indexOut[47]:Float64Index([0.0,1.0,2.0],dtype='float64')In [48]:print(s.to_csv(path=None))0.0,11.0,22.0,3

Changes to dtype assignment behaviors

When a DataFrame’s slice is updated with a new slice of the same dtype, the dtype of the DataFrame will now remain the same. (GH10503)

Previous Behavior:

In [5]:df=pd.DataFrame({'a':[0,1,1],                           'b': pd.Series([100, 200, 300], dtype='uint32')})In [7]:df.dtypesOut[7]:a     int64b    uint32dtype: objectIn [8]:ix=df['a']==1In [9]:df.loc[ix,'b']=df.loc[ix,'b']In [11]:df.dtypesOut[11]:a    int64b    int64dtype: object

New Behavior:

In [49]:df=pd.DataFrame({'a':[0,1,1],   ....:'b':pd.Series([100,200,300],dtype='uint32')})   ....:In [50]:df.dtypesOut[50]:a     int64b    uint32dtype: objectIn [51]:ix=df['a']==1In [52]:df.loc[ix,'b']=df.loc[ix,'b']In [53]:df.dtypesOut[53]:a     int64b    uint32dtype: object

When a DataFrame’s integer slice is partially updated with a new slice of floats that could potentially be downcasted to integer without losing precision, the dtype of the slice will be set to float instead of integer.

Previous Behavior:

In [4]:df=pd.DataFrame(np.array(range(1,10)).reshape(3,3),                          columns=list('abc'),                          index=[[4,4,8], [8,10,12]])In [5]:dfOut[5]:      a  b  c4 8   1  2  3  10  4  5  68 12  7  8  9In [7]:df.ix[4,'c']=np.array([0.,1.])In [8]:dfOut[8]:      a  b  c4 8   1  2  0  10  4  5  18 12  7  8  9

New Behavior:

In [54]:df=pd.DataFrame(np.array(range(1,10)).reshape(3,3),   ....:columns=list('abc'),   ....:index=[[4,4,8],[8,10,12]])   ....:In [55]:dfOut[55]:      a  b  c4 8   1  2  3  10  4  5  68 12  7  8  9In [56]:df.ix[4,'c']=np.array([0.,1.])In [57]:dfOut[57]:      a  b    c4 8   1  2  0.0  10  4  5  1.08 12  7  8  9.0

to_xarray

In a future version of pandas, we will be deprecatingPanel and other > 2 ndim objects. In order to provide for continuity,allNDFrame objects have gained the.to_xarray() method in order to convert toxarray objects, which hasa pandas-like interface for > 2 ndim. (GH11972)

See thexarray full-documentation here.

In [1]:p=Panel(np.arange(2*3*4).reshape(2,3,4))In [2]:p.to_xarray()Out[2]:<xarray.DataArray (items: 2, major_axis: 3, minor_axis: 4)>array([[[ 0,  1,  2,  3],        [ 4,  5,  6,  7],        [ 8,  9, 10, 11]],       [[12, 13, 14, 15],        [16, 17, 18, 19],        [20, 21, 22, 23]]])Coordinates:  * items       (items) int64 0 1  * major_axis  (major_axis) int64 0 1 2  * minor_axis  (minor_axis) int64 0 1 2 3

Latex Representation

DataFrame has gained a._repr_latex_() method in order to allow for conversion to latex in a ipython/jupyter notebook using nbconvert. (GH11778)

Note that this must be activated by setting the optionpd.display.latex.repr=True (GH12182)

For example, if you have a jupyter notebook you plan to convert to latex using nbconvert, place the statementpd.display.latex.repr=True in the first cell to have the contained DataFrame output also stored as latex.

The optionsdisplay.latex.escape anddisplay.latex.longtable have also been added to the configuration and are used automatically by theto_latexmethod. See theavailable options docs for more info.

pd.read_sas() changes

read_sas has gained the ability to read SAS7BDAT files, including compressed files. The files can be read in entirety, or incrementally. For full details seehere. (GH4052)

Other enhancements

  • Handle truncated floats in SAS xport files (GH11713)
  • Added option to hide index inSeries.to_string (GH11729)
  • read_excel now supports s3 urls of the formats3://bucketname/filename (GH11447)
  • add support forAWS_S3_HOST env variable when reading from s3 (GH12198)
  • A simple version ofPanel.round() is now implemented (GH11763)
  • For Python 3.x,round(DataFrame),round(Series),round(Panel) will work (GH11763)
  • sys.getsizeof(obj) returns the memory usage of a pandas object, including thevalues it contains (GH11597)
  • Series gained anis_unique attribute (GH11946)
  • DataFrame.quantile andSeries.quantile now acceptinterpolation keyword (GH10174).
  • AddedDataFrame.style.format for more flexible formatting of cell values (GH11692)
  • DataFrame.select_dtypes now allows thenp.float16 typecode (GH11990)
  • pivot_table() now accepts most iterables for thevalues parameter (GH12017)
  • Added GoogleBigQuery service account authentication support, which enables authentication on remote servers. (GH11881,GH12572). For further details seehere
  • HDFStore is now iterable:forkinstore is equivalent toforkinstore.keys() (GH12221).
  • Add missing methods/fields to.dt forPeriod (GH8848)
  • The entire codebase has beenPEP-ified (GH12096)

Backwards incompatible API changes

  • the leading whitespaces have been removed from the output of.to_string(index=False) method (GH11833)
  • theout parameter has been removed from theSeries.round() method. (GH11763)
  • DataFrame.round() leaves non-numeric columns unchanged in its return, rather than raises. (GH11885)
  • DataFrame.head(0) andDataFrame.tail(0) return empty frames, rather thanself. (GH11937)
  • Series.head(0) andSeries.tail(0) return empty series, rather thanself. (GH11937)
  • to_msgpack andread_msgpack encoding now defaults to'utf-8'. (GH12170)
  • the order of keyword arguments to text file parsing functions (.read_csv(),.read_table(),.read_fwf()) changed to group related arguments. (GH11555)
  • NaTType.isoformat now returns the string'NaT to allow the result tobe passed to the constructor ofTimestamp. (GH12300)

NaT and Timedelta operations

NaT andTimedelta have expanded arithmetic operations, which are extended toSeriesarithmetic where applicable. Operations defined fordatetime64[ns] ortimedelta64[ns]are now also defined forNaT (GH11564).

NaT now supports arithmetic operations with integers and floats.

In [58]:pd.NaT*1Out[58]:NaTIn [59]:pd.NaT*1.5Out[59]:NaTIn [60]:pd.NaT/2Out[60]:NaTIn [61]:pd.NaT*np.nanOut[61]:NaT

NaT defines more arithmetic operations withdatetime64[ns] andtimedelta64[ns].

In [62]:pd.NaT/pd.NaTOut[62]:nanIn [63]:pd.Timedelta('1s')/pd.NaTOut[63]:nan

NaT may represent either adatetime64[ns] null or atimedelta64[ns] null.Given the ambiguity, it is treated as atimedelta64[ns], which allows more operationsto succeed.

In [64]:pd.NaT+pd.NaTOut[64]:NaT# same asIn [65]:pd.Timedelta('1s')+pd.Timedelta('1s')Out[65]:Timedelta('0 days 00:00:02')

as opposed to

In [3]:pd.Timestamp('19900315')+pd.Timestamp('19900315')TypeError: unsupported operand type(s) for +: 'Timestamp' and 'Timestamp'

However, when wrapped in aSeries whosedtype isdatetime64[ns] ortimedelta64[ns],thedtype information is respected.

In [1]:pd.Series([pd.NaT],dtype='<M8[ns]')+pd.Series([pd.NaT],dtype='<M8[ns]')TypeError: can only operate on a datetimes for subtraction,           but the operator [__add__] was passed
In [66]:pd.Series([pd.NaT],dtype='<m8[ns]')+pd.Series([pd.NaT],dtype='<m8[ns]')Out[66]:0   NaTdtype: timedelta64[ns]

Timedelta division byfloats now works.

In [67]:pd.Timedelta('1s')/2.0Out[67]:Timedelta('0 days 00:00:00.500000')

Subtraction byTimedelta in aSeries by aTimestamp works (GH11925)

In [68]:ser=pd.Series(pd.timedelta_range('1 day',periods=3))In [69]:serOut[69]:0   1 days1   2 days2   3 daysdtype: timedelta64[ns]In [70]:pd.Timestamp('2012-01-01')-serOut[70]:0   2011-12-311   2011-12-302   2011-12-29dtype: datetime64[ns]

NaT.isoformat() now returns'NaT'. This change allows allowspd.Timestamp to rehydrate any timestamp like object from its isoformat(GH12300).

Changes to msgpack

Forward incompatible changes inmsgpack writing format were made over 0.17.0 and 0.18.0; older versions of pandas cannot read files packed by newer versions (GH12129,GH10527)

Bugs into_msgpack andread_msgpack introduced in 0.17.0 and fixed in 0.18.0, caused files packed in Python 2 unreadable by Python 3 (GH12142). The following table describes the backward and forward compat of msgpacks.

Warning

Packed withCan be unpacked with
pre-0.17 / Python 2any
pre-0.17 / Python 3any
0.17 / Python 2
  • ==0.17 / Python 2
  • >=0.18 / any Python
0.17 / Python 3>=0.18 / any Python
0.18>= 0.18

0.18.0 is backward-compatible for reading files packed by older versions, except for files packed with 0.17 in Python 2, in which case only they can only be unpacked in Python 2.

Signature change for .rank

Series.rank andDataFrame.rank now have the same signature (GH11759)

Previous signature

In [3]:pd.Series([0,1]).rank(method='average',na_option='keep',                              ascending=True, pct=False)Out[3]:0    11    2dtype: float64In [4]:pd.DataFrame([0,1]).rank(axis=0,numeric_only=None,                                 method='average', na_option='keep',                                 ascending=True, pct=False)Out[4]:   00  11  2

New signature

In [71]:pd.Series([0,1]).rank(axis=0,method='average',numeric_only=None,   ....:na_option='keep',ascending=True,pct=False)   ....:Out[71]:0    1.01    2.0dtype: float64In [72]:pd.DataFrame([0,1]).rank(axis=0,method='average',numeric_only=None,   ....:na_option='keep',ascending=True,pct=False)   ....:Out[72]:     00  1.01  2.0

Bug in QuarterBegin with n=0

In previous versions, the behavior of the QuarterBegin offset was inconsistentdepending on the date when then parameter was 0. (GH11406)

The general semantics of anchored offsets forn=0 is to not move the datewhen it is an anchor point (e.g., a quarter start date), and otherwise rollforward to the next anchor point.

In [73]:d=pd.Timestamp('2014-02-01')In [74]:dOut[74]:Timestamp('2014-02-01 00:00:00')In [75]:d+pd.offsets.QuarterBegin(n=0,startingMonth=2)Out[75]:Timestamp('2014-02-01 00:00:00')In [76]:d+pd.offsets.QuarterBegin(n=0,startingMonth=1)Out[76]:Timestamp('2014-04-01 00:00:00')

For theQuarterBegin offset in previous versions, the date would be rolledbackwards if date was in the same month as the quarter start date.

In [3]:d=pd.Timestamp('2014-02-15')In [4]:d+pd.offsets.QuarterBegin(n=0,startingMonth=2)Out[4]:Timestamp('2014-02-01 00:00:00')

This behavior has been corrected in version 0.18.0, which is consistent withother anchored offsets likeMonthBegin andYearBegin.

In [77]:d=pd.Timestamp('2014-02-15')In [78]:d+pd.offsets.QuarterBegin(n=0,startingMonth=2)Out[78]:Timestamp('2014-05-01 00:00:00')

Resample API

Like the change in the window functions APIabove,.resample(...) is changing to have a more groupby-like API. (GH11732,GH12702,GH12202,GH12332,GH12334,GH12348,GH12448).

In [79]:np.random.seed(1234)In [80]:df=pd.DataFrame(np.random.rand(10,4),   ....:columns=list('ABCD'),   ....:index=pd.date_range('2010-01-01 09:00:00',periods=10,freq='s'))   ....:In [81]:dfOut[81]:                            A         B         C         D2010-01-01 09:00:00  0.191519  0.622109  0.437728  0.7853592010-01-01 09:00:01  0.779976  0.272593  0.276464  0.8018722010-01-01 09:00:02  0.958139  0.875933  0.357817  0.5009952010-01-01 09:00:03  0.683463  0.712702  0.370251  0.5611962010-01-01 09:00:04  0.503083  0.013768  0.772827  0.8826412010-01-01 09:00:05  0.364886  0.615396  0.075381  0.3688242010-01-01 09:00:06  0.933140  0.651378  0.397203  0.7887302010-01-01 09:00:07  0.316836  0.568099  0.869127  0.4361732010-01-01 09:00:08  0.802148  0.143767  0.704261  0.7045812010-01-01 09:00:09  0.218792  0.924868  0.442141  0.909316

Previous API:

You would write a resampling operation that immediately evaluates. If ahow parameter was not provided, itwould default tohow='mean'.

In [6]:df.resample('2s')Out[6]:                         A         B         C         D2010-01-01 09:00:00  0.485748  0.447351  0.357096  0.7936152010-01-01 09:00:02  0.820801  0.794317  0.364034  0.5310962010-01-01 09:00:04  0.433985  0.314582  0.424104  0.6257332010-01-01 09:00:06  0.624988  0.609738  0.633165  0.6124522010-01-01 09:00:08  0.510470  0.534317  0.573201  0.806949

You could also specify ahow directly

In [7]:df.resample('2s',how='sum')Out[7]:                         A         B         C         D2010-01-01 09:00:00  0.971495  0.894701  0.714192  1.5872312010-01-01 09:00:02  1.641602  1.588635  0.728068  1.0621912010-01-01 09:00:04  0.867969  0.629165  0.848208  1.2514652010-01-01 09:00:06  1.249976  1.219477  1.266330  1.2249042010-01-01 09:00:08  1.020940  1.068634  1.146402  1.613897

New API:

Now, you can write.resample(..) as a 2-stage operation like.groupby(...), whichyields aResampler.

In [82]:r=df.resample('2s')In [83]:rOut[83]:DatetimeIndexResampler[freq=<2*Seconds>,axis=0,closed=left,label=left,convention=start,base=0]
Downsampling

You can then use this object to perform operations.These are downsampling operations (going from a higher frequency to a lower one).

In [84]:r.mean()Out[84]:                            A         B         C         D2010-01-01 09:00:00  0.485748  0.447351  0.357096  0.7936152010-01-01 09:00:02  0.820801  0.794317  0.364034  0.5310962010-01-01 09:00:04  0.433985  0.314582  0.424104  0.6257332010-01-01 09:00:06  0.624988  0.609738  0.633165  0.6124522010-01-01 09:00:08  0.510470  0.534317  0.573201  0.806949
In [85]:r.sum()Out[85]:                            A         B         C         D2010-01-01 09:00:00  0.971495  0.894701  0.714192  1.5872312010-01-01 09:00:02  1.641602  1.588635  0.728068  1.0621912010-01-01 09:00:04  0.867969  0.629165  0.848208  1.2514652010-01-01 09:00:06  1.249976  1.219477  1.266330  1.2249042010-01-01 09:00:08  1.020940  1.068634  1.146402  1.613897

Furthermore, resample now supportsgetitem operations to perform the resample on specific columns.

In [86]:r[['A','C']].mean()Out[86]:                            A         C2010-01-01 09:00:00  0.485748  0.3570962010-01-01 09:00:02  0.820801  0.3640342010-01-01 09:00:04  0.433985  0.4241042010-01-01 09:00:06  0.624988  0.6331652010-01-01 09:00:08  0.510470  0.573201

and.aggregate type operations.

In [87]:r.agg({'A':'mean','B':'sum'})Out[87]:                            A         B2010-01-01 09:00:00  0.485748  0.8947012010-01-01 09:00:02  0.820801  1.5886352010-01-01 09:00:04  0.433985  0.6291652010-01-01 09:00:06  0.624988  1.2194772010-01-01 09:00:08  0.510470  1.068634

These accessors can of course, be combined

In [88]:r[['A','B']].agg(['mean','sum'])Out[88]:                            A                   B                         mean       sum      mean       sum2010-01-01 09:00:00  0.485748  0.971495  0.447351  0.8947012010-01-01 09:00:02  0.820801  1.641602  0.794317  1.5886352010-01-01 09:00:04  0.433985  0.867969  0.314582  0.6291652010-01-01 09:00:06  0.624988  1.249976  0.609738  1.2194772010-01-01 09:00:08  0.510470  1.020940  0.534317  1.068634
Upsampling

Upsampling operations take you from a lower frequency to a higher frequency. These are nowperformed with theResampler objects withbackfill(),ffill(),fillna() andasfreq() methods.

In [89]:s=pd.Series(np.arange(5,dtype='int64'),   ....:index=date_range('2010-01-01',periods=5,freq='Q'))   ....:In [90]:sOut[90]:2010-03-31    02010-06-30    12010-09-30    22010-12-31    32011-03-31    4Freq: Q-DEC, dtype: int64

Previously

In [6]:s.resample('M',fill_method='ffill')Out[6]:2010-03-31    02010-04-30    02010-05-31    02010-06-30    12010-07-31    12010-08-31    12010-09-30    22010-10-31    22010-11-30    22010-12-31    32011-01-31    32011-02-28    32011-03-31    4Freq: M, dtype: int64

New API

In [91]:s.resample('M').ffill()Out[91]:2010-03-31    02010-04-30    02010-05-31    02010-06-30    12010-07-31    12010-08-31    12010-09-30    22010-10-31    22010-11-30    22010-12-31    32011-01-31    32011-02-28    32011-03-31    4Freq: M, dtype: int64

Note

In the new API, you can either downsample OR upsample. The prior implementation would allow you to pass an aggregator function (likemean) even though you were upsampling, providing a bit of confusion.

Previous API will work but with deprecations

Warning

This new API for resample includes some internal changes for the prior-to-0.18.0 API, to work with a deprecation warning in most cases, as the resample operation returns a deferred object. We can intercept operations and just do what the (pre 0.18.0) API did (with a warning). Here is a typical use case:

In [4]:r=df.resample('2s')In [6]:r*10pandas/tseries/resample.py:80: FutureWarning: .resample() is now a deferred operationuse .resample(...).mean() instead of .resample(...)Out[6]:                      A         B         C         D2010-01-01 09:00:00  4.857476  4.473507  3.570960  7.9361542010-01-01 09:00:02  8.208011  7.943173  3.640340  5.3109572010-01-01 09:00:04  4.339846  3.145823  4.241039  6.2573262010-01-01 09:00:06  6.249881  6.097384  6.331650  6.1245182010-01-01 09:00:08  5.104699  5.343172  5.732009  8.069486

However, getting and assignment operations directly on aResampler will raise aValueError:

In [7]:r.iloc[0]=5ValueError: .resample() is now a deferred operationuse .resample(...).mean() instead of .resample(...)

There is a situation where the new API can not perform all the operations when using original code.This code is intending to resample every 2s, take themean AND then take themin of those results.

In [4]:df.resample('2s').min()Out[4]:A    0.433985B    0.314582C    0.357096D    0.531096dtype: float64

The new API will:

In [92]:df.resample('2s').min()Out[92]:                            A         B         C         D2010-01-01 09:00:00  0.191519  0.272593  0.276464  0.7853592010-01-01 09:00:02  0.683463  0.712702  0.357817  0.5009952010-01-01 09:00:04  0.364886  0.013768  0.075381  0.3688242010-01-01 09:00:06  0.316836  0.568099  0.397203  0.4361732010-01-01 09:00:08  0.218792  0.143767  0.442141  0.704581

The good news is the return dimensions will differ between the new API and the old API, so this should loudly raisean exception.

To replicate the original operation

In [93]:df.resample('2s').mean().min()Out[93]:A    0.433985B    0.314582C    0.357096D    0.531096dtype: float64

Changes to eval

In prior versions, new columns assignments in aneval expression resultedin an inplace change to theDataFrame. (GH9297,GH8664,GH10486)

In [94]:df=pd.DataFrame({'a':np.linspace(0,10,5),'b':range(5)})In [95]:dfOut[95]:      a  b0   0.0  01   2.5  12   5.0  23   7.5  34  10.0  4
In [12]:df.eval('c = a + b')FutureWarning: eval expressions containing an assignment currentlydefault to operating inplace.This will change in a future version of pandas, use inplace=True to avoid this warning.In [13]:dfOut[13]:      a  b     c0   0.0  0   0.01   2.5  1   3.52   5.0  2   7.03   7.5  3  10.54  10.0  4  14.0

In version 0.18.0, a newinplace keyword was added to choose whether theassignment should be done inplace or return a copy.

In [96]:dfOut[96]:      a  b     c0   0.0  0   0.01   2.5  1   3.52   5.0  2   7.03   7.5  3  10.54  10.0  4  14.0In [97]:df.eval('d = c - b',inplace=False)Out[97]:      a  b     c     d0   0.0  0   0.0   0.01   2.5  1   3.5   2.52   5.0  2   7.0   5.03   7.5  3  10.5   7.54  10.0  4  14.0  10.0In [98]:dfOut[98]:      a  b     c0   0.0  0   0.01   2.5  1   3.52   5.0  2   7.03   7.5  3  10.54  10.0  4  14.0In [99]:df.eval('d = c - b',inplace=True)In [100]:dfOut[100]:      a  b     c     d0   0.0  0   0.0   0.01   2.5  1   3.5   2.52   5.0  2   7.0   5.03   7.5  3  10.5   7.54  10.0  4  14.0  10.0

Warning

For backwards compatability,inplace defaults toTrue if not specified.This will change in a future version of pandas. If your code depends on aninplace assignment you should update to explicitly setinplace=True

Theinplace keyword parameter was also added thequery method.

In [101]:df.query('a > 5')Out[101]:      a  b     c     d3   7.5  3  10.5   7.54  10.0  4  14.0  10.0In [102]:df.query('a > 5',inplace=True)In [103]:dfOut[103]:      a  b     c     d3   7.5  3  10.5   7.54  10.0  4  14.0  10.0

Warning

Note that the default value forinplace in aqueryisFalse, which is consistent with prior versions.

eval has also been updated to allow multi-line expressions for multipleassignments. These expressions will be evaluated one at a time in order. Onlyassignments are valid for multi-line expressions.

In [104]:dfOut[104]:      a  b     c     d3   7.5  3  10.5   7.54  10.0  4  14.0  10.0In [105]:df.eval("""   .....: e = d + a   .....: f = e - 22   .....: g = f / 2.0""",inplace=True)   .....:In [106]:dfOut[106]:      a  b     c     d     e    f    g3   7.5  3  10.5   7.5  15.0 -7.0 -3.54  10.0  4  14.0  10.0  20.0 -2.0 -1.0

Other API Changes

  • DataFrame.between_time andSeries.between_time now only parse a fixed set of time strings. Parsing of date strings is no longer supported and raises aValueError. (GH11818)

    In [107]:s=pd.Series(range(10),pd.date_range('2015-01-01',freq='H',periods=10))In [108]:s.between_time("7:00am","9:00am")---------------------------------------------------------------------------ValueError                                Traceback (most recent call last)<ipython-input-108-1f395af72989> in <module>()----> 1 s.between_time("7:00am", "9:00am")/home/joris/scipy/pandas/pandas/core/generic.pyc in between_time(self, start_time, end_time, include_start, include_end)   4054             indexer = self.index.indexer_between_time(   4055                 start_time, end_time, include_start=include_start,-> 4056                 include_end=include_end)   4057             return self.take(indexer, convert=False)   4058         except AttributeError:/home/joris/scipy/pandas/pandas/tseries/index.pyc in indexer_between_time(self, start_time, end_time, include_start, include_end)   1879         values_between_time : TimeSeries   1880         """-> 1881         start_time = to_time(start_time)   1882         end_time = to_time(end_time)   1883         time_micros = self._get_time_micros()/home/joris/scipy/pandas/pandas/tseries/tools.pyc in to_time(arg, format, infer_time_format, errors)    766         return _convert_listlike(arg, format)    767--> 768     return _convert_listlike(np.array([arg]), format)[0]    769    770/home/joris/scipy/pandas/pandas/tseries/tools.pyc in _convert_listlike(arg, format)    746                 elif errors == 'raise':    747                     raise ValueError("Cannot convert arg {arg} to "--> 748                                      "a time".format(arg=arg))    749                 elif errors == 'ignore':    750                     return argValueError: Cannot convert arg ['7:00am'] to a time

    This will now raise.

    In [2]:s.between_time('20150101 07:00:00','20150101 09:00:00')ValueError: Cannot convert arg ['20150101 07:00:00'] to a time.
  • .memory_usage() now includes values in the index, as does memory_usage in.info() (GH11597)

  • DataFrame.to_latex() now supports non-ascii encodings (egutf-8) in Python 2 with the parameterencoding (GH7061)

  • pandas.merge() andDataFrame.merge() will show a specific error message when trying to merge with an object that is not of typeDataFrame or a subclass (GH12081)

  • DataFrame.unstack andSeries.unstack now takefill_value keyword to allow direct replacement of missing values when an unstack results in missing values in the resultingDataFrame. As an added benefit, specifyingfill_value will preserve the data type of the original stacked data. (GH9746)

  • As part of the new API forwindow functions andresampling, aggregation functions have been clarified, raising more informative error messages on invalid aggregations. (GH9052). A full set of examples are presented ingroupby.

  • Statistical functions forNDFrame objects (likesum(),mean(),min()) will now raise if non-numpy-compatible arguments are passed in for**kwargs (GH12301)

  • .to_latex and.to_html gain adecimal parameter like.to_csv; the default is'.' (GH12031)

  • More helpful error message when constructing aDataFrame with empty data but with indices (GH8020)

  • .describe() will now properly handle bool dtype as a categorical (GH6625)

  • More helpful error message with an invalid.transform with user defined input (GH10165)

  • Exponentially weighted functions now allow specifying alpha directly (GH10789) and raiseValueError if parameters violate0<alpha<=1 (GH12492)

Deprecations

  • The functionspd.rolling_*,pd.expanding_*, andpd.ewm* are deprecated and replaced by the corresponding method call. Note thatthe new suggested syntax includes all of the arguments (even if default) (GH11603)

    In [1]:s=pd.Series(range(3))In [2]:pd.rolling_mean(s,window=2,min_periods=1)        FutureWarning: pd.rolling_mean is deprecated for Series and             will be removed in a future version, replace with             Series.rolling(min_periods=1,window=2,center=False).mean()Out[2]:        0    0.0        1    0.5        2    1.5        dtype: float64In [3]:pd.rolling_cov(s,s,window=2)        FutureWarning: pd.rolling_cov is deprecated for Series and             will be removed in a future version, replace with             Series.rolling(window=2).cov(other=<Series>)Out[3]:        0    NaN        1    0.5        2    0.5        dtype: float64
  • The thefreq andhow arguments to the.rolling,.expanding, and.ewm (new) functions are deprecated, and will be removed in a future version. You can simply resample the input prior to creating a window function. (GH11603).

    For example, instead ofs.rolling(window=5,freq='D').max() to get the max value on a rolling 5 Day window, one could uses.resample('D').mean().rolling(window=5).max(), which first resamples the data to daily data, then provides a rolling 5 day window.

  • pd.tseries.frequencies.get_offset_name function is deprecated. Use offset’s.freqstr property as alternative (GH11192)

  • pandas.stats.fama_macbeth routines are deprecated and will be removed in a future version (GH6077)

  • pandas.stats.ols,pandas.stats.plm andpandas.stats.var routines are deprecated and will be removed in a future version (GH6077)

  • show aFutureWarning rather than aDeprecationWarning on using long-time deprecated syntax inHDFStore.select, where thewhere clause is not a string-like (GH12027)

  • Thepandas.options.display.mpl_style configuration has been deprecatedand will be removed in a future version of pandas. This functionalityis better handled by matplotlib’sstyle sheets (GH11783).

Removal of deprecated float indexers

InGH4892 indexing with floating point numbers on a non-Float64Index was deprecated (in version 0.14.0).In 0.18.0, this deprecation warning is removed and these will now raise aTypeError. (GH12165,GH12333)

In [109]:s=pd.Series([1,2,3],index=[4,5,6])In [110]:sOut[110]:4    15    26    3dtype: int64In [111]:s2=pd.Series([1,2,3],index=list('abc'))In [112]:s2Out[112]:a    1b    2c    3dtype: int64

Previous Behavior:

# this is label indexingIn [2]:s[5.0]FutureWarning: scalar indexers for index type Int64Index should be integers and not floating pointOut[2]:2# this is positional indexingIn [3]:s.iloc[1.0]FutureWarning: scalar indexers for index type Int64Index should be integers and not floating pointOut[3]:2# this is label indexingIn [4]:s.loc[5.0]FutureWarning: scalar indexers for index type Int64Index should be integers and not floating pointOut[4]:2# .ix would coerce 1.0 to the positional 1, and indexIn [5]:s2.ix[1.0]=10FutureWarning: scalar indexers for index type Index should be integers and not floating pointIn [6]:s2Out[6]:a     1b    10c     3dtype: int64

New Behavior:

For iloc, getting & setting via a float scalar will always raise.

In [3]:s.iloc[2.0]TypeError: cannot do label indexing on <class 'pandas.indexes.numeric.Int64Index'> with these indexers [2.0] of <type 'float'>

Other indexers will coerce to a like integer for both getting and setting. TheFutureWarning has been dropped for.loc,.ix and[].

In [113]:s[5.0]Out[113]:2In [114]:s.loc[5.0]Out[114]:2In [115]:s.ix[5.0]Out[115]:2

and setting

In [116]:s_copy=s.copy()In [117]:s_copy[5.0]=10In [118]:s_copyOut[118]:4     15    106     3dtype: int64In [119]:s_copy=s.copy()In [120]:s_copy.loc[5.0]=10In [121]:s_copyOut[121]:4     15    106     3dtype: int64In [122]:s_copy=s.copy()In [123]:s_copy.ix[5.0]=10In [124]:s_copyOut[124]:4     15    106     3dtype: int64

Positional setting with.ix and a float indexer will ADD this value to the index, rather than previously setting the value by position.

In [125]:s2.ix[1.0]=10In [126]:s2Out[126]:a       1b       2c       31.0    10dtype: int64

Slicing will also coerce integer-like floats to integers for a non-Float64Index.

In [127]:s.loc[5.0:6]Out[127]:5    26    3dtype: int64In [128]:s.ix[5.0:6]Out[128]:5    26    3dtype: int64

Note that for floats that are NOT coercible to ints, the label based bounds will be excluded

In [129]:s.loc[5.1:6]Out[129]:6    3dtype: int64In [130]:s.ix[5.1:6]Out[130]:6    3dtype: int64

Float indexing on aFloat64Index is unchanged.

In [131]:s=pd.Series([1,2,3],index=np.arange(3.))In [132]:s[1.0]Out[132]:2In [133]:s[1.0:2.5]Out[133]:1.0    22.0    3dtype: int64

Removal of prior version deprecations/changes

  • Removal ofrolling_corr_pairwise in favor of.rolling().corr(pairwise=True) (GH4950)
  • Removal ofexpanding_corr_pairwise in favor of.expanding().corr(pairwise=True) (GH4950)
  • Removal ofDataMatrix module. This was not imported into the pandas namespace in any event (GH12111)
  • Removal ofcols keyword in favor ofsubset inDataFrame.duplicated() andDataFrame.drop_duplicates() (GH6680)
  • Removal of theread_frame andframe_query (both aliases forpd.read_sql)andwrite_frame (alias ofto_sql) functions in thepd.io.sql namespace,deprecated since 0.14.0 (GH6292).
  • Removal of theorder keyword from.factorize() (GH6930)

Performance Improvements

  • Improved performance ofandrews_curves (GH11534)
  • Improved hugeDatetimeIndex,PeriodIndex andTimedeltaIndex‘s ops performance includingNaT (GH10277)
  • Improved performance ofpandas.concat (GH11958)
  • Improved performance ofStataReader (GH11591)
  • Improved performance in construction ofCategoricals withSeries of datetimes containingNaT (GH12077)
  • Improved performance of ISO 8601 date parsing for dates without separators (GH11899), leading zeros (GH11871) and with whitespace preceding the time zone (GH9714)

Bug Fixes

  • Bug inGroupBy.size when data-frame is empty. (GH11699)
  • Bug inPeriod.end_time when a multiple of time period is requested (GH11738)
  • Regression in.clip with tz-aware datetimes (GH11838)
  • Bug indate_range when the boundaries fell on the frequency (GH11804,GH12409)
  • Bug in consistency of passing nested dicts to.groupby(...).agg(...) (GH9052)
  • Accept unicode inTimedelta constructor (GH11995)
  • Bug in value label reading forStataReader when reading incrementally (GH12014)
  • Bug in vectorizedDateOffset whenn parameter is0 (GH11370)
  • Compat for numpy 1.11 w.r.t.NaT comparison changes (GH12049)
  • Bug inread_csv when reading from aStringIO in threads (GH11790)
  • Bug in not treatingNaT as a missing value in datetimelikes when factorizing & withCategoricals (GH12077)
  • Bug in getitem when the values of aSeries were tz-aware (GH12089)
  • Bug inSeries.str.get_dummies when one of the variables was ‘name’ (GH12180)
  • Bug inpd.concat while concatenating tz-aware NaT series. (GH11693,GH11755,GH12217)
  • Bug inpd.read_stata with version <= 108 files (GH12232)
  • Bug inSeries.resample using a frequency ofNano when the index is aDatetimeIndex and contains non-zero nanosecond parts (GH12037)
  • Bug in resampling with.nunique and a sparse index (GH12352)
  • Removed some compiler warnings (GH12471)
  • Work around compat issues withboto in python 3.5 (GH11915)
  • Bug inNaT subtraction fromTimestamp orDatetimeIndex with timezones (GH11718)
  • Bug in subtraction ofSeries of a single tz-awareTimestamp (GH12290)
  • Use compat iterators in PY2 to support.next() (GH12299)
  • Bug inTimedelta.round with negative values (GH11690)
  • Bug in.loc againstCategoricalIndex may result in normalIndex (GH11586)
  • Bug inDataFrame.info when duplicated column names exist (GH11761)
  • Bug in.copy of datetime tz-aware objects (GH11794)
  • Bug inSeries.apply andSeries.map wheretimedelta64 was not boxed (GH11349)
  • Bug inDataFrame.set_index() with tz-awareSeries (GH12358)
  • Bug in subclasses ofDataFrame whereAttributeError did not propagate (GH11808)
  • Bug groupby on tz-aware data where selection not returningTimestamp (GH11616)
  • Bug inpd.read_clipboard andpd.to_clipboard functions not supporting Unicode; upgrade includedpyperclip to v1.5.15 (GH9263)
  • Bug inDataFrame.query containing an assignment (GH8664)
  • Bug infrom_msgpack where__contains__() fails for columns of the unpackedDataFrame, if theDataFrame has object columns. (GH11880)
  • Bug in.resample on categorical data withTimedeltaIndex (GH12169)
  • Bug in timezone info lost when broadcasting scalar datetime toDataFrame (GH11682)
  • Bug inIndex creation fromTimestamp with mixed tz coerces to UTC (GH11488)
  • Bug into_numeric where it does not raise if input is more than one dimension (GH11776)
  • Bug in parsing timezone offset strings with non-zero minutes (GH11708)
  • Bug indf.plot using incorrect colors for bar plots under matplotlib 1.5+ (GH11614)
  • Bug in thegroupbyplot method when using keyword arguments (GH11805).
  • Bug inDataFrame.duplicated anddrop_duplicates causing spurious matches when settingkeep=False (GH11864)
  • Bug in.loc result with duplicated key may haveIndex with incorrect dtype (GH11497)
  • Bug inpd.rolling_median where memory allocation failed even with sufficient memory (GH11696)
  • Bug inDataFrame.style with spurious zeros (GH12134)
  • Bug inDataFrame.style with integer columns not starting at 0 (GH12125)
  • Bug in.style.bar may not rendered properly using specific browser (GH11678)
  • Bug in rich comparison ofTimedelta with anumpy.array ofTimedelta that caused an infinite recursion (GH11835)
  • Bug inDataFrame.round dropping column index name (GH11986)
  • Bug indf.replace while replacing value in mixed dtypeDataframe (GH11698)
  • Bug inIndex prevents copying name of passedIndex, when a new name is not provided (GH11193)
  • Bug inread_excel failing to read any non-empty sheets when empty sheets exist andsheetname=None (GH11711)
  • Bug inread_excel failing to raiseNotImplemented error when keywordsparse_dates anddate_parser are provided (GH11544)
  • Bug inread_sql withpymysql connections failing to return chunked data (GH11522)
  • Bug in.to_csv ignoring formatting parametersdecimal,na_rep,float_format for float indexes (GH11553)
  • Bug inInt64Index andFloat64Index preventing the use of the modulo operator (GH9244)
  • Bug inMultiIndex.drop for not lexsorted multi-indexes (GH12078)
  • Bug inDataFrame when masking an emptyDataFrame (GH11859)
  • Bug in.plot potentially modifying thecolors input when the number of columns didn’t match the number of series provided (GH12039).
  • Bug inSeries.plot failing when index has aCustomBusinessDay frequency (GH7222).
  • Bug in.to_sql fordatetime.time values with sqlite fallback (GH8341)
  • Bug inread_excel failing to read data with one column whensqueeze=True (GH12157)
  • Bug inread_excel failing to read one empty column (GH12292,GH9002)
  • Bug in.groupby where aKeyError was not raised for a wrong column if there was only one row in the dataframe (GH11741)
  • Bug in.read_csv with dtype specified on empty data producing an error (GH12048)
  • Bug in.read_csv where strings like'2E' are treated as valid floats (GH12237)
  • Bug in buildingpandas with debugging symbols (GH12123)
  • Removedmillisecond property ofDatetimeIndex. This would always raise aValueError (GH12019).
  • Bug inSeries constructor with read-only data (GH11502)
  • Removedpandas.util.testing.choice(). Should usenp.random.choice(), instead. (GH12386)
  • Bug in.loc setitem indexer preventing the use of a TZ-aware DatetimeIndex (GH12050)
  • Bug in.style indexes and multi-indexes not appearing (GH11655)
  • Bug into_msgpack andfrom_msgpack which did not correctly serialize or deserializeNaT (GH12307).
  • Bug in.skew and.kurt due to roundoff error for highly similar values (GH11974)
  • Bug inTimestamp constructor where microsecond resolution was lost if HHMMSS were not separated with ‘:’ (GH10041)
  • Bug inbuffer_rd_bytes src->buffer could be freed more than once if reading failed, causing a segfault (GH12098)
  • Bug incrosstab where arguments with non-overlapping indexes would return aKeyError (GH10291)
  • Bug inDataFrame.apply in which reduction was not being prevented for cases in whichdtype was not a numpy dtype (GH12244)
  • Bug when initializing categorical series with a scalar value. (GH12336)
  • Bug when specifying a UTCDatetimeIndex by settingutc=True in.to_datetime (GH11934)
  • Bug when increasing the buffer size of CSV reader inread_csv (GH12494)
  • Bug when setting columns of aDataFrame with duplicate column names (GH12344)

v0.17.1 (November 21, 2015)

Note

We are proud to announce thatpandas has become a sponsored project of the (NUMFocus organization). This will help ensure the success of development ofpandas as a world-class open-source project.

This is a minor bug-fix release from 0.17.0 and includes a large number ofbug fixes along several new features, enhancements, and performance improvements.We recommend that all users upgrade to this version.

Highlights include:

  • Support for Conditional HTML Formatting, seehere
  • Releasing the GIL on the csv reader & other ops, seehere
  • Fixed regression inDataFrame.drop_duplicates from 0.16.2, causing incorrect results on integer values (GH11376)

New features

Conditional HTML Formatting

Warning

This is a new feature and is under active development.We’ll be adding features an possibly making breaking changes in futurereleases. Feedback iswelcome.

We’ve addedexperimental support for conditional HTML formatting:the visual styling of a DataFrame based on the data.The styling is accomplished with HTML and CSS.Acesses the styler class with thepandas.DataFrame.style, attribute,an instance ofStyler with your data attached.

Here’s a quick example:

In [1]:np.random.seed(123)In [2]:df=DataFrame(np.random.randn(10,5),columns=list('abcde'))In [3]:html=df.style.background_gradient(cmap='viridis',low=.5)

We can render the HTML to get the following table.

abcde
0 -1.085631 0.997345 0.282978 -1.506295 -0.5786
1 1.651437 -2.426679 -0.428913 1.265936 -0.86674
2 -0.678886 -0.094709 1.49139 -0.638902 -0.443982
3 -0.434351 2.20593 2.186786 1.004054 0.386186
4 0.737369 1.490732 -0.935834 1.175829 -1.253881
5 -0.637752 0.907105 -1.428681 -0.140069 -0.861755
6 -0.255619 -2.798589 -1.771533 -0.699877 0.927462
7 -0.173636 0.002846 0.688223 -0.879536 0.283627
8 -0.805367 -1.727669 -0.3909 0.573806 0.338589
9 -0.01183 2.392365 0.412912 0.978736 2.238143

Styler interacts nicely with the Jupyter Notebook.See thedocumentation for more.

Enhancements

  • DatetimeIndex now supports conversion to strings withastype(str) (GH10442)

  • Support forcompression (gzip/bz2) inpandas.DataFrame.to_csv() (GH7615)

  • pd.read_* functions can now also acceptpathlib.Path, orpy._path.local.LocalPathobjects for thefilepath_or_buffer argument. (GH11033)- TheDataFrame andSeries functions.to_csv(),.to_html() and.to_latex() can now handle paths beginning with tildes (e.g.~/Documents/) (GH11438)

  • DataFrame now uses the fields of anamedtuple as columns, if columns are not supplied (GH11181)

  • DataFrame.itertuples() now returnsnamedtuple objects, when possible. (GH11269,GH11625)

  • Addedaxvlines_kwds to parallel coordinates plot (GH10709)

  • Option to.info() and.memory_usage() to provide for deep introspection of memory consumption. Note that this can be expensive to compute and therefore is an optional parameter. (GH11595)

    In [4]:df=DataFrame({'A':['foo']*1000})In [5]:df['B']=df['A'].astype('category')# shows the '+' as we have object dtypesIn [6]:df.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 1000 entries, 0 to 999Data columns (total 2 columns):A    1000 non-null objectB    1000 non-null categorydtypes: category(1), object(1)memory usage: 8.9+ KB# we have an accurate memory assessment (but can be expensive to compute this)In [7]:df.info(memory_usage='deep')<class 'pandas.core.frame.DataFrame'>RangeIndex: 1000 entries, 0 to 999Data columns (total 2 columns):A    1000 non-null objectB    1000 non-null categorydtypes: category(1), object(1)memory usage: 48.0 KB
  • Index now has afillna method (GH10089)

    In [8]:pd.Index([1,np.nan,3]).fillna(2)Out[8]:Float64Index([1.0,2.0,3.0],dtype='float64')
  • Series of typecategory now make.str.<...> and.dt.<...> accessor methods / properties available, if the categories are of that type. (GH10661)

    In [9]:s=pd.Series(list('aabb')).astype('category')In [10]:sOut[10]:0    a1    a2    b3    bdtype: categoryCategories (2, object): [a, b]In [11]:s.str.contains("a")Out[11]:0     True1     True2    False3    Falsedtype: boolIn [12]:date=pd.Series(pd.date_range('1/1/2015',periods=5)).astype('category')In [13]:dateOut[13]:0   2015-01-011   2015-01-022   2015-01-033   2015-01-044   2015-01-05dtype: categoryCategories (5, datetime64[ns]): [2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05]In [14]:date.dt.dayOut[14]:0    11    22    33    44    5dtype: int64
  • pivot_table now has amargins_name argument so you can use something other than the default of ‘All’ (GH3335)

  • Implement export ofdatetime64[ns,tz] dtypes with a fixed HDF5 store (GH11411)

  • Pretty printing sets (e.g. in DataFrame cells) now uses set literal syntax ({x,y}) instead ofLegacy Python syntax (set([x,y])) (GH11215)

  • Improve the error message inpandas.io.gbq.to_gbq() when a streaming insert fails (GH11285)and when the DataFrame does not match the schema of the destination table (GH11359)

API changes

  • raiseNotImplementedError inIndex.shift for non-supported index types (GH8038)
  • min andmax reductions ondatetime64 andtimedelta64 dtyped series nowresult inNaT and notnan (GH11245).
  • Indexing with a null key will raise aTypeError, instead of aValueError (GH11356)
  • Series.ptp will now ignore missing values by default (GH11163)

Deprecations

  • Thepandas.io.ga module which implementsgoogle-analytics support is deprecated and will be removed in a future version (GH11308)
  • Deprecate theengine keyword in.to_csv(), which will be removed in a future version (GH11274)

Performance Improvements

  • Checking monotonic-ness before sorting on an index (GH11080)
  • Series.dropna performance improvement when its dtype can’t containNaN (GH11159)
  • Release the GIL on most datetime field operations (e.g.DatetimeIndex.year,Series.dt.year), normalization, and conversion to and fromPeriod,DatetimeIndex.to_period andPeriodIndex.to_timestamp (GH11263)
  • Release the GIL on some rolling algos:rolling_median,rolling_mean,rolling_max,rolling_min,rolling_var,rolling_kurt,rolling_skew (GH11450)
  • Release the GIL when reading and parsing text files inread_csv,read_table (GH11272)
  • Improved performance ofrolling_median (GH11450)
  • Improved performance ofto_excel (GH11352)
  • Performance bug in repr ofCategorical categories, which was rendering the strings before chopping them for display (GH11305)
  • Performance improvement inCategorical.remove_unused_categories, (GH11643).
  • Improved performance ofSeries constructor with no data andDatetimeIndex (GH11433)
  • Improved performance ofshift,cumprod, andcumsum with groupby (GH4095)

Bug Fixes

  • SparseArray.__iter__() now does not causePendingDeprecationWarning in Python 3.5 (GH11622)
  • Regression from 0.16.2 for output formatting of long floats/nan, restored in (GH11302)
  • Series.sort_index() now correctly handles theinplace option (GH11402)
  • Incorrectly distributed .c file in the build onPyPi when reading a csv of floats and passingna_values=<ascalar> would show an exception (GH11374)
  • Bug in.to_latex() output broken when the index has a name (GH10660)
  • Bug inHDFStore.append with strings whose encoded length exceded the max unencoded length (GH11234)
  • Bug in mergingdatetime64[ns,tz] dtypes (GH11405)
  • Bug inHDFStore.select when comparing with a numpy scalar in a where clause (GH11283)
  • Bug in usingDataFrame.ix with a multi-index indexer (GH11372)
  • Bug indate_range with ambigous endpoints (GH11626)
  • Prevent adding new attributes to the accessors.str,.dt and.cat. Retrieving sucha value was not possible, so error out on setting it. (GH10673)
  • Bug in tz-conversions with an ambiguous time and.dt accessors (GH11295)
  • Bug in output formatting when using an index of ambiguous times (GH11619)
  • Bug in comparisons of Series vs list-likes (GH11339)
  • Bug inDataFrame.replace with adatetime64[ns,tz] and a non-compat to_replace (GH11326,GH11153)
  • Bug inisnull wherenumpy.datetime64('NaT') in anumpy.array was not determined to be null(GH11206)
  • Bug in list-like indexing with a mixed-integer Index (GH11320)
  • Bug inpivot_table withmargins=True when indexes are ofCategorical dtype (GH10993)
  • Bug inDataFrame.plot cannot use hex strings colors (GH10299)
  • Regression inDataFrame.drop_duplicates from 0.16.2, causing incorrect results on integer values (GH11376)
  • Bug inpd.eval where unary ops in a list error (GH11235)
  • Bug insqueeze() with zero length arrays (GH11230,GH8999)
  • Bug indescribe() dropping column names for hierarchical indexes (GH11517)
  • Bug inDataFrame.pct_change() not propagatingaxis keyword on.fillna method (GH11150)
  • Bug in.to_csv() when a mix of integer and string column names are passed as thecolumns parameter (GH11637)
  • Bug in indexing with arange, (GH11652)
  • Bug in inference of numpy scalars and preserving dtype when setting columns (GH11638)
  • Bug into_sql using unicode column names giving UnicodeEncodeError with (GH11431).
  • Fix regression in setting ofxticks inplot (GH11529).
  • Bug inholiday.dates where observance rules could not be applied to holiday and doc enhancement (GH11477,GH11533)
  • Fix plotting issues when having plainAxes instances instead ofSubplotAxes (GH11520,GH11556).
  • Bug inDataFrame.to_latex() produces an extra rule whenheader=False (GH7124)
  • Bug indf.groupby(...).apply(func) when a func returns aSeries containing a new datetimelike column (GH11324)
  • Bug inpandas.json when file to load is big (GH11344)
  • Bugs into_excel with duplicate columns (GH11007,GH10982,GH10970)
  • Fixed a bug that prevented the construction of an empty series of dtypedatetime64[ns,tz] (GH11245).
  • Bug inread_excel with multi-index containing integers (GH11317)
  • Bug into_excel with openpyxl 2.2+ and merging (GH11408)
  • Bug inDataFrame.to_dict() produces anp.datetime64 object instead ofTimestamp when only datetime is present in data (GH11327)
  • Bug inDataFrame.corr() raises exception when computes Kendall correlation for DataFrames with boolean and not boolean columns (GH11560)
  • Bug in the link-time error caused by Cinline functions on FreeBSD 10+ (withclang) (GH10510)
  • Bug inDataFrame.to_csv in passing through arguments for formattingMultiIndexes, includingdate_format (GH7791)
  • Bug inDataFrame.join() withhow='right' producing aTypeError (GH11519)
  • Bug inSeries.quantile with empty list results hasIndex withobject dtype (GH11588)
  • Bug inpd.merge results in emptyInt64Index rather thanIndex(dtype=object) when the merge result is empty (GH11588)
  • Bug inCategorical.remove_unused_categories when havingNaN values (GH11599)
  • Bug inDataFrame.to_sparse() loses column names for MultiIndexes (GH11600)
  • Bug inDataFrame.round() with non-unique column index producing a Fatal Python error (GH11611)
  • Bug inDataFrame.round() withdecimals being a non-unique indexed Series producing extra columns (GH11618)

v0.17.0 (October 9, 2015)

This is a major release from 0.16.2 and includes a small number of API changes, several new features,enhancements, and performance improvements along with a large number of bug fixes. We recommend that allusers upgrade to this version.

Warning

pandas >= 0.17.0 will no longer support compatibility with Python version 3.2 (GH9118)

Warning

Thepandas.io.data package is deprecated and will be replaced by thepandas-datareader package.This will allow the data modules to be independently updated to your pandasinstallation. The API forpandas-datareaderv0.1.1 is exactly the sameas inpandasv0.17.0 (GH8961,GH10861).

After installing pandas-datareader, you can easily change your imports:

frompandas.ioimportdata,wb

becomes

frompandas_datareaderimportdata,wb

Highlights include:

  • Release the Global Interpreter Lock (GIL) on some cython operations, seehere
  • Plotting methods are now available as attributes of the.plot accessor, seehere
  • The sorting API has been revamped to remove some long-time inconsistencies, seehere
  • Support for adatetime64[ns] with timezones as a first-class dtype, seehere
  • The default forto_datetime will now be toraise when presented with unparseable formats,previously this would return the original input. Also, date parsefunctions now return consistent results. Seehere
  • The default fordropna inHDFStore has changed toFalse, to store by default all rows evenif they are allNaN, seehere
  • Datetime accessor (dt) now supportsSeries.dt.strftime to generate formatted strings for datetime-likes, andSeries.dt.total_seconds to generate each duration of the timedelta in seconds. Seehere
  • Period andPeriodIndex can handle multiplied freq like3D, which corresponding to 3 days span. Seehere
  • Development installed versions of pandas will now havePEP440 compliant version strings (GH9518)
  • Development support for benchmarking with theAir Speed Velocity library (GH8361)
  • Support for reading SAS xport files, seehere
  • Documentation comparing SAS topandas, seehere
  • Removal of the automatic TimeSeries broadcasting, deprecated since 0.8.0, seehere
  • Display format with plain text can optionally align with Unicode East Asian Width, seehere
  • Compatibility with Python 3.5 (GH11097)
  • Compatibility with matplotlib 1.5.0 (GH11111)

Check theAPI Changes anddeprecations before updating.

New features

Datetime with TZ

We are adding an implementation that natively supports datetime with timezones. ASeries or aDataFrame column previouslycould be assigned a datetime with timezones, and would work as anobject dtype. This had performance issues with a largenumber rows. See thedocs for more details. (GH8260,GH10763,GH11034).

The new implementation allows for having a single-timezone across all rows, with operations in a performant manner.

In [1]:df=DataFrame({'A':date_range('20130101',periods=3),   ...:'B':date_range('20130101',periods=3,tz='US/Eastern'),   ...:'C':date_range('20130101',periods=3,tz='CET')})   ...:In [2]:dfOut[2]:           A                         B                         C0 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00+01:001 2013-01-02 2013-01-02 00:00:00-05:00 2013-01-02 00:00:00+01:002 2013-01-03 2013-01-03 00:00:00-05:00 2013-01-03 00:00:00+01:00In [3]:df.dtypesOut[3]:A                datetime64[ns]B    datetime64[ns, US/Eastern]C           datetime64[ns, CET]dtype: object
In [4]:df.BOut[4]:0   2013-01-01 00:00:00-05:001   2013-01-02 00:00:00-05:002   2013-01-03 00:00:00-05:00Name: B, dtype: datetime64[ns, US/Eastern]In [5]:df.B.dt.tz_localize(None)Out[5]:0   2013-01-011   2013-01-022   2013-01-03Name: B, dtype: datetime64[ns]

This uses a new-dtype representation as well, that is very similar in look-and-feel to its numpy cousindatetime64[ns]

In [6]:df['B'].dtypeOut[6]:datetime64[ns,US/Eastern]In [7]:type(df['B'].dtype)Out[7]:pandas.types.dtypes.DatetimeTZDtype

Note

There is a slightly different string repr for the underlyingDatetimeIndex as a result of the dtype changes, butfunctionally these are the same.

Previous Behavior:

In [1]:pd.date_range('20130101',periods=3,tz='US/Eastern')Out[1]:DatetimeIndex(['2013-01-01 00:00:00-05:00','2013-01-02 00:00:00-05:00',                       '2013-01-03 00:00:00-05:00'],                      dtype='datetime64[ns]', freq='D', tz='US/Eastern')In [2]:pd.date_range('20130101',periods=3,tz='US/Eastern').dtypeOut[2]:dtype('<M8[ns]')

New Behavior:

In [8]:pd.date_range('20130101',periods=3,tz='US/Eastern')Out[8]:DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',               '2013-01-03 00:00:00-05:00'],              dtype='datetime64[ns, US/Eastern]', freq='D')In [9]:pd.date_range('20130101',periods=3,tz='US/Eastern').dtypeOut[9]:datetime64[ns,US/Eastern]

Releasing the GIL

We are releasing the global-interpreter-lock (GIL) on some cython operations.This will allow other threads to run simultaneously during computation, potentially allowing performance improvementsfrom multi-threading. Notablygroupby,nsmallest,value_counts and some indexing operations benefit from this. (GH8882)

For example the groupby expression in the following code will have the GIL released during the factorization step, e.g.df.groupby('key')as well as the.sum() operation.

N=1000000ngroups=10df=DataFrame({'key':np.random.randint(0,ngroups,size=N),'data':np.random.randn(N)})df.groupby('key')['data'].sum()

Releasing of the GIL could benefit an application that uses threads for user interactions (e.g.QT), or performing multi-threaded computations. A nice example of a library that can handle these types of computation-in-parallel is thedask library.

Plot submethods

The Series and DataFrame.plot() method allows for customizingplot types by supplying thekind keyword arguments. Unfortunately, many of these kinds of plots use different required and optional keyword arguments, which makes it difficult to discover what any given plot kind uses out of the dozens of possible arguments.

To alleviate this issue, we have added a new, optional plotting interface, which exposes each kind of plot as a method of the.plot attribute. Instead of writingseries.plot(kind=<kind>,...), you can now also useseries.plot.<kind>(...):

In [10]:df=pd.DataFrame(np.random.rand(10,2),columns=['a','b'])In [11]:df.plot.bar()
_images/whatsnew_plot_submethods.png

As a result of this change, these methods are now all discoverable via tab-completion:

In [12]:df.plot.<TAB>df.plot.area     df.plot.barh     df.plot.density  df.plot.hist     df.plot.line     df.plot.scatterdf.plot.bar      df.plot.box      df.plot.hexbin   df.plot.kde      df.plot.pie

Each method signature only includes relevant arguments. Currently, these are limited to required arguments, but in the future these will include optional arguments, as well. For an overview, see the newPlotting API documentation.

Additional methods fordt accessor

strftime

We are now supporting aSeries.dt.strftime method for datetime-likes to generate a formatted string (GH10110). Examples:

# DatetimeIndexIn [13]:s=pd.Series(pd.date_range('20130101',periods=4))In [14]:sOut[14]:0   2013-01-011   2013-01-022   2013-01-033   2013-01-04dtype: datetime64[ns]In [15]:s.dt.strftime('%Y/%m/%d')Out[15]:0    2013/01/011    2013/01/022    2013/01/033    2013/01/04dtype: object
# PeriodIndexIn [16]:s=pd.Series(pd.period_range('20130101',periods=4))In [17]:sOut[17]:0   2013-01-011   2013-01-022   2013-01-033   2013-01-04dtype: objectIn [18]:s.dt.strftime('%Y/%m/%d')Out[18]:0    2013/01/011    2013/01/022    2013/01/033    2013/01/04dtype: object

The string format is as the python standard library and details can be foundhere

total_seconds

pd.Series of typetimedelta64 has new method.dt.total_seconds() returning the duration of the timedelta in seconds (GH10817)

# TimedeltaIndexIn [19]:s=pd.Series(pd.timedelta_range('1 minutes',periods=4))In [20]:sOut[20]:0   0 days 00:01:001   1 days 00:01:002   2 days 00:01:003   3 days 00:01:00dtype: timedelta64[ns]In [21]:s.dt.total_seconds()Out[21]:0        60.01     86460.02    172860.03    259260.0dtype: float64

Period Frequency Enhancement

Period,PeriodIndex andperiod_range can now accept multiplied freq. Also,Period.freq andPeriodIndex.freq are now stored as aDateOffset instance likeDatetimeIndex, and not asstr (GH7811)

A multiplied freq represents a span of corresponding length. The example below creates a period of 3 days. Addition and subtraction will shift the period by its span.

In [22]:p=pd.Period('2015-08-01',freq='3D')In [23]:pOut[23]:Period('2015-08-01','3D')In [24]:p+1Out[24]:Period('2015-08-04','3D')In [25]:p-2Out[25]:Period('2015-07-26','3D')In [26]:p.to_timestamp()Out[26]:Timestamp('2015-08-01 00:00:00')In [27]:p.to_timestamp(how='E')Out[27]:Timestamp('2015-08-03 00:00:00')

You can use the multiplied freq inPeriodIndex andperiod_range.

In [28]:idx=pd.period_range('2015-08-01',periods=4,freq='2D')In [29]:idxOut[29]:PeriodIndex(['2015-08-01','2015-08-03','2015-08-05','2015-08-07'],dtype='period[2D]',freq='2D')In [30]:idx+1Out[30]:PeriodIndex(['2015-08-03','2015-08-05','2015-08-07','2015-08-09'],dtype='period[2D]',freq='2D')

Support for SAS XPORT files

read_sas() provides support for readingSAS XPORT format files. (GH4052).

df=pd.read_sas('sas_xport.xpt')

It is also possible to obtain an iterator and read an XPORT fileincrementally.

fordfinpd.read_sas('sas_xport.xpt',chunksize=10000)do_something(df)

See thedocs for more details.

Support for Math Functions in .eval()

eval() now supports calling math functions (GH4893)

df=pd.DataFrame({'a':np.random.randn(10)})df.eval("b = sin(a)")

The support math functions aresin,cos,exp,log,expm1,log1p,sqrt,sinh,cosh,tanh,arcsin,arccos,arctan,arccosh,arcsinh,arctanh,abs andarctan2.

These functions map to the intrinsics for theNumExpr engine. For the Pythonengine, they are mapped toNumPy calls.

Changes to Excel withMultiIndex

In version 0.16.2 aDataFrame withMultiIndex columns could not be written to Excel viato_excel.That functionality has been added (GH10564), along with updatingread_excel so that the data canbe read back with, no loss of information, by specifying which columns/rows make up theMultiIndexin theheader andindex_col parameters (GH4679)

See thedocumentation for more details.

In [31]:df=pd.DataFrame([[1,2,3,4],[5,6,7,8]],   ....:columns=pd.MultiIndex.from_product([['foo','bar'],['a','b']],   ....:names=['col1','col2']),   ....:index=pd.MultiIndex.from_product([['j'],['l','k']],   ....:names=['i1','i2']))   ....:In [32]:dfOut[32]:col1  foo    barcol2    a  b   a  bi1 i2j  l    1  2   3  4   k    5  6   7  8In [33]:df.to_excel('test.xlsx')In [34]:df=pd.read_excel('test.xlsx',header=[0,1],index_col=[0,1])In [35]:dfOut[35]:col1  foo    barcol2    a  b   a  bi1 i2j  l    1  2   3  4   k    5  6   7  8

Previously, it was necessary to specify thehas_index_names argument inread_excel,if the serialized data had index names. For version 0.17.0 the ouptput format ofto_excelhas been changed to make this keyword unnecessary - the change is shown below.

Old

_images/old-excel-index.png

New

_images/new-excel-index.png

Warning

Excel files saved in version 0.16.2 or prior that had index names will still able to be read in,but thehas_index_names argument must specified toTrue.

Google BigQuery Enhancements

  • Added ability to automatically create a table/dataset using thepandas.io.gbq.to_gbq() function if the destination table/dataset does not exist. (GH8325,GH11121).
  • Added ability to replace an existing table and schema when calling thepandas.io.gbq.to_gbq() function via theif_exists argument. See thedocs for more details (GH8325).
  • InvalidColumnOrder andInvalidPageToken in the gbq module will raiseValueError instead ofIOError.
  • Thegenerate_bq_schema() function is now deprecated and will be removed in a future version (GH11121)
  • The gbq module will now support Python 3 (GH11094).

Display Alignment with Unicode East Asian Width

Warning

Enabling this option will affect the performance for printing ofDataFrame andSeries (about 2 times slower).Use only when it is actually required.

Some East Asian countries use Unicode characters its width is corresponding to 2 alphabets. If aDataFrame orSeries contains these characters, the default output cannot be aligned properly. The following options are added to enable precise handling for these characters.

  • display.unicode.east_asian_width: Whether to use the Unicode East Asian Width to calculate the display text width. (GH2612)
  • display.unicode.ambiguous_as_wide: Whether to handle Unicode characters belong to Ambiguous as Wide. (GH11102)
In [36]:df=pd.DataFrame({u'国籍':['UK',u'日本'],u'名前':['Alice',u'しのぶ']})In [37]:df;
_images/option_unicode01.png
In [38]:pd.set_option('display.unicode.east_asian_width',True)In [39]:df;
_images/option_unicode02.png

For further details, seehere

Other enhancements

  • Support foropenpyxl >= 2.2. The API for style support is now stable (GH10125)

  • merge now accepts the argumentindicator which adds a Categorical-type column (by default called_merge) to the output object that takes on the values (GH8790)

    Observation Origin_merge value
    Merge key only in'left' frameleft_only
    Merge key only in'right' frameright_only
    Merge key in both framesboth
    In [40]:df1=pd.DataFrame({'col1':[0,1],'col_left':['a','b']})In [41]:df2=pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})In [42]:pd.merge(df1,df2,on='col1',how='outer',indicator=True)Out[42]:   col1 col_left  col_right      _merge0     0        a        NaN   left_only1     1        b        2.0        both2     2      NaN        2.0  right_only3     2      NaN        2.0  right_only

    For more, see theupdated docs

  • pd.to_numeric is a new function to coerce strings to numbers (possibly with coercion) (GH11133)

  • pd.merge will now allow duplicate column names if they are not merged upon (GH10639).

  • pd.pivot will now allow passing index asNone (GH3962).

  • pd.concat will now use existing Series names if provided (GH10698).

    In [43]:foo=pd.Series([1,2],name='foo')In [44]:bar=pd.Series([1,2])In [45]:baz=pd.Series([4,5])

    Previous Behavior:

    In [1] pd.concat([foo, bar, baz], 1)Out[1]:      0  1  2   0  1  1  4   1  2  2  5

    New Behavior:

    In [46]:pd.concat([foo,bar,baz],1)Out[46]:   foo  0  10    1  1  41    2  2  5
  • DataFrame has gained thenlargest andnsmallest methods (GH10393)

  • Add alimit_direction keyword argument that works withlimit to enableinterpolate to fillNaN values forward, backward, or both (GH9218,GH10420,GH11115)

    In [47]:ser=pd.Series([np.nan,np.nan,5,np.nan,np.nan,np.nan,13])In [48]:ser.interpolate(limit=1,limit_direction='both')Out[48]:0     NaN1     5.02     5.03     7.04     NaN5    11.06    13.0dtype: float64
  • Added aDataFrame.round method to round the values to a variable number of decimal places (GH10568).

    In [49]:df=pd.DataFrame(np.random.random([3,3]),columns=['A','B','C'],   ....:index=['first','second','third'])   ....:In [50]:dfOut[50]:               A         B         Cfirst   0.342764  0.304121  0.417022second  0.681301  0.875457  0.510422third   0.669314  0.585937  0.624904In [51]:df.round(2)Out[51]:           A     B     Cfirst   0.34  0.30  0.42second  0.68  0.88  0.51third   0.67  0.59  0.62In [52]:df.round({'A':0,'C':2})Out[52]:          A         B     Cfirst   0.0  0.304121  0.42second  1.0  0.875457  0.51third   1.0  0.585937  0.62
  • drop_duplicates andduplicated now accept akeep keyword to target first, last, and all duplicates. Thetake_last keyword is deprecated, seehere (GH6511,GH8505)

    In [53]:s=pd.Series(['A','B','C','A','B','D'])In [54]:s.drop_duplicates()Out[54]:0    A1    B2    C5    Ddtype: objectIn [55]:s.drop_duplicates(keep='last')Out[55]:2    C3    A4    B5    Ddtype: objectIn [56]:s.drop_duplicates(keep=False)Out[56]:2    C5    Ddtype: object
  • Reindex now has atolerance argument that allows for finer control ofLimits on filling while reindexing (GH10411):

    In [57]:df=pd.DataFrame({'x':range(5),   ....:'t':pd.date_range('2000-01-01',periods=5)})   ....:In [58]:df.reindex([0.1,1.9,3.5],   ....:method='nearest',   ....:tolerance=0.2)   ....:Out[58]:             t    x0.1 2000-01-01  0.01.9 2000-01-03  2.03.5        NaT  NaN

    When used on aDatetimeIndex,TimedeltaIndex orPeriodIndex,tolerance will coerced into aTimedelta if possible. This allows you to specify tolerance with a string:

    In [59]:df=df.set_index('t')In [60]:df.reindex(pd.to_datetime(['1999-12-31']),   ....:method='nearest',   ....:tolerance='1 day')   ....:Out[60]:            x1999-12-31  0

    tolerance is also exposed by the lower levelIndex.get_indexer andIndex.get_loc methods.

  • Added functionality to use thebase argument when resampling aTimeDeltaIndex (GH10530)

  • DatetimeIndex can be instantiated using strings containsNaT (GH7599)

  • to_datetime can now accept theyearfirst keyword (GH7599)

  • pandas.tseries.offsets larger than theDay offset can now be used with aSeries for addition/subtraction (GH10699). See thedocs for more details.

  • pd.Timedelta.total_seconds() now returns Timedelta duration to ns precision (previously microsecond precision) (GH10939)

  • PeriodIndex now supports arithmetic withnp.ndarray (GH10638)

  • Support pickling ofPeriod objects (GH10439)

  • .as_blocks will now take acopy optional argument to return a copy of the data, default is to copy (no change in behavior from prior versions), (GH9607)

  • regex argument toDataFrame.filter now handles numeric column names instead of raisingValueError (GH10384).

  • Enable reading gzip compressed files via URL, either by explicitly setting the compression parameter or by inferring from the presence of the HTTP Content-Encoding header in the response (GH8685)

  • Enable writing Excel files inmemory using StringIO/BytesIO (GH7074)

  • Enable serialization of lists and dicts to strings inExcelWriter (GH8188)

  • SQL io functions now accept a SQLAlchemy connectable. (GH7877)

  • pd.read_sql andto_sql can accept database URI ascon parameter (GH10214)

  • read_sql_table will now allow reading from views (GH10750).

  • Enable writing complex values toHDFStores when using thetable format (GH10447)

  • Enablepd.read_hdf to be used without specifying a key when the HDF file contains a single dataset (GH10443)

  • pd.read_stata will now read Stata 118 type files. (GH9882)

  • msgpack submodule has been updated to 0.4.6 with backward compatibility (GH10581)

  • DataFrame.to_dict now acceptsorient='index' keyword argument (GH10844).

  • DataFrame.apply will return a Series of dicts if the passed function returns a dict andreduce=True (GH8735).

  • Allow passingkwargs to the interpolation methods (GH10378).

  • Improved error message when concatenating an empty iterable ofDataframe objects (GH9157)

  • pd.read_csv can now read bz2-compressed files incrementally, and the C parser can read bz2-compressed files from AWS S3 (GH11070,GH11072).

  • Inpd.read_csv, recognizes3n:// ands3a:// URLs as designating S3 file storage (GH11070,GH11071).

  • Read CSV files from AWS S3 incrementally, instead of first downloading the entire file. (Full file download still required for compressed files in Python 2.) (GH11070,GH11073)

  • pd.read_csv is now able to infer compression type for files read from AWS S3 storage (GH11070,GH11074).

Backwards incompatible API changes

Changes to sorting API

The sorting API has had some longtime inconsistencies. (GH9816,GH8239).

Here is a summary of the APIPRIOR to 0.17.0:

  • Series.sort isINPLACE whileDataFrame.sort returns a new object.
  • Series.order returns a new object
  • It was possible to useSeries/DataFrame.sort_index to sort byvalues by passing theby keyword.
  • Series/DataFrame.sortlevel worked only on aMultiIndex for sorting by index.

To address these issues, we have revamped the API:

  • We have introduced a new method,DataFrame.sort_values(), which is the merger ofDataFrame.sort(),Series.sort(),andSeries.order(), to handle sorting ofvalues.
  • The existing methodsSeries.sort(),Series.order(), andDataFrame.sort() have been deprecated and will be removed in afuture version.
  • Theby argument ofDataFrame.sort_index() has been deprecated and will be removed in a future version.
  • The existing method.sort_index() will gain thelevel keyword to enable level sorting.

We now have two distinct and non-overlapping methods of sorting. A* marks items thatwill show aFutureWarning.

To sort by thevalues:

PreviousReplacement
*Series.order()Series.sort_values()
*Series.sort()Series.sort_values(inplace=True)
*DataFrame.sort(columns=...)DataFrame.sort_values(by=...)

To sort by theindex:

PreviousReplacement
Series.sort_index()Series.sort_index()
Series.sortlevel(level=...)Series.sort_index(level=...)
DataFrame.sort_index()DataFrame.sort_index()
DataFrame.sortlevel(level=...)DataFrame.sort_index(level=...)
*DataFrame.sort()DataFrame.sort_index()

We have also deprecated and changed similar methods in two Series-like classes,Index andCategorical.

PreviousReplacement
*Index.order()Index.sort_values()
*Categorical.order()Categorical.sort_values()

Changes to to_datetime and to_timedelta

Error handling

The default forpd.to_datetime error handling has changed toerrors='raise'.In prior versions it waserrors='ignore'. Furthermore, thecoerce argumenthas been deprecated in favor oferrors='coerce'. This means that invalid parsingwill raise rather that return the original input as in previous versions. (GH10636)

Previous Behavior:

In [2]:pd.to_datetime(['2009-07-31','asd'])Out[2]:array(['2009-07-31','asd'],dtype=object)

New Behavior:

In [3]:pd.to_datetime(['2009-07-31','asd'])ValueError: Unknown string format

Of course you can coerce this as well.

In [61]:to_datetime(['2009-07-31','asd'],errors='coerce')Out[61]:DatetimeIndex(['2009-07-31','NaT'],dtype='datetime64[ns]',freq=None)

To keep the previous behavior, you can useerrors='ignore':

In [62]:to_datetime(['2009-07-31','asd'],errors='ignore')Out[62]:array(['2009-07-31','asd'],dtype=object)

Furthermore,pd.to_timedelta has gained a similar API, oferrors='raise'|'ignore'|'coerce', and thecoerce keywordhas been deprecated in favor oferrors='coerce'.

Consistent Parsing

The string parsing ofto_datetime,Timestamp andDatetimeIndex hasbeen made consistent. (GH7599)

Prior to v0.17.0,Timestamp andto_datetime may parse year-only datetime-string incorrectly using today’s date, otherwiseDatetimeIndexuses the beginning of the year.Timestamp andto_datetime may raiseValueError in some types of datetime-string whichDatetimeIndexcan parse, such as a quarterly string.

Previous Behavior:

In [1]:Timestamp('2012Q2')Traceback   ...ValueError: Unable to parse 2012Q2# Results in today's date.In [2]:Timestamp('2014')Out [2]: 2014-08-12 00:00:00

v0.17.0 can parse them as below. It works onDatetimeIndex also.

New Behavior:

In [63]:Timestamp('2012Q2')Out[63]:Timestamp('2012-04-01 00:00:00')In [64]:Timestamp('2014')Out[64]:Timestamp('2014-01-01 00:00:00')In [65]:DatetimeIndex(['2012Q2','2014'])Out[65]:DatetimeIndex(['2012-04-01','2014-01-01'],dtype='datetime64[ns]',freq=None)

Note

If you want to perform calculations based on today’s date, useTimestamp.now() andpandas.tseries.offsets.

In [66]:importpandas.tseries.offsetsasoffsetsIn [67]:Timestamp.now()Out[67]:Timestamp('2016-11-03 16:51:06.549337')In [68]:Timestamp.now()+offsets.DateOffset(years=1)Out[68]:Timestamp('2017-11-03 16:51:06.550998')

Changes to Index Comparisons

Operator equal onIndex should behavior similarly toSeries (GH9947,GH10637)

Starting in v0.17.0, comparingIndex objects of different lengths will raiseaValueError. This is to be consistent with the behavior ofSeries.

Previous Behavior:

In [2]:pd.Index([1,2,3])==pd.Index([1,4,5])Out[2]:array([True,False,False],dtype=bool)In [3]:pd.Index([1,2,3])==pd.Index([2])Out[3]:array([False,True,False],dtype=bool)In [4]:pd.Index([1,2,3])==pd.Index([1,2])Out[4]:False

New Behavior:

In [8]:pd.Index([1,2,3])==pd.Index([1,4,5])Out[8]:array([True,False,False],dtype=bool)In [9]:pd.Index([1,2,3])==pd.Index([2])ValueError: Lengths must match to compareIn [10]:pd.Index([1,2,3])==pd.Index([1,2])ValueError: Lengths must match to compare

Note that this is different from thenumpy behavior where a comparison canbe broadcast:

In [69]:np.array([1,2,3])==np.array([1])Out[69]:array([True,False,False],dtype=bool)

or it can return False if broadcasting can not be done:

In [70]:np.array([1,2,3])==np.array([1,2])Out[70]:False

Changes to Boolean Comparisons vs. None

Boolean comparisons of aSeries vsNone will now be equivalent to comparing withnp.nan, rather than raiseTypeError. (GH1079).

In [71]:s=Series(range(3))In [72]:s.iloc[1]=NoneIn [73]:sOut[73]:0    0.01    NaN2    2.0dtype: float64

Previous Behavior:

In [5]:s==NoneTypeError: Could not compare <type 'NoneType'> type with Series

New Behavior:

In [74]:s==NoneOut[74]:0    False1    False2    Falsedtype: bool

Usually you simply want to know which values are null.

In [75]:s.isnull()Out[75]:0    False1     True2    Falsedtype: bool

Warning

You generally will want to useisnull/notnull for these types of comparisons, asisnull/notnull tells you which elements are null. One has to bemindful thatnan's don’t compare equal, butNone's do. Note that Pandas/numpy uses the fact thatnp.nan!=np.nan, and treatsNone likenp.nan.

In [76]:None==NoneOut[76]:TrueIn [77]:np.nan==np.nanOut[77]:False

HDFStore dropna behavior

The default behavior for HDFStore write functions withformat='table' is now to keep rows that are all missing. Previously, the behavior was to drop rows that were all missing save the index. The previous behavior can be replicated using thedropna=True option. (GH9382)

Previous Behavior:

In [78]:df_with_missing=pd.DataFrame({'col1':[0,np.nan,2],   ....:'col2':[1,np.nan,np.nan]})   ....:In [79]:df_with_missingOut[79]:   col1  col20   0.0   1.01   NaN   NaN2   2.0   NaN
In [27]:df_with_missing.to_hdf('file.h5',                       'df_with_missing',                       format='table',                       mode='w')In [28]:pd.read_hdf('file.h5','df_with_missing')Out [28]:      col1  col2  0     0     1  2     2   NaN

New Behavior:

In [80]:df_with_missing.to_hdf('file.h5',   ....:'df_with_missing',   ....:format='table',   ....:mode='w')   ....:In [81]:pd.read_hdf('file.h5','df_with_missing')Out[81]:   col1  col20   0.0   1.01   NaN   NaN2   2.0   NaN

See thedocs for more details.

Changes todisplay.precision option

Thedisplay.precision option has been clarified to refer to decimal places (GH10451).

Earlier versions of pandas would format floating point numbers to have one less decimal place than the value indisplay.precision.

In [1]:pd.set_option('display.precision',2)In [2]:pd.DataFrame({'x':[123.456789]})Out[2]:       x0  123.5

If interpreting precision as “significant figures” this did work for scientific notation but that same interpretationdid not work for values with standard formatting. It was also out of step with how numpy handles formatting.

Going forward the value ofdisplay.precision will directly control the number of places after the decimal, forregular formatting as well as scientific notation, similar to how numpy’sprecision print option works.

In [82]:pd.set_option('display.precision',2)In [83]:pd.DataFrame({'x':[123.456789]})Out[83]:        x0  123.46

To preserve output behavior with prior versions the default value ofdisplay.precision has been reduced to6from7.

Changes toCategorical.unique

Categorical.unique now returns newCategoricals withcategories andcodes that are unique, rather than returningnp.array (GH10508)

  • unordered category: values and categories are sorted by appearance order.
  • ordered category: values are sorted by appearance order, categories keep existing order.
In [84]:cat=pd.Categorical(['C','A','B','C'],   ....:categories=['A','B','C'],   ....:ordered=True)   ....:In [85]:catOut[85]:[C, A, B, C]Categories (3, object): [A < B < C]In [86]:cat.unique()Out[86]:[C, A, B]Categories (3, object): [A < B < C]In [87]:cat=pd.Categorical(['C','A','B','C'],   ....:categories=['A','B','C'])   ....:In [88]:catOut[88]:[C, A, B, C]Categories (3, object): [A, B, C]In [89]:cat.unique()Out[89]:[C, A, B]Categories (3, object): [C, A, B]

Changes tobool passed asheader in Parsers

In earlier versions of pandas, if a bool was passed theheader argument ofread_csv,read_excel, orread_html it was implicitly converted toan integer, resulting inheader=0 forFalse andheader=1 forTrue(GH6113)

Abool input toheader will now raise aTypeError

In[29]:df=pd.read_csv('data.csv',header=False)TypeError:Passingabooltoheaderisinvalid.Useheader=Nonefornoheaderorheader=intorlist-likeofintstospecifytherow(s)makingupthecolumnnames

Other API Changes

  • Line and kde plot withsubplots=True now uses default colors, not all black. Specifycolor='k' to draw all lines in black (GH9894)

  • Calling the.value_counts() method on a Series with acategorical dtype now returns a Series with aCategoricalIndex (GH10704)

  • The metadata properties of subclasses of pandas objects will now be serialized (GH10553).

  • groupby usingCategorical follows the same rule asCategorical.unique described above (GH10508)

  • When constructingDataFrame with an array ofcomplex64 dtype previously meant the corresponding columnwas automatically promoted to thecomplex128 dtype. Pandas will now preserve the itemsize of the input for complex data (GH10952)

  • some numeric reduction operators would returnValueError, rather thanTypeError on object types that includes strings and numbers (GH11131)

  • Passing currently unsupportedchunksize argument toread_excel orExcelFile.parse will now raiseNotImplementedError (GH8011)

  • Allow anExcelFile object to be passed intoread_excel (GH11198)

  • DatetimeIndex.union does not inferfreq ifself and the input haveNone asfreq (GH11086)

  • NaT‘s methods now either raiseValueError, or returnnp.nan orNaT (GH9513)

    BehaviorMethods
    returnnp.nanweekday,isoweekday
    returnNaTdate,now,replace,to_datetime,today
    returnnp.datetime64('NaT')to_datetime64 (unchanged)
    raiseValueErrorAll other public methods (names not beginning with underscores)

Deprecations

  • ForSeries the following indexing functions are deprecated (GH10177).

    Deprecated FunctionReplacement
    .irow(i).iloc[i] or.iat[i]
    .iget(i).iloc[i] or.iat[i]
    .iget_value(i).iloc[i] or.iat[i]
  • ForDataFrame the following indexing functions are deprecated (GH10177).

    Deprecated FunctionReplacement
    .irow(i).iloc[i]
    .iget_value(i,j).iloc[i,j] or.iat[i,j]
    .icol(j).iloc[:,j]

Note

These indexing function have been deprecated in the documentation since 0.11.0.

  • Categorical.name was deprecated to makeCategorical morenumpy.ndarray like. UseSeries(cat,name="whatever") instead (GH10482).
  • Setting missing values (NaN) in aCategorical‘scategories will issue a warning (GH10748). You can still have missing values in thevalues.
  • drop_duplicates andduplicated‘stake_last keyword was deprecated in favor ofkeep. (GH6511,GH8505)
  • Series.nsmallest andnlargest‘stake_last keyword was deprecated in favor ofkeep. (GH10792)
  • DataFrame.combineAdd andDataFrame.combineMult are deprecated. Theycan easily be replaced by using theadd andmul methods:DataFrame.add(other,fill_value=0) andDataFrame.mul(other,fill_value=1.)(GH10735).
  • TimeSeries deprecated in favor ofSeries (note that this has been an alias since 0.13.0), (GH10890)
  • SparsePanel deprecated and will be removed in a future version (GH11157).
  • Series.is_time_series deprecated in favor ofSeries.index.is_all_dates (GH11135)
  • Legacy offsets (like'A@JAN') are deprecated (note that this has been alias since 0.8.0) (GH10878)
  • WidePanel deprecated in favor ofPanel,LongPanel in favor ofDataFrame (note these have been aliases since < 0.11.0), (GH10892)
  • DataFrame.convert_objects has been deprecated in favor of type-specific functionspd.to_datetime,pd.to_timestamp andpd.to_numeric (new in 0.17.0) (GH11133).

Removal of prior version deprecations/changes

  • Removal ofna_last parameters fromSeries.order() andSeries.sort(), in favor ofna_position. (GH5231)

  • Remove ofpercentile_width from.describe(), in favor ofpercentiles. (GH7088)

  • Removal ofcolSpace parameter fromDataFrame.to_string(), in favor ofcol_space, circa 0.8.0 version.

  • Removal of automatic time-series broadcasting (GH2304)

    In [90]:np.random.seed(1234)In [91]:df=DataFrame(np.random.randn(5,2),columns=list('AB'),index=date_range('20130101',periods=5))In [92]:dfOut[92]:                   A         B2013-01-01  0.471435 -1.1909762013-01-02  1.432707 -0.3126522013-01-03 -0.720589  0.8871632013-01-04  0.859588 -0.6365242013-01-05  0.015696 -2.242685

    Previously

    In [3]:df+df.AFutureWarning: TimeSeries broadcasting along DataFrame index by default is deprecated.Please use DataFrame.<op> to explicitly broadcast arithmetic operations along the indexOut[3]:                    A         B2013-01-01  0.942870 -0.7195412013-01-02  2.865414  1.1200552013-01-03 -1.441177  0.1665742013-01-04  1.719177  0.2230652013-01-05  0.031393 -2.226989

    Current

    In [93]:df.add(df.A,axis='index')Out[93]:                   A         B2013-01-01  0.942870 -0.7195412013-01-02  2.865414  1.1200552013-01-03 -1.441177  0.1665742013-01-04  1.719177  0.2230652013-01-05  0.031393 -2.226989
  • Removetable keyword inHDFStore.put/append, in favor of usingformat= (GH4645)

  • Removekind inread_excel/ExcelFile as its unused (GH4712)

  • Removeinfer_type keyword frompd.read_html as its unused (GH4770,GH7032)

  • Removeoffset andtimeRule keywords fromSeries.tshift/shift, in favor offreq (GH4853,GH4864)

  • Removepd.load/pd.save aliases in favor ofpd.to_pickle/pd.read_pickle (GH3787)

Performance Improvements

  • Development support for benchmarking with theAir Speed Velocity library (GH8361)
  • Added vbench benchmarks for alternative ExcelWriter engines and reading Excel files (GH7171)
  • Performance improvements inCategorical.value_counts (GH10804)
  • Performance improvements inSeriesGroupBy.nunique andSeriesGroupBy.value_counts andSeriesGroupby.transform (GH10820,GH11077)
  • Performance improvements inDataFrame.drop_duplicates with integer dtypes (GH10917)
  • Performance improvements inDataFrame.duplicated with wide frames. (GH10161,GH11180)
  • 4x improvement intimedelta string parsing (GH6755,GH10426)
  • 8x improvement intimedelta64 anddatetime64 ops (GH6755)
  • Significantly improved performance of indexingMultiIndex with slicers (GH10287)
  • 8x improvement iniloc using list-like input (GH10791)
  • Improved performance ofSeries.isin for datetimelike/integer Series (GH10287)
  • 20x improvement inconcat of Categoricals when categories are identical (GH10587)
  • Improved performance ofto_datetime when specified format string is ISO8601 (GH10178)
  • 2x improvement ofSeries.value_counts for float dtype (GH10821)
  • Enableinfer_datetime_format into_datetime when date components do not have 0 padding (GH11142)
  • Regression from 0.16.1 in constructingDataFrame from nested dictionary (GH11084)
  • Performance improvements in addition/subtraction operations forDateOffset withSeries orDatetimeIndex (GH10744,GH11205)

Bug Fixes

  • Bug in incorrection computation of.mean() ontimedelta64[ns] because of overflow (GH9442)
  • Bug in.isin on older numpies (:issue:11232)
  • Bug inDataFrame.to_html(index=False) renders unnecessaryname row (GH10344)
  • Bug inDataFrame.to_latex() thecolumn_format argument could not be passed (GH9402)
  • Bug inDatetimeIndex when localizing withNaT (GH10477)
  • Bug inSeries.dt ops in preserving meta-data (GH10477)
  • Bug in preservingNaT when passed in an otherwise invalidto_datetime construction (GH10477)
  • Bug inDataFrame.apply when function returns categorical series. (GH9573)
  • Bug into_datetime with invalid dates and formats supplied (GH10154)
  • Bug inIndex.drop_duplicates dropping name(s) (GH10115)
  • Bug inSeries.quantile dropping name (GH10881)
  • Bug inpd.Series when setting a value on an emptySeries whose index has a frequency. (GH10193)
  • Bug inpd.Series.interpolate with invalidorder keyword values. (GH10633)
  • Bug inDataFrame.plot raisesValueError when color name is specified by multiple characters (GH10387)
  • Bug inIndex construction with a mixed list of tuples (GH10697)
  • Bug inDataFrame.reset_index when index containsNaT. (GH10388)
  • Bug inExcelReader when worksheet is empty (GH6403)
  • Bug inBinGrouper.group_info where returned values are not compatible with base class (GH10914)
  • Bug in clearing the cache onDataFrame.pop and a subsequent inplace op (GH10912)
  • Bug in indexing with a mixed-integerIndex causing anImportError (GH10610)
  • Bug inSeries.count when index has nulls (GH10946)
  • Bug in pickling of a non-regular freqDatetimeIndex (GH11002)
  • Bug causingDataFrame.where to not respect theaxis parameter when the frame has a symmetric shape. (GH9736)
  • Bug inTable.select_column where name is not preserved (GH10392)
  • Bug inoffsets.generate_range wherestart andend have finer precision thanoffset (GH9907)
  • Bug inpd.rolling_* whereSeries.name would be lost in the output (GH10565)
  • Bug instack when index or columns are not unique. (GH10417)
  • Bug in setting aPanel when an axis has a multi-index (GH10360)
  • Bug inUSFederalHolidayCalendar whereUSMemorialDay andUSMartinLutherKingJr were incorrect (GH10278 andGH9760 )
  • Bug in.sample() where returned object, if set, gives unnecessarySettingWithCopyWarning (GH10738)
  • Bug in.sample() where weights passed asSeries were not aligned along axis before being treated positionally, potentially causing problems if weight indices were not aligned with sampled object. (GH10738)
  • Regression fixed in (GH9311,GH6620,GH9345), where groupby with a datetime-like converting to float with certain aggregators (GH10979)
  • Bug inDataFrame.interpolate withaxis=1 andinplace=True (GH10395)
  • Bug inio.sql.get_schema when specifying multiple columns as primarykey (GH10385).
  • Bug ingroupby(sort=False) with datetime-likeCategorical raisesValueError (GH10505)
  • Bug ingroupby(axis=1) withfilter() throwsIndexError (GH11041)
  • Bug intest_categorical on big-endian builds (GH10425)
  • Bug inSeries.shift andDataFrame.shift not supporting categorical data (GH9416)
  • Bug inSeries.map using categoricalSeries raisesAttributeError (GH10324)
  • Bug inMultiIndex.get_level_values includingCategorical raisesAttributeError (GH10460)
  • Bug inpd.get_dummies withsparse=True not returningSparseDataFrame (GH10531)
  • Bug inIndex subtypes (such asPeriodIndex) not returning their own type for.drop and.insert methods (GH10620)
  • Bug inalgos.outer_join_indexer whenright array is empty (GH10618)
  • Bug infilter (regression from 0.16.0) andtransform when grouping on multiple keys, one of which is datetime-like (GH10114)
  • Bug into_datetime andto_timedelta causingIndex name to be lost (GH10875)
  • Bug inlen(DataFrame.groupby) causingIndexError when there’s a column containing only NaNs (:issue:11016)
  • Bug that caused segfault when resampling an empty Series (GH10228)
  • Bug inDatetimeIndex andPeriodIndex.value_counts resets name from its result, but retains in result’sIndex. (GH10150)
  • Bug inpd.eval usingnumexpr engine coerces 1 element numpy array to scalar (GH10546)
  • Bug inpd.concat withaxis=0 when column is of dtypecategory (GH10177)
  • Bug inread_msgpack where input type is not always checked (GH10369,GH10630)
  • Bug inpd.read_csv with kwargsindex_col=False,index_col=['a','b'] ordtype(GH10413,GH10467,GH10577)
  • Bug inSeries.from_csv withheader kwarg not setting theSeries.name or theSeries.index.name (GH10483)
  • Bug ingroupby.var which caused variance to be inaccurate for small float values (GH10448)
  • Bug inSeries.plot(kind='hist') Y Label not informative (GH10485)
  • Bug inread_csv when using a converter which generates auint8 type (GH9266)
  • Bug causes memory leak in time-series line and area plot (GH9003)
  • Bug when setting aPanel sliced along the major or minor axes when the right-hand side is aDataFrame (GH11014)
  • Bug that returnsNone and does not raiseNotImplementedError when operator functions (e.g..add) ofPanel are not implemented (GH7692)
  • Bug in line and kde plot cannot accept multiple colors whensubplots=True (GH9894)
  • Bug inDataFrame.plot raisesValueError when color name is specified by multiple characters (GH10387)
  • Bug in left and rightalign ofSeries withMultiIndex may be inverted (GH10665)
  • Bug in left and rightjoin of withMultiIndex may be inverted (GH10741)
  • Bug inread_stata when reading a file with a different order set incolumns (GH10757)
  • Bug inCategorical may not representing properly when category containstz orPeriod (GH10713)
  • Bug inCategorical.__iter__ may not returning correctdatetime andPeriod (GH10713)
  • Bug in indexing with aPeriodIndex on an object with aPeriodIndex (GH4125)
  • Bug inread_csv withengine='c': EOF preceded by a comment, blank line, etc. was not handled correctly (GH10728,GH10548)
  • Reading “famafrench” data viaDataReader results in HTTP 404 error because of the website url is changed (GH10591).
  • Bug inread_msgpack where DataFrame to decode has duplicate column names (GH9618)
  • Bug inio.common.get_filepath_or_buffer which caused reading of valid S3 files to fail if the bucket also contained keys for which the user does not have read permission (GH10604)
  • Bug in vectorised setting of timestamp columns with pythondatetime.date and numpydatetime64 (GH10408,GH10412)
  • Bug inIndex.take may add unnecessaryfreq attribute (GH10791)
  • Bug inmerge with emptyDataFrame may raiseIndexError (GH10824)
  • Bug into_latex where unexpected keyword argument for some documented arguments (GH10888)
  • Bug in indexing of largeDataFrame whereIndexError is uncaught (GH10645 andGH10692)
  • Bug inread_csv when using thenrows orchunksize parameters if file contains only a header line (GH9535)
  • Bug in serialization ofcategory types in HDF5 in presence of alternate encodings. (GH10366)
  • Bug inpd.DataFrame when constructing an empty DataFrame with a string dtype (GH9428)
  • Bug inpd.DataFrame.diff when DataFrame is not consolidated (GH10907)
  • Bug inpd.unique for arrays with thedatetime64 ortimedelta64 dtype that meant an array with object dtype was returned instead the original dtype (GH9431)
  • Bug inTimedelta raising error when slicing from 0s (GH10583)
  • Bug inDatetimeIndex.take andTimedeltaIndex.take may not raiseIndexError against invalid index (GH10295)
  • Bug inSeries([np.nan]).astype('M8[ms]'), which now returnsSeries([pd.NaT]) (GH10747)
  • Bug inPeriodIndex.order reset freq (GH10295)
  • Bug indate_range whenfreq dividesend as nanos (GH10885)
  • Bug iniloc allowing memory outside bounds of a Series to be accessed with negative integers (GH10779)
  • Bug inread_msgpack where encoding is not respected (GH10581)
  • Bug preventing access to the first index when usingiloc with a list containing the appropriate negative integer (GH10547,GH10779)
  • Bug inTimedeltaIndex formatter causing error while trying to saveDataFrame withTimedeltaIndex usingto_csv (GH10833)
  • Bug inDataFrame.where when handling Series slicing (GH10218,GH9558)
  • Bug wherepd.read_gbq throwsValueError when Bigquery returns zero rows (GH10273)
  • Bug into_json which was causing segmentation fault when serializing 0-rank ndarray (GH9576)
  • Bug in plotting functions may raiseIndexError when plotted onGridSpec (GH10819)
  • Bug in plot result may show unnecessary minor ticklabels (GH10657)
  • Bug ingroupby incorrect computation for aggregation onDataFrame withNaT (E.gfirst,last,min). (GH10590,GH11010)
  • Bug when constructingDataFrame where passing a dictionary with only scalar values and specifying columns did not raise an error (GH10856)
  • Bug in.var() causing roundoff errors for highly similar values (GH10242)
  • Bug inDataFrame.plot(subplots=True) with duplicated columns outputs incorrect result (GH10962)
  • Bug inIndex arithmetic may result in incorrect class (GH10638)
  • Bug indate_range results in empty if freq is negative annualy, quarterly and monthly (GH11018)
  • Bug inDatetimeIndex cannot infer negative freq (GH11018)
  • Remove use of some deprecated numpy comparison operations, mainly in tests. (GH10569)
  • Bug inIndex dtype may not applied properly (GH11017)
  • Bug inio.gbq when testing for minimum google api client version (GH10652)
  • Bug inDataFrame construction from nesteddict withtimedelta keys (GH11129)
  • Bug in.fillna against may raiseTypeError when data contains datetime dtype (GH7095,GH11153)
  • Bug in.groupby when number of keys to group by is same as length of index (GH11185)
  • Bug inconvert_objects where converted values might not be returned if all null andcoerce (GH9589)
  • Bug inconvert_objects wherecopy keyword was not respected (GH9589)

v0.16.2 (June 12, 2015)

This is a minor bug-fix release from 0.16.1 and includes a a large number ofbug fixes along some new features (pipe() method), enhancements, and performance improvements.

We recommend that all users upgrade to this version.

Highlights include:

  • A newpipe method, seehere
  • Documentation on how to usenumba withpandas, seehere

New features

Pipe

We’ve introduced a new methodDataFrame.pipe(). As suggested by the name,pipeshould be used to pipe data through a chain of function calls.The goal is to avoid confusing nested function calls like

# df is a DataFrame# f, g, and h are functions that take and return DataFramesf(g(h(df),arg1=1),arg2=2,arg3=3)

The logic flows from inside out, and function names are separated from their keyword arguments.This can be rewritten as

(df.pipe(h).pipe(g,arg1=1).pipe(f,arg2=2,arg3=3))

Now both the code and the logic flow from top to bottom. Keyword arguments are next totheir functions. Overall the code is much more readable.

In the example above, the functionsf,g, andh each expected the DataFrame as the first positional argument.When the function you wish to apply takes its data anywhere other than the first argument, pass a tupleof(function,keyword) indicating where the DataFrame should flow. For example:

In [1]:importstatsmodels.formula.apiassmIn [2]:bb=pd.read_csv('data/baseball.csv',index_col='id')# sm.poisson takes (formula, data)In [3]:(bb.query('h > 0')   ...:.assign(ln_h=lambdadf:np.log(df.h))   ...:.pipe((sm.poisson,'data'),'hr ~ ln_h + year + g + C(lg)')   ...:.fit()   ...:.summary()   ...:)   ...:Optimization terminated successfully.         Current function value: 2.116284         Iterations 24Out[3]:<class 'statsmodels.iolib.summary.Summary'>"""                          Poisson Regression Results==============================================================================Dep. Variable:                     hr   No. Observations:                   68Model:                        Poisson   Df Residuals:                       63Method:                           MLE   Df Model:                            4Date:                Don, 03 Nov 2016   Pseudo R-squ.:                  0.6878Time:                        16:51:07   Log-Likelihood:                -143.91converged:                       True   LL-Null:                       -460.91                                        LLR p-value:                6.774e-136===============================================================================                  coef    std err          z      P>|z|      [95.0% Conf. Int.]-------------------------------------------------------------------------------Intercept   -1267.3636    457.867     -2.768      0.006     -2164.767  -369.960C(lg)[T.NL]    -0.2057      0.101     -2.044      0.041        -0.403    -0.008ln_h            0.9280      0.191      4.866      0.000         0.554     1.302year            0.6301      0.228      2.762      0.006         0.183     1.077g               0.0099      0.004      2.754      0.006         0.003     0.017==============================================================================="""

The pipe method is inspired by unix pipes, which stream text throughprocesses. More recentlydplyr andmagrittr have introduced thepopular(%>%) pipe operator forR.

See thedocumentation for more. (GH10129)

Other Enhancements

  • Addedrsplit to Index/Series StringMethods (GH10303)

  • Removed the hard-coded size limits on theDataFrame HTML representationin the IPython notebook, and leave this to IPython itself (only for IPythonv3.0 or greater). This eliminates the duplicate scroll bars that appeared inthe notebook with large frames (GH10231).

    Note that the notebook has atoggleoutputscrolling feature to limit thedisplay of very large frames (by clicking left of the output).You can also configure the way DataFrames are displayed using the pandasoptions, see herehere.

  • axis parameter ofDataFrame.quantile now accepts alsoindex andcolumn. (GH9543)

API Changes

  • Holiday now raisesNotImplementedError if bothoffset andobservance are used in the constructor instead of returning an incorrect result (GH10217).

Performance Improvements

  • ImprovedSeries.resample performance withdtype=datetime64[ns] (GH7754)
  • Increase performance ofstr.split whenexpand=True (GH10081)

Bug Fixes

  • Bug inSeries.hist raises an error when a one rowSeries was given (GH10214)
  • Bug whereHDFStore.select modifies the passed columns list (GH7212)
  • Bug inCategorical repr withdisplay.width ofNone in Python 3 (GH10087)
  • Bug into_json with certain orients and aCategoricalIndex would segfault (GH10317)
  • Bug where some of the nan funcs do not have consistent return dtypes (GH10251)
  • Bug inDataFrame.quantile on checking that a valid axis was passed (GH9543)
  • Bug ingroupby.apply aggregation forCategorical not preserving categories (GH10138)
  • Bug into_csv wheredate_format is ignored if thedatetime is fractional (GH10209)
  • Bug inDataFrame.to_json with mixed data types (GH10289)
  • Bug in cache updating when consolidating (GH10264)
  • Bug inmean() where integer dtypes can overflow (GH10172)
  • Bug wherePanel.from_dict does not set dtype when specified (GH10058)
  • Bug inIndex.union raisesAttributeError when passing array-likes. (GH10149)
  • Bug inTimestamp‘s’microsecond,quarter,dayofyear,week anddaysinmonth properties returnnp.int type, not built-inint. (GH10050)
  • Bug inNaT raisesAttributeError when accessing todaysinmonth,dayofweek properties. (GH10096)
  • Bug in Index repr when using themax_seq_items=None setting (GH10182).
  • Bug in getting timezone data withdateutil on various platforms (GH9059,GH8639,GH9663,GH10121)
  • Bug in displaying datetimes with mixed frequencies; display ‘ms’ datetimes to the proper precision. (GH10170)
  • Bug insetitem where type promotion is applied to the entire block (GH10280)
  • Bug inSeries arithmetic methods may incorrectly hold names (GH10068)
  • Bug inGroupBy.get_group when grouping on multiple keys, one of which is categorical. (GH10132)
  • Bug inDatetimeIndex andTimedeltaIndex names are lost after timedelta arithmetics (GH9926)
  • Bug inDataFrame construction from nesteddict withdatetime64 (GH10160)
  • Bug inSeries construction fromdict withdatetime64 keys (GH9456)
  • Bug inSeries.plot(label="LABEL") not correctly setting the label (GH10119)
  • Bug inplot not defaulting to matplotlibaxes.grid setting (GH9792)
  • Bug causing strings containing an exponent, but no decimal to be parsed asint instead offloat inengine='python' for theread_csv parser (GH9565)
  • Bug inSeries.align resetsname whenfill_value is specified (GH10067)
  • Bug inread_csv causing index name not to be set on an empty DataFrame (GH10184)
  • Bug inSparseSeries.abs resetsname (GH10241)
  • Bug inTimedeltaIndex slicing may reset freq (GH10292)
  • Bug inGroupBy.get_group raisesValueError when group key containsNaT (GH6992)
  • Bug inSparseSeries constructor ignores input data name (GH10258)
  • Bug inCategorical.remove_categories causing aValueError when removing theNaN category if underlying dtype is floating-point (GH10156)
  • Bug where infer_freq infers timerule (WOM-5XXX) unsupported by to_offset (GH9425)
  • Bug inDataFrame.to_hdf() where table format would raise a seemingly unrelated error for invalid (non-string) column names. This is now explicitly forbidden. (GH9057)
  • Bug to handle masking emptyDataFrame (GH10126).
  • Bug where MySQL interface could not handle numeric table/column names (GH10255)
  • Bug inread_csv with adate_parser that returned adatetime64 array of other time resolution than[ns] (GH10245)
  • Bug inPanel.apply when the result has ndim=0 (GH10332)
  • Bug inread_hdf whereauto_close could not be passed (GH9327).
  • Bug inread_hdf where open stores could not be used (GH10330).
  • Bug in adding emptyDataFrame``s,nowresultsina``DataFrame that.equals an emptyDataFrame (GH10181).
  • Bug into_hdf andHDFStore which did not check that complib choices were valid (GH4582,GH8874).

v0.16.1 (May 11, 2015)

This is a minor bug-fix release from 0.16.0 and includes a a large number ofbug fixes along several new features, enhancements, and performance improvements.We recommend that all users upgrade to this version.

Highlights include:

  • Support for aCategoricalIndex, a category based index, seehere
  • New section on how-to-contribute topandas, seehere
  • Revised “Merge, join, and concatenate” documentation, including graphical examples to make it easier to understand each operations, seehere
  • New methodsample for drawing random samples from Series, DataFrames and Panels. Seehere
  • The defaultIndex printing has changed to a more uniform format, seehere
  • BusinessHour datetime-offset is now supported, seehere
  • Further enhancement to the.str accessor to make string operations easier, seehere

Warning

In pandas 0.17.0, the sub-packagepandas.io.data will be removed in favor of a separately installable package. Seehere for details (GH8961)

Enhancements

CategoricalIndex

We introduce aCategoricalIndex, a new type of index object that is useful for supportingindexing with duplicates. This is a container around aCategorical (introduced in v0.15.0)and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,setting the index of aDataFrame/Series with acategory dtype would convert this to regular object-basedIndex.

In [1]:df=DataFrame({'A':np.arange(6),   ...:'B':Series(list('aabbca')).astype('category',   ...:categories=list('cab'))   ...:})   ...:In [2]:dfOut[2]:   A  B0  0  a1  1  a2  2  b3  3  b4  4  c5  5  aIn [3]:df.dtypesOut[3]:A       int64B    categorydtype: objectIn [4]:df.B.cat.categoriesOut[4]:Index([u'c',u'a',u'b'],dtype='object')

setting the index, will create create aCategoricalIndex

In [5]:df2=df.set_index('B')In [6]:df2.indexOut[6]:CategoricalIndex([u'a',u'a',u'b',u'b',u'c',u'a'],categories=[u'c',u'a',u'b'],ordered=False,name=u'B',dtype='category')

indexing with__getitem__/.iloc/.loc/.ix works similarly to an Index with duplicates.The indexers MUST be in the category or the operation will raise.

In [7]:df2.loc['a']Out[7]:   ABa  0a  1a  5

and preserves theCategoricalIndex

In [8]:df2.loc['a'].indexOut[8]:CategoricalIndex([u'a',u'a',u'a'],categories=[u'c',u'a',u'b'],ordered=False,name=u'B',dtype='category')

sorting will order by the order of the categories

In [9]:df2.sort_index()Out[9]:   ABc  4a  0a  1a  5b  2b  3

groupby operations on the index will preserve the index nature as well

In [10]:df2.groupby(level=0).sum()Out[10]:   ABc  4a  6b  5In [11]:df2.groupby(level=0).sum().indexOut[11]:CategoricalIndex([u'c',u'a',u'b'],categories=[u'c',u'a',u'b'],ordered=False,name=u'B',dtype='category')

reindexing operations, will return a resulting index based on the type of the passedindexer, meaning that passing a list will return a plain-old-Index; indexing withaCategorical will return aCategoricalIndex, indexed according to the categoriesof the PASSEDCategorical dtype. This allows one to arbitrarly index these even withvalues NOT in the categories, similarly to how you can reindex ANY pandas index.

In [12]:df2.reindex(['a','e'])Out[12]:     ABa  0.0a  1.0a  5.0e  NaNIn [13]:df2.reindex(['a','e']).indexOut[13]:Index([u'a',u'a',u'a',u'e'],dtype='object',name=u'B')In [14]:df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))Out[14]:     ABa  0.0a  1.0a  5.0e  NaNIn [15]:df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).indexOut[15]:CategoricalIndex([u'a',u'a',u'a',u'e'],categories=[u'a',u'b',u'c',u'd',u'e'],ordered=False,name=u'B',dtype='category')

See thedocumentation for more. (GH7629,GH10038,GH10039)

Sample

Series, DataFrames, and Panels now have a new method:sample().The method accepts a specific number of rows or columns to return, or a fraction of thetotal number or rows or columns. It also has options for sampling with or without replacement,for passing in a column for weights for non-uniform sampling, and for setting seed values tofacilitate replication. (GH2419)

In [16]:example_series=Series([0,1,2,3,4,5])# When no arguments are passed, returns 1In [17]:example_series.sample()Out[17]:3    3dtype: int64# One may specify either a number of rows:In [18]:example_series.sample(n=3)Out[18]:5    51    14    4dtype: int64# Or a fraction of the rows:In [19]:example_series.sample(frac=0.5)Out[19]:4    41    10    0dtype: int64# weights are accepted.In [20]:example_weights=[0,0,0.2,0.2,0.2,0.4]In [21]:example_series.sample(n=3,weights=example_weights)Out[21]:2    23    35    5dtype: int64# weights will also be normalized if they do not sum to one,# and missing values will be treated as zeros.In [22]:example_weights2=[0.5,0,0,0,None,np.nan]In [23]:example_series.sample(n=1,weights=example_weights2)Out[23]:0    0dtype: int64

When applied to a DataFrame, one may pass the name of a column to specify sampling weightswhen sampling from rows.

In [24]:df=DataFrame({'col1':[9,8,7,6],'weight_column':[0.5,0.4,0.1,0]})In [25]:df.sample(n=3,weights='weight_column')Out[25]:   col1  weight_column0     9            0.51     8            0.42     7            0.1

String Methods Enhancements

Continuing from v0.16.0, the followingenhancements make string operations easier and more consistent with standard python string operations.

  • AddedStringMethods (.str accessor) toIndex (GH9068)

    The.str accessor is now available for bothSeries andIndex.

    In [26]:idx=Index([' jack','jill ',' jesse ','frank'])In [27]:idx.str.strip()Out[27]:Index([u'jack',u'jill',u'jesse',u'frank'],dtype='object')

    One special case for the.str accessor onIndex is that if a string method returnsbool, the.str accessorwill return anp.array instead of a booleanIndex (GH8875). This enables the following expressionto work naturally:

    In [28]:idx=Index(['a1','a2','b1','b2'])In [29]:s=Series(range(4),index=idx)In [30]:sOut[30]:a1    0a2    1b1    2b2    3dtype: int64In [31]:idx.str.startswith('a')Out[31]:array([True,True,False,False],dtype=bool)In [32]:s[s.index.str.startswith('a')]Out[32]:a1    0a2    1dtype: int64
  • The following new methods are accesible via.str accessor to apply the function to each values. (GH9766,GH9773,GH10031,GH10045,GH10052)

    Methods
    capitalize()swapcase()normalize()partition()rpartition()
    index()rindex()translate()  
  • split now takesexpand keyword to specify whether to expand dimensionality.return_type is deprecated. (GH9847)

    In [33]:s=Series(['a,b','a,c','b,c'])# return SeriesIn [34]:s.str.split(',')Out[34]:0    [a, b]1    [a, c]2    [b, c]dtype: object# return DataFrameIn [35]:s.str.split(',',expand=True)Out[35]:   0  10  a  b1  a  c2  b  cIn [36]:idx=Index(['a,b','a,c','b,c'])# return IndexIn [37]:idx.str.split(',')Out[37]:Index([[u'a',u'b'],[u'a',u'c'],[u'b',u'c']],dtype='object')# return MultiIndexIn [38]:idx.str.split(',',expand=True)Out[38]:MultiIndex(levels=[[u'a', u'b'], [u'b', u'c']],           labels=[[0, 0, 1], [0, 1, 1]])
  • Improvedextract andget_dummies methods forIndex.str (GH9980)

Other Enhancements

  • BusinessHour offset is now supported, which represents business hours starting from 09:00 - 17:00 onBusinessDay by default. SeeHere for details. (GH7905)

    In [39]:frompandas.tseries.offsetsimportBusinessHourIn [40]:Timestamp('2014-08-01 09:00')+BusinessHour()Out[40]:Timestamp('2014-08-01 10:00:00')In [41]:Timestamp('2014-08-01 07:00')+BusinessHour()Out[41]:Timestamp('2014-08-01 10:00:00')In [42]:Timestamp('2014-08-01 16:30')+BusinessHour()Out[42]:Timestamp('2014-08-04 09:30:00')
  • DataFrame.diff now takes anaxis parameter that determines the direction of differencing (GH9727)

  • Allowclip,clip_lower, andclip_upper to accept array-like arguments as thresholds (This is a regression from 0.11.0). These methods now have anaxis parameter which determines how the Series or DataFrame will be aligned with the threshold(s). (GH6966)

  • DataFrame.mask() andSeries.mask() now support same keywords aswhere (GH8801)

  • drop function can now accepterrors keyword to suppressValueError raised when any of label does not exist in the target data. (GH6736)

    In [43]:df=DataFrame(np.random.randn(3,3),columns=['A','B','C'])In [44]:df.drop(['A','X'],axis=1,errors='ignore')Out[44]:          B         C0  1.058969 -0.3978401  1.047579  1.0459382 -0.122092  0.124713
  • Add support for separating years and quarters using dashes, forexample 2014-Q1. (GH9688)

  • Allow conversion of values with dtypedatetime64 ortimedelta64 to strings usingastype(str) (GH9757)

  • get_dummies function now acceptssparse keyword. If set toTrue, the returnDataFrame is sparse, e.g.SparseDataFrame. (GH8823)

  • Period now acceptsdatetime64 as value input. (GH9054)

  • Allow timedelta string conversion when leading zero is missing from time definition, ie0:00:00 vs00:00:00. (GH9570)

  • AllowPanel.shift withaxis='items' (GH9890)

  • Trying to write an excel file now raisesNotImplementedError if theDataFrame has aMultiIndex instead of writing a broken Excel file. (GH9794)

  • AllowCategorical.add_categories to acceptSeries ornp.array. (GH9927)

  • Add/deletestr/dt/cat accessors dynamically from__dir__. (GH9910)

  • Addnormalize as adt accessor method. (GH10047)

  • DataFrame andSeries now have_constructor_expanddim property as overridable constructor for one higher dimensionality data. This should be used only when it is really needed, seehere

  • pd.lib.infer_dtype now returns'bytes' in Python 3 where appropriate. (GH10032)

API changes

  • When passing in an ax todf.plot(...,ax=ax), thesharex kwarg will now default toFalse.The result is that the visibility of xlabels and xticklabels will not anymore be changed. Youhave to do that by yourself for the right axes in your figure or setsharex=True explicitly(but this changes the visible for all axes in the figure, not only the one which is passed in!).If pandas creates the subplots itself (e.g. no passed inax kwarg), then thedefault is stillsharex=True and the visibility changes are applied.
  • assign() now inserts new columns in alphabetical order. Previouslythe order was arbitrary. (GH9777)
  • By default,read_csv andread_table will now try to infer the compression type based on the file extension. Setcompression=None to restore the previous behavior (no decompression). (GH9770)

Deprecations

  • Series.str.split‘sreturn_type keyword was removed in favor ofexpand (GH9847)

Index Representation

The string representation ofIndex and its sub-classes have now been unified. These will show a single-line display if there are few values; a wrapped multi-line display for a lot of values (but less thandisplay.max_seq_items; if lots of items (>display.max_seq_items) will show a truncated display (the head and tail of the data). The formatting forMultiIndex is unchanges (a multi-line wrapped display). The display width responds to the optiondisplay.max_seq_items, which is defaulted to 100. (GH6482)

Previous Behavior

In [2]:pd.Index(range(4),name='foo')Out[2]:Int64Index([0,1,2,3],dtype='int64')In [3]:pd.Index(range(104),name='foo')Out[3]:Int64Index([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,...],dtype='int64')In [4]:pd.date_range('20130101',periods=4,name='foo',tz='US/Eastern')Out[4]:<class 'pandas.tseries.index.DatetimeIndex'>[2013-01-01 00:00:00-05:00, ..., 2013-01-04 00:00:00-05:00]Length: 4, Freq: D, Timezone: US/EasternIn [5]:pd.date_range('20130101',periods=104,name='foo',tz='US/Eastern')Out[5]:<class 'pandas.tseries.index.DatetimeIndex'>[2013-01-01 00:00:00-05:00, ..., 2013-04-14 00:00:00-04:00]Length: 104, Freq: D, Timezone: US/Eastern

New Behavior

In [45]:pd.set_option('display.width',80)In [46]:pd.Index(range(4),name='foo')Out[46]:Int64Index([0,1,2,3],dtype='int64',name=u'foo')In [47]:pd.Index(range(30),name='foo')Out[47]:Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,            17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29],           dtype='int64', name=u'foo')In [48]:pd.Index(range(104),name='foo')Out[48]:Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,            ...             94,  95,  96,  97,  98,  99, 100, 101, 102, 103],           dtype='int64', name=u'foo', length=104)In [49]:pd.CategoricalIndex(['a','bb','ccc','dddd'],ordered=True,name='foobar')Out[49]:CategoricalIndex([u'a',u'bb',u'ccc',u'dddd'],categories=[u'a',u'bb',u'ccc',u'dddd'],ordered=True,name=u'foobar',dtype='category')In [50]:pd.CategoricalIndex(['a','bb','ccc','dddd']*10,ordered=True,name='foobar')Out[50]:CategoricalIndex([u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd',                  u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd',                  u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd',                  u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd',                  u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd'],                 categories=[u'a', u'bb', u'ccc', u'dddd'], ordered=True, name=u'foobar', dtype='category')In [51]:pd.CategoricalIndex(['a','bb','ccc','dddd']*100,ordered=True,name='foobar')Out[51]:CategoricalIndex([u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd',                  u'a', u'bb',                  ...                  u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd', u'a', u'bb',                  u'ccc', u'dddd'],                 categories=[u'a', u'bb', u'ccc', u'dddd'], ordered=True, name=u'foobar', dtype='category', length=400)In [52]:pd.date_range('20130101',periods=4,name='foo',tz='US/Eastern')Out[52]:DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',               '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00'],              dtype='datetime64[ns, US/Eastern]', name=u'foo', freq='D')In [53]:pd.date_range('20130101',periods=25,freq='D')Out[53]:DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',               '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',               '2013-01-09', '2013-01-10', '2013-01-11', '2013-01-12',               '2013-01-13', '2013-01-14', '2013-01-15', '2013-01-16',               '2013-01-17', '2013-01-18', '2013-01-19', '2013-01-20',               '2013-01-21', '2013-01-22', '2013-01-23', '2013-01-24',               '2013-01-25'],              dtype='datetime64[ns]', freq='D')In [54]:pd.date_range('20130101',periods=104,name='foo',tz='US/Eastern')Out[54]:DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',               '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00',               '2013-01-05 00:00:00-05:00', '2013-01-06 00:00:00-05:00',               '2013-01-07 00:00:00-05:00', '2013-01-08 00:00:00-05:00',               '2013-01-09 00:00:00-05:00', '2013-01-10 00:00:00-05:00',               ...               '2013-04-05 00:00:00-04:00', '2013-04-06 00:00:00-04:00',               '2013-04-07 00:00:00-04:00', '2013-04-08 00:00:00-04:00',               '2013-04-09 00:00:00-04:00', '2013-04-10 00:00:00-04:00',               '2013-04-11 00:00:00-04:00', '2013-04-12 00:00:00-04:00',               '2013-04-13 00:00:00-04:00', '2013-04-14 00:00:00-04:00'],              dtype='datetime64[ns, US/Eastern]', name=u'foo', length=104, freq='D')

Performance Improvements

  • Improved csv write performance with mixed dtypes, including datetimes by up to 5x (GH9940)
  • Improved csv write performance generally by 2x (GH9940)
  • Improved the performance ofpd.lib.max_len_string_array by 5-7x (GH10024)

Bug Fixes

  • Bug where labels did not appear properly in the legend ofDataFrame.plot(), passinglabel= arguments works, and Series indices are no longer mutated. (GH9542)
  • Bug in json serialization causing a segfault when a frame had zero length. (GH9805)
  • Bug inread_csv where missing trailing delimiters would cause segfault. (GH5664)
  • Bug in retaining index name on appending (GH9862)
  • Bug inscatter_matrix draws unexpected axis ticklabels (GH5662)
  • Fixed bug inStataWriter resulting in changes to inputDataFrame upon save (GH9795).
  • Bug intransform causing length mismatch when null entries were present and a fast aggregator was being used (GH9697)
  • Bug inequals causing false negatives when block order differed (GH9330)
  • Bug in grouping with multiplepd.Grouper where one is non-time based (GH10063)
  • Bug inread_sql_table error when reading postgres table with timezone (GH7139)
  • Bug inDataFrame slicing may not retain metadata (GH9776)
  • Bug whereTimdeltaIndex were not properly serialized in fixedHDFStore (GH9635)
  • Bug withTimedeltaIndex constructor ignoringname when given anotherTimedeltaIndex as data (GH10025).
  • Bug inDataFrameFormatter._get_formatted_index with not applyingmax_colwidth to theDataFrame index (GH7856)
  • Bug in.loc with a read-only ndarray data source (GH10043)
  • Bug ingroupby.apply() that would raise if a passed user defined function either returned onlyNone (for all input). (GH9685)
  • Always use temporary files in pytables tests (GH9992)
  • Bug in plotting continuously usingsecondary_y may not show legend properly. (GH9610,GH9779)
  • Bug inDataFrame.plot(kind="hist") results inTypeError whenDataFrame contains non-numeric columns (GH9853)
  • Bug where repeated plotting ofDataFrame with aDatetimeIndex may raiseTypeError (GH9852)
  • Bug insetup.py that would allow an incompat cython version to build (GH9827)
  • Bug in plottingsecondary_y incorrectly attachesright_ax property to secondary axes specifying itself recursively. (GH9861)
  • Bug inSeries.quantile on empty Series of typeDatetime orTimedelta (GH9675)
  • Bug inwhere causing incorrect results when upcasting was required (GH9731)
  • Bug inFloatArrayFormatter where decision boundary for displaying “small” floats in decimal format is off by one order of magnitude for a given display.precision (GH9764)
  • Fixed bug whereDataFrame.plot() raised an error when bothcolor andstyle keywords were passed and there was no color symbol in the style strings (GH9671)
  • Not showing aDeprecationWarning on combining list-likes with anIndex (GH10083)
  • Bug inread_csv andread_table when usingskip_rows parameter if blank lines are present. (GH9832)
  • Bug inread_csv() interpretsindex_col=True as1 (GH9798)
  • Bug in index equality comparisons using== failing on Index/MultiIndex type incompatibility (GH9785)
  • Bug in whichSparseDataFrame could not takenan as a column name (GH8822)
  • Bug into_msgpack andread_msgpack zlib and blosc compression support (GH9783)
  • BugGroupBy.size doesn’t attach index name properly if grouped byTimeGrouper (GH9925)
  • Bug causing an exception in slice assignments becauselength_of_indexer returns wrong results (GH9995)
  • Bug in csv parser causing lines with initial whitespace plus one non-space character to be skipped. (GH9710)
  • Bug in C csv parser causing spurious NaNs when data started with newline followed by whitespace. (GH10022)
  • Bug causing elements with a null group to spill into the final group when grouping by aCategorical (GH9603)
  • Bug where .iloc and .loc behavior is not consistent on empty dataframes (GH9964)
  • Bug in invalid attribute access on aTimedeltaIndex incorrectly raisedValueError instead ofAttributeError (GH9680)
  • Bug in unequal comparisons between categorical data and a scalar, which was not in the categories (e.g.Series(Categorical(list("abc"),ordered=True))>"d". This returnedFalse for all elements, but now raises aTypeError. Equality comparisons also now returnFalse for== andTrue for!=. (GH9848)
  • Bug in DataFrame__setitem__ when right hand side is a dictionary (GH9874)
  • Bug inwhere when dtype isdatetime64/timedelta64, but dtype of other is not (GH9804)
  • Bug inMultiIndex.sortlevel() results in unicode level name breaks (GH9856)
  • Bug in whichgroupby.transform incorrectly enforced output dtypes to match input dtypes. (GH9807)
  • Bug inDataFrame constructor whencolumns parameter is set, anddata is an empty list (GH9939)
  • Bug in bar plot withlog=True raisesTypeError if all values are less than 1 (GH9905)
  • Bug in horizontal bar plot ignoreslog=True (GH9905)
  • Bug in PyTables queries that did not return proper results using the index (GH8265,GH9676)
  • Bug where dividing a dataframe containing values of typeDecimal by anotherDecimal would raise. (GH9787)
  • Bug where using DataFrames asfreq would remove the name of the index. (GH9885)
  • Bug causing extra index point when resample BM/BQ (GH9756)
  • Changed caching inAbstractHolidayCalendar to be at the instance level rather than at the class level as the latter can result in unexpected behaviour. (GH9552)
  • Fixed latex output for multi-indexed dataframes (GH9778)
  • Bug causing an exception when setting an empty range usingDataFrame.loc (GH9596)
  • Bug in hiding ticklabels with subplots and shared axes when adding a new plot to an existing grid of axes (GH9158)
  • Bug intransform andfilter when grouping on a categorical variable (GH9921)
  • Bug intransform when groups are equal in number and dtype to the input index (GH9700)
  • Google BigQuery connector now imports dependencies on a per-method basis.(GH9713)
  • Updated BigQuery connector to no longer use deprecatedoauth2client.tools.run() (GH8327)
  • Bug in subclassedDataFrame. It may not return the correct class, when slicing or subsetting it. (GH9632)
  • Bug in.median() where non-float null values are not handled correctly (GH10040)
  • Bug in Series.fillna() where it raises if a numerically convertible string is given (GH10092)

v0.16.0 (March 22, 2015)

This is a major release from 0.15.2 and includes a small number of API changes, several new features,enhancements, and performance improvements along with a large number of bug fixes. We recommend that allusers upgrade to this version.

Highlights include:

  • DataFrame.assign method, seehere
  • Series.to_coo/from_coo methods to interact withscipy.sparse, seehere
  • Backwards incompatible change toTimedelta to conform the.seconds attribute withdatetime.timedelta, seehere
  • Changes to the.loc slicing API to conform with the behavior of.ix seehere
  • Changes to the default for ordering in theCategorical constructor, seehere
  • Enhancement to the.str accessor to make string operations easier, seehere
  • Thepandas.tools.rplot,pandas.sandbox.qtpandas andpandas.rpymodules are deprecated. We refer users to external packages likeseaborn,pandas-qt andrpy2 for similar or equivalentfunctionality, seehere

Check theAPI Changes anddeprecations before updating.

New features

DataFrame Assign

Inspired bydplyr’smutate verb, DataFrame has a newassign() method.The function signature forassign is simply**kwargs. The keysare the column names for the new fields, and the values are either a valueto be inserted (for example, aSeries or NumPy array), or a functionof one argument to be called on theDataFrame. The new values are inserted,and the entire DataFrame (with all original and new columns) is returned.

In [1]:iris=read_csv('data/iris.data')In [2]:iris.head()Out[2]:   SepalLength  SepalWidth  PetalLength  PetalWidth         Name0          5.1         3.5          1.4         0.2  Iris-setosa1          4.9         3.0          1.4         0.2  Iris-setosa2          4.7         3.2          1.3         0.2  Iris-setosa3          4.6         3.1          1.5         0.2  Iris-setosa4          5.0         3.6          1.4         0.2  Iris-setosaIn [3]:iris.assign(sepal_ratio=iris['SepalWidth']/iris['SepalLength']).head()Out[3]:   SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio0          5.1         3.5          1.4         0.2  Iris-setosa     0.6862751          4.9         3.0          1.4         0.2  Iris-setosa     0.6122452          4.7         3.2          1.3         0.2  Iris-setosa     0.6808513          4.6         3.1          1.5         0.2  Iris-setosa     0.6739134          5.0         3.6          1.4         0.2  Iris-setosa     0.720000

Above was an example of inserting a precomputed value. We can also pass ina function to be evalutated.

In [4]:iris.assign(sepal_ratio=lambdax:(x['SepalWidth']/   ...:x['SepalLength'])).head()   ...:Out[4]:   SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio0          5.1         3.5          1.4         0.2  Iris-setosa     0.6862751          4.9         3.0          1.4         0.2  Iris-setosa     0.6122452          4.7         3.2          1.3         0.2  Iris-setosa     0.6808513          4.6         3.1          1.5         0.2  Iris-setosa     0.6739134          5.0         3.6          1.4         0.2  Iris-setosa     0.720000

The power ofassign comes when used in chains of operations. For example,we can limit the DataFrame to just those with a Sepal Length greater than 5,calculate the ratio, and plot

In [5]:(iris.query('SepalLength > 5')   ...:.assign(SepalRatio=lambdax:x.SepalWidth/x.SepalLength,   ...:PetalRatio=lambdax:x.PetalWidth/x.PetalLength)   ...:.plot(kind='scatter',x='SepalRatio',y='PetalRatio'))   ...:Out[5]:<matplotlib.axes._subplots.AxesSubplotat0x7fd23589bb10>
_images/whatsnew_assign.png

See thedocumentation for more. (GH9229)

Interaction with scipy.sparse

AddedSparseSeries.to_coo() andSparseSeries.from_coo() methods (GH8048) for converting to and fromscipy.sparse.coo_matrix instances (seehere). For example, given a SparseSeries with MultiIndex we can convert to ascipy.sparse.coo_matrix by specifying the row and column labels as index levels:

In [6]:fromnumpyimportnanIn [7]:s=Series([3.0,nan,1.0,3.0,nan,nan])In [8]:s.index=MultiIndex.from_tuples([(1,2,'a',0),   ...:(1,2,'a',1),   ...:(1,1,'b',0),   ...:(1,1,'b',1),   ...:(2,1,'b',0),   ...:(2,1,'b',1)],   ...:names=['A','B','C','D'])   ...:In [9]:sOut[9]:A  B  C  D1  2  a  0    3.0         1    NaN   1  b  0    1.0         1    3.02  1  b  0    NaN         1    NaNdtype: float64# SparseSeriesIn [10]:ss=s.to_sparse()In [11]:ssOut[11]:A  B  C  D1  2  a  0    3.0         1    NaN   1  b  0    1.0         1    3.02  1  b  0    NaN         1    NaNdtype: float64BlockIndexBlock locations: array([0, 2], dtype=int32)Block lengths: array([1, 2], dtype=int32)In [12]:A,rows,columns=ss.to_coo(row_levels=['A','B'],   ....:column_levels=['C','D'],   ....:sort_labels=False)   ....:In [13]:AOut[13]:<3x4 sparse matrix of type '<type 'numpy.float64'>'with 3 stored elements in COOrdinate format>In [14]:A.todense()Out[14]:matrix([[ 3.,  0.,  0.,  0.],        [ 0.,  0.,  1.,  3.],        [ 0.,  0.,  0.,  0.]])In [15]:rowsOut[15]:[(1,2),(1,1),(2,1)]In [16]:columnsOut[16]:[('a',0),('a',1),('b',0),('b',1)]

The from_coo method is a convenience method for creating aSparseSeriesfrom ascipy.sparse.coo_matrix:

In [17]:fromscipyimportsparseIn [18]:A=sparse.coo_matrix(([3.0,1.0,2.0],([1,0,0],[0,2,3])),   ....:shape=(3,4))   ....:In [19]:AOut[19]:<3x4 sparse matrix of type '<type 'numpy.float64'>'with 3 stored elements in COOrdinate format>In [20]:A.todense()Out[20]:matrix([[ 0.,  0.,  1.,  2.],        [ 3.,  0.,  0.,  0.],        [ 0.,  0.,  0.,  0.]])In [21]:ss=SparseSeries.from_coo(A)In [22]:ssOut[22]:0  2    1.0   3    2.01  0    3.0dtype: float64BlockIndexBlock locations: array([0], dtype=int32)Block lengths: array([3], dtype=int32)

String Methods Enhancements

  • Following new methods are accesible via.str accessor to apply the function to each values. This is intended to make it more consistent with standard methods on strings. (GH9282,GH9352,GH9386,GH9387,GH9439)

    Methods
    isalnum()isalpha()isdigit()isdigit()isspace()
    islower()isupper()istitle()isnumeric()isdecimal()
    find()rfind()ljust()rjust()zfill()
    In [23]:s=Series(['abcd','3456','EFGH'])In [24]:s.str.isalpha()Out[24]:0     True1    False2     Truedtype: boolIn [25]:s.str.find('ab')Out[25]:0    01   -12   -1dtype: int64
  • Series.str.pad() andSeries.str.center() now acceptfillchar option to specify filling character (GH9352)

    In [26]:s=Series(['12','300','25'])In [27]:s.str.pad(5,fillchar='_')Out[27]:0    ___121    __3002    ___25dtype: object
  • AddedSeries.str.slice_replace(), which previously raisedNotImplementedError (GH8888)

    In [28]:s=Series(['ABCD','EFGH','IJK'])In [29]:s.str.slice_replace(1,3,'X')Out[29]:0    AXD1    EXH2     IXdtype: object# replaced with empty charIn [30]:s.str.slice_replace(0,1)Out[30]:0    BCD1    FGH2     JKdtype: object

Other enhancements

  • Reindex now supportsmethod='nearest' for frames or series with a monotonic increasing or decreasing index (GH9258):

    In [31]:df=pd.DataFrame({'x':range(5)})In [32]:df.reindex([0.2,1.8,3.5],method='nearest')Out[32]:     x0.2  01.8  23.5  4

    This method is also exposed by the lower levelIndex.get_indexer andIndex.get_loc methods.

  • Theread_excel() function’ssheetname argument now accepts a list andNone, to get multiple or all sheets respectively. If more than one sheet is specified, a dictionary is returned. (GH9450)

    # Returns the 1st and 4th sheet, as a dictionary of DataFrames.pd.read_excel('path_to_file.xls',sheetname=['Sheet1',3])
  • Allow Stata files to be read incrementally with an iterator; support for long strings in Stata files. See the docshere (GH9493:).

  • Paths beginning with ~ will now be expanded to begin with the user’s home directory (GH9066)

  • Added time interval selection inget_data_yahoo (GH9071)

  • AddedTimestamp.to_datetime64() to complementTimedelta.to_timedelta64() (GH9255)

  • tseries.frequencies.to_offset() now acceptsTimedelta as input (GH9064)

  • Lag parameter was added to the autocorrelation method ofSeries, defaults to lag-1 autocorrelation (GH9192)

  • Timedelta will now acceptnanoseconds keyword in constructor (GH9273)

  • SQL code now safely escapes table and column names (GH8986)

  • Added auto-complete forSeries.str.<tab>,Series.dt.<tab> andSeries.cat.<tab> (GH9322)

  • Index.get_indexer now supportsmethod='pad' andmethod='backfill' even for any target array, not just monotonic targets. These methods also work for monotonic decreasing as well as monotonic increasing indexes (GH9258).

  • Index.asof now works on all index types (GH9258).

  • Averbose argument has been augmented inio.read_excel(), defaults to False. Set to True to print sheet names as they are parsed. (GH9450)

  • Addeddays_in_month (compatibility aliasdaysinmonth) property toTimestamp,DatetimeIndex,Period,PeriodIndex, andSeries.dt (GH9572)

  • Addeddecimal option into_csv to provide formatting for non-‘.’ decimal separators (GH781)

  • Addednormalize option forTimestamp to normalized to midnight (GH8794)

  • Added example forDataFrame import to R using HDF5 file andrhdf5library. See thedocumentation for more(GH9636).

Backwards incompatible API changes

Changes in Timedelta

In v0.15.0 a new scalar typeTimedelta was introduced, that is asub-class ofdatetime.timedelta. Mentionedhere was a notice of an API change w.r.t. the.seconds accessor. The intent was to provide a user-friendly set of accessors that give the ‘natural’ value for that unit, e.g. if you had aTimedelta('1day,10:11:12'), then.seconds would return 12. However, this is at odds with the definition ofdatetime.timedelta, which defines.seconds as10*3600+11*60+12==36672.

So in v0.16.0, we are restoring the API to match that ofdatetime.timedelta. Further, the component values are still available through the.components accessor. This affects the.seconds and.microseconds accessors, and removes the.hours,.minutes,.milliseconds accessors. These changes affectTimedeltaIndex and the Series.dt accessor as well. (GH9185,GH9139)

Previous Behavior

In [2]:t=pd.Timedelta('1 day, 10:11:12.100123')In [3]:t.daysOut[3]:1In [4]:t.secondsOut[4]:12In [5]:t.microsecondsOut[5]:123

New Behavior

In [33]:t=pd.Timedelta('1 day, 10:11:12.100123')In [34]:t.daysOut[34]:1In [35]:t.secondsOut[35]:36672In [36]:t.microsecondsOut[36]:100123

Using.components allows the full component access

In [37]:t.componentsOut[37]:Components(days=1,hours=10,minutes=11,seconds=12,milliseconds=100,microseconds=123,nanoseconds=0)In [38]:t.components.secondsOut[38]:12

Indexing Changes

The behavior of a small sub-set of edge cases for using.loc have changed (GH8613). Furthermore we have improved the content of the error messages that are raised:

  • Slicing with.loc where the start and/or stop bound is not found in the index is now allowed; this previously would raise aKeyError. This makes the behavior the same as.ix in this case. This change is only for slicing, not when indexing with a single label.

    In [39]:df=DataFrame(np.random.randn(5,4),   ....:columns=list('ABCD'),   ....:index=date_range('20130101',periods=5))   ....:In [40]:dfOut[40]:                   A         B         C         D2013-01-01 -0.322795  0.841675  2.390961  0.0762002013-01-02 -0.566446  0.036142 -2.074978  0.2477922013-01-03 -0.897157 -0.136795  0.018289  0.7554142013-01-04  0.215269  0.841009 -1.445810 -1.4019732013-01-05 -0.100918 -0.548242 -0.144620  0.354020In [41]:s=Series(range(5),[-2,-1,1,2,3])In [42]:sOut[42]:-2    0-1    1 1    2 2    3 3    4dtype: int64

    Previous Behavior

    In [4]:df.loc['2013-01-02':'2013-01-10']KeyError: 'stop bound [2013-01-10] is not in the [index]'In [6]:s.loc[-10:3]KeyError: 'start bound [-10] is not the [index]'

    New Behavior

    In [43]:df.loc['2013-01-02':'2013-01-10']Out[43]:                   A         B         C         D2013-01-02 -0.566446  0.036142 -2.074978  0.2477922013-01-03 -0.897157 -0.136795  0.018289  0.7554142013-01-04  0.215269  0.841009 -1.445810 -1.4019732013-01-05 -0.100918 -0.548242 -0.144620  0.354020In [44]:s.loc[-10:3]Out[44]:-2    0-1    1 1    2 2    3 3    4dtype: int64
  • Allow slicing with float-like values on an integer index for.ix. Previously this was only enabled for.loc:

    Previous Behavior

    In [8]:s.ix[-1.0:2]TypeError: the slice start value [-1.0] is not a proper indexer for this index type (Int64Index)

    New Behavior

    In [45]:s.ix[-1.0:2]Out[45]:-1    1 1    2 2    3dtype: int64
  • Provide a useful exception for indexing with an invalid type for that index when using.loc. For example trying to use.loc on an index of typeDatetimeIndex orPeriodIndex orTimedeltaIndex, with an integer (or a float).

    Previous Behavior

    In[4]:df.loc[2:3]KeyError:'start bound [2] is not the [index]'

    New Behavior

    In [4]:df.loc[2:3]TypeError: Cannot do slice indexing on <class 'pandas.tseries.index.DatetimeIndex'> with <type 'int'> keys

Categorical Changes

In prior versions,Categoricals that had an unspecified ordering (meaning noordered keyword was passed) were defaulted asordered Categoricals. Going forward, theordered keyword in theCategorical constructor will default toFalse. Ordering must now be explicit.

Furthermore, previously youcould change theordered attribute of a Categorical by just setting the attribute, e.g.cat.ordered=True; This is now deprecated and you should usecat.as_ordered() orcat.as_unordered(). These will by default return anew object and not modify the existing object. (GH9347,GH9190)

Previous Behavior

In [3]:s=Series([0,1,2],dtype='category')In [4]:sOut[4]:0    01    12    2dtype: categoryCategories (3, int64): [0 < 1 < 2]In [5]:s.cat.orderedOut[5]:TrueIn [6]:s.cat.ordered=FalseIn [7]:sOut[7]:0    01    12    2dtype: categoryCategories (3, int64): [0, 1, 2]

New Behavior

In [46]:s=Series([0,1,2],dtype='category')In [47]:sOut[47]:0    01    12    2dtype: categoryCategories (3, int64): [0, 1, 2]In [48]:s.cat.orderedOut[48]:FalseIn [49]:s=s.cat.as_ordered()In [50]:sOut[50]:0    01    12    2dtype: categoryCategories (3, int64): [0 < 1 < 2]In [51]:s.cat.orderedOut[51]:True# you can set in the constructor of the CategoricalIn [52]:s=Series(Categorical([0,1,2],ordered=True))In [53]:sOut[53]:0    01    12    2dtype: categoryCategories (3, int64): [0 < 1 < 2]In [54]:s.cat.orderedOut[54]:True

For ease of creation of series of categorical data, we have added the ability to pass keywords when calling.astype(). These are passed directly to the constructor.

In [55]:s=Series(["a","b","c","a"]).astype('category',ordered=True)In [56]:sOut[56]:0    a1    b2    c3    adtype: categoryCategories (3, object): [a < b < c]In [57]:s=Series(["a","b","c","a"]).astype('category',categories=list('abcdef'),ordered=False)In [58]:sOut[58]:0    a1    b2    c3    adtype: categoryCategories (6, object): [a, b, c, d, e, f]

Other API Changes

  • Index.duplicated now returnsnp.array(dtype=bool) rather thanIndex(dtype=object) containingbool values. (GH8875)

  • DataFrame.to_json now returns accurate type serialisation for each column for frames of mixed dtype (GH9037)

    Previously data was coerced to a common dtype before serialisation, which forexample resulted in integers being serialised to floats:

    In [2]:pd.DataFrame({'i':[1,2],'f':[3.0,4.2]}).to_json()Out[2]:'{"f":{"0":3.0,"1":4.2},"i":{"0":1.0,"1":2.0}}'

    Now each column is serialised using its correct dtype:

    In [2]:pd.DataFrame({'i':[1,2],'f':[3.0,4.2]}).to_json()Out[2]:'{"f":{"0":3.0,"1":4.2},"i":{"0":1,"1":2}}'
  • DatetimeIndex,PeriodIndex andTimedeltaIndex.summary now output the same format. (GH9116)

  • TimedeltaIndex.freqstr now output the same string format asDatetimeIndex. (GH9116)

  • Bar and horizontal bar plots no longer add a dashed line along the info axis. The prior style can be achieved with matplotlib’saxhline oraxvline methods (GH9088).

  • Series accessors.dt,.cat and.str now raiseAttributeError instead ofTypeError if the series does not contain the appropriate type of data (GH9617). This follows Python’s built-in exception hierarchy more closely and ensures that tests likehasattr(s,'cat') are consistent on both Python 2 and 3.

  • Series now supports bitwise operation for integral types (GH9016). Previously even if the input dtypes were integral, the output dtype was coerced tobool.

    Previous Behavior

    In [2]:pd.Series([0,1,2,3],list('abcd'))|pd.Series([4,4,4,4],list('abcd'))Out[2]:a    Trueb    Truec    Trued    Truedtype: bool

    New Behavior. If the input dtypes are integral, the output dtype is also integral and the outputvalues are the result of the bitwise operation.

    In [2]:pd.Series([0,1,2,3],list('abcd'))|pd.Series([4,4,4,4],list('abcd'))Out[2]:a    4b    5c    6d    7dtype: int64
  • During division involving aSeries orDataFrame,0/0 and0//0 now givenp.nan instead ofnp.inf. (GH9144,GH8445)

    Previous Behavior

    In [2]:p=pd.Series([0,1])In [3]:p/0Out[3]:0    inf1    infdtype: float64In [4]:p//0Out[4]:0    inf1    infdtype: float64

    New Behavior

    In [59]:p=pd.Series([0,1])In [60]:p/0Out[60]:0    NaN1    infdtype: float64In [61]:p//0Out[61]:0    NaN1    infdtype: float64
  • Series.values_counts andSeries.describe for categorical data will now putNaN entries at the end. (GH9443)

  • Series.describe for categorical data will now give counts and frequencies of 0, notNaN, for unused categories (GH9443)

  • Due to a bug fix, looking up a partial string label withDatetimeIndex.asof now includes values that match the string, even if they are after the start of the partial string label (GH9258).

    Old behavior:

    In [4]:pd.to_datetime(['2000-01-31','2000-02-28']).asof('2000-02')Out[4]:Timestamp('2000-01-31 00:00:00')

    Fixed behavior:

    In [62]:pd.to_datetime(['2000-01-31','2000-02-28']).asof('2000-02')Out[62]:Timestamp('2000-02-28 00:00:00')

    To reproduce the old behavior, simply add more precision to the label (e.g., use2000-02-01 instead of2000-02).

Deprecations

  • Therplot trellis plotting interface is deprecated and will be removedin a future version. We refer to external packages likeseaborn for similarbut more refined functionality (GH3445).The documentation includes some examples how to convert your existing codeusingrplot to seaborn:rplot docs.
  • Thepandas.sandbox.qtpandas interface is deprecated and will be removed in a future version.We refer users to the external packagepandas-qt. (GH9615)
  • Thepandas.rpy interface is deprecated and will be removed in a future version.Similar functionaility can be accessed thru therpy2 project (GH9602)
  • AddingDatetimeIndex/PeriodIndex to anotherDatetimeIndex/PeriodIndex is being deprecated as a set-operation. This will be changed to aTypeError in a future version..union() should be used for the union set operation. (GH9094)
  • SubtractingDatetimeIndex/PeriodIndex from anotherDatetimeIndex/PeriodIndex is being deprecated as a set-operation. This will be changed to an actual numeric subtraction yielding aTimeDeltaIndex in a future version..difference() should be used for the differencing set operation. (GH9094)

Removal of prior version deprecations/changes

  • DataFrame.pivot_table andcrosstab‘srows andcols keyword arguments were removed in favorofindex andcolumns (GH6581)
  • DataFrame.to_excel andDataFrame.to_csvcols keyword argument was removed in favor ofcolumns (GH6581)
  • Removedconvert_dummies in favor ofget_dummies (GH6581)
  • Removedvalue_range in favor ofdescribe (GH6581)

Performance Improvements

  • Fixed a performance regression for.loc indexing with an array or list-like (GH9126:).
  • DataFrame.to_json 30x performance improvement for mixed dtype frames. (GH9037)
  • Performance improvements inMultiIndex.duplicated by working with labels instead of values (GH9125)
  • Improved the speed ofnunique by callingunique instead ofvalue_counts (GH9129,GH7771)
  • Performance improvement of up to 10x inDataFrame.count andDataFrame.dropna by taking advantage of homogeneous/heterogeneous dtypes appropriately (GH9136)
  • Performance improvement of up to 20x inDataFrame.count when using aMultiIndex and thelevel keyword argument (GH9163)
  • Performance and memory usage improvements inmerge when key space exceedsint64 bounds (GH9151)
  • Performance improvements in multi-keygroupby (GH9429)
  • Performance improvements inMultiIndex.sortlevel (GH9445)
  • Performance and memory usage improvements inDataFrame.duplicated (GH9398)
  • CythonizedPeriod (GH9440)
  • Decreased memory usage onto_hdf (GH9648)

Bug Fixes

  • Changed.to_html to remove leading/trailing spaces in table body (GH4987)
  • Fixed issue usingread_csv on s3 with Python 3 (GH9452)
  • Fixed compatibility issue inDatetimeIndex affecting architectures wherenumpy.int_ defaults tonumpy.int32 (GH8943)
  • Bug in Panel indexing with an object-like (GH9140)
  • Bug in the returnedSeries.dt.components index was reset to the default index (GH9247)
  • Bug inCategorical.__getitem__/__setitem__ with listlike input getting incorrect results from indexer coercion (GH9469)
  • Bug in partial setting with a DatetimeIndex (GH9478)
  • Bug in groupby for integer and datetime64 columns when applying an aggregator that caused the value to bechanged when the number was sufficiently large (GH9311,GH6620)
  • Fixed bug into_sql when mapping aTimestamp object column (datetimecolumn with timezone info) to the appropriate sqlalchemy type (GH9085).
  • Fixed bug into_sqldtype argument not accepting an instantiatedSQLAlchemy type (GH9083).
  • Bug in.loc partial setting with anp.datetime64 (GH9516)
  • Incorrect dtypes inferred on datetimelike lookingSeries & on.xs slices (GH9477)
  • Items inCategorical.unique() (ands.unique() ifs is of dtypecategory) now appear in the order in which they are originally found, not in sorted order (GH9331). This is now consistent with the behavior for other dtypes in pandas.
  • Fixed bug on big endian platforms which produced incorrect results inStataReader (GH8688).
  • Bug inMultiIndex.has_duplicates when having many levels causes an indexer overflow (GH9075,GH5873)
  • Bug inpivot andunstack wherenan values would break index alignment (GH4862,GH7401,GH7403,GH7405,GH7466,GH9497)
  • Bug in leftjoin on multi-index withsort=True or null values (GH9210).
  • Bug inMultiIndex where inserting new keys would fail (GH9250).
  • Bug ingroupby when key space exceedsint64 bounds (GH9096).
  • Bug inunstack withTimedeltaIndex orDatetimeIndex and nulls (GH9491).
  • Bug inrank where comparing floats with tolerance will cause inconsistent behaviour (GH8365).
  • Fixed character encoding bug inread_stata andStataReader when loading data from a URL (GH9231).
  • Bug in addingoffsets.Nano to other offets raisesTypeError (GH9284)
  • Bug inDatetimeIndex iteration, related to (GH8890), fixed in (GH9100)
  • Bugs inresample around DST transitions. This required fixing offset classes so they behave correctly on DST transitions. (GH5172,GH8744,GH8653,GH9173,GH9468).
  • Bug in binary operator method (eg.mul()) alignment with integer levels (GH9463).
  • Bug in boxplot, scatter and hexbin plot may show an unnecessary warning (GH8877)
  • Bug in subplot withlayout kw may show unnecessary warning (GH9464)
  • Bug in using grouper functions that need passed thru arguments (e.g. axis), when using wrapped function (e.g.fillna), (GH9221)
  • DataFrame now properly supports simultaneouscopy anddtype arguments in constructor (GH9099)
  • Bug inread_csv when using skiprows on a file with CR line endings with the c engine. (GH9079)
  • isnull now detectsNaT inPeriodIndex (GH9129)
  • Bug in groupby.nth() with a multiple column groupby (GH8979)
  • Bug inDataFrame.where andSeries.where coerce numerics to string incorrectly (GH9280)
  • Bug inDataFrame.where andSeries.where raiseValueError when string list-like is passed. (GH9280)
  • AccessingSeries.str methods on with non-string values now raisesTypeError instead of producing incorrect results (GH9184)
  • Bug inDatetimeIndex.__contains__ when index has duplicates and is not monotonic increasing (GH9512)
  • Fixed division by zero error forSeries.kurt() when all values are equal (GH9197)
  • Fixed issue in thexlsxwriter engine where it added a default ‘General’ format to cells if no other format wass applied. This prevented other row or column formatting being applied. (GH9167)
  • Fixes issue withindex_col=False whenusecols is also specified inread_csv. (GH9082)
  • Bug wherewide_to_long would modify the input stubnames list (GH9204)
  • Bug into_sql not storing float64 values using double precision. (GH9009)
  • SparseSeries andSparsePanel now accept zero argument constructors (same as their non-sparse counterparts) (GH9272).
  • Regression in mergingCategorical andobject dtypes (GH9426)
  • Bug inread_csv with buffer overflows with certain malformed input files (GH9205)
  • Bug in groupby MultiIndex with missing pair (GH9049,GH9344)
  • Fixed bug inSeries.groupby where grouping onMultiIndex levels would ignore the sort argument (GH9444)
  • Fix bug inDataFrame.Groupby wheresort=False is ignored in the case of Categorical columns. (GH8868)
  • Fixed bug with reading CSV files from Amazon S3 on python 3 raising a TypeError (GH9452)
  • Bug in the Google BigQuery reader where the ‘jobComplete’ key may be present but False in the query results (GH8728)
  • Bug inSeries.values_counts with excludingNaN for categorical typeSeries withdropna=True (GH9443)
  • Fixed mising numeric_only option forDataFrame.std/var/sem (GH9201)
  • Support constructingPanel orPanel4D with scalar data (GH8285)
  • Series text representation disconnected frommax_rows/max_columns (GH7508).

  • Series number formatting inconsistent when truncated (GH8532).

    Previous Behavior

    In[2]:pd.options.display.max_rows=10In[3]:s=pd.Series([1,1,1,1,1,1,1,1,1,1,0.9999,1,1]*10)In[4]:sOut[4]:011121...1270.99991281.00001291.0000Length:130,dtype:float64

    New Behavior

    01.000011.000021.000031.000041.0000...1251.00001261.00001270.99991281.00001291.0000dtype:float64
  • A SpuriousSettingWithCopy Warning was generated when setting a new item in a frame in some cases (GH8730)

    The following would previously report aSettingWithCopy Warning.

    In [1]:df1=DataFrame({'x':Series(['a','b','c']),'y':Series(['d','e','f'])})In [2]:df2=df1[['x']]In [3]:df2['y']=['g','h','i']

v0.15.2 (December 12, 2014)

This is a minor release from 0.15.1 and includes a large number of bug fixesalong with several new features, enhancements, and performance improvements.A small number of API changes were necessary to fix existing bugs.We recommend that all users upgrade to this version.

API changes

  • Indexing inMultiIndex beyond lex-sort depth is now supported, thougha lexically sorted index will have a better performance. (GH2646)

    In [1]:df=pd.DataFrame({'jim':[0,0,1,1],   ...:'joe':['x','x','z','y'],   ...:'jolie':np.random.rand(4)}).set_index(['jim','joe'])   ...:In [2]:dfOut[2]:            joliejim joe0   x    0.123943    x    0.1193811   z    0.738523    y    0.587304In [3]:df.index.lexsort_depthOut[3]:1# in prior versions this would raise a KeyError# will now show a PerformanceWarningIn [4]:df.loc[(1,'z')]Out[4]:            joliejim joe1   z    0.738523# lexically sortingIn [5]:df2=df.sortlevel()In [6]:df2Out[6]:            joliejim joe0   x    0.123943    x    0.1193811   y    0.587304    z    0.738523In [7]:df2.index.lexsort_depthOut[7]:2In [8]:df2.loc[(1,'z')]Out[8]:            joliejim joe1   z    0.738523
  • Bug in unique of Series withcategory dtype, which returned all categories regardlesswhether they were “used” or not (seeGH8559 for the discussion).Previous behaviour was to return all categories:

    In [3]:cat=pd.Categorical(['a','b','a'],categories=['a','b','c'])In [4]:catOut[4]:[a, b, a]Categories (3, object): [a < b < c]In [5]:cat.unique()Out[5]:array(['a','b','c'],dtype=object)

    Now, only the categories that do effectively occur in the array are returned:

    In [9]:cat=pd.Categorical(['a','b','a'],categories=['a','b','c'])In [10]:cat.unique()Out[10]:[a, b]Categories (2, object): [a, b]
  • Series.all andSeries.any now support thelevel andskipna parameters.Series.all,Series.any,Index.all, andIndex.any no longer support theout andkeepdims parameters, which existed for compatibility with ndarray. Various index types no longer support theall andany aggregation functions and will now raiseTypeError. (GH8302).

  • Allow equality comparisons of Series with a categorical dtype and object dtype; previously these would raiseTypeError (GH8938)

  • Bug inNDFrame: conflicting attribute/column names now behave consistently between getting and setting. Previously, when both a column and attribute namedy existed,data.y would return the attribute, whiledata.y=z would update the column (GH8994)

    In [11]:data=pd.DataFrame({'x':[1,2,3]})In [12]:data.y=2In [13]:data['y']=[2,4,6]In [14]:dataOut[14]:   x  y0  1  21  2  42  3  6# this assignment was inconsistentIn [15]:data.y=5

    Old behavior:

    In [6]:data.yOut[6]:2In [7]:data['y'].valuesOut[7]:array([5,5,5])

    New behavior:

    In [16]:data.yOut[16]:5In [17]:data['y'].valuesOut[17]:array([2,4,6])
  • Timestamp('now') is now equivalent toTimestamp.now() in that it returns the local time rather than UTC. Also,Timestamp('today') is now equivalent toTimestamp.today() and both havetz as a possible argument. (GH9000)

  • Fix negative step support for label-based slices (GH8753)

    Old behavior:

    In [1]:s=pd.Series(np.arange(3),['a','b','c'])Out[1]:a    0b    1c    2dtype: int64In [2]:s.loc['c':'a':-1]Out[2]:c    2dtype: int64

    New behavior:

    In [18]:s=pd.Series(np.arange(3),['a','b','c'])In [19]:s.loc['c':'a':-1]Out[19]:c    2b    1a    0dtype: int64

Enhancements

Categorical enhancements:

  • Added ability to export Categorical data to Stata (GH8633). Seehere for limitations of categorical variables exported to Stata data files.
  • Added flagorder_categoricals toStataReader andread_stata to select whether to order imported categorical data (GH8836). Seehere for more information on importing categorical variables from Stata data files.
  • Added ability to export Categorical data to to/from HDF5 (GH7621). Queries work the same as if it was an object array. However, thecategory dtyped data is stored in a more efficient manner. Seehere for an example and caveats w.r.t. prior versions of pandas.
  • Added support forsearchsorted() onCategorical class (GH8420).

Other enhancements:

  • Added the ability to specify the SQL type of columns when writing a DataFrameto a database (GH8778).For example, specifying to use the sqlalchemyString type instead of thedefaultText type for string columns:

    fromsqlalchemy.typesimportStringdata.to_sql('data_dtype',engine,dtype={'Col_1':String})
  • Series.all andSeries.any now support thelevel andskipna parameters (GH8302):

    In [20]:s=pd.Series([False,True,False],index=[0,0,1])In [21]:s.any(level=0)Out[21]:0     True1    Falsedtype: bool
  • Panel now supports theall andany aggregation functions. (GH8302):

    In [22]:p=pd.Panel(np.random.rand(2,5,4)>0.1)In [23]:p.all()Out[23]:       0      10   True   True1   True   True2  False  False3   True   True
  • Added support forutcfromtimestamp(),fromtimestamp(), andcombine() onTimestamp class (GH5351).

  • Added Google Analytics (pandas.io.ga) basic documentation (GH8835). Seehere.

  • Timedelta arithmetic returnsNotImplemented in unknown cases, allowing extensions by custom classes (GH8813).

  • Timedelta now supports arithemtic withnumpy.ndarray objects of the appropriate dtype (numpy 1.8 or newer only) (GH8884).

  • AddedTimedelta.to_timedelta64() method to the public API (GH8884).

  • Addedgbq.generate_bq_schema() function to the gbq module (GH8325).

  • Series now works with map objects the same way as generators (GH8909).

  • Added context manager toHDFStore for automatic closing (GH8791).

  • to_datetime gains anexact keyword to allow for a format to not require an exact match for a provided format string (if itsFalse).exact defaults toTrue (meaning that exact matching is still the default) (GH8904)

  • Addedaxvlines boolean option to parallel_coordinates plot function, determines whether vertical lines will be printed, default is True

  • Added ability to read table footers to read_html (GH8552)

  • to_sql now infers datatypes of non-NA values for columns that contain NA values and have dtypeobject (GH8778).

Performance

  • Reduce memory usage when skiprows is an integer in read_csv (GH8681)
  • Performance boost forto_datetime conversions with a passedformat=, and theexact=False (GH8904)

Bug Fixes

  • Bug in concat of Series withcategory dtype which were coercing toobject. (GH8641)
  • Bug in Timestamp-Timestamp not returning a Timedelta type and datelike-datelike ops with timezones (GH8865)
  • Made consistent a timezone mismatch exception (either tz operated with None or incompatible timezone), will now returnTypeError rather thanValueError (a couple of edge cases only), (GH8865)
  • Bug in using apd.Grouper(key=...) with no level/axis or level only (GH8795,GH8866)
  • Report aTypeError when invalid/no parameters are passed in a groupby (GH8015)
  • Bug in packaging pandas withpy2app/cx_Freeze (GH8602,GH8831)
  • Bug ingroupby signatures that didn’t include *args or **kwargs (GH8733).
  • io.data.Options now raisesRemoteDataError when no expiry dates are available from Yahoo and when it receives no data from Yahoo (GH8761), (GH8783).
  • Unclear error message in csv parsing when passing dtype and names and the parsed data is a different data type (GH8833)
  • Bug in slicing a multi-index with an empty list and at least one boolean indexer (GH8781)
  • io.data.Options now raisesRemoteDataError when no expiry dates are available from Yahoo (GH8761).
  • Timedelta kwargs may now be numpy ints and floats (GH8757).
  • Fixed several outstanding bugs forTimedelta arithmetic and comparisons (GH8813,GH5963,GH5436).
  • sql_schema now generates dialect appropriateCREATETABLE statements (GH8697)
  • slice string method now takes step into account (GH8754)
  • Bug inBlockManager where setting values with different type would break block integrity (GH8850)
  • Bug inDatetimeIndex when usingtime object as key (GH8667)
  • Bug inmerge wherehow='left' andsort=False would not preserve left frame order (GH7331)
  • Bug inMultiIndex.reindex where reindexing at level would not reorder labels (GH4088)
  • Bug in certain operations with dateutil timezones, manifesting with dateutil 2.3 (GH8639)
  • Regression in DatetimeIndex iteration with a Fixed/Local offset timezone (GH8890)
  • Bug into_datetime when parsing a nanoseconds using the%f format (GH8989)
  • io.data.Options now raisesRemoteDataError when no expiry dates are available from Yahoo and when it receives no data from Yahoo (GH8761), (GH8783).
  • Fix: The font size was only set on x axis if vertical or the y axis if horizontal. (GH8765)
  • Fixed division by 0 when reading big csv files in python 3 (GH8621)
  • Bug in outputing a Multindex withto_html,index=False which would add an extra column (GH8452)
  • Imported categorical variables from Stata files retain the ordinal information in the underlying data (GH8836).
  • Defined.size attribute acrossNDFrame objects to provide compat with numpy >= 1.9.1; buggy withnp.array_split (GH8846)
  • Skip testing of histogram plots for matplotlib <= 1.2 (GH8648).
  • Bug whereget_data_google returned object dtypes (GH3995)
  • Bug inDataFrame.stack(...,dropna=False) when the DataFrame’scolumns is aMultiIndexwhoselabels do not reference all itslevels. (GH8844)
  • Bug in that Option context applied on__enter__ (GH8514)
  • Bug in resample that causes a ValueError when resampling across multiple daysand the last offset is not calculated from the start of the range (GH8683)
  • Bug whereDataFrame.plot(kind='scatter') fails when checking if an np.array is in the DataFrame (GH8852)
  • Bug inpd.infer_freq/DataFrame.inferred_freq that prevented proper sub-daily frequency inference when the index contained DST days (GH8772).
  • Bug where index name was still used when plotting a series withuse_index=False (GH8558).
  • Bugs when trying to stack multiple columns, when some (or all) of the level names are numbers (GH8584).
  • Bug inMultiIndex where__contains__ returns wrong result if index is not lexically sorted or unique (GH7724)
  • BUG CSV: fix problem with trailing whitespace in skipped rows, (GH8679), (GH8661), (GH8983)
  • Regression inTimestamp does not parse ‘Z’ zone designator for UTC (GH8771)
  • Bug inStataWriter the produces writes strings with 244 characters irrespective of actual size (GH8969)
  • Fixed ValueError raised by cummin/cummax when datetime64 Series contains NaT. (GH8965)
  • Bug in Datareader returns object dtype if there are missing values (GH8980)
  • Bug in plotting if sharex was enabled and index was a timeseries, would show labels on multiple axes (GH3964).
  • Bug where passing a unit to the TimedeltaIndex constructor applied the to nano-second conversion twice. (GH9011).
  • Bug in plotting of a period-like array (GH9012)

v0.15.1 (November 9, 2014)

This is a minor bug-fix release from 0.15.0 and includes a small number of API changes, several new features,enhancements, and performance improvements along with a large number of bug fixes. We recommend that allusers upgrade to this version.

API changes

  • s.dt.hour and other.dt accessors will now returnnp.nan for missing values (rather than previously -1), (GH8689)

    In [1]:s=Series(date_range('20130101',periods=5,freq='D'))In [2]:s.iloc[2]=np.nanIn [3]:sOut[3]:0   2013-01-011   2013-01-022          NaT3   2013-01-044   2013-01-05dtype: datetime64[ns]

    previous behavior:

    In [6]:s.dt.hourOut[6]:0    01    02   -13    04    0dtype: int64

    current behavior:

    In [4]:s.dt.hourOut[4]:0    0.01    0.02    NaN3    0.04    0.0dtype: float64
  • groupby withas_index=False will not add erroneous extra columns toresult (GH8582):

    In [5]:np.random.seed(2718281)In [6]:df=pd.DataFrame(np.random.randint(0,100,(10,2)),   ...:columns=['jim','joe'])   ...:In [7]:df.head()Out[7]:   jim  joe0   61   811   96   492   55   653   72   514   77   12In [8]:ts=pd.Series(5*np.random.randint(0,3,10))

    previous behavior:

    In [4]:df.groupby(ts,as_index=False).max()Out[4]:   NaN  jim  joe0    0   72   831    5   77   842   10   96   65

    current behavior:

    In [9]:df.groupby(ts,as_index=False).max()Out[9]:   jim  joe0   72   831   77   842   96   65
  • groupby will not erroneously exclude columns if the column name conflicswith the grouper name (GH8112):

    In [10]:df=pd.DataFrame({'jim':range(5),'joe':range(5,10)})In [11]:dfOut[11]:   jim  joe0    0    51    1    62    2    73    3    84    4    9In [12]:gr=df.groupby(df['jim']<2)

    previous behavior (excludes 1st column from output):

    In [4]:gr.apply(sum)Out[4]:       joejimFalse   24True    11

    current behavior:

    In [13]:gr.apply(sum)Out[13]:       jim  joejimFalse    9   24True     1   11
  • Support for slicing with monotonic decreasing indexes, even ifstart orstop isnot found in the index (GH7860):

    In [14]:s=pd.Series(['a','b','c','d'],[4,3,2,1])In [15]:sOut[15]:4    a3    b2    c1    ddtype: object

    previous behavior:

    In [8]:s.loc[3.5:1.5]KeyError: 3.5

    current behavior:

    In [16]:s.loc[3.5:1.5]Out[16]:3    b2    cdtype: object
  • io.data.Options has been fixed for a change in the format of the Yahoo Options page (GH8612), (GH8741)

    Note

    As a result of a change in Yahoo’s option page layout, when an expiry date is given,Options methods now return data for a single expiry date. Previously, methods returned alldata for the selected month.

    Themonth andyear parameters have been undeprecated and can be used to get alloptions data for a given month.

    If an expiry date that is not valid is given, data for the next expiry after the givendate is returned.

    Option data frames are now saved on the instance ascallsYYMMDD orputsYYMMDD. Previouslythey were saved ascallsMMYY andputsMMYY. The next expiry is saved ascalls andputs.

    New features:

    • The expiry parameter can now be a single date or a list-like object containing dates.
    • A new propertyexpiry_dates was added, which returns all available expiry dates.

    Current behavior:

    In [17]:frompandas.io.dataimportOptionsIn [18]:aapl=Options('aapl','yahoo')In [19]:aapl.get_call_data().iloc[0:5,0:1]Out[19]:                                             LastStrike Expiry     Type Symbol80     2014-11-14 call AAPL141114C00080000  29.0584     2014-11-14 call AAPL141114C00084000  24.8085     2014-11-14 call AAPL141114C00085000  24.0586     2014-11-14 call AAPL141114C00086000  22.7687     2014-11-14 call AAPL141114C00087000  21.74In [20]:aapl.expiry_datesOut[20]:[datetime.date(2014, 11, 14), datetime.date(2014, 11, 22), datetime.date(2014, 11, 28), datetime.date(2014, 12, 5), datetime.date(2014, 12, 12), datetime.date(2014, 12, 20), datetime.date(2015, 1, 17), datetime.date(2015, 2, 20), datetime.date(2015, 4, 17), datetime.date(2015, 7, 17), datetime.date(2016, 1, 15), datetime.date(2017, 1, 20)]In [21]:aapl.get_near_stock_price(expiry=aapl.expiry_dates[0:3]).iloc[0:5,0:1]Out[21]:                                            LastStrike Expiry     Type Symbol109    2014-11-22 call AAPL141122C00109000  1.48       2014-11-28 call AAPL141128C00109000  1.79110    2014-11-14 call AAPL141114C00110000  0.55       2014-11-22 call AAPL141122C00110000  1.02       2014-11-28 call AAPL141128C00110000  1.32
  • pandas now also registers thedatetime64 dtype in matplotlib’s units registryto plot such values as datetimes. This is activated once pandas is imported. Inprevious versions, plotting an array ofdatetime64 values will have resultedin plotted integer values. To keep the previous behaviour, you can dodelmatplotlib.units.registry[np.datetime64] (GH8614).

Enhancements

  • concat permits a wider variety of iterables of pandas objects to bepassed as the first parameter (GH8645):

    In [17]:fromcollectionsimportdequeIn [18]:df1=pd.DataFrame([1,2,3])In [19]:df2=pd.DataFrame([4,5,6])

    previous behavior:

    In [7]:pd.concat(deque((df1,df2)))TypeError: first argument must be a list-like of pandas objects, you passed an object of type "deque"

    current behavior:

    In [20]:pd.concat(deque((df1,df2)))Out[20]:   00  11  22  30  41  52  6
  • RepresentMultiIndex labels with a dtype that utilizes memory based on the level size. In prior versions, the memory usage was a constant 8 bytes per element in each level. In addition, in prior versions, thereported memory usage was incorrect as it didn’t show the usage for the memory occupied by the underling data array. (GH8456)

    In [21]:dfi=DataFrame(1,index=pd.MultiIndex.from_product([['a'],range(1000)]),columns=['A'])

    previous behavior:

    # this was underreported in prior versionsIn [1]:dfi.memory_usage(index=True)Out[1]:Index    8000 # took about 24008 bytes in < 0.15.1A        8000dtype: int64

    current behavior:

    In [22]:dfi.memory_usage(index=True)Out[22]:Index    11040A         8000dtype: int64
  • Added Index propertiesis_monotonic_increasing andis_monotonic_decreasing (GH8680).

  • Added option to select columns when importing Stata files (GH7935)

  • Qualify memory usage inDataFrame.info() by adding+ if it is a lower bound (GH8578)

  • Raise errors in certain aggregation cases where an argument such asnumeric_only is not handled (GH8592).

  • Added support for 3-character ISO and non-standard country codes inio.wb.download() (GH8482)

  • World Bank data requests now will warn/raise basedon anerrors argument, as well as a list of hard-coded country codes andthe World Bank’s JSON response. In prior versions, the error messagesdidn’t look at the World Bank’s JSON response. Problem-inducing input weresimply dropped prior to the request. The issue was that many good countrieswere cropped in the hard-coded approach. All countries will work now, butsome bad countries will raise exceptions because some edge cases break theentire response. (GH8482)

  • Added option toSeries.str.split() to return aDataFrame rather than aSeries (GH8428)

  • Added option todf.info(null_counts=None|True|False) to override the default display options and force showing of the null-counts (GH8701)

Bug Fixes

  • Bug in unpickling of aCustomBusinessDay object (GH8591)
  • Bug in coercingCategorical to a records array, e.g.df.to_records() (GH8626)
  • Bug inCategorical not created properly withSeries.to_frame() (GH8626)
  • Bug in coercing in astype of aCategorical of a passedpd.Categorical (this now raisesTypeError correctly), (GH8626)
  • Bug incut/qcut when usingSeries andretbins=True (GH8589)
  • Bug in writing Categorical columns to an SQL database withto_sql (GH8624).
  • Bug in comparingCategorical of datetime raising when being compared to a scalar datetime (GH8687)
  • Bug in selecting from aCategorical with.iloc (GH8623)
  • Bug in groupby-transform with a Categorical (GH8623)
  • Bug in duplicated/drop_duplicates with a Categorical (GH8623)
  • Bug inCategorical reflected comparison operator raising if the first argument was a numpy array scalar (e.g. np.int64) (GH8658)
  • Bug in Panel indexing with a list-like (GH8710)
  • Compat issue isDataFrame.dtypes whenoptions.mode.use_inf_as_null is True (GH8722)
  • Bug inread_csv,dialect parameter would not take a string (:issue:8703)
  • Bug in slicing a multi-index level with an empty-list (GH8737)
  • Bug in numeric index operations of add/sub with Float/Index Index with numpy arrays (GH8608)
  • Bug in setitem with empty indexer and unwanted coercion of dtypes (GH8669)
  • Bug in ix/loc block splitting on setitem (manifests with integer-like dtypes, e.g. datetime64) (GH8607)
  • Bug when doing label based indexing with integers not found in the index fornon-unique but monotonic indexes (GH8680).
  • Bug when indexing a Float64Index withnp.nan on numpy 1.7 (GH8980).
  • Fixshape attribute forMultiIndex (GH8609)
  • Bug inGroupBy where a name conflict between the grouper and columnswould breakgroupby operations (GH7115,GH8112)
  • Fixed a bug where plotting a columny and specifying a label would mutate the index name of the original DataFrame (GH8494)
  • Fix regression in plotting of a DatetimeIndex directly with matplotlib (GH8614).
  • Bug indate_range where partially-specified dates would incorporate current date (GH6961)
  • Bug in Setting by indexer to a scalar value with a mixed-dtypePanel4d was failing (GH8702)
  • Bug whereDataReader‘s would fail if one of the symbols passed was invalid. Now returns data for valid symbols and np.nan for invalid (GH8494)
  • Bug inget_quote_yahoo that wouldn’t allow non-float return values (GH5229).

v0.15.0 (October 18, 2014)

This is a major release from 0.14.1 and includes a small number of API changes, several new features,enhancements, and performance improvements along with a large number of bug fixes. We recommend that allusers upgrade to this version.

Warning

pandas >= 0.15.0 will no longer support compatibility with NumPy versions <1.7.0. If you want to use the latest versions of pandas, please upgrade toNumPy >= 1.7.0 (GH7711)

Warning

In 0.15.0Index has internally been refactored to no longer sub-classndarraybut instead subclassPandasObject, similarly to the rest of the pandas objects. This change allows very easy sub-classing and creation of new index types. This should bea transparent change with only very limited API implications (See theInternal Refactoring)

Warning

The refactorings inCategorical changed the two argument constructor from“codes/labels and levels” to “values and levels (now called ‘categories’)”. This can lead to subtle bugs. If you useCategorical directly, please audit your code before updating to this pandasversion and change it to use thefrom_codes() constructor. See more onCategoricalhere

New features

Categoricals in Series/DataFrame

Categorical can now be included inSeries andDataFrames and gained newmethods to manipulate. Thanks to Jan Schulz for much of this API/implementation. (GH3943,GH5313,GH5314,GH7444,GH7839,GH7848,GH7864,GH7914,GH7768,GH8006,GH3678,GH8075,GH8076,GH8143,GH8453,GH8518).

For full docs, see thecategorical introduction and theAPI documentation.

In [1]:df=DataFrame({"id":[1,2,3,4,5,6],"raw_grade":['a','b','b','a','a','e']})In [2]:df["grade"]=df["raw_grade"].astype("category")In [3]:df["grade"]Out[3]:0    a1    b2    b3    a4    a5    eName: grade, dtype: categoryCategories (3, object): [a, b, e]# Rename the categoriesIn [4]:df["grade"].cat.categories=["very good","good","very bad"]# Reorder the categories and simultaneously add the missing categoriesIn [5]:df["grade"]=df["grade"].cat.set_categories(["very bad","bad","medium","good","very good"])In [6]:df["grade"]Out[6]:0    very good1         good2         good3    very good4    very good5     very badName: grade, dtype: categoryCategories (5, object): [very bad, bad, medium, good, very good]In [7]:df.sort("grade")Out[7]:   id raw_grade      grade5   6         e   very bad1   2         b       good2   3         b       good0   1         a  very good3   4         a  very good4   5         a  very goodIn [8]:df.groupby("grade").size()Out[8]:gradevery bad     1bad          0medium       0good         2very good    3dtype: int64
  • pandas.core.group_agg andpandas.core.factor_agg were removed. As an alternative, constructa dataframe and usedf.groupby(<group>).agg(<func>).
  • Supplying “codes/labels and levels” to theCategorical constructor is notsupported anymore. Supplying two arguments to the constructor is now interpreted as“values and levels (now called ‘categories’)”. Please change your code to use thefrom_codes()constructor.
  • TheCategorical.labels attribute was renamed toCategorical.codes and is readonly. If you want to manipulate codes, please use one of theAPI methods on Categoricals.
  • TheCategorical.levels attribute is renamed toCategorical.categories.

TimedeltaIndex/Scalar

We introduce a new scalar typeTimedelta, which is a subclass ofdatetime.timedelta, and behaves in a similar manner,but allows compatibility withnp.timedelta64 types as well as a host of custom representation, parsing, and attributes.This type is very similar to howTimestamp works fordatetimes. It is a nice-API box for the type. See thedocs.(GH3009,GH4533,GH8209,GH8187,GH8190,GH7869,GH7661,GH8345,GH8471)

Warning

Timedelta scalars (andTimedeltaIndex) component fields arenot the same as the component fields on adatetime.timedelta object. For example,.seconds on adatetime.timedelta object returns the total number of seconds combined betweenhours,minutes andseconds. In contrast, the pandasTimedelta breaks out hours, minutes, microseconds and nanoseconds separately.

# Timedelta accessorIn [9]:tds=Timedelta('31 days 5 min 3 sec')In [10]:tds.minutesOut[10]:5LIn [11]:tds.secondsOut[11]:3L# datetime.timedelta accessor# this is 5 minutes * 60 + 3 secondsIn [12]:tds.to_pytimedelta().secondsOut[12]:303

Note: this is no longer true starting from v0.16.0, where fullcompatibility withdatetime.timedelta is introduced. See the0.16.0 whatsnew entry

Warning

Prior to 0.15.0pd.to_timedelta would return aSeries for list-like/Series input, and anp.timedelta64 for scalar input.It will now return aTimedeltaIndex for list-like input,Series for Series input, andTimedelta for scalar input.

The arguments topd.to_timedelta are now(arg,unit='ns',box=True,coerce=False), previously were(arg,box=True,unit='ns') as these are more logical.

Consruct a scalar

In [9]:Timedelta('1 days 06:05:01.00003')Out[9]:Timedelta('1 days 06:05:01.000030')In [10]:Timedelta('15.5us')Out[10]:Timedelta('0 days 00:00:00.000015')In [11]:Timedelta('1 hour 15.5us')Out[11]:Timedelta('0 days 01:00:00.000015')# negative Timedeltas have this string repr# to be more consistent with datetime.timedelta conventionsIn [12]:Timedelta('-1us')Out[12]:Timedelta('-1 days +23:59:59.999999')# a NaTIn [13]:Timedelta('nan')Out[13]:NaT

Access fields for aTimedelta

In [14]:td=Timedelta('1 hour 3m 15.5us')In [15]:td.secondsOut[15]:3780In [16]:td.microsecondsOut[16]:15In [17]:td.nanosecondsOut[17]:500

Construct aTimedeltaIndex

In [18]:TimedeltaIndex(['1 days','1 days, 00:00:05',   ....:np.timedelta64(2,'D'),timedelta(days=2,seconds=2)])   ....:Out[18]:TimedeltaIndex(['1 days 00:00:00', '1 days 00:00:05', '2 days 00:00:00',                '2 days 00:00:02'],               dtype='timedelta64[ns]', freq=None)

Constructing aTimedeltaIndex with a regular range

In [19]:timedelta_range('1 days',periods=5,freq='D')Out[19]:TimedeltaIndex(['1 days','2 days','3 days','4 days','5 days'],dtype='timedelta64[ns]',freq='D')In [20]:timedelta_range(start='1 days',end='2 days',freq='30T')Out[20]:TimedeltaIndex(['1 days 00:00:00', '1 days 00:30:00', '1 days 01:00:00',                '1 days 01:30:00', '1 days 02:00:00', '1 days 02:30:00',                '1 days 03:00:00', '1 days 03:30:00', '1 days 04:00:00',                '1 days 04:30:00', '1 days 05:00:00', '1 days 05:30:00',                '1 days 06:00:00', '1 days 06:30:00', '1 days 07:00:00',                '1 days 07:30:00', '1 days 08:00:00', '1 days 08:30:00',                '1 days 09:00:00', '1 days 09:30:00', '1 days 10:00:00',                '1 days 10:30:00', '1 days 11:00:00', '1 days 11:30:00',                '1 days 12:00:00', '1 days 12:30:00', '1 days 13:00:00',                '1 days 13:30:00', '1 days 14:00:00', '1 days 14:30:00',                '1 days 15:00:00', '1 days 15:30:00', '1 days 16:00:00',                '1 days 16:30:00', '1 days 17:00:00', '1 days 17:30:00',                '1 days 18:00:00', '1 days 18:30:00', '1 days 19:00:00',                '1 days 19:30:00', '1 days 20:00:00', '1 days 20:30:00',                '1 days 21:00:00', '1 days 21:30:00', '1 days 22:00:00',                '1 days 22:30:00', '1 days 23:00:00', '1 days 23:30:00',                '2 days 00:00:00'],               dtype='timedelta64[ns]', freq='30T')

You can now use aTimedeltaIndex as the index of a pandas object

In [21]:s=Series(np.arange(5),   ....:index=timedelta_range('1 days',periods=5,freq='s'))   ....:In [22]:sOut[22]:1 days 00:00:00    01 days 00:00:01    11 days 00:00:02    21 days 00:00:03    31 days 00:00:04    4Freq: S, dtype: int64

You can select with partial string selections

In [23]:s['1 day 00:00:02']Out[23]:2In [24]:s['1 day':'1 day 00:00:02']Out[24]:1 days 00:00:00    01 days 00:00:01    11 days 00:00:02    2Freq: S, dtype: int64

Finally, the combination ofTimedeltaIndex withDatetimeIndex allow certain combination operations that areNaT preserving:

In [25]:tdi=TimedeltaIndex(['1 days',pd.NaT,'2 days'])In [26]:tdi.tolist()Out[26]:[Timedelta('1 days 00:00:00'),NaT,Timedelta('2 days 00:00:00')]In [27]:dti=date_range('20130101',periods=3)In [28]:dti.tolist()Out[28]:[Timestamp('2013-01-01 00:00:00', freq='D'), Timestamp('2013-01-02 00:00:00', freq='D'), Timestamp('2013-01-03 00:00:00', freq='D')]In [29]:(dti+tdi).tolist()Out[29]:[Timestamp('2013-01-02 00:00:00'),NaT,Timestamp('2013-01-05 00:00:00')]In [30]:(dti-tdi).tolist()Out[30]:[Timestamp('2012-12-31 00:00:00'),NaT,Timestamp('2013-01-01 00:00:00')]
  • iteration of aSeries e.g.list(Series(...)) oftimedelta64[ns] would prior to v0.15.0 returnnp.timedelta64 for each element. These will now be wrapped inTimedelta.

Memory Usage

Implemented methods to find memory usage of a DataFrame. See theFAQ for more. (GH6852).

A new display optiondisplay.memory_usage (seeOptions and Settings) sets the default behavior of thememory_usage argument in thedf.info() method. By defaultdisplay.memory_usage isTrue.

In [31]:dtypes=['int64','float64','datetime64[ns]','timedelta64[ns]',   ....:'complex128','object','bool']   ....:In [32]:n=5000In [33]:data=dict([(t,np.random.randint(100,size=n).astype(t))   ....:fortindtypes])   ....:In [34]:df=DataFrame(data)In [35]:df['categorical']=df['object'].astype('category')In [36]:df.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 5000 entries, 0 to 4999Data columns (total 8 columns):bool               5000 non-null boolcomplex128         5000 non-null complex128datetime64[ns]     5000 non-null datetime64[ns]float64            5000 non-null float64int64              5000 non-null int64object             5000 non-null objecttimedelta64[ns]    5000 non-null timedelta64[ns]categorical        5000 non-null categorydtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)memory usage: 284.1+ KB

Additionallymemory_usage() is an available method for a dataframe object which returns the memory usage of each column.

In [37]:df.memory_usage(index=True)Out[37]:Index                 72bool                5000complex128         80000datetime64[ns]     40000float64            40000int64              40000object             40000timedelta64[ns]    40000categorical         5800dtype: int64

.dt accessor

Series has gained an accessor to succinctly return datetime like properties for thevalues of the Series, if its a datetime/period like Series. (GH7207)This will return a Series, indexed like the existing Series. See thedocs

# datetimeIn [38]:s=Series(date_range('20130101 09:10:12',periods=4))In [39]:sOut[39]:0   2013-01-01 09:10:121   2013-01-02 09:10:122   2013-01-03 09:10:123   2013-01-04 09:10:12dtype: datetime64[ns]In [40]:s.dt.hourOut[40]:0    91    92    93    9dtype: int64In [41]:s.dt.secondOut[41]:0    121    122    123    12dtype: int64In [42]:s.dt.dayOut[42]:0    11    22    33    4dtype: int64In [43]:s.dt.freqOut[43]:<Day>

This enables nice expressions like this:

In [44]:s[s.dt.day==2]Out[44]:1   2013-01-02 09:10:12dtype: datetime64[ns]

You can easily produce tz aware transformations:

In [45]:stz=s.dt.tz_localize('US/Eastern')In [46]:stzOut[46]:0   2013-01-01 09:10:12-05:001   2013-01-02 09:10:12-05:002   2013-01-03 09:10:12-05:003   2013-01-04 09:10:12-05:00dtype: datetime64[ns, US/Eastern]In [47]:stz.dt.tzOut[47]:<DstTzInfo'US/Eastern'LMT-1day,19:04:00STD>

You can also chain these types of operations:

In [48]:s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')Out[48]:0   2013-01-01 04:10:12-05:001   2013-01-02 04:10:12-05:002   2013-01-03 04:10:12-05:003   2013-01-04 04:10:12-05:00dtype: datetime64[ns, US/Eastern]

The.dt accessor works for period and timedelta dtypes.

# periodIn [49]:s=Series(period_range('20130101',periods=4,freq='D'))In [50]:sOut[50]:0   2013-01-011   2013-01-022   2013-01-033   2013-01-04dtype: objectIn [51]:s.dt.yearOut[51]:0    20131    20132    20133    2013dtype: int64In [52]:s.dt.dayOut[52]:0    11    22    33    4dtype: int64
# timedeltaIn [53]:s=Series(timedelta_range('1 day 00:00:05',periods=4,freq='s'))In [54]:sOut[54]:0   1 days 00:00:051   1 days 00:00:062   1 days 00:00:073   1 days 00:00:08dtype: timedelta64[ns]In [55]:s.dt.daysOut[55]:0    11    12    13    1dtype: int64In [56]:s.dt.secondsOut[56]:0    51    62    73    8dtype: int64In [57]:s.dt.componentsOut[57]:   days  hours  minutes  seconds  milliseconds  microseconds  nanoseconds0     1      0        0        5             0             0            01     1      0        0        6             0             0            02     1      0        0        7             0             0            03     1      0        0        8             0             0            0

Timezone handling improvements

  • tz_localize(None) for tz-awareTimestamp andDatetimeIndex now removes timezone holding local time,previously this resulted inException orTypeError (GH7812)

    In [58]:ts=Timestamp('2014-08-01 09:00',tz='US/Eastern')In [59]:tsOut[59]:Timestamp('2014-08-01 09:00:00-0400',tz='US/Eastern')In [60]:ts.tz_localize(None)Out[60]:Timestamp('2014-08-01 09:00:00')In [61]:didx=DatetimeIndex(start='2014-08-01 09:00',freq='H',periods=10,tz='US/Eastern')In [62]:didxOut[62]:DatetimeIndex(['2014-08-01 09:00:00-04:00', '2014-08-01 10:00:00-04:00',               '2014-08-01 11:00:00-04:00', '2014-08-01 12:00:00-04:00',               '2014-08-01 13:00:00-04:00', '2014-08-01 14:00:00-04:00',               '2014-08-01 15:00:00-04:00', '2014-08-01 16:00:00-04:00',               '2014-08-01 17:00:00-04:00', '2014-08-01 18:00:00-04:00'],              dtype='datetime64[ns, US/Eastern]', freq='H')In [63]:didx.tz_localize(None)Out[63]:DatetimeIndex(['2014-08-01 09:00:00', '2014-08-01 10:00:00',               '2014-08-01 11:00:00', '2014-08-01 12:00:00',               '2014-08-01 13:00:00', '2014-08-01 14:00:00',               '2014-08-01 15:00:00', '2014-08-01 16:00:00',               '2014-08-01 17:00:00', '2014-08-01 18:00:00'],              dtype='datetime64[ns]', freq='H')
  • tz_localize now accepts theambiguous keyword which allows for passing an array of boolsindicating whether the date belongs in DST or not, ‘NaT’ for setting transition times to NaT,‘infer’ for inferring DST/non-DST, and ‘raise’ (default) for anAmbiguousTimeError to be raised. Seethe docs for more details (GH7943)

  • DataFrame.tz_localize andDataFrame.tz_convert now accepts an optionallevel argumentfor localizing a specific level of a MultiIndex (GH7846)

  • Timestamp.tz_localize andTimestamp.tz_convert now raiseTypeError in error cases, rather thanException (GH8025)

  • a timeseries/index localized to UTC when inserted into a Series/DataFrame will preserve the UTC timezone (rather than being a naivedatetime64[ns]) asobject dtype (GH8411)

  • Timestamp.__repr__ displaysdateutil.tz.tzoffset info (GH7907)

Rolling/Expanding Moments improvements

  • rolling_min(),rolling_max(),rolling_cov(), androlling_corr()now return objects with allNaN whenlen(arg)<min_periods<=window ratherthan raising. (This makes all rolling functions consistent in this behavior). (GH7766)

    Prior to 0.15.0

    In [64]:s=Series([10,11,12,13])
    In [15]:rolling_min(s,window=10,min_periods=5)ValueError: min_periods (5) must be <= window (4)

    New behavior

    In [4]:pd.rolling_min(s,window=10,min_periods=5)Out[4]:0   NaN1   NaN2   NaN3   NaNdtype: float64
  • rolling_max(),rolling_min(),rolling_sum(),rolling_mean(),rolling_median(),rolling_std(),rolling_var(),rolling_skew(),rolling_kurt(),rolling_quantile(),rolling_cov(),rolling_corr(),rolling_corr_pairwise(),rolling_window(), androlling_apply() withcenter=True previously would return a result of the samestructure as the inputarg withNaN in the final(window-1)/2 entries.

    Now the final(window-1)/2 entries of the result are calculated as if the inputarg were followedby(window-1)/2NaN values (or with shrinking windows, in the case ofrolling_apply()).(GH7925,GH8269)

    Prior behavior (note final value isNaN):

    In [7]:rolling_sum(Series(range(4)),window=3,min_periods=0,center=True)Out[7]:0     11     32     63   NaNdtype: float64

    New behavior (note final value is5=sum([2,3,NaN])):

    In [7]:rolling_sum(Series(range(4)),window=3,min_periods=0,center=True)Out[7]:0    11    32    63    5dtype: float64
  • rolling_window() now normalizes the weights properly in rolling mean mode (mean=True) so thatthe calculated weighted means (e.g. ‘triang’, ‘gaussian’) are distributed about the same means as thosecalculated without weighting (i.e. ‘boxcar’). Seethe note on normalization for further details. (GH7618)

    In [65]:s=Series([10.5,8.8,11.4,9.7,9.3])

    Behavior prior to 0.15.0:

    In [39]:rolling_window(s,window=3,win_type='triang',center=True)Out[39]:0         NaN1    6.5833332    6.8833333    6.6833334         NaNdtype: float64

    New behavior

    In [10]:pd.rolling_window(s,window=3,win_type='triang',center=True)Out[10]:0       NaN1     9.8752    10.3253    10.0254       NaNdtype: float64
  • Removedcenter argument from allexpanding_ functions (seelist),as the results produced whencenter=True did not make much sense. (GH7925)

  • Added optionalddof argument toexpanding_cov() androlling_cov().The default value of1 is backwards-compatible. (GH8279)

  • Documented theddof argument toexpanding_var(),expanding_std(),rolling_var(), androlling_std(). These functions’ support of addof argument (with a default value of1) was previously undocumented. (GH8064)

  • ewma(),ewmstd(),ewmvol(),ewmvar(),ewmcov(), andewmcorr()now interpretmin_periods in the same manner that therolling_*() andexpanding_*() functions do:a given result entry will beNaN if the (expanding, in this case) window does not containat leastmin_periods values. The previous behavior was to set toNaN themin_periods entriesstarting with the first non-NaN value. (GH7977)

    Prior behavior (note values start at index2, which ismin_periods after index0(the index of the first non-empty value)):

    In [66]:s=Series([1,None,None,None,2,3])
    In [51]:ewma(s,com=3.,min_periods=2)Out[51]:0         NaN1         NaN2    1.0000003    1.0000004    1.5714295    2.189189dtype: float64

    New behavior (note values start at index4, the location of the 2nd (sincemin_periods=2) non-empty value):

    In [2]:pd.ewma(s,com=3.,min_periods=2)Out[2]:0         NaN1         NaN2         NaN3         NaN4    1.7596445    2.383784dtype: float64
  • ewmstd(),ewmvol(),ewmvar(),ewmcov(), andewmcorr()now have an optionaladjust argument, just likeewma() does,affecting how the weights are calculated.The default value ofadjust isTrue, which is backwards-compatible.SeeExponentially weighted moment functions for details. (GH7911)

  • ewma(),ewmstd(),ewmvol(),ewmvar(),ewmcov(), andewmcorr()now have an optionalignore_na argument.Whenignore_na=False (the default), missing values are taken into account in the weights calculation.Whenignore_na=True (which reproduces the pre-0.15.0 behavior), missing values are ignored in the weights calculation.(GH7543)

    In [7]:pd.ewma(Series([None,1.,8.]),com=2.)Out[7]:0    NaN1    1.02    5.2dtype: float64In [8]:pd.ewma(Series([1.,None,8.]),com=2.,ignore_na=True)# pre-0.15.0 behaviorOut[8]:0    1.01    1.02    5.2dtype: float64In [9]:pd.ewma(Series([1.,None,8.]),com=2.,ignore_na=False)# new defaultOut[9]:0    1.0000001    1.0000002    5.846154dtype: float64

    Warning

    By default (ignore_na=False) theewm*() functions’ weights calculationin the presence of missing values is different than in pre-0.15.0 versions.To reproduce the pre-0.15.0 calculation of weights in the presence of missing valuesone must specify explicitlyignore_na=True.

  • Bug inexpanding_cov(),expanding_corr(),rolling_cov(),rolling_cor(),ewmcov(), andewmcorr()returning results with columns sorted by name and producing an error for non-unique columns;now handles non-unique columns and returns columns in original order(except for the case of two DataFrames withpairwise=False, where behavior is unchanged) (GH7542)

  • Bug inrolling_count() andexpanding_*() functions unnecessarily producing error message for zero-length data (GH8056)

  • Bug inrolling_apply() andexpanding_apply() interpretingmin_periods=0 asmin_periods=1 (GH8080)

  • Bug inexpanding_std() andexpanding_var() for a single value producing a confusing error message (GH7900)

  • Bug inrolling_std() androlling_var() for a single value producing0 rather thanNaN (GH7900)

  • Bug inewmstd(),ewmvol(),ewmvar(), andewmcov()calculation of de-biasing factors whenbias=False (the default).Previously an incorrect constant factor was used, based onadjust=True,ignore_na=True,and an infinite number of observations.Now a different factor is used for each entry, based on the actual weights(analogous to the usualN/(N-1) factor).In particular, for a single point a value ofNaN is returned whenbias=False,whereas previously a value of (approximately)0 was returned.

    For example, consider the following pre-0.15.0 results forewmvar(...,bias=False),and the corresponding debiasing factors:

    In [67]:s=Series([1.,2.,0.,4.])
    In [89]:ewmvar(s,com=2.,bias=False)Out[89]:0   -2.775558e-161    3.000000e-012    9.556787e-013    3.585799e+00dtype: float64In [90]:ewmvar(s,com=2.,bias=False)/ewmvar(s,com=2.,bias=True)Out[90]:0    1.251    1.252    1.253    1.25dtype: float64

    Note that entry0 is approximately 0, and the debiasing factors are a constant 1.25.By comparison, the following 0.15.0 results have aNaN for entry0,and the debiasing factors are decreasing (towards 1.25):

    In [14]:pd.ewmvar(s,com=2.,bias=False)Out[14]:0         NaN1    0.5000002    1.2105263    4.089069dtype: float64In [15]:pd.ewmvar(s,com=2.,bias=False)/pd.ewmvar(s,com=2.,bias=True)Out[15]:0         NaN1    2.0833332    1.5833333    1.425439dtype: float64

    SeeExponentially weighted moment functions for details. (GH7912)

Improvements in the sql io module

  • Added support for achunksize parameter toto_sql function. This allows DataFrame to be written in chunks and avoid packet-size overflow errors (GH8062).

  • Added support for achunksize parameter toread_sql function. Specifying this argument will return an iterator through chunks of the query result (GH2908).

  • Added support for writingdatetime.date anddatetime.time object columns withto_sql (GH6932).

  • Added support for specifying aschema to read from/write to withread_sql_table andto_sql (GH7441,GH7952).For example:

    df.to_sql('table',engine,schema='other_schema')pd.read_sql_table('table',engine,schema='other_schema')
  • Added support for writingNaN values withto_sql (GH2754).

  • Added support for writing datetime64 columns withto_sql for all database flavors (GH7103).

Backwards incompatible API changes

Breaking changes

API changes related toCategorical (seeherefor more details):

  • TheCategorical constructor with two arguments changed from“codes/labels and levels” to “values and levels (now called ‘categories’)”.This can lead to subtle bugs. If you useCategorical directly,please audit your code by changing it to use thefrom_codes()constructor.

    An old function call like (prior to 0.15.0):

    pd.Categorical([0,1,0,2,1],levels=['a','b','c'])

    will have to adapted to the following to keep the same behaviour:

    In [2]:pd.Categorical.from_codes([0,1,0,2,1],categories=['a','b','c'])Out[2]:[a, b, a, c, b]Categories (3, object): [a, b, c]

API changes related to the introduction of theTimedelta scalar (seeabove for more details):

  • Prior to 0.15.0to_timedelta() would return aSeries for list-like/Series input,and anp.timedelta64 for scalar input. It will now return aTimedeltaIndex forlist-like input,Series for Series input, andTimedelta for scalar input.

For API changes related to the rolling and expanding functions, see detailed overviewabove.

Other notable API changes:

  • Consistency when indexing with.loc and a list-like indexer when no values are found.

    In [68]:df=DataFrame([['a'],['b']],index=[1,2])In [69]:dfOut[69]:   01  a2  b

    In prior versions there was a difference in these two constructs:

    • df.loc[[3]] would return a frame reindexed by 3 (with allnp.nan values)
    • df.loc[[3],:] would raiseKeyError.

    Both will now raise aKeyError. The rule is thatat least 1 indexer must be found when using a list-like and.loc (GH7999)

    Furthermore in prior versions these were also different:

    • df.loc[[1,3]] would return a frame reindexed by [1,3]
    • df.loc[[1,3],:] would raiseKeyError.

    Both will now return a frame reindex by [1,3]. E.g.

    In [70]:df.loc[[1,3]]Out[70]:     01    a3  NaNIn [71]:df.loc[[1,3],:]Out[71]:     01    a3  NaN

    This can also be seen in multi-axis indexing with aPanel.

    In [72]:p=Panel(np.arange(2*3*4).reshape(2,3,4),   ....:items=['ItemA','ItemB'],   ....:major_axis=[1,2,3],   ....:minor_axis=['A','B','C','D'])   ....:In [73]:pOut[73]:<class 'pandas.core.panel.Panel'>Dimensions: 2 (items) x 3 (major_axis) x 4 (minor_axis)Items axis: ItemA to ItemBMajor_axis axis: 1 to 3Minor_axis axis: A to D

    The following would raiseKeyError prior to 0.15.0:

    In [74]:p.loc[['ItemA','ItemD'],:,'D']Out[74]:   ItemA  ItemD1      3    NaN2      7    NaN3     11    NaN

    Furthermore,.loc will raise If no values are found in a multi-index with a list-like indexer:

    In [75]:s=Series(np.arange(3,dtype='int64'),   ....:index=MultiIndex.from_product([['A'],['foo','bar','baz']],   ....:names=['one','two'])   ....:).sortlevel()   ....:In [76]:sOut[76]:one  twoA    bar    1     baz    2     foo    0dtype: int64In [77]:try:   ....:s.loc[['D']]   ....:exceptKeyErrorase:   ....:print("KeyError: "+str(e))   ....:KeyError: 'cannot index a multi-index axis with these keys'
  • Assigning values toNone now considers the dtype when choosing an ‘empty’ value (GH7941).

    Previously, assigning toNone in numeric containers changed thedtype to object (or errored, depending on the call). It now usesNaN:

    In [78]:s=Series([1,2,3])In [79]:s.loc[0]=NoneIn [80]:sOut[80]:0    NaN1    2.02    3.0dtype: float64

    NaT is now used similarly for datetime containers.

    For object containers, we now preserveNone values (previously thesewere converted toNaN values).

    In [81]:s=Series(["a","b","c"])In [82]:s.loc[0]=NoneIn [83]:sOut[83]:0    None1       b2       cdtype: object

    To insert aNaN, you must explicitly usenp.nan. See thedocs.

  • In prior versions, updating a pandas object inplace would not reflect in other python references to this object. (GH8511,GH5104)

    In [84]:s=Series([1,2,3])In [85]:s2=sIn [86]:s+=1.5

    Behavior prior to v0.15.0

    # the original objectIn [5]:sOut[5]:0    2.51    3.52    4.5dtype: float64# a reference to the original objectIn [7]:s2Out[7]:0    11    22    3dtype: int64

    This is now the correct behavior

    # the original objectIn [87]:sOut[87]:0    2.51    3.52    4.5dtype: float64# a reference to the original objectIn [88]:s2Out[88]:0    2.51    3.52    4.5dtype: float64
  • Made both the C-based and Python engines forread_csv andread_table ignore empty lines in input as well aswhitespace-filled lines, as long assep is not whitespace. This is an API changethat can be controlled by the keyword parameterskip_blank_lines. Seethe docs (GH4466)

  • A timeseries/index localized to UTC when inserted into a Series/DataFrame will preserve the UTC timezoneand inserted asobject dtype rather than being converted to a naivedatetime64[ns] (GH8411).

  • Bug in passing aDatetimeIndex with a timezone that was not being retained in DataFrame construction from a dict (GH7822)

    In prior versions this would drop the timezone, now it retains the timezone,but gives a column ofobject dtype:

    In [89]:i=date_range('1/1/2011',periods=3,freq='10s',tz='US/Eastern')In [90]:iOut[90]:DatetimeIndex(['2011-01-01 00:00:00-05:00', '2011-01-01 00:00:10-05:00',               '2011-01-01 00:00:20-05:00'],              dtype='datetime64[ns, US/Eastern]', freq='10S')In [91]:df=DataFrame({'a':i})In [92]:dfOut[92]:                          a0 2011-01-01 00:00:00-05:001 2011-01-01 00:00:10-05:002 2011-01-01 00:00:20-05:00In [93]:df.dtypesOut[93]:a    datetime64[ns, US/Eastern]dtype: object

    Previously this would have yielded a column ofdatetime64 dtype, but without timezone info.

    The behaviour of assigning a column to an existing dataframe asdf[‘a’] = iremains unchanged (this already returned anobject column with a timezone).

  • When passing multiple levels tostack(), it will now raise aValueError when thelevels aren’t all level names or all level numbers (GH7660). SeeReshaping by stacking and unstacking.

  • Raise aValueError indf.to_hdf with ‘fixed’ format, ifdf has non-unique columns as the resulting file will be broken (GH7761)

  • SettingWithCopy raise/warnings (according to the optionmode.chained_assignment) will now be issued when setting a value on a sliced mixed-dtype DataFrame using chained-assignment. (GH7845,GH7950)

    In[1]:df=DataFrame(np.arange(0,9),columns=['count'])In[2]:df['group']='b'In[3]:df.iloc[0:5]['group']='a'/usr/local/bin/ipython:1:SettingWithCopyWarning:AvalueistryingtobesetonacopyofaslicefromaDataFrame.Tryusing.loc[row_indexer,col_indexer]=valueinsteadSeethethecaveatsinthedocumentation:http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  • merge,DataFrame.merge, andordered_merge now return the same typeas theleft argument (GH7737).

  • Previously an enlargement with a mixed-dtype frame would act unlike.append which will preserve dtypes (relatedGH2578,GH8176):

    In [94]:df=DataFrame([[True,1],[False,2]],   ....:columns=["female","fitness"])   ....:In [95]:dfOut[95]:  female  fitness0   True        11  False        2In [96]:df.dtypesOut[96]:female      boolfitness    int64dtype: object# dtypes are now preservedIn [97]:df.loc[2]=df.loc[1]In [98]:dfOut[98]:  female  fitness0   True        11  False        22  False        2In [99]:df.dtypesOut[99]:female      boolfitness    int64dtype: object
  • Series.to_csv() now returns a string whenpath=None, matching the behaviour ofDataFrame.to_csv() (GH8215).

  • read_hdf now raisesIOError when a file that doesn’t exist is passed in. Previously, a new, empty file was created, and aKeyError raised (GH7715).

  • DataFrame.info() now ends its output with a newline character (GH8114)

  • Concatenating no objects will now raise aValueError rather than a bareException.

  • Merge errors will now be sub-classes ofValueError rather than rawException (GH8501)

  • DataFrame.plot andSeries.plot keywords are now have consistent orders (GH8037)

Internal Refactoring

In 0.15.0Index has internally been refactored to no longer sub-classndarraybut instead subclassPandasObject, similarly to the rest of the pandas objects. Thischange allows very easy sub-classing and creation of new index types. This should bea transparent change with only very limited API implications (GH5080,GH7439,GH7796,GH8024,GH8367,GH7997,GH8522):

  • you may need to unpickle pandas version < 0.15.0 pickles usingpd.read_pickle rather thanpickle.load. Seepickle docs
  • when plotting with aPeriodIndex, the matplotlib internal axes will now be arrays ofPeriod rather than aPeriodIndex (this is similar to how aDatetimeIndex passes arrays ofdatetimes now)
  • MultiIndexes will now raise similary to other pandas objects w.r.t. truth testing, seehere (GH7897).
  • When plotting a DatetimeIndex directly with matplotlib’splot function,the axis labels will no longer be formatted as dates but as integers (theinternal representation of adatetime64).UPDATE This is fixedin 0.15.1, seehere.

Deprecations

  • The attributesCategoricallabels andlevels attributes aredeprecated and renamed tocodes andcategories.
  • Theouttype argument topd.DataFrame.to_dict has been deprecated in favor oforient. (GH7840)
  • Theconvert_dummies method has been deprecated in favor ofget_dummies (GH8140)
  • Theinfer_dst argument intz_localize will be deprecated in favor ofambiguous to allow for more flexibility in dealing with DST transitions.Replaceinfer_dst=True withambiguous='infer' for the same behavior (GH7943).Seethe docs for more details.
  • The top-levelpd.value_range has been deprecated and can be replaced by.describe() (GH8481)
  • TheIndex set operations+ and- were deprecated in order to provide these for numeric type operations on certain index types.+ can be replaced by.union() or|, and- by.difference(). Further the method nameIndex.diff() is deprecated and can be replaced byIndex.difference() (GH8226)

    # +Index(['a','b','c'])+Index(['b','c','d'])# should be replaced byIndex(['a','b','c']).union(Index(['b','c','d']))
    # -Index(['a','b','c'])-Index(['b','c','d'])# should be replaced byIndex(['a','b','c']).difference(Index(['b','c','d']))
  • Theinfer_types argument toread_html() now has noeffect and is deprecated (GH7762,GH7032).

Removal of prior version deprecations/changes

  • RemoveDataFrame.delevel method in favor ofDataFrame.reset_index

Enhancements

Enhancements in the importing/exporting of Stata files:

  • Added support for bool, uint8, uint16 and uint32 datatypes into_stata (GH7097,GH7365)
  • Added conversion option when importing Stata files (GH8527)
  • DataFrame.to_stata andStataWriter check string length forcompatibility with limitations imposed in dta files where fixed-widthstrings must contain 244 or fewer characters. Attempting to write Statadta files with strings longer than 244 characters raises aValueError. (GH7858)
  • read_stata andStataReader can import missing data information into aDataFrame by setting the argumentconvert_missing toTrue. Whenusing this options, missing values are returned asStataMissingValueobjects and columns containing missing values haveobject data type. (GH8045)

Enhancements in the plotting functions:

  • Addedlayout keyword toDataFrame.plot. You can pass a tuple of(rows,columns), one of which can be-1 to automatically infer (GH6667,GH8071).
  • Allow to pass multiple axes toDataFrame.plot,hist andboxplot (GH5353,GH6970,GH7069)
  • Added support forc,colormap andcolorbar arguments forDataFrame.plot withkind='scatter' (GH7780)
  • Histogram fromDataFrame.plot withkind='hist' (GH7809), Seethe docs.
  • Boxplot fromDataFrame.plot withkind='box' (GH7998), Seethe docs.

Other:

  • read_csv now has a keyword parameterfloat_precision which specifies which floating-point converter the C engine should use during parsing, seehere (GH8002,GH8044)

  • Addedsearchsorted method toSeries objects (GH7447)

  • describe() on mixed-types DataFrames is more flexible. Type-based column filtering is now possible via theinclude/exclude arguments.See thedocs (GH8164).

    In [100]:df=DataFrame({'catA':['foo','foo','bar']*8,   .....:'catB':['a','b','c','d']*6,   .....:'numC':np.arange(24),   .....:'numD':np.arange(24.)+.5})   .....:In [101]:df.describe(include=["object"])Out[101]:       catA catBcount    24   24unique    2    4top     foo    dfreq     16    6In [102]:df.describe(include=["number","object"],exclude=["float"])Out[102]:       catA catB       numCcount    24   24  24.000000unique    2    4        NaNtop     foo    d        NaNfreq     16    6        NaNmean    NaN  NaN  11.500000std     NaN  NaN   7.071068min     NaN  NaN   0.00000025%     NaN  NaN   5.75000050%     NaN  NaN  11.50000075%     NaN  NaN  17.250000max     NaN  NaN  23.000000

    Requesting all columns is possible with the shorthand ‘all’

    In [103]:df.describe(include='all')Out[103]:       catA catB       numC       numDcount    24   24  24.000000  24.000000unique    2    4        NaN        NaNtop     foo    d        NaN        NaNfreq     16    6        NaN        NaNmean    NaN  NaN  11.500000  12.000000std     NaN  NaN   7.071068   7.071068min     NaN  NaN   0.000000   0.50000025%     NaN  NaN   5.750000   6.25000050%     NaN  NaN  11.500000  12.00000075%     NaN  NaN  17.250000  17.750000max     NaN  NaN  23.000000  23.500000

    Without those arguments, ‘describe` will behave as before, including only numerical columns or, if none are, only categorical columns. See also thedocs

  • Addedsplit as an option to theorient argument inpd.DataFrame.to_dict. (GH7840)

  • Theget_dummies method can now be used on DataFrames. By default onlycatagorical columns are encoded as 0’s and 1’s, while other columns areleft untouched.

    In [104]:df=DataFrame({'A':['a','b','a'],'B':['c','c','b'],   .....:'C':[1,2,3]})   .....:In [105]:pd.get_dummies(df)Out[105]:   C  A_a  A_b  B_b  B_c0  1    1    0    0    11  2    0    1    0    12  3    1    0    1    0
  • PeriodIndex supportsresolution as the same asDatetimeIndex (GH7708)

  • pandas.tseries.holiday has added support for additional holidays and ways to observe holidays (GH7070)

  • pandas.tseries.holiday.Holiday now supports a list of offsets in Python3 (GH7070)

  • pandas.tseries.holiday.Holiday now supports a days_of_week parameter (GH7070)

  • GroupBy.nth() now supports selecting multiple nth values (GH7910)

    In [106]:business_dates=date_range(start='4/1/2014',end='6/30/2014',freq='B')In [107]:df=DataFrame(1,index=business_dates,columns=['a','b'])# get the first, 4th, and last date index for each monthIn [108]:df.groupby((df.index.year,df.index.month)).nth([0,3,-1])Out[108]:        a  b2014 4  1  1     4  1  1     4  1  1     5  1  1     5  1  1     5  1  1     6  1  1     6  1  1     6  1  1
  • Period andPeriodIndex supports addition/subtraction withtimedelta-likes (GH7966)

    IfPeriod freq isD,H,T,S,L,U,N,Timedelta-like can be added if the result can have same freq. Otherwise, only the sameoffsets can be added.

    In [109]:idx=pd.period_range('2014-07-01 09:00',periods=5,freq='H')In [110]:idxOut[110]:PeriodIndex(['2014-07-01 09:00', '2014-07-01 10:00', '2014-07-01 11:00',             '2014-07-01 12:00', '2014-07-01 13:00'],            dtype='period[H]', freq='H')In [111]:idx+pd.offsets.Hour(2)Out[111]:PeriodIndex(['2014-07-01 11:00', '2014-07-01 12:00', '2014-07-01 13:00',             '2014-07-01 14:00', '2014-07-01 15:00'],            dtype='period[H]', freq='H')In [112]:idx+Timedelta('120m')Out[112]:PeriodIndex(['2014-07-01 11:00', '2014-07-01 12:00', '2014-07-01 13:00',             '2014-07-01 14:00', '2014-07-01 15:00'],            dtype='period[H]', freq='H')In [113]:idx=pd.period_range('2014-07',periods=5,freq='M')In [114]:idxOut[114]:PeriodIndex(['2014-07','2014-08','2014-09','2014-10','2014-11'],dtype='period[M]',freq='M')In [115]:idx+pd.offsets.MonthEnd(3)Out[115]:PeriodIndex(['2014-10','2014-11','2014-12','2015-01','2015-02'],dtype='period[M]',freq='M')
  • Added experimental compatibility withopenpyxl for versions >= 2.0. TheDataFrame.to_excelmethodengine keyword now recognizesopenpyxl1 andopenpyxl2which will explicitly require openpyxl v1 and v2 respectively, failing ifthe requested version is not available. Theopenpyxl engine is a now ameta-engine that automatically uses whichever version of openpyxl isinstalled. (GH7177)

  • DataFrame.fillna can now accept aDataFrame as a fill value (GH8377)

  • Passing multiple levels tostack() will now work when multiple levelnumbers are passed (GH7660). SeeReshaping by stacking and unstacking.

  • set_names(),set_labels(), andset_levels() methods now take an optionallevel keyword argument to all modification of specific level(s) of a MultiIndex. Additionallyset_names() now accepts a scalar string value when operating on anIndex or on a specific level of aMultiIndex (GH7792)

    In [116]:idx=MultiIndex.from_product([['a'],range(3),list("pqr")],names=['foo','bar','baz'])In [117]:idx.set_names('qux',level=0)Out[117]:MultiIndex(levels=[[u'a'], [0, 1, 2], [u'p', u'q', u'r']],           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]],           names=[u'qux', u'bar', u'baz'])In [118]:idx.set_names(['qux','baz'],level=[0,1])Out[118]:MultiIndex(levels=[[u'a'], [0, 1, 2], [u'p', u'q', u'r']],           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]],           names=[u'qux', u'baz', u'baz'])In [119]:idx.set_levels(['a','b','c'],level='bar')Out[119]:MultiIndex(levels=[[u'a'], [u'a', u'b', u'c'], [u'p', u'q', u'r']],           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]],           names=[u'foo', u'bar', u'baz'])In [120]:idx.set_levels([['a','b','c'],[1,2,3]],level=[1,2])Out[120]:MultiIndex(levels=[[u'a'], [u'a', u'b', u'c'], [1, 2, 3]],           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]],           names=[u'foo', u'bar', u'baz'])
  • Index.isin now supports alevel argument to specify which index levelto use for membership tests (GH7892,GH7890)

    In [1]:idx=MultiIndex.from_product([[0,1],['a','b','c']])In [2]:idx.valuesOut[2]:array([(0,'a'),(0,'b'),(0,'c'),(1,'a'),(1,'b'),(1,'c')],dtype=object)In [3]:idx.isin(['a','c','e'],level=1)Out[3]:array([True,False,True,True,False,True],dtype=bool)
  • Index now supportsduplicated anddrop_duplicates. (GH4060)

    In [121]:idx=Index([1,2,3,4,1,2])In [122]:idxOut[122]:Int64Index([1,2,3,4,1,2],dtype='int64')In [123]:idx.duplicated()Out[123]:array([False,False,False,False,True,True],dtype=bool)In [124]:idx.drop_duplicates()Out[124]:Int64Index([1,2,3,4],dtype='int64')
  • addcopy=True argument topd.concat to enable pass thru of complete blocks (GH8252)

  • Added support for numpy 1.8+ data types (bool_,int_,float_,string_) for conversion to R dataframe (GH8400)

Performance

  • Performance improvements inDatetimeIndex.__iter__ to allow faster iteration (GH7683)
  • Performance improvements inPeriod creation (andPeriodIndex setitem) (GH5155)
  • Improvements in Series.transform for significant performance gains (revised) (GH6496)
  • Performance improvements inStataReader when reading large files (GH8040,GH8073)
  • Performance improvements inStataWriter when writing large files (GH8079)
  • Performance and memory usage improvements in multi-keygroupby (GH8128)
  • Performance improvements in groupby.agg and.apply where builtins max/min were not mapped to numpy/cythonized versions (GH7722)
  • Performance improvement in writing to sql (to_sql) of up to 50% (GH8208).
  • Performance benchmarking of groupby for large value of ngroups (GH6787)
  • Performance improvement inCustomBusinessDay,CustomBusinessMonth (GH8236)
  • Performance improvement forMultiIndex.values for multi-level indexes containing datetimes (GH8543)

Bug Fixes

  • Bug in pivot_table, when using margins and a dict aggfunc (GH8349)
  • Bug inread_csv wheresqueeze=True would return a view (GH8217)
  • Bug in checking of table name inread_sql in certain cases (GH7826).
  • Bug inDataFrame.groupby whereGrouper does not recognize level when frequency is specified (GH7885)
  • Bug in multiindexes dtypes getting mixed up when DataFrame is saved to SQL table (GH8021)
  • Bug inSeries 0-division with a float and integer operand dtypes (GH7785)
  • Bug inSeries.astype("unicode") not callingunicode on the values correctly (GH7758)
  • Bug inDataFrame.as_matrix() with mixeddatetime64[ns] andtimedelta64[ns] dtypes (GH7778)
  • Bug inHDFStore.select_column() not preserving UTC timezone info when selecting aDatetimeIndex (GH7777)
  • Bug into_datetime whenformat='%Y%m%d' andcoerce=True are specified, where previously an object array was returned (rather thana coerced time-series withNaT), (GH7930)
  • Bug inDatetimeIndex andPeriodIndex in-place addition and subtraction cause different result from normal one (GH6527)
  • Bug in adding and subtractingPeriodIndex withPeriodIndex raiseTypeError (GH7741)
  • Bug incombine_first withPeriodIndex data raisesTypeError (GH3367)
  • Bug in multi-index slicing with missing indexers (GH7866)
  • Bug in multi-index slicing with various edge cases (GH8132)
  • Regression in multi-index indexing with a non-scalar type object (GH7914)
  • Bug inTimestamp comparisons with== andint64 dtype (GH8058)
  • Bug in pickles containsDateOffset may raiseAttributeError whennormalize attribute is reffered internally (GH7748)
  • Bug inPanel when usingmajor_xs andcopy=False is passed (deprecation warning fails because of missingwarnings) (GH8152).
  • Bug in pickle deserialization that failed for pre-0.14.1 containers with dup items trying to avoid ambiguitywhen matching block and manager items, when there’s only one block there’s no ambiguity (GH7794)
  • Bug in putting aPeriodIndex into aSeries would convert toint64 dtype, rather thanobject ofPeriods (GH7932)
  • Bug inHDFStore iteration when passing a where (GH8014)
  • Bug inDataFrameGroupby.transform when transforming with a passed non-sorted key (GH8046,GH8430)
  • Bug in repeated timeseries line and area plot may result inValueError or incorrect kind (GH7733)
  • Bug in inference in aMultiIndex withdatetime.date inputs (GH7888)
  • Bug inget where anIndexError would not cause the default value to be returned (GH7725)
  • Bug inoffsets.apply,rollforward androllback may reset nanosecond (GH7697)
  • Bug inoffsets.apply,rollforward androllback may raiseAttributeError ifTimestamp hasdateutil tzinfo (GH7697)
  • Bug in sorting a multi-index frame with aFloat64Index (GH8017)
  • Bug in inconsistent panel setitem with a rhs of aDataFrame for alignment (GH7763)
  • Bug inis_superperiod andis_subperiod cannot handle higher frequencies thanS (GH7760,GH7772,GH7803)
  • Bug in 32-bit platforms withSeries.shift (GH8129)
  • Bug inPeriodIndex.unique returns int64np.ndarray (GH7540)
  • Bug ingroupby.apply with a non-affecting mutation in the function (GH8467)
  • Bug inDataFrame.reset_index which hasMultiIndex containsPeriodIndex orDatetimeIndex with tz raisesValueError (GH7746,GH7793)
  • Bug inDataFrame.plot withsubplots=True may draw unnecessary minor xticks and yticks (GH7801)
  • Bug inStataReader which did not read variable labels in 117 files due to difference between Stata documentation and implementation (GH7816)
  • Bug inStataReader where strings were always converted to 244 characters-fixed width irrespective of underlying string size (GH7858)
  • Bug inDataFrame.plot andSeries.plot may ignorerot andfontsize keywords (GH7844)
  • Bug inDatetimeIndex.value_counts doesn’t preserve tz (GH7735)
  • Bug inPeriodIndex.value_counts results inInt64Index (GH7735)
  • Bug inDataFrame.join when doing left join on index and there are multiple matches (GH5391)
  • Bug inGroupBy.transform() where int groups with a transform thatdidn’t preserve the index were incorrectly truncated (GH7972).
  • Bug ingroupby where callable objects without name attributes would take the wrong path,and produce aDataFrame instead of aSeries (GH7929)
  • Bug ingroupby error message when a DataFrame grouping column is duplicated (GH7511)
  • Bug inread_html where theinfer_types argument forced coercion ofdate-likes incorrectly (GH7762,GH7032).
  • Bug inSeries.str.cat with an index which was filtered as to not include the first item (GH7857)
  • Bug inTimestamp cannot parsenanosecond from string (GH7878)
  • Bug inTimestamp with string offset andtz results incorrect (GH7833)
  • Bug intslib.tz_convert andtslib.tz_convert_single may return different results (GH7798)
  • Bug inDatetimeIndex.intersection of non-overlapping timestamps with tz raisesIndexError (GH7880)
  • Bug in alignment with TimeOps and non-unique indexes (GH8363)
  • Bug inGroupBy.filter() where fast path vs. slow path made the filterreturn a non scalar value that appeared valid but wasn’t (GH7870).
  • Bug indate_range()/DatetimeIndex() when the timezone was inferred from input dates yet incorrecttimes were returned when crossing DST boundaries (GH7835,GH7901).
  • Bug into_excel() where a negative sign was being prepended to positive infinity and was absent for negative infinity (GH7949)
  • Bug in area plot draws legend with incorrectalpha whenstacked=True (GH8027)
  • Period andPeriodIndex addition/subtraction withnp.timedelta64 results in incorrect internal representations (GH7740)
  • Bug inHoliday with no offset or observance (GH7987)
  • Bug inDataFrame.to_latex formatting when columns or index is aMultiIndex (GH7982).
  • Bug inDateOffset around Daylight Savings Time produces unexpected results (GH5175).
  • Bug inDataFrame.shift where empty columns would throwZeroDivisionError on numpy 1.7 (GH8019)
  • Bug in installation wherehtml_encoding/*.html wasn’t installed andtherefore some tests were not running correctly (GH7927).
  • Bug inread_html wherebytes objects were not tested for in_read (GH7927).
  • Bug inDataFrame.stack() when one of the column levels was a datelike (GH8039)
  • Bug in broadcasting numpy scalars withDataFrame (GH8116)
  • Bug inpivot_table performed with namelessindex andcolumns raisesKeyError (GH8103)
  • Bug inDataFrame.plot(kind='scatter') draws points and errorbars with different colors when the color is specified byc keyword (GH8081)
  • Bug inFloat64Index whereiat andat were not testing and werefailing (GH8092).
  • Bug inDataFrame.boxplot() where y-limits were not set correctly whenproducing multiple axes (GH7528,GH5517).
  • Bug inread_csv where line comments were not handled correctly givena custom line terminator ordelim_whitespace=True (GH8122).
  • Bug inread_html where empty tables caused aStopIteration (GH7575)
  • Bug in casting when setting a column in a same-dtype block (GH7704)
  • Bug in accessing groups from aGroupBy when the original grouperwas a tuple (GH8121).
  • Bug in.at that would accept integer indexers on a non-integer index and do fallback (GH7814)
  • Bug with kde plot and NaNs (GH8182)
  • Bug inGroupBy.count with float32 data type were nan values were not excluded (GH8169).
  • Bug with stacked barplots and NaNs (GH8175).
  • Bug in resample with non evenly divisible offsets (e.g. ‘7s’) (GH8371)
  • Bug in interpolation methods with thelimit keyword when no values needed interpolating (GH7173).
  • Bug wherecol_space was ignored inDataFrame.to_string() whenheader=False (GH8230).
  • Bug withDatetimeIndex.asof incorrectly matching partial strings and returning the wrong date (GH8245).
  • Bug in plotting methods modifying the global matplotlib rcParams (GH8242).
  • Bug inDataFrame.__setitem__ that caused errors when setting a dataframe column to a sparse array (GH8131)
  • Bug whereDataframe.boxplot() failed when entire column was empty (GH8181).
  • Bug with messed variables inradviz visualization (GH8199).
  • Bug in interpolation methods with thelimit keyword when no values needed interpolating (GH7173).
  • Bug wherecol_space was ignored inDataFrame.to_string() whenheader=False (GH8230).
  • Bug into_clipboard that would clip long column data (GH8305)
  • Bug inDataFrame terminal display: Setting max_column/max_rows to zero did not trigger auto-resizing of dfs to fit terminal width/height (GH7180).
  • Bug in OLS where running with “cluster” and “nw_lags” parameters did not work correctly, but also did not throw an error(GH5884).
  • Bug inDataFrame.dropna that interpreted non-existent columns in the subset argument as the ‘last column’ (GH8303)
  • Bug inIndex.intersection on non-monotonic non-unique indexes (GH8362).
  • Bug in masked series assignment where mismatching types would break alignment (GH8387)
  • Bug inNDFrame.equals gives false negatives with dtype=object (GH8437)
  • Bug in assignment with indexer where type diversity would break alignment (GH8258)
  • Bug inNDFrame.loc indexing when row/column names were lost when target was a list/ndarray (GH6552)
  • Regression inNDFrame.loc indexing when rows/columns were converted to Float64Index if target was an empty list/ndarray (GH7774)
  • Bug inSeries that allows it to be indexed by aDataFrame which has unexpected results. Such indexing is no longer permitted (GH8444)
  • Bug in item assignment of aDataFrame with multi-index columns where right-hand-side columns were not aligned (GH7655)
  • Suppress FutureWarning generated by NumPy when comparing object arrays containing NaN for equality (GH7065)
  • Bug inDataFrame.eval() where the dtype of thenot operator (~)was not correctly inferred asbool.

v0.14.1 (July 11, 2014)

This is a minor release from 0.14.0 and includes a small number of API changes, several new features,enhancements, and performance improvements along with a large number of bug fixes. We recommend that allusers upgrade to this version.

API changes

  • Openpyxl now raises a ValueError on construction of the openpyxl writerinstead of warning on pandas import (GH7284).

  • ForStringMethods.extract, when no match is found, the result - onlycontainingNaN values - now also hasdtype=object instead offloat (GH7242)

  • Period objects no longer raise aTypeError when compared using==with another object thatisn’t aPeriod. Insteadwhen comparing aPeriod with another object using== if the otherobject isn’t aPeriodFalse is returned. (GH7376)

  • Previously, the behaviour on resetting the time or not inoffsets.apply,rollforward androllback operations differedbetween offsets. With the support of thenormalize keyword for all offsets(seebelow) with a default value of False (preserve time), the behaviour changed for certainoffsets (BusinessMonthBegin, MonthEnd, BusinessMonthEnd, CustomBusinessMonthEnd,BusinessYearBegin, LastWeekOfMonth, FY5253Quarter, LastWeekOfMonth, Easter):

    In [6]:frompandas.tseriesimportoffsetsIn [7]:d=pd.Timestamp('2014-01-01 09:00')# old behaviour < 0.14.1In [8]:d+offsets.MonthEnd()Out[8]:Timestamp('2014-01-31 00:00:00')

    Starting from 0.14.1 all offsets preserve time by default. The oldbehaviour can be obtained withnormalize=True

    # new behaviourIn [1]:d+offsets.MonthEnd()Out[1]:Timestamp('2014-01-31 09:00:00')In [2]:d+offsets.MonthEnd(normalize=True)Out[2]:Timestamp('2014-01-31 00:00:00')

    Note that for the other offsets the default behaviour did not change.

  • Add back#N/AN/A as a default NA value in text parsing, (regresion from 0.12) (GH5521)

  • Raise aTypeError on inplace-setting with a.where and a nonnp.nan value as this is inconsistentwith a set-item expression likedf[mask]=None (GH7656)

Enhancements

  • Adddropna argument tovalue_counts andnunique (GH5569).

  • Addselect_dtypes() method to allow selection ofcolumns based on dtype (GH7316). Seethe docs.

  • Alloffsets suppports thenormalize keyword to specify whetheroffsets.apply,rollforward androllback resets the time (hour,minute, etc) or not (defaultFalse, preserves time) (GH7156):

    In [3]:importpandas.tseries.offsetsasoffsetsIn [4]:day=offsets.Day()In [5]:day.apply(Timestamp('2014-01-01 09:00'))Out[5]:Timestamp('2014-01-02 09:00:00')In [6]:day=offsets.Day(normalize=True)In [7]:day.apply(Timestamp('2014-01-01 09:00'))Out[7]:Timestamp('2014-01-02 00:00:00')
  • PeriodIndex is represented as the same format asDatetimeIndex (GH7601)

  • StringMethods now work on empty Series (GH7242)

  • The file parsersread_csv andread_table now ignore line comments provided bythe parametercomment, which accepts only a single character for the C reader.In particular, they allow for comments before file data begins (GH2685)

  • AddNotImplementedError for simultaneous use ofchunksize andnrowsfor read_csv() (GH6774).

  • Tests for basic reading of public S3 buckets now exist (GH7281).

  • read_html now sports anencoding argument that is passed to theunderlying parser library. You can use this to read non-ascii encoded webpages (GH7323).

  • read_excel now supports reading from URLs in the same waythatread_csv does. (GH6809)

  • Support for dateutil timezones, which can now be used in the same way aspytz timezones across pandas. (GH4688)

    In [8]:rng=date_range('3/6/2012 00:00',periods=10,freq='D',   ...:tz='dateutil/Europe/London')   ...:In [9]:rng.tzOut[9]:tzfile('/usr/share/zoneinfo/Europe/London')

    Seethe docs.

  • Implementedsem (standard error of the mean) operation forSeries,DataFrame,Panel, andGroupby (GH6897)

  • Addnlargest andnsmallest to theSeriesgroupby whitelist,which means you can now use these methods on aSeriesGroupBy object(GH7053).

  • All offsetsapply,rollforward androllback can now handlenp.datetime64, previously results inApplyTypeError (GH7452)

  • Period andPeriodIndex can containNaT in its values (GH7485)

  • Support picklingSeries,DataFrame andPanel objects withnon-unique labels alongitem axis (index,columns anditemsrespectively) (GH7370).

  • Improved inference of datetime/timedelta with mixed null objects. Regression from 0.13.1 in interpretation of an object Indexwith all null elements (GH7431)

Performance

  • Improvements in dtype inference for numeric operations involving yielding performance gains for dtypes:int64,timedelta64,datetime64 (GH7223)
  • Improvements in Series.transform for significant performance gains (GH6496)
  • Improvements in DataFrame.transform with ufuncs and built-in grouper functions for signifcant performance gains (GH7383)
  • Regression in groupby aggregation of datetime64 dtypes (GH7555)
  • Improvements inMultiIndex.from_product for large iterables (GH7627)

Experimental

  • pandas.io.data.Options has a new method,get_all_data method, and now consistently returns amulti-indexedDataFrame (GH5602)
  • io.gbq.read_gbq andio.gbq.to_gbq were refactored to remove thedependency on the Googlebq.py command line client. This submodulenow useshttplib2 and the Googleapiclient andoauth2client API clientlibraries which should be more stable and, therefore, reliable thanbq.py. Seethe docs. (GH6937).

Bug Fixes

  • Bug inDataFrame.where with a symmetric shaped frame and a passed other of a DataFrame (GH7506)
  • Bug in Panel indexing with a multi-index axis (GH7516)
  • Regression in datetimelike slice indexing with a duplicated index and non-exact end-points (GH7523)
  • Bug in setitem with list-of-lists and single vs mixed types (GH7551:)
  • Bug in timeops with non-aligned Series (GH7500)
  • Bug in timedelta inference when assigning an incomplete Series (GH7592)
  • Bug in groupby.nth with a Series and integer-like column name (GH7559)
  • Bug inSeries.get with a boolean accessor (GH7407)
  • Bug invalue_counts whereNaT did not qualify as missing (NaN) (GH7423)
  • Bug into_timedelta that accepted invalid units and misinterpreted ‘m/h’ (GH7611,GH6423)
  • Bug in line plot doesn’t set correctxlim ifsecondary_y=True (GH7459)
  • Bug in groupedhist andscatter plots use oldfigsize default (GH7394)
  • Bug in plotting subplots withDataFrame.plot,hist clears passedax even if the number of subplots is one (GH7391).
  • Bug in plotting subplots withDataFrame.boxplot withby kw raisesValueError if the number of subplots exceeds 1 (GH7391).
  • Bug in subplots displaysticklabels andlabels in different rule (GH5897)
  • Bug inPanel.apply with a multi-index as an axis (GH7469)
  • Bug inDatetimeIndex.insert doesn’t preservename andtz (GH7299)
  • Bug inDatetimeIndex.asobject doesn’t preservename (GH7299)
  • Bug in multi-index slicing with datetimelike ranges (strings and Timestamps), (GH7429)
  • Bug inIndex.min andmax doesn’t handlenan andNaT properly (GH7261)
  • Bug inPeriodIndex.min/max results inint (GH7609)
  • Bug inresample wherefill_method was ignored if you passedhow (GH2073)
  • Bug inTimeGrouper doesn’t exclude column specified bykey (GH7227)
  • Bug inDataFrame andSeries bar and barh plot raisesTypeError whenbottomandleft keyword is specified (GH7226)
  • Bug inDataFrame.hist raisesTypeError when it contains non numeric column (GH7277)
  • Bug inIndex.delete does not preservename andfreq attributes (GH7302)
  • Bug inDataFrame.query()/eval where local string variables with the @sign were being treated as temporaries attempting to be deleted(GH7300).
  • Bug inFloat64Index which didn’t allow duplicates (GH7149).
  • Bug inDataFrame.replace() where truthy values were being replaced(GH7140).
  • Bug inStringMethods.extract() where a single match group Serieswould use the matcher’s name instead of the group name (GH7313).
  • Bug inisnull() whenmode.use_inf_as_null==True where isnullwouldn’t testTrue when it encountered aninf/-inf(GH7315).
  • Bug in inferred_freq results in None for eastern hemisphere timezones (GH7310)
  • Bug inEaster returns incorrect date when offset is negative (GH7195)
  • Bug in broadcasting with.div, integer dtypes and divide-by-zero (GH7325)
  • Bug inCustomBusinessDay.apply raiasesNameError whennp.datetime64 object is passed (GH7196)
  • Bug inMultiIndex.append,concat andpivot_table don’t preserve timezone (GH6606)
  • Bug in.loc with a list of indexers on a single-multi index level (that is not nested) (GH7349)
  • Bug inSeries.map when mapping a dict with tuple keys of different lengths (GH7333)
  • Bug allStringMethods now work on empty Series (GH7242)
  • Fix delegation ofread_sql toread_sql_query when query does not contain ‘select’ (GH7324).
  • Bug where a string column name assignment to aDataFrame with aFloat64Index raised aTypeError during a call tonp.isnan(GH7366).
  • Bug whereNDFrame.replace() didn’t correctly replace objects withPeriod values (GH7379).
  • Bug in.ix getitem should always return a Series (GH7150)
  • Bug in multi-index slicing with incomplete indexers (GH7399)
  • Bug in multi-index slicing with a step in a sliced level (GH7400)
  • Bug where negative indexers inDatetimeIndex were not correctly sliced(GH7408)
  • Bug whereNaT wasn’t repr’d correctly in aMultiIndex (GH7406,GH7409).
  • Bug where bool objects were converted tonan inconvert_objects(GH7416).
  • Bug inquantile ignoring the axis keyword argument (:issue`7306`)
  • Bug wherenanops._maybe_null_out doesn’t work with complex numbers(GH7353)
  • Bug in severalnanops functions whenaxis==0 for1-dimensionalnan arrays (GH7354)
  • Bug wherenanops.nanmedian doesn’t work whenaxis==None(GH7352)
  • Bug wherenanops._has_infs doesn’t work with many dtypes(GH7357)
  • Bug inStataReader.data where reading a 0-observation dta failed (GH7369)
  • Bug inStataReader when reading Stata 13 (117) files containing fixed width strings (GH7360)
  • Bug inStataWriter where encoding was ignored (GH7286)
  • Bug inDatetimeIndex comparison doesn’t handleNaT properly (GH7529)
  • Bug in passing input withtzinfo to some offsetsapply,rollforward orrollback resetstzinfo or raisesValueError (GH7465)
  • Bug inDatetimeIndex.to_period,PeriodIndex.asobject,PeriodIndex.to_timestamp doesn’t preservename (GH7485)
  • Bug inDatetimeIndex.to_period andPeriodIndex.to_timestanp handleNaT incorrectly (GH7228)
  • Bug inoffsets.apply,rollforward androllback may return normaldatetime (GH7502)
  • Bug inresample raisesValueError when target containsNaT (GH7227)
  • Bug inTimestamp.tz_localize resetsnanosecond info (GH7534)
  • Bug inDatetimeIndex.asobject raisesValueError when it containsNaT (GH7539)
  • Bug inTimestamp.__new__ doesn’t preserve nanosecond properly (GH7610)
  • Bug inIndex.astype(float) where it would return anobject dtypeIndex (GH7464).
  • Bug inDataFrame.reset_index losestz (GH3950)
  • Bug inDatetimeIndex.freqstr raisesAttributeError whenfreq isNone (GH7606)
  • Bug inGroupBy.size created byTimeGrouper raisesAttributeError (GH7453)
  • Bug in single column bar plot is misaligned (GH7498).
  • Bug in area plot with tz-aware time series raisesValueError (GH7471)
  • Bug in non-monotonicIndex.union may preservename incorrectly (GH7458)
  • Bug inDatetimeIndex.intersection doesn’t preserve timezone (GH4690)
  • Bug inrolling_var where a window larger than the array would raise an error(GH7297)
  • Bug with last plotted timeseries dictatingxlim (GH2960)
  • Bug withsecondary_y axis not being considered for timeseriesxlim (GH3490)
  • Bug inFloat64Index assignment with a non scalar indexer (GH7586)
  • Bug inpandas.core.strings.str_contains does not properly match in a case insensitive fashion whenregex=False andcase=False (GH7505)
  • Bug inexpanding_cov,expanding_corr,rolling_cov, androlling_corr for two arguments with mismatched index (GH7512)
  • Bug into_sql taking the boolean column as text column (GH7678)
  • Bug in groupedhist doesn’t handlerot kw andsharex kw properly (GH7234)
  • Bug in.loc performing fallback integer indexing withobject dtype indices (GH7496)
  • Bug (regression) inPeriodIndex constructor when passedSeries objects (GH7701).

v0.14.0 (May 31 , 2014)

This is a major release from 0.13.1 and includes a small number of API changes, several new features,enhancements, and performance improvements along with a large number of bug fixes. We recommend that allusers upgrade to this version.

Warning

In 0.14.0 allNDFrame based containers have undergone significant internal refactoring. Before that each block ofhomogeneous data had its own labels and extra care was necessary to keep those in sync with the parent container’s labels.This should not have any visible user/API behavior changes (GH6745)

API changes

  • read_excel uses 0 as the default sheet (GH6573)

  • iloc will now accept out-of-bounds indexers for slices, e.g. a value that exceeds the length of the object beingindexed. These will be excluded. This will make pandas conform more with python/numpy indexing of out-of-boundsvalues. A single indexer that is out-of-bounds and drops the dimensions of the object will still raiseIndexError (GH6296,GH6299). This could result in an empty axis (e.g. an empty DataFrame being returned)

    In [1]:dfl=DataFrame(np.random.randn(5,2),columns=list('AB'))In [2]:dflOut[2]:          A         B0  1.583584 -0.4383131 -0.402537 -0.7805722 -0.141685  0.5422413  0.370966 -0.2516424  0.787484  1.666563In [3]:dfl.iloc[:,2:3]Out[3]:Empty DataFrameColumns: []Index: [0, 1, 2, 3, 4]In [4]:dfl.iloc[:,1:3]Out[4]:          B0 -0.4383131 -0.7805722  0.5422413 -0.2516424  1.666563In [5]:dfl.iloc[4:6]Out[5]:          A         B4  0.787484  1.666563

    These are out-of-bounds selections

    dfl.iloc[[4,5,6]]IndexError:positionalindexersareout-of-boundsdfl.iloc[:,4]IndexError:singlepositionalindexerisout-of-bounds
  • Slicing with negative start, stop & step values handles corner cases better (GH6531):

    • df.iloc[:-len(df)] is now empty
    • df.iloc[len(df)::-1] now enumerates all elements in reverse
  • TheDataFrame.interpolate() keyworddowncast default has been changed frominfer toNone. This is to preseve the original dtype unless explicitly requested otherwise (GH6290).

  • When converting a dataframe to HTML it used to returnEmpty DataFrame. This special case hasbeen removed, instead a header with the column names is returned (GH6062).

  • Series andIndex now internall share more common operations, e.g.factorize(),nunique(),value_counts() arenow supported onIndex types as well. TheSeries.weekday property from is removedfrom Series for API consistency. Using aDatetimeIndex/PeriodIndex method on a Series will now raise aTypeError.(GH4551,GH4056,GH5519,GH6380,GH7206).

  • Addis_month_start,is_month_end,is_quarter_start,is_quarter_end,is_year_start,is_year_end accessors forDateTimeIndex /Timestamp which return a boolean array of whether the timestamp(s) are at the start/end of the month/quarter/year defined by the frequency of theDateTimeIndex /Timestamp (GH4565,GH6998)

  • Local variable usage has changed inpandas.eval()/DataFrame.eval()/DataFrame.query()(GH5987). For theDataFrame methods, two things havechanged

    • Column names are now given precedence over locals
    • Local variables must be referred to explicitly. This means that even ifyou have a local variable that isnot a column you must still refer toit with the'@' prefix.
    • You can have an expression likedf.query('@a<a') with no complaintsfrompandas about ambiguity of the namea.
    • The top-levelpandas.eval() function does not allow you use the'@' prefix and provides you with an error message telling you so.
    • NameResolutionError was removed because it isn’t necessary anymore.
  • Define and document the order of column vs index names in query/eval (GH6676)

  • concat will now concatenate mixed Series and DataFrames using the Series nameor numbering columns as needed (GH2385). Seethe docs

  • Slicing and advanced/boolean indexing operations onIndex classes as wellasIndex.delete() andIndex.drop() methods will no longer change the type of theresulting index (GH6440,GH7040)

    In [6]:i=pd.Index([1,2,3,'a','b','c'])In [7]:i[[0,1,2]]Out[7]:Index([1,2,3],dtype='object')In [8]:i.drop(['a','b','c'])Out[8]:Index([1,2,3],dtype='object')

    Previously, the above operation would returnInt64Index. If you’d liketo do this manually, useIndex.astype()

    In [9]:i[[0,1,2]].astype(np.int_)Out[9]:Int64Index([1,2,3],dtype='int64')
  • set_index no longer converts MultiIndexes to an Index of tuples. For example,the old behavior returned an Index in this case (GH6459):

    # Old behavior, casted MultiIndex to an IndexIn [10]:tuple_indOut[10]:Index([(u'a',u'c'),(u'a',u'd'),(u'b',u'c'),(u'b',u'd')],dtype='object')In [11]:df_multi.set_index(tuple_ind)Out[11]:               0         1(a, c)  0.471435 -1.190976(a, d)  1.432707 -0.312652(b, c) -0.720589  0.887163(b, d)  0.859588 -0.636524# New behaviorIn [12]:miOut[12]:MultiIndex(levels=[[u'a', u'b'], [u'c', u'd']],           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])In [13]:df_multi.set_index(mi)Out[13]:            0         1a c  0.471435 -1.190976  d  1.432707 -0.312652b c -0.720589  0.887163  d  0.859588 -0.636524

    This also applies when passing multiple indices toset_index:

    # Old output, 2-level MultiIndex of tuplesIn [14]:df_multi.set_index([df_multi.index,df_multi.index])Out[14]:                      0         1(a, c) (a, c)  0.471435 -1.190976(a, d) (a, d)  1.432707 -0.312652(b, c) (b, c) -0.720589  0.887163(b, d) (b, d)  0.859588 -0.636524# New output, 4-level MultiIndexIn [15]:df_multi.set_index([df_multi.index,df_multi.index])Out[15]:                0         1a c a c  0.471435 -1.190976  d a d  1.432707 -0.312652b c b c -0.720589  0.887163  d b d  0.859588 -0.636524
  • pairwise keyword was added to the statistical moment functionsrolling_cov,rolling_corr,ewmcov,ewmcorr,expanding_cov,expanding_corr to allow the calculation of movingwindow covariance and correlation matrices (GH4950). SeeComputing rolling pairwise covariances and correlations in the docs.

    In [1]:df=DataFrame(np.random.randn(10,4),columns=list('ABCD'))In [4]:covs=pd.rolling_cov(df[['A','B','C']],df[['B','C','D']],5,pairwise=True)In [5]:covs[df.index[-1]]Out[5]:          B         C         DA  0.035310  0.326593 -0.505430B  0.137748 -0.006888 -0.005383C -0.006888  0.861040  0.020762
  • Series.iteritems() is now lazy (returns an iterator rather than a list). This was the documented behavior prior to 0.14. (GH6760)

  • Addednunique andvalue_counts functions toIndex for counting unique elements. (GH6734)

  • stack andunstack now raise aValueError when thelevel keyword refersto a non-unique item in theIndex (previously raised aKeyError). (GH6738)

  • drop unused order argument fromSeries.sort; args now are in the same order asSeries.order;addna_position arg to conform toSeries.order (GH6847)

  • default sorting algorithm forSeries.order is nowquicksort, to conform withSeries.sort(and numpy defaults)

  • addinplace keyword toSeries.order/sort to make them inverses (GH6859)

  • DataFrame.sort now places NaNs at the beginning or end of the sort according to thena_position parameter. (GH3917)

  • acceptTextFileReader inconcat, which was affecting a common user idiom (GH6583), this was a regressionfrom 0.13.1

  • Addedfactorize functions toIndex andSeries to get indexer and unique values (GH7090)

  • describe on a DataFrame with a mix of Timestamp and string like objects returns a different Index (GH7088).Previously the index was unintentionally sorted.

  • Arithmetic operations withonlybool dtypes now give a warning indicatingthat they are evaluated in Python space for+,-,and* operations and raise for all others (GH7011,GH6762,GH7015,GH7210)

    x=pd.Series(np.random.rand(10)>0.5)y=Truex+y# warning generated: should do x | y insteadx/y# this raises because it doesn't make senseNotImplementedError:operator'/'notimplementedforbooldtypes
  • InHDFStore,select_as_multiple will always raise aKeyError, when a key or the selector is not found (GH6177)

  • df['col']=value anddf.loc[:,'col']=value are now completely equivalent;previously the.loc would not necessarily coerce the dtype of the resultant series (GH6149)

  • dtypes andftypes now return a series withdtype=object on empty containers (GH5740)

  • df.to_csv will now return a string of the CSV data if neither a target path nor a buffer is provided(GH6061)

  • pd.infer_freq() will now raise aTypeError if given an invalidSeries/Indextype (GH6407,GH6463)

  • A tuple passed toDataFame.sort_index will be interpreted as the levels ofthe index, rather than requiring a list of tuple (GH4370)

  • all offset operations now returnTimestamp types (rather than datetime), Business/Week frequencies were incorrect (GH4069)

  • to_excel now convertsnp.inf into a string representation,customizable by theinf_rep keyword argument (Excel has no native infrepresentation) (GH6782)

  • Replacepandas.compat.scipy.scoreatpercentile withnumpy.percentile (GH6810)

  • .quantile on adatetime[ns] series now returnsTimestamp insteadofnp.datetime64 objects (GH6810)

  • changeAssertionError toTypeError for invalid types passed toconcat (GH6583)

  • Raise aTypeError whenDataFrame is passed an iterator as thedata argument (GH5357)

Display Changes

  • The default way of printing large DataFrames has changed. DataFramesexceedingmax_rows and/ormax_columns are now displayed in acentrally truncated view, consistent with the printing of apandas.Series (GH5603).

    In previous versions, a DataFrame was truncated once the dimensionconstraints were reached and an ellipse (...) signaled that part ofthe data was cut off.

    The previous look of truncate.

    In the current version, large DataFrames are centrally truncated,showing a preview of head and tail in both dimensions.

    The new look.
  • allow option'truncate' fordisplay.show_dimensions to only show the dimensions if theframe is truncated (GH6547).

    The default fordisplay.show_dimensions will now betruncate. This is consistent withhow Series display length.

    In [16]:dfd=pd.DataFrame(np.arange(25).reshape(-1,5),index=[0,1,2,3,4],columns=[0,1,2,3,4])# show dimensions since this is truncatedIn [17]:withpd.option_context('display.max_rows',2,'display.max_columns',2,   ....:'display.show_dimensions','truncate'):   ....:print(dfd)   ....:     0 ...   40    0 ...   4..  .. ...  ..4   20 ...  24[5 rows x 5 columns]# will not show dimensions since it is not truncatedIn [18]:withpd.option_context('display.max_rows',10,'display.max_columns',40,   ....:'display.show_dimensions','truncate'):   ....:print(dfd)   ....:    0   1   2   3   40   0   1   2   3   41   5   6   7   8   92  10  11  12  13  143  15  16  17  18  194  20  21  22  23  24
  • Regression in the display of a MultiIndexed Series withdisplay.max_rows is less than thelength of the series (GH7101)

  • Fixed a bug in the HTML repr of a truncated Series or DataFrame not showing the class name with thelarge_repr set to ‘info’ (GH7105)

  • Theverbose keyword inDataFrame.info(), which controls whether to shorten theinforepresentation, is nowNone by default. This will follow the global setting indisplay.max_info_columns. The global setting can be overriden withverbose=True orverbose=False.

  • Fixed a bug with theinfo repr not honoring thedisplay.max_info_columns setting (GH6939)

  • Offset/freq info now in Timestamp __repr__ (GH4553)

Text Parsing API Changes

read_csv()/read_table() will now be noiser w.r.t invalid options rather than falling back to thePythonParser.

  • RaiseValueError whensep specified withdelim_whitespace=True inread_csv()/read_table()(GH6607)
  • RaiseValueError whenengine='c' specified with unsupportedoptions inread_csv()/read_table() (GH6607)
  • RaiseValueError when fallback to python parser causes options to beignored (GH6607)
  • ProduceParserWarning on fallback to pythonparser when no options are ignored (GH6607)
  • Translatesep='\s+' todelim_whitespace=True inread_csv()/read_table() if no other C-unsupported optionsspecified (GH6607)

Groupby API Changes

More consistent behaviour for some groupby methods:

  • groupbyhead andtail now act more likefilter rather than an aggregation:

    In [19]:df=pd.DataFrame([[1,2],[1,4],[5,6]],columns=['A','B'])In [20]:g=df.groupby('A')In [21]:g.head(1)# filters DataFrameOut[21]:   A  B0  1  22  5  6In [22]:g.apply(lambdax:x.head(1))# used to simply fall-throughOut[22]:     A  BA1 0  1  25 2  5  6
  • groupby head and tail respect column selection:

    In [23]:g[['B']].head(1)Out[23]:   B0  22  6
  • groupbynth now reduces by default; filtering can be achieved by passingas_index=False. With an optionaldropna argument to ignoreNaN. Seethe docs.

    Reducing

    In [24]:df=DataFrame([[1,np.nan],[1,4],[5,6]],columns=['A','B'])In [25]:g=df.groupby('A')In [26]:g.nth(0)Out[26]:     BA1  NaN5  6.0# this is equivalent to g.first()In [27]:g.nth(0,dropna='any')Out[27]:     BA1  4.05  6.0# this is equivalent to g.last()In [28]:g.nth(-1,dropna='any')Out[28]:     BA1  4.05  6.0

    Filtering

    In [29]:gf=df.groupby('A',as_index=False)In [30]:gf.nth(0)Out[30]:   A    B0  1  NaN2  5  6.0In [31]:gf.nth(0,dropna='any')Out[31]:   A    BA1  1  4.05  5  6.0
  • groupby will now not return the grouped column for non-cython functions (GH5610,GH5614,GH6732),as its already the index

    In [32]:df=DataFrame([[1,np.nan],[1,4],[5,6],[5,8]],columns=['A','B'])In [33]:g=df.groupby('A')In [34]:g.count()Out[34]:   BA1  15  2In [35]:g.describe()Out[35]:                BA1 count  1.000000  mean   4.000000  std         NaN  min    4.000000  25%    4.000000  50%    4.000000  75%    4.000000...           ...5 mean   7.000000  std    1.414214  min    6.000000  25%    6.500000  50%    7.000000  75%    7.500000  max    8.000000[16 rows x 1 columns]
  • passingas_index will leave the grouped column in-place (this is not change in 0.14.0)

    In [36]:df=DataFrame([[1,np.nan],[1,4],[5,6],[5,8]],columns=['A','B'])In [37]:g=df.groupby('A',as_index=False)In [38]:g.count()Out[38]:   A  B0  1  11  5  2In [39]:g.describe()Out[39]:           A         B0 count  2.0  1.000000  mean   1.0  4.000000  std    0.0       NaN  min    1.0  4.000000  25%    1.0  4.000000  50%    1.0  4.000000  75%    1.0  4.000000...      ...       ...1 mean   5.0  7.000000  std    0.0  1.414214  min    5.0  6.000000  25%    5.0  6.500000  50%    5.0  7.000000  75%    5.0  7.500000  max    5.0  8.000000[16 rows x 2 columns]
  • Allow specification of a more complex groupby viapd.Grouper, such as groupingby a Time and a string field simultaneously. Seethe docs. (GH3794)

  • Better propagation/preservation of Series names when performing groupbyoperations:

    • SeriesGroupBy.agg will ensure that the name attribute of the originalseries is propagated to the result (GH6265).
    • If the function provided toGroupBy.apply returns a named series, thename of the series will be kept as the name of the column index of theDataFrame returned byGroupBy.apply (GH6124). This facilitatesDataFrame.stack operations where the name of the column index is used asthe name of the inserted column containing the pivoted data.

SQL

The SQL reading and writing functions now support more database flavorsthrough SQLAlchemy (GH2717,GH4163,GH5950,GH6292).All databases supported by SQLAlchemy can be used, suchas PostgreSQL, MySQL, Oracle, Microsoft SQL server (see documentation ofSQLAlchemy onincluded dialects).

The functionality of providing DBAPI connection objects will only be supportedfor sqlite3 in the future. The'mysql' flavor is deprecated.

The new functionsread_sql_query() andread_sql_table()are introduced. The functionread_sql() is kept as a conveniencewrapper around the other two and will delegate to specific function depending onthe provided input (database table name or sql query).

In practice, you have to provide a SQLAlchemyengine to the sql functions.To connect with SQLAlchemy you use thecreate_engine() function to create an engineobject from database URI. You only need to create the engine once per database you areconnecting to. For an in-memory sqlite database:

In [40]:fromsqlalchemyimportcreate_engine# Create your connection.In [41]:engine=create_engine('sqlite:///:memory:')

Thisengine can then be used to write or read data to/from this database:

In [42]:df=pd.DataFrame({'A':[1,2,3],'B':['a','b','c']})In [43]:df.to_sql('db_table',engine,index=False)

You can read data from a database by specifying the table name:

In [44]:pd.read_sql_table('db_table',engine)Out[44]:   A  B0  1  a1  2  b2  3  c

or by specifying a sql query:

In [45]:pd.read_sql_query('SELECT * FROM db_table',engine)Out[45]:   A  B0  1  a1  2  b2  3  c

Some other enhancements to the sql functions include:

  • support for writing the index. This can be controlled with theindexkeyword (default is True).
  • specify the column label to use when writing the index withindex_label.
  • specify string columns to parse as datetimes withh theparse_dateskeyword inread_sql_query() andread_sql_table().

Warning

Some of the existing functions or function aliases have been deprecatedand will be removed in future versions. This includes:tquery,uquery,read_frame,frame_query,write_frame.

Warning

The support for the ‘mysql’ flavor when using DBAPI connection objects has been deprecated.MySQL will be further supported with SQLAlchemy engines (GH6900).

MultiIndexing Using Slicers

In 0.14.0 we added a new way to slice multi-indexed objects.You can slice a multi-index by providing multiple indexers.

You can provide any of the selectors as if you are indexing by label, seeSelection by Label,including slices, lists of labels, labels, and boolean indexers.

You can useslice(None) to select all the contents ofthat level. You do not need to specify all thedeeper levels, they will be implied asslice(None).

As usual,both sides of the slicers are included as this is label indexing.

Seethe docsSee also issues (GH6134,GH4036,GH3057,GH2598,GH5641,GH7106)

Warning

You should specify all axes in the.loc specifier, meaning the indexer for theindex andfor thecolumns. Their are some ambiguous cases where the passed indexer could be mis-interpretedas indexingboth axes, rather than into say the MuliIndex for the rows.

You should do this:

df.loc[(slice('A1','A3'),.....),:]

rather than this:

df.loc[(slice('A1','A3'),.....)]

Warning

You will need to make sure that the selection axes are fully lexsorted!

In [46]:defmklbl(prefix,n):   ....:return["%s%s"%(prefix,i)foriinrange(n)]   ....:In [47]:index=MultiIndex.from_product([mklbl('A',4),   ....:mklbl('B',2),   ....:mklbl('C',4),   ....:mklbl('D',2)])   ....:In [48]:columns=MultiIndex.from_tuples([('a','foo'),('a','bar'),   ....:('b','foo'),('b','bah')],   ....:names=['lvl0','lvl1'])   ....:In [49]:df=DataFrame(np.arange(len(index)*len(columns)).reshape((len(index),len(columns))),   ....:index=index,   ....:columns=columns).sortlevel().sortlevel(axis=1)   ....:In [50]:dfOut[50]:lvl0           a         blvl1         bar  foo  bah  fooA0 B0 C0 D0    1    0    3    2         D1    5    4    7    6      C1 D0    9    8   11   10         D1   13   12   15   14      C2 D0   17   16   19   18         D1   21   20   23   22      C3 D0   25   24   27   26...          ...  ...  ...  ...A3 B1 C0 D1  229  228  231  230      C1 D0  233  232  235  234         D1  237  236  239  238      C2 D0  241  240  243  242         D1  245  244  247  246      C3 D0  249  248  251  250         D1  253  252  255  254[64 rows x 4 columns]

Basic multi-index slicing using slices, lists, and labels.

In [51]:df.loc[(slice('A1','A3'),slice(None),['C1','C3']),:]Out[51]:lvl0           a         blvl1         bar  foo  bah  fooA1 B0 C1 D0   73   72   75   74         D1   77   76   79   78      C3 D0   89   88   91   90         D1   93   92   95   94   B1 C1 D0  105  104  107  106         D1  109  108  111  110      C3 D0  121  120  123  122...          ...  ...  ...  ...A3 B0 C1 D1  205  204  207  206      C3 D0  217  216  219  218         D1  221  220  223  222   B1 C1 D0  233  232  235  234         D1  237  236  239  238      C3 D0  249  248  251  250         D1  253  252  255  254[24 rows x 4 columns]

You can use apd.IndexSlice to shortcut the creation of these slices

In [52]:idx=pd.IndexSliceIn [53]:df.loc[idx[:,:,['C1','C3']],idx[:,'foo']]Out[53]:lvl0           a    blvl1         foo  fooA0 B0 C1 D0    8   10         D1   12   14      C3 D0   24   26         D1   28   30   B1 C1 D0   40   42         D1   44   46      C3 D0   56   58...          ...  ...A3 B0 C1 D1  204  206      C3 D0  216  218         D1  220  222   B1 C1 D0  232  234         D1  236  238      C3 D0  248  250         D1  252  254[32 rows x 2 columns]

It is possible to perform quite complicated selections using this method on multipleaxes at the same time.

In [54]:df.loc['A1',(slice(None),'foo')]Out[54]:lvl0        a    blvl1      foo  fooB0 C0 D0   64   66      D1   68   70   C1 D0   72   74      D1   76   78   C2 D0   80   82      D1   84   86   C3 D0   88   90...       ...  ...B1 C0 D1  100  102   C1 D0  104  106      D1  108  110   C2 D0  112  114      D1  116  118   C3 D0  120  122      D1  124  126[16 rows x 2 columns]In [55]:df.loc[idx[:,:,['C1','C3']],idx[:,'foo']]Out[55]:lvl0           a    blvl1         foo  fooA0 B0 C1 D0    8   10         D1   12   14      C3 D0   24   26         D1   28   30   B1 C1 D0   40   42         D1   44   46      C3 D0   56   58...          ...  ...A3 B0 C1 D1  204  206      C3 D0  216  218         D1  220  222   B1 C1 D0  232  234         D1  236  238      C3 D0  248  250         D1  252  254[32 rows x 2 columns]

Using a boolean indexer you can provide selection related to thevalues.

In [56]:mask=df[('a','foo')]>200In [57]:df.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]Out[57]:lvl0           a    blvl1         foo  fooA3 B0 C1 D1  204  206      C3 D0  216  218         D1  220  222   B1 C1 D0  232  234         D1  236  238      C3 D0  248  250         D1  252  254

You can also specify theaxis argument to.loc to interpret the passedslicers on a single axis.

In [58]:df.loc(axis=0)[:,:,['C1','C3']]Out[58]:lvl0           a         blvl1         bar  foo  bah  fooA0 B0 C1 D0    9    8   11   10         D1   13   12   15   14      C3 D0   25   24   27   26         D1   29   28   31   30   B1 C1 D0   41   40   43   42         D1   45   44   47   46      C3 D0   57   56   59   58...          ...  ...  ...  ...A3 B0 C1 D1  205  204  207  206      C3 D0  217  216  219  218         D1  221  220  223  222   B1 C1 D0  233  232  235  234         D1  237  236  239  238      C3 D0  249  248  251  250         D1  253  252  255  254[32 rows x 4 columns]

Furthermore you canset the values using these methods

In [59]:df2=df.copy()In [60]:df2.loc(axis=0)[:,:,['C1','C3']]=-10In [61]:df2Out[61]:lvl0           a         blvl1         bar  foo  bah  fooA0 B0 C0 D0    1    0    3    2         D1    5    4    7    6      C1 D0  -10  -10  -10  -10         D1  -10  -10  -10  -10      C2 D0   17   16   19   18         D1   21   20   23   22      C3 D0  -10  -10  -10  -10...          ...  ...  ...  ...A3 B1 C0 D1  229  228  231  230      C1 D0  -10  -10  -10  -10         D1  -10  -10  -10  -10      C2 D0  241  240  243  242         D1  245  244  247  246      C3 D0  -10  -10  -10  -10         D1  -10  -10  -10  -10[64 rows x 4 columns]

You can use a right-hand-side of an alignable object as well.

In [62]:df2=df.copy()In [63]:df2.loc[idx[:,:,['C1','C3']],:]=df2*1000In [64]:df2Out[64]:lvl0              a               blvl1            bar     foo     bah     fooA0 B0 C0 D0       1       0       3       2         D1       5       4       7       6      C1 D0    9000    8000   11000   10000         D1   13000   12000   15000   14000      C2 D0      17      16      19      18         D1      21      20      23      22      C3 D0   25000   24000   27000   26000...             ...     ...     ...     ...A3 B1 C0 D1     229     228     231     230      C1 D0  233000  232000  235000  234000         D1  237000  236000  239000  238000      C2 D0     241     240     243     242         D1     245     244     247     246      C3 D0  249000  248000  251000  250000         D1  253000  252000  255000  254000[64 rows x 4 columns]

Plotting

  • Hexagonal bin plots fromDataFrame.plot withkind='hexbin' (GH5478), Seethe docs.

  • DataFrame.plot andSeries.plot now supports area plot with specifyingkind='area' (GH6656), Seethe docs

  • Pie plots fromSeries.plot andDataFrame.plot withkind='pie' (GH6976), Seethe docs.

  • Plotting with Error Bars is now supported in the.plot method ofDataFrame andSeries objects (GH3796,GH6834), Seethe docs.

  • DataFrame.plot andSeries.plot now support atable keyword for plottingmatplotlib.Table, Seethe docs. Thetable keyword can receive the following values.

    • False: Do nothing (default).
    • True: Draw a table using theDataFrame orSeries calledplot method. Data will be transposed to meet matplotlib’s default layout.
    • DataFrame orSeries: Draw matplotlib.table using the passed data. The data will be drawn as displayed in print method (not transposed automatically).Also, helper functionpandas.tools.plotting.table is added to create a table fromDataFrame andSeries, and add it to anmatplotlib.Axes.
  • plot(legend='reverse') will now reverse the order of legend labels formost plot kinds. (GH6014)

  • Line plot and area plot can be stacked bystacked=True (GH6656)

  • Following keywords are now acceptable forDataFrame.plot() withkind='bar' andkind='barh':

    • width: Specify the bar width. In previous versions, static value 0.5 was passed to matplotlib and it cannot be overwritten. (GH6604)
    • align: Specify the bar alignment. Default iscenter (different from matplotlib). In previous versions, pandas passesalign=’edge’ to matplotlib and adjust the location tocenter by itself, and it resultsalign keyword is not applied as expected. (GH4525)
    • position: Specify relative alignments for bar plot layout. From 0 (left/bottom-end) to 1(right/top-end). Default is 0.5 (center). (GH6604)

    Because of the defaultalign value changes, coordinates of bar plots are now located on integer values (0.0, 1.0, 2.0 ...). This is intended to make bar plot be located on the same coodinates as line plot. However, bar plot may differs unexpectedly when you manually adjust the bar location or drawing area, such as usingset_xlim,set_ylim, etc. In this cases, please modify your script to meet with new coordinates.

  • Theparallel_coordinates() function now takes argumentcolorinstead ofcolors. AFutureWarning is raised to alert thatthe oldcolors argument will not be supported in a future release. (GH6956)

  • Theparallel_coordinates() andandrews_curves() functions now takepositional argumentframe instead ofdata. AFutureWarning israised if the olddata argument is used by name. (GH6956)

  • DataFrame.boxplot() now supportslayout keyword (GH6769)

  • DataFrame.boxplot() has a new keyword argument,return_type. It accepts'dict','axes', or'both', in which case a namedtuple with the matplotlibaxes and a dict of matplotlib Lines is returned.

Prior Version Deprecations/Changes

There are prior version deprecations that are taking effect as of 0.14.0.

  • RemoveDateRange in favor ofDatetimeIndex (GH6816)
  • Removecolumn keyword fromDataFrame.sort (GH4370)
  • Removeprecision keyword fromset_eng_float_format() (GH395)
  • Removeforce_unicode keyword fromDataFrame.to_string(),DataFrame.to_latex(), andDataFrame.to_html(); these functionencode in unicode by default (GH2224,GH2225)
  • RemovenanRep keyword fromDataFrame.to_csv() andDataFrame.to_string() (GH275)
  • Removeunique keyword fromHDFStore.select_column() (GH3256)
  • RemoveinferTimeRule keyword fromTimestamp.offset() (GH391)
  • Removename keyword fromget_data_yahoo() andget_data_google() (commit b921d1a )
  • Removeoffset keyword fromDatetimeIndex constructor(commit 3136390 )
  • Removetime_rule from several rolling-moment statistical functions, suchasrolling_sum() (GH1042)
  • Removed neg- boolean operations on numpy arrays in favor of inv~, as this is going tobe deprecated in numpy 1.9 (GH6960)

Deprecations

  • Thepivot_table()/DataFrame.pivot_table() andcrosstab() functionsnow take argumentsindex andcolumns instead ofrows andcols. AFutureWarning is raised to alert that the oldrows andcols argumentswill not be supported in a future release (GH5505)

  • TheDataFrame.drop_duplicates() andDataFrame.duplicated() methodsnow take argumentsubset instead ofcols to better align withDataFrame.dropna(). AFutureWarning is raised to alert that the oldcols arguments will not be supported in a future release (GH6680)

  • TheDataFrame.to_csv() andDataFrame.to_excel() functionsnow takes argumentcolumns instead ofcols. AFutureWarning is raised to alert that the oldcols argumentswill not be supported in a future release (GH6645)

  • Indexers will warnFutureWarning when used with a scalar indexer anda non-floating point Index (GH4892,GH6960)

    # non-floating point indexes can only be indexed by integers / labelsIn [1]:Series(1,np.arange(5))[3.0]        pandas/core/index.py:469: FutureWarning: scalar indexers for index type Int64Index should be integers and not floating pointOut[1]:1In [2]:Series(1,np.arange(5)).iloc[3.0]        pandas/core/index.py:469: FutureWarning: scalar indexers for index type Int64Index should be integers and not floating pointOut[2]:1In [3]:Series(1,np.arange(5)).iloc[3.0:4]        pandas/core/index.py:527: FutureWarning: slice indexers when using iloc should be integers and not floating pointOut[3]:        3    1        dtype: int64# these are Float64Indexes, so integer or floating point is acceptableIn [4]:Series(1,np.arange(5.))[3]Out[4]:1In [5]:Series(1,np.arange(5.))[3.0]Out[6]:1
  • Numpy 1.9 compat w.r.t. deprecation warnings (GH6960)

  • Panel.shift() now has a function signature that matchesDataFrame.shift().The old positional argumentlags has been changed to a keyword argumentperiods with a default value of 1. AFutureWarning is raised if theold argumentlags is used by name. (GH6910)

  • Theorder keyword argument offactorize() will be removed. (GH6926).

  • Remove thecopy keyword fromDataFrame.xs(),Panel.major_xs(),Panel.minor_xs(). A view will bereturned if possible, otherwise a copy will be made. Previously the user could think thatcopy=False wouldALWAYS return a view. (GH6894)

  • Theparallel_coordinates() function now takes argumentcolorinstead ofcolors. AFutureWarning is raised to alert thatthe oldcolors argument will not be supported in a future release. (GH6956)

  • Theparallel_coordinates() andandrews_curves() functions now takepositional argumentframe instead ofdata. AFutureWarning israised if the olddata argument is used by name. (GH6956)

  • The support for the ‘mysql’ flavor when using DBAPI connection objects has been deprecated.MySQL will be further supported with SQLAlchemy engines (GH6900).

  • The followingio.sql functions have been deprecated:tquery,uquery,read_frame,frame_query,write_frame.

  • Thepercentile_width keyword argument indescribe() has been deprecated.Use thepercentiles keyword instead, which takes a list of percentiles to display. Thedefault output is unchanged.

  • The default return type ofboxplot() will change from a dict to a matpltolib Axesin a future release. You can use the future behavior now by passingreturn_type='axes'to boxplot.

Known Issues

  • OpenPyXL 2.0.0 breaks backwards compatibility (GH7169)

Enhancements

  • DataFrame and Series will create a MultiIndex object if passed a tuples dict, Seethe docs (GH3323)

    In [65]:Series({('a','b'):1,('a','a'):0,   ....:('a','c'):2,('b','a'):3,('b','b'):4})   ....:Out[65]:a  a    0   b    1   c    2b  a    3   b    4dtype: int64In [66]:DataFrame({('a','b'):{('A','B'):1,('A','C'):2},   ....:('a','a'):{('A','C'):3,('A','B'):4},   ....:('a','c'):{('A','B'):5,('A','C'):6},   ....:('b','a'):{('A','C'):7,('A','B'):8},   ....:('b','b'):{('A','D'):9,('A','B'):10}})   ....:Out[66]:       a              b       a    b    c    a     bA B  4.0  1.0  5.0  8.0  10.0  C  3.0  2.0  6.0  7.0   NaN  D  NaN  NaN  NaN  NaN   9.0
  • Added thesym_diff method toIndex (GH5543)

  • DataFrame.to_latex now takes a longtable keyword, which if True will return a table in a longtable environment. (GH6617)

  • Add option to turn off escaping inDataFrame.to_latex (GH6472)

  • pd.read_clipboard will, if the keywordsep is unspecified, try to detect data copied from a spreadsheetand parse accordingly. (GH6223)

  • Joining a singly-indexed DataFrame with a multi-indexed DataFrame (GH3662)

    Seethe docs. Joining multi-index DataFrames on both the left and right is not yet supported ATM.

    In [67]:household=DataFrame(dict(household_id=[1,2,3],   ....:male=[0,1,0],   ....:wealth=[196087.3,316478.7,294750]),   ....:columns=['household_id','male','wealth']   ....:).set_index('household_id')   ....:In [68]:householdOut[68]:              male    wealthhousehold_id1                0  196087.32                1  316478.73                0  294750.0In [69]:portfolio=DataFrame(dict(household_id=[1,2,2,3,3,3,4],   ....:asset_id=["nl0000301109","nl0000289783","gb00b03mlx29",   ....:"gb00b03mlx29","lu0197800237","nl0000289965",np.nan],   ....:name=["ABN Amro","Robeco","Royal Dutch Shell","Royal Dutch Shell",   ....:"AAB Eastern Europe Equity Fund","Postbank BioTech Fonds",np.nan],   ....:share=[1.0,0.4,0.6,0.15,0.6,0.25,1.0]),   ....:columns=['household_id','asset_id','name','share']   ....:).set_index(['household_id','asset_id'])   ....:In [70]:portfolioOut[70]:                                                     name  sharehousehold_id asset_id1            nl0000301109                        ABN Amro   1.002            nl0000289783                          Robeco   0.40             gb00b03mlx29               Royal Dutch Shell   0.603            gb00b03mlx29               Royal Dutch Shell   0.15             lu0197800237  AAB Eastern Europe Equity Fund   0.60             nl0000289965          Postbank BioTech Fonds   0.254            NaN                                      NaN   1.00In [71]:household.join(portfolio,how='inner')Out[71]:                           male    wealth                            name  \household_id asset_id1            nl0000301109     0  196087.3                        ABN Amro2            nl0000289783     1  316478.7                          Robeco             gb00b03mlx29     1  316478.7               Royal Dutch Shell3            gb00b03mlx29     0  294750.0               Royal Dutch Shell             lu0197800237     0  294750.0  AAB Eastern Europe Equity Fund             nl0000289965     0  294750.0          Postbank BioTech Fonds                           sharehousehold_id asset_id1            nl0000301109   1.002            nl0000289783   0.40             gb00b03mlx29   0.603            gb00b03mlx29   0.15             lu0197800237   0.60             nl0000289965   0.25
  • quotechar,doublequote, andescapechar can now be specified whenusingDataFrame.to_csv (GH5414,GH4528)

  • Partially sort by only the specified levels of a MultiIndex with thesort_remaining boolean kwarg. (GH3984)

  • Addedto_julian_date toTimeStamp andDatetimeIndex. The JulianDate is used primarily in astronomy and represents the number of days fromnoon, January 1, 4713 BC. Because nanoseconds are used to define the timein pandas the actual range of dates that you can use is 1678 AD to 2262 AD. (GH4041)

  • DataFrame.to_stata will now check data for compatibility with Stata data typesand will upcast when needed. When it is not possible to losslessly upcast, a warningis issued (GH6327)

  • DataFrame.to_stata andStataWriter will accept keyword arguments time_stampand data_label which allow the time stamp and dataset label to be set when creating afile. (GH6545)

  • pandas.io.gbq now handles reading unicode strings properly. (GH5940)

  • Holidays Calendars are now available and can be used with theCustomBusinessDay offset (GH6719)

  • Float64Index is now backed by afloat64 dtype ndarray instead of anobject dtype array (GH6471).

  • ImplementedPanel.pct_change (GH6904)

  • Addedhow option to rolling-moment functions to dictate how to handle resampling;rolling_max() defaults to max,rolling_min() defaults to min, and all others default to mean (GH6297)

  • CustomBuisnessMonthBegin andCustomBusinessMonthEnd are now available (GH6866)

  • Series.quantile() andDataFrame.quantile() now accept an array ofquantiles.

  • describe() now accepts an array of percentiles to include in the summary statistics (GH4196)

  • pivot_table can now acceptGrouper byindex andcolumns keywords (GH6913)

    In [72]:importdatetimeIn [73]:df=DataFrame({   ....:'Branch':'A A A A A B'.split(),   ....:'Buyer':'Carl Mark Carl Carl Joe Joe'.split(),   ....:'Quantity':[1,3,5,1,8,1],   ....:'Date':[datetime.datetime(2013,11,1,13,0),datetime.datetime(2013,9,1,13,5),   ....:datetime.datetime(2013,10,1,20,0),datetime.datetime(2013,10,2,10,0),   ....:datetime.datetime(2013,11,1,20,0),datetime.datetime(2013,10,2,10,0)],   ....:'PayDay':[datetime.datetime(2013,10,4,0,0),datetime.datetime(2013,10,15,13,5),   ....:datetime.datetime(2013,9,5,20,0),datetime.datetime(2013,11,2,10,0),   ....:datetime.datetime(2013,10,7,20,0),datetime.datetime(2013,9,5,10,0)]})   ....:In [74]:dfOut[74]:  Branch Buyer                Date              PayDay  Quantity0      A  Carl 2013-11-01 13:00:00 2013-10-04 00:00:00         11      A  Mark 2013-09-01 13:05:00 2013-10-15 13:05:00         32      A  Carl 2013-10-01 20:00:00 2013-09-05 20:00:00         53      A  Carl 2013-10-02 10:00:00 2013-11-02 10:00:00         14      A   Joe 2013-11-01 20:00:00 2013-10-07 20:00:00         85      B   Joe 2013-10-02 10:00:00 2013-09-05 10:00:00         1In [75]:pivot_table(df,index=Grouper(freq='M',key='Date'),   ....:columns=Grouper(freq='M',key='PayDay'),   ....:values='Quantity',aggfunc=np.sum)   ....:Out[75]:PayDay      2013-09-30  2013-10-31  2013-11-30Date2013-09-30         NaN         3.0         NaN2013-10-31         6.0         NaN         1.02013-11-30         NaN         9.0         NaN
  • Arrays of strings can be wrapped to a specified width (str.wrap) (GH6999)

  • Addnsmallest() andSeries.nlargest() methods to Series, Seethe docs (GH3960)

  • PeriodIndex fully supports partial string indexing likeDatetimeIndex (GH7043)

    In [76]:prng=period_range('2013-01-01 09:00',periods=100,freq='H')In [77]:ps=Series(np.random.randn(len(prng)),index=prng)In [78]:psOut[78]:2013-01-01 09:00    0.0156962013-01-01 10:00   -2.2426852013-01-01 11:00    1.1500362013-01-01 12:00    0.9919462013-01-01 13:00    0.9533242013-01-01 14:00   -2.0212552013-01-01 15:00   -0.334077                      ...2013-01-05 06:00    0.5665342013-01-05 07:00    0.5035922013-01-05 08:00    0.2852962013-01-05 09:00    0.4842882013-01-05 10:00    1.3634822013-01-05 11:00   -0.7811052013-01-05 12:00   -0.468018Freq: H, dtype: float64In [79]:ps['2013-01-02']Out[79]:2013-01-02 00:00    0.5534392013-01-02 01:00    1.3181522013-01-02 02:00   -0.4693052013-01-02 03:00    0.6755542013-01-02 04:00   -1.8170272013-01-02 05:00   -0.1831092013-01-02 06:00    1.058969                      ...2013-01-02 17:00    0.0762002013-01-02 18:00   -0.5664462013-01-02 19:00    0.0361422013-01-02 20:00   -2.0749782013-01-02 21:00    0.2477922013-01-02 22:00   -0.8971572013-01-02 23:00   -0.136795Freq: H, dtype: float64
  • read_excel can now read milliseconds in Excel dates and times with xlrd >= 0.9.3. (GH5945)

  • pd.stats.moments.rolling_var now uses Welford’s method for increased numerical stability (GH6817)

  • pd.expanding_apply and pd.rolling_apply now take args and kwargs that are passed on tothe func (GH6289)

  • DataFrame.rank() now has a percentage rank option (GH5971)

  • Series.rank() now has a percentage rank option (GH5971)

  • Series.rank() andDataFrame.rank() now acceptmethod='dense' for ranks without gaps (GH6514)

  • Support passingencoding with xlwt (GH3710)

  • Refactor Block classes removingBlock.items attributes to avoid duplicationin item handling (GH6745,GH6988).

  • Testing statements updated to use specialized asserts (GH6175)

Performance

  • Performance improvement when convertingDatetimeIndex to floating ordinalsusingDatetimeConverter (GH6636)
  • Performance improvement forDataFrame.shift (GH5609)
  • Performance improvement in indexing into a multi-indexed Series (GH5567)
  • Performance improvements in single-dtyped indexing (GH6484)
  • Improve performance of DataFrame construction with certain offsets, by removing faulty caching(e.g. MonthEnd,BusinessMonthEnd), (GH6479)
  • Improve performance ofCustomBusinessDay (GH6584)
  • improve performance of slice indexing on Series with string keys (GH6341,GH6372)
  • Performance improvement forDataFrame.from_records when reading aspecified number of rows from an iterable (GH6700)
  • Performance improvements in timedelta conversions for integer dtypes (GH6754)
  • Improved performance of compatible pickles (GH6899)
  • Improve performance in certain reindexing operations by optimizingtake_2d (GH6749)
  • GroupBy.count() is now implemented in Cython and is much faster for largenumbers of groups (GH7016).

Experimental

There are no experimental changes in 0.14.0

Bug Fixes

  • Bug in Series ValueError when index doesn’t match data (GH6532)
  • Prevent segfault due to MultiIndex not being supported in HDFStore tableformat (GH1848)
  • Bug inpd.DataFrame.sort_index where mergesort wasn’t stable whenascending=False (GH6399)
  • Bug inpd.tseries.frequencies.to_offset when argument has leading zeroes (GH6391)
  • Bug in version string gen. for dev versions with shallow clones / install from tarball (GH6127)
  • Inconsistent tz parsingTimestamp /to_datetime for current year (GH5958)
  • Indexing bugs with reordered indexes (GH6252,GH6254)
  • Bug in.xs with a Series multiindex (GH6258,GH5684)
  • Bug in conversion of a string types to a DatetimeIndex with a specified frequency (GH6273,GH6274)
  • Bug ineval where type-promotion failed for large expressions (GH6205)
  • Bug in interpolate withinplace=True (GH6281)
  • HDFStore.remove now handles start and stop (GH6177)
  • HDFStore.select_as_multiple handles start and stop the same way asselect (GH6177)
  • HDFStore.select_as_coordinates andselect_column works with awhere clause that results in filters (GH6177)
  • Regression in join of non_unique_indexes (GH6329)
  • Issue with groupbyagg with a single function and a a mixed-type frame (GH6337)
  • Bug inDataFrame.replace() when passing a non-boolto_replace argument (GH6332)
  • Raise when trying to align on different levels of a multi-index assignment (GH3738)
  • Bug in setting complex dtypes via boolean indexing (GH6345)
  • Bug in TimeGrouper/resample when presented with a non-monotonic DatetimeIndex that would return invalid results. (GH4161)
  • Bug in index name propogation in TimeGrouper/resample (GH4161)
  • TimeGrouper has a more compatible API to the rest of the groupers (e.g.groups was missing) (GH3881)
  • Bug in multiple grouping with a TimeGrouper depending on target column order (GH6764)
  • Bug inpd.eval when parsing strings with possible tokens like'&'(GH6351)
  • Bug correctly handle placements of-inf in Panels when dividing by integer 0 (GH6178)
  • DataFrame.shift withaxis=1 was raising (GH6371)
  • Disabled clipboard tests until release time (run locally withnosetests-Adisabled) (GH6048).
  • Bug inDataFrame.replace() when passing a nesteddict that containedkeys not in the values to be replaced (GH6342)
  • str.match ignored the na flag (GH6609).
  • Bug in take with duplicate columns that were not consolidated (GH6240)
  • Bug in interpolate changing dtypes (GH6290)
  • Bug inSeries.get, was using a buggy access method (GH6383)
  • Bug in hdfstore queries of the formwhere=[('date','>=',datetime(2013,1,1)),('date','<=',datetime(2014,1,1))] (GH6313)
  • Bug inDataFrame.dropna with duplicate indices (GH6355)
  • Regression in chained getitem indexing with embedded list-like from 0.12 (GH6394)
  • Float64Index with nans not comparing correctly (GH6401)
  • eval/query expressions with strings containing the@ characterwill now work (GH6366).
  • Bug inSeries.reindex when specifying amethod with some nan values was inconsistent (noted on a resample) (GH6418)
  • Bug inDataFrame.replace() where nested dicts were erroneouslydepending on the order of dictionary keys and values (GH5338).
  • Perf issue in concatting with empty objects (GH3259)
  • Clarify sorting ofsym_diff onIndex objects withNaN values (GH6444)
  • Regression inMultiIndex.from_product with aDatetimeIndex as input (GH6439)
  • Bug instr.extract when passed a non-default index (GH6348)
  • Bug instr.split when passedpat=None andn=1 (GH6466)
  • Bug inio.data.DataReader when passed"F-F_Momentum_Factor" anddata_source="famafrench" (GH6460)
  • Bug insum of atimedelta64[ns] series (GH6462)
  • Bug inresample with a timezone and certain offsets (GH6397)
  • Bug iniat/iloc with duplicate indices on a Series (GH6493)
  • Bug inread_html where nan’s were incorrectly being used to indicatemissing values in text. Should use the empty string for consistency with therest of pandas (GH5129).
  • Bug inread_html tests where redirected invalid URLs would make one testfail (GH6445).
  • Bug in multi-axis indexing using.loc on non-unique indices (GH6504)
  • Bug that caused _ref_locs corruption when slice indexing across columns axis of a DataFrame (GH6525)
  • Regression from 0.13 in the treatment of numpydatetime64 non-ns dtypes in Series creation (GH6529)
  • .names attribute of MultiIndexes passed toset_index are now preserved (GH6459).
  • Bug in setitem with a duplicate index and an alignable rhs (GH6541)
  • Bug in setitem with.loc on mixed integer Indexes (GH6546)
  • Bug inpd.read_stata which would use the wrong data types and missing values (GH6327)
  • Bug inDataFrame.to_stata that lead to data loss in certain cases, and could be exported using thewrong data types and missing values (GH6335)
  • StataWriter replaces missing values in string columns by empty string (GH6802)
  • Inconsistent types inTimestamp addition/subtraction (GH6543)
  • Bug in preserving frequency across Timestamp addition/subtraction (GH4547)
  • Bug in empty list lookup causedIndexError exceptions (GH6536,GH6551)
  • Series.quantile raising on anobject dtype (GH6555)
  • Bug in.xs with anan in level when dropped (GH6574)
  • Bug in fillna withmethod='bfill/ffill' anddatetime64[ns] dtype (GH6587)
  • Bug in sql writing with mixed dtypes possibly leading to data loss (GH6509)
  • Bug inSeries.pop (GH6600)
  • Bug iniloc indexing when positional indexer matchedInt64Index of the corresponding axis and no reordering happened (GH6612)
  • Bug infillna withlimit andvalue specified
  • Bug inDataFrame.to_stata when columns have non-string names (GH4558)
  • Bug in compat withnp.compress, surfaced in (GH6658)
  • Bug in binary operations with a rhs of a Series not aligning (GH6681)
  • Bug inDataFrame.to_stata which incorrectly handles nan values and ignoreswith_index keyword argument (GH6685)
  • Bug in resample with extra bins when using an evenly divisible frequency (GH4076)
  • Bug in consistency of groupby aggregation when passing a custom function (GH6715)
  • Bug in resample whenhow=None resample freq is the same as the axis frequency (GH5955)
  • Bug in downcasting inference with empty arrays (GH6733)
  • Bug inobj.blocks on sparse containers dropping all but the last items of same for dtype (GH6748)
  • Bug in unpicklingNaT(NaTType) (GH4606)
  • Bug inDataFrame.replace() where regex metacharacters were being treatedas regexs even whenregex=False (GH6777).
  • Bug in timedelta ops on 32-bit platforms (GH6808)
  • Bug in setting a tz-aware index directly via.index (GH6785)
  • Bug in expressions.py where numexpr would try to evaluate arithmetic ops(GH6762).
  • Bug in Makefile where it didn’t remove Cython generated C files withmakeclean (GH6768)
  • Bug with numpy < 1.7.2 when reading long strings fromHDFStore (GH6166)
  • Bug inDataFrame._reduce where non bool-like (0/1) integers were beingcoverted into bools. (GH6806)
  • Regression from 0.13 withfillna and a Series on datetime-like (GH6344)
  • Bug in addingnp.timedelta64 toDatetimeIndex with timezone outputs incorrect results (GH6818)
  • Bug inDataFrame.replace() where changing a dtype through replacementwould only replace the first occurrence of a value (GH6689)
  • Better error message when passing a frequency of ‘MS’ inPeriod construction (GH5332)
  • Bug inSeries.__unicode__ whenmax_rows=None and the Series has more than 1000 rows. (GH6863)
  • Bug ingroupby.get_group where a datetlike wasn’t always accepted (GH5267)
  • Bug ingroupBy.get_group created byTimeGrouper raisesAttributeError (GH6914)
  • Bug inDatetimeIndex.tz_localize andDatetimeIndex.tz_convert convertingNaT incorrectly (GH5546)
  • Bug in arithmetic operations affectingNaT (GH6873)
  • Bug inSeries.str.extract where the resultingSeries from a singlegroup match wasn’t renamed to the group name
  • Bug inDataFrame.to_csv where settingindex=False ignored theheader kwarg (GH6186)
  • Bug inDataFrame.plot andSeries.plot, where the legend behave inconsistently when plotting to the same axes repeatedly (GH6678)
  • Internal tests for patching__finalize__ / bug in merge not finalizing (GH6923,GH6927)
  • acceptTextFileReader inconcat, which was affecting a common user idiom (GH6583)
  • Bug in C parser with leading whitespace (GH3374)
  • Bug in C parser withdelim_whitespace=True and\r-delimited lines
  • Bug in python parser with explicit multi-index in row following column header (GH6893)
  • Bug inSeries.rank andDataFrame.rank that caused small floats (<1e-13) to all receive the same rank (GH6886)
  • Bug inDataFrame.apply with functions that used *args`` or **kwargs and returnedan empty result (GH6952)
  • Bug in sum/mean on 32-bit platforms on overflows (GH6915)
  • MovedPanel.shift toNDFrame.slice_shift and fixed to respect multiple dtypes. (GH6959)
  • Bug in enablingsubplots=True inDataFrame.plot only has single column raisesTypeError, andSeries.plot raisesAttributeError (GH6951)
  • Bug inDataFrame.plot draws unnecessary axes when enablingsubplots andkind=scatter (GH6951)
  • Bug inread_csv from a filesystem with non-utf-8 encoding (GH6807)
  • Bug iniloc when setting / aligning (GH6766)
  • Bug causing UnicodeEncodeError when get_dummies called with unicode values and a prefix (GH6885)
  • Bug in timeseries-with-frequency plot cursor display (GH5453)
  • Bug surfaced ingroupby.plot when using aFloat64Index (GH7025)
  • Stopped tests from failing if options data isn’t able to be downloaded from Yahoo (GH7034)
  • Bug inparallel_coordinates andradviz where reordering of class columncaused possible color/class mismatch (GH6956)
  • Bug inradviz andandrews_curves where multiple values of ‘color’were being passed to plotting method (GH6956)
  • Bug inFloat64Index.isin() where containingnan s would make indicesclaim that they contained all the things (GH7066).
  • Bug inDataFrame.boxplot where it failed to use the axis passed as theax argument (GH3578)
  • Bug in theXlsxWriter andXlwtWriter implementations that resulted in datetime columns being formatted without the time (GH7075)were being passed to plotting method
  • read_fwf() treatsNone incolspec like regular python slices. It now reads from the beginningor until the end of the line whencolspec contains aNone (previously raised aTypeError)
  • Bug in cache coherence with chained indexing and slicing; add_is_view property toNDFrame to correctly predictviews; markis_copy onxs only if its an actual copy (and not a view) (GH7084)
  • Bug in DatetimeIndex creation from string ndarray withdayfirst=True (GH5917)
  • Bug inMultiIndex.from_arrays created fromDatetimeIndex doesn’t preservefreq andtz (GH7090)
  • Bug inunstack raisesValueError whenMultiIndex containsPeriodIndex (GH4342)
  • Bug inboxplot andhist draws unnecessary axes (GH6769)
  • Regression ingroupby.nth() for out-of-bounds indexers (GH6621)
  • Bug inquantile with datetime values (GH6965)
  • Bug inDataframe.set_index,reindex andpivot don’t preserveDatetimeIndex andPeriodIndex attributes (GH3950,GH5878,GH6631)
  • Bug inMultiIndex.get_level_values doesn’t preserveDatetimeIndex andPeriodIndex attributes (GH7092)
  • Bug inGroupby doesn’t preservetz (GH3950)
  • Bug inPeriodIndex partial string slicing (GH6716)
  • Bug in the HTML repr of a truncated Series or DataFrame not showing the class name with thelarge_repr set to ‘info’(GH7105)
  • Bug inDatetimeIndex specifyingfreq raisesValueError when passed value is too short (GH7098)
  • Fixed a bug with theinfo repr not honoring thedisplay.max_info_columns setting (GH6939)
  • BugPeriodIndex string slicing with out of bounds values (GH5407)
  • Fixed a memory error in the hashtable implementation/factorizer on resizing of large tables (GH7157)
  • Bug inisnull when applied to 0-dimensional object arrays (GH7176)
  • Bug inquery/eval where global constants were not looked up correctly(GH7178)
  • Bug in recognizing out-of-bounds positional list indexers withiloc and a multi-axis tuple indexer (GH7189)
  • Bug in setitem with a single value, multi-index and integer indices (GH7190,GH7218)
  • Bug in expressions evaluation with reversed ops, showing in series-dataframe ops (GH7198,GH7192)
  • Bug in multi-axis indexing with > 2 ndim and a multi-index (GH7199)
  • Fix a bug where invalid eval/query operations would blow the stack (GH5198)

v0.13.1 (February 3, 2014)

This is a minor release from 0.13.0 and includes a small number of API changes, several new features,enhancements, and performance improvements along with a large number of bug fixes. We recommend that allusers upgrade to this version.

Highlights include:

  • Addedinfer_datetime_format keyword toread_csv/to_datetime to allow speedups for homogeneously formatted datetimes.
  • Will intelligently limit display precision for datetime/timedelta formats.
  • Enhanced Panelapply() method.
  • Suggested tutorials in newTutorials section.
  • Our pandas ecosystem is growing, We now feature related projects in a newPandas Ecosystem section.
  • Much work has been taking place on improving the docs, and a newContributing section has been added.
  • Even though it may only be of interest to devs, we <3 our new CI status page:ScatterCI.

Warning

0.13.1 fixes a bug that was caused by a combination of having numpy < 1.8, and doingchained assignment on a string-like array. Please reviewthe docs,chained indexing can have unexpected results and should generally be avoided.

This would previously segfault:

In [1]:df=DataFrame(dict(A=np.array(['foo','bar','bah','foo','bar'])))In [2]:df['A'].iloc[0]=np.nanIn [3]:dfOut[3]:     A0  NaN1  bar2  bah3  foo4  bar

The recommended way to do this type of assignment is:

In [4]:df=DataFrame(dict(A=np.array(['foo','bar','bah','foo','bar'])))In [5]:df.ix[0,'A']=np.nanIn [6]:dfOut[6]:     A0  NaN1  bar2  bah3  foo4  bar

Output Formatting Enhancements

  • df.info() view now display dtype info per column (GH5682)

  • df.info() now honors the optionmax_info_rows, to disable null counts for large frames (GH5974)

    In [7]:max_info_rows=pd.get_option('max_info_rows')In [8]:df=DataFrame(dict(A=np.random.randn(10),   ...:B=np.random.randn(10),   ...:C=date_range('20130101',periods=10)))   ...:In [9]:df.iloc[3:6,[0,2]]=np.nan
    # set to not display the null countsIn [10]:pd.set_option('max_info_rows',0)In [11]:df.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 10 entries, 0 to 9Data columns (total 3 columns):A    float64B    float64C    datetime64[ns]dtypes: datetime64[ns](1), float64(2)memory usage: 312.0 bytes
    # this is the default (same as in 0.13.0)In [12]:pd.set_option('max_info_rows',max_info_rows)In [13]:df.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 10 entries, 0 to 9Data columns (total 3 columns):A    7 non-null float64B    10 non-null float64C    7 non-null datetime64[ns]dtypes: datetime64[ns](1), float64(2)memory usage: 312.0 bytes
  • Addshow_dimensions display option for the new DataFrame repr to control whether the dimensions print.

    In [14]:df=DataFrame([[1,2],[3,4]])In [15]:pd.set_option('show_dimensions',False)In [16]:dfOut[16]:   0  10  1  21  3  4In [17]:pd.set_option('show_dimensions',True)In [18]:dfOut[18]:   0  10  1  21  3  4[2 rows x 2 columns]
  • TheArrayFormatter fordatetime andtimedelta64 now intelligentlylimit precision based on the values in the array (GH3401)

    Previously output might look like:

    agetodaydiff02001-01-0100:00:002013-04-1900:00:004491days,00:00:0012004-06-0100:00:002013-04-1900:00:003244days,00:00:00

    Now the output looks like:

    In [19]:df=DataFrame([Timestamp('20010101'),   ....:Timestamp('20040601')],columns=['age'])   ....:In [20]:df['today']=Timestamp('20130419')In [21]:df['diff']=df['today']-df['age']In [22]:dfOut[22]:         age      today      diff0 2001-01-01 2013-04-19 4491 days1 2004-06-01 2013-04-19 3244 days[2 rows x 3 columns]

API changes

  • Add-NaN and-nan to the default set of NA values (GH5952).SeeNA Values.

  • AddedSeries.str.get_dummies vectorized string method (GH6021), to extractdummy/indicator variables for separated string columns:

    In [23]:s=Series(['a','a|b',np.nan,'a|c'])In [24]:s.str.get_dummies(sep='|')Out[24]:   a  b  c0  1  0  01  1  1  02  0  0  03  1  0  1[4 rows x 3 columns]
  • Added theNDFrame.equals() method to compare if two NDFrames areequal have equal axes, dtypes, and values. Added thearray_equivalent function to compare if two ndarrays areequal. NaNs in identical locations are treated asequal. (GH5283) See alsothe docs for a motivating example.

    In [25]:df=DataFrame({'col':['foo',0,np.nan]})In [26]:df2=DataFrame({'col':[np.nan,0,'foo']},index=[2,1,0])In [27]:df.equals(df2)Out[27]:FalseIn [28]:df.equals(df2.sort())Out[28]:TrueIn [29]:importpandas.core.commonascomIn [30]:com.array_equivalent(np.array([0,np.nan]),np.array([0,np.nan]))Out[30]:TrueIn [31]:np.array_equal(np.array([0,np.nan]),np.array([0,np.nan]))Out[31]:False
  • DataFrame.apply will use thereduce argument to determine whether aSeries or aDataFrame should be returned when theDataFrame isempty (GH6007).

    Previously, callingDataFrame.apply an emptyDataFrame would returneither aDataFrame if there were no columns, or the function beingapplied would be called with an emptySeries to guess whether aSeries orDataFrame should be returned:

    In [32]:defapplied_func(col):   ....:print("Apply function being called with: ",col)   ....:returncol.sum()   ....:In [33]:empty=DataFrame(columns=['a','b'])In [34]:empty.apply(applied_func)('Apply function being called with: ', Series([], dtype: float64))Out[34]:a   NaNb   NaNdtype: float64

    Now, whenapply is called on an emptyDataFrame: if thereduceargument isTrue aSeries will returned, if it isFalse aDataFrame will be returned, and if it isNone (the default) thefunction being applied will be called with an empty series to try and guessthe return type.

    In [35]:empty.apply(applied_func,reduce=True)Out[35]:a   NaNb   NaNdtype: float64In [36]:empty.apply(applied_func,reduce=False)Out[36]:Empty DataFrameColumns: [a, b]Index: [][0 rows x 2 columns]

Prior Version Deprecations/Changes

There are no announced changes in 0.13 or prior that are taking effect as of 0.13.1

Deprecations

There are no deprecations of prior behavior in 0.13.1

Enhancements

  • pd.read_csv andpd.to_datetime learned a newinfer_datetime_format keyword which greatlyimproves parsing perf in many cases. Thanks to @lexual for suggesting and @danbirkenfor rapidly implementing. (GH5490,GH6021)

    Ifparse_dates is enabled and this flag is set, pandas will attempt toinfer the format of the datetime strings in the columns, and if it canbe inferred, switch to a faster method of parsing them. In some casesthis can increase the parsing speed by ~5-10x.

    # Try to infer the format for the index columndf=pd.read_csv('foo.csv',index_col=0,parse_dates=True,infer_datetime_format=True)
  • date_format anddatetime_format keywords can now be specified when writing toexcelfiles (GH4133)

  • MultiIndex.from_product convenience function for creating a MultiIndex fromthe cartesian product of a set of iterables (GH6055):

    In [37]:shades=['light','dark']In [38]:colors=['red','green','blue']In [39]:MultiIndex.from_product([shades,colors],names=['shade','color'])Out[39]:MultiIndex(levels=[[u'dark', u'light'], [u'blue', u'green', u'red']],           labels=[[1, 1, 1, 0, 0, 0], [2, 1, 0, 2, 1, 0]],           names=[u'shade', u'color'])
  • Panelapply() will work on non-ufuncs. Seethe docs.

    In [40]:importpandas.util.testingastmIn [41]:panel=tm.makePanel(5)In [42]:panelOut[42]:<class 'pandas.core.panel.Panel'>Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)Items axis: ItemA to ItemCMajor_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00Minor_axis axis: A to DIn [43]:panel['ItemA']Out[43]:                   A         B         C         D2000-01-03  0.694103  1.893534 -1.735349 -0.8503462000-01-04  0.678630  0.639633  1.210384  1.1768122000-01-05  0.239556 -0.962029  0.797435 -0.5243362000-01-06  0.151227 -2.085266 -0.379811  0.7009082000-01-07  0.816127  1.930247  0.702562  0.984188[5 rows x 4 columns]

    Specifying anapply that operates on a Series (to return a single element)

    In [44]:panel.apply(lambdax:x.dtype,axis='items')Out[44]:                  A        B        C        D2000-01-03  float64  float64  float64  float642000-01-04  float64  float64  float64  float642000-01-05  float64  float64  float64  float642000-01-06  float64  float64  float64  float642000-01-07  float64  float64  float64  float64[5 rows x 4 columns]

    A similar reduction type operation

    In [45]:panel.apply(lambdax:x.sum(),axis='major_axis')Out[45]:      ItemA     ItemB     ItemCA  2.579643  3.062757  0.379252B  1.416120 -1.960855  0.923558C  0.595222 -1.079772 -3.118269D  1.487226 -0.734611 -1.979310[4 rows x 3 columns]

    This is equivalent to

    In [46]:panel.sum('major_axis')Out[46]:      ItemA     ItemB     ItemCA  2.579643  3.062757  0.379252B  1.416120 -1.960855  0.923558C  0.595222 -1.079772 -3.118269D  1.487226 -0.734611 -1.979310[4 rows x 3 columns]

    A transformation operation that returns a Panel, but is computingthe z-score across the major_axis

    In [47]:result=panel.apply(   ....:lambdax:(x-x.mean())/x.std(),   ....:axis='major_axis')   ....:In [48]:resultOut[48]:<class 'pandas.core.panel.Panel'>Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)Items axis: ItemA to ItemCMajor_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00Minor_axis axis: A to DIn [49]:result['ItemA']Out[49]:                   A         B         C         D2000-01-03  0.595800  0.907552 -1.556260 -1.2448752000-01-04  0.544058  0.200868  0.915883  0.9537472000-01-05 -0.924165 -0.701810  0.569325 -0.8912902000-01-06 -1.219530 -1.334852 -0.418654  0.4375892000-01-07  1.003837  0.928242  0.489705  0.744830[5 rows x 4 columns]
  • Panelapply() operating on cross-sectional slabs. (GH1148)

    In [50]:f=lambdax:((x.T-x.mean(1))/x.std(1)).TIn [51]:result=panel.apply(f,axis=['items','major_axis'])In [52]:resultOut[52]:<class 'pandas.core.panel.Panel'>Dimensions: 4 (items) x 5 (major_axis) x 3 (minor_axis)Items axis: A to DMajor_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00Minor_axis axis: ItemA to ItemCIn [53]:result.loc[:,:,'ItemA']Out[53]:                   A         B         C         D2000-01-03  0.331409  1.071034 -0.914540 -0.5105872000-01-04 -0.741017 -0.118794  0.383277  0.5372122000-01-05  0.065042 -0.767353  0.655436  0.0694672000-01-06  0.027932 -0.569477  0.908202  0.6105852000-01-07  1.116434  1.133591  0.871287  1.004064[5 rows x 4 columns]

    This is equivalent to the following

    In [54]:result=Panel(dict([(ax,f(panel.loc[:,:,ax]))   ....:foraxinpanel.minor_axis]))   ....:In [55]:resultOut[55]:<class 'pandas.core.panel.Panel'>Dimensions: 4 (items) x 5 (major_axis) x 3 (minor_axis)Items axis: A to DMajor_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00Minor_axis axis: ItemA to ItemCIn [56]:result.loc[:,:,'ItemA']Out[56]:                   A         B         C         D2000-01-03  0.331409  1.071034 -0.914540 -0.5105872000-01-04 -0.741017 -0.118794  0.383277  0.5372122000-01-05  0.065042 -0.767353  0.655436  0.0694672000-01-06  0.027932 -0.569477  0.908202  0.6105852000-01-07  1.116434  1.133591  0.871287  1.004064[5 rows x 4 columns]

Performance

Performance improvements for 0.13.1

  • Series datetime/timedelta binary operations (GH5801)
  • DataFramecount/dropna foraxis=1
  • Series.str.contains now has aregex=False keyword which can be faster for plain (non-regex) string patterns. (GH5879)
  • Series.str.extract (GH5944)
  • dtypes/ftypes methods (GH5968)
  • indexing with object dtypes (GH5968)
  • DataFrame.apply (GH6013)
  • Regression in JSON IO (GH5765)
  • Index construction from Series (GH6150)

Experimental

There are no experimental changes in 0.13.1

Bug Fixes

SeeV0.13.1 Bug Fixes for an extensive list of bugs that have been fixed in 0.13.1.

See thefull release notes or issue trackeron GitHub for a complete list of all API changes, Enhancements and Bug Fixes.

v0.13.0 (January 3, 2014)

This is a major release from 0.12.0 and includes a number of API changes, several new features andenhancements along with a large number of bug fixes.

Highlights include:

  • support for a new index typeFloat64Index, and other Indexing enhancements
  • HDFStore has a new string based syntax for query specification
  • support for new methods of interpolation
  • updatedtimedelta operations
  • a new string manipulation methodextract
  • Nanosecond support for Offsets
  • isin for DataFrames

Several experimental features are added, including:

  • neweval/query methods for expression evaluation
  • support formsgpack serialization
  • an i/o interface to Google’sBigQuery

Their are several new or updated docs sections including:

Warning

In 0.13.0Series has internally been refactored to no longer sub-classndarraybut instead subclassNDFrame, similar to the rest of the pandas containers. This should bea transparent change with only very limited API implications. SeeInternal Refactoring

API changes

  • read_excel now supports an integer in itssheetname argument givingthe index of the sheet to read in (GH4301).

  • Text parser now treats anything that reads like inf (“inf”, “Inf”, “-Inf”,“iNf”, etc.) as infinity. (GH4220,GH4219), affectingread_table,read_csv, etc.

  • pandas now is Python 2/3 compatible without the need for 2to3 thanks to@jtratner. As a result, pandas now uses iterators more extensively. Thisalso led to the introduction of substantive parts of the BenjaminPeterson’ssix library into compat. (GH4384,GH4375,GH4372)

  • pandas.util.compat andpandas.util.py3compat have been merged intopandas.compat.pandas.compat now includes many functions allowing2/3 compatibility. It contains both list and iterator versions of range,filter, map and zip, plus other necessary elements for Python 3compatibility.lmap,lzip,lrange andlfilter all producelists instead of iterators, for compatibility withnumpy, subscriptingandpandas constructors.(GH4384,GH4375,GH4372)

  • Series.get with negative indexers now returns the same as[] (GH4390)

  • Changes to howIndex andMultiIndex handle metadata (levels,labels, andnames) (GH4039):

    # previously, you would have set levels or labels directlyindex.levels=[[1,2,3,4],[1,2,4,4]]# now, you use the set_levels or set_labels methodsindex=index.set_levels([[1,2,3,4],[1,2,4,4]])# similarly, for names, you can rename the object# but setting names is not deprecatedindex=index.set_names(["bob","cranberry"])# and all methods take an inplace kwarg - but return Noneindex.set_names(["bob","cranberry"],inplace=True)
  • All division withNDFrame objects is nowtruedivision, regardlessof the future import. This means that operating on pandas objects will by defaultusefloating point division, and return a floating point dtype.You can use// andfloordiv to do integer division.

    Integer division

    In [3]:arr=np.array([1,2,3,4])In [4]:arr2=np.array([5,3,2,1])In [5]:arr/arr2Out[5]:array([0,0,1,4])In [6]:Series(arr)//Series(arr2)Out[6]:0    01    02    13    4dtype: int64

    True Division

    In [7]:pd.Series(arr)/pd.Series(arr2)# no future import requiredOut[7]:0    0.2000001    0.6666672    1.5000003    4.000000dtype: float64
  • Infer and downcast dtype ifdowncast='infer' is passed tofillna/ffill/bfill (GH4604)

  • __nonzero__ for all NDFrame objects, will now raise aValueError, this reverts back to (GH1073,GH4633)behavior. Seegotchas for a more detailed discussion.

    This prevents doing boolean comparison onentire pandas objects, which is inherently ambiguous. These all will raise aValueError.

    ifdf:....df1anddf2s1ands2

    Added the.bool() method toNDFrame objects to facilitate evaluating of single-element boolean Series:

    In [1]:Series([True]).bool()Out[1]:TrueIn [2]:Series([False]).bool()Out[2]:FalseIn [3]:DataFrame([[True]]).bool()Out[3]:TrueIn [4]:DataFrame([[False]]).bool()Out[4]:False
  • All non-Index NDFrames (Series,DataFrame,Panel,Panel4D,SparsePanel, etc.), now support the entire set of arithmetic operatorsand arithmetic flex methods (add, sub, mul, etc.).SparsePanel does notsupportpow ormod with non-scalars. (GH3765)

  • Series andDataFrame now have amode() method to calculate thestatistical mode(s) by axis/Series. (GH5367)

  • Chained assignment will now by default warn if the user is assigning to a copy. This can be changedwith the optionmode.chained_assignment, allowed options areraise/warn/None. Seethe docs.

    In [5]:dfc=DataFrame({'A':['aaa','bbb','ccc'],'B':[1,2,3]})In [6]:pd.set_option('chained_assignment','warn')

    The following warning / exception will show if this is attempted.

    In [7]:dfc.loc[0]['A']=1111
    Traceback(mostrecentcalllast)...SettingWithCopyWarning:AvalueistryingtobesetonacopyofaslicefromaDataFrame.Tryusing.loc[row_index,col_indexer]=valueinstead

    Here is the correct method of assignment.

    In [8]:dfc.loc[0,'A']=11In [9]:dfcOut[9]:     A  B0   11  11  bbb  22  ccc  3[3 rows x 2 columns]
  • Panel.reindex has the following call signaturePanel.reindex(items=None,major_axis=None,minor_axis=None,**kwargs)

    to conform with otherNDFrame objects. SeeInternal Refactoring for more information.

  • Series.argmin andSeries.argmax are now aliased toSeries.idxmin andSeries.idxmax. These return theindex of the

    min or max element respectively. Prior to 0.13.0 these would return the position of the min / max element. (GH6214)

Prior Version Deprecations/Changes

These were announced changes in 0.12 or prior that are taking effect as of 0.13.0

  • Remove deprecatedFactor (GH3650)
  • Remove deprecatedset_printoptions/reset_printoptions (GH3046)
  • Remove deprecated_verbose_info (GH3215)
  • Remove deprecatedread_clipboard/to_clipboard/ExcelFile/ExcelWriter frompandas.io.parsers (GH3717)These are available as functions in the main pandas namespace (e.g.pd.read_clipboard)
  • default fortupleize_cols is nowFalse for bothto_csv andread_csv. Fair warning in 0.12 (GH3604)
  • default fordisplay.max_seq_len is now 100 rather thenNone. This activatestruncated display (”...”) of long sequences in various places. (GH3391)

Deprecations

Deprecated in 0.13.0

  • deprecatediterkv, which will be removed in a future release (this wasan alias of iteritems used to bypass2to3‘s changes).(GH4384,GH4375,GH4372)
  • deprecated the string methodmatch, whose role is now performed moreidiomatically byextract. In a future release, the default behaviorofmatch will change to become analogous tocontains, which returnsa boolean indexer. (Theirdistinction is strictness:match relies onre.match whilecontains relies onre.search.) In this release, the deprecatedbehavior is the default, but the new behavior is available through thekeyword argumentas_indexer=True.

Indexing API Changes

Prior to 0.13, it was impossible to use a label indexer (.loc/.ix) to set a value thatwas not contained in the index of a particular axis. (GH2578). Seethe docs

In theSeries case this is effectively an appending operation

In [10]:s=Series([1,2,3])In [11]:sOut[11]:0    11    22    3dtype: int64In [12]:s[5]=5.In [13]:sOut[13]:0    1.01    2.02    3.05    5.0dtype: float64
In [14]:dfi=DataFrame(np.arange(6).reshape(3,2),   ....:columns=['A','B'])   ....:In [15]:dfiOut[15]:   A  B0  0  11  2  32  4  5[3 rows x 2 columns]

This would previouslyKeyError

In [16]:dfi.loc[:,'C']=dfi.loc[:,'A']In [17]:dfiOut[17]:   A  B  C0  0  1  01  2  3  22  4  5  4[3 rows x 3 columns]

This is like anappend operation.

In [18]:dfi.loc[3]=5In [19]:dfiOut[19]:   A  B  C0  0  1  01  2  3  22  4  5  43  5  5  5[4 rows x 3 columns]

A Panel setting operation on an arbitrary axis aligns the input to the Panel

In [20]:p=pd.Panel(np.arange(16).reshape(2,4,2),   ....:items=['Item1','Item2'],   ....:major_axis=pd.date_range('2001/1/12',periods=4),   ....:minor_axis=['A','B'],dtype='float64')   ....:In [21]:pOut[21]:<class 'pandas.core.panel.Panel'>Dimensions: 2 (items) x 4 (major_axis) x 2 (minor_axis)Items axis: Item1 to Item2Major_axis axis: 2001-01-12 00:00:00 to 2001-01-15 00:00:00Minor_axis axis: A to BIn [22]:p.loc[:,:,'C']=Series([30,32],index=p.items)In [23]:pOut[23]:<class 'pandas.core.panel.Panel'>Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)Items axis: Item1 to Item2Major_axis axis: 2001-01-12 00:00:00 to 2001-01-15 00:00:00Minor_axis axis: A to CIn [24]:p.loc[:,:,'C']Out[24]:            Item1  Item22001-01-12   30.0   32.02001-01-13   30.0   32.02001-01-14   30.0   32.02001-01-15   30.0   32.0[4 rows x 2 columns]

Float64Index API Change

  • Added a new index type,Float64Index. This will be automatically created when passing floating values in index creation.This enables a pure label-based slicing paradigm that makes[],ix,loc for scalar indexing and slicing work exactly thesame. Seethe docs, (GH263)

    Construction is by default for floating type values.

    In [25]:index=Index([1.5,2,3,4.5,5])In [26]:indexOut[26]:Float64Index([1.5,2.0,3.0,4.5,5.0],dtype='float64')In [27]:s=Series(range(5),index=index)In [28]:sOut[28]:1.5    02.0    13.0    24.5    35.0    4dtype: int64

    Scalar selection for[],.ix,.loc will always be label based. An integer will match an equal float index (e.g.3 is equivalent to3.0)

    In [29]:s[3]Out[29]:2In [30]:s.ix[3]Out[30]:2In [31]:s.loc[3]Out[31]:2

    The only positional indexing is viailoc

    In [32]:s.iloc[3]Out[32]:3

    A scalar index that is not found will raiseKeyError

    Slicing is ALWAYS on the values of the index, for[],ix,loc and ALWAYS positional withiloc

    In [33]:s[2:4]Out[33]:2.0    13.0    2dtype: int64In [34]:s.ix[2:4]Out[34]:2.0    13.0    2dtype: int64In [35]:s.loc[2:4]Out[35]:2.0    13.0    2dtype: int64In [36]:s.iloc[2:4]Out[36]:3.0    24.5    3dtype: int64

    In float indexes, slicing using floats are allowed

    In [37]:s[2.1:4.6]Out[37]:3.0    24.5    3dtype: int64In [38]:s.loc[2.1:4.6]Out[38]:3.0    24.5    3dtype: int64
  • Indexing on other index types are preserved (and positional fallback for[],ix), with the exception, that floating point slicingon indexes on nonFloat64Index will now raise aTypeError.

    In [1]:Series(range(5))[3.5]TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)In [1]:Series(range(5))[3.5:4.5]TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)

    Using a scalar float indexer will be deprecated in a future version, but is allowed for now.

    In [3]:Series(range(5))[3.0]Out[3]:3

HDFStore API Changes

  • Query Format Changes. A much more string-like query format is now supported. Seethe docs.

    In [39]:path='test.h5'In [40]:dfq=DataFrame(randn(10,4),   ....:columns=list('ABCD'),   ....:index=date_range('20130101',periods=10))   ....:In [41]:dfq.to_hdf(path,'dfq',format='table',data_columns=True)

    Use boolean expressions, with in-line function evaluation.

    In [42]:read_hdf(path,'dfq',   ....:where="index>Timestamp('20130104') & columns=['A', 'B']")   ....:Out[42]:                   A         B2013-01-05  1.057633 -0.7914892013-01-06  1.910759  0.7879652013-01-07  1.043945  2.1077852013-01-08  0.749185 -0.6755212013-01-09 -0.276646  1.9245332013-01-10  0.226363 -2.078618[6 rows x 2 columns]

    Use an inline column reference

    In [43]:read_hdf(path,'dfq',   ....:where="A>0 or C>0")   ....:Out[43]:                   A         B         C         D2013-01-01 -0.414505 -1.425795  0.209395 -0.5928862013-01-02 -1.473116 -0.896581  1.104352 -0.4315502013-01-03 -0.161137  0.889157  0.288377 -1.0515392013-01-04 -0.319561 -0.619993  0.156998 -0.5714552013-01-05  1.057633 -0.791489 -0.524627  0.0718782013-01-06  1.910759  0.787965  0.513082 -0.5464162013-01-07  1.043945  2.107785  1.459927  1.0154052013-01-08  0.749185 -0.675521  0.440266  0.6889722013-01-09 -0.276646  1.924533  0.411204  0.8907652013-01-10  0.226363 -2.078618 -0.387886 -0.087107[10 rows x 4 columns]
  • theformat keyword now replaces thetable keyword; allowed values arefixed(f) ortable(t)the same defaults as prior < 0.13.0 remain, e.g.put impliesfixed format andappend impliestable format. This default format can be set as an option by settingio.hdf.default_format.

    In [44]:path='test.h5'In [45]:df=DataFrame(randn(10,2))In [46]:df.to_hdf(path,'df_table',format='table')In [47]:df.to_hdf(path,'df_table2',append=True)In [48]:df.to_hdf(path,'df_fixed')In [49]:withget_store(path)asstore:   ....:print(store)   ....:<class 'pandas.io.pytables.HDFStore'>File path: test.h5/df_fixed             frame        (shape->[10,2])/df_table             frame_table  (typ->appendable,nrows->10,ncols->2,indexers->[index])/df_table2            frame_table  (typ->appendable,nrows->10,ncols->2,indexers->[index])
  • Significant table writing performance improvements

  • handle a passedSeries in table format (GH4330)

  • can now serialize atimedelta64[ns] dtype in a table (GH3577), Seethe docs.

  • added anis_open property to indicate if the underlying file handle is_open;a closed store will now report ‘CLOSED’ when viewing the store (rather than raising an error)(GH4409)

  • a close of aHDFStore now will close that instance of theHDFStorebut will only close the actual file if the ref count (byPyTables) w.r.t. all of the open handlesare 0. Essentially you have a local instance ofHDFStore referenced by a variable. Once youclose it, it will report closed. Other references (to the same file) will continue to operateuntil they themselves are closed. Performing an action on a closed file will raiseClosedFileError

    In [50]:path='test.h5'In [51]:df=DataFrame(randn(10,2))In [52]:store1=HDFStore(path)In [53]:store2=HDFStore(path)In [54]:store1.append('df',df)In [55]:store2.append('df2',df)In [56]:store1Out[56]:<class 'pandas.io.pytables.HDFStore'>File path: test.h5/df            frame_table  (typ->appendable,nrows->10,ncols->2,indexers->[index])In [57]:store2Out[57]:<class 'pandas.io.pytables.HDFStore'>File path: test.h5/df             frame_table  (typ->appendable,nrows->10,ncols->2,indexers->[index])/df2            frame_table  (typ->appendable,nrows->10,ncols->2,indexers->[index])In [58]:store1.close()In [59]:store2Out[59]:<class 'pandas.io.pytables.HDFStore'>File path: test.h5/df             frame_table  (typ->appendable,nrows->10,ncols->2,indexers->[index])/df2            frame_table  (typ->appendable,nrows->10,ncols->2,indexers->[index])In [60]:store2.close()In [61]:store2Out[61]:<class 'pandas.io.pytables.HDFStore'>File path: test.h5File is CLOSED
  • removed the_quiet attribute, replace by aDuplicateWarning if retrievingduplicate rows from a table (GH4367)

  • removed thewarn argument fromopen. Instead aPossibleDataLossError exception willbe raised if you try to usemode='w' with an OPEN file handle (GH4367)

  • allow a passed locations array or mask as awhere condition (GH4467).Seethe docs for an example.

  • add the keyworddropna=True toappend to change whether ALL nan rows are not writtento the store (default isTrue, ALL nan rows are NOT written), also settablevia the optionio.hdf.dropna_table (GH4625)

  • pass thru store creation arguments; can be used to support in-memory stores

DataFrame repr Changes

The HTML and plain text representations ofDataFrame now showa truncated view of the table once it exceeds a certain size, ratherthan switching to the short info view (GH4886,GH5550).This makes the representation more consistent as small DataFrames getlarger.

Truncated HTML representation of a DataFrame

To get the info view, callDataFrame.info(). If you prefer theinfo view as the repr for large DataFrames, you can set this by runningset_option('display.large_repr','info').

Enhancements

  • df.to_clipboard() learned a newexcel keyword that let’s youpaste df data directly into excel (enabled by default). (GH5070).

  • read_html now raises aURLError instead of catching and raising aValueError (GH4303,GH4305)

  • Added a test forread_clipboard() andto_clipboard() (GH4282)

  • Clipboard functionality now works with PySide (GH4282)

  • Added a more informative error message when plot arguments containoverlapping color and style arguments (GH4402)

  • to_dict now takesrecords as a possible outtype. Returns an arrayof column-keyed dictionaries. (GH4936)

  • NaN handing in get_dummies (GH4446) withdummy_na

    # previously, nan was erroneously counted as 2 here# now it is not counted at allIn [62]:get_dummies([1,2,np.nan])Out[62]:   1.0  2.00    1    01    0    12    0    0[3 rows x 2 columns]# unless requestedIn [63]:get_dummies([1,2,np.nan],dummy_na=True)Out[63]:    1.0   2.0  NaN0     1     0     01     0     1     02     0     0     1[3 rows x 3 columns]
  • timedelta64[ns] operations. Seethe docs.

    Warning

    Most of these operations requirenumpy>=1.7

    Using the new top-levelto_timedelta, you can convert a scalar or array from the standardtimedelta format (produced byto_csv) into a timedelta type (np.timedelta64 innanoseconds).

    In [64]:to_timedelta('1 days 06:05:01.00003')Out[64]:Timedelta('1 days 06:05:01.000030')In [65]:to_timedelta('15.5us')Out[65]:Timedelta('0 days 00:00:00.000015')In [66]:to_timedelta(['1 days 06:05:01.00003','15.5us','nan'])Out[66]:TimedeltaIndex(['1 days 06:05:01.000030','0 days 00:00:00.000015',NaT],dtype='timedelta64[ns]',freq=None)In [67]:to_timedelta(np.arange(5),unit='s')Out[67]:TimedeltaIndex(['00:00:00','00:00:01','00:00:02','00:00:03','00:00:04'],dtype='timedelta64[ns]',freq=None)In [68]:to_timedelta(np.arange(5),unit='d')Out[68]:TimedeltaIndex(['0 days','1 days','2 days','3 days','4 days'],dtype='timedelta64[ns]',freq=None)

    A Series of dtypetimedelta64[ns] can now be divided by anothertimedelta64[ns] object, or astyped to yield afloat64 dtyped Series. Thisis frequency conversion. Seethe docs for the docs.

    In [69]:fromdatetimeimporttimedeltaIn [70]:td=Series(date_range('20130101',periods=4))-Series(date_range('20121201',periods=4))In [71]:td[2]+=np.timedelta64(timedelta(minutes=5,seconds=3))In [72]:td[3]=np.nanIn [73]:tdOut[73]:0   31 days 00:00:001   31 days 00:00:002   31 days 00:05:033                NaTdtype: timedelta64[ns]# to daysIn [74]:td/np.timedelta64(1,'D')Out[74]:0    31.0000001    31.0000002    31.0035073          NaNdtype: float64In [75]:td.astype('timedelta64[D]')Out[75]:0    31.01    31.02    31.03     NaNdtype: float64# to secondsIn [76]:td/np.timedelta64(1,'s')Out[76]:0    2678400.01    2678400.02    2678703.03          NaNdtype: float64In [77]:td.astype('timedelta64[s]')Out[77]:0    2678400.01    2678400.02    2678703.03          NaNdtype: float64

    Dividing or multiplying atimedelta64[ns] Series by an integer or integer Series

    In [78]:td*-1Out[78]:0   -31 days +00:00:001   -31 days +00:00:002   -32 days +23:54:573                  NaTdtype: timedelta64[ns]In [79]:td*Series([1,2,3,4])Out[79]:0   31 days 00:00:001   62 days 00:00:002   93 days 00:15:093                NaTdtype: timedelta64[ns]

    AbsoluteDateOffset objects can act equivalently totimedeltas

    In [80]:frompandasimportoffsetsIn [81]:td+offsets.Minute(5)+offsets.Milli(5)Out[81]:0   31 days 00:05:00.0050001   31 days 00:05:00.0050002   31 days 00:10:03.0050003                       NaTdtype: timedelta64[ns]

    Fillna is now supported for timedeltas

    In [82]:td.fillna(0)Out[82]:0   31 days 00:00:001   31 days 00:00:002   31 days 00:05:033    0 days 00:00:00dtype: timedelta64[ns]In [83]:td.fillna(timedelta(days=1,seconds=5))Out[83]:0   31 days 00:00:001   31 days 00:00:002   31 days 00:05:033    1 days 00:00:05dtype: timedelta64[ns]

    You can do numeric reduction operations on timedeltas.

    In [84]:td.mean()Out[84]:Timedelta('31 days 00:01:41')In [85]:td.quantile(.1)Out[85]:Timedelta('31 days 00:00:00')
  • plot(kind='kde') now accepts the optional parametersbw_method andind, passed to scipy.stats.gaussian_kde() (for scipy >= 0.11.0) to setthe bandwidth, and to gkde.evaluate() to specify the indices at which itis evaluated, respectively. See scipy docs. (GH4298)

  • DataFrame constructor now accepts a numpy masked record array (GH3478)

  • The new vectorized string methodextract return regular expressionmatches more conveniently.

    In [86]:Series(['a1','b2','c3']).str.extract('[ab](\d)')Out[86]:0      11      22    NaNdtype: object

    Elements that do not match returnNaN. Extracting a regular expressionwith more than one group returns a DataFrame with one column per group.

    In [87]:Series(['a1','b2','c3']).str.extract('([ab])(\d)')Out[87]:     0    10    a    11    b    22  NaN  NaN[3 rows x 2 columns]

    Elements that do not match return a row ofNaN.Thus, a Series of messy strings can beconverted into alike-indexed Series or DataFrame of cleaned-up or more useful strings,without necessitatingget() to access tuples orre.match objects.

    Named groups like

    In [88]:Series(['a1','b2','c3']).str.extract(   ....:'(?P<letter>[ab])(?P<digit>\d)')   ....:Out[88]:  letter digit0      a     11      b     22    NaN   NaN[3 rows x 2 columns]

    and optional groups can also be used.

    In [89]:Series(['a1','b2','3']).str.extract(   ....:'(?P<letter>[ab])?(?P<digit>\d)')   ....:Out[89]:  letter digit0      a     11      b     22    NaN     3[3 rows x 2 columns]
  • read_stata now accepts Stata 13 format (GH4291)

  • read_fwf now infers the column specifications from the first 100 rows ofthe file if the data has correctly separated and properly aligned columnsusing the delimiter provided to the function (GH4488).

  • support for nanosecond times as an offset

    Warning

    These operations requirenumpy>=1.7

    Period conversions in the range of seconds and below were reworked and extendedup to nanoseconds. Periods in the nanosecond range are now available.

    In [90]:date_range('2013-01-01',periods=5,freq='5N')Out[90]:DatetimeIndex(['2013-01-01', '2013-01-01', '2013-01-01', '2013-01-01',               '2013-01-01'],              dtype='datetime64[ns]', freq='5N')

    or with frequency as offset

    In [91]:date_range('2013-01-01',periods=5,freq=pd.offsets.Nano(5))Out[91]:DatetimeIndex(['2013-01-01', '2013-01-01', '2013-01-01', '2013-01-01',               '2013-01-01'],              dtype='datetime64[ns]', freq='5N')

    Timestamps can be modified in the nanosecond range

    In [92]:t=Timestamp('20130101 09:01:02')In [93]:t+pd.tseries.offsets.Nano(123)Out[93]:Timestamp('2013-01-01 09:01:02.000000123')
  • A new method,isin for DataFrames, which plays nicely with boolean indexing. The argument toisin, what we’re comparing the DataFrame to, can be a DataFrame, Series, dict, or array of values. Seethe docs for more.

    To get the rows where any of the conditions are met:

    In [94]:dfi=DataFrame({'A':[1,2,3,4],'B':['a','b','f','n']})In [95]:dfiOut[95]:   A  B0  1  a1  2  b2  3  f3  4  n[4 rows x 2 columns]In [96]:other=DataFrame({'A':[1,3,3,7],'B':['e','f','f','e']})In [97]:mask=dfi.isin(other)In [98]:maskOut[98]:       A      B0   True  False1  False  False2   True   True3  False  False[4 rows x 2 columns]In [99]:dfi[mask.any(1)]Out[99]:   A  B0  1  a2  3  f[2 rows x 2 columns]
  • Series now supports ato_frame method to convert it to a single-column DataFrame (GH5164)

  • All R datasets listed herehttp://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html can now be loaded into Pandas objects

    # note that pandas.rpy was deprecated in v0.16.0importpandas.rpy.commonascomcom.load_data('Titanic')
  • tz_localize can infer a fall daylight savings transition based on the structureof the unlocalized data (GH4230), seethe docs

  • DatetimeIndex is now in the API documentation, seethe docs

  • json_normalize() is a new method to allow you to create a flat tablefrom semi-structured JSON data. Seethe docs (GH1067)

  • Added PySide support for the qtpandas DataFrameModel and DataFrameWidget.

  • Python csv parser now supports usecols (GH4335)

  • Frequencies gained several new offsets:

  • DataFrame has a newinterpolate method, similar to Series (GH4434,GH1892)

    In [100]:df=DataFrame({'A':[1,2.1,np.nan,4.7,5.6,6.8],   .....:'B':[.25,np.nan,np.nan,4,12.2,14.4]})   .....:In [101]:df.interpolate()Out[101]:     A      B0  1.0   0.251  2.1   1.502  3.4   2.753  4.7   4.004  5.6  12.205  6.8  14.40[6 rows x 2 columns]

    Additionally, themethod argument tointerpolate has been expandedto include'nearest','zero','slinear','quadratic','cubic','barycentric','krogh','piecewise_polynomial','pchip',`polynomial`,'spline'The new methods requirescipy. Consult the Scipy referenceguide anddocumentation for more informationabout when the various methods are appropriate. Seethe docs.

    Interpolate now also accepts alimit keyword argument.This works similar tofillna‘s limit:

    In [102]:ser=Series([1,3,np.nan,np.nan,np.nan,11])In [103]:ser.interpolate(limit=2)Out[103]:0     1.01     3.02     5.03     7.04     NaN5    11.0dtype: float64
  • Addedwide_to_long panel data convenience function. Seethe docs.

    In [104]:np.random.seed(123)In [105]:df=pd.DataFrame({"A1970":{0:"a",1:"b",2:"c"},   .....:"A1980":{0:"d",1:"e",2:"f"},   .....:"B1970":{0:2.5,1:1.2,2:.7},   .....:"B1980":{0:3.2,1:1.3,2:.1},   .....:"X":dict(zip(range(3),np.random.randn(3)))   .....:})   .....:In [106]:df["id"]=df.indexIn [107]:dfOut[107]:  A1970 A1980  B1970  B1980         X  id0     a     d    2.5    3.2 -1.085631   01     b     e    1.2    1.3  0.997345   12     c     f    0.7    0.1  0.282978   2[3 rows x 6 columns]In [108]:wide_to_long(df,["A","B"],i="id",j="year")Out[108]:                X  A    Bid year0  1970 -1.085631  a  2.51  1970  0.997345  b  1.22  1970  0.282978  c  0.70  1980 -1.085631  d  3.21  1980  0.997345  e  1.32  1980  0.282978  f  0.1[6 rows x 3 columns]
  • to_csv now takes adate_format keyword argument that specifies howoutput datetime objects should be formatted. Datetimes encountered in theindex, columns, and values will all have this formatting applied. (GH4313)
  • DataFrame.plot will scatter plot x versus y by passingkind='scatter' (GH2215)
  • Added support for Google Analytics v3 API segment IDs that also supports v2 IDs. (GH5271)

Experimental

  • The neweval() function implements expression evaluation usingnumexpr behind the scenes. This results in large speedups forcomplicated expressions involving large DataFrames/Series. For example,

    In [109]:nrows,ncols=20000,100In [110]:df1,df2,df3,df4=[DataFrame(randn(nrows,ncols))   .....:for_inrange(4)]   .....:
    # eval with NumExpr backendIn [111]:%timeitpd.eval('df1 + df2 + df3 + df4')100 loops, best of 3: 9.21 ms per loop
    # pure Python evaluationIn [112]:%timeitdf1+df2+df3+df410 loops, best of 3: 27.2 ms per loop

    For more details, see thethe docs

  • Similar topandas.eval,DataFrame has a newDataFrame.eval method that evaluates an expression in the context oftheDataFrame. For example,

    In [113]:df=DataFrame(randn(10,2),columns=['a','b'])In [114]:df.eval('a + b')Out[114]:0   -0.6852041    1.5897452    0.3254413   -1.7841534   -0.4328935    0.1718506    1.8959197    3.0655878   -0.0927599    1.391365dtype: float64
  • query() method has been added that allowsyou to select elements of aDataFrame using a natural query syntaxnearly identical to Python syntax. For example,

    In [115]:n=20In [116]:df=DataFrame(np.random.randint(n,size=(n,3)),columns=['a','b','c'])In [117]:df.query('a < b < c')Out[117]:    a   b   c11  1   5   815  8  16  19[2 rows x 3 columns]

    selects all the rows ofdf wherea<b<c evaluates toTrue.For more details see thethe docs.

  • pd.read_msgpack() andpd.to_msgpack() are now a supported method of serializationof arbitrary pandas (and python objects) in a lightweight portable binary format. Seethe docs

    Warning

    Since this is an EXPERIMENTAL LIBRARY, the storage format may not be stable until a future release.

    In [118]:df=DataFrame(np.random.rand(5,2),columns=list('AB'))In [119]:df.to_msgpack('foo.msg')In [120]:pd.read_msgpack('foo.msg')Out[120]:          A         B0  0.251082  0.0173571  0.347915  0.9298792  0.546233  0.2033683  0.064942  0.0317224  0.355309  0.524575[5 rows x 2 columns]In [121]:s=Series(np.random.rand(5),index=date_range('20130101',periods=5))In [122]:pd.to_msgpack('foo.msg',df,s)In [123]:pd.read_msgpack('foo.msg')Out[123]:[          A         B 0  0.251082  0.017357 1  0.347915  0.929879 2  0.546233  0.203368 3  0.064942  0.031722 4  0.355309  0.524575 [5 rows x 2 columns], 2013-01-01    0.022321 2013-01-02    0.227025 2013-01-03    0.383282 2013-01-04    0.193225 2013-01-05    0.110977 Freq: D, dtype: float64]

    You can passiterator=True to iterator over the unpacked results

    In [124]:foroinpd.read_msgpack('foo.msg',iterator=True):   .....:printo   .....:          A         B0  0.251082  0.0173571  0.347915  0.9298792  0.546233  0.2033683  0.064942  0.0317224  0.355309  0.524575[5 rows x 2 columns]2013-01-01    0.0223212013-01-02    0.2270252013-01-03    0.3832822013-01-04    0.1932252013-01-05    0.110977Freq: D, dtype: float64
  • pandas.io.gbq provides a simple way to extract from, and load data into,Google’s BigQuery Data Sets by way of pandas DataFrames. BigQuery is a highperformance SQL-like database service, useful for performing ad-hoc queriesagainst extremely large datasets.See the docs

    frompandas.ioimportgbq# A query to select the average monthly temperatures in the# in the year 2000 across the USA. The dataset,# publicata:samples.gsod, is available on all BigQuery accounts,# and is based on NOAA gsod data.query="""SELECT station_number as STATION,month as MONTH, AVG(mean_temp) as MEAN_TEMPFROM publicdata:samples.gsodWHERE YEAR = 2000GROUP BY STATION, MONTHORDER BY STATION, MONTH ASC"""# Fetch the result set for this query# Your Google BigQuery Project ID# To find this, see your dashboard:# https://console.developers.google.com/iam-admin/projects?authuser=0projectid=xxxxxxxxx;df=gbq.read_gbq(query,project_id=projectid)# Use pandas to process and reshape the datasetdf2=df.pivot(index='STATION',columns='MONTH',values='MEAN_TEMP')df3=pandas.concat([df2.min(),df2.mean(),df2.max()],axis=1,keys=["Min Tem","Mean Temp","Max Temp"])

    The resulting DataFrame is:

    >df3MinTemMeanTempMaxTempMONTH1-53.33666739.82789289.7709682-49.83750043.68521993.4379323-77.92608748.70835596.0999984-82.89285855.07008797.3172405-92.37826161.428117102.0428566-77.70333465.858888102.9000007-87.82142868.169663106.5107148-89.43199968.614215105.5000009-86.61111263.436935107.14285610-78.20967756.88083892.10333311-50.12500048.86122894.99642812-50.33225842.28687994.396774

    Warning

    To use this module, you will need a BigQuery account. See<https://cloud.google.com/products/big-query> for details.

    As of 10/10/13, there is a bug in Google’s API preventing result setsfrom being larger than 100,000 rows. A patch is scheduled for the week of10/14/13.

Internal Refactoring

In 0.13.0 there is a major refactor primarily to subclassSeries fromNDFrame, which is the base class currently forDataFrame andPanel,to unify methods and behaviors. Series formerly subclassed directly fromndarray. (GH4080,GH3862,GH816)

Warning

There are two potential incompatibilities from < 0.13.0

  • Using certain numpy functions would previously return aSeries if passed aSeriesas an argument. This seems only to affectnp.ones_like,np.empty_like,np.diff andnp.where. These now returnndarrays.

    In [125]:s=Series([1,2,3,4])

    Numpy Usage

    In [126]:np.ones_like(s)Out[126]:array([1,1,1,1])In [127]:np.diff(s)Out[127]:array([1,1,1])In [128]:np.where(s>1,s,np.nan)Out[128]:array([nan,2.,3.,4.])

    Pandonic Usage

    In [129]:Series(1,index=s.index)Out[129]:0    11    12    13    1dtype: int64In [130]:s.diff()Out[130]:0    NaN1    1.02    1.03    1.0dtype: float64In [131]:s.where(s>1)Out[131]:0    NaN1    2.02    3.03    4.0dtype: float64
  • Passing aSeries directly to a cython function expecting anndarray type will nolong work directly, you must passSeries.values, SeeEnhancing Performance

  • Series(0.5) would previously return the scalar0.5, instead this will return a 1-elementSeries

  • This change breaksrpy2<=2.3.8. an Issue has been opened against rpy2 and a workaroundis detailed inGH5698. Thanks @JanSchulz.

  • Pickle compatibility is preserved for pickles created prior to 0.13. These must be unpickled withpd.read_pickle, seePickling.

  • Refactor of series.py/frame.py/panel.py to move common code to generic.py

    • added_setup_axes to created generic NDFrame structures
    • moved methods
      • from_axes,_wrap_array,axes,ix,loc,iloc,shape,empty,swapaxes,transpose,pop
      • __iter__,keys,__contains__,__len__,__neg__,__invert__
      • convert_objects,as_blocks,as_matrix,values
      • __getstate__,__setstate__ (compat remains in frame/panel)
      • __getattr__,__setattr__
      • _indexed_same,reindex_like,align,where,mask
      • fillna,replace (Series replace is now consistent withDataFrame)
      • filter (also added axis argument to selectively filter on a different axis)
      • reindex,reindex_axis,take
      • truncate (moved to become part ofNDFrame)
  • These are API changes which makePanel more consistent withDataFrame

    • swapaxes on aPanel with the same axes specified now return a copy
    • support attribute access for setting
    • filter supports the same API as the originalDataFrame filter
  • Reindex called with no arguments will now return a copy of the input object

  • TimeSeries is now an alias forSeries. the propertyis_time_seriescan be used to distinguish (if desired)

  • Refactor of Sparse objects to use BlockManager

    • Created a new block type in internals,SparseBlock, which can hold multi-dtypesand is non-consolidatable.SparseSeries andSparseDataFrame now inheritmore methods from there hierarchy (Series/DataFrame), and no longer inheritfromSparseArray (which instead is the object of theSparseBlock)
    • Sparse suite now supports integration with non-sparse data. Non-float sparsedata is supportable (partially implemented)
    • Operations on sparse structures within DataFrames should preserve sparseness,merging type operations will convert to dense (and back to sparse), so mightbe somewhat inefficient
    • enable setitem onSparseSeries for boolean/integer/slices
    • SparsePanels implementation is unchanged (e.g. not using BlockManager, needs work)
  • addedftypes method to Series/DataFrame, similar todtypes, but indicatesif the underlying is sparse/dense (as well as the dtype)

  • AllNDFrame objects can now use__finalize__() to specify variousvalues to propagate to new objects from an existing one (e.g.name inSeries willfollow more automatically now)

  • Internal type checking is now done via a suite of generated classes, allowingisinstance(value,klass)without having to directly import the klass, courtesy of @jtratner

  • Bug in Series update where the parent frame is not updating its cache based onchanges (GH4080) or types (GH3217), fillna (GH3386)

  • Indexing with dtype conversions fixed (GH4463,GH4204)

  • RefactorSeries.reindex to core/generic.py (GH4604,GH4618), allowmethod= in reindexingon a Series to work

  • Series.copy no longer accepts theorder parameter and is now consistent withNDFrame copy

  • Refactorrename methods to core/generic.py; fixesSeries.rename for (GH4605), and addsrenamewith the same signature forPanel

  • Refactorclip methods to core/generic.py (GH4798)

  • Refactor of_get_numeric_data/_get_bool_data to core/generic.py, allowing Series/Panel functionality

  • Series (for index) /Panel (for items) now allow attribute access to its elements (GH1903)

    In [132]:s=Series([1,2,3],index=list('abc'))In [133]:s.bOut[133]:2In [134]:s.a=5In [135]:sOut[135]:a    5b    2c    3dtype: int64

Bug Fixes

SeeV0.13.0 Bug Fixes for an extensive list of bugs that have been fixed in 0.13.0.

See thefull release notes or issue trackeron GitHub for a complete list of all API changes, Enhancements and Bug Fixes.

v0.12.0 (July 24, 2013)

This is a major release from 0.11.0 and includes several new features andenhancements along with a large number of bug fixes.

Highlights include a consistent I/O API naming scheme, routines to read html,write multi-indexes to csv files, read & write STATA data files, read & write JSON formatfiles, Python 3 support forHDFStore, filtering of groupby expressions viafilter, and arevampedreplace routine that accepts regular expressions.

API changes

  • The I/O API is now much more consistent with a set of top levelreader functionsaccessed likepd.read_csv() that generally return apandas object.

    • read_csv
    • read_excel
    • read_hdf
    • read_sql
    • read_json
    • read_html
    • read_stata
    • read_clipboard

    The correspondingwriter functions are object methods that are accessed likedf.to_csv()

    • to_csv
    • to_excel
    • to_hdf
    • to_sql
    • to_json
    • to_html
    • to_stata
    • to_clipboard
  • Fix modulo and integer division on Series,DataFrames to act similary tofloat dtypes to returnnp.nan ornp.inf as appropriate (GH3590). This correct a numpy bug that treatsintegerandfloat dtypes differently.

    In [1]:p=DataFrame({'first':[4,5,8],'second':[0,0,3]})In [2]:p%0Out[2]:   first  second0    NaN     NaN1    NaN     NaN2    NaN     NaN[3 rows x 2 columns]In [3]:p%pOut[3]:   first  second0    0.0     NaN1    0.0     NaN2    0.0     0.0[3 rows x 2 columns]In [4]:p/pOut[4]:   first  second0    1.0     NaN1    1.0     NaN2    1.0     1.0[3 rows x 2 columns]In [5]:p/0Out[5]:   first  second0    inf     NaN1    inf     NaN2    inf     inf[3 rows x 2 columns]
  • Addsqueeze keyword togroupby to allow reduction fromDataFrame -> Series if groups are unique. This is a Regression from 0.10.1.We are reverting back to the prior behavior. This means groupby will return thesame shaped objects whether the groups are unique or not. Revert this issue (GH2893)with (GH3596).

    In [6]:df2=DataFrame([{"val1":1,"val2":20},{"val1":1,"val2":19},   ...:{"val1":1,"val2":27},{"val1":1,"val2":12}])   ...:In [7]:deffunc(dataf):   ...:returndataf["val2"]-dataf["val2"].mean()   ...:# squeezing the result frame to a series (because we have unique groups)In [8]:df2.groupby("val1",squeeze=True).apply(func)Out[8]:0    0.51   -0.52    7.53   -7.5Name: 1, dtype: float64# no squeezing (the default, and behavior in 0.10.1)In [9]:df2.groupby("val1").apply(func)Out[9]:val2    0    1    2    3val11     0.5 -0.5  7.5 -7.5[1 rows x 4 columns]
  • Raise oniloc when boolean indexing with a label based indexer maske.g. a boolean Series, even with integer labels, will raise. Sinceilocis purely positional based, the labels on the Series are not alignable (GH3631)

    This case is rarely used, and there are plently of alternatives. This preserves theiloc API to bepurely positional based.

    In [10]:df=DataFrame(lrange(5),list('ABCDE'),columns=['a'])In [11]:mask=(df.a%2==0)In [12]:maskOut[12]:A     TrueB    FalseC     TrueD    FalseE     TrueName: a, dtype: bool# this is what you should useIn [13]:df.loc[mask]Out[13]:   aA  0C  2E  4[3 rows x 1 columns]# this will work as wellIn [14]:df.iloc[mask.values]Out[14]:   aA  0C  2E  4[3 rows x 1 columns]

    df.iloc[mask] will raise aValueError

  • Theraise_on_error argument to plotting functions is removed. Instead,plotting functions raise aTypeError when thedtype of the objectisobject to remind you to avoidobject arrays whenever possibleand thus you should cast to an appropriate numeric dtype if you need toplot something.

  • Addcolormap keyword to DataFrame plotting methods. Accepts either amatplotlib colormap object (ie, matplotlib.cm.jet) or a string name of suchan object (ie, ‘jet’). The colormap is sampled to select the color for eachcolumn. Please seeColormaps for more information.(GH3860)

  • DataFrame.interpolate() is now deprecated. Please useDataFrame.fillna() andDataFrame.replace() instead. (GH3582,GH3675,GH3676)

  • themethod andaxis arguments ofDataFrame.replace() aredeprecated

  • DataFrame.replace ‘sinfer_types parameter is removed and nowperforms conversion by default. (GH3907)

  • Add the keywordallow_duplicates toDataFrame.insert to allow a duplicate columnto be inserted ifTrue, default isFalse (same as prior to 0.12) (GH3679)

  • Implement__nonzero__ forNDFrame objects (GH3691,GH3696)

  • IO api

    • added top-level functionread_excel to replace the following,The original API is deprecated and will be removed in a future version

      frompandas.io.parsersimportExcelFilexls=ExcelFile('path_to_file.xls')xls.parse('Sheet1',index_col=None,na_values=['NA'])

      With

      importpandasaspdpd.read_excel('path_to_file.xls','Sheet1',index_col=None,na_values=['NA'])
    • added top-level functionread_sql that is equivalent to the following

      frompandas.io.sqlimportread_frameread_frame(....)
  • DataFrame.to_html andDataFrame.to_latex now accept a path fortheir first argument (GH3702)

  • Do not allow astypes ondatetime64[ns] except toobject, andtimedelta64[ns] toobject/int (GH3425)

  • The behavior ofdatetime64 dtypes has changed with respect to certainso-called reduction operations (GH3726). The following operations nowraise aTypeError when perfomed on aSeries and return anemptySeries when performed on aDataFrame similar to performing theseoperations on, for example, aDataFrame ofslice objects:

    • sum, prod, mean, std, var, skew, kurt, corr, and cov
  • read_html now defaults toNone when reading, and falls back onbs4 +html5lib when lxml fails to parse. a list of parsers to tryuntil success is also valid

  • The internalpandas class hierarchy has changed (slightly). ThepreviousPandasObject now is calledPandasContainer and a newPandasObject has become the baseclass forPandasContainer as wellasIndex,Categorical,GroupBy,SparseList, andSparseArray (+ their base classes). Currently,PandasObjectprovides string methods (fromStringMixin). (GH4090,GH4092)

  • NewStringMixin that, given a__unicode__ method, gets python 2 andpython 3 compatible string methods (__str__,__bytes__, and__repr__). Plus string safety throughout. Now employed in many placesthroughout the pandas library. (GH4090,GH4092)

I/O Enhancements

  • pd.read_html() can now parse HTML strings, files or urls and returnDataFrames, courtesy of @cpcloud. (GH3477,GH3605,GH3606,GH3616).It works with asingle parser backend: BeautifulSoup4 + html5libSee the docs

    You can usepd.read_html() to read the output fromDataFrame.to_html() like so

    In [15]:df=DataFrame({'a':range(3),'b':list('abc')})In [16]:print(df)   a  b0  0  a1  1  b2  2  c[3 rows x 2 columns]In [17]:html=df.to_html()In [18]:alist=pd.read_html(html,index_col=0)In [19]:print(df==alist[0])      a     b0  True  True1  True  True2  True  True[3 rows x 2 columns]

    Note thatalist here is a Pythonlist sopd.read_html() andDataFrame.to_html() are not inverses.

    • pd.read_html() no longer performs hard conversion of date strings(GH3656).

    Warning

    You may have to install an older version of BeautifulSoup4,See the installation docs

  • Added module for reading and writing Stata files:pandas.io.stata (GH1512)accessable viaread_stata top-level function for reading,andto_stata DataFrame method for writing,See the docs

  • Added module for reading and writing json format files:pandas.io.jsonaccessable viaread_json top-level function for reading,andto_json DataFrame method for writing,See the docsvarious issues (GH1226,GH3804,GH3876,GH3867,GH1305)

  • MultiIndex column support for reading and writing csv format files

    • Theheader option inread_csv now accepts alist of the rows from which to read the index.

    • The option,tupleize_cols can now be specified in bothto_csv andread_csv, to provide compatiblity for the pre 0.12 behavior ofwriting and readingMultIndex columns via a list of tuples. The default in0.12 is to write lists of tuples andnot interpret list of tuples as aMultiIndex column.

      Note: The default behavior in 0.12 remains unchanged from prior versions, but starting with 0.13,the defaultto write and readMultiIndex columns will be in the newformat. (GH3571,GH1651,GH3141)

    • If anindex_col is not specified (e.g. you don’t have an index, or wrote itwithdf.to_csv(...,index=False), then anynames on the columns index willbelost.

      In [20]:frompandas.util.testingimportmakeCustomDataframeasmkdfIn [21]:df=mkdf(5,3,r_idx_nlevels=2,c_idx_nlevels=4)In [22]:df.to_csv('mi.csv',tupleize_cols=False)In [23]:print(open('mi.csv').read())C0,,C_l0_g0,C_l0_g1,C_l0_g2C1,,C_l1_g0,C_l1_g1,C_l1_g2C2,,C_l2_g0,C_l2_g1,C_l2_g2C3,,C_l3_g0,C_l3_g1,C_l3_g2R0,R1,,,R_l0_g0,R_l1_g0,R0C0,R0C1,R0C2R_l0_g1,R_l1_g1,R1C0,R1C1,R1C2R_l0_g2,R_l1_g2,R2C0,R2C1,R2C2R_l0_g3,R_l1_g3,R3C0,R3C1,R3C2R_l0_g4,R_l1_g4,R4C0,R4C1,R4C2In [24]:pd.read_csv('mi.csv',header=[0,1,2,3],index_col=[0,1],tupleize_cols=False)Out[24]:C0              C_l0_g0 C_l0_g1 C_l0_g2C1              C_l1_g0 C_l1_g1 C_l1_g2C2              C_l2_g0 C_l2_g1 C_l2_g2C3              C_l3_g0 C_l3_g1 C_l3_g2R0      R1R_l0_g0 R_l1_g0    R0C0    R0C1    R0C2R_l0_g1 R_l1_g1    R1C0    R1C1    R1C2R_l0_g2 R_l1_g2    R2C0    R2C1    R2C2R_l0_g3 R_l1_g3    R3C0    R3C1    R3C2R_l0_g4 R_l1_g4    R4C0    R4C1    R4C2[5 rows x 3 columns]
  • Support forHDFStore (viaPyTables3.0.0) on Python3

  • Iterator support viaread_hdf that automatically opens and closes thestore when iteration is finished. This is only fortables

    In [25]:path='store_iterator.h5'In [26]:DataFrame(randn(10,2)).to_hdf(path,'df',table=True)In [27]:fordfinread_hdf(path,'df',chunksize=3):   ....:printdf   ....:          0         10  0.713216 -0.7784611 -0.661062  0.8628772  0.344342  0.149565          0         13 -0.626968 -0.8757724 -0.930687 -0.2189835  0.949965 -0.442354          0         16 -0.402985  1.1113587 -0.241527 -0.6704778  0.049355  0.632633          0         19 -1.502767 -1.225492
  • read_csv will now throw a more informative error message when a filecontains no columns, e.g., all newline characters

Other Enhancements

  • DataFrame.replace() now allows regular expressions on containedSeries with object dtype. See the examples section in the regular docsReplacing via String Expression

    For example you can do

    In [25]:df=DataFrame({'a':list('ab..'),'b':[1,2,3,4]})In [26]:df.replace(regex=r'\s*\.\s*',value=np.nan)Out[26]:     a  b0    a  11    b  22  NaN  33  NaN  4[4 rows x 2 columns]

    to replace all occurrences of the string'.' with zero or moreinstances of surrounding whitespace withNaN.

    Regular string replacement still works as expected. For example, you can do

    In [27]:df.replace('.',np.nan)Out[27]:     a  b0    a  11    b  22  NaN  33  NaN  4[4 rows x 2 columns]

    to replace all occurrences of the string'.' withNaN.

  • pd.melt() now accepts the optional parametersvar_name andvalue_nameto specify custom column names of the returned DataFrame.

  • pd.set_option() now allows N option, value pairs (GH3667).

    Let’s say that we had an option'a.b' and another option'b.c'.We can set them at the same time:

    In [28]:pd.get_option('a.b')Out[28]:2In [29]:pd.get_option('b.c')Out[29]:3In [30]:pd.set_option('a.b',1,'b.c',4)In [31]:pd.get_option('a.b')Out[31]:1In [32]:pd.get_option('b.c')Out[32]:4
  • Thefilter method for group objects returns a subset of the originalobject. Suppose we want to take only elements that belong to groups with agroup sum greater than 2.

    In [33]:sf=Series([1,1,2,3,3,3])In [34]:sf.groupby(sf).filter(lambdax:x.sum()>2)Out[34]:3    34    35    3dtype: int64

    The argument offilter must a function that, applied to the group as awhole, returnsTrue orFalse.

    Another useful operation is filtering out elements that belong to groupswith only a couple members.

    In [35]:dff=DataFrame({'A':np.arange(8),'B':list('aabbbbcc')})In [36]:dff.groupby('B').filter(lambdax:len(x)>2)Out[36]:   A  B2  2  b3  3  b4  4  b5  5  b[4 rows x 2 columns]

    Alternatively, instead of dropping the offending groups, we can return alike-indexed objects where the groups that do not pass the filter arefilled with NaNs.

    In [37]:dff.groupby('B').filter(lambdax:len(x)>2,dropna=False)Out[37]:     A    B0  NaN  NaN1  NaN  NaN2  2.0    b3  3.0    b4  4.0    b5  5.0    b6  NaN  NaN7  NaN  NaN[8 rows x 2 columns]
  • Series and DataFrame hist methods now take afigsize argument (GH3834)

  • DatetimeIndexes no longer try to convert mixed-integer indexes during joinoperations (GH3877)

  • Timestamp.min and Timestamp.max now represent valid Timestamp instances insteadof the default datetime.min and datetime.max (respectively), thanks @SleepingPills

  • read_html now raises when no tables are found and BeautifulSoup==4.2.0is detected (GH4214)

Experimental Features

  • Added experimentalCustomBusinessDay class to supportDateOffsetswith custom holiday calendars and custom weekmasks. (GH2301)

    Note

    This uses thenumpy.busdaycalendar API introduced in Numpy 1.7 andtherefore requires Numpy 1.7.0 or newer.

    In [38]:frompandas.tseries.offsetsimportCustomBusinessDayIn [39]:fromdatetimeimportdatetime# As an interesting example, let's look at Egypt where# a Friday-Saturday weekend is observed.In [40]:weekmask_egypt='Sun Mon Tue Wed Thu'# They also observe International Workers' Day so let's# add that for a couple of yearsIn [41]:holidays=['2012-05-01',datetime(2013,5,1),np.datetime64('2014-05-01')]In [42]:bday_egypt=CustomBusinessDay(holidays=holidays,weekmask=weekmask_egypt)In [43]:dt=datetime(2013,4,30)In [44]:print(dt+2*bday_egypt)2013-05-05 00:00:00In [45]:dts=date_range(dt,periods=5,freq=bday_egypt)In [46]:print(Series(dts.weekday,dts).map(Series('Mon Tue Wed Thu Fri Sat Sun'.split())))2013-04-30    Tue2013-05-02    Thu2013-05-05    Sun2013-05-06    Mon2013-05-07    TueFreq: C, dtype: object

Bug Fixes

  • Plotting functions now raise aTypeError before trying to plot anythingif the associated objects have have a dtype ofobject (GH1818,GH3572,GH3911,GH3912), but they will try to convert object arrays tonumeric arrays if possible so that you can still plot, for example, anobject array with floats. This happens before any drawing takes place whichelimnates any spurious plots from showing up.

  • fillna methods now raise aTypeError if thevalue parameter isa list or tuple.

  • Series.str now supports iteration (GH3638). You can iterate over theindividual elements of each string in theSeries. Each iteration yieldsyields aSeries with either a single character at each index of theoriginalSeries orNaN. For example,

    In [47]:strs='go','bow','joe','slow'In [48]:ds=Series(strs)In [49]:forsinds.str:   ....:print(s)   ....:0    g1    b2    j3    sdtype: object0    o1    o2    o3    ldtype: object0    NaN1      w2      e3      odtype: object0    NaN1    NaN2    NaN3      wdtype: objectIn [50]:sOut[50]:0    NaN1    NaN2    NaN3      wdtype: objectIn [51]:s.dropna().values.item()=='w'Out[51]:True

    The last element yielded by the iterator will be aSeries containingthe last element of the longest string in theSeries with all otherelements beingNaN. Here since'slow' is the longest stringand there are no other strings with the same length'w' is the onlynon-null string in the yieldedSeries.

  • HDFStore

    • will retain index attributes (freq,tz,name) on recreation (GH3499)
    • will warn with aAttributeConflictWarning if you are attempting to appendan index with a different frequency than the existing, or attemptingto append an index with a different name than the existing
    • support datelike columns with a timezone as data_columns (GH2852)
  • Non-unique index support clarified (GH3468).

    • Fix assigning a new index to a duplicate index in a DataFrame would fail (GH3468)
    • Fix construction of a DataFrame with a duplicate index
    • ref_locs support to allow duplicative indices across dtypes,allows iget support to always find the index (even across dtypes) (GH2194)
    • applymap on a DataFrame with a non-unique index now works(removed warning) (GH2786), and fix (GH3230)
    • Fix to_csv to handle non-unique columns (GH3495)
    • Duplicate indexes with getitem will return items in the correct order (GH3455,GH3457)and handle missing elements like unique indices (GH3561)
    • Duplicate indexes with and empty DataFrame.from_records will return a correct frame (GH3562)
    • Concat to produce a non-unique columns when duplicates are across dtypes is fixed (GH3602)
    • Allow insert/delete to non-unique columns (GH3679)
    • Non-unique indexing with a slice vialoc and friends fixed (GH3659)
    • Allow insert/delete to non-unique columns (GH3679)
    • Extendreindex to correctly deal with non-unique indices (GH3679)
    • DataFrame.itertuples() now works with frames with duplicate columnnames (GH3873)
    • Bug in non-unique indexing viailoc (GH4017); addedtakeable argument toreindex for location-based taking
    • Allow non-unique indexing in series via.ix/.loc and__getitem__ (GH4246)
    • Fixed non-unique indexing memory allocation issue with.ix/.loc (GH4280)
  • DataFrame.from_records did not accept empty recarrays (GH3682)

  • read_html now correctly skips tests (GH3741)

  • Fixed a bug whereDataFrame.replace with a compiled regular expressionin theto_replace argument wasn’t working (GH3907)

  • Improvednetwork test decorator to catchIOError (and thereforeURLError as well). Addedwith_connectivity_check decorator to allowexplicitly checking a website as a proxy for seeing if there is networkconnectivity. Plus, newoptional_args decorator factory for decorators.(GH3910,GH3914)

  • Fixed testing issue where too many sockets where open thus leading to aconnection reset issue (GH3982,GH3985,GH4028,GH4054)

  • Fixed failing tests in test_yahoo, test_google where symbols were notretrieved but were being accessed (GH3982,GH3985,GH4028,GH4054)

  • Series.hist will now take the figure from the current environment ifone is not passed

  • Fixed bug where a 1xN DataFrame would barf on a 1xN mask (GH4071)

  • Fixed running oftox under python3 where the pickle import was gettingrewritten in an incompatible way (GH4062,GH4063)

  • Fixed bug where sharex and sharey were not being passed to grouped_hist(GH4089)

  • Fixed bug inDataFrame.replace where a nested dict wasn’t beingiterated over when regex=False (GH4115)

  • Fixed bug in the parsing of microseconds when using theformatargument into_datetime (GH4152)

  • Fixed bug inPandasAutoDateLocator whereinvert_xaxis triggeredincorrectlyMilliSecondLocator (GH3990)

  • Fixed bug in plotting that wasn’t raising on invalid colormap formatplotlib 1.1.1 (GH4215)

  • Fixed the legend displaying inDataFrame.plot(kind='kde') (GH4216)

  • Fixed bug where Index slices weren’t carrying the name attribute(GH4226)

  • Fixed bug in initializingDatetimeIndex with an array of stringsin a certain time zone (GH4229)

  • Fixed bug where html5lib wasn’t being properly skipped (GH4265)

  • Fixed bug where get_data_famafrench wasn’t using the correct file edges(GH4281)

See thefull release notes or issue trackeron GitHub for a complete list.

v0.11.0 (April 22, 2013)

This is a major release from 0.10.1 and includes many new features andenhancements along with a large number of bug fixes. The methods of SelectingData have had quite a number of additions, and Dtype support is now full-fledged.There are also a number of important API changes that long-time pandas users shouldpay close attention to.

There is a new section in the documentation,10 Minutes to Pandas,primarily geared to new users.

There is a new section in the documentation,Cookbook, a collectionof useful recipes in pandas (and that we want contributions!).

There are several libraries that are nowRecommended Dependencies

Selection Choices

Starting in 0.11.0, object selection has had a number of user-requested additions inorder to support more explicit location based indexing. Pandas now supportsthree types of multi-axis indexing.

  • .loc is strictly label based, will raiseKeyError when the items are not found, allowed inputs are:

    • A single label, e.g.5 or'a', (note that5 is interpreted as alabel of the index. This use isnot an integer position along the index)
    • A list or array of labels['a','b','c']
    • A slice object with labels'a':'f', (note that contrary to usual python slices,both the start and the stop are included!)
    • A boolean array

    See more atSelection by Label

  • .iloc is strictly integer position based (from0 tolength-1 of the axis), will raiseIndexError when the requested indicies are out of bounds. Allowed inputs are:

    • An integer e.g.5
    • A list or array of integers[4,3,0]
    • A slice object with ints1:7
    • A boolean array

    See more atSelection by Position

  • .ix supports mixed integer and label based access. It is primarily label based, but will fallback to integer positional access..ix is the most general and will supportany of the inputs to.loc and.iloc, as well as support for floating point label schemes..ix is especially useful when dealing with mixed positional and labelbased hierarchial indexes.

    As using integer slices with.ix have different behavior depending on whether the sliceis interpreted as position based or label based, it’s usually better to beexplicit and use.iloc or.loc.

    See more atAdvanced Indexing andAdvanced Hierarchical.

Selection Deprecations

Starting in version 0.11.0, these methodsmay be deprecated in future versions.

  • irow
  • icol
  • iget_value

See the sectionSelection by Position for substitutes.

Dtypes

Numeric dtypes will propagate and can coexist in DataFrames. If a dtype is passed (either directly via thedtype keyword, a passedndarray, or a passedSeries, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes willNOT be combined. The following example will give you a taste.

In [1]:df1=DataFrame(randn(8,1),columns=['A'],dtype='float32')In [2]:df1Out[2]:          A0  1.3926651 -0.1234972 -0.4027613 -0.2466044 -0.2884335 -0.7634346  2.0695267 -1.203569[8 rows x 1 columns]In [3]:df1.dtypesOut[3]:A    float32dtype: objectIn [4]:df2=DataFrame(dict(A=Series(randn(8),dtype='float16'),   ...:B=Series(randn(8)),   ...:C=Series(randn(8),dtype='uint8')))   ...:In [5]:df2Out[5]:          A         B    C0  0.591797 -0.038605    01  0.841309 -0.460478    12 -0.500977 -0.310458    03 -0.816406  0.866493  2544 -0.207031  0.245972    05 -0.664062  0.319442    16  0.580566  1.378512    17 -0.965820  0.292502  255[8 rows x 3 columns]In [6]:df2.dtypesOut[6]:A    float16B    float64C      uint8dtype: object# here you get some upcastingIn [7]:df3=df1.reindex_like(df2).fillna(value=0.0)+df2In [8]:df3Out[8]:          A         B      C0  1.984462 -0.038605    0.01  0.717812 -0.460478    1.02 -0.903737 -0.310458    0.03 -1.063011  0.866493  254.04 -0.495465  0.245972    0.05 -1.427497  0.319442    1.06  2.650092  1.378512    1.07 -2.169390  0.292502  255.0[8 rows x 3 columns]In [9]:df3.dtypesOut[9]:A    float32B    float64C    float64dtype: object

Dtype Conversion

This is lower-common-denomicator upcasting, meaning you get the dtype which can accomodate all of the types

In [10]:df3.values.dtypeOut[10]:dtype('float64')

Conversion

In [11]:df3.astype('float32').dtypesOut[11]:A    float32B    float32C    float32dtype: object

Mixed Conversion

In [12]:df3['D']='1.'In [13]:df3['E']='1'In [14]:df3.convert_objects(convert_numeric=True).dtypesOut[14]:A    float32B    float64C    float64D    float64E      int64dtype: object# same, but specific dtype conversionIn [15]:df3['D']=df3['D'].astype('float16')In [16]:df3['E']=df3['E'].astype('int32')In [17]:df3.dtypesOut[17]:A    float32B    float64C    float64D    float16E      int32dtype: object

Forcing Date coercion (and settingNaT when not datelike)

In [18]:fromdatetimeimportdatetimeIn [19]:s=Series([datetime(2001,1,1,0,0),'foo',1.0,1,   ....:Timestamp('20010104'),'20010105'],dtype='O')   ....:In [20]:s.convert_objects(convert_dates='coerce')Out[20]:0   2001-01-011          NaT2          NaT3          NaT4   2001-01-045   2001-01-05dtype: datetime64[ns]

Dtype Gotchas

Platform Gotchas

Starting in 0.11.0, construction of DataFrame/Series will use default dtypes ofint64 andfloat64,regardless of platform. This is not an apparent change from earlier versions of pandas. If you specifydtypes, theyWILL be respected, however (GH2837)

The following will all result inint64 dtypes

In [21]:DataFrame([1,2],columns=['a']).dtypesOut[21]:a    int64dtype: objectIn [22]:DataFrame({'a':[1,2]}).dtypesOut[22]:a    int64dtype: objectIn [23]:DataFrame({'a':1},index=range(2)).dtypesOut[23]:a    int64dtype: object

Keep in mind thatDataFrame(np.array([1,2]))WILL result inint32 on 32-bit platforms!

Upcasting Gotchas

Performing indexing operations on integer type data can easily upcast the data.The dtype of the input data will be preserved in cases wherenans are not introduced.

In [24]:dfi=df3.astype('int32')In [25]:dfi['D']=dfi['D'].astype('int64')In [26]:dfiOut[26]:   A  B    C  D  E0  1  0    0  1  11  0  0    1  1  12  0  0    0  1  13 -1  0  254  1  14  0  0    0  1  15 -1  0    1  1  16  2  1    1  1  17 -2  0  255  1  1[8 rows x 5 columns]In [27]:dfi.dtypesOut[27]:A    int32B    int32C    int32D    int64E    int32dtype: objectIn [28]:casted=dfi[dfi>0]In [29]:castedOut[29]:     A    B      C  D  E0  1.0  NaN    NaN  1  11  NaN  NaN    1.0  1  12  NaN  NaN    NaN  1  13  NaN  NaN  254.0  1  14  NaN  NaN    NaN  1  15  NaN  NaN    1.0  1  16  2.0  1.0    1.0  1  17  NaN  NaN  255.0  1  1[8 rows x 5 columns]In [30]:casted.dtypesOut[30]:A    float64B    float64C    float64D      int64E      int32dtype: object

While float dtypes are unchanged.

In [31]:df4=df3.copy()In [32]:df4['A']=df4['A'].astype('float32')In [33]:df4.dtypesOut[33]:A    float32B    float64C    float64D    float16E      int32dtype: objectIn [34]:casted=df4[df4>0]In [35]:castedOut[35]:          A         B      C    D  E0  1.984462       NaN    NaN  1.0  11  0.717812       NaN    1.0  1.0  12       NaN       NaN    NaN  1.0  13       NaN  0.866493  254.0  1.0  14       NaN  0.245972    NaN  1.0  15       NaN  0.319442    1.0  1.0  16  2.650092  1.378512    1.0  1.0  17       NaN  0.292502  255.0  1.0  1[8 rows x 5 columns]In [36]:casted.dtypesOut[36]:A    float32B    float64C    float64D    float16E      int32dtype: object

Datetimes Conversion

Datetime64[ns] columns in a DataFrame (or a Series) allow the use ofnp.nan to indicate a nan value,in addition to the traditionalNaT, or not-a-time. This allows convenient nan setting in a generic way.Furthermoredatetime64[ns] columns are created by default, when passed datetimelike objects (this change was introduced in 0.10.1)(GH2809,GH2810)

In [37]:df=DataFrame(randn(6,2),date_range('20010102',periods=6),columns=['A','B'])In [38]:df['timestamp']=Timestamp('20010103')In [39]:dfOut[39]:                   A         B  timestamp2001-01-02  1.023958  0.660103 2001-01-032001-01-03  1.236475 -2.170629 2001-01-032001-01-04 -0.270630 -1.685677 2001-01-032001-01-05 -0.440747 -0.115070 2001-01-032001-01-06 -0.632102 -0.585977 2001-01-032001-01-07 -1.444787 -0.201135 2001-01-03[6 rows x 3 columns]# datetime64[ns] out of the boxIn [40]:df.get_dtype_counts()Out[40]:datetime64[ns]    1float64           2dtype: int64# use the traditional nan, which is mapped to NaT internallyIn [41]:df.ix[2:4,['A','timestamp']]=np.nanIn [42]:dfOut[42]:                   A         B  timestamp2001-01-02  1.023958  0.660103 2001-01-032001-01-03  1.236475 -2.170629 2001-01-032001-01-04       NaN -1.685677        NaT2001-01-05       NaN -0.115070        NaT2001-01-06 -0.632102 -0.585977 2001-01-032001-01-07 -1.444787 -0.201135 2001-01-03[6 rows x 3 columns]

Astype conversion ondatetime64[ns] toobject, implicity convertsNaT tonp.nan

In [43]:importdatetimeIn [44]:s=Series([datetime.datetime(2001,1,2,0,0)foriinrange(3)])In [45]:s.dtypeOut[45]:dtype('<M8[ns]')In [46]:s[1]=np.nanIn [47]:sOut[47]:0   2001-01-021          NaT2   2001-01-02dtype: datetime64[ns]In [48]:s.dtypeOut[48]:dtype('<M8[ns]')In [49]:s=s.astype('O')In [50]:sOut[50]:0    2001-01-02 00:00:001                    NaT2    2001-01-02 00:00:00dtype: objectIn [51]:s.dtypeOut[51]:dtype('O')

API changes

  • Added to_series() method to indicies, to facilitate the creation of indexers(GH3275)
  • HDFStore
    • added the methodselect_column to select a single column from a table as a Series.
    • deprecated theunique method, can be replicated byselect_column(key,column).unique()
    • min_itemsize parameter toappend will now automatically create data_columns for passed keys

Enhancements

  • Improved performance of df.to_csv() by up to 10x in some cases. (GH3059)

  • Numexpr is now aRecommended Dependencies, to accelerate certaintypes of numerical and boolean operations

  • Bottleneck is now aRecommended Dependencies, to accelerate certaintypes ofnan operations

  • HDFStore

    • supportread_hdf/to_hdf API similar toread_csv/to_csv

      In [52]:df=DataFrame(dict(A=lrange(5),B=lrange(5)))In [53]:df.to_hdf('store.h5','table',append=True)In [54]:read_hdf('store.h5','table',where=['index>2'])Out[54]:   A  B3  3  34  4  4[2 rows x 2 columns]
    • provide dotted attribute access toget from stores, e.g.store.df==store['df']

    • new keywordsiterator=boolean, andchunksize=number_in_a_chunk areprovided to support iteration onselect andselect_as_multiple (GH3076)

  • You can now select timestamps from anunordered timeseries similarly to anordered timeseries (GH2437)

  • You can now select with a string from a DataFrame with a datelike index, in a similar way to a Series (GH3070)

    In [55]:idx=date_range("2001-10-1",periods=5,freq='M')In [56]:ts=Series(np.random.rand(len(idx)),index=idx)In [57]:ts['2001']Out[57]:2001-10-31    0.6632562001-11-30    0.0791262001-12-31    0.587699Freq: M, dtype: float64In [58]:df=DataFrame(dict(A=ts))In [59]:df['2001']Out[59]:                   A2001-10-31  0.6632562001-11-30  0.0791262001-12-31  0.587699[3 rows x 1 columns]
  • Squeeze to possibly remove length 1 dimensions from an object.

    In [60]:p=Panel(randn(3,4,4),items=['ItemA','ItemB','ItemC'],   ....:major_axis=date_range('20010102',periods=4),   ....:minor_axis=['A','B','C','D'])   ....:In [61]:pOut[61]:<class 'pandas.core.panel.Panel'>Dimensions: 3 (items) x 4 (major_axis) x 4 (minor_axis)Items axis: ItemA to ItemCMajor_axis axis: 2001-01-02 00:00:00 to 2001-01-05 00:00:00Minor_axis axis: A to DIn [62]:p.reindex(items=['ItemA']).squeeze()Out[62]:                   A         B         C         D2001-01-02 -1.203403  0.425882 -0.436045 -0.9824622001-01-03  0.348090 -0.969649  0.121731  0.2027982001-01-04  1.215695 -0.218549 -0.631381 -0.3371162001-01-05  0.404238  0.907213 -0.865657  0.483186[4 rows x 4 columns]In [63]:p.reindex(items=['ItemA'],minor=['B']).squeeze()Out[63]:2001-01-02    0.4258822001-01-03   -0.9696492001-01-04   -0.2185492001-01-05    0.907213Freq: D, Name: B, dtype: float64
  • Inpd.io.data.Options,

    • Fix bug when trying to fetch data for the current month when alreadypast expiry.
    • Now using lxml to scrape html instead of BeautifulSoup (lxml was faster).
    • New instance variables for calls and puts are automatically createdwhen a method that creates them is called. This works for current monthwhere the instance variables are simplycalls andputs. Alsoworks for future expiry months and save the instance variable ascallsMMYY orputsMMYY, whereMMYY are, respectively, themonth and year of the option’s expiry.
    • Options.get_near_stock_price now allows the user to specify themonth for which to get relevant options data.
    • Options.get_forward_data now has optional kwargsnear andabove_below. This allows the user to specify if they would like toonly return forward looking data for options near the current stockprice. This just obtains the data from Options.get_near_stock_priceinstead of Options.get_xxx_data() (GH2758).
  • Cursor coordinate information is now displayed in time-series plots.

  • added optiondisplay.max_seq_items to control the number ofelements printed per sequence pprinting it. (GH2979)

  • added optiondisplay.chop_threshold to control display of small numericalvalues. (GH2739)

  • added optiondisplay.max_info_rows to prevent verbose_info from beingcalculated for frames above 1M rows (configurable). (GH2807,GH2918)

  • value_counts() now accepts a “normalize” argument, for normalizedhistograms. (GH2710).

  • DataFrame.from_records now accepts not only dicts but any instance ofthe collections.Mapping ABC.

  • added optiondisplay.mpl_style providing a sleeker visual stylefor plots. Based onhttps://gist.github.com/huyng/816622 (GH3075).

  • Treat boolean values as integers (values 1 and 0) for numericoperations. (GH2641)

  • to_html() now accepts an optional “escape” argument to control reservedHTML character escaping (enabled by default) and escapes&, in additionto< and>. (GH2919)

See thefull release notes or issue trackeron GitHub for a complete list.

v0.10.1 (January 22, 2013)

This is a minor release from 0.10.0 and includes new features, enhancements,and bug fixes. In particular, there is substantial new HDFStore functionalitycontributed by Jeff Reback.

An undesired API breakage with functions taking theinplace option has beenreverted and deprecation warnings added.

API changes

  • Functions taking aninplace option return the calling object as before. Adeprecation message has been added
  • Groupby aggregations Max/Min no longer exclude non-numeric data (GH2700)
  • Resampling an empty DataFrame now returns an empty DataFrame instead ofraising an exception (GH2640)
  • The file reader will now raise an exception when NA values are found in anexplicitly specified integer column instead of converting the column to float(GH2631)
  • DatetimeIndex.unique now returns a DatetimeIndex with the same name and
  • timezone instead of an array (GH2563)

New features

  • MySQL support for database (contribution from Dan Allan)

HDFStore

You may need to upgrade your existing data files. Please visit thecompatibility section in the main docs.

You can designate (and index) certain columns that you want to be able toperform queries on a table, by passing a list todata_columns

In [1]:store=HDFStore('store.h5')In [2]:df=DataFrame(randn(8,3),index=date_range('1/1/2000',periods=8),   ...:columns=['A','B','C'])   ...:In [3]:df['string']='foo'In [4]:df.ix[4:6,'string']=np.nanIn [5]:df.ix[7:9,'string']='bar'In [6]:df['string2']='cool'In [7]:dfOut[7]:                   A         B         C string string22000-01-01  1.885136 -0.183873  2.550850    foo    cool2000-01-02  0.180759 -1.117089  0.061462    foo    cool2000-01-03 -0.294467 -0.591411 -0.876691    foo    cool2000-01-04  3.127110  1.451130  0.045152    foo    cool2000-01-05 -0.242846  1.195819  1.533294    NaN    cool2000-01-06  0.820521 -0.281201  1.651561    NaN    cool2000-01-07 -0.034086  0.252394 -0.498772    foo    cool2000-01-08 -2.290958 -1.601262 -0.256718    bar    cool[8 rows x 5 columns]# on-disk operationsIn [8]:store.append('df',df,data_columns=['B','C','string','string2'])In [9]:store.select('df',['B > 0','string == foo'])Out[9]:Empty DataFrameColumns: [A, B, C, string, string2]Index: [][0 rows x 5 columns]# this is in-memory version of this type of selectionIn [10]:df[(df.B>0)&(df.string=='foo')]Out[10]:                   A         B         C string string22000-01-04  3.127110  1.451130  0.045152    foo    cool2000-01-07 -0.034086  0.252394 -0.498772    foo    cool[2 rows x 5 columns]

Retrieving unique values in an indexable or data column.

# note that this is deprecated as of 0.14.0# can be replicated by: store.select_column('df','index').unique()store.unique('df','index')store.unique('df','string')

You can now storedatetime64 in data columns

In [11]:df_mixed=df.copy()In [12]:df_mixed['datetime64']=Timestamp('20010102')In [13]:df_mixed.ix[3:4,['A','B']]=np.nanIn [14]:store.append('df_mixed',df_mixed)In [15]:df_mixed1=store.select('df_mixed')In [16]:df_mixed1Out[16]:                   A         B         C string string2 datetime642000-01-01  1.885136 -0.183873  2.550850    foo    cool 2001-01-022000-01-02  0.180759 -1.117089  0.061462    foo    cool 2001-01-022000-01-03 -0.294467 -0.591411 -0.876691    foo    cool 2001-01-022000-01-04       NaN       NaN  0.045152    foo    cool 2001-01-022000-01-05 -0.242846  1.195819  1.533294    NaN    cool 2001-01-022000-01-06  0.820521 -0.281201  1.651561    NaN    cool 2001-01-022000-01-07 -0.034086  0.252394 -0.498772    foo    cool 2001-01-022000-01-08 -2.290958 -1.601262 -0.256718    bar    cool 2001-01-02[8 rows x 6 columns]In [17]:df_mixed1.get_dtype_counts()Out[17]:datetime64[ns]    1float64           3object            2dtype: int64

You can passcolumns keyword to select to filter a list of the returncolumns, this is equivalent to passing aTerm('columns',list_of_columns_to_filter)

In [18]:store.select('df',columns=['A','B'])Out[18]:                   A         B2000-01-01  1.885136 -0.1838732000-01-02  0.180759 -1.1170892000-01-03 -0.294467 -0.5914112000-01-04  3.127110  1.4511302000-01-05 -0.242846  1.1958192000-01-06  0.820521 -0.2812012000-01-07 -0.034086  0.2523942000-01-08 -2.290958 -1.601262[8 rows x 2 columns]

HDFStore now serializes multi-index dataframes when appending tables.

In [19]:index=MultiIndex(levels=[['foo','bar','baz','qux'],   ....:['one','two','three']],   ....:labels=[[0,0,0,1,1,2,2,3,3,3],   ....:[0,1,2,0,1,1,2,0,1,2]],   ....:names=['foo','bar'])   ....:In [20]:df=DataFrame(np.random.randn(10,3),index=index,   ....:columns=['A','B','C'])   ....:In [21]:dfOut[21]:                  A         B         Cfoo barfoo one    0.239369  0.174122 -1.131794    two   -1.948006  0.980347 -0.674429    three -0.361633 -0.761218  1.768215bar one    0.152288 -0.862613 -0.210968    two   -0.859278  1.498195  0.462413baz two   -0.647604  1.511487 -0.727189    three -0.342928 -0.007364  1.427674qux one    0.104020  2.052171 -1.230963    two   -0.019240 -1.713238  0.838912    three -0.637855  0.215109 -1.515362[10 rows x 3 columns]In [22]:store.append('mi',df)In [23]:store.select('mi')Out[23]:                  A         B         Cfoo barfoo one    0.239369  0.174122 -1.131794    two   -1.948006  0.980347 -0.674429    three -0.361633 -0.761218  1.768215bar one    0.152288 -0.862613 -0.210968    two   -0.859278  1.498195  0.462413baz two   -0.647604  1.511487 -0.727189    three -0.342928 -0.007364  1.427674qux one    0.104020  2.052171 -1.230963    two   -0.019240 -1.713238  0.838912    three -0.637855  0.215109 -1.515362[10 rows x 3 columns]# the levels are automatically included as data columnsIn [24]:store.select('mi',Term('foo=bar'))Out[24]:Empty DataFrameColumns: [A, B, C]Index: [][0 rows x 3 columns]

Multi-table creation viaappend_to_multiple and selection viaselect_as_multiple can create/select from multiple tables and return acombined result, by usingwhere on a selector table.

In [25]:df_mt=DataFrame(randn(8,6),index=date_range('1/1/2000',periods=8),   ....:columns=['A','B','C','D','E','F'])   ....:In [26]:df_mt['foo']='bar'# you can also create the tables individuallyIn [27]:store.append_to_multiple({'df1_mt':['A','B'],'df2_mt':None},df_mt,selector='df1_mt')In [28]:storeOut[28]:<class 'pandas.io.pytables.HDFStore'>File path: store.h5/df                  frame_table  (typ->appendable,nrows->8,ncols->5,indexers->[index],dc->[B,C,string,string2])/df1_mt              frame_table  (typ->appendable,nrows->8,ncols->2,indexers->[index],dc->[A,B])/df2_mt              frame_table  (typ->appendable,nrows->8,ncols->5,indexers->[index])/df_mixed            frame_table  (typ->appendable,nrows->8,ncols->6,indexers->[index])/mi                  frame_table  (typ->appendable_multi,nrows->10,ncols->5,indexers->[index],dc->[bar,foo])# indiviual tables were createdIn [29]:store.select('df1_mt')Out[29]:                   A         B2000-01-01  1.586924 -0.4479742000-01-02 -0.102206  0.8703022000-01-03  1.249874  1.4582102000-01-04 -0.616293  0.1504682000-01-05 -0.431163  0.0166402000-01-06  0.800353 -0.4515722000-01-07  1.239198  0.1854372000-01-08 -0.040863  0.290110[8 rows x 2 columns]In [30]:store.select('df2_mt')Out[30]:                   C         D         E         F  foo2000-01-01 -1.573998  0.630925 -0.071659 -1.277640  bar2000-01-02  1.275280 -1.199212  1.060780  1.673018  bar2000-01-03 -0.710542  0.825392  1.557329  1.993441  bar2000-01-04  0.132104  0.580923 -0.128750  1.445964  bar2000-01-05  0.904578 -1.645852 -0.688741  0.228006  bar2000-01-06  0.831767  0.228760  0.932498 -2.200069  bar2000-01-07 -0.540770 -0.370038  1.298390  1.662964  bar2000-01-08 -0.096145  1.717830 -0.462446 -0.112019  bar[8 rows x 5 columns]# as a multipleIn [31]:store.select_as_multiple(['df1_mt','df2_mt'],where=['A>0','B>0'],selector='df1_mt')Out[31]:                   A         B         C         D         E         F  foo2000-01-03  1.249874  1.458210 -0.710542  0.825392  1.557329  1.993441  bar2000-01-07  1.239198  0.185437 -0.540770 -0.370038  1.298390  1.662964  bar[2 rows x 7 columns]

Enhancements

  • HDFStore now can read native PyTables table format tables
  • You can passnan_rep='my_nan_rep' to append, to change the default nanrepresentation on disk (which converts to/fromnp.nan), this defaults tonan.
  • You can passindex toappend. This defaults toTrue. This willautomagically create indicies on theindexables anddata columns of thetable
  • You can passchunksize=aninteger toappend, to change the writingchunksize (default is 50000). This will signficantly lower your memory usageon writing.
  • You can passexpectedrows=aninteger to the firstappend, to set theTOTAL number of expectedrows thatPyTables will expected. This willoptimize read/write performance.
  • Select now supports passingstart andstop to provide selectionspace limiting in selection.
  • Greatly improved ISO8601 (e.g., yyyy-mm-dd) date parsing for file parsers (GH2698)
  • AllowDataFrame.merge to handle combinatorial sizes too large for 64-bitinteger (GH2690)
  • Series now has unary negation (-series) and inversion (~series) operators (GH2686)
  • DataFrame.plot now includes alogx parameter to change the x-axis to log scale (GH2327)
  • Series arithmetic operators can now handle constant and ndarray input (GH2574)
  • ExcelFile now takes akind argument to specify the file type (GH2613)
  • A faster implementation for Series.str methods (GH2602)

Bug Fixes

  • HDFStore tables can now storefloat32 types correctly (cannot bemixed withfloat64 however)
  • Fixed Google Analytics prefix when specifying request segment (GH2713).
  • Function to reset Google Analytics token store so users can recover fromimproperly setup client secrets (GH2687).
  • Fixed groupby bug resulting in segfault when passing in MultiIndex (GH2706)
  • Fixed bug where passing a Series with datetime64 values intoto_datetimeresults in bogus output values (GH2699)
  • Fixed bug inpatterninHDFStore expressions when pattern is not a validregex (GH2694)
  • Fixed performance issues while aggregating boolean data (GH2692)
  • When given a boolean mask key and a Series of new values, Series __setitem__will now align the incoming values with the original Series (GH2686)
  • Fixed MemoryError caused by performing counting sort on sorting MultiIndexlevels with a very large number of combinatorial values (GH2684)
  • Fixed bug that causes plotting to fail when the index is a DatetimeIndex witha fixed-offset timezone (GH2683)
  • Corrected businessday subtraction logic when the offset is more than 5 bdaysand the starting date is on a weekend (GH2680)
  • Fixed C file parser behavior when the file has more columns than data(GH2668)
  • Fixed file reader bug that misaligned columns with data in the presence of animplicit column and a specifiedusecols value
  • DataFrames with numerical or datetime indices are now sorted prior toplotting (GH2609)
  • Fixed DataFrame.from_records error when passed columns, index, but emptyrecords (GH2633)
  • Several bug fixed for Series operations when dtype is datetime64 (GH2689,GH2629,GH2626)

See thefull release notes or issue trackeron GitHub for a complete list.

v0.10.0 (December 17, 2012)

This is a major release from 0.9.1 and includes many new features andenhancements along with a large number of bug fixes. There are also a number ofimportant API changes that long-time pandas users should pay close attentionto.

File parsing new features

The delimited file parsing engine (the guts ofread_csv andread_table)has been rewritten from the ground up and now uses a fraction the amount ofmemory while parsing, while being 40% or more faster in most use cases (in somecases much faster).

There are also many new features:

  • Much-improved Unicode handling via theencoding option.
  • Column filtering (usecols)
  • Dtype specification (dtype argument)
  • Ability to specify strings to be recognized as True/False
  • Ability to yield NumPy record arrays (as_recarray)
  • High performancedelim_whitespace option
  • Decimal format (e.g. European format) specification
  • Easier CSV dialect options:escapechar,lineterminator,quotechar, etc.
  • More robust handling of many exceptional kinds of files observed in the wild

API changes

Deprecated DataFrame BINOP TimeSeries special case behavior

The default behavior of binary operations between a DataFrame and a Series hasalways been to align on the DataFrame’s columns and broadcast down the rows,except in the special case that the DataFrame contains time series. Sincethere are now method for each binary operator enabling you to specify how youwant to broadcast, we are phasing out this special case (Zen of Python:Special cases aren’t special enough to break the rules). Here’s what I’mtalking about:

In [1]:importpandasaspdIn [2]:df=pd.DataFrame(np.random.randn(6,4),   ...:index=pd.date_range('1/1/2000',periods=6))   ...:In [3]:dfOut[3]:                   0         1         2         32000-01-01 -0.134024 -0.205969  1.348944 -1.1982462000-01-02 -1.626124  0.982041  0.059493 -0.4601112000-01-03 -1.565401 -0.025706  0.942864  2.5021562000-01-04 -0.302741  0.261551 -0.066342  0.8970972000-01-05  0.268766 -1.225092  0.582752 -1.4907642000-01-06 -0.639757 -0.952750 -0.892402  0.505987[6 rows x 4 columns]# deprecated nowIn [4]:df-df[0]Out[4]:            2000-01-01 00:00:00  2000-01-02 00:00:00  2000-01-03 00:00:00  \2000-01-01                  NaN                  NaN                  NaN2000-01-02                  NaN                  NaN                  NaN2000-01-03                  NaN                  NaN                  NaN2000-01-04                  NaN                  NaN                  NaN2000-01-05                  NaN                  NaN                  NaN2000-01-06                  NaN                  NaN                  NaN            2000-01-04 00:00:00  2000-01-05 00:00:00  2000-01-06 00:00:00   0  \2000-01-01                  NaN                  NaN                  NaN NaN2000-01-02                  NaN                  NaN                  NaN NaN2000-01-03                  NaN                  NaN                  NaN NaN2000-01-04                  NaN                  NaN                  NaN NaN2000-01-05                  NaN                  NaN                  NaN NaN2000-01-06                  NaN                  NaN                  NaN NaN             1   2   32000-01-01 NaN NaN NaN2000-01-02 NaN NaN NaN2000-01-03 NaN NaN NaN2000-01-04 NaN NaN NaN2000-01-05 NaN NaN NaN2000-01-06 NaN NaN NaN[6 rows x 10 columns]# Change your code toIn [5]:df.sub(df[0],axis=0)# align on axis 0 (rows)Out[5]:              0         1         2         32000-01-01  0.0 -0.071946  1.482967 -1.0642232000-01-02  0.0  2.608165  1.685618  1.1660132000-01-03  0.0  1.539695  2.508265  4.0675562000-01-04  0.0  0.564293  0.236399  1.1998392000-01-05  0.0 -1.493857  0.313986 -1.7595302000-01-06  0.0 -0.312993 -0.252645  1.145744[6 rows x 4 columns]

You will get a deprecation warning in the 0.10.x series, and the deprecatedfunctionality will be removed in 0.11 or later.

Altered resample default behavior

The default time seriesresample binning behavior of dailyD andhigher frequencies has been changed toclosed='left',label='left'. Lowernfrequencies are unaffected. The prior defaults were causing a great deal ofconfusion for users, especially resampling data to daily frequency (whichlabeled the aggregated group with the end of the interval: the next day).

In [1]:dates=pd.date_range('1/1/2000','1/5/2000',freq='4h')In [2]:series=Series(np.arange(len(dates)),index=dates)In [3]:seriesOut[3]:2000-01-01 00:00:00     02000-01-01 04:00:00     12000-01-01 08:00:00     22000-01-01 12:00:00     32000-01-01 16:00:00     42000-01-01 20:00:00     52000-01-02 00:00:00     62000-01-02 04:00:00     72000-01-02 08:00:00     82000-01-02 12:00:00     92000-01-02 16:00:00    102000-01-02 20:00:00    112000-01-03 00:00:00    122000-01-03 04:00:00    132000-01-03 08:00:00    142000-01-03 12:00:00    152000-01-03 16:00:00    162000-01-03 20:00:00    172000-01-04 00:00:00    182000-01-04 04:00:00    192000-01-04 08:00:00    202000-01-04 12:00:00    212000-01-04 16:00:00    222000-01-04 20:00:00    232000-01-05 00:00:00    24Freq: 4H, dtype: int64In [4]:series.resample('D',how='sum')Out[4]:2000-01-01     152000-01-02     512000-01-03     872000-01-04    1232000-01-05     24Freq: D, dtype: int64In [5]:# old behaviorIn [6]:series.resample('D',how='sum',closed='right',label='right')Out[6]:2000-01-01      02000-01-02     212000-01-03     572000-01-04     932000-01-05    129Freq: D, dtype: int64
  • Infinity and negative infinity are no longer treated as NA byisnull andnotnull. That they ever were was a relic of early pandas. This behaviorcan be re-enabled globally by themode.use_inf_as_null option:
In [6]:s=pd.Series([1.5,np.inf,3.4,-np.inf])In [7]:pd.isnull(s)Out[7]:0    False1    False2    False3    Falsedtype: boolIn [8]:s.fillna(0)Out[8]:0    1.5000001         inf2    3.4000003        -infdtype: float64In [9]:pd.set_option('use_inf_as_null',True)In [10]:pd.isnull(s)Out[10]:0    False1     True2    False3     Truedtype: boolIn [11]:s.fillna(0)Out[11]:0    1.51    0.02    3.43    0.0dtype: float64In [12]:pd.reset_option('use_inf_as_null')
  • Methods with theinplace option now all returnNone instead of thecalling object. E.g. code written likedf=df.fillna(0,inplace=True)may stop working. To fix, simply delete the unnecessary variable assignment.
  • pandas.merge no longer sorts the group keys (sort=False) bydefault. This was done for performance reasons: the group-key sorting isoften one of the more expensive parts of the computation and is oftenunnecessary.
  • The default column names for a file with no header have been changed to theintegers0 throughN-1. This is to create consistency with theDataFrame constructor with no columns specified. The v0.9.0 behavior (namesX0,X1, ...) can be reproduced by specifyingprefix='X':
In [13]:data='a,b,c\n1,Yes,2\n3,No,4'In [14]:print(data)a,b,c1,Yes,23,No,4In [15]:pd.read_csv(StringIO(data),header=None)Out[15]:   0    1  20  a    b  c1  1  Yes  22  3   No  4[3 rows x 3 columns]In [16]:pd.read_csv(StringIO(data),header=None,prefix='X')Out[16]:  X0   X1 X20  a    b  c1  1  Yes  22  3   No  4[3 rows x 3 columns]
  • Values like'Yes' and'No' are not interpreted as boolean by default,though this can be controlled by newtrue_values andfalse_valuesarguments:
In [17]:print(data)a,b,c1,Yes,23,No,4In [18]:pd.read_csv(StringIO(data))Out[18]:   a    b  c0  1  Yes  21  3   No  4[2 rows x 3 columns]In [19]:pd.read_csv(StringIO(data),true_values=['Yes'],false_values=['No'])Out[19]:   a      b  c0  1   True  21  3  False  4[2 rows x 3 columns]
  • The file parsers will not recognize non-string values arising from aconverter function as NA if passed in thena_values argument. It’s betterto do post-processing using thereplace function instead.
  • Callingfillna on Series or DataFrame with no arguments is no longervalid code. You must either specify a fill value or an interpolation method:
In [20]:s=Series([np.nan,1.,2.,np.nan,4])In [21]:sOut[21]:0    NaN1    1.02    2.03    NaN4    4.0dtype: float64In [22]:s.fillna(0)Out[22]:0    0.01    1.02    2.03    0.04    4.0dtype: float64In [23]:s.fillna(method='pad')Out[23]:0    NaN1    1.02    2.03    2.04    4.0dtype: float64

Convenience methodsffill andbfill have been added:

In [24]:s.ffill()Out[24]:0    NaN1    1.02    2.03    2.04    4.0dtype: float64
  • Series.apply will now operate on a returned value from the appliedfunction, that is itself a series, and possibly upcast the result to aDataFrame

    In [25]:deff(x):   ....:returnSeries([x,x**2],index=['x','x^2'])   ....:In [26]:s=Series(np.random.rand(5))In [27]:sOut[27]:0    0.7174781    0.8151992    0.4524783    0.8483854    0.235477dtype: float64In [28]:s.apply(f)Out[28]:          x       x^20  0.717478  0.5147751  0.815199  0.6645502  0.452478  0.2047373  0.848385  0.7197574  0.235477  0.055449[5 rows x 2 columns]
  • New API functions for working with pandas options (GH2097):

    • get_option /set_option - get/set the value of an option. Partialnames are accepted. -reset_option - reset one or more options totheir default value. Partial names are accepted. -describe_option -print a description of one or more options. When called with noarguments. print all registered options.

    Note:set_printoptions/reset_printoptions are now deprecated (butfunctioning), the print options now live under “display.XYZ”. For example:

    In [29]:get_option("display.max_rows")Out[29]:15
  • to_string() methods now always return unicode strings (GH2224).

New features

Wide DataFrame Printing

Instead of printing the summary information, pandas now splits the stringrepresentation across multiple rows by default:

In [30]:wide_frame=DataFrame(randn(5,16))In [31]:wide_frameOut[31]:         0         1         2         3         4         5         6   \0 -0.681624  0.191356  1.180274 -0.834179  0.703043  0.166568 -0.5835991  0.441522 -0.316864 -0.017062  1.570114 -0.360875 -0.880096  0.2355322 -0.412451 -0.462580  0.422194  0.288403 -0.487393 -0.777639  0.0558653 -0.277255  1.331263  0.585174 -0.568825 -0.719412  1.191340 -0.4563624 -1.642511  0.432560  1.218080 -0.564705 -0.581790  0.286071  0.048725         7         8         9         10        11        12        13  \0 -1.201796 -1.422811 -0.882554  1.209871 -0.941235  0.863067 -0.3362321  0.207232 -1.983857 -1.702547 -1.621234 -0.906840  1.014601 -0.4751082  1.383381  0.085638  0.246392  0.965887  0.246354 -0.727728 -0.0944143  0.089931  0.776079  0.752889 -1.195795 -1.425911 -0.548829  0.7742254  1.002440  1.276582  0.054399  0.241963 -0.471786  0.314510 -0.059986         14        150 -0.976847  0.0338621 -0.358944  1.2629422 -0.276854  0.1583993  0.740501  1.5102634 -2.069319 -1.115104[5 rows x 16 columns]

The old behavior of printing out summary information can be achieved via the‘expand_frame_repr’ print option:

In [32]:pd.set_option('expand_frame_repr',False)In [33]:wide_frameOut[33]:         0         1         2         3         4         5         6         7         8         9         10        11        12        13        14        150 -0.681624  0.191356  1.180274 -0.834179  0.703043  0.166568 -0.583599 -1.201796 -1.422811 -0.882554  1.209871 -0.941235  0.863067 -0.336232 -0.976847  0.0338621  0.441522 -0.316864 -0.017062  1.570114 -0.360875 -0.880096  0.235532  0.207232 -1.983857 -1.702547 -1.621234 -0.906840  1.014601 -0.475108 -0.358944  1.2629422 -0.412451 -0.462580  0.422194  0.288403 -0.487393 -0.777639  0.055865  1.383381  0.085638  0.246392  0.965887  0.246354 -0.727728 -0.094414 -0.276854  0.1583993 -0.277255  1.331263  0.585174 -0.568825 -0.719412  1.191340 -0.456362  0.089931  0.776079  0.752889 -1.195795 -1.425911 -0.548829  0.774225  0.740501  1.5102634 -1.642511  0.432560  1.218080 -0.564705 -0.581790  0.286071  0.048725  1.002440  1.276582  0.054399  0.241963 -0.471786  0.314510 -0.059986 -2.069319 -1.115104[5 rows x 16 columns]

The width of each line can be changed via ‘line_width’ (80 by default):

In [34]:pd.set_option('line_width',40)line_width has been deprecated, use display.width instead (currently both areidentical)In [35]:wide_frameOut[35]:         0         1         2   \0 -0.681624  0.191356  1.1802741  0.441522 -0.316864 -0.0170622 -0.412451 -0.462580  0.4221943 -0.277255  1.331263  0.5851744 -1.642511  0.432560  1.218080         3         4         5   \0 -0.834179  0.703043  0.1665681  1.570114 -0.360875 -0.8800962  0.288403 -0.487393 -0.7776393 -0.568825 -0.719412  1.1913404 -0.564705 -0.581790  0.286071         6         7         8   \0 -0.583599 -1.201796 -1.4228111  0.235532  0.207232 -1.9838572  0.055865  1.383381  0.0856383 -0.456362  0.089931  0.7760794  0.048725  1.002440  1.276582         9         10        11  \0 -0.882554  1.209871 -0.9412351 -1.702547 -1.621234 -0.9068402  0.246392  0.965887  0.2463543  0.752889 -1.195795 -1.4259114  0.054399  0.241963 -0.471786         12        13        14  \0  0.863067 -0.336232 -0.9768471  1.014601 -0.475108 -0.3589442 -0.727728 -0.094414 -0.2768543 -0.548829  0.774225  0.7405014  0.314510 -0.059986 -2.069319         150  0.0338621  1.2629422  0.1583993  1.5102634 -1.115104[5 rows x 16 columns]

Updated PyTables Support

Docs for PyTablesTable format & several enhancements to the api. Here is a taste of what to expect.

In [36]:store=HDFStore('store.h5')In [37]:df=DataFrame(randn(8,3),index=date_range('1/1/2000',periods=8),   ....:columns=['A','B','C'])   ....:In [38]:dfOut[38]:                   A         B         C2000-01-01 -0.369325 -1.502617 -0.3762802000-01-02  0.511936 -0.116412 -0.6252562000-01-03 -0.550627  1.261433 -0.5524292000-01-04  1.695803 -1.025917 -0.9109422000-01-05  0.426805 -0.131749  0.4326002000-01-06  0.044671 -0.341265  1.8445362000-01-07 -2.036047  0.000830 -0.9556972000-01-08 -0.898872 -0.725411  0.059904[8 rows x 3 columns]# appending data framesIn [39]:df1=df[0:4]In [40]:df2=df[4:]In [41]:store.append('df',df1)In [42]:store.append('df',df2)In [43]:storeOut[43]:<class 'pandas.io.pytables.HDFStore'>File path: store.h5/df            frame_table  (typ->appendable,nrows->8,ncols->3,indexers->[index])# selecting the entire storeIn [44]:store.select('df')Out[44]:                   A         B         C2000-01-01 -0.369325 -1.502617 -0.3762802000-01-02  0.511936 -0.116412 -0.6252562000-01-03 -0.550627  1.261433 -0.5524292000-01-04  1.695803 -1.025917 -0.9109422000-01-05  0.426805 -0.131749  0.4326002000-01-06  0.044671 -0.341265  1.8445362000-01-07 -2.036047  0.000830 -0.9556972000-01-08 -0.898872 -0.725411  0.059904[8 rows x 3 columns]
In [45]:wp=Panel(randn(2,5,4),items=['Item1','Item2'],   ....:major_axis=date_range('1/1/2000',periods=5),   ....:minor_axis=['A','B','C','D'])   ....:In [46]:wpOut[46]:<class 'pandas.core.panel.Panel'>Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)Items axis: Item1 to Item2Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00Minor_axis axis: A to D# storing a panelIn [47]:store.append('wp',wp)# selecting via A QUERYIn [48]:store.select('wp',   ....:[Term('major_axis>20000102'),Term('minor_axis','=',['A','B'])])   ....:Out[48]:<class 'pandas.core.panel.Panel'>Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis)Items axis: Item1 to Item2Major_axis axis: 2000-01-03 00:00:00 to 2000-01-05 00:00:00Minor_axis axis: A to B# removing data from tablesIn [49]:store.remove('wp',Term('major_axis>20000103'))Out[49]:8In [50]:store.select('wp')Out[50]:<class 'pandas.core.panel.Panel'>Dimensions: 2 (items) x 3 (major_axis) x 4 (minor_axis)Items axis: Item1 to Item2Major_axis axis: 2000-01-01 00:00:00 to 2000-01-03 00:00:00Minor_axis axis: A to D# deleting a storeIn [51]:delstore['df']In [52]:storeOut[52]:<class 'pandas.io.pytables.HDFStore'>File path: store.h5/wp            wide_table   (typ->appendable,nrows->12,ncols->2,indexers->[major_axis,minor_axis])

Enhancements

  • added ability to hierarchical keys

    In [53]:store.put('foo/bar/bah',df)In [54]:store.append('food/orange',df)In [55]:store.append('food/apple',df)In [56]:storeOut[56]:<class 'pandas.io.pytables.HDFStore'>File path: store.h5/foo/bar/bah            frame        (shape->[8,3])/food/apple             frame_table  (typ->appendable,nrows->8,ncols->3,indexers->[index])/food/orange            frame_table  (typ->appendable,nrows->8,ncols->3,indexers->[index])/wp                     wide_table   (typ->appendable,nrows->12,ncols->2,indexers->[major_axis,minor_axis])# remove all nodes under this levelIn [57]:store.remove('food')In [58]:storeOut[58]:<class 'pandas.io.pytables.HDFStore'>File path: store.h5/foo/bar/bah            frame        (shape->[8,3])/wp                     wide_table   (typ->appendable,nrows->12,ncols->2,indexers->[major_axis,minor_axis])
  • added mixed-dtype support!

    In [59]:df['string']='string'In [60]:df['int']=1In [61]:store.append('df',df)In [62]:df1=store.select('df')In [63]:df1Out[63]:                   A         B         C  string  int2000-01-01 -0.369325 -1.502617 -0.376280  string    12000-01-02  0.511936 -0.116412 -0.625256  string    12000-01-03 -0.550627  1.261433 -0.552429  string    12000-01-04  1.695803 -1.025917 -0.910942  string    12000-01-05  0.426805 -0.131749  0.432600  string    12000-01-06  0.044671 -0.341265  1.844536  string    12000-01-07 -2.036047  0.000830 -0.955697  string    12000-01-08 -0.898872 -0.725411  0.059904  string    1[8 rows x 5 columns]In [64]:df1.get_dtype_counts()Out[64]:float64    3int64      1object     1dtype: int64
  • performance improvments on table writing

  • support for arbitrarily indexed dimensions

  • SparseSeries now has adensity property (GH2384)

  • enableSeries.str.strip/lstrip/rstrip methods to take an input argumentto strip arbitrary characters (GH2411)

  • implementvalue_vars inmelt to limit values to certain columnsand addmelt to pandas namespace (GH2412)

Bug Fixes

  • addedTerm method of specifying where conditions (GH1996).
  • delstore['df'] now callstore.remove('df') for store deletion
  • deleting of consecutive rows is much faster than before
  • min_itemsize parameter can be specified in table creation to force aminimum size for indexing columns (the previous implementation would set thecolumn size based on the first append)
  • indexing support viacreate_table_index (requires PyTables >= 2.3)(GH698).
  • appending on a store would fail if the table was not first created viaput
  • fixed issue with missing attributes after loading a pickled dataframe (GH2431)
  • minor change to select and remove: require a table ONLY if where is alsoprovided (and not None)

Compatibility

0.10 ofHDFStore is backwards compatible for reading tables created in a prior version of pandas,however, query terms using the prior (undocumented) methodology are unsupported. You must read in the entirefile and write it out using the new format to take advantage of the updates.

N Dimensional Panels (Experimental)

Adding experimental support for Panel4D and factory functions to create n-dimensional named panels.Docs for NDim. Here is a taste of what to expect.

In [65]:p4d=Panel4D(randn(2,2,5,4),   ....:labels=['Label1','Label2'],   ....:items=['Item1','Item2'],   ....:major_axis=date_range('1/1/2000',periods=5),   ....:minor_axis=['A','B','C','D'])   ....:In [66]:p4dOut[66]:<class 'pandas.core.panelnd.Panel4D'>Dimensions: 2 (labels) x 2 (items) x 5 (major_axis) x 4 (minor_axis)Labels axis: Label1 to Label2Items axis: Item1 to Item2Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00Minor_axis axis: A to D

See thefull release notes or issue trackeron GitHub for a complete list.

v0.9.1 (November 14, 2012)

This is a bugfix release from 0.9.0 and includes several new features andenhancements along with a large number of bug fixes. The new features includeby-column sort order for DataFrame and Series, improved NA handling for the rankmethod, masking functions for DataFrame, and intraday time-series filtering forDataFrame.

New features

  • Series.sort,DataFrame.sort, andDataFrame.sort_index can now bespecified in a per-column manner to support multiple sort orders (GH928)

    In [1]:df=DataFrame(np.random.randint(0,2,(6,3)),columns=['A','B','C'])In [2]:df.sort(['A','B'],ascending=[1,0])Out[2]:   A  B  C0  0  1  02  0  0  11  1  1  15  1  1  03  1  0  04  1  0  1[6 rows x 3 columns]
  • DataFrame.rank now supports additional argument values for thena_option parameter so missing values can be assigned either the largestor the smallest rank (GH1508,GH2159)

    In [3]:df=DataFrame(np.random.randn(6,3),columns=['A','B','C'])In [4]:df.ix[2:4]=np.nanIn [5]:df.rank()Out[5]:     A    B    C0  3.0  2.0  1.01  1.0  3.0  3.02  NaN  NaN  NaN3  NaN  NaN  NaN4  NaN  NaN  NaN5  2.0  1.0  2.0[6 rows x 3 columns]In [6]:df.rank(na_option='top')Out[6]:     A    B    C0  6.0  5.0  4.01  4.0  6.0  6.02  2.0  2.0  2.03  2.0  2.0  2.04  2.0  2.0  2.05  5.0  4.0  5.0[6 rows x 3 columns]In [7]:df.rank(na_option='bottom')Out[7]:     A    B    C0  3.0  2.0  1.01  1.0  3.0  3.02  5.0  5.0  5.03  5.0  5.0  5.04  5.0  5.0  5.05  2.0  1.0  2.0[6 rows x 3 columns]
  • DataFrame has newwhere andmask methods to select values according to agiven boolean mask (GH2109,GH2151)

    DataFrame currently supports slicing via a boolean vector the same length as the DataFrame (inside the[]).The returned DataFrame has the same number of columns as the original, but is sliced on its index.

    In [8]:df=DataFrame(np.random.randn(5,3),columns=['A','B','C'])In [9]:dfOut[9]:          A         B         C0 -0.187239 -1.703664  0.6131361 -0.948528  0.505346  0.0172282 -2.391256  1.207381  0.8531743  0.124213 -0.625597 -1.2112244 -0.476548  0.649425  0.004610[5 rows x 3 columns]In [10]:df[df['A']>0]Out[10]:          A         B         C3  0.124213 -0.625597 -1.211224[1 rows x 3 columns]

    If a DataFrame is sliced with a DataFrame based boolean condition (with the same size as the original DataFrame),then a DataFrame the same size (index and columns) as the original is returned, withelements that do not meet the boolean condition asNaN. This is accomplished viathe new methodDataFrame.where. In addition,where takes an optionalother argument for replacement.

    In [11]:df[df>0]Out[11]:          A         B         C0       NaN       NaN  0.6131361       NaN  0.505346  0.0172282       NaN  1.207381  0.8531743  0.124213       NaN       NaN4       NaN  0.649425  0.004610[5 rows x 3 columns]In [12]:df.where(df>0)Out[12]:          A         B         C0       NaN       NaN  0.6131361       NaN  0.505346  0.0172282       NaN  1.207381  0.8531743  0.124213       NaN       NaN4       NaN  0.649425  0.004610[5 rows x 3 columns]In [13]:df.where(df>0,-df)Out[13]:          A         B         C0  0.187239  1.703664  0.6131361  0.948528  0.505346  0.0172282  2.391256  1.207381  0.8531743  0.124213  0.625597  1.2112244  0.476548  0.649425  0.004610[5 rows x 3 columns]

    Furthermore,where now aligns the input boolean condition (ndarray or DataFrame), such that partial selectionwith setting is possible. This is analagous to partial setting via.ix (but on the contents rather than the axis labels)

    In [14]:df2=df.copy()In [15]:df2[df2[1:4]>0]=3In [16]:df2Out[16]:          A         B         C0 -0.187239 -1.703664  0.6131361 -0.948528  3.000000  3.0000002 -2.391256  3.000000  3.0000003  3.000000 -0.625597 -1.2112244 -0.476548  0.649425  0.004610[5 rows x 3 columns]

    DataFrame.mask is the inverse boolean operation ofwhere.

    In [17]:df.mask(df<=0)Out[17]:          A         B         C0       NaN       NaN  0.6131361       NaN  0.505346  0.0172282       NaN  1.207381  0.8531743  0.124213       NaN       NaN4       NaN  0.649425  0.004610[5 rows x 3 columns]
  • Enable referencing of Excel columns by their column names (GH1936)

    In [18]:xl=ExcelFile('data/test.xls')In [19]:xl.parse('Sheet1',index_col=0,parse_dates=True,   ....:parse_cols='A:D')   ....:---------------------------------------------------------------------------NotImplementedError                       Traceback (most recent call last)<ipython-input-19-7ac41df80d31> in <module>()      1 xl.parse('Sheet1', index_col=0, parse_dates=True,----> 2          parse_cols='A:D')/home/joris/scipy/pandas/pandas/io/excel.pyc in parse(self, sheetname, header, skiprows, skip_footer, names, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, true_values, false_values, squeeze, **kwds)    279                                  false_values=false_values,    280                                  squeeze=squeeze,--> 281                                  **kwds)    282    283     def _should_parse(self, i, parse_cols):/home/joris/scipy/pandas/pandas/io/excel.pyc in _parse_excel(self, sheetname, header, skiprows, names, skip_footer, index_col, has_index_names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, true_values, false_values, verbose, squeeze, **kwds)    337                                       "is not implemented")    338         if parse_dates:--> 339             raise NotImplementedError("parse_dates keyword of read_excel "    340                                       "is not implemented")    341NotImplementedError: parse_dates keyword of read_excel is not implemented
  • Added option to disable pandas-style tick locators and formattersusingseries.plot(x_compat=True) orpandas.plot_params[‘x_compat’] =True (GH2205)

  • Existing TimeSeries methodsat_time andbetween_time were added toDataFrame (GH2149)

  • DataFrame.dot can now accept ndarrays (GH2042)

  • DataFrame.drop now supports non-unique indexes (GH2101)

  • Panel.shift now supports negative periods (GH2164)

  • DataFrame now support unary ~ operator (GH2110)

API changes

  • Upsampling data with a PeriodIndex will result in a higher frequencyTimeSeries that spans the original time window

    In [1]:prng=period_range('2012Q1',periods=2,freq='Q')In [2]:s=Series(np.random.randn(len(prng)),prng)In [4]:s.resample('M')Out[4]:2012-01   -1.4719922012-02         NaN2012-03         NaN2012-04   -0.4935932012-05         NaN2012-06         NaNFreq: M, dtype: float64
  • Period.end_time now returns the last nanosecond in the time interval(GH2124,GH2125,GH1764)

    In [20]:p=Period('2012')In [21]:p.end_timeOut[21]:Timestamp('2012-12-31 23:59:59.999999999')
  • File parsers no longer coerce to float or bool for columns that have customconverters specified (GH2184)

    In [22]:data='A,B,C\n00001,001,5\n00002,002,6'In [23]:read_csv(StringIO(data),converters={'A':lambdax:x.strip()})Out[23]:       A  B  C0  00001  1  51  00002  2  6[2 rows x 3 columns]

See thefull release notes or issue trackeron GitHub for a complete list.

v0.9.0 (October 7, 2012)

This is a major release from 0.8.1 and includes several new features andenhancements along with a large number of bug fixes. New features includevectorized unicode encoding/decoding forSeries.str,to_latex method toDataFrame, more flexible parsing of boolean values, and enabling the download ofoptions data from Yahoo! Finance.

New features

  • Addencode anddecode for unicode handling tovectorizedstring processing methods in Series.str (GH1706)
  • AddDataFrame.to_latex method (GH1735)
  • Add convenient expanding window equivalents of all rolling_* ops (GH1785)
  • Add Options class to pandas.io.data for fetching options data from Yahoo!Finance (GH1748,GH1739)
  • More flexible parsing of boolean values (Yes, No, TRUE, FALSE, etc)(GH1691,GH1295)
  • Addlevel parameter toSeries.reset_index
  • TimeSeries.between_time can now select times across midnight (GH1871)
  • Series constructor can now handle generator as input (GH1679)
  • DataFrame.dropna can now take multiple axes (tuple/list) as input(GH924)
  • Enableskip_footer parameter inExcelFile.parse (GH1843)

API changes

  • The default column names whenheader=None and no columns names passed tofunctions likeread_csv has changed to be more Pythonic and amenable toattribute access:
In [1]:data='0,0,1\n1,1,0\n0,1,0'In [2]:df=read_csv(StringIO(data),header=None)In [3]:dfOut[3]:   0  1  20  0  0  11  1  1  02  0  1  0[3 rows x 3 columns]
  • Creating a Series from another Series, passing an index, will cause reindexingto happen inside rather than treating the Series like an ndarray. Technicallyimproper usages likeSeries(df[col1],index=df[col2]) that worked before“by accident” (this was never intended) will lead to all NA Series in somecases. To be perfectly clear:
In [4]:s1=Series([1,2,3])In [5]:s1Out[5]:0    11    22    3dtype: int64In [6]:s2=Series(s1,index=['foo','bar','baz'])In [7]:s2Out[7]:foo   NaNbar   NaNbaz   NaNdtype: float64
  • Deprecatedday_of_year API removed from PeriodIndex, usedayofyear(GH1723)
  • Don’t modify NumPy suppress printoption to True at import time
  • The internal HDF5 data arrangement for DataFrames has been transposed. Legacyfiles will still be readable by HDFStore (GH1834,GH1824)
  • Legacy cruft removed: pandas.stats.misc.quantileTS
  • Use ISO8601 format for Period repr: monthly, daily, and on down (GH1776)
  • Empty DataFrame columns are now created as object dtype. This will prevent aclass of TypeErrors that was occurring in code where the dtype of a columnwould depend on the presence of data or not (e.g. a SQL query having results)(GH1783)
  • Setting parts of DataFrame/Panel using ix now aligns input Series/DataFrame(GH1630)
  • first andlast methods inGroupBy no longer drop non-numericcolumns (GH1809)
  • Resolved inconsistencies in specifying custom NA values in text parser.na_values of type dict no longer override default NAs unlesskeep_default_na is set to false explicitly (GH1657)
  • DataFrame.dot will not do data alignment, and also work with Series(GH1915)

See thefull release notes or issue trackeron GitHub for a complete list.

v0.8.1 (July 22, 2012)

This release includes a few new features, performance enhancements, and over 30bug fixes from 0.8.0. New features include notably NA friendly stringprocessing functionality and a series of new plot types and options.

New features

Performance improvements

  • Improved implementation of rolling min and max (thanks toBottleneck !)
  • Add accelerated'median' GroupBy option (GH1358)
  • Significantly improve the performance of parsing ISO8601-format datestrings withDatetimeIndex orto_datetime (GH1571)
  • Improve the performance of GroupBy on single-key aggregations and use withCategorical types
  • Significant datetime parsing performance improvments

v0.8.0 (June 29, 2012)

This is a major release from 0.7.3 and includes extensive work on the timeseries handling and processing infrastructure as well as a great deal of newfunctionality throughout the library. It includes over 700 commits from morethan 20 distinct authors. Most pandas 0.7.3 and earlier users should notexperience any issues upgrading, but due to the migration to the NumPydatetime64 dtype, there may be a number of bugs and incompatibilitieslurking. Lingering incompatibilities will be fixed ASAP in a 0.8.1 release ifnecessary. See thefull release notes or issue trackeron GitHub for a complete list.

Support for non-unique indexes

All objects can now work with non-unique indexes. Data alignment / joinoperations work according to SQL join semantics (including, if application,index duplication in many-to-many joins)

NumPy datetime64 dtype and 1.6 dependency

Time series data are now represented using NumPy’s datetime64 dtype; thus,pandas 0.8.0 now requires at least NumPy 1.6. It has been tested and verifiedto work with the development version (1.7+) of NumPy as well which includessome significant user-facing API changes. NumPy 1.6 also has a number of bugshaving to do with nanosecond resolution data, so I recommend that you steerclear of NumPy 1.6’s datetime64 API functions (though limited as they are) andonly interact with this data using the interface that pandas provides.

See the end of the 0.8.0 section for a “porting” guide listing potential issuesfor users migrating legacy codebases from pandas 0.7 or earlier to 0.8.0.

Bug fixes to the 0.7.x series for legacy NumPy < 1.6 users will be provided asthey arise. There will be no more further development in 0.7.x beyond bugfixes.

Time series changes and improvements

Note

With this release, legacy scikits.timeseries users should be able to porttheir code to use pandas.

Note

Seedocumentation for overview of pandas timeseries API.

  • New datetime64 representationspeeds up join operations and dataalignment,reduces memory usage, and improve serialization /deserialization performance significantly over datetime.datetime
  • High performance and flexibleresample method for converting fromhigh-to-low and low-to-high frequency. Supports interpolation, user-definedaggregation functions, and control over how the intervals and result labelingare defined. A suite of high performance Cython/C-based resampling functions(including Open-High-Low-Close) have also been implemented.
  • Revamp offrequency aliases and support forfrequency shortcuts like ‘15min’, or ‘1h30min’
  • NewDatetimeIndex class supports both fixedfrequency and irregular timeseries. Replaces now deprecated DateRange class
  • NewPeriodIndex andPeriod classes for representingtime spans and performingcalendar logic,including the12 fiscal quarterly frequencies <timeseries.quarterly>.This is a partial port of, and a substantial enhancement to,elements of the scikits.timeseries codebase. Support for conversion betweenPeriodIndex and DatetimeIndex
  • New Timestamp data type subclassesdatetime.datetime, providing the sameinterface while enabling working with nanosecond-resolution data. Alsoprovideseasy time zone conversions.
  • Enhanced support fortime zones. Addtz_convert andtz_lcoalize methods to TimeSeries and DataFrame. Alltimestamps are stored as UTC; Timestamps from DatetimeIndex objects with timezone set will be localized to localtime. Time zone conversions are thereforeessentially free. User needs to know very little about pytz library now; onlytime zone names as as strings are required. Time zone-aware timestamps areequal if and only if their UTC timestamps match. Operations between timezone-aware time series with different time zones will result in a UTC-indexedtime series.
  • Time seriesstring indexing conveniences / shortcuts: slice years, yearand month, and index values with strings
  • Enhanced time seriesplotting; adaptation of scikits.timeseriesmatplotlib-based plotting code
  • Newdate_range,bdate_range, andperiod_rangefactoryfunctions
  • Robustfrequency inference functioninfer_freq andinferred_freqproperty of DatetimeIndex, with option to infer frequency on construction ofDatetimeIndex
  • to_datetime function efficientlyparses array of strings toDatetimeIndex. DatetimeIndex will parse array or list of strings todatetime64
  • Optimized support for datetime64-dtype data in Series and DataFramecolumns
  • New NaT (Not-a-Time) type to representNA in timestamp arrays
  • Optimize Series.asof for looking up“as of” values for arrays oftimestamps
  • Milli, Micro, Nano date offset objects
  • Can index time series with datetime.time objects to select all data atparticulartime of day (TimeSeries.at_time) orbetween two times(TimeSeries.between_time)
  • Addtshift method for leading/laggingusing the frequency (if any) of the index, as opposed to a naive lead/lagusing shift

Other new features

  • Newcut andqcut functions (like R’s cutfunction) for computing a categorical variable from a continuous variable bybinning values either into value-based (cut) or quantile-based (qcut)bins
  • RenameFactor toCategorical and add a number of usability features
  • Addlimit argument to fillna/reindex
  • More flexible multiple function application in GroupBy, and can pass list(name, function) tuples to get result in particular order with given names
  • Add flexiblereplace method for efficientlysubstituting values
  • Enhancedread_csv/read_table for reading time seriesdata and converting multiple columns to dates
  • Addcomments option to parser functions: read_csv, etc.
  • Add :ref`dayfirst <io.dayfirst>` option to parser functions for parsinginternational DD/MM/YYYY dates
  • Allow the user to specify the CSV readerdialect tocontrol quoting etc.
  • Handlingthousands separators in read_csv to improveinteger parsing.
  • Enable unstacking of multiple levels in one shot. Alleviatepivot_tablebugs (empty columns being introduced)
  • Move to klib-based hash tables for indexing; better performance and lessmemory usage than Python’s dict
  • Add first, last, min, max, and prod optimized GroupBy functions
  • Newordered_merge function
  • Add flexiblecomparison instance methods eq, ne, lt,gt, etc. to DataFrame, Series
  • Improvescatter_matrix plottingfunction and add histogram or kernel density estimates to diagonal
  • Add‘kde’ plot option for density plots
  • Support for converting DataFrame to R data.frame through rpy2
  • Improved support for complex numbers in Series and DataFrame
  • Addpct_change method to all data structures
  • Add max_colwidth configuration option for DataFrame console output
  • Interpolate Series values using index values
  • Can select multiple columns from GroupBy
  • Addupdate methods to Series/DataFramefor updating values in place
  • Addany andall method to DataFrame

New plotting methods

Series.plot now supports asecondary_y option:

In [1]:plt.figure()Out[1]:<matplotlib.figure.Figureat0x7fd237c84f10>In [2]:fx['FR'].plot(style='g')Out[2]:<matplotlib.axes._subplots.AxesSubplotat0x7fd23e90f5d0>In [3]:fx['IT'].plot(style='k--',secondary_y=True)Out[3]:<matplotlib.axes._subplots.AxesSubplotat0x7fd23eb04910>
../_static/whatsnew_secondary_y.png

Vytautas Jancauskas, the 2012 GSOC participant, has added many new plottypes. For example,'kde' is a new option:

In [4]:s=Series(np.concatenate((np.random.randn(1000),   ...:np.random.randn(1000)*0.5+3)))   ...:In [5]:plt.figure()Out[5]:<matplotlib.figure.Figureat0x7fd237c84190>In [6]:s.hist(normed=True,alpha=0.2)Out[6]:<matplotlib.axes._subplots.AxesSubplotat0x7fd23e79dbd0>In [7]:s.plot(kind='kde')Out[7]:<matplotlib.axes._subplots.AxesSubplotat0x7fd23e79dbd0>
../_static/whatsnew_kde.png

Seethe plotting page for much more.

Other API changes

  • Deprecation ofoffset,time_rule, andtimeRule arguments names intime series functions. Warnings will be printed until pandas 0.9 or 1.0.

Potential porting issues for pandas <= 0.7.3 users

The major change that may affect you in pandas 0.8.0 is that time seriesindexes use NumPy’sdatetime64 data type instead ofdtype=object arraysof Python’s built-indatetime.datetime objects.DateRange has beenreplaced byDatetimeIndex but otherwise behaved identically. But, if youhave code that convertsDateRange orIndex objects that used to containdatetime.datetime values to plain NumPy arrays, you may have bugs lurkingwith code using scalar values because you are handing control over to NumPy:

In [8]:importdatetimeIn [9]:rng=date_range('1/1/2000',periods=10)In [10]:rng[5]Out[10]:Timestamp('2000-01-06 00:00:00',freq='D')In [11]:isinstance(rng[5],datetime.datetime)Out[11]:TrueIn [12]:rng_asarray=np.asarray(rng)In [13]:scalar_val=rng_asarray[5]In [14]:type(scalar_val)Out[14]:numpy.datetime64

pandas’sTimestamp object is a subclass ofdatetime.datetime that hasnanosecond support (thenanosecond field store the nanosecond value between0 and 999). It should substitute directly into any code that useddatetime.datetime values before. Thus, I recommend not castingDatetimeIndex to regular NumPy arrays.

If you have code that requires an array ofdatetime.datetime objects, youhave a couple of options. First, theasobject property ofDatetimeIndexproduces an array ofTimestamp objects:

In [15]:stamp_array=rng.asobjectIn [16]:stamp_arrayOut[16]:Index([2000-01-01 00:00:00, 2000-01-02 00:00:00, 2000-01-03 00:00:00,       2000-01-04 00:00:00, 2000-01-05 00:00:00, 2000-01-06 00:00:00,       2000-01-07 00:00:00, 2000-01-08 00:00:00, 2000-01-09 00:00:00,       2000-01-10 00:00:00],      dtype='object')In [17]:stamp_array[5]Out[17]:Timestamp('2000-01-06 00:00:00',freq='D')

To get an array of properdatetime.datetime objects, use theto_pydatetime method:

In [18]:dt_array=rng.to_pydatetime()In [19]:dt_arrayOut[19]:array([datetime.datetime(2000, 1, 1, 0, 0),       datetime.datetime(2000, 1, 2, 0, 0),       datetime.datetime(2000, 1, 3, 0, 0),       datetime.datetime(2000, 1, 4, 0, 0),       datetime.datetime(2000, 1, 5, 0, 0),       datetime.datetime(2000, 1, 6, 0, 0),       datetime.datetime(2000, 1, 7, 0, 0),       datetime.datetime(2000, 1, 8, 0, 0),       datetime.datetime(2000, 1, 9, 0, 0),       datetime.datetime(2000, 1, 10, 0, 0)], dtype=object)In [20]:dt_array[5]Out[20]:datetime.datetime(2000,1,6,0,0)

matplotlib knows how to handledatetime.datetime but not Timestampobjects. While I recommend that you plot time series usingTimeSeries.plot,you can either useto_pydatetime or register a converter for the Timestamptype. Seematplotlib documentation for more on this.

Warning

There are bugs in the user-facing API with the nanosecond datetime64 unitin NumPy 1.6. In particular, the string version of the array shows garbagevalues, and conversion todtype=object is similarly broken.

In [21]:rng=date_range('1/1/2000',periods=10)In [22]:rngOut[22]:DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08',               '2000-01-09', '2000-01-10'],              dtype='datetime64[ns]', freq='D')In [23]:np.asarray(rng)Out[23]:array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000',       '2000-01-03T00:00:00.000000000', '2000-01-04T00:00:00.000000000',       '2000-01-05T00:00:00.000000000', '2000-01-06T00:00:00.000000000',       '2000-01-07T00:00:00.000000000', '2000-01-08T00:00:00.000000000',       '2000-01-09T00:00:00.000000000', '2000-01-10T00:00:00.000000000'], dtype='datetime64[ns]')In [24]:converted=np.asarray(rng,dtype=object)In [25]:converted[5]Out[25]:947116800000000000L

Trust me: don’t panic. If you are using NumPy 1.6 and restrict yourinteraction withdatetime64 values to pandas’s API you will be justfine. There is nothing wrong with the data-type (a 64-bit integerinternally); all of the important data processing happens in pandas and isheavily tested. I strongly recommend that youdo not work directly withdatetime64 arrays in NumPy 1.6 and only use the pandas API.

Support for non-unique indexes: In the latter case, you may have codeinside atry:...catch: block that failed due to the index not beingunique. In many cases it will no longer fail (some method likeappend stillcheck for uniqueness unless disabled). However, all is not lost: you caninspectindex.is_unique and raise an exception explicitly if it isFalse or go to a different code branch.

v.0.7.3 (April 12, 2012)

This is a minor release from 0.7.2 and fixes many minor bugs and adds a numberof nice new features. There are also a couple of API changes to note; theseshould not affect very many users, and we are inclined to call them “bug fixes”even though they do constitute a change in behavior. See thefull releasenotes or issuetracker on GitHub for a complete list.

New features

frompandas.tools.plottingimportscatter_matrixscatter_matrix(df,alpha=0.2)
_images/scatter_matrix_kde.png
df.plot(kind='bar',stacked=True)
_images/bar_plot_stacked_ex.png
df.plot(kind='barh',stacked=True)
_images/barh_plot_stacked_ex.png
  • Add log x and yscaling options toDataFrame.plot andSeries.plot
  • Addkurt methods to Series and DataFrame for computing kurtosis

NA Boolean Comparison API Change

Reverted some changes to how NA values (represented typically asNaN orNone) are handled in non-numeric Series:

In [1]:series=Series(['Steve',np.nan,'Joe'])In [2]:series=='Steve'Out[2]:0     True1    False2    Falsedtype: boolIn [3]:series!='Steve'Out[3]:0    False1     True2     Truedtype: bool

In comparisons, NA / NaN will always come through asFalse except with!= which isTrue.Be very careful with boolean arithmetic, especiallynegation, in the presence of NA data. You may wish to add an explicit NAfilter into boolean array operations if you are worried about this:

In [4]:mask=series=='Steve'In [5]:series[mask&series.notnull()]Out[5]:0    Stevedtype: object

While propagating NA in comparisons may seem like the right behavior to someusers (and you could argue on purely technical grounds that this is the rightthing to do), the evaluation was made that propagating NA everywhere, includingin numerical arrays, would cause a large amount of problems for users. Thus, a“practicality beats purity” approach was taken. This issue may be revisited atsome point in the future.

Other API Changes

When callingapply on a grouped Series, the return value will also be aSeries, to be more consistent with thegroupby behavior with DataFrame:

In [6]:df=DataFrame({'A':['foo','bar','foo','bar',   ...:'foo','bar','foo','foo'],   ...:'B':['one','one','two','three',   ...:'two','two','one','three'],   ...:'C':np.random.randn(8),'D':np.random.randn(8)})   ...:In [7]:dfOut[7]:     A      B         C         D0  foo    one  0.219405 -1.0791811  bar    one -0.342863 -1.6318822  foo    two -0.032419  0.2372883  bar  three -1.581534  0.5146794  foo    two -0.912061 -1.4881015  bar    two  0.209500  1.0185146  foo    one -0.675890 -1.4888407  foo  three  0.055228 -1.355434[8 rows x 4 columns]In [8]:grouped=df.groupby('A')['C']In [9]:grouped.describe()Out[9]:Abar  count    3.000000     mean    -0.571633     std      0.917171     min     -1.581534     25%     -0.962199     50%     -0.342863     75%     -0.066682                ...foo  mean    -0.269148     std      0.494652     min     -0.912061     25%     -0.675890     50%     -0.032419     75%      0.055228     max      0.219405Name: C, dtype: float64In [10]:grouped.apply(lambdax:x.order()[-2:])# top 2 valuesOut[10]:Abar  1   -0.342863     5    0.209500foo  7    0.055228     0    0.219405Name: C, dtype: float64

v.0.7.2 (March 16, 2012)

This release targets bugs in 0.7.1, and adds a few minor features.

New features

  • Add additional tie-breaking methods in DataFrame.rank (GH874)
  • Add ascending parameter to rank in Series, DataFrame (GH875)
  • Add coerce_float option to DataFrame.from_records (GH893)
  • Add sort_columns parameter to allow unsorted plots (GH918)
  • Enable column access via attributes on GroupBy (GH882)
  • Can pass dict of values to DataFrame.fillna (GH661)
  • Can select multiple hierarchical groups by passing list of values in .ix(GH134)
  • Addaxis option to DataFrame.fillna (GH174)
  • Add level keyword todrop for dropping values from a level (GH159)

Performance improvements

  • Use khash for Series.value_counts, add raw function to algorithms.py (GH861)
  • Intercept __builtin__.sum in groupby (GH885)

v.0.7.1 (February 29, 2012)

This release includes a few new features and addresses over a dozen bugs in0.7.0.

New features

  • Addto_clipboard function to pandas namespace for writing objects tothe system clipboard (GH774)
  • Additertuples method to DataFrame for iterating through the rows of adataframe as tuples (GH818)
  • Add ability to pass fill_value and method to DataFrame and Series alignmethod (GH806,GH807)
  • Add fill_value option to reindex, align methods (GH784)
  • Enable concat to produce DataFrame from Series (GH787)
  • Addbetween method to Series (GH802)
  • Add HTML representation hook to DataFrame for the IPython HTML notebook(GH773)
  • Support for reading Excel 2007 XML documents using openpyxl

Performance improvements

  • Improve performance and memory usage of fillna on DataFrame
  • Can concatenate a list of Series along axis=1 to obtain a DataFrame (GH787)

v.0.7.0 (February 9, 2012)

New features

  • New unifiedmerge function for efficiently performingfull gamut of database / relational-algebra operations. Refactored existingjoin methods to use the new infrastructure, resulting in substantialperformance gains (GH220,GH249,GH267)
  • Newunified concatenation function for concatenatingSeries, DataFrame or Panel objects along an axis. Can form union orintersection of the other axes. Improves performance ofSeries.append andDataFrame.append (GH468,GH479,GH273)
  • Can pass multiple DataFrames toDataFrame.append to concatenate (stack) and multiple Series toSeries.append too
  • Can pass list of dicts (e.g., alist of JSON objects) to DataFrame constructor (GH526)
  • You can nowset multiple columns in aDataFrame via__getitem__, useful for transformation (GH342)
  • Handle differently-indexed output values inDataFrame.apply (GH498)
In [1]:df=DataFrame(randn(10,4))In [2]:df.apply(lambdax:x.describe())Out[2]:               0          1          2          3count  10.000000  10.000000  10.000000  10.000000mean    0.448104   0.052501   0.058434   0.008207std     0.784159   0.676134   0.959629   1.126010min    -1.275249  -1.200953  -1.819334  -1.60790625%     0.100811  -0.095948  -0.365166  -0.97309550%     0.709636   0.071581   0.116057   0.17911275%     0.851809   0.478706   0.616168   0.807868max     1.437656   1.051356   1.387310   1.521442[8 rows x 4 columns]
  • Addreorder_levels method to Series andDataFrame (GH534)
  • Add dict-likeget function to DataFrameand Panel (GH521)
  • AddDataFrame.iterrows method for efficientlyiterating through the rows of a DataFrame
  • AddDataFrame.to_panel with code adapted fromLongPanel.to_long
  • Addreindex_axis method added to DataFrame
  • Addlevel option to binary arithmetic functions onDataFrame andSeries
  • Addlevel option to thereindexandalign methods on Series and DataFrame for broadcasting values acrossa level (GH542,GH552, others)
  • Add attribute-based item access toPanel and add IPython completion (GH563)
  • Addlogy option toSeries.plot forlog-scaling on the Y axis
  • Addindex andheader options toDataFrame.to_string
  • Can pass multiple DataFrames toDataFrame.join to join on index (GH115)
  • Can pass multiple Panels toPanel.join(GH115)
  • Addedjustify argument toDataFrame.to_stringto allow different alignment of column headers
  • Addsort option to GroupBy to allow disablingsorting of the group keys for potential speedups (GH595)
  • Can pass MaskedArray to Seriesconstructor (GH563)
  • Add Panel item access via attributesand IPython completion (GH554)
  • ImplementDataFrame.lookup, fancy-indexing analogue for retrieving valuesgiven a sequence of row and column labels (GH338)
  • Can pass alist of functions toaggregate with groupby on a DataFrame, yielding an aggregated result withhierarchical columns (GH166)
  • Can callcummin andcummax on Series and DataFrame to get cumulativeminimum and maximum, respectively (GH647)
  • value_range added as utility function to get min and max of a dataframe(GH288)
  • Addedencoding argument toread_csv,read_table,to_csv andfrom_csv for non-ascii text (GH717)
  • Addedabs method to pandas objects
  • Addedcrosstab function for easily computing frequency tables
  • Addedisin method to index objects
  • Addedlevel argument toxs method of DataFrame.

API Changes to integer indexing

One of the potentially riskiest API changes in 0.7.0, but also one of the mostimportant, was a complete review of howinteger indexes are handled withregard to label-based indexing. Here is an example:

In [3]:s=Series(randn(10),index=range(0,20,2))In [4]:sOut[4]:0     0.6799192    -0.4571474     0.0418676     1.5031168    -0.84126510   -1.57800312   -0.27372814    1.75524016   -0.70578818   -0.351950dtype: float64In [5]:s[0]Out[5]:0.67991862351992061In [6]:s[2]Out[6]:-0.45714692729799072In [7]:s[4]Out[7]:0.041867372914288915

This is all exactly identical to the behavior before. However, if you ask for akeynot contained in the Series, in versions 0.6.1 and prior, Series wouldfall back on a location-based lookup. This now raises aKeyError:

In [2]:s[1]KeyError: 1

This change also has the same impact on DataFrame:

In [3]:df=DataFrame(randn(8,4),index=range(0,16,2))In [4]:df    0        1       2       30   0.88427  0.3363 -0.1787  0.031622   0.14451 -0.1415  0.2504  0.583744  -1.44779 -0.9186 -1.4996  0.271636  -0.26598 -2.4184 -0.2658  0.115038  -0.58776  0.3144 -0.8566  0.6194110  0.10940 -0.7175 -1.0108  0.4799012 -1.16919 -0.3087 -0.6049 -0.4354414 -0.07337  0.3410  0.0424 -0.16037In [5]:df.ix[3]KeyError: 3

In order to support purely integer-based indexing, the following methods havebeen added:

MethodDescription
Series.iget_value(i)Retrieve value stored at locationi
Series.iget(i)Alias foriget_value
DataFrame.irow(i)Retrieve thei-th row
DataFrame.icol(j)Retrieve thej-th column
DataFrame.iget_value(i,j)Retrieve the value at rowi and columnj

API tweaks regarding label-based slicing

Label-based slicing usingix now requires that the index be sorted(monotonic)unless both the start and endpoint are contained in the index:

In [8]:s=Series(randn(6),index=list('gmkaec'))In [9]:sOut[9]:g    1.507974m    0.419219k    0.647633a   -0.147670e   -0.759803c   -0.757308dtype: float64

Then this is OK:

In [10]:s.ix['k':'e']Out[10]:k    0.647633a   -0.147670e   -0.759803dtype: float64

But this is not:

In [12]:s.ix['b':'h']KeyError 'b'

If the index had been sorted, the “range selection” would have been possible:

In [11]:s2=s.sort_index()In [12]:s2Out[12]:a   -0.147670c   -0.757308e   -0.759803g    1.507974k    0.647633m    0.419219dtype: float64In [13]:s2.ix['b':'h']Out[13]:c   -0.757308e   -0.759803g    1.507974dtype: float64

Changes to Series[] operator

As as notational convenience, you can pass a sequence of labels or a labelslice to a Series when getting and setting values via[] (i.e. the__getitem__ and__setitem__ methods). The behavior will be the same aspassing similar input toixexcept in the case of integer indexing:

In [14]:s=Series(randn(6),index=list('acegkm'))In [15]:sOut[15]:a   -1.921164c   -1.093529e   -0.592157g   -0.715074k   -0.616193m   -0.335468dtype: float64In [16]:s[['m','a','c','e']]Out[16]:m   -0.335468a   -1.921164c   -1.093529e   -0.592157dtype: float64In [17]:s['b':'l']Out[17]:c   -1.093529e   -0.592157g   -0.715074k   -0.616193dtype: float64In [18]:s['c':'k']Out[18]:c   -1.093529e   -0.592157g   -0.715074k   -0.616193dtype: float64

In the case of integer indexes, the behavior will be exactly as before(shadowingndarray):

In [19]:s=Series(randn(6),index=range(0,12,2))In [20]:s[[4,0,2]]Out[20]:4    0.8861700   -0.3920512   -0.189537dtype: float64In [21]:s[1:5]Out[21]:2   -0.1895374    0.8861706   -1.1258948    0.319635dtype: float64

If you wish to do indexing with sequences and slicing on an integer index withlabel semantics, useix.

Other API Changes

  • The deprecatedLongPanel class has been completely removed
  • IfSeries.sort is called on a column of a DataFrame, an exception willnow be raised. Before it was possible to accidentally mutate a DataFrame’scolumn by doingdf[col].sort() instead of the side-effect free methoddf[col].order() (GH316)
  • Miscellaneous renames and deprecations which will (harmlessly) raiseFutureWarning
  • drop added as an optional parameter toDataFrame.reset_index (GH699)

Performance improvements

  • Cythonized GroupBy aggregations no longerpresort the data, thus achieving a significant speedup (GH93). GroupByaggregations with Python functions significantly sped up by clevermanipulation of the ndarray data type in Cython (GH496).
  • Better error message in DataFrame constructor when passed column labelsdon’t match data (GH497)
  • Substantially improve performance of multi-GroupBy aggregation when aPython function is passed, reuse ndarray object in Cython (GH496)
  • Can store objects indexed by tuples and floats in HDFStore (GH492)
  • Don’t print length by default in Series.to_string, addlength option (GH489)
  • Improve Cython code for multi-groupby to aggregate without having to sortthe data (GH93)
  • Improve MultiIndex reindexing speed by storing tuples in the MultiIndex,test for backwards unpickling compatibility
  • Improve column reindexing performance by using specialized Cython takefunction
  • Further performance tweaking of Series.__getitem__ for standard use cases
  • Avoid Index dict creation in some cases (i.e. when getting slices, etc.),regression from prior versions
  • Friendlier error message in setup.py if NumPy not installed
  • Use common set of NA-handling operations (sum, mean, etc.) in Panel classalso (GH536)
  • Default name assignment when callingreset_index on DataFrame with aregular (non-hierarchical) index (GH476)
  • Use Cythonized groupers when possible in Series/DataFrame stat ops withlevel parameter passed (GH545)
  • Ported skiplist data structure to C to speed uprolling_median by about5-10x in most typical use cases (GH374)

v.0.6.1 (December 13, 2011)

New features

Performance improvements

  • Improve memory usage ofDataFrame.describe (do not copy dataunnecessarily) (PR #425)
  • Optimize scalar value lookups in the general case by 25% or more in Seriesand DataFrame
  • Fix performance regression in cross-sectional count in DataFrame, affectingDataFrame.dropna speed
  • Column deletion in DataFrame copies no data (computes views on blocks) (GH#158)

v.0.6.0 (November 25, 2011)

New Features

  • Addedmelt function topandas.core.reshape
  • Addedlevel parameter to group by level in Series and DataFrame descriptive statistics (GH313)
  • Addedhead andtail methods to Series, analogous to to DataFrame (GH296)
  • AddedSeries.isin function which checks if each value is contained in a passed sequence (GH289)
  • Addedfloat_format option toSeries.to_string
  • Addedskip_footer (GH291) andconverters (GH343) options toread_csv andread_table
  • Addeddrop_duplicates andduplicated functions for removing duplicate DataFrame rows and checking for duplicate rows, respectively (GH319)
  • Implemented operators ‘&’, ‘|’, ‘^’, ‘-‘ on DataFrame (GH347)
  • AddedSeries.mad, mean absolute deviation
  • AddedQuarterEnd DateOffset (GH321)
  • Addeddot to DataFrame (GH65)
  • Addedorient option toPanel.from_dict (GH359,GH301)
  • Addedorient option toDataFrame.from_dict
  • Added passing list of tuples or list of lists toDataFrame.from_records (GH357)
  • Added multiple levels to groupby (GH103)
  • Allow multiple columns inby argument ofDataFrame.sort_index (GH92,GH362)
  • Added fastget_value andput_value methods to DataFrame (GH360)
  • Addedcov instance methods to Series and DataFrame (GH194,GH362)
  • Addedkind='bar' option toDataFrame.plot (GH348)
  • Addedidxmin andidxmax to Series and DataFrame (GH286)
  • Addedread_clipboard function to parse DataFrame from clipboard (GH300)
  • Addednunique function to Series for counting unique elements (GH297)
  • Made DataFrame constructor use Series name if no columns passed (GH373)
  • Support regular expressions in read_table/read_csv (GH364)
  • AddedDataFrame.to_html for writing DataFrame to HTML (GH387)
  • Added support for MaskedArray data in DataFrame, masked values converted to NaN (GH396)
  • AddedDataFrame.boxplot function (GH368)
  • Can pass extra args, kwds to DataFrame.apply (GH376)
  • ImplementDataFrame.join with vectoron argument (GH312)
  • Addedlegend boolean flag toDataFrame.plot (GH324)
  • Can pass multiple levels tostack andunstack (GH370)
  • Can pass multiple values columns topivot_table (GH381)
  • Use Series name in GroupBy for result index (GH363)
  • Addedraw option toDataFrame.apply for performance if only need ndarray (GH309)
  • Added proper, tested weighted least squares to standard and panel OLS (GH303)

Performance Enhancements

  • VBENCH Cythonizedcache_readonly, resulting in substantial micro-performance enhancements throughout the codebase (GH361)
  • VBENCH Special Cython matrix iterator for applying arbitrary reduction operations with 3-5x better performance thannp.apply_along_axis (GH309)
  • VBENCH Improved performance ofMultiIndex.from_tuples
  • VBENCH Special Cython matrix iterator for applying arbitrary reduction operations
  • VBENCH + DOCUMENT Addraw option toDataFrame.apply for getting better performance when
  • VBENCH Faster cythonized count by level in Series and DataFrame (GH341)
  • VBENCH? Significant GroupBy performance enhancement with multiple keys with many “empty” combinations
  • VBENCH New Cython vectorized functionmap_infer speeds upSeries.apply andSeries.map significantly when passed elementwise Python function, motivated by (GH355)
  • VBENCH Significantly improved performance ofSeries.order, which also makes np.unique called on a Series faster (GH327)
  • VBENCH Vastly improved performance of GroupBy on axes with a MultiIndex (GH299)

v.0.5.0 (October 24, 2011)

New Features

  • AddedDataFrame.align method with standard join options
  • Addedparse_dates option toread_csv andread_table methods to optionally try to parse dates in the index columns
  • Addednrows,chunksize, anditerator arguments toread_csv andread_table. The last two return a newTextParser class capable of lazily iterating through chunks of a flat file (GH242)
  • Added ability to join on multiple columns inDataFrame.join (GH214)
  • Added private_get_duplicates function toIndex for identifying duplicate values more easily (ENH5c)
  • Added column attribute access to DataFrame.
  • Added Python tab completion hook for DataFrame columns. (GH233,GH230)
  • ImplementedSeries.describe for Series containing objects (GH241)
  • Added inner join option toDataFrame.join when joining on key(s) (GH248)
  • Implemented selecting DataFrame columns by passing a list to__getitem__ (GH253)
  • Implemented & and | to intersect / union Index objects, respectively (GH261)
  • Addedpivot_table convenience function to pandas namespace (GH234)
  • ImplementedPanel.rename_axis function (GH243)
  • DataFrame will show index level names in console output (GH334)
  • ImplementedPanel.take
  • Addedset_eng_float_format for alternate DataFrame floating point string formatting (ENH61)
  • Added convenienceset_index function for creating a DataFrame index from its existing columns
  • Implementedgroupby hierarchical index level name (GH223)
  • Added support for different delimiters inDataFrame.to_csv (GH244)
  • TODO: DOCS ABOUT TAKE METHODS

Performance Enhancements

  • VBENCH Major performance improvements in file parsing functionsread_csv andread_table
  • VBENCH Added Cython function for converting tuples to ndarray very fast. Speeds up many MultiIndex-related operations
  • VBENCH Refactored merging / joining code into a tidy class and disabled unnecessary computations in the float/object case, thus getting about 10% better performance (GH211)
  • VBENCH Improved speed ofDataFrame.xs on mixed-type DataFrame objects by about 5x, regression from 0.3.0 (GH215)
  • VBENCH With newDataFrame.align method, speeding up binary operations between differently-indexed DataFrame objects by 10-25%.
  • VBENCH Significantly sped up conversion of nested dict into DataFrame (GH212)
  • VBENCH Significantly speed up DataFrame__repr__ andcount on large mixed-type DataFrame objects

v.0.4.3 through v0.4.1 (September 25 - October 9, 2011)

New Features

  • Added Python 3 support using 2to3 (GH200)
  • Addedname attribute toSeries, nowprints as part ofSeries.__repr__
  • Added instance methodsisnull andnotnull toSeries (GH209,GH203)
  • AddedSeries.align method for aligning two serieswith choice of join method (ENH56)
  • Added methodget_level_values toMultiIndex (GH188)
  • Set values in mixed-typeDataFrame objects via.ix indexing attribute (GH135)
  • Added newDataFramemethodsget_dtype_counts and propertydtypes (ENHdc)
  • Addedignore_index option toDataFrame.append to stack DataFrames (ENH1b)
  • read_csv tries tosniff delimiters usingcsv.Sniffer (GH146)
  • read_csv canread multiple columns into aMultiIndex; DataFrame’sto_csv method writes out a correspondingMultiIndex (GH151)
  • DataFrame.rename has a newcopy parameter torename a DataFrame in place (ENHed)
  • Enable unstacking by name (GH142)
  • Enablesortlevel to work by level (GH141)

Performance Enhancements

  • Altered binary operations on differently-indexed SparseSeries objectsto use the integer-based (dense) alignment logic which is faster with alarger number of blocks (GH205)
  • Wrote faster Cython data alignment / merging routines resulting insubstantial speed increases
  • Improved performance ofisnull andnotnull, a regression from v0.3.0(GH187)
  • Refactored code related toDataFrame.join so that intermediate alignedcopies of the data in eachDataFrame argument do not need to be created.Substantial performance increases result (GH176)
  • Substantially improved performance of genericIndex.intersection andIndex.union
  • ImplementedBlockManager.take resulting in significantly fastertakeperformance on mixed-typeDataFrame objects (GH104)
  • Improved performance ofSeries.sort_index
  • Significant groupby performance enhancement: removed unnecessary integritychecks in DataFrame internals that were slowing down slicing operations toretrieve groups
  • Optimized_ensure_index function resulting in performance savings intype-checking Index objects
  • Wrote fast time series merging / joining methods in Cython. Will beintegrated later into DataFrame.join and related functions

Navigation

Scroll To Top
[8]ページ先頭

©2009-2025 Movatter.jp