merge_asof for asof-style time-series joining.rolling() is now time-series awareread_csv has improved support for duplicate column namesread_csv supports parsingCategorical directlyget_dummies now returns integer dtypesto_numericSeries.tolist() will now return Python typesSeries operators for different indexesSeries type promotion on assignment.to_datetime() changes.describe() changesPeriod changes+ /- no longer used for set operationsIndex.difference and.symmetric_difference changesIndex.unique consistently returnsIndexMultiIndex constructors,groupby andset_index preserve categorical dtypesread_csv will progressively enumerate chunksdisplay.precision optionCategorical.uniquebool passed asheader in ParsersEnter search terms or a module, class or function name.
These are new features and improvements of note in each release.
This is a minor bug-fix release from 0.19.0 and includes some small regression fixes,bug fixes and performance improvements.We recommend that all users upgrade to this version.
What’s new in v0.19.1
Period data (GH14338)Series.asof(where) whenwhere is a scalar (GH14461)DataFrame.asof(where) whenwhere is a scalar (GH14461).to_json() whenlines=True (GH14408)cython installed, as in previous versions (GH14204)read_csv (c engine) (GH14418).DataFrame.quantile when missing values where present in some columns (GH14357).Index.difference where thefreq of aDatetimeIndex was incorrectly set (GH14323)pandas.core.common.array_equivalent with a deprecation warning (GH14555).pd.read_csv for the C engine in which quotation marks were improperly parsed in skipped rows (GH14459)pd.read_csv for Python 2.x in which Unicode quote characters were no longer being respected (GH14477)Index.append when categorical indices were appended (GH14545).pd.DataFrame where constructor fails when given dict withNone value (GH14381)DatetimeIndex._maybe_cast_slice_bound when index is empty (GH14354).TimedeltaIndex addition with a Datetime-like object where addition overflow in the negative direction was not being caught (GH14068,GH14453)objectIndex may raiseAttributeError (GH14424)ValueError on empty input topd.eval() anddf.query() (GH13139)RangeIndex.intersection when result is a empty set (GH14364).Series.__setitem__ which allowed mutating read-only arrays (GH14359).DataFrame.insert where multiple calls with duplicate columns can fail (GH14291)pd.merge() will raiseValueError with non-boolean parameters in passed boolean type arguments (GH14434)Timestamp where dates very near the minimum (1677-09) could underflow on creation (GH14415)pd.concat where names of thekeys were not propagated to the resultingMultiIndex (GH14252)pd.concat whereaxis cannot take string parameters'rows' or'columns' (GH14369)pd.concat with dataframes heterogeneous in length and tuplekeys (GH14438)MultiIndex.set_levels where illegal level values were still set after raising an error (GH13754)DataFrame.to_json wherelines=True and a value contained a} character (GH14391)df.groupby causing anAttributeError when grouping a single index frame by a column and the index level (:issue`14327`)df.groupby whereTypeError raised whenpd.Grouper(key=...) is passed in a list (GH14334)pd.pivot_table may raiseTypeError orValueError whenindex orcolumnsis not scalar andvalues is not specified (GH14380)This is a major release from 0.18.1 and includes number of API changes, several new features,enhancements, and performance improvements along with a large number of bug fixes. We recommend that allusers upgrade to this version.
Highlights include:
merge_asof() for asof-style time-series joining, seehere.rolling() is now time-series aware, seehereread_csv() now supports parsingCategorical data, seehereunion_categorical() has been added for combining categoricals, seeherePeriodIndex now has its ownperiod dtype, and changed to be more consistent with otherIndex classes. Seehereint andbool dtypes, seehereSeries no longer ignores the index, seehere for an overview of the API changes.Panel4D andPanelND. We recommend to represent these types of n-dimensional data with thexarray package.pandas.io.data,pandas.io.wb,pandas.tools.rplot.Warning
pandas >= 0.19.0 will no longer silence numpy ufunc warnings upon import, seehere.
What’s new in v0.19.0
merge_asof for asof-style time-series joining.rolling() is now time-series awareread_csv has improved support for duplicate column namesread_csv supports parsingCategorical directlyget_dummies now returns integer dtypesto_numericSeries.tolist() will now return Python typesSeries operators for different indexesSeries type promotion on assignment.to_datetime() changes.describe() changesPeriod changes+ /- no longer used for set operationsIndex.difference and.symmetric_difference changesIndex.unique consistently returnsIndexMultiIndex constructors,groupby andset_index preserve categorical dtypesread_csv will progressively enumerate chunksmerge_asof for asof-style time-series joining¶A long-time requested feature has been added through themerge_asof() function, tosupport asof style joining of time-series (GH1870,GH13695,GH13709,GH13902). Full documentation ishere.
Themerge_asof() performs an asof merge, which is similar to a left-joinexcept that we match on nearest key rather than equal keys.
In [1]:left=pd.DataFrame({'a':[1,5,10], ...:'left_val':['a','b','c']}) ...:In [2]:right=pd.DataFrame({'a':[1,2,3,6,7], ...:'right_val':[1,2,3,6,7]}) ...:In [3]:leftOut[3]: a left_val0 1 a1 5 b2 10 cIn [4]:rightOut[4]: a right_val0 1 11 2 22 3 33 6 64 7 7
We typically want to match exactly when possible, and use the mostrecent value otherwise.
In [5]:pd.merge_asof(left,right,on='a')Out[5]: a left_val right_val0 1 a 11 5 b 32 10 c 7
We can also match rows ONLY with prior data, and not an exact match.
In [6]:pd.merge_asof(left,right,on='a',allow_exact_matches=False)Out[6]: a left_val right_val0 1 a NaN1 5 b 3.02 10 c 7.0
In a typical time-series example, we havetrades andquotes and we want toasof-join them.This also illustrates using theby parameter to group data before merging.
In [7]:trades=pd.DataFrame({ ...:'time':pd.to_datetime(['20160525 13:30:00.023', ...:'20160525 13:30:00.038', ...:'20160525 13:30:00.048', ...:'20160525 13:30:00.048', ...:'20160525 13:30:00.048']), ...:'ticker':['MSFT','MSFT', ...:'GOOG','GOOG','AAPL'], ...:'price':[51.95,51.95, ...:720.77,720.92,98.00], ...:'quantity':[75,155, ...:100,100,100]}, ...:columns=['time','ticker','price','quantity']) ...:In [8]:quotes=pd.DataFrame({ ...:'time':pd.to_datetime(['20160525 13:30:00.023', ...:'20160525 13:30:00.023', ...:'20160525 13:30:00.030', ...:'20160525 13:30:00.041', ...:'20160525 13:30:00.048', ...:'20160525 13:30:00.049', ...:'20160525 13:30:00.072', ...:'20160525 13:30:00.075']), ...:'ticker':['GOOG','MSFT','MSFT', ...:'MSFT','GOOG','AAPL','GOOG', ...:'MSFT'], ...:'bid':[720.50,51.95,51.97,51.99, ...:720.50,97.99,720.50,52.01], ...:'ask':[720.93,51.96,51.98,52.00, ...:720.93,98.01,720.88,52.03]}, ...:columns=['time','ticker','bid','ask']) ...:
In [9]:tradesOut[9]: time ticker price quantity0 2016-05-25 13:30:00.023 MSFT 51.95 751 2016-05-25 13:30:00.038 MSFT 51.95 1552 2016-05-25 13:30:00.048 GOOG 720.77 1003 2016-05-25 13:30:00.048 GOOG 720.92 1004 2016-05-25 13:30:00.048 AAPL 98.00 100In [10]:quotesOut[10]: time ticker bid ask0 2016-05-25 13:30:00.023 GOOG 720.50 720.931 2016-05-25 13:30:00.023 MSFT 51.95 51.962 2016-05-25 13:30:00.030 MSFT 51.97 51.983 2016-05-25 13:30:00.041 MSFT 51.99 52.004 2016-05-25 13:30:00.048 GOOG 720.50 720.935 2016-05-25 13:30:00.049 AAPL 97.99 98.016 2016-05-25 13:30:00.072 GOOG 720.50 720.887 2016-05-25 13:30:00.075 MSFT 52.01 52.03
An asof merge joins on theon, typically a datetimelike field, which is ordered, andin this case we are using a grouper in theby field. This is like a left-outer join, exceptthat forward filling happens automatically taking the most recent non-NaN value.
In [11]:pd.merge_asof(trades,quotes, ....:on='time', ....:by='ticker') ....:Out[11]: time ticker price quantity bid ask0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.961 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.982 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.933 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.934 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
This returns a merged DataFrame with the entries in the same order as the original leftpassed DataFrame (trades in this case), with the fields of thequotes merged.
.rolling() is now time-series aware¶.rolling() objects are now time-series aware and can accept a time-series offset (or convertible) for thewindow argument (GH13327,GH12995).See the full documentationhere.
In [12]:dft=pd.DataFrame({'B':[0,1,2,np.nan,4]}, ....:index=pd.date_range('20130101 09:00:00',periods=5,freq='s')) ....:In [13]:dftOut[13]: B2013-01-01 09:00:00 0.02013-01-01 09:00:01 1.02013-01-01 09:00:02 2.02013-01-01 09:00:03 NaN2013-01-01 09:00:04 4.0
This is a regular frequency index. Using an integer window parameter works to roll along the window frequency.
In [14]:dft.rolling(2).sum()Out[14]: B2013-01-01 09:00:00 NaN2013-01-01 09:00:01 1.02013-01-01 09:00:02 3.02013-01-01 09:00:03 NaN2013-01-01 09:00:04 NaNIn [15]:dft.rolling(2,min_periods=1).sum()Out[15]: B2013-01-01 09:00:00 0.02013-01-01 09:00:01 1.02013-01-01 09:00:02 3.02013-01-01 09:00:03 2.02013-01-01 09:00:04 4.0
Specifying an offset allows a more intuitive specification of the rolling frequency.
In [16]:dft.rolling('2s').sum()Out[16]: B2013-01-01 09:00:00 0.02013-01-01 09:00:01 1.02013-01-01 09:00:02 3.02013-01-01 09:00:03 2.02013-01-01 09:00:04 4.0
Using a non-regular, but still monotonic index, rolling with an integer window does not impart any special calculation.
In [17]:dft=DataFrame({'B':[0,1,2,np.nan,4]}, ....:index=pd.Index([pd.Timestamp('20130101 09:00:00'), ....:pd.Timestamp('20130101 09:00:02'), ....:pd.Timestamp('20130101 09:00:03'), ....:pd.Timestamp('20130101 09:00:05'), ....:pd.Timestamp('20130101 09:00:06')], ....:name='foo')) ....:In [18]:dftOut[18]: Bfoo2013-01-01 09:00:00 0.02013-01-01 09:00:02 1.02013-01-01 09:00:03 2.02013-01-01 09:00:05 NaN2013-01-01 09:00:06 4.0In [19]:dft.rolling(2).sum()Out[19]: Bfoo2013-01-01 09:00:00 NaN2013-01-01 09:00:02 1.02013-01-01 09:00:03 3.02013-01-01 09:00:05 NaN2013-01-01 09:00:06 NaN
Using the time-specification generates variable windows for this sparse data.
In [20]:dft.rolling('2s').sum()Out[20]: Bfoo2013-01-01 09:00:00 0.02013-01-01 09:00:02 1.02013-01-01 09:00:03 3.02013-01-01 09:00:05 NaN2013-01-01 09:00:06 4.0
Furthermore, we now allow an optionalon parameter to specify a column (rather than thedefault of the index) in a DataFrame.
In [21]:dft=dft.reset_index()In [22]:dftOut[22]: foo B0 2013-01-01 09:00:00 0.01 2013-01-01 09:00:02 1.02 2013-01-01 09:00:03 2.03 2013-01-01 09:00:05 NaN4 2013-01-01 09:00:06 4.0In [23]:dft.rolling('2s',on='foo').sum()Out[23]: foo B0 2013-01-01 09:00:00 0.01 2013-01-01 09:00:02 1.02 2013-01-01 09:00:03 3.03 2013-01-01 09:00:05 NaN4 2013-01-01 09:00:06 4.0
read_csv has improved support for duplicate column names¶Duplicate column names are now supported inread_csv() whetherthey are in the file or passed in as thenames parameter (GH7160,GH9424)
In [24]:data='0,1,2\n3,4,5'In [25]:names=['a','b','a']
Previous behavior:
In [2]:pd.read_csv(StringIO(data),names=names)Out[2]: a b a0 2 1 21 5 4 5
The firsta column contained the same data as the seconda column, when it should havecontained the values[0,3].
New behavior:
In [26]:pd.read_csv(StringIO(data),names=names)Out[26]: a b a.10 0 1 21 3 4 5
read_csv supports parsingCategorical directly¶Theread_csv() function now supports parsing aCategorical column whenspecified as a dtype (GH10153). Depending on the structure of the data,this can result in a faster parse time and lower memory usage compared toconverting toCategorical after parsing. See the iodocs here.
In [27]:data='col1,col2,col3\na,b,1\na,b,2\nc,d,3'In [28]:pd.read_csv(StringIO(data))Out[28]: col1 col2 col30 a b 11 a b 22 c d 3In [29]:pd.read_csv(StringIO(data)).dtypesOut[29]:col1 objectcol2 objectcol3 int64dtype: objectIn [30]:pd.read_csv(StringIO(data),dtype='category').dtypesOut[30]:col1 categorycol2 categorycol3 categorydtype: object
Individual columns can be parsed as aCategorical using a dict specification
In [31]:pd.read_csv(StringIO(data),dtype={'col1':'category'}).dtypesOut[31]:col1 categorycol2 objectcol3 int64dtype: object
Note
The resulting categories will always be parsed as strings (object dtype).If the categories are numeric they can be converted using theto_numeric() function, or as appropriate, another convertersuch asto_datetime().
In [32]:df=pd.read_csv(StringIO(data),dtype='category')In [33]:df.dtypesOut[33]:col1 categorycol2 categorycol3 categorydtype: objectIn [34]:df['col3']Out[34]:0 11 22 3Name: col3, dtype: categoryCategories (3, object): [1, 2, 3]In [35]:df['col3'].cat.categories=pd.to_numeric(df['col3'].cat.categories)In [36]:df['col3']Out[36]:0 11 22 3Name: col3, dtype: categoryCategories (3, int64): [1, 2, 3]
A functionunion_categoricals() has been added for combining categoricals, seeUnioning Categoricals (GH13361,GH:13763, issue:13846,GH14173)
In [37]:frompandas.types.concatimportunion_categoricalsIn [38]:a=pd.Categorical(["b","c"])In [39]:b=pd.Categorical(["a","b"])In [40]:union_categoricals([a,b])Out[40]:[b, c, a, b]Categories (3, object): [b, c, a]
concat andappend now can concatcategory dtypes with differentcategories asobject dtype (GH13524)
In [41]:s1=pd.Series(['a','b'],dtype='category')In [42]:s2=pd.Series(['b','c'],dtype='category')
Previous behavior:
In [1]:pd.concat([s1,s2])ValueError: incompatible categories in categorical concat
New behavior:
In [43]:pd.concat([s1,s2])Out[43]:0 a1 b0 b1 cdtype: object
Pandas has gained new frequency offsets,SemiMonthEnd (‘SM’) andSemiMonthBegin (‘SMS’).These provide date offsets anchored (by default) to the 15th and end of month, and 15th and 1st of month respectively.(GH1543)
In [44]:frompandas.tseries.offsetsimportSemiMonthEnd,SemiMonthBegin
SemiMonthEnd:
In [45]:Timestamp('2016-01-01')+SemiMonthEnd()Out[45]:Timestamp('2016-01-15 00:00:00')In [46]:pd.date_range('2015-01-01',freq='SM',periods=4)Out[46]:DatetimeIndex(['2015-01-15','2015-01-31','2015-02-15','2015-02-28'],dtype='datetime64[ns]',freq='SM-15')
SemiMonthBegin:
In [47]:Timestamp('2016-01-01')+SemiMonthBegin()Out[47]:Timestamp('2016-01-15 00:00:00')In [48]:pd.date_range('2015-01-01',freq='SMS',periods=4)Out[48]:DatetimeIndex(['2015-01-01','2015-01-15','2015-02-01','2015-02-15'],dtype='datetime64[ns]',freq='SMS-15')
Using the anchoring suffix, you can also specify the day of month to use instead of the 15th.
In [49]:pd.date_range('2015-01-01',freq='SMS-16',periods=4)Out[49]:DatetimeIndex(['2015-01-01','2015-01-16','2015-02-01','2015-02-16'],dtype='datetime64[ns]',freq='SMS-16')In [50]:pd.date_range('2015-01-01',freq='SM-14',periods=4)Out[50]:DatetimeIndex(['2015-01-14','2015-01-31','2015-02-14','2015-02-28'],dtype='datetime64[ns]',freq='SM-14')
The following methods and options are added toIndex, to be more consistent with theSeries andDataFrame API.
Index now supports the.where() function for same shape indexing (GH13170)
In [51]:idx=pd.Index(['a','b','c'])In [52]:idx.where([True,False,True])Out[52]:Index([u'a',nan,u'c'],dtype='object')
Index now supports.dropna() to exclude missing values (GH6194)
In [53]:idx=pd.Index([1,2,np.nan,4])In [54]:idx.dropna()Out[54]:Float64Index([1.0,2.0,4.0],dtype='float64')
ForMultiIndex, values are dropped if any level is missing by default. Specifyinghow='all' only drops values where all levels are missing.
In [55]:midx=pd.MultiIndex.from_arrays([[1,2,np.nan,4], ....:[1,2,np.nan,np.nan]]) ....:In [56]:midxOut[56]:MultiIndex(levels=[[1, 2, 4], [1, 2]], labels=[[0, 1, -1, 2], [0, 1, -1, -1]])In [57]:midx.dropna()Out[57]:MultiIndex(levels=[[1, 2, 4], [1, 2]], labels=[[0, 1], [0, 1]])In [58]:midx.dropna(how='all')Out[58]:MultiIndex(levels=[[1, 2, 4], [1, 2]], labels=[[0, 1, 2], [0, 1, -1]])
Index now supports.str.extractall() which returns aDataFrame, see thedocs here (GH10008,GH13156)
In [59]:idx=pd.Index(["a1a2","b1","c1"])In [60]:idx.str.extractall("[ab](?P<digit>\d)")Out[60]: digit match0 0 1 1 21 0 1
Index.astype() now accepts an optional boolean argumentcopy, which allows optional copying if the requirements on dtype are satisfied (GH13209)
Previous versions of pandas would permanently silence numpy’s ufunc error handling whenpandas was imported. Pandas did this in order to silence the warnings that would arise from using numpy ufuncs on missing data, which are usually represented asNaN s. Unfortunately, this silenced legitimate warnings arising in non-pandas code in the application. Starting with 0.19.0, pandas will use thenumpy.errstate context manager to silence these warnings in a more fine-grained manner, only around where these operations are actually used in the pandas codebase. (GH13109,GH13145)
After upgrading pandas, you may seenewRuntimeWarnings being issued from your code. These are likely legitimate, and the underlying cause likely existed in the code when using previous versions of pandas that simply silenced the warning. Usenumpy.errstate around the source of theRuntimeWarning to control how these conditions are handled.
get_dummies now returns integer dtypes¶Thepd.get_dummies function now returns dummy-encoded columns as small integers, rather than floats (GH8725). This should provide an improved memory footprint.
Previous behavior:
In [1]:pd.get_dummies(['a','b','a','c']).dtypesOut[1]:a float64b float64c float64dtype: object
New behavior:
In [61]:pd.get_dummies(['a','b','a','c']).dtypesOut[61]:a uint8b uint8c uint8dtype: object
to_numeric¶pd.to_numeric() now accepts adowncast parameter, which will downcast the data if possible to smallest specified numerical dtype (GH13352)
In [62]:s=['1',2,3]In [63]:pd.to_numeric(s,downcast='unsigned')Out[63]:array([1,2,3],dtype=uint8)In [64]:pd.to_numeric(s,downcast='integer')Out[64]:array([1,2,3],dtype=int8)
As part of making pandas API more uniform and accessible in the future, we have created a standardsub-package of pandas,pandas.api to hold public API’s. We are starting by exposing typeintrospection functions inpandas.api.types. More sub-packages and officially sanctioned API’swill be published in future versions of pandas (GH13147,GH13634)
The following are now part of this API:
In [65]:importpprintIn [66]:frompandas.apiimporttypesIn [67]:funcs=[fforfindir(types)ifnotf.startswith('_')]In [68]:pprint.pprint(funcs)['is_any_int_dtype', 'is_bool', 'is_bool_dtype', 'is_categorical', 'is_categorical_dtype', 'is_complex', 'is_complex_dtype', 'is_datetime64_any_dtype', 'is_datetime64_dtype', 'is_datetime64_ns_dtype', 'is_datetime64tz_dtype', 'is_datetimetz', 'is_dict_like', 'is_dtype_equal', 'is_extension_type', 'is_float', 'is_float_dtype', 'is_floating_dtype', 'is_hashable', 'is_int64_dtype', 'is_integer', 'is_integer_dtype', 'is_iterator', 'is_list_like', 'is_named_tuple', 'is_number', 'is_numeric_dtype', 'is_object_dtype', 'is_period', 'is_period_dtype', 'is_re', 'is_re_compilable', 'is_scalar', 'is_sequence', 'is_sparse', 'is_string_dtype', 'is_timedelta64_dtype', 'is_timedelta64_ns_dtype', 'pandas_dtype']
Note
Calling these functions from the internal modulepandas.core.common will now show aDeprecationWarning (GH13990)
Timestamp can now accept positional and keyword parameters similar todatetime.datetime() (GH10758,GH11630)
In [69]:pd.Timestamp(2012,1,1)Out[69]:Timestamp('2012-01-01 00:00:00')In [70]:pd.Timestamp(year=2012,month=1,day=1,hour=8,minute=30)Out[70]:Timestamp('2012-01-01 08:30:00')
The.resample() function now accepts aon= orlevel= parameter for resampling on a datetimelike column orMultiIndex level (GH13500)
In [71]:df=pd.DataFrame({'date':pd.date_range('2015-01-01',freq='W',periods=5), ....:'a':np.arange(5)}, ....:index=pd.MultiIndex.from_arrays([ ....:[1,2,3,4,5], ....:pd.date_range('2015-01-01',freq='W',periods=5)], ....:names=['v','d'])) ....:In [72]:dfOut[72]: a datev d1 2015-01-04 0 2015-01-042 2015-01-11 1 2015-01-113 2015-01-18 2 2015-01-184 2015-01-25 3 2015-01-255 2015-02-01 4 2015-02-01In [73]:df.resample('M',on='date').sum()Out[73]: adate2015-01-31 62015-02-28 4In [74]:df.resample('M',level='d').sum()Out[74]: ad2015-01-31 62015-02-28 4
The.get_credentials() method ofGbqConnector can now first try to fetchthe application default credentials. See thedocs for more details (GH13577).
The.tz_localize() method ofDatetimeIndex andTimestamp has gained theerrors keyword, so you can potentially coerce nonexistent timestamps toNaT. The default behavior remains to raising aNonExistentTimeError (GH13057)
.to_hdf/read_hdf() now accept path objects (e.g.pathlib.Path,py.path.local) for the file path (GH11773)
Thepd.read_csv() withengine='python' has gained support for thedecimal (GH12933),na_filter (GH13321) and thememory_map option (GH13381).
Consistent with the Python API,pd.read_csv() will now interpret+inf as positive infinity (GH13274)
Thepd.read_html() has gained support for thena_values,converters,keep_default_na options (GH13461)
Categorical.astype() now accepts an optional boolean argumentcopy, effective when dtype is categorical (GH13209)
DataFrame has gained the.asof() method to return the last non-NaN values according to the selected subset (GH13358)
TheDataFrame constructor will now respect key ordering if a list ofOrderedDict objects are passed in (GH13304)
pd.read_html() has gained support for thedecimal option (GH12907)
Series has gained the properties.is_monotonic,.is_monotonic_increasing,.is_monotonic_decreasing, similar toIndex (GH13336)
DataFrame.to_sql() now allows a single value as the SQL type for all columns (GH11886).
Series.append now supports theignore_index option (GH13677)
.to_stata() andStataWriter can now write variable labels to Stata dta files using a dictionary to make column names to labels (GH13535,GH13536)
.to_stata() andStataWriter will automatically convertdatetime64[ns] columns to Stata format%tc, rather than raising aValueError (GH12259)
read_stata() andStataReader raise with a more explicit error message when reading Stata files with repeated value labels whenconvert_categoricals=True (GH13923)
DataFrame.style will now render sparsified MultiIndexes (GH11655)
DataFrame.style will now show column level names (e.g.DataFrame.columns.names) (GH13775)
DataFrame has gained support to re-order the columns based on the valuesin a row usingdf.sort_values(by='...',axis=1) (GH10806)
In [75]:df=pd.DataFrame({'A':[2,7],'B':[3,5],'C':[4,8]}, ....:index=['row1','row2']) ....:In [76]:dfOut[76]: A B Crow1 2 3 4row2 7 5 8In [77]:df.sort_values(by='row2',axis=1)Out[77]: B A Crow1 3 2 4row2 5 7 8
Added documentation toI/O regarding the perils of reading in columns with mixed dtypes and how to handle it (GH13746)
to_html() now has aborder argument to control the value in the opening<table> tag. The default is the value of thehtml.border option, which defaults to 1. This also affects the notebook HTML repr, but since Jupyter’s CSS includes a border-width attribute, the visual effect is the same. (GH11563).
RaiseImportError in the sql functions whensqlalchemy is not installed and a connection string is used (GH11920).
Compatibility with matplotlib 2.0. Older versions of pandas should also work with matplotlib 2.0 (GH13333)
Timestamp,Period,DatetimeIndex,PeriodIndex and.dt accessor have gained a.is_leap_year property to check whether the date belongs to a leap year. (GH13727)
astype() will now accept a dict of column name to data types mapping as thedtype argument. (GH12086)
Thepd.read_json andDataFrame.to_json has gained support for reading and writing json lines withlines option seeLine delimited json (GH9180)
read_excel() now supports the true_values and false_values keyword arguments (GH13347)
groupby() will now accept a scalar and a single-element list for specifyinglevel on a non-MultiIndex grouper. (GH13907)
Non-convertible dates in an excel date column will be returned without conversion and the column will beobject dtype, rather than raising an exception (GH10001).
pd.Timedelta(None) is now accepted and will returnNaT, mirroringpd.Timestamp (GH13687)
pd.read_stata() can now handle some format 111 files, which are produced by SAS when generating Stata dta files (GH11526)
Series andIndex now supportdivmod which will return a tuple ofseries or indices. This behaves like a standard binary operator with regardsto broadcasting rules (GH14208).
Series.tolist() will now return Python types¶Series.tolist() will now return Python types in the output, mimicking NumPy.tolist() behavior (GH10904)
In [78]:s=pd.Series([1,2,3])
Previous behavior:
In [7]:type(s.tolist()[0])Out[7]: <class 'numpy.int64'>
New behavior:
In [79]:type(s.tolist()[0])Out[79]:int
Series operators for different indexes¶FollowingSeries operators have been changed to make all operators consistent,includingDataFrame (GH1134,GH4581,GH13538)
Series comparison operators now raiseValueError whenindex are different.Series logical operators align bothindex of left and right hand side.Warning
Until 0.18.1, comparingSeries with the same length, would succeed even ifthe.index are different (the result ignores.index). As of 0.19.0, this will raisesValueError to be more strict. This section also describes how to keep previous behavior or align different indexes, using the flexible comparison methods like.eq.
As a result,Series andDataFrame operators behave as below:
Arithmetic operators align bothindex (no changes).
In [80]:s1=pd.Series([1,2,3],index=list('ABC'))In [81]:s2=pd.Series([2,2,2],index=list('ABD'))In [82]:s1+s2Out[82]:A 3.0B 4.0C NaND NaNdtype: float64In [83]:df1=pd.DataFrame([1,2,3],index=list('ABC'))In [84]:df2=pd.DataFrame([2,2,2],index=list('ABD'))In [85]:df1+df2Out[85]: 0A 3.0B 4.0C NaND NaN
Comparison operators raiseValueError when.index are different.
Previous Behavior (Series):
Series compared values ignoring the.index as long as both had the same length:
In [1]:s1==s2Out[1]:A FalseB TrueC Falsedtype: bool
New behavior (Series):
In [2]:s1==s2Out[2]:ValueError: Can only compare identically-labeled Series objects
Note
To achieve the same result as previous versions (compare values based on locations ignoring.index), compare both.values.
In [86]:s1.values==s2.valuesOut[86]:array([False,True,False],dtype=bool)
If you want to compareSeries aligning its.index, see flexible comparison methods section below:
In [87]:s1.eq(s2)Out[87]:A FalseB TrueC FalseD Falsedtype: bool
Current Behavior (DataFrame, no change):
In [3]:df1==df2Out[3]:ValueError: Can only compare identically-labeled DataFrame objects
Logical operators align both.index of left and right hand side.
Previous behavior (Series), only left hand sideindex was kept:
In [4]:s1=pd.Series([True,False,True],index=list('ABC'))In [5]:s2=pd.Series([True,True,True],index=list('ABD'))In [6]:s1&s2Out[6]:A TrueB FalseC Falsedtype: bool
New behavior (Series):
In [88]:s1=pd.Series([True,False,True],index=list('ABC'))In [89]:s2=pd.Series([True,True,True],index=list('ABD'))In [90]:s1&s2Out[90]:A TrueB FalseC FalseD Falsedtype: bool
Note
Series logical operators fill aNaN result withFalse.
Note
To achieve the same result as previous versions (compare values based on only left hand side index), you can usereindex_like:
In [91]:s1&s2.reindex_like(s1)Out[91]:A TrueB FalseC Falsedtype: bool
Current Behavior (DataFrame, no change):
In [92]:df1=pd.DataFrame([True,False,True],index=list('ABC'))In [93]:df2=pd.DataFrame([True,True,True],index=list('ABD'))In [94]:df1&df2Out[94]: 0A TrueB FalseC NaND NaN
Series flexible comparison methods likeeq,ne,le,lt,ge andgt now align bothindex. Use these operators if you want to compare twoSerieswhich has the differentindex.
In [95]:s1=pd.Series([1,2,3],index=['a','b','c'])In [96]:s2=pd.Series([2,2,2],index=['b','c','d'])In [97]:s1.eq(s2)Out[97]:a Falseb Truec Falsed Falsedtype: boolIn [98]:s1.ge(s2)Out[98]:a Falseb Truec Trued Falsedtype: bool
Previously, this worked the same as comparison operators (see above).
Series type promotion on assignment¶ASeries will now correctly promote its dtype for assignment with incompat values to the current dtype (GH13234)
In [99]:s=pd.Series()
Previous behavior:
In [2]:s["a"]=pd.Timestamp("2016-01-01")In [3]:s["b"]=3.0TypeError: invalid type promotion
New behavior:
In [100]:s["a"]=pd.Timestamp("2016-01-01")In [101]:s["b"]=3.0In [102]:sOut[102]:a 2016-01-01 00:00:00b 3dtype: objectIn [103]:s.dtypeOut[103]:dtype('O')
.to_datetime() changes¶Previously if.to_datetime() encountered mixed integers/floats and strings, but no datetimes witherrors='coerce' it would convert all toNaT.
Previous behavior:
In [2]:pd.to_datetime([1,'foo'],errors='coerce')Out[2]:DatetimeIndex(['NaT','NaT'],dtype='datetime64[ns]',freq=None)
Current behavior:
This will now convert integers/floats with the default unit ofns.
In [104]:pd.to_datetime([1,'foo'],errors='coerce')Out[104]:DatetimeIndex(['1970-01-01 00:00:00.000000001','NaT'],dtype='datetime64[ns]',freq=None)
Bug fixes related to.to_datetime():
pd.to_datetime() when passing integers or floats, and nounit anderrors='coerce' (GH13180).pd.to_datetime() when passing invalid datatypes (e.g. bool); will now respect theerrors keyword (GH13176)pd.to_datetime() which overflowed onint8, andint16 dtypes (GH13451)pd.to_datetime() raiseAttributeError withNaN and the other string is not valid whenerrors='ignore' (GH12424)pd.to_datetime() did not cast floats correctly whenunit was specified, resulting in truncated datetime (GH13834)Merging will now preserve the dtype of the join keys (GH8596)
In [105]:df1=pd.DataFrame({'key':[1],'v1':[10]})In [106]:df1Out[106]: key v10 1 10In [107]:df2=pd.DataFrame({'key':[1,2],'v1':[20,30]})In [108]:df2Out[108]: key v10 1 201 2 30
Previous behavior:
In [5]:pd.merge(df1,df2,how='outer')Out[5]: key v10 1.0 10.01 1.0 20.02 2.0 30.0In [6]:pd.merge(df1,df2,how='outer').dtypesOut[6]:key float64v1 float64dtype: object
New behavior:
We are able to preserve the join keys
In [109]:pd.merge(df1,df2,how='outer')Out[109]: key v10 1 101 1 202 2 30In [110]:pd.merge(df1,df2,how='outer').dtypesOut[110]:key int64v1 int64dtype: object
Of course if you have missing values that are introduced, then theresulting dtype will be upcast, which is unchanged from previous.
In [111]:pd.merge(df1,df2,how='outer',on='key')Out[111]: key v1_x v1_y0 1 10.0 201 2 NaN 30In [112]:pd.merge(df1,df2,how='outer',on='key').dtypesOut[112]:key int64v1_x float64v1_y int64dtype: object
.describe() changes¶Percentile identifiers in the index of a.describe() output will now be rounded to the least precision that keeps them distinct (GH13104)
In [113]:s=pd.Series([0,1,2,3,4])In [114]:df=pd.DataFrame([0,1,2,3,4])
Previous behavior:
The percentiles were rounded to at most one decimal place, which could raiseValueError for a data frame if the percentiles were duplicated.
In [3]:s.describe(percentiles=[0.0001,0.0005,0.001,0.999,0.9995,0.9999])Out[3]:count 5.000000mean 2.000000std 1.581139min 0.0000000.0% 0.0004000.1% 0.0020000.1% 0.00400050% 2.00000099.9% 3.996000100.0% 3.998000100.0% 3.999600max 4.000000dtype: float64In [4]:df.describe(percentiles=[0.0001,0.0005,0.001,0.999,0.9995,0.9999])Out[4]:...ValueError: cannot reindex from a duplicate axis
New behavior:
In [115]:s.describe(percentiles=[0.0001,0.0005,0.001,0.999,0.9995,0.9999])Out[115]:count 5.000000mean 2.000000std 1.581139min 0.0000000.01% 0.0004000.05% 0.0020000.1% 0.00400050% 2.00000099.9% 3.99600099.95% 3.99800099.99% 3.999600max 4.000000dtype: float64In [116]:df.describe(percentiles=[0.0001,0.0005,0.001,0.999,0.9995,0.9999])Out[116]: 0count 5.000000mean 2.000000std 1.581139min 0.0000000.01% 0.0004000.05% 0.0020000.1% 0.00400050% 2.00000099.9% 3.99600099.95% 3.99800099.99% 3.999600max 4.000000
Furthermore:
percentiles will now raise aValueError..describe() on a DataFrame with a mixed-dtype column index, which would previously raise aTypeError (GH13288)Period changes¶PeriodIndex now hasperiod dtype¶PeriodIndex now has its ownperiod dtype. Theperiod dtype is apandas extension dtype likecategory or thetimezone aware dtype (datetime64[ns,tz]) (GH13941).As a consequence of this change,PeriodIndex no longer has an integer dtype:
Previous behavior:
In [1]:pi=pd.PeriodIndex(['2016-08-01'],freq='D')In [2]:piOut[2]:PeriodIndex(['2016-08-01'],dtype='int64',freq='D')In [3]:pd.api.types.is_integer_dtype(pi)Out[3]:TrueIn [4]:pi.dtypeOut[4]:dtype('int64')
New behavior:
In [117]:pi=pd.PeriodIndex(['2016-08-01'],freq='D')In [118]:piOut[118]:PeriodIndex(['2016-08-01'],dtype='period[D]',freq='D')In [119]:pd.api.types.is_integer_dtype(pi)Out[119]:FalseIn [120]:pd.api.types.is_period_dtype(pi)Out[120]:TrueIn [121]:pi.dtypeOut[121]:period[D]In [122]:type(pi.dtype)Out[122]:pandas.types.dtypes.PeriodDtype
Period('NaT') now returnspd.NaT¶Previously,Period has its ownPeriod('NaT') representation different frompd.NaT. NowPeriod('NaT') has been changed to returnpd.NaT. (GH12759,GH13582)
Previous behavior:
In [5]:pd.Period('NaT',freq='D')Out[5]:Period('NaT','D')
New behavior:
These result inpd.NaT without providingfreq option.
In [123]:pd.Period('NaT')Out[123]:NaTIn [124]:pd.Period(None)Out[124]:NaT
To be compatible withPeriod addition and subtraction,pd.NaT now supports addition and subtraction withint. Previously it raisedValueError.
Previous behavior:
In [5]:pd.NaT+1...ValueError: Cannot add integral value to Timestamp without freq.
New behavior:
In [125]:pd.NaT+1Out[125]:NaTIn [126]:pd.NaT-1Out[126]:NaT
PeriodIndex.values now returns array ofPeriod object¶.values is changed to return an array ofPeriod objects, rather than an arrayof integers (GH13988).
Previous behavior:
In [6]:pi=pd.PeriodIndex(['2011-01','2011-02'],freq='M')In [7]:pi.valuesarray([492, 493])
New behavior:
In [127]:pi=pd.PeriodIndex(['2011-01','2011-02'],freq='M')In [128]:pi.valuesOut[128]:array([Period('2011-01','M'),Period('2011-02','M')],dtype=object)
+ /- no longer used for set operations¶Addition and subtraction of the base Index type and of DatetimeIndex(not the numeric index types)previously performed set operations (set union and difference). Thisbehavior was already deprecated since 0.15.0 (in favor using the specific.union() and.difference() methods), and is now disabled. Whenpossible,+ and- are now used for element-wise operations, forexample for concatenating strings or subtracting datetimes(GH8227,GH14127).
Previous behavior:
In [1]:pd.Index(['a','b'])+pd.Index(['a','c'])FutureWarning: using '+' to provide set union with Indexes is deprecated, use '|' or .union()Out[1]:Index(['a','b','c'],dtype='object')
New behavior: the same operation will now perform element-wise addition:
In [129]:pd.Index(['a','b'])+pd.Index(['a','c'])Out[129]:Index([u'aa',u'bc'],dtype='object')
Note that numeric Index objects already performed element-wise operations.For example, the behavior of adding two integer Indexes is unchanged.The baseIndex is now made consistent with this behavior.
In [130]:pd.Index([1,2,3])+pd.Index([2,3,4])Out[130]:Int64Index([3,5,7],dtype='int64')
Further, because of this change, it is now possible to subtract twoDatetimeIndex objects resulting in a TimedeltaIndex:
Previous behavior:
In [1]:pd.DatetimeIndex(['2016-01-01','2016-01-02'])-pd.DatetimeIndex(['2016-01-02','2016-01-03'])FutureWarning: using '-' to provide set differences with datetimelike Indexes is deprecated, use .difference()Out[1]:DatetimeIndex(['2016-01-01'],dtype='datetime64[ns]',freq=None)
New behavior:
In [131]:pd.DatetimeIndex(['2016-01-01','2016-01-02'])-pd.DatetimeIndex(['2016-01-02','2016-01-03'])Out[131]:TimedeltaIndex(['-1 days','-1 days'],dtype='timedelta64[ns]',freq=None)
Index.difference and.symmetric_difference changes¶Index.difference andIndex.symmetric_difference will now, more consistently, treatNaN values as any other values. (GH13514)
In [132]:idx1=pd.Index([1,2,3,np.nan])In [133]:idx2=pd.Index([0,1,np.nan])
Previous behavior:
In [3]:idx1.difference(idx2)Out[3]:Float64Index([nan,2.0,3.0],dtype='float64')In [4]:idx1.symmetric_difference(idx2)Out[4]:Float64Index([0.0,nan,2.0,3.0],dtype='float64')
New behavior:
In [134]:idx1.difference(idx2)Out[134]:Float64Index([2.0,3.0],dtype='float64')In [135]:idx1.symmetric_difference(idx2)Out[135]:Float64Index([0.0,2.0,3.0],dtype='float64')
Index.unique consistently returnsIndex¶Index.unique() now returns unique values as anIndex of the appropriatedtype. (GH13395).Previously, mostIndex classes returnednp.ndarray, andDatetimeIndex,TimedeltaIndex andPeriodIndex returnedIndex to keep metadata like timezone.
Previous behavior:
In [1]:pd.Index([1,2,3]).unique()Out[1]:array([1,2,3])In [2]:pd.DatetimeIndex(['2011-01-01','2011-01-02','2011-01-03'],tz='Asia/Tokyo').unique()Out[2]:DatetimeIndex(['2011-01-01 00:00:00+09:00', '2011-01-02 00:00:00+09:00', '2011-01-03 00:00:00+09:00'], dtype='datetime64[ns, Asia/Tokyo]', freq=None)
New behavior:
In [136]:pd.Index([1,2,3]).unique()Out[136]:Int64Index([1,2,3],dtype='int64')In [137]:pd.DatetimeIndex(['2011-01-01','2011-01-02','2011-01-03'],tz='Asia/Tokyo').unique()Out[137]:DatetimeIndex(['2011-01-01 00:00:00+09:00', '2011-01-02 00:00:00+09:00', '2011-01-03 00:00:00+09:00'], dtype='datetime64[ns, Asia/Tokyo]', freq=None)
MultiIndex constructors,groupby andset_index preserve categorical dtypes¶MultiIndex.from_arrays andMultiIndex.from_product will now preserve categorical dtypeinMultiIndex levels (GH13743,GH13854).
In [138]:cat=pd.Categorical(['a','b'],categories=list("bac"))In [139]:lvl1=['foo','bar']In [140]:midx=pd.MultiIndex.from_arrays([cat,lvl1])In [141]:midxOut[141]:MultiIndex(levels=[[u'b', u'a', u'c'], [u'bar', u'foo']], labels=[[1, 0], [1, 0]])
Previous behavior:
In [4]:midx.levels[0]Out[4]:Index(['b','a','c'],dtype='object')In [5]:midx.get_level_values[0]Out[5]:Index(['a','b'],dtype='object')
New behavior: the single level is now aCategoricalIndex:
In [142]:midx.levels[0]Out[142]:CategoricalIndex([u'b',u'a',u'c'],categories=[u'b',u'a',u'c'],ordered=False,dtype='category')In [143]:midx.get_level_values(0)Out[143]:CategoricalIndex([u'a',u'b'],categories=[u'b',u'a',u'c'],ordered=False,dtype='category')
An analogous change has been made toMultiIndex.from_product.As a consequence,groupby andset_index also preserve categorical dtypes in indexes
In [144]:df=pd.DataFrame({'A':[0,1],'B':[10,11],'C':cat})In [145]:df_grouped=df.groupby(by=['A','C']).first()In [146]:df_set_idx=df.set_index(['A','C'])
Previous behavior:
In [11]:df_grouped.index.levels[1]Out[11]:Index(['b','a','c'],dtype='object',name='C')In [12]:df_grouped.reset_index().dtypesOut[12]:A int64C objectB float64dtype: objectIn [13]:df_set_idx.index.levels[1]Out[13]:Index(['b','a','c'],dtype='object',name='C')In [14]:df_set_idx.reset_index().dtypesOut[14]:A int64C objectB int64dtype: object
New behavior:
In [147]:df_grouped.index.levels[1]Out[147]:CategoricalIndex([u'b',u'a',u'c'],categories=[u'b',u'a',u'c'],ordered=False,name=u'C',dtype='category')In [148]:df_grouped.reset_index().dtypesOut[148]:A int64C categoryB float64dtype: objectIn [149]:df_set_idx.index.levels[1]Out[149]:CategoricalIndex([u'b',u'a',u'c'],categories=[u'b',u'a',u'c'],ordered=False,name=u'C',dtype='category')In [150]:df_set_idx.reset_index().dtypesOut[150]:A int64C categoryB int64dtype: object
read_csv will progressively enumerate chunks¶Whenread_csv() is called withchunksize=n and without specifying an index,each chunk used to have an independently generated index from0 ton-1.They are now given instead a progressive index, starting from0 for the first chunk,fromn for the second, and so on, so that, when concatenated, they are identical tothe result of callingread_csv() without thechunksize= argument(GH12185).
In [151]:data='A,B\n0,1\n2,3\n4,5\n6,7'
Previous behavior:
In [2]:pd.concat(pd.read_csv(StringIO(data),chunksize=2))Out[2]: A B0 0 11 2 30 4 51 6 7
New behavior:
In [152]:pd.concat(pd.read_csv(StringIO(data),chunksize=2))Out[152]: A B0 0 11 2 32 4 53 6 7
These changes allow pandas to handle sparse data with more dtypes, and for work to make a smoother experience with data handling.
int64 andbool support enhancements¶Sparse data structures now gained enhanced support ofint64 andbooldtype (GH667,GH13849).
Previously, sparse data werefloat64 dtype by default, even if all inputs were ofint orbool dtype. You had to specifydtype explicitly to create sparse data withint64 dtype. Also,fill_value had to be specified explicitly because the default wasnp.nan which doesn’t appear inint64 orbool data.
In [1]:pd.SparseArray([1,2,0,0])Out[1]:[1.0, 2.0, 0.0, 0.0]Fill: nanIntIndexIndices: array([0, 1, 2, 3], dtype=int32)# specifying int64 dtype, but all values are stored in sp_values because# fill_value default is np.nanIn [2]:pd.SparseArray([1,2,0,0],dtype=np.int64)Out[2]:[1, 2, 0, 0]Fill: nanIntIndexIndices: array([0, 1, 2, 3], dtype=int32)In [3]:pd.SparseArray([1,2,0,0],dtype=np.int64,fill_value=0)Out[3]:[1, 2, 0, 0]Fill: 0IntIndexIndices: array([0, 1], dtype=int32)
As of v0.19.0, sparse data keeps the input dtype, and uses more appropriatefill_value defaults (0 forint64 dtype,False forbool dtype).
In [153]:pd.SparseArray([1,2,0,0],dtype=np.int64)Out[153]:[1, 2, 0, 0]Fill: 0IntIndexIndices: array([0, 1], dtype=int32)In [154]:pd.SparseArray([True,False,False,False])Out[154]:[True, False, False, False]Fill: FalseIntIndexIndices: array([0], dtype=int32)
See thedocs for more details.
Sparse data structure now can preservedtype after arithmetic ops (GH13848)
In [155]:s=pd.SparseSeries([0,2,0,1],fill_value=0,dtype=np.int64)In [156]:s.dtypeOut[156]:dtype('int64')In [157]:s+1Out[157]:0 11 32 13 2dtype: int64BlockIndexBlock locations: array([1, 3], dtype=int32)Block lengths: array([1, 1], dtype=int32)
Sparse data structure now supportastype to convert internaldtype (GH13900)
In [158]:s=pd.SparseSeries([1.,0.,2.,0.],fill_value=0)In [159]:sOut[159]:0 1.01 0.02 2.03 0.0dtype: float64BlockIndexBlock locations: array([0, 2], dtype=int32)Block lengths: array([1, 1], dtype=int32)In [160]:s.astype(np.int64)Out[160]:0 11 02 23 0dtype: int64BlockIndexBlock locations: array([0, 2], dtype=int32)Block lengths: array([1, 1], dtype=int32)
astype fails if data contains values which cannot be converted to specifieddtype.Note that the limitation is applied tofill_value which default isnp.nan.
In [7]:pd.SparseSeries([1.,np.nan,2.,np.nan],fill_value=np.nan).astype(np.int64)Out[7]:ValueError: unable to coerce current fill_value nan to int64 dtype
SparseDataFrame andSparseSeries now preserve class types when slicing or transposing. (GH13787)SparseArray withbool dtype now supports logical (bool) operators (GH14000)SparseSeries withMultiIndex[] indexing may raiseIndexError (GH13144)SparseSeries withMultiIndex[] indexing result may have normalIndex (GH13144)SparseDataFrame in whichaxis=None did not default toaxis=0 (GH13048)SparseSeries andSparseDataFrame creation withobject dtype may raiseTypeError (GH11633)SparseDataFrame doesn’t respect passedSparseArray orSparseSeries ‘s dtype andfill_value (GH13866)SparseArray andSparseSeries don’t apply ufunc tofill_value (GH13853)SparseSeries.abs incorrectly keeps negativefill_value (GH13853)SparseDataFrame s, types were previously forced to float (GH13917)SparseSeries slicing changes integer dtype to float (GH8292)SparseDataFarme comparison ops may raiseTypeError (GH13001)SparseDataFarme.isnull raisesValueError (GH8276)SparseSeries representation withbool dtype may raiseIndexError (GH13110)SparseSeries andSparseDataFrame ofbool orint64 dtype may display its values likefloat64 dtype (GH13110)SparseArray withbool dtype may return incorrect result (GH13985)SparseArray created fromSparseSeries may losedtype (GH13999)SparseSeries comparison with dense returns normalSeries rather thanSparseSeries (GH13999)Note
This change only affects 64 bit python running on Windows, and only affects relatively advancedindexing operations
Methods such asIndex.get_indexer that return an indexer array, coerce that array to a “platform int”, so that it can bedirectly used in 3rd party library operations likenumpy.take. Previously, a platform int was defined asnp.int_which corresponds to a C integer, but the correct type, and what is being used now, isnp.intp, which correspondsto the C integer size that can hold a pointer (GH3033,GH13972).
These types are the same on many platform, but for 64 bit python on Windows,np.int_ is 32 bits, andnp.intp is 64 bits. Changing this behavior improves performance for manyoperations on that platform.
Previous behavior:
In [1]:i=pd.Index(['a','b','c'])In [2]:i.get_indexer(['b','b','c']).dtypeOut[2]:dtype('int32')
New behavior:
In [1]:i=pd.Index(['a','b','c'])In [2]:i.get_indexer(['b','b','c']).dtypeOut[2]:dtype('int64')
Timestamp.to_pydatetime will issue aUserWarning whenwarn=True, and the instance has a non-zero number of nanoseconds, previously this would print a message to stdout (GH14101).Series.unique() with datetime and timezone now returns return array ofTimestamp with timezone (GH13565).Panel.to_sparse() will raise aNotImplementedError exception when called (GH13778).Index.reshape() will raise aNotImplementedError exception when called (GH12882)..filter() enforces mutual exclusion of the keyword arguments (GH12399).eval‘s upcasting rules forfloat32 types have been updated to be more consistent with NumPy’s rules. New behavior will not upcast tofloat64 if you multiply a pandasfloat32 object by a scalar float64 (GH12388).UnsupportedFunctionCall error is now raised if NumPy ufuncs likenp.mean are called on groupby or resample objects (GH12811).__setitem__ will no longer apply a callable rhs as a function instead of storing it. Callwhere directly to get the previous behavior (GH13299)..sample() will respect the random seed set vianumpy.random.seed(n) (GH13161)Styler.apply is now more strict about the outputs your function must return. Foraxis=0 oraxis=1, the output shape must be identical. Foraxis=None, the output must be a DataFrame with identical columns and index labels (GH13222).Float64Index.astype(int) will now raiseValueError ifFloat64Index containsNaN values (GH13149)TimedeltaIndex.astype(int) andDatetimeIndex.astype(int) will now returnInt64Index instead ofnp.array (GH13209)Period with multiple frequencies to normalIndex now returnsIndex withobject dtype (GH13664)PeriodIndex.fillna withPeriod has different freq now coerces toobject dtype (GH13664)DataFrame.boxplot(by=col) now return aSeries whenreturn_type is not None. Previously these returned anOrderedDict. Note that whenreturn_type=None, the default, these still return a 2-D NumPy array (GH12216,GH7096).pd.read_hdf will now raise aValueError instead ofKeyError, if a mode other thanr,r+ anda is supplied. (GH13623)pd.read_csv(),pd.read_table(), andpd.read_hdf() raise the builtinFileNotFoundError exception for Python 3.x when called on a nonexistent file; this is back-ported asIOError in Python 2.x (GH14086)CParserError (GH13652).pd.read_csv() in the C engine will now issue aParserWarning or raise aValueError whensep encoded is more than one character long (GH14065)DataFrame.values will now returnfloat64 with aDataFrame of mixedint64 anduint64 dtypes, conforming tonp.find_common_type (GH10364,GH13917).groupby.groups will now return a dictionary ofIndex objects, rather than a dictionary ofnp.ndarray orlists (GH14293)Series.reshape andCategorical.reshape have been deprecated and will be removed in a subsequent release (GH12882,GH12882)PeriodIndex.to_datetime has been deprecated in favor ofPeriodIndex.to_timestamp (GH8254)Timestamp.to_datetime has been deprecated in favor ofTimestamp.to_pydatetime (GH8254)Index.to_datetime andDatetimeIndex.to_datetime have been deprecated in favor ofpd.to_datetime (GH8254)pandas.core.datetools module has been deprecated and will be removed in a subsequent release (GH14094)SparseList has been deprecated and will be removed in a future version (GH13784)DataFrame.to_html() andDataFrame.to_latex() have dropped thecolSpace parameter in favor ofcol_space (GH13857)DataFrame.to_sql() has deprecated theflavor parameter, as it is superfluous when SQLAlchemy is not installed (GH13611)read_csv keywords:compact_ints anduse_unsigned have been deprecated and will be removed in a future version (GH13320)buffer_lines has been deprecated and will be removed in a future version (GH13360)as_recarray has been deprecated and will be removed in a future version (GH13373)skip_footer has been deprecated in favor ofskipfooter and will be removed in a future version (GH13349)pd.ordered_merge() has been renamed topd.merge_ordered() and the original name will be removed in a future version (GH13358)Timestamp.offset property (and named arg in the constructor), has been deprecated in favor offreq (GH12160)pd.tseries.util.pivot_annual is deprecated. Usepivot_table as alternative, an example ishere (GH736)pd.tseries.util.isleapyear has been deprecated and will be removed in a subsequent release. Datetime-likes now have a.is_leap_year property (GH13727)Panel4D andPanelND constructors are deprecated and will be removed in a future version. The recommended way to represent these types of n-dimensional data are with thexarray package. Pandas provides ato_xarray() method to automate this conversion (GH13564).pandas.tseries.frequencies.get_standard_freq is deprecated. Usepandas.tseries.frequencies.to_offset(freq).rule_code instead (GH13874)pandas.tseries.frequencies.to_offset‘sfreqstr keyword is deprecated in favor offreq (GH13874)Categorical.from_array has been deprecated and will be removed in a future version (GH13854)SparsePanel class has been removed (GH13778)pd.sandbox module has been removed in favor of the external librarypandas-qt (GH13670)pandas.io.data andpandas.io.wb modules are removed in favor ofthepandas-datareader package (GH13724).pandas.tools.rplot module has been removed in favor oftheseaborn package (GH13855)DataFrame.to_csv() has dropped theengine parameter, as was deprecated in 0.17.1 (GH11274,GH13419)DataFrame.to_dict() has dropped theouttype parameter in favor oforient (GH13627,GH8486)pd.Categorical has dropped setting of theordered attribute directly in favor of theset_ordered method (GH13671)pd.Categorical has dropped thelevels attribute in favor ofcategories (GH8376)DataFrame.to_sql() has dropped themysql option for theflavor parameter (GH13611)Panel.shift() has dropped thelags parameter in favor ofperiods (GH14041)pd.Index has dropped thediff method in favor ofdifference (GH13669)pd.DataFrame has dropped theto_wide method in favor ofto_panel (GH14039)Series.to_csv has dropped thenanRep parameter in favor ofna_rep (GH13804)Series.xs,DataFrame.xs,Panel.xs,Panel.major_xs, andPanel.minor_xs have dropped thecopy parameter (GH13781)str.split has dropped thereturn_type parameter in favor ofexpand (GH13701)ValueError. For the list of currently supported offsets, seehere.return_type parameter forDataFrame.plot.box andDataFrame.boxplot changed fromNone to"axes". These methods will now return a matplotlib axes by default instead of a dictionary of artists. Seehere (GH6581).tquery anduquery functions in thepandas.io.sql module are removed (GH5950).IntIndex.intersect (GH13082)BlockIndex when the number of blocks are large, though recommended to useIntIndex in such cases (GH13082)DataFrame.quantile() as it now operates per-block (GH11623)DataFrameGroupBy.transform (GH12737)Index andSeries.duplicated (GH10235)Index.difference (GH12044)RangeIndex.is_monotonic_increasing andis_monotonic_decreasing (GH13749)DatetimeIndex (GH13692)Period (GH12817)factorize of datetime with timezone (GH13750)groupby.groups (GH14293)groupby().shift(), which could cause a segfault or corruption in rare circumstances when grouping by columns with missing values (GH13813)groupby().cumsum() calculatingcumprod whenaxis=1. (GH13994)pd.to_timedelta() in which theerrors parameter was not being respected (GH13613)io.json.json_normalize(), where non-ascii keys raised an exception (GH13213)Series asxerr oryerr in.plot() (GH11858)DataFrame assignment with an object-dtypedIndex where the resultant column is mutable to the original object. (GH13522)AutoDataFormatter; this restores the second scaled formatting and re-adds micro-second scaled formatting (GH13131)HDFStore with a fixed format andstart and/orstop specified will now return the selected range (GH8287)Categorical.from_codes() where an unhelpful error was raised when an invalidordered parameter was passed in (GH14058)Series construction from a tuple of integers on windows not returning default dtype (int64) (GH13646)TimedeltaIndex addition with a Datetime-like object where addition overflow was not being caught (GH14068).groupby(..).resample(..) when the same object is called multiple times (GH13174).to_records() when index name is a unicode string (GH13172).memory_usage() on object which doesn’t implement (GH12924)Series.quantile with nans (also shows up in.median() and.describe() ); furthermore now names theSeries with the quantile (GH13098,GH13146)SeriesGroupBy.transform with datetime values and missing groups (GH13191)Series were incorrectly coerced in datetime-like numeric operations (GH13844)Categorical constructor when passed aCategorical containing datetimes with timezones (GH14190)Series.str.extractall() withstr index raisesValueError (GH13156)Series.str.extractall() with single group and quantifier (GH13382)DatetimeIndex andPeriod subtraction raisesValueError orAttributeError rather thanTypeError (GH13078)Index andSeries created withNaN andNaT mixed data may not havedatetime64 dtype (GH13324)Index andSeries may ignorenp.datetime64('nat') andnp.timdelta64('nat') to infer dtype (GH13324)PeriodIndex andPeriod subtraction raisesAttributeError (GH13071)PeriodIndex construction returning afloat64 index in some circumstances (GH13067).resample(..) with aPeriodIndex not changing itsfreq appropriately when empty (GH13067).resample(..) with aPeriodIndex not retaining its type or name with an emptyDataFrame appropriately when empty (GH13212)groupby(..).apply(..) when the passed function returns scalar values per group (GH13468).groupby(..).resample(..) where passing some keywords would raise an exception (GH13235).tz_convert on a tz-awareDateTimeIndex that relied on index being sorted for correct results (GH13306).tz_localize withdateutil.tz.tzlocal may return incorrect result (GH13583)DatetimeTZDtype dtype withdateutil.tz.tzlocal cannot be regarded as valid dtype (GH13583)pd.read_hdf() where attempting to load an HDF file with a single dataset, that had one or more categorical columns, failed unless the key argument was set to the name of the dataset. (GH13231).rolling() that allowed a negative integer window in contruction of theRolling() object, but would later fail on aggregation (GH13383)Series indexing with tuple-valued data and a numeric index (GH13509)pd.DataFrame where unusual elements with theobject dtype were causing segfaults (GH13717)Series which could result in segfaults (GH13445)DatetimeIndex, which did not honour thecopy=True (GH13205)DatetimeIndex.is_normalized returns incorrectly for normalized date_range in case of local timezones (GH13459)pd.concat and.append may coercesdatetime64 andtimedelta toobject dtype containing python built-indatetime ortimedelta rather thanTimestamp orTimedelta (GH13626)PeriodIndex.append may raisesAttributeError when the result isobject dtype (GH13221)CategoricalIndex.append may accept normallist (GH13626)pd.concat and.append with the same timezone get reset to UTC (GH7795)Series andDataFrame.append raisesAmbiguousTimeError if data contains datetime near DST boundary (GH13626)DataFrame.to_csv() in which float values were being quoted even though quotations were specified for non-numeric values only (GH12922,GH13259)DataFrame.describe() raisingValueError with only boolean columns (GH13898)MultiIndex slicing where extra elements were returned when level is non-unique (GH12896).str.replace does not raiseTypeError for invalid replacement (GH13438)MultiIndex.from_arrays which didn’t check for input array lengths matching (GH13599)cartesian_product andMultiIndex.from_product which may raise with empty input arrays (GH12258)pd.read_csv() which may cause a segfault or corruption when iterating in large chunks over a stream/file under rare circumstances (GH13703)pd.read_csv() which caused errors to be raised when a dictionary containing scalars is passed in forna_values (GH12224)pd.read_csv() which caused BOM files to be incorrectly parsed by not ignoring the BOM (GH4793)pd.read_csv() withengine='python' which raised errors when a numpy array was passed in forusecols (GH12546)pd.read_csv() where the index columns were being incorrectly parsed when parsed as dates with athousands parameter (GH14066)pd.read_csv() withengine='python' in whichNaN values weren’t being detected after data was converted to numeric values (GH13314)pd.read_csv() in which thenrows argument was not properly validated for both engines (GH10476)pd.read_csv() withengine='python' in which infinities of mixed-case forms were not being interpreted properly (GH13274)pd.read_csv() withengine='python' in which trailingNaN values were not being parsed (GH13320)pd.read_csv() withengine='python' when reading from atempfile.TemporaryFile on Windows with Python 3 (GH13398)pd.read_csv() that preventsusecols kwarg from accepting single-byte unicode strings (GH13219)pd.read_csv() that preventsusecols from being an empty set (GH13402)pd.read_csv() in the C engine where the NULL character was not being parsed as NULL (GH14012)pd.read_csv() withengine='c' in which NULLquotechar was not accepted even thoughquoting was specified asNone (GH13411)pd.read_csv() withengine='c' in which fields were not properly cast to float when quoting was specified as non-numeric (GH13411)pd.read_csv() in Python 2.x with non-UTF8 encoded, multi-character separated data (GH3404)pd.read_csv(), where aliases for utf-xx (e.g. UTF-xx, UTF_xx, utf_xx) raised UnicodeDecodeError (GH13549)pd.read_csv,pd.read_table,pd.read_fwf,pd.read_stata andpd.read_sas where files were opened by parsers but not closed if bothchunksize anditerator wereNone. (GH13940)StataReader,StataWriter,XportReader andSAS7BDATReader where a file was not properly closed when an error was raised. (GH13940)pd.pivot_table() wheremargins_name is ignored whenaggfunc is a list (GH13354)pd.Series.str.zfill,center,ljust,rjust, andpad when passing non-integers, did not raiseTypeError (GH13598)TimedeltaIndex, which always returnedTrue (GH13603)Series arithmetic raisesTypeError if it contains datetime-like asobject dtype (GH13043)Series.isnull() andSeries.notnull() ignorePeriod('NaT') (GH13737)Series.fillna() andSeries.dropna() don’t affect toPeriod('NaT') (GH13737.fillna(value=np.nan) incorrectly raisesKeyError on acategory dtypedSeries (GH14021).resample(..) where incorrect warnings were triggered by IPython introspection (GH13618)NaT -Period raisesAttributeError (GH13071)Series comparison may output incorrect result if rhs containsNaT (GH9005)Series andIndex comparison may output incorrect result if it containsNaT withobject dtype (GH13592)Period addition raisesTypeError ifPeriod is on right hand side (GH13069)Peirod andSeries orIndex comparison raisesTypeError (GH13200)pd.set_eng_float_format() that would prevent NaN and Inf from formatting (GH11981).unstack withCategorical dtype resets.ordered toTrue (GH13249)factorize raisesAmbiguousTimeError if data contains datetime near DST boundary (GH13750).set_index raisesAmbiguousTimeError if new index contains DST boundary and multi levels (GH12920).shift raisesAmbiguousTimeError if data contains datetime near DST boundary (GH13926)pd.read_hdf() returns incorrect result when aDataFrame with acategorical column and a query which doesn’t match any values (GH13792).iloc when indexing with a non lex-sorted MultiIndex (GH13797).loc when indexing with date strings in a reverse sortedDatetimeIndex (GH14316)Series comparison operators when dealing with zero dim NumPy arrays (GH13006).combine_first may return incorrectdtype (GH7630,GH10567)groupby whereapply returns different result depending on whether first result isNone or not (GH12824)groupby(..).nth() where the group key is included inconsistently if called after.head()/.tail() (GH12839).to_html,.to_latex and.to_string silently ignore custom datetime formatter passed through theformatters key word (GH10690)DataFrame.iterrows(), not yielding aSeries subclasse if defined (GH13977)pd.to_numeric whenerrors='coerce' and input contains non-hashable objects (GH13324)Timedelta arithmetic and comparison may raiseValueError rather thanTypeError (GH13624)to_datetime andDatetimeIndex may raiseTypeError rather thanValueError (GH11169,GH11287)Index created with tz-awareTimestamp and mismatchedtz option incorrectly coerces timezone (GH13692)DatetimeIndex with nanosecond frequency does not include timestamp specified withend (GH13672)`Series` when setting a slice with a`np.timedelta64` (GH14155)Index raisesOutOfBoundsDatetime ifdatetime exceedsdatetime64[ns] bounds, rather than coercing toobject dtype (GH13663)Index may ignore specifieddatetime64 ortimedelta64 passed asdtype (GH13981)RangeIndex can be created without no arguments rather than raisesTypeError (GH13793).value_counts() raisesOutOfBoundsDatetime if data exceedsdatetime64[ns] bounds (GH13663)DatetimeIndex may raiseOutOfBoundsDatetime if inputnp.datetime64 has other unit thanns (GH9114)Series creation withnp.datetime64 which has other unit thanns asobject dtype results in incorrect values (GH13876)resample with timedelta data where data was casted to float (GH13119).pd.isnull()pd.notnull() raiseTypeError if input datetime-like has other unit thanns (GH13389)pd.merge() may raiseTypeError if input datetime-like has other unit thanns (GH13389)HDFStore/read_hdf() discardedDatetimeIndex.name iftz was set (GH13884)Categorical.remove_unused_categories() changes.codes dtype to platform int (GH13261)groupby withas_index=False returns all NaN’s when grouping on multiple columns including a categorical one (GH13204)df.groupby(...)[...] where getitem withInt64Index raised an error (GH13731)DataFrame.style for index names. Previously they were assigned"col_headinglevel<n>col<c>" wheren was the number of levels + 1. Now they are assigned"index_namelevel<n>", wheren is the correct level for that MultiIndex.pd.read_gbq() could throwImportError:Nomodulenameddiscovery as a result of a naming conflict with another python package called apiclient (GH13454)Index.union returns an incorrect result with a named empty index (GH13432)Index.difference andDataFrame.join raise in Python3 when using mixed-integer indexes (GH13432,GH12814)datetime.datetime from tz-awaredatetime64 series (GH14088).to_excel() when DataFrame contains a MultiIndex which contains a label with a NaN value (GH13511)ValueError (GH13930)concat andgroupby for hierarchical frames withRangeIndex levels (GH13542).Series.str.contains() for Series containing onlyNaN values ofobject dtype (GH14171)agg() function on groupby dataframe changes dtype ofdatetime64[ns] column tofloat64 (GH12821)PeriodIndex to add or subtract integer raiseIncompatibleFrequency. Note that using standard operator like+ or- is recommended, because standard operators use more efficient path (GH13980)NaT returningfloat instead ofdatetime64[ns] (GH12941)Series flexible arithmetic methods (like.add()) raisesValueError whenaxis=None (GH13894)DataFrame.to_csv() withMultiIndex columns in which a stray empty line was added (GH6618)DatetimeIndex,TimedeltaIndex andPeriodIndex.equals() may returnTrue when input isn’tIndex but contains the same values (GH13107)pd.eval() andHDFStore query truncating long float literals with python 2 (GH14241)Index raisesKeyError displaying incorrect column when column is not in the df and columns contains duplicate values (GH13822)Period andPeriodIndex creating wrong dates when frequency has combined offset aliases (GH13874).to_string() when called with an integerline_width andindex=False raises an UnboundLocalError exception becauseidx referenced before assignment.eval() where theresolvers argument would not accept a list (GH14095)stack,get_dummies,make_axis_dummies which don’t preserve categorical dtypes in (multi)indexes (GH13854)PeriodIndex can now acceptlist andarray which containspd.NaT (GH13430)df.groupby where.median() returns arbitrary values if grouped dataframe contains empty bins (GH13629)Index.copy() wherename parameter was ignored (GH14302)This is a minor bug-fix release from 0.18.0 and includes a large number ofbug fixes along with several new features, enhancements, and performance improvements.We recommend that all users upgrade to this version.
Highlights include:
.groupby(...) has been enhanced to provide convenient syntax when working with.rolling(..),.expanding(..) and.resample(..) per group, seeherepd.to_datetime() has gained the ability to assemble dates from aDataFrame, seeheresparse, seehereWhat’s new in v0.18.1
TheCustomBusinessHour is a mixture ofBusinessHour andCustomBusinessDay whichallows you to specify arbitrary holidays. For details,seeCustom Business Hour (GH11514)
In [1]:frompandas.tseries.offsetsimportCustomBusinessHourIn [2]:frompandas.tseries.holidayimportUSFederalHolidayCalendarIn [3]:bhour_us=CustomBusinessHour(calendar=USFederalHolidayCalendar())
Friday before MLK Day
In [4]:dt=datetime(2014,1,17,15)In [5]:dt+bhour_usOut[5]:Timestamp('2014-01-17 16:00:00')
Tuesday after MLK Day (Monday is skipped because it’s a holiday)
In [6]:dt+bhour_us*2Out[6]:Timestamp('2014-01-20 09:00:00')
.groupby(..) syntax with window and resample operations¶.groupby(...) has been enhanced to provide convenient syntax when working with.rolling(..),.expanding(..) and.resample(..) per group, see (GH12486,GH12738).
You can now use.rolling(..) and.expanding(..) as methods on groupbys. These return another deferred object (similar to what.rolling() and.expanding() do on ungrouped pandas objects). You can then operate on theseRollingGroupby objects in a similar manner.
Previously you would have to do this to get a rolling window mean per-group:
In [7]:df=pd.DataFrame({'A':[1]*20+[2]*12+[3]*8, ...:'B':np.arange(40)}) ...:In [8]:dfOut[8]: A B0 1 01 1 12 1 23 1 34 1 45 1 56 1 6.. .. ..33 3 3334 3 3435 3 3536 3 3637 3 3738 3 3839 3 39[40 rows x 2 columns]
In [9]:df.groupby('A').apply(lambdax:x.rolling(4).B.mean())Out[9]:A1 0 NaN 1 NaN 2 NaN 3 1.5 4 2.5 5 3.5 6 4.5 ...3 33 NaN 34 NaN 35 33.5 36 34.5 37 35.5 38 36.5 39 37.5Name: B, dtype: float64
Now you can do:
In [10]:df.groupby('A').rolling(4).B.mean()Out[10]:A1 0 NaN 1 NaN 2 NaN 3 1.5 4 2.5 5 3.5 6 4.5 ...3 33 NaN 34 NaN 35 33.5 36 34.5 37 35.5 38 36.5 39 37.5Name: B, dtype: float64
For.resample(..) type of operations, previously you would have to:
In [11]:df=pd.DataFrame({'date':pd.date_range(start='2016-01-01', ....:periods=4, ....:freq='W'), ....:'group':[1,1,2,2], ....:'val':[5,6,7,8]}).set_index('date') ....:In [12]:dfOut[12]: group valdate2016-01-03 1 52016-01-10 1 62016-01-17 2 72016-01-24 2 8
In [13]:df.groupby('group').apply(lambdax:x.resample('1D').ffill())Out[13]: group valgroup date1 2016-01-03 1 5 2016-01-04 1 5 2016-01-05 1 5 2016-01-06 1 5 2016-01-07 1 5 2016-01-08 1 5 2016-01-09 1 5... ... ...2 2016-01-18 2 7 2016-01-19 2 7 2016-01-20 2 7 2016-01-21 2 7 2016-01-22 2 7 2016-01-23 2 7 2016-01-24 2 8[16 rows x 2 columns]
Now you can do:
In [14]:df.groupby('group').resample('1D').ffill()Out[14]: group valgroup date1 2016-01-03 1 5 2016-01-04 1 5 2016-01-05 1 5 2016-01-06 1 5 2016-01-07 1 5 2016-01-08 1 5 2016-01-09 1 5... ... ...2 2016-01-18 2 7 2016-01-19 2 7 2016-01-20 2 7 2016-01-21 2 7 2016-01-22 2 7 2016-01-23 2 7 2016-01-24 2 8[16 rows x 2 columns]
The following methods / indexers now accept acallable. It is intended to makethese more useful in method chains, see thedocumentation.(GH11485,GH12533)
.where() and.mask().loc[],iloc[] and.ix[][] indexing.where() and.mask()¶These can accept a callable for the condition andotherarguments.
In [15]:df=pd.DataFrame({'A':[1,2,3], ....:'B':[4,5,6], ....:'C':[7,8,9]}) ....:In [16]:df.where(lambdax:x>4,lambdax:x+10)Out[16]: A B C0 11 14 71 12 5 82 13 6 9
.loc[],.iloc[],.ix[]¶These can accept a callable, and a tuple of callable as a slicer. The callablecan return a valid boolean indexer or anything which is valid for these indexer’s input.
# callable returns bool indexerIn [17]:df.loc[lambdax:x.A>=2,lambdax:x.sum()>10]Out[17]: B C1 5 82 6 9# callable returns list of labelsIn [18]:df.loc[lambdax:[1,2],lambdax:['A','B']]Out[18]: A B1 2 52 3 6
[] indexing¶Finally, you can use a callable in[] indexing of Series, DataFrame and Panel.The callable must return a valid input for[] indexing depending on itsclass and index type.
In [19]:df[lambdax:'A']Out[19]:0 11 22 3Name: A, dtype: int64
Using these methods / indexers, you can chain data selection operationswithout using temporary variable.
In [20]:bb=pd.read_csv('data/baseball.csv',index_col='id')In [21]:(bb.groupby(['year','team']) ....:.sum() ....:.loc[lambdadf:df.r>100] ....:) ....:Out[21]: stint g ab r h X2b X3b hr rbi sb cs bb \year team2007 CIN 6 379 745 101 203 35 2 36 125.0 10.0 1.0 105 DET 5 301 1062 162 283 54 4 37 144.0 24.0 7.0 97 HOU 4 311 926 109 218 47 6 14 77.0 10.0 4.0 60 LAN 11 413 1021 153 293 61 3 36 154.0 7.0 5.0 114 NYN 13 622 1854 240 509 101 3 61 243.0 22.0 4.0 174 SFN 5 482 1305 198 337 67 6 40 171.0 26.0 7.0 235 TEX 2 198 729 115 200 40 4 28 115.0 21.0 4.0 73 TOR 4 459 1408 187 378 96 2 58 223.0 4.0 2.0 190 so ibb hbp sh sf gidpyear team2007 CIN 127.0 14.0 1.0 1.0 15.0 18.0 DET 176.0 3.0 10.0 4.0 8.0 28.0 HOU 212.0 3.0 9.0 16.0 6.0 17.0 LAN 141.0 8.0 9.0 3.0 8.0 29.0 NYN 310.0 24.0 23.0 18.0 15.0 48.0 SFN 188.0 51.0 8.0 16.0 6.0 41.0 TEX 140.0 4.0 5.0 2.0 8.0 16.0 TOR 265.0 16.0 12.0 4.0 16.0 38.0
DateTimeIndex when part of aMultiIndex¶Partial string indexing now matches onDateTimeIndex when part of aMultiIndex (GH10331)
In [22]:dft2=pd.DataFrame(np.random.randn(20,1), ....:columns=['A'], ....:index=pd.MultiIndex.from_product([pd.date_range('20130101', ....:periods=10, ....:freq='12H'), ....:['a','b']])) ....:In [23]:dft2Out[23]: A2013-01-01 00:00:00 a 1.474071 b -0.0640342013-01-01 12:00:00 a -1.282782 b 0.7818362013-01-02 00:00:00 a -1.071357 b 0.4411532013-01-02 12:00:00 a 2.353925... ...2013-01-04 00:00:00 b -0.8456962013-01-04 12:00:00 a -1.340896 b 1.8468832013-01-05 00:00:00 a -1.328865 b 1.6827062013-01-05 12:00:00 a -1.717693 b 0.888782[20 rows x 1 columns]In [24]:dft2.loc['2013-01-05']Out[24]: A2013-01-05 00:00:00 a -1.328865 b 1.6827062013-01-05 12:00:00 a -1.717693 b 0.888782
On other levels
In [25]:idx=pd.IndexSliceIn [26]:dft2=dft2.swaplevel(0,1).sort_index()In [27]:dft2Out[27]: Aa 2013-01-01 00:00:00 1.474071 2013-01-01 12:00:00 -1.282782 2013-01-02 00:00:00 -1.071357 2013-01-02 12:00:00 2.353925 2013-01-03 00:00:00 0.221471 2013-01-03 12:00:00 0.758527 2013-01-04 00:00:00 -0.964980... ...b 2013-01-02 12:00:00 0.583787 2013-01-03 00:00:00 -0.744471 2013-01-03 12:00:00 1.729689 2013-01-04 00:00:00 -0.845696 2013-01-04 12:00:00 1.846883 2013-01-05 00:00:00 1.682706 2013-01-05 12:00:00 0.888782[20 rows x 1 columns]In [28]:dft2.loc[idx[:,'2013-01-05'],:]Out[28]: Aa 2013-01-05 00:00:00 -1.328865 2013-01-05 12:00:00 -1.717693b 2013-01-05 00:00:00 1.682706 2013-01-05 12:00:00 0.888782
pd.to_datetime() has gained the ability to assemble datetimes from a passed inDataFrame or a dict. (GH8158).
In [29]:df=pd.DataFrame({'year':[2015,2016], ....:'month':[2,3], ....:'day':[4,5], ....:'hour':[2,3]}) ....:In [30]:dfOut[30]: day hour month year0 4 2 2 20151 5 3 3 2016
Assembling using the passed frame.
In [31]:pd.to_datetime(df)Out[31]:0 2015-02-04 02:00:001 2016-03-05 03:00:00dtype: datetime64[ns]
You can pass only the columns that you need to assemble.
In [32]:pd.to_datetime(df[['year','month','day']])Out[32]:0 2015-02-041 2016-03-05dtype: datetime64[ns]
pd.read_csv() now supportsdelim_whitespace=True for the Python engine (GH12958)
pd.read_csv() now supports opening ZIP files that contains a single CSV, via extension inference or explictcompression='zip' (GH12175)
pd.read_csv() now supports opening files using xz compression, via extension inference or explicitcompression='xz' is specified;xz compressions is also supported byDataFrame.to_csv in the same way (GH11852)
pd.read_msgpack() now always gives writeable ndarrays even when compression is used (GH12359).
pd.read_msgpack() now supports serializing and de-serializing categoricals with msgpack (GH12573)
.to_json() now supportsNDFrames that contain categorical and sparse data (GH10778)
interpolate() now supportsmethod='akima' (GH7588).
pd.read_excel() now accepts path objects (e.g.pathlib.Path,py.path.local) for the file path, in line with otherread_* functions (GH12655)
Added.weekday_name property as a component toDatetimeIndex and the.dt accessor. (GH11128)
Index.take now handlesallow_fill andfill_value consistently (GH12631)
In [33]:idx=pd.Index([1.,2.,3.,4.],dtype='float')# default, allow_fill=True, fill_value=NoneIn [34]:idx.take([2,-1])Out[34]:Float64Index([3.0,4.0],dtype='float64')In [35]:idx.take([2,-1],fill_value=True)Out[35]:Float64Index([3.0,nan],dtype='float64')
Index now supports.str.get_dummies() which returnsMultiIndex, seeCreating Indicator Variables (GH10008,GH10103)
In [36]:idx=pd.Index(['a|b','a|c','b|c'])In [37]:idx.str.get_dummies('|')Out[37]:MultiIndex(levels=[[0, 1], [0, 1], [0, 1]], labels=[[1, 1, 0], [1, 0, 1], [0, 1, 1]], names=[u'a', u'b', u'c'])
pd.crosstab() has gained anormalize argument for normalizing frequency tables (GH12569). Examples in the updated docshere.
.resample(..).interpolate() is now supported (GH12925)
.isin() now accepts passedsets (GH12988)
These changes conform sparse handling to return the correct types and work to make a smoother experience with indexing.
SparseArray.take now returns a scalar for scalar input,SparseArray for others. Furthermore, it handles a negative indexer with the same rule asIndex (GH10560,GH12796)
In [38]:s=pd.SparseArray([np.nan,np.nan,1,2,3,np.nan,4,5,np.nan,6])In [39]:s.take(0)Out[39]:nanIn [40]:s.take([1,2,3])Out[40]:[nan, 1.0, 2.0]Fill: nanIntIndexIndices: array([1, 2], dtype=int32)
SparseSeries[] indexing withEllipsis raisesKeyError (GH9467)SparseArray[] indexing with tuples are not handled properly (GH12966)SparseSeries.loc[] with list-like input raisesTypeError (GH10560)SparseSeries.iloc[] with scalar input may raiseIndexError (GH10560)SparseSeries.loc[],.iloc[] withslice returnsSparseArray, rather thanSparseSeries (GH10560)SparseDataFrame.loc[],.iloc[] may results in denseSeries, rather thanSparseSeries (GH12787)SparseArray addition ignoresfill_value of right hand side (GH12910)SparseArray mod raisesAttributeError (GH12910)SparseArray pow calculates1**np.nan asnp.nan which must be 1 (GH12910)SparseArray comparison output may incorrect result or raiseValueError (GH12971)SparseSeries.__repr__ raisesTypeError when it is longer thanmax_rows (GH10560)SparseSeries.shape ignoresfill_value (GH10452)SparseSeries andSparseArray may have differentdtype from its dense values (GH12908)SparseSeries.reindex incorrectly handlefill_value (GH12797)SparseArray.to_frame() results inDataFrame, rather thanSparseDataFrame (GH9850)SparseSeries.value_counts() does not countfill_value (GH6749)SparseArray.to_dense() does not preservedtype (GH10648)SparseArray.to_dense() incorrectly handlefill_value (GH12797)pd.concat() ofSparseSeries results in dense (GH10536)pd.concat() ofSparseDataFrame incorrectly handlefill_value (GH9765)pd.concat() ofSparseDataFrame may raiseAttributeError (GH12174)SparseArray.shift() may raiseNameError orTypeError (GH12908).groupby(..).nth() changes¶The index in.groupby(..).nth() output is now more consistent when theas_index argument is passed (GH11039):
In [41]:df=DataFrame({'A':['a','b','a'], ....:'B':[1,2,3]}) ....:In [42]:dfOut[42]: A B0 a 11 b 22 a 3
Previous Behavior:
In [3]:df.groupby('A',as_index=True)['B'].nth(0)Out[3]:0 11 2Name: B, dtype: int64In [4]:df.groupby('A',as_index=False)['B'].nth(0)Out[4]:0 11 2Name: B, dtype: int64
New Behavior:
In [43]:df.groupby('A',as_index=True)['B'].nth(0)Out[43]:Aa 1b 2Name: B, dtype: int64In [44]:df.groupby('A',as_index=False)['B'].nth(0)Out[44]:0 11 2Name: B, dtype: int64
Furthermore, previously, a.groupby would always sort, regardless ifsort=False was passed with.nth().
In [45]:np.random.seed(1234)In [46]:df=pd.DataFrame(np.random.randn(100,2),columns=['a','b'])In [47]:df['c']=np.random.randint(0,4,100)
Previous Behavior:
In [4]:df.groupby('c',sort=True).nth(1)Out[4]: a bc0 -0.334077 0.0021181 0.036142 -2.0749782 -0.720589 0.8871633 0.859588 -0.636524In [5]:df.groupby('c',sort=False).nth(1)Out[5]: a bc0 -0.334077 0.0021181 0.036142 -2.0749782 -0.720589 0.8871633 0.859588 -0.636524
New Behavior:
In [48]:df.groupby('c',sort=True).nth(1)Out[48]: a bc0 -0.334077 0.0021181 0.036142 -2.0749782 -0.720589 0.8871633 0.859588 -0.636524In [49]:df.groupby('c',sort=False).nth(1)Out[49]: a bc2 -0.720589 0.8871633 0.859588 -0.6365240 -0.334077 0.0021181 0.036142 -2.074978
Compatibility between pandas array-like methods (e.g.sum andtake) and theirnumpycounterparts has been greatly increased by augmenting the signatures of thepandas methods soas to accept arguments that can be passed in fromnumpy, even if they are not necessarilyused in thepandas implementation (GH12644,GH12638,GH12687)
.searchsorted() forIndex andTimedeltaIndex now accept asorter argument to maintain compatibility with numpy’ssearchsorted function (GH12238)np.round() on aSeries (GH12600)An example of this signature augmentation is illustrated below:
In [50]:sp=pd.SparseDataFrame([1,2,3])In [51]:spOut[51]: 00 11 22 3
Previous behaviour:
In [2]:np.cumsum(sp,axis=0)...TypeError: cumsum() takes at most 2 arguments (4 given)
New behaviour:
In [52]:np.cumsum(sp,axis=0)Out[52]: 00 11 32 6
.apply on groupby resampling¶Usingapply on resampling groupby operations (using apd.TimeGrouper) now has the same output types as similarapply calls on other groupby operations. (GH11742).
In [53]:df=pd.DataFrame({'date':pd.to_datetime(['10/10/2000','11/10/2000']), ....:'value':[10,13]}) ....:In [54]:dfOut[54]: date value0 2000-10-10 101 2000-11-10 13
Previous behavior:
In [1]:df.groupby(pd.TimeGrouper(key='date',freq='M')).apply(lambdax:x.value.sum())Out[1]:...TypeError: cannot concatenate a non-NDFrame object# Output is a SeriesIn [2]:df.groupby(pd.TimeGrouper(key='date',freq='M')).apply(lambdax:x[['value']].sum())Out[2]:date2000-10-31 value 102000-11-30 value 13dtype: int64
New Behavior:
# Output is a SeriesIn [55]:df.groupby(pd.TimeGrouper(key='date',freq='M')).apply(lambdax:x.value.sum())Out[55]:date2000-10-31 102000-11-30 13Freq: M, dtype: int64# Output is a DataFrameIn [56]:df.groupby(pd.TimeGrouper(key='date',freq='M')).apply(lambdax:x[['value']].sum())Out[56]: valuedate2000-10-31 102000-11-30 13
read_csv exceptions¶In order to standardize theread_csv API for both thec andpython engines, both will now raise anEmptyDataError, a subclass ofValueError, in response to empty columns or header (GH12493,GH12506)
Previous behaviour:
In [1]:df=pd.read_csv(StringIO(''),engine='c')...ValueError: No columns to parse from fileIn [2]:df=pd.read_csv(StringIO(''),engine='python')...StopIteration
New behaviour:
In [1]:df=pd.read_csv(StringIO(''),engine='c')...pandas.io.common.EmptyDataError: No columns to parse from fileIn [2]:df=pd.read_csv(StringIO(''),engine='python')...pandas.io.common.EmptyDataError: No columns to parse from file
In addition to this error change, several others have been made as well:
CParserError now sub-classesValueError instead of just aException (GH12551)CParserError is now raised instead of a genericException inread_csv when thec engine cannot parse a column (GH12506)ValueError is now raised instead of a genericException inread_csv when thec engine encounters aNaN value in an integer column (GH12506)ValueError is now raised instead of a genericException inread_csv whentrue_values is specified, and thec engine encounters an element in a column containing unencodable bytes (GH12506)pandas.parser.OverflowError exception has been removed and has been replaced with Python’s built-inOverflowError exception (GH12506)pd.read_csv() no longer allows a combination of strings and integers for theusecols parameter (GH12678)to_datetime error changes¶Bugs inpd.to_datetime() when passing aunit with convertible entries anderrors='coerce' or non-convertible witherrors='ignore'. Furthermore, anOutOfBoundsDateime exception will be raised when an out-of-range value is encountered for that unit whenerrors='raise'. (GH11758,GH13052,GH13059)
Previous behaviour:
In [27]:pd.to_datetime(1420043460,unit='s',errors='coerce')Out[27]:NaTIn [28]:pd.to_datetime(11111111,unit='D',errors='ignore')OverflowError: Python int too large to convert to C longIn [29]:pd.to_datetime(11111111,unit='D',errors='raise')OverflowError: Python int too large to convert to C long
New behaviour:
In [2]:pd.to_datetime(1420043460,unit='s',errors='coerce')Out[2]:Timestamp('2014-12-31 16:31:00')In [3]:pd.to_datetime(11111111,unit='D',errors='ignore')Out[3]:11111111In [4]:pd.to_datetime(11111111,unit='D',errors='raise')OutOfBoundsDatetime: cannot convert input with unit 'D'
.swaplevel() forSeries,DataFrame,Panel, andMultiIndex now features defaults for its first two parametersi andj that swap the two innermost levels of the index. (GH12934).searchsorted() forIndex andTimedeltaIndex now accept asorter argument to maintain compatibility with numpy’ssearchsorted function (GH12238)Period andPeriodIndex now raisesIncompatibleFrequency error which inheritsValueError rather than rawValueError (GH12615)Series.apply for category dtype now applies the passed function to each of the.categories (and not the.codes), and returns acategory dtype if possible (GH12473)read_csv will now raise aTypeError ifparse_dates is neither a boolean, list, or dictionary (matches the doc-string) (GH5636).query()/.eval() is nowengine=None, which will usenumexpr if it’s installed; otherwise it will fallback to thepython engine. This mimics the pre-0.18.1 behavior ifnumexpr is installed (and which, previously, if numexpr was not installed,.query()/.eval() would raise). (GH12749)pd.show_versions() now includespandas_datareader version (GH12740)__name__ and__qualname__ attributes for generic functions (GH12021)pd.concat(ignore_index=True) now usesRangeIndex as default (GH12695)pd.merge() andDataFrame.join() will show aUserWarning when merging/joining a single- with a multi-leveled dataframe (GH9455,GH12219)scipy > 0.17 for deprecatedpiecewise_polynomial interpolation method; support for the replacementfrom_derivatives method (GH12887).groupby(..).cumcount() (GH11039)pd.read_csv() when usingskiprows=an_integer (GH13005)DataFrame.to_sql when checking case sensitivity for tables. Now only checks if table has been created correctly when table name is not lower case. (GH12876)Period construction and time series plotting (GH12903,GH11831)..str.encode() and.str.decode() methods (GH13008)to_numeric if input is numeric dtype (GH12777)IntIndex (GH13036)usecols parameter inpd.read_csv is now respected even when the lines of a CSV file are not even (GH12203)groupby.transform(..) whenaxis=1 is specified with a non-monotonic ordered index (GH12713)Period andPeriodIndex creation raisesKeyError iffreq="Minute" is specified. Note that “Minute” freq is deprecated in v0.17.0, and recommended to usefreq="T" instead (GH11854).resample(...).count() with aPeriodIndex always raising aTypeError (GH12774).resample(...) with aPeriodIndex casting to aDatetimeIndex when empty (GH12868).resample(...) with aPeriodIndex when resampling to an existing frequency (GH12770)Period with differentfreq raisesValueError (GH12615)Series construction withCategorical anddtype='category' is specified (GH12574)display.max_rows (GH12411,GH12045,GH11594,GH10571,GH12211)float_format option with option not being validated as a callable. (GH12706)GroupBy.filter whendropna=False and no groups fulfilled the criteria (GH12768)__name__ of.cum* functions (GH12021).astype() of aFloat64Inde/Int64Index to anInt64Index (GH12881).to_json()/.read_json() whenorient='index' (the default) (GH12866)Categorical dtypes cause error when attempting stacked bar plot (GH13019)numpy 1.11 forNaT comparions (GH12969).drop() with a non-uniqueMultiIndex. (GH12701).concat of datetime tz-aware and naive DataFrames (GH12467)ValueError in.resample(..).fillna(..) when passing a non-string (GH12952)pd.read_sas() (GH12659,GH12654,GH12647,GH12809)pd.crosstab() where would silently ignoreaggfunc ifvalues=None (GH12569).DataFrame.to_json when serialisingdatetime.time (GH11473).DataFrame.to_json when attempting to serialise 0d array (GH11299).to_json when attempting to serialise aDataFrame orSeries with non-ndarray values; now supports serialization ofcategory,sparse, anddatetime64[ns,tz] dtypes (GH10778).DataFrame.to_json with unsupported dtype not passed to default handler (GH12554)..align not returning the sub-class (GH12983)Series with aDataFrame (GH13037)ABCPanel in whichPanel4D was not being considered as a valid instance of this generic type (GH12810).name on.groupby(..).apply(..) cases (GH12363)Timestamp.__repr__ that causedpprint to fail in nested structures (GH12622)Timedelta.min andTimedelta.max, the properties now report the true minimum/maximumtimedeltas as recognized by pandas. See thedocumentation. (GH12727).quantile() with interpolation may coerce tofloat unexpectedly (GH12772).quantile() with emptySeries may return scalar rather than emptySeries (GH12772).loc with out-of-bounds in a large indexer would raiseIndexError rather thanKeyError (GH12527)TimedeltaIndex and.asfreq(), would previously not include the final fencepost (GH12926)Categorical in aDataFrame (GH12564)GroupBy.first(),.last() returns incorrect row whenTimeGrouper is used (GH7453)pd.read_csv() with thec engine when specifyingskiprows with newlines in quoted items (GH10911,GH12775)DataFrame timezone lost when assigning tz-aware datetimeSeries with alignment (GH12981).value_counts() whennormalize=True anddropna=True where nulls still contributed to the normalized count (GH12558)Series.value_counts() loses name if its dtype iscategory (GH12835)Series.value_counts() loses timezone info (GH12835)Series.value_counts(normalize=True) withCategorical raisesUnboundLocalError (GH12835)Panel.fillna() ignoringinplace=True (GH12633)pd.read_csv() when specifyingnames,usecols, andparse_dates simultaneously with thec engine (GH9755)pd.read_csv() when specifyingdelim_whitespace=True andlineterminator simultaneously with thec engine (GH12912)Series.rename,DataFrame.rename andDataFrame.rename_axis not treatingSeries as mappings to relabel (GH12623)..rolling.min and.rolling.max to enhance dtype handling (GH12373)groupby where complex types are coerced to float (GH12902)Series.map raisesTypeError if its dtype iscategory or tz-awaredatetime (GH12473)RangeIndex construction (GH12893)DataFrame defined to return subclassedSeries may return normalSeries (GH11559).str accessor methods may raiseValueError if input hasname and the result isDataFrame orMultiIndex (GH12617)DataFrame.last_valid_index() andDataFrame.first_valid_index() on empty frames (GH12800)CategoricalIndex.get_loc returns different result from regularIndex (GH12531)PeriodIndex.resample where name not propagated (GH12769)date_rangeclosed keyword and timezones (GH12684).pd.concat raisesAttributeError when input data contains tz-aware datetime and timedelta (GH12620)pd.concat did not handle emptySeries properly (GH11082).plot.bar alginment whenwidth is specified withint (GH12979)fill_value is ignored if the argument to a binary operator is a constant (GH12723)pd.read_html() when using bs4 flavor and parsing table with a header and only one column (GH9178).pivot_table whenmargins=True anddropna=True where nulls still contributed to margin count (GH12577).pivot_table whendropna=False where table index/column names disappear (GH12133)pd.crosstab() whenmargins=True anddropna=False which raised (GH12642)Series.name whenname attribute can be a hashable type (GH12610).describe() resets categorical columns information (GH11558)loffset argument was not applied when callingresample().count() on a timeseries (GH12725)pd.read_excel() now accepts column names associated with keyword argumentnames (GH12870)pd.to_numeric() withIndex returnsnp.ndarray, rather thanIndex (GH12777)pd.to_numeric() with datetime-like may raiseTypeError (GH12777)pd.to_numeric() with scalar raisesValueError (GH12777)This is a major release from 0.17.1 and includes a small number of API changes, several new features,enhancements, and performance improvements along with a large number of bug fixes. We recommend that allusers upgrade to this version.
Warning
pandas >= 0.18.0 no longer supports compatibility with Python version 2.6and 3.3 (GH7718,GH11273)
Warning
numexpr version 2.4.4 will now show a warning and not be used as a computation back-end for pandas because of some buggy behavior. This does not affect other versions (>= 2.1 and >= 2.4.6). (GH12489)
Highlights include:
.groupby, seehere.RangeIndex as a specialized form of theInt64Indexfor memory savings, seehere..resample method to make it more.groupbylike, seehere.TypeError, seehere..to_xarray() function has been added for compatibility with thexarray package, seehere.read_sas function has been enhanced to readsas7bdat files, seehere.pd.test() top-level nose test runner is available (GH4327).Check theAPI Changes anddeprecations before updating.
What’s new in v0.18.0
Window functions have been refactored to be methods onSeries/DataFrame objects, rather than top-level functions, which are now deprecated. This allows these window-type functions, to have a similar API to that of.groupby. See the full documentationhere (GH11603,GH12373)
In [1]:np.random.seed(1234)In [2]:df=pd.DataFrame({'A':range(10),'B':np.random.randn(10)})In [3]:dfOut[3]: A B0 0 0.4714351 1 -1.1909762 2 1.4327073 3 -0.3126524 4 -0.7205895 5 0.8871636 6 0.8595887 7 -0.6365248 8 0.0156969 9 -2.242685
Previous Behavior:
In [8]:pd.rolling_mean(df,window=3) FutureWarning: pd.rolling_mean is deprecated for DataFrame and will be removed in a future version, replace with DataFrame.rolling(window=3,center=False).mean()Out[8]: A B0 NaN NaN1 NaN NaN2 1 0.2377223 2 -0.0236404 3 0.1331555 4 -0.0486936 5 0.3420547 6 0.3700768 7 0.0795879 8 -0.954504
New Behavior:
In [4]:r=df.rolling(window=3)
These show a descriptive repr
In [5]:rOut[5]:Rolling[window=3,center=False,axis=0]
with tab-completion of available methods and properties.
In [9]:r.r.A r.agg r.apply r.count r.exclusions r.max r.median r.name r.skew r.sumr.B r.aggregate r.corr r.cov r.kurt r.mean r.min r.quantile r.std r.var
The methods operate on theRolling object itself
In [6]:r.mean()Out[6]: A B0 NaN NaN1 NaN NaN2 1.0 0.2377223 2.0 -0.0236404 3.0 0.1331555 4.0 -0.0486936 5.0 0.3420547 6.0 0.3700768 7.0 0.0795879 8.0 -0.954504
They provide getitem accessors
In [7]:r['A'].mean()Out[7]:0 NaN1 NaN2 1.03 2.04 3.05 4.06 5.07 6.08 7.09 8.0Name: A, dtype: float64
And multiple aggregations
In [8]:r.agg({'A':['mean','std'], ...:'B':['mean','std']}) ...:Out[8]: A B mean std mean std0 NaN NaN NaN NaN1 NaN NaN NaN NaN2 1.0 1.0 0.237722 1.3273643 2.0 1.0 -0.023640 1.3355054 3.0 1.0 0.133155 1.1437785 4.0 1.0 -0.048693 0.8357476 5.0 1.0 0.342054 0.9203797 6.0 1.0 0.370076 0.8718508 7.0 1.0 0.079587 0.7500999 8.0 1.0 -0.954504 1.162285
Series.rename andNDFrame.rename_axis can now take a scalar or list-likeargument for altering the Series or axisname, in addition to their old behaviors of altering labels. (GH9494,GH11965)
In [9]:s=pd.Series(np.random.randn(5))In [10]:s.rename('newname')Out[10]:0 1.1500361 0.9919462 0.9533243 -2.0212554 -0.334077Name: newname, dtype: float64
In [11]:df=pd.DataFrame(np.random.randn(5,2))In [12]:(df.rename_axis("indexname") ....:.rename_axis("columns_name",axis="columns")) ....:Out[12]:columns_name 0 1indexname0 0.002118 0.4054531 0.289092 1.3211582 -1.546906 -0.2026463 -0.655969 0.1934214 0.553439 1.318152
The new functionality works well in method chains. Previously these methods only accepted functions or dicts mapping alabel to a new label.This continues to work as before for function or dict-like values.
ARangeIndex has been added to theInt64Index sub-classes to support a memory saving alternative for common use cases. This has a similar implementation to the pythonrange object (xrange in python 2), in that it only stores the start, stop, and step values for the index. It will transparently interact with the user API, converting toInt64Index if needed.
This will now be the default constructed index forNDFrame objects, rather than previous anInt64Index. (GH939,GH12070,GH12071,GH12109,GH12888)
Previous Behavior:
In [3]:s=pd.Series(range(1000))In [4]:s.indexOut[4]:Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 990, 991, 992, 993, 994, 995, 996, 997, 998, 999], dtype='int64', length=1000)In [6]:s.index.nbytesOut[6]:8000
New Behavior:
In [13]:s=pd.Series(range(1000))In [14]:s.indexOut[14]:RangeIndex(start=0,stop=1000,step=1)In [15]:s.index.nbytesOut[15]:72
The.str.extract method takes a regularexpression with capture groups, finds the first match in each subjectstring, and returns the contents of the capture groups(GH11386).
In v0.18.0, theexpand argument was added toextract.
expand=False: it returns aSeries,Index, orDataFrame, depending on the subject and regular expression pattern (same behavior as pre-0.18.0).expand=True: it always returns aDataFrame, which is more consistent and less confusing from the perspective of a user.Currently the default isexpand=None which gives aFutureWarning and usesexpand=False. To avoid this warning, please explicitly specifyexpand.
In [1]:pd.Series(['a1','b2','c3']).str.extract('[ab](\d)',expand=None)FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame)but in a future version of pandas this will be changed to expand=True (return DataFrame)Out[1]:0 11 22 NaNdtype: object
Extracting a regular expression with one group returns a Series ifexpand=False.
In [16]:pd.Series(['a1','b2','c3']).str.extract('[ab](\d)',expand=False)Out[16]:0 11 22 NaNdtype: object
It returns aDataFrame with one column ifexpand=True.
In [17]:pd.Series(['a1','b2','c3']).str.extract('[ab](\d)',expand=True)Out[17]: 00 11 22 NaN
Calling on anIndex with a regex with exactly one capture groupreturns anIndex ifexpand=False.
In [18]:s=pd.Series(["a1","b2","c3"],["A11","B22","C33"])In [19]:s.indexOut[19]:Index([u'A11',u'B22',u'C33'],dtype='object')In [20]:s.index.str.extract("(?P<letter>[a-zA-Z])",expand=False)Out[20]:Index([u'A',u'B',u'C'],dtype='object',name=u'letter')
It returns aDataFrame with one column ifexpand=True.
In [21]:s.index.str.extract("(?P<letter>[a-zA-Z])",expand=True)Out[21]: letter0 A1 B2 C
Calling on anIndex with a regex with more than one capture groupraisesValueError ifexpand=False.
>>>s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)",expand=False)ValueError: only one regex group is supported with Index
It returns aDataFrame ifexpand=True.
In [22]:s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)",expand=True)Out[22]: letter 10 A 111 B 222 C 33
In summary,extract(expand=True) always returns aDataFramewith a row for every subject string, and a column for every capturegroup.
The.str.extractall method was added(GH11386). Unlikeextract, which returns only the firstmatch.
In [23]:s=pd.Series(["a1a2","b1","c1"],["A","B","C"])In [24]:sOut[24]:A a1a2B b1C c1dtype: objectIn [25]:s.str.extract("(?P<letter>[ab])(?P<digit>\d)",expand=False)Out[25]: letter digitA a 1B b 1C NaN NaN
Theextractall method returns all matches.
In [26]:s.str.extractall("(?P<letter>[ab])(?P<digit>\d)")Out[26]: letter digit matchA 0 a 1 1 a 2B 0 b 1
The method.str.cat() concatenates the members of aSeries. Before, ifNaN values were present in the Series, calling.str.cat() on it would returnNaN, unlike the rest of theSeries.str.* API. This behavior has been amended to ignoreNaN values by default. (GH11435).
A new, friendlierValueError is added to protect against the mistake of supplying thesep as an arg, rather than as a kwarg. (GH11334).
In [27]:pd.Series(['a','b',np.nan,'c']).str.cat(sep=' ')Out[27]:'a b c'In [28]:pd.Series(['a','b',np.nan,'c']).str.cat(sep=' ',na_rep='?')Out[28]:'a b ? c'
In [2]:pd.Series(['a','b',np.nan,'c']).str.cat(' ')ValueError: Did you mean to supply a `sep` keyword?
DatetimeIndex,Timestamp,TimedeltaIndex,Timedelta have gained the.round(),.floor() and.ceil() method for datetimelike rounding, flooring and ceiling. (GH4314,GH11963)
Naive datetimes
In [29]:dr=pd.date_range('20130101 09:12:56.1234',periods=3)In [30]:drOut[30]:DatetimeIndex(['2013-01-01 09:12:56.123400', '2013-01-02 09:12:56.123400', '2013-01-03 09:12:56.123400'], dtype='datetime64[ns]', freq='D')In [31]:dr.round('s')Out[31]:DatetimeIndex(['2013-01-01 09:12:56', '2013-01-02 09:12:56', '2013-01-03 09:12:56'], dtype='datetime64[ns]', freq=None)# Timestamp scalarIn [32]:dr[0]Out[32]:Timestamp('2013-01-01 09:12:56.123400',freq='D')In [33]:dr[0].round('10s')Out[33]:Timestamp('2013-01-01 09:13:00')
Tz-aware are rounded, floored and ceiled in local times
In [34]:dr=dr.tz_localize('US/Eastern')In [35]:drOut[35]:DatetimeIndex(['2013-01-01 09:12:56.123400-05:00', '2013-01-02 09:12:56.123400-05:00', '2013-01-03 09:12:56.123400-05:00'], dtype='datetime64[ns, US/Eastern]', freq='D')In [36]:dr.round('s')Out[36]:DatetimeIndex(['2013-01-01 09:12:56-05:00', '2013-01-02 09:12:56-05:00', '2013-01-03 09:12:56-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)
Timedeltas
In [37]:t=timedelta_range('1 days 2 hr 13 min 45 us',periods=3,freq='d')In [38]:tOut[38]:TimedeltaIndex(['1 days 02:13:00.000045', '2 days 02:13:00.000045', '3 days 02:13:00.000045'], dtype='timedelta64[ns]', freq='D')In [39]:t.round('10min')Out[39]:TimedeltaIndex(['1 days 02:10:00','2 days 02:10:00','3 days 02:10:00'],dtype='timedelta64[ns]',freq=None)# Timedelta scalarIn [40]:t[0]Out[40]:Timedelta('1 days 02:13:00.000045')In [41]:t[0].round('2h')Out[41]:Timedelta('1 days 02:00:00')
In addition,.round(),.floor() and.ceil() will be available thru the.dt accessor ofSeries.
In [42]:s=pd.Series(dr)In [43]:sOut[43]:0 2013-01-01 09:12:56.123400-05:001 2013-01-02 09:12:56.123400-05:002 2013-01-03 09:12:56.123400-05:00dtype: datetime64[ns, US/Eastern]In [44]:s.dt.round('D')Out[44]:0 2013-01-01 00:00:00-05:001 2013-01-02 00:00:00-05:002 2013-01-03 00:00:00-05:00dtype: datetime64[ns, US/Eastern]
Integers inFloatIndex, e.g. 1., are now formatted with a decimal point and a0 digit, e.g.1.0 (GH11713)This change not only affects the display to the console, but also the output of IO methods like.to_csv or.to_html.
Previous Behavior:
In [2]:s=pd.Series([1,2,3],index=np.arange(3.))In [3]:sOut[3]:0 11 22 3dtype: int64In [4]:s.indexOut[4]:Float64Index([0.0,1.0,2.0],dtype='float64')In [5]:print(s.to_csv(path=None))0,11,22,3
New Behavior:
In [45]:s=pd.Series([1,2,3],index=np.arange(3.))In [46]:sOut[46]:0.0 11.0 22.0 3dtype: int64In [47]:s.indexOut[47]:Float64Index([0.0,1.0,2.0],dtype='float64')In [48]:print(s.to_csv(path=None))0.0,11.0,22.0,3
When a DataFrame’s slice is updated with a new slice of the same dtype, the dtype of the DataFrame will now remain the same. (GH10503)
Previous Behavior:
In [5]:df=pd.DataFrame({'a':[0,1,1], 'b': pd.Series([100, 200, 300], dtype='uint32')})In [7]:df.dtypesOut[7]:a int64b uint32dtype: objectIn [8]:ix=df['a']==1In [9]:df.loc[ix,'b']=df.loc[ix,'b']In [11]:df.dtypesOut[11]:a int64b int64dtype: object
New Behavior:
In [49]:df=pd.DataFrame({'a':[0,1,1], ....:'b':pd.Series([100,200,300],dtype='uint32')}) ....:In [50]:df.dtypesOut[50]:a int64b uint32dtype: objectIn [51]:ix=df['a']==1In [52]:df.loc[ix,'b']=df.loc[ix,'b']In [53]:df.dtypesOut[53]:a int64b uint32dtype: object
When a DataFrame’s integer slice is partially updated with a new slice of floats that could potentially be downcasted to integer without losing precision, the dtype of the slice will be set to float instead of integer.
Previous Behavior:
In [4]:df=pd.DataFrame(np.array(range(1,10)).reshape(3,3), columns=list('abc'), index=[[4,4,8], [8,10,12]])In [5]:dfOut[5]: a b c4 8 1 2 3 10 4 5 68 12 7 8 9In [7]:df.ix[4,'c']=np.array([0.,1.])In [8]:dfOut[8]: a b c4 8 1 2 0 10 4 5 18 12 7 8 9
New Behavior:
In [54]:df=pd.DataFrame(np.array(range(1,10)).reshape(3,3), ....:columns=list('abc'), ....:index=[[4,4,8],[8,10,12]]) ....:In [55]:dfOut[55]: a b c4 8 1 2 3 10 4 5 68 12 7 8 9In [56]:df.ix[4,'c']=np.array([0.,1.])In [57]:dfOut[57]: a b c4 8 1 2 0.0 10 4 5 1.08 12 7 8 9.0
In a future version of pandas, we will be deprecatingPanel and other > 2 ndim objects. In order to provide for continuity,allNDFrame objects have gained the.to_xarray() method in order to convert toxarray objects, which hasa pandas-like interface for > 2 ndim. (GH11972)
See thexarray full-documentation here.
In [1]:p=Panel(np.arange(2*3*4).reshape(2,3,4))In [2]:p.to_xarray()Out[2]:<xarray.DataArray (items: 2, major_axis: 3, minor_axis: 4)>array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]])Coordinates: * items (items) int64 0 1 * major_axis (major_axis) int64 0 1 2 * minor_axis (minor_axis) int64 0 1 2 3
DataFrame has gained a._repr_latex_() method in order to allow for conversion to latex in a ipython/jupyter notebook using nbconvert. (GH11778)
Note that this must be activated by setting the optionpd.display.latex.repr=True (GH12182)
For example, if you have a jupyter notebook you plan to convert to latex using nbconvert, place the statementpd.display.latex.repr=True in the first cell to have the contained DataFrame output also stored as latex.
The optionsdisplay.latex.escape anddisplay.latex.longtable have also been added to the configuration and are used automatically by theto_latexmethod. See theavailable options docs for more info.
pd.read_sas() changes¶read_sas has gained the ability to read SAS7BDAT files, including compressed files. The files can be read in entirety, or incrementally. For full details seehere. (GH4052)
Series.to_string (GH11729)read_excel now supports s3 urls of the formats3://bucketname/filename (GH11447)AWS_S3_HOST env variable when reading from s3 (GH12198)Panel.round() is now implemented (GH11763)round(DataFrame),round(Series),round(Panel) will work (GH11763)sys.getsizeof(obj) returns the memory usage of a pandas object, including thevalues it contains (GH11597)Series gained anis_unique attribute (GH11946)DataFrame.quantile andSeries.quantile now acceptinterpolation keyword (GH10174).DataFrame.style.format for more flexible formatting of cell values (GH11692)DataFrame.select_dtypes now allows thenp.float16 typecode (GH11990)pivot_table() now accepts most iterables for thevalues parameter (GH12017)BigQuery service account authentication support, which enables authentication on remote servers. (GH11881,GH12572). For further details seehereHDFStore is now iterable:forkinstore is equivalent toforkinstore.keys() (GH12221)..dt forPeriod (GH8848)PEP-ified (GH12096).to_string(index=False) method (GH11833)out parameter has been removed from theSeries.round() method. (GH11763)DataFrame.round() leaves non-numeric columns unchanged in its return, rather than raises. (GH11885)DataFrame.head(0) andDataFrame.tail(0) return empty frames, rather thanself. (GH11937)Series.head(0) andSeries.tail(0) return empty series, rather thanself. (GH11937)to_msgpack andread_msgpack encoding now defaults to'utf-8'. (GH12170).read_csv(),.read_table(),.read_fwf()) changed to group related arguments. (GH11555)NaTType.isoformat now returns the string'NaT to allow the result tobe passed to the constructor ofTimestamp. (GH12300)NaT andTimedelta have expanded arithmetic operations, which are extended toSeriesarithmetic where applicable. Operations defined fordatetime64[ns] ortimedelta64[ns]are now also defined forNaT (GH11564).
NaT now supports arithmetic operations with integers and floats.
In [58]:pd.NaT*1Out[58]:NaTIn [59]:pd.NaT*1.5Out[59]:NaTIn [60]:pd.NaT/2Out[60]:NaTIn [61]:pd.NaT*np.nanOut[61]:NaT
NaT defines more arithmetic operations withdatetime64[ns] andtimedelta64[ns].
In [62]:pd.NaT/pd.NaTOut[62]:nanIn [63]:pd.Timedelta('1s')/pd.NaTOut[63]:nan
NaT may represent either adatetime64[ns] null or atimedelta64[ns] null.Given the ambiguity, it is treated as atimedelta64[ns], which allows more operationsto succeed.
In [64]:pd.NaT+pd.NaTOut[64]:NaT# same asIn [65]:pd.Timedelta('1s')+pd.Timedelta('1s')Out[65]:Timedelta('0 days 00:00:02')
as opposed to
In [3]:pd.Timestamp('19900315')+pd.Timestamp('19900315')TypeError: unsupported operand type(s) for +: 'Timestamp' and 'Timestamp'
However, when wrapped in aSeries whosedtype isdatetime64[ns] ortimedelta64[ns],thedtype information is respected.
In [1]:pd.Series([pd.NaT],dtype='<M8[ns]')+pd.Series([pd.NaT],dtype='<M8[ns]')TypeError: can only operate on a datetimes for subtraction, but the operator [__add__] was passed
In [66]:pd.Series([pd.NaT],dtype='<m8[ns]')+pd.Series([pd.NaT],dtype='<m8[ns]')Out[66]:0 NaTdtype: timedelta64[ns]
Timedelta division byfloats now works.
In [67]:pd.Timedelta('1s')/2.0Out[67]:Timedelta('0 days 00:00:00.500000')
Subtraction byTimedelta in aSeries by aTimestamp works (GH11925)
In [68]:ser=pd.Series(pd.timedelta_range('1 day',periods=3))In [69]:serOut[69]:0 1 days1 2 days2 3 daysdtype: timedelta64[ns]In [70]:pd.Timestamp('2012-01-01')-serOut[70]:0 2011-12-311 2011-12-302 2011-12-29dtype: datetime64[ns]
NaT.isoformat() now returns'NaT'. This change allows allowspd.Timestamp to rehydrate any timestamp like object from its isoformat(GH12300).
Forward incompatible changes inmsgpack writing format were made over 0.17.0 and 0.18.0; older versions of pandas cannot read files packed by newer versions (GH12129,GH10527)
Bugs into_msgpack andread_msgpack introduced in 0.17.0 and fixed in 0.18.0, caused files packed in Python 2 unreadable by Python 3 (GH12142). The following table describes the backward and forward compat of msgpacks.
Warning
| Packed with | Can be unpacked with |
|---|---|
| pre-0.17 / Python 2 | any |
| pre-0.17 / Python 3 | any |
| 0.17 / Python 2 |
|
| 0.17 / Python 3 | >=0.18 / any Python |
| 0.18 | >= 0.18 |
0.18.0 is backward-compatible for reading files packed by older versions, except for files packed with 0.17 in Python 2, in which case only they can only be unpacked in Python 2.
Series.rank andDataFrame.rank now have the same signature (GH11759)
Previous signature
In [3]:pd.Series([0,1]).rank(method='average',na_option='keep', ascending=True, pct=False)Out[3]:0 11 2dtype: float64In [4]:pd.DataFrame([0,1]).rank(axis=0,numeric_only=None, method='average', na_option='keep', ascending=True, pct=False)Out[4]: 00 11 2
New signature
In [71]:pd.Series([0,1]).rank(axis=0,method='average',numeric_only=None, ....:na_option='keep',ascending=True,pct=False) ....:Out[71]:0 1.01 2.0dtype: float64In [72]:pd.DataFrame([0,1]).rank(axis=0,method='average',numeric_only=None, ....:na_option='keep',ascending=True,pct=False) ....:Out[72]: 00 1.01 2.0
In previous versions, the behavior of the QuarterBegin offset was inconsistentdepending on the date when then parameter was 0. (GH11406)
The general semantics of anchored offsets forn=0 is to not move the datewhen it is an anchor point (e.g., a quarter start date), and otherwise rollforward to the next anchor point.
In [73]:d=pd.Timestamp('2014-02-01')In [74]:dOut[74]:Timestamp('2014-02-01 00:00:00')In [75]:d+pd.offsets.QuarterBegin(n=0,startingMonth=2)Out[75]:Timestamp('2014-02-01 00:00:00')In [76]:d+pd.offsets.QuarterBegin(n=0,startingMonth=1)Out[76]:Timestamp('2014-04-01 00:00:00')
For theQuarterBegin offset in previous versions, the date would be rolledbackwards if date was in the same month as the quarter start date.
In [3]:d=pd.Timestamp('2014-02-15')In [4]:d+pd.offsets.QuarterBegin(n=0,startingMonth=2)Out[4]:Timestamp('2014-02-01 00:00:00')
This behavior has been corrected in version 0.18.0, which is consistent withother anchored offsets likeMonthBegin andYearBegin.
In [77]:d=pd.Timestamp('2014-02-15')In [78]:d+pd.offsets.QuarterBegin(n=0,startingMonth=2)Out[78]:Timestamp('2014-05-01 00:00:00')
Like the change in the window functions APIabove,.resample(...) is changing to have a more groupby-like API. (GH11732,GH12702,GH12202,GH12332,GH12334,GH12348,GH12448).
In [79]:np.random.seed(1234)In [80]:df=pd.DataFrame(np.random.rand(10,4), ....:columns=list('ABCD'), ....:index=pd.date_range('2010-01-01 09:00:00',periods=10,freq='s')) ....:In [81]:dfOut[81]: A B C D2010-01-01 09:00:00 0.191519 0.622109 0.437728 0.7853592010-01-01 09:00:01 0.779976 0.272593 0.276464 0.8018722010-01-01 09:00:02 0.958139 0.875933 0.357817 0.5009952010-01-01 09:00:03 0.683463 0.712702 0.370251 0.5611962010-01-01 09:00:04 0.503083 0.013768 0.772827 0.8826412010-01-01 09:00:05 0.364886 0.615396 0.075381 0.3688242010-01-01 09:00:06 0.933140 0.651378 0.397203 0.7887302010-01-01 09:00:07 0.316836 0.568099 0.869127 0.4361732010-01-01 09:00:08 0.802148 0.143767 0.704261 0.7045812010-01-01 09:00:09 0.218792 0.924868 0.442141 0.909316
Previous API:
You would write a resampling operation that immediately evaluates. If ahow parameter was not provided, itwould default tohow='mean'.
In [6]:df.resample('2s')Out[6]: A B C D2010-01-01 09:00:00 0.485748 0.447351 0.357096 0.7936152010-01-01 09:00:02 0.820801 0.794317 0.364034 0.5310962010-01-01 09:00:04 0.433985 0.314582 0.424104 0.6257332010-01-01 09:00:06 0.624988 0.609738 0.633165 0.6124522010-01-01 09:00:08 0.510470 0.534317 0.573201 0.806949
You could also specify ahow directly
In [7]:df.resample('2s',how='sum')Out[7]: A B C D2010-01-01 09:00:00 0.971495 0.894701 0.714192 1.5872312010-01-01 09:00:02 1.641602 1.588635 0.728068 1.0621912010-01-01 09:00:04 0.867969 0.629165 0.848208 1.2514652010-01-01 09:00:06 1.249976 1.219477 1.266330 1.2249042010-01-01 09:00:08 1.020940 1.068634 1.146402 1.613897
New API:
Now, you can write.resample(..) as a 2-stage operation like.groupby(...), whichyields aResampler.
In [82]:r=df.resample('2s')In [83]:rOut[83]:DatetimeIndexResampler[freq=<2*Seconds>,axis=0,closed=left,label=left,convention=start,base=0]
You can then use this object to perform operations.These are downsampling operations (going from a higher frequency to a lower one).
In [84]:r.mean()Out[84]: A B C D2010-01-01 09:00:00 0.485748 0.447351 0.357096 0.7936152010-01-01 09:00:02 0.820801 0.794317 0.364034 0.5310962010-01-01 09:00:04 0.433985 0.314582 0.424104 0.6257332010-01-01 09:00:06 0.624988 0.609738 0.633165 0.6124522010-01-01 09:00:08 0.510470 0.534317 0.573201 0.806949
In [85]:r.sum()Out[85]: A B C D2010-01-01 09:00:00 0.971495 0.894701 0.714192 1.5872312010-01-01 09:00:02 1.641602 1.588635 0.728068 1.0621912010-01-01 09:00:04 0.867969 0.629165 0.848208 1.2514652010-01-01 09:00:06 1.249976 1.219477 1.266330 1.2249042010-01-01 09:00:08 1.020940 1.068634 1.146402 1.613897
Furthermore, resample now supportsgetitem operations to perform the resample on specific columns.
In [86]:r[['A','C']].mean()Out[86]: A C2010-01-01 09:00:00 0.485748 0.3570962010-01-01 09:00:02 0.820801 0.3640342010-01-01 09:00:04 0.433985 0.4241042010-01-01 09:00:06 0.624988 0.6331652010-01-01 09:00:08 0.510470 0.573201
and.aggregate type operations.
In [87]:r.agg({'A':'mean','B':'sum'})Out[87]: A B2010-01-01 09:00:00 0.485748 0.8947012010-01-01 09:00:02 0.820801 1.5886352010-01-01 09:00:04 0.433985 0.6291652010-01-01 09:00:06 0.624988 1.2194772010-01-01 09:00:08 0.510470 1.068634
These accessors can of course, be combined
In [88]:r[['A','B']].agg(['mean','sum'])Out[88]: A B mean sum mean sum2010-01-01 09:00:00 0.485748 0.971495 0.447351 0.8947012010-01-01 09:00:02 0.820801 1.641602 0.794317 1.5886352010-01-01 09:00:04 0.433985 0.867969 0.314582 0.6291652010-01-01 09:00:06 0.624988 1.249976 0.609738 1.2194772010-01-01 09:00:08 0.510470 1.020940 0.534317 1.068634
Upsampling operations take you from a lower frequency to a higher frequency. These are nowperformed with theResampler objects withbackfill(),ffill(),fillna() andasfreq() methods.
In [89]:s=pd.Series(np.arange(5,dtype='int64'), ....:index=date_range('2010-01-01',periods=5,freq='Q')) ....:In [90]:sOut[90]:2010-03-31 02010-06-30 12010-09-30 22010-12-31 32011-03-31 4Freq: Q-DEC, dtype: int64
Previously
In [6]:s.resample('M',fill_method='ffill')Out[6]:2010-03-31 02010-04-30 02010-05-31 02010-06-30 12010-07-31 12010-08-31 12010-09-30 22010-10-31 22010-11-30 22010-12-31 32011-01-31 32011-02-28 32011-03-31 4Freq: M, dtype: int64
New API
In [91]:s.resample('M').ffill()Out[91]:2010-03-31 02010-04-30 02010-05-31 02010-06-30 12010-07-31 12010-08-31 12010-09-30 22010-10-31 22010-11-30 22010-12-31 32011-01-31 32011-02-28 32011-03-31 4Freq: M, dtype: int64
Note
In the new API, you can either downsample OR upsample. The prior implementation would allow you to pass an aggregator function (likemean) even though you were upsampling, providing a bit of confusion.
Warning
This new API for resample includes some internal changes for the prior-to-0.18.0 API, to work with a deprecation warning in most cases, as the resample operation returns a deferred object. We can intercept operations and just do what the (pre 0.18.0) API did (with a warning). Here is a typical use case:
In [4]:r=df.resample('2s')In [6]:r*10pandas/tseries/resample.py:80: FutureWarning: .resample() is now a deferred operationuse .resample(...).mean() instead of .resample(...)Out[6]: A B C D2010-01-01 09:00:00 4.857476 4.473507 3.570960 7.9361542010-01-01 09:00:02 8.208011 7.943173 3.640340 5.3109572010-01-01 09:00:04 4.339846 3.145823 4.241039 6.2573262010-01-01 09:00:06 6.249881 6.097384 6.331650 6.1245182010-01-01 09:00:08 5.104699 5.343172 5.732009 8.069486
However, getting and assignment operations directly on aResampler will raise aValueError:
In [7]:r.iloc[0]=5ValueError: .resample() is now a deferred operationuse .resample(...).mean() instead of .resample(...)
There is a situation where the new API can not perform all the operations when using original code.This code is intending to resample every 2s, take themean AND then take themin of those results.
In [4]:df.resample('2s').min()Out[4]:A 0.433985B 0.314582C 0.357096D 0.531096dtype: float64
The new API will:
In [92]:df.resample('2s').min()Out[92]: A B C D2010-01-01 09:00:00 0.191519 0.272593 0.276464 0.7853592010-01-01 09:00:02 0.683463 0.712702 0.357817 0.5009952010-01-01 09:00:04 0.364886 0.013768 0.075381 0.3688242010-01-01 09:00:06 0.316836 0.568099 0.397203 0.4361732010-01-01 09:00:08 0.218792 0.143767 0.442141 0.704581
The good news is the return dimensions will differ between the new API and the old API, so this should loudly raisean exception.
To replicate the original operation
In [93]:df.resample('2s').mean().min()Out[93]:A 0.433985B 0.314582C 0.357096D 0.531096dtype: float64
In prior versions, new columns assignments in aneval expression resultedin an inplace change to theDataFrame. (GH9297,GH8664,GH10486)
In [94]:df=pd.DataFrame({'a':np.linspace(0,10,5),'b':range(5)})In [95]:dfOut[95]: a b0 0.0 01 2.5 12 5.0 23 7.5 34 10.0 4
In [12]:df.eval('c = a + b')FutureWarning: eval expressions containing an assignment currentlydefault to operating inplace.This will change in a future version of pandas, use inplace=True to avoid this warning.In [13]:dfOut[13]: a b c0 0.0 0 0.01 2.5 1 3.52 5.0 2 7.03 7.5 3 10.54 10.0 4 14.0
In version 0.18.0, a newinplace keyword was added to choose whether theassignment should be done inplace or return a copy.
In [96]:dfOut[96]: a b c0 0.0 0 0.01 2.5 1 3.52 5.0 2 7.03 7.5 3 10.54 10.0 4 14.0In [97]:df.eval('d = c - b',inplace=False)Out[97]: a b c d0 0.0 0 0.0 0.01 2.5 1 3.5 2.52 5.0 2 7.0 5.03 7.5 3 10.5 7.54 10.0 4 14.0 10.0In [98]:dfOut[98]: a b c0 0.0 0 0.01 2.5 1 3.52 5.0 2 7.03 7.5 3 10.54 10.0 4 14.0In [99]:df.eval('d = c - b',inplace=True)In [100]:dfOut[100]: a b c d0 0.0 0 0.0 0.01 2.5 1 3.5 2.52 5.0 2 7.0 5.03 7.5 3 10.5 7.54 10.0 4 14.0 10.0
Warning
For backwards compatability,inplace defaults toTrue if not specified.This will change in a future version of pandas. If your code depends on aninplace assignment you should update to explicitly setinplace=True
Theinplace keyword parameter was also added thequery method.
In [101]:df.query('a > 5')Out[101]: a b c d3 7.5 3 10.5 7.54 10.0 4 14.0 10.0In [102]:df.query('a > 5',inplace=True)In [103]:dfOut[103]: a b c d3 7.5 3 10.5 7.54 10.0 4 14.0 10.0
Warning
Note that the default value forinplace in aqueryisFalse, which is consistent with prior versions.
eval has also been updated to allow multi-line expressions for multipleassignments. These expressions will be evaluated one at a time in order. Onlyassignments are valid for multi-line expressions.
In [104]:dfOut[104]: a b c d3 7.5 3 10.5 7.54 10.0 4 14.0 10.0In [105]:df.eval(""" .....: e = d + a .....: f = e - 22 .....: g = f / 2.0""",inplace=True) .....:In [106]:dfOut[106]: a b c d e f g3 7.5 3 10.5 7.5 15.0 -7.0 -3.54 10.0 4 14.0 10.0 20.0 -2.0 -1.0
DataFrame.between_time andSeries.between_time now only parse a fixed set of time strings. Parsing of date strings is no longer supported and raises aValueError. (GH11818)
In [107]:s=pd.Series(range(10),pd.date_range('2015-01-01',freq='H',periods=10))In [108]:s.between_time("7:00am","9:00am")---------------------------------------------------------------------------ValueError Traceback (most recent call last)<ipython-input-108-1f395af72989> in <module>()----> 1 s.between_time("7:00am", "9:00am")/home/joris/scipy/pandas/pandas/core/generic.pyc in between_time(self, start_time, end_time, include_start, include_end) 4054 indexer = self.index.indexer_between_time( 4055 start_time, end_time, include_start=include_start,-> 4056 include_end=include_end) 4057 return self.take(indexer, convert=False) 4058 except AttributeError:/home/joris/scipy/pandas/pandas/tseries/index.pyc in indexer_between_time(self, start_time, end_time, include_start, include_end) 1879 values_between_time : TimeSeries 1880 """-> 1881 start_time = to_time(start_time) 1882 end_time = to_time(end_time) 1883 time_micros = self._get_time_micros()/home/joris/scipy/pandas/pandas/tseries/tools.pyc in to_time(arg, format, infer_time_format, errors) 766 return _convert_listlike(arg, format) 767--> 768 return _convert_listlike(np.array([arg]), format)[0] 769 770/home/joris/scipy/pandas/pandas/tseries/tools.pyc in _convert_listlike(arg, format) 746 elif errors == 'raise': 747 raise ValueError("Cannot convert arg {arg} to "--> 748 "a time".format(arg=arg)) 749 elif errors == 'ignore': 750 return argValueError: Cannot convert arg ['7:00am'] to a time
This will now raise.
In [2]:s.between_time('20150101 07:00:00','20150101 09:00:00')ValueError: Cannot convert arg ['20150101 07:00:00'] to a time.
.memory_usage() now includes values in the index, as does memory_usage in.info() (GH11597)
DataFrame.to_latex() now supports non-ascii encodings (egutf-8) in Python 2 with the parameterencoding (GH7061)
pandas.merge() andDataFrame.merge() will show a specific error message when trying to merge with an object that is not of typeDataFrame or a subclass (GH12081)
DataFrame.unstack andSeries.unstack now takefill_value keyword to allow direct replacement of missing values when an unstack results in missing values in the resultingDataFrame. As an added benefit, specifyingfill_value will preserve the data type of the original stacked data. (GH9746)
As part of the new API forwindow functions andresampling, aggregation functions have been clarified, raising more informative error messages on invalid aggregations. (GH9052). A full set of examples are presented ingroupby.
Statistical functions forNDFrame objects (likesum(),mean(),min()) will now raise if non-numpy-compatible arguments are passed in for**kwargs (GH12301)
.to_latex and.to_html gain adecimal parameter like.to_csv; the default is'.' (GH12031)
More helpful error message when constructing aDataFrame with empty data but with indices (GH8020)
.describe() will now properly handle bool dtype as a categorical (GH6625)
More helpful error message with an invalid.transform with user defined input (GH10165)
Exponentially weighted functions now allow specifying alpha directly (GH10789) and raiseValueError if parameters violate0<alpha<=1 (GH12492)
The functionspd.rolling_*,pd.expanding_*, andpd.ewm* are deprecated and replaced by the corresponding method call. Note thatthe new suggested syntax includes all of the arguments (even if default) (GH11603)
In [1]:s=pd.Series(range(3))In [2]:pd.rolling_mean(s,window=2,min_periods=1) FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with Series.rolling(min_periods=1,window=2,center=False).mean()Out[2]: 0 0.0 1 0.5 2 1.5 dtype: float64In [3]:pd.rolling_cov(s,s,window=2) FutureWarning: pd.rolling_cov is deprecated for Series and will be removed in a future version, replace with Series.rolling(window=2).cov(other=<Series>)Out[3]: 0 NaN 1 0.5 2 0.5 dtype: float64
The thefreq andhow arguments to the.rolling,.expanding, and.ewm (new) functions are deprecated, and will be removed in a future version. You can simply resample the input prior to creating a window function. (GH11603).
For example, instead ofs.rolling(window=5,freq='D').max() to get the max value on a rolling 5 Day window, one could uses.resample('D').mean().rolling(window=5).max(), which first resamples the data to daily data, then provides a rolling 5 day window.
pd.tseries.frequencies.get_offset_name function is deprecated. Use offset’s.freqstr property as alternative (GH11192)
pandas.stats.fama_macbeth routines are deprecated and will be removed in a future version (GH6077)
pandas.stats.ols,pandas.stats.plm andpandas.stats.var routines are deprecated and will be removed in a future version (GH6077)
show aFutureWarning rather than aDeprecationWarning on using long-time deprecated syntax inHDFStore.select, where thewhere clause is not a string-like (GH12027)
Thepandas.options.display.mpl_style configuration has been deprecatedand will be removed in a future version of pandas. This functionalityis better handled by matplotlib’sstyle sheets (GH11783).
InGH4892 indexing with floating point numbers on a non-Float64Index was deprecated (in version 0.14.0).In 0.18.0, this deprecation warning is removed and these will now raise aTypeError. (GH12165,GH12333)
In [109]:s=pd.Series([1,2,3],index=[4,5,6])In [110]:sOut[110]:4 15 26 3dtype: int64In [111]:s2=pd.Series([1,2,3],index=list('abc'))In [112]:s2Out[112]:a 1b 2c 3dtype: int64
Previous Behavior:
# this is label indexingIn [2]:s[5.0]FutureWarning: scalar indexers for index type Int64Index should be integers and not floating pointOut[2]:2# this is positional indexingIn [3]:s.iloc[1.0]FutureWarning: scalar indexers for index type Int64Index should be integers and not floating pointOut[3]:2# this is label indexingIn [4]:s.loc[5.0]FutureWarning: scalar indexers for index type Int64Index should be integers and not floating pointOut[4]:2# .ix would coerce 1.0 to the positional 1, and indexIn [5]:s2.ix[1.0]=10FutureWarning: scalar indexers for index type Index should be integers and not floating pointIn [6]:s2Out[6]:a 1b 10c 3dtype: int64
New Behavior:
For iloc, getting & setting via a float scalar will always raise.
In [3]:s.iloc[2.0]TypeError: cannot do label indexing on <class 'pandas.indexes.numeric.Int64Index'> with these indexers [2.0] of <type 'float'>
Other indexers will coerce to a like integer for both getting and setting. TheFutureWarning has been dropped for.loc,.ix and[].
In [113]:s[5.0]Out[113]:2In [114]:s.loc[5.0]Out[114]:2In [115]:s.ix[5.0]Out[115]:2
and setting
In [116]:s_copy=s.copy()In [117]:s_copy[5.0]=10In [118]:s_copyOut[118]:4 15 106 3dtype: int64In [119]:s_copy=s.copy()In [120]:s_copy.loc[5.0]=10In [121]:s_copyOut[121]:4 15 106 3dtype: int64In [122]:s_copy=s.copy()In [123]:s_copy.ix[5.0]=10In [124]:s_copyOut[124]:4 15 106 3dtype: int64
Positional setting with.ix and a float indexer will ADD this value to the index, rather than previously setting the value by position.
In [125]:s2.ix[1.0]=10In [126]:s2Out[126]:a 1b 2c 31.0 10dtype: int64
Slicing will also coerce integer-like floats to integers for a non-Float64Index.
In [127]:s.loc[5.0:6]Out[127]:5 26 3dtype: int64In [128]:s.ix[5.0:6]Out[128]:5 26 3dtype: int64
Note that for floats that are NOT coercible to ints, the label based bounds will be excluded
In [129]:s.loc[5.1:6]Out[129]:6 3dtype: int64In [130]:s.ix[5.1:6]Out[130]:6 3dtype: int64
Float indexing on aFloat64Index is unchanged.
In [131]:s=pd.Series([1,2,3],index=np.arange(3.))In [132]:s[1.0]Out[132]:2In [133]:s[1.0:2.5]Out[133]:1.0 22.0 3dtype: int64
rolling_corr_pairwise in favor of.rolling().corr(pairwise=True) (GH4950)expanding_corr_pairwise in favor of.expanding().corr(pairwise=True) (GH4950)DataMatrix module. This was not imported into the pandas namespace in any event (GH12111)cols keyword in favor ofsubset inDataFrame.duplicated() andDataFrame.drop_duplicates() (GH6680)read_frame andframe_query (both aliases forpd.read_sql)andwrite_frame (alias ofto_sql) functions in thepd.io.sql namespace,deprecated since 0.14.0 (GH6292).order keyword from.factorize() (GH6930)andrews_curves (GH11534)DatetimeIndex,PeriodIndex andTimedeltaIndex‘s ops performance includingNaT (GH10277)pandas.concat (GH11958)StataReader (GH11591)Categoricals withSeries of datetimes containingNaT (GH12077)GroupBy.size when data-frame is empty. (GH11699)Period.end_time when a multiple of time period is requested (GH11738).clip with tz-aware datetimes (GH11838)date_range when the boundaries fell on the frequency (GH11804,GH12409).groupby(...).agg(...) (GH9052)Timedelta constructor (GH11995)StataReader when reading incrementally (GH12014)DateOffset whenn parameter is0 (GH11370)NaT comparison changes (GH12049)read_csv when reading from aStringIO in threads (GH11790)NaT as a missing value in datetimelikes when factorizing & withCategoricals (GH12077)Series were tz-aware (GH12089)Series.str.get_dummies when one of the variables was ‘name’ (GH12180)pd.concat while concatenating tz-aware NaT series. (GH11693,GH11755,GH12217)pd.read_stata with version <= 108 files (GH12232)Series.resample using a frequency ofNano when the index is aDatetimeIndex and contains non-zero nanosecond parts (GH12037).nunique and a sparse index (GH12352)boto in python 3.5 (GH11915)NaT subtraction fromTimestamp orDatetimeIndex with timezones (GH11718)Series of a single tz-awareTimestamp (GH12290).next() (GH12299)Timedelta.round with negative values (GH11690).loc againstCategoricalIndex may result in normalIndex (GH11586)DataFrame.info when duplicated column names exist (GH11761).copy of datetime tz-aware objects (GH11794)Series.apply andSeries.map wheretimedelta64 was not boxed (GH11349)DataFrame.set_index() with tz-awareSeries (GH12358)DataFrame whereAttributeError did not propagate (GH11808)Timestamp (GH11616)pd.read_clipboard andpd.to_clipboard functions not supporting Unicode; upgrade includedpyperclip to v1.5.15 (GH9263)DataFrame.query containing an assignment (GH8664)from_msgpack where__contains__() fails for columns of the unpackedDataFrame, if theDataFrame has object columns. (GH11880).resample on categorical data withTimedeltaIndex (GH12169)DataFrame (GH11682)Index creation fromTimestamp with mixed tz coerces to UTC (GH11488)to_numeric where it does not raise if input is more than one dimension (GH11776)df.plot using incorrect colors for bar plots under matplotlib 1.5+ (GH11614)groupbyplot method when using keyword arguments (GH11805).DataFrame.duplicated anddrop_duplicates causing spurious matches when settingkeep=False (GH11864).loc result with duplicated key may haveIndex with incorrect dtype (GH11497)pd.rolling_median where memory allocation failed even with sufficient memory (GH11696)DataFrame.style with spurious zeros (GH12134)DataFrame.style with integer columns not starting at 0 (GH12125).style.bar may not rendered properly using specific browser (GH11678)Timedelta with anumpy.array ofTimedelta that caused an infinite recursion (GH11835)DataFrame.round dropping column index name (GH11986)df.replace while replacing value in mixed dtypeDataframe (GH11698)Index prevents copying name of passedIndex, when a new name is not provided (GH11193)read_excel failing to read any non-empty sheets when empty sheets exist andsheetname=None (GH11711)read_excel failing to raiseNotImplemented error when keywordsparse_dates anddate_parser are provided (GH11544)read_sql withpymysql connections failing to return chunked data (GH11522).to_csv ignoring formatting parametersdecimal,na_rep,float_format for float indexes (GH11553)Int64Index andFloat64Index preventing the use of the modulo operator (GH9244)MultiIndex.drop for not lexsorted multi-indexes (GH12078)DataFrame when masking an emptyDataFrame (GH11859).plot potentially modifying thecolors input when the number of columns didn’t match the number of series provided (GH12039).Series.plot failing when index has aCustomBusinessDay frequency (GH7222)..to_sql fordatetime.time values with sqlite fallback (GH8341)read_excel failing to read data with one column whensqueeze=True (GH12157)read_excel failing to read one empty column (GH12292,GH9002).groupby where aKeyError was not raised for a wrong column if there was only one row in the dataframe (GH11741).read_csv with dtype specified on empty data producing an error (GH12048).read_csv where strings like'2E' are treated as valid floats (GH12237)millisecond property ofDatetimeIndex. This would always raise aValueError (GH12019).Series constructor with read-only data (GH11502)pandas.util.testing.choice(). Should usenp.random.choice(), instead. (GH12386).loc setitem indexer preventing the use of a TZ-aware DatetimeIndex (GH12050).style indexes and multi-indexes not appearing (GH11655)to_msgpack andfrom_msgpack which did not correctly serialize or deserializeNaT (GH12307)..skew and.kurt due to roundoff error for highly similar values (GH11974)Timestamp constructor where microsecond resolution was lost if HHMMSS were not separated with ‘:’ (GH10041)buffer_rd_bytes src->buffer could be freed more than once if reading failed, causing a segfault (GH12098)crosstab where arguments with non-overlapping indexes would return aKeyError (GH10291)DataFrame.apply in which reduction was not being prevented for cases in whichdtype was not a numpy dtype (GH12244)DatetimeIndex by settingutc=True in.to_datetime (GH11934)read_csv (GH12494)DataFrame with duplicate column names (GH12344)Note
We are proud to announce thatpandas has become a sponsored project of the (NUMFocus organization). This will help ensure the success of development ofpandas as a world-class open-source project.
This is a minor bug-fix release from 0.17.0 and includes a large number ofbug fixes along several new features, enhancements, and performance improvements.We recommend that all users upgrade to this version.
Highlights include:
DataFrame.drop_duplicates from 0.16.2, causing incorrect results on integer values (GH11376)What’s new in v0.17.1
Warning
This is a new feature and is under active development.We’ll be adding features an possibly making breaking changes in futurereleases. Feedback iswelcome.
We’ve addedexperimental support for conditional HTML formatting:the visual styling of a DataFrame based on the data.The styling is accomplished with HTML and CSS.Acesses the styler class with thepandas.DataFrame.style, attribute,an instance ofStyler with your data attached.
Here’s a quick example:
In [1]:np.random.seed(123)In [2]:df=DataFrame(np.random.randn(10,5),columns=list('abcde'))In [3]:html=df.style.background_gradient(cmap='viridis',low=.5)
We can render the HTML to get the following table.
| a | b | c | d | e | |
|---|---|---|---|---|---|
| 0 | -1.085631 | 0.997345 | 0.282978 | -1.506295 | -0.5786 |
| 1 | 1.651437 | -2.426679 | -0.428913 | 1.265936 | -0.86674 |
| 2 | -0.678886 | -0.094709 | 1.49139 | -0.638902 | -0.443982 |
| 3 | -0.434351 | 2.20593 | 2.186786 | 1.004054 | 0.386186 |
| 4 | 0.737369 | 1.490732 | -0.935834 | 1.175829 | -1.253881 |
| 5 | -0.637752 | 0.907105 | -1.428681 | -0.140069 | -0.861755 |
| 6 | -0.255619 | -2.798589 | -1.771533 | -0.699877 | 0.927462 |
| 7 | -0.173636 | 0.002846 | 0.688223 | -0.879536 | 0.283627 |
| 8 | -0.805367 | -1.727669 | -0.3909 | 0.573806 | 0.338589 |
| 9 | -0.01183 | 2.392365 | 0.412912 | 0.978736 | 2.238143 |
Styler interacts nicely with the Jupyter Notebook.See thedocumentation for more.
DatetimeIndex now supports conversion to strings withastype(str) (GH10442)
Support forcompression (gzip/bz2) inpandas.DataFrame.to_csv() (GH7615)
pd.read_* functions can now also acceptpathlib.Path, orpy._path.local.LocalPathobjects for thefilepath_or_buffer argument. (GH11033)- TheDataFrame andSeries functions.to_csv(),.to_html() and.to_latex() can now handle paths beginning with tildes (e.g.~/Documents/) (GH11438)
DataFrame now uses the fields of anamedtuple as columns, if columns are not supplied (GH11181)
DataFrame.itertuples() now returnsnamedtuple objects, when possible. (GH11269,GH11625)
Addedaxvlines_kwds to parallel coordinates plot (GH10709)
Option to.info() and.memory_usage() to provide for deep introspection of memory consumption. Note that this can be expensive to compute and therefore is an optional parameter. (GH11595)
In [4]:df=DataFrame({'A':['foo']*1000})In [5]:df['B']=df['A'].astype('category')# shows the '+' as we have object dtypesIn [6]:df.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 1000 entries, 0 to 999Data columns (total 2 columns):A 1000 non-null objectB 1000 non-null categorydtypes: category(1), object(1)memory usage: 8.9+ KB# we have an accurate memory assessment (but can be expensive to compute this)In [7]:df.info(memory_usage='deep')<class 'pandas.core.frame.DataFrame'>RangeIndex: 1000 entries, 0 to 999Data columns (total 2 columns):A 1000 non-null objectB 1000 non-null categorydtypes: category(1), object(1)memory usage: 48.0 KB
Index now has afillna method (GH10089)
In [8]:pd.Index([1,np.nan,3]).fillna(2)Out[8]:Float64Index([1.0,2.0,3.0],dtype='float64')
Series of typecategory now make.str.<...> and.dt.<...> accessor methods / properties available, if the categories are of that type. (GH10661)
In [9]:s=pd.Series(list('aabb')).astype('category')In [10]:sOut[10]:0 a1 a2 b3 bdtype: categoryCategories (2, object): [a, b]In [11]:s.str.contains("a")Out[11]:0 True1 True2 False3 Falsedtype: boolIn [12]:date=pd.Series(pd.date_range('1/1/2015',periods=5)).astype('category')In [13]:dateOut[13]:0 2015-01-011 2015-01-022 2015-01-033 2015-01-044 2015-01-05dtype: categoryCategories (5, datetime64[ns]): [2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05]In [14]:date.dt.dayOut[14]:0 11 22 33 44 5dtype: int64
pivot_table now has amargins_name argument so you can use something other than the default of ‘All’ (GH3335)
Implement export ofdatetime64[ns,tz] dtypes with a fixed HDF5 store (GH11411)
Pretty printing sets (e.g. in DataFrame cells) now uses set literal syntax ({x,y}) instead ofLegacy Python syntax (set([x,y])) (GH11215)
Improve the error message inpandas.io.gbq.to_gbq() when a streaming insert fails (GH11285)and when the DataFrame does not match the schema of the destination table (GH11359)
NotImplementedError inIndex.shift for non-supported index types (GH8038)min andmax reductions ondatetime64 andtimedelta64 dtyped series nowresult inNaT and notnan (GH11245).TypeError, instead of aValueError (GH11356)Series.ptp will now ignore missing values by default (GH11163)Series.dropna performance improvement when its dtype can’t containNaN (GH11159)DatetimeIndex.year,Series.dt.year), normalization, and conversion to and fromPeriod,DatetimeIndex.to_period andPeriodIndex.to_timestamp (GH11263)rolling_median,rolling_mean,rolling_max,rolling_min,rolling_var,rolling_kurt,rolling_skew (GH11450)read_csv,read_table (GH11272)rolling_median (GH11450)to_excel (GH11352)Categorical categories, which was rendering the strings before chopping them for display (GH11305)Categorical.remove_unused_categories, (GH11643).Series constructor with no data andDatetimeIndex (GH11433)shift,cumprod, andcumsum with groupby (GH4095)SparseArray.__iter__() now does not causePendingDeprecationWarning in Python 3.5 (GH11622)Series.sort_index() now correctly handles theinplace option (GH11402)PyPi when reading a csv of floats and passingna_values=<ascalar> would show an exception (GH11374).to_latex() output broken when the index has a name (GH10660)HDFStore.append with strings whose encoded length exceded the max unencoded length (GH11234)datetime64[ns,tz] dtypes (GH11405)HDFStore.select when comparing with a numpy scalar in a where clause (GH11283)DataFrame.ix with a multi-index indexer (GH11372)date_range with ambigous endpoints (GH11626).str,.dt and.cat. Retrieving sucha value was not possible, so error out on setting it. (GH10673).dt accessors (GH11295)DataFrame.replace with adatetime64[ns,tz] and a non-compat to_replace (GH11326,GH11153)isnull wherenumpy.datetime64('NaT') in anumpy.array was not determined to be null(GH11206)pivot_table withmargins=True when indexes are ofCategorical dtype (GH10993)DataFrame.plot cannot use hex strings colors (GH10299)DataFrame.drop_duplicates from 0.16.2, causing incorrect results on integer values (GH11376)pd.eval where unary ops in a list error (GH11235)squeeze() with zero length arrays (GH11230,GH8999)describe() dropping column names for hierarchical indexes (GH11517)DataFrame.pct_change() not propagatingaxis keyword on.fillna method (GH11150).to_csv() when a mix of integer and string column names are passed as thecolumns parameter (GH11637)range, (GH11652)to_sql using unicode column names giving UnicodeEncodeError with (GH11431).xticks inplot (GH11529).holiday.dates where observance rules could not be applied to holiday and doc enhancement (GH11477,GH11533)Axes instances instead ofSubplotAxes (GH11520,GH11556).DataFrame.to_latex() produces an extra rule whenheader=False (GH7124)df.groupby(...).apply(func) when a func returns aSeries containing a new datetimelike column (GH11324)pandas.json when file to load is big (GH11344)to_excel with duplicate columns (GH11007,GH10982,GH10970)datetime64[ns,tz] (GH11245).read_excel with multi-index containing integers (GH11317)to_excel with openpyxl 2.2+ and merging (GH11408)DataFrame.to_dict() produces anp.datetime64 object instead ofTimestamp when only datetime is present in data (GH11327)DataFrame.corr() raises exception when computes Kendall correlation for DataFrames with boolean and not boolean columns (GH11560)inline functions on FreeBSD 10+ (withclang) (GH10510)DataFrame.to_csv in passing through arguments for formattingMultiIndexes, includingdate_format (GH7791)DataFrame.join() withhow='right' producing aTypeError (GH11519)Series.quantile with empty list results hasIndex withobject dtype (GH11588)pd.merge results in emptyInt64Index rather thanIndex(dtype=object) when the merge result is empty (GH11588)Categorical.remove_unused_categories when havingNaN values (GH11599)DataFrame.to_sparse() loses column names for MultiIndexes (GH11600)DataFrame.round() with non-unique column index producing a Fatal Python error (GH11611)DataFrame.round() withdecimals being a non-unique indexed Series producing extra columns (GH11618)This is a major release from 0.16.2 and includes a small number of API changes, several new features,enhancements, and performance improvements along with a large number of bug fixes. We recommend that allusers upgrade to this version.
Warning
pandas >= 0.17.0 will no longer support compatibility with Python version 3.2 (GH9118)
Warning
Thepandas.io.data package is deprecated and will be replaced by thepandas-datareader package.This will allow the data modules to be independently updated to your pandasinstallation. The API forpandas-datareaderv0.1.1 is exactly the sameas inpandasv0.17.0 (GH8961,GH10861).
After installing pandas-datareader, you can easily change your imports:
frompandas.ioimportdata,wb
becomes
frompandas_datareaderimportdata,wb
Highlights include:
.plot accessor, seeheredatetime64[ns] with timezones as a first-class dtype, seehereto_datetime will now be toraise when presented with unparseable formats,previously this would return the original input. Also, date parsefunctions now return consistent results. Seeheredropna inHDFStore has changed toFalse, to store by default all rows evenif they are allNaN, seeheredt) now supportsSeries.dt.strftime to generate formatted strings for datetime-likes, andSeries.dt.total_seconds to generate each duration of the timedelta in seconds. SeeherePeriod andPeriodIndex can handle multiplied freq like3D, which corresponding to 3 days span. SeeherePEP440 compliant version strings (GH9518)Check theAPI Changes anddeprecations before updating.
What’s new in v0.17.0
display.precision optionCategorical.uniquebool passed asheader in ParsersWe are adding an implementation that natively supports datetime with timezones. ASeries or aDataFrame column previouslycould be assigned a datetime with timezones, and would work as anobject dtype. This had performance issues with a largenumber rows. See thedocs for more details. (GH8260,GH10763,GH11034).
The new implementation allows for having a single-timezone across all rows, with operations in a performant manner.
In [1]:df=DataFrame({'A':date_range('20130101',periods=3), ...:'B':date_range('20130101',periods=3,tz='US/Eastern'), ...:'C':date_range('20130101',periods=3,tz='CET')}) ...:In [2]:dfOut[2]: A B C0 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00+01:001 2013-01-02 2013-01-02 00:00:00-05:00 2013-01-02 00:00:00+01:002 2013-01-03 2013-01-03 00:00:00-05:00 2013-01-03 00:00:00+01:00In [3]:df.dtypesOut[3]:A datetime64[ns]B datetime64[ns, US/Eastern]C datetime64[ns, CET]dtype: object
In [4]:df.BOut[4]:0 2013-01-01 00:00:00-05:001 2013-01-02 00:00:00-05:002 2013-01-03 00:00:00-05:00Name: B, dtype: datetime64[ns, US/Eastern]In [5]:df.B.dt.tz_localize(None)Out[5]:0 2013-01-011 2013-01-022 2013-01-03Name: B, dtype: datetime64[ns]
This uses a new-dtype representation as well, that is very similar in look-and-feel to its numpy cousindatetime64[ns]
In [6]:df['B'].dtypeOut[6]:datetime64[ns,US/Eastern]In [7]:type(df['B'].dtype)Out[7]:pandas.types.dtypes.DatetimeTZDtype
Note
There is a slightly different string repr for the underlyingDatetimeIndex as a result of the dtype changes, butfunctionally these are the same.
Previous Behavior:
In [1]:pd.date_range('20130101',periods=3,tz='US/Eastern')Out[1]:DatetimeIndex(['2013-01-01 00:00:00-05:00','2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00'], dtype='datetime64[ns]', freq='D', tz='US/Eastern')In [2]:pd.date_range('20130101',periods=3,tz='US/Eastern').dtypeOut[2]:dtype('<M8[ns]')
New Behavior:
In [8]:pd.date_range('20130101',periods=3,tz='US/Eastern')Out[8]:DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq='D')In [9]:pd.date_range('20130101',periods=3,tz='US/Eastern').dtypeOut[9]:datetime64[ns,US/Eastern]
We are releasing the global-interpreter-lock (GIL) on some cython operations.This will allow other threads to run simultaneously during computation, potentially allowing performance improvementsfrom multi-threading. Notablygroupby,nsmallest,value_counts and some indexing operations benefit from this. (GH8882)
For example the groupby expression in the following code will have the GIL released during the factorization step, e.g.df.groupby('key')as well as the.sum() operation.
N=1000000ngroups=10df=DataFrame({'key':np.random.randint(0,ngroups,size=N),'data':np.random.randn(N)})df.groupby('key')['data'].sum()
Releasing of the GIL could benefit an application that uses threads for user interactions (e.g.QT), or performing multi-threaded computations. A nice example of a library that can handle these types of computation-in-parallel is thedask library.
The Series and DataFrame.plot() method allows for customizingplot types by supplying thekind keyword arguments. Unfortunately, many of these kinds of plots use different required and optional keyword arguments, which makes it difficult to discover what any given plot kind uses out of the dozens of possible arguments.
To alleviate this issue, we have added a new, optional plotting interface, which exposes each kind of plot as a method of the.plot attribute. Instead of writingseries.plot(kind=<kind>,...), you can now also useseries.plot.<kind>(...):
In [10]:df=pd.DataFrame(np.random.rand(10,2),columns=['a','b'])In [11]:df.plot.bar()

As a result of this change, these methods are now all discoverable via tab-completion:
In [12]:df.plot.<TAB>df.plot.area df.plot.barh df.plot.density df.plot.hist df.plot.line df.plot.scatterdf.plot.bar df.plot.box df.plot.hexbin df.plot.kde df.plot.pie
Each method signature only includes relevant arguments. Currently, these are limited to required arguments, but in the future these will include optional arguments, as well. For an overview, see the newPlotting API documentation.
dt accessor¶We are now supporting aSeries.dt.strftime method for datetime-likes to generate a formatted string (GH10110). Examples:
# DatetimeIndexIn [13]:s=pd.Series(pd.date_range('20130101',periods=4))In [14]:sOut[14]:0 2013-01-011 2013-01-022 2013-01-033 2013-01-04dtype: datetime64[ns]In [15]:s.dt.strftime('%Y/%m/%d')Out[15]:0 2013/01/011 2013/01/022 2013/01/033 2013/01/04dtype: object
# PeriodIndexIn [16]:s=pd.Series(pd.period_range('20130101',periods=4))In [17]:sOut[17]:0 2013-01-011 2013-01-022 2013-01-033 2013-01-04dtype: objectIn [18]:s.dt.strftime('%Y/%m/%d')Out[18]:0 2013/01/011 2013/01/022 2013/01/033 2013/01/04dtype: object
The string format is as the python standard library and details can be foundhere
pd.Series of typetimedelta64 has new method.dt.total_seconds() returning the duration of the timedelta in seconds (GH10817)
# TimedeltaIndexIn [19]:s=pd.Series(pd.timedelta_range('1 minutes',periods=4))In [20]:sOut[20]:0 0 days 00:01:001 1 days 00:01:002 2 days 00:01:003 3 days 00:01:00dtype: timedelta64[ns]In [21]:s.dt.total_seconds()Out[21]:0 60.01 86460.02 172860.03 259260.0dtype: float64
Period,PeriodIndex andperiod_range can now accept multiplied freq. Also,Period.freq andPeriodIndex.freq are now stored as aDateOffset instance likeDatetimeIndex, and not asstr (GH7811)
A multiplied freq represents a span of corresponding length. The example below creates a period of 3 days. Addition and subtraction will shift the period by its span.
In [22]:p=pd.Period('2015-08-01',freq='3D')In [23]:pOut[23]:Period('2015-08-01','3D')In [24]:p+1Out[24]:Period('2015-08-04','3D')In [25]:p-2Out[25]:Period('2015-07-26','3D')In [26]:p.to_timestamp()Out[26]:Timestamp('2015-08-01 00:00:00')In [27]:p.to_timestamp(how='E')Out[27]:Timestamp('2015-08-03 00:00:00')
You can use the multiplied freq inPeriodIndex andperiod_range.
In [28]:idx=pd.period_range('2015-08-01',periods=4,freq='2D')In [29]:idxOut[29]:PeriodIndex(['2015-08-01','2015-08-03','2015-08-05','2015-08-07'],dtype='period[2D]',freq='2D')In [30]:idx+1Out[30]:PeriodIndex(['2015-08-03','2015-08-05','2015-08-07','2015-08-09'],dtype='period[2D]',freq='2D')
read_sas() provides support for readingSAS XPORT format files. (GH4052).
df=pd.read_sas('sas_xport.xpt')
It is also possible to obtain an iterator and read an XPORT fileincrementally.
fordfinpd.read_sas('sas_xport.xpt',chunksize=10000)do_something(df)
See thedocs for more details.
eval() now supports calling math functions (GH4893)
df=pd.DataFrame({'a':np.random.randn(10)})df.eval("b = sin(a)")
The support math functions aresin,cos,exp,log,expm1,log1p,sqrt,sinh,cosh,tanh,arcsin,arccos,arctan,arccosh,arcsinh,arctanh,abs andarctan2.
These functions map to the intrinsics for theNumExpr engine. For the Pythonengine, they are mapped toNumPy calls.
MultiIndex¶In version 0.16.2 aDataFrame withMultiIndex columns could not be written to Excel viato_excel.That functionality has been added (GH10564), along with updatingread_excel so that the data canbe read back with, no loss of information, by specifying which columns/rows make up theMultiIndexin theheader andindex_col parameters (GH4679)
See thedocumentation for more details.
In [31]:df=pd.DataFrame([[1,2,3,4],[5,6,7,8]], ....:columns=pd.MultiIndex.from_product([['foo','bar'],['a','b']], ....:names=['col1','col2']), ....:index=pd.MultiIndex.from_product([['j'],['l','k']], ....:names=['i1','i2'])) ....:In [32]:dfOut[32]:col1 foo barcol2 a b a bi1 i2j l 1 2 3 4 k 5 6 7 8In [33]:df.to_excel('test.xlsx')In [34]:df=pd.read_excel('test.xlsx',header=[0,1],index_col=[0,1])In [35]:dfOut[35]:col1 foo barcol2 a b a bi1 i2j l 1 2 3 4 k 5 6 7 8
Previously, it was necessary to specify thehas_index_names argument inread_excel,if the serialized data had index names. For version 0.17.0 the ouptput format ofto_excelhas been changed to make this keyword unnecessary - the change is shown below.
Old

New

Warning
Excel files saved in version 0.16.2 or prior that had index names will still able to be read in,but thehas_index_names argument must specified toTrue.
pandas.io.gbq.to_gbq() function if the destination table/dataset does not exist. (GH8325,GH11121).pandas.io.gbq.to_gbq() function via theif_exists argument. See thedocs for more details (GH8325).InvalidColumnOrder andInvalidPageToken in the gbq module will raiseValueError instead ofIOError.generate_bq_schema() function is now deprecated and will be removed in a future version (GH11121)Warning
Enabling this option will affect the performance for printing ofDataFrame andSeries (about 2 times slower).Use only when it is actually required.
Some East Asian countries use Unicode characters its width is corresponding to 2 alphabets. If aDataFrame orSeries contains these characters, the default output cannot be aligned properly. The following options are added to enable precise handling for these characters.
display.unicode.east_asian_width: Whether to use the Unicode East Asian Width to calculate the display text width. (GH2612)display.unicode.ambiguous_as_wide: Whether to handle Unicode characters belong to Ambiguous as Wide. (GH11102)In [36]:df=pd.DataFrame({u'国籍':['UK',u'日本'],u'名前':['Alice',u'しのぶ']})In [37]:df;

In [38]:pd.set_option('display.unicode.east_asian_width',True)In [39]:df;

For further details, seehere
Support foropenpyxl >= 2.2. The API for style support is now stable (GH10125)
merge now accepts the argumentindicator which adds a Categorical-type column (by default called_merge) to the output object that takes on the values (GH8790)
| Observation Origin | _merge value |
|---|---|
Merge key only in'left' frame | left_only |
Merge key only in'right' frame | right_only |
| Merge key in both frames | both |
In [40]:df1=pd.DataFrame({'col1':[0,1],'col_left':['a','b']})In [41]:df2=pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})In [42]:pd.merge(df1,df2,on='col1',how='outer',indicator=True)Out[42]: col1 col_left col_right _merge0 0 a NaN left_only1 1 b 2.0 both2 2 NaN 2.0 right_only3 2 NaN 2.0 right_only
For more, see theupdated docs
pd.to_numeric is a new function to coerce strings to numbers (possibly with coercion) (GH11133)
pd.merge will now allow duplicate column names if they are not merged upon (GH10639).
pd.pivot will now allow passing index asNone (GH3962).
pd.concat will now use existing Series names if provided (GH10698).
In [43]:foo=pd.Series([1,2],name='foo')In [44]:bar=pd.Series([1,2])In [45]:baz=pd.Series([4,5])
Previous Behavior:
In [1] pd.concat([foo, bar, baz], 1)Out[1]: 0 1 2 0 1 1 4 1 2 2 5
New Behavior:
In [46]:pd.concat([foo,bar,baz],1)Out[46]: foo 0 10 1 1 41 2 2 5
DataFrame has gained thenlargest andnsmallest methods (GH10393)
Add alimit_direction keyword argument that works withlimit to enableinterpolate to fillNaN values forward, backward, or both (GH9218,GH10420,GH11115)
In [47]:ser=pd.Series([np.nan,np.nan,5,np.nan,np.nan,np.nan,13])In [48]:ser.interpolate(limit=1,limit_direction='both')Out[48]:0 NaN1 5.02 5.03 7.04 NaN5 11.06 13.0dtype: float64
Added aDataFrame.round method to round the values to a variable number of decimal places (GH10568).
In [49]:df=pd.DataFrame(np.random.random([3,3]),columns=['A','B','C'], ....:index=['first','second','third']) ....:In [50]:dfOut[50]: A B Cfirst 0.342764 0.304121 0.417022second 0.681301 0.875457 0.510422third 0.669314 0.585937 0.624904In [51]:df.round(2)Out[51]: A B Cfirst 0.34 0.30 0.42second 0.68 0.88 0.51third 0.67 0.59 0.62In [52]:df.round({'A':0,'C':2})Out[52]: A B Cfirst 0.0 0.304121 0.42second 1.0 0.875457 0.51third 1.0 0.585937 0.62
drop_duplicates andduplicated now accept akeep keyword to target first, last, and all duplicates. Thetake_last keyword is deprecated, seehere (GH6511,GH8505)
In [53]:s=pd.Series(['A','B','C','A','B','D'])In [54]:s.drop_duplicates()Out[54]:0 A1 B2 C5 Ddtype: objectIn [55]:s.drop_duplicates(keep='last')Out[55]:2 C3 A4 B5 Ddtype: objectIn [56]:s.drop_duplicates(keep=False)Out[56]:2 C5 Ddtype: object
Reindex now has atolerance argument that allows for finer control ofLimits on filling while reindexing (GH10411):
In [57]:df=pd.DataFrame({'x':range(5), ....:'t':pd.date_range('2000-01-01',periods=5)}) ....:In [58]:df.reindex([0.1,1.9,3.5], ....:method='nearest', ....:tolerance=0.2) ....:Out[58]: t x0.1 2000-01-01 0.01.9 2000-01-03 2.03.5 NaT NaN
When used on aDatetimeIndex,TimedeltaIndex orPeriodIndex,tolerance will coerced into aTimedelta if possible. This allows you to specify tolerance with a string:
In [59]:df=df.set_index('t')In [60]:df.reindex(pd.to_datetime(['1999-12-31']), ....:method='nearest', ....:tolerance='1 day') ....:Out[60]: x1999-12-31 0
tolerance is also exposed by the lower levelIndex.get_indexer andIndex.get_loc methods.
Added functionality to use thebase argument when resampling aTimeDeltaIndex (GH10530)
DatetimeIndex can be instantiated using strings containsNaT (GH7599)
to_datetime can now accept theyearfirst keyword (GH7599)
pandas.tseries.offsets larger than theDay offset can now be used with aSeries for addition/subtraction (GH10699). See thedocs for more details.
pd.Timedelta.total_seconds() now returns Timedelta duration to ns precision (previously microsecond precision) (GH10939)
PeriodIndex now supports arithmetic withnp.ndarray (GH10638)
Support pickling ofPeriod objects (GH10439)
.as_blocks will now take acopy optional argument to return a copy of the data, default is to copy (no change in behavior from prior versions), (GH9607)
regex argument toDataFrame.filter now handles numeric column names instead of raisingValueError (GH10384).
Enable reading gzip compressed files via URL, either by explicitly setting the compression parameter or by inferring from the presence of the HTTP Content-Encoding header in the response (GH8685)
Enable writing Excel files inmemory using StringIO/BytesIO (GH7074)
Enable serialization of lists and dicts to strings inExcelWriter (GH8188)
SQL io functions now accept a SQLAlchemy connectable. (GH7877)
pd.read_sql andto_sql can accept database URI ascon parameter (GH10214)
read_sql_table will now allow reading from views (GH10750).
Enable writing complex values toHDFStores when using thetable format (GH10447)
Enablepd.read_hdf to be used without specifying a key when the HDF file contains a single dataset (GH10443)
pd.read_stata will now read Stata 118 type files. (GH9882)
msgpack submodule has been updated to 0.4.6 with backward compatibility (GH10581)
DataFrame.to_dict now acceptsorient='index' keyword argument (GH10844).
DataFrame.apply will return a Series of dicts if the passed function returns a dict andreduce=True (GH8735).
Allow passingkwargs to the interpolation methods (GH10378).
Improved error message when concatenating an empty iterable ofDataframe objects (GH9157)
pd.read_csv can now read bz2-compressed files incrementally, and the C parser can read bz2-compressed files from AWS S3 (GH11070,GH11072).
Inpd.read_csv, recognizes3n:// ands3a:// URLs as designating S3 file storage (GH11070,GH11071).
Read CSV files from AWS S3 incrementally, instead of first downloading the entire file. (Full file download still required for compressed files in Python 2.) (GH11070,GH11073)
pd.read_csv is now able to infer compression type for files read from AWS S3 storage (GH11070,GH11074).
The sorting API has had some longtime inconsistencies. (GH9816,GH8239).
Here is a summary of the APIPRIOR to 0.17.0:
Series.sort isINPLACE whileDataFrame.sort returns a new object.Series.order returns a new objectSeries/DataFrame.sort_index to sort byvalues by passing theby keyword.Series/DataFrame.sortlevel worked only on aMultiIndex for sorting by index.To address these issues, we have revamped the API:
DataFrame.sort_values(), which is the merger ofDataFrame.sort(),Series.sort(),andSeries.order(), to handle sorting ofvalues.Series.sort(),Series.order(), andDataFrame.sort() have been deprecated and will be removed in afuture version.by argument ofDataFrame.sort_index() has been deprecated and will be removed in a future version..sort_index() will gain thelevel keyword to enable level sorting.We now have two distinct and non-overlapping methods of sorting. A* marks items thatwill show aFutureWarning.
To sort by thevalues:
| Previous | Replacement |
|---|---|
*Series.order() | Series.sort_values() |
*Series.sort() | Series.sort_values(inplace=True) |
*DataFrame.sort(columns=...) | DataFrame.sort_values(by=...) |
To sort by theindex:
| Previous | Replacement |
|---|---|
Series.sort_index() | Series.sort_index() |
Series.sortlevel(level=...) | Series.sort_index(level=...) |
DataFrame.sort_index() | DataFrame.sort_index() |
DataFrame.sortlevel(level=...) | DataFrame.sort_index(level=...) |
*DataFrame.sort() | DataFrame.sort_index() |
We have also deprecated and changed similar methods in two Series-like classes,Index andCategorical.
| Previous | Replacement |
|---|---|
*Index.order() | Index.sort_values() |
*Categorical.order() | Categorical.sort_values() |
The default forpd.to_datetime error handling has changed toerrors='raise'.In prior versions it waserrors='ignore'. Furthermore, thecoerce argumenthas been deprecated in favor oferrors='coerce'. This means that invalid parsingwill raise rather that return the original input as in previous versions. (GH10636)
Previous Behavior:
In [2]:pd.to_datetime(['2009-07-31','asd'])Out[2]:array(['2009-07-31','asd'],dtype=object)
New Behavior:
In [3]:pd.to_datetime(['2009-07-31','asd'])ValueError: Unknown string format
Of course you can coerce this as well.
In [61]:to_datetime(['2009-07-31','asd'],errors='coerce')Out[61]:DatetimeIndex(['2009-07-31','NaT'],dtype='datetime64[ns]',freq=None)
To keep the previous behavior, you can useerrors='ignore':
In [62]:to_datetime(['2009-07-31','asd'],errors='ignore')Out[62]:array(['2009-07-31','asd'],dtype=object)
Furthermore,pd.to_timedelta has gained a similar API, oferrors='raise'|'ignore'|'coerce', and thecoerce keywordhas been deprecated in favor oferrors='coerce'.
The string parsing ofto_datetime,Timestamp andDatetimeIndex hasbeen made consistent. (GH7599)
Prior to v0.17.0,Timestamp andto_datetime may parse year-only datetime-string incorrectly using today’s date, otherwiseDatetimeIndexuses the beginning of the year.Timestamp andto_datetime may raiseValueError in some types of datetime-string whichDatetimeIndexcan parse, such as a quarterly string.
Previous Behavior:
In [1]:Timestamp('2012Q2')Traceback ...ValueError: Unable to parse 2012Q2# Results in today's date.In [2]:Timestamp('2014')Out [2]: 2014-08-12 00:00:00
v0.17.0 can parse them as below. It works onDatetimeIndex also.
New Behavior:
In [63]:Timestamp('2012Q2')Out[63]:Timestamp('2012-04-01 00:00:00')In [64]:Timestamp('2014')Out[64]:Timestamp('2014-01-01 00:00:00')In [65]:DatetimeIndex(['2012Q2','2014'])Out[65]:DatetimeIndex(['2012-04-01','2014-01-01'],dtype='datetime64[ns]',freq=None)
Note
If you want to perform calculations based on today’s date, useTimestamp.now() andpandas.tseries.offsets.
In [66]:importpandas.tseries.offsetsasoffsetsIn [67]:Timestamp.now()Out[67]:Timestamp('2016-11-03 16:51:06.549337')In [68]:Timestamp.now()+offsets.DateOffset(years=1)Out[68]:Timestamp('2017-11-03 16:51:06.550998')
Operator equal onIndex should behavior similarly toSeries (GH9947,GH10637)
Starting in v0.17.0, comparingIndex objects of different lengths will raiseaValueError. This is to be consistent with the behavior ofSeries.
Previous Behavior:
In [2]:pd.Index([1,2,3])==pd.Index([1,4,5])Out[2]:array([True,False,False],dtype=bool)In [3]:pd.Index([1,2,3])==pd.Index([2])Out[3]:array([False,True,False],dtype=bool)In [4]:pd.Index([1,2,3])==pd.Index([1,2])Out[4]:False
New Behavior:
In [8]:pd.Index([1,2,3])==pd.Index([1,4,5])Out[8]:array([True,False,False],dtype=bool)In [9]:pd.Index([1,2,3])==pd.Index([2])ValueError: Lengths must match to compareIn [10]:pd.Index([1,2,3])==pd.Index([1,2])ValueError: Lengths must match to compare
Note that this is different from thenumpy behavior where a comparison canbe broadcast:
In [69]:np.array([1,2,3])==np.array([1])Out[69]:array([True,False,False],dtype=bool)
or it can return False if broadcasting can not be done:
In [70]:np.array([1,2,3])==np.array([1,2])Out[70]:False
Boolean comparisons of aSeries vsNone will now be equivalent to comparing withnp.nan, rather than raiseTypeError. (GH1079).
In [71]:s=Series(range(3))In [72]:s.iloc[1]=NoneIn [73]:sOut[73]:0 0.01 NaN2 2.0dtype: float64
Previous Behavior:
In [5]:s==NoneTypeError: Could not compare <type 'NoneType'> type with Series
New Behavior:
In [74]:s==NoneOut[74]:0 False1 False2 Falsedtype: bool
Usually you simply want to know which values are null.
In [75]:s.isnull()Out[75]:0 False1 True2 Falsedtype: bool
Warning
You generally will want to useisnull/notnull for these types of comparisons, asisnull/notnull tells you which elements are null. One has to bemindful thatnan's don’t compare equal, butNone's do. Note that Pandas/numpy uses the fact thatnp.nan!=np.nan, and treatsNone likenp.nan.
In [76]:None==NoneOut[76]:TrueIn [77]:np.nan==np.nanOut[77]:False
The default behavior for HDFStore write functions withformat='table' is now to keep rows that are all missing. Previously, the behavior was to drop rows that were all missing save the index. The previous behavior can be replicated using thedropna=True option. (GH9382)
Previous Behavior:
In [78]:df_with_missing=pd.DataFrame({'col1':[0,np.nan,2], ....:'col2':[1,np.nan,np.nan]}) ....:In [79]:df_with_missingOut[79]: col1 col20 0.0 1.01 NaN NaN2 2.0 NaN
In [27]:df_with_missing.to_hdf('file.h5', 'df_with_missing', format='table', mode='w')In [28]:pd.read_hdf('file.h5','df_with_missing')Out [28]: col1 col2 0 0 1 2 2 NaN
New Behavior:
In [80]:df_with_missing.to_hdf('file.h5', ....:'df_with_missing', ....:format='table', ....:mode='w') ....:In [81]:pd.read_hdf('file.h5','df_with_missing')Out[81]: col1 col20 0.0 1.01 NaN NaN2 2.0 NaN
See thedocs for more details.
display.precision option¶Thedisplay.precision option has been clarified to refer to decimal places (GH10451).
Earlier versions of pandas would format floating point numbers to have one less decimal place than the value indisplay.precision.
In [1]:pd.set_option('display.precision',2)In [2]:pd.DataFrame({'x':[123.456789]})Out[2]: x0 123.5
If interpreting precision as “significant figures” this did work for scientific notation but that same interpretationdid not work for values with standard formatting. It was also out of step with how numpy handles formatting.
Going forward the value ofdisplay.precision will directly control the number of places after the decimal, forregular formatting as well as scientific notation, similar to how numpy’sprecision print option works.
In [82]:pd.set_option('display.precision',2)In [83]:pd.DataFrame({'x':[123.456789]})Out[83]: x0 123.46
To preserve output behavior with prior versions the default value ofdisplay.precision has been reduced to6from7.
Categorical.unique¶Categorical.unique now returns newCategoricals withcategories andcodes that are unique, rather than returningnp.array (GH10508)
In [84]:cat=pd.Categorical(['C','A','B','C'], ....:categories=['A','B','C'], ....:ordered=True) ....:In [85]:catOut[85]:[C, A, B, C]Categories (3, object): [A < B < C]In [86]:cat.unique()Out[86]:[C, A, B]Categories (3, object): [A < B < C]In [87]:cat=pd.Categorical(['C','A','B','C'], ....:categories=['A','B','C']) ....:In [88]:catOut[88]:[C, A, B, C]Categories (3, object): [A, B, C]In [89]:cat.unique()Out[89]:[C, A, B]Categories (3, object): [C, A, B]
bool passed asheader in Parsers¶In earlier versions of pandas, if a bool was passed theheader argument ofread_csv,read_excel, orread_html it was implicitly converted toan integer, resulting inheader=0 forFalse andheader=1 forTrue(GH6113)
Abool input toheader will now raise aTypeError
In[29]:df=pd.read_csv('data.csv',header=False)TypeError:Passingabooltoheaderisinvalid.Useheader=Nonefornoheaderorheader=intorlist-likeofintstospecifytherow(s)makingupthecolumnnames
Line and kde plot withsubplots=True now uses default colors, not all black. Specifycolor='k' to draw all lines in black (GH9894)
Calling the.value_counts() method on a Series with acategorical dtype now returns a Series with aCategoricalIndex (GH10704)
The metadata properties of subclasses of pandas objects will now be serialized (GH10553).
groupby usingCategorical follows the same rule asCategorical.unique described above (GH10508)
When constructingDataFrame with an array ofcomplex64 dtype previously meant the corresponding columnwas automatically promoted to thecomplex128 dtype. Pandas will now preserve the itemsize of the input for complex data (GH10952)
some numeric reduction operators would returnValueError, rather thanTypeError on object types that includes strings and numbers (GH11131)
Passing currently unsupportedchunksize argument toread_excel orExcelFile.parse will now raiseNotImplementedError (GH8011)
Allow anExcelFile object to be passed intoread_excel (GH11198)
DatetimeIndex.union does not inferfreq ifself and the input haveNone asfreq (GH11086)
NaT‘s methods now either raiseValueError, or returnnp.nan orNaT (GH9513)
| Behavior | Methods |
|---|---|
returnnp.nan | weekday,isoweekday |
returnNaT | date,now,replace,to_datetime,today |
returnnp.datetime64('NaT') | to_datetime64 (unchanged) |
raiseValueError | All other public methods (names not beginning with underscores) |
ForSeries the following indexing functions are deprecated (GH10177).
| Deprecated Function | Replacement |
|---|---|
.irow(i) | .iloc[i] or.iat[i] |
.iget(i) | .iloc[i] or.iat[i] |
.iget_value(i) | .iloc[i] or.iat[i] |
ForDataFrame the following indexing functions are deprecated (GH10177).
| Deprecated Function | Replacement |
|---|---|
.irow(i) | .iloc[i] |
.iget_value(i,j) | .iloc[i,j] or.iat[i,j] |
.icol(j) | .iloc[:,j] |
Note
These indexing function have been deprecated in the documentation since 0.11.0.
Categorical.name was deprecated to makeCategorical morenumpy.ndarray like. UseSeries(cat,name="whatever") instead (GH10482).Categorical‘scategories will issue a warning (GH10748). You can still have missing values in thevalues.drop_duplicates andduplicated‘stake_last keyword was deprecated in favor ofkeep. (GH6511,GH8505)Series.nsmallest andnlargest‘stake_last keyword was deprecated in favor ofkeep. (GH10792)DataFrame.combineAdd andDataFrame.combineMult are deprecated. Theycan easily be replaced by using theadd andmul methods:DataFrame.add(other,fill_value=0) andDataFrame.mul(other,fill_value=1.)(GH10735).TimeSeries deprecated in favor ofSeries (note that this has been an alias since 0.13.0), (GH10890)SparsePanel deprecated and will be removed in a future version (GH11157).Series.is_time_series deprecated in favor ofSeries.index.is_all_dates (GH11135)'A@JAN') are deprecated (note that this has been alias since 0.8.0) (GH10878)WidePanel deprecated in favor ofPanel,LongPanel in favor ofDataFrame (note these have been aliases since < 0.11.0), (GH10892)DataFrame.convert_objects has been deprecated in favor of type-specific functionspd.to_datetime,pd.to_timestamp andpd.to_numeric (new in 0.17.0) (GH11133).Removal ofna_last parameters fromSeries.order() andSeries.sort(), in favor ofna_position. (GH5231)
Remove ofpercentile_width from.describe(), in favor ofpercentiles. (GH7088)
Removal ofcolSpace parameter fromDataFrame.to_string(), in favor ofcol_space, circa 0.8.0 version.
Removal of automatic time-series broadcasting (GH2304)
In [90]:np.random.seed(1234)In [91]:df=DataFrame(np.random.randn(5,2),columns=list('AB'),index=date_range('20130101',periods=5))In [92]:dfOut[92]: A B2013-01-01 0.471435 -1.1909762013-01-02 1.432707 -0.3126522013-01-03 -0.720589 0.8871632013-01-04 0.859588 -0.6365242013-01-05 0.015696 -2.242685
Previously
In [3]:df+df.AFutureWarning: TimeSeries broadcasting along DataFrame index by default is deprecated.Please use DataFrame.<op> to explicitly broadcast arithmetic operations along the indexOut[3]: A B2013-01-01 0.942870 -0.7195412013-01-02 2.865414 1.1200552013-01-03 -1.441177 0.1665742013-01-04 1.719177 0.2230652013-01-05 0.031393 -2.226989
Current
In [93]:df.add(df.A,axis='index')Out[93]: A B2013-01-01 0.942870 -0.7195412013-01-02 2.865414 1.1200552013-01-03 -1.441177 0.1665742013-01-04 1.719177 0.2230652013-01-05 0.031393 -2.226989
Removetable keyword inHDFStore.put/append, in favor of usingformat= (GH4645)
Removekind inread_excel/ExcelFile as its unused (GH4712)
Removeinfer_type keyword frompd.read_html as its unused (GH4770,GH7032)
Removeoffset andtimeRule keywords fromSeries.tshift/shift, in favor offreq (GH4853,GH4864)
Removepd.load/pd.save aliases in favor ofpd.to_pickle/pd.read_pickle (GH3787)
Categorical.value_counts (GH10804)SeriesGroupBy.nunique andSeriesGroupBy.value_counts andSeriesGroupby.transform (GH10820,GH11077)DataFrame.drop_duplicates with integer dtypes (GH10917)DataFrame.duplicated with wide frames. (GH10161,GH11180)timedelta string parsing (GH6755,GH10426)timedelta64 anddatetime64 ops (GH6755)MultiIndex with slicers (GH10287)iloc using list-like input (GH10791)Series.isin for datetimelike/integer Series (GH10287)concat of Categoricals when categories are identical (GH10587)to_datetime when specified format string is ISO8601 (GH10178)Series.value_counts for float dtype (GH10821)infer_datetime_format into_datetime when date components do not have 0 padding (GH11142)DataFrame from nested dictionary (GH11084)DateOffset withSeries orDatetimeIndex (GH10744,GH11205).mean() ontimedelta64[ns] because of overflow (GH9442).isin on older numpies (:issue:11232)DataFrame.to_html(index=False) renders unnecessaryname row (GH10344)DataFrame.to_latex() thecolumn_format argument could not be passed (GH9402)DatetimeIndex when localizing withNaT (GH10477)Series.dt ops in preserving meta-data (GH10477)NaT when passed in an otherwise invalidto_datetime construction (GH10477)DataFrame.apply when function returns categorical series. (GH9573)to_datetime with invalid dates and formats supplied (GH10154)Index.drop_duplicates dropping name(s) (GH10115)Series.quantile dropping name (GH10881)pd.Series when setting a value on an emptySeries whose index has a frequency. (GH10193)pd.Series.interpolate with invalidorder keyword values. (GH10633)DataFrame.plot raisesValueError when color name is specified by multiple characters (GH10387)Index construction with a mixed list of tuples (GH10697)DataFrame.reset_index when index containsNaT. (GH10388)ExcelReader when worksheet is empty (GH6403)BinGrouper.group_info where returned values are not compatible with base class (GH10914)DataFrame.pop and a subsequent inplace op (GH10912)Index causing anImportError (GH10610)Series.count when index has nulls (GH10946)DatetimeIndex (GH11002)DataFrame.where to not respect theaxis parameter when the frame has a symmetric shape. (GH9736)Table.select_column where name is not preserved (GH10392)offsets.generate_range wherestart andend have finer precision thanoffset (GH9907)pd.rolling_* whereSeries.name would be lost in the output (GH10565)stack when index or columns are not unique. (GH10417)Panel when an axis has a multi-index (GH10360)USFederalHolidayCalendar whereUSMemorialDay andUSMartinLutherKingJr were incorrect (GH10278 andGH9760 ).sample() where returned object, if set, gives unnecessarySettingWithCopyWarning (GH10738).sample() where weights passed asSeries were not aligned along axis before being treated positionally, potentially causing problems if weight indices were not aligned with sampled object. (GH10738)DataFrame.interpolate withaxis=1 andinplace=True (GH10395)io.sql.get_schema when specifying multiple columns as primarykey (GH10385).groupby(sort=False) with datetime-likeCategorical raisesValueError (GH10505)groupby(axis=1) withfilter() throwsIndexError (GH11041)test_categorical on big-endian builds (GH10425)Series.shift andDataFrame.shift not supporting categorical data (GH9416)Series.map using categoricalSeries raisesAttributeError (GH10324)MultiIndex.get_level_values includingCategorical raisesAttributeError (GH10460)pd.get_dummies withsparse=True not returningSparseDataFrame (GH10531)Index subtypes (such asPeriodIndex) not returning their own type for.drop and.insert methods (GH10620)algos.outer_join_indexer whenright array is empty (GH10618)filter (regression from 0.16.0) andtransform when grouping on multiple keys, one of which is datetime-like (GH10114)to_datetime andto_timedelta causingIndex name to be lost (GH10875)len(DataFrame.groupby) causingIndexError when there’s a column containing only NaNs (:issue:11016)DatetimeIndex andPeriodIndex.value_counts resets name from its result, but retains in result’sIndex. (GH10150)pd.eval usingnumexpr engine coerces 1 element numpy array to scalar (GH10546)pd.concat withaxis=0 when column is of dtypecategory (GH10177)read_msgpack where input type is not always checked (GH10369,GH10630)pd.read_csv with kwargsindex_col=False,index_col=['a','b'] ordtype(GH10413,GH10467,GH10577)Series.from_csv withheader kwarg not setting theSeries.name or theSeries.index.name (GH10483)groupby.var which caused variance to be inaccurate for small float values (GH10448)Series.plot(kind='hist') Y Label not informative (GH10485)read_csv when using a converter which generates auint8 type (GH9266)Panel sliced along the major or minor axes when the right-hand side is aDataFrame (GH11014)None and does not raiseNotImplementedError when operator functions (e.g..add) ofPanel are not implemented (GH7692)subplots=True (GH9894)DataFrame.plot raisesValueError when color name is specified by multiple characters (GH10387)align ofSeries withMultiIndex may be inverted (GH10665)join of withMultiIndex may be inverted (GH10741)read_stata when reading a file with a different order set incolumns (GH10757)Categorical may not representing properly when category containstz orPeriod (GH10713)Categorical.__iter__ may not returning correctdatetime andPeriod (GH10713)PeriodIndex on an object with aPeriodIndex (GH4125)read_csv withengine='c': EOF preceded by a comment, blank line, etc. was not handled correctly (GH10728,GH10548)DataReader results in HTTP 404 error because of the website url is changed (GH10591).read_msgpack where DataFrame to decode has duplicate column names (GH9618)io.common.get_filepath_or_buffer which caused reading of valid S3 files to fail if the bucket also contained keys for which the user does not have read permission (GH10604)datetime.date and numpydatetime64 (GH10408,GH10412)Index.take may add unnecessaryfreq attribute (GH10791)merge with emptyDataFrame may raiseIndexError (GH10824)to_latex where unexpected keyword argument for some documented arguments (GH10888)DataFrame whereIndexError is uncaught (GH10645 andGH10692)read_csv when using thenrows orchunksize parameters if file contains only a header line (GH9535)category types in HDF5 in presence of alternate encodings. (GH10366)pd.DataFrame when constructing an empty DataFrame with a string dtype (GH9428)pd.DataFrame.diff when DataFrame is not consolidated (GH10907)pd.unique for arrays with thedatetime64 ortimedelta64 dtype that meant an array with object dtype was returned instead the original dtype (GH9431)Timedelta raising error when slicing from 0s (GH10583)DatetimeIndex.take andTimedeltaIndex.take may not raiseIndexError against invalid index (GH10295)Series([np.nan]).astype('M8[ms]'), which now returnsSeries([pd.NaT]) (GH10747)PeriodIndex.order reset freq (GH10295)date_range whenfreq dividesend as nanos (GH10885)iloc allowing memory outside bounds of a Series to be accessed with negative integers (GH10779)read_msgpack where encoding is not respected (GH10581)iloc with a list containing the appropriate negative integer (GH10547,GH10779)TimedeltaIndex formatter causing error while trying to saveDataFrame withTimedeltaIndex usingto_csv (GH10833)DataFrame.where when handling Series slicing (GH10218,GH9558)pd.read_gbq throwsValueError when Bigquery returns zero rows (GH10273)to_json which was causing segmentation fault when serializing 0-rank ndarray (GH9576)IndexError when plotted onGridSpec (GH10819)groupby incorrect computation for aggregation onDataFrame withNaT (E.gfirst,last,min). (GH10590,GH11010)DataFrame where passing a dictionary with only scalar values and specifying columns did not raise an error (GH10856).var() causing roundoff errors for highly similar values (GH10242)DataFrame.plot(subplots=True) with duplicated columns outputs incorrect result (GH10962)Index arithmetic may result in incorrect class (GH10638)date_range results in empty if freq is negative annualy, quarterly and monthly (GH11018)DatetimeIndex cannot infer negative freq (GH11018)Index dtype may not applied properly (GH11017)io.gbq when testing for minimum google api client version (GH10652)DataFrame construction from nesteddict withtimedelta keys (GH11129).fillna against may raiseTypeError when data contains datetime dtype (GH7095,GH11153).groupby when number of keys to group by is same as length of index (GH11185)convert_objects where converted values might not be returned if all null andcoerce (GH9589)convert_objects wherecopy keyword was not respected (GH9589)This is a minor bug-fix release from 0.16.1 and includes a a large number ofbug fixes along some new features (pipe() method), enhancements, and performance improvements.
We recommend that all users upgrade to this version.
Highlights include:
What’s new in v0.16.2
We’ve introduced a new methodDataFrame.pipe(). As suggested by the name,pipeshould be used to pipe data through a chain of function calls.The goal is to avoid confusing nested function calls like
# df is a DataFrame# f, g, and h are functions that take and return DataFramesf(g(h(df),arg1=1),arg2=2,arg3=3)
The logic flows from inside out, and function names are separated from their keyword arguments.This can be rewritten as
(df.pipe(h).pipe(g,arg1=1).pipe(f,arg2=2,arg3=3))
Now both the code and the logic flow from top to bottom. Keyword arguments are next totheir functions. Overall the code is much more readable.
In the example above, the functionsf,g, andh each expected the DataFrame as the first positional argument.When the function you wish to apply takes its data anywhere other than the first argument, pass a tupleof(function,keyword) indicating where the DataFrame should flow. For example:
In [1]:importstatsmodels.formula.apiassmIn [2]:bb=pd.read_csv('data/baseball.csv',index_col='id')# sm.poisson takes (formula, data)In [3]:(bb.query('h > 0') ...:.assign(ln_h=lambdadf:np.log(df.h)) ...:.pipe((sm.poisson,'data'),'hr ~ ln_h + year + g + C(lg)') ...:.fit() ...:.summary() ...:) ...:Optimization terminated successfully. Current function value: 2.116284 Iterations 24Out[3]:<class 'statsmodels.iolib.summary.Summary'>""" Poisson Regression Results==============================================================================Dep. Variable: hr No. Observations: 68Model: Poisson Df Residuals: 63Method: MLE Df Model: 4Date: Don, 03 Nov 2016 Pseudo R-squ.: 0.6878Time: 16:51:07 Log-Likelihood: -143.91converged: True LL-Null: -460.91 LLR p-value: 6.774e-136=============================================================================== coef std err z P>|z| [95.0% Conf. Int.]-------------------------------------------------------------------------------Intercept -1267.3636 457.867 -2.768 0.006 -2164.767 -369.960C(lg)[T.NL] -0.2057 0.101 -2.044 0.041 -0.403 -0.008ln_h 0.9280 0.191 4.866 0.000 0.554 1.302year 0.6301 0.228 2.762 0.006 0.183 1.077g 0.0099 0.004 2.754 0.006 0.003 0.017==============================================================================="""
The pipe method is inspired by unix pipes, which stream text throughprocesses. More recentlydplyr andmagrittr have introduced thepopular(%>%) pipe operator forR.
See thedocumentation for more. (GH10129)
Addedrsplit to Index/Series StringMethods (GH10303)
Removed the hard-coded size limits on theDataFrame HTML representationin the IPython notebook, and leave this to IPython itself (only for IPythonv3.0 or greater). This eliminates the duplicate scroll bars that appeared inthe notebook with large frames (GH10231).
Note that the notebook has atoggleoutputscrolling feature to limit thedisplay of very large frames (by clicking left of the output).You can also configure the way DataFrames are displayed using the pandasoptions, see herehere.
axis parameter ofDataFrame.quantile now accepts alsoindex andcolumn. (GH9543)
Holiday now raisesNotImplementedError if bothoffset andobservance are used in the constructor instead of returning an incorrect result (GH10217).Series.hist raises an error when a one rowSeries was given (GH10214)HDFStore.select modifies the passed columns list (GH7212)Categorical repr withdisplay.width ofNone in Python 3 (GH10087)to_json with certain orients and aCategoricalIndex would segfault (GH10317)DataFrame.quantile on checking that a valid axis was passed (GH9543)groupby.apply aggregation forCategorical not preserving categories (GH10138)to_csv wheredate_format is ignored if thedatetime is fractional (GH10209)DataFrame.to_json with mixed data types (GH10289)mean() where integer dtypes can overflow (GH10172)Panel.from_dict does not set dtype when specified (GH10058)Index.union raisesAttributeError when passing array-likes. (GH10149)Timestamp‘s’microsecond,quarter,dayofyear,week anddaysinmonth properties returnnp.int type, not built-inint. (GH10050)NaT raisesAttributeError when accessing todaysinmonth,dayofweek properties. (GH10096)max_seq_items=None setting (GH10182).dateutil on various platforms (GH9059,GH8639,GH9663,GH10121)setitem where type promotion is applied to the entire block (GH10280)Series arithmetic methods may incorrectly hold names (GH10068)GroupBy.get_group when grouping on multiple keys, one of which is categorical. (GH10132)DatetimeIndex andTimedeltaIndex names are lost after timedelta arithmetics (GH9926)DataFrame construction from nesteddict withdatetime64 (GH10160)Series construction fromdict withdatetime64 keys (GH9456)Series.plot(label="LABEL") not correctly setting the label (GH10119)plot not defaulting to matplotlibaxes.grid setting (GH9792)int instead offloat inengine='python' for theread_csv parser (GH9565)Series.align resetsname whenfill_value is specified (GH10067)read_csv causing index name not to be set on an empty DataFrame (GH10184)SparseSeries.abs resetsname (GH10241)TimedeltaIndex slicing may reset freq (GH10292)GroupBy.get_group raisesValueError when group key containsNaT (GH6992)SparseSeries constructor ignores input data name (GH10258)Categorical.remove_categories causing aValueError when removing theNaN category if underlying dtype is floating-point (GH10156)DataFrame.to_hdf() where table format would raise a seemingly unrelated error for invalid (non-string) column names. This is now explicitly forbidden. (GH9057)DataFrame (GH10126).read_csv with adate_parser that returned adatetime64 array of other time resolution than[ns] (GH10245)Panel.apply when the result has ndim=0 (GH10332)read_hdf whereauto_close could not be passed (GH9327).read_hdf where open stores could not be used (GH10330).DataFrame``s,nowresultsina``DataFrame that.equals an emptyDataFrame (GH10181).to_hdf andHDFStore which did not check that complib choices were valid (GH4582,GH8874).This is a minor bug-fix release from 0.16.0 and includes a a large number ofbug fixes along several new features, enhancements, and performance improvements.We recommend that all users upgrade to this version.
Highlights include:
CategoricalIndex, a category based index, seeheresample for drawing random samples from Series, DataFrames and Panels. SeehereIndex printing has changed to a more uniform format, seehereBusinessHour datetime-offset is now supported, seehere.str accessor to make string operations easier, seehereWhat’s new in v0.16.1
Warning
In pandas 0.17.0, the sub-packagepandas.io.data will be removed in favor of a separately installable package. Seehere for details (GH8961)
We introduce aCategoricalIndex, a new type of index object that is useful for supportingindexing with duplicates. This is a container around aCategorical (introduced in v0.15.0)and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,setting the index of aDataFrame/Series with acategory dtype would convert this to regular object-basedIndex.
In [1]:df=DataFrame({'A':np.arange(6), ...:'B':Series(list('aabbca')).astype('category', ...:categories=list('cab')) ...:}) ...:In [2]:dfOut[2]: A B0 0 a1 1 a2 2 b3 3 b4 4 c5 5 aIn [3]:df.dtypesOut[3]:A int64B categorydtype: objectIn [4]:df.B.cat.categoriesOut[4]:Index([u'c',u'a',u'b'],dtype='object')
setting the index, will create create aCategoricalIndex
In [5]:df2=df.set_index('B')In [6]:df2.indexOut[6]:CategoricalIndex([u'a',u'a',u'b',u'b',u'c',u'a'],categories=[u'c',u'a',u'b'],ordered=False,name=u'B',dtype='category')
indexing with__getitem__/.iloc/.loc/.ix works similarly to an Index with duplicates.The indexers MUST be in the category or the operation will raise.
In [7]:df2.loc['a']Out[7]: ABa 0a 1a 5
and preserves theCategoricalIndex
In [8]:df2.loc['a'].indexOut[8]:CategoricalIndex([u'a',u'a',u'a'],categories=[u'c',u'a',u'b'],ordered=False,name=u'B',dtype='category')
sorting will order by the order of the categories
In [9]:df2.sort_index()Out[9]: ABc 4a 0a 1a 5b 2b 3
groupby operations on the index will preserve the index nature as well
In [10]:df2.groupby(level=0).sum()Out[10]: ABc 4a 6b 5In [11]:df2.groupby(level=0).sum().indexOut[11]:CategoricalIndex([u'c',u'a',u'b'],categories=[u'c',u'a',u'b'],ordered=False,name=u'B',dtype='category')
reindexing operations, will return a resulting index based on the type of the passedindexer, meaning that passing a list will return a plain-old-Index; indexing withaCategorical will return aCategoricalIndex, indexed according to the categoriesof the PASSEDCategorical dtype. This allows one to arbitrarly index these even withvalues NOT in the categories, similarly to how you can reindex ANY pandas index.
In [12]:df2.reindex(['a','e'])Out[12]: ABa 0.0a 1.0a 5.0e NaNIn [13]:df2.reindex(['a','e']).indexOut[13]:Index([u'a',u'a',u'a',u'e'],dtype='object',name=u'B')In [14]:df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))Out[14]: ABa 0.0a 1.0a 5.0e NaNIn [15]:df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).indexOut[15]:CategoricalIndex([u'a',u'a',u'a',u'e'],categories=[u'a',u'b',u'c',u'd',u'e'],ordered=False,name=u'B',dtype='category')
See thedocumentation for more. (GH7629,GH10038,GH10039)
Series, DataFrames, and Panels now have a new method:sample().The method accepts a specific number of rows or columns to return, or a fraction of thetotal number or rows or columns. It also has options for sampling with or without replacement,for passing in a column for weights for non-uniform sampling, and for setting seed values tofacilitate replication. (GH2419)
In [16]:example_series=Series([0,1,2,3,4,5])# When no arguments are passed, returns 1In [17]:example_series.sample()Out[17]:3 3dtype: int64# One may specify either a number of rows:In [18]:example_series.sample(n=3)Out[18]:5 51 14 4dtype: int64# Or a fraction of the rows:In [19]:example_series.sample(frac=0.5)Out[19]:4 41 10 0dtype: int64# weights are accepted.In [20]:example_weights=[0,0,0.2,0.2,0.2,0.4]In [21]:example_series.sample(n=3,weights=example_weights)Out[21]:2 23 35 5dtype: int64# weights will also be normalized if they do not sum to one,# and missing values will be treated as zeros.In [22]:example_weights2=[0.5,0,0,0,None,np.nan]In [23]:example_series.sample(n=1,weights=example_weights2)Out[23]:0 0dtype: int64
When applied to a DataFrame, one may pass the name of a column to specify sampling weightswhen sampling from rows.
In [24]:df=DataFrame({'col1':[9,8,7,6],'weight_column':[0.5,0.4,0.1,0]})In [25]:df.sample(n=3,weights='weight_column')Out[25]: col1 weight_column0 9 0.51 8 0.42 7 0.1
Continuing from v0.16.0, the followingenhancements make string operations easier and more consistent with standard python string operations.
AddedStringMethods (.str accessor) toIndex (GH9068)
The.str accessor is now available for bothSeries andIndex.
In [26]:idx=Index([' jack','jill ',' jesse ','frank'])In [27]:idx.str.strip()Out[27]:Index([u'jack',u'jill',u'jesse',u'frank'],dtype='object')
One special case for the.str accessor onIndex is that if a string method returnsbool, the.str accessorwill return anp.array instead of a booleanIndex (GH8875). This enables the following expressionto work naturally:
In [28]:idx=Index(['a1','a2','b1','b2'])In [29]:s=Series(range(4),index=idx)In [30]:sOut[30]:a1 0a2 1b1 2b2 3dtype: int64In [31]:idx.str.startswith('a')Out[31]:array([True,True,False,False],dtype=bool)In [32]:s[s.index.str.startswith('a')]Out[32]:a1 0a2 1dtype: int64
The following new methods are accesible via.str accessor to apply the function to each values. (GH9766,GH9773,GH10031,GH10045,GH10052)
| Methods | ||||
|---|---|---|---|---|
capitalize() | swapcase() | normalize() | partition() | rpartition() |
index() | rindex() | translate() |
split now takesexpand keyword to specify whether to expand dimensionality.return_type is deprecated. (GH9847)
In [33]:s=Series(['a,b','a,c','b,c'])# return SeriesIn [34]:s.str.split(',')Out[34]:0 [a, b]1 [a, c]2 [b, c]dtype: object# return DataFrameIn [35]:s.str.split(',',expand=True)Out[35]: 0 10 a b1 a c2 b cIn [36]:idx=Index(['a,b','a,c','b,c'])# return IndexIn [37]:idx.str.split(',')Out[37]:Index([[u'a',u'b'],[u'a',u'c'],[u'b',u'c']],dtype='object')# return MultiIndexIn [38]:idx.str.split(',',expand=True)Out[38]:MultiIndex(levels=[[u'a', u'b'], [u'b', u'c']], labels=[[0, 0, 1], [0, 1, 1]])
Improvedextract andget_dummies methods forIndex.str (GH9980)
BusinessHour offset is now supported, which represents business hours starting from 09:00 - 17:00 onBusinessDay by default. SeeHere for details. (GH7905)
In [39]:frompandas.tseries.offsetsimportBusinessHourIn [40]:Timestamp('2014-08-01 09:00')+BusinessHour()Out[40]:Timestamp('2014-08-01 10:00:00')In [41]:Timestamp('2014-08-01 07:00')+BusinessHour()Out[41]:Timestamp('2014-08-01 10:00:00')In [42]:Timestamp('2014-08-01 16:30')+BusinessHour()Out[42]:Timestamp('2014-08-04 09:30:00')
DataFrame.diff now takes anaxis parameter that determines the direction of differencing (GH9727)
Allowclip,clip_lower, andclip_upper to accept array-like arguments as thresholds (This is a regression from 0.11.0). These methods now have anaxis parameter which determines how the Series or DataFrame will be aligned with the threshold(s). (GH6966)
DataFrame.mask() andSeries.mask() now support same keywords aswhere (GH8801)
drop function can now accepterrors keyword to suppressValueError raised when any of label does not exist in the target data. (GH6736)
In [43]:df=DataFrame(np.random.randn(3,3),columns=['A','B','C'])In [44]:df.drop(['A','X'],axis=1,errors='ignore')Out[44]: B C0 1.058969 -0.3978401 1.047579 1.0459382 -0.122092 0.124713
Add support for separating years and quarters using dashes, forexample 2014-Q1. (GH9688)
Allow conversion of values with dtypedatetime64 ortimedelta64 to strings usingastype(str) (GH9757)
get_dummies function now acceptssparse keyword. If set toTrue, the returnDataFrame is sparse, e.g.SparseDataFrame. (GH8823)
Period now acceptsdatetime64 as value input. (GH9054)
Allow timedelta string conversion when leading zero is missing from time definition, ie0:00:00 vs00:00:00. (GH9570)
AllowPanel.shift withaxis='items' (GH9890)
Trying to write an excel file now raisesNotImplementedError if theDataFrame has aMultiIndex instead of writing a broken Excel file. (GH9794)
AllowCategorical.add_categories to acceptSeries ornp.array. (GH9927)
Add/deletestr/dt/cat accessors dynamically from__dir__. (GH9910)
Addnormalize as adt accessor method. (GH10047)
DataFrame andSeries now have_constructor_expanddim property as overridable constructor for one higher dimensionality data. This should be used only when it is really needed, seehere
pd.lib.infer_dtype now returns'bytes' in Python 3 where appropriate. (GH10032)
df.plot(...,ax=ax), thesharex kwarg will now default toFalse.The result is that the visibility of xlabels and xticklabels will not anymore be changed. Youhave to do that by yourself for the right axes in your figure or setsharex=True explicitly(but this changes the visible for all axes in the figure, not only the one which is passed in!).If pandas creates the subplots itself (e.g. no passed inax kwarg), then thedefault is stillsharex=True and the visibility changes are applied.assign() now inserts new columns in alphabetical order. Previouslythe order was arbitrary. (GH9777)read_csv andread_table will now try to infer the compression type based on the file extension. Setcompression=None to restore the previous behavior (no decompression). (GH9770)The string representation ofIndex and its sub-classes have now been unified. These will show a single-line display if there are few values; a wrapped multi-line display for a lot of values (but less thandisplay.max_seq_items; if lots of items (>display.max_seq_items) will show a truncated display (the head and tail of the data). The formatting forMultiIndex is unchanges (a multi-line wrapped display). The display width responds to the optiondisplay.max_seq_items, which is defaulted to 100. (GH6482)
Previous Behavior
In [2]:pd.Index(range(4),name='foo')Out[2]:Int64Index([0,1,2,3],dtype='int64')In [3]:pd.Index(range(104),name='foo')Out[3]:Int64Index([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,...],dtype='int64')In [4]:pd.date_range('20130101',periods=4,name='foo',tz='US/Eastern')Out[4]:<class 'pandas.tseries.index.DatetimeIndex'>[2013-01-01 00:00:00-05:00, ..., 2013-01-04 00:00:00-05:00]Length: 4, Freq: D, Timezone: US/EasternIn [5]:pd.date_range('20130101',periods=104,name='foo',tz='US/Eastern')Out[5]:<class 'pandas.tseries.index.DatetimeIndex'>[2013-01-01 00:00:00-05:00, ..., 2013-04-14 00:00:00-04:00]Length: 104, Freq: D, Timezone: US/Eastern
New Behavior
In [45]:pd.set_option('display.width',80)In [46]:pd.Index(range(4),name='foo')Out[46]:Int64Index([0,1,2,3],dtype='int64',name=u'foo')In [47]:pd.Index(range(30),name='foo')Out[47]:Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], dtype='int64', name=u'foo')In [48]:pd.Index(range(104),name='foo')Out[48]:Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 94, 95, 96, 97, 98, 99, 100, 101, 102, 103], dtype='int64', name=u'foo', length=104)In [49]:pd.CategoricalIndex(['a','bb','ccc','dddd'],ordered=True,name='foobar')Out[49]:CategoricalIndex([u'a',u'bb',u'ccc',u'dddd'],categories=[u'a',u'bb',u'ccc',u'dddd'],ordered=True,name=u'foobar',dtype='category')In [50]:pd.CategoricalIndex(['a','bb','ccc','dddd']*10,ordered=True,name='foobar')Out[50]:CategoricalIndex([u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd'], categories=[u'a', u'bb', u'ccc', u'dddd'], ordered=True, name=u'foobar', dtype='category')In [51]:pd.CategoricalIndex(['a','bb','ccc','dddd']*100,ordered=True,name='foobar')Out[51]:CategoricalIndex([u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', ... u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd'], categories=[u'a', u'bb', u'ccc', u'dddd'], ordered=True, name=u'foobar', dtype='category', length=400)In [52]:pd.date_range('20130101',periods=4,name='foo',tz='US/Eastern')Out[52]:DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', name=u'foo', freq='D')In [53]:pd.date_range('20130101',periods=25,freq='D')Out[53]:DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08', '2013-01-09', '2013-01-10', '2013-01-11', '2013-01-12', '2013-01-13', '2013-01-14', '2013-01-15', '2013-01-16', '2013-01-17', '2013-01-18', '2013-01-19', '2013-01-20', '2013-01-21', '2013-01-22', '2013-01-23', '2013-01-24', '2013-01-25'], dtype='datetime64[ns]', freq='D')In [54]:pd.date_range('20130101',periods=104,name='foo',tz='US/Eastern')Out[54]:DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00', '2013-01-05 00:00:00-05:00', '2013-01-06 00:00:00-05:00', '2013-01-07 00:00:00-05:00', '2013-01-08 00:00:00-05:00', '2013-01-09 00:00:00-05:00', '2013-01-10 00:00:00-05:00', ... '2013-04-05 00:00:00-04:00', '2013-04-06 00:00:00-04:00', '2013-04-07 00:00:00-04:00', '2013-04-08 00:00:00-04:00', '2013-04-09 00:00:00-04:00', '2013-04-10 00:00:00-04:00', '2013-04-11 00:00:00-04:00', '2013-04-12 00:00:00-04:00', '2013-04-13 00:00:00-04:00', '2013-04-14 00:00:00-04:00'], dtype='datetime64[ns, US/Eastern]', name=u'foo', length=104, freq='D')
DataFrame.plot(), passinglabel= arguments works, and Series indices are no longer mutated. (GH9542)read_csv where missing trailing delimiters would cause segfault. (GH5664)scatter_matrix draws unexpected axis ticklabels (GH5662)StataWriter resulting in changes to inputDataFrame upon save (GH9795).transform causing length mismatch when null entries were present and a fast aggregator was being used (GH9697)equals causing false negatives when block order differed (GH9330)pd.Grouper where one is non-time based (GH10063)read_sql_table error when reading postgres table with timezone (GH7139)DataFrame slicing may not retain metadata (GH9776)TimdeltaIndex were not properly serialized in fixedHDFStore (GH9635)TimedeltaIndex constructor ignoringname when given anotherTimedeltaIndex as data (GH10025).DataFrameFormatter._get_formatted_index with not applyingmax_colwidth to theDataFrame index (GH7856).loc with a read-only ndarray data source (GH10043)groupby.apply() that would raise if a passed user defined function either returned onlyNone (for all input). (GH9685)secondary_y may not show legend properly. (GH9610,GH9779)DataFrame.plot(kind="hist") results inTypeError whenDataFrame contains non-numeric columns (GH9853)DataFrame with aDatetimeIndex may raiseTypeError (GH9852)setup.py that would allow an incompat cython version to build (GH9827)secondary_y incorrectly attachesright_ax property to secondary axes specifying itself recursively. (GH9861)Series.quantile on empty Series of typeDatetime orTimedelta (GH9675)where causing incorrect results when upcasting was required (GH9731)FloatArrayFormatter where decision boundary for displaying “small” floats in decimal format is off by one order of magnitude for a given display.precision (GH9764)DataFrame.plot() raised an error when bothcolor andstyle keywords were passed and there was no color symbol in the style strings (GH9671)DeprecationWarning on combining list-likes with anIndex (GH10083)read_csv andread_table when usingskip_rows parameter if blank lines are present. (GH9832)read_csv() interpretsindex_col=True as1 (GH9798)== failing on Index/MultiIndex type incompatibility (GH9785)SparseDataFrame could not takenan as a column name (GH8822)to_msgpack andread_msgpack zlib and blosc compression support (GH9783)GroupBy.size doesn’t attach index name properly if grouped byTimeGrouper (GH9925)length_of_indexer returns wrong results (GH9995)Categorical (GH9603)TimedeltaIndex incorrectly raisedValueError instead ofAttributeError (GH9680)Series(Categorical(list("abc"),ordered=True))>"d". This returnedFalse for all elements, but now raises aTypeError. Equality comparisons also now returnFalse for== andTrue for!=. (GH9848)__setitem__ when right hand side is a dictionary (GH9874)where when dtype isdatetime64/timedelta64, but dtype of other is not (GH9804)MultiIndex.sortlevel() results in unicode level name breaks (GH9856)groupby.transform incorrectly enforced output dtypes to match input dtypes. (GH9807)DataFrame constructor whencolumns parameter is set, anddata is an empty list (GH9939)log=True raisesTypeError if all values are less than 1 (GH9905)log=True (GH9905)Decimal by anotherDecimal would raise. (GH9787)AbstractHolidayCalendar to be at the instance level rather than at the class level as the latter can result in unexpected behaviour. (GH9552)DataFrame.loc (GH9596)transform andfilter when grouping on a categorical variable (GH9921)transform when groups are equal in number and dtype to the input index (GH9700)oauth2client.tools.run() (GH8327)DataFrame. It may not return the correct class, when slicing or subsetting it. (GH9632).median() where non-float null values are not handled correctly (GH10040)This is a major release from 0.15.2 and includes a small number of API changes, several new features,enhancements, and performance improvements along with a large number of bug fixes. We recommend that allusers upgrade to this version.
Highlights include:
DataFrame.assign method, seehereSeries.to_coo/from_coo methods to interact withscipy.sparse, seehereTimedelta to conform the.seconds attribute withdatetime.timedelta, seehere.loc slicing API to conform with the behavior of.ix seehereCategorical constructor, seehere.str accessor to make string operations easier, seeherepandas.tools.rplot,pandas.sandbox.qtpandas andpandas.rpymodules are deprecated. We refer users to external packages likeseaborn,pandas-qt andrpy2 for similar or equivalentfunctionality, seehereCheck theAPI Changes anddeprecations before updating.
What’s new in v0.16.0
Inspired bydplyr’smutate verb, DataFrame has a newassign() method.The function signature forassign is simply**kwargs. The keysare the column names for the new fields, and the values are either a valueto be inserted (for example, aSeries or NumPy array), or a functionof one argument to be called on theDataFrame. The new values are inserted,and the entire DataFrame (with all original and new columns) is returned.
In [1]:iris=read_csv('data/iris.data')In [2]:iris.head()Out[2]: SepalLength SepalWidth PetalLength PetalWidth Name0 5.1 3.5 1.4 0.2 Iris-setosa1 4.9 3.0 1.4 0.2 Iris-setosa2 4.7 3.2 1.3 0.2 Iris-setosa3 4.6 3.1 1.5 0.2 Iris-setosa4 5.0 3.6 1.4 0.2 Iris-setosaIn [3]:iris.assign(sepal_ratio=iris['SepalWidth']/iris['SepalLength']).head()Out[3]: SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio0 5.1 3.5 1.4 0.2 Iris-setosa 0.6862751 4.9 3.0 1.4 0.2 Iris-setosa 0.6122452 4.7 3.2 1.3 0.2 Iris-setosa 0.6808513 4.6 3.1 1.5 0.2 Iris-setosa 0.6739134 5.0 3.6 1.4 0.2 Iris-setosa 0.720000
Above was an example of inserting a precomputed value. We can also pass ina function to be evalutated.
In [4]:iris.assign(sepal_ratio=lambdax:(x['SepalWidth']/ ...:x['SepalLength'])).head() ...:Out[4]: SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio0 5.1 3.5 1.4 0.2 Iris-setosa 0.6862751 4.9 3.0 1.4 0.2 Iris-setosa 0.6122452 4.7 3.2 1.3 0.2 Iris-setosa 0.6808513 4.6 3.1 1.5 0.2 Iris-setosa 0.6739134 5.0 3.6 1.4 0.2 Iris-setosa 0.720000
The power ofassign comes when used in chains of operations. For example,we can limit the DataFrame to just those with a Sepal Length greater than 5,calculate the ratio, and plot
In [5]:(iris.query('SepalLength > 5') ...:.assign(SepalRatio=lambdax:x.SepalWidth/x.SepalLength, ...:PetalRatio=lambdax:x.PetalWidth/x.PetalLength) ...:.plot(kind='scatter',x='SepalRatio',y='PetalRatio')) ...:Out[5]:<matplotlib.axes._subplots.AxesSubplotat0x7fd23589bb10>

See thedocumentation for more. (GH9229)
AddedSparseSeries.to_coo() andSparseSeries.from_coo() methods (GH8048) for converting to and fromscipy.sparse.coo_matrix instances (seehere). For example, given a SparseSeries with MultiIndex we can convert to ascipy.sparse.coo_matrix by specifying the row and column labels as index levels:
In [6]:fromnumpyimportnanIn [7]:s=Series([3.0,nan,1.0,3.0,nan,nan])In [8]:s.index=MultiIndex.from_tuples([(1,2,'a',0), ...:(1,2,'a',1), ...:(1,1,'b',0), ...:(1,1,'b',1), ...:(2,1,'b',0), ...:(2,1,'b',1)], ...:names=['A','B','C','D']) ...:In [9]:sOut[9]:A B C D1 2 a 0 3.0 1 NaN 1 b 0 1.0 1 3.02 1 b 0 NaN 1 NaNdtype: float64# SparseSeriesIn [10]:ss=s.to_sparse()In [11]:ssOut[11]:A B C D1 2 a 0 3.0 1 NaN 1 b 0 1.0 1 3.02 1 b 0 NaN 1 NaNdtype: float64BlockIndexBlock locations: array([0, 2], dtype=int32)Block lengths: array([1, 2], dtype=int32)In [12]:A,rows,columns=ss.to_coo(row_levels=['A','B'], ....:column_levels=['C','D'], ....:sort_labels=False) ....:In [13]:AOut[13]:<3x4 sparse matrix of type '<type 'numpy.float64'>'with 3 stored elements in COOrdinate format>In [14]:A.todense()Out[14]:matrix([[ 3., 0., 0., 0.], [ 0., 0., 1., 3.], [ 0., 0., 0., 0.]])In [15]:rowsOut[15]:[(1,2),(1,1),(2,1)]In [16]:columnsOut[16]:[('a',0),('a',1),('b',0),('b',1)]
The from_coo method is a convenience method for creating aSparseSeriesfrom ascipy.sparse.coo_matrix:
In [17]:fromscipyimportsparseIn [18]:A=sparse.coo_matrix(([3.0,1.0,2.0],([1,0,0],[0,2,3])), ....:shape=(3,4)) ....:In [19]:AOut[19]:<3x4 sparse matrix of type '<type 'numpy.float64'>'with 3 stored elements in COOrdinate format>In [20]:A.todense()Out[20]:matrix([[ 0., 0., 1., 2.], [ 3., 0., 0., 0.], [ 0., 0., 0., 0.]])In [21]:ss=SparseSeries.from_coo(A)In [22]:ssOut[22]:0 2 1.0 3 2.01 0 3.0dtype: float64BlockIndexBlock locations: array([0], dtype=int32)Block lengths: array([3], dtype=int32)
Following new methods are accesible via.str accessor to apply the function to each values. This is intended to make it more consistent with standard methods on strings. (GH9282,GH9352,GH9386,GH9387,GH9439)
| Methods | ||||
|---|---|---|---|---|
isalnum() | isalpha() | isdigit() | isdigit() | isspace() |
islower() | isupper() | istitle() | isnumeric() | isdecimal() |
find() | rfind() | ljust() | rjust() | zfill() |
In [23]:s=Series(['abcd','3456','EFGH'])In [24]:s.str.isalpha()Out[24]:0 True1 False2 Truedtype: boolIn [25]:s.str.find('ab')Out[25]:0 01 -12 -1dtype: int64
Series.str.pad() andSeries.str.center() now acceptfillchar option to specify filling character (GH9352)
In [26]:s=Series(['12','300','25'])In [27]:s.str.pad(5,fillchar='_')Out[27]:0 ___121 __3002 ___25dtype: object
AddedSeries.str.slice_replace(), which previously raisedNotImplementedError (GH8888)
In [28]:s=Series(['ABCD','EFGH','IJK'])In [29]:s.str.slice_replace(1,3,'X')Out[29]:0 AXD1 EXH2 IXdtype: object# replaced with empty charIn [30]:s.str.slice_replace(0,1)Out[30]:0 BCD1 FGH2 JKdtype: object
Reindex now supportsmethod='nearest' for frames or series with a monotonic increasing or decreasing index (GH9258):
In [31]:df=pd.DataFrame({'x':range(5)})In [32]:df.reindex([0.2,1.8,3.5],method='nearest')Out[32]: x0.2 01.8 23.5 4
This method is also exposed by the lower levelIndex.get_indexer andIndex.get_loc methods.
Theread_excel() function’ssheetname argument now accepts a list andNone, to get multiple or all sheets respectively. If more than one sheet is specified, a dictionary is returned. (GH9450)
# Returns the 1st and 4th sheet, as a dictionary of DataFrames.pd.read_excel('path_to_file.xls',sheetname=['Sheet1',3])
Allow Stata files to be read incrementally with an iterator; support for long strings in Stata files. See the docshere (GH9493:).
Paths beginning with ~ will now be expanded to begin with the user’s home directory (GH9066)
Added time interval selection inget_data_yahoo (GH9071)
AddedTimestamp.to_datetime64() to complementTimedelta.to_timedelta64() (GH9255)
tseries.frequencies.to_offset() now acceptsTimedelta as input (GH9064)
Lag parameter was added to the autocorrelation method ofSeries, defaults to lag-1 autocorrelation (GH9192)
Timedelta will now acceptnanoseconds keyword in constructor (GH9273)
SQL code now safely escapes table and column names (GH8986)
Added auto-complete forSeries.str.<tab>,Series.dt.<tab> andSeries.cat.<tab> (GH9322)
Index.get_indexer now supportsmethod='pad' andmethod='backfill' even for any target array, not just monotonic targets. These methods also work for monotonic decreasing as well as monotonic increasing indexes (GH9258).
Index.asof now works on all index types (GH9258).
Averbose argument has been augmented inio.read_excel(), defaults to False. Set to True to print sheet names as they are parsed. (GH9450)
Addeddays_in_month (compatibility aliasdaysinmonth) property toTimestamp,DatetimeIndex,Period,PeriodIndex, andSeries.dt (GH9572)
Addeddecimal option into_csv to provide formatting for non-‘.’ decimal separators (GH781)
Addednormalize option forTimestamp to normalized to midnight (GH8794)
Added example forDataFrame import to R using HDF5 file andrhdf5library. See thedocumentation for more(GH9636).
In v0.15.0 a new scalar typeTimedelta was introduced, that is asub-class ofdatetime.timedelta. Mentionedhere was a notice of an API change w.r.t. the.seconds accessor. The intent was to provide a user-friendly set of accessors that give the ‘natural’ value for that unit, e.g. if you had aTimedelta('1day,10:11:12'), then.seconds would return 12. However, this is at odds with the definition ofdatetime.timedelta, which defines.seconds as10*3600+11*60+12==36672.
So in v0.16.0, we are restoring the API to match that ofdatetime.timedelta. Further, the component values are still available through the.components accessor. This affects the.seconds and.microseconds accessors, and removes the.hours,.minutes,.milliseconds accessors. These changes affectTimedeltaIndex and the Series.dt accessor as well. (GH9185,GH9139)
Previous Behavior
In [2]:t=pd.Timedelta('1 day, 10:11:12.100123')In [3]:t.daysOut[3]:1In [4]:t.secondsOut[4]:12In [5]:t.microsecondsOut[5]:123
New Behavior
In [33]:t=pd.Timedelta('1 day, 10:11:12.100123')In [34]:t.daysOut[34]:1In [35]:t.secondsOut[35]:36672In [36]:t.microsecondsOut[36]:100123
Using.components allows the full component access
In [37]:t.componentsOut[37]:Components(days=1,hours=10,minutes=11,seconds=12,milliseconds=100,microseconds=123,nanoseconds=0)In [38]:t.components.secondsOut[38]:12
The behavior of a small sub-set of edge cases for using.loc have changed (GH8613). Furthermore we have improved the content of the error messages that are raised:
Slicing with.loc where the start and/or stop bound is not found in the index is now allowed; this previously would raise aKeyError. This makes the behavior the same as.ix in this case. This change is only for slicing, not when indexing with a single label.
In [39]:df=DataFrame(np.random.randn(5,4), ....:columns=list('ABCD'), ....:index=date_range('20130101',periods=5)) ....:In [40]:dfOut[40]: A B C D2013-01-01 -0.322795 0.841675 2.390961 0.0762002013-01-02 -0.566446 0.036142 -2.074978 0.2477922013-01-03 -0.897157 -0.136795 0.018289 0.7554142013-01-04 0.215269 0.841009 -1.445810 -1.4019732013-01-05 -0.100918 -0.548242 -0.144620 0.354020In [41]:s=Series(range(5),[-2,-1,1,2,3])In [42]:sOut[42]:-2 0-1 1 1 2 2 3 3 4dtype: int64
Previous Behavior
In [4]:df.loc['2013-01-02':'2013-01-10']KeyError: 'stop bound [2013-01-10] is not in the [index]'In [6]:s.loc[-10:3]KeyError: 'start bound [-10] is not the [index]'
New Behavior
In [43]:df.loc['2013-01-02':'2013-01-10']Out[43]: A B C D2013-01-02 -0.566446 0.036142 -2.074978 0.2477922013-01-03 -0.897157 -0.136795 0.018289 0.7554142013-01-04 0.215269 0.841009 -1.445810 -1.4019732013-01-05 -0.100918 -0.548242 -0.144620 0.354020In [44]:s.loc[-10:3]Out[44]:-2 0-1 1 1 2 2 3 3 4dtype: int64
Allow slicing with float-like values on an integer index for.ix. Previously this was only enabled for.loc:
Previous Behavior
In [8]:s.ix[-1.0:2]TypeError: the slice start value [-1.0] is not a proper indexer for this index type (Int64Index)
New Behavior
In [45]:s.ix[-1.0:2]Out[45]:-1 1 1 2 2 3dtype: int64
Provide a useful exception for indexing with an invalid type for that index when using.loc. For example trying to use.loc on an index of typeDatetimeIndex orPeriodIndex orTimedeltaIndex, with an integer (or a float).
Previous Behavior
In[4]:df.loc[2:3]KeyError:'start bound [2] is not the [index]'
New Behavior
In [4]:df.loc[2:3]TypeError: Cannot do slice indexing on <class 'pandas.tseries.index.DatetimeIndex'> with <type 'int'> keys
In prior versions,Categoricals that had an unspecified ordering (meaning noordered keyword was passed) were defaulted asordered Categoricals. Going forward, theordered keyword in theCategorical constructor will default toFalse. Ordering must now be explicit.
Furthermore, previously youcould change theordered attribute of a Categorical by just setting the attribute, e.g.cat.ordered=True; This is now deprecated and you should usecat.as_ordered() orcat.as_unordered(). These will by default return anew object and not modify the existing object. (GH9347,GH9190)
Previous Behavior
In [3]:s=Series([0,1,2],dtype='category')In [4]:sOut[4]:0 01 12 2dtype: categoryCategories (3, int64): [0 < 1 < 2]In [5]:s.cat.orderedOut[5]:TrueIn [6]:s.cat.ordered=FalseIn [7]:sOut[7]:0 01 12 2dtype: categoryCategories (3, int64): [0, 1, 2]
New Behavior
In [46]:s=Series([0,1,2],dtype='category')In [47]:sOut[47]:0 01 12 2dtype: categoryCategories (3, int64): [0, 1, 2]In [48]:s.cat.orderedOut[48]:FalseIn [49]:s=s.cat.as_ordered()In [50]:sOut[50]:0 01 12 2dtype: categoryCategories (3, int64): [0 < 1 < 2]In [51]:s.cat.orderedOut[51]:True# you can set in the constructor of the CategoricalIn [52]:s=Series(Categorical([0,1,2],ordered=True))In [53]:sOut[53]:0 01 12 2dtype: categoryCategories (3, int64): [0 < 1 < 2]In [54]:s.cat.orderedOut[54]:True
For ease of creation of series of categorical data, we have added the ability to pass keywords when calling.astype(). These are passed directly to the constructor.
In [55]:s=Series(["a","b","c","a"]).astype('category',ordered=True)In [56]:sOut[56]:0 a1 b2 c3 adtype: categoryCategories (3, object): [a < b < c]In [57]:s=Series(["a","b","c","a"]).astype('category',categories=list('abcdef'),ordered=False)In [58]:sOut[58]:0 a1 b2 c3 adtype: categoryCategories (6, object): [a, b, c, d, e, f]
Index.duplicated now returnsnp.array(dtype=bool) rather thanIndex(dtype=object) containingbool values. (GH8875)
DataFrame.to_json now returns accurate type serialisation for each column for frames of mixed dtype (GH9037)
Previously data was coerced to a common dtype before serialisation, which forexample resulted in integers being serialised to floats:
In [2]:pd.DataFrame({'i':[1,2],'f':[3.0,4.2]}).to_json()Out[2]:'{"f":{"0":3.0,"1":4.2},"i":{"0":1.0,"1":2.0}}'
Now each column is serialised using its correct dtype:
In [2]:pd.DataFrame({'i':[1,2],'f':[3.0,4.2]}).to_json()Out[2]:'{"f":{"0":3.0,"1":4.2},"i":{"0":1,"1":2}}'
DatetimeIndex,PeriodIndex andTimedeltaIndex.summary now output the same format. (GH9116)
TimedeltaIndex.freqstr now output the same string format asDatetimeIndex. (GH9116)
Bar and horizontal bar plots no longer add a dashed line along the info axis. The prior style can be achieved with matplotlib’saxhline oraxvline methods (GH9088).
Series accessors.dt,.cat and.str now raiseAttributeError instead ofTypeError if the series does not contain the appropriate type of data (GH9617). This follows Python’s built-in exception hierarchy more closely and ensures that tests likehasattr(s,'cat') are consistent on both Python 2 and 3.
Series now supports bitwise operation for integral types (GH9016). Previously even if the input dtypes were integral, the output dtype was coerced tobool.
Previous Behavior
In [2]:pd.Series([0,1,2,3],list('abcd'))|pd.Series([4,4,4,4],list('abcd'))Out[2]:a Trueb Truec Trued Truedtype: bool
New Behavior. If the input dtypes are integral, the output dtype is also integral and the outputvalues are the result of the bitwise operation.
In [2]:pd.Series([0,1,2,3],list('abcd'))|pd.Series([4,4,4,4],list('abcd'))Out[2]:a 4b 5c 6d 7dtype: int64
During division involving aSeries orDataFrame,0/0 and0//0 now givenp.nan instead ofnp.inf. (GH9144,GH8445)
Previous Behavior
In [2]:p=pd.Series([0,1])In [3]:p/0Out[3]:0 inf1 infdtype: float64In [4]:p//0Out[4]:0 inf1 infdtype: float64
New Behavior
In [59]:p=pd.Series([0,1])In [60]:p/0Out[60]:0 NaN1 infdtype: float64In [61]:p//0Out[61]:0 NaN1 infdtype: float64
Series.values_counts andSeries.describe for categorical data will now putNaN entries at the end. (GH9443)
Series.describe for categorical data will now give counts and frequencies of 0, notNaN, for unused categories (GH9443)
Due to a bug fix, looking up a partial string label withDatetimeIndex.asof now includes values that match the string, even if they are after the start of the partial string label (GH9258).
Old behavior:
In [4]:pd.to_datetime(['2000-01-31','2000-02-28']).asof('2000-02')Out[4]:Timestamp('2000-01-31 00:00:00')
Fixed behavior:
In [62]:pd.to_datetime(['2000-01-31','2000-02-28']).asof('2000-02')Out[62]:Timestamp('2000-02-28 00:00:00')
To reproduce the old behavior, simply add more precision to the label (e.g., use2000-02-01 instead of2000-02).
rplot trellis plotting interface is deprecated and will be removedin a future version. We refer to external packages likeseaborn for similarbut more refined functionality (GH3445).The documentation includes some examples how to convert your existing codeusingrplot to seaborn:rplot docs.pandas.sandbox.qtpandas interface is deprecated and will be removed in a future version.We refer users to the external packagepandas-qt. (GH9615)pandas.rpy interface is deprecated and will be removed in a future version.Similar functionaility can be accessed thru therpy2 project (GH9602)DatetimeIndex/PeriodIndex to anotherDatetimeIndex/PeriodIndex is being deprecated as a set-operation. This will be changed to aTypeError in a future version..union() should be used for the union set operation. (GH9094)DatetimeIndex/PeriodIndex from anotherDatetimeIndex/PeriodIndex is being deprecated as a set-operation. This will be changed to an actual numeric subtraction yielding aTimeDeltaIndex in a future version..difference() should be used for the differencing set operation. (GH9094)DataFrame.pivot_table andcrosstab‘srows andcols keyword arguments were removed in favorofindex andcolumns (GH6581)DataFrame.to_excel andDataFrame.to_csvcols keyword argument was removed in favor ofcolumns (GH6581)convert_dummies in favor ofget_dummies (GH6581)value_range in favor ofdescribe (GH6581).loc indexing with an array or list-like (GH9126:).DataFrame.to_json 30x performance improvement for mixed dtype frames. (GH9037)MultiIndex.duplicated by working with labels instead of values (GH9125)nunique by callingunique instead ofvalue_counts (GH9129,GH7771)DataFrame.count andDataFrame.dropna by taking advantage of homogeneous/heterogeneous dtypes appropriately (GH9136)DataFrame.count when using aMultiIndex and thelevel keyword argument (GH9163)merge when key space exceedsint64 bounds (GH9151)groupby (GH9429)MultiIndex.sortlevel (GH9445)DataFrame.duplicated (GH9398)Period (GH9440)to_hdf (GH9648).to_html to remove leading/trailing spaces in table body (GH4987)read_csv on s3 with Python 3 (GH9452)DatetimeIndex affecting architectures wherenumpy.int_ defaults tonumpy.int32 (GH8943)Series.dt.components index was reset to the default index (GH9247)Categorical.__getitem__/__setitem__ with listlike input getting incorrect results from indexer coercion (GH9469)to_sql when mapping aTimestamp object column (datetimecolumn with timezone info) to the appropriate sqlalchemy type (GH9085).to_sqldtype argument not accepting an instantiatedSQLAlchemy type (GH9083)..loc partial setting with anp.datetime64 (GH9516)Series & on.xs slices (GH9477)Categorical.unique() (ands.unique() ifs is of dtypecategory) now appear in the order in which they are originally found, not in sorted order (GH9331). This is now consistent with the behavior for other dtypes in pandas.StataReader (GH8688).MultiIndex.has_duplicates when having many levels causes an indexer overflow (GH9075,GH5873)pivot andunstack wherenan values would break index alignment (GH4862,GH7401,GH7403,GH7405,GH7466,GH9497)join on multi-index withsort=True or null values (GH9210).MultiIndex where inserting new keys would fail (GH9250).groupby when key space exceedsint64 bounds (GH9096).unstack withTimedeltaIndex orDatetimeIndex and nulls (GH9491).rank where comparing floats with tolerance will cause inconsistent behaviour (GH8365).read_stata andStataReader when loading data from a URL (GH9231).offsets.Nano to other offets raisesTypeError (GH9284)DatetimeIndex iteration, related to (GH8890), fixed in (GH9100)resample around DST transitions. This required fixing offset classes so they behave correctly on DST transitions. (GH5172,GH8744,GH8653,GH9173,GH9468)..mul()) alignment with integer levels (GH9463).layout kw may show unnecessary warning (GH9464)fillna), (GH9221)DataFrame now properly supports simultaneouscopy anddtype arguments in constructor (GH9099)read_csv when using skiprows on a file with CR line endings with the c engine. (GH9079)isnull now detectsNaT inPeriodIndex (GH9129).nth() with a multiple column groupby (GH8979)DataFrame.where andSeries.where coerce numerics to string incorrectly (GH9280)DataFrame.where andSeries.where raiseValueError when string list-like is passed. (GH9280)Series.str methods on with non-string values now raisesTypeError instead of producing incorrect results (GH9184)DatetimeIndex.__contains__ when index has duplicates and is not monotonic increasing (GH9512)Series.kurt() when all values are equal (GH9197)xlsxwriter engine where it added a default ‘General’ format to cells if no other format wass applied. This prevented other row or column formatting being applied. (GH9167)index_col=False whenusecols is also specified inread_csv. (GH9082)wide_to_long would modify the input stubnames list (GH9204)to_sql not storing float64 values using double precision. (GH9009)SparseSeries andSparsePanel now accept zero argument constructors (same as their non-sparse counterparts) (GH9272).Categorical andobject dtypes (GH9426)read_csv with buffer overflows with certain malformed input files (GH9205)Series.groupby where grouping onMultiIndex levels would ignore the sort argument (GH9444)DataFrame.Groupby wheresort=False is ignored in the case of Categorical columns. (GH8868)Series.values_counts with excludingNaN for categorical typeSeries withdropna=True (GH9443)DataFrame.std/var/sem (GH9201)Panel orPanel4D with scalar data (GH8285)Series text representation disconnected frommax_rows/max_columns (GH7508).Series number formatting inconsistent when truncated (GH8532).
Previous Behavior
In[2]:pd.options.display.max_rows=10In[3]:s=pd.Series([1,1,1,1,1,1,1,1,1,1,0.9999,1,1]*10)In[4]:sOut[4]:011121...1270.99991281.00001291.0000Length:130,dtype:float64
New Behavior
01.000011.000021.000031.000041.0000...1251.00001261.00001270.99991281.00001291.0000dtype:float64
A SpuriousSettingWithCopy Warning was generated when setting a new item in a frame in some cases (GH8730)
The following would previously report aSettingWithCopy Warning.
In [1]:df1=DataFrame({'x':Series(['a','b','c']),'y':Series(['d','e','f'])})In [2]:df2=df1[['x']]In [3]:df2['y']=['g','h','i']
This is a minor release from 0.15.1 and includes a large number of bug fixesalong with several new features, enhancements, and performance improvements.A small number of API changes were necessary to fix existing bugs.We recommend that all users upgrade to this version.
Indexing inMultiIndex beyond lex-sort depth is now supported, thougha lexically sorted index will have a better performance. (GH2646)
In [1]:df=pd.DataFrame({'jim':[0,0,1,1], ...:'joe':['x','x','z','y'], ...:'jolie':np.random.rand(4)}).set_index(['jim','joe']) ...:In [2]:dfOut[2]: joliejim joe0 x 0.123943 x 0.1193811 z 0.738523 y 0.587304In [3]:df.index.lexsort_depthOut[3]:1# in prior versions this would raise a KeyError# will now show a PerformanceWarningIn [4]:df.loc[(1,'z')]Out[4]: joliejim joe1 z 0.738523# lexically sortingIn [5]:df2=df.sortlevel()In [6]:df2Out[6]: joliejim joe0 x 0.123943 x 0.1193811 y 0.587304 z 0.738523In [7]:df2.index.lexsort_depthOut[7]:2In [8]:df2.loc[(1,'z')]Out[8]: joliejim joe1 z 0.738523
Bug in unique of Series withcategory dtype, which returned all categories regardlesswhether they were “used” or not (seeGH8559 for the discussion).Previous behaviour was to return all categories:
In [3]:cat=pd.Categorical(['a','b','a'],categories=['a','b','c'])In [4]:catOut[4]:[a, b, a]Categories (3, object): [a < b < c]In [5]:cat.unique()Out[5]:array(['a','b','c'],dtype=object)
Now, only the categories that do effectively occur in the array are returned:
In [9]:cat=pd.Categorical(['a','b','a'],categories=['a','b','c'])In [10]:cat.unique()Out[10]:[a, b]Categories (2, object): [a, b]
Series.all andSeries.any now support thelevel andskipna parameters.Series.all,Series.any,Index.all, andIndex.any no longer support theout andkeepdims parameters, which existed for compatibility with ndarray. Various index types no longer support theall andany aggregation functions and will now raiseTypeError. (GH8302).
Allow equality comparisons of Series with a categorical dtype and object dtype; previously these would raiseTypeError (GH8938)
Bug inNDFrame: conflicting attribute/column names now behave consistently between getting and setting. Previously, when both a column and attribute namedy existed,data.y would return the attribute, whiledata.y=z would update the column (GH8994)
In [11]:data=pd.DataFrame({'x':[1,2,3]})In [12]:data.y=2In [13]:data['y']=[2,4,6]In [14]:dataOut[14]: x y0 1 21 2 42 3 6# this assignment was inconsistentIn [15]:data.y=5
Old behavior:
In [6]:data.yOut[6]:2In [7]:data['y'].valuesOut[7]:array([5,5,5])
New behavior:
In [16]:data.yOut[16]:5In [17]:data['y'].valuesOut[17]:array([2,4,6])
Timestamp('now') is now equivalent toTimestamp.now() in that it returns the local time rather than UTC. Also,Timestamp('today') is now equivalent toTimestamp.today() and both havetz as a possible argument. (GH9000)
Fix negative step support for label-based slices (GH8753)
Old behavior:
In [1]:s=pd.Series(np.arange(3),['a','b','c'])Out[1]:a 0b 1c 2dtype: int64In [2]:s.loc['c':'a':-1]Out[2]:c 2dtype: int64
New behavior:
In [18]:s=pd.Series(np.arange(3),['a','b','c'])In [19]:s.loc['c':'a':-1]Out[19]:c 2b 1a 0dtype: int64
Categorical enhancements:
order_categoricals toStataReader andread_stata to select whether to order imported categorical data (GH8836). Seehere for more information on importing categorical variables from Stata data files.category dtyped data is stored in a more efficient manner. Seehere for an example and caveats w.r.t. prior versions of pandas.searchsorted() onCategorical class (GH8420).Other enhancements:
Added the ability to specify the SQL type of columns when writing a DataFrameto a database (GH8778).For example, specifying to use the sqlalchemyString type instead of thedefaultText type for string columns:
fromsqlalchemy.typesimportStringdata.to_sql('data_dtype',engine,dtype={'Col_1':String})
Series.all andSeries.any now support thelevel andskipna parameters (GH8302):
In [20]:s=pd.Series([False,True,False],index=[0,0,1])In [21]:s.any(level=0)Out[21]:0 True1 Falsedtype: bool
Panel now supports theall andany aggregation functions. (GH8302):
In [22]:p=pd.Panel(np.random.rand(2,5,4)>0.1)In [23]:p.all()Out[23]: 0 10 True True1 True True2 False False3 True True
Added support forutcfromtimestamp(),fromtimestamp(), andcombine() onTimestamp class (GH5351).
Added Google Analytics (pandas.io.ga) basic documentation (GH8835). Seehere.
Timedelta arithmetic returnsNotImplemented in unknown cases, allowing extensions by custom classes (GH8813).
Timedelta now supports arithemtic withnumpy.ndarray objects of the appropriate dtype (numpy 1.8 or newer only) (GH8884).
AddedTimedelta.to_timedelta64() method to the public API (GH8884).
Addedgbq.generate_bq_schema() function to the gbq module (GH8325).
Series now works with map objects the same way as generators (GH8909).
Added context manager toHDFStore for automatic closing (GH8791).
to_datetime gains anexact keyword to allow for a format to not require an exact match for a provided format string (if itsFalse).exact defaults toTrue (meaning that exact matching is still the default) (GH8904)
Addedaxvlines boolean option to parallel_coordinates plot function, determines whether vertical lines will be printed, default is True
Added ability to read table footers to read_html (GH8552)
to_sql now infers datatypes of non-NA values for columns that contain NA values and have dtypeobject (GH8778).
category dtype which were coercing toobject. (GH8641)TypeError rather thanValueError (a couple of edge cases only), (GH8865)pd.Grouper(key=...) with no level/axis or level only (GH8795,GH8866)TypeError when invalid/no parameters are passed in a groupby (GH8015)py2app/cx_Freeze (GH8602,GH8831)groupby signatures that didn’t include *args or **kwargs (GH8733).io.data.Options now raisesRemoteDataError when no expiry dates are available from Yahoo and when it receives no data from Yahoo (GH8761), (GH8783).io.data.Options now raisesRemoteDataError when no expiry dates are available from Yahoo (GH8761).Timedelta kwargs may now be numpy ints and floats (GH8757).Timedelta arithmetic and comparisons (GH8813,GH5963,GH5436).sql_schema now generates dialect appropriateCREATETABLE statements (GH8697)slice string method now takes step into account (GH8754)BlockManager where setting values with different type would break block integrity (GH8850)DatetimeIndex when usingtime object as key (GH8667)merge wherehow='left' andsort=False would not preserve left frame order (GH7331)MultiIndex.reindex where reindexing at level would not reorder labels (GH4088)to_datetime when parsing a nanoseconds using the%f format (GH8989)io.data.Options now raisesRemoteDataError when no expiry dates are available from Yahoo and when it receives no data from Yahoo (GH8761), (GH8783).to_html,index=False which would add an extra column (GH8452).size attribute acrossNDFrame objects to provide compat with numpy >= 1.9.1; buggy withnp.array_split (GH8846)get_data_google returned object dtypes (GH3995)DataFrame.stack(...,dropna=False) when the DataFrame’scolumns is aMultiIndexwhoselabels do not reference all itslevels. (GH8844)__enter__ (GH8514)DataFrame.plot(kind='scatter') fails when checking if an np.array is in the DataFrame (GH8852)pd.infer_freq/DataFrame.inferred_freq that prevented proper sub-daily frequency inference when the index contained DST days (GH8772).use_index=False (GH8558).MultiIndex where__contains__ returns wrong result if index is not lexically sorted or unique (GH7724)Timestamp does not parse ‘Z’ zone designator for UTC (GH8771)This is a minor bug-fix release from 0.15.0 and includes a small number of API changes, several new features,enhancements, and performance improvements along with a large number of bug fixes. We recommend that allusers upgrade to this version.
s.dt.hour and other.dt accessors will now returnnp.nan for missing values (rather than previously -1), (GH8689)
In [1]:s=Series(date_range('20130101',periods=5,freq='D'))In [2]:s.iloc[2]=np.nanIn [3]:sOut[3]:0 2013-01-011 2013-01-022 NaT3 2013-01-044 2013-01-05dtype: datetime64[ns]
previous behavior:
In [6]:s.dt.hourOut[6]:0 01 02 -13 04 0dtype: int64
current behavior:
In [4]:s.dt.hourOut[4]:0 0.01 0.02 NaN3 0.04 0.0dtype: float64
groupby withas_index=False will not add erroneous extra columns toresult (GH8582):
In [5]:np.random.seed(2718281)In [6]:df=pd.DataFrame(np.random.randint(0,100,(10,2)), ...:columns=['jim','joe']) ...:In [7]:df.head()Out[7]: jim joe0 61 811 96 492 55 653 72 514 77 12In [8]:ts=pd.Series(5*np.random.randint(0,3,10))
previous behavior:
In [4]:df.groupby(ts,as_index=False).max()Out[4]: NaN jim joe0 0 72 831 5 77 842 10 96 65
current behavior:
In [9]:df.groupby(ts,as_index=False).max()Out[9]: jim joe0 72 831 77 842 96 65
groupby will not erroneously exclude columns if the column name conflicswith the grouper name (GH8112):
In [10]:df=pd.DataFrame({'jim':range(5),'joe':range(5,10)})In [11]:dfOut[11]: jim joe0 0 51 1 62 2 73 3 84 4 9In [12]:gr=df.groupby(df['jim']<2)
previous behavior (excludes 1st column from output):
In [4]:gr.apply(sum)Out[4]: joejimFalse 24True 11
current behavior:
In [13]:gr.apply(sum)Out[13]: jim joejimFalse 9 24True 1 11
Support for slicing with monotonic decreasing indexes, even ifstart orstop isnot found in the index (GH7860):
In [14]:s=pd.Series(['a','b','c','d'],[4,3,2,1])In [15]:sOut[15]:4 a3 b2 c1 ddtype: object
previous behavior:
In [8]:s.loc[3.5:1.5]KeyError: 3.5
current behavior:
In [16]:s.loc[3.5:1.5]Out[16]:3 b2 cdtype: object
io.data.Options has been fixed for a change in the format of the Yahoo Options page (GH8612), (GH8741)
Note
As a result of a change in Yahoo’s option page layout, when an expiry date is given,Options methods now return data for a single expiry date. Previously, methods returned alldata for the selected month.
Themonth andyear parameters have been undeprecated and can be used to get alloptions data for a given month.
If an expiry date that is not valid is given, data for the next expiry after the givendate is returned.
Option data frames are now saved on the instance ascallsYYMMDD orputsYYMMDD. Previouslythey were saved ascallsMMYY andputsMMYY. The next expiry is saved ascalls andputs.
New features:
expiry_dates was added, which returns all available expiry dates.Current behavior:
In [17]:frompandas.io.dataimportOptionsIn [18]:aapl=Options('aapl','yahoo')In [19]:aapl.get_call_data().iloc[0:5,0:1]Out[19]: LastStrike Expiry Type Symbol80 2014-11-14 call AAPL141114C00080000 29.0584 2014-11-14 call AAPL141114C00084000 24.8085 2014-11-14 call AAPL141114C00085000 24.0586 2014-11-14 call AAPL141114C00086000 22.7687 2014-11-14 call AAPL141114C00087000 21.74In [20]:aapl.expiry_datesOut[20]:[datetime.date(2014, 11, 14), datetime.date(2014, 11, 22), datetime.date(2014, 11, 28), datetime.date(2014, 12, 5), datetime.date(2014, 12, 12), datetime.date(2014, 12, 20), datetime.date(2015, 1, 17), datetime.date(2015, 2, 20), datetime.date(2015, 4, 17), datetime.date(2015, 7, 17), datetime.date(2016, 1, 15), datetime.date(2017, 1, 20)]In [21]:aapl.get_near_stock_price(expiry=aapl.expiry_dates[0:3]).iloc[0:5,0:1]Out[21]: LastStrike Expiry Type Symbol109 2014-11-22 call AAPL141122C00109000 1.48 2014-11-28 call AAPL141128C00109000 1.79110 2014-11-14 call AAPL141114C00110000 0.55 2014-11-22 call AAPL141122C00110000 1.02 2014-11-28 call AAPL141128C00110000 1.32
datetime64 dtype in matplotlib’s units registryto plot such values as datetimes. This is activated once pandas is imported. Inprevious versions, plotting an array ofdatetime64 values will have resultedin plotted integer values. To keep the previous behaviour, you can dodelmatplotlib.units.registry[np.datetime64] (GH8614).concat permits a wider variety of iterables of pandas objects to bepassed as the first parameter (GH8645):
In [17]:fromcollectionsimportdequeIn [18]:df1=pd.DataFrame([1,2,3])In [19]:df2=pd.DataFrame([4,5,6])
previous behavior:
In [7]:pd.concat(deque((df1,df2)))TypeError: first argument must be a list-like of pandas objects, you passed an object of type "deque"
current behavior:
In [20]:pd.concat(deque((df1,df2)))Out[20]: 00 11 22 30 41 52 6
RepresentMultiIndex labels with a dtype that utilizes memory based on the level size. In prior versions, the memory usage was a constant 8 bytes per element in each level. In addition, in prior versions, thereported memory usage was incorrect as it didn’t show the usage for the memory occupied by the underling data array. (GH8456)
In [21]:dfi=DataFrame(1,index=pd.MultiIndex.from_product([['a'],range(1000)]),columns=['A'])
previous behavior:
# this was underreported in prior versionsIn [1]:dfi.memory_usage(index=True)Out[1]:Index 8000 # took about 24008 bytes in < 0.15.1A 8000dtype: int64
current behavior:
In [22]:dfi.memory_usage(index=True)Out[22]:Index 11040A 8000dtype: int64
Added Index propertiesis_monotonic_increasing andis_monotonic_decreasing (GH8680).
Added option to select columns when importing Stata files (GH7935)
Qualify memory usage inDataFrame.info() by adding+ if it is a lower bound (GH8578)
Raise errors in certain aggregation cases where an argument such asnumeric_only is not handled (GH8592).
Added support for 3-character ISO and non-standard country codes inio.wb.download() (GH8482)
World Bank data requests now will warn/raise basedon anerrors argument, as well as a list of hard-coded country codes andthe World Bank’s JSON response. In prior versions, the error messagesdidn’t look at the World Bank’s JSON response. Problem-inducing input weresimply dropped prior to the request. The issue was that many good countrieswere cropped in the hard-coded approach. All countries will work now, butsome bad countries will raise exceptions because some edge cases break theentire response. (GH8482)
Added option toSeries.str.split() to return aDataFrame rather than aSeries (GH8428)
Added option todf.info(null_counts=None|True|False) to override the default display options and force showing of the null-counts (GH8701)
CustomBusinessDay object (GH8591)Categorical to a records array, e.g.df.to_records() (GH8626)Categorical not created properly withSeries.to_frame() (GH8626)Categorical of a passedpd.Categorical (this now raisesTypeError correctly), (GH8626)cut/qcut when usingSeries andretbins=True (GH8589)to_sql (GH8624).Categorical of datetime raising when being compared to a scalar datetime (GH8687)Categorical with.iloc (GH8623)Categorical reflected comparison operator raising if the first argument was a numpy array scalar (e.g. np.int64) (GH8658)DataFrame.dtypes whenoptions.mode.use_inf_as_null is True (GH8722)read_csv,dialect parameter would not take a string (:issue:8703)np.nan on numpy 1.7 (GH8980).shape attribute forMultiIndex (GH8609)GroupBy where a name conflict between the grouper and columnswould breakgroupby operations (GH7115,GH8112)y and specifying a label would mutate the index name of the original DataFrame (GH8494)date_range where partially-specified dates would incorporate current date (GH6961)DataReader‘s would fail if one of the symbols passed was invalid. Now returns data for valid symbols and np.nan for invalid (GH8494)get_quote_yahoo that wouldn’t allow non-float return values (GH5229).This is a major release from 0.14.1 and includes a small number of API changes, several new features,enhancements, and performance improvements along with a large number of bug fixes. We recommend that allusers upgrade to this version.
Warning
pandas >= 0.15.0 will no longer support compatibility with NumPy versions <1.7.0. If you want to use the latest versions of pandas, please upgrade toNumPy >= 1.7.0 (GH7711)
Categorical type was integrated as a first-class pandas type, seehereTimedelta, and a new index typeTimedeltaIndex, seehere.dt for Series, seeDatetimelike Propertiesdf.info() to include memory usage, seeMemory Usageread_csv will now by default ignore blank lines when parsing, seehereIndex class to no longer sub-classndarray, seeInternal RefactoringPyTables less than version 3.0.0, andnumexpr less than version 2.1 (GH7990)Warning
In 0.15.0Index has internally been refactored to no longer sub-classndarraybut instead subclassPandasObject, similarly to the rest of the pandas objects. This change allows very easy sub-classing and creation of new index types. This should bea transparent change with only very limited API implications (See theInternal Refactoring)
Warning
The refactorings inCategorical changed the two argument constructor from“codes/labels and levels” to “values and levels (now called ‘categories’)”. This can lead to subtle bugs. If you useCategorical directly, please audit your code before updating to this pandasversion and change it to use thefrom_codes() constructor. See more onCategoricalhere
Categorical can now be included inSeries andDataFrames and gained newmethods to manipulate. Thanks to Jan Schulz for much of this API/implementation. (GH3943,GH5313,GH5314,GH7444,GH7839,GH7848,GH7864,GH7914,GH7768,GH8006,GH3678,GH8075,GH8076,GH8143,GH8453,GH8518).
For full docs, see thecategorical introduction and theAPI documentation.
In [1]:df=DataFrame({"id":[1,2,3,4,5,6],"raw_grade":['a','b','b','a','a','e']})In [2]:df["grade"]=df["raw_grade"].astype("category")In [3]:df["grade"]Out[3]:0 a1 b2 b3 a4 a5 eName: grade, dtype: categoryCategories (3, object): [a, b, e]# Rename the categoriesIn [4]:df["grade"].cat.categories=["very good","good","very bad"]# Reorder the categories and simultaneously add the missing categoriesIn [5]:df["grade"]=df["grade"].cat.set_categories(["very bad","bad","medium","good","very good"])In [6]:df["grade"]Out[6]:0 very good1 good2 good3 very good4 very good5 very badName: grade, dtype: categoryCategories (5, object): [very bad, bad, medium, good, very good]In [7]:df.sort("grade")Out[7]: id raw_grade grade5 6 e very bad1 2 b good2 3 b good0 1 a very good3 4 a very good4 5 a very goodIn [8]:df.groupby("grade").size()Out[8]:gradevery bad 1bad 0medium 0good 2very good 3dtype: int64
pandas.core.group_agg andpandas.core.factor_agg were removed. As an alternative, constructa dataframe and usedf.groupby(<group>).agg(<func>).Categorical constructor is notsupported anymore. Supplying two arguments to the constructor is now interpreted as“values and levels (now called ‘categories’)”. Please change your code to use thefrom_codes()constructor.Categorical.labels attribute was renamed toCategorical.codes and is readonly. If you want to manipulate codes, please use one of theAPI methods on Categoricals.Categorical.levels attribute is renamed toCategorical.categories.We introduce a new scalar typeTimedelta, which is a subclass ofdatetime.timedelta, and behaves in a similar manner,but allows compatibility withnp.timedelta64 types as well as a host of custom representation, parsing, and attributes.This type is very similar to howTimestamp works fordatetimes. It is a nice-API box for the type. See thedocs.(GH3009,GH4533,GH8209,GH8187,GH8190,GH7869,GH7661,GH8345,GH8471)
Warning
Timedelta scalars (andTimedeltaIndex) component fields arenot the same as the component fields on adatetime.timedelta object. For example,.seconds on adatetime.timedelta object returns the total number of seconds combined betweenhours,minutes andseconds. In contrast, the pandasTimedelta breaks out hours, minutes, microseconds and nanoseconds separately.
# Timedelta accessorIn [9]:tds=Timedelta('31 days 5 min 3 sec')In [10]:tds.minutesOut[10]:5LIn [11]:tds.secondsOut[11]:3L# datetime.timedelta accessor# this is 5 minutes * 60 + 3 secondsIn [12]:tds.to_pytimedelta().secondsOut[12]:303
Note: this is no longer true starting from v0.16.0, where fullcompatibility withdatetime.timedelta is introduced. See the0.16.0 whatsnew entry
Warning
Prior to 0.15.0pd.to_timedelta would return aSeries for list-like/Series input, and anp.timedelta64 for scalar input.It will now return aTimedeltaIndex for list-like input,Series for Series input, andTimedelta for scalar input.
The arguments topd.to_timedelta are now(arg,unit='ns',box=True,coerce=False), previously were(arg,box=True,unit='ns') as these are more logical.
Consruct a scalar
In [9]:Timedelta('1 days 06:05:01.00003')Out[9]:Timedelta('1 days 06:05:01.000030')In [10]:Timedelta('15.5us')Out[10]:Timedelta('0 days 00:00:00.000015')In [11]:Timedelta('1 hour 15.5us')Out[11]:Timedelta('0 days 01:00:00.000015')# negative Timedeltas have this string repr# to be more consistent with datetime.timedelta conventionsIn [12]:Timedelta('-1us')Out[12]:Timedelta('-1 days +23:59:59.999999')# a NaTIn [13]:Timedelta('nan')Out[13]:NaT
Access fields for aTimedelta
In [14]:td=Timedelta('1 hour 3m 15.5us')In [15]:td.secondsOut[15]:3780In [16]:td.microsecondsOut[16]:15In [17]:td.nanosecondsOut[17]:500
Construct aTimedeltaIndex
In [18]:TimedeltaIndex(['1 days','1 days, 00:00:05', ....:np.timedelta64(2,'D'),timedelta(days=2,seconds=2)]) ....:Out[18]:TimedeltaIndex(['1 days 00:00:00', '1 days 00:00:05', '2 days 00:00:00', '2 days 00:00:02'], dtype='timedelta64[ns]', freq=None)
Constructing aTimedeltaIndex with a regular range
In [19]:timedelta_range('1 days',periods=5,freq='D')Out[19]:TimedeltaIndex(['1 days','2 days','3 days','4 days','5 days'],dtype='timedelta64[ns]',freq='D')In [20]:timedelta_range(start='1 days',end='2 days',freq='30T')Out[20]:TimedeltaIndex(['1 days 00:00:00', '1 days 00:30:00', '1 days 01:00:00', '1 days 01:30:00', '1 days 02:00:00', '1 days 02:30:00', '1 days 03:00:00', '1 days 03:30:00', '1 days 04:00:00', '1 days 04:30:00', '1 days 05:00:00', '1 days 05:30:00', '1 days 06:00:00', '1 days 06:30:00', '1 days 07:00:00', '1 days 07:30:00', '1 days 08:00:00', '1 days 08:30:00', '1 days 09:00:00', '1 days 09:30:00', '1 days 10:00:00', '1 days 10:30:00', '1 days 11:00:00', '1 days 11:30:00', '1 days 12:00:00', '1 days 12:30:00', '1 days 13:00:00', '1 days 13:30:00', '1 days 14:00:00', '1 days 14:30:00', '1 days 15:00:00', '1 days 15:30:00', '1 days 16:00:00', '1 days 16:30:00', '1 days 17:00:00', '1 days 17:30:00', '1 days 18:00:00', '1 days 18:30:00', '1 days 19:00:00', '1 days 19:30:00', '1 days 20:00:00', '1 days 20:30:00', '1 days 21:00:00', '1 days 21:30:00', '1 days 22:00:00', '1 days 22:30:00', '1 days 23:00:00', '1 days 23:30:00', '2 days 00:00:00'], dtype='timedelta64[ns]', freq='30T')
You can now use aTimedeltaIndex as the index of a pandas object
In [21]:s=Series(np.arange(5), ....:index=timedelta_range('1 days',periods=5,freq='s')) ....:In [22]:sOut[22]:1 days 00:00:00 01 days 00:00:01 11 days 00:00:02 21 days 00:00:03 31 days 00:00:04 4Freq: S, dtype: int64
You can select with partial string selections
In [23]:s['1 day 00:00:02']Out[23]:2In [24]:s['1 day':'1 day 00:00:02']Out[24]:1 days 00:00:00 01 days 00:00:01 11 days 00:00:02 2Freq: S, dtype: int64
Finally, the combination ofTimedeltaIndex withDatetimeIndex allow certain combination operations that areNaT preserving:
In [25]:tdi=TimedeltaIndex(['1 days',pd.NaT,'2 days'])In [26]:tdi.tolist()Out[26]:[Timedelta('1 days 00:00:00'),NaT,Timedelta('2 days 00:00:00')]In [27]:dti=date_range('20130101',periods=3)In [28]:dti.tolist()Out[28]:[Timestamp('2013-01-01 00:00:00', freq='D'), Timestamp('2013-01-02 00:00:00', freq='D'), Timestamp('2013-01-03 00:00:00', freq='D')]In [29]:(dti+tdi).tolist()Out[29]:[Timestamp('2013-01-02 00:00:00'),NaT,Timestamp('2013-01-05 00:00:00')]In [30]:(dti-tdi).tolist()Out[30]:[Timestamp('2012-12-31 00:00:00'),NaT,Timestamp('2013-01-01 00:00:00')]
Series e.g.list(Series(...)) oftimedelta64[ns] would prior to v0.15.0 returnnp.timedelta64 for each element. These will now be wrapped inTimedelta.Implemented methods to find memory usage of a DataFrame. See theFAQ for more. (GH6852).
A new display optiondisplay.memory_usage (seeOptions and Settings) sets the default behavior of thememory_usage argument in thedf.info() method. By defaultdisplay.memory_usage isTrue.
In [31]:dtypes=['int64','float64','datetime64[ns]','timedelta64[ns]', ....:'complex128','object','bool'] ....:In [32]:n=5000In [33]:data=dict([(t,np.random.randint(100,size=n).astype(t)) ....:fortindtypes]) ....:In [34]:df=DataFrame(data)In [35]:df['categorical']=df['object'].astype('category')In [36]:df.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 5000 entries, 0 to 4999Data columns (total 8 columns):bool 5000 non-null boolcomplex128 5000 non-null complex128datetime64[ns] 5000 non-null datetime64[ns]float64 5000 non-null float64int64 5000 non-null int64object 5000 non-null objecttimedelta64[ns] 5000 non-null timedelta64[ns]categorical 5000 non-null categorydtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)memory usage: 284.1+ KB
Additionallymemory_usage() is an available method for a dataframe object which returns the memory usage of each column.
In [37]:df.memory_usage(index=True)Out[37]:Index 72bool 5000complex128 80000datetime64[ns] 40000float64 40000int64 40000object 40000timedelta64[ns] 40000categorical 5800dtype: int64
Series has gained an accessor to succinctly return datetime like properties for thevalues of the Series, if its a datetime/period like Series. (GH7207)This will return a Series, indexed like the existing Series. See thedocs
# datetimeIn [38]:s=Series(date_range('20130101 09:10:12',periods=4))In [39]:sOut[39]:0 2013-01-01 09:10:121 2013-01-02 09:10:122 2013-01-03 09:10:123 2013-01-04 09:10:12dtype: datetime64[ns]In [40]:s.dt.hourOut[40]:0 91 92 93 9dtype: int64In [41]:s.dt.secondOut[41]:0 121 122 123 12dtype: int64In [42]:s.dt.dayOut[42]:0 11 22 33 4dtype: int64In [43]:s.dt.freqOut[43]:<Day>
This enables nice expressions like this:
In [44]:s[s.dt.day==2]Out[44]:1 2013-01-02 09:10:12dtype: datetime64[ns]
You can easily produce tz aware transformations:
In [45]:stz=s.dt.tz_localize('US/Eastern')In [46]:stzOut[46]:0 2013-01-01 09:10:12-05:001 2013-01-02 09:10:12-05:002 2013-01-03 09:10:12-05:003 2013-01-04 09:10:12-05:00dtype: datetime64[ns, US/Eastern]In [47]:stz.dt.tzOut[47]:<DstTzInfo'US/Eastern'LMT-1day,19:04:00STD>
You can also chain these types of operations:
In [48]:s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')Out[48]:0 2013-01-01 04:10:12-05:001 2013-01-02 04:10:12-05:002 2013-01-03 04:10:12-05:003 2013-01-04 04:10:12-05:00dtype: datetime64[ns, US/Eastern]
The.dt accessor works for period and timedelta dtypes.
# periodIn [49]:s=Series(period_range('20130101',periods=4,freq='D'))In [50]:sOut[50]:0 2013-01-011 2013-01-022 2013-01-033 2013-01-04dtype: objectIn [51]:s.dt.yearOut[51]:0 20131 20132 20133 2013dtype: int64In [52]:s.dt.dayOut[52]:0 11 22 33 4dtype: int64
# timedeltaIn [53]:s=Series(timedelta_range('1 day 00:00:05',periods=4,freq='s'))In [54]:sOut[54]:0 1 days 00:00:051 1 days 00:00:062 1 days 00:00:073 1 days 00:00:08dtype: timedelta64[ns]In [55]:s.dt.daysOut[55]:0 11 12 13 1dtype: int64In [56]:s.dt.secondsOut[56]:0 51 62 73 8dtype: int64In [57]:s.dt.componentsOut[57]: days hours minutes seconds milliseconds microseconds nanoseconds0 1 0 0 5 0 0 01 1 0 0 6 0 0 02 1 0 0 7 0 0 03 1 0 0 8 0 0 0
tz_localize(None) for tz-awareTimestamp andDatetimeIndex now removes timezone holding local time,previously this resulted inException orTypeError (GH7812)
In [58]:ts=Timestamp('2014-08-01 09:00',tz='US/Eastern')In [59]:tsOut[59]:Timestamp('2014-08-01 09:00:00-0400',tz='US/Eastern')In [60]:ts.tz_localize(None)Out[60]:Timestamp('2014-08-01 09:00:00')In [61]:didx=DatetimeIndex(start='2014-08-01 09:00',freq='H',periods=10,tz='US/Eastern')In [62]:didxOut[62]:DatetimeIndex(['2014-08-01 09:00:00-04:00', '2014-08-01 10:00:00-04:00', '2014-08-01 11:00:00-04:00', '2014-08-01 12:00:00-04:00', '2014-08-01 13:00:00-04:00', '2014-08-01 14:00:00-04:00', '2014-08-01 15:00:00-04:00', '2014-08-01 16:00:00-04:00', '2014-08-01 17:00:00-04:00', '2014-08-01 18:00:00-04:00'], dtype='datetime64[ns, US/Eastern]', freq='H')In [63]:didx.tz_localize(None)Out[63]:DatetimeIndex(['2014-08-01 09:00:00', '2014-08-01 10:00:00', '2014-08-01 11:00:00', '2014-08-01 12:00:00', '2014-08-01 13:00:00', '2014-08-01 14:00:00', '2014-08-01 15:00:00', '2014-08-01 16:00:00', '2014-08-01 17:00:00', '2014-08-01 18:00:00'], dtype='datetime64[ns]', freq='H')
tz_localize now accepts theambiguous keyword which allows for passing an array of boolsindicating whether the date belongs in DST or not, ‘NaT’ for setting transition times to NaT,‘infer’ for inferring DST/non-DST, and ‘raise’ (default) for anAmbiguousTimeError to be raised. Seethe docs for more details (GH7943)
DataFrame.tz_localize andDataFrame.tz_convert now accepts an optionallevel argumentfor localizing a specific level of a MultiIndex (GH7846)
Timestamp.tz_localize andTimestamp.tz_convert now raiseTypeError in error cases, rather thanException (GH8025)
a timeseries/index localized to UTC when inserted into a Series/DataFrame will preserve the UTC timezone (rather than being a naivedatetime64[ns]) asobject dtype (GH8411)
Timestamp.__repr__ displaysdateutil.tz.tzoffset info (GH7907)
rolling_min(),rolling_max(),rolling_cov(), androlling_corr()now return objects with allNaN whenlen(arg)<min_periods<=window ratherthan raising. (This makes all rolling functions consistent in this behavior). (GH7766)
Prior to 0.15.0
In [64]:s=Series([10,11,12,13])
In [15]:rolling_min(s,window=10,min_periods=5)ValueError: min_periods (5) must be <= window (4)
New behavior
In [4]:pd.rolling_min(s,window=10,min_periods=5)Out[4]:0 NaN1 NaN2 NaN3 NaNdtype: float64
rolling_max(),rolling_min(),rolling_sum(),rolling_mean(),rolling_median(),rolling_std(),rolling_var(),rolling_skew(),rolling_kurt(),rolling_quantile(),rolling_cov(),rolling_corr(),rolling_corr_pairwise(),rolling_window(), androlling_apply() withcenter=True previously would return a result of the samestructure as the inputarg withNaN in the final(window-1)/2 entries.
Now the final(window-1)/2 entries of the result are calculated as if the inputarg were followedby(window-1)/2NaN values (or with shrinking windows, in the case ofrolling_apply()).(GH7925,GH8269)
Prior behavior (note final value isNaN):
In [7]:rolling_sum(Series(range(4)),window=3,min_periods=0,center=True)Out[7]:0 11 32 63 NaNdtype: float64
New behavior (note final value is5=sum([2,3,NaN])):
In [7]:rolling_sum(Series(range(4)),window=3,min_periods=0,center=True)Out[7]:0 11 32 63 5dtype: float64
rolling_window() now normalizes the weights properly in rolling mean mode (mean=True) so thatthe calculated weighted means (e.g. ‘triang’, ‘gaussian’) are distributed about the same means as thosecalculated without weighting (i.e. ‘boxcar’). Seethe note on normalization for further details. (GH7618)
In [65]:s=Series([10.5,8.8,11.4,9.7,9.3])
Behavior prior to 0.15.0:
In [39]:rolling_window(s,window=3,win_type='triang',center=True)Out[39]:0 NaN1 6.5833332 6.8833333 6.6833334 NaNdtype: float64
New behavior
In [10]:pd.rolling_window(s,window=3,win_type='triang',center=True)Out[10]:0 NaN1 9.8752 10.3253 10.0254 NaNdtype: float64
Removedcenter argument from allexpanding_ functions (seelist),as the results produced whencenter=True did not make much sense. (GH7925)
Added optionalddof argument toexpanding_cov() androlling_cov().The default value of1 is backwards-compatible. (GH8279)
Documented theddof argument toexpanding_var(),expanding_std(),rolling_var(), androlling_std(). These functions’ support of addof argument (with a default value of1) was previously undocumented. (GH8064)
ewma(),ewmstd(),ewmvol(),ewmvar(),ewmcov(), andewmcorr()now interpretmin_periods in the same manner that therolling_*() andexpanding_*() functions do:a given result entry will beNaN if the (expanding, in this case) window does not containat leastmin_periods values. The previous behavior was to set toNaN themin_periods entriesstarting with the first non-NaN value. (GH7977)
Prior behavior (note values start at index2, which ismin_periods after index0(the index of the first non-empty value)):
In [66]:s=Series([1,None,None,None,2,3])
In [51]:ewma(s,com=3.,min_periods=2)Out[51]:0 NaN1 NaN2 1.0000003 1.0000004 1.5714295 2.189189dtype: float64
New behavior (note values start at index4, the location of the 2nd (sincemin_periods=2) non-empty value):
In [2]:pd.ewma(s,com=3.,min_periods=2)Out[2]:0 NaN1 NaN2 NaN3 NaN4 1.7596445 2.383784dtype: float64
ewmstd(),ewmvol(),ewmvar(),ewmcov(), andewmcorr()now have an optionaladjust argument, just likeewma() does,affecting how the weights are calculated.The default value ofadjust isTrue, which is backwards-compatible.SeeExponentially weighted moment functions for details. (GH7911)
ewma(),ewmstd(),ewmvol(),ewmvar(),ewmcov(), andewmcorr()now have an optionalignore_na argument.Whenignore_na=False (the default), missing values are taken into account in the weights calculation.Whenignore_na=True (which reproduces the pre-0.15.0 behavior), missing values are ignored in the weights calculation.(GH7543)
In [7]:pd.ewma(Series([None,1.,8.]),com=2.)Out[7]:0 NaN1 1.02 5.2dtype: float64In [8]:pd.ewma(Series([1.,None,8.]),com=2.,ignore_na=True)# pre-0.15.0 behaviorOut[8]:0 1.01 1.02 5.2dtype: float64In [9]:pd.ewma(Series([1.,None,8.]),com=2.,ignore_na=False)# new defaultOut[9]:0 1.0000001 1.0000002 5.846154dtype: float64
Warning
By default (ignore_na=False) theewm*() functions’ weights calculationin the presence of missing values is different than in pre-0.15.0 versions.To reproduce the pre-0.15.0 calculation of weights in the presence of missing valuesone must specify explicitlyignore_na=True.
Bug inexpanding_cov(),expanding_corr(),rolling_cov(),rolling_cor(),ewmcov(), andewmcorr()returning results with columns sorted by name and producing an error for non-unique columns;now handles non-unique columns and returns columns in original order(except for the case of two DataFrames withpairwise=False, where behavior is unchanged) (GH7542)
Bug inrolling_count() andexpanding_*() functions unnecessarily producing error message for zero-length data (GH8056)
Bug inrolling_apply() andexpanding_apply() interpretingmin_periods=0 asmin_periods=1 (GH8080)
Bug inexpanding_std() andexpanding_var() for a single value producing a confusing error message (GH7900)
Bug inrolling_std() androlling_var() for a single value producing0 rather thanNaN (GH7900)
Bug inewmstd(),ewmvol(),ewmvar(), andewmcov()calculation of de-biasing factors whenbias=False (the default).Previously an incorrect constant factor was used, based onadjust=True,ignore_na=True,and an infinite number of observations.Now a different factor is used for each entry, based on the actual weights(analogous to the usualN/(N-1) factor).In particular, for a single point a value ofNaN is returned whenbias=False,whereas previously a value of (approximately)0 was returned.
For example, consider the following pre-0.15.0 results forewmvar(...,bias=False),and the corresponding debiasing factors:
In [67]:s=Series([1.,2.,0.,4.])
In [89]:ewmvar(s,com=2.,bias=False)Out[89]:0 -2.775558e-161 3.000000e-012 9.556787e-013 3.585799e+00dtype: float64In [90]:ewmvar(s,com=2.,bias=False)/ewmvar(s,com=2.,bias=True)Out[90]:0 1.251 1.252 1.253 1.25dtype: float64
Note that entry0 is approximately 0, and the debiasing factors are a constant 1.25.By comparison, the following 0.15.0 results have aNaN for entry0,and the debiasing factors are decreasing (towards 1.25):
In [14]:pd.ewmvar(s,com=2.,bias=False)Out[14]:0 NaN1 0.5000002 1.2105263 4.089069dtype: float64In [15]:pd.ewmvar(s,com=2.,bias=False)/pd.ewmvar(s,com=2.,bias=True)Out[15]:0 NaN1 2.0833332 1.5833333 1.425439dtype: float64
SeeExponentially weighted moment functions for details. (GH7912)
Added support for achunksize parameter toto_sql function. This allows DataFrame to be written in chunks and avoid packet-size overflow errors (GH8062).
Added support for achunksize parameter toread_sql function. Specifying this argument will return an iterator through chunks of the query result (GH2908).
Added support for writingdatetime.date anddatetime.time object columns withto_sql (GH6932).
Added support for specifying aschema to read from/write to withread_sql_table andto_sql (GH7441,GH7952).For example:
df.to_sql('table',engine,schema='other_schema')pd.read_sql_table('table',engine,schema='other_schema')
Added support for writingNaN values withto_sql (GH2754).
Added support for writing datetime64 columns withto_sql for all database flavors (GH7103).
API changes related toCategorical (seeherefor more details):
TheCategorical constructor with two arguments changed from“codes/labels and levels” to “values and levels (now called ‘categories’)”.This can lead to subtle bugs. If you useCategorical directly,please audit your code by changing it to use thefrom_codes()constructor.
An old function call like (prior to 0.15.0):
pd.Categorical([0,1,0,2,1],levels=['a','b','c'])
will have to adapted to the following to keep the same behaviour:
In [2]:pd.Categorical.from_codes([0,1,0,2,1],categories=['a','b','c'])Out[2]:[a, b, a, c, b]Categories (3, object): [a, b, c]
API changes related to the introduction of theTimedelta scalar (seeabove for more details):
to_timedelta() would return aSeries for list-like/Series input,and anp.timedelta64 for scalar input. It will now return aTimedeltaIndex forlist-like input,Series for Series input, andTimedelta for scalar input.For API changes related to the rolling and expanding functions, see detailed overviewabove.
Other notable API changes:
Consistency when indexing with.loc and a list-like indexer when no values are found.
In [68]:df=DataFrame([['a'],['b']],index=[1,2])In [69]:dfOut[69]: 01 a2 b
In prior versions there was a difference in these two constructs:
df.loc[[3]] would return a frame reindexed by 3 (with allnp.nan values)df.loc[[3],:] would raiseKeyError.Both will now raise aKeyError. The rule is thatat least 1 indexer must be found when using a list-like and.loc (GH7999)
Furthermore in prior versions these were also different:
df.loc[[1,3]] would return a frame reindexed by [1,3]df.loc[[1,3],:] would raiseKeyError.Both will now return a frame reindex by [1,3]. E.g.
In [70]:df.loc[[1,3]]Out[70]: 01 a3 NaNIn [71]:df.loc[[1,3],:]Out[71]: 01 a3 NaN
This can also be seen in multi-axis indexing with aPanel.
In [72]:p=Panel(np.arange(2*3*4).reshape(2,3,4), ....:items=['ItemA','ItemB'], ....:major_axis=[1,2,3], ....:minor_axis=['A','B','C','D']) ....:In [73]:pOut[73]:<class 'pandas.core.panel.Panel'>Dimensions: 2 (items) x 3 (major_axis) x 4 (minor_axis)Items axis: ItemA to ItemBMajor_axis axis: 1 to 3Minor_axis axis: A to D
The following would raiseKeyError prior to 0.15.0:
In [74]:p.loc[['ItemA','ItemD'],:,'D']Out[74]: ItemA ItemD1 3 NaN2 7 NaN3 11 NaN
Furthermore,.loc will raise If no values are found in a multi-index with a list-like indexer:
In [75]:s=Series(np.arange(3,dtype='int64'), ....:index=MultiIndex.from_product([['A'],['foo','bar','baz']], ....:names=['one','two']) ....:).sortlevel() ....:In [76]:sOut[76]:one twoA bar 1 baz 2 foo 0dtype: int64In [77]:try: ....:s.loc[['D']] ....:exceptKeyErrorase: ....:print("KeyError: "+str(e)) ....:KeyError: 'cannot index a multi-index axis with these keys'
Assigning values toNone now considers the dtype when choosing an ‘empty’ value (GH7941).
Previously, assigning toNone in numeric containers changed thedtype to object (or errored, depending on the call). It now usesNaN:
In [78]:s=Series([1,2,3])In [79]:s.loc[0]=NoneIn [80]:sOut[80]:0 NaN1 2.02 3.0dtype: float64
NaT is now used similarly for datetime containers.
For object containers, we now preserveNone values (previously thesewere converted toNaN values).
In [81]:s=Series(["a","b","c"])In [82]:s.loc[0]=NoneIn [83]:sOut[83]:0 None1 b2 cdtype: object
To insert aNaN, you must explicitly usenp.nan. See thedocs.
In prior versions, updating a pandas object inplace would not reflect in other python references to this object. (GH8511,GH5104)
In [84]:s=Series([1,2,3])In [85]:s2=sIn [86]:s+=1.5
Behavior prior to v0.15.0
# the original objectIn [5]:sOut[5]:0 2.51 3.52 4.5dtype: float64# a reference to the original objectIn [7]:s2Out[7]:0 11 22 3dtype: int64
This is now the correct behavior
# the original objectIn [87]:sOut[87]:0 2.51 3.52 4.5dtype: float64# a reference to the original objectIn [88]:s2Out[88]:0 2.51 3.52 4.5dtype: float64
Made both the C-based and Python engines forread_csv andread_table ignore empty lines in input as well aswhitespace-filled lines, as long assep is not whitespace. This is an API changethat can be controlled by the keyword parameterskip_blank_lines. Seethe docs (GH4466)
A timeseries/index localized to UTC when inserted into a Series/DataFrame will preserve the UTC timezoneand inserted asobject dtype rather than being converted to a naivedatetime64[ns] (GH8411).
Bug in passing aDatetimeIndex with a timezone that was not being retained in DataFrame construction from a dict (GH7822)
In prior versions this would drop the timezone, now it retains the timezone,but gives a column ofobject dtype:
In [89]:i=date_range('1/1/2011',periods=3,freq='10s',tz='US/Eastern')In [90]:iOut[90]:DatetimeIndex(['2011-01-01 00:00:00-05:00', '2011-01-01 00:00:10-05:00', '2011-01-01 00:00:20-05:00'], dtype='datetime64[ns, US/Eastern]', freq='10S')In [91]:df=DataFrame({'a':i})In [92]:dfOut[92]: a0 2011-01-01 00:00:00-05:001 2011-01-01 00:00:10-05:002 2011-01-01 00:00:20-05:00In [93]:df.dtypesOut[93]:a datetime64[ns, US/Eastern]dtype: object
Previously this would have yielded a column ofdatetime64 dtype, but without timezone info.
The behaviour of assigning a column to an existing dataframe asdf[‘a’] = iremains unchanged (this already returned anobject column with a timezone).
When passing multiple levels tostack(), it will now raise aValueError when thelevels aren’t all level names or all level numbers (GH7660). SeeReshaping by stacking and unstacking.
Raise aValueError indf.to_hdf with ‘fixed’ format, ifdf has non-unique columns as the resulting file will be broken (GH7761)
SettingWithCopy raise/warnings (according to the optionmode.chained_assignment) will now be issued when setting a value on a sliced mixed-dtype DataFrame using chained-assignment. (GH7845,GH7950)
In[1]:df=DataFrame(np.arange(0,9),columns=['count'])In[2]:df['group']='b'In[3]:df.iloc[0:5]['group']='a'/usr/local/bin/ipython:1:SettingWithCopyWarning:AvalueistryingtobesetonacopyofaslicefromaDataFrame.Tryusing.loc[row_indexer,col_indexer]=valueinsteadSeethethecaveatsinthedocumentation:http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
merge,DataFrame.merge, andordered_merge now return the same typeas theleft argument (GH7737).
Previously an enlargement with a mixed-dtype frame would act unlike.append which will preserve dtypes (relatedGH2578,GH8176):
In [94]:df=DataFrame([[True,1],[False,2]], ....:columns=["female","fitness"]) ....:In [95]:dfOut[95]: female fitness0 True 11 False 2In [96]:df.dtypesOut[96]:female boolfitness int64dtype: object# dtypes are now preservedIn [97]:df.loc[2]=df.loc[1]In [98]:dfOut[98]: female fitness0 True 11 False 22 False 2In [99]:df.dtypesOut[99]:female boolfitness int64dtype: object
Series.to_csv() now returns a string whenpath=None, matching the behaviour ofDataFrame.to_csv() (GH8215).
read_hdf now raisesIOError when a file that doesn’t exist is passed in. Previously, a new, empty file was created, and aKeyError raised (GH7715).
DataFrame.info() now ends its output with a newline character (GH8114)
Concatenating no objects will now raise aValueError rather than a bareException.
Merge errors will now be sub-classes ofValueError rather than rawException (GH8501)
DataFrame.plot andSeries.plot keywords are now have consistent orders (GH8037)
In 0.15.0Index has internally been refactored to no longer sub-classndarraybut instead subclassPandasObject, similarly to the rest of the pandas objects. Thischange allows very easy sub-classing and creation of new index types. This should bea transparent change with only very limited API implications (GH5080,GH7439,GH7796,GH8024,GH8367,GH7997,GH8522):
pd.read_pickle rather thanpickle.load. Seepickle docsPeriodIndex, the matplotlib internal axes will now be arrays ofPeriod rather than aPeriodIndex (this is similar to how aDatetimeIndex passes arrays ofdatetimes now)datetime64).UPDATE This is fixedin 0.15.1, seehere.Categoricallabels andlevels attributes aredeprecated and renamed tocodes andcategories.outtype argument topd.DataFrame.to_dict has been deprecated in favor oforient. (GH7840)convert_dummies method has been deprecated in favor ofget_dummies (GH8140)infer_dst argument intz_localize will be deprecated in favor ofambiguous to allow for more flexibility in dealing with DST transitions.Replaceinfer_dst=True withambiguous='infer' for the same behavior (GH7943).Seethe docs for more details.pd.value_range has been deprecated and can be replaced by.describe() (GH8481)TheIndex set operations+ and- were deprecated in order to provide these for numeric type operations on certain index types.+ can be replaced by.union() or|, and- by.difference(). Further the method nameIndex.diff() is deprecated and can be replaced byIndex.difference() (GH8226)
# +Index(['a','b','c'])+Index(['b','c','d'])# should be replaced byIndex(['a','b','c']).union(Index(['b','c','d']))
# -Index(['a','b','c'])-Index(['b','c','d'])# should be replaced byIndex(['a','b','c']).difference(Index(['b','c','d']))
Theinfer_types argument toread_html() now has noeffect and is deprecated (GH7762,GH7032).
DataFrame.delevel method in favor ofDataFrame.reset_indexEnhancements in the importing/exporting of Stata files:
to_stata (GH7097,GH7365)DataFrame.to_stata andStataWriter check string length forcompatibility with limitations imposed in dta files where fixed-widthstrings must contain 244 or fewer characters. Attempting to write Statadta files with strings longer than 244 characters raises aValueError. (GH7858)read_stata andStataReader can import missing data information into aDataFrame by setting the argumentconvert_missing toTrue. Whenusing this options, missing values are returned asStataMissingValueobjects and columns containing missing values haveobject data type. (GH8045)Enhancements in the plotting functions:
layout keyword toDataFrame.plot. You can pass a tuple of(rows,columns), one of which can be-1 to automatically infer (GH6667,GH8071).DataFrame.plot,hist andboxplot (GH5353,GH6970,GH7069)c,colormap andcolorbar arguments forDataFrame.plot withkind='scatter' (GH7780)DataFrame.plot withkind='hist' (GH7809), Seethe docs.DataFrame.plot withkind='box' (GH7998), Seethe docs.Other:
read_csv now has a keyword parameterfloat_precision which specifies which floating-point converter the C engine should use during parsing, seehere (GH8002,GH8044)
Addedsearchsorted method toSeries objects (GH7447)
describe() on mixed-types DataFrames is more flexible. Type-based column filtering is now possible via theinclude/exclude arguments.See thedocs (GH8164).
In [100]:df=DataFrame({'catA':['foo','foo','bar']*8, .....:'catB':['a','b','c','d']*6, .....:'numC':np.arange(24), .....:'numD':np.arange(24.)+.5}) .....:In [101]:df.describe(include=["object"])Out[101]: catA catBcount 24 24unique 2 4top foo dfreq 16 6In [102]:df.describe(include=["number","object"],exclude=["float"])Out[102]: catA catB numCcount 24 24 24.000000unique 2 4 NaNtop foo d NaNfreq 16 6 NaNmean NaN NaN 11.500000std NaN NaN 7.071068min NaN NaN 0.00000025% NaN NaN 5.75000050% NaN NaN 11.50000075% NaN NaN 17.250000max NaN NaN 23.000000
Requesting all columns is possible with the shorthand ‘all’
In [103]:df.describe(include='all')Out[103]: catA catB numC numDcount 24 24 24.000000 24.000000unique 2 4 NaN NaNtop foo d NaN NaNfreq 16 6 NaN NaNmean NaN NaN 11.500000 12.000000std NaN NaN 7.071068 7.071068min NaN NaN 0.000000 0.50000025% NaN NaN 5.750000 6.25000050% NaN NaN 11.500000 12.00000075% NaN NaN 17.250000 17.750000max NaN NaN 23.000000 23.500000
Without those arguments, ‘describe` will behave as before, including only numerical columns or, if none are, only categorical columns. See also thedocs
Addedsplit as an option to theorient argument inpd.DataFrame.to_dict. (GH7840)
Theget_dummies method can now be used on DataFrames. By default onlycatagorical columns are encoded as 0’s and 1’s, while other columns areleft untouched.
In [104]:df=DataFrame({'A':['a','b','a'],'B':['c','c','b'], .....:'C':[1,2,3]}) .....:In [105]:pd.get_dummies(df)Out[105]: C A_a A_b B_b B_c0 1 1 0 0 11 2 0 1 0 12 3 1 0 1 0
PeriodIndex supportsresolution as the same asDatetimeIndex (GH7708)
pandas.tseries.holiday has added support for additional holidays and ways to observe holidays (GH7070)
pandas.tseries.holiday.Holiday now supports a list of offsets in Python3 (GH7070)
pandas.tseries.holiday.Holiday now supports a days_of_week parameter (GH7070)
GroupBy.nth() now supports selecting multiple nth values (GH7910)
In [106]:business_dates=date_range(start='4/1/2014',end='6/30/2014',freq='B')In [107]:df=DataFrame(1,index=business_dates,columns=['a','b'])# get the first, 4th, and last date index for each monthIn [108]:df.groupby((df.index.year,df.index.month)).nth([0,3,-1])Out[108]: a b2014 4 1 1 4 1 1 4 1 1 5 1 1 5 1 1 5 1 1 6 1 1 6 1 1 6 1 1
Period andPeriodIndex supports addition/subtraction withtimedelta-likes (GH7966)
IfPeriod freq isD,H,T,S,L,U,N,Timedelta-like can be added if the result can have same freq. Otherwise, only the sameoffsets can be added.
In [109]:idx=pd.period_range('2014-07-01 09:00',periods=5,freq='H')In [110]:idxOut[110]:PeriodIndex(['2014-07-01 09:00', '2014-07-01 10:00', '2014-07-01 11:00', '2014-07-01 12:00', '2014-07-01 13:00'], dtype='period[H]', freq='H')In [111]:idx+pd.offsets.Hour(2)Out[111]:PeriodIndex(['2014-07-01 11:00', '2014-07-01 12:00', '2014-07-01 13:00', '2014-07-01 14:00', '2014-07-01 15:00'], dtype='period[H]', freq='H')In [112]:idx+Timedelta('120m')Out[112]:PeriodIndex(['2014-07-01 11:00', '2014-07-01 12:00', '2014-07-01 13:00', '2014-07-01 14:00', '2014-07-01 15:00'], dtype='period[H]', freq='H')In [113]:idx=pd.period_range('2014-07',periods=5,freq='M')In [114]:idxOut[114]:PeriodIndex(['2014-07','2014-08','2014-09','2014-10','2014-11'],dtype='period[M]',freq='M')In [115]:idx+pd.offsets.MonthEnd(3)Out[115]:PeriodIndex(['2014-10','2014-11','2014-12','2015-01','2015-02'],dtype='period[M]',freq='M')
Added experimental compatibility withopenpyxl for versions >= 2.0. TheDataFrame.to_excelmethodengine keyword now recognizesopenpyxl1 andopenpyxl2which will explicitly require openpyxl v1 and v2 respectively, failing ifthe requested version is not available. Theopenpyxl engine is a now ameta-engine that automatically uses whichever version of openpyxl isinstalled. (GH7177)
DataFrame.fillna can now accept aDataFrame as a fill value (GH8377)
Passing multiple levels tostack() will now work when multiple levelnumbers are passed (GH7660). SeeReshaping by stacking and unstacking.
set_names(),set_labels(), andset_levels() methods now take an optionallevel keyword argument to all modification of specific level(s) of a MultiIndex. Additionallyset_names() now accepts a scalar string value when operating on anIndex or on a specific level of aMultiIndex (GH7792)
In [116]:idx=MultiIndex.from_product([['a'],range(3),list("pqr")],names=['foo','bar','baz'])In [117]:idx.set_names('qux',level=0)Out[117]:MultiIndex(levels=[[u'a'], [0, 1, 2], [u'p', u'q', u'r']], labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]], names=[u'qux', u'bar', u'baz'])In [118]:idx.set_names(['qux','baz'],level=[0,1])Out[118]:MultiIndex(levels=[[u'a'], [0, 1, 2], [u'p', u'q', u'r']], labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]], names=[u'qux', u'baz', u'baz'])In [119]:idx.set_levels(['a','b','c'],level='bar')Out[119]:MultiIndex(levels=[[u'a'], [u'a', u'b', u'c'], [u'p', u'q', u'r']], labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]], names=[u'foo', u'bar', u'baz'])In [120]:idx.set_levels([['a','b','c'],[1,2,3]],level=[1,2])Out[120]:MultiIndex(levels=[[u'a'], [u'a', u'b', u'c'], [1, 2, 3]], labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]], names=[u'foo', u'bar', u'baz'])
Index.isin now supports alevel argument to specify which index levelto use for membership tests (GH7892,GH7890)
In [1]:idx=MultiIndex.from_product([[0,1],['a','b','c']])In [2]:idx.valuesOut[2]:array([(0,'a'),(0,'b'),(0,'c'),(1,'a'),(1,'b'),(1,'c')],dtype=object)In [3]:idx.isin(['a','c','e'],level=1)Out[3]:array([True,False,True,True,False,True],dtype=bool)
Index now supportsduplicated anddrop_duplicates. (GH4060)
In [121]:idx=Index([1,2,3,4,1,2])In [122]:idxOut[122]:Int64Index([1,2,3,4,1,2],dtype='int64')In [123]:idx.duplicated()Out[123]:array([False,False,False,False,True,True],dtype=bool)In [124]:idx.drop_duplicates()Out[124]:Int64Index([1,2,3,4],dtype='int64')
addcopy=True argument topd.concat to enable pass thru of complete blocks (GH8252)
Added support for numpy 1.8+ data types (bool_,int_,float_,string_) for conversion to R dataframe (GH8400)
DatetimeIndex.__iter__ to allow faster iteration (GH7683)Period creation (andPeriodIndex setitem) (GH5155)StataReader when reading large files (GH8040,GH8073)StataWriter when writing large files (GH8079)groupby (GH8128).agg and.apply where builtins max/min were not mapped to numpy/cythonized versions (GH7722)to_sql) of up to 50% (GH8208).CustomBusinessDay,CustomBusinessMonth (GH8236)MultiIndex.values for multi-level indexes containing datetimes (GH8543)read_csv wheresqueeze=True would return a view (GH8217)read_sql in certain cases (GH7826).DataFrame.groupby whereGrouper does not recognize level when frequency is specified (GH7885)Series 0-division with a float and integer operand dtypes (GH7785)Series.astype("unicode") not callingunicode on the values correctly (GH7758)DataFrame.as_matrix() with mixeddatetime64[ns] andtimedelta64[ns] dtypes (GH7778)HDFStore.select_column() not preserving UTC timezone info when selecting aDatetimeIndex (GH7777)to_datetime whenformat='%Y%m%d' andcoerce=True are specified, where previously an object array was returned (rather thana coerced time-series withNaT), (GH7930)DatetimeIndex andPeriodIndex in-place addition and subtraction cause different result from normal one (GH6527)PeriodIndex withPeriodIndex raiseTypeError (GH7741)combine_first withPeriodIndex data raisesTypeError (GH3367)Timestamp comparisons with== andint64 dtype (GH8058)DateOffset may raiseAttributeError whennormalize attribute is reffered internally (GH7748)Panel when usingmajor_xs andcopy=False is passed (deprecation warning fails because of missingwarnings) (GH8152).PeriodIndex into aSeries would convert toint64 dtype, rather thanobject ofPeriods (GH7932)HDFStore iteration when passing a where (GH8014)DataFrameGroupby.transform when transforming with a passed non-sorted key (GH8046,GH8430)ValueError or incorrect kind (GH7733)MultiIndex withdatetime.date inputs (GH7888)get where anIndexError would not cause the default value to be returned (GH7725)offsets.apply,rollforward androllback may reset nanosecond (GH7697)offsets.apply,rollforward androllback may raiseAttributeError ifTimestamp hasdateutil tzinfo (GH7697)Float64Index (GH8017)DataFrame for alignment (GH7763)is_superperiod andis_subperiod cannot handle higher frequencies thanS (GH7760,GH7772,GH7803)Series.shift (GH8129)PeriodIndex.unique returns int64np.ndarray (GH7540)groupby.apply with a non-affecting mutation in the function (GH8467)DataFrame.reset_index which hasMultiIndex containsPeriodIndex orDatetimeIndex with tz raisesValueError (GH7746,GH7793)DataFrame.plot withsubplots=True may draw unnecessary minor xticks and yticks (GH7801)StataReader which did not read variable labels in 117 files due to difference between Stata documentation and implementation (GH7816)StataReader where strings were always converted to 244 characters-fixed width irrespective of underlying string size (GH7858)DataFrame.plot andSeries.plot may ignorerot andfontsize keywords (GH7844)DatetimeIndex.value_counts doesn’t preserve tz (GH7735)PeriodIndex.value_counts results inInt64Index (GH7735)DataFrame.join when doing left join on index and there are multiple matches (GH5391)GroupBy.transform() where int groups with a transform thatdidn’t preserve the index were incorrectly truncated (GH7972).groupby where callable objects without name attributes would take the wrong path,and produce aDataFrame instead of aSeries (GH7929)groupby error message when a DataFrame grouping column is duplicated (GH7511)read_html where theinfer_types argument forced coercion ofdate-likes incorrectly (GH7762,GH7032).Series.str.cat with an index which was filtered as to not include the first item (GH7857)Timestamp cannot parsenanosecond from string (GH7878)Timestamp with string offset andtz results incorrect (GH7833)tslib.tz_convert andtslib.tz_convert_single may return different results (GH7798)DatetimeIndex.intersection of non-overlapping timestamps with tz raisesIndexError (GH7880)GroupBy.filter() where fast path vs. slow path made the filterreturn a non scalar value that appeared valid but wasn’t (GH7870).date_range()/DatetimeIndex() when the timezone was inferred from input dates yet incorrecttimes were returned when crossing DST boundaries (GH7835,GH7901).to_excel() where a negative sign was being prepended to positive infinity and was absent for negative infinity (GH7949)alpha whenstacked=True (GH8027)Period andPeriodIndex addition/subtraction withnp.timedelta64 results in incorrect internal representations (GH7740)Holiday with no offset or observance (GH7987)DataFrame.to_latex formatting when columns or index is aMultiIndex (GH7982).DateOffset around Daylight Savings Time produces unexpected results (GH5175).DataFrame.shift where empty columns would throwZeroDivisionError on numpy 1.7 (GH8019)html_encoding/*.html wasn’t installed andtherefore some tests were not running correctly (GH7927).read_html wherebytes objects were not tested for in_read (GH7927).DataFrame.stack() when one of the column levels was a datelike (GH8039)DataFrame (GH8116)pivot_table performed with namelessindex andcolumns raisesKeyError (GH8103)DataFrame.plot(kind='scatter') draws points and errorbars with different colors when the color is specified byc keyword (GH8081)Float64Index whereiat andat were not testing and werefailing (GH8092).DataFrame.boxplot() where y-limits were not set correctly whenproducing multiple axes (GH7528,GH5517).read_csv where line comments were not handled correctly givena custom line terminator ordelim_whitespace=True (GH8122).read_html where empty tables caused aStopIteration (GH7575)GroupBy when the original grouperwas a tuple (GH8121)..at that would accept integer indexers on a non-integer index and do fallback (GH7814)GroupBy.count with float32 data type were nan values were not excluded (GH8169).limit keyword when no values needed interpolating (GH7173).col_space was ignored inDataFrame.to_string() whenheader=False (GH8230).DatetimeIndex.asof incorrectly matching partial strings and returning the wrong date (GH8245).DataFrame.__setitem__ that caused errors when setting a dataframe column to a sparse array (GH8131)Dataframe.boxplot() failed when entire column was empty (GH8181).radviz visualization (GH8199).limit keyword when no values needed interpolating (GH7173).col_space was ignored inDataFrame.to_string() whenheader=False (GH8230).to_clipboard that would clip long column data (GH8305)DataFrame terminal display: Setting max_column/max_rows to zero did not trigger auto-resizing of dfs to fit terminal width/height (GH7180).DataFrame.dropna that interpreted non-existent columns in the subset argument as the ‘last column’ (GH8303)Index.intersection on non-monotonic non-unique indexes (GH8362).NDFrame.equals gives false negatives with dtype=object (GH8437)NDFrame.loc indexing when row/column names were lost when target was a list/ndarray (GH6552)NDFrame.loc indexing when rows/columns were converted to Float64Index if target was an empty list/ndarray (GH7774)Series that allows it to be indexed by aDataFrame which has unexpected results. Such indexing is no longer permitted (GH8444)DataFrame with multi-index columns where right-hand-side columns were not aligned (GH7655)DataFrame.eval() where the dtype of thenot operator (~)was not correctly inferred asbool.This is a minor release from 0.14.0 and includes a small number of API changes, several new features,enhancements, and performance improvements along with a large number of bug fixes. We recommend that allusers upgrade to this version.
select_dtypes() to select columnsbased on the dtype andsem() to calculate thestandard error of the mean.read_csv()text parser.Openpyxl now raises a ValueError on construction of the openpyxl writerinstead of warning on pandas import (GH7284).
ForStringMethods.extract, when no match is found, the result - onlycontainingNaN values - now also hasdtype=object instead offloat (GH7242)
Period objects no longer raise aTypeError when compared using==with another object thatisn’t aPeriod. Insteadwhen comparing aPeriod with another object using== if the otherobject isn’t aPeriodFalse is returned. (GH7376)
Previously, the behaviour on resetting the time or not inoffsets.apply,rollforward androllback operations differedbetween offsets. With the support of thenormalize keyword for all offsets(seebelow) with a default value of False (preserve time), the behaviour changed for certainoffsets (BusinessMonthBegin, MonthEnd, BusinessMonthEnd, CustomBusinessMonthEnd,BusinessYearBegin, LastWeekOfMonth, FY5253Quarter, LastWeekOfMonth, Easter):
In [6]:frompandas.tseriesimportoffsetsIn [7]:d=pd.Timestamp('2014-01-01 09:00')# old behaviour < 0.14.1In [8]:d+offsets.MonthEnd()Out[8]:Timestamp('2014-01-31 00:00:00')
Starting from 0.14.1 all offsets preserve time by default. The oldbehaviour can be obtained withnormalize=True
# new behaviourIn [1]:d+offsets.MonthEnd()Out[1]:Timestamp('2014-01-31 09:00:00')In [2]:d+offsets.MonthEnd(normalize=True)Out[2]:Timestamp('2014-01-31 00:00:00')
Note that for the other offsets the default behaviour did not change.
Add back#N/AN/A as a default NA value in text parsing, (regresion from 0.12) (GH5521)
Raise aTypeError on inplace-setting with a.where and a nonnp.nan value as this is inconsistentwith a set-item expression likedf[mask]=None (GH7656)
Adddropna argument tovalue_counts andnunique (GH5569).
Addselect_dtypes() method to allow selection ofcolumns based on dtype (GH7316). Seethe docs.
Alloffsets suppports thenormalize keyword to specify whetheroffsets.apply,rollforward androllback resets the time (hour,minute, etc) or not (defaultFalse, preserves time) (GH7156):
In [3]:importpandas.tseries.offsetsasoffsetsIn [4]:day=offsets.Day()In [5]:day.apply(Timestamp('2014-01-01 09:00'))Out[5]:Timestamp('2014-01-02 09:00:00')In [6]:day=offsets.Day(normalize=True)In [7]:day.apply(Timestamp('2014-01-01 09:00'))Out[7]:Timestamp('2014-01-02 00:00:00')
PeriodIndex is represented as the same format asDatetimeIndex (GH7601)
StringMethods now work on empty Series (GH7242)
The file parsersread_csv andread_table now ignore line comments provided bythe parametercomment, which accepts only a single character for the C reader.In particular, they allow for comments before file data begins (GH2685)
AddNotImplementedError for simultaneous use ofchunksize andnrowsfor read_csv() (GH6774).
Tests for basic reading of public S3 buckets now exist (GH7281).
read_html now sports anencoding argument that is passed to theunderlying parser library. You can use this to read non-ascii encoded webpages (GH7323).
read_excel now supports reading from URLs in the same waythatread_csv does. (GH6809)
Support for dateutil timezones, which can now be used in the same way aspytz timezones across pandas. (GH4688)
In [8]:rng=date_range('3/6/2012 00:00',periods=10,freq='D', ...:tz='dateutil/Europe/London') ...:In [9]:rng.tzOut[9]:tzfile('/usr/share/zoneinfo/Europe/London')
Seethe docs.
Implementedsem (standard error of the mean) operation forSeries,DataFrame,Panel, andGroupby (GH6897)
Addnlargest andnsmallest to theSeriesgroupby whitelist,which means you can now use these methods on aSeriesGroupBy object(GH7053).
All offsetsapply,rollforward androllback can now handlenp.datetime64, previously results inApplyTypeError (GH7452)
Period andPeriodIndex can containNaT in its values (GH7485)
Support picklingSeries,DataFrame andPanel objects withnon-unique labels alongitem axis (index,columns anditemsrespectively) (GH7370).
Improved inference of datetime/timedelta with mixed null objects. Regression from 0.13.1 in interpretation of an object Indexwith all null elements (GH7431)
int64,timedelta64,datetime64 (GH7223)pandas.io.data.Options has a new method,get_all_data method, and now consistently returns amulti-indexedDataFrame (GH5602)io.gbq.read_gbq andio.gbq.to_gbq were refactored to remove thedependency on the Googlebq.py command line client. This submodulenow useshttplib2 and the Googleapiclient andoauth2client API clientlibraries which should be more stable and, therefore, reliable thanbq.py. Seethe docs. (GH6937).DataFrame.where with a symmetric shaped frame and a passed other of a DataFrame (GH7506).nth with a Series and integer-like column name (GH7559)Series.get with a boolean accessor (GH7407)value_counts whereNaT did not qualify as missing (NaN) (GH7423)to_timedelta that accepted invalid units and misinterpreted ‘m/h’ (GH7611,GH6423)xlim ifsecondary_y=True (GH7459)hist andscatter plots use oldfigsize default (GH7394)DataFrame.plot,hist clears passedax even if the number of subplots is one (GH7391).DataFrame.boxplot withby kw raisesValueError if the number of subplots exceeds 1 (GH7391).ticklabels andlabels in different rule (GH5897)Panel.apply with a multi-index as an axis (GH7469)DatetimeIndex.insert doesn’t preservename andtz (GH7299)DatetimeIndex.asobject doesn’t preservename (GH7299)Index.min andmax doesn’t handlenan andNaT properly (GH7261)PeriodIndex.min/max results inint (GH7609)resample wherefill_method was ignored if you passedhow (GH2073)TimeGrouper doesn’t exclude column specified bykey (GH7227)DataFrame andSeries bar and barh plot raisesTypeError whenbottomandleft keyword is specified (GH7226)DataFrame.hist raisesTypeError when it contains non numeric column (GH7277)Index.delete does not preservename andfreq attributes (GH7302)DataFrame.query()/eval where local string variables with the @sign were being treated as temporaries attempting to be deleted(GH7300).Float64Index which didn’t allow duplicates (GH7149).DataFrame.replace() where truthy values were being replaced(GH7140).StringMethods.extract() where a single match group Serieswould use the matcher’s name instead of the group name (GH7313).isnull() whenmode.use_inf_as_null==True where isnullwouldn’t testTrue when it encountered aninf/-inf(GH7315).Easter returns incorrect date when offset is negative (GH7195).div, integer dtypes and divide-by-zero (GH7325)CustomBusinessDay.apply raiasesNameError whennp.datetime64 object is passed (GH7196)MultiIndex.append,concat andpivot_table don’t preserve timezone (GH6606).loc with a list of indexers on a single-multi index level (that is not nested) (GH7349)Series.map when mapping a dict with tuple keys of different lengths (GH7333)StringMethods now work on empty Series (GH7242)DataFrame with aFloat64Index raised aTypeError during a call tonp.isnan(GH7366).NDFrame.replace() didn’t correctly replace objects withPeriod values (GH7379)..ix getitem should always return a Series (GH7150)DatetimeIndex were not correctly sliced(GH7408)NaT wasn’t repr’d correctly in aMultiIndex (GH7406,GH7409).nan inconvert_objects(GH7416).quantile ignoring the axis keyword argument (:issue`7306`)nanops._maybe_null_out doesn’t work with complex numbers(GH7353)nanops functions whenaxis==0 for1-dimensionalnan arrays (GH7354)nanops.nanmedian doesn’t work whenaxis==None(GH7352)nanops._has_infs doesn’t work with many dtypes(GH7357)StataReader.data where reading a 0-observation dta failed (GH7369)StataReader when reading Stata 13 (117) files containing fixed width strings (GH7360)StataWriter where encoding was ignored (GH7286)DatetimeIndex comparison doesn’t handleNaT properly (GH7529)tzinfo to some offsetsapply,rollforward orrollback resetstzinfo or raisesValueError (GH7465)DatetimeIndex.to_period,PeriodIndex.asobject,PeriodIndex.to_timestamp doesn’t preservename (GH7485)DatetimeIndex.to_period andPeriodIndex.to_timestanp handleNaT incorrectly (GH7228)offsets.apply,rollforward androllback may return normaldatetime (GH7502)resample raisesValueError when target containsNaT (GH7227)Timestamp.tz_localize resetsnanosecond info (GH7534)DatetimeIndex.asobject raisesValueError when it containsNaT (GH7539)Timestamp.__new__ doesn’t preserve nanosecond properly (GH7610)Index.astype(float) where it would return anobject dtypeIndex (GH7464).DataFrame.reset_index losestz (GH3950)DatetimeIndex.freqstr raisesAttributeError whenfreq isNone (GH7606)GroupBy.size created byTimeGrouper raisesAttributeError (GH7453)ValueError (GH7471)Index.union may preservename incorrectly (GH7458)DatetimeIndex.intersection doesn’t preserve timezone (GH4690)rolling_var where a window larger than the array would raise an error(GH7297)xlim (GH2960)secondary_y axis not being considered for timeseriesxlim (GH3490)Float64Index assignment with a non scalar indexer (GH7586)pandas.core.strings.str_contains does not properly match in a case insensitive fashion whenregex=False andcase=False (GH7505)expanding_cov,expanding_corr,rolling_cov, androlling_corr for two arguments with mismatched index (GH7512)to_sql taking the boolean column as text column (GH7678).loc performing fallback integer indexing withobject dtype indices (GH7496)PeriodIndex constructor when passedSeries objects (GH7701).This is a major release from 0.13.1 and includes a small number of API changes, several new features,enhancements, and performance improvements along with a large number of bug fixes. We recommend that allusers upgrade to this version.
sqlalchemy, SeeHere.CustomBusinessDay, seeHereWarning
In 0.14.0 allNDFrame based containers have undergone significant internal refactoring. Before that each block ofhomogeneous data had its own labels and extra care was necessary to keep those in sync with the parent container’s labels.This should not have any visible user/API behavior changes (GH6745)
read_excel uses 0 as the default sheet (GH6573)
iloc will now accept out-of-bounds indexers for slices, e.g. a value that exceeds the length of the object beingindexed. These will be excluded. This will make pandas conform more with python/numpy indexing of out-of-boundsvalues. A single indexer that is out-of-bounds and drops the dimensions of the object will still raiseIndexError (GH6296,GH6299). This could result in an empty axis (e.g. an empty DataFrame being returned)
In [1]:dfl=DataFrame(np.random.randn(5,2),columns=list('AB'))In [2]:dflOut[2]: A B0 1.583584 -0.4383131 -0.402537 -0.7805722 -0.141685 0.5422413 0.370966 -0.2516424 0.787484 1.666563In [3]:dfl.iloc[:,2:3]Out[3]:Empty DataFrameColumns: []Index: [0, 1, 2, 3, 4]In [4]:dfl.iloc[:,1:3]Out[4]: B0 -0.4383131 -0.7805722 0.5422413 -0.2516424 1.666563In [5]:dfl.iloc[4:6]Out[5]: A B4 0.787484 1.666563
These are out-of-bounds selections
dfl.iloc[[4,5,6]]IndexError:positionalindexersareout-of-boundsdfl.iloc[:,4]IndexError:singlepositionalindexerisout-of-bounds
Slicing with negative start, stop & step values handles corner cases better (GH6531):
df.iloc[:-len(df)] is now emptydf.iloc[len(df)::-1] now enumerates all elements in reverseTheDataFrame.interpolate() keyworddowncast default has been changed frominfer toNone. This is to preseve the original dtype unless explicitly requested otherwise (GH6290).
When converting a dataframe to HTML it used to returnEmpty DataFrame. This special case hasbeen removed, instead a header with the column names is returned (GH6062).
Series andIndex now internall share more common operations, e.g.factorize(),nunique(),value_counts() arenow supported onIndex types as well. TheSeries.weekday property from is removedfrom Series for API consistency. Using aDatetimeIndex/PeriodIndex method on a Series will now raise aTypeError.(GH4551,GH4056,GH5519,GH6380,GH7206).
Addis_month_start,is_month_end,is_quarter_start,is_quarter_end,is_year_start,is_year_end accessors forDateTimeIndex /Timestamp which return a boolean array of whether the timestamp(s) are at the start/end of the month/quarter/year defined by the frequency of theDateTimeIndex /Timestamp (GH4565,GH6998)
Local variable usage has changed inpandas.eval()/DataFrame.eval()/DataFrame.query()(GH5987). For theDataFrame methods, two things havechanged
'@' prefix.df.query('@a<a') with no complaintsfrompandas about ambiguity of the namea.pandas.eval() function does not allow you use the'@' prefix and provides you with an error message telling you so.NameResolutionError was removed because it isn’t necessary anymore.Define and document the order of column vs index names in query/eval (GH6676)
concat will now concatenate mixed Series and DataFrames using the Series nameor numbering columns as needed (GH2385). Seethe docs
Slicing and advanced/boolean indexing operations onIndex classes as wellasIndex.delete() andIndex.drop() methods will no longer change the type of theresulting index (GH6440,GH7040)
In [6]:i=pd.Index([1,2,3,'a','b','c'])In [7]:i[[0,1,2]]Out[7]:Index([1,2,3],dtype='object')In [8]:i.drop(['a','b','c'])Out[8]:Index([1,2,3],dtype='object')
Previously, the above operation would returnInt64Index. If you’d liketo do this manually, useIndex.astype()
In [9]:i[[0,1,2]].astype(np.int_)Out[9]:Int64Index([1,2,3],dtype='int64')
set_index no longer converts MultiIndexes to an Index of tuples. For example,the old behavior returned an Index in this case (GH6459):
# Old behavior, casted MultiIndex to an IndexIn [10]:tuple_indOut[10]:Index([(u'a',u'c'),(u'a',u'd'),(u'b',u'c'),(u'b',u'd')],dtype='object')In [11]:df_multi.set_index(tuple_ind)Out[11]: 0 1(a, c) 0.471435 -1.190976(a, d) 1.432707 -0.312652(b, c) -0.720589 0.887163(b, d) 0.859588 -0.636524# New behaviorIn [12]:miOut[12]:MultiIndex(levels=[[u'a', u'b'], [u'c', u'd']], labels=[[0, 0, 1, 1], [0, 1, 0, 1]])In [13]:df_multi.set_index(mi)Out[13]: 0 1a c 0.471435 -1.190976 d 1.432707 -0.312652b c -0.720589 0.887163 d 0.859588 -0.636524
This also applies when passing multiple indices toset_index:
# Old output, 2-level MultiIndex of tuplesIn [14]:df_multi.set_index([df_multi.index,df_multi.index])Out[14]: 0 1(a, c) (a, c) 0.471435 -1.190976(a, d) (a, d) 1.432707 -0.312652(b, c) (b, c) -0.720589 0.887163(b, d) (b, d) 0.859588 -0.636524# New output, 4-level MultiIndexIn [15]:df_multi.set_index([df_multi.index,df_multi.index])Out[15]: 0 1a c a c 0.471435 -1.190976 d a d 1.432707 -0.312652b c b c -0.720589 0.887163 d b d 0.859588 -0.636524
pairwise keyword was added to the statistical moment functionsrolling_cov,rolling_corr,ewmcov,ewmcorr,expanding_cov,expanding_corr to allow the calculation of movingwindow covariance and correlation matrices (GH4950). SeeComputing rolling pairwise covariances and correlations in the docs.
In [1]:df=DataFrame(np.random.randn(10,4),columns=list('ABCD'))In [4]:covs=pd.rolling_cov(df[['A','B','C']],df[['B','C','D']],5,pairwise=True)In [5]:covs[df.index[-1]]Out[5]: B C DA 0.035310 0.326593 -0.505430B 0.137748 -0.006888 -0.005383C -0.006888 0.861040 0.020762
Series.iteritems() is now lazy (returns an iterator rather than a list). This was the documented behavior prior to 0.14. (GH6760)
Addednunique andvalue_counts functions toIndex for counting unique elements. (GH6734)
stack andunstack now raise aValueError when thelevel keyword refersto a non-unique item in theIndex (previously raised aKeyError). (GH6738)
drop unused order argument fromSeries.sort; args now are in the same order asSeries.order;addna_position arg to conform toSeries.order (GH6847)
default sorting algorithm forSeries.order is nowquicksort, to conform withSeries.sort(and numpy defaults)
addinplace keyword toSeries.order/sort to make them inverses (GH6859)
DataFrame.sort now places NaNs at the beginning or end of the sort according to thena_position parameter. (GH3917)
acceptTextFileReader inconcat, which was affecting a common user idiom (GH6583), this was a regressionfrom 0.13.1
Addedfactorize functions toIndex andSeries to get indexer and unique values (GH7090)
describe on a DataFrame with a mix of Timestamp and string like objects returns a different Index (GH7088).Previously the index was unintentionally sorted.
Arithmetic operations withonlybool dtypes now give a warning indicatingthat they are evaluated in Python space for+,-,and* operations and raise for all others (GH7011,GH6762,GH7015,GH7210)
x=pd.Series(np.random.rand(10)>0.5)y=Truex+y# warning generated: should do x | y insteadx/y# this raises because it doesn't make senseNotImplementedError:operator'/'notimplementedforbooldtypes
InHDFStore,select_as_multiple will always raise aKeyError, when a key or the selector is not found (GH6177)
df['col']=value anddf.loc[:,'col']=value are now completely equivalent;previously the.loc would not necessarily coerce the dtype of the resultant series (GH6149)
dtypes andftypes now return a series withdtype=object on empty containers (GH5740)
df.to_csv will now return a string of the CSV data if neither a target path nor a buffer is provided(GH6061)
pd.infer_freq() will now raise aTypeError if given an invalidSeries/Indextype (GH6407,GH6463)
A tuple passed toDataFame.sort_index will be interpreted as the levels ofthe index, rather than requiring a list of tuple (GH4370)
all offset operations now returnTimestamp types (rather than datetime), Business/Week frequencies were incorrect (GH4069)
to_excel now convertsnp.inf into a string representation,customizable by theinf_rep keyword argument (Excel has no native infrepresentation) (GH6782)
Replacepandas.compat.scipy.scoreatpercentile withnumpy.percentile (GH6810)
.quantile on adatetime[ns] series now returnsTimestamp insteadofnp.datetime64 objects (GH6810)
changeAssertionError toTypeError for invalid types passed toconcat (GH6583)
Raise aTypeError whenDataFrame is passed an iterator as thedata argument (GH5357)
The default way of printing large DataFrames has changed. DataFramesexceedingmax_rows and/ormax_columns are now displayed in acentrally truncated view, consistent with the printing of apandas.Series (GH5603).
In previous versions, a DataFrame was truncated once the dimensionconstraints were reached and an ellipse (...) signaled that part ofthe data was cut off.

In the current version, large DataFrames are centrally truncated,showing a preview of head and tail in both dimensions.

allow option'truncate' fordisplay.show_dimensions to only show the dimensions if theframe is truncated (GH6547).
The default fordisplay.show_dimensions will now betruncate. This is consistent withhow Series display length.
In [16]:dfd=pd.DataFrame(np.arange(25).reshape(-1,5),index=[0,1,2,3,4],columns=[0,1,2,3,4])# show dimensions since this is truncatedIn [17]:withpd.option_context('display.max_rows',2,'display.max_columns',2, ....:'display.show_dimensions','truncate'): ....:print(dfd) ....: 0 ... 40 0 ... 4.. .. ... ..4 20 ... 24[5 rows x 5 columns]# will not show dimensions since it is not truncatedIn [18]:withpd.option_context('display.max_rows',10,'display.max_columns',40, ....:'display.show_dimensions','truncate'): ....:print(dfd) ....: 0 1 2 3 40 0 1 2 3 41 5 6 7 8 92 10 11 12 13 143 15 16 17 18 194 20 21 22 23 24
Regression in the display of a MultiIndexed Series withdisplay.max_rows is less than thelength of the series (GH7101)
Fixed a bug in the HTML repr of a truncated Series or DataFrame not showing the class name with thelarge_repr set to ‘info’ (GH7105)
Theverbose keyword inDataFrame.info(), which controls whether to shorten theinforepresentation, is nowNone by default. This will follow the global setting indisplay.max_info_columns. The global setting can be overriden withverbose=True orverbose=False.
Fixed a bug with theinfo repr not honoring thedisplay.max_info_columns setting (GH6939)
Offset/freq info now in Timestamp __repr__ (GH4553)
read_csv()/read_table() will now be noiser w.r.t invalid options rather than falling back to thePythonParser.
ValueError whensep specified withdelim_whitespace=True inread_csv()/read_table()(GH6607)ValueError whenengine='c' specified with unsupportedoptions inread_csv()/read_table() (GH6607)ValueError when fallback to python parser causes options to beignored (GH6607)ParserWarning on fallback to pythonparser when no options are ignored (GH6607)sep='\s+' todelim_whitespace=True inread_csv()/read_table() if no other C-unsupported optionsspecified (GH6607)More consistent behaviour for some groupby methods:
groupbyhead andtail now act more likefilter rather than an aggregation:
In [19]:df=pd.DataFrame([[1,2],[1,4],[5,6]],columns=['A','B'])In [20]:g=df.groupby('A')In [21]:g.head(1)# filters DataFrameOut[21]: A B0 1 22 5 6In [22]:g.apply(lambdax:x.head(1))# used to simply fall-throughOut[22]: A BA1 0 1 25 2 5 6
groupby head and tail respect column selection:
In [23]:g[['B']].head(1)Out[23]: B0 22 6
groupbynth now reduces by default; filtering can be achieved by passingas_index=False. With an optionaldropna argument to ignoreNaN. Seethe docs.
Reducing
In [24]:df=DataFrame([[1,np.nan],[1,4],[5,6]],columns=['A','B'])In [25]:g=df.groupby('A')In [26]:g.nth(0)Out[26]: BA1 NaN5 6.0# this is equivalent to g.first()In [27]:g.nth(0,dropna='any')Out[27]: BA1 4.05 6.0# this is equivalent to g.last()In [28]:g.nth(-1,dropna='any')Out[28]: BA1 4.05 6.0
Filtering
In [29]:gf=df.groupby('A',as_index=False)In [30]:gf.nth(0)Out[30]: A B0 1 NaN2 5 6.0In [31]:gf.nth(0,dropna='any')Out[31]: A BA1 1 4.05 5 6.0
groupby will now not return the grouped column for non-cython functions (GH5610,GH5614,GH6732),as its already the index
In [32]:df=DataFrame([[1,np.nan],[1,4],[5,6],[5,8]],columns=['A','B'])In [33]:g=df.groupby('A')In [34]:g.count()Out[34]: BA1 15 2In [35]:g.describe()Out[35]: BA1 count 1.000000 mean 4.000000 std NaN min 4.000000 25% 4.000000 50% 4.000000 75% 4.000000... ...5 mean 7.000000 std 1.414214 min 6.000000 25% 6.500000 50% 7.000000 75% 7.500000 max 8.000000[16 rows x 1 columns]
passingas_index will leave the grouped column in-place (this is not change in 0.14.0)
In [36]:df=DataFrame([[1,np.nan],[1,4],[5,6],[5,8]],columns=['A','B'])In [37]:g=df.groupby('A',as_index=False)In [38]:g.count()Out[38]: A B0 1 11 5 2In [39]:g.describe()Out[39]: A B0 count 2.0 1.000000 mean 1.0 4.000000 std 0.0 NaN min 1.0 4.000000 25% 1.0 4.000000 50% 1.0 4.000000 75% 1.0 4.000000... ... ...1 mean 5.0 7.000000 std 0.0 1.414214 min 5.0 6.000000 25% 5.0 6.500000 50% 5.0 7.000000 75% 5.0 7.500000 max 5.0 8.000000[16 rows x 2 columns]
Allow specification of a more complex groupby viapd.Grouper, such as groupingby a Time and a string field simultaneously. Seethe docs. (GH3794)
Better propagation/preservation of Series names when performing groupbyoperations:
SeriesGroupBy.agg will ensure that the name attribute of the originalseries is propagated to the result (GH6265).GroupBy.apply returns a named series, thename of the series will be kept as the name of the column index of theDataFrame returned byGroupBy.apply (GH6124). This facilitatesDataFrame.stack operations where the name of the column index is used asthe name of the inserted column containing the pivoted data.The SQL reading and writing functions now support more database flavorsthrough SQLAlchemy (GH2717,GH4163,GH5950,GH6292).All databases supported by SQLAlchemy can be used, suchas PostgreSQL, MySQL, Oracle, Microsoft SQL server (see documentation ofSQLAlchemy onincluded dialects).
The functionality of providing DBAPI connection objects will only be supportedfor sqlite3 in the future. The'mysql' flavor is deprecated.
The new functionsread_sql_query() andread_sql_table()are introduced. The functionread_sql() is kept as a conveniencewrapper around the other two and will delegate to specific function depending onthe provided input (database table name or sql query).
In practice, you have to provide a SQLAlchemyengine to the sql functions.To connect with SQLAlchemy you use thecreate_engine() function to create an engineobject from database URI. You only need to create the engine once per database you areconnecting to. For an in-memory sqlite database:
In [40]:fromsqlalchemyimportcreate_engine# Create your connection.In [41]:engine=create_engine('sqlite:///:memory:')
Thisengine can then be used to write or read data to/from this database:
In [42]:df=pd.DataFrame({'A':[1,2,3],'B':['a','b','c']})In [43]:df.to_sql('db_table',engine,index=False)
You can read data from a database by specifying the table name:
In [44]:pd.read_sql_table('db_table',engine)Out[44]: A B0 1 a1 2 b2 3 c
or by specifying a sql query:
In [45]:pd.read_sql_query('SELECT * FROM db_table',engine)Out[45]: A B0 1 a1 2 b2 3 c
Some other enhancements to the sql functions include:
indexkeyword (default is True).index_label.parse_dateskeyword inread_sql_query() andread_sql_table().Warning
Some of the existing functions or function aliases have been deprecatedand will be removed in future versions. This includes:tquery,uquery,read_frame,frame_query,write_frame.
Warning
The support for the ‘mysql’ flavor when using DBAPI connection objects has been deprecated.MySQL will be further supported with SQLAlchemy engines (GH6900).
In 0.14.0 we added a new way to slice multi-indexed objects.You can slice a multi-index by providing multiple indexers.
You can provide any of the selectors as if you are indexing by label, seeSelection by Label,including slices, lists of labels, labels, and boolean indexers.
You can useslice(None) to select all the contents ofthat level. You do not need to specify all thedeeper levels, they will be implied asslice(None).
As usual,both sides of the slicers are included as this is label indexing.
Seethe docsSee also issues (GH6134,GH4036,GH3057,GH2598,GH5641,GH7106)
Warning
You should specify all axes in the.loc specifier, meaning the indexer for theindex andfor thecolumns. Their are some ambiguous cases where the passed indexer could be mis-interpretedas indexingboth axes, rather than into say the MuliIndex for the rows.
You should do this:
df.loc[(slice('A1','A3'),.....),:]
rather than this:
df.loc[(slice('A1','A3'),.....)]
Warning
You will need to make sure that the selection axes are fully lexsorted!
In [46]:defmklbl(prefix,n): ....:return["%s%s"%(prefix,i)foriinrange(n)] ....:In [47]:index=MultiIndex.from_product([mklbl('A',4), ....:mklbl('B',2), ....:mklbl('C',4), ....:mklbl('D',2)]) ....:In [48]:columns=MultiIndex.from_tuples([('a','foo'),('a','bar'), ....:('b','foo'),('b','bah')], ....:names=['lvl0','lvl1']) ....:In [49]:df=DataFrame(np.arange(len(index)*len(columns)).reshape((len(index),len(columns))), ....:index=index, ....:columns=columns).sortlevel().sortlevel(axis=1) ....:In [50]:dfOut[50]:lvl0 a blvl1 bar foo bah fooA0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 9 8 11 10 D1 13 12 15 14 C2 D0 17 16 19 18 D1 21 20 23 22 C3 D0 25 24 27 26... ... ... ... ...A3 B1 C0 D1 229 228 231 230 C1 D0 233 232 235 234 D1 237 236 239 238 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 249 248 251 250 D1 253 252 255 254[64 rows x 4 columns]
Basic multi-index slicing using slices, lists, and labels.
In [51]:df.loc[(slice('A1','A3'),slice(None),['C1','C3']),:]Out[51]:lvl0 a blvl1 bar foo bah fooA1 B0 C1 D0 73 72 75 74 D1 77 76 79 78 C3 D0 89 88 91 90 D1 93 92 95 94 B1 C1 D0 105 104 107 106 D1 109 108 111 110 C3 D0 121 120 123 122... ... ... ... ...A3 B0 C1 D1 205 204 207 206 C3 D0 217 216 219 218 D1 221 220 223 222 B1 C1 D0 233 232 235 234 D1 237 236 239 238 C3 D0 249 248 251 250 D1 253 252 255 254[24 rows x 4 columns]
You can use apd.IndexSlice to shortcut the creation of these slices
In [52]:idx=pd.IndexSliceIn [53]:df.loc[idx[:,:,['C1','C3']],idx[:,'foo']]Out[53]:lvl0 a blvl1 foo fooA0 B0 C1 D0 8 10 D1 12 14 C3 D0 24 26 D1 28 30 B1 C1 D0 40 42 D1 44 46 C3 D0 56 58... ... ...A3 B0 C1 D1 204 206 C3 D0 216 218 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254[32 rows x 2 columns]
It is possible to perform quite complicated selections using this method on multipleaxes at the same time.
In [54]:df.loc['A1',(slice(None),'foo')]Out[54]:lvl0 a blvl1 foo fooB0 C0 D0 64 66 D1 68 70 C1 D0 72 74 D1 76 78 C2 D0 80 82 D1 84 86 C3 D0 88 90... ... ...B1 C0 D1 100 102 C1 D0 104 106 D1 108 110 C2 D0 112 114 D1 116 118 C3 D0 120 122 D1 124 126[16 rows x 2 columns]In [55]:df.loc[idx[:,:,['C1','C3']],idx[:,'foo']]Out[55]:lvl0 a blvl1 foo fooA0 B0 C1 D0 8 10 D1 12 14 C3 D0 24 26 D1 28 30 B1 C1 D0 40 42 D1 44 46 C3 D0 56 58... ... ...A3 B0 C1 D1 204 206 C3 D0 216 218 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254[32 rows x 2 columns]
Using a boolean indexer you can provide selection related to thevalues.
In [56]:mask=df[('a','foo')]>200In [57]:df.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]Out[57]:lvl0 a blvl1 foo fooA3 B0 C1 D1 204 206 C3 D0 216 218 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254
You can also specify theaxis argument to.loc to interpret the passedslicers on a single axis.
In [58]:df.loc(axis=0)[:,:,['C1','C3']]Out[58]:lvl0 a blvl1 bar foo bah fooA0 B0 C1 D0 9 8 11 10 D1 13 12 15 14 C3 D0 25 24 27 26 D1 29 28 31 30 B1 C1 D0 41 40 43 42 D1 45 44 47 46 C3 D0 57 56 59 58... ... ... ... ...A3 B0 C1 D1 205 204 207 206 C3 D0 217 216 219 218 D1 221 220 223 222 B1 C1 D0 233 232 235 234 D1 237 236 239 238 C3 D0 249 248 251 250 D1 253 252 255 254[32 rows x 4 columns]
Furthermore you canset the values using these methods
In [59]:df2=df.copy()In [60]:df2.loc(axis=0)[:,:,['C1','C3']]=-10In [61]:df2Out[61]:lvl0 a blvl1 bar foo bah fooA0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10 C2 D0 17 16 19 18 D1 21 20 23 22 C3 D0 -10 -10 -10 -10... ... ... ... ...A3 B1 C0 D1 229 228 231 230 C1 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10[64 rows x 4 columns]
You can use a right-hand-side of an alignable object as well.
In [62]:df2=df.copy()In [63]:df2.loc[idx[:,:,['C1','C3']],:]=df2*1000In [64]:df2Out[64]:lvl0 a blvl1 bar foo bah fooA0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 9000 8000 11000 10000 D1 13000 12000 15000 14000 C2 D0 17 16 19 18 D1 21 20 23 22 C3 D0 25000 24000 27000 26000... ... ... ... ...A3 B1 C0 D1 229 228 231 230 C1 D0 233000 232000 235000 234000 D1 237000 236000 239000 238000 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 249000 248000 251000 250000 D1 253000 252000 255000 254000[64 rows x 4 columns]
Hexagonal bin plots fromDataFrame.plot withkind='hexbin' (GH5478), Seethe docs.
DataFrame.plot andSeries.plot now supports area plot with specifyingkind='area' (GH6656), Seethe docs
Pie plots fromSeries.plot andDataFrame.plot withkind='pie' (GH6976), Seethe docs.
Plotting with Error Bars is now supported in the.plot method ofDataFrame andSeries objects (GH3796,GH6834), Seethe docs.
DataFrame.plot andSeries.plot now support atable keyword for plottingmatplotlib.Table, Seethe docs. Thetable keyword can receive the following values.
False: Do nothing (default).True: Draw a table using theDataFrame orSeries calledplot method. Data will be transposed to meet matplotlib’s default layout.DataFrame orSeries: Draw matplotlib.table using the passed data. The data will be drawn as displayed in print method (not transposed automatically).Also, helper functionpandas.tools.plotting.table is added to create a table fromDataFrame andSeries, and add it to anmatplotlib.Axes.plot(legend='reverse') will now reverse the order of legend labels formost plot kinds. (GH6014)
Line plot and area plot can be stacked bystacked=True (GH6656)
Following keywords are now acceptable forDataFrame.plot() withkind='bar' andkind='barh':
Because of the defaultalign value changes, coordinates of bar plots are now located on integer values (0.0, 1.0, 2.0 ...). This is intended to make bar plot be located on the same coodinates as line plot. However, bar plot may differs unexpectedly when you manually adjust the bar location or drawing area, such as usingset_xlim,set_ylim, etc. In this cases, please modify your script to meet with new coordinates.
Theparallel_coordinates() function now takes argumentcolorinstead ofcolors. AFutureWarning is raised to alert thatthe oldcolors argument will not be supported in a future release. (GH6956)
Theparallel_coordinates() andandrews_curves() functions now takepositional argumentframe instead ofdata. AFutureWarning israised if the olddata argument is used by name. (GH6956)
DataFrame.boxplot() now supportslayout keyword (GH6769)
DataFrame.boxplot() has a new keyword argument,return_type. It accepts'dict','axes', or'both', in which case a namedtuple with the matplotlibaxes and a dict of matplotlib Lines is returned.
There are prior version deprecations that are taking effect as of 0.14.0.
DateRange in favor ofDatetimeIndex (GH6816)column keyword fromDataFrame.sort (GH4370)precision keyword fromset_eng_float_format() (GH395)force_unicode keyword fromDataFrame.to_string(),DataFrame.to_latex(), andDataFrame.to_html(); these functionencode in unicode by default (GH2224,GH2225)nanRep keyword fromDataFrame.to_csv() andDataFrame.to_string() (GH275)unique keyword fromHDFStore.select_column() (GH3256)inferTimeRule keyword fromTimestamp.offset() (GH391)name keyword fromget_data_yahoo() andget_data_google() (commit b921d1a )offset keyword fromDatetimeIndex constructor(commit 3136390 )time_rule from several rolling-moment statistical functions, suchasrolling_sum() (GH1042)- boolean operations on numpy arrays in favor of inv~, as this is going tobe deprecated in numpy 1.9 (GH6960)Thepivot_table()/DataFrame.pivot_table() andcrosstab() functionsnow take argumentsindex andcolumns instead ofrows andcols. AFutureWarning is raised to alert that the oldrows andcols argumentswill not be supported in a future release (GH5505)
TheDataFrame.drop_duplicates() andDataFrame.duplicated() methodsnow take argumentsubset instead ofcols to better align withDataFrame.dropna(). AFutureWarning is raised to alert that the oldcols arguments will not be supported in a future release (GH6680)
TheDataFrame.to_csv() andDataFrame.to_excel() functionsnow takes argumentcolumns instead ofcols. AFutureWarning is raised to alert that the oldcols argumentswill not be supported in a future release (GH6645)
Indexers will warnFutureWarning when used with a scalar indexer anda non-floating point Index (GH4892,GH6960)
# non-floating point indexes can only be indexed by integers / labelsIn [1]:Series(1,np.arange(5))[3.0] pandas/core/index.py:469: FutureWarning: scalar indexers for index type Int64Index should be integers and not floating pointOut[1]:1In [2]:Series(1,np.arange(5)).iloc[3.0] pandas/core/index.py:469: FutureWarning: scalar indexers for index type Int64Index should be integers and not floating pointOut[2]:1In [3]:Series(1,np.arange(5)).iloc[3.0:4] pandas/core/index.py:527: FutureWarning: slice indexers when using iloc should be integers and not floating pointOut[3]: 3 1 dtype: int64# these are Float64Indexes, so integer or floating point is acceptableIn [4]:Series(1,np.arange(5.))[3]Out[4]:1In [5]:Series(1,np.arange(5.))[3.0]Out[6]:1
Numpy 1.9 compat w.r.t. deprecation warnings (GH6960)
Panel.shift() now has a function signature that matchesDataFrame.shift().The old positional argumentlags has been changed to a keyword argumentperiods with a default value of 1. AFutureWarning is raised if theold argumentlags is used by name. (GH6910)
Theorder keyword argument offactorize() will be removed. (GH6926).
Remove thecopy keyword fromDataFrame.xs(),Panel.major_xs(),Panel.minor_xs(). A view will bereturned if possible, otherwise a copy will be made. Previously the user could think thatcopy=False wouldALWAYS return a view. (GH6894)
Theparallel_coordinates() function now takes argumentcolorinstead ofcolors. AFutureWarning is raised to alert thatthe oldcolors argument will not be supported in a future release. (GH6956)
Theparallel_coordinates() andandrews_curves() functions now takepositional argumentframe instead ofdata. AFutureWarning israised if the olddata argument is used by name. (GH6956)
The support for the ‘mysql’ flavor when using DBAPI connection objects has been deprecated.MySQL will be further supported with SQLAlchemy engines (GH6900).
The followingio.sql functions have been deprecated:tquery,uquery,read_frame,frame_query,write_frame.
Thepercentile_width keyword argument indescribe() has been deprecated.Use thepercentiles keyword instead, which takes a list of percentiles to display. Thedefault output is unchanged.
The default return type ofboxplot() will change from a dict to a matpltolib Axesin a future release. You can use the future behavior now by passingreturn_type='axes'to boxplot.
DataFrame and Series will create a MultiIndex object if passed a tuples dict, Seethe docs (GH3323)
In [65]:Series({('a','b'):1,('a','a'):0, ....:('a','c'):2,('b','a'):3,('b','b'):4}) ....:Out[65]:a a 0 b 1 c 2b a 3 b 4dtype: int64In [66]:DataFrame({('a','b'):{('A','B'):1,('A','C'):2}, ....:('a','a'):{('A','C'):3,('A','B'):4}, ....:('a','c'):{('A','B'):5,('A','C'):6}, ....:('b','a'):{('A','C'):7,('A','B'):8}, ....:('b','b'):{('A','D'):9,('A','B'):10}}) ....:Out[66]: a b a b c a bA B 4.0 1.0 5.0 8.0 10.0 C 3.0 2.0 6.0 7.0 NaN D NaN NaN NaN NaN 9.0
Added thesym_diff method toIndex (GH5543)
DataFrame.to_latex now takes a longtable keyword, which if True will return a table in a longtable environment. (GH6617)
Add option to turn off escaping inDataFrame.to_latex (GH6472)
pd.read_clipboard will, if the keywordsep is unspecified, try to detect data copied from a spreadsheetand parse accordingly. (GH6223)
Joining a singly-indexed DataFrame with a multi-indexed DataFrame (GH3662)
Seethe docs. Joining multi-index DataFrames on both the left and right is not yet supported ATM.
In [67]:household=DataFrame(dict(household_id=[1,2,3], ....:male=[0,1,0], ....:wealth=[196087.3,316478.7,294750]), ....:columns=['household_id','male','wealth'] ....:).set_index('household_id') ....:In [68]:householdOut[68]: male wealthhousehold_id1 0 196087.32 1 316478.73 0 294750.0In [69]:portfolio=DataFrame(dict(household_id=[1,2,2,3,3,3,4], ....:asset_id=["nl0000301109","nl0000289783","gb00b03mlx29", ....:"gb00b03mlx29","lu0197800237","nl0000289965",np.nan], ....:name=["ABN Amro","Robeco","Royal Dutch Shell","Royal Dutch Shell", ....:"AAB Eastern Europe Equity Fund","Postbank BioTech Fonds",np.nan], ....:share=[1.0,0.4,0.6,0.15,0.6,0.25,1.0]), ....:columns=['household_id','asset_id','name','share'] ....:).set_index(['household_id','asset_id']) ....:In [70]:portfolioOut[70]: name sharehousehold_id asset_id1 nl0000301109 ABN Amro 1.002 nl0000289783 Robeco 0.40 gb00b03mlx29 Royal Dutch Shell 0.603 gb00b03mlx29 Royal Dutch Shell 0.15 lu0197800237 AAB Eastern Europe Equity Fund 0.60 nl0000289965 Postbank BioTech Fonds 0.254 NaN NaN 1.00In [71]:household.join(portfolio,how='inner')Out[71]: male wealth name \household_id asset_id1 nl0000301109 0 196087.3 ABN Amro2 nl0000289783 1 316478.7 Robeco gb00b03mlx29 1 316478.7 Royal Dutch Shell3 gb00b03mlx29 0 294750.0 Royal Dutch Shell lu0197800237 0 294750.0 AAB Eastern Europe Equity Fund nl0000289965 0 294750.0 Postbank BioTech Fonds sharehousehold_id asset_id1 nl0000301109 1.002 nl0000289783 0.40 gb00b03mlx29 0.603 gb00b03mlx29 0.15 lu0197800237 0.60 nl0000289965 0.25
quotechar,doublequote, andescapechar can now be specified whenusingDataFrame.to_csv (GH5414,GH4528)
Partially sort by only the specified levels of a MultiIndex with thesort_remaining boolean kwarg. (GH3984)
Addedto_julian_date toTimeStamp andDatetimeIndex. The JulianDate is used primarily in astronomy and represents the number of days fromnoon, January 1, 4713 BC. Because nanoseconds are used to define the timein pandas the actual range of dates that you can use is 1678 AD to 2262 AD. (GH4041)
DataFrame.to_stata will now check data for compatibility with Stata data typesand will upcast when needed. When it is not possible to losslessly upcast, a warningis issued (GH6327)
DataFrame.to_stata andStataWriter will accept keyword arguments time_stampand data_label which allow the time stamp and dataset label to be set when creating afile. (GH6545)
pandas.io.gbq now handles reading unicode strings properly. (GH5940)
Holidays Calendars are now available and can be used with theCustomBusinessDay offset (GH6719)
Float64Index is now backed by afloat64 dtype ndarray instead of anobject dtype array (GH6471).
ImplementedPanel.pct_change (GH6904)
Addedhow option to rolling-moment functions to dictate how to handle resampling;rolling_max() defaults to max,rolling_min() defaults to min, and all others default to mean (GH6297)
CustomBuisnessMonthBegin andCustomBusinessMonthEnd are now available (GH6866)
Series.quantile() andDataFrame.quantile() now accept an array ofquantiles.
describe() now accepts an array of percentiles to include in the summary statistics (GH4196)
pivot_table can now acceptGrouper byindex andcolumns keywords (GH6913)
In [72]:importdatetimeIn [73]:df=DataFrame({ ....:'Branch':'A A A A A B'.split(), ....:'Buyer':'Carl Mark Carl Carl Joe Joe'.split(), ....:'Quantity':[1,3,5,1,8,1], ....:'Date':[datetime.datetime(2013,11,1,13,0),datetime.datetime(2013,9,1,13,5), ....:datetime.datetime(2013,10,1,20,0),datetime.datetime(2013,10,2,10,0), ....:datetime.datetime(2013,11,1,20,0),datetime.datetime(2013,10,2,10,0)], ....:'PayDay':[datetime.datetime(2013,10,4,0,0),datetime.datetime(2013,10,15,13,5), ....:datetime.datetime(2013,9,5,20,0),datetime.datetime(2013,11,2,10,0), ....:datetime.datetime(2013,10,7,20,0),datetime.datetime(2013,9,5,10,0)]}) ....:In [74]:dfOut[74]: Branch Buyer Date PayDay Quantity0 A Carl 2013-11-01 13:00:00 2013-10-04 00:00:00 11 A Mark 2013-09-01 13:05:00 2013-10-15 13:05:00 32 A Carl 2013-10-01 20:00:00 2013-09-05 20:00:00 53 A Carl 2013-10-02 10:00:00 2013-11-02 10:00:00 14 A Joe 2013-11-01 20:00:00 2013-10-07 20:00:00 85 B Joe 2013-10-02 10:00:00 2013-09-05 10:00:00 1In [75]:pivot_table(df,index=Grouper(freq='M',key='Date'), ....:columns=Grouper(freq='M',key='PayDay'), ....:values='Quantity',aggfunc=np.sum) ....:Out[75]:PayDay 2013-09-30 2013-10-31 2013-11-30Date2013-09-30 NaN 3.0 NaN2013-10-31 6.0 NaN 1.02013-11-30 NaN 9.0 NaN
Arrays of strings can be wrapped to a specified width (str.wrap) (GH6999)
Addnsmallest() andSeries.nlargest() methods to Series, Seethe docs (GH3960)
PeriodIndex fully supports partial string indexing likeDatetimeIndex (GH7043)
In [76]:prng=period_range('2013-01-01 09:00',periods=100,freq='H')In [77]:ps=Series(np.random.randn(len(prng)),index=prng)In [78]:psOut[78]:2013-01-01 09:00 0.0156962013-01-01 10:00 -2.2426852013-01-01 11:00 1.1500362013-01-01 12:00 0.9919462013-01-01 13:00 0.9533242013-01-01 14:00 -2.0212552013-01-01 15:00 -0.334077 ...2013-01-05 06:00 0.5665342013-01-05 07:00 0.5035922013-01-05 08:00 0.2852962013-01-05 09:00 0.4842882013-01-05 10:00 1.3634822013-01-05 11:00 -0.7811052013-01-05 12:00 -0.468018Freq: H, dtype: float64In [79]:ps['2013-01-02']Out[79]:2013-01-02 00:00 0.5534392013-01-02 01:00 1.3181522013-01-02 02:00 -0.4693052013-01-02 03:00 0.6755542013-01-02 04:00 -1.8170272013-01-02 05:00 -0.1831092013-01-02 06:00 1.058969 ...2013-01-02 17:00 0.0762002013-01-02 18:00 -0.5664462013-01-02 19:00 0.0361422013-01-02 20:00 -2.0749782013-01-02 21:00 0.2477922013-01-02 22:00 -0.8971572013-01-02 23:00 -0.136795Freq: H, dtype: float64
read_excel can now read milliseconds in Excel dates and times with xlrd >= 0.9.3. (GH5945)
pd.stats.moments.rolling_var now uses Welford’s method for increased numerical stability (GH6817)
pd.expanding_apply and pd.rolling_apply now take args and kwargs that are passed on tothe func (GH6289)
DataFrame.rank() now has a percentage rank option (GH5971)
Series.rank() now has a percentage rank option (GH5971)
Series.rank() andDataFrame.rank() now acceptmethod='dense' for ranks without gaps (GH6514)
Support passingencoding with xlwt (GH3710)
Refactor Block classes removingBlock.items attributes to avoid duplicationin item handling (GH6745,GH6988).
Testing statements updated to use specialized asserts (GH6175)
DatetimeIndex to floating ordinalsusingDatetimeConverter (GH6636)DataFrame.shift (GH5609)CustomBusinessDay (GH6584)DataFrame.from_records when reading aspecified number of rows from an iterable (GH6700)take_2d (GH6749)GroupBy.count() is now implemented in Cython and is much faster for largenumbers of groups (GH7016).There are no experimental changes in 0.14.0
pd.DataFrame.sort_index where mergesort wasn’t stable whenascending=False (GH6399)pd.tseries.frequencies.to_offset when argument has leading zeroes (GH6391)Timestamp /to_datetime for current year (GH5958).xs with a Series multiindex (GH6258,GH5684)eval where type-promotion failed for large expressions (GH6205)inplace=True (GH6281)HDFStore.remove now handles start and stop (GH6177)HDFStore.select_as_multiple handles start and stop the same way asselect (GH6177)HDFStore.select_as_coordinates andselect_column works with awhere clause that results in filters (GH6177)agg with a single function and a a mixed-type frame (GH6337)DataFrame.replace() when passing a non-boolto_replace argument (GH6332)groups was missing) (GH3881)pd.eval when parsing strings with possible tokens like'&'(GH6351)-inf in Panels when dividing by integer 0 (GH6178)DataFrame.shift withaxis=1 was raising (GH6371)nosetests-Adisabled) (GH6048).DataFrame.replace() when passing a nesteddict that containedkeys not in the values to be replaced (GH6342)str.match ignored the na flag (GH6609).Series.get, was using a buggy access method (GH6383)where=[('date','>=',datetime(2013,1,1)),('date','<=',datetime(2014,1,1))] (GH6313)DataFrame.dropna with duplicate indices (GH6355)Float64Index with nans not comparing correctly (GH6401)eval/query expressions with strings containing the@ characterwill now work (GH6366).Series.reindex when specifying amethod with some nan values was inconsistent (noted on a resample) (GH6418)DataFrame.replace() where nested dicts were erroneouslydepending on the order of dictionary keys and values (GH5338).sym_diff onIndex objects withNaN values (GH6444)MultiIndex.from_product with aDatetimeIndex as input (GH6439)str.extract when passed a non-default index (GH6348)str.split when passedpat=None andn=1 (GH6466)io.data.DataReader when passed"F-F_Momentum_Factor" anddata_source="famafrench" (GH6460)sum of atimedelta64[ns] series (GH6462)resample with a timezone and certain offsets (GH6397)iat/iloc with duplicate indices on a Series (GH6493)read_html where nan’s were incorrectly being used to indicatemissing values in text. Should use the empty string for consistency with therest of pandas (GH5129).read_html tests where redirected invalid URLs would make one testfail (GH6445)..loc on non-unique indices (GH6504)datetime64 non-ns dtypes in Series creation (GH6529).names attribute of MultiIndexes passed toset_index are now preserved (GH6459)..loc on mixed integer Indexes (GH6546)pd.read_stata which would use the wrong data types and missing values (GH6327)DataFrame.to_stata that lead to data loss in certain cases, and could be exported using thewrong data types and missing values (GH6335)StataWriter replaces missing values in string columns by empty string (GH6802)Timestamp addition/subtraction (GH6543)IndexError exceptions (GH6536,GH6551)Series.quantile raising on anobject dtype (GH6555).xs with anan in level when dropped (GH6574)method='bfill/ffill' anddatetime64[ns] dtype (GH6587)Series.pop (GH6600)iloc indexing when positional indexer matchedInt64Index of the corresponding axis and no reordering happened (GH6612)fillna withlimit andvalue specifiedDataFrame.to_stata when columns have non-string names (GH4558)np.compress, surfaced in (GH6658)DataFrame.to_stata which incorrectly handles nan values and ignoreswith_index keyword argument (GH6685)how=None resample freq is the same as the axis frequency (GH5955)obj.blocks on sparse containers dropping all but the last items of same for dtype (GH6748)NaT(NaTType) (GH4606)DataFrame.replace() where regex metacharacters were being treatedas regexs even whenregex=False (GH6777)..index (GH6785)makeclean (GH6768)HDFStore (GH6166)DataFrame._reduce where non bool-like (0/1) integers were beingcoverted into bools. (GH6806)fillna and a Series on datetime-like (GH6344)np.timedelta64 toDatetimeIndex with timezone outputs incorrect results (GH6818)DataFrame.replace() where changing a dtype through replacementwould only replace the first occurrence of a value (GH6689)Period construction (GH5332)Series.__unicode__ whenmax_rows=None and the Series has more than 1000 rows. (GH6863)groupby.get_group where a datetlike wasn’t always accepted (GH5267)groupBy.get_group created byTimeGrouper raisesAttributeError (GH6914)DatetimeIndex.tz_localize andDatetimeIndex.tz_convert convertingNaT incorrectly (GH5546)NaT (GH6873)Series.str.extract where the resultingSeries from a singlegroup match wasn’t renamed to the group nameDataFrame.to_csv where settingindex=False ignored theheader kwarg (GH6186)DataFrame.plot andSeries.plot, where the legend behave inconsistently when plotting to the same axes repeatedly (GH6678)__finalize__ / bug in merge not finalizing (GH6923,GH6927)TextFileReader inconcat, which was affecting a common user idiom (GH6583)delim_whitespace=True and\r-delimited linesSeries.rank andDataFrame.rank that caused small floats (<1e-13) to all receive the same rank (GH6886)DataFrame.apply with functions that used *args`` or **kwargs and returnedan empty result (GH6952)Panel.shift toNDFrame.slice_shift and fixed to respect multiple dtypes. (GH6959)subplots=True inDataFrame.plot only has single column raisesTypeError, andSeries.plot raisesAttributeError (GH6951)DataFrame.plot draws unnecessary axes when enablingsubplots andkind=scatter (GH6951)read_csv from a filesystem with non-utf-8 encoding (GH6807)iloc when setting / aligning (GH6766)groupby.plot when using aFloat64Index (GH7025)parallel_coordinates andradviz where reordering of class columncaused possible color/class mismatch (GH6956)radviz andandrews_curves where multiple values of ‘color’were being passed to plotting method (GH6956)Float64Index.isin() where containingnan s would make indicesclaim that they contained all the things (GH7066).DataFrame.boxplot where it failed to use the axis passed as theax argument (GH3578)XlsxWriter andXlwtWriter implementations that resulted in datetime columns being formatted without the time (GH7075)were being passed to plotting methodread_fwf() treatsNone incolspec like regular python slices. It now reads from the beginningor until the end of the line whencolspec contains aNone (previously raised aTypeError)_is_view property toNDFrame to correctly predictviews; markis_copy onxs only if its an actual copy (and not a view) (GH7084)dayfirst=True (GH5917)MultiIndex.from_arrays created fromDatetimeIndex doesn’t preservefreq andtz (GH7090)unstack raisesValueError whenMultiIndex containsPeriodIndex (GH4342)boxplot andhist draws unnecessary axes (GH6769)groupby.nth() for out-of-bounds indexers (GH6621)quantile with datetime values (GH6965)Dataframe.set_index,reindex andpivot don’t preserveDatetimeIndex andPeriodIndex attributes (GH3950,GH5878,GH6631)MultiIndex.get_level_values doesn’t preserveDatetimeIndex andPeriodIndex attributes (GH7092)Groupby doesn’t preservetz (GH3950)PeriodIndex partial string slicing (GH6716)DatetimeIndex specifyingfreq raisesValueError when passed value is too short (GH7098)PeriodIndex string slicing with out of bounds values (GH5407)isnull when applied to 0-dimensional object arrays (GH7176)query/eval where global constants were not looked up correctly(GH7178)iloc and a multi-axis tuple indexer (GH7189)This is a minor release from 0.13.0 and includes a small number of API changes, several new features,enhancements, and performance improvements along with a large number of bug fixes. We recommend that allusers upgrade to this version.
Highlights include:
infer_datetime_format keyword toread_csv/to_datetime to allow speedups for homogeneously formatted datetimes.apply() method.Warning
0.13.1 fixes a bug that was caused by a combination of having numpy < 1.8, and doingchained assignment on a string-like array. Please reviewthe docs,chained indexing can have unexpected results and should generally be avoided.
This would previously segfault:
In [1]:df=DataFrame(dict(A=np.array(['foo','bar','bah','foo','bar'])))In [2]:df['A'].iloc[0]=np.nanIn [3]:dfOut[3]: A0 NaN1 bar2 bah3 foo4 bar
The recommended way to do this type of assignment is:
In [4]:df=DataFrame(dict(A=np.array(['foo','bar','bah','foo','bar'])))In [5]:df.ix[0,'A']=np.nanIn [6]:dfOut[6]: A0 NaN1 bar2 bah3 foo4 bar
df.info() view now display dtype info per column (GH5682)
df.info() now honors the optionmax_info_rows, to disable null counts for large frames (GH5974)
In [7]:max_info_rows=pd.get_option('max_info_rows')In [8]:df=DataFrame(dict(A=np.random.randn(10), ...:B=np.random.randn(10), ...:C=date_range('20130101',periods=10))) ...:In [9]:df.iloc[3:6,[0,2]]=np.nan
# set to not display the null countsIn [10]:pd.set_option('max_info_rows',0)In [11]:df.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 10 entries, 0 to 9Data columns (total 3 columns):A float64B float64C datetime64[ns]dtypes: datetime64[ns](1), float64(2)memory usage: 312.0 bytes
# this is the default (same as in 0.13.0)In [12]:pd.set_option('max_info_rows',max_info_rows)In [13]:df.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 10 entries, 0 to 9Data columns (total 3 columns):A 7 non-null float64B 10 non-null float64C 7 non-null datetime64[ns]dtypes: datetime64[ns](1), float64(2)memory usage: 312.0 bytes
Addshow_dimensions display option for the new DataFrame repr to control whether the dimensions print.
In [14]:df=DataFrame([[1,2],[3,4]])In [15]:pd.set_option('show_dimensions',False)In [16]:dfOut[16]: 0 10 1 21 3 4In [17]:pd.set_option('show_dimensions',True)In [18]:dfOut[18]: 0 10 1 21 3 4[2 rows x 2 columns]
TheArrayFormatter fordatetime andtimedelta64 now intelligentlylimit precision based on the values in the array (GH3401)
Previously output might look like:
agetodaydiff02001-01-0100:00:002013-04-1900:00:004491days,00:00:0012004-06-0100:00:002013-04-1900:00:003244days,00:00:00
Now the output looks like:
In [19]:df=DataFrame([Timestamp('20010101'), ....:Timestamp('20040601')],columns=['age']) ....:In [20]:df['today']=Timestamp('20130419')In [21]:df['diff']=df['today']-df['age']In [22]:dfOut[22]: age today diff0 2001-01-01 2013-04-19 4491 days1 2004-06-01 2013-04-19 3244 days[2 rows x 3 columns]
Add-NaN and-nan to the default set of NA values (GH5952).SeeNA Values.
AddedSeries.str.get_dummies vectorized string method (GH6021), to extractdummy/indicator variables for separated string columns:
In [23]:s=Series(['a','a|b',np.nan,'a|c'])In [24]:s.str.get_dummies(sep='|')Out[24]: a b c0 1 0 01 1 1 02 0 0 03 1 0 1[4 rows x 3 columns]
Added theNDFrame.equals() method to compare if two NDFrames areequal have equal axes, dtypes, and values. Added thearray_equivalent function to compare if two ndarrays areequal. NaNs in identical locations are treated asequal. (GH5283) See alsothe docs for a motivating example.
In [25]:df=DataFrame({'col':['foo',0,np.nan]})In [26]:df2=DataFrame({'col':[np.nan,0,'foo']},index=[2,1,0])In [27]:df.equals(df2)Out[27]:FalseIn [28]:df.equals(df2.sort())Out[28]:TrueIn [29]:importpandas.core.commonascomIn [30]:com.array_equivalent(np.array([0,np.nan]),np.array([0,np.nan]))Out[30]:TrueIn [31]:np.array_equal(np.array([0,np.nan]),np.array([0,np.nan]))Out[31]:False
DataFrame.apply will use thereduce argument to determine whether aSeries or aDataFrame should be returned when theDataFrame isempty (GH6007).
Previously, callingDataFrame.apply an emptyDataFrame would returneither aDataFrame if there were no columns, or the function beingapplied would be called with an emptySeries to guess whether aSeries orDataFrame should be returned:
In [32]:defapplied_func(col): ....:print("Apply function being called with: ",col) ....:returncol.sum() ....:In [33]:empty=DataFrame(columns=['a','b'])In [34]:empty.apply(applied_func)('Apply function being called with: ', Series([], dtype: float64))Out[34]:a NaNb NaNdtype: float64
Now, whenapply is called on an emptyDataFrame: if thereduceargument isTrue aSeries will returned, if it isFalse aDataFrame will be returned, and if it isNone (the default) thefunction being applied will be called with an empty series to try and guessthe return type.
In [35]:empty.apply(applied_func,reduce=True)Out[35]:a NaNb NaNdtype: float64In [36]:empty.apply(applied_func,reduce=False)Out[36]:Empty DataFrameColumns: [a, b]Index: [][0 rows x 2 columns]
There are no announced changes in 0.13 or prior that are taking effect as of 0.13.1
There are no deprecations of prior behavior in 0.13.1
pd.read_csv andpd.to_datetime learned a newinfer_datetime_format keyword which greatlyimproves parsing perf in many cases. Thanks to @lexual for suggesting and @danbirkenfor rapidly implementing. (GH5490,GH6021)
Ifparse_dates is enabled and this flag is set, pandas will attempt toinfer the format of the datetime strings in the columns, and if it canbe inferred, switch to a faster method of parsing them. In some casesthis can increase the parsing speed by ~5-10x.
# Try to infer the format for the index columndf=pd.read_csv('foo.csv',index_col=0,parse_dates=True,infer_datetime_format=True)
date_format anddatetime_format keywords can now be specified when writing toexcelfiles (GH4133)
MultiIndex.from_product convenience function for creating a MultiIndex fromthe cartesian product of a set of iterables (GH6055):
In [37]:shades=['light','dark']In [38]:colors=['red','green','blue']In [39]:MultiIndex.from_product([shades,colors],names=['shade','color'])Out[39]:MultiIndex(levels=[[u'dark', u'light'], [u'blue', u'green', u'red']], labels=[[1, 1, 1, 0, 0, 0], [2, 1, 0, 2, 1, 0]], names=[u'shade', u'color'])
Panelapply() will work on non-ufuncs. Seethe docs.
In [40]:importpandas.util.testingastmIn [41]:panel=tm.makePanel(5)In [42]:panelOut[42]:<class 'pandas.core.panel.Panel'>Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)Items axis: ItemA to ItemCMajor_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00Minor_axis axis: A to DIn [43]:panel['ItemA']Out[43]: A B C D2000-01-03 0.694103 1.893534 -1.735349 -0.8503462000-01-04 0.678630 0.639633 1.210384 1.1768122000-01-05 0.239556 -0.962029 0.797435 -0.5243362000-01-06 0.151227 -2.085266 -0.379811 0.7009082000-01-07 0.816127 1.930247 0.702562 0.984188[5 rows x 4 columns]
Specifying anapply that operates on a Series (to return a single element)
In [44]:panel.apply(lambdax:x.dtype,axis='items')Out[44]: A B C D2000-01-03 float64 float64 float64 float642000-01-04 float64 float64 float64 float642000-01-05 float64 float64 float64 float642000-01-06 float64 float64 float64 float642000-01-07 float64 float64 float64 float64[5 rows x 4 columns]
A similar reduction type operation
In [45]:panel.apply(lambdax:x.sum(),axis='major_axis')Out[45]: ItemA ItemB ItemCA 2.579643 3.062757 0.379252B 1.416120 -1.960855 0.923558C 0.595222 -1.079772 -3.118269D 1.487226 -0.734611 -1.979310[4 rows x 3 columns]
This is equivalent to
In [46]:panel.sum('major_axis')Out[46]: ItemA ItemB ItemCA 2.579643 3.062757 0.379252B 1.416120 -1.960855 0.923558C 0.595222 -1.079772 -3.118269D 1.487226 -0.734611 -1.979310[4 rows x 3 columns]
A transformation operation that returns a Panel, but is computingthe z-score across the major_axis
In [47]:result=panel.apply( ....:lambdax:(x-x.mean())/x.std(), ....:axis='major_axis') ....:In [48]:resultOut[48]:<class 'pandas.core.panel.Panel'>Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)Items axis: ItemA to ItemCMajor_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00Minor_axis axis: A to DIn [49]:result['ItemA']Out[49]: A B C D2000-01-03 0.595800 0.907552 -1.556260 -1.2448752000-01-04 0.544058 0.200868 0.915883 0.9537472000-01-05 -0.924165 -0.701810 0.569325 -0.8912902000-01-06 -1.219530 -1.334852 -0.418654 0.4375892000-01-07 1.003837 0.928242 0.489705 0.744830[5 rows x 4 columns]
Panelapply() operating on cross-sectional slabs. (GH1148)
In [50]:f=lambdax:((x.T-x.mean(1))/x.std(1)).TIn [51]:result=panel.apply(f,axis=['items','major_axis'])In [52]:resultOut[52]:<class 'pandas.core.panel.Panel'>Dimensions: 4 (items) x 5 (major_axis) x 3 (minor_axis)Items axis: A to DMajor_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00Minor_axis axis: ItemA to ItemCIn [53]:result.loc[:,:,'ItemA']Out[53]: A B C D2000-01-03 0.331409 1.071034 -0.914540 -0.5105872000-01-04 -0.741017 -0.118794 0.383277 0.5372122000-01-05 0.065042 -0.767353 0.655436 0.0694672000-01-06 0.027932 -0.569477 0.908202 0.6105852000-01-07 1.116434 1.133591 0.871287 1.004064[5 rows x 4 columns]
This is equivalent to the following
In [54]:result=Panel(dict([(ax,f(panel.loc[:,:,ax])) ....:foraxinpanel.minor_axis])) ....:In [55]:resultOut[55]:<class 'pandas.core.panel.Panel'>Dimensions: 4 (items) x 5 (major_axis) x 3 (minor_axis)Items axis: A to DMajor_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00Minor_axis axis: ItemA to ItemCIn [56]:result.loc[:,:,'ItemA']Out[56]: A B C D2000-01-03 0.331409 1.071034 -0.914540 -0.5105872000-01-04 -0.741017 -0.118794 0.383277 0.5372122000-01-05 0.065042 -0.767353 0.655436 0.0694672000-01-06 0.027932 -0.569477 0.908202 0.6105852000-01-07 1.116434 1.133591 0.871287 1.004064[5 rows x 4 columns]
Performance improvements for 0.13.1
count/dropna foraxis=1dtypes/ftypes methods (GH5968)DataFrame.apply (GH6013)There are no experimental changes in 0.13.1
SeeV0.13.1 Bug Fixes for an extensive list of bugs that have been fixed in 0.13.1.
See thefull release notes or issue trackeron GitHub for a complete list of all API changes, Enhancements and Bug Fixes.
This is a major release from 0.12.0 and includes a number of API changes, several new features andenhancements along with a large number of bug fixes.
Highlights include:
Float64Index, and other Indexing enhancementsHDFStore has a new string based syntax for query specificationtimedelta operationsextractisin for DataFramesSeveral experimental features are added, including:
eval/query methods for expression evaluationmsgpack serializationBigQueryTheir are several new or updated docs sections including:
eval/query.Warning
In 0.13.0Series has internally been refactored to no longer sub-classndarraybut instead subclassNDFrame, similar to the rest of the pandas containers. This should bea transparent change with only very limited API implications. SeeInternal Refactoring
read_excel now supports an integer in itssheetname argument givingthe index of the sheet to read in (GH4301).
Text parser now treats anything that reads like inf (“inf”, “Inf”, “-Inf”,“iNf”, etc.) as infinity. (GH4220,GH4219), affectingread_table,read_csv, etc.
pandas now is Python 2/3 compatible without the need for 2to3 thanks to@jtratner. As a result, pandas now uses iterators more extensively. Thisalso led to the introduction of substantive parts of the BenjaminPeterson’ssix library into compat. (GH4384,GH4375,GH4372)
pandas.util.compat andpandas.util.py3compat have been merged intopandas.compat.pandas.compat now includes many functions allowing2/3 compatibility. It contains both list and iterator versions of range,filter, map and zip, plus other necessary elements for Python 3compatibility.lmap,lzip,lrange andlfilter all producelists instead of iterators, for compatibility withnumpy, subscriptingandpandas constructors.(GH4384,GH4375,GH4372)
Series.get with negative indexers now returns the same as[] (GH4390)
Changes to howIndex andMultiIndex handle metadata (levels,labels, andnames) (GH4039):
# previously, you would have set levels or labels directlyindex.levels=[[1,2,3,4],[1,2,4,4]]# now, you use the set_levels or set_labels methodsindex=index.set_levels([[1,2,3,4],[1,2,4,4]])# similarly, for names, you can rename the object# but setting names is not deprecatedindex=index.set_names(["bob","cranberry"])# and all methods take an inplace kwarg - but return Noneindex.set_names(["bob","cranberry"],inplace=True)
All division withNDFrame objects is nowtruedivision, regardlessof the future import. This means that operating on pandas objects will by defaultusefloating point division, and return a floating point dtype.You can use// andfloordiv to do integer division.
Integer division
In [3]:arr=np.array([1,2,3,4])In [4]:arr2=np.array([5,3,2,1])In [5]:arr/arr2Out[5]:array([0,0,1,4])In [6]:Series(arr)//Series(arr2)Out[6]:0 01 02 13 4dtype: int64
True Division
In [7]:pd.Series(arr)/pd.Series(arr2)# no future import requiredOut[7]:0 0.2000001 0.6666672 1.5000003 4.000000dtype: float64
Infer and downcast dtype ifdowncast='infer' is passed tofillna/ffill/bfill (GH4604)
__nonzero__ for all NDFrame objects, will now raise aValueError, this reverts back to (GH1073,GH4633)behavior. Seegotchas for a more detailed discussion.
This prevents doing boolean comparison onentire pandas objects, which is inherently ambiguous. These all will raise aValueError.
ifdf:....df1anddf2s1ands2
Added the.bool() method toNDFrame objects to facilitate evaluating of single-element boolean Series:
In [1]:Series([True]).bool()Out[1]:TrueIn [2]:Series([False]).bool()Out[2]:FalseIn [3]:DataFrame([[True]]).bool()Out[3]:TrueIn [4]:DataFrame([[False]]).bool()Out[4]:False
All non-Index NDFrames (Series,DataFrame,Panel,Panel4D,SparsePanel, etc.), now support the entire set of arithmetic operatorsand arithmetic flex methods (add, sub, mul, etc.).SparsePanel does notsupportpow ormod with non-scalars. (GH3765)
Series andDataFrame now have amode() method to calculate thestatistical mode(s) by axis/Series. (GH5367)
Chained assignment will now by default warn if the user is assigning to a copy. This can be changedwith the optionmode.chained_assignment, allowed options areraise/warn/None. Seethe docs.
In [5]:dfc=DataFrame({'A':['aaa','bbb','ccc'],'B':[1,2,3]})In [6]:pd.set_option('chained_assignment','warn')
The following warning / exception will show if this is attempted.
In [7]:dfc.loc[0]['A']=1111
Traceback(mostrecentcalllast)...SettingWithCopyWarning:AvalueistryingtobesetonacopyofaslicefromaDataFrame.Tryusing.loc[row_index,col_indexer]=valueinstead
Here is the correct method of assignment.
In [8]:dfc.loc[0,'A']=11In [9]:dfcOut[9]: A B0 11 11 bbb 22 ccc 3[3 rows x 2 columns]
Panel.reindex has the following call signaturePanel.reindex(items=None,major_axis=None,minor_axis=None,**kwargs)to conform with otherNDFrame objects. SeeInternal Refactoring for more information.
Series.argmin andSeries.argmax are now aliased toSeries.idxmin andSeries.idxmax. These return theindex of themin or max element respectively. Prior to 0.13.0 these would return the position of the min / max element. (GH6214)
These were announced changes in 0.12 or prior that are taking effect as of 0.13.0
Factor (GH3650)set_printoptions/reset_printoptions (GH3046)_verbose_info (GH3215)read_clipboard/to_clipboard/ExcelFile/ExcelWriter frompandas.io.parsers (GH3717)These are available as functions in the main pandas namespace (e.g.pd.read_clipboard)tupleize_cols is nowFalse for bothto_csv andread_csv. Fair warning in 0.12 (GH3604)Deprecated in 0.13.0
iterkv, which will be removed in a future release (this wasan alias of iteritems used to bypass2to3‘s changes).(GH4384,GH4375,GH4372)match, whose role is now performed moreidiomatically byextract. In a future release, the default behaviorofmatch will change to become analogous tocontains, which returnsa boolean indexer. (Theirdistinction is strictness:match relies onre.match whilecontains relies onre.search.) In this release, the deprecatedbehavior is the default, but the new behavior is available through thekeyword argumentas_indexer=True.Prior to 0.13, it was impossible to use a label indexer (.loc/.ix) to set a value thatwas not contained in the index of a particular axis. (GH2578). Seethe docs
In theSeries case this is effectively an appending operation
In [10]:s=Series([1,2,3])In [11]:sOut[11]:0 11 22 3dtype: int64In [12]:s[5]=5.In [13]:sOut[13]:0 1.01 2.02 3.05 5.0dtype: float64
In [14]:dfi=DataFrame(np.arange(6).reshape(3,2), ....:columns=['A','B']) ....:In [15]:dfiOut[15]: A B0 0 11 2 32 4 5[3 rows x 2 columns]
This would previouslyKeyError
In [16]:dfi.loc[:,'C']=dfi.loc[:,'A']In [17]:dfiOut[17]: A B C0 0 1 01 2 3 22 4 5 4[3 rows x 3 columns]
This is like anappend operation.
In [18]:dfi.loc[3]=5In [19]:dfiOut[19]: A B C0 0 1 01 2 3 22 4 5 43 5 5 5[4 rows x 3 columns]
A Panel setting operation on an arbitrary axis aligns the input to the Panel
In [20]:p=pd.Panel(np.arange(16).reshape(2,4,2), ....:items=['Item1','Item2'], ....:major_axis=pd.date_range('2001/1/12',periods=4), ....:minor_axis=['A','B'],dtype='float64') ....:In [21]:pOut[21]:<class 'pandas.core.panel.Panel'>Dimensions: 2 (items) x 4 (major_axis) x 2 (minor_axis)Items axis: Item1 to Item2Major_axis axis: 2001-01-12 00:00:00 to 2001-01-15 00:00:00Minor_axis axis: A to BIn [22]:p.loc[:,:,'C']=Series([30,32],index=p.items)In [23]:pOut[23]:<class 'pandas.core.panel.Panel'>Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)Items axis: Item1 to Item2Major_axis axis: 2001-01-12 00:00:00 to 2001-01-15 00:00:00Minor_axis axis: A to CIn [24]:p.loc[:,:,'C']Out[24]: Item1 Item22001-01-12 30.0 32.02001-01-13 30.0 32.02001-01-14 30.0 32.02001-01-15 30.0 32.0[4 rows x 2 columns]
Added a new index type,Float64Index. This will be automatically created when passing floating values in index creation.This enables a pure label-based slicing paradigm that makes[],ix,loc for scalar indexing and slicing work exactly thesame. Seethe docs, (GH263)
Construction is by default for floating type values.
In [25]:index=Index([1.5,2,3,4.5,5])In [26]:indexOut[26]:Float64Index([1.5,2.0,3.0,4.5,5.0],dtype='float64')In [27]:s=Series(range(5),index=index)In [28]:sOut[28]:1.5 02.0 13.0 24.5 35.0 4dtype: int64
Scalar selection for[],.ix,.loc will always be label based. An integer will match an equal float index (e.g.3 is equivalent to3.0)
In [29]:s[3]Out[29]:2In [30]:s.ix[3]Out[30]:2In [31]:s.loc[3]Out[31]:2
The only positional indexing is viailoc
In [32]:s.iloc[3]Out[32]:3
A scalar index that is not found will raiseKeyError
Slicing is ALWAYS on the values of the index, for[],ix,loc and ALWAYS positional withiloc
In [33]:s[2:4]Out[33]:2.0 13.0 2dtype: int64In [34]:s.ix[2:4]Out[34]:2.0 13.0 2dtype: int64In [35]:s.loc[2:4]Out[35]:2.0 13.0 2dtype: int64In [36]:s.iloc[2:4]Out[36]:3.0 24.5 3dtype: int64
In float indexes, slicing using floats are allowed
In [37]:s[2.1:4.6]Out[37]:3.0 24.5 3dtype: int64In [38]:s.loc[2.1:4.6]Out[38]:3.0 24.5 3dtype: int64
Indexing on other index types are preserved (and positional fallback for[],ix), with the exception, that floating point slicingon indexes on nonFloat64Index will now raise aTypeError.
In [1]:Series(range(5))[3.5]TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)In [1]:Series(range(5))[3.5:4.5]TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)
Using a scalar float indexer will be deprecated in a future version, but is allowed for now.
In [3]:Series(range(5))[3.0]Out[3]:3
Query Format Changes. A much more string-like query format is now supported. Seethe docs.
In [39]:path='test.h5'In [40]:dfq=DataFrame(randn(10,4), ....:columns=list('ABCD'), ....:index=date_range('20130101',periods=10)) ....:In [41]:dfq.to_hdf(path,'dfq',format='table',data_columns=True)
Use boolean expressions, with in-line function evaluation.
In [42]:read_hdf(path,'dfq', ....:where="index>Timestamp('20130104') & columns=['A', 'B']") ....:Out[42]: A B2013-01-05 1.057633 -0.7914892013-01-06 1.910759 0.7879652013-01-07 1.043945 2.1077852013-01-08 0.749185 -0.6755212013-01-09 -0.276646 1.9245332013-01-10 0.226363 -2.078618[6 rows x 2 columns]
Use an inline column reference
In [43]:read_hdf(path,'dfq', ....:where="A>0 or C>0") ....:Out[43]: A B C D2013-01-01 -0.414505 -1.425795 0.209395 -0.5928862013-01-02 -1.473116 -0.896581 1.104352 -0.4315502013-01-03 -0.161137 0.889157 0.288377 -1.0515392013-01-04 -0.319561 -0.619993 0.156998 -0.5714552013-01-05 1.057633 -0.791489 -0.524627 0.0718782013-01-06 1.910759 0.787965 0.513082 -0.5464162013-01-07 1.043945 2.107785 1.459927 1.0154052013-01-08 0.749185 -0.675521 0.440266 0.6889722013-01-09 -0.276646 1.924533 0.411204 0.8907652013-01-10 0.226363 -2.078618 -0.387886 -0.087107[10 rows x 4 columns]
theformat keyword now replaces thetable keyword; allowed values arefixed(f) ortable(t)the same defaults as prior < 0.13.0 remain, e.g.put impliesfixed format andappend impliestable format. This default format can be set as an option by settingio.hdf.default_format.
In [44]:path='test.h5'In [45]:df=DataFrame(randn(10,2))In [46]:df.to_hdf(path,'df_table',format='table')In [47]:df.to_hdf(path,'df_table2',append=True)In [48]:df.to_hdf(path,'df_fixed')In [49]:withget_store(path)asstore: ....:print(store) ....:<class 'pandas.io.pytables.HDFStore'>File path: test.h5/df_fixed frame (shape->[10,2])/df_table frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index])/df_table2 frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index])
Significant table writing performance improvements
handle a passedSeries in table format (GH4330)
can now serialize atimedelta64[ns] dtype in a table (GH3577), Seethe docs.
added anis_open property to indicate if the underlying file handle is_open;a closed store will now report ‘CLOSED’ when viewing the store (rather than raising an error)(GH4409)
a close of aHDFStore now will close that instance of theHDFStorebut will only close the actual file if the ref count (byPyTables) w.r.t. all of the open handlesare 0. Essentially you have a local instance ofHDFStore referenced by a variable. Once youclose it, it will report closed. Other references (to the same file) will continue to operateuntil they themselves are closed. Performing an action on a closed file will raiseClosedFileError
In [50]:path='test.h5'In [51]:df=DataFrame(randn(10,2))In [52]:store1=HDFStore(path)In [53]:store2=HDFStore(path)In [54]:store1.append('df',df)In [55]:store2.append('df2',df)In [56]:store1Out[56]:<class 'pandas.io.pytables.HDFStore'>File path: test.h5/df frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index])In [57]:store2Out[57]:<class 'pandas.io.pytables.HDFStore'>File path: test.h5/df frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index])/df2 frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index])In [58]:store1.close()In [59]:store2Out[59]:<class 'pandas.io.pytables.HDFStore'>File path: test.h5/df frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index])/df2 frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index])In [60]:store2.close()In [61]:store2Out[61]:<class 'pandas.io.pytables.HDFStore'>File path: test.h5File is CLOSED
removed the_quiet attribute, replace by aDuplicateWarning if retrievingduplicate rows from a table (GH4367)
removed thewarn argument fromopen. Instead aPossibleDataLossError exception willbe raised if you try to usemode='w' with an OPEN file handle (GH4367)
allow a passed locations array or mask as awhere condition (GH4467).Seethe docs for an example.
add the keyworddropna=True toappend to change whether ALL nan rows are not writtento the store (default isTrue, ALL nan rows are NOT written), also settablevia the optionio.hdf.dropna_table (GH4625)
pass thru store creation arguments; can be used to support in-memory stores
The HTML and plain text representations ofDataFrame now showa truncated view of the table once it exceeds a certain size, ratherthan switching to the short info view (GH4886,GH5550).This makes the representation more consistent as small DataFrames getlarger.

To get the info view, callDataFrame.info(). If you prefer theinfo view as the repr for large DataFrames, you can set this by runningset_option('display.large_repr','info').
df.to_clipboard() learned a newexcel keyword that let’s youpaste df data directly into excel (enabled by default). (GH5070).
read_html now raises aURLError instead of catching and raising aValueError (GH4303,GH4305)
Added a test forread_clipboard() andto_clipboard() (GH4282)
Clipboard functionality now works with PySide (GH4282)
Added a more informative error message when plot arguments containoverlapping color and style arguments (GH4402)
to_dict now takesrecords as a possible outtype. Returns an arrayof column-keyed dictionaries. (GH4936)
NaN handing in get_dummies (GH4446) withdummy_na
# previously, nan was erroneously counted as 2 here# now it is not counted at allIn [62]:get_dummies([1,2,np.nan])Out[62]: 1.0 2.00 1 01 0 12 0 0[3 rows x 2 columns]# unless requestedIn [63]:get_dummies([1,2,np.nan],dummy_na=True)Out[63]: 1.0 2.0 NaN0 1 0 01 0 1 02 0 0 1[3 rows x 3 columns]
timedelta64[ns] operations. Seethe docs.
Warning
Most of these operations requirenumpy>=1.7
Using the new top-levelto_timedelta, you can convert a scalar or array from the standardtimedelta format (produced byto_csv) into a timedelta type (np.timedelta64 innanoseconds).
In [64]:to_timedelta('1 days 06:05:01.00003')Out[64]:Timedelta('1 days 06:05:01.000030')In [65]:to_timedelta('15.5us')Out[65]:Timedelta('0 days 00:00:00.000015')In [66]:to_timedelta(['1 days 06:05:01.00003','15.5us','nan'])Out[66]:TimedeltaIndex(['1 days 06:05:01.000030','0 days 00:00:00.000015',NaT],dtype='timedelta64[ns]',freq=None)In [67]:to_timedelta(np.arange(5),unit='s')Out[67]:TimedeltaIndex(['00:00:00','00:00:01','00:00:02','00:00:03','00:00:04'],dtype='timedelta64[ns]',freq=None)In [68]:to_timedelta(np.arange(5),unit='d')Out[68]:TimedeltaIndex(['0 days','1 days','2 days','3 days','4 days'],dtype='timedelta64[ns]',freq=None)
A Series of dtypetimedelta64[ns] can now be divided by anothertimedelta64[ns] object, or astyped to yield afloat64 dtyped Series. Thisis frequency conversion. Seethe docs for the docs.
In [69]:fromdatetimeimporttimedeltaIn [70]:td=Series(date_range('20130101',periods=4))-Series(date_range('20121201',periods=4))In [71]:td[2]+=np.timedelta64(timedelta(minutes=5,seconds=3))In [72]:td[3]=np.nanIn [73]:tdOut[73]:0 31 days 00:00:001 31 days 00:00:002 31 days 00:05:033 NaTdtype: timedelta64[ns]# to daysIn [74]:td/np.timedelta64(1,'D')Out[74]:0 31.0000001 31.0000002 31.0035073 NaNdtype: float64In [75]:td.astype('timedelta64[D]')Out[75]:0 31.01 31.02 31.03 NaNdtype: float64# to secondsIn [76]:td/np.timedelta64(1,'s')Out[76]:0 2678400.01 2678400.02 2678703.03 NaNdtype: float64In [77]:td.astype('timedelta64[s]')Out[77]:0 2678400.01 2678400.02 2678703.03 NaNdtype: float64
Dividing or multiplying atimedelta64[ns] Series by an integer or integer Series
In [78]:td*-1Out[78]:0 -31 days +00:00:001 -31 days +00:00:002 -32 days +23:54:573 NaTdtype: timedelta64[ns]In [79]:td*Series([1,2,3,4])Out[79]:0 31 days 00:00:001 62 days 00:00:002 93 days 00:15:093 NaTdtype: timedelta64[ns]
AbsoluteDateOffset objects can act equivalently totimedeltas
In [80]:frompandasimportoffsetsIn [81]:td+offsets.Minute(5)+offsets.Milli(5)Out[81]:0 31 days 00:05:00.0050001 31 days 00:05:00.0050002 31 days 00:10:03.0050003 NaTdtype: timedelta64[ns]
Fillna is now supported for timedeltas
In [82]:td.fillna(0)Out[82]:0 31 days 00:00:001 31 days 00:00:002 31 days 00:05:033 0 days 00:00:00dtype: timedelta64[ns]In [83]:td.fillna(timedelta(days=1,seconds=5))Out[83]:0 31 days 00:00:001 31 days 00:00:002 31 days 00:05:033 1 days 00:00:05dtype: timedelta64[ns]
You can do numeric reduction operations on timedeltas.
In [84]:td.mean()Out[84]:Timedelta('31 days 00:01:41')In [85]:td.quantile(.1)Out[85]:Timedelta('31 days 00:00:00')
plot(kind='kde') now accepts the optional parametersbw_method andind, passed to scipy.stats.gaussian_kde() (for scipy >= 0.11.0) to setthe bandwidth, and to gkde.evaluate() to specify the indices at which itis evaluated, respectively. See scipy docs. (GH4298)
DataFrame constructor now accepts a numpy masked record array (GH3478)
The new vectorized string methodextract return regular expressionmatches more conveniently.
In [86]:Series(['a1','b2','c3']).str.extract('[ab](\d)')Out[86]:0 11 22 NaNdtype: object
Elements that do not match returnNaN. Extracting a regular expressionwith more than one group returns a DataFrame with one column per group.
In [87]:Series(['a1','b2','c3']).str.extract('([ab])(\d)')Out[87]: 0 10 a 11 b 22 NaN NaN[3 rows x 2 columns]
Elements that do not match return a row ofNaN.Thus, a Series of messy strings can beconverted into alike-indexed Series or DataFrame of cleaned-up or more useful strings,without necessitatingget() to access tuples orre.match objects.
Named groups like
In [88]:Series(['a1','b2','c3']).str.extract( ....:'(?P<letter>[ab])(?P<digit>\d)') ....:Out[88]: letter digit0 a 11 b 22 NaN NaN[3 rows x 2 columns]
and optional groups can also be used.
In [89]:Series(['a1','b2','3']).str.extract( ....:'(?P<letter>[ab])?(?P<digit>\d)') ....:Out[89]: letter digit0 a 11 b 22 NaN 3[3 rows x 2 columns]
read_stata now accepts Stata 13 format (GH4291)
read_fwf now infers the column specifications from the first 100 rows ofthe file if the data has correctly separated and properly aligned columnsusing the delimiter provided to the function (GH4488).
support for nanosecond times as an offset
Warning
These operations requirenumpy>=1.7
Period conversions in the range of seconds and below were reworked and extendedup to nanoseconds. Periods in the nanosecond range are now available.
In [90]:date_range('2013-01-01',periods=5,freq='5N')Out[90]:DatetimeIndex(['2013-01-01', '2013-01-01', '2013-01-01', '2013-01-01', '2013-01-01'], dtype='datetime64[ns]', freq='5N')
or with frequency as offset
In [91]:date_range('2013-01-01',periods=5,freq=pd.offsets.Nano(5))Out[91]:DatetimeIndex(['2013-01-01', '2013-01-01', '2013-01-01', '2013-01-01', '2013-01-01'], dtype='datetime64[ns]', freq='5N')
Timestamps can be modified in the nanosecond range
In [92]:t=Timestamp('20130101 09:01:02')In [93]:t+pd.tseries.offsets.Nano(123)Out[93]:Timestamp('2013-01-01 09:01:02.000000123')
A new method,isin for DataFrames, which plays nicely with boolean indexing. The argument toisin, what we’re comparing the DataFrame to, can be a DataFrame, Series, dict, or array of values. Seethe docs for more.
To get the rows where any of the conditions are met:
In [94]:dfi=DataFrame({'A':[1,2,3,4],'B':['a','b','f','n']})In [95]:dfiOut[95]: A B0 1 a1 2 b2 3 f3 4 n[4 rows x 2 columns]In [96]:other=DataFrame({'A':[1,3,3,7],'B':['e','f','f','e']})In [97]:mask=dfi.isin(other)In [98]:maskOut[98]: A B0 True False1 False False2 True True3 False False[4 rows x 2 columns]In [99]:dfi[mask.any(1)]Out[99]: A B0 1 a2 3 f[2 rows x 2 columns]
Series now supports ato_frame method to convert it to a single-column DataFrame (GH5164)
All R datasets listed herehttp://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html can now be loaded into Pandas objects
# note that pandas.rpy was deprecated in v0.16.0importpandas.rpy.commonascomcom.load_data('Titanic')
tz_localize can infer a fall daylight savings transition based on the structureof the unlocalized data (GH4230), seethe docs
DatetimeIndex is now in the API documentation, seethe docs
json_normalize() is a new method to allow you to create a flat tablefrom semi-structured JSON data. Seethe docs (GH1067)
Added PySide support for the qtpandas DataFrameModel and DataFrameWidget.
Python csv parser now supports usecols (GH4335)
Frequencies gained several new offsets:
DataFrame has a newinterpolate method, similar to Series (GH4434,GH1892)
In [100]:df=DataFrame({'A':[1,2.1,np.nan,4.7,5.6,6.8], .....:'B':[.25,np.nan,np.nan,4,12.2,14.4]}) .....:In [101]:df.interpolate()Out[101]: A B0 1.0 0.251 2.1 1.502 3.4 2.753 4.7 4.004 5.6 12.205 6.8 14.40[6 rows x 2 columns]
Additionally, themethod argument tointerpolate has been expandedto include'nearest','zero','slinear','quadratic','cubic','barycentric','krogh','piecewise_polynomial','pchip',`polynomial`,'spline'The new methods requirescipy. Consult the Scipy referenceguide anddocumentation for more informationabout when the various methods are appropriate. Seethe docs.
Interpolate now also accepts alimit keyword argument.This works similar tofillna‘s limit:
In [102]:ser=Series([1,3,np.nan,np.nan,np.nan,11])In [103]:ser.interpolate(limit=2)Out[103]:0 1.01 3.02 5.03 7.04 NaN5 11.0dtype: float64
Addedwide_to_long panel data convenience function. Seethe docs.
In [104]:np.random.seed(123)In [105]:df=pd.DataFrame({"A1970":{0:"a",1:"b",2:"c"}, .....:"A1980":{0:"d",1:"e",2:"f"}, .....:"B1970":{0:2.5,1:1.2,2:.7}, .....:"B1980":{0:3.2,1:1.3,2:.1}, .....:"X":dict(zip(range(3),np.random.randn(3))) .....:}) .....:In [106]:df["id"]=df.indexIn [107]:dfOut[107]: A1970 A1980 B1970 B1980 X id0 a d 2.5 3.2 -1.085631 01 b e 1.2 1.3 0.997345 12 c f 0.7 0.1 0.282978 2[3 rows x 6 columns]In [108]:wide_to_long(df,["A","B"],i="id",j="year")Out[108]: X A Bid year0 1970 -1.085631 a 2.51 1970 0.997345 b 1.22 1970 0.282978 c 0.70 1980 -1.085631 d 3.21 1980 0.997345 e 1.32 1980 0.282978 f 0.1[6 rows x 3 columns]
to_csv now takes adate_format keyword argument that specifies howoutput datetime objects should be formatted. Datetimes encountered in theindex, columns, and values will all have this formatting applied. (GH4313)DataFrame.plot will scatter plot x versus y by passingkind='scatter' (GH2215)The neweval() function implements expression evaluation usingnumexpr behind the scenes. This results in large speedups forcomplicated expressions involving large DataFrames/Series. For example,
In [109]:nrows,ncols=20000,100In [110]:df1,df2,df3,df4=[DataFrame(randn(nrows,ncols)) .....:for_inrange(4)] .....:
# eval with NumExpr backendIn [111]:%timeitpd.eval('df1 + df2 + df3 + df4')100 loops, best of 3: 9.21 ms per loop
# pure Python evaluationIn [112]:%timeitdf1+df2+df3+df410 loops, best of 3: 27.2 ms per loop
For more details, see thethe docs
Similar topandas.eval,DataFrame has a newDataFrame.eval method that evaluates an expression in the context oftheDataFrame. For example,
In [113]:df=DataFrame(randn(10,2),columns=['a','b'])In [114]:df.eval('a + b')Out[114]:0 -0.6852041 1.5897452 0.3254413 -1.7841534 -0.4328935 0.1718506 1.8959197 3.0655878 -0.0927599 1.391365dtype: float64
query() method has been added that allowsyou to select elements of aDataFrame using a natural query syntaxnearly identical to Python syntax. For example,
In [115]:n=20In [116]:df=DataFrame(np.random.randint(n,size=(n,3)),columns=['a','b','c'])In [117]:df.query('a < b < c')Out[117]: a b c11 1 5 815 8 16 19[2 rows x 3 columns]
selects all the rows ofdf wherea<b<c evaluates toTrue.For more details see thethe docs.
pd.read_msgpack() andpd.to_msgpack() are now a supported method of serializationof arbitrary pandas (and python objects) in a lightweight portable binary format. Seethe docs
Warning
Since this is an EXPERIMENTAL LIBRARY, the storage format may not be stable until a future release.
In [118]:df=DataFrame(np.random.rand(5,2),columns=list('AB'))In [119]:df.to_msgpack('foo.msg')In [120]:pd.read_msgpack('foo.msg')Out[120]: A B0 0.251082 0.0173571 0.347915 0.9298792 0.546233 0.2033683 0.064942 0.0317224 0.355309 0.524575[5 rows x 2 columns]In [121]:s=Series(np.random.rand(5),index=date_range('20130101',periods=5))In [122]:pd.to_msgpack('foo.msg',df,s)In [123]:pd.read_msgpack('foo.msg')Out[123]:[ A B 0 0.251082 0.017357 1 0.347915 0.929879 2 0.546233 0.203368 3 0.064942 0.031722 4 0.355309 0.524575 [5 rows x 2 columns], 2013-01-01 0.022321 2013-01-02 0.227025 2013-01-03 0.383282 2013-01-04 0.193225 2013-01-05 0.110977 Freq: D, dtype: float64]
You can passiterator=True to iterator over the unpacked results
In [124]:foroinpd.read_msgpack('foo.msg',iterator=True): .....:printo .....: A B0 0.251082 0.0173571 0.347915 0.9298792 0.546233 0.2033683 0.064942 0.0317224 0.355309 0.524575[5 rows x 2 columns]2013-01-01 0.0223212013-01-02 0.2270252013-01-03 0.3832822013-01-04 0.1932252013-01-05 0.110977Freq: D, dtype: float64
pandas.io.gbq provides a simple way to extract from, and load data into,Google’s BigQuery Data Sets by way of pandas DataFrames. BigQuery is a highperformance SQL-like database service, useful for performing ad-hoc queriesagainst extremely large datasets.See the docs
frompandas.ioimportgbq# A query to select the average monthly temperatures in the# in the year 2000 across the USA. The dataset,# publicata:samples.gsod, is available on all BigQuery accounts,# and is based on NOAA gsod data.query="""SELECT station_number as STATION,month as MONTH, AVG(mean_temp) as MEAN_TEMPFROM publicdata:samples.gsodWHERE YEAR = 2000GROUP BY STATION, MONTHORDER BY STATION, MONTH ASC"""# Fetch the result set for this query# Your Google BigQuery Project ID# To find this, see your dashboard:# https://console.developers.google.com/iam-admin/projects?authuser=0projectid=xxxxxxxxx;df=gbq.read_gbq(query,project_id=projectid)# Use pandas to process and reshape the datasetdf2=df.pivot(index='STATION',columns='MONTH',values='MEAN_TEMP')df3=pandas.concat([df2.min(),df2.mean(),df2.max()],axis=1,keys=["Min Tem","Mean Temp","Max Temp"])
The resulting DataFrame is:
>df3MinTemMeanTempMaxTempMONTH1-53.33666739.82789289.7709682-49.83750043.68521993.4379323-77.92608748.70835596.0999984-82.89285855.07008797.3172405-92.37826161.428117102.0428566-77.70333465.858888102.9000007-87.82142868.169663106.5107148-89.43199968.614215105.5000009-86.61111263.436935107.14285610-78.20967756.88083892.10333311-50.12500048.86122894.99642812-50.33225842.28687994.396774
Warning
To use this module, you will need a BigQuery account. See<https://cloud.google.com/products/big-query> for details.
As of 10/10/13, there is a bug in Google’s API preventing result setsfrom being larger than 100,000 rows. A patch is scheduled for the week of10/14/13.
In 0.13.0 there is a major refactor primarily to subclassSeries fromNDFrame, which is the base class currently forDataFrame andPanel,to unify methods and behaviors. Series formerly subclassed directly fromndarray. (GH4080,GH3862,GH816)
Warning
There are two potential incompatibilities from < 0.13.0
Using certain numpy functions would previously return aSeries if passed aSeriesas an argument. This seems only to affectnp.ones_like,np.empty_like,np.diff andnp.where. These now returnndarrays.
In [125]:s=Series([1,2,3,4])
Numpy Usage
In [126]:np.ones_like(s)Out[126]:array([1,1,1,1])In [127]:np.diff(s)Out[127]:array([1,1,1])In [128]:np.where(s>1,s,np.nan)Out[128]:array([nan,2.,3.,4.])
Pandonic Usage
In [129]:Series(1,index=s.index)Out[129]:0 11 12 13 1dtype: int64In [130]:s.diff()Out[130]:0 NaN1 1.02 1.03 1.0dtype: float64In [131]:s.where(s>1)Out[131]:0 NaN1 2.02 3.03 4.0dtype: float64
Passing aSeries directly to a cython function expecting anndarray type will nolong work directly, you must passSeries.values, SeeEnhancing Performance
Series(0.5) would previously return the scalar0.5, instead this will return a 1-elementSeries
This change breaksrpy2<=2.3.8. an Issue has been opened against rpy2 and a workaroundis detailed inGH5698. Thanks @JanSchulz.
Pickle compatibility is preserved for pickles created prior to 0.13. These must be unpickled withpd.read_pickle, seePickling.
Refactor of series.py/frame.py/panel.py to move common code to generic.py
_setup_axes to created generic NDFrame structuresfrom_axes,_wrap_array,axes,ix,loc,iloc,shape,empty,swapaxes,transpose,pop__iter__,keys,__contains__,__len__,__neg__,__invert__convert_objects,as_blocks,as_matrix,values__getstate__,__setstate__ (compat remains in frame/panel)__getattr__,__setattr___indexed_same,reindex_like,align,where,maskfillna,replace (Series replace is now consistent withDataFrame)filter (also added axis argument to selectively filter on a different axis)reindex,reindex_axis,taketruncate (moved to become part ofNDFrame)These are API changes which makePanel more consistent withDataFrame
swapaxes on aPanel with the same axes specified now return a copyDataFrame filterReindex called with no arguments will now return a copy of the input object
TimeSeries is now an alias forSeries. the propertyis_time_seriescan be used to distinguish (if desired)
Refactor of Sparse objects to use BlockManager
SparseBlock, which can hold multi-dtypesand is non-consolidatable.SparseSeries andSparseDataFrame now inheritmore methods from there hierarchy (Series/DataFrame), and no longer inheritfromSparseArray (which instead is the object of theSparseBlock)SparseSeries for boolean/integer/slicesSparsePanels implementation is unchanged (e.g. not using BlockManager, needs work)addedftypes method to Series/DataFrame, similar todtypes, but indicatesif the underlying is sparse/dense (as well as the dtype)
AllNDFrame objects can now use__finalize__() to specify variousvalues to propagate to new objects from an existing one (e.g.name inSeries willfollow more automatically now)
Internal type checking is now done via a suite of generated classes, allowingisinstance(value,klass)without having to directly import the klass, courtesy of @jtratner
Bug in Series update where the parent frame is not updating its cache based onchanges (GH4080) or types (GH3217), fillna (GH3386)
RefactorSeries.reindex to core/generic.py (GH4604,GH4618), allowmethod= in reindexingon a Series to work
Series.copy no longer accepts theorder parameter and is now consistent withNDFrame copy
Refactorrename methods to core/generic.py; fixesSeries.rename for (GH4605), and addsrenamewith the same signature forPanel
Refactorclip methods to core/generic.py (GH4798)
Refactor of_get_numeric_data/_get_bool_data to core/generic.py, allowing Series/Panel functionality
Series (for index) /Panel (for items) now allow attribute access to its elements (GH1903)
In [132]:s=Series([1,2,3],index=list('abc'))In [133]:s.bOut[133]:2In [134]:s.a=5In [135]:sOut[135]:a 5b 2c 3dtype: int64
SeeV0.13.0 Bug Fixes for an extensive list of bugs that have been fixed in 0.13.0.
See thefull release notes or issue trackeron GitHub for a complete list of all API changes, Enhancements and Bug Fixes.
This is a major release from 0.11.0 and includes several new features andenhancements along with a large number of bug fixes.
Highlights include a consistent I/O API naming scheme, routines to read html,write multi-indexes to csv files, read & write STATA data files, read & write JSON formatfiles, Python 3 support forHDFStore, filtering of groupby expressions viafilter, and arevampedreplace routine that accepts regular expressions.
The I/O API is now much more consistent with a set of top level
readerfunctionsaccessed likepd.read_csv()that generally return apandasobject.
read_csvread_excelread_hdfread_sqlread_jsonread_htmlread_stataread_clipboardThe corresponding
writerfunctions are object methods that are accessed likedf.to_csv()
to_csvto_excelto_hdfto_sqlto_jsonto_htmlto_statato_clipboardFix modulo and integer division on Series,DataFrames to act similary to
floatdtypes to returnnp.nanornp.infas appropriate (GH3590). This correct a numpy bug that treatsintegerandfloatdtypes differently.In [1]:p=DataFrame({'first':[4,5,8],'second':[0,0,3]})In [2]:p%0Out[2]: first second0 NaN NaN1 NaN NaN2 NaN NaN[3 rows x 2 columns]In [3]:p%pOut[3]: first second0 0.0 NaN1 0.0 NaN2 0.0 0.0[3 rows x 2 columns]In [4]:p/pOut[4]: first second0 1.0 NaN1 1.0 NaN2 1.0 1.0[3 rows x 2 columns]In [5]:p/0Out[5]: first second0 inf NaN1 inf NaN2 inf inf[3 rows x 2 columns]Add
squeezekeyword togroupbyto allow reduction fromDataFrame -> Series if groups are unique. This is a Regression from 0.10.1.We are reverting back to the prior behavior. This means groupby will return thesame shaped objects whether the groups are unique or not. Revert this issue (GH2893)with (GH3596).In [6]:df2=DataFrame([{"val1":1,"val2":20},{"val1":1,"val2":19}, ...:{"val1":1,"val2":27},{"val1":1,"val2":12}]) ...:In [7]:deffunc(dataf): ...:returndataf["val2"]-dataf["val2"].mean() ...:# squeezing the result frame to a series (because we have unique groups)In [8]:df2.groupby("val1",squeeze=True).apply(func)Out[8]:0 0.51 -0.52 7.53 -7.5Name: 1, dtype: float64# no squeezing (the default, and behavior in 0.10.1)In [9]:df2.groupby("val1").apply(func)Out[9]:val2 0 1 2 3val11 0.5 -0.5 7.5 -7.5[1 rows x 4 columns]Raise on
ilocwhen boolean indexing with a label based indexer maske.g. a boolean Series, even with integer labels, will raise. Sinceilocis purely positional based, the labels on the Series are not alignable (GH3631)This case is rarely used, and there are plently of alternatives. This preserves the
ilocAPI to bepurely positional based.In [10]:df=DataFrame(lrange(5),list('ABCDE'),columns=['a'])In [11]:mask=(df.a%2==0)In [12]:maskOut[12]:A TrueB FalseC TrueD FalseE TrueName: a, dtype: bool# this is what you should useIn [13]:df.loc[mask]Out[13]: aA 0C 2E 4[3 rows x 1 columns]# this will work as wellIn [14]:df.iloc[mask.values]Out[14]: aA 0C 2E 4[3 rows x 1 columns]
df.iloc[mask]will raise aValueErrorThe
raise_on_errorargument to plotting functions is removed. Instead,plotting functions raise aTypeErrorwhen thedtypeof the objectisobjectto remind you to avoidobjectarrays whenever possibleand thus you should cast to an appropriate numeric dtype if you need toplot something.Add
colormapkeyword to DataFrame plotting methods. Accepts either amatplotlib colormap object (ie, matplotlib.cm.jet) or a string name of suchan object (ie, ‘jet’). The colormap is sampled to select the color for eachcolumn. Please seeColormaps for more information.(GH3860)
DataFrame.interpolate()is now deprecated. Please useDataFrame.fillna()andDataFrame.replace()instead. (GH3582,GH3675,GH3676)the
methodandaxisarguments ofDataFrame.replace()aredeprecated
DataFrame.replace‘sinfer_typesparameter is removed and nowperforms conversion by default. (GH3907)Add the keyword
allow_duplicatestoDataFrame.insertto allow a duplicate columnto be inserted ifTrue, default isFalse(same as prior to 0.12) (GH3679)IO api
added top-level function
read_excelto replace the following,The original API is deprecated and will be removed in a future versionfrompandas.io.parsersimportExcelFilexls=ExcelFile('path_to_file.xls')xls.parse('Sheet1',index_col=None,na_values=['NA'])With
importpandasaspdpd.read_excel('path_to_file.xls','Sheet1',index_col=None,na_values=['NA'])added top-level function
read_sqlthat is equivalent to the followingfrompandas.io.sqlimportread_frameread_frame(....)
DataFrame.to_htmlandDataFrame.to_latexnow accept a path fortheir first argument (GH3702)Do not allow astypes on
datetime64[ns]except toobject, andtimedelta64[ns]toobject/int(GH3425)The behavior of
datetime64dtypes has changed with respect to certainso-called reduction operations (GH3726). The following operations nowraise aTypeErrorwhen perfomed on aSeriesand return anemptySerieswhen performed on aDataFramesimilar to performing theseoperations on, for example, aDataFrameofsliceobjects:
- sum, prod, mean, std, var, skew, kurt, corr, and cov
read_htmlnow defaults toNonewhen reading, and falls back onbs4+html5libwhen lxml fails to parse. a list of parsers to tryuntil success is also validThe internal
pandasclass hierarchy has changed (slightly). ThepreviousPandasObjectnow is calledPandasContainerand a newPandasObjecthas become the baseclass forPandasContaineras wellasIndex,Categorical,GroupBy,SparseList, andSparseArray(+ their base classes). Currently,PandasObjectprovides string methods (fromStringMixin). (GH4090,GH4092)New
StringMixinthat, given a__unicode__method, gets python 2 andpython 3 compatible string methods (__str__,__bytes__, and__repr__). Plus string safety throughout. Now employed in many placesthroughout the pandas library. (GH4090,GH4092)
pd.read_html()can now parse HTML strings, files or urls and returnDataFrames, courtesy of @cpcloud. (GH3477,GH3605,GH3606,GH3616).It works with asingle parser backend: BeautifulSoup4 + html5libSee the docsYou can use
pd.read_html()to read the output fromDataFrame.to_html()like soIn [15]:df=DataFrame({'a':range(3),'b':list('abc')})In [16]:print(df) a b0 0 a1 1 b2 2 c[3 rows x 2 columns]In [17]:html=df.to_html()In [18]:alist=pd.read_html(html,index_col=0)In [19]:print(df==alist[0]) a b0 True True1 True True2 True True[3 rows x 2 columns]Note that
alisthere is a Pythonlistsopd.read_html()andDataFrame.to_html()are not inverses.
pd.read_html()no longer performs hard conversion of date strings(GH3656).Warning
You may have to install an older version of BeautifulSoup4,See the installation docs
Added module for reading and writing Stata files:
pandas.io.stata(GH1512)accessable viaread_statatop-level function for reading,andto_stataDataFrame method for writing,See the docsAdded module for reading and writing json format files:
pandas.io.jsonaccessable viaread_jsontop-level function for reading,andto_jsonDataFrame method for writing,See the docsvarious issues (GH1226,GH3804,GH3876,GH3867,GH1305)
MultiIndexcolumn support for reading and writing csv format files
The
headeroption inread_csvnow accepts alist of the rows from which to read the index.The option,
tupleize_colscan now be specified in bothto_csvandread_csv, to provide compatiblity for the pre 0.12 behavior ofwriting and readingMultIndexcolumns via a list of tuples. The default in0.12 is to write lists of tuples andnot interpret list of tuples as aMultiIndexcolumn.Note: The default behavior in 0.12 remains unchanged from prior versions, but starting with 0.13,the defaultto write and read
MultiIndexcolumns will be in the newformat. (GH3571,GH1651,GH3141)If an
index_colis not specified (e.g. you don’t have an index, or wrote itwithdf.to_csv(...,index=False), then anynameson the columns index willbelost.In [20]:frompandas.util.testingimportmakeCustomDataframeasmkdfIn [21]:df=mkdf(5,3,r_idx_nlevels=2,c_idx_nlevels=4)In [22]:df.to_csv('mi.csv',tupleize_cols=False)In [23]:print(open('mi.csv').read())C0,,C_l0_g0,C_l0_g1,C_l0_g2C1,,C_l1_g0,C_l1_g1,C_l1_g2C2,,C_l2_g0,C_l2_g1,C_l2_g2C3,,C_l3_g0,C_l3_g1,C_l3_g2R0,R1,,,R_l0_g0,R_l1_g0,R0C0,R0C1,R0C2R_l0_g1,R_l1_g1,R1C0,R1C1,R1C2R_l0_g2,R_l1_g2,R2C0,R2C1,R2C2R_l0_g3,R_l1_g3,R3C0,R3C1,R3C2R_l0_g4,R_l1_g4,R4C0,R4C1,R4C2In [24]:pd.read_csv('mi.csv',header=[0,1,2,3],index_col=[0,1],tupleize_cols=False)Out[24]:C0 C_l0_g0 C_l0_g1 C_l0_g2C1 C_l1_g0 C_l1_g1 C_l1_g2C2 C_l2_g0 C_l2_g1 C_l2_g2C3 C_l3_g0 C_l3_g1 C_l3_g2R0 R1R_l0_g0 R_l1_g0 R0C0 R0C1 R0C2R_l0_g1 R_l1_g1 R1C0 R1C1 R1C2R_l0_g2 R_l1_g2 R2C0 R2C1 R2C2R_l0_g3 R_l1_g3 R3C0 R3C1 R3C2R_l0_g4 R_l1_g4 R4C0 R4C1 R4C2[5 rows x 3 columns]Support for
HDFStore(viaPyTables3.0.0) on Python3Iterator support via
read_hdfthat automatically opens and closes thestore when iteration is finished. This is only fortablesIn [25]:path='store_iterator.h5'In [26]:DataFrame(randn(10,2)).to_hdf(path,'df',table=True)In [27]:fordfinread_hdf(path,'df',chunksize=3): ....:printdf ....: 0 10 0.713216 -0.7784611 -0.661062 0.8628772 0.344342 0.149565 0 13 -0.626968 -0.8757724 -0.930687 -0.2189835 0.949965 -0.442354 0 16 -0.402985 1.1113587 -0.241527 -0.6704778 0.049355 0.632633 0 19 -1.502767 -1.225492
read_csvwill now throw a more informative error message when a filecontains no columns, e.g., all newline characters
DataFrame.replace()now allows regular expressions on containedSerieswith object dtype. See the examples section in the regular docsReplacing via String ExpressionFor example you can do
In [25]:df=DataFrame({'a':list('ab..'),'b':[1,2,3,4]})In [26]:df.replace(regex=r'\s*\.\s*',value=np.nan)Out[26]: a b0 a 11 b 22 NaN 33 NaN 4[4 rows x 2 columns]to replace all occurrences of the string
'.'with zero or moreinstances of surrounding whitespace withNaN.Regular string replacement still works as expected. For example, you can do
In [27]:df.replace('.',np.nan)Out[27]: a b0 a 11 b 22 NaN 33 NaN 4[4 rows x 2 columns]to replace all occurrences of the string
'.'withNaN.
pd.melt()now accepts the optional parametersvar_nameandvalue_nameto specify custom column names of the returned DataFrame.
pd.set_option()now allows N option, value pairs (GH3667).Let’s say that we had an option
'a.b'and another option'b.c'.We can set them at the same time:In [28]:pd.get_option('a.b')Out[28]:2In [29]:pd.get_option('b.c')Out[29]:3In [30]:pd.set_option('a.b',1,'b.c',4)In [31]:pd.get_option('a.b')Out[31]:1In [32]:pd.get_option('b.c')Out[32]:4The
filtermethod for group objects returns a subset of the originalobject. Suppose we want to take only elements that belong to groups with agroup sum greater than 2.In [33]:sf=Series([1,1,2,3,3,3])In [34]:sf.groupby(sf).filter(lambdax:x.sum()>2)Out[34]:3 34 35 3dtype: int64The argument of
filtermust a function that, applied to the group as awhole, returnsTrueorFalse.Another useful operation is filtering out elements that belong to groupswith only a couple members.
In [35]:dff=DataFrame({'A':np.arange(8),'B':list('aabbbbcc')})In [36]:dff.groupby('B').filter(lambdax:len(x)>2)Out[36]: A B2 2 b3 3 b4 4 b5 5 b[4 rows x 2 columns]Alternatively, instead of dropping the offending groups, we can return alike-indexed objects where the groups that do not pass the filter arefilled with NaNs.
In [37]:dff.groupby('B').filter(lambdax:len(x)>2,dropna=False)Out[37]: A B0 NaN NaN1 NaN NaN2 2.0 b3 3.0 b4 4.0 b5 5.0 b6 NaN NaN7 NaN NaN[8 rows x 2 columns]Series and DataFrame hist methods now take a
figsizeargument (GH3834)DatetimeIndexes no longer try to convert mixed-integer indexes during joinoperations (GH3877)
Timestamp.min and Timestamp.max now represent valid Timestamp instances insteadof the default datetime.min and datetime.max (respectively), thanks @SleepingPills
read_htmlnow raises when no tables are found and BeautifulSoup==4.2.0is detected (GH4214)
Added experimental
CustomBusinessDayclass to supportDateOffsetswith custom holiday calendars and custom weekmasks. (GH2301)Note
This uses the
numpy.busdaycalendarAPI introduced in Numpy 1.7 andtherefore requires Numpy 1.7.0 or newer.In [38]:frompandas.tseries.offsetsimportCustomBusinessDayIn [39]:fromdatetimeimportdatetime# As an interesting example, let's look at Egypt where# a Friday-Saturday weekend is observed.In [40]:weekmask_egypt='Sun Mon Tue Wed Thu'# They also observe International Workers' Day so let's# add that for a couple of yearsIn [41]:holidays=['2012-05-01',datetime(2013,5,1),np.datetime64('2014-05-01')]In [42]:bday_egypt=CustomBusinessDay(holidays=holidays,weekmask=weekmask_egypt)In [43]:dt=datetime(2013,4,30)In [44]:print(dt+2*bday_egypt)2013-05-05 00:00:00In [45]:dts=date_range(dt,periods=5,freq=bday_egypt)In [46]:print(Series(dts.weekday,dts).map(Series('Mon Tue Wed Thu Fri Sat Sun'.split())))2013-04-30 Tue2013-05-02 Thu2013-05-05 Sun2013-05-06 Mon2013-05-07 TueFreq: C, dtype: object
Plotting functions now raise a
TypeErrorbefore trying to plot anythingif the associated objects have have a dtype ofobject(GH1818,GH3572,GH3911,GH3912), but they will try to convert object arrays tonumeric arrays if possible so that you can still plot, for example, anobject array with floats. This happens before any drawing takes place whichelimnates any spurious plots from showing up.
fillnamethods now raise aTypeErrorif thevalueparameter isa list or tuple.
Series.strnow supports iteration (GH3638). You can iterate over theindividual elements of each string in theSeries. Each iteration yieldsyields aSerieswith either a single character at each index of theoriginalSeriesorNaN. For example,In [47]:strs='go','bow','joe','slow'In [48]:ds=Series(strs)In [49]:forsinds.str: ....:print(s) ....:0 g1 b2 j3 sdtype: object0 o1 o2 o3 ldtype: object0 NaN1 w2 e3 odtype: object0 NaN1 NaN2 NaN3 wdtype: objectIn [50]:sOut[50]:0 NaN1 NaN2 NaN3 wdtype: objectIn [51]:s.dropna().values.item()=='w'Out[51]:TrueThe last element yielded by the iterator will be a
Seriescontainingthe last element of the longest string in theSerieswith all otherelements beingNaN. Here since'slow'is the longest stringand there are no other strings with the same length'w'is the onlynon-null string in the yieldedSeries.
HDFStore
- will retain index attributes (freq,tz,name) on recreation (GH3499)
- will warn with a
AttributeConflictWarningif you are attempting to appendan index with a different frequency than the existing, or attemptingto append an index with a different name than the existing- support datelike columns with a timezone as data_columns (GH2852)
Non-unique index support clarified (GH3468).
- Fix assigning a new index to a duplicate index in a DataFrame would fail (GH3468)
- Fix construction of a DataFrame with a duplicate index
- ref_locs support to allow duplicative indices across dtypes,allows iget support to always find the index (even across dtypes) (GH2194)
- applymap on a DataFrame with a non-unique index now works(removed warning) (GH2786), and fix (GH3230)
- Fix to_csv to handle non-unique columns (GH3495)
- Duplicate indexes with getitem will return items in the correct order (GH3455,GH3457)and handle missing elements like unique indices (GH3561)
- Duplicate indexes with and empty DataFrame.from_records will return a correct frame (GH3562)
- Concat to produce a non-unique columns when duplicates are across dtypes is fixed (GH3602)
- Allow insert/delete to non-unique columns (GH3679)
- Non-unique indexing with a slice via
locand friends fixed (GH3659)- Allow insert/delete to non-unique columns (GH3679)
- Extend
reindexto correctly deal with non-unique indices (GH3679)DataFrame.itertuples()now works with frames with duplicate columnnames (GH3873)- Bug in non-unique indexing via
iloc(GH4017); addedtakeableargument toreindexfor location-based taking- Allow non-unique indexing in series via
.ix/.locand__getitem__(GH4246)- Fixed non-unique indexing memory allocation issue with
.ix/.loc(GH4280)
DataFrame.from_recordsdid not accept empty recarrays (GH3682)
read_htmlnow correctly skips tests (GH3741)Fixed a bug where
DataFrame.replacewith a compiled regular expressionin theto_replaceargument wasn’t working (GH3907)Improved
networktest decorator to catchIOError(and thereforeURLErroras well). Addedwith_connectivity_checkdecorator to allowexplicitly checking a website as a proxy for seeing if there is networkconnectivity. Plus, newoptional_argsdecorator factory for decorators.(GH3910,GH3914)Fixed testing issue where too many sockets where open thus leading to aconnection reset issue (GH3982,GH3985,GH4028,GH4054)
Fixed failing tests in test_yahoo, test_google where symbols were notretrieved but were being accessed (GH3982,GH3985,GH4028,GH4054)
Series.histwill now take the figure from the current environment ifone is not passedFixed bug where a 1xN DataFrame would barf on a 1xN mask (GH4071)
Fixed running of
toxunder python3 where the pickle import was gettingrewritten in an incompatible way (GH4062,GH4063)Fixed bug where sharex and sharey were not being passed to grouped_hist(GH4089)
Fixed bug in
DataFrame.replacewhere a nested dict wasn’t beingiterated over when regex=False (GH4115)Fixed bug in the parsing of microseconds when using the
formatargument into_datetime(GH4152)Fixed bug in
PandasAutoDateLocatorwhereinvert_xaxistriggeredincorrectlyMilliSecondLocator(GH3990)Fixed bug in plotting that wasn’t raising on invalid colormap formatplotlib 1.1.1 (GH4215)
Fixed the legend displaying in
DataFrame.plot(kind='kde')(GH4216)Fixed bug where Index slices weren’t carrying the name attribute(GH4226)
Fixed bug in initializing
DatetimeIndexwith an array of stringsin a certain time zone (GH4229)Fixed bug where html5lib wasn’t being properly skipped (GH4265)
Fixed bug where get_data_famafrench wasn’t using the correct file edges(GH4281)
See thefull release notes or issue trackeron GitHub for a complete list.
This is a major release from 0.10.1 and includes many new features andenhancements along with a large number of bug fixes. The methods of SelectingData have had quite a number of additions, and Dtype support is now full-fledged.There are also a number of important API changes that long-time pandas users shouldpay close attention to.
There is a new section in the documentation,10 Minutes to Pandas,primarily geared to new users.
There is a new section in the documentation,Cookbook, a collectionof useful recipes in pandas (and that we want contributions!).
There are several libraries that are nowRecommended Dependencies
Starting in 0.11.0, object selection has had a number of user-requested additions inorder to support more explicit location based indexing. Pandas now supportsthree types of multi-axis indexing.
.loc is strictly label based, will raiseKeyError when the items are not found, allowed inputs are:
5 or'a', (note that5 is interpreted as alabel of the index. This use isnot an integer position along the index)['a','b','c']'a':'f', (note that contrary to usual python slices,both the start and the stop are included!)See more atSelection by Label
.iloc is strictly integer position based (from0 tolength-1 of the axis), will raiseIndexError when the requested indicies are out of bounds. Allowed inputs are:
5[4,3,0]1:7See more atSelection by Position
.ix supports mixed integer and label based access. It is primarily label based, but will fallback to integer positional access..ix is the most general and will supportany of the inputs to.loc and.iloc, as well as support for floating point label schemes..ix is especially useful when dealing with mixed positional and labelbased hierarchial indexes.
As using integer slices with.ix have different behavior depending on whether the sliceis interpreted as position based or label based, it’s usually better to beexplicit and use.iloc or.loc.
See more atAdvanced Indexing andAdvanced Hierarchical.
Starting in version 0.11.0, these methodsmay be deprecated in future versions.
irowicoliget_valueSee the sectionSelection by Position for substitutes.
Numeric dtypes will propagate and can coexist in DataFrames. If a dtype is passed (either directly via thedtype keyword, a passedndarray, or a passedSeries, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes willNOT be combined. The following example will give you a taste.
In [1]:df1=DataFrame(randn(8,1),columns=['A'],dtype='float32')In [2]:df1Out[2]: A0 1.3926651 -0.1234972 -0.4027613 -0.2466044 -0.2884335 -0.7634346 2.0695267 -1.203569[8 rows x 1 columns]In [3]:df1.dtypesOut[3]:A float32dtype: objectIn [4]:df2=DataFrame(dict(A=Series(randn(8),dtype='float16'), ...:B=Series(randn(8)), ...:C=Series(randn(8),dtype='uint8'))) ...:In [5]:df2Out[5]: A B C0 0.591797 -0.038605 01 0.841309 -0.460478 12 -0.500977 -0.310458 03 -0.816406 0.866493 2544 -0.207031 0.245972 05 -0.664062 0.319442 16 0.580566 1.378512 17 -0.965820 0.292502 255[8 rows x 3 columns]In [6]:df2.dtypesOut[6]:A float16B float64C uint8dtype: object# here you get some upcastingIn [7]:df3=df1.reindex_like(df2).fillna(value=0.0)+df2In [8]:df3Out[8]: A B C0 1.984462 -0.038605 0.01 0.717812 -0.460478 1.02 -0.903737 -0.310458 0.03 -1.063011 0.866493 254.04 -0.495465 0.245972 0.05 -1.427497 0.319442 1.06 2.650092 1.378512 1.07 -2.169390 0.292502 255.0[8 rows x 3 columns]In [9]:df3.dtypesOut[9]:A float32B float64C float64dtype: object
This is lower-common-denomicator upcasting, meaning you get the dtype which can accomodate all of the types
In [10]:df3.values.dtypeOut[10]:dtype('float64')
Conversion
In [11]:df3.astype('float32').dtypesOut[11]:A float32B float32C float32dtype: object
Mixed Conversion
In [12]:df3['D']='1.'In [13]:df3['E']='1'In [14]:df3.convert_objects(convert_numeric=True).dtypesOut[14]:A float32B float64C float64D float64E int64dtype: object# same, but specific dtype conversionIn [15]:df3['D']=df3['D'].astype('float16')In [16]:df3['E']=df3['E'].astype('int32')In [17]:df3.dtypesOut[17]:A float32B float64C float64D float16E int32dtype: object
Forcing Date coercion (and settingNaT when not datelike)
In [18]:fromdatetimeimportdatetimeIn [19]:s=Series([datetime(2001,1,1,0,0),'foo',1.0,1, ....:Timestamp('20010104'),'20010105'],dtype='O') ....:In [20]:s.convert_objects(convert_dates='coerce')Out[20]:0 2001-01-011 NaT2 NaT3 NaT4 2001-01-045 2001-01-05dtype: datetime64[ns]
Platform Gotchas
Starting in 0.11.0, construction of DataFrame/Series will use default dtypes ofint64 andfloat64,regardless of platform. This is not an apparent change from earlier versions of pandas. If you specifydtypes, theyWILL be respected, however (GH2837)
The following will all result inint64 dtypes
In [21]:DataFrame([1,2],columns=['a']).dtypesOut[21]:a int64dtype: objectIn [22]:DataFrame({'a':[1,2]}).dtypesOut[22]:a int64dtype: objectIn [23]:DataFrame({'a':1},index=range(2)).dtypesOut[23]:a int64dtype: object
Keep in mind thatDataFrame(np.array([1,2]))WILL result inint32 on 32-bit platforms!
Upcasting Gotchas
Performing indexing operations on integer type data can easily upcast the data.The dtype of the input data will be preserved in cases wherenans are not introduced.
In [24]:dfi=df3.astype('int32')In [25]:dfi['D']=dfi['D'].astype('int64')In [26]:dfiOut[26]: A B C D E0 1 0 0 1 11 0 0 1 1 12 0 0 0 1 13 -1 0 254 1 14 0 0 0 1 15 -1 0 1 1 16 2 1 1 1 17 -2 0 255 1 1[8 rows x 5 columns]In [27]:dfi.dtypesOut[27]:A int32B int32C int32D int64E int32dtype: objectIn [28]:casted=dfi[dfi>0]In [29]:castedOut[29]: A B C D E0 1.0 NaN NaN 1 11 NaN NaN 1.0 1 12 NaN NaN NaN 1 13 NaN NaN 254.0 1 14 NaN NaN NaN 1 15 NaN NaN 1.0 1 16 2.0 1.0 1.0 1 17 NaN NaN 255.0 1 1[8 rows x 5 columns]In [30]:casted.dtypesOut[30]:A float64B float64C float64D int64E int32dtype: object
While float dtypes are unchanged.
In [31]:df4=df3.copy()In [32]:df4['A']=df4['A'].astype('float32')In [33]:df4.dtypesOut[33]:A float32B float64C float64D float16E int32dtype: objectIn [34]:casted=df4[df4>0]In [35]:castedOut[35]: A B C D E0 1.984462 NaN NaN 1.0 11 0.717812 NaN 1.0 1.0 12 NaN NaN NaN 1.0 13 NaN 0.866493 254.0 1.0 14 NaN 0.245972 NaN 1.0 15 NaN 0.319442 1.0 1.0 16 2.650092 1.378512 1.0 1.0 17 NaN 0.292502 255.0 1.0 1[8 rows x 5 columns]In [36]:casted.dtypesOut[36]:A float32B float64C float64D float16E int32dtype: object
Datetime64[ns] columns in a DataFrame (or a Series) allow the use ofnp.nan to indicate a nan value,in addition to the traditionalNaT, or not-a-time. This allows convenient nan setting in a generic way.Furthermoredatetime64[ns] columns are created by default, when passed datetimelike objects (this change was introduced in 0.10.1)(GH2809,GH2810)
In [37]:df=DataFrame(randn(6,2),date_range('20010102',periods=6),columns=['A','B'])In [38]:df['timestamp']=Timestamp('20010103')In [39]:dfOut[39]: A B timestamp2001-01-02 1.023958 0.660103 2001-01-032001-01-03 1.236475 -2.170629 2001-01-032001-01-04 -0.270630 -1.685677 2001-01-032001-01-05 -0.440747 -0.115070 2001-01-032001-01-06 -0.632102 -0.585977 2001-01-032001-01-07 -1.444787 -0.201135 2001-01-03[6 rows x 3 columns]# datetime64[ns] out of the boxIn [40]:df.get_dtype_counts()Out[40]:datetime64[ns] 1float64 2dtype: int64# use the traditional nan, which is mapped to NaT internallyIn [41]:df.ix[2:4,['A','timestamp']]=np.nanIn [42]:dfOut[42]: A B timestamp2001-01-02 1.023958 0.660103 2001-01-032001-01-03 1.236475 -2.170629 2001-01-032001-01-04 NaN -1.685677 NaT2001-01-05 NaN -0.115070 NaT2001-01-06 -0.632102 -0.585977 2001-01-032001-01-07 -1.444787 -0.201135 2001-01-03[6 rows x 3 columns]
Astype conversion ondatetime64[ns] toobject, implicity convertsNaT tonp.nan
In [43]:importdatetimeIn [44]:s=Series([datetime.datetime(2001,1,2,0,0)foriinrange(3)])In [45]:s.dtypeOut[45]:dtype('<M8[ns]')In [46]:s[1]=np.nanIn [47]:sOut[47]:0 2001-01-021 NaT2 2001-01-02dtype: datetime64[ns]In [48]:s.dtypeOut[48]:dtype('<M8[ns]')In [49]:s=s.astype('O')In [50]:sOut[50]:0 2001-01-02 00:00:001 NaT2 2001-01-02 00:00:00dtype: objectIn [51]:s.dtypeOut[51]:dtype('O')
- Added to_series() method to indicies, to facilitate the creation of indexers(GH3275)
HDFStore
- added the method
select_columnto select a single column from a table as a Series.- deprecated the
uniquemethod, can be replicated byselect_column(key,column).unique()min_itemsizeparameter toappendwill now automatically create data_columns for passed keys
Improved performance of df.to_csv() by up to 10x in some cases. (GH3059)
Numexpr is now aRecommended Dependencies, to accelerate certaintypes of numerical and boolean operations
Bottleneck is now aRecommended Dependencies, to accelerate certaintypes of
nanoperations
HDFStore
support
read_hdf/to_hdfAPI similar toread_csv/to_csvIn [52]:df=DataFrame(dict(A=lrange(5),B=lrange(5)))In [53]:df.to_hdf('store.h5','table',append=True)In [54]:read_hdf('store.h5','table',where=['index>2'])Out[54]: A B3 3 34 4 4[2 rows x 2 columns]provide dotted attribute access to
getfrom stores, e.g.store.df==store['df']new keywords
iterator=boolean, andchunksize=number_in_a_chunkareprovided to support iteration onselectandselect_as_multiple(GH3076)You can now select timestamps from anunordered timeseries similarly to anordered timeseries (GH2437)
You can now select with a string from a DataFrame with a datelike index, in a similar way to a Series (GH3070)
In [55]:idx=date_range("2001-10-1",periods=5,freq='M')In [56]:ts=Series(np.random.rand(len(idx)),index=idx)In [57]:ts['2001']Out[57]:2001-10-31 0.6632562001-11-30 0.0791262001-12-31 0.587699Freq: M, dtype: float64In [58]:df=DataFrame(dict(A=ts))In [59]:df['2001']Out[59]: A2001-10-31 0.6632562001-11-30 0.0791262001-12-31 0.587699[3 rows x 1 columns]
Squeezeto possibly remove length 1 dimensions from an object.In [60]:p=Panel(randn(3,4,4),items=['ItemA','ItemB','ItemC'], ....:major_axis=date_range('20010102',periods=4), ....:minor_axis=['A','B','C','D']) ....:In [61]:pOut[61]:<class 'pandas.core.panel.Panel'>Dimensions: 3 (items) x 4 (major_axis) x 4 (minor_axis)Items axis: ItemA to ItemCMajor_axis axis: 2001-01-02 00:00:00 to 2001-01-05 00:00:00Minor_axis axis: A to DIn [62]:p.reindex(items=['ItemA']).squeeze()Out[62]: A B C D2001-01-02 -1.203403 0.425882 -0.436045 -0.9824622001-01-03 0.348090 -0.969649 0.121731 0.2027982001-01-04 1.215695 -0.218549 -0.631381 -0.3371162001-01-05 0.404238 0.907213 -0.865657 0.483186[4 rows x 4 columns]In [63]:p.reindex(items=['ItemA'],minor=['B']).squeeze()Out[63]:2001-01-02 0.4258822001-01-03 -0.9696492001-01-04 -0.2185492001-01-05 0.907213Freq: D, Name: B, dtype: float64In
pd.io.data.Options,
- Fix bug when trying to fetch data for the current month when alreadypast expiry.
- Now using lxml to scrape html instead of BeautifulSoup (lxml was faster).
- New instance variables for calls and puts are automatically createdwhen a method that creates them is called. This works for current monthwhere the instance variables are simply
callsandputs. Alsoworks for future expiry months and save the instance variable ascallsMMYYorputsMMYY, whereMMYYare, respectively, themonth and year of the option’s expiry.Options.get_near_stock_pricenow allows the user to specify themonth for which to get relevant options data.Options.get_forward_datanow has optional kwargsnearandabove_below. This allows the user to specify if they would like toonly return forward looking data for options near the current stockprice. This just obtains the data from Options.get_near_stock_priceinstead of Options.get_xxx_data() (GH2758).Cursor coordinate information is now displayed in time-series plots.
added optiondisplay.max_seq_items to control the number ofelements printed per sequence pprinting it. (GH2979)
added optiondisplay.chop_threshold to control display of small numericalvalues. (GH2739)
added optiondisplay.max_info_rows to prevent verbose_info from beingcalculated for frames above 1M rows (configurable). (GH2807,GH2918)
value_counts() now accepts a “normalize” argument, for normalizedhistograms. (GH2710).
DataFrame.from_records now accepts not only dicts but any instance ofthe collections.Mapping ABC.
added optiondisplay.mpl_style providing a sleeker visual stylefor plots. Based onhttps://gist.github.com/huyng/816622 (GH3075).
Treat boolean values as integers (values 1 and 0) for numericoperations. (GH2641)
to_html() now accepts an optional “escape” argument to control reservedHTML character escaping (enabled by default) and escapes
&, in additionto<and>. (GH2919)
See thefull release notes or issue trackeron GitHub for a complete list.
This is a minor release from 0.10.0 and includes new features, enhancements,and bug fixes. In particular, there is substantial new HDFStore functionalitycontributed by Jeff Reback.
An undesired API breakage with functions taking theinplace option has beenreverted and deprecation warnings added.
inplace option return the calling object as before. Adeprecation message has been addedYou may need to upgrade your existing data files. Please visit thecompatibility section in the main docs.
You can designate (and index) certain columns that you want to be able toperform queries on a table, by passing a list todata_columns
In [1]:store=HDFStore('store.h5')In [2]:df=DataFrame(randn(8,3),index=date_range('1/1/2000',periods=8), ...:columns=['A','B','C']) ...:In [3]:df['string']='foo'In [4]:df.ix[4:6,'string']=np.nanIn [5]:df.ix[7:9,'string']='bar'In [6]:df['string2']='cool'In [7]:dfOut[7]: A B C string string22000-01-01 1.885136 -0.183873 2.550850 foo cool2000-01-02 0.180759 -1.117089 0.061462 foo cool2000-01-03 -0.294467 -0.591411 -0.876691 foo cool2000-01-04 3.127110 1.451130 0.045152 foo cool2000-01-05 -0.242846 1.195819 1.533294 NaN cool2000-01-06 0.820521 -0.281201 1.651561 NaN cool2000-01-07 -0.034086 0.252394 -0.498772 foo cool2000-01-08 -2.290958 -1.601262 -0.256718 bar cool[8 rows x 5 columns]# on-disk operationsIn [8]:store.append('df',df,data_columns=['B','C','string','string2'])In [9]:store.select('df',['B > 0','string == foo'])Out[9]:Empty DataFrameColumns: [A, B, C, string, string2]Index: [][0 rows x 5 columns]# this is in-memory version of this type of selectionIn [10]:df[(df.B>0)&(df.string=='foo')]Out[10]: A B C string string22000-01-04 3.127110 1.451130 0.045152 foo cool2000-01-07 -0.034086 0.252394 -0.498772 foo cool[2 rows x 5 columns]
Retrieving unique values in an indexable or data column.
# note that this is deprecated as of 0.14.0# can be replicated by: store.select_column('df','index').unique()store.unique('df','index')store.unique('df','string')
You can now storedatetime64 in data columns
In [11]:df_mixed=df.copy()In [12]:df_mixed['datetime64']=Timestamp('20010102')In [13]:df_mixed.ix[3:4,['A','B']]=np.nanIn [14]:store.append('df_mixed',df_mixed)In [15]:df_mixed1=store.select('df_mixed')In [16]:df_mixed1Out[16]: A B C string string2 datetime642000-01-01 1.885136 -0.183873 2.550850 foo cool 2001-01-022000-01-02 0.180759 -1.117089 0.061462 foo cool 2001-01-022000-01-03 -0.294467 -0.591411 -0.876691 foo cool 2001-01-022000-01-04 NaN NaN 0.045152 foo cool 2001-01-022000-01-05 -0.242846 1.195819 1.533294 NaN cool 2001-01-022000-01-06 0.820521 -0.281201 1.651561 NaN cool 2001-01-022000-01-07 -0.034086 0.252394 -0.498772 foo cool 2001-01-022000-01-08 -2.290958 -1.601262 -0.256718 bar cool 2001-01-02[8 rows x 6 columns]In [17]:df_mixed1.get_dtype_counts()Out[17]:datetime64[ns] 1float64 3object 2dtype: int64
You can passcolumns keyword to select to filter a list of the returncolumns, this is equivalent to passing aTerm('columns',list_of_columns_to_filter)
In [18]:store.select('df',columns=['A','B'])Out[18]: A B2000-01-01 1.885136 -0.1838732000-01-02 0.180759 -1.1170892000-01-03 -0.294467 -0.5914112000-01-04 3.127110 1.4511302000-01-05 -0.242846 1.1958192000-01-06 0.820521 -0.2812012000-01-07 -0.034086 0.2523942000-01-08 -2.290958 -1.601262[8 rows x 2 columns]
HDFStore now serializes multi-index dataframes when appending tables.
In [19]:index=MultiIndex(levels=[['foo','bar','baz','qux'], ....:['one','two','three']], ....:labels=[[0,0,0,1,1,2,2,3,3,3], ....:[0,1,2,0,1,1,2,0,1,2]], ....:names=['foo','bar']) ....:In [20]:df=DataFrame(np.random.randn(10,3),index=index, ....:columns=['A','B','C']) ....:In [21]:dfOut[21]: A B Cfoo barfoo one 0.239369 0.174122 -1.131794 two -1.948006 0.980347 -0.674429 three -0.361633 -0.761218 1.768215bar one 0.152288 -0.862613 -0.210968 two -0.859278 1.498195 0.462413baz two -0.647604 1.511487 -0.727189 three -0.342928 -0.007364 1.427674qux one 0.104020 2.052171 -1.230963 two -0.019240 -1.713238 0.838912 three -0.637855 0.215109 -1.515362[10 rows x 3 columns]In [22]:store.append('mi',df)In [23]:store.select('mi')Out[23]: A B Cfoo barfoo one 0.239369 0.174122 -1.131794 two -1.948006 0.980347 -0.674429 three -0.361633 -0.761218 1.768215bar one 0.152288 -0.862613 -0.210968 two -0.859278 1.498195 0.462413baz two -0.647604 1.511487 -0.727189 three -0.342928 -0.007364 1.427674qux one 0.104020 2.052171 -1.230963 two -0.019240 -1.713238 0.838912 three -0.637855 0.215109 -1.515362[10 rows x 3 columns]# the levels are automatically included as data columnsIn [24]:store.select('mi',Term('foo=bar'))Out[24]:Empty DataFrameColumns: [A, B, C]Index: [][0 rows x 3 columns]
Multi-table creation viaappend_to_multiple and selection viaselect_as_multiple can create/select from multiple tables and return acombined result, by usingwhere on a selector table.
In [25]:df_mt=DataFrame(randn(8,6),index=date_range('1/1/2000',periods=8), ....:columns=['A','B','C','D','E','F']) ....:In [26]:df_mt['foo']='bar'# you can also create the tables individuallyIn [27]:store.append_to_multiple({'df1_mt':['A','B'],'df2_mt':None},df_mt,selector='df1_mt')In [28]:storeOut[28]:<class 'pandas.io.pytables.HDFStore'>File path: store.h5/df frame_table (typ->appendable,nrows->8,ncols->5,indexers->[index],dc->[B,C,string,string2])/df1_mt frame_table (typ->appendable,nrows->8,ncols->2,indexers->[index],dc->[A,B])/df2_mt frame_table (typ->appendable,nrows->8,ncols->5,indexers->[index])/df_mixed frame_table (typ->appendable,nrows->8,ncols->6,indexers->[index])/mi frame_table (typ->appendable_multi,nrows->10,ncols->5,indexers->[index],dc->[bar,foo])# indiviual tables were createdIn [29]:store.select('df1_mt')Out[29]: A B2000-01-01 1.586924 -0.4479742000-01-02 -0.102206 0.8703022000-01-03 1.249874 1.4582102000-01-04 -0.616293 0.1504682000-01-05 -0.431163 0.0166402000-01-06 0.800353 -0.4515722000-01-07 1.239198 0.1854372000-01-08 -0.040863 0.290110[8 rows x 2 columns]In [30]:store.select('df2_mt')Out[30]: C D E F foo2000-01-01 -1.573998 0.630925 -0.071659 -1.277640 bar2000-01-02 1.275280 -1.199212 1.060780 1.673018 bar2000-01-03 -0.710542 0.825392 1.557329 1.993441 bar2000-01-04 0.132104 0.580923 -0.128750 1.445964 bar2000-01-05 0.904578 -1.645852 -0.688741 0.228006 bar2000-01-06 0.831767 0.228760 0.932498 -2.200069 bar2000-01-07 -0.540770 -0.370038 1.298390 1.662964 bar2000-01-08 -0.096145 1.717830 -0.462446 -0.112019 bar[8 rows x 5 columns]# as a multipleIn [31]:store.select_as_multiple(['df1_mt','df2_mt'],where=['A>0','B>0'],selector='df1_mt')Out[31]: A B C D E F foo2000-01-03 1.249874 1.458210 -0.710542 0.825392 1.557329 1.993441 bar2000-01-07 1.239198 0.185437 -0.540770 -0.370038 1.298390 1.662964 bar[2 rows x 7 columns]
Enhancements
HDFStore now can read native PyTables table format tablesnan_rep='my_nan_rep' to append, to change the default nanrepresentation on disk (which converts to/fromnp.nan), this defaults tonan.index toappend. This defaults toTrue. This willautomagically create indicies on theindexables anddata columns of thetablechunksize=aninteger toappend, to change the writingchunksize (default is 50000). This will signficantly lower your memory usageon writing.expectedrows=aninteger to the firstappend, to set theTOTAL number of expectedrows thatPyTables will expected. This willoptimize read/write performance.Select now supports passingstart andstop to provide selectionspace limiting in selection.DataFrame.merge to handle combinatorial sizes too large for 64-bitinteger (GH2690)logx parameter to change the x-axis to log scale (GH2327)kind argument to specify the file type (GH2613)Bug Fixes
HDFStore tables can now storefloat32 types correctly (cannot bemixed withfloat64 however)patterninHDFStore expressions when pattern is not a validregex (GH2694)See thefull release notes or issue trackeron GitHub for a complete list.
This is a major release from 0.9.1 and includes many new features andenhancements along with a large number of bug fixes. There are also a number ofimportant API changes that long-time pandas users should pay close attentionto.
The delimited file parsing engine (the guts ofread_csv andread_table)has been rewritten from the ground up and now uses a fraction the amount ofmemory while parsing, while being 40% or more faster in most use cases (in somecases much faster).
There are also many new features:
encoding option.usecols)dtype argument)as_recarray)delim_whitespace optionescapechar,lineterminator,quotechar, etc.Deprecated DataFrame BINOP TimeSeries special case behavior
The default behavior of binary operations between a DataFrame and a Series hasalways been to align on the DataFrame’s columns and broadcast down the rows,except in the special case that the DataFrame contains time series. Sincethere are now method for each binary operator enabling you to specify how youwant to broadcast, we are phasing out this special case (Zen of Python:Special cases aren’t special enough to break the rules). Here’s what I’mtalking about:
In [1]:importpandasaspdIn [2]:df=pd.DataFrame(np.random.randn(6,4), ...:index=pd.date_range('1/1/2000',periods=6)) ...:In [3]:dfOut[3]: 0 1 2 32000-01-01 -0.134024 -0.205969 1.348944 -1.1982462000-01-02 -1.626124 0.982041 0.059493 -0.4601112000-01-03 -1.565401 -0.025706 0.942864 2.5021562000-01-04 -0.302741 0.261551 -0.066342 0.8970972000-01-05 0.268766 -1.225092 0.582752 -1.4907642000-01-06 -0.639757 -0.952750 -0.892402 0.505987[6 rows x 4 columns]# deprecated nowIn [4]:df-df[0]Out[4]: 2000-01-01 00:00:00 2000-01-02 00:00:00 2000-01-03 00:00:00 \2000-01-01 NaN NaN NaN2000-01-02 NaN NaN NaN2000-01-03 NaN NaN NaN2000-01-04 NaN NaN NaN2000-01-05 NaN NaN NaN2000-01-06 NaN NaN NaN 2000-01-04 00:00:00 2000-01-05 00:00:00 2000-01-06 00:00:00 0 \2000-01-01 NaN NaN NaN NaN2000-01-02 NaN NaN NaN NaN2000-01-03 NaN NaN NaN NaN2000-01-04 NaN NaN NaN NaN2000-01-05 NaN NaN NaN NaN2000-01-06 NaN NaN NaN NaN 1 2 32000-01-01 NaN NaN NaN2000-01-02 NaN NaN NaN2000-01-03 NaN NaN NaN2000-01-04 NaN NaN NaN2000-01-05 NaN NaN NaN2000-01-06 NaN NaN NaN[6 rows x 10 columns]# Change your code toIn [5]:df.sub(df[0],axis=0)# align on axis 0 (rows)Out[5]: 0 1 2 32000-01-01 0.0 -0.071946 1.482967 -1.0642232000-01-02 0.0 2.608165 1.685618 1.1660132000-01-03 0.0 1.539695 2.508265 4.0675562000-01-04 0.0 0.564293 0.236399 1.1998392000-01-05 0.0 -1.493857 0.313986 -1.7595302000-01-06 0.0 -0.312993 -0.252645 1.145744[6 rows x 4 columns]
You will get a deprecation warning in the 0.10.x series, and the deprecatedfunctionality will be removed in 0.11 or later.
Altered resample default behavior
The default time seriesresample binning behavior of dailyD andhigher frequencies has been changed toclosed='left',label='left'. Lowernfrequencies are unaffected. The prior defaults were causing a great deal ofconfusion for users, especially resampling data to daily frequency (whichlabeled the aggregated group with the end of the interval: the next day).
In [1]:dates=pd.date_range('1/1/2000','1/5/2000',freq='4h')In [2]:series=Series(np.arange(len(dates)),index=dates)In [3]:seriesOut[3]:2000-01-01 00:00:00 02000-01-01 04:00:00 12000-01-01 08:00:00 22000-01-01 12:00:00 32000-01-01 16:00:00 42000-01-01 20:00:00 52000-01-02 00:00:00 62000-01-02 04:00:00 72000-01-02 08:00:00 82000-01-02 12:00:00 92000-01-02 16:00:00 102000-01-02 20:00:00 112000-01-03 00:00:00 122000-01-03 04:00:00 132000-01-03 08:00:00 142000-01-03 12:00:00 152000-01-03 16:00:00 162000-01-03 20:00:00 172000-01-04 00:00:00 182000-01-04 04:00:00 192000-01-04 08:00:00 202000-01-04 12:00:00 212000-01-04 16:00:00 222000-01-04 20:00:00 232000-01-05 00:00:00 24Freq: 4H, dtype: int64In [4]:series.resample('D',how='sum')Out[4]:2000-01-01 152000-01-02 512000-01-03 872000-01-04 1232000-01-05 24Freq: D, dtype: int64In [5]:# old behaviorIn [6]:series.resample('D',how='sum',closed='right',label='right')Out[6]:2000-01-01 02000-01-02 212000-01-03 572000-01-04 932000-01-05 129Freq: D, dtype: int64
isnull andnotnull. That they ever were was a relic of early pandas. This behaviorcan be re-enabled globally by themode.use_inf_as_null option:In [6]:s=pd.Series([1.5,np.inf,3.4,-np.inf])In [7]:pd.isnull(s)Out[7]:0 False1 False2 False3 Falsedtype: boolIn [8]:s.fillna(0)Out[8]:0 1.5000001 inf2 3.4000003 -infdtype: float64In [9]:pd.set_option('use_inf_as_null',True)In [10]:pd.isnull(s)Out[10]:0 False1 True2 False3 Truedtype: boolIn [11]:s.fillna(0)Out[11]:0 1.51 0.02 3.43 0.0dtype: float64In [12]:pd.reset_option('use_inf_as_null')
inplace option now all returnNone instead of thecalling object. E.g. code written likedf=df.fillna(0,inplace=True)may stop working. To fix, simply delete the unnecessary variable assignment.pandas.merge no longer sorts the group keys (sort=False) bydefault. This was done for performance reasons: the group-key sorting isoften one of the more expensive parts of the computation and is oftenunnecessary.0 throughN-1. This is to create consistency with theDataFrame constructor with no columns specified. The v0.9.0 behavior (namesX0,X1, ...) can be reproduced by specifyingprefix='X':In [13]:data='a,b,c\n1,Yes,2\n3,No,4'In [14]:print(data)a,b,c1,Yes,23,No,4In [15]:pd.read_csv(StringIO(data),header=None)Out[15]: 0 1 20 a b c1 1 Yes 22 3 No 4[3 rows x 3 columns]In [16]:pd.read_csv(StringIO(data),header=None,prefix='X')Out[16]: X0 X1 X20 a b c1 1 Yes 22 3 No 4[3 rows x 3 columns]
'Yes' and'No' are not interpreted as boolean by default,though this can be controlled by newtrue_values andfalse_valuesarguments:In [17]:print(data)a,b,c1,Yes,23,No,4In [18]:pd.read_csv(StringIO(data))Out[18]: a b c0 1 Yes 21 3 No 4[2 rows x 3 columns]In [19]:pd.read_csv(StringIO(data),true_values=['Yes'],false_values=['No'])Out[19]: a b c0 1 True 21 3 False 4[2 rows x 3 columns]
na_values argument. It’s betterto do post-processing using thereplace function instead.fillna on Series or DataFrame with no arguments is no longervalid code. You must either specify a fill value or an interpolation method:In [20]:s=Series([np.nan,1.,2.,np.nan,4])In [21]:sOut[21]:0 NaN1 1.02 2.03 NaN4 4.0dtype: float64In [22]:s.fillna(0)Out[22]:0 0.01 1.02 2.03 0.04 4.0dtype: float64In [23]:s.fillna(method='pad')Out[23]:0 NaN1 1.02 2.03 2.04 4.0dtype: float64
Convenience methodsffill andbfill have been added:
In [24]:s.ffill()Out[24]:0 NaN1 1.02 2.03 2.04 4.0dtype: float64
Series.apply will now operate on a returned value from the appliedfunction, that is itself a series, and possibly upcast the result to aDataFrame
In [25]:deff(x): ....:returnSeries([x,x**2],index=['x','x^2']) ....:In [26]:s=Series(np.random.rand(5))In [27]:sOut[27]:0 0.7174781 0.8151992 0.4524783 0.8483854 0.235477dtype: float64In [28]:s.apply(f)Out[28]: x x^20 0.717478 0.5147751 0.815199 0.6645502 0.452478 0.2047373 0.848385 0.7197574 0.235477 0.055449[5 rows x 2 columns]
New API functions for working with pandas options (GH2097):
get_option /set_option - get/set the value of an option. Partialnames are accepted. -reset_option - reset one or more options totheir default value. Partial names are accepted. -describe_option -print a description of one or more options. When called with noarguments. print all registered options.Note:set_printoptions/reset_printoptions are now deprecated (butfunctioning), the print options now live under “display.XYZ”. For example:
In [29]:get_option("display.max_rows")Out[29]:15
to_string() methods now always return unicode strings (GH2224).
Instead of printing the summary information, pandas now splits the stringrepresentation across multiple rows by default:
In [30]:wide_frame=DataFrame(randn(5,16))In [31]:wide_frameOut[31]: 0 1 2 3 4 5 6 \0 -0.681624 0.191356 1.180274 -0.834179 0.703043 0.166568 -0.5835991 0.441522 -0.316864 -0.017062 1.570114 -0.360875 -0.880096 0.2355322 -0.412451 -0.462580 0.422194 0.288403 -0.487393 -0.777639 0.0558653 -0.277255 1.331263 0.585174 -0.568825 -0.719412 1.191340 -0.4563624 -1.642511 0.432560 1.218080 -0.564705 -0.581790 0.286071 0.048725 7 8 9 10 11 12 13 \0 -1.201796 -1.422811 -0.882554 1.209871 -0.941235 0.863067 -0.3362321 0.207232 -1.983857 -1.702547 -1.621234 -0.906840 1.014601 -0.4751082 1.383381 0.085638 0.246392 0.965887 0.246354 -0.727728 -0.0944143 0.089931 0.776079 0.752889 -1.195795 -1.425911 -0.548829 0.7742254 1.002440 1.276582 0.054399 0.241963 -0.471786 0.314510 -0.059986 14 150 -0.976847 0.0338621 -0.358944 1.2629422 -0.276854 0.1583993 0.740501 1.5102634 -2.069319 -1.115104[5 rows x 16 columns]
The old behavior of printing out summary information can be achieved via the‘expand_frame_repr’ print option:
In [32]:pd.set_option('expand_frame_repr',False)In [33]:wide_frameOut[33]: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 -0.681624 0.191356 1.180274 -0.834179 0.703043 0.166568 -0.583599 -1.201796 -1.422811 -0.882554 1.209871 -0.941235 0.863067 -0.336232 -0.976847 0.0338621 0.441522 -0.316864 -0.017062 1.570114 -0.360875 -0.880096 0.235532 0.207232 -1.983857 -1.702547 -1.621234 -0.906840 1.014601 -0.475108 -0.358944 1.2629422 -0.412451 -0.462580 0.422194 0.288403 -0.487393 -0.777639 0.055865 1.383381 0.085638 0.246392 0.965887 0.246354 -0.727728 -0.094414 -0.276854 0.1583993 -0.277255 1.331263 0.585174 -0.568825 -0.719412 1.191340 -0.456362 0.089931 0.776079 0.752889 -1.195795 -1.425911 -0.548829 0.774225 0.740501 1.5102634 -1.642511 0.432560 1.218080 -0.564705 -0.581790 0.286071 0.048725 1.002440 1.276582 0.054399 0.241963 -0.471786 0.314510 -0.059986 -2.069319 -1.115104[5 rows x 16 columns]
The width of each line can be changed via ‘line_width’ (80 by default):
In [34]:pd.set_option('line_width',40)line_width has been deprecated, use display.width instead (currently both areidentical)In [35]:wide_frameOut[35]: 0 1 2 \0 -0.681624 0.191356 1.1802741 0.441522 -0.316864 -0.0170622 -0.412451 -0.462580 0.4221943 -0.277255 1.331263 0.5851744 -1.642511 0.432560 1.218080 3 4 5 \0 -0.834179 0.703043 0.1665681 1.570114 -0.360875 -0.8800962 0.288403 -0.487393 -0.7776393 -0.568825 -0.719412 1.1913404 -0.564705 -0.581790 0.286071 6 7 8 \0 -0.583599 -1.201796 -1.4228111 0.235532 0.207232 -1.9838572 0.055865 1.383381 0.0856383 -0.456362 0.089931 0.7760794 0.048725 1.002440 1.276582 9 10 11 \0 -0.882554 1.209871 -0.9412351 -1.702547 -1.621234 -0.9068402 0.246392 0.965887 0.2463543 0.752889 -1.195795 -1.4259114 0.054399 0.241963 -0.471786 12 13 14 \0 0.863067 -0.336232 -0.9768471 1.014601 -0.475108 -0.3589442 -0.727728 -0.094414 -0.2768543 -0.548829 0.774225 0.7405014 0.314510 -0.059986 -2.069319 150 0.0338621 1.2629422 0.1583993 1.5102634 -1.115104[5 rows x 16 columns]
Docs for PyTablesTable format & several enhancements to the api. Here is a taste of what to expect.
In [36]:store=HDFStore('store.h5')In [37]:df=DataFrame(randn(8,3),index=date_range('1/1/2000',periods=8), ....:columns=['A','B','C']) ....:In [38]:dfOut[38]: A B C2000-01-01 -0.369325 -1.502617 -0.3762802000-01-02 0.511936 -0.116412 -0.6252562000-01-03 -0.550627 1.261433 -0.5524292000-01-04 1.695803 -1.025917 -0.9109422000-01-05 0.426805 -0.131749 0.4326002000-01-06 0.044671 -0.341265 1.8445362000-01-07 -2.036047 0.000830 -0.9556972000-01-08 -0.898872 -0.725411 0.059904[8 rows x 3 columns]# appending data framesIn [39]:df1=df[0:4]In [40]:df2=df[4:]In [41]:store.append('df',df1)In [42]:store.append('df',df2)In [43]:storeOut[43]:<class 'pandas.io.pytables.HDFStore'>File path: store.h5/df frame_table (typ->appendable,nrows->8,ncols->3,indexers->[index])# selecting the entire storeIn [44]:store.select('df')Out[44]: A B C2000-01-01 -0.369325 -1.502617 -0.3762802000-01-02 0.511936 -0.116412 -0.6252562000-01-03 -0.550627 1.261433 -0.5524292000-01-04 1.695803 -1.025917 -0.9109422000-01-05 0.426805 -0.131749 0.4326002000-01-06 0.044671 -0.341265 1.8445362000-01-07 -2.036047 0.000830 -0.9556972000-01-08 -0.898872 -0.725411 0.059904[8 rows x 3 columns]
In [45]:wp=Panel(randn(2,5,4),items=['Item1','Item2'], ....:major_axis=date_range('1/1/2000',periods=5), ....:minor_axis=['A','B','C','D']) ....:In [46]:wpOut[46]:<class 'pandas.core.panel.Panel'>Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)Items axis: Item1 to Item2Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00Minor_axis axis: A to D# storing a panelIn [47]:store.append('wp',wp)# selecting via A QUERYIn [48]:store.select('wp', ....:[Term('major_axis>20000102'),Term('minor_axis','=',['A','B'])]) ....:Out[48]:<class 'pandas.core.panel.Panel'>Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis)Items axis: Item1 to Item2Major_axis axis: 2000-01-03 00:00:00 to 2000-01-05 00:00:00Minor_axis axis: A to B# removing data from tablesIn [49]:store.remove('wp',Term('major_axis>20000103'))Out[49]:8In [50]:store.select('wp')Out[50]:<class 'pandas.core.panel.Panel'>Dimensions: 2 (items) x 3 (major_axis) x 4 (minor_axis)Items axis: Item1 to Item2Major_axis axis: 2000-01-01 00:00:00 to 2000-01-03 00:00:00Minor_axis axis: A to D# deleting a storeIn [51]:delstore['df']In [52]:storeOut[52]:<class 'pandas.io.pytables.HDFStore'>File path: store.h5/wp wide_table (typ->appendable,nrows->12,ncols->2,indexers->[major_axis,minor_axis])
Enhancements
added ability to hierarchical keys
In [53]:store.put('foo/bar/bah',df)In [54]:store.append('food/orange',df)In [55]:store.append('food/apple',df)In [56]:storeOut[56]:<class 'pandas.io.pytables.HDFStore'>File path: store.h5/foo/bar/bah frame (shape->[8,3])/food/apple frame_table (typ->appendable,nrows->8,ncols->3,indexers->[index])/food/orange frame_table (typ->appendable,nrows->8,ncols->3,indexers->[index])/wp wide_table (typ->appendable,nrows->12,ncols->2,indexers->[major_axis,minor_axis])# remove all nodes under this levelIn [57]:store.remove('food')In [58]:storeOut[58]:<class 'pandas.io.pytables.HDFStore'>File path: store.h5/foo/bar/bah frame (shape->[8,3])/wp wide_table (typ->appendable,nrows->12,ncols->2,indexers->[major_axis,minor_axis])
added mixed-dtype support!
In [59]:df['string']='string'In [60]:df['int']=1In [61]:store.append('df',df)In [62]:df1=store.select('df')In [63]:df1Out[63]: A B C string int2000-01-01 -0.369325 -1.502617 -0.376280 string 12000-01-02 0.511936 -0.116412 -0.625256 string 12000-01-03 -0.550627 1.261433 -0.552429 string 12000-01-04 1.695803 -1.025917 -0.910942 string 12000-01-05 0.426805 -0.131749 0.432600 string 12000-01-06 0.044671 -0.341265 1.844536 string 12000-01-07 -2.036047 0.000830 -0.955697 string 12000-01-08 -0.898872 -0.725411 0.059904 string 1[8 rows x 5 columns]In [64]:df1.get_dtype_counts()Out[64]:float64 3int64 1object 1dtype: int64
performance improvments on table writing
support for arbitrarily indexed dimensions
SparseSeries now has adensity property (GH2384)
enableSeries.str.strip/lstrip/rstrip methods to take an input argumentto strip arbitrary characters (GH2411)
implementvalue_vars inmelt to limit values to certain columnsand addmelt to pandas namespace (GH2412)
Bug Fixes
Term method of specifying where conditions (GH1996).delstore['df'] now callstore.remove('df') for store deletionmin_itemsize parameter can be specified in table creation to force aminimum size for indexing columns (the previous implementation would set thecolumn size based on the first append)create_table_index (requires PyTables >= 2.3)(GH698).putCompatibility
0.10 ofHDFStore is backwards compatible for reading tables created in a prior version of pandas,however, query terms using the prior (undocumented) methodology are unsupported. You must read in the entirefile and write it out using the new format to take advantage of the updates.
Adding experimental support for Panel4D and factory functions to create n-dimensional named panels.Docs for NDim. Here is a taste of what to expect.
In [65]:p4d=Panel4D(randn(2,2,5,4), ....:labels=['Label1','Label2'], ....:items=['Item1','Item2'], ....:major_axis=date_range('1/1/2000',periods=5), ....:minor_axis=['A','B','C','D']) ....:In [66]:p4dOut[66]:<class 'pandas.core.panelnd.Panel4D'>Dimensions: 2 (labels) x 2 (items) x 5 (major_axis) x 4 (minor_axis)Labels axis: Label1 to Label2Items axis: Item1 to Item2Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00Minor_axis axis: A to D
See thefull release notes or issue trackeron GitHub for a complete list.
This is a bugfix release from 0.9.0 and includes several new features andenhancements along with a large number of bug fixes. The new features includeby-column sort order for DataFrame and Series, improved NA handling for the rankmethod, masking functions for DataFrame, and intraday time-series filtering forDataFrame.
Series.sort,DataFrame.sort, andDataFrame.sort_index can now bespecified in a per-column manner to support multiple sort orders (GH928)
In [1]:df=DataFrame(np.random.randint(0,2,(6,3)),columns=['A','B','C'])In [2]:df.sort(['A','B'],ascending=[1,0])Out[2]: A B C0 0 1 02 0 0 11 1 1 15 1 1 03 1 0 04 1 0 1[6 rows x 3 columns]DataFrame.rank now supports additional argument values for thena_option parameter so missing values can be assigned either the largestor the smallest rank (GH1508,GH2159)
In [3]:df=DataFrame(np.random.randn(6,3),columns=['A','B','C'])In [4]:df.ix[2:4]=np.nanIn [5]:df.rank()Out[5]: A B C0 3.0 2.0 1.01 1.0 3.0 3.02 NaN NaN NaN3 NaN NaN NaN4 NaN NaN NaN5 2.0 1.0 2.0[6 rows x 3 columns]In [6]:df.rank(na_option='top')Out[6]: A B C0 6.0 5.0 4.01 4.0 6.0 6.02 2.0 2.0 2.03 2.0 2.0 2.04 2.0 2.0 2.05 5.0 4.0 5.0[6 rows x 3 columns]In [7]:df.rank(na_option='bottom')Out[7]: A B C0 3.0 2.0 1.01 1.0 3.0 3.02 5.0 5.0 5.03 5.0 5.0 5.04 5.0 5.0 5.05 2.0 1.0 2.0[6 rows x 3 columns]DataFrame has newwhere andmask methods to select values according to agiven boolean mask (GH2109,GH2151)
DataFrame currently supports slicing via a boolean vector the same length as the DataFrame (inside the[]).The returned DataFrame has the same number of columns as the original, but is sliced on its index.
In [8]:df=DataFrame(np.random.randn(5,3),columns=['A','B','C'])In [9]:dfOut[9]: A B C0 -0.187239 -1.703664 0.6131361 -0.948528 0.505346 0.0172282 -2.391256 1.207381 0.8531743 0.124213 -0.625597 -1.2112244 -0.476548 0.649425 0.004610[5 rows x 3 columns]In [10]:df[df['A']>0]Out[10]: A B C3 0.124213 -0.625597 -1.211224[1 rows x 3 columns]If a DataFrame is sliced with a DataFrame based boolean condition (with the same size as the original DataFrame),then a DataFrame the same size (index and columns) as the original is returned, withelements that do not meet the boolean condition asNaN. This is accomplished viathe new methodDataFrame.where. In addition,where takes an optionalother argument for replacement.
In [11]:df[df>0]Out[11]: A B C0 NaN NaN 0.6131361 NaN 0.505346 0.0172282 NaN 1.207381 0.8531743 0.124213 NaN NaN4 NaN 0.649425 0.004610[5 rows x 3 columns]In [12]:df.where(df>0)Out[12]: A B C0 NaN NaN 0.6131361 NaN 0.505346 0.0172282 NaN 1.207381 0.8531743 0.124213 NaN NaN4 NaN 0.649425 0.004610[5 rows x 3 columns]In [13]:df.where(df>0,-df)Out[13]: A B C0 0.187239 1.703664 0.6131361 0.948528 0.505346 0.0172282 2.391256 1.207381 0.8531743 0.124213 0.625597 1.2112244 0.476548 0.649425 0.004610[5 rows x 3 columns]Furthermore,where now aligns the input boolean condition (ndarray or DataFrame), such that partial selectionwith setting is possible. This is analagous to partial setting via.ix (but on the contents rather than the axis labels)
In [14]:df2=df.copy()In [15]:df2[df2[1:4]>0]=3In [16]:df2Out[16]: A B C0 -0.187239 -1.703664 0.6131361 -0.948528 3.000000 3.0000002 -2.391256 3.000000 3.0000003 3.000000 -0.625597 -1.2112244 -0.476548 0.649425 0.004610[5 rows x 3 columns]DataFrame.mask is the inverse boolean operation ofwhere.
In [17]:df.mask(df<=0)Out[17]: A B C0 NaN NaN 0.6131361 NaN 0.505346 0.0172282 NaN 1.207381 0.8531743 0.124213 NaN NaN4 NaN 0.649425 0.004610[5 rows x 3 columns]Enable referencing of Excel columns by their column names (GH1936)
In [18]:xl=ExcelFile('data/test.xls')In [19]:xl.parse('Sheet1',index_col=0,parse_dates=True, ....:parse_cols='A:D') ....:---------------------------------------------------------------------------NotImplementedError Traceback (most recent call last)<ipython-input-19-7ac41df80d31> in <module>() 1 xl.parse('Sheet1', index_col=0, parse_dates=True,----> 2 parse_cols='A:D')/home/joris/scipy/pandas/pandas/io/excel.pyc in parse(self, sheetname, header, skiprows, skip_footer, names, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, true_values, false_values, squeeze, **kwds) 279 false_values=false_values, 280 squeeze=squeeze,--> 281 **kwds) 282 283 def _should_parse(self, i, parse_cols):/home/joris/scipy/pandas/pandas/io/excel.pyc in _parse_excel(self, sheetname, header, skiprows, names, skip_footer, index_col, has_index_names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, true_values, false_values, verbose, squeeze, **kwds) 337 "is not implemented") 338 if parse_dates:--> 339 raise NotImplementedError("parse_dates keyword of read_excel " 340 "is not implemented") 341NotImplementedError: parse_dates keyword of read_excel is not implementedAdded option to disable pandas-style tick locators and formattersusingseries.plot(x_compat=True) orpandas.plot_params[‘x_compat’] =True (GH2205)
Existing TimeSeries methodsat_time andbetween_time were added toDataFrame (GH2149)
DataFrame.dot can now accept ndarrays (GH2042)
DataFrame.drop now supports non-unique indexes (GH2101)
Panel.shift now supports negative periods (GH2164)
DataFrame now support unary ~ operator (GH2110)
Upsampling data with a PeriodIndex will result in a higher frequencyTimeSeries that spans the original time window
In [1]:prng=period_range('2012Q1',periods=2,freq='Q')In [2]:s=Series(np.random.randn(len(prng)),prng)In [4]:s.resample('M')Out[4]:2012-01 -1.4719922012-02 NaN2012-03 NaN2012-04 -0.4935932012-05 NaN2012-06 NaNFreq: M, dtype: float64Period.end_time now returns the last nanosecond in the time interval(GH2124,GH2125,GH1764)
In [20]:p=Period('2012')In [21]:p.end_timeOut[21]:Timestamp('2012-12-31 23:59:59.999999999')File parsers no longer coerce to float or bool for columns that have customconverters specified (GH2184)
In [22]:data='A,B,C\n00001,001,5\n00002,002,6'In [23]:read_csv(StringIO(data),converters={'A':lambdax:x.strip()})Out[23]: A B C0 00001 1 51 00002 2 6[2 rows x 3 columns]
See thefull release notes or issue trackeron GitHub for a complete list.
This is a major release from 0.8.1 and includes several new features andenhancements along with a large number of bug fixes. New features includevectorized unicode encoding/decoding forSeries.str,to_latex method toDataFrame, more flexible parsing of boolean values, and enabling the download ofoptions data from Yahoo! Finance.
- Add
encodeanddecodefor unicode handling tovectorizedstring processing methods in Series.str (GH1706)- Add
DataFrame.to_latexmethod (GH1735)- Add convenient expanding window equivalents of all rolling_* ops (GH1785)
- Add Options class to pandas.io.data for fetching options data from Yahoo!Finance (GH1748,GH1739)
- More flexible parsing of boolean values (Yes, No, TRUE, FALSE, etc)(GH1691,GH1295)
- Add
levelparameter toSeries.reset_indexTimeSeries.between_timecan now select times across midnight (GH1871)- Series constructor can now handle generator as input (GH1679)
DataFrame.dropnacan now take multiple axes (tuple/list) as input(GH924)- Enable
skip_footerparameter inExcelFile.parse(GH1843)
- The default column names when
header=Noneand no columns names passed tofunctions likeread_csvhas changed to be more Pythonic and amenable toattribute access:
In [1]:data='0,0,1\n1,1,0\n0,1,0'In [2]:df=read_csv(StringIO(data),header=None)In [3]:dfOut[3]: 0 1 20 0 0 11 1 1 02 0 1 0[3 rows x 3 columns]
Series(df[col1],index=df[col2]) that worked before“by accident” (this was never intended) will lead to all NA Series in somecases. To be perfectly clear:In [4]:s1=Series([1,2,3])In [5]:s1Out[5]:0 11 22 3dtype: int64In [6]:s2=Series(s1,index=['foo','bar','baz'])In [7]:s2Out[7]:foo NaNbar NaNbaz NaNdtype: float64
day_of_year API removed from PeriodIndex, usedayofyear(GH1723)first andlast methods inGroupBy no longer drop non-numericcolumns (GH1809)na_values of type dict no longer override default NAs unlesskeep_default_na is set to false explicitly (GH1657)DataFrame.dot will not do data alignment, and also work with Series(GH1915)See thefull release notes or issue trackeron GitHub for a complete list.
This release includes a few new features, performance enhancements, and over 30bug fixes from 0.8.0. New features include notably NA friendly stringprocessing functionality and a series of new plot types and options.
- Addvectorized string processing methodsaccessible via Series.str (GH620)
- Add option to disable adjustment in EWMA (GH1584)
- Radviz plot (GH1566)
- Parallel coordinates plot
- Bootstrap plot
- Per column styles and secondary y-axis plotting (GH1559)
- New datetime converters millisecond plotting (GH1599)
- Add option to disable “sparse” display of hierarchical indexes (GH1538)
- Series/DataFrame’s
set_indexmethod canappend levels to an existing Index/MultiIndex (GH1569,GH1577)
- Improved implementation of rolling min and max (thanks toBottleneck !)
- Add accelerated
'median'GroupBy option (GH1358)- Significantly improve the performance of parsing ISO8601-format datestrings with
DatetimeIndexorto_datetime(GH1571)- Improve the performance of GroupBy on single-key aggregations and use withCategorical types
- Significant datetime parsing performance improvments
This is a major release from 0.7.3 and includes extensive work on the timeseries handling and processing infrastructure as well as a great deal of newfunctionality throughout the library. It includes over 700 commits from morethan 20 distinct authors. Most pandas 0.7.3 and earlier users should notexperience any issues upgrading, but due to the migration to the NumPydatetime64 dtype, there may be a number of bugs and incompatibilitieslurking. Lingering incompatibilities will be fixed ASAP in a 0.8.1 release ifnecessary. See thefull release notes or issue trackeron GitHub for a complete list.
All objects can now work with non-unique indexes. Data alignment / joinoperations work according to SQL join semantics (including, if application,index duplication in many-to-many joins)
Time series data are now represented using NumPy’s datetime64 dtype; thus,pandas 0.8.0 now requires at least NumPy 1.6. It has been tested and verifiedto work with the development version (1.7+) of NumPy as well which includessome significant user-facing API changes. NumPy 1.6 also has a number of bugshaving to do with nanosecond resolution data, so I recommend that you steerclear of NumPy 1.6’s datetime64 API functions (though limited as they are) andonly interact with this data using the interface that pandas provides.
See the end of the 0.8.0 section for a “porting” guide listing potential issuesfor users migrating legacy codebases from pandas 0.7 or earlier to 0.8.0.
Bug fixes to the 0.7.x series for legacy NumPy < 1.6 users will be provided asthey arise. There will be no more further development in 0.7.x beyond bugfixes.
Note
With this release, legacy scikits.timeseries users should be able to porttheir code to use pandas.
Note
Seedocumentation for overview of pandas timeseries API.
PeriodIndex andPeriod classes for representingtime spans and performingcalendar logic,including the12 fiscal quarterly frequencies <timeseries.quarterly>.This is a partial port of, and a substantial enhancement to,elements of the scikits.timeseries codebase. Support for conversion betweenPeriodIndex and DatetimeIndextz_lcoalize methods to TimeSeries and DataFrame. Alltimestamps are stored as UTC; Timestamps from DatetimeIndex objects with timezone set will be localized to localtime. Time zone conversions are thereforeessentially free. User needs to know very little about pytz library now; onlytime zone names as as strings are required. Time zone-aware timestamps areequal if and only if their UTC timestamps match. Operations between timezone-aware time series with different time zones will result in a UTC-indexedtime series.date_range,bdate_range, andperiod_rangefactoryfunctionsinferred_freqproperty of DatetimeIndex, with option to infer frequency on construction ofDatetimeIndexTimeSeries.at_time) orbetween two times(TimeSeries.between_time)qcut functions (like R’s cutfunction) for computing a categorical variable from a continuous variable bybinning values either into value-based (cut) or quantile-based (qcut)binsFactor toCategorical and add a number of usability featurespivot_tablebugs (empty columns being introduced)any andall method to DataFrameSeries.plot now supports asecondary_y option:
In [1]:plt.figure()Out[1]:<matplotlib.figure.Figureat0x7fd237c84f10>In [2]:fx['FR'].plot(style='g')Out[2]:<matplotlib.axes._subplots.AxesSubplotat0x7fd23e90f5d0>In [3]:fx['IT'].plot(style='k--',secondary_y=True)Out[3]:<matplotlib.axes._subplots.AxesSubplotat0x7fd23eb04910>

Vytautas Jancauskas, the 2012 GSOC participant, has added many new plottypes. For example,'kde' is a new option:
In [4]:s=Series(np.concatenate((np.random.randn(1000), ...:np.random.randn(1000)*0.5+3))) ...:In [5]:plt.figure()Out[5]:<matplotlib.figure.Figureat0x7fd237c84190>In [6]:s.hist(normed=True,alpha=0.2)Out[6]:<matplotlib.axes._subplots.AxesSubplotat0x7fd23e79dbd0>In [7]:s.plot(kind='kde')Out[7]:<matplotlib.axes._subplots.AxesSubplotat0x7fd23e79dbd0>

Seethe plotting page for much more.
offset,time_rule, andtimeRule arguments names intime series functions. Warnings will be printed until pandas 0.9 or 1.0.The major change that may affect you in pandas 0.8.0 is that time seriesindexes use NumPy’sdatetime64 data type instead ofdtype=object arraysof Python’s built-indatetime.datetime objects.DateRange has beenreplaced byDatetimeIndex but otherwise behaved identically. But, if youhave code that convertsDateRange orIndex objects that used to containdatetime.datetime values to plain NumPy arrays, you may have bugs lurkingwith code using scalar values because you are handing control over to NumPy:
In [8]:importdatetimeIn [9]:rng=date_range('1/1/2000',periods=10)In [10]:rng[5]Out[10]:Timestamp('2000-01-06 00:00:00',freq='D')In [11]:isinstance(rng[5],datetime.datetime)Out[11]:TrueIn [12]:rng_asarray=np.asarray(rng)In [13]:scalar_val=rng_asarray[5]In [14]:type(scalar_val)Out[14]:numpy.datetime64
pandas’sTimestamp object is a subclass ofdatetime.datetime that hasnanosecond support (thenanosecond field store the nanosecond value between0 and 999). It should substitute directly into any code that useddatetime.datetime values before. Thus, I recommend not castingDatetimeIndex to regular NumPy arrays.
If you have code that requires an array ofdatetime.datetime objects, youhave a couple of options. First, theasobject property ofDatetimeIndexproduces an array ofTimestamp objects:
In [15]:stamp_array=rng.asobjectIn [16]:stamp_arrayOut[16]:Index([2000-01-01 00:00:00, 2000-01-02 00:00:00, 2000-01-03 00:00:00, 2000-01-04 00:00:00, 2000-01-05 00:00:00, 2000-01-06 00:00:00, 2000-01-07 00:00:00, 2000-01-08 00:00:00, 2000-01-09 00:00:00, 2000-01-10 00:00:00], dtype='object')In [17]:stamp_array[5]Out[17]:Timestamp('2000-01-06 00:00:00',freq='D')
To get an array of properdatetime.datetime objects, use theto_pydatetime method:
In [18]:dt_array=rng.to_pydatetime()In [19]:dt_arrayOut[19]:array([datetime.datetime(2000, 1, 1, 0, 0), datetime.datetime(2000, 1, 2, 0, 0), datetime.datetime(2000, 1, 3, 0, 0), datetime.datetime(2000, 1, 4, 0, 0), datetime.datetime(2000, 1, 5, 0, 0), datetime.datetime(2000, 1, 6, 0, 0), datetime.datetime(2000, 1, 7, 0, 0), datetime.datetime(2000, 1, 8, 0, 0), datetime.datetime(2000, 1, 9, 0, 0), datetime.datetime(2000, 1, 10, 0, 0)], dtype=object)In [20]:dt_array[5]Out[20]:datetime.datetime(2000,1,6,0,0)
matplotlib knows how to handledatetime.datetime but not Timestampobjects. While I recommend that you plot time series usingTimeSeries.plot,you can either useto_pydatetime or register a converter for the Timestamptype. Seematplotlib documentation for more on this.
Warning
There are bugs in the user-facing API with the nanosecond datetime64 unitin NumPy 1.6. In particular, the string version of the array shows garbagevalues, and conversion todtype=object is similarly broken.
In [21]:rng=date_range('1/1/2000',periods=10)In [22]:rngOut[22]:DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08', '2000-01-09', '2000-01-10'], dtype='datetime64[ns]', freq='D')In [23]:np.asarray(rng)Out[23]:array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000', '2000-01-03T00:00:00.000000000', '2000-01-04T00:00:00.000000000', '2000-01-05T00:00:00.000000000', '2000-01-06T00:00:00.000000000', '2000-01-07T00:00:00.000000000', '2000-01-08T00:00:00.000000000', '2000-01-09T00:00:00.000000000', '2000-01-10T00:00:00.000000000'], dtype='datetime64[ns]')In [24]:converted=np.asarray(rng,dtype=object)In [25]:converted[5]Out[25]:947116800000000000L
Trust me: don’t panic. If you are using NumPy 1.6 and restrict yourinteraction withdatetime64 values to pandas’s API you will be justfine. There is nothing wrong with the data-type (a 64-bit integerinternally); all of the important data processing happens in pandas and isheavily tested. I strongly recommend that youdo not work directly withdatetime64 arrays in NumPy 1.6 and only use the pandas API.
Support for non-unique indexes: In the latter case, you may have codeinside atry:...catch: block that failed due to the index not beingunique. In many cases it will no longer fail (some method likeappend stillcheck for uniqueness unless disabled). However, all is not lost: you caninspectindex.is_unique and raise an exception explicitly if it isFalse or go to a different code branch.
This is a minor release from 0.7.2 and fixes many minor bugs and adds a numberof nice new features. There are also a couple of API changes to note; theseshould not affect very many users, and we are inclined to call them “bug fixes”even though they do constitute a change in behavior. See thefull releasenotes or issuetracker on GitHub for a complete list.
read_fwffrompandas.tools.plottingimportscatter_matrixscatter_matrix(df,alpha=0.2)

stacked argument to Series and DataFrame’splot method forstacked bar plots.df.plot(kind='bar',stacked=True)

df.plot(kind='barh',stacked=True)

DataFrame.plot andSeries.plotkurt methods to Series and DataFrame for computing kurtosisReverted some changes to how NA values (represented typically asNaN orNone) are handled in non-numeric Series:
In [1]:series=Series(['Steve',np.nan,'Joe'])In [2]:series=='Steve'Out[2]:0 True1 False2 Falsedtype: boolIn [3]:series!='Steve'Out[3]:0 False1 True2 Truedtype: bool
In comparisons, NA / NaN will always come through asFalse except with!= which isTrue.Be very careful with boolean arithmetic, especiallynegation, in the presence of NA data. You may wish to add an explicit NAfilter into boolean array operations if you are worried about this:
In [4]:mask=series=='Steve'In [5]:series[mask&series.notnull()]Out[5]:0 Stevedtype: object
While propagating NA in comparisons may seem like the right behavior to someusers (and you could argue on purely technical grounds that this is the rightthing to do), the evaluation was made that propagating NA everywhere, includingin numerical arrays, would cause a large amount of problems for users. Thus, a“practicality beats purity” approach was taken. This issue may be revisited atsome point in the future.
When callingapply on a grouped Series, the return value will also be aSeries, to be more consistent with thegroupby behavior with DataFrame:
In [6]:df=DataFrame({'A':['foo','bar','foo','bar', ...:'foo','bar','foo','foo'], ...:'B':['one','one','two','three', ...:'two','two','one','three'], ...:'C':np.random.randn(8),'D':np.random.randn(8)}) ...:In [7]:dfOut[7]: A B C D0 foo one 0.219405 -1.0791811 bar one -0.342863 -1.6318822 foo two -0.032419 0.2372883 bar three -1.581534 0.5146794 foo two -0.912061 -1.4881015 bar two 0.209500 1.0185146 foo one -0.675890 -1.4888407 foo three 0.055228 -1.355434[8 rows x 4 columns]In [8]:grouped=df.groupby('A')['C']In [9]:grouped.describe()Out[9]:Abar count 3.000000 mean -0.571633 std 0.917171 min -1.581534 25% -0.962199 50% -0.342863 75% -0.066682 ...foo mean -0.269148 std 0.494652 min -0.912061 25% -0.675890 50% -0.032419 75% 0.055228 max 0.219405Name: C, dtype: float64In [10]:grouped.apply(lambdax:x.order()[-2:])# top 2 valuesOut[10]:Abar 1 -0.342863 5 0.209500foo 7 0.055228 0 0.219405Name: C, dtype: float64
This release targets bugs in 0.7.1, and adds a few minor features.
- Add additional tie-breaking methods in DataFrame.rank (GH874)
- Add ascending parameter to rank in Series, DataFrame (GH875)
- Add coerce_float option to DataFrame.from_records (GH893)
- Add sort_columns parameter to allow unsorted plots (GH918)
- Enable column access via attributes on GroupBy (GH882)
- Can pass dict of values to DataFrame.fillna (GH661)
- Can select multiple hierarchical groups by passing list of values in .ix(GH134)
- Add
axisoption to DataFrame.fillna (GH174)- Add level keyword to
dropfor dropping values from a level (GH159)
This release includes a few new features and addresses over a dozen bugs in0.7.0.
- Add
to_clipboardfunction to pandas namespace for writing objects tothe system clipboard (GH774)- Add
itertuplesmethod to DataFrame for iterating through the rows of adataframe as tuples (GH818)- Add ability to pass fill_value and method to DataFrame and Series alignmethod (GH806,GH807)
- Add fill_value option to reindex, align methods (GH784)
- Enable concat to produce DataFrame from Series (GH787)
- Add
betweenmethod to Series (GH802)- Add HTML representation hook to DataFrame for the IPython HTML notebook(GH773)
- Support for reading Excel 2007 XML documents using openpyxl
Series.append andDataFrame.append (GH468,GH479,GH273)Series.append too__getitem__, useful for transformation (GH342)DataFrame.apply (GH498)In [1]:df=DataFrame(randn(10,4))In [2]:df.apply(lambdax:x.describe())Out[2]: 0 1 2 3count 10.000000 10.000000 10.000000 10.000000mean 0.448104 0.052501 0.058434 0.008207std 0.784159 0.676134 0.959629 1.126010min -1.275249 -1.200953 -1.819334 -1.60790625% 0.100811 -0.095948 -0.365166 -0.97309550% 0.709636 0.071581 0.116057 0.17911275% 0.851809 0.478706 0.616168 0.807868max 1.437656 1.051356 1.387310 1.521442[8 rows x 4 columns]
reorder_levels method to Series andDataFrame (GH534)get function to DataFrameand Panel (GH521)DataFrame.iterrows method for efficientlyiterating through the rows of a DataFrameDataFrame.to_panel with code adapted fromLongPanel.to_longreindex_axis method added to DataFramelevel option to binary arithmetic functions onDataFrame andSerieslevel option to thereindexandalign methods on Series and DataFrame for broadcasting values acrossa level (GH542,GH552, others)Panel and add IPython completion (GH563)logy option toSeries.plot forlog-scaling on the Y axisindex andheader options toDataFrame.to_stringDataFrame.join to join on index (GH115)Panel.join(GH115)justify argument toDataFrame.to_stringto allow different alignment of column headerssort option to GroupBy to allow disablingsorting of the group keys for potential speedups (GH595)DataFrame.lookup, fancy-indexing analogue for retrieving valuesgiven a sequence of row and column labels (GH338)cummin andcummax on Series and DataFrame to get cumulativeminimum and maximum, respectively (GH647)value_range added as utility function to get min and max of a dataframe(GH288)encoding argument toread_csv,read_table,to_csv andfrom_csv for non-ascii text (GH717)abs method to pandas objectscrosstab function for easily computing frequency tablesisin method to index objectslevel argument toxs method of DataFrame.One of the potentially riskiest API changes in 0.7.0, but also one of the mostimportant, was a complete review of howinteger indexes are handled withregard to label-based indexing. Here is an example:
In [3]:s=Series(randn(10),index=range(0,20,2))In [4]:sOut[4]:0 0.6799192 -0.4571474 0.0418676 1.5031168 -0.84126510 -1.57800312 -0.27372814 1.75524016 -0.70578818 -0.351950dtype: float64In [5]:s[0]Out[5]:0.67991862351992061In [6]:s[2]Out[6]:-0.45714692729799072In [7]:s[4]Out[7]:0.041867372914288915
This is all exactly identical to the behavior before. However, if you ask for akeynot contained in the Series, in versions 0.6.1 and prior, Series wouldfall back on a location-based lookup. This now raises aKeyError:
In [2]:s[1]KeyError: 1
This change also has the same impact on DataFrame:
In [3]:df=DataFrame(randn(8,4),index=range(0,16,2))In [4]:df 0 1 2 30 0.88427 0.3363 -0.1787 0.031622 0.14451 -0.1415 0.2504 0.583744 -1.44779 -0.9186 -1.4996 0.271636 -0.26598 -2.4184 -0.2658 0.115038 -0.58776 0.3144 -0.8566 0.6194110 0.10940 -0.7175 -1.0108 0.4799012 -1.16919 -0.3087 -0.6049 -0.4354414 -0.07337 0.3410 0.0424 -0.16037In [5]:df.ix[3]KeyError: 3
In order to support purely integer-based indexing, the following methods havebeen added:
| Method | Description |
|---|---|
Series.iget_value(i) | Retrieve value stored at locationi |
Series.iget(i) | Alias foriget_value |
DataFrame.irow(i) | Retrieve thei-th row |
DataFrame.icol(j) | Retrieve thej-th column |
DataFrame.iget_value(i,j) | Retrieve the value at rowi and columnj |
Label-based slicing usingix now requires that the index be sorted(monotonic)unless both the start and endpoint are contained in the index:
In [8]:s=Series(randn(6),index=list('gmkaec'))In [9]:sOut[9]:g 1.507974m 0.419219k 0.647633a -0.147670e -0.759803c -0.757308dtype: float64
Then this is OK:
In [10]:s.ix['k':'e']Out[10]:k 0.647633a -0.147670e -0.759803dtype: float64
But this is not:
In [12]:s.ix['b':'h']KeyError 'b'
If the index had been sorted, the “range selection” would have been possible:
In [11]:s2=s.sort_index()In [12]:s2Out[12]:a -0.147670c -0.757308e -0.759803g 1.507974k 0.647633m 0.419219dtype: float64In [13]:s2.ix['b':'h']Out[13]:c -0.757308e -0.759803g 1.507974dtype: float64
[] operator¶As as notational convenience, you can pass a sequence of labels or a labelslice to a Series when getting and setting values via[] (i.e. the__getitem__ and__setitem__ methods). The behavior will be the same aspassing similar input toixexcept in the case of integer indexing:
In [14]:s=Series(randn(6),index=list('acegkm'))In [15]:sOut[15]:a -1.921164c -1.093529e -0.592157g -0.715074k -0.616193m -0.335468dtype: float64In [16]:s[['m','a','c','e']]Out[16]:m -0.335468a -1.921164c -1.093529e -0.592157dtype: float64In [17]:s['b':'l']Out[17]:c -1.093529e -0.592157g -0.715074k -0.616193dtype: float64In [18]:s['c':'k']Out[18]:c -1.093529e -0.592157g -0.715074k -0.616193dtype: float64
In the case of integer indexes, the behavior will be exactly as before(shadowingndarray):
In [19]:s=Series(randn(6),index=range(0,12,2))In [20]:s[[4,0,2]]Out[20]:4 0.8861700 -0.3920512 -0.189537dtype: float64In [21]:s[1:5]Out[21]:2 -0.1895374 0.8861706 -1.1258948 0.319635dtype: float64
If you wish to do indexing with sequences and slicing on an integer index withlabel semantics, useix.
LongPanel class has been completely removedSeries.sort is called on a column of a DataFrame, an exception willnow be raised. Before it was possible to accidentally mutate a DataFrame’scolumn by doingdf[col].sort() instead of the side-effect free methoddf[col].order() (GH316)FutureWarningdrop added as an optional parameter toDataFrame.reset_index (GH699)reset_index on DataFrame with aregular (non-hierarchical) index (GH476)level parameter passed (GH545)rolling_median by about5-10x in most typical use cases (GH374)get_value andset_value methods toSeries, DataFrame, and Panel for very low-overhead access (>2x faster in manycases) to scalar elements (GH437,GH438).set_value is capable ofproducing an enlarged object.Series.from_csv function (GH482)melt function topandas.core.reshapelevel parameter to group by level in Series and DataFrame descriptive statistics (GH313)head andtail methods to Series, analogous to to DataFrame (GH296)Series.isin function which checks if each value is contained in a passed sequence (GH289)float_format option toSeries.to_stringskip_footer (GH291) andconverters (GH343) options toread_csv andread_tabledrop_duplicates andduplicated functions for removing duplicate DataFrame rows and checking for duplicate rows, respectively (GH319)Series.mad, mean absolute deviationQuarterEnd DateOffset (GH321)dot to DataFrame (GH65)orient option toPanel.from_dict (GH359,GH301)orient option toDataFrame.from_dictDataFrame.from_records (GH357)by argument ofDataFrame.sort_index (GH92,GH362)get_value andput_value methods to DataFrame (GH360)cov instance methods to Series and DataFrame (GH194,GH362)kind='bar' option toDataFrame.plot (GH348)idxmin andidxmax to Series and DataFrame (GH286)read_clipboard function to parse DataFrame from clipboard (GH300)nunique function to Series for counting unique elements (GH297)DataFrame.to_html for writing DataFrame to HTML (GH387)DataFrame.boxplot function (GH368)DataFrame.join with vectoron argument (GH312)legend boolean flag toDataFrame.plot (GH324)stack andunstack (GH370)pivot_table (GH381)raw option toDataFrame.apply for performance if only need ndarray (GH309)cache_readonly, resulting in substantial micro-performance enhancements throughout the codebase (GH361)MultiIndex.from_tuplesraw option toDataFrame.apply for getting better performance whenmap_infer speeds upSeries.apply andSeries.map significantly when passed elementwise Python function, motivated by (GH355)Series.order, which also makes np.unique called on a Series faster (GH327)DataFrame.align method with standard join optionsparse_dates option toread_csv andread_table methods to optionally try to parse dates in the index columnsnrows,chunksize, anditerator arguments toread_csv andread_table. The last two return a newTextParser class capable of lazily iterating through chunks of a flat file (GH242)DataFrame.join (GH214)_get_duplicates function toIndex for identifying duplicate values more easily (ENH5c)Series.describe for Series containing objects (GH241)DataFrame.join when joining on key(s) (GH248)__getitem__ (GH253)pivot_table convenience function to pandas namespace (GH234)Panel.rename_axis function (GH243)Panel.takeset_eng_float_format for alternate DataFrame floating point string formatting (ENH61)set_index function for creating a DataFrame index from its existing columnsgroupby hierarchical index level name (GH223)DataFrame.to_csv (GH244)read_csv andread_tableDataFrame.xs on mixed-type DataFrame objects by about 5x, regression from 0.3.0 (GH215)DataFrame.align method, speeding up binary operations between differently-indexed DataFrame objects by 10-25%.__repr__ andcount on large mixed-type DataFrame objectsname attribute toSeries, nowprints as part ofSeries.__repr__isnull andnotnull toSeries (GH209,GH203)Series.align method for aligning two serieswith choice of join method (ENH56)get_level_values toMultiIndex (GH188)DataFrame objects via.ix indexing attribute (GH135)DataFramemethodsget_dtype_counts and propertydtypes (ENHdc)DataFrame.append to stack DataFrames (ENH1b)read_csv tries tosniff delimiters usingcsv.Sniffer (GH146)read_csv canread multiple columns into aMultiIndex; DataFrame’sto_csv method writes out a correspondingMultiIndex (GH151)DataFrame.rename has a newcopy parameter torename a DataFrame in place (ENHed)sortlevel to work by level (GH141)isnull andnotnull, a regression from v0.3.0(GH187)DataFrame.join so that intermediate alignedcopies of the data in eachDataFrame argument do not need to be created.Substantial performance increases result (GH176)Index.intersection andIndex.unionBlockManager.take resulting in significantly fastertakeperformance on mixed-typeDataFrame objects (GH104)Series.sort_index_ensure_index function resulting in performance savings intype-checking Index objects