Enter search terms or a module, class or function name.
pandas follows the numpy convention of raising an error when you try to convert something to abool.This happens in aif or when using the boolean operations,and,or, ornot. It is not clearwhat the result of
>>>ifpd.Series([False,True,False]): ...
should be. Should it beTrue because it’s not zero-length?False because there areFalse values?It is unclear, so instead, pandas raises aValueError:
>>>ifpd.Series([False,True,False]): print("I was true")Traceback ...ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
If you see that, you need to explicitly choose what you want to do with it (e.g., useany(),all() orempty).or, you might want to compare if the pandas object isNone
>>>ifpd.Series([False,True,False])isnotNone: print("I was not None")>>>IwasnotNone
or return ifany value isTrue.
>>>ifpd.Series([False,True,False]).any(): print("I am any")>>>Iamany
To evaluate single-element pandas objects in a boolean context, use the method.bool():
In [1]:pd.Series([True]).bool()Out[1]:TrueIn [2]:pd.Series([False]).bool()Out[2]:FalseIn [3]:pd.DataFrame([[True]]).bool()Out[3]:TrueIn [4]:pd.DataFrame([[False]]).bool()Out[4]:False
Bitwise boolean operators like== and!= will return a booleanSeries,which is almost always what you want anyways.
>>>s=pd.Series(range(5))>>>s==40 False1 False2 False3 False4 Truedtype: bool
Seeboolean comparisons for more examples.
in operator¶Using the Pythonin operator on a Series tests for membership in theindex, not membership among the values.
If this behavior is surprising, keep in mind that usingin on a Pythondictionary tests keys, not values, and Series are dict-like.To test for membership in the values, use the methodisin():
For DataFrames, likewise,in applies to the column axis,testing for membership in the list of column names.
NaN, IntegerNA values andNA type promotions¶NA representation¶For lack ofNA (missing) support from the ground up in NumPy and Python ingeneral, we were given the difficult choice between either
NA across the dtypesFor many reasons we chose the latter. After years of production use it hasproven, at least in my opinion, to be the best decision given the state ofaffairs in NumPy and Python in general. The special valueNaN(Not-A-Number) is used everywhere as theNA value, and there are APIfunctionsisnull andnotnull which can be used across the dtypes todetect NA values.
However, it comes with it a couple of trade-offs which I most certainly havenot ignored.
NA¶In the absence of high performanceNA support being built into NumPy fromthe ground up, the primary casualty is the ability to represent NAs in integerarrays. For example:
In [5]:s=pd.Series([1,2,3,4,5],index=list('abcde'))In [6]:sOut[6]:a 1b 2c 3d 4e 5dtype: int64In [7]:s.dtypeOut[7]:dtype('int64')In [8]:s2=s.reindex(['a','b','c','f','u'])In [9]:s2Out[9]:a 1.0b 2.0c 3.0f NaNu NaNdtype: float64In [10]:s2.dtypeOut[10]:dtype('float64')
This trade-off is made largely for memory and performance reasons, and also sothat the resulting Series continues to be “numeric”. One possibility is to usedtype=object arrays instead.
NA type promotions¶When introducing NAs into an existing Series or DataFrame viareindex orsome other means, boolean and integer types will be promoted to a differentdtype in order to store the NAs. These are summarized by this table:
| Typeclass | Promotion dtype for storing NAs |
|---|---|
floating | no change |
object | no change |
integer | cast tofloat64 |
boolean | cast toobject |
While this may seem like a heavy trade-off, I have found very fewcases where this is an issue in practice. Some explanation for the motivationhere in the next section.
Many people have suggested that NumPy should simply emulate theNA supportpresent in the more domain-specific statistical programming languageR. Part of the reason is the NumPy type hierarchy:
| Typeclass | Dtypes |
|---|---|
numpy.floating | float16,float32,float64,float128 |
numpy.integer | int8,int16,int32,int64 |
numpy.unsignedinteger | uint8,uint16,uint32,uint64 |
numpy.object_ | object_ |
numpy.bool_ | bool_ |
numpy.character | string_,unicode_ |
The R language, by contrast, only has a handful of built-in data types:integer,numeric (floating-point),character, andboolean.NA types are implemented by reserving special bit patterns foreach type to be used as the missing value. While doing this with the full NumPytype hierarchy would be possible, it would be a more substantial trade-off(especially for the 8- and 16-bit data types) and implementation undertaking.
An alternate approach is that of using masked arrays. A masked array is anarray of data with an associated booleanmask denoting whether each valueshould be consideredNA or not. I am personally not in love with thisapproach as I feel that overall it places a fairly heavy burden on the user andthe library implementer. Additionally, it exacts a fairly high performance costwhen working with numerical data compared with the simple approach of usingNaN. Thus, I have chosen the Pythonic “practicality beats purity” approachand traded integerNA capability for a much simpler approach of using aspecial value in float and object arrays to denoteNA, and promotinginteger arrays to floating when NAs must be introduced.
Label-based indexing with integer axis labels is a thorny topic. It has beendiscussed heavily on mailing lists and among various members of the scientificPython community. In pandas, our general viewpoint is that labels matter morethan integer locations. Therefore, with an integer axis indexonlylabel-based indexing is possible with the standard tools like.ix. Thefollowing code will generate exceptions:
s=pd.Series(range(5))s[-1]df=pd.DataFrame(np.random.randn(5,4))dfdf.ix[-2:]
This deliberate decision was made to prevent ambiguities and subtle bugs (manyusers reported finding bugs when the API change was made to stop “falling back”on position-based indexing).
If the index of aSeries orDataFrame is monotonically increasing or decreasing, then the boundsof a label-based slice can be outside the range of the index, much like slice indexing anormal Pythonlist. Monotonicity of an index can be tested with theis_monotonic_increasing andis_monotonic_decreasing attributes.
In [11]:df=pd.DataFrame(index=[2,3,3,4,5],columns=['data'],data=range(5))In [12]:df.index.is_monotonic_increasingOut[12]:True# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:In [13]:df.loc[0:4,:]Out[13]: data2 03 13 24 3# slice is are outside the index, so empty DataFrame is returnedIn [14]:df.loc[13:15,:]Out[14]:Empty DataFrameColumns: [data]Index: []
On the other hand, if the index is not monotonic, then both slice bounds must beunique members of the index.
In [15]:df=pd.DataFrame(index=[2,3,1,4,3,5],columns=['data'],data=range(6))In [16]:df.index.is_monotonic_increasingOut[16]:False# OK because 2 and 4 are in the indexIn [17]:df.loc[2:4,:]Out[17]: data2 03 11 24 3
# 0 is not in the indexIn[9]:df.loc[0:4,:]KeyError:0# 3 is not a unique labelIn[11]:df.loc[2:3,:]KeyError:'Cannot get right slice bound for non-unique label: 3'
Compared with standard Python sequence slicing in which the slice endpoint isnot inclusive, label-based slicing in pandasis inclusive. The primaryreason for this is that it is often not possible to easily determine the“successor” or next element after a particular label in an index. For example,consider the following Series:
In [18]:s=pd.Series(np.random.randn(6),index=list('abcdef'))In [19]:sOut[19]:a 1.544821b -1.708552c 1.545458d -0.735738e -0.649091f -0.403878dtype: float64
Suppose we wished to slice fromc toe, using integers this would be
In [20]:s[2:5]Out[20]:c 1.545458d -0.735738e -0.649091dtype: float64
However, if you only hadc ande, determining the next element in theindex can be somewhat complicated. For example, the following does not work:
s.ix['c':'e'+1]
A very common use case is to limit a time series to start and end at twospecific dates. To enable this, we made the design design to make label-basedslicing include both endpoints:
In [21]:s.ix['c':'e']Out[21]:c 1.545458d -0.735738e -0.649091dtype: float64
This is most definitely a “practicality beats purity” sort of thing, but it issomething to watch out for if you expect label-based slicing to behave exactlyin the way that standard Python integer slicing works.
Many users will find themselves using theix indexing capabilities as aconcise means of selecting data from a pandas object:
In [22]:df=pd.DataFrame(np.random.randn(6,4),columns=['one','two','three','four'], ....:index=list('abcdef')) ....:In [23]:dfOut[23]: one two three foura -2.474932 0.975891 -0.204206 0.452707b 3.478418 -0.591538 -0.508560 0.047946c -0.170009 -1.615606 -0.894382 1.334681d -0.418002 -0.690649 0.128522 0.429260e 1.207515 -1.308877 -0.548792 -1.520879f 1.153696 0.609378 -0.825763 0.218223In [24]:df.ix[['b','c','e']]Out[24]: one two three fourb 3.478418 -0.591538 -0.508560 0.047946c -0.170009 -1.615606 -0.894382 1.334681e 1.207515 -1.308877 -0.548792 -1.520879
This is, of course, completely equivalentin this case to using thereindex method:
In [25]:df.reindex(['b','c','e'])Out[25]: one two three fourb 3.478418 -0.591538 -0.508560 0.047946c -0.170009 -1.615606 -0.894382 1.334681e 1.207515 -1.308877 -0.548792 -1.520879
Some might conclude thatix andreindex are 100% equivalent based onthis. This is indeed trueexcept in the case of integer indexing. Forexample, the above operation could alternately have been expressed as:
In [26]:df.ix[[1,2,4]]Out[26]: one two three fourb 3.478418 -0.591538 -0.508560 0.047946c -0.170009 -1.615606 -0.894382 1.334681e 1.207515 -1.308877 -0.548792 -1.520879
If you pass[1,2,4] toreindex you will get another thing entirely:
In [27]:df.reindex([1,2,4])Out[27]: one two three four1 NaN NaN NaN NaN2 NaN NaN NaN NaN4 NaN NaN NaN NaN
So it’s important to remember thatreindex isstrict label indexingonly. This can lead to some potentially surprising results in pathologicalcases where an index contains, say, both integers and strings:
In [28]:s=pd.Series([1,2,3],index=['a',0,1])In [29]:sOut[29]:a 10 21 3dtype: int64In [30]:s.ix[[0,1]]Out[30]:0 21 3dtype: int64In [31]:s.reindex([0,1])Out[31]:0 21 3dtype: int64
Because the index in this case does not contain solely integers,ix fallsback on integer indexing. By contrast,reindex only looks for the valuespassed in the index, thus finding the integers0 and1. While it wouldbe possible to insert some logic to check whether a passed sequence is allcontained in the index, that logic would exact a very high cost in large datasets.
The use ofreindex_like can potentially change the dtype of aSeries.
In [32]:series=pd.Series([1,2,3])In [33]:x=pd.Series([True])In [34]:x.dtypeOut[34]:dtype('bool')In [35]:x=pd.Series([True]).reindex_like(series)In [36]:x.dtypeOut[36]:dtype('O')
This is becausereindex_like silently insertsNaNs and thedtypechanges accordingly. This can cause some issues when usingnumpyufuncssuch asnumpy.logical_and.
See thethis old issue for a moredetailed discussion.
When parsing multiple text file columns into a single date column, the new datecolumn is prepended to the data and thenindex_col specification is indexed offof the new set of columns rather than the original ones:
In [37]:print(open('tmp.csv').read())KORD,19990127, 19:00:00, 18:56:00, 0.8100KORD,19990127, 20:00:00, 19:56:00, 0.0100KORD,19990127, 21:00:00, 20:56:00, -0.5900KORD,19990127, 21:00:00, 21:18:00, -0.9900KORD,19990127, 22:00:00, 21:56:00, -0.5900KORD,19990127, 23:00:00, 22:56:00, -0.5900In [38]:date_spec={'nominal':[1,2],'actual':[1,3]}In [39]:df=pd.read_csv('tmp.csv',header=None, ....:parse_dates=date_spec, ....:keep_date_col=True, ....:index_col=0) ....:# index_col=0 refers to the combined column "nominal" and not the original# first column of 'KORD' stringsIn [40]:dfOut[40]: actual 0 1 2 3 \nominal1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 19990127 19:00:00 18:56:001999-01-27 20:00:00 1999-01-27 19:56:00 KORD 19990127 20:00:00 19:56:001999-01-27 21:00:00 1999-01-27 20:56:00 KORD 19990127 21:00:00 20:56:001999-01-27 21:00:00 1999-01-27 21:18:00 KORD 19990127 21:00:00 21:18:001999-01-27 22:00:00 1999-01-27 21:56:00 KORD 19990127 22:00:00 21:56:001999-01-27 23:00:00 1999-01-27 22:56:00 KORD 19990127 23:00:00 22:56:00 4nominal1999-01-27 19:00:00 0.811999-01-27 20:00:00 0.011999-01-27 21:00:00 -0.591999-01-27 21:00:00 -0.991999-01-27 22:00:00 -0.591999-01-27 23:00:00 -0.59
For Series and DataFrame objects,var normalizes byN-1 to produceunbiased estimates of the sample variance, while NumPy’svar normalizesby N, which measures the variance of the sample. Note thatcovnormalizes byN-1 in both pandas and NumPy.
As of pandas 0.11, pandas is not 100% thread safe. The known issues relate totheDataFrame.copy method. If you are doing a lot of copying of DataFrameobjects shared among threads, we recommend holding locks inside the threadswhere the data copying occurs.
Seethis linkfor more information.
There are some versioning issues surrounding the libraries that are used toparse HTML tables in the top-level pandas io functionread_html.
Issues withlxml
- Benefits
- Drawbacks
- lxml doesnot make any guarantees about the results of its parseunless it is givenstrictly valid markup.
- In light of the above, we have chosen to allow you, the user, to use thelxml backend, butthis backend will usehtml5lib iflxmlfails to parse
- It is thereforehighly recommended that you install bothBeautifulSoup4 andhtml5lib, so that you will still get a validresult (provided everything else is valid) even iflxml fails.
Issues withBeautifulSoup4usinglxmlas a backend
- The above issues hold here as well sinceBeautifulSoup4 is essentiallyjust a wrapper around a parser backend.
Issues withBeautifulSoup4usinghtml5libas a backend
- Benefits
- html5lib is far more lenient thanlxml and consequently dealswithreal-life markup in a much saner way rather than just, e.g.,dropping an element without notifying you.
- html5libgenerates valid HTML5 markup from invalid markupautomatically. This is extremely important for parsing HTML tables,since it guarantees a valid document. However, that does NOT mean thatit is “correct”, since the process of fixing markup does not have asingle definition.
- html5lib is pure Python and requires no additional build steps beyondits own installation.
- Drawbacks
- The biggest drawback to usinghtml5lib is that it is slow asmolasses. However consider the fact that many tables on the web are notbig enough for the parsing algorithm runtime to matter. It is morelikely that the bottleneck will be in the process of reading the rawtext from the URL over the web, i.e., IO (input-output). For very largetables, this might not be true.
Issues with usingAnaconda
- Anaconda ships withlxml version 3.2.0; the following workaround forAnaconda was successfully used to deal with the versioning issuessurroundinglxml andBeautifulSoup4.
Note
Unless you haveboth:
- A strong restriction on the upper bound of the runtime of some codethat incorporates
read_html()- Complete knowledge that the HTML you will be parsing will be 100%valid at all times
then you should installhtml5lib and things will work swimminglywithout you having to muck around withconda. If you want the best ofboth worlds then install bothhtml5lib andlxml. If you do installlxml then you need to perform the following commands to ensure thatlxml will work correctly:
# remove the included versionconda remove lxml# install the latest version of lxmlpip install'git+git://github.com/lxml/lxml.git'# install the latest version of beautifulsoup4pip install'bzr+lp:beautifulsoup'Note that you needbzr andgit installed to perform the last two operations.
Occasionally you may have to deal with data that were created on a machine witha different byte order than the one on which you are running Python. A common symptom of this issue is an error like
Traceback...ValueError:Big-endianbuffernotsupportedonlittle-endiancompiler
To dealwith this issue you should convert the underlying NumPy array to the nativesystem byte orderbefore passing it to Series/DataFrame/Panel constructorsusing something similar to the following:
In [41]:x=np.array(list(range(10)),'>i4')# big endianIn [42]:newx=x.byteswap().newbyteorder()# force native byteorderIn [43]:s=pd.Series(newx)
Seethe NumPy documentation on byte order for moredetails.