Movatterモバイル変換


[0]ホーム

URL:


Navigation

Table Of Contents

Search

Enter search terms or a module, class or function name.

Caveats and Gotchas

Using If/Truth Statements with pandas

pandas follows the numpy convention of raising an error when you try to convert something to abool.This happens in aif or when using the boolean operations,and,or, ornot. It is not clearwhat the result of

>>>ifpd.Series([False,True,False]):     ...

should be. Should it beTrue because it’s not zero-length?False because there areFalse values?It is unclear, so instead, pandas raises aValueError:

>>>ifpd.Series([False,True,False]):    print("I was true")Traceback    ...ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

If you see that, you need to explicitly choose what you want to do with it (e.g., useany(),all() orempty).or, you might want to compare if the pandas object isNone

>>>ifpd.Series([False,True,False])isnotNone:       print("I was not None")>>>IwasnotNone

or return ifany value isTrue.

>>>ifpd.Series([False,True,False]).any():       print("I am any")>>>Iamany

To evaluate single-element pandas objects in a boolean context, use the method.bool():

In [1]:pd.Series([True]).bool()Out[1]:TrueIn [2]:pd.Series([False]).bool()Out[2]:FalseIn [3]:pd.DataFrame([[True]]).bool()Out[3]:TrueIn [4]:pd.DataFrame([[False]]).bool()Out[4]:False

Bitwise boolean

Bitwise boolean operators like== and!= will return a booleanSeries,which is almost always what you want anyways.

>>>s=pd.Series(range(5))>>>s==40    False1    False2    False3    False4     Truedtype: bool

Seeboolean comparisons for more examples.

Using thein operator

Using the Pythonin operator on a Series tests for membership in theindex, not membership among the values.

If this behavior is surprising, keep in mind that usingin on a Pythondictionary tests keys, not values, and Series are dict-like.To test for membership in the values, use the methodisin():

For DataFrames, likewise,in applies to the column axis,testing for membership in the list of column names.

NaN, IntegerNA values andNA type promotions

Choice ofNA representation

For lack ofNA (missing) support from the ground up in NumPy and Python ingeneral, we were given the difficult choice between either

  • Amasked array solution: an array of data and an array of boolean valuesindicating whether a value
  • Using a special sentinel value, bit pattern, or set of sentinel values todenoteNA across the dtypes

For many reasons we chose the latter. After years of production use it hasproven, at least in my opinion, to be the best decision given the state ofaffairs in NumPy and Python in general. The special valueNaN(Not-A-Number) is used everywhere as theNA value, and there are APIfunctionsisnull andnotnull which can be used across the dtypes todetect NA values.

However, it comes with it a couple of trade-offs which I most certainly havenot ignored.

Support for integerNA

In the absence of high performanceNA support being built into NumPy fromthe ground up, the primary casualty is the ability to represent NAs in integerarrays. For example:

In [5]:s=pd.Series([1,2,3,4,5],index=list('abcde'))In [6]:sOut[6]:a    1b    2c    3d    4e    5dtype: int64In [7]:s.dtypeOut[7]:dtype('int64')In [8]:s2=s.reindex(['a','b','c','f','u'])In [9]:s2Out[9]:a    1.0b    2.0c    3.0f    NaNu    NaNdtype: float64In [10]:s2.dtypeOut[10]:dtype('float64')

This trade-off is made largely for memory and performance reasons, and also sothat the resulting Series continues to be “numeric”. One possibility is to usedtype=object arrays instead.

NA type promotions

When introducing NAs into an existing Series or DataFrame viareindex orsome other means, boolean and integer types will be promoted to a differentdtype in order to store the NAs. These are summarized by this table:

TypeclassPromotion dtype for storing NAs
floatingno change
objectno change
integercast tofloat64
booleancast toobject

While this may seem like a heavy trade-off, I have found very fewcases where this is an issue in practice. Some explanation for the motivationhere in the next section.

Why not make NumPy like R?

Many people have suggested that NumPy should simply emulate theNA supportpresent in the more domain-specific statistical programming languageR. Part of the reason is the NumPy type hierarchy:

TypeclassDtypes
numpy.floatingfloat16,float32,float64,float128
numpy.integerint8,int16,int32,int64
numpy.unsignedintegeruint8,uint16,uint32,uint64
numpy.object_object_
numpy.bool_bool_
numpy.characterstring_,unicode_

The R language, by contrast, only has a handful of built-in data types:integer,numeric (floating-point),character, andboolean.NA types are implemented by reserving special bit patterns foreach type to be used as the missing value. While doing this with the full NumPytype hierarchy would be possible, it would be a more substantial trade-off(especially for the 8- and 16-bit data types) and implementation undertaking.

An alternate approach is that of using masked arrays. A masked array is anarray of data with an associated booleanmask denoting whether each valueshould be consideredNA or not. I am personally not in love with thisapproach as I feel that overall it places a fairly heavy burden on the user andthe library implementer. Additionally, it exacts a fairly high performance costwhen working with numerical data compared with the simple approach of usingNaN. Thus, I have chosen the Pythonic “practicality beats purity” approachand traded integerNA capability for a much simpler approach of using aspecial value in float and object arrays to denoteNA, and promotinginteger arrays to floating when NAs must be introduced.

Integer indexing

Label-based indexing with integer axis labels is a thorny topic. It has beendiscussed heavily on mailing lists and among various members of the scientificPython community. In pandas, our general viewpoint is that labels matter morethan integer locations. Therefore, with an integer axis indexonlylabel-based indexing is possible with the standard tools like.ix. Thefollowing code will generate exceptions:

s=pd.Series(range(5))s[-1]df=pd.DataFrame(np.random.randn(5,4))dfdf.ix[-2:]

This deliberate decision was made to prevent ambiguities and subtle bugs (manyusers reported finding bugs when the API change was made to stop “falling back”on position-based indexing).

Label-based slicing conventions

Non-monotonic indexes require exact matches

If the index of aSeries orDataFrame is monotonically increasing or decreasing, then the boundsof a label-based slice can be outside the range of the index, much like slice indexing anormal Pythonlist. Monotonicity of an index can be tested with theis_monotonic_increasing andis_monotonic_decreasing attributes.

In [11]:df=pd.DataFrame(index=[2,3,3,4,5],columns=['data'],data=range(5))In [12]:df.index.is_monotonic_increasingOut[12]:True# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:In [13]:df.loc[0:4,:]Out[13]:   data2     03     13     24     3# slice is are outside the index, so empty DataFrame is returnedIn [14]:df.loc[13:15,:]Out[14]:Empty DataFrameColumns: [data]Index: []

On the other hand, if the index is not monotonic, then both slice bounds must beunique members of the index.

In [15]:df=pd.DataFrame(index=[2,3,1,4,3,5],columns=['data'],data=range(6))In [16]:df.index.is_monotonic_increasingOut[16]:False# OK because 2 and 4 are in the indexIn [17]:df.loc[2:4,:]Out[17]:   data2     03     11     24     3
# 0 is not in the indexIn[9]:df.loc[0:4,:]KeyError:0# 3 is not a unique labelIn[11]:df.loc[2:3,:]KeyError:'Cannot get right slice bound for non-unique label: 3'

Endpoints are inclusive

Compared with standard Python sequence slicing in which the slice endpoint isnot inclusive, label-based slicing in pandasis inclusive. The primaryreason for this is that it is often not possible to easily determine the“successor” or next element after a particular label in an index. For example,consider the following Series:

In [18]:s=pd.Series(np.random.randn(6),index=list('abcdef'))In [19]:sOut[19]:a    1.544821b   -1.708552c    1.545458d   -0.735738e   -0.649091f   -0.403878dtype: float64

Suppose we wished to slice fromc toe, using integers this would be

In [20]:s[2:5]Out[20]:c    1.545458d   -0.735738e   -0.649091dtype: float64

However, if you only hadc ande, determining the next element in theindex can be somewhat complicated. For example, the following does not work:

s.ix['c':'e'+1]

A very common use case is to limit a time series to start and end at twospecific dates. To enable this, we made the design design to make label-basedslicing include both endpoints:

In [21]:s.ix['c':'e']Out[21]:c    1.545458d   -0.735738e   -0.649091dtype: float64

This is most definitely a “practicality beats purity” sort of thing, but it issomething to watch out for if you expect label-based slicing to behave exactlyin the way that standard Python integer slicing works.

Miscellaneous indexing gotchas

Reindex versus ix gotchas

Many users will find themselves using theix indexing capabilities as aconcise means of selecting data from a pandas object:

In [22]:df=pd.DataFrame(np.random.randn(6,4),columns=['one','two','three','four'],   ....:index=list('abcdef'))   ....:In [23]:dfOut[23]:        one       two     three      foura -2.474932  0.975891 -0.204206  0.452707b  3.478418 -0.591538 -0.508560  0.047946c -0.170009 -1.615606 -0.894382  1.334681d -0.418002 -0.690649  0.128522  0.429260e  1.207515 -1.308877 -0.548792 -1.520879f  1.153696  0.609378 -0.825763  0.218223In [24]:df.ix[['b','c','e']]Out[24]:        one       two     three      fourb  3.478418 -0.591538 -0.508560  0.047946c -0.170009 -1.615606 -0.894382  1.334681e  1.207515 -1.308877 -0.548792 -1.520879

This is, of course, completely equivalentin this case to using thereindex method:

In [25]:df.reindex(['b','c','e'])Out[25]:        one       two     three      fourb  3.478418 -0.591538 -0.508560  0.047946c -0.170009 -1.615606 -0.894382  1.334681e  1.207515 -1.308877 -0.548792 -1.520879

Some might conclude thatix andreindex are 100% equivalent based onthis. This is indeed trueexcept in the case of integer indexing. Forexample, the above operation could alternately have been expressed as:

In [26]:df.ix[[1,2,4]]Out[26]:        one       two     three      fourb  3.478418 -0.591538 -0.508560  0.047946c -0.170009 -1.615606 -0.894382  1.334681e  1.207515 -1.308877 -0.548792 -1.520879

If you pass[1,2,4] toreindex you will get another thing entirely:

In [27]:df.reindex([1,2,4])Out[27]:   one  two  three  four1  NaN  NaN    NaN   NaN2  NaN  NaN    NaN   NaN4  NaN  NaN    NaN   NaN

So it’s important to remember thatreindex isstrict label indexingonly. This can lead to some potentially surprising results in pathologicalcases where an index contains, say, both integers and strings:

In [28]:s=pd.Series([1,2,3],index=['a',0,1])In [29]:sOut[29]:a    10    21    3dtype: int64In [30]:s.ix[[0,1]]Out[30]:0    21    3dtype: int64In [31]:s.reindex([0,1])Out[31]:0    21    3dtype: int64

Because the index in this case does not contain solely integers,ix fallsback on integer indexing. By contrast,reindex only looks for the valuespassed in the index, thus finding the integers0 and1. While it wouldbe possible to insert some logic to check whether a passed sequence is allcontained in the index, that logic would exact a very high cost in large datasets.

Reindex potentially changes underlying Series dtype

The use ofreindex_like can potentially change the dtype of aSeries.

In [32]:series=pd.Series([1,2,3])In [33]:x=pd.Series([True])In [34]:x.dtypeOut[34]:dtype('bool')In [35]:x=pd.Series([True]).reindex_like(series)In [36]:x.dtypeOut[36]:dtype('O')

This is becausereindex_like silently insertsNaNs and thedtypechanges accordingly. This can cause some issues when usingnumpyufuncssuch asnumpy.logical_and.

See thethis old issue for a moredetailed discussion.

Parsing Dates from Text Files

When parsing multiple text file columns into a single date column, the new datecolumn is prepended to the data and thenindex_col specification is indexed offof the new set of columns rather than the original ones:

In [37]:print(open('tmp.csv').read())KORD,19990127, 19:00:00, 18:56:00, 0.8100KORD,19990127, 20:00:00, 19:56:00, 0.0100KORD,19990127, 21:00:00, 20:56:00, -0.5900KORD,19990127, 21:00:00, 21:18:00, -0.9900KORD,19990127, 22:00:00, 21:56:00, -0.5900KORD,19990127, 23:00:00, 22:56:00, -0.5900In [38]:date_spec={'nominal':[1,2],'actual':[1,3]}In [39]:df=pd.read_csv('tmp.csv',header=None,   ....:parse_dates=date_spec,   ....:keep_date_col=True,   ....:index_col=0)   ....:# index_col=0 refers to the combined column "nominal" and not the original# first column of 'KORD' stringsIn [40]:dfOut[40]:                                 actual     0         1          2          3  \nominal1999-01-27 19:00:00 1999-01-27 18:56:00  KORD  19990127   19:00:00   18:56:001999-01-27 20:00:00 1999-01-27 19:56:00  KORD  19990127   20:00:00   19:56:001999-01-27 21:00:00 1999-01-27 20:56:00  KORD  19990127   21:00:00   20:56:001999-01-27 21:00:00 1999-01-27 21:18:00  KORD  19990127   21:00:00   21:18:001999-01-27 22:00:00 1999-01-27 21:56:00  KORD  19990127   22:00:00   21:56:001999-01-27 23:00:00 1999-01-27 22:56:00  KORD  19990127   23:00:00   22:56:00                        4nominal1999-01-27 19:00:00  0.811999-01-27 20:00:00  0.011999-01-27 21:00:00 -0.591999-01-27 21:00:00 -0.991999-01-27 22:00:00 -0.591999-01-27 23:00:00 -0.59

Differences with NumPy

For Series and DataFrame objects,var normalizes byN-1 to produceunbiased estimates of the sample variance, while NumPy’svar normalizesby N, which measures the variance of the sample. Note thatcovnormalizes byN-1 in both pandas and NumPy.

Thread-safety

As of pandas 0.11, pandas is not 100% thread safe. The known issues relate totheDataFrame.copy method. If you are doing a lot of copying of DataFrameobjects shared among threads, we recommend holding locks inside the threadswhere the data copying occurs.

Seethis linkfor more information.

HTML Table Parsing

There are some versioning issues surrounding the libraries that are used toparse HTML tables in the top-level pandas io functionread_html.

Issues withlxml

  • Benefits
    • lxml is very fast
    • lxml requires Cython to install correctly.
  • Drawbacks
    • lxml doesnot make any guarantees about the results of its parseunless it is givenstrictly valid markup.
    • In light of the above, we have chosen to allow you, the user, to use thelxml backend, butthis backend will usehtml5lib iflxmlfails to parse
    • It is thereforehighly recommended that you install bothBeautifulSoup4 andhtml5lib, so that you will still get a validresult (provided everything else is valid) even iflxml fails.

Issues withBeautifulSoup4usinglxmlas a backend

  • The above issues hold here as well sinceBeautifulSoup4 is essentiallyjust a wrapper around a parser backend.

Issues withBeautifulSoup4usinghtml5libas a backend

  • Benefits
    • html5lib is far more lenient thanlxml and consequently dealswithreal-life markup in a much saner way rather than just, e.g.,dropping an element without notifying you.
    • html5libgenerates valid HTML5 markup from invalid markupautomatically. This is extremely important for parsing HTML tables,since it guarantees a valid document. However, that does NOT mean thatit is “correct”, since the process of fixing markup does not have asingle definition.
    • html5lib is pure Python and requires no additional build steps beyondits own installation.
  • Drawbacks
    • The biggest drawback to usinghtml5lib is that it is slow asmolasses. However consider the fact that many tables on the web are notbig enough for the parsing algorithm runtime to matter. It is morelikely that the bottleneck will be in the process of reading the rawtext from the URL over the web, i.e., IO (input-output). For very largetables, this might not be true.

Issues with usingAnaconda

Note

Unless you haveboth:

  • A strong restriction on the upper bound of the runtime of some codethat incorporatesread_html()
  • Complete knowledge that the HTML you will be parsing will be 100%valid at all times

then you should installhtml5lib and things will work swimminglywithout you having to muck around withconda. If you want the best ofboth worlds then install bothhtml5lib andlxml. If you do installlxml then you need to perform the following commands to ensure thatlxml will work correctly:

# remove the included versionconda remove lxml# install the latest version of lxmlpip install'git+git://github.com/lxml/lxml.git'# install the latest version of beautifulsoup4pip install'bzr+lp:beautifulsoup'

Note that you needbzr andgit installed to perform the last two operations.

Byte-Ordering Issues

Occasionally you may have to deal with data that were created on a machine witha different byte order than the one on which you are running Python. A common symptom of this issue is an error like

Traceback...ValueError:Big-endianbuffernotsupportedonlittle-endiancompiler

To dealwith this issue you should convert the underlying NumPy array to the nativesystem byte orderbefore passing it to Series/DataFrame/Panel constructorsusing something similar to the following:

In [41]:x=np.array(list(range(10)),'>i4')# big endianIn [42]:newx=x.byteswap().newbyteorder()# force native byteorderIn [43]:s=pd.Series(newx)

Seethe NumPy documentation on byte order for moredetails.

Navigation

Scroll To Top
[8]ページ先頭

©2009-2025 Movatter.jp