- User Guide
- Intro to...
Intro to data structures#
We’ll start with a quick, non-comprehensive overview of the fundamental datastructures in pandas to get you started. The fundamental behavior about datatypes, indexing, axis labeling, and alignment apply across all of theobjects. To get started, import NumPy and load pandas into your namespace:
In [1]:importnumpyasnpIn [2]:importpandasaspd
Fundamentally,data alignment is intrinsic. The linkbetween labels and data will not be broken unless done so explicitly by you.
We’ll give a brief intro to the data structures, then consider all of the broadcategories of functionality and methods in separate sections.
Series#
Series
is a one-dimensional labeled array capable of holding any datatype (integers, strings, floating point numbers, Python objects, etc.). The axislabels are collectively referred to as theindex. The basic method to create aSeries
is to call:
s=pd.Series(data,index=index)
Here,data
can be many different things:
a Python dict
an ndarray
a scalar value (like 5)
The passedindex is a list of axis labels. Thus, this separates into a fewcases depending on whatdata is:
From ndarray
Ifdata
is an ndarray,index must be the same length asdata. If noindex is passed, one will be created having values[0,...,len(data)-1]
.
In [3]:s=pd.Series(np.random.randn(5),index=["a","b","c","d","e"])In [4]:sOut[4]:a 0.469112b -0.282863c -1.509059d -1.135632e 1.212112dtype: float64In [5]:s.indexOut[5]:Index(['a', 'b', 'c', 'd', 'e'], dtype='object')In [6]:pd.Series(np.random.randn(5))Out[6]:0 -0.1732151 0.1192092 -1.0442363 -0.8618494 -2.104569dtype: float64
Note
pandas supports non-unique index values. If an operationthat does not support duplicate index values is attempted, an exceptionwill be raised at that time.
From dict
Series
can be instantiated from dicts:
In [7]:d={"b":1,"a":0,"c":2}In [8]:pd.Series(d)Out[8]:b 1a 0c 2dtype: int64
If an index is passed, the values in data corresponding to the labels in theindex will be pulled out.
In [9]:d={"a":0.0,"b":1.0,"c":2.0}In [10]:pd.Series(d)Out[10]:a 0.0b 1.0c 2.0dtype: float64In [11]:pd.Series(d,index=["b","c","d","a"])Out[11]:b 1.0c 2.0d NaNa 0.0dtype: float64
Note
NaN (not a number) is the standard missing data marker used in pandas.
From scalar value
Ifdata
is a scalar value, an index must beprovided. The value will be repeated to match the length ofindex.
In [12]:pd.Series(5.0,index=["a","b","c","d","e"])Out[12]:a 5.0b 5.0c 5.0d 5.0e 5.0dtype: float64
Series is ndarray-like#
Series
acts very similarly to andarray
and is a valid argument to most NumPy functions.However, operations such as slicing will also slice the index.
In [13]:s.iloc[0]Out[13]:0.4691122999071863In [14]:s.iloc[:3]Out[14]:a 0.469112b -0.282863c -1.509059dtype: float64In [15]:s[s>s.median()]Out[15]:a 0.469112e 1.212112dtype: float64In [16]:s.iloc[[4,3,1]]Out[16]:e 1.212112d -1.135632b -0.282863dtype: float64In [17]:np.exp(s)Out[17]:a 1.598575b 0.753623c 0.221118d 0.321219e 3.360575dtype: float64
Note
We will address array-based indexing likes.iloc[[4,3,1]]
insection on indexing.
Like a NumPy array, a pandasSeries
has a singledtype
.
In [18]:s.dtypeOut[18]:dtype('float64')
This is often a NumPy dtype. However, pandas and 3rd-party librariesextend NumPy’s type system in a few places, in which case the dtype wouldbe anExtensionDtype
. Some examples withinpandas areCategorical data andNullable integer data type. Seedtypesfor more.
If you need the actual array backing aSeries
, useSeries.array
.
In [19]:s.arrayOut[19]:<NumpyExtensionArray>[ 0.4691122999071863, -0.2828633443286633, -1.5090585031735124, -1.1356323710171934, 1.2121120250208506]Length: 5, dtype: float64
Accessing the array can be useful when you need to do some operation without theindex (to disableautomatic alignment, for example).
Series.array
will always be anExtensionArray
.Briefly, an ExtensionArray is a thin wrapper around one or moreconcrete arrays like anumpy.ndarray
. pandas knows how to take anExtensionArray
andstore it in aSeries
or a column of aDataFrame
.Seedtypes for more.
WhileSeries
is ndarray-like, if you need anactual ndarray, then useSeries.to_numpy()
.
In [20]:s.to_numpy()Out[20]:array([ 0.4691, -0.2829, -1.5091, -1.1356, 1.2121])
Even if theSeries
is backed by aExtensionArray
,Series.to_numpy()
will return a NumPy ndarray.
Series is dict-like#
ASeries
is also like a fixed-size dict in that you can get and set values by indexlabel:
In [21]:s["a"]Out[21]:0.4691122999071863In [22]:s["e"]=12.0In [23]:sOut[23]:a 0.469112b -0.282863c -1.509059d -1.135632e 12.000000dtype: float64In [24]:"e"insOut[24]:TrueIn [25]:"f"insOut[25]:False
If a label is not contained in the index, an exception is raised:
In [26]:s["f"]---------------------------------------------------------------------------KeyErrorTraceback (most recent call last)File ~/work/pandas/pandas/pandas/core/indexes/base.py:3805, inIndex.get_loc(self, key)3804try:->3805returnself._engine.get_loc(casted_key)3806exceptKeyErroraserr:File index.pyx:167, inpandas._libs.index.IndexEngine.get_loc()File index.pyx:196, inpandas._libs.index.IndexEngine.get_loc()File pandas/_libs/hashtable_class_helper.pxi:7081, inpandas._libs.hashtable.PyObjectHashTable.get_item()File pandas/_libs/hashtable_class_helper.pxi:7089, inpandas._libs.hashtable.PyObjectHashTable.get_item()KeyError: 'f'Theaboveexceptionwasthedirectcauseofthefollowingexception:KeyErrorTraceback (most recent call last)CellIn[26],line1---->1s["f"]File ~/work/pandas/pandas/pandas/core/series.py:1121, inSeries.__getitem__(self, key)1118returnself._values[key]1120elifkey_is_scalar:->1121returnself._get_value(key)1123# Convert generator to list before going through hashable part1124# (We will iterate through the generator there to check for slices)1125ifis_iterator(key):File ~/work/pandas/pandas/pandas/core/series.py:1237, inSeries._get_value(self, label, takeable)1234returnself._values[label]1236# Similar to Index.get_value, but we do not fall back to positional->1237loc=self.index.get_loc(label)1239ifis_integer(loc):1240returnself._values[loc]File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, inIndex.get_loc(self, key)3807ifisinstance(casted_key,slice)or(3808isinstance(casted_key,abc.Iterable)3809andany(isinstance(x,slice)forxincasted_key)3810):3811raiseInvalidIndexError(key)->3812raiseKeyError(key)fromerr3813exceptTypeError:3814# If we have a listlike key, _check_indexing_error will raise3815# InvalidIndexError. Otherwise we fall through and re-raise3816# the TypeError.3817self._check_indexing_error(key)KeyError: 'f'
Using theSeries.get()
method, a missing label will return None or specified default:
In [27]:s.get("f")In [28]:s.get("f",np.nan)Out[28]:nan
These labels can also be accessed byattribute.
Vectorized operations and label alignment with Series#
When working with raw NumPy arrays, looping through value-by-value is usuallynot necessary. The same is true when working withSeries
in pandas.Series
can also be passed into most NumPy methods expecting an ndarray.
In [29]:s+sOut[29]:a 0.938225b -0.565727c -3.018117d -2.271265e 24.000000dtype: float64In [30]:s*2Out[30]:a 0.938225b -0.565727c -3.018117d -2.271265e 24.000000dtype: float64In [31]:np.exp(s)Out[31]:a 1.598575b 0.753623c 0.221118d 0.321219e 162754.791419dtype: float64
A key difference betweenSeries
and ndarray is that operations betweenSeries
automatically align the data based on label. Thus, you can write computationswithout giving consideration to whether theSeries
involved have the samelabels.
In [32]:s.iloc[1:]+s.iloc[:-1]Out[32]:a NaNb -0.565727c -3.018117d -2.271265e NaNdtype: float64
The result of an operation between unalignedSeries
will have theunion ofthe indexes involved. If a label is not found in oneSeries
or the other, theresult will be marked as missingNaN
. Being able to write code without doingany explicit data alignment grants immense freedom and flexibility ininteractive data analysis and research. The integrated data alignment featuresof the pandas data structures set pandas apart from the majority of relatedtools for working with labeled data.
Note
In general, we chose to make the default result of operations betweendifferently indexed objects yield theunion of the indexes in order toavoid loss of information. Having an index label, though the data ismissing, is typically important information as part of a computation. Youof course have the option of dropping labels with missing data via thedropna function.
Name attribute#
Series
also has aname
attribute:
In [33]:s=pd.Series(np.random.randn(5),name="something")In [34]:sOut[34]:0 -0.4949291 1.0718042 0.7215553 -0.7067714 -1.039575Name: something, dtype: float64In [35]:s.nameOut[35]:'something'
TheSeries
name
can be assigned automatically in many cases, in particular,when selecting a single column from aDataFrame
, thename
will be assignedthe column label.
You can rename aSeries
with thepandas.Series.rename()
method.
In [36]:s2=s.rename("different")In [37]:s2.nameOut[37]:'different'
Note thats
ands2
refer to different objects.
DataFrame#
DataFrame
is a 2-dimensional labeled data structure with columns ofpotentially different types. You can think of it like a spreadsheet or SQLtable, or a dict of Series objects. It is generally the most commonly usedpandas object. Like Series, DataFrame accepts many different kinds of input:
Dict of 1D ndarrays, lists, dicts, or
Series
2-D numpy.ndarray
Structured or record ndarray
Another
DataFrame
Along with the data, you can optionally passindex (row labels) andcolumns (column labels) arguments. If you pass an index and / or columns,you are guaranteeing the index and / or columns of the resultingDataFrame. Thus, a dict of Series plus a specific index will discard all datanot matching up to the passed index.
If axis labels are not passed, they will be constructed from the input databased on common sense rules.
From dict of Series or dicts#
The resultingindex will be theunion of the indexes of the variousSeries. If there are any nested dicts, these will first be converted toSeries. If no columns are passed, the columns will be the ordered list of dictkeys.
In [38]:d={ ....:"one":pd.Series([1.0,2.0,3.0],index=["a","b","c"]), ....:"two":pd.Series([1.0,2.0,3.0,4.0],index=["a","b","c","d"]), ....:} ....:In [39]:df=pd.DataFrame(d)In [40]:dfOut[40]: one twoa 1.0 1.0b 2.0 2.0c 3.0 3.0d NaN 4.0In [41]:pd.DataFrame(d,index=["d","b","a"])Out[41]: one twod NaN 4.0b 2.0 2.0a 1.0 1.0In [42]:pd.DataFrame(d,index=["d","b","a"],columns=["two","three"])Out[42]: two threed 4.0 NaNb 2.0 NaNa 1.0 NaN
The row and column labels can be accessed respectively by accessing theindex andcolumns attributes:
Note
When a particular set of columns is passed along with a dict of data, thepassed columns override the keys in the dict.
In [43]:df.indexOut[43]:Index(['a', 'b', 'c', 'd'], dtype='object')In [44]:df.columnsOut[44]:Index(['one', 'two'], dtype='object')
From dict of ndarrays / lists#
All ndarrays must share the same length. If an index is passed, it mustalso be the same length as the arrays. If no index is passed, theresult will berange(n)
, wheren
is the array length.
In [45]:d={"one":[1.0,2.0,3.0,4.0],"two":[4.0,3.0,2.0,1.0]}In [46]:pd.DataFrame(d)Out[46]: one two0 1.0 4.01 2.0 3.02 3.0 2.03 4.0 1.0In [47]:pd.DataFrame(d,index=["a","b","c","d"])Out[47]: one twoa 1.0 4.0b 2.0 3.0c 3.0 2.0d 4.0 1.0
From structured or record array#
This case is handled identically to a dict of arrays.
In [48]:data=np.zeros((2,),dtype=[("A","i4"),("B","f4"),("C","a10")])In [49]:data[:]=[(1,2.0,"Hello"),(2,3.0,"World")]In [50]:pd.DataFrame(data)Out[50]: A B C0 1 2.0 b'Hello'1 2 3.0 b'World'In [51]:pd.DataFrame(data,index=["first","second"])Out[51]: A B Cfirst 1 2.0 b'Hello'second 2 3.0 b'World'In [52]:pd.DataFrame(data,columns=["C","A","B"])Out[52]: C A B0 b'Hello' 1 2.01 b'World' 2 3.0
Note
DataFrame is not intended to work exactly like a 2-dimensional NumPyndarray.
From a list of dicts#
In [53]:data2=[{"a":1,"b":2},{"a":5,"b":10,"c":20}]In [54]:pd.DataFrame(data2)Out[54]: a b c0 1 2 NaN1 5 10 20.0In [55]:pd.DataFrame(data2,index=["first","second"])Out[55]: a b cfirst 1 2 NaNsecond 5 10 20.0In [56]:pd.DataFrame(data2,columns=["a","b"])Out[56]: a b0 1 21 5 10
From a dict of tuples#
You can automatically create a MultiIndexed frame by passing a tuplesdictionary.
In [57]:pd.DataFrame( ....:{ ....:("a","b"):{("A","B"):1,("A","C"):2}, ....:("a","a"):{("A","C"):3,("A","B"):4}, ....:("a","c"):{("A","B"):5,("A","C"):6}, ....:("b","a"):{("A","C"):7,("A","B"):8}, ....:("b","b"):{("A","D"):9,("A","B"):10}, ....:} ....:) ....:Out[57]: a b b a c a bA B 1.0 4.0 5.0 8.0 10.0 C 2.0 3.0 6.0 7.0 NaN D NaN NaN NaN NaN 9.0
From a Series#
The result will be a DataFrame with the same index as the input Series, andwith one column whose name is the original name of the Series (only if no othercolumn name provided).
In [58]:ser=pd.Series(range(3),index=list("abc"),name="ser")In [59]:pd.DataFrame(ser)Out[59]: sera 0b 1c 2
From a list of namedtuples#
The field names of the firstnamedtuple
in the list determine the columnsof theDataFrame
. The remaining namedtuples (or tuples) are simply unpackedand their values are fed into the rows of theDataFrame
. If any of thosetuples is shorter than the firstnamedtuple
then the later columns in thecorresponding row are marked as missing values. If any are longer than thefirstnamedtuple
, aValueError
is raised.
In [60]:fromcollectionsimportnamedtupleIn [61]:Point=namedtuple("Point","x y")In [62]:pd.DataFrame([Point(0,0),Point(0,3),(2,3)])Out[62]: x y0 0 01 0 32 2 3In [63]:Point3D=namedtuple("Point3D","x y z")In [64]:pd.DataFrame([Point3D(0,0,0),Point3D(0,3,5),Point(2,3)])Out[64]: x y z0 0 0 0.01 0 3 5.02 2 3 NaN
From a list of dataclasses#
Data Classes as introduced inPEP557,can be passed into the DataFrame constructor.Passing a list of dataclasses is equivalent to passing a list of dictionaries.
Please be aware, that all values in the list should be dataclasses, mixingtypes in the list would result in aTypeError
.
In [65]:fromdataclassesimportmake_dataclassIn [66]:Point=make_dataclass("Point",[("x",int),("y",int)])In [67]:pd.DataFrame([Point(0,0),Point(0,3),Point(2,3)])Out[67]: x y0 0 01 0 32 2 3
Missing data
To construct a DataFrame with missing data, we usenp.nan
torepresent missing values. Alternatively, you may pass anumpy.MaskedArray
as the data argument to the DataFrame constructor, and its masked entries willbe considered missing. SeeMissing data for more.
Alternate constructors#
DataFrame.from_dict
DataFrame.from_dict()
takes a dict of dicts or a dict of array-like sequencesand returns a DataFrame. It operates like theDataFrame
constructor exceptfor theorient
parameter which is'columns'
by default, but which can beset to'index'
in order to use the dict keys as row labels.
In [68]:pd.DataFrame.from_dict(dict([("A",[1,2,3]),("B",[4,5,6])]))Out[68]: A B0 1 41 2 52 3 6
If you passorient='index'
, the keys will be the row labels. In thiscase, you can also pass the desired column names:
In [69]:pd.DataFrame.from_dict( ....:dict([("A",[1,2,3]),("B",[4,5,6])]), ....:orient="index", ....:columns=["one","two","three"], ....:) ....:Out[69]: one two threeA 1 2 3B 4 5 6
DataFrame.from_records
DataFrame.from_records()
takes a list of tuples or an ndarray with structureddtype. It works analogously to the normalDataFrame
constructor, except thatthe resulting DataFrame index may be a specific field of the structureddtype.
In [70]:dataOut[70]:array([(1, 2., b'Hello'), (2, 3., b'World')], dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])In [71]:pd.DataFrame.from_records(data,index="C")Out[71]: A BCb'Hello' 1 2.0b'World' 2 3.0
Column selection, addition, deletion#
You can treat aDataFrame
semantically like a dict of like-indexedSeries
objects. Getting, setting, and deleting columns works with the same syntax asthe analogous dict operations:
In [72]:df["one"]Out[72]:a 1.0b 2.0c 3.0d NaNName: one, dtype: float64In [73]:df["three"]=df["one"]*df["two"]In [74]:df["flag"]=df["one"]>2In [75]:dfOut[75]: one two three flaga 1.0 1.0 1.0 Falseb 2.0 2.0 4.0 Falsec 3.0 3.0 9.0 Trued NaN 4.0 NaN False
Columns can be deleted or popped like with a dict:
In [76]:deldf["two"]In [77]:three=df.pop("three")In [78]:dfOut[78]: one flaga 1.0 Falseb 2.0 Falsec 3.0 Trued NaN False
When inserting a scalar value, it will naturally be propagated to fill thecolumn:
In [79]:df["foo"]="bar"In [80]:dfOut[80]: one flag fooa 1.0 False barb 2.0 False barc 3.0 True bard NaN False bar
When inserting aSeries
that does not have the same index as theDataFrame
, itwill be conformed to the DataFrame’s index:
In [81]:df["one_trunc"]=df["one"][:2]In [82]:dfOut[82]: one flag foo one_trunca 1.0 False bar 1.0b 2.0 False bar 2.0c 3.0 True bar NaNd NaN False bar NaN
You can insert raw ndarrays but their length must match the length of theDataFrame’s index.
By default, columns get inserted at the end.DataFrame.insert()
inserts at a particular location in the columns:
In [83]:df.insert(1,"bar",df["one"])In [84]:dfOut[84]: one bar flag foo one_trunca 1.0 1.0 False bar 1.0b 2.0 2.0 False bar 2.0c 3.0 3.0 True bar NaNd NaN NaN False bar NaN
Assigning new columns in method chains#
Inspired bydplyr’smutate
verb, DataFrame has anassign()
method that allows you to easily create new columns that are potentiallyderived from existing columns.
In [85]:iris=pd.read_csv("data/iris.data")In [86]:iris.head()Out[86]: SepalLength SepalWidth PetalLength PetalWidth Name0 5.1 3.5 1.4 0.2 Iris-setosa1 4.9 3.0 1.4 0.2 Iris-setosa2 4.7 3.2 1.3 0.2 Iris-setosa3 4.6 3.1 1.5 0.2 Iris-setosa4 5.0 3.6 1.4 0.2 Iris-setosaIn [87]:iris.assign(sepal_ratio=iris["SepalWidth"]/iris["SepalLength"]).head()Out[87]: SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio0 5.1 3.5 1.4 0.2 Iris-setosa 0.6862751 4.9 3.0 1.4 0.2 Iris-setosa 0.6122452 4.7 3.2 1.3 0.2 Iris-setosa 0.6808513 4.6 3.1 1.5 0.2 Iris-setosa 0.6739134 5.0 3.6 1.4 0.2 Iris-setosa 0.720000
In the example above, we inserted a precomputed value. We can also pass ina function of one argument to be evaluated on the DataFrame being assigned to.
In [88]:iris.assign(sepal_ratio=lambdax:(x["SepalWidth"]/x["SepalLength"])).head()Out[88]: SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio0 5.1 3.5 1.4 0.2 Iris-setosa 0.6862751 4.9 3.0 1.4 0.2 Iris-setosa 0.6122452 4.7 3.2 1.3 0.2 Iris-setosa 0.6808513 4.6 3.1 1.5 0.2 Iris-setosa 0.6739134 5.0 3.6 1.4 0.2 Iris-setosa 0.720000
assign()
always returns a copy of the data, leaving the originalDataFrame untouched.
Passing a callable, as opposed to an actual value to be inserted, isuseful when you don’t have a reference to the DataFrame at hand. This iscommon when usingassign()
in a chain of operations. For example,we can limit the DataFrame to just those observations with a Sepal Lengthgreater than 5, calculate the ratio, and plot:
In [89]:( ....:iris.query("SepalLength > 5") ....:.assign( ....:SepalRatio=lambdax:x.SepalWidth/x.SepalLength, ....:PetalRatio=lambdax:x.PetalWidth/x.PetalLength, ....:) ....:.plot(kind="scatter",x="SepalRatio",y="PetalRatio") ....:) ....:Out[89]:<Axes: xlabel='SepalRatio', ylabel='PetalRatio'>

Since a function is passed in, the function is computed on the DataFramebeing assigned to. Importantly, this is the DataFrame that’s been filteredto those rows with sepal length greater than 5. The filtering happens first,and then the ratio calculations. This is an example where we didn’thave a reference to thefiltered DataFrame available.
The function signature forassign()
is simply**kwargs
. The keysare the column names for the new fields, and the values are either a valueto be inserted (for example, aSeries
or NumPy array), or a functionof one argument to be called on theDataFrame
. Acopy of the originalDataFrame
is returned, with the new values inserted.
The order of**kwargs
is preserved. This allowsfordependent assignment, where an expression later in**kwargs
can referto a column created earlier in the sameassign()
.
In [90]:dfa=pd.DataFrame({"A":[1,2,3],"B":[4,5,6]})In [91]:dfa.assign(C=lambdax:x["A"]+x["B"],D=lambdax:x["A"]+x["C"])Out[91]: A B C D0 1 4 5 61 2 5 7 92 3 6 9 12
In the second expression,x['C']
will refer to the newly created column,that’s equal todfa['A']+dfa['B']
.
Indexing / selection#
The basics of indexing are as follows:
Operation | Syntax | Result |
---|---|---|
Select column |
| Series |
Select row by label |
| Series |
Select row by integer location |
| Series |
Slice rows |
| DataFrame |
Select rows by boolean vector |
| DataFrame |
Row selection, for example, returns aSeries
whose index is the columns of theDataFrame
:
In [92]:df.loc["b"]Out[92]:one 2.0bar 2.0flag Falsefoo barone_trunc 2.0Name: b, dtype: objectIn [93]:df.iloc[2]Out[93]:one 3.0bar 3.0flag Truefoo barone_trunc NaNName: c, dtype: object
For a more exhaustive treatment of sophisticated label-based indexing andslicing, see thesection on indexing. We will address thefundamentals of reindexing / conforming to new sets of labels in thesection on reindexing.
Data alignment and arithmetic#
Data alignment betweenDataFrame
objects automatically align onboth thecolumns and the index (row labels). Again, the resulting object will have theunion of the column and row labels.
In [94]:df=pd.DataFrame(np.random.randn(10,4),columns=["A","B","C","D"])In [95]:df2=pd.DataFrame(np.random.randn(7,3),columns=["A","B","C"])In [96]:df+df2Out[96]: A B C D0 0.045691 -0.014138 1.380871 NaN1 -0.955398 -1.501007 0.037181 NaN2 -0.662690 1.534833 -0.859691 NaN3 -2.452949 1.237274 -0.133712 NaN4 1.414490 1.951676 -2.320422 NaN5 -0.494922 -1.649727 -1.084601 NaN6 -1.047551 -0.748572 -0.805479 NaN7 NaN NaN NaN NaN8 NaN NaN NaN NaN9 NaN NaN NaN NaN
When doing an operation betweenDataFrame
andSeries
, the default behavior isto align theSeries
index on theDataFrame
columns, thusbroadcastingrow-wise. For example:
In [97]:df-df.iloc[0]Out[97]: A B C D0 0.000000 0.000000 0.000000 0.0000001 -1.359261 -0.248717 -0.453372 -1.7546592 0.253128 0.829678 0.010026 -1.9912343 -1.311128 0.054325 -1.724913 -1.6205444 0.573025 1.500742 -0.676070 1.3673315 -1.741248 0.781993 -1.241620 -2.0531366 -1.240774 -0.869551 -0.153282 0.0004307 -0.743894 0.411013 -0.929563 -0.2823868 -1.194921 1.320690 0.238224 -1.4826449 2.293786 1.856228 0.773289 -1.446531
For explicit control over the matching and broadcasting behavior, see thesection onflexible binary operations.
Arithmetic operations with scalars operate element-wise:
In [98]:df*5+2Out[98]: A B C D0 3.359299 -0.124862 4.835102 3.3811601 -3.437003 -1.368449 2.568242 -5.3921332 4.624938 4.023526 4.885230 -6.5750103 -3.196342 0.146766 -3.789461 -4.7215594 6.224426 7.378849 1.454750 10.2178155 -5.346940 3.785103 -1.373001 -6.8845196 -2.844569 -4.472618 4.068691 3.3833097 -0.360173 1.930201 0.187285 1.9692328 -2.615303 6.478587 6.026220 -4.0320599 14.828230 9.156280 8.701544 -3.851494In [99]:1/dfOut[99]: A B C D0 3.678365 -2.353094 1.763605 3.6201451 -0.919624 -1.484363 8.799067 -0.6763952 1.904807 2.470934 1.732964 -0.5830903 -0.962215 -2.697986 -0.863638 -0.7438754 1.183593 0.929567 -9.170108 0.6084345 -0.680555 2.800959 -1.482360 -0.5627776 -1.032084 -0.772485 2.416988 3.6145237 -2.118489 -71.634509 -2.758294 -162.5072958 -1.083352 1.116424 1.241860 -0.8289049 0.389765 0.698687 0.746097 -0.854483In [100]:df**4Out[100]: A B C D0 0.005462 3.261689e-02 0.103370 5.822320e-031 1.398165 2.059869e-01 0.000167 4.777482e+002 0.075962 2.682596e-02 0.110877 8.650845e+003 1.166571 1.887302e-02 1.797515 3.265879e+004 0.509555 1.339298e+00 0.000141 7.297019e+005 4.661717 1.624699e-02 0.207103 9.969092e+006 0.881334 2.808277e+00 0.029302 5.858632e-037 0.049647 3.797614e-08 0.017276 1.433866e-098 0.725974 6.437005e-01 0.420446 2.118275e+009 43.329821 4.196326e+00 3.227153 1.875802e+00
Boolean operators operate element-wise as well:
In [101]:df1=pd.DataFrame({"a":[1,0,1],"b":[0,1,1]},dtype=bool)In [102]:df2=pd.DataFrame({"a":[0,1,1],"b":[1,1,0]},dtype=bool)In [103]:df1&df2Out[103]: a b0 False False1 False True2 True FalseIn [104]:df1|df2Out[104]: a b0 True True1 True True2 True TrueIn [105]:df1^df2Out[105]: a b0 True True1 True False2 False TrueIn [106]:-df1Out[106]: a b0 False True1 True False2 False False
Transposing#
To transpose, access theT
attribute orDataFrame.transpose()
,similar to an ndarray:
# only show the first 5 rowsIn [107]:df[:5].TOut[107]: 0 1 2 3 4A 0.271860 -1.087401 0.524988 -1.039268 0.844885B -0.424972 -0.673690 0.404705 -0.370647 1.075770C 0.567020 0.113648 0.577046 -1.157892 -0.109050D 0.276232 -1.478427 -1.715002 -1.344312 1.643563
DataFrame interoperability with NumPy functions#
Most NumPy functions can be called directly onSeries
andDataFrame
.
In [108]:np.exp(df)Out[108]: A B C D0 1.312403 0.653788 1.763006 1.3181541 0.337092 0.509824 1.120358 0.2279962 1.690438 1.498861 1.780770 0.1799633 0.353713 0.690288 0.314148 0.2607194 2.327710 2.932249 0.896686 5.1735715 0.230066 1.429065 0.509360 0.1691616 0.379495 0.274028 1.512461 1.3187207 0.623732 0.986137 0.695904 0.9938658 0.397301 2.449092 2.237242 0.2992699 13.009059 4.183951 3.820223 0.310274In [109]:np.asarray(df)Out[109]:array([[ 0.2719, -0.425 , 0.567 , 0.2762], [-1.0874, -0.6737, 0.1136, -1.4784], [ 0.525 , 0.4047, 0.577 , -1.715 ], [-1.0393, -0.3706, -1.1579, -1.3443], [ 0.8449, 1.0758, -0.109 , 1.6436], [-1.4694, 0.357 , -0.6746, -1.7769], [-0.9689, -1.2945, 0.4137, 0.2767], [-0.472 , -0.014 , -0.3625, -0.0062], [-0.9231, 0.8957, 0.8052, -1.2064], [ 2.5656, 1.4313, 1.3403, -1.1703]])
DataFrame
is not intended to be a drop-in replacement for ndarray as itsindexing semantics and data model are quite different in places from an n-dimensionalarray.
Series
implements__array_ufunc__
, which allows it to work with NumPy’suniversal functions.
The ufunc is applied to the underlying array in aSeries
.
In [110]:ser=pd.Series([1,2,3,4])In [111]:np.exp(ser)Out[111]:0 2.7182821 7.3890562 20.0855373 54.598150dtype: float64
When multipleSeries
are passed to a ufunc, they are aligned beforeperforming the operation.
Like other parts of the library, pandas will automatically align labeled inputsas part of a ufunc with multiple inputs. For example, usingnumpy.remainder()
on twoSeries
with differently ordered labels will align before the operation.
In [112]:ser1=pd.Series([1,2,3],index=["a","b","c"])In [113]:ser2=pd.Series([1,3,5],index=["b","a","c"])In [114]:ser1Out[114]:a 1b 2c 3dtype: int64In [115]:ser2Out[115]:b 1a 3c 5dtype: int64In [116]:np.remainder(ser1,ser2)Out[116]:a 1b 0c 3dtype: int64
As usual, the union of the two indices is taken, and non-overlapping values are filledwith missing values.
In [117]:ser3=pd.Series([2,4,6],index=["b","c","d"])In [118]:ser3Out[118]:b 2c 4d 6dtype: int64In [119]:np.remainder(ser1,ser3)Out[119]:a NaNb 0.0c 3.0d NaNdtype: float64
When a binary ufunc is applied to aSeries
andIndex
, theSeries
implementation takes precedence and aSeries
is returned.
In [120]:ser=pd.Series([1,2,3])In [121]:idx=pd.Index([4,5,6])In [122]:np.maximum(ser,idx)Out[122]:0 41 52 6dtype: int64
NumPy ufuncs are safe to apply toSeries
backed by non-ndarray arrays,for examplearrays.SparseArray
(seeSparse calculation). If possible,the ufunc is applied without converting the underlying data to an ndarray.
Console display#
A very largeDataFrame
will be truncated to display them in the console.You can also get a summary usinginfo()
.(Thebaseball dataset is from theplyr R package):
In [123]:baseball=pd.read_csv("data/baseball.csv")In [124]:print(baseball) id player year stint team lg ... so ibb hbp sh sf gidp0 88641 womacto01 2006 2 CHN NL ... 4.0 0.0 0.0 3.0 0.0 0.01 88643 schilcu01 2006 1 BOS AL ... 1.0 0.0 0.0 0.0 0.0 0.0.. ... ... ... ... ... .. ... ... ... ... ... ... ...98 89533 aloumo01 2007 1 NYN NL ... 30.0 5.0 2.0 0.0 3.0 13.099 89534 alomasa02 2007 1 NYN NL ... 3.0 0.0 0.0 0.0 0.0 0.0[100 rows x 23 columns]In [125]:baseball.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 100 entries, 0 to 99Data columns (total 23 columns): # Column Non-Null Count Dtype--- ------ -------------- ----- 0 id 100 non-null int64 1 player 100 non-null object 2 year 100 non-null int64 3 stint 100 non-null int64 4 team 100 non-null object 5 lg 100 non-null object 6 g 100 non-null int64 7 ab 100 non-null int64 8 r 100 non-null int64 9 h 100 non-null int64 10 X2b 100 non-null int64 11 X3b 100 non-null int64 12 hr 100 non-null int64 13 rbi 100 non-null float64 14 sb 100 non-null float64 15 cs 100 non-null float64 16 bb 100 non-null int64 17 so 100 non-null float64 18 ibb 100 non-null float64 19 hbp 100 non-null float64 20 sh 100 non-null float64 21 sf 100 non-null float64 22 gidp 100 non-null float64dtypes: float64(9), int64(11), object(3)memory usage: 18.1+ KB
However, usingDataFrame.to_string()
will return a string representation of theDataFrame
in tabular form, though it won’t always fit the console width:
In [126]:print(baseball.iloc[-20:,:12].to_string()) id player year stint team lg g ab r h X2b X3b80 89474 finlest01 2007 1 COL NL 43 94 9 17 3 081 89480 embreal01 2007 1 OAK AL 4 0 0 0 0 082 89481 edmonji01 2007 1 SLN NL 117 365 39 92 15 283 89482 easleda01 2007 1 NYN NL 76 193 24 54 6 084 89489 delgaca01 2007 1 NYN NL 139 538 71 139 30 085 89493 cormirh01 2007 1 CIN NL 6 0 0 0 0 086 89494 coninje01 2007 2 NYN NL 21 41 2 8 2 087 89495 coninje01 2007 1 CIN NL 80 215 23 57 11 188 89497 clemero02 2007 1 NYA AL 2 2 0 1 0 089 89498 claytro01 2007 2 BOS AL 8 6 1 0 0 090 89499 claytro01 2007 1 TOR AL 69 189 23 48 14 091 89501 cirilje01 2007 2 ARI NL 28 40 6 8 4 092 89502 cirilje01 2007 1 MIN AL 50 153 18 40 9 293 89521 bondsba01 2007 1 SFN NL 126 340 75 94 14 094 89523 biggicr01 2007 1 HOU NL 141 517 68 130 31 395 89525 benitar01 2007 2 FLO NL 34 0 0 0 0 096 89526 benitar01 2007 1 SFN NL 19 0 0 0 0 097 89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 398 89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 199 89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 0
Wide DataFrames will be printed across multiple rows bydefault:
In [127]:pd.DataFrame(np.random.randn(3,12))Out[127]: 0 1 2 ... 9 10 110 -1.226825 0.769804 -1.281247 ... -1.110336 -0.619976 0.1497481 -0.732339 0.687738 0.176444 ... 1.462696 -1.743161 -0.8265912 -0.345352 1.314232 0.690579 ... 0.896171 -0.487602 -0.082240[3 rows x 12 columns]
You can change how much to print on a single row by setting thedisplay.width
option:
In [128]:pd.set_option("display.width",40)# default is 80In [129]:pd.DataFrame(np.random.randn(3,12))Out[129]: 0 1 2 ... 9 10 110 -2.182937 0.380396 0.084844 ... -0.023688 2.410179 1.4505201 0.206053 -0.251905 -2.213588 ... -0.025747 -0.988387 0.0940552 1.262731 1.289997 0.082423 ... -0.281461 0.030711 0.109121[3 rows x 12 columns]
You can adjust the max width of the individual columns by settingdisplay.max_colwidth
In [130]:datafile={ .....:"filename":["filename_01","filename_02"], .....:"path":[ .....:"media/user_name/storage/folder_01/filename_01", .....:"media/user_name/storage/folder_02/filename_02", .....:], .....:} .....:In [131]:pd.set_option("display.max_colwidth",30)In [132]:pd.DataFrame(datafile)Out[132]: filename path0 filename_01 media/user_name/storage/fo...1 filename_02 media/user_name/storage/fo...In [133]:pd.set_option("display.max_colwidth",100)In [134]:pd.DataFrame(datafile)Out[134]: filename path0 filename_01 media/user_name/storage/folder_01/filename_011 filename_02 media/user_name/storage/folder_02/filename_02
You can also disable this feature via theexpand_frame_repr
option.This will print the table in one block.
DataFrame column attribute access and IPython completion#
If aDataFrame
column label is a valid Python variable name, the column can beaccessed like an attribute:
In [135]:df=pd.DataFrame({"foo1":np.random.randn(5),"foo2":np.random.randn(5)})In [136]:dfOut[136]: foo1 foo20 1.126203 0.7818361 -0.977349 -1.0713572 1.474071 0.4411533 -0.064034 2.3539254 -1.282782 0.583787In [137]:df.foo1Out[137]:0 1.1262031 -0.9773492 1.4740713 -0.0640344 -1.282782Name: foo1, dtype: float64
The columns are also connected to theIPythoncompletion mechanism so they can be tab-completed:
In [5]:df.foo<TAB># noqa: E225, E999df.foo1 df.foo2
- Series
- DataFrame
- From dict of Series or dicts
- From dict of ndarrays / lists
- From structured or record array
- From a list of dicts
- From a dict of tuples
- From a Series
- From a list of namedtuples
- From a list of dataclasses
- Alternate constructors
- Column selection, addition, deletion
- Assigning new columns in method chains
- Indexing / selection
- Data alignment and arithmetic
- Transposing
- DataFrame interoperability with NumPy functions
- Console display
- DataFrame column attribute access and IPython completion