- User Guide
- Duplicate Labels
Duplicate Labels#
Index
objects are not required to be unique; you can have duplicate rowor column labels. This may be a bit confusing at first. If you’re familiar withSQL, you know that row labels are similar to a primary key on a table, and youwould never want duplicates in a SQL table. But one of pandas’ roles is to cleanmessy, real-world data before it goes to some downstream system. And real-worlddata has duplicates, even in fields that are supposed to be unique.
This section describes how duplicate labels change the behavior of certainoperations, and how prevent duplicates from arising during operations, or todetect them if they do.
In [1]:importpandasaspdIn [2]:importnumpyasnp
Consequences of Duplicate Labels#
Some pandas methods (Series.reindex()
for example) just don’t work withduplicates present. The output can’t be determined, and so pandas raises.
In [3]:s1=pd.Series([0,1,2],index=["a","b","b"])In [4]:s1.reindex(["a","b","c"])---------------------------------------------------------------------------ValueErrorTraceback (most recent call last)CellIn[4],line1---->1s1.reindex(["a","b","c"])File ~/work/pandas/pandas/pandas/core/series.py:5153, inSeries.reindex(self, index, axis, method, copy, level, fill_value, limit, tolerance)5136@doc(5137NDFrame.reindex,# type: ignore[has-type]5138klass=_shared_doc_kwargs["klass"],(...)5151tolerance=None,5152)->Series:->5153returnsuper().reindex(5154index=index,5155method=method,5156copy=copy,5157level=level,5158fill_value=fill_value,5159limit=limit,5160tolerance=tolerance,5161)File ~/work/pandas/pandas/pandas/core/generic.py:5610, inNDFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance)5607returnself._reindex_multi(axes,copy,fill_value)5609# perform the reindex on the axes->5610returnself._reindex_axes(5611axes,level,limit,tolerance,method,fill_value,copy5612).__finalize__(self,method="reindex")File ~/work/pandas/pandas/pandas/core/generic.py:5633, inNDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)5630continue5632ax=self._get_axis(a)->5633new_index,indexer=ax.reindex(5634labels,level=level,limit=limit,tolerance=tolerance,method=method5635)5637axis=self._get_axis_number(a)5638obj=obj._reindex_with_indexers(5639{axis:[new_index,indexer]},5640fill_value=fill_value,5641copy=copy,5642allow_dups=False,5643)File ~/work/pandas/pandas/pandas/core/indexes/base.py:4429, inIndex.reindex(self, target, method, level, limit, tolerance)4426raiseValueError("cannot handle a non-unique multi-index!")4427elifnotself.is_unique:4428# GH#42568->4429raiseValueError("cannot reindex on an axis with duplicate labels")4430else:4431indexer,_=self.get_indexer_non_unique(target)ValueError: cannot reindex on an axis with duplicate labels
Other methods, like indexing, can give very surprising results. Typicallyindexing with a scalar willreduce dimensionality. Slicing aDataFrame
with a scalar will return aSeries
. Slicing aSeries
with a scalar willreturn a scalar. But with duplicates, this isn’t the case.
In [5]:df1=pd.DataFrame([[0,1,2],[3,4,5]],columns=["A","A","B"])In [6]:df1Out[6]: A A B0 0 1 21 3 4 5
We have duplicates in the columns. If we slice'B'
, we get back aSeries
In [7]:df1["B"]# a seriesOut[7]:0 21 5Name: B, dtype: int64
But slicing'A'
returns aDataFrame
In [8]:df1["A"]# a DataFrameOut[8]: A A0 0 11 3 4
This applies to row labels as well
In [9]:df2=pd.DataFrame({"A":[0,1,2]},index=["a","a","b"])In [10]:df2Out[10]: Aa 0a 1b 2In [11]:df2.loc["b","A"]# a scalarOut[11]:2In [12]:df2.loc["a","A"]# a SeriesOut[12]:a 0a 1Name: A, dtype: int64
Duplicate Label Detection#
You can check whether anIndex
(storing the row or column labels) isunique withIndex.is_unique
:
In [13]:df2Out[13]: Aa 0a 1b 2In [14]:df2.index.is_uniqueOut[14]:FalseIn [15]:df2.columns.is_uniqueOut[15]:True
Note
Checking whether an index is unique is somewhat expensive for large datasets.pandas does cache this result, so re-checking on the same index is very fast.
Index.duplicated()
will return a boolean ndarray indicating whether alabel is repeated.
In [16]:df2.index.duplicated()Out[16]:array([False, True, False])
Which can be used as a boolean filter to drop duplicate rows.
In [17]:df2.loc[~df2.index.duplicated(),:]Out[17]: Aa 0b 2
If you need additional logic to handle duplicate labels, rather than justdropping the repeats, usinggroupby()
on the index is a commontrick. For example, we’ll resolve duplicates by taking the average of all rowswith the same label.
In [18]:df2.groupby(level=0).mean()Out[18]: Aa 0.5b 2.0
Disallowing Duplicate Labels#
Added in version 1.2.0.
As noted above, handling duplicates is an important feature when reading in rawdata. That said, you may want to avoid introducing duplicates as part of a dataprocessing pipeline (from methods likepandas.concat()
,rename()
, etc.). BothSeries
andDataFrame
disallow duplicate labels by calling.set_flags(allows_duplicate_labels=False)
.(the default is to allow them). If there are duplicate labels, an exceptionwill be raised.
In [19]:pd.Series([0,1,2],index=["a","b","b"]).set_flags(allows_duplicate_labels=False)---------------------------------------------------------------------------DuplicateLabelErrorTraceback (most recent call last)CellIn[19],line1---->1pd.Series([0,1,2],index=["a","b","b"]).set_flags(allows_duplicate_labels=False)File ~/work/pandas/pandas/pandas/core/generic.py:508, inNDFrame.set_flags(self, copy, allows_duplicate_labels)506df=self.copy(deep=copyandnotusing_copy_on_write())507ifallows_duplicate_labelsisnotNone:-->508df.flags["allows_duplicate_labels"]=allows_duplicate_labels509returndfFile ~/work/pandas/pandas/pandas/core/flags.py:109, inFlags.__setitem__(self, key, value)107ifkeynotinself._keys:108raiseValueError(f"Unknown flag{key}. Must be one of{self._keys}")-->109setattr(self,key,value)File ~/work/pandas/pandas/pandas/core/flags.py:96, inFlags.allows_duplicate_labels(self, value)94ifnotvalue:95foraxinobj.axes:--->96ax._maybe_check_unique()98self._allows_duplicate_labels=valueFile ~/work/pandas/pandas/pandas/core/indexes/base.py:715, inIndex._maybe_check_unique(self)712duplicates=self._format_duplicate_message()713msg+=f"\n{duplicates}"-->715raiseDuplicateLabelError(msg)DuplicateLabelError: Index has duplicates.positionslabelb[1,2]
This applies to both row and column labels for aDataFrame
In [20]:pd.DataFrame([[0,1,2],[3,4,5]],columns=["A","B","C"],).set_flags( ....:allows_duplicate_labels=False ....:) ....:Out[20]: A B C0 0 1 21 3 4 5
This attribute can be checked or set withallows_duplicate_labels
,which indicates whether that object can have duplicate labels.
In [21]:df=pd.DataFrame({"A":[0,1,2,3]},index=["x","y","X","Y"]).set_flags( ....:allows_duplicate_labels=False ....:) ....:In [22]:dfOut[22]: Ax 0y 1X 2Y 3In [23]:df.flags.allows_duplicate_labelsOut[23]:False
DataFrame.set_flags()
can be used to return a newDataFrame
with attributeslikeallows_duplicate_labels
set to some value
In [24]:df2=df.set_flags(allows_duplicate_labels=True)In [25]:df2.flags.allows_duplicate_labelsOut[25]:True
The newDataFrame
returned is a view on the same data as the oldDataFrame
.Or the property can just be set directly on the same object
In [26]:df2.flags.allows_duplicate_labels=FalseIn [27]:df2.flags.allows_duplicate_labelsOut[27]:False
When processing raw, messy data you might initially read in the messy data(which potentially has duplicate labels), deduplicate, and then disallow duplicatesgoing forward, to ensure that your data pipeline doesn’t introduce duplicates.
>>>raw=pd.read_csv("...")>>>deduplicated=raw.groupby(level=0).first()# remove duplicates>>>deduplicated.flags.allows_duplicate_labels=False# disallow going forward
Settingallows_duplicate_labels=False
on aSeries
orDataFrame
with duplicatelabels or performing an operation that introduces duplicate labels on aSeries
orDataFrame
that disallows duplicates will raise anerrors.DuplicateLabelError
.
In [28]:df.rename(str.upper)---------------------------------------------------------------------------DuplicateLabelErrorTraceback (most recent call last)CellIn[28],line1---->1df.rename(str.upper)File ~/work/pandas/pandas/pandas/core/frame.py:5767, inDataFrame.rename(self, mapper, index, columns, axis, copy, inplace, level, errors)5636defrename(5637self,5638mapper:Renamer|None=None,(...)5646errors:IgnoreRaise="ignore",5647)->DataFrame|None:5648"""5649 Rename columns or index labels.5650 (...)5765 4 3 65766 """->5767returnsuper()._rename(5768mapper=mapper,5769index=index,5770columns=columns,5771axis=axis,5772copy=copy,5773inplace=inplace,5774level=level,5775errors=errors,5776)File ~/work/pandas/pandas/pandas/core/generic.py:1140, inNDFrame._rename(self, mapper, index, columns, axis, copy, inplace, level, errors)1138returnNone1139else:->1140returnresult.__finalize__(self,method="rename")File ~/work/pandas/pandas/pandas/core/generic.py:6262, inNDFrame.__finalize__(self, other, method, **kwargs)6255ifother.attrs:6256# We want attrs propagation to have minimal performance6257# impact if attrs are not used; i.e. attrs is an empty dict.6258# One could make the deepcopy unconditionally, but a deepcopy6259# of an empty dict is 50x more expensive than the empty check.6260self.attrs=deepcopy(other.attrs)->6262self.flags.allows_duplicate_labels=other.flags.allows_duplicate_labels6263# For subclasses using _metadata.6264fornameinset(self._metadata)&set(other._metadata):File ~/work/pandas/pandas/pandas/core/flags.py:96, inFlags.allows_duplicate_labels(self, value)94ifnotvalue:95foraxinobj.axes:--->96ax._maybe_check_unique()98self._allows_duplicate_labels=valueFile ~/work/pandas/pandas/pandas/core/indexes/base.py:715, inIndex._maybe_check_unique(self)712duplicates=self._format_duplicate_message()713msg+=f"\n{duplicates}"-->715raiseDuplicateLabelError(msg)DuplicateLabelError: Index has duplicates.positionslabelX[0,2]Y[1,3]
This error message contains the labels that are duplicated, and the numeric positionsof all the duplicates (including the “original”) in theSeries
orDataFrame
Duplicate Label Propagation#
In general, disallowing duplicates is “sticky”. It’s preserved throughoperations.
In [29]:s1=pd.Series(0,index=["a","b"]).set_flags(allows_duplicate_labels=False)In [30]:s1Out[30]:a 0b 0dtype: int64In [31]:s1.head().rename({"a":"b"})---------------------------------------------------------------------------DuplicateLabelErrorTraceback (most recent call last)CellIn[31],line1---->1s1.head().rename({"a":"b"})File ~/work/pandas/pandas/pandas/core/series.py:5090, inSeries.rename(self, index, axis, copy, inplace, level, errors)5083axis=self._get_axis_number(axis)5085ifcallable(index)oris_dict_like(index):5086# error: Argument 1 to "_rename" of "NDFrame" has incompatible5087# type "Union[Union[Mapping[Any, Hashable], Callable[[Any],5088# Hashable]], Hashable, None]"; expected "Union[Mapping[Any,5089# Hashable], Callable[[Any], Hashable], None]"->5090returnsuper()._rename(5091index,# type: ignore[arg-type]5092copy=copy,5093inplace=inplace,5094level=level,5095errors=errors,5096)5097else:5098returnself._set_name(index,inplace=inplace,deep=copy)File ~/work/pandas/pandas/pandas/core/generic.py:1140, inNDFrame._rename(self, mapper, index, columns, axis, copy, inplace, level, errors)1138returnNone1139else:->1140returnresult.__finalize__(self,method="rename")File ~/work/pandas/pandas/pandas/core/generic.py:6262, inNDFrame.__finalize__(self, other, method, **kwargs)6255ifother.attrs:6256# We want attrs propagation to have minimal performance6257# impact if attrs are not used; i.e. attrs is an empty dict.6258# One could make the deepcopy unconditionally, but a deepcopy6259# of an empty dict is 50x more expensive than the empty check.6260self.attrs=deepcopy(other.attrs)->6262self.flags.allows_duplicate_labels=other.flags.allows_duplicate_labels6263# For subclasses using _metadata.6264fornameinset(self._metadata)&set(other._metadata):File ~/work/pandas/pandas/pandas/core/flags.py:96, inFlags.allows_duplicate_labels(self, value)94ifnotvalue:95foraxinobj.axes:--->96ax._maybe_check_unique()98self._allows_duplicate_labels=valueFile ~/work/pandas/pandas/pandas/core/indexes/base.py:715, inIndex._maybe_check_unique(self)712duplicates=self._format_duplicate_message()713msg+=f"\n{duplicates}"-->715raiseDuplicateLabelError(msg)DuplicateLabelError: Index has duplicates.positionslabelb[0,1]
Warning
This is an experimental feature. Currently, many methods fail topropagate theallows_duplicate_labels
value. In future versionsit is expected that every method taking or returning one or moreDataFrame or Series objects will propagateallows_duplicate_labels
.