Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
Ctrl+K

Duplicate Labels#

Index objects are not required to be unique; you can have duplicate rowor column labels. This may be a bit confusing at first. If you’re familiar withSQL, you know that row labels are similar to a primary key on a table, and youwould never want duplicates in a SQL table. But one of pandas’ roles is to cleanmessy, real-world data before it goes to some downstream system. And real-worlddata has duplicates, even in fields that are supposed to be unique.

This section describes how duplicate labels change the behavior of certainoperations, and how prevent duplicates from arising during operations, or todetect them if they do.

In [1]:importpandasaspdIn [2]:importnumpyasnp

Consequences of Duplicate Labels#

Some pandas methods (Series.reindex() for example) just don’t work withduplicates present. The output can’t be determined, and so pandas raises.

In [3]:s1=pd.Series([0,1,2],index=["a","b","b"])In [4]:s1.reindex(["a","b","c"])---------------------------------------------------------------------------ValueErrorTraceback (most recent call last)CellIn[4],line1---->1s1.reindex(["a","b","c"])File ~/work/pandas/pandas/pandas/core/series.py:5153, inSeries.reindex(self, index, axis, method, copy, level, fill_value, limit, tolerance)5136@doc(5137NDFrame.reindex,# type: ignore[has-type]5138klass=_shared_doc_kwargs["klass"],(...)5151tolerance=None,5152)->Series:->5153returnsuper().reindex(5154index=index,5155method=method,5156copy=copy,5157level=level,5158fill_value=fill_value,5159limit=limit,5160tolerance=tolerance,5161)File ~/work/pandas/pandas/pandas/core/generic.py:5610, inNDFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance)5607returnself._reindex_multi(axes,copy,fill_value)5609# perform the reindex on the axes->5610returnself._reindex_axes(5611axes,level,limit,tolerance,method,fill_value,copy5612).__finalize__(self,method="reindex")File ~/work/pandas/pandas/pandas/core/generic.py:5633, inNDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)5630continue5632ax=self._get_axis(a)->5633new_index,indexer=ax.reindex(5634labels,level=level,limit=limit,tolerance=tolerance,method=method5635)5637axis=self._get_axis_number(a)5638obj=obj._reindex_with_indexers(5639{axis:[new_index,indexer]},5640fill_value=fill_value,5641copy=copy,5642allow_dups=False,5643)File ~/work/pandas/pandas/pandas/core/indexes/base.py:4429, inIndex.reindex(self, target, method, level, limit, tolerance)4426raiseValueError("cannot handle a non-unique multi-index!")4427elifnotself.is_unique:4428# GH#42568->4429raiseValueError("cannot reindex on an axis with duplicate labels")4430else:4431indexer,_=self.get_indexer_non_unique(target)ValueError: cannot reindex on an axis with duplicate labels

Other methods, like indexing, can give very surprising results. Typicallyindexing with a scalar willreduce dimensionality. Slicing aDataFramewith a scalar will return aSeries. Slicing aSeries with a scalar willreturn a scalar. But with duplicates, this isn’t the case.

In [5]:df1=pd.DataFrame([[0,1,2],[3,4,5]],columns=["A","A","B"])In [6]:df1Out[6]:   A  A  B0  0  1  21  3  4  5

We have duplicates in the columns. If we slice'B', we get back aSeries

In [7]:df1["B"]# a seriesOut[7]:0    21    5Name: B, dtype: int64

But slicing'A' returns aDataFrame

In [8]:df1["A"]# a DataFrameOut[8]:   A  A0  0  11  3  4

This applies to row labels as well

In [9]:df2=pd.DataFrame({"A":[0,1,2]},index=["a","a","b"])In [10]:df2Out[10]:   Aa  0a  1b  2In [11]:df2.loc["b","A"]# a scalarOut[11]:2In [12]:df2.loc["a","A"]# a SeriesOut[12]:a    0a    1Name: A, dtype: int64

Duplicate Label Detection#

You can check whether anIndex (storing the row or column labels) isunique withIndex.is_unique:

In [13]:df2Out[13]:   Aa  0a  1b  2In [14]:df2.index.is_uniqueOut[14]:FalseIn [15]:df2.columns.is_uniqueOut[15]:True

Note

Checking whether an index is unique is somewhat expensive for large datasets.pandas does cache this result, so re-checking on the same index is very fast.

Index.duplicated() will return a boolean ndarray indicating whether alabel is repeated.

In [16]:df2.index.duplicated()Out[16]:array([False,  True, False])

Which can be used as a boolean filter to drop duplicate rows.

In [17]:df2.loc[~df2.index.duplicated(),:]Out[17]:   Aa  0b  2

If you need additional logic to handle duplicate labels, rather than justdropping the repeats, usinggroupby() on the index is a commontrick. For example, we’ll resolve duplicates by taking the average of all rowswith the same label.

In [18]:df2.groupby(level=0).mean()Out[18]:     Aa  0.5b  2.0

Disallowing Duplicate Labels#

Added in version 1.2.0.

As noted above, handling duplicates is an important feature when reading in rawdata. That said, you may want to avoid introducing duplicates as part of a dataprocessing pipeline (from methods likepandas.concat(),rename(), etc.). BothSeries andDataFramedisallow duplicate labels by calling.set_flags(allows_duplicate_labels=False).(the default is to allow them). If there are duplicate labels, an exceptionwill be raised.

In [19]:pd.Series([0,1,2],index=["a","b","b"]).set_flags(allows_duplicate_labels=False)---------------------------------------------------------------------------DuplicateLabelErrorTraceback (most recent call last)CellIn[19],line1---->1pd.Series([0,1,2],index=["a","b","b"]).set_flags(allows_duplicate_labels=False)File ~/work/pandas/pandas/pandas/core/generic.py:508, inNDFrame.set_flags(self, copy, allows_duplicate_labels)506df=self.copy(deep=copyandnotusing_copy_on_write())507ifallows_duplicate_labelsisnotNone:-->508df.flags["allows_duplicate_labels"]=allows_duplicate_labels509returndfFile ~/work/pandas/pandas/pandas/core/flags.py:109, inFlags.__setitem__(self, key, value)107ifkeynotinself._keys:108raiseValueError(f"Unknown flag{key}. Must be one of{self._keys}")-->109setattr(self,key,value)File ~/work/pandas/pandas/pandas/core/flags.py:96, inFlags.allows_duplicate_labels(self, value)94ifnotvalue:95foraxinobj.axes:--->96ax._maybe_check_unique()98self._allows_duplicate_labels=valueFile ~/work/pandas/pandas/pandas/core/indexes/base.py:715, inIndex._maybe_check_unique(self)712duplicates=self._format_duplicate_message()713msg+=f"\n{duplicates}"-->715raiseDuplicateLabelError(msg)DuplicateLabelError: Index has duplicates.positionslabelb[1,2]

This applies to both row and column labels for aDataFrame

In [20]:pd.DataFrame([[0,1,2],[3,4,5]],columns=["A","B","C"],).set_flags(   ....:allows_duplicate_labels=False   ....:)   ....:Out[20]:   A  B  C0  0  1  21  3  4  5

This attribute can be checked or set withallows_duplicate_labels,which indicates whether that object can have duplicate labels.

In [21]:df=pd.DataFrame({"A":[0,1,2,3]},index=["x","y","X","Y"]).set_flags(   ....:allows_duplicate_labels=False   ....:)   ....:In [22]:dfOut[22]:   Ax  0y  1X  2Y  3In [23]:df.flags.allows_duplicate_labelsOut[23]:False

DataFrame.set_flags() can be used to return a newDataFrame with attributeslikeallows_duplicate_labels set to some value

In [24]:df2=df.set_flags(allows_duplicate_labels=True)In [25]:df2.flags.allows_duplicate_labelsOut[25]:True

The newDataFrame returned is a view on the same data as the oldDataFrame.Or the property can just be set directly on the same object

In [26]:df2.flags.allows_duplicate_labels=FalseIn [27]:df2.flags.allows_duplicate_labelsOut[27]:False

When processing raw, messy data you might initially read in the messy data(which potentially has duplicate labels), deduplicate, and then disallow duplicatesgoing forward, to ensure that your data pipeline doesn’t introduce duplicates.

>>>raw=pd.read_csv("...")>>>deduplicated=raw.groupby(level=0).first()# remove duplicates>>>deduplicated.flags.allows_duplicate_labels=False# disallow going forward

Settingallows_duplicate_labels=False on aSeries orDataFrame with duplicatelabels or performing an operation that introduces duplicate labels on aSeries orDataFrame that disallows duplicates will raise anerrors.DuplicateLabelError.

In [28]:df.rename(str.upper)---------------------------------------------------------------------------DuplicateLabelErrorTraceback (most recent call last)CellIn[28],line1---->1df.rename(str.upper)File ~/work/pandas/pandas/pandas/core/frame.py:5767, inDataFrame.rename(self, mapper, index, columns, axis, copy, inplace, level, errors)5636defrename(5637self,5638mapper:Renamer|None=None,(...)5646errors:IgnoreRaise="ignore",5647)->DataFrame|None:5648"""5649     Rename columns or index labels.5650   (...)5765     4  3  65766     """->5767returnsuper()._rename(5768mapper=mapper,5769index=index,5770columns=columns,5771axis=axis,5772copy=copy,5773inplace=inplace,5774level=level,5775errors=errors,5776)File ~/work/pandas/pandas/pandas/core/generic.py:1140, inNDFrame._rename(self, mapper, index, columns, axis, copy, inplace, level, errors)1138returnNone1139else:->1140returnresult.__finalize__(self,method="rename")File ~/work/pandas/pandas/pandas/core/generic.py:6262, inNDFrame.__finalize__(self, other, method, **kwargs)6255ifother.attrs:6256# We want attrs propagation to have minimal performance6257# impact if attrs are not used; i.e. attrs is an empty dict.6258# One could make the deepcopy unconditionally, but a deepcopy6259# of an empty dict is 50x more expensive than the empty check.6260self.attrs=deepcopy(other.attrs)->6262self.flags.allows_duplicate_labels=other.flags.allows_duplicate_labels6263# For subclasses using _metadata.6264fornameinset(self._metadata)&set(other._metadata):File ~/work/pandas/pandas/pandas/core/flags.py:96, inFlags.allows_duplicate_labels(self, value)94ifnotvalue:95foraxinobj.axes:--->96ax._maybe_check_unique()98self._allows_duplicate_labels=valueFile ~/work/pandas/pandas/pandas/core/indexes/base.py:715, inIndex._maybe_check_unique(self)712duplicates=self._format_duplicate_message()713msg+=f"\n{duplicates}"-->715raiseDuplicateLabelError(msg)DuplicateLabelError: Index has duplicates.positionslabelX[0,2]Y[1,3]

This error message contains the labels that are duplicated, and the numeric positionsof all the duplicates (including the “original”) in theSeries orDataFrame

Duplicate Label Propagation#

In general, disallowing duplicates is “sticky”. It’s preserved throughoperations.

In [29]:s1=pd.Series(0,index=["a","b"]).set_flags(allows_duplicate_labels=False)In [30]:s1Out[30]:a    0b    0dtype: int64In [31]:s1.head().rename({"a":"b"})---------------------------------------------------------------------------DuplicateLabelErrorTraceback (most recent call last)CellIn[31],line1---->1s1.head().rename({"a":"b"})File ~/work/pandas/pandas/pandas/core/series.py:5090, inSeries.rename(self, index, axis, copy, inplace, level, errors)5083axis=self._get_axis_number(axis)5085ifcallable(index)oris_dict_like(index):5086# error: Argument 1 to "_rename" of "NDFrame" has incompatible5087# type "Union[Union[Mapping[Any, Hashable], Callable[[Any],5088# Hashable]], Hashable, None]"; expected "Union[Mapping[Any,5089# Hashable], Callable[[Any], Hashable], None]"->5090returnsuper()._rename(5091index,# type: ignore[arg-type]5092copy=copy,5093inplace=inplace,5094level=level,5095errors=errors,5096)5097else:5098returnself._set_name(index,inplace=inplace,deep=copy)File ~/work/pandas/pandas/pandas/core/generic.py:1140, inNDFrame._rename(self, mapper, index, columns, axis, copy, inplace, level, errors)1138returnNone1139else:->1140returnresult.__finalize__(self,method="rename")File ~/work/pandas/pandas/pandas/core/generic.py:6262, inNDFrame.__finalize__(self, other, method, **kwargs)6255ifother.attrs:6256# We want attrs propagation to have minimal performance6257# impact if attrs are not used; i.e. attrs is an empty dict.6258# One could make the deepcopy unconditionally, but a deepcopy6259# of an empty dict is 50x more expensive than the empty check.6260self.attrs=deepcopy(other.attrs)->6262self.flags.allows_duplicate_labels=other.flags.allows_duplicate_labels6263# For subclasses using _metadata.6264fornameinset(self._metadata)&set(other._metadata):File ~/work/pandas/pandas/pandas/core/flags.py:96, inFlags.allows_duplicate_labels(self, value)94ifnotvalue:95foraxinobj.axes:--->96ax._maybe_check_unique()98self._allows_duplicate_labels=valueFile ~/work/pandas/pandas/pandas/core/indexes/base.py:715, inIndex._maybe_check_unique(self)712duplicates=self._format_duplicate_message()713msg+=f"\n{duplicates}"-->715raiseDuplicateLabelError(msg)DuplicateLabelError: Index has duplicates.positionslabelb[0,1]

Warning

This is an experimental feature. Currently, many methods fail topropagate theallows_duplicate_labels value. In future versionsit is expected that every method taking or returning one or moreDataFrame or Series objects will propagateallows_duplicate_labels.


[8]ページ先頭

©2009-2025 Movatter.jp