Finding unique rows means removing duplicate rows from a 2D array so that each row appears only once.For example, given [[1, 2], [3, 4], [1, 2], [5, 6]], the unique rows would be [[1, 2], [3, 4], [5, 6]]. Let’s explore efficient ways to achieve this usingNumPy.
Using np.unique(axis=0)
np.unique(axis=0)is a simple and beginner-friendly method. It compares entire rows and returns only distinct ones, much like Excel’s "Remove Duplicates." However, it sorts the output, so the original row order is not preserved.
Pythonimportnumpyasnpa=np.array([[1,2],[3,4],[1,2],[5,6]])res=np.unique(a,axis=0)print(res)
Output[[1 2] [3 4] [5 6]]
Explanation:Arraya has repeated rows. By applying np.unique(a, axis=0), we ask NumPy to treat each row as a unit and remove duplicates. The result is a sorted array with only unique rows.
Using np.lexsort()
For more control over deduplication, np.lexsort() sorts rows to group duplicates, thennp.diff()filters them out. Like sorting papers by name and removing repeats, it’s more flexible than np.unique but requires extra steps.
Pythonimportnumpyasnpa=np.array([[7,8],[7,9],[6,8],[7,8]])s=np.lexsort(a.T[::-1])a=a[s]mask=np.ones(len(a),bool)mask[1:]=np.any(np.diff(a,axis=0),axis=1)res=a[mask]print(res)
Output[[6 8] [7 8] [7 9]]
Explanation: We first sort rows using np.lexsort() on reversed columns to ensure correct sort order. Then,np.diff()finds differences between consecutive rows. A boolean mask identifies the first unique row in each group.
Using set(tuple(row))
set(tuple(...)) converts each row to a tuple and uses a set to remove duplicates. It’s quick and easy but doesn’t preserve row order and may have issues with floating-point precision. Think of it as dropping rows into labeled baskets, duplicates simply won’t fit twice.
Pythonimportnumpyasnpa=np.array([[0,0],[1,1],[0,0],[2,2]])res=np.array(list({tuple(r)forrina}))print(res)
Output[[1 1] [2 2] [0 0]]
Explanation: Each row in arraya is turned into a tuple using a set comprehension. Sets automatically discard duplicates. Finally, we convert the set of tuples back to a NumPy array.
Using view(np.void)
view(np.void) treats each row as a single byte block, enabling fast row-wise comparison. It’s less readable but highly efficient for large datasets. Row order can be preserved by sorting indices.
Pythonimportnumpyasnpa=np.array([[1,2,3],[4,5,6],[1,2,3],[7,8,9]])b=np.ascontiguousarray(a).view(np.dtype((np.void,a.dtype.itemsize*a.shape[1])))_,idx=np.unique(b,return_index=True)res=a[np.sort(idx)]print(res)
Output[[1 2 3] [4 5 6] [7 8 9]]
Explanation: We convert a to a contiguous byte block usinga.view(np.void), letting NumPy treat each row as a single unit. Thennp.unique() finds unique byte rows and we use the indices to retrieve the original unique rows.