Sorting Techniques

Author:

Andrew Dalke and Raymond Hettinger

Python lists have a built-inlist.sort() method that modifies the listin-place. There is also asorted() built-in function that builds a newsorted list from an iterable.

In this document, we explore the various techniques for sorting data using Python.

Sorting Basics

A simple ascending sort is very easy: just call thesorted() function. Itreturns a new sorted list:

>>>sorted([5,2,3,1,4])[1, 2, 3, 4, 5]

You can also use thelist.sort() method. It modifies the listin-place (and returnsNone to avoid confusion). Usually it’s less convenientthansorted() - but if you don’t need the original list, it’s slightlymore efficient.

>>>a=[5,2,3,1,4]>>>a.sort()>>>a[1, 2, 3, 4, 5]

Another difference is that thelist.sort() method is only defined forlists. In contrast, thesorted() function accepts any iterable.

>>>sorted({1:'D',2:'B',3:'B',4:'E',5:'A'})[1, 2, 3, 4, 5]

Key Functions

Bothlist.sort() andsorted() have akey parameter to specify afunction (or other callable) to be called on each list element prior to makingcomparisons.

For example, here’s a case-insensitive string comparison:

>>>sorted("This is a test string from Andrew".split(),key=str.casefold)['a', 'Andrew', 'from', 'is', 'string', 'test', 'This']

The value of thekey parameter should be a function (or other callable) thattakes a single argument and returns a key to use for sorting purposes. Thistechnique is fast because the key function is called exactly once for eachinput record.

A common pattern is to sort complex objects using some of the object’s indicesas keys. For example:

>>>student_tuples=[...('john','A',15),...('jane','B',12),...('dave','B',10),...]>>>sorted(student_tuples,key=lambdastudent:student[2])# sort by age[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]

The same technique works for objects with named attributes. For example:

>>>classStudent:...def__init__(self,name,grade,age):...self.name=name...self.grade=grade...self.age=age...def__repr__(self):...returnrepr((self.name,self.grade,self.age))>>>student_objects=[...Student('john','A',15),...Student('jane','B',12),...Student('dave','B',10),...]>>>sorted(student_objects,key=lambdastudent:student.age)# sort by age[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]

Objects with named attributes can be made by a regular class as shownabove, or they can be instances ofdataclass oranamed tuple.

Operator Module Functions and Partial Function Evaluation

Thekey function patterns shown above are very common, so Python providesconvenience functions to make accessor functions easier and faster. Theoperator module hasitemgetter(),attrgetter(), and amethodcaller() function.

Using those functions, the above examples become simpler and faster:

>>>fromoperatorimportitemgetter,attrgetter>>>sorted(student_tuples,key=itemgetter(2))[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]>>>sorted(student_objects,key=attrgetter('age'))[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]

The operator module functions allow multiple levels of sorting. For example, tosort bygrade then byage:

>>>sorted(student_tuples,key=itemgetter(1,2))[('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)]>>>sorted(student_objects,key=attrgetter('grade','age'))[('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)]

Thefunctools module provides another helpful tool for makingkey-functions. Thepartial() function can reduce thearity of a multi-argumentfunction making it suitable for use as a key-function.

>>>fromfunctoolsimportpartial>>>fromunicodedataimportnormalize>>>names='Zoë Åbjørn Núñez Élana Zeke Abe Nubia Eloise'.split()>>>sorted(names,key=partial(normalize,'NFD'))['Abe', 'Åbjørn', 'Eloise', 'Élana', 'Nubia', 'Núñez', 'Zeke', 'Zoë']>>>sorted(names,key=partial(normalize,'NFC'))['Abe', 'Eloise', 'Nubia', 'Núñez', 'Zeke', 'Zoë', 'Åbjørn', 'Élana']

Ascending and Descending

Bothlist.sort() andsorted() accept areverse parameter with aboolean value. This is used to flag descending sorts. For example, to get thestudent data in reverseage order:

>>>sorted(student_tuples,key=itemgetter(2),reverse=True)[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]>>>sorted(student_objects,key=attrgetter('age'),reverse=True)[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]

Sort Stability and Complex Sorts

Sorts are guaranteed to bestable. That means thatwhen multiple records have the same key, their original order is preserved.

>>>data=[('red',1),('blue',1),('red',2),('blue',2)]>>>sorted(data,key=itemgetter(0))[('blue', 1), ('blue', 2), ('red', 1), ('red', 2)]

Notice how the two records forblue retain their original order so that('blue',1) is guaranteed to precede('blue',2).

This wonderful property lets you build complex sorts in a series of sortingsteps. For example, to sort the student data by descendinggrade and thenascendingage, do theage sort first and then sort again usinggrade:

>>>s=sorted(student_objects,key=attrgetter('age'))# sort on secondary key>>>sorted(s,key=attrgetter('grade'),reverse=True)# now sort on primary key, descending[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]

This can be abstracted out into a wrapper function that can take a list andtuples of field and order to sort them on multiple passes.

>>>defmultisort(xs,specs):...forkey,reverseinreversed(specs):...xs.sort(key=attrgetter(key),reverse=reverse)...returnxs>>>multisort(list(student_objects),(('grade',True),('age',False)))[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]

TheTimsort algorithm used in Pythondoes multiple sorts efficiently because it can take advantage of any orderingalready present in a dataset.

Decorate-Sort-Undecorate

This idiom is called Decorate-Sort-Undecorate after its three steps:

  • First, the initial list is decorated with new values that control the sort order.

  • Second, the decorated list is sorted.

  • Finally, the decorations are removed, creating a list that contains only theinitial values in the new order.

For example, to sort the student data bygrade using the DSU approach:

>>>decorated=[(student.grade,i,student)fori,studentinenumerate(student_objects)]>>>decorated.sort()>>>[studentforgrade,i,studentindecorated]# undecorate[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]

This idiom works because tuples are compared lexicographically; the first itemsare compared; if they are the same then the second items are compared, and soon.

It is not strictly necessary in all cases to include the indexi in thedecorated list, but including it gives two benefits:

  • The sort is stable – if two items have the same key, their order will bepreserved in the sorted list.

  • The original items do not have to be comparable because the ordering of thedecorated tuples will be determined by at most the first two items. So forexample the original list could contain complex numbers which cannot be sorteddirectly.

Another name for this idiom isSchwartzian transform,after Randal L. Schwartz, who popularized it among Perl programmers.

Now that Python sorting provides key-functions, this technique is not often needed.

Comparison Functions

Unlike key functions that return an absolute value for sorting, a comparisonfunction computes the relative ordering for two inputs.

For example, abalance scalecompares two samples giving a relative ordering: lighter, equal, or heavier.Likewise, a comparison function such ascmp(a,b) will return a negativevalue for less-than, zero if the inputs are equal, or a positive value forgreater-than.

It is common to encounter comparison functions when translating algorithms fromother languages. Also, some libraries provide comparison functions as part oftheir API. For example,locale.strcoll() is a comparison function.

To accommodate those situations, Python providesfunctools.cmp_to_key to wrap the comparison functionto make it usable as a key function:

sorted(words,key=cmp_to_key(strcoll))# locale-aware sort order

Odds and Ends

  • For locale aware sorting, uselocale.strxfrm() for a key function orlocale.strcoll() for a comparison function. This is necessarybecause “alphabetical” sort orderings can vary across cultures evenif the underlying alphabet is the same.

  • Thereverse parameter still maintains sort stability (so that records withequal keys retain the original order). Interestingly, that effect can besimulated without the parameter by using the builtinreversed() functiontwice:

    >>>data=[('red',1),('blue',1),('red',2),('blue',2)]>>>standard_way=sorted(data,key=itemgetter(0),reverse=True)>>>double_reversed=list(reversed(sorted(reversed(data),key=itemgetter(0))))>>>assertstandard_way==double_reversed>>>standard_way[('red', 1), ('red', 2), ('blue', 1), ('blue', 2)]
  • The sort routines use< when making comparisonsbetween two objects. So, it is easy to add a standard sort order to a class bydefining an__lt__() method:

    >>>Student.__lt__=lambdaself,other:self.age<other.age>>>sorted(student_objects)[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]

    However, note that< can fall back to using__gt__() if__lt__() is not implemented (seeobject.__lt__()for details on the mechanics). To avoid surprises,PEP 8recommends that all six comparison methods be implemented.Thetotal_ordering() decorator is provided to make thattask easier.

  • Key functions need not depend directly on the objects being sorted. A keyfunction can also access external resources. For instance, if the student gradesare stored in a dictionary, they can be used to sort a separate list of studentnames:

    >>>students=['dave','john','jane']>>>newgrades={'john':'F','jane':'A','dave':'C'}>>>sorted(students,key=newgrades.__getitem__)['jane', 'dave', 'john']

Partial Sorts

Some applications require only some of the data to be ordered. The standardlibrary provides several tools that do less work than a full sort:

  • min() andmax() return the smallest and largest values,respectively. These functions make a single pass over the input data andrequire almost no auxiliary memory.

  • heapq.nsmallest() andheapq.nlargest() returnthen smallest and largest values, respectively. These functionsmake a single pass over the data keeping onlyn elements in memoryat a time. For values ofn that are small relative to the number ofinputs, these functions make far fewer comparisons than a full sort.

  • heapq.heappush() andheapq.heappop() create and maintain apartially sorted arrangement of data that keeps the smallest elementat position0. These functions are suitable for implementingpriority queues which are commonly used for task scheduling.