- Notifications
You must be signed in to change notification settings - Fork576
Cool Python features for machine learning that I used to be too afraid to use. Will be updated as I have more time / learn more.
chiphuyen/python-is-cool
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
A gentle guide to the Python features that I didn't know existed or was too afraid to use. This will be updated as I learn more and become less lazy.
This usespython >= 3.6.
GitHub has problem rendering Jupyter notebook so I copied the content here. I still keep the notebook in case you want to clone and run it on your machine, but you can also click the Binder badge below and run it in your browser.
The lambda keyword is used to create inline functions. The functionssquare_fn andsquare_ld below are identical.
defsquare_fn(x):returnx*xsquare_ld=lambdax:x*xforiinrange(10):assertsquare_fn(i)==square_ld(i)
Its quick declaration makeslambda functions ideal for use in callbacks, and when functions are to be passed as arguments to other functions. They are especially useful when used in conjunction with functions likemap,filter, andreduce.
map(fn, iterable) applies thefn to all elements of theiterable (e.g. list, set, dictionary, tuple, string) and returns a map object.
nums= [1/3,333/7,2323/2230,40/34,2/3]nums_squared= [num*numfornuminnums]print(nums_squared)==> [0.1111111,2263.04081632,1.085147,1.384083,0.44444444]
This is the same as calling usingmap with a callback function.
nums_squared_1=map(square_fn,nums)nums_squared_2=map(lambdax:x*x,nums)print(list(nums_squared_1))==> [0.1111111,2263.04081632,1.085147,1.384083,0.44444444]
You can also usemap with more than one iterable. For example, if you want to calculate the mean squared error of a simple linear functionf(x) = ax + b with the true labellabels, these two methods are equivalent:
a,b=3,-0.5xs= [2,3,4,5]labels= [6.4,8.9,10.9,15.3]# Method 1: using a looperrors= []fori,xinenumerate(xs):errors.append((a*x+b-labels[i])**2)result1=sum(errors)**0.5/len(xs)# Method 2: using mapdiffs=map(lambdax,y: (a*x+b-y)**2,xs,labels)result2=sum(diffs)**0.5/len(xs)print(result1,result2)==>0.350891721190455140.35089172119045514
Note that objects returned bymap andfilter are iterators, which means that their values aren't stored but generated as needed. After you've calledsum(diffs),diffs becomes empty. If you want to keep all elements indiffs, convert it to a list usinglist(diffs).
filter(fn, iterable) works the same way asmap, except thatfn returns a boolean value andfilter returns all the elements of theiterable for which thefn returns True.
bad_preds=filter(lambdax:x>0.5,errors)print(list(bad_preds))==> [0.8100000000000006,0.6400000000000011]
reduce(fn, iterable, initializer) is used when we want to iteratively apply an operator to all elements in a list. For example, if we want to calculate the product of all elements in a list:
product=1fornuminnums:product*=numprint(product)==>12.95564683272412
This is equivalent to:
fromfunctoolsimportreduceproduct=reduce(lambdax,y:x*y,nums)print(product)==>12.95564683272412
Lambda functions are meant for one time use. Each timelambda x: dosomething(x) is called, the function has to be created, which hurts the performance if you calllambda x: dosomething(x) multiple times (e.g. when you pass it insidereduce).
When you assign a name to the lambda function as infn = lambda x: dosomething(x), its performance is slightly slower than the same function defined usingdef, but the difference is negligible. Seehere.
Even though I find lambdas cool, I personally recommend using named functions when you can for the sake of clarity.
Python lists are super cool.
We can unpack a list by each element like this:
elems= [1,2,3,4]a,b,c,d=elemsprint(a,b,c,d)==>1234
We can also unpack a list like this:
a,*new_elems,d=elemsprint(a)print(new_elems)print(d)==>1 [2,3]4
We know that we can reverse a list using[::-1].
elems=list(range(10))print(elems)==> [0,1,2,3,4,5,6,7,8,9]print(elems[::-1])==> [9,8,7,6,5,4,3,2,1,0]
The syntax[x:y:z] means "take everyzth element of a list from indexx to indexy". Whenz is negative, it indicates going backwards. Whenx isn't specified, it defaults to the first element of the list in the direction you are traversing the list. Wheny isn't specified, it defaults to the last element of the list. So if we want to take every 2th element of a list, we use[::2].
evens=elems[::2]print(evens)reversed_evens=elems[-2::-2]print(reversed_evens)==> [0,2,4,6,8] [8,6,4,2,0]
We can also use slicing to delete all the even numbers in the list.
delelems[::2]print(elems)==> [1,3,5,7,9]
We can change the value of an element in a list to another value.
elems=list(range(10))elems[1]=10print(elems)==> [0,10,2,3,4,5,6,7,8,9]
If we want to replace the element at an index with multiple elements, e.g. replace the value1 with 3 values20, 30, 40:
elems=list(range(10))elems[1:2]= [20,30,40]print(elems)==> [0,20,30,40,2,3,4,5,6,7,8,9]
If we want to insert 3 values0.2, 0.3, 0.5 between element at index 0 and element at index 1:
elems=list(range(10))elems[1:1]= [0.2,0.3,0.5]print(elems)==> [0,0.2,0.3,0.5,1,2,3,4,5,6,7,8,9]
We can flatten a list of lists usingsum.
list_of_lists= [[1], [2,3], [4,5,6]]sum(list_of_lists, [])==> [1,2,3,4,5,6]
If we have nested lists, we can recursively flatten it. That's another beauty of lambda functions -- we can use it in the same line as its creation.
nested_lists= [[1,2], [[3,4], [5,6], [[7,8], [9,10], [[11, [12,13]]]]]]flatten=lambdax: [yforlinxforyinflatten(l)]iftype(x)islistelse [x]flatten(nested_lists)# This line of code is from# https://github.com/sahands/python-by-example/blob/master/python-by-example.rst#flattening-lists
To illustrate the difference between a list and a generator, let's look at an example of creating n-grams out of a list of tokens.
One way to create n-grams is to use a sliding window.
tokens= ['i','want','to','go','to','school']defngrams(tokens,n):length=len(tokens)grams= []foriinrange(length-n+1):grams.append(tokens[i:i+n])returngramsprint(ngrams(tokens,3))==> [['i','want','to'], ['want','to','go'], ['to','go','to'], ['go','to','school']]
In the above example, we have to store all the n-grams at the same time. If the text has m tokens, then the memory requirement isO(nm), which can be problematic when m is large.
Instead of using a list to store all n-grams, we can use a generator that generates the next n-gram when it's asked for. This is known as lazy evaluation. We can make the functionngrams returns a generator using the keywordyield. Then the memory requirement isO(m+n).
defngrams(tokens,n):length=len(tokens)foriinrange(length-n+1):yieldtokens[i:i+n]ngrams_generator=ngrams(tokens,3)print(ngrams_generator)==><generatorobjectngramsat0x1069b26d0>forngraminngrams_generator:print(ngram)==> ['i','want','to'] ['want','to','go'] ['to','go','to'] ['go','to','school']
Another way to generate n-grams is to use slices to create lists:[0, 1, ..., -n],[1, 2, ..., -n+1], ...,[n-1, n, ..., -1], and thenzip them together.
defngrams(tokens,n):length=len(tokens)slices= (tokens[i:length-n+i+1]foriinrange(n))returnzip(*slices)ngrams_generator=ngrams(tokens,3)print(ngrams_generator)==><zipobjectat0x1069a7dc8># zip objects are generatorsforngraminngrams_generator:print(ngram)==> ('i','want','to') ('want','to','go') ('to','go','to') ('go','to','school')
Note that to create slices, we use(tokens[...] for i in range(n)) instead of[tokens[...] for i in range(n)].[] is the normal list comprehension that returns a list.() returns a generator.
In Python, magic methods are prefixed and suffixed with the double underscore__, also known as dunder. The most wellknown magic method is probably__init__.
classNode:""" A struct to denote the node of a binary tree. It contains a value and pointers to left and right children. """def__init__(self,value,left=None,right=None):self.value=valueself.left=leftself.right=right
When we try to print out a Node object, however, it's not very interpretable.
root=Node(5)print(root)# <__main__.Node object at 0x1069c4518>
Ideally, when user prints out a node, we want to print out the node's value and the values of its children if it has children. To do so, we use the magic method__repr__, which must return a printable object, like string.
classNode:""" A struct to denote the node of a binary tree. It contains a value and pointers to left and right children. """def__init__(self,value,left=None,right=None):self.value=valueself.left=leftself.right=rightdef__repr__(self):strings= [f'value:{self.value}']strings.append(f'left:{self.left.value}'ifself.leftelse'left: None')strings.append(f'right:{self.right.value}'ifself.rightelse'right: None')return', '.join(strings)left=Node(4)root=Node(5,left)print(root)# value: 5, left: 4, right: None
We'd also like to compare two nodes by comparing their values. To do so, we overload the operator== with__eq__,< with__lt__, and>= with__ge__.
classNode:""" A struct to denote the node of a binary tree. It contains a value and pointers to left and right children. """def__init__(self,value,left=None,right=None):self.value=valueself.left=leftself.right=rightdef__eq__(self,other):returnself.value==other.valuedef__lt__(self,other):returnself.value<other.valuedef__ge__(self,other):returnself.value>=other.valueleft=Node(4)root=Node(5,left)print(left==root)# Falseprint(left<root)# Trueprint(left>=root)# False
For a comprehensive list of supported magic methodshere or see the official Python documentationhere (slightly harder to read).
Some of the methods that I highly recommend:
__len__: to overload thelen()function.__str__: to overload thestr()function.__iter__: if you want to your objects to be iterators. This also allows you to callnext()on your object.
For classes like Node where we know for sure all the attributes they can support (in the case of Node, they arevalue,left, andright), we might want to use__slots__ to denote those values for both performance boost and memory saving. For a comprehensive understanding of pros and cons of__slots__, see thisabsolutely amazing answer by Aaron Hall on StackOverflow.
classNode:""" A struct to denote the node of a binary tree. It contains a value and pointers to left and right children. """__slots__= ('value','left','right')def__init__(self,value,left=None,right=None):self.value=valueself.left=leftself.right=right
Thelocals() function returns a dictionary containing the variables defined in the local namespace.
classModel1:def__init__(self,hidden_size=100,num_layers=3,learning_rate=3e-4):print(locals())self.hidden_size=hidden_sizeself.num_layers=num_layersself.learning_rate=learning_ratemodel1=Model1()==> {'learning_rate':0.0003,'num_layers':3,'hidden_size':100,'self':<__main__.Model1objectat0x1069b1470>}
All attributes of an object are stored in its__dict__.
print(model1.__dict__)==> {'hidden_size':100,'num_layers':3,'learning_rate':0.0003}
Note that manually assigning each of the arguments to an attribute can be quite tiring when the list of the arguments is large. To avoid this, we can directly assign the list of arguments to the object's__dict__.
classModel2:def__init__(self,hidden_size=100,num_layers=3,learning_rate=3e-4):params=locals()delparams['self']self.__dict__=paramsmodel2=Model2()print(model2.__dict__)==> {'learning_rate':0.0003,'num_layers':3,'hidden_size':100}
This can be especially convenient when the object is initiated using the catch-all**kwargs, though the use of**kwargs should be reduced to the minimum.
classModel3:def__init__(self,**kwargs):self.__dict__=kwargsmodel3=Model3(hidden_size=100,num_layers=3,learning_rate=3e-4)print(model3.__dict__)==> {'hidden_size':100,'num_layers':3,'learning_rate':0.0003}
Often, you run into this wild import* that looks something like this:
file.py
frompartsimport*
This is irresponsible because it will import everything in module, even the imports of that module. For example, ifparts.py looks like this:
parts.py
importnumpyimporttensorflowclassEncoder: ...classDecoder: ...classLoss: ...defhelper(*args,**kwargs): ...defutils(*args,**kwargs): ...
Sinceparts.py doesn't have__all__ specified,file.py will import Encoder, Decoder, Loss, utils, helper together with numpy and tensorflow.
If we intend that only Encoder, Decoder, and Loss are ever to be imported and used in another module, we should specify that inparts.py using the__all__ keyword.
parts.py
__all__= ['Encoder','Decoder','Loss']importnumpyimporttensorflowclassEncoder: ...
Now, if some user irresponsibly does a wild import withparts, they can only import Encoder, Decoder, Loss. Personally, I also find__all__ helpful as it gives me an overview of the module.
It's often useful to know how long it takes a function to run, e.g. when you need to compare the performance of two algorithms that do the same thing. One naive way is to calltime.time() at the begin and end of each function and print out the difference.
For example: compare two algorithms to calculate the n-th Fibonacci number, one uses memoization and one doesn't.
deffib_helper(n):ifn<2:returnnreturnfib_helper(n-1)+fib_helper(n-2)deffib(n):""" fib is a wrapper function so that later we can change its behavior at the top level without affecting the behavior at every recursion step. """returnfib_helper(n)deffib_m_helper(n,computed):ifnincomputed:returncomputed[n]computed[n]=fib_m_helper(n-1,computed)+fib_m_helper(n-2,computed)returncomputed[n]deffib_m(n):returnfib_m_helper(n, {0:0,1:1})
Let's make sure thatfib andfib_m are functionally equivalent.
forninrange(20):assertfib(n)==fib_m(n)
importtimestart=time.time()fib(30)print(f'Without memoization, it takes{time.time()-start:7f} seconds.')==>Withoutmemoization,ittakes0.267569seconds.start=time.time()fib_m(30)print(f'With memoization, it takes{time.time()-start:.7f} seconds.')==>Withmemoization,ittakes0.0000713seconds.
If you want to time multiple functions, it can be a drag having to write the same code over and over again. It'd be nice to have a way to specify how to change any function in the same way. In this case would be to call time.time() at the beginning and the end of each function, and print out the time difference.
This is exactly what decorators do. They allow programmers to change the behavior of a function or class. Here's an example to create a decoratortimeit.
deftimeit(fn):# *args and **kwargs are to support positional and named arguments of fndefget_time(*args,**kwargs):start=time.time()output=fn(*args,**kwargs)print(f"Time taken in{fn.__name__}:{time.time()-start:.7f}")returnoutput# make sure that the decorator returns the output of fnreturnget_time
Add the decorator@timeit to your functions.
@timeitdeffib(n):returnfib_helper(n)@timeitdeffib_m(n):returnfib_m_helper(n, {0:0,1:1})fib(30)fib_m(30)==>Timetakeninfib:0.2787242==>Timetakeninfib_m:0.0000138
Memoization is a form of cache: we cache the previously calculated Fibonacci numbers so that we don't have to calculate them again.
Caching is such an important technique that Python provides a built-in decorator to give your function the caching capacity. If you wantfib_helper to reuse the previously calculated Fibonacci numbers, you can just add the decoratorlru_cache fromfunctools.lru stands for "least recently used". For more information on cache, seehere.
importfunctools@functools.lru_cache()deffib_helper(n):ifn<2:returnnreturnfib_helper(n-1)+fib_helper(n-2)@timeitdeffib(n):""" fib is a wrapper function so that later we can change its behavior at the top level without affecting the behavior at every recursion step. """returnfib_helper(n)fib(50)fib_m(50)==>Timetakeninfib:0.0000412==>Timetakeninfib_m:0.0000281
About
Cool Python features for machine learning that I used to be too afraid to use. Will be updated as I have more time / learn more.
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors5
Uh oh!
There was an error while loading.Please reload this page.