Browse Topics Guided Learning Paths
Basics Intermediate Advanced

ai algorithms api best-practices career community databases data-science data-structures data-viz devops django docker editors flask front-end gamedev gui machine-learning news numpy projects python stdlib testing tools web-dev web-scraping

Recommended Course

Generating Random Data in Python

26m · 4 lessons

Generating Random Data in Python (Guide)

byBrad SolomonReading time estimate 28mintermediate data-science python

Table of Contents

Remove ads

Recommended Course

Generating Random Data in Python(26m)

How random is random? This is a weird question to ask, but it is one of paramount importance in cases where information security is concerned. Whenever you’re generating random data,strings, or numbers in Python, it’s a good idea to have at least a rough idea of how that data was generated.

Here, you’ll cover a handful of different options for generating random data in Python, and then build up to a comparison of each in terms of its level of security, versatility, purpose, and speed.

I promise that this tutorial will not be a lesson in mathematics or cryptography, which I wouldn’t be well equipped to lecture on in the first place. You’ll get into just as much math as needed, and no more.

How Random Is Random?

First, a prominent disclaimer is necessary. Most random data generated with Python is not fully random in the scientific sense of the word. Rather, it ispseudorandom: generated with a pseudorandom number generator (PRNG), which is essentially any algorithm for generating seemingly random but still reproducible data.

“True” random numbers can be generated by, you guessed it, a true random number generator (TRNG). One example is to repeatedly pick up a die off the floor, toss it in the air, and let it land how it may.

Assuming that your toss is unbiased, you have truly no idea what number the die will land on. Rolling a die is a crude form of using hardware to generate a number that is not deterministic whatsoever. (Or, you can have thedice-o-matic do this for you.) TRNGs are out of the scope of this article but worth a mention nonetheless for comparison’s sake.

PRNGs, usually done with software rather than hardware, work slightly differently. Here’s a concise description:

They start with a random number, known as the seed, and then use an algorithm to generate a pseudo-random sequence of bits based on it.(Source)

You’ve likely been told to “read the docs!” at some point. Well, those people are not wrong. Here’s a particularly notable snippet from therandom module’s documentation that you don’t want to miss:

Warning: The pseudo-random generators of this module should not be used for security purposes.(Source)

You’ve probably seenrandom.seed(999),random.seed(1234), or the like, in Python. This function call is seeding the underlying random number generator used by Python’srandom module. It is what makes subsequent calls to generate random numbers deterministic: input A always produces output B. This blessing can also be a curse if it is used maliciously.

Perhaps the terms “random” and “deterministic” seem like they cannot exist next to each other. To make that clearer, here’s an extremely trimmed down version ofrandom() that iteratively creates a “random” number by usingx = (x * 3) % 19.x is originally defined as a seed value and then morphs into a deterministic sequence of numbers based on that seed:

Python

classNotSoRandom(object):defseed(self,a=3):"""Seed the world's most mysterious random number generator."""self.seedval=adefrandom(self):"""Look, random numbers!"""self.seedval=(self.seedval*3)%19returnself.seedval_inst=NotSoRandom()seed=_inst.seedrandom=_inst.random

Don’t take this example too literally, as it’s meant mainly to illustrate the concept. If you use the seed value 1234, the subsequent sequence of calls torandom() should always be identical:

Python

>>>seed(1234)>>>[random()for_inrange(10)][16, 10, 11, 14, 4, 12, 17, 13, 1, 3]>>>seed(1234)>>>[random()for_inrange(10)][16, 10, 11, 14, 4, 12, 17, 13, 1, 3]

You’ll see a more serious illustration of this shortly.

Remove ads

What Is “Cryptographically Secure?”

If you haven’t had enough with the “RNG” acronyms, let’s throw one more into the mix: a CSPRNG, or cryptographically secure PRNG. CSPRNGs are suitable for generating sensitive data such as passwords, authenticators, and tokens. Given a random string, there is realistically no way for Malicious Joe to determine what string came before or after that string in a sequence of random strings.

One other term that you may see isentropy. In a nutshell, this refers to the amount of randomness introduced or desired. For example, one Pythonmodule that you’ll cover here definesDEFAULT_ENTROPY = 32, the number of bytes to return by default. The developers deem this to be “enough” bytes to be a sufficient amount of noise.

Note: Through this tutorial, I assume that a byte refers to 8 bits, as it has since the 1960s, rather than some other unit of data storage. You are free to call this anoctet if you so prefer.

A key point about CSPRNGs is that they are still pseudorandom. They are engineered in some way that is internally deterministic, but they add some othervariable or have some property that makes them “random enough” to prohibit backing into whatever function enforces determinism.

What You’ll Cover Here

In practical terms, this means that you should use plain PRNGs forstatistical modeling, simulation, and to make random data reproducible. They’re also significantly faster than CSPRNGs, as you’ll see later on. Use CSPRNGs for security and cryptographic applications where data sensitivity is imperative.

In addition to expanding on the use cases above, in this tutorial, you’ll delve into Python tools for using both PRNGs and CSPRNGs:

PRNG options include therandom module from Python’s standard library and its array-based NumPy counterpart,numpy.random.
Python’sos,secrets, anduuid modules contain functions for generating cryptographically secure objects.

You’ll touch on all of the above and wrap up with a high-level comparison.

PRNGs in Python

The`random` Module

Probably the most widely known tool for generating random data in Python is itsrandom module, which uses theMersenne Twister PRNG algorithm as its core generator.

Earlier, you touched briefly onrandom.seed(), and now is a good time to see how it works. First, let’s build some random data without seeding. Therandom.random() function returns a random float in the interval [0.0, 1.0). The result will always be less than the right-hand endpoint (1.0). This is also known as a semi-open range:

Python

>>># Don't call `random.seed()` yet>>>importrandom>>>random.random()0.35553263284394376>>>random.random()0.6101992345575074

If you run this code yourself, I’ll bet my life savings that the numbers returned on your machine will be different. Thedefault when you don’t seed the generator is to use your current system time or a “randomness source” from your OS if one is available.

Withrandom.seed(), you can make results reproducible, and the chain of calls afterrandom.seed() will produce the same trail of data:

Python

>>>random.seed(444)>>>random.random()0.3088946587429545>>>random.random()0.01323751590501987>>>random.seed(444)# Re-seed>>>random.random()0.3088946587429545>>>random.random()0.01323751590501987

Notice the repetition of “random” numbers. The sequence of random numbers becomes deterministic, or completely determined by the seed value, 444.

Let’s take a look at some more basic functionality ofrandom. Above, you generated a random float. You can generate a random integer between two endpoints in Python with therandom.randint() function. This spans the full [x, y] interval and may include both endpoints:

Python

>>>random.randint(0,10)7>>>random.randint(500,50000)18601

Withrandom.randrange(), you can exclude the right-hand side of the interval, meaning the generated number always lies within [x, y) and will always be smaller than the right endpoint:

Python

>>>random.randrange(1,10)5

If you need to generate random floats that lie within a specific [x, y] interval, you can userandom.uniform(), which plucks from thecontinuous uniform distribution:

Python

>>>random.uniform(20,30)27.42639687016509>>>random.uniform(30,40)36.33865802745107

To pick a random element from a non-empty sequence (like alist or a tuple), you can userandom.choice(). There is alsorandom.choices() for choosing multiple elements from a sequence with replacement (duplicates are possible):

Python

>>>items=['one','two','three','four','five']>>>random.choice(items)'four'>>>random.choices(items,k=2)['three','three']>>>random.choices(items,k=3)['three','five','four']

To mimic sampling without replacement, userandom.sample():

Python

>>>random.sample(items,4)['one', 'five', 'four', 'three']

You can randomize a sequence in-place usingrandom.shuffle(). This will modify the sequence object and randomize the order of elements:

Python

>>>random.shuffle(items)>>>items['four', 'three', 'two', 'one', 'five']

If you’d rather not mutate the original list, you’ll need tomake a copy first and then shuffle the copy. You can create copies of Python lists with thecopy module, or justx[:] orx.copy(), wherex is the list.

Before moving on to generating random data with NumPy, let’s look at one more slightly involved application: generating a sequence of unique random strings of uniform length.

It can help to think about the design of the function first. You need to choose from a “pool” of characters such as letters, numbers, and/or punctuation, combine these into a single string, and then check that this string has not already been generated. A Pythonset works well for this type of membership testing:

Python

importstringdefunique_strings(k:int,ntokens:int,pool:str=string.ascii_letters)->set:"""Generate a set of unique string tokens.    k: Length of each token    ntokens: Number of tokens    pool: Iterable of characters to choose from    For a highly optimized version:    https://stackoverflow.com/a/48421303/7954504    """seen=set()# An optimization for tightly-bound loops:# Bind these methods outside of a loopjoin=''.joinadd=seen.addwhilelen(seen)<ntokens:token=join(random.choices(pool,k=k))add(token)returnseen

''.join() joins the letters fromrandom.choices() into a single Pythonstr of lengthk. This token is added to the set, which can’t contain duplicates, and thewhile loop executes until the set has the number of elements that you specify.

Resource: Python’sstring module contains a number of useful constants:ascii_lowercase,ascii_uppercase,string.punctuation,ascii_whitespace, and a handful of others.

Let’s try this function out:

Python

>>>unique_strings(k=4,ntokens=5){'AsMk', 'Cvmi', 'GIxv', 'HGsZ', 'eurU'}>>>unique_strings(5,4,string.printable){"'O*1!", '9Ien%', 'W=m7<', 'mUD|z'}

For a fine-tuned version of this function,this Stack Overflow answer uses generator functions, name binding, and some other advanced tricks to make a faster, cryptographically secure version ofunique_strings() above.

Remove ads

PRNGs for Arrays:`numpy.random`

One thing you might have noticed is that a majority of the functions fromrandom return a scalar value (a singleint,float, or other object). If you wanted to generate a sequence of random numbers, one way to achieve that would be with a Pythonlist comprehension:

Python

>>>[random.random()for_inrange(5)][0.021655420657909374, 0.4031628347066195, 0.6609991871223335, 0.5854998250783767, 0.42886606317322706]

But there is another option that is specifically designed for this. You can think of NumPy’s ownnumpy.random package as being like the standard library’srandom, but forNumPy arrays. (It also comes loaded with the ability to draw from a lot more statistical distributions.)

Take note thatnumpy.random uses its own PRNG that is separate from plain oldrandom. You won’t produce deterministically random NumPy arrays with a call to Python’s ownrandom.seed():

Python

>>>importnumpyasnp>>>np.random.seed(444)>>>np.set_printoptions(precision=2)# Output decimal fmt.

Without further ado, here are a few examples to whet your appetite:

Python

>>># Return samples from the standard normal distribution>>>np.random.randn(5)array([ 0.36,  0.38,  1.38,  1.18, -0.94])>>>np.random.randn(3,4)array([[-1.14, -0.54, -0.55,  0.21],       [ 0.21,  1.27, -0.81, -3.3 ],       [-0.81, -0.36, -0.88,  0.15]])>>># `p` is the probability of choosing each element>>>np.random.choice([0,1],p=[0.6,0.4],size=(5,4))array([[0, 0, 1, 0],       [0, 1, 1, 1],       [1, 1, 1, 0],       [0, 0, 0, 1],       [0, 1, 0, 1]])

In the syntax forrandn(d0, d1, ..., dn), the parametersd0, d1, ..., dn are optional and indicate the shape of the final object. Here,np.random.randn(3, 4) creates a 2d array with 3 rows and 4 columns. The data will bei.i.d., meaning that each data point is drawn independent of the others.

Note: If you’re looking to create normally distributed random numbers, then you’re in luck!How to Get Normally Distributed Random Numbers With NumPy can guide your way.

Another common operation is to create a sequence of randomBoolean values,True orFalse. One way to do this would be withnp.random.choice([True, False]). However, it’s actually about 4x faster to choose from(0, 1) and thenview-cast these integers to their corresponding Boolean values:

Python

>>># NumPy's `randint` is [inclusive, exclusive), unlike `random.randint()`>>>np.random.randint(0,2,size=25,dtype=np.uint8).view(bool)array([ True, False,  True,  True, False,  True, False, False, False,       False, False,  True,  True, False, False, False,  True, False,        True, False,  True,  True,  True, False,  True])

What about generatingcorrelated data? Let’s say you want to simulate two correlated time series. One way of going about this is with NumPy’smultivariate_normal() function, which takes a covariance matrix into account. In other words, to draw from a single normally distributed random variable, you need to specify its mean and variance (or standard deviation).

To sample from themultivariate normal distribution, you specify the means and covariance matrix, and you end up with multiple, correlated series of data that are each approximately normally distributed.

However, rather than covariance,correlation is a measure that is more familiar and intuitive to most. It’s the covariance normalized by the product of standard deviations, and so you can also define covariance in terms of correlation and standard deviation:

So, could you draw random samples from a multivariate normal distribution by specifying a correlation matrix and standard deviations? Yes, but you’ll need to get the aboveinto matrix form first. Here,S is a vector of the standard deviations,P is their correlation matrix, andC is the resulting (square) covariance matrix:

This can be expressed in NumPy as follows:

Python

defcorr2cov(p:np.ndarray,s:np.ndarray)->np.ndarray:"""Covariance matrix from correlation & standard deviations"""d=np.diag(s)returnd@p@d

Now, you can generate two time series that are correlated but still random:

Python

>>># Start with a correlation matrix and standard deviations.>>># -0.40 is the correlation between A and B, and the correlation>>># of a variable with itself is 1.0.>>>corr=np.array([[1.,-0.40],...[-0.40,1.]])>>># Standard deviations/means of A and B, respectively>>>stdev=np.array([6.,1.])>>>mean=np.array([2.,0.5])>>>cov=corr2cov(corr,stdev)>>># `size` is the length of time series for 2d data>>># (500 months, days, and so on).>>>data=np.random.multivariate_normal(mean=mean,cov=cov,size=500)>>>data[:10]array([[ 0.58,  1.87],       [-7.31,  0.74],       [-6.24,  0.33],       [-0.77,  1.19],       [ 1.71,  0.7 ],       [-3.33,  1.57],       [-1.13,  1.23],       [-6.58,  1.81],       [-0.82, -0.34],       [-2.32,  1.1 ]])>>>data.shape(500, 2)

You can think ofdata as 500 pairs of inversely correlated data points. Here’s a sanity check that you can back into the original inputs, which approximatecorr,stdev, andmean from above:

Python

>>>np.corrcoef(data,rowvar=False)array([[ 1.  , -0.39],       [-0.39,  1.  ]])>>>data.std(axis=0)array([5.96, 1.01])>>>data.mean(axis=0)array([2.13, 0.49])

Before we move on to CSPRNGs, it might be helpful to summarize somerandom functions and theirnumpy.random counterparts:

Python`random` Module	NumPy Counterpart	Use
`random()`	`rand()`	Random float in [0.0, 1.0)
`randint(a, b)`	`random_integers()`	Random integer in [a, b]
`randrange(a, b[, step])`	`randint()`	Random integer in [a, b)
`uniform(a, b)`	`uniform()`	Random float in [a, b]
`choice(seq)`	`choice()`	Random element from`seq`
`choices(seq, k=1)`	`choice()`	Random`k` elements from`seq` with replacement
`sample(population, k)`	`choice()` with`replace=False`	Random`k` elements from`seq` without replacement
`shuffle(x[, random])`	`shuffle()`	Shuffle the sequence`x` in place
`normalvariate(mu, sigma)` or`gauss(mu, sigma)`	`normal()`	Sample from a normal distribution with mean`mu` and standard deviation`sigma`

Note: NumPy is specialized for building and manipulating large, multidimensional arrays. If you just need a single value,random will suffice and will probably be faster as well. For small sequences,random may even be faster too, because NumPy does come with some overhead.

Now that you’ve covered two fundamental options for PRNGs, let’s move onto a few more secure adaptations.

Remove ads

CSPRNGs in Python

`os.urandom()`: About as Random as It Gets

Python’sos.urandom() function is used by bothsecrets anduuid (both of which you’ll see here in a moment). Without getting into too much detail,os.urandom() generates operating-system-dependent random bytes that can safely be called cryptographically secure:

On Unix operating systems, it reads random bytes from the special file/dev/urandom, which in turn “allow access to environmental noise collected from device drivers and other sources.” (Thank you,Wikipedia.) This is garbled information that is particular to your hardware and system state at an instance in time but at the same time sufficiently random.
On Windows, the C++ functionCryptGenRandom() is used. This function is still technically pseudorandom, but it works by generating a seed value from variables such as the process ID, memory status, and so on.

Withos.urandom(), there is no concept of manually seeding. While still technically pseudorandom, this function better aligns with how we think of randomness. The only argument is the number ofbytes to return:

Python

>>>os.urandom(3)b'\xa2\xe8\x02'>>>x=os.urandom(6)>>>xb'\xce\x11\xe7"!\x84'>>>type(x),len(x)(bytes, 6)

Before we go any further, this might be a good time to delve into a mini-lesson oncharacter encoding. Many people, including myself, have some type of allergic reaction when they seebytes objects and a long line of\x characters. However, it’s useful to know how sequences such asx above eventually get turned into strings or numbers.

os.urandom() returns a sequence of single bytes:

Python

>>>xb'\xce\x11\xe7"!\x84'

But how does this eventually get turned into a Pythonstr or sequence of numbers?

First, recall one of the fundamental concepts of computing, which is that a byte is made up of 8 bits. You can think of a bit as a single digit that is either 0 or 1. A byte effectively chooses between 0 and 1 eight times, so both01101100 and11110000 could represent bytes. Try this, which makes use of Pythonf-strings introduced in Python 3.6, in your interpreter:

Python

>>>binary=[f'{i:0>8b}'foriinrange(256)]>>>binary[:16]['00000000', '00000001', '00000010', '00000011', '00000100', '00000101', '00000110', '00000111', '00001000', '00001001', '00001010', '00001011', '00001100', '00001101', '00001110', '00001111']

This is equivalent to[bin(i) for i in range(256)], with some special formatting.bin() converts an integer to its binary representation as a string.

Where does that leave us? Usingrange(256) above is not a random choice. (No pun intended.) Given that we are allowed 8 bits, each with 2 choices, there are2 ** 8 == 256 possible bytes “combinations.”

This means that each byte maps to an integer between 0 and 255. In other words, we would need more than 8 bits to express the integer 256. You can verify this by checking thatlen(f'{256:0>8b}') is now 9, not 8.

Okay, now let’s get back to thebytes data type that you saw above, by constructing a sequence of the bytes that correspond to integers 0 through 255:

Python

>>>bites=bytes(range(256))

If you calllist(bites), you’ll get back to a Python list that runs from 0 to 255. But if you just printbites, you get an ugly looking sequence littered with backslashes:

Python

>>>bitesb'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15' '\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJK' 'LMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86' '\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b' # ...

These backslashes are escape sequences, and\xhhrepresents the character with hex valuehh. Some of the elements ofbites are displayed literally (printable characters such as letters, numbers, and punctuation). Most are expressed with escapes.\x08 represents a keyboard’s backspace, while\x13 is acarriage return (part of a new line, on Windows systems).

If you need a refresher on hexadecimal, Charles Petzold’sCode: The Hidden Language is a great place for that. Hex is a base-16 numbering system that, instead of using 0 through 9, uses 0 through 9 anda throughf as its basic digits.

Finally, let’s get back to where you started, with the sequence of random bytesx. Hopefully this makes a little more sense now. Calling.hex() on abytes object gives astr of hexadecimal numbers, with each corresponding to a decimal number from 0 through 255:

Python

>>>xb'\xce\x11\xe7"!\x84'>>>list(x)[206, 17, 231, 34, 33, 132]>>>x.hex()'ce11e7222184'>>>len(x.hex())12

One last question: how isb.hex() 12 characters long above, even thoughx is only 6 bytes? This is because two hexadecimal digits correspond precisely to a single byte. Thestr version ofbytes will always be twice as long as far as our eyes are concerned.

Even if the byte (such as\x01) does not need a full 8 bits to be represented,b.hex() will always use two hex digits per byte, so the number 1 will be represented as01 rather than just1. Mathematically, though, both of these are the same size.

Technical Detail: What you’ve mainly dissected here is how abytes object becomes a Pythonstr. One other technicality is howbytes produced byos.urandom() get converted to afloat in the interval [0.0, 1.0), as in thecryptographically secure version ofrandom.random(). If you’re interested in exploring this further,this code snippet demonstrates howint.from_bytes() makes the initial conversion to an integer, using a base-256 numbering system.

With that under your belt, let’s touch on a recently introduced module,secrets, which makes generating secure tokens much more user-friendly.

Remove ads

Python’s Best Kept`secrets`

Introduced in Python 3.6 byone of the more colorful PEPs out there, thesecrets module is intended to be the de facto Python module for generating cryptographically secure random bytes and strings.

You can check out thesource code for the module, which is short and sweet at about 25 lines of code.secrets is basically a wrapper aroundos.urandom(). It exports just a handful of functions for generating random numbers, bytes, and strings. Most of these examples should be fairly self-explanatory:

Python

>>>n=16>>># Generate secure tokens>>>secrets.token_bytes(n)b'A\x8cz\xe1o\xf9!;\x8b\xf2\x80pJ\x8b\xd4\xd3'>>>secrets.token_hex(n)'9cb190491e01230ec4239cae643f286f'>>>secrets.token_urlsafe(n)'MJoi7CknFu3YN41m88SEgQ'>>># Secure version of `random.choice()`>>>secrets.choice('rain')'a'

Now, how about a concrete example? You’ve probably used URL shortener services liketinyurl.com orbit.ly that turn an unwieldy URL into something likehttps://bit.ly/2IcCp9u. Most shorteners don’t do any complicated hashing from input to output; they just generate a random string, make sure that string has not already been generated previously, and then tie that back to the input URL.

Let’s say that after taking a look at theRoot Zone Database, you’ve registered the siteshort.ly. Here’s a function to get you started with your service:

Python

# shortly.pyfromsecretsimporttoken_urlsafeDATABASE={}defshorten(url:str,nbytes:int=5)->str:ext=token_urlsafe(nbytes=nbytes)ifextinDATABASE:returnshorten(url,nbytes=nbytes)else:DATABASE.update({ext:url})returnf'short.ly/{ext}

Is this a full-fledged real illustration? No. I would wager that bit.ly does things in a slightly more advanced way than storing its gold mine in a global Python dictionary that is not persistent between sessions.

Note: If you’d like to build a full-fledged URL shortener of your own, then check outBuild a URL Shortener With FastAPI and Python.

However, it’s roughly accurate conceptually:

Python

>>>urls=(...'https://realpython.com/',...'https://docs.python.org/3/howto/regex.html'...)>>>foruinurls:...print(shorten(u))short.ly/p_Z4fLIshort.ly/fuxSyNY>>>DATABASE{'p_Z4fLI': 'https://realpython.com/', 'fuxSyNY': 'https://docs.python.org/3/howto/regex.html'}

Hold On: One thing you may notice is that both of these results are of length 7 when you requested 5 bytes.Wait, I thought that you said the result would be twice as long? Well, not exactly, in this case. There is one more thing going on here:token_urlsafe() uses base64 encoding, where each character is 6 bits of data. (It’s 0 through 63, and corresponding characters. The characters are A-Z, a-z, 0-9, and +/.)

If you originally specify a certain number of bytesnbytes, the resulting length fromsecrets.token_urlsafe(nbytes) will bemath.ceil(nbytes * 8 / 6), which you canprove and investigate further if you’re curious.

The bottom line here is that, whilesecrets is really just a wrapper around existing Python functions, it can be your go-to when security is your foremost concern.

One Last Candidate:`uuid`

One last option for generating a random token is theuuid4() function from Python’suuid module. AUUID is a Universally Unique IDentifier, a 128-bit sequence (str of length 32) designed to “guarantee uniqueness across space and time.”uuid4() is one of the module’s most useful functions, and this functionalso usesos.urandom():

Python

>>>importuuid>>>uuid.uuid4()UUID('3e3ef28d-3ff0-4933-9bba-e5ee91ce0e7b')>>>uuid.uuid4()UUID('2e115fcb-5761-4fa1-8287-19f4ee2877ac')

The nice thing is that all ofuuid’s functions produce an instance of theUUID class, which encapsulates the ID and has properties like.int,.bytes, and.hex:

Python

>>>tok=uuid.uuid4()>>>tok.bytesb'.\xb7\x80\xfd\xbfIG\xb3\xae\x1d\xe3\x97\xee\xc5\xd5\x81'>>>len(tok.bytes)16>>>len(tok.bytes)*8# In bits128>>>tok.hex'2eb780fdbf4947b3ae1de397eec5d581'>>>tok.int62097294383572614195530565389543396737

You may also have seen some other variations:uuid1(),uuid3(), anduuid5(). The key difference between these anduuid4() is that those three functions all take some form of input and therefore don’t meet the definition of “random” to the extent that a Version 4 UUID does:

uuid1() uses your machine’s host ID and current time by default. Because of the reliance on current time down to nanosecond resolution, this version is where UUID derives the claim “guaranteed uniqueness across time.”
uuid3() anduuid5() both take a namespace identifier and a name. The former uses anMD5 hash and the latter uses SHA-1.

uuid4(), conversely, is entirely pseudorandom (or random). It consists of getting 16 bytes viaos.urandom(), converting this to abig-endian integer, and doing a number of bitwise operations to comply with theformal specification.

Hopefully, by now you have a good idea of the distinction between different “types” of random data and how to create them. However, one other issue that might come to mind is that of collisions.

In this case, a collision would simply refer to generating two matching UUIDs. What is the chance of that? Well, it is technically not zero, but perhaps it is close enough: there are2 ** 128 or 340undecillion possibleuuid4 values. So, I’ll leave it up to you to judge whether this is enough of a guarantee to sleep well.

One common use ofuuid is in Django, which has aUUIDField that is often used as a primary key in a model’s underlying relational database.

Remove ads

Why Not Just “Default to”`SystemRandom`?

In addition to the secure modules discussed here such assecrets, Python’srandom module actually has a little-used class calledSystemRandom that usesos.urandom(). (SystemRandom, in turn, is also used bysecrets. It’s all a bit of a web that traces back tourandom().)

At this point, you might be asking yourself why you wouldn’t just “default to” this version? Why not “always be safe” rather than defaulting to the deterministicrandom functions that aren’t cryptographically secure ?

I’ve already mentioned one reason: sometimes you want your data to be deterministic and reproducible for others to follow along with.

But the second reason is that CSPRNGs, at least in Python, tend to be meaningfully slower than PRNGs. Let’s test that with a script,timed.py, that compares the PRNG and CSPRNG versions ofrandint() using Python’stimeit.repeat():

Python

# timed.pyimportrandomimporttimeit# The "default" random is actually an instance of `random.Random()`.# The CSPRNG version uses `SystemRandom()` and `os.urandom()` in turn._sysrand=random.SystemRandom()defprng()->None:random.randint(0,95)defcsprng()->None:_sysrand.randint(0,95)setup='import random; from __main__ import prng, csprng'if__name__=='__main__':print('Best of 3 trials with 1,000,000 loops per trial:')forfin('prng()','csprng()'):best=min(timeit.repeat(f,setup=setup))print('\t{:8s}{:0.2f} seconds total time.'.format(f,best))

Now to execute this from the shell:

Shell

$python3./timed.pyBest of 3 trials with 1,000,000 loops per trial:        prng()   1.07 seconds total time.        csprng() 6.20 seconds total time.

A 5x timing difference is certainly a valid consideration in addition to cryptographic security when choosing between the two.

Odds and Ends: Hashing

One concept that hasn’t received much attention in this tutorial is that ofhashing, which can be done with Python’shashlib module.

A hash is designed to be a one-way mapping from an input value to a fixed-size string that is virtually impossible to reverse engineer. As such, while the result of a hash function may “look like” random data, it doesn’t really qualify under the definition here.

Recap

You’ve covered a lot of ground in this tutorial. To recap, here is a high-level comparison of the options available to you for engineering randomness in Python:

Package/Module	Description	Cryptographically Secure
`random`	Fasty & easy random data using Mersenne Twister	No
`numpy.random`	Like`random` but for (possibly multidimensional) arrays	No
`os`	Contains`urandom()`, the base of other functions covered here	Yes
`secrets`	Designed to be Python’s de facto module for generating secure random numbers, bytes, and strings	Yes
`uuid`	Home to a handful of functions for building 128-bit identifiers	Yes,`uuid4()`

Feel free to leave some totally random comments below, and thanks for reading.

Additional Links

Random.org offers “true random numbers to anyone on the Internet” derived from atmospheric noise.
TheRecipes section from therandom module has some additional tricks.
The seminal paper on theMersienne Twister appeared in 1997, if you’re into that kind of thing.
TheItertools Recipes define functions for choosing randomly from a combinatoric set, such as from combinations or permutations.
Scikit-Learn includes various random sample generators that can be used to build artificial datasets of controlled size and complexity.
Eli Bendersky digs intorandom.randint() in his articleSlow and Fast Methods for Generating Random Integers in Python.
Peter Norvig’s aConcrete Introduction to Probability using Python is a comprehensive resource as well.
The Pandas library includes acontext manager that can be used to set a temporary random state.
From Stack Overflow:

Recommended Course

Generating Random Data in Python(26m)

🐍 Python Tricks 💌

Get a short & sweetPython Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

AboutBrad Solomon

Brad is a software engineer and a member of the Real Python Tutorial Team.

» More about Brad

Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:

Adriana

Geir Arne

Joanna

MasterReal-World Python Skills With Unlimited Access to Real Python

Locked learning resources

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Level Up Your Python Skills »

MasterReal-World Python Skills
With Unlimited Access to Real Python

Locked learning resources

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Level Up Your Python Skills »

What Do You Think?

Rate this article:

LinkedIn Twitter Bluesky Facebook Email

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students.Get tips for asking good questions andget answers to common questions in our support portal.

Looking for a real-time conversation? Visit theReal Python Community Chat or join the next“Office Hours” Live Q&A Session. Happy Pythoning!

Keep Learning

Related Topics:intermediate data-science python

Related Learning Paths:

Related Courses:

Generating Random Data in Python

Movatterモバイル変換

Generating Random Data in Python (Guide)

How Random Is Random?

What Is “Cryptographically Secure?”

What You’ll Cover Here

PRNGs in Python

The`random` Module

PRNGs for Arrays:`numpy.random`

CSPRNGs in Python

`os.urandom()`: About as Random as It Gets

Python’s Best Kept`secrets`

One Last Candidate:`uuid`

Why Not Just “Default to”`SystemRandom`?

Odds and Ends: Hashing

Recap

Additional Links

Keep reading Real Python by creating a free account or signing in:

Movatterモバイル変換

Generating Random Data in Python (Guide)

How Random Is Random?

What Is “Cryptographically Secure?”

What You’ll Cover Here

PRNGs in Python

Therandom Module

PRNGs for Arrays:numpy.random

CSPRNGs in Python

os.urandom(): About as Random as It Gets

Python’s Best Keptsecrets

One Last Candidate:uuid

Why Not Just “Default to”SystemRandom?

Odds and Ends: Hashing

Recap

Additional Links

The`random` Module

PRNGs for Arrays:`numpy.random`

`os.urandom()`: About as Random as It Gets

Python’s Best Kept`secrets`

One Last Candidate:`uuid`

Why Not Just “Default to”`SystemRandom`?