Python Scripts as a Replacement for Bash Utility Scripts

on January 16, 2013

For Linux users, the command line is a celebrated part of our entireexperience. Unlike other popular operating systems, where the commandline is a scary proposition for all but the most experienced veterans, inthe Linux community, command-line use is encouraged. Often the commandline can provide a more elegant and efficient solution when comparedto doing a similar task with a graphical user interface.

As the Linux community has grown up with a dependence on the command line,UNIX shells, such as bash and zsh, have grown into extremely formidabletools that complement the UNIX shell experience. With bash and othersimilar shells, a number of powerful features areavailable, such as piping, filename wild-carding and the ability to readcommands from a file called a script.

Let's look at a real-world example todemonstrate the power of the command line. Every time users log in toa service, their user names are logged to a text file. For this example,let's find out howmany unique users use the service.

The series of commands in the following example show the power of more complexutilities by chaining together smaller building blocks:

$ cat names.log | sort | uniq | wc -l

The pipesymbol (|) is used to pass the standard output of one command into thestandard input of the next command. In the example here, the output ofcatnames.txt is passed into thesortcommand. The output of thesortcommand is each line of the file rearranged in alphabetical order. Thissubsequently is piped into theuniq command, which removes any duplicatenames. Finally, the output ofuniq is passed to thewc command.wcis a counting command, and with the-l flag set, itreturns the number oflines. This allows you to chain a number of commands together.

However, sometimes what is needed can become quite complex, and chainingcommands together can become unwieldy. In that case, shellscripts are the answer. A shell script is a list of commands that are readby the shell and executed in order. Shell scripts also support someprogramming language fundamentals, such as variables, flow control anddata structures. Shell scripts can be very useful for batch jobs thatwill be run often and repeatedly. Unfortunately, shell scripts come withsome disadvantages:

  • Shell scripts easily can become overly complicated and unreadable toa developer wanting to improve or maintain them.

  • Often the syntax and interpreter for these shell scripts can beawkward and unintuitive. The more awkward the syntax, the less readableit is for the developer who must work with these scripts.

  • The code is generally unusable in other scripts. Code reuse amongscripts tends to be difficult, and scripts tend to be very specific toa certain problem.

  • Libraries for advanced features, such as HTML parsing or HTTP requests,are not as easily available as they are with modern programming and scriptinglanguages.

These problems can make shell scripting an awkward undertaking and oftencan lead to a lot of wasted developer time. Instead, the Python programminglanguage can be used as a very able replacement. There are many benefitsto using Python as a replacement for shell scripts:

  • Python is installed by default on all the major Linuxdistributions. Opening a command line and typingpython immediately will drop youinto a Python interpreter. This ubiquity makes it a sensiblechoice for most scripting tasks.

  • Python has a very easy to read and understand syntax. Its style emphasizes minimalism and clean code while allowing the developerto write in a bare-bones style that suits shell scripting.

  • Python is an interpreted language, meaning there is no compilestage. This makes Python an ideal language for scripting. Python alsocomes with a Read Eval Print Loop, which allows you to try outnew code quickly in an interpreted way. This lets the developer tinker withideas without having to write the full program out into a file.

  • Python is a fully featured programming language. Code reuse issimple, because Python modules easily can be imported and used in any Pythonscript. Scripts easily can be extended or built upon.

  • Python has access to an excellent standard library and thousands ofthird-party libraries for all sorts of advanced utilities, such as parsersand request libraries. For instance, Python's standard library includesdatetime libraries that allow you to parse dates into any formatthat you specify and compare it to other dates easily.

  • Python can be a simple link in the chain. Python should not replaceall the bash commands. It is as powerful to write Python programsthat behave in a UNIX fashion (that is, read in standard input and writeto standard output) as it is to write Python replacements for existing shellcommands, such as cat and sort.

Let's build on the problem that was solved earlier in this article.Besides the workalready done, let's find out know how many times a certain user has loggedin to the system. Theuniq command simply removes duplicates but gives no informationon how many duplicates there are. Instead ofuniq, a Python scriptcan be used as another command in the chain. Here's a Python program to dothis (in my examples, I refer to this file as namescount.py):

#!/usr/bin/env pythonimport sysif __name__ == "__main__":    # Initialize a names dictionary as empty to start with.    # Each key in this dictionary will be a name and the value    # will be the number of times that name appears.    names = {}    # sys.stdin is a file object. All the same functions that    # can be applied to a file object can be applied to sys.stdin.    for name in sys.stdin.readlines():            # Each line will have a newline on the end            # that should be removed.            name = name.strip()            if name in names:                    names[name] += 1            else:                    names[name] = 1    # Iterating over the dictionary,    # print name followed by a space followed by the    # number of times it appeared.    for name, count in names.iteritems():            sys.stdout.write("%d\t%s\n" % (count, name))

Let's look at how this Python script fits into the chain ofcommands. First, it reads in input from standard input exposed throughthe sys.stdin object. Any output is written to the sys.stdout object, whichis how standard output is implemented in Python. A Python dictionary(often called a hash map in other languages) is used to get a mappingfrom the user name to the duplicate count. To get a count of all the users,execute the following:

$ cat names.log | python namescount.py

This displays a count of how many times a user appears along withthe user's name using a tab as a separator. The next thing to do isdisplay,in order, the users who used the system most often. This can be done atthe Python level, but let's implement it using the utilities that arealready provided by the core UNIX utilities. Previously, I used thesort command to sort alphabetically. If the command is provided with a-rn flag, it sorts the lines numerically, indescending order. Asthe Python script prints to standard out, you simply can pipe the commandintosort and retrieve the output you want:

$ cat names.log | python namescount.py | sort -rn

This is an example of the power of using Python as part of a chain ofcommands. The advantages of using Python in this scenario are as follows:

  • The ability to chain with tools like cat and sort. Simple utilities(reading a file line by line and sorting a file numerically) are handledby tried-and-trusted UNIX commands. These commands also are reading lineby line, which means these functions can scale to files that are largein size, and they are very quick.

  • When some heavy-lifting is needed in the chain, a very clear, concisePython script can be written, which does what it needs to do and thenoffloads the responsibility to the next link in the chain.

  • It is a reusable module, although this example is specifically about names,if you feed this any input that contains duplicate lines, itwill print out each line and the number of duplicates. Making the Pythoncode modular allows you to apply it in a range of scenarios.

To demonstrate the power of combining Python scripts in a modular andpiped fashion, let's expand further on the problem space. Let's findthe top five users of the service.head is a commandthat allows you tospecify a certain number of lines to display of the standard input itis given. Adding this to the command chain gives the following:

$ cat names.log | python namescount.py | sort -rn | head -n 5

This prints only the top five users and ignores the rest. Similarly, toget the five users who use the service least, you can use thetail command, whichtakes the same arguments. The result of the Python commandbeing printed to standard output allows you to build and extend uponits functionality.

To demonstrate the modularity of this script,let's once again change the problem space. The servicealso generates a comma-separated value (CSV) log file that contains alist of e-mail addresses and the comments that each e-mail address made about theservice. Here's an example entry:

"email@example.com", "This service is great."

The task is to provide a way for the service to send a thank-you message tothe top ten users in terms of comment frequency. First, you need a scriptthat can read and print a certain column of CSV data. The standardlibrary of Python provides a CSV reader. The Python script belowcompletes this goal:

#!/usr/bin/env python# CSV module that comes with the Python standard libraryimport csvimport sysif __name__ == "__main__":    # The CSV module exposes a reader object that takes    # a file object to read. In this example, sys.stdin.    csvfile = csv.reader(sys.stdin)    # The script should take one argument that is a column number.    # Command-line arguments are accessed via sys.argv list.    column_number = 0    if len(sys.argv) > 1:            column_number = int(sys.argv[1])    # Each row in the CSV file is a list with each     # comma-separated value for that line.    for row in csvfile:            print row[column_number]

This script can parse the CSV data and return in plain text thecolumn that is supplied as a command-line argument. It usesprintinstead ofsys.stdout.write, asprint, by default, uses standard out asits output file.

Let's add this script to the chain. The new script ischained with the others to print out a list of e-mail addresses and theircomment frequencies using the command listed below (the .csv log fileis assumed to be called emailcomments.csv and the new Python script,csvcolumn.py):

$ cat emailcomments.csv | python csvcolumn.py |  ↪python namescount.py | sort -rn | head -n 5

Next, you need a way to send an e-mail. In the Python standard library offunctions, you can import smtplib, which is a module that allows youto connect to an SMTP server to send mail. Let's write a simple Pythonscript that uses this library to send a message to each of the top tene-mail addresses found already:

#!/usr/bin/env pythonimport smtplibimport sysGMAIL_SMTP_SERVER = "smtp.gmail.com"GMAIL_SMTP_PORT = 587GMAIL_EMAIL = "Your Gmail Email Goes Here"GMAIL_PASSWORD = "Your Gmail Password Goes Here"def initialize_smtp_server():    '''    This function initializes and greets the smtp server.    It logs in using the provided credentials and returns     the smtp server object as a result.    '''    smtpserver = smtplib.SMTP(GMAIL_SMTP_SERVER, GMAIL_SMTP_PORT)    smtpserver.ehlo()    smtpserver.starttls()    smtpserver.ehlo()    smtpserver.login(GMAIL_EMAIL, GMAIL_PASSWORD)    return smtpserverdef send_thank_you_mail(email):    to_email = email    from_email = GMAIL_EMAIL    subj = "Thanks for being an active commenter"    # The header consists of the To and From and Subject lines    # separated using a newline character    header = "To:%s\nFrom:%s\nSubject:%s \n" % (to_email,            from_email, subj)    # Hard-coded templates are not best practice.    msg_body = """    Hi %s,    Thank you very much for your repeated comments on our service.    The interaction is much appreciated.    Thank You.""" % email    content = header + "\n" + msg_body    smtpserver = initialize_smtp_server()    smtpserver.sendmail(from_email, to_email, content)    smtpserver.close()if __name__ == "__main__":    # for every line of input.    for email in sys.stdin.readlines():            send_thank_you_mail(email)

This Python script supports contacting any SMTP server, whether local orremote. For ease of use, I have included Gmail's SMTP server, and it shouldwork, provided you give the scripts the correct Gmail credentials. Thescript uses the functions provided to send mail in smtplib. This againdemonstrates the power of using Python at this level. Somethinglike SMTP interaction is easy and readable in Python. Equivalentshell scripts are messy, and such libraries are not as easily accessible,if they exist at all.

In order to send the e-mails to the top ten users sorted by commentfrequency, first you must isolate only the e-mail column of the output ofcolumn names. To isolate a certain column in Linux, you use thecutcommand. In the example below, the commands are given in twoseparate chains. For ease of use, I wrote the output into a temporaryfile, which can be loaded into the second chain. This simply makes theprocess more readable (the Python script for sending mail is referredto as sendemail.py):

$ cat emailcomments.csv | python csvcolumn.py |  ↪python namescount.py | sort -rn > /tmp/comment_freq$ cat /tmp/comment_freq | head -n 10 | cut -f2 |  ↪python sendemail.py

This shows the real power of Python as a utility in a chain of bashcommands such as this. Writing scripts that accept input from standardinput and write any data out to standard out, allows the developer to chain commands such as these together quickly and easily with a link inthe chain often being a Python program. This philosophy of designing asmall application that services one purpose fits nicely with the flowof commands being used here.

Often in Python scripts that are used on the command line, argumentsare used to give users options when they run a certain command. Forinstance, thehead command takes a-n argument that takes the numberfollowing it and prints only that number of lines. Each argument thatis provided to a Python script is exposed through thesys.argv array,which can be accessed by first importingsys. Thecode below shows how to takesingle words as arguments. This program is a simple adder, which takes twonumber arguments and adds them, and prints that out to the user. However,this format of taking in command-line arguments is rather basic. It iseasy to make mistakes—for instance, pass two strings, such ashelloand world, to this command, and you will start to geterrors:

#!/usr/bin/env pythonimport sysif __name__ == "__main__":    # The first argument of sys.argv is always the filename,    # meaning that the length of system arguments will be    # more than one, when command-line arguments exist.    if len(sys.argv) > 2:            num1 = long(sys.argv[1])            num2 = long(sys.argv[2])    else:            print "This command takes two arguments and adds them"            print "Less than two arguments given."            sys.exit(1)    print "%s" % str(num1 + num2)

Thankfully, Python has a number of modules todeal with command-line arguments. My personal favorite isOptionParser. OptionParseris part of the optparse module that is provided by the standardlibrary. OptionParser allows you to do a range of very useful things withcommand-line arguments:

  • Specify a default if a certain argument is not provided.

  • It supports both argument flags (either present or not) and argumentswith values (-n 10000).

  • It supports different formats of passing arguments—for example, thedifference between -n=100000 and -n 100000.

Let's use the OptionParser to enhance the sending-mail script. Theoriginal script had a lot of variables hard-coded into place, such as theSMTP details and the users' login credentials. In the code provided below,command-line arguments are used to pass in these variables:

#!/usr/bin/env pythonimport smtplibimport sysfrom optparse import OptionParserdef initialize_smtp_server(smtpserver, smtpport, email, pwd):    '''    This function initializes and greets the SMTP server.    It logs in using the provided credentials and returns the    SMTP server object as a result.    '''    smtpserver = smtplib.SMTP(smtpserver, smtpport)    smtpserver.ehlo()    smtpserver.starttls()    smtpserver.ehlo()    smtpserver.login(email, pwd)    return smtpserverdef send_thank_you_mail(email, smtpserver):    to_email = email    from_email = GMAIL_EMAIL    subj = "Thanks for being an active commenter"    # The header consists of the To and From and Subject lines    # separated using a newline character.    header = "To:%s\nFrom:%s\nSubject:%s \n" % (to_email,            from_email, subj)    # Hard-coded templates are not best practice.    msg_body = """    Hi %s,    Thank you very much for your repeated comments on our service.    The interaction is much appreciated.    Thank You.""" % email    content = header + "\n" + msg_body    smtpserver.sendmail(from_email, to_email, content)if __name__ == "__main__":    usage = "usage: %prog [options]"    parser = OptionParser(usage=usage)    parser.add_option("--email", dest="email",            help="email to login to smtp server")    parser.add_option("--pwd", dest="pwd",            help="password to login to smtp server")    parser.add_option("--smtp-server", dest="smtpserver",            help="smtp server url", default="smtp.gmail.com")    parser.add_option("--smtp-port", dest="smtpserverport",            help="smtp server port", default=587)    options, args = parser.parse_args()    if not (options.email or options.pwd):            parser.error("Must provide both an email and a password")    smtpserver = initialize_smtp_server(options.stmpserver,            options.smtpserverport, options.email, options.pwd)    # for every line of input.    for email in sys.stdin.readlines():            send_thank_you_mail(email, smtpserver)    smtpserver.close()

This script shows the usefulness of OptionParser. It provides a simple,easy-to-use interface for command-line arguments, allowing you to definecertain properties for each command-line option. It also allows you tospecify default values. If certain arguments are not provided, it allowsyou to throw specific errors.

So what have you learned? Instead of replacing a series of bash commandswith one Python script, it often is better to have Python do onlythe heavy lifting in the middle. This allows for more modular andreusable scripts, while also tapping into the power of all that Pythonoffers. Using stdin as a file object allows Python to read input, whichis piped to it from other commands, and writing to stdout allows it tocontinue passing the information through the piping system. Combininginformation like this can make for some very powerful programs. Theexamples I have given here are all for a fictional service that logsto a file.

As a real-world example, recently I have been working with gigabytes ofCSV files that I have been converting using a Python script to a filethat contains SQL commands to insert the information. To understand thesort of data I'm concerned with here, I ran the data for a singletable, and the script took 23 hours to execute and generated an SQL file thatwas 20GB in size. The advantage of using a Python script in the fashiondescribed in this article is that the whole file does not need to be readinto memory. This means that an entire 20GB+ file can be processed oneline at a time. Also it is easier to think about a problem when eachstep (reading, sorting, manipulation and writing) is separated into theselogical steps. The guarantee that each of these commands, which are partof the core utilities of UNIX-like environment, is efficient and stablehelps the entire experience to be more stable and secure.

The other benefit is that there is no hard-coded file that is readin. Often having the flexibility to pass it strings rather than theconcept of files is very powerful. For instance, if 20,000lines through a certain file, the script breaks, instead of re-runningthe script from the start,tail can be used to read only from the lineon which the script failed.

There are a lot of aspects to Python in the shell that go beyond thescope of this article, such as the os module and thesubprocessmodule. The os module is a standard library function that holds a lotof key operating system-level operations, such as listing directories andstating files, along with an excellent submodule os.path that dealswith normalizing directories paths. The subprocess module allows Pythonprograms to run system commands and other advanced operations, such ashandling piping as described above within Python code between spawnedprocesses. Both of these libraries are worth checking out if you intendto do any Python shell scripting.

Load Disqus comments