Python has support for working with databases via a simple API. Modules included with Python include modules forSQLite andBerkeley DB. Modules forMySQL ,PostgreSQL ,FirebirdSQL and others are available as third-party modules. The latter have to be downloaded and installed before use.The package MySQLdb can be installed, for example, using the Debian package "python-mysqldb".
An Example with MySQL would look like this:
importMySQLdbdb=MySQLdb.connect("host machine","dbuser","password","dbname")cursor=db.cursor()query="""SELECT * FROM sampletable"""lines=cursor.execute(query)data=cursor.fetchall()db.close()
On the first line, theModuleMySQLdb is imported. Then a connection to the database is set up and on line 4, we save the actual SQL statement to be executed in the variablequery. On line 5 we execute the query and on line 6 we fetch all the data.After the execution of this piece of code,lines contains the number of lines fetched (e.g. the number of rows in the tablesampletable). The variabledata contains all the actual data, e.g. the content ofsampletable. In the end, the connection to the database would be closed again. If the number of lines are large, it is better to userow = cursor.fetchone() and process the rows individually:
#first 5 lines are the same as abovewhileTrue:row=cursor.fetchone()ifrow==None:break#do something with this row of datadb.close()
Obviously, some kind of data processing has to be used on a row, otherwise the data will not be stored. The result of thefetchone() command is aTuple.
In order to make the initialization of the connection easier, a configuration file can be used:
importMySQLdbdb=MySQLdb.connect(read_default_file="~/.my.cnf")...
Here, the file .my.cnf in the home directory contains the necessary configuration information for MySQL.
An example with SQLite is very similar to the one above and the cursor providesmany of the same functionalities.
importsqlite3db=sqlite3.connect("/path/to/file")cursor=db.cursor()query="""SELECT * FROM sampletable"""lines=cursor.execute(query)data=cursor.fetchall()db.close()
When writing to the db, one has to remember to call db.commit(),otherwise the changes are not saved:
importsqlite3db=sqlite3.connect("/path/to/file")cursor=db.cursor()query="""INSERT INTO sampletable (value1, value2) VALUES (1,'test')"""cursor.execute(query)db.commit()db.close()
importpsycopg2conn=psycopg2.connect("dbname=test")cursor=conn.cursor()cursor.execute("select * from test");foriincursor.next():print(i)conn.close()
importfirebirdsqlconn=firebirdsql.connect(dsn='localhost/3050:/var/lib/firebird/2.5/test.fdb',user='alice',password='wonderland')cur=conn.cursor()cur.execute("select * from baz")forcincur.fetchall():print(c)conn.close()
You will frequently need to substitute dynamic data into a query string.It is important to ensure this is done correctly.
# Do not do this!result=db.execute("SELECT name FROM employees WHERE location = '"+location+"'")
This example iswrong, because it doesn’t correctly deal with special characters, like apostrophes, in the string being substituted. If your code has to deal with potentially hostile users (like on a public-facing Web server), this could leave you open to anSQL injection attack.
For simple cases, use the automatic parameter substitution provided by theexecute
method, e.g.
result=db.execute("SELECT name FROM employees WHERE location = ?",[location])
The DBMS interface itself will automatically convert the values you pass into the correct SQL syntax.
For more complex cases, the DBMS module should provide a quoting function that you can explicitly call. For example, MySQLdb provides theescape_string
method, while APSW (for SQLite3) providesformat_sql_value
. This is necessary where the query structure takes a more dynamic form:
criteria=[("company",company)]# list of tuples (fieldname, value)ifdepartment!=None:criteria.append(("department",department))# ... append other optional criteria as appropriate ...result=db.execute("SELECT name FROM employees WHERE "+" and ".join("%s =%s"%(criterion[0],MySQLdb.escape_string(criterion[1]))forcriterionincriteria))
This will dynamically construct queries like “select name from employees where company = 'some company'” or “select name from employees where company = 'some company' and department = 'some department'”, depending on which fields have been filled in by the user.
Python iterators are a natural fit for the problem of iterating over lots of database records. Here is an example of a function that performs a database query and returns an iterator for the results, instead of returning them all at once. It relies on the fact that, in APSW (the Python 3 interface library for SQLite), thecursor.execute
method itself returns an iterator for the result records. The result is that you can write very concise code for doing complex database queries in Python.
defdb_iter(db,cmd,mapfn=lambdax:x):"executes cmd on a new cursor from connection db and yields the results in turn."cu=db.cursor()result=cu.execute(cmd)whileTrue:yieldmapfn(next(result))
Example uses of this function:
forartist,publisherindb_iter(db=db,cmd="SELECT artist, publisher FROM artists WHERE location =%s"%apsw.format_sql_value(location)):print(artist,publisher)
and
forlocationindb_iter(db=db,cmd="SELECT DISTINCT location FROM artists",mapfn=lambdax:x[0]):print(location)
In the first example, sincedb_iter
returns a tuple for each record, this can be directly assigned to individual variables for the record fields. In the second example, the tuple has only one element, so a custommapfn
is used to extract this element and return it instead of the tuple.
Database table definitions are frequently subject to change. As application requirements evolve, fields and even entire tables are often added, or sometimes removed. Consider a statement like
result=db.execute("select * from employees")
You may happen to know that theemployees table currently contains, say, 4 fields. But tomorrow someone may add a fifth field. Did you remember to update your code to deal with this? If not, it’s liable to crash. Or even worse, produce an incorrect result!
Better to always list thespecific fields you’re interested in, no matter how many there are:
result=db.execute("select name, address, department, location from employees")
That way, any extra fields added will simply be ignored. And if any of the named fields are removed, the code will at least fail with a runtime error, which is a good reminder that you forgot to update it!
Consider the following scenario: your sales company database has a table of employees, and also a table of sales made by each employee. You want to loop over these sale entries, and produce some per-employee statistics. A naïve approach might be:
If you have a lot of employees, then the first query may produce a large list, and the second step will involve a correspondingly large number of database queries.
In fact, the entire processing loop can run off asingle database query, using the standard SQL construct called ajoin
.
Note: SQL programming is a specialty skill in its own right. To learn more about this, start with theWikipedia article.
Here is what an example of such a loop could look like:
rows=db_iter \(db=db,cmd="select employees.name, sales.amount, sales.date from"" employees left join sales on employees.id = sales.employee_id"" order by employees.name, sales.date")prev_employee_name=NonewhileTrue:row=next(rows,None)ifrow!=None:employee_name,amount,date=rowifrow==Noneoremployee_name!=prev_employee_name:ifprev_employee_name!=None:# done stats for this employeereport(prev_employee_name,employee_stats)ifrow==None:break# start stats for a new employeeprev_employee_name=employee_nameemployee_stats={"total_sales":0,"number_of_sales":0}ifdate!=None:employee_stats["earliest_sale"]=date# another row of stats for this employeeifamount!=None:employee_stats["total_sales"]+=amountemployee_stats["number_of_sales"]+=1ifdate!=None:employee_stats["latest_sale"]=date
Here the statistics are quite simple: earliest and latest sale, and number and total amount of sales, and could be computed directly within the SQL query. But the same loop could compute more complex statistics (like standard deviation) that cannot be represented directly within a simple SQL query.
Note how the statistics for each employee are written out under either of two conditions:
Both conditions are tested withrow == None or employee_name != prev_employee_name
; after writing out the employee statistics, a separate check for the second conditionrow == None
is used to terminate the loop. If the loop doesn’t terminate, then processing is initialized for the new employee.
Note also the use of aleft join
in this case: if an employee has had no sales, then the join will return a single row for that employee, with SQLnull
values (represented byNone
in Python) for the fields from thesales table. This is why we need checks for suchNone
values before processing those fields.
Alternatively, we could have used aninner join
, which would have returnedno results for an employee with no sales. Whether you want to omit such an employee from your report, or include them with totals of zero, is really up to your application.