50

Doesurllib2 fetch the whole page when aurlopen call is made?

I'd like to just read the HTTP response header without getting the page. It looks likeurllib2 opens the HTTP connection and then subsequently gets the actual HTML page... or does it just start buffering the page with theurlopen call?

import urllib2myurl = 'http://www.kidsidebyside.org/2009/05/come-and-draw-the-circle-of-unity-with-us/'page = urllib2.urlopen(myurl) // open connection, get headershtml = page.readlines()  // stream page
tsn's user avatar
tsn
8389 silver badges20 bronze badges
askedMay 9, 2009 at 14:11
shigeta's user avatar

6 Answers6

52

Use theresponse.info() method to get the headers.

From theurllib2 docs:

urllib2.urlopen(url[, data][, timeout])

...

This function returns a file-like object with two additional methods:

  • geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed
  • info() — return the meta-information of the page, such as headers, in the form of an httplib.HTTPMessage instance (see Quick Reference to HTTP Headers)

So, for your example, try stepping through the result ofresponse.info().headers for what you're looking for.

Note the major caveat to using httplib.HTTPMessage is documented inpython issue 4773.

answeredOct 29, 2009 at 0:17
tolmeda's user avatar
Sign up to request clarification or add additional context in comments.

5 Comments

Python 3 Note First, there is nothing likeresponse.info().headers, do adict(response.info()). Second, for the HTTP status code doresponse.status.
Does thisonly gets the header oronly prints the header?
Where isheaders documented? Also consider usingresponse.info().items() that returns a key value dict.
Python 2 Note this is what you want:response.info().getheader('Content-Type') source:stackoverflow.com/questions/1653591/…
actually for Python 3:response.headers will do, for more infohttp.client.HTTPResponse
42

What about sending a HEAD request instead of a normal GET request. The following snipped (copied from a similarquestion) does exactly that.

>>> import httplib>>> conn = httplib.HTTPConnection("www.google.com")>>> conn.request("HEAD", "/index.html")>>> res = conn.getresponse()>>> print res.status, res.reason200 OK>>> print res.getheaders()[('content-length', '0'), ('expires', '-1'), ('server', 'gws'), ('cache-control', 'private, max-age=0'), ('date', 'Sat, 20 Sep 2008 06:43:36 GMT'), ('content-type', 'text/html; charset=ISO-8859-1')]
answeredMay 9, 2009 at 14:17
reto's user avatar

Comments

24

Actually, it appears that urllib2 can do an HTTP HEAD request.

Thequestion that @reto linked to, above, shows how to get urllib2 to do a HEAD request.

Here's my take on it:

import urllib2# Derive from Request class and override get_method to allow a HEAD request.class HeadRequest(urllib2.Request):    def get_method(self):        return "HEAD"myurl = 'http://bit.ly/doFeT'request = HeadRequest(myurl)try:    response = urllib2.urlopen(request)    response_headers = response.info()    # This will just display all the dictionary key-value pairs.  Replace this    # line with something useful.    response_headers.dictexcept urllib2.HTTPError, e:    # Prints the HTTP Status code of the response but only if there was a     # problem.    print ("Error code: %s" % e.code)

If you check this with something like the Wireshark network protocol analazer, you can see that it is actually sending out a HEAD request, rather than a GET.

This is the HTTP request and response from the code above, as captured by Wireshark:

HEAD /doFeT HTTP/1.1
Accept-Encoding: identity
Host: bit.ly
Connection: close
User-Agent: Python-urllib/2.7

HTTP/1.1 301 Moved
Server: nginx
Date: Sun, 19 Feb 2012 13:20:56 GMT
Content-Type: text/html; charset=utf-8
Cache-control: private; max-age=90
Location:http://www.kidsidebyside.org/?p=445
MIME-Version: 1.0
Content-Length: 127
Connection: close
Set-Cookie: _bit=4f40f738-00153-02ed0-421cf10a;domain=.bit.ly;expires=Fri Aug 17 13:20:56 2012;path=/; HttpOnly

However, as mentioned in one of the comments in the other question, if the URL in question includes a redirect then urllib2 will do a GET request to the destination, not a HEAD. This could be a major shortcoming, if you really wanted to only make HEAD requests.

The request above involves a redirect. Here is request to the destination, as captured by Wireshark:

GET /2009/05/come-and-draw-the-circle-of-unity-with-us/ HTTP/1.1
Accept-Encoding: identity
Host: www.kidsidebyside.org
Connection: close
User-Agent: Python-urllib/2.7

An alternative to using urllib2 is to use Joe Gregorio'shttplib2 library:

import httplib2url = "http://bit.ly/doFeT"http_interface = httplib2.Http()try:    response, content = http_interface.request(url, method="HEAD")    print ("Response status: %d - %s" % (response.status, response.reason))    # This will just display all the dictionary key-value pairs.  Replace this    # line with something useful.    response.__dict__except httplib2.ServerNotFoundError, e:    print (e.message)

This has the advantage of using HEAD requests for both the initial HTTP request and the redirected request to the destination URL.

Here's the first request:

HEAD /doFeT HTTP/1.1
Host: bit.ly
accept-encoding: gzip, deflate
user-agent: Python-httplib2/0.7.2 (gzip)

Here's the second request, to the destination:

HEAD /2009/05/come-and-draw-the-circle-of-unity-with-us/ HTTP/1.1
Host: www.kidsidebyside.org
accept-encoding: gzip, deflate
user-agent: Python-httplib2/0.7.2 (gzip)

answeredFeb 19, 2012 at 14:27
Simon Elms's user avatar

1 Comment

I missed it the first time I read the answer, butresponse.info().dict is exactly what I was looking for. This isnot explainedin the docs.
8

urllib2.urlopen does an HTTP GET (or POST if you supply a data argument), not an HTTP HEAD (if it did the latter, you couldn't do readlines or other accesses to the page body, of course).

answeredMay 9, 2009 at 14:18
Alex Martelli's user avatar

Comments

4

One-liner:

$ python -c "import urllib2; print urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1)).open(urllib2.Request('http://google.com'))"
answeredMar 30, 2012 at 8:11
quanta's user avatar

Comments

-1
def _GetHtmlPage(self, addr):  headers = { 'User-Agent' : self.userAgent,            '  Cookie' : self.cookies}  req = urllib2.Request(addr)  response = urllib2.urlopen(req)  print "ResponseInfo="  print response.info()  resultsHtml = unicode(response.read(), self.encoding)  return resultsHtml
answeredJul 28, 2014 at 9:25
vitperov's user avatar

Comments

Your Answer

Sign up orlog in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

By clicking “Post Your Answer”, you agree to ourterms of service and acknowledge you have read ourprivacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.