Movatterモバイル変換


[0]ホーム

URL:


Navigation

urllib2 — extensible library for opening URLs

Note

Theurllib2 module has been split across several modules inPython 3.0 namedurllib.request andurllib.error.The2to3 tool will automatically adapt imports when convertingyour sources to 3.0.

Theurllib2 module defines functions and classes which help in openingURLs (mostly HTTP) in a complex world — basic and digest authentication,redirections, cookies and more.

Theurllib2 module defines the following functions:

urllib2.urlopen(url[,data][,timeout])

Open the URLurl, which can be either a string or aRequest object.

data may be a string specifying additional data to send to the server, orNone if no such data is needed. Currently HTTP requests are the only onesthat usedata; the HTTP request will be a POST instead of a GET when thedata parameter is provided.data should be a buffer in the standardapplication/x-www-form-urlencoded format. Theurllib.urlencode() function takes a mapping or sequence of 2-tuples andreturns a string in this format.

The optionaltimeout parameter specifies a timeout in seconds for blockingoperations like the connection attempt (if not specified, the global defaulttimeout setting will be used). This actually only works for HTTP, HTTPS,FTP and FTPS connections.

This function returns a file-like object with two additional methods:

  • geturl() — return the URL of the resource retrieved, commonly used todetermine if a redirect was followed
  • info() — return the meta-information of the page, such as headers, inthe form of anhttplib.HTTPMessage instance(seeQuick Reference to HTTP Headers)

RaisesURLError on errors.

Note thatNone may be returned if no handler handles the request (though thedefault installed globalOpenerDirector usesUnknownHandler toensure this never happens).

Changed in version 2.6:timeout was added.

urllib2.install_opener(opener)
Install anOpenerDirector instance as the default global opener.Installing an opener is only necessary if you want urlopen to use that opener;otherwise, simply callOpenerDirector.open() instead ofurlopen().The code does not check for a realOpenerDirector, and any class withthe appropriate interface will work.
urllib2.build_opener([handler,...])

Return anOpenerDirector instance, which chains the handlers in theorder given.handlers can be either instances ofBaseHandler, orsubclasses ofBaseHandler (in which case it must be possible to callthe constructor without any parameters). Instances of the following classeswill be in front of thehandlers, unless thehandlers contain them,instances of them or subclasses of them:ProxyHandler,UnknownHandler,HTTPHandler,HTTPDefaultErrorHandler,HTTPRedirectHandler,FTPHandler,FileHandler,HTTPErrorProcessor.

If the Python installation has SSL support (i.e., if thessl module can be imported),HTTPSHandler will also be added.

Beginning in Python 2.3, aBaseHandler subclass may also change itshandler_order member variable to modify its position in the handlerslist.

The following exceptions are raised as appropriate:

exceptionurllib2.URLError

The handlers raise this exception (or derived exceptions) when they run into aproblem. It is a subclass ofIOError.

reason
The reason for this error. It can be a message string or another exceptioninstance (socket.error for remote URLs,OSError for localURLs).
exceptionurllib2.HTTPError

Though being an exception (a subclass ofURLError), anHTTPErrorcan also function as a non-exceptional file-like return value (the same thingthaturlopen() returns). This is useful when handling exotic HTTPerrors, such as requests for authentication.

code
An HTTP status code as defined inRFC 2616.This numeric value corresponds to a value found in the dictionary ofcodes as found inBaseHTTPServer.BaseHTTPRequestHandler.responses.

The following classes are provided:

classurllib2.Request(url[,data][,headers][,origin_req_host][,unverifiable])

This class is an abstraction of a URL request.

url should be a string containing a valid URL.

data may be a string specifying additional data to send to the server, orNone if no such data is needed. Currently HTTP requests are the only onesthat usedata; the HTTP request will be a POST instead of a GET when thedata parameter is provided.data should be a buffer in the standardapplication/x-www-form-urlencoded format. Theurllib.urlencode() function takes a mapping or sequence of 2-tuples andreturns a string in this format.

headers should be a dictionary, and will be treated as ifadd_header()was called with each key and value as arguments. This is often used to “spoof”theUser-Agent header, which is used by a browser to identify itself –some HTTP servers only allow requests coming from common browsers as opposedto scripts. For example, Mozilla Firefox may identify itself as"Mozilla/5.0(X11;U;Linuxi686)Gecko/20071127Firefox/2.0.0.11", whileurllib2‘sdefault user agent string is"Python-urllib/2.6" (on Python 2.6).

The final two arguments are only of interest for correct handling of third-partyHTTP cookies:

origin_req_host should be the request-host of the origin transaction, asdefined byRFC 2965. It defaults tocookielib.request_host(self). Thisis the host name or IP address of the original request that was initiated by theuser. For example, if the request is for an image in an HTML document, thisshould be the request-host of the request for the page containing the image.

unverifiable should indicate whether the request is unverifiable, as definedby RFC 2965. It defaults to False. An unverifiable request is one whose URLthe user did not have the option to approve. For example, if the request is foran image in an HTML document, and the user had no option to approve theautomatic fetching of the image, this should be true.

classurllib2.OpenerDirector
TheOpenerDirector class opens URLs viaBaseHandlers chainedtogether. It manages the chaining of handlers, and recovery from errors.
classurllib2.BaseHandler
This is the base class for all registered handlers — and handles only thesimple mechanics of registration.
classurllib2.HTTPDefaultErrorHandler
A class which defines a default handler for HTTP error responses; all responsesare turned intoHTTPError exceptions.
classurllib2.HTTPRedirectHandler
A class to handle redirections.
classurllib2.HTTPCookieProcessor([cookiejar])
A class to handle HTTP Cookies.
classurllib2.ProxyHandler([proxies])
Cause requests to go through a proxy. Ifproxies is given, it must be adictionary mapping protocol names to URLs of proxies. The default is to read thelist of proxies from the environment variables.To disable autodetected proxy pass an empty dictionary.
classurllib2.HTTPPasswordMgr
Keep a database of(realm,uri)->(user,password) mappings.
classurllib2.HTTPPasswordMgrWithDefaultRealm
Keep a database of(realm,uri)->(user,password) mappings. A realm ofNone is considered a catch-all realm, which is searched if no other realmfits.
classurllib2.AbstractBasicAuthHandler([password_mgr])
This is a mixin class that helps with HTTP authentication, both to the remotehost and to a proxy.password_mgr, if given, should be something that iscompatible withHTTPPasswordMgr; refer to sectionHTTPPasswordMgr Objects for information on the interface that must besupported.
classurllib2.HTTPBasicAuthHandler([password_mgr])
Handle authentication with the remote host.password_mgr, if given, should besomething that is compatible withHTTPPasswordMgr; refer to sectionHTTPPasswordMgr Objects for information on the interface that must besupported.
classurllib2.ProxyBasicAuthHandler([password_mgr])
Handle authentication with the proxy.password_mgr, if given, should besomething that is compatible withHTTPPasswordMgr; refer to sectionHTTPPasswordMgr Objects for information on the interface that must besupported.
classurllib2.AbstractDigestAuthHandler([password_mgr])
This is a mixin class that helps with HTTP authentication, both to the remotehost and to a proxy.password_mgr, if given, should be something that iscompatible withHTTPPasswordMgr; refer to sectionHTTPPasswordMgr Objects for information on the interface that must besupported.
classurllib2.HTTPDigestAuthHandler([password_mgr])
Handle authentication with the remote host.password_mgr, if given, should besomething that is compatible withHTTPPasswordMgr; refer to sectionHTTPPasswordMgr Objects for information on the interface that must besupported.
classurllib2.ProxyDigestAuthHandler([password_mgr])
Handle authentication with the proxy.password_mgr, if given, should besomething that is compatible withHTTPPasswordMgr; refer to sectionHTTPPasswordMgr Objects for information on the interface that must besupported.
classurllib2.HTTPHandler
A class to handle opening of HTTP URLs.
classurllib2.HTTPSHandler
A class to handle opening of HTTPS URLs.
classurllib2.FileHandler
Open local files.
classurllib2.FTPHandler
Open FTP URLs.
classurllib2.CacheFTPHandler
Open FTP URLs, keeping a cache of open FTP connections to minimize delays.
classurllib2.UnknownHandler
A catch-all class to handle unknown URLs.

Request Objects

The following methods describe all ofRequest‘s public interface, andso all must be overridden in subclasses.

Request.add_data(data)
Set theRequest data todata. This is ignored by all handlers exceptHTTP handlers — and there it should be a byte string, and will change therequest to bePOST rather thanGET.
Request.get_method()
Return a string indicating the HTTP request method. This is only meaningful forHTTP requests, and currently always returns'GET' or'POST'.
Request.has_data()
Return whether the instance has a non-None data.
Request.get_data()
Return the instance’s data.
Request.add_header(key,val)
Add another header to the request. Headers are currently ignored by allhandlers except HTTP handlers, where they are added to the list of headers sentto the server. Note that there cannot be more than one header with the samename, and later calls will overwrite previous calls in case thekey collides.Currently, this is no loss of HTTP functionality, since all headers which havemeaning when used more than once have a (header-specific) way of gaining thesame functionality using only one header.
Request.add_unredirected_header(key,header)

Add a header that will not be added to a redirected request.

New in version 2.4.

Request.has_header(header)

Return whether the instance has the named header (checks both regular andunredirected).

New in version 2.4.

Request.get_full_url()
Return the URL given in the constructor.
Request.get_type()
Return the type of the URL — also known as the scheme.
Request.get_host()
Return the host to which a connection will be made.
Request.get_selector()
Return the selector — the part of the URL that is sent to the server.
Request.set_proxy(host,type)
Prepare the request by connecting to a proxy server. Thehost andtype willreplace those of the instance, and the instance’s selector will be the originalURL given in the constructor.
Request.get_origin_req_host()
Return the request-host of the origin transaction, as defined byRFC 2965.See the documentation for theRequest constructor.
Request.is_unverifiable()
Return whether the request is unverifiable, as defined by RFC 2965. See thedocumentation for theRequest constructor.

OpenerDirector Objects

OpenerDirector instances have the following methods:

OpenerDirector.add_handler(handler)

handler should be an instance ofBaseHandler. The following methodsare searched, and added to the possible chains (note that HTTP errors are aspecial case).

  • protocol_open() — signal that the handler knows how to openprotocolURLs.
  • http_error_type() — signal that the handler knows how to handle HTTPerrors with HTTP error codetype.
  • protocol_error() — signal that the handler knows how to handle errorsfrom (non-http)protocol.
  • protocol_request() — signal that the handler knows how to pre-processprotocol requests.
  • protocol_response() — signal that the handler knows how topost-processprotocol responses.
OpenerDirector.open(url[,data][,timeout])

Open the givenurl (which can be a request object or a string), optionallypassing the givendata. Arguments, return values and exceptions raised arethe same as those ofurlopen() (which simply calls theopen()method on the currently installed globalOpenerDirector). Theoptionaltimeout parameter specifies a timeout in seconds for blockingoperations like the connection attempt (if not specified, the global defaulttimeout setting will be usedi). The timeout feature actually works only forHTTP, HTTPS, FTP and FTPS connections).

Changed in version 2.6:timeout was added.

OpenerDirector.error(proto[,arg[,...]])

Handle an error of the given protocol. This will call the registered errorhandlers for the given protocol with the given arguments (which are protocolspecific). The HTTP protocol is a special case which uses the HTTP responsecode to determine the specific error handler; refer to thehttp_error_*()methods of the handler classes.

Return values and exceptions raised are the same as those ofurlopen().

OpenerDirector objects open URLs in three stages:

The order in which these methods are called within each stage is determined bysorting the handler instances.

  1. Every handler with a method named likeprotocol_request() has thatmethod called to pre-process the request.

  2. Handlers with a method named likeprotocol_open() are called to handlethe request. This stage ends when a handler either returns a non-Nonevalue (ie. a response), or raises an exception (usuallyURLError).Exceptions are allowed to propagate.

    In fact, the above algorithm is first tried for methods nameddefault_open(). If all such methods returnNone, the algorithmis repeated for methods named likeprotocol_open(). If all such methodsreturnNone, the algorithm is repeated for methods namedunknown_open().

    Note that the implementation of these methods may involve calls of the parentOpenerDirector instance’sopen() anderror() methods.

  3. Every handler with a method named likeprotocol_response() has thatmethod called to post-process the response.

BaseHandler Objects

BaseHandler objects provide a couple of methods that are directlyuseful, and others that are meant to be used by derived classes. These areintended for direct use:

BaseHandler.add_parent(director)
Add a director as parent.
BaseHandler.close()
Remove any parents.

The following members and methods should only be used by classes derived fromBaseHandler.

Note

The convention has been adopted that subclasses definingprotocol_request() orprotocol_response() methods are named*Processor; all others are named*Handler.

BaseHandler.parent
A validOpenerDirector, which can be used to open using a differentprotocol, or handle errors.
BaseHandler.default_open(req)

This method isnot defined inBaseHandler, but subclasses shoulddefine it if they want to catch all URLs.

This method, if implemented, will be called by the parentOpenerDirector. It should return a file-like object as described inthe return value of theopen() ofOpenerDirector, orNone.It should raiseURLError, unless a truly exceptional thing happens (forexample,MemoryError should not be mapped toURLError).

This method will be called before any protocol-specific open method.

BaseHandler.protocol_open(req)

This method isnot defined inBaseHandler, but subclasses shoulddefine it if they want to handle URLs with the given protocol.

This method, if defined, will be called by the parentOpenerDirector.Return values should be the same as fordefault_open().

BaseHandler.unknown_open(req)

This method isnot defined inBaseHandler, but subclasses shoulddefine it if they want to catch all URLs with no specific registered handler toopen it.

This method, if implemented, will be called by theparentOpenerDirector. Return values should be the same as fordefault_open().

BaseHandler.http_error_default(req,fp,code,msg,hdrs)

This method isnot defined inBaseHandler, but subclasses shouldoverride it if they intend to provide a catch-all for otherwise unhandled HTTPerrors. It will be called automatically by theOpenerDirector gettingthe error, and should not normally be called in other circumstances.

req will be aRequest object,fp will be a file-like object withthe HTTP error body,code will be the three-digit code of the error,msgwill be the user-visible explanation of the code andhdrs will be a mappingobject with the headers of the error.

Return values and exceptions raised should be the same as those ofurlopen().

BaseHandler.http_error_nnn(req,fp,code,msg,hdrs)

nnn should be a three-digit HTTP error code. This method is also not definedinBaseHandler, but will be called, if it exists, on an instance of asubclass, when an HTTP error with codennn occurs.

Subclasses should override this method to handle specific HTTP errors.

Arguments, return values and exceptions raised should be the same as forhttp_error_default().

BaseHandler.protocol_request(req)

This method isnot defined inBaseHandler, but subclasses shoulddefine it if they want to pre-process requests of the given protocol.

This method, if defined, will be called by the parentOpenerDirector.req will be aRequest object. The return value should be aRequest object.

BaseHandler.protocol_response(req,response)

This method isnot defined inBaseHandler, but subclasses shoulddefine it if they want to post-process responses of the given protocol.

This method, if defined, will be called by the parentOpenerDirector.req will be aRequest object.response will be an objectimplementing the same interface as the return value ofurlopen(). Thereturn value should implement the same interface as the return value ofurlopen().

HTTPRedirectHandler Objects

Note

Some HTTP redirections require action from this module’s client code. If thisis the case,HTTPError is raised. SeeRFC 2616 for details of theprecise meanings of the various redirection codes.

HTTPRedirectHandler.redirect_request(req,fp,code,msg,hdrs)

Return aRequest orNone in response to a redirect. This is calledby the default implementations of thehttp_error_30*() methods when aredirection is received from the server. If a redirection should take place,return a newRequest to allowhttp_error_30*() to perform theredirect. Otherwise, raiseHTTPError if no other handler should try tohandle this URL, or returnNone if you can’t but another handler might.

Note

The default implementation of this method does not strictly followRFC 2616,which says that 301 and 302 responses toPOST requests must not beautomatically redirected without confirmation by the user. In reality, browsersdo allow automatic redirection of these responses, changing the POST to aGET, and the default implementation reproduces this behavior.

HTTPRedirectHandler.http_error_301(req,fp,code,msg,hdrs)
Redirect to theLocation: URL. This method is called by the parentOpenerDirector when getting an HTTP ‘moved permanently’ response.
HTTPRedirectHandler.http_error_302(req,fp,code,msg,hdrs)
The same ashttp_error_301(), but called for the ‘found’ response.
HTTPRedirectHandler.http_error_303(req,fp,code,msg,hdrs)
The same ashttp_error_301(), but called for the ‘see other’ response.
HTTPRedirectHandler.http_error_307(req,fp,code,msg,hdrs)
The same ashttp_error_301(), but called for the ‘temporary redirect’response.

HTTPCookieProcessor Objects

New in version 2.4.

HTTPCookieProcessor instances have one attribute:

HTTPCookieProcessor.cookiejar
Thecookielib.CookieJar in which cookies are stored.

ProxyHandler Objects

ProxyHandler.protocol_open(request)
TheProxyHandler will have a methodprotocol_open() for everyprotocol which has a proxy in theproxies dictionary given in theconstructor. The method will modify requests to go through the proxy, bycallingrequest.set_proxy(), and call the next handler in the chain toactually execute the protocol.

HTTPPasswordMgr Objects

These methods are available onHTTPPasswordMgr andHTTPPasswordMgrWithDefaultRealm objects.

HTTPPasswordMgr.add_password(realm,uri,user,passwd)
uri can be either a single URI, or a sequence of URIs.realm,user andpasswd must be strings. This causes(user,passwd) to be used asauthentication tokens when authentication forrealm and a super-URI of any ofthe given URIs is given.
HTTPPasswordMgr.find_user_password(realm,authuri)

Get user/password for given realm and URI, if any. This method will return(None,None) if there is no matching user/password.

ForHTTPPasswordMgrWithDefaultRealm objects, the realmNone will besearched if the givenrealm has no matching user/password.

AbstractBasicAuthHandler Objects

AbstractBasicAuthHandler.http_error_auth_reqed(authreq,host,req,headers)

Handle an authentication request by getting a user/password pair, and re-tryingthe request.authreq should be the name of the header where the informationabout the realm is included in the request,host specifies the URL and path toauthenticate for,req should be the (failed)Request object, andheaders should be the error headers.

host is either an authority (e.g."python.org") or a URL containing anauthority component (e.g."http://python.org/"). In either case, theauthority must not contain a userinfo component (so,"python.org" and"python.org:80" are fine,"joe:password@python.org" is not).

HTTPBasicAuthHandler Objects

HTTPBasicAuthHandler.http_error_401(req,fp,code,msg,hdrs)
Retry the request with authentication information, if available.

ProxyBasicAuthHandler Objects

ProxyBasicAuthHandler.http_error_407(req,fp,code,msg,hdrs)
Retry the request with authentication information, if available.

AbstractDigestAuthHandler Objects

AbstractDigestAuthHandler.http_error_auth_reqed(authreq,host,req,headers)
authreq should be the name of the header where the information about the realmis included in the request,host should be the host to authenticate to,reqshould be the (failed)Request object, andheaders should be theerror headers.

HTTPDigestAuthHandler Objects

HTTPDigestAuthHandler.http_error_401(req,fp,code,msg,hdrs)
Retry the request with authentication information, if available.

ProxyDigestAuthHandler Objects

ProxyDigestAuthHandler.http_error_407(req,fp,code,msg,hdrs)
Retry the request with authentication information, if available.

HTTPHandler Objects

HTTPHandler.http_open(req)
Send an HTTP request, which can be either GET or POST, depending onreq.has_data().

HTTPSHandler Objects

HTTPSHandler.https_open(req)
Send an HTTPS request, which can be either GET or POST, depending onreq.has_data().

FileHandler Objects

FileHandler.file_open(req)
Open the file locally, if there is no host name, or the host name is'localhost'. Change the protocol toftp otherwise, and retry opening itusingparent.

FTPHandler Objects

FTPHandler.ftp_open(req)
Open the FTP file indicated byreq. The login is always done with emptyusername and password.

CacheFTPHandler Objects

CacheFTPHandler objects areFTPHandler objects with thefollowing additional methods:

CacheFTPHandler.setTimeout(t)
Set timeout of connections tot seconds.
CacheFTPHandler.setMaxConns(m)
Set maximum number of cached connections tom.

UnknownHandler Objects

UnknownHandler.unknown_open()
Raise aURLError exception.

HTTPErrorProcessor Objects

New in version 2.4.

HTTPErrorProcessor.unknown_open()

Process HTTP error responses.

For 200 error codes, the response object is returned immediately.

For non-200 error codes, this simply passes the job on to theprotocol_error_code() handler methods, viaOpenerDirector.error().Eventually,urllib2.HTTPDefaultErrorHandler will raise anHTTPError if no other handler handles the error.

Examples

This example gets the python.org main page and displays the first 100 bytes ofit:

>>>importurllib2>>>f=urllib2.urlopen('http://www.python.org/')>>>printf.read(100)<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><?xml-stylesheet href="./css/ht2html

Here we are sending a data-stream to the stdin of a CGI and reading the data itreturns to us. Note that this example will only work when the Pythoninstallation supports SSL.

>>>importurllib2>>>req=urllib2.Request(url='https://localhost/cgi-bin/test.cgi',...data='This data is passed to stdin of the CGI')>>>f=urllib2.urlopen(req)>>>printf.read()Got Data: "This data is passed to stdin of the CGI"

The code for the sample CGI used in the above example is:

#!/usr/bin/env pythonimportsysdata=sys.stdin.read()print'Content-type: text-plain\n\nGot Data: "%s"'%data

Use of Basic HTTP Authentication:

importurllib2# Create an OpenerDirector with support for Basic HTTP Authentication...auth_handler=urllib2.HTTPBasicAuthHandler()auth_handler.add_password(realm='PDQ Application',uri='https://mahler:8092/site-updates.py',user='klem',passwd='kadidd!ehopper')opener=urllib2.build_opener(auth_handler)# ...and install it globally so it can be used with urlopen.urllib2.install_opener(opener)urllib2.urlopen('http://www.example.com/login.html')

build_opener() provides many handlers by default, including aProxyHandler. By default,ProxyHandler uses the environmentvariables named<scheme>_proxy, where<scheme> is the URL schemeinvolved. For example, thehttp_proxy environment variable is read toobtain the HTTP proxy’s URL.

This example replaces the defaultProxyHandler with one that usesprogrammatically-supplied proxy URLs, and adds proxy authorization support withProxyBasicAuthHandler.

proxy_handler=urllib2.ProxyHandler({'http':'http://www.example.com:3128/'})proxy_auth_handler=urllib2.HTTPBasicAuthHandler()proxy_auth_handler.add_password('realm','host','username','password')opener=build_opener(proxy_handler,proxy_auth_handler)# This time, rather than install the OpenerDirector, we use it directly:opener.open('http://www.example.com/login.html')

Adding HTTP headers:

Use theheaders argument to theRequest constructor, or:

importurllib2req=urllib2.Request('http://www.example.com/')req.add_header('Referer','http://www.python.org/')r=urllib2.urlopen(req)

OpenerDirector automatically adds aUser-Agent header toeveryRequest. To change this:

importurllib2opener=urllib2.build_opener()opener.addheaders=[('User-agent','Mozilla/5.0')]opener.open('http://www.example.com/')

Also, remember that a few standard headers (Content-Length,Content-Type andHost) are added when theRequest is passed tourlopen() (orOpenerDirector.open()).

Table Of Contents

Previous topic

urllib — Open arbitrary resources by URL

Next topic

httplib — HTTP protocol client

This Page

Quick search

Navigation

©Copyright 1990-2008, Python Software Foundation. Last updated on Oct 02, 2008. Created usingSphinx 0.5.

[8]ページ先頭

©2009-2026 Movatter.jp