How to avoid HTTP error 429 (Too Many Requests) python

Question 1

I am trying to use Python to login to a website and gather information from several webpages and I get the following error:

Traceback (most recent call last):  File "extract_test.py", line 43, in <module>    response=br.open(v)  File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open    return self._mech_open(url, data, timeout=timeout)  File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 255, in _mech_open    raise responsemechanize._response.httperror_seek_wrapper: HTTP Error 429: Unknown Response Code

I usedtime.sleep() and it works, but it seems unintelligent and unreliable, is there any other way to dodge this error?

Here's my code:

import mechanizeimport cookielibimport refirst=("example.com/page1")second=("example.com/page2")third=("example.com/page3")fourth=("example.com/page4")## I have seven URL's I want to openurls_list=[first,second,third,fourth]br = mechanize.Browser()# Cookie Jarcj = cookielib.LWPCookieJar()br.set_cookiejar(cj)# Browser options br.set_handle_equiv(True)br.set_handle_redirect(True)br.set_handle_referer(True)br.set_handle_robots(False)# Log in credentialsbr.open("example.com")br.select_form(nr=0)br["username"] = "username"br["password"] = "password"br.submit()for url in urls_list:        br.open(url)        print re.findall("Some String")

Question 2

There's no way around it, this is an enforcement on the server-side keeping track of how many requests /time-unit you make. If you exceed this unit you'll be temporarily blocked. Some servers send this information in the header, but those occasions are rare. Check the headers recieved from the server, use the information available.. If not, check how fast you can hammer without getting caught and use asleep.

Question 3

stackoverflow.com/questions/15648272/…

Question 4

Receiving a status 429 isnot an error, it is the other server "kindly" asking you to please stop spamming requests. Obviously, your rate of requests has been too high and the server is not willing to accept this.

You should not seek to "dodge" this, or even try to circumvent server security settings by trying to spoof your IP, you should simply respect the server's answer by not sending too many requests.

If everything is set up properly, you will also have received a "Retry-after" header along with the 429 response. This header specifies the number of seconds you should wait before making another call. The proper way to deal with this "problem" is to read this header and to sleep your process for that many seconds.

You can find more information on status 429 here:https://www.rfc-editor.org/rfc/rfc6585#page-3

Question 5

Well, no one ever said that all web servers are configured correctly. Also, since most rate limiters are identifying visitors by IP, this might lead to problems in a scenario where IPs are shared dynamically. If you keep receiving status 429 although you are confident that you have not sent too many requests at all, you might consider contacting the site's administrator.

Question 6

Thanks for mentioning the "Retry-after" header. I would love a code example to see how to get that value (I was using urllib, to OP mechanize, in either case I don't think the headers are included in the raised exception)

Question 7

@MacFreek I don't have any particular Python code examples ready, but I assume some examples about how to retrieve response headers in general can be taken from the answers to this question:stackoverflow.com/q/843392

Question 8

Thanks @MRA. I found that the headers are available in the exception too: after catchingHTTPError as my_exception, it is available inmy_exception.headers, at least for urllib2.

Question 9

Writing this piece of code when requesting fixed my problem:

requests.get(link, headers = {'User-agent': 'your bot 0.1'})

This works because sites sometimes return a Too Many Requests (429) error when there isn't a user agent provided. For example, Reddit's API only works when a user agent is applied.

Question 10

This answer is downvoted, but some sites automatically return error code 429 if the user agent is banned due to abuse from other people. If you get error code 429 even if you've only sent a few requests, try setting the user agent to something else.

Question 11

Would also like to add, some sites plainly refuse requests unless a user-agent is sent, and you may get a myriad of other responses: 503 / 403 / some generic index page.

Question 12

Can confirm this. Just trying to interface python with reddit and without setting the user agent I was always getting error code 429.

Question 13

can you add some explanation please ?

Question 14

Where do you "write this piece of code"? This solution needs more details.

Question 15

As MRA said, you shouldn't try to dodge a429 Too Many Requests but instead handle it accordingly. You have several options depending on your use-case:

1)Sleep your process. The server usually includes aRetry-after header in the response with the number of seconds you are supposed to wait before retrying. Keep in mind that sleeping a process might cause problems, e.g. in a task queue, where you should instead retry the task at a later time to free up the worker for other things.

2)Exponential backoff. If the server does not tell you how long to wait, you can retry your request using increasing pauses in between. The popular task queue Celery has this featurebuilt right-in.

3)Token bucket. This technique is useful if you know in advance how many requests you are able to make in a given time. Each time you access the API you first fetch a token from the bucket. The bucket is refilled at a constant rate. If the bucket is empty, you know you'll have to wait before hitting the API again. Token buckets are usually implemented on the other end (the API) but you can also use them as a proxy to avoid ever getting a429 Too Many Requests. Celery'srate_limit feature uses a token bucket algorithm.

Here is an example of a Python/Celery app using exponential backoff and rate-limiting/token bucket:

class TooManyRequests(Exception):"""Too many requests"""@task(   rate_limit='10/s',   autoretry_for=(ConnectTimeout, TooManyRequests,),   retry_backoff=True)def api(*args, **kwargs):  r = requests.get('placeholder-external-api')  if r.status_code == 429:    raise TooManyRequests()

Question 16

if response.status_code == 429:  time.sleep(int(response.headers["Retry-After"]))

Question 17

Way way to simple implementation. The "Retry-After" could be a timestamp instead of a number of seconds. Seedeveloper.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After

Question 18

This may be a simple exmple, but it points to the general shape of how to handle rate limiting - check for 429 status, use info in headers to respond. It was useful to me.

Question 19

Same here when scraping data off Letterboxd, sometimes seconds differ. Default is 60s.

Question 20

Another workaround would be to spoof your IP using some sort of Public VPN or Tor network. This would be assuming the rate-limiting on the server at IP level.

There is a brief blog post demonstrating a way to use tor along with urllib2:

http://blog.flip-edesign.com/?p=119

Question 21

This is why I always require users of my API's to register for a key to make requests. This way I can limit requests by key rather than by IP. Registering for another key would be the only way to get a higher limit.

Question 22

I've found out a niceworkaround to IP blocking when scraping sites. It lets you run a Scraper indefinitely by running it from Google App Engine and redeploying it automatically when you get a 429.

Check outthis article

Question 23

Haha wow... using Google to scrape Google. And then changing your Google IP when Google blocks it.

Question 24

Thanks -_- Google is now blocking Google for legitimate users.stackoverflow.com/questions/74237192

Question 25

There is an API to get information from Google services. This is much more convenient than parsing HTML most of the time.serpapi.com

Question 26

In many cases, continuing to scrape data from a website even when the server is requesting you not to is unethical. However, in the cases where it isn't, you can utilize a list of public proxies in order to scrape a website with many different IP addresses.

Question 27

HttpURLConnection connection = (new Connection(urlString)).connection;connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");

It is crucial to change the user-agent because the old user agent is on Yahoo's stop-list.

MRA 3,1981 gold badge19 silver badges19 bronze badges · Accepted Answer · 2021-10-07 05:46:31Z

Receiving a status 429 isnot an error, it is the other server "kindly" asking you to please stop spamming requests. Obviously, your rate of requests has been too high and the server is not willing to accept this.

You should not seek to "dodge" this, or even try to circumvent server security settings by trying to spoof your IP, you should simply respect the server's answer by not sending too many requests.

If everything is set up properly, you will also have received a "Retry-after" header along with the 429 response. This header specifies the number of seconds you should wait before making another call. The proper way to deal with this "problem" is to read this header and to sleep your process for that many seconds.

You can find more information on status 429 here:https://www.rfc-editor.org/rfc/rfc6585#page-3

Well, no one ever said that all web servers are configured correctly. Also, since most rate limiters are identifying visitors by IP, this might lead to problems in a scenario where IPs are shared dynamically. If you keep receiving status 429 although you are confident that you have not sent too many requests at all, you might consider contacting the site's administrator.
Thanks for mentioning the "Retry-after" header. I would love a code example to see how to get that value (I was using urllib, to OP mechanize, in either case I don't think the headers are included in the raised exception)
@MacFreek I don't have any particular Python code examples ready, but I assume some examples about how to retrieve response headers in general can be taken from the answers to this question:stackoverflow.com/q/843392
Thanks @MRA. I found that the headers are available in the exception too: after catchingHTTPError as my_exception, it is available inmy_exception.headers, at least for urllib2.

Movatterモバイル変換

Collectives™ on Stack Overflow

How to avoid HTTP error 429 (Too Many Requests) python

8 Answers8

4 Comments

8 Comments

Comments

3 Comments

1 Comment

3 Comments

Comments

Comments

Linked

Related

Hot Network Questions

Subscribe to RSS