asyncdefrun_main(bootstrap_retries="indefinite"):delete_webhook_and_drop_pending_updates(retries=bootstrap_retries)poll_for_updates(retries="indefinite")

I now changed this to

asyncdefrun_main(bootstrap_retries=None):delete_webhook_and_drop_pending_updates(retries=bootstrap_retries)poll_for_updates(retries="indefinite")

I think this is saner, and I'm not even sure if indefinite retries during the bootstrapping phase were ever actually intended …

ToDo

Update existing tests to new logs
Add new tests for bootstrapping inApplication

Add Bootstrapping Logic toApplication.run_*

315a97f

Also changes the default value for `bootstrap_retries` on polling

Bibo-Joshi added 🛠 refactor

change type: refactor

🔌 enhancementpr description: enhancement labels

Feb 7, 2025

Poolitzer reviewed

Feb 8, 2025

View reviewed changes

Copy link

Member

Poolitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Im just asking questions at this point

telegram/ext/_updater.py

		@@ -409,12 +414,14 @@ def default_error_callback(exc: TelegramError) -> None:
		# updates from Telegram and inserts them in the update queue of the
		# Application.
		self.__polling_task = asyncio.create_task(
		self._network_loop_retry(
		network_retry_loop(
		is_running=lambda: self.running,

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

why is that a lambda?

Copy link

MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

python-telegram-bot/telegram/ext/_utils/networkloop.py

Line 113 in315a97f

whileeffective_is_running():

this line expects a callable. If we simply passis_runnig=self.running and edit above line to

whileis_running:

then we have an infinite loop, b/c settingself.running=False withinUpdater won't change the value in the network retry loop :) There are probably a billion other ways do to this (pass an object with arunning attribute to check, pass a dict with a corresponding entry, pass an event …). This one seemed rather straight forward to me, but I'm open to changing it. Accepting an instance of

classHasRunning(Protocol):running:bool

would sound like the next best thing to me at first glance

Copy link

Member

tsnoam commentedFeb 9, 2025

@Eldinnie
here's my vague memory for the original changes:

infinite retries were really intended.
the idea was to find some common ground (i.e. defaults) for most of the user base (we understood that not all can be pleased)
there is a bootstrapping phase in which certain actions had to be made (are they still needed? I don't know... I haven't been following the bot API in the past years).
what happens if these operations fail? do we resume with normal execution of the bot? do we abort (and let the user run the bot again)? or do we just retry until a success?
we chose the option of indefinite retries (and afair, with appropriate logging). that way we make sure that the bot is properly started and will be started eventually. aborting seemed like too much, after all, the user will just need to restart the bot and come to the same "problematic" part of the initialization. so what's the point in that?

I hope that helps...

Copy link

Member

tsnoam commentedFeb 9, 2025

and why did i tag Eldinnie instead of@Bibo-Joshi ?

it's my memory playing tricks on me. lol.

Copy link

MemberAuthor

Bibo-Joshi commentedFeb 9, 2025

Thanks very much for your insights@tsnoam !

The bootstrapping phase is indeed still needed. It deletes the previous webhook if you run polling, sets a new webhook if your run in webhook mode and drops pending updates if requested.

Do you recall why you chose indefinite retries over doing a limited number of retries (say, 3)?. Even though we have logging, indefinite retries can get you in a situation where the process is running in a healthy way (eagerly retrying to set the webhook) but the bot doesn't react at all. That sounds undesirable to me. A limitd number of retries OTOH would be a sane effort of getting things going whil still preventing that the process get's stuck in an unusable state without explicit indication to the user.

Moreover I noticed that for webhook mode, the retries were kept at "none". I suspect that this was an oversight: The docstring was updated, but not the signature:https://github.com/python-telegram-bot/python-telegram-bot/pull/1018/files#r1948035215

Copy link

Member

tsnoam commentedFeb 9, 2025•
edited
Loading

@Bibo-Joshi

Do you recall why you chose indefinite retries over doing a limited number of retries (say, 3)?.
as stated on bullet 5:

we chose the option of indefinite retries (and afair, with appropriate logging). that way we make sure that the bot is properly started and will be started eventually. aborting seemed like too much, after all, the user will just need to restart the bot and come to the same "problematic" part of the initialization. so what's the point in that?

A limitd number of retries OTOH would be a sane effort of getting things going whil still preventing that the process get's stuck in an unusable state without explicit indication to the user.

I'm not sure about it. The initialization phase is the for a reason. Skipping it might lead to a limbo state in which you don't know exactly what happens.

Moreover I noticed that for webhook mode, the retries were kept at "none". I suspect that this was an oversight

Could be an oversight. That much I don't remember.

Copy link

MemberAuthor

Bibo-Joshi commentedFeb 9, 2025

Skipping it might lead to a limbo state in which you don't know exactly what happens.

I would not skip it, but rather abort if the bootstrapping phase fails. Meaning that fater n < ∞ retries you either have a bot that's able to handle updates or the process has shut down.

But okay, I get your reasoning and now have to decide what to make of it 😅 Thanks for the input!

Copy link

Member

tsnoam commentedFeb 9, 2025

as long as you can get to an expected state, any solution is good.

Copy link

Contributor

septatrix commentedFeb 9, 2025

I think a finite number of retries is more than enough as long as the delay between them is large enough to eliminate intermittent connectivity problems as the source of potential problems. Eg. in my issue report the problem was that the service was brought up before network/DNS was online. This state is usually resolved within the first 10-20 seconds after boot.

Having an infinite amount of retries by default sound dangerous and may instead hide problems which the user wants the be notified about (by the service exiting)

Copy link

MemberAuthor

Bibo-Joshi commentedFeb 9, 2025

as long as the delay between them is large enough to eliminate intermittent connectivity problems as the source of potential problems

So far I have implemented the initial retry-interval to 0 seconds forApplication.initialize

python-telegram-bot/telegram/ext/_application.py

Line 1056 in315a97f

interval=0,

but I see that the bootstrapping withinUpdater uses an initial interval of 1 second - I can use that forApplication as well:

python-telegram-bot/telegram/ext/_updater.py

Line 715 in315a97f

bootstrap_interval:float=1,

Note that the retry-loop retriesimmediately on timeout errors. On other errors, the interval is increased step by step up to 30 seconds:

python-telegram-bot/telegram/ext/_utils/networkloop.py

Lines 114 to 145 in315a97f

	try:
	ifnotawaitdo_action():
	break
	exceptRetryAfterasexc:
	slack_time=0.5
	_LOGGER.info(
	"%s %s. Adding %s seconds to the specified time.",log_prefix,exc,slack_time
	)
	cur_interval=slack_time+exc.retry_after
	exceptTimedOutastoe:
	_LOGGER.debug("%s Timed out: %s. Retrying immediately.",log_prefix,toe)
	# If failure is due to timeout, we should retry asap.
	cur_interval=0
	exceptInvalidToken:
	_LOGGER.exception("%s Invalid token. Aborting retry loop.",log_prefix)
	raise
	exceptTelegramErrorastelegram_exc:
	ifon_err_cb:
	on_err_cb(telegram_exc)

	ifmax_retries<0orretries<max_retries:
	_LOGGER.debug(
	"%s Failed run number %s of %s. Retrying.",log_prefix,retries,max_retries
	)
	else:
	_LOGGER.exception(
	"%s Failed run number %s of %s. Aborting.",log_prefix,retries,max_retries
	)
	raise

	# increase waiting times on subsequent errors up to 30secs
	cur_interval=1ifcur_interval==0elsemin(30,1.5*cur_interval)

Specifying the retry-interval is currently not exposed to the user in theUpdater.start_* methods and for starters that would seem like a bit of overkill, TBH.

@septatrix would this + (customizable) finite number of retries be enough for you?

Copy link

Contributor

septatrix commentedFeb 9, 2025

as long as the delay between them is large enough to eliminate intermittent connectivity problems as the source of potential problems
So far I have implemented the initial retry-interval to 0 seconds forApplication.initialize [...] but I see that the bootstrapping withinUpdater uses an initial interval of 1 second - I can use that forApplication as well:

I think having a default 1s interval is better than 0s - or at least something non-zero. Depending on how far along the network stack is the connections would try instantly instead of after a timeout so adding a small delay (such as the 1s) on our side is better.

Note that the retry-loop retriesimmediately on timeout errors. On other errors, the interval is increased step by step up to 30 seconds:

On timeout errors this should not be a problem because the timeout itself will result in some delay (though it also wouldn't hurt to wait, it's just not necessary).

Specifying the retry-interval is currently not exposed to the user in theUpdater.start_* methods and for starters that would seem like a bit of overkill, TBH.

I agree. As long as the default interval is non-zero it should be fine for most people.

@septatrix would this + (customizable) finite number of retries be enough for you?

Yes, as far as I understand how this is hooked up now this would be enough for me.

Small aside while peeking at the code I found a small curiosity, nothing urgent: I noticed the slack added toRetryAfter exceptions (introduced ina68cf8d) is different from the one inAIORateLimiter. If you already extract it into a variable maybe put it in some shared location? However, this is completely independent from the rest of this PR so feel free to disregard it

Bibo-Joshi added3 commits

February 17, 2025 17:17

change retry interval to 1 second on application bootstrap

6e64043

Merge branch 'master' into feature/app-bootstrapping

f933cfd

Add tests forApplication

4d60f00

Copy link

codecovbot commentedFeb 17, 2025•
edited
Loading

❌ 2 Tests Failed:

Tests completed	Failed	Passed	Skipped
6531	2	6529	746

View the top 2 failed test(s) by shortest run time

tests.test_bot.TestBotWithRequest::test_send_close_date_default_tz[Europe/Berlin-ZoneInfo]

Stack Traces | 7.35s run time

self=<tests.test_bot.TestBotWithRequestobjectat0x000001CE6EFE70D0>tz_bot=PytestExtBot[token=690091347:AAFLmR5pAB5Ycpe_mOh7zM4JFBOh0z3T0To]super_group_id='-1001279600026'asyncdeftest_send_close_date_default_tz(self,tz_bot,super_group_id):question="Is this a test?"answers= ["Yes","No","Maybe"]reply_markup=InlineKeyboardMarkup.from_button(InlineKeyboardButton(text="text",callback_data="data")        )aware_close_date=dtm.datetime.now(tz=tz_bot.defaults.tzinfo)+dtm.timedelta(seconds=5)close_date=aware_close_date.replace(tzinfo=None)msg=awaittz_bot.send_poll(# The timezone returned from this is always converted to UTCchat_id=super_group_id,question=question,options=answers,close_date=close_date,read_timeout=60,        )msg.poll._unfreeze()# Sometimes there can be a few seconds delay, so don't let the test fail due to that->msg.poll.close_date=msg.poll.close_date.astimezone(aware_close_date.tzinfo)EAttributeError:'NoneType'objecthasnoattribute'astimezone'tests\test_bot.py:2843:AttributeError

tests.test_bot.TestBotWithRequest::test_send_close_date_default_tz[Asia/Singapore-timezone]

Stack Traces | 7.37s run time

self=<tests.test_bot.TestBotWithRequestobjectat0x00000223F0A048D0>tz_bot=PytestExtBot[token=690091347:AAFLmR5pAB5Ycpe_mOh7zM4JFBOh0z3T0To]super_group_id='-1001279600026'asyncdeftest_send_close_date_default_tz(self,tz_bot,super_group_id):question="Is this a test?"answers= ["Yes","No","Maybe"]reply_markup=InlineKeyboardMarkup.from_button(InlineKeyboardButton(text="text",callback_data="data")        )aware_close_date=dtm.datetime.now(tz=tz_bot.defaults.tzinfo)+dtm.timedelta(seconds=5)close_date=aware_close_date.replace(tzinfo=None)msg=awaittz_bot.send_poll(# The timezone returned from this is always converted to UTCchat_id=super_group_id,question=question,options=answers,close_date=close_date,read_timeout=60,        )msg.poll._unfreeze()# Sometimes there can be a few seconds delay, so don't let the test fail due to that->msg.poll.close_date=msg.poll.close_date.astimezone(aware_close_date.tzinfo)EAttributeError:'NoneType'objecthasnoattribute'astimezone'tests\test_bot.py:2843:AttributeError