Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork8.3k
ports/esp32: Fix crash and improve timer interrupt allocation#17265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
2e0fb59
to2015bda
Compare
projectgus left a comment• edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
These changes look good to me,@DvdGiessen, and even without the use-after-free bug I agree on principle that it's better to be disabling/re-enabling the interrupt rather than freeing and re-allocating it where we can.
My only concern is that it'd be nice to have either a regression test or test coverage for this. I'm guessing that's non-trivial, yeah? i.e. It doesn't sound like the crash would be easy to reproduce in a unit test, and there's no existing unit test for machine.Timer on esp32 so "extending" test coverage would really mean "implementing" test coverage which is a much bigger task. What do you think?
Thanks for reviewing! Note I don't think it's a use-after-free, it's due to the use of I agree that the ESP32 could use some machine.Timer tests, but I don't think it'll be easy to test this specific bug. Due to it being dependant on scheduling (and having a non-pinned task in the first place) it won't be easily reproduced with a unit test; best we could do is either engineer the circumstances with a modified build (which seems a bit pointless) or run a reproduction in a loop with the standard build and assert it doesn't crash within some timeout. |
Oh, sorry, yes you explained the issue very clearly in your description! Which I read, then waited a week, then reviewed the code so I'd forgotten what you said about it. 🤦 I think it's probably OK to merge this without worrying about automated testing, if Damien is OK with that. |
esp_intr_free is not safe to call from the timer ISR because it requiresthe current task (the one the ISR interrupted) to be pinned to the samecore as the interrupt was allocated on. Merely disabling the ISR however issafe since that only requires that we're currently running on the same core(which the ISR always is), regardless of the current task.This was causing deadlocks in machine_uart when the ISR happened tointerrupt a task that was not pinned to a specific core.Signed-off-by: Daniël van de Giessen <daniel@dvdgiessen.nl>
If the interrupt is not freed but merely disabled, instead of reallocatingit every time the timer is enabled again we can instead just re-enable it.That means we're no longer setting the handler every time, and we need toensure it does not change. Doing so by adding an additional wrapperfunction does not only solve that problem, it also allows us to removesome code duplication and simplify how machine_uart uses the timer.Signed-off-by: Daniël van de Giessen <daniel@dvdgiessen.nl>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thanks, this looks good.
Regarding tests: I'm OK not adding any new tests. We already have UART IRQ tests which should test part of the changes here. And eventually we'll have bettermachine.Timer
tests that cover all ports.
2c2f0b2
intomicropython:masterUh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Summary
esp_intr_free
is not safe to call from the timer ISR because it requires thecurrent task (the one the ISR interrupted) to be pinned to the same core as the interrupt was allocated on. The ISR is of course guaranteed to interrupt a task on the same core, but it's not guaranteed that that task is also pinned to that core.This was causing a lockup followed by ISR watchdog timeout in the
machine_uart
RXIDLE timer (which disables the timer from its own callback) when the ISR happened to interrupt a task that was not pinned to a specific core (for example for me if often hit the lwIP TCP/IP thread task).The first commit in this PR fixes that by merely disabling the interrupt, whichis safe to do from the ISR since that only requires that we're currently running on the same core (which the ISR always is), regardless of the current task.
Additionally, the second commit makes repeated disabling and enabling of the interrupt (such as the UART RXIDLE does) a bit more efficient by re-enabling instead of reallocating it. That also allowed to removing some code duplication and simplify how
machine_uart
uses the timer.Testing
I've been using these fixes in combination with#17138 (which changed the PPP implementation to use that RXIDLE IRQ). Like that PR, I've been running this on three custom boards (1x original ESP32, 2x ESP32S3) for a few weeks at the moment of writing.