Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork33.7k
Description
Bug report
Bug description:
importtracebackimportgcclassObj:def__init__(self,name:str):self.name=namedef__repr__(self):returnf"Obj({self.name!r})"def__del__(self):print("del",self)defdeep(i:int):a=Obj(f"a, i={i}")ifi==2:raiseException(f"exception at i={i}")print(a)deffunc():foriinrange(5):gc.collect()print("** i:",i)try:deep(i)exceptExceptionasexc:print("caught",exc)print_tb(exc.__traceback__)# traceback.clear_frames(prev_exc.__traceback__)clear_tb(exc.__traceback__)continue# continue with next iprint("deep",i,"done")defprint_tb(tb):print("Call stack:")whiletb:frame_i=tb.tb_frame.f_locals.get("i")print(f"{tb.tb_frame.f_code.co_name}: i={frame_i}")tb=tb.tb_nextdefclear_tb(tb):print("Clearing stack:")whiletb:print(tb.tb_frame)try:tb.tb_frame.clear()exceptRuntimeError:print(" cannot clear?")else:print(" cleared")# Using this code triggers that the ref actually goes out of scope, otherwise it does not!# print(" now:", tb.tb_frame.f_locals)tb=tb.tb_nextif__name__=='__main__':func()print("exit")
Running this code gives the following output:
** i: 0Obj('a, i=0')del Obj('a, i=0')deep 0 done** i: 1Obj('a, i=1')del Obj('a, i=1')deep 1 done** i: 2caught exception at i=2Call stack: func: i=2 deep: i=2Clearing stack:<frame at 0x7f9ee1cc72a0, file '/u/zeyer/code/playground/py-oom-out-of-scope.py', line 34, code func> cannot clear?<frame at 0x7f9ee1c168c0, file '/u/zeyer/code/playground/py-oom-out-of-scope.py', line 20, code deep> cleared** i: 3Obj('a, i=3')del Obj('a, i=3')deep 3 done** i: 4Obj('a, i=4')del Obj('a, i=4')deep 4 doneexitdel Obj('a, i=2')You see thatObj('a, i=2') only is deleted at exit.
This only happens when theprint_tb is used before, which will accessf_locals of each frame.
traceback.clear_frames should have cleared the locals. But as you see from the output, it does not.
clear_tb is basically a copy oftraceback.clear_frames.
The problem goes away if you accesstb.tb_frame.f_localsafter it was cleared (i.e.tb.tb_frame.clear() was called).
Looking at the C code, this is whattb_frame.clear() will do:
https://github.com/python/cpython/blob/3.12/Objects/frameobject.c#L933-L946
static intframe_tp_clear(PyFrameObject *f){ Py_CLEAR(f->f_trace); /* locals and stack */ PyObject **locals = _PyFrame_GetLocalsArray(f->f_frame); assert(f->f_frame->stacktop >= 0); for (int i = 0; i < f->f_frame->stacktop; i++) { Py_CLEAR(locals[i]); } f->f_frame->stacktop = 0; return 0;}However, if you accessedtb_frame.f_locals before, it will have created a dictionary inframe->f_locals here:https://github.com/python/cpython/blob/5c238225f60c33cf1931b1a8c9a3310192c716ae/Objects/frameobject.c#L1218C18-L1218C33
Thatframe->f_locals dict will also have references to all the local vars. And thatf_locals dict is not cleared intb_frame.clear().
However, then when you accesstb_frame.f_locals again, it will update the existingframe->f_locals dict, and delete all the local vars in it, because they are not available anymore. Here:
https://github.com/python/cpython/blob/3.12/Objects/frameobject.c#L1256C13-L1256C55
I think it's a bug (or at least very unexpected) thattb_frame.clear() does not clearframe->f_locals.
So my suggestion would be to addPy_CLEAR(f->f_frame->f_locals) inframe_tp_clear.
There is then another related issue: When theexcept block is left, the exception goes out of scope, so then it should free all the locals (even whenframe.clear() was not called). However, this is also not the case.
After inspecting this further: Onceframe.f_locals was accessed from the current frame where the exception is handled, thisframe.f_locals still has a reference to the exception, and thus to the frames, even though theDELETE_FAST for the exception deleted it from the fast locals. See the comments below for more on this.
Note, for PyTorch and others, when you first do extended exception reporting which accessesf_locals in any way, this here fixes two arising problems. Related:
- Inconsistent recovery from CUDA OOMs pytorch/pytorch#18853
- Free Memory after CUDA out of memory error pytorch/pytorch#27600
E.g., this came up for us because we have this extended exception reporting, which accessesf_locals:
# Extend exception message by module call stack.module_names_by_id= {}# id -> nameforname,modinmodel.named_modules():ifid(mod)notinmodule_names_by_id:module_names_by_id[id(mod)]=nameor"(root)"exc_ext= []forframeiniter_traceback(exc.__traceback__):ifframe.f_code.co_nlocals==0:continueframe_self=frame.f_locals.get("self")ifisinstance(frame_self, (torch.nn.Module,rf.Module)):func=get_func_from_code_object(frame.f_code,frame=frame)iffuncandfunc.__name__andfunc.__name__.startswith("_")andnotfunc.__name__.startswith("__"):continuefunc_name= (funcandfunc.__qualname__)ortype(frame_self).__name__exc_ext.append(f"({func_name}){module_names_by_id.get(id(frame_self),'(unknown)')}")ifnotexc_ext:exc_ext.append("(No module call frames.)")iflen(exc.args)==1andisinstance(exc.args[0],str)andnotalways_direct_print:exc.args= ("\n".join([exc.args[0],"","Module call stack:"]+exc_ext),)else:print("Module call stack:",file=log.v3)formsginexc_ext:print(msg,file=log.v3)
The normaltraceback.clear_frames here does not help.
CPython versions tested on:
3.11, 3.12, 3.13
Operating systems tested on:
Linux