Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Fix access violation exception on shutdown#2386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged

Conversation

Frawak
Copy link
Contributor

@FrawakFrawak commentedMay 13, 2024
edited
Loading

What does this implement/fix? Explain your changes.

When nulling the GC handles on shutdown the reference count of all objects pointed to by the IntPtr in theCLRObject.reflectedObjects are zero. This caused an exception in some scenarios becauseRuntime.PyObject_TYPE(reflectedClrObject) is called while the reference counter is at zero.

AfterTypeManager.RemoveTypes(); is called in theRuntime.Shutdown() method, reference count decrements to zero do not invokeClassBase.tp_clear for managed objects anymore which normally is responsible for freeing GC handles and removing references fromCLRObject.reflectedObjects. Collecting objects referenced inCLRObject.reflectedObjects only after leads to an unstable state in which the reference count for these object addresses is zero while still maintaining them to be used for further pseudo-cleanup. In that time, the memory could have been reclaimed already which leads to the exception.

Does this close any currently open issues?

#1977

Any other comments?

...

Checklist

Check all those that are applicable and complete.

  • Make sure to include one or more tests for your change
  • If an enhancement PR, please create docs and at best an example
  • Ensure you have signed the.NET Foundation CLA
  • Add yourself toAUTHORS
  • Updated theCHANGELOG

When nulling the GC handles on shutdown the reference count of all objects pointed to by the IntPtr in the `CLRObject.reflectedObjects` are zero.This caused an exception in some scenarios because `Runtime.PyObject_TYPE(reflectedClrObject)` is called while the reference counter is at zero.After `TypeManager.RemoveTypes();` is called in the `Runtime.Shutdown()` method, reference count decrements to zero do not invoke `ClassBase.tp_clear` for managed objects anymore which normally is responsible for removing references from `CLRObject.reflectedObjects`. Collecting objects referenced in `CLRObject.reflectedObjects` only after leads to an unstable state in which the reference count for these object addresses is zero while still maintaining them to be used for further pseudo-cleanup. In that time, the memory could have been reclaimed already which leads to the exception.
@FrawakFrawakforce-pushed thefix/1977-access-violation-gc branch fromd496959 to4e5afdfCompareMay 13, 2024 14:43
@Frawak
Copy link
ContributorAuthor

Frawak commentedMay 13, 2024
edited
Loading

Alright. At least the embedding tests went through. I am going to take another look.

Otherwise, collecting all at this earlier point results in corrupt memory for derived types.
@Frawak
Copy link
ContributorAuthor

The original change broke the shutdown flow for derived objects. The (hopefully) final proposition from me remains the same, only restricting the early collection run to the managed objects which would, as described above, run into the problem of the clearedtp_clear slot at a later point.

@Frawak
Copy link
ContributorAuthor

Frawak commentedMay 21, 2024
edited
Loading

@filmor@lostmsu Could you please trigger the workflows/pipelines again?

@lostmsu
Copy link
Member

lostmsu commentedMay 21, 2024
edited
Loading

This feels like a band-aid rather than proper fix. Theoretically invoking GC should not leave anything in a broken state, but types derived across .NET/Python boundaries are tricky. We should come up with a correct way to finalize them rather than adding such a band-aid, IMHO.

It is possible your scenario keeps invalid references. E.g. Python or .NET hold live references to each other's objects at the point when you callShutdown. You need to narrow down the scenario to a short test case that would reliably crash, then we can see if Python.NET needs to fix that and how or provide a better error message.

@Frawak
Copy link
ContributorAuthor

Frawak commentedMay 21, 2024
edited
Loading

@lostmsu To make sure we are on the same page and in risk of repeating myself, I will copy some code snippets of the current master.

The shutdown method:

NullGCHandles(ExtensionType.loadedExtensions);ClassManager.RemoveClasses();TypeManager.RemoveTypes();// this removes tp_clear slot_typesInitialized=false;            ...PyGC_Collect();booleverythingSeemsCollected=TryCollectingGarbage(MaxCollectRetriesOnShutdown,forceBreakLoops:true);// this decreases the reference count

The tp_clear method:

publicstaticinttp_clear(BorrowedReferenceob){varweakrefs=Runtime.PyObject_GetWeakRefList(ob);if(weakrefs!=null){Runtime.PyObject_ClearWeakRefs(ob);}if(TryFreeGCHandle(ob))// what NullGCHandles tries to do after the fact{IntPtraddr=ob.DangerousGetAddress();booldeleted=CLRObject.reflectedObjects.Remove(addr);// the normal removal from hashsetDebug.Assert(deleted);}intbaseClearResult=BaseUnmanagedClear(ob);if(baseClearResult!=0){returnbaseClearResult;}ClearObjectDict(ob);return0;}

tp_clear should be responsible for freeing the GC handle and removing the address fromCLRObject.reflectedObjects what theforceBreakLoops tries to do separatedly.

for(inti=0;i<2;i++){GC.Collect();GC.WaitForPendingFinalizers();pyCollected+=PyGC_Collect();pyCollected+=Finalizer.Instance.DisposeAll();}if(Volatile.Read(ref_collected)==0&&pyCollected==0){if(attempt+1==runs)returntrue;}elseif(forceBreakLoops){NullGCHandles(CLRObject.reflectedObjects);CLRObject.reflectedObjects.Clear();}

There are two crucial points which I hope you see are wrong in the current shutdown flow:

  1. tp_clear is not invoked after decreasing the reference count of a respective object (referenced inCLRObject.reflectedObjects) to zero.
  2. The reference count to all addresses inCLRObject.reflectedObjects is zero when calling NullGCHandles(CLRObject.reflectedObjects);, i.e. handling the addresses in that state.

This can be seen by just debugging and stepping into the code, regardless if an exception is thrown or not. Hence, the state is already broken after the linepyCollected += Finalizer.Instance.DisposeAll();.

Is this not enough that a fix is necessary?
An error message does not help because nothing can be done really to handle it. If you extend Python and unload pythonnet, you cannot use C# exception types anymore to catch them in the Python domain. This is an unhandled exception that crashes the Python process.

@Frawak
Copy link
ContributorAuthor

This feels like a band-aid rather than proper fix.

Is this a feeling because of function name semantics which could be dissolved by refactoring the solution?

Effectively, my changes only run this:

while(!_objQueue.IsEmpty){if(!_objQueue.TryDequeue(outvarobj))continue;if(obj.RuntimeRun!=run){HandleFinalizationException(obj.PyObj,newRuntimeShutdownException(obj.PyObj));continue;}IntPtrcopyForException=obj.PyObj;Runtime.XDecref(StolenReference.Take(refobj.PyObj));collected++;try{Runtime.CheckExceptionOccurred();}catch(Exceptione){HandleFinalizationException(obj.PyObj,e);}}

Before this:

NullGCHandles(ExtensionType.loadedExtensions);ClassManager.RemoveClasses();TypeManager.RemoveTypes();_typesInitialized=false;MetaType.Release();PyCLRMetaType.Dispose();PyCLRMetaType=null!;Exceptions.Shutdown();PythonEngine.InteropConfiguration.Dispose();DisposeLazyObject(clrInterop);DisposeLazyObject(inspect);DisposeLazyObject(hexCallable);PyObjectConversions.Reset();

@lostmsu
Copy link
Member

lostmsu commentedMay 21, 2024
edited
Loading

An explanation is not enough, we need a minimal test (not necessarily written as a test function, could be a short script or .NET console app) that fails before the fix, and passes after, preferably with just a handful of allocated objects.

I am being so picky because without certainty about exactly what's going on I can not be sure that your fix does not just randomly happens to help your particular use case while breaking someone else's use case not yet covered by tests. And it introduces a little extra complexity too.

@Frawak
Copy link
ContributorAuthor

Frawak commentedMay 22, 2024
edited
Loading

This is the most down grinded version of my case that I could accomplish:https://github.com/Frawak/pythonnet/tree/fix/1977-access-violation-gc-testcase/exception_scenario
Please read the README.

I hope that the exception is as reliable on other machines.

preferably with just a handful of allocated objects

The problem with that: The exception only occurs with some internal memory management from Python (my interpretation) which is hard to reproduce in general and I do not see how it can be done with a few simple objects. The process has to be triggered by its inner logic to access the memory while pythonnet is shutting down.

I also had to add the other open-source library because while using that, the exception occurred in specific cases. But there is nothing wrong with the library and its types. They are pretty simple and no IDisposables or the like.
Trying to magically trigger this exception with default .NET library types is like gambling in the lottery.

And as I said: It is not about the surrounding conditions when the exception is thrown. They are not the cause. The cause can be seen with any script using .NET objects from Python and debugging into the shutdown method. It is just lucky that the exception is rare.

@Frawak
Copy link
ContributorAuthor

@lostmsu Are you going to look at the code example? Does the exception get thrown when you run it?

It's a process. Work with me.

@lostmsu
Copy link
Member

lostmsu commentedMay 31, 2024
edited
Loading

@Frawak I took a look at the example, just want to see if I understood it correctly: it does not appear to have either Python classes derived from .NET classes, nor .NET classes derived from Python classes? E.g. all it does is imports .NET lib, instantiates a few classes from there as temporary variables, and that's it?

P.S. sorry about late response, we are unpaid maintainers here, doing it at our leisure.

@Frawak
Copy link
ContributorAuthor

Frawak commentedMay 31, 2024
edited
Loading

I took a look at the example, just want to see if I understood it correctly: it does not appear to have either Python classes derived from .NET classes, nor .NET classes derived from Python classes? E.g. all it does is imports .NET lib, instantiates a few classes from there as temporary variables, and that's it?

That is correct.

P.S. sorry about late response, we are unpaid maintainers here, doing it at our leisure.

No need to apologize. Of course I know that. But any information (like "maybe in x weeks") is better than no information. Otherwise, I don't know whether it has genuinely slipped your mind and you need a poke or you come around to it at your own time. In my experience, the former is more common.

@rohanjain101
Copy link

@lostmsu could you please help take another look?

@lambda
Copy link

Just wanted to report that we were consistently running into this issue in our Windows CI tests, for a case where we were using pythonnet to load a proprietary .net file format parser (since it's proprietary, can't necessarily submit it as a test case).

We switched to point to this branch of pythonnet, and the issue has gone away.

So, hopefully that might provide another data point to address "I can not be sure that your fix does not just randomly happens to help your particular use case while breaking someone else's use case not yet covered by tests," this does fix a problem for us and doesn't seem to have introduced any new ones across our test matrix (Python 3.10, 3.11, 3.12 on Debian, and 3.10 on Windows).

Please do review and ensure this branch is properly tested, but we are eagerly looking forward to the release of this fix, or some other fix to this problem.

lostmsu reacted with thumbs up emojies-mitty and rohanjain101 reacted with heart emoji

@filmor
Copy link
Member

@lostmsu The current investigation here would be enough for me to accept the PR as is. That derived classes require special handling is not surprising, given how they are implemented.

@Frawak It would be great if you could put your test-case into a separate repository and keep it alive for a while. If you have the time to integrate it into our test-suite, even better :)

@lostmsu
Copy link
Member

I still plan to take a look.

filmor and Frawak reacted with thumbs up emoji

@lostmsu
Copy link
Member

OK, I did a review, and IMHO, this fix will just leak .NET memory because the GC handles will never be released after .NET types are disabled in Python. I am not sure, but it also might mean that some critical finalizers won't run.

IMHO, what should be done is aftertp_clear is disabled, we should replace it with a Python function that will grab the GC handle and put it somewhere the rest of the shutdown procedure can look (e.g. some global Python object, essentiallylist[int] whereint is the value ofGCHandle). That way we can have a variant ofTryCollectGarbage that does not refer toreflectedObjects and instead releasesGCHandle from that list.

Ideally, aftertp_clear is disabled we should also stop putting new objects intoreflectedObjects. Then theTryCollectGarbage won't need specialbool telling it if the objects in this list are still valid.

@Frawak
Copy link
ContributorAuthor

Frawak commentedJun 10, 2024
edited
Loading

I have a guess what you could mean but I am not following completely.

because the GC handles will never be released after .NET types are disabled in Python.

tp_clear manages that on its own:

if(TryFreeGCHandle(ob))

That is the same thing what theforceBreakLoop part tries to do after the fact:

ManagedType.TryFreeGCHandle(@ref);

Ideally, after tp_clear is disabled we should also stop putting new objects into reflectedObjects. Then the TryCollectGarbage won't need special bool telling it if the objects in this list are still valid.

I would even suggest that theforceBreakLoop part could be removed completely in my change proposal because by the time my change gets to it, thereflectedObjects list is already empty. And I cannot really imagine that while running these lines anything is added to thereflectedObjects:

NullGCHandles(ExtensionType.loadedExtensions);
ClassManager.RemoveClasses();
TypeManager.RemoveTypes();
_typesInitialized=false;
MetaType.Release();
PyCLRMetaType.Dispose();
PyCLRMetaType=null!;
Exceptions.Shutdown();
PythonEngine.InteropConfiguration.Dispose();
DisposeLazyObject(clrInterop);
DisposeLazyObject(inspect);
DisposeLazyObject(hexCallable);
PyObjectConversions.Reset();

But I was not entirely sure whether or not there is a scenario in which objects still remain inreflectedObjects after running the first part of the finalizer:

while(!_objQueue.IsEmpty)
{
if(!_objQueue.TryDequeue(outvarobj))
continue;
if(obj.RuntimeRun!=run)
{
HandleFinalizationException(obj.PyObj,newRuntimeShutdownException(obj.PyObj));
continue;
}
IntPtrcopyForException=obj.PyObj;
Runtime.XDecref(StolenReference.Take(refobj.PyObj));
collected++;
try
{
Runtime.CheckExceptionOccurred();
}
catch(Exceptione)
{
HandleFinalizationException(obj.PyObj,e);
}
}

Because in my scenario I do not use any derived types or anything exotic, only:

all it does is imports .NET lib, instantiates a few classes from there as temporary variables, and that's it

And so my changes made only the most minimalistic touch on the shutdown behavior so that the previous code parts still run if I am overlooking something. And if I am not, then that code just runs but has no effect anymore (which then is just a redundancy at its worst).

To break it down:
My gut feeling is that theforceBreakLoop part could be removed completely. The first loop of theFinalizer.DisposeAll runs before the removal of thetp_clear slot and the other loops after as my changes suggest.
But could you give a scenario in which .NET types are still in play and risk of not being finalized properly?

@lostmsu
Copy link
Member

TheforceBreakLoop exists because you can have a .NET objectO that has aPyObject field pointing to Python objectP which somehow points back toO. If there are only these two objects, and nothing else references them, in order to get them collected one needs to destroy one of the references.

It is possible that theO-P loop is only referenced by one of the reflected types. So whenRemoveTypes is called,O andP become garbage and might need to be collected. That constitutes the necessity to break loops even afterRemoveTypes.

Frawak reacted with thumbs up emoji

@Frawak
Copy link
ContributorAuthor

I created the scenario of your first paragraph with the code snippets shown below and debugged into the code (with my changes).
I could observe that theforceBreakLoop (beforeRemoveTypes) causes theLoopReference instance to be finalized at the nextGC.Collect(). Otherwise, the finalizer gets called when the program exits. So, I get the point behind it now.
But I do not comprehend the necessity of breaking loops afterRemoveTypes. I tried some additional code variations and debugged through them but could not figure out how to produce a respective example.

usingSystem;namespaceTemp;publicclassLoopReference{privatereadonlyobject_P;publicLoopReference(objectp){_P=p;}~LoopReference(){if(_PisIDisposablep)p.Dispose();}}
PythonEngine.Initialize();PyDictlocals=newPyDict();PythonEngine.Exec(@"import clrclr.AddReference(r'[the path to the DLL containing LoopReference]')from Temp import LoopReferenceclass P(object):    def set_o(self, o):        self.o = op = P()o = LoopReference(p)",null,locals);PythonEngine.Shutdown();

Ideally, after tp_clear is disabled we should also stop putting new objects into reflectedObjects.

Essentially, this is the biggest puzzle piece for me. How are new objects added? I do not see any follow-up code in the shutdown procedure that does that.

@filmor
Copy link
Member

Maybe you could introduce a flag that makes adding an object toreflectedObjects throw after Shutdown? That would at least give us a stacktrace.

@Frawak
Copy link
ContributorAuthor

@filmor Like so (see last commit)?

Could you start the workflows again? I ran the tests locally.

filmor reacted with thumbs up emoji

@Frawak
Copy link
ContributorAuthor

Frawak commentedJul 8, 2024
edited
Loading

In one of the workflow configurations, an obsolete test failed:

[Test]
[Obsolete("GC tests are not guaranteed")]
publicvoidCollectOnShutdown()

The failed assert is also before running the shutdown.

@GBBBAS
Copy link

GBBBAS commentedJul 11, 2024
edited
Loading

Can confirm this branch fixes my Access Violation Exception issue I've been getting. Any idea on when this can be released?

@filmor
Copy link
Member

@lostmsu Anything else you would like to see?

Copy link
Member

@lostmsulostmsu left a comment
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Still uneasy about this, I think it might break stuff, but I was not able to come up with a test case.

Can be merged aftercreationBlocking issue is resolved.

Otherwise if you have a Python object that needs to temporarily create a .NET object in its destructor (for instance write log summary to a file), its destructor will fail if it happens to be freed on the second iteration of loop breaking.
@lostmsulostmsu merged commitea059ca intopythonnet:masterAug 6, 2024
27 checks passed
@filmor
Copy link
Member

Nice, thank you very much for your contribution@Frawak. I'll see this weekend if there is more to do before cutting a new release.

@lostmsu
Copy link
Member

Not sure if it's related yet, but we have two crashes in CI.

@Frawak
Copy link
ContributorAuthor

The difference in the log of the two failed jobs compared to others is the following:
TheCollectOnShutdown test fails which I mentionedabove. The error messageGarbage should contains the collected object indicates that the test fails before the shutdown is called, i.e. before any change.

Assert.IsTrue(garbage.Contains(op),
"Garbage should contains the collected object");
PythonEngine.Shutdown();

Links to the log lines:
https://github.com/pythonnet/pythonnet/actions/runs/10274663048/job/28431871059#step:10:195
https://github.com/pythonnet/pythonnet/actions/runs/10274663048/job/28431871476#step:10:199

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@filmorfilmorfilmor approved these changes

@lostmsulostmsulostmsu approved these changes

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

6 participants
@Frawak@lostmsu@rohanjain101@lambda@filmor@GBBBAS

[8]ページ先頭

©2009-2025 Movatter.jp