Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork34
fixclose
error due to race ind_closeall
#248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
The below code that adds some code to regulate GC behavior can recreate the situation. julia>using Distributedjulia>addprocs(4);julia>@everywherebeginusing Distributed, DistributedArraysendjulia> A=rand(1:100, (100,100));julia> DA=distribute(A);julia> GC.enable(false)truejulia> DA=nothingjulia>function DistributedArrays.d_from_weakref_or_d(id) d=get(DistributedArrays.registry, id,nothing) GC.enable(true) GC.gc()isa(d, WeakRef)&&return d.valuereturn dendjulia> DistributedArrays.d_closeall()ERROR: MethodError: no method matchingclose(::Nothing)Closest candidates are:close(::Union{Base.AsyncCondition, Timer}) @ Base asyncevent.jl:162close(::Union{FileWatching.FileMonitor, FileWatching.FolderMonitor, FileWatching.PollingFileWatcher}) @ FileWatching/data/Work/julia/binaries/julia-1.9.3/share/julia/stdlib/v1.9/FileWatching/src/FileWatching.jl:328close(::DArray) @ DistributedArrays~/.julia/dev/DistributedArrays/src/core.jl:34...Stacktrace: [1]d_closeall() @ DistributedArrays~/.julia/dev/DistributedArrays/src/core.jl:47 [2] top-level scope @ REPL[9]:1 |
It seems possible that `DistributedArrays.d_closeall()` may encounter a condition where it finds a darray id in the `registry`, but the corresponding weakref value is `nothing` because the referenced darray got garbage collected. It has been enountered many times in CI and elsewhere, but is hard to replicate normally. Adding a check for the weakref value, before actually invoking `close` on it, to fix it.
Can we be sure that the remote memory is properly freed if the reference has been garbage collected? |
Yes, I think so, because the finalizer invokes julia>using Distributedjulia>addprocs(4);julia>@everywherebeginusing Distributed, DistributedArraysfunctionprint_registry_entries()for kinkeys(DistributedArrays.registry) hasval=!(DistributedArrays.d_from_weakref_or_d(k)===nothing)println("key$k hasval$hasval")endendendjulia>@everywhereprint_registry_entries()julia> A=rand(1:100, (100,100));julia> DA=distribute(A);julia> GC.enable(false)truejulia> DA=nothingjulia>@everywhereprint_registry_entries()key (1,1) hasvaltrue From worker5:key (1,1) hasvaltrue From worker3:key (1,1) hasvaltrue From worker2:key (1,1) hasvaltrue From worker4:key (1,1) hasvaltruejulia>function DistributedArrays.d_from_weakref_or_d(id) d=get(DistributedArrays.registry, id,nothing) GC.enable(true) GC.gc()isa(d, WeakRef)&&return d.valuereturn dendjulia>@everywhereprint_registry_entries()key (1,1) hasvalfalse From worker2:key (1,1) hasvaltrue From worker3:key (1,1) hasvaltrue From worker5:key (1,1) hasvaltrue From worker4:key (1,1) hasvaltruejulia> DistributedArrays.d_closeall()julia>@everywhereprint_registry_entries() Would it be good to add this as a test? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think it is fine as it is
It seems possible that
DistributedArrays.d_closeall()
may encounter a condition where it finds a darray id in theregistry
, but the corresponding weakref value isnothing
because the referenced darray got garbage collected. It has been enountered many times in CI and elsewhere, but is hard to replicate normally. Adding a check for the weakref value, before actually invokingclose
on it, to fix it.fixes#246