NotificationsYou must be signed in to change notification settings
Fork32k
Star67.2k

gh-129069: make list ass_slice and memory_repeat safe in free-threading#131882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

tom-pytel wants to merge13 commits intopython:main

base:main

Choose a base branch

fromtom-pytel:fix-issue-129069

Open

gh-129069: make list ass_slice and memory_repeat safe in free-threading#131882

tom-pytel wants to merge13 commits intopython:mainfromtom-pytel:fix-issue-129069

+84 −23

Conversation

Copy link

Contributor

tom-pytel commentedMar 29, 2025•
edited
Loading

This PR exists because of#129069 andTools/tsan/suppressions_free_threading.txt:

# List resizing happens through different paths ending in memcpy or memmove# (for efficiency), which will probably need to rewritten as explicit loops# of ptr-sized copies to be thread-safe.

It seems to hit the requirements: performance, atomicity assured (per pointer), TSAN shuts up. Pretty sure can remove thelist_ass_slice_lock_held andlist_inplace_repeat_lock_held suppressions, and if not yet then should be able to add the atomic memcpy tolist_resize where needed to be able to do so.

The "atomic memcpy" (atomic per ptr, not whole memcpy) functions are in the atomic wrappers header because they can be reused, if want otherwise though can move intolistobject.c, or somewhere else. Can also make non-inline callable real functions. Or the inline funcs could go intopyatomic.h?

@colesbury, you may recognize this, I did something similar for the lock-free array module PR, but this is simpler. PyObject pointers are a bit more fragile than pure values though so I assume you want this here?

Here is an example of what the inner loop aFT_ATOMIC_MEMMOVE_PTR_RELAXED compiles to, its norep movsq or other specialized move instructions, but its fast:

.L2971:# ./Include/cpython/pyatomic_gcc.h:387: { return (void*)__atomic_load_n((void* const*)obj, __ATOMIC_RELAXED); }movrax, QWORD PTR[rdi]# _160,* s# ./Include/internal/pycore_pyatomic_ft_wrappers.h:143:         for (void**d = (void**)dest+ n,**s = (void**)src+ n,**e = (void**)dest-1; d != e; d--, s--) {subrsi,64# d,# ./Include/internal/pycore_pyatomic_ft_wrappers.h:143:         for (void**d = (void**)dest+ n,**s = (void**)src+ n,**e = (void**)dest-1; d != e; d--, s--) {subrdi,64# s,# ./Include/cpython/pyatomic_gcc.h:509: { __atomic_store_n((void**)obj, value, __ATOMIC_RELAXED); }movQWORD PTR64[rsi],rax#,, _160movrax, QWORD PTR56[rdi]# _160,movQWORD PTR56[rsi],rax#,, _160# ./Include/cpython/pyatomic_gcc.h:387: { return (void*)__atomic_load_n((void* const*)obj, __ATOMIC_RELAXED); }movrax, QWORD PTR48[rdi]# _160,movQWORD PTR48[rsi],rax#,, _160# ./Include/cpython/pyatomic_gcc.h:387: { return (void*)__atomic_load_n((void* const*)obj, __ATOMIC_RELAXED); }movrax, QWORD PTR40[rdi]# _160,movQWORD PTR40[rsi],rax#,, _160# ./Include/cpython/pyatomic_gcc.h:387: { return (void*)__atomic_load_n((void* const*)obj, __ATOMIC_RELAXED); }movrax, QWORD PTR32[rdi]# _160,movQWORD PTR32[rsi],rax#,, _160# ./Include/cpython/pyatomic_gcc.h:387: { return (void*)__atomic_load_n((void* const*)obj, __ATOMIC_RELAXED); }movrax, QWORD PTR24[rdi]# _160,movQWORD PTR24[rsi],rax#,, _160# ./Include/cpython/pyatomic_gcc.h:387: { return (void*)__atomic_load_n((void* const*)obj, __ATOMIC_RELAXED); }movrax, QWORD PTR16[rdi]# _160,movQWORD PTR16[rsi],rax#,, _160# ./Include/cpython/pyatomic_gcc.h:387: { return (void*)__atomic_load_n((void* const*)obj, __ATOMIC_RELAXED); }movrax, QWORD PTR8[rdi]# _160,movQWORD PTR8[rsi],rax#,, _160cmpQWORD PTR32[rsp],rsi# %sfp, djne.L2971#,

Simple benchmark. Note the decrementing address copy case inFT_ATOMIC_MEMMOVE_PTR_RELAXED seems to hurt a bit cache-wise in the 'tins' simple bench, but it doesn't seem to make a difference in the large move or overall inpyperformance. I've had this one jump around from parity with current to this (which is the worst case so I put it here). The difference disappears if the realmemmove is used in this case but the difference seems to come from the realmemmove using special instructions (which are not atomic), a la:https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html. The other two simple (tdel andtins big) are solid where they are. Test script:

fromtimeitimporttimeitfromtimeimporttimetdel=timeit('del dst[:1024]','dst = [None] * 4096',number=10000000)print(f't del:{tdel}')tins=timeit('dst[:0] = src','dst = [None] * 4096; dst.pop(); src = [None]',number=100000)# pop() to make sure no realloc in timingprint(f't ins:{tins}')dst= [None]*0x10000000src= [None]dst.pop()t0=time()dst[:0]=srct1=time()print(f't ins big:{t1-t0}')

Times, average of 10 runs each:

current main:-------------t del: 0.2191115931245804t ins: 0.3050231862498549t ins big: 0.0799584686756134fix-129069:-----------t del: 0.21582267370067712t ins: 1.072859044800134t ins big: 0.08154952526092529

pyperformance benchmark. Difference in average performance is essentially nonexistent though individual tests can vary a bit (AMD 7950x running in VirtualBox):

                           current main   fix-129069    norm diffasync_generators                    250          254      -0.0157asyncio_tcp                         248          247       0.0040asyncio_tcp_ssl                    1.08         1.06       0.0189asyncio_websockets                  344          342       0.0058chaos                              42.3         42.4      -0.0024comprehensions                     11.1           11       0.0091bench_mp_pool                      17.5         17.4       0.0057bench_thread_pool                  1.16         1.09       0.0642coroutines                         12.8         12.8       0.0000coverage                             57         54.5       0.0459crypto_pyaes                       51.9         51.6       0.0058deepcopy                            182          180       0.0111deepcopy_reduce                    1.91         1.94      -0.0155deepcopy_memo                      20.2         21.7      -0.0691deltablue                           2.3          2.3       0.0000django_template                    25.2           25       0.0080docutils                           1.61         1.58       0.0190dulwich_log                        26.3         25.8       0.0194fannkuch                            260          257       0.0117float                              42.4         43.5      -0.0253create_gc_cycles                    494          512      -0.0352gc_traversal                       1.17         1.27      -0.0787generators                         20.9         21.4      -0.0234genshi_text                        16.6         16.7      -0.0060genshi_xml                         34.3         34.9      -0.0172go                                 80.3         80.8      -0.0062hexiom                             4.17         4.22      -0.0118html5lib                           33.2           34      -0.0235json_dumps                         7.43         7.31       0.0164json_loads                         14.2         13.9       0.0216logging_format                     4.58         4.61      -0.0065logging_silent                     64.8         63.7       0.0173logging_simple                      4.2         4.22      -0.0047mako                               7.95         7.97      -0.0025mdp                                1.78         1.98      -0.1010meteor_contest                     68.2         69.7      -0.0215nbody                              82.6         78.8       0.0482nqueens                            58.6         61.1      -0.0409pathlib                            13.8         13.9      -0.0072pickle                             6.12         6.13      -0.0016pickle_dict                        15.3         15.2       0.0066pickle_list                         2.3         2.18       0.0550pickle_pure_python                  204          203       0.0049pidigits                            116          117      -0.0085pprint_safe_repr                    499          513      -0.0273pprint_pformat                     1.02         1.05      -0.0286pyflate                             296          304      -0.0263python_startup                     9.29         9.29       0.0000python_startup_no_site             7.78         7.62       0.0210raytrace                            192          193      -0.0052regex_compile                      74.4         74.1       0.0040regex_dna                           111          110       0.0091regex_effbot                       1.69          1.7      -0.0059regex_v8                           13.8         13.7       0.0073richards                           32.1         32.4      -0.0093richards_super                     36.8         37.3      -0.0134scimark_fft                         232          231       0.0043scimark_lu                           81         80.7       0.0037scimark_monte_carlo                45.9         47.9      -0.0418scimark_sor                        77.9         76.3       0.0210scimark_sparse_mat_mult            3.83         3.78       0.0132spectral_norm                      68.5           67       0.0224sqlglot_normalize                   197          196       0.0051sqlglot_optimize                   36.8         36.9      -0.0027sqlglot_parse                       887          888      -0.0011sqlglot_transpile                   1.1          1.1       0.0000sqlite_synth                       1.43         1.44      -0.0069sympy_expand                        277          276       0.0036sympy_integrate                    12.8         12.7       0.0079sympy_sum                          88.1         85.3       0.0328sympy_str                           164          162       0.0123telco                              5.47         5.39       0.0148tomli_loads                        1.37         1.37       0.0000typing_runtime_protocols            112          112       0.0000unpack_sequence                    28.7         28.6       0.0035unpickle                              8         8.18      -0.0220unpickle_list                      2.87         2.75       0.0436unpickle_pure_python                144          140       0.0286xml_etree_parse                      74         72.6       0.0193xml_etree_iterparse                49.6         47.8       0.0377xml_etree_generate                 56.2         56.9      -0.0123xml_etree_process                  40.4         40.3       0.0025AVERAGE                                                   -0.0001

Issue:Race in concurrent list mutation and item retrieval #129069

pythongh-129069: make list ass_slice and memory_repeat safe

0c8dcfc

bedevere-appbot added the awaiting review label

Mar 29, 2025

bedevere-appbot mentioned this pull request

Mar 29, 2025

Race in concurrent list mutation and item retrieval#129069

Open

blurb-itbotand others added2 commits

March 29, 2025 20:12

📜🤖 Added by blurb_it.

67c9459

include atomic wrappers in pycore_list.h

e460b5a

Copy link

ContributorAuthor

tom-pytel commentedMar 29, 2025

Ping@colesbury ,@Yhg1s , Windows doesn't likesize_t as a parameter, will fix, but would like a thumbs up or down on this PR in general?

provisional fix to compile on windows

fcdb634

tom-pytel changed the title~~gh-129069: make list ass_slice and memory_repeat safe~~gh-129069: make list ass_slice and memory_repeat safe in free-threading

Mar 29, 2025

Yhg1s reviewed

Apr 1, 2025

View reviewed changes

Include/internal/pycore_pyatomic_ft_wrappers.h OutdatedShow resolvedHide resolved

Include/internal/pycore_pyatomic_ft_wrappers.h Outdated

Comment on lines 124 to 125

		void *v = _Py_atomic_load_ptr_relaxed(s);
		_Py_atomic_store_ptr_relaxed(d, v);

Copy link

Member

Yhg1sApr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think we need a relaxed load here (and in memmove), at least not for the list/tuple cases (writes always happen with the lock held for lists, and tuples are supposed to be immutable after they're shared). I'm not sure if we're going to end up using this in situations where we do need relaxed loads though, but I guess we could document it appropriately and revisit if we come across a need.

I'm not sure if the stores should be relaxed, though. We use release order when inserting new items in list_ass_slice_lock_held, but I'm not sure I understand why. I can't think of reorderings that would affect loads from those items.@colesbury what do you think? Do we need a single release fence, or do we need a strict ordering among the stores?

(The function name will need changing if we don't use relaxed loads/stores, and it should probably be something like_Py_atomic_memcpy_ptr_unsafe_release to indicate it reads unsafely but release-stores.)

Copy link

ContributorAuthor

tom-pytelApr 1, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Some context - like I said in the header I did something similar in array module and there I had to make reads atomic in resize because there were lock-free writes. I guess that doesn't apply here, but making reads non-atomic still compiles to essentially the same thing (on Intel), which makes sense since the data still needs to be written in ptr size chunks for the atomic write. So basically nothing is really gained by making reads non-atomic (relaxed) in this case (at least on Intel).

Non-atomic read with atomic write inner loop:

.L2929:movrdi, QWORD PTR[rax]# MEM[(void**)s_289], MEM[(void**)s_289]# ./Include/internal/pycore_pyatomic_ft_wrappers.h:144:         for (void**d = (void**)dest,**s = (void**)src,**e = d+ n / sizeof(void*); d != e; d++, s++) {addrsi,64# d,# ./Include/internal/pycore_pyatomic_ft_wrappers.h:144:         for (void**d = (void**)dest,**s = (void**)src,**e = d+ n / sizeof(void*); d != e; d++, s++) {addrax,64# s,# ./Include/cpython/pyatomic_gcc.h:509: { __atomic_store_n((void**)obj, value, __ATOMIC_RELAXED); }movQWORD PTR-64[rsi],rdi#,, MEM[(void**)s_289]movrdi, QWORD PTR-56[rax]# MEM[(void**)s_289], MEM[(void**)s_289]movQWORD PTR-56[rsi],rdi#,, MEM[(void**)s_289]movrdi, QWORD PTR-48[rax]# MEM[(void**)s_289], MEM[(void**)s_289]movQWORD PTR-48[rsi],rdi#,, MEM[(void**)s_289]movrdi, QWORD PTR-40[rax]# MEM[(void**)s_289], MEM[(void**)s_289]movQWORD PTR-40[rsi],rdi#,, MEM[(void**)s_289]movrdi, QWORD PTR-32[rax]# MEM[(void**)s_289], MEM[(void**)s_289]movQWORD PTR-32[rsi],rdi#,, MEM[(void**)s_289]movrdi, QWORD PTR-24[rax]# MEM[(void**)s_289], MEM[(void**)s_289]movQWORD PTR-24[rsi],rdi#,, MEM[(void**)s_289]movrdi, QWORD PTR-16[rax]# MEM[(void**)s_289], MEM[(void**)s_289]movQWORD PTR-16[rsi],rdi#,, MEM[(void**)s_289]movrdi, QWORD PTR-8[rax]# MEM[(void**)s_289], MEM[(void**)s_289]movQWORD PTR-8[rsi],rdi#,, MEM[(void**)s_289]cmprax, QWORD PTR16[rsp]# s, %sfpjne.L2929#,

Waiting on RELEASE decision.

Copy link

Contributor

colesburyApr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Let's continue to userelaxed atomics for these functions for now.

The release ordering for storing new items is important so that all the non-atomic initializations of the item (like reference count,ob_type, etc.) are visiblebefore the store of the object into the list.

Moving items around within the list is less clear, but "relaxed" is the same behavior that we use for moving items inins1() so let's stick with it for now and revisit it later.

Copy link

ContributorAuthor

tom-pytelApr 8, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Removed atomic load and renamed to stuff like_Py_atomic_memcpy_ptr_store_relaxed. Probably better to do only what you explicitly need. If you want a full relaxed version in future just copy and tweak.