Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Functions usingblas cause a segfault (SIGSEV) #617

Closed
Labels
bugSomething isn't workingml
@eeeebbbbrrrr

Description

@eeeebbbbrrrr

After working with@levkk and@montanalow to install PostgresML (as of master:63ebce3) on my linux box, I discovered that functions such aspgml.cosine_similarity andpgml.norm_l1 cause Postgres to segfault.

As an example:

[v15.1][5126] pgml=# select pgml.norm_l1(ARRAY[1,2,3]::real[]);server closed the connection unexpectedly    This probably means the server terminated abnormally    beforeor while processing the request.The connection to the server was lost. Attempting reset: Failed.The connection to the server was lost. Attempting reset: Failed.Time:188.973 ms[v][] ?!>

Postgres logs leading up to a crash againstpgml.cosine_similarity() are:

/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/utils/logging.py:65: RuntimeWarning: Error deriving logger module name, using <None>. Exception: <module '' from '/home/pg/15/data'> is a built-in module  warnings.warn(No sentence-transformers model found with name /home/zombodb/.cache/torch/sentence_transformers/intfloat_e5-large. Creating a new one with MEAN pooling.2023-05-04 18:35:48.950 UTC [20973] LOG:  server process (PID 21218) was terminated by signal 11: Segmentation fault2023-05-04 18:35:48.950 UTC [20973] DETAIL:  Failed process was running: select *, pgml.cosine_similarity(embed, pgml.embed('intfloat/e5-large', 'meetings with beer or wine and cheese')) from embeddings_e5large_100k limit 10;2023-05-04 18:35:48.950 UTC [20973] LOG:  terminating any other active server processes2023-05-04 18:35:48.953 UTC [20973] LOG:  all server processes terminated; reinitializing2023-05-04 18:35:48.979 UTC [20973] FATAL:  Can't attach, lock is not in an empty state: PgLwLockInner2023-05-04 18:35:48.980 UTC [20973] LOG:  database system is shut down

The backtrace from a --debug build of pgml is:

Thread 1 "postgres" received signal SIGSEGV, Segmentation fault.0x00007ff52bc65d76 in sdot_ () from /home/pg/15/lib/postgresql/pgml.so(gdb) bt#0  0x00007ff52bc65d76 in sdot_ () from /home/pg/15/lib/postgresql/pgml.so#1  0x00007ff52b8b363a in blas::sdot (n=1024, x=..., incx=1, y=..., incy=1)    at /home/zombodb/.cargo/registry/src/github.com-1ecc6299db9ec823/blas-0.22.0/src/lib.rs:109#2  0x00007ff52b7c6aa6 in pgml::vectors::cosine_similarity_s (vector=..., other=...) at src/vectors.rs:304#3  0x00007ff52b7c6d9a in pgml::vectors::cosine_similarity_s_wrapper::cosine_similarity_s_wrapper_inner (_fcinfo=0x55fa5656e560) at src/vectors.rs:302#4  0x00007ff52b4ae1c1 in pgml::vectors::cosine_similarity_s_wrapper::{closure#0} () at src/vectors.rs:302#5  0x00007ff52b6edb8c in std::panicking::try::do_call<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (    data=0x7ffe798f2828) at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:483#6  0x00007ff52b6f0f6b in __rust_try.llvm.11079318101650794703 () from /home/pg/15/lib/postgresql/pgml.so#7  0x00007ff52b6ea049 in std::panicking::try<pgrx_pg_sys::submodules::datum::Datum, pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}> (f=...)    at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:447#8  0x00007ff52b75a0f6 in std::panic::catch_unwind<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (f=...)    at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panic.rs:137#9  0x00007ff52b765983 in pgrx_pg_sys::submodules::panic::run_guarded<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (f=...) at /home/zombodb/.cargo/registry/src/github.com-1ecc6299db9ec823/pgrx-pg-sys-0.8.3/src/submodules/panic.rs:403#10 0x00007ff52b77111c in pgrx_pg_sys::submodules::panic::pgrx_extern_c_guard<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (f=...) at /home/zombodb/.cargo/registry/src/github.com-1ecc6299db9ec823/pgrx-pg-sys-0.8.3/src/submodules/panic.rs:380#11 0x00007ff52b7c6c9d in pgml::vectors::cosine_similarity_s_wrapper (_fcinfo=0x55fa5656e560) at src/vectors.rs:302#12 0x000055fa54ce4b43 in ExecInterpExpr ()#13 0x000055fa54cf15a2 in ExecScan ()#14 0x000055fa54d0c368 in ExecLimit ()#15 0x000055fa54ce88a2 in standard_ExecutorRun ()

My box is a (humblebrag):

$ lscpuArchitecture:            x86_64  CPU op-mode(s):        32-bit, 64-bit  Address sizes:         43 bits physical, 48 bits virtual  Byte Order:            Little EndianCPU(s):                  64  On-line CPU(s) list:   0-63Vendor ID:               AuthenticAMD  Model name:            AMD Ryzen Threadripper 3970X 32-Core Processor    CPU family:          23    Model:               49    Thread(s) per core:  2    Core(s) per socket:  32    Socket(s):           1    Stepping:            0    Frequency boost:     enabled    CPU max MHz:         3700.0000    CPU min MHz:         2200.0000    BogoMIPS:            7386.30    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb                          rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 mo                         vbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt t                         ce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 s                         mep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_loc                         al clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter p                         fthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_esVirtualization features:   Virtualization:        AMD-VCaches (sum of all):       L1d:                   1 MiB (32 instances)  L1i:                   1 MiB (32 instances)  L2:                    16 MiB (32 instances)  L3:                    128 MiB (8 instances)NUMA:                      NUMA node(s):          1  NUMA node0 CPU(s):     0-63Vulnerabilities:           Itlb multihit:         Not affected  L1tf:                  Not affected  Mds:                   Not affected  Meltdown:              Not affected  Mmio stale data:       Not affected  Retbleed:              Vulnerable  Spec store bypass:     Vulnerable  Spectre v1:            Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers  Spectre v2:            Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected  Srbds:                 Not affected  Tsx async abort:       Not affected

With an nvidia RTX 4080:

  nvidia-debugdump -lFound 1 NVIDIA devices   Device ID:              0   Device name:            NVIDIA GeForce RTX 4080   (*PrimaryCard)   GPU internal ID:        GPU-b772ddf7-d413-e1bb-d1e1-8e7022c59343

Lev helped me discover that by commenting out this line,

println!("cargo:rustc-link-lib=static=openblas");
, everything works:

[v15.1][8595] pgml=# select pgml.norm_l1(ARRAY[1,2,3]::real[]); norm_l1---------6(1 row)Time:0.620 ms

This crash seems to be isolated toblas as I created 100k embeddings withpgml.embed() in a mere 7m 50s, using 4 parallel workers, even. So that part is good.

I had a thought that rebooting the computer might help since I had just stressed the GPU making all those embeddings, but naw, that didn't change anything.

A theory is that since pgml links to so many libraries (probably directly and indirectly) that maybe there's some kind of symbol resolution problem and the wrong symbols are being called? Just a theory.

@thomcc might be able to offer some help with this if it's some kind of linking problem? Offering up his services as PostgresML's success is pgrx's success!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingml

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions


      [8]ページ先頭

      ©2009-2025 Movatter.jp