Movatterモバイル変換


[0]ホーム

URL:


LWN.net LogoLWN
.net
News from the source
LWN
|
|
Log in /Subscribe /Register

The glibc s390 ABI break

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate thiscontent and would like to see more of it, your subscription willhelp to ensure that LWN continues to thrive. Please visitthis page to join up and keep LWN onthe net.

ByJonathan Corbet
July 16, 2014
The GNU C library (glibc) project has long lived up to a reputation forconservatism; glibc developers know that an ill-chosen change can create agreat deal of pain downstream, so they proceed with caution. Even so,mistakes can happen. A recent slip-up involving the s390 architecturemakes it clear how one of those mistakes can cascade into a significantmess that is hard to clean up afterward.

Thesetjmp()andlongjmp()functions have been part of the standard C library since something close tothe beginning. They can be used to perform stack unwinding — a sort of"long return" from a function that skips over any number of interveningfunction calls. Both of these functions take an opaquejmp_bufdata structure as an argument. The caller provides the buffer tosetjmp(), which fills it with the information needed to makeanother return to the location of that call. A later call tolongjmp() with that buffer will then causesetjmp() toappear to have returned a second time.

Back in April, developers from IBM committedapatch thatchanged the size of thejmp_buf structure on thes390 architecture; this change, which subsequently became part of the 2.19release, was apparently needed to enable better hardware support forsetjmp() andlongjmp(). Sincejmp_buf is a typethat is visible to applications, this was a clear ABI change, with all ofthe possible problems that can go with it. For example, newer glibcreleases expect the largerjmp_buf size, but they may be linked(at run time) against applications that have not been rebuiltand, thus, are still working with the older version ofjmp_buf.

This possibility was taken into account, though. Symbol versioning wasused to provide compatible versions ofsetjmp() andlongjmp() for these older applications. So, in theory, thingsshould Just Work without additional problems. This particular theory didnot last long after its encounter with the real world, though.

The problem is thatjmp_buf structures are often embedded intoother structures, so a change in the size of that structure will change thecontaining structures too. To find victims, one need not even look outsideof glibc; it turns out that glibc's POSIX threads (pthreads) implementationembeds ajmp_buf structure intoits own__pthread_unwind_buf_t structure which, in turn, is visible toapplications. So, as a result, a number of pthreads functions need tobecome versioned as well.

Versioning does not work, though, for problems that pop up outside ofglibc. Consider, for example, the Perl interpreter, which embeds ajmp_buf in its main "this is a running Perl instance" structure.That has caused various Perl modules to fail (example) andcan only really be fixed by rebuilding the entire Perl environment. ThePNG image format library (libpng) also has an embeddedjmp_buf — in astructure that is used by all PNG-using applications.

Debian's developers, who were trying to clean up this mess, consideredrebuilding all of Perl and then, perhaps, all (500 or so) packagesdepending on the PNG library. But, by this point, it became clear thatthe ripples from this change spread widely indeed and that playingwhack-a-mole may never get all of them fixed. So the Debian developershavefiguredthat the course they may have to consider is to "do like Red Hat, iejust rebuild everything and warn the users their system might break duringupgrade." Needless to say, this approach lacks appeal, especiallyin the Debian world, where mass rebuilds are a rare event.

Even then, of course, there is the problem of end-user applications.Distributors cannot rebuild those; even worse, the user may notbe able to either. So some things might just be broken.

One might be thinking that there is a mechanism in place for this kind ofincompatible ABI change. Shared libraries have a shared-object name("soname") built into them; applications linked against those librariesalso contain that name. For glibc on your editor's system, for example,the soname is "libc.so.6". The runtime linker will not link anapplication against a shared object if the sonames do not match. In thisway, the system can disallow running against a library that will not work.It also enables, in theory, the parallel installation of multiple versionsof the library; older applications would continue to use the older library,while newly built binaries would use the current version.

So the glibc project could consider making a point release with a differentsoname (libc.so.6.1, say); distributors could then install theresult alongside an older version of the library and, in theory, thingsshould work. Except that glibc developer Carlos O'Donelltried it and concluded that:

It's unsupportable as a solution for glibc.

The SO name bump in a mixed-ABI environment like debian results intwo libc's being loaded and competing for effectively the samenamespace of symbols with resolution (and therefore selection ofthe ABI) being determined by ELF interposition and scoperules. It's a nightmare. It's possible a worse solution than justtelling everyone to rebuild and get on with their lives.

It also turns out to be painful to bootstrap a system with anew, ABI-incompatible version of the C library. So it seems that thesoname change will not happen and that, on s390, a lot of rebuilding isgoing to have to go on. It will also become impossible to move affectedapplications between systems with pre- and post-change libraries. Not fun,but, as David Millerput it:

Therefore, on the negative side, we might be stuck with this. But,on the positive side, we can refer to this incident next time asimilar incident arises. We now know exactly what theramifications are for not handling this properly.

That leads to the obvious question: what can be done to avoid this kind ofproblem in the future? Carlosplans to puttogether a policy on how to manage ABI changes, with "don't breakABI ever" asthe first item. There has been talk of improving the testing tools in anattempt to catch this kind of ABI break in the future.

In the end, though, nothing can replace a high level of care on the part ofthe developers involved. Glibc developers have always shown that care,which is why stories like this one are rare. In the aftermath of thismistake, one can assume that they will be doubly careful in the future.That, along with some testing support, should help to ensure that upcomingglibc releases are free of this kind of issue.


to post comments

The glibc s390 ABI break

Posted Jul 17, 2014 7:07 UTC (Thu) byairlied (subscriber, #9104) [Link] (2 responses)

never would have happened on Uli's watch!

though whether that would be because he'd catch it or just have never applied the patch, who knows!

The glibc s390 ABI break

Posted Jul 17, 2014 9:30 UTC (Thu) byjhhaller (guest, #56103) [Link]

I know of one such patch he rejected. The semaphores in shared memory using sem_init are a different size for 32-bit binaries and 64-bit binaries, meaning that semaphore can't be shared by 32-bit and 64-bit binaries. A change was proposed to change the 32-bit version to be compatible with the 64-bit version, and it was rapidly shot down by Uli for breaking API compatibility.

The glibc s390 ABI break

Posted Jul 22, 2014 18:38 UTC (Tue) byfw (subscriber, #26023) [Link]

What about the "extern int errno" business? (Yes, I know, that's pretty lame, but it still hurt when you were affected by it.)

The glibc s390 ABI break

Posted Jul 17, 2014 10:48 UTC (Thu) bydanpb (subscriber, #4831) [Link] (1 responses)


Seems like some kind of automated testing of the public ABI could have caught this problem. ie something that validates that the size of any & every public struct does not change. Of course changing the jmpbuf size was a deliberate decision, but the ripple effects it caused on other structs could have been identified sooner perhaps causing a rethink on the change to jmpbuf.

The glibc s390 ABI break

Posted Jul 17, 2014 14:28 UTC (Thu) byjtaylor (subscriber, #91739) [Link]

This is done, but only for x86. This ABI break affects only S390.
https://sourceware.org/glibc/wiki/Testing/ABI_checker#gli...

The glibc s390 ABI break

Posted Jul 17, 2014 15:07 UTC (Thu) bymathstuf (subscriber, #69389) [Link] (2 responses)



So it's translucent? This is not the definition of "opaque" structure I'm used to in C.


Is the reason we don't support looking up symbols only in global and directly linked libraries due to performance and too much extra bookkeeping? I'd really like this to be possible as well:

— libA.so links libC.so
— libB.so links libC.so
— myapp does *not* link libC.so
— myapp: dlopen("libA.so", RTLD_LOCAL | RTLD_NOW); // opens libC.so implicitly
— myapp: dlopen("libB.so", RTLD_LOCAL | RTLD_NOW); // fails with missing symbols from libC.so

Since libB.so directly and explicitly links libC.so; why is it denied access to libC.so based on libA.so's transitive linking? If libC.so were opened directly with RTLD_LOCAL, I could see some logic behind it, but this makes much less sense to me and basically means when loading a plugin, I have to use RTLD_GLOBAL or risk this exact problem.

The glibc s390 ABI break

Posted Jul 17, 2014 17:44 UTC (Thu) byRobSeace (subscriber, #4435) [Link] (1 responses)


Yeah, jmp_buf is definitely not opaque... It's fully defined in <setjmp.h> (and some other files like <bits/setjmp.h> for the types of some of its members)... As you point out, if it were truly opaque, no one would be able to embed it anywhere, because they wouldn't have a full definition for it! They could basically only work with pointers to it... (I'm not sure if there are any true opaque structs in glibc... In theory, stdio FILE could probably be opaque, but in practice it's not... Maybe DIR is?)

I suppose it's "opaque" in a way, since the majority of it is just defined as a bunch of nondescript ints whose meaning is left as a complete mystery to the caller... So, one is obviously not meant to go poking in it...

The glibc s390 ABI break

Posted Jul 22, 2014 18:39 UTC (Tue) byfw (subscriber, #26023) [Link]

DIR is opaque. Historically, DIR * was sometimes implemented as an integer file descriptor cast to a pointer, which is why readdir used a static, global buffer and was not thread-safe.

The glibc s390 ABI break

Posted Jul 17, 2014 18:05 UTC (Thu) byKarellen (subscriber, #67644) [Link] (1 responses)

I'm wondering if Debian could solve this better with Multiarch, to create two entirely distinct "architectures" for the same hardware, rather than attemping a libc soname bump within the current s390 arch.

https://wiki.debian.org/Multiarch

The glibc s390 ABI break

Posted Jul 18, 2014 1:18 UTC (Fri) bymathstuf (subscriber, #69389) [Link]

Well, any new builds will be the "new" architecture, so it isn't like the "old" architecture has any kind of future. Are there really enough idle hands on the s390 porters team for Debian that this is viable anyways?

ABIs are *hard*

Posted Jul 31, 2014 0:32 UTC (Thu) byvomlehn (guest, #45588) [Link]

I spent a lot of time with the MIPS ABI Group and learned a lot about how hard it is to deal with ABIs. You really cannot change the size of anything without breaking compatibility. Ever. To avoid this, we scrutinized proposed data structures before adopting them to ensure that they would never need to grow. In one case, one vendor had a data structure several times larger than the size everyone else used. That became the size for everyone because we needed to support all the implementations. And then we lived with it. I still wonder why they really needed all that room.

The glibc s390 ABI break

Posted Jul 31, 2014 18:34 UTC (Thu) bysharkcz (guest, #52232) [Link]

Yes, it was a nice exercise :-) I've ended with ad-hoc rebuilding cca 85 Perl modules and 2 other libraries for Fedora, the rest was (and is being) processed during the continuous Rawhide build process.


Copyright © 2014, Eklektix, Inc.
This article may be redistributed under the terms of theCreative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds


[8]ページ先頭

©2009-2025 Movatter.jp