Please consider subscribing to LWNSubscriptions are the lifeblood of LWN.net. If you appreciate thiscontent and would like to see more of it, your subscription willhelp to ensure that LWN continues to thrive. Please visitthis page to join up and keep LWN onthe net.
Thesetjmp()andlongjmp()functions have been part of the standard C library since something close tothe beginning. They can be used to perform stack unwinding — a sort of"long return" from a function that skips over any number of interveningfunction calls. Both of these functions take an opaquejmp_bufdata structure as an argument. The caller provides the buffer tosetjmp(), which fills it with the information needed to makeanother return to the location of that call. A later call tolongjmp() with that buffer will then causesetjmp() toappear to have returned a second time.
Back in April, developers from IBM committedapatch thatchanged the size of thejmp_buf structure on thes390 architecture; this change, which subsequently became part of the 2.19release, was apparently needed to enable better hardware support forsetjmp() andlongjmp(). Sincejmp_buf is a typethat is visible to applications, this was a clear ABI change, with all ofthe possible problems that can go with it. For example, newer glibcreleases expect the largerjmp_buf size, but they may be linked(at run time) against applications that have not been rebuiltand, thus, are still working with the older version ofjmp_buf.
This possibility was taken into account, though. Symbol versioning wasused to provide compatible versions ofsetjmp() andlongjmp() for these older applications. So, in theory, thingsshould Just Work without additional problems. This particular theory didnot last long after its encounter with the real world, though.
The problem is thatjmp_buf structures are often embedded intoother structures, so a change in the size of that structure will change thecontaining structures too. To find victims, one need not even look outsideof glibc; it turns out that glibc's POSIX threads (pthreads) implementationembeds ajmp_buf structure intoits own__pthread_unwind_buf_t structure which, in turn, is visible toapplications. So, as a result, a number of pthreads functions need tobecome versioned as well.
Versioning does not work, though, for problems that pop up outside ofglibc. Consider, for example, the Perl interpreter, which embeds ajmp_buf in its main "this is a running Perl instance" structure.That has caused various Perl modules to fail (example) andcan only really be fixed by rebuilding the entire Perl environment. ThePNG image format library (libpng) also has an embeddedjmp_buf — in astructure that is used by all PNG-using applications.
Debian's developers, who were trying to clean up this mess, consideredrebuilding all of Perl and then, perhaps, all (500 or so) packagesdepending on the PNG library. But, by this point, it became clear thatthe ripples from this change spread widely indeed and that playingwhack-a-mole may never get all of them fixed. So the Debian developershavefiguredthat the course they may have to consider is to "do like Red Hat, iejust rebuild everything and warn the users their system might break duringupgrade.
" Needless to say, this approach lacks appeal, especiallyin the Debian world, where mass rebuilds are a rare event.
Even then, of course, there is the problem of end-user applications.Distributors cannot rebuild those; even worse, the user may notbe able to either. So some things might just be broken.
One might be thinking that there is a mechanism in place for this kind ofincompatible ABI change. Shared libraries have a shared-object name("soname") built into them; applications linked against those librariesalso contain that name. For glibc on your editor's system, for example,the soname is "libc.so.6". The runtime linker will not link anapplication against a shared object if the sonames do not match. In thisway, the system can disallow running against a library that will not work.It also enables, in theory, the parallel installation of multiple versionsof the library; older applications would continue to use the older library,while newly built binaries would use the current version.
So the glibc project could consider making a point release with a differentsoname (libc.so.6.1, say); distributors could then install theresult alongside an older version of the library and, in theory, thingsshould work. Except that glibc developer Carlos O'Donelltried it and concluded that:
The SO name bump in a mixed-ABI environment like debian results intwo libc's being loaded and competing for effectively the samenamespace of symbols with resolution (and therefore selection ofthe ABI) being determined by ELF interposition and scoperules. It's a nightmare. It's possible a worse solution than justtelling everyone to rebuild and get on with their lives.
It also turns out to be painful to bootstrap a system with anew, ABI-incompatible version of the C library. So it seems that thesoname change will not happen and that, on s390, a lot of rebuilding isgoing to have to go on. It will also become impossible to move affectedapplications between systems with pre- and post-change libraries. Not fun,but, as David Millerput it:
That leads to the obvious question: what can be done to avoid this kind ofproblem in the future? Carlosplans to puttogether a policy on how to manage ABI changes, with "don't breakABI ever
" asthe first item. There has been talk of improving the testing tools in anattempt to catch this kind of ABI break in the future.
In the end, though, nothing can replace a high level of care on the part ofthe developers involved. Glibc developers have always shown that care,which is why stories like this one are rare. In the aftermath of thismistake, one can assume that they will be doubly careful in the future.That, along with some testing support, should help to ensure that upcomingglibc releases are free of this kind of issue.
Posted Jul 17, 2014 7:07 UTC (Thu) byairlied (subscriber, #9104) [Link] (2 responses) though whether that would be because he'd catch it or just have never applied the patch, who knows! Posted Jul 17, 2014 9:30 UTC (Thu) byjhhaller (guest, #56103) [Link] Posted Jul 22, 2014 18:38 UTC (Tue) byfw (subscriber, #26023) [Link] Posted Jul 17, 2014 10:48 UTC (Thu) bydanpb (subscriber, #4831) [Link] (1 responses) Seems like some kind of automated testing of the public ABI could have caught this problem. ie something that validates that the size of any & every public struct does not change. Of course changing the jmpbuf size was a deliberate decision, but the ripple effects it caused on other structs could have been identified sooner perhaps causing a rethink on the change to jmpbuf. Posted Jul 17, 2014 14:28 UTC (Thu) byjtaylor (subscriber, #91739) [Link] Posted Jul 17, 2014 15:07 UTC (Thu) bymathstuf (subscriber, #69389) [Link] (2 responses) So it's translucent? This is not the definition of "opaque" structure I'm used to in C. Is the reason we don't support looking up symbols only in global and directly linked libraries due to performance and too much extra bookkeeping? I'd really like this to be possible as well: — libA.so links libC.so Since libB.so directly and explicitly links libC.so; why is it denied access to libC.so based on libA.so's transitive linking? If libC.so were opened directly with RTLD_LOCAL, I could see some logic behind it, but this makes much less sense to me and basically means when loading a plugin, I have to use RTLD_GLOBAL or risk this exact problem. Posted Jul 17, 2014 17:44 UTC (Thu) byRobSeace (subscriber, #4435) [Link] (1 responses) Yeah, jmp_buf is definitely not opaque... It's fully defined in <setjmp.h> (and some other files like <bits/setjmp.h> for the types of some of its members)... As you point out, if it were truly opaque, no one would be able to embed it anywhere, because they wouldn't have a full definition for it! They could basically only work with pointers to it... (I'm not sure if there are any true opaque structs in glibc... In theory, stdio FILE could probably be opaque, but in practice it's not... Maybe DIR is?) I suppose it's "opaque" in a way, since the majority of it is just defined as a bunch of nondescript ints whose meaning is left as a complete mystery to the caller... So, one is obviously not meant to go poking in it... Posted Jul 22, 2014 18:39 UTC (Tue) byfw (subscriber, #26023) [Link] Posted Jul 17, 2014 18:05 UTC (Thu) byKarellen (subscriber, #67644) [Link] (1 responses) Posted Jul 18, 2014 1:18 UTC (Fri) bymathstuf (subscriber, #69389) [Link] Posted Jul 31, 2014 0:32 UTC (Thu) byvomlehn (guest, #45588) [Link] Posted Jul 31, 2014 18:34 UTC (Thu) bysharkcz (guest, #52232) [Link]The glibc s390 ABI break
The glibc s390 ABI break
The glibc s390 ABI break
The glibc s390 ABI break
The glibc s390 ABI break
https://sourceware.org/glibc/wiki/Testing/ABI_checker#gli...The glibc s390 ABI break
— libB.so links libC.so
— myapp does *not* link libC.so
— myapp: dlopen("libA.so", RTLD_LOCAL | RTLD_NOW); // opens libC.so implicitly
— myapp: dlopen("libB.so", RTLD_LOCAL | RTLD_NOW); // fails with missing symbols from libC.soThe glibc s390 ABI break
The glibc s390 ABI break
The glibc s390 ABI break
The glibc s390 ABI break
ABIs are *hard*
The glibc s390 ABI break
Copyright © 2014, Eklektix, Inc.
This article may be redistributed under the terms of theCreative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds