Jul 24, 2018 •Internship,Programming,Rust •Edits •Permalink

Pointers Are Complicated, or: What's in a Byte?

This summer, I am againworking on Rust full-time, and again I will work (amongst other things) on a “memory model” for Rust/MIR.However, before I can talk about the ideas I have for this year, I have to finally take the time and dispel the myth that “pointers are simple: they are just integers”.Both parts of this statement are false, at least in languages with unsafe features like Rust or C: Pointers are neither simple nor (just) integers.

I also want to define a piece of the memory model that has to be fixed before we can even talk about some of the more complex parts: Just whatis the data that is stored in memory?It is organized in bytes, the minimal addressable unit and the smallest piece that can be accessed (at least on most platforms), but what are the possible values of a byte?Again, it turns out “it’s just an 8-bit integer” does not actually work as the answer.

I hope that by the end of this post, you will agree with me on both of these statements. :)

Pointers Are Complicated

What is the problem with “pointers are just integers”? Let us consider the following example:
(I am using C++ code here mostly because writing unsafe code is easier in C++ than in Rust, and unsafe code is where these problems really show up. C has all the same issues, as does unsafe Rust.)

inttest(){autox=newint[8];autoy=newint[8];y[0]=42;inti=/* some side-effect-free computation */;autox_ptr=&x[i];*x_ptr=23;returny[0];}

It would be beneficial to be able to optimize the final read ofy[0] to just return42.C++ compilers regularly perform such optimizations as they are crucial for generating high-quality assembly.¹The justification for this optimization is that writing tox_ptr, which points intox, cannot changey.

However, given how low-level a language C++ is, we can actually break this assumption by settingi toy-x.Since&x[i] is the same asx+i, this means we are actually writing23 to&y[0].

Of course, that does not stop C++ compilers from doing these optimizations.To allow this, the standard declares our code to haveundefined behavior.²

First of all, it is not allowed to perform pointer arithmetic (like&x[i] does) that goesbeyond either end of the array it started in.Our program violates this rule:x[i] is outside ofx, so this is undefined behavior.To be clear: Just thecomputation ofx_ptr is already UB, we don’t even get to the part where we want touse this pointer!³

But we are not done yet: This rule has a special exception that we can exploit to our advantage.If the arithmetic ends up computing a pointerjust past the end of an allocation, that computation is fine.(This exception is necessary to permit computingvec.end() for the usual kind of C++98 iterator loop.)

So let us change this example a bit:

inttest(){autox=newint[8];autoy=newint[8];y[0]=42;autox_ptr=x+8;// one past the endif(x_ptr==&y[0])*x_ptr=23;returny[0];}

Now, imagine thatx andy have been allocatedright next to each other, withy having the higher address.Thenx_ptr actually pointsright at the beginning ofy!The conditional is true and the write happens.Still, there is no UB due to out-of-bounds pointer arithmetic.

This seems to break the optimization.However, the C++ standard has another trick up its sleeve to help compiler writers: It doesn’t actually allow us touse ourx_ptr above.According to what the standard says aboutaddition on pointers,x_ptr points “one past the last element of the array object”.It doesnot point at an actual element of another objecteven if they have the same address. (At least, that is the common interpretation of the standard based on whichLLVM optimizes this code.)

The key point here is that just becausex_ptr and&y[0] point to the sameaddress, that does not make themthe same pointer, i.e., they cannot be used interchangeably:&y[0] points to the first element ofy;x_ptr points past the end ofx.If we replace*x_ptr = 23 by*&y[0] = 23, we change the meaning of the program, even though the two pointers have been tested for equality.

This is worth repeating:

Just because two pointers point to the same address, does not mean they are equal and can be used interchangeably.

If this sounds subtle, it is.In fact, this still causes miscompilations in bothLLVM andGCC.

Also notice that this one-past-the-end rule is not the only part of C/C++ where this effect can be witnessed.Another example is therestrict keyword in C, which can be used to express that pointers do not alias:

intfoo(int*restrictx,int*restricty){*x=42;if(x==y){*y=23;}return*x;}inttest(){intx;returnfoo(&x,&x);}

Callingtest() triggers UB because the two accesses infoo must not alias.Replacing*y by*x infoo changes the meaning of the program such that it no longer has UB.So, again, even thoughx andy have the same address, they cannot be used interchangeably.

Pointers are definitely not integers.

A Simple Pointer Model

So, whatis a pointer?I don’t know the full answer to this.In fact, this is an open area of research.

One important point to stress here is that we are just looking for anabstract model of the pointer.Of course, on the actual machine, pointers are integers.But the actual machine also does not do the kind of optimizations that modern C++ compilers do, so it can get away with that.If we wrote the above programs in assembly, there would be no UB, and no optimizations.C++ and Rust employ a more “high-level” view of memory and pointers, restricting the programmer for the benefit of optimizations.When formally describing what the programmer may and may not do in these languages, as we have seen, the model of pointers as integers falls apart, so we have to look for something else.This is another example of using a “virtual machine” that’s different from the real machine for specification purposes, which is an ideaI have blogged about before.

Here’s a simple proposal (in fact, this is the model of pointers used inCompCert and myRustBelt work, and it is also howmiri implementspointers):A pointer is a pair of some kind of ID uniquely identifying theallocation, and anoffset into the allocation.If we defined this in Rust, we might write

structPointer{alloc_id:usize,offset:isize,}

Adding/subtracting an integer to/from a pointer just acts on the offset, and can thus never leave the allocation.Subtracting a pointer from another is only allowed when both point to the same allocation (matchingC++).⁴

It turns out (and miri shows) that this model can get us very far.We always remember which allocation a pointer points to, so we can differentiate a pointer “one past the end” of one allocation from a pointer to the beginning of another allocation.That’s how miri can detect that our second example (with&x[8]) is UB.

The Model Falls Apart

In this model, pointers are not integers, but they are at least simple.However, this simple model starts to fall apart once you consider pointer-integer casts.In miri, casting a pointer to an integer does not actually do anything, we now just have an integer variable (i.e., itstype says it is an integer) whosevalue is a pointer (i.e., an allocation-offset pair).However, multiplying that “integer” by 2 leads to an error, because it is entirely unclear what it means to multiply such an abstract pointer by 2.

I should clarify that this isnot a good solution when defining language semantics.It works fine for an interpreter though.It is the most lazy thing to do, and we do it because it is not clear what else to do (other than not supporting these casts at all – but this way, miri can run more programs):In our abstract machine, there just is no single coherent “address space” that all allocations live in, that we could use to map every pointer to a distinct integer.Every allocation is just identified by an (unobservable) ID.We could now start to enrich this model with extra data like a base address for each allocation, and somehow use that when casting an integer back to a pointer… but that’s where it gets really complicated, and anyway discussing such a model is not the point of this post.The point it to discuss theneed for such a model.If you are interested, I suggest you readthis paper that explores the above idea of adding a base address.

Long story short, pointer-integer casts are messy and hard to define formally when also considering optimizations like we discussed above.There is a conflict between the high-level view that is required to enable optimizations, and the low-level view that is required to explain casting a pointer to an integer and back.We mostly just ignore the problem in miri and opportunistically do as much as we can, given the simple model we are working with.A full definition of a language like C++ or Rust of course cannot take this shortcut, it has to explain what really happens here.To my knowledge, no satisfying solution exists, but academic research isgetting closer.

Update: This was by no means meant to be an exhaustive list of academic research on C in general. I do not know of other work that focuses directly on the interplay of integer-pointer casts and optimizations, but other noteworthy work on formalizing C includesKCC,Robbert Krebber’s PhD thesis andCerberus./Update

This is why pointers are not simple, either.

From Pointers to Bytes

I hope I made a convincing argument that integers are not the only data one has to consider when formally specifying low-level languages such as C++ or (the unsafe parts of) Rust.However, this means that a simple operation like loading a byte from memory cannot just return au8.Imagine weimplementmemcpy by loading (in turn) every byte of the source into some local variablev, and then storing it to the target.What if that byte is part of a pointer? When a pointer is a pair of allocation and offset, what is its first byte?We have to say what the value ofv is, so we have to find some way to answer this question.(And this is an entirely separate issue from the problem with multiplication that came up in the last section. We just assume some abstract typePointer.)

We cannot represent a byte of a pointer as an element of0..256.Essentially, if we use a naive model of memory, the extra “hidden” part of a pointer (the one that makes it more than just an integer) would be lost when a pointer is stored to memory and loaded again.We have to fix this, so we have to extend our notion of a “byte” to accomodate that extra state.So, a byte is noweither an element of0..256 (“raw bits”),or the n-th byte of some abstract pointer.If we were to implement our memory model in Rust, this might look as follows:

enumByteV1{Bits(u8),PtrFragment(Pointer,u8),}

For example, aPtrFragment(ptr, 0) represents the first byte ofptr.This way,memcpy can “take apart” a pointer into the individual bytes that represent this pointer in memory, and copy them separately.On a 32bit architecture, the full value representingptr consists of the following 4 bytes:

[PtrFragment(ptr, 0), PtrFragment(ptr, 1), PtrFragment(ptr, 2), PtrFragment(ptr, 3)]

Such a representation supports performing all byte-level “data moving” operations on pointers, which is sufficient formemcpy.Arithmetic or bit-level operations are not fully supported; as already mentioned above, that requires a more sophisticated pointer representation.

Uninitialized Memory

However, we are not done yet with our definition of a “byte”.To fully describe program behavior, we need to take one more possibility into account: A byte in memory could beuninitialized.The final definition for a byte (assuming we have a typePointer for pointers) thus looks as follows:

enumByte{Bits(u8),PtrFragment(Pointer,u8),Uninit,}

Uninit is the value we use for all bytes that have been allocated, but not yet written to.Reading uninitialized memory is fine, but actuallydoing anything with those bytes (like, using them in integer arithmetic) is undefined behavior.

This is very similar to the rules LLVM has for its special value calledpoison.Note that LLVMalso has a value calledundef, which it uses for uninitialized memory and which works somewhat differently – however, compiling ourUninit toundef is actually correct (undef is in some sense “weaker”), and there are proposals toremoveundef from LLVM and usepoison instead.

You may wonder why we have a specialUninit value at all.Couldn’t we just pick some arbitraryb: u8 for each newly allocated byte, and then useBits(b) as the initial value?That would indeed also be an option.However, first of all, pretty much all compilers have converged to having a sentinel value for uninitialized memory.Not doing that would not only pose trouble when compiling through LLVM, it would also require reevaluating many optimizations to make sure they work correctly with this changed model.The key point is that it is always safe, during compilation, to replaceUninit by any value: Any operation that actually observes this value is UB anyway.

For example, the following C code becomes easier to optimize withUninit:

inttest(){intx;if(condA())x=1;// Lots of hard to analyze code that will definitely return when condA()// does NOT hold, but will not change x.use(x);// want to optimize x to 1.}

WithUninit, we can easily argue thatx is eitherUninit or1, and since replacingUninit by1 is okay, the optimization is easily justified.WithoutUninit, however,x is either “some arbitrary bit pattern” or1, and doing the same optimization becomes much harder to justify.⁵

Finally,Uninit is also a better choice for interpreters like miri.Such interpreters have a hard time dealing with operations of the form “just choose any of these values” (i.e., non-deterministic operations), because if they want to fully explore all possible program executions, that means they have to try every possible value.UsingUninit instead of an arbitrary bit pattern means miri can, in a single execution, reliably tell you if your programs uses uninitialized values incorrectly.

Update: Since writing this section, I have written an entirepost dedicated to uninitialized memory and “real hardware” with more details, examples and references./Update

Conclusion

We have seen that in languages like C++ and Rust (unlike on real hardware), pointers can be different even when they point to the same address, and that a byte is more than just a number in0..256.This is also why calling C “portable assembly” may have been appropriate in 1978, but is a dangerously misleading statement nowadays.With this, I think we are ready to look at a first draft of my “2018 memory model” (working title ;) – in the next post. :)

Thanks to @rkruppe and @nagisa for help in finding arguments for whyUninit is needed.If you have any questions, feel free toask in the forums!

Footnotes

To be fair, the areclaimed to be crucial for generating high-quality assembly. The claim sounds plausible to me, but unfortunately, I do not know of a systematic study exploring the performance benefits of such optimizations. ↩
An argument could be made that compilers should just not do such optimizations to make the programming model simpler. This is a discussion worth having, but the point of this post is not to explore this trade-off, it is to explore the consequences of the choices made in C++. ↩
It turns out thati = y-x isalso undefined behavior becauseone may only subtract pointers into the same allocation. However, we could usei = ((size_t)y - (size_t)x)/sizeof(int) to work around that. ↩
As we have seen, the C++ standard actually applies these rules on the level of arrays, not allocations. However, LLVM applies the rule on aper-allocation level. ↩
We could argue that we can reorder when the non-deterministic choice is made, but then we have to prove that the hard to analyze code does not observex.Uninit avoids that unnecessary extra proof burden. ↩

Posted onRalf's Ramblings on Jul 24, 2018.
Comments?Drop me a mail orleave a note in the forum!

Movatterモバイル変換