I noticed that the difference between lower and uppercase is32
. This seems like a perfect opportunity to utilize some clever bit manipulation. The problem is that it's been a long time since my Computer Architecture classes, and I'm a bit rusty with the concepts required. From what I do recall, depending on the CPU architecture and language representation of signed/unsigned there are a very small number of solutions that would apply to almost all programming languages with these operators. I'm interested in comparing these. I'm not interested in simply converting the case, I know there are "simpler" ways (for humans, at least). I'm interested in learning about how this problem interacts with the low-level representation of the data.
Please provide workable, minimal solutions for both lowercase->upper and uppercase->lower, for each common representation, as well as a reasonably detailed explanation of how they work.
- 1If there's a difference of 32 (which there is), then addition and subtraction are what you need. Nothing fancy. You may be thinking ofshifting to perform multiplication or division by powers of 2.CommentedOct 16, 2014 at 23:01
- For ASCII assuming C
int c
semantics:lower2upper(c) = c - 'a' + 'A'
andupper2lower(c) = c - 'A' + 'a'
and let the compiler do its work. For Unicode text it won't work.– jfsCommentedOct 17, 2014 at 1:08 - 1
- @J.F.Sebastian Yes, I'd like to see an explanation of how that works in relation to the low-level representation of the data, and how it would change for other common representations.CommentedOct 17, 2014 at 16:04
- Do you want to know how various languages represent bytes as numbers and the corresponding bit-patterns if applicable? For example,How to combine two 32-bit integers into one 64-bit integer? Do you want to know what assembly/machine code is generated by various compilers/for target architectures?– jfsCommentedOct 20, 2014 at 4:47
1 Answer1
(Note: i'm usingpython
here, but this is of course language agnostic. I also do speak aboutascii
, so I'll use a 7-bit representation of things.)
If you look at the binary representation ofascii
characters in the[a-z][A-Z]
range, you'll notice two things:
>>> bin(ord('a'))'0b1100001'>>> bin(ord('A'))'0b1000001'>>> bin(ord('y'))'0b1111001'>>> bin(ord('Y'))'0b1011001'
First: they all have the seventh bit (from right) set.Second: the lowercase characters have the sixth bit (from right) set, the uppercase ones unset, and that's the only difference between a given uppercase character and it's lowercase version (et vice versa).
So all you have to do is flip that bit to toggle case - that would bexor 0b0100000
which isxor 0x20
.
Tolower()
, you have to set that bit, so you canor 0b0100000
which isor 0x20
- the already mentionedor 0x60
also works, as 0x60 is 0b1100000 and that bit is set anyway.
And to upper, you have to unset that bit, that would be "and the inverse of the mask 0b0100000", which is the same asand 0x5f
.
To see it all in action, i wrote some python snipplets which check that what we just saw is true for every character in the english alphabet:
#toggle():>>> ''.join(chr(ord(c)^0x20) for c in string.ascii_lowercase) == string.ascii_uppercaseTrue>>> ''.join(chr(ord(c)^0x20) for c in string.ascii_uppercase) == string.ascii_lowercaseTrue#lower():>>> ''.join(chr(ord(c)|0x20) for c in string.ascii_lowercase) == string.ascii_lowercaseTrue>>> ''.join(chr(ord(c)|0x20) for c in string.ascii_uppercase) == string.ascii_lowercase#upper():>>> ''.join(chr(ord(c)&0x5f) for c in string.ascii_lowercase) == string.ascii_uppercaseTrue>>> ''.join(chr(ord(c)&0x5f) for c in string.ascii_uppercase) == string.ascii_uppercaseTrue
it does not do anything useful to' '
,'\n'
, and such though!
Explore related questions
See similar questions with these tags.