
Before starting, the reader considers this my personal opinion, how I see the things and not the only way resolve the problem.
I am working on a validator library and one of the various implemented filters do HTML escaping (replacing special characters with the correct HTML entities). For escape special char I need to know name or number of the html entity. Not all entities have a name then use the number is oblugatory.
For get the number of char, PHP offer a function called ord(). Unfortunately ord() does not work with UTF-8 multibyte characters and if multi-byte functions are not availables need to find another solution.
To be honest I don't like use multi-byte function on a filter because I cannot presume that all strings or character passed to the filter are multi-byte. In this case, defensive programming help us and we consider that all passed can be dangerous.
In comments of ord() function documentations I found a little snippet of code that seem the solution for my problem. It return the char number for all multi-byte chars.
Original code
This is the original version of the function (it's five years old code), prensent in PHP docs comments:
<?phpfunction ordutf8($string, &$offset){ $code = ord(substr($string, $offset, 1)); if ($code >= 128) {//otherwise 0xxxxxxx if ($code < 224) $bytesnumber = 2;//110xxxxx else if ($code < 240) $bytesnumber = 3;//1110xxxx else if ($code < 248) $bytesnumber = 4;//11110xxx $codetemp = $code - 192 - ($bytesnumber > 2 ? 32 : 0) - ($bytesnumber > 3 ? 16 : 0); for ($i = 2; $i <= $bytesnumber; $i++) { $offset ++; $code2 = ord(substr($string, $offset, 1)) - 128;//10xxxxxx $codetemp = $codetemp * 64 + $code2; } $code = $codetemp; } $offset += 1; if ($offset >= strlen($string)) $offset = -1; return $code;}
Code is thinked to manage a phrase, not only one character at time. Inside this function, there are two IF, five IF-ELSE and one FOR statements. Too complicated.
Step 1
We are in 2018 and PHP7 (7.2 current major version) provide a lot of new features, the one I love most is type checking!
Next, I need to manage one char at time, offset argument is no longer required.
<?phpfunction ordutf8(string $char) : int{ $code = ord(substr($char, 0, 1)); if ($code >= 128) { if ($code < 224) $bytesnumber = 2; else if ($code < 240) $bytesnumber = 3; else if ($code < 248) $bytesnumber = 4; $codetemp = $code - 192 - ($bytesnumber > 2 ? 32 : 0) - ($bytesnumber > 3 ? 16 : 0); $offset = 0; for ($i = 2; $i <= $bytesnumber; $i++) { $offset ++; $code2 = ord(substr($char, $offset, 1)) - 128; $codetemp = $codetemp * 64 + $code2; } $code = $codetemp; } return $code;}
One of two IF statement was removed.
Step 2
This code piece contains two IF-ELSE written using ternary operators:
<?php$codetemp = $code - 192 - ($bytesnumber > 2 ? 32 : 0) - ($bytesnumber > 3 ? 16 : 0);
We replace it with static values, provide directly the correct number without check it every time.
<?phpfunction ordutf8(string $char) : int{ $code = ord(substr($char, 0, 1)); if ($code >= 128) { $count = 0; if ($code < 224) { $bytesnumber = 2; } else if ($code < 240) { $bytesnumber = 3; $count = 32; } else if ($code < 248) { $bytesnumber = 4; $count = 48; } $codetemp = $code - 192 - $count; $offset = 0; for ($i = 2; $i <= $bytesnumber; $i++) { $offset ++; $code2 = ord(substr($char, $offset, 1)) - 128; $codetemp = $codetemp * 64 + $code2; } $code = $codetemp; } return $code;}
Two of five IF-ELSE statements was removed.
Step 3
We can refactor FOR statements for remove variables and reduce lines of code.
<?phpfunction ordutf8(string $char) : int{ $code = ord(substr($char, 0, 1)); if ($code >= 128) { $count = 0; if ($code < 224) { $bytes = 2; } else if ($code < 240) { $bytes = 3; $count = 32; } else if ($code < 248) { $bytes = 4; $count = 48; } $temp = $code - 192 - $count; for ($i = 1; $i < $bytes; $i++) { $code = $temp = $temp * 64 + ord(substr($char, $i, 1)) - 128; } } return $code;}
In details from this:
<?php$codetemp = $code - 192 - $count;$offset = 0;for ($i = 2; $i <= $bytesnumber; $i++) { $offset ++; $code2 = ord(substr($char, $offset, 1)) - 128; $codetemp = $codetemp * 64 + $code2;}$code = $codetemp;
To this:
<?php$temp = $code - 192 - $count;for ($i = 1; $i < $bytes; $i++) { $code = $temp = $temp * 64 + ord(substr($char, $i, 1)) - 128;}
Step 4
Now we need to try to remove a little bit of IF-ELSE statements. This step seem more complicated than previous, but let us reflect on the conditions used in the IF statements.
we now have four conditions:
- >=128 2 bytes
- <224 2 bytes
- <240 3 bytes
- <248 4 bytes
we can rewrite them using only three:
- >127 2 bytes
- >223 3 bytes
- >239 4 bytes
without using IF-ELSE statements and overwriting the variables when to the next condition is true.
<?phpfunction ordutf8(string $char) : int{ $code = ord(substr($char, 0, 1)); if ($code > 127) { $bytes = 2; $count = 0; if ($code > 223){ $bytes = 3; $count = 32; } if ($code > 239){ $bytes = 4; $count = 48; } $temp = $code - 192 - $count; for ($i = 1; $i < $bytes; $i++) { $code = $temp = $temp * 64 + ord(substr($char, $i, 1)) - 128; } } return $code;}
Five of five IF-ELSE statements was removed. Remain 3 IF and one FOR statements.
Step 5
We now try to remove FOR statement and rewrite all as mathematics formula.
<?phpfunction ordutf8(string $char) : int{ $code = ord(substr($char, 0, 1)); if ($code > 239){ return ((/*($code - 240) * 64 + */ord(substr($char, 1, 1)) - 128) * 64 + ord(substr($char, 2, 1)) - 128) * 64 + ord(substr($char, 3, 1)) - 128; } if ($code > 223){ return (($code - 224) * 64 + ord(substr($char, 1, 1)) - 128) * 64 + ord(substr($char, 2, 1)) - 128; } if ($code > 127) { return ($code - 192) * 64 + ord(substr($char, 1, 1)) - 128; } return $code;}
Now only three IF statement remain.
UPDATE: Commented code in first IF statement return always zero. Useless code.
Conclusion
I hope this article can help you understand that you do not have to stop at the moment the code works. The code can always be improved, simplified and made more efficient.
Code improvement can be appreciated running little test onhttps://3v4l.org/TWFLX.
Code is also available on myGist
Top comments(0)
For further actions, you may consider blocking this person and/orreporting abuse