This question does not appear to be a code review request within the scope defined in thehelp center. Please see theFAQ about off-topic questions.
Closed8 months ago.
I have a dictionary with a lot of symbols, each of which is encoded in a huffman binary string.
Example:
| Symbol | Huffman Code |
|---|---|
| you | 010 |
| shall | 0111 |
| not | 00111 |
| pass | 00001 |
| ... | ... |
Therefore I encode the sentence "you shall not pass" like this: "01001110011100001".
Now I would like to divide the binary string into chunks of six characters and interpret each sextuplet as a character according to the base64 encoding.If the length of the string is not divisible by 6, I simply append zero at the end until it does.
So "you shall not pass" would be represented like this:010011 100111 000010 which in base64 should be justTnC.
To do this I managed to write the following PhP code, but I suspect that it could be done in some more elegant way, using some more efficient built-in function.I was looking to something likebase64_encode(), but if I try tobase64_encode("01001110011100001"), the function treats each single character as 1 byte, and outputs "MDEwMDExMTAwMTExMDAwMDE=".Could you give me some suggestion please?
<?phpfunction binaryToChars($binarystring){ static $chars = array( "000000"=>"A","010000"=>"Q","100000"=>"g","110000"=>"w", "000001"=>"B","010001"=>"R","100001"=>"h","110001"=>"x", "000010"=>"C","010010"=>"S","100010"=>"i","110010"=>"y", "000011"=>"D","010011"=>"T","100011"=>"j","110011"=>"z", "000100"=>"E","010100"=>"U","100100"=>"k","110100"=>"0", "000101"=>"F","010101"=>"V","100101"=>"l","110101"=>"1", "000110"=>"G","010110"=>"W","100110"=>"m","110110"=>"2", "000111"=>"H","010111"=>"X","100111"=>"n","110111"=>"3", "001000"=>"I","011000"=>"Y","101000"=>"o","111000"=>"4", "001001"=>"J","011001"=>"Z","101001"=>"p","111001"=>"5", "001010"=>"K","011010"=>"a","101010"=>"q","111010"=>"6", "001011"=>"L","011011"=>"b","101011"=>"r","111011"=>"7", "001100"=>"M","011100"=>"c","101100"=>"s","111100"=>"8", "001101"=>"N","011101"=>"d","101101"=>"t","111101"=>"9", "001110"=>"O","011110"=>"e","101110"=>"u","111110"=>"+", "001111"=>"P","011111"=>"f","101111"=>"v","111111"=>"/"); $result = ""; while (strlen($binarystring)%6!=0){$binarystring .= "0";} while (strlen($binarystring)>0){ $substr = substr($binarystring,0,6); if (array_key_exists($substr, $chars)){$result .= $chars[$substr]; $binarystring = substr($binarystring,6);} else{throw new Exception("The variable is not a binary string.");} } return $result;}?>- \$\begingroup\$Hmmm... Homebrew data compression... A fixed' palette' of words encoded with a few bits... With 18 bits (3x6), one could number 256,000 individual messages, some as complex as "It's the weekend, so I'm not answering calls" or more specific. English alphabet has 'Q' and 'X', but I don't think there's a word using Q and X (ie: Some token combinations have no value.) OP has encoded "shall" with 4 bits and "not" with 5... It's unlikely this represents optimal accounting for "frequency of use"...\$\endgroup\$user272752– user2727522025-04-11 03:09:10 +00:00CommentedApr 11 at 3:09
- \$\begingroup\$The example using words of sentences was just for simplify the question! I'm not really trying to compress plaintext. I'm actually doing it for chess positions. My "words" are blocks of pieces on the board.\$\endgroup\$Benzio– Benzio2025-04-15 21:09:27 +00:00CommentedApr 15 at 21:09
- 1\$\begingroup\$I’m voting to close this question because OP admits to deliberate misrepresentation of their true intent and purpose.\$\endgroup\$user272752– user2727522025-04-15 21:51:38 +00:00CommentedApr 15 at 21:51
- \$\begingroup\$I suggest not to use strings of characters to represent binary values: indexing instead of key matching.\$\endgroup\$greybeard– greybeard2025-04-18 17:31:17 +00:00CommentedApr 18 at 17:31
2 Answers2
array_key_exists vs.isset
if (array_key_exists($substr, $chars)){
You almost never want to use array_key_exists in PHP.
if (isset($chars[$substr])) {does the same thing faster unlessnull is a valid value for which you want to return true.
String padding
while (strlen($binarystring)%6!=0){$binarystring .= "0";}
That's about as efficient a way as you'll find for less than six characters. I'd write it with more whitespace though.
while (strlen($binarystring) % 6 != 0) { $binarystring .= '0';}An alternative would be
if (strlen($binarystring) % 6 != 0) { str_pad($binarystring, strlen($binarystring) + 6 - strlen($binarystring) % 6, '0');}I'm not sure that's more efficient though. It would be if you might add more characters.
base64_encode
The rest of your code is where you could make it simpler. Consider
$excess_count = strlen($binarystring) % 8;if ($excess_count != 0) { str_pad($binarystring, strlen($binarystring) + 8 - $excess_count, '0');}$binary_data = '';for ($i = 0; $i < strlen($binarystring); $i += 8) { $binary_data .= chr(bindec(substr($binarystring, $i, 8));}$encoded = base64_encode($binary_data);Then you don't have to generate the string manually. You can check the string to make sure that it's composed only of 0s and 1s if you want. You'd do that before this code.
I'm not sure that you need to zero pad it in this case, but that's the way that you were doing it.
Thebindec converts from a string to a number (decbin reverses it). Thechr converts from a number to a character in a string (the character might be unprintable;ord reverses it). Concatenating gives you a string of these characters (str_split reverses it so that you can useforeach on the resulting array). Thebase64_encode converts from binary data that may include unprintable characters to only printable characters (base64_decode reverses it).
I suspect that it could be done in some more elegant way, using some more efficient built-in function
One could:
- convert binary numbers to decimal using
bindec() - convert the decimal number to an ASCII character using
chr()- because theASCII characters are arranged a bit differently one would need to add to or subtract from the value returned from
bindec()to get to the correct spot, and likely just have a special case for the two non-alphnumeric characters - i.e.+,/
- because theASCII characters are arranged a bit differently one would need to add to or subtract from the value returned from
That would allow for the elimination ofstatic $chars. One could use a function likepreg_match() to determine if non-binary characters are present in the string (i.e. to throw the exception).
Review
Consider making whitespace more consistent
A space is added after some keywords likeif but notelse. For the sake of readability, it would be good to have a space after each keyword, as well as around each binary operator. For idiomatic PHP consider following recommendations inPSR-12.
Closing PHP tag can be eliminated
At the end of the PHP file there is a closing PHP tag:
?>
If a file contains only PHP code, it is preferable to omit the PHP closing tag at the end of the file. This prevents accidental whitespace or new lines being added after the PHP closing tag, which may cause unwanted effects because PHP will start output buffering when there is no intention from the programmer to send any output at that point in the script.1
$chars could be a constant
If you continue using$chars it may be worth making it aconstant instead of a static variable. That way it can't be modified at run-time.
Explore related questions
See similar questions with these tags.

