2
\$\begingroup\$
Closed. This question isoff-topic. It is not currently accepting answers.

This question does not appear to be a code review request within the scope defined in thehelp center. Please see theFAQ about off-topic questions.

Closed8 months ago.

I have a dictionary with a lot of symbols, each of which is encoded in a huffman binary string.

Example:

SymbolHuffman Code
you010
shall0111
not00111
pass00001
......

Therefore I encode the sentence "you shall not pass" like this: "01001110011100001".

Now I would like to divide the binary string into chunks of six characters and interpret each sextuplet as a character according to the base64 encoding.If the length of the string is not divisible by 6, I simply append zero at the end until it does.

So "you shall not pass" would be represented like this:010011 100111 000010 which in base64 should be justTnC.

To do this I managed to write the following PhP code, but I suspect that it could be done in some more elegant way, using some more efficient built-in function.I was looking to something likebase64_encode(), but if I try tobase64_encode("01001110011100001"), the function treats each single character as 1 byte, and outputs "MDEwMDExMTAwMTExMDAwMDE=".Could you give me some suggestion please?

<?phpfunction binaryToChars($binarystring){    static $chars = array(    "000000"=>"A","010000"=>"Q","100000"=>"g","110000"=>"w",    "000001"=>"B","010001"=>"R","100001"=>"h","110001"=>"x",    "000010"=>"C","010010"=>"S","100010"=>"i","110010"=>"y",    "000011"=>"D","010011"=>"T","100011"=>"j","110011"=>"z",    "000100"=>"E","010100"=>"U","100100"=>"k","110100"=>"0",    "000101"=>"F","010101"=>"V","100101"=>"l","110101"=>"1",    "000110"=>"G","010110"=>"W","100110"=>"m","110110"=>"2",    "000111"=>"H","010111"=>"X","100111"=>"n","110111"=>"3",    "001000"=>"I","011000"=>"Y","101000"=>"o","111000"=>"4",    "001001"=>"J","011001"=>"Z","101001"=>"p","111001"=>"5",    "001010"=>"K","011010"=>"a","101010"=>"q","111010"=>"6",    "001011"=>"L","011011"=>"b","101011"=>"r","111011"=>"7",    "001100"=>"M","011100"=>"c","101100"=>"s","111100"=>"8",    "001101"=>"N","011101"=>"d","101101"=>"t","111101"=>"9",    "001110"=>"O","011110"=>"e","101110"=>"u","111110"=>"+",    "001111"=>"P","011111"=>"f","101111"=>"v","111111"=>"/");        $result = "";    while (strlen($binarystring)%6!=0){$binarystring .= "0";}    while (strlen($binarystring)>0){        $substr = substr($binarystring,0,6);        if (array_key_exists($substr, $chars)){$result .= $chars[$substr]; $binarystring = substr($binarystring,6);}        else{throw new Exception("The variable is not a binary string.");}    }    return $result;}?>
Sᴀᴍ Onᴇᴌᴀ's user avatar
Sᴀᴍ Onᴇᴌᴀ
29.6k16 gold badges46 silver badges203 bronze badges
askedApr 8 at 18:50
Benzio's user avatar
\$\endgroup\$
4
  • \$\begingroup\$Hmmm... Homebrew data compression... A fixed' palette' of words encoded with a few bits... With 18 bits (3x6), one could number 256,000 individual messages, some as complex as "It's the weekend, so I'm not answering calls" or more specific. English alphabet has 'Q' and 'X', but I don't think there's a word using Q and X (ie: Some token combinations have no value.) OP has encoded "shall" with 4 bits and "not" with 5... It's unlikely this represents optimal accounting for "frequency of use"...\$\endgroup\$CommentedApr 11 at 3:09
  • \$\begingroup\$The example using words of sentences was just for simplify the question! I'm not really trying to compress plaintext. I'm actually doing it for chess positions. My "words" are blocks of pieces on the board.\$\endgroup\$CommentedApr 15 at 21:09
  • 1
    \$\begingroup\$I’m voting to close this question because OP admits to deliberate misrepresentation of their true intent and purpose.\$\endgroup\$CommentedApr 15 at 21:51
  • \$\begingroup\$I suggest not to use strings of characters to represent binary values: indexing instead of key matching.\$\endgroup\$CommentedApr 18 at 17:31

2 Answers2

3
\$\begingroup\$

array_key_exists vs.isset

if (array_key_exists($substr, $chars)){

You almost never want to use array_key_exists in PHP.

if (isset($chars[$substr])) {

does the same thing faster unlessnull is a valid value for which you want to return true.

String padding

while (strlen($binarystring)%6!=0){$binarystring .= "0";}

That's about as efficient a way as you'll find for less than six characters. I'd write it with more whitespace though.

while (strlen($binarystring) % 6 != 0) {    $binarystring .= '0';}

An alternative would be

if (strlen($binarystring) % 6 != 0) {    str_pad($binarystring, strlen($binarystring) + 6 - strlen($binarystring) % 6, '0');}

I'm not sure that's more efficient though. It would be if you might add more characters.

base64_encode

The rest of your code is where you could make it simpler. Consider

$excess_count = strlen($binarystring) % 8;if ($excess_count != 0) {    str_pad($binarystring, strlen($binarystring) + 8 - $excess_count, '0');}$binary_data = '';for ($i = 0; $i < strlen($binarystring); $i += 8) {     $binary_data .= chr(bindec(substr($binarystring, $i, 8));}$encoded = base64_encode($binary_data);

Then you don't have to generate the string manually. You can check the string to make sure that it's composed only of 0s and 1s if you want. You'd do that before this code.

I'm not sure that you need to zero pad it in this case, but that's the way that you were doing it.

Thebindec converts from a string to a number (decbin reverses it). Thechr converts from a number to a character in a string (the character might be unprintable;ord reverses it). Concatenating gives you a string of these characters (str_split reverses it so that you can useforeach on the resulting array). Thebase64_encode converts from binary data that may include unprintable characters to only printable characters (base64_decode reverses it).

answeredApr 10 at 17:54
mdfst13's user avatar
\$\endgroup\$
4
\$\begingroup\$

I suspect that it could be done in some more elegant way, using some more efficient built-in function

One could:

  • convert binary numbers to decimal usingbindec()
  • convert the decimal number to an ASCII character usingchr()
    • because theASCII characters are arranged a bit differently one would need to add to or subtract from the value returned frombindec() to get to the correct spot, and likely just have a special case for the two non-alphnumeric characters - i.e.+,/

That would allow for the elimination ofstatic $chars. One could use a function likepreg_match() to determine if non-binary characters are present in the string (i.e. to throw the exception).

Review

Consider making whitespace more consistent

A space is added after some keywords likeif but notelse. For the sake of readability, it would be good to have a space after each keyword, as well as around each binary operator. For idiomatic PHP consider following recommendations inPSR-12.

Closing PHP tag can be eliminated

At the end of the PHP file there is a closing PHP tag:

?>

Perthe PHP documentation:

If a file contains only PHP code, it is preferable to omit the PHP closing tag at the end of the file. This prevents accidental whitespace or new lines being added after the PHP closing tag, which may cause unwanted effects because PHP will start output buffering when there is no intention from the programmer to send any output at that point in the script.1

$chars could be a constant

If you continue using$chars it may be worth making it aconstant instead of a static variable. That way it can't be modified at run-time.

answeredApr 8 at 23:15
Sᴀᴍ Onᴇᴌᴀ's user avatar
\$\endgroup\$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.