Simple LZW compression algorithm

Question 1

It's a simple version of LZW compression algorithm with 12 bit codes. I want to know what's good and what's bad about this code. Be as picky as you like.

//  Compile with gcc 4.7.2 or later, using the following command line:////    g++ -std=c++0x lzw.c -o lzw////LZW algorithm implemented using fixed 12 bit codes.#include <iostream>#include <sstream>#include <fstream>#include <bitset>#include <string>#include <unordered_map>#define MAX_DEF 4096using namespace std;string convert_int_to_bin(int number){    string result = bitset<12>(number).to_string();    return result;}void compress(string input, int size, string filename) {    unordered_map<string, int> compress_dictionary(MAX_DEF);    //Dictionary initializing with ASCII    for ( int unsigned i = 0 ; i < 256 ; i++ ){    compress_dictionary[string(1,i)] = i;    }    string current_string;    unsigned int code;    unsigned int next_code = 256;    //Output file for compressed data    ofstream outputFile;    outputFile.open(filename + ".lzw");    for(char& c: input){    current_string = current_string + c;    if ( compress_dictionary.find(current_string) ==compress_dictionary.end() ){            if (next_code <= MAX_DEF)                compress_dictionary.insert(make_pair(current_string, next_code++));            current_string.erase(current_string.size()-1);            outputFile << convert_int_to_bin(compress_dictionary[current_string]);            current_string = c;        }       }       if (current_string.size())            outputFile << convert_int_to_bin(compress_dictionary[current_string]);    outputFile.close();}void decompress(string input, int size, string filename) {    unordered_map<unsigned int, string> dictionary(MAX_DEF);    //Dictionary initializing with ASCII    for ( int unsigned i = 0 ; i < 256 ; i++ ){    dictionary[i] = string(1,i);    }    string previous_string;    unsigned int code;    unsigned int next_code = 256;    //Output file for decompressed data    ofstream outputFile;    outputFile.open(filename + "_uncompressed.txt");    int i =0;    while (i<size){        //Extracting 12 bits and converting binary to decimal        string subinput = input.substr(i,12);        bitset<12> binary(subinput);        code = binary.to_ullong();        i+=12;        if ( dictionary.find(code) ==dictionary.end() )             dictionary.insert(make_pair(code,(previous_string + previous_string.substr(0,1))));        outputFile<<dictionary[code];        if ( previous_string.size())            dictionary.insert(make_pair(next_code++,previous_string + dictionary[code][0]));         previous_string = dictionary[code];        }    outputFile.close();}string convert_char_to_string(const char *pCh, int arraySize){    string str;    if (pCh[arraySize-1] == '\0') str.append(pCh);    else for(int i=0; i<arraySize; i++) str.append(1,pCh[i]);    return str;}static void show_usage(){        cerr << "Usage: \n"              << "Specify the file that needs to be compressed or decompressed\n"              <<"lzw -c input    #compress file input\n"              <<"lzw -d input    #decompress file input\n"              <<"Compressed data will be found in a file with the same name but with a .lzw extension\n"              <<"Decompressed data can be found in a file with the same name and a _uncompressed.txt extension\n"              << endl;}int main (int argc, char* argv[]) {    streampos size;    char * memblock;    if (argc <2)    {        show_usage();           return(1);    }    ifstream file (argv[2], ios::in|ios::binary|ios::ate);    if (file.is_open())    {        size = file.tellg();        memblock = new char[size];        file.seekg (0, ios::beg);        file.read (memblock, size);        file.close();        string input = convert_char_to_string(memblock,size);        if (string( "-c" ) == argv[1] )            compress(input,size, argv[2]);        else if (string( "-d" ) == argv[1] )            decompress(input,size, argv[2]);        else            show_usage();    }    else {    cout << "Unable to open file."<<endl;    show_usage();    }    return 0;}

Question 2

Here are some things I see that may help you improve your code.

Shouldn't a compressed file be smaller?

Imagine my surprise when I discovered that a 4057-byte file (the source code itself) became 18720 bytes when compressed! Then I looked at the "compressed" file and saw that it was composed of ASCII'1' and'0'characters rather than binary. If that had been converted to binary output, that would be 2340 bytes which actually is smaller than the input. I'd suggest emitting binary rather than ASCII representation of binary which is 8x the size.

Pass by const reference where practical

The first argument tocompress is astd::string but that causes the entire input buffer to be duplicated. Better would be to make itconst std::string & because it is not modified and it doesn't need to be duplicated. Thefor loop then becomes

for(const char c: input) {    // ...}

Note that this is aconst char notconst char & because on most machines it's quicker and smaller to simply use achar rather than a reference to one.

Thefilename argument should also beconst string &.

Remember to free allocated memory

The program shoulddelete[] memblock at the end ofmain. Better yet, see the following suggestion

Eliminate buggy`convert_char_to_string()`

The code for converting from a char pointer to a string as currently in the code is this:

string convert_char_to_string(const char *pCh, int arraySize){    string str;    if (pCh[arraySize-1] == '\0')         str.append(pCh);    else         for(int i=0; i<arraySize; i++)             str.append(1,pCh[i]);    return str;}

However this has a few problems. First is the bug that occurs if the passed buffer has two'\0' characters in it, one of which is at the end. The bug is that this code will only copy up to the first'\0' character, silently truncating the input buffer. The second problem is the loop. As written, it will probably reallocate and copy the contents ofstr many times as it grows. That pointlessly slows the program because you already know how big the string should be. Better still would be to eliminate it entirely, and eliminatememblock by copying the file directly into the string:

string input = string(std::istreambuf_iterator<char>(file),    std::istreambuf_iterator<char>());

Note that there arefaster ways to do this, but none as succinct.

Use C++11 compile switches

The program uses C++ 11 constructs, so I'd recommend changing the command-line compilation suggestion in the commment to this instead:

g++ -std=c++11 lzw.c -o lzw

Prefer`constexpr` to old-style`#define`

Rather than using a#define forMAX_DEF the code could use aconstexpr:

constexpr std::size_t MAX_DEF = 4096;

It doesn't make a lot of difference here, but generally the advantage is that the value has an associated type. Where it does make a difference is in the next suggestion:

Make relationships between constants explicit

The numbers 12 and 4096 in the program are associated, but that association is implicit rather than explicit. It would be better to make the relationship between the two constants explicit:

constexpr std::size_t BITS=12;constexpr std::size_t MAX_DEF=1<<BITS;

Now all places that currently use the "magic number" 12 should use the constantBITS instead. Better still, would be to have aCompressor class for which these would be member variables, andcompress anddecompress would be member functions.

Avoid using the same unnamed constant multiple times

The code currently initializescompress_dictionary and then setsnext_code to 256. This would be a little more straightforward and avoids having the same constant multiple places in the same routine:

unsigned next_code;for ( next_code = 0 ; next_code < 256 ; next_code++ ){    compress_dictionary[string(1,next_code)] = next_code;}

This is advantageous because there is no way to make a mistake in typing the constant two different ways. A similar change should be made todecompress which also eliminates another ugly peculiarity, which is to have declaredi as two different types within the same function.

Use a`for` loop instead of`while` where it makes sense

In thedecompress routine, thewhile loop would be better expressed as afor loop. Instead of this:

int i = 0;while (i<size){    //    i+=12;    //}

Use this:

for (int i=0; i < size; i += BITS) {    // }

Bail out early on error

The code currently checks to see if the file is open and then has anelse clause way at the bottom ofmain to handle an error. Better would be to bail out early on error, and not enclose the entire program in anif clause:

if (!file){    cout << "Unable to open file." << endl;    show_usage();}

Also note that I've used!file for the condition. This is the usual C++ idiom for file error checking.

Use all appropriate`#include`s

The program usesmake_pair which is actually defined in<utility>. For that reason, there should be a line

    #include <utility>

Eliminate unused variables

Thecode variable incompress is not used. Also, thesize parameter is not used, but it should be. I'd recommend eliminatingcode andsize. If you need the size, as for thedecompress routine, you can simply useinput.size().

Omit`return 0`

When a C++ program reaches the end ofmain the compiler will automatically generate code to return 0, so there is no reason to putreturn 0; explicitly at the end ofmain.

Question 3

Excellent, but I'd also add a suggestion of emitting only used code instead of all 256. This can make low color images/data compress even better. This is roughly how GIF works.

Question 4

@phyrfox: I'm not sure I understand your suggestion. The dictionary is not emitted; only used code are.

Question 5

I'd simply advise creating a dictionary based on used symbols, and emit this dictionary as part of the file (and load on decompression). Starting with eight bit codes might not always be efficient. Text files would arguably benefit, as well as images with fewer colors. For example, you could probably save two bits per code with a typical text file from the start. Emitting the dictionary could be done in just eight bytes, which could pay for itself in a file as small as 12 bytes.

Edward 67.2k4 gold badges120 silver badges284 bronze badges · Answer 1 · 2017-05-23 12:40:49Z

Here are some things I see that may help you improve your code.

Shouldn't a compressed file be smaller?

Imagine my surprise when I discovered that a 4057-byte file (the source code itself) became 18720 bytes when compressed! Then I looked at the "compressed" file and saw that it was composed of ASCII'1' and'0'characters rather than binary. If that had been converted to binary output, that would be 2340 bytes which actually is smaller than the input. I'd suggest emitting binary rather than ASCII representation of binary which is 8x the size.

Pass by const reference where practical

The first argument tocompress is astd::string but that causes the entire input buffer to be duplicated. Better would be to make itconst std::string & because it is not modified and it doesn't need to be duplicated. Thefor loop then becomes

for(const char c: input) {    // ...}

Note that this is aconst char notconst char & because on most machines it's quicker and smaller to simply use achar rather than a reference to one.

Thefilename argument should also beconst string &.

Remember to free allocated memory

The program shoulddelete[] memblock at the end ofmain. Better yet, see the following suggestion

Eliminate buggy`convert_char_to_string()`

The code for converting from a char pointer to a string as currently in the code is this:

string convert_char_to_string(const char *pCh, int arraySize){    string str;    if (pCh[arraySize-1] == '\0')         str.append(pCh);    else         for(int i=0; i<arraySize; i++)             str.append(1,pCh[i]);    return str;}

However this has a few problems. First is the bug that occurs if the passed buffer has two'\0' characters in it, one of which is at the end. The bug is that this code will only copy up to the first'\0' character, silently truncating the input buffer. The second problem is the loop. As written, it will probably reallocate and copy the contents ofstr many times as it grows. That pointlessly slows the program because you already know how big the string should be. Better still would be to eliminate it entirely, and eliminatememblock by copying the file directly into the string:

string input = string(std::istreambuf_iterator<char>(file),    std::istreambuf_iterator<char>());

Note that there arefaster ways to do this, but none as succinct.

Use C++11 compile switches

The program uses C++ 11 constructs, so I'd recommend changing the command-line compilation suggestion in the commment to this instead:

g++ -std=c++11 lzw.c -o lzw

Prefer`constexpr` to old-style`#define`

Rather than using a#define forMAX_DEF the code could use aconstexpr:

constexpr std::size_t MAX_DEF = 4096;

It doesn't make a lot of difference here, but generally the advantage is that the value has an associated type. Where it does make a difference is in the next suggestion:

Make relationships between constants explicit

The numbers 12 and 4096 in the program are associated, but that association is implicit rather than explicit. It would be better to make the relationship between the two constants explicit:

constexpr std::size_t BITS=12;constexpr std::size_t MAX_DEF=1<<BITS;

Now all places that currently use the "magic number" 12 should use the constantBITS instead. Better still, would be to have aCompressor class for which these would be member variables, andcompress anddecompress would be member functions.

Avoid using the same unnamed constant multiple times

The code currently initializescompress_dictionary and then setsnext_code to 256. This would be a little more straightforward and avoids having the same constant multiple places in the same routine:

unsigned next_code;for ( next_code = 0 ; next_code < 256 ; next_code++ ){    compress_dictionary[string(1,next_code)] = next_code;}

This is advantageous because there is no way to make a mistake in typing the constant two different ways. A similar change should be made todecompress which also eliminates another ugly peculiarity, which is to have declaredi as two different types within the same function.

Use a`for` loop instead of`while` where it makes sense

In thedecompress routine, thewhile loop would be better expressed as afor loop. Instead of this:

int i = 0;while (i<size){    //    i+=12;    //}

Use this:

for (int i=0; i < size; i += BITS) {    // }

Bail out early on error

The code currently checks to see if the file is open and then has anelse clause way at the bottom ofmain to handle an error. Better would be to bail out early on error, and not enclose the entire program in anif clause:

if (!file){    cout << "Unable to open file." << endl;    show_usage();}

Also note that I've used!file for the condition. This is the usual C++ idiom for file error checking.

Use all appropriate`#include`s

The program usesmake_pair which is actually defined in<utility>. For that reason, there should be a line

    #include <utility>

Eliminate unused variables

Thecode variable incompress is not used. Also, thesize parameter is not used, but it should be. I'd recommend eliminatingcode andsize. If you need the size, as for thedecompress routine, you can simply useinput.size().

Omit`return 0`

When a C++ program reaches the end ofmain the compiler will automatically generate code to return 0, so there is no reason to putreturn 0; explicitly at the end ofmain.

Excellent, but I'd also add a suggestion of emitting only used code instead of all 256. This can make low color images/data compress even better. This is roughly how GIF works.
@phyrfox: I'm not sure I understand your suggestion. The dictionary is not emitted; only used code are.
I'd simply advise creating a dictionary based on used symbols, and emit this dictionary as part of the file (and load on decompression). Starting with eight bit codes might not always be efficient. Text files would arguably benefit, as well as images with fewer colors. For example, you could probably save two bits per code with a typical text file from the start. Emitting the dictionary could be done in just eight bytes, which could pay for itself in a file as small as 12 bytes.

Movatterモバイル変換

Stack Exchange Network

Simple LZW compression algorithm

1 Answer1

Shouldn't a compressed file be smaller?

Pass by const reference where practical

Remember to free allocated memory

Eliminate buggy`convert_char_to_string()`

Use C++11 compile switches

Prefer`constexpr` to old-style`#define`

Make relationships between constants explicit

Avoid using the same unnamed constant multiple times

Use a`for` loop instead of`while` where it makes sense

Bail out early on error

Use all appropriate`#include`s

Eliminate unused variables

Omit`return 0`

You mustlog in to answer this question.

Related

Hot Network Questions

Subscribe to RSS

Movatterモバイル変換

Simple LZW compression algorithm

1 Answer1

Shouldn't a compressed file be smaller?

Pass by const reference where practical

Remember to free allocated memory

Eliminate buggyconvert_char_to_string()

Use C++11 compile switches

Preferconstexpr to old-style#define

Make relationships between constants explicit

Avoid using the same unnamed constant multiple times

Use afor loop instead ofwhile where it makes sense

Bail out early on error

Use all appropriate#includes

Eliminate unused variables

Omitreturn 0

You mustlog in to answer this question.

Related

Hot Network Questions

Eliminate buggy`convert_char_to_string()`

Prefer`constexpr` to old-style`#define`

Use a`for` loop instead of`while` where it makes sense

Use all appropriate`#include`s

Omit`return 0`