So, I'm implementing LZ77 compression algorithm. To compress any file type, I use its binary representation and then read it aschars (because 1char is equal to 1 byte, afaik) to astd::string. Current program version compresses and decompresses files (.txt, .bmp, etc) just fine – size of raw file in bytes matches the size of uncompressed file. Though I started to wonder, if usage of byte representation instead of bits is optimal at all:
- Is it optimal to use
chars(bytes) instead of single bits? No possible loss of bits?
Also, is there a way to compare file sizes in bits instead of bytes? (forgive me for stupid questions)
And now for the actualcode part. Here's the short info on how LZ77 handles compression:

Below are 2 main functions:compress andfindLongestMatch:
compressmoves char data between 2 buffers and saves encoded tuple⟨offset, length, nextchar⟩findLongestMatchfinds the longest match of lookheadBuffer inhistoryBuffer
- Is there a more elegant and effective way of searching for longest match?
(also in theory algo should search right to left, but is there any complexity difference? even if theoffset is actuallylonger, it's still int and still 4 bytes – I convert everyint into 4chars (bytes) to save into output binary file)
LZ77::Triplet LZ77::slidingWindow::findLongestPrefix(){ // Minimal tuple (if no match >1 is found) Triplet n(0, 0, lookheadBuffer[0]); size_t lookCurrLen = lookheadBuffer.length() - 1; size_t histCurrLen = historyBuffer.length(); // Increasing the substring (match) length on every iteration for (size_t i = 1; i <= std::min(lookCurrLen, histCurrLen); i++) { // Getting the substring std::string s = lookheadBuffer.substr(0, i); size_t pos = historyBuffer.find(s); if (pos == std::string::npos) break; if ((historyBuffer.compare(histCurrLen - i, i, s) == 0) && (lookheadBuffer[0] == lookheadBuffer[i])) pos = histCurrLen - i; // If the longest match is found, check if there are any repeats // following the of current longest substring in lookheadBuffer int extra = 0; if (histCurrLen == pos + i) { // Check for full following repeats while ((lookCurrLen >= i + extra + i) && (lookheadBuffer.compare(i + extra, i, s) == 0)) extra += i; // Check for partial following repeats int extraextra = i - 1; while (extraextra > 0) { if ((lookCurrLen >= i + extra + extraextra) && (lookheadBuffer.compare(i + extra, extraextra, s, 0, extraextra) == 0)) break; extraextra--; } extra += extraextra; } // Compare the lengths of matches if (n.length <= i + extra) n = Triplet(histCurrLen - pos, i + extra, lookheadBuffer[i + extra]); } return n;}void LZ77::compress(){ do { if ((window.lookheadBuffer.length() < window.lookBufferMax) && (byteDataString.length() != 0)) { int len = window.lookBufferMax - window.lookheadBuffer.length(); window.lookheadBuffer.append(byteDataString, 0, len); byteDataString.erase(0, len); } LZ77::Triplet tiplet = window.findLongestPrefix(); // Move the used part of lookheadBuffer to historyBuffer window.historyBuffer.append(window.lookheadBuffer, 0, tiplet.length + 1); window.lookheadBuffer.erase(0, tiplet.length + 1); // If historyBuffer's size exceeds max, delete oldest part if (window.historyBuffer.length() > window.histBufferMax) window.historyBuffer.erase(0, window.historyBuffer.length() - window.histBufferMax); encoded.push_back(tiplet); } while (window.lookheadBuffer.length());}Accessory functions:
int intFromBytes(std::istream& is){ char bytes[4]; for (int i = 0; i < 4; ++i) is.get(bytes[i]); int integer; std::memcpy(&integer, &bytes, 4); return integer;}void intToBytes(std::ostream& os, int value){ char bytes[4]; std::memcpy(&bytes, &value, 4); os.write(bytes, 4);}struct Triplet{ int offset; int length; char next;}1 Answer1
In answer to your first question, it may depend on whether you wish to optimise for speed or compression ratio. If optimising for speed, it would seem that using bytes is best, as that is what theLZ4 algorithm does. LZ4 is a variant of LZ77, highly optimised for speed. If your optimisation is for the compression ratio, I am unsure which would be better, as I have never run a bitwise LZ77 compressor.
In answer to your second question: How about, instead of your historyBuffer.find() method returning the first position of a match, you return an ArrayList of Triplets which match? This is because if you find a match, you know you will perform another iteration of the loop, (providedi is not at its maximum value which is unlikely). Next time you perform the iteration, instead of going through your entire sliding window looking for a match, simply check whether or not any of the phrases in yourArrayList ofTriplets will still match when the strings has that additional character appended. This is because a match longer than the current match must build upon either that current match, or some other equally long match. This way, you don't redo work that has already been done. This approach means you can get rid of the lines
if ((historyBuffer.compare(histCurrLen - i, i, s) == 0) && (lookheadBuffer[0] == lookheadBuffer[i])) pos = histCurrLen - i;// If the longest match is found, check if there are any repeats// following the of current longest substring in lookheadBufferint extra = 0;if (histCurrLen == pos + i){ // Check for full following repeats while ((lookCurrLen >= i + extra + i) && (lookheadBuffer.compare(i + extra, i, s) == 0)) extra += i; // Check for partial following repeats int extraextra = i - 1; while (extraextra > 0) { if ((lookCurrLen >= i + extra + extraextra) && (lookheadBuffer.compare(i + extra, extraextra, s, 0, extraextra) == 0)) break; extraextra--; } extra += extraextra;}without losing any performance.
With regards to your question about the complexity of searching left-to-right or vice versa, the time complexity of the two will be identical.
One final note, however, is that incorporating my suggestion, where you look for multiple matches rather than one match, will influence which string searching algorithm would be optimal to implement in yourhistoryBuffer.find() method. TheRabin-Karp substring matching algorithm, is generally best for finding multiple matches. This algorithm uses hashing to discard parts of thehistoryBuffer which will definitely not match the substring, leaving you to easily check the parts of thehistoryBuffer which are likely to match. However, if you are simply finding one match, as your current implementation does, then theBoyer-Moore algorithm is your best choice.
- 1\$\begingroup\$Thank you for your thorough answer. I'm kinda new to compression algorithms, so I hope you don't mind a couple more questions: 1. Is there a chance for data (bit) loss when reading blobs (bytes)? 2. Any way to compare raw and uncompressed file bitwise?\$\endgroup\$asymmetriq– asymmetriq2019-12-02 13:37:58 +00:00CommentedDec 2, 2019 at 13:37
- 1\$\begingroup\$1. No chance to lose any bits of information when reading bytes as opposed to bits. Just make sure you don't perform signed arithmetic with unsigned variables.\$\endgroup\$waitaria– waitaria2019-12-02 23:19:10 +00:00CommentedDec 2, 2019 at 23:19
- 2\$\begingroup\$2. You could read the two files each into an array of bytes, and then use bitwise AND to compare the two arrays:
if(c_1[i] & c_2[i] == c_1[i])-> then the two arrays are equal for byte i.Else, use bit masking to identify the bit of(c_1[i] & c_2[i])which is not equal toc_1[i]. This will show the bit position at which arrays c_1 and c_2 differ. Does this answer your question?\$\endgroup\$waitaria– waitaria2019-12-02 23:26:50 +00:00CommentedDec 2, 2019 at 23:26
You mustlog in to answer this question.
Explore related questions
See similar questions with these tags.
