Simple compression algorithm

Question 1

An implementation of a simple compression algorithm that's featured in a programming practice book.

My goals:

Robust: All error conditions must be handled properly. The specification indicated in the self-documenting comment must be met exactly.
Secure: No opportunities for buffer overflow in the program.
Efficient: The implementation must be at most O(n) in time, be as efficient as possible in time, and result in using as little memory as possible.
Compatible: The implementation should be able to compile and run on any desktop, workstation, or server with a C99 compiler.
Well documented: The implementation should be clearly documented and re-usable.

Yes, this algorithm would probably never be used production. It was a test of my ability to write production quality C code. Is this efficient and secure C? Please provide any and all criticisms, including the way in which I documented the code.

a3compress.h:

#ifndef _A3COMPRESS_H_#define _A3COMPRESS_H_/* Compress an input string in the following manner:   For each contiguous range of repeated chars, print the char followed by a   decimal number indicating the number of repetitions.   NOTE:        This function calls malloc, it returns NULL if malloc failed.       In any other case the return value must be freed.   Example:       input: aaabbbbbccccaaa       output: a3b5c4a3   If the string contains a digit (0 - 9), the string will not be compressed,   because the resulting compressed string would be ambiguous. A copy of the   string is returned that must be freed.   If the compressed string would not be of shorter length than the original   string, a copy of the string is returned that must be freed. */char * const a3compress (const char * const str);#endif /* _A3COMPRESS_H_ */

a3compress.c:

#include <stdio.h>#include <stdlib.h>#include <string.h>#include "a3compress.h"inline unsigned int count_digits (unsigned int n) {    if (n == 0) {return 1;}    unsigned int test = 1;    unsigned int digits = 0;    while (n >= test) {        test *= 10;        ++digits;    }     return digits;}/* Compress an input string in the following manner:   For each contiguous range of repeated chars, print the char followed by a   decimal number indicating the number of repetitions.   NOTE:        This function calls malloc, it returns NULL if malloc failed.       In any other case the return value must be freed.   Example:       input: aaabbbbbccccaaa       output: a3b5c4a3   If the string contains a digit (0 - 9), the string will not be compressed,   because the resulting compressed string would be ambiguous. A copy of the   string is returned that must be freed.   If the compressed string would not be of shorter length than the original   string, a copy of the string is returned that must be freed. */char * const a3compress (const char * const str) {    size_t len = strlen(str);    char * compressed;    /* Determine the size of the compressed string */    char curr = str[0];    size_t cnt = 1;    size_t idx = 0;    for (size_t i = 1; i < len; ++i) {        /* If the current char is a decimal digit, abort. The compressed string           would be ambiguous */        if (str[i] >= '0' && str[i] <= '9') {goto abort;}        /* If the current char is the same as the previous char, and this           isn't the last char in the input string, increase the count */        if ((str[i] == curr) && (i != len - 1)) {            ++cnt;        } else {            /* If this is the last char in the input string, increase the               count */            if (i == len - 1) {++cnt;}            ++idx;            if (idx > len - 1) {goto abort;}             /* Add the number of digits in the current count */            size_t digits = count_digits(cnt);            idx += digits;            if (idx > len - 1) {goto abort;}             /* Begin counting repeats of the newly seen char */            curr = str[i];            cnt = 1;        }    }    compressed = (char *)malloc((idx + 1) + 1);    if (compressed == NULL) {return compressed;}    curr = str[0];    cnt = 1;    idx = 0;    for (size_t i = 1; i < len; ++i) {        /* If the current char is the same as the previous char, and this           isn't the last char in the input string, increase the count */        if ((str[i] == curr) && (i != len - 1)) {            ++cnt;        } else {            /* If this is the last char in the input string, increase the               count */            if (i == len - 1) {++cnt;}            /* Add the current char to the output string */            compressed[idx++] = curr;            /* Add the current count to the output string */            size_t digits = count_digits(cnt);            char digit_s[digits + 1];            size_t digit_len = snprintf(digit_s, digits + 1, "%d",                                         (unsigned int)cnt);            idx += strlcpy(&compressed[idx], digit_s, digit_len + 1);            /* Begin counting repeats of the newly seen char */            curr = str[i];            cnt = 1;        }    }    return compressed;abort:    compressed = (char *)malloc(len + 1);    if (compressed == NULL) {return compressed;}    strlcpy(compressed, str, len + 1);    return compressed;}

test.c

#include <stdio.h>#include <stdlib.h>#include "a3compress.h"int main (const int argc, char * const argv[]) {    char * line = NULL;    size_t linecap = 0;    ssize_t linelen;    linelen = getline(&line, &linecap, stdin);    if (linelen <= 0) {perror(NULL), exit(1);}    if (line[linelen - 1] == '\n') {line[linelen - 1] = '\0';}    char * const compressed = a3compress(line);    if (compressed == NULL) {perror(NULL), exit(1);}    free(line);    printf("%s\n", compressed);    free(compressed);}

Question 2

This seems to be a variant ofRLE. Is there a reason to re-invent the wheel? Generally, trying to roll your own compression, hash or encryption algorithm is a bad idea since chances are very slim that you can improve on the existing stuff unless you're an expert in that field. But chances are high you're making something inefficient/insecure.

Question 3

Oh, never mind. I've overlooked your"It was a test of my ability to write production quality C code." statement.

Question 4

Better algorithm<char Sequence => '<char><count>'+ Where<count> is an actual number (not the text version of a number), remember that a char is just a very small integer (8 bits). Because you are using the text representation of a number you are using 8bits to represent 4 1/2 bits so you are wasting a lot of bits. There is a limit since a char can hold 0-256 this encoding breaks into sequence that are a max of 256 characters long: But 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' => 'AA' (A is the 65 character so that). This also allows number to be encoded.

Question 5

Note: Compressed text is usually not human readable so where<count> is a non printable character it does not really matter.

Question 6

Comments

Thecount_digits function could need an explanation (what is its purpose?).

A lot of your comments describe what the code does. This is redundant, I can read the code to see what your code is doing. Instead, a comment should describewhy something is done orwhy something is done in this specific way.

For example, the comment"If this is the last char in the input string, increase the count" is not very helpful. I can see that the code is doing that, but the question is: why is count increased only for the last character?

malloc

Don't cast themalloc return type..

Also, whymalloc((idx + 1) + 1) ? Why not+2, and what's the reason for adding 2 in the first place? It needs a comment.

Logic

You're not handling an empty string correctly. You'll allocate 3 bytes instead of 1 byte and won't assign the 0 termination, which might lead to crashes or at least funny garbage when printing.

Return value

In case something is wrong (like invalid input string123) you return a copy of the original string. But this is likely to be invalid or nonsense input for your decompression. So you can't distinguish between a valid and invalid return value. I would expect to haveNULL returned on error anderrno set to a reasonable value likeEINVAL.

Standard conformance

strlcpy is a BSD function and probably not available everywhere. Since you've explicitly listed "compability" and only mentioned C99, you might want to usestrncpy instead. But be careful to correctly append the 0 termination (see also the problem with the empty string).

Question 7

Thank you for your suggestions. It was interesting to read Ulrich Drepper's opinion onstrlcpy

Question 8

strlcpy andstrncpy only have comparable behavior by happenstance when the buffer is exactly one element longer than the source-string. So no, don't use them as replacements for each other.

Question 9

A few notes:

You usesize_t everywhere else in your program besides incount_digits() where you useunsigned int. I would switch over and usesize_t for consistency and to get rid of the compiler warnings. If you have a reason not to, you should leave some sort of comment stating why.
You do not modify the parameter passed tocount_digits, so it should be declared asconst.
If you want to, you could "simplify" yourwhile loops into afor loop.
```
while (n >= test) {    test *= 10;    ++digits;}
```
How I might do it (up to your discretion on whether or not you want to do this or not):
```
for (; n >= test; digits++){    test *= 10;}
```

Castcnt into asize_t for consistency:

size_t digit_len = snprintf(digit_s, digits + 1, "%zu", (size_t) cnt);

You are inconsistent on where you put yourconst
```
int main (const int argc, char * const argv[])
```
Either always put it before you declare the type, or after.
You should always initialize your variables when you declare them. I noticed specifically the variablecompressed.

TL;DR:

enter image description here

Question 10

Thanks for your suggestions. By the way,const char * const argv[] is not the same aschar * const argv[]. The former means that both the pointers and their contents must not be modified, the latter means that only the pointers must not be modified. When I wrotechar * const argv[], it means that the contents of arguments may be modified, but the pointers must not, this was my intention.

Question 11

@OregonTrail Agreed, but like I stated: consistency is key. If you deviate from your consistency than you need to state why in a comment.

Question 12

I respectfully disagree that appropriate use ofconst to indicate whether a pointer isconst or its content isconst is a matter of consistency. It has self-documenting semantics.

Question 13

@OregonTrail To a beginner, I don't think they would be able to read that self-documentation. You should always leave code in a highly understandable and maintainable state as possible. If you were to leave this code in the hands of a programmer still getting their grips on the language, I would highly doubt they would know the difference betweenconst indicating if the pointer or the content isconst. It's up to you to make the changes, but that's my opinion on it.

Question 14

This would mean adding an explanation above every signature formain that I write, which I don't intend to adopt. A comment explaining the syntax of the language is unnecessary, like writing the commentx = ++i; /* increments i, then stores the result in x */

Question 15

A few comments.

Your function does more or less the same thing twice. If you want to do that, then I would seperate the two halves into functions. But actually I would instead just allocate memory of the same length as the input string and encode into it. You can then callrealloc at the end to recover wasted memory, if necessary.
You have duplicated the description of the function. Either put it in theheader or in the C file. If you put it in both the descriptions willdiverge over time.
I don't see why the function must return aconst string.
Yourcount_digits should bestatic notinline. Whether toinlinecan be decided better by the compiler.
Useisdigit to detect numeric chars
Your comments are rather noisy - conveying little of use. If you usefunctions the code becomes self documenting - eg if your code that does the"Add the current count to the output string" was extracted to a functionnamedadd_count with the string and number as parameters it would be obvious what it was doing without a comment.

Here is how I might approach the function (not tested)

char* a3compress (const char* const s){    const size_t len = strlen(s);    char *res = calloc(len + 1, 1);    if (res) {        char *comp = res;        for (int span = 1, n = 0; n < len; n += span) {            if (isdigit(s[n])) { /* digits confuse the string encoding */                memcpy(res, s, len + 1);                break;            }            *comp++ = s[n];            span = char_span(s, s[n]);            if (span > 1) {                comp += write_char_count(comp, span);            }        }        assert(comp <= res + len);    }    return res;}

The two functions do the obvious things. Theassert is a sanity check -the logic of the function doesn't allow the assertion to fail, but the coding might be wrong.

Question 16

count_digit() has trouble withn nearUINT_MAX astest *= 10 can overflow. Further, a specialif (n == 0) is not needed.

#define MAXUPOWER10 1000000000u#define MAXUPOWER10LOG 9inline unsigned count_digits(unsigned n) {    if (n >= MAXUPOWER10)       return MAXUPOWER10LOG + 1;    unsigned test = 1;    for (unsigned digits = 1; ; digits++) {  // or do loop        test *= 10;        if  (n < test) break;     }    return digits;}

Certain other algorithmic simplifications possible.

DarkDust 4053 silver badges8 bronze badges · Answer 1 · 2017-05-23 12:40:49Z

Comments

Thecount_digits function could need an explanation (what is its purpose?).

A lot of your comments describe what the code does. This is redundant, I can read the code to see what your code is doing. Instead, a comment should describewhy something is done orwhy something is done in this specific way.

For example, the comment"If this is the last char in the input string, increase the count" is not very helpful. I can see that the code is doing that, but the question is: why is count increased only for the last character?

malloc

Don't cast themalloc return type..

Also, whymalloc((idx + 1) + 1) ? Why not+2, and what's the reason for adding 2 in the first place? It needs a comment.

Logic

You're not handling an empty string correctly. You'll allocate 3 bytes instead of 1 byte and won't assign the 0 termination, which might lead to crashes or at least funny garbage when printing.

Return value

In case something is wrong (like invalid input string123) you return a copy of the original string. But this is likely to be invalid or nonsense input for your decompression. So you can't distinguish between a valid and invalid return value. I would expect to haveNULL returned on error anderrno set to a reasonable value likeEINVAL.

Standard conformance

strlcpy is a BSD function and probably not available everywhere. Since you've explicitly listed "compability" and only mentioned C99, you might want to usestrncpy instead. But be careful to correctly append the 0 termination (see also the problem with the empty string).

Thank you for your suggestions. It was interesting to read Ulrich Drepper's opinion onstrlcpy
strlcpy andstrncpy only have comparable behavior by happenstance when the buffer is exactly one element longer than the source-string. So no, don't use them as replacements for each other.

syb0rg 21.9k10 gold badges113 silver badges193 bronze badges · Answer 2 · 2014-09-04 19:13:15Z

A few notes:

You usesize_t everywhere else in your program besides incount_digits() where you useunsigned int. I would switch over and usesize_t for consistency and to get rid of the compiler warnings. If you have a reason not to, you should leave some sort of comment stating why.
You do not modify the parameter passed tocount_digits, so it should be declared asconst.
If you want to, you could "simplify" yourwhile loops into afor loop.
```
while (n >= test) {    test *= 10;    ++digits;}
```
How I might do it (up to your discretion on whether or not you want to do this or not):
```
for (; n >= test; digits++){    test *= 10;}
```

Castcnt into asize_t for consistency:

size_t digit_len = snprintf(digit_s, digits + 1, "%zu", (size_t) cnt);

You are inconsistent on where you put yourconst
```
int main (const int argc, char * const argv[])
```
Either always put it before you declare the type, or after.
You should always initialize your variables when you declare them. I noticed specifically the variablecompressed.

TL;DR:

enter image description here

Thanks for your suggestions. By the way,const char * const argv[] is not the same aschar * const argv[]. The former means that both the pointers and their contents must not be modified, the latter means that only the pointers must not be modified. When I wrotechar * const argv[], it means that the contents of arguments may be modified, but the pointers must not, this was my intention.
@OregonTrail Agreed, but like I stated: consistency is key. If you deviate from your consistency than you need to state why in a comment.
I respectfully disagree that appropriate use ofconst to indicate whether a pointer isconst or its content isconst is a matter of consistency. It has self-documenting semantics.
@OregonTrail To a beginner, I don't think they would be able to read that self-documentation. You should always leave code in a highly understandable and maintainable state as possible. If you were to leave this code in the hands of a programmer still getting their grips on the language, I would highly doubt they would know the difference betweenconst indicating if the pointer or the content isconst. It's up to you to make the changes, but that's my opinion on it.
This would mean adding an explanation above every signature formain that I write, which I don't intend to adopt. A comment explaining the syntax of the language is unnecessary, like writing the commentx = ++i; /* increments i, then stores the result in x */

William Morris 9,41419 silver badges43 bronze badges · Answer 3 · 2014-09-07 22:45:40Z

A few comments.

Your function does more or less the same thing twice. If you want to do that, then I would seperate the two halves into functions. But actually I would instead just allocate memory of the same length as the input string and encode into it. You can then callrealloc at the end to recover wasted memory, if necessary.
You have duplicated the description of the function. Either put it in theheader or in the C file. If you put it in both the descriptions willdiverge over time.
I don't see why the function must return aconst string.
Yourcount_digits should bestatic notinline. Whether toinlinecan be decided better by the compiler.
Useisdigit to detect numeric chars
Your comments are rather noisy - conveying little of use. If you usefunctions the code becomes self documenting - eg if your code that does the"Add the current count to the output string" was extracted to a functionnamedadd_count with the string and number as parameters it would be obvious what it was doing without a comment.

Here is how I might approach the function (not tested)

char* a3compress (const char* const s){    const size_t len = strlen(s);    char *res = calloc(len + 1, 1);    if (res) {        char *comp = res;        for (int span = 1, n = 0; n < len; n += span) {            if (isdigit(s[n])) { /* digits confuse the string encoding */                memcpy(res, s, len + 1);                break;            }            *comp++ = s[n];            span = char_span(s, s[n]);            if (span > 1) {                comp += write_char_count(comp, span);            }        }        assert(comp <= res + len);    }    return res;}

The two functions do the obvious things. Theassert is a sanity check -the logic of the function doesn't allow the assertion to fail, but the coding might be wrong.

chux 36.5k2 gold badges43 silver badges97 bronze badges · Answer 4 · 2014-09-09 01:34:13Z

count_digit() has trouble withn nearUINT_MAX astest *= 10 can overflow. Further, a specialif (n == 0) is not needed.

#define MAXUPOWER10 1000000000u#define MAXUPOWER10LOG 9inline unsigned count_digits(unsigned n) {    if (n >= MAXUPOWER10)       return MAXUPOWER10LOG + 1;    unsigned test = 1;    for (unsigned digits = 1; ; digits++) {  // or do loop        test *= 10;        if  (n < test) break;     }    return digits;}

Certain other algorithmic simplifications possible.

Movatterモバイル変換

Stack Exchange Network

Simple compression algorithm

4 Answers4

Comments

malloc

Logic

Return value

Standard conformance

TL;DR:

You mustlog in to answer this question.

Related

Hot Network Questions

Subscribe to RSS