An implementation of a simple compression algorithm that's featured in a programming practice book.
My goals:
- Robust: All error conditions must be handled properly. The specification indicated in the self-documenting comment must be met exactly.
- Secure: No opportunities for buffer overflow in the program.
- Efficient: The implementation must be at most O(n) in time, be as efficient as possible in time, and result in using as little memory as possible.
- Compatible: The implementation should be able to compile and run on any desktop, workstation, or server with a C99 compiler.
- Well documented: The implementation should be clearly documented and re-usable.
Yes, this algorithm would probably never be used production. It was a test of my ability to write production quality C code. Is this efficient and secure C? Please provide any and all criticisms, including the way in which I documented the code.
a3compress.h:
#ifndef _A3COMPRESS_H_#define _A3COMPRESS_H_/* Compress an input string in the following manner: For each contiguous range of repeated chars, print the char followed by a decimal number indicating the number of repetitions. NOTE: This function calls malloc, it returns NULL if malloc failed. In any other case the return value must be freed. Example: input: aaabbbbbccccaaa output: a3b5c4a3 If the string contains a digit (0 - 9), the string will not be compressed, because the resulting compressed string would be ambiguous. A copy of the string is returned that must be freed. If the compressed string would not be of shorter length than the original string, a copy of the string is returned that must be freed. */char * const a3compress (const char * const str);#endif /* _A3COMPRESS_H_ */a3compress.c:
#include <stdio.h>#include <stdlib.h>#include <string.h>#include "a3compress.h"inline unsigned int count_digits (unsigned int n) { if (n == 0) {return 1;} unsigned int test = 1; unsigned int digits = 0; while (n >= test) { test *= 10; ++digits; } return digits;}/* Compress an input string in the following manner: For each contiguous range of repeated chars, print the char followed by a decimal number indicating the number of repetitions. NOTE: This function calls malloc, it returns NULL if malloc failed. In any other case the return value must be freed. Example: input: aaabbbbbccccaaa output: a3b5c4a3 If the string contains a digit (0 - 9), the string will not be compressed, because the resulting compressed string would be ambiguous. A copy of the string is returned that must be freed. If the compressed string would not be of shorter length than the original string, a copy of the string is returned that must be freed. */char * const a3compress (const char * const str) { size_t len = strlen(str); char * compressed; /* Determine the size of the compressed string */ char curr = str[0]; size_t cnt = 1; size_t idx = 0; for (size_t i = 1; i < len; ++i) { /* If the current char is a decimal digit, abort. The compressed string would be ambiguous */ if (str[i] >= '0' && str[i] <= '9') {goto abort;} /* If the current char is the same as the previous char, and this isn't the last char in the input string, increase the count */ if ((str[i] == curr) && (i != len - 1)) { ++cnt; } else { /* If this is the last char in the input string, increase the count */ if (i == len - 1) {++cnt;} ++idx; if (idx > len - 1) {goto abort;} /* Add the number of digits in the current count */ size_t digits = count_digits(cnt); idx += digits; if (idx > len - 1) {goto abort;} /* Begin counting repeats of the newly seen char */ curr = str[i]; cnt = 1; } } compressed = (char *)malloc((idx + 1) + 1); if (compressed == NULL) {return compressed;} curr = str[0]; cnt = 1; idx = 0; for (size_t i = 1; i < len; ++i) { /* If the current char is the same as the previous char, and this isn't the last char in the input string, increase the count */ if ((str[i] == curr) && (i != len - 1)) { ++cnt; } else { /* If this is the last char in the input string, increase the count */ if (i == len - 1) {++cnt;} /* Add the current char to the output string */ compressed[idx++] = curr; /* Add the current count to the output string */ size_t digits = count_digits(cnt); char digit_s[digits + 1]; size_t digit_len = snprintf(digit_s, digits + 1, "%d", (unsigned int)cnt); idx += strlcpy(&compressed[idx], digit_s, digit_len + 1); /* Begin counting repeats of the newly seen char */ curr = str[i]; cnt = 1; } } return compressed;abort: compressed = (char *)malloc(len + 1); if (compressed == NULL) {return compressed;} strlcpy(compressed, str, len + 1); return compressed;}test.c
#include <stdio.h>#include <stdlib.h>#include "a3compress.h"int main (const int argc, char * const argv[]) { char * line = NULL; size_t linecap = 0; ssize_t linelen; linelen = getline(&line, &linecap, stdin); if (linelen <= 0) {perror(NULL), exit(1);} if (line[linelen - 1] == '\n') {line[linelen - 1] = '\0';} char * const compressed = a3compress(line); if (compressed == NULL) {perror(NULL), exit(1);} free(line); printf("%s\n", compressed); free(compressed);}- 2\$\begingroup\$This seems to be a variant ofRLE. Is there a reason to re-invent the wheel? Generally, trying to roll your own compression, hash or encryption algorithm is a bad idea since chances are very slim that you can improve on the existing stuff unless you're an expert in that field. But chances are high you're making something inefficient/insecure.\$\endgroup\$DarkDust– DarkDust2014-09-04 08:58:07 +00:00CommentedSep 4, 2014 at 8:58
- \$\begingroup\$Oh, never mind. I've overlooked your"It was a test of my ability to write production quality C code." statement.\$\endgroup\$DarkDust– DarkDust2014-09-04 09:03:43 +00:00CommentedSep 4, 2014 at 9:03
- 1\$\begingroup\$Better algorithm
<char Sequence => '<char><count>'+Where<count>is an actual number (not the text version of a number), remember that a char is just a very small integer (8 bits). Because you are using the text representation of a number you are using 8bits to represent 4 1/2 bits so you are wasting a lot of bits. There is a limit since a char can hold 0-256 this encoding breaks into sequence that are a max of 256 characters long: But 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' => 'AA' (A is the 65 character so that). This also allows number to be encoded.\$\endgroup\$Loki Astari– Loki Astari2014-09-05 16:39:42 +00:00CommentedSep 5, 2014 at 16:39 - 1\$\begingroup\$Note: Compressed text is usually not human readable so where
<count>is a non printable character it does not really matter.\$\endgroup\$Loki Astari– Loki Astari2014-09-05 16:42:07 +00:00CommentedSep 5, 2014 at 16:42
4 Answers4
Comments
Thecount_digits function could need an explanation (what is its purpose?).
A lot of your comments describe what the code does. This is redundant, I can read the code to see what your code is doing. Instead, a comment should describewhy something is done orwhy something is done in this specific way.
For example, the comment"If this is the last char in the input string, increase the count" is not very helpful. I can see that the code is doing that, but the question is: why is count increased only for the last character?
malloc
Don't cast themalloc return type..
Also, whymalloc((idx + 1) + 1) ? Why not+2, and what's the reason for adding 2 in the first place? It needs a comment.
Logic
You're not handling an empty string correctly. You'll allocate 3 bytes instead of 1 byte and won't assign the 0 termination, which might lead to crashes or at least funny garbage when printing.
Return value
In case something is wrong (like invalid input string123) you return a copy of the original string. But this is likely to be invalid or nonsense input for your decompression. So you can't distinguish between a valid and invalid return value. I would expect to haveNULL returned on error anderrno set to a reasonable value likeEINVAL.
Standard conformance
strlcpy is a BSD function and probably not available everywhere. Since you've explicitly listed "compability" and only mentioned C99, you might want to usestrncpy instead. But be careful to correctly append the 0 termination (see also the problem with the empty string).
- \$\begingroup\$Thank you for your suggestions. It was interesting to read Ulrich Drepper's opinion on
strlcpy\$\endgroup\$OregonTrail– OregonTrail2014-09-04 19:22:50 +00:00CommentedSep 4, 2014 at 19:22 - \$\begingroup\$
strlcpyandstrncpyonly have comparable behavior by happenstance when the buffer is exactly one element longer than the source-string. So no, don't use them as replacements for each other.\$\endgroup\$Deduplicator– Deduplicator2017-07-21 13:40:58 +00:00CommentedJul 21, 2017 at 13:40
A few notes:
You use
size_teverywhere else in your program besides incount_digits()where you useunsigned int. I would switch over and usesize_tfor consistency and to get rid of the compiler warnings. If you have a reason not to, you should leave some sort of comment stating why.You do not modify the parameter passed to
count_digits, so it should be declared asconst.If you want to, you could "simplify" your
whileloops into aforloop.while (n >= test) { test *= 10; ++digits;}How I might do it (up to your discretion on whether or not you want to do this or not):
for (; n >= test; digits++){ test *= 10;}Cast
cntinto asize_tfor consistency:size_t digit_len = snprintf(digit_s, digits + 1, "%zu", (size_t) cnt);You are inconsistent on where you put your
constint main (const int argc, char * const argv[])Either always put it before you declare the type, or after.
You should always initialize your variables when you declare them. I noticed specifically the variable
compressed.
TL;DR:

- \$\begingroup\$Thanks for your suggestions. By the way,
const char * const argv[]is not the same aschar * const argv[]. The former means that both the pointers and their contents must not be modified, the latter means that only the pointers must not be modified. When I wrotechar * const argv[], it means that the contents of arguments may be modified, but the pointers must not, this was my intention.\$\endgroup\$OregonTrail– OregonTrail2014-09-04 19:28:24 +00:00CommentedSep 4, 2014 at 19:28 - \$\begingroup\$@OregonTrail Agreed, but like I stated: consistency is key. If you deviate from your consistency than you need to state why in a comment.\$\endgroup\$syb0rg– syb0rg2014-09-04 19:29:31 +00:00CommentedSep 4, 2014 at 19:29
- \$\begingroup\$I respectfully disagree that appropriate use of
constto indicate whether a pointer isconstor its content isconstis a matter of consistency. It has self-documenting semantics.\$\endgroup\$OregonTrail– OregonTrail2014-09-04 19:32:11 +00:00CommentedSep 4, 2014 at 19:32 - \$\begingroup\$@OregonTrail To a beginner, I don't think they would be able to read that self-documentation. You should always leave code in a highly understandable and maintainable state as possible. If you were to leave this code in the hands of a programmer still getting their grips on the language, I would highly doubt they would know the difference between
constindicating if the pointer or the content isconst. It's up to you to make the changes, but that's my opinion on it.\$\endgroup\$syb0rg– syb0rg2014-09-04 19:35:22 +00:00CommentedSep 4, 2014 at 19:35 - \$\begingroup\$This would mean adding an explanation above every signature for
mainthat I write, which I don't intend to adopt. A comment explaining the syntax of the language is unnecessary, like writing the commentx = ++i; /* increments i, then stores the result in x */\$\endgroup\$OregonTrail– OregonTrail2014-09-04 19:38:05 +00:00CommentedSep 4, 2014 at 19:38
A few comments.
Your function does more or less the same thing twice. If you want to do that, then I would seperate the two halves into functions. But actually I would instead just allocate memory of the same length as the input string and encode into it. You can then call
reallocat the end to recover wasted memory, if necessary.You have duplicated the description of the function. Either put it in theheader or in the C file. If you put it in both the descriptions willdiverge over time.
I don't see why the function must return a
conststring.Your
count_digitsshould bestaticnotinline. Whether toinlinecan be decided better by the compiler.Use
isdigitto detect numeric charsYour comments are rather noisy - conveying little of use. If you usefunctions the code becomes self documenting - eg if your code that does the"Add the current count to the output string" was extracted to a functionnamed
add_countwith the string and number as parameters it would be obvious what it was doing without a comment.
Here is how I might approach the function (not tested)
char* a3compress (const char* const s){ const size_t len = strlen(s); char *res = calloc(len + 1, 1); if (res) { char *comp = res; for (int span = 1, n = 0; n < len; n += span) { if (isdigit(s[n])) { /* digits confuse the string encoding */ memcpy(res, s, len + 1); break; } *comp++ = s[n]; span = char_span(s, s[n]); if (span > 1) { comp += write_char_count(comp, span); } } assert(comp <= res + len); } return res;}The two functions do the obvious things. Theassert is a sanity check -the logic of the function doesn't allow the assertion to fail, but the coding might be wrong.
count_digit() has trouble withn nearUINT_MAX astest *= 10 can overflow. Further, a specialif (n == 0) is not needed.
#define MAXUPOWER10 1000000000u#define MAXUPOWER10LOG 9inline unsigned count_digits(unsigned n) { if (n >= MAXUPOWER10) return MAXUPOWER10LOG + 1; unsigned test = 1; for (unsigned digits = 1; ; digits++) { // or do loop test *= 10; if (n < test) break; } return digits;}Certain other algorithmic simplifications possible.
You mustlog in to answer this question.
Explore related questions
See similar questions with these tags.

