Chunking strings to binary output

Question 1

A follow-up to this question isChunking strings to binary block-based output

I have code which takes a text file as input and creates a special binary output form of the input. Specifically, the test input I'm using is the plain text of Samuel Coleridge's poem"Kubla Khan". Here are the first few lines, shown here with line numbers which are only for reference and not actually part of the text:

 1 Kubla Khan 2 By Samuel Taylor Coleridge 3  4 Or, a vision in a dream. A Fragment. 5 6 In Xanadu did Kubla Khan 7 A stately pleasure-dome decree: 8 Where Alph, the sacred river, ran 9 Through caverns measureless to man10    Down to a sunless sea.11 So twice five miles of fertile ground12 With walls and towers were girdled round;13 And there were gardens bright with sinuous rills,14 Where blossomed many an incense-bearing tree;15 And here were forests ancient as the hills,16 Enfolding sunny spots of greenery.

Here is a sample hex dump of the output with some annotations. See the "Processing" section for an explanation of this data structure.

         Block 000000000: be ad ca fe 0a 4b 75 62 6c 61 20 4b 68 61 6e 1a  .....Kubla Khan.         | signature | n| first line ...              | n|00000010: 42 79 20 53 61 6d 75 65 6c 20 54 61 79 6c 6f 72  By Samuel Taylor         | second line...00000020: 20 43 6f 6c 65 72 69 64 67 65 00 24 4f 72 2c 20   Coleridge.$Or,          |                             | n| n| fourth...  |00000030: 61 20 76 69 73 69 6f 6e 20 69 6e 20 61 20 64 72  a vision in a dr         | line ...                                       |00000040: 65 61 6d 2e 20 41 20 46 72 61 67 6d 65 6e 74 2e  eam. A Fragment.         | still the fourth line.                         |00000050: 00 18 49 6e 20 58 61 6e 61 64 75 20 64 69 64 20  ..In Xanadu did          | n| n| sixth line...00000060: 4b 75 62 6c 61 20 4b 68 61 6e 1f 41 20 73 74 61  Kubla Khan.A sta00000070: 74 65 6c 79 20 70 6c 65 61 73 75 72 65 2d 64 6f  tely pleasure-do00000080: 6d 65 20 64 65 63 72 65 65 3a 21 57 68 65 72 65  me decree:!Where00000090: 20 41 6c 70 68 2c 20 74 68 65 20 73 61 63 72 65   Alph, the sacre000000a0: 64 20 72 69 76 65 72 2c 20 72 61 6e 22 54 68 72  d river, ran"Thr000000b0: 6f 75 67 68 20 63 61 76 65 72 6e 73 20 6d 65 61  ough caverns mea000000c0: 73 75 72 65 6c 65 73 73 20 74 6f 20 6d 61 6e 19  sureless to man.000000d0: 20 20 20 44 6f 77 6e 20 74 6f 20 61 20 73 75 6e     Down to a sun000000e0: 6c 65 73 73 20 73 65 61 2e 25 53 6f 20 74 77 69  less sea.%So twi         | end of tenth line        | n| eleventh line    |000000f0: 63 65 20 66 69 76 65 20 6d 69 6c 65 3e f2 d5 86  ce five mile>...         |  middle of eleventh line          | checksum   |         Block 100000100: be ad ca fe 73 20 6f 66 20 66 65 72 74 69 6c 65  ....s of fertile         | signature | middle of eleventh line            |00000110: 20 67 72 6f 75 6e 64 29 57 69 74 68 20 77 61 6c   ground)With wal         | end eleventh line  | n| twelfth line ...       |

Processing

Each line of the text is turned into acounted string (also sometimes called a "Pascal string" after the way that language stores strings). A counted string is a singleuint8_t count\$n\$, followed by\$n\$ bytes of the string. No line is more than 255 characters long and a count of zero indicates a blank line.

Counted string format

$$\begin{array}{l|c|l}\text{name} & \text{length in bytes} & \text{description} \\\hline\text{count} & 1 & \text{count of bytes that follow, range 0-255} \\\text{string} & 0..255 & \text{string may or may not have NUL terminator} \\\end{array}$$

Then those counted strings are output as a series ofBlocks. ABlock is a 256-byte chunk which starts with a fixed 4-byte block identifier and ends with auint32_t checksum which is the simple checksum of all of the other data as though it were a series ofuint32_t numbers, ignoring overflow.

Block format

$$\begin{array}{l|c|l}\text{name} & \text{length in bytes} & \text{description} \\\hline\text{signature} & 4 & \text{fixed 0xfecaadbe} \\\text{data} & 248 & \text{the data} \\\text{checksum} & 4 & \text{checksum of block as 32-bit unsigned value} \\& & \text{with same endian-ness as signature} \\\hline\text{Block} & 256 & \text{total block size} \\\end{array}$$

Questions

The code I have works as intended, but I'm left with the nagging feeling that it is fundamentally the wrong approach. For instance, in this code, the entire data is read and created as astd::strstream but I can anticipate that at some point I am going to want to process things on the fly, as from a named pipe or TCP stream where rewinding won't be possible. I thought about chaining two independent streams, one which feeds the other but I'm not sure how to approach that. Should I derive my ownostream? Twoostreams? Maybestreambuf?

encode.cpp

#include <iostream>#include <fstream>#include <string>#include <sstream>#include <algorithm>#include <array>/* * The stream format consists of blocks, each 256 bytes long. * Each block begins with a fixed 4-byte block identifier and * ends with a fixed 4-byte checksum.  Everything between  * them is data. * * The data is in the form of counted strings.  A counted  * string is a one byte unsigned integer `n` followed by * that many bytes of data.  A counted string may or may not * be NUL character terminated. */class Block {public:    static constexpr std::size_t mysize{0x100};    friend std::istream& operator>>(std::istream& in, Block& blk) {        blk.clear();        in.read(reinterpret_cast<char *>(&blk.data), blk.datasize);        blk.checksum = blk.sumcalc();        return in;    }    friend std::ostream& operator<<(std::ostream& out, const Block& blk) {        out.write(reinterpret_cast<const char *>(&blk.id), sizeof(blk.id));        out.write(reinterpret_cast<const char *>(&blk.data), blk.datasize);        out.write(reinterpret_cast<const char *>(&blk.checksum), sizeof(blk.checksum));        return out;    }private:    void clear() {        std::fill(data.begin(), data.end(), 0);     }    uint32_t sumcalc() {        uint32_t sum{id};        auto n{datasize/sizeof(uint32_t)};        for (uint32_t *ptr = reinterpret_cast<uint32_t *>(&data); n; ++ptr) {            sum += *ptr;            --n;        }        return sum;    }    uint32_t id = 0xfecaadbe;    uint32_t checksum = 0;    static constexpr std::size_t datasize{mysize - sizeof(Block::id) - sizeof(checksum)};    std::array<uint8_t, datasize> data;};int main(int argc, char *argv[]) {    std::string line;    if (argc != 3) {        std::cerr << "Usage: encode infile outfile\n";        return 1;    }    std::ifstream in(argv[1]);    std::stringstream buff;    while (std::getline(in, line)) {        // skip long lines        if (line.length() < 256) {            uint8_t n = line.length() & 0xff;            buff.put(n);            buff << line;        }    }    in.close();    // second pass    std::ofstream out(argv[2]);    buff.seekg(0);  // rewind    Block b;    while (buff >> b) {        out << b;    }    // always emit at least one block even if empty    out << b;}

Question 2

Just to make sure that I understand the problem correctly: can't you simply read a line, emit one Block, and then move on to the next line?

Question 3

No because each block typically contains multiple counted strings and often the last counted string in a block is a partial one that continues in the next block. I'll amend the question to try to make this more clear.

Question 4

I see I was misunderstanding the format. Thanks for the clear explanation!

Question 5

The code is nice and readable.

The system-dependent handling of\n characters may cause problems — we may introduce a\012 byte in the signature, count, or checksum part that gets transformed into CR-LF (on Windows) or CR (on old MacOS). I think we can simply open the output stream in binary mode, since our data won't contain\n characters that need to be handled specially.

Here are some small improvements:

void clear() {    std::fill(data.begin(), data.end(), 0); }

We can use thefill member ofstd::array to simplify the code:

data.fill(0);

or even

data = {};

reinterpret_cast<char *>(&blk.data)

This cast comes up very often. Consider making a function:

template <typename T>char* as_chars(const T& value){    return reinterpret_cast<char*>(value);}

So you can write

in.read(as_chars(blk.data), blk.datasize);// ...

You can even make a function for reading/writing if you do it often.

uint32_t id = 0xfecaadbe;

static constexpr, maybe?

std::string line;

This variable is used several lines after. It may be more readable to put it inside the loop:

for (std::string line; std::getline(in, line);) {    // ...}

if (line.length() < 256) {    uint8_t n = line.length() & 0xff;    buff.put(n);    buff << line;}

Ifline.length() < 256, thenline.length() & 0xff == line.length() right?

in.close();

You can omit this close by puttingin inside a scope. Not sure if that's better.

Block b;while (buff >> b) {    out << b;}// always emit at least one block even if emptyout << b;

It took me a while to see thatb is empty after the last failed read. Help the poor code reader by using something likeout << Block{} please :)

Question 6

I've used just about all of these suggestions, but hadn't even considered the first issue. Good review, thanks!

Question 7

I agree with you that this would be more intuitive to use by chaining streams, rather than acting as a queue that must be pushed into and pulled out of. I've never written a filtering stream like that myself, but Ithink you want to construct anostream with a customstreambuf for each filter.

I definitely think that separating the line encoding and the block packing would be a good thing, and would allow your unit tests to be much more selective, and therefore more diagnostic.

We seem to be assuming these typedefs:

using std::uint32_t;using std::uint8_t;

Reviewing themain() - it's quite restrictive to insist on two filenames (and that the output file be seekable). It would be more natural if it was willing to use standard i/o streams if no arguments are given.

    uint32_t sum{id};    auto n{datasize/sizeof(uint32_t)};    for (uint32_t *ptr = reinterpret_cast<uint32_t *>(&data); n; ++ptr) {        sum += *ptr;        --n;    }    return sum;

This looks like a candidate forstd::span:

    std::span as_u32{reinterpret_cast<std::uint32_t*>(data.begin()),                     reinterpret_cast<std::uint32_t*>(data.end())};    return std::accumulate(as_u32.begin(), as_u32.end(), std::uint32_t{});

Or, using a simple pair of iterators, for C++17 and earlier:

    auto first = reinterpret_cast<const std::uint32_t*>(data.begin());    auto last = reinterpret_cast<const std::uint32_t*>(data.end());    return std::accumulate(first, last, std::uint32_t{});

This method should probably be declaredconst.

We do have a problem here in that the data are interpreted asstd::uint32_tin the endianness of the host. That means that different platforms can generate different checksums, something generally considered undesirable.

Question 8

I particularly like the suggestion about usingstd::span - my compiler doesn't yet support that but it's coming soon.

Question 9

For this purpose, you can probably side-step the span - we're just using it as an iterator pair here. I've edited with a C++17 alternative; that's simpler, so may be better thanstd::span even where that's available.

Question 10

In fact, that’s exactly what I did in the revised code, which I will soon post.

L. F. 9,7152 gold badges27 silver badges70 bronze badges · Accepted Answer · 2020-03-04 01:28:48Z