Parsing HTTP Headers in C++

Question 1

I am building a web server from scratch and trying to make certain tasks faster by embedding C code into the mix for performance. Specifically I'm worried about how thestd::string class with the.find() and other functions compare to straight pointer arithmetic.

#include <iostream>#include <map>#include <string>std::map<std::string, std::string> http_request;void parse_header( void * );int main(){    char * msg= "GET / HTTP/1.1\r\n"                "Host: 192.241.213.46:6880\r\n"                "Upgrade-Insecure-Requests: 1\r\n"                "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"                "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8\r\n"                "Accept-Language: en-us\r\n"                "Accept-Encoding: gzip, deflate\r\n"                "Connection: keep-alive\r\n\r\n";    parse_header( msg );}void parse_header( void *msg ){    char *head = (char *) msg;    char *mid;    char *tail = head;    if( sizeof( msg ) == 0 )    {        return;    }    // Find request type    while( *head++ != ' ');    http_request[ "Type" ] = std::string( ( char * ) msg ).substr( 0 , ( head - 1) - tail );    // Find path    tail = head;    while( *head++ != ' ');    http_request[ "Path" ] = std::string( ( char * ) msg ).substr( tail - ( char *)msg , ( head - 1) - tail );    // Find HTTP version    tail = head;    while( *head++ != '\r');    http_request[ "Version" ] = std::string( ( char * ) msg ).substr( tail - ( char *)msg , ( head - 1) - tail );    // Map all headers from a key to a value    while( true )    {        tail = head + 1;        while( *head++ != '\r' );        mid = strstr( tail, ":" );           // Look for the failed strstr           if( tail > mid )            break;        http_request[ std::string( ( char * ) msg ).substr( tail - ( char *)msg , ( mid ) - tail  ) ] = std::string( ( char * ) msg ).substr( mid + 2 - ( char *) msg , ( head - 3 ) - mid );    }    // Determine if successful    std::cout << http_request[ "Host" ] << std::endl;     std::cout << http_request[ "Upgrade-Insecure-Requests" ] << std::endl;     std::cout << http_request[ "Accept" ] << std::endl;     std::cout << http_request[ "User-Agent" ] << std::endl;     std::cout << http_request[ "Accept-Language" ] << std::endl;     std::cout << http_request[ "Accept-Encoding" ] << std::endl;     std::cout << http_request[ "Connection" ] << std::endl; }

Data comes in via the linuxsocket.send() function, so it is avoid * type. I directly send that data to aparse_header function to create astd::map for easy access. Is it necessary to be this low-level, or can this speed be obtained by the STL?

Note: I killed OO design for a minimal example. The end-goal has an HTTP class.

Note++: I'd prefer to have this code review pertain to performance rather than c++ semantics, I'd like it to be as conforming as possible while also being fast and lightweight. This seems to be the purest form of the cluge of C/C++.

Question 2

There is something wrong:sizeof(msg) == 0, I don't think this is what you meant.

Question 3

You are right. sizeof the pointer was not what I was meaning! Thanks!

Question 4

This is a false worry. The string implementation is efficient.

Question 5

IF you are worried abut speed usenginx

Question 6

@Loki I use nginx for the webserver, this code is for handling API requests, sorry for the confusion. I benchmarked the code against Jorge Omar Medra's example and my code comes out roughly ~20us or so faster( vs ~50us avg), which is most likely due to less function calls. I guess the question now is: the c++ code provided is more maintainable because everyone hates pointers, but is the difference in speed worth it if I'm the only programmer? I was thinking if you were getting thousands of requests a second 30us is more beneficial than 50us.

Question 7

Note++: I'd prefer to have this code review pertain to performance rather than c++ semantics

I get it, I really do (he says unconvincingly); but I also really think that you're going about this backwards. Step 1 isalways to "make it work"; you don't get to "make it fast, make it cheap" untilafter you get it working. So you should momentarily forget about performance and just work on fixing the bugs and infelicities; and then if it turns out to be slow, you find the bottlenecks and improve them. (I'm not going as far as to say you need a profiler; you can find bottlenecks with eyeballs and intuition too. But there's no point in micro-polishing code until the bugs are out.)

So:sizeof(msg)==0 is never true. It looks like you're wanting to check the size of the message itself (like, number of bytes in the message); so, you need to add a function parameter toget the message size.(void *msg, size_t msg_size) would do fine. In C++17 the new party line is probably to use astd::string_view.

while( *head++ != ' '); is a guaranteed segfault, as soon as I throw a malformed request at your server. Don't even give me the opportunity to segfault your code. Say what you mean:

while (head != msg_end && *head == ' ') ++head;

The repeated subexpressionstd::string( ( char * ) msg ) is a red flag: both because you're creating a string over and over (which does heap allocation, which is slow) and because you're using a C-style cast (type-casts are always red flags in both C and C++)and because you're casting awayconst./me checks the function signature again — No, you'renot casting awayconst. But you should be! A parsing function shouldn't be allowed to mutate the message buffer. Make that signature(const void *msg, size_t msg_size). Or even better:(const char *msg, const char *msg_end).

Iostreams are slow; don't usestd::cout << ... if you care about micro-optimizing performance. — But in fact you shouldn't be printing output from within your parsing function anyway! Let's just have the parsing functionreturn the map of headers.

Also, your variable names seem confused. I'd try to make each extracted string run fromhead totail, since that's how we usually talk about things in English. So:

void parse_header(const char *msg, const char *msg_end){    const char *head = msg;    const char *tail = msg;    // Find request type    while (tail != msg_end && *tail != ' ') ++tail;    http_request["Type"] = std::string(head, tail);    // Find path    while (tail != msg_end && *tail == ' ') ++tail;    head = tail;    while (tail != msg_end && *tail != ' ') ++tail;    http_request["Path"] = std::string(head, tail);    // Find HTTP version    while (tail != msg_end && *tail == ' ') ++tail;    head = tail;    while (tail != msg_end && *tail != '\r') ++tail;    http_request["Version"] = std::string(head, tail);    if (tail != msg_end) ++tail;  // skip '\r'    // TODO: what about the trailing '\n'?    // Map all headers from a key to a value    head = tail;    while (head != msg_end && *head != '\r') {        while (tail != msg_end && *tail != '\r') ++tail;        const char *colon = memchr(head, tail, ':');        if (colon == NULL) {            // TODO: malformed headers, what should happen?            break;        }        const char *value = colon+1;        while (value != tail && *value == ' ') ++value;        http_request[ std::string(head, colon) ] = std::string(value, tail);        head = tail+1;        // TODO: what about the trailing '\n'?    }    return http_request;}

Completely untested, mind you; but the most important things about the above code are that- it uses the appropriatestd::string constructor instead of repeatedly allocating a gigantic string and thensubstr'ing parts of it;- it follows your original code (good idea!) in repeating the same basic idiom over and over, mechanically, so that if there ends up being a bug in the logic it's easy to see all the places to apply the fix.

Another thing to try if you havelots of headers, or if you're planning to dolots of lookups in the resulting map, would be to try usingstd::unordered_map instead ofstd::map. Less pointer-chasing often means faster code.

Question 8

Thanks for the reminder about seg faults. I was assuming too much in my code. I also had no idea about that particular std::string constructor, so yes, that definitely simplified the code! I'm also guessing you'd rather message_end rather than message_length because you'd have to have more math to see if the pointer is in the right position, correct (i.e., tail - msg < msg_end - msg)? Finally, is &msg[ sizeof( msg ) - 1] the cheapest way to get the end of an array?

Question 9

If you're interested, I also benchmarked the two which can be found herepastebin.com/bLBzapgi. This solution is approximately 3x faster than the original.. really goes to show how expensive the heap is.

Question 10

@ShawnicHedgehog: I think you're still misunderstanding whatsizeof actually does. It isnot a synonym forstrlen.

Question 11

I am aware that sizeof returns the amount of bytes. I'm misusing sizeof in the OP with a pointer. On my machine a character is 1 byte. If I wanted it to be perfect I would do sizeof( data ) / sizeof ( data[0] ) which would find the number of elements in an array. It works in this example because I'm specifying const char array[] instead of const char *array. I thought we are not able to update the OP's code, otherwise I would've fixed it.

Question 12

Didn't mean to sound torte, hard to set a tone over the internet. Thanks for the advice, strchr seems to be what I'm looking for.

Question 13

First of all, The Http server send all the headers Line by Line. In your sample i see you created a dummy sample with the next buffer:

char * msg= "GET / HTTP/1.1\r\n"            "Host: 192.241.213.46:6880\r\n"            "Upgrade-Insecure-Requests: 1\r\n"            "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"            "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8\r\n"            "Accept-Language: en-us\r\n"            "Accept-Encoding: gzip, deflate\r\n"            "Connection: keep-alive\r\n\r\n";

I suppose that you first recive all the header before analyzing it with your methodvoid parse_header( void *msg )... and that is the reason that the task to get each Key-Value is so hard because you are using parsers like this:

http_request[ std::string( ( char * ) msg ).substr( tail - ( char *)msg , ( mid ) - tail  ) ] = std::string( ( char * ) msg ).substr( mid + 2 - ( char *) msg , ( head - 3 ) - mid );

What will I do to improve your code:

I will keep themap<string,string> to store all headers.
Read the header one line by time and, before to continue with the other line, analyze it.
I don't see any issue with the code to parse the first but I will modify it to analyze is a single line and simplify the parsers.

If you read and analyze one line by line, the code to map all the keys with values could be like this:

void parseFirstLine(string line){    string key = "";    string value = "";    int  position, lpost;    // Find request type    position = line.find(' ');    http_request[ "Type" ] = line.substr(0, position);    position++; //Skip character ' '    // Find path    position = line.find(' ', lpost);    http_request[ "Path" ] = line.substr(lpost, (position-lpost));    position++; //Skip character ' '    // Find HTTP version    http_request[ "Version" ] = line.substr(position);}void parseHeader(string line){    string key = "";    string value = "";    if(data.size() == 0) return;    int posFirst = line.find(":",0); //Look for separator ':'    key = line.substr(0, posFirst);    value = line.substr(posFirst + 1);    http_request[key] = value;}

(*) This code could have some bugs because I'm writing this code without any developer environment to review this.

Is your decision where you get each line to analyze, at the moment you get it from socket reading or after finishing of read all the header.

Question 14

I think you misunderstood.. I am the Http server; this is server-side code. Data is received using the socket recv functionman7.org/linux/man-pages/man2/recv.2.html. This function requires a data buffer and a known data buffer length. To my knowledge it would be incredibly more difficult to read 1 line at a time since you have to assign a length, and you don't know the length because the first line is variable, and the specific headers required change if authorization is necessary.

Question 15

Yes, @Shawnic-Hedgehog, i understood you and it is the same case for both sides. The server always recives a Http request which the header are segment by lines followed by the contest, the response has the same structure. I have some sample to read the Request line by line. I have a sample of how to read it, but use the lib curl:github.com/jorgemedra/HTTPSClient/blob/master/HTTPSClient/…. My post is in spanish.

Quuxplusone 19.7k2 gold badges44 silver badges91 bronze badges · Accepted Answer · 2017-03-06 20:03:48Z

Note++: I'd prefer to have this code review pertain to performance rather than c++ semantics

I get it, I really do (he says unconvincingly); but I also really think that you're going about this backwards. Step 1 isalways to "make it work"; you don't get to "make it fast, make it cheap" untilafter you get it working. So you should momentarily forget about performance and just work on fixing the bugs and infelicities; and then if it turns out to be slow, you find the bottlenecks and improve them. (I'm not going as far as to say you need a profiler; you can find bottlenecks with eyeballs and intuition too. But there's no point in micro-polishing code until the bugs are out.)

So:sizeof(msg)==0 is never true. It looks like you're wanting to check the size of the message itself (like, number of bytes in the message); so, you need to add a function parameter toget the message size.(void *msg, size_t msg_size) would do fine. In C++17 the new party line is probably to use astd::string_view.

while( *head++ != ' '); is a guaranteed segfault, as soon as I throw a malformed request at your server. Don't even give me the opportunity to segfault your code. Say what you mean:

while (head != msg_end && *head == ' ') ++head;

The repeated subexpressionstd::string( ( char * ) msg ) is a red flag: both because you're creating a string over and over (which does heap allocation, which is slow) and because you're using a C-style cast (type-casts are always red flags in both C and C++)and because you're casting awayconst./me checks the function signature again — No, you'renot casting awayconst. But you should be! A parsing function shouldn't be allowed to mutate the message buffer. Make that signature(const void *msg, size_t msg_size). Or even better:(const char *msg, const char *msg_end).

Iostreams are slow; don't usestd::cout << ... if you care about micro-optimizing performance. — But in fact you shouldn't be printing output from within your parsing function anyway! Let's just have the parsing functionreturn the map of headers.

Also, your variable names seem confused. I'd try to make each extracted string run fromhead totail, since that's how we usually talk about things in English. So:

void parse_header(const char *msg, const char *msg_end){    const char *head = msg;    const char *tail = msg;    // Find request type    while (tail != msg_end && *tail != ' ') ++tail;    http_request["Type"] = std::string(head, tail);    // Find path    while (tail != msg_end && *tail == ' ') ++tail;    head = tail;    while (tail != msg_end && *tail != ' ') ++tail;    http_request["Path"] = std::string(head, tail);    // Find HTTP version    while (tail != msg_end && *tail == ' ') ++tail;    head = tail;    while (tail != msg_end && *tail != '\r') ++tail;    http_request["Version"] = std::string(head, tail);    if (tail != msg_end) ++tail;  // skip '\r'    // TODO: what about the trailing '\n'?    // Map all headers from a key to a value    head = tail;    while (head != msg_end && *head != '\r') {        while (tail != msg_end && *tail != '\r') ++tail;        const char *colon = memchr(head, tail, ':');        if (colon == NULL) {            // TODO: malformed headers, what should happen?            break;        }        const char *value = colon+1;        while (value != tail && *value == ' ') ++value;        http_request[ std::string(head, colon) ] = std::string(value, tail);        head = tail+1;        // TODO: what about the trailing '\n'?    }    return http_request;}

Completely untested, mind you; but the most important things about the above code are that- it uses the appropriatestd::string constructor instead of repeatedly allocating a gigantic string and thensubstr'ing parts of it;- it follows your original code (good idea!) in repeating the same basic idiom over and over, mechanically, so that if there ends up being a bug in the logic it's easy to see all the places to apply the fix.

Another thing to try if you havelots of headers, or if you're planning to dolots of lookups in the resulting map, would be to try usingstd::unordered_map instead ofstd::map. Less pointer-chasing often means faster code.

Thanks for the reminder about seg faults. I was assuming too much in my code. I also had no idea about that particular std::string constructor, so yes, that definitely simplified the code! I'm also guessing you'd rather message_end rather than message_length because you'd have to have more math to see if the pointer is in the right position, correct (i.e., tail - msg < msg_end - msg)? Finally, is &msg[ sizeof( msg ) - 1] the cheapest way to get the end of an array?
If you're interested, I also benchmarked the two which can be found herepastebin.com/bLBzapgi. This solution is approximately 3x faster than the original.. really goes to show how expensive the heap is.
@ShawnicHedgehog: I think you're still misunderstanding whatsizeof actually does. It isnot a synonym forstrlen.
I am aware that sizeof returns the amount of bytes. I'm misusing sizeof in the OP with a pointer. On my machine a character is 1 byte. If I wanted it to be perfect I would do sizeof( data ) / sizeof ( data[0] ) which would find the number of elements in an array. It works in this example because I'm specifying const char array[] instead of const char *array. I thought we are not able to update the OP's code, otherwise I would've fixed it.
Didn't mean to sound torte, hard to set a tone over the internet. Thanks for the advice, strchr seems to be what I'm looking for.

Movatterモバイル変換

Stack Exchange Network

Parsing HTTP Headers in C++

2 Answers2

You mustlog in to answer this question.

Related

Hot Network Questions

Subscribe to RSS