C++ Strings are stuck with overloading existing operators. Theobvious choice for concatenation is += and +.But someone just looking at the code will see + and think "addition".He'll have to look up the types (and types are frequently buriedbehind multiple typedef's) to see that it's a string type, andit's not adding strings but concatenating them.
Additionally, if one has an array of floats, is '+' overloaded tobe the same as a vector addition, or an array concatenation?
In D, these problems are avoided by introducing a new binaryoperator ~ as the concatenation operator. It works witharrays (of which strings are a subset). ~= is the correspondingappend operator. ~ on arrays of floats would concatenate them,+ would imply a vector add. Adding a new operator makes it possiblefor orthogonality and consistency in the treatment of arrays.(In D, strings are simply arrays of characters, not a specialtype.)
Overloading of operators only really works if one of the operandsis overloadable. So the C++ string class cannot consistentlyhandle arbitrary expressions containing strings. Consider:
const char abc[5] = "world";string str = "hello" + abc;That isn't going to work. But it does work when the core languageknows about strings:
const char[5] abc = "world";char[] str = "hello" ~ abc;There are three ways to find the length of a string in C++:
const char abc[] = "world";:sizeof(abc)/sizeof(abc[0])-1:strlen(abc)string str;:str.length()That kind of inconsistency makes it hard to write generic templates.Consider D:
char[5] abc ="world";:abc.lengthchar[] str:str.lengthC++ strings use a function to determine if a string is empty:
string str;if (str.empty())// string is emptyIn D, an empty string has zero length:
char[] str;if (!str.length)// string is emptyC++ handles this with the resize() member function:
string str;str.resize(newsize);D takes advantage of knowing that str is an array, andso resizing it is just changing the length property:
char[] str;str.length = newsize;C++ slices an existing string using a special constructor:
string s1 = "hello world";string s2(s1, 6, 5);// s2 is "world"D has the array slice syntax, not possible with C++:
char[] s1 ="hello world";char[] s2 = s1[6 .. 11];// s2 is "world"Slicing, of course, works with any array in D, not just strings.
C++ copies strings with the replace function:
string s1 = "hello world";string s2 = "goodbye ";s2.replace(8, 5, s1, 6, 5);// s2 is "goodbye world"D uses the slice syntax as an lvalue:
char[] s1 ="hello world";char[] s2 ="goodbye ".dup;s2[8..13] = s1[6..11];// s2 is "goodbye world"The.dup is needed because string literals areread-only in D, the.dup will create a copythat is writable.
This is needed for compatibility with C API's. In C++, thisuses the c_str() member function:
void foo(const char *);string s1;foo(s1.c_str());In D, strings can be converted to char* using the .ptr property:
void foo(char*);char[] s1;foo(s1.ptr);although for this to work wherefoo expects a 0 terminatedstring,s1 must have a terminating 0. Alternatively, thefunctionstd.string.toStringz will ensure it:
void foo(char*);char[] s1;foo(std.string.toStringz(s1));In C++, string array bounds checking for [] is not done.In D, array bounds checking is on by default and it can be turned offwith a compiler switch after the program is debugged.
Are not possible in C++, nor is there any way to add themby adding more to the library. In D, they take the obvioussyntactical forms:
switch (str){case"hello":case"world":...}where str can be any of literal "string"s, fixed string arrayslike char[10], or dynamic strings like char[]. A quality implementationcan, of course, explore many strategies of efficiently implementingthis based on the contents of the case strings.
In C++, this is done with the replace() member function:
string str = "hello";str.replace(1,2,2,'?');// str is "h??lo"In D, use the array slicing syntax in the natural manner:
char[5] str ="hello";str[1..3] = '?';// str is "h??lo"C++ strings, as implemented by STLport, are by value and are0-terminated. [The latter is an implementation choice, butSTLport seems to be the most popular implementation.]This, coupled with no garbage collection, hassome consequences. First of all, any string created must makeits own copy of the string data. The 'owner' of the stringdata must be kept track of, because when the owner is deletedall references become invalid. If one tries to avoid thedangling reference problem by treating strings as value types,there will be a lot of overhead of memory allocation,data copying, and memory deallocation. Next, the 0-terminationimplies that strings cannot refer to other strings. Stringdata in the data segment, stack, etc., cannotbe referred to.
D strings are reference types, and the memory is garbage collected.This means that only references need to be copied, not thestring data. D strings can refer to data in the static datasegment, data on the stack, data inside other strings, objects,file buffers, etc. There's no need to keep track of the 'owner'of the string data.
The obvious question is if multiple D strings refer to the samestring data, what happens if the data is modified? All thereferences will now point to the modified data. This can haveits own consequences, which can be avoided if the copy-on-writeconvention is followed. All copy-on-write is is that ifa string is written to, an actual copy of the string data is madefirst.
The result of D strings being reference only and garbage collectedis that code that does a lot of string manipulating, such asan lzw compressor, can be a lot more efficient in terms of bothmemory consumption and speed.
Let's take a look at a small utility, wordcount, that counts upthe frequency of each word in a text file. In D, it looks like this:
import std.file;import std.stdio;int main (char[][] args){int w_total;int l_total;int c_total;int[char[]] dictionary; writefln(" lines words bytes file");for (int i = 1; i < args.length; ++i) {char[] input;int w_cnt, l_cnt, c_cnt;int inword;int wstart;input =cast(char[])std.file.read(args[i]);for (int j = 0; j < input.length; j++){char c; c = input[j];if (c == '\n')++l_cnt;if (c >= '0' && c <= '9') { }elseif (c >= 'a' && c <= 'z' ||c >= 'A' && c <= 'Z') {if (!inword){ wstart = j; inword = 1; ++w_cnt;} }elseif (inword) {char[] word = input[wstart .. j];dictionary[word]++;inword = 0; } ++c_cnt;}if (inword){char[] w = input[wstart .. input.length]; dictionary[w]++;}writefln("%8s%8s%8s %s", l_cnt, w_cnt, c_cnt, args[i]);l_total += l_cnt;w_total += w_cnt;c_total += c_cnt; }if (args.length > 2) {writefln("--------------------------------------%8s%8s%8s total", l_total, w_total, c_total); } writefln("--------------------------------------");foreach (char[] word1; dictionary.keys.sort) {writefln("%3d %s", dictionary[word1], word1); }return 0;}(Analternate implementation that uses buffered file I/O to handle larger files.)
Two people have written C++ implementations using the C++ standardtemplate library,wccpp1andwccpp2.The input filealice30.txtis the text of "Alice in Wonderland."The D compiler,dmd,and the C++ compiler,dmc,share the sameoptimizer and code generator, which provides a more apples toapples comparison of the efficiency of the semantics of the languagesrather than the optimization and code generator sophistication.Tests were run on a Win XP machine. dmc uses STLport for the templateimplementation.
| Program | Compile | Compile Time | Run | Run Time |
|---|---|---|---|---|
| D wc | dmd wc -O -release | 0.0719 | wc alice30.txt >log | 0.0326 |
| C++ wccpp1 | dmc wccpp1 -o -I\dm\stlport\stlport | 2.1917 | wccpp1 alice30.txt >log | 0.0944 |
| C++ wccpp2 | dmc wccpp2 -o -I\dm\stlport\stlport | 2.0463 | wccpp2 alice30.txt >log | 0.1012 |
The following tests were run on linux, again comparing a D compiler(gdc)and a C++ compiler (g++) that share a common optimizer andcode generator. The system is Pentium III 800MHz running RedHat Linux 8.0and gcc 3.4.2.The Digital Mars D compiler for linux (dmd)is included for comparison.
| Program | Compile | Compile Time | Run | Run Time |
|---|---|---|---|---|
| D wc | gdc -O2 -frelease -o wc wc.d | 0.326 | wc alice30.txt > /dev/null | 0.041 |
| D wc | dmd wc -O -release | 0.235 | wc alice30.txt > /dev/null | 0.041 |
| C++ wccpp1 | g++ -O2 -o wccpp1 wccpp1.cc | 2.874 | wccpp1 alice30.txt > /dev/null | 0.086 |
| C++ wccpp2 | g++ -O2 -o wccpp2 wccpp2.cc | 2.886 | wccpp2 alice30.txt > /dev/null | 0.095 |
These tests compare gdc with g++ on a PowerMac G5 2x2.0GHzrunning MacOS X 10.3.5 and gcc 3.4.2. (Timings are a littleless accurate.)
| Program | Compile | Compile Time | Run | Run Time |
|---|---|---|---|---|
| D wc | gdc -O2 -frelease -o wc wc.d | 0.28 | wc alice30.txt > /dev/null | 0.03 |
| C++ wccpp1 | g++ -O2 -o wccpp1 wccpp1.cc | 1.90 | wccpp1 alice30.txt > /dev/null | 0.07 |
| C++ wccpp2 | g++ -O2 -o wccpp2 wccpp2.cc | 1.88 | wccpp2 alice30.txt > /dev/null | 0.08 |
#include <algorithm>#include <cstdio>#include <fstream>#include <iterator>#include <map>#include <vector>bool isWordStartChar (char c){ return isalpha(c); }bool isWordEndChar (char c){ return !isalnum(c); }int main (int argc, char const* argv[]){ using namespace std; printf("Lines Words Bytes File:\n"); map<string, int> dict; int tLines = 0, tWords = 0, tBytes = 0; for(int i = 1; i < argc; i++) {ifstream file(argv[i]);istreambuf_iterator<char> from(file.rdbuf()), to;vector<char> v(from, to);vector<char>::iterator first = v.begin(), last = v.end(), bow, eow;int numLines = count(first, last, '\n');int numWords = 0;int numBytes = last - first;for(eow = first; eow != last; ){ bow = find_if(eow, last, isWordStartChar); eow = find_if(bow, last, isWordEndChar); if(bow != eow)++dict[string(bow, eow)], ++numWords;}printf("%5d %5d %5d %s\n", numLines, numWords, numBytes, argv[i]);tLines += numLines;tWords += numWords;tBytes += numBytes; } if(argc > 2) printf("-----------------------\n%5d %5d %5d\n", tLines, tWords, tBytes); printf("-----------------------\n\n"); for(map<string, int>::const_iterator it = dict.begin(); it != dict.end(); ++it) printf("%5d %s\n", it->second, it->first.c_str()); return 0;}