Skip to content
Jan Olsson edited this page Jun 4, 2018 · 35 revisions

String handling and value types

Explanation of the contents of a topic page @ Week 1 Topic 1

Back to Week 1

Objective: Basic string manipulation (min: QString + QChar + QCodec, creating, concatenation + light-weight use with QLatin1String (or QStringView this one may be omitted)

Comment: There is no String type in Qt, so should be written on lower-case (E: I'll leave this here for reference. Actually I'll just leave all the comments here for reference, comment on them myself so the dialogue stays somewhere. I didn't do that for 1.00 but it will probably be a better idea to just leave all the comments on the topic content pages for future references.)

Comment: This is a huge topic and I'd try to avoid losing attendees' interest at the beginning. The minimum: QString + QChar + QCodec, creating, concatenation + light-weight use with QLatin1String or QStringView (the latter can be omitted). String literals could be already another topic. And QByteArray is relevant in cases, where coding is not needed. I'm afraid this is too much for one section, isn't it.

E: Ill bump String Literals and QStringView into the expert section and include them if it seems feasible after everything else has been seen to.

Comment: The important message is that we do have literals and it does make sense to use mutable strings in general, if literals are ok. Like file names.

E: This page including the next ones will mostly be copy and paste from the Qt documentation and/or wiki, I'll start reading and editing them in the next sprint (20.3 ->)

Comment: This is very detailed now and explains a lot of details about string handling. If I was a student, I'd be exhausted in reading this all and possibly I'd would be at least little confused, what I need in there exercise and what not. This is related to the questions, I have asked a couple of times that how exercise creation and material creation are synched. This is definitely one way to create the material, but I see a risk that at least our customers start thinking about quitting, if the amount of reading is this large. You may see in my slides that a lot of details are just not mentioned. This section can be as it is, but I'd really consider omitting some details in further sections not to mention too many details.

E: This is not by any means supposed to stay this detailed. I acknowledge it's way too long and too detailed but I'm worried I'll work too slowly if I start doing edits at the same time as I'm collecting and including material. I'm also trying to figure out a good approach for writing the material aswell. The idea would be building up a material (that at some parts will end up being too large) and starting to just cut out the unimportant things once the big picture is there. This will also, let me stew on the material for a bit while im doing some collecting on other topics, its a workflow kind of a thing but I agree that it might look horrible at a first glance :D. Point im trying to make is that I really need to include some basis on the materials and find it easier to just delete stuff that isn't needed later on.

Beginner

Intermediate

  • What is QString?
  • QString vs QByteArray?
  • How does std::string/c string compare to QString?
  • How do you compare and manipulate strings efficiently (QLatin1String, QStringRef, ((QStringView)))?
  • What are value types in Qt?

Expert

  • What is QStringLiteral?
  • What is QStringView?

Course material content

As our next topic we're going to take a look at some basic string manipulation, interesting string facts and some value types. We'll also take a brief look at how to compare and manipulate strings efficiently.
To start, Qt has it's own type QString replacing standard string, and it's implicitly shared. Implicit sharing in Qt types is discussed at length later in the material. If you should walk away from this topic with something in mind, it is that Qt does have literals, and that it does make sense to use mutable strings in situations where literals are ok, such as file names.

We will start with string manipulation, after which we will take a comparative look at QString and QByteArray. After that, we will take a look at how std::string and C string compares to QString, followed by a brief discussion about how to compare and manipulate strings efficiently. Lastly, we will discuss value types in Qt.

Please don't be alarmed about the length and detail in this chapter. Many of the more specific details are presented as good-to-know, and you aren't exptected to remember e.g. every detail of QChar functions.

QChar

The QChar class provides a 16-bit Unicode character. (Unicode is an international standard that supports most of the writing systems in use today. It is a superset of US-ASCII (ANSI X3.4-1986) and Latin-1 (ISO 8859-1), and all the US-ASCII/Latin-1 characters are available at the same code positions.)

In Qt, Unicode characters are 16-bit entities without any markup or structure. This class represents such an entity. It is lightweight, so it can be used everywhere. Most compilers treat it like an unsigned short.

QChar provides a full complement of testing/classification functions, converting to and from other formats, converting from composed to decomposed Unicode, and trying to compare and case-convert if you ask it to.

The classification functions include functions like those in the standard C++ header <cctype> (formerly <ctype.h>), but operating on the full range of Unicode characters, not just for the ASCII range. They all return true if the character is a certain type of character; otherwise they return false. These classification functions are isNull() (returns true if the character is '\0'), isPrint() (true if the character is any sort of printable character, including whitespace), isPunct() (any sort of punctation), isMark() (Unicode Mark), isLetter() (a letter), isNumber() (any sort of numeric character, not just 0-9), isLetterOrNumber(), and isDigit() (decimal digits). All of these are wrappers around category() which return the Unicode-defined category of each character. Some of these also calculate the derived properties (for example isSpace() returns true if the character is of category Separator_* or an exceptional code point from Other_Control category).

QChar also provides direction(), which indicates the "natural" writing direction of this character. The joiningType() function indicates how the character joins with its neighbors (needed mostly for Arabic or Syriac) and finally hasMirrored(), which indicates whether the character needs to be mirrored when it is printed in it's "unnatural" writing direction.

Composed Unicode characters (like ring) can be converted to decomposed Unicode ("a" followed by "ring above") by using decomposition().

In Unicode, comparison is not necessarily possible and case conversion is very difficult at best. Unicode, covering the "entire" world, also includes most of the world's case and sorting problems. operator==() and friends will do comparison based purely on the numeric Unicode value (code point) of the characters, and toUpper() and toLower() will do case changes when the character has a well-defined uppercase/lowercase equivalent. For locale-dependent comparisons, use QString::localeAwareCompare().

The conversion functions include unicode() (to a scalar), toLatin1() (to scalar, but converts all non-Latin-1 characters to 0), row() (gives the Unicode row), cell() (gives the Unicode cell), digitValue() (gives the integer value of any of the numerous digit characters), and a host of constructors.

QChar provides constructors and cast operators that make it easy to convert to and from traditional 8-bit chars. If you defined QT_NO_CAST_FROM_ASCII and `QT_NO_CAST_TO_ASCII´, as explained in the ´QString´ documentation, you will need to explicitly call ´fromLatin1()´, or use ´QLatin1Char´, to construct a ´QChar´ from an 8-bit char, and you will need to call ´toLatin1()´ to get the 8-bit value back.

QString

The ´QString´ class provides a Unicode character string. ´QString´ stores a string of 16-bit ´QChars´, where each ´QChar´ corresponds one Unicode 4.0 character. (Unicode characters with code values above 65535 are stored using surrogate pairs, i.e., two consecutive QChars.)

Behind the scenes, QString uses implicit sharing (copy-on-write) to reduce memory usage and to avoid the needless copying of data. This also helps reduce the inherent overhead of storing 16-bit characters instead of 8-bit characters.

In addition to QString, Qt also provides the QByteArray class to store raw bytes and traditional 8-bit '\0'-terminated strings. For most purposes, QString is the class you want to use. It is used throughout the Qt API, and the Unicode support ensures that your applications will be easy to translate if you want to expand your application's market at some point. The two main cases where QByteArray is appropriate are when you need to store raw binary data, and when memory conservation is critical (like in embedded systems).

Initializing a String

One way to initialize a QString is simply to pass a const char * to its constructor. For example, the following code creates a QString of size 5 containing the data "Hello":

QString str = "Hello";

QString converts the const char * data into Unicode using the fromUtf8() function. In all of the QString functions that take const char * parameters, it is interpreted as a classic C-style '\0'-terminated string encoded in UTF-8. It is legal for the const char * parameter to be 0.

You can also provide string data as an array of QChars:

static const QChar data[4] = { 0x0055, 0x006e, 0x10e3, 0x03a3 };
QString str(data, 4);

QString makes a deep copy of the QChar data, so you can modify it later without experiencing side effects. (If for performance reasons you don't want to take a deep copy of the character data, use QString::fromRawData() instead.)

Another approach is to set the size of the string using resize() and to initialize the data character per character. QString uses 0-based indexes, just like C++ arrays. To access the character at a particular index position, you can use operator[](). On non-const strings, operator[]() returns a reference to a character that can be used on the left side of an assignment. For example:

QString str;
str.resize(4);

str[0] = QChar('U');
str[1] = QChar('n');
str[2] = QChar(0x10e3);
str[3] = QChar(0x03a3);

For read-only access, an alternative syntax is to use the at() function:

QString str;
for (int i = 0; i < str.size(); ++i) {
    if (str.at(i) >= QChar('a') && str.at(i) <= QChar('f'))
        qDebug() << "Found character in range [a-f]";
}

The at() function can be faster than operator[](), because it never causes a deep copy to occur. Alternatively, use the left(), right(), or mid() functions to extract several characters at a time.

A QString can embed '\0' characters (QChar::Null). The size() function always returns the size of the whole string, including embedded '\0' characters.

After a call to the resize() function, newly allocated characters have undefined values. To set all the characters in the string to a particular value, use the fill() function.

QString provides dozens of overloads designed to simplify string usage. For example, if you want to compare a QString with a string literal, you can write code like this and it will work as expected:

QString str;
if (str == "auto" || str == "extern" || str == "static" || str == "register") {
    // ...
}

You can also pass string literals to functions that take QStrings as arguments, invoking the QString(const char *) constructor. Similarly, you can pass a QString to a function that takes a const char * argument using the qPrintable() macro which returns the given QString as a const char *. This is equivalent to calling <QString>.toLocal8Bit().constData().

Manipulating String Data

QString provides the following basic functions for modifying the character data: append(), prepend(), insert(), replace(), and remove(). For example:

QString str = "and";
str.prepend("rock "); // str == "rock and"
str.append(" roll"); // str == "rock and roll"
str.replace(5, 3, "&"); // str == "rock & roll"

If you are building a QString gradually and know in advance approximately how many characters the QString will contain, you can call reserve(), asking QString to preallocate a certain amount of memory. You can also call capacity() to find out how much memory QString actually allocated.

The replace() and remove() functions' first two arguments are the position from which to start erasing and the number of characters that should be erased. If you want to replace all occurrences of a particular substring with another, use one of the two-parameter replace() overloads.

A frequent requirement is to remove whitespace characters from a string ('\n', '\t', ' ', etc.). If you want to remove whitespace from both ends of a QString, use the trimmed() function. If you want to remove whitespace from both ends and replace multiple consecutive whitespaces with a single space character within the string, use simplified().

If you want to find all occurrences of a particular character or substring in a QString, use the indexOf() or lastIndexOf() functions. The former searches forward starting from a given index position, the latter searches backward. Both return the index position of the character or substring if they find it; otherwise they return -1. For example, here's a typical loop that finds all occurrences of a particular substring:

QString str = "We must be <b>bold</b>, very <b>bold</b>";
int j = 0;

while ((j = str.indexOf("<b>", j)) != -1) {
    qDebug() << "Found <b> tag at index position" << j;
    ++j;
}

QString provides many functions for converting numbers into strings, and vice versa. See the arg() functions, the setNum(), number(), toInt(), etc. for further information. To get an upper- or lowercase version of a string use toUpper() or toLower().

Lists of strings are handled by the QStringList class. You can split a string into a list of strings using the split() function, and join a list of strings into a single string with an optional separator using QStringList::join(). You can obtain a list of strings from a string list that contain a particular substring or that match a particular QRegExp using the QStringList::filter() function.

Querying String Data

QStrings can be compared using overloaded operators such as operator<(), operator<=(), operator==(), operator>=(), and so on. Note that the comparison is based exclusively on the numeric Unicode values of the characters. It is very fast, but is not what a human would expect; the QString::localeAwareCompare() function is a better choice for sorting user-interface strings.

To obtain a pointer to the actual character data, call data() or constData(). These functions return a pointer to the beginning of the QChar data. The pointer is guaranteed to remain valid until a non-const function is called on the QString.

Conversions

QString provides the following three functions that return a const char * version of the string as QByteArray: toUtf8(), toLatin1(), and toLocal8Bit(). Corresponding functions to convert from these encodings are fromLatin1(), fromUtf8(), and fromLocal8Bit(). Other encodings are supported through the QTextCodec class.

As mentioned previously, QString provides a lot of functions and operators that make it easy to interoperate with const char * strings. But this functionality is a double-edged sword: It makes QString more convenient to use if all strings are US-ASCII or Latin-1, but there is always the risk that an implicit conversion from or to const char * is done using the wrong 8-bit encoding. To minimize these risks, you can turn off these implicit conversions by defining the following two preprocessor symbols:

QT_NO_CAST_FROM_ASCII disables automatic conversions from C string literals and pointers to Unicode. QT_RESTRICTED_CAST_FROM_ASCII allows automatic conversions from C characters and character arrays, but disables automatic conversions from character pointers to Unicode. QT_NO_CAST_TO_ASCII disables automatic conversion from QString to C strings.

One way to define these preprocessor symbols globally for your application is to add the following entry to your qmake project file:

DEFINES += QT_NO_CAST_FROM_ASCII \
           QT_NO_CAST_TO_ASCII

You then need to explicitly call e.g. fromUtf8() to construct a QString from an 8-bit string, or use the lightweight QLatin1String class, for example: QString url = QLatin1String("http://www.unicode.org/");

Distinction Between Null and Empty Strings

For historical reasons, QString distinguishes between a null string and an empty string. A null string is a string that is initialized using QString's default constructor or by passing const char * 0 to the constructor. An empty string is any string with size 0. Thus, null string is always empty, but an empty string isn't necessarily null (e.g. QString() is both null and empty, QString("") is empty, but not null).

All functions except isNull() treat null strings the same as empty strings. For example, toUtf8().constData() returns a pointer to a '\0' character for a null string (not a null pointer), and QString() compares equal to QString(""). We recommend that you always use the isEmpty() function and avoid isNull().

More Efficient String Construction

Many strings are known at compile time. But the trivial constructor QString("Hello"), will copy the contents of the string, treating the contents as Latin-1. To avoid this one can use the QStringLiteral macro to directly create the required data at compile time. Constructing a QString out of the literal does then not cause any overhead at runtime.

A slightly less efficient way is to use QLatin1String. This class wraps a C string literal, precalculates it length at compile time and can then be used for faster comparison with QStrings and conversion to QStrings than a regular C string literal.

Using the QString '+' operator, it is easy to construct a complex string from multiple substrings. You will often write code like this:

QString foo;
QString type = "long";

foo->setText(QLatin1String("vector<") + type + QLatin1String(">::iterator"));

if (foo.startsWith("(" + type + ") 0x"))
    ...

There is nothing wrong with either of these string constructions, but there are a few hidden inefficiencies. Beginning with Qt 4.6, you can eliminate them.

First, multiple uses of the '+' operator usually means multiple memory allocations. When concatenating n substrings, where n > 2, there can be as many as n - 1 calls to the memory allocator.

In 4.6, an internal template class QStringBuilder has been added along with a few helper functions. This class is marked internal and does not appear in the documentation, because you aren't meant to instantiate it in your code. Its use will be automatic, as described below. The class is found in src/corelib/tools/qstringbuilder.cpp if you want to have a look at it.

QStringBuilder uses expression templates and reimplements the '%' operator so that when you use '%' for string concatenation instead of '+', multiple substring concatenations will be postponed until the final result is about to be assigned to a QString. At this point, the amount of memory required for the final result is known. The memory allocator is then called once to get the required space, and the substrings are copied into it one by one.

Additional efficiency is gained by inlining and reduced reference counting (the QString created from a QStringBuilder typically has a ref count of 1, whereas QString::append() needs an extra test).

There are two ways you can access this improved method of string construction. The straightforward way is to include QStringBuilder wherever you want to use it, and use the '%' operator instead of '+' when concatenating strings:

#include <QStringBuilder>

QString hello("hello");
QStringRef el(&hello, 2, 3);
QLatin1String world("world");
QString message =  hello % el % world % QChar('!');

A more global approach which is the most convenient, but not entirely source compatible, is to this define in your .pro file:

DEFINES *= QT_USE_QSTRINGBUILDER

and the '+' will automatically be performed as the QStringBuilder '%' everywhere.

QCodec

The QTextCodec class provides conversions between text encodings.

Qt uses Unicode to store, draw and manipulate strings. In many situations you may wish to deal with data that uses a different encoding. For example, most Japanese documents are still stored in Shift-JIS or ISO 2022-JP, while Russian users often have their documents in KOI8-R or Windows-1251.

Qt provides a set of QTextCodec classes to help with converting non-Unicode formats to and from Unicode. You can also create your own codec classes. You can find the full list of supported encodings in QTextCodec documentation. If Qt is compiled with ICU support enabled, most codecs supported by ICU will also be available to the application.

QTextCodecs can be used as follows to convert some locally encoded string to Unicode. Suppose you have some string encoded in Russian KOI8-R encoding, and want to convert it to Unicode. The simple way to do it is like this:

QByteArray encodedString = "...";
QTextCodec *codec = QTextCodec::codecForName("KOI8-R");
QString string = codec->toUnicode(encodedString);

After this, string holds the text converted to Unicode.

To read or write files in various encodings, use QTextStream and its setCodec() function. See the Codecs example for an application of QTextCodec to file I/O.

Some care must be taken when trying to convert the data in chunks, for example, when receiving it over a network. In such cases it is possible that a multi-byte character will be split over two chunks. At best this might result in the loss of a character and at worst cause the entire conversion to fail.

The approach to use in these situations is to create a QTextDecoder object for the codec and use this QTextDecoder for the whole decoding process, as shown below:

QTextCodec *codec = QTextCodec::codecForName("Shift-JIS");
QTextDecoder *decoder = codec->makeDecoder();

QString string;
while (new_data_available()) {
    QByteArray chunk = get_new_data();
    string += decoder->toUnicode(chunk);
}
delete decoder;

The QTextDecoder object maintains state between chunks and therefore works correctly even if a multi-byte character is split between chunks.

If you need to create your own codec class, you can check out the instructions for doing so here.

QByteArray

The QByteArray class provides an array of bytes.

QByteArray can be used to store both raw bytes (including '\0's) and traditional 8-bit '\0'-terminated strings. Using QByteArray is much more convenient than using const char *. Behind the scenes, it always ensures that the data is followed by a '\0' terminator, and uses implicit sharing (copy-on-write) to reduce memory usage and avoid needless copying of data.

One way to initialize a QByteArray is simply to pass a const char * to its constructor. For example, the following code creates a byte array of size 5 containing the data "Hello":

QByteArray ba("Hello");

Although the size() is 5, the byte array also maintains an extra '\0' character at the end so that if a function is used that asks for a pointer to the underlying data (e.g. a call to data()), the data pointed to is guaranteed to be '\0'-terminated.

QByteArray makes a deep copy of the const char * data, so you can modify it later without experiencing side effects. (If for performance reasons you don't want to take a deep copy of the character data, use QByteArray::fromRawData() instead.)

Another approach is to set the size of the array using resize() and to initialize the data byte per byte. QByteArray uses 0-based indexes, just like C++ arrays. To access the byte at a particular index position, you can use operator[](). On non-const byte arrays, operator[]() returns a reference to a byte that can be used on the left side of an assignment. For example:

QByteArray ba;
ba.resize(5);
ba[0] = 0x3c;
ba[1] = 0xb8;
ba[2] = 0x64;
ba[3] = 0x18;
ba[4] = 0xca;

For read-only access, an alternative syntax is to use at():

for (int i = 0; i < ba.size(); ++i) {
    if (ba.at(i) >= 'a' && ba.at(i) <= 'f')
        cout << "Found character in range [a-f]" << endl;
    }

at() can be faster than operator[](), because it never causes a deep copy to occur.

To extract many bytes at a time, use left(), right(), or mid().

A QByteArray can embed '\0' bytes. The size() function always returns the size of the whole array, including embedded '\0' bytes, but excluding the terminating '\0' added by QByteArray. If you want to obtain the length of the data up to and excluding the first '\0' character, call qstrlen() on the byte array.

To obtain a pointer to the actual character data, call data() or constData(). These functions return a pointer to the beginning of the data. The pointer is guaranteed to remain valid until a non-const function is called on the QByteArray. It is also guaranteed that the data ends with a '\0' byte unless the QByteArray was created from a raw data. This '\0' byte is automatically provided by QByteArray and is not counted in size().

QByteArray provides the same functions for modifying and searching the byte data as QString, such as append() and indexOf().

Functions that perform conversions between numeric data types and strings are performed in the C locale, irrespective of the user's locale settings. Use QString to perform locale-aware conversions between numbers and strings.

In QByteArray, the notion of uppercase and lowercase and of which character is greater than or less than another character is locale dependent. This affects functions that support a case insensitive option, or that compare, or lowercase or uppercase their arguments. Case insensitive operations and comparisons will be accurate if both strings contain only ASCII characters. (If $LC_CTYPE is set, most Unix systems do "the right thing".) This issue does not apply to QStrings since they represent characters using Unicode.

std::string/c string vs QString

J: Should this be added? Chapter is already quite long

Comparing and manipulating strings efficiently

The QLatin1String class provides a thin wrapper around an US-ASCII/Latin-1 encoded string literal.

Many of QString's member functions are overloaded to accept const char * instead of QString. This includes the copy constructor, the assignment operator, the comparison operators, and various other functions such as insert(), replace(), and indexOf(). These functions are usually optimized to avoid constructing a QString object for the const char * data. For example, assuming str is a QString,

if (str == "auto" || str == "extern" || str == "static" || str == "register") {
    ...
}

is much faster than

if (str == QString("auto") || str == QString("extern") || str == QString("static") || str == QString("register")) {
    ...
}

because it doesn't construct four temporary QString objects and make a deep copy of the character data.

Applications that define QT_NO_CAST_FROM_ASCII don't have access to QString's const char * API. To provide an efficient way of specifying constant Latin-1 strings, Qt provides the QLatin1String, which is just a very thin wrapper around a const char *. Using QLatin1String, the example code above becomes

if (str == QLatin1String("auto") 
           || str == QLatin1String("extern")
           || str == QLatin1String("static")
           || str == QLatin1String("register") {
    ...
}

This is a bit longer to type, but it provides exactly the same benefits as the first version of the code, and is faster than converting the Latin-1 strings using QString::fromLatin1().

Thanks to the QString(QLatin1String) constructor, QLatin1String can be used everywhere a QString is expected.

Note: If the function you're calling with a QLatin1String argument isn't actually overloaded to take QLatin1String, the implicit conversion to QString will trigger a memory allocation, which is usually what you want to avoid by using QLatin1String in the first place. In those cases, using QStringLiteral may be the better option.


Instructions and description for the exercise of topic 1.01

  1. Create a QString with the text "Qt rules " and then add the number "42" to it using QStrings appending function. Then print it the QDebug output.

  2. Use the same QString. Use insert to add the word "always" after the first word, and then print the string with QDebug.

  3. Make a copy of the string and use a loop to append it to the original (seperated by a space) 10 times. Again print the string with QDebug.

  4. Count the amount of times "Qt" appears in the string print out "Qt appeared x times" with QDebug.

Exhaustive reference material mentioned in this topic

Further reading topics/links:

Clone this wiki locally