-
Notifications
You must be signed in to change notification settings - Fork 0
1.01
Explanation of the contents of a topic page @ Week 1 Topic 1
Objective: Basic string manipulation (min: QString + QChar + QCodec, creating, concatenation + light-weight use with QLatin1String (or QStringView this one may be omitted)
Comment: There is no String type in Qt, so should be written on lower-case (E: I'll leave this here for reference. Actually I'll just leave all the comments here for reference, comment on them myself so the dialogue stays somewhere. I didn't do that for 1.00 but it will probably be a better idea to just leave all the comments on the topic content pages for future references.)
Comment: This is a huge topic and I'd try to avoid losing attendees' interest at the beginning. The minimum: QString + QChar + QCodec, creating, concatenation + light-weight use with QLatin1String or QStringView (the latter can be omitted). String literals could be already another topic. And QByteArray is relevant in cases, where coding is not needed. I'm afraid this is too much for one section, isn't it.
E: Ill bump String Literals and QStringView into the expert section and include them if it seems feasible after everything else has been seen to.
Comment: The important message is that we do have literals and it does make sense to use mutable strings in general, if literals are ok. Like file names.
E: This page including the next ones will mostly be copy and paste from the Qt documentation and/or wiki, I'll start reading and editing them in the next sprint (20.3 ->)
Comment: This is very detailed now and explains a lot of details about string handling. If I was a student, I'd be exhausted in reading this all and possibly I'd would be at least little confused, what I need in there exercise and what not. This is related to the questions, I have asked a couple of times that how exercise creation and material creation are synched. This is definitely one way to create the material, but I see a risk that at least our customers start thinking about quitting, if the amount of reading is this large. You may see in my slides that a lot of details are just not mentioned. This section can be as it is, but I'd really consider omitting some details in further sections not to mention too many details.
E: This is not by any means supposed to stay this detailed. I acknowledge it's way too long and too detailed but I'm worried I'll work too slowly if I start doing edits at the same time as I'm collecting and including material. I'm also trying to figure out a good approach for writing the material aswell. The idea would be building up a material (that at some parts will end up being too large) and starting to just cut out the unimportant things once the big picture is there. This will also, let me stew on the material for a bit while im doing some collecting on other topics, its a workflow kind of a thing but I agree that it might look horrible at a first glance :D. Point im trying to make is that I really need to include some basis on the materials and find it easier to just delete stuff that isn't needed later on.
- What is QString?
- QString vs QByteArray?
- How does std::string/c string compare to QString?
- How do you compare and manipulate strings efficiently (QLatin1String, QStringRef, ((QStringView)))?
- What are value types in Qt?
- What is QStringLiteral?
- What is QStringView?
As our next topic we're going to take a look at some basic string manipulation, interesting string facts and some value types. We'll also take a brief look at how to compare and manipulate strings efficiently.
To start, Qt has it's own type QString
replacing standard string, and it's implicitly shared. Implicit sharing in Qt types is discussed at length later in the material. If you should walk away from this topic with something in mind, it is that Qt does have literals, and that it does make sense to use mutable strings in situations where literals are ok, such as file names.
We will start with string manipulation, after which we will take a comparative look at QString and QByteArray. After that, we will take a look at how std::string and C string compares to QString, followed by a brief discussion about how to compare and manipulate strings efficiently. Lastly, we will discuss value types in Qt.
Please don't be alarmed about the length and detail in this chapter. Many of the more specific details are presented as good-to-know, and you aren't exptected to remember e.g. every detail of QChar
functions.
The QChar class provides a 16-bit Unicode character. (Unicode is an international standard that supports most of the writing systems in use today. It is a superset of US-ASCII (ANSI X3.4-1986) and Latin-1 (ISO 8859-1), and all the US-ASCII/Latin-1 characters are available at the same code positions.)
In Qt, Unicode characters are 16-bit entities without any markup or structure. This class represents such an entity. It is lightweight, so it can be used everywhere. Most compilers treat it like an unsigned short.
QChar provides a full complement of testing/classification functions, converting to and from other formats, converting from composed to decomposed Unicode, and trying to compare and case-convert if you ask it to.
The classification functions include functions like those in the standard C++ header <cctype>
(formerly <ctype.h>
), but operating on the full range of Unicode characters, not just for the ASCII range. They all return true if the character is a certain type of character; otherwise they return false. These classification functions are isNull()
(returns true if the character is '\0'), isPrint()
(true if the character is any sort of printable character, including whitespace), isPunct()
(any sort of punctation), isMark()
(Unicode Mark), isLetter()
(a letter), isNumber()
(any sort of numeric character, not just 0-9), isLetterOrNumber()
, and isDigit()
(decimal digits). All of these are wrappers around category()
which return the Unicode-defined category of each character. Some of these also calculate the derived properties (for example isSpace()
returns true if the character is of category Separator_* or an exceptional code point from Other_Control category).
QChar also provides direction()
, which indicates the "natural" writing direction of this character. The joiningType()
function indicates how the character joins with its neighbors (needed mostly for Arabic or Syriac) and finally hasMirrored()
, which indicates whether the character needs to be mirrored when it is printed in it's "unnatural" writing direction.
Composed Unicode characters (like ring) can be converted to decomposed Unicode ("a" followed by "ring above") by using decomposition()
.
In Unicode, comparison is not necessarily possible and case conversion is very difficult at best. Unicode, covering the "entire" world, also includes most of the world's case and sorting problems. operator==()
and friends will do comparison based purely on the numeric Unicode value (code point) of the characters, and toUpper()
and toLower()
will do case changes when the character has a well-defined uppercase/lowercase equivalent. For locale-dependent comparisons, use QString::localeAwareCompare()
.
The conversion functions include unicode()
(to a scalar), toLatin1()
(to scalar, but converts all non-Latin-1 characters to 0), row()
(gives the Unicode row), cell()
(gives the Unicode cell), digitValue()
(gives the integer value of any of the numerous digit characters), and a host of constructors.
QChar
provides constructors and cast operators that make it easy to convert to and from traditional 8-bit chars. If you defined QT_NO_CAST_FROM_ASCII
and `QT_NO_CAST_TO_ASCII´, as explained in the ´QString´ documentation, you will need to explicitly call ´fromLatin1()´, or use ´QLatin1Char´, to construct a ´QChar´ from an 8-bit char, and you will need to call ´toLatin1()´ to get the 8-bit value back.
The ´QString´ class provides a Unicode character string. ´QString´ stores a string of 16-bit ´QChars´, where each ´QChar´ corresponds one Unicode 4.0 character. (Unicode characters with code values above 65535 are stored using surrogate pairs, i.e., two consecutive QChars.)
Behind the scenes, QString
uses implicit sharing (copy-on-write) to reduce memory usage and to avoid the needless copying of data. This also helps reduce the inherent overhead of storing 16-bit characters instead of 8-bit characters.
In addition to QString
, Qt also provides the QByteArray
class to store raw bytes and traditional 8-bit '\0'-terminated strings. For most purposes, QString
is the class you want to use. It is used throughout the Qt API, and the Unicode support ensures that your applications will be easy to translate if you want to expand your application's market at some point. The two main cases where QByteArray
is appropriate are when you need to store raw binary data, and when memory conservation is critical (like in embedded systems).
One way to initialize a QString
is simply to pass a const char *
to its constructor. For example, the following code creates a QString
of size 5 containing the data "Hello":
QString str = "Hello";
QString
converts the const char *
data into Unicode using the fromUtf8()
function. In all of the QString
functions that take const char *
parameters, it is interpreted as a classic C-style '\0'-terminated string encoded in UTF-8. It is legal for the const char *
parameter to be 0.
You can also provide string data as an array of QChars:
static const QChar data[4] = { 0x0055, 0x006e, 0x10e3, 0x03a3 };
QString str(data, 4);
QString
makes a deep copy of the QChar
data, so you can modify it later without experiencing side effects. (If for performance reasons you don't want to take a deep copy of the character data, use QString::fromRawData()
instead.)
Another approach is to set the size of the string using resize()
and to initialize the data character per character. QString uses 0-based indexes, just like C++ arrays. To access the character at a particular index position, you can use operator[]()
. On non-const strings, operator[]()
returns a reference to a character that can be used on the left side of an assignment. For example:
QString str;
str.resize(4);
str[0] = QChar('U');
str[1] = QChar('n');
str[2] = QChar(0x10e3);
str[3] = QChar(0x03a3);
For read-only access, an alternative syntax is to use the at()
function:
QString str;
for (int i = 0; i < str.size(); ++i) {
if (str.at(i) >= QChar('a') && str.at(i) <= QChar('f'))
qDebug() << "Found character in range [a-f]";
}
The at()
function can be faster than operator[]()
, because it never causes a deep copy to occur. Alternatively, use the left()
, right()
, or mid()
functions to extract several characters at a time.
A QString can embed '\0' characters (QChar::Null
). The size()
function always returns the size of the whole string, including embedded '\0' characters.
After a call to the resize()
function, newly allocated characters have undefined values. To set all the characters in the string to a particular value, use the fill()
function.
QString provides dozens of overloads designed to simplify string usage. For example, if you want to compare a QString with a string literal, you can write code like this and it will work as expected:
QString str;
if (str == "auto" || str == "extern" || str == "static" || str == "register") {
// ...
}
You can also pass string literals to functions that take QString
s as arguments, invoking the QString(const char *)
constructor. Similarly, you can pass a QString
to a function that takes a const char *
argument using the qPrintable()
macro which returns the given QString
as a const char *
. This is equivalent to calling <QString>.toLocal8Bit().constData()
.
QString
provides the following basic functions for modifying the character data: append()
, prepend()
, insert()
, replace()
, and remove()
. For example:
QString str = "and";
str.prepend("rock "); // str == "rock and"
str.append(" roll"); // str == "rock and roll"
str.replace(5, 3, "&"); // str == "rock & roll"
If you are building a QString
gradually and know in advance approximately how many characters the QString
will contain, you can call reserve()
, asking QString
to preallocate a certain amount of memory. You can also call capacity()
to find out how much memory QString actually allocated.
The replace()
and remove()
functions' first two arguments are the position from which to start erasing and the number of characters that should be erased. If you want to replace all occurrences of a particular substring with another, use one of the two-parameter replace()
overloads.
A frequent requirement is to remove whitespace characters from a string ('\n', '\t', ' ', etc.). If you want to remove whitespace from both ends of a QString
, use the trimmed()
function. If you want to remove whitespace from both ends and replace multiple consecutive whitespaces with a single space character within the string, use simplified()
.
If you want to find all occurrences of a particular character or substring in a QString, use the indexOf()
or lastIndexOf()
functions. The former searches forward starting from a given index position, the latter searches backward. Both return the index position of the character or substring if they find it; otherwise they return -1. For example, here's a typical loop that finds all occurrences of a particular substring:
QString str = "We must be <b>bold</b>, very <b>bold</b>";
int j = 0;
while ((j = str.indexOf("<b>", j)) != -1) {
qDebug() << "Found <b> tag at index position" << j;
++j;
}
QString
provides many functions for converting numbers into strings, and vice versa. See the arg()
functions, the setNum()
, number()
, toInt()
, etc. for further information. To get an upper- or lowercase version of a string use toUpper()
or toLower()
.
Lists of strings are handled by the QStringList
class. You can split a string into a list of strings using the split()
function, and join a list of strings into a single string with an optional separator using QStringList::join()
. You can obtain a list of strings from a string list that contain a particular substring or that match a particular QRegExp
using the QStringList::filter()
function.
QString
s can be compared using overloaded operators such as operator<()
, operator<=()
, operator==()
, operator>=()
, and so on. Note that the comparison is based exclusively on the numeric Unicode values of the characters. It is very fast, but is not what a human would expect; the QString::localeAwareCompare()
function is a better choice for sorting user-interface strings.
To obtain a pointer to the actual character data, call data()
or constData()
. These functions return a pointer to the beginning of the QChar
data. The pointer is guaranteed to remain valid until a non-const function is called on the QString
.
QString provides the following three functions that return a const char *
version of the string as QByteArray
: toUtf8()
, toLatin1()
, and toLocal8Bit()
. Corresponding functions to convert from these encodings are fromLatin1()
, fromUtf8()
, and fromLocal8Bit()
. Other encodings are supported through the QTextCodec
class.
As mentioned previously, QString
provides a lot of functions and operators that make it easy to interoperate with const char *
strings. But this functionality is a double-edged sword: It makes QString
more convenient to use if all strings are US-ASCII or Latin-1, but there is always the risk that an implicit conversion from or to const char *
is done using the wrong 8-bit encoding. To minimize these risks, you can turn off these implicit conversions by defining the following two preprocessor symbols:
QT_NO_CAST_FROM_ASCII
disables automatic conversions from C string literals and pointers to Unicode.
QT_RESTRICTED_CAST_FROM_ASCII
allows automatic conversions from C characters and character arrays, but disables automatic conversions from character pointers to Unicode.
QT_NO_CAST_TO_ASCII
disables automatic conversion from QString to C strings.
One way to define these preprocessor symbols globally for your application is to add the following entry to your qmake
project file:
DEFINES += QT_NO_CAST_FROM_ASCII \
QT_NO_CAST_TO_ASCII
You then need to explicitly call e.g. fromUtf8()
to construct a QString from an 8-bit string, or use the lightweight QLatin1String
class, for example: QString url = QLatin1String("http://www.unicode.org/");
For historical reasons, QString
distinguishes between a null string and an empty string. A null string is a string that is initialized using QString
's default constructor or by passing const char *
0 to the constructor. An empty string is any string with size 0. Thus, null string is always empty, but an empty string isn't necessarily null (e.g. QString()
is both null and empty, QString("")
is empty, but not null).
All functions except isNull()
treat null strings the same as empty strings. For example, toUtf8().constData()
returns a pointer to a '\0' character for a null string (not a null pointer), and QString()
compares equal to QString("")
. We recommend that you always use the isEmpty()
function and avoid isNull()
.
Many strings are known at compile time. But the trivial constructor QString("Hello")
, will copy the contents of the string, treating the contents as Latin-1. To avoid this one can use the QStringLiteral
macro to directly create the required data at compile time. Constructing a QString
out of the literal does then not cause any overhead at runtime.
A slightly less efficient way is to use QLatin1String
. This class wraps a C string literal, precalculates it length at compile time and can then be used for faster comparison with QString
s and conversion to QString
s than a regular C string literal.
Using the QString
'+' operator, it is easy to construct a complex string from multiple substrings. You will often write code like this:
QString foo;
QString type = "long";
foo->setText(QLatin1String("vector<") + type + QLatin1String(">::iterator"));
if (foo.startsWith("(" + type + ") 0x"))
...
There is nothing wrong with either of these string constructions, but there are a few hidden inefficiencies. Beginning with Qt 4.6, you can eliminate them.
First, multiple uses of the '+' operator usually means multiple memory allocations. When concatenating n substrings, where n > 2, there can be as many as n - 1 calls to the memory allocator.
In 4.6, an internal template class QStringBuilder
has been added along with a few helper functions. This class is marked internal and does not appear in the documentation, because you aren't meant to instantiate it in your code. Its use will be automatic, as described below. The class is found in src/corelib/tools/qstringbuilder.cpp
if you want to have a look at it.
QStringBuilder
uses expression templates and reimplements the '%' operator so that when you use '%' for string concatenation instead of '+', multiple substring concatenations will be postponed until the final result is about to be assigned to a QString
. At this point, the amount of memory required for the final result is known. The memory allocator is then called once to get the required space, and the substrings are copied into it one by one.
Additional efficiency is gained by inlining and reduced reference counting (the QString
created from a QStringBuilder
typically has a ref count of 1, whereas QString::append()
needs an extra test).
There are two ways you can access this improved method of string construction. The straightforward way is to include QStringBuilder
wherever you want to use it, and use the '%' operator instead of '+' when concatenating strings:
#include <QStringBuilder>
QString hello("hello");
QStringRef el(&hello, 2, 3);
QLatin1String world("world");
QString message = hello % el % world % QChar('!');
A more global approach which is the most convenient, but not entirely source compatible, is to this define in your .pro file:
DEFINES *= QT_USE_QSTRINGBUILDER
and the '+' will automatically be performed as the QStringBuilder '%' everywhere.
The QTextCodec
class provides conversions between text encodings.
Qt uses Unicode to store, draw and manipulate strings. In many situations you may wish to deal with data that uses a different encoding. For example, most Japanese documents are still stored in Shift-JIS or ISO 2022-JP, while Russian users often have their documents in KOI8-R or Windows-1251.
Qt provides a set of QTextCodec
classes to help with converting non-Unicode formats to and from Unicode. You can also create your own codec classes. You can find the full list of supported encodings in QTextCodec documentation. If Qt is compiled with ICU support enabled, most codecs supported by ICU will also be available to the application.
QTextCodecs can be used as follows to convert some locally encoded string to Unicode. Suppose you have some string encoded in Russian KOI8-R encoding, and want to convert it to Unicode. The simple way to do it is like this:
QByteArray encodedString = "...";
QTextCodec *codec = QTextCodec::codecForName("KOI8-R");
QString string = codec->toUnicode(encodedString);
After this, string holds the text converted to Unicode.
To read or write files in various encodings, use QTextStream
and its setCodec()
function. See the Codecs example for an application of QTextCodec
to file I/O.
Some care must be taken when trying to convert the data in chunks, for example, when receiving it over a network. In such cases it is possible that a multi-byte character will be split over two chunks. At best this might result in the loss of a character and at worst cause the entire conversion to fail.
The approach to use in these situations is to create a QTextDecoder
object for the codec and use this QTextDecoder
for the whole decoding process, as shown below:
QTextCodec *codec = QTextCodec::codecForName("Shift-JIS");
QTextDecoder *decoder = codec->makeDecoder();
QString string;
while (new_data_available()) {
QByteArray chunk = get_new_data();
string += decoder->toUnicode(chunk);
}
delete decoder;
The QTextDecoder
object maintains state between chunks and therefore works correctly even if a multi-byte character is split between chunks.
If you need to create your own codec class, you can check out the instructions for doing so here.
The QByteArray
class provides an array of bytes.
QByteArray
can be used to store both raw bytes (including '\0's) and traditional 8-bit '\0'-terminated strings. Using QByteArray
is much more convenient than using const char *
. Behind the scenes, it always ensures that the data is followed by a '\0' terminator, and uses implicit sharing (copy-on-write) to reduce memory usage and avoid needless copying of data.
One way to initialize a QByteArray
is simply to pass a const char *
to its constructor. For example, the following code creates a byte array of size 5 containing the data "Hello":
QByteArray ba("Hello");
Although the size()
is 5, the byte array also maintains an extra '\0' character at the end so that if a function is used that asks for a pointer to the underlying data (e.g. a call to data()
), the data pointed to is guaranteed to be '\0'-terminated.
QByteArray
makes a deep copy of the const char *
data, so you can modify it later without experiencing side effects. (If for performance reasons you don't want to take a deep copy of the character data, use QByteArray::fromRawData()
instead.)
Another approach is to set the size of the array using resize()
and to initialize the data byte per byte. QByteArray
uses 0-based indexes, just like C++ arrays. To access the byte at a particular index position, you can use operator[]()
. On non-const byte arrays, operator[]()
returns a reference to a byte that can be used on the left side of an assignment. For example:
QByteArray ba;
ba.resize(5);
ba[0] = 0x3c;
ba[1] = 0xb8;
ba[2] = 0x64;
ba[3] = 0x18;
ba[4] = 0xca;
For read-only access, an alternative syntax is to use at()
:
for (int i = 0; i < ba.size(); ++i) {
if (ba.at(i) >= 'a' && ba.at(i) <= 'f')
cout << "Found character in range [a-f]" << endl;
}
at()
can be faster than operator[]()
, because it never causes a deep copy to occur.
To extract many bytes at a time, use left()
, right()
, or mid()
.
A QByteArray can embed '\0' bytes. The size(
) function always returns the size of the whole array, including embedded '\0' bytes, but excluding the terminating '\0' added by QByteArray
. If you want to obtain the length of the data up to and excluding the first '\0' character, call qstrlen()
on the byte array.
To obtain a pointer to the actual character data, call data() or constData(). These functions return a pointer to the beginning of the data. The pointer is guaranteed to remain valid until a non-const function is called on the QByteArray. It is also guaranteed that the data ends with a '\0' byte unless the QByteArray was created from a raw data. This '\0' byte is automatically provided by QByteArray and is not counted in size().
QByteArray
provides the same functions for modifying and searching the byte data as QString
, such as append()
and indexOf()
.
Functions that perform conversions between numeric data types and strings are performed in the C locale, irrespective of the user's locale settings. Use QString
to perform locale-aware conversions between numbers and strings.
In QByteArray, the notion of uppercase and lowercase and of which character is greater than or less than another character is locale dependent. This affects functions that support a case insensitive option, or that compare, or lowercase or uppercase their arguments. Case insensitive operations and comparisons will be accurate if both strings contain only ASCII characters. (If $LC_CTYPE is set, most Unix systems do "the right thing".) This issue does not apply to QString
s since they represent characters using Unicode.
J: Should this be added? Chapter is already quite long
The QLatin1String
class provides a thin wrapper around an US-ASCII/Latin-1 encoded string literal.
Many of QString
's member functions are overloaded to accept const char *
instead of QString
. This includes the copy constructor, the assignment operator, the comparison operators, and various other functions such as insert()
, replace()
, and indexOf()
. These functions are usually optimized to avoid constructing a QString
object for the const char *
data. For example, assuming str is a QString
,
if (str == "auto" || str == "extern" || str == "static" || str == "register") {
...
}
is much faster than
if (str == QString("auto") || str == QString("extern") || str == QString("static") || str == QString("register")) {
...
}
because it doesn't construct four temporary QString
objects and make a deep copy of the character data.
Applications that define QT_NO_CAST_FROM_ASCII
don't have access to QString
's const char *
API. To provide an efficient way of specifying constant Latin-1 strings, Qt provides the QLatin1String
, which is just a very thin wrapper around a const char *
. Using QLatin1String
, the example code above becomes
if (str == QLatin1String("auto")
|| str == QLatin1String("extern")
|| str == QLatin1String("static")
|| str == QLatin1String("register") {
...
}
This is a bit longer to type, but it provides exactly the same benefits as the first version of the code, and is faster than converting the Latin-1 strings using QString::fromLatin1()
.
Thanks to the QString(QLatin1String)
constructor, QLatin1String
can be used everywhere a QString
is expected.
Note: If the function you're calling with a QLatin1String
argument isn't actually overloaded to take QLatin1String
, the implicit conversion to QString
will trigger a memory allocation, which is usually what you want to avoid by using QLatin1String
in the first place. In those cases, using QStringLiteral
may be the better option.
-
Create a QString with the text "Qt rules " and then add the number "42" to it using QStrings appending function. Then print it the QDebug output.
-
Use the same QString. Use insert to add the word "always" after the first word, and then print the string with QDebug.
-
Make a copy of the string and use a loop to append it to the original (seperated by a space) 10 times. Again print the string with QDebug.
-
Count the amount of times "Qt" appears in the string print out "Qt appeared x times" with QDebug.