Skip to content
nucularmoo edited this page Mar 17, 2018 · 35 revisions

String handling and value types

Explanation of the contents of a topic page @ Week 1 Topic 1

Back to Week 1

Objective: Basic string manipulation (min: QString + QChar + QCodec, creating, concatenation + light-weight use with QLatin1String (or QStringView this one may be omitted)

Comment: There is no String type in Qt, so should be written on lower-case (E: I'll leave this here for reference. Actually I'll just leave all the comments here for reference, comment on them myself so the dialogue stays somewhere. I didn't do that for 1.00 but it will probably be a better idea to just leave all the comments on the topic content pages for future references.)

Comment: This is a huge topic and I'd try to avoid losing attendees' interest at the beginning. The minimum: QString + QChar + QCodec, creating, concatenation + light-weight use with QLatin1String or QStringView (the latter can be omitted). String literals could be already another topic. And QByteArray is relevant in cases, where coding is not needed. I'm afraid this is too much for one section, isn't it.

E: Ill bump String Literals and QStringView into the expert section and include them if it seems feasible after everything else has been seen to.

Comment: The important message is that we do have literals and it does make sense to use mutable strings in general, if literals are ok. Like file names.

Beginner

Intermediate

  • What is QString?
  • QString vs QByteArray?
  • How does std::string/c string compare to QString?
  • How do you compare and manipulate strings efficiently (QLatin1String, QStringRef, ((QStringView)))?
  • What are value types in Qt?

Expert

  • What is QStringLiteral?
  • What is QStringView?

Course material content

As our next topic we're going to take a look at some basic string manipulation, interesting string facts and some value types. We'll also take a brief look at how to compare and manipulate strings efficiently.
To start, there is no String type in Qt, so in the case of writing, it should be written on lower-case. If you should walk away from this topic with something in mind, it is that Qt does have literals and that it does make sense to use mutable strings in situations where literals are ok, such as file names.

We will start with string manipulation, after which we will take a comparative look at QString and QByteArray. After that, we will take a look at how std::string and c string compares to QString, followed by a brief discussion about how to compare and manipulate strings efficiently. Lastly, we will discuss value types in Qt.

String manipulation (QString, QChar, QCodec)

http://doc.qt.io/qt-5/qstring.html
http://doc.qt.io/qt-5/qchar.html
http://doc.qt.io/qt-5/qtextcodec.html

Basic string manipulation (min: QString + QChar + QCodec, creating, concatenation + light-weight use with QLatin1String

QChar

The QChar class provides a 16-bit Unicode character.

In Qt, Unicode characters are 16-bit entities without any markup or structure. This class represents such an entity. It is lightweight, so it can be used everywhere. Most compilers treat it like an unsigned short.

QChar provides a full complement of testing/classification functions, converting to and from other formats, converting from composed to decomposed Unicode, and trying to compare and case-convert if you ask it to.

The classification functions include functions like those in the standard C++ header (formerly <ctype.h>), but operating on the full range of Unicode characters, not just for the ASCII range. They all return true if the character is a certain type of character; otherwise they return false. These classification functions are isNull() (returns true if the character is '\0'), isPrint() (true if the character is any sort of printable character, including whitespace), isPunct() (any sort of punctation), isMark() (Unicode Mark), isLetter() (a letter), isNumber() (any sort of numeric character, not just 0-9), isLetterOrNumber(), and isDigit() (decimal digits). All of these are wrappers around category() which return the Unicode-defined category of each character. Some of these also calculate the derived properties (for example isSpace() returns true if the character is of category Separator_* or an exceptional code point from Other_Control category).

QChar also provides direction(), which indicates the "natural" writing direction of this character. The joiningType() function indicates how the character joins with it's neighbors (needed mostly for Arabic or Syriac) and finally hasMirrored(), which indicates whether the character needs to be mirrored when it is printed in it's "unnatural" writing direction.

Composed Unicode characters (like ring) can be converted to decomposed Unicode ("a" followed by "ring above") by using decomposition().

In Unicode, comparison is not necessarily possible and case conversion is very difficult at best. Unicode, covering the "entire" world, also includes most of the world's case and sorting problems. operator==() and friends will do comparison based purely on the numeric Unicode value (code point) of the characters, and toUpper() and toLower() will do case changes when the character has a well-defined uppercase/lowercase equivalent. For locale-dependent comparisons, use QString::localeAwareCompare().

The conversion functions include unicode() (to a scalar), toLatin1() (to scalar, but converts all non-Latin-1 characters to 0), row() (gives the Unicode row), cell() (gives the Unicode cell), digitValue() (gives the integer value of any of the numerous digit characters), and a host of constructors.

QChar provides constructors and cast operators that make it easy to convert to and from traditional 8-bit chars. If you defined QT_NO_CAST_FROM_ASCII and QT_NO_CAST_TO_ASCII, as explained in the QString documentation, you will need to explicitly call fromLatin1(), or use QLatin1Char, to construct a QChar from an 8-bit char, and you will need to call toLatin1() to get the 8-bit value back.

QString

The QString class provides a Unicode character string.

QString stores a string of 16-bit QChars, where each QChar corresponds one Unicode 4.0 character. (Unicode characters with code values above 65535 are stored using surrogate pairs, i.e., two consecutive QChars.)

Unicode is an international standard that supports most of the writing systems in use today. It is a superset of US-ASCII (ANSI X3.4-1986) and Latin-1 (ISO 8859-1), and all the US-ASCII/Latin-1 characters are available at the same code positions.

Behind the scenes, QString uses implicit sharing (copy-on-write) to reduce memory usage and to avoid the needless copying of data. This also helps reduce the inherent overhead of storing 16-bit characters instead of 8-bit characters.

In addition to QString, Qt also provides the QByteArray class to store raw bytes and traditional 8-bit '\0'-terminated strings. For most purposes, QString is the class you want to use. It is used throughout the Qt API, and the Unicode support ensures that your applications will be easy to translate if you want to expand your application's market at some point. The two main cases where QByteArray is appropriate are when you need to store raw binary data, and when memory conservation is critical (like in embedded systems).

QCodec

The QTextCodec class provides conversions between text encodings.

Qt uses Unicode to store, draw and manipulate strings. In many situations you may wish to deal with data that uses a different encoding. For example, most Japanese documents are still stored in Shift-JIS or ISO 2022-JP, while Russian users often have their documents in KOI8-R or Windows-1251.

Qt provides a set of QTextCodec classes to help with converting non-Unicode formats to and from Unicode. You can also create your own codec classes.

The supported encodings are:

Big5
Big5-HKSCS
CP949
EUC-JP
EUC-KR
GB18030
HP-ROMAN8
IBM 850
IBM 866
IBM 874
ISO 2022-JP
ISO 8859-1 to 10
ISO 8859-13 to 16
Iscii-Bng, Dev, Gjr, Knd, Mlm, Ori, Pnj, Tlg, and Tml
KOI8-R
KOI8-U
Macintosh
Shift-JIS
TIS-620
TSCII
UTF-8
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
Windows-1250 to 1258

If Qt is compiled with ICU support enabled, most codecs supported by ICU will also be available to the application.

QTextCodecs can be used as follows to convert some locally encoded string to Unicode. Suppose you have some string encoded in Russian KOI8-R encoding, and want to convert it to Unicode. The simple way to do it is like this:

 QByteArray encodedString = "...";
 QTextCodec *codec = QTextCodec::codecForName("KOI8-R");
 QString string = codec->toUnicode(encodedString);

After this, string holds the text converted to Unicode. Converting a string from Unicode to the local encoding is just as easy:

 QString string = "...";
 QTextCodec *codec = QTextCodec::codecForName("KOI8-R");
 QByteArray encodedString = codec->fromUnicode(string);

To read or write files in various encodings, use QTextStream and its setCodec() function. See the Codecs example for an application of QTextCodec to file I/O.

Some care must be taken when trying to convert the data in chunks, for example, when receiving it over a network. In such cases it is possible that a multi-byte character will be split over two chunks. At best this might result in the loss of a character and at worst cause the entire conversion to fail.

The approach to use in these situations is to create a QTextDecoder object for the codec and use this QTextDecoder for the whole decoding process, as shown below:

 QTextCodec *codec = QTextCodec::codecForName("Shift-JIS");
 QTextDecoder *decoder = codec->makeDecoder();

 QString string;
 while (new_data_available()) {
      QByteArray chunk = get_new_data();
      string += decoder->toUnicode(chunk);
 }
 delete decoder;

The QTextDecoder object maintains state between chunks and therefore works correctly even if a multi-byte character is split between chunks.

If you need to create your own codec class, you can check out the instructions for doing so at http://doc.qt.io/qt-5/qtextcodec.html#creating-your-own-codec-class.

QString vs QByteArray

QByteArray

http://doc.qt.io/qt-5/qbytearray.html

The QByteArray class provides an array of bytes.

QByteArray can be used to store both raw bytes (including '\0's) and traditional 8-bit '\0'-terminated strings. Using QByteArray is much more convenient than using const char *. Behind the scenes, it always ensures that the data is followed by a '\0' terminator, and uses implicit sharing (copy-on-write) to reduce memory usage and avoid needless copying of data.

In addition to QByteArray, Qt also provides the QString class to store string data. For most purposes, QString is the class you want to use. It stores 16-bit Unicode characters, making it easy to store non-ASCII/non-Latin-1 characters in your application. Furthermore, QString is used throughout in the Qt API. The two main cases where QByteArray is appropriate are when you need to store raw binary data, and when memory conservation is critical (e.g., with Qt for Embedded Linux).

One way to initialize a QByteArray is simply to pass a const char * to its constructor. For example, the following code creates a byte array of size 5 containing the data "Hello":

 QByteArray ba("Hello");

Although the size() is 5, the byte array also maintains an extra '\0' character at the end so that if a function is used that asks for a pointer to the underlying data (e.g. a call to data()), the data pointed to is guaranteed to be '\0'-terminated.

QByteArray makes a deep copy of the const char * data, so you can modify it later without experiencing side effects. (If for performance reasons you don't want to take a deep copy of the character data, use QByteArray::fromRawData() instead.)

Another approach is to set the size of the array using resize() and to initialize the data byte per byte. QByteArray uses 0-based indexes, just like C++ arrays. To access the byte at a particular index position, you can use operator. On non-const byte arrays, operator returns a reference to a byte that can be used on the left side of an assignment. For example:

 QByteArray ba;
 ba.resize(5);
 ba[0] = 0x3c;
 ba[1] = 0xb8;
 ba[2] = 0x64;
 ba[3] = 0x18;
 ba[4] = 0xca;

For read-only access, an alternative syntax is to use at():

 for (int i = 0; i < ba.size(); ++i) {
      if (ba.at(i) >= 'a' && ba.at(i) <= 'f')
           cout << "Found character in range [a-f]" << endl;
 }

at() can be faster than operator, because it never causes a deep copy to occur.

To extract many bytes at a time, use left(), right(), or mid().

A QByteArray can embed '\0' bytes. The size() function always returns the size of the whole array, including embedded '\0' bytes, but excluding the terminating '\0' added by QByteArray. For example:

 QByteArray ba1("ca\0r\0t");
 ba1.size();                     // Returns 2.
 ba1.constData();                // Returns "ca" with terminating \0.

 QByteArray ba2("ca\0r\0t", 3);
 ba2.size();                     // Returns 3.
 ba2.constData();                // Returns "ca\0" with terminating \0.

 QByteArray ba3("ca\0r\0t", 4);
 ba3.size();                     // Returns 4.
 ba3.constData();                // Returns "ca\0r" with terminating \0.

 const char cart[] = {'c', 'a', '\0', 'r', '\0', 't'};
 QByteArray ba4(QByteArray::fromRawData(cart, 6));
 ba4.size();                     // Returns 6.
 ba4.constData();                // Returns "ca\0r\0t" without terminating \0.

If you want to obtain the length of the data up to and excluding the first '\0' character, call qstrlen() on the byte array.

After a call to resize(), newly allocated bytes have undefined values. To set all the bytes to a particular value, call fill().

To obtain a pointer to the actual character data, call data() or constData(). These functions return a pointer to the beginning of the data. The pointer is guaranteed to remain valid until a non-const function is called on the QByteArray. It is also guaranteed that the data ends with a '\0' byte unless the QByteArray was created from a raw data. This '\0' byte is automatically provided by QByteArray and is not counted in size().

QByteArray provides the following basic functions for modifying the byte data: append(), prepend(), insert(), replace(), and remove(). For example:

 QByteArray x("and");
 x.prepend("rock ");         // x == "rock and"
 x.append(" roll");          // x == "rock and roll"
 x.replace(5, 3, "&");       // x == "rock & roll"

The replace() and remove() functions' first two arguments are the position from which to start erasing and the number of bytes that should be erased.

When you append() data to a non-empty array, the array will be reallocated and the new data copied to it. You can avoid this behavior by calling reserve(), which preallocates a certain amount of memory. You can also call capacity() to find out how much memory QByteArray actually allocated. Data appended to an empty array is not copied.

A frequent requirement is to remove whitespace characters from a byte array ('\n', '\t', ' ', etc.). If you want to remove whitespace from both ends of a QByteArray, use trimmed(). If you want to remove whitespace from both ends and replace multiple consecutive whitespaces with a single space character within the byte array, use simplified().

If you want to find all occurrences of a particular character or substring in a QByteArray, use indexOf() or lastIndexOf(). The former searches forward starting from a given index position, the latter searches backward. Both return the index position of the character or substring if they find it; otherwise, they return -1. For example, here's a typical loop that finds all occurrences of a particular substring:

 QByteArray ba("We must be <b>bold</b>, very <b>bold</b>");
 int j = 0;
 while ((j = ba.indexOf("<b>", j)) != -1) {
      cout << "Found <b> tag at index position " << j << endl;
      ++j;
 }

If you simply want to check whether a QByteArray contains a particular character or substring, use contains(). If you want to find out how many times a particular character or substring occurs in the byte array, use count(). If you want to replace all occurrences of a particular value with another, use one of the two-parameter replace() overloads.

QByteArrays can be compared using overloaded operators such as operator<(), operator<=(), operator==(), operator>=(), and so on. The comparison is based exclusively on the numeric values of the characters and is very fast, but is not what a human would expect. QString::localeAwareCompare() is a better choice for sorting user-interface strings.

For historical reasons, QByteArray distinguishes between a null byte array and an empty byte array. A null byte array is a byte array that is initialized using QByteArray's default constructor or by passing (const char *)0 to the constructor. An empty byte array is any byte array with size 0. A null byte array is always empty, but an empty byte array isn't necessarily null:

 QByteArray().isNull();          // returns true
 QByteArray().isEmpty();         // returns true

 QByteArray("").isNull();        // returns false
 QByteArray("").isEmpty();       // returns true

 QByteArray("abc").isNull();     // returns false
 QByteArray("abc").isEmpty();    // returns false

All functions except isNull() treat null byte arrays the same as empty byte arrays. For example, data() returns a pointer to a '\0' character for a null byte array (not a null pointer), and QByteArray() compares equal to QByteArray(""). We recommend that you always use isEmpty() and avoid isNull().

Functions that perform conversions between numeric data types and strings are performed in the C locale, irrespective of the user's locale settings. Use QString to perform locale-aware conversions between numbers and strings.

In QByteArray, the notion of uppercase and lowercase and of which character is greater than or less than another character is locale dependent. This affects functions that support a case insensitive option or that compare or lowercase or uppercase their arguments. Case insensitive operations and comparisons will be accurate if both strings contain only ASCII characters. (If $LC_CTYPE is set, most Unix systems do "the right thing".) Functions that this affects include contains(), indexOf(), lastIndexOf(), operator<(), operator<=(), operator>(), operator>=(), toLower() and toUpper().

This issue does not apply to QStrings since they represent characters using Unicode.

std::string/c string vs QString

http://2016-aalto-c.mooc.fi/fi/Module_2/index.html#04_strings

Comparing and manipulating strings efficiently

http://doc.qt.io/qt-5/qlatin1string.html
http://doc.qt.io/qt-5/qstringref.html
https://doc.qt.io/qt-5/qstringview.html

Value types in Qt

http://doc.qt.io/qt-5/custom-types.html


Exhaustive reference material mentioned in this topic

Further reading topics/links:

Clone this wiki locally