Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid string for charset utf8mb4 #8893

Open
aschnappauf opened this issue Feb 26, 2025 · 4 comments
Open

Invalid string for charset utf8mb4 #8893

aschnappauf opened this issue Feb 26, 2025 · 4 comments

Comments

@aschnappauf
Copy link

Feel free to adjust the title to something more meaningful

I've got a table in the following form:

CREATE TABLE Products(
  id INTEGER PRIMARY KEY,
  name TEXT
);

Into which we forcefully inserted strings with a wrong encoding (latin1) so that we get the following output:

SELECT id, HEX(name) FROM Products;

/*
+----+------------------+
| id |     HEX(name)    |
+----+------------------+
| 1  | 446F6C744C6162AE |  => 'DoltLab®' encoded in latin1
+----+------------------+
*/

Querying

Regular selection does not work (unsurprisingly)

SELECT name FROM Products;

/* invalid string for charset utf8mb4: '[44 111 108 116 76 97 98 174]' */

Conversion does kinda work but not as intended

SELECT CONVERT(name USING latin1) FROM Products;

/*
+----------------------------+
| CONVERT(name USING latin1) |
+----------------------------+
| DoltLab?                   |
+----------------------------+
*/

Binary encoding does roughly as expected:

SELECT CONVERT(name USING binary) FROM Products;

/*
+----------------------------+
| CONVERT(name USING binary) |
+----------------------------+
| DoltLab�                  |
+----------------------------+
*/

Forcing binary interpretation before converting to latin1 does not improve results:

SELECT CONVERT(CONVERT(name USING binary) USING latin1) FROM Products;

/*
+--------------------------------------------------+
| CONVERT(CONVERT(name USING binary) USING latin1) |
+--------------------------------------------------+
| DoltLab?                                         |
+--------------------------------------------------+
*/

Casting plain refuses to do anything:

SELECT CAST(name AS CHAR) FROM Products;

/*
+--------------------+
| CAST(name AS CHAR) |
+--------------------+
| NULL               |
+--------------------+
*/

Casting to BINARY yields the same result as CONVERT(name USING binary).
Converting a binary-casted string also yields the same result as the CONVERT-CONVERT strategy.

Replacing the faulty bytes also does not work.

SELECT REPLACE(CAST(name AS BINARY), x'AE', x'2B') FROM Products;

/* invalid string for charset utf8mb4: '[44 111 108 116 76 97 98 174]' */

Version

Tested using Dolt 1.49.3

@Hydrocharged
Copy link
Contributor

I was finally able to replicate the issue here, and I think I've found a way to fix it.

First, we should convert the column to a BLOB, as the conversion does not verify the integrity of each string.

ALTER TABLE Products MODIFY name BLOB;

From here, we are free to modify the strings, so we can replace the bad string with the appropriate string for utf8mb4 encoding.

UPDATE Products SET name = UNHEX('446F6C744C6162C2AE') WHERE name = UNHEX('446F6C744C6162AE');

Once the strings in question have been replaced, we'll convert back to our TEXT type.

ALTER TABLE Products MODIFY name TEXT;

Our standard SELECT statement now works as expected:

SELECT name FROM Products;
/*
+----------+
|   name   |
+----------+
| DoltLab® |
+----------+
*/

Let me know if this resolves the issue for you!

@Hydrocharged
Copy link
Contributor

It's worth mentioning that we don't have to use UNHEX for the UPDATE, but it makes it much clearer the values that are being modified (especially since it's binary data in the BLOB format).

Also, if returning the column back to its standard TEXT type is throwing additional errors (since it verifies that all strings are valid during the conversion), you can easily visualize the different strings using a query such as:

SELECT CONVERT(name USING utf8mb4), HEX(name) FROM Products;

Any strings with the placeholder can be modified through an UPDATE.

@aschnappauf
Copy link
Author

For me, the ALTER TABLE throws an issue for invalid strings; Apparently it does check the integrity

@Hydrocharged
Copy link
Contributor

I wonder if there's some other invalid state, since I made sure it would work on my repro beforehand. You're specifically referring to this ALTER correct?

ALTER TABLE Products MODIFY name BLOB;

I doubt this will work, but we should still try it for the sake of thoroughness. What if we use VARBINARY instead, so it's:

ALTER TABLE Products MODIFY name VARBINARY(16000);

If this still does not work, would you mind pushing your repository to DoltHub? Or some sanitized subset that may remove any private information, but still exhibits the problem? It would help with debugging immensely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants