How to find a character which can’t be stored in a MySQL “utf8” column in Java
Posted by: 浩平 野崎
Date: June 28, 2019 04:46PM

Hello, I use MySQL 5.7 and I have a table which has a column which uses the “utf8” character set. Unfortunately it is not utf8mb4, therefore I always get an error when my app tries to insert a character which is out of the range of “utf8” (E.g. emojis).

Unfortunately I can’t change the character set to “utf8mb4” soon, so I wonder if it’s possible to detect those characters which make the error happen before it gets inserted into the table and let our customers know that they can’t use them.

Based on the following references, it seems that searching for non "BMP" characters is what I need to do,

https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb3.html
https://www.oracle.com/technetwork/java/javase/downloads/supplementary-142654.html?printOnly=1

So I ended up writing the following code to find those problematic characters:

Set<String> findProblematicStrings(String input) {
return input
.codePoints() // get Unicode code points
.filter(codePoint -> Character.charCount(codePoint) > 1) // filter BMP characters
.mapToObj(codePoint -> new String(Character.toChars(codePoint))) // convert code points into Strings
.collect(Collectors.toSet());
}

Does the code above do the thing I want? Or is there anything missing?

The code also can be found on my GitHub repo (with a unit test): https://github.com/nuzayats/mysqlutf8mb3validation/blob/master/src/main/java/utf8/Utf8Mb3Validator.java

Regards,
Kohei

Options: ReplyQuote


Subject
Written By
Posted
How to find a character which can’t be stored in a MySQL “utf8” column in Java
June 28, 2019 04:46PM


Sorry, only registered users may post in this forum.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.