Hello, I use MySQL 5.7 and I have a table which has a column which uses the “utf8” character set. Unfortunately it is not utf8mb4, therefore I always get an error when my app tries to insert a character which is out of the range of “utf8” (E.g. emojis).
Unfortunately I can’t change the character set to “utf8mb4” soon, so I wonder if it’s possible to detect those characters which make the error happen before it gets inserted into the table and let our customers know that they can’t use them.
Based on the following references, it seems that searching for non "BMP" characters is what I need to do,
https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb3.html
https://www.oracle.com/technetwork/java/javase/downloads/supplementary-142654.html?printOnly=1
So I ended up writing the following code to find those problematic characters:
Set<String> findProblematicStrings(String input) {
return input
.codePoints() // get Unicode code points
.filter(codePoint -> Character.charCount(codePoint) > 1) // filter BMP characters
.mapToObj(codePoint -> new String(Character.toChars(codePoint))) // convert code points into Strings
.collect(Collectors.toSet());
}
Does the code above do the thing I want? Or is there anything missing?
The code also can be found on my GitHub repo (with a unit test):
https://github.com/nuzayats/mysqlutf8mb3validation/blob/master/src/main/java/utf8/Utf8Mb3Validator.java
Regards,
Kohei