Database Collation

2021-Aug-5

Home  >  Reference Main Page

email Feedback About This Page

🚴 From: stackoverflow.com: What's the difference between utf8_general_ci and utf8_unicode_ci?

edited Nov 26 '20 at 23:43 thomasrutter

Which is better:   utf8_general_ci   or   utf8_unicode_ci  ?

Use   utf8mb4   instead.

📝   Note  : MySQL is currently transitioning away from an older, flawed UTF-8 implementation. For now, you need to use   utf8mb4   instead of utf8 for the character encoding part, to ensure you are getting the fixed version. The flawed version remains for backward compatibility, though it is being deprecated.

For those people still arriving at this question in 2020 or later, there are newer options that may be better than both of these. For example,   utf8mb4_0900_ai_ci  .

All these collations are for the UTF-8 character encoding. The differences are in how text is sorted and compared.

  _unicode_ci   and   _general_ci   are two different sets of rules for sorting and comparing text according to the way we expect. Newer versions of MySQL introduce new sets of rules, too, such as   _0900_ai_ci   for equivalent rules based on Unicode 9.0 - and with no equivalent _general_ci variant. People reading this now should probably use one of these newer collations instead of either _unicode_ci or _general_ci. The description of those older collations below is provided for interest only.

Key differences

On modern servers, this performance boost will be all but negligible. It was devised in a time when servers had a tiny fraction of the CPU performance of today's computers.

Benefits of utf8mb4_unicode_ci over utf8mb4_general_ci

utf8mb4_unicode_ci, which uses the Unicode rules for sorting and comparison, employs a fairly complex algorithm for correct sorting in a wide range of languages and when using a wide range of special characters. These rules need to take into account language-specific conventions; not everybody sorts their characters in what we would call alphabetical order.

As far as Latin (ie European) languages go, there is not much difference between the Unicode sorting and the simplified utf8mb4_general_ci sorting in MySQL, but there are still a few differences:

In non-latin languages, such as Asian languages or languages with different alphabets, there may be a lot more differences between Unicode sorting and the simplified utf8mb4_general_ci sorting. The suitability of utf8mb4_general_ci will depend heavily on the language used. For some languages, it'll be quite inadequate.

What should you use?

There is almost certainly no reason to use utf8mb4_general_ci anymore, as we have left behind the point where CPU speed is low enough that the performance difference would be important. Your database will almost certainly be limited by other bottlenecks than this.

In the past, some people recommended to use utf8mb4_general_ci except when accurate sorting was going to be important enough to justify the performance cost. Today, that performance cost has all but disappeared, and developers are treating internationalization more seriously.

There's an argument to be made that if speed is more important to you than accuracy, you may as well not do any sorting at all. It's trivial to make an algorithm faster if you do not need it to be accurate. So, utf8mb4_general_ci is a compromise that's probably not needed for speed reasons and probably also not suitable for accuracy reasons.

One other thing I'll add is that even if you know your application only supports the English language, it may still need to deal with people's names, which can often contain characters used in other languages in which it is just as important to sort correctly. Using the Unicode rules for everything helps add peace of mind that the very smart Unicode people have worked very hard to make sorting work properly.

What the parts mean

Firstly,   ci   is for case-insensitive sorting and comparison. This means it's suitable for textual data, and case is not important. The other types of collation are   cs   (case-sensitive) for textual data where case is important, and bin, for where the encoding needs to match, bit for bit, which is suitable for fields which are really encoded binary data (including, for example, Base64). Case-sensitive sorting leads to some weird results and case-sensitive comparison can result in duplicate values differing only in letter case, so case-sensitive collations are falling out of favor for textual data - if case is significant to you, then otherwise ignorable punctuation and so on is probably also significant, and a binary collation might be more appropriate.

Next, unicode or general refers to the specific sorting and comparison rules - in particular, the way text is normalized or compared. There are many different sets of rules for the   utf8mb4   character encoding, with unicode and general being two that attempt to work well in all possible languages rather than one specific one. The differences between these two sets of rules are the subject of this answer.

📝 Note: unicode uses rules from Unicode 4.0.

📝 Recent versions of MySQL add the rulesets   unicode_520   using rules from Unicode 5.2, and 0900 (dropping the unicode_ part) using rules from Unicode 9.0.

And lastly,   utf8mb4   is of course the character encoding used internally. In this answer I'm talking only about Unicode based encodings.