There has been a lot of discussion on the community forum regarding character encoding in Vanilla and MySQL. This page is intended to help resolve issues related to this subject.
First, the short story:
If:
Then you just need (after install/upgrade) to copy the following line from appg/settings.php :
$Configuration['DATABASE_CHARACTER_ENCODING'] = '';
to the file conf/settings.php, and edit it like this :
$Configuration['DATABASE_CHARACTER_ENCODING'] = 'utf8';
This is the default set-up for new installation for Vanilla >= 1.1.1
That’s it. Read further only if you need it or care to know.
At first glance, an English focused Vanilla forum doesn’t care about encoding since even ASCII is enough for English. But it can turn out, rather quickly, that’s not true, because there are many place where accented or non-Latin characters show up: names, books titles, music lyrics.
For a non-English forum the problem is the same: being stuck to a limited encoding like latin1 deprives you of any input in eastern European and non-Latin languages.
Furthermore, utf-8 is the upcoming standard on the web and the default charset for xhtml.
For all these reasons utf-8 is the right choice nowadays, and that’s why Vanilla switched to it before 1.0.
| case # | Vanilla charset | connection charset | database charset | database state | comment |
|---|---|---|---|---|---|
| 1 | utf8 | utf8 | utf8 | OK | Vanilla correctly setup for any character entry |
| 2 | latin1 | latin1 | utf8 | OK | Limited to western languages, upgrade on Vanilla side only |
| 3 | utf8 | utf8 | latin1 | OK | Limited to western languages, upgrade on MySQL side |
| 4 | latin1 | latin1 | latin1 | OK | Limited to western languages, upgrade on both sides |
| 5 | utf8 | latin1 | latin1 | WEIRD! | You should fix it ! |
| 6 | utf8 | latin1 | utf8 | WORST! | As above, more tricky !! |
$Configuration[’CHARSET’] setting. All files used by Vanilla should be edited and saved with this encoding. Changing this setting implies transcoding relevant files. English only files are safe in every encoding defined in this memo, so you can switch easily between encodings as long as you have no translation files.DATABASE_CHARACTER_ENCODING setting, as described at the start of this page. Also, 1.0.x admins who applied the right patch are in this clean state.DATABASE_CHARACTER_ENCODING setting, this is it. All strings are already falsely transcoded from utf-8 to nothing. Vanilla is still able to read them, pretending to be a latin1 client, MySQL do the reverse translation, and reading it as utf-8. Fix-it ASAP!If you read the preceding chapter you should be convinced already.
Depending of the case of your installation, you may have to work on Vanilla side, MySQL side or both.
Vanilla side is easy, MySQL may require more work and trials.
Usual consideration applies: backup before attempting any modification on your installation.
For the following step you need a text editor able to open text files in a given charset and to save in a specific charset as well. There are many out there.
You also need a tool to access and manage your MySQL server. PhpMyAdmin is the a common tool used to administer a MySQL server directly. You should have minimal knowledge of this tool to be able to export/import a SQL file.
Notice : If you MySQL version in < 4.1, hold on until you upgrade it. No matter what your current state is, it is likely not worth the effort to do anything to the data until your MySQL has been upgraded.
Fixing MySQL state can turn out badly if you make mistakes. I’d advise to setup a separate base from the currently running one. Most hosting providers allow for several databases, even low end shared hosting. This way you can use trial and error on this second database until it’s ok and then switch Vanilla by editing your conf/database.php settings.
Note There are more effective ways to achieve the same goal, using scripts or SQL commands. This process intend to be followed with a minimal MySQL knowledge.
CREATE TABLE command in the file. If the file was exported from MySQL > 4.1 you’d see a DEFAULT CHARSET=latin1 on the last line of this command. Delete all occurrences of this text bit from the file (or replace it by DEFAULT CHARSET=utf8 if in the next step you cannot create a database or change its default collation) and save it.utf8_general_ci default collation. Check that phpMyAdmin is using utf-8 for its connection as well.Done. Before trying to connect Vanilla to this new base, don’t forget to make it utf-8 aware, see below.
$Configuration[’DATABASE_CHARACTER_ENCODING’] = ‘utf8’; line in you conf/settings.php files, as described on the top of this page. If you have no language add-on installed, this is the only step on Vanilla side.conf/language.php file.These are the character sets available in phpmyadmin, when you first setup your database. Use utf8_general_ci or utf8_unicode_ci for universal, multilingual support.
| Collation | Description |
|---|---|
| armscii8 (ARMSCII-8 Armenian) | |
| armscii8_bin | Armenian, Binary |
| armscii8_general_ci | Armenian, case-insensitive |
| ascii (US ASCII) | |
| ascii_bin | West European (multilingual), Binary |
| ascii_general_ci | West European (multilingual), case-insensitive |
| big5 (Big5 Traditional Chinese) | |
| big5_bin | Traditional Chinese, Binary |
| big5_chinese_ci | Traditional Chinese, case-insensitive |
| binary (Binary pseudo charset) | |
| binary | Binary |
| cp1250 (Windows Central European) | |
| cp1250_bin | Central European (multilingual), Binary |
| cp1250_croatian_ci | Croatian, case-insensitive |
| cp1250_czech_cs | Czech, case-sensitive |
| cp1250_general_ci | Central European (multilingual), case-insensitive |
| cp1251 (Windows Cyrillic) | |
| cp1251_bin | Cyrillic (multilingual), Binary |
| cp1251_bulgarian_ci | Bulgarian, case-insensitive |
| cp1251_general_ci | Cyrillic (multilingual), case-insensitive |
| cp1251_general_cs | Cyrillic (multilingual), case-sensitive |
| cp1251_ukrainian_ci | Ukrainian, case-insensitive |
| cp1256 (Windows Arabic) | |
| cp1256_bin | Arabic, Binary |
| cp1256_general_ci | Arabic, case-insensitive |
| cp1257 (Windows Baltic) | |
| cp1257_bin | Baltic (multilingual), Binary |
| cp1257_general_ci | Baltic (multilingual), case-insensitive |
| cp1257_lithuanian_ci | Lithuanian, case-insensitive |
| cp850 (DOS West European) | |
| cp850_bin | West European (multilingual), Binary |
| cp850_general_ci | West European (multilingual), case-insensitive |
| cp852 (DOS Central European) | |
| cp852_bin | Central European (multilingual), Binary |
| cp852_general_ci | Central European (multilingual), case-insensitive |
| cp866 (DOS Russian) | |
| cp866_bin | Russian, Binary |
| cp866_general_ci | Russian, case-insensitive |
| cp932 (SJIS for Windows Japanese) | |
| cp932_bin | Japanese, Binary |
| cp932_japanese_ci | Japanese, case-insensitive |
| dec8 (DEC West European) | |
| dec8_bin | West European (multilingual), Binary |
| dec8_swedish_ci | Swedish, case-insensitive |
| eucjpms (UJIS for Windows Japanese) | |
| eucjpms_bin | Japanese, Binary |
| eucjpms_japanese_ci | Japanese, case-insensitive |
| euckr (EUC-KR Korean) | |
| euckr_bin | Korean, Binary |
| euckr_korean_ci | Korean, case-insensitive |
| gb2312 (GB2312 Simplified Chinese) | |
| gb2312_bin | Simplified Chinese, Binary |
| gb2312_chinese_ci | Simplified Chinese, case-insensitive |
| gbk (GBK Simplified Chinese) | |
| gbk_bin | Simplified Chinese, Binary |
| gbk_chinese_ci | Simplified Chinese, case-insensitive |
| geostd8 (GEOSTD8 Georgian) | |
| geostd8_bin | Georgian, Binary |
| geostd8_general_ci | Georgian, case-insensitive |
| greek (ISO 8859-7 Greek) | |
| greek_bin | Greek, Binary |
| greek_general_ci | Greek, case-insensitive |
| hebrew (ISO 8859-8 Hebrew) | |
| hebrew_bin | Hebrew, Binary |
| hebrew_general_ci | Hebrew, case-insensitive |
| hp8 (HP West European) | |
| hp8_bin | West European (multilingual), Binary |
| hp8_english_ci | English, case-insensitive |
| keybcs2 (DOS Kamenicky Czech-Slovak) | |
| keybcs2_bin | Czech-Slovak, Binary |
| keybcs2_general_ci | Czech-Slovak, case-insensitive |
| koi8r (KOI8-R Relcom Russian) | |
| koi8r_bin | Russian, Binary |
| koi8r_general_ci | Russian, case-insensitive |
| koi8u (KOI8-U Ukrainian) | |
| koi8u_bin | Ukrainian, Binary |
| koi8u_general_ci | Ukrainian, case-insensitive |
| latin1 (cp1252 West European) | |
| latin1_bin | West European (multilingual), Binary |
| latin1_danish_ci | Danish, case-insensitive |
| latin1_general_ci | West European (multilingual), case-insensitive |
| latin1_general_cs | West European (multilingual), case-sensitive |
| latin1_german1_ci | German (dictionary), case-insensitive |
| latin1_german2_ci | German (phone book), case-insensitive |
| latin1_spanish_ci | Spanish, case-insensitive |
| latin1_swedish_ci | Swedish, case-insensitive |
| latin2 (ISO 8859-2 Central European) | |
| latin2_bin | Central European (multilingual), Binary |
| latin2_croatian_ci | Croatian, case-insensitive |
| latin2_czech_cs | Czech, case-sensitive |
| latin2_general_ci | Central European (multilingual), case-insensitive |
| latin2_hungarian_ci | Hungarian, case-insensitive |
| latin5 (ISO 8859-9 Turkish) | |
| latin5_bin | Turkish, Binary |
| latin5_turkish_ci | Turkish, case-insensitive |
| latin7 (ISO 8859-13 Baltic) | |
| latin7_bin | Baltic (multilingual), Binary |
| latin7_estonian_cs | Estonian, case-sensitive |
| latin7_general_ci | Baltic (multilingual), case-insensitive |
| latin7_general_cs | Baltic (multilingual), case-sensitive |
| macce (Mac Central European) | |
| macce_bin | Central European (multilingual), Binary |
| macce_general_ci | Central European (multilingual), case-insensitive |
| macroman (Mac West European) | |
| macroman_bin | West European (multilingual), Binary |
| macroman_general_ci | West European (multilingual), case-insensitive |
| sjis (Shift-JIS Japanese) | |
| sjis_bin | Japanese, Binary |
| sjis_japanese_ci | Japanese, case-insensitive |
| swe7 (7bit Swedish) | |
| swe7_bin | Swedish, Binary |
| swe7_swedish_ci | Swedish, case-insensitive |
| tis620 (TIS620 Thai) | |
| tis620_bin | Thai, Binary |
| tis620_thai_ci | Thai, case-insensitive |
| ucs2 (UCS-2 Unicode) | |
| ucs2_bin | Unicode (multilingual), Binary |
| ucs2_czech_ci | Czech, case-insensitive |
| ucs2_danish_ci | Danish, case-insensitive |
| ucs2_esperanto_ci | Esperanto, case-insensitive |
| ucs2_estonian_ci | Estonian, case-insensitive |
| ucs2_general_ci | Unicode (multilingual), case-insensitive |
| ucs2_hungarian_ci | Hungarian, case-insensitive |
| ucs2_icelandic_ci | Icelandic, case-insensitive |
| ucs2_latvian_ci | Latvian, case-insensitive |
| ucs2_lithuanian_ci | Lithuanian, case-insensitive |
| ucs2_persian_ci | Persian, case-insensitive |
| ucs2_polish_ci | Polish, case-insensitive |
| ucs2_roman_ci | West European, case-insensitive |
| ucs2_romanian_ci | Romanian, case-insensitive |
| ucs2_slovak_ci | Slovak, case-insensitive |
| ucs2_slovenian_ci | Slovenian, case-insensitive |
| ucs2_spanish2_ci | Traditional Spanish, case-insensitive |
| ucs2_spanish_ci | Spanish, case-insensitive |
| ucs2_swedish_ci | Swedish, case-insensitive |
| ucs2_turkish_ci | Turkish, case-insensitive |
| ucs2_unicode_ci | Unicode (multilingual), case-insensitive |
| ujis (EUC-JP Japanese) | |
| ujis_bin | Japanese, Binary |
| ujis_japanese_ci | Japanese, case-insensitive |
| utf8 (UTF-8 Unicode) | |
| utf8_bin | Unicode (multilingual), Binary |
| utf8_czech_ci | Czech, case-insensitive |
| utf8_danish_ci | Danish, case-insensitive |
| utf8_esperanto_ci | Esperanto, case-insensitive |
| utf8_estonian_ci | Estonian, case-insensitive |
| utf8_general_ci | Unicode (multilingual), case-insensitive |
| utf8_hungarian_ci | Hungarian, case-insensitive |
| utf8_icelandic_ci | Icelandic, case-insensitive |
| utf8_latvian_ci | Latvian, case-insensitive |
| utf8_lithuanian_ci | Lithuanian, case-insensitive |
| utf8_persian_ci | Persian, case-insensitive |
| utf8_polish_ci | Polish, case-insensitive |
| utf8_roman_ci | West European, case-insensitive |
| utf8_romanian_ci | Romanian, case-insensitive |
| utf8_slovak_ci | Slovak, case-insensitive |
| utf8_slovenian_ci | Slovenian, case-insensitive |
| utf8_spanish2_ci | Traditional Spanish, case-insensitive |
| utf8_spanish_ci | Spanish, case-insensitive |
| utf8_swedish_ci | Swedish, case-insensitive |
| utf8_turkish_ci | Turkish, case-insensitive |
| utf8_unicode_ci | Unicode (multilingual), case-insensitive |
This note is about a specific point in using unicode/utf-8 and is not related to the original conundrum
Sometime you may run in situation where everything seem correctly set to utf-8 and nevertheless there is a bug. A text is displayed correctly, but you can’t search for it, or a file is not found while its name looks correct.
Note: if something is not displayed correctly this paragraph is of no use.
For historic reason somme (many) characters can be encoded under several forms. A simple common example is an accented character: é . It can be encoded as a single character OR as two : e´ . There, the accent is a separate character, with its own unicode number, while the letter e is the legacy ASCII character. In non latin languages their are many cases, sometimes more complicated.
Both forms are correct. They are named normalization and defined in this unicode document. The first is defined as composed and named Normalization Form C (NFC). The second is Normalization Form D (NFD) (decomposed). There are also two other forms, but we’ll ignore them to keep it simple. Both normalization forms look the same when displayed. But if you try compare the strings, even with multi-bytes PHP functions, the result will be false. These two strings lengths are different. These functions do not normalize, they are not even aware of this normalization form stuff.
The bad trip begins when different applications are using different normalizations. A common case in on filenames. Mac OS X filesystem HFS+ uses NFD (decomposed). Most linux NFC (if utf-8 at all), forget about Windows!
W3C defines the web character encoding to be UTF-8 NFC, so servers and browsers are all likely to use NFC when utf-8. But, if you copy and paste from some document to a html source, you may be using NFD. Also, if you transfer files through FTP from an OS X (Darwin) system, filenames are NFD. And so on. You can guess some problem with database stored strings like passwords.
In everyday use, the most common problem, IMO, is on filenames. But be aware of the point for passwords and search strings.
Remember, the normalization is made when encoding. Displaying is always OK.
On major projects, like Wikipedia, this is a very relevant concern and that’s why mediawiki authors developed the PHP tool linked below.
For a Vanilla webmaster there is not much place to work on this, beside being aware of the potential problem. In some case we have to normalize strings/filenames to ensure we can compare them or use it in URLs.
This is not straightforward, because we have no standard tools in PHP and not many overall. I’m aware of both for the webmaster:
This last one can be used everywhere. When I have to use accented (or other non ascii) filenames, I run a small script on the server side, based on UtfNormal.php, to ensure all are in NFC.
@Mark (Vanilla’s author): if you ever read this page again an see this chapter, you may consider normalizing passwords and user names/login in V2.
@Extension authors: if your extension deals with filenames, attachments and such, beware!
This section to add informations on tools and options related to the normalization stuff. Feel free to contribute.
defaults write ch.sudo.cyberduck path.normalize.unicode true in a Terminal.app window and relaunch Cyberduck. File names are NFC normalized when transfered. Note: When downloading to OS X a NFC file name is translated back to NFD at system level.Enjoy your utf-8 capable forum!
Ask or comment in this discussion