Vanilla 1 Documentation Archive

 
 
vanilla:administrators:encodings
Table of Contents

Characters encodings, the UTF-8 conundrum

There has been a lot of discussion on the community forum regarding character encoding in Vanilla and MySQL. This page is intended to help resolve issues related to this subject.

First, the short story:

If:

  1. Your MySQL version is >= 4.1
  2. And you are installing a brand new Vanilla or you are upgrading from 1.0.x and you had a previously patched install where you know all database text is clean utf-8.

Then you just need (after install/upgrade) to copy the following line from appg/settings.php :

$Configuration['DATABASE_CHARACTER_ENCODING'] = '';

to the file conf/settings.php, and edit it like this :

$Configuration['DATABASE_CHARACTER_ENCODING'] = 'utf8';

This is the default set-up for new installation for Vanilla >= 1.1.1

That’s it. Read further only if you need it or care to know.

Why UTF-8

At first glance, an English focused Vanilla forum doesn’t care about encoding since even ASCII is enough for English. But it can turn out, rather quickly, that’s not true, because there are many place where accented or non-Latin characters show up: names, books titles, music lyrics.

For a non-English forum the problem is the same: being stuck to a limited encoding like latin1 deprives you of any input in eastern European and non-Latin languages.

Furthermore, utf-8 is the upcoming standard on the web and the default charset for xhtml.

For all these reasons utf-8 is the right choice nowadays, and that’s why Vanilla switched to it before 1.0.

What is my situation ?

case # Vanilla charset connection charset database charset database state comment
1 utf8 utf8 utf8 OK Vanilla correctly setup for any character entry
2 latin1 latin1 utf8 OK Limited to western languages, upgrade on Vanilla side only
3 utf8 utf8 latin1 OK Limited to western languages, upgrade on MySQL side
4 latin1 latin1 latin1 OK Limited to western languages, upgrade on both sides
5 utf8 latin1 latin1 WEIRD! You should fix it !
6 utf8 latin1 utf8 WORST! As above, more tricky !!

Case #

  1. Well and right. You are in this situation if you just installed Vanilla 1.1.x and added the DATABASE_CHARACTER_ENCODING setting, as described at the start of this page. Also, 1.0.x admins who applied the right patch are in this clean state.
  2. Few, if any, Vanilla install should be in this configuration. Upgrade to #1 following the step Vanilla side in the following chapter.
  3. I guess no Vanilla install is in this configuration. Upgrade to #1 following the step MySQL side in the following chapter.
  4. There are folks in this state. Either because they run a Vanilla forum from time before Mark switched to utf-8, and when upgrading to recent release changed every “utf8” to latin1 (ISO-8859-1). Or else because they got stuck by the conundrum and switched to latin1 to get it running. Upgrading to #1 needs both sides steps.
  5. Here comes, presumably, the vast majority of Vanilla installations. I call it weird because utf-8 characters are stored without modification, but stamped as being latin1. A client trying to read something the right way will get garbled strings. If it’s a latin1 client declared as latin1, MySQL will serve the characters as they are: utf-8 = garbled in latin1 display. For an utf-8 client, MySQL will kindly transcode the string, which does not need it. The client get something really unusable being a true non-existent encoding.
    All of you running MySQL < 4.1 are in this case. The weirdness is there harmless, because MySQL doesn’t care about encoding and will not attempt to modify anything. But, nevertheless, to use the database content, a client must know this is utf-8 stuff. If you run MySQL < 4.1, better stay in this case until you upgrade. Why? Because you keep utf-8 rich content on the forum side. When you upgrade MySQL, do it the right way (see below).
    MySQL >= 4.1 users might be in this case also and they are the complaining crowd on the community forum. They should fix it before getting in a situation where different clients (i.e. Vanilla/phpMyAdmin) trash each other content.
  6. Lastly the worst case. If Vanilla was setup on MySQL >= 4.1, with default utf-8 encoding, and with a standard (not patched) 1.0.x or a 1.1.x and no DATABASE_CHARACTER_ENCODING setting, this is it. All strings are already falsely transcoded from utf-8 to nothing. Vanilla is still able to read them, pretending to be a latin1 client, MySQL do the reverse translation, and reading it as utf-8. Fix-it ASAP!

Why should I fix ?

If you read the preceding chapter you should be convinced already.

Cleaning

Depending of the case of your installation, you may have to work on Vanilla side, MySQL side or both.
Vanilla side is easy, MySQL may require more work and trials.
Usual consideration applies: backup before attempting any modification on your installation.

For the following step you need a text editor able to open text files in a given charset and to save in a specific charset as well. There are many out there.
You also need a tool to access and manage your MySQL server. PhpMyAdmin is the a common tool used to administer a MySQL server directly. You should have minimal knowledge of this tool to be able to export/import a SQL file.

Notice : If you MySQL version in < 4.1, hold on until you upgrade it. No matter what your current state is, it is likely not worth the effort to do anything to the data until your MySQL has been upgraded.

MySQL Side

Fixing MySQL state can turn out badly if you make mistakes. I’d advise to setup a separate base from the currently running one. Most hosting providers allow for several databases, even low end shared hosting. This way you can use trial and error on this second database until it’s ok and then switch Vanilla by editing your conf/database.php settings.

Note There are more effective ways to achieve the same goal, using scripts or SQL commands. This process intend to be followed with a minimal MySQL knowledge.

  1. Connect to you MySQL with (say) phpMyAdmin, and select you current database.
  2. Export the whole base, structure and data, to a SQL text file. Keep a backup of this file for reference.
  3. Open this file in your text editor as an utf-8 file. Search through the file for a non ascii string, and:
    1. It looks ok: you were in the weird case, go to the #4 below.
    2. It’s garbled: you were in the worst case. Strings in this file are using a non-encoding.
      1. Transcode the file to latin1. Depending of your editor it may be necessary to Save As… and choose encoding in the save dialog. At this stage, the strings are still garbled.
      2. Open this file again, as utf-8. It should now looks good. If not, check the process to be sure that the editor did transcode. There is no point to go further while your strings are not clean.
  4. Now, look at the first CREATE TABLE command in the file. If the file was exported from MySQL > 4.1 you’d see a DEFAULT CHARSET=latin1 on the last line of this command. Delete all occurrences of this text bit from the file (or replace it by DEFAULT CHARSET=utf8 if in the next step you cannot create a database or change its default collation) and save it.
  5. Go to phpMyAdmin and create a new database. Make it with utf8_general_ci default collation. Check that phpMyAdmin is using utf-8 for its connection as well.
  6. Import the saved SQL file into this new base.
  7. Look at the base data to check that non ASCII strings are fine.

Done. Before trying to connect Vanilla to this new base, don’t forget to make it utf-8 aware, see below.

Vanilla Side

  1. Add the $Configuration[’DATABASE_CHARACTER_ENCODING’] = ‘utf8’; line in you conf/settings.php files, as described on the top of this page. If you have no language add-on installed, this is the only step on Vanilla side.
  2. If you have language(s) add-on: For each one, check that all files in it are using utf-8 encoding. If not, save them in utf-8. Most of current languages add-on are already utf-8 if you didn’t change it. Don’t forget to check out the conf/language.php file.
  3. Special issue : If you did patch your install to be latin1 rather than utf-8, some admins did it, revert those change back, by looking through forum source for latin1/ISO-8859-1. Be aware that some places need “utf-8” (i.e. for html meta) other “utf8” (MySQL setting above).

Character Sets and Collations

These are the character sets available in phpmyadmin, when you first setup your database. Use utf8_general_ci or utf8_unicode_ci for universal, multilingual support.

Collation Description
armscii8 (ARMSCII-8 Armenian)
armscii8_bin Armenian, Binary
armscii8_general_ci Armenian, case-insensitive
ascii (US ASCII)
ascii_bin West European (multilingual), Binary
ascii_general_ci West European (multilingual), case-insensitive
big5 (Big5 Traditional Chinese)
big5_bin Traditional Chinese, Binary
big5_chinese_ci Traditional Chinese, case-insensitive
binary (Binary pseudo charset)
binary Binary
cp1250 (Windows Central European)
cp1250_bin Central European (multilingual), Binary
cp1250_croatian_ci Croatian, case-insensitive
cp1250_czech_cs Czech, case-sensitive
cp1250_general_ci Central European (multilingual), case-insensitive
cp1251 (Windows Cyrillic)
cp1251_bin Cyrillic (multilingual), Binary
cp1251_bulgarian_ci Bulgarian, case-insensitive
cp1251_general_ci Cyrillic (multilingual), case-insensitive
cp1251_general_cs Cyrillic (multilingual), case-sensitive
cp1251_ukrainian_ci Ukrainian, case-insensitive
cp1256 (Windows Arabic)
cp1256_bin Arabic, Binary
cp1256_general_ci Arabic, case-insensitive
cp1257 (Windows Baltic)
cp1257_bin Baltic (multilingual), Binary
cp1257_general_ci Baltic (multilingual), case-insensitive
cp1257_lithuanian_ci Lithuanian, case-insensitive
cp850 (DOS West European)
cp850_bin West European (multilingual), Binary
cp850_general_ci West European (multilingual), case-insensitive
cp852 (DOS Central European)
cp852_bin Central European (multilingual), Binary
cp852_general_ci Central European (multilingual), case-insensitive
cp866 (DOS Russian)
cp866_bin Russian, Binary
cp866_general_ci Russian, case-insensitive
cp932 (SJIS for Windows Japanese)
cp932_bin Japanese, Binary
cp932_japanese_ci Japanese, case-insensitive
dec8 (DEC West European)
dec8_bin West European (multilingual), Binary
dec8_swedish_ci Swedish, case-insensitive
eucjpms (UJIS for Windows Japanese)
eucjpms_bin Japanese, Binary
eucjpms_japanese_ci Japanese, case-insensitive
euckr (EUC-KR Korean)
euckr_bin Korean, Binary
euckr_korean_ci Korean, case-insensitive
gb2312 (GB2312 Simplified Chinese)
gb2312_bin Simplified Chinese, Binary
gb2312_chinese_ci Simplified Chinese, case-insensitive
gbk (GBK Simplified Chinese)
gbk_bin Simplified Chinese, Binary
gbk_chinese_ci Simplified Chinese, case-insensitive
geostd8 (GEOSTD8 Georgian)
geostd8_bin Georgian, Binary
geostd8_general_ci Georgian, case-insensitive
greek (ISO 8859-7 Greek)
greek_bin Greek, Binary
greek_general_ci Greek, case-insensitive
hebrew (ISO 8859-8 Hebrew)
hebrew_bin Hebrew, Binary
hebrew_general_ci Hebrew, case-insensitive
hp8 (HP West European)
hp8_bin West European (multilingual), Binary
hp8_english_ci English, case-insensitive
keybcs2 (DOS Kamenicky Czech-Slovak)
keybcs2_bin Czech-Slovak, Binary
keybcs2_general_ci Czech-Slovak, case-insensitive
koi8r (KOI8-R Relcom Russian)
koi8r_bin Russian, Binary
koi8r_general_ci Russian, case-insensitive
koi8u (KOI8-U Ukrainian)
koi8u_bin Ukrainian, Binary
koi8u_general_ci Ukrainian, case-insensitive
latin1 (cp1252 West European)
latin1_bin West European (multilingual), Binary
latin1_danish_ci Danish, case-insensitive
latin1_general_ci West European (multilingual), case-insensitive
latin1_general_cs West European (multilingual), case-sensitive
latin1_german1_ci German (dictionary), case-insensitive
latin1_german2_ci German (phone book), case-insensitive
latin1_spanish_ci Spanish, case-insensitive
latin1_swedish_ci Swedish, case-insensitive
latin2 (ISO 8859-2 Central European)
latin2_bin Central European (multilingual), Binary
latin2_croatian_ci Croatian, case-insensitive
latin2_czech_cs Czech, case-sensitive
latin2_general_ci Central European (multilingual), case-insensitive
latin2_hungarian_ci Hungarian, case-insensitive
latin5 (ISO 8859-9 Turkish)
latin5_bin Turkish, Binary
latin5_turkish_ci Turkish, case-insensitive
latin7 (ISO 8859-13 Baltic)
latin7_bin Baltic (multilingual), Binary
latin7_estonian_cs Estonian, case-sensitive
latin7_general_ci Baltic (multilingual), case-insensitive
latin7_general_cs Baltic (multilingual), case-sensitive
macce (Mac Central European)
macce_bin Central European (multilingual), Binary
macce_general_ci Central European (multilingual), case-insensitive
macroman (Mac West European)
macroman_bin West European (multilingual), Binary
macroman_general_ci West European (multilingual), case-insensitive
sjis (Shift-JIS Japanese)
sjis_bin Japanese, Binary
sjis_japanese_ci Japanese, case-insensitive
swe7 (7bit Swedish)
swe7_bin Swedish, Binary
swe7_swedish_ci Swedish, case-insensitive
tis620 (TIS620 Thai)
tis620_bin Thai, Binary
tis620_thai_ci Thai, case-insensitive
ucs2 (UCS-2 Unicode)
ucs2_bin Unicode (multilingual), Binary
ucs2_czech_ci Czech, case-insensitive
ucs2_danish_ci Danish, case-insensitive
ucs2_esperanto_ci Esperanto, case-insensitive
ucs2_estonian_ci Estonian, case-insensitive
ucs2_general_ci Unicode (multilingual), case-insensitive
ucs2_hungarian_ci Hungarian, case-insensitive
ucs2_icelandic_ci Icelandic, case-insensitive
ucs2_latvian_ci Latvian, case-insensitive
ucs2_lithuanian_ci Lithuanian, case-insensitive
ucs2_persian_ci Persian, case-insensitive
ucs2_polish_ci Polish, case-insensitive
ucs2_roman_ci West European, case-insensitive
ucs2_romanian_ci Romanian, case-insensitive
ucs2_slovak_ci Slovak, case-insensitive
ucs2_slovenian_ci Slovenian, case-insensitive
ucs2_spanish2_ci Traditional Spanish, case-insensitive
ucs2_spanish_ci Spanish, case-insensitive
ucs2_swedish_ci Swedish, case-insensitive
ucs2_turkish_ci Turkish, case-insensitive
ucs2_unicode_ci Unicode (multilingual), case-insensitive
ujis (EUC-JP Japanese)
ujis_bin Japanese, Binary
ujis_japanese_ci Japanese, case-insensitive
utf8 (UTF-8 Unicode)
utf8_bin Unicode (multilingual), Binary
utf8_czech_ci Czech, case-insensitive
utf8_danish_ci Danish, case-insensitive
utf8_esperanto_ci Esperanto, case-insensitive
utf8_estonian_ci Estonian, case-insensitive
utf8_general_ci Unicode (multilingual), case-insensitive
utf8_hungarian_ci Hungarian, case-insensitive
utf8_icelandic_ci Icelandic, case-insensitive
utf8_latvian_ci Latvian, case-insensitive
utf8_lithuanian_ci Lithuanian, case-insensitive
utf8_persian_ci Persian, case-insensitive
utf8_polish_ci Polish, case-insensitive
utf8_roman_ci West European, case-insensitive
utf8_romanian_ci Romanian, case-insensitive
utf8_slovak_ci Slovak, case-insensitive
utf8_slovenian_ci Slovenian, case-insensitive
utf8_spanish2_ci Traditional Spanish, case-insensitive
utf8_spanish_ci Spanish, case-insensitive
utf8_swedish_ci Swedish, case-insensitive
utf8_turkish_ci Turkish, case-insensitive
utf8_unicode_ci Unicode (multilingual), case-insensitive

A note on Unicode variation

This note is about a specific point in using unicode/utf-8 and is not related to the original conundrum

Summary

Sometime you may run in situation where everything seem correctly set to utf-8 and nevertheless there is a bug. A text is displayed correctly, but you can’t search for it, or a file is not found while its name looks correct.

Note: if something is not displayed correctly this paragraph is of no use.

Explanation (short)

For historic reason somme (many) characters can be encoded under several forms. A simple common example is an accented character: é . It can be encoded as a single character OR as two : . There, the accent is a separate character, with its own unicode number, while the letter e is the legacy ASCII character. In non latin languages their are many cases, sometimes more complicated.

Both forms are correct. They are named normalization and defined in this unicode document. The first is defined as composed and named Normalization Form C (NFC). The second is Normalization Form D (NFD) (decomposed). There are also two other forms, but we’ll ignore them to keep it simple. Both normalization forms look the same when displayed. But if you try compare the strings, even with multi-bytes PHP functions, the result will be false. These two strings lengths are different. These functions do not normalize, they are not even aware of this normalization form stuff.

The bad trip begins when different applications are using different normalizations. A common case in on filenames. Mac OS X filesystem HFS+ uses NFD (decomposed). Most linux NFC (if utf-8 at all), forget about Windows!

W3C defines the web character encoding to be UTF-8 NFC, so servers and browsers are all likely to use NFC when utf-8. But, if you copy and paste from some document to a html source, you may be using NFD. Also, if you transfer files through FTP from an OS X (Darwin) system, filenames are NFD. And so on. You can guess some problem with database stored strings like passwords.

In everyday use, the most common problem, IMO, is on filenames. But be aware of the point for passwords and search strings.

Remember, the normalization is made when encoding. Displaying is always OK.

On major projects, like Wikipedia, this is a very relevant concern and that’s why mediawiki authors developed the PHP tool linked below.

Workaround HOWTO

For a Vanilla webmaster there is not much place to work on this, beside being aware of the potential problem. In some case we have to normalize strings/filenames to ensure we can compare them or use it in URLs.

This is not straightforward, because we have no standard tools in PHP and not many overall. I’m aware of both for the webmaster:

This last one can be used everywhere. When I have to use accented (or other non ascii) filenames, I run a small script on the server side, based on UtfNormal.php, to ensure all are in NFC.

@Mark (Vanilla’s author): if you ever read this page again an see this chapter, you may consider normalizing passwords and user names/login in V2.

@Extension authors: if your extension deals with filenames, attachments and such, beware!

Useful tools

This section to add informations on tools and options related to the normalization stuff. Feel free to contribute.

Enjoy your utf-8 capable forum!

Ask or comment in this discussion

 
 
 
 
vanilla/administrators/encodings.txt · Last modified: 2009/11/26 19:50 by dinoboff
Vanilla 1 Documentation archive (Wiki by DokuWiki)