Drupal mysql utf8 and latin1 character set issues
Mike Dixon
Upgrading from Drupal 4.6 has always been complicated by issues of character sets, but the Drupal upgrade scripts normally solve most of these problems for us. But when your dealing with a complex upgrade you need a good understanding of how Drupal is dealing with character sets in Mysql.
It all comes down to this - pre 4.7 Drupal stores utf8 encoded data in latin1 based tables (although this sounds a bit silly there were good reasons for it). From 4.7 onwards Drupal stores ut8 encoded data in utf8 encoded tables. The process of converting from one method to the other should be simple, but can cause problems.
Potential problem - default encoding on the new DB
Typically upgrades start with fresh databases, you very rarely upgrade your actual site in place. The process is normally this
- Content freeze on "live" 4.6 site
- Database dump of old site
- Create new database
- Import old database into the new empty DB
- Point a new version of Drupal at the new DB and run the update scripts
This should work just fine, with the update scripts managing the utf8 conversion for us. BUT you may find that when viewing your site some characters are not displaying correctly. The problem will be in step 3 above - when you create your new database it will more than likely have a default charset of utf8 - which means when you do your update the utf8 conversion doesn't work correctly. The solution is simple. Stop. Start again, but this time explicitly set the charset on your new DB to be latin1.
Untangling a Drupal utf8 and latin1 mess
There will off course be situations where you cannot just stop and start again, or you have done a custom migration - where you have written your own scripts to move data from one DB to another. In which case you need to do some manual character set changing. The process to convert from utf8 text stored in a latin1 table is quite simple - you first convert the column to binary data and then convert it back to utf8 text. But this process relies on the column having a charset of latin1 - if you have imported into a utf8 table this won't be the case - so you need to convert the column to latin1 first.
There is a simple script attached which does just this - it's based on code from the Drupal upgrade scripts (notably update_169) - and although we can't guarantee it will solve your problem it has certainly go us out of difficulties a number of times.
[EDIT] Sorry, but we lost the attachment in a server migration.