samedi 28 février 2015

Why is converting my database to UTF-8 truncating entries?


I'm running WordPress 4.1.1. My site is hosted at NearlyFreeSpeech.


I've been using my WordPress blog since the days when it encoded information in the latin1 charset. This week, I realized that certain posts on my blog (like these: 1-2-3) weren't displaying Japanese characters--the characters either show up as question marks, or as strings of characters like 日本語.


Clearly, this is an encoding error. Looking at my database in phpMyAdmin, I have many tables and columns in my database with their collation set to latin1_swedish_ci. I sought to fix this by changing the database to UTF-8 in a variety of different ways. They all had the exact same result.


The ways I tried to change the database encoding to UTF-8:



  1. Use the UTF-8 Database Converter plugin

  2. Follow this guide to export the database, replace all instances of "latin1" with "UTF8"

  3. Use a SQL script to convert tables and columns to blob, and then to UTF-8 text (detailed here)

  4. Use a SQL script to recast tables and columns to the datatype they contain, then to blob, and then to UTF-8 text (detailed here)


Expected results:


My site appears the same as it was above, with all themes and settings intact, but now displaying Japanese characters properly.


Actual results for all of the above methods:


Japanese still doesn't display correctly. What's more, database entries end abruptly; for instance, some entries in post_content are missing some or most of their original content. Custom shortcodes, defined by the Shortcoder plugin and stored in the shortcoder_data row in wp_options, are broken because the shortcoder_data entry got abruptly truncated. My theme options, including custom CSS and fonts, appear to have been reset or damaged, most likely due to similar abrupt truncation of database entries.


Luckily, I had the foresight to make all these changes on a duplicate database, so I have a backup with all my data intact.


When I compare the altered data in post_content to the original, I notice something: nearly all of the truncated strings begin with a special character. For instance, a post that once read:



Today it was a pleasant 72° and sunny.



will, in the altered database, read:



Today it was a pleasant 72



I haven't gone through and found all of my truncated posts--I know nothing about mySQL, so I'd have to do that by hand, and that would be an exercise in patience. However, out of a sample of 8 posts that got truncated, 6 of them were clearly truncated at a special character.


What do I need to do to properly convert my database so that Japanese characters display correctly without causing this data loss--or, barring a complete solution, what can I do to properly diagnose what's going on?


Thank you.


Update: Extra information


A few more things.


I used to have a lot of posts on my blog that displayed strings like I’m instead of I’m, naïve instead of naïve. Like I mentioned above, Japanese was displaying as long strings like 日本語 instead of 日本語. I went through and replaced these strings where I saw them, replacing ï with a proper ï (for instance).


I didn't catch all of them, however, and there are a few posts in my database that still have naïve instead of naïve.


When I look at those posts on the databases I've altered... they display correctly. They don't truncate. All of the garbled characters have seamlessly translated into their "proper" equivalents. Even the Japanese converted.


In the posts where I went back and "corrected" the garbled characters, however, where I have naïve and not naïve, the content in the database gets truncated on import as described above.





Aucun commentaire:

Enregistrer un commentaire