unidecode_php-0.3.tar.gz

Dec 18, 2008 23:39

I recently wrote a conversion script and PHP wrapper so that the data from the Perl "last-chance transliterator" Text::Unidecode by Sean M. Burke can be used from PHP: unidecode_php-0.3.tar.gz. To use this you'll need to install the Perl Text::Unidecode module and then run the udec2bin.pl script inside the unidecode_php package.
Example PHP usage ( Read more... )

unidecode, unicode, php

Leave a comment

Comments 2

A little bug anonymous February 24 2010, 12:37:20 UTC
A nice wrapper around Text::Unidecode!

There's a little bug though: The code initially converts into UCS-4BE (which is outdated and can be replaced by UTF-32) so every character becomes 4 bytes long. The function _unidecode_codepoint(), however, treats characters as 2 byte long.

Looking at the original Perl code, it becomes pretty clear, that the function is handling two byte unicode characters, that is UTF-16 (or UCS-2BE).

It therefore is suitable to convert input to UTF-16 instead of UCS-4BE.

Best regards,

Gerd

Reply

Re: A little bug bsittler February 25 2010, 01:13:09 UTC
The reason for that is to properly turn non-BMP characters into "[?]", rather than the incorrect "[?][?]", without lots of extra complexity. At least at the time I wrote that code (maybe it's different now?) Text::Unidecode didn't have any data for non-BMP characters anyhow.

Reply


Leave a comment

Up