Ere Maijala writes
> On 7.10.2010 15:17, Thomas Krichel wrote:
...
> >use Encode::Guess qw/latin-1/;
> >$decoded=decode("Guess", $dodgy_input);
> >
> > $decoded then should be a utf-8 string with utf8 flag on.
>
> Would that work for a predominantly proper utf-8 input with some
> "mistakes" thrown in?
It will try to guess between UTF-8 and ISO-8859-1. This can be done
because UTF-8 has many invalid byte sequences. But say if you
wanted to guess between ISO-8859-1 and ISO-8859-2, you'd be out of
luck. The module seems to do a good job for me.
I use it for a robot on CrossRef's sigg API. The engine is reliable,
but the data there is poorly character coded and marked up. I'd be
happy to share the robot with anyone who wants to go out there get
the character creeps. After all, we have Halloween coming up. ;-)
Cheers,
Thomas Krichel http://openlib.org/home/krichel
http://authorclaim.org/profile/pkr1
skype: thomaskrichel
|