LISTSERV 16.5 - CODE4LIB Archives

  Ere Maijala writes

> On 7.10.2010 15:17, Thomas Krichel wrote:

  ...

> >use Encode::Guess qw/latin-1/;
> >$decoded=decode("Guess", $dodgy_input);
> >
> >   $decoded then should be a utf-8 string with utf8 flag on.
>
> Would that work for a predominantly proper utf-8 input with some
> "mistakes" thrown in?

  It will try to guess between UTF-8 and ISO-8859-1. This can be done
  because UTF-8 has many invalid byte sequences.  But say if you
  wanted to guess between ISO-8859-1 and ISO-8859-2, you'd be out of
  luck. The module seems to do a good job for me.

  I use it for a robot on CrossRef's sigg API. The engine is reliable,
  but the data there is poorly character coded and marked up. I'd be
  happy to share the robot with anyone who wants to go out there get
  the character creeps. After all, we have Halloween coming up. ;-)


  Cheers,

  Thomas Krichel                    http://openlib.org/home/krichel
                                http://authorclaim.org/profile/pkr1
                                               skype: thomaskrichel