LISTSERV 16.5 - CODE4LIB Archives

Thomas Krichel wrote:
>   Ere Maijala writes
> 
>> On 7.10.2010 15:17, Thomas Krichel wrote:
> 
>   ...
> 
>>> use Encode::Guess qw/latin-1/;
>>> $decoded=decode("Guess", $dodgy_input);
>>>
>>>   $decoded then should be a utf-8 string with utf8 flag on.
>> Would that work for a predominantly proper utf-8 input with some
>> "mistakes" thrown in?
> 
>   It will try to guess between UTF-8 and ISO-8859-1. This can be done
>   because UTF-8 has many invalid byte sequences.  But say if you
>   wanted to guess between ISO-8859-1 and ISO-8859-2, you'd be out of
>   luck. 

Not necessarily.

There are tools such as http://www.let.rug.nl/~vannoord/TextCat/ which 
provide very reliable guessing of languages. A minor adaptation might be 
needed to make it guess twice, once for each of ISO-8859-1 and 
ISO-8859-2 and then you take the highest ranked.

cheers
stuart
-- 
Stuart Yeates
http://www.nzetc.org/       New Zealand Electronic Text Centre
http://researcharchive.vuw.ac.nz/     Institutional Repository