Thomas Krichel wrote: > Ere Maijala writes > >> On 7.10.2010 15:17, Thomas Krichel wrote: > > ... > >>> use Encode::Guess qw/latin-1/; >>> $decoded=decode("Guess", $dodgy_input); >>> >>> $decoded then should be a utf-8 string with utf8 flag on. >> Would that work for a predominantly proper utf-8 input with some >> "mistakes" thrown in? > > It will try to guess between UTF-8 and ISO-8859-1. This can be done > because UTF-8 has many invalid byte sequences. But say if you > wanted to guess between ISO-8859-1 and ISO-8859-2, you'd be out of > luck. Not necessarily. There are tools such as http://www.let.rug.nl/~vannoord/TextCat/ which provide very reliable guessing of languages. A minor adaptation might be needed to make it guess twice, once for each of ISO-8859-1 and ISO-8859-2 and then you take the highest ranked. cheers stuart -- Stuart Yeates http://www.nzetc.org/ New Zealand Electronic Text Centre http://researcharchive.vuw.ac.nz/ Institutional Repository