use strict;
my $Canon = 'Guess';
bless {
Suspects => { %DEF_SUSPECTS },
} => __PACKAGE__;
our @EXPORT = qw(guess_encoding);
our $NoUTFAutoGuess = 0;
sub import { # Exporter not used so we do it on our own
my $callpkg = caller;
no strict 'refs';
*{"$callpkg\::$item"} = \&{"$item"};
}
set_suspects(@_);
}
sub set_suspects{
my $class = shift;
$self->add_suspects(@_);
}
sub add_suspects{
my $class = shift;
for my $c (@_){
my $e = find_encoding($c) or die "Unknown encoding: $c";
}
}
sub decode($$;$){
unless (ref($guessed)){
require Carp;
}
return $utf8;
}
sub guess_encoding{
}
sub guess {
my $class = shift;
my $octet = shift;
# sanity check
# cheat 0: utf8 flag;
}
# cheat 1: BOM
unless ($NoUTFAutoGuess) {
return find_encoding('UTF-16')
return find_encoding('UTF-32')
my $utf;
$utf = "UTF-32";
}
}else{ # UTF-16(BE|LE) assumed
$utf = "UTF-16";
}
}
DEBUG and warn "$utf, be == $be, le == $le";
and return
"Encodings ambiguous between $utf BE and LE ($be, $le)";
return find_encoding($utf);
}
}
for my $c (@_){
my $e = find_encoding($c) or die "Unknown encoding: $c";
}
my $nline = 1;
# cheat 2 -- \e in the string
if ($line =~ /\e/o){
for my $k (@keys){
}
}
# warn join(",", keys %try);
for my $k (keys %try){
if ($scratch eq ''){
}else{
use bytes ();
DEBUG and
warn sprintf("%4d:%-24s not ok; %d bytes left\n",
delete $ok{$k};
}
}
%ok or return "No appropriate encodings found!";
if (scalar(keys(%ok)) == 1){
return $retval;
}
}
}
1;
=head1 NAME
Encode::Guess -- Guesses encoding from data
=head1 SYNOPSIS
# if you are sure $data won't contain anything bogus
use Encode;
my $utf8 = decode("Guess", $data);
my $data = encode("Guess", $utf8); # this doesn't work!
# more elaborate way
use Encode::Guess;
ref($enc) or die "Can't guess: $enc"; # trap error this way
$utf8 = $enc->decode($data);
# or
$utf8 = decode($enc->name, $data)
=head1 ABSTRACT
Encode::Guess enables you to guess in what encoding a given data is
encoded, or at least tries to.
=head1 DESCRIPTION
By default, it checks only ascii, utf8 and UTF-16/32 with BOM.
To use it more practically, you have to give the names of encodings to
check (I<suspects> as follows). The name of suspects can either be
canonical names or aliases.
# tries all major Japanese Encodings as well
If the C<$Encode::Guess::NoUTFAutoGuess> variable is set to a true
value, no heuristics will be applied to UTF8/16/32, and the result
will be limited to the suspects and C<ascii>.
=over 4
=item Encode::Guess->set_suspects
You can also change the internal suspects list via C<set_suspects>
method.
use Encode::Guess;
=item Encode::Guess->add_suspects
Or you can use C<add_suspects> method. The difference is that
C<set_suspects> flushes the current suspects list while
C<add_suspects> adds.
use Encode::Guess;
# now the suspects are euc-jp,shiftjis,7bit-jis, AND
# euc-kr,euc-cn, and big5-eten
=item Encode::decode("Guess" ...)
When you are content with suspects list, you can now
my $utf8 = Encode::decode("Guess", $data);
=item Encode::Guess->guess($data)
But it will croak if:
=over
=item *
Two or more suspects remain
=item *
No suspects left
=back
So you should instead try this;
my $decoder = Encode::Guess->guess($data);
On success, $decoder is an object that is documented in
L<Encode::Encoding>. So you can now do this;
my $utf8 = $decoder->decode($data);
On failure, $decoder now contains an error message so the whole thing
would be as follows;
my $decoder = Encode::Guess->guess($data);
die $decoder unless ref($decoder);
my $utf8 = $decoder->decode($data);
=item guess_encoding($data, [, I<list of suspects>])
You can also try C<guess_encoding> function which is exported by
default. It takes $data to check and it also takes the list of
suspects by option. The optional suspect list is I<not reflected> to
the internal suspects list.
die $decoder unless ref($decoder);
my $utf8 = $decoder->decode($data);
# check only ascii and utf8
my $decoder = guess_encoding($data);
=back
=head1 CAVEATS
=over 4
=item *
Because of the algorithm used, ISO-8859 series and other single-byte
encodings do not work well unless either one of ISO-8859 is the only
one suspect (besides ascii and utf8).
use Encode::Guess;
# perhaps ok
my $decoder = guess_encoding($data, 'latin1');
# definitely NOT ok
The reason is that Encode::Guess guesses encoding by trial and error.
It first splits $data into lines and tries to decode the line for each
suspect. It keeps it going until all but one encoding is eliminated
out of suspects list. ISO-8859 series is just too successful for most
cases (because it fills almost all code points in \x00-\xff).
=item *
Do not mix national standard encodings and the corresponding vendor
encodings.
# a very bad idea
my $decoder
The reason is that vendor encoding is usually a superset of national
standard so it becomes too ambiguous for most cases.
=item *
On the other hand, mixing various national standard encodings
automagically works unless $data is too short to allow for guessing.
# This is ok if $data is long enough
my $decoder =
euc-jp shiftjis 7bit-jis
euc-kr
big5-eten/);
=item *
DO NOT PUT TOO MANY SUSPECTS! Don't you try something like this!
my $decoder = guess_encoding($data,
Encode->encodings(":all"));
=back
It is, after all, just a guess. You should alway be explicit when it
comes to encodings. But there are some, especially Japanese,
environment that guess-coding is a must. Use this module with care.
=head1 TO DO
Encode::Guess does not work on EBCDIC platforms.
=head1 SEE ALSO
L<Encode>, L<Encode::Encoding>
=cut