1N/AData structures for encoding transformations. 1N/APerl works internally in either a native 'byte' encoding or 1N/Ain UTF-8 encoded Unicode. We have no immediate need for a "wchar_t" 1N/Arepresentation. When we do we can use utf8_to_uv(). 1N/AMost character encodings are either simple byte mappings or 1N/Avariable length multi-byte encodings. UTF-8 can be viewed as a 1N/Arather extreme case of the latter. 1N/ASo to solve an important part of perl's encode needs we need to solve the 1N/A"multi-byte -> multi-byte" case. The simple byte forms are then just degenerate 1N/Acase. (Where one of multi-bytes will usually be UTF-8.) 1N/AThe other type of encoding is a shift encoding where a prefix sequence 1N/Adetermines what subsequent bytes mean. Such encodings have state. 1N/AWe also need to handle case where a character in one encoding has to be 1N/Arepresented as multiple characters in the other. e.g. letter+diacritic. 1N/AThe process can be considered as pseudo perl: 1N/A my $size = $count($src); 1N/A my $in_seq = substr($src,0,$size,''); 1N/A my $out_seq = $s2d_hash{$in_seq}; 1N/A if (defined $out_seq) 1N/A # an error condition 1N/AThat has the following components: 1N/A &src_count - a "rule" for how many bytes make up the next character in the 1N/A %s2d_hash - a mapping from input sequences to output sequences 1N/AThe problem with that scheme is that it does not allow the output 1N/Acharacter repertoire to affect the characters considered from the 1N/ASo we use a "trie" representation which can also be considered 1N/Amy $next = \@s2d_next; 1N/A my $byte = $substr($src,0,1,''); 1N/A my $out_seq = $seq->[$byte]; 1N/A if (defined $out_seq) 1N/A # an error condition 1N/A ($next,$seq) = @$next->[$byte] if $next; 1N/AThere is now a pair of data structures to represent everything. 1N/AIt is valid for output sequence at a particular point to 1N/Abe defined but zero length, that just means "don't know yet". 1N/AFor the single byte case there is no 'next' so new tables will be the same as 1N/Athe original tables. For a multi-byte case a prefix byte will flip to the tables 1N/Afor the next page (adding nothing to the output), then the tables for the page 1N/Awill provide the actual output and set tables back to original base page. 1N/AThis scheme can also handle shift encodings. 1N/AA slight enhancement to the scheme also allows for look-ahead - if 1N/Awe add a flag to re-add the removed byte to the source we could handle 1N/A ab -> a (and take b back please) 1N/A /* partial source character */ 1N/A /* Cannot represent */