William D Clinger scripsit:
> * If t1 and t2 are transcoders, then their composition
> is defined by describing their composition in both
> the input and output directions. In the input
> direction, their composition is t1input followed by
> t2output followed by t2input. For output, their
> composition is t2output followed by t2input followed
> by t1output.
This is very confusing. Let t1 be the binary transcoder, and let t2
be the {utf-8-codec, lf, raise} transcoder. Differences other than
encoding drop out. On input, t1input maps bytes to characters in the
Latin-1 repertoire. t2output maps these characters to either one-byte
or two-byte sequences; t2input then maps them back again. Where does
full UTF-8 decoding get done? I would expect rather that the sequence
is t1input followed by t1output (the identity) followed by t2input.
> The rationale for this definition of composition is that
> it adds a new layer of transcoding onto the existing layer,
> instead of replacing the existing layer. That allows
> some of the weirder, file-at-a-time transcodings.
Other than composing the binary transcoder with an ordinary one, what are
the actual use cases for this? I don't know what you mean by "file at
a time"; transcoders require varying amounts of state from none (UTF-8)
to one bit (UTF-16) to a few bytes (full ISO 2022), but I know of no
encodings in which a byte can retroactively change the interpretation
of bytes that have already been transcoded.
> In practice, t1 will usually be the binary transcoder, and t2output
> followed by t2input will be the identity restricted to some subset of
> the Unicode characters.
Note, however, that there are transcoders which accept certain characters
for output that they never generate on input; they conflate multiple
Unicode characters into the same external encoding.
--
Business before pleasure, if not too bloomering long before.
--Nicholas van Rijn
John Cowan <cowan_at_ccil.org>
http://www.ccil.org/~cowan
Received on Sat Nov 04 2006 - 13:26:22 UTC