[I18n-sig] UTF-8 decoder in CVS still buggy
Fredrik Lundh
Fredrik Lundh" <effbot@telia.com
Sat, 2 Sep 2000 18:30:56 +0200
François Pinard wrote:
> Hi, people. I just recently subscribed to i18n-sig, and started to
> read the archives. Let me hope you will tolerate that I jump in some
> conversations without having matured all the background.
>
> On the above topic, I did not check what Python exactly does, but I wanted to
> share that my `recode' program is not perfect in that area. In particular,
> there is a requirement for UTF-8 to be valid that the sequence be minimal,
> which `recode' currently does not check on input. Roughly said, an UTF-8
> sequence is not valid if it could have been expressed in fewer bytes.
for security reasons, the UTF-8 codec gives you an "illegal encoding"
error in this case.
mal wrote:
> Could you give some examples ? I'm not sure I understand what you
> mean by "could have been expressed with fewer bytes" -- perhaps
> a multi-byte encoding where the top-most bytes are 0 ?
quoting RFC 2279:
Implementors of UTF-8 need to consider the security aspects of how
they handle illegal UTF-8 sequences. It is conceivable that in some
circumstances an attacker would be able to exploit an incautious
UTF-8 parser by sending it an octet sequence that is not permitted by
the UTF-8 syntax.
A particularly subtle form of this attack could be carried out
against a parser which performs security-critical validity checks
against the UTF-8 encoded form of its input, but interprets certain
illegal octet sequences as characters. For example, a parser might
prohibit the NUL character when encoded as the single-octet sequence
00, but allow the illegal two-octet sequence C0 80 and interpret it
as a NUL character. Another example might be a parser which
prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
illegal octet sequence 2F C0 AE 2E 2F.
</F>