[Python-Dev] Bytes path support

Marko Rauhamaa marko at pacujo.net
Sat Aug 23 10:21:57 CEST 2014


"Stephen J. Turnbull" <stephen at xemacs.org>:

> Just read as bytes and decode piecewise in one way or another. For
> Oleg's HTML case, there's a well-understood structure that can be used
> to determine retry points

HTML and XML are interesting examples since their encoding is initially
unknown:

  <?xml version="1.0"?>
                      ^
                      +--- Now I know it is UTF-8

  <?xml version="1.0" encoding="UTF-16"?>
                                      ^
                                      +--- Now I know it was UTF-16
                                           all along!

Then we have:


  HTTP/1.1 200 OK
  Content-Type: text/html; charset=ISO-8859-1

  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
  <html>
  <head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-16">

See how deep you have to parse the TCP stream before you realize the
content encoding is UTF-16.


Marko


More information about the Python-Dev mailing list