[Python-Dev] Bytes path support
Marko Rauhamaa
marko at pacujo.net
Sat Aug 23 10:21:57 CEST 2014
"Stephen J. Turnbull" <stephen at xemacs.org>:
> Just read as bytes and decode piecewise in one way or another. For
> Oleg's HTML case, there's a well-understood structure that can be used
> to determine retry points
HTML and XML are interesting examples since their encoding is initially
unknown:
<?xml version="1.0"?>
^
+--- Now I know it is UTF-8
<?xml version="1.0" encoding="UTF-16"?>
^
+--- Now I know it was UTF-16
all along!
Then we have:
HTTP/1.1 200 OK
Content-Type: text/html; charset=ISO-8859-1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-16">
See how deep you have to parse the TCP stream before you realize the
content encoding is UTF-16.
Marko
More information about the Python-Dev
mailing list