[ python-Bugs-946130 ] xmlrpclib omits charset in Content-Type HTTP header

SourceForge.net noreply at sourceforge.net
Sun May 2 17:25:41 EDT 2004


Bugs item #946130, was opened at 2004-05-02 00:30
Message generated for change (Comment added) made by loewis
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=946130&group_id=5470

Category: None
Group: Not a Bug
Status: Closed
Resolution: Invalid
Priority: 5
Submitted By: Christian Schmidt (c960657)
Assigned to: Nobody/Anonymous (nobody)
Summary: xmlrpclib omits charset in Content-Type HTTP header

Initial Comment:
When xmlrpclib makes an HTTP request, it always sends
the HTTP header line "Content-Type: text/xml". The
encoding of the XML document is specified in the <?xml
...?> tag, e.g. <?xml version='1.0' encoding='utf-8'?>.

However, when XML is transferred over HTTP, the charset
specified in the HTTP Content-Type header takes
precedence over that in the document itself, i.e. the
encoding specified in th <?xml?> tag should be ignored
(RFC 3023 section 3.1). If the charset is not specified
in the Content-Type header, it defaults to us-ascii.

xmlrpclib currently specifies the charset in the
encoding attribute of the <?xml?> tag and not in the
HTTP header. The XML-RPC server thus treats the XML
document as us-ascii instead of the specified encoding.

xmlrpclib should specify the encoding in the
Content-Type header.

Disclaimer: I am no expert in XML and MIME-types, so I
might be wrong about this.

----------------------------------------------------------------------

>Comment By: Martin v. Löwis (loewis)
Date: 2004-05-02 23:25

Message:
Logged In: YES 
user_id=21627

The bug, as reported, is not a bug. Adding charset= to the
Content-type would be a violation of the XML-RPC protocol.
Not doing so might be a violation of the HTTP protocol, but
xmlrpclib does not claim to implement HTTP; it claims to
implement XML-RPC.

Using character references if unencodable characters are
found might be a good idea, but this is not the subject of
the bug report. Asking for a change to xmlrpclib to use
character references where necessary would not be a bug
report, it would be a feature request (because xmlrpclib is
currently not misbehaving in this respect).

If you want to look into implementing a change in this
direction, you should try to use the 'xmlcharrefreplace'
error handler for the Unicode .encode method. The tricky
part is to get the quoting of ASCII characters (lt, gt, ...)
right.

----------------------------------------------------------------------

Comment By: Christian Schmidt (c960657)
Date: 2004-05-02 23:04

Message:
Logged In: YES 
user_id=32013

It was not my intention to argue with you whether XML-RPC
should allow adding a charset= or not.

But I am arguing that RFC 3023 applies for XML-RPC, and that
implies that as long as charset= is not specified, the
server should treat the encoding as us-ascii, even if
another encoding is specified in the <?xml> tag (the XML-RPC
spec doesn't mention specifying an encoding in the <?xml>
tag, so as such there is no contradiction between the spec
and RFC 3023). And this implies that XML-RPC must encode
non-us-ascii characters with &#xx;.

So my point is that in order to be compliant with both the
XML-RPC spec and RFC 3023, xmlrpclib should encode
non-us-ascii characters as &#xx;.

This may not be that important, but I believe it _is_ a bug,
i.e. I disagree that this is marked "Not a bug".

I only started coding Pyhton two weeks ago, so I may not be
the best to provide a patch :-) But I am willing to give it
a try.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2004-05-02 18:28

Message:
Logged In: YES 
user_id=21627

When it comes to XML-RPC, the only person to argue with is
Dave Winer (arguing with us is futile). If you can make him
say, in public, that adding charset= is ok for XML-RPC
implementations, we can change Python.

As for your current problem: It would be best to use
US-ASCII for encoding your XML document, representing
non-ASCII characters as character references. Of course,
that is currently not supported in xmlrpclib; patches welcome.

----------------------------------------------------------------------

Comment By: Christian Schmidt (c960657)
Date: 2004-05-02 16:39

Message:
Logged In: YES 
user_id=32013

Hmm, interesting.

I agree that according to the letter of the spec, the
encoding cannot be specified in the Content-Type header.

But -- XML-RPC uses HTTP so I would argue that RFC 3023
still applies. If it does, a server should ignore any
encoding specified in the <?xml>, and the default encoding
for text/xml is us-ascii. So in order to represent
non-us-ascii characters in an XML-RPC message, they should
be encoded using the &#xx; notation.

xmlrpclib doesn't do this, so I suggest reopening this bug.

FYI: The actual problem I am having is making xmlrpclib work
with the XML_RPC_Server that is part of PEAR (PEAR is the
"official" PHP Extension and Application Repository). This
server does not inspect the <?xml> tag for an encoding but
always assumes that the input is UTF-8. According to RFC
3023 it should assume it to be us-ascii, but since us-ascii
is a subset of UTF-8, the current behaviour of the server
should be safe, as long as the client either sends us-ascii
or UTF-8 (I have submitted a patch to the XML_RPC_Server
maintainer that extends the encoding detection to look in
both the Content-Type header and the <?xml> tag).

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2004-05-02 14:25

Message:
Logged In: YES 
user_id=21627

The XML-RPC spec is very clear that the value of the
Content-Type header is "text/xml". Following the traditional
interpretation of the XML-RPC spec (where examples are
considered normative), it would be a protocol violation to
add a charset= parameter to Content-Type.

Until the XML-RPC spec is changed, or the status of using
charset= in XML-RPC is officially clarified, we can't change
our implementation.

Closing this as not-a-bug.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-05-02 12:59

Message:
Logged In: YES 
user_id=38388

I don't see anything wrong with the way xmlrpclib deals
with the encoding.

You right on one point: HTTP defaults to Latin-1 as charset,
but since the content may well be non-Latin-1, xmlrpclib
should probably also place the encoding information into the
HTTP header (for requests it sends out).

However, this is rarely a problem, since clients usually don't
follow the HTTP way of interpreting the charset when seeing
text/xml as content type... xmlrpclib itself certainly
doesn't :-)


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=946130&group_id=5470



More information about the Python-bugs-list mailing list