[Python-Dev] cgi.FieldStorage with multipart/form-data tries to decode binary file as UTF-8 if "filename=" not specified

Wed Feb 15 12:35:39 EST 2017

On Wed, 15 Feb 2017 at 08:14 Ben Hoyt <benhoyt at gmail.com> wrote:

> I posted this on StackOverflow [1], but I'm posting it here as well, as I
> believe this is a bug (or at least quirk) in cgi.FieldStorage where you
> can't access a file upload properly if "filename=" is not present in the
> MIME part's Content-Disposition header. There are a couple of related bugs
> open (and closed) on bugs.python.ord, but not quite this issue.
>
> Is it legitimate for cgi.FieldStorage to use the presence of "filename="
> to determine "this is a binary file" (in which case this is not a bug and
> my client is just buggy), or is this a bug? I lean towards the latter as
> the spec indicates that the filename is optional [2].
>

Assuming this isn't a recent change in semantics I would say this is now a
quick considering how old the module is and people probably rely on its
current semantics.

-Brett

>
> Copying from my StackOverflow question, including a test/repro case:
>
> When I use `cgi.FieldStorage` to parse a `multipart/form-data` request (or
> any web framework like Pyramid which uses `cgi.FieldStorage`) I have
> trouble processing file uploads from certain clients which don't provide a
> `filename=file.ext` in the part's `Content-Disposition` header.
>
> If the `filename=` option is missing, `FieldStorage()` tries to decode the
> contents of the file as UTF-8 and return a string. And obviously many files
> are binary and not UTF-8 and as such give bogus results.
>
> For example:
>
>     >>> import cgi
>     >>> import io
>     >>> body = (b'--KQNTvuH-itP09uVKjjZiegh7\r\n' +
>     ...         b'Content-Disposition: form-data; name=payload\r\n\r\n' +
>     ...         b'\xff\xd8\xff\xe0\x00\x10JFIF')
>     >>> env = {
>     ...     'REQUEST_METHOD': 'POST',
>     ...     'CONTENT_TYPE': 'multipart/form-data;
> boundary=KQNTvuH-itP09uVKjjZiegh7',
>     ...     'CONTENT_LENGTH': len(body),
>     ... }
>     >>> fs = cgi.FieldStorage(fp=io.BytesIO(body), environ=env)
>     >>> (fs['payload'].filename, fs['payload'].file.read())
>     (None, '����\x00\x10JFIF')
>
> Browsers, and *most* HTTP libraries do include the `filename=` option for
> file uploads, but I'm currently dealing with a client that doesn't (and
> omitting the `filename` does seem to be valid according to the spec).
>
> Currently I'm using a pretty hacky workaround by subclassing
> `FieldStorage` and replacing the relevant `Content-Disposition` header with
> one that does have the filename:
>
>     import cgi
>     import os
>
>     class FileFieldStorage(cgi.FieldStorage):
>         """To use, subclass FileFieldStorage and override _file_fields
> with a tuple
>         of the names of the file field(s). You can also override
> _file_name with
>         the filename to add.
>         """
>
>         _file_fields = ()
>         _file_name = 'file_name'
>
>         def __init__(self, fp=None, headers=None, outerboundary=b'',
>                      environ=os.environ, keep_blank_values=0,
> strict_parsing=0,
>                      limit=None, encoding='utf-8', errors='replace'):
>
>             if self._file_fields and headers and
> headers.get('content-disposition'):
>                 content_disposition = headers['content-disposition']
>                 key, pdict = cgi.parse_header(content_disposition)
>                 if (key == 'form-data' and pdict.get('name') in
> self._file_fields and
>                         'filename' not in pdict):
>                     del headers['content-disposition']
>                     quoted_file_name = self._file_name.replace('"', '\\"')
>                     headers['content-disposition'] = '{};
> filename="{}"'.format(
>                             content_disposition, quoted_file_name)
>
>             super().__init__(fp=fp, headers=headers,
> outerboundary=outerboundary,
>                              environ=environ,
> keep_blank_values=keep_blank_values,
>                              strict_parsing=strict_parsing, limit=limit,
>                              encoding=encoding, errors=errors)
>
> Using the `body` and `env` in my first test, this works now:
>
>     >>> class TestFieldStorage(FileFieldStorage):
>     ...     _file_fields = ('payload',)
>     >>> fs = TestFieldStorage(fp=io.BytesIO(body), environ=env)
>     >>> (fs['payload'].filename, fs['payload'].file.read())
>     ('file_name', b'\xff\xd8\xff\xe0\x00\x10JFIF')
>
> Is there some way to avoid this hack and tell `FieldStorage` not to decode
> as UTF-8? It would be nice if you could provide `encoding=None` or
> something, but it doesn't look like it supports that.
>
> Thanks,
> Ben.
>
>
> [1]
> https://stackoverflow.com/questions/42213318/cgi-fieldstorage-with-multipart-form-data-tries-to-decode-binary-file-as-utf-8-e
> [2] https://www.w3.org/Protocols/rfc2616/rfc2616-sec19.html#sec19.5.1
>
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/brett%40python.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20170215/3ce2fcaa/attachment.html>