Can utf-8 encoded character contain a byte of TAB?

Random832 random832 at fastmail.com
Mon Jan 15 10:09:41 EST 2018


On Mon, Jan 15, 2018, at 09:35, Peter Otten wrote:
> Peng Yu wrote:
> 
> > Can utf-8 encoded character contain a byte of TAB?
> 
> Yes; ascii is a subset of utf8.
> 
> If you want to allow fields containing TABs in a file where TAB is also the 
> field separator you need a convention to escape the TABs occuring in the 
> values. Nothing I see in your post can cope with that, but the csv module 
> can, by quoting field containing the delimiter:

Just to be clear, TAB *only* appears in utf-8 as the encoding for the actual TAB character, not as a part of any other character's encoding. The only bytes that can appear in the utf-8 encoding of non-ascii characters are starting with 0xC2 through 0xF4, followed by one or more of 0x80 through 0xBF.



More information about the Python-list mailing list