Can utf-8 encoded character contain a byte of TAB?

Mon Jan 15 16:41:55 EST 2018

On Tue, Jan 16, 2018 at 8:29 AM, Peng Yu <pengyu.ut at gmail.com> wrote:
>> Just to be clear, TAB *only* appears in utf-8 as the encoding for the actual TAB character, not as a part of any other character's encoding. The only bytes that can appear in the utf-8 encoding of non-ascii characters are starting with 0xC2 through 0xF4, followed by one or more of 0x80 through 0xBF.
>
> So for utf-8 encoded input, I only need to use this code to split each
> line into fields?
>
> import sys
> for line in sys.stdin:
>     fields=line.rstrip('\n').split('\t')
>     print fields
>
> Is there a need to use this code to split each line into fields?
>
> import sys
> for line in sys.stdin:
>     fields=line.rstrip('\n').decode('utf-8').split('\t')
>     print [x.encode('utf-8') for x in fields]
>

One of the deliberate design features of UTF-8 is that the ASCII byte
values (those below 128) are *exclusively* used for ASCII characters.
Characters >=128 are encoded using multiple bytes in the 128-255
range.

But what you should ideally do is decode everything as UTF-8, then
manipulate it as text. That's the default way to do things in Py3
anyway. The reason for this is that it's entirely possible for an
arbitrary byte stream to NOT follow the rules of UTF-8, which could
break your code. The way to be confident is to do the decode, and if
it fails, reject the input.

ChrisA