Can utf-8 encoded character contain a byte of TAB?

Peng Yu pengyu.ut at gmail.com
Mon Jan 15 16:29:36 EST 2018


> Just to be clear, TAB *only* appears in utf-8 as the encoding for the actual TAB character, not as a part of any other character's encoding. The only bytes that can appear in the utf-8 encoding of non-ascii characters are starting with 0xC2 through 0xF4, followed by one or more of 0x80 through 0xBF.

So for utf-8 encoded input, I only need to use this code to split each
line into fields?

import sys
for line in sys.stdin:
    fields=line.rstrip('\n').split('\t')
    print fields

Is there a need to use this code to split each line into fields?

import sys
for line in sys.stdin:
    fields=line.rstrip('\n').decode('utf-8').split('\t')
    print [x.encode('utf-8') for x in fields]

-- 
Regards,
Peng



More information about the Python-list mailing list