Can utf-8 encoded character contain a byte of TAB?

Peng Yu pengyu.ut at gmail.com
Mon Jan 15 09:11:02 EST 2018


Hi,

I use the following code to process TSV input.

$ printf '%s\t%s\n' {1..10} | ./main.py
['1', '2']
['3', '4']
['5', '6']
['7', '8']
['9', '10']
$ cat main.py
#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:

import sys
for line in sys.stdin:
    fields=line.rstrip('\n').split('\t')
    print fields

But I am not sure it will process utf-8 input correctly. Thus, I come
up with this code. However, I am not sure if this is really necessary
as my impression is that utf-8 character should not contain the ascii
code for TAB. Is it so? Thanks.

$ cat main1.py
#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:

import sys
for line in sys.stdin:
    #fields=line.rstrip('\n').split('\t')
    fields=line.rstrip('\n').decode('utf-8').split('\t')
    print [x.encode('utf-8') for x in fields]

$ printf '%s\t%s\n' {1..10} | ./main1.py
['1', '2']
['3', '4']
['5', '6']
['7', '8']
['9', '10']


-- 
Regards,
Peng



More information about the Python-list mailing list