A Unicode problem -HELP
manstey
manstey at csu.edu.au
Wed May 17 01:29:49 EDT 2006
OK, I apologise for not being clearer.
1. Here is my input data file, line 2:
gn1:1,1.2 R")$I73YT R")$IYT at ncfsa
2. Here is my output data file, line 2:
u'gn', u'1', u'1', u'1', u'2', u'-', u'R")$I73YT', u'R")$IYT',
u'R")$IYT', u'@', u'ncfsa', u'nc', '', '', '', u'f', u's', u'a', '',
'', '', '', '', '', '', '', u'B.:R")$I^YT', u'b.:cv)cv^yc', '\xc9\x94'
3. Here is my main program:
# -*- coding: UTF-8 -*-
import codecs
import splitFunctions
import surfaceIPA
# Constants for file location
# Working directory constants
dir_root = 'E:\\'
dir_relative = '2 Core\\2b Data\\Data Working\\'
# Input file constants
input_file_name = 'in.grab.txt'
input_file_loc = dir_root + dir_relative + input_file_name
# Initialise input file
input_file = codecs.open(input_file_loc, 'r', 'utf-8')
# Output file constants
output_file_name = 'out.grab.txt'
output_file_loc = dir_root + dir_relative + output_file_name
# Initialise output file
output_file = codecs.open(output_file_loc, 'w', 'utf-8') # unicode
i = 0
for line in input_file:
if line[0] != '>': # Ignore headers
i += 1
if i != 1:
word_info = splitFunctions.splitGrab(line, i)
parse=splitFunctions.splitParse(word_info[10])
gloss=surfaceIPA.surfaceIPA(word_info[6],word_info[8],word_info[9],parse)
a=str(word_info + parse + gloss).encode('utf-8')
a=a[1:len(a)-1]
output_file.write(a)
output_file.write('\n')
input_file.close()
output_file.close()
print 'done'
4. Here is my problem:
At the end of my output file, where my unicode character \u0254 (OPEN
O) appears, the file has '\xc9\x94'
What I want is an output file like:
'gn', '1', '1', '1', '2', '-', ..... 'ɔ'
where ɔ is an open O, and would display correctly in the appropriate
font.
Once I can get it to display properly, I will rewrite gloss so that it
returns a proper translation of 'R")$I73YT', which will be a string of
unicode characters.
Is this clearer? The other two functions are basic. splitGrab turns
'gn1:1,1.2 R")$I73YT R")$IYT at ncfsa' into 'gn 1 1 1 2 R")$I73YT R")$IYT
@ ncfsa' and splitParse turns the final piece of this 'ncfsa' into 'n c
f s a'. They have to be done separately as splitParse involves some
translation and program logic. SurfaceIPA reads in 'R")$I73YT' and
other data to produce the unicode string. At the moment it just returns
two dummy strings and u'\u0254'.encode('utf-8').
All help is appreciated!
Thanks
More information about the Python-list
mailing list