A Unicode problem -HELP

manstey manstey at csu.edu.au
Wed May 17 01:29:49 EDT 2006


OK, I apologise for not being clearer.

1. Here is my input data file, line 2:
gn1:1,1.2 R")$I73YT R")$IYT at ncfsa

2. Here is my output data file, line 2:
u'gn', u'1', u'1', u'1', u'2', u'-', u'R")$I73YT', u'R")$IYT',
u'R")$IYT', u'@', u'ncfsa', u'nc', '', '', '', u'f', u's', u'a', '',
'', '', '', '', '', '', '', u'B.:R")$I^YT', u'b.:cv)cv^yc', '\xc9\x94'

3. Here is my main program:
# -*- coding: UTF-8 -*-
import codecs

import splitFunctions
import surfaceIPA

# Constants for file location

# Working directory constants
dir_root = 'E:\\'
dir_relative = '2 Core\\2b Data\\Data Working\\'

# Input file constants
input_file_name = 'in.grab.txt'
input_file_loc = dir_root + dir_relative + input_file_name
# Initialise input file
input_file = codecs.open(input_file_loc, 'r', 'utf-8')

# Output file constants
output_file_name = 'out.grab.txt'
output_file_loc = dir_root + dir_relative + output_file_name
# Initialise output file
output_file = codecs.open(output_file_loc, 'w', 'utf-8') # unicode

i = 0
for line in input_file:
	if line[0] != '>': # Ignore headers
		i += 1
		if i != 1:
			word_info = splitFunctions.splitGrab(line, i)
			parse=splitFunctions.splitParse(word_info[10])
			gloss=surfaceIPA.surfaceIPA(word_info[6],word_info[8],word_info[9],parse)
			a=str(word_info + parse + gloss).encode('utf-8')
			a=a[1:len(a)-1]
			output_file.write(a)
			output_file.write('\n')

input_file.close()
output_file.close()

print 'done'


4. Here is my problem:
At the end of my output file, where my unicode character \u0254 (OPEN
O) appears, the file has '\xc9\x94'

What I want is an output file like:

'gn', '1', '1', '1', '2', '-', ..... 'ɔ'

where ɔ is an open O, and would display correctly in the appropriate
font.

Once I can get it to display properly, I will rewrite gloss so that it
returns a proper translation of 'R")$I73YT', which will be a string of
unicode characters.

Is this clearer? The other two functions are basic. splitGrab turns
'gn1:1,1.2 R")$I73YT R")$IYT at ncfsa' into 'gn 1 1 1 2 R")$I73YT R")$IYT
@ ncfsa' and splitParse turns the final piece of this 'ncfsa' into 'n c
f s a'. They have to be done separately as splitParse involves some
translation and program logic. SurfaceIPA reads in 'R")$I73YT' and
other data to produce the unicode string. At the moment it just returns
two dummy strings and u'\u0254'.encode('utf-8').

All help is appreciated!

Thanks




More information about the Python-list mailing list