reading hebrew text file

jepler at unpythonic.net jepler at unpythonic.net
Mon Oct 17 10:26:46 EDT 2005


I looked for "VAV" in the files in the "encodings" directory
(/usr/lib/python2.4/encodings/*.py on my machine).  I found that the following
character encodings seem to include hebrew characters:
	cp1255
	cp424
	cp856
	cp862
	iso8859-8
A file containing hebrew text might be in any one of these encodings, or
any unicode-based encoding.

To open an encoded file for reading, use
	f = codecs.open(file, 'r', encoding='...')
Now, calls like 'f.readline()' will return unicode strings.

Here's an example, using a file in UTF-8 I have laying around:
>>> f = codecs.open("/users/jepler/txt/UTF-8-demo.txt", "r", "utf-8")
>>> for i in range(5): print repr(f.readline())
... 
u'UTF-8 encoded sample plain-text file\n'
u'\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\n'
u'\n'
u'Markus Kuhn [\u02c8ma\u02b3k\u028as ku\u02d0n] <mkuhn at acm.org> \u2014 1999-08-20\n'
u'\n'

Jeff
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20051017/c98f40aa/attachment.sig>


More information about the Python-list mailing list