overriding character escapes during file input

Sun Sep 3 00:32:18 EDT 2006

David J Birnbaum wrote:
> Dear Python-list,
>
> I need to read a Unicode (utf-8) file that contains text like:
> > blah \fR40\fC blah
> I get my input and then process it with something like:
> > inputFile = codecs.open(sys.argv[1],'r', 'utf-8')
> >
> > for line in inputFile:
> When Python encounters the "\f" substring in an input line, it wants to
> treat it as an escape sequence representing a form-feed control
> character,

Even if it were as sentient as "wanting" to muck about with the input,
it doesn't. Those escape sequences are interpreted by the compiler, and
in other functions (e.g. re.compile) but *not* when reading a text
file.

Example:
|>>> guff = r"blah \fR40\fC blah"
|>>> print repr(guff)
'blah \\fR40\\fC blah'
|>>> # above is ASCII so it is automatically also UTF8

Comment: It contains backslash followed by 'f' ...

|... fname = "guff.utf8"
|>>> f = open(fname, "w")
|>>> f.write(guff)
|>>> f.close()
|>>> import codecs
|>>> f = codecs.open(fname,'r', 'utf-8')
|>>> guff2 = f.read()
|>>> print guff2 == guff
|True
No interpretation of the r"\f" has been done.

> which means that it gets interpreted as (or, from my
> perspective, translated to) "\x0c". Were I entering this string myself
> within my program code, I could use a raw string (r"\f") to avoid this
> translation, but I don't know how to do this when I am reading a line
> from a file.

What I suggest you do is:
   print repr(open('yourfile', 'r').read()
[or at least one of the offending lines]
and inspect it closely. You may find (1) that the file has formfeeds in
it or (2) it has r"\f" in in it and you were mistaken about the
interpretation or (3) something else.

If you maintain (3) is the case, then make up a small example file,
show a dump of it using print repr(.....) as above, plus the (short)
code where you decode it and dump the result.

HTH,
John