newbie raw text question

John Machin sjmachin at lexicon.net
Tue Feb 4 16:36:47 EST 2003


"Ian Sparks" <Ian.Sparks at etrials.com> wrote in message news:<mailman.1044369911.27254.python-list at python.org>...
> Thanks for the reply Dennis. Your breakdown of the meaning of the RTF 
> codes is pretty-much spot on. However, I'm still not "getting it". You 
> say :
> 
> >>
> What escaped characters? The \ is a tag introducer (for lack of a 
> better word) and is part of the actual data. "\rtf1" is NOT <cr>tf1. 
> <<
> 
> So here's a simple command-line test :
> 
> >>> print "\rtf1"
> 
> tf1
> >>> print r"\rtf1"
>  \rtf1
> >>>
> 
> Looks to me like \rtf1 *is* <cr>tf1 unless you define the string as a 
> raw string and then it can contain the "\" character.
> 
> This is all very well for strings you define at the command line but 
> what if a variable "x" contains "\rtf1" (NOT a raw string). Now how can 
> you deal with it?
> 
> >>> print x
> 
> tf1
> >>> print rx   #attempt to turn x into a raw string for printing.
> Traceback (most recent call last):
>   File "<interactive input>", line 1, in ?
> NameError: name 'rx' is not defined
> >>> 
> 
> How can I print x as though it were a raw string? Like I said, its 
> probably pretty obvious, I just don't "get it".
> 

The raw/nonraw distinction is relevant only in a script or at the
command line. They are just different ways of telling Python how to
get the same internal representation. What you have in your RTF file
is:
0: backslash
1: r
2: t
3: f
etc etc

>>> def sdump(s):
...    print "repr ->", repr(s)
...    print "len  ->", len(s)
...    print "hex  ->", " ".join([hex(ord(c)) for c in s])
...
>>> x = "\rtf1" # Python reads this as carriage return, t, f, 1
>>> sdump(x)
repr -> '\rtf1'
len  -> 4
hex  -> 0xd 0x74 0x66 0x31
>>> sdump("\rtf1")
repr -> '\rtf1'
len  -> 4
hex  -> 0xd 0x74 0x66 0x31

# You *don't* have the above in your file.

>>> sdump("\\rtf1") # Python reads this as backslash, r, t, f, 1
repr -> '\\rtf1'
len  -> 5
hex  -> 0x5c 0x72 0x74 0x66 0x31
>>> sdump(r"\rtf1") # Python reads this as backslash, r, t, f, 1
repr -> '\\rtf1'
len  -> 5
hex  -> 0x5c 0x72 0x74 0x66 0x31

# You *do* have the above in your file.

How you deal with it is fairly simple. Here's an example, untested,
off the top of my head and with no knowledge of the RTF grammar:

BSLASH = "\\" # single backslash character
keywd_matcher = re.compile("[a-z]+")
buff = file("blah.rtf").read() # may need binary mode, I dunno
blen = len(buff)
pos = 0
while pos < blen:
   if buff[pos] != BSLASH:
      # I dunno, read the docs
   else:
      pos += 1
      m = keywd_matcher(buff, pos)
      if not m:
         # I dunno, read the docs
      else:
         keywd = m.group()
         pos = m.end()
         if keywd == "rtf":
            etc
         elif keywd == "foo":
            etc 
         # faster ways of doing this e.g. look up a dictionary

HTH,
John




More information about the Python-list mailing list