Reading \n unescaped from a file

Friedrich Rentsch anthra.norell at bluewin.ch
Thu Sep 3 09:37:35 EDT 2015



On 09/03/2015 11:24 AM, Peter Otten wrote:
> Friedrich Rentsch wrote:
>
>>
>> On 09/02/2015 04:03 AM, Rob Hills wrote:
>>> Hi,
>>>
>>> I am developing code (Python 3.4) that transforms text data from one
>>> format to another.
>>>
>>> As part of the process, I had a set of hard-coded str.replace(...)
>>> functions that I used to clean up the incoming text into the desired
>>> output format, something like this:
>>>
>>>       dataIn = dataIn.replace('\r', '\\n') # Tidy up linefeeds
>>>       dataIn = dataIn.replace('<','<') # Tidy up < character
>>>       dataIn = dataIn.replace('>','>') # Tidy up < character
>>>       dataIn = dataIn.replace('o','o') # No idea why but lots of
>>>       these: convert to 'o' character dataIn =
>>>       dataIn.replace('f','f') # .. and these: convert to 'f'
>>>       character
>>>       dataIn = dataIn.replace('e','e') # ..  'e'
>>>       dataIn = dataIn.replace('O','O') # ..  'O'
>>>
>>> These statements transform my data correctly, but the list of statements
>>> grows as I test the data so I thought it made sense to store the
>>> replacement mappings in a file, read them into a dict and loop through
>>> that to do the cleaning up, like this:
>>>
>>>           with open(fileName, 'r+t', encoding='utf-8') as mapFile:
>>>               for line in mapFile:
>>>                   line = line.strip()
>>>                   try:
>>>                       if (line) and not line.startswith('#'):
>>>                           line = line.split('#')[:1][0].strip() # trim any
>>>                           trailing comments name, value = line.split('=')
>>>                           name = name.strip()
>>>                           self.filterMap[name]=value.strip()
>>>                   except:
>>>                       self.logger.error('exception occurred parsing line
>>>                       [{0}] in file [{1}]'.format(line, fileName)) raise
>>>
>>> Elsewhere, I use the following code to do the actual cleaning up:
>>>
>>>       def filter(self, dataIn):
>>>           if dataIn:
>>>               for token, replacement in self.filterMap.items():
>>>                   dataIn = dataIn.replace(token, replacement)
>>>           return dataIn
>>>
>>>
>>> My mapping file contents look like this:
>>>
>>> \r = \\n
>>> “ = "
>>> < = <
>>> > = >
>>> ' = '
>>> F = F
>>> o = o
>>> f = f
>>> e = e
>>> O = O
>>>
>>> This all works "as advertised" */except/* for the '\r' => '\\n'
>>> replacement. Debugging the code, I see that my '\r' character is
>>> "escaped" to '\\r' and the '\\n' to '\\\\n' when they are read in from
>>> the file.
>>>
>>> I've been googling hard and reading the Python docs, trying to get my
>>> head around character encoding, but I just can't figure out how to get
>>> these bits of code to do what I want.
>>>
>>> It seems to me that I need to either:
>>>
>>>     * change the way I represent '\r' and '\\n' in my mapping file; or
>>>     * transform them somehow when I read them in
>>>
>>> However, I haven't figured out how to do either of these.
>>>
>>> TIA,
>>>
>>>
>> I have had this problem too and can propose a solution ready to run out
>> of my toolbox:
>>
>>
>> class editor:
>>
>>       def compile (self, replacements):
>>           targets, substitutes = zip (*replacements)
>>           re_targets = [re.escape (item) for item in targets]
>>           re_targets.sort (reverse = True)
>>           self.targets_set = set (targets)
>>           self.table = dict (replacements)
>>           regex_string = '|'.join (re_targets)
>>           self.regex = re.compile (regex_string, re.DOTALL)
>>
>>       def edit (self, text, eat = False):
>>           hits = self.regex.findall (text)
>>           nohits = self.regex.split (text)
>>           valid_hits = set (hits) & self.targets_set  # Ignore targets
>> with illegal re modifiers.
> Can you give an example of an ignored target? I don't see the light...
>
>>           if valid_hits:
>>               substitutes = [self.table [item] for item in hits if item
>> in valid_hits] + []  # Make lengths equal for zip to work right
> That looks wrong...
>
>>               if eat:
>>                   output = ''.join (substitutes)
>>               else:
>>                   zipped = zip (nohits, substitutes)
>>                   output = ''.join (list (reduce (lambda a, b: a + b,
>> [zipped][0]))) + nohits [-1]
>>           else:
>>               if eat:
>>                   output = ''
>>               else:
>>                   output = input
> ...and so does this.
>
>>           return output
>>
>>   >>> substitutions = (
>>       ('\r', '\n'),
>>       ('<', '<'),
>>       ('>', '>'),
>>       ('o', 'o'),
>>       ('f', 'f'),
>>       ('e', 'e'),
>>       ('O', 'O'),
>>       )
>>
>> Order doesn't matter. Add new ones at the end.
>>
>>   >>> e = editor ()
>>   >>> e.compile (substitutions)
>>
>> A simple way of testing is running the substitutions through the editor
>>
>>   >>> print e.edit (repr (substitutions))
>> (('\r', '\n'), ('<', '<'), ('>', '>'), ('o', 'o'), ('f', 'f'), ('e',
>> 'e'), ('O', 'O'))
>>
>> The escapes need to be tested separately
>>
>>   >>> print e.edit ('abc\rdef')
>> abc
>> def
>>
>> Note: This editor's compiler compiles the substitution list to a regular
>> expression which the editor uses to find all matches in the text passed
>> to edit. There has got to be a limit to the size of a text which a
>> regular expression can handle. I don't know what this limit is. To be on
>> the safe side, edit a large text line by line or at least in sensible
>> chunks.
>>
>> Frederic
>>
>>
Peter, thanks for your comments.

Valid hits is the intersection of targets compiled and targets matched. 
In the list comprehension it ensures that no matches get into the lookup 
that are not literally in the substitution targets, as might happen with 
repetition symbols (?,*,+). Although the compiler escapes those, only 
omniscient developers know all contingencies in advance.

I appreciate your identifying two mistakes. I am curious to know what 
they are.

Frederic





More information about the Python-list mailing list