Unwanted Spaces and Iterative Loop

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sun Jan 26 19:40:26 EST 2014


On Sun, 26 Jan 2014 13:46:21 -0800, matt.s.marotta wrote:

> I have been working on a python script that separates mailing addresses
> into different components.
> 
> Here is my code:
> 
> inFile = "directory"
> outFile = "directory"
> inHandler = open(inFile, 'r')
> outHandler = open(outFile, 'w')

Are you *really* opening the same file for reading and writing at the 
same time?

Even if your operating system allows that, surely it's not a good idea. 
You might get away with it for small files, but at some point you're 
going to run into weird, hard-to-diagnose bugs.


> outHandler.write("FarmID\tAddress\tStreetNum\tStreetName\tSufType\tDir
\tCity\tProvince\tPostalCode")

This looks like a CSV file using tabs as the separator. You really ought 
to use the csv module.

http://docs.python.org/3/library/csv.html
http://docs.python.org/2/library/csv.html

http://pymotw.com/2/csv/


> for line in inHandler:
>     str = line.replace("FarmID\tAddress", " ")
>     outHandler.write(str[0:-1])
>     str = str.replace(" ","\t", 1)
>     str = str.replace(" Rd,","\tRd\t\t")
>     str = str.replace(" Rd","\tRd\t")
>     str = str.replace("Ave,","\tAve\t\t") 
>     str = str.replace("Ave","\tAve\t\t")
>     str = str.replace("St ","\tSt\t\t")
>     str = str.replace("St,","\tSt\t\t")
>     str = str.replace("Dr,","\tDr\t\t")
      [snip additional string manipulations]
>     str = str.replace(",","\t")
>     str = str.replace(" ON","ON\t")
>     outHandler.write(str)


Aiy aiy aiy, what a mess! I get a headache just trying to understand it!

The first question that comes to mind is that you appear to be writing 
each input line *twice*, first after a very minimal set of string 
manipulations (you convert the literal string "FarmID\tAddress" to a 
space, then write the whole line out), the second time after a whole mess 
of string replacements. Why?

If the sample data you show below is accurate, I *think* what you are 
trying to do is simply suppress the header line. The first line in the 
input file is:

FarmID	Address

and rather than write that you want to write a space. I don't know why 
you want the output file to begin with a space, but this would be better:

for line in inHandler:
    line = line.strip()  # Remove any leading and trailing whitespace,
        # including the trailing newline. Later, we'll add a newline 
        # back in.
    if line == "FarmID\tAddress":
        outHandler.write(" ")  # Write a mysterious space.
        continue  # And skip to the next line.
    # Now process the non-header lines.


Now, as far as the non-header lines, you do a whole lot of complex string 
manipulations, replacing chunks of text with or without tabs or commas to 
the same text with or without tabs but in a different order. The logic of 
these manipulations completely escape me: what are you actually trying to 
do here?

I *strongly* suggest that you don't try to implement your program logic 
in the form of string manipulations. According to your sample data, your 
data looks like this:

1	1067 Niagara Stone Rd, Niagara-On-The-Lake, ON L0S 1J0

i.e. 

farmId TAB address COMMA district COMMA postcode

It is much better to pull the line apart into named components, 
manipulate the components directly, then put it back together in the 
order you want. This makes the code more understandable, and easier to 
change if you ever need to change things.

for line in inHandler:
    line = line.strip()
    if line == "FarmID\tAddress":
        outHandler.write(" ")  # Write a mysterious space.
        continue
    # Now process the non-header lines.
    farmid, address = line.split("\t")
    farmid = farmid.strip()
    address, district, postcode = address.split(",")
    address = address.strip()
    district = district.strip()
    postcode = postcode.strip()
    # Now process the fields however you like.
    parts_of_address = address.split(" ")
    street_number = parts_of_address[0]  # first part
    street_type = parts_of_address[-1]  # last part
    street_name = parts_of_address[1:-1]  # everything else
    street_name = " ".join(street_name)

and so on for the post code. Then, at the very end, assemble the parts 
you want to write out, join them with tabs, and write:

    fields = [farmid, street_number, street_name, street_type, ... ]
    outHandler.write("\t".join(fields))
    outHandler.write("\n")


Or use the csv module to do the actual writing. It will handle escaping 
anything that needs escaping, newlines, tabs, etc.



-- 
Steven



More information about the Python-list mailing list