Delete duplicate rows in textfile - except it contains a "{" or "}"

Peter Otten __peter__ at web.de
Wed Oct 10 06:32:11 EDT 2012


Joon Ki Choi wrote:

> 
> Hello Pythonistas,
> 
> i have a very large textfile with contents like:
> 
> @INBOOK{Ackermann1999-b,
>   author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
>   Ackermann,
> K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
> and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
> Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
> K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
> and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
> Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
> K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
> and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
> Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
>   year = {1980},
>   timestamp = {1995-12-02}
> }
> 
> And i want to delete the duplicate rows except these rows containing the
> brackets { or }. The result should look like:
> 
> @INBOOK{Ackermann1999-b,
>   author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
>   Ackermann,
> Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
>   year = {1980},
>   timestamp = {1995-12-02}
> }
> 
> I come across with this Python-Skript:
> 
> lines_seen = set() # holds lines already seen
> outfile = open("literatur_clean.txt", "w")
> for line in open("literatur_dupl.txt", "r"):
>     if line not in lines_seen: # not a duplicate
>         outfile.write(line)
>         lines_seen.add(line)
> outfile.close()
> 
> But it deletes also the lines with a closing bracket } and the lines with
> the same authordata. Therefor i need the condition of the brackets.
> 
> Could someone point me out to adding this condition?
> 
> Thanks in advance,
> Joon

Not what you asked for, but here is something that is quick-and-dirty, too, 
but tries a bit harder:

import re

def unique(match):
    names = match.group()[1:-1].split(",")
    parts = set(" ".join(author.split()) for author in names)
    return "{%s}" % ", ".join(parts)

if __name__ == "__main__":
    with open("literatur_dupl.txt") as f:
        data = f.read()
    data = re.compile("{[^{}]*}", re.DOTALL).sub(unique, data)

    with open("literatur_clean.txt", "w") as f:
        f.write(data)

I'm assuming that "very large" means that the file contents still 
comfortably fit into your computer's memory...




More information about the Python-list mailing list