Delete duplicate rows in textfile - except it contains a "{" or "}"
Peter Otten
__peter__ at web.de
Wed Oct 10 06:32:11 EDT 2012
Joon Ki Choi wrote:
>
> Hello Pythonistas,
>
> i have a very large textfile with contents like:
>
> @INBOOK{Ackermann1999-b,
> author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
> Ackermann,
> K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
> and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
> Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
> K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
> and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
> Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
> K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
> and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
> Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
> year = {1980},
> timestamp = {1995-12-02}
> }
>
> And i want to delete the duplicate rows except these rows containing the
> brackets { or }. The result should look like:
>
> @INBOOK{Ackermann1999-b,
> author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
> Ackermann,
> Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
> year = {1980},
> timestamp = {1995-12-02}
> }
>
> I come across with this Python-Skript:
>
> lines_seen = set() # holds lines already seen
> outfile = open("literatur_clean.txt", "w")
> for line in open("literatur_dupl.txt", "r"):
> if line not in lines_seen: # not a duplicate
> outfile.write(line)
> lines_seen.add(line)
> outfile.close()
>
> But it deletes also the lines with a closing bracket } and the lines with
> the same authordata. Therefor i need the condition of the brackets.
>
> Could someone point me out to adding this condition?
>
> Thanks in advance,
> Joon
Not what you asked for, but here is something that is quick-and-dirty, too,
but tries a bit harder:
import re
def unique(match):
names = match.group()[1:-1].split(",")
parts = set(" ".join(author.split()) for author in names)
return "{%s}" % ", ".join(parts)
if __name__ == "__main__":
with open("literatur_dupl.txt") as f:
data = f.read()
data = re.compile("{[^{}]*}", re.DOTALL).sub(unique, data)
with open("literatur_clean.txt", "w") as f:
f.write(data)
I'm assuming that "very large" means that the file contents still
comfortably fit into your computer's memory...
More information about the Python-list
mailing list