Splitting Tree

Cameron Simpson cs at zip.com.au
Sun Dec 2 17:11:51 EST 2012


On 02Dec2012 07:02, subhabangalore at gmail.com <subhabangalore at gmail.com> wrote:
| On Sunday, December 2, 2012 5:39:32 PM UTC+5:30, subhaba... at gmail.com wrote:
| > I am using NLTK and I used the following command,
| > chunk=nltk.ne_chunk(tag)
| > 
| > print "The Chunk of the Line Is:",chunk
| > 
| > The Chunk of the Line Is: (S
| >   ''/''
| >   It/PRP
[...]
| > Now I am trying to split the output preferably by ",/,".
[...]
|
| Sorry to ask this. I converted in string and then splitted it.

I'm glad you solved your problem, but I would like to point out that
this is generally a risky way of manipulating data.

The problem arises if the string you're splitting on occurs as a literal
piece of text, but _not_ in the sense you intend. It may be the case
that it will not happen in your particular situation, but in general the
procedure:
  - convert structure to string somehow
  - perfect simple text manipulation
  - unconvert
is at risk of simplistic parsing of the string.

A common example is with CSV data. Supposing you wanted the the third
column from an array of tuples:

  rows = [ (1,2,"A",4),
           (5,6,"B",8),
           (9,10,"C,D",12),
         ]

and you wanted [ "A", "B", "C,D" ]. If one went with the "convert to
text" approach, and decided that converting each tuple to a CSV style
data row was a good idea you might write:

  column_3 = []
  for row in rows:
    csv_string = ",".join( str(item) for item in row )
    item3 = csv_string.split(",")[2]
    column_3.append(item3)

The (simplistic) code above with give you "C" from the third row, not
"C,D". Because it naively assumes there are no commas in the data, and
then does a simplistic textual split to find the third column.

Obviously you woldn't really do that for something this simple; it is to
show the issue. But your situation where manipulating a tree was tricky
and you converted it to a string is very similar conceptually.

Hoping this shows you the issue,
-- 
Cameron Simpson <cs at zip.com.au>

I'm not making any of this up you know. - Anna Russell



More information about the Python-list mailing list