[SciPy-User] Column extraction from line based text

Thøger Emil Rivera-Thorsen trive at astro.su.se
Fri May 2 06:52:42 EDT 2014


Maybe you could use the Pandas read_csv() function and give the '|'
character as delimiter; and then, for the offending columns, use the
'map()' method to split up the strings in the column and move them to a
separate column. This of course requires you to know which columns may
have offending content, and adress them individually.
Could be done like:

|import pandas as pd

table = pd.read_csv('filename.csv', 
                    delimiter='|', 
                    names=['number', 'street', 'city', 'state', 'zip'])

# Do something like the following for all columns with offending content
table['address2'] = table.street.map(lambda x: nan if len(x.split(',')) < 2 else x.split(',')[-1].strip())
table.street = table.street.map(lambda x: x.split(',')[0].strip())

# Now order columns to your liking:
table = table[['number', 'street', 'address2', 'city', 'state', 'zip']]

print table.to_string()
|

|   number                        street   address2           city state          zip
0                     One Bowling Green  3rd Floor   New York       NY         10004
1     228                    Park Ave S        NaN   New York       NY    10003-1502
2      85  West Street at Albany Street        NaN   New York       NY         10006
3  13031       West Jefferson Boulevard  Suite 200   Los Angeles    CA         90094
4    550                    S Flower St        NaN   Los Angeles    CA         90071
|

On 2014-05-02 11:56, Patrick Westphal wrote:

> Hi Athanasios,
>
> thanks for your answer. Unfortunately, this was just an example and
> the content is rather random. So fixed regexes won't help here. I
> thought of a solution, that takes the field type, (maybe field
> lengths) and number of fields into account to find homogeneities. In
> the given example, this would mean that it is detected, that
> - the line usually starts with a string that contains a number
> - usually followed by some inhomogeneous text (e.g. West Street at
> Albany Street  New York)
> - usually followed by a two letter word (e.g. NY)
> - and the line ends with a string, again containing numbers
> which would at least suffice to generate some of the desired columns, e.g.
>
> 85 | West Street at Albany Street  New York | NY | 10006
>
> But, I could try to generate regexes from the text lines and use
> numpy.fromregex. I'll think about that.
>
> Thanks!
>
> On 02.05.2014 11:24, Athanasios Anastasiou wrote:
>> Hello Patrick
>>
>> In general, you can use "Regular Expressions" to extract your fields
>> and re-format them in any way you like.
>>
>> I don't think that this is a "homogeneous" structure though. The
>> postal address seems to have been formatted following a specific
>> template of
>>
>> number (maybe character, maybe numeric), street name (could go up to
>> 3 space separated tokens) coma (sometimes) city, postcode
>>
>> Furthermore: street name -> street (any number of space separated
>> tokens) coma(always?) location (Floor, Suite, other? followed by a
>> number (always?))
>>
>> The city and postcode fields are the easiest to extract. For example,
>> you could establish a regular expression like ".*? [,]? ("Los
>> Angeles"|"New York"), [A-Z]{2}[0-9]{5}" which is roughly translated
>> to "Match in a non-greedy way everything up to an optional coma
>> followed by something that looks like a city name followed by a coma
>> followed by a postcode".
>>
>> You can extract such fields with FROMREGEXP
>> (http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromregex.html)
>> or pyparsing (http://pyparsing.wikispaces.com/)
>>
>> As you can see, some things have to be kept or assumed "fixed".
>>
>> For example, you might have to run a quick check of how many cities
>> you have in your dataset or at least, how many distinct words you
>> expect a city to occupy to be able to describe roughly how does a
>> city name (or other field) looks like. The same applies when you try
>> to fix the way that the location is being described. Is it always
>> Floor? Could it be for example "block, Block, Floor, floor,
>> Apartment, Apt, apt, apartment, room, Room, RM, rm, ....". Could it
>> be 1st Floor? Maybe Floor 1, Floor 2 and so on.
>>
>> Hope this helps.
>>
>> All the best.
>>
>>
>> On Fri, May 2, 2014 at 9:56 AM, Patrick Westphal
>> <patrick.westphal at informatik.uni-leipzig.de
>> <mailto:patrick.westphal at informatik.uni-leipzig.de>> wrote:
>>
>>     Hello Scipy users,
>>
>>     I have to deal with messy line based text files that have certain
>>     fields, e.g.
>>
>>     One Bowling Green, 3rd Floor New York, NY 10004
>>     228 Park Ave S New York, NY 10003-1502
>>     85 West Street at Albany Street  New York, NY 10006
>>     13031 West Jefferson Boulevard, Suite 200 Los Angeles, CA 90094
>>     550 S Flower St, Los Angeles, CA 90071
>>
>>     and I'd like to have these columns extracted
>>
>>           | One Bowling Green, 3rd Floor        | New York    | NY |
>>     10004
>>       228| Park Ave S                          | New York    | NY |
>>     10003-1502
>>        85| West Street at Albany Street        | New York    | NY | 10006
>>     13031 | West Jefferson Boulevard, Suite 200 | Los Angeles | CA |
>>     90094
>>       550 | S Flower St                         | Los Angeles | CA |
>>     90071
>>
>>     I searched quite a lot and found projects like Asciitable [0],
>>     but these
>>     were not built to find homogeneous column structures by combining
>>     multiple line tokens, as e.g. in 'One Bowling Green, 3rd Floor', but
>>     rather try to generate one column per token.
>>
>>     My question is, if you are aware of any projects that better suit my
>>     task, or if you could give me some hints where and what to search.
>>
>>     Thanks in advance,
>>     Patrick
>>
>>     [0] http://cxc.cfa.harvard.edu/contrib/asciitable/
>>     _______________________________________________
>>     SciPy-User mailing list
>>     SciPy-User at scipy.org <mailto:SciPy-User at scipy.org>
>>     http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>>
>
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user

-- 
--------------------------
Thøger Emil Rivera-Thorsen
Ph.D. student
Stockholm University, 
Department of Astronomy
--------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20140502/66ae22a6/attachment.html>


More information about the SciPy-User mailing list