[SciPy-User] Column extraction from line based text
Thøger Emil Rivera-Thorsen
trive at astro.su.se
Fri May 2 06:52:42 EDT 2014
Maybe you could use the Pandas read_csv() function and give the '|'
character as delimiter; and then, for the offending columns, use the
'map()' method to split up the strings in the column and move them to a
separate column. This of course requires you to know which columns may
have offending content, and adress them individually.
Could be done like:
|import pandas as pd
table = pd.read_csv('filename.csv',
delimiter='|',
names=['number', 'street', 'city', 'state', 'zip'])
# Do something like the following for all columns with offending content
table['address2'] = table.street.map(lambda x: nan if len(x.split(',')) < 2 else x.split(',')[-1].strip())
table.street = table.street.map(lambda x: x.split(',')[0].strip())
# Now order columns to your liking:
table = table[['number', 'street', 'address2', 'city', 'state', 'zip']]
print table.to_string()
|
| number street address2 city state zip
0 One Bowling Green 3rd Floor New York NY 10004
1 228 Park Ave S NaN New York NY 10003-1502
2 85 West Street at Albany Street NaN New York NY 10006
3 13031 West Jefferson Boulevard Suite 200 Los Angeles CA 90094
4 550 S Flower St NaN Los Angeles CA 90071
|
On 2014-05-02 11:56, Patrick Westphal wrote:
> Hi Athanasios,
>
> thanks for your answer. Unfortunately, this was just an example and
> the content is rather random. So fixed regexes won't help here. I
> thought of a solution, that takes the field type, (maybe field
> lengths) and number of fields into account to find homogeneities. In
> the given example, this would mean that it is detected, that
> - the line usually starts with a string that contains a number
> - usually followed by some inhomogeneous text (e.g. West Street at
> Albany Street New York)
> - usually followed by a two letter word (e.g. NY)
> - and the line ends with a string, again containing numbers
> which would at least suffice to generate some of the desired columns, e.g.
>
> 85 | West Street at Albany Street New York | NY | 10006
>
> But, I could try to generate regexes from the text lines and use
> numpy.fromregex. I'll think about that.
>
> Thanks!
>
> On 02.05.2014 11:24, Athanasios Anastasiou wrote:
>> Hello Patrick
>>
>> In general, you can use "Regular Expressions" to extract your fields
>> and re-format them in any way you like.
>>
>> I don't think that this is a "homogeneous" structure though. The
>> postal address seems to have been formatted following a specific
>> template of
>>
>> number (maybe character, maybe numeric), street name (could go up to
>> 3 space separated tokens) coma (sometimes) city, postcode
>>
>> Furthermore: street name -> street (any number of space separated
>> tokens) coma(always?) location (Floor, Suite, other? followed by a
>> number (always?))
>>
>> The city and postcode fields are the easiest to extract. For example,
>> you could establish a regular expression like ".*? [,]? ("Los
>> Angeles"|"New York"), [A-Z]{2}[0-9]{5}" which is roughly translated
>> to "Match in a non-greedy way everything up to an optional coma
>> followed by something that looks like a city name followed by a coma
>> followed by a postcode".
>>
>> You can extract such fields with FROMREGEXP
>> (http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromregex.html)
>> or pyparsing (http://pyparsing.wikispaces.com/)
>>
>> As you can see, some things have to be kept or assumed "fixed".
>>
>> For example, you might have to run a quick check of how many cities
>> you have in your dataset or at least, how many distinct words you
>> expect a city to occupy to be able to describe roughly how does a
>> city name (or other field) looks like. The same applies when you try
>> to fix the way that the location is being described. Is it always
>> Floor? Could it be for example "block, Block, Floor, floor,
>> Apartment, Apt, apt, apartment, room, Room, RM, rm, ....". Could it
>> be 1st Floor? Maybe Floor 1, Floor 2 and so on.
>>
>> Hope this helps.
>>
>> All the best.
>>
>>
>> On Fri, May 2, 2014 at 9:56 AM, Patrick Westphal
>> <patrick.westphal at informatik.uni-leipzig.de
>> <mailto:patrick.westphal at informatik.uni-leipzig.de>> wrote:
>>
>> Hello Scipy users,
>>
>> I have to deal with messy line based text files that have certain
>> fields, e.g.
>>
>> One Bowling Green, 3rd Floor New York, NY 10004
>> 228 Park Ave S New York, NY 10003-1502
>> 85 West Street at Albany Street New York, NY 10006
>> 13031 West Jefferson Boulevard, Suite 200 Los Angeles, CA 90094
>> 550 S Flower St, Los Angeles, CA 90071
>>
>> and I'd like to have these columns extracted
>>
>> | One Bowling Green, 3rd Floor | New York | NY |
>> 10004
>> 228| Park Ave S | New York | NY |
>> 10003-1502
>> 85| West Street at Albany Street | New York | NY | 10006
>> 13031 | West Jefferson Boulevard, Suite 200 | Los Angeles | CA |
>> 90094
>> 550 | S Flower St | Los Angeles | CA |
>> 90071
>>
>> I searched quite a lot and found projects like Asciitable [0],
>> but these
>> were not built to find homogeneous column structures by combining
>> multiple line tokens, as e.g. in 'One Bowling Green, 3rd Floor', but
>> rather try to generate one column per token.
>>
>> My question is, if you are aware of any projects that better suit my
>> task, or if you could give me some hints where and what to search.
>>
>> Thanks in advance,
>> Patrick
>>
>> [0] http://cxc.cfa.harvard.edu/contrib/asciitable/
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org <mailto:SciPy-User at scipy.org>
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>>
>
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
--
--------------------------
Thøger Emil Rivera-Thorsen
Ph.D. student
Stockholm University,
Department of Astronomy
--------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20140502/66ae22a6/attachment.html>
More information about the SciPy-User
mailing list