[Tutor] My problem in simple terms

Mats Wichmann mats at wichmann.us
Fri Mar 22 10:53:39 EDT 2019


On 3/21/19 11:54 PM, Edward Kanja wrote:
> Greetings,
> I'm referring to my question i sent earlier, kindly if you have a hint on
> how i can solve
> my problem i will really appreciate. After running regular expressions
> using python
> my output has lot of square brackets i.e. [][][][][][][][][]. How do i
> substitute this with empty
> string so as to have a clear output which i will latter export to an excel
> file.
> Thanks a lot.

I think you got the key part of the answer already: you're getting empty
lists as matches, which when printed, look like []. Let's try to be more
explicit:

$ python3
Python 3.7.2 (default, Jan 16 2019, 19:49:22)
[GCC 8.2.1 20181215 (Red Hat 8.2.1-6)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> help(re.findall)

Help on function findall in module re:

findall(pattern, string, flags=0)
    Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result.


re.findall *always* returns a list, even if there is no match.  If we
add more debug prints in your code so it looks like this:


import re

with open ('unon.txt') as csvfile:

    for line in csvfile:

        print("line=", line)
        index_no=re.findall(r'(\|\s\d{5,8}\s)',line)
        print("index_no (type %s)" % type(index_no), index_no)

        names=re.findall(r'(\|[A-Za-z]\w*\s\w*\s\w*\s\w*\s)',line)
        print("names (type %s)" % type(names), names)
        #Address=re.findall(r'\|\s([A-Z0-9-,/]\w*\s\w*\s)',line)

        duty_station=re.findall(r'\|\s[A-Z]*\d{2}\-\w\w\w\w\w\w\w\s',line)
        print("duty_station (type %s)" % type(duty_station), duty_station)


You can easily see what happens as your data is processed - I ran this
on your data file and the first few times through looks like this:

line=
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

index_no (type <class 'list'>) []
names (type <class 'list'>) []
duty_station (type <class 'list'>) []
line= |Rawzeea NLKPP                         | VE11-Nairobi
               | 20002254-MADIZ                        | 00           |
00               |Regular Scheme B | 15-JAN-2019 To 31-DEC-2019 | No       |

index_no (type <class 'list'>) []
names (type <class 'list'>) ['|Rawzeea NLKPP   ']
duty_station (type <class 'list'>) ['| VE11-Nairobi ']
line=
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

index_no (type <class 'list'>) []
names (type <class 'list'>) []
duty_station (type <class 'list'>) []


You see each result of re.findall has given you a list, and most are
empty.  The first and third lines are separators, containing no useful
data, and you get no matches at all. The second line provided you with a
match for "names" and for "duty_station", but not for "index_no".  Your
code will need to be prepared for those sorts of outcomes.

Just looking at the data, it's table data, presumably from a
spreadsheet, but does not really present in a format that is easy to
process, because individual lines are not complete.   A separator line
with all dashes seems to be the marker between complete entries, which
then take up 14 lines, including additional marker lines which follow
slightly different patterns - they may contain | marks or leading spaces.

You will need to decide how regular your table data is and how to work
with it, most examples of handling table data assume that one row is a
complete entry, so you probably won't find a lot of information on this.
 In your case I'm looking at line 2 containing 8 fields, line 4
containing 9 fields, line 6 10 fields, and then lines 8-14 being
relatively free-form consisting of multiple lines.

Is there any chance you can generate your data file in a different way
to make it easier to process?





More information about the Tutor mailing list