[Tutor] Fwd: Data pattern query.

Mon Jan 7 10:49:22 EST 2019

Sending to list. Please use reply-all when responding to the list.

-------- Original Message --------
 Subject: RE: [Tutor] Data pattern query.
 From: mhysnm1964 at gmail.com
 Sent: Monday, 7 January 2019 11:58
 To: 'Alan Gauld' <alan.gauld at yahoo.co.uk>
 CC: 

Alan,

Thanks for responding. I am storing the spreadsheet into a dictionary using
a two dimension list. The key for the dictionary is the sheet name. Possibly
in the future I might just have a simple list. 6 columns across with
variable rows. Max is over 1000 rows.

Below is the code:

import openpyxl # module to import spreadsheet.
# module found on the net to remove duplicates.
from  more_itertools import unique_everseen 

# Loding the workbook and names for the sheets.
wb = openpyxl.load_workbook("test.xlsx")
# Storing sheet names into a list for the loop.
wss = wb.get_sheet_names()

# defining variables for data structure
transaction = {} # dictionary of bank transactions.
description = [] # the descriptions of transactions

for ws_name in wss:
	# sheet being worked upon.
	ws = wb.get_sheet_by_name(ws_name) 
	data = []# temp variable to store 2 dimension list.
	for column in ws.columns:
		c = [] # temp variable for current column 
		for cell in column:
			c.append(cell.value) # is this correct?
		# end for cell loop 
		if c[0] == "Description":
		#Populating the description list
			description.extend(c)
		data.append(c)  # creating two dimension list
	# end for for column 
	# creating dictionary with 2 dimension list 
	transaction[ws_name] = data 
#end for 

/*** the below code is not working ***/

If I sort the description list. I get untyped errors. When I use the debug.
The items (elements) in the description list at the end are 'None" value.
Not sure if this is due to the location of defining the empty list or the
method of using append with the list c. If the below code could sort, then
the remaining lines would perform the duplicate on the whole value of the
item. While I want to identify duplicates with the vendor name only.

description.sort()
for result in unique_everseen (description):
	print (result)

Some example data structure. I have modified for privacy reasons.

PAY/SALARY FROM vendor-name 
EFTPOS vendor-name ####HORNSBY      country-code
bank-name BANKING FUNDS TFER TRANSFER ######  TO ######
vendor-name suburb state 
vendor-name #### suburb 
DEPOSIT vendor-name PAYMENT        ######## state-name
WITHDRAWAL BY EFTPOS #### vendor-name #### suburb 12/06 
DEPOSIT-vendor-name Aust organisation_name #########

-----Original Message-----
From: Tutor <tutor-bounces+mhysnm1964=gmail.com at python.org> On Behalf Of
Alan Gauld via Tutor
Sent: Monday, 7 January 2019 8:20 PM
To: tutor at python.org
Subject: Re: [Tutor] Data pattern query.

On 07/01/2019 02:38, mhysnm1964 at gmail.com wrote:

> All the descriptions of the transactions are in a single column. I am 
> trying to work out the easiest method of identifying the same pattern 
> of text in the fields.

What does a singe column mean? That presumably is how it appears in the
spreadsheet? But how is it stored in your Python code? A list? a list of
lists? a dictionary?

We don't know what your data looks like.
Post a sample along with an explanation of how it is structured.

In general when looking for patterns in text a regular expression is the
tool of choice. But only if you know what the pattern looks like.
Identifying patterns as you go is a much more difficult challenge

> Then I am going to group these vendors by categories. 

And how do you categorize them? Is the category also in the data or is it
some arbitrary thing that you have devised?

> In the field, there is the vendor name, suburb/town, type of transaction,
etc.

etc is kind of vague!
Show us some data and tel;l us which field is which.
Without that its difficult to impossible to tell you how to extract
anything!

The important thing is not how it looked in the spreadsheet but how it looks
now you have it in Python.

> How can I teach the program to learn new vendor names? 

Usually you would use a set or dictionary and add new names as you find
them.

> I was thinking of removing all the duplicate entries

Using a set would do that for you automatically

> Was thinking of using dictionaries for this. 
> But not sure if this is the best approach. 

If you make the vendor name the key of a dictionary then it has the same
effect as using a set. But whether a set or dict is best depends on what
else you need to store. If its only the vendor names then a set is best. If
you want to store associated data then a dict is better.

You need to be much more specific about what your data looks like, how you
identify the fields you want, and how you will categorize them.

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos

_______________________________________________
Tutor maillist  -  Tutor at python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor