How to escape # hash character in regex match strings

Brian D briandenzer at gmail.com
Thu Jun 11 10:25:18 EDT 2009


On Jun 11, 9:22 am, Brian D <brianden... at gmail.com> wrote:
> On Jun 11, 2:01 am, Lie Ryan <lie.1... at gmail.com> wrote:
>
>
>
> > 504cr... at gmail.com wrote:
> > > I've encountered a problem with my RegEx learning curve -- how to
> > > escape hash characters # in strings being matched, e.g.:
>
> > >>>> string = re.escape('123#abc456')
> > >>>> match = re.match('\d+', string)
> > >>>> print match
>
> > > <_sre.SRE_Match object at 0x00A6A800>
> > >>>> print match.group()
>
> > > 123
>
> > > The correct result should be:
>
> > > 123456
>
> > > I've tried to escape the hash symbol in the match string without
> > > result.
>
> > > Any ideas? Is the answer something I overlooked in my lurching Python
> > > schooling?
>
> > As you're not being clear on what you wanted, I'm just guessing this is
> > what you wanted:
>
> > >>> s = '123#abc456'
> > >>> re.match('\d+', re.sub('#\D+', '', s)).group()
> > '123456'
> > >>> s = '123#this is a comment and is ignored456'
> > >>> re.match('\d+', re.sub('#\D+', '', s)).group()
>
> > '123456'
>
> Sorry I wasn't more clear. I positively appreciate your reply. It
> provides half of what I'm hoping to learn. The hash character is
> actually a desirable hook to identify a data entity in a scraping
> routine I'm developing, but not a character I want in the scrubbed
> data.
>
> In my application, the hash makes a string of alphanumeric characters
> unique from other alphanumeric strings. The strings I'm looking for
> are actually manually-entered identifiers, but a real machine-created
> identifier shouldn't contain that hash character. The correct pattern
> should be 'A1234509', but is instead often merely entered as '#12345'
> when the first character, representing an alphabet sequence for the
> month, and the last two characters, representing a two-digit year, can
> be assumed. Identifying the hash character in a RegEx match is a way
> of trapping the string and transforming it into its correct machine-
> generated form.
>
> I'm surprised it's been so difficult to find an example of the hash
> character in a RegEx string -- for exactly this type of situation,
> since it's so common in the real world that people want to put a pound
> symbol in front of a number.
>
> Thanks!

By the way, other forms the strings can take in their manually created
forms:

A#12345
#1234509

Garbage in, garbage out -- I know. I wish I could tell the people
entering the data how challenging it is to work with what they
provide, but it is, after all, a screen-scraping routine.



More information about the Python-list mailing list