[omaha] Web scraping and funky characters

Wes Turner wes.turner at gmail.com
Wed Nov 5 08:01:18 CET 2014


If it is a charset issue, these packages and documentation links may be
helpful for working with unicode characters and HTML named entities in
email with Python 2.7:


* http://dev.w3.org/html5/html-author/charref
*
http://www.w3.org/html/wg/drafts/html/master/syntax.html#named-character-references
*
https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

* https://pypi.python.org/pypi/namedentities
* https://bitbucket.org/jeunice/namedentities

* https://docs.python.org/2/howto/unicode.html
* http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/

* https://pypi.python.org/pypi/chardet
* https://github.com/chardet/chardet
* https://chardet.github.io/chardet/




On Tue, Nov 4, 2014 at 11:57 PM, Jeff Hinrichs - DM&T <jeffh at dundeemt.com>
wrote:

> If the character(s) in question don't convey meaning/context, you could
> potentially replace/remove those characters.  Something like:
>
> def scrub(buf):
>   bad_things = ['&#215', ...]
>   for bad in bad_things:
>       buf = buf.replace(bad, '')
>   return buf
>
> or it could be a replacement
>
>   censored = {'sad': 'happy', ...}
>   for bad in censored.keys():
>       buf = buf.replace(bad, censored[bad])
>
>
> ...
>
>
>
> On Tue, Nov 4, 2014 at 10:24 PM, Bob Haffner <bob.haffner at gmail.com>
> wrote:
>
> > Hey Jeff thanks for the input
> >
> > I'm using requests.  As far as the message, plain text.  I'm assuming i
> > have to do that cause I'm sending a mix of texts and emails.
> >
> > When i did encode it the crashes stopped, but the funky output persisted.
> >
> > Yeah, I think thats exactly what's going on.  Meaning, the copy and
> pasting
> > of some cool chars.
> >
> >
> >
> >
> >
> > On Tue, Nov 4, 2014 at 9:25 PM, Jeff Hinrichs - DM&T <jeffh at dundeemt.com
> >
> > wrote:
> >
> > > What library are you using?  Built-in, beautiful soup, requests? and
> then
> > > are you sending the information in a plain-text email or an html email?
> > > Plain-text won't be able to display extended characters without some
> > > encoding in the message.  Others can chime in here, but it seems that
> you
> > > either have to transmit those chars in a format that can handle them or
> > if
> > > you are using plain-text, remove them completely or substitute another
> > > character for them.   Ahh, the joys of glyphs - or forget english,
> > everyone
> > > speak ascii :)0
> > >
> > > Almost as enjoyable as users who cut and paste random things from the
> > > internet and then save them to an accounting database via windows who
> > > speaks cp1252. :p
> > >
> > > On Tue, Nov 4, 2014 at 9:14 PM, Bob Haffner <bob.haffner at gmail.com>
> > wrote:
> > >
> > > > Hey gang,
> > > >
> > > > I got a script running in 2.7 that checks a website daily, finds some
> > > info
> > > > and then sends some messages (email and text) through gmail.
> > > >
> > > > Pretty simple and runs without any problems... most of the time.
> > > >
> > > > When it does have problems its usually because of some funky
> character
> > > (ex:
> > > > &#215) in the html.  These cause problems when searching for keywords
> > or
> > > > with the gmail portion.
> > > >
> > > > Doing a .encode('utf-8') seemed to help with the crashing while
> > sending,
> > > > but the characters still come across funny.
> > > >
> > > > Any advice?
> > > >
> > > > -bob
> > > > _______________________________________________
> > > > Omaha Python Users Group mailing list
> > > > Omaha at python.org
> > > > https://mail.python.org/mailman/listinfo/omaha
> > > > http://www.OmahaPython.org
> > > >
> > >
> > >
> > >
> > > --
> > > Best,
> > >
> > > Jeff Hinrichs
> > > 402.218.1473
> > > _______________________________________________
> > > Omaha Python Users Group mailing list
> > > Omaha at python.org
> > > https://mail.python.org/mailman/listinfo/omaha
> > > http://www.OmahaPython.org
> > >
> > _______________________________________________
> > Omaha Python Users Group mailing list
> > Omaha at python.org
> > https://mail.python.org/mailman/listinfo/omaha
> > http://www.OmahaPython.org
> >
>
>
>
> --
> Best,
>
> Jeff Hinrichs
> 402.218.1473
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org
> https://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org
>



-- 
Wes Turner
https://westurner.github.io/


More information about the Omaha mailing list