[omaha] Web scraping and funky characters
Wes Turner
wes.turner at gmail.com
Wed Nov 5 08:01:18 CET 2014
If it is a charset issue, these packages and documentation links may be
helpful for working with unicode characters and HTML named entities in
email with Python 2.7:
* http://dev.w3.org/html5/html-author/charref
*
http://www.w3.org/html/wg/drafts/html/master/syntax.html#named-character-references
*
https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
* https://pypi.python.org/pypi/namedentities
* https://bitbucket.org/jeunice/namedentities
* https://docs.python.org/2/howto/unicode.html
* http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/
* https://pypi.python.org/pypi/chardet
* https://github.com/chardet/chardet
* https://chardet.github.io/chardet/
On Tue, Nov 4, 2014 at 11:57 PM, Jeff Hinrichs - DM&T <jeffh at dundeemt.com>
wrote:
> If the character(s) in question don't convey meaning/context, you could
> potentially replace/remove those characters. Something like:
>
> def scrub(buf):
> bad_things = ['×', ...]
> for bad in bad_things:
> buf = buf.replace(bad, '')
> return buf
>
> or it could be a replacement
>
> censored = {'sad': 'happy', ...}
> for bad in censored.keys():
> buf = buf.replace(bad, censored[bad])
>
>
> ...
>
>
>
> On Tue, Nov 4, 2014 at 10:24 PM, Bob Haffner <bob.haffner at gmail.com>
> wrote:
>
> > Hey Jeff thanks for the input
> >
> > I'm using requests. As far as the message, plain text. I'm assuming i
> > have to do that cause I'm sending a mix of texts and emails.
> >
> > When i did encode it the crashes stopped, but the funky output persisted.
> >
> > Yeah, I think thats exactly what's going on. Meaning, the copy and
> pasting
> > of some cool chars.
> >
> >
> >
> >
> >
> > On Tue, Nov 4, 2014 at 9:25 PM, Jeff Hinrichs - DM&T <jeffh at dundeemt.com
> >
> > wrote:
> >
> > > What library are you using? Built-in, beautiful soup, requests? and
> then
> > > are you sending the information in a plain-text email or an html email?
> > > Plain-text won't be able to display extended characters without some
> > > encoding in the message. Others can chime in here, but it seems that
> you
> > > either have to transmit those chars in a format that can handle them or
> > if
> > > you are using plain-text, remove them completely or substitute another
> > > character for them. Ahh, the joys of glyphs - or forget english,
> > everyone
> > > speak ascii :)0
> > >
> > > Almost as enjoyable as users who cut and paste random things from the
> > > internet and then save them to an accounting database via windows who
> > > speaks cp1252. :p
> > >
> > > On Tue, Nov 4, 2014 at 9:14 PM, Bob Haffner <bob.haffner at gmail.com>
> > wrote:
> > >
> > > > Hey gang,
> > > >
> > > > I got a script running in 2.7 that checks a website daily, finds some
> > > info
> > > > and then sends some messages (email and text) through gmail.
> > > >
> > > > Pretty simple and runs without any problems... most of the time.
> > > >
> > > > When it does have problems its usually because of some funky
> character
> > > (ex:
> > > > ×) in the html. These cause problems when searching for keywords
> > or
> > > > with the gmail portion.
> > > >
> > > > Doing a .encode('utf-8') seemed to help with the crashing while
> > sending,
> > > > but the characters still come across funny.
> > > >
> > > > Any advice?
> > > >
> > > > -bob
> > > > _______________________________________________
> > > > Omaha Python Users Group mailing list
> > > > Omaha at python.org
> > > > https://mail.python.org/mailman/listinfo/omaha
> > > > http://www.OmahaPython.org
> > > >
> > >
> > >
> > >
> > > --
> > > Best,
> > >
> > > Jeff Hinrichs
> > > 402.218.1473
> > > _______________________________________________
> > > Omaha Python Users Group mailing list
> > > Omaha at python.org
> > > https://mail.python.org/mailman/listinfo/omaha
> > > http://www.OmahaPython.org
> > >
> > _______________________________________________
> > Omaha Python Users Group mailing list
> > Omaha at python.org
> > https://mail.python.org/mailman/listinfo/omaha
> > http://www.OmahaPython.org
> >
>
>
>
> --
> Best,
>
> Jeff Hinrichs
> 402.218.1473
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org
> https://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org
>
--
Wes Turner
https://westurner.github.io/
More information about the Omaha
mailing list