[Baypiggies] BayPiggies meeting next Thursday (Sept 22): Debugging, Scraping, and NLP

Fri Sep 16 16:07:46 EDT 2022

BayPiggies Sept 22, 2022 7:00 pm - 8:30 pm PDT (online)
This month, we'll have a lightning talk from Ryan Kuhl on debugging and a
full talk from Stephen McInerney on Web scraping and NLP. We hope that you
can join us!

*Lightning Talk: Debugging with ipdb*
*Speaker:* Ryan Kuhl
*Speaker Bio:*
Ryan is a Miami based software engineer at Tatari, co-founder of Public
Sector ML, and student at Georgia Institute of Technology. Ryan has been
programming professionally with python for 9 years and loves to build
performant APIs and chunky SQL queries! When not programming for work he's
studying machine learning and quantum computing. Connect to Ryan via email
at ryan at kuhl.dev, LinkedIn at linkedin.com/in/kuhl or GitHub at
GitHub.com/lame.

*Main Talk: NLP, Topic Modeling and Scraping of conference talks to find
which topics are hot and not*
*Speaker:* Stephen McInerney
NLP (Natural Language Processing) and Topic Modeling are subdomains of
Machine Learning which are core technologies for Python data scientists;
and the automated collection of data by Scraping (in a TOS-compliant,
ethical way) is a rarely-discussed practice. Outline:

   - Review the basic steps, present a typical pipeline for
   Scraping+NLP+Topic Modeling and cover packages used
   - As a motivating example, we investigate changes in Python conference
   topics 2016-2022, and statistically extract conclusions on what's hot and
   not, as of 2022
   - We also handle foreign-language abstracts and outline how machine
   translation can be used for Topic Modeling
   - We illustrate best practices in Scraping on text data, maximally
   preserving and augmenting with metadata
   - Review the basic steps, present a typical pipeline (segmentation,
   handling Unicode, Levenshtein distance, word-vectors, Transformer, NER, IE).
   - Overview of related NLP/ML/Deep Learning packages we use both for
   prototyping and production.
   - Topic Modeling using LDA is a highly iterative clustering process to
   "learn" which topics seem to be similar/related/identical/different
   - In this specific case, we augment conference abstracts with whatever
   metadata is helpful to topic-modeling e.g. speaker interests, affiliation,
   links to Twitter
   - Example: "token" means an entirely different topic when it co-occurs
   with "crypto"/"blockchain"/"web3" versus when it co-occurs with
   "API"/"authentication"/"appsec"/"2FA"/"Oauth". But how do we automatically
   learn hundreds and then thousands of such cases?

*Speaker Bio:* Stephen McInerney
Data scientist and NLP specialist for over a decade, specializing in
domain-specific (biotech/legal/financial) and multilingual NLP, in both
startups and large companies. Kaggle competitor; have led "Kaggle Together"
classes. Former Data Science co-chair of SF Bay Area ACM and organizer of
multiple Data Science Camps. Passionate about open-source.
www.linkedin.com/in/stephenmcinerney

*RSVP*
We will conduct the meeting via Zoom meeting. To RSVP, go to
https://www.meetup.com/baypiggies/events/288471326/. When you RSVP "Yes" to
this event, the link to the Zoom meeting will become visible in MeetUp.

*Code of Conduct*
https://baypiggies.net/pages/code_of_conduct.html
Interactions online have less nuance than in-person interactions. Please be
Open, Considerate and Respectful. Also, please refrain from discussing
topics unrelated to the Python community or the technical content of the
meeting.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/baypiggies/attachments/20220916/92c8c0be/attachment.html>