[Baypiggies] BayPiggies Oct? "NLP, Topic Modeling and Scraping of conference talks to find which topics are hot and not"

Sun Oct 9 23:01:48 EDT 2022

Dear BayPIGgies,

Please reply +1 if you want me to re-give in October my planned talk from Sept which was disrupted.
Abstract is same as before. Hope we can finally do this! - Stephen

________________________________
From: Baypiggies <baypiggies-bounces+spmcinerney=hotmail.com at python.org> on behalf of Jeff Fischer <jeffrey.fischer at gmail.com>
To: Baypiggies <baypiggies at python.org>
Subject: [Baypiggies] BayPiggies meeting next Thursday (Sept 22): Debugging, Scraping, and NLP

[Sept 22], we'll have a lightning talk from Ryan Kuhl on debugging and a full talk from Stephen McInerney on Web scraping and NLP. We hope that you can join us!

Main Talk: NLP, Topic Modeling and Scraping of conference talks to find which topics are hot and not
Speaker: Stephen McInerney
NLP (Natural Language Processing) and Topic Modeling are subdomains of Machine Learning which are core technologies for Python data scientists; and the automated collection of data by Scraping (in a TOS-compliant, ethical way) is a rarely-discussed practice. Outline:

  *   Review the basic steps, present a typical pipeline for Scraping+NLP+Topic Modeling and cover packages used
  *   As a motivating example, we investigate changes in Python conference topics 2016-2022, and statistically extract conclusions on what's hot and not, as of 2022
  *   We also handle foreign-language abstracts and outline how machine translation can be used for Topic Modeling
  *   We illustrate best practices in Scraping on text data, maximally preserving and augmenting with metadata
  *   Review the basic steps, present a typical pipeline (segmentation, handling Unicode, Levenshtein distance, word-vectors, Transformer, NER, IE).
  *   Overview of related NLP/ML/Deep Learning packages we use both for prototyping and production.
  *   Topic Modeling using LDA is a highly iterative clustering process to "learn" which topics seem to be similar/related/identical/different
  *   In this specific case, we augment conference abstracts with whatever metadata is helpful to topic-modeling e.g. speaker interests, affiliation, links to Twitter
  *   Example: "token" means an entirely different topic when it co-occurs with "crypto"/"blockchain"/"web3" versus when it co-occurs with "API"/"authentication"/"appsec"/"2FA"/"Oauth". But how do we automatically learn hundreds and then thousands of such cases?

Speaker Bio: Stephen McInerney
Data scientist and NLP specialist for over a decade, specializing in domain-specific (biotech/legal/financial) and multilingual NLP, in both startups and large companies. Kaggle competitor; have led "Kaggle Together" classes. Former Data Science co-chair of SF Bay Area ACM and organizer of multiple Data Science Camps. Passionate about open-source. www.linkedin.com/in/stephenmcinerney<http://www.linkedin.com/in/stephenmcinerney>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/baypiggies/attachments/20221010/3745d22e/attachment.html>