[Baypiggies] BayPiggies Oct? "NLP, Topic Modeling and Scraping of conference talks to find which topics are hot and not"
Stephen McInerney
spmcinerney at hotmail.com
Sun Oct 9 23:01:48 EDT 2022
Dear BayPIGgies,
Please reply +1 if you want me to re-give in October my planned talk from Sept which was disrupted.
Abstract is same as before. Hope we can finally do this! - Stephen
________________________________
From: Baypiggies <baypiggies-bounces+spmcinerney=hotmail.com at python.org> on behalf of Jeff Fischer <jeffrey.fischer at gmail.com>
To: Baypiggies <baypiggies at python.org>
Subject: [Baypiggies] BayPiggies meeting next Thursday (Sept 22): Debugging, Scraping, and NLP
[Sept 22], we'll have a lightning talk from Ryan Kuhl on debugging and a full talk from Stephen McInerney on Web scraping and NLP. We hope that you can join us!
Main Talk: NLP, Topic Modeling and Scraping of conference talks to find which topics are hot and not
Speaker: Stephen McInerney
NLP (Natural Language Processing) and Topic Modeling are subdomains of Machine Learning which are core technologies for Python data scientists; and the automated collection of data by Scraping (in a TOS-compliant, ethical way) is a rarely-discussed practice. Outline:
* Review the basic steps, present a typical pipeline for Scraping+NLP+Topic Modeling and cover packages used
* As a motivating example, we investigate changes in Python conference topics 2016-2022, and statistically extract conclusions on what's hot and not, as of 2022
* We also handle foreign-language abstracts and outline how machine translation can be used for Topic Modeling
* We illustrate best practices in Scraping on text data, maximally preserving and augmenting with metadata
* Review the basic steps, present a typical pipeline (segmentation, handling Unicode, Levenshtein distance, word-vectors, Transformer, NER, IE).
* Overview of related NLP/ML/Deep Learning packages we use both for prototyping and production.
* Topic Modeling using LDA is a highly iterative clustering process to "learn" which topics seem to be similar/related/identical/different
* In this specific case, we augment conference abstracts with whatever metadata is helpful to topic-modeling e.g. speaker interests, affiliation, links to Twitter
* Example: "token" means an entirely different topic when it co-occurs with "crypto"/"blockchain"/"web3" versus when it co-occurs with "API"/"authentication"/"appsec"/"2FA"/"Oauth". But how do we automatically learn hundreds and then thousands of such cases?
Speaker Bio: Stephen McInerney
Data scientist and NLP specialist for over a decade, specializing in domain-specific (biotech/legal/financial) and multilingual NLP, in both startups and large companies. Kaggle competitor; have led "Kaggle Together" classes. Former Data Science co-chair of SF Bay Area ACM and organizer of multiple Data Science Camps. Passionate about open-source. www.linkedin.com/in/stephenmcinerney<http://www.linkedin.com/in/stephenmcinerney>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/baypiggies/attachments/20221010/3745d22e/attachment.html>
More information about the Baypiggies
mailing list