[Chennaipy] MOM - September 2018 Chennaipy Meetup

Sun Oct 7 01:05:01 EDT 2018

Hello,

Apologize for the delay. Please find the minutes of September 2018 Meetup.

*Data Compression Techniques*
Data compression involves minimizing bytes size without degrading quality
to an unacceptable level. There are lossy & lossless data compressions.

But how can we measure information? Information theory provides solution
for the same. It defines 1 unit of information. Uncertainty, Information
and Entropy are terms used in Information Theory. If a data is uncertain it
means it has low probability and hence high in information and entropy.
(For ex. it is hot in chennai is not an information but snow in chennai is)

If a data needs to be compressed, instead of coding directly the bits, we
can alter the codeword based on their probability of occurrence. Huffman
Coding Algorithm uses this method to achieve lossless data compression. It
maps symbols to probability based codeword.

Information theory is a well developed field and many ideas are drawn from
it in data sciences. On a lighter note, this was already implemented in
Morse code on 1836 before Shannon formalised it on 1948.

*Last mile problem in ML*
Software Engineering involves a function which takes an input and gives an
output. Machine Learning involves a good function which is called as model.

For ML we now have dead simple APIs with abundance power. The cornerstone
of science is repeatable results. Since data science involves science, it
is important to produce repeatable results and hence track experiments.
When this is not done we end up with zombie models.

We need a way to obtain the following (wishlist)
- Remember what training data used
- Remember what code was used
- Remember configuration and hyperparameters used
- Remember results
- Save model
- Compare the results

We've a tool called mlflow which provides these. With the help of apis such
as set_tracking_url, start, log_param, log_metric, log_artifact these could
be achieved. We could also deploy to AWS sagemaker.

The code structure should be proper and should try to expose the models
like a library. A sample code structure was shared.

*Pysangamam - Lessons learnt*
Timeline 2 keynote, 16 20 minute slots, 16 poster slots, 12 lightning
slots. Idea started on Dec 8, 2017.

Zen - Local > national > international. Stick with where the base is more.
Use mail lists in TN.

Prototype before implementing was the rule. And constraints lead to
quality. Organizers cannot be speakers and ensure environment is kept clean
after the event.

Good part:
All tasks were completed on time. There were rehearsals and the quality was
good. Posters were very engaging. Lightning talks time managed using a
countdown timer. Food was served on time. Reception was positive.

Website was great and social media were updated. The name & logo were well
appreciated. Venue was spacious and compact. Contributor tickets helped
provide discounts to students.

Hard part:
The process was painful. Difficult to keep enthusiasm and set the ball
rolling. Ensured that the organizers F2F meet once every week.  Sponsorship
was difficult. No sufficient contacts.

Less takers for posters and logo approval took time. Video recording,
banners had issues. No on the spot registration. Very few volunteers. And
food got waste.

*Importance of unit testing:*
Unit testing makes product stable and prevents regression. Good unit test =
No network, no db, no file modification, run parallel, no special
environment. Artima Link: https://www.artima.com/
weblogs/viewpost.jsp?thread=126923

Mock external dependencies in unit tests.

pip install exam (Provides decorators like fixture, before, after). Use
flake8

Importance of logging in Database - For disaster recovery.

Kind regards,
Bharath
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/chennaipy/attachments/20181007/885010b7/attachment.html>