[Chicago] Chicago Digest, Vol 121, Issue 6

Tue Sep 8 21:54:53 CEST 2015

Thanks Tanya!  Yes, someone at Northeastern told me the three keys to
success in big data are 1) Python, 2) R and finally 3) Hadoop, which he
said is really an extension of SQL.  I'm sure that next week the three keys
to success will be something else!  Technology is great, but it's changing
faster than I can keep pace with it.  I wonder how many people feel that
way???

What time does the presentation officially begin on Thursday evening?  6
pm?  What time does it end?  Is it structures like a classroom or is it
more open-ended?  Just one group?  Or does it get split up into many
different sub-groups?

I don't know a lot about "parallel computing" other than this is what Java
programmers call "Threads".  I think (not too sure really) that when you
"thread" an algorithm, you allocate one core for one part of the algorithm
and then another core for another part of the algorithm, and so on and so
forth, and then at the end it all has to get magically pieced back
together.  ( Sounds like a merge sort problem to me! )  What about Python?
Does Python support this multicore approach or "threading"?  I really don't
know.  Jon Haroop, the author of "Ocaml For Scientists" told me that the
Ocaml programming language became much less popular in the early 2000's
because its main developer underestimated the impact of multicore
technology on modern programming.

On Tue, Sep 8, 2015 at 8:28 AM, Tanya Schlusser <tanya at tickel.net> wrote:

>
>
>> Where is the ChiPy meeting this Thursday evening?  I checked the website,
>> but the location of the meeting had not yet been decided.
>>
>
> It's at Braintree again (8th floor, Merchandise Mart) -- sorry for taking
> so long to get it up!
>
>
>
>> When I think of "big data analysis" I think of something like, "Okay, read
>> all these data from an Excel spreadsheet into a huge Python array or
>> matrix, and then construct various Q-Q plots to see if the data are
>> normally distributed, exponentially distributed or something else, and
>> then
>> determine the parameters of the distribution".  In other words, when I
>> hear
>> "big data" I'm really thinking of a mixture of statistics and computer
>> programming.  Is that correct or is my "definition" a little too narrow?
>>
>
>
> It's pretty correct. The 'analysis' part is correct -- it's still
> statistics / machine learning. The 'big data' part really was a catchall
> phrase for "anything that can't be done right now in a standard database".
> So 'big data' can mean a handful of things:
>
>    - Workarounds to key generation because inserts are happening faster
>    than a standard database can deal with them. This was the 'Velocity' part
>    of the big data marketers' advertising campaigns. Twitter's Snowflake
>    <https://blog.twitter.com/2010/announcing-snowflake> is a good example
>    of working around this.
>
>    - NOSQL (Not Only SQL) -- storing and doing computation over images,
>    MRIs, genetic data, PDFs, entire Log Files, et cetera... This is the
>    'Variety' in the big data marketing. Hadoop's Distributed File System and
>    MongoDB are good examples of databases that can store these sorts of files.
>
>    -
>    - Parallel computation on a(n inexpensive) cluster because it would
>    take too long or the data would not fit on one computer. This means the
>    algorithms had to be rewritten for parallel execution. This was the
>    'Volume' part of big data marketing.
>    - Apache Mahout
>       <https://mahout.apache.org/users/basics/algorithms.html> (in java)
>       was I think one of the first open-source implementation of parallelized
>       machine learning algorithms.
>       - The hottest things for this now are the Spark Machine Learning
>       library <http://spark.apache.org/docs/latest/mllib-guide.html> ( -- Pycon
>       2015 presentation of spark+python
>       <http://pyvideo.org/video/3407/introduction-to-spark-with-python>)
>       . There is also a Chicago Spark Meetup
>       <http://www.meetup.com/Chicago-Spark-Users/>.
>       - And the newcomer Apache Flink, also in Java, bypasses Java's
>       garbage collection for speed, optimizes SQL queries (unlike Hive), and
>       claims to provide a truly streaming analytics option without some of the
>       hangups of Storm. It also has Python bindings
>       <http://mvnrepository.com/artifact/org.apache.flink/flink-python>.
>       There is a Chicago Flink meetup
>       <http://www.meetup.com/Chicago-Apache-Flink-Meetup/> -- I think
>       it's the 3rd Flink user group in North America.
>
>
> hope it was useful...see you @ Braintree Thursday!
>
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> https://mail.python.org/mailman/listinfo/chicago
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/chicago/attachments/20150908/e9fa8691/attachment.html>