[omaha] Spark Data Frames - databricks blog post

Bob Haffner bob.haffner at gmail.com
Wed Feb 18 00:01:15 CET 2015


I've been meaning to check out Blaze.  Thanks for the link!

And that Xray looks interesting.  Can't say I have a need for it right now,
but its definitely one to watch.

On Tue, Feb 17, 2015 at 2:57 PM, Wes Turner <wes.turner at gmail.com> wrote:

> Great article, thanks!
>
> "Out of the box, DataFrame supports reading data from the most popular
> formats, including JSON files, Parquet files, Hive tables. It can read from
> local file systems, distributed file systems (HDFS), cloud storage (S3),
> and external relational database systems via JDBC. In addition, through
> Spark SQL’s external data sources API, DataFrames can be extended to
> support any third-party data formats or sources. Existing third-party
> extensions already include Avro, CSV, ElasticSearch, and Cassandra"
>
> " * APIs for Python, Java, Scala, and R (in development via SparkR)"
>
> ...
>
> Spark recently won the Daytona Gray Sort Benchmark 2014 title from Hadoop
> MapReduce:
>
> https://spark.apache.org/news/spark-wins-daytona-gray-sort-100tb-benchmark.html
>
> Blaze "translates a subset of modified NumPy and Pandas-like syntax to
> databases and other computing systems. Blaze allows Python users a familiar
> interface to query data living in other data storage systems."
> http://blaze.pydata.org/docs/dev/backends.html#id5 (Blaze interfaces with
> Spark w/ PySpark)
>
> * http://pandas-docs.github.io/pandas-docs-travis/ecosystem.html#xray
> works
> with N-dimensional labeled data and netCDF (in lieu of (Sparse) Panel and
> Panel4D)
> * http://pandas.pydata.org/pandas-docs/dev/api.html
> On Feb 17, 2015 2:39 PM, "Bob Haffner" <bob.haffner at gmail.com> wrote:
>
> > This post talks about how the new data frames in Spark that were inspired
> > in part by Pandas.  Looks like there's interop between pandas frames and
> > spark frames as well.
> >
> > Along with pyspark, I see this playing a significant role in bridging the
> > python and distributed analysis worlds.
> >
> >
> >
> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
> > _______________________________________________
> > Omaha Python Users Group mailing list
> > Omaha at python.org
> > https://mail.python.org/mailman/listinfo/omaha
> > http://www.OmahaPython.org
> >
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org
> https://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org


More information about the Omaha mailing list