[omaha] Spark Data Frames - databricks blog post

Wes Turner wes.turner at gmail.com
Tue Feb 17 21:57:54 CET 2015

Great article, thanks!

"Out of the box, DataFrame supports reading data from the most popular
formats, including JSON files, Parquet files, Hive tables. It can read from
local file systems, distributed file systems (HDFS), cloud storage (S3),
and external relational database systems via JDBC. In addition, through
Spark SQL’s external data sources API, DataFrames can be extended to
support any third-party data formats or sources. Existing third-party
extensions already include Avro, CSV, ElasticSearch, and Cassandra"

" * APIs for Python, Java, Scala, and R (in development via SparkR)"


Spark recently won the Daytona Gray Sort Benchmark 2014 title from Hadoop

Blaze "translates a subset of modified NumPy and Pandas-like syntax to
databases and other computing systems. Blaze allows Python users a familiar
interface to query data living in other data storage systems."
http://blaze.pydata.org/docs/dev/backends.html#id5 (Blaze interfaces with
Spark w/ PySpark)

* http://pandas-docs.github.io/pandas-docs-travis/ecosystem.html#xray works
with N-dimensional labeled data and netCDF (in lieu of (Sparse) Panel and
* http://pandas.pydata.org/pandas-docs/dev/api.html
On Feb 17, 2015 2:39 PM, "Bob Haffner" <bob.haffner at gmail.com> wrote:

> This post talks about how the new data frames in Spark that were inspired
> in part by Pandas.  Looks like there's interop between pandas frames and
> spark frames as well.
> Along with pyspark, I see this playing a significant role in bridging the
> python and distributed analysis worlds.
> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org
> https://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org

More information about the Omaha mailing list