Thursday, March 3, 2016

Pandas and Spark DataFrame: structural difference

Pandas is a Python package for easy data manipulation and analysis. The basic data structure is based on NumPy array. It is very easy to see the trace of NumPy out of both series and data frame. The data should be able to fit into the memory to make it function.

Spark is a different language (or eco-system) designed for big data analytics. It's first popular data structure is the RDD (resilient distributed dataset), and then while its expanding into data science region, the concept of data frame is introduced (initially named as SchemaRDD, and changed into data frame after 1.3.0 version). So the Spark-DF is based on RDD, keep this in mind, then it is easy to understand why currently there are so many "seemingly easy" operations that not there yet.

(Note. Pandas version 0.17.1, Spark version 1.5.2)

Basic data structures in Pandas

Two commonly used data structures are: Series and Data Frame.

Series

Pandas series could be has index, and value, the index is a wrapper structure around numpy as its values attribute, series itself has values attribute, which is a numpy array as well. So basically, under the hood, two numpy arrays constructed one Pandas series.


Data Frame

Pandas data frame could be considered as two 1-D numpy array and one 2-D numpy array. It has attributes like: index, columns, values. With each as different wrapper over different sized numpy array. Thus it is easy to understand why doing transformation over pandas-DF is so straightforward: you just change the "index" into "columns" and "columns" into "index"!


Basic data structures in Spark

RDD

(from Wikipedia): "...... resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way". In principle, the RDD is immutable, that means it can not be changed if constructed. Although it could undergo different transformations, and can be realized by action, the data inside RDD will not be change once it is initiated. Note. it only carry the data, not much meta-info is stored. The data scale is usually require multiple machines to parallel the storage and computation. 

Data Frame

Spark-DF superimpose the schema over RDD, to make the RDD carry schema; from analytical point of view, it is easy to refer each column and dramatically facilitate data manipulations. While herein, there is no Numpy structure at all involved, everything is derived from the RDD concept and there is no index along the Spark-DF as well. There are three elements: RDD, schema, and columns (although there are lots stuff under the hood, for pure analytics purpose, this comprehension is sufficient)




Now based on above conceptual differences, easily some derivation on Spark-DF could be draw:
1. transpose Spark-DF would be very difficult (this is not what it is designed for)
2. adding new columns will need to change the schema (by adding a new field into the list), and created a new RDD to based on the original RDD and newly added RDD.

The list goes on and on, however, just keep one thing in mind, and then most of the difference in Spark-DF would be understandable: RDD is immutable, and it is designed for big data.

Wednesday, March 2, 2016

Pandas and Spark DataFrame

Spark DataFrame is a great way to do data analytics over big data, and it has many similar (not slightly different) APIs like the well-adopted python package: Pandas. Recently, I have been working with both of them quite frequently and I found it is very easy to misuse one with another.

Here are several great posts about the comparison between Pandas and Spark DF:

@chris_bour/6-differences-between-pandas-and-spark-dataframes

from-pandas-to-apache-sparks-dataframe

pandarize-spark-dataframes