Aries Research Note: March 2014

Monday, March 17, 2014

Hadoop Study Note: Hive operations

In the big data era, hadoop is the main-stream architecture to organize and manage data analysis. Hive is a great tool designed for data warehouse building on top of hadoop, and it is very helpful to provide data summarization, query and analysis. Here, I document three main operations in Hive (according to the Apache Hive website), they are: Data Definition Language (DDL), Data Manipulation Language (DML), and SQL operation.

The interaction between DDL, DML and SQL is: DDL defines a meta-structure for tables, DML helps to manage/load the data into the tables, and SQL queries the tables to get insightful analysis.

Part 1. DDL Operation

Create/Drop/Alter Database
Create/Drop/Truncate Table
Alter Table/Partition/Column
Create/Drop/Alter View
Create/Drop/Alter Index
Create/Drop Function
Create/Drop/Grant/Revoke Roles and Privileges
Show
Describe

Part 2. DML Operation

Loading files into tables
Inserting data into Hive Tables from queries
Writing data into the filesystem from queries

There are two primary ways of modifying data in Hive:

LOAD
INSERT

Part 3. SQL Operation

This is quite similar as SQL language, basically it is based on the Select Syntax

WHERE Clause
ALL and DISTINCT Clauses
Partition Based Queries
HAVING Clause
LIMIT Clause
REGEX Column Specification

More about Select Syntax are:

GROUP BY
SORT BY, ORDER BY, CLUSTER BY, DISTRIBUTE BY
JOIN
UNION ALL
TABLESAMPLE
Subqueries
Virtual Columns
Operators and UDFs
LATERAL VIEW
Windowing, OVER, and Analytics
Common Table Expressions

Linux Study Note: password-less inter-server communication

Log into different servers, or transfer files between different servers are always quite frustrated if you cannot remember the "just updated" passwords. An easy solution to this would be using SSH to access remote servers in a secure, while password-less, fashion.

The following approach is what I consider the most straightforward way, while due to different server's restriction, the usability of such approach may differs.

MISSION:

Log into "Server B" from "Server A", or transfer file from "Server A" to "Server B' without password.

PROCEDURE:

1. log into "Server A", go to ~/.ssh directory (currently, there should be no id_rsa or id_rsa.pub files)

2. run command line: ssh-keygen -t rsa (just press 'Enter' all the way down, you will see two files are generated)

3. log into "Server B" using password, then append the context in "Server A" 's id_rsa.pub file into the "Server B" 's ~/.ssh/authorized_keys file.

4. Now if you are in "Server A", then you can feel free to "scp file-from-server-B location-on-server-A" or "scp file-from-server-A location-on-server-B" or "ssh server-B" directly.

NOTE:

Seems like some server are configured in a way that it cannot use generated ssh-key to access other server? (not sure) So it is always a good idea to consult the server management people first on these issues.

Good luck with multiple servers communication :)

Sunday, March 9, 2014

Python Study Note: Three important Python packages

The Numpy provides a very convenient data structure, however, it is still not quite handy to directly run data analysis or machine learning on. To achieve the goal, three packages are needed:

pandas --- provide high-performance, easy-to-use data structures and data analysis tools
scikit-learn --- simple and efficient tools for data mining and data analysis
matplotlib --- 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms

Both pandas and scikit-learn rely matplotlib as its plot tools, and both of them are based on NumPy and SciPy for basic data structure and scientific computation.

The matplotlib has massive documentations on its website. Since I only need to use matplotlib to visualize my data analysis, not necessarily publish a paper based on that, its matplotlib.pyplot module is sufficient to me. The detailed documentation is HERE.

In later posts, I would like to focus on Pandas and scikit-learn for data analysis and modeling.

Python Study Note: NumPy - axis

There is one important concept in Numpy: axis.

One easy way to explain axis is: "In Numpy dimensions are called axes. The number of axes is rank. For example, the coordinates of a point in 3D space [1, 2, 1] is an array of rank 1, because it has one axis. That axis has a length of 3. In example pictured below, the array has rank 2 (it is 2-dimensional). The first dimension (axis) has a length of 2, the second dimension has a length of 3."

However, I found it quite difficult to understand what is the result from sum(axis=0), sum(axis=1) and sum(axis=2) etc. Here I would like to provide some examples to show what is the concept of Axis.

If we run:

>> a=np.arange(10);

>> a.resize([2,5]);

Then a is: array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]), a.shape = (2,5)

Here, the axis=0 has a length of 2, axis=1 has a length of 5.

If we run:

>> b=np.arange(24);

>> b.resize([2,3,4]);

Then b's axis=0 has length 2, axis=1 has length 3, axis = 2 has length 4.

The axis is like a internal-index for each entry, to find each entry, such as b[1][2][3] or b[1,2,3], one need to go through (0's axis value = 1, 1's axis value = 2, 3's axis value = 3). So If we want to carry out a sum over specific axis, what we are doing is to "reduce" the axis.

For array a, if we run '' >> a.sum(axis=0) '', then we collapse the 0's axis, and generate a new array with new.shape = (5,), because 0's axis was removed, the 1's axis moves up to serve as 0's axis.

If we run '' >> a.sum(axis=1) '', then we collapse the 1's axis, and generate a new array with new.shape = (2,), because 1's axis was removed, and only 0's axis is left.

For array b, if '' >> b.sum(axis=0) '', then 0's axis was collapsed and 1's axis => 0's axis, and 2's axis => 1's axis, so new.shape = (3,4).

If '' >> b.sum(axis = 1) '', then values aggregates over axis=1 level, and then 1's axis was removed, and 2's axis changed into 1's axis.

If '' >> b.sum(axis=2) '', then 2's axis was collapsed and aggregated accordingly, then the new.shape = (2,3).

Now this is quite clear what the "axis=?" option appears over many build-in functions, it is used to "select the axis to collapse & act on".

Python Study Note: NumPy array - 2

There are more about the ndarray --- its operations.

1. Array conversion

ndarray.tolist() --- Return the array as a (possibly nested) list.

ndarray.astype(dtype[, order, casting, ...])   ---   Copy of the array, cast to a specified type.
ndarray.copy([order])   ---   Return a copy of the array.
ndarray.fill(value)   ---   Fill the array with a scalar value.

2. Shape manipulation

ndarray.reshape(shape[, order]) --- Returns an array containing the same data with a new shape.

ndarray.resize(new_shape[, refcheck]) --- Change shape and size of array in-place.

ndarray.flatten([order]) --- Return a copy of the array collapsed into one dimension.

ndarray.ravel([order]) --- Return a flattened array.

Difference between reshape and resize is: reshape does not affect the raw array; resize directly change the raw array

Difference between flatten and ravel is: the ravel array's entries are affected by the raw array's value change; the flatten array is a totally different new array.

3. Item selection

ndarray.sort([axis, kind, order]) --- Sort an array, in-place.

ndarray.argsort([axis, kind, order]) --- Returns the indices that would sort this array.

ndarray.repeat(repeats[, axis]) --- Repeat elements of an array.

4. Calculation

The calculation functions are quite straightforward: ndarray.func([axis, out]), however, remember that there are one major parameters: axis.

If axis is None (the default), the array is treated as a 1-D array and the operation is performed over the entire array.

If axis is an integer, then the operation is done over the given axis (for each 1-D subarray that can be created along the given axis).

Python Study Note: NumPy array - 1

Array is the key concept in NumPy, and its creations & operations are quite useful to remember.

First, there are two new types: ndarray, and dtype. ndarray represents n-dimension array, and dtype means data-type for NumPy.

1. Array creation

One comprehensive list is given Here. I will only write down some important ones.

1.1 ones and zeros

eye(N[, M, k, dtype]) ---   Return a 2-D array with ones on the diagonal and zeros elsewhere.
zeros(shape[, dtype, order]) ---   Return a new array of given shape and type, filled with zeros.
  zeros_like(a[, dtype, order, subok]) ---   Return an array of zeros with the same shape and type as a given array.
There are also ones() and ones_like() function, which usage is same as zeros.

1.2 from existing data

numpy.array(object, dtype=None, copy=True, order=None, subok=False, ndmin=0) --- convert other object into np.array

1.3 numeric range

arange([start,] stop[, step,][, dtype]) --- Return evenly spaced values within a given interval.

linspace(start, stop[, num, endpoint, retstep]) --- Return evenly spaced numbers over a specified interval.

logspace(start, stop[, num, endpoint, base]) --- Return numbers spaced evenly on a log scale.

2. Array attributes

It is always helpful to understand what is the size/shape of the current array; they are the array's attributes.

ndarray.shape --- Tuple of array dimensions.

ndarray.ndim --- Number of array dimensions.

ndarray.size --- Number of elements in the array.

ndarray.nbytes --- Total bytes consumed by the elements of the array.

ndarray.dtype --- Data-type of the array’s elements.

ndarray.T ---   Same as self.transpose(), except that self is returned if self.ndim < 2.
ndarray.real ---   The real part of the array.
ndarray.imag ---   The imaginary part of the array.
ndarray.flat ---   A 1-D iterator over the array.

Python Study Note: Numpy & Scipy

In my daily work, I always need to import two modules: NumPy and SciPy. It feels really weird to seems "use" them everyday but without really "understand" them. In this post, I would like to learn and document what is the essential of NumPy and SciPy.

The introduction in SciPy official website is quite helpful, and I simply copy it here for my reference.

"NumPy‘s array type augments the Python language with an efficient data structure useful for numerical work, e.g., manipulating matrices. NumPy also provides basic numerical routines, such as tools for finding eigenvectors.
SciPy contains additional routines needed in scientific work: for example, routines for computing integrals numerically, solving differential equations, optimization, and sparse matrices."

Looks like NumPy provide an important data structure (array) and related operations, and SciPy contain additional tools for mathematically heavy calculations. Due to my own interest, I want to focus on NumPy for its data structure in following discussions.

1. Basic Structure
NumPy's main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers. In Numpy dimensions are called axes. The number of axes is rank.

There are several operations on the NumPy array, they are: creation, self-operation, inter-operation, and and math-rich operations. Will write in following post.

Python Study Note: Function

Although Python is an object-oriented programming language, function still play quite large role in data analysis. Usually we need to write a customized function to run some simulation, refine the data set, etc. This post is a brief summary about the Python defined function.

1. How to define a function

def function_name(x):
a = 10
return a

function_name(x);

There is nothing much to say about how to define a function ... what's more interesting is about how the Python function differentiated from traditional C or other low-level language's functions

2. Default argument value

def func_name(parm_1, parm_2=10, parm_3='one string'):
return 0

3. Keyword argument

func_name(parm_1 = 100)

4. Arbitrary argument list

def func_name (parm_1, *parm_list, **parm_dict):
for arg in parm_list:
print arg
for kw in parm_dict:
print kw, ':', parm_dict[kw]

If the parameter list is already exist as in the "list", one can pass it to the function as: a = [2,3,4,5]; func_name(parm_1, *a)

5. Lambda expressions
This one has confused me for months since I started to use Python. Now it is the time to unravel the mistery.
"Lambda functions can be used wherever function objects are required. They are syntactically restricted to a single expression. Semantically, they are just syntactic sugar for a normal function definition."
It is just a short version for defining function, and it could be quite convenient.

6. Pass parameter by reference or by Value?
I will refer to this Link, it has a quite detailed discussion on this topic. A rule in thumb would be: if pass parameter as list, then you can directly modify the entries in the list; else, no direct modification.

Now this seems like what I need to know about Python Function. Once a function is written, it is always nice to add comment lines for this function's usage. Here is a guideline about that:

1. The first line should always be a short, concise summary of the object’s purpose.
2. If there are more lines in the documentation string, the second line should be blank
3. Lines that are indented less should not occur

Ex.
---

>>> def my_function():
...     """Do nothing, but document it.
...
...     No, really, it doesn't do anything.
...     """
...     pass
...

Saturday, March 8, 2014

Python Study Note: Data Structure - Dict

"Unlike sequences, which are indexed by a range of numbers, dictionaries are indexed by keys, which can be any immutable type; strings and numbers can always be keys. Tuples can be used as keys if they contain only strings, numbers, or tuples; if a tuple contains any mutable object either directly or indirectly, it cannot be used as a key. "

Think dict as an unordered set of [key:value] pairs, with a requirement that the key is unique.

1. How to generate a dictionary?
1.1 Direct way
a = {} --- empty dictionary
1.2 Generator
a = {x:x**2 for x in range(10)}
1.3 Convert
a = [[1,2], [2,3]]; b = dict(a);

2. Operators (what is in the dict?)
2.1 a.keys() --- return all keys as a list
2.2 a.values() --- return all values as a list, including duplicates
2.3 a.has_key(x) --- return True/False, based on whether dict has the key
2.4 a.pop(x) --- return the value for key x, and then remove the key:value list
2.5 del a[x] --- directly delete the element from dict a
2.6 a.update(b) --- add the entries in dict b into a

Now seems like the basic data structures in "raw" python is discussed. It is very interesting that the Python language fully utilized the keyboard: () as tuple, [] as list, and {} as dict. How interesting it is?

Btw, if one want to do some looping, then one can use enumerate() to generate a enumerate type for looping:

for i,j in enumerate(['this', 'is', 'a', 'good', 'day']:
print i,j

while the generated enumerate type could be directly translated into list/tuple/dict by using list()/tuple()/dict() functions.

Python Study Note: Data Structure - Set

"A set is an unordered collection with no duplicate elements. Basic uses include membership testing and eliminating duplicate entries. Set objects also support mathematical operations like union, intersection, difference, and symmetric difference."

This is pretty much about a "set". Set is a simple object, it could be very helpful to used as a "check-existence" or "eliminate duplicates" tool.

1. Generate set
1.1 From List or tuple
a = [1,2,3]; b = (4,5,6); c = set(a); c = set(b)

Here there is one interesting fact, the elements in a set should be "hashable"(which means "An object is hashable if it has a hash value which never changes during its lifetime"). Since a "list"'s value could be changed, it may be only a pointer, so one cannot include a "list" as an entry in a set; but one can include a "tuple" as an element of a set.

2. Usage
2.1 "check-existence"
a = range(10); b = set(a);
10 in c
2.2 "eliminate duplicates"
a = [1,2,3,1,2,5,2,10]; b = set(a); c = list(b);

3. Evaluation
3.1 len(s); x in s; x not in s; --- whether element inside set
3.2 s.issubset(t); s.issuperset(t) --- whether set s is sub/super set of t
3.3 s.union(t) --- equivalent to s | t
3.4 s.intersection(t) --- equivalent to s & t
3.5 s.difference(t) --- equivalent to s - t
3.6 s.symmetric_difference(t) --- new set with elements in either s or t but not both
3.7 s.copy() --- new set with a shadow copy of s

4. Operator
4.1 s.update(t) --- return set s with elements added from t
4.2 s.difference_update(t) --- return set s after removing elements found in t
4.3 s.intersection_update(t) --- return set s keeping only elements also found in t
4.4 s.symmetric_difference_update(t)
4.5 s.add(x)
4.6 s.remove(x); s.discard(x) --- if there is no such element, remove return error, while discard does not
4.7 s.pop() --- remove and return an arbitrary element from s; or error if not ..
4.8 s.clear() -- remove all elements from a set

Python Study Note: Data Structure - Tuple

In my opinion, tuple is similar as List, while it is "immutable", that means, its elements cannot be changed.

For example,

a = [1,2,3] is a list

b = (1,2,3) is a tuple

We can run: a[0] = 1000;

However, we cannot run b[0] = 1000;

Basically, once the tuple is set, it is set!

If we try to understand the List and Tuple in C, the List variable name is more like a pointer; while the tuple is more like a well-structured box.

1. Why list is like Pointer?

a = [1,2,3];

b = a;

a[1] = 1000;

then b value also changes to [1,1000,3].

2. List in Tuple

a = [1,2,3]; b = [4,5,6]; c = (a,b);

a[0] = 1000;

then the tuple's value also changes to ([1000,2,3], [4,5,6]), this may because the tuple has the list's name/pointer fixed, however, its pointed space's value could change.

If we want to make a "real" copy of one list, we can do:

a = [1,2,3]; b = list(a);

Done ~

Python Study Note: Data Structure - List

List is one of the most powerful structures in Python. It is defined as "a container that holds a number of other objects, in a given order". So it is quite general and extendable.

For List type, there are several aspects we need to know:

1. Generation
1.1. Direct way
a = []; b = [1,2,3]; c = ['today', 'is', 1, 'of', 'the', 0.5, 'day'];
b = list(1,2,3);
1.2. Sequence generation
a = [x**2 for x in range(10)]
1.3 Convert from other data structures
a = (1,2,3); b = list(a)

2. Access
2.1. Direct way
a = range(10); b = a[1:5]
2.2 Sequence way
a = range(100); b = [1,2,3,10,40]; c = [a[i] for i in b];

3. Modification (list-level)
3.1 list.append() --- add item to the list
a.append(10);
3.2 list.extend() --- add elements from list to the list
a.extend([1,2,3]);
3.3 list.insert(i, x) --- add item to the list on specific location
a.insert(len(a), 10) is equivalent to a.append(10)
3.4 list.remove(x) --- remove the first item from the list whose value is x, return error if there is no such item.
3.5 list.pop([i]) --- Remove the item at the given position in the list, and return it. If no index is specified, a.pop() removes and returns the last item in the list.
3.6 list.index(x) --- Return the index in the list of the first item whose value is x, return error if there is no such item
3.7 list.count(x) --- Return the number of times x appears in the list, return error if there is no such item
3.8 list.sort() or list.sort(reverse = True) or list.reverse() --- sort the list ...
3.9 del statement --- delete either any element (del(a[0])) or the whole list! (del a). Powerful, yet dangerous :)

4. Nested List
If one generate a nested list as this:
a = [[a,b] for a in range(10) for b in range(5)]
To access to each individual:
a[i][j]

Some additional useful functions on List:
1. How to merge two lists?
c = a + b very straight forward, remember there is no a - b operation

NOTE: List is different from the matrix or vector as in Matlab, one can not directly run any mathematical operation on it, such as a = a + 1.

NOTE: two derived structures: Queue and Stack, how to do this? following examples:
1. Queue
a = range(10); a.append('new-entry'); a.pop(0);
2. Stack
a = range(10); a.append('new-entry'); a.pop();

Python Study Note: "Help-like" command & Data Structures Intro.

In "raw" Python language, there are quite a few command to help me to understand what is going one with the software, these commands act like "help" functions for me. They are:

1. dir
Without arguments, return the list of names in the current local scope. With an argument, attempt to return a list of valid attributes for that object.
So, if I have no idea about how many variables are existing in the space, I run "dir()"; if I forget what attributes a variable has, I can run "dir(variable_name)", quite handy.

2. type
With one argument, return the type of an object. The return value is a type object.
Many times I may forget what is the type of some variable, and may not be able to add any functions on it! Type could help this very much.
Ex. >> type('') is str

Basic data structure is the basis for any "fancy" structures (Pandas.DataFrame, etc). They are:
1. Basic: int, float, string
2. List
3. Tuple
4. Dictionary
5. Set
6. Sequence

The data structures are quite rich, and will write something about them in a separate Post.

Python Study Note: Raw Python -> Numpy/SciPy -> Pandas

After using so many different software and programming languages, I found it would be quite challenge to switch from one language mind to another. For example, these days, I may use Python the most, and after a while I may need to switch to Matlab for a specific programming homework, and after one week, I need to go back to Python for daily work ... The barrier of fast switching between languages are its syntax and common functions, it is better to record them down, in case need to do the switch again :)

For the language Python, I like it very much. It is very useful for data analysis and Machine Learning technology application. Better to make sure I can dive into Python programming anytime in the future ~

The high-level Python data analysis tools, such as Pandas, are always built based on some low-level packages, such as NumPy and SciPy; and those low-level packages are based on the Basic syntax of Python itself ... what a hierarchical structure. To use Pandas smoothly, it would be important to make sure I ink the low-level concepts in my mind.