Monday, March 17, 2014

Hadoop Study Note: Hive operations

In the big data era, hadoop is the main-stream architecture to organize and manage data analysis. Hive is a great tool designed for data warehouse building on top of hadoop, and it is very helpful to provide data summarization, query and analysis. Here, I document three main operations in Hive (according to the Apache Hive website), they are: Data Definition Language (DDL), Data Manipulation Language (DML), and SQL operation.

The interaction between DDL, DML and SQL is: DDL defines a meta-structure for tables, DML helps to manage/load the data into the tables, and SQL queries the tables to get insightful analysis.

Part 1. DDL Operation

  • Create/Drop/Alter Database
  • Create/Drop/Truncate Table
  • Alter Table/Partition/Column
  • Create/Drop/Alter View
  • Create/Drop/Alter Index
  • Create/Drop Function
  • Create/Drop/Grant/Revoke Roles and Privileges
  • Show
  • Describe

Part 2. DML Operation

  • Loading files into tables
  • Inserting data into Hive Tables from queries
  • Writing data into the filesystem from queries
There are two primary ways of modifying data in Hive:
  • LOAD
  • INSERT

Part 3. SQL Operation

    This is quite similar as SQL language, basically it is based on the Select Syntax
  • WHERE Clause
  • ALL and DISTINCT Clauses
  • Partition Based Queries
  • HAVING Clause
  • LIMIT Clause
  • REGEX Column Specification
    More about Select Syntax are:
  • GROUP BY
  • SORT BY, ORDER BY, CLUSTER BY, DISTRIBUTE BY
  • JOIN
  • UNION ALL
  • TABLESAMPLE
  • Subqueries
  • Virtual Columns
  • Operators and UDFs
  • LATERAL VIEW
  • Windowing, OVER, and Analytics
  • Common Table Expressions

1 comment: