Top Most IT Courses: Hadoop BigData

Hadoop Big data Commands:

User Commands

ü archive

ü distcp

ü fs

ü fsck

ü fetchdt

ü jar

ü job

ü pipes

ü queue

ü version

ü CLASSNAME

ü classpath

Administration Commands

ü balancer

ü daemonlog

ü datanode

ü dfsadmin

ü mradmin

ü jobtracker

ü namenode

ü secondarynamenode

ü tasktracker

Overview

All hadoop commands are invoked by the bin/hadoop script. Running the hadoop script without any arguments prints the description for all commands.
Usage: hadoop [--config confdir] [COMMAND] [GENERIC_OPTIONS] [COMMAND_OPTIONS]

Hadoop has an option parsing framework that employs parsing generic options as well as running classes.

Generic Options

The following options are supported by dfsadmin, fs, fsck, job and fetchdt. Applications should implement Tool.

User Commands

Commands useful for users of a hadoop cluster.

`archive`

Creates a hadoop archive. More information can be found at Hadoop Archives.

Usage: hadoop archive -archiveName NAME <src>* <dest>

Copy file or directories recursively. More information can be found at Hadoop DistCp Guide.

Usage: hadoop distcp <srcurl> <desturl>

Usage: hadoop fs [GENERIC_OPTIONS] [COMMAND_OPTIONS]
Deprecated, use hdfs dfs instead.
Runs a generic filesystem user client.
The various COMMAND_OPTIONS can be found at File System Shell Guide.

`fsck`

Runs a HDFS filesystem checking utility. See Fsck for more info.

Usage: hadoop fsck [GENERIC_OPTIONS] <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]

Gets Delegation Token from a NameNode. See fetchdt for more info.
Usage: `hadoop fetchdt [GENERIC_OPTIONS] [--webservice <namenode_http_addr>] <path>`

Runs a jar file. Users can bundle their Map Reduce code in a jar file and execute it using this command.
Usage: `hadoop jar <jar> [mainClass] args...`
The streaming jobs are run via this command. Examples can be referred from Streaming examples
Word count example is also run using jar command. It can be referred from Wordcount example

`Job`

Command to interact with Map Reduce Jobs.

Usage: `hadoop job [GENERIC_OPTIONS] [-submit <job-file>] | [-status <job-id>] | [-counter <job-id> <group-name> <counter-name>] | [-kill <job-id>] | [-events <job-id> <from-event-#> <#-of-events>] | [-history [all] <jobOutputDir>] | [-list [all]] | [-kill-task <task-id>] | [-fail-task <task-id>] | [-set-priority <job-id> <priority>]`

Runs a pipes job.

Usage: `hadoop pipes [-conf <path>] [-jobconf <key=value>, <key=value>, ...] [-input <path>] [-output <path>] [-jar <jar file>] [-inputformat <class>] [-map <class>] [-partitioner <class>] [-reduce <class>] [-writer <class>] [-program <executable>] [-reduces <num>]`

command to interact and view Job Queue information

Usage: `hadoop queue [-list] | [-info <job-queue-name> [-showJobs]] | [-showacls]`

Prints the version.
Usage: `hadoop version`

`CLASSNAME`

hadoop script can be used to invoke any class.
Usage: `hadoop CLASSNAME`
Runs the class named `CLASSNAME`.

`classpath`

Prints the class path needed to get the Hadoop jar and the required libraries.
Usage: `hadoop classpath`

Administration Commands

Commands useful for administrators of a hadoop cluster.

`balancer`

Runs a cluster balancing utility. An administrator can simply press Ctrl-C to stop the rebalancing process. See Rebalancer for more details.

Usage: `hadoop balancer [-threshold <threshold>]`

Big Data Architecture:

Most Big Data projects use variations of a Big Data reference architecture. Understanding the high level view of this reference architecture provides a good background for understanding Big Data and how it complements existing analytics, BI, databases and systems. This architecture is not a fixed, one-size-fits-all approach. Each component of the architecture has at least several alternatives with its own advantages and disadvantages for a particular workload. Companies often start with a subset of the patterns in this architecture, and as they realize value for gaining insight to key business outcomes they expand the breadth of use.

NameNode and DataNodes:
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a
master server that manages the file system namespace and regulates access to files by clients.In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system
namespace and allows user data to be stored in files. Internally, a file is split into one or more
blocks and these blocks are stored in a set of DataNodes. The NameNode executes file
system namespace operations like opening, closing, and renaming files and directories. It
also determines the mapping of blocks to DataNodes. The DataNodes are responsible for
serving read and write requests from the file system’s clients. The DataNodes also perform
block creation, deletion, and replication upon instruction from the NameNode.

The NameNode and DataNode are pieces of software designed to run on commodity
machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built
using the Java language; any machine that supports Java can run the NameNode or the
DataNode software. Usage of the highly portable Java language means that HDFS can be
deployed on a wide range of machines. A typical deployment has a dedicated machine that
runs only the NameNode software. Each of the other machines in the cluster runs one
instance of the DataNode software. The architecture does not preclude running multiple
DataNodes on the same machine but in a real deployment that is rarely the case.
The existence of a single NameNode in a cluster greatly simplifies the architecture of the
system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.

The File System Namespace:

HDFS supports a traditional hierarchical file organization. A user or an application can create

directories and store files inside these directories. The file system namespace hierarchy is
similar to most other existing file systems; one can create and remove files, move a file from
one directory to another, or rename a file. HDFS does not yet implement user quotas. HDFS
does not support hard links or soft links. However, the HDFS architecture does not preclude
implementing these features.
The NameNode maintains the file system namespace. Any change to the file system
namespace or its properties is recorded by the NameNode. An application can specify the
number of replicas of a file that should be maintained by HDFS. The number of copies of a

file is called the replication factor of that file. This information is stored by the NameNode.

Data Replication:
HDFS is designed to reliably store very large files across machines in a large cluster. It stores
each file as a sequence of blocks; all blocks in a file except the last block are the same size.
The blocks of a file are replicated for fault tolerance. The block size and replication factor are
configurable per file. An application can specify the number of replicas of a file. The
replication factor can be specified at file creation time and can be changed later. Files in
HDFS are write-once and have strictly one writer at any time.
The NameNode makes all decisions regarding replication of blocks. It periodically receives a
Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a
Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of

all blocks on a DataNode.

Replica Placement: The First Baby Steps:

The placement of replicas is critical to HDFS reliability and performance. Optimizing replica
placement distinguishes HDFS from most other distributed file systems. This is a feature that
needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is
to improve data reliability, availability, and network bandwidth utilization. The current
implementation for the replica placement policy is a first effort in this direction. The
short-term goals of implementing this policy are to validate it on production systems, learn
more about its behavior, and build a foundation to test and research more sophisticated
policies.
Large HDFS instances run on a cluster of computers that commonly spread across many
racks. Communication between two nodes in different racks has to go through switches. In
most cases, network bandwidth between machines in the same rack is greater than network
bandwidth between machines in different racks.
The NameNode determines the rack id each DataNode belongs to via the process outlined in
Hadoop Rack Awareness. A simple but non-optimal policy is to place replicas on unique
racks. This prevents losing data when an entire rack fails and allows use of bandwidth from
multiple racks when reading data. This policy evenly distributes replicas in the cluster which
makes it easy to balance load on component failure. However, this policy increases the cost
of writes because a write needs to transfer blocks to multiple racks.
For the common case, when the replication factor is three, HDFS’s placement policy is to put
one replica on one node in the local rack, another on a node in a different (remote) rack, and
the last on a different node in the same remote rack. This policy cuts the inter-rack write
traffic which generally improves write performance. The chance of rack failure is far less
than that of node failure; this policy does not impact data reliability and availability
guarantees. However, it does reduce the aggregate network bandwidth used when reading
data since a block is placed in only two unique racks rather than three. With this policy, the
replicas of a file do not evenly distribute across the racks. One third of replicas are on one
node, two thirds of replicas are on one rack, and the other third are evenly distributed across
the remaining racks. This policy improves write performance without compromising data
reliability or read performance.

The current, default replica placement policy described here is a work in progress.

Safemode:

On startup, the NameNode enters a special state called Safemode. Replication of data blocks
does not occur when the NameNode is in the Safemode state. The NameNode receives
Heartbeat and Blockreport messages from the DataNodes. A Blockreport contains the list of
data blocks that a DataNode is hosting. Each block has a specified minimum number of
replicas. A block is considered safely replicated when the minimum number of replicas of
that data block has checked in with the NameNode. After a configurable percentage of safely
replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the
NameNode exits the Safemode state. It then determines the list of data blocks (if any) that
still have fewer than the specified number of replicas. The NameNode then replicates these

blocks to other DataNodes.

The Communication Protocols:

All HDFS communication protocols are layered on top of the TCP/IP protocol. A client
establishes a connection to a configurable TCP port on the NameNode machine. It talks the
ClientProtocol with the NameNode. The DataNodes talk to the NameNode using the
DataNode Protocol. A Remote Procedure Call (RPC) abstraction wraps both the Client
Protocol and the DataNode Protocol. By design, the NameNode never initiates any RPCs.

Instead, it only responds to RPC requests issued by DataNodes or clients.

Data Integrity:

It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption
can occur because of faults in a storage device, network faults, or buggy software. The HDFS
client software implements checksum checking on the contents of HDFS files. When a client
creates an HDFS file, it computes a checksum of each block of the file and stores these
checksums in a separate hidden file in the same HDFS namespace. When a client retrieves
file contents it verifies that the data it received from each DataNode matches the checksum
stored in the associated checksum file. If not, then the client can opt to retrieve that block

from another DataNode that has a replica of that block.

Data Organization

Data Blocks:

HDFS is designed to support very large files. Applications that are compatible with HDFS
are those that deal with large data sets. These applications write their data only once but they
read it one or more times and require these reads to be satisfied at streaming speeds. HDFS
supports write-once-read-many semantics on files. A typical block size used by HDFS is 64
MB. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will
reside on a different DataNode.

Staging:

A client request to create a file does not reach the NameNode immediately. In fact, initially
the HDFS client caches the file data into a temporary local file. Application writes are
transparently redirected to this temporary local file. When the local file accumulates data
worth over one HDFS block size, the client contacts the NameNode. The NameNode inserts
the file name into the file system hierarchy and allocates a data block for it. The NameNode
responds to the client request with the identity of the DataNode and the destination data
block. Then the client flushes the block of data from the local temporary file to the specified
DataNode. When a file is closed, the remaining un-flushed data in the temporary local file is
transferred to the DataNode. The client then tells the NameNode that the file is closed. At

this point, the NameNode commits the file creation operation into a persistent store.

FS Shell:

HDFS allows user data to be organized in the form of files and directories. It provides a
commandline interface called FS shell that lets a user interact with the data in HDFS. The
syntax of this command set is similar to other shells (e.g. bash, csh) that users are already
familiar with. Here are some sample action/command pairs:
Action Command

Create a directory named /foodir bin/hadoop dfs -mkdir /foodir

Remove a directory named /foodir bin/hadoop dfs -rmr /foodir
View the contents of a file named
/foodir/myfile.txt
bin/hadoop dfs -cat

/foodir/myfile.txt

DFSAdmin:

The DFSAdmin command set is used for administering an HDFS cluster. These are
commands that are used only by an HDFS administrator. Here are some sample
action/command pairs:
Action Command
Put the cluster in Safemode bin/hadoop dfsadmin -safemode enter
Generate a list of DataNodes bin/hadoop dfsadmin -report

Recommission or decommission DataNode(s) bin/hadoop dfsadmin -refreshNodes

Space Reclamation:

File Deletes and Undeletes

When a file is deleted by a user or an application, it is not immediately removed from HDFS.
Instead, HDFS first renames it to a file in the /trash directory. The file can be restored
quickly as long as it remains in /trash. A file remains in /trash for a configurable
amount of time. After the expiry of its life in /trash, the NameNode deletes the file from
the HDFS namespace. The deletion of a file causes the blocks associated with the file to be
freed. Note that there could be an appreciable time delay between the time a file is deleted by
a user and the time of the corresponding increase in free space in HDFS.
A user can Undelete a file after deleting it as long as it remains in the /trash directory. If a
user wants to undelete a file that he/she has deleted, he/she can navigate the /trash

directory and retrieve the file. The /trash directory contains only the latest copy of the file
that was deleted. The /trash directory is just like any other directory with one special
feature: HDFS applies specified policies to automatically delete files from this directory. The
current default policy is to delete files from /trash that are more than 6 hours old. In the

future, this policy will be configurable through a well defined interface.

BigData References:

HDFS Java API:http://hadoop.apache.org/core/docs/current/api/
HDFS source code : http://hadoop.apache.org/hdfs/version_control.html

Big data Course Objective Summary:

• Introduction to Big Data and Analytics
• Introduction to Hadoop
• Hadoop ecosystem - Concepts
• Hadoop Map-reduce concepts and features
• Developing the map-reduce Applications
• Pig concepts
• Hive concepts
• Sqoop concepts
• Flume Concepts
• Oozie workflow concepts
• Impala Concepts
• Hue Concepts
• HBASE Concepts
• ZooKeeper Concepts
• Real Life Use Cases

Top Most IT Courses

Pages

Hadoop BigData

Overview

Generic Options

User Commands