Thursday 24 May 2012

Hadoop – HBase Installation On Windows


Hadoop and base are designed to work on Linux. But we can install on Windows also. Here I will explain how to install Hadoop and Hbase on Windows.

SOFTWARE INSTALLATION

1. Prerequisite

1. Hadoop-0.19.1
2. HBase-0.19.0
3. Cygwin
4. Java 1.6
5. Eclipse Europa 3.3.2

2. Downloads


Software Links

a. Hadoop-0.19.1: http://hadoop.apache.org/core/releases.htm
b. Hbase-0.19.0: http://hadoop.apache.org/hbase/releases.html
c. Cygwin: http://www.cygwin.com/
d. Java 1.6: http://java.sun.com/javase/downloads/index.jsp

e. Eclipse Europa 3.3.2: http://www.eclipse.org/downloads/packages/release/europa/winter, http://archive.eclipse.org/eclipse/downloads/

3. Creating Local Account(optional, better to have a separate local account for Hadoop)

1. Right click on my computer and click manage and create new user account with name ‘HadoopAdmin’ and give some password
2. Go to its local user and groups -> users -> HadoopAdmin -> properties and give him administrator rights.
3. Log off and log in with HadoopAdmin account.


4. Install Cygwin

1. Download Cygwin installer.
2. Run the downloaded file.
3. Keep pressing the 'Next' button until you see the package selection screen.
4. Click the little View button for "Full" view
5. Find the package "openssh", click on the word "skip"
6. After you selected these packages press the 'Next' button to complete the installation

7. Set Environment Variables
i. Find "My Computer" icon either on the desktop or in the start menu, right-click on it and select Properties item from the menu.
ii. When you see the Properties dialog box, click on the Environment Variables button
iii. When Environment Variables dialog shows up, click on the Path variable located in the System Variables box and then click the Edit button.
iv. When Edit dialog appears append the following text to the end of the Variable value field: “;C:\cygwin\bin;C:\cygwin\usr\bin”
v. Close all three dialog boxes by pressing OK button of each dialog box.


5. Setup SSH daemon

Both Hadoop scripts and Eclipse plug-in need password-less SSH to operate. This section describes how to set it up in the Cygwin environment.

5.1 Configure ssh daemon

1. Open the Cygwin command prompt.
2. Execute the following command: “ssh-host-config”
3. When asked if privilege separation should be used, answer no.
4. When asked if sshd should be installed as a service, answer yes.
5. When asked about the value of CYGWIN environment variable, enter ntsec.

5.2 Start SSH daemon

Find My Computer icon either on your desktop or in the start-up menu, right-click on it and select Manage from the context menu.

1. Open Services and Applications in the left-hand panel then select the Services item.
2. Find the CYGWIN sshd item in the main section and right-click on it.
3. Select Start from the context menu.

5.3 Setup authorization keys

Eclipse plug-in and Hadoop scripts require ssh authentication to be performed through authorization keys rather than passwords. The following steps describe how authorization keys are set up.

1. Open Cygwin command prompt
2. Execute the following command to generate keys: “ssh-keygen”
3. When prompted for filenames and pass phrases press ENTER to accept default values.
4. After the command has finished generating keys, enter the following command to change into your .ssh directory: “cd ~/.ssh”
5. Check if the keys were indeed generated by executing the following command: “ls –l”
6. You should see two files id_rsa.pub and id_rsa with recent creation dates. These files contain authorization keys.
7. To register the new authorization keys enter the following command (note the sharply-angled double brackets -- they are very important): “cat id_rsa.pub >> authorized_keys”
8. Now check if the keys were set up correctly by executing the following command: “ssh localhost”
Since it is a new ssh installation, you will be warned that authenticity of the host could not be established and will be asked whether you really want to connect. Answer yes and press ENTER. You should see the Cygwin prompt again, which means that you have successfully connected.
9. Now execute the command again: “ssh localhost”. This time you should not be prompted for anything.


6. Install java

1. Download Java 1.6 installer.
2. Run the downloaded file.
3. Change the installation path to: “C:\cygwin\home\HadoopAdmin\java”
4. Click finish when installation is complete.


7. Download and Extract Hadoop-0.19.1 and HBase-0.19.0

1. Download hadoop-0.19.1.tar.gz and hbase-0.19.0.tar.gz and place in some folder on your computer.
2. Right click on them and click Extract Files.
3. Give the destination path as: “C:\cygwin\home\HadoopAdmin”


CONFIGURE HADOOP

1. Supported modes

Hadoop runs in one of three modes:

* Standalone: All Hadoop functionality runs in one Java process. This works “out of the box” and is trivial to use on any platform, Windows included.
* Pseudo-Distributed: Hadoop functionality all runs on the local machine but the various components will run as separate processes. This is much more like “real” Hadoop and does require some configuration as well as SSH. It does not, however, permit distributed storage or processing across multiple machines.
* Fully Distributed: Hadoop functionality is distributed across a “cluster” of machines. Each machine participates in somewhat different (and occasionally overlapping) roles. This allows multiple machines to contribute processing power and storage to the cluster.

This post focuses on the Fully Distributed mode of Hadoop. If you want to try other modes then Hadoop Quick start can get you started on Standalone mode and Pseudo-Distributed.


2. Configure your hosts file (All machines)

Open your Windows hosts file located at c:\windows\system32\drivers\etc\hosts (the file is named “hosts” with no extension) in a text editor and add the following lines (replacing the NNNs with the IP addresses of both master and slave):
NNN.NNN.NNN.NNN master
NNN.NNN.NNN.NNN slave

Save the file. This step isn’t strictly necessary but it will make easier if your computers change IP’s.


3. Generate public/private key pairs (All Machines)

Hadoop uses SSH to allow the master computer(s) in a cluster to start and stop processes on the slave computers. It supports several modes of secure authentication: you can use passwords or you can use public/private keys to connect without passwords (”passwordless”).

1. To generate a key pair, open Cygwin and issue the following commands ($> is the command prompt):

a) $> ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

b) $> cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

2. Now, you should be able to SSH into your local machine using the following command:
$> ssh localhost

3. When prompted for your password, enter it. You’ll see something like the following in your Cygwin terminal.

HadoopAdmin@localhost's password:

Last login: Mon Apr 6 11:44:01 2009 from master

4. To quit the SSH session and go back to your regular terminal, use:

$> exit

Note: Make sure to do this on all computers in your cluster.


4. Exchange public keys

Now that you have public and private key pairs on each machine in your cluster, you need to share your public keys around to permit passwordless login from one machine to the other. Once a machine has a public key, it can safely authenticate a request from a remote machine that is encrypted using the private key that matches that public key.

On the master issue the following command in cygwin (where “” is the username you use to login to Windows on the slave computer):

$> scp ~/.ssh/id_dsa.pub @slave:~/.ssh/master-key.pub

Example:

$> scp ~/.ssh/id_dsa.pub HadoopAdmin@slave:~/.ssh/master-key.pub

Enter your password when prompted. This will copy your public key file in use on the master to the slave.

On the slave, issue the following command in cygwin:

$> cat ~/.ssh/master-key.pub >> ~/.ssh/authorized_keys

This will append your public key to the set of authorized keys the slave accepts for authentication purposes.

Back on the master, test this out by issuing the following command in cygwin:

$> ssh HadoopAdmin@slave

If all is well, you should be logged into the slave computer with no password required.

Note: Repeat this process in reverse, copying the slave’s public key to the master. Also, make sure to exchange public keys between the master and any other slaves that may be in your cluster.

Remarks: It’s better to have Hadoop Namenode/Jobtracker and HBase Master on one node. So there will less exchange of keys because if there are two masters then we have to repeat the process two times.

5. Configure hadoop-env.sh (All Machines)

The conf/hadoop-env.sh file is a shell script that sets up various environment variables that Hadoop needs to run.

1. Open hadoop-0.19.1 folder
2. Open conf/hadoop-env.sh in a text editor. Look for the line that starts with “#export JAVA_HOME”.
3. Change that line to something like the following:
“export JAVA_HOME=/home/HadoopAdmin/Java/jdk1.6.0_11”
Note: This should be the home directory of your Java installation. Note that you need to remove the leading “#” (comment) symbol


6 Configure hadoop-site.xml (All Machines)

The conf/hadoop-site.xml file is basically a properties file that lets you configure all sorts of HDFS and MapReduce parameters on a per-machine basis.

1. Open hadoop-0.19.1 folder
2. Open conf/hadoop-env.sh in a text editor.
3. Insert the following lines between and tags.

fs.default.name
hdfs://master:9000

mapred.job.tracker
master:9001

dfs.data.dir
/home/hadoop/dfs/data
true

dfs.name.dir
/home/hadoop/dfs/name
true

hadoop.tmp.dir
/tmp/hadoop
true

mapred.system.dir
/hadoop/mapred/system
true

dfs.replication
3

7. Important Directories


1. fs.default.name - This is the URI (protocol specifier, hostname, and port) that describes the NameNode for the cluster. Each node in the system on which Hadoop is expected to operate needs to know the address of the NameNode. The DataNode instances will register with this NameNode, and make their data available through it. Individual client programs will connect to this address to retrieve the locations of actual file blocks.
2. dfs.data.dir - This is the path on the local file system in which the DataNode instance should store its data. It is not necessary that all DataNode instances store their data under the same local path prefix, as they will all be on separate machines; it is acceptable that these machines are heterogeneous. However, it will simplify configuration if this directory is standardized throughout the system. By default, Hadoop will place this under /tmp. This is fine for testing purposes, but is an easy way to lose actual data in a production system, and thus must be overridden.
3. dfs.name.dir - This is the path on the local file system of the NameNode instance where the NameNode metadata is stored. It is only used by the NameNode instance to find its information, and does not exist on the DataNodes. The caveat above about /tmp applies to this as well; this setting must be overridden in a production system.
4. dfs.replication - This is the default replication factor for each block of data in the file system. For a production cluster, this should usually be left at its default value of 3.
5. mapred.system.dir - Hadoop's default installation is designed to work for standalone operation, which does not use HDFS. Thus it conflates HDFS and local file system paths. When enabling HDFS, however, MapReduce will store shared information about jobs in mapred.system.dir on the DFS


8. Configure slaves file (Master only)

The conf/slaves file tells the master where it can find slaves to do work.

1. Open hadoop-0.19.1 folder

2. Open conf/slaves in a text editor. It will probably have one line which says “localhost”.

3. Replace that with the following:
master
slave


9. Format the namenode

Next step is to format the Namenode to create a Hadoop Distributed File System (HDFS).

1. Open a new Cygwin window.

2. Execute the following commands:

a. cd hadoop-0.19.1

b. mkdir logs

c. bin/hadoop namenode -format


10. Starting your cluster (Master Only)

1. Open a new Cygwin window.
2. To fully start your cluster, execute the following commands:
1. cd hadoop-0.19.1
2. $ bin/start-all.sh

Browse the web interface for the NameNode and the JobTracker; they are available at:

* NameNode - http://master:50070/
* JobTracker - http://master:50030/

11 Stop your cluster

When you're done, stop the daemons with:

$ bin/stop-all.sh


CONFIGURE HBASE


1. Configure hbase-env.sh (All Machines)

The conf/hbase-env.sh file is a shell script that sets up various environment variables that Hbase needs to run.

1. Open hbase-0.19.0 folder

2. Open conf/hbase-env.sh in a text editor. Look for the line that starts with “#export JAVA_HOME”.

3. Change that line to something like the following:

4. “export JAVA_HOME=/home/HadoopAdmin/Java/jdk1.6.0_11”

Note: This should be the home directory of your Java installation. Note that you need to remove the leading “#” (comment) symbol


2. Configure hbase-site.xml (All Machines)

The conf/hbase-site.xml file is basically a properties file that lets you configure all sorts of HDFS and MapReduce parameters on a per-machine basis.

1. Open hbase-0.19.0 folder
2. Open conf/hbase-env.sh in a text editor.
3. Insert the following lines between and tags.

hbase.rootdir
hdfs://master:9000/hbase
The directory shared by region servers.

hbase.master
master:60000
The host and port that the HBase master runs at.

hbase.regionserver
master:60020
The host and port a HBase region server runs at.


3. Configure regionservers file

The conf/ regionservers file tells the master where it can find slaves to do work.

1. Open hbase-0.19.0 folder
2. Open conf/ regionservers in a text editor. It will probably have one line which says “localhost”.
3. Replace that with the following:
master
slave


4. Starting HBase

1. Open a new Cygwin window.
2. To fully start your cluster, execute the following commands:
a. cd hbase-0.19.0

b. $ bin/start-hbase.sh

Browse the web interface for the Master and the Regionserver; they are available at:

* HBase Master - http://master:60010/
* HBase Regionserver - http://master:60030/

5. HBase Shell

Once HBase has started, enter $ bin/hbase shell to obtain a shell against HBase from which you can execute commands. Test your installation by creating, viewing, and dropping.


6. Stop HBase

To stop HBase, exit the HBase shell and enter:
$ bin/stop-hbase.sh


Getting Started With Eclipse

1. Downloading and Installing

1. Download eclipse-SDK-3.3.2-win32.zip and place in some folder on your computer.
2. Right click on them and click Extract Files.
3. Give the destination path as: “C:\”


2. Installing the Hadoop MapReduce Plug-in

1. Open hadoop-0.19.1 folder
2. In the hadoop-0.18.0/contrib/eclipse-plugin directory on this CD, you will find a file named hadoop-0.18.0-eclipse-plugin.jar.
3. Copy this into the “C:\eclipse\plugins” subdirectory of Eclipse.


3. Configuring the MapReduce Plug-in

1. Start Eclipse and choose a workspace directory. If you are presented with a "welcome" screen, click the button that says "Go to the Workbench." The Workbench is the main view of Eclipse, where you can write source code, launch programs, and manage your projects.
2. Switch to the MapReduce perspective. In the upper-right corner of the workbench, click the "Open Perspective" button
3. Select "Other," followed by "Map/Reduce" in the window that opens up. At first, nothing may appear to change. In the menu, choose Window * Show View * Other. Under "MapReduce Tools," select "Map/Reduce Locations." This should make a new panel visible at the bottom of the screen, next to Problems and Tasks.
4. Add the Server. In the Map/Reduce Locations panel, click on the elephant logo in the upper-right corner to add a new server to Eclipse.
5. You will now be asked to fill in a number of parameters identifying the server. To connect to Hadoop Server, the values are:
a. Location name: (Any descriptive name you want; e.g., "HDFS")
b. Map/Reduce Master Host: master
c. Map/Reduce Master Port: 9001
d. DFS Master Port: 9000
e. User name: HadoopAdmin

6. Next, click on the "Advanced" tab. There are two settings here which must be changed.
7. When you are done, click "Finish." Your server will now appear in the Map/Reduce Locations panel. If you look in the Project Explorer (upper-left corner of Eclipse), you will see that the MapReduce plug-in has added the ability to browse HDFS. Click the [+] buttons to expand the directory tree to see any files already there. If you inserted files into HDFS yourself, they will be visible in this tree.

3 comments:

  1. I have configured in your way and its working. But when I am running code , its completing map 100 but reduce 0%/. Please I am facing this issue since long time please help me

    Best hadoop training institute in chennai
    Hadoop Course in Chennai

    ReplyDelete
  2. The content provided here is vital in increasing one's knowledge regarding hadoop, the way you have presented here is simply awesome. Thanks for sharing this. The uniqueness I see in your content made me to comment on this. Keep sharing article like this. Thanks :)

    Hadoop Training in Chennai | Best hadoop training institute in chennai | Big data training in Chennai

    ReplyDelete
  3. You have made some decent points there. I checked on the net for additional information about the issue and found most individuals will go data along with your views on this web site.

    ReplyDelete