Monday, 4 June 2012

Hadoop course

1. An Introduction To Hadoop And HDFS
  • Why Hadoop?
  • HDFS
  • MapReduce
  • Hive, Pig, HBase and other ecosystem projects
  • Hands-On Exercise: Installing a pseudo-distributed cluster
2. Planning Your Hadoop Cluster
  • General Planning Considerations
  • Choosing The Right Hardware
  • Node Topologies
  • Choosing The Right Software
3. Deploying Your Cluster
  • Installing Hadoop
  • Using SCM Express for easy installation
  • Typical Configuration Parameters
  • Configuring Rack Awareness
  • Using Configuration Management Tools
  • Hands-On Exercise: Installing a Hadoop Cluster
4. Cluster Maintenance
  • Checking HDFS with fsck
  • Hands-On Exercise: Breaking the Cluster
  • Copying data with distcp
  • Rebalancing cluster nodes
  • Adding and removing cluster nodes
  • Hands-On Exercise: Verifying the Cluster's Self-Healing Features
  • Backup And Restore
  • Upgrading and Migrating
  • Hands-On Exercise: Backing Up and Restoring the NameNode Metadata
5. Cloudera Certified Administrator Exam
  • Following the training, attendees will have an opportunity to take the Cloudera Certified Hadoop Admininstrator exam
6. Managing and Scheduling Jobs
  • Starting and stopping MapReduce jobs
  • Hands-On Exercise: Managing jobs
  • The FIFO Scheduler
  • The Fair Scheduler
  • Hands-On Exercise: Using the FairScheduler
7. Installing And Managing Other Hadoop Projects
  • Hive
  • Pig
  • HBase
  • Hands-On Exercise: Configuring the Hive Shared Metastore
8. Populating HDFS From External Sources
  • Using Sqoop
  • Using Flume
  • Best Practices for Data Ingestion
9. Cluster Monitoring, Troubleshooting and Optimizing
  • Hadoop Log Files
  • Using the NameNode and JobTracker Web UIs
  • Interpreting Job Logs
  • Monitoring with Ganglia
  • Other monitoring tools
  • General Optimization Tips
  • Benchmarking Your Cluster

As a developer we should know the following


1. The Motivation For Hadoop

    Problems with traditional large-scale systems
    Requirements for a new approach

2.  Hadoop: Basic Concepts

    An Overview of Hadoop
    The Hadoop Distributed File System
    Hands-On Exercise
    How MapReduce Works
    Hands-On Exercise
    Anatomy of a Hadoop Cluster
    Other Hadoop Ecosystem Components

3. Writing a MapReduce Program

    The MapReduce Flow
    Examining a Sample MapReduce Program
    Basic MapReduce API Concepts
    The Driver Code
    The Mapper
    The Reducer
    Hadoop’s Streaming API
    Using Eclipse for Rapid Development
    Hands-on exercise

4. Integrating Hadoop Into The Workflow

    Relational Database Management Systems
    Storage Systems
    Importing Data from RDBMSs With Sqoop
    Hands-On Exercise
    Importing Real-Time Data with Flume
    Accessing HDFS Using FuseDFS and Hoop

5. More Advanced MapReduce Programming

    Custom Writables and WritableComparables
    Saving Binary Data using SequenceFiles and Avro Files
    Creating InputFormats and OutputFormats
    Hands-on exercise

6. Graph Manipulation in Hadoop

    Introduction to graph techniques Representing graphs in Hadoop Implementing a sample algorithm: Single Source Shortest Path

7. Cloudera Certified Hadoop Developer Exam

    Following the training, attendees will have an opportunity to take the Cloudera Certified Hadoop Developer exam

8. Using Hive and Pig

    Hive Basics Pig Basics Hands-on exercise

9. Delving Deeper Into The Hadoop API

    Using LocalJobRunner Mode for Faster Development Reducing Intermediate Data With Combiners The configure and close methods for Map/Reduce Setup and Teardown Writing Partitioners for Better Load Balancing Directly Accessing HDFS Using the Distributed Cache Hands-On Exercise

10. Practical Development Tips and Techniques

Testing with MRUnit Debugging MapReduce Code Using LocalJobRunner Mode For Easier Debugging Retrieving Job Information with Counters Logging Splittable File Formats Determining the Optimal Number of Reducers Map-Only MapReduce Jobs Implementing Multiple Mappers using ChainMapper Hands-On Exercise

11. Common MapReduce Algorithms

Sorting and Searching Indexing Machine Learning With Mahout Term Frequency – Inverse Document Frequency Word Co-Occurrence Hands-On Exercise

12. Joining Data Sets in MapReduce Jobs

Map-Side Joins The Secondary Sort Reduce-Side Joins Hands-On Exercise

13. Creating Workflows with Oozie

The Motivation for Oozie Oozie's Workflow Definition Format Hands-On Exercise


Syllabus guidelines for Developer exam
    Core Hadoop Concepts
    Recognize and identify Apache Hadoop daemons and how they function both in data storage and processing. Understand how Apache Hadoop exploits data locality. Given a big data scenario, determine the challenges to large-scale computational models and how distributed systems attempt to overcome various challenges posed by the scenario.
    Storing Files in Hadoop
    Analyze the benefits and challenges of the HDFS architecture, including how HDFS implements file sizes, block sizes, and block abstraction. Understand default replication values and storage requirements for replication. Determine how HDFS stores, reads, and writes files. Given a sample architecture, determine how HDFS handles hardware failure.
    Job Configuration and Submission
    Construct proper job configuration parameters, including using JobConf and appropriate properties. Identify the correct procedures for MapReduce job submission. How to use various commands in job submission (“hadoop jar” etc.)
    Job Execution Environment
    Given a MapReduce job, determine the lifecycle of a Mapper and the lifecycle of a Reducer. Understand the key fault tolerance principles at work in a MapReduce job. Identify the role of Apache Hadoop Classes, Interfaces, and Methods. Understand how speculative execution exploits differences in machine configurations and capabilities in a parallel environment and how and when it runs.
    Input and Output
    Given a sample job, analyze and determine the correct InputFormat and OutputFormat to select based on job requirements. Understand the role of the RecordReader, and of sequence files and compression.
    Job Lifecycle
    Analyze the order of operations in a MapReduce job, how data moves from place to place, how partitioners and combiners function, and the sort and shuffle process.
    Data processing
    Analyze and determine the relationship of input keys to output keys in terms of both type and number, the sorting of keys, and the sorting of values. Given sample input data, identify the number, type, and value of emitted keys and values from the Mappers as well as the emitted data from each Reducer and the number and contents of the output file(s).
    Key and Value Types
    Given a scenario, analyze and determine which of Hadoop’s data types for keys and values are appropriate for the job. Understand common key and value types in the MapReduce framework and the interfaces they implement.
    Common Algorithms and Design Patterns
    Evaluate whether an algorithm is well-suited for expression in MapReduce. Understand implementation and limitations and strategies for joining datasets in MapReduce. Analyze the role of DistributedCache and Counters.
    The Hadoop Ecosystem
    Analyze a workflow scenario and determine how and when to leverage ecosystems projects, including Apache Hive, Apache Pig, Sqoop and Oozie. Understand how Hadoop Streaming might apply to a job workflow.

Syllabus guidelines for Admin exam
    Apache Hadoop Cluster Core Technologies
    Daemons and normal operation of an Apache Hadoop cluster, both in data storage and in data processing. The current features of computing systems that motivate a system like Apache Hadoop.
    Apache Hadoop Cluster Planning
    Principal points to consider in choosing the hardware and operating systems to host an Apache Hadoop cluster.
    Apache Hadoop Cluster Management
    Cluster handling of disk and machine failures. Regular tools for monitoring and managing the Apache Hadoop file system
    Job Scheduling
    How the default FIFO scheduler and the FairScheduler handle the tasks in a mix of jobs running on a cluster.
    Monitoring and Logging
    Functions and features of Apache Hadoop’s logging and monitoring systems.

refer

http://www.philippeadjiman.com/blog/2009/12/07/hadoop-tutorial-part-1-setting-up-your-mapreduce-learning-playground/
 http://www.cs.bgu.ac.il/~dsp112/The_Map-Reduce_Pattern
http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/


No comments:

Post a Comment