1. An Introduction To Hadoop And HDFS
As a developer we should know the following
1. The Motivation For Hadoop
Problems with traditional large-scale systems
Requirements for a new approach
2. Hadoop: Basic Concepts
An Overview of Hadoop
The Hadoop Distributed File System
Hands-On Exercise
How MapReduce Works
Hands-On Exercise
Anatomy of a Hadoop Cluster
Other Hadoop Ecosystem Components
3. Writing a MapReduce Program
The MapReduce Flow
Examining a Sample MapReduce Program
Basic MapReduce API Concepts
The Driver Code
The Mapper
The Reducer
Hadoop’s Streaming API
Using Eclipse for Rapid Development
Hands-on exercise
4. Integrating Hadoop Into The Workflow
Relational Database Management Systems
Storage Systems
Importing Data from RDBMSs With Sqoop
Hands-On Exercise
Importing Real-Time Data with Flume
Accessing HDFS Using FuseDFS and Hoop
5. More Advanced MapReduce Programming
Custom Writables and WritableComparables
Saving Binary Data using SequenceFiles and Avro Files
Creating InputFormats and OutputFormats
Hands-on exercise
6. Graph Manipulation in Hadoop
Introduction to graph techniques Representing graphs in Hadoop Implementing a sample algorithm: Single Source Shortest Path
7. Cloudera Certified Hadoop Developer Exam
Following the training, attendees will have an opportunity to take the Cloudera Certified Hadoop Developer exam
8. Using Hive and Pig
Hive Basics Pig Basics Hands-on exercise
9. Delving Deeper Into The Hadoop API
Using LocalJobRunner Mode for Faster Development Reducing Intermediate Data With Combiners The configure and close methods for Map/Reduce Setup and Teardown Writing Partitioners for Better Load Balancing Directly Accessing HDFS Using the Distributed Cache Hands-On Exercise
10. Practical Development Tips and Techniques
Testing with MRUnit Debugging MapReduce Code Using LocalJobRunner Mode For Easier Debugging Retrieving Job Information with Counters Logging Splittable File Formats Determining the Optimal Number of Reducers Map-Only MapReduce Jobs Implementing Multiple Mappers using ChainMapper Hands-On Exercise
11. Common MapReduce Algorithms
Sorting and Searching Indexing Machine Learning With Mahout Term Frequency – Inverse Document Frequency Word Co-Occurrence Hands-On Exercise
12. Joining Data Sets in MapReduce Jobs
Map-Side Joins The Secondary Sort Reduce-Side Joins Hands-On Exercise
13. Creating Workflows with Oozie
The Motivation for Oozie Oozie's Workflow Definition Format Hands-On Exercise
Syllabus guidelines for Developer exam
Syllabus guidelines for Admin exam
refer
http://www.philippeadjiman.com/blog/2009/12/07/hadoop-tutorial-part-1-setting-up-your-mapreduce-learning-playground/
http://www.cs.bgu.ac.il/~dsp112/The_Map-Reduce_Pattern
http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/
- Why Hadoop?
- HDFS
- MapReduce
- Hive, Pig, HBase and other ecosystem projects
- Hands-On Exercise: Installing a pseudo-distributed cluster
- General Planning Considerations
- Choosing The Right Hardware
- Node Topologies
- Choosing The Right Software
- Installing Hadoop
- Using SCM Express for easy installation
- Typical Configuration Parameters
- Configuring Rack Awareness
- Using Configuration Management Tools
- Hands-On Exercise: Installing a Hadoop Cluster
- Checking HDFS with fsck
- Hands-On Exercise: Breaking the Cluster
- Copying data with distcp
- Rebalancing cluster nodes
- Adding and removing cluster nodes
- Hands-On Exercise: Verifying the Cluster's Self-Healing Features
- Backup And Restore
- Upgrading and Migrating
- Hands-On Exercise: Backing Up and Restoring the NameNode Metadata
- Following the training, attendees will have an opportunity to take the Cloudera Certified Hadoop Admininstrator exam
- Starting and stopping MapReduce jobs
- Hands-On Exercise: Managing jobs
- The FIFO Scheduler
- The Fair Scheduler
- Hands-On Exercise: Using the FairScheduler
7. Installing And Managing Other Hadoop Projects
- Hive
- Pig
- HBase
- Hands-On Exercise: Configuring the Hive Shared Metastore
- Using Sqoop
- Using Flume
- Best Practices for Data Ingestion
- Hadoop Log Files
- Using the NameNode and JobTracker Web UIs
- Interpreting Job Logs
- Monitoring with Ganglia
- Other monitoring tools
- General Optimization Tips
- Benchmarking Your Cluster
As a developer we should know the following
1. The Motivation For Hadoop
Problems with traditional large-scale systems
Requirements for a new approach
2. Hadoop: Basic Concepts
An Overview of Hadoop
The Hadoop Distributed File System
Hands-On Exercise
How MapReduce Works
Hands-On Exercise
Anatomy of a Hadoop Cluster
Other Hadoop Ecosystem Components
3. Writing a MapReduce Program
The MapReduce Flow
Examining a Sample MapReduce Program
Basic MapReduce API Concepts
The Driver Code
The Mapper
The Reducer
Hadoop’s Streaming API
Using Eclipse for Rapid Development
Hands-on exercise
4. Integrating Hadoop Into The Workflow
Relational Database Management Systems
Storage Systems
Importing Data from RDBMSs With Sqoop
Hands-On Exercise
Importing Real-Time Data with Flume
Accessing HDFS Using FuseDFS and Hoop
5. More Advanced MapReduce Programming
Custom Writables and WritableComparables
Saving Binary Data using SequenceFiles and Avro Files
Creating InputFormats and OutputFormats
Hands-on exercise
6. Graph Manipulation in Hadoop
Introduction to graph techniques Representing graphs in Hadoop Implementing a sample algorithm: Single Source Shortest Path
7. Cloudera Certified Hadoop Developer Exam
Following the training, attendees will have an opportunity to take the Cloudera Certified Hadoop Developer exam
8. Using Hive and Pig
Hive Basics Pig Basics Hands-on exercise
9. Delving Deeper Into The Hadoop API
Using LocalJobRunner Mode for Faster Development Reducing Intermediate Data With Combiners The configure and close methods for Map/Reduce Setup and Teardown Writing Partitioners for Better Load Balancing Directly Accessing HDFS Using the Distributed Cache Hands-On Exercise
10. Practical Development Tips and Techniques
Testing with MRUnit Debugging MapReduce Code Using LocalJobRunner Mode For Easier Debugging Retrieving Job Information with Counters Logging Splittable File Formats Determining the Optimal Number of Reducers Map-Only MapReduce Jobs Implementing Multiple Mappers using ChainMapper Hands-On Exercise
11. Common MapReduce Algorithms
Sorting and Searching Indexing Machine Learning With Mahout Term Frequency – Inverse Document Frequency Word Co-Occurrence Hands-On Exercise
12. Joining Data Sets in MapReduce Jobs
Map-Side Joins The Secondary Sort Reduce-Side Joins Hands-On Exercise
13. Creating Workflows with Oozie
The Motivation for Oozie Oozie's Workflow Definition Format Hands-On Exercise
Syllabus guidelines for Developer exam
- Core Hadoop Concepts
Recognize and identify Apache Hadoop daemons and how they function both in data storage and processing. Understand how Apache Hadoop exploits data locality. Given a big data scenario, determine the challenges to large-scale computational models and how distributed systems attempt to overcome various challenges posed by the scenario.
Storing Files in Hadoop
Analyze the benefits and challenges of the HDFS architecture, including how HDFS implements file sizes, block sizes, and block abstraction. Understand default replication values and storage requirements for replication. Determine how HDFS stores, reads, and writes files. Given a sample architecture, determine how HDFS handles hardware failure.
Job Configuration and Submission
Construct proper job configuration parameters, including using JobConf and appropriate properties. Identify the correct procedures for MapReduce job submission. How to use various commands in job submission (“hadoop jar” etc.)
Job Execution Environment
Given a MapReduce job, determine the lifecycle of a Mapper and the lifecycle of a Reducer. Understand the key fault tolerance principles at work in a MapReduce job. Identify the role of Apache Hadoop Classes, Interfaces, and Methods. Understand how speculative execution exploits differences in machine configurations and capabilities in a parallel environment and how and when it runs.
Input and Output
Given a sample job, analyze and determine the correct InputFormat and OutputFormat to select based on job requirements. Understand the role of the RecordReader, and of sequence files and compression.
Job Lifecycle
Analyze the order of operations in a MapReduce job, how data moves from place to place, how partitioners and combiners function, and the sort and shuffle process.
Data processing
Analyze and determine the relationship of input keys to output keys in terms of both type and number, the sorting of keys, and the sorting of values. Given sample input data, identify the number, type, and value of emitted keys and values from the Mappers as well as the emitted data from each Reducer and the number and contents of the output file(s).
Key and Value Types
Given a scenario, analyze and determine which of Hadoop’s data types for keys and values are appropriate for the job. Understand common key and value types in the MapReduce framework and the interfaces they implement.
Common Algorithms and Design Patterns
Evaluate whether an algorithm is well-suited for expression in MapReduce. Understand implementation and limitations and strategies for joining datasets in MapReduce. Analyze the role of DistributedCache and Counters.
The Hadoop Ecosystem
Analyze a workflow scenario and determine how and when to leverage ecosystems projects, including Apache Hive, Apache Pig, Sqoop and Oozie. Understand how Hadoop Streaming might apply to a job workflow.
Syllabus guidelines for Admin exam
- Apache Hadoop Cluster Core Technologies
Daemons and normal operation of an Apache Hadoop cluster, both in data storage and in data processing. The current features of computing systems that motivate a system like Apache Hadoop.
Apache Hadoop Cluster Planning
Principal points to consider in choosing the hardware and operating systems to host an Apache Hadoop cluster.
Apache Hadoop Cluster Management
Cluster handling of disk and machine failures. Regular tools for monitoring and managing the Apache Hadoop file system
Job Scheduling
How the default FIFO scheduler and the FairScheduler handle the tasks in a mix of jobs running on a cluster.
Monitoring and Logging
Functions and features of Apache Hadoop’s logging and monitoring systems.
refer
http://www.philippeadjiman.com/blog/2009/12/07/hadoop-tutorial-part-1-setting-up-your-mapreduce-learning-playground/
http://www.cs.bgu.ac.il/~dsp112/The_Map-Reduce_Pattern
http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/
No comments:
Post a Comment