1. Why do we use Hadoop?
1)Stripped to its core, the tools of that Hadoop provides for building distributed systems - for data storage, data analysis, and coordination – are simple. If there’s a common theme, it is about raising the level of abstraction- to create building blocks for programmers who just happen to have lots of data to store, or lots of data to analyze, or lots of machines to coordinate, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it.
2)More data usually beats better algorithms.
2. What is the core problem when processing large data?
The increase rate of drive access speed can hardly keep up with that of storage capacity of hard drive and the volume of data. –> Bottleneck: ACCESS SPEED TO DATA
So we need to process a datasets in paralell, to divide a long queue of data waiting for being processed into multiple short queues(datasets), and process them simultaneously.
3. What problems will we face when implementing Hadoop for data processing? And how do Hadoop deal with them?
1) Hardware failure – Replication on Hadoop Distributed Filesysttem(HDFS)
2) Data combination – MapReduce
Two componets above are the kernel of Hadoop.
4. What are the advantages highlight Hadoop in data processing? What scenarios does Hadoop fit for?
1) Efficiency of seeking and updating the majority of a database-better than B-tree used in relational databases.
-- Small data, structured, normalized(redundancy optimized),continually updated – RDBMS(MYSQL, etc.)
Large data, unstructured or semi-structured, not normalized, written once – MapReduce
What is B-tree?:http://en.wikipedia.org/wiki/B-tree
2) Speed of processing large datasets – faster than MPI(since data is locally processed in MapReduce, while transferred frequently from nodes in grid computing)
-- Predominantly compute-intensive jobs – MPI
Large data – MapReduce
3) Control of the process – more limited but easier than MPI, with higher availability (since MapReduce is in a higher level and share nothing among nodes, while MPI is in a lower level and nodes are tightly coupled)
-- All kinds of data processing – MPI
General data process - MapReduce
RDBMS=Relational Database Management System
MPI=Message Passing Interface
1. Who is the creator of Hadoop?
Doug Cutting, who is also the creator of Apache Lucene.
What is Lucene?:http://en.wikipedia.org/wiki/Lucene
What is Nutch?:http://en.wikipedia.org/wiki/Nutch
2. What subprojects does Hadoop includes?
1) Core: A set of components and interfaces for distributed filesystems and general I/O(serialization,Java RPC,persistent data structures).
2) Avro: A data serialization system for efficient, cross-language RPC, and persistent data storage.
3) MapReduce: A distributed data processing model and execution environment that runs on large clusters of commodity machines.
4) HDFS:A distributed filesystem that runs on large clusters of commodity machines.
5) Pig: A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce Clusters.
6) HBase: A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries.
7) ZooKeeper: A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.
8) Hive: A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL( and which is translated by the runtime engine to MapReduce jobs) for querying the data.
9) Chukwa: A distributed data collection and analysis system. Chukwa runs collectors that store data in HDFS, and it uses MapReduce to produce reports.
What is serialization?:http://en.wikipedia.org/wiki/Serialization
What is Java RPC(JAX-RPC)?:http://en.wikipedia.org/wiki/Java_API_for_XML-based_RPC
SETI@Home= Search for Extra-Terrestrial Intelligence, which is CPU-intensive,illocal.
crawler= a program downloads pages from web servers.
RPC= Remote Procedure Call
1. How to prepare the NCDC Weather Data for MapReduce experiment?
It implements only the map function to retrieve files from S3 to local, untarring them(since the files are bzip2 compressed tarball), concatenating them into a single ASCII text file, compressing them into a gzip compressed format and finally putting them into the HDFS.
Probably like this: