------------------------------------------------------------------------------------
简介
------------------------------------------------------------------------------------
(1) 适用于大规模数据并行处理,可扩展到成千上百台机器,每台机器拥有多核。
适用于在一系列机器上处理大量分布式任务。
(2) Hadoop includes a DFS (distributed file system) which breaks up input data and sends fractions of the original data to several machines in your cluster to hold. This results in the problem being processed in parallel using all of the machines in the cluster and computes output results as efficiently as possible.
------------------------------------------------------------------------------------
challenges
------------------------------------------------------------------------------------
(4)In a distributed environment, however, partial failures are an expected and common occurrence. Networks can experience partial or total failure if switches and routers break down. Data may not arrive at a particular point in time due to unexpected network congestion. Individual compute nodes may overheat, crash, experience hard drive failures, or run out of memory or disk space. Data may be corrupted, or maliciously or improperly transmitted. Multiple implementations or versions of client software may speak slightly different protocols from one another. Clocks may become desynchronized, lock files may not be released, parties involved in distributed atomic transactions may lose their network connections part-way through, etc. In each of these cases, the rest of the distributed system should be able to recover from the component failure or transient error condition and continue to make progress. Of course, actually providing such resilience is a major software engineering challenge.
(5) Hadoop is designed to handle hardware failure and data congestion issues very robustly.
------------------------------------------------------------------------------------
Data Distribution
------------------------------------------------------------------------------------
(9) HDFS will split large data files into chunks which are managed by different nodes in the cluster. In addition to this each chunk is replicated across several machines, so that a single machine failure does not result in any data being unavailable. An active monitoring system then re-replicates the data in response to system failures which can result in partial storage. Even though the file chunks are replicated and distributed across several machines, they form a single namespace, so their contents are universally accessible.
(10) Each process running on a node in the cluster then processes a subset of these records. The Hadoop framework then schedules these processes in proximity to the location of data/records using knowledge from the distributed file system. Since files are spread across the distributed file system as chunks, each compute process running on a node operates on a subset of the data. Which data operated on by a node is chosen based on its locality to the node: most data is read from the local disk straight into the CPU, alleviating strain on network bandwidth and preventing unnecessary network transfers. This strategy of moving computation to the data, instead of moving the data to the computation allows Hadoop to achieve high data locality which in turn results in high performance.
------------------------------------------------------------------------------------
MapReduce: Isolated Processes
------------------------------------------------------------------------------------
(12)
Mapping and reducing tasks run on nodes where individual records of data are already present.
(13)Pieces of data can be tagged with key names which inform Hadoop how to send related bits of information to a common destination node. Hadoop internally manages all of the data transfer and cluster topology issues.
------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
(15) 机器数目少时(10台以内),MPI也许会快些。然而,话费在机器和开发、重开发上精力巨大,可扩展性极差。
(16) 然而Hadoop的可扩展性极强。