《​​Hadoop​​​权威指南》读书笔记三 — ​​Chapter 3​

1. ​​what is distributed filesystem?​


FileSystems that manage the storage across a network of machines are called distributed filesystems.


2. ​​HDFS concepts​

2.1 ​​block​

什么是​​block​​​?它和 ​​linux​​​ 中的​​block​​​有区别么?为什么又在​​hdfs​​​中搞一个​​block​​的概念呢?


  • 01.​​block​​是用于物理上划分磁盘的
  • 02.在文件系统中,​​block​​​代表的是最小的读写单位,不同的系统其默认的​​block​​大小不同;


Filesystem blocks are typically a few kilobytes in size, whereas disk blocks are normally 512 bytes.


文件系统的​​block​​ 通常是几千字节,而磁盘块的大小通常是512字节。

  • 03.​​hadoop​​​中的​​block​​​与文件系统的 ​​block​​​ 不同,​​hdfs​​​中默认​​block​​​的大小是​​128MB​
2.2 ​​namenode & datanode​

介绍一下它们的主要功能,并且做一下区别:


  • ​namenode(the master)​​ 01.管理文件系统空间
    02.管理文件树以及文件树中所有的文件和文件夹的元数据。这些信息以两种形式固化在磁盘中:​​namespace image​​和 ​​edit log​​。但是​​namenode​​ 不会保存块信息,因为消息是在​​hadoop​​ 集群开始时从​​datanodes​​重新构建。
  • ​a number of datanodes(workers)​

2.3 ​​block cache​


Normally a datanode reads blocks from disk, but for frequently accessed files the blocks may be explicitly cached in the datanode’s memory, in an off-heap block cache.



By default, a block is cached in only one datanode’s memory, although the number is configurable on a per-file basis.



Users or applications instruct the namenode which files to cache (and for how long) by adding a cache directive to a cache pool.【通过向一个缓冲池中添加缓冲指令】


2.4 ​​HDFS Federation​


The namenode keeps a reference to every file and block in the filesystem in memory, which means that on very large clusters with many files, memory becomes the limiting factor for scaling.
使用多个单独的namenode 对datanode 进行管理



HDFS federation, introduced in the 2.x release series, allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace.


HDFS federation 其实就是 ​​NameNode federation​​​, 这个federation 就是联盟的意思,即:使用 多个namenode 对 ​​filesystem​​ 进行管理。


For example, one namenode might manage all the files rooted under /user, say, and a second namenode might handle files under /share.


例如,一个namenode 可能管理/user路径下的所有文件,而第二个 namenode可能处理的是在/share下的所有文件。


Under federation, each namenode manages a namespace volume, which is made up of the metadata for the namespace, and a block pool containing all the blocks for the files in the namespace.


在联合之后, 每个 ​​namenode​​​ 管理一个 ​​namespace volum​​​, 这个卷由 ​​namespace​​ 的元数据组成, 以及包含名称空间中文件的所有块的块池。

  • ​namenode​​​的​​namespace volumes​​是相互独立的


Namespace volumes are independent of each other, which means namenodes do not communicate with one another, and furthermore the failure of one namenode does not affect the availability of the namespaces managed by other namenodes.


  • ​datanode​​​需要注册每一个​​namenode​


so datanodes register with each namenode in the cluster and store blocks from multiple block pools.


  • 用户端如何访问​​datanode​​?


To access a federated HDFS cluster, clients use client-side mount tables to map file paths to namenodes.


3.为什么​​HDFS​​中的块大小比文件系统中的块大小要大这么多?


01.the reason is to minimize the cost of seeks
减少寻道时间。


4.​​HDFS​​ 文件的写入


  • 01.​​HDFS​​​ 中的文件可以由单个​​writer​​ 写入;【单个writer是什么意思?updating…】
  • 02.总是在文件末写入【即采取追加的方式】


There is no support for multiple writes of for modifications at arbitrary offsets in the file.


5. HDFS 吞吐量和延迟的关系


HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency.
HDFS 是为提高数据高吞吐量而优化,同时,这将会以扩大延迟为代价。


6. ​​HDFS​​ 的存储能力是多少?


  • 01.理论上​​HDFS​​是没有存储上限的。因为可以将数据放在很多台机器上。
  • 02.但是实际上​​HDFS​​​的存储是有限制的。因为​​HDFS​​​中的每个数据块、文件夹、文件都需要作为一个元数据保存在内存中。但因为内存是有大小限制的。导致保存在 ​​HDFS​​ 中的数据量就有了限制。

7. ​​Hadoop​​​ 使用自定义的 ​​block​​ ,有什么意义呢?


Having a block abstraction for a distributed filesystem brings several benefits.



  • a file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take dvantage of any of the disks in the cluster.
  • making the unit of abstraction a block rather than a file simplifies the storage subsystem.
  • blocks fit well with replication for providing fault tolerance and availability
    【抽象,泛化…】

8. ​​Keeping an HDFS Cluster Balanced​

如何保证​​HDFS​​ 集群平衡?

主要考虑的场景有:

01.拷贝数据到HDFS中 => 保证​​distcp​​ 命令不会对其产生影响。

02.编写 ​​MapReduce​​ 代码时,需要使用多个 ​​map​​ 任务,而不是一个​​map​​任务。【知道这个原因是什么么?】

原因是:如果只有一个​​map​​任务,那么就会在这个运行 ​​map​​任务 的节点上运行​​reduce​​ 任务,这样就会导致​​reduce​​ 任务直接把最后的输出数据写到 本地的这个节点上。从而影响数据的分布。

9. ​​Hadoop​​​ 中是如何保证 ​​namenode resilient​​ ?

在回答这个问题之前,先问问自己,为什么需要​​namenode resilient​​​? 而在回答这个问题之前就得先想想:​​namenode​​ 的作用到底是什么?

9.1 ​​namenode​​ 的作用是什么?

见上述2.2

9.2 为什么需要 ​​namenode resilient​​?


Without the namenode, the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes.



For this reason, it is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this.


9.3 如何保证​​namenode resilient​​?

  • 01.back up the file that make up the persistent state of the filesystem metadata
  • 02.run a secondary namenode

9.4 ​​secondary namenode​​ 的作用是什么?



Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large



它的主要角色是:周期性的合并 ​​namespace​​​ 镜像和​​edit log​​​,从而阻止​​edit log​​变得太大




It keeps a copy of the merged namespace image, which can be used in the event of the namenode failing. However, namenode lags that of the primary.



同时,它保留合并后的​​namespace image​​​,这个能够在​​namenode​​ 宕机时被使用。 然而,namenode延迟于 主节点是极有可能的

10. ​​Hadoop​​ 集群是怎样维护一个高可用状态的?

10.1 结合在多个文件系统中复制 ​​namenode​​​ 元数据以及使用​​secondary namenode​​ 会维持集群的高可用性么?


The combination of replicating namenode metadata on multiple filesystems and using the secondary namenode to create checkpoints protects against data loss, but it does not provide high availability of the filesystem.


不能。为什么呢?


because the namenode is the sole repository of the metadata and the file-to-block mapping.


因为​​namenode​​是元数据和文件映射到块的唯一存储库

那么使用什么方法可以维护 ​​Hadoop High Availability​​?主要实现如下:


there are a pair of namenodes in an active-standby configuration,In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption.


实现上述的 双机热备的方式的主要步骤有:





The namenodes must use highly available shared storage to share the edit log. When a standby namenode comes up, it reads up to the end of the shared edit log to synchronize its state with the active namenode, and then continues to read new entries as they are written by the active namenode.






Datanodes must send block reports to both namenodes because the block mappings are stored in a namenode’s memory, and not on disk.






Clients must be configured to handle namenode failover, using a mechanism that is transparent to users.






The secondary namenode’s role is subsumed by the standby, which takes periodic checkpoints of the active namenode’s namespace




从活跃的namenode 到 备机 是由一个叫做failover controller 管理的。

在新系统中


failover transition
it is impossible to be sure that the failed namenode has stopped running.


到此我们终于实现了 ​​HA​​。


If the active namenode fails, the standby can take over very quickly (in a few tens of seconds) because it has the latest state available in memory: both the latest edit log entries and an up-to-date block mapping. The actual observed failover time will be longer in practice (around a minute or so), because the system needs to be conservative in deciding that the active namenode has failed.


如果活跃的​​namenode​​​失败了,备用的​​namenode​​​将会迅速启用(在几十秒之内),因为备用的​​namenode​​​内存中有最近的可用状态:最近的​​edit log entries​​​ 以及 最新的块映射。 实际观察的失败转移时间(大约在1minu左右),因为集群需要判断活跃的 ​​namenode​​ 是否真的已经失效。

针对上述的步骤,又有如下问题:

  • 如何实现高可用的共享存储? 主要通过如下两种方式:


an NFS filer
or a quorum journal manager (QJM).


其中 ​​NFS​​ 如下:


​NFS​​ is the recommended choice for most HDFS installations
​QJM​​ 如下:
The QJM runs as a group of journal nodes, and each edit must be written to a majority of the journal nodes.


  • ​block mapping​​指的是什么?

11. ​​FSDataOutputStream​​​ 和 ​​FSDataInputStream​​ 的区别?

01.​​FSDataOutputStream​​ 不支持​​seek()​​。 其原因是:​​HDFS​​ 仅仅支持顺序写一个新文件或者追加到一个旧文件。

02.​​FSDataInputStream​​ 支持seek(),即随机查找。

12. ​​Hadoop​​ 中主要使用到的文件类


  • FileSystem
  • FileStatus

13. ​​Data Flow​​ 之读文件过程的剖析


the process of client interacting with HDFS,the namenode,and the datanode.


客户端与​​HDFS​​​,​​namenode​​​,​​datanode​​ 交互的过程,主要步骤如下:


  • 01.​​client​​​ calling ​​open()​​​ on the ​​FileSystem​​ object => an instance of DistributedFileSystem
  • 02.DistributedFileSystem call the namenode [using rpc] to get the first few blocks in the file
  • 03.for each block, namenode retruns the address of the datanode that have a copy of that block
  • the datanodes are sorted according to their proximity to the client according to the topology of the cluster’s network **[Network Topology and Hadoop] => hadoop的网络拓扑 **
  • 04.DistributedFileSystem returns an FSDataInputStream(an input stream that supports file seeks) to client for it to read data from
  • 05.FSDataInputStream in turn wraps a DFSInputStream ,which manage the datanode an namenode I/O
  • 06.client call read() on the stream DFSInputStream. repeatedly on the stream.
  • 07.when the end of block is reached,DFSInputStream will close the connection to the datanode.And find the best datanode for the next block

14. ​​Data Flow​​ 之写文件过程的剖析


it is instructive to understand the data flow because it clarifies HDFS’s cohenrency model.



  • 01.client call create() on DistributedFilSystem
  • 02.DistributedFilSystem make an RPC to namenode to create a new file in the filesystem’s namespace
  • 03.check if exists, and the permission of file
  • 04.DistributedFileSystem returns an FSDataOutputStream for the client to start writing data
  • 05.DFSOutputStream splits it into packets, write to an internal queue called data queue
  • 06.data queue is consumed by the DataStreamer
  • 07.DFSOutputStream maintain an internal queue of packets that are waiting to be acknowledges by datanodes, called the ack queue

15. ​​NETWORK TOPOLOGY AND HADOOP​

hadoop中的网络拓扑知识,这个对于提高 ​​MapReduce​​ job 的性能有很高的影响。


What does it mean for two nodes in a local network to be “close” to each other? In the context of high-volume data processing, the limiting factor is the rate at which we can transfer data between nodes — bandwidth is a scarce commodity. The idea is to use the bandwidth between two nodes as a measure of distance.



Rather than measuring bandwidth between nodes, which can be difficult to do in practice (it requires a quiet cluster, and the number of pairs of nodes in a cluster grows as the square of the number of nodes), Hadoop takes a simple approach in which the network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor. Levels in the tree are not predefined, but it is common to have levels that correspond to the data center, the rack, and the node that a process is running on. The idea is that the bandwidth available for each of the following scenarios becomes progressively less.


16. 如何选取副本的位置? ​​Replica placement​


  • 01.first replica on the same node as the client [client所在的节点上]
  • 02.second replica is placed on a different rack from the first(off-rack),chose at random
  • 03.the third replica is placed on the same rack as the second, but on a different node chosen at random.


once the replica locations have been chosen, a pipeline is built, taking network topology into account.


17.​​Coherency Model​


  • 01.​​hflush()​​是把数据刷写到内存中
  • 02.​​hsysnc()​​是把数据异步刷下到磁盘中
    为了防止写数据丢失,应该在应用程序适时调用如上的方法。

18.​​HDFS​​的优缺点

18.1 advantage
  • Very large files


“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.


  • Streaming data access


HDFS is built around the idea that the most efficient data processing pattern is a writeonce, read-many-times pattern.


  • Commodity hardware
18.2 disadvantage


It is also worth examining the applications for which using HDFS does not work so well. Although this may change in the future, these are areas where HDFS is not a good fit today:




Low-latency data access
HDFS is optimized for delivering a high throughput of data, and this may be at expense of latency.Hbase is currently a better choice for low-latency access.



Losts of small files
Because the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode.
所以如果小文件太多,将会占据很大的namenode 的内存空间,这样就导致无法高效的存储数据。



Multiple writers, arbitrary file modifications
Files in HDFS may be written to by a single writer. Writes are always made at the end of the file, in append-only fashion.