《Hadoop
权威指南》读书笔记三 — Chapter 3
1. what is distributed filesystem?
FileSystems that manage the storage across a network of machines are called distributed filesystems.
2. HDFS concepts
2.1 block
什么是block
?它和 linux
中的block
有区别么?为什么又在hdfs
中搞一个block
的概念呢?
- 01.
block
是用于物理上划分磁盘的 - 02.在文件系统中,
block
代表的是最小的读写单位,不同的系统其默认的block
大小不同;
Filesystem blocks are typically a few kilobytes in size, whereas disk blocks are normally 512 bytes.
文件系统的block
通常是几千字节,而磁盘块的大小通常是512字节。
- 03.
hadoop
中的block
与文件系统的 block
不同,hdfs
中默认block
的大小是128MB
2.2 namenode & datanode
介绍一下它们的主要功能,并且做一下区别:
-
namenode(the master)
01.管理文件系统空间
02.管理文件树以及文件树中所有的文件和文件夹的元数据。这些信息以两种形式固化在磁盘中:namespace image
和 edit log
。但是namenode
不会保存块信息,因为消息是在hadoop
集群开始时从datanodes
重新构建。 -
a number of datanodes(workers)
2.3 block cache
Normally a datanode reads blocks from disk, but for frequently accessed files the blocks may be explicitly cached in the datanode’s memory, in an off-heap block cache.
By default, a block is cached in only one datanode’s memory, although the number is configurable on a per-file basis.
Users or applications instruct the namenode which files to cache (and for how long) by adding a cache directive to a cache pool.【通过向一个缓冲池中添加缓冲指令】
2.4 HDFS Federation
The namenode keeps a reference to every file and block in the filesystem in memory, which means that on very large clusters with many files, memory becomes the limiting factor for scaling.
使用多个单独的namenode 对datanode 进行管理
HDFS federation, introduced in the 2.x release series, allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace.
HDFS federation 其实就是 NameNode federation
, 这个federation 就是联盟的意思,即:使用 多个namenode 对 filesystem
进行管理。
For example, one namenode might manage all the files rooted under /user, say, and a second namenode might handle files under /share.
例如,一个namenode 可能管理/user路径下的所有文件,而第二个 namenode可能处理的是在/share下的所有文件。
Under federation, each namenode manages a namespace volume, which is made up of the metadata for the namespace, and a block pool containing all the blocks for the files in the namespace.
在联合之后, 每个 namenode
管理一个 namespace volum
, 这个卷由 namespace
的元数据组成, 以及包含名称空间中文件的所有块的块池。
-
namenode
的namespace volumes
是相互独立的
Namespace volumes are independent of each other, which means namenodes do not communicate with one another, and furthermore the failure of one namenode does not affect the availability of the namespaces managed by other namenodes.
-
datanode
需要注册每一个namenode
so datanodes register with each namenode in the cluster and store blocks from multiple block pools.
- 用户端如何访问
datanode
?
To access a federated HDFS cluster, clients use client-side mount tables to map file paths to namenodes.
3.为什么HDFS
中的块大小比文件系统中的块大小要大这么多?
01.the reason is to minimize the cost of seeks
减少寻道时间。
4.HDFS
文件的写入
- 01.
HDFS
中的文件可以由单个writer
写入;【单个writer是什么意思?updating…】 - 02.总是在文件末写入【即采取追加的方式】
There is no support for multiple writes of for modifications at arbitrary offsets in the file.
5. HDFS 吞吐量和延迟的关系
HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency.
HDFS 是为提高数据高吞吐量而优化,同时,这将会以扩大延迟为代价。
6. HDFS
的存储能力是多少?
- 01.理论上
HDFS
是没有存储上限的。因为可以将数据放在很多台机器上。 - 02.但是实际上
HDFS
的存储是有限制的。因为HDFS
中的每个数据块、文件夹、文件都需要作为一个元数据保存在内存中。但因为内存是有大小限制的。导致保存在 HDFS
中的数据量就有了限制。
7. Hadoop
使用自定义的 block
,有什么意义呢?
Having a block abstraction for a distributed filesystem brings several benefits.
- a file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take dvantage of any of the disks in the cluster.
- making the unit of abstraction a block rather than a file simplifies the storage subsystem.
- blocks fit well with replication for providing fault tolerance and availability
【抽象,泛化…】
8. Keeping an HDFS Cluster Balanced
如何保证HDFS
集群平衡?
主要考虑的场景有:
01.拷贝数据到HDFS中 => 保证distcp
命令不会对其产生影响。
02.编写 MapReduce
代码时,需要使用多个 map
任务,而不是一个map
任务。【知道这个原因是什么么?】
原因是:如果只有一个map
任务,那么就会在这个运行 map
任务 的节点上运行reduce
任务,这样就会导致reduce
任务直接把最后的输出数据写到 本地的这个节点上。从而影响数据的分布。
9. Hadoop
中是如何保证 namenode resilient
?
在回答这个问题之前,先问问自己,为什么需要namenode resilient
? 而在回答这个问题之前就得先想想:namenode
的作用到底是什么?
9.1 namenode
的作用是什么?
见上述2.2
9.2 为什么需要 namenode resilient
?
Without the namenode, the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes.
For this reason, it is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this.
9.3 如何保证namenode resilient
?
- 01.back up the file that make up the persistent state of the filesystem metadata
- 02.run a secondary namenode
9.4 secondary namenode
的作用是什么?
Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large
它的主要角色是:周期性的合并 namespace
镜像和edit log
,从而阻止edit log
变得太大
It keeps a copy of the merged namespace image, which can be used in the event of the namenode failing. However, namenode lags that of the primary.
同时,它保留合并后的namespace image
,这个能够在namenode
宕机时被使用。 然而,namenode延迟于 主节点是极有可能的
10. Hadoop
集群是怎样维护一个高可用状态的?
10.1 结合在多个文件系统中复制 namenode
元数据以及使用secondary namenode
会维持集群的高可用性么?
The combination of replicating namenode metadata on multiple filesystems and using the secondary namenode to create checkpoints protects against data loss, but it does not provide high availability of the filesystem.
不能。为什么呢?
because the namenode is the sole repository of the metadata and the file-to-block mapping.
因为namenode
是元数据和文件映射到块的唯一存储库
那么使用什么方法可以维护 Hadoop High Availability
?主要实现如下:
there are a pair of namenodes in an active-standby configuration,In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption.
实现上述的 双机热备的方式的主要步骤有:
The namenodes must use highly available shared storage to share the edit log. When a standby namenode comes up, it reads up to the end of the shared edit log to synchronize its state with the active namenode, and then continues to read new entries as they are written by the active namenode.
Datanodes must send block reports to both namenodes because the block mappings are stored in a namenode’s memory, and not on disk.
Clients must be configured to handle namenode failover, using a mechanism that is transparent to users.
The secondary namenode’s role is subsumed by the standby, which takes periodic checkpoints of the active namenode’s namespace
从活跃的namenode 到 备机 是由一个叫做failover controller 管理的。
在新系统中
failover transition
it is impossible to be sure that the failed namenode has stopped running.
到此我们终于实现了 HA
。
If the active namenode fails, the standby can take over very quickly (in a few tens of seconds) because it has the latest state available in memory: both the latest edit log entries and an up-to-date block mapping. The actual observed failover time will be longer in practice (around a minute or so), because the system needs to be conservative in deciding that the active namenode has failed.
如果活跃的namenode
失败了,备用的namenode
将会迅速启用(在几十秒之内),因为备用的namenode
内存中有最近的可用状态:最近的edit log entries
以及 最新的块映射。 实际观察的失败转移时间(大约在1minu左右),因为集群需要判断活跃的 namenode
是否真的已经失效。
针对上述的步骤,又有如下问题:
- 如何实现高可用的共享存储? 主要通过如下两种方式:
an NFS filer
or a quorum journal manager (QJM).
其中 NFS
如下:
NFS
is the recommended choice for most HDFS installations
QJM
如下:
The QJM runs as a group of journal nodes, and each edit must be written to a majority of the journal nodes.
-
block mapping
指的是什么?
11. FSDataOutputStream
和 FSDataInputStream
的区别?
01.FSDataOutputStream
不支持seek()
。 其原因是:HDFS
仅仅支持顺序写一个新文件或者追加到一个旧文件。
02.FSDataInputStream
支持seek(),即随机查找。
12. Hadoop
中主要使用到的文件类
- FileSystem
- FileStatus
13. Data Flow
之读文件过程的剖析
the process of client interacting with HDFS,the namenode,and the datanode.
客户端与HDFS
,namenode
,datanode
交互的过程,主要步骤如下:
- 01.
client
calling open()
on the FileSystem
object => an instance of DistributedFileSystem - 02.DistributedFileSystem call the namenode [using rpc] to get the first few blocks in the file
- 03.for each block, namenode retruns the address of the datanode that have a copy of that block
- the datanodes are sorted according to their proximity to the client according to the topology of the cluster’s network **[Network Topology and Hadoop] => hadoop的网络拓扑 **
- 04.DistributedFileSystem returns an FSDataInputStream(an input stream that supports file seeks) to client for it to read data from
- 05.FSDataInputStream in turn wraps a DFSInputStream ,which manage the datanode an namenode I/O
- 06.client call read() on the stream DFSInputStream. repeatedly on the stream.
- 07.when the end of block is reached,DFSInputStream will close the connection to the datanode.And find the best datanode for the next block
14. Data Flow
之写文件过程的剖析
it is instructive to understand the data flow because it clarifies HDFS’s cohenrency model.
- 01.client call create() on DistributedFilSystem
- 02.DistributedFilSystem make an RPC to namenode to create a new file in the filesystem’s namespace
- 03.check if exists, and the permission of file
- 04.DistributedFileSystem returns an FSDataOutputStream for the client to start writing data
- 05.DFSOutputStream splits it into packets, write to an internal queue called data queue
- 06.data queue is consumed by the DataStreamer
- 07.DFSOutputStream maintain an internal queue of packets that are waiting to be acknowledges by datanodes, called the ack queue
15. NETWORK TOPOLOGY AND HADOOP
hadoop中的网络拓扑知识,这个对于提高 MapReduce
job 的性能有很高的影响。
What does it mean for two nodes in a local network to be “close” to each other? In the context of high-volume data processing, the limiting factor is the rate at which we can transfer data between nodes — bandwidth is a scarce commodity. The idea is to use the bandwidth between two nodes as a measure of distance.
Rather than measuring bandwidth between nodes, which can be difficult to do in practice (it requires a quiet cluster, and the number of pairs of nodes in a cluster grows as the square of the number of nodes), Hadoop takes a simple approach in which the network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor. Levels in the tree are not predefined, but it is common to have levels that correspond to the data center, the rack, and the node that a process is running on. The idea is that the bandwidth available for each of the following scenarios becomes progressively less.
16. 如何选取副本的位置? Replica placement
- 01.first replica on the same node as the client [client所在的节点上]
- 02.second replica is placed on a different rack from the first(off-rack),chose at random
- 03.the third replica is placed on the same rack as the second, but on a different node chosen at random.
once the replica locations have been chosen, a pipeline is built, taking network topology into account.
17.Coherency Model
- 01.
hflush()
是把数据刷写到内存中 - 02.
hsysnc()
是把数据异步刷下到磁盘中
为了防止写数据丢失,应该在应用程序适时调用如上的方法。
18.HDFS
的优缺点
18.1 advantage
- Very large files
“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.
- Streaming data access
HDFS is built around the idea that the most efficient data processing pattern is a writeonce, read-many-times pattern.
- Commodity hardware
18.2 disadvantage
It is also worth examining the applications for which using HDFS does not work so well. Although this may change in the future, these are areas where HDFS is not a good fit today:
Low-latency data access
HDFS is optimized for delivering a high throughput of data, and this may be at expense of latency.Hbase is currently a better choice for low-latency access.
Losts of small files
Because the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode.
所以如果小文件太多,将会占据很大的namenode 的内存空间,这样就导致无法高效的存储数据。
Multiple writers, arbitrary file modifications
Files in HDFS may be written to by a single writer. Writes are always made at the end of the file, in append-only fashion.