大数据架构师，带你分析HDFS格式化过程，可能经典最值得

程序员小英

782人浏览 · 2024-03-22 16:16:14

程序员小英 · 2024-03-22 16:16:14 发布

我们知道，Namenode启动时可以指定不同的选项，当指定-format选项时，就是格式化Namenode，可以在Namenode类中看到格式化的方法，方法签名如下所示：

private static boolean format(Configuration conf,
      boolean isConfirmationNeeded, boolean isInteractive) throws IOException

在该方法中，首先调用FSNamesystem类的方法，获取到待格式化的name目录和edit目录：

Collection<File> editDirsToFormat = Collection<File> dirsToFormat = FSNamesystem.getNamespaceDirs(conf);
FSNamesystem.getNamespaceEditsDirs(conf);

跟踪调用FSNamesystem类的方法，可以看到，实际上获取到的目录为：

name目录：是根据配置的dfs.name.dir属性，如果没有配置，默认使用目录/tmp/hadoop/dfs/name。
edit目录：是根据配置的dfs.name.edits.dir属性，如果没有配置，默认使用目录/tmp/hadoop/dfs/name。

在上面format方法中，创建对应的name目录和edit目录，对应如下代码行：

FSNamesystem nsys = new FSNamesystem(new FSImage(dirsToFormat, editDirsToFormat), conf);

实际上是调用FSImage对象的format方法格式化HDFS文件系统，调用代码如下所示：

nsys.dir.fsImage.format();

下面，对上面提到的关键操作进行详细说明：

FSImage对象初始化

从上面用到的FSImage的构造方法，我们可以看到，在创建Namenode的目录对象时，主要是按照name和edit目录分别进行处理的：对于name目录，对应的存储目录类型可能是IMAGE或者IMAGE_AND_EDITS，当配置的name目录和edit目录相同时，类型为IMAGE_AND_EDITS，不同时类型为IMAGE；对于edit目录，类型就是EDITS。name和edit目录实际上就是FSImage对象所包含的内容，这个FSImage对象包含一个StorageDirectory对象列表，而FSImage继承自抽象类
org.apache.hadoop.hdfs.server.common.Storage，在该抽象类中定义如下所示：

protected List<StorageDirectory> storageDirs = new ArrayList<StorageDirectory>();

这个列表中每个存储目录包含如下信息，如下Storage.StorageDirectory类图所示：

从类图中可以看到，主要包含如下三个信息：

root：配置的根目录路径
lock：一个FileLock文件锁对象，控制root下的写操作
dirType：表示StorageDirectory对象所使用目录的类型

一个dirType，它是Storage.StorageDirType类型的，Storage.StorageDirType是一个接口，定义如下所示：

public interface StorageDirType {
  public StorageDirType getStorageDirType();
  public boolean isOfType(StorageDirType type);
}

那么，对于Namenode节点的目录的Storage.StorageDirectory对象，它对应的dirType的定义，是实现了Storage.StorageDirType接口的枚举类，定义如下所示：FSImage.NameNodeDirType

static enum NameNodeDirType implements StorageDirType {
  UNDEFINED,
  IMAGE,
  EDITS,
  IMAGE_AND_EDITS;
  
  public StorageDirType getStorageDirType() {
    return this;
  }
  
  public boolean isOfType(StorageDirType type) {
    if ((this == IMAGE_AND_EDITS) && (type == IMAGE || type == EDITS))
      return true;
    return this == type;
  }
}

上述枚举类中定义的dirType恰好是前面我们提到的FSImage对象，所包含的实际Storage.StorageDirectory对象的类型，初始化FSImage对象时，就是确定了FSImage对象所包含的Storage.StorageDirectory对象列表及其它们的类型信息。

FSNamesystem对象初始化

FSNamesystem是个非常关键的类，它用来保存与Datanode相关的一些信息，如Block到Datanode的映射信息、StorageID到Datanode的映射信息等等。前面调用的FSNamesystem的构造方法，如下所示：

FSNamesystem(FSImage fsImage, Configuration conf) throws IOException {
  setConfigurationParameters(conf);
  this.dir = new FSDirectory(fsImage, this, conf);
  dtSecretManager = createDelegationTokenSecretManager(conf);
}

初始化主要包括如下信息：

方法setConfigurationParameters根据传递的conf对象来设置FSNamesystem使用的一些参数值；
创建一个FSDirectory对象dir，该对象包含了一组用来维护Hadoop文件系统目录状态的操作，专门用来控制对目录的实际操作，如写操作、加载操作等，同时，它能够保持“文件->Block列表”的映射始终是最新的状态，并将变更记录到日志。
创建了一个DelegationTokenSecretManager对象，用来管理HDFS的安全访问。

在FSNamesystem中，创建的FSDirectory对象dir，是整个HDFS文件系统的根目录。对应的FSDirectory dir内部有一个inode表示，它是带配额的INodeDirectoryWithQuota rootDir，详细可见下面分析。

FSDirectory对象初始化

FSDirectory对象是很关键的，该类内部定义了如下字段：

final FSNamesystem namesystem;
final INodeDirectoryWithQuota rootDir;
FSImage fsImage;
private boolean ready = false;
private final int lsLimit;  // max list limit
private final NameCache<ByteArray> nameCache;

其中，rootDir表示一个带有配额限制的inode对象。下面我们看一下FSDirectory的构造方法：

FSDirectory(FSImage fsImage, FSNamesystem ns, Configuration conf) {
  rootDir = new INodeDirectoryWithQuota(INodeDirectory.ROOT_NAME,
      ns.createFsOwnerPermissions(new FsPermission((short)0755)), Integer.MAX_VALUE, -1);
  this.fsImage = fsImage;
  fsImage.setRestoreRemovedDirs(conf.getBoolean(DFSConfigKeys.DFS_NAMENODE_NAME_DIR_RESTORE_KEY,
      DFSConfigKeys.DFS_NAMENODE_NAME_DIR_RESTORE_DEFAULT));
  fsImage.setEditsTolerationLength(conf.getInt(DFSConfigKeys.DFS_NAMENODE_EDITS_TOLERATION_LENGTH_KEY,
      DFSConfigKeys.DFS_NAMENODE_EDITS_TOLERATION_LENGTH_DEFAULT));
 
  namesystem = ns;
  int configuredLimit = conf.getInt(DFSConfigKeys.DFS_LIST_LIMIT, DFSConfigKeys.DFS_LIST_LIMIT_DEFAULT);
  this.lsLimit = configuredLimit>0 ?
      configuredLimit : DFSConfigKeys.DFS_LIST_LIMIT_DEFAULT;
  
  int threshold = conf.getInt(DFSConfigKeys.DFS_NAMENODE_NAME_CACHE_THRESHOLD_KEY,
      DFSConfigKeys.DFS_NAMENODE_NAME_CACHE_THRESHOLD_DEFAULT);
  NameNode.LOG.info("Caching file names occuring more than " + threshold + " times ");
  nameCache = new NameCache<ByteArray>(threshold);
 
}

这里创建了一个rootDir对象，如果我们调试跟踪该处代码，用户名为shirdrn，它的值可以表示如下：

"":shirdrn:supergroup:rwxr-xr-x

可见，对于FSNamesystem对象所维护的namespace中，inode对象包含目录名称、所属用户、所属用户组、操作权限信息。上面构造方法中初始化了一个NameCache缓存对象，用来缓存经常用到的文件，这里提供了一个threshold值，默认为10。也就是如果当一个文件被访问的次数超过threshold指定的值，就会将该文件名称放进NameCache缓存中，实际上是该文件名称的字节码的ByteArray表示形式作为Key，它唯一表示了一个文件的inode节点。在NameCache内部，实际是将放到了其内部的HashMap集合中，Key是文件名称的ByteArray表示形式，Value封装了文件被访问的计数信息。

格式化HDFS

调用FSImage对象的format方法，该方法实现代码，如下所示：

public void format() throws IOException {
  this.layoutVersion = FSConstants.LAYOUT_VERSION;
  this.namespaceID = newNamespaceID();
  this.cTime = 0L;
  this.checkpointTime = FSNamesystem.now();
  for (Iterator<StorageDirectory> it = dirIterator(); it.hasNext();) {
    StorageDirectory sd = it.next();
    format(sd);
  }
}

根据上面代码逻辑，详细说明如下：

layoutVersion

layoutVersion定义了HDFS持久化数据结构的版本号，它的值是负值。当HDFS的持久化数据结构发生了变化，如增加了一些其他的操作或者字段信息，则版本号会在原来的基础上减1。Hadoop 1.2.1版本中，layoutVersion的值是-41，它与Hadoop的发行版本号是两回事，如果layoutVersion的值变化了（通过减1变化，实际layoutVersion的值更小了），则如果能够读取原来旧版本的数据，必须执行一个升级（Upgrade）过程。layoutVersion主要在fsimage和edit日志文件、数据存储文件中使用。

namespaceID

namespaceID唯一标识了HDFS，在格式化HDFS的时候指定了它的值。在HDFS集群启动以后，使用namespaceID来识别集群中的Datanode节点，也就是说，在HDFS集群启动的时候，各个Datanode会自动向Namenode注册获取到namespaceID的值，然后在该值存储在Datanode节点的VERSION文件中。

cTime

cTime表示Namenode存储对象（即FSImage对象）创建的时间，但是在初始化时它的值为0。如果由于layoutVersion发生变化触发了一次升级过程，则会更新该事件字段的值。

checkpointTime

checkpointTime用来控制检查点（Checkpoint）的执行，为了在集群中获取到同步的时间，使用通过调用FSNamesystem对象的的now方法来生成时间戳。Hadoop使用检查点技术来实现Namenode存储数据的可靠性，如果因为Namenode节点宕机而无法恢复数据，则整个集群将无法工作。

格式化StorageDirectory对象

我们知道，每一个Storage对象都包含一个StorageDirectory列表，FSImage就是Namenode用来存储数据的对象的实现，上面代码中通过for循环分别格式化每一个StorageDirectory对象，对应的format方法代码，如下所示：

void format(StorageDirectory sd) throws IOException {
  sd.clearDirectory(); // create currrent dir
  sd.lock();
  try {
    saveCurrent(sd);
  } finally {
    sd.unlock();
  }
  LOG.info("Storage directory " + sd.getRoot() + " has been successfully formatted.");
}

上面调用sd.lock()会创建一个${dfs.name.dir}/in_use.lock锁文件，用来保证当前只有同一个进程能够执行格式化操作。格式化的关键逻辑，都在saveCurrent方法中，代码如下所示：

protected void saveCurrent(StorageDirectory sd) throws IOException {
  File curDir = sd.getCurrentDir();
  NameNodeDirType dirType = (NameNodeDirType)sd.getStorageDirType();
  // save new image or new edits
  if (!curDir.exists() && !curDir.mkdir())
    throw new IOException("Cannot create directory " + curDir);
  if (dirType.isOfType(NameNodeDirType.IMAGE))
    saveFSImage(getImageFile(sd, NameNodeFile.IMAGE));
  if (dirType.isOfType(NameNodeDirType.EDITS))
    editLog.createEditLogFile(getImageFile(sd, NameNodeFile.EDITS));
  // write version and time files
  sd.write();
}

每一个StorageDirectory对象代表一个存储目录的抽象，包含root、lock、和dirType三个属性，在格式化过程中，如果已经存在则要首先删除，然后创建对应的目录。该目录实际的绝对路径为：

${dfs.name.dir}/current/

指定了根目录，就要创建对应的文件，这里面会生成文件fsimage、edits两个重要的文件，我们分别详细说明这两个文件中保存的内容：

初始化fsimage文件数据

对应代码行如下：

if (dirType.isOfType(NameNodeDirType.IMAGE))
  saveFSImage(getImageFile(sd, NameNodeFile.IMAGE));

如果StorageDirectory对象的dirType为IMAGE，则会在上面的current目录下创建一个文件：

${dfs.name.dir}/current/fsimage

可以通过saveFSImage方法看到，主要执行的操作，将数据存储到fsimage文件中，代码如下所示：

try {
  out.writeInt(FSConstants.LAYOUT_VERSION);
  out.writeInt(namespaceID);
  out.writeLong(fsDir.rootDir.numItemsInTree());
  out.writeLong(fsNamesys.getGenerationStamp());
  byte[] byteStore = new byte[4*FSConstants.MAX_PATH_LENGTH];
  ByteBuffer strbuf = ByteBuffer.wrap(byteStore);
  // save the root
  saveINode2Image(strbuf, fsDir.rootDir, out);
  // save the rest of the nodes
  saveImage(strbuf, 0, fsDir.rootDir, out);
  fsNamesys.saveFilesUnderConstruction(out);
  fsNamesys.saveSecretManagerState(out);
  strbuf = null;
} finally {
  out.close();
}

首先，保存了文件系统的一些基本信息，如下表所示：

其次，调用saveINode2Image方法中，保存了文件系统的root目录名称、长度，以及inode信息，如下表所示：

然后，调用saveImage方法，保存了从root目录开始的剩余其他目录节点的信息。saveImage方法是一个递归方法，它能够根据给定的root目录来保存该目录下所有目录或文件的信息。我们知道，到目前为止，只是创建一个文件系统的root目录，并没有对应的孩子inode节点，所以这一步实际上没有存储任何inode信息。接着，
fsNamesys.saveFilesUnderConstruction(out)保存root目录的租约信息（Lease），代码如下所示：

void saveFilesUnderConstruction(DataOutputStream out) throws IOException {
  synchronized (leaseManager) {
    out.writeInt(leaseManager.countPath()); // write the size
 
    for (Lease lease : leaseManager.getSortedLeases()) {
      for(String path : lease.getPaths()) {
        // verify that path exists in namespace
        INode node = dir.getFileINode(path);
        if (node == null) {
          throw new IOException("saveLeases found path " + path + " but no matching entry in namespace.");
        }
        if (!node.isUnderConstruction()) {
          throw new IOException("saveLeases found path " + path + " but is not under construction.");
        }
        INodeFileUnderConstruction cons = (INodeFileUnderConstruction) node;
        FSImage.writeINodeUnderConstruction(out, cons, path);
      }
    }
  }
}

这里，leaseManager.countPath()的值为0，此时还没有任何文件的租约信息，所以for循环没有执行，此处只是写入了一个0值，表示leaseManager对象所管理的path的数量为0，如下表所示：

调用
fsNamesys.saveSecretManagerState(out)保存SecretManager的状态信息，跟踪代码可以看到在DelegationTokenSecretManager类中的saveSecretManagerState，如下所示：

public synchronized void saveSecretManagerState(DataOutputStream out) throws IOException {
  out.writeInt(currentId);
  saveAllKeys(out);
  out.writeInt(delegationTokenSequenceNumber);
  saveCurrentTokens(out);
}

顺序写入的字段数据，如下表所示：

上面的内容，都是fsimage文件保存的数据内容。

初始化edits文件数据

对应代码行如下所示：

if (dirType.isOfType(NameNodeDirType.EDITS))

editLog.createEditLogFile(getImageFile(sd, NameNodeFile.EDITS));

首先获取到edits文件名称，亦即文件：

${dfs.name.dir}/current/edits

然后调用editLog对象的createEditLogFile方法真正创建该文件，方法实现如下所示：

public synchronized void createEditLogFile(File name) throws IOException {
  EditLogOutputStream eStream = new EditLogFileOutputStream(name);
  eStream.create();
  eStream.close();
}

创建了一个流对象EditLogOutputStream eStream，并初始化一些基本信息以用来操作edits文件，通过create方法可以很清楚地看到，如下所示：

@Override
void create() throws IOException {
  fc.truncate(0);
  fc.position(0);
  bufCurrent.writeInt(FSConstants.LAYOUT_VERSION);
  setReadyToFlush();
  flush();
}

序列化写入了layoutVersion的值，这里是-41。在EditLogOutputStream内部维护了2个buffer，一个是bufCurrent，另一个是bufReady，当有数据要写入时首先写入bufCurrent，然后将bufCurrent与bufReady交换，这时bufCurrent空闲了，可以继续写入新的数据，而bufReady中的数据会在调用flush()方法时被持久化写入到edits文件中。其中，上面的setReadyToFlush()方法就是用来交换2个buffer的。flush()方法调用了FSEditLog类的flushAndSync()方法最终写入到文件中，可以简单看一下对应的代码实现：

@Override
protected void flushAndSync() throws IOException {
  preallocate();            // preallocate file if necessary
  bufReady.writeTo(fp);     // write data to file
  bufReady.reset();         // erase all data in the buffer
  fc.force(false);          // metadata updates not needed because of preallocation
}

这样，edits文件已经完成初始化。

初始化VERSION文件数据

上面sd.write()完成了VERSION文件的初始化，实现代码在
Storage.StorageDirectory.write()方法中，代码如下所示：

public void write() throws IOException {
  corruptPreUpgradeStorage(root);
  write(getVersionFile());
}

调用corruptPreUpgradeStorage方法检查是否是HDFS需要升级，如果需要升级，格式化过程失败（此时如果遗留的image目录存在），方法的实现如下所示：

protected void corruptPreUpgradeStorage(File rootDir) throws IOException {
  File oldImageDir = new File(rootDir, "image");
  if (!oldImageDir.exists())
    if (!oldImageDir.mkdir())
      throw new IOException("Cannot create directory " + oldImageDir);
  File oldImage = new File(oldImageDir, "fsimage");
  if (!oldImage.exists())
    // recreate old image file to let pre-upgrade versions fail
    if (!oldImage.createNewFile())
      throw new IOException("Cannot create file " + oldImage);
  RandomAccessFile oldFile = new RandomAccessFile(oldImage, "rws");
  // write new version into old image file
  try {
    writeCorruptedData(oldFile);
  } finally {
    oldFile.close();
  }
}

首先，如果在${dfs.name.dir}下面不存在image目录，则创建该目录，然后在image目录下面创建文件fsimage，写入该文件的数据内容，如下表所示：

如果执行corruptPreUpgradeStorage方法没有抛出异常，则这时开始初始化VERSION文件，该文件路径为${dfs.name.dir}/current/VERSION，调用write(getVersionFile())来实现，主要是通过一个Properties props对象，将对应的属性信息写入VERSION文件，可以通过setFields方法看到：

protected void setFields(Properties props, StorageDirectory sd) throws IOException {
  super.setFields(props, sd);
  boolean uState = getDistributedUpgradeState();
  int uVersion = getDistributedUpgradeVersion();
  if(uState && uVersion != getLayoutVersion()) {
    props.setProperty("distributedUpgradeState", Boolean.toString(uState));
    props.setProperty("distributedUpgradeVersion", Integer.toString(uVersion));
  }
  writeCheckpointTime(sd);
}

调用基类的super.setFields(props, sd);方法，实现如下所示：

protected void setFields(Properties props, StorageDirectory sd) throws IOException {
  props.setProperty("layoutVersion", String.valueOf(layoutVersion));
  props.setProperty("storageType", storageType.toString());
  props.setProperty("namespaceID", String.valueOf(namespaceID));
  props.setProperty("cTime", String.valueOf(cTime));
}

综合上面分析，可以看到，对应写入到VERSION文件的内容如下所示：

上面代码中uState=false，uVersion=0，getLayoutVersion()=-41，所以属性distributedUpgradeState和distributedUpgradeVersion没有添加到Properties中，例如，properties中的属性数据类似如下内容：{namespaceID=64614865, cTime=0, storageType=NAME_NODE, layoutVersion=-41}数据并没直接写入VERSION，而是等到初始化fstime文件完成之后，延迟初始化VERSION文件，以及，写入fstime文件先于写入VERSION文件。

初始化fstime文件数据

在初始化VERSION文件时，调用了writeCheckpointTime(sd)方法，写入checkpointTime到文件${dfs.name.dir}/current/fstime中，代码如下所示：

void writeCheckpointTime(StorageDirectory sd) throws IOException {
  if (checkpointTime < 0L)
    return; // do not write negative time
  File timeFile = getImageFile(sd, NameNodeFile.TIME);
  DataOutputStream out = new DataOutputStream(new AtomicFileOutputStream(timeFile));
  try {
    out.writeLong(checkpointTime);
  } finally {
    out.close();
  }
}

实际上写入fstime文件的只是检查点的时间，如下表所示：

格式化实例分析

下面，我们通过配置Hadoop-1.2.1，并执行HDFS的格式化操作，观察对应的目录的结构和数据。首先配置Hadoop，各个配置文件如下所示：

格式化后，在
/home/shirdrn/programs/hadoop/dfs/name/current目录下生成如下4个文件：

edits
fsimage
fstime
VERSION

上面4个文件中，VERSION文件实际上是一个properties文件，它的内容是可读的字符串信息，内容如下所示：

#Thu Apr 10 21:49:18 PDT 2014
namespaceID=1858400315
cTime=0
storageType=NAME_NODE
layoutVersion=-41

第一次进行格式化，cTime=0。对于其它几个文件，使用了Java的序列化方式进行存储，不是字符串可读格式的，可以参考源代码中实际序列化写入的内容，见上面给出的表格中列出的字段信息。（时延军（包含链接：http://shiyanjun.cn））

AtomGit 开源协作平台测评赛

瓜分20万奖金获得内推名额丰厚实物奖励易参与易上手

更多推荐

ADS1292R 使用过程心电图高精度ADC模块

文章目录1 Fundamentals ofPrecision ADC Noise Analysis 精密模数转换器噪声分析基础1 Fundamentals ofPrecision ADC Noise Analysis 精密模数转换器噪声分析基础https://www.ti.com.cn/cn/lit/wp/slyy192/slyy192.pdf?ts=1600659610730&ref_u

开放原子开发者工作坊

实现一个家庭安防与环境监测系统（一）

开放原子开发者工作坊

【cf】Codeforces Round #774 (Div. 2) 前4题

题目A. Square Counting 简单数学题目大意题解代码B. Quality vs Quantity 排序题目大意题解代码C. Factorials and Powers of Two 状态压缩dp+位运算题目大意题解代码D. Weight the Tree 树形dp+dfs题目大意题解代码E. Power Board 看起来像是数论？许多年没打cf了，偶尔打了一盘，恢复紫名了。A. S