hive 内置md5函数 hive md5函数使用

转载

mob64ca14092155 2024-05-14 15:36:57

文章标签 hive 内置md5函数哈希算法字符数组 ide 上传 文章分类 Hive 大数据

可以使用MD5算法来实现文件去重，因为它可以接受任意大小的数据并输出固定长度的哈希值。所以两个不一样的文件一般情况下使用MD5计算出来的hash值是不可能会相等的。

所以一旦两个文件计算出来的hash值相同那么他们的文件就是相同的。

这时文件上传的一个例子，先使用md5算法计算文件的hash值，再检测我们磁盘是否有相同的文件名的文件，如果有那我们就不上传直接返回访问路径，如果没有才上传

@Override
    public String uploadFile(MultipartFile file, String path) {
        try {
            // 获取文件md5值
            String md5 = FileUtils.getMd5(file.getInputStream());
            // 获取文件扩展名
            String extName = FileUtils.getExtName(file.getOriginalFilename());
            // 重新生成文件名
            String fileName = md5 + extName;
            System.out.println("filename  "+fileName);
            // 判断文件是否已存在
            if (!exists(path + fileName)) {
                // 不存在则继续上传
                upload(path, fileName, file.getInputStream());
            }
            // 返回文件访问路径
            return getFileAccessUrl(path + fileName);
        } catch (Exception e) {
            e.printStackTrace();
            throw new BizException("文件上传失败");
        }
    }

我这就先看一下MD5得到Hash值的逻辑

public static String getMd5(InputStream inputStream) {
        try {
            MessageDigest md5 = MessageDigest.getInstance("md5");
            byte[] buffer = new byte[8192];
            int length;
            while ((length = inputStream.read(buffer)) != -1) {
                md5.update(buffer, 0, length);
            }
            return new String(Hex.encodeHex(md5.digest()));
        } catch (Exception e) {
            e.printStackTrace();
            return null;
        } finally {
            try {
                if (inputStream != null) {
                    inputStream.close();
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

这里使用的是JDK自带的MD5算法，也可以替换成其他库提供的MD5算法。讲一下得到MD5算法摘要的大致逻辑，当然不看得到MD5算法摘要的的逻辑也是可以的，没有影响

先传入我们要得到的算法名称

MessageDigest.getInstance("md5");

MessageDigest.getInstance又会调用这个方法

GetInstance.Instance instance = GetInstance.getInstance("MessageDigest",
                MessageDigestSpi.class, algorithm);

这里是先得到所有的算法类提供商，再在提供商列表中查找第一个支持该算法的提供商提供的服务类，如果这个服务类能够提供指定的clazz的实例就直接将这个Instance返回，如果不能就找下一个能提供该算法的提供商看行不行。

我们这里的clazz就是MessageDigestSpi类，该类为MessageDigest类定义服务提供者接口(Service Provider Interface, SPI)，该类提供消息摘要算法的功能，如MD5或SHA。消息摘要是安全的单向哈希函数，它接受任意大小的数据并输出固定长度的哈希值。

public static Instance getInstance(String type, Class<?> clazz,
            String algorithm) throws NoSuchAlgorithmException {
        // in the almost all cases, the first service will work
        // avoid taking long path if so
        ProviderList list = Providers.getProviderList();
        Service firstService = list.getService(type, algorithm);
        if (firstService == null) {
            throw new NoSuchAlgorithmException
                    (algorithm + " " + type + " not available");
        }
        NoSuchAlgorithmException failure;
        try {
            return getInstance(firstService, clazz);
        } catch (NoSuchAlgorithmException e) {
            failure = e;
        }
        
        for (Service s : list.getServices(type, algorithm)) {
            if (s == firstService) {
                // do not retry initial failed service
                continue;
            }
            try {
                return getInstance(s, clazz);
            } catch (NoSuchAlgorithmException e) {
                failure = e;
            }
        }
        throw failure;
    }

最后返回我们期望的消息摘要算法MessageDigest。除了有MD5以外还有SHA1等等算法可以选择

MessageDigest的方法：

getInstance 得到算法摘要
update 处理这些数据
digest 转换并返回结果，也是字节数组

在校验文件重复性的时候，我们最后一步就是将MD5校验返回的字节数组编码成16进制的字符数组，然后用这个字符数组转换为字符串作为我们文件的名字，如果以后还有同样的文件被上传了，会对比是否有文件名相同的文件。

因为一个字节需要2个16进制数来表示，所以字符数组的大小是字节数组的大小的2倍

public static String encodeHex(byte[] byteArray) {
 
      // 首先初始化一个字符数组，用来存放每个16进制字符
 
      char[] hexDigits = {'0','1','2','3','4','5','6','7','8','9', 'A','B','C','D','E','F' };
 
 
 
      // new一个字符数组，这个就是用来组成结果字符串的（解释一下：一个byte是八位二进制，也就是2位十六进制字符（2的8次方等于16的2次方））
 
      char[] resultCharArray =new char[byteArray.length * 2];
 
 
 
      // 遍历字节数组，通过位运算（位运算效率高），转换成字符放到字符数组中去
 
      int index = 0;
 
      for (byte b : byteArray) {
 
         resultCharArray[index++] = hexDigits[b>>> 4 & 0xf];
 
         resultCharArray[index++] = hexDigits[b& 0xf];
 
      }
 
 
 
      // 字符数组组合成字符串返回
 
      return new String(resultCharArray);
 
}

对于计算文件的MD5 的Hash值时我们可以像这样使用基础的InputStrem

MessageDigest md5 = MessageDigest.getInstance("md5");
  byte[] buffer = new byte[8192];
  int length;
  while ((length = inputStream.read(buffer)) != -1) {
      md5.update(buffer, 0, length);
  }
  byte[] resultByteArray = MD5.digest();

也可以使用DigestInputStream

MessageDigest messageDigest =MessageDigest.getInstance("MD5");
         // 使用DigestInputStream

         DigestInputStream  digestInputStream = new DigestInputStream(inputStream,messageDigest);

         // read的过程中进行MD5处理，直到读完文件
 
         byte[] buffer =new byte[bufferSize];
 
         while (digestInputStream.read(buffer) > 0);

         // 获取最终的MessageDigest
         messageDigest= digestInputStream.getMessageDigest();

         byte[] resultByteArray = messageDigest.digest();

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。