最近想去学一下Hive,结果发现在搭建环境这一步花了好大一笔时间才搞定,然而实际上多数人在工作时是不需要自己搭建环境的。因此我把自己已经搭建好(Java&Hadoop&MySQL&Hive)环境的虚拟机分享出来供小伙伴们直接使用,同时也把搭建过程记录的内容分享在下面。

由于网盘限制,文件采用分卷压缩的形式上传。OVF目录下为虚拟机导出文件,需要重新配置网卡信息;VirtualBox_VMs目录以及Virtual_Machines目录下为分别在VirtualBox 以及 VMware Workstation 下创建的Linux虚拟机的完整工作目录,应当不需要配置网卡。系统内所有密码均为Hadoop。Hadoop采用伪分布式,所以压缩包内只有一台虚拟机。虚拟机环境在VMware Workstation 16 上搭建,理论上VMware Workstation以及Oracle VM VirtualBox均可加载。


  • 具体环境–>Ubuntu 20.04

Software

Version

Software

Version

Java

1.8

Hadoop

2.7.1

MySQL

8.0

Hive

2.8.3


  • 搭建过程

文章目录

  • 方向键乱码
  • 安装jdk1.8
  • 下载&安装
  • 配置环境
  • 测试java环境
  • Hadoop
  • 创建Hadoop用户(选做)
  • 配置ssh无密码登录
  • 下载&安装 Hadoop
  • 配置Hadoop
  • 1. 配置hadoop-env.sh,
  • 2. 配置core-site.xml
  • 3. 配置hdfs-site.xml
  • 4. 配置mapred-site.xml
  • 5. 配置yarn-site.xml
  • 启动Hadoop
  • Hive
  • Hive2.3.8安装
  • 启动Hive
  • 初始化默认derby数据库(如果使用MySQL则跳过这步)
  • 连接MySQL(8.0)数据库

方向键乱码

sudo gedit /etc/vim/vimrc.tiny
  • 进行如下设置
    set nocompatibleset backspace=2

安装jdk1.8

下载&安装

  • 下载,使用root用户 --> # <–目录:/opt
wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie"  http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz
  • 解压,使用root用户 --> # <–目录:/opt
tar -zxvf jdk-8u131-linux-x64.tar.gz    # 解压
mv jdk-8u131-linux-x64/ jdk             # 改名

配置环境

  • 编辑profile使用root用户 --> # <–
vi /etc/profile
  • 进行如下设置
• export JAVA_HOME=/opt/jdkexport JRE_HOME=${JAVA_HOME}/jreexport CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib:$CLASSPATHexport JAVA_PATH=${JAVA_HOME}/bin:${JRE_HOME}/binexport PATH=$PATH:${JAVA_PATH}
  • 使环境变量生效,使用root用户 --> # <–
source /etc/profile

测试java环境

java -version
java version “1.8.0_131”
 Java™ SE Runtime Environment (build 1.8.0_131-b11)
 Java HotSpot™ 64-Bit Server VM (build 25.131-b11, mixed mode)

Hadoop

创建Hadoop用户(选做)

因为后面配置免密登录、添加权限等操作都是以hadoop用户名为例,所以这里提一下,如果不是使用hadoop用户的话记得后边追加权限时修改对应的用户名

sudo useradd -m hadoop -s /bin/bash # 创建用户名为hadoop的用户
sudo passwd hadoop                  # 设置hadoop用户的密码
sudo adduser hadoop sudo            # 对hadoop用户追加管理员权限
sudo chown -R hadoop /opt          # 为hadoop用户添加/opt目录读写权限

配置ssh无密码登录

  • 安装 ssh服务
sudo apt-get install openssh-server
  • 测试从localhost登录,使用hadoop用户 --> $ <–,此时应当需要输入密码才能登录
ssh localhost
exit

logout
Connection to localhost closed.

  • 创建密钥,使用hadoop用户 --> $ <–目录:
cd ~/.ssh/
ssh-keygen -t rsa
cat ./id_rsa.pub >> ./authorized_keys
  • 测试从localhost登录,使用hadoop用户 --> $ <–,此时可以直接登录,无需输入密码
ssh localhost
exit

下载&安装 Hadoop

wget http://archive.apache.org/dist/hadoop/core/hadoop-2.7.1/hadoop-2.7.1.tar.gz
tar -zxvf hadoop-2.7.1.tar.gz       # 解压
mv hadoop-2.7.1/ hadoop             # 改名
rm -f hadoop-2.7.1.tar.gz           # 删除下载的安装包
chown -R hadoop ./hadoop            # 修改目录权限
  • 创建目录,使用hadoop用户 --> $ <–
mkdir /opt/hadoop/tmp               # 创建目录
mkdir /opt/hadoop/hdfs
mkdir /opt/hadoop/hdfs/data
mkdir /opt/hadoop/hdfs/name
  • 如果是在root用户下创建目录,则需要为hadoop用户追加读写权限
    chown -R hadoop /opt/hadoop
  • 设置环境变量,使用hadoop用户 --> $ <–,(此环境变量仅对hadoop用户生效)
vi ~/.bash_profile
  • 进行如下配置
export HADOOP_HOME=/opt/hadoopexport PATH=$PATH:$HADOOP_HOME/bin
  • 使环境变量生效,使用hadoop用户 --> $ <–
source ~/.bash_profile
  • 测试环境变量是否有效
hadoop version
Hadoop 2.7.1
 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 15ecc87ccf4a0228f35af08fc56de536e6ce657a
 Compiled by jenkins on 2015-06-29T06:04Z
 Compiled with protoc 2.5.0
 From source with checksum fc0a1a23fc1868e4d5ee7fa2b28a58a
 This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-2.7.1.jar

配置Hadoop

以下内容均可在hadoop用户下执行 --> $ <–

1. 配置hadoop-env.sh,

vi /opt/hadoop/etc/hadoop/hadoop-env.sh

export JAVA_HOME=${JAVA_HOME}修改为jdk的绝对路径export JAVA_HOME=/opt/jdk

这里也可以不设置,但有时在启动NameNode时会报错
Error: JAVA_HOME is not set and could not be found. 所以还是先设置一下

2. 配置core-site.xml

vi /opt/hadoop/etc/hadoop/core-site.xml

添加如下内容

<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
        <description>HDFS的URI,文件系统://namenode标识:端口号</description>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/hadoop/tmp</value>
        <description>namenode上本地的hadoop临时文件夹</description>
    </property>
    <property>
	    <name>hadoop.proxyuser.hadoop.hosts</name>
	    <value>*</value>
	</property>
	<property>
	    <name>hadoop.proxyuser.hadoop.groups</name>
	    <value>*</value>
	</property>
</configuration>

3. 配置hdfs-site.xml

vi /opt/hadoop/etc/hadoop/hdfs-site.xml

添加如下内容

<configuration>
    <property>
        <name>dfs.name.dir</name>
        <value>/opt/hadoop/hdfs/name</value>
        <description>namenode上存储hdfs名字空间元数据 </description> 
    </property>
    
    <property>
        <name>dfs.data.dir</name>
        <value>/opt/hadoop/hdfs/data</value>
        <description>datanode上数据块的物理存储位置</description>
    </property>
    
    <property>
        <name>dfs.replication</name>
        <value>1</value>
        <description>副本个数,配置默认是3,应小于datanode机器数量</description>
    </property>
    <property>
        <name>dfs.http.address</name>
        <value>0.0.0.0:50070</value>
    </property>
</configuration>

4. 配置mapred-site.xml

vi /opt/hadoop/etc/hadoop/mapred-site.xml

添加如下内容

<configuration>
    <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
    </property>
</configuration>

5. 配置yarn-site.xml

vi /opt/hadoop/etc/hadoop/yarn-site.xml

添加如下内容

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
        </property>
</configuration>

启动Hadoop

  • 格式化 NameNode
/opt/hadoop/bin/hdfs namenode -format

21/06/05 07:41:01 INFO common.Storage: Storage directory /opt/hadoop/hdfs/name has been successfully formatted.
21/06/05 07:41:01 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
21/06/05 07:41:01 INFO util.ExitUtil: Exiting with status 0
21/06/05 07:41:01 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/

  • 启动 NameNode
/opt/hadoop/sbin/start-dfs.sh
Starting namenodes on [localhost]
 localhost: starting namenode, logging to /opt/hadoop/logs/hadoop-hadoop-namenode-ubuntu.out
 localhost: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-ubuntu.out
 Starting secondary namenodes [0.0.0.0]
 0.0.0.0: starting secondarynamenode, logging to /opt/hadoop/logs/hadoop-hadoop-secondarynamenode-ubuntu.out
  • 启动 Yarn
/opt/hadoop/sbin/start-yarn.sh
starting yarn daemons
 starting resourcemanager, logging to /opt/hadoop/logs/yarn-hadoop-resourcemanager-ubuntu.out
 localhost: starting nodemanager, logging to /opt/hadoop/logs/yarn-hadoop-nodemanager-ubuntu.out
  • 查看是否正常启动
jps
49121 NodeManager
 49329 Jps
 48546 DataNode
 48995 ResourceManager
 48730 SecondaryNameNode
 48395 NameNode
  • Web中访问http://localhost:50070可以查看 NameNode、Datanode、HDFS的相关信息。
  • Web中访问http://localhost:8088可以查看任务运行情况

Hive

Hive2.3.8安装

wget http://archive.apache.org/dist/hive/hive-2.3.8/apache-hive-2.3.8-bin.tar.gz
tar -zxvf apache-hive-2.3.8-bin.tar.gz  # 解压
mv apache-hive-2.3.8-bin/ hive          # 改名
rm -f apache-hive-2.3.8-bin.tar.gz      # 删除下载的tar包
chown -R hadoop /opt/hive               # 为hadoop用户添加读写权限
  • 配置,使用hadoop用户即可 --> $ <–
mv /opt/hive/conf/hive-env.sh.template /opt/hive/conf/hive-env.sh
vi /opt/hive/conf/hive-env.sh
  • 追加以下两行,即hadoop的路径以及hive的配置文件路径
    export HADOOP_HOME=/opt/hadoopexport HIVE_CONF_DIR=/opt/hive/conf启动Hadoop,一定记得启动Hadoop
/opt/hadoop/sbin/start-dfs.sh
/opt/hadoop/sbin/start-yarn.sh
  • 创建相关目录,附加相关权限,(这步必须驱动hadoop后执行)
/opt/hadoop/bin/hadoop fs -mkdir /tmp
/opt/hadoop/bin/hadoop fs -mkdir -p /user/hive/warehouse
/opt/hadoop/bin/hadoop fs -chmod g+w /tmp
/opt/hadoop/bin/hadoop fs -chmod g+w /user/hive/warehouse

启动Hive

初始化默认derby数据库(如果使用MySQL则跳过这步)

/opt/hive/bin/schematool -initSchema -dbType derby
/opt/hive/bin/hive
SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
 SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]Logging initialized using configuration in jar:file:/opt/hive/lib/hive-common-2.3.8.jar!/hive-log4j2.properties Async: true
 Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
 hive>

连接MySQL(8.0)数据库

  • 安装MySQL
wget https://dev.mysql.com/get/mysql-apt-config_0.8.17-1_all.deb
sudo dpkg -i mysql-apt-config_0.8.17-1_all.deb

选择(其他选择OK):
MySQL Server & Cluster (Currently selected: mysql 8.0)mysql-8.0

sudo apt update
sudo apt install mysql-server

Use Legacy Authentication Method (Retain MySQL 5.x ...

  • 配置Metastore到MySQL
vi /opt/hive/conf/hive-site.xml

添加一以下内容

<configuration>
        <property>
          <name>javax.jdo.option.ConnectionURL</name>
          <value>jdbc:mysql://localhost:3306/hive?useUnicode=true&characterEncoding=utf-8&useSSL=false&serverTimezone=GMT&createDatabaseIfNotExist=true</value>
          <description>JDBC connect string for a JDBC metastore</description>
        </property>

        <property>
          <name>javax.jdo.option.ConnectionDriverName</name>
          <value>com.mysql.cj.jdbc.Driver</value>
          <description>Driver class name for a JDBC metastore</description>
        </property>

        <property>
          <name>javax.jdo.option.ConnectionUserName</name>
          <value>root</value>
          <description>username to use against metastore database</description>
        </property>

        <property>
          <name>javax.jdo.option.ConnectionPassword</name>
          <value>hadoop</value>
          <description>password to use against metastore database</description>
        </property>
</configuration>
  • 安装驱动 /opt/hive目录
wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-java-8.0.11.tar.gz
tar -zxvf mysql-connector-java-8.0.11.tar.gz
mv /opt/hive/mysql-connector-java-8.0.11/mysql-connector-java-8.0.11.jar /opt/hive/lib/mysql-connector-java-8.0.11.jar
rm -f /opt/hive/mysql-connector-java-8.0.11.tar.gz
rm -rf /opt/hive/mysql-connector-java-8.0.11
  • 初始化 /opt/hive/bin目录
./schematool -dbType mysql -initSchema
SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
 SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
 Metastore connection URL: jdbc:mysql://localhost:3306/hive?useUnicode=true&characterEncoding=utf-8&useSSL=false&serverTimezone=GMT&createDatabaseIfNotExist=true
 Metastore Connection Driver : com.mysql.cj.jdbc.Driver
 Metastore connection User: root
 Starting metastore schema initialization to 2.3.0
 Initialization script hive-schema-2.3.0.mysql.sql
 Initialization script completed
 schemaTool completed
  • 设置环境变量,使用hadoop用户 --> $ <–,(此环境变量仅对hadoop用户生效)
vi ~/.bash_profile
  • 进行如下配置
    export HIVE_HOME=/opt/hiveexport PATH=$PATH:$HIVE_HOME/bin
  • 使环境变量生效,使用hadoop用户 --> $ <–
source ~/.bash_profile
  • 启动hive
hive
SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
 SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]Logging initialized using configuration in jar:file:/opt/hive/lib/hive-common-2.3.8.jar!/hive-log4j2.properties Async: true
 Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
 hive>