Idea上 使用Spark3.0 sql操作hive

前提:按照以下集群规划安装好HDFS,Hadoop Yarn

hive的最新版本 rhythm hive最新版本_hive

1、安装hive

选择一个节点安装hive相关组件,这里选择hadoop3

1.1 、安装mysql

# centos7.6自带的 MariaDB(MariaDB是MySQL的一个分支),与要安装的MySQL有冲突,需要删除
# 查询是否安装了mariadb
rpm -aq | grep mariadb
# 删除mariadb。-e 删除指定的套件;--nodeps 不验证套件的相互关联性
rpm -e --nodeps mariadb-libs

# 安装依赖
yum install perl -y
yum install net-tools -y

# 解压缩
tar xvf mysql-5.7.26-1.el7.x86_64.rpm-bundle.tar
# 依次运行以下命令
rpm -ivh mysql-community-common-5.7.26-1.el7.x86_64.rpm
rpm -ivh mysql-community-libs-5.7.26-1.el7.x86_64.rpm
rpm -ivh mysql-community-client-5.7.26-1.el7.x86_64.rpm
rpm -ivh mysql-community-server-5.7.26-1.el7.x86_64.rpm

# 启动数据库
systemctl start mysqld

# 查找root密码
grep password /var/log/mysqld.log


# 进入MySQL,使用前面查询到的口令
mysql -u root -p
# 设置口令强度;将root口令设置为12345678;刷新
set global validate_password_policy=0;
set password for 'root'@'localhost' =password('12345678');
flush privileges;

# 创建 hive 用户
# 创建用户设置口令、授权、刷新
CREATE USER 'hive'@'%' IDENTIFIED BY '12345678';
GRANT ALL ON *.* TO 'hive'@'%';
FLUSH PRIVILEGES;

1.2 、安装hive

(1)下载hive执行包并配置
http://archive.apache.org/dist/hive/hive-2.3.9/ 下载 apache-hive-2.3.9-bin.tar.gz

如果下载较慢,可选择国内镜像安装

cd /opt/cwc/software
tar zxvf apache-hive-2.3.9-bin.tar.gz -C ../servers/
cd ../servers
mv apache-hive-2.3.9-bin hive-2.3.9

# 修改环境变量
# 在 /etc/profile 文件中增加环境变量
export HIVE_HOME=/opt/cwc/servers/hive-2.3.9
export PATH=$PATH:$HIVE_HOME/bin
# 执行并生效
source /etc/profile

(2)创建hive-site.xml,增加如下配置

# 创建hive-site.xml,增加如下配置
cd $HIVE_HOME/conf 

vim hive-site.xml

增加如下内容

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <!-- hive元数据的存储位置 -->
    <!-- 注意jdbc的连接串,如果没有 useSSL=false 会有大量警告 在xml文件中 & 表示 & -->
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://hadoop3:3306/hivemetadata?createDatabaseIfNotExist=true&useSSL=false</value>
            <description>JDBC connect string for a JDBC metastore</description>
    </property>
    <!-- 指定驱动程序 -->
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>
    <!-- 连接数据库的用户名 -->
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hive</value>
        <description>username to use against metastore database</description>
    </property>
    <!-- 连接数据库的口令 -->
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>12345678</value>
        <description>password to use against metastore database</description>
    </property>
    <property>
        <!-- 数据默认的存储位置(HDFS) -->
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
        <description>location of default database for the warehouse</description>
	</property>
    <property>
        <!-- 在命令行中,显示当前操作的数据库 -->
        <name>hive.cli.print.current.db</name>
        <value>true</value>
        <description>Whether to include the current database in the Hive prompt.</description>
    </property>
    <property>
        <!-- 在命令行中,显示数据的表头 -->
        <name>hive.cli.print.header</name>
        <value>true</value>
    </property>
</configuration>

(3)拷贝 MySQL JDBC 驱动程序,初始化hive元数据

将 mysql-connector-java-5.1.46.jar 拷贝到 $HIVE_HOME/lib

# 初始化元数据库
schematool -dbType mysql -initSchema

# 启动hive服务之前,请先启动hdfs、yarn的服务
[root@linux123 ~]$ hive
hive> show functions;

sz:将选定的文件发送(send)到本地机器
rz:运行该命令会弹出一个文件选择窗口,从本地选择文件上传到Linux服务器

yum install lrzsz

2、Idea中连接hive

2.1 、示例代码

import org.apache.spark.sql.SparkSession

object hive_read {
  def main(args: Array[String]): Unit = {

    val ss = SparkSession.builder().master("local[2]").appName("the test of SparkSession").enableHiveSupport().getOrCreate()
    // 测试是否连接到hive
    val df = ss.sql("show databases")  
    // 测试是否spark sql 能读取hive表执行
    //val df = ss.sql("select count(1) from biods.ods_start_log where dt='20210721'")
    df.show
    Thread.sleep(1000000)
    ss.stop()
  }
}

2.2 、pom依赖

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>lagou-spark</artifactId>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <scala.version>2.12.10</scala.version>
        <spark.version>3.0.1</spark.version>
        <hadoop.version>2.9.2</hadoop.version>
        <encoding>UTF-8</encoding>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.12</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.46</version>
        </dependency>
    </dependencies>
    <build>
        <pluginManagement>
            <plugins>
                <!-- 编译scala的插件 -->
                <plugin>
                    <groupId>net.alchim31.maven</groupId>
                    <artifactId>scala-maven-plugin</artifactId>
                    <version>3.2.2</version>
                </plugin>
                <!-- 编译java的插件 -->
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>3.5.1</version>
                </plugin>
            </plugins>
        </pluginManagement>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <executions>
                    <execution>
                        <id>scala-compile-first</id>
                        <phase>process-resources</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>scala-test-compile</id>
                        <phase>process-test-resources</phase>
                        <goals>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <executions>
                    <execution>
                        <phase>compile</phase>
                        <goals>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <!-- 打jar插件 -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.4.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>METAINF/*.SF</exclude>
                                        <exclude>METAINF/*.DSA</exclude>
                                        <exclude>METAINF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

2.3 将hive-site.xml放入resource目录

2.4 遇到的问题

2.4.1 提示xxx目录没有权限

报错日志如下:

org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: org.apache.hadoop.security.AccessControlException Permission denied: user=CaiWencheng, access=WRITE, inode="/user/hive/warehouse/temp.db":root:supergroup:drwxr-xr-x

解决方式: 加权限

hadoop fs -chmod -R 777 /user/hive/warehouse/temp.db

2.4.1 mysql驱动没找到

报错日志如下:

org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.

解决方式:添加驱动包依赖

<dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.46</version>
</dependency>

2.5 Hive建表

# 创建数据库
create database biods;

-- 创建orc外部表
create external table biods.ods_start_log
(
`str` string
)
comment '用户启动日志信息'
partitioned by (`dt` string)
stored as orc
location '/bi/ods/ods_start_log';

-- 创建txt外部表
create external table biods.txt_ods_start_log
(
`str` string
)
comment '用户启动日志信息'
partitioned by (`dt` string)
stored as textfile
location '/bi/ods/txt_ods_start_log';

2.6 环境启动

(1)启动hdfs

在hadoop1节点,执行start-dfs.sh

(2)启动yarn

在hadoop2节点,执行start-yarn.sh

(3)启动hive

在hadoop3节点,执行hive