组件版本

组件 版本 下载地址
Hadoop 2.7.7 hadoop 2.7.7
JDK 1.8 jdk 8
Mysql 5.7 Mysql 5.7
Hive 2.3.4 Hive 2.3.4
Spark 2.1.1 Spark 2.1.1

**机器环境 **

IP 主机名 密码
192.168.222.201 master password
192.168.222.202 slave1 password
192.169.222.203 slave2 password

pom文件

<properties>
  <maven.compiler.source>8</maven.compiler.source>
  <maven.compiler.target>8</maven.compiler.target>
  <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  <!--scala version-->
  <scala.version>2.11.0</scala.version>
  <!-- spark version-->
  <spark.version>2.1.1</spark.version>
  <scala.binary.version>2.11</scala.binary.version>
  <!-- hadoop version-->
  <hadoop.version>2.7.7</hadoop.version>
  <!--log4j version-->
  <log4j.version>1.2.17</log4j.version>
  <!--hive-->
  <hive.version>2.3.4</hive.version>
  <!--flink-->
  <flink.version>1.10.2</flink.version>
  <mysql.connect.version>5.1.17</mysql.connect.version>
</properties>
<dependencies>
  <!--scala -->
  <dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-library</artifactId>
    <version>${scala.version}</version>
  </dependency>
  <dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-reflect</artifactId>
    <version>${scala.version}</version>
  </dependency>
  <dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-compiler</artifactId>
    <version>${scala.version}</version>
  </dependency>
  <!--Spark 需要的jar包-->
  <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_${scala.binary.version}</artifactId>
    <version>${spark.version}</version>
  </dependency>
  <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_${scala.binary.version}</artifactId>
    <version>${spark.version}</version>
  </dependency>
  <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>${spark.version}</version>
  </dependency>
  <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-mllib_${scala.binary.version}</artifactId>
    <version>${spark.version}</version>
  </dependency>
  <!--Hadoop 配置-->
  <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>${hadoop.version}</version>
  </dependency>
  <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>${hadoop.version}</version>
  </dependency>
  <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-mapreduce-client-core</artifactId>
    <version>${hadoop.version}</version>
  </dependency>
  <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>${hadoop.version}</version>
  </dependency>
  <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-yarn-api</artifactId>
    <version>${hadoop.version}</version>
  </dependency>
  <!-- Hive -->
  <dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
    <version>${hive.version}</version>
  </dependency>
  <!--Flink (java API)-->
  <dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-java</artifactId>
    <version>${flink.version}</version>
  </dependency>
  <dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
    <version>${flink.version}</version>
  </dependency>
  <!--Flink (scala API)-->
  <dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-scala_${scala.binary.version}</artifactId>
    <version>${flink.version}</version>
  </dependency>
  <dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
    <version>${flink.version}</version>
  </dependency>
  <dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-clients_${scala.binary.version}</artifactId>
    <version>${flink.version}</version>
  </dependency>
  <!--mysql 驱动-->
  <dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>${mysql.connect.version}</version>
  </dependency>
  <!--hive 驱动-->
  <dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-jdbc</artifactId>
    <version>${hive.version}</version>
  </dependency>
</dependencies>
<build>
  <plugins>
    <plugin>
      <groupId>org.scala-tools</groupId>
      <artifactId>maven-scala-plugin</artifactId>
      <version>2.15.2</version>
      <executions>
        <execution>
          <goals>
            <goal>compile</goal>
            <goal>testCompile</goal>
          </goals>
        </execution>
      </executions>
    </plugin>
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-surefire-plugin</artifactId>
      <version>2.19</version>
      <configuration>
        <skip>true</skip>
      </configuration>
    </plugin>
  </plugins>
</build>

代码:

import org.apache.spark.sql.{DataFrame, Dataset, Row, SaveMode, SparkSession}

import java.util.Properties

object ReadMysqlDemo {
  def main(args: Array[String]): Unit = {
    var warehouse = "C:\\Users\\smy\\IdeaProjects\\SparkSQLDemo\\warehouse"
    val sparkSession: SparkSession = SparkSession.builder()
      .appName("read mysql data") // app名称
      .master("local[*]") // 使用spark 集群
      .config("spark.sql.warehouse.dir", warehouse) // 指定hive仓库位置
      .getOrCreate()
    // 方法一
    val dataFrame: DataFrame = sparkSession
      .read //读数据
      .format("jdbc") //数据源格式jdbc
      .option("url", "jdbc:mysql://192.168.222.201:3306/datasource?useUnicode=true&characterEncoding=utf-8") //mysql jdbc地址
      .option("dbtable", "food") //数据表名
      .option("user", "root") //数据库用户名
      .option("password", "password") //数据库密码
      .load() //加载数据
    //输出前五行
    dataFrame.show(5)
    //
    // 方法二
    val properties = new Properties()
    properties.put("user", "root") //用户名
    properties.put("password", "password") // 密码
    //    val jdbc_url = "jdbc:mysql://192.168.222.201:3306/datasource?user=root&password=password&useUnicode=true&characterEncoding=utf-8"
    val jdbc_url = "jdbc:mysql://192.168.222.201:3306/datasource?useUnicode=true&characterEncoding=utf-8"
    val dataFrame1: DataFrame = sparkSession.read.jdbc(jdbc_url, "food", properties) //jdbc
    // 输出前五行
    dataFrame1.show(5)
    // 测试中文字符问题
    dataFrame.filter("`主营类型`='甜点饮品'").filter("`店铺评分`>0.3").show(6)
    val result: Dataset[Row] = dataFrame1.filter("`主营类型`='甜点饮品'")
      .filter("`店铺评分`>0.3").select("商家名称","店铺评分")
    result.show(5)
    //存储方式一
    result.write.mode(SaveMode.Overwrite)
      .format("jdbc")
      .option("url", "jdbc:mysql://192.168.222.201:3306/datasource?useUnicode=true&characterEncoding=utf-8") //mysql jdbc地址
      .option("dbtable", "test1") //数据表名
      .option("user", "root") //数据库用户名
      .option("password", "password") //数据库密码
      .save()
    // 存储方式二
//    result.write.mode(SaveMode.Overwrite).jdbc(jdbc_url,"test1",properties)
  }
}

参数说明:

参数 说明 例子
url 要连接到的 JDBC URL jdbc:mysql://localhost:3306/datasource?user=root&password=password&useUnicode=true&characterEncoding=utf-8
dbtable 应该读取的 JDBC 表,FROM可以使用在 SQL 查询子句中有效的任何内容 可以使用括号中的子查询来代替完整的表。
driver 用于连接到此 URL 的 JDBC 驱动程序的类名 com.mysql.jdbc.Driver
fetchsize JDBC 提取大小,它确定每次往返要提取多少行。 默认所有
batchsize JDBC 批处理大小,它确定每次往返插入的行数 此选项仅适用于写入。它默认为1000.

参数说明:

参数 说明
SaveMode.ErrorifExists(默认) 将DataFrame保存到数据源时,如果数据已经存在,预计会抛出异常。
SaveMode.Append 将 DataFrame 保存到数据源时,如果数据/表已存在,则希望将 DataFrame 的内容附加到现有数据中。
SaveMode.Overwrite 覆盖模式是指在将DataFrame 保存到数据源时,如果data/table 已经存在,现有的数据会被DataFrame 的内容覆盖。
SaveMode.Ignore 忽略模式是指在将DataFrame 保存到数据源时,如果数据已经存在,则保存操作预计不会保存DataFrame 的内容,也不会更改现有数据。这类似于CREATE TABLE IF NOT EXISTSSQL 中的 a。