组件版本
| 组件 | 版本 | 下载地址 |
|---|---|---|
| Hadoop | 2.7.7 | hadoop 2.7.7 |
| JDK | 1.8 | jdk 8 |
| Mysql | 5.7 | Mysql 5.7 |
| Hive | 2.3.4 | Hive 2.3.4 |
| Spark | 2.1.1 | Spark 2.1.1 |
**机器环境 **
| IP | 主机名 | 密码 |
|---|---|---|
| 192.168.222.201 | master | password |
| 192.168.222.202 | slave1 | password |
| 192.169.222.203 | slave2 | password |
pom文件
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<!--scala version-->
<scala.version>2.11.0</scala.version>
<!-- spark version-->
<spark.version>2.1.1</spark.version>
<scala.binary.version>2.11</scala.binary.version>
<!-- hadoop version-->
<hadoop.version>2.7.7</hadoop.version>
<!--log4j version-->
<log4j.version>1.2.17</log4j.version>
<!--hive-->
<hive.version>2.3.4</hive.version>
<!--flink-->
<flink.version>1.10.2</flink.version>
<mysql.connect.version>5.1.17</mysql.connect.version>
</properties>
<dependencies>
<!--scala -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-reflect</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-compiler</artifactId>
<version>${scala.version}</version>
</dependency>
<!--Spark 需要的jar包-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!--Hadoop 配置-->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-api</artifactId>
<version>${hadoop.version}</version>
</dependency>
<!-- Hive -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>${hive.version}</version>
</dependency>
<!--Flink (java API)-->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<!--Flink (scala API)-->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<!--mysql 驱动-->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>${mysql.connect.version}</version>
</dependency>
<!--hive 驱动-->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>${hive.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.19</version>
<configuration>
<skip>true</skip>
</configuration>
</plugin>
</plugins>
</build>
代码:
import org.apache.spark.sql.{DataFrame, Dataset, Row, SaveMode, SparkSession}
import java.util.Properties
object ReadMysqlDemo {
def main(args: Array[String]): Unit = {
var warehouse = "C:\\Users\\smy\\IdeaProjects\\SparkSQLDemo\\warehouse"
val sparkSession: SparkSession = SparkSession.builder()
.appName("read mysql data") // app名称
.master("local[*]") // 使用spark 集群
.config("spark.sql.warehouse.dir", warehouse) // 指定hive仓库位置
.getOrCreate()
// 方法一
val dataFrame: DataFrame = sparkSession
.read //读数据
.format("jdbc") //数据源格式jdbc
.option("url", "jdbc:mysql://192.168.222.201:3306/datasource?useUnicode=true&characterEncoding=utf-8") //mysql jdbc地址
.option("dbtable", "food") //数据表名
.option("user", "root") //数据库用户名
.option("password", "password") //数据库密码
.load() //加载数据
//输出前五行
dataFrame.show(5)
//
// 方法二
val properties = new Properties()
properties.put("user", "root") //用户名
properties.put("password", "password") // 密码
// val jdbc_url = "jdbc:mysql://192.168.222.201:3306/datasource?user=root&password=password&useUnicode=true&characterEncoding=utf-8"
val jdbc_url = "jdbc:mysql://192.168.222.201:3306/datasource?useUnicode=true&characterEncoding=utf-8"
val dataFrame1: DataFrame = sparkSession.read.jdbc(jdbc_url, "food", properties) //jdbc
// 输出前五行
dataFrame1.show(5)
// 测试中文字符问题
dataFrame.filter("`主营类型`='甜点饮品'").filter("`店铺评分`>0.3").show(6)
val result: Dataset[Row] = dataFrame1.filter("`主营类型`='甜点饮品'")
.filter("`店铺评分`>0.3").select("商家名称","店铺评分")
result.show(5)
//存储方式一
result.write.mode(SaveMode.Overwrite)
.format("jdbc")
.option("url", "jdbc:mysql://192.168.222.201:3306/datasource?useUnicode=true&characterEncoding=utf-8") //mysql jdbc地址
.option("dbtable", "test1") //数据表名
.option("user", "root") //数据库用户名
.option("password", "password") //数据库密码
.save()
// 存储方式二
// result.write.mode(SaveMode.Overwrite).jdbc(jdbc_url,"test1",properties)
}
}
参数说明:
| 参数 | 说明 | 例子 |
|---|---|---|
| url | 要连接到的 JDBC URL | jdbc:mysql://localhost:3306/datasource?user=root&password=password&useUnicode=true&characterEncoding=utf-8 |
| dbtable | 应该读取的 JDBC 表,FROM可以使用在 SQL 查询子句中有效的任何内容 |
可以使用括号中的子查询来代替完整的表。 |
| driver | 用于连接到此 URL 的 JDBC 驱动程序的类名 | com.mysql.jdbc.Driver |
| fetchsize | JDBC 提取大小,它确定每次往返要提取多少行。 | 默认所有 |
| batchsize | JDBC 批处理大小,它确定每次往返插入的行数 | 此选项仅适用于写入。它默认为1000. |
参数说明:
| 参数 | 说明 |
|---|---|
| SaveMode.ErrorifExists(默认) | 将DataFrame保存到数据源时,如果数据已经存在,预计会抛出异常。 |
| SaveMode.Append | 将 DataFrame 保存到数据源时,如果数据/表已存在,则希望将 DataFrame 的内容附加到现有数据中。 |
| SaveMode.Overwrite | 覆盖模式是指在将DataFrame 保存到数据源时,如果data/table 已经存在,现有的数据会被DataFrame 的内容覆盖。 |
| SaveMode.Ignore | 忽略模式是指在将DataFrame 保存到数据源时,如果数据已经存在,则保存操作预计不会保存DataFrame 的内容,也不会更改现有数据。这类似于CREATE TABLE IF NOT EXISTSSQL 中的 a。 |
















