一、文件的读取,首先准备一个people.json文件和一个people.csv文件,文件内容如下:

json文件:

{"name":"Michael"}
{"name":"Andy","age":30}
{"name":"Justin","age":19}

csv文件为:

name,age
Michael,
Andy,30
Justin,19

1、读取和写入json文件

(1)从本地文件读取,首先导入:import spark.implicits._包,然后执行:val df=spark.read.json("file:///sunxj/work/spark/people.json"),表示从文件读取,最后通过df.show()显示读取到的内容,如下图所示:

spark读取小文件合并优化 spark如何读取大文件_spark读取小文件合并优化

也可以通过:val df1=spark.read.format("json").load("file:///sunxj/work/spark/people.json")读取

spark读取小文件合并优化 spark如何读取大文件_apache_02

(2)写入json文件,执行:df.write.json("file:///sunxj/work/spark/people1.json")或df.write.format("json").save("file:///sunxj/work/spark/people2.json"),表示将json写入到文件中(注意:此方法是写入到一个目录中的,目录中有一个文件),如下图所示:

spark读取小文件合并优化 spark如何读取大文件_spark读取小文件合并优化_03

spark读取小文件合并优化 spark如何读取大文件_json_04

(3)可以通过df.select("name","age").write.csv("file:///sunxj/work/spark/newpeople1.json")或df.select("name","age").write.format("json").save("file:///sunxj/work/spark/newpeople.json")写入文件(注意:此方法是写入到一个目录中的,目录中有一个文件),如下图所示:

spark读取小文件合并优化 spark如何读取大文件_apache_05

spark读取小文件合并优化 spark如何读取大文件_apache_06

2、读取和写入csv文件

(1)从本地文件读取,首先导入:import spark.implicits._包,然后执行:val csvdf=spark.read.csv("file:///sunxj/work/spark/people.csv")或val csvdf1=spark.read.format("csv").load("file:///sunxj/work/spark/people.csv")表示从文件读取,最后通过show()显示读取到的内容,如下图所示:

spark读取小文件合并优化 spark如何读取大文件_spark读取小文件合并优化_07

 

(2)写入csv文件,执行:csvdf.select("_c0","_c1").write.format("csv").save("file:///sunxj/work/spark/newcsv/")或csvdf.select("_c0","_c1").write.csv("file:///sunxj/work/spark/newcsv1/"),表示将csv写入到文件中(注意:此方法是写入到一个目录中的,目录中有一个文件),如下图所示:

 

spark读取小文件合并优化 spark如何读取大文件_spark读取小文件合并优化_08

spark读取小文件合并优化 spark如何读取大文件_apache_09

(3)、可以通过csvdf.select("_c0","_c1").write.format("csv").save("file:///sunxj/work/spark/newcsv/")或csvdf.select("_c0","_c1").write.csv("file:///sunxj/work/spark/newcsv1/")写入文件(注意:此方法是写入到一个目录中的,目录中有一个文件),如下图所示:

spark读取小文件合并优化 spark如何读取大文件_apache_10

3、不管是json还是csv都可以通过rdd.saveAsTextFile()来保存,如:csvdf.rdd.saveAsTextFile("file:///sunxj/work/spark/rddcsv/")或df.rdd.saveAsTextFile("file:///sunxj/work/spark/rddjson/"),如下图所示:

spark读取小文件合并优化 spark如何读取大文件_spark_11

spark读取小文件合并优化 spark如何读取大文件_spark读取小文件合并优化_12

4、使用printSchema()打印出DataFrame的模式(Schema)信息,如下图所示:

spark读取小文件合并优化 spark如何读取大文件_apache_13

5、select()它的功能是从DataFrame中选取部分列的数据,或者重命名,如下图所示:

spark读取小文件合并优化 spark如何读取大文件_apache_14

6、filter过滤,可以根据某个字段进行过滤,如下图所示:

spark读取小文件合并优化 spark如何读取大文件_json_15

7、groupBy()分组,按照某个字段进行分组,并统计个数,如下图所示:

spark读取小文件合并优化 spark如何读取大文件_spark读取小文件合并优化_16

8、sort排序,按照age字段进行降序排序,如果age相同则在按照name进行升序排序,如下图所示:

spark读取小文件合并优化 spark如何读取大文件_json_17

9、利用反射机制推断RDD模式(通过定义case class模式来定义表,然后通过sql查询)

(1)首先新建一个people.txt文件内容如下:

Michael,29
Andy,30
Justin,19

(2)在pom.xml文件添加如下内容:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>comprehensive-example</groupId>
    <artifactId>comprehensiveexample</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.1.1</version>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass>PeopleSql</mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

(3)新建一个PeopleSql.scala文件,内容为:

import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.SparkSession
case class People(name:String,age:Long)
object PeopleSql {
  def main(args:Array[String]): Unit ={
    val spark=SparkSession.builder().getOrCreate()
    import spark.implicits._
    //从文件中读取数据
    val lines=spark.sparkContext.textFile("file:///sunxj/work/spark/people.txt")
    //如果要将RDD转换成DataFram必须要定义一个case class,首先对数据进行分割,然后返回People列表,在通过toDF()将RDD转换成DataFrame
    val pdf=lines.map(line=>line.split(",")).map(x=>People(x(0),x(1).toInt)).toDF()
    //将DataFrame注册为一个临时表,表名为people
    pdf.createOrReplaceTempView("people")
    //然后通过sql查询语句在生成一个DataFrame
    val personRDD=spark.sql("select name,age from people where age>20")
    //遍历DataFrame打印信息
    personRDD.map(x=>"Name:"+x(0)+","+"Age:"+x(1)).show()
  }
}

(4)使用mvn package打包成jar,然后执行:spark-submit comprehensive-example/target/comprehensiveexample-1.0-SNAPSHOT-jar-with-dependencies.jar,输出结果如下图所示:

spark读取小文件合并优化 spark如何读取大文件_json_18

10、使用编程方式定义RDD模式(就是将文件内容转换为表(表头+表记录)来通过sql查询)

(1)新建一个PeopleSqlProgramming.scala,内容如下:

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
object PeopleSqlProgramming {
  def main(args:Array[String]): Unit ={
    //定义表头,name是表名,StringType是表类型,true表示是否允许为空值
    val fields=Array(StructField("name",StringType,true),StructField("age",IntegerType,true))
    //生成一个schema,实际意义上的表头
    val schema=StructType(fields)
    val spark=SparkSession.builder().getOrCreate()
    import spark.implicits._
    //从文件读取数据
    val peopleRDD=spark.sparkContext.textFile("file:///sunxj/work/spark/people.txt")
    //将数据转换成Row方式,并返回一个列表
    val rowRDD=peopleRDD.map(_.split(",")).map(x=>Row(x(0),x(1).trim.toInt))
    //将schema和Row列表组合在一起
    val peopleDF=spark.createDataFrame(rowRDD,schema)
    //注册成为一个临时表
    peopleDF.createOrReplaceTempView("people")
    //执行sql语句查询
    val result=spark.sql("select name,age from people")
    //将结果打印输出
    result.map(x=>"Name:"+x(0)+","+"Age:"+x(1)).show()
  }
}

(2)将<mainClass>PeopleSql</mainClass>改为<mainClass>PeopleSqlProgramming</mainClass>,然后通过mvn package打包成jar包,然后执行:spark-submit comprehensive-example/target/comprehensiveexample-1.0-SNAPSHOT-jar-with-dependencies.jar,输出结果如下图所示:

spark读取小文件合并优化 spark如何读取大文件_apache_19

11、spark sql通过jdbc读写数据库,首先创建数据库和表

(1)首先通过mysql -u root -p登录数据库输入密码123456

(2)使用:create database spark;创建数据库,如下图所示:

spark读取小文件合并优化 spark如何读取大文件_apache_20

(3)依次执行:

use spark;

create table student(id int(4),name varchar(20),gender varchar(4),age int(4));

insert into student values (1,'Xueqian','F',23);

insert into student values (2,'Weiliang','M',24);

如下图所示:

spark读取小文件合并优化 spark如何读取大文件_json_21

(4)查询记录为:

spark读取小文件合并优化 spark如何读取大文件_json_22

(5)新建一个项目,配置pom.xml,如下内容:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>spark-sql-mysql-hive</groupId>
    <artifactId>spark-sql-mysql-hive</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.17</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.1.1</version>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass>SparkSqlOperateMysql</mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

(6)新建一个SparkSqlOperateMysql.scala文件,文件内容如下:

import org.apache.spark.sql.SparkSession
object SparkSqlOperateMysql {
  def main(args:Array[String]): Unit ={
    val spark=SparkSession.builder().getOrCreate()
    val jdbcDF=spark.read.format("jdbc")
      .option("url","jdbc:mysql://localhost:3306/spark")
      .option("driver","com.mysql.jdbc.Driver")
      .option("dbtable","student")
      .option("user","root")
      .option("password","123456")
      .load()
    jdbcDF.show();
  }
}

(7)使用mvn package打包jar文件,然后执行:spark-submit spark-sql-mysql-hive/target/spark-sql-mysql-hive-1.0-SNAPSHOT-jar-with-dependencies.jar,输出结果如下图所示:

spark读取小文件合并优化 spark如何读取大文件_apache_23

(8)修改代码使其能够插入记录,修改如下:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import java.util.Properties
object SparkSqlOperateMysql {
  val spark=SparkSession.builder().getOrCreate()
  import spark.implicits._
  def getDF(databaseName:String,tablename:String,user:String,password:String): DataFrame ={
    spark.read.format("jdbc")
      .option("url","jdbc:mysql://localhost:3306/"+databaseName)
      .option("driver","com.mysql.jdbc.Driver")
      .option("dbtable",tablename)
      .option("user",user)
      .option("password",password)
      .load()
  }
  def printStudent(databaseName:String,tablename:String,user:String,password:String): Unit ={
    val jdbcDF=getDF(databaseName,tablename,user,password)
    jdbcDF.show()
  }
  def getStudentCount(databaseName:String,tablename:String,user:String,password:String): Long ={
    val jdbcDF=getDF(databaseName,tablename,user,password)
    jdbcDF.count()
  }
  def main(args:Array[String]): Unit ={
    val dbName="spark"
    val tabName="student"
    val user="root"
    val pwd="123456"
    var count=getStudentCount(dbName,tabName,user,pwd)
    println("插入前记录个数:%d".format(count))
    println("插入前记录:")
    printStudent(dbName,tabName,user,pwd)
    //构造两个学生信息
    val studentRDD=spark.sparkContext.parallelize(Array("%d Rongcheng%d M 26".format(count+1,count+1),"%d Guanhua%d M 27".format(count+2,count+2)))
      .map(_.split(" "))
    //设置模式信息,即表头以及类型
    val schema=StructType(List(StructField("id",IntegerType,true),StructField("name",StringType,true),StructField("gender",StringType,true),StructField("age",IntegerType,true)))
    //创建Row对象,每个Row对象都是rowRDD中的一行
    val rowRDD=studentRDD.map(p=>Row(p(0).toInt,p(1).trim,p(2).trim,p(3).toInt))
    //建立起Row对象和模式之间的对应关系,也就是把数据和模式对应起来
    val studentDF=spark.createDataFrame(rowRDD,schema)
    //下面创建一个prop变量用来保存JDBC连接参数
    val prop=new Properties()
    prop.put("user",user)
    prop.put("password",pwd)
    prop.put("driver","com.mysql.jdbc.Driver")
    //连接数据库,采用append模式,表示追加记录到数据库spark的student表中
    studentDF.write.mode("append").jdbc("jdbc:mysql://localhost:3306/"+dbName,"spark."+tabName,prop)
    count=getStudentCount(dbName,tabName,user,pwd)
    println("插入后记录个数:%d".format(count))
    println("插入后记录:")
    printStudent(dbName,tabName,user,pwd)
  }
}

(9)然后打包,spark-submit提交运行,输出 结果如下图所示:

spark读取小文件合并优化 spark如何读取大文件_apache_24

spark读取小文件合并优化 spark如何读取大文件_spark_25

spark读取小文件合并优化 spark如何读取大文件_apache_26

spark读取小文件合并优化 spark如何读取大文件_json_27

(10)直接在数据库中查询,如下图所示:

spark读取小文件合并优化 spark如何读取大文件_spark_28

12、spark操作hive数据库

(1)首先进入hive,创建数据库spark_test和数据表student,如下图所示:

spark读取小文件合并优化 spark如何读取大文件_apache_29

(2)插入数据,如下图所示:

spark读取小文件合并优化 spark如何读取大文件_spark读取小文件合并优化_30

spark读取小文件合并优化 spark如何读取大文件_spark_31

由此可见hive的sql语句是转换为MapReduce执行的。

(3)新建一个SprkSqlOperateHive.scala文件,文件内容如下:

import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
case class Record(key:Int,value:String)
object SprkSqlOperateHive {
  val warehouseLocation="hdfs://master:9000/user/hive/warehouse"
  def main(args:Array[String]): Unit ={
    val spark=SparkSession.builder()
      .appName("Spark Hive Example")
      .config("hive.metastore.uris","thrift://master:9083")//设置操作hive的url
      //.config("spark.sql.warehouse.dir",warehouseLocation)//此处已经在hive-site.xml配置了
      .enableHiveSupport()//启用hive
      .getOrCreate()
    spark.sql("select * from spark_test.student").show()
  }
}

pom.xml文件内容如下:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>spark-sql-mysql-hive</groupId>
    <artifactId>spark-sql-mysql-hive</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.17</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.11</artifactId>
            <version>2.4.0</version>
            <scope>provided</scope>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.1.1</version>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass>SprkSqlOperateHive</mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

(4)将pom.xml中的<mainClass>SparkSqlOperateMysql</mainClass>改为:<mainClass>SprkSqlOperateHive</mainClass>,然后重新打包,然后执行,输出结果如下:

spark读取小文件合并优化 spark如何读取大文件_spark_32

(5)修改代码,使其能够写入记录,修改如下:

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
object SprkSqlOperateHive {
  val spark=SparkSession.builder()
    .appName("Spark Hive Example")
    .config("hive.metastore.uris","thrift://master:9083")//设置操作hive的url
    .enableHiveSupport()//启用hive
    .getOrCreate()
  def printStudent(): Unit ={
    spark.sql("select * from spark_test.student").show()
  }
  def getStudentCount(): Long ={
    spark.sql("select * from spark_test.student").count()
  }
  def main(args:Array[String]): Unit ={
    var count=getStudentCount()
    println("插入前记录个数:%d".format(count))
    println("插入前记录:")
    printStudent()

    //构造两条学生信息
    val studentRDD=spark.sparkContext.parallelize(Array("%d Rongcheng_%d_hive M 26".format(count+1,count+1),"%d Guanhua_%d_hive M 27".format(count+2,count+2)))
      .map(_.split(" "))
    //设置模式信息,即表头以及类型
    val schema=StructType(List(StructField("id",IntegerType,true),StructField("name",StringType,true),StructField("gender",StringType,true),StructField("age",IntegerType,true)))
    //创建Row对象,每个Row对象都是rowRDD中的一行
    val rowRDD=studentRDD.map(p=>Row(p(0).toInt,p(1).trim,p(2).trim,p(3).toInt))
    //建立起Row对象和模式之间的对应关系,也就是把数据和模式对应起来
    val studentDF=spark.createDataFrame(rowRDD,schema)
    //下面注册临时表
    studentDF.createOrReplaceTempView("tempTable")
    spark.sql("insert into spark_test.student select * from tempTable")
    count=getStudentCount()
    println("插入后记录个数:%d".format(count))
    println("插入后记录:")
    printStudent()
  }
}

输出结果如下:

spark读取小文件合并优化 spark如何读取大文件_spark读取小文件合并优化_33

spark读取小文件合并优化 spark如何读取大文件_json_34

spark读取小文件合并优化 spark如何读取大文件_spark读取小文件合并优化_35

spark读取小文件合并优化 spark如何读取大文件_json_36

然后在查看hive,如下图所示:

spark读取小文件合并优化 spark如何读取大文件_apache_37