引言
DataFrame是spark 1.3版本之后引入的功能,大大扩展了SparkSQL的编程,借助于DataFrame,可以对不同的数据源进行操作,包括RDD,json,parque,jdbc,hive表等。
本篇有感于DataFrame的强大,对DataFrame的使用做一下笔记。
假设有两个表数据,studentInfo表和studentScore表,表数据结构如下:
现在需要过滤分数大于80分的学生,然后进行表join操作:
思路如下:
为了简化,将表数据用json来表示,并且SparkSQL可以直接读入json数据,将其转成DataFrame结构。
首先,利用SQLContex分别从hdfs上读取json数据,将其转成DataFrame
//读取学生分数的json数据,将其转为DataFrame
DataFrame studentScoreDF = sqlContext.read().json("hdfs://cqt01:9000/spark-data/studentScore.json");
//读取学生信息的json数据,将其转为DataFrame
DataFrame studentInfoDF = sqlContext.read().json("hdfs://cqt01:9000/spark-data/studentInfo.json");
然后将他们注册为两张临时表,以便直接sql语句进行过滤操作
// 将studentScoreDF注册为一张临时表,然后用sql进行过滤
studentScoreDF.registerTempTable("studentScore");
//将studentInfoDF注册为一张临时表
studentInfoDF.registerTempTable("studentInfo");
对studentScore进行过滤操作,过滤出分数大于80分的学生
DataFrame filteredScoreDF = sqlContext.sql("select * from studentScore where score>80");
List<String> filtedStudentNameList = filteredScoreDF.javaRDD().map(new Function<Row, String>() {
@Override
public String call(Row row) throws Exception {
return (String)row.getAs("name");
}
}).collect();
得到的filtedStudentNameList ,是分数大于80分学生的姓名,根据这些姓名,对studentInfo表进行过滤,只留下这些学生的信息:
String sql = "select * from studentInfo where name in (";
//拼接sql
for(int i = 0;i < filtedStudentNameList.size();i++){
sql += "'" + filtedStudentNameList.get(i) + "'";
if(i<filtedStudentNameList.size()-1){
sql += ",";
}
}
sql += ")";
//过滤出分数大于80分的学生信息
DataFrame filteredStudentInfoDF = sqlContext.sql(sql);
过滤完之后,得到的都是DataFrame对象,需要转换成JavaRDD结构,然后根据学生name这个key进行join操作:
//将过滤后的DataFrame转成RDD,,然后进行join操作
JavaPairRDD<String, Tuple2<Integer, String>> joined =
filteredScoreDF.javaRDD().mapToPair(new PairFunction<Row, String, Integer>() {
@Override
public Tuple2<String, Integer> call(Row row) throws Exception {
return new Tuple2<String,Integer>((String)row.getAs("name"),Integer.valueOf(String.valueOf(row.getAs("score"))));
}
}).join(filteredStudentInfoDF.javaRDD().mapToPair(new PairFunction<Row, String, String>() {
@Override
public Tuple2<String, String> call(Row row) throws Exception {
return new Tuple2<String,String>((String)row.getAs("name"),Integer.valueOf(String.valueOf(row.getAs("age")))+","+(String)row.getAs("area"));//key为name
}
}));
得到join后的数据后,将其转成JavaRDD类型,然后进行打印显示
//将join后的RDD数据转为javaRDD<Row>
JavaRDD<Row> joinedRow = joined.map(new Function<Tuple2<String,Tuple2<Integer,String>>, Row>() {
@Override
public Row call(Tuple2<String, Tuple2<Integer, String>> t)
throws Exception {
return RowFactory.create(t._1,t._2._1,t._2._2.split(",")[0],t._2._2.split(",")[1]);
}
});
joinedRow.foreach(new VoidFunction<Row>() {
@Override
public void call(Row row) throws Exception {
System.out.println(row);
}
});
或者将其转成json文件保存到hdfs上:
由于现在的数据为JavaRDD,因此需要定义RDD中的元信息,便于DataFrame进行转换处理,即对joinedRow数据,根据元数据信息转成DataFrame,然后调用write方法,并指定json格式和保存路径即可:
//创建JavaRDD<Row>的元数据信息
List<StructField> structFields = new ArrayList<StructField>();
structFields.add(DataTypes.createStructField("name", DataTypes.StringType, true));
structFields.add(DataTypes.createStructField("age", DataTypes.IntegerType, true));
structFields.add(DataTypes.createStructField("area", DataTypes.StringType, true));
structFields.add(DataTypes.createStructField("score", DataTypes.IntegerType, true));
StructType structType = DataTypes.createStructType(structFields);
//将joinedRow RDD数据转成DataFrame便于保存为其他格式的数据
DataFrame joinedDF = sqlContext.createDataFrame(joinedRow, structType);
joinedDF.write().format("json").save("hdfs://cqt01:9000/spark-data/joined-students");
程序运行结果为:
完整程序如下:
public class DataFrameFilterAndJoin {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("DataFrameFilterAndJoin");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
//读取学生分数的json数据,将其转为DataFrame
DataFrame studentScoreDF = sqlContext.read().json("C:\\Temp\\studentScore.json");
//读取学生信息的json数据,将其转为DataFram
DataFrame studentInfoDF = sqlContext.read().json("C:\\Temp\\studentInfo.json");
//过滤出分数大于80分的学生
// 首先将studentScoreDF注册为一张临时表,然后用sql进行过滤
studentScoreDF.registerTempTable("studentScore");
DataFrame filteredScoreDF = sqlContext.sql("select * from studentScore where score>80");
List<String> filtedStudentNameList = filteredScoreDF.javaRDD().map(new Function<Row, String>() {
@Override
public String call(Row row) throws Exception {
return (String)row.getAs("name");
}
}).collect();
//根据过滤出来的学生,与学生信息表进行join操作
studentInfoDF.registerTempTable("studentInfo");
String sql = "select * from studentInfo where name in (";
//拼接sql
for(int i = 0;i < filtedStudentNameList.size();i++){
sql += "'" + filtedStudentNameList.get(i) + "'";
if(i<filtedStudentNameList.size()-1){
sql += ",";
}
}
sql += ")";
DataFrame filteredStudentInfoDF = sqlContext.sql(sql);
//将过滤后的DataFrame转成RDD,,然后进行join操作
JavaPairRDD<String, Tuple2<Integer, String>> joined =
filteredScoreDF.javaRDD().mapToPair(new PairFunction<Row, String, Integer>() {
@Override
public Tuple2<String, Integer> call(Row row) throws Exception {
return new Tuple2<String,Integer>((String)row.getAs("name"),Integer.valueOf(String.valueOf(row.getAs("score"))));
}
}).join(filteredStudentInfoDF.javaRDD().mapToPair(new PairFunction<Row, String, String>() {
@Override
public Tuple2<String, String> call(Row row) throws Exception {
return new Tuple2<String,String>((String)row.getAs("name"),Integer.valueOf(String.valueOf(row.getAs("age")))+","+(String)row.getAs("area"));//key为name
}
}));
//将join后的RDD数据转为javaRDD<Row>
JavaRDD<Row> joinedRow = joined.map(new Function<Tuple2<String,Tuple2<Integer,String>>, Row>() {
@Override
public Row call(Tuple2<String, Tuple2<Integer, String>> t)
throws Exception {
return RowFactory.create(t._1,t._2._1,t._2._2.split(",")[0],t._2._2.split(",")[1]);
}
});
joinedRow.foreach(new VoidFunction<Row>() {
@Override
public void call(Row row) throws Exception {
System.out.println(row);
}
});
//创建JavaRDD<Row>的元数据信息
List<StructField> structFields = new ArrayList<StructField>();
structFields.add(DataTypes.createStructField("name", DataTypes.StringType, true));
structFields.add(DataTypes.createStructField("age", DataTypes.IntegerType, true));
structFields.add(DataTypes.createStructField("area", DataTypes.StringType, true));
structFields.add(DataTypes.createStructField("score", DataTypes.IntegerType, true));
StructType structType = DataTypes.createStructType(structFields);
//将joinedRow RDD数据转成DataFrame便于保存为其他格式的数据
DataFrame joinedDF = sqlContext.createDataFrame(joinedRow, structType);
joinedDF.write().format("json").save("hdfs://cqt01:9000/spark-data/joined-students");
}
}