引言
DataFrame是spark 1.3版本之后引入的功能,大大扩展了SparkSQL的编程,借助于DataFrame,可以对不同的数据源进行操作,包括RDD,json,parque,jdbc,hive表等。
本篇有感于DataFrame的强大,对DataFrame的使用做一下笔记。

假设有两个表数据,studentInfo表和studentScore表,表数据结构如下:

spark 两个dataframe 合并 spark dataframe join_json

现在需要过滤分数大于80分的学生,然后进行表join操作:

思路如下:
为了简化,将表数据用json来表示,并且SparkSQL可以直接读入json数据,将其转成DataFrame结构。
首先,利用SQLContex分别从hdfs上读取json数据,将其转成DataFrame

//读取学生分数的json数据,将其转为DataFrame
DataFrame studentScoreDF = sqlContext.read().json("hdfs://cqt01:9000/spark-data/studentScore.json");

//读取学生信息的json数据,将其转为DataFrame
DataFrame studentInfoDF = sqlContext.read().json("hdfs://cqt01:9000/spark-data/studentInfo.json");

然后将他们注册为两张临时表,以便直接sql语句进行过滤操作

// 将studentScoreDF注册为一张临时表,然后用sql进行过滤
studentScoreDF.registerTempTable("studentScore");

//将studentInfoDF注册为一张临时表
studentInfoDF.registerTempTable("studentInfo");

对studentScore进行过滤操作,过滤出分数大于80分的学生

DataFrame filteredScoreDF = sqlContext.sql("select * from studentScore where score>80");

List<String> filtedStudentNameList = filteredScoreDF.javaRDD().map(new Function<Row, String>() {

    @Override
    public String call(Row row) throws Exception {
        return (String)row.getAs("name");
    }
}).collect();

得到的filtedStudentNameList ,是分数大于80分学生的姓名,根据这些姓名,对studentInfo表进行过滤,只留下这些学生的信息:

String sql = "select * from studentInfo where name in (";

//拼接sql
for(int i = 0;i < filtedStudentNameList.size();i++){
    sql += "'" + filtedStudentNameList.get(i) + "'";
    if(i<filtedStudentNameList.size()-1){
        sql += ",";
    }
}
sql += ")";
//过滤出分数大于80分的学生信息
DataFrame filteredStudentInfoDF = sqlContext.sql(sql);

过滤完之后,得到的都是DataFrame对象,需要转换成JavaRDD结构,然后根据学生name这个key进行join操作:

//将过滤后的DataFrame转成RDD,,然后进行join操作
        JavaPairRDD<String, Tuple2<Integer, String>> joined = 

                filteredScoreDF.javaRDD().mapToPair(new PairFunction<Row, String, Integer>() {

            @Override
            public Tuple2<String, Integer> call(Row row) throws Exception {
                return new Tuple2<String,Integer>((String)row.getAs("name"),Integer.valueOf(String.valueOf(row.getAs("score"))));
            }
        }).join(filteredStudentInfoDF.javaRDD().mapToPair(new PairFunction<Row, String, String>() {

            @Override
            public Tuple2<String, String> call(Row row) throws Exception {

                return new Tuple2<String,String>((String)row.getAs("name"),Integer.valueOf(String.valueOf(row.getAs("age")))+","+(String)row.getAs("area"));//key为name
            }

        }));

得到join后的数据后,将其转成JavaRDD类型,然后进行打印显示

//将join后的RDD数据转为javaRDD<Row>
JavaRDD<Row> joinedRow = joined.map(new Function<Tuple2<String,Tuple2<Integer,String>>, Row>() {

    @Override
    public Row call(Tuple2<String, Tuple2<Integer, String>> t)
            throws Exception {
        return RowFactory.create(t._1,t._2._1,t._2._2.split(",")[0],t._2._2.split(",")[1]);
    }
});

joinedRow.foreach(new VoidFunction<Row>() {

    @Override
    public void call(Row row) throws Exception {
        System.out.println(row);

    }
});

或者将其转成json文件保存到hdfs上:
由于现在的数据为JavaRDD,因此需要定义RDD中的元信息,便于DataFrame进行转换处理,即对joinedRow数据,根据元数据信息转成DataFrame,然后调用write方法,并指定json格式和保存路径即可:

//创建JavaRDD<Row>的元数据信息
List<StructField> structFields = new ArrayList<StructField>();
structFields.add(DataTypes.createStructField("name", DataTypes.StringType, true)); 
structFields.add(DataTypes.createStructField("age", DataTypes.IntegerType, true));  
structFields.add(DataTypes.createStructField("area", DataTypes.StringType, true));
structFields.add(DataTypes.createStructField("score", DataTypes.IntegerType, true));
StructType structType = DataTypes.createStructType(structFields);

//将joinedRow RDD数据转成DataFrame便于保存为其他格式的数据
DataFrame joinedDF = sqlContext.createDataFrame(joinedRow, structType);
joinedDF.write().format("json").save("hdfs://cqt01:9000/spark-data/joined-students");

程序运行结果为:

spark 两个dataframe 合并 spark dataframe join_json_02

完整程序如下:

public class DataFrameFilterAndJoin {

    public static void main(String[] args) {

        SparkConf conf = new SparkConf().setAppName("DataFrameFilterAndJoin");
        JavaSparkContext sc = new JavaSparkContext(conf);

        SQLContext sqlContext = new SQLContext(sc);

        //读取学生分数的json数据,将其转为DataFrame
        DataFrame studentScoreDF = sqlContext.read().json("C:\\Temp\\studentScore.json");

        //读取学生信息的json数据,将其转为DataFram
        DataFrame studentInfoDF = sqlContext.read().json("C:\\Temp\\studentInfo.json");

        //过滤出分数大于80分的学生
        // 首先将studentScoreDF注册为一张临时表,然后用sql进行过滤
        studentScoreDF.registerTempTable("studentScore");
        DataFrame filteredScoreDF = sqlContext.sql("select * from studentScore where score>80");

        List<String> filtedStudentNameList = filteredScoreDF.javaRDD().map(new Function<Row, String>() {

            @Override
            public String call(Row row) throws Exception {
                return (String)row.getAs("name");
            }
        }).collect();

        //根据过滤出来的学生,与学生信息表进行join操作
        studentInfoDF.registerTempTable("studentInfo");
        String sql = "select * from studentInfo where name in (";

        //拼接sql
        for(int i = 0;i < filtedStudentNameList.size();i++){
            sql += "'" + filtedStudentNameList.get(i) + "'";
            if(i<filtedStudentNameList.size()-1){
                sql += ",";
            }
        }
        sql += ")";

        DataFrame filteredStudentInfoDF = sqlContext.sql(sql);

        //将过滤后的DataFrame转成RDD,,然后进行join操作
        JavaPairRDD<String, Tuple2<Integer, String>> joined = 

                filteredScoreDF.javaRDD().mapToPair(new PairFunction<Row, String, Integer>() {

            @Override
            public Tuple2<String, Integer> call(Row row) throws Exception {
                return new Tuple2<String,Integer>((String)row.getAs("name"),Integer.valueOf(String.valueOf(row.getAs("score"))));
            }
        }).join(filteredStudentInfoDF.javaRDD().mapToPair(new PairFunction<Row, String, String>() {

            @Override
            public Tuple2<String, String> call(Row row) throws Exception {

                return new Tuple2<String,String>((String)row.getAs("name"),Integer.valueOf(String.valueOf(row.getAs("age")))+","+(String)row.getAs("area"));//key为name
            }

        }));

        //将join后的RDD数据转为javaRDD<Row>
        JavaRDD<Row> joinedRow = joined.map(new Function<Tuple2<String,Tuple2<Integer,String>>, Row>() {

            @Override
            public Row call(Tuple2<String, Tuple2<Integer, String>> t)
                    throws Exception {
                return RowFactory.create(t._1,t._2._1,t._2._2.split(",")[0],t._2._2.split(",")[1]);
            }
        });

        joinedRow.foreach(new VoidFunction<Row>() {

            @Override
            public void call(Row row) throws Exception {
                System.out.println(row);

            }
        });


        //创建JavaRDD<Row>的元数据信息
        List<StructField> structFields = new ArrayList<StructField>();
        structFields.add(DataTypes.createStructField("name", DataTypes.StringType, true)); 
        structFields.add(DataTypes.createStructField("age", DataTypes.IntegerType, true));  
        structFields.add(DataTypes.createStructField("area", DataTypes.StringType, true));
        structFields.add(DataTypes.createStructField("score", DataTypes.IntegerType, true));
        StructType structType = DataTypes.createStructType(structFields);

        //将joinedRow RDD数据转成DataFrame便于保存为其他格式的数据
        DataFrame joinedDF = sqlContext.createDataFrame(joinedRow, structType);
        joinedDF.write().format("json").save("hdfs://cqt01:9000/spark-data/joined-students");  
    }

}