下面的代码将三行数据:
张三,20
李四,22
王五,30
写入HDFS上的/tmp/lxw1234/orcoutput/lxw1234.com.orc文件中。
package com.lxw1234.test;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat;
import org.apache.hadoop.hive.ql.io.orc.OrcSerde;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.OutputFormat;
import org.apache.hadoop.mapred.RecordWriter;
import org.apache.hadoop.mapred.Reporter;
/**
* lxw的大数据田地 -- http://lxw1234.com
* @author lxw.com
*
*/
public class TestOrcWriter {
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf();
FileSystem fs = FileSystem.get(conf);
Path outputPath = new Path("/tmp/lxw1234/orcoutput/lxw1234.com.orc");
StructObjectInspector inspector =
(StructObjectInspector) ObjectInspectorFactory
.getReflectionObjectInspector(MyRow.class,
ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
OrcSerde serde = new OrcSerde();
OutputFormat outFormat = new OrcOutputFormat();
RecordWriter writer = outFormat.getRecordWriter(fs, conf,
outputPath.toString(), Reporter.NULL);
writer.write(NullWritable.get(),
serde.serialize(new MyRow("张三",20), inspector));
writer.write(NullWritable.get(),
serde.serialize(new MyRow("李四",22), inspector));
writer.write(NullWritable.get(),
serde.serialize(new MyRow("王五",30), inspector));
writer.close(Reporter.NULL);
fs.close();
System.out.println("write success .");
}
static class MyRow implements Writable {
String name;
int age;
MyRow(String name,int age){
this.name = name;
this.age = age;
}
@Override
public void readFields(DataInput arg0) throws IOException {
throw new UnsupportedOperationException("no write");
}
@Override
public void write(DataOutput arg0) throws IOException {
throw new UnsupportedOperationException("no read");
}
}
}
将上面的程序打包成orc.jar,上传至Hadoop客户端机器,
执行命令:
export HADOOP_CLASSPATH=/usr/local/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar:$HADOOP_CLASSPATH
hadoop jar orc.jar com.lxw1234.test.TestOrcWriter
执行成功后,在HDFS上查看该文件:
[liuxiaowen@dev tmp]$ hadoop fs -ls /tmp/lxw1234/orcoutput/
Found 1 items
-rw-r--r-- 2 liuxiaowen supergroup 312 2015-08-18 18:09 /tmp/lxw1234/orcoutput/lxw1234.com.orc
接下来在Hive中建立外部表,路径指向该目录,并设置文件格式为ORC:
CREATE EXTERNAL TABLE lxw1234(
name STRING,
age INT
) stored AS ORC
location '/tmp/lxw1234/orcoutput/';
在Hive中查询该表数据:
hive> desc lxw1234;
OK
name string
age int
Time taken: 0.148 seconds, Fetched: 2 row(s)
hive> select * from lxw1234;
OK
张三 20
李四 22
王五 30
Time taken: 0.1 seconds, Fetched: 3 row(s)
hive>
OK,数据正常显示。
注意:该程序只做可行性测试,如果Orc数据量太大,则需要改进,或者使用MapReduce;
后续将介绍使用MapReduce读写Hive Orc文件。