文章目录
- bug
- 数据集运算
pig安装
1、客户端主机安装软件并解压
hadoop-desktop:~$ cd /opt/
hadoop-desktop:/opt$ sudo tar xvzf /home/hadoop/pig-0.17.0.tar.gz
hadoop-desktop:/opt$ sudo chown -R hadoop:hadoop pig-0.17.0/
2、修改参数
hadoop-desktop:~$ cd /opt/pig-0.17.0/
hadoop-desktop:/opt/pig-0.17.0$ cd conf/
hadoop-desktop:/opt/pig-0.17.0/conf$ mv log4j.properties.template log4j.properties
hadoop-desktop:/opt/pig-0.17.0/conf$ vim pig.properties
pig.logfile=/opt/pig-0.17.0/logs
log4jconf=/opt/pig-0.17.0/conf/log4j.properties
exectype=mapreduce
3、修改环境变量并生效
hadoop-desktop:~$ vim /home/hadoop/.profile
export PIG_HOME=/opt/pig-0.17.0
export PATH=$PATH:$PIG_HOME/bin
hadoop-desktop:~$ source /home/hadoop/.profile
运行pig
1、主节点运行hadoop服务
2、客户端主机启动pig
基本应用
(1)创建test目录,上传到hdfs
grunt> mkdir /test
grunt> copyFromlocal A.txt /test;
grunt> copyFromlocal B.txt /test;
grunt> copyFromlocal TP.txt /test;
grunt> copyFromlocal MP.txt /test;
(2)装载A.txt到变量a,变量b为a的列$0+列$1
grunt> a = load '/test/A.txt' using PigStorage(',') as (c1:int,c2:double,c3:float);
grunt> b = foreach a generate $0+$1 as b1;
604000 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
21/08/13 16:29:47 WARN newplan.BaseOperatorPlan: Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
grunt> dump b;
(1.0)
(4.0)
grunt> describe b;
b: {b1: double}
(3)变量c为b的b1列减去1
grunt> c = foreach b generate b1-1;
grunt> dump c;
(4)变量d为a的第1列,是0输出(c1,c2),不是0输出(c1,c3)
grunt> d = foreach a generate c1,($0==0?$1:$2);
grunt> dump d;
(5)变量f为a的c1>0并且c2>1的输出
grunt> f = filter a by c1>0 and c2>1;
grunt> dump f;
(6)装载Tuple数据TP.txt到变量tp,变量g为tp产生的输出
grunt> tp = load '/test/TP.txt' as t:tuple(c1:int,c2:int,c3:int);
grunt> describe tp;
grunt> dump tp;
grunt> g = foreach tp generate t.c1,t.c2,t.c3;
grunt> describe g;
grunt> dump g;
(7)对g进行分组,输出Bag数据到变量bg
grunt> bg = group g by c1;
grunt> describe bg;
grunt> dump bg;
grunt> illustrate bg;
grunt> x = foreach bg generate g.c1;
grunt> dump x;
(8)装载Map数据MP.txt到变量mp,变量h为mp产生的输出
grunt> mp = load '/test/MP.txt' as (m:map[]);
grunt> describe mp;
mp: {m: map[]}
grunt> h = foreach mp generate m#'Pig';
grunt> describe h;
h: {bytearray}
grunt> dump h;
bug
bin/hadoop dfsadmin -safemode leave
//在bin下执行
//若配置环境变量,使用以下命令
hadoop dfsadmin -safemode leave
解决后
数据集运算
(1)加载数据
grunt> a = load '/test/A.txt' using PigStorage(',') as (a1:int, a2:int, a3:int);
grunt> b = load '/test/B.txt' using PigStorage(',') as (b1:int, b2:int, b3:int);
(2)a与b并集
grunt> c = union a, b;
grunt> dump c;
(3)将c分割为d和e,其中d的第一列数据值为0,e的第一列的数据为1($0表示数据集的第一列)
grunt> split c into d if $0 == 0, e if $0 == 1;
grunt> dump d;
grunt> dump e;
(4)选择c中的一部分数据
grunt> f = filter c by $1 > 3;
grunt> dump f;
(5)对数据进行分组
grunt> g = group c by $2;
grunt> dump g;
(6)将所有的元素集合到一起
grunt> h = group c all;
grunt> dump h;
(7)查看h中元素个数
grunt> i = foreach h generate COUNT($1);
grunt> dump i;
(8)连表查询,条件是a.$2 == b.$2
grunt> j = join a by $2, b by $2;
grunt> dump j;
(9)变量k为c的$1和$1 * $2的输出
grunt> k = foreach c generate $1, $1 * $2;
grunt> dump k;