文章目录

pig安装

1、客户端主机安装软件并解压

pig安装应用_大数据

hadoop@ddai-desktop:~$ cd /opt/
hadoop@ddai-desktop:/opt$ sudo tar xvzf /home/hadoop/pig-0.17.0.tar.gz
hadoop@ddai-desktop:/opt$ sudo chown -R hadoop:hadoop pig-0.17.0/

2、修改参数

hadoop@ddai-desktop:~$ cd /opt/pig-0.17.0/
hadoop@ddai-desktop:/opt/pig-0.17.0$ cd conf/
hadoop@ddai-desktop:/opt/pig-0.17.0/conf$ mv log4j.properties.template log4j.properties
hadoop@ddai-desktop:/opt/pig-0.17.0/conf$ vim pig.properties
pig.logfile=/opt/pig-0.17.0/logs
log4jconf=/opt/pig-0.17.0/conf/log4j.properties

exectype=mapreduce

pig安装应用_pig_02

3、修改环境变量并生效

hadoop@ddai-desktop:~$ vim /home/hadoop/.profile
export PIG_HOME=/opt/pig-0.17.0
export PATH=$PATH:$PIG_HOME/bin

pig安装应用_大数据_03

hadoop@ddai-desktop:~$ source /home/hadoop/.profile

运行pig

1、主节点运行hadoop服务

2、客户端主机启动pig

pig安装应用_大数据_04

基本应用

pig安装应用_desktop_05

(1)创建test目录,上传到hdfs

grunt> mkdir /test
grunt> copyFromlocal A.txt /test;
grunt> copyFromlocal B.txt /test;
grunt> copyFromlocal TP.txt /test;
grunt> copyFromlocal MP.txt /test;

(2)装载A.txt到变量a,变量b为a的列$0+列$1

grunt> a = load '/test/A.txt' using PigStorage(',') as (c1:int,c2:double,c3:float);
grunt> b = foreach a generate $0+$1 as b1;

604000 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
21/08/13 16:29:47 WARN newplan.BaseOperatorPlan: Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).

grunt> dump b;
(1.0)
(4.0)
grunt> describe b;
b: {b1: double}

pig安装应用_pig_06

(3)变量c为b的b1列减去1

grunt> c = foreach b generate b1-1;
grunt> dump c;

pig安装应用_pig_07

(4)变量d为a的第1列,是0输出(c1,c2),不是0输出(c1,c3)

grunt> d = foreach a generate c1,($0==0?$1:$2);
grunt> dump d;

pig安装应用_hadoop_08

(5)变量f为a的c1>0并且c2>1的输出

grunt> f = filter a by c1>0 and c2>1;
grunt> dump f;

pig安装应用_desktop_09

(6)装载Tuple数据TP.txt到变量tp,变量g为tp产生的输出

grunt> tp = load '/test/TP.txt' as t:tuple(c1:int,c2:int,c3:int);
grunt> describe tp;
grunt> dump tp;

pig安装应用_数据_10

grunt> g = foreach tp generate t.c1,t.c2,t.c3;
grunt> describe g;
grunt> dump g;

pig安装应用_数据_11

(7)对g进行分组,输出Bag数据到变量bg

grunt> bg = group g by c1;
grunt> describe bg;
grunt> dump bg;

pig安装应用_数据_12

grunt> illustrate bg;

pig安装应用_desktop_13

grunt> x = foreach bg generate g.c1;
grunt> dump x;

pig安装应用_pig_14

(8)装载Map数据MP.txt到变量mp,变量h为mp产生的输出

grunt> mp = load '/test/MP.txt' as (m:map[]);
grunt> describe mp;
mp: {m: map[]}
grunt> h = foreach mp generate m#'Pig';
grunt> describe h;
h: {bytearray}

grunt> dump h;

bug

pig安装应用_大数据_15

bin/hadoop dfsadmin -safemode leave
//在bin下执行
//若配置环境变量,使用以下命令
hadoop dfsadmin -safemode leave

解决后

pig安装应用_pig_16

数据集运算

(1)加载数据

grunt> a = load '/test/A.txt' using PigStorage(',') as (a1:int, a2:int, a3:int);
grunt> b = load '/test/B.txt' using PigStorage(',') as (b1:int, b2:int, b3:int);

(2)a与b并集

grunt> c = union a, b;
grunt> dump c;

pig安装应用_desktop_17

(3)将c分割为d和e,其中d的第一列数据值为0,e的第一列的数据为1($0表示数据集的第一列)

grunt> split c into d if $0 == 0, e if $0 == 1;
grunt> dump d;
grunt> dump e;

pig安装应用_desktop_18

pig安装应用_pig_19

(4)选择c中的一部分数据

grunt> f = filter c by $1 > 3;
grunt> dump f;

pig安装应用_数据_20

(5)对数据进行分组

grunt> g = group c by $2;
grunt> dump g;

pig安装应用_pig_21

(6)将所有的元素集合到一起

grunt> h = group c all;
grunt> dump h;

pig安装应用_数据_22

(7)查看h中元素个数

grunt> i = foreach h generate COUNT($1);
grunt> dump i;

pig安装应用_hadoop_23

(8)连表查询,条件是a.$2 == b.$2

grunt> j = join a by $2, b by $2;
grunt> dump j;

pig安装应用_desktop_24

(9)变量k为c的$1和$1 * $2的输出

grunt> k = foreach c generate $1, $1 * $2;
grunt> dump k;

pig安装应用_desktop_25