COGROUP/GROUP |
Groups the data in one or more relations. The COGROUP operator groups together tuples that have the same group key (key field) |
A = load ‘student’ AS (name:chararray,age:int,gpa:float);
B = GROUP A BY age; |
CROSS |
Computes the cross product of two or more relations |
X = CROSS A,B A = (1, 2, 3) B = (2, 4)
DUMP X; (4, 2, 1) (8, 9)
(1,2,3,2,4) (1, 3)
(1,2,3,8,9)
(1,2,3,1,3)
(4,2,1,2,4)
(4,2,1,8,9)
(4,2,1,1,3) |
DEFINE |
Assigns an alias to a UDF or streaming command. |
DEFINE CMD `perl PigStreaming.pl – nameMap` input(stdin using PigStreaming(‘,’)) output(stdout using PigStreaming(‘,’));
A = LOAD ‘file’;
B = STREAM B THROUGH CMD; |
DISTINCT |
Removes duplicate tuples in a relation. |
X = DISTINCT A; A = (8,3,4)
DUMP X; (1,2,3)
(1,2,3) (4,3,3)
(4,3,3) (4,3,3)
(8,3,4) (1,2,3) |
FILTER |
Selects tuples from a relation based on some condition. |
X = FILTER A BY f3 == 3; A = (1,2,3)
DUMP X; (4,5,6)
(1,2,3) (7,8,9)
(4,3,3) (4,3,3)
(8,4,3) (8,4,3) |
FOREACH |
Generates transformation of data for each row as specified |
X = FOREACH A GENERATE a1, a2; A = (1,2,3)
DUMP X; (4,2,5)
(1,2) (8,3,6)
(4,2)
(8,3) |
IMPORT |
Import macros defined in a separate file. |
/* myscript.pig */
IMPORT ‘my_macro.pig’; |
JOIN |
Performs an inner join of two or more relations based on common field values. |
X = JOIN A BY a1, B BY b1;
DUMP X
(1,2,1,3) A = (1,2) B = (1,3)
(1,2,1,2) (4,5) (1,2)
(4,5,4,7) (4,7) |
LOAD |
Loads data from the file system. |
A = LOAD ‘myfile.txt’;
LOAD ‘myfile.txt’ AS (f1:int, f2:int, f3:int); |
MAPREDUCE |
Executes native MapReduce jobs inside a Pig script. |
A = LOAD ‘WordcountInput.txt’;
B = MAPREDUCE ‘wordcount.jar’ STORE A INTO ‘inputDir’ LOAD ‘outputDir’
AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`; |
ORDERBY |
Sorts a relation based on one or more fields. |
A = LOAD ‘mydata’ AS (x: int, y: map[]);
B = ORDER A BY x; |
SAMPLE |
Partitions a relation into two or more relations, selects a random data sample with the stated sample size. |
Relation X will contain 1% of the data in relation A.
A = LOAD ‘data’ AS (f1:int,f2:int,f3:int);
X = SAMPLE A 0.01; |
SPLIT |
Partitions a relation into two or more relations based on some expression. |
SPLIT input_var INTO output_var IF (field1 is not null), ignored_var IF (field1 is null); |
STORE |
Stores or saves results to the file system. |
STORE A INTO ‘myoutput’ USING PigStorage (‘*’);
1*2*3
4*2*1 |
STREAM |
Sends data to an external script or program |
A = LOAD ‘data’;
B = STREAM A THROUGH `stream.pl -n 5`; |
UNION |
Computes the union of two or more relations. (Does not preserve the order of tuples) |
X = UNION A, B; A = (1,2,3) B = (2,4)
DUMP X; (4,2,1) (8,9)
(1,2,3) (1,3)
(4,2,1)
(2,4)
(8,9)
(1,3) |