-
Notifications
You must be signed in to change notification settings - Fork 19
[pig] set pig.splitCombination false
Myungchul Shin edited this page May 19, 2016
·
4 revisions
- if we have one big file(ex, size 1G), but number of mapper is smaller than expected. and processing bounds to cpu(ex, morphological analysis) not i/o.
- then, split input file up to the expected number and upload those files to hdfs.
split_num=20
split -l$((`wc -l < item.txt`/${split_num})) item.txt
- set pig script like bellow
define proc1 `$progname1 -p '$dicpath'` ship('$progpath1','$sopath1','$sopath2','$sopath3');
define proc2 `$progname2` ship('$progpath2');
-- set mapreduce.map.memory.mb 3072
-- set mapreduce.map.java.opts -Xmx2048m
set pig.splitCombination false;
A = LOAD '$input' USING PigStorage('\t');
B = STREAM A THROUGH proc1;
C = GROUP B BY $0 PARALLEL 10;
D = FOREACH C GENERATE FLATTEN($1);
E = STREAM D THROUGH proc2;
STORE E INTO '$output' USING PigStorage('\t');