Skip to content

[pig] set pig.splitCombination false

Myungchul Shin edited this page May 19, 2016 · 4 revisions
  • if we have one big file(ex, size 1G), but number of mapper is smaller than expected. and processing bounds to cpu(ex, morphological analysis) not i/o.
  • then, split input file up to the expected number and upload those files to hdfs.
split_num=20
split -l$((`wc -l < item.txt`/${split_num})) item.txt
  • set pig script like bellow
define proc1 `$progname1 -p '$dicpath'` ship('$progpath1','$sopath1','$sopath2','$sopath3');
define proc2 `$progname2` ship('$progpath2');

-- set mapreduce.map.memory.mb 3072
-- set mapreduce.map.java.opts -Xmx2048m
set pig.splitCombination false;

A = LOAD '$input' USING PigStorage('\t');
B = STREAM A THROUGH proc1;

C = GROUP B BY $0 PARALLEL 10;
D = FOREACH C GENERATE FLATTEN($1);
E = STREAM D THROUGH proc2;

STORE E INTO '$output' USING PigStorage('\t');
Clone this wiki locally