wiki:Streaming

Context Navigation

Version 24 (modified by waue, 15 years ago) (diff)
--

Hadoop Streaming

Stream Example 1 : 用 Shell
Stream Example 2 : 用 PHP
Python 實做

Hadoop streaming是Hadoop的一個工具，它幫助用戶創建和運行一類特殊的map/reduce作業，這些特殊的map/reduce作業是由一些可執行文件或腳本文件充當mapper或者reducer

用法：

$ bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar \
-input $INPUT -output $OUTPUT -mapper $MAPPER -reducer $REDUCER

格式分析：

bin/hadoop 呼叫使用hadoop程式
jar contrib/streaming/hadoop-0.20.2-streaming.jar 使用streaming這個功能 ps:預設此jar檔放在 contrib/streaming/ 內
-input $INPUT 設定hdfs上的輸入資料夾 ps:需先上傳資料到hdfs 上
-output $OUTPUT 設定hdfs上的輸出資料夾 ps:在hdfs 上的 output資料夾不可重複
-mapper $MAPPER 設定mapper程式 ps:要給完整路徑
-reducer $REDUCER 設定reducer程式 ps:要給完整路徑

Stream Example 1 : 用 Shell

此範例以 cat 當mapper , wc 作 reducer

運算方法如下

$ bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar \
-input lab4_input -output stream-out1 -mapper /bin/cat -reducer /usr/bin/wc

輸出的結果為：

$ bin/hadoop fs -cat stream-out1/part-00000

行	字數	字元數
2	15	80

Stream Example 2 : 用 PHP

參考自 Hadoop Taiwan User Group

安裝php的執行方法

$ cd /opt/hadoop/
$ sudo apt-get install php5-cli

編輯 mapper 的 php 程式
```
$ gedit mapper.php
```

內容為：

#!/usr/bin/php
<?php

$word2count = array();

// 標準輸入為 STDIN (standard input)
while (($line = fgets(STDIN)) !== false) {
   // 移除小寫與空白
   $line = strtolower(trim($line));
   // 將行拆解成各個字於words 陣列中
   $words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
   // 將字+1
   foreach ($words as $word) {
       $word2count[$word] += 1;
   }
}

// 將結果寫到 STDOUT (standard output)
foreach ($word2count as $word => $count) {
   // 印出 [字 , "tab符號" ,  "數字" , "結束字元"]
   echo $word, chr(9), $count, PHP_EOL;
}
?>

編輯 reduce 的php程式
```
$ gedit reducer.php
```

內容為：

#!/usr/bin/php
<?php

$word2count = array();

// 輸入為 STDIN
while (($line = fgets(STDIN)) !== false) {
    // 移除多餘的空白
    $line = trim($line);
    // 每一行的格式為 (單字 "tab" 數字) ，紀錄到($word, $count)
    list($word, $count) = explode(chr(9), $line);
    // 轉換格式string -> int
    $count = intval($count);
    // 加總
    if ($count > 0) $word2count[$word] += $count;
}

// 此行不必要，但可讓output排列更完整
ksort($word2count);

// 將結果寫到 STDOUT (standard output)
foreach ($word2count as $word => $count) {
    echo $word, chr(9), $count, PHP_EOL;
}

?>

修改執行權限

$ chmod 755 *.php

測試是否能運作

$ echo "i love hadoop, hadoop love u" | ./mapper.php | ./reducer.php

開始執行

$ bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar  \
-mapper /opt/hadoop/mapper.php -reducer /opt/hadoop/reducer.php -input lab4_input -output stream_out2

檢查結果

$ bin/hadoop dfs -cat stream_out2/part-00000

Python 實做

Hadoop Example Program from brandeis University

Download in other formats:

Plain Text

bin/hadoop	呼叫使用hadoop程式
jar contrib/streaming/hadoop-0.20.2-streaming.jar	使用streaming這個功能	ps:預設此jar檔放在 contrib/streaming/ 內
-input $INPUT	設定hdfs上的輸入資料夾	ps:需先上傳資料到hdfs 上
-output $OUTPUT	設定hdfs上的輸出資料夾	ps:在hdfs 上的 output資料夾不可重複
-mapper $MAPPER	設定mapper程式	ps:要給完整路徑
-reducer $REDUCER	設定reducer程式	ps:要給完整路徑