|   | 1 | [[PageOutline]] | 
                  
                          |   | 2 |  | 
                  
                          |   | 3 | ◢ <[wiki:NTUT160220/Lab4 實作四]> | <[wiki:NTUT160220 回課程大綱]> ▲ | <[wiki:NTUT160220/Lab6 實作六]> ◣ | 
                  
                          |   | 4 |  | 
                  
                          |   | 5 | = 實作五 Lab 5 = | 
                  
                          |   | 6 |  | 
                  
                          |   | 7 | {{{ | 
                  
                          |   | 8 | #!html | 
                  
                          |   | 9 | <div style="text-align: center;"><big style="font-weight: bold;"><big>在單機模式執行 MapReduce 基本運算<br/>Running MapReduce in local mode by Examples</big></big></div> | 
                  
                          |   | 10 | }}} | 
                  
                          |   | 11 |  | 
                  
                          |   | 12 | {{{ | 
                  
                          |   | 13 | #!text | 
                  
                          |   | 14 | 以下練習,請在本機的 Hadoop4Win 環境操作。 | 
                  
                          |   | 15 | }}} | 
                  
                          |   | 16 |  | 
                  
                          |   | 17 | == 範例一『字數統計(WordCount)』 == | 
                  
                          |   | 18 |  | 
                  
                          |   | 19 |  * STEP 1 : 練習 MapReduce 丟 Job 指令: 『__'''hadoop jar <local jar file> <class name> <parameters>'''__』 | 
                  
                          |   | 20 | {{{ | 
                  
                          |   | 21 | Jazz@human ~ | 
                  
                          |   | 22 | $ cd /opt/hadoop/ | 
                  
                          |   | 23 |  | 
                  
                          |   | 24 | Jazz@human /opt/hadoop | 
                  
                          |   | 25 | $ hadoop jar hadoop-*-examples.jar wordcount input output | 
                  
                          |   | 26 | 11/10/21 14:08:58 INFO input.FileInputFormat: Total input paths to process : 12 | 
                  
                          |   | 27 | 11/10/21 14:09:00 INFO mapred.JobClient: Running job: job_201110211130_0001 | 
                  
                          |   | 28 | 11/10/21 14:09:01 INFO mapred.JobClient:  map 0% reduce 0% | 
                  
                          |   | 29 | 11/10/21 14:09:31 INFO mapred.JobClient:  map 16% reduce 0% | 
                  
                          |   | 30 | 11/10/21 14:10:29 INFO mapred.JobClient:  map 100% reduce 27% | 
                  
                          |   | 31 | 11/10/21 14:10:33 INFO mapred.JobClient:  map 100% reduce 100% | 
                  
                          |   | 32 | 11/10/21 14:10:35 INFO mapred.JobClient: Job complete: job_201110211130_0001 | 
                  
                          |   | 33 | 11/10/21 14:10:35 INFO mapred.JobClient: Counters: 17 | 
                  
                          |   | 34 | 11/10/21 14:10:35 INFO mapred.JobClient:   Job Counters | 
                  
                          |   | 35 | 11/10/21 14:10:35 INFO mapred.JobClient:     Launched reduce tasks=1 | 
                  
                          |   | 36 | 11/10/21 14:10:35 INFO mapred.JobClient:     Launched map tasks=12 | 
                  
                          |   | 37 | 11/10/21 14:10:35 INFO mapred.JobClient:     Data-local map tasks=12 | 
                  
                          |   | 38 | 11/10/21 14:10:35 INFO mapred.JobClient:   FileSystemCounters | 
                  
                          |   | 39 | 11/10/21 14:10:35 INFO mapred.JobClient:     FILE_BYTES_READ=16578 | 
                  
                          |   | 40 | 11/10/21 14:10:35 INFO mapred.JobClient:     HDFS_BYTES_READ=18312 | 
                  
                          |   | 41 | 11/10/21 14:10:35 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=32636 | 
                  
                          |   | 42 | 11/10/21 14:10:35 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=10922 | 
                  
                          |   | 43 | 11/10/21 14:10:35 INFO mapred.JobClient:   Map-Reduce Framework | 
                  
                          |   | 44 | 11/10/21 14:10:35 INFO mapred.JobClient:     Reduce input groups=592 | 
                  
                          |   | 45 | 11/10/21 14:10:35 INFO mapred.JobClient:     Combine output records=750 | 
                  
                          |   | 46 | 11/10/21 14:10:35 INFO mapred.JobClient:     Map input records=553 | 
                  
                          |   | 47 | 11/10/21 14:10:35 INFO mapred.JobClient:     Reduce shuffle bytes=15674 | 
                  
                          |   | 48 | 11/10/21 14:10:35 INFO mapred.JobClient:     Reduce output records=592 | 
                  
                          |   | 49 | 11/10/21 14:10:35 INFO mapred.JobClient:     Spilled Records=1500 | 
                  
                          |   | 50 | 11/10/21 14:10:35 INFO mapred.JobClient:     Map output bytes=24438 | 
                  
                          |   | 51 | 11/10/21 14:10:35 INFO mapred.JobClient:     Combine input records=1755 | 
                  
                          |   | 52 | 11/10/21 14:10:35 INFO mapred.JobClient:     Map output records=1755 | 
                  
                          |   | 53 | 11/10/21 14:10:35 INFO mapred.JobClient:     Reduce input records=750 | 
                  
                          |   | 54 | }}} | 
                  
                          |   | 55 |    * [[BR]][[Image(Hadoop4Win:hadoop4win_14.jpg,width=600)]] | 
                  
                          |   | 56 |  | 
                  
                          |   | 57 |  * STEP 2 : 練習從 http://localhost:50030 查看目前 MapReduce Job 的運作情形 | 
                  
                          |   | 58 |    * [[BR]][[Image(Hadoop4Win:hadoop4win_15.jpg,width=600)]] | 
                  
                          |   | 59 |  | 
                  
                          |   | 60 |  * STEP 3 : 使用 HDFS 指令: 『__'''hadoop fs -get <HDFS file/dir> <local file/dir>'''__』,並了解輸出檔案檔名均為 part-r-*****,且執行參數會紀錄於 <HOSTNAME>_<TIME>_job_<JOBID>_0001_conf.xml,不妨可以觀察 xml 內容與 hadoop config 檔的參數關聯。 | 
                  
                          |   | 61 | {{{ | 
                  
                          |   | 62 | Jazz@human /opt/hadoop | 
                  
                          |   | 63 | $ hadoop fs -get output my_output | 
                  
                          |   | 64 |  | 
                  
                          |   | 65 | Jazz@human /opt/hadoop | 
                  
                          |   | 66 | $ ls -alR my_output | 
                  
                          |   | 67 | my_output: | 
                  
                          |   | 68 | total 12 | 
                  
                          |   | 69 | drwxr-xr-x+  3 Jazz None     0 Oct 21 14:12 . | 
                  
                          |   | 70 | drwxr-xr-x+ 15 Jazz None     0 Oct 21 14:12 .. | 
                  
                          |   | 71 | drwxr-xr-x+  3 Jazz None     0 Oct 21 14:12 _logs | 
                  
                          |   | 72 | -rwxr-xr-x   1 Jazz None 10922 Oct 21 14:12 part-r-00000 | 
                  
                          |   | 73 |  | 
                  
                          |   | 74 | my_output/_logs: | 
                  
                          |   | 75 | total 0 | 
                  
                          |   | 76 | drwxr-xr-x+ 3 Jazz None 0 Oct 21 14:12 . | 
                  
                          |   | 77 | drwxr-xr-x+ 3 Jazz None 0 Oct 21 14:12 .. | 
                  
                          |   | 78 | drwxr-xr-x+ 2 Jazz None 0 Oct 21 14:12 history | 
                  
                          |   | 79 |  | 
                  
                          |   | 80 | my_output/_logs/history: | 
                  
                          |   | 81 | total 48 | 
                  
                          |   | 82 | drwxr-xr-x+ 2 Jazz None     0 Oct 21 14:12 . | 
                  
                          |   | 83 | drwxr-xr-x+ 3 Jazz None     0 Oct 21 14:12 .. | 
                  
                          |   | 84 | -rwxr-xr-x  1 Jazz None 26004 Oct 21 14:12 localhost_1319167815125_job_201110211130_0001_Jazz_word+count | 
                  
                          |   | 85 | -rwxr-xr-x  1 Jazz None 16984 Oct 21 14:12 localhost_1319167815125_job_201110211130_0001_conf.xml | 
                  
                          |   | 86 | }}} | 
                  
                          |   | 87 |    * [[BR]][[Image(Hadoop4Win:hadoop4win_22.jpg,width=600)]] | 
                  
                          |   | 88 |  | 
                  
                          |   | 89 |  * 比較熟悉 Windows 的,也可以用 cgystart my_output 用檔案總管開啟下載的內容 | 
                  
                          |   | 90 | {{{ | 
                  
                          |   | 91 | Jazz@human /opt/hadoop | 
                  
                          |   | 92 | $ cgystart my_output | 
                  
                          |   | 93 | }}} | 
                  
                          |   | 94 |  | 
                  
                          |   | 95 |  * 也可以透過 NameNode 網頁介面觀看結果 | 
                  
                          |   | 96 |    * http://localhost:50075/browseDirectory.jsp?dir=/user&namenodeInfoPort=50070 | 
                  
                          |   | 97 |  | 
                  
                          |   | 98 | == 範例二『用標準表示法過濾內容 grep』 == | 
                  
                          |   | 99 |  | 
                  
                          |   | 100 |  * grep 這個命令是擷取文件裡面特定的字元,在 Hadoop example 中此指令可以擷取文件中有此指定文字的字串,並作計數統計[[BR]]grep is a command to extract specific characters in documents. In hadoop examples, you can use this command to extract strings match the regular expression and count for matched strings. | 
                  
                          |   | 101 | {{{ | 
                  
                          |   | 102 | Jazz@human /opt/hadoop | 
                  
                          |   | 103 | $ hadoop jar hadoop-*-examples.jar  grep input lab5_out1 'dfs[a-z.]+' | 
                  
                          |   | 104 | }}} | 
                  
                          |   | 105 |  * 運作的畫面如下:[[BR]]You should see procedure like this:  | 
                  
                          |   | 106 | {{{ | 
                  
                          |   | 107 | Jazz@human /opt/hadoop | 
                  
                          |   | 108 | $ hadoop jar hadoop-*-examples.jar  grep input lab5_out1 'dfs[a-z.]+' | 
                  
                          |   | 109 | 11/10/21 14:17:39 INFO mapred.FileInputFormat: Total input paths to process : 12 | 
                  
                          |   | 110 |  | 
                  
                          |   | 111 | 11/10/21 14:17:39 INFO mapred.JobClient: Running job: job_201110211130_0002 | 
                  
                          |   | 112 | 11/10/21 14:17:40 INFO mapred.JobClient:  map 0% reduce 0% | 
                  
                          |   | 113 | 11/10/21 14:17:54 INFO mapred.JobClient:  map 8% reduce 0% | 
                  
                          |   | 114 | 11/10/21 14:17:57 INFO mapred.JobClient:  map 16% reduce 0% | 
                  
                          |   | 115 | 11/10/21 14:18:03 INFO mapred.JobClient:  map 33% reduce 0% | 
                  
                          |   | 116 | 11/10/21 14:18:13 INFO mapred.JobClient:  map 41% reduce 0% | 
                  
                          |   | 117 | 11/10/21 14:18:16 INFO mapred.JobClient:  map 50% reduce 11% | 
                  
                          |   | 118 | 11/10/21 14:18:19 INFO mapred.JobClient:  map 58% reduce 11% | 
                  
                          |   | 119 | 11/10/21 14:18:23 INFO mapred.JobClient:  map 66% reduce 11% | 
                  
                          |   | 120 | 11/10/21 14:18:30 INFO mapred.JobClient:  map 83% reduce 16% | 
                  
                          |   | 121 | 11/10/21 14:18:33 INFO mapred.JobClient:  map 83% reduce 22% | 
                  
                          |   | 122 | 11/10/21 14:18:36 INFO mapred.JobClient:  map 91% reduce 22% | 
                  
                          |   | 123 | 11/10/21 14:18:39 INFO mapred.JobClient:  map 100% reduce 22% | 
                  
                          |   | 124 | 11/10/21 14:18:42 INFO mapred.JobClient:  map 100% reduce 27% | 
                  
                          |   | 125 | 11/10/21 14:18:48 INFO mapred.JobClient:  map 100% reduce 30% | 
                  
                          |   | 126 | 11/10/21 14:18:54 INFO mapred.JobClient:  map 100% reduce 100% | 
                  
                          |   | 127 | 11/10/21 14:18:56 INFO mapred.JobClient: Job complete: job_201110211130_0002 | 
                  
                          |   | 128 | 11/10/21 14:18:56 INFO mapred.JobClient: Counters: 18 | 
                  
                          |   | 129 | 11/10/21 14:18:56 INFO mapred.JobClient:   Job Counters | 
                  
                          |   | 130 | 11/10/21 14:18:56 INFO mapred.JobClient:     Launched reduce tasks=1 | 
                  
                          |   | 131 | 11/10/21 14:18:56 INFO mapred.JobClient:     Launched map tasks=12 | 
                  
                          |   | 132 | 11/10/21 14:18:56 INFO mapred.JobClient:     Data-local map tasks=12 | 
                  
                          |   | 133 | 11/10/21 14:18:56 INFO mapred.JobClient:   FileSystemCounters | 
                  
                          |   | 134 | 11/10/21 14:18:56 INFO mapred.JobClient:     FILE_BYTES_READ=888 | 
                  
                          |   | 135 | 11/10/21 14:18:56 INFO mapred.JobClient:     HDFS_BYTES_READ=18312 | 
                  
                          |   | 136 | 11/10/21 14:18:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=1496 | 
                  
                          |   | 137 | 11/10/21 14:18:56 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=280 | 
                  
                          |   | 138 | 11/10/21 14:18:56 INFO mapred.JobClient:   Map-Reduce Framework | 
                  
                          |   | 139 | 11/10/21 14:18:56 INFO mapred.JobClient:     Reduce input groups=7 | 
                  
                          |   | 140 | 11/10/21 14:18:56 INFO mapred.JobClient:     Combine output records=7 | 
                  
                          |   | 141 | 11/10/21 14:18:56 INFO mapred.JobClient:     Map input records=553 | 
                  
                          |   | 142 | 11/10/21 14:18:56 INFO mapred.JobClient:     Reduce shuffle bytes=224 | 
                  
                          |   | 143 | 11/10/21 14:18:56 INFO mapred.JobClient:     Reduce output records=7 | 
                  
                          |   | 144 | 11/10/21 14:18:56 INFO mapred.JobClient:     Spilled Records=14 | 
                  
                          |   | 145 | 11/10/21 14:18:56 INFO mapred.JobClient:     Map output bytes=193 | 
                  
                          |   | 146 | 11/10/21 14:18:56 INFO mapred.JobClient:     Map input bytes=18312 | 
                  
                          |   | 147 | 11/10/21 14:18:56 INFO mapred.JobClient:     Combine input records=10 | 
                  
                          |   | 148 | 11/10/21 14:18:56 INFO mapred.JobClient:     Map output records=10 | 
                  
                          |   | 149 | 11/10/21 14:18:56 INFO mapred.JobClient:     Reduce input records=7 | 
                  
                          |   | 150 | 11/10/21 14:18:56 WARN mapred.JobClient: Use GenericOptionsParser for parsing th | 
                  
                          |   | 151 | e arguments. Applications should implement Tool for the same. | 
                  
                          |   | 152 | 11/10/21 14:18:57 INFO mapred.FileInputFormat: Total input paths to process : 1 | 
                  
                          |   | 153 | 11/10/21 14:18:57 INFO mapred.JobClient: Running job: job_201110211130_0003 | 
                  
                          |   | 154 | ( ... skip ... ) | 
                  
                          |   | 155 | }}} | 
                  
                          |   | 156 |  * 接著查看結果[[BR]]Let's check the computed result of '''grep''' from HDFS :  | 
                  
                          |   | 157 |  * 這個例子是要從 input 目錄中的所有檔案中找出符合 dfs 後面接著 a-z 字母一個以上的字串 | 
                  
                          |   | 158 | {{{ | 
                  
                          |   | 159 | Jazz@human /opt/hadoop | 
                  
                          |   | 160 | $ hadoop fs -ls lab5_out1 | 
                  
                          |   | 161 | Found 2 items | 
                  
                          |   | 162 | drwxr-xr-x   - Jazz supergroup          0 2011-10-21 14:18 /user/Jazz/lab5_out1/_logs | 
                  
                          |   | 163 | -rw-r--r--   1 Jazz supergroup         96 2011-10-21 14:19 /user/Jazz/lab5_out1/part-00000 | 
                  
                          |   | 164 |  | 
                  
                          |   | 165 | Jazz@human /opt/hadoop | 
                  
                          |   | 166 | $ hadoop fs -cat lab5_out1/part-00000 | 
                  
                          |   | 167 | 3       dfs.class | 
                  
                          |   | 168 | 2       dfs.period | 
                  
                          |   | 169 | 1       dfs.file | 
                  
                          |   | 170 | 1       dfs.replication | 
                  
                          |   | 171 | 1       dfs.servers | 
                  
                          |   | 172 | 1       dfsadmin | 
                  
                          |   | 173 | 1       dfsmetrics.log | 
                  
                          |   | 174 | }}} |