Context Navigation

Changes between Version 18 and Version 19 of LogParser

Timestamp:: Jul 8, 2008, 10:03:10 AM (17 years ago)
Author:: waue
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

LogParser

-                      v18
+                      v19
 $ bin/hadoop dfs -put /var/log/apache2/ apache-log
 }}}
 parameter "dir" in main contains the logs.
 you should filter the exception contents manually,
+Set the correct parameter "dir" in main contains the logs.
+Filter or delete the exception contents as below manually,
 {{{
 ex:  ::1 - - [29/Jun/2008:07:35:15 +0800] "GET / HTTP/1.0" 200 729 "...
+}}}
+::1 - - [29/Jun/2008:07:35:15 +0800] "GET / HTTP/1.0" 200 729 "...
+}}}
+Run by Eclipse
 = 結果 =
 執行以下指令
 …
         hql > select * from apache-log;
 }}}
+結果
+{{{
+原始的apache log 如下：
+{{{
+.170.101.250 - - [19/Jun/2008:23:21:12 +0800] "GET http://203.187.1.180/goldchun555/index.htm HTTP/1.1" 404 318 "-" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"
+ ... (skip)
+.65.93.58 - - [18/Jun/2008:06:54:57 +0800] "OPTIONS * HTTP/1.1" 400 300 "-" "-"
+}}}
+結果
  || Row || Column || Cell ||
  || 118.170.101.250 || http:agent || Mozilla/4.0 (compatible; ||
 …
  || 87.65.93.58 || http:protocol || HTTP/1.1 ||
 row(s) in set. (0.58 sec)
-}}}
  = !LogParserGo.java =
 …
 }}}
 LogParserGo共宣告了以下幾個全域變數及方法：
 HBaseConfiguration conf為重要的控制設定參數，其定義了很多方法可以設定或取得map reduce程式運作所需要的值
+!HBaseConfiguration conf為重要的控制設定參數，其定義了很多方法可以設定或取得map reduce程式運作所需要的值
 定義 TABLE 為 "table.name"，table.name為 name property
 string tableName 為資料表名稱
 Htable table 在定義一個HBase的操作變數
 class MapClass 為實做map的一個內部類別
 Path[] listPaths 是個可以列出指定路徑下的檔案和目錄，原本0.16 api即宣告 Deprecated，因此為了解決warning在此實做
 void runMapReduce(String table, String dir) 跑MapReduce的程序
 void creatTable(String table)  建立hbase的資料表
+string !tableName 為資料表名稱
+!Htable table 在定義一個HBase的操作變數
+class !MapClass 為實做map的一個內部類別
+Path[] !listPaths 是個可以列出指定路徑下的檔案和目錄，原本0.16 api即宣告 Deprecated，因此為了解決warning在此實做
+void !runMapReduce(String table, String dir) 跑MapReduce的程序
+void !creatTable(String table)  建立hbase的資料表
 void main(String[] args)  main 函數
 …
 首先看到main函數究竟搞了些什麼？[[br]]
 宣告了table的名稱，要parser的檔案放在'''hdfs'''當中的哪個路徑下，注意此路徑為hdfs，若給的是local file system的路徑的話，程式跑的時候會產生NullPointer Exception的錯誤。然後呼叫creatTable函數其功能用來創建table，接著跑runMapReduce函數，而整個程式主體就是在runMapReduce
+宣告了table的名稱，要parser的檔案放在'''hdfs'''當中的哪個路徑下，注意此路徑為hdfs，若給的是local file system的路徑的話，程式跑的時候會產生!NullPointer Exception的錯誤。然後呼叫!creatTable函數其功能用來創建table，接著跑runMapReduce函數，而整個程式主體就是在runMapReduce
 ------------------------------------
 …
 此內部類別繼承了 [http://hadoop.apache.org/core/docs/r0.16.4/api/org/apache/hadoop/mapred/MapReduceBase.html org.apache.hadoop.mapred.MapReduceBase] ，並實做Mapper<WritableComparable, Text, Text, Writable> 介面，
 不見得所有map reduce程式都需要實做此介面，但若有要讓map能分配工作就需要寫在下面此函數中：[[BR]]
 map(WritableComparable key, Text value, OutputCollector<Text, Writable> output, Reporter reporter) [[BR]]
+map(!WritableComparable key, Text value,        !OutputCollector<Text, Writable> output, Reporter reporter) [[BR]]
 變數key為hbase中的row key，value則為值，output 可以透過collect() 功能將值寫入hbase的table中。但在此範例中，
 並沒有用到 output的寫入方式，reporter也沒有用到。[[br]]
 此方法因為有IO的存取，因此要宣告trows IOException, 且用try來起始。[[br]]
 首先LogParser log = new LogParser(value.toString()); value的值為要parser的內容的某一行，因為基於hdfs的map-reduce架構上，hadoop會幫我們把資料整合起來，因此程式的邏輯只要處理好這一行即可。LogParser 在下面會介紹到，目前只要知道log物件是原始資料value透過 LogParser 處理過的產物。透過log物件的方法getIP,getProtocol(),...等，我們可以輕易取得需要的資料，用table.put( Row_Key , Column_Qualify_Name , Value) 方法將Value值填入Row_Key中的Column_Qualify_Name欄位中。接著研究table物件。[[br]]
+此方法因為有IO的存取，因此要宣告trows !IOException, 且用try來起始。[[br]]
+首先LogParser log = new !LogParser(value.toString()); value的值為要parser的內容的某一行，因為基於hdfs的map-reduce架構上，hadoop會幫我們把資料整合起來，因此程式的邏輯只要處理好這一行即可。LogParser 在下面會介紹到，目前只要知道log物件是原始資料value透過 LogParser 處理過的產物。透過log物件的方法!getIP,!getProtocol(),...等，我們可以輕易取得需要的資料，用table.put( Row_Key , Column_Qualify_Name , Value) 方法將Value值填入Row_Key中的Column_Qualify_Name欄位中。接著研究table物件。[[br]]
 table是全域變數之一，由 [http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/HTable.html org.apache.hadoop.hbase.HTable] 類別定義。產生出HTable物件'''必定要'''給兩個初始化的值，一個是另一個全域變數也是重要的設定檔conf，另一個是tableName也就是資料表的名稱，當HTable 的 table 物件產生出來之後，我們就可以利用put來放入資料。然而一個新的資料表，要如何給他row_key呢？
 因此 table.startUpdate(new Text(log.getIp())) 的功能就是 將 ip設定為table的row_key。有興趣的話可以參考[http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/HTable.html#startUpdate(org.apache.hadoop.io.Text) 官方的startUpdate說明] [[br]]