| Version 13 (modified by waue, 17 years ago) (diff) |
|---|
目的
This program will parse your apache log and store it into Hbase.
如何使用
1 Upload apache logs ( /var/log/apache2/access.log* ) to hdfs (default: /user/waue/apache-log)
$ bin/hadoop dfs -put /var/log/apache2/ apache-log2 parameter "dir" in main contains the logs. 3 you should filter the exception contents manually,
ex: ::1 - - [29/Jun/2008:07:35:15 +0800] "GET / HTTP/1.0" 200 729 "...
結果
1 執行以下指令
hql > select * from apache-log;2 結果
+-------------------------+-------------------------+-------------------------+ | Row | Column | Cell | +-------------------------+-------------------------+-------------------------+ | 118.170.101.250 | http:agent | Mozilla/4.0 (compatible;| | | | MSIE 4.01; Windows 95) | ..........(skip)........ +-------------------------+-------------------------+-------------------------+ | 87.65.93.58 | http:method | OPTIONS | +-------------------------+-------------------------+-------------------------+ | 87.65.93.58 | http:protocol | HTTP/1.1 | 31 row(s) in set. (0.58 sec)
LogParserGo?.java
public static class MapClass extends MapReduceBase implements
Mapper<WritableComparable, Text, Text, Writable> {
@Override
// MapReduceBase.configure(JobConf job)
// Default implementation that does nothing.
public void configure(JobConf job) {
// String get(String name,String defaultValue)
// Get the value of the name property. If no such property exists,\
// then defaultValue is returned.
tableName = job.get(TABLE, "");
}
public void map(WritableComparable key, Text value,
OutputCollector<Text, Writable> output, Reporter reporter)
throws IOException {
try {
/*
print(value.toString());
FileWriter out = new FileWriter(new File(
"/home/waue/mr-result.txt"));
out.write(value.toString());
out.flush();
out.close();
*/
LogParser log = new LogParser(value.toString());
if (table == null)
table = new HTable(conf, new Text(tableName));
long lockId = table.startUpdate(new Text(log.getIp()));
table.put(lockId, new Text("http:protocol"), log.getProtocol()
.getBytes());
table.put(lockId, new Text("http:method"), log.getMethod()
.getBytes());
table.put(lockId, new Text("http:code"), log.getCode()
.getBytes());
table.put(lockId, new Text("http:bytesize"), log.getByteSize()
.getBytes());
table.put(lockId, new Text("http:agent"), log.getAgent()
.getBytes());
table.put(lockId, new Text("url:" + log.getUrl()), log
.getReferrer().getBytes());
table.put(lockId, new Text("referrer:" + log.getReferrer()),
log.getUrl().getBytes());
table.commit(lockId, log.getTimestamp());
} catch (Exception e) {
e.printStackTrace();
}
}
}
LogParser.java
這個java檔的任務是分析log檔案中的每行資訊
private String ip;
private String protocol;
private String method;
private String url;
private String code;
private String byteSize;
private String referrer;
private String agent;
private long timestamp;
private static Pattern p = Pattern
.compile("([︿ ]*) ([︿ ]*) ([︿ ]*) \\[([︿]]*)\\] \"([︿\"]*)\"" +
" ([︿ ]*) ([︿ ]*) \"([︿\"]*)\" \"([︿\"]*)\".*");
首先先宣告產生一個物件 java.util.regex.Pattern
這個類別沒有建構子,因此宣告出來之後用compile(String regex)敘述來建立滿足正規表示式的物件,功能說明:
Compiles the given regular expression into a pattern.
將正規表示式的字串當引數輸入之後,就可以得到一個p的Pattern物件,而此正規表示式:
([︿ ]*) ([︿ ]*) ([︿ ]*)
[([︿]]*)
] \"([︿\"]*)\" ([︿ ]*) ([︿ ]*) \"([︿\"]*)\" \"([︿\"]*)\".*
若apache log範例為:
140.110.138.176 - - [02/Jul/2008:16:55:02 +0800] "GET /hbase-0.1.3.zip HTTP/1.0" 200 10249801 "-" "Wget/1.10.2"
則此正規表示法可看成
([︿ ]*) ([︿ ]*) ([︿ ]*)
[([︿]]*)
]\"([︿\"]*)\" ([︿ ]*) ([︿ ]*) \"([︿\"]*)\" \"([︿\"]*)\".* ip - - 時間 "http " 回傳碼 長度 "指引" "代理器" 140.110.138.176 - - [02/Jul/2008:16:55:02 +0800] "GET /hbase-0.1.3.zip HTTP/1.0" 200 10249801 " -" "Wget/1.10.2"
在此可以把Pattern 當成是一個雛型類別,用compiler(表示式) 則告知了 以"表示式"為規則產生一個p的模板出來
public LogParser(String line) throws ParseException, Exception{
Matcher matcher = p.matcher(line);
if(matcher.matches()){
this.ip = matcher.group(1);
// IP address of the client requesting the web page.
if(isIpAddress(ip)){
SimpleDateFormat sdf = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss Z",Locale.US);
this.timestamp = sdf.parse(matcher.group(4)).getTime();
String[] http = matcher.group(5).split(" ");
this.method = http[0];
this.url = http[1];
this.protocol = http[2];
this.code = matcher.group(6);
this.byteSize = matcher.group(7);
this.referrer = matcher.group(8);
this.agent = matcher.group(9);
}
}
}
接著定義建構子,宣告了一個 java.util.regex.Matcher 此物件可以用來與之前的 Pattern搭配。
剛剛宣告的模板p有個函數 matcher(String) ,此功能會將材料(String敘述 )壓印成模板的形狀,並把這個壓出物件叫做matcher。 之後要取用matcher的第n段,只要用matcher.group(n)就可以把第n段的內容以String的形式取回。
回頭對照傳近來的內容
1 2 3 4 5 6 7 8 9 ip - - 時間 "http " 回傳碼 長度 "指引" "代理器" 140.110.138.176 - - [02/Jul/2008:16:55:02 +0800] "GET /hbase-0.1.3.zip HTTP/1.0" 200 10249801 " -" "Wget/1.10.2"
之後就很顯而易見,用matcher.group(n)取得值後,一一的用this.參數來作設定,但其實不用this 編譯依然能過關,只是習慣在建構子內用到該class的參數會這麼用(以跟繼承到父類別的參數作區別?)其中時間需要用SimpleDateFormat小轉譯一下,http的內容需要用split()來作更細部的分解。
public static boolean isIpAddress(String inputString) {
StringTokenizer tokenizer = new StringTokenizer(inputString, ".");
if (tokenizer.countTokens() != 4) {
return false;
}
try {
for (int i = 0; i < 4; i++) {
String t = tokenizer.nextToken();
int chunk = Integer.parseInt(t);
if ((chunk & 255) != chunk) {
return false;
}
}
} catch (NumberFormatException e) {
return false;
}
if (inputString.indexOf("..") >= 0) {
return false;
}
return true;
}
}
此函數用來檢查IP的格式是否正確而已
