wiki:HyperTable/BioInfo_Project

Context Navigation

Version 11 (modified by jazz, 16 years ago) (diff)
--

Hypertable 應用於 BioInfo Project

背景資料

陽明生物資訊合作計畫

應用說明

以陽明的病人屬性分類資料為範本來建立資料庫
- 參考資料:
  - 待解問題1
  - 轉檔程式
- 匯入方式:
  將陽明的病人屬性分類資料(XML格式),轉換成Hypertable可批次讀取的TSV格式,再使用HQL指令載入到Hypertable當中
- 匯出方式:
  使用HQL指令匯出TSV檔案
- 搜尋方式:
  目前Hypertable只提供以Rowkey完全比對及Rowkey或Timestamp的區間比對,以陽明的範例為例:
  Source是指檢查之前診斷判定的病因;
  Primary Site是指檢查之後診斷判定的病因;
  如果用Source為Rowkey來搜尋"Breast",搜尋結果會查出所有Source為"Breast"的資料
  如果用Primary Site為Rowkey來搜尋"Breast",搜尋結果會查出所有Primary Site為"Breast"的資料
  - NOTE: 交互比對的部份,Hypertable目前似乎沒提供column family與column qualifier的搜尋,
    目前想到的做法之一是把source與primary site整合成單一個rowkey,
    像是(source)+(primary site)為一個rowkey,例如:"Breast Colon"可以查尋到:
    Source 為 Breast,Primary Site為Colon的資料

資料表建置

編輯資料表綱要

$ vim gse.hql
CREATE TABLE GSEFamily (
  'Sample-iid',
  'Supplementary-Data',
  Description
);

建立GSEFamily資料表
```
$ hypertable --batch < gse.hql
```
透過轉檔程式產生GSEFamily.tsv(參考TSV格式一說明)
- 轉檔程式(windows)
```
C:\> GSEXmlParser.exe -f tsv -i GSE2109_family.xml -o GSEFamily.tsv
```
以Hql Command模式登入hypertable
```
$ hypertable
```

匯入資料到GSEFamily資料表

hypertable>  load data infile "GSEFamily.tsv" into table GSEFamily;

搜尋資料

hypertable> select * from GSEFamily;
hypertable> select "Supplementary-Data" from GSEFamily;
hypertable> select "Sample-iid" from GSEFamily;
hypertable> select "Description" from GSEFamily;

刪除資料

hypertable> delete * from GSEFamily where ROW="Breast";

TSV格式說明

符號說明
- []:項目內容為選擇型式,可以不出現
- ():必要欄位
- "":固定字串
- tab: tab space
- space: 空格
- columnkey: column family

格式一:
資料來源指定可以包含Column Qualifier

第一行為欄位說明

("#")[space]["timestamp"](tab)("rowkey")(tab)("columnkey")(tab)("value")

第二行之後為資料內容

[timestamp(tab)](rowkey)(tab)(column family[:column qualifier])(tab)(value)

格式二:
資料來源不包含 Column Qualifier
Rowkey Column 與 Timestamp Column 可經由 ROW_KEY_COLUMN 與 TIMESTAMP_COLUMN 在 load data infile 時指定,
因此Rowkey與Timestamp的欄位序順及位置可以任意變換,唯資料內容的位置需與欄位說明對應
- 第一行為欄位說明
```
("#")[space][timestamp](tab)(rowkey)(tab)(column1)(tab)(column2)...
```
- 第二行之後為資料內容
```
[timestamp](tab)(rowkey)(tab)(column1' value)(tab)(column2's value)...
```

Hadoop整合

Download in other formats:

Plain Text