[IR] Ch3 Modeling

3月 19, 2015

Reference : Introduction to Information Retrieval Ch1

Unstructured data in 1650:
例如在沙士比亞的所有劇本當中，如何找包含有 BRUTUS 跟 CASER 但不包含 CALPURNIA 的劇本

GAIS: 吳昇老師
http://zh.wikipedia.org/wiki/%E5%90%B3%E6%98%87

MapReduce
http://research.google.com/archive/mapreduce-osdi04-slides/index.html

NoSQL : not only SQL

inverted index : 常出現在書本的最後面，單詞索引，觀察發現，很多科普的書都會有，文學類的書沒有 why ?
http://zh.wikipedia.org/wiki/%E5%80%92%E6%8E%92%E7%B4%A2%E5%BC%95
http://myir-note.blogspot.tw/2012/11/inverted-index.html
其實 occurrence 只是 inverted index 的其中一種
試想金庸小說，如何跟你講這個次見要在哪邊查得到？
章回小說的，第幾回第幾回的意義為何？ --> 場景！
每一回就像是一個場景，第幾回就是在第幾個場景裡面
這跟Java課本裡面的Chapter是完全不一樣的

Boolean queries:
把要找的data轉會成兩個set之間的搜尋就可以把問題變成boolean query
Hash Table 是Boolean Queries 目前最好的解法

Modeling in IR is a complex process aimed at

producing a ranking function

Ranking function: a function that assigns scores to documents

with regard to a given query

IR systems usually adopt index terms to index and

retrieve documents

更general 講法為

An index term is a word or group of consecutive words

in a document

A pre-selected set of index terms can be used to

summarize the document contents

如果不能把抽出來的東西跟那件事情卡在一起，就沒意義了

However, it might be interesting to assume thll

words are index terms (full text representation)

假設所有的word都是index term

Documents and queries can be represented by

patterns of term co-occurrences

兩個term之間有沒有 co-occurrences

文件本身有結構的存在會嚴重影響到文字被閱讀到的可能性

Boolean Model:

term的有無轉為boolean

也可以表示成1跟0變成出現頻率計算 -> 計算 weight

The similarity of the document dj to the query q is

defined as

這裡的DNF是Disjunctive normal form

c(q)是Conjunctive

p.s. [Recall 數位邏輯電路]

Conjunctive normal form : 又叫做 Product of Sum

例如 (A+B'+C)(A'+B+C')

Disjunctive normal form：又叫做 Sum of Product

例如 (A'BC')+(AB'C)

上面這兩例的邏輯電路真值表剛好相反(互補)

Term Weighting

weight 的來源有哪些？

frequencies of occurrence: 這個term在document裡面的出現次數

total frequency of occurrence Fi

例如

Term-term Correlational Matrix

term 到 Document

Document 再到 term

如此矩陣相乘之後

變成一個term 對 term 的 correlation matrix

TF-IDF Weights

Term frequency (TF)

Inverse document frequency (IDF)

Luhn Assumption. The value of wi,j is proportional to

the term frequency fi,j

期中考之前要做一個架構
可以處理
Boolean Model
Vector Space Model

搜尋此網誌

陳雲濤的部落格

[IR] Ch3 Modeling

留言

張貼留言

這個網誌中的熱門文章

[筆記] pandas 用法 (2) 讀寫檔合併 concat merge 圖表

[筆記] CRLF跟LF之區別 --- 隱形的 bug

[筆記] pandas 用法 (1) 基本功能 indexing 設值

[ML筆記] Batch Normalization

[筆記] numpy 用法 (1) 宣告與基本運算

[筆記] 統計實習(1) SAS 基礎用法 (匯入資料並另存SAS新檔，SUBSTR，計算總和與平均，BMI)

[ML筆記] Ensemble - Bagging, Boosting & Stacking