Introduce to Inforamtion Retrieval读书笔记(1)

fuliang

浏览: 1637597 次
性别:
来自: 北京

最近访客更多访客>>

依然任逍遥

stephenworld

lli

samwalt

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Search Engine

读书 REST IDEA Microsoft Web

很好的一本书，介绍的非常全面，看了很久了，还没有看完，刚看完前十章，发现前面看的都忘的差不多了，还是回来记一下吧。

Boolean Retrieval

一、information retrieval定义：

学院派定义：

Information retrieval (IR) is finding material (usually documents) of
an unstructured nature (usually text) that satisfies an information need
from within large collections (usually stored on computers).

Category ：

Category By Scale :

web search、domain-specific search、personal information retrieval

Basic need:

1、To process large document collections quickly.

2、To allow more flexible matching operations

3、To allow ranked retrieval

Simple idea:

term-document incidence matrix use binary logical OR AND NOT...：110100 AND 110111 AND 101111 = 100100

What is Boolean Retrival:

The Boolean retrieval model is a model for information BOOLEAN RETRIEVAL retrieval in which we
MODEL can pose any query which is in the form of a Boolean expression of terms,
that is, in which terms are combined with the operators AND, OR, and NOT.
Such queries effectively view each document as a set of words.

What's the boolean retrival query like:

(Calpurnia AND Brutus) AND Caesar

how to assess IR system

Precision : What fraction of the returned results are relevant to the information
need?
Recall : What fraction of the relevant documents in the collection were returned
by the system?

vector space model： Easy to rank

Term-document matrix: not scalable

Inverted index： dictionary and posting list.

How Build Inverted index :

1. Collect the documents to be indexed:
Friends, Romans, countrymen. So let it be with Caesar . . .
2. Tokenize the text, turning each document into a list of tokens:
Friends Romans countrymen So . . .

3. Do linguistic preprocessing, producing a list of normalized tokens, which
are the indexing terms: friend roman countryman so . . .
4. Index the documents that each term occurs in by creating an inverted index,
consisting of a dictionary and postings.

Processing Boolean queries:

AND operation:

intersect two posting list:

INTERSECT(p1, p2)
1 answer ← （）
2 while p1 ！= NIL and p2 ！= NIL
3 do if docID(p1) = docID(p2)
4 then ADD(answer, docID(p1))
5 p1 ← next(p1)
6 p2 ← next(p2)
7 else if docID(p1) < docID(p2)
8 then p1 ← next(p1)
9 else p2 ← next(p2)
10 return answer

mulitiple term AND operation:

Process terms in order of increasing document frequency:

if we start by intersecting the two smallest postings lists, then all intermediate resultsmust be no bigger than the smallest postings list, and we are therefore likely to do the least amount of total work

INTERSECT(ht1, . . . , tni)
1 terms ← SORTBYINCREASINGFREQUENCY(ht1, . . . , tni)
2 result ← postings( f irst(terms))
3 terms ← rest(terms)
4 while terms != NIL and result != NIL
5 do result ← INTERSECT(result, postings( f irst(terms)))
6 terms ← rest(terms)
7 return result

OR operation:

The idea is 归并排序中的n路归并,similarily with AND operation。

The extended Boolean model versus ranked retrieval:

Proximity operator:

A proximity operator is a way of specifying that two terms in a query must occur in a document close to each other, where closeness may be measured
by limiting the allowed number of intervening words or by reference to a structural unit such as a sentence or paragraph.

Addition to do:

1. We would like to better determine the set of terms in the dictionary and
to provide retrieval that is tolerant to spelling mistakes and inconsistent
choice of words.
2. It is often useful to search for compounds or phrases that denote a concept
such as “operating system”. As the Westlaw examples show, we might also
wish to do proximity queries such as Gates NEAR Microsoft. To answer
such queries, the index has to be augmented to capture the proximities of
terms in documents.
3. A Boolean model only records term presence or absence, but often we
would like to accumulate evidence, givingmoreweight to documents that
have a term several times as opposed to ones that contain it only once. To
be able to do this we need the term frequency information TERM FREQUENCY (the number of
times a term occurs in a document) in postings lists.
4. Boolean queries just retrieve a set of matching documents, but commonly
we wish to have an effective method to order (or “rank”) the returned
results. This requires having a mechanism for determining a document
score which encapsulates how good a match a document is for a query.

0
顶

0
踩

分享到：

Introduce to Inforamtion Retrieval读书笔 ... | crontab使用方式

2009-10-25 23:49
浏览 2011
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论