有关Lucene的问题4影响Lucene对文档打分的四种方式Word文档格式.docx

资源描述

有关Lucene的问题4影响Lucene对文档打分的四种方式Word文档格式.docx

《有关Lucene的问题4影响Lucene对文档打分的四种方式Word文档格式.docx》由会员分享，可在线阅读，更多相关《有关Lucene的问题4影响Lucene对文档打分的四种方式Word文档格式.docx（21页珍藏版）》请在冰点文库上搜索。

有关Lucene的问题4影响Lucene对文档打分的四种方式Word文档格式.docx

score（q,d）

coord（q,d）

queryNorm（q）

∑（tf（tind）

idf（t）2

t.getBoost（）·

norm（t,d））

tinq

DocumentBoost和FieldBoost影响的是norm（t,d），其公式如下：

norm（t,d）

doc.getBoost（）

lengthNorm（field）

∏f.getBoost（）

fieldfindnamedast

它包括三个参数：

∙Documentboost：

此值越大，说明此文档越重要。

∙Fieldboost：

此域越大，说明此域越重要。

∙lengthNorm（field）=（1.0/Math.sqrt（numTerms））：

一个域中包含的Term总数越多，也即文档越长，此值越小，文档越短，此值越大。

其中第三个参数可以在自己的Similarity中影响打分，下面会论述。

当然，也可以在添加Field的时候，设置Field.Index.ANALYZED_NO_NORMS或Field.Index.NOT_ANALYZED_NO_NORMS，完全不用norm，来节约空间。

根据Lucene的注释，Nonormsmeansthatindex-timefieldanddocumentboostingandfieldlengthnormalizationaredisabled.

ThebenefitislessmemoryusageasnormstakeuponebyteofRAMperindexedfieldforeverydocumentintheindex,duringsearching.

Notethatonceyouindexagivenfieldwithnormsenabled,disablingnormswillhavenoeffect.没有norms意味着索引阶段禁用了文档boost和域的boost及长度标准化。

好处在于节省内存，不用在搜索阶段为索引中的每篇文档的每个域都占用一个字节来保存norms信息了。

但是对norms信息的禁用是必须全部域都禁用的，一旦有一个域不禁用，则其他禁用的域也会存放默认的norms值。

因为为了加快norms的搜索速度，Lucene是根据文档号乘以每篇文档的norms信息所占用的大小来计算偏移量的，中间少一篇文档，偏移量将无法计算。

也即norms信息要么都保存，要么都不保存。

下面几个试验可以验证norms信息的作用：

试验一：

DocumentBoost的作用

publicvoidtestNormsDocBoost（）throwsException{

FileindexDir=newFile（"

testNormsDocBoost"

）;

IndexWriterwriter=newIndexWriter（FSDirectory.open（indexDir）,newStandardAnalyzer（Version.LUCENE_CURRENT）,true,IndexWriter.MaxFieldLength.LIMITED）;

writer.setUseCompoundFile（false）;

Documentdoc1=newDocument（）;

Fieldf1=newField（"

commonhellohello"

doc1.add（f1）;

doc1.setBoost（100）;

writer.addDocument（doc1）;

Documentdoc2=newDocument（）;

Fieldf2=newField（"

commoncommonhello"

Field.Store.NO,Field.Index.ANALYZED_NO_NORMS）;

doc2.add（f2）;

writer.addDocument（doc2）;

Documentdoc3=newDocument（）;

Fieldf3=newField（"

commoncommoncommon"

doc3.add（f3）;

writer.addDocument（doc3）;

writer.close（）;

IndexReaderreader=IndexReader.open（FSDirectory.open（indexDir））;

IndexSearchersearcher=newIndexSearcher（reader）;

TopDocsdocs=searcher.search（newTermQuery（newTerm（"

common"

））,10）;

for（ScoreDocdoc:

docs.scoreDocs）{

System.out.println（"

docid:

+doc.doc+"

score:

+doc.score）;

}

如果第一篇文档的域f1也为Field.Index.ANALYZED_NO_NORMS的时候，搜索排名如下：

2score:

1.2337708

1score:

1.0073696

0score:

0.71231794

如果第一篇文档的域f1设为Field.Index.ANALYZED，则搜索排名如下：

39.889805

0.6168854

0.5036848

试验二：

FieldBoost的作用

如果我们觉得title要比contents要重要，可以做一下设定。

publicvoidtestNormsFieldBoost（）throwsException{

testNormsFieldBoost"

title"

f1.setBoost（100）;

QueryParserparser=newQueryParser（Version.LUCENE_CURRENT,"

newStandardAnalyzer（Version.LUCENE_CURRENT））;

Queryquery=parser.parse（"

title:

commoncontents:

TopDocsdocs=searcher.search（query,10）;

0.49999997

0.35355338

19.79899

0.49999997

试验三：

norms中文档长度对打分的影响

publicvoidtestNormsLength（）throwsException{

testNormsLength"

commoncommonhellohellohellohello"

当norms被禁用的时候，包含两个common的第二篇文档打分较高：

0.13928263

0.09848769

当norms起作用的时候，虽然包含两个common的第二篇文档，由于长度较长，因而打分较低：

0.09848769

0.052230984

试验四：

norms信息要么都保存，要么都不保存的特性

publicvoidtestOmitNorms（）throwsException{

testOmitNorms"

for（inti=0;

10000;

i++）{

}

当我们添加10001篇文档，所有的文档都设为Field.Index.ANALYZED_NO_NORMS的时候，我们看索引文件，发现.nrm文件只有1K，也即其中除了保持一定的格式信息，并无其他数据。

当我们把第一篇文档设为Field.Index.ANALYZED，而其他10000篇文档都设为Field.Index.ANALYZED_NO_NORMS的时候，发现.nrm文件又10K，也即所有的文档都存储了norms信息，而非只有第一篇文档。

在搜索语句中，设置QueryBoost.

在搜索中，我们可以指定，某些词对我们来说更重要，我们可以设置这个词的boost：

common^4hello

使得包含common的文档比包含hello的文档获得更高的分数。

由于在Lucene中，一个Term定义为Field:

Term，则也可以影响不同域的打分：

common^4content:

common

使得title中包含common的文档比content中包含common的文档获得更高的分数。

实例：

publicvoidtestQueryBoost（）throwsException{

TestQueryBoost"

common1hellohello"

common2common2hello"

common1common2"

根据tf/idf，包含两个common2的第二篇文档打分较高：

0.24999999

0.17677669

如果我们输入的查询语句为：

common1^100common2"

，则第一篇文档打分较高：

0.2499875

0.0035353568

那QueryBoost是如何影响文档打分的呢？

根据Lucene的打分计算公式：

注：

在queryNorm的部分，也有q.getBoost（）的部分，但是对query向量的归一化（见向量空间模型与Lucene的打分机制[

继承并实现自己的Similarity

Similariy是计算Lucene打分的最主要的类，实现其中的很多借口可以干预打分的过程。

（1）floatcomputeNorm（Stringfield,FieldInvertStatestate）

（2）floatlengthNorm（StringfieldName,intnumTokens）

（3）floatqueryNorm（floatsumOfSquaredWeights）

（4）floattf（floatfreq）

（5）floatidf（intdocFreq,intnumDocs）

（6）floatcoord（intoverlap,intmaxOverlap）

（7）floatscorePayload（intdocId,StringfieldName,intstart,intend,byte[]payload,intoffset,intlength）

它们分别影响Lucene打分计算的如下部分：

（6）coord（q,d）

（3）queryNorm（q）

∑（（4）tf（tind）

（5）idf（t）2

（1）norm（t,d））

tinq

（2）lengthNorm（field）

fieldfind

展开阅读全文