我使用StandardAnalyzer为我的文本建立索引。但是,在查询时,我正在执行术语查询和短语查询。对于术语查询和短语查询,我相信lucene在计算术语频率和短语频率方面没有问题。但是,这对于像Dirichlet相似性这样的模型是很好的。对于BM25相似性模型或TFIDFS相似性模型,它需要IDF(term)和IDF(Phrase)。lucene如何处理这个问题?
TFIDFS相似性短语IDF是其组成词的IDF的总和。那是:idf("ab cd") = idf(ab) + idf(cd)
然后将该值乘以词组频率,就计分而言非常类似于术语。
要看完整个故事,我认为看一个例子最有意义。IndexSearcher.explain
是了解得分的非常有用的工具:
索引:
查询: "text ab" unique
Explain
第一个(得分最高)匹配项(文档0)的输出:
1.3350155 = (MATCH) sum of:
0.7981777 = (MATCH) weight(content:"text ab" in 0) [DefaultSimilarity], result of:
0.7981777 = score(doc=0,freq=1.0 = phraseFreq=1.0
), product of:
0.7732263 = queryWeight, product of:
2.0645385 = idf(), sum of:
0.7768564 = idf(docFreq=4, maxDocs=4)
1.287682 = idf(docFreq=2, maxDocs=4)
0.37452745 = queryNorm
1.0322692 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = phraseFreq=1.0
2.0645385 = idf(), sum of:
0.7768564 = idf(docFreq=4, maxDocs=4)
1.287682 = idf(docFreq=2, maxDocs=4)
0.5 = fieldNorm(doc=0)
0.5368378 = (MATCH) weight(content:unique in 0) [DefaultSimilarity], result of:
0.5368378 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.6341301 = queryWeight, product of:
1.6931472 = idf(docFreq=1, maxDocs=4)
0.37452745 = queryNorm
0.8465736 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
1.6931472 = idf(docFreq=1, maxDocs=4)
0.5 = fieldNorm(doc=0)
请注意,除了为短语idf计算添加了总和之外,处理"text ab"
查询部分的分数的前半部分与后半部分(评分unique
)的算法几乎相同。
Explain
第二个匹配项的输出(良好的衡量标准)(文档2):
0.49384725 = (MATCH) product of:
0.9876945 = (MATCH) sum of:
0.9876945 = (MATCH) weight(content:"text ab" in 2) [DefaultSimilarity], result of:
0.9876945 = score(doc=2,freq=2.0 = phraseFreq=2.0
), product of:
0.7732263 = queryWeight, product of:
2.0645385 = idf(), sum of:
0.7768564 = idf(docFreq=4, maxDocs=4)
1.287682 = idf(docFreq=2, maxDocs=4)
0.37452745 = queryNorm
1.277368 = fieldWeight in 2, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = phraseFreq=2.0
2.0645385 = idf(), sum of:
0.7768564 = idf(docFreq=4, maxDocs=4)
1.287682 = idf(docFreq=2, maxDocs=4)
0.4375 = fieldNorm(doc=2)
0.5 = coord(1/2)
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句