基本上,我为85k html文件建立了索引(google结果页面和关键字是不同的大学名称),并且我将每个页面的标题用作我的lucene索引中名为“ title”的字段。当我搜索“ duquesne AND university”之类的关键词时,没有结果出来,但是,当我仅将关键词更改为“ duquesne”时,我可以得到标题为:“ title:Duquesne Univeristy-Google搜索”的结果,为什么发生这种情况吗?从第二次尝试中,我可以告诉您该标题为Duquesne Univeristy的文件已被索引,但我无法从第一次尝试中获取它。多谢!〜
这是我用于建立索引的代码,我使用Jsoup从网页获取标题:
//indexDir is the directory that hosts Lucene's index files
File indexDir = new File("F:\\luceneIndex");
Directory myindex=SimpleFSDirectory.open(indexDir);
//dataDir is the directory that hosts the text files that to be indexed
File dataDir = new File("I:\\luceneTextFiles");
Analyzer luceneAnalyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
File[] dataFiles = dataDir.listFiles();
IndexWriterConfig indexConfig=new IndexWriterConfig(Version.LUCENE_CURRENT,luceneAnalyzer);
IndexWriter indexWriter = new IndexWriter(myindex, indexConfig);
long startTime = new Date().getTime();
System.out.println("Total file number is "+dataFiles.length+"");
for(int i = 0; i < dataFiles.length; i++){
if(dataFiles[i].isFile() && dataFiles[i].getName().endsWith(".txt")){
org.jsoup.nodes.Document t=Jsoup.parse(dataFiles[i], "UTF-8");
Document document = new Document();
Reader txtReader = new FileReader(dataFiles[i]);
document.add(new Field("title",t.title(),Field.Store.YES,Field.Index.ANALYZED));
document.add(new Field("path",dataFiles[i].getCanonicalPath(),Field.Store.YES,Field.Index.NOT_ANALYZED));
document.add(new Field("count",i+"",Field.Store.YES,Field.Index.NOT_ANALYZED));
document.add(new Field("contents",txtReader));
indexWriter.addDocument(document);
}
}
//indexWriter.getCommitData();
indexWriter.close();
long endTime = new Date().getTime();
String queryKey="duquesne";
String subqueryKey="university";
String queryField="contents";
String subqueryField="title";
/*
* 0------>normal search
* 1------>range search
* 2------>prefix search
* 3------>combine search
* 4------>phrase query
* 5------>wild card query
* 6------>fuzzy query
*/
int querychoice=0;
//initialize the directory
File indexDir=new File("F:\\luceneIndex");
Directory directory=SimpleFSDirectory.open(indexDir);
IndexReader reader=IndexReader.open(directory);
//initialize the searcher
IndexSearcher searcher=new IndexSearcher(reader);
Analyzer analyzer=new StandardAnalyzer(Version.LUCENE_CURRENT);
Query query;
switch(querychoice){
case 0:
QueryParser parser=new QueryParser(Version.LUCENE_CURRENT,subqueryField,analyzer);
query=parser.parse(queryKey);
break;
解析title:Duquesne Univeristy - Google Search
使用标准分析器将导致查询title:duquesne defaultfield:univeristy defaultfield:google defaultfield:search
而术语是OR连接。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句