我有下面的映射,它可以正常工作
{
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "0",
"analysis": {
"filter": {
"stemmer_plural_portugues": {
"name": "minimal_portuguese",
"stopwords" : ["http", "https", "ftp", "www"],
"type": "stemmer"
},
"synonym_filter": {
"type": "synonym",
"lenient": true,
"synonyms_path": "analysis/synonym.txt",
"updateable" : true
},
"shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3
}
},
"analyzer": {
"analyzer_customizado": {
"filter": [
"lowercase",
"stemmer_plural_portugues",
"asciifolding",
"synonym_filter",
"shingle_filter"
],
"tokenizer": "lowercase"
}
}
}
}
},
"mappings": {
"properties": {
"id": {
"type": "long"
},
"data": {
"type": "date"
},
"quebrado": {
"type": "byte"
},
"pgrk": {
"type": "integer"
},
"url_length": {
"type": "integer"
},
"title": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"description": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"url": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
}
}
}
}
我在下面插入文档
{
"title": "rocket 1960",
"description": "space",
"url": "www.nasa.com"
}
如果我使用AND运算符执行以下查询,它将正常找到该文档,因为所有搜索到的单词都存在于该文档中。
{
"from": 0,
"size": 10,
"query": {
"multi_match": {
"query": "space nasa rocket",
"type": "cross_fields",
"fields": [
"title",
"description",
"url"
],
"operator": "and"
}
}
}
但是如果我在搜索中也输入“ 1960”,因为下面的查询不会返回任何内容
{
"from": 0,
"size": 10,
"query": {
"multi_match": {
"query": "1960 space nasa rocket",
"type": "cross_fields",
"fields": [
"title",
"description",
"url"
],
"operator": "and"
}
}
}
我发现我的“小写”标记生成器没有生成数字标记。因此,我将令牌生成器更改为“标准”,并生成了1960数字令牌。
但该查询找不到任何内容,因为具有链接www.nasa.com的URL字段不再生成令牌“ www nasa com”,生成的令牌是整个链接www.nasa.com。
该查询仅在我输入完整URL www.nasa.com时才有效,如下所示
{
"from": 0,
"size": 10,
"query": {
"multi_match": {
"query": "1960 space www.nasa.com rocket",
"type": "cross_fields",
"fields": [
"title",
"description",
"url"
],
"operator": "and"
}
}
}
如果仅针对URL字段生成另一个“小写”令牌生成器,则链接www.nasa.com再次生成单独的令牌“ www nasa com”
但我在下面的查询中找不到任何内容,因为URL字段的标记符与其他字段的标题和描述不同。以下查询仅在使用OR运算符但需要AND运算符的情况下有效,
{
"from": 0,
"size": 10,
"query": {
"multi_match": {
"query": "1960 space nasa rocket",
"type": "cross_fields",
"fields": [
"title",
"description",
"url"
],
"operator": "and"
}
}
}
我无法在映射中使用Ngram,因为我使用了“词组建议程序”,并且当我使用Ngram时,正在生成带有数百个令牌的建议,这些令牌在建议中产生了不准确性。
谁知道我的映射能够在我的“标题和描述”字段中生成数字令牌的任何解决方案,但是我的URL字段将继续,将网站链接分为多个令牌“ www nasa com”,而不是将链接整个“ www .nasa.com”,并且我的查询作为AND运算符同时在所有字段中进行搜索。
如果我在搜索中也输入“ 1960”,因为下面的查询不返回任何内容
在以下索引映射中,我删除了synonym_filter
。将其删除并为示例文档建立索引,并运行与您在问题中提到的搜索查询相同的搜索查询后,我可以获得所需的结果
索引映射:
{
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "0",
"analysis": {
"filter": {
"stemmer_plural_portugues": {
"name": "minimal_portuguese",
"stopwords": [
"http",
"https",
"ftp",
"www"
],
"type": "stemmer"
},
"shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3
}
},
"analyzer": {
"analyzer_customizado": {
"filter": [
"lowercase",
"stemmer_plural_portugues",
"asciifolding",
"shingle_filter"
],
"tokenizer": "lowercase"
}
}
}
}
},
"mappings": {
"properties": {
"id": {
"type": "long"
},
"data": {
"type": "date"
},
"quebrado": {
"type": "byte"
},
"pgrk": {
"type": "integer"
},
"url_length": {
"type": "integer"
},
"title": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"description": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"url": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
}
}
}
}
搜索查询:
{
"from": 0,
"size": 10,
"query": {
"multi_match": {
"query": "1960 space nasa rocket",
"type": "cross_fields",
"fields": [
"title",
"description",
"url"
],
"operator": "and"
}
}
}
搜索结果:
"hits": [
{
"_index": "my-index",
"_type": "_doc",
"_id": "1",
"_score": 0.9370217,
"_source": {
"title": "rocket 1960",
"description": "space",
"url": "www.nasa.com"
}
}
]
如@Gibbs所述,我认为中存在一些问题synonym_filter
,因此,如果您与synonym.txt
其他人共享,则更好,搜索查询运行良好。
更新1 :(包括synonym_filter)
如果要包括同义词令牌过滤器,则使索引映射与您的索引映射相同,只需对映射进行一些更改即可:
"synonym_filter": {
"type": "synonym",
"lenient": true,
"synonyms_path": "analysis/synonym.txt",
"updateable" : false --> set this to false
},
您将同义词过滤器设置为“可更新”,大概是因为您想要更改同义词而不必关闭并重新打开索引,而是使用重新加载API。可更新的同义词限制了使用它们的分析器只能在搜索时使用。
要获得对此的完整解释,您可以参考此ES讨论
使用与上述相同的搜索查询(在更改映射之后),您将获得所需的结果。
但是,如果仍然要设置"updateable" : true
,则可以参考Reload搜索分析器API的官方文档
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句