I have documents that I want to index in ElasticSearch that contains a text field called name
. I currently index the name using the snowball
analyzer. However, I would like to match names both with and without included spaces. For example, a document with the name "The Home Depot" should match "homedepot", "home", and "home depot". Additionally, documents with a single word name like "ExxonMobil" should match "exxon mobil" and "exxonmobil".
I can't seem to find the right combination of analyzer/filters to accomplish this.
I think the most direct approach to this problem would be to apply a Shingle token filter, which, instead of creating ngrams of characters, creates combinations of incoming tokens. You can add it to your analyzer something like:
filter:
........
my_shingle_filter:
type: shingle
min_shingle_size: 2
max_shingle_size: 3
output_unigrams: true
token_separator: ""
you should be mindful of where this filter is placed in your filter chain. It should probably come late in the chain, after all token separation/removal/replacement has already occurred (ie. after any StopFilters, SynonymFilters, stemmers, etc).
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments