I believe Synonym Token Filter and NGram Token Filter are the two frequent filters that most people normally used in Elasticsearch(ES). In this article, I’m going to talk about some of the annoying stuff I met when I combine both of them in my analyzer.
Background
Some quick intro about these two filters:
Synonym Token Filter - As the name stated, it is a filter that allows user to handle synonyms easily during text analysis.
e.g.
In Malaysia, Mcdonald/McD is fontly known as “mekdi” (I know this is weird, but it’s real -> mekdi). To create this synonym link we can specify it in ES as : mcdonalds, mcd => mekdi
.
NGram Token Filter - This is a filter that tokenize a string to specified “N” size token.
e.g.
With a Tri-gram filter (N=3), the word aaabbb
will become ["aaa", "bbb"]
.
The Problem
Let’s talk about the problem when we try to use these two filter in an analyzer.
Problem 1 : ES does not allow us to apply NGram before Synonym. You will get following error if you do so.
1
2
3
4
5
6
7
8
9
10
11
12
13
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Token filter [trigrams_filter] cannot be used to parse synonyms"
}
],
"type": "illegal_argument_exception",
"reason": "Token filter [trigrams_filter] cannot be used to parse synonyms"
},
"status": 400
}
Problem 2 : You will get some irrelevant match result.
Say we has an index restaurants
which apply synonym and follow by a trigram(N=3) filter. The synonym will convert mcdonalds
and mcd
to mekdi
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
PUT /restaurants
{
"settings": {
"analysis" : {
"analyzer":{
"name_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter": ["synonym","trigrams_filter"]
}
},
"filter": {
"trigrams_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3
},
"synonym": {
"type": "synonym",
"lenient": true,
"synonyms": ["mcdonalds, mcd => mekdi"]
}
}
}
},
"mappings": {
"properties": {
"name": { "type": "text", "analyzer": "name_analyzer" }
}
}
}
And in ES we have following data insert into ES.
1
2
3
4
5
6
7
POST /_bulk
{ "create" : { "_index" : "restaurants", "_id": "1", "_type" : "_doc" }
{ "name": "mcdonalds"}
{ "create" : { "_index" : "restaurants", "_id": "2", "_type" : "_doc" }
{ "name": "mek sue cafe"}
The problem start when we query for the name mcdonalds
.
1
2
3
4
5
6
7
8
9
10
11
12
GET /_search
{
"query": {
"match": {
"name": {
"query": "mcdonalds",
"operator": "and"
}
}
}
}
Result:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.3382834,
"hits": [
{
"_index": "restaurants",
"_type": "_doc",
"_id": "1",
"_score": 0.3382834,
"_source": {
"name": "mcdonalds"
}
},
{
"_index": "restaurants",
"_type": "_doc",
"_id": "2",
"_score": 0.19363807,
"_source": {
"name": "mek sue cafe"
}
}
]
}
}
You will notice that we get an irrelevant result, mek sue cafe
appear in our result.
Why?
Let’s examinate how the analyzer process the word mcdonalds
.
1
2
3
4
5
6
GET /_analyze
{
"analyzer" : "name_analyzer",
"text" : "mcdonalds"
}
Result:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
"tokens": [
{
"token": "mek",
"start_offset": 0,
"end_offset": 9,
"type": "SYNONYM",
"position": 0
},
{
"token": "ekd",
"start_offset": 0,
"end_offset": 9,
"type": "SYNONYM",
"position": 0
},
{
"token": "kdi",
"start_offset": 0,
"end_offset": 9,
"type": "SYNONYM",
"position": 0
}
]
}
From the result, we understand that the word mcdonalds
is converted into trigram of ["mek", "ekd", "kdi"]
.
And, the token mek
caused the match ofmek sue cafe
.
Conclusion
It is totally fine to combining synonym and ngram filter, but we should be extra cautious when doing this as we might get some undesired search results.