背景
我们经常会遇到这种后台需求,要求实现SQL中like “%xxxx%”的匹配效果。因为后端要求搜索要毕竟准确的搜出来,方便做运营相关的。
wildcard通配
这种效果在ES中最匹配的做法是用wildcard query通配,这种情况不会对query分词,而是直接遍历倒排索引逐个匹配计算,性能是无法想象的,大家慎用。
match全文匹配
效果最差的做法是用match全文检索,这种情况只要query分词的任何一个term出现在倒排中,就会召回文档,所以很容易搜出一些八竿子打不着的文档。
match_phrase短语匹配
推荐一个折衷性能与准确度的做法就是用match_phrase短语匹配。
match_phrase的原理是对query分词,要求所有的term都出现在倒排中,并且连续且顺序一致的排列
实战
新建测试index
PUT test { "settings": { "index": { "refresh_interval": "1s", "number_of_shards": "3" } }, "mappings": { "properties": { "@timestamp": { "type": "date" }, "@version": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "title": { "type": "text", "analyzer": "ik_smart", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } }
插入测试数据
POST /_bulk {"index":{"_index":"test","_id":1}} {"title":"青岛上合蓝"} {"index":{"_index":"test","_id":2}} {"title":"张三去青岛买篮球"} {"index":{"_index":"test","_id":3}} {"title":"青岛上面合蓝"} {"index":{"_index":"test","_id":4}} {"title":"张三去青岛上面买篮球"}
搜索:
POST /test/_analyze { "analyzer": "ik_smart", "field": "title", "text": "上合蓝" } 结果如下 { "tokens" : [ { "token" : "上", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "合", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 }, { "token" : "蓝", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 2 } ] }
GET /test/_search { "query": { "bool": { "must": [ { "match_phrase": { "title": { "query": "上合蓝" } } } ] } } } 结果如下: { "took" : 20, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 0, "relation" : "eq" }, "max_score" : null, "hits" : [ ] } }
发现这时候是搜不到的,我们分析下具体原因
我们采用ik_smart中文分词器,对”青岛上合蓝”分词:
POST /_analyze { "analyzer": "ik_smart", "text": "青岛上合蓝" } 结果如下 { "tokens" : [ { "token" : "青", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "岛上", "start_offset" : 1, "end_offset" : 3, "type" : "CN_WORD", "position" : 1 }, { "token" : "合", "start_offset" : 3, "end_offset" : 4, "type" : "CN_CHAR", "position" : 2 }, { "token" : "蓝", "start_offset" : 4, "end_offset" : 5, "type" : "CN_CHAR", "position" : 3 } ] }
对 “张三去青岛买篮球”分词:
POST /_analyze { "analyzer": "ik_smart", "text": "张三去青岛买篮球" } 结果如下: { "tokens" : [ { "token" : "张三", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "去", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 1 }, { "token" : "青岛", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 2 }, { "token" : "买", "start_offset" : 5, "end_offset" : 6, "type" : "CN_CHAR", "position" : 3 }, { "token" : "篮球", "start_offset" : 6, "end_offset" : 8, "type" : "CN_WORD", "position" : 4 } ] }
大家看到,每个term都有一个position字段标识了term的位置,这将直接影响match_phrase是否可以召回。这时候我们通过match_phrase query 搜索 "上合蓝",第一个分词是 “青” “岛上” “合” “蓝”,并没有 “上”这个分词, 所以无法召回
我们可以尝试搜索: 合蓝 ,发现就能搜到了,因为匹配到了 “合” “蓝” 且是连续的,如果你搜索 “蓝合”,因为不连续,所以也搜不到
GET /test/_search { "query": { "bool": { "must": [ { "match_phrase": { "title": { "query": "合蓝" } } } ] } } } 结果如下: { "took" : 2, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 2.1364596, "hits" : [ { "_index" : "test", "_type" : "_doc", "_id" : "3", "_score" : 2.1364596, "_source" : { "title" : "青岛上面合蓝" } }, { "_index" : "test", "_type" : "_doc", "_id" : "1", "_score" : 0.5753642, "_source" : { "title" : "青岛上合蓝" } } ] } }
我们再试试搜索 “青岛”
GET /test/_search { "query": { "bool": { "must": [ { "match_phrase": { "title": { "query": "青岛" } } } ] } } } 结果如下: { "took" : 1, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : 0.14543022, "hits" : [ { "_index" : "test", "_type" : "_doc", "_id" : "3", "_score" : 0.14543022, "_source" : { "title" : "青岛上面合蓝" } }, { "_index" : "test", "_type" : "_doc", "_id" : "2", "_score" : 0.13353139, "_source" : { "title" : "张三去青岛买篮球" } }, { "_index" : "test", "_type" : "_doc", "_id" : "4", "_score" : 0.12343238, "_source" : { "title" : "张三去青岛上面买篮球" } } ] } }
这时候我们通过match_phrase query 搜索 "青岛",第一个分词是 “青” “岛上” “合” “蓝”,并没有 匹配到 “青岛” 所以也无法召回
但是大家要注意,match_phrase与ik_max_word分词器是无法一起工作的,(es7.5实测没有这种问题,后面发现再补充) ,因为ik_max_word分词的term具有重叠问题:
我们新建另一个测试index test2用ik_max_word 分词,插入一样的测试数据
PUT test2 { "settings": { "index": { "refresh_interval": "1s", "number_of_shards": "3" } }, "mappings": { "properties": { "@timestamp": { "type": "date" }, "@version": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "title": { "type": "text", "analyzer": "ik_max_word", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } }
先用ik_max_word分词:
POST /_analyze { "analyzer": "ik_max_word", "text": "青岛上合蓝" } 结果如下: { "tokens" : [ { "token" : "青岛", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "岛上", "start_offset" : 1, "end_offset" : 3, "type" : "CN_WORD", "position" : 1 }, { "token" : "合", "start_offset" : 3, "end_offset" : 4, "type" : "CN_CHAR", "position" : 2 }, { "token" : "蓝", "start_offset" : 4, "end_offset" : 5, "type" : "CN_CHAR", "position" : 3 } ] }
你从 “青岛”“岛上”就能看出,它的term之间具有重叠情况,这与ik_smart是完全不同的,因为ik_max_word的目标是尽可能产生更多的term组合,一般用于全文检索提高召回率。
我们再试试搜索 “青岛”
GET /test2/_search { "query": { "bool": { "must": [ { "match_phrase": { "title": { "query": "青岛" } } } ] } } } 结果如下: { "took" : 1, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : 0.14543022, "hits" : [ { "_index" : "test", "_type" : "_doc", "_id" : "3", "_score" : 0.14543022, "_source" : { "title" : "青岛上面合蓝" } }, { "_index" : "test", "_type" : "_doc", "_id" : "2", "_score" : 0.13353139, "_source" : { "title" : "张三去青岛买篮球" } }, { "_index" : "test", "_type" : "_doc", "_id" : "4", "_score" : 0.12343238, "_source" : { "title" : "张三去青岛上面买篮球" } } ] } }
结论
如果大家用match_phrase的话,需要注意2个方面:
1. 分词器不准会影响召回;
2. 只能用ik_smart。
3. wildcard 搜索只能用keyword分词器,否则搜索 *青岛* ,是不会返回 “青岛” 开头的文档
4. 如果用match_phrase,会出现某些字段检索不出来的情况,如果用wildcard,能检索出来,但又有性能问题的存在。 这时候,可以考虑下: match_phrase_prefix。
5. 如果选用ik,建议使用ik_max_word分词,因为:ik_max_word的分词结果包含ik_smart。匹配的时候,如果想尽可能的多检索结果,考虑使用match; 如果想尽可能精确的匹配分词结果,考虑使用match_phrase; 如果短语匹配的时候,怕遗漏,考虑使用match_phrase_prefix。
《本文》有 0 条评论