首页 > > 程序设计 > Java技术 >

干货 |《从Lucene到Elasticsearch全文检索实战》…

2019-08-16 12:20:17来源：博客园阅读 ()

干货 |《从Lucene到Elasticsearch全文检索实战》拆解实践

1、题记

2018年3月初，萌生了一个想法：对Elasticsearch相关的技术书籍做拆解阅读，该想法源自非计算机领域红火已久的【樊登读书会】、得到的每天听本书、XX拆书帮等。

目前市面上Elasticsearch的中文书籍就那么基本，针对ES5.X以上的三本左右；国外翻译有几本，都是针对ES1.X，2.X版本，其中《深入理解Elasticsearch》还算比较经典。

拆书的目的：

1）梳理已有的Elasticsearch知识体系；
2）拾遗拉在角落的Elasticsearch知识点；
3）通过手敲动代码或命令行，在实践中再次“温故知新”，提前增加知识储备，避免项目/产品实战中的“临阵抱佛脚”；
4）最大化的节省您的宝贵时间，让您最快的时间吸取最精华的“干货”。

本次解读是《从Lucene到Elasticsearch全文检索实战》。

2、本书梗概

作者是中科院硕士姚攀（90后）在读研究生期间根据实习写成CSDN博客，最终成书。

该书1/4章节讲解Lucence相关原理及实战；
1/2章节讲解Elasticsearch基本概念：集群入门、搜索分类详解、聚合分析、Java API；
1/4章节讲解Elasticsearch集群管理、项目实战、Hadoop实战。

总体评价：

优点：

1）涵盖了Elasticsearch相关的基本概念、基础原理；
2）有两个实战项目分享；

缺点：

1）某些概念就只是有分类，没有讲解出不同分类的区别，不同分类的适用场景；
2）某些细节点涵盖不全，偏理论，好多知识技术点，实战中应用会有不同。
3）书基于Elasticsearch5.4.0讲解，一些特性6.X已不适用。

3、核心知识点梳理

以下的DSL都是通过ElasticsearchV6.2.2版本试验过的。

3.1 mget 一次获取多个文档。

1GET test_index/test_type/_mget
2{
3  "docs":[
4  {"_id":1},
5  {"_id":3}
6  ]
7}

最小简化版本：

1GET test_index/test_type/_mget
2{
3  "ids":[1,3]
4}

3.2 update更新

——添加、删除、更新字段

1POST test_index/test_type/1
2{
3  "no":1,
4  "name":"奔驰X100",
5  "addr":"德国",
6  "price":1000000,
7  "tags" : ["red"]
8}

3.2.1 添加字段

以下添加了新字段tags，赋值为“red”。

1POST test_index/test_type/1/_update
2{
3  "script":"ctx._source.tags = \"red\""
4}

修改后结果为：

1{
 2  "_index": "test_index",
 3  "_type": "test_type",
 4  "_id": "1",
 5  "_version": 6,
 6  "found": true,
 7  "_source": {
 8  "no": 1,
 9  "name": "奔驰X100",
10  "addr": "德国",
11  "price": 1000000,
12  "tags": "red"
13  }
14}

3.2.2 删除字段

1POST test_index/test_type/1/_update
2{
3  "script":"ctx._source.remove(\"new_field\")"
4}

3.2.3 更新字段-添加

 1POST test_index/test_type/1/_update
 2{
 3  "script" : {
 4  "source": "ctx._source.tags.add(params.tag)",
 5  "lang": "painless",
 6  "params" : {
 7  "tag" : "blue"
 8  }
 9  }
10}

更新后结果如下：

1{
 2  "_index": "test_index",
 3  "_type": "test_type",
 4  "_id": "1",
 5  "_version": 8,
 6  "found": true,
 7  "_source": {
 8  "no": 1,
 9  "name": "奔驰X100",
10  "addr": "德国",
11  "price": 1000000,
12  "tags": [
13  "red",
14  "blue"
15  ]
16  }
17}

3.2.4 删除字段（if判定）

 1POST test_index/test_type/1/_update
 2{
 3  "script" : {
 4  "source": "if (ctx._source.tags.contains(params.tag)) { ctx.op = 'delete' } else { ctx.op = 'none' }",
 5  "lang": "painless",
 6  "params" : {
 7  "tag" : "red"
 8  }
 9  }
10}

3.3 bulk批量请求的注意事项

每一行的结尾处都必须有换行符"\n"，最后一行也要有，换行符可以有效的分隔每行。
注意一次提交文件的大小，整个批量请求需要被加载到请求节点的内存里，所以请求越大，给其他请求可用的内存越小。
最佳bulk请求的大小，完全取决于服务器的硬件、文档的大小和复杂度以及索引和搜索的负载。

3.4 并发修改文档导致版本冲突的问题

以下是社区的问题，我认为更切合知识点。

线上的场景可能会对一个文档同一秒进行并发修改，导致会出现个别的VersionConflictEngineException 异常，我猜测是并发upsert请求可能存在先获取到版本号的请求比后获取到版本号的请求执行慢或者执行晚导致的，
毕竟默认es不会对文档操作加锁。但是如在不做锁机制的情况下处理这个问题呢。

解决方案（初步）：

es版本控制有内部和外部两种类型。默认情况下，es使用内部版本控制。

version_type=external的时候是外部值控制。在使用外部版本类型时，
系统会检查传递给索引请求的版本号是否大于当前存储的文档的版本，
如果为true，则文档将被索引并使用新的版本号。

如果提供的值小于或等于存储文档的版本号，则会发生版本冲突，索引操作将失败。

1PUT /test_index/test_type/10?version=1520834740000&version_type=external
2{
3  "newadd":11,
4  "test":"true"
5}

返回结果：

 1{
 2  "_index": "test_index",
 3  "_type": "test_type",
 4  "_id": "10",
 5  "_version": 1520834740000,
 6  "found": true,
 7  "_source": {
 8  "newadd": 11,
 9  "test": "true"
10  }
11}

所以最简单的实现方式就是每次更新使用当前==时间戳==作为版本号，

3.5 动态映射和静态映射的区分

动态映射：文档写入ES中，它会根据字段的类型自动识别，这种称为：动态映射；
静态映射：写入数据之前对字段的属性进行手工设置。

3.6 text字段的特殊性

不用于排序，很少用于聚合（termsAggrions除外，未来版本会彻底禁止text类型聚合操作）。
题外话：如果需要可以借助 multi-fields.使用:keyword 类型。
官网解读：

http://t.cn/R6jy9Z3，http://t.cn/RnKU4tG

3.7 数据类型存储建议

对于数字类型的字段，在满足需求的情况下，要尽可能的选择范围小的数字类型。

3.8 过滤和搜索的区别

过滤：只根据条件对文档进行过滤，不计算评分；
搜索：解决的是相关度的问题。

当用户输入一个查询，Elasticsearch通过排序模型计算文档和查询关键词之间的相关度，按照评分排序后返回最想关的文档给用户。
e
细化：Elasticsearch接受到关键词以后到倒排索引中进行查询，通过倒排索引中维护的倒排记录表找到关键词对应的文档集合，然后做评分、排序、高亮处理，最终返回搜索结果给用户。

注意：ES是按照查询和文档的相关度进行排序的，默认按照评分降序排序。

3.9指定搜索字段的权重

1GET _search
2{
3  "query":{
4    "multi_match": {
5      "query": "美国",
6      "fields": ["addr^5", "name"]
7    }
8  }
9}

3.10 返回字段中至少有一个非控制的文档。

1GET _search
2{
3  "query":{
4    "exists":{
5      "field":"name"
6    }
7  }
8}

3.11 固定得分检索

 1GET /_search
 2{
 3    "query": {
 4        "constant_score" : {
 5            "filter" : {
 6                "term" : { "addr.keyword" : "美国"}
 7            },
 8            "boost" : 1.2
 9        }
10    }
11}

返回结果：

1{
 2  "took": 1,
 3  "timed_out": false,
 4  "_shards": {
 5    "total": 32,
 6    "successful": 32,
 7    "skipped": 0,
 8    "failed": 0
 9  },
10  "hits": {
11    "total": 3,
12    "max_score": 1.2,
13    "hits": [
14      {
15        "_index": "test_index",
16        "_type": "test_type",
17        "_id": "5",
18        "_score": 1.2,
19        "_source": {
20          "no": 5,
21          "name": "福特500",
22          "addr": "美国",
23          "price": 180000
24        }
25      },
26      {
27        "_index": "test_index",
28        "_type": "test_type",
29        "_id": "6",
30        "_score": 1.2,
31        "_source": {
32          "no": 6,
33          "name": null,
34          "addr": "美国",
35          "price": 180000
36        }
37      },
38      {
39        "_index": "test_index",
40        "_type": "test_type",
41        "_id": "3",
42        "_score": 1.2,
43        "_source": {
44          "no": 3,
45          "name": "福特300",
46          "addr": "美国",
47          "price": 300000
48        }
49      }
50    ]
51  }
52}

3.12 修改文档得分检索

借助：function Score Query 实现。

3.13 获取相似文章

1{
 2  "query": {
 3    "more_like_this": {
 4      "fields": [
 5        "title"
 6      ],
 7      "like": "新时代的领路人",
 8      "min_term_freq": 1,
 9      "max_query_terms": 12
10    }
11  },
12  "_source": "title",
13  "from": 1000,
14  "size": 5
15}

3.14 脚本检索

以下内容是6.X验证的。
5.X版本要把source改成inline。

1POST test_index/_search
 2{
 3  "query":{
 4    "bool":{
 5      "must":{
 6         "script":{
 7        "script":{
 8          "source": "doc['price'].value > 100000",
 9     "lang":"painless"
10        }
11         }
12      }
13    }
14  }
15}

3.15 多字段高亮

字段高亮已经比较熟悉，有一种场景是：当我搜索title字段的时候，我期望高亮：title、content、abstr如何做到呢？

通俗的讲：
不搜索某个字段，可以顺带高亮该字段。

 1POST test_index/test_type/_search
 2{
 3  "query":{
 4    "match_phrase":{
 5      "addr":"美国"
 6    }
 7  },
 8  "highlight": {
 9    "require_field_match":false,
10      "fields":{
11        "addr":{"pre_tags":["<strong>"],
12          "post_tags":["</strong>"]
13        },
14        "name":{"pre_tags":["<strong>"],
15          "post_tags":["</strong>"]}
16      }
17  }
18}
1{
 2  "took": 116,
 3  "timed_out": false,
 4  "_shards": {
 5    "total": 5,
 6    "successful": 5,
 7    "skipped": 0,
 8    "failed": 0
 9  },
10  "hits": {
11    "total": 3,
12    "max_score": 1.1143606,
13    "hits": [
14      {
15        "_index": "test_index",
16        "_type": "test_type",
17        "_id": "6",
18        "_score": 1.1143606,
19        "_source": {
20          "no": 6,
21          "name": "大片美国",
22          "addr": "美国",
23          "price": 180000
24        },
25        "highlight": {
26          "name": [
27            "大片<strong>美</strong><strong>国</strong>"
28          ],
29          "addr": [
30            "<strong>美</strong><strong>国</strong>"
31          ]
32        }
33      },
34      {
35        "_index": "test_index",
36        "_type": "test_type",
37        "_id": "5",
38        "_score": 0.5753642,
39        "_source": {
40          "no": 5,
41          "name": "福特500",
42          "addr": "美国",
43          "price": 180000
44        },
45        "highlight": {
46          "addr": [
47            "<strong>美</strong><strong>国</strong>"
48          ]
49        }
50      },
51      {
52        "_index": "test_index",
53        "_type": "test_type",
54        "_id": "3",
55        "_score": 0.5753642,
56        "_source": {
57          "no": 3,
58          "name": "福特300",
59          "addr": "美国",
60          "price": 300000
61        },
62        "highlight": {
63          "addr": [
64            "<strong>美</strong><strong>国</strong>"
65          ]
66        }
67      }
68    ]
69  }
70}