第三节 IK中文分词器的安装

亮子 2021-06-15 03:44:55

1、IK分词器的下载

# 下载地址
https://github.com/medcl/elasticsearch-analysis-ik
https://github.com/medcl/elasticsearch-analysis-ik/releases

解压到elasticsearch的plug目录即可，版本一定要保持一致。建议不要下载最新版本，因为最新版不一定有相匹配的ik分词器。

图片alt

2、安装分词器

把elasticsearch-analysis-ik-7.8.1.zip解压到elasticsearch-7.8.1\plugins目录下，如下图：

ik安装

然后运行elasticsearch.bat，如果没有出错，那就说明ik分词器安装正确了。

注意：解压elasticsearch-analysis-ik-7.8.1.zip文件后，一定要把zip文件删除，否则运行会出错

3、查看插件安装状态

在浏览器输入下面地址，即可查看当前的es都安装了哪些插件：

http://localhost:9200/_cat/plugins

图片alt

4、测试分词功能

使用crul命令，输入下面的URL地址，验证分词器是否成功。

$ curl -X GET -H "Content-Type: application/json"  "http://localhost:9200/_analyze?pretty=true" -d'{"text":"中华五千年华夏"}';

当然也可以使用postman来进行测试，具体测试一下：

请求方法： GET
请求地址： http://localhost:9200/_analyze?pretty=true
请求体： body 为 json

具体如下图：

图片alt

返回结果如下：

{
    "tokens": [
        {
            "token": "中",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "华",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "五",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "千",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "年",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "华",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        },
        {
            "token": "夏",
            "start_offset": 6,
            "end_offset": 7,
            "type": "<IDEOGRAPHIC>",
            "position": 6
        }
    ]
}

从返回结果看，ik分词器把每个字都进行了拆分。但是有时候我们不需要分拆这么细，那么怎么来控制呢？这就需要设置分词器的细度颗粒来实现了。

如果有kibana的话，可以使用下面的语句来查询

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "中国共产党"
}

图片alt

GET _analyze
{
  "analyzer": "ik_smart",
  "text": "中国共产党"
}

图片alt

5、IK分词器的粒度

ik_max_word和ik_smart

ik_max_word: 将文本按最细粒度的组合来拆分，比如会将“中华五千年华夏”拆分为“五千年、五千、五千年华、华夏、千年华夏”，总之是可能的组合；

ik_smart: 最粗粒度的拆分，比如会将“五千年华夏”拆分为“五千年、华夏”

1）、ik_smart分词

在JSON格式中添加**analyzer**节点内容为**ik_smart**

$ curl -X GET -H "Content-Type: application/json"  "http://localhost:9200/_analyze?pretty=true" -d'{"text":"中华五千年华夏","analyzer": "ik_smart"}';

2）、ik_max_word分词

在JSON格式中添加**analyzer**节点内容为**ik_max_word**

$ curl -X GET -H "Content-Type: application/json"  "http://localhost:9200/_analyze?pretty=true" -d'{"text":"中华五千年华夏","analyzer": "ik_max_word"}';

博主

标签

专辑