docker、docker-compose 下安装elasticsearch、IK分词器

再实际的分词过程中，我们还有一些常用词语，不想进行拆分的（药名、外国的地名、商品的品牌–蓝月亮等），需要作为一个整体的搜索的情况，这样就需要我们进行自定义的词组信息；在分词插件的的配置config目录中可以看到字典信息定义的文件后缀为xxx.dic;我们自定义的字典也使用同样的命名方式；# 创建字典文件；注意，需要再elasticsearch-analysis-ik-7.8.0/config 目

葵花下的獾

2348人浏览 · 2022-08-17 14:34:36

葵花下的獾 · 2022-08-17 14:34:36 发布

docker、docker-compose 下安装elasticsearch、IK分词器

文章目录

docker、docker-compose 下安装elasticsearch、IK分词器

1、整体版本的选择，以及安装参考文档

1.1、整体版本以7.8.0；

选择的 elasticsearch:7.8.0、kibana:7.8.0、IK分词器 elasticsearch-analysis-ik-7.8.0；

1.2、整个安装步骤，参考以下文档，本地使用的是单机版本，文档以参考为主

详细可以参考

《官网文档-elasticsearch》

《docker-hub-elasticsearch》

2、elasticsearch的安装

2.1、下载elasticsearch镜像

docker pull docker.elastic.co/elasticsearch/elasticsearch:7.8.0

2.2、运行elasticsearch镜像的实例

docker run --name es01 -d \
-p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" \
docker.elastic.co/elasticsearch/elasticsearch:7.8.0

# -e 为环境变量，
#discovery.type 指定为单机模式
-e "discovery.type=single-node" 
# -name 实例的名字
-name es01

需要集群搭建，可以参考《官网文档-elasticsearch》

2.3、简单测试

浏览器打开http://localhost:9200/

返回json数据如下：

{
  "name" : "5663ac33f3ed",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "PisVNRb7QHmvjJNoK08HSQ",
  "version" : {
    "number" : "7.8.0",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "757314695644ea9a1dc2fecd26d1a43856725e65",
    "build_date" : "2020-06-14T19:35:50.234439Z",
    "build_snapshot" : false,
    "lucene_version" : "8.5.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

2.4、进入elasticsearch镜像的实例，查看可以挂载的目录信息

进入容器，并查看elasticsearch容器的工作目录

#es01 是2.2步骤，elasticsearch镜像的实例的名称
docker exec -it es01 /bin/bash

可以看到，我们需要用到的目录为:

config：配置文件

data：数据存储的目录

plugins：插件的目录，目前我们放入IK的分词插件

2.5、拷贝容器内的配置文件到宿主机

#注意，2.4步骤，是在当前的容器内，需要先退出当前容器，如果已经退出，可以直接操作
exit
#执行复制指令
docker cp es01:/usr/share/elasticsearch/config /Users/liqi/docker-compose/elasticsearch/config
#关闭和删除当前的es容器
docker stop es01
docker rm es01

2.6、挂载目录后，再次启动

docker run --name es01 -d \
	-p 9200:9200 \
	-p 9300:9300 \
	-e "discovery.type=single-node" \
	-v /Users/liqi/docker-compose/elasticsearch/config:/usr/share/elasticsearch/config \
	-v /Users/liqi/docker-compose/elasticsearch/plugins:/usr/share/elasticsearch/plugins \
	-v /Users/liqi/docker-compose/elasticsearch/data:/usr/share/elasticsearch/data \
  docker.elastic.co/elasticsearch/elasticsearch:7.8.0

3、docker-compose.yml脚本

version: '3.1'
services:
  es01:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.8.0
    container_name: es01
    environment:
      - discovery.type=single-node
    volumes:
      - ./data:/usr/share/elasticsearch/data
      - ./plugins:/usr/share/elasticsearch/plugins
      - ./config:/usr/share/elasticsearch/config
    ports:
      - 9200:9200
      - 9300:9300

4、分词器测试、使用

4.1、使用PostMan测试：未安装中文分词器测试

GET http://localhost:9200/_analyze
{
    "text":"测试数据"
}

分词结果，可以看到，每个汉字，都是一个词语，没有根据汉语的组词来分词；

{
    "tokens": [
        {
            "token": "测",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "试",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "数",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "据",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        }
    ]
}

4.2、安装IK 中文分词器插件

下载地址

https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v7.8.0

下载之后，解压出来，放入挂载的plugins目录中

重启es的容器的服务

docker restart es01

加入新的查询参数"analyzer":“ik_max_word”

GET http://localhost:9200/_analyze
{
    "text":"测试数据",
    "analyzer":"ik_max_word"
}

ik_max_word：会将文本做最细粒度的拆分

ik_smart：会将文本做最粗粒度的拆分

返回的json，可以看到，已经按照新的汉字语义，进行分词了；

{
    "tokens": [
        {
            "token": "测试数据",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "测试",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "数据",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 2
        }
    ]
}

4.3、IK 中文分词器自定义中文分词

再实际的分词过程中，我们还有一些常用词语，不想进行拆分的（药名、外国的地名、商品的品牌–蓝月亮等），需要作为一个整体的搜索的情况，这样就需要我们进行自定义的词组信息；

在分词插件的的配置config目录中

/Users/liqi/docker-compose/elasticsearch1/plugins/elasticsearch-analysis-ik-7.8.0/config

可以看到字典信息定义的文件后缀为xxx.dic;我们自定义的字典也使用同样的命名方式；

# 创建字典文件；注意，需要再elasticsearch-analysis-ik-7.8.0/config 目录下创建
touch custom.dic
#编辑文件，插入短句
vim custom.dic

我这边写入了短句 蓝月亮

配置文件IKAnalyzer.cfg.xml中，在标签 <entry key="ext_dict"></entry> 中，写入自定义的文件信息

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">custom.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords"></entry>
	<!--用户可以在这里配置远程扩展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用户可以在这里配置远程扩展停止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

重启容器

docker restart es01

测试自定义的短句

GET http://localhost:9200/_analyze
{
    "text":"蓝月亮",
    "analyzer":"ik_max_word"
}

返回json

{
    "tokens": [
        {
            "token": "蓝月亮",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "月亮",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}