ElasticSearch

发布于 更新于

AI总结: 本文介绍了Elasticsearch的索引管理和分词器的使用,包括如何新增索引和自定义分词器。用户可以在创建索引时定义分词器,并提供了多个内置分词器的描述和用途,如Standard、Simple、Stop、Whitespace、Language、Pattern、Keyword和自定义分词器。每种分词器的功能和示例都被详细列出,展示了其在文本分析中的应用。需要注意的是,示例中未提到的分词器使用场景可能导致误解,例如,Pattern分词器的默认正则表达式是\W+,而用户自定义的pattern可能会影响分词效果。

索引管理

新增索引

可以在新增索引时, 自定义分词器custom_pattern

PUT {{myes}}/my_index  
{  
  "settings": {  
    "number_of_shards": 1,  
    "number_of_replicas": 1,  
    "analysis": {  
      "analyzer": {  
        "custom_pattern": {  
          "type": "pattern",  
          "pattern": "\\|"  
        }  
      }  
    }  
  },  
  "mappings": {  
    "properties": {  
      "title": {  
        "type": "text",  
        "analyzer": "standard"  
      },  
      "content": {  
        "type": "text",  
        "analyzer": "standard"  
      },  
      "timestamp": {  
        "type": "date"  
      }  
    }  
  }  
}  

Elasticsearch 常见分词器及其主要用途:

内置分词器 作用
Standard 默认分词器,按单词分类并进行小写处理
Simple 按照非字母切分,然后去除非字母并进行小写处理
Stop 按照停用词过滤并进行小写处理,停用词包括the、a、is
Whitespace 按照空格切分, 不转小写
Language 提供了多种常见语言的分词器
Pattern 按照正则表达式进行分词,默认是\W+ ,代表非字母
Keyword 不进行分词,作为一个整体输出
Customer 自定义分词器

请求链接:

GET {{myes}}/my_index/_analyze  

Standard Tokenizer

描述:标准分词器,对文本进行基本的分词处理。
示例

{  
  "analyzer": "standard",  
  "text": "Hello, this is a document to be analyzed."  
}  

{  
  "tokens": [  
    {  
      "token": "hello",  
      "start_offset": 0,  
      "end_offset": 5,  
      "type": "<ALPHANUM>",  
      "position": 0  
    },  
    {  
      "token": "this",  
      "start_offset": 7,  
      "end_offset": 11,  
      "type": "<ALPHANUM>",  
      "position": 1  
    },  
    {  
      "token": "is",  
      "start_offset": 12,  
      "end_offset": 14,  
      "type": "<ALPHANUM>",  
      "position": 2  
    },  
    {  
      "token": "a",  
      "start_offset": 15,  
      "end_offset": 16,  
      "type": "<ALPHANUM>",  
      "position": 3  
    },  
    {  
      "token": "document",  
      "start_offset": 17,  
      "end_offset": 25,  
      "type": "<ALPHANUM>",  
      "position": 4  
    },  
    {  
      "token": "to",  
      "start_offset": 26,  
      "end_offset": 28,  
      "type": "<ALPHANUM>",  
      "position": 5  
    },  
    {  
      "token": "be",  
      "start_offset": 29,  
      "end_offset": 31,  
      "type": "<ALPHANUM>",  
      "position": 6  
    },  
    {  
      "token": "analyzed",  
      "start_offset": 32,  
      "end_offset": 40,  
      "type": "<ALPHANUM>",  
      "position": 7  
    }  
  ]  
}  

示例2

{  
  "analyzer": "standard",  
  "text": "aaa|bbb|ccc"  
}  

{  
  "tokens": [  
    {  
      "token": "aaa",  
      "start_offset": 0,  
      "end_offset": 3,  
      "type": "<ALPHANUM>",  
      "position": 0  
    },  
    {  
      "token": "bbb",  
      "start_offset": 4,  
      "end_offset": 7,  
      "type": "<ALPHANUM>",  
      "position": 1  
    },  
    {  
      "token": "ccc",  
      "start_offset": 8,  
      "end_offset": 11,  
      "type": "<ALPHANUM>",  
      "position": 2  
    }  
  ]  
}  

示例3

{  
  "analyzer": "standard",  
  "text": "中文|汉字|测试"  
}  

{  
  "tokens": [  
    {  
      "token": "中",  
      "start_offset": 0,  
      "end_offset": 1,  
      "type": "<IDEOGRAPHIC>",  
      "position": 0  
    },  
    {  
      "token": "文",  
      "start_offset": 1,  
      "end_offset": 2,  
      "type": "<IDEOGRAPHIC>",  
      "position": 1  
    },  
    {  
      "token": "汉",  
      "start_offset": 3,  
      "end_offset": 4,  
      "type": "<IDEOGRAPHIC>",  
      "position": 2  
    },  
    {  
      "token": "字",  
      "start_offset": 4,  
      "end_offset": 5,  
      "type": "<IDEOGRAPHIC>",  
      "position": 3  
    },  
    {  
      "token": "测",  
      "start_offset": 6,  
      "end_offset": 7,  
      "type": "<IDEOGRAPHIC>",  
      "position": 4  
    },  
    {  
      "token": "试",  
      "start_offset": 7,  
      "end_offset": 8,  
      "type": "<IDEOGRAPHIC>",  
      "position": 5  
    }  
  ]  
}  

Simple Tokenizer

描述:简单分词器,根据非字母数字字符进行分词,适用于多语言文本。主要特点是根据短横线、中横线、下划线、逗号、冒号、分号等常见符号进行分词。它不进行任何大小写转换或其他复杂处理。
示例

{  
  "analyzer": "simple",  
  "text": "中文|汉字|测试"  
}  

{  
  "tokens": [  
    {  
      "token": "中文",  
      "start_offset": 0,  
      "end_offset": 2,  
      "type": "word",  
      "position": 0  
    },  
    {  
      "token": "汉字",  
      "start_offset": 3,  
      "end_offset": 5,  
      "type": "word",  
      "position": 1  
    },  
    {  
      "token": "测试",  
      "start_offset": 6,  
      "end_offset": 8,  
      "type": "word",  
      "position": 2  
    }  
  ]  
}  

Whitespace Tokenizer

描述:根据空格进行分词,忽略标点符号。
示例

{  
  "analyzer": "whitespace",  
  "text": "Hello, this is a document to be analyzed."  
}  

{  
  "tokens": [  
    {  
      "token": "Hello,",  
      "start_offset": 0,  
      "end_offset": 6,  
      "type": "word",  
      "position": 0  
    },  
    {  
      "token": "this",  
      "start_offset": 7,  
      "end_offset": 11,  
      "type": "word",  
      "position": 1  
    },  
    {  
      "token": "is",  
      "start_offset": 12,  
      "end_offset": 14,  
      "type": "word",  
      "position": 2  
    },  
    {  
      "token": "a",  
      "start_offset": 15,  
      "end_offset": 16,  
      "type": "word",  
      "position": 3  
    },  
    {  
      "token": "document",  
      "start_offset": 17,  
      "end_offset": 25,  
      "type": "word",  
      "position": 4  
    },  
    {  
      "token": "to",  
      "start_offset": 26,  
      "end_offset": 28,  
      "type": "word",  
      "position": 5  
    },  
    {  
      "token": "be",  
      "start_offset": 29,  
      "end_offset": 31,  
      "type": "word",  
      "position": 6  
    },  
    {  
      "token": "analyzed.",  
      "start_offset": 32,  
      "end_offset": 41,  
      "type": "word",  
      "position": 7  
    }  
  ]  
}  

Keyword Tokenizer

描述:保持输入的文本作为单个分词。
示例

{  
  "analyzer": "keyword",  
  "text": "Hello, this is a document to be analyzed."  
}  

{  
  "tokens": [  
    {  
      "token": "Hello, this is a document to be analyzed.",  
      "start_offset": 0,  
      "end_offset": 41,  
      "type": "word",  
      "position": 0  
    }  
  ]  
}  

Pattern Tokenizer

描述: 正则分词器, 创建索引时创建分词器
示例

{  
  "analyzer": "custom_pattern",  
  "text": "中文|汉字|测试"  
}  

{  
  "tokens": [  
    {  
      "token": "中文",  
      "start_offset": 0,  
      "end_offset": 2,  
      "type": "word",  
      "position": 0  
    },  
    {  
      "token": "汉字",  
      "start_offset": 3,  
      "end_offset": 5,  
      "type": "word",  
      "position": 1  
    },  
    {  
      "token": "测试",  
      "start_offset": 6,  
      "end_offset": 8,  
      "type": "word",  
      "position": 2  
    }  
  ]  
}