0X97 Elasticsearch 常见分词器

索引管理

新增索引

可以在新增索引时, 自定义分词器custom_pattern

PUT {{myes}}/my_index  
{  
  "settings": {  
    "number_of_shards": 1,  
    "number_of_replicas": 1,  
    "analysis": {  
      "analyzer": {  
        "custom_pattern": {  
          "type": "pattern",  
          "pattern": "\\|"  
        }  
      }  
    }  
  },  
  "mappings": {  
    "properties": {  
      "title": {  
        "type": "text",  
        "analyzer": "standard"  
      },  
      "content": {  
        "type": "text",  
        "analyzer": "standard"  
      },  
      "timestamp": {  
        "type": "date"  
      }  
    }  
  }  
}

Elasticsearch 常见分词器及其主要用途：

内置分词器	作用
Standard	默认分词器，按单词分类并进行小写处理
Simple	按照非字母切分，然后去除非字母并进行小写处理
Stop	按照停用词过滤并进行小写处理，停用词包括the、a、is
Whitespace	按照空格切分, 不转小写
Language	提供了多种常见语言的分词器
Pattern	按照正则表达式进行分词，默认是\W+ ,代表非字母
Keyword	不进行分词，作为一个整体输出
Customer	自定义分词器

请求链接:

GET {{myes}}/my_index/_analyze

Standard Tokenizer

描述：标准分词器，对文本进行基本的分词处理。
示例：

{  
  "analyzer": "standard",  
  "text": "Hello, this is a document to be analyzed."  
}  

{  
  "tokens": [  
    {  
      "token": "hello",  
      "start_offset": 0,  
      "end_offset": 5,  
      "type": "<ALPHANUM>",  
      "position": 0  
    },  
    {  
      "token": "this",  
      "start_offset": 7,  
      "end_offset": 11,  
      "type": "<ALPHANUM>",  
      "position": 1  
    },  
    {  
      "token": "is",  
      "start_offset": 12,  
      "end_offset": 14,  
      "type": "<ALPHANUM>",  
      "position": 2  
    },  
    {  
      "token": "a",  
      "start_offset": 15,  
      "end_offset": 16,  
      "type": "<ALPHANUM>",  
      "position": 3  
    },  
    {  
      "token": "document",  
      "start_offset": 17,  
      "end_offset": 25,  
      "type": "<ALPHANUM>",  
      "position": 4  
    },  
    {  
      "token": "to",  
      "start_offset": 26,  
      "end_offset": 28,  
      "type": "<ALPHANUM>",  
      "position": 5  
    },  
    {  
      "token": "be",  
      "start_offset": 29,  
      "end_offset": 31,  
      "type": "<ALPHANUM>",  
      "position": 6  
    },  
    {  
      "token": "analyzed",  
      "start_offset": 32,  
      "end_offset": 40,  
      "type": "<ALPHANUM>",  
      "position": 7  
    }  
  ]  
}  

示例2

{  
  "analyzer": "standard",  
  "text": "aaa|bbb|ccc"  
}  

{  
  "tokens": [  
    {  
      "token": "aaa",  
      "start_offset": 0,  
      "end_offset": 3,  
      "type": "<ALPHANUM>",  
      "position": 0  
    },  
    {  
      "token": "bbb",  
      "start_offset": 4,  
      "end_offset": 7,  
      "type": "<ALPHANUM>",  
      "position": 1  
    },  
    {  
      "token": "ccc",  
      "start_offset": 8,  
      "end_offset": 11,  
      "type": "<ALPHANUM>",  
      "position": 2  
    }  
  ]  
}  

示例3

{  
  "analyzer": "standard",  
  "text": "中文|汉字|测试"  
}  

{  
  "tokens": [  
    {  
      "token": "中",  
      "start_offset": 0,  
      "end_offset": 1,  
      "type": "<IDEOGRAPHIC>",  
      "position": 0  
    },  
    {  
      "token": "文",  
      "start_offset": 1,  
      "end_offset": 2,  
      "type": "<IDEOGRAPHIC>",  
      "position": 1  
    },  
    {  
      "token": "汉",  
      "start_offset": 3,  
      "end_offset": 4,  
      "type": "<IDEOGRAPHIC>",  
      "position": 2  
    },  
    {  
      "token": "字",  
      "start_offset": 4,  
      "end_offset": 5,  
      "type": "<IDEOGRAPHIC>",  
      "position": 3  
    },  
    {  
      "token": "测",  
      "start_offset": 6,  
      "end_offset": 7,  
      "type": "<IDEOGRAPHIC>",  
      "position": 4  
    },  
    {  
      "token": "试",  
      "start_offset": 7,  
      "end_offset": 8,  
      "type": "<IDEOGRAPHIC>",  
      "position": 5  
    }  
  ]  
}  

Simple Tokenizer

描述：简单分词器，根据非字母数字字符进行分词，适用于多语言文本。主要特点是根据短横线、中横线、下划线、逗号、冒号、分号等常见符号进行分词。它不进行任何大小写转换或其他复杂处理。
示例

{  
  "analyzer": "simple",  
  "text": "中文|汉字|测试"  
}  

{  
  "tokens": [  
    {  
      "token": "中文",  
      "start_offset": 0,  
      "end_offset": 2,  
      "type": "word",  
      "position": 0  
    },  
    {  
      "token": "汉字",  
      "start_offset": 3,  
      "end_offset": 5,  
      "type": "word",  
      "position": 1  
    },  
    {  
      "token": "测试",  
      "start_offset": 6,  
      "end_offset": 8,  
      "type": "word",  
      "position": 2  
    }  
  ]  
}  

Whitespace Tokenizer

描述：根据空格进行分词，忽略标点符号。
示例：

{  
  "analyzer": "whitespace",  
  "text": "Hello, this is a document to be analyzed."  
}  

{  
  "tokens": [  
    {  
      "token": "Hello,",  
      "start_offset": 0,  
      "end_offset": 6,  
      "type": "word",  
      "position": 0  
    },  
    {  
      "token": "this",  
      "start_offset": 7,  
      "end_offset": 11,  
      "type": "word",  
      "position": 1  
    },  
    {  
      "token": "is",  
      "start_offset": 12,  
      "end_offset": 14,  
      "type": "word",  
      "position": 2  
    },  
    {  
      "token": "a",  
      "start_offset": 15,  
      "end_offset": 16,  
      "type": "word",  
      "position": 3  
    },  
    {  
      "token": "document",  
      "start_offset": 17,  
      "end_offset": 25,  
      "type": "word",  
      "position": 4  
    },  
    {  
      "token": "to",  
      "start_offset": 26,  
      "end_offset": 28,  
      "type": "word",  
      "position": 5  
    },  
    {  
      "token": "be",  
      "start_offset": 29,  
      "end_offset": 31,  
      "type": "word",  
      "position": 6  
    },  
    {  
      "token": "analyzed.",  
      "start_offset": 32,  
      "end_offset": 41,  
      "type": "word",  
      "position": 7  
    }  
  ]  
}  

Keyword Tokenizer

描述：保持输入的文本作为单个分词。
示例：

{  
  "analyzer": "keyword",  
  "text": "Hello, this is a document to be analyzed."  
}  

{  
  "tokens": [  
    {  
      "token": "Hello, this is a document to be analyzed.",  
      "start_offset": 0,  
      "end_offset": 41,  
      "type": "word",  
      "position": 0  
    }  
  ]  
}  

Pattern Tokenizer

描述: 正则分词器, 创建索引时创建分词器
示例

{  
  "analyzer": "custom_pattern",  
  "text": "中文|汉字|测试"  
}  

{  
  "tokens": [  
    {  
      "token": "中文",  
      "start_offset": 0,  
      "end_offset": 2,  
      "type": "word",  
      "position": 0  
    },  
    {  
      "token": "汉字",  
      "start_offset": 3,  
      "end_offset": 5,  
      "type": "word",  
      "position": 1  
    },  
    {  
      "token": "测试",  
      "start_offset": 6,  
      "end_offset": 8,  
      "type": "word",  
      "position": 2  
    }  
  ]  
}