📅  最后修改于: 2020-10-30 14:20:59             🧑  作者: Mango
当在搜索操作期间处理查询时,分析模块会分析任何索引中的内容。该模块由分析器,令牌生成器,令牌过滤器和字符过滤器组成。如果未定义分析器,则默认情况下,内置分析器,令牌,过滤器和令牌生成器会在分析模块中注册。
在以下示例中,我们使用标准分析器,当未指定其他分析器时使用。它将基于语法分析句子,并产生句子中使用的单词。
POST _analyze
{
"analyzer": "standard",
"text": "Today's weather is beautiful"
}
运行上面的代码后,我们得到如下所示的响应:
{
"tokens" : [
{
"token" : "today's",
"start_offset" : 0,
"end_offset" : 7,
"type" : "",
"position" : 0
},
{
"token" : "weather",
"start_offset" : 8,
"end_offset" : 15,
"type" : "",
"position" : 1
},
{
"token" : "is",
"start_offset" : 16,
"end_offset" : 18,
"type" : "",
"position" : 2
},
{
"token" : "beautiful",
"start_offset" : 19,
"end_offset" : 28,
"type" : "",
"position" : 3
}
]
}
我们可以为标准分析仪配置各种参数,以满足我们的自定义要求。
在以下示例中,我们将标准分析器配置为max_token_length为5。
为此,我们首先使用具有max_length_token参数的分析器创建索引。
PUT index_4_analysis
{
"settings": {
"analysis": {
"analyzer": {
"my_english_analyzer": {
"type": "standard",
"max_token_length": 5,
"stopwords": "_english_"
}
}
}
}
}
接下来,我们对分析器应用如下文本。请注意,令牌的样子不会出现,因为它的开头有两个空格,结尾有两个空格。对于“是”一词,在其开头有一个空格,在其结尾有一个空格。将它们全部取而代之,它变成4个带空格的字母,但这并不构成一个单词。至少在开头或结尾应有一个非空格字符,以使其成为要计数的单词。
POST index_4_analysis/_analyze
{
"analyzer": "my_english_analyzer",
"text": "Today's weather is beautiful"
}
运行上面的代码后,我们得到如下所示的响应:
{
"tokens" : [
{
"token" : "today",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 0
},
{
"token" : "s",
"start_offset" : 6,
"end_offset" : 7,
"type" : "",
"position" : 1
},
{
"token" : "weath",
"start_offset" : 8,
"end_offset" : 13,
"type" : "",
"position" : 2
},
{
"token" : "er",
"start_offset" : 13,
"end_offset" : 15,
"type" : "",
"position" : 3
},
{
"token" : "beaut",
"start_offset" : 19,
"end_offset" : 24,
"type" : "",
"position" : 5
},
{
"token" : "iful",
"start_offset" : 24,
"end_offset" : 28,
"type" : "",
"position" : 6
}
]
}
下表列出了各种分析仪的列表及其说明-
S.No | Analyzer & Description |
---|---|
1 |
Standard analyzer (standard) stopwords and max_token_length setting can be set for this analyzer. By default, stopwords list is empty and max_token_length is 255. |
2 |
Simple analyzer (simple) This analyzer is composed of lowercase tokenizer. |
3 |
Whitespace analyzer (whitespace) This analyzer is composed of whitespace tokenizer. |
4 |
Stop analyzer (stop) stopwords and stopwords_path can be configured. By default stopwords initialized to English stop words and stopwords_path contains path to a text file with stop words. |
令牌生成器用于从Elasticsearch中的文本生成令牌。可以通过考虑空格或其他标点符号将文本分解为标记。 Elasticsearch有很多内置的标记器,可以在定制分析器中使用。
下面显示了一个分词器的示例,该分词器在遇到非字母的字符时将文本分解为多个词,但也会将所有词都小写,如下所示-
POST _analyze
{
"tokenizer": "lowercase",
"text": "It Was a Beautiful Weather 5 Days ago."
}
运行上面的代码后,我们得到如下所示的响应:
{
"tokens" : [
{
"token" : "it",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "was",
"start_offset" : 3,
"end_offset" : 6,
"type" : "word",
"position" : 1
},
{
"token" : "a",
"start_offset" : 7,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "beautiful",
"start_offset" : 9,
"end_offset" : 18,
"type" : "word",
"position" : 3
},
{
"token" : "weather",
"start_offset" : 19,
"end_offset" : 26,
"type" : "word",
"position" : 4
},
{
"token" : "days",
"start_offset" : 29,
"end_offset" : 33,
"type" : "word",
"position" : 5
},
{
"token" : "ago",
"start_offset" : 34,
"end_offset" : 37,
"type" : "word",
"position" : 6
}
]
}
令牌生成器的列表及其说明如下表所示:
S.No | Tokenizer & Description |
---|---|
1 |
Standard tokenizer (standard) This is built on grammar based tokenizer and max_token_length can be |
2 |
Edge NGram tokenizer (edgeNGram) Settings like min_gram, max_gram, token_chars can be set for this tokenizer. |
3 |
Keyword tokenizer (keyword) This generates entire input as an output and buffer_size can be set for this. |
4 |
Letter tokenizer (letter) This captures the whole word until a non-letter is encountered. |