📅  最后修改于: 2020-11-12 04:47:49             🧑  作者: Mango
在前面的章节中,我们已经看到Lucene使用IndexWriter来使用分析器来分析文档,然后根据需要创建/打开/编辑索引。在本章中,我们将讨论在分析过程中使用的各种类型的Analyzer对象和其他相关对象。了解分析过程以及分析器的工作方式将使您对Lucene如何编制文档索引有更深入的了解。
以下是我们将在适当时候讨论的对象列表。
S.No. | Class & Description |
---|---|
1 | Token
Token represents text or word in a document with relevant details like its metadata (position, start offset, end offset, token type and its position increment). |
2 | TokenStream
TokenStream is an output of the analysis process and it comprises of a series of tokens. It is an abstract class. |
3 | Analyzer
This is an abstract base class for each and every type of Analyzer. |
4 | WhitespaceAnalyzer
This analyzer splits the text in a document based on whitespace. |
5 | SimpleAnalyzer
This analyzer splits the text in a document based on non-letter characters and puts the text in lowercase. |
6 | StopAnalyzer
This analyzer works just as the SimpleAnalyzer and removes the common words like ‘a’, ‘an’, ‘the’, etc. |
7 | StandardAnalyzer
This is the most sophisticated analyzer and is capable of handling names, email addresses, etc. It lowercases each token and removes common words and punctuations, if any. |