标记、模式和词法
编译器是将用高级语言编写的源程序翻译成低级语言的系统软件。源代码的编译过程分为几个阶段,以便于开发和设计过程。这些阶段按顺序工作,因为前一个阶段的输出用于下一个阶段。各个阶段如下:
- 词法分析
- 语法分析
- 语义分析
- 中间代码生成
- 代码优化
- 存储分配
- 代码生成
词法分析阶段:在这个阶段,输入是要从左到右读取的源程序,我们得到的输出是一个标记序列,将在下一个语法分析阶段进行分析。在扫描源代码期间,空白字符、注释、回车字符、预处理器指令、宏、换行字符、空格、制表符等被删除。词法分析器或扫描器也有助于错误检测。例如,如果源代码包含无效常量、关键字拼写错误等,则由词法分析阶段处理。正则表达式用作指定编程语言标记的标准符号。
令牌
它基本上是被视为一个单元的字符序列,因为它不能进一步分解。在 C 语言等编程语言中 - 关键字(int、char、float、const、goto、continue 等)标识符(用户定义的名称)、运算符(+、-、*、/)、分隔符/标点符号(如逗号 (, )、分号(;)、大括号({ })等,字符串可以被认为是记号。此阶段识别三种类型的标记:终端符号 (TRM) - 关键字和运算符、字面量(LIT) 和标识符 (IDN)。
现在让我们了解如何在源代码(C 语言)中计算令牌:
示例 1:
int a = 10; //Input Source code
Tokens
int (keyword), a(identifier), =(operator), 10(constant) and ;(punctuation-semicolon)
答案 –令牌总数 = 5
示例 2:
int main() {
// printf() sends the string inside quotation to
// the standard output (the display)
printf("Welcome to GeeksforGeeks!");
return 0;
}
Tokens
'int', 'main', '(', ')', '{', 'printf', '(', ' "Welcome to GeeksforGeeks!" ',
')', ';', 'return', '0', ';', '}'
答案 –令牌总数 = 14
词素
它是源代码中的字符序列,由给定的预定义语言规则匹配,每个词位都被指定为有效标记。
例子:
main is lexeme of type identifier(token)
(,),{,} are lexemes of type punctuation(token)
图案
它指定扫描器创建令牌时遵循的一组规则。
编程语言(C、C++)示例:
For a keyword to be identified as a valid token, the pattern is the sequence of characters that make the keyword.
For identifier to be identified as a valid token, the pattern is the predefined rules that it must start with alphabet, followed by alphabet or a digit.
Token、Lexeme 和 Pattern 之间的区别
Criteria | Token | Lexeme | Pattern |
---|---|---|---|
Definition | Token is basically a sequence of characters that are treated as a unit as it cannot be further broken down. | It is a sequence of characters in the source code that are matched by given predefined language rules for every lexeme to be specified as a valid token. | It specifies a set of rules that a scanner follows to create a token. |
Interpretation of type Keyword | all the reserved keywords of that language(main, printf, etc.) | int, goto | The sequence of characters that make the keyword. |
Interpretation of type Identifier | name of a variable, function, etc | main, a | it must start with the alphabet, followed by the alphabet or a digit. |
Interpretation of type Operator | all the operators are considered tokens. | +, = | +, = |
Interpretation of type Punctuation | each kind of punctuation is considered a token. e.g. semicolon, bracket, comma, etc. | (, ), {, } | (, ), {, } |
Interpretation of type Literal | a grammar rule or boolean literal. | “Welcome to GeeksforGeeks!” | any string of characters (except ‘ ‘) between ” and “ |
词法分析阶段的输出:
词法分析器的输出作为标记序列而不是词位序列作为语法分析器的输入,因为在语法分析阶段,单个单元并不重要,但该词位所属的类别或类别相当重要。
例子:
z = x + y;
This statement has the below form for syntax analyzer
= + ; //- identifier (token)
词法分析器不仅提供了一系列标记,还创建了一个符号表,其中包含源代码中存在的所有标记,除了空白和注释。