语法树python实现 python语法树生成工具

转载

mob64ca13f40f3d 2023-12-18 22:18:45

文章标签 语法树python实现源码分析编译原理元语法状态转移 文章分类 Python 后端开发

Python 元语法 MetaGrammar

Python的语法文件Grammar定义了Python的文法规则，该语法文件也有其文法，Python的程序pgen用来将语法文件生成graminit.h/graminit.c，用来编译Python的编译器，具体来说，元语法在pgen程序里面，用来解析Python的语法文件Grammar/Grammar，生成Python语法的DFA状态图，以及用来将实际的Python代码.py文件词法分析后的Token序列转化成解析树ParseTree，最后生成抽象语法树AST

本文主要分析pgen程序生成元语法的DFA状态图的过程，以及其数据结构和生成流程

1. pgen.c 的尾部说明：

/*
Description
-----------

Input is a grammar in extended BNF (using * for repetition, + for
at-least-once repetition, [] for optional parts, | for alternatives and
() for grouping).  This has already been parsed and turned into a parse
tree.

Each rule is considered as a regular expression in its own right.
It is turned into a Non-deterministic Finite Automaton (NFA), which
is then turned into a Deterministic Finite Automaton (DFA), which is then
optimized to reduce the number of states.  See [Aho&Ullman 77] chapter 3,
or similar compiler books (this technique is more often used for lexical
analyzers).

The DFA's are used by the parser as parsing tables in a special way
that's probably unique.  Before they are usable, the FIRST sets of all
non-terminals are computed.

Reference
---------

[Aho&Ullman 77]
    Aho&Ullman, Principles of Compiler Design, Addison-Wesley 1977
    (first edition)
*/

2. 概念复习

2.1 非终结符的FIRST集

First(A) 表示非终结符A的产生式中首个终结符的集合

2.2 非终结符的FOLLOW集

Follow(A) 表示非终结符A后的首个终结符的集合，不包含空串

3. 元语法文件

MSTART: (NEWLINE | RULE)* ENDMARKER
RULE: NAME ':' RHS NEWLINE
RHS: ALT ('|' ALT)*
ALT: ITEM+
ITEM: '[' RHS ']' | ATOM ['*' | '+']
ATOM: NAME | STRING | '(' RHS ')'

Python’s compiler - from grammar to dfa

元语法文件说明：
起始标记：MSTART
MSTART：由0-n个（NEWLINE 或者 RULE）加上结束标记
RULE：由变量名加上 “:” 加上 RHS 加上换行(NEWLINE)

4. 元语法定义

// 获取元语法的函数方法
grammar *
meta_grammar(void)
{
    // 返回元语法定义结构体首地址
    return &_PyParser_Grammar;
}

// 元语法定义结构体
static grammar _PyParser_Grammar = {
    6,                           // 6个DFA
    dfas,                        // DFA数组首地址
    {19, labels},                // 有19个标记
    256                          // 起始的DFA的TYPE为256
};

// 元语法包含6中DFA
static dfa dfas[6] = {
    // lable type, Token类型
    // label name, Token名称
    // initial state, 初始状态
    // state count, 状态个数
    // states, 状态数组
    // first set，Token的FIRST集 
    {256, "MSTART", 0, 2, states_0, "\070\000\000"},
    {257, "RULE",   0, 5, states_1, "\040\000\000"},
    {258, "RHS",    0, 2, states_2, "\040\010\003"},
    {259, "ALT",    0, 2, states_3, "\040\010\003"},
    {260, "ITEM",   0, 5, states_4, "\040\010\003"},
    {261, "ATOM",   0, 4, states_5, "\040\000\003"},
};

// 元语法中DFA状态转移用到的Token
// >= 256 为非终结符 < 256 为终结符
static label labels[19] = {
    {0, "EMPTY"},  // ENDMARKER 表示结束标记
    {256, 0},      // MSTART
    {257, 0},      // RULE 
    {4, 0},        // NEWLINE（换行符）
    {0, 0},        // ENDMARKER 
    {1, 0},        // NAME（）
    {11, 0},       // COLON（冒号）
    {258, 0},      // RHS
    {259, 0},      // ALT
    {18, 0},       // VBAR（竖线）
    {260, 0},      // ITEM
    {9, 0},        // LSQB（左方括号） 
    {10, 0},       // RSQB（右方括号）
    {261, 0},      // ATOM
    {16, 0},       // STAR（星号）
    {14, 0},       // PLUS（加号）
    {3, 0},        // STRING（字符串）
    {7, 0},        // LPAR（左括号）
    {8, 0},        // RPAR（右括号）
};

5. 元语法文件的DFA数据结构

DFA的结构体定义：

typedef struct {
    int		 d_type;	/* Non-terminal this represents */ // 代表非终结符
    char	*d_name;	/* For printing */                 // 名称，打印使用
    int		 d_initial;	/* Initial state */                // 初始状态
    int		 d_nstates;                                    // 可迁移状态的数量                    
    state	*d_state;	/* Array of states */              // 状态集
    bitset	 d_first;                                      // 该非终结符的FIRST集（为一个字节数组，八进制表示？这里还是有点不明白）
} dfa;

来看一下DFA状态state的定义：

typedef struct {
    int		 s_narcs;                                     // 状态转移（边）的个数 
    arc		*s_arc;		/* Array of arcs */               // 状态转移数组首地址
	
    /* Optional accelerators */
    int		 s_lower;	/* Lowest label index */          // 最小的label的index 
    int		 s_upper;	/* Highest label index */         // 最大的label的index
    int		*s_accel;	/* Accelerator */                 // 加速器 ？
    int		 s_accept;	/* Nonzero for accepting state */ // 非零表示该状态被接收
} state;

再看一下DFA边arc的定义：

// 状态转移（边）的定义
typedef struct {
    short	a_lbl;		/* Label of this arc */               // 当前状态上的输入label[index]
    short	a_arrow;	/* State where this arc goes to */    // 可以转移到的当前状态state[index]
} arc;

再看一下DFA边上的label的定义：
label就是一个状态转移到另一个状态时的input输入，也就是输入一个非终结符可以是状态发生转移

typedef struct {
    int		 lb_type;  // label的类型
    char	*lb_str;   // label的input输入值
} label;

6. 元语法的DFA的分析

元语法定义中有6个正规式，对应6个DFA

来看第一个正规式：

MSTART: (NEWLINE | RULE)* ENDMARKER

第一个DFA：

{256, "MSTART", 0, 2, states_0, "\070\000\000"}

解释说明：
type: 256
name: “MSTART”
initial: 0
nstates: 2
state *: states_0 ----- DFA的状态集
bitset: “\070\000\000”

这个DFA的状态集：states_0，包含2个状态节点，第一个节点上有3条边，第二个节点上有1条边

// DFA的状态节点集和 
static state states_0[2] = {   // 有2个状态节点
    {3, arcs_0_0},             // 有3种状态转移，转移集合为 arcs_0_0
    {1, arcs_0_1},             // 有1装状态转移，转移集合为 arcs_0_1
};
// 第一个状态节点
static arc arcs_0_0[3] = {     // 有3种状态转移（边）
    {2, 0},                    // 输入为 label[2] 可以转移到状态 states_0[0]
    {3, 0},                    // 输入为 label[3] 可以转移到状态 states_0[0]
    {4, 1},                    // 输入为 label[4] 可以转移到状态 states_0[1]
};
// 第二个状态节点
static arc arcs_0_1[1] = {     // 有2种状态转移（边） 
    {0, 1},                    // 输入为 label[0] 可以转移到状态 states_0[1]
};

其中label的定义如下：

label[0] = ENDMARKER
label[1] = MSTART
label[2] = RULE
lable[3] = NEWLINE
label[4] = ENDMARKER

语法树python实现 python语法树生成工具_编译原理

第二个正规式：

RULE: NAME ':' RHS NEWLINE

其DFA定义：

{257, "RULE",   0, 5, states_1, "\040\000\000"},

这个DFA有5个状态节点：

static state states_1[5] = {
    {1, arcs_1_0},
    {1, arcs_1_1},
    {1, arcs_1_2},
    {1, arcs_1_3},
    {1, arcs_1_4},
};

static arc arcs_1_0[1] = {
    {5, 1},  // label[5] -> states_1[1]
};
static arc arcs_1_1[1] = {
    {6, 2},  // label[6] -> states_1[2]
};
static arc arcs_1_2[1] = {
    {7, 3},  // label[7] -> states_1[3]
};
static arc arcs_1_3[1] = {
    {3, 4},  // label[3] -> states_1[4]
};
static arc arcs_1_4[1] = {
    {0, 4},  // label[0] -> states_1[4]
};

其中label的定义如下：

label[0] = ENDMARKER
label[3] = NEWLINE
label[5] = NAME
lable[6] = COLON
label[7] = RHS

元语法加载出来之后，我们试着用元语法来生成一个语法文件grammar.txt的DFA数据结构
grammar.txt 这个文件的内容：

arglist: NUMBER (',' NUMBER)*  [',']

这个语法文件的意思是一条规则：arglist，由 NUMBER 加上 0个或多个（逗号 NUMBER) 加上逗号组成

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：元数据架构设计层次包括应用曾元数据设计方案

下一篇：SPARK期末题目 spark编程题

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯