Spear Parser(二) 树库Token读取类EdgeLexer

原创

snowteng17 2011-04-11 16:29:25 ©著作权

文章标签 休闲 Token 滨州树库类EdgeLexer 文章分类 C/C++ 后端开发

©著作权归作者所有：来自51CTO博客作者snowteng17的原创作品，请联系作者获取转载授权，否则将追究法律责任

滨州树库标注实例

句法模型训练最基础的一步，就是从树库中抽取规则。而规则是由一些非终结符，词汇等信息组成的，所以Training第一步是要能提取这些信息。滨州树库(Penn Tree Bank) WSJ mrg标注风格的树库是这样的。

        1: ( (S 

       2:     (NP-SBJ 

       3:       (NP (NNP Pierre) (NNP Vinken) )

       4:       (, ,) 

       5:       (ADJP 

       6:         (NP (CD 61) (NNS years) )

       7:         (JJ old) )

       8:       (, ,) )

       9:     (VP (MD will) 

      10:       (VP (VB join) 

      11:         (NP (DT the) (NN board) )

      12:         (PP-CLR (IN as) 

      13:           (NP (DT a) (JJ nonexecutive) (NN director) ))

      14:         (NP-TMP (NNP Nov.) (CD 29) )))

      15:     (. .) ))

很明显这种树的标记由三类不同的符号组成。左括号(,右括号),以及像S、NP-SBJ、director这样的字符串。

树库Token读取类EdgeLexer

Spear中提供了一个类EdgeLexer来读取这三种Token,并且从文件角度考虑加入了一个终止的Token。这个类在模型训练的时候，复杂读取这四种Token,并且在Token是字符串的情况下，返回读取的内容。

EdgeLexer的声明如下所示。

       1: class EdgeLexer  

       2: { 

       3:  public: 

       4:   /*几种不同的Token*/ 

       5:   static const int TOKEN_EOF = 0;/*终止符*/ 

       6:   static const int TOKEN_STRING = 1;/*字符串*/ 

       7:   static const int TOKEN_LP = 2;/*左括号*/ 

       8:   static const int TOKEN_RP = 3;/*右括号*/ 

       9:   EdgeLexer(IStream &); 

      10:   /*核心函数*/ 

      11:   int lexem(String &); 

      12:   int getLineCount() const { return _lineCount; }; 

      13:  private: 

      14:   /** The stream */ 

      15:   IStream & _stream; 

      16:   /** Line count */ 

      17:   int _lineCount; 

      18:   /** Advance over white spaces */ 

      19:   void skipWhiteSpaces(); 

      20:   bool isSpace(Char c) const; 

      21: };

从代码的声明可以看出，EdgeLexer完成了行数统计，空白符判断，Token读取的三种行为。EdgeLexer的实现如下所示。

       1: /**构造函数*/ 

       2: EdgeLexer::EdgeLexer(IStream & stream) 

       3:   : _stream(stream), _lineCount(1){} 

       4: /**判断一个字符是否是空白符 

       5:  *@ c 要判断的字符 

       6:  */ 

       7: bool EdgeLexer::isSpace(Char c) const 

       8: { 

       9:   if(c != W(' ') && 

     c != W('\t') && 

     c != W('\n') && 

     c != W('\r')){ 

    return false; 

  } 

  return true; 

} 

/**跳过空白符**/ 

void EdgeLexer::skipWhiteSpaces() 

{ 

  Char c; 

  while((c = _stream.get()) != EOF && isSpace(c)){ 

    if(c == W('\n')){ 

      _lineCount ++; 

    } 

  } 

  _stream.unget(); 

} 

/**判断是否是空白、左右括号的宏*/ 

#define STRING_CHAR(c) ( \ 

  c != W('(') &&  \ 

  c != W(')') &&  \ 

  ! isSpace(c)    \ 

) 

/** 

*读一个词条，并且返回词的类型，将词的内容存到text中 

*如果不是STRING,而是括号，EOF，则只返回类型，不返回内容 

*@text 存储返回的字符串 

*如果终止则返回TOKEN_EOF,如果为字符串则返回TOKEN_STRING 

*/ 

int EdgeLexer::lexem(String & text) 

{ 

  skipWhiteSpaces(); 

  Char c = _stream.get(); 

  if(c == EOF){ 

    return TOKEN_EOF; 

  } else if(c == W('(')){ 

    return TOKEN_LP; 

  } else if(c == W(')')){ 

    return TOKEN_RP; 

  } else if(STRING_CHAR(c)){ 

    OStringStream buffer; 

    buffer <&lt; c; 

    while((c = _stream.get()) != TOKEN_EOF && STRING_CHAR(c)){ 

      buffer &lt;&lt; c; 

    } 

    text = buffer.str(); 

    _stream.unget(); 

    return TOKEN_STRING; 

  } 

  // should never get here 

  return TOKEN_EOF; 

}

EdgeLexer的使用实例

读取TreeBank文件，输出所有的非终结符和词汇信息。

       1: #include &lt;fstream> 

       2: using namespace std; 

       3: int main(int argc,char **argv) 

       4: { 

       5:   if(argc!=2){ 

       6:       printf("[Usage]:%s [treebank]\n",argv[0]); 

       7:       exit(0); 

       8:   } 

       9:   ifstream is(argv[1]); 

      10:   EdgeLexer lex(is); 

      11:   string text; 

      12:   int l; 

      13:   while((l = lex.lexem(text)) != EdgeLexer::TOKEN_EOF){ 

      14:     //cout &lt;&lt; l; 

      15:     if(l == EdgeLexer::TOKEN_STRING){//输出字符串 

      16:        cout &lt;&lt; " " &lt;&lt; text; 

      17:        cout &lt;&lt;endl; 

      18:     } 

      19:     //cout &lt;&lt; endl; 

      20:   } 

      21: }