快速学习Lucene-Lucene分析器

原创

wx5d0241bb88268 2021-08-18 10:41:50 博主文章分类：快速学习 ©著作权

文章标签 Lucene分析器 Lucene analyzer lucene spring 文章分类 代码人生

©著作权归作者所有：来自51CTO博客作者wx5d0241bb88268的原创作品，请联系作者获取转载授权，否则将追究法律责任

分析器的分词效果

    //查看标准分析器的分词效果
    @Test
    public void testTokenStream() throws Exception {
        //创建一个标准分析器对象
        Analyzer analyzer = new StandardAnalyzer();
        //获得tokenStream对象
        //第一个参数：域名，可以随便给一个
        //第二个参数：要分析的文本内容
        TokenStream tokenStream = analyzer.tokenStream("test", "The Spring Framework provides a comprehensive programming and configuration model.");
        //添加一个引用，可以获得每个关键词
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        //添加一个偏移量的引用，记录了关键词的开始位置以及结束位置
        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
        //将指针调整到列表的头部
        tokenStream.reset();
        //遍历关键词列表，通过incrementToken方法判断列表是否结束
        while(tokenStream.incrementToken()) {
            //关键词的起始位置
            System.out.println("start->" + offsetAttribute.startOffset());
            //取关键词
            System.out.println(charTermAttribute);
            //结束位置
            System.out.println("end->" + offsetAttribute.endOffset());
        }
        tokenStream.close();
    }

中文分析器

Lucene的自带中文分析器

StandardAnalyzer：
单字分词：就是按照中文一个字一个字地进行分词。如：“我爱中国”，
效果：“我”、“爱”、“中”、“国”。
SmartChineseAnalyzer
对中文支持较好，但扩展性差，扩展词库，禁用词库和同义词库等不好处理

IKAnalyzer

快速学习Lucene-Lucene分析器_Lucene
使用方法：
第一步：把jar包添加到工程中
第二步：把配置文件和扩展词典和停用词词典添加到classpath下

注意：hotword.dic和ext_stopword.dic文件的格式为UTF-8，注意是无BOM 的UTF-8 编码。也就是说禁止使用windows记事本编辑扩展词典文件

使用EditPlus.exe保存为无BOM 的UTF-8 编码格式，如下图
快速学习Lucene-Lucene分析器_Lucene分析器_02

使用自定义分析器

    @Test
    public void addDocument() throws Exception {
        //索引库存放路径
        Directory directory = FSDirectory.open(new File("D:\\temp\\index").toPath());
        IndexWriterConfig config = new IndexWriterConfig(new IKAnalyzer());
        //创建一个indexwriter对象
        IndexWriter indexWriter = new IndexWriter(directory, config);
        //...
    }