正在用的Neo4j是当前最新版:3.1.0,各种踩坑。说一下如何在Neo4j 3.1.0中使用中文索引。选用了IKAnalyzer做分词器。

里面大致讲了用IKAnalyzer做索引的方式。但并不清晰,实际上,这篇文章的背景是用嵌入式Neo4j,即Neo4j一定要嵌入在你的Java应用中(https://neo4j.com/docs/java-reference/current/#tutorials-java-embedded),切记。否则无法使用自定义的Analyzer。其次,文中的方法现在用起来已经有问题了,因为Neo4j 3.1.0用了lucene5.5,故官方的IKAnalyzer已经不适用了。


2. 修正

 转用 IKAnalyzer2012FF_u1.jar,在Google可以下载到(https://code.google.com/archive/p/ik-analyzer/downloads)。这个版本的IKAnalyzer是有小伙伴修复了IKAnalyzer不适配lucene3.5以上而修改的一个版本。但是用了这个包仍有问题,报错提示:

Caused by: java.lang.AbstractMethodError: org.apache.lucene.analysis.Analyzer.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents;


即IKAnalyzer的Analyzer类和当前版本的lucene仍有不适配的地方。

解决方案:再增加两个类






​package com.uc.wa.function;​












​import org.apache.lucene.analysis.Analyzer;​






​import org.apache.lucene.analysis.Tokenizer;​












​public class IKAnalyzer5x extends Analyzer{​












​ private boolean useSmart;​












​ public boolean useSmart() {​






​ return useSmart;​






​ }​












​ public void setUseSmart(boolean useSmart) {​






​ this.useSmart = useSmart;​






​ }​












​ public IKAnalyzer5x(){​






​ this(false);​






​ }​












​ public IKAnalyzer5x(boolean useSmart){​






​ super();​






​ this.useSmart = useSmart;​






​ }​


















​ /**​






​ protected TokenStreamComponents createComponents(String fieldName, final Reader in) {​






​ Tokenizer _IKTokenizer = new IKTokenizer(in , this.useSmart());​






​ return new TokenStreamComponents(_IKTokenizer);​






​ }​






​ **/​


















​ /**​






​ * 重写最新版本的createComponents​






​ * 重载Analyzer接口,构造分词组件​






​ */​






​ @Override​






​ protected TokenStreamComponents createComponents(String fieldName) {​






​ Tokenizer _IKTokenizer = new IKTokenizer5x(this.useSmart());​






​ return new TokenStreamComponents(_IKTokenizer);​






​ }​






​}​







​package com.uc.wa.function;​












​import java.io.IOException;​












​import org.apache.lucene.analysis.Tokenizer;​






​import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;​






​import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;​






​import org.apache.lucene.analysis.tokenattributes.TypeAttribute;​






​import org.wltea.analyzer.core.IKSegmenter;​






​import org.wltea.analyzer.core.Lexeme;​












​public class IKTokenizer5x extends Tokenizer{​












​ //IK�ִ���ʵ��​






​ private IKSegmenter _IKImplement; ​












​ //��Ԫ�ı����� ​






​ private final CharTermAttribute termAtt; ​






​ //��Ԫλ������ ​






​ private final OffsetAttribute offsetAtt; ​






​ //��Ԫ�������ԣ������Է���ο�org.wltea.analyzer.core.Lexeme�еķ��ೣ���� ​






​ private final TypeAttribute typeAtt; ​






​ //��¼���һ����Ԫ�Ľ���λ�� ​






​ private int endPosition; ​


















​ /** ​






​ public IKTokenizer(Reader in , boolean useSmart){​






​ super(in);​






​ offsetAtt = addAttribute(OffsetAttribute.class);​






​ termAtt = addAttribute(CharTermAttribute.class);​






​ typeAtt = addAttribute(TypeAttribute.class);​






​ _IKImplement = new IKSegmenter(input , useSmart);​






​ }**/​












​ /** ​






​ * Lucene 5.x Tokenizer�������๹�캯��​






​ * ʵ�����µ�Tokenizer�ӿ�​






​ * @param useSmart​






​ */​






​ public IKTokenizer5x(boolean useSmart){ ​






​ super(); ​






​ offsetAtt = addAttribute(OffsetAttribute.class);​






​ termAtt = addAttribute(CharTermAttribute.class);​






​ typeAtt = addAttribute(TypeAttribute.class);​






​ _IKImplement = new IKSegmenter(input , useSmart); ​






​ }​












​ /* (non-Javadoc) ​






​ * @see org.apache.lucene.analysis.TokenStream#incrementToken()​






​ */​






​ @Override ​






​ public boolean incrementToken() throws IOException { ​






​ //������еĴ�Ԫ���� ​






​ clearAttributes();​






​ Lexeme nextLexeme = _IKImplement.next();​






​ if(nextLexeme != null){ ​






​ //��Lexemeת��Attributes ​






​ //���ô�Ԫ�ı� ​






​ termAtt.append(nextLexeme.getLexemeText());​






​ //���ô�Ԫ���� ​






​ termAtt.setLength(nextLexeme.getLength());​






​ //���ô�Ԫλ�� ​






​ offsetAtt.setOffset(nextLexeme.getBeginPosition(), nextLexeme.getEndPosition());​






​ //��¼�ִʵ����λ�� ​






​ endPosition = nextLexeme.getEndPosition();​






​ //��¼��Ԫ���� ​






​ typeAtt.setType(nextLexeme.getLexemeTypeString()); ​






​ //����true��֪�����¸���Ԫ ​






​ return true; ​






​ }​






​ //����false��֪��Ԫ������ ​






​ return false; ​






​ }​












​ /* ​






​ * (non-Javadoc)​






​ * @see org.apache.lucene.analysis.Tokenizer#reset(java.io.Reader)​






​ */​






​ @Override ​






​ public void reset() throws IOException { ​






​ super.reset(); ​






​ _IKImplement.reset(input);​






​ } ​












​ @Override ​






​ public final void end() { ​






​ // set final offset ​






​ int finalOffset = correctOffset(this.endPosition); ​






​ offsetAtt.setOffset(finalOffset, finalOffset);​






​ }​






​}​


解决 IKAnalyzer2012FF_u1.jar和lucene5不适配的问题。使用时用IKAnalyzer5x替换IKAnalyzer即可。


3. 最后

Neo4j中文索引建立和搜索示例:






​ /**​






​ * 为单个结点创建索引​






​ * ​






​ * @param propKeys​






​ */​






​ public static void createFullTextIndex(long id, List<String> propKeys) {​






​ log.info("method[createFullTextIndex] begin.propKeys<"+propKeys+">");​






​ Index<Node> entityIndex = null;​












​ try (Transaction tx = Neo4j.graphDb.beginTx()) {​






​ entityIndex = Neo4j.graphDb.index().forNodes("NodeFullTextIndex",​






​ MapUtil.stringMap(IndexManager.PROVIDER, "lucene", "analyzer", IKAnalyzer5x.class.getName()));​












​ Node node = Neo4j.graphDb.getNodeById(id);​






​ log.info("method[createFullTextIndex] get node id<"+node.getId()+"> name<"​






​ +node.getProperty("knowledge_name")+">");​






​ /**获取node详细信息*/​






​ Set<Map.Entry<String, Object>> properties = node.getProperties(propKeys.toArray(new String[0]))​






​ .entrySet();​






​ for (Map.Entry<String, Object> property : properties) {​






​ log.info("method[createFullTextIndex] index prop<"+property.getKey()+":"+property.getValue()+">");​






​ entityIndex.add(node, property.getKey(), property.getValue());​






​ }​






​ tx.success();​






​ }​






​ }​







​ /**​






​ * 使用索引查询​






​ * ​






​ * @param query​






​ * @return​






​ * @throws IOException ​






​ */​






​ public static List<Map<String, Object>> selectByFullTextIndex(String[] fields, String query) throws IOException {​






​ List<Map<String, Object>> ret = Lists.newArrayList();​






​ try (Transaction tx = Neo4j.graphDb.beginTx()) {​






​ IndexManager index = Neo4j.graphDb.index();​






​ /**查询*/​






​ Index<Node> addressNodeFullTextIndex = index.forNodes("NodeFullTextIndex",​






​ MapUtil.stringMap(IndexManager.PROVIDER, "lucene", "analyzer", IKAnalyzer5x.class.getName()));​






​ Query q = IKQueryParser.parseMultiField(fields, query);​












​ IndexHits<Node> foundNodes = addressNodeFullTextIndex.query(q);​












​ for(Node n : foundNodes){​






​ Map<String, Object> m = n.getAllProperties();​






​ if(!Float.isNaN(foundNodes.currentScore())){​






​ m.put("score", foundNodes.currentScore());​






​ }​






​ log.info("method[selectByIndex] score<"+foundNodes.currentScore()+">");​






​ ret.add(m);​






​ }​






​ tx.success();​






​ } catch (IOException e) {​






​ log.error("method[selectByIndex] fields<"+Joiner.on(",").join(fields)+"> query<"+query+">", e);​






​ throw e;​






​ }​






​ return ret;​






​ }​


注意到,在这里我用了IKQueryParser,即根据我们的查询词和要查询的字段,自动构造Query。这里是绕过了一个坑:用lucene查询语句直接查的话,是有问题的。比如:“address:南昌市” 查询语句,会搜到所有带市字的地址,这是非常不合理的。改用IKQueryParser即修正这个问题。IKQueryParser是IKAnalyzer自带的一个工具,但在 IKAnalyzer2012FF_u1.jar却被删减掉了。因此我这里重新引入了原版IKAnalyzer的jar包,项目最终是两个jar包共存的。

到这里坑就踩得差不多了。