lucene之旅（三十一）——SpellChecker上

yanlijun250

浏览: 748540 次

最近访客更多访客>>

stephenhs

likejiushilike

秋风落叶_9

ruiluo

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (1076)

社区版块

存档分类

2012-03 ( 6)
2012-02 ( 112)
2012-01 ( 11)
更多存档...

SpellChecker是Lucene的一个新组件，至少对我这个一年内没在动Lucene的人来说。
群里有人讨论SpellChecker，便去研究了一下，还不是很难。
SpellChecker就纠错不错，但也有人用来做相关搜索，但效果应该不太理想。
好，言归正传。先来个实例：
第一步导入错别字库：

SpellChecker sc=new SpellChecker(FSDirectory.getDirectory("book.index")); sc.indexDictionary(new PlainTextDictionary(new File("spell.txt")));

第二步就是导出错别字列表

String[] strs=sc.suggestSimilar("明朝那点屎", 3); for (int i = 0; i < strs.length; i++) { System.out.println(strs[i]); }

= =！，有点恶心。不过还是看一下输出结果。

明朝那点事

忘了介绍一个文件就是错别字文件，可以用索引创建也可以用文本创建，咱们先来简单的也就是

大家看到的PlainTextDictionary，其中spell.txt的内容如下

明朝那点事明朝五好家庭明朝的皇帝

可能大家会输入明朝会怎么样，答案是没结果，因为

public SpellChecker(Directory spellIndex) throws IOException { this(spellIndex, new LevensteinDistance()); }

将LevensteinDistance是默认的，要想输入明朝也提示出来，那么改为JaroWinklerDistance

SpellChecker sc=new SpellChecker(FSDirectory.getDirectory("book.index"),new JaroWinklerDistance()); sc.indexDictionary(new PlainTextDictionary(new File("spell.txt")));

JaroWinklerDistance是有个类似相似度的参数，默认是0.7，可以通过setThreshold，进行设置。

那么总结一下总共三个元素SpellChecker,StringDistance,Dictionary

StringDistance接口的实现类有JaroWinklerDistance和LevensteinDistance

Disctionary接口的实现类有LuceneDictionary和PlainTextDictionary

JaroWinklerDistance主要可以用来相关搜索

LevensteinDistance用来纠错

PlainTextDictionary文本格式

LuceneDictionary索引格式

LuceneDictionary的结构如下

IndexStructure	Example
word	kings
gram3	kin, ing, ngs
gram4	king, ings
start3	kin
start4	king
end3	ngs
end4	ings

应用的话就讲到这里，其中的算法，我们在下篇再讲

分享到：

automation服务器不能创建对象 | lucene之旅（一）——Lucene总览

2009-08-20 17:12
浏览 761
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论