国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

首頁(yè) > 學(xué)院 > 開發(fā)設(shè)計(jì) > 正文

編譯Ansj之Solr插件

2019-11-15 00:03:30
字體:
供稿:網(wǎng)友
編譯Ansj之Solr插件

  Ansj是一個(gè)比較優(yōu)秀的中文分詞組件,具體情況就不在本文介紹了。ansj作者在其官方代碼中,提供了對(duì)lucene接口的支持。如果用在Solr下,還需要簡(jiǎn)單的擴(kuò)展一下。

1、基于maven管理

ansj是基于maven進(jìn)行開發(fā)管理的。我們首先修改一下其pom.xml,具體如下所示:

  

<PRoject xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><parent><groupId>org.ansj</groupId><artifactId>MavenAccount-aggregator</artifactId><version>0.0.1</version><relativePath>../pom.xml</relativePath></parent><artifactId>ansj_lucene4_plug</artifactId><version>2.0.2</version><packaging>jar</packaging><name>ansj_lucene4_plug</name> <properties>        <solr.version>4.8.0</solr.version>    </properties><dependencies><dependency><groupId>org.ansj</groupId><artifactId>ansj_seg</artifactId><version>2.0.5</version><classifier>min</classifier><scope>provided</scope></dependency><dependency><groupId>org.apache.lucene</groupId><artifactId>lucene-core</artifactId><version>${solr.version}</version><scope>provided</scope></dependency><dependency><groupId>org.apache.lucene</groupId><artifactId>lucene-highlighter</artifactId><version>${solr.version}</version><scope>provided</scope></dependency><dependency><groupId>org.apache.lucene</groupId><artifactId>lucene-queries</artifactId><version>${solr.version}</version><scope>provided</scope></dependency><dependency><groupId>org.apache.lucene</groupId><artifactId>lucene-queryparser</artifactId><version>${solr.version}</version><scope>provided</scope></dependency>   <dependency><groupId>org.apache.solr</groupId><artifactId>solr-dataimporthandler</artifactId><version>${solr.version}</version><scope>provided</scope>  </dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.4</version><scope>test</scope></dependency></dependencies></project>

  其中,代碼依賴的配置項(xiàng):<scope>provided</scope> 表示只用于代碼編譯階段。依賴關(guān)系整理好以后,寫一個(gè)TokenizerFactory類,用于solr中配置使用,代碼如下:

package org.ansj.solr;import java.io.BufferedReader;import java.io.File;import java.io.FileInputStream;import java.io.FileNotFoundException;import java.io.IOException;import java.io.InputStreamReader;import java.io.Reader;import java.util.HashSet;import java.util.Map;import java.util.Set;import org.ansj.lucene.util.AnsjTokenizer;import org.ansj.splitWord.analysis.IndexAnalysis;import org.ansj.splitWord.analysis.ToAnalysis;import org.apache.lucene.analysis.Tokenizer;import org.apache.lucene.analysis.util.TokenizerFactory;import org.apache.lucene.util.AttributeSource.AttributeFactory;public class AnsjTokenizerFactory extends TokenizerFactory{    boolean pstemming;    boolean isQuery;    private String stopwordsDir;    public Set<String> filter;      public AnsjTokenizerFactory(Map<String, String> args) {        super(args);        assureMatchVersion();        isQuery = getBoolean(args, "isQuery", true);        pstemming = getBoolean(args, "pstemming", false);        stopwordsDir = get(args,"words");        addStopwords(stopwordsDir);    }    //add stopwords list to filter    private void addStopwords(String dir) {        if (dir == null){            System.out.println("no stopwords dir");            return;        }        //read stoplist        System.out.println("stopwords: " + dir);        filter = new HashSet<String>();        File file = new File(dir);         InputStreamReader reader;        try {            reader = new InputStreamReader(new FileInputStream(file),"UTF-8");            BufferedReader br = new BufferedReader(reader);             String word = br.readLine();              while (word != null) {                filter.add(word);                word = br.readLine();             }          } catch (FileNotFoundException e) {            System.out.println("No stopword file found");        } catch (IOException e) {            System.out.println("stopword file io exception");        }          }    @Override    public Tokenizer create(AttributeFactory factory, Reader input) {        if(isQuery == true){            //query            return new AnsjTokenizer(new ToAnalysis(new BufferedReader(input)), input, filter, pstemming);        } else {            //index            return new AnsjTokenizer(new IndexAnalysis(new BufferedReader(input)), input, filter, pstemming);        }    }       }

  pstemming 參數(shù)是ansj需要的參數(shù)。

  isQuery 是用于判斷是查詢還是索引,一般搜索index階段分詞比較細(xì),查詢的分詞比較粗。

2、編譯jar包。

代碼結(jié)構(gòu)如下:

  編寫mavn編譯命令:mvn install -DskipTests=true# 忽略單元測(cè)試編譯。

  

執(zhí)行編譯:

[INFO] Scanning for projects...[INFO]                                                                         [INFO] ------------------------------------------------------------------------[INFO] Building ansj_lucene4_plug 2.0.2[INFO] ------------------------------------------------------------------------[INFO] [INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ ansj_lucene4_plug ---[INFO] Deleting R:/ansj-seg/ansj_seg/plug/ansj_lucene4_plug/target[INFO] [INFO] --- maven-resources-plugin:2.4.3:resources (default-resources) @ ansj_lucene4_plug ---[INFO] Using 'UTF-8' encoding to copy filtered resources.[INFO] skip non existing resourceDirectory R:/ansj-seg/ansj_seg/plug/ansj_lucene4_plug/src/main/resources[INFO] [INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @ ansj_lucene4_plug ---[INFO] Compiling 5 source files to R:/ansj-seg/ansj_seg/plug/ansj_lucene4_plug/target/classes[INFO] [INFO] --- maven-resources-plugin:2.4.3:testResources (default-testResources) @ ansj_lucene4_plug ---[INFO] Using 'UTF-8' encoding to copy filtered resources.[INFO] skip non existing resourceDirectory R:/ansj-seg/ansj_seg/plug/ansj_lucene4_plug/src/test/resources[INFO] [INFO] --- maven-compiler-plugin:2.3.2:testCompile (default-testCompile) @ ansj_lucene4_plug ---[INFO] Compiling 3 source files to R:/ansj-seg/ansj_seg/plug/ansj_lucene4_plug/target/test-classes[INFO] [INFO] --- maven-surefire-plugin:2.7.1:test (default-test) @ ansj_lucene4_plug ---[INFO] Tests are skipped.[INFO] [INFO] --- maven-jar-plugin:2.3.1:jar (default-jar) @ ansj_lucene4_plug ---[INFO] Building jar: R:/ansj-seg/ansj_seg/plug/ansj_lucene4_plug/target/ansj_lucene4_plug-2.0.2.jar[INFO] [INFO] --- maven-install-plugin:2.3.1:install (default-install) @ ansj_lucene4_plug ---[INFO] Installing R:/ansj-seg/ansj_seg/plug/ansj_lucene4_plug/target/ansj_lucene4_plug-2.0.2.jar to C:/Users/GCZX-016/.m2/repository/org/ansj/ansj_lucene4_plug/2.0.2/ansj_lucene4_plug-2.0.2.jar[INFO] Installing R:/ansj-seg/ansj_seg/plug/ansj_lucene4_plug/pom.xml to C:/Users/GCZX-016/.m2/repository/org/ansj/ansj_lucene4_plug/2.0.2/ansj_lucene4_plug-2.0.2.pom[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------[INFO] Total time: 8.149s[INFO] Finished at: Tue May 05 15:29:19 CST 2015[INFO] Final Memory: 27M/245M[INFO] ------------------------------------------------------------------------

  

  

  


發(fā)表評(píng)論 共有條評(píng)論
用戶名: 密碼:
驗(yàn)證碼: 匿名發(fā)表
主站蜘蛛池模板: 庐江县| 津南区| 黄大仙区| 三台县| 昭通市| 祥云县| 抚宁县| 隆子县| 泰宁县| 同仁县| 恩平市| 寻乌县| 武陟县| 韶关市| 临泉县| 玉门市| 彭阳县| 南华县| 界首市| 运城市| 吴川市| 乌鲁木齐市| 凤翔县| 怀远县| 始兴县| 梅州市| 陇西县| 神木县| 大方县| 攀枝花市| 台北市| 芒康县| 韶山市| 江安县| 新密市| 察哈| 孙吴县| 普宁市| 北碚区| 鹿邑县| 大邑县|