Stanford分词实战

介绍英文自带分词，而中文最小粒度由字组成，使用得分词。Stanford分词开源工具主页地址：https://nlp.stanford.edu/software/segmenter.shtml原始文本的tokenization(标记化)是许多NLP任务的标准预处理步骤。对于英文来说，标记化通常涉及标点符号分割和分离一些词缀。其他语言需要更广泛的tokenization预处理，通常称为分词。斯坦福

文章共1,268字 · 阅读需要大约5分钟

一键AI生成摘要，助你高效阅读

问答

AI小白入门

6308人浏览 · 2017-09-20 19:08:24

AI小白入门 · 2017-09-20 19:08:24 发布

介绍

英文自带分词，而中文最小粒度由字组成，使用得分词。
Stanford分词开源工具主页地址：https://nlp.stanford.edu/software/segmenter.shtml

原始文本的tokenization(标记化)是许多NLP任务的标准预处理步骤。对于英文来说，标记化通常涉及标点符号分割和分离一些词缀。其他语言需要更广泛的tokenization预处理，通常称为分词。

斯坦福大词典目前支持阿拉伯语和中文。Stanford Tokenizer可用于英文，法文和西班牙文。
需要jdk1.8+。

Stanford工具中文分词：
中文需要分词，本工具是基于CRF的中文字分割器的Java实现。
实现基于论文：A Conditional Random Field Word Segmenter
论文地址：https://nlp.stanford.edu/pubs/sighan2005.pdf

这个版本包含二种独立的分词：Chinese Penn Treebank standard 和Peking University standard.

之后发布了一个能够利用外部词汇特征的版本，这个版本分词更加精确，实现基于
论文：Optimizing Chinese Word Segmentation for Machine Translation Performance.
论文地址：https://nlp.stanford.edu/pubs/acl-wmt08-cws.pdf

该包包括用于命令行调用的组件和一个Java API。下载的解压包包含模型文件，编译代码和源文件。如果你打开tar文件，你应该有一切需要。包含简单的脚本来调用分词器。

实战

将文件里面的data放进工程，然后把stanford-segmenter-3.8.0.jar, stanford-segmenter-3.8.0-javadoc.jar，stanford-segmenter-3.8.0-sources.jar三个jar包添加进lib。然后把文件提供的SegDemo拷进去，直接运行即可。注意文件的路径放置，如果出现问题，修改文件路径使其对应。

可以通过输入参数运行：在run的Program arguments输入文件路径，比如src\test.txt
运行输出分词结果

SegDemo代码：

package WordSegmenter;


import edu.stanford.nlp.ie.crf.CRFClassifier;
import edu.stanford.nlp.ling.CoreLabel;
import java.io.*;
import java.util.Properties;

/*
* 通过参数输入文本，输出分词结果
* */

public class SegDemo {

  private static final String basedir = System.getProperty("SegDemo", "data/pos_model");

  public static void main(String[] args) throws Exception {
    System.setOut(new PrintStream(System.out, true, "utf-8"));

    //设置参数
    Properties props = new Properties();
    props.setProperty("sighanCorporaDict", basedir);
    // props.setProperty("NormalizationTable", "data/norm.simp.utf8");
    // props.setProperty("normTableEncoding", "UTF-8");
    // below is needed because CTBSegDocumentIteratorFactory accesses it
    props.setProperty("serDictionary", basedir + "/dict-chris6.ser.gz");
    if (args.length > 0) {
      props.setProperty("testFile", args[0]);
    }
    props.setProperty("inputEncoding", "UTF-8");
    props.setProperty("sighanPostProcessing", "true");

    CRFClassifier<CoreLabel> segmenter = new CRFClassifier<>(props);
    segmenter.loadClassifierNoExceptions(basedir + "/ctb.gz", props);
//参数的文件分词
    for (String filename : args) {
    segmenter.classifyAndWriteAnswers(filename);
    }

    String sample = "我住在美国。";
    List<String> segmented = segmenter.segmentString(sample);
    System.out.println(segmented);
  }

}

运行结果：
src\test.txt为输入参数的分词文件

testFile=src\test.txt
serDictionary=data/pos_model/dict-chris6.ser.gz
sighanCorporaDict=data/pos_model
inputEncoding=UTF-8
sighanPostProcessing=true
Loading Chinese dictionaries from 1 file:
  data/pos_model/dict-chris6.ser.gz
Done. Unique words in ChineseDictionary is: 423200.
Loading classifier from data/pos_model/ctb.gz ... done [10.6 sec].
Loading character dictionary file from data/pos_model/dict/character_list [done].
Loading affix dictionary from data/pos_model/dict/in.ctb [done].
我的 是 你 的 嘛
CRFClassifier tagged 6 words in 1 documents at 81.08 words per second.
[我, 住在, 美国, 。]