pinyin去掉声调

拼音表达方式多种多言，有Unicode的符号音调，也有数字音调，通常我们需要在各种格式间相互转换。我们举个简单的例子做个转换。#!/usr/bin/python# -*- coding: UTF-8 -*-# map vowel-number combination to unicodemapVowelTone2Unicode = {'a1': 'ā','a2': 'á','a3': 'ǎ',

lawenliu

2212人浏览 · 2020-05-12 11:23:03

lawenliu · 2020-05-12 11:23:03 发布

拼音表达方式多种多言，有Unicode的符号音调，也有数字音调，通常我们需要在各种格式间相互转换。我们举个简单的例子做个转换。

#!/usr/bin/python
# -*- coding: UTF-8 -*-

# map vowel-number combination to unicode
mapVowelTone2Unicode = {'a1': 'ā',
                        'a2': 'á',
                        'a3': 'ǎ',
                        'a4': 'à',
                        'e1': 'ē',
                        'e2': 'é',
                        'e3': 'ě',
                        'e4': 'è',
                        'i1': 'ī',
                        'i2': 'í',
                        'i3': 'ǐ',
                        'i4': 'ì',
                        'o1': 'ō',
                        'o2': 'ó',
                        'o3': 'ǒ',
                        'o4': 'ò',
                        'u1': 'ū',
                        'u2': 'ú',
                        'u3': 'ǔ',
                        'u4': 'ù',
                        'v1': 'ǖ',
                        'v2': 'ǘ',
                        'v3': 'ǚ',
                        'v4': 'ǜ',
                       }

# map unicode to vowel-number combination
mapVowelUnicode2Tone = {'ā': 'a1',
                        'á': 'a2',
                        'ǎ': 'a3',
                        'à': 'a4',
                        'ē': 'e1',
                        'é': 'e2',
                        'ě': 'e3',
                        'è': 'e4',
                        'ī': 'i1',
                        'í': 'i2',
                        'ǐ': 'i3',
                        'ì': 'i4',
                        'ō': 'o1',
                        'ó': 'o2',
                        'ǒ': 'o3',
                        'ò': 'o4',
                        'ū': 'u1',
                        'ú': 'u2',
                        'ǔ': 'u3',
                        'ù': 'u4',
                        'ǖ': 'v1',
                        'ǘ': 'v2',
                        'ǚ': 'v3',
                        'ǜ': 'v4',
                       }

# map vowel unicode to vowel
mapVowelUnicode2WithoutTone = {'ā': 'a',
                        'á': 'a',
                        'ǎ': 'a',
                        'à': 'a',
                        'ē': 'e',
                        'é': 'e',
                        'ě': 'e',
                        'è': 'e',
                        'ī': 'i',
                        'í': 'i',
                        'ǐ': 'i',
                        'ì': 'i',
                        'ō': 'o',
                        'ó': 'o',
                        'ǒ': 'o',
                        'ò': 'o',
                        'ū': 'u',
                        'ú': 'u',
                        'ǔ': 'u',
                        'ù': 'u',
                        'ǖ': 'v',
                        'ǘ': 'v',
                        'ǚ': 'v',
                        'ǜ': 'v',
                       }

def ParsePinyin(iFilename, oFilename):
    wordList = []
    with open(iFilename, 'r') as fileReader:
        # read out useless information
        fileReader.readline()
        fileReader.readline()
    
        # read and parse pinyin
        while True:
            line = fileReader.readline()
            if not line:
                break
            uPinyin = line.split()[1]
            for vPinyin in uPinyin.split(','):
                for x, y in mapVowelUnicode2WithoutTone.items():
                    vPinyin = vPinyin.replace(x, y).replace(x.upper(), y.upper())
                    vPinyin = vPinyin.replace('Ü', 'V').replace('ü', 'v')
                if wordList.count(vPinyin) == 0:
                    wordList.append(vPinyin)
                print(uPinyin + " " + vPinyin)

    # write word list to file
    with open(oFilename, 'w') as fileWriter:
        for pinyin in wordList:
            fileWriter.write(pinyin)
            fileWriter.write('\n')

# parse pinyin file to remove tone information
ParsePinyin("pinyin.txt", "pinyin_dict.txt")

输入文件“pinyin.txt”可以从GitHub - mozillazg/pinyin-data: 汉字拼音数据下载，代码中的处理是希望把所有的tone信息去掉。

这里去掉tone是为了匹配用户输入的pinyin，然后进行切词。用户输入的时候一般pinyin不会带声调，这是不需要tone的拼音的一个应用场景。

参考文献：LiveToolkit