大规模现代汉语标注语料库的加工规范 |

大规模现代汉语标注语料库的加工规范[1]

俞士汶朱学锋段慧明

摘要：北京大学计算语言学研究所在开发了《现代汉语语法信息词典》等语言资源的基础上，又在实施另一项大型语言工程，即对大规模的现代汉语原始语料进行多级加工，目前的加工项目包括词语切分、词性标注（包括动词和形容词的特殊用法），并标出专有名词以及短语型的地名、机构名称等等。

规划中的语料库规模约为2700万字。现在已经完成了1400万字的任务，而且质量很高。

要建成高质量的标注语料库，必须制订出完备的加工规范。本文介绍制订加工规范的原则和执行加工规范的经验。

关键词：现代汉语标注语料库词语切分词性标注现代汉语语法信息词典加工规范

The Guideline for Segmentation and Part-Of-Speech Tagging

on Very Large Scale Corpus of Contemporary Chinese

Yu Shiwen Zhu Xuefeng Duan Huiming

Abstract:The Institute of Computational Linguistics of Peking University is developing a very large-scale contemporary Chinese corpus segmented and with many tags based on the owned resources, e.g. the Grammatical Knowledge-base of Contemporary Chinese. There are about 40 tags in the tag set. It contains common Part-Of-Speech tags, special usage tags of verbs and adjectives, proper noun, place name of phrase type, organization name of phrase type and so on.

The scale of the corpus is about 27 millions Chinese characters. The Institute of Computational Linguistics of PKU has completed the task of 14 millions characters and the processing quality is very high.

It is necessary to work out a complete guideline of corpus processing to obtain high quality tagged corpus. This paper introduces the principles of making out the guideline and the experiences of carrying out the guideline.

Keywords:Contemporary Chinese Tagged Corpus, Segmentation, Part-Of-Speech Tagging,

The Grammatical Knowledge-base of Contemporary Chinese, processing guidline

大规模现代汉语标注语料库的加工规范.doc(50.5 KB)

大规模现代汉语标注语料库的加工规范

发表回复取消回复

归档

功能

大规模现代汉语标注语料库的加工规范

发表回复 取消回复

归档

功能

发表回复取消回复