中文文本标注（NLP工具集：【doccano】——标注平台doccano使用手册）

时间2025-07-06 19:15:07分类IT科技浏览4967

导读：一. 简介 doccano 是一个开源的文本标注平台。它为文本分类、序列标记和序列到序列任务提供标注功能。因此，您可以为情感分析、命名实体识别、文本摘要、机器翻译等任务创建标注数据。只需创建一个项目，上传数据并开始标注，您就可以在数小时内构建得到想要的数据集。...

一. 简介

doccano 是一个开源的文本标注平台。它为文本分类、序列标记和序列到序列任务提供标注功能。因此，您可以为情感分析、命名实体识别、文本摘要、机器翻译等任务创建标注数据。只需创建一个项目，上传数据并开始标注，您就可以在数小时内构建得到想要的数据集。

doccano特性：

合作标注：可以进行多人合作，分配标注任务。支持多种语言文本标注：目前已知知识英语，中文，日语，阿拉伯语，印度尼西亚语等。

二. 安装及使用

docker安装比较简单，这里只给出docker安装方式：

Step 1.镜像下载

镜像好比面向对象语言中的类。

docker pull doccano/doccano

Step 2.创建容器

容器就是镜像对应的实例化对象。

docker container create --name doccano \ -e "ADMIN_USERNAME=admin" \ -e "ADMIN_EMAIL=admin@example.com" \ -e "ADMIN_PASSWORD=123456" \ -v doccano-db:/data \ -p 8000:8000 doccano/doccano

其中的各参数意义如下：

–name doccano 表示创建的容器名称为doccano -e “ADMIN_USERNAME=admin ” doccano项目中管理员账号为admin -e “ADMIN_EMAIL=admin@example.com ” doccano项目中管理员的联系邮箱为admin@example.com -e “ADMIN_PASSWORD=password ” doccano项目中管理员登陆密码password -v doccano-db:/data 项目中的数据挂在到宿主机地址到/data中 -p 8000:8000 宿主机（前）与容器中端口之间的映射 doccano/doccano 镜像名称

Step 3.启动容器

docker container start doccano

Step 4.浏览器访问

用户通过浏览器访问部署的服务器ip加上对应的端口号即可访问：

这里咱们输入：http://10.6.16.96:8000/

点击右上角进行登录

三. 数据标注

1.项目创建

以农业花生数据标注任务为例：

2.数据上传

3.标签构建

创建实体标签创建关系标签

4.任务标注

关系标注时需要依次在头，尾实体Tag上点击鼠标左键才会出现可用关系选项

标注结果

5.数据导出

导出后的数据格式如下

：

6.数据转换

抽取式任务数据转换：

当标注完成后，在 doccano 平台上导出 JSONL(relation) 形式的文件，并将其重命名为 doccano_ext.json 后，放入 ./data 目录下。通过 doccano.py 脚本进行数据形式转换，然后便可以开始进行相应模型训练。

doccano.py代码如下： import os import time import argparse import json import numpy as np from utils import set_seed, convert_ext_examples, convert_cls_examples def do_convert(): set_seed(args.seed) tic_time = time.time() if not os.path.exists(args.doccano_file): raise ValueError("Please input the correct path of doccano file.") if not os.path.exists(args.save_dir): os.makedirs(args.save_dir) if len(args.splits) != 0 and len(args.splits) != 3: raise ValueError("Only []/ len(splits)==3 accepted for splits.") if args.splits and sum(args.splits) != 1: raise ValueError( "Please set correct splits, sum of elements in splits should be equal to 1." ) with open(args.doccano_file, "r", encoding="utf-8") as f: raw_examples = f.readlines() def _create_ext_examples(examples, negative_ratio=0, shuffle=False, is_train=True): entities, relations = convert_ext_examples( examples, negative_ratio, is_train=is_train) examples = entities + relations if shuffle: indexes = np.random.permutation(len(examples)) examples = [examples[i] for i in indexes] return examples def _create_cls_examples(examples, prompt_prefix, options, shuffle=False): examples = convert_cls_examples(examples, prompt_prefix, options) if shuffle: indexes = np.random.permutation(len(examples)) examples = [examples[i] for i in indexes] return examples def _save_examples(save_dir, file_name, examples): count = 0 save_path = os.path.join(save_dir, file_name) with open(save_path, "w", encoding="utf-8") as f: for example in examples: f.write(json.dumps(example, ensure_ascii=False) + "\n") count += 1 print("\nSave %d examples to %s." % (count, save_path)) if len(args.splits) == 0: if args.task_type == "ext": examples = _create_ext_examples(raw_examples, args.negative_ratio, args.is_shuffle) else: examples = _create_cls_examples(raw_examples, args.prompt_prefix, args.options, args.is_shuffle) _save_examples(args.save_dir, "train.txt", examples) else: if args.is_shuffle: indexes = np.random.permutation(len(raw_examples)) raw_examples = [raw_examples[i] for i in indexes] i1, i2, _ = args.splits p1 = int(len(raw_examples) * i1) p2 = int(len(raw_examples) * (i1 + i2)) if args.task_type == "ext": train_examples = _create_ext_examples( raw_examples[:p1], args.negative_ratio, args.is_shuffle) dev_examples = _create_ext_examples( raw_examples[p1:p2], -1, is_train=False) test_examples = _create_ext_examples( raw_examples[p2:], -1, is_train=False) else: train_examples = _create_cls_examples( raw_examples[:p1], args.prompt_prefix, args.options) dev_examples = _create_cls_examples( raw_examples[p1:p2], args.prompt_prefix, args.options) test_examples = _create_cls_examples( raw_examples[p2:], args.prompt_prefix, args.options) _save_examples(args.save_dir, "train.txt", train_examples) _save_examples(args.save_dir, "dev.txt", dev_examples) _save_examples(args.save_dir, "test.txt", test_examples) print(Finished! It takes %.2f seconds % (time.time() - tic_time)) if __name__ == "__main__": # yapf: disable parser = argparse.ArgumentParser() parser.add_argument("--doccano_file", default=r"../data/doccano_ext.json", type=str, help="The doccano file exported from doccano platform.") parser.add_argument("--save_dir", default=r"../data", type=str, help="The path of data that you wanna save.") parser.add_argument("--negative_ratio", default=5, type=int, help="Used only for the extraction task, the ratio of positive and negative samples, number of negtive samples = negative_ratio * number of positive samples") parser.add_argument("--splits", default=[0.8, 0.1, 0.1], type=float, nargs="*", help="The ratio of samples in datasets. [0.6, 0.2, 0.2] means 60% samples used for training, 20% for evaluation and 20% for test.") parser.add_argument("--task_type", choices=[ext, cls], default="ext", type=str, help="Select task type, ext for the extraction task and cls for the classification task, defaults to ext.") parser.add_argument("--options", default=["正向", "负向"], type=str, nargs="+", help="Used only for the classification task, the options for classification") parser.add_argument("--prompt_prefix", default="情感倾向", type=str, help="Used only for the classification task, the prompt prefix for classification") parser.add_argument("--is_shuffle", default=True, type=bool, help="Whether to shuffle the labeled dataset, defaults to True.") parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") args = parser.parse_args() # yapf: enable do_convert()

附录

doccano标准平台官方代码

：https://github.com/doccano/doccano

doccano标准平台官方文档：https://doccano.github.io/doccano/