现有语料格式
差别
这里会显示出您选择的修订版和当前版本之间的差别。
| 两侧同时换到之前的修订记录前一修订版后一修订版 | 前一修订版 | ||
| 现有语料格式 [2023/11/11 10:54] – MNBVC项目组 | 现有语料格式 [2025/06/02 15:17] (当前版本) – 外部编辑 127.0.0.1 | ||
|---|---|---|---|
| 行 18: | 行 18: | ||
| 平行语料格式 | 平行语料格式 | ||
| [[https:// | [[https:// | ||
| + | |||
| + | 多模态语料 | ||
| + | [[https:// | ||
| ====== MNBVC语料格式检查工具 ====== | ====== MNBVC语料格式检查工具 ====== | ||
| 未来所有MNBVC语料都会统一格式,请提交数据的同学都执行下格式检查工具:[[https:// | 未来所有MNBVC语料都会统一格式,请提交数据的同学都执行下格式检查工具:[[https:// | ||
| - | ====== MNBVC语料格式详情 ====== | + | ======MNBVC语料格式详情 ====== |
| 对于语料格式的每个jsonl文件,其大小略大于500MB。 | 对于语料格式的每个jsonl文件,其大小略大于500MB。 | ||
| + | |||
| + | ==== 关于“时间” ==== | ||
| + | 时间字段所有语料格式中都有,必填,代表本语料出现的最早时间,统一采用字符串的 yyyymmdd 格式,具体规则如下: | ||
| + | - 年份固定为4位,月份和日固定为两位,例如2024年1月1日记为 ' | ||
| + | - 年份不足4位需要在前面补0至4位,如738年3月3日记为 ' | ||
| + | - 不能具体到日或月份,统一记为01,如公元738年记为 ' | ||
| + | - 公元前则在前面加上负号,如公元前5000年记为 ' | ||
| + | |||
| + | 补充:补零4位python代码只需要加上: | ||
| ==== 通用文本输出jsonl格式说明 ==== | ==== 通用文本输出jsonl格式说明 ==== | ||
| - | 1.对于每一个文件,他的json结构层次如下: | + | 1.对于每一个文件,时间格式为yyyymmdd,具体参考前面的内容,他的json结构层次如下: |
| < | < | ||
| 行 42: | 行 54: | ||
| ' | ' | ||
| ' | ' | ||
| - | ' | + | ' |
| + | ' | ||
| + | ' | ||
| } | } | ||
| </ | </ | ||
| 行 54: | 行 68: | ||
| ' | ' | ||
| ' | ' | ||
| - | ' | + | ' |
| + | ' | ||
| } | } | ||
| </ | </ | ||
| 行 77: | 行 92: | ||
| ' | ' | ||
| ' | ' | ||
| - | ' | + | ' |
| - | } | + | ' |
| + | }, | ||
| + | ' | ||
| + | ' | ||
| ] | ] | ||
| } | } | ||
| 行 98: | 行 116: | ||
| " | " | ||
| " | " | ||
| - | } | + | }, |
| + | " | ||
| } | } | ||
| </ | </ | ||
| 行 131: | 行 150: | ||
| " | " | ||
| " | " | ||
| + | " | ||
| " | " | ||
| " | " | ||
| 行 178: | 行 198: | ||
| " | " | ||
| " | " | ||
| - | " | + | " |
| + | " | ||
| } | } | ||
| </ | </ | ||
| 行 194: | 行 215: | ||
| " | " | ||
| " | " | ||
| - | " | + | " |
| + | " | ||
| } | } | ||
| </ | </ | ||
| + | |||
| + | ==== 代码commit语料输出jsonl格式说明 ===== | ||
| + | 1.每行是一个文本的数据,对应一个代码仓库里的一个文本文件的变更。 | ||
| + | |||
| + | 2.对于每一行数据,其最高层次结构如下。 | ||
| + | < | ||
| + | { | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | This is the first line. | ||
| + | -This is the second line. | ||
| + | +This line has been modified. | ||
| + | @@ -5,2 +6,3 @@ | ||
| + | +This line has been modified again. | ||
| + | +This is another new line added.", | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | ' | ||
| + | } | ||
| + | </ | ||
| ==== 多轮对话输出jsonl格式说明 ===== | ==== 多轮对话输出jsonl格式说明 ===== | ||
| 行 208: | 行 257: | ||
| " | " | ||
| " | " | ||
| + | " | ||
| " | " | ||
| " | " | ||
| 行 246: | 行 296: | ||
| " | " | ||
| " | " | ||
| + | " | ||
| " | " | ||
| " | " | ||
| " | " | ||
| " | " | ||
| - | " | + | " |
| " | " | ||
| " | " | ||
| " | " | ||
| - | } | + | }" |
| } | } | ||
| } | } | ||
| 行 269: | 行 320: | ||
| " | " | ||
| " | " | ||
| + | " | ||
| " | " | ||
| " | " | ||
| 行 325: | 行 377: | ||
| " | " | ||
| " | " | ||
| + | " | ||
| " | " | ||
| { | { | ||
| 行 351: | 行 404: | ||
| ==== 平行语料输出jsonl格式说明 ==== | ==== 平行语料输出jsonl格式说明 ==== | ||
| - | 1.对于每一个文件,他的json结构层次如下: | + | 语料文件是多行 jsonl 格式,这是其中一行的样例(实际上一行即为一个json,不需要缩进打印): |
| < | < | ||
| { | { | ||
| - | | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| - | } | + | |
| - | </ | + | |
| - | + | | |
| - | 2.将每一行为一个段落,段落的json结构层次如下: | + | |
| - | < | + | |
| - | { | + | |
| - | '行号': line_number, | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| - | 'ja_text': 日语, | + | |
| - | | + | " |
| - | | + | " |
| - | | + | " |
| - | | + | " |
| - | | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| } | } | ||
| </ | </ | ||
| - | 3.结果示例: | + | 如果语料格式与平行语料小组的github主仓库有差异,**以仓库内的README所展示的为准**。https:// |
| + | |||
| + | 字段说明: | ||
| + | |||
| + | **文件名**: | ||
| + | |||
| + | **是否待查文件**: | ||
| + | |||
| + | **是否重复文件**: | ||
| + | |||
| + | **段落数**: | ||
| + | |||
| + | **去重段落数**: | ||
| + | |||
| + | **低质量段落数**: | ||
| + | |||
| + | **行号**: 段落下标,是一个取值范围在 `[1, 段落数]` 之间的整数 | ||
| + | |||
| + | **是否重复**: | ||
| + | |||
| + | **是否跨文件重复**: | ||
| + | |||
| + | **时间**: `yyyymmdd` 格式的日期字符串,表示这份语料被转换为本文所定义的标准平行语料格式的时间。可以参考样例 | ||
| + | |||
| - | < | ||
| - | { | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | ' | ||
| - | }] | ||
| - | } | ||
| - | </ | ||
现有语料格式.1699671249.txt.gz · 最后更改: (外部编辑)
