现有语料格式
差别
这里会显示出您选择的修订版和当前版本之间的差别。
两侧同时换到之前的修订记录前一修订版后一修订版 | 前一修订版 | ||
现有语料格式 [2024/02/02 17:50] – [通用文本输出jsonl格式说明] MNBVC项目组 | 现有语料格式 [2024/09/21 19:28] (当前版本) – [代码commit语料输出jsonl格式说明] MNBVC项目组 | ||
---|---|---|---|
行 25: | 行 25: | ||
未来所有MNBVC语料都会统一格式,请提交数据的同学都执行下格式检查工具:[[https:// | 未来所有MNBVC语料都会统一格式,请提交数据的同学都执行下格式检查工具:[[https:// | ||
- | ====== MNBVC语料格式详情 ====== | + | ======MNBVC语料格式详情 ====== |
对于语料格式的每个jsonl文件,其大小略大于500MB。 | 对于语料格式的每个jsonl文件,其大小略大于500MB。 | ||
+ | |||
+ | ==== 关于“时间” ==== | ||
+ | 时间字段所有语料格式中都有,必填,代表本语料出现的最早时间,统一采用字符串的 yyyymmdd 格式,具体规则如下: | ||
+ | - 年份固定为4位,月份和日固定为两位,例如2024年1月1日记为 ' | ||
+ | - 年份不足4位需要在前面补0至4位,如738年3月3日记为 ' | ||
+ | - 不能具体到日或月份,统一记为01,如公元738年记为 ' | ||
+ | - 公元前则在前面加上负号,如公元前5000年记为 ' | ||
+ | |||
+ | 补充:补零4位python代码只需要加上: | ||
==== 通用文本输出jsonl格式说明 ==== | ==== 通用文本输出jsonl格式说明 ==== | ||
行 46: | 行 55: | ||
' | ' | ||
' | ' | ||
- | '拓展字段': | + | '扩展字段': |
- | ' | + | ' |
} | } | ||
</ | </ | ||
行 60: | 行 69: | ||
' | ' | ||
' | ' | ||
- | '拓展字段': | + | '扩展字段': |
} | } | ||
</ | </ | ||
行 84: | 行 93: | ||
' | ' | ||
' | ' | ||
- | '拓展字段': | + | '扩展字段': |
}, | }, | ||
- | '拓展字段': | + | '扩展字段': |
- | ' | + | ' |
] | ] | ||
} | } | ||
行 107: | 行 116: | ||
" | " | ||
" | " | ||
- | } | + | }, |
+ | " | ||
} | } | ||
</ | </ | ||
行 140: | 行 150: | ||
" | " | ||
" | " | ||
+ | " | ||
" | " | ||
" | " | ||
行 187: | 行 198: | ||
" | " | ||
" | " | ||
- | " | + | " |
+ | " | ||
} | } | ||
</ | </ | ||
行 203: | 行 215: | ||
" | " | ||
" | " | ||
- | " | + | " |
+ | " | ||
} | } | ||
</ | </ | ||
+ | |||
+ | ==== 代码commit语料输出jsonl格式说明 ===== | ||
+ | 1.每行是一个文本的数据,对应一个代码仓库里的一个文本文件的变更。 | ||
+ | |||
+ | 2.对于每一行数据,其最高层次结构如下。 | ||
+ | < | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | This is the first line. | ||
+ | -This is the second line. | ||
+ | +This line has been modified. | ||
+ | @@ -5,2 +6,3 @@ | ||
+ | +This line has been modified again. | ||
+ | +This is another new line added.", | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | ' | ||
+ | } | ||
+ | </ | ||
==== 多轮对话输出jsonl格式说明 ===== | ==== 多轮对话输出jsonl格式说明 ===== | ||
行 217: | 行 257: | ||
" | " | ||
" | " | ||
+ | " | ||
" | " | ||
" | " | ||
行 255: | 行 296: | ||
" | " | ||
" | " | ||
+ | " | ||
" | " | ||
" | " | ||
" | " | ||
" | " | ||
- | " | + | " |
" | " | ||
" | " | ||
" | " | ||
- | } | + | }" |
} | } | ||
} | } | ||
行 278: | 行 320: | ||
" | " | ||
" | " | ||
+ | " | ||
" | " | ||
" | " | ||
行 334: | 行 377: | ||
" | " | ||
" | " | ||
+ | " | ||
" | " | ||
{ | { | ||
行 363: | 行 407: | ||
< | < | ||
{ | { | ||
- | ' | + | ' |
- | ' | + | ' |
- | ' | + | ' |
' | ' | ||
- | ' | + | ' |
- | ' | + | ' |
' | ' | ||
- | '拓展字段': | + | '扩展字段': |
+ | ' | ||
} | } | ||
</ | </ | ||
+ | |||
+ | **注意:**所有语种字段的双字母缩写优先参考[ISO 639-1](https:// | ||
2.将每一行为一个段落,段落的json结构层次如下: | 2.将每一行为一个段落,段落的json结构层次如下: | ||
+ | |||
< | < | ||
{ | { | ||
- | ' | + | ' |
- | ' | + | ' |
- | ' | + | ' |
- | ' | + | ' |
' | ' | ||
' | ' | ||
行 397: | 行 445: | ||
' | ' | ||
' | ' | ||
- | ' | + | |
- | ' | + | ' |
- | '拓展字段': | + | ' |
+ | | ||
+ | ' | ||
+ | '扩展字段': | ||
} | } | ||
</ | </ | ||
- | 3.结果示例: | + | **段落** |
- | < | + | < |
{ | { | ||
- | ' | + | other_texts: { |
- | | + | |
- | | + | |
- | ' | + | }, |
- | ' | + | ... |
- | ' | + | } |
- | ' | + | </ |
- | ' | + | |
- | ' | + | **文件** |
- | ' | + | |
- | ' | + | < |
- | ' | + | { |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | }], | + | |
- | ' | + | |
} | } | ||
+ | } | ||
</ | </ | ||
+ | |||
+ | 如果没有别的需要收录的语种,并且也没有其它信息需要用扩展字段记录时,扩展字段这里约定填{}来保证json.loads不会出问题。 | ||
+ | |||
+ | 3.一份样例语料数据(注意,扩展字段直接用json.dumps(obj, | ||
+ | |||
+ | < | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | \" | ||
+ | \" | ||
+ | \" | ||
+ | \" | ||
+ | \" | ||
+ | \" | ||
+ | }, | ||
+ | }", | ||
+ | " | ||
+ | " | ||
+ | } | ||
+ | ], | ||
+ | " | ||
+ | \" | ||
+ | \" | ||
+ | \" | ||
+ | \" | ||
+ | \" | ||
+ | \" | ||
+ | } | ||
+ | }", | ||
+ | " | ||
+ | } | ||
+ | </ | ||
+ |
现有语料格式.1706867421.txt.gz · 最后更改: 2024/02/02 17:50 由 MNBVC项目组