{
"id":"82b2834abe2ed41a26b6b06317114f8f",
"问":"写一个超短小说",
"答":"他们相遇,又别离。岁月如梭,情感却不减。",
"来源":"ShareGPT",
"元数据":{
"create_time":"20230511 15:56:03",
"问题明细":"\"from\": \"human\"",
"回答明细":"\"from\": \"gpt\"",
"扩展字段":"{\"会话\": 1, \"多轮序号\": 1, \"解析模型\": gpt4}"
}
}
| KEY | VALUE说明 |
|---|---|
| id | 每一对问答的唯一标识,使用json串的md5作为唯一标识id |
| 问 | 问的文本 |
| 答 | 答的文本 |
| 来源 | 固定为’ShareGPT’ |
| 元数据 | 包含创建时间、问题明细、回答明细、扩展字段 |
| KEY | VALUE说明 |
|---|---|
| create_time | 问答创建时间,格式为%Y%m%d %H:%M:%S |
| 问题明细 | 原始语料中问的来源,例如 “from”: “human” |
| 回答明细 | 原始语料中答的来源,例如 “from”: “gpt” |
| 扩展字段 | 包含会话的唯一标识和本条在会话中的序号,以及解析模型 |
| KEY | VALUE说明 |
|---|---|
| 会话 | 会话的唯一标识,例如 “会话”: “yOKd88p” |
| 多轮序号 | 本条在会话中的序号,例如 “多轮序号”: 1 |
| 解析模型 | 用于标识原始语料的来源,例如 “解析模型”: “gpt4” |
| 其他。。等等 | 语料中问答相关的其他补充信息字段 |
git clone ShareGPTQAExtractor-mnbvc
<HTML><ol start=“2” style=“list-style-type: decimal;”></HTML> <HTML><li></HTML>进入目录并安装依赖<HTML></li></HTML><HTML></ol></HTML>
cd ShareGPTQAExtractor-mnbvc pip install -r requirements.txt
通过以下命令将FILE文件转化并输出到以ShareGPT为开头的结果文件中。
python sharegpt_extract.py FILE -m MODEL
以上命令将输出时间戳结果文件例子shareGPT_gpt4_2023-09-17-00-21-06.jsonl。 模型和原始语料的对应关系如下:
| MODEL | 对应原始语料 |
|---|---|
| multilang | scryscan/multilingual-share |
| vicuna | anon8231489123/ShareGPT_Vicuna_unfiltered |
| gpt4 | Ejafa/GPT_4_with_ShareGPT |
| common_en/common_zh | shareAI/ShareGPT-Chinese-English-90k |
| baiduzhidao | shareGPT的中文问答 |
sharegpt_extract.py 入口程序schema.py 输出json模板 {
"id": "yOKd88p",
"conversations": [
{
"from": "human",
"value": "Can you make me a Shakespearean script about a girl who has tummy troubles and can\u2019t fart not matter how hard she tries- so they think she is a witch"
},
{
"from": "gpt",
"value": "Sure, here's a Shakespearean script about a girl who c..."
},
{
"from": "human",
"value": "Can you change Mary\u2019s name to Katy"
},
{
"from": "gpt",
"value": "Certainly! Here's the revised script:\n\nAct I, Scene I\n\nEnter KATY,..."
}
]
}
{
"id": "82b2834abe2ed41a26b6b06317114f8f",
"问": "Can you make me a Shakespearean script about a girl who has tummy troubles and can\u2019t fart not matter how hard she tries- so they think she is a witch",
"答": "Sure, here's a Shakespearean script about a girl who c...",
"来源": "ShareGPT",
"元数据": {
"create_time": "20230517 10:41:58",
"问题明细":"\"from\": \"human\"",
"回答明细":"\"from\": \"gpt\"",
"扩展字段": {
"会话": "yOKd88p",
"多轮序号": 1,
"解析模型": "gpt4"
}
}
}
{
"id": "82b2834abe2ed41a26bbbbbbbbbbbbbbbb",
"问": "Can you change Mary\u2019s name to Katy",
"答": "Certainly! Here's the revised script:\n\nAct I, Scene I\n\nEnter KATY,...",
"来源": "ShareGPT",
"元数据": {
"create_time": "20230517 10:41:58",
"问题明细":"\"from\": \"human\"",
"回答明细":"\"from\": \"gpt\"",
"扩展字段": {
"会话": "yOKd88p",
"多轮序号": 2,
"解析模型": "gpt4"
}
}
},
上面的格式方便查看,最终输出到文件仍然为jsonl的规范,如下:
{"id": "82b...", "问": "Can you make me a ...", "答": "Sure...", "来源": "ShareGPT", "元数据": {"create_time": "20230517 10:41:58",...}}
{"id": "82b...", "问": "Can you make me a ...", "答": "Sure...", "来源": "ShareGPT", "元数据": {"create_time": "20230517 10:41:58",...}}