dmtyl

目录

多模态小组目标
小组成果
小组任务
小组成果
其他

多模态小组目标

将图文结合的网页、PDF、WORD文档等转换成多模态语料。

小组成果

新人文档：https://v61g3vcxy7.feishu.cn/wiki/G0OAwqhA2iNYGrkFOPUc2CpvnIh?from=from_copylink

小组组内wiki (飞书)：https://v61g3vcxy7.feishu.cn/wiki/H8D1wqyIXim3wcktxgqc0IIlnAf

小组任务

1. 对纯中英文的PDF进行抽取，形成纯文本数据集

2. 对论文抽取成多模态数据集

3. 对复杂的PDF进行抽取

4. 更多多模态数据（文本，音频，视频等）

小组成果

扫描PDF文件夹然后对其进行采样：https://github.com/wanng-ide/scan_copy_pdfs_mnbvc

PDF语言分类器：https://github.com/Lu-Tan/pdf_CN_EN_filter_mnbvc

Chinaxiv抓取：https://github.com/wyzhangyuhan/chinaxivCrawler_mnbvc

Arxiv抓取：https://github.com/wanng-ide/arxivSpider_mnbvc

Arxiv Tex抽取：https://github.com/wanng-ide/arxiv_tex_mnbvc

PDF元数据的lda分类：https://github.com/FantasticCode2019/pdf_lda_mnbvc

PDF工具：https://github.com/akira-l/pdf-tools

PDF 元信息提取：https://github.com/MIracleyin/pdf_meta_data_mnbvc

PDF 大小分类器：https://github.com/MIracleyin/pdf_size_mnbvc

mutilmodal doc processing 框架：https://github.com/MIracleyin/mmdp_mnbvc

PDF多模态分析：https://github.com/MIracleyin/mmda_mnbvc

其他

论文阅读笔记：https://v61g3vcxy7.feishu.cn/wiki/MykAw1S15iqA5jkETKxc99BJnbf