4.2 MinerU提取论文并RAG化
magic-pdf
参考
依赖库安装
pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/ -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install magic-pdf[full]==0.6.2b1 -i https://mirrors.aliyun.com/pypi/simple/
模型下载
import os
# 设置环境变量
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
# 下载模型
os.system('huggingface-cli download --resume-download wanderkid/PDF-Extract-Kit --local-dir /root/pro/model/PDF-Extract-Kit')
源码下载
cp magic-pdf.template.json ~/magic-pdf.json
-
配置文件
https://github.com/opendatalab/MinerU.git
GPU支持
pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
运行
magic-pdf pdf-command --pdf "/root/pro/files" --inside_model true
magic-pdf pdf-command --pdf "/root/pro/files/SUAN0.3.pdf" --inside_model true
magic-doc
https://github.com/opendatalab/magic-doc
magic-html
https://github.com/opendatalab/magic-html