无限OCR:一次性长视野解析
欢迎进入一次性长视野解析的时代。发布 [2026/06/23] 📄 我们的论文现已在arXiv上发布。 [2026/06/23] 🤝 感谢ModelScope社区的支持。我们的模型现已在ModelScope上提供。 [2026/06/22] 🚀 我们介绍无限OCR,旨在将Deepseek-OCR推向更高的水平。使用NVIDIA GPU上的Huggingface transformers进行推理。测试的要求:python 3.12.3 + CUDA12.9:torch==2.10.0 torchvision==0.25.0 transformers==4.57.1 Pillow==12.1.1 matplotlib==3.10.8 einops==0.8.2 addict==2.4.0 easydict==1.13 pymupdf==1.27.2.2 psutil==7.2.2 导入os 导入torch 从transformers导入AutoModel,AutoTokenizer model_name = 'baidu/Unlimited-OCR' tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModel.from_pretrained(model_name, trust_remote_code=True, use_safetensors=True, torch_dtype=torch.bfloat16) model = model.eval().cuda() # ── 单图像支持两种配置:gundam或base ── # gundam: base_size=1024, image_size=640, crop_mode=True # base: base_size=1024, image_size=1024, crop_mode=False model.infer(tokenizer, prompt='<image>文档解析。', image_file='your_image.jpg', output_path='your/output/dir', base_size=1024, image_size=640, crop_mode=True, max_length=32768, no_repeat_ngram_size=35, ngram_window=128, save_results=True) # ── 多页/PDF只使用base (image_size=1024) ── model.infer_multi(tokenizer, prompt='<image>多页解析。', image_files=['page1.png', 'page2.png', 'page3.png'], output_path='your/output/dir', image_size=1024, max_length=32768, no_repeat_ngram_size=35, ngram_window=1024, save_results=True) # ── PDF(将页面转换为图像,然后进行多页解析) ── 导入tempfile, fitz # PyMuPDF def pdf_to_images(pdf_path, dpi=300): doc = fitz.open(pdf_path) tmp_dir = tempfile.mkdtemp(prefix='pdf_ocr_') mat = fitz.Matrix(dpi / 72, dpi / 72) paths = [] for i, page in enumerate(doc): out = os.path.join(tmp_dir, f'page_{i + 1:04d}.png') page.get_pixmap(matrix=mat).save(out) paths.append(out) doc.close() return paths model.infer_multi(tokenizer, prompt='<image>多页解析。', image_files=pdf_to_images('your_doc.pdf', dpi=300), output_path='your/output/dir', image_size=1024, max_length=32768, no_repeat_ngram_size=35, ngram_window=1024, save_results=True) SGLang 设置环境(uv管理的虚拟环境)。首先安装本地SGLang wheel,然后固定kernels==0.9.0并安装PyMuPDF以进行PDF到图像的转换: uv venv --python 3.12 source .venv/bin/activate uv pip install wheel/sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl uv pip install kernels==0.11.7 uv pip install pymupdf==1.27.2.2 启动SGLang服务器:python -m sglang.launch_server --model baidu/Unlimited-OCR --served-model-name Unlimited-OCR --attention-backend fa3 --page-size 1 --mem-fraction-static 0.8 --context-length 32768 --enable-custom-logit-processor --disable-overlap-schedule --skip-server-warmup --host 0.0.0.0 --port 10000 发送流式请求到OpenAI兼容的API: 导入base64 导入json 导入os 导入tempfile 导入fitz 导入requests 从sglang.srt.sampling.custom_logit_processor导入DeepseekOCRNoRepeatNGramLogitProcessor server_url = "http://127.0.0.1:10000" session = requests.Session() session.trust_env = False def pdf_to_images(pdf_path, dpi=300): doc = fitz.open(pdf_path) tmp_dir = tempfile.mkdtemp(prefix="pdf_ocr_") mat = fitz.Matrix(dpi / 72, dpi / 72) image_paths = [] for i, page in enumerate(doc): image_path = os.path.join(tmp_dir, f"page_{i + 1:04d}.png") page.get_pixmap(matrix=mat).save(image_path) image_paths.append(image_path) doc.close() return image_paths def encode_image(image_path): ext = os.path.splitext(image_path)[1].lower() mime = "image/jpeg" if ext in (".jpg", ".jpeg") else f"image/{ext.lstrip('.')}" with open(image_path, "rb") as f: data = base64.b64encode(f.read()).decode("utf-8") return {"type": "image_url", "image_url": {"url": f"data: {mime};base64, {data}"}} def build_content(prompt, image_paths): return [{"type": "text", "text": prompt}] + [encode_image(path) for path in image_paths] def generate(prompt, image_paths, image_mode, ngram_window): payload = {"model": "Unlimited-OCR", "messages": [{"role": "user", "content": build_content(prompt, image_paths)}], "temperature": 0, "skip_special_tokens": False, "images_config": {"image_mode": image_mode}, "custom_logit_processor": DeepseekOCRNoRepeatNGramLogitProcessor.to_str(), "custom_params": {"ngram_size": 35, "window_size": ngram_window}, "stream": True,} response = session.post(f"{server_url}", json=payload)
本站免费、广告极少。如果觉得有帮助,可以请我们喝杯咖啡 —— 任何金额都对持续运营有实际帮助。
☕请我喝杯咖啡