USMLE Clinical Inference Engine (Llama-3.1-8B-Instruct)

Retrieval Architecture: Hybrid Sparse/Dense (BM25 + BGE-Small) → MiniLM Cross-Encoder Reranking. Inference Backend: USMLE-finetuned LoRA adapter merged into Llama-3.1, compressed via W4A16 AWQ. Deployed on a serverless A10G GPU. Execution: vLLM continuous batching via forced eager execution.