As others have said you want RAG. The most feature complete implementation I've ...

As others have said you want RAG.

The most feature complete implementation I've seen is h2ogpt[0] (not affiliated).

The code is kind of a mess (most of the logic is in an ~8000 line python file) but it supports ingestion of everything from YouTube videos to docx, pdf, etc - either offline or from the web interface. It uses langchain and a ton of additional open source libraries under the hood. It can run directly on Linux, via docker, or with one-click installers for Mac and Windows.

It has various model hosting implementations built in - transformers, exllama, llama.cpp as well as support for model serving frameworks like vLLM, HF TGI, etc or just OpenAI.

You can also define your preferred embedding model along with various other parameters but I've found the out of box defaults to be pretty sane and usable.

[0] - https://github.com/h2oai/h2ogpt