Create a web app with Python backend to switch LLM models.
I have a AI server running llama.cpp on 3 GTX1080 GPUs. The vram is
only enough to run a small model. I have downloaded some small models
and created llama.cpp parameters for them. This web app will enable me
to change which model llama.cpp runs without me SSHing into the AI
server and manually change the settings. The way it works goes as
follows:
* The model files (.gguf) are in the `/mnt/data/models/llm` directory.
* The directory `/etc/llama.cpp.d` stores the llama.cpp parameters for
these models; one file for each model. These files contains the env
variables that the systemd service file reads, which gets
substituted into the llama.cpp commandline. For example, the file
`/etc/llama.cpp.d/mistral-venice-edition.conf` contains the
parameters for the `Dolphin-Mistral-24B-Venice-Edition-Q5_K_M.gguf`
model (the path of the model file is a parameter in the conf file).
* `/etc/llama.cpp.conf` is a symbolic link to one of the conf files.
The systemd service reads this file for parameters. Therefore,
the file this links to is the current configuration
The web app you are creating will show a list of all the models (just
use the name of the conf file as model names), and I would be able to
choose a model, and click a button. The backend would then link
`/etc/llama.cpp.conf` to the correct conf file, and restart llama.cpp
service. That’s it.
Put all the files in the `model-switcher` sub-directory.
Naming style:
* Use `CapitalizedCase` for classes.
* Use `snake_case` for local variables.
* Use `UPPER_CASE` for global constants.
* Use `camelCase` for functions.
Give me a plan of how you would implement this. Don’t edit anything
yet.