When I call the API node /free it will invoke the free_memory() function that moves the model from the VRAM back into the RAM.
With a 20GB model, this takes about 8 seconds. What is the bottleneck here? Both RAM (DDR4) and the VRAM and the PCIe4 are much faster than this. During this operation, I see no CPU activity.
Freeing up VRAM should be able in a split second, but how?
Thanks. If someone comes across this topic however, the --high-vram flag seems to fixes it, because it does not offload into the ram. Unload model(s) in vram in 1 second instead of 8 or more. Works for me.
Still I believe the whole transfer between SSD → RAM → VRAM could be faster, but maybe Python is not the most suitable thing for this.