Crashing with "CUDA kernel errors might be asynchronously reported at some other API call" since updating desktop to 0.3.71

Hi,

Since updating my desktop version from 0.3.67 to 0.3.71 my workflow crashes when when i run batches. I used to set it to 16 and come back, but now it crashes with the error below.

It can crash at any point in the workflow. This has never happened before the update.

Can you tell from my error anything? Or is it possible to revert back to 0.3.67?

Requested to load WAN21
loaded partially; 8793.59 MB usable, 8788.37 MB loaded, 548.82 MB offloaded, lowvram patches: 0
Attempting to release mmap (39)
100%|██████████| 3/3 [01:04<00:00, 21.41s/it]
Requested to load WanVAE
loaded completely; 2538.90 MB usable, 242.03 MB loaded, full load: True
!!! Exception during processing !!! CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last):
File “C:\ComfyUI\resources\ComfyUI\execution.py”, line 510, in execute
output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\ComfyUI\resources\ComfyUI\execution.py”, line 324, in get_output_data
return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\ComfyUI\resources\ComfyUI\execution.py”, line 298, in async_map_node_over_list
await process_inputs(input_dict, i)
File “C:\ComfyUI\resources\ComfyUI\execution.py”, line 286, in process_inputs
result = f(**inputs)
^^^^^^^^^^^
File "C:\ComfyUI\resources\ComfyUI\comfy_api\internal_init
.py", line 149, in wrapped_func
return method(locked_class, **inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\ComfyUI\resources\ComfyUI\comfy_api\latest_io.py”, line 1275, in EXECUTE_NORMALIZED
to_return = cls.execute(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\ComfyUI\resources\ComfyUI\comfy_extras\nodes_upscale_model.py”, line 92, in execute
upscale_model.to(“cpu”)
File “C:\Users\xxx\Documents\ComfyUI.venv\Lib\site-packages\spandrel__helpers\model_descriptor.py”, line 331, in to
self.model.to(device=device, dtype=dtype)
File “C:\Users\xxx\Documents\ComfyUI.venv\Lib\site-packages\torch\nn\modules\module.py”, line 1369, in to
return self._apply(convert)
^^^^^^^^^^^^^^^^^^^^
File “C:\Users\xxx\Documents\ComfyUI.venv\Lib\site-packages\torch\nn\modules\module.py”, line 928, in _apply
module._apply(fn)
File “C:\Users\xxx\Documents\ComfyUI.venv\Lib\site-packages\torch\nn\modules\module.py”, line 928, in _apply
module._apply(fn)
File “C:\Users\xxx\Documents\ComfyUI.venv\Lib\site-packages\torch\nn\modules\module.py”, line 928, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File “C:\Users\xxx\Documents\ComfyUI.venv\Lib\site-packages\torch\nn\modules\module.py”, line 955, in _apply
param_applied = fn(param)
^^^^^^^^^
File “C:\Users\xxx\Documents\ComfyUI.venv\Lib\site-packages\torch\nn\modules\module.py”, line 1355, in convert
return t.to(
^^^^^
torch.AcceleratorError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I uninstalled and used my old installer to reinstall 0.3.67. Using the same exact custom nodes as before, i just moved the folders i needed back, it works perfectly again.

I tried the portable 0.3.71. It did the exact same thing with the same CUDA error. I can’t use the batch mode built into comfyui in either 0.3.71. It crashes with this same error. They both also do this wierd thing where in the top left hand corner it says “running in another tab” and you can’t see anything happening, no progress bar, you have to restart to see anything happen.

I’m scared to update now. I don’t think I can ever move on from 0.3.67 as using batches with the comfyui built in batch doesn’t work anymore. I guess I could try a batch node?

I’m very new at this, been using comfy for about 2 months.

I’m having the same issues since the last update but in rare cases (that it reports it, I suspect it doesn’t always do, and we might get the “disconnected” red pop up instead), I believe the devs need to get the error themselves to look into it. Wait and see I believe, finger’s crossed

I’m getting the same exact thing when I can catch it. Same workflow, models, inputs, and everything as the prior version just with random crashes that sometimes come out as this error, other times OOM, and most times just a dead crash with “Disconnected”.

I thought I had a lead, if you close and open ComfyUi again, the previous closed comfy ui stays in your process, so if you close and open it 10 times, you’ll have 10 times comfy ui running in the process. I thought by killing all those artefacts that it fixed it, but the issue came back, no idea why. I mean comfy is so unstable, it’s hard to pin point what’s going wrong. Try killing the extra processes, and/or tell us if you also have extra processes running once you closed comfy

I noticed that, its like the python process was still running in the windows task manager. Last night I switched over to a manual install, and while the zombie python problem cleared up I was still getting random crashes on version 0.3.76.

I switched over to 0.3.77 (which I think is the dev release) and while its kind of shaky it seems a bit more stable by way of the random cuda api / oom errors. Although now I’m running into an issue where my generations are basically spitting out mannequins with camera movement instead of the normal full motion I was getting in previous versions. But at least it looks like they’re working on it! >_<

yeah I also noticed that, something’s screwing up the models or something, with wan 2.2 instead of having normal motion, mine just fades away into noise and remains mostly static. Then either I try to restart the computer or find the sombies in the process, and it’ fixes it. Anyways, comfy is cool, but we have to keep in mind it’s cutting edge techs they’re putting together, so there will be times like that unfortunately