### Name and Version
❯ llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:… no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XT, gfx1100 (0x1100), VMM: no, Wave Size: 32
version: 6297 (1cf123a3)
built with clang version 19.0.0git (/srcdest/rocm-llvm d366fa84f3fdcbd4b10847ebd5db572ae12a34fb) for x86_64-pc-linux-gnu
Following env variables have been set to configure llama-server:
```sh
export LLAMA_ARG_BATCH=2048
export LLAMA_ARG_UBATCH=2048
export LLAMA_ARG_SWA_FULL=false
export LLAMA_ARG_KV_SPLIT=false
export LLAMA_SET_ROWS=1 # for ARG_KV_SPLIT=false to work
export LLAMA_ARG_FLASH_ATTN=true
export LLAMA_ARG_MLOCK=true
export LLAMA_ARG_NO_MMAP=false
export LLAMA_ARG_N_GPU_LAYERS=999
export LLAMA_OFFLINE=false
export LLAMA_ARG_ENDPOINT_SLOTS=true
export LLAMA_ARG_ENDPOINT_PROPS=true
```
### Operating systems
Linux
### GGML backends
HIP
### Hardware
Ryzen 5900X + Radeon RX 7900XT
### Models
GPT-OSS 20B
Mistral-Small-3.2-24B-Instruct-2506-UD-Q4_K_XL.gguf (Unsloth quant)
### Problem description & steps to reproduce
Running llama-server built with latest ROCm version on Arch Linux (as of 27.08.2025 it's 6.4.3-1) crashes when trying to generate completion.
To reproduce - run GPT-OSS-20B with `llama-server` and try requesting completion (for example, via web UI). It should immediately crash without any additional message (assuming non-verbose mode).
I've ran `llama-server` with `valgrind` to see where the issue comes from, and there seems to be an issue with ROCm, but it's in destructors so it might as well be improperly handled destruction caused by the underlying issue.
I've checked if other models also trigger that issue, but Qwen3-4B works without issues.
### First Bad Commit
Probably unrelated to specific commit.
### Relevant log output
```shell
main: server is listening on http://0.0.0.0:51536 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: GET / 127.0.0.1 200
Unauthorized: Invalid API Key
srv log_server_r: request: GET /favicon.ico 127.0.0.1 401
srv log_server_r: request: GET /props 127.0.0.1 200
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 68
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 68, n_tokens = 68, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 68, n_tokens = 68
==17652== Invalid read of size 8
==17652== at 0x1584F640: _M_begin (hashtable.h:440)
==17652== by 0x1584F640: begin (hashtable.h:642)
==17652== by 0x1584F640: begin (unordered_map.h:388)
==17652== by 0x1584F640: amd::device::Program::runInitFiniKernel(amd::device::Program::kernel_kind_t) const (devprogram.cpp:2958)
==17652== by 0x15882BC1: amd::Program::unload() (program.cpp:95)
==17652== by 0x15546C11: hip::FatBinaryDeviceInfo::~FatBinaryDeviceInfo() (hip_fatbin.cpp:98)
==17652== by 0x15547908: hip::FatBinaryInfo::~FatBinaryInfo() (hip_fatbin.cpp:129)
==17652== by 0x154E21B1: hip::DynCO::~DynCO() (hip_code_object.cpp:1152)
==17652== by 0x154E2355: hip::DynCO::~DynCO() (hip_code_object.cpp:1153)
==17652== by 0x1575BAC5: hip::PlatformState::loadModule(ihipModule_t**, char const*, void const*) (hip_platform.cpp:759)
==17652== by 0x15716EAE: hip::hipModuleLoadData(ihipModule_t**, void const*) (hip_module.cpp:61)
==17652== by 0xB898649: ??? (in /opt/rocm/lib/librocblas.so.4.4)
==17652== by 0xB899DB3: ??? (in /opt/rocm/lib/librocblas.so.4.4)
==17652== by 0xAE4AB6A: ??? (in /opt/rocm/lib/librocblas.so.4.4)
==17652== by 0xAE80C49: ??? (in /opt/rocm/lib/librocblas.so.4.4)
==17652== Address 0x28 is not stack'd, malloc'd or (recently) free'd
==17652==
==17652==
==17652== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==17652== Access not within mapped region at address 0x28
==17652== at 0x1584F640: _M_begin (hashtable.h:440)
==17652== by 0x1584F640: begin (hashtable.h:642)
==17652== by 0x1584F640: begin (unordered_map.h:388)
==17652== by 0x1584F640: amd::device::Program::runInitFiniKernel(amd::device::Program::kernel_kind_t) const (devprogram.cpp:2958)
==17652== by 0x15882BC1: amd::Program::unload() (program.cpp:95)
==17652== by 0x15546C11: hip::FatBinaryDeviceInfo::~FatBinaryDeviceInfo() (hip_fatbin.cpp:98)
==17652== by 0x15547908: hip::FatBinaryInfo::~FatBinaryInfo() (hip_fatbin.cpp:129)
==17652== by 0x154E21B1: hip::DynCO::~DynCO() (hip_code_object.cpp:1152)
==17652== by 0x154E2355: hip::DynCO::~DynCO() (hip_code_object.cpp:1153)
==17652== by 0x1575BAC5: hip::PlatformState::loadModule(ihipModule_t**, char const*, void const*) (hip_platform.cpp:759)
==17652== by 0x15716EAE: hip::hipModuleLoadData(ihipModule_t**, void const*) (hip_module.cpp:61)
==17652== by 0xB898649: ??? (in /opt/rocm/lib/librocblas.so.4.4)
==17652== by 0xB899DB3: ??? (in /opt/rocm/lib/librocblas.so.4.4)
==17652== by 0xAE4AB6A: ??? (in /opt/rocm/lib/librocblas.so.4.4)
==17652== by 0xAE80C49: ??? (in /opt/rocm/lib/librocblas.so.4.4)
==17652== If you believe this happened as a result of a stack
==17652== overflow in your program's main thread (unlikely but
==17652== possible), you can try to increase the size of the
==17652== main thread stack using the --main-stacksize= flag.
==17652== The main thread stack size used in this run was 8388608.
==17652==
==17652== HEAP SUMMARY:
==17652== in use at exit: 1,278,763,386 bytes in 994,033 blocks
==17652== total heap usage: 3,326,026 allocs, 2,331,993 frees, 32,436,809,144 bytes allocated
==17652==
==17652== LEAK SUMMARY:
==17652== definitely lost: 0 bytes in 0 blocks
==17652== indirectly lost: 0 bytes in 0 blocks
==17652== possibly lost: 97,304 bytes in 231 blocks
==17652== still reachable: 1,278,665,458 bytes in 993,801 blocks
==17652== of which reachable via heuristic:
==17652== multipleinheritance: 1,664 bytes in 20 blocks
==17652== suppressed: 624 bytes in 1 blocks
==17652== Rerun with --leak-check=full to see details of leaked memory
==17652==
==17652== For lists of detected and suppressed errors, rerun with: -s
==17652== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
[1] 17652 segmentation fault (core dumped) valgrind llama-server --model ~/LLMs/gpt-oss-20b.auto.gguf --jinja --temp 1.0
```
This is the output with `-v` flag and no valgrind:
```
main: server is listening on http://0.0.0.0:51536 - starting the main loop
que start_loop: processing new tasks
que start_loop: update slots
srv update_slots: all slots are idle
srv kv_cache_cle: clearing KV cache
que start_loop: waiting for new tasks
request: {"messages":[{"role":"user","content":"test"}],"stream":true,"cache_prompt":true,"reasoning_format":"none","samplers":"edkypmxt","temperature":0.8,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"typical_p":1,"xtc_probability":0,"xtc_threshold":0.1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"max_tokens":-1,"timings_per_token":false}
srv params_from_: Grammar:
srv params_from_: Grammar lazy: false
srv params_from_: Chat format: GPT-OSS
srv params_from_: Preserved token: 200005
srv params_from_: Preserved token: 200003
srv params_from_: Preserved token: 200008
srv params_from_: Preserved token: 200006
srv params_from_: Preserved token: 200007
srv add_waiting_: add task 0 to waiting list. current waiting = 0 (before add)
que post: new task, id = 0/1, front = 0
que start_loop: processing new tasks
que start_loop: processing task, id = 0
slot get_availabl: id 0 | task -1 | selected slot by lru, t_last = -1
slot reset: id 0 | task -1 |
slot launch_slot_: id 0 | task 0 | launching slot : {"id":0,"id_task":0,"n_ctx":131072,"speculative":false,"is_processing":false,"params":{"n_predict":-1,"seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":131072,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[200003,200005,200006,200007,200008],"chat_format":"GPT-OSS","reasoning_format":"none","reasoning_in_content":false,"thinking_forced_open":false,"samplers":["penalties","dry","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},"prompt":"<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-27\n\nReasoning: medium\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>test<|end|><|start|>assistant","next_token":{"has_next_token":true,"has_new_line":false,"n_remain":-1,"n_decoded":0,"stopping_word":""}}
slot launch_slot_: id 0 | task 0 | processing task
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 1, front = 0
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 68
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 68, n_tokens = 68, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 68, n_tokens = 68
srv update_slots: decoding batch, n_tokens = 68
clear_adapter_lora: call
set_embeddings: value = 0
```