I think I just killed my RTX 2080 and possibly mobo with cuda_memtest

Ok so, really I think I might have just killed 3 gpus and a motherboard in a series of unfortunate events.

Last night I finally got manjaro installed after days of install issues due to nvidia driver problems (you can see my other post to the forum for that ordeal)

So today I was thinking maybe I should find a way to test my RTX 2080 and make sure it’s not having any errors. I did a bit of research and came across a program called cuda_memtest which sounded like just what I needed to test.I found it on the AUR and installed it.

Now, I am fairly new to Linux so this next part is going to sound dumb, but I wasn’t sure if cuda_memtest had a gui or not so I typed it into the kde start menu (probably not the actual name for it, but it looks like windows start). When I did that it didn’t come up until I typed the full name and it said Run cuda_memtest, but it showed the terminal icon, so I figured it must be only terminal based, but clicked run cuda_memtest anyway and was expecting a terminal window to pop up, but no window ever popped up. I did hear my fans ramp up quite a bit, but I couldn’t see what was happening, so I opened a terminal window up and typed in sudo cuda_memtest and hit enter. When I did this I saw some info displayed but it was saying out of memory. At this point, unsure what to do I rebooted. My fans sounded like they were running at 100% until my pc shut down, I assume that cuda_memtest was trying to run in the background. So I reboot, but no longer get any display at all. I can’t even see the bios, pc seems to not be posting. Looking at my motherboard troubleshooting LEDs I see the vga light lit up. I try a cmos reset and try rebooting a few times, it doesn’t change anything.

At this point I start to worry and decide to pull out the radeon rx 580 I have and put that in, but I some how managed to drop it on the floor before getting it into my pc. I don’t see any physical damage, it only fell about 2 or 3 feet. I go ahead and put it into my pc to see what happens and I get no video and the screen keeps cycling between completely off and saying no signal every 30 seconds or so, BUT I don’t get the vga troubleshooting LED on the motherboard, and I could tell my pc booted into windows (my primary boot device at the moment) because I can hear the sound of USB plugin sound when I plug and unplug a device. I try another cmos reset and a few reboots, no change. Still can’t even see the bios on boot.

At this point I pull out the last gpu I have available, a GTX 770 that’s been in a closet for years, to see what happens. This time when I put it in, I get into windows ! Great news, right? So I immediately pull up this forum and start typing this post to see if anyone has suggestions on how to fix the RTX 2080, and as I’m typing the post, my screen suddenly goes black and I assume maybe windows detected the gpu and it’s doing a driver update, so I wait about 10 minutes and the screen never come back on. I try to reboot, now I can see the bios and everything, but trying to boot into both windows and manjaro I get no signal.

Here’s what I think my problems might be:

  1. RTX 2080 is dead because I rebooted while cuda_memtest was running
  2. my motherboard now has an issue caused by rebooting during cuda_memtest.
  3. The RX 580 is dead either because I dropped it OR it’s not showing video because my motherboard now has an issue caused by rebooting during cuda_memtest
  4. The GTX 770 is dead because my motherboard now has an issue caused by rebooting during cuda_memtest OR because it’s old and just died after being turned on for about 5 minutes?

I’m really not that concerned about the RX 580 or the GTX 770, if they are dead, whatever. I planned on selling them, but oh well.

I REALLY want to get the RTX 2080 working and I REALLY don’t want to buy another high end GPU right now, especially with the shortages right now.

I also hope the motherboard isn’t the probem. I just built this pc 3 weeks ago with pretty much all new parts other than the 2080 and the PSU because I just used them from my old PC.

TLDR: Rebooted while cuda_memtest was running, probably killed RTX 2080, dropped my RX 580 while trying to troubleshoot, GTX 770 might be dead too.

My PC Specs:
CPU: AMD Ryzen 7 5800x
GPU: MSI Ventus Nvidia RTX 2080 8GB
Motherboard: MSI MAG X570 TOMAHAWK WIFI
RAM: Team Group T-FORCE VULCAN Z 32GB (2 x 16GB)
PSU: 750W

Did you try to boot on a usb stick ?
There are many distros specialized in diagnostics…SystemRescue for instance

I have seen a lot of things - but I have yet to see a memory test kill the graphics and the mainboard.

The only place I know where to be careful is with hdparm - but that has noting to do with graphics and memory - unless you ran completely random script non-verified script found on the internet as root - I highly doubt your system is fried.

But I cannot be sure - theorectically the script could have caused an extreme overheating of every system component including RAM, GPU and CPU because it has been deliberately coded to disable thermal sensors - even in that case one would think the firmware would have kicked in and shut the system down.

I would boot the system using a LIVE USB using only a single known to work GPU.

I will give this a try.

I’m going to give it a try with SystemRescue. I don’t really have a known to work gpu to test with. The 2080 is giving the vga light, the RX 580 was dropped, and the 770 hasn’t been plugged in for like 3 years.

I do work in a computer repair shop. I guess if I can’t get anything figure out this weekend I can probably take my pc and gpus into work and I’ll have more hardware I can test with.

I hope you will have good news, tell us what you get.
I was always sad when i had hardware failure, it’s like a member of your family (kind of)

Ok I’m able to get into systemrescue using the GTX 770 if I use the nomodeset option (regular boot didn’t work, I’m assuming because nvidia). I was even able to start the GUI!

I don’t think I said it in my first post, but I did try booting manjaro using nomodeset while the 770 was plugged in earlier and wasn’t having any luck with that.

I haven’t tried the other GPU’s because the 770 is the only one that I can see anything on at boot, the other ones I get no display at boot.

Now, I am not really familiar with systemrescue and I’m not sure what I should try running to test my system at this point. Do you, have any suggestions on what I should try to do now that I’m in system rescue?

A good start here:
https://www.system-rescue.org/System-tools/

sysrescd is built using Arch - so if you can boot a sysrescd - you can boot Manjaro.

After playing around with a few things I am able to get my windows install working with the 770 if I use a windows install disk and do a system restore to yesterday. But after about 2 minutes I’m pretty sure the gpu tries to update its self (thanks Microsoft) and the screen goes black and never comes back. Not sure if it’s the driver or maybe the gpu is overheating or something? Rebooting gives me no video after it posts.

Thanks, I tried to run the memtester tool but I do t quite understand the parameters on it, I feel like I need to see an example of how it’s used or something. The man page on it didn’t really help me.

The rest of the tools included don’t really seem all that useful for my situation, unless I’m missing something.

If anyone has any recommendations on tools I can try I’d love to know! Internet connection is working in SystemRescue, so I’m sure I can download other tools not included as well. I just don’t really know what to use.

Ill try it again later today or tomorrow. Maybe I typoed nomodeset or something. Or maybe I can try a clue other things to fet it working.

I need to take care of a few things today, so I’ll come back to this later. If anyone has any additional recommendations, I’d love to hear it. Especially if anyone has any recommendations on getting my 2080 working again!

This is a Pic of the light on my motherboard when I try to use the 2080. It’s a solid light (sorry I know it’s a bad Picture, the Led brightness made it difficult to take.

The manual is quite clear

VGA - indicates GPU is not detected or fail.

Well… Had an interesting turn of events this morning. I didn’t really try anything more after my last post, was busy yesterday.

So this morning I decided to just try putting the 2080 back in and seeing what happens, and it just worked! I literally just plugged it back in like I had done 3 or 4 times yesterday with no success and it seems to be working perfectly fine today. No idea what changed.

Maybe you almost fried it and it was in some sort of secure state preventing from working? I don’t even know if it exists on video cards but it may be possible that something went too hot on it and some fail safe was triggered, or maybe simply a component is almost failing and resting magically fixed it? Maybe your power supply? Can be anything at this point.

Yeah, I am a little concerned that it might be on the verge of death. I won’t be running any more strenuous tests on it. I guess I’ll just hope it stays working.So far it’s fine, I opened a few gpu heavy games yesterday and it was working well, so I’ll keep my fingers crossed. I did get the 2080 used from a crypto miner about 2 years ago, so it probably not 100% healthy anyway. Didn’t have any issues with it until now though.

Yeah don’t push your luck especially nowadays videos cards are just impossible to buy.