← go back 'blog'
Argh, My Display!
There comes a time in life. A time when you need to release a new product onto the market. That product must meet both the user needs (so that they can use it after they pay for it), as well as the company needs (they need to ensure it will actually be received well so they get more money as a result).
Now, imagine, if you will, a culture driver war.
You need to accomodate for the new
display hardware changes,
meaning you need
new code to handle it properly.
Then, imagine, if you will, that the new display driver code
does not work as intended.
'Ok, surely, it's not a big deal. It can't be,' you think.
'There must be a way to fix
these
obvious
issues
that do indeed affect real world people.'
...
Before we start, however: whilst I would like to truly express my disappointment to the most accurate extent possible (I have literally spent around $1700 on a shiny new laptop, after all) (I know I shouldn't have) (I will never do it again), I must refrain from making irrational claims while taking a shit on people who do not deserve it. They are working to fix what they can reproduce, and they have my respect for that. For this reason, I will be very gentle.
Nonetheless, I do have a problem with the way AMD tests display code changes. It's so... underwhelming, it's not even funny. Here is an example of their testing phase: 4 systems. FOUR SYSTEMS. Surely that accounts for all possible DCN variants, right? There's no way you'd ever miss a single DCN version that might behave erratically, right? There's no way you'd ever overlook issues specific to dGPU DCN vs. APU DCN, right? That isn't very good.
As such, the only thing I ask of you, dear reader, is that you at least acknowledge that when a company works on a product that is then sold to system integrators or AIB partners, they should be able to thoroughly test not just their own reference hardware, but also third-party devices. This is the only thing that matters, as will become very apparent soon.
We have been told that AMD cannot reproduce most of these issues. Now that's strange, isn't it? How is it possible that they can't see what we are so obviously seeing? There always seem to be a few people who join these discussions to notify the issue reporters that they're not alone. And all of it happens on end-user hardware.
Obviously, Strix Point has not gone mainstream, at least not yet, especially with its prices and early adopter issues (but that applies to most laptop hardware unfortunately). It will definitely take some time to make it work the way my old Raven laptop does - absolutely no issues with anything as of right now despite some sporadic GPU timeouts in its early days.
Going hard on Strix isn't very valid in my mind, either: the hardware support is genuinely good! I, for example, do not have issues with PSR or hard lockups. Others do and I sympathize with them. I seem to have drawn the long straw here. The only real issue with these chips that I definitely intensely dislike are the occasional VCN timeouts, but since decoding's mostly ok on Chromium and perfect with mpv while single display reset handling is well-done, I can reluctantly say 'it's fine.'
Also, I don't really care for suspend-to-idle. I don't like it; it's buggy enough to avoid it altogether, I don't need it; the laptop boots up and shuts down plenty fast, just like a desktop computer does. It really is fine, leave it. If I can make it work after a single GPU reset (regardless of whether it is intentional or a video decoder bug lol), I care even less. 'Yay, another workaround,' he said.
The real problem kicks in when there is no workaround.
Why this post
If you take a closer look at issue 3808, you can see that the only reason I am able to use my laptop without feeling like my work's gonna go to shit at random when I do is because I hacked my way around the GPU reset mechanism to trick the driver into thinking that it's actually suspending hardware, and not 'just' resetting the chip.
I am not well acquainted with the display part of amdgpu. I don't know most of the driver, really, and I have only ever written the most basic kinds of patches. I can code, however, and I can use Bootlin; this is how I was able to make that simple hack I mentioned before. I do not understand why display stuff absolutely needs state copies and cannot just re-probe all connectors, which might even be easier to maintain, who knows, and actually seems to work, at least on laptop APUs. For now, I'm enjoying a pretty stable system and have a hack that works in case something goes wrong. All's good.
So, what's next? I haven't the slightest clue. Maybe I should just make an offer to mail my laptop then... I guess this is the only way forward at this strix point.
oh god that was so fucking unfunny, I am genuinely sorry
For real, though, I guess we just need to wait. AMD *has* fixed an annoying compute bug on GFX10+; it only needed time. Easy does it, I suppose.
By the way, the patch I mentioned is also available here for your kernel compiling pleasure. Apply on top of your tree if necessary. Don't harass and, most importantly, have fun.
2025-02-11