Discussion:
[PATCH] radeonsi: don't use fast color clear for small images even on APUs
(too old to reply)
Marek Olšák
2017-12-12 23:53:12 UTC
Permalink
From: Marek Olšák <***@amd.com>

Increase the limit and handle non-square images better.

This makes glxgears 20% faster on APUs, and a little more on dGPUs.
We all use and love glxgears.
---
src/gallium/drivers/radeonsi/si_clear.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/src/gallium/drivers/radeonsi/si_clear.c b/src/gallium/drivers/radeonsi/si_clear.c
index 0ac83f4..464b9d7 100644
--- a/src/gallium/drivers/radeonsi/si_clear.c
+++ b/src/gallium/drivers/radeonsi/si_clear.c
@@ -418,26 +418,25 @@ static void si_do_fast_color_clear(struct si_context *sctx,
sctx->b.family == CHIP_STONEY)
tex->num_slow_clears++;
}

bool need_decompress_pass = false;

/* Use a slow clear for small surfaces where the cost of
* the eliminate pass can be higher than the benefit of fast
* clear. The closed driver does this, but the numbers may differ.
*
- * Always use fast clear on APUs.
+ * This helps on both dGPUs and APUs, even small APUs like Mullins.
*/
- bool too_small = sctx->screen->info.has_dedicated_vram &&
- tex->resource.b.b.nr_samples <= 1 &&
- tex->resource.b.b.width0 <= 256 &&
- tex->resource.b.b.height0 <= 256;
+ bool too_small = tex->resource.b.b.nr_samples <= 1 &&
+ tex->resource.b.b.width0 *
+ tex->resource.b.b.height0 <= 512 * 512;

/* Try to clear DCC first, otherwise try CMASK. */
if (vi_dcc_enabled(tex, 0)) {
uint32_t reset_value;
bool clear_words_needed;

if (sctx->screen->debug_flags & DBG(NO_DCC_CLEAR))
continue;

/* This can only occur with MSAA. */
--
2.7.4
Dieter Nützel
2017-12-13 09:03:40 UTC
Permalink
Tested-by: Dieter Nützel <***@nuetzel-hh.de>

Yes, on RX580 it is slightly more than 20%, but GREAT ;-)

Dieter
Post by Marek Olšák
Increase the limit and handle non-square images better.
This makes glxgears 20% faster on APUs, and a little more on dGPUs.
We all use and love glxgears.
---
src/gallium/drivers/radeonsi/si_clear.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/src/gallium/drivers/radeonsi/si_clear.c
b/src/gallium/drivers/radeonsi/si_clear.c
index 0ac83f4..464b9d7 100644
--- a/src/gallium/drivers/radeonsi/si_clear.c
+++ b/src/gallium/drivers/radeonsi/si_clear.c
@@ -418,26 +418,25 @@ static void si_do_fast_color_clear(struct si_context *sctx,
sctx->b.family == CHIP_STONEY)
tex->num_slow_clears++;
}
bool need_decompress_pass = false;
/* Use a slow clear for small surfaces where the cost of
* the eliminate pass can be higher than the benefit of fast
* clear. The closed driver does this, but the numbers may differ.
*
- * Always use fast clear on APUs.
+ * This helps on both dGPUs and APUs, even small APUs like Mullins.
*/
- bool too_small = sctx->screen->info.has_dedicated_vram &&
- tex->resource.b.b.nr_samples <= 1 &&
- tex->resource.b.b.width0 <= 256 &&
- tex->resource.b.b.height0 <= 256;
+ bool too_small = tex->resource.b.b.nr_samples <= 1 &&
+ tex->resource.b.b.width0 *
+ tex->resource.b.b.height0 <= 512 * 512;
/* Try to clear DCC first, otherwise try CMASK. */
if (vi_dcc_enabled(tex, 0)) {
uint32_t reset_value;
bool clear_words_needed;
if (sctx->screen->debug_flags & DBG(NO_DCC_CLEAR))
continue;
/* This can only occur with MSAA. */
Samuel Pitoiset
2017-12-14 12:54:01 UTC
Permalink
Post by Marek Olšák
Increase the limit and handle non-square images better.
This makes glxgears 20% faster on APUs, and a little more on dGPUs.
We all use and love glxgears.
We love it. :)
Post by Marek Olšák
---
src/gallium/drivers/radeonsi/si_clear.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/src/gallium/drivers/radeonsi/si_clear.c b/src/gallium/drivers/radeonsi/si_clear.c
index 0ac83f4..464b9d7 100644
--- a/src/gallium/drivers/radeonsi/si_clear.c
+++ b/src/gallium/drivers/radeonsi/si_clear.c
@@ -418,26 +418,25 @@ static void si_do_fast_color_clear(struct si_context *sctx,
sctx->b.family == CHIP_STONEY)
tex->num_slow_clears++;
}
bool need_decompress_pass = false;
/* Use a slow clear for small surfaces where the cost of
* the eliminate pass can be higher than the benefit of fast
* clear. The closed driver does this, but the numbers may differ.
*
- * Always use fast clear on APUs.
+ * This helps on both dGPUs and APUs, even small APUs like Mullins.
*/
- bool too_small = sctx->screen->info.has_dedicated_vram &&
- tex->resource.b.b.nr_samples <= 1 &&
- tex->resource.b.b.width0 <= 256 &&
- tex->resource.b.b.height0 <= 256;
+ bool too_small = tex->resource.b.b.nr_samples <= 1 &&
+ tex->resource.b.b.width0 *
+ tex->resource.b.b.height0 <= 512 * 512;
/* Try to clear DCC first, otherwise try CMASK. */
if (vi_dcc_enabled(tex, 0)) {
uint32_t reset_value;
bool clear_words_needed;
if (sctx->screen->debug_flags & DBG(NO_DCC_CLEAR))
continue;
/* This can only occur with MSAA. */
Konstantin Kharlamov
2017-12-28 11:29:16 UTC
Permalink
I'm wondering, how is r600g different in that regard? I tried wiring up the code into evergreen_do_fast_color_clear(), both in this state and by using 256*256 — however FPS for me always varies around the same 1420.

That said, I'm seeing lots of CPU used by Xorg, glxgears, and compton — I'm wondering if CPU cap could be the reason?
Post by Marek Olšák
Increase the limit and handle non-square images better.
This makes glxgears 20% faster on APUs, and a little more on dGPUs.
We all use and love glxgears.
---
src/gallium/drivers/radeonsi/si_clear.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/src/gallium/drivers/radeonsi/si_clear.c b/src/gallium/drivers/radeonsi/si_clear.c
index 0ac83f4..464b9d7 100644
--- a/src/gallium/drivers/radeonsi/si_clear.c
+++ b/src/gallium/drivers/radeonsi/si_clear.c
@@ -418,26 +418,25 @@ static void si_do_fast_color_clear(struct si_context *sctx,
sctx->b.family == CHIP_STONEY)
tex->num_slow_clears++;
}
bool need_decompress_pass = false;
/* Use a slow clear for small surfaces where the cost of
* the eliminate pass can be higher than the benefit of fast
* clear. The closed driver does this, but the numbers may differ.
*
- * Always use fast clear on APUs.
+ * This helps on both dGPUs and APUs, even small APUs like Mullins.
*/
- bool too_small = sctx->screen->info.has_dedicated_vram &&
- tex->resource.b.b.nr_samples <= 1 &&
- tex->resource.b.b.width0 <= 256 &&
- tex->resource.b.b.height0 <= 256;
+ bool too_small = tex->resource.b.b.nr_samples <= 1 &&
+ tex->resource.b.b.width0 *
+ tex->resource.b.b.height0 <= 512 * 512;
/* Try to clear DCC first, otherwise try CMASK. */
if (vi_dcc_enabled(tex, 0)) {
uint32_t reset_value;
bool clear_words_needed;
if (sctx->screen->debug_flags & DBG(NO_DCC_CLEAR))
continue;
/* This can only occur with MSAA. */
Marek Olšák
2017-12-28 14:54:33 UTC
Permalink
On Thu, Dec 28, 2017 at 12:29 PM, Konstantin Kharlamov
Post by Konstantin Kharlamov
I'm wondering, how is r600g different in that regard? I tried wiring up the code into evergreen_do_fast_color_clear(), both in this state and by using 256*256 — however FPS for me always varies around the same 1420.
That said, I'm seeing lots of CPU used by Xorg, glxgears, and compton — I'm wondering if CPU cap could be the reason?
r600g might benefit in the same way. glxgears requires the limit to be
at least 300*300.

Marek
Post by Konstantin Kharlamov
Post by Marek Olšák
Increase the limit and handle non-square images better.
This makes glxgears 20% faster on APUs, and a little more on dGPUs.
We all use and love glxgears.
---
src/gallium/drivers/radeonsi/si_clear.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/src/gallium/drivers/radeonsi/si_clear.c b/src/gallium/drivers/radeonsi/si_clear.c
index 0ac83f4..464b9d7 100644
--- a/src/gallium/drivers/radeonsi/si_clear.c
+++ b/src/gallium/drivers/radeonsi/si_clear.c
@@ -418,26 +418,25 @@ static void si_do_fast_color_clear(struct si_context *sctx,
sctx->b.family == CHIP_STONEY)
tex->num_slow_clears++;
}
bool need_decompress_pass = false;
/* Use a slow clear for small surfaces where the cost of
* the eliminate pass can be higher than the benefit of fast
* clear. The closed driver does this, but the numbers may differ.
*
- * Always use fast clear on APUs.
+ * This helps on both dGPUs and APUs, even small APUs like Mullins.
*/
- bool too_small = sctx->screen->info.has_dedicated_vram &&
- tex->resource.b.b.nr_samples <= 1 &&
- tex->resource.b.b.width0 <= 256 &&
- tex->resource.b.b.height0 <= 256;
+ bool too_small = tex->resource.b.b.nr_samples <= 1 &&
+ tex->resource.b.b.width0 *
+ tex->resource.b.b.height0 <= 512 * 512;
/* Try to clear DCC first, otherwise try CMASK. */
if (vi_dcc_enabled(tex, 0)) {
uint32_t reset_value;
bool clear_words_needed;
if (sctx->screen->debug_flags & DBG(NO_DCC_CLEAR))
continue;
/* This can only occur with MSAA. */
Bas Nieuwenhuizen
2017-12-28 15:02:45 UTC
Permalink
Post by Marek Olšák
On Thu, Dec 28, 2017 at 12:29 PM, Konstantin Kharlamov
Post by Konstantin Kharlamov
I'm wondering, how is r600g different in that regard? I tried wiring up the code into evergreen_do_fast_color_clear(), both in this state and by using 256*256 — however FPS for me always varies around the same 1420.
That said, I'm seeing lots of CPU used by Xorg, glxgears, and compton — I'm wondering if CPU cap could be the reason?
r600g might benefit in the same way. glxgears requires the limit to be
at least 300*300.
As was discussed on #radeon, his default window was much larger due to
a tiling window manager (683x768) and hence his changes did not
trigger.

- Bas
Post by Marek Olšák
Marek
Post by Konstantin Kharlamov
Post by Marek Olšák
Increase the limit and handle non-square images better.
This makes glxgears 20% faster on APUs, and a little more on dGPUs.
We all use and love glxgears.
---
src/gallium/drivers/radeonsi/si_clear.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/src/gallium/drivers/radeonsi/si_clear.c b/src/gallium/drivers/radeonsi/si_clear.c
index 0ac83f4..464b9d7 100644
--- a/src/gallium/drivers/radeonsi/si_clear.c
+++ b/src/gallium/drivers/radeonsi/si_clear.c
@@ -418,26 +418,25 @@ static void si_do_fast_color_clear(struct si_context *sctx,
sctx->b.family == CHIP_STONEY)
tex->num_slow_clears++;
}
bool need_decompress_pass = false;
/* Use a slow clear for small surfaces where the cost of
* the eliminate pass can be higher than the benefit of fast
* clear. The closed driver does this, but the numbers may differ.
*
- * Always use fast clear on APUs.
+ * This helps on both dGPUs and APUs, even small APUs like Mullins.
*/
- bool too_small = sctx->screen->info.has_dedicated_vram &&
- tex->resource.b.b.nr_samples <= 1 &&
- tex->resource.b.b.width0 <= 256 &&
- tex->resource.b.b.height0 <= 256;
+ bool too_small = tex->resource.b.b.nr_samples <= 1 &&
+ tex->resource.b.b.width0 *
+ tex->resource.b.b.height0 <= 512 * 512;
/* Try to clear DCC first, otherwise try CMASK. */
if (vi_dcc_enabled(tex, 0)) {
uint32_t reset_value;
bool clear_words_needed;
if (sctx->screen->debug_flags & DBG(NO_DCC_CLEAR))
continue;
/* This can only occur with MSAA. */
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Loading...