[PATCH] glsl: optimize sqrt

Discussion:

Marek Olšák

2010-03-29 02:50:41 UTC

We were talking a bit on IRC that the GLSL compiler implements the sqrt
function somewhat inefficiently. Instead of rsq+rcp+cmp instructions as is
in the original code, the proposed patch uses just rsq+mul. Please see the
patch log for further explanation, and please review.

-Marek

Roland Scheidegger

2010-03-29 15:07:44 UTC

Permalink

Post by Marek OlÅ¡Ã¡k
We were talking a bit on IRC that the GLSL compiler implements the sqrt
function somewhat inefficiently. Instead of rsq+rcp+cmp instructions as
is in the original code, the proposed patch uses just rsq+mul. Please
see the patch log for further explanation, and please review.

I'll definitely agree with the mul instead of rcp part, as that should
be more efficient on a lot of modern hardware (rcp usually being part of
some special function block instead of main alu).
As far as I can tell though we still need the cmp unfortunately, since
invsqrt(0) is infinite and multiplying by 0 will give some undefined
result, for IEEE it should be NaN (well depending on hardware I guess,
if you have implementation which clamps infinity to its max
representable number it should be ok). In any case, glsl says invsqrt(0)
is undefined, hence can't rely on this.
Thinking about it, we'd possibly want a SQRT opcode, both in mesa and
tgsi. Because there's actually hardware which can do sqrt (i965
MathBox), and just as importantly because this gives drivers a way to
implement this as invsqrt + mul without the cmp, if they can. For
instance AMD hardware generally has 3 rounding modes for these ops,
"IEEE" (which gives infinity for invsqrt(0)), "DX" (clamps to
MAX_FLOAT), and "FF" (which clamps infinity to 0, exactly what you need
to implement sqrt with a mul and invsqrt and no cmp - though actually it
should work with "DX" clamping as well).

Roland

Post by Marek OlÅ¡Ã¡k
-Marek
------------------------------------------------------------------------
From 9b834a79a1819f3b4b9868be3e2696667791c83e Mon Sep 17 00:00:00 2001
Date: Sat, 27 Mar 2010 13:49:09 +0100
Subject: [PATCH] glsl: optimize sqrt
sqrt(x) =
sqrt(x)^2 / sqrt(x) =
x / sqrt(x) =
x * rsqrt(x)
Also the need for the CMP instruction is gone because there is no division
by zero.
---
.../shader/slang/library/slang_common_builtin.gc | 22 +++----------------
1 files changed, 4 insertions(+), 18 deletions(-)
diff --git a/src/mesa/shader/slang/library/slang_common_builtin.gc b/src/mesa/shader/slang/library/slang_common_builtin.gc
index a25ca55..3f6596c 100644
--- a/src/mesa/shader/slang/library/slang_common_builtin.gc
+++ b/src/mesa/shader/slang/library/slang_common_builtin.gc
@@ -602,50 +602,36 @@ vec4 exp2(const vec4 a)
float sqrt(const float x)
{
- const float nx = -x;
float r;
__asm float_rsq r, x;
- __asm float_rcp r, r;
- __asm vec4_cmp __retVal, nx, r, 0.0;
+ __retVal = r * x;
}
vec2 sqrt(const vec2 x)
{
- const vec2 nx = -x, zero = vec2(0.0);
vec2 r;
__asm float_rsq r.x, x.x;
__asm float_rsq r.y, x.y;
- __asm float_rcp r.x, r.x;
- __asm float_rcp r.y, r.y;
- __asm vec4_cmp __retVal, nx, r, zero;
+ __retVal = r * x;
}
vec3 sqrt(const vec3 x)
{
- const vec3 nx = -x, zero = vec3(0.0);
vec3 r;
__asm float_rsq r.x, x.x;
__asm float_rsq r.y, x.y;
__asm float_rsq r.z, x.z;
- __asm float_rcp r.x, r.x;
- __asm float_rcp r.y, r.y;
- __asm float_rcp r.z, r.z;
- __asm vec4_cmp __retVal, nx, r, zero;
+ __retVal = r * x;
}
vec4 sqrt(const vec4 x)
{
- const vec4 nx = -x, zero = vec4(0.0);
vec4 r;
__asm float_rsq r.x, x.x;
__asm float_rsq r.y, x.y;
__asm float_rsq r.z, x.z;
__asm float_rsq r.w, x.w;
- __asm float_rcp r.x, r.x;
- __asm float_rcp r.y, r.y;
- __asm float_rcp r.z, r.z;
- __asm float_rcp r.w, r.w;
- __asm vec4_cmp __retVal, nx, r, zero;
+ __retVal = r * x;
}
------------------------------------------------------------------------
------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
------------------------------------------------------------------------
_______________________________________________
Mesa3d-dev mailing list
https://lists.sourceforge.net/lists/listinfo/mesa3d-dev

Brian Paul

2010-03-29 17:34:10 UTC

Permalink

Post by Roland Scheidegger

Yeah, I'm going to keep the x==0 test for now. I'm replacing the rcp
with mul, per Marek's idea. Thanks, Marek!

Post by Roland Scheidegger
Thinking about it, we'd possibly want a SQRT opcode, both in mesa and
tgsi. Because there's actually hardware which can do sqrt (i965
MathBox), and just as importantly because this gives drivers a way to
implement this as invsqrt + mul without the cmp, if they can. For
instance AMD hardware generally has 3 rounding modes for these ops,
"IEEE" (which gives infinity for invsqrt(0)), "DX" (clamps to
MAX_FLOAT), and "FF" (which clamps infinity to 0, exactly what you need
to implement sqrt with a mul and invsqrt and no cmp - though actually it
should work with "DX" clamping as well).

I'd be happy to see a new SQRT instruction. I'll put it on my to-do list.

-Brian