Discussion:
[RFC PATCH 0/6] r600: speed up tesselation shaders
Add Reply
Gert Wollny
2017-11-15 09:29:10 UTC
Reply
Permalink
Raw Message
Dear all,

since on r600 the tesselation shaders don't go through the sb-optimizer I
though it might help to improve performance by applying some optimizations
to the created assembly. The patches are experimental but to a point where
I think some input from you could be helpful.

This patch series does the following optimizations:
- pre-calculate and re-use address offsets that were always calculated
on the fly
- only load from LDS what is really requested (based on the source swizzle masks
of the input values).
- preload all used elements in cases where the shader would only partially load
data in different places.

At this point there are no piglit regressions, but an unrelated GOU lockup is
triggered. (Dave and me are already testing patches for this).

Benchmarking on BARTS with Unigine-Heaven and Tessmark x32 (with
MESA_GL_VERSION_OVERRIDE=4.0) I get the following improvements:

pre-opt post-opt
master: fb0e9b5197
===============================================
Heaven

Res: 1280x1024
Q: High
Tess: Normal
-----------------------------------------------
Time: 260.2 260.2
Frames: 3276 4192

FPS: 12.6 16.1
Min FPS: 4.0 4.6
Max FPS: 60.9 69.0

Score: 317.2 405.8
-----------------------------------------------

Tessmark x32
R: 1024x640
-----------------------------------------------
Points: 635 700
FPS: 10 11

A github repo inclusing these patches can be found at

https://github.com/gerddie/mesa/tree/r600-tess-speedup

many thanks for any comments,
Gert

Gert Wollny (6):
r600:shader: Fix all warnings issed with "-Wall -Wextra"
r600_shader: only load from LDS what is really used
r600_shader.c: Add a caching structure for load tesselation data
r600_shader: Move calculation of offset to do_lds_fetch_values
r600_shader.c: Pre-caclculate some offsets for LDS access
r600_shader.c: Preload some LDS values.

src/gallium/drivers/r600/r600_shader.c | 636 ++++++++++++++++++++++++---------
1 file changed, 476 insertions(+), 160 deletions(-)
--
2.13.6
Gert Wollny
2017-11-15 09:29:11 UTC
Reply
Permalink
Raw Message
- fix a number of -Wsign-compare warnings
- fix two warnings for -Woverride-init because TGSI_OPCODE_CEIL == 83, and
the according field was defined two times.

Signed-off-by: Gert Wollny <***@gmail.com>
---
src/gallium/drivers/r600/r600_shader.c | 67 ++++++++++++++++++----------------
1 file changed, 36 insertions(+), 31 deletions(-)

diff --git a/src/gallium/drivers/r600/r600_shader.c b/src/gallium/drivers/r600/r600_shader.c
index 625537b48b..a2dc08c596 100644
--- a/src/gallium/drivers/r600/r600_shader.c
+++ b/src/gallium/drivers/r600/r600_shader.c
@@ -289,7 +289,7 @@ error:
return r;
}

-void r600_pipe_shader_destroy(struct pipe_context *ctx, struct r600_pipe_shader *shader)
+void r600_pipe_shader_destroy(struct pipe_context *ctx UNUSED, struct r600_pipe_shader *shader)
{
r600_resource_reference(&shader->bo, NULL);
r600_bytecode_clear(&shader->shader.bc);
@@ -1094,7 +1094,8 @@ static int allocate_system_value_inputs(struct r600_shader_ctx *ctx, int gpr_off

{ false, &ctx->fixed_pt_position_gpr, TGSI_SEMANTIC_SAMPLEID, TGSI_SEMANTIC_SAMPLEPOS } /* SAMPLEID is in Fixed Point Position GPR.w */
};
- int i, k, num_regs = 0;
+ int num_regs = 0;
+ unsigned k, i;

if (tgsi_parse_init(&parse, ctx->tokens) != TGSI_PARSE_OK) {
return 0;
@@ -1997,11 +1998,12 @@ static int process_twoside_color_inputs(struct r600_shader_ctx *ctx)
}

static int emit_streamout(struct r600_shader_ctx *ctx, struct pipe_stream_output_info *so,
- int stream, unsigned *stream_item_size)
+ int stream, unsigned *stream_item_size UNUSED)
{
unsigned so_gpr[PIPE_MAX_SHADER_OUTPUTS];
unsigned start_comp[PIPE_MAX_SHADER_OUTPUTS];
- int i, j, r;
+ int j, r;
+ unsigned i;

/* Sanity checking. */
if (so->num_outputs > PIPE_MAX_SO_OUTPUTS) {
@@ -2153,13 +2155,14 @@ static int generate_gs_copy_shader(struct r600_context *rctx,
struct r600_shader_ctx ctx = {};
struct r600_shader *gs_shader = &gs->shader;
struct r600_pipe_shader *cshader;
- int ocnt = gs_shader->noutput;
+ unsigned ocnt = gs_shader->noutput;
struct r600_bytecode_alu alu;
struct r600_bytecode_vtx vtx;
struct r600_bytecode_output output;
struct r600_bytecode_cf *cf_jump, *cf_pop,
*last_exp_pos = NULL, *last_exp_param = NULL;
- int i, j, next_clip_pos = 61, next_param = 0;
+ int next_clip_pos = 61, next_param = 0;
+ unsigned i, j;
int ring;
bool only_ring_0 = true;
cshader = calloc(1, sizeof(struct r600_pipe_shader));
@@ -2475,10 +2478,11 @@ static int emit_inc_ring_offset(struct r600_shader_ctx *ctx, int idx, bool ind)
return 0;
}

-static int emit_gs_ring_writes(struct r600_shader_ctx *ctx, const struct pipe_stream_output_info *so, int stream, bool ind)
+static int emit_gs_ring_writes(struct r600_shader_ctx *ctx, const struct pipe_stream_output_info *so UNUSED, int stream, bool ind)
{
struct r600_bytecode_output output;
- int i, k, ring_offset;
+ int ring_offset;
+ unsigned i, k;
int effective_stream = stream == -1 ? 0 : stream;
int idx = 0;

@@ -2619,8 +2623,9 @@ static int r600_fetch_tess_io_info(struct r600_shader_ctx *ctx)

static int emit_lds_vs_writes(struct r600_shader_ctx *ctx)
{
- int i, j, r;
+ int j, r;
int temp_reg;
+ unsigned i;

/* fetch tcs input values into input_vals */
ctx->tess_input_info = r600_get_temp(ctx);
@@ -2793,10 +2798,10 @@ static int r600_tess_factor_read(struct r600_shader_ctx *ctx,

static int r600_emit_tess_factor(struct r600_shader_ctx *ctx)
{
- unsigned i;
int stride, outer_comps, inner_comps;
int tessinner_idx = -1, tessouter_idx = -1;
- int r;
+ int i, r;
+ unsigned j;
int temp_reg = r600_get_temp(ctx);
int treg[3] = {-1, -1, -1};
struct r600_bytecode_alu alu;
@@ -2843,11 +2848,11 @@ static int r600_emit_tess_factor(struct r600_shader_ctx *ctx)

/* R0 is InvocationID, RelPatchID, PatchID, tf_base */
/* TF_WRITE takes index in R.x, value in R.y */
- for (i = 0; i < ctx->shader->noutput; i++) {
- if (ctx->shader->output[i].name == TGSI_SEMANTIC_TESSINNER)
- tessinner_idx = i;
- if (ctx->shader->output[i].name == TGSI_SEMANTIC_TESSOUTER)
- tessouter_idx = i;
+ for (j = 0; j < ctx->shader->noutput; j++) {
+ if (ctx->shader->output[j].name == TGSI_SEMANTIC_TESSINNER)
+ tessinner_idx = j;
+ if (ctx->shader->output[j].name == TGSI_SEMANTIC_TESSOUTER)
+ tessouter_idx = j;
}

if (tessouter_idx == -1)
@@ -2948,7 +2953,8 @@ static int r600_shader_from_tgsi(struct r600_context *rctx,
struct r600_bytecode_output output[ARRAY_SIZE(shader->output)];
unsigned output_done, noutput;
unsigned opcode;
- int i, j, k, r = 0;
+ int j, k, r = 0;
+ unsigned i;
int next_param_base = 0, next_clip_base;
int max_color_exports = MAX2(key.ps.nr_cbufs, 1);
bool indirect_gprs;
@@ -3638,7 +3644,7 @@ static int r600_shader_from_tgsi(struct r600_context *rctx,
goto out_err;
}

- if (output[j].type==-1) {
+ if ((int)output[j].type==-1) {
output[j].type = V_SQ_CF_ALLOC_EXPORT_WORD0_SQ_EXPORT_PARAM;
output[j].array_base = next_param_base++;
}
@@ -3696,10 +3702,10 @@ static int r600_shader_from_tgsi(struct r600_context *rctx,
noutput = j;

/* set export done on last export of each type */
- for (i = noutput - 1, output_done = 0; i >= 0; i--) {
- if (!(output_done & (1 << output[i].type))) {
- output_done |= (1 << output[i].type);
- output[i].op = CF_OP_EXPORT_DONE;
+ for (k = noutput - 1, output_done = 0; k >= 0; k--) {
+ if (!(output_done & (1 << output[k].type))) {
+ output_done |= (1 << output[k].type);
+ output[k].op = CF_OP_EXPORT_DONE;
}
}
/* add output to bytecode */
@@ -3759,7 +3765,7 @@ static int tgsi_unsupported(struct r600_shader_ctx *ctx)
return -EINVAL;
}

-static int tgsi_end(struct r600_shader_ctx *ctx)
+static int tgsi_end(struct r600_shader_ctx *ctx UNUSED)
{
return 0;
}
@@ -7645,7 +7651,7 @@ static int tgsi_tex(struct r600_shader_ctx *ctx)
static int find_hw_atomic_counter(struct r600_shader_ctx *ctx,
struct tgsi_full_src_register *src)
{
- int i;
+ unsigned i;

if (src->Register.Indirect) {
for (i = 0; i < ctx->shader->nhwatomic_ranges; i++) {
@@ -7655,7 +7661,7 @@ static int find_hw_atomic_counter(struct r600_shader_ctx *ctx,
} else {
uint32_t index = src->Register.Index;
for (i = 0; i < ctx->shader->nhwatomic_ranges; i++) {
- if (ctx->shader->atomics[i].buffer_id != src->Dimension.Index)
+ if (ctx->shader->atomics[i].buffer_id != (unsigned)src->Dimension.Index)
continue;
if (index > ctx->shader->atomics[i].end)
continue;
@@ -7821,7 +7827,7 @@ static int tgsi_lrp(struct r600_shader_ctx *ctx)
{
struct tgsi_full_instruction *inst = &ctx->parse.FullToken.FullInstruction;
struct r600_bytecode_alu alu;
- int lasti = tgsi_last_instruction(inst->Dst[0].Register.WriteMask);
+ unsigned lasti = tgsi_last_instruction(inst->Dst[0].Register.WriteMask);
unsigned i, temp_regs[2];
int r;

@@ -8616,7 +8622,8 @@ static inline void callstack_update_max_depth(struct r600_shader_ctx *ctx,
unsigned reason)
{
struct r600_stack_info *stack = &ctx->bc->stack;
- unsigned elements, entries;
+ unsigned elements;
+ int entries;

unsigned entry_size = stack->entry_size;

@@ -8871,7 +8878,7 @@ static int tgsi_bgnloop(struct r600_shader_ctx *ctx)

static int tgsi_endloop(struct r600_shader_ctx *ctx)
{
- unsigned i;
+ int i;

r600_bytecode_add_cfinst(ctx->bc, CF_OP_LOOP_END);

@@ -9443,8 +9450,7 @@ static const struct r600_shader_tgsi_instruction eg_shader_tgsi_instruction[] =
[TGSI_OPCODE_ENDIF] = { ALU_OP0_NOP, tgsi_endif},
[TGSI_OPCODE_DDX_FINE] = { FETCH_OP_GET_GRADIENTS_H, tgsi_tex},
[TGSI_OPCODE_DDY_FINE] = { FETCH_OP_GET_GRADIENTS_V, tgsi_tex},
- [82] = { ALU_OP0_NOP, tgsi_unsupported},
- [83] = { ALU_OP0_NOP, tgsi_unsupported},
+ [82] = { ALU_OP0_NOP, tgsi_unsupported},
[TGSI_OPCODE_CEIL] = { ALU_OP1_CEIL, tgsi_op2},
[TGSI_OPCODE_I2F] = { ALU_OP1_INT_TO_FLT, tgsi_op2_trans},
[TGSI_OPCODE_NOT] = { ALU_OP1_NOT_INT, tgsi_op2},
@@ -9667,7 +9673,6 @@ static const struct r600_shader_tgsi_instruction cm_shader_tgsi_instruction[] =
[TGSI_OPCODE_DDX_FINE] = { FETCH_OP_GET_GRADIENTS_H, tgsi_tex},
[TGSI_OPCODE_DDY_FINE] = { FETCH_OP_GET_GRADIENTS_V, tgsi_tex},
[82] = { ALU_OP0_NOP, tgsi_unsupported},
- [83] = { ALU_OP0_NOP, tgsi_unsupported},
[TGSI_OPCODE_CEIL] = { ALU_OP1_CEIL, tgsi_op2},
[TGSI_OPCODE_I2F] = { ALU_OP1_INT_TO_FLT, tgsi_op2},
[TGSI_OPCODE_NOT] = { ALU_OP1_NOT_INT, tgsi_op2},
--
2.13.6
Gert Wollny
2017-11-15 09:29:13 UTC
Reply
Permalink
Raw Message
Cache values that are loaded more then once, or where various components
are loaded at separate places. This saves repeated calculation of the offsets.

Signed-off-by: Gert Wollny <***@gmail.com>
---
src/gallium/drivers/r600/r600_shader.c | 211 +++++++++++++++++++++++++++++----
1 file changed, 190 insertions(+), 21 deletions(-)

diff --git a/src/gallium/drivers/r600/r600_shader.c b/src/gallium/drivers/r600/r600_shader.c
index 9fa83189bc..5713eda6b0 100644
--- a/src/gallium/drivers/r600/r600_shader.c
+++ b/src/gallium/drivers/r600/r600_shader.c
@@ -317,6 +317,20 @@ struct eg_interp {
unsigned ij_index;
};

+struct r600_tess_input_cache_entry {
+ struct tgsi_full_src_register key;
+ unsigned reg: 16;
+ unsigned initialized:1;
+ unsigned read_access:1;
+ unsigned was_written:1;
+ unsigned mask:4;
+};
+
+struct r600_tess_input_cache {
+ struct r600_tess_input_cache_entry data[32];
+ int fill;
+};
+
struct r600_shader_ctx {
struct tgsi_shader_info info;
struct tgsi_parse_context parse;
@@ -353,6 +367,7 @@ struct r600_shader_ctx {
unsigned enabled_stream_buffers_mask;
unsigned tess_input_info; /* temp with tess input offsets */
unsigned tess_output_info; /* temp with tess input offsets */
+ struct r600_tess_input_cache tess_input_cache;
};

struct r600_shader_tgsi_instruction {
@@ -1810,7 +1825,8 @@ static int fetch_mask( struct tgsi_src_register *reg)
return mask;
}

-static int fetch_tes_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src, unsigned int dst_reg)
+static int fetch_tes_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src,
+ unsigned int dst_reg, unsigned mask)
{
int r;
unsigned temp_reg = r600_get_temp(ctx);
@@ -1826,13 +1842,14 @@ static int fetch_tes_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg, fetch_mask(&src->Register));
+ r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask);
if (r)
return r;
return 0;
}

-static int fetch_tcs_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src, unsigned int dst_reg)
+static int fetch_tcs_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src,
+ unsigned int dst_reg, unsigned mask)
{
int r;
unsigned temp_reg = r600_get_temp(ctx);
@@ -1852,13 +1869,14 @@ static int fetch_tcs_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg, fetch_mask(&src->Register));
+ r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask);
if (r)
return r;
return 0;
}

-static int fetch_tcs_output(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src, unsigned int dst_reg)
+static int fetch_tcs_output(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src,
+ unsigned int dst_reg, unsigned mask)
{
int r;
unsigned temp_reg = r600_get_temp(ctx);
@@ -1874,12 +1892,153 @@ static int fetch_tcs_output(struct r600_shader_ctx *ctx, struct tgsi_full_src_re
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg, fetch_mask(&src->Register));
+ r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask);
if (r)
return r;
return 0;
}

+static int tgsi_full_src_register_equal_for_cache(struct tgsi_full_src_register *lhs,
+ struct tgsi_full_src_register *rhs)
+{
+ if (lhs->Register.Index != rhs->Register.Index)
+ return 0;
+
+ if (lhs->Register.File != rhs->Register.File)
+
+ if (lhs->Register.Indirect || rhs->Register.Indirect)
+ return 0;
+
+ if (lhs->Register.Dimension) {
+ if (!rhs->Register.Dimension ||
+ (rhs->Dimension.Index != lhs->Dimension.Index) ||
+ (rhs->Dimension.Dimension != lhs->Dimension.Dimension))
+ return 0;
+
+ if (lhs->Dimension.Indirect || rhs->Dimension.Indirect)
+ return 0;
+ } else if (rhs->Register.Dimension)
+ return 0;
+
+ return 1;
+}
+
+static void tess_input_cache_store(struct r600_tess_input_cache *cache,
+ struct tgsi_full_src_register *src)
+{
+ if (cache->fill < 32) {
+ memcpy(&cache->data[cache->fill].key, src, sizeof(struct tgsi_full_src_register));
+ cache->data[cache->fill].mask = fetch_mask(&src->Register);
+ cache->data[cache->fill].reg = 0;
+ cache->data[cache->fill].was_written = src->Register.File == TGSI_FILE_OUTPUT;
+ ++cache->fill;
+ }
+}
+
+static void tess_input_cache_check(struct r600_tess_input_cache *cache,
+ struct tgsi_full_src_register *src)
+{
+ int i;
+ for (i = 0; i < cache->fill; ++i) {
+ /* indirect loads can come from anywhere, no use caching them */
+ if (src->Register.Indirect || src->Dimension.Indirect)
+ return;
+
+ if (tgsi_full_src_register_equal_for_cache(src, &cache->data[i].key)) {
+ cache->data[i].mask |= fetch_mask(&src->Register);
+ cache->data[i].read_access = src->Register.File == TGSI_FILE_INPUT;
+ if (!cache->data[i].was_written) {
+ ++cache->data[i].reg;
+ cache->data[i].was_written = src->Register.File == TGSI_FILE_OUTPUT;
+ } else {
+ /* FIXME: If the entry was written before reading it, we can not cache it,
+ * instead we could store theaddress to speed up access, or keep the written
+ * value. The latter should check whether there is syncronisation within the
+ * work group to ensure that the stored value is not overwritten by another
+ * thread.
+ */
+ cache->data[i].reg = 0;
+ }
+ return;
+ }
+ }
+ tess_input_cache_store(cache, src);
+}
+
+static int tess_input_cache_count_multiused(struct r600_tess_input_cache *cache,
+ unsigned reg_base)
+{
+ int i;
+ int cnt = 0;
+ for (i = 0; i < cache->fill; ++i) {
+ if (cache->data[i].reg > 0 && cache->data[i].read_access) {
+ if (i != cnt)
+ memcpy(&cache->data[cnt], &cache->data[i],
+ sizeof(struct r600_tess_input_cache_entry));
+ cache->data[cnt].reg = reg_base + cnt;
+ cache->data[cnt].initialized = 0;
+ ++cnt;
+ }
+ }
+ cache->fill = cnt;
+ return cnt;
+}
+
+static struct r600_tess_input_cache_entry *
+tess_input_cache_load(struct r600_tess_input_cache *cache,
+ struct tgsi_full_src_register *src)
+{
+ struct r600_tess_input_cache_entry *retval = NULL;
+ int i;
+ for (i = 0; i < cache->fill; ++i) {
+ struct r600_tess_input_cache_entry *ce = &cache->data[i];
+ if (tgsi_full_src_register_equal_for_cache(src, &ce->key)) {
+ retval = ce;
+ break;
+ }
+ }
+ return retval;
+}
+
+typedef int (*fetch_tessdata_from_lds)(struct r600_shader_ctx *ctx,
+ struct tgsi_full_src_register *src,
+ unsigned int dst_reg, unsigned mask);
+
+static int r600_load_tess_data(struct r600_shader_ctx *ctx,
+ struct tgsi_full_src_register *src,
+ fetch_tessdata_from_lds fetch_call)
+{
+ int treg;
+ struct r600_tess_input_cache_entry *ce;
+ ce = tess_input_cache_load(&ctx->tess_input_cache, src);
+ if (!ce) {
+ treg = r600_get_temp(ctx);
+ fetch_call(ctx, src, treg, fetch_mask(&src->Register));
+ } else {
+ if (!ce->initialized) {
+ fetch_call(ctx, src, ce->reg, ce->mask);
+ ce->initialized = 1;
+ }
+ treg = ce->reg;
+ }
+ return treg;
+}
+
+
+static void count_tess_inputs(struct r600_shader_ctx *ctx)
+{
+ struct tgsi_full_instruction *inst = &ctx->parse.FullToken.FullInstruction;
+ unsigned i;
+
+ for (i = 0; i < inst->Instruction.NumSrcRegs; i++) {
+ struct tgsi_full_src_register *src = &inst->Src[i];
+ if (((src->Register.File == TGSI_FILE_INPUT) && (ctx->type == PIPE_SHADER_TESS_EVAL)) ||
+ (ctx->type == PIPE_SHADER_TESS_CTRL &&
+ (src->Register.File == TGSI_FILE_INPUT || src->Register.File == TGSI_FILE_OUTPUT)))
+ tess_input_cache_check(&ctx->tess_input_cache, src);
+ }
+}
+
static int tgsi_split_lds_inputs(struct r600_shader_ctx *ctx)
{
struct tgsi_full_instruction *inst = &ctx->parse.FullToken.FullInstruction;
@@ -1889,21 +2048,15 @@ static int tgsi_split_lds_inputs(struct r600_shader_ctx *ctx)
struct tgsi_full_src_register *src = &inst->Src[i];

if (ctx->type == PIPE_SHADER_TESS_EVAL && src->Register.File == TGSI_FILE_INPUT) {
- int treg = r600_get_temp(ctx);
- fetch_tes_input(ctx, src, treg);
- ctx->src[i].sel = treg;
+ ctx->src[i].sel = r600_load_tess_data(ctx, src, fetch_tes_input);
ctx->src[i].rel = 0;
}
if (ctx->type == PIPE_SHADER_TESS_CTRL && src->Register.File == TGSI_FILE_INPUT) {
- int treg = r600_get_temp(ctx);
- fetch_tcs_input(ctx, src, treg);
- ctx->src[i].sel = treg;
+ ctx->src[i].sel = r600_load_tess_data(ctx, src, fetch_tcs_input);
ctx->src[i].rel = 0;
}
if (ctx->type == PIPE_SHADER_TESS_CTRL && src->Register.File == TGSI_FILE_OUTPUT) {
- int treg = r600_get_temp(ctx);
- fetch_tcs_output(ctx, src, treg);
- ctx->src[i].sel = treg;
+ ctx->src[i].sel = r600_load_tess_data(ctx, src, fetch_tcs_output);
ctx->src[i].rel = 0;
}
}
@@ -2982,6 +3135,8 @@ static int r600_shader_from_tgsi(struct r600_context *rctx,
bool lds_inputs = false;
bool pos_emitted = false;

+ ctx.tess_input_cache.fill = 0;
+
ctx.bc = &shader->bc;
ctx.shader = shader;
ctx.native_integers = true;
@@ -3162,21 +3317,35 @@ static int r600_shader_from_tgsi(struct r600_context *rctx,
ctx.temp_reg = ctx.bc->ar_reg + 3;
}

+ if (lds_inputs) {
+ tgsi_parse_init(&ctx.parse, tokens);
+ while (!tgsi_parse_end_of_tokens(&ctx.parse)) {
+ tgsi_parse_token(&ctx.parse);
+
+ if (ctx.parse.FullToken.Token.Type != TGSI_TOKEN_TYPE_INSTRUCTION)
+ continue;
+
+ count_tess_inputs(&ctx);
+ }
+ ctx.temp_reg += tess_input_cache_count_multiused(&ctx.tess_input_cache, ctx.temp_reg);
+ tgsi_parse_init(&ctx.parse, tokens);
+ }
+
shader->max_arrays = 0;
shader->num_arrays = 0;
if (indirect_gprs) {

if (ctx.info.indirect_files & (1 << TGSI_FILE_INPUT)) {
r600_add_gpr_array(shader, ctx.file_offset[TGSI_FILE_INPUT],
- ctx.file_offset[TGSI_FILE_OUTPUT] -
- ctx.file_offset[TGSI_FILE_INPUT],
- 0x0F);
+ ctx.file_offset[TGSI_FILE_OUTPUT] -
+ ctx.file_offset[TGSI_FILE_INPUT],
+ 0x0F);
}
if (ctx.info.indirect_files & (1 << TGSI_FILE_OUTPUT)) {
r600_add_gpr_array(shader, ctx.file_offset[TGSI_FILE_OUTPUT],
- ctx.file_offset[TGSI_FILE_TEMPORARY] -
- ctx.file_offset[TGSI_FILE_OUTPUT],
- 0x0F);
+ ctx.file_offset[TGSI_FILE_TEMPORARY] -
+ ctx.file_offset[TGSI_FILE_OUTPUT],
+ 0x0F);
}
}
--
2.13.6
Gert Wollny
2017-11-15 09:29:12 UTC
Reply
Permalink
Raw Message
Use the destination write mask to determine which values are really to be
read from LDS and load only these.

Signed-off-by: Gert Wollny <***@gmail.com>
---
src/gallium/drivers/r600/r600_shader.c | 33 ++++++++++++++++++++++++++-------
1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/src/gallium/drivers/r600/r600_shader.c b/src/gallium/drivers/r600/r600_shader.c
index a2dc08c596..9fa83189bc 100644
--- a/src/gallium/drivers/r600/r600_shader.c
+++ b/src/gallium/drivers/r600/r600_shader.c
@@ -377,7 +377,7 @@ static void r600_bytecode_src(struct r600_bytecode_alu_src *bc_src,
const struct r600_shader_src *shader_src,
unsigned chan);
static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
- unsigned dst_reg);
+ unsigned dst_reg, unsigned mask);

static int tgsi_last_instruction(unsigned writemask)
{
@@ -1025,7 +1025,7 @@ static int tgsi_declaration(struct r600_shader_ctx *ctx)
if (r)
return r;

- do_lds_fetch_values(ctx, temp_reg, dreg);
+ do_lds_fetch_values(ctx, temp_reg, dreg, 0xF);
}
else if (d->Semantic.Name == TGSI_SEMANTIC_TESSCOORD) {
/* MOV r1.x, r0.x;
@@ -1743,14 +1743,18 @@ static int r600_get_byte_address(struct r600_shader_ctx *ctx, int temp_reg,
}

static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
- unsigned dst_reg)
+ unsigned dst_reg, unsigned mask)
{
struct r600_bytecode_alu alu;
int r, i;

if ((ctx->bc->cf_last->ndw>>1) >= 0x60)
ctx->bc->force_add_cf = 1;
+
for (i = 1; i < 4; i++) {
+ if (!(mask & (1 << i)))
+ continue;
+
r = single_alu_op2(ctx, ALU_OP2_ADD_INT,
temp_reg, i,
temp_reg, 0,
@@ -1759,6 +1763,9 @@ static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
return r;
}
for (i = 0; i < 4; i++) {
+ if (! (mask & (1 << i)))
+ continue;
+
/* emit an LDS_READ_RET */
memset(&alu, 0, sizeof(alu));
alu.op = LDS_OP1_LDS_READ_RET;
@@ -1774,6 +1781,8 @@ static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
return r;
}
for (i = 0; i < 4; i++) {
+ if (! (mask & (1 << i)))
+ continue;
/* then read from LDS_OQ_A_POP */
memset(&alu, 0, sizeof(alu));

@@ -1791,6 +1800,16 @@ static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
return 0;
}

+static int fetch_mask( struct tgsi_src_register *reg)
+{
+ int mask = 0;
+ mask |= 1 << reg->SwizzleX;
+ mask |= 1 << reg->SwizzleY;
+ mask |= 1 << reg->SwizzleZ;
+ mask |= 1 << reg->SwizzleW;
+ return mask;
+}
+
static int fetch_tes_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src, unsigned int dst_reg)
{
int r;
@@ -1807,7 +1826,7 @@ static int fetch_tes_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg);
+ r = do_lds_fetch_values(ctx, temp_reg, dst_reg, fetch_mask(&src->Register));
if (r)
return r;
return 0;
@@ -1833,7 +1852,7 @@ static int fetch_tcs_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg);
+ r = do_lds_fetch_values(ctx, temp_reg, dst_reg, fetch_mask(&src->Register));
if (r)
return r;
return 0;
@@ -1855,7 +1874,7 @@ static int fetch_tcs_output(struct r600_shader_ctx *ctx, struct tgsi_full_src_re
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg);
+ r = do_lds_fetch_values(ctx, temp_reg, dst_reg, fetch_mask(&src->Register));
if (r)
return r;
return 0;
@@ -2792,7 +2811,7 @@ static int r600_tess_factor_read(struct r600_shader_ctx *ctx,
if (r)
return r;

- do_lds_fetch_values(ctx, temp_reg, dreg);
+ do_lds_fetch_values(ctx, temp_reg, dreg, 0xf);
return 0;
}
--
2.13.6
Gert Wollny
2017-11-15 09:29:14 UTC
Reply
Permalink
Raw Message
Instead of calculating creating the code for calculating a base offset
and then to caclucate the component offfsets, calculate this offset
for all components directly. This saves one instruction group.

Signed-off-by: Gert Wollny <***@gmail.com>
---
src/gallium/drivers/r600/r600_shader.c | 113 ++++++++++++++++-----------------
1 file changed, 56 insertions(+), 57 deletions(-)

diff --git a/src/gallium/drivers/r600/r600_shader.c b/src/gallium/drivers/r600/r600_shader.c
index 5713eda6b0..873b525449 100644
--- a/src/gallium/drivers/r600/r600_shader.c
+++ b/src/gallium/drivers/r600/r600_shader.c
@@ -392,7 +392,7 @@ static void r600_bytecode_src(struct r600_bytecode_alu_src *bc_src,
const struct r600_shader_src *shader_src,
unsigned chan);
static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
- unsigned dst_reg, unsigned mask);
+ unsigned dst_reg, unsigned mask, int param);

static int tgsi_last_instruction(unsigned writemask)
{
@@ -1033,14 +1033,7 @@ static int tgsi_declaration(struct r600_shader_ctx *ctx)
if (r)
return r;

- r = single_alu_op2(ctx, ALU_OP2_ADD_INT,
- temp_reg, 0,
- temp_reg, 0,
- V_SQ_ALU_SRC_LITERAL, param * 16);
- if (r)
- return r;
-
- do_lds_fetch_values(ctx, temp_reg, dreg, 0xF);
+ do_lds_fetch_values(ctx, temp_reg, dreg, 0xF, param);
}
else if (d->Semantic.Name == TGSI_SEMANTIC_TESSCOORD) {
/* MOV r1.x, r0.x;
@@ -1658,12 +1651,11 @@ static int tgsi_split_gs_inputs(struct r600_shader_ctx *ctx)
static int r600_get_byte_address(struct r600_shader_ctx *ctx, int temp_reg,
const struct tgsi_full_dst_register *dst,
const struct tgsi_full_src_register *src,
- int stride_bytes_reg, int stride_bytes_chan)
+ int stride_bytes_reg, int stride_bytes_chan, int *param)
{
struct tgsi_full_dst_register reg;
ubyte *name, *index, *array_first;
int r;
- int param;
struct tgsi_shader_info *info = &ctx->info;
/* Set the register description. The address computation is the same
* for sources and destinations. */
@@ -1736,51 +1728,54 @@ static int r600_get_byte_address(struct r600_shader_ctx *ctx, int temp_reg,
if (r)
return r;

- param = r600_get_lds_unique_index(name[first],
+ *param = r600_get_lds_unique_index(name[first],
index[first]);

} else {
- param = r600_get_lds_unique_index(name[reg.Register.Index],
+ *param = r600_get_lds_unique_index(name[reg.Register.Index],
index[reg.Register.Index]);
}

- /* add to base_addr - passed in temp_reg.x */
- if (param) {
- r = single_alu_op2(ctx, ALU_OP2_ADD_INT,
- temp_reg, 0,
- temp_reg, 0,
- V_SQ_ALU_SRC_LITERAL, param * 16);
- if (r)
- return r;
-
- }
return 0;
}

static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
- unsigned dst_reg, unsigned mask)
+ unsigned dst_reg, unsigned mask, int param)
{
struct r600_bytecode_alu alu;
int r, i;
+ int lasti = tgsi_last_instruction(mask);
+ int firsti = param > 0 ? 0 : 1;

if ((ctx->bc->cf_last->ndw>>1) >= 0x60)
ctx->bc->force_add_cf = 1;
-
- for (i = 1; i < 4; i++) {
+
+ /* Add the offsets to the base address */
+ for (i = firsti; i <= lasti; i++) {
if (!(mask & (1 << i)))
continue;
-
- r = single_alu_op2(ctx, ALU_OP2_ADD_INT,
- temp_reg, i,
- temp_reg, 0,
- V_SQ_ALU_SRC_LITERAL, 4 * i);
+
+ memset(&alu, 0, sizeof(struct r600_bytecode_alu));
+ alu.dst.sel = temp_reg;
+ alu.dst.chan = i;
+ alu.dst.write = 1;
+ alu.op = ALU_OP2_ADD_INT;
+ alu.src[0].sel = temp_reg;
+ alu.src[0].chan = 0;
+ alu.src[1].sel = V_SQ_ALU_SRC_LITERAL;
+ alu.src[1].value = 4 * i + 16 * param;
+
+ if (i == lasti)
+ alu.last = 1;
+
+ r = r600_bytecode_add_alu(ctx->bc, &alu);
if (r)
return r;
}
+
for (i = 0; i < 4; i++) {
if (! (mask & (1 << i)))
continue;
-
/* emit an LDS_READ_RET */
memset(&alu, 0, sizeof(alu));
alu.op = LDS_OP1_LDS_READ_RET;
@@ -1828,7 +1823,7 @@ static int fetch_mask( struct tgsi_src_register *reg)
static int fetch_tes_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src,
unsigned int dst_reg, unsigned mask)
{
- int r;
+ int r, param;
unsigned temp_reg = r600_get_temp(ctx);

r = get_lds_offset0(ctx, 2, temp_reg,
@@ -1838,11 +1833,11 @@ static int fetch_tes_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg

/* the base address is now in temp.x */
r = r600_get_byte_address(ctx, temp_reg,
- NULL, src, ctx->tess_output_info, 1);
+ NULL, src, ctx->tess_output_info, 1, &param);
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask);
+ r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask, param);
if (r)
return r;
return 0;
@@ -1851,7 +1846,7 @@ static int fetch_tes_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg
static int fetch_tcs_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src,
unsigned int dst_reg, unsigned mask)
{
- int r;
+ int r,param;
unsigned temp_reg = r600_get_temp(ctx);

/* t.x = ips * r0.y */
@@ -1865,11 +1860,11 @@ static int fetch_tcs_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg

/* the base address is now in temp.x */
r = r600_get_byte_address(ctx, temp_reg,
- NULL, src, ctx->tess_input_info, 1);
+ NULL, src, ctx->tess_input_info, 1, &param);
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask);
+ r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask, param);
if (r)
return r;
return 0;
@@ -1878,7 +1873,7 @@ static int fetch_tcs_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg
static int fetch_tcs_output(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src,
unsigned int dst_reg, unsigned mask)
{
- int r;
+ int r, param;
unsigned temp_reg = r600_get_temp(ctx);

r = get_lds_offset0(ctx, 1, temp_reg,
@@ -1888,11 +1883,11 @@ static int fetch_tcs_output(struct r600_shader_ctx *ctx, struct tgsi_full_src_re
/* the base address is now in temp.x */
r = r600_get_byte_address(ctx, temp_reg,
NULL, src,
- ctx->tess_output_info, 1);
+ ctx->tess_output_info, 1, &param);
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask);
+ r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask, param);
if (r)
return r;
return 0;
@@ -2831,8 +2826,8 @@ static int emit_lds_vs_writes(struct r600_shader_ctx *ctx)

r = single_alu_op2(ctx, ALU_OP2_ADD_INT,
temp_reg, 2,
- temp_reg, param ? 1 : 0,
- V_SQ_ALU_SRC_LITERAL, 8);
+ temp_reg, 0,
+ V_SQ_ALU_SRC_LITERAL, 8 + param * 16);
if (r)
return r;

@@ -2867,6 +2862,7 @@ static int r600_store_tcs_output(struct r600_shader_ctx *ctx)
int temp_reg = r600_get_temp(ctx);
struct r600_bytecode_alu alu;
unsigned write_mask = dst->Register.WriteMask;
+ int param;

if (inst->Dst[0].Register.File != TGSI_FILE_OUTPUT)
return 0;
@@ -2877,20 +2873,30 @@ static int r600_store_tcs_output(struct r600_shader_ctx *ctx)

/* the base address is now in temp.x */
r = r600_get_byte_address(ctx, temp_reg,
- &inst->Dst[0], NULL, ctx->tess_output_info, 1);
+ &inst->Dst[0], NULL, ctx->tess_output_info, 1, &param);
if (r)
return r;

/* LDS write */
lasti = tgsi_last_instruction(write_mask);
- for (i = 1; i <= lasti; i++) {
+ for (i = (param > 0 ? 0: 1); i <= lasti; i++) {

if (!(write_mask & (1 << i)))
continue;
- r = single_alu_op2(ctx, ALU_OP2_ADD_INT,
- temp_reg, i,
- temp_reg, 0,
- V_SQ_ALU_SRC_LITERAL, 4 * i);
+ memset(&alu, 0, sizeof(struct r600_bytecode_alu));
+ alu.dst.sel = temp_reg;
+ alu.dst.chan = i;
+ alu.dst.write = 1;
+ alu.op = ALU_OP2_ADD_INT;
+ alu.src[0].sel = temp_reg;
+ alu.src[0].chan = 0;
+ alu.src[1].sel = V_SQ_ALU_SRC_LITERAL;
+ alu.src[1].value = 4 * i + 16 * param;
+
+ if (i == lasti)
+ alu.last = 1;
+
+ r = r600_bytecode_add_alu(ctx->bc, &alu);
if (r)
return r;
}
@@ -2957,14 +2963,7 @@ static int r600_tess_factor_read(struct r600_shader_ctx *ctx,
if (r)
return r;

- r = single_alu_op2(ctx, ALU_OP2_ADD_INT,
- temp_reg, 0,
- temp_reg, 0,
- V_SQ_ALU_SRC_LITERAL, param * 16);
- if (r)
- return r;
-
- do_lds_fetch_values(ctx, temp_reg, dreg, 0xf);
+ do_lds_fetch_values(ctx, temp_reg, dreg, 0xf, param);
return 0;
}
--
2.13.6
Gert Wollny
2017-11-15 09:29:16 UTC
Reply
Permalink
Raw Message
Pre-load all the LDS values who's range is accessed more than once.

Signed-off-by: Gert Wollny <***@gmail.com>
---
src/gallium/drivers/r600/r600_shader.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)

diff --git a/src/gallium/drivers/r600/r600_shader.c b/src/gallium/drivers/r600/r600_shader.c
index 163ae75eb5..7c999fbb0b 100644
--- a/src/gallium/drivers/r600/r600_shader.c
+++ b/src/gallium/drivers/r600/r600_shader.c
@@ -2047,6 +2047,34 @@ static void count_tess_inputs(struct r600_shader_ctx *ctx)
}
}

+static void preload_tes_lds(struct r600_shader_ctx *ctx)
+{
+ int i;
+ ctx->max_driver_temp_used = 0;
+ r600_get_temp(ctx);
+
+ for (i = 0; i < ctx->tess_input_cache.fill; ++i) {
+ struct r600_tess_input_cache_entry *ce = &ctx->tess_input_cache.data[i];
+ fetch_tes_input(ctx, &ce->key, ce->reg, ce->mask);
+ ce->initialized = 1;
+ }
+}
+
+static void preload_tcs_lds(struct r600_shader_ctx *ctx)
+{
+ int i;
+ ctx->max_driver_temp_used = 0;
+ r600_get_temp(ctx);
+ for (i = 0; i < ctx->tess_input_cache.fill; ++i) {
+ struct r600_tess_input_cache_entry *ce = &ctx->tess_input_cache.data[i];
+ if (ce->key.Register.File == TGSI_FILE_INPUT)
+ fetch_tcs_input(ctx, &ce->key, ce->reg, ce->mask);
+ else
+ fetch_tcs_output(ctx, &ce->key, ce->reg, ce->mask);
+ ce->initialized = 1;
+ }
+}
+
static int tgsi_split_lds_inputs(struct r600_shader_ctx *ctx)
{
struct tgsi_full_instruction *inst = &ctx->parse.FullToken.FullInstruction;
@@ -3624,6 +3652,11 @@ static int r600_shader_from_tgsi(struct r600_context *rctx,
return r;
}

+ if (ctx.type == PIPE_SHADER_TESS_EVAL)
+ preload_tes_lds(&ctx);
+ else if (ctx.type == PIPE_SHADER_TESS_CTRL)
+ preload_tcs_lds(&ctx);
+
tgsi_parse_init(&ctx.parse, tokens);
while (!tgsi_parse_end_of_tokens(&ctx.parse)) {
tgsi_parse_token(&ctx.parse);
--
2.13.6
Gert Wollny
2017-11-15 09:29:15 UTC
Reply
Permalink
Raw Message
Some offsets used for the LDS access are recalculated quite regularly.
Since tesselation shaders are not optimized by the SB manually pre-evaluate
some offsets to speed up this type of shader.

Signed-off-by: Gert Wollny <***@gmail.com>
---
src/gallium/drivers/r600/r600_shader.c | 253 ++++++++++++++++++++++-----------
1 file changed, 172 insertions(+), 81 deletions(-)

diff --git a/src/gallium/drivers/r600/r600_shader.c b/src/gallium/drivers/r600/r600_shader.c
index 873b525449..163ae75eb5 100644
--- a/src/gallium/drivers/r600/r600_shader.c
+++ b/src/gallium/drivers/r600/r600_shader.c
@@ -183,6 +183,7 @@ int r600_pipe_shader_create(struct pipe_context *ctx,
R600_ERR("translation from TGSI failed !\n");
goto error;
}
+
if (shader->shader.processor_type == PIPE_SHADER_VERTEX) {
/* only disable for vertex shaders in tess paths */
if (key.vs.as_ls)
@@ -329,6 +330,7 @@ struct r600_tess_input_cache_entry {
struct r600_tess_input_cache {
struct r600_tess_input_cache_entry data[32];
int fill;
+ int uses_lds_io;
};

struct r600_shader_ctx {
@@ -367,7 +369,8 @@ struct r600_shader_ctx {
unsigned enabled_stream_buffers_mask;
unsigned tess_input_info; /* temp with tess input offsets */
unsigned tess_output_info; /* temp with tess input offsets */
- struct r600_tess_input_cache tess_input_cache;
+ unsigned tess_io_info_precalc; /* temp with precalcuated offsets */
+ struct r600_tess_input_cache tess_input_cache;
};

struct r600_shader_tgsi_instruction {
@@ -392,7 +395,8 @@ static void r600_bytecode_src(struct r600_bytecode_alu_src *bc_src,
const struct r600_shader_src *shader_src,
unsigned chan);
static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
- unsigned dst_reg, unsigned mask, int param);
+ unsigned temp_chan, unsigned dst_reg,
+ unsigned mask, int param);

static int tgsi_last_instruction(unsigned writemask)
{
@@ -1027,13 +1031,8 @@ static int tgsi_declaration(struct r600_shader_ctx *ctx)
d->Semantic.Name == TGSI_SEMANTIC_TESSOUTER) {
int param = r600_get_lds_unique_index(d->Semantic.Name, 0);
int dreg = d->Semantic.Name == TGSI_SEMANTIC_TESSINNER ? 3 : 2;
- unsigned temp_reg = r600_get_temp(ctx);
-
- r = get_lds_offset0(ctx, 2, temp_reg, true);
- if (r)
- return r;

- do_lds_fetch_values(ctx, temp_reg, dreg, 0xF, param);
+ do_lds_fetch_values(ctx, ctx->tess_io_info_precalc, 3, dreg, 0xF, param);
}
else if (d->Semantic.Name == TGSI_SEMANTIC_TESSCOORD) {
/* MOV r1.x, r0.x;
@@ -1648,7 +1647,9 @@ static int tgsi_split_gs_inputs(struct r600_shader_ctx *ctx)
* All three shaders VS(LS), TCS, TES share the same LDS space.
*/
/* this will return with the dw address in temp_reg.x */
-static int r600_get_byte_address(struct r600_shader_ctx *ctx, int temp_reg,
+static int r600_get_byte_address(struct r600_shader_ctx *ctx,
+ unsigned *result_reg, unsigned *result_chan,
+ int base_offset_reg, int base_offset_chan,
const struct tgsi_full_dst_register *dst,
const struct tgsi_full_src_register *src,
int stride_bytes_reg, int stride_bytes_chan, int *param)
@@ -1656,7 +1657,11 @@ static int r600_get_byte_address(struct r600_shader_ctx *ctx, int temp_reg,
struct tgsi_full_dst_register reg;
ubyte *name, *index, *array_first;
int r;
+ int temp_reg = -1;
struct tgsi_shader_info *info = &ctx->info;
+ *result_reg = base_offset_reg;
+ *result_chan = base_offset_chan;
+
/* Set the register description. The address computation is the same
* for sources and destinations. */
if (src) {
@@ -1686,14 +1691,18 @@ static int r600_get_byte_address(struct r600_shader_ctx *ctx, int temp_reg,
sel = V_SQ_ALU_SRC_LITERAL;
chan = reg.Dimension.Index;
}
-
+ temp_reg = r600_get_temp(ctx);
+ *result_reg = temp_reg;
+ *result_chan = 0;
r = single_alu_op3(ctx, ALU_OP3_MULADD_UINT24,
temp_reg, 0,
stride_bytes_reg, stride_bytes_chan,
sel, chan,
- temp_reg, 0);
+ base_offset_reg, base_offset_chan);
if (r)
return r;
+ } else {
+
}

if (reg.Register.File == TGSI_FILE_INPUT) {
@@ -1719,15 +1728,20 @@ static int r600_get_byte_address(struct r600_shader_ctx *ctx, int temp_reg,

addr_reg = get_address_file_reg(ctx, reg.Indirect.Index);

- /* pull the value from index_reg */
+ if (temp_reg < 0)
+ temp_reg = r600_get_temp(ctx);
+
r = single_alu_op3(ctx, ALU_OP3_MULADD_UINT24,
temp_reg, 0,
V_SQ_ALU_SRC_LITERAL, 16,
addr_reg, 0,
- temp_reg, 0);
+ *result_reg, *result_chan);
if (r)
return r;

+ *result_reg = temp_reg;
+ *result_chan = 0;
+
*param = r600_get_lds_unique_index(name[first],
index[first]);

@@ -1739,14 +1753,17 @@ static int r600_get_byte_address(struct r600_shader_ctx *ctx, int temp_reg,
return 0;
}

-static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
+static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned offs_reg,
+ unsigned offs_chan,
unsigned dst_reg, unsigned mask, int param)
+
{
struct r600_bytecode_alu alu;
int r, i;
int lasti = tgsi_last_instruction(mask);
int firsti = param > 0 ? 0 : 1;

+
if ((ctx->bc->cf_last->ndw>>1) >= 0x60)
ctx->bc->force_add_cf = 1;

@@ -1756,12 +1773,12 @@ static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
continue;

memset(&alu, 0, sizeof(struct r600_bytecode_alu));
- alu.dst.sel = temp_reg;
+ alu.dst.sel = ctx->temp_reg;
alu.dst.chan = i;
alu.dst.write = 1;
alu.op = ALU_OP2_ADD_INT;
- alu.src[0].sel = temp_reg;
- alu.src[0].chan = 0;
+ alu.src[0].sel = offs_reg;
+ alu.src[0].chan = offs_chan;
alu.src[1].sel = V_SQ_ALU_SRC_LITERAL;
alu.src[1].value = 4 * i + 16 * param;

@@ -1779,8 +1796,13 @@ static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
/* emit an LDS_READ_RET */
memset(&alu, 0, sizeof(alu));
alu.op = LDS_OP1_LDS_READ_RET;
- alu.src[0].sel = temp_reg;
- alu.src[0].chan = i;
+ if (i > 0 || firsti == 0) {
+ alu.src[0].sel = ctx->temp_reg;
+ alu.src[0].chan = i;
+ } else {
+ alu.src[0].sel = offs_reg;
+ alu.src[0].chan = offs_chan;
+ }
alu.src[1].sel = V_SQ_ALU_SRC_0;
alu.src[2].sel = V_SQ_ALU_SRC_0;
alu.dst.chan = 0;
@@ -1824,20 +1846,18 @@ static int fetch_tes_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg
unsigned int dst_reg, unsigned mask)
{
int r, param;
- unsigned temp_reg = r600_get_temp(ctx);
-
- r = get_lds_offset0(ctx, 2, temp_reg,
- src->Register.Dimension ? false : true);
- if (r)
- return r;
+ unsigned temp_reg;
+ unsigned temp_chan;

/* the base address is now in temp.x */
- r = r600_get_byte_address(ctx, temp_reg,
+ r = r600_get_byte_address(ctx, &temp_reg, &temp_chan,
+ ctx->tess_io_info_precalc,
+ src->Register.Dimension ? 2:3,
NULL, src, ctx->tess_output_info, 1, &param);
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask, param);
+ r = do_lds_fetch_values(ctx, temp_reg, temp_chan, dst_reg, mask, param);
if (r)
return r;
return 0;
@@ -1848,23 +1868,16 @@ static int fetch_tcs_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg
{
int r,param;
unsigned temp_reg = r600_get_temp(ctx);
-
- /* t.x = ips * r0.y */
- r = single_alu_op2(ctx, ALU_OP2_MUL_UINT24,
- temp_reg, 0,
- ctx->tess_input_info, 0,
- 0, 1);
-
- if (r)
- return r;
+ unsigned temp_chan = 0;

/* the base address is now in temp.x */
- r = r600_get_byte_address(ctx, temp_reg,
+ r = r600_get_byte_address(ctx, &temp_reg, &temp_chan,
+ ctx->tess_io_info_precalc, 3,
NULL, src, ctx->tess_input_info, 1, &param);
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask, param);
+ r = do_lds_fetch_values(ctx, temp_reg, temp_chan, dst_reg, mask, param);
if (r)
return r;
return 0;
@@ -1874,20 +1887,18 @@ static int fetch_tcs_output(struct r600_shader_ctx *ctx, struct tgsi_full_src_re
unsigned int dst_reg, unsigned mask)
{
int r, param;
- unsigned temp_reg = r600_get_temp(ctx);
+ unsigned temp_reg;
+ unsigned temp_chan;

- r = get_lds_offset0(ctx, 1, temp_reg,
- src->Register.Dimension ? false : true);
- if (r)
- return r;
- /* the base address is now in temp.x */
- r = r600_get_byte_address(ctx, temp_reg,
+ r = r600_get_byte_address(ctx, &temp_reg, &temp_chan,
+ ctx->tess_io_info_precalc,
+ src->Register.Dimension ? 0:1,
NULL, src,
ctx->tess_output_info, 1, &param);
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask, param);
+ r = do_lds_fetch_values(ctx, temp_reg, temp_chan, dst_reg, mask, param);
if (r)
return r;
return 0;
@@ -1896,11 +1907,12 @@ static int fetch_tcs_output(struct r600_shader_ctx *ctx, struct tgsi_full_src_re
static int tgsi_full_src_register_equal_for_cache(struct tgsi_full_src_register *lhs,
struct tgsi_full_src_register *rhs)
{
+ if (lhs->Register.File != rhs->Register.File)
+ return 0;
+
if (lhs->Register.Index != rhs->Register.Index)
return 0;

- if (lhs->Register.File != rhs->Register.File)
-
if (lhs->Register.Indirect || rhs->Register.Indirect)
return 0;

@@ -2028,9 +2040,10 @@ static void count_tess_inputs(struct r600_shader_ctx *ctx)
for (i = 0; i < inst->Instruction.NumSrcRegs; i++) {
struct tgsi_full_src_register *src = &inst->Src[i];
if (((src->Register.File == TGSI_FILE_INPUT) && (ctx->type == PIPE_SHADER_TESS_EVAL)) ||
- (ctx->type == PIPE_SHADER_TESS_CTRL &&
- (src->Register.File == TGSI_FILE_INPUT || src->Register.File == TGSI_FILE_OUTPUT)))
+ (ctx->type == PIPE_SHADER_TESS_CTRL)) {
tess_input_cache_check(&ctx->tess_input_cache, src);
+ ctx->tess_input_cache.uses_lds_io = 1;
+ }
}
}

@@ -2729,7 +2742,7 @@ static int r600_fetch_tess_io_info(struct r600_shader_ctx *ctx)
0, 0);
if (r)
return r;
-
+
/* used by VS/TCS */
if (ctx->tess_input_info) {
/* fetch tcs input values into resv space */
@@ -2752,12 +2765,13 @@ static int r600_fetch_tess_io_info(struct r600_shader_ctx *ctx)
vtx.dst_sel_w = 3;
vtx.src_gpr = temp_val;
vtx.src_sel_x = 0;
-
+
r = r600_bytecode_add_vtx(ctx->bc, &vtx);
if (r)
return r;
+
}
-
+
/* used by TCS/TES */
if (ctx->tess_output_info) {
/* fetch tcs output values into resv space */
@@ -2784,6 +2798,64 @@ static int r600_fetch_tess_io_info(struct r600_shader_ctx *ctx)
r = r600_bytecode_add_vtx(ctx->bc, &vtx);
if (r)
return r;
+
+ if (ctx->tess_input_cache.uses_lds_io) {
+
+ /* Precalc some offsets, after this we have
+
+ */
+
+ /* tess_io_info_precalc.x = tess_output_info.x * R0.y + tess_output_info.z */
+ r = single_alu_op3(ctx, ALU_OP3_MULADD_UINT24,
+ ctx->tess_io_info_precalc, 0,
+ ctx->tess_output_info, 0,
+ 0, 1,
+ ctx->tess_output_info, 2);
+ if (r)
+ return r;
+
+ /* tess_io_info_precalc.y = tess_output_info.x * R0.y + tess_output_info.w */
+
+ r = single_alu_op3(ctx, ALU_OP3_MULADD_UINT24,
+ ctx->tess_io_info_precalc, 1,
+ ctx->tess_output_info, 0,
+ 0, 1,
+ ctx->tess_output_info, 3);
+ if (r)
+ return r;
+
+
+ /* tess_io_info_precalc.z = tess_output_info.x * R0.z + tess_output_info.z */
+ r = single_alu_op3(ctx, ALU_OP3_MULADD_UINT24,
+ ctx->tess_io_info_precalc, 2,
+ ctx->tess_output_info, 0,
+ 0, 2,
+ ctx->tess_output_info, 2);
+ if (r)
+ return r;
+
+ /* This is a TCS shader */
+ if (ctx->tess_input_info) {
+
+ /* t.x = ips * r0.y */
+ r = single_alu_op2(ctx, ALU_OP2_MUL_UINT24,
+ ctx->tess_io_info_precalc, 3,
+ ctx->tess_input_info, 0,
+ 0, 1);
+ if (r)
+ return r;
+ } else {
+
+ /* tess_io_info_precalc.w = tess_output_info.x * R0.z + tess_output_info.w */
+ r = single_alu_op3(ctx, ALU_OP3_MULADD_UINT24,
+ ctx->tess_io_info_precalc, 3,
+ ctx->tess_output_info, 0,
+ 0, 2,
+ ctx->tess_output_info, 3);
+ if (r)
+ return r;
+ }
+ }
}
return 0;
}
@@ -2858,8 +2930,10 @@ static int r600_store_tcs_output(struct r600_shader_ctx *ctx)
{
struct tgsi_full_instruction *inst = &ctx->parse.FullToken.FullInstruction;
const struct tgsi_full_dst_register *dst = &inst->Dst[0];
- int i, r, lasti;
+ int i, r, lasti, firsti;
int temp_reg = r600_get_temp(ctx);
+ unsigned offs_reg;
+ unsigned offs_chan;
struct r600_bytecode_alu alu;
unsigned write_mask = dst->Register.WriteMask;
int param;
@@ -2867,19 +2941,18 @@ static int r600_store_tcs_output(struct r600_shader_ctx *ctx)
if (inst->Dst[0].Register.File != TGSI_FILE_OUTPUT)
return 0;

- r = get_lds_offset0(ctx, 1, temp_reg, dst->Register.Dimension ? false : true);
- if (r)
- return r;
-
/* the base address is now in temp.x */
- r = r600_get_byte_address(ctx, temp_reg,
+ r = r600_get_byte_address(ctx, &offs_reg, &offs_chan,
+ ctx->tess_io_info_precalc,
+ dst->Register.Dimension ? 0:1,
&inst->Dst[0], NULL, ctx->tess_output_info, 1, &param);
if (r)
return r;

+ firsti = param > 0 ? 0 : 1;
/* LDS write */
lasti = tgsi_last_instruction(write_mask);
- for (i = (param > 0 ? 0: 1); i <= lasti; i++) {
+ for (i = firsti; i <= lasti; i++) {

if (!(write_mask & (1 << i)))
continue;
@@ -2888,8 +2961,8 @@ static int r600_store_tcs_output(struct r600_shader_ctx *ctx)
alu.dst.chan = i;
alu.dst.write = 1;
alu.op = ALU_OP2_ADD_INT;
- alu.src[0].sel = temp_reg;
- alu.src[0].chan = 0;
+ alu.src[0].sel = offs_reg;
+ alu.src[0].chan = offs_chan;
alu.src[1].sel = V_SQ_ALU_SRC_LITERAL;
alu.src[1].value = 4 * i + 16 * param;

@@ -2909,8 +2982,14 @@ static int r600_store_tcs_output(struct r600_shader_ctx *ctx)
(i == 2 && ((write_mask & 0xc) == 0xc))) {
memset(&alu, 0, sizeof(struct r600_bytecode_alu));
alu.op = LDS_OP3_LDS_WRITE_REL;
- alu.src[0].sel = temp_reg;
- alu.src[0].chan = i;
+
+ if (firsti == 0 || i > 0) {
+ alu.src[0].sel = temp_reg;
+ alu.src[0].chan = i;
+ } else {
+ alu.src[0].sel = offs_reg;
+ alu.src[0].chan = offs_chan;
+ }

alu.src[1].sel = dst->Register.Index;
alu.src[1].sel += ctx->file_offset[dst->Register.File];
@@ -2931,8 +3010,14 @@ static int r600_store_tcs_output(struct r600_shader_ctx *ctx)
}
memset(&alu, 0, sizeof(struct r600_bytecode_alu));
alu.op = LDS_OP2_LDS_WRITE;
- alu.src[0].sel = temp_reg;
- alu.src[0].chan = i;
+
+ if (firsti == 0 || i > 0) {
+ alu.src[0].sel = temp_reg;
+ alu.src[0].chan = i;
+ } else {
+ alu.src[0].sel = offs_reg;
+ alu.src[0].chan = offs_chan;
+ }

alu.src[1].sel = dst->Register.Index;
alu.src[1].sel += ctx->file_offset[dst->Register.File];
@@ -2953,17 +3038,12 @@ static int r600_tess_factor_read(struct r600_shader_ctx *ctx,
int output_idx)
{
int param;
- unsigned temp_reg = r600_get_temp(ctx);
unsigned name = ctx->shader->output[output_idx].name;
int dreg = ctx->shader->output[output_idx].gpr;
- int r;

param = r600_get_lds_unique_index(name, 0);
- r = get_lds_offset0(ctx, 1, temp_reg, true);
- if (r)
- return r;
-
- do_lds_fetch_values(ctx, temp_reg, dreg, 0xf, param);
+
+ do_lds_fetch_values(ctx, ctx->tess_io_info_precalc, 1, dreg, 0xf, param);
return 0;
}

@@ -3293,11 +3373,13 @@ static int r600_shader_from_tgsi(struct r600_context *rctx,
if (ctx.type == PIPE_SHADER_TESS_CTRL) {
ctx.tess_input_info = ctx.bc->ar_reg + 3;
ctx.tess_output_info = ctx.bc->ar_reg + 4;
- ctx.temp_reg = ctx.bc->ar_reg + 5;
+ ctx.tess_io_info_precalc = ctx.bc->ar_reg + 5;
+ ctx.temp_reg = ctx.bc->ar_reg + 6;
} else if (ctx.type == PIPE_SHADER_TESS_EVAL) {
ctx.tess_input_info = 0;
ctx.tess_output_info = ctx.bc->ar_reg + 3;
- ctx.temp_reg = ctx.bc->ar_reg + 4;
+ ctx.tess_io_info_precalc = ctx.bc->ar_reg + 4;
+ ctx.temp_reg = ctx.bc->ar_reg + 5;
} else if (ctx.type == PIPE_SHADER_GEOMETRY) {
ctx.gs_export_gpr_tregs[0] = ctx.bc->ar_reg + 3;
ctx.gs_export_gpr_tregs[1] = ctx.bc->ar_reg + 4;
@@ -3316,18 +3398,27 @@ static int r600_shader_from_tgsi(struct r600_context *rctx,
ctx.temp_reg = ctx.bc->ar_reg + 3;
}

- if (lds_inputs) {
+ ctx.tess_input_cache.uses_lds_io = 0;
+ if (lds_inputs || lds_outputs) {
tgsi_parse_init(&ctx.parse, tokens);
+
while (!tgsi_parse_end_of_tokens(&ctx.parse)) {
tgsi_parse_token(&ctx.parse);
-
- if (ctx.parse.FullToken.Token.Type != TGSI_TOKEN_TYPE_INSTRUCTION)
- continue;
-
- count_tess_inputs(&ctx);
+ if (ctx.parse.FullToken.Token.Type == TGSI_TOKEN_TYPE_INSTRUCTION)
+ count_tess_inputs(&ctx);
+ else if (ctx.parse.FullToken.Token.Type == TGSI_TOKEN_TYPE_DECLARATION) {
+ struct tgsi_full_declaration *d = &ctx.parse.FullToken.FullDeclaration;
+ if (d->Declaration.File == TGSI_FILE_SYSTEM_VALUE &&
+ (d->Semantic.Name == TGSI_SEMANTIC_TESSINNER ||
+ d->Semantic.Name == TGSI_SEMANTIC_TESSOUTER))
+ ctx.tess_input_cache.uses_lds_io = 1;
+
+ }
}
ctx.temp_reg += tess_input_cache_count_multiused(&ctx.tess_input_cache, ctx.temp_reg);
tgsi_parse_init(&ctx.parse, tokens);
+ } else {
+
}

shader->max_arrays = 0;
--
2.13.6
Dave Airlie
2017-12-08 06:30:06 UTC
Reply
Permalink
Raw Message
Post by Gert Wollny
Dear all,
since on r600 the tesselation shaders don't go through the sb-optimizer I
though it might help to improve performance by applying some optimizations
to the created assembly. The patches are experimental but to a point where
I think some input from you could be helpful.
- pre-calculate and re-use address offsets that were always calculated
on the fly
- only load from LDS what is really requested (based on the source swizzle masks
of the input values).
- preload all used elements in cases where the shader would only partially load
data in different places.
At this point there are no piglit regressions, but an unrelated GOU lockup is
triggered. (Dave and me are already testing patches for this).
So I haven't commited these yet, because I wanted to see if I could
get sb to work.

https://cgit.freedesktop.org/~airlied/mesa/log/?h=r600-sb-lds-wip

is my non functional attempt, so far, biut it gpu hangs on the nop shader.

I'm away for a week, so I might try and look at it against after that.

Dave.
Gert Wollny
2017-12-11 12:49:14 UTC
Reply
Permalink
Raw Message
[snip]
So I haven't commited these yet, because I wanted to see if I could
get sb to work.
Well, it was very much work in progress, I didn't expect it to be
committed as is anyway.
https://cgit.freedesktop.org/~airlied/mesa/log/?h=r600-sb-lds-wip
is my non functional attempt, so far, biut it gpu hangs on the nop shader.
I've played aound it a bit and added some hacks to make it not hang,
i.e. sb scheduls calls into any slot, but LDS read/write should go only
into SLOT_X, and not splitting up the fetch seemed to be important
(patch attached).


However, gcm moves around the LSD_OQ* loads changing the order without
changing the order of the according LDS_READ_RET calls. At least for
this the nop shader still fails.

I tried to persuade the optimizer to not reorder these move
instructions by adding a "use" to the dst-value of a node that reads
from a LDS_OQ to the next node that reads from the same queue, but to
no avail. I guess I didn't figure out how to count these extra uses
properly when the instructuions are scheduled.

Best,
Gert
Dave Airlie
2017-12-29 06:38:23 UTC
Reply
Permalink
Raw Message
Post by Gert Wollny
[snip]
So I haven't commited these yet, because I wanted to see if I could
get sb to work.
Well, it was very much work in progress, I didn't expect it to be
committed as is anyway.
https://cgit.freedesktop.org/~airlied/mesa/log/?h=r600-sb-lds-wip
is my non functional attempt, so far, biut it gpu hangs on the nop shader.
I've played aound it a bit and added some hacks to make it not hang,
i.e. sb scheduls calls into any slot, but LDS read/write should go only
into SLOT_X, and not splitting up the fetch seemed to be important
(patch attached).
However, gcm moves around the LSD_OQ* loads changing the order without
changing the order of the according LDS_READ_RET calls. At least for
this the nop shader still fails.
I tried to persuade the optimizer to not reorder these move
instructions by adding a "use" to the dst-value of a node that reads
from a LDS_OQ to the next node that reads from the same queue, but to
no avail. I guess I didn't figure out how to count these extra uses
properly when the instructuions are scheduled.
I thought I'd done this already, I must dig a bit more.

I've pushed mosre stuff to the branch, nop still doesn't work.

I've included your patche in one of the squashes, I think we should be
pretty close.

Dave.
Dave Airlie
2017-12-29 07:18:43 UTC
Reply
Permalink
Raw Message
Post by Dave Airlie
Post by Gert Wollny
[snip]
So I haven't commited these yet, because I wanted to see if I could
get sb to work.
Well, it was very much work in progress, I didn't expect it to be
committed as is anyway.
https://cgit.freedesktop.org/~airlied/mesa/log/?h=r600-sb-lds-wip
is my non functional attempt, so far, biut it gpu hangs on the nop shader.
I've played aound it a bit and added some hacks to make it not hang,
i.e. sb scheduls calls into any slot, but LDS read/write should go only
into SLOT_X, and not splitting up the fetch seemed to be important
(patch attached).
However, gcm moves around the LSD_OQ* loads changing the order without
changing the order of the according LDS_READ_RET calls. At least for
this the nop shader still fails.
I tried to persuade the optimizer to not reorder these move
instructions by adding a "use" to the dst-value of a node that reads
from a LDS_OQ to the next node that reads from the same queue, but to
no avail. I guess I didn't figure out how to count these extra uses
properly when the instructuions are scheduled.
I thought I'd done this already, I must dig a bit more.
I've pushed mosre stuff to the branch, nop still doesn't work.
I've included your patche in one of the squashes, I think we should be
pretty close.
I think the top patch un my tree fixes the LDS reordering, nop still
doesn't work
though which is annoying. maybe you can spot the problem I've been
staring too long.

Dave.

Loading...