Discussion:
[RFC PATCH 0/6] r600: speed up tesselation shaders
(too old to reply)
Gert Wollny
2017-11-15 09:29:10 UTC
Permalink
Dear all,

since on r600 the tesselation shaders don't go through the sb-optimizer I
though it might help to improve performance by applying some optimizations
to the created assembly. The patches are experimental but to a point where
I think some input from you could be helpful.

This patch series does the following optimizations:
- pre-calculate and re-use address offsets that were always calculated
on the fly
- only load from LDS what is really requested (based on the source swizzle masks
of the input values).
- preload all used elements in cases where the shader would only partially load
data in different places.

At this point there are no piglit regressions, but an unrelated GOU lockup is
triggered. (Dave and me are already testing patches for this).

Benchmarking on BARTS with Unigine-Heaven and Tessmark x32 (with
MESA_GL_VERSION_OVERRIDE=4.0) I get the following improvements:

pre-opt post-opt
master: fb0e9b5197
===============================================
Heaven

Res: 1280x1024
Q: High
Tess: Normal
-----------------------------------------------
Time: 260.2 260.2
Frames: 3276 4192

FPS: 12.6 16.1
Min FPS: 4.0 4.6
Max FPS: 60.9 69.0

Score: 317.2 405.8
-----------------------------------------------

Tessmark x32
R: 1024x640
-----------------------------------------------
Points: 635 700
FPS: 10 11

A github repo inclusing these patches can be found at

https://github.com/gerddie/mesa/tree/r600-tess-speedup

many thanks for any comments,
Gert

Gert Wollny (6):
r600:shader: Fix all warnings issed with "-Wall -Wextra"
r600_shader: only load from LDS what is really used
r600_shader.c: Add a caching structure for load tesselation data
r600_shader: Move calculation of offset to do_lds_fetch_values
r600_shader.c: Pre-caclculate some offsets for LDS access
r600_shader.c: Preload some LDS values.

src/gallium/drivers/r600/r600_shader.c | 636 ++++++++++++++++++++++++---------
1 file changed, 476 insertions(+), 160 deletions(-)
--
2.13.6
Gert Wollny
2017-11-15 09:29:11 UTC
Permalink
- fix a number of -Wsign-compare warnings
- fix two warnings for -Woverride-init because TGSI_OPCODE_CEIL == 83, and
the according field was defined two times.

Signed-off-by: Gert Wollny <***@gmail.com>
---
src/gallium/drivers/r600/r600_shader.c | 67 ++++++++++++++++++----------------
1 file changed, 36 insertions(+), 31 deletions(-)

diff --git a/src/gallium/drivers/r600/r600_shader.c b/src/gallium/drivers/r600/r600_shader.c
index 625537b48b..a2dc08c596 100644
--- a/src/gallium/drivers/r600/r600_shader.c
+++ b/src/gallium/drivers/r600/r600_shader.c
@@ -289,7 +289,7 @@ error:
return r;
}

-void r600_pipe_shader_destroy(struct pipe_context *ctx, struct r600_pipe_shader *shader)
+void r600_pipe_shader_destroy(struct pipe_context *ctx UNUSED, struct r600_pipe_shader *shader)
{
r600_resource_reference(&shader->bo, NULL);
r600_bytecode_clear(&shader->shader.bc);
@@ -1094,7 +1094,8 @@ static int allocate_system_value_inputs(struct r600_shader_ctx *ctx, int gpr_off

{ false, &ctx->fixed_pt_position_gpr, TGSI_SEMANTIC_SAMPLEID, TGSI_SEMANTIC_SAMPLEPOS } /* SAMPLEID is in Fixed Point Position GPR.w */
};
- int i, k, num_regs = 0;
+ int num_regs = 0;
+ unsigned k, i;

if (tgsi_parse_init(&parse, ctx->tokens) != TGSI_PARSE_OK) {
return 0;
@@ -1997,11 +1998,12 @@ static int process_twoside_color_inputs(struct r600_shader_ctx *ctx)
}

static int emit_streamout(struct r600_shader_ctx *ctx, struct pipe_stream_output_info *so,
- int stream, unsigned *stream_item_size)
+ int stream, unsigned *stream_item_size UNUSED)
{
unsigned so_gpr[PIPE_MAX_SHADER_OUTPUTS];
unsigned start_comp[PIPE_MAX_SHADER_OUTPUTS];
- int i, j, r;
+ int j, r;
+ unsigned i;

/* Sanity checking. */
if (so->num_outputs > PIPE_MAX_SO_OUTPUTS) {
@@ -2153,13 +2155,14 @@ static int generate_gs_copy_shader(struct r600_context *rctx,
struct r600_shader_ctx ctx = {};
struct r600_shader *gs_shader = &gs->shader;
struct r600_pipe_shader *cshader;
- int ocnt = gs_shader->noutput;
+ unsigned ocnt = gs_shader->noutput;
struct r600_bytecode_alu alu;
struct r600_bytecode_vtx vtx;
struct r600_bytecode_output output;
struct r600_bytecode_cf *cf_jump, *cf_pop,
*last_exp_pos = NULL, *last_exp_param = NULL;
- int i, j, next_clip_pos = 61, next_param = 0;
+ int next_clip_pos = 61, next_param = 0;
+ unsigned i, j;
int ring;
bool only_ring_0 = true;
cshader = calloc(1, sizeof(struct r600_pipe_shader));
@@ -2475,10 +2478,11 @@ static int emit_inc_ring_offset(struct r600_shader_ctx *ctx, int idx, bool ind)
return 0;
}

-static int emit_gs_ring_writes(struct r600_shader_ctx *ctx, const struct pipe_stream_output_info *so, int stream, bool ind)
+static int emit_gs_ring_writes(struct r600_shader_ctx *ctx, const struct pipe_stream_output_info *so UNUSED, int stream, bool ind)
{
struct r600_bytecode_output output;
- int i, k, ring_offset;
+ int ring_offset;
+ unsigned i, k;
int effective_stream = stream == -1 ? 0 : stream;
int idx = 0;

@@ -2619,8 +2623,9 @@ static int r600_fetch_tess_io_info(struct r600_shader_ctx *ctx)

static int emit_lds_vs_writes(struct r600_shader_ctx *ctx)
{
- int i, j, r;
+ int j, r;
int temp_reg;
+ unsigned i;

/* fetch tcs input values into input_vals */
ctx->tess_input_info = r600_get_temp(ctx);
@@ -2793,10 +2798,10 @@ static int r600_tess_factor_read(struct r600_shader_ctx *ctx,

static int r600_emit_tess_factor(struct r600_shader_ctx *ctx)
{
- unsigned i;
int stride, outer_comps, inner_comps;
int tessinner_idx = -1, tessouter_idx = -1;
- int r;
+ int i, r;
+ unsigned j;
int temp_reg = r600_get_temp(ctx);
int treg[3] = {-1, -1, -1};
struct r600_bytecode_alu alu;
@@ -2843,11 +2848,11 @@ static int r600_emit_tess_factor(struct r600_shader_ctx *ctx)

/* R0 is InvocationID, RelPatchID, PatchID, tf_base */
/* TF_WRITE takes index in R.x, value in R.y */
- for (i = 0; i < ctx->shader->noutput; i++) {
- if (ctx->shader->output[i].name == TGSI_SEMANTIC_TESSINNER)
- tessinner_idx = i;
- if (ctx->shader->output[i].name == TGSI_SEMANTIC_TESSOUTER)
- tessouter_idx = i;
+ for (j = 0; j < ctx->shader->noutput; j++) {
+ if (ctx->shader->output[j].name == TGSI_SEMANTIC_TESSINNER)
+ tessinner_idx = j;
+ if (ctx->shader->output[j].name == TGSI_SEMANTIC_TESSOUTER)
+ tessouter_idx = j;
}

if (tessouter_idx == -1)
@@ -2948,7 +2953,8 @@ static int r600_shader_from_tgsi(struct r600_context *rctx,
struct r600_bytecode_output output[ARRAY_SIZE(shader->output)];
unsigned output_done, noutput;
unsigned opcode;
- int i, j, k, r = 0;
+ int j, k, r = 0;
+ unsigned i;
int next_param_base = 0, next_clip_base;
int max_color_exports = MAX2(key.ps.nr_cbufs, 1);
bool indirect_gprs;
@@ -3638,7 +3644,7 @@ static int r600_shader_from_tgsi(struct r600_context *rctx,
goto out_err;
}

- if (output[j].type==-1) {
+ if ((int)output[j].type==-1) {
output[j].type = V_SQ_CF_ALLOC_EXPORT_WORD0_SQ_EXPORT_PARAM;
output[j].array_base = next_param_base++;
}
@@ -3696,10 +3702,10 @@ static int r600_shader_from_tgsi(struct r600_context *rctx,
noutput = j;

/* set export done on last export of each type */
- for (i = noutput - 1, output_done = 0; i >= 0; i--) {
- if (!(output_done & (1 << output[i].type))) {
- output_done |= (1 << output[i].type);
- output[i].op = CF_OP_EXPORT_DONE;
+ for (k = noutput - 1, output_done = 0; k >= 0; k--) {
+ if (!(output_done & (1 << output[k].type))) {
+ output_done |= (1 << output[k].type);
+ output[k].op = CF_OP_EXPORT_DONE;
}
}
/* add output to bytecode */
@@ -3759,7 +3765,7 @@ static int tgsi_unsupported(struct r600_shader_ctx *ctx)
return -EINVAL;
}

-static int tgsi_end(struct r600_shader_ctx *ctx)
+static int tgsi_end(struct r600_shader_ctx *ctx UNUSED)
{
return 0;
}
@@ -7645,7 +7651,7 @@ static int tgsi_tex(struct r600_shader_ctx *ctx)
static int find_hw_atomic_counter(struct r600_shader_ctx *ctx,
struct tgsi_full_src_register *src)
{
- int i;
+ unsigned i;

if (src->Register.Indirect) {
for (i = 0; i < ctx->shader->nhwatomic_ranges; i++) {
@@ -7655,7 +7661,7 @@ static int find_hw_atomic_counter(struct r600_shader_ctx *ctx,
} else {
uint32_t index = src->Register.Index;
for (i = 0; i < ctx->shader->nhwatomic_ranges; i++) {
- if (ctx->shader->atomics[i].buffer_id != src->Dimension.Index)
+ if (ctx->shader->atomics[i].buffer_id != (unsigned)src->Dimension.Index)
continue;
if (index > ctx->shader->atomics[i].end)
continue;
@@ -7821,7 +7827,7 @@ static int tgsi_lrp(struct r600_shader_ctx *ctx)
{
struct tgsi_full_instruction *inst = &ctx->parse.FullToken.FullInstruction;
struct r600_bytecode_alu alu;
- int lasti = tgsi_last_instruction(inst->Dst[0].Register.WriteMask);
+ unsigned lasti = tgsi_last_instruction(inst->Dst[0].Register.WriteMask);
unsigned i, temp_regs[2];
int r;

@@ -8616,7 +8622,8 @@ static inline void callstack_update_max_depth(struct r600_shader_ctx *ctx,
unsigned reason)
{
struct r600_stack_info *stack = &ctx->bc->stack;
- unsigned elements, entries;
+ unsigned elements;
+ int entries;

unsigned entry_size = stack->entry_size;

@@ -8871,7 +8878,7 @@ static int tgsi_bgnloop(struct r600_shader_ctx *ctx)

static int tgsi_endloop(struct r600_shader_ctx *ctx)
{
- unsigned i;
+ int i;

r600_bytecode_add_cfinst(ctx->bc, CF_OP_LOOP_END);

@@ -9443,8 +9450,7 @@ static const struct r600_shader_tgsi_instruction eg_shader_tgsi_instruction[] =
[TGSI_OPCODE_ENDIF] = { ALU_OP0_NOP, tgsi_endif},
[TGSI_OPCODE_DDX_FINE] = { FETCH_OP_GET_GRADIENTS_H, tgsi_tex},
[TGSI_OPCODE_DDY_FINE] = { FETCH_OP_GET_GRADIENTS_V, tgsi_tex},
- [82] = { ALU_OP0_NOP, tgsi_unsupported},
- [83] = { ALU_OP0_NOP, tgsi_unsupported},
+ [82] = { ALU_OP0_NOP, tgsi_unsupported},
[TGSI_OPCODE_CEIL] = { ALU_OP1_CEIL, tgsi_op2},
[TGSI_OPCODE_I2F] = { ALU_OP1_INT_TO_FLT, tgsi_op2_trans},
[TGSI_OPCODE_NOT] = { ALU_OP1_NOT_INT, tgsi_op2},
@@ -9667,7 +9673,6 @@ static const struct r600_shader_tgsi_instruction cm_shader_tgsi_instruction[] =
[TGSI_OPCODE_DDX_FINE] = { FETCH_OP_GET_GRADIENTS_H, tgsi_tex},
[TGSI_OPCODE_DDY_FINE] = { FETCH_OP_GET_GRADIENTS_V, tgsi_tex},
[82] = { ALU_OP0_NOP, tgsi_unsupported},
- [83] = { ALU_OP0_NOP, tgsi_unsupported},
[TGSI_OPCODE_CEIL] = { ALU_OP1_CEIL, tgsi_op2},
[TGSI_OPCODE_I2F] = { ALU_OP1_INT_TO_FLT, tgsi_op2},
[TGSI_OPCODE_NOT] = { ALU_OP1_NOT_INT, tgsi_op2},
--
2.13.6
Gert Wollny
2017-11-15 09:29:13 UTC
Permalink
Cache values that are loaded more then once, or where various components
are loaded at separate places. This saves repeated calculation of the offsets.

Signed-off-by: Gert Wollny <***@gmail.com>
---
src/gallium/drivers/r600/r600_shader.c | 211 +++++++++++++++++++++++++++++----
1 file changed, 190 insertions(+), 21 deletions(-)

diff --git a/src/gallium/drivers/r600/r600_shader.c b/src/gallium/drivers/r600/r600_shader.c
index 9fa83189bc..5713eda6b0 100644
--- a/src/gallium/drivers/r600/r600_shader.c
+++ b/src/gallium/drivers/r600/r600_shader.c
@@ -317,6 +317,20 @@ struct eg_interp {
unsigned ij_index;
};

+struct r600_tess_input_cache_entry {
+ struct tgsi_full_src_register key;
+ unsigned reg: 16;
+ unsigned initialized:1;
+ unsigned read_access:1;
+ unsigned was_written:1;
+ unsigned mask:4;
+};
+
+struct r600_tess_input_cache {
+ struct r600_tess_input_cache_entry data[32];
+ int fill;
+};
+
struct r600_shader_ctx {
struct tgsi_shader_info info;
struct tgsi_parse_context parse;
@@ -353,6 +367,7 @@ struct r600_shader_ctx {
unsigned enabled_stream_buffers_mask;
unsigned tess_input_info; /* temp with tess input offsets */
unsigned tess_output_info; /* temp with tess input offsets */
+ struct r600_tess_input_cache tess_input_cache;
};

struct r600_shader_tgsi_instruction {
@@ -1810,7 +1825,8 @@ static int fetch_mask( struct tgsi_src_register *reg)
return mask;
}

-static int fetch_tes_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src, unsigned int dst_reg)
+static int fetch_tes_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src,
+ unsigned int dst_reg, unsigned mask)
{
int r;
unsigned temp_reg = r600_get_temp(ctx);
@@ -1826,13 +1842,14 @@ static int fetch_tes_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg, fetch_mask(&src->Register));
+ r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask);
if (r)
return r;
return 0;
}

-static int fetch_tcs_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src, unsigned int dst_reg)
+static int fetch_tcs_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src,
+ unsigned int dst_reg, unsigned mask)
{
int r;
unsigned temp_reg = r600_get_temp(ctx);
@@ -1852,13 +1869,14 @@ static int fetch_tcs_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg, fetch_mask(&src->Register));
+ r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask);
if (r)
return r;
return 0;
}

-static int fetch_tcs_output(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src, unsigned int dst_reg)
+static int fetch_tcs_output(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src,
+ unsigned int dst_reg, unsigned mask)
{
int r;
unsigned temp_reg = r600_get_temp(ctx);
@@ -1874,12 +1892,153 @@ static int fetch_tcs_output(struct r600_shader_ctx *ctx, struct tgsi_full_src_re
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg, fetch_mask(&src->Register));
+ r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask);
if (r)
return r;
return 0;
}

+static int tgsi_full_src_register_equal_for_cache(struct tgsi_full_src_register *lhs,
+ struct tgsi_full_src_register *rhs)
+{
+ if (lhs->Register.Index != rhs->Register.Index)
+ return 0;
+
+ if (lhs->Register.File != rhs->Register.File)
+
+ if (lhs->Register.Indirect || rhs->Register.Indirect)
+ return 0;
+
+ if (lhs->Register.Dimension) {
+ if (!rhs->Register.Dimension ||
+ (rhs->Dimension.Index != lhs->Dimension.Index) ||
+ (rhs->Dimension.Dimension != lhs->Dimension.Dimension))
+ return 0;
+
+ if (lhs->Dimension.Indirect || rhs->Dimension.Indirect)
+ return 0;
+ } else if (rhs->Register.Dimension)
+ return 0;
+
+ return 1;
+}
+
+static void tess_input_cache_store(struct r600_tess_input_cache *cache,
+ struct tgsi_full_src_register *src)
+{
+ if (cache->fill < 32) {
+ memcpy(&cache->data[cache->fill].key, src, sizeof(struct tgsi_full_src_register));
+ cache->data[cache->fill].mask = fetch_mask(&src->Register);
+ cache->data[cache->fill].reg = 0;
+ cache->data[cache->fill].was_written = src->Register.File == TGSI_FILE_OUTPUT;
+ ++cache->fill;
+ }
+}
+
+static void tess_input_cache_check(struct r600_tess_input_cache *cache,
+ struct tgsi_full_src_register *src)
+{
+ int i;
+ for (i = 0; i < cache->fill; ++i) {
+ /* indirect loads can come from anywhere, no use caching them */
+ if (src->Register.Indirect || src->Dimension.Indirect)
+ return;
+
+ if (tgsi_full_src_register_equal_for_cache(src, &cache->data[i].key)) {
+ cache->data[i].mask |= fetch_mask(&src->Register);
+ cache->data[i].read_access = src->Register.File == TGSI_FILE_INPUT;
+ if (!cache->data[i].was_written) {
+ ++cache->data[i].reg;
+ cache->data[i].was_written = src->Register.File == TGSI_FILE_OUTPUT;
+ } else {
+ /* FIXME: If the entry was written before reading it, we can not cache it,
+ * instead we could store theaddress to speed up access, or keep the written
+ * value. The latter should check whether there is syncronisation within the
+ * work group to ensure that the stored value is not overwritten by another
+ * thread.
+ */
+ cache->data[i].reg = 0;
+ }
+ return;
+ }
+ }
+ tess_input_cache_store(cache, src);
+}
+
+static int tess_input_cache_count_multiused(struct r600_tess_input_cache *cache,
+ unsigned reg_base)
+{
+ int i;
+ int cnt = 0;
+ for (i = 0; i < cache->fill; ++i) {
+ if (cache->data[i].reg > 0 && cache->data[i].read_access) {
+ if (i != cnt)
+ memcpy(&cache->data[cnt], &cache->data[i],
+ sizeof(struct r600_tess_input_cache_entry));
+ cache->data[cnt].reg = reg_base + cnt;
+ cache->data[cnt].initialized = 0;
+ ++cnt;
+ }
+ }
+ cache->fill = cnt;
+ return cnt;
+}
+
+static struct r600_tess_input_cache_entry *
+tess_input_cache_load(struct r600_tess_input_cache *cache,
+ struct tgsi_full_src_register *src)
+{
+ struct r600_tess_input_cache_entry *retval = NULL;
+ int i;
+ for (i = 0; i < cache->fill; ++i) {
+ struct r600_tess_input_cache_entry *ce = &cache->data[i];
+ if (tgsi_full_src_register_equal_for_cache(src, &ce->key)) {
+ retval = ce;
+ break;
+ }
+ }
+ return retval;
+}
+
+typedef int (*fetch_tessdata_from_lds)(struct r600_shader_ctx *ctx,
+ struct tgsi_full_src_register *src,
+ unsigned int dst_reg, unsigned mask);
+
+static int r600_load_tess_data(struct r600_shader_ctx *ctx,
+ struct tgsi_full_src_register *src,
+ fetch_tessdata_from_lds fetch_call)
+{
+ int treg;
+ struct r600_tess_input_cache_entry *ce;
+ ce = tess_input_cache_load(&ctx->tess_input_cache, src);
+ if (!ce) {
+ treg = r600_get_temp(ctx);
+ fetch_call(ctx, src, treg, fetch_mask(&src->Register));
+ } else {
+ if (!ce->initialized) {
+ fetch_call(ctx, src, ce->reg, ce->mask);
+ ce->initialized = 1;
+ }
+ treg = ce->reg;
+ }
+ return treg;
+}
+
+
+static void count_tess_inputs(struct r600_shader_ctx *ctx)
+{
+ struct tgsi_full_instruction *inst = &ctx->parse.FullToken.FullInstruction;
+ unsigned i;
+
+ for (i = 0; i < inst->Instruction.NumSrcRegs; i++) {
+ struct tgsi_full_src_register *src = &inst->Src[i];
+ if (((src->Register.File == TGSI_FILE_INPUT) && (ctx->type == PIPE_SHADER_TESS_EVAL)) ||
+ (ctx->type == PIPE_SHADER_TESS_CTRL &&
+ (src->Register.File == TGSI_FILE_INPUT || src->Register.File == TGSI_FILE_OUTPUT)))
+ tess_input_cache_check(&ctx->tess_input_cache, src);
+ }
+}
+
static int tgsi_split_lds_inputs(struct r600_shader_ctx *ctx)
{
struct tgsi_full_instruction *inst = &ctx->parse.FullToken.FullInstruction;
@@ -1889,21 +2048,15 @@ static int tgsi_split_lds_inputs(struct r600_shader_ctx *ctx)
struct tgsi_full_src_register *src = &inst->Src[i];

if (ctx->type == PIPE_SHADER_TESS_EVAL && src->Register.File == TGSI_FILE_INPUT) {
- int treg = r600_get_temp(ctx);
- fetch_tes_input(ctx, src, treg);
- ctx->src[i].sel = treg;
+ ctx->src[i].sel = r600_load_tess_data(ctx, src, fetch_tes_input);
ctx->src[i].rel = 0;
}
if (ctx->type == PIPE_SHADER_TESS_CTRL && src->Register.File == TGSI_FILE_INPUT) {
- int treg = r600_get_temp(ctx);
- fetch_tcs_input(ctx, src, treg);
- ctx->src[i].sel = treg;
+ ctx->src[i].sel = r600_load_tess_data(ctx, src, fetch_tcs_input);
ctx->src[i].rel = 0;
}
if (ctx->type == PIPE_SHADER_TESS_CTRL && src->Register.File == TGSI_FILE_OUTPUT) {
- int treg = r600_get_temp(ctx);
- fetch_tcs_output(ctx, src, treg);
- ctx->src[i].sel = treg;
+ ctx->src[i].sel = r600_load_tess_data(ctx, src, fetch_tcs_output);
ctx->src[i].rel = 0;
}
}
@@ -2982,6 +3135,8 @@ static int r600_shader_from_tgsi(struct r600_context *rctx,
bool lds_inputs = false;
bool pos_emitted = false;

+ ctx.tess_input_cache.fill = 0;
+
ctx.bc = &shader->bc;
ctx.shader = shader;
ctx.native_integers = true;
@@ -3162,21 +3317,35 @@ static int r600_shader_from_tgsi(struct r600_context *rctx,
ctx.temp_reg = ctx.bc->ar_reg + 3;
}

+ if (lds_inputs) {
+ tgsi_parse_init(&ctx.parse, tokens);
+ while (!tgsi_parse_end_of_tokens(&ctx.parse)) {
+ tgsi_parse_token(&ctx.parse);
+
+ if (ctx.parse.FullToken.Token.Type != TGSI_TOKEN_TYPE_INSTRUCTION)
+ continue;
+
+ count_tess_inputs(&ctx);
+ }
+ ctx.temp_reg += tess_input_cache_count_multiused(&ctx.tess_input_cache, ctx.temp_reg);
+ tgsi_parse_init(&ctx.parse, tokens);
+ }
+
shader->max_arrays = 0;
shader->num_arrays = 0;
if (indirect_gprs) {

if (ctx.info.indirect_files & (1 << TGSI_FILE_INPUT)) {
r600_add_gpr_array(shader, ctx.file_offset[TGSI_FILE_INPUT],
- ctx.file_offset[TGSI_FILE_OUTPUT] -
- ctx.file_offset[TGSI_FILE_INPUT],
- 0x0F);
+ ctx.file_offset[TGSI_FILE_OUTPUT] -
+ ctx.file_offset[TGSI_FILE_INPUT],
+ 0x0F);
}
if (ctx.info.indirect_files & (1 << TGSI_FILE_OUTPUT)) {
r600_add_gpr_array(shader, ctx.file_offset[TGSI_FILE_OUTPUT],
- ctx.file_offset[TGSI_FILE_TEMPORARY] -
- ctx.file_offset[TGSI_FILE_OUTPUT],
- 0x0F);
+ ctx.file_offset[TGSI_FILE_TEMPORARY] -
+ ctx.file_offset[TGSI_FILE_OUTPUT],
+ 0x0F);
}
}
--
2.13.6
Gert Wollny
2017-11-15 09:29:12 UTC
Permalink
Use the destination write mask to determine which values are really to be
read from LDS and load only these.

Signed-off-by: Gert Wollny <***@gmail.com>
---
src/gallium/drivers/r600/r600_shader.c | 33 ++++++++++++++++++++++++++-------
1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/src/gallium/drivers/r600/r600_shader.c b/src/gallium/drivers/r600/r600_shader.c
index a2dc08c596..9fa83189bc 100644
--- a/src/gallium/drivers/r600/r600_shader.c
+++ b/src/gallium/drivers/r600/r600_shader.c
@@ -377,7 +377,7 @@ static void r600_bytecode_src(struct r600_bytecode_alu_src *bc_src,
const struct r600_shader_src *shader_src,
unsigned chan);
static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
- unsigned dst_reg);
+ unsigned dst_reg, unsigned mask);

static int tgsi_last_instruction(unsigned writemask)
{
@@ -1025,7 +1025,7 @@ static int tgsi_declaration(struct r600_shader_ctx *ctx)
if (r)
return r;

- do_lds_fetch_values(ctx, temp_reg, dreg);
+ do_lds_fetch_values(ctx, temp_reg, dreg, 0xF);
}
else if (d->Semantic.Name == TGSI_SEMANTIC_TESSCOORD) {
/* MOV r1.x, r0.x;
@@ -1743,14 +1743,18 @@ static int r600_get_byte_address(struct r600_shader_ctx *ctx, int temp_reg,
}

static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
- unsigned dst_reg)
+ unsigned dst_reg, unsigned mask)
{
struct r600_bytecode_alu alu;
int r, i;

if ((ctx->bc->cf_last->ndw>>1) >= 0x60)
ctx->bc->force_add_cf = 1;
+
for (i = 1; i < 4; i++) {
+ if (!(mask & (1 << i)))
+ continue;
+
r = single_alu_op2(ctx, ALU_OP2_ADD_INT,
temp_reg, i,
temp_reg, 0,
@@ -1759,6 +1763,9 @@ static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
return r;
}
for (i = 0; i < 4; i++) {
+ if (! (mask & (1 << i)))
+ continue;
+
/* emit an LDS_READ_RET */
memset(&alu, 0, sizeof(alu));
alu.op = LDS_OP1_LDS_READ_RET;
@@ -1774,6 +1781,8 @@ static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
return r;
}
for (i = 0; i < 4; i++) {
+ if (! (mask & (1 << i)))
+ continue;
/* then read from LDS_OQ_A_POP */
memset(&alu, 0, sizeof(alu));

@@ -1791,6 +1800,16 @@ static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
return 0;
}

+static int fetch_mask( struct tgsi_src_register *reg)
+{
+ int mask = 0;
+ mask |= 1 << reg->SwizzleX;
+ mask |= 1 << reg->SwizzleY;
+ mask |= 1 << reg->SwizzleZ;
+ mask |= 1 << reg->SwizzleW;
+ return mask;
+}
+
static int fetch_tes_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src, unsigned int dst_reg)
{
int r;
@@ -1807,7 +1826,7 @@ static int fetch_tes_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg);
+ r = do_lds_fetch_values(ctx, temp_reg, dst_reg, fetch_mask(&src->Register));
if (r)
return r;
return 0;
@@ -1833,7 +1852,7 @@ static int fetch_tcs_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg);
+ r = do_lds_fetch_values(ctx, temp_reg, dst_reg, fetch_mask(&src->Register));
if (r)
return r;
return 0;
@@ -1855,7 +1874,7 @@ static int fetch_tcs_output(struct r600_shader_ctx *ctx, struct tgsi_full_src_re
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg);
+ r = do_lds_fetch_values(ctx, temp_reg, dst_reg, fetch_mask(&src->Register));
if (r)
return r;
return 0;
@@ -2792,7 +2811,7 @@ static int r600_tess_factor_read(struct r600_shader_ctx *ctx,
if (r)
return r;

- do_lds_fetch_values(ctx, temp_reg, dreg);
+ do_lds_fetch_values(ctx, temp_reg, dreg, 0xf);
return 0;
}
--
2.13.6
Gert Wollny
2017-11-15 09:29:14 UTC
Permalink
Instead of calculating creating the code for calculating a base offset
and then to caclucate the component offfsets, calculate this offset
for all components directly. This saves one instruction group.

Signed-off-by: Gert Wollny <***@gmail.com>
---
src/gallium/drivers/r600/r600_shader.c | 113 ++++++++++++++++-----------------
1 file changed, 56 insertions(+), 57 deletions(-)

diff --git a/src/gallium/drivers/r600/r600_shader.c b/src/gallium/drivers/r600/r600_shader.c
index 5713eda6b0..873b525449 100644
--- a/src/gallium/drivers/r600/r600_shader.c
+++ b/src/gallium/drivers/r600/r600_shader.c
@@ -392,7 +392,7 @@ static void r600_bytecode_src(struct r600_bytecode_alu_src *bc_src,
const struct r600_shader_src *shader_src,
unsigned chan);
static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
- unsigned dst_reg, unsigned mask);
+ unsigned dst_reg, unsigned mask, int param);

static int tgsi_last_instruction(unsigned writemask)
{
@@ -1033,14 +1033,7 @@ static int tgsi_declaration(struct r600_shader_ctx *ctx)
if (r)
return r;

- r = single_alu_op2(ctx, ALU_OP2_ADD_INT,
- temp_reg, 0,
- temp_reg, 0,
- V_SQ_ALU_SRC_LITERAL, param * 16);
- if (r)
- return r;
-
- do_lds_fetch_values(ctx, temp_reg, dreg, 0xF);
+ do_lds_fetch_values(ctx, temp_reg, dreg, 0xF, param);
}
else if (d->Semantic.Name == TGSI_SEMANTIC_TESSCOORD) {
/* MOV r1.x, r0.x;
@@ -1658,12 +1651,11 @@ static int tgsi_split_gs_inputs(struct r600_shader_ctx *ctx)
static int r600_get_byte_address(struct r600_shader_ctx *ctx, int temp_reg,
const struct tgsi_full_dst_register *dst,
const struct tgsi_full_src_register *src,
- int stride_bytes_reg, int stride_bytes_chan)
+ int stride_bytes_reg, int stride_bytes_chan, int *param)
{
struct tgsi_full_dst_register reg;
ubyte *name, *index, *array_first;
int r;
- int param;
struct tgsi_shader_info *info = &ctx->info;
/* Set the register description. The address computation is the same
* for sources and destinations. */
@@ -1736,51 +1728,54 @@ static int r600_get_byte_address(struct r600_shader_ctx *ctx, int temp_reg,
if (r)
return r;

- param = r600_get_lds_unique_index(name[first],
+ *param = r600_get_lds_unique_index(name[first],
index[first]);

} else {
- param = r600_get_lds_unique_index(name[reg.Register.Index],
+ *param = r600_get_lds_unique_index(name[reg.Register.Index],
index[reg.Register.Index]);
}

- /* add to base_addr - passed in temp_reg.x */
- if (param) {
- r = single_alu_op2(ctx, ALU_OP2_ADD_INT,
- temp_reg, 0,
- temp_reg, 0,
- V_SQ_ALU_SRC_LITERAL, param * 16);
- if (r)
- return r;
-
- }
return 0;
}

static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
- unsigned dst_reg, unsigned mask)
+ unsigned dst_reg, unsigned mask, int param)
{
struct r600_bytecode_alu alu;
int r, i;
+ int lasti = tgsi_last_instruction(mask);
+ int firsti = param > 0 ? 0 : 1;

if ((ctx->bc->cf_last->ndw>>1) >= 0x60)
ctx->bc->force_add_cf = 1;
-
- for (i = 1; i < 4; i++) {
+
+ /* Add the offsets to the base address */
+ for (i = firsti; i <= lasti; i++) {
if (!(mask & (1 << i)))
continue;
-
- r = single_alu_op2(ctx, ALU_OP2_ADD_INT,
- temp_reg, i,
- temp_reg, 0,
- V_SQ_ALU_SRC_LITERAL, 4 * i);
+
+ memset(&alu, 0, sizeof(struct r600_bytecode_alu));
+ alu.dst.sel = temp_reg;
+ alu.dst.chan = i;
+ alu.dst.write = 1;
+ alu.op = ALU_OP2_ADD_INT;
+ alu.src[0].sel = temp_reg;
+ alu.src[0].chan = 0;
+ alu.src[1].sel = V_SQ_ALU_SRC_LITERAL;
+ alu.src[1].value = 4 * i + 16 * param;
+
+ if (i == lasti)
+ alu.last = 1;
+
+ r = r600_bytecode_add_alu(ctx->bc, &alu);
if (r)
return r;
}
+
for (i = 0; i < 4; i++) {
if (! (mask & (1 << i)))
continue;
-
/* emit an LDS_READ_RET */
memset(&alu, 0, sizeof(alu));
alu.op = LDS_OP1_LDS_READ_RET;
@@ -1828,7 +1823,7 @@ static int fetch_mask( struct tgsi_src_register *reg)
static int fetch_tes_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src,
unsigned int dst_reg, unsigned mask)
{
- int r;
+ int r, param;
unsigned temp_reg = r600_get_temp(ctx);

r = get_lds_offset0(ctx, 2, temp_reg,
@@ -1838,11 +1833,11 @@ static int fetch_tes_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg

/* the base address is now in temp.x */
r = r600_get_byte_address(ctx, temp_reg,
- NULL, src, ctx->tess_output_info, 1);
+ NULL, src, ctx->tess_output_info, 1, &param);
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask);
+ r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask, param);
if (r)
return r;
return 0;
@@ -1851,7 +1846,7 @@ static int fetch_tes_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg
static int fetch_tcs_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src,
unsigned int dst_reg, unsigned mask)
{
- int r;
+ int r,param;
unsigned temp_reg = r600_get_temp(ctx);

/* t.x = ips * r0.y */
@@ -1865,11 +1860,11 @@ static int fetch_tcs_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg

/* the base address is now in temp.x */
r = r600_get_byte_address(ctx, temp_reg,
- NULL, src, ctx->tess_input_info, 1);
+ NULL, src, ctx->tess_input_info, 1, &param);
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask);
+ r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask, param);
if (r)
return r;
return 0;
@@ -1878,7 +1873,7 @@ static int fetch_tcs_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg
static int fetch_tcs_output(struct r600_shader_ctx *ctx, struct tgsi_full_src_register *src,
unsigned int dst_reg, unsigned mask)
{
- int r;
+ int r, param;
unsigned temp_reg = r600_get_temp(ctx);

r = get_lds_offset0(ctx, 1, temp_reg,
@@ -1888,11 +1883,11 @@ static int fetch_tcs_output(struct r600_shader_ctx *ctx, struct tgsi_full_src_re
/* the base address is now in temp.x */
r = r600_get_byte_address(ctx, temp_reg,
NULL, src,
- ctx->tess_output_info, 1);
+ ctx->tess_output_info, 1, &param);
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask);
+ r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask, param);
if (r)
return r;
return 0;
@@ -2831,8 +2826,8 @@ static int emit_lds_vs_writes(struct r600_shader_ctx *ctx)

r = single_alu_op2(ctx, ALU_OP2_ADD_INT,
temp_reg, 2,
- temp_reg, param ? 1 : 0,
- V_SQ_ALU_SRC_LITERAL, 8);
+ temp_reg, 0,
+ V_SQ_ALU_SRC_LITERAL, 8 + param * 16);
if (r)
return r;

@@ -2867,6 +2862,7 @@ static int r600_store_tcs_output(struct r600_shader_ctx *ctx)
int temp_reg = r600_get_temp(ctx);
struct r600_bytecode_alu alu;
unsigned write_mask = dst->Register.WriteMask;
+ int param;

if (inst->Dst[0].Register.File != TGSI_FILE_OUTPUT)
return 0;
@@ -2877,20 +2873,30 @@ static int r600_store_tcs_output(struct r600_shader_ctx *ctx)

/* the base address is now in temp.x */
r = r600_get_byte_address(ctx, temp_reg,
- &inst->Dst[0], NULL, ctx->tess_output_info, 1);
+ &inst->Dst[0], NULL, ctx->tess_output_info, 1, &param);
if (r)
return r;

/* LDS write */
lasti = tgsi_last_instruction(write_mask);
- for (i = 1; i <= lasti; i++) {
+ for (i = (param > 0 ? 0: 1); i <= lasti; i++) {

if (!(write_mask & (1 << i)))
continue;
- r = single_alu_op2(ctx, ALU_OP2_ADD_INT,
- temp_reg, i,
- temp_reg, 0,
- V_SQ_ALU_SRC_LITERAL, 4 * i);
+ memset(&alu, 0, sizeof(struct r600_bytecode_alu));
+ alu.dst.sel = temp_reg;
+ alu.dst.chan = i;
+ alu.dst.write = 1;
+ alu.op = ALU_OP2_ADD_INT;
+ alu.src[0].sel = temp_reg;
+ alu.src[0].chan = 0;
+ alu.src[1].sel = V_SQ_ALU_SRC_LITERAL;
+ alu.src[1].value = 4 * i + 16 * param;
+
+ if (i == lasti)
+ alu.last = 1;
+
+ r = r600_bytecode_add_alu(ctx->bc, &alu);
if (r)
return r;
}
@@ -2957,14 +2963,7 @@ static int r600_tess_factor_read(struct r600_shader_ctx *ctx,
if (r)
return r;

- r = single_alu_op2(ctx, ALU_OP2_ADD_INT,
- temp_reg, 0,
- temp_reg, 0,
- V_SQ_ALU_SRC_LITERAL, param * 16);
- if (r)
- return r;
-
- do_lds_fetch_values(ctx, temp_reg, dreg, 0xf);
+ do_lds_fetch_values(ctx, temp_reg, dreg, 0xf, param);
return 0;
}
--
2.13.6
Gert Wollny
2017-11-15 09:29:16 UTC
Permalink
Pre-load all the LDS values who's range is accessed more than once.

Signed-off-by: Gert Wollny <***@gmail.com>
---
src/gallium/drivers/r600/r600_shader.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)

diff --git a/src/gallium/drivers/r600/r600_shader.c b/src/gallium/drivers/r600/r600_shader.c
index 163ae75eb5..7c999fbb0b 100644
--- a/src/gallium/drivers/r600/r600_shader.c
+++ b/src/gallium/drivers/r600/r600_shader.c
@@ -2047,6 +2047,34 @@ static void count_tess_inputs(struct r600_shader_ctx *ctx)
}
}

+static void preload_tes_lds(struct r600_shader_ctx *ctx)
+{
+ int i;
+ ctx->max_driver_temp_used = 0;
+ r600_get_temp(ctx);
+
+ for (i = 0; i < ctx->tess_input_cache.fill; ++i) {
+ struct r600_tess_input_cache_entry *ce = &ctx->tess_input_cache.data[i];
+ fetch_tes_input(ctx, &ce->key, ce->reg, ce->mask);
+ ce->initialized = 1;
+ }
+}
+
+static void preload_tcs_lds(struct r600_shader_ctx *ctx)
+{
+ int i;
+ ctx->max_driver_temp_used = 0;
+ r600_get_temp(ctx);
+ for (i = 0; i < ctx->tess_input_cache.fill; ++i) {
+ struct r600_tess_input_cache_entry *ce = &ctx->tess_input_cache.data[i];
+ if (ce->key.Register.File == TGSI_FILE_INPUT)
+ fetch_tcs_input(ctx, &ce->key, ce->reg, ce->mask);
+ else
+ fetch_tcs_output(ctx, &ce->key, ce->reg, ce->mask);
+ ce->initialized = 1;
+ }
+}
+
static int tgsi_split_lds_inputs(struct r600_shader_ctx *ctx)
{
struct tgsi_full_instruction *inst = &ctx->parse.FullToken.FullInstruction;
@@ -3624,6 +3652,11 @@ static int r600_shader_from_tgsi(struct r600_context *rctx,
return r;
}

+ if (ctx.type == PIPE_SHADER_TESS_EVAL)
+ preload_tes_lds(&ctx);
+ else if (ctx.type == PIPE_SHADER_TESS_CTRL)
+ preload_tcs_lds(&ctx);
+
tgsi_parse_init(&ctx.parse, tokens);
while (!tgsi_parse_end_of_tokens(&ctx.parse)) {
tgsi_parse_token(&ctx.parse);
--
2.13.6
Gert Wollny
2017-11-15 09:29:15 UTC
Permalink
Some offsets used for the LDS access are recalculated quite regularly.
Since tesselation shaders are not optimized by the SB manually pre-evaluate
some offsets to speed up this type of shader.

Signed-off-by: Gert Wollny <***@gmail.com>
---
src/gallium/drivers/r600/r600_shader.c | 253 ++++++++++++++++++++++-----------
1 file changed, 172 insertions(+), 81 deletions(-)

diff --git a/src/gallium/drivers/r600/r600_shader.c b/src/gallium/drivers/r600/r600_shader.c
index 873b525449..163ae75eb5 100644
--- a/src/gallium/drivers/r600/r600_shader.c
+++ b/src/gallium/drivers/r600/r600_shader.c
@@ -183,6 +183,7 @@ int r600_pipe_shader_create(struct pipe_context *ctx,
R600_ERR("translation from TGSI failed !\n");
goto error;
}
+
if (shader->shader.processor_type == PIPE_SHADER_VERTEX) {
/* only disable for vertex shaders in tess paths */
if (key.vs.as_ls)
@@ -329,6 +330,7 @@ struct r600_tess_input_cache_entry {
struct r600_tess_input_cache {
struct r600_tess_input_cache_entry data[32];
int fill;
+ int uses_lds_io;
};

struct r600_shader_ctx {
@@ -367,7 +369,8 @@ struct r600_shader_ctx {
unsigned enabled_stream_buffers_mask;
unsigned tess_input_info; /* temp with tess input offsets */
unsigned tess_output_info; /* temp with tess input offsets */
- struct r600_tess_input_cache tess_input_cache;
+ unsigned tess_io_info_precalc; /* temp with precalcuated offsets */
+ struct r600_tess_input_cache tess_input_cache;
};

struct r600_shader_tgsi_instruction {
@@ -392,7 +395,8 @@ static void r600_bytecode_src(struct r600_bytecode_alu_src *bc_src,
const struct r600_shader_src *shader_src,
unsigned chan);
static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
- unsigned dst_reg, unsigned mask, int param);
+ unsigned temp_chan, unsigned dst_reg,
+ unsigned mask, int param);

static int tgsi_last_instruction(unsigned writemask)
{
@@ -1027,13 +1031,8 @@ static int tgsi_declaration(struct r600_shader_ctx *ctx)
d->Semantic.Name == TGSI_SEMANTIC_TESSOUTER) {
int param = r600_get_lds_unique_index(d->Semantic.Name, 0);
int dreg = d->Semantic.Name == TGSI_SEMANTIC_TESSINNER ? 3 : 2;
- unsigned temp_reg = r600_get_temp(ctx);
-
- r = get_lds_offset0(ctx, 2, temp_reg, true);
- if (r)
- return r;

- do_lds_fetch_values(ctx, temp_reg, dreg, 0xF, param);
+ do_lds_fetch_values(ctx, ctx->tess_io_info_precalc, 3, dreg, 0xF, param);
}
else if (d->Semantic.Name == TGSI_SEMANTIC_TESSCOORD) {
/* MOV r1.x, r0.x;
@@ -1648,7 +1647,9 @@ static int tgsi_split_gs_inputs(struct r600_shader_ctx *ctx)
* All three shaders VS(LS), TCS, TES share the same LDS space.
*/
/* this will return with the dw address in temp_reg.x */
-static int r600_get_byte_address(struct r600_shader_ctx *ctx, int temp_reg,
+static int r600_get_byte_address(struct r600_shader_ctx *ctx,
+ unsigned *result_reg, unsigned *result_chan,
+ int base_offset_reg, int base_offset_chan,
const struct tgsi_full_dst_register *dst,
const struct tgsi_full_src_register *src,
int stride_bytes_reg, int stride_bytes_chan, int *param)
@@ -1656,7 +1657,11 @@ static int r600_get_byte_address(struct r600_shader_ctx *ctx, int temp_reg,
struct tgsi_full_dst_register reg;
ubyte *name, *index, *array_first;
int r;
+ int temp_reg = -1;
struct tgsi_shader_info *info = &ctx->info;
+ *result_reg = base_offset_reg;
+ *result_chan = base_offset_chan;
+
/* Set the register description. The address computation is the same
* for sources and destinations. */
if (src) {
@@ -1686,14 +1691,18 @@ static int r600_get_byte_address(struct r600_shader_ctx *ctx, int temp_reg,
sel = V_SQ_ALU_SRC_LITERAL;
chan = reg.Dimension.Index;
}
-
+ temp_reg = r600_get_temp(ctx);
+ *result_reg = temp_reg;
+ *result_chan = 0;
r = single_alu_op3(ctx, ALU_OP3_MULADD_UINT24,
temp_reg, 0,
stride_bytes_reg, stride_bytes_chan,
sel, chan,
- temp_reg, 0);
+ base_offset_reg, base_offset_chan);
if (r)
return r;
+ } else {
+
}

if (reg.Register.File == TGSI_FILE_INPUT) {
@@ -1719,15 +1728,20 @@ static int r600_get_byte_address(struct r600_shader_ctx *ctx, int temp_reg,

addr_reg = get_address_file_reg(ctx, reg.Indirect.Index);

- /* pull the value from index_reg */
+ if (temp_reg < 0)
+ temp_reg = r600_get_temp(ctx);
+
r = single_alu_op3(ctx, ALU_OP3_MULADD_UINT24,
temp_reg, 0,
V_SQ_ALU_SRC_LITERAL, 16,
addr_reg, 0,
- temp_reg, 0);
+ *result_reg, *result_chan);
if (r)
return r;

+ *result_reg = temp_reg;
+ *result_chan = 0;
+
*param = r600_get_lds_unique_index(name[first],
index[first]);

@@ -1739,14 +1753,17 @@ static int r600_get_byte_address(struct r600_shader_ctx *ctx, int temp_reg,
return 0;
}

-static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
+static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned offs_reg,
+ unsigned offs_chan,
unsigned dst_reg, unsigned mask, int param)
+
{
struct r600_bytecode_alu alu;
int r, i;
int lasti = tgsi_last_instruction(mask);
int firsti = param > 0 ? 0 : 1;

+
if ((ctx->bc->cf_last->ndw>>1) >= 0x60)
ctx->bc->force_add_cf = 1;

@@ -1756,12 +1773,12 @@ static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
continue;

memset(&alu, 0, sizeof(struct r600_bytecode_alu));
- alu.dst.sel = temp_reg;
+ alu.dst.sel = ctx->temp_reg;
alu.dst.chan = i;
alu.dst.write = 1;
alu.op = ALU_OP2_ADD_INT;
- alu.src[0].sel = temp_reg;
- alu.src[0].chan = 0;
+ alu.src[0].sel = offs_reg;
+ alu.src[0].chan = offs_chan;
alu.src[1].sel = V_SQ_ALU_SRC_LITERAL;
alu.src[1].value = 4 * i + 16 * param;

@@ -1779,8 +1796,13 @@ static int do_lds_fetch_values(struct r600_shader_ctx *ctx, unsigned temp_reg,
/* emit an LDS_READ_RET */
memset(&alu, 0, sizeof(alu));
alu.op = LDS_OP1_LDS_READ_RET;
- alu.src[0].sel = temp_reg;
- alu.src[0].chan = i;
+ if (i > 0 || firsti == 0) {
+ alu.src[0].sel = ctx->temp_reg;
+ alu.src[0].chan = i;
+ } else {
+ alu.src[0].sel = offs_reg;
+ alu.src[0].chan = offs_chan;
+ }
alu.src[1].sel = V_SQ_ALU_SRC_0;
alu.src[2].sel = V_SQ_ALU_SRC_0;
alu.dst.chan = 0;
@@ -1824,20 +1846,18 @@ static int fetch_tes_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg
unsigned int dst_reg, unsigned mask)
{
int r, param;
- unsigned temp_reg = r600_get_temp(ctx);
-
- r = get_lds_offset0(ctx, 2, temp_reg,
- src->Register.Dimension ? false : true);
- if (r)
- return r;
+ unsigned temp_reg;
+ unsigned temp_chan;

/* the base address is now in temp.x */
- r = r600_get_byte_address(ctx, temp_reg,
+ r = r600_get_byte_address(ctx, &temp_reg, &temp_chan,
+ ctx->tess_io_info_precalc,
+ src->Register.Dimension ? 2:3,
NULL, src, ctx->tess_output_info, 1, &param);
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask, param);
+ r = do_lds_fetch_values(ctx, temp_reg, temp_chan, dst_reg, mask, param);
if (r)
return r;
return 0;
@@ -1848,23 +1868,16 @@ static int fetch_tcs_input(struct r600_shader_ctx *ctx, struct tgsi_full_src_reg
{
int r,param;
unsigned temp_reg = r600_get_temp(ctx);
-
- /* t.x = ips * r0.y */
- r = single_alu_op2(ctx, ALU_OP2_MUL_UINT24,
- temp_reg, 0,
- ctx->tess_input_info, 0,
- 0, 1);
-
- if (r)
- return r;
+ unsigned temp_chan = 0;

/* the base address is now in temp.x */
- r = r600_get_byte_address(ctx, temp_reg,
+ r = r600_get_byte_address(ctx, &temp_reg, &temp_chan,
+ ctx->tess_io_info_precalc, 3,
NULL, src, ctx->tess_input_info, 1, &param);
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask, param);
+ r = do_lds_fetch_values(ctx, temp_reg, temp_chan, dst_reg, mask, param);
if (r)
return r;
return 0;
@@ -1874,20 +1887,18 @@ static int fetch_tcs_output(struct r600_shader_ctx *ctx, struct tgsi_full_src_re
unsigned int dst_reg, unsigned mask)
{
int r, param;
- unsigned temp_reg = r600_get_temp(ctx);
+ unsigned temp_reg;
+ unsigned temp_chan;

- r = get_lds_offset0(ctx, 1, temp_reg,
- src->Register.Dimension ? false : true);
- if (r)
- return r;
- /* the base address is now in temp.x */
- r = r600_get_byte_address(ctx, temp_reg,
+ r = r600_get_byte_address(ctx, &temp_reg, &temp_chan,
+ ctx->tess_io_info_precalc,
+ src->Register.Dimension ? 0:1,
NULL, src,
ctx->tess_output_info, 1, &param);
if (r)
return r;

- r = do_lds_fetch_values(ctx, temp_reg, dst_reg, mask, param);
+ r = do_lds_fetch_values(ctx, temp_reg, temp_chan, dst_reg, mask, param);
if (r)
return r;
return 0;
@@ -1896,11 +1907,12 @@ static int fetch_tcs_output(struct r600_shader_ctx *ctx, struct tgsi_full_src_re
static int tgsi_full_src_register_equal_for_cache(struct tgsi_full_src_register *lhs,
struct tgsi_full_src_register *rhs)
{
+ if (lhs->Register.File != rhs->Register.File)
+ return 0;
+
if (lhs->Register.Index != rhs->Register.Index)
return 0;

- if (lhs->Register.File != rhs->Register.File)
-
if (lhs->Register.Indirect || rhs->Register.Indirect)
return 0;

@@ -2028,9 +2040,10 @@ static void count_tess_inputs(struct r600_shader_ctx *ctx)
for (i = 0; i < inst->Instruction.NumSrcRegs; i++) {
struct tgsi_full_src_register *src = &inst->Src[i];
if (((src->Register.File == TGSI_FILE_INPUT) && (ctx->type == PIPE_SHADER_TESS_EVAL)) ||
- (ctx->type == PIPE_SHADER_TESS_CTRL &&
- (src->Register.File == TGSI_FILE_INPUT || src->Register.File == TGSI_FILE_OUTPUT)))
+ (ctx->type == PIPE_SHADER_TESS_CTRL)) {
tess_input_cache_check(&ctx->tess_input_cache, src);
+ ctx->tess_input_cache.uses_lds_io = 1;
+ }
}
}

@@ -2729,7 +2742,7 @@ static int r600_fetch_tess_io_info(struct r600_shader_ctx *ctx)
0, 0);
if (r)
return r;
-
+
/* used by VS/TCS */
if (ctx->tess_input_info) {
/* fetch tcs input values into resv space */
@@ -2752,12 +2765,13 @@ static int r600_fetch_tess_io_info(struct r600_shader_ctx *ctx)
vtx.dst_sel_w = 3;
vtx.src_gpr = temp_val;
vtx.src_sel_x = 0;
-
+
r = r600_bytecode_add_vtx(ctx->bc, &vtx);
if (r)
return r;
+
}
-
+
/* used by TCS/TES */
if (ctx->tess_output_info) {
/* fetch tcs output values into resv space */
@@ -2784,6 +2798,64 @@ static int r600_fetch_tess_io_info(struct r600_shader_ctx *ctx)
r = r600_bytecode_add_vtx(ctx->bc, &vtx);
if (r)
return r;
+
+ if (ctx->tess_input_cache.uses_lds_io) {
+
+ /* Precalc some offsets, after this we have
+
+ */
+
+ /* tess_io_info_precalc.x = tess_output_info.x * R0.y + tess_output_info.z */
+ r = single_alu_op3(ctx, ALU_OP3_MULADD_UINT24,
+ ctx->tess_io_info_precalc, 0,
+ ctx->tess_output_info, 0,
+ 0, 1,
+ ctx->tess_output_info, 2);
+ if (r)
+ return r;
+
+ /* tess_io_info_precalc.y = tess_output_info.x * R0.y + tess_output_info.w */
+
+ r = single_alu_op3(ctx, ALU_OP3_MULADD_UINT24,
+ ctx->tess_io_info_precalc, 1,
+ ctx->tess_output_info, 0,
+ 0, 1,
+ ctx->tess_output_info, 3);
+ if (r)
+ return r;
+
+
+ /* tess_io_info_precalc.z = tess_output_info.x * R0.z + tess_output_info.z */
+ r = single_alu_op3(ctx, ALU_OP3_MULADD_UINT24,
+ ctx->tess_io_info_precalc, 2,
+ ctx->tess_output_info, 0,
+ 0, 2,
+ ctx->tess_output_info, 2);
+ if (r)
+ return r;
+
+ /* This is a TCS shader */
+ if (ctx->tess_input_info) {
+
+ /* t.x = ips * r0.y */
+ r = single_alu_op2(ctx, ALU_OP2_MUL_UINT24,
+ ctx->tess_io_info_precalc, 3,
+ ctx->tess_input_info, 0,
+ 0, 1);
+ if (r)
+ return r;
+ } else {
+
+ /* tess_io_info_precalc.w = tess_output_info.x * R0.z + tess_output_info.w */
+ r = single_alu_op3(ctx, ALU_OP3_MULADD_UINT24,
+ ctx->tess_io_info_precalc, 3,
+ ctx->tess_output_info, 0,
+ 0, 2,
+ ctx->tess_output_info, 3);
+ if (r)
+ return r;
+ }
+ }
}
return 0;
}
@@ -2858,8 +2930,10 @@ static int r600_store_tcs_output(struct r600_shader_ctx *ctx)
{
struct tgsi_full_instruction *inst = &ctx->parse.FullToken.FullInstruction;
const struct tgsi_full_dst_register *dst = &inst->Dst[0];
- int i, r, lasti;
+ int i, r, lasti, firsti;
int temp_reg = r600_get_temp(ctx);
+ unsigned offs_reg;
+ unsigned offs_chan;
struct r600_bytecode_alu alu;
unsigned write_mask = dst->Register.WriteMask;
int param;
@@ -2867,19 +2941,18 @@ static int r600_store_tcs_output(struct r600_shader_ctx *ctx)
if (inst->Dst[0].Register.File != TGSI_FILE_OUTPUT)
return 0;

- r = get_lds_offset0(ctx, 1, temp_reg, dst->Register.Dimension ? false : true);
- if (r)
- return r;
-
/* the base address is now in temp.x */
- r = r600_get_byte_address(ctx, temp_reg,
+ r = r600_get_byte_address(ctx, &offs_reg, &offs_chan,
+ ctx->tess_io_info_precalc,
+ dst->Register.Dimension ? 0:1,
&inst->Dst[0], NULL, ctx->tess_output_info, 1, &param);
if (r)
return r;

+ firsti = param > 0 ? 0 : 1;
/* LDS write */
lasti = tgsi_last_instruction(write_mask);
- for (i = (param > 0 ? 0: 1); i <= lasti; i++) {
+ for (i = firsti; i <= lasti; i++) {

if (!(write_mask & (1 << i)))
continue;
@@ -2888,8 +2961,8 @@ static int r600_store_tcs_output(struct r600_shader_ctx *ctx)
alu.dst.chan = i;
alu.dst.write = 1;
alu.op = ALU_OP2_ADD_INT;
- alu.src[0].sel = temp_reg;
- alu.src[0].chan = 0;
+ alu.src[0].sel = offs_reg;
+ alu.src[0].chan = offs_chan;
alu.src[1].sel = V_SQ_ALU_SRC_LITERAL;
alu.src[1].value = 4 * i + 16 * param;

@@ -2909,8 +2982,14 @@ static int r600_store_tcs_output(struct r600_shader_ctx *ctx)
(i == 2 && ((write_mask & 0xc) == 0xc))) {
memset(&alu, 0, sizeof(struct r600_bytecode_alu));
alu.op = LDS_OP3_LDS_WRITE_REL;
- alu.src[0].sel = temp_reg;
- alu.src[0].chan = i;
+
+ if (firsti == 0 || i > 0) {
+ alu.src[0].sel = temp_reg;
+ alu.src[0].chan = i;
+ } else {
+ alu.src[0].sel = offs_reg;
+ alu.src[0].chan = offs_chan;
+ }

alu.src[1].sel = dst->Register.Index;
alu.src[1].sel += ctx->file_offset[dst->Register.File];
@@ -2931,8 +3010,14 @@ static int r600_store_tcs_output(struct r600_shader_ctx *ctx)
}
memset(&alu, 0, sizeof(struct r600_bytecode_alu));
alu.op = LDS_OP2_LDS_WRITE;
- alu.src[0].sel = temp_reg;
- alu.src[0].chan = i;
+
+ if (firsti == 0 || i > 0) {
+ alu.src[0].sel = temp_reg;
+ alu.src[0].chan = i;
+ } else {
+ alu.src[0].sel = offs_reg;
+ alu.src[0].chan = offs_chan;
+ }

alu.src[1].sel = dst->Register.Index;
alu.src[1].sel += ctx->file_offset[dst->Register.File];
@@ -2953,17 +3038,12 @@ static int r600_tess_factor_read(struct r600_shader_ctx *ctx,
int output_idx)
{
int param;
- unsigned temp_reg = r600_get_temp(ctx);
unsigned name = ctx->shader->output[output_idx].name;
int dreg = ctx->shader->output[output_idx].gpr;
- int r;

param = r600_get_lds_unique_index(name, 0);
- r = get_lds_offset0(ctx, 1, temp_reg, true);
- if (r)
- return r;
-
- do_lds_fetch_values(ctx, temp_reg, dreg, 0xf, param);
+
+ do_lds_fetch_values(ctx, ctx->tess_io_info_precalc, 1, dreg, 0xf, param);
return 0;
}

@@ -3293,11 +3373,13 @@ static int r600_shader_from_tgsi(struct r600_context *rctx,
if (ctx.type == PIPE_SHADER_TESS_CTRL) {
ctx.tess_input_info = ctx.bc->ar_reg + 3;
ctx.tess_output_info = ctx.bc->ar_reg + 4;
- ctx.temp_reg = ctx.bc->ar_reg + 5;
+ ctx.tess_io_info_precalc = ctx.bc->ar_reg + 5;
+ ctx.temp_reg = ctx.bc->ar_reg + 6;
} else if (ctx.type == PIPE_SHADER_TESS_EVAL) {
ctx.tess_input_info = 0;
ctx.tess_output_info = ctx.bc->ar_reg + 3;
- ctx.temp_reg = ctx.bc->ar_reg + 4;
+ ctx.tess_io_info_precalc = ctx.bc->ar_reg + 4;
+ ctx.temp_reg = ctx.bc->ar_reg + 5;
} else if (ctx.type == PIPE_SHADER_GEOMETRY) {
ctx.gs_export_gpr_tregs[0] = ctx.bc->ar_reg + 3;
ctx.gs_export_gpr_tregs[1] = ctx.bc->ar_reg + 4;
@@ -3316,18 +3398,27 @@ static int r600_shader_from_tgsi(struct r600_context *rctx,
ctx.temp_reg = ctx.bc->ar_reg + 3;
}

- if (lds_inputs) {
+ ctx.tess_input_cache.uses_lds_io = 0;
+ if (lds_inputs || lds_outputs) {
tgsi_parse_init(&ctx.parse, tokens);
+
while (!tgsi_parse_end_of_tokens(&ctx.parse)) {
tgsi_parse_token(&ctx.parse);
-
- if (ctx.parse.FullToken.Token.Type != TGSI_TOKEN_TYPE_INSTRUCTION)
- continue;
-
- count_tess_inputs(&ctx);
+ if (ctx.parse.FullToken.Token.Type == TGSI_TOKEN_TYPE_INSTRUCTION)
+ count_tess_inputs(&ctx);
+ else if (ctx.parse.FullToken.Token.Type == TGSI_TOKEN_TYPE_DECLARATION) {
+ struct tgsi_full_declaration *d = &ctx.parse.FullToken.FullDeclaration;
+ if (d->Declaration.File == TGSI_FILE_SYSTEM_VALUE &&
+ (d->Semantic.Name == TGSI_SEMANTIC_TESSINNER ||
+ d->Semantic.Name == TGSI_SEMANTIC_TESSOUTER))
+ ctx.tess_input_cache.uses_lds_io = 1;
+
+ }
}
ctx.temp_reg += tess_input_cache_count_multiused(&ctx.tess_input_cache, ctx.temp_reg);
tgsi_parse_init(&ctx.parse, tokens);
+ } else {
+
}

shader->max_arrays = 0;
--
2.13.6
Dave Airlie
2017-12-08 06:30:06 UTC
Permalink
Post by Gert Wollny
Dear all,
since on r600 the tesselation shaders don't go through the sb-optimizer I
though it might help to improve performance by applying some optimizations
to the created assembly. The patches are experimental but to a point where
I think some input from you could be helpful.
- pre-calculate and re-use address offsets that were always calculated
on the fly
- only load from LDS what is really requested (based on the source swizzle masks
of the input values).
- preload all used elements in cases where the shader would only partially load
data in different places.
At this point there are no piglit regressions, but an unrelated GOU lockup is
triggered. (Dave and me are already testing patches for this).
So I haven't commited these yet, because I wanted to see if I could
get sb to work.

https://cgit.freedesktop.org/~airlied/mesa/log/?h=r600-sb-lds-wip

is my non functional attempt, so far, biut it gpu hangs on the nop shader.

I'm away for a week, so I might try and look at it against after that.

Dave.
Gert Wollny
2017-12-11 12:49:14 UTC
Permalink
[snip]
So I haven't commited these yet, because I wanted to see if I could
get sb to work.
Well, it was very much work in progress, I didn't expect it to be
committed as is anyway.
https://cgit.freedesktop.org/~airlied/mesa/log/?h=r600-sb-lds-wip
is my non functional attempt, so far, biut it gpu hangs on the nop shader.
I've played aound it a bit and added some hacks to make it not hang,
i.e. sb scheduls calls into any slot, but LDS read/write should go only
into SLOT_X, and not splitting up the fetch seemed to be important
(patch attached).


However, gcm moves around the LSD_OQ* loads changing the order without
changing the order of the according LDS_READ_RET calls. At least for
this the nop shader still fails.

I tried to persuade the optimizer to not reorder these move
instructions by adding a "use" to the dst-value of a node that reads
from a LDS_OQ to the next node that reads from the same queue, but to
no avail. I guess I didn't figure out how to count these extra uses
properly when the instructuions are scheduled.

Best,
Gert
Dave Airlie
2017-12-29 06:38:23 UTC
Permalink
Post by Gert Wollny
[snip]
So I haven't commited these yet, because I wanted to see if I could
get sb to work.
Well, it was very much work in progress, I didn't expect it to be
committed as is anyway.
https://cgit.freedesktop.org/~airlied/mesa/log/?h=r600-sb-lds-wip
is my non functional attempt, so far, biut it gpu hangs on the nop shader.
I've played aound it a bit and added some hacks to make it not hang,
i.e. sb scheduls calls into any slot, but LDS read/write should go only
into SLOT_X, and not splitting up the fetch seemed to be important
(patch attached).
However, gcm moves around the LSD_OQ* loads changing the order without
changing the order of the according LDS_READ_RET calls. At least for
this the nop shader still fails.
I tried to persuade the optimizer to not reorder these move
instructions by adding a "use" to the dst-value of a node that reads
from a LDS_OQ to the next node that reads from the same queue, but to
no avail. I guess I didn't figure out how to count these extra uses
properly when the instructuions are scheduled.
I thought I'd done this already, I must dig a bit more.

I've pushed mosre stuff to the branch, nop still doesn't work.

I've included your patche in one of the squashes, I think we should be
pretty close.

Dave.
Dave Airlie
2017-12-29 07:18:43 UTC
Permalink
Post by Dave Airlie
Post by Gert Wollny
[snip]
So I haven't commited these yet, because I wanted to see if I could
get sb to work.
Well, it was very much work in progress, I didn't expect it to be
committed as is anyway.
https://cgit.freedesktop.org/~airlied/mesa/log/?h=r600-sb-lds-wip
is my non functional attempt, so far, biut it gpu hangs on the nop shader.
I've played aound it a bit and added some hacks to make it not hang,
i.e. sb scheduls calls into any slot, but LDS read/write should go only
into SLOT_X, and not splitting up the fetch seemed to be important
(patch attached).
However, gcm moves around the LSD_OQ* loads changing the order without
changing the order of the according LDS_READ_RET calls. At least for
this the nop shader still fails.
I tried to persuade the optimizer to not reorder these move
instructions by adding a "use" to the dst-value of a node that reads
from a LDS_OQ to the next node that reads from the same queue, but to
no avail. I guess I didn't figure out how to count these extra uses
properly when the instructuions are scheduled.
I thought I'd done this already, I must dig a bit more.
I've pushed mosre stuff to the branch, nop still doesn't work.
I've included your patche in one of the squashes, I think we should be
pretty close.
I think the top patch un my tree fixes the LDS reordering, nop still
doesn't work
though which is annoying. maybe you can spot the problem I've been
staring too long.

Dave.
Gert Wollny
2018-01-04 14:05:19 UTC
Permalink
Post by Dave Airlie
I think the top patch un my tree fixes the LDS reordering, nop still
doesn't work
though which is annoying. maybe you can spot the problem I've been
staring too long.
Unfortunately my monitor decided to die while I was testing the code.
When I have replaced it an can test it I'll get back to you.
 
happy new year,
Gert
Gert Wollny
2018-01-05 17:18:07 UTC
Permalink
Post by Dave Airlie
Post by Dave Airlie
I thought I'd done this already, I must dig a bit more.
I've pushed mosre stuff to the branch, nop still doesn't work.
I've included your patche in one of the squashes, I think we should
be pretty close.
I think the top patch un my tree fixes the LDS reordering, nop still
doesn't work though which is annoying. maybe you can spot the problem
I've been staring too long.
Well, I have tested some piglits now and the behaviour is quite wired:

When I run nop as the very first piglit after booting the machine it
works. After running other piglits (specifically tcs-input-read-array-
interface and tcs-input-read-mat), nop starts to fail, also without sb.

Restarting X is not enough to get nop to pass again.

If I run piglit normally on the shader subset, I also get lockups and I
even got kicked out of X, the last syslog message related to this was:

[ 1403.211887] [drm:r600_ib_test [radeon]] *ERROR* radeon: fence wait
timed out.
[ 1403.211932] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon:
failed testing IB on GFX ring (-110).


Best,
Gert
Gert Wollny
2018-01-05 17:41:40 UTC
Permalink
Post by Gert Wollny
Well, I have tested some piglits now and the behaviour is quite
wired: 
When I run nop as the very first piglit after booting the machine it
works. After running other piglits (specifically  tcs-input-read-
array-interface and tcs-input-read-mat), nop starts to fail, also
without sb.
Restarting X is not enough to get nop to pass again.
If I run piglit normally on the shader subset, I also get lockups and
I even got kicked out of X, the last syslog message related to this
[ 1403.211887] [drm:r600_ib_test [radeon]] *ERROR* radeon: fence wait
timed out.
failed testing IB on GFX ring (-110).
When I run Unigine_Heaven with your WIP code and all sb passes for
tesselation enabled, I get a crash because of a stack overflow, i.e.
the hash evaluation ends up in an infinite recursion doing a ping-pong
between two nodes:

...
#747 in r600_sb::node::hash (this=0x1e01228) at sb/sb_ir.cpp:277
#748 in r600_sb::value::hash (this=0x1e39cd0) at sb/sb_valtable.cpp:189
#749 in r600_sb::value::hash (this=< >) at sb/sb_valtable.cpp:184
#750 in r600_sb::node::hash_src (this=***@entry= ) at sb/sb_ir.cpp:265
#751 in r600_sb::node::hash (this=0x1e00bf0) at sb/sb_ir.cpp:277
#752 in r600_sb::value::hash (this=0x1e39e70) at sb/sb_valtable.cpp:189
#753 in r600_sb::value::hash (this=< >) at sb/sb_valtable.cpp:184
#754 in r600_sb::node::hash_src (this=***@entry= ) at sb/sb_ir.cpp:265
#755 in r600_sb::node::hash (this=0x1e01228) at sb/sb_ir.cpp:277
...

Best,
Gert
Dave Airlie
2018-01-08 07:12:51 UTC
Permalink
Post by Gert Wollny
Post by Gert Wollny
When I run nop as the very first piglit after booting the machine it
works. After running other piglits (specifically tcs-input-read-
array-interface and tcs-input-read-mat), nop starts to fail, also
without sb.
Restarting X is not enough to get nop to pass again.
If I run piglit normally on the shader subset, I also get lockups and
I even got kicked out of X, the last syslog message related to this
[ 1403.211887] [drm:r600_ib_test [radeon]] *ERROR* radeon: fence wait
timed out.
failed testing IB on GFX ring (-110).
When I run Unigine_Heaven with your WIP code and all sb passes for
tesselation enabled, I get a crash because of a stack overflow, i.e.
the hash evaluation ends up in an infinite recursion doing a ping-pong
...
#747 in r600_sb::node::hash (this=0x1e01228) at sb/sb_ir.cpp:277
#748 in r600_sb::value::hash (this=0x1e39cd0) at sb/sb_valtable.cpp:189
#749 in r600_sb::value::hash (this=< >) at sb/sb_valtable.cpp:184
#751 in r600_sb::node::hash (this=0x1e00bf0) at sb/sb_ir.cpp:277
#752 in r600_sb::value::hash (this=0x1e39e70) at sb/sb_valtable.cpp:189
#753 in r600_sb::value::hash (this=< >) at sb/sb_valtable.cpp:184
#755 in r600_sb::node::hash (this=0x1e01228) at sb/sb_ir.cpp:277
Yeah I see the same. Not 100% sure why yet.

For nop.shader_test I've noticed if you move the position line above the
tess factor emission things start to work, which is confusing me no end,
it's sounds like we doing something bad with LDS still.

Dave.
Dave Airlie
2018-01-09 07:32:36 UTC
Permalink
Post by Dave Airlie
Post by Gert Wollny
Post by Gert Wollny
When I run nop as the very first piglit after booting the machine it
works. After running other piglits (specifically tcs-input-read-
array-interface and tcs-input-read-mat), nop starts to fail, also
without sb.
Restarting X is not enough to get nop to pass again.
If I run piglit normally on the shader subset, I also get lockups and
I even got kicked out of X, the last syslog message related to this
[ 1403.211887] [drm:r600_ib_test [radeon]] *ERROR* radeon: fence wait
timed out.
failed testing IB on GFX ring (-110).
When I run Unigine_Heaven with your WIP code and all sb passes for
tesselation enabled, I get a crash because of a stack overflow, i.e.
the hash evaluation ends up in an infinite recursion doing a ping-pong
...
#747 in r600_sb::node::hash (this=0x1e01228) at sb/sb_ir.cpp:277
#748 in r600_sb::value::hash (this=0x1e39cd0) at sb/sb_valtable.cpp:189
#749 in r600_sb::value::hash (this=< >) at sb/sb_valtable.cpp:184
#751 in r600_sb::node::hash (this=0x1e00bf0) at sb/sb_ir.cpp:277
#752 in r600_sb::value::hash (this=0x1e39e70) at sb/sb_valtable.cpp:189
#753 in r600_sb::value::hash (this=< >) at sb/sb_valtable.cpp:184
#755 in r600_sb::node::hash (this=0x1e01228) at sb/sb_ir.cpp:277
Yeah I see the same. Not 100% sure why yet.
For nop.shader_test I've noticed if you move the position line above the
tess factor emission things start to work, which is confusing me no end,
it's sounds like we doing something bad with LDS still.
I've pushed out another few hacks in progress.

I've got heaven running now, and seem to get about the same speedup
you were getting with this series.

On piglit -t tessellation I've got about 13 crashes in some variable
indexing tests. and they are all GCM related, I've torn out a fair bit
of hair this afternoon trying to keep the gcm scheduler happy but to
no avail.

tests/spec/arb_tessellation_shader/execution/variable-indexing/tcs-input-array-float-index-rd.shader_test
is one of the culprits, it looks like GCM schedules a bunch of basic
blocks, but then some instructions
are dont_move but get scheduled wrong.

Dave.
Gert Wollny
2018-01-09 09:07:17 UTC
Permalink
_______________________________________________
mesa-dev mailing list
mesa-***@lists.freedesktop.org
https://list
Gert Wollny
2018-01-09 16:14:03 UTC
Permalink
Post by Dave Airlie
tests/spec/arb_tessellation_shader/execution/variable-indexing/tcs-
input-array-float-index-rd.shader_test
is one of the culprits, it looks like GCM schedules a bunch of basic
blocks, but then some instructions are dont_move but get scheduled
wrong.
Strangely for this shader sb also creates something that is interpreted
by the disassambler as

 0068  001f0dff 06422000  11  x: LDS_READ_RET       __.x,  Param63.w
 0070  801fadff 2f801a10  y: ADD_INT            T0.y,  Param63.w,

i.e. src0 is 511 which is not documented as an allowed value for
Evergreen in LDS_IDX_OP.

Best,
Gert

Loading...