Discussion:
[PATCH v2 0/9] anv: Push constant 16bit storage and push UBOs
Add Reply
Jason Ekstrand
2017-12-07 04:34:18 UTC
Reply
Permalink
Raw Message
This series is a combination of my push UBOs series, a patch I sent out
this last week and the patch from Jose to enable 16bit storage for push
constants.

I haven't landed UBO pushing yet because I wanted to let Jose land 16bit
storage. In order for UBO pushing to work now that 16bit storage is
enabled, we need 16bit push constants. The first patch is a rewrite of
assign_constant_locations to support other bit-widths than 32 and 64. This
is followed by the patch to enable 16bit storage which is trivial once
assign_constant_locations works. The last 7 patches are mostly the ones
Jordan reviewed already but with one more patch on top.

Chema Casanova (1):
anv: Enable VK_KHR_16bit_storage for push_constant

Jason Ekstrand (8):
i965/fs: Rewrite assign_constant_locations
anv/pipeline: Translate vulkan_resource_index to a constant when
possible
anv/cmd_buffer: Add some helpers for working with descriptor sets
anv/cmd_buffer: Add some stage asserts
anv/cmd_buffer: Add support for pushing UBO ranges
anv/device: Increase the UBO alignment requirement to 32
i965/fs: Handle !supports_pull_constants and push UBOs properly
anv: Enable UBO pushing

src/intel/compiler/brw_fs.cpp | 307 +++++++++++++----------
src/intel/vulkan/anv_device.c | 15 +-
src/intel/vulkan/anv_nir_apply_pipeline_layout.c | 17 +-
src/intel/vulkan/anv_pipeline.c | 6 +
src/intel/vulkan/genX_cmd_buffer.c | 191 ++++++++++----
src/intel/vulkan/genX_pipeline.c | 3 +-
6 files changed, 356 insertions(+), 183 deletions(-)
--
2.5.0.400.gff86faf
Jason Ekstrand
2017-12-07 04:34:19 UTC
Reply
Permalink
Raw Message
This rewires the logic for assigning uniform locations to work in terms
of "complex alignments". The basic idea is that, as we walk the list of
instructions, we keep track of the alignment and continuity requirements
of each slot and assert that the alignments all match up. We then use
those alignments in the compaction stage to ensure that everything gets
placed at a properly aligned register. The old mechanism handled
alignments by special-casing each of the bit sizes and placing 64-bit
values first followed by 32-bit values.

The old scheme had the advantage of never leaving a hole since all the
64-bit values could be tightly packed and so could the 32-bit values.
However, the new scheme has no type size special cases so it handles not
only 32 and 64-bit types but should gracefully extend to 16 and 8-bit
types as the need arises.

Tested-by: Jose Maria Casanova Crespo <***@igalia.com>
---
src/intel/compiler/brw_fs.cpp | 307 ++++++++++++++++++++++++------------------
1 file changed, 174 insertions(+), 133 deletions(-)

diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
index 93bb6b4..41260b4 100644
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@@ -1906,62 +1906,6 @@ fs_visitor::compact_virtual_grfs()
return progress;
}

-static void
-set_push_pull_constant_loc(unsigned uniform, int *chunk_start,
- unsigned *max_chunk_bitsize,
- bool contiguous, unsigned bitsize,
- const unsigned target_bitsize,
- int *push_constant_loc, int *pull_constant_loc,
- unsigned *num_push_constants,
- unsigned *num_pull_constants,
- const unsigned max_push_components,
- const unsigned max_chunk_size,
- bool allow_pull_constants,
- struct brw_stage_prog_data *stage_prog_data)
-{
- /* This is the first live uniform in the chunk */
- if (*chunk_start < 0)
- *chunk_start = uniform;
-
- /* Keep track of the maximum bit size access in contiguous uniforms */
- *max_chunk_bitsize = MAX2(*max_chunk_bitsize, bitsize);
-
- /* If this element does not need to be contiguous with the next, we
- * split at this point and everything between chunk_start and u forms a
- * single chunk.
- */
- if (!contiguous) {
- /* If bitsize doesn't match the target one, skip it */
- if (*max_chunk_bitsize != target_bitsize) {
- /* FIXME: right now we only support 32 and 64-bit accesses */
- assert(*max_chunk_bitsize == 4 || *max_chunk_bitsize == 8);
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- return;
- }
-
- unsigned chunk_size = uniform - *chunk_start + 1;
-
- /* Decide whether we should push or pull this parameter. In the
- * Vulkan driver, push constants are explicitly exposed via the API
- * so we push everything. In GL, we only push small arrays.
- */
- if (!allow_pull_constants ||
- (*num_push_constants + chunk_size <= max_push_components &&
- chunk_size <= max_chunk_size)) {
- assert(*num_push_constants + chunk_size <= max_push_components);
- for (unsigned j = *chunk_start; j <= uniform; j++)
- push_constant_loc[j] = (*num_push_constants)++;
- } else {
- for (unsigned j = *chunk_start; j <= uniform; j++)
- pull_constant_loc[j] = (*num_pull_constants)++;
- }
-
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- }
-}
-
static int
get_subgroup_id_param_index(const brw_stage_prog_data *prog_data)
{
@@ -1977,6 +1921,98 @@ get_subgroup_id_param_index(const brw_stage_prog_data *prog_data)
}

/**
+ * Struct for handling complex alignments.
+ *
+ * A complex alignment is stored as multiplier and an offset. A value is
+ * considered to be aligned if it is congruent to the offset modulo the
+ * multiplier.
+ */
+struct cplx_align {
+ unsigned mul:4;
+ unsigned offset:4;
+};
+
+#define CPLX_ALIGN_MAX_MUL 8
+
+static void
+cplx_align_assert_sane(struct cplx_align a)
+{
+ assert(a.mul > 0 && util_is_power_of_two(a.mul));
+ assert(a.offset < a.mul);
+}
+
+/**
+ * Combines two alignments to produce a least multiple of sorts.
+ *
+ * The returned alignment is the smallest (in terms of multiplier) such that
+ * anything aligned to both a and b will be aligned to the new alignment.
+ * This function will assert-fail if a and b are not compatible, i.e. if the
+ * offset parameters are such that no common alignment is possible.
+ */
+static struct cplx_align
+cplx_align_combine(struct cplx_align a, struct cplx_align b)
+{
+ cplx_align_assert_sane(a);
+ cplx_align_assert_sane(b);
+
+ /* Assert that the alignments agree. */
+ assert((a.offset & (b.mul - 1)) == (b.offset & (a.mul - 1)));
+
+ return a.mul > b.mul ? a : b;
+}
+
+/**
+ * Apply a complex alignment
+ *
+ * This function will return the smallest number greater than or equal to
+ * offset that is aligned to align.
+ */
+static unsigned
+cplx_align_apply(struct cplx_align align, unsigned offset)
+{
+ return ALIGN(offset - align.offset, align.mul) + align.offset;
+}
+
+#define UNIFORM_SLOT_SIZE 4
+
+struct uniform_slot_info {
+ /** True if the given uniform slot is live */
+ unsigned is_live:1;
+
+ /** True if this slot and the slot must remain contiguous */
+ unsigned contiguous:1;
+
+ struct cplx_align align;
+};
+
+static void
+mark_uniform_slots_read(struct uniform_slot_info *slots,
+ unsigned num_slots, unsigned alignment)
+{
+ assert(alignment > 0 && util_is_power_of_two(alignment));
+ assert(alignment <= CPLX_ALIGN_MAX_MUL);
+
+ /* We can't align a slot to anything less than the slot size */
+ alignment = MAX2(alignment, UNIFORM_SLOT_SIZE);
+
+ struct cplx_align align = {alignment, 0};
+ cplx_align_assert_sane(align);
+
+ for (unsigned i = 0; i < num_slots; i++) {
+ slots[i].is_live = true;
+ if (i < num_slots - 1)
+ slots[i].contiguous = true;
+
+ align.offset = (i * UNIFORM_SLOT_SIZE) & (align.mul - 1);
+ if (slots[i].align.mul == 0) {
+ slots[i].align = align;
+ } else {
+ slots[i].align = cplx_align_combine(slots[i].align, align);
+ }
+ }
+}
+
+/**
* Assign UNIFORM file registers to either push constants or pull constants.
*
* We allow a fragment shader to have more than the specified minimum
@@ -1994,60 +2030,44 @@ fs_visitor::assign_constant_locations()
return;
}

- bool is_live[uniforms];
- memset(is_live, 0, sizeof(is_live));
- unsigned bitsize_access[uniforms];
- memset(bitsize_access, 0, sizeof(bitsize_access));
-
- /* For each uniform slot, a value of true indicates that the given slot and
- * the next slot must remain contiguous. This is used to keep us from
- * splitting arrays and 64-bit values apart.
- */
- bool contiguous[uniforms];
- memset(contiguous, 0, sizeof(contiguous));
+ struct uniform_slot_info slots[uniforms];
+ memset(slots, 0, sizeof(slots));

- /* First, we walk through the instructions and do two things:
- *
- * 1) Figure out which uniforms are live.
- *
- * 2) Mark any indirectly used ranges of registers as contiguous.
- *
- * Note that we don't move constant-indexed accesses to arrays. No
- * testing has been done of the performance impact of this choice.
- */
foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
for (int i = 0 ; i < inst->sources; i++) {
if (inst->src[i].file != UNIFORM)
continue;

- int constant_nr = inst->src[i].nr + inst->src[i].offset / 4;
+ /* NIR tightly packs things so the uniform number might not be
+ * aligned (if we have a double right after a float, for instance).
+ * This is fine because the process of re-arranging them will ensure
+ * that things are properly aligned. The offset into that uniform,
+ * however, must be aligned.
+ *
+ * In Vulkan, we have explicit offsets but everything is crammed
+ * into a single "variable" so inst->src[i].nr will always be 0.
+ * Everything will be properly aligned relative to that one base.
+ */
+ assert(inst->src[i].offset % type_sz(inst->src[i].type) == 0);
+
+ unsigned u = inst->src[i].nr +
+ inst->src[i].offset / UNIFORM_SLOT_SIZE;

+ if (u >= uniforms)
+ continue;
+
+ unsigned slots_read;
if (inst->opcode == SHADER_OPCODE_MOV_INDIRECT && i == 0) {
- assert(inst->src[2].ud % 4 == 0);
- unsigned last = constant_nr + (inst->src[2].ud / 4) - 1;
- assert(last < uniforms);
-
- for (unsigned j = constant_nr; j < last; j++) {
- is_live[j] = true;
- contiguous[j] = true;
- bitsize_access[j] = MAX2(bitsize_access[j], type_sz(inst->src[i].type));
- }
- is_live[last] = true;
- bitsize_access[last] = MAX2(bitsize_access[last], type_sz(inst->src[i].type));
+ slots_read = DIV_ROUND_UP(inst->src[2].ud, UNIFORM_SLOT_SIZE);
} else {
- if (constant_nr >= 0 && constant_nr < (int) uniforms) {
- int regs_read = inst->components_read(i) *
- type_sz(inst->src[i].type) / 4;
- assert(regs_read <= 2);
- if (regs_read == 2)
- contiguous[constant_nr] = true;
- for (int j = 0; j < regs_read; j++) {
- is_live[constant_nr + j] = true;
- bitsize_access[constant_nr + j] =
- MAX2(bitsize_access[constant_nr + j], type_sz(inst->src[i].type));
- }
- }
+ unsigned bytes_read = inst->components_read(i) *
+ type_sz(inst->src[i].type);
+ slots_read = DIV_ROUND_UP(bytes_read, UNIFORM_SLOT_SIZE);
}
+
+ assert(u + slots_read <= uniforms);
+ mark_uniform_slots_read(&slots[u], slots_read,
+ type_sz(inst->src[i].type));
}
}

@@ -2082,43 +2102,64 @@ fs_visitor::assign_constant_locations()
memset(pull_constant_loc, -1, uniforms * sizeof(*pull_constant_loc));

int chunk_start = -1;
- unsigned max_chunk_bitsize = 0;
-
- /* First push 64-bit uniforms to ensure they are properly aligned */
- const unsigned uniform_64_bit_size = type_sz(BRW_REGISTER_TYPE_DF);
+ struct cplx_align align;
for (unsigned u = 0; u < uniforms; u++) {
- if (!is_live[u])
+ if (!slots[u].is_live) {
+ assert(chunk_start == -1);
continue;
+ }

- set_push_pull_constant_loc(u, &chunk_start, &max_chunk_bitsize,
- contiguous[u], bitsize_access[u],
- uniform_64_bit_size,
- push_constant_loc, pull_constant_loc,
- &num_push_constants, &num_pull_constants,
- max_push_components, max_chunk_size,
- compiler->supports_pull_constants,
- stage_prog_data);
+ /* Skip subgroup_id_index to put it in the last push register. */
+ if (subgroup_id_index == (int)u)
+ continue;

- }
+ if (chunk_start == -1) {
+ chunk_start = u;
+ align = slots[u].align;
+ } else {
+ /* Offset into the chunk */
+ unsigned chunk_offset = (u - chunk_start) * UNIFORM_SLOT_SIZE;

- /* Then push the rest of uniforms */
- const unsigned uniform_32_bit_size = type_sz(BRW_REGISTER_TYPE_F);
- for (unsigned u = 0; u < uniforms; u++) {
- if (!is_live[u])
- continue;
+ /* Shift the slot alignment down by the chunk offset so it is
+ * comparable with the base chunk alignment.
+ */
+ struct cplx_align slot_align = slots[u].align;
+ slot_align.offset =
+ (slot_align.offset - chunk_offset) & (align.mul - 1);

- /* Skip subgroup_id_index to put it in the last push register. */
- if (subgroup_id_index == (int)u)
+ align = cplx_align_combine(align, slot_align);
+ }
+
+ /* Sanity check the alignment */
+ cplx_align_assert_sane(align);
+
+ if (slots[u].contiguous)
continue;

- set_push_pull_constant_loc(u, &chunk_start, &max_chunk_bitsize,
- contiguous[u], bitsize_access[u],
- uniform_32_bit_size,
- push_constant_loc, pull_constant_loc,
- &num_push_constants, &num_pull_constants,
- max_push_components, max_chunk_size,
- compiler->supports_pull_constants,
- stage_prog_data);
+ /* Adjust the alignment to be in terms of slots, not bytes */
+ assert((align.mul & (UNIFORM_SLOT_SIZE - 1)) == 0);
+ assert((align.offset & (UNIFORM_SLOT_SIZE - 1)) == 0);
+ align.mul /= UNIFORM_SLOT_SIZE;
+ align.offset /= UNIFORM_SLOT_SIZE;
+
+ unsigned push_start_align = cplx_align_apply(align, num_push_constants);
+ unsigned chunk_size = u - chunk_start + 1;
+ if (!compiler->supports_pull_constants ||
+ (chunk_size < max_chunk_size &&
+ push_start_align + chunk_size <= max_push_components)) {
+ /* Align up the number of push constants */
+ num_push_constants = push_start_align;
+ for (unsigned i = 0; i < chunk_size; i++)
+ push_constant_loc[chunk_start + i] = num_push_constants++;
+ } else {
+ /* We need to pull this one */
+ num_pull_constants = cplx_align_apply(align, num_pull_constants);
+ for (unsigned i = 0; i < chunk_size; i++)
+ pull_constant_loc[chunk_start + i] = num_pull_constants++;
+ }
+
+ /* Reset the chunk and start again */
+ chunk_start = -1;
}

/* Add the CS local thread ID uniform at the end of the push constants */
@@ -2131,8 +2172,8 @@ fs_visitor::assign_constant_locations()
uint32_t *param = stage_prog_data->param;
stage_prog_data->nr_params = num_push_constants;
if (num_push_constants) {
- stage_prog_data->param = ralloc_array(mem_ctx, uint32_t,
- num_push_constants);
+ stage_prog_data->param = rzalloc_array(mem_ctx, uint32_t,
+ num_push_constants);
} else {
stage_prog_data->param = NULL;
}
@@ -2140,8 +2181,8 @@ fs_visitor::assign_constant_locations()
assert(stage_prog_data->pull_param == NULL);
if (num_pull_constants > 0) {
stage_prog_data->nr_pull_params = num_pull_constants;
- stage_prog_data->pull_param = ralloc_array(mem_ctx, uint32_t,
- num_pull_constants);
+ stage_prog_data->pull_param = rzalloc_array(mem_ctx, uint32_t,
+ num_pull_constants);
}

/* Now that we know how many regular uniforms we'll push, reduce the
--
2.5.0.400.gff86faf
Pohjolainen, Topi
2017-12-07 17:10:49 UTC
Reply
Permalink
Raw Message
Post by Jason Ekstrand
This rewires the logic for assigning uniform locations to work in terms
of "complex alignments". The basic idea is that, as we walk the list of
instructions, we keep track of the alignment and continuity requirements
of each slot and assert that the alignments all match up. We then use
those alignments in the compaction stage to ensure that everything gets
placed at a properly aligned register. The old mechanism handled
alignments by special-casing each of the bit sizes and placing 64-bit
values first followed by 32-bit values.
The old scheme had the advantage of never leaving a hole since all the
64-bit values could be tightly packed and so could the 32-bit values.
However, the new scheme has no type size special cases so it handles not
only 32 and 64-bit types but should gracefully extend to 16 and 8-bit
types as the need arises.
---
src/intel/compiler/brw_fs.cpp | 307 ++++++++++++++++++++++++------------------
1 file changed, 174 insertions(+), 133 deletions(-)
diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
index 93bb6b4..41260b4 100644
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@@ -1906,62 +1906,6 @@ fs_visitor::compact_virtual_grfs()
return progress;
}
-static void
-set_push_pull_constant_loc(unsigned uniform, int *chunk_start,
- unsigned *max_chunk_bitsize,
- bool contiguous, unsigned bitsize,
- const unsigned target_bitsize,
- int *push_constant_loc, int *pull_constant_loc,
- unsigned *num_push_constants,
- unsigned *num_pull_constants,
- const unsigned max_push_components,
- const unsigned max_chunk_size,
- bool allow_pull_constants,
- struct brw_stage_prog_data *stage_prog_data)
-{
- /* This is the first live uniform in the chunk */
- if (*chunk_start < 0)
- *chunk_start = uniform;
-
- /* Keep track of the maximum bit size access in contiguous uniforms */
- *max_chunk_bitsize = MAX2(*max_chunk_bitsize, bitsize);
-
- /* If this element does not need to be contiguous with the next, we
- * split at this point and everything between chunk_start and u forms a
- * single chunk.
- */
- if (!contiguous) {
- /* If bitsize doesn't match the target one, skip it */
- if (*max_chunk_bitsize != target_bitsize) {
- /* FIXME: right now we only support 32 and 64-bit accesses */
- assert(*max_chunk_bitsize == 4 || *max_chunk_bitsize == 8);
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- return;
- }
-
- unsigned chunk_size = uniform - *chunk_start + 1;
-
- /* Decide whether we should push or pull this parameter. In the
- * Vulkan driver, push constants are explicitly exposed via the API
- * so we push everything. In GL, we only push small arrays.
- */
- if (!allow_pull_constants ||
- (*num_push_constants + chunk_size <= max_push_components &&
- chunk_size <= max_chunk_size)) {
- assert(*num_push_constants + chunk_size <= max_push_components);
- for (unsigned j = *chunk_start; j <= uniform; j++)
- push_constant_loc[j] = (*num_push_constants)++;
- } else {
- for (unsigned j = *chunk_start; j <= uniform; j++)
- pull_constant_loc[j] = (*num_pull_constants)++;
- }
-
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- }
-}
-
static int
get_subgroup_id_param_index(const brw_stage_prog_data *prog_data)
{
@@ -1977,6 +1921,98 @@ get_subgroup_id_param_index(const brw_stage_prog_data *prog_data)
}
/**
+ * Struct for handling complex alignments.
+ *
+ * A complex alignment is stored as multiplier and an offset. A value is
+ * considered to be aligned if it is congruent to the offset modulo the
+ * multiplier.
+ */
+struct cplx_align {
+ unsigned mul:4;
+ unsigned offset:4;
+};
+
+#define CPLX_ALIGN_MAX_MUL 8
+
+static void
+cplx_align_assert_sane(struct cplx_align a)
+{
+ assert(a.mul > 0 && util_is_power_of_two(a.mul));
+ assert(a.offset < a.mul);
+}
+
+/**
+ * Combines two alignments to produce a least multiple of sorts.
+ *
+ * The returned alignment is the smallest (in terms of multiplier) such that
+ * anything aligned to both a and b will be aligned to the new alignment.
+ * This function will assert-fail if a and b are not compatible, i.e. if the
+ * offset parameters are such that no common alignment is possible.
+ */
+static struct cplx_align
+cplx_align_combine(struct cplx_align a, struct cplx_align b)
+{
+ cplx_align_assert_sane(a);
+ cplx_align_assert_sane(b);
+
+ /* Assert that the alignments agree. */
+ assert((a.offset & (b.mul - 1)) == (b.offset & (a.mul - 1)));
+
+ return a.mul > b.mul ? a : b;
+}
+
+/**
+ * Apply a complex alignment
+ *
+ * This function will return the smallest number greater than or equal to
+ * offset that is aligned to align.
+ */
+static unsigned
+cplx_align_apply(struct cplx_align align, unsigned offset)
+{
Should we assert that offset >= 4?
Post by Jason Ekstrand
+ return ALIGN(offset - align.offset, align.mul) + align.offset;
Soon I'm going to continue with glsl mediump support and thought better to
read these patches. I stumbled on something (probably missing something
again):

To me it looks there can ever be only three different instances of cplx_align:

{8, 0}, {8, 4} and {4, 0} and that "offset" argument here is always multiple
of 4.

In case of align == {8, 0} "offset" gets aligned to 8, and in case of
align == {4, 0} cplx_align_apply() is nop.

But in case of {8, 4} and "offset" not multiple of 8, "offset" - 4 is
multiple of 8 and returned value becomes ALIGN(offset - 4, 8) + 4 which
in turn isn't aligned by 8.
Post by Jason Ekstrand
+}
+
+#define UNIFORM_SLOT_SIZE 4
+
+struct uniform_slot_info {
+ /** True if the given uniform slot is live */
+ unsigned is_live:1;
+
+ /** True if this slot and the slot must remain contiguous */
+ unsigned contiguous:1;
+
+ struct cplx_align align;
+};
+
+static void
+mark_uniform_slots_read(struct uniform_slot_info *slots,
+ unsigned num_slots, unsigned alignment)
+{
+ assert(alignment > 0 && util_is_power_of_two(alignment));
+ assert(alignment <= CPLX_ALIGN_MAX_MUL);
+
+ /* We can't align a slot to anything less than the slot size */
+ alignment = MAX2(alignment, UNIFORM_SLOT_SIZE);
+
+ struct cplx_align align = {alignment, 0};
+ cplx_align_assert_sane(align);
+
+ for (unsigned i = 0; i < num_slots; i++) {
+ slots[i].is_live = true;
+ if (i < num_slots - 1)
+ slots[i].contiguous = true;
+
+ align.offset = (i * UNIFORM_SLOT_SIZE) & (align.mul - 1);
+ if (slots[i].align.mul == 0) {
+ slots[i].align = align;
+ } else {
+ slots[i].align = cplx_align_combine(slots[i].align, align);
+ }
+ }
+}
+
+/**
* Assign UNIFORM file registers to either push constants or pull constants.
*
* We allow a fragment shader to have more than the specified minimum
@@ -1994,60 +2030,44 @@ fs_visitor::assign_constant_locations()
return;
}
- bool is_live[uniforms];
- memset(is_live, 0, sizeof(is_live));
- unsigned bitsize_access[uniforms];
- memset(bitsize_access, 0, sizeof(bitsize_access));
-
- /* For each uniform slot, a value of true indicates that the given slot and
- * the next slot must remain contiguous. This is used to keep us from
- * splitting arrays and 64-bit values apart.
- */
- bool contiguous[uniforms];
- memset(contiguous, 0, sizeof(contiguous));
+ struct uniform_slot_info slots[uniforms];
+ memset(slots, 0, sizeof(slots));
- *
- * 1) Figure out which uniforms are live.
- *
- * 2) Mark any indirectly used ranges of registers as contiguous.
- *
- * Note that we don't move constant-indexed accesses to arrays. No
- * testing has been done of the performance impact of this choice.
- */
foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
for (int i = 0 ; i < inst->sources; i++) {
if (inst->src[i].file != UNIFORM)
continue;
- int constant_nr = inst->src[i].nr + inst->src[i].offset / 4;
+ /* NIR tightly packs things so the uniform number might not be
+ * aligned (if we have a double right after a float, for instance).
+ * This is fine because the process of re-arranging them will ensure
+ * that things are properly aligned. The offset into that uniform,
+ * however, must be aligned.
+ *
+ * In Vulkan, we have explicit offsets but everything is crammed
+ * into a single "variable" so inst->src[i].nr will always be 0.
+ * Everything will be properly aligned relative to that one base.
+ */
+ assert(inst->src[i].offset % type_sz(inst->src[i].type) == 0);
+
+ unsigned u = inst->src[i].nr +
+ inst->src[i].offset / UNIFORM_SLOT_SIZE;
+ if (u >= uniforms)
+ continue;
+
+ unsigned slots_read;
if (inst->opcode == SHADER_OPCODE_MOV_INDIRECT && i == 0) {
- assert(inst->src[2].ud % 4 == 0);
- unsigned last = constant_nr + (inst->src[2].ud / 4) - 1;
- assert(last < uniforms);
-
- for (unsigned j = constant_nr; j < last; j++) {
- is_live[j] = true;
- contiguous[j] = true;
- bitsize_access[j] = MAX2(bitsize_access[j], type_sz(inst->src[i].type));
- }
- is_live[last] = true;
- bitsize_access[last] = MAX2(bitsize_access[last], type_sz(inst->src[i].type));
+ slots_read = DIV_ROUND_UP(inst->src[2].ud, UNIFORM_SLOT_SIZE);
} else {
- if (constant_nr >= 0 && constant_nr < (int) uniforms) {
- int regs_read = inst->components_read(i) *
- type_sz(inst->src[i].type) / 4;
- assert(regs_read <= 2);
- if (regs_read == 2)
- contiguous[constant_nr] = true;
- for (int j = 0; j < regs_read; j++) {
- is_live[constant_nr + j] = true;
- bitsize_access[constant_nr + j] =
- MAX2(bitsize_access[constant_nr + j], type_sz(inst->src[i].type));
- }
- }
+ unsigned bytes_read = inst->components_read(i) *
+ type_sz(inst->src[i].type);
+ slots_read = DIV_ROUND_UP(bytes_read, UNIFORM_SLOT_SIZE);
}
+
+ assert(u + slots_read <= uniforms);
+ mark_uniform_slots_read(&slots[u], slots_read,
+ type_sz(inst->src[i].type));
}
}
@@ -2082,43 +2102,64 @@ fs_visitor::assign_constant_locations()
memset(pull_constant_loc, -1, uniforms * sizeof(*pull_constant_loc));
int chunk_start = -1;
- unsigned max_chunk_bitsize = 0;
-
- /* First push 64-bit uniforms to ensure they are properly aligned */
- const unsigned uniform_64_bit_size = type_sz(BRW_REGISTER_TYPE_DF);
+ struct cplx_align align;
for (unsigned u = 0; u < uniforms; u++) {
- if (!is_live[u])
+ if (!slots[u].is_live) {
+ assert(chunk_start == -1);
continue;
+ }
- set_push_pull_constant_loc(u, &chunk_start, &max_chunk_bitsize,
- contiguous[u], bitsize_access[u],
- uniform_64_bit_size,
- push_constant_loc, pull_constant_loc,
- &num_push_constants, &num_pull_constants,
- max_push_components, max_chunk_size,
- compiler->supports_pull_constants,
- stage_prog_data);
+ /* Skip subgroup_id_index to put it in the last push register. */
+ if (subgroup_id_index == (int)u)
+ continue;
- }
+ if (chunk_start == -1) {
+ chunk_start = u;
+ align = slots[u].align;
+ } else {
+ /* Offset into the chunk */
+ unsigned chunk_offset = (u - chunk_start) * UNIFORM_SLOT_SIZE;
- /* Then push the rest of uniforms */
- const unsigned uniform_32_bit_size = type_sz(BRW_REGISTER_TYPE_F);
- for (unsigned u = 0; u < uniforms; u++) {
- if (!is_live[u])
- continue;
+ /* Shift the slot alignment down by the chunk offset so it is
+ * comparable with the base chunk alignment.
+ */
+ struct cplx_align slot_align = slots[u].align;
+ slot_align.offset =
+ (slot_align.offset - chunk_offset) & (align.mul - 1);
- /* Skip subgroup_id_index to put it in the last push register. */
- if (subgroup_id_index == (int)u)
+ align = cplx_align_combine(align, slot_align);
+ }
+
+ /* Sanity check the alignment */
+ cplx_align_assert_sane(align);
+
+ if (slots[u].contiguous)
continue;
- set_push_pull_constant_loc(u, &chunk_start, &max_chunk_bitsize,
- contiguous[u], bitsize_access[u],
- uniform_32_bit_size,
- push_constant_loc, pull_constant_loc,
- &num_push_constants, &num_pull_constants,
- max_push_components, max_chunk_size,
- compiler->supports_pull_constants,
- stage_prog_data);
+ /* Adjust the alignment to be in terms of slots, not bytes */
+ assert((align.mul & (UNIFORM_SLOT_SIZE - 1)) == 0);
+ assert((align.offset & (UNIFORM_SLOT_SIZE - 1)) == 0);
+ align.mul /= UNIFORM_SLOT_SIZE;
+ align.offset /= UNIFORM_SLOT_SIZE;
+
+ unsigned push_start_align = cplx_align_apply(align, num_push_constants);
+ unsigned chunk_size = u - chunk_start + 1;
+ if (!compiler->supports_pull_constants ||
+ (chunk_size < max_chunk_size &&
+ push_start_align + chunk_size <= max_push_components)) {
+ /* Align up the number of push constants */
+ num_push_constants = push_start_align;
+ for (unsigned i = 0; i < chunk_size; i++)
+ push_constant_loc[chunk_start + i] = num_push_constants++;
+ } else {
+ /* We need to pull this one */
+ num_pull_constants = cplx_align_apply(align, num_pull_constants);
+ for (unsigned i = 0; i < chunk_size; i++)
+ pull_constant_loc[chunk_start + i] = num_pull_constants++;
+ }
+
+ /* Reset the chunk and start again */
+ chunk_start = -1;
}
/* Add the CS local thread ID uniform at the end of the push constants */
@@ -2131,8 +2172,8 @@ fs_visitor::assign_constant_locations()
uint32_t *param = stage_prog_data->param;
stage_prog_data->nr_params = num_push_constants;
if (num_push_constants) {
- stage_prog_data->param = ralloc_array(mem_ctx, uint32_t,
- num_push_constants);
+ stage_prog_data->param = rzalloc_array(mem_ctx, uint32_t,
+ num_push_constants);
} else {
stage_prog_data->param = NULL;
}
@@ -2140,8 +2181,8 @@ fs_visitor::assign_constant_locations()
assert(stage_prog_data->pull_param == NULL);
if (num_pull_constants > 0) {
stage_prog_data->nr_pull_params = num_pull_constants;
- stage_prog_data->pull_param = ralloc_array(mem_ctx, uint32_t,
- num_pull_constants);
+ stage_prog_data->pull_param = rzalloc_array(mem_ctx, uint32_t,
+ num_pull_constants);
}
/* Now that we know how many regular uniforms we'll push, reduce the
--
2.5.0.400.gff86faf
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Jason Ekstrand
2017-12-07 17:25:08 UTC
Reply
Permalink
Raw Message
On Thu, Dec 7, 2017 at 9:10 AM, Pohjolainen, Topi <
Post by Jason Ekstrand
Post by Jason Ekstrand
This rewires the logic for assigning uniform locations to work in terms
of "complex alignments". The basic idea is that, as we walk the list of
instructions, we keep track of the alignment and continuity requirements
of each slot and assert that the alignments all match up. We then use
those alignments in the compaction stage to ensure that everything gets
placed at a properly aligned register. The old mechanism handled
alignments by special-casing each of the bit sizes and placing 64-bit
values first followed by 32-bit values.
The old scheme had the advantage of never leaving a hole since all the
64-bit values could be tightly packed and so could the 32-bit values.
However, the new scheme has no type size special cases so it handles not
only 32 and 64-bit types but should gracefully extend to 16 and 8-bit
types as the need arises.
---
src/intel/compiler/brw_fs.cpp | 307 ++++++++++++++++++++++++------
------------
Post by Jason Ekstrand
1 file changed, 174 insertions(+), 133 deletions(-)
diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.
cpp
Post by Jason Ekstrand
index 93bb6b4..41260b4 100644
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@@ -1906,62 +1906,6 @@ fs_visitor::compact_virtual_grfs()
return progress;
}
-static void
-set_push_pull_constant_loc(unsigned uniform, int *chunk_start,
- unsigned *max_chunk_bitsize,
- bool contiguous, unsigned bitsize,
- const unsigned target_bitsize,
- int *push_constant_loc, int
*pull_constant_loc,
Post by Jason Ekstrand
- unsigned *num_push_constants,
- unsigned *num_pull_constants,
- const unsigned max_push_components,
- const unsigned max_chunk_size,
- bool allow_pull_constants,
- struct brw_stage_prog_data *stage_prog_data)
-{
- /* This is the first live uniform in the chunk */
- if (*chunk_start < 0)
- *chunk_start = uniform;
-
- /* Keep track of the maximum bit size access in contiguous uniforms
*/
Post by Jason Ekstrand
- *max_chunk_bitsize = MAX2(*max_chunk_bitsize, bitsize);
-
- /* If this element does not need to be contiguous with the next, we
- * split at this point and everything between chunk_start and u
forms a
Post by Jason Ekstrand
- * single chunk.
- */
- if (!contiguous) {
- /* If bitsize doesn't match the target one, skip it */
- if (*max_chunk_bitsize != target_bitsize) {
- /* FIXME: right now we only support 32 and 64-bit accesses */
- assert(*max_chunk_bitsize == 4 || *max_chunk_bitsize == 8);
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- return;
- }
-
- unsigned chunk_size = uniform - *chunk_start + 1;
-
- /* Decide whether we should push or pull this parameter. In the
- * Vulkan driver, push constants are explicitly exposed via the
API
Post by Jason Ekstrand
- * so we push everything. In GL, we only push small arrays.
- */
- if (!allow_pull_constants ||
- (*num_push_constants + chunk_size <= max_push_components &&
- chunk_size <= max_chunk_size)) {
- assert(*num_push_constants + chunk_size <=
max_push_components);
Post by Jason Ekstrand
- for (unsigned j = *chunk_start; j <= uniform; j++)
- push_constant_loc[j] = (*num_push_constants)++;
- } else {
- for (unsigned j = *chunk_start; j <= uniform; j++)
- pull_constant_loc[j] = (*num_pull_constants)++;
- }
-
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- }
-}
-
static int
get_subgroup_id_param_index(const brw_stage_prog_data *prog_data)
{
@@ -1977,6 +1921,98 @@ get_subgroup_id_param_index(const
brw_stage_prog_data *prog_data)
Post by Jason Ekstrand
}
/**
+ * Struct for handling complex alignments.
+ *
+ * A complex alignment is stored as multiplier and an offset. A value
is
Post by Jason Ekstrand
+ * considered to be aligned if it is congruent to the offset modulo the
+ * multiplier.
+ */
+struct cplx_align {
+ unsigned mul:4;
+ unsigned offset:4;
+};
+
+#define CPLX_ALIGN_MAX_MUL 8
+
+static void
+cplx_align_assert_sane(struct cplx_align a)
+{
+ assert(a.mul > 0 && util_is_power_of_two(a.mul));
+ assert(a.offset < a.mul);
+}
+
+/**
+ * Combines two alignments to produce a least multiple of sorts.
+ *
+ * The returned alignment is the smallest (in terms of multiplier) such
that
Post by Jason Ekstrand
+ * anything aligned to both a and b will be aligned to the new
alignment.
Post by Jason Ekstrand
+ * This function will assert-fail if a and b are not compatible, i.e.
if the
Post by Jason Ekstrand
+ * offset parameters are such that no common alignment is possible.
+ */
+static struct cplx_align
+cplx_align_combine(struct cplx_align a, struct cplx_align b)
+{
+ cplx_align_assert_sane(a);
+ cplx_align_assert_sane(b);
+
+ /* Assert that the alignments agree. */
+ assert((a.offset & (b.mul - 1)) == (b.offset & (a.mul - 1)));
+
+ return a.mul > b.mul ? a : b;
+}
+
+/**
+ * Apply a complex alignment
+ *
+ * This function will return the smallest number greater than or equal
to
Post by Jason Ekstrand
+ * offset that is aligned to align.
+ */
+static unsigned
+cplx_align_apply(struct cplx_align align, unsigned offset)
+{
Should we assert that offset >= 4?
Not here, no. I intentionally want this code to be as generic as
possible. We may have reason to re-use it (or something similar) elsewhere
and I don't want to build in lots of push-constant assumptions.
Post by Jason Ekstrand
Post by Jason Ekstrand
+ return ALIGN(offset - align.offset, align.mul) + align.offset;
Soon I'm going to continue with glsl mediump support and thought better to
read these patches. I stumbled on something (probably missing something
:-)
Today, yes, but there may come a day when we want to re-arrange with a
finer granularity than dwords.
Post by Jason Ekstrand
{8, 0}, {8, 4} and {4, 0} and that "offset" argument here is always multiple
of 4.
In case of align == {8, 0} "offset" gets aligned to 8, and in case of
align == {4, 0} cplx_align_apply() is nop.
But in case of {8, 4} and "offset" not multiple of 8, "offset" - 4 is
multiple of 8 and returned value becomes ALIGN(offset - 4, 8) + 4 which
in turn isn't aligned by 8.
Correct, and it shouldn't be. Let's walk through the calculation with a
more complex alignment: {8, 6} (no, that isn't possible today but the
calculations should work with it) and some concrete numbers: 6, 8, 10, 12,
14, 16

6 -> ALIGN(6 - 6, 8) + 6 = 6
8 -> ALIGN(8 - 6, 8) + 6 = 14
10 -> ALIGN(10 - 6, 8) + 6 = 14
12 -> ALIGN(12 - 6, 8) + 6 = 14
14 -> ALIGN(14 - 6, 8) + 6 = 14
16 -> ALIGN(16 - 6, 8) + 6 = 22

As you can see, the result of each is the next number that is 6 more than a
multiple of 8. There is also a question about negative numbers. Let's
look at one more calculation, this time with more detail about the guts of
ALIGN():

4 -> ALIGN(4 - 6, 8) + 6 = (((4 - 6) + (8 - 1)) & ~(8-1)) + 6 = (5 & ~7) +
6 = 6

That one's a bit more complicated but I think you see what's going on. If
it makes things more clear, we could do something such as

return (offset + (align.mul - align.offset - 1)) & ~(align.mul - 1)) +
align.offset;
Post by Jason Ekstrand
+}
Post by Jason Ekstrand
+
+#define UNIFORM_SLOT_SIZE 4
+
+struct uniform_slot_info {
+ /** True if the given uniform slot is live */
+ unsigned is_live:1;
+
+ /** True if this slot and the slot must remain contiguous */
+ unsigned contiguous:1;
+
+ struct cplx_align align;
+};
+
+static void
+mark_uniform_slots_read(struct uniform_slot_info *slots,
+ unsigned num_slots, unsigned alignment)
+{
+ assert(alignment > 0 && util_is_power_of_two(alignment));
+ assert(alignment <= CPLX_ALIGN_MAX_MUL);
+
+ /* We can't align a slot to anything less than the slot size */
+ alignment = MAX2(alignment, UNIFORM_SLOT_SIZE);
+
+ struct cplx_align align = {alignment, 0};
+ cplx_align_assert_sane(align);
+
+ for (unsigned i = 0; i < num_slots; i++) {
+ slots[i].is_live = true;
+ if (i < num_slots - 1)
+ slots[i].contiguous = true;
+
+ align.offset = (i * UNIFORM_SLOT_SIZE) & (align.mul - 1);
+ if (slots[i].align.mul == 0) {
+ slots[i].align = align;
+ } else {
+ slots[i].align = cplx_align_combine(slots[i].align, align);
+ }
+ }
+}
+
+/**
* Assign UNIFORM file registers to either push constants or pull
constants.
Post by Jason Ekstrand
*
* We allow a fragment shader to have more than the specified minimum
@@ -1994,60 +2030,44 @@ fs_visitor::assign_constant_locations()
return;
}
- bool is_live[uniforms];
- memset(is_live, 0, sizeof(is_live));
- unsigned bitsize_access[uniforms];
- memset(bitsize_access, 0, sizeof(bitsize_access));
-
- /* For each uniform slot, a value of true indicates that the given
slot and
Post by Jason Ekstrand
- * the next slot must remain contiguous. This is used to keep us
from
Post by Jason Ekstrand
- * splitting arrays and 64-bit values apart.
- */
- bool contiguous[uniforms];
- memset(contiguous, 0, sizeof(contiguous));
+ struct uniform_slot_info slots[uniforms];
+ memset(slots, 0, sizeof(slots));
- *
- * 1) Figure out which uniforms are live.
- *
- * 2) Mark any indirectly used ranges of registers as contiguous.
- *
- * Note that we don't move constant-indexed accesses to arrays. No
- * testing has been done of the performance impact of this choice.
- */
foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
for (int i = 0 ; i < inst->sources; i++) {
if (inst->src[i].file != UNIFORM)
continue;
- int constant_nr = inst->src[i].nr + inst->src[i].offset / 4;
+ /* NIR tightly packs things so the uniform number might not be
+ * aligned (if we have a double right after a float, for
instance).
Post by Jason Ekstrand
+ * This is fine because the process of re-arranging them will
ensure
Post by Jason Ekstrand
+ * that things are properly aligned. The offset into that
uniform,
Post by Jason Ekstrand
+ * however, must be aligned.
+ *
+ * In Vulkan, we have explicit offsets but everything is
crammed
Post by Jason Ekstrand
+ * into a single "variable" so inst->src[i].nr will always be
0.
Post by Jason Ekstrand
+ * Everything will be properly aligned relative to that one
base.
Post by Jason Ekstrand
+ */
+ assert(inst->src[i].offset % type_sz(inst->src[i].type) == 0);
+
+ unsigned u = inst->src[i].nr +
+ inst->src[i].offset / UNIFORM_SLOT_SIZE;
+ if (u >= uniforms)
+ continue;
+
+ unsigned slots_read;
if (inst->opcode == SHADER_OPCODE_MOV_INDIRECT && i == 0) {
- assert(inst->src[2].ud % 4 == 0);
- unsigned last = constant_nr + (inst->src[2].ud / 4) - 1;
- assert(last < uniforms);
-
- for (unsigned j = constant_nr; j < last; j++) {
- is_live[j] = true;
- contiguous[j] = true;
- bitsize_access[j] = MAX2(bitsize_access[j],
type_sz(inst->src[i].type));
Post by Jason Ekstrand
- }
- is_live[last] = true;
- bitsize_access[last] = MAX2(bitsize_access[last],
type_sz(inst->src[i].type));
Post by Jason Ekstrand
+ slots_read = DIV_ROUND_UP(inst->src[2].ud,
UNIFORM_SLOT_SIZE);
Post by Jason Ekstrand
} else {
- if (constant_nr >= 0 && constant_nr < (int) uniforms) {
- int regs_read = inst->components_read(i) *
- type_sz(inst->src[i].type) / 4;
- assert(regs_read <= 2);
- if (regs_read == 2)
- contiguous[constant_nr] = true;
- for (int j = 0; j < regs_read; j++) {
- is_live[constant_nr + j] = true;
- bitsize_access[constant_nr + j] =
- MAX2(bitsize_access[constant_nr + j],
type_sz(inst->src[i].type));
Post by Jason Ekstrand
- }
- }
+ unsigned bytes_read = inst->components_read(i) *
+ type_sz(inst->src[i].type);
+ slots_read = DIV_ROUND_UP(bytes_read, UNIFORM_SLOT_SIZE);
}
+
+ assert(u + slots_read <= uniforms);
+ mark_uniform_slots_read(&slots[u], slots_read,
+ type_sz(inst->src[i].type));
}
}
@@ -2082,43 +2102,64 @@ fs_visitor::assign_constant_locations()
memset(pull_constant_loc, -1, uniforms * sizeof(*pull_constant_loc));
int chunk_start = -1;
- unsigned max_chunk_bitsize = 0;
-
- /* First push 64-bit uniforms to ensure they are properly aligned */
- const unsigned uniform_64_bit_size = type_sz(BRW_REGISTER_TYPE_DF);
+ struct cplx_align align;
for (unsigned u = 0; u < uniforms; u++) {
- if (!is_live[u])
+ if (!slots[u].is_live) {
+ assert(chunk_start == -1);
continue;
+ }
- set_push_pull_constant_loc(u, &chunk_start, &max_chunk_bitsize,
- contiguous[u], bitsize_access[u],
- uniform_64_bit_size,
- push_constant_loc, pull_constant_loc,
- &num_push_constants,
&num_pull_constants,
Post by Jason Ekstrand
- max_push_components, max_chunk_size,
- compiler->supports_pull_constants,
- stage_prog_data);
+ /* Skip subgroup_id_index to put it in the last push register. */
+ if (subgroup_id_index == (int)u)
+ continue;
- }
+ if (chunk_start == -1) {
+ chunk_start = u;
+ align = slots[u].align;
+ } else {
+ /* Offset into the chunk */
+ unsigned chunk_offset = (u - chunk_start) * UNIFORM_SLOT_SIZE;
- /* Then push the rest of uniforms */
- const unsigned uniform_32_bit_size = type_sz(BRW_REGISTER_TYPE_F);
- for (unsigned u = 0; u < uniforms; u++) {
- if (!is_live[u])
- continue;
+ /* Shift the slot alignment down by the chunk offset so it is
+ * comparable with the base chunk alignment.
+ */
+ struct cplx_align slot_align = slots[u].align;
+ slot_align.offset =
+ (slot_align.offset - chunk_offset) & (align.mul - 1);
- /* Skip subgroup_id_index to put it in the last push register. */
- if (subgroup_id_index == (int)u)
+ align = cplx_align_combine(align, slot_align);
+ }
+
+ /* Sanity check the alignment */
+ cplx_align_assert_sane(align);
+
+ if (slots[u].contiguous)
continue;
- set_push_pull_constant_loc(u, &chunk_start, &max_chunk_bitsize,
- contiguous[u], bitsize_access[u],
- uniform_32_bit_size,
- push_constant_loc, pull_constant_loc,
- &num_push_constants,
&num_pull_constants,
Post by Jason Ekstrand
- max_push_components, max_chunk_size,
- compiler->supports_pull_constants,
- stage_prog_data);
+ /* Adjust the alignment to be in terms of slots, not bytes */
+ assert((align.mul & (UNIFORM_SLOT_SIZE - 1)) == 0);
+ assert((align.offset & (UNIFORM_SLOT_SIZE - 1)) == 0);
+ align.mul /= UNIFORM_SLOT_SIZE;
+ align.offset /= UNIFORM_SLOT_SIZE;
+
+ unsigned push_start_align = cplx_align_apply(align,
num_push_constants);
Post by Jason Ekstrand
+ unsigned chunk_size = u - chunk_start + 1;
+ if (!compiler->supports_pull_constants ||
+ (chunk_size < max_chunk_size &&
+ push_start_align + chunk_size <= max_push_components)) {
+ /* Align up the number of push constants */
+ num_push_constants = push_start_align;
+ for (unsigned i = 0; i < chunk_size; i++)
+ push_constant_loc[chunk_start + i] = num_push_constants++;
+ } else {
+ /* We need to pull this one */
+ num_pull_constants = cplx_align_apply(align,
num_pull_constants);
Post by Jason Ekstrand
+ for (unsigned i = 0; i < chunk_size; i++)
+ pull_constant_loc[chunk_start + i] = num_pull_constants++;
+ }
+
+ /* Reset the chunk and start again */
+ chunk_start = -1;
}
/* Add the CS local thread ID uniform at the end of the push
constants */
Post by Jason Ekstrand
@@ -2131,8 +2172,8 @@ fs_visitor::assign_constant_locations()
uint32_t *param = stage_prog_data->param;
stage_prog_data->nr_params = num_push_constants;
if (num_push_constants) {
- stage_prog_data->param = ralloc_array(mem_ctx, uint32_t,
- num_push_constants);
+ stage_prog_data->param = rzalloc_array(mem_ctx, uint32_t,
+ num_push_constants);
} else {
stage_prog_data->param = NULL;
}
@@ -2140,8 +2181,8 @@ fs_visitor::assign_constant_locations()
assert(stage_prog_data->pull_param == NULL);
if (num_pull_constants > 0) {
stage_prog_data->nr_pull_params = num_pull_constants;
- stage_prog_data->pull_param = ralloc_array(mem_ctx, uint32_t,
- num_pull_constants);
+ stage_prog_data->pull_param = rzalloc_array(mem_ctx, uint32_t,
+ num_pull_constants);
}
/* Now that we know how many regular uniforms we'll push, reduce the
--
2.5.0.400.gff86faf
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Pohjolainen, Topi
2017-12-07 17:57:08 UTC
Reply
Permalink
Raw Message
Post by Jason Ekstrand
On Thu, Dec 7, 2017 at 9:10 AM, Pohjolainen, Topi <
Post by Jason Ekstrand
Post by Jason Ekstrand
This rewires the logic for assigning uniform locations to work in terms
of "complex alignments". The basic idea is that, as we walk the list of
instructions, we keep track of the alignment and continuity requirements
of each slot and assert that the alignments all match up. We then use
those alignments in the compaction stage to ensure that everything gets
placed at a properly aligned register. The old mechanism handled
alignments by special-casing each of the bit sizes and placing 64-bit
values first followed by 32-bit values.
The old scheme had the advantage of never leaving a hole since all the
64-bit values could be tightly packed and so could the 32-bit values.
However, the new scheme has no type size special cases so it handles not
only 32 and 64-bit types but should gracefully extend to 16 and 8-bit
types as the need arises.
---
src/intel/compiler/brw_fs.cpp | 307 ++++++++++++++++++++++++------
------------
Post by Jason Ekstrand
1 file changed, 174 insertions(+), 133 deletions(-)
diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.
cpp
Post by Jason Ekstrand
index 93bb6b4..41260b4 100644
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@@ -1906,62 +1906,6 @@ fs_visitor::compact_virtual_grfs()
return progress;
}
-static void
-set_push_pull_constant_loc(unsigned uniform, int *chunk_start,
- unsigned *max_chunk_bitsize,
- bool contiguous, unsigned bitsize,
- const unsigned target_bitsize,
- int *push_constant_loc, int
*pull_constant_loc,
Post by Jason Ekstrand
- unsigned *num_push_constants,
- unsigned *num_pull_constants,
- const unsigned max_push_components,
- const unsigned max_chunk_size,
- bool allow_pull_constants,
- struct brw_stage_prog_data *stage_prog_data)
-{
- /* This is the first live uniform in the chunk */
- if (*chunk_start < 0)
- *chunk_start = uniform;
-
- /* Keep track of the maximum bit size access in contiguous uniforms
*/
Post by Jason Ekstrand
- *max_chunk_bitsize = MAX2(*max_chunk_bitsize, bitsize);
-
- /* If this element does not need to be contiguous with the next, we
- * split at this point and everything between chunk_start and u
forms a
Post by Jason Ekstrand
- * single chunk.
- */
- if (!contiguous) {
- /* If bitsize doesn't match the target one, skip it */
- if (*max_chunk_bitsize != target_bitsize) {
- /* FIXME: right now we only support 32 and 64-bit accesses */
- assert(*max_chunk_bitsize == 4 || *max_chunk_bitsize == 8);
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- return;
- }
-
- unsigned chunk_size = uniform - *chunk_start + 1;
-
- /* Decide whether we should push or pull this parameter. In the
- * Vulkan driver, push constants are explicitly exposed via the
API
Post by Jason Ekstrand
- * so we push everything. In GL, we only push small arrays.
- */
- if (!allow_pull_constants ||
- (*num_push_constants + chunk_size <= max_push_components &&
- chunk_size <= max_chunk_size)) {
- assert(*num_push_constants + chunk_size <=
max_push_components);
Post by Jason Ekstrand
- for (unsigned j = *chunk_start; j <= uniform; j++)
- push_constant_loc[j] = (*num_push_constants)++;
- } else {
- for (unsigned j = *chunk_start; j <= uniform; j++)
- pull_constant_loc[j] = (*num_pull_constants)++;
- }
-
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- }
-}
-
static int
get_subgroup_id_param_index(const brw_stage_prog_data *prog_data)
{
@@ -1977,6 +1921,98 @@ get_subgroup_id_param_index(const
brw_stage_prog_data *prog_data)
Post by Jason Ekstrand
}
/**
+ * Struct for handling complex alignments.
+ *
+ * A complex alignment is stored as multiplier and an offset. A value
is
Post by Jason Ekstrand
+ * considered to be aligned if it is congruent to the offset modulo the
+ * multiplier.
+ */
+struct cplx_align {
+ unsigned mul:4;
+ unsigned offset:4;
+};
+
+#define CPLX_ALIGN_MAX_MUL 8
+
+static void
+cplx_align_assert_sane(struct cplx_align a)
+{
+ assert(a.mul > 0 && util_is_power_of_two(a.mul));
+ assert(a.offset < a.mul);
+}
+
+/**
+ * Combines two alignments to produce a least multiple of sorts.
+ *
+ * The returned alignment is the smallest (in terms of multiplier) such
that
Post by Jason Ekstrand
+ * anything aligned to both a and b will be aligned to the new
alignment.
Post by Jason Ekstrand
+ * This function will assert-fail if a and b are not compatible, i.e.
if the
Post by Jason Ekstrand
+ * offset parameters are such that no common alignment is possible.
+ */
+static struct cplx_align
+cplx_align_combine(struct cplx_align a, struct cplx_align b)
+{
+ cplx_align_assert_sane(a);
+ cplx_align_assert_sane(b);
+
+ /* Assert that the alignments agree. */
+ assert((a.offset & (b.mul - 1)) == (b.offset & (a.mul - 1)));
+
+ return a.mul > b.mul ? a : b;
+}
+
+/**
+ * Apply a complex alignment
+ *
+ * This function will return the smallest number greater than or equal
to
Post by Jason Ekstrand
+ * offset that is aligned to align.
+ */
+static unsigned
+cplx_align_apply(struct cplx_align align, unsigned offset)
+{
Should we assert that offset >= 4?
Not here, no. I intentionally want this code to be as generic as
possible. We may have reason to re-use it (or something similar) elsewhere
and I don't want to build in lots of push-constant assumptions.
Post by Jason Ekstrand
Post by Jason Ekstrand
+ return ALIGN(offset - align.offset, align.mul) + align.offset;
Soon I'm going to continue with glsl mediump support and thought better to
read these patches. I stumbled on something (probably missing something
:-)
Today, yes, but there may come a day when we want to re-arrange with a
finer granularity than dwords.
Post by Jason Ekstrand
{8, 0}, {8, 4} and {4, 0} and that "offset" argument here is always multiple
of 4.
In case of align == {8, 0} "offset" gets aligned to 8, and in case of
align == {4, 0} cplx_align_apply() is nop.
But in case of {8, 4} and "offset" not multiple of 8, "offset" - 4 is
multiple of 8 and returned value becomes ALIGN(offset - 4, 8) + 4 which
in turn isn't aligned by 8.
Correct, and it shouldn't be. Let's walk through the calculation with a
more complex alignment: {8, 6} (no, that isn't possible today but the
calculations should work with it) and some concrete numbers: 6, 8, 10, 12,
14, 16
6 -> ALIGN(6 - 6, 8) + 6 = 6
8 -> ALIGN(8 - 6, 8) + 6 = 14
10 -> ALIGN(10 - 6, 8) + 6 = 14
12 -> ALIGN(12 - 6, 8) + 6 = 14
14 -> ALIGN(14 - 6, 8) + 6 = 14
16 -> ALIGN(16 - 6, 8) + 6 = 22
As you can see, the result of each is the next number that is 6 more than a
multiple of 8. There is also a question about negative numbers. Let's
look at one more calculation, this time with more detail about the guts of
4 -> ALIGN(4 - 6, 8) + 6 = (((4 - 6) + (8 - 1)) & ~(8-1)) + 6 = (5 & ~7) +
6 = 6
That one's a bit more complicated but I think you see what's going on. If
it makes things more clear, we could do something such as
return (offset + (align.mul - align.offset - 1)) & ~(align.mul - 1)) +
align.offset;
Big thanks for the good explanation. I was also wondering why I didn't fully
get the documentation of cplx_align. I kept asking myself "but offset is
always modulo of mul". Now the comment there makes sense :) Some of your reply
here would probably go nicely there.

I'll read the whole thing again to see if there was anything else I wanted to
ask.
Post by Jason Ekstrand
Post by Jason Ekstrand
+}
Post by Jason Ekstrand
+
+#define UNIFORM_SLOT_SIZE 4
+
+struct uniform_slot_info {
+ /** True if the given uniform slot is live */
+ unsigned is_live:1;
+
+ /** True if this slot and the slot must remain contiguous */
+ unsigned contiguous:1;
+
+ struct cplx_align align;
+};
+
+static void
+mark_uniform_slots_read(struct uniform_slot_info *slots,
+ unsigned num_slots, unsigned alignment)
+{
+ assert(alignment > 0 && util_is_power_of_two(alignment));
+ assert(alignment <= CPLX_ALIGN_MAX_MUL);
+
+ /* We can't align a slot to anything less than the slot size */
+ alignment = MAX2(alignment, UNIFORM_SLOT_SIZE);
+
+ struct cplx_align align = {alignment, 0};
+ cplx_align_assert_sane(align);
+
+ for (unsigned i = 0; i < num_slots; i++) {
+ slots[i].is_live = true;
+ if (i < num_slots - 1)
+ slots[i].contiguous = true;
+
+ align.offset = (i * UNIFORM_SLOT_SIZE) & (align.mul - 1);
+ if (slots[i].align.mul == 0) {
+ slots[i].align = align;
+ } else {
+ slots[i].align = cplx_align_combine(slots[i].align, align);
+ }
+ }
+}
+
+/**
* Assign UNIFORM file registers to either push constants or pull
constants.
Post by Jason Ekstrand
*
* We allow a fragment shader to have more than the specified minimum
@@ -1994,60 +2030,44 @@ fs_visitor::assign_constant_locations()
return;
}
- bool is_live[uniforms];
- memset(is_live, 0, sizeof(is_live));
- unsigned bitsize_access[uniforms];
- memset(bitsize_access, 0, sizeof(bitsize_access));
-
- /* For each uniform slot, a value of true indicates that the given
slot and
Post by Jason Ekstrand
- * the next slot must remain contiguous. This is used to keep us
from
Post by Jason Ekstrand
- * splitting arrays and 64-bit values apart.
- */
- bool contiguous[uniforms];
- memset(contiguous, 0, sizeof(contiguous));
+ struct uniform_slot_info slots[uniforms];
+ memset(slots, 0, sizeof(slots));
- *
- * 1) Figure out which uniforms are live.
- *
- * 2) Mark any indirectly used ranges of registers as contiguous.
- *
- * Note that we don't move constant-indexed accesses to arrays. No
- * testing has been done of the performance impact of this choice.
- */
foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
for (int i = 0 ; i < inst->sources; i++) {
if (inst->src[i].file != UNIFORM)
continue;
- int constant_nr = inst->src[i].nr + inst->src[i].offset / 4;
+ /* NIR tightly packs things so the uniform number might not be
+ * aligned (if we have a double right after a float, for
instance).
Post by Jason Ekstrand
+ * This is fine because the process of re-arranging them will
ensure
Post by Jason Ekstrand
+ * that things are properly aligned. The offset into that
uniform,
Post by Jason Ekstrand
+ * however, must be aligned.
+ *
+ * In Vulkan, we have explicit offsets but everything is
crammed
Post by Jason Ekstrand
+ * into a single "variable" so inst->src[i].nr will always be
0.
Post by Jason Ekstrand
+ * Everything will be properly aligned relative to that one
base.
Post by Jason Ekstrand
+ */
+ assert(inst->src[i].offset % type_sz(inst->src[i].type) == 0);
+
+ unsigned u = inst->src[i].nr +
+ inst->src[i].offset / UNIFORM_SLOT_SIZE;
+ if (u >= uniforms)
+ continue;
+
+ unsigned slots_read;
if (inst->opcode == SHADER_OPCODE_MOV_INDIRECT && i == 0) {
- assert(inst->src[2].ud % 4 == 0);
- unsigned last = constant_nr + (inst->src[2].ud / 4) - 1;
- assert(last < uniforms);
-
- for (unsigned j = constant_nr; j < last; j++) {
- is_live[j] = true;
- contiguous[j] = true;
- bitsize_access[j] = MAX2(bitsize_access[j],
type_sz(inst->src[i].type));
Post by Jason Ekstrand
- }
- is_live[last] = true;
- bitsize_access[last] = MAX2(bitsize_access[last],
type_sz(inst->src[i].type));
Post by Jason Ekstrand
+ slots_read = DIV_ROUND_UP(inst->src[2].ud,
UNIFORM_SLOT_SIZE);
Post by Jason Ekstrand
} else {
- if (constant_nr >= 0 && constant_nr < (int) uniforms) {
- int regs_read = inst->components_read(i) *
- type_sz(inst->src[i].type) / 4;
- assert(regs_read <= 2);
- if (regs_read == 2)
- contiguous[constant_nr] = true;
- for (int j = 0; j < regs_read; j++) {
- is_live[constant_nr + j] = true;
- bitsize_access[constant_nr + j] =
- MAX2(bitsize_access[constant_nr + j],
type_sz(inst->src[i].type));
Post by Jason Ekstrand
- }
- }
+ unsigned bytes_read = inst->components_read(i) *
+ type_sz(inst->src[i].type);
+ slots_read = DIV_ROUND_UP(bytes_read, UNIFORM_SLOT_SIZE);
}
+
+ assert(u + slots_read <= uniforms);
+ mark_uniform_slots_read(&slots[u], slots_read,
+ type_sz(inst->src[i].type));
}
}
@@ -2082,43 +2102,64 @@ fs_visitor::assign_constant_locations()
memset(pull_constant_loc, -1, uniforms * sizeof(*pull_constant_loc));
int chunk_start = -1;
- unsigned max_chunk_bitsize = 0;
-
- /* First push 64-bit uniforms to ensure they are properly aligned */
- const unsigned uniform_64_bit_size = type_sz(BRW_REGISTER_TYPE_DF);
+ struct cplx_align align;
for (unsigned u = 0; u < uniforms; u++) {
- if (!is_live[u])
+ if (!slots[u].is_live) {
+ assert(chunk_start == -1);
continue;
+ }
- set_push_pull_constant_loc(u, &chunk_start, &max_chunk_bitsize,
- contiguous[u], bitsize_access[u],
- uniform_64_bit_size,
- push_constant_loc, pull_constant_loc,
- &num_push_constants,
&num_pull_constants,
Post by Jason Ekstrand
- max_push_components, max_chunk_size,
- compiler->supports_pull_constants,
- stage_prog_data);
+ /* Skip subgroup_id_index to put it in the last push register. */
+ if (subgroup_id_index == (int)u)
+ continue;
- }
+ if (chunk_start == -1) {
+ chunk_start = u;
+ align = slots[u].align;
+ } else {
+ /* Offset into the chunk */
+ unsigned chunk_offset = (u - chunk_start) * UNIFORM_SLOT_SIZE;
- /* Then push the rest of uniforms */
- const unsigned uniform_32_bit_size = type_sz(BRW_REGISTER_TYPE_F);
- for (unsigned u = 0; u < uniforms; u++) {
- if (!is_live[u])
- continue;
+ /* Shift the slot alignment down by the chunk offset so it is
+ * comparable with the base chunk alignment.
+ */
+ struct cplx_align slot_align = slots[u].align;
+ slot_align.offset =
+ (slot_align.offset - chunk_offset) & (align.mul - 1);
- /* Skip subgroup_id_index to put it in the last push register. */
- if (subgroup_id_index == (int)u)
+ align = cplx_align_combine(align, slot_align);
+ }
+
+ /* Sanity check the alignment */
+ cplx_align_assert_sane(align);
+
+ if (slots[u].contiguous)
continue;
- set_push_pull_constant_loc(u, &chunk_start, &max_chunk_bitsize,
- contiguous[u], bitsize_access[u],
- uniform_32_bit_size,
- push_constant_loc, pull_constant_loc,
- &num_push_constants,
&num_pull_constants,
Post by Jason Ekstrand
- max_push_components, max_chunk_size,
- compiler->supports_pull_constants,
- stage_prog_data);
+ /* Adjust the alignment to be in terms of slots, not bytes */
+ assert((align.mul & (UNIFORM_SLOT_SIZE - 1)) == 0);
+ assert((align.offset & (UNIFORM_SLOT_SIZE - 1)) == 0);
+ align.mul /= UNIFORM_SLOT_SIZE;
+ align.offset /= UNIFORM_SLOT_SIZE;
+
+ unsigned push_start_align = cplx_align_apply(align,
num_push_constants);
Post by Jason Ekstrand
+ unsigned chunk_size = u - chunk_start + 1;
+ if (!compiler->supports_pull_constants ||
+ (chunk_size < max_chunk_size &&
+ push_start_align + chunk_size <= max_push_components)) {
+ /* Align up the number of push constants */
+ num_push_constants = push_start_align;
+ for (unsigned i = 0; i < chunk_size; i++)
+ push_constant_loc[chunk_start + i] = num_push_constants++;
+ } else {
+ /* We need to pull this one */
+ num_pull_constants = cplx_align_apply(align,
num_pull_constants);
Post by Jason Ekstrand
+ for (unsigned i = 0; i < chunk_size; i++)
+ pull_constant_loc[chunk_start + i] = num_pull_constants++;
+ }
+
+ /* Reset the chunk and start again */
+ chunk_start = -1;
}
/* Add the CS local thread ID uniform at the end of the push
constants */
Post by Jason Ekstrand
@@ -2131,8 +2172,8 @@ fs_visitor::assign_constant_locations()
uint32_t *param = stage_prog_data->param;
stage_prog_data->nr_params = num_push_constants;
if (num_push_constants) {
- stage_prog_data->param = ralloc_array(mem_ctx, uint32_t,
- num_push_constants);
+ stage_prog_data->param = rzalloc_array(mem_ctx, uint32_t,
+ num_push_constants);
} else {
stage_prog_data->param = NULL;
}
@@ -2140,8 +2181,8 @@ fs_visitor::assign_constant_locations()
assert(stage_prog_data->pull_param == NULL);
if (num_pull_constants > 0) {
stage_prog_data->nr_pull_params = num_pull_constants;
- stage_prog_data->pull_param = ralloc_array(mem_ctx, uint32_t,
- num_pull_constants);
+ stage_prog_data->pull_param = rzalloc_array(mem_ctx, uint32_t,
+ num_pull_constants);
}
/* Now that we know how many regular uniforms we'll push, reduce the
--
2.5.0.400.gff86faf
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Jason Ekstrand
2017-12-07 18:57:31 UTC
Reply
Permalink
Raw Message
On Thu, Dec 7, 2017 at 9:57 AM, Pohjolainen, Topi <
Post by Jason Ekstrand
Post by Jason Ekstrand
On Thu, Dec 7, 2017 at 9:10 AM, Pohjolainen, Topi <
Post by Jason Ekstrand
Post by Jason Ekstrand
This rewires the logic for assigning uniform locations to work in
terms
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
of "complex alignments". The basic idea is that, as we walk the
list of
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
instructions, we keep track of the alignment and continuity
requirements
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
of each slot and assert that the alignments all match up. We then
use
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
those alignments in the compaction stage to ensure that everything
gets
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
placed at a properly aligned register. The old mechanism handled
alignments by special-casing each of the bit sizes and placing 64-bit
values first followed by 32-bit values.
The old scheme had the advantage of never leaving a hole since all
the
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
64-bit values could be tightly packed and so could the 32-bit values.
However, the new scheme has no type size special cases so it handles
not
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
only 32 and 64-bit types but should gracefully extend to 16 and 8-bit
types as the need arises.
---
src/intel/compiler/brw_fs.cpp | 307 ++++++++++++++++++++++++------
------------
Post by Jason Ekstrand
1 file changed, 174 insertions(+), 133 deletions(-)
diff --git a/src/intel/compiler/brw_fs.cpp
b/src/intel/compiler/brw_fs.
Post by Jason Ekstrand
Post by Jason Ekstrand
cpp
Post by Jason Ekstrand
index 93bb6b4..41260b4 100644
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@@ -1906,62 +1906,6 @@ fs_visitor::compact_virtual_grfs()
return progress;
}
-static void
-set_push_pull_constant_loc(unsigned uniform, int *chunk_start,
- unsigned *max_chunk_bitsize,
- bool contiguous, unsigned bitsize,
- const unsigned target_bitsize,
- int *push_constant_loc, int
*pull_constant_loc,
Post by Jason Ekstrand
- unsigned *num_push_constants,
- unsigned *num_pull_constants,
- const unsigned max_push_components,
- const unsigned max_chunk_size,
- bool allow_pull_constants,
- struct brw_stage_prog_data
*stage_prog_data)
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
-{
- /* This is the first live uniform in the chunk */
- if (*chunk_start < 0)
- *chunk_start = uniform;
-
- /* Keep track of the maximum bit size access in contiguous
uniforms
Post by Jason Ekstrand
Post by Jason Ekstrand
*/
Post by Jason Ekstrand
- *max_chunk_bitsize = MAX2(*max_chunk_bitsize, bitsize);
-
- /* If this element does not need to be contiguous with the next,
we
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- * split at this point and everything between chunk_start and u
forms a
Post by Jason Ekstrand
- * single chunk.
- */
- if (!contiguous) {
- /* If bitsize doesn't match the target one, skip it */
- if (*max_chunk_bitsize != target_bitsize) {
- /* FIXME: right now we only support 32 and 64-bit accesses
*/
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- assert(*max_chunk_bitsize == 4 || *max_chunk_bitsize == 8);
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- return;
- }
-
- unsigned chunk_size = uniform - *chunk_start + 1;
-
- /* Decide whether we should push or pull this parameter. In
the
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- * Vulkan driver, push constants are explicitly exposed via
the
Post by Jason Ekstrand
Post by Jason Ekstrand
API
Post by Jason Ekstrand
- * so we push everything. In GL, we only push small arrays.
- */
- if (!allow_pull_constants ||
- (*num_push_constants + chunk_size <= max_push_components
&&
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- chunk_size <= max_chunk_size)) {
- assert(*num_push_constants + chunk_size <=
max_push_components);
Post by Jason Ekstrand
- for (unsigned j = *chunk_start; j <= uniform; j++)
- push_constant_loc[j] = (*num_push_constants)++;
- } else {
- for (unsigned j = *chunk_start; j <= uniform; j++)
- pull_constant_loc[j] = (*num_pull_constants)++;
- }
-
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- }
-}
-
static int
get_subgroup_id_param_index(const brw_stage_prog_data *prog_data)
{
@@ -1977,6 +1921,98 @@ get_subgroup_id_param_index(const
brw_stage_prog_data *prog_data)
Post by Jason Ekstrand
}
/**
+ * Struct for handling complex alignments.
+ *
+ * A complex alignment is stored as multiplier and an offset. A
value
Post by Jason Ekstrand
Post by Jason Ekstrand
is
Post by Jason Ekstrand
+ * considered to be aligned if it is congruent to the offset modulo
the
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ * multiplier.
+ */
+struct cplx_align {
+ unsigned mul:4;
+ unsigned offset:4;
+};
+
+#define CPLX_ALIGN_MAX_MUL 8
+
+static void
+cplx_align_assert_sane(struct cplx_align a)
+{
+ assert(a.mul > 0 && util_is_power_of_two(a.mul));
+ assert(a.offset < a.mul);
+}
+
+/**
+ * Combines two alignments to produce a least multiple of sorts.
+ *
+ * The returned alignment is the smallest (in terms of multiplier)
such
Post by Jason Ekstrand
Post by Jason Ekstrand
that
Post by Jason Ekstrand
+ * anything aligned to both a and b will be aligned to the new
alignment.
Post by Jason Ekstrand
+ * This function will assert-fail if a and b are not compatible,
i.e.
Post by Jason Ekstrand
Post by Jason Ekstrand
if the
Post by Jason Ekstrand
+ * offset parameters are such that no common alignment is possible.
+ */
+static struct cplx_align
+cplx_align_combine(struct cplx_align a, struct cplx_align b)
+{
+ cplx_align_assert_sane(a);
+ cplx_align_assert_sane(b);
+
+ /* Assert that the alignments agree. */
+ assert((a.offset & (b.mul - 1)) == (b.offset & (a.mul - 1)));
+
+ return a.mul > b.mul ? a : b;
+}
+
+/**
+ * Apply a complex alignment
+ *
+ * This function will return the smallest number greater than or
equal
Post by Jason Ekstrand
Post by Jason Ekstrand
to
Post by Jason Ekstrand
+ * offset that is aligned to align.
+ */
+static unsigned
+cplx_align_apply(struct cplx_align align, unsigned offset)
+{
Should we assert that offset >= 4?
Not here, no. I intentionally want this code to be as generic as
possible. We may have reason to re-use it (or something similar)
elsewhere
Post by Jason Ekstrand
and I don't want to build in lots of push-constant assumptions.
Post by Jason Ekstrand
Post by Jason Ekstrand
+ return ALIGN(offset - align.offset, align.mul) + align.offset;
Soon I'm going to continue with glsl mediump support and thought
better to
Post by Jason Ekstrand
Post by Jason Ekstrand
read these patches. I stumbled on something (probably missing something
:-)
Today, yes, but there may come a day when we want to re-arrange with a
finer granularity than dwords.
Post by Jason Ekstrand
{8, 0}, {8, 4} and {4, 0} and that "offset" argument here is always multiple
of 4.
In case of align == {8, 0} "offset" gets aligned to 8, and in case of
align == {4, 0} cplx_align_apply() is nop.
But in case of {8, 4} and "offset" not multiple of 8, "offset" - 4 is
multiple of 8 and returned value becomes ALIGN(offset - 4, 8) + 4 which
in turn isn't aligned by 8.
Correct, and it shouldn't be. Let's walk through the calculation with a
more complex alignment: {8, 6} (no, that isn't possible today but the
calculations should work with it) and some concrete numbers: 6, 8, 10,
12,
Post by Jason Ekstrand
14, 16
6 -> ALIGN(6 - 6, 8) + 6 = 6
8 -> ALIGN(8 - 6, 8) + 6 = 14
10 -> ALIGN(10 - 6, 8) + 6 = 14
12 -> ALIGN(12 - 6, 8) + 6 = 14
14 -> ALIGN(14 - 6, 8) + 6 = 14
16 -> ALIGN(16 - 6, 8) + 6 = 22
As you can see, the result of each is the next number that is 6 more
than a
Post by Jason Ekstrand
multiple of 8. There is also a question about negative numbers. Let's
look at one more calculation, this time with more detail about the guts
of
Post by Jason Ekstrand
4 -> ALIGN(4 - 6, 8) + 6 = (((4 - 6) + (8 - 1)) & ~(8-1)) + 6 = (5 & ~7)
+
Post by Jason Ekstrand
6 = 6
That one's a bit more complicated but I think you see what's going on.
If
Post by Jason Ekstrand
it makes things more clear, we could do something such as
return (offset + (align.mul - align.offset - 1)) & ~(align.mul - 1)) +
align.offset;
Big thanks for the good explanation. I was also wondering why I didn't fully
get the documentation of cplx_align. I kept asking myself "but offset is
always modulo of mul". Now the comment there makes sense :) Some of your reply
here would probably go nicely there.
Any preference whether it goes in the documentation of struct cplx_align or
cplx_align_apply? I'm leaning towards cplx_align
Post by Jason Ekstrand
I'll read the whole thing again to see if there was anything else I wanted to
ask.
Cool

--Jason
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+}
Post by Jason Ekstrand
+
+#define UNIFORM_SLOT_SIZE 4
+
+struct uniform_slot_info {
+ /** True if the given uniform slot is live */
+ unsigned is_live:1;
+
+ /** True if this slot and the slot must remain contiguous */
+ unsigned contiguous:1;
+
+ struct cplx_align align;
+};
+
+static void
+mark_uniform_slots_read(struct uniform_slot_info *slots,
+ unsigned num_slots, unsigned alignment)
+{
+ assert(alignment > 0 && util_is_power_of_two(alignment));
+ assert(alignment <= CPLX_ALIGN_MAX_MUL);
+
+ /* We can't align a slot to anything less than the slot size */
+ alignment = MAX2(alignment, UNIFORM_SLOT_SIZE);
+
+ struct cplx_align align = {alignment, 0};
+ cplx_align_assert_sane(align);
+
+ for (unsigned i = 0; i < num_slots; i++) {
+ slots[i].is_live = true;
+ if (i < num_slots - 1)
+ slots[i].contiguous = true;
+
+ align.offset = (i * UNIFORM_SLOT_SIZE) & (align.mul - 1);
+ if (slots[i].align.mul == 0) {
+ slots[i].align = align;
+ } else {
+ slots[i].align = cplx_align_combine(slots[i].align,
align);
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ }
+ }
+}
+
+/**
* Assign UNIFORM file registers to either push constants or pull
constants.
Post by Jason Ekstrand
*
* We allow a fragment shader to have more than the specified
minimum
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
@@ -1994,60 +2030,44 @@ fs_visitor::assign_constant_locations()
return;
}
- bool is_live[uniforms];
- memset(is_live, 0, sizeof(is_live));
- unsigned bitsize_access[uniforms];
- memset(bitsize_access, 0, sizeof(bitsize_access));
-
- /* For each uniform slot, a value of true indicates that the
given
Post by Jason Ekstrand
Post by Jason Ekstrand
slot and
Post by Jason Ekstrand
- * the next slot must remain contiguous. This is used to keep us
from
Post by Jason Ekstrand
- * splitting arrays and 64-bit values apart.
- */
- bool contiguous[uniforms];
- memset(contiguous, 0, sizeof(contiguous));
+ struct uniform_slot_info slots[uniforms];
+ memset(slots, 0, sizeof(slots));
- *
- * 1) Figure out which uniforms are live.
- *
- * 2) Mark any indirectly used ranges of registers as
contiguous.
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- *
- * Note that we don't move constant-indexed accesses to arrays.
No
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- * testing has been done of the performance impact of this
choice.
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- */
foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
for (int i = 0 ; i < inst->sources; i++) {
if (inst->src[i].file != UNIFORM)
continue;
- int constant_nr = inst->src[i].nr + inst->src[i].offset /
4;
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ /* NIR tightly packs things so the uniform number might
not be
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ * aligned (if we have a double right after a float, for
instance).
Post by Jason Ekstrand
+ * This is fine because the process of re-arranging them
will
Post by Jason Ekstrand
Post by Jason Ekstrand
ensure
Post by Jason Ekstrand
+ * that things are properly aligned. The offset into that
uniform,
Post by Jason Ekstrand
+ * however, must be aligned.
+ *
+ * In Vulkan, we have explicit offsets but everything is
crammed
Post by Jason Ekstrand
+ * into a single "variable" so inst->src[i].nr will always
be
Post by Jason Ekstrand
Post by Jason Ekstrand
0.
Post by Jason Ekstrand
+ * Everything will be properly aligned relative to that one
base.
Post by Jason Ekstrand
+ */
+ assert(inst->src[i].offset % type_sz(inst->src[i].type) ==
0);
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+
+ unsigned u = inst->src[i].nr +
+ inst->src[i].offset / UNIFORM_SLOT_SIZE;
+ if (u >= uniforms)
+ continue;
+
+ unsigned slots_read;
if (inst->opcode == SHADER_OPCODE_MOV_INDIRECT && i == 0) {
- assert(inst->src[2].ud % 4 == 0);
- unsigned last = constant_nr + (inst->src[2].ud / 4) - 1;
- assert(last < uniforms);
-
- for (unsigned j = constant_nr; j < last; j++) {
- is_live[j] = true;
- contiguous[j] = true;
- bitsize_access[j] = MAX2(bitsize_access[j],
type_sz(inst->src[i].type));
Post by Jason Ekstrand
- }
- is_live[last] = true;
- bitsize_access[last] = MAX2(bitsize_access[last],
type_sz(inst->src[i].type));
Post by Jason Ekstrand
+ slots_read = DIV_ROUND_UP(inst->src[2].ud,
UNIFORM_SLOT_SIZE);
Post by Jason Ekstrand
} else {
- if (constant_nr >= 0 && constant_nr < (int) uniforms) {
- int regs_read = inst->components_read(i) *
- type_sz(inst->src[i].type) / 4;
- assert(regs_read <= 2);
- if (regs_read == 2)
- contiguous[constant_nr] = true;
- for (int j = 0; j < regs_read; j++) {
- is_live[constant_nr + j] = true;
- bitsize_access[constant_nr + j] =
- MAX2(bitsize_access[constant_nr + j],
type_sz(inst->src[i].type));
Post by Jason Ekstrand
- }
- }
+ unsigned bytes_read = inst->components_read(i) *
+ type_sz(inst->src[i].type);
+ slots_read = DIV_ROUND_UP(bytes_read,
UNIFORM_SLOT_SIZE);
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
}
+
+ assert(u + slots_read <= uniforms);
+ mark_uniform_slots_read(&slots[u], slots_read,
+ type_sz(inst->src[i].type));
}
}
@@ -2082,43 +2102,64 @@ fs_visitor::assign_constant_locations()
memset(pull_constant_loc, -1, uniforms *
sizeof(*pull_constant_loc));
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
int chunk_start = -1;
- unsigned max_chunk_bitsize = 0;
-
- /* First push 64-bit uniforms to ensure they are properly
aligned */
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- const unsigned uniform_64_bit_size =
type_sz(BRW_REGISTER_TYPE_DF);
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ struct cplx_align align;
for (unsigned u = 0; u < uniforms; u++) {
- if (!is_live[u])
+ if (!slots[u].is_live) {
+ assert(chunk_start == -1);
continue;
+ }
- set_push_pull_constant_loc(u, &chunk_start,
&max_chunk_bitsize,
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- contiguous[u], bitsize_access[u],
- uniform_64_bit_size,
- push_constant_loc,
pull_constant_loc,
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- &num_push_constants,
&num_pull_constants,
Post by Jason Ekstrand
- max_push_components,
max_chunk_size,
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- compiler->supports_pull_constants,
- stage_prog_data);
+ /* Skip subgroup_id_index to put it in the last push
register. */
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ if (subgroup_id_index == (int)u)
+ continue;
- }
+ if (chunk_start == -1) {
+ chunk_start = u;
+ align = slots[u].align;
+ } else {
+ /* Offset into the chunk */
+ unsigned chunk_offset = (u - chunk_start) *
UNIFORM_SLOT_SIZE;
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- /* Then push the rest of uniforms */
- const unsigned uniform_32_bit_size =
type_sz(BRW_REGISTER_TYPE_F);
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- for (unsigned u = 0; u < uniforms; u++) {
- if (!is_live[u])
- continue;
+ /* Shift the slot alignment down by the chunk offset so it
is
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ * comparable with the base chunk alignment.
+ */
+ struct cplx_align slot_align = slots[u].align;
+ slot_align.offset =
+ (slot_align.offset - chunk_offset) & (align.mul - 1);
- /* Skip subgroup_id_index to put it in the last push
register. */
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- if (subgroup_id_index == (int)u)
+ align = cplx_align_combine(align, slot_align);
+ }
+
+ /* Sanity check the alignment */
+ cplx_align_assert_sane(align);
+
+ if (slots[u].contiguous)
continue;
- set_push_pull_constant_loc(u, &chunk_start,
&max_chunk_bitsize,
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- contiguous[u], bitsize_access[u],
- uniform_32_bit_size,
- push_constant_loc,
pull_constant_loc,
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- &num_push_constants,
&num_pull_constants,
Post by Jason Ekstrand
- max_push_components,
max_chunk_size,
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- compiler->supports_pull_constants,
- stage_prog_data);
+ /* Adjust the alignment to be in terms of slots, not bytes */
+ assert((align.mul & (UNIFORM_SLOT_SIZE - 1)) == 0);
+ assert((align.offset & (UNIFORM_SLOT_SIZE - 1)) == 0);
+ align.mul /= UNIFORM_SLOT_SIZE;
+ align.offset /= UNIFORM_SLOT_SIZE;
+
+ unsigned push_start_align = cplx_align_apply(align,
num_push_constants);
Post by Jason Ekstrand
+ unsigned chunk_size = u - chunk_start + 1;
+ if (!compiler->supports_pull_constants ||
+ (chunk_size < max_chunk_size &&
+ push_start_align + chunk_size <= max_push_components)) {
+ /* Align up the number of push constants */
+ num_push_constants = push_start_align;
+ for (unsigned i = 0; i < chunk_size; i++)
+ push_constant_loc[chunk_start + i] =
num_push_constants++;
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ } else {
+ /* We need to pull this one */
+ num_pull_constants = cplx_align_apply(align,
num_pull_constants);
Post by Jason Ekstrand
+ for (unsigned i = 0; i < chunk_size; i++)
+ pull_constant_loc[chunk_start + i] =
num_pull_constants++;
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ }
+
+ /* Reset the chunk and start again */
+ chunk_start = -1;
}
/* Add the CS local thread ID uniform at the end of the push
constants */
Post by Jason Ekstrand
@@ -2131,8 +2172,8 @@ fs_visitor::assign_constant_locations()
uint32_t *param = stage_prog_data->param;
stage_prog_data->nr_params = num_push_constants;
if (num_push_constants) {
- stage_prog_data->param = ralloc_array(mem_ctx, uint32_t,
- num_push_constants);
+ stage_prog_data->param = rzalloc_array(mem_ctx, uint32_t,
+ num_push_constants);
} else {
stage_prog_data->param = NULL;
}
@@ -2140,8 +2181,8 @@ fs_visitor::assign_constant_locations()
assert(stage_prog_data->pull_param == NULL);
if (num_pull_constants > 0) {
stage_prog_data->nr_pull_params = num_pull_constants;
- stage_prog_data->pull_param = ralloc_array(mem_ctx, uint32_t,
-
num_pull_constants);
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ stage_prog_data->pull_param = rzalloc_array(mem_ctx, uint32_t,
+
num_pull_constants);
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
}
/* Now that we know how many regular uniforms we'll push, reduce
the
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
--
2.5.0.400.gff86faf
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Pohjolainen, Topi
2017-12-08 08:27:23 UTC
Reply
Permalink
Raw Message
Post by Jason Ekstrand
On Thu, Dec 7, 2017 at 9:57 AM, Pohjolainen, Topi <
Post by Jason Ekstrand
Post by Jason Ekstrand
On Thu, Dec 7, 2017 at 9:10 AM, Pohjolainen, Topi <
Post by Jason Ekstrand
Post by Jason Ekstrand
This rewires the logic for assigning uniform locations to work in
terms
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
of "complex alignments". The basic idea is that, as we walk the
list of
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
instructions, we keep track of the alignment and continuity
requirements
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
of each slot and assert that the alignments all match up. We then
use
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
those alignments in the compaction stage to ensure that everything
gets
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
placed at a properly aligned register. The old mechanism handled
alignments by special-casing each of the bit sizes and placing 64-bit
values first followed by 32-bit values.
The old scheme had the advantage of never leaving a hole since all
the
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
64-bit values could be tightly packed and so could the 32-bit values.
However, the new scheme has no type size special cases so it handles
not
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
only 32 and 64-bit types but should gracefully extend to 16 and 8-bit
types as the need arises.
---
src/intel/compiler/brw_fs.cpp | 307 ++++++++++++++++++++++++------
------------
Post by Jason Ekstrand
1 file changed, 174 insertions(+), 133 deletions(-)
diff --git a/src/intel/compiler/brw_fs.cpp
b/src/intel/compiler/brw_fs.
Post by Jason Ekstrand
Post by Jason Ekstrand
cpp
Post by Jason Ekstrand
index 93bb6b4..41260b4 100644
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@@ -1906,62 +1906,6 @@ fs_visitor::compact_virtual_grfs()
return progress;
}
-static void
-set_push_pull_constant_loc(unsigned uniform, int *chunk_start,
- unsigned *max_chunk_bitsize,
- bool contiguous, unsigned bitsize,
- const unsigned target_bitsize,
- int *push_constant_loc, int
*pull_constant_loc,
Post by Jason Ekstrand
- unsigned *num_push_constants,
- unsigned *num_pull_constants,
- const unsigned max_push_components,
- const unsigned max_chunk_size,
- bool allow_pull_constants,
- struct brw_stage_prog_data
*stage_prog_data)
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
-{
- /* This is the first live uniform in the chunk */
- if (*chunk_start < 0)
- *chunk_start = uniform;
-
- /* Keep track of the maximum bit size access in contiguous
uniforms
Post by Jason Ekstrand
Post by Jason Ekstrand
*/
Post by Jason Ekstrand
- *max_chunk_bitsize = MAX2(*max_chunk_bitsize, bitsize);
-
- /* If this element does not need to be contiguous with the next,
we
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- * split at this point and everything between chunk_start and u
forms a
Post by Jason Ekstrand
- * single chunk.
- */
- if (!contiguous) {
- /* If bitsize doesn't match the target one, skip it */
- if (*max_chunk_bitsize != target_bitsize) {
- /* FIXME: right now we only support 32 and 64-bit accesses
*/
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- assert(*max_chunk_bitsize == 4 || *max_chunk_bitsize == 8);
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- return;
- }
-
- unsigned chunk_size = uniform - *chunk_start + 1;
-
- /* Decide whether we should push or pull this parameter. In
the
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- * Vulkan driver, push constants are explicitly exposed via
the
Post by Jason Ekstrand
Post by Jason Ekstrand
API
Post by Jason Ekstrand
- * so we push everything. In GL, we only push small arrays.
- */
- if (!allow_pull_constants ||
- (*num_push_constants + chunk_size <= max_push_components
&&
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- chunk_size <= max_chunk_size)) {
- assert(*num_push_constants + chunk_size <=
max_push_components);
Post by Jason Ekstrand
- for (unsigned j = *chunk_start; j <= uniform; j++)
- push_constant_loc[j] = (*num_push_constants)++;
- } else {
- for (unsigned j = *chunk_start; j <= uniform; j++)
- pull_constant_loc[j] = (*num_pull_constants)++;
- }
-
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- }
-}
-
static int
get_subgroup_id_param_index(const brw_stage_prog_data *prog_data)
{
@@ -1977,6 +1921,98 @@ get_subgroup_id_param_index(const
brw_stage_prog_data *prog_data)
Post by Jason Ekstrand
}
/**
+ * Struct for handling complex alignments.
+ *
+ * A complex alignment is stored as multiplier and an offset. A
value
Post by Jason Ekstrand
Post by Jason Ekstrand
is
Post by Jason Ekstrand
+ * considered to be aligned if it is congruent to the offset modulo
the
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ * multiplier.
+ */
+struct cplx_align {
+ unsigned mul:4;
+ unsigned offset:4;
+};
+
+#define CPLX_ALIGN_MAX_MUL 8
+
+static void
+cplx_align_assert_sane(struct cplx_align a)
+{
+ assert(a.mul > 0 && util_is_power_of_two(a.mul));
+ assert(a.offset < a.mul);
+}
+
+/**
+ * Combines two alignments to produce a least multiple of sorts.
+ *
+ * The returned alignment is the smallest (in terms of multiplier)
such
Post by Jason Ekstrand
Post by Jason Ekstrand
that
Post by Jason Ekstrand
+ * anything aligned to both a and b will be aligned to the new
alignment.
Post by Jason Ekstrand
+ * This function will assert-fail if a and b are not compatible,
i.e.
Post by Jason Ekstrand
Post by Jason Ekstrand
if the
Post by Jason Ekstrand
+ * offset parameters are such that no common alignment is possible.
+ */
+static struct cplx_align
+cplx_align_combine(struct cplx_align a, struct cplx_align b)
+{
+ cplx_align_assert_sane(a);
+ cplx_align_assert_sane(b);
+
+ /* Assert that the alignments agree. */
+ assert((a.offset & (b.mul - 1)) == (b.offset & (a.mul - 1)));
+
+ return a.mul > b.mul ? a : b;
+}
+
+/**
+ * Apply a complex alignment
+ *
+ * This function will return the smallest number greater than or
equal
Post by Jason Ekstrand
Post by Jason Ekstrand
to
Post by Jason Ekstrand
+ * offset that is aligned to align.
+ */
+static unsigned
+cplx_align_apply(struct cplx_align align, unsigned offset)
+{
Should we assert that offset >= 4?
Not here, no. I intentionally want this code to be as generic as
possible. We may have reason to re-use it (or something similar)
elsewhere
Post by Jason Ekstrand
and I don't want to build in lots of push-constant assumptions.
Post by Jason Ekstrand
Post by Jason Ekstrand
+ return ALIGN(offset - align.offset, align.mul) + align.offset;
Soon I'm going to continue with glsl mediump support and thought
better to
Post by Jason Ekstrand
Post by Jason Ekstrand
read these patches. I stumbled on something (probably missing something
:-)
Today, yes, but there may come a day when we want to re-arrange with a
finer granularity than dwords.
Post by Jason Ekstrand
{8, 0}, {8, 4} and {4, 0} and that "offset" argument here is always multiple
of 4.
In case of align == {8, 0} "offset" gets aligned to 8, and in case of
align == {4, 0} cplx_align_apply() is nop.
But in case of {8, 4} and "offset" not multiple of 8, "offset" - 4 is
multiple of 8 and returned value becomes ALIGN(offset - 4, 8) + 4 which
in turn isn't aligned by 8.
Correct, and it shouldn't be. Let's walk through the calculation with a
more complex alignment: {8, 6} (no, that isn't possible today but the
calculations should work with it) and some concrete numbers: 6, 8, 10,
12,
Post by Jason Ekstrand
14, 16
6 -> ALIGN(6 - 6, 8) + 6 = 6
8 -> ALIGN(8 - 6, 8) + 6 = 14
10 -> ALIGN(10 - 6, 8) + 6 = 14
12 -> ALIGN(12 - 6, 8) + 6 = 14
14 -> ALIGN(14 - 6, 8) + 6 = 14
16 -> ALIGN(16 - 6, 8) + 6 = 22
As you can see, the result of each is the next number that is 6 more
than a
Post by Jason Ekstrand
multiple of 8. There is also a question about negative numbers. Let's
look at one more calculation, this time with more detail about the guts
of
Post by Jason Ekstrand
4 -> ALIGN(4 - 6, 8) + 6 = (((4 - 6) + (8 - 1)) & ~(8-1)) + 6 = (5 & ~7)
+
Post by Jason Ekstrand
6 = 6
That one's a bit more complicated but I think you see what's going on.
If
Post by Jason Ekstrand
it makes things more clear, we could do something such as
return (offset + (align.mul - align.offset - 1)) & ~(align.mul - 1)) +
align.offset;
Big thanks for the good explanation. I was also wondering why I didn't fully
get the documentation of cplx_align. I kept asking myself "but offset is
always modulo of mul". Now the comment there makes sense :) Some of your reply
here would probably go nicely there.
Any preference whether it goes in the documentation of struct cplx_align or
cplx_align_apply? I'm leaning towards cplx_align
Sounds good.
Post by Jason Ekstrand
Post by Jason Ekstrand
I'll read the whole thing again to see if there was anything else I wanted to
ask.
I have a few more questions. Replying again to the patch directly as things
are getting too indented here.
Pohjolainen, Topi
2017-12-08 08:40:55 UTC
Reply
Permalink
Raw Message
Post by Jason Ekstrand
This rewires the logic for assigning uniform locations to work in terms
of "complex alignments". The basic idea is that, as we walk the list of
instructions, we keep track of the alignment and continuity requirements
of each slot and assert that the alignments all match up. We then use
those alignments in the compaction stage to ensure that everything gets
placed at a properly aligned register. The old mechanism handled
alignments by special-casing each of the bit sizes and placing 64-bit
values first followed by 32-bit values.
The old scheme had the advantage of never leaving a hole since all the
64-bit values could be tightly packed and so could the 32-bit values.
However, the new scheme has no type size special cases so it handles not
only 32 and 64-bit types but should gracefully extend to 16 and 8-bit
types as the need arises.
---
src/intel/compiler/brw_fs.cpp | 307 ++++++++++++++++++++++++------------------
1 file changed, 174 insertions(+), 133 deletions(-)
diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
index 93bb6b4..41260b4 100644
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@@ -1906,62 +1906,6 @@ fs_visitor::compact_virtual_grfs()
return progress;
}
-static void
-set_push_pull_constant_loc(unsigned uniform, int *chunk_start,
- unsigned *max_chunk_bitsize,
- bool contiguous, unsigned bitsize,
- const unsigned target_bitsize,
- int *push_constant_loc, int *pull_constant_loc,
- unsigned *num_push_constants,
- unsigned *num_pull_constants,
- const unsigned max_push_components,
- const unsigned max_chunk_size,
- bool allow_pull_constants,
- struct brw_stage_prog_data *stage_prog_data)
-{
- /* This is the first live uniform in the chunk */
- if (*chunk_start < 0)
- *chunk_start = uniform;
-
- /* Keep track of the maximum bit size access in contiguous uniforms */
- *max_chunk_bitsize = MAX2(*max_chunk_bitsize, bitsize);
-
- /* If this element does not need to be contiguous with the next, we
- * split at this point and everything between chunk_start and u forms a
- * single chunk.
- */
- if (!contiguous) {
- /* If bitsize doesn't match the target one, skip it */
- if (*max_chunk_bitsize != target_bitsize) {
- /* FIXME: right now we only support 32 and 64-bit accesses */
- assert(*max_chunk_bitsize == 4 || *max_chunk_bitsize == 8);
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- return;
- }
-
- unsigned chunk_size = uniform - *chunk_start + 1;
-
- /* Decide whether we should push or pull this parameter. In the
- * Vulkan driver, push constants are explicitly exposed via the API
- * so we push everything. In GL, we only push small arrays.
- */
- if (!allow_pull_constants ||
- (*num_push_constants + chunk_size <= max_push_components &&
- chunk_size <= max_chunk_size)) {
- assert(*num_push_constants + chunk_size <= max_push_components);
- for (unsigned j = *chunk_start; j <= uniform; j++)
- push_constant_loc[j] = (*num_push_constants)++;
- } else {
- for (unsigned j = *chunk_start; j <= uniform; j++)
- pull_constant_loc[j] = (*num_pull_constants)++;
- }
-
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- }
-}
-
static int
get_subgroup_id_param_index(const brw_stage_prog_data *prog_data)
{
@@ -1977,6 +1921,98 @@ get_subgroup_id_param_index(const brw_stage_prog_data *prog_data)
}
/**
+ * Struct for handling complex alignments.
+ *
+ * A complex alignment is stored as multiplier and an offset. A value is
+ * considered to be aligned if it is congruent to the offset modulo the
+ * multiplier.
+ */
+struct cplx_align {
+ unsigned mul:4;
+ unsigned offset:4;
+};
+
+#define CPLX_ALIGN_MAX_MUL 8
+
+static void
+cplx_align_assert_sane(struct cplx_align a)
+{
+ assert(a.mul > 0 && util_is_power_of_two(a.mul));
+ assert(a.offset < a.mul);
+}
+
+/**
+ * Combines two alignments to produce a least multiple of sorts.
+ *
+ * The returned alignment is the smallest (in terms of multiplier) such that
+ * anything aligned to both a and b will be aligned to the new alignment.
+ * This function will assert-fail if a and b are not compatible, i.e. if the
+ * offset parameters are such that no common alignment is possible.
+ */
+static struct cplx_align
+cplx_align_combine(struct cplx_align a, struct cplx_align b)
+{
+ cplx_align_assert_sane(a);
+ cplx_align_assert_sane(b);
+
+ /* Assert that the alignments agree. */
+ assert((a.offset & (b.mul - 1)) == (b.offset & (a.mul - 1)));
+
+ return a.mul > b.mul ? a : b;
+}
+
+/**
+ * Apply a complex alignment
+ *
+ * This function will return the smallest number greater than or equal to
+ * offset that is aligned to align.
+ */
+static unsigned
+cplx_align_apply(struct cplx_align align, unsigned offset)
+{
+ return ALIGN(offset - align.offset, align.mul) + align.offset;
+}
+
+#define UNIFORM_SLOT_SIZE 4
+
+struct uniform_slot_info {
+ /** True if the given uniform slot is live */
+ unsigned is_live:1;
+
+ /** True if this slot and the slot must remain contiguous */
Missing a word? Maybe "...this slot and the slot after must..."?
Post by Jason Ekstrand
+ unsigned contiguous:1;
+
+ struct cplx_align align;
+};
+
+static void
+mark_uniform_slots_read(struct uniform_slot_info *slots,
+ unsigned num_slots, unsigned alignment)
+{
+ assert(alignment > 0 && util_is_power_of_two(alignment));
+ assert(alignment <= CPLX_ALIGN_MAX_MUL);
+
+ /* We can't align a slot to anything less than the slot size */
+ alignment = MAX2(alignment, UNIFORM_SLOT_SIZE);
+
+ struct cplx_align align = {alignment, 0};
+ cplx_align_assert_sane(align);
+
+ for (unsigned i = 0; i < num_slots; i++) {
+ slots[i].is_live = true;
+ if (i < num_slots - 1)
+ slots[i].contiguous = true;
+
+ align.offset = (i * UNIFORM_SLOT_SIZE) & (align.mul - 1);
+ if (slots[i].align.mul == 0) {
+ slots[i].align = align;
+ } else {
+ slots[i].align = cplx_align_combine(slots[i].align, align);
+ }
+ }
+}
+
+/**
* Assign UNIFORM file registers to either push constants or pull constants.
*
* We allow a fragment shader to have more than the specified minimum
@@ -1994,60 +2030,44 @@ fs_visitor::assign_constant_locations()
return;
}
- bool is_live[uniforms];
- memset(is_live, 0, sizeof(is_live));
- unsigned bitsize_access[uniforms];
- memset(bitsize_access, 0, sizeof(bitsize_access));
-
- /* For each uniform slot, a value of true indicates that the given slot and
- * the next slot must remain contiguous. This is used to keep us from
- * splitting arrays and 64-bit values apart.
- */
- bool contiguous[uniforms];
- memset(contiguous, 0, sizeof(contiguous));
+ struct uniform_slot_info slots[uniforms];
+ memset(slots, 0, sizeof(slots));
- *
- * 1) Figure out which uniforms are live.
- *
- * 2) Mark any indirectly used ranges of registers as contiguous.
- *
- * Note that we don't move constant-indexed accesses to arrays. No
- * testing has been done of the performance impact of this choice.
- */
foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
for (int i = 0 ; i < inst->sources; i++) {
if (inst->src[i].file != UNIFORM)
continue;
- int constant_nr = inst->src[i].nr + inst->src[i].offset / 4;
+ /* NIR tightly packs things so the uniform number might not be
+ * aligned (if we have a double right after a float, for instance).
+ * This is fine because the process of re-arranging them will ensure
+ * that things are properly aligned. The offset into that uniform,
+ * however, must be aligned.
What about in case of types having size less than 32-bits (i.e., smaller than
uniform slot size? Take, for example, invididual components of 16-bits
vectors. Their offsets are not necessarily aligned on 32-bit slot boundaries.

See also related question a few lines further.
Post by Jason Ekstrand
+ *
+ * In Vulkan, we have explicit offsets but everything is crammed
+ * into a single "variable" so inst->src[i].nr will always be 0.
+ * Everything will be properly aligned relative to that one base.
+ */
+ assert(inst->src[i].offset % type_sz(inst->src[i].type) == 0);
+
+ unsigned u = inst->src[i].nr +
+ inst->src[i].offset / UNIFORM_SLOT_SIZE;
This effectively gives the same uniform slot no matter which individual
byte(s) within are addressed. The logic here is about slots so this seems
fine. Just wanted to check that this is your intention. And if so, should we
comment here a little?
Post by Jason Ekstrand
+ if (u >= uniforms)
+ continue;
+
+ unsigned slots_read;
if (inst->opcode == SHADER_OPCODE_MOV_INDIRECT && i == 0) {
- assert(inst->src[2].ud % 4 == 0);
- unsigned last = constant_nr + (inst->src[2].ud / 4) - 1;
- assert(last < uniforms);
-
- for (unsigned j = constant_nr; j < last; j++) {
- is_live[j] = true;
- contiguous[j] = true;
- bitsize_access[j] = MAX2(bitsize_access[j], type_sz(inst->src[i].type));
- }
- is_live[last] = true;
- bitsize_access[last] = MAX2(bitsize_access[last], type_sz(inst->src[i].type));
+ slots_read = DIV_ROUND_UP(inst->src[2].ud, UNIFORM_SLOT_SIZE);
} else {
- if (constant_nr >= 0 && constant_nr < (int) uniforms) {
- int regs_read = inst->components_read(i) *
- type_sz(inst->src[i].type) / 4;
- assert(regs_read <= 2);
- if (regs_read == 2)
- contiguous[constant_nr] = true;
- for (int j = 0; j < regs_read; j++) {
- is_live[constant_nr + j] = true;
- bitsize_access[constant_nr + j] =
- MAX2(bitsize_access[constant_nr + j], type_sz(inst->src[i].type));
- }
- }
+ unsigned bytes_read = inst->components_read(i) *
+ type_sz(inst->src[i].type);
+ slots_read = DIV_ROUND_UP(bytes_read, UNIFORM_SLOT_SIZE);
}
+
+ assert(u + slots_read <= uniforms);
+ mark_uniform_slots_read(&slots[u], slots_read,
+ type_sz(inst->src[i].type));
}
}
@@ -2082,43 +2102,64 @@ fs_visitor::assign_constant_locations()
memset(pull_constant_loc, -1, uniforms * sizeof(*pull_constant_loc));
int chunk_start = -1;
- unsigned max_chunk_bitsize = 0;
-
- /* First push 64-bit uniforms to ensure they are properly aligned */
- const unsigned uniform_64_bit_size = type_sz(BRW_REGISTER_TYPE_DF);
+ struct cplx_align align;
for (unsigned u = 0; u < uniforms; u++) {
- if (!is_live[u])
+ if (!slots[u].is_live) {
+ assert(chunk_start == -1);
continue;
+ }
- set_push_pull_constant_loc(u, &chunk_start, &max_chunk_bitsize,
- contiguous[u], bitsize_access[u],
- uniform_64_bit_size,
- push_constant_loc, pull_constant_loc,
- &num_push_constants, &num_pull_constants,
- max_push_components, max_chunk_size,
- compiler->supports_pull_constants,
- stage_prog_data);
+ /* Skip subgroup_id_index to put it in the last push register. */
+ if (subgroup_id_index == (int)u)
+ continue;
- }
+ if (chunk_start == -1) {
+ chunk_start = u;
+ align = slots[u].align;
+ } else {
+ /* Offset into the chunk */
+ unsigned chunk_offset = (u - chunk_start) * UNIFORM_SLOT_SIZE;
- /* Then push the rest of uniforms */
- const unsigned uniform_32_bit_size = type_sz(BRW_REGISTER_TYPE_F);
- for (unsigned u = 0; u < uniforms; u++) {
- if (!is_live[u])
- continue;
+ /* Shift the slot alignment down by the chunk offset so it is
+ * comparable with the base chunk alignment.
+ */
+ struct cplx_align slot_align = slots[u].align;
+ slot_align.offset =
+ (slot_align.offset - chunk_offset) & (align.mul - 1);
- /* Skip subgroup_id_index to put it in the last push register. */
- if (subgroup_id_index == (int)u)
+ align = cplx_align_combine(align, slot_align);
+ }
+
+ /* Sanity check the alignment */
+ cplx_align_assert_sane(align);
+
+ if (slots[u].contiguous)
continue;
- set_push_pull_constant_loc(u, &chunk_start, &max_chunk_bitsize,
- contiguous[u], bitsize_access[u],
- uniform_32_bit_size,
- push_constant_loc, pull_constant_loc,
- &num_push_constants, &num_pull_constants,
- max_push_components, max_chunk_size,
- compiler->supports_pull_constants,
- stage_prog_data);
+ /* Adjust the alignment to be in terms of slots, not bytes */
+ assert((align.mul & (UNIFORM_SLOT_SIZE - 1)) == 0);
+ assert((align.offset & (UNIFORM_SLOT_SIZE - 1)) == 0);
+ align.mul /= UNIFORM_SLOT_SIZE;
+ align.offset /= UNIFORM_SLOT_SIZE;
+
+ unsigned push_start_align = cplx_align_apply(align, num_push_constants);
+ unsigned chunk_size = u - chunk_start + 1;
+ if (!compiler->supports_pull_constants ||
+ (chunk_size < max_chunk_size &&
+ push_start_align + chunk_size <= max_push_components)) {
+ /* Align up the number of push constants */
+ num_push_constants = push_start_align;
+ for (unsigned i = 0; i < chunk_size; i++)
+ push_constant_loc[chunk_start + i] = num_push_constants++;
+ } else {
+ /* We need to pull this one */
+ num_pull_constants = cplx_align_apply(align, num_pull_constants);
+ for (unsigned i = 0; i < chunk_size; i++)
+ pull_constant_loc[chunk_start + i] = num_pull_constants++;
+ }
+
+ /* Reset the chunk and start again */
+ chunk_start = -1;
}
/* Add the CS local thread ID uniform at the end of the push constants */
@@ -2131,8 +2172,8 @@ fs_visitor::assign_constant_locations()
uint32_t *param = stage_prog_data->param;
stage_prog_data->nr_params = num_push_constants;
if (num_push_constants) {
- stage_prog_data->param = ralloc_array(mem_ctx, uint32_t,
- num_push_constants);
+ stage_prog_data->param = rzalloc_array(mem_ctx, uint32_t,
+ num_push_constants);
} else {
stage_prog_data->param = NULL;
}
@@ -2140,8 +2181,8 @@ fs_visitor::assign_constant_locations()
assert(stage_prog_data->pull_param == NULL);
if (num_pull_constants > 0) {
stage_prog_data->nr_pull_params = num_pull_constants;
- stage_prog_data->pull_param = ralloc_array(mem_ctx, uint32_t,
- num_pull_constants);
+ stage_prog_data->pull_param = rzalloc_array(mem_ctx, uint32_t,
+ num_pull_constants);
}
/* Now that we know how many regular uniforms we'll push, reduce the
--
2.5.0.400.gff86faf
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Jason Ekstrand
2017-12-08 16:55:56 UTC
Reply
Permalink
Raw Message
On Fri, Dec 8, 2017 at 12:40 AM, Pohjolainen, Topi <
Post by Jason Ekstrand
Post by Jason Ekstrand
This rewires the logic for assigning uniform locations to work in terms
of "complex alignments". The basic idea is that, as we walk the list of
instructions, we keep track of the alignment and continuity requirements
of each slot and assert that the alignments all match up. We then use
those alignments in the compaction stage to ensure that everything gets
placed at a properly aligned register. The old mechanism handled
alignments by special-casing each of the bit sizes and placing 64-bit
values first followed by 32-bit values.
The old scheme had the advantage of never leaving a hole since all the
64-bit values could be tightly packed and so could the 32-bit values.
However, the new scheme has no type size special cases so it handles not
only 32 and 64-bit types but should gracefully extend to 16 and 8-bit
types as the need arises.
---
src/intel/compiler/brw_fs.cpp | 307 ++++++++++++++++++++++++------
------------
Post by Jason Ekstrand
1 file changed, 174 insertions(+), 133 deletions(-)
diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.
cpp
Post by Jason Ekstrand
index 93bb6b4..41260b4 100644
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@@ -1906,62 +1906,6 @@ fs_visitor::compact_virtual_grfs()
return progress;
}
-static void
-set_push_pull_constant_loc(unsigned uniform, int *chunk_start,
- unsigned *max_chunk_bitsize,
- bool contiguous, unsigned bitsize,
- const unsigned target_bitsize,
- int *push_constant_loc, int
*pull_constant_loc,
Post by Jason Ekstrand
- unsigned *num_push_constants,
- unsigned *num_pull_constants,
- const unsigned max_push_components,
- const unsigned max_chunk_size,
- bool allow_pull_constants,
- struct brw_stage_prog_data *stage_prog_data)
-{
- /* This is the first live uniform in the chunk */
- if (*chunk_start < 0)
- *chunk_start = uniform;
-
- /* Keep track of the maximum bit size access in contiguous uniforms
*/
Post by Jason Ekstrand
- *max_chunk_bitsize = MAX2(*max_chunk_bitsize, bitsize);
-
- /* If this element does not need to be contiguous with the next, we
- * split at this point and everything between chunk_start and u
forms a
Post by Jason Ekstrand
- * single chunk.
- */
- if (!contiguous) {
- /* If bitsize doesn't match the target one, skip it */
- if (*max_chunk_bitsize != target_bitsize) {
- /* FIXME: right now we only support 32 and 64-bit accesses */
- assert(*max_chunk_bitsize == 4 || *max_chunk_bitsize == 8);
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- return;
- }
-
- unsigned chunk_size = uniform - *chunk_start + 1;
-
- /* Decide whether we should push or pull this parameter. In the
- * Vulkan driver, push constants are explicitly exposed via the
API
Post by Jason Ekstrand
- * so we push everything. In GL, we only push small arrays.
- */
- if (!allow_pull_constants ||
- (*num_push_constants + chunk_size <= max_push_components &&
- chunk_size <= max_chunk_size)) {
- assert(*num_push_constants + chunk_size <=
max_push_components);
Post by Jason Ekstrand
- for (unsigned j = *chunk_start; j <= uniform; j++)
- push_constant_loc[j] = (*num_push_constants)++;
- } else {
- for (unsigned j = *chunk_start; j <= uniform; j++)
- pull_constant_loc[j] = (*num_pull_constants)++;
- }
-
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- }
-}
-
static int
get_subgroup_id_param_index(const brw_stage_prog_data *prog_data)
{
@@ -1977,6 +1921,98 @@ get_subgroup_id_param_index(const
brw_stage_prog_data *prog_data)
Post by Jason Ekstrand
}
/**
+ * Struct for handling complex alignments.
+ *
+ * A complex alignment is stored as multiplier and an offset. A value
is
Post by Jason Ekstrand
+ * considered to be aligned if it is congruent to the offset modulo the
+ * multiplier.
+ */
+struct cplx_align {
+ unsigned mul:4;
+ unsigned offset:4;
+};
+
+#define CPLX_ALIGN_MAX_MUL 8
+
+static void
+cplx_align_assert_sane(struct cplx_align a)
+{
+ assert(a.mul > 0 && util_is_power_of_two(a.mul));
+ assert(a.offset < a.mul);
+}
+
+/**
+ * Combines two alignments to produce a least multiple of sorts.
+ *
+ * The returned alignment is the smallest (in terms of multiplier) such
that
Post by Jason Ekstrand
+ * anything aligned to both a and b will be aligned to the new
alignment.
Post by Jason Ekstrand
+ * This function will assert-fail if a and b are not compatible, i.e.
if the
Post by Jason Ekstrand
+ * offset parameters are such that no common alignment is possible.
+ */
+static struct cplx_align
+cplx_align_combine(struct cplx_align a, struct cplx_align b)
+{
+ cplx_align_assert_sane(a);
+ cplx_align_assert_sane(b);
+
+ /* Assert that the alignments agree. */
+ assert((a.offset & (b.mul - 1)) == (b.offset & (a.mul - 1)));
+
+ return a.mul > b.mul ? a : b;
+}
+
+/**
+ * Apply a complex alignment
+ *
+ * This function will return the smallest number greater than or equal
to
Post by Jason Ekstrand
+ * offset that is aligned to align.
+ */
+static unsigned
+cplx_align_apply(struct cplx_align align, unsigned offset)
+{
+ return ALIGN(offset - align.offset, align.mul) + align.offset;
+}
+
+#define UNIFORM_SLOT_SIZE 4
+
+struct uniform_slot_info {
+ /** True if the given uniform slot is live */
+ unsigned is_live:1;
+
+ /** True if this slot and the slot must remain contiguous */
Missing a word? Maybe "...this slot and the slot after must..."?
Yes, "slot after" and "next slot" both work. I'll fix that up.
Post by Jason Ekstrand
Post by Jason Ekstrand
+ unsigned contiguous:1;
+
+ struct cplx_align align;
+};
+
+static void
+mark_uniform_slots_read(struct uniform_slot_info *slots,
+ unsigned num_slots, unsigned alignment)
+{
+ assert(alignment > 0 && util_is_power_of_two(alignment));
+ assert(alignment <= CPLX_ALIGN_MAX_MUL);
+
+ /* We can't align a slot to anything less than the slot size */
+ alignment = MAX2(alignment, UNIFORM_SLOT_SIZE);
+
+ struct cplx_align align = {alignment, 0};
+ cplx_align_assert_sane(align);
+
+ for (unsigned i = 0; i < num_slots; i++) {
+ slots[i].is_live = true;
+ if (i < num_slots - 1)
+ slots[i].contiguous = true;
+
+ align.offset = (i * UNIFORM_SLOT_SIZE) & (align.mul - 1);
+ if (slots[i].align.mul == 0) {
+ slots[i].align = align;
+ } else {
+ slots[i].align = cplx_align_combine(slots[i].align, align);
+ }
+ }
+}
+
+/**
* Assign UNIFORM file registers to either push constants or pull
constants.
Post by Jason Ekstrand
*
* We allow a fragment shader to have more than the specified minimum
@@ -1994,60 +2030,44 @@ fs_visitor::assign_constant_locations()
return;
}
- bool is_live[uniforms];
- memset(is_live, 0, sizeof(is_live));
- unsigned bitsize_access[uniforms];
- memset(bitsize_access, 0, sizeof(bitsize_access));
-
- /* For each uniform slot, a value of true indicates that the given
slot and
Post by Jason Ekstrand
- * the next slot must remain contiguous. This is used to keep us
from
Post by Jason Ekstrand
- * splitting arrays and 64-bit values apart.
- */
- bool contiguous[uniforms];
- memset(contiguous, 0, sizeof(contiguous));
+ struct uniform_slot_info slots[uniforms];
+ memset(slots, 0, sizeof(slots));
- *
- * 1) Figure out which uniforms are live.
- *
- * 2) Mark any indirectly used ranges of registers as contiguous.
- *
- * Note that we don't move constant-indexed accesses to arrays. No
- * testing has been done of the performance impact of this choice.
- */
foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
for (int i = 0 ; i < inst->sources; i++) {
if (inst->src[i].file != UNIFORM)
continue;
- int constant_nr = inst->src[i].nr + inst->src[i].offset / 4;
+ /* NIR tightly packs things so the uniform number might not be
+ * aligned (if we have a double right after a float, for
instance).
Post by Jason Ekstrand
+ * This is fine because the process of re-arranging them will
ensure
Post by Jason Ekstrand
+ * that things are properly aligned. The offset into that
uniform,
Post by Jason Ekstrand
+ * however, must be aligned.
What about in case of types having size less than 32-bits (i.e., smaller than
uniform slot size? Take, for example, invididual components of 16-bits
vectors. Their offsets are not necessarily aligned on 32-bit slot boundaries.
I did think about that. :-) My plan was that they would have their uniform
nr set to floor(offset/4) with an offset of offset % 4. Then their slot
would get aligned to MAX(2, UNIFORM_SLOT_SIZE) = 4 (see
mark_uniform_slots_read above). Assuming things are reasonably aligned
(i.e., no one puts a float at an offset of 2 bytes), this should work out
ok.

Unfortunately, I see that we still have asserts that byte_offset % 4 == 0
in our handling of nir_intrinsic_load_uniform and we're passing the tests.
They must be pretty bad.

--Jason
Post by Jason Ekstrand
See also related question a few lines further.
Post by Jason Ekstrand
+ *
+ * In Vulkan, we have explicit offsets but everything is
crammed
Post by Jason Ekstrand
+ * into a single "variable" so inst->src[i].nr will always be
0.
Post by Jason Ekstrand
+ * Everything will be properly aligned relative to that one
base.
Post by Jason Ekstrand
+ */
+ assert(inst->src[i].offset % type_sz(inst->src[i].type) == 0);
+
+ unsigned u = inst->src[i].nr +
+ inst->src[i].offset / UNIFORM_SLOT_SIZE;
This effectively gives the same uniform slot no matter which individual
byte(s) within are addressed. The logic here is about slots so this seems
fine. Just wanted to check that this is your intention. And if so, should we
comment here a little?
Post by Jason Ekstrand
+ if (u >= uniforms)
+ continue;
+
+ unsigned slots_read;
if (inst->opcode == SHADER_OPCODE_MOV_INDIRECT && i == 0) {
- assert(inst->src[2].ud % 4 == 0);
- unsigned last = constant_nr + (inst->src[2].ud / 4) - 1;
- assert(last < uniforms);
-
- for (unsigned j = constant_nr; j < last; j++) {
- is_live[j] = true;
- contiguous[j] = true;
- bitsize_access[j] = MAX2(bitsize_access[j],
type_sz(inst->src[i].type));
Post by Jason Ekstrand
- }
- is_live[last] = true;
- bitsize_access[last] = MAX2(bitsize_access[last],
type_sz(inst->src[i].type));
Post by Jason Ekstrand
+ slots_read = DIV_ROUND_UP(inst->src[2].ud,
UNIFORM_SLOT_SIZE);
Post by Jason Ekstrand
} else {
- if (constant_nr >= 0 && constant_nr < (int) uniforms) {
- int regs_read = inst->components_read(i) *
- type_sz(inst->src[i].type) / 4;
- assert(regs_read <= 2);
- if (regs_read == 2)
- contiguous[constant_nr] = true;
- for (int j = 0; j < regs_read; j++) {
- is_live[constant_nr + j] = true;
- bitsize_access[constant_nr + j] =
- MAX2(bitsize_access[constant_nr + j],
type_sz(inst->src[i].type));
Post by Jason Ekstrand
- }
- }
+ unsigned bytes_read = inst->components_read(i) *
+ type_sz(inst->src[i].type);
+ slots_read = DIV_ROUND_UP(bytes_read, UNIFORM_SLOT_SIZE);
}
+
+ assert(u + slots_read <= uniforms);
+ mark_uniform_slots_read(&slots[u], slots_read,
+ type_sz(inst->src[i].type));
}
}
@@ -2082,43 +2102,64 @@ fs_visitor::assign_constant_locations()
memset(pull_constant_loc, -1, uniforms * sizeof(*pull_constant_loc));
int chunk_start = -1;
- unsigned max_chunk_bitsize = 0;
-
- /* First push 64-bit uniforms to ensure they are properly aligned */
- const unsigned uniform_64_bit_size = type_sz(BRW_REGISTER_TYPE_DF);
+ struct cplx_align align;
for (unsigned u = 0; u < uniforms; u++) {
- if (!is_live[u])
+ if (!slots[u].is_live) {
+ assert(chunk_start == -1);
continue;
+ }
- set_push_pull_constant_loc(u, &chunk_start, &max_chunk_bitsize,
- contiguous[u], bitsize_access[u],
- uniform_64_bit_size,
- push_constant_loc, pull_constant_loc,
- &num_push_constants,
&num_pull_constants,
Post by Jason Ekstrand
- max_push_components, max_chunk_size,
- compiler->supports_pull_constants,
- stage_prog_data);
+ /* Skip subgroup_id_index to put it in the last push register. */
+ if (subgroup_id_index == (int)u)
+ continue;
- }
+ if (chunk_start == -1) {
+ chunk_start = u;
+ align = slots[u].align;
+ } else {
+ /* Offset into the chunk */
+ unsigned chunk_offset = (u - chunk_start) * UNIFORM_SLOT_SIZE;
- /* Then push the rest of uniforms */
- const unsigned uniform_32_bit_size = type_sz(BRW_REGISTER_TYPE_F);
- for (unsigned u = 0; u < uniforms; u++) {
- if (!is_live[u])
- continue;
+ /* Shift the slot alignment down by the chunk offset so it is
+ * comparable with the base chunk alignment.
+ */
+ struct cplx_align slot_align = slots[u].align;
+ slot_align.offset =
+ (slot_align.offset - chunk_offset) & (align.mul - 1);
- /* Skip subgroup_id_index to put it in the last push register. */
- if (subgroup_id_index == (int)u)
+ align = cplx_align_combine(align, slot_align);
+ }
+
+ /* Sanity check the alignment */
+ cplx_align_assert_sane(align);
+
+ if (slots[u].contiguous)
continue;
- set_push_pull_constant_loc(u, &chunk_start, &max_chunk_bitsize,
- contiguous[u], bitsize_access[u],
- uniform_32_bit_size,
- push_constant_loc, pull_constant_loc,
- &num_push_constants,
&num_pull_constants,
Post by Jason Ekstrand
- max_push_components, max_chunk_size,
- compiler->supports_pull_constants,
- stage_prog_data);
+ /* Adjust the alignment to be in terms of slots, not bytes */
+ assert((align.mul & (UNIFORM_SLOT_SIZE - 1)) == 0);
+ assert((align.offset & (UNIFORM_SLOT_SIZE - 1)) == 0);
+ align.mul /= UNIFORM_SLOT_SIZE;
+ align.offset /= UNIFORM_SLOT_SIZE;
+
+ unsigned push_start_align = cplx_align_apply(align,
num_push_constants);
Post by Jason Ekstrand
+ unsigned chunk_size = u - chunk_start + 1;
+ if (!compiler->supports_pull_constants ||
+ (chunk_size < max_chunk_size &&
+ push_start_align + chunk_size <= max_push_components)) {
+ /* Align up the number of push constants */
+ num_push_constants = push_start_align;
+ for (unsigned i = 0; i < chunk_size; i++)
+ push_constant_loc[chunk_start + i] = num_push_constants++;
+ } else {
+ /* We need to pull this one */
+ num_pull_constants = cplx_align_apply(align,
num_pull_constants);
Post by Jason Ekstrand
+ for (unsigned i = 0; i < chunk_size; i++)
+ pull_constant_loc[chunk_start + i] = num_pull_constants++;
+ }
+
+ /* Reset the chunk and start again */
+ chunk_start = -1;
}
/* Add the CS local thread ID uniform at the end of the push
constants */
Post by Jason Ekstrand
@@ -2131,8 +2172,8 @@ fs_visitor::assign_constant_locations()
uint32_t *param = stage_prog_data->param;
stage_prog_data->nr_params = num_push_constants;
if (num_push_constants) {
- stage_prog_data->param = ralloc_array(mem_ctx, uint32_t,
- num_push_constants);
+ stage_prog_data->param = rzalloc_array(mem_ctx, uint32_t,
+ num_push_constants);
} else {
stage_prog_data->param = NULL;
}
@@ -2140,8 +2181,8 @@ fs_visitor::assign_constant_locations()
assert(stage_prog_data->pull_param == NULL);
if (num_pull_constants > 0) {
stage_prog_data->nr_pull_params = num_pull_constants;
- stage_prog_data->pull_param = ralloc_array(mem_ctx, uint32_t,
- num_pull_constants);
+ stage_prog_data->pull_param = rzalloc_array(mem_ctx, uint32_t,
+ num_pull_constants);
}
/* Now that we know how many regular uniforms we'll push, reduce the
--
2.5.0.400.gff86faf
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Pohjolainen, Topi
2017-12-08 18:04:19 UTC
Reply
Permalink
Raw Message
Post by Jason Ekstrand
On Fri, Dec 8, 2017 at 12:40 AM, Pohjolainen, Topi <
Post by Jason Ekstrand
Post by Jason Ekstrand
This rewires the logic for assigning uniform locations to work in terms
of "complex alignments". The basic idea is that, as we walk the list of
instructions, we keep track of the alignment and continuity requirements
of each slot and assert that the alignments all match up. We then use
those alignments in the compaction stage to ensure that everything gets
placed at a properly aligned register. The old mechanism handled
alignments by special-casing each of the bit sizes and placing 64-bit
values first followed by 32-bit values.
The old scheme had the advantage of never leaving a hole since all the
64-bit values could be tightly packed and so could the 32-bit values.
However, the new scheme has no type size special cases so it handles not
only 32 and 64-bit types but should gracefully extend to 16 and 8-bit
types as the need arises.
---
src/intel/compiler/brw_fs.cpp | 307 ++++++++++++++++++++++++------
------------
Post by Jason Ekstrand
1 file changed, 174 insertions(+), 133 deletions(-)
diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.
cpp
Post by Jason Ekstrand
index 93bb6b4..41260b4 100644
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@@ -1906,62 +1906,6 @@ fs_visitor::compact_virtual_grfs()
return progress;
}
-static void
-set_push_pull_constant_loc(unsigned uniform, int *chunk_start,
- unsigned *max_chunk_bitsize,
- bool contiguous, unsigned bitsize,
- const unsigned target_bitsize,
- int *push_constant_loc, int
*pull_constant_loc,
Post by Jason Ekstrand
- unsigned *num_push_constants,
- unsigned *num_pull_constants,
- const unsigned max_push_components,
- const unsigned max_chunk_size,
- bool allow_pull_constants,
- struct brw_stage_prog_data *stage_prog_data)
-{
- /* This is the first live uniform in the chunk */
- if (*chunk_start < 0)
- *chunk_start = uniform;
-
- /* Keep track of the maximum bit size access in contiguous uniforms
*/
Post by Jason Ekstrand
- *max_chunk_bitsize = MAX2(*max_chunk_bitsize, bitsize);
-
- /* If this element does not need to be contiguous with the next, we
- * split at this point and everything between chunk_start and u
forms a
Post by Jason Ekstrand
- * single chunk.
- */
- if (!contiguous) {
- /* If bitsize doesn't match the target one, skip it */
- if (*max_chunk_bitsize != target_bitsize) {
- /* FIXME: right now we only support 32 and 64-bit accesses */
- assert(*max_chunk_bitsize == 4 || *max_chunk_bitsize == 8);
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- return;
- }
-
- unsigned chunk_size = uniform - *chunk_start + 1;
-
- /* Decide whether we should push or pull this parameter. In the
- * Vulkan driver, push constants are explicitly exposed via the
API
Post by Jason Ekstrand
- * so we push everything. In GL, we only push small arrays.
- */
- if (!allow_pull_constants ||
- (*num_push_constants + chunk_size <= max_push_components &&
- chunk_size <= max_chunk_size)) {
- assert(*num_push_constants + chunk_size <=
max_push_components);
Post by Jason Ekstrand
- for (unsigned j = *chunk_start; j <= uniform; j++)
- push_constant_loc[j] = (*num_push_constants)++;
- } else {
- for (unsigned j = *chunk_start; j <= uniform; j++)
- pull_constant_loc[j] = (*num_pull_constants)++;
- }
-
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- }
-}
-
static int
get_subgroup_id_param_index(const brw_stage_prog_data *prog_data)
{
@@ -1977,6 +1921,98 @@ get_subgroup_id_param_index(const
brw_stage_prog_data *prog_data)
Post by Jason Ekstrand
}
/**
+ * Struct for handling complex alignments.
+ *
+ * A complex alignment is stored as multiplier and an offset. A value
is
Post by Jason Ekstrand
+ * considered to be aligned if it is congruent to the offset modulo the
+ * multiplier.
+ */
+struct cplx_align {
+ unsigned mul:4;
+ unsigned offset:4;
+};
+
+#define CPLX_ALIGN_MAX_MUL 8
+
+static void
+cplx_align_assert_sane(struct cplx_align a)
+{
+ assert(a.mul > 0 && util_is_power_of_two(a.mul));
+ assert(a.offset < a.mul);
+}
+
+/**
+ * Combines two alignments to produce a least multiple of sorts.
+ *
+ * The returned alignment is the smallest (in terms of multiplier) such
that
Post by Jason Ekstrand
+ * anything aligned to both a and b will be aligned to the new
alignment.
Post by Jason Ekstrand
+ * This function will assert-fail if a and b are not compatible, i.e.
if the
Post by Jason Ekstrand
+ * offset parameters are such that no common alignment is possible.
+ */
+static struct cplx_align
+cplx_align_combine(struct cplx_align a, struct cplx_align b)
+{
+ cplx_align_assert_sane(a);
+ cplx_align_assert_sane(b);
+
+ /* Assert that the alignments agree. */
+ assert((a.offset & (b.mul - 1)) == (b.offset & (a.mul - 1)));
+
+ return a.mul > b.mul ? a : b;
+}
+
+/**
+ * Apply a complex alignment
+ *
+ * This function will return the smallest number greater than or equal
to
Post by Jason Ekstrand
+ * offset that is aligned to align.
+ */
+static unsigned
+cplx_align_apply(struct cplx_align align, unsigned offset)
+{
+ return ALIGN(offset - align.offset, align.mul) + align.offset;
+}
+
+#define UNIFORM_SLOT_SIZE 4
+
+struct uniform_slot_info {
+ /** True if the given uniform slot is live */
+ unsigned is_live:1;
+
+ /** True if this slot and the slot must remain contiguous */
Missing a word? Maybe "...this slot and the slot after must..."?
Yes, "slot after" and "next slot" both work. I'll fix that up.
Post by Jason Ekstrand
Post by Jason Ekstrand
+ unsigned contiguous:1;
+
+ struct cplx_align align;
+};
+
+static void
+mark_uniform_slots_read(struct uniform_slot_info *slots,
+ unsigned num_slots, unsigned alignment)
+{
+ assert(alignment > 0 && util_is_power_of_two(alignment));
+ assert(alignment <= CPLX_ALIGN_MAX_MUL);
+
+ /* We can't align a slot to anything less than the slot size */
+ alignment = MAX2(alignment, UNIFORM_SLOT_SIZE);
+
+ struct cplx_align align = {alignment, 0};
+ cplx_align_assert_sane(align);
+
+ for (unsigned i = 0; i < num_slots; i++) {
+ slots[i].is_live = true;
+ if (i < num_slots - 1)
+ slots[i].contiguous = true;
+
+ align.offset = (i * UNIFORM_SLOT_SIZE) & (align.mul - 1);
+ if (slots[i].align.mul == 0) {
+ slots[i].align = align;
+ } else {
+ slots[i].align = cplx_align_combine(slots[i].align, align);
+ }
+ }
+}
+
+/**
* Assign UNIFORM file registers to either push constants or pull
constants.
Post by Jason Ekstrand
*
* We allow a fragment shader to have more than the specified minimum
@@ -1994,60 +2030,44 @@ fs_visitor::assign_constant_locations()
return;
}
- bool is_live[uniforms];
- memset(is_live, 0, sizeof(is_live));
- unsigned bitsize_access[uniforms];
- memset(bitsize_access, 0, sizeof(bitsize_access));
-
- /* For each uniform slot, a value of true indicates that the given
slot and
Post by Jason Ekstrand
- * the next slot must remain contiguous. This is used to keep us
from
Post by Jason Ekstrand
- * splitting arrays and 64-bit values apart.
- */
- bool contiguous[uniforms];
- memset(contiguous, 0, sizeof(contiguous));
+ struct uniform_slot_info slots[uniforms];
+ memset(slots, 0, sizeof(slots));
- *
- * 1) Figure out which uniforms are live.
- *
- * 2) Mark any indirectly used ranges of registers as contiguous.
- *
- * Note that we don't move constant-indexed accesses to arrays. No
- * testing has been done of the performance impact of this choice.
- */
foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
for (int i = 0 ; i < inst->sources; i++) {
if (inst->src[i].file != UNIFORM)
continue;
- int constant_nr = inst->src[i].nr + inst->src[i].offset / 4;
+ /* NIR tightly packs things so the uniform number might not be
+ * aligned (if we have a double right after a float, for
instance).
Post by Jason Ekstrand
+ * This is fine because the process of re-arranging them will
ensure
Post by Jason Ekstrand
+ * that things are properly aligned. The offset into that
uniform,
Post by Jason Ekstrand
+ * however, must be aligned.
What about in case of types having size less than 32-bits (i.e., smaller than
uniform slot size? Take, for example, invididual components of 16-bits
vectors. Their offsets are not necessarily aligned on 32-bit slot boundaries.
I did think about that. :-) My plan was that they would have their uniform
nr set to floor(offset/4) with an offset of offset % 4. Then their slot
would get aligned to MAX(2, UNIFORM_SLOT_SIZE) = 4 (see
mark_uniform_slots_read above). Assuming things are reasonably aligned
(i.e., no one puts a float at an offset of 2 bytes), this should work out
ok.
Right, I did notice the MAX2() in mark_uniform_slots_read(), and like said
below, things here seem to work also. Mostly I was thinking the division
inst->src[i].offset / UNIFORM_SLOT_SIZE. A little note saying that the offset
may not align to slots but that it doesn't need to. One is only interested if
the slot itself is used and that it is insignificant which part of the slot
the instruction sources.

Took me a while to crasp things but this is actually clearer than what we had
Post by Jason Ekstrand
Unfortunately, I see that we still have asserts that byte_offset % 4 == 0
in our handling of nir_intrinsic_load_uniform and we're passing the tests.
They must be pretty bad.
Yeah, I've been wondering how many things get exercised. I seemed to hit
number of things in liveness analysis and optimization passes with SIMD8
and 16-bit regs with glsl. Then again, I don't have clear picture how much
the 16-bit storage support requires from the 16-bit compiler infrastructure.
Post by Jason Ekstrand
--Jason
Post by Jason Ekstrand
See also related question a few lines further.
Post by Jason Ekstrand
+ *
+ * In Vulkan, we have explicit offsets but everything is
crammed
Post by Jason Ekstrand
+ * into a single "variable" so inst->src[i].nr will always be
0.
Post by Jason Ekstrand
+ * Everything will be properly aligned relative to that one
base.
Post by Jason Ekstrand
+ */
+ assert(inst->src[i].offset % type_sz(inst->src[i].type) == 0);
+
+ unsigned u = inst->src[i].nr +
+ inst->src[i].offset / UNIFORM_SLOT_SIZE;
This effectively gives the same uniform slot no matter which individual
byte(s) within are addressed. The logic here is about slots so this seems
fine. Just wanted to check that this is your intention. And if so, should we
comment here a little?
Post by Jason Ekstrand
+ if (u >= uniforms)
+ continue;
+
+ unsigned slots_read;
if (inst->opcode == SHADER_OPCODE_MOV_INDIRECT && i == 0) {
- assert(inst->src[2].ud % 4 == 0);
- unsigned last = constant_nr + (inst->src[2].ud / 4) - 1;
- assert(last < uniforms);
-
- for (unsigned j = constant_nr; j < last; j++) {
- is_live[j] = true;
- contiguous[j] = true;
- bitsize_access[j] = MAX2(bitsize_access[j],
type_sz(inst->src[i].type));
Post by Jason Ekstrand
- }
- is_live[last] = true;
- bitsize_access[last] = MAX2(bitsize_access[last],
type_sz(inst->src[i].type));
Post by Jason Ekstrand
+ slots_read = DIV_ROUND_UP(inst->src[2].ud,
UNIFORM_SLOT_SIZE);
Post by Jason Ekstrand
} else {
- if (constant_nr >= 0 && constant_nr < (int) uniforms) {
- int regs_read = inst->components_read(i) *
- type_sz(inst->src[i].type) / 4;
- assert(regs_read <= 2);
- if (regs_read == 2)
- contiguous[constant_nr] = true;
- for (int j = 0; j < regs_read; j++) {
- is_live[constant_nr + j] = true;
- bitsize_access[constant_nr + j] =
- MAX2(bitsize_access[constant_nr + j],
type_sz(inst->src[i].type));
Post by Jason Ekstrand
- }
- }
+ unsigned bytes_read = inst->components_read(i) *
+ type_sz(inst->src[i].type);
+ slots_read = DIV_ROUND_UP(bytes_read, UNIFORM_SLOT_SIZE);
}
+
+ assert(u + slots_read <= uniforms);
+ mark_uniform_slots_read(&slots[u], slots_read,
+ type_sz(inst->src[i].type));
}
}
@@ -2082,43 +2102,64 @@ fs_visitor::assign_constant_locations()
memset(pull_constant_loc, -1, uniforms * sizeof(*pull_constant_loc));
int chunk_start = -1;
- unsigned max_chunk_bitsize = 0;
-
- /* First push 64-bit uniforms to ensure they are properly aligned */
- const unsigned uniform_64_bit_size = type_sz(BRW_REGISTER_TYPE_DF);
+ struct cplx_align align;
for (unsigned u = 0; u < uniforms; u++) {
- if (!is_live[u])
+ if (!slots[u].is_live) {
+ assert(chunk_start == -1);
continue;
+ }
- set_push_pull_constant_loc(u, &chunk_start, &max_chunk_bitsize,
- contiguous[u], bitsize_access[u],
- uniform_64_bit_size,
- push_constant_loc, pull_constant_loc,
- &num_push_constants,
&num_pull_constants,
Post by Jason Ekstrand
- max_push_components, max_chunk_size,
- compiler->supports_pull_constants,
- stage_prog_data);
+ /* Skip subgroup_id_index to put it in the last push register. */
+ if (subgroup_id_index == (int)u)
+ continue;
- }
+ if (chunk_start == -1) {
+ chunk_start = u;
+ align = slots[u].align;
+ } else {
+ /* Offset into the chunk */
+ unsigned chunk_offset = (u - chunk_start) * UNIFORM_SLOT_SIZE;
- /* Then push the rest of uniforms */
- const unsigned uniform_32_bit_size = type_sz(BRW_REGISTER_TYPE_F);
- for (unsigned u = 0; u < uniforms; u++) {
- if (!is_live[u])
- continue;
+ /* Shift the slot alignment down by the chunk offset so it is
+ * comparable with the base chunk alignment.
+ */
+ struct cplx_align slot_align = slots[u].align;
+ slot_align.offset =
+ (slot_align.offset - chunk_offset) & (align.mul - 1);
- /* Skip subgroup_id_index to put it in the last push register. */
- if (subgroup_id_index == (int)u)
+ align = cplx_align_combine(align, slot_align);
+ }
+
+ /* Sanity check the alignment */
+ cplx_align_assert_sane(align);
+
+ if (slots[u].contiguous)
continue;
- set_push_pull_constant_loc(u, &chunk_start, &max_chunk_bitsize,
- contiguous[u], bitsize_access[u],
- uniform_32_bit_size,
- push_constant_loc, pull_constant_loc,
- &num_push_constants,
&num_pull_constants,
Post by Jason Ekstrand
- max_push_components, max_chunk_size,
- compiler->supports_pull_constants,
- stage_prog_data);
+ /* Adjust the alignment to be in terms of slots, not bytes */
+ assert((align.mul & (UNIFORM_SLOT_SIZE - 1)) == 0);
+ assert((align.offset & (UNIFORM_SLOT_SIZE - 1)) == 0);
+ align.mul /= UNIFORM_SLOT_SIZE;
+ align.offset /= UNIFORM_SLOT_SIZE;
+
+ unsigned push_start_align = cplx_align_apply(align,
num_push_constants);
Post by Jason Ekstrand
+ unsigned chunk_size = u - chunk_start + 1;
+ if (!compiler->supports_pull_constants ||
+ (chunk_size < max_chunk_size &&
+ push_start_align + chunk_size <= max_push_components)) {
+ /* Align up the number of push constants */
+ num_push_constants = push_start_align;
+ for (unsigned i = 0; i < chunk_size; i++)
+ push_constant_loc[chunk_start + i] = num_push_constants++;
+ } else {
+ /* We need to pull this one */
+ num_pull_constants = cplx_align_apply(align,
num_pull_constants);
Post by Jason Ekstrand
+ for (unsigned i = 0; i < chunk_size; i++)
+ pull_constant_loc[chunk_start + i] = num_pull_constants++;
+ }
+
+ /* Reset the chunk and start again */
+ chunk_start = -1;
}
/* Add the CS local thread ID uniform at the end of the push
constants */
Post by Jason Ekstrand
@@ -2131,8 +2172,8 @@ fs_visitor::assign_constant_locations()
uint32_t *param = stage_prog_data->param;
stage_prog_data->nr_params = num_push_constants;
if (num_push_constants) {
- stage_prog_data->param = ralloc_array(mem_ctx, uint32_t,
- num_push_constants);
+ stage_prog_data->param = rzalloc_array(mem_ctx, uint32_t,
+ num_push_constants);
} else {
stage_prog_data->param = NULL;
}
@@ -2140,8 +2181,8 @@ fs_visitor::assign_constant_locations()
assert(stage_prog_data->pull_param == NULL);
if (num_pull_constants > 0) {
stage_prog_data->nr_pull_params = num_pull_constants;
- stage_prog_data->pull_param = ralloc_array(mem_ctx, uint32_t,
- num_pull_constants);
+ stage_prog_data->pull_param = rzalloc_array(mem_ctx, uint32_t,
+ num_pull_constants);
}
/* Now that we know how many regular uniforms we'll push, reduce the
--
2.5.0.400.gff86faf
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Jason Ekstrand
2017-12-08 23:18:36 UTC
Reply
Permalink
Raw Message
On Fri, Dec 8, 2017 at 10:04 AM, Pohjolainen, Topi <
Post by Jason Ekstrand
Post by Jason Ekstrand
On Fri, Dec 8, 2017 at 12:40 AM, Pohjolainen, Topi <
Post by Jason Ekstrand
Post by Jason Ekstrand
This rewires the logic for assigning uniform locations to work in
terms
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
of "complex alignments". The basic idea is that, as we walk the
list of
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
instructions, we keep track of the alignment and continuity
requirements
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
of each slot and assert that the alignments all match up. We then
use
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
those alignments in the compaction stage to ensure that everything
gets
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
placed at a properly aligned register. The old mechanism handled
alignments by special-casing each of the bit sizes and placing 64-bit
values first followed by 32-bit values.
The old scheme had the advantage of never leaving a hole since all
the
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
64-bit values could be tightly packed and so could the 32-bit values.
However, the new scheme has no type size special cases so it handles
not
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
only 32 and 64-bit types but should gracefully extend to 16 and 8-bit
types as the need arises.
---
src/intel/compiler/brw_fs.cpp | 307 ++++++++++++++++++++++++------
------------
Post by Jason Ekstrand
1 file changed, 174 insertions(+), 133 deletions(-)
diff --git a/src/intel/compiler/brw_fs.cpp
b/src/intel/compiler/brw_fs.
Post by Jason Ekstrand
Post by Jason Ekstrand
cpp
Post by Jason Ekstrand
index 93bb6b4..41260b4 100644
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@@ -1906,62 +1906,6 @@ fs_visitor::compact_virtual_grfs()
return progress;
}
-static void
-set_push_pull_constant_loc(unsigned uniform, int *chunk_start,
- unsigned *max_chunk_bitsize,
- bool contiguous, unsigned bitsize,
- const unsigned target_bitsize,
- int *push_constant_loc, int
*pull_constant_loc,
Post by Jason Ekstrand
- unsigned *num_push_constants,
- unsigned *num_pull_constants,
- const unsigned max_push_components,
- const unsigned max_chunk_size,
- bool allow_pull_constants,
- struct brw_stage_prog_data
*stage_prog_data)
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
-{
- /* This is the first live uniform in the chunk */
- if (*chunk_start < 0)
- *chunk_start = uniform;
-
- /* Keep track of the maximum bit size access in contiguous
uniforms
Post by Jason Ekstrand
Post by Jason Ekstrand
*/
Post by Jason Ekstrand
- *max_chunk_bitsize = MAX2(*max_chunk_bitsize, bitsize);
-
- /* If this element does not need to be contiguous with the next,
we
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- * split at this point and everything between chunk_start and u
forms a
Post by Jason Ekstrand
- * single chunk.
- */
- if (!contiguous) {
- /* If bitsize doesn't match the target one, skip it */
- if (*max_chunk_bitsize != target_bitsize) {
- /* FIXME: right now we only support 32 and 64-bit accesses
*/
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- assert(*max_chunk_bitsize == 4 || *max_chunk_bitsize == 8);
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- return;
- }
-
- unsigned chunk_size = uniform - *chunk_start + 1;
-
- /* Decide whether we should push or pull this parameter. In
the
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- * Vulkan driver, push constants are explicitly exposed via
the
Post by Jason Ekstrand
Post by Jason Ekstrand
API
Post by Jason Ekstrand
- * so we push everything. In GL, we only push small arrays.
- */
- if (!allow_pull_constants ||
- (*num_push_constants + chunk_size <= max_push_components
&&
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- chunk_size <= max_chunk_size)) {
- assert(*num_push_constants + chunk_size <=
max_push_components);
Post by Jason Ekstrand
- for (unsigned j = *chunk_start; j <= uniform; j++)
- push_constant_loc[j] = (*num_push_constants)++;
- } else {
- for (unsigned j = *chunk_start; j <= uniform; j++)
- pull_constant_loc[j] = (*num_pull_constants)++;
- }
-
- *max_chunk_bitsize = 0;
- *chunk_start = -1;
- }
-}
-
static int
get_subgroup_id_param_index(const brw_stage_prog_data *prog_data)
{
@@ -1977,6 +1921,98 @@ get_subgroup_id_param_index(const
brw_stage_prog_data *prog_data)
Post by Jason Ekstrand
}
/**
+ * Struct for handling complex alignments.
+ *
+ * A complex alignment is stored as multiplier and an offset. A
value
Post by Jason Ekstrand
Post by Jason Ekstrand
is
Post by Jason Ekstrand
+ * considered to be aligned if it is congruent to the offset modulo
the
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ * multiplier.
+ */
+struct cplx_align {
+ unsigned mul:4;
+ unsigned offset:4;
+};
+
+#define CPLX_ALIGN_MAX_MUL 8
+
+static void
+cplx_align_assert_sane(struct cplx_align a)
+{
+ assert(a.mul > 0 && util_is_power_of_two(a.mul));
+ assert(a.offset < a.mul);
+}
+
+/**
+ * Combines two alignments to produce a least multiple of sorts.
+ *
+ * The returned alignment is the smallest (in terms of multiplier)
such
Post by Jason Ekstrand
Post by Jason Ekstrand
that
Post by Jason Ekstrand
+ * anything aligned to both a and b will be aligned to the new
alignment.
Post by Jason Ekstrand
+ * This function will assert-fail if a and b are not compatible,
i.e.
Post by Jason Ekstrand
Post by Jason Ekstrand
if the
Post by Jason Ekstrand
+ * offset parameters are such that no common alignment is possible.
+ */
+static struct cplx_align
+cplx_align_combine(struct cplx_align a, struct cplx_align b)
+{
+ cplx_align_assert_sane(a);
+ cplx_align_assert_sane(b);
+
+ /* Assert that the alignments agree. */
+ assert((a.offset & (b.mul - 1)) == (b.offset & (a.mul - 1)));
+
+ return a.mul > b.mul ? a : b;
+}
+
+/**
+ * Apply a complex alignment
+ *
+ * This function will return the smallest number greater than or
equal
Post by Jason Ekstrand
Post by Jason Ekstrand
to
Post by Jason Ekstrand
+ * offset that is aligned to align.
+ */
+static unsigned
+cplx_align_apply(struct cplx_align align, unsigned offset)
+{
+ return ALIGN(offset - align.offset, align.mul) + align.offset;
+}
+
+#define UNIFORM_SLOT_SIZE 4
+
+struct uniform_slot_info {
+ /** True if the given uniform slot is live */
+ unsigned is_live:1;
+
+ /** True if this slot and the slot must remain contiguous */
Missing a word? Maybe "...this slot and the slot after must..."?
Yes, "slot after" and "next slot" both work. I'll fix that up.
Post by Jason Ekstrand
Post by Jason Ekstrand
+ unsigned contiguous:1;
+
+ struct cplx_align align;
+};
+
+static void
+mark_uniform_slots_read(struct uniform_slot_info *slots,
+ unsigned num_slots, unsigned alignment)
+{
+ assert(alignment > 0 && util_is_power_of_two(alignment));
+ assert(alignment <= CPLX_ALIGN_MAX_MUL);
+
+ /* We can't align a slot to anything less than the slot size */
+ alignment = MAX2(alignment, UNIFORM_SLOT_SIZE);
+
+ struct cplx_align align = {alignment, 0};
+ cplx_align_assert_sane(align);
+
+ for (unsigned i = 0; i < num_slots; i++) {
+ slots[i].is_live = true;
+ if (i < num_slots - 1)
+ slots[i].contiguous = true;
+
+ align.offset = (i * UNIFORM_SLOT_SIZE) & (align.mul - 1);
+ if (slots[i].align.mul == 0) {
+ slots[i].align = align;
+ } else {
+ slots[i].align = cplx_align_combine(slots[i].align,
align);
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ }
+ }
+}
+
+/**
* Assign UNIFORM file registers to either push constants or pull
constants.
Post by Jason Ekstrand
*
* We allow a fragment shader to have more than the specified
minimum
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
@@ -1994,60 +2030,44 @@ fs_visitor::assign_constant_locations()
return;
}
- bool is_live[uniforms];
- memset(is_live, 0, sizeof(is_live));
- unsigned bitsize_access[uniforms];
- memset(bitsize_access, 0, sizeof(bitsize_access));
-
- /* For each uniform slot, a value of true indicates that the
given
Post by Jason Ekstrand
Post by Jason Ekstrand
slot and
Post by Jason Ekstrand
- * the next slot must remain contiguous. This is used to keep us
from
Post by Jason Ekstrand
- * splitting arrays and 64-bit values apart.
- */
- bool contiguous[uniforms];
- memset(contiguous, 0, sizeof(contiguous));
+ struct uniform_slot_info slots[uniforms];
+ memset(slots, 0, sizeof(slots));
- *
- * 1) Figure out which uniforms are live.
- *
- * 2) Mark any indirectly used ranges of registers as
contiguous.
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- *
- * Note that we don't move constant-indexed accesses to arrays.
No
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- * testing has been done of the performance impact of this
choice.
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- */
foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
for (int i = 0 ; i < inst->sources; i++) {
if (inst->src[i].file != UNIFORM)
continue;
- int constant_nr = inst->src[i].nr + inst->src[i].offset /
4;
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ /* NIR tightly packs things so the uniform number might
not be
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ * aligned (if we have a double right after a float, for
instance).
Post by Jason Ekstrand
+ * This is fine because the process of re-arranging them
will
Post by Jason Ekstrand
Post by Jason Ekstrand
ensure
Post by Jason Ekstrand
+ * that things are properly aligned. The offset into that
uniform,
Post by Jason Ekstrand
+ * however, must be aligned.
What about in case of types having size less than 32-bits (i.e.,
smaller
Post by Jason Ekstrand
Post by Jason Ekstrand
than
uniform slot size? Take, for example, invididual components of 16-bits
vectors. Their offsets are not necessarily aligned on 32-bit slot boundaries.
I did think about that. :-) My plan was that they would have their
uniform
Post by Jason Ekstrand
nr set to floor(offset/4) with an offset of offset % 4. Then their slot
would get aligned to MAX(2, UNIFORM_SLOT_SIZE) = 4 (see
mark_uniform_slots_read above). Assuming things are reasonably aligned
(i.e., no one puts a float at an offset of 2 bytes), this should work out
ok.
Right, I did notice the MAX2() in mark_uniform_slots_read(), and like said
below, things here seem to work also. Mostly I was thinking the division
inst->src[i].offset / UNIFORM_SLOT_SIZE. A little note saying that the offset
may not align to slots but that it doesn't need to. One is only interested if
the slot itself is used and that it is insignificant which part of the slot
the instruction sources.
Took me a while to crasp things but this is actually clearer than what we had
Post by Jason Ekstrand
Unfortunately, I see that we still have asserts that byte_offset % 4 == 0
in our handling of nir_intrinsic_load_uniform and we're passing the
tests.
Post by Jason Ekstrand
They must be pretty bad.
Yeah, I've been wondering how many things get exercised. I seemed to hit
number of things in liveness analysis and optimization passes with SIMD8
and 16-bit regs with glsl. Then again, I don't have clear picture how much
the 16-bit storage support requires from the 16-bit compiler
infrastructure.
I just realized why this is Ok. Vulkan has a single variable for all push
constants and it's at offset 0. That said, we also have an assert that the
offset is a multiple of 4 and that definitely doesn't hold.

--Jason
Post by Jason Ekstrand
Post by Jason Ekstrand
--Jason
Post by Jason Ekstrand
See also related question a few lines further.
Post by Jason Ekstrand
+ *
+ * In Vulkan, we have explicit offsets but everything is
crammed
Post by Jason Ekstrand
+ * into a single "variable" so inst->src[i].nr will always
be
Post by Jason Ekstrand
Post by Jason Ekstrand
0.
Post by Jason Ekstrand
+ * Everything will be properly aligned relative to that one
base.
Post by Jason Ekstrand
+ */
+ assert(inst->src[i].offset % type_sz(inst->src[i].type) ==
0);
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+
+ unsigned u = inst->src[i].nr +
+ inst->src[i].offset / UNIFORM_SLOT_SIZE;
This effectively gives the same uniform slot no matter which individual
byte(s) within are addressed. The logic here is about slots so this
seems
Post by Jason Ekstrand
Post by Jason Ekstrand
fine. Just wanted to check that this is your intention. And if so,
should
Post by Jason Ekstrand
Post by Jason Ekstrand
we
comment here a little?
Post by Jason Ekstrand
+ if (u >= uniforms)
+ continue;
+
+ unsigned slots_read;
if (inst->opcode == SHADER_OPCODE_MOV_INDIRECT && i == 0) {
- assert(inst->src[2].ud % 4 == 0);
- unsigned last = constant_nr + (inst->src[2].ud / 4) - 1;
- assert(last < uniforms);
-
- for (unsigned j = constant_nr; j < last; j++) {
- is_live[j] = true;
- contiguous[j] = true;
- bitsize_access[j] = MAX2(bitsize_access[j],
type_sz(inst->src[i].type));
Post by Jason Ekstrand
- }
- is_live[last] = true;
- bitsize_access[last] = MAX2(bitsize_access[last],
type_sz(inst->src[i].type));
Post by Jason Ekstrand
+ slots_read = DIV_ROUND_UP(inst->src[2].ud,
UNIFORM_SLOT_SIZE);
Post by Jason Ekstrand
} else {
- if (constant_nr >= 0 && constant_nr < (int) uniforms) {
- int regs_read = inst->components_read(i) *
- type_sz(inst->src[i].type) / 4;
- assert(regs_read <= 2);
- if (regs_read == 2)
- contiguous[constant_nr] = true;
- for (int j = 0; j < regs_read; j++) {
- is_live[constant_nr + j] = true;
- bitsize_access[constant_nr + j] =
- MAX2(bitsize_access[constant_nr + j],
type_sz(inst->src[i].type));
Post by Jason Ekstrand
- }
- }
+ unsigned bytes_read = inst->components_read(i) *
+ type_sz(inst->src[i].type);
+ slots_read = DIV_ROUND_UP(bytes_read,
UNIFORM_SLOT_SIZE);
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
}
+
+ assert(u + slots_read <= uniforms);
+ mark_uniform_slots_read(&slots[u], slots_read,
+ type_sz(inst->src[i].type));
}
}
@@ -2082,43 +2102,64 @@ fs_visitor::assign_constant_locations()
memset(pull_constant_loc, -1, uniforms *
sizeof(*pull_constant_loc));
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
int chunk_start = -1;
- unsigned max_chunk_bitsize = 0;
-
- /* First push 64-bit uniforms to ensure they are properly
aligned */
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- const unsigned uniform_64_bit_size =
type_sz(BRW_REGISTER_TYPE_DF);
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ struct cplx_align align;
for (unsigned u = 0; u < uniforms; u++) {
- if (!is_live[u])
+ if (!slots[u].is_live) {
+ assert(chunk_start == -1);
continue;
+ }
- set_push_pull_constant_loc(u, &chunk_start,
&max_chunk_bitsize,
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- contiguous[u], bitsize_access[u],
- uniform_64_bit_size,
- push_constant_loc,
pull_constant_loc,
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- &num_push_constants,
&num_pull_constants,
Post by Jason Ekstrand
- max_push_components,
max_chunk_size,
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- compiler->supports_pull_constants,
- stage_prog_data);
+ /* Skip subgroup_id_index to put it in the last push
register. */
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ if (subgroup_id_index == (int)u)
+ continue;
- }
+ if (chunk_start == -1) {
+ chunk_start = u;
+ align = slots[u].align;
+ } else {
+ /* Offset into the chunk */
+ unsigned chunk_offset = (u - chunk_start) *
UNIFORM_SLOT_SIZE;
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- /* Then push the rest of uniforms */
- const unsigned uniform_32_bit_size =
type_sz(BRW_REGISTER_TYPE_F);
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- for (unsigned u = 0; u < uniforms; u++) {
- if (!is_live[u])
- continue;
+ /* Shift the slot alignment down by the chunk offset so it
is
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ * comparable with the base chunk alignment.
+ */
+ struct cplx_align slot_align = slots[u].align;
+ slot_align.offset =
+ (slot_align.offset - chunk_offset) & (align.mul - 1);
- /* Skip subgroup_id_index to put it in the last push
register. */
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- if (subgroup_id_index == (int)u)
+ align = cplx_align_combine(align, slot_align);
+ }
+
+ /* Sanity check the alignment */
+ cplx_align_assert_sane(align);
+
+ if (slots[u].contiguous)
continue;
- set_push_pull_constant_loc(u, &chunk_start,
&max_chunk_bitsize,
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- contiguous[u], bitsize_access[u],
- uniform_32_bit_size,
- push_constant_loc,
pull_constant_loc,
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- &num_push_constants,
&num_pull_constants,
Post by Jason Ekstrand
- max_push_components,
max_chunk_size,
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
- compiler->supports_pull_constants,
- stage_prog_data);
+ /* Adjust the alignment to be in terms of slots, not bytes */
+ assert((align.mul & (UNIFORM_SLOT_SIZE - 1)) == 0);
+ assert((align.offset & (UNIFORM_SLOT_SIZE - 1)) == 0);
+ align.mul /= UNIFORM_SLOT_SIZE;
+ align.offset /= UNIFORM_SLOT_SIZE;
+
+ unsigned push_start_align = cplx_align_apply(align,
num_push_constants);
Post by Jason Ekstrand
+ unsigned chunk_size = u - chunk_start + 1;
+ if (!compiler->supports_pull_constants ||
+ (chunk_size < max_chunk_size &&
+ push_start_align + chunk_size <= max_push_components)) {
+ /* Align up the number of push constants */
+ num_push_constants = push_start_align;
+ for (unsigned i = 0; i < chunk_size; i++)
+ push_constant_loc[chunk_start + i] =
num_push_constants++;
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ } else {
+ /* We need to pull this one */
+ num_pull_constants = cplx_align_apply(align,
num_pull_constants);
Post by Jason Ekstrand
+ for (unsigned i = 0; i < chunk_size; i++)
+ pull_constant_loc[chunk_start + i] =
num_pull_constants++;
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ }
+
+ /* Reset the chunk and start again */
+ chunk_start = -1;
}
/* Add the CS local thread ID uniform at the end of the push
constants */
Post by Jason Ekstrand
@@ -2131,8 +2172,8 @@ fs_visitor::assign_constant_locations()
uint32_t *param = stage_prog_data->param;
stage_prog_data->nr_params = num_push_constants;
if (num_push_constants) {
- stage_prog_data->param = ralloc_array(mem_ctx, uint32_t,
- num_push_constants);
+ stage_prog_data->param = rzalloc_array(mem_ctx, uint32_t,
+ num_push_constants);
} else {
stage_prog_data->param = NULL;
}
@@ -2140,8 +2181,8 @@ fs_visitor::assign_constant_locations()
assert(stage_prog_data->pull_param == NULL);
if (num_pull_constants > 0) {
stage_prog_data->nr_pull_params = num_pull_constants;
- stage_prog_data->pull_param = ralloc_array(mem_ctx, uint32_t,
-
num_pull_constants);
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
+ stage_prog_data->pull_param = rzalloc_array(mem_ctx, uint32_t,
+
num_pull_constants);
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
}
/* Now that we know how many regular uniforms we'll push, reduce
the
Post by Jason Ekstrand
Post by Jason Ekstrand
Post by Jason Ekstrand
--
2.5.0.400.gff86faf
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Jason Ekstrand
2017-12-07 04:34:20 UTC
Reply
Permalink
Raw Message
From: Chema Casanova <***@igalia.com>

Enables storagePushConstant16 feature of VK_KHR_16bit_storage
for Gen8+.

Reviewed-by: Jason Ekstrand <***@jlekstrand.net>
---
src/intel/vulkan/anv_device.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/intel/vulkan/anv_device.c b/src/intel/vulkan/anv_device.c
index 81a2ed6..fd5326b 100644
--- a/src/intel/vulkan/anv_device.c
+++ b/src/intel/vulkan/anv_device.c
@@ -747,7 +747,7 @@ void anv_GetPhysicalDeviceFeatures2KHR(

features->storageBuffer16BitAccess = pdevice->info.gen >= 8;
features->uniformAndStorageBuffer16BitAccess = pdevice->info.gen >= 8;
- features->storagePushConstant16 = false;
+ features->storagePushConstant16 = pdevice->info.gen >= 8;
features->storageInputOutput16 = false;
break;
}
--
2.5.0.400.gff86faf
Jason Ekstrand
2017-12-07 04:34:21 UTC
Reply
Permalink
Raw Message
We want to call brw_nir_analyze_ubo_ranges immedately after
anv_nir_apply_pipeline_layout and it badly wants constants. We could
run an optimization step and let constant folding do it but that's way
more expensive than needed. It's really easy to just handle constants
in apply_pipeline_layout.

Reviewed-by: Jordan Justen <***@intel.com>
---
src/intel/vulkan/anv_nir_apply_pipeline_layout.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/src/intel/vulkan/anv_nir_apply_pipeline_layout.c b/src/intel/vulkan/anv_nir_apply_pipeline_layout.c
index 612b3f7..978a8a5 100644
--- a/src/intel/vulkan/anv_nir_apply_pipeline_layout.c
+++ b/src/intel/vulkan/anv_nir_apply_pipeline_layout.c
@@ -116,12 +116,21 @@ lower_res_index_intrinsic(nir_intrinsic_instr *intrin,
uint32_t array_size =
state->layout->set[set].layout->binding[binding].array_size;

- nir_ssa_def *block_index = nir_ssa_for_src(b, intrin->src[0], 1);
+ nir_const_value *const_array_index = nir_src_as_const_value(intrin->src[0]);

- if (state->add_bounds_checks)
- block_index = nir_umin(b, block_index, nir_imm_int(b, array_size - 1));
+ nir_ssa_def *block_index;
+ if (const_array_index) {
+ unsigned array_index = const_array_index->u32[0];
+ array_index = MIN2(array_index, array_size - 1);
+ block_index = nir_imm_int(b, surface_index + array_index);
+ } else {
+ block_index = nir_ssa_for_src(b, intrin->src[0], 1);

- block_index = nir_iadd(b, nir_imm_int(b, surface_index), block_index);
+ if (state->add_bounds_checks)
+ block_index = nir_umin(b, block_index, nir_imm_int(b, array_size - 1));
+
+ block_index = nir_iadd(b, nir_imm_int(b, surface_index), block_index);
+ }

assert(intrin->dest.is_ssa);
nir_ssa_def_rewrite_uses(&intrin->dest.ssa, nir_src_for_ssa(block_index));
--
2.5.0.400.gff86faf
Jason Ekstrand
2017-12-07 04:34:22 UTC
Reply
Permalink
Raw Message
Reviewed-by: Jordan Justen <***@intel.com>
---
src/intel/vulkan/genX_cmd_buffer.c | 45 ++++++++++++++++++++++++++++----------
1 file changed, 34 insertions(+), 11 deletions(-)

diff --git a/src/intel/vulkan/genX_cmd_buffer.c b/src/intel/vulkan/genX_cmd_buffer.c
index ab5590d..e4362d1 100644
--- a/src/intel/vulkan/genX_cmd_buffer.c
+++ b/src/intel/vulkan/genX_cmd_buffer.c
@@ -1432,6 +1432,35 @@ cmd_buffer_alloc_push_constants(struct anv_cmd_buffer *cmd_buffer)
cmd_buffer->state.push_constants_dirty |= VK_SHADER_STAGE_ALL_GRAPHICS;
}

+static const struct anv_descriptor *
+anv_descriptor_for_binding(const struct anv_cmd_buffer *cmd_buffer,
+ const struct anv_pipeline_binding *binding)
+{
+ assert(binding->set < MAX_SETS);
+ const struct anv_descriptor_set *set =
+ cmd_buffer->state.descriptors[binding->set];
+ const uint32_t offset =
+ set->layout->binding[binding->binding].descriptor_index;
+ return &set->descriptors[offset + binding->index];
+}
+
+static uint32_t
+dynamic_offset_for_binding(const struct anv_cmd_buffer *cmd_buffer,
+ const struct anv_pipeline *pipeline,
+ const struct anv_pipeline_binding *binding)
+{
+ assert(binding->set < MAX_SETS);
+ const struct anv_descriptor_set *set =
+ cmd_buffer->state.descriptors[binding->set];
+
+ uint32_t dynamic_offset_idx =
+ pipeline->layout->set[binding->set].dynamic_offset_start +
+ set->layout->binding[binding->binding].dynamic_offset_index +
+ binding->index;
+
+ return cmd_buffer->state.dynamic_offsets[dynamic_offset_idx];
+}
+
static VkResult
emit_binding_table(struct anv_cmd_buffer *cmd_buffer,
gl_shader_stage stage,
@@ -1534,10 +1563,8 @@ emit_binding_table(struct anv_cmd_buffer *cmd_buffer,
continue;
}

- struct anv_descriptor_set *set =
- cmd_buffer->state.descriptors[binding->set];
- uint32_t offset = set->layout->binding[binding->binding].descriptor_index;
- struct anv_descriptor *desc = &set->descriptors[offset + binding->index];
+ const struct anv_descriptor *desc =
+ anv_descriptor_for_binding(cmd_buffer, binding);

switch (desc->type) {
case VK_DESCRIPTOR_TYPE_SAMPLER:
@@ -1611,14 +1638,10 @@ emit_binding_table(struct anv_cmd_buffer *cmd_buffer,

case VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC:
case VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC: {
- uint32_t dynamic_offset_idx =
- pipeline->layout->set[binding->set].dynamic_offset_start +
- set->layout->binding[binding->binding].dynamic_offset_index +
- binding->index;
-
/* Compute the offset within the buffer */
- uint64_t offset = desc->offset +
- cmd_buffer->state.dynamic_offsets[dynamic_offset_idx];
+ uint32_t dynamic_offset =
+ dynamic_offset_for_binding(cmd_buffer, pipeline, binding);
+ uint64_t offset = desc->offset + dynamic_offset;
/* Clamp to the buffer size */
offset = MIN2(offset, desc->buffer->size);
/* Clamp the range to the buffer size */
--
2.5.0.400.gff86faf
Jason Ekstrand
2017-12-07 04:34:23 UTC
Reply
Permalink
Raw Message
There are several places where we look up opcodes in an array of stages.
Assert that the we don't end up going out-of-bounds.

Reviewed-by: Jordan Justen <***@intel.com>
---
src/intel/vulkan/genX_cmd_buffer.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/src/intel/vulkan/genX_cmd_buffer.c b/src/intel/vulkan/genX_cmd_buffer.c
index e4362d1..16b4ca6 100644
--- a/src/intel/vulkan/genX_cmd_buffer.c
+++ b/src/intel/vulkan/genX_cmd_buffer.c
@@ -1822,6 +1822,9 @@ cmd_buffer_emit_descriptor_pointers(struct anv_cmd_buffer *cmd_buffer,
};

anv_foreach_stage(s, stages) {
+ assert(s < ARRAY_SIZE(binding_table_opcodes));
+ assert(binding_table_opcodes[s] > 0);
+
if (cmd_buffer->state.samplers[s].alloc_size > 0) {
anv_batch_emit(&cmd_buffer->batch,
GENX(3DSTATE_SAMPLER_STATE_POINTERS_VS), ssp) {
@@ -1858,6 +1861,9 @@ cmd_buffer_flush_push_constants(struct anv_cmd_buffer *cmd_buffer)
if (stage == MESA_SHADER_COMPUTE)
continue;

+ assert(stage < ARRAY_SIZE(push_constant_opcodes));
+ assert(push_constant_opcodes[stage] > 0);
+
struct anv_state state = anv_cmd_buffer_push_constants(cmd_buffer, stage);

if (state.offset == 0) {
--
2.5.0.400.gff86faf
Jason Ekstrand
2017-12-07 04:34:24 UTC
Reply
Permalink
Raw Message
In order to do this we have to modify push constant set up to handle
ranges. We also have to tweak the way we handle dirty bits a bit so
that we re-push whenever a descriptor set changes.

Reviewed-by: Jordan Justen <***@intel.com>
---
src/intel/vulkan/genX_cmd_buffer.c | 142 ++++++++++++++++++++++++++++---------
src/intel/vulkan/genX_pipeline.c | 3 +-
2 files changed, 112 insertions(+), 33 deletions(-)

diff --git a/src/intel/vulkan/genX_cmd_buffer.c b/src/intel/vulkan/genX_cmd_buffer.c
index 16b4ca6..0bd3874 100644
--- a/src/intel/vulkan/genX_cmd_buffer.c
+++ b/src/intel/vulkan/genX_cmd_buffer.c
@@ -1843,9 +1843,12 @@ cmd_buffer_emit_descriptor_pointers(struct anv_cmd_buffer *cmd_buffer,
}
}

-static uint32_t
-cmd_buffer_flush_push_constants(struct anv_cmd_buffer *cmd_buffer)
+static void
+cmd_buffer_flush_push_constants(struct anv_cmd_buffer *cmd_buffer,
+ VkShaderStageFlags dirty_stages)
{
+ UNUSED const struct anv_pipeline *pipeline = cmd_buffer->state.pipeline;
+
static const uint32_t push_constant_opcodes[] = {
[MESA_SHADER_VERTEX] = 21,
[MESA_SHADER_TESS_CTRL] = 25, /* HS */
@@ -1857,39 +1860,117 @@ cmd_buffer_flush_push_constants(struct anv_cmd_buffer *cmd_buffer)

VkShaderStageFlags flushed = 0;

- anv_foreach_stage(stage, cmd_buffer->state.push_constants_dirty) {
- if (stage == MESA_SHADER_COMPUTE)
- continue;
-
+ anv_foreach_stage(stage, dirty_stages) {
assert(stage < ARRAY_SIZE(push_constant_opcodes));
assert(push_constant_opcodes[stage] > 0);

- struct anv_state state = anv_cmd_buffer_push_constants(cmd_buffer, stage);
+ anv_batch_emit(&cmd_buffer->batch, GENX(3DSTATE_CONSTANT_VS), c) {
+ c._3DCommandSubOpcode = push_constant_opcodes[stage];

- if (state.offset == 0) {
- anv_batch_emit(&cmd_buffer->batch, GENX(3DSTATE_CONSTANT_VS), c)
- c._3DCommandSubOpcode = push_constant_opcodes[stage];
- } else {
- anv_batch_emit(&cmd_buffer->batch, GENX(3DSTATE_CONSTANT_VS), c) {
- c._3DCommandSubOpcode = push_constant_opcodes[stage],
- c.ConstantBody = (struct GENX(3DSTATE_CONSTANT_BODY)) {
-#if GEN_GEN >= 9
- .Buffer[2] = { &cmd_buffer->device->dynamic_state_pool.block_pool.bo, state.offset },
- .ReadLength[2] = DIV_ROUND_UP(state.alloc_size, 32),
+ if (anv_pipeline_has_stage(cmd_buffer->state.pipeline, stage)) {
+#if GEN_GEN >= 8 || GEN_IS_HASWELL
+ const struct brw_stage_prog_data *prog_data =
+ pipeline->shaders[stage]->prog_data;
+ const struct anv_pipeline_bind_map *bind_map =
+ &pipeline->shaders[stage]->bind_map;
+
+ /* The Skylake PRM contains the following restriction:
+ *
+ * "The driver must ensure The following case does not occur
+ * without a flush to the 3D engine: 3DSTATE_CONSTANT_* with
+ * buffer 3 read length equal to zero committed followed by a
+ * 3DSTATE_CONSTANT_* with buffer 0 read length not equal to
+ * zero committed."
+ *
+ * To avoid this, we program the buffers in the highest slots.
+ * This way, slot 0 is only used if slot 3 is also used.
+ */
+ int n = 3;
+
+ for (int i = 3; i >= 0; i--) {
+ const struct brw_ubo_range *range = &prog_data->ubo_ranges[i];
+ if (range->length == 0)
+ continue;
+
+ const unsigned surface =
+ prog_data->binding_table.ubo_start + range->block;
+
+ assert(surface <= bind_map->surface_count);
+ const struct anv_pipeline_binding *binding =
+ &bind_map->surface_to_descriptor[surface];
+
+ const struct anv_descriptor *desc =
+ anv_descriptor_for_binding(cmd_buffer, binding);
+
+ struct anv_address read_addr;
+ uint32_t read_len;
+ if (desc->type == VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER) {
+ read_len = MIN2(range->length,
+ DIV_ROUND_UP(desc->buffer_view->range, 32) - range->start);
+ read_addr = (struct anv_address) {
+ .bo = desc->buffer_view->bo,
+ .offset = desc->buffer_view->offset +
+ range->start * 32,
+ };
+ } else {
+ assert(desc->type == VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC);
+
+ uint32_t dynamic_offset =
+ dynamic_offset_for_binding(cmd_buffer, pipeline, binding);
+ uint32_t buf_offset =
+ MIN2(desc->offset + dynamic_offset, desc->buffer->size);
+ uint32_t buf_range =
+ MIN2(desc->range, desc->buffer->size - buf_offset);
+
+ read_len = MIN2(range->length,
+ DIV_ROUND_UP(buf_range, 32) - range->start);
+ read_addr = (struct anv_address) {
+ .bo = desc->buffer->bo,
+ .offset = desc->buffer->offset + buf_offset +
+ range->start * 32,
+ };
+ }
+
+ if (read_len > 0) {
+ c.ConstantBody.Buffer[n] = read_addr;
+ c.ConstantBody.ReadLength[n] = read_len;
+ n--;
+ }
+ }
+
+ struct anv_state state =
+ anv_cmd_buffer_push_constants(cmd_buffer, stage);
+
+ if (state.alloc_size > 0) {
+ c.ConstantBody.Buffer[n] = (struct anv_address) {
+ .bo = &cmd_buffer->device->dynamic_state_pool.block_pool.bo,
+ .offset = state.offset,
+ };
+ c.ConstantBody.ReadLength[n] =
+ DIV_ROUND_UP(state.alloc_size, 32);
+ }
#else
- .Buffer[0] = { .offset = state.offset },
- .ReadLength[0] = DIV_ROUND_UP(state.alloc_size, 32),
+ /* For Ivy Bridge, the push constants packets have a different
+ * rule that would require us to iterate in the other direction
+ * and possibly mess around with dynamic state base address.
+ * Don't bother; just emit regular push constants at n = 0.
+ */
+ struct anv_state state =
+ anv_cmd_buffer_push_constants(cmd_buffer, stage);
+
+ if (state.alloc_size > 0) {
+ c.ConstantBody.Buffer[0].offset = state.offset,
+ c.ConstantBody.ReadLength[0] =
+ DIV_ROUND_UP(state.alloc_size, 32);
+ }
#endif
- };
}
}

flushed |= mesa_to_vk_shader_stage(stage);
}

- cmd_buffer->state.push_constants_dirty &= ~VK_SHADER_STAGE_ALL_GRAPHICS;
-
- return flushed;
+ cmd_buffer->state.push_constants_dirty &= ~flushed;
}

void
@@ -2003,16 +2084,13 @@ genX(cmd_buffer_flush_state)(struct anv_cmd_buffer *cmd_buffer)
if (cmd_buffer->state.descriptors_dirty)
dirty = flush_descriptor_sets(cmd_buffer);

- if (cmd_buffer->state.push_constants_dirty) {
-#if GEN_GEN >= 9
- /* On Sky Lake and later, the binding table pointers commands are
- * what actually flush the changes to push constant state so we need
- * to dirty them so they get re-emitted below.
+ if (dirty || cmd_buffer->state.push_constants_dirty) {
+ /* Because we're pushing UBOs, we have to push whenever either
+ * descriptors or push constants is dirty.
*/
- dirty |= cmd_buffer_flush_push_constants(cmd_buffer);
-#else
- cmd_buffer_flush_push_constants(cmd_buffer);
-#endif
+ dirty |= cmd_buffer->state.push_constants_dirty;
+ dirty &= ANV_STAGE_MASK & VK_SHADER_STAGE_ALL_GRAPHICS;
+ cmd_buffer_flush_push_constants(cmd_buffer, dirty);
}

if (dirty)
diff --git a/src/intel/vulkan/genX_pipeline.c b/src/intel/vulkan/genX_pipeline.c
index 7e3a785..0ae9ead 100644
--- a/src/intel/vulkan/genX_pipeline.c
+++ b/src/intel/vulkan/genX_pipeline.c
@@ -1480,7 +1480,8 @@ emit_3dstate_ps(struct anv_pipeline *pipeline,
ps.VectorMaskEnable = true;
ps.SamplerCount = get_sampler_count(fs_bin);
ps.BindingTableEntryCount = get_binding_table_entry_count(fs_bin);
- ps.PushConstantEnable = wm_prog_data->base.nr_params > 0;
+ ps.PushConstantEnable = wm_prog_data->base.nr_params > 0 ||
+ wm_prog_data->base.ubo_ranges[0].length;
ps.PositionXYOffsetSelect = wm_prog_data->uses_pos_offset ?
POSOFFSET_SAMPLE: POSOFFSET_NONE;
#if GEN_GEN < 8
--
2.5.0.400.gff86faf
Jason Ekstrand
2017-12-07 04:34:25 UTC
Reply
Permalink
Raw Message
Push constants work in terms of 32-byte chunks so if we want to be able
to push UBOs, every thing needs to be 32-byte aligned. Currently, we
only require 16-byte which is too small.

Reviewed-by: Jordan Justen <***@intel.com>
---
src/intel/vulkan/anv_device.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/src/intel/vulkan/anv_device.c b/src/intel/vulkan/anv_device.c
index fd5326b..013823e 100644
--- a/src/intel/vulkan/anv_device.c
+++ b/src/intel/vulkan/anv_device.c
@@ -849,7 +849,8 @@ void anv_GetPhysicalDeviceProperties(
.viewportSubPixelBits = 13, /* We take a float? */
.minMemoryMapAlignment = 4096, /* A page */
.minTexelBufferOffsetAlignment = 1,
- .minUniformBufferOffsetAlignment = 16,
+ /* We need 16 for UBO block reads to work and 32 for push UBOs */
+ .minUniformBufferOffsetAlignment = 32,
.minStorageBufferOffsetAlignment = 4,
.minTexelOffset = -8,
.maxTexelOffset = 7,
@@ -1915,8 +1916,15 @@ void anv_GetBufferMemoryRequirements(
memory_types |= (1u << i);
}

+ /* Base alignment requirement of a cache line */
+ uint32_t alignment = 16;
+
+ /* We need an alignment of 32 for pushing UBOs */
+ if (buffer->usage & VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT)
+ alignment = MAX2(alignment, 32);
+
pMemoryRequirements->size = buffer->size;
- pMemoryRequirements->alignment = 16;
+ pMemoryRequirements->alignment = alignment;
pMemoryRequirements->memoryTypeBits = memory_types;
}
--
2.5.0.400.gff86faf
Jason Ekstrand
2017-12-07 04:34:26 UTC
Reply
Permalink
Raw Message
In Vulkan, we don't support classic pull constants and everything the
client asks us to push, we push. However, for pushed UBOs, we still
want to fall back to conventional pulls if we run out of space.
---
src/intel/compiler/brw_fs.cpp | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
index 41260b4..200b77b 100644
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@@ -2144,7 +2144,7 @@ fs_visitor::assign_constant_locations()

unsigned push_start_align = cplx_align_apply(align, num_push_constants);
unsigned chunk_size = u - chunk_start + 1;
- if (!compiler->supports_pull_constants ||
+ if ((!compiler->supports_pull_constants && u < UBO_START) ||
(chunk_size < max_chunk_size &&
push_start_align + chunk_size <= max_push_components)) {
/* Align up the number of push constants */
--
2.5.0.400.gff86faf
Jason Ekstrand
2017-12-07 04:34:27 UTC
Reply
Permalink
Raw Message
Push constants on Intel hardware are significantly more performant than
pull constants. Since most Vulkan applications don't actively use push
constants on Vulkan or at least don't use it heavily, we're pulling way
more than we should be. By enabling pushing chunks of UBOs we can get
rid of a lot of those pulls.

On my SKL GT4e, this improves the performance of Dota 2 and Talos by
around 2.5% and improves Aztec Ruins by around 2%.

Reviewed-by: Jordan Justen <***@intel.com>
---
src/intel/vulkan/anv_device.c | 1 +
src/intel/vulkan/anv_pipeline.c | 6 ++++++
2 files changed, 7 insertions(+)

diff --git a/src/intel/vulkan/anv_device.c b/src/intel/vulkan/anv_device.c
index 013823e..a9364b5 100644
--- a/src/intel/vulkan/anv_device.c
+++ b/src/intel/vulkan/anv_device.c
@@ -419,6 +419,7 @@ anv_physical_device_init(struct anv_physical_device *device,
device->compiler->shader_debug_log = compiler_debug_log;
device->compiler->shader_perf_log = compiler_perf_log;
device->compiler->supports_pull_constants = false;
+ device->compiler->constant_buffer_0_is_relative = true;

isl_device_init(&device->isl_dev, &device->info, swizzled);

diff --git a/src/intel/vulkan/anv_pipeline.c b/src/intel/vulkan/anv_pipeline.c
index cf2079d..f2d4d113 100644
--- a/src/intel/vulkan/anv_pipeline.c
+++ b/src/intel/vulkan/anv_pipeline.c
@@ -389,6 +389,9 @@ anv_pipeline_compile(struct anv_pipeline *pipeline,
struct brw_stage_prog_data *prog_data,
struct anv_pipeline_bind_map *map)
{
+ const struct brw_compiler *compiler =
+ pipeline->device->instance->physicalDevice.compiler;
+
nir_shader *nir = anv_shader_compile_to_nir(pipeline, mem_ctx,
module, entrypoint, stage,
spec_info);
@@ -438,6 +441,9 @@ anv_pipeline_compile(struct anv_pipeline *pipeline,
if (pipeline->layout)
anv_nir_apply_pipeline_layout(pipeline, nir, prog_data, map);

+ if (stage != MESA_SHADER_COMPUTE)
+ brw_nir_analyze_ubo_ranges(compiler, nir, prog_data->ubo_ranges);
+
assert(nir->num_uniforms == prog_data->nr_params * 4);

return nir;
--
2.5.0.400.gff86faf
Loading...