Discussion:
GBM and the Device Memory Allocator Proposals
Add Reply
James Jones
2017-11-21 01:11:03 UTC
Reply
Permalink
Raw Message
As many here know at this point, I've been working on solving issues
related to DMA-capable memory allocation for various devices for some
time now. I'd like to take this opportunity to apologize for the way I
handled the EGL stream proposals. I understand now that the development
process followed there was unacceptable to the community and likely
offended many great engineers.

Moving forward, I attempted to reboot talks in a more constructive
manner with the generic allocator library proposals & discussion forum
at XDC 2016. Some great design ideas came out of that, and I've since
been prototyping some code to prove them out before bringing them back
as official proposals. Again, I understand some people are growing
concerned that I've been doing this off on the side in a github project
that has primarily NVIDIA contributors. My goal was only to avoid
wasting everyone's time with unproven ideas. The intent was never to
dump the prototype code as-is on the community and presume acceptance.
It's just a public research project.

Now the prototyping is nearing completion, and I'd like to renew
discussion on whether and how the new mechanisms can be integrated with
the Linux graphics stack.

I'd be interested to know if more work is needed to demonstrate the
usefulness of the new mechanisms, or whether people think they have
value at this point.

After talking with people on the hallway track at XDC this year, I've
heard several proposals for incorporating the new mechanisms:

-Include ideas from the generic allocator design into GBM. This could
take the form of designing a "GBM 2.0" API, or incrementally adding to
the existing GBM API.

-Develop a library to replace GBM. The allocator prototype code could
be massaged into something production worthy to jump start this process.

-Develop a library that sits beside or on top of GBM, using GBM for
low-level graphics buffer allocation, while supporting non-graphics
kernel APIs directly. The additional cross-device negotiation and
sorting of capabilities would be handled in this slightly higher-level
API before handing off to GBM and other APIs for actual allocation somehow.

-I have also heard some general comments that regardless of the
relationship between GBM and the new allocator mechanisms, it might be
time to move GBM out of Mesa so it can be developed as a stand-alone
project. I'd be interested what others think about that, as it would be
something worth coordinating with any other new development based on or
inside of GBM.

And of course I'm open to any other ideas for integration. Beyond just
where this code would live, there is much to debate about the mechanisms
themselves and all the implementation details. I was just hoping to
kick things off with something high level to start.

For reference, the code Miguel and I have been developing for the
prototype is here:

https://github.com/cubanismo/allocator

And we've posted a port of kmscube that uses the new interfaces as a
demonstration here:

https://github.com/cubanismo/kmscube

There are still some proposed mechanisms (usage transitions mainly) that
aren't prototyped, but I think it makes sense to start discussing
integration while prototyping continues.

In addition, I'd like to note that NVIDIA is committed to providing open
source driver implementations of these mechanisms for our hardware, in
addition to support in our proprietary drivers. In other words,
wherever modifications to the nouveau kernel & userspace drivers are
needed to implement the improved allocator mechanisms, we'll be
contributing patches if no one beats us to it.

Thanks in advance for any feedback!

-James Jones
Emil Velikov
2017-11-23 16:00:00 UTC
Reply
Permalink
Raw Message
Hi James,
-I have also heard some general comments that regardless of the relationship
between GBM and the new allocator mechanisms, it might be time to move GBM
out of Mesa so it can be developed as a stand-alone project. I'd be
interested what others think about that, as it would be something worth
coordinating with any other new development based on or inside of GBM.
Having a GBM frontend is one thing I've been pondering as well.

Regardless of exact solution wrt the new allocator, having a clear
frontend/backend separation for GBM will be beneficial.
I'll be giving it a stab these days.

Disclaimer: Mostly thinking out loud, so please take the following
with grain of salt.

On the details wrt the new allocator project, I think that having a
new lean library would be a good idea.
One could borrow ideas from GBM, but by default no connection between
the two should be required.

That might lead to having a the initial hurdle of porting a bit
harder, but it will allow for more efficient driver implementation.

HTH
Emil
Jason Ekstrand
2017-11-24 16:45:02 UTC
Reply
Permalink
Raw Message
Post by Emil Velikov
Hi James,
-I have also heard some general comments that regardless of the relationship
between GBM and the new allocator mechanisms, it might be time to move GBM
out of Mesa so it can be developed as a stand-alone project. I'd be
interested what others think about that, as it would be something worth
coordinating with any other new development based on or inside of GBM.
Having a GBM frontend is one thing I've been pondering as well.
Regardless of exact solution wrt the new allocator, having a clear
frontend/backend separation for GBM will be beneficial.
I'll be giving it a stab these days.
I'm not sure what you mean by that. It currently has something that looks
like separation but it's a joke. Unless we have a real reason to have
anything other than a dri_interface back-end, I'd rather we just stop
pretending and drop the extra layer of function pointer indirection entirely.

--Jason
Post by Emil Velikov
Disclaimer: Mostly thinking out loud, so please take the following
with grain of salt.
On the details wrt the new allocator project, I think that having a
new lean library would be a good idea.
One could borrow ideas from GBM, but by default no connection between
the two should be required.
That might lead to having a the initial hurdle of porting a bit
harder, but it will allow for more efficient driver implementation.
HTH
Emil
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Jason Ekstrand
2017-11-24 16:50:11 UTC
Reply
Permalink
Raw Message
Post by Jason Ekstrand
Post by Emil Velikov
Hi James,
-I have also heard some general comments that regardless of the relationship
between GBM and the new allocator mechanisms, it might be time to move GBM
out of Mesa so it can be developed as a stand-alone project. I'd be
interested what others think about that, as it would be something worth
coordinating with any other new development based on or inside of GBM.
Having a GBM frontend is one thing I've been pondering as well.
Regardless of exact solution wrt the new allocator, having a clear
frontend/backend separation for GBM will be beneficial.
I'll be giving it a stab these days.
I'm not sure what you mean by that. It currently has something that looks
like separation but it's a joke. Unless we have a real reason to have
anything other than a dri_interface back-end, I'd rather we just stop
pretending and drop the extra layer of function pointer indirection entirely.
Gah! I didn't read Rob's email before writing this. It looks like there
is a use-case for this. I'm still a bit skeptical about whether or not we
really want to extend what we have our if it would be better to start over
and just require that the new thing also support the current GBM ABI.
Post by Jason Ekstrand
--Jason
Post by Emil Velikov
Disclaimer: Mostly thinking out loud, so please take the following
with grain of salt.
On the details wrt the new allocator project, I think that having a
new lean library would be a good idea.
One could borrow ideas from GBM, but by default no connection between
the two should be required.
That might lead to having a the initial hurdle of porting a bit
harder, but it will allow for more efficient driver implementation.
HTH
Emil
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Rob Clark
2017-11-24 16:29:10 UTC
Reply
Permalink
Raw Message
As many here know at this point, I've been working on solving issues related
to DMA-capable memory allocation for various devices for some time now. I'd
like to take this opportunity to apologize for the way I handled the EGL
stream proposals. I understand now that the development process followed
there was unacceptable to the community and likely offended many great
engineers.
Moving forward, I attempted to reboot talks in a more constructive manner
with the generic allocator library proposals & discussion forum at XDC 2016.
Some great design ideas came out of that, and I've since been prototyping
some code to prove them out before bringing them back as official proposals.
Again, I understand some people are growing concerned that I've been doing
this off on the side in a github project that has primarily NVIDIA
contributors. My goal was only to avoid wasting everyone's time with
unproven ideas. The intent was never to dump the prototype code as-is on
the community and presume acceptance. It's just a public research project.
Now the prototyping is nearing completion, and I'd like to renew discussion
on whether and how the new mechanisms can be integrated with the Linux
graphics stack.
I'd be interested to know if more work is needed to demonstrate the
usefulness of the new mechanisms, or whether people think they have value at
this point.
After talking with people on the hallway track at XDC this year, I've heard
-Include ideas from the generic allocator design into GBM. This could take
the form of designing a "GBM 2.0" API, or incrementally adding to the
existing GBM API.
-Develop a library to replace GBM. The allocator prototype code could be
massaged into something production worthy to jump start this process.
-Develop a library that sits beside or on top of GBM, using GBM for
low-level graphics buffer allocation, while supporting non-graphics kernel
APIs directly. The additional cross-device negotiation and sorting of
capabilities would be handled in this slightly higher-level API before
handing off to GBM and other APIs for actual allocation somehow.
tbh, I kinda see GBM and $new_thing sitting side by side.. GBM is
still the "winsys" for running on "bare metal" (ie. kms). And we
don't want to saddle $new_thing with aspects of that, but rather have
it focus on being the thing that in multiple-"device"[1] scenarious
figures out what sort of buffer can be allocated by who for sharing.
Ie $new_thing should really not care about winsys level things like
cursors or surfaces.. only buffers.

The mesa implementation of $new_thing could sit on top of GBM,
although it could also just sit on top of the same internal APIs that
GBM sits on top of. That is an implementation detail. It could be
that GBM grows an API to return an instance of $new_thing for
use-cases that involve sharing a buffer with the GPU. Or perhaps that
is exposed via some sort of EGL extension. (We probably also need a
way to get an instance from libdrm (?) for display-only KMS drivers,
to cover cases like etnaviv sharing a buffer with a separate display
driver.)

[1] where "devices" could be multiple GPUs or multiple APIs for one or
more GPUs, but also includes non-GPU devices like camera, video
decoder, "image processor" (which may or may not be part of camera),
etc, etc
-I have also heard some general comments that regardless of the relationship
between GBM and the new allocator mechanisms, it might be time to move GBM
out of Mesa so it can be developed as a stand-alone project. I'd be
interested what others think about that, as it would be something worth
coordinating with any other new development based on or inside of GBM.
+1

We already have at least a couple different non-mesa implementations
of GBM (which afaict tend to lag behind mesa's GBM and cause
headaches).

The extracted part probably isn't much more than a header and shim.
But probably does need to grow some versioning for the backend to know
if, for example, gbm->bo_map() is supported.. at least it could
provide stubs that return an error, rather than having link-time fail
if building something w/ $vendor's old gbm implementation.
And of course I'm open to any other ideas for integration. Beyond just
where this code would live, there is much to debate about the mechanisms
themselves and all the implementation details. I was just hoping to kick
things off with something high level to start.
My $0.02, is that the place where devel happens and place to go for
releases could be different. Either way, I would like to see git tree
for tagged release versions live on fd.o and use the common release
process[2] for generating/uploading release tarballs that distros can
use.

[2] https://cgit.freedesktop.org/xorg/util/modular/tree/release.sh
For reference, the code Miguel and I have been developing for the prototype
https://github.com/cubanismo/allocator
And we've posted a port of kmscube that uses the new interfaces as a
https://github.com/cubanismo/kmscube
There are still some proposed mechanisms (usage transitions mainly) that
aren't prototyped, but I think it makes sense to start discussing
integration while prototyping continues.
btw, I think a nice end goal would be a gralloc implementation using
this new API for sharing buffers in various use-cases. That could
mean converting gbm-gralloc, or perhaps it means something new.

AOSP has support for mesa + upstream kernel for some devices which
also have upstream camera and/or video decoder in addition to just
GPU.. and this is where you start hitting the limits of a GBM based
gralloc. In a lot of way, I view $new_thing as what gralloc *should*
have been, but at least it provides a way to implement a generic
gralloc.

Maybe that is getting a step ahead, there is a lot we can prototype
with kmscube. But gralloc gets us into interesting real-world
use-cases that involve more than just GPUs. Possibly this would be
something that linaro might be interested in getting involved with?

BR,
-R
In addition, I'd like to note that NVIDIA is committed to providing open
source driver implementations of these mechanisms for our hardware, in
addition to support in our proprietary drivers. In other words, wherever
modifications to the nouveau kernel & userspace drivers are needed to
implement the improved allocator mechanisms, we'll be contributing patches
if no one beats us to it.
Thanks in advance for any feedback!
-James Jones
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Jason Ekstrand
2017-11-25 17:46:31 UTC
Reply
Permalink
Raw Message
Post by Rob Clark
As many here know at this point, I've been working on solving issues related
to DMA-capable memory allocation for various devices for some time now. I'd
like to take this opportunity to apologize for the way I handled the EGL
stream proposals. I understand now that the development process followed
there was unacceptable to the community and likely offended many great
engineers.
Moving forward, I attempted to reboot talks in a more constructive manner
with the generic allocator library proposals & discussion forum at XDC 2016.
Some great design ideas came out of that, and I've since been prototyping
some code to prove them out before bringing them back as official proposals.
Again, I understand some people are growing concerned that I've been doing
this off on the side in a github project that has primarily NVIDIA
contributors. My goal was only to avoid wasting everyone's time with
unproven ideas. The intent was never to dump the prototype code as-is on
the community and presume acceptance. It's just a public research project.
Now the prototyping is nearing completion, and I'd like to renew discussion
on whether and how the new mechanisms can be integrated with the Linux
graphics stack.
I'd be interested to know if more work is needed to demonstrate the
usefulness of the new mechanisms, or whether people think they have value at
this point.
After talking with people on the hallway track at XDC this year, I've heard
-Include ideas from the generic allocator design into GBM. This could take
the form of designing a "GBM 2.0" API, or incrementally adding to the
existing GBM API.
-Develop a library to replace GBM. The allocator prototype code could be
massaged into something production worthy to jump start this process.
-Develop a library that sits beside or on top of GBM, using GBM for
low-level graphics buffer allocation, while supporting non-graphics kernel
APIs directly. The additional cross-device negotiation and sorting of
capabilities would be handled in this slightly higher-level API before
handing off to GBM and other APIs for actual allocation somehow.
tbh, I kinda see GBM and $new_thing sitting side by side.. GBM is
still the "winsys" for running on "bare metal" (ie. kms). And we
don't want to saddle $new_thing with aspects of that, but rather have
it focus on being the thing that in multiple-"device"[1] scenarious
figures out what sort of buffer can be allocated by who for sharing.
Ie $new_thing should really not care about winsys level things like
cursors or surfaces.. only buffers.
The mesa implementation of $new_thing could sit on top of GBM,
although it could also just sit on top of the same internal APIs that
GBM sits on top of. That is an implementation detail. It could be
that GBM grows an API to return an instance of $new_thing for
use-cases that involve sharing a buffer with the GPU. Or perhaps that
is exposed via some sort of EGL extension. (We probably also need a
way to get an instance from libdrm (?) for display-only KMS drivers,
to cover cases like etnaviv sharing a buffer with a separate display
driver.)
[1] where "devices" could be multiple GPUs or multiple APIs for one or
more GPUs, but also includes non-GPU devices like camera, video
decoder, "image processor" (which may or may not be part of camera),
etc, etc
I'm not quite some sure what I think about this. I think I would like to
see $new_thing at least replace the guts of GBM. Whether GBM becomes a
wrapper around $new_thing or $new_thing implements the GBM API, I'm not
sure. What I don't think I want is to see GBM development continuing on
it's own so we have two competing solutions.

I *think* I like the idea of having $new_thing implement GBM as a
deprecated legacy API. Whether that means we start by pulling GBM out into
it's own project or we start over, I don't know. My feeling is that the
current dri_interface is *not* what we want which is why starting with GBM
makes me nervous.

I need to go read through your code before I can provide a stronger or more
nuanced opinion. That's not going to happen before the end of the year.
Post by Rob Clark
-I have also heard some general comments that regardless of the relationship
between GBM and the new allocator mechanisms, it might be time to move GBM
out of Mesa so it can be developed as a stand-alone project. I'd be
interested what others think about that, as it would be something worth
coordinating with any other new development based on or inside of GBM.
+1
We already have at least a couple different non-mesa implementations
of GBM (which afaict tend to lag behind mesa's GBM and cause
headaches).
The extracted part probably isn't much more than a header and shim.
But probably does need to grow some versioning for the backend to know
if, for example, gbm->bo_map() is supported.. at least it could
provide stubs that return an error, rather than having link-time fail
if building something w/ $vendor's old gbm implementation.
And of course I'm open to any other ideas for integration. Beyond just
where this code would live, there is much to debate about the mechanisms
themselves and all the implementation details. I was just hoping to kick
things off with something high level to start.
My $0.02, is that the place where devel happens and place to go for
releases could be different. Either way, I would like to see git tree
for tagged release versions live on fd.o and use the common release
process[2] for generating/uploading release tarballs that distros can
use.
Agreed. I think fd.o is the right place for such a project to live. We
can have mirrors on GitHub and other places but fd.o is where Linux
graphics stack development currently happens.
Post by Rob Clark
[2] https://cgit.freedesktop.org/xorg/util/modular/tree/release.sh
For reference, the code Miguel and I have been developing for the prototype
https://github.com/cubanismo/allocator
And we've posted a port of kmscube that uses the new interfaces as a
https://github.com/cubanismo/kmscube
There are still some proposed mechanisms (usage transitions mainly) that
aren't prototyped, but I think it makes sense to start discussing
integration while prototyping continues.
btw, I think a nice end goal would be a gralloc implementation using
this new API for sharing buffers in various use-cases. That could
mean converting gbm-gralloc, or perhaps it means something new.
AOSP has support for mesa + upstream kernel for some devices which
also have upstream camera and/or video decoder in addition to just
GPU.. and this is where you start hitting the limits of a GBM based
gralloc. In a lot of way, I view $new_thing as what gralloc *should*
have been, but at least it provides a way to implement a generic
gralloc.
+100
Post by Rob Clark
Maybe that is getting a step ahead, there is a lot we can prototype
with kmscube. But gralloc gets us into interesting real-world
use-cases that involve more than just GPUs. Possibly this would be
something that linaro might be interested in getting involved with?
BR,
-R
In addition, I'd like to note that NVIDIA is committed to providing open
source driver implementations of these mechanisms for our hardware, in
addition to support in our proprietary drivers. In other words, wherever
modifications to the nouveau kernel & userspace drivers are needed to
implement the improved allocator mechanisms, we'll be contributing patches
if no one beats us to it.
Thanks in advance for any feedback!
-James Jones
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Rob Clark
2017-11-25 21:20:59 UTC
Reply
Permalink
Raw Message
Post by Jason Ekstrand
Post by Rob Clark
As many here know at this point, I've been working on solving issues related
to DMA-capable memory allocation for various devices for some time now.
I'd
like to take this opportunity to apologize for the way I handled the EGL
stream proposals. I understand now that the development process followed
there was unacceptable to the community and likely offended many great
engineers.
Moving forward, I attempted to reboot talks in a more constructive manner
with the generic allocator library proposals & discussion forum at XDC 2016.
Some great design ideas came out of that, and I've since been prototyping
some code to prove them out before bringing them back as official proposals.
Again, I understand some people are growing concerned that I've been doing
this off on the side in a github project that has primarily NVIDIA
contributors. My goal was only to avoid wasting everyone's time with
unproven ideas. The intent was never to dump the prototype code as-is on
the community and presume acceptance. It's just a public research project.
Now the prototyping is nearing completion, and I'd like to renew discussion
on whether and how the new mechanisms can be integrated with the Linux
graphics stack.
I'd be interested to know if more work is needed to demonstrate the
usefulness of the new mechanisms, or whether people think they have value at
this point.
After talking with people on the hallway track at XDC this year, I've heard
-Include ideas from the generic allocator design into GBM. This could take
the form of designing a "GBM 2.0" API, or incrementally adding to the
existing GBM API.
-Develop a library to replace GBM. The allocator prototype code could be
massaged into something production worthy to jump start this process.
-Develop a library that sits beside or on top of GBM, using GBM for
low-level graphics buffer allocation, while supporting non-graphics kernel
APIs directly. The additional cross-device negotiation and sorting of
capabilities would be handled in this slightly higher-level API before
handing off to GBM and other APIs for actual allocation somehow.
tbh, I kinda see GBM and $new_thing sitting side by side.. GBM is
still the "winsys" for running on "bare metal" (ie. kms). And we
don't want to saddle $new_thing with aspects of that, but rather have
it focus on being the thing that in multiple-"device"[1] scenarious
figures out what sort of buffer can be allocated by who for sharing.
Ie $new_thing should really not care about winsys level things like
cursors or surfaces.. only buffers.
The mesa implementation of $new_thing could sit on top of GBM,
although it could also just sit on top of the same internal APIs that
GBM sits on top of. That is an implementation detail. It could be
that GBM grows an API to return an instance of $new_thing for
use-cases that involve sharing a buffer with the GPU. Or perhaps that
is exposed via some sort of EGL extension. (We probably also need a
way to get an instance from libdrm (?) for display-only KMS drivers,
to cover cases like etnaviv sharing a buffer with a separate display
driver.)
[1] where "devices" could be multiple GPUs or multiple APIs for one or
more GPUs, but also includes non-GPU devices like camera, video
decoder, "image processor" (which may or may not be part of camera),
etc, etc
I'm not quite some sure what I think about this. I think I would like to
see $new_thing at least replace the guts of GBM. Whether GBM becomes a
wrapper around $new_thing or $new_thing implements the GBM API, I'm not
sure. What I don't think I want is to see GBM development continuing on
it's own so we have two competing solutions.
I don't really view them as competing.. there is *some* overlap, ie.
allocating a buffer.. but even if you are using GBM w/out $new_thing
you could allocate a buffer externally and import it. I don't see
$new_thing as that much different from GBM PoV.

But things like surfaces (aka swap chains) seem a bit out of place
when you are thinking about implementing $new_thing for non-gpu
devices. Plus EGL<->GBM tie-ins that seem out of place when talking
about a (for ex.) camera. I kinda don't want to throw out the baby
with the bathwater here.

*maybe* GBM could be partially implemented on top of $new_thing. I
don't quite see how that would work. Possibly we could deprecate
parts of GBM that are no longer needed? idk.. Either way, I fully
expect that GBM and mesa's implementation of $new_thing could perhaps
sit on to of some of the same set of internal APIs. The public
interface can be decoupled from the internal implementation.
Post by Jason Ekstrand
I *think* I like the idea of having $new_thing implement GBM as a deprecated
legacy API. Whether that means we start by pulling GBM out into it's own
project or we start over, I don't know. My feeling is that the current
dri_interface is *not* what we want which is why starting with GBM makes me
nervous.
/me expects if we pull GBM out of mesa, the interface between GBM and
mesa (or other GL drivers) is 'struct gbm_device'.. so "GBM the
project" is just a thin shim plus some 'struct gbm_device' versioning.

BR,
-R
Post by Jason Ekstrand
I need to go read through your code before I can provide a stronger or more
nuanced opinion. That's not going to happen before the end of the year.
Post by Rob Clark
-I have also heard some general comments that regardless of the relationship
between GBM and the new allocator mechanisms, it might be time to move GBM
out of Mesa so it can be developed as a stand-alone project. I'd be
interested what others think about that, as it would be something worth
coordinating with any other new development based on or inside of GBM.
+1
We already have at least a couple different non-mesa implementations
of GBM (which afaict tend to lag behind mesa's GBM and cause
headaches).
The extracted part probably isn't much more than a header and shim.
But probably does need to grow some versioning for the backend to know
if, for example, gbm->bo_map() is supported.. at least it could
provide stubs that return an error, rather than having link-time fail
if building something w/ $vendor's old gbm implementation.
And of course I'm open to any other ideas for integration. Beyond just
where this code would live, there is much to debate about the mechanisms
themselves and all the implementation details. I was just hoping to kick
things off with something high level to start.
My $0.02, is that the place where devel happens and place to go for
releases could be different. Either way, I would like to see git tree
for tagged release versions live on fd.o and use the common release
process[2] for generating/uploading release tarballs that distros can
use.
Agreed. I think fd.o is the right place for such a project to live. We can
have mirrors on GitHub and other places but fd.o is where Linux graphics
stack development currently happens.
Post by Rob Clark
[2] https://cgit.freedesktop.org/xorg/util/modular/tree/release.sh
For reference, the code Miguel and I have been developing for the prototype
https://github.com/cubanismo/allocator
And we've posted a port of kmscube that uses the new interfaces as a
https://github.com/cubanismo/kmscube
There are still some proposed mechanisms (usage transitions mainly) that
aren't prototyped, but I think it makes sense to start discussing
integration while prototyping continues.
btw, I think a nice end goal would be a gralloc implementation using
this new API for sharing buffers in various use-cases. That could
mean converting gbm-gralloc, or perhaps it means something new.
AOSP has support for mesa + upstream kernel for some devices which
also have upstream camera and/or video decoder in addition to just
GPU.. and this is where you start hitting the limits of a GBM based
gralloc. In a lot of way, I view $new_thing as what gralloc *should*
have been, but at least it provides a way to implement a generic
gralloc.
+100
Post by Rob Clark
Maybe that is getting a step ahead, there is a lot we can prototype
with kmscube. But gralloc gets us into interesting real-world
use-cases that involve more than just GPUs. Possibly this would be
something that linaro might be interested in getting involved with?
BR,
-R
In addition, I'd like to note that NVIDIA is committed to providing open
source driver implementations of these mechanisms for our hardware, in
addition to support in our proprietary drivers. In other words, wherever
modifications to the nouveau kernel & userspace drivers are needed to
implement the improved allocator mechanisms, we'll be contributing patches
if no one beats us to it.
Thanks in advance for any feedback!
-James Jones
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Jason Ekstrand
2017-11-29 17:33:29 UTC
Reply
Permalink
Raw Message
Post by James Jones
Post by Jason Ekstrand
Post by Rob Clark
As many here know at this point, I've been working on solving issues related
to DMA-capable memory allocation for various devices for some time now.
I'd
like to take this opportunity to apologize for the way I handled the
EGL
Post by Jason Ekstrand
Post by Rob Clark
stream proposals. I understand now that the development process
followed
Post by Jason Ekstrand
Post by Rob Clark
there was unacceptable to the community and likely offended many great
engineers.
Moving forward, I attempted to reboot talks in a more constructive
manner
Post by Jason Ekstrand
Post by Rob Clark
with the generic allocator library proposals & discussion forum at XDC 2016.
Some great design ideas came out of that, and I've since been
prototyping
Post by Jason Ekstrand
Post by Rob Clark
some code to prove them out before bringing them back as official proposals.
Again, I understand some people are growing concerned that I've been doing
this off on the side in a github project that has primarily NVIDIA
contributors. My goal was only to avoid wasting everyone's time with
unproven ideas. The intent was never to dump the prototype code as-is
on
Post by Jason Ekstrand
Post by Rob Clark
the community and presume acceptance. It's just a public research project.
Now the prototyping is nearing completion, and I'd like to renew discussion
on whether and how the new mechanisms can be integrated with the Linux
graphics stack.
I'd be interested to know if more work is needed to demonstrate the
usefulness of the new mechanisms, or whether people think they have
value
Post by Jason Ekstrand
Post by Rob Clark
at
this point.
After talking with people on the hallway track at XDC this year, I've heard
-Include ideas from the generic allocator design into GBM. This could take
the form of designing a "GBM 2.0" API, or incrementally adding to the
existing GBM API.
-Develop a library to replace GBM. The allocator prototype code could
be
Post by Jason Ekstrand
Post by Rob Clark
massaged into something production worthy to jump start this process.
-Develop a library that sits beside or on top of GBM, using GBM for
low-level graphics buffer allocation, while supporting non-graphics kernel
APIs directly. The additional cross-device negotiation and sorting of
capabilities would be handled in this slightly higher-level API before
handing off to GBM and other APIs for actual allocation somehow.
tbh, I kinda see GBM and $new_thing sitting side by side.. GBM is
still the "winsys" for running on "bare metal" (ie. kms). And we
don't want to saddle $new_thing with aspects of that, but rather have
it focus on being the thing that in multiple-"device"[1] scenarious
figures out what sort of buffer can be allocated by who for sharing.
Ie $new_thing should really not care about winsys level things like
cursors or surfaces.. only buffers.
The mesa implementation of $new_thing could sit on top of GBM,
although it could also just sit on top of the same internal APIs that
GBM sits on top of. That is an implementation detail. It could be
that GBM grows an API to return an instance of $new_thing for
use-cases that involve sharing a buffer with the GPU. Or perhaps that
is exposed via some sort of EGL extension. (We probably also need a
way to get an instance from libdrm (?) for display-only KMS drivers,
to cover cases like etnaviv sharing a buffer with a separate display
driver.)
[1] where "devices" could be multiple GPUs or multiple APIs for one or
more GPUs, but also includes non-GPU devices like camera, video
decoder, "image processor" (which may or may not be part of camera),
etc, etc
I'm not quite some sure what I think about this. I think I would like to
see $new_thing at least replace the guts of GBM. Whether GBM becomes a
wrapper around $new_thing or $new_thing implements the GBM API, I'm not
sure. What I don't think I want is to see GBM development continuing on
it's own so we have two competing solutions.
I don't really view them as competing.. there is *some* overlap, ie.
allocating a buffer.. but even if you are using GBM w/out $new_thing
you could allocate a buffer externally and import it. I don't see
$new_thing as that much different from GBM PoV.
But things like surfaces (aka swap chains) seem a bit out of place
when you are thinking about implementing $new_thing for non-gpu
devices. Plus EGL<->GBM tie-ins that seem out of place when talking
about a (for ex.) camera. I kinda don't want to throw out the baby
with the bathwater here.
Agreed. GBM is very EGLish and we don't want the new allocator to be that.
Post by James Jones
*maybe* GBM could be partially implemented on top of $new_thing. I
don't quite see how that would work. Possibly we could deprecate
parts of GBM that are no longer needed? idk.. Either way, I fully
expect that GBM and mesa's implementation of $new_thing could perhaps
sit on to of some of the same set of internal APIs. The public
interface can be decoupled from the internal implementation.
Maybe I should restate things a bit. My real point was that modifiers +
$new_thing + Kernel blob should be a complete and more powerful replacement
for GBM. I don't know that we really can implement GBM on top of it
because GBM has lots of wishy-washy concepts such as "cursor plane" which
may not map well at least not without querying the kernel about specifc
display planes. In particular, I don't want someone to feel like they need
to use $new_thing and GBM at the same time or together. Ideally, I'd like
them to never do that unless we decide gbm_bo is a useful abstraction for
$new_thing.
Post by James Jones
Post by Jason Ekstrand
I *think* I like the idea of having $new_thing implement GBM as a
deprecated
Post by Jason Ekstrand
legacy API. Whether that means we start by pulling GBM out into it's own
project or we start over, I don't know. My feeling is that the current
dri_interface is *not* what we want which is why starting with GBM makes
me
Post by Jason Ekstrand
nervous.
/me expects if we pull GBM out of mesa, the interface between GBM and
mesa (or other GL drivers) is 'struct gbm_device'.. so "GBM the
project" is just a thin shim plus some 'struct gbm_device' versioning.
BR,
-R
Post by Jason Ekstrand
I need to go read through your code before I can provide a stronger or
more
Post by Jason Ekstrand
nuanced opinion. That's not going to happen before the end of the year.
Post by Rob Clark
-I have also heard some general comments that regardless of the relationship
between GBM and the new allocator mechanisms, it might be time to move GBM
out of Mesa so it can be developed as a stand-alone project. I'd be
interested what others think about that, as it would be something worth
coordinating with any other new development based on or inside of GBM.
+1
We already have at least a couple different non-mesa implementations
of GBM (which afaict tend to lag behind mesa's GBM and cause
headaches).
The extracted part probably isn't much more than a header and shim.
But probably does need to grow some versioning for the backend to know
if, for example, gbm->bo_map() is supported.. at least it could
provide stubs that return an error, rather than having link-time fail
if building something w/ $vendor's old gbm implementation.
And of course I'm open to any other ideas for integration. Beyond just
where this code would live, there is much to debate about the
mechanisms
Post by Jason Ekstrand
Post by Rob Clark
themselves and all the implementation details. I was just hoping to
kick
Post by Jason Ekstrand
Post by Rob Clark
things off with something high level to start.
My $0.02, is that the place where devel happens and place to go for
releases could be different. Either way, I would like to see git tree
for tagged release versions live on fd.o and use the common release
process[2] for generating/uploading release tarballs that distros can
use.
Agreed. I think fd.o is the right place for such a project to live. We
can
Post by Jason Ekstrand
have mirrors on GitHub and other places but fd.o is where Linux graphics
stack development currently happens.
Post by Rob Clark
[2] https://cgit.freedesktop.org/xorg/util/modular/tree/release.sh
For reference, the code Miguel and I have been developing for the prototype
https://github.com/cubanismo/allocator
And we've posted a port of kmscube that uses the new interfaces as a
https://github.com/cubanismo/kmscube
There are still some proposed mechanisms (usage transitions mainly)
that
Post by Jason Ekstrand
Post by Rob Clark
aren't prototyped, but I think it makes sense to start discussing
integration while prototyping continues.
btw, I think a nice end goal would be a gralloc implementation using
this new API for sharing buffers in various use-cases. That could
mean converting gbm-gralloc, or perhaps it means something new.
AOSP has support for mesa + upstream kernel for some devices which
also have upstream camera and/or video decoder in addition to just
GPU.. and this is where you start hitting the limits of a GBM based
gralloc. In a lot of way, I view $new_thing as what gralloc *should*
have been, but at least it provides a way to implement a generic
gralloc.
+100
Post by Rob Clark
Maybe that is getting a step ahead, there is a lot we can prototype
with kmscube. But gralloc gets us into interesting real-world
use-cases that involve more than just GPUs. Possibly this would be
something that linaro might be interested in getting involved with?
BR,
-R
In addition, I'd like to note that NVIDIA is committed to providing
open
Post by Jason Ekstrand
Post by Rob Clark
source driver implementations of these mechanisms for our hardware, in
addition to support in our proprietary drivers. In other words,
wherever
Post by Jason Ekstrand
Post by Rob Clark
modifications to the nouveau kernel & userspace drivers are needed to
implement the improved allocator mechanisms, we'll be contributing patches
if no one beats us to it.
Thanks in advance for any feedback!
-James Jones
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Miguel Angel Vico
2017-11-29 19:41:02 UTC
Reply
Permalink
Raw Message
Many of you may already know, but James is going to be out for a few
weeks and I'll be taking over this in the meantime.

See inline for comments.

On Wed, 29 Nov 2017 09:33:29 -0800
Post by Jason Ekstrand
Post by James Jones
Post by Jason Ekstrand
Post by Rob Clark
As many here know at this point, I've been working on solving issues related
to DMA-capable memory allocation for various devices for some time now.
I'd
like to take this opportunity to apologize for the way I handled the
EGL
Post by Jason Ekstrand
Post by Rob Clark
stream proposals. I understand now that the development process
followed
Post by Jason Ekstrand
Post by Rob Clark
there was unacceptable to the community and likely offended many great
engineers.
Moving forward, I attempted to reboot talks in a more constructive
manner
Post by Jason Ekstrand
Post by Rob Clark
with the generic allocator library proposals & discussion forum at XDC 2016.
Some great design ideas came out of that, and I've since been
prototyping
Post by Jason Ekstrand
Post by Rob Clark
some code to prove them out before bringing them back as official proposals.
Again, I understand some people are growing concerned that I've been doing
this off on the side in a github project that has primarily NVIDIA
contributors. My goal was only to avoid wasting everyone's time with
unproven ideas. The intent was never to dump the prototype code as-is
on
Post by Jason Ekstrand
Post by Rob Clark
the community and presume acceptance. It's just a public research project.
Now the prototyping is nearing completion, and I'd like to renew discussion
on whether and how the new mechanisms can be integrated with the Linux
graphics stack.
I'd be interested to know if more work is needed to demonstrate the
usefulness of the new mechanisms, or whether people think they have
value
Post by Jason Ekstrand
Post by Rob Clark
at
this point.
After talking with people on the hallway track at XDC this year, I've heard
-Include ideas from the generic allocator design into GBM. This could take
the form of designing a "GBM 2.0" API, or incrementally adding to the
existing GBM API.
-Develop a library to replace GBM. The allocator prototype code could
be
Post by Jason Ekstrand
Post by Rob Clark
massaged into something production worthy to jump start this process.
-Develop a library that sits beside or on top of GBM, using GBM for
low-level graphics buffer allocation, while supporting non-graphics kernel
APIs directly. The additional cross-device negotiation and sorting of
capabilities would be handled in this slightly higher-level API before
handing off to GBM and other APIs for actual allocation somehow.
tbh, I kinda see GBM and $new_thing sitting side by side.. GBM is
still the "winsys" for running on "bare metal" (ie. kms). And we
don't want to saddle $new_thing with aspects of that, but rather have
it focus on being the thing that in multiple-"device"[1] scenarious
figures out what sort of buffer can be allocated by who for sharing.
Ie $new_thing should really not care about winsys level things like
cursors or surfaces.. only buffers.
The mesa implementation of $new_thing could sit on top of GBM,
although it could also just sit on top of the same internal APIs that
GBM sits on top of. That is an implementation detail. It could be
that GBM grows an API to return an instance of $new_thing for
use-cases that involve sharing a buffer with the GPU. Or perhaps that
is exposed via some sort of EGL extension. (We probably also need a
way to get an instance from libdrm (?) for display-only KMS drivers,
to cover cases like etnaviv sharing a buffer with a separate display
driver.)
[1] where "devices" could be multiple GPUs or multiple APIs for one or
more GPUs, but also includes non-GPU devices like camera, video
decoder, "image processor" (which may or may not be part of camera),
etc, etc
I'm not quite some sure what I think about this. I think I would like to
see $new_thing at least replace the guts of GBM. Whether GBM becomes a
wrapper around $new_thing or $new_thing implements the GBM API, I'm not
sure. What I don't think I want is to see GBM development continuing on
it's own so we have two competing solutions.
I don't really view them as competing.. there is *some* overlap, ie.
allocating a buffer.. but even if you are using GBM w/out $new_thing
you could allocate a buffer externally and import it. I don't see
$new_thing as that much different from GBM PoV.
But things like surfaces (aka swap chains) seem a bit out of place
when you are thinking about implementing $new_thing for non-gpu
devices. Plus EGL<->GBM tie-ins that seem out of place when talking
about a (for ex.) camera. I kinda don't want to throw out the baby
with the bathwater here.
Agreed. GBM is very EGLish and we don't want the new allocator to be that.
Post by James Jones
*maybe* GBM could be partially implemented on top of $new_thing. I
don't quite see how that would work. Possibly we could deprecate
parts of GBM that are no longer needed? idk.. Either way, I fully
expect that GBM and mesa's implementation of $new_thing could perhaps
sit on to of some of the same set of internal APIs. The public
interface can be decoupled from the internal implementation.
Maybe I should restate things a bit. My real point was that modifiers +
$new_thing + Kernel blob should be a complete and more powerful replacement
for GBM. I don't know that we really can implement GBM on top of it
because GBM has lots of wishy-washy concepts such as "cursor plane" which
may not map well at least not without querying the kernel about specifc
display planes. In particular, I don't want someone to feel like they need
to use $new_thing and GBM at the same time or together. Ideally, I'd like
them to never do that unless we decide gbm_bo is a useful abstraction for
$new_thing.
I'm not really familiar with GBM guts, so I don't know how easy would
it be to make GBM rely on the allocator for the buffer allocations.
Maybe that's something worth exploring. What I wouldn't like is
$new_thing to fall short because we are trying to shove it under GBM's
hood.

It seems to me that $new_thing should grow as a separate thing whether
it ends up replacing GBM or GBM internals are somewhat rewritten on top
of it. If I'm reading you both correctly, you agree with that, so in
order to move forward, should we go ahead and create a project in fd.o?

Before filing the new project request though, we should find an
appropriate name for $new_thing. Creativity isn't one of my strengths,
but I'll go ahead and start the bikeshedding with "Generic Device
Memory Allocator" or "Generic Device Memory Manager".

Once we agree upon something, I can take care of filing the request,
but I'm unclear what the initial list of approvers should be.
Looking at the main contributors of both the initial draft of
$new_thing and git repository, does the following list of people seem
reasonable?

* Rob Clark
* Jason Ekstrand
* James Jones
* Chad Versace
* Miguel A Vico

I never started a project in fd.o, so any useful advice will be
appreciated.

Thanks,
Miguel.
Post by Jason Ekstrand
Post by James Jones
Post by Jason Ekstrand
I *think* I like the idea of having $new_thing implement GBM as a
deprecated
Post by Jason Ekstrand
legacy API. Whether that means we start by pulling GBM out into it's own
project or we start over, I don't know. My feeling is that the current
dri_interface is *not* what we want which is why starting with GBM makes
me
Post by Jason Ekstrand
nervous.
/me expects if we pull GBM out of mesa, the interface between GBM and
mesa (or other GL drivers) is 'struct gbm_device'.. so "GBM the
project" is just a thin shim plus some 'struct gbm_device' versioning.
BR,
-R
Post by Jason Ekstrand
I need to go read through your code before I can provide a stronger or
more
Post by Jason Ekstrand
nuanced opinion. That's not going to happen before the end of the year.
Post by Rob Clark
-I have also heard some general comments that regardless of the relationship
between GBM and the new allocator mechanisms, it might be time to move GBM
out of Mesa so it can be developed as a stand-alone project. I'd be
interested what others think about that, as it would be something worth
coordinating with any other new development based on or inside of GBM.
+1
We already have at least a couple different non-mesa implementations
of GBM (which afaict tend to lag behind mesa's GBM and cause
headaches).
The extracted part probably isn't much more than a header and shim.
But probably does need to grow some versioning for the backend to know
if, for example, gbm->bo_map() is supported.. at least it could
provide stubs that return an error, rather than having link-time fail
if building something w/ $vendor's old gbm implementation.
And of course I'm open to any other ideas for integration. Beyond just
where this code would live, there is much to debate about the
mechanisms
Post by Jason Ekstrand
Post by Rob Clark
themselves and all the implementation details. I was just hoping to
kick
Post by Jason Ekstrand
Post by Rob Clark
things off with something high level to start.
My $0.02, is that the place where devel happens and place to go for
releases could be different. Either way, I would like to see git tree
for tagged release versions live on fd.o and use the common release
process[2] for generating/uploading release tarballs that distros can
use.
Agreed. I think fd.o is the right place for such a project to live. We
can
Post by Jason Ekstrand
have mirrors on GitHub and other places but fd.o is where Linux graphics
stack development currently happens.
Post by Rob Clark
[2] https://cgit.freedesktop.org/xorg/util/modular/tree/release.sh
For reference, the code Miguel and I have been developing for the prototype
https://github.com/cubanismo/allocator
And we've posted a port of kmscube that uses the new interfaces as a
https://github.com/cubanismo/kmscube
There are still some proposed mechanisms (usage transitions mainly)
that
Post by Jason Ekstrand
Post by Rob Clark
aren't prototyped, but I think it makes sense to start discussing
integration while prototyping continues.
btw, I think a nice end goal would be a gralloc implementation using
this new API for sharing buffers in various use-cases. That could
mean converting gbm-gralloc, or perhaps it means something new.
AOSP has support for mesa + upstream kernel for some devices which
also have upstream camera and/or video decoder in addition to just
GPU.. and this is where you start hitting the limits of a GBM based
gralloc. In a lot of way, I view $new_thing as what gralloc *should*
have been, but at least it provides a way to implement a generic
gralloc.
+100
Post by Rob Clark
Maybe that is getting a step ahead, there is a lot we can prototype
with kmscube. But gralloc gets us into interesting real-world
use-cases that involve more than just GPUs. Possibly this would be
something that linaro might be interested in getting involved with?
BR,
-R
In addition, I'd like to note that NVIDIA is committed to providing
open
Post by Jason Ekstrand
Post by Rob Clark
source driver implementations of these mechanisms for our hardware, in
addition to support in our proprietary drivers. In other words,
wherever
Post by Jason Ekstrand
Post by Rob Clark
modifications to the nouveau kernel & userspace drivers are needed to
implement the improved allocator mechanisms, we'll be contributing patches
if no one beats us to it.
Thanks in advance for any feedback!
-James Jones
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
--
Miguel
Rob Clark
2017-11-29 21:28:15 UTC
Reply
Permalink
Raw Message
Post by Miguel Angel Vico
Many of you may already know, but James is going to be out for a few
weeks and I'll be taking over this in the meantime.
See inline for comments.
On Wed, 29 Nov 2017 09:33:29 -0800
Post by Jason Ekstrand
Post by James Jones
Post by Jason Ekstrand
Post by Rob Clark
As many here know at this point, I've been working on solving issues related
to DMA-capable memory allocation for various devices for some time now.
I'd
like to take this opportunity to apologize for the way I handled the
EGL
Post by Jason Ekstrand
Post by Rob Clark
stream proposals. I understand now that the development process
followed
Post by Jason Ekstrand
Post by Rob Clark
there was unacceptable to the community and likely offended many great
engineers.
Moving forward, I attempted to reboot talks in a more constructive
manner
Post by Jason Ekstrand
Post by Rob Clark
with the generic allocator library proposals & discussion forum at XDC 2016.
Some great design ideas came out of that, and I've since been
prototyping
Post by Jason Ekstrand
Post by Rob Clark
some code to prove them out before bringing them back as official proposals.
Again, I understand some people are growing concerned that I've been doing
this off on the side in a github project that has primarily NVIDIA
contributors. My goal was only to avoid wasting everyone's time with
unproven ideas. The intent was never to dump the prototype code as-is
on
Post by Jason Ekstrand
Post by Rob Clark
the community and presume acceptance. It's just a public research project.
Now the prototyping is nearing completion, and I'd like to renew discussion
on whether and how the new mechanisms can be integrated with the Linux
graphics stack.
I'd be interested to know if more work is needed to demonstrate the
usefulness of the new mechanisms, or whether people think they have
value
Post by Jason Ekstrand
Post by Rob Clark
at
this point.
After talking with people on the hallway track at XDC this year, I've heard
-Include ideas from the generic allocator design into GBM. This could take
the form of designing a "GBM 2.0" API, or incrementally adding to the
existing GBM API.
-Develop a library to replace GBM. The allocator prototype code could
be
Post by Jason Ekstrand
Post by Rob Clark
massaged into something production worthy to jump start this process.
-Develop a library that sits beside or on top of GBM, using GBM for
low-level graphics buffer allocation, while supporting non-graphics kernel
APIs directly. The additional cross-device negotiation and sorting of
capabilities would be handled in this slightly higher-level API before
handing off to GBM and other APIs for actual allocation somehow.
tbh, I kinda see GBM and $new_thing sitting side by side.. GBM is
still the "winsys" for running on "bare metal" (ie. kms). And we
don't want to saddle $new_thing with aspects of that, but rather have
it focus on being the thing that in multiple-"device"[1] scenarious
figures out what sort of buffer can be allocated by who for sharing.
Ie $new_thing should really not care about winsys level things like
cursors or surfaces.. only buffers.
The mesa implementation of $new_thing could sit on top of GBM,
although it could also just sit on top of the same internal APIs that
GBM sits on top of. That is an implementation detail. It could be
that GBM grows an API to return an instance of $new_thing for
use-cases that involve sharing a buffer with the GPU. Or perhaps that
is exposed via some sort of EGL extension. (We probably also need a
way to get an instance from libdrm (?) for display-only KMS drivers,
to cover cases like etnaviv sharing a buffer with a separate display
driver.)
[1] where "devices" could be multiple GPUs or multiple APIs for one or
more GPUs, but also includes non-GPU devices like camera, video
decoder, "image processor" (which may or may not be part of camera),
etc, etc
I'm not quite some sure what I think about this. I think I would like to
see $new_thing at least replace the guts of GBM. Whether GBM becomes a
wrapper around $new_thing or $new_thing implements the GBM API, I'm not
sure. What I don't think I want is to see GBM development continuing on
it's own so we have two competing solutions.
I don't really view them as competing.. there is *some* overlap, ie.
allocating a buffer.. but even if you are using GBM w/out $new_thing
you could allocate a buffer externally and import it. I don't see
$new_thing as that much different from GBM PoV.
But things like surfaces (aka swap chains) seem a bit out of place
when you are thinking about implementing $new_thing for non-gpu
devices. Plus EGL<->GBM tie-ins that seem out of place when talking
about a (for ex.) camera. I kinda don't want to throw out the baby
with the bathwater here.
Agreed. GBM is very EGLish and we don't want the new allocator to be that.
Post by James Jones
*maybe* GBM could be partially implemented on top of $new_thing. I
don't quite see how that would work. Possibly we could deprecate
parts of GBM that are no longer needed? idk.. Either way, I fully
expect that GBM and mesa's implementation of $new_thing could perhaps
sit on to of some of the same set of internal APIs. The public
interface can be decoupled from the internal implementation.
Maybe I should restate things a bit. My real point was that modifiers +
$new_thing + Kernel blob should be a complete and more powerful replacement
for GBM. I don't know that we really can implement GBM on top of it
because GBM has lots of wishy-washy concepts such as "cursor plane" which
may not map well at least not without querying the kernel about specifc
display planes. In particular, I don't want someone to feel like they need
to use $new_thing and GBM at the same time or together. Ideally, I'd like
them to never do that unless we decide gbm_bo is a useful abstraction for
$new_thing.
I'm not really familiar with GBM guts, so I don't know how easy would
it be to make GBM rely on the allocator for the buffer allocations.
Maybe that's something worth exploring. What I wouldn't like is
$new_thing to fall short because we are trying to shove it under GBM's
hood.
yeah, I think we should consider functionality of $new_thing
independent of GBM.. how to go from individual buffers allocated via
$new_thing to EGL surface/swapchain is I think out of scope for
$new_thing.
Post by Miguel Angel Vico
It seems to me that $new_thing should grow as a separate thing whether
it ends up replacing GBM or GBM internals are somewhat rewritten on top
of it. If I'm reading you both correctly, you agree with that, so in
order to move forward, should we go ahead and create a project in fd.o?
Before filing the new project request though, we should find an
appropriate name for $new_thing. Creativity isn't one of my strengths,
but I'll go ahead and start the bikeshedding with "Generic Device
Memory Allocator" or "Generic Device Memory Manager".
liballoc - Generic Device Memory Allocator ... seems reasonable to me..

I think it is reasonable to live on github until we figure out how
transitions work.. or in particular are there any thread restrictions
or interactions w/ gl context if transitions are done on the gpu or
anything like that? Or can we just make it more vulkan like w/
explicit ctx ptr, and pass around fence fd's to synchronize everyone??
I haven't thought about the transition part too much but I guess we
should have a reasonable idea for how that should work before we start
getting too many non-toy users, lest we find big API changes are
needed..

Do we need to define both in-place and copy transitions? Ie. what if
GPU is still reading a tiled or compressed texture (ie. sampling from
previous frame for some reason), but we need to untile/uncompress for
display.. of maybe there are some other cases like that we should
think about..

Maybe you already have some thoughts about that?
Post by Miguel Angel Vico
Once we agree upon something, I can take care of filing the request,
but I'm unclear what the initial list of approvers should be.
Looking at the main contributors of both the initial draft of
$new_thing and git repository, does the following list of people seem
reasonable?
* Rob Clark
* Jason Ekstrand
* James Jones
* Chad Versace
* Miguel A Vico
I never started a project in fd.o, so any useful advice will be
appreciated.
fwiw, https://www.freedesktop.org/wiki/NewProject/

BR,
-R
Post by Miguel Angel Vico
Thanks,
Miguel.
Post by Jason Ekstrand
Post by James Jones
Post by Jason Ekstrand
I *think* I like the idea of having $new_thing implement GBM as a
deprecated
Post by Jason Ekstrand
legacy API. Whether that means we start by pulling GBM out into it's own
project or we start over, I don't know. My feeling is that the current
dri_interface is *not* what we want which is why starting with GBM makes
me
Post by Jason Ekstrand
nervous.
/me expects if we pull GBM out of mesa, the interface between GBM and
mesa (or other GL drivers) is 'struct gbm_device'.. so "GBM the
project" is just a thin shim plus some 'struct gbm_device' versioning.
BR,
-R
Post by Jason Ekstrand
I need to go read through your code before I can provide a stronger or
more
Post by Jason Ekstrand
nuanced opinion. That's not going to happen before the end of the year.
Post by Rob Clark
-I have also heard some general comments that regardless of the relationship
between GBM and the new allocator mechanisms, it might be time to move GBM
out of Mesa so it can be developed as a stand-alone project. I'd be
interested what others think about that, as it would be something worth
coordinating with any other new development based on or inside of GBM.
+1
We already have at least a couple different non-mesa implementations
of GBM (which afaict tend to lag behind mesa's GBM and cause
headaches).
The extracted part probably isn't much more than a header and shim.
But probably does need to grow some versioning for the backend to know
if, for example, gbm->bo_map() is supported.. at least it could
provide stubs that return an error, rather than having link-time fail
if building something w/ $vendor's old gbm implementation.
And of course I'm open to any other ideas for integration. Beyond just
where this code would live, there is much to debate about the
mechanisms
Post by Jason Ekstrand
Post by Rob Clark
themselves and all the implementation details. I was just hoping to
kick
Post by Jason Ekstrand
Post by Rob Clark
things off with something high level to start.
My $0.02, is that the place where devel happens and place to go for
releases could be different. Either way, I would like to see git tree
for tagged release versions live on fd.o and use the common release
process[2] for generating/uploading release tarballs that distros can
use.
Agreed. I think fd.o is the right place for such a project to live. We
can
Post by Jason Ekstrand
have mirrors on GitHub and other places but fd.o is where Linux graphics
stack development currently happens.
Post by Rob Clark
[2] https://cgit.freedesktop.org/xorg/util/modular/tree/release.sh
For reference, the code Miguel and I have been developing for the prototype
https://github.com/cubanismo/allocator
And we've posted a port of kmscube that uses the new interfaces as a
https://github.com/cubanismo/kmscube
There are still some proposed mechanisms (usage transitions mainly)
that
Post by Jason Ekstrand
Post by Rob Clark
aren't prototyped, but I think it makes sense to start discussing
integration while prototyping continues.
btw, I think a nice end goal would be a gralloc implementation using
this new API for sharing buffers in various use-cases. That could
mean converting gbm-gralloc, or perhaps it means something new.
AOSP has support for mesa + upstream kernel for some devices which
also have upstream camera and/or video decoder in addition to just
GPU.. and this is where you start hitting the limits of a GBM based
gralloc. In a lot of way, I view $new_thing as what gralloc *should*
have been, but at least it provides a way to implement a generic
gralloc.
+100
Post by Rob Clark
Maybe that is getting a step ahead, there is a lot we can prototype
with kmscube. But gralloc gets us into interesting real-world
use-cases that involve more than just GPUs. Possibly this would be
something that linaro might be interested in getting involved with?
BR,
-R
In addition, I'd like to note that NVIDIA is committed to providing
open
Post by Jason Ekstrand
Post by Rob Clark
source driver implementations of these mechanisms for our hardware, in
addition to support in our proprietary drivers. In other words,
wherever
Post by Jason Ekstrand
Post by Rob Clark
modifications to the nouveau kernel & userspace drivers are needed to
implement the improved allocator mechanisms, we'll be contributing patches
if no one beats us to it.
Thanks in advance for any feedback!
-James Jones
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
--
Miguel
Miguel Angel Vico
2017-11-30 00:09:34 UTC
Reply
Permalink
Raw Message
On Wed, 29 Nov 2017 16:28:15 -0500
Post by Rob Clark
Post by Miguel Angel Vico
Many of you may already know, but James is going to be out for a few
weeks and I'll be taking over this in the meantime.
See inline for comments.
On Wed, 29 Nov 2017 09:33:29 -0800
Post by Jason Ekstrand
Post by James Jones
Post by Jason Ekstrand
Post by Rob Clark
Post by James Jones
As many here know at this point, I've been working on solving issues
related
to DMA-capable memory allocation for various devices for some time now.
I'd
like to take this opportunity to apologize for the way I handled the
EGL
Post by Jason Ekstrand
Post by Rob Clark
Post by James Jones
stream proposals. I understand now that the development process
followed
Post by Jason Ekstrand
Post by Rob Clark
Post by James Jones
there was unacceptable to the community and likely offended many great
engineers.
Moving forward, I attempted to reboot talks in a more constructive
manner
Post by Jason Ekstrand
Post by Rob Clark
Post by James Jones
with the generic allocator library proposals & discussion forum at XDC
2016.
Some great design ideas came out of that, and I've since been
prototyping
Post by Jason Ekstrand
Post by Rob Clark
Post by James Jones
some code to prove them out before bringing them back as official
proposals.
Again, I understand some people are growing concerned that I've been
doing
this off on the side in a github project that has primarily NVIDIA
contributors. My goal was only to avoid wasting everyone's time with
unproven ideas. The intent was never to dump the prototype code as-is
on
Post by Jason Ekstrand
Post by Rob Clark
Post by James Jones
the community and presume acceptance. It's just a public research
project.
Now the prototyping is nearing completion, and I'd like to renew
discussion
on whether and how the new mechanisms can be integrated with the Linux
graphics stack.
I'd be interested to know if more work is needed to demonstrate the
usefulness of the new mechanisms, or whether people think they have
value
Post by Jason Ekstrand
Post by Rob Clark
Post by James Jones
at
this point.
After talking with people on the hallway track at XDC this year, I've
heard
-Include ideas from the generic allocator design into GBM. This could
take
the form of designing a "GBM 2.0" API, or incrementally adding to the
existing GBM API.
-Develop a library to replace GBM. The allocator prototype code could
be
Post by Jason Ekstrand
Post by Rob Clark
Post by James Jones
massaged into something production worthy to jump start this process.
-Develop a library that sits beside or on top of GBM, using GBM for
low-level graphics buffer allocation, while supporting non-graphics
kernel
APIs directly. The additional cross-device negotiation and sorting of
capabilities would be handled in this slightly higher-level API before
handing off to GBM and other APIs for actual allocation somehow.
tbh, I kinda see GBM and $new_thing sitting side by side.. GBM is
still the "winsys" for running on "bare metal" (ie. kms). And we
don't want to saddle $new_thing with aspects of that, but rather have
it focus on being the thing that in multiple-"device"[1] scenarious
figures out what sort of buffer can be allocated by who for sharing.
Ie $new_thing should really not care about winsys level things like
cursors or surfaces.. only buffers.
The mesa implementation of $new_thing could sit on top of GBM,
although it could also just sit on top of the same internal APIs that
GBM sits on top of. That is an implementation detail. It could be
that GBM grows an API to return an instance of $new_thing for
use-cases that involve sharing a buffer with the GPU. Or perhaps that
is exposed via some sort of EGL extension. (We probably also need a
way to get an instance from libdrm (?) for display-only KMS drivers,
to cover cases like etnaviv sharing a buffer with a separate display
driver.)
[1] where "devices" could be multiple GPUs or multiple APIs for one or
more GPUs, but also includes non-GPU devices like camera, video
decoder, "image processor" (which may or may not be part of camera),
etc, etc
I'm not quite some sure what I think about this. I think I would like to
see $new_thing at least replace the guts of GBM. Whether GBM becomes a
wrapper around $new_thing or $new_thing implements the GBM API, I'm not
sure. What I don't think I want is to see GBM development continuing on
it's own so we have two competing solutions.
I don't really view them as competing.. there is *some* overlap, ie.
allocating a buffer.. but even if you are using GBM w/out $new_thing
you could allocate a buffer externally and import it. I don't see
$new_thing as that much different from GBM PoV.
But things like surfaces (aka swap chains) seem a bit out of place
when you are thinking about implementing $new_thing for non-gpu
devices. Plus EGL<->GBM tie-ins that seem out of place when talking
about a (for ex.) camera. I kinda don't want to throw out the baby
with the bathwater here.
Agreed. GBM is very EGLish and we don't want the new allocator to be that.
Post by James Jones
*maybe* GBM could be partially implemented on top of $new_thing. I
don't quite see how that would work. Possibly we could deprecate
parts of GBM that are no longer needed? idk.. Either way, I fully
expect that GBM and mesa's implementation of $new_thing could perhaps
sit on to of some of the same set of internal APIs. The public
interface can be decoupled from the internal implementation.
Maybe I should restate things a bit. My real point was that modifiers +
$new_thing + Kernel blob should be a complete and more powerful replacement
for GBM. I don't know that we really can implement GBM on top of it
because GBM has lots of wishy-washy concepts such as "cursor plane" which
may not map well at least not without querying the kernel about specifc
display planes. In particular, I don't want someone to feel like they need
to use $new_thing and GBM at the same time or together. Ideally, I'd like
them to never do that unless we decide gbm_bo is a useful abstraction for
$new_thing.
I'm not really familiar with GBM guts, so I don't know how easy would
it be to make GBM rely on the allocator for the buffer allocations.
Maybe that's something worth exploring. What I wouldn't like is
$new_thing to fall short because we are trying to shove it under GBM's
hood.
yeah, I think we should consider functionality of $new_thing
independent of GBM.. how to go from individual buffers allocated via
$new_thing to EGL surface/swapchain is I think out of scope for
$new_thing.
Post by Miguel Angel Vico
It seems to me that $new_thing should grow as a separate thing whether
it ends up replacing GBM or GBM internals are somewhat rewritten on top
of it. If I'm reading you both correctly, you agree with that, so in
order to move forward, should we go ahead and create a project in fd.o?
Before filing the new project request though, we should find an
appropriate name for $new_thing. Creativity isn't one of my strengths,
but I'll go ahead and start the bikeshedding with "Generic Device
Memory Allocator" or "Generic Device Memory Manager".
liballoc - Generic Device Memory Allocator ... seems reasonable to me..
Cool. If there aren't better suggestions, we can go with that. We
should also namespace all APIs and structures. Is 'galloc' distinctive
enough to be used as namespace? Being an 'r' away from gralloc maybe
it's a bit confusing?
Post by Rob Clark
I think it is reasonable to live on github until we figure out how
transitions work.. or in particular are there any thread restrictions
or interactions w/ gl context if transitions are done on the gpu or
anything like that? Or can we just make it more vulkan like w/
explicit ctx ptr, and pass around fence fd's to synchronize everyone??
I haven't thought about the transition part too much but I guess we
should have a reasonable idea for how that should work before we start
getting too many non-toy users, lest we find big API changes are
needed..
Seems fine, but I would like to get other people other than NVIDIANs
involved giving feedback on the design as we move forward with the
prototype.

Due to lack of a better list, is it okay to start sending patches to
mesa-dev? If that's a too broad audience, should I just CC specific
individuals that have somewhat contributed to the project?
Post by Rob Clark
Do we need to define both in-place and copy transitions? Ie. what if
GPU is still reading a tiled or compressed texture (ie. sampling from
previous frame for some reason), but we need to untile/uncompress for
display.. of maybe there are some other cases like that we should
think about..
Maybe you already have some thoughts about that?
This is the next thing I'll be working on. I haven't given it much
thought myself so far, but I think James might have had some insights.
I'll read through some of his notes to double-check.

Thanks,
Miguel.
Post by Rob Clark
Post by Miguel Angel Vico
Once we agree upon something, I can take care of filing the request,
but I'm unclear what the initial list of approvers should be.
Looking at the main contributors of both the initial draft of
$new_thing and git repository, does the following list of people seem
reasonable?
* Rob Clark
* Jason Ekstrand
* James Jones
* Chad Versace
* Miguel A Vico
I never started a project in fd.o, so any useful advice will be
appreciated.
fwiw, https://www.freedesktop.org/wiki/NewProject/
BR,
-R
Post by Miguel Angel Vico
Thanks,
Miguel.
Post by Jason Ekstrand
Post by James Jones
Post by Jason Ekstrand
I *think* I like the idea of having $new_thing implement GBM as a
deprecated
Post by Jason Ekstrand
legacy API. Whether that means we start by pulling GBM out into it's own
project or we start over, I don't know. My feeling is that the current
dri_interface is *not* what we want which is why starting with GBM makes
me
Post by Jason Ekstrand
nervous.
/me expects if we pull GBM out of mesa, the interface between GBM and
mesa (or other GL drivers) is 'struct gbm_device'.. so "GBM the
project" is just a thin shim plus some 'struct gbm_device' versioning.
BR,
-R
Post by Jason Ekstrand
I need to go read through your code before I can provide a stronger or
more
Post by Jason Ekstrand
nuanced opinion. That's not going to happen before the end of the year.
Post by Rob Clark
Post by James Jones
-I have also heard some general comments that regardless of the
relationship
between GBM and the new allocator mechanisms, it might be time to move
GBM
out of Mesa so it can be developed as a stand-alone project. I'd be
interested what others think about that, as it would be something worth
coordinating with any other new development based on or inside of GBM.
+1
We already have at least a couple different non-mesa implementations
of GBM (which afaict tend to lag behind mesa's GBM and cause
headaches).
The extracted part probably isn't much more than a header and shim.
But probably does need to grow some versioning for the backend to know
if, for example, gbm->bo_map() is supported.. at least it could
provide stubs that return an error, rather than having link-time fail
if building something w/ $vendor's old gbm implementation.
Post by James Jones
And of course I'm open to any other ideas for integration. Beyond just
where this code would live, there is much to debate about the
mechanisms
Post by Jason Ekstrand
Post by Rob Clark
Post by James Jones
themselves and all the implementation details. I was just hoping to
kick
Post by Jason Ekstrand
Post by Rob Clark
Post by James Jones
things off with something high level to start.
My $0.02, is that the place where devel happens and place to go for
releases could be different. Either way, I would like to see git tree
for tagged release versions live on fd.o and use the common release
process[2] for generating/uploading release tarballs that distros can
use.
Agreed. I think fd.o is the right place for such a project to live. We
can
Post by Jason Ekstrand
have mirrors on GitHub and other places but fd.o is where Linux graphics
stack development currently happens.
Post by Rob Clark
[2] https://cgit.freedesktop.org/xorg/util/modular/tree/release.sh
Post by James Jones
For reference, the code Miguel and I have been developing for the
prototype
https://github.com/cubanismo/allocator
And we've posted a port of kmscube that uses the new interfaces as a
https://github.com/cubanismo/kmscube
There are still some proposed mechanisms (usage transitions mainly)
that
Post by Jason Ekstrand
Post by Rob Clark
Post by James Jones
aren't prototyped, but I think it makes sense to start discussing
integration while prototyping continues.
btw, I think a nice end goal would be a gralloc implementation using
this new API for sharing buffers in various use-cases. That could
mean converting gbm-gralloc, or perhaps it means something new.
AOSP has support for mesa + upstream kernel for some devices which
also have upstream camera and/or video decoder in addition to just
GPU.. and this is where you start hitting the limits of a GBM based
gralloc. In a lot of way, I view $new_thing as what gralloc *should*
have been, but at least it provides a way to implement a generic
gralloc.
+100
Post by Rob Clark
Maybe that is getting a step ahead, there is a lot we can prototype
with kmscube. But gralloc gets us into interesting real-world
use-cases that involve more than just GPUs. Possibly this would be
something that linaro might be interested in getting involved with?
BR,
-R
Post by James Jones
In addition, I'd like to note that NVIDIA is committed to providing
open
Post by Jason Ekstrand
Post by Rob Clark
Post by James Jones
source driver implementations of these mechanisms for our hardware, in
addition to support in our proprietary drivers. In other words,
wherever
Post by Jason Ekstrand
Post by Rob Clark
Post by James Jones
modifications to the nouveau kernel & userspace drivers are needed to
implement the improved allocator mechanisms, we'll be contributing
patches
if no one beats us to it.
Thanks in advance for any feedback!
-James Jones
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
--
Miguel
--
Miguel
James Jones
2017-11-30 05:59:30 UTC
Reply
Permalink
Raw Message
Post by Miguel Angel Vico
On Wed, 29 Nov 2017 16:28:15 -0500
Post by Rob Clark
Post by Miguel Angel Vico
Many of you may already know, but James is going to be out for a few
weeks and I'll be taking over this in the meantime.
Sorry for the unfortunate timing. I am indeed on paternity leave at the
moment. Some quick comments below. I'll be trying to follow the
discussion as time allows while I'm out.
Post by Miguel Angel Vico
Post by Rob Clark
Post by Miguel Angel Vico
See inline for comments.
On Wed, 29 Nov 2017 09:33:29 -0800
Post by Jason Ekstrand
Post by James Jones
Post by Jason Ekstrand
Post by Rob Clark
As many here know at this point, I've been working on solving issues related
to DMA-capable memory allocation for various devices for some time now.
I'd
like to take this opportunity to apologize for the way I handled the
EGL
Post by Jason Ekstrand
Post by Rob Clark
stream proposals. I understand now that the development process
followed
Post by Jason Ekstrand
Post by Rob Clark
there was unacceptable to the community and likely offended many great
engineers.
Moving forward, I attempted to reboot talks in a more constructive
manner
Post by Jason Ekstrand
Post by Rob Clark
with the generic allocator library proposals & discussion forum at XDC 2016.
Some great design ideas came out of that, and I've since been
prototyping
Post by Jason Ekstrand
Post by Rob Clark
some code to prove them out before bringing them back as official proposals.
Again, I understand some people are growing concerned that I've been doing
this off on the side in a github project that has primarily NVIDIA
contributors. My goal was only to avoid wasting everyone's time with
unproven ideas. The intent was never to dump the prototype code as-is
on
Post by Jason Ekstrand
Post by Rob Clark
the community and presume acceptance. It's just a public research project.
Now the prototyping is nearing completion, and I'd like to renew discussion
on whether and how the new mechanisms can be integrated with the Linux
graphics stack.
I'd be interested to know if more work is needed to demonstrate the
usefulness of the new mechanisms, or whether people think they have
value
Post by Jason Ekstrand
Post by Rob Clark
at
this point.
After talking with people on the hallway track at XDC this year, I've heard
-Include ideas from the generic allocator design into GBM. This could take
the form of designing a "GBM 2.0" API, or incrementally adding to the
existing GBM API.
-Develop a library to replace GBM. The allocator prototype code could
be
Post by Jason Ekstrand
Post by Rob Clark
massaged into something production worthy to jump start this process.
-Develop a library that sits beside or on top of GBM, using GBM for
low-level graphics buffer allocation, while supporting non-graphics kernel
APIs directly. The additional cross-device negotiation and sorting of
capabilities would be handled in this slightly higher-level API before
handing off to GBM and other APIs for actual allocation somehow.
tbh, I kinda see GBM and $new_thing sitting side by side.. GBM is
still the "winsys" for running on "bare metal" (ie. kms). And we
don't want to saddle $new_thing with aspects of that, but rather have
it focus on being the thing that in multiple-"device"[1] scenarious
figures out what sort of buffer can be allocated by who for sharing.
Ie $new_thing should really not care about winsys level things like
cursors or surfaces.. only buffers.
The mesa implementation of $new_thing could sit on top of GBM,
although it could also just sit on top of the same internal APIs that
GBM sits on top of. That is an implementation detail. It could be
that GBM grows an API to return an instance of $new_thing for
use-cases that involve sharing a buffer with the GPU. Or perhaps that
is exposed via some sort of EGL extension. (We probably also need a
way to get an instance from libdrm (?) for display-only KMS drivers,
to cover cases like etnaviv sharing a buffer with a separate display
driver.)
[1] where "devices" could be multiple GPUs or multiple APIs for one or
more GPUs, but also includes non-GPU devices like camera, video
decoder, "image processor" (which may or may not be part of camera),
etc, etc
I'm not quite some sure what I think about this. I think I would like to
see $new_thing at least replace the guts of GBM. Whether GBM becomes a
wrapper around $new_thing or $new_thing implements the GBM API, I'm not
sure. What I don't think I want is to see GBM development continuing on
it's own so we have two competing solutions.
I don't really view them as competing.. there is *some* overlap, ie.
allocating a buffer.. but even if you are using GBM w/out $new_thing
you could allocate a buffer externally and import it. I don't see
$new_thing as that much different from GBM PoV.
But things like surfaces (aka swap chains) seem a bit out of place
when you are thinking about implementing $new_thing for non-gpu
devices. Plus EGL<->GBM tie-ins that seem out of place when talking
about a (for ex.) camera. I kinda don't want to throw out the baby
with the bathwater here.
Agreed. GBM is very EGLish and we don't want the new allocator to be that.
Post by James Jones
*maybe* GBM could be partially implemented on top of $new_thing. I
don't quite see how that would work. Possibly we could deprecate
parts of GBM that are no longer needed? idk.. Either way, I fully
expect that GBM and mesa's implementation of $new_thing could perhaps
sit on to of some of the same set of internal APIs. The public
interface can be decoupled from the internal implementation.
Maybe I should restate things a bit. My real point was that modifiers +
$new_thing + Kernel blob should be a complete and more powerful replacement
for GBM. I don't know that we really can implement GBM on top of it
because GBM has lots of wishy-washy concepts such as "cursor plane" which
may not map well at least not without querying the kernel about specifc
display planes. In particular, I don't want someone to feel like they need
to use $new_thing and GBM at the same time or together. Ideally, I'd like
them to never do that unless we decide gbm_bo is a useful abstraction for
$new_thing.
I'm not really familiar with GBM guts, so I don't know how easy would
it be to make GBM rely on the allocator for the buffer allocations.
Maybe that's something worth exploring. What I wouldn't like is
$new_thing to fall short because we are trying to shove it under GBM's
hood.
yeah, I think we should consider functionality of $new_thing
independent of GBM.. how to go from individual buffers allocated via
$new_thing to EGL surface/swapchain is I think out of scope for
$new_thing.
Post by Miguel Angel Vico
It seems to me that $new_thing should grow as a separate thing whether
it ends up replacing GBM or GBM internals are somewhat rewritten on top
of it. If I'm reading you both correctly, you agree with that, so in
order to move forward, should we go ahead and create a project in fd.o?
Before filing the new project request though, we should find an
appropriate name for $new_thing. Creativity isn't one of my strengths,
but I'll go ahead and start the bikeshedding with "Generic Device
Memory Allocator" or "Generic Device Memory Manager".
liballoc - Generic Device Memory Allocator ... seems reasonable to me..
Cool. If there aren't better suggestions, we can go with that. We
should also namespace all APIs and structures. Is 'galloc' distinctive
enough to be used as namespace? Being an 'r' away from gralloc maybe
it's a bit confusing?
Post by Rob Clark
I think it is reasonable to live on github until we figure out how
transitions work.. or in particular are there any thread restrictions
or interactions w/ gl context if transitions are done on the gpu or
anything like that? Or can we just make it more vulkan like w/
explicit ctx ptr, and pass around fence fd's to synchronize everyone??
I haven't thought about the transition part too much but I guess we
should have a reasonable idea for how that should work before we start
getting too many non-toy users, lest we find big API changes are
needed..
Seems fine, but I would like to get other people other than NVIDIANs
involved giving feedback on the design as we move forward with the
prototype.
Due to lack of a better list, is it okay to start sending patches to
mesa-dev? If that's a too broad audience, should I just CC specific
individuals that have somewhat contributed to the project?
Post by Rob Clark
Do we need to define both in-place and copy transitions? Ie. what if
GPU is still reading a tiled or compressed texture (ie. sampling from
previous frame for some reason), but we need to untile/uncompress for
display.. of maybe there are some other cases like that we should
think about..
Maybe you already have some thoughts about that?
This is the next thing I'll be working on. I haven't given it much
thought myself so far, but I think James might have had some insights.
I'll read through some of his notes to double-check.
A couple of notes on usage transitions:

While chatting about transitions, a few assertions were made by others
that I've come to accept, despite the fact that they reduce the
generality of the allocator mechanisms:

-GPUs are the only things that actually need usage transitions as far as
I know thus far. Other engines either share the GPU representations of
data, or use more limited representations; the latter being the reason
non-GPU usage transitions are a useful thing.

-It's reasonable to assume that a GPU is required to perform a usage
transition. This follows from the above postulate. If only GPUs are
using more advanced representations, you don't need any transitions
unless you have a GPU available.

From that, I derived the rough API proposal for transitions presented
on my XDC 2017 slides. Transition "metadata" is queried from the
allocator given a pair of usages (which may refer to more than one
device), but the realization of the transition is left to existing GPU
APIs. I think I put Vulkan-like pseudo-code in the slides, but the GL
external objects extensions (GL_EXT_memory_object and GL_EXT_semaphore)
would work as well.

Regarding in-place Vs. copy: To me a transition is something that
happens in-place, at least semantically. If you need to make copies,
that's a format conversion blit not a transition, and graphics APIs are
already capable of expressing that without any special transitions or
help from the allocator. However, I understand some chipsets perform
transitions using something that looks kind of like a blit using on-chip
caches and constrained usage semantics. There's probably some work to
do to see whether those need to be accommodated as conversion blits or
usgae transitions.

For our hardware's purposes, transitions are just various levels of
decompression or compression reconfiguration and potentially cache
flushing/invalidation, so our transition metadata will just be some bits
signaling which compression operation is needed, if any. That's the
sort of operation I modeled the API around, so if things are much more
exotic than that for others, it will probably require some adjustments.
Post by Miguel Angel Vico
Thanks,
Miguel.
Post by Rob Clark
Post by Miguel Angel Vico
Once we agree upon something, I can take care of filing the request,
but I'm unclear what the initial list of approvers should be.
Looking at the main contributors of both the initial draft of
$new_thing and git repository, does the following list of people seem
reasonable?
* Rob Clark
* Jason Ekstrand
* James Jones
* Chad Versace
* Miguel A Vico
I never started a project in fd.o, so any useful advice will be
appreciated.
fwiw, https://www.freedesktop.org/wiki/NewProject/
BR,
-R
Post by Miguel Angel Vico
Thanks,
Miguel.
Post by Jason Ekstrand
Post by James Jones
Post by Jason Ekstrand
I *think* I like the idea of having $new_thing implement GBM as a
deprecated
Post by Jason Ekstrand
legacy API. Whether that means we start by pulling GBM out into it's own
project or we start over, I don't know. My feeling is that the current
dri_interface is *not* what we want which is why starting with GBM makes
me
Post by Jason Ekstrand
nervous.
/me expects if we pull GBM out of mesa, the interface between GBM and
mesa (or other GL drivers) is 'struct gbm_device'.. so "GBM the
project" is just a thin shim plus some 'struct gbm_device' versioning.
BR,
-R
Post by Jason Ekstrand
I need to go read through your code before I can provide a stronger or
more
Post by Jason Ekstrand
nuanced opinion. That's not going to happen before the end of the year.
I hope you and others, especially those of you who seem to already have
some well-formed ideas about end-goals for this project, do get a chance
to go through the prototype code and simple kmscube example at some
point. A code review is worth a thousand high-level design discussions
IMHO, and it really isn't that much code at this point. Of course, I
understand everyone's busy this time of year.
Post by Miguel Angel Vico
Post by Rob Clark
Post by Miguel Angel Vico
Post by Jason Ekstrand
Post by James Jones
Post by Jason Ekstrand
Post by Rob Clark
-I have also heard some general comments that regardless of the relationship
between GBM and the new allocator mechanisms, it might be time to move GBM
out of Mesa so it can be developed as a stand-alone project. I'd be
interested what others think about that, as it would be something worth
coordinating with any other new development based on or inside of GBM.
+1
We already have at least a couple different non-mesa implementations
of GBM (which afaict tend to lag behind mesa's GBM and cause
headaches).
The extracted part probably isn't much more than a header and shim.
But probably does need to grow some versioning for the backend to know
if, for example, gbm->bo_map() is supported.. at least it could
provide stubs that return an error, rather than having link-time fail
if building something w/ $vendor's old gbm implementation.
And of course I'm open to any other ideas for integration. Beyond just
where this code would live, there is much to debate about the
mechanisms
Post by Jason Ekstrand
Post by Rob Clark
themselves and all the implementation details. I was just hoping to
kick
Post by Jason Ekstrand
Post by Rob Clark
things off with something high level to start.
My $0.02, is that the place where devel happens and place to go for
releases could be different. Either way, I would like to see git tree
for tagged release versions live on fd.o and use the common release
process[2] for generating/uploading release tarballs that distros can
use.
Agreed. I think fd.o is the right place for such a project to live. We
can
Post by Jason Ekstrand
have mirrors on GitHub and other places but fd.o is where Linux graphics
stack development currently happens.
Post by Rob Clark
[2] https://cgit.freedesktop.org/xorg/util/modular/tree/release.sh
For reference, the code Miguel and I have been developing for the prototype
https://github.com/cubanismo/allocator
And we've posted a port of kmscube that uses the new interfaces as a
https://github.com/cubanismo/kmscube
There are still some proposed mechanisms (usage transitions mainly)
that
Post by Jason Ekstrand
Post by Rob Clark
aren't prototyped, but I think it makes sense to start discussing
integration while prototyping continues.
btw, I think a nice end goal would be a gralloc implementation using
this new API for sharing buffers in various use-cases. That could
mean converting gbm-gralloc, or perhaps it means something new.
AOSP has support for mesa + upstream kernel for some devices which
also have upstream camera and/or video decoder in addition to just
GPU.. and this is where you start hitting the limits of a GBM based
gralloc. In a lot of way, I view $new_thing as what gralloc *should*
have been, but at least it provides a way to implement a generic
gralloc.
+100
Post by Rob Clark
Maybe that is getting a step ahead, there is a lot we can prototype
with kmscube. But gralloc gets us into interesting real-world
use-cases that involve more than just GPUs. Possibly this would be
something that linaro might be interested in getting involved with?
Gralloc-on-$new_thing, as well as hwcomposer-on-$new_thing is one of my
primary goals. However, it's a pretty heavy thing to prototype. If
someone has the time though, I think it would be a great experiment. It
would help flesh out the paltry list of usages, constraints, and
capabilities in the existing prototype codebase. The kmscube example
really should have added at least a "render" usage, but I got lazy and
just re-used texture for now. That won't actually work on our HW in all
cases, but it's good enough for kmscube.

Thanks,
-James
Post by Miguel Angel Vico
Post by Rob Clark
Post by Miguel Angel Vico
Post by Jason Ekstrand
Post by James Jones
Post by Jason Ekstrand
Post by Rob Clark
BR,
-R
In addition, I'd like to note that NVIDIA is committed to providing
open
Post by Jason Ekstrand
Post by Rob Clark
source driver implementations of these mechanisms for our hardware, in
addition to support in our proprietary drivers. In other words,
wherever
Post by Jason Ekstrand
Post by Rob Clark
modifications to the nouveau kernel & userspace drivers are needed to
implement the improved allocator mechanisms, we'll be contributing patches
if no one beats us to it.
Thanks in advance for any feedback!
-James Jones
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
--
Miguel
Rob Clark
2017-11-30 18:20:16 UTC
Reply
Permalink
Raw Message
Post by Miguel Angel Vico
On Wed, 29 Nov 2017 16:28:15 -0500
Post by Rob Clark
Do we need to define both in-place and copy transitions? Ie. what if
GPU is still reading a tiled or compressed texture (ie. sampling from
previous frame for some reason), but we need to untile/uncompress for
display.. of maybe there are some other cases like that we should
think about..
Maybe you already have some thoughts about that?
This is the next thing I'll be working on. I haven't given it much
thought myself so far, but I think James might have had some insights.
I'll read through some of his notes to double-check.
While chatting about transitions, a few assertions were made by others that
I've come to accept, despite the fact that they reduce the generality of the
-GPUs are the only things that actually need usage transitions as far as I
know thus far. Other engines either share the GPU representations of data,
or use more limited representations; the latter being the reason non-GPU
usage transitions are a useful thing.
-It's reasonable to assume that a GPU is required to perform a usage
transition. This follows from the above postulate. If only GPUs are using
more advanced representations, you don't need any transitions unless you
have a GPU available.
This seems reasonable. I can't think of any non-gpu related case
where you would need a transition, other than perhaps cache flush/inv.
From that, I derived the rough API proposal for transitions presented on my
XDC 2017 slides. Transition "metadata" is queried from the allocator given
a pair of usages (which may refer to more than one device), but the
realization of the transition is left to existing GPU APIs. I think I put
Vulkan-like pseudo-code in the slides, but the GL external objects
extensions (GL_EXT_memory_object and GL_EXT_semaphore) would work as well.
I haven't quite wrapped my head around how this would work in the
cross-device case.. I mean from the API standpoint for the user, it
seems straightforward enough. Just not sure how to implement that and
what the driver interface would look like.

I guess we need a capability-conversion (?).. I mean take for example
the the fb compression capability from your slide #12[1]. If we knew
there was an available transition to go from "Dev2 FB compression" to
"normal", then we could have allowed the "Dev2 FB compression" valid
set?

[1] https://www.x.org/wiki/Events/XDC2017/jones_allocator.pdf
Regarding in-place Vs. copy: To me a transition is something that happens
in-place, at least semantically. If you need to make copies, that's a
format conversion blit not a transition, and graphics APIs are already
capable of expressing that without any special transitions or help from the
allocator. However, I understand some chipsets perform transitions using
something that looks kind of like a blit using on-chip caches and
constrained usage semantics. There's probably some work to do to see
whether those need to be accommodated as conversion blits or usgae
transitions.
I guess part of what I was thinking of, is what happens if the
producing device is still reading from the buffer. For example,
viddec -> gpu use case, where the video decoder is also still hanging
on to the frame to use as a reference frame to decode future frames?

I guess if transition from devA -> devB can be done in parallel with
devA still reading the buffer, it isn't a problem. I guess that
limits (non-blit) transitions to decompression and cache op's? Maybe
that is ok..
For our hardware's purposes, transitions are just various levels of
decompression or compression reconfiguration and potentially cache
flushing/invalidation, so our transition metadata will just be some bits
signaling which compression operation is needed, if any. That's the sort of
operation I modeled the API around, so if things are much more exotic than
that for others, it will probably require some adjustments.
[snip]
Gralloc-on-$new_thing, as well as hwcomposer-on-$new_thing is one of my
primary goals. However, it's a pretty heavy thing to prototype. If someone
has the time though, I think it would be a great experiment. It would help
flesh out the paltry list of usages, constraints, and capabilities in the
existing prototype codebase. The kmscube example really should have added
at least a "render" usage, but I got lazy and just re-used texture for now.
That won't actually work on our HW in all cases, but it's good enough for
kmscube.
btw, I did start looking at it.. I guess this gets a bit into the
other side of this thread (ie. where/if GBM fits in). So far I don't
think mesa has EGL_EXT_device_base, but I'm guessing that is part of
what you had in mind as alternative to GBM ;-)


BR,
-R
Lyude Paul
2017-11-30 20:06:02 UTC
Reply
Permalink
Raw Message
Post by Rob Clark
Post by Miguel Angel Vico
On Wed, 29 Nov 2017 16:28:15 -0500
Post by Rob Clark
Do we need to define both in-place and copy transitions? Ie. what if
GPU is still reading a tiled or compressed texture (ie. sampling from
previous frame for some reason), but we need to untile/uncompress for
display.. of maybe there are some other cases like that we should
think about..
Maybe you already have some thoughts about that?
This is the next thing I'll be working on. I haven't given it much
thought myself so far, but I think James might have had some insights.
I'll read through some of his notes to double-check.
While chatting about transitions, a few assertions were made by others that
I've come to accept, despite the fact that they reduce the generality of the
-GPUs are the only things that actually need usage transitions as far as I
know thus far. Other engines either share the GPU representations of data,
or use more limited representations; the latter being the reason non-GPU
usage transitions are a useful thing.
-It's reasonable to assume that a GPU is required to perform a usage
transition. This follows from the above postulate. If only GPUs are using
more advanced representations, you don't need any transitions unless you
have a GPU available.
This seems reasonable. I can't think of any non-gpu related case
where you would need a transition, other than perhaps cache flush/inv.
From that, I derived the rough API proposal for transitions presented on my
XDC 2017 slides. Transition "metadata" is queried from the allocator given
a pair of usages (which may refer to more than one device), but the
realization of the transition is left to existing GPU APIs. I think I put
Vulkan-like pseudo-code in the slides, but the GL external objects
extensions (GL_EXT_memory_object and GL_EXT_semaphore) would work as well.
I haven't quite wrapped my head around how this would work in the
cross-device case.. I mean from the API standpoint for the user, it
seems straightforward enough. Just not sure how to implement that and
what the driver interface would look like.
I guess we need a capability-conversion (?).. I mean take for example
the the fb compression capability from your slide #12[1]. If we knew
there was an available transition to go from "Dev2 FB compression" to
"normal", then we could have allowed the "Dev2 FB compression" valid
set?
[1] https://www.x.org/wiki/Events/XDC2017/jones_allocator.pdf
Regarding in-place Vs. copy: To me a transition is something that happens
in-place, at least semantically. If you need to make copies, that's a
format conversion blit not a transition, and graphics APIs are already
capable of expressing that without any special transitions or help from the
allocator. However, I understand some chipsets perform transitions using
something that looks kind of like a blit using on-chip caches and
constrained usage semantics. There's probably some work to do to see
whether those need to be accommodated as conversion blits or usgae
transitions.
I guess part of what I was thinking of, is what happens if the
producing device is still reading from the buffer. For example,
viddec -> gpu use case, where the video decoder is also still hanging
on to the frame to use as a reference frame to decode future frames?
I guess if transition from devA -> devB can be done in parallel with
devA still reading the buffer, it isn't a problem. I guess that
limits (non-blit) transitions to decompression and cache op's? Maybe
that is ok..
For our hardware's purposes, transitions are just various levels of
decompression or compression reconfiguration and potentially cache
flushing/invalidation, so our transition metadata will just be some bits
signaling which compression operation is needed, if any. That's the sort of
operation I modeled the API around, so if things are much more exotic than
that for others, it will probably require some adjustments.
[snip]
Gralloc-on-$new_thing, as well as hwcomposer-on-$new_thing is one of my
primary goals. However, it's a pretty heavy thing to prototype. If someone
has the time though, I think it would be a great experiment. It would help
flesh out the paltry list of usages, constraints, and capabilities in the
existing prototype codebase. The kmscube example really should have added
at least a "render" usage, but I got lazy and just re-used texture for now.
That won't actually work on our HW in all cases, but it's good enough for
kmscube.
btw, I did start looking at it.. I guess this gets a bit into the
other side of this thread (ie. where/if GBM fits in). So far I don't
think mesa has EGL_EXT_device_base, but I'm guessing that is part of
There is wip from ajax to add support for this actually, although it didn't do
much correctly the last time I played with it:

https://cgit.freedesktop.org/~ajax/mesa/log/?h=egl-ext-device

I was also hoping to write a simple egl device testing extension that lists
devices and that sort of stuff, as well made an entire seperate repo to start
holding glxinfo, eglinfo, and group said tool in with that. Haven't actually
written any code for this yet though
Post by Rob Clark
what you had in mind as alternative to GBM ;-)
BR,
-R
James Jones
2017-12-06 06:27:41 UTC
Reply
Permalink
Raw Message
Post by Lyude Paul
Post by Rob Clark
Post by Miguel Angel Vico
On Wed, 29 Nov 2017 16:28:15 -0500
Post by Rob Clark
Do we need to define both in-place and copy transitions? Ie. what if
GPU is still reading a tiled or compressed texture (ie. sampling from
previous frame for some reason), but we need to untile/uncompress for
display.. of maybe there are some other cases like that we should
think about..
Maybe you already have some thoughts about that?
This is the next thing I'll be working on. I haven't given it much
thought myself so far, but I think James might have had some insights.
I'll read through some of his notes to double-check.
While chatting about transitions, a few assertions were made by others that
I've come to accept, despite the fact that they reduce the generality of the
-GPUs are the only things that actually need usage transitions as far as I
know thus far. Other engines either share the GPU representations of data,
or use more limited representations; the latter being the reason non-GPU
usage transitions are a useful thing.
-It's reasonable to assume that a GPU is required to perform a usage
transition. This follows from the above postulate. If only GPUs are using
more advanced representations, you don't need any transitions unless you
have a GPU available.
This seems reasonable. I can't think of any non-gpu related case
where you would need a transition, other than perhaps cache flush/inv.
From that, I derived the rough API proposal for transitions presented on my
XDC 2017 slides. Transition "metadata" is queried from the allocator given
a pair of usages (which may refer to more than one device), but the
realization of the transition is left to existing GPU APIs. I think I put
Vulkan-like pseudo-code in the slides, but the GL external objects
extensions (GL_EXT_memory_object and GL_EXT_semaphore) would work as well.
I haven't quite wrapped my head around how this would work in the
cross-device case.. I mean from the API standpoint for the user, it
seems straightforward enough. Just not sure how to implement that and
what the driver interface would look like.
I guess we need a capability-conversion (?).. I mean take for example
the the fb compression capability from your slide #12[1]. If we knew
there was an available transition to go from "Dev2 FB compression" to
"normal", then we could have allowed the "Dev2 FB compression" valid
set?
[1] https://www.x.org/wiki/Events/XDC2017/jones_allocator.pdf
Regarding in-place Vs. copy: To me a transition is something that happens
in-place, at least semantically. If you need to make copies, that's a
format conversion blit not a transition, and graphics APIs are already
capable of expressing that without any special transitions or help from the
allocator. However, I understand some chipsets perform transitions using
something that looks kind of like a blit using on-chip caches and
constrained usage semantics. There's probably some work to do to see
whether those need to be accommodated as conversion blits or usgae
transitions.
I guess part of what I was thinking of, is what happens if the
producing device is still reading from the buffer. For example,
viddec -> gpu use case, where the video decoder is also still hanging
on to the frame to use as a reference frame to decode future frames?
I guess if transition from devA -> devB can be done in parallel with
devA still reading the buffer, it isn't a problem. I guess that
limits (non-blit) transitions to decompression and cache op's? Maybe
that is ok..
I don't know of a real case it would be a problem. Note you can
transition to multiple usages in the proposed API, so for the video
decoder example, you would transition from [video decode target] to
[video decode target, GPU sampler source] for simultaneous texturing and
reference frame usage.
Post by Lyude Paul
Post by Rob Clark
For our hardware's purposes, transitions are just various levels of
decompression or compression reconfiguration and potentially cache
flushing/invalidation, so our transition metadata will just be some bits
signaling which compression operation is needed, if any. That's the sort of
operation I modeled the API around, so if things are much more exotic than
that for others, it will probably require some adjustments.
[snip]
Gralloc-on-$new_thing, as well as hwcomposer-on-$new_thing is one of my
primary goals. However, it's a pretty heavy thing to prototype. If someone
has the time though, I think it would be a great experiment. It would help
flesh out the paltry list of usages, constraints, and capabilities in the
existing prototype codebase. The kmscube example really should have added
at least a "render" usage, but I got lazy and just re-used texture for now.
That won't actually work on our HW in all cases, but it's good enough for
kmscube.
btw, I did start looking at it.. I guess this gets a bit into the
other side of this thread (ie. where/if GBM fits in). So far I don't
think mesa has EGL_EXT_device_base, but I'm guessing that is part of
There is wip from ajax to add support for this actually, although it didn't do
https://cgit.freedesktop.org/~ajax/mesa/log/?h=egl-ext-device
I was also hoping to write a simple egl device testing extension that lists
devices and that sort of stuff, as well made an entire seperate repo to start
holding glxinfo, eglinfo, and group said tool in with that. Haven't actually
written any code for this yet though
Yes, or there's also this:

https://github.com/KhronosGroup/EGL-Registry/pull/23

Which combined with:

https://www.khronos.org/registry/EGL/extensions/MESA/EGL_MESA_platform_surfaceless.txt

provides a semantic alternative method to instantiate a platform-less
EGLDisplay on an EGLDevice. It's functionally equivalent to
EGL_EXT_platform_device, but some people find it more palatable. I'm
roughly indifferent between the two at this point, but I slightly prefer
EGL_EXT_platform_device just because we already have it implemented and
have a bunch of internal code and external customers using it, so we
have to maintain it anyway.

And just a reminder to avoid opening old wounds: EGLDevice is in no way
tied to EGLStreams. EGLDevice is basically just a slightly more
detailed version of EGL_MESA_platform_surfaceless.

Thanks,
-James
Post by Lyude Paul
Post by Rob Clark
what you had in mind as alternative to GBM ;-)
BR,
-R
Nicolai Hähnle
2017-11-30 09:21:15 UTC
Reply
Permalink
Raw Message
Post by Miguel Angel Vico
Post by Rob Clark
Post by Miguel Angel Vico
It seems to me that $new_thing should grow as a separate thing whether
it ends up replacing GBM or GBM internals are somewhat rewritten on top
of it. If I'm reading you both correctly, you agree with that, so in
order to move forward, should we go ahead and create a project in fd.o?
Before filing the new project request though, we should find an
appropriate name for $new_thing. Creativity isn't one of my strengths,
but I'll go ahead and start the bikeshedding with "Generic Device
Memory Allocator" or "Generic Device Memory Manager".
liballoc - Generic Device Memory Allocator ... seems reasonable to me..
Cool. If there aren't better suggestions, we can go with that. We
should also namespace all APIs and structures. Is 'galloc' distinctive
enough to be used as namespace? Being an 'r' away from gralloc maybe
it's a bit confusing?
libgalloc with a galloc prefix seems fine.
Post by Miguel Angel Vico
Post by Rob Clark
I think it is reasonable to live on github until we figure out how
transitions work.. or in particular are there any thread restrictions
or interactions w/ gl context if transitions are done on the gpu or
anything like that? Or can we just make it more vulkan like w/
explicit ctx ptr, and pass around fence fd's to synchronize everyone??
I haven't thought about the transition part too much but I guess we
should have a reasonable idea for how that should work before we start
getting too many non-toy users, lest we find big API changes are
needed..
Seems fine, but I would like to get other people other than NVIDIANs
involved giving feedback on the design as we move forward with the
prototype.
Due to lack of a better list, is it okay to start sending patches to
mesa-dev? If that's a too broad audience, should I just CC specific
individuals that have somewhat contributed to the project?
Keeping it on mesa-dev seems like the best way to ensure the relevant
people actually see it.

Cheers,
Nicolai
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
Rob Clark
2017-11-30 18:52:55 UTC
Reply
Permalink
Raw Message
Post by Nicolai Hähnle
Post by Miguel Angel Vico
Post by Rob Clark
Post by Miguel Angel Vico
It seems to me that $new_thing should grow as a separate thing whether
it ends up replacing GBM or GBM internals are somewhat rewritten on top
of it. If I'm reading you both correctly, you agree with that, so in
order to move forward, should we go ahead and create a project in fd.o?
Before filing the new project request though, we should find an
appropriate name for $new_thing. Creativity isn't one of my strengths,
but I'll go ahead and start the bikeshedding with "Generic Device
Memory Allocator" or "Generic Device Memory Manager".
liballoc - Generic Device Memory Allocator ... seems reasonable to me..
Cool. If there aren't better suggestions, we can go with that. We
should also namespace all APIs and structures. Is 'galloc' distinctive
enough to be used as namespace? Being an 'r' away from gralloc maybe
it's a bit confusing?
libgalloc with a galloc prefix seems fine.
I keep reading "galloc" as "gralloc".. I suspect that will be
confusing. Maybe libgal/gal_.. or just liballoc/al_?

BR,
-R
Nicolai Hähnle
2017-11-30 19:10:15 UTC
Reply
Permalink
Raw Message
Post by Rob Clark
Post by Nicolai Hähnle
Post by Miguel Angel Vico
Post by Rob Clark
Post by Miguel Angel Vico
It seems to me that $new_thing should grow as a separate thing whether
it ends up replacing GBM or GBM internals are somewhat rewritten on top
of it. If I'm reading you both correctly, you agree with that, so in
order to move forward, should we go ahead and create a project in fd.o?
Before filing the new project request though, we should find an
appropriate name for $new_thing. Creativity isn't one of my strengths,
but I'll go ahead and start the bikeshedding with "Generic Device
Memory Allocator" or "Generic Device Memory Manager".
liballoc - Generic Device Memory Allocator ... seems reasonable to me..
Cool. If there aren't better suggestions, we can go with that. We
should also namespace all APIs and structures. Is 'galloc' distinctive
enough to be used as namespace? Being an 'r' away from gralloc maybe
it's a bit confusing?
libgalloc with a galloc prefix seems fine.
I keep reading "galloc" as "gralloc".. I suspect that will be
confusing. Maybe libgal/gal_.. or just liballoc/al_?
True, but liballoc is *very* generic.

libimagealloc?
libsurfacealloc?
contractions thereof?

Cheers,
Nicolai
Post by Rob Clark
BR,
-R
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
Alex Deucher
2017-11-30 19:20:24 UTC
Reply
Permalink
Raw Message
Post by Nicolai Hähnle
Post by Rob Clark
Post by Nicolai Hähnle
Post by Miguel Angel Vico
Post by Rob Clark
Post by Miguel Angel Vico
It seems to me that $new_thing should grow as a separate thing whether
it ends up replacing GBM or GBM internals are somewhat rewritten on top
of it. If I'm reading you both correctly, you agree with that, so in
order to move forward, should we go ahead and create a project in fd.o?
Before filing the new project request though, we should find an
appropriate name for $new_thing. Creativity isn't one of my strengths,
but I'll go ahead and start the bikeshedding with "Generic Device
Memory Allocator" or "Generic Device Memory Manager".
liballoc - Generic Device Memory Allocator ... seems reasonable to me..
Cool. If there aren't better suggestions, we can go with that. We
should also namespace all APIs and structures. Is 'galloc' distinctive
enough to be used as namespace? Being an 'r' away from gralloc maybe
it's a bit confusing?
libgalloc with a galloc prefix seems fine.
I keep reading "galloc" as "gralloc".. I suspect that will be
confusing. Maybe libgal/gal_.. or just liballoc/al_?
True, but liballoc is *very* generic.
libimagealloc?
libsurfacealloc?
contractions thereof?
libdevicealloc?

Alex
Lyude Paul
2017-11-30 19:51:44 UTC
Reply
Permalink
Raw Message
Post by Alex Deucher
Post by Nicolai Hähnle
Post by Rob Clark
Post by Nicolai Hähnle
Post by Miguel Angel Vico
Post by Rob Clark
Post by Miguel Angel Vico
It seems to me that $new_thing should grow as a separate thing
whether
it ends up replacing GBM or GBM internals are somewhat rewritten
on
top
of it. If I'm reading you both correctly, you agree with that,
so in
order to move forward, should we go ahead and create a project
in
fd.o?
Before filing the new project request though, we should find an
appropriate name for $new_thing. Creativity isn't one of my
strengths,
but I'll go ahead and start the bikeshedding with "Generic Device
Memory Allocator" or "Generic Device Memory Manager".
liballoc - Generic Device Memory Allocator ... seems reasonable to me..
Cool. If there aren't better suggestions, we can go with that. We
should also namespace all APIs and structures. Is 'galloc' distinctive
enough to be used as namespace? Being an 'r' away from gralloc maybe
it's a bit confusing?
libgalloc with a galloc prefix seems fine.
I keep reading "galloc" as "gralloc".. I suspect that will be
confusing. Maybe libgal/gal_.. or just liballoc/al_?
True, but liballoc is *very* generic.
libimagealloc?
libsurfacealloc?
contractions thereof?
libdevicealloc?
libhwalloc
Post by Alex Deucher
Alex
Rob Clark
2017-11-29 21:10:17 UTC
Reply
Permalink
Raw Message
Post by Jason Ekstrand
Post by Rob Clark
Post by Jason Ekstrand
Post by Rob Clark
As many here know at this point, I've been working on solving issues related
to DMA-capable memory allocation for various devices for some time now.
I'd
like to take this opportunity to apologize for the way I handled the EGL
stream proposals. I understand now that the development process followed
there was unacceptable to the community and likely offended many great
engineers.
Moving forward, I attempted to reboot talks in a more constructive manner
with the generic allocator library proposals & discussion forum at XDC 2016.
Some great design ideas came out of that, and I've since been prototyping
some code to prove them out before bringing them back as official proposals.
Again, I understand some people are growing concerned that I've been doing
this off on the side in a github project that has primarily NVIDIA
contributors. My goal was only to avoid wasting everyone's time with
unproven ideas. The intent was never to dump the prototype code as-is on
the community and presume acceptance. It's just a public research project.
Now the prototyping is nearing completion, and I'd like to renew discussion
on whether and how the new mechanisms can be integrated with the Linux
graphics stack.
I'd be interested to know if more work is needed to demonstrate the
usefulness of the new mechanisms, or whether people think they have
value
at
this point.
After talking with people on the hallway track at XDC this year, I've heard
-Include ideas from the generic allocator design into GBM. This could take
the form of designing a "GBM 2.0" API, or incrementally adding to the
existing GBM API.
-Develop a library to replace GBM. The allocator prototype code could be
massaged into something production worthy to jump start this process.
-Develop a library that sits beside or on top of GBM, using GBM for
low-level graphics buffer allocation, while supporting non-graphics kernel
APIs directly. The additional cross-device negotiation and sorting of
capabilities would be handled in this slightly higher-level API before
handing off to GBM and other APIs for actual allocation somehow.
tbh, I kinda see GBM and $new_thing sitting side by side.. GBM is
still the "winsys" for running on "bare metal" (ie. kms). And we
don't want to saddle $new_thing with aspects of that, but rather have
it focus on being the thing that in multiple-"device"[1] scenarious
figures out what sort of buffer can be allocated by who for sharing.
Ie $new_thing should really not care about winsys level things like
cursors or surfaces.. only buffers.
The mesa implementation of $new_thing could sit on top of GBM,
although it could also just sit on top of the same internal APIs that
GBM sits on top of. That is an implementation detail. It could be
that GBM grows an API to return an instance of $new_thing for
use-cases that involve sharing a buffer with the GPU. Or perhaps that
is exposed via some sort of EGL extension. (We probably also need a
way to get an instance from libdrm (?) for display-only KMS drivers,
to cover cases like etnaviv sharing a buffer with a separate display
driver.)
[1] where "devices" could be multiple GPUs or multiple APIs for one or
more GPUs, but also includes non-GPU devices like camera, video
decoder, "image processor" (which may or may not be part of camera),
etc, etc
I'm not quite some sure what I think about this. I think I would like to
see $new_thing at least replace the guts of GBM. Whether GBM becomes a
wrapper around $new_thing or $new_thing implements the GBM API, I'm not
sure. What I don't think I want is to see GBM development continuing on
it's own so we have two competing solutions.
I don't really view them as competing.. there is *some* overlap, ie.
allocating a buffer.. but even if you are using GBM w/out $new_thing
you could allocate a buffer externally and import it. I don't see
$new_thing as that much different from GBM PoV.
But things like surfaces (aka swap chains) seem a bit out of place
when you are thinking about implementing $new_thing for non-gpu
devices. Plus EGL<->GBM tie-ins that seem out of place when talking
about a (for ex.) camera. I kinda don't want to throw out the baby
with the bathwater here.
Agreed. GBM is very EGLish and we don't want the new allocator to be that.
Post by Rob Clark
*maybe* GBM could be partially implemented on top of $new_thing. I
don't quite see how that would work. Possibly we could deprecate
parts of GBM that are no longer needed? idk.. Either way, I fully
expect that GBM and mesa's implementation of $new_thing could perhaps
sit on to of some of the same set of internal APIs. The public
interface can be decoupled from the internal implementation.
Maybe I should restate things a bit. My real point was that modifiers +
$new_thing + Kernel blob should be a complete and more powerful replacement
for GBM. I don't know that we really can implement GBM on top of it because
GBM has lots of wishy-washy concepts such as "cursor plane" which may not
map well at least not without querying the kernel about specifc display
planes. In particular, I don't want someone to feel like they need to use
$new_thing and GBM at the same time or together. Ideally, I'd like them to
never do that unless we decide gbm_bo is a useful abstraction for
$new_thing.
(just to repeat what I mentioned on irc)

I think main thing is how do you create a swapchain/surface and know
which is current front buffer after SwapBuffers().. that is the only
bits of GBM that seem like there would still be useful. idk, maybe
there is some other idea.

BR,
-R
Post by Jason Ekstrand
Post by Rob Clark
Post by Jason Ekstrand
I *think* I like the idea of having $new_thing implement GBM as a deprecated
legacy API. Whether that means we start by pulling GBM out into it's own
project or we start over, I don't know. My feeling is that the current
dri_interface is *not* what we want which is why starting with GBM makes me
nervous.
/me expects if we pull GBM out of mesa, the interface between GBM and
mesa (or other GL drivers) is 'struct gbm_device'.. so "GBM the
project" is just a thin shim plus some 'struct gbm_device' versioning.
BR,
-R
Post by Jason Ekstrand
I need to go read through your code before I can provide a stronger or more
nuanced opinion. That's not going to happen before the end of the year.
Post by Rob Clark
-I have also heard some general comments that regardless of the relationship
between GBM and the new allocator mechanisms, it might be time to move GBM
out of Mesa so it can be developed as a stand-alone project. I'd be
interested what others think about that, as it would be something worth
coordinating with any other new development based on or inside of GBM.
+1
We already have at least a couple different non-mesa implementations
of GBM (which afaict tend to lag behind mesa's GBM and cause
headaches).
The extracted part probably isn't much more than a header and shim.
But probably does need to grow some versioning for the backend to know
if, for example, gbm->bo_map() is supported.. at least it could
provide stubs that return an error, rather than having link-time fail
if building something w/ $vendor's old gbm implementation.
And of course I'm open to any other ideas for integration. Beyond just
where this code would live, there is much to debate about the mechanisms
themselves and all the implementation details. I was just hoping to kick
things off with something high level to start.
My $0.02, is that the place where devel happens and place to go for
releases could be different. Either way, I would like to see git tree
for tagged release versions live on fd.o and use the common release
process[2] for generating/uploading release tarballs that distros can
use.
Agreed. I think fd.o is the right place for such a project to live. We can
have mirrors on GitHub and other places but fd.o is where Linux graphics
stack development currently happens.
Post by Rob Clark
[2] https://cgit.freedesktop.org/xorg/util/modular/tree/release.sh
For reference, the code Miguel and I have been developing for the prototype
https://github.com/cubanismo/allocator
And we've posted a port of kmscube that uses the new interfaces as a
https://github.com/cubanismo/kmscube
There are still some proposed mechanisms (usage transitions mainly) that
aren't prototyped, but I think it makes sense to start discussing
integration while prototyping continues.
btw, I think a nice end goal would be a gralloc implementation using
this new API for sharing buffers in various use-cases. That could
mean converting gbm-gralloc, or perhaps it means something new.
AOSP has support for mesa + upstream kernel for some devices which
also have upstream camera and/or video decoder in addition to just
GPU.. and this is where you start hitting the limits of a GBM based
gralloc. In a lot of way, I view $new_thing as what gralloc *should*
have been, but at least it provides a way to implement a generic
gralloc.
+100
Post by Rob Clark
Maybe that is getting a step ahead, there is a lot we can prototype
with kmscube. But gralloc gets us into interesting real-world
use-cases that involve more than just GPUs. Possibly this would be
something that linaro might be interested in getting involved with?
BR,
-R
In addition, I'd like to note that NVIDIA is committed to providing open
source driver implementations of these mechanisms for our hardware, in
addition to support in our proprietary drivers. In other words, wherever
modifications to the nouveau kernel & userspace drivers are needed to
implement the improved allocator mechanisms, we'll be contributing patches
if no one beats us to it.
Thanks in advance for any feedback!
-James Jones
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
James Jones
2017-11-30 06:28:56 UTC
Reply
Permalink
Raw Message
Post by Rob Clark
Post by Jason Ekstrand
Post by Rob Clark
Post by Jason Ekstrand
Post by Rob Clark
As many here know at this point, I've been working on solving issues related
to DMA-capable memory allocation for various devices for some time now.
I'd
like to take this opportunity to apologize for the way I handled the EGL
stream proposals. I understand now that the development process followed
there was unacceptable to the community and likely offended many great
engineers.
Moving forward, I attempted to reboot talks in a more constructive manner
with the generic allocator library proposals & discussion forum at XDC 2016.
Some great design ideas came out of that, and I've since been prototyping
some code to prove them out before bringing them back as official proposals.
Again, I understand some people are growing concerned that I've been doing
this off on the side in a github project that has primarily NVIDIA
contributors. My goal was only to avoid wasting everyone's time with
unproven ideas. The intent was never to dump the prototype code as-is on
the community and presume acceptance. It's just a public research project.
Now the prototyping is nearing completion, and I'd like to renew discussion
on whether and how the new mechanisms can be integrated with the Linux
graphics stack.
I'd be interested to know if more work is needed to demonstrate the
usefulness of the new mechanisms, or whether people think they have
value
at
this point.
After talking with people on the hallway track at XDC this year, I've heard
-Include ideas from the generic allocator design into GBM. This could take
the form of designing a "GBM 2.0" API, or incrementally adding to the
existing GBM API.
-Develop a library to replace GBM. The allocator prototype code could be
massaged into something production worthy to jump start this process.
-Develop a library that sits beside or on top of GBM, using GBM for
low-level graphics buffer allocation, while supporting non-graphics kernel
APIs directly. The additional cross-device negotiation and sorting of
capabilities would be handled in this slightly higher-level API before
handing off to GBM and other APIs for actual allocation somehow.
tbh, I kinda see GBM and $new_thing sitting side by side.. GBM is
still the "winsys" for running on "bare metal" (ie. kms). And we
don't want to saddle $new_thing with aspects of that, but rather have
it focus on being the thing that in multiple-"device"[1] scenarious
figures out what sort of buffer can be allocated by who for sharing.
Ie $new_thing should really not care about winsys level things like
cursors or surfaces.. only buffers.
The mesa implementation of $new_thing could sit on top of GBM,
although it could also just sit on top of the same internal APIs that
GBM sits on top of. That is an implementation detail. It could be
that GBM grows an API to return an instance of $new_thing for
use-cases that involve sharing a buffer with the GPU. Or perhaps that
is exposed via some sort of EGL extension. (We probably also need a
way to get an instance from libdrm (?) for display-only KMS drivers,
to cover cases like etnaviv sharing a buffer with a separate display
driver.)
[1] where "devices" could be multiple GPUs or multiple APIs for one or
more GPUs, but also includes non-GPU devices like camera, video
decoder, "image processor" (which may or may not be part of camera),
etc, etc
I'm not quite some sure what I think about this. I think I would like to
see $new_thing at least replace the guts of GBM. Whether GBM becomes a
wrapper around $new_thing or $new_thing implements the GBM API, I'm not
sure. What I don't think I want is to see GBM development continuing on
it's own so we have two competing solutions.
I don't really view them as competing.. there is *some* overlap, ie.
allocating a buffer.. but even if you are using GBM w/out $new_thing
you could allocate a buffer externally and import it. I don't see
$new_thing as that much different from GBM PoV.
But things like surfaces (aka swap chains) seem a bit out of place
when you are thinking about implementing $new_thing for non-gpu
devices. Plus EGL<->GBM tie-ins that seem out of place when talking
about a (for ex.) camera. I kinda don't want to throw out the baby
with the bathwater here.
Agreed. GBM is very EGLish and we don't want the new allocator to be that.
Post by Rob Clark
*maybe* GBM could be partially implemented on top of $new_thing. I
don't quite see how that would work. Possibly we could deprecate
parts of GBM that are no longer needed? idk.. Either way, I fully
expect that GBM and mesa's implementation of $new_thing could perhaps
sit on to of some of the same set of internal APIs. The public
interface can be decoupled from the internal implementation.
Maybe I should restate things a bit. My real point was that modifiers +
$new_thing + Kernel blob should be a complete and more powerful replacement
for GBM. I don't know that we really can implement GBM on top of it because
GBM has lots of wishy-washy concepts such as "cursor plane" which may not
map well at least not without querying the kernel about specifc display
planes. In particular, I don't want someone to feel like they need to use
$new_thing and GBM at the same time or together. Ideally, I'd like them to
never do that unless we decide gbm_bo is a useful abstraction for
$new_thing.
(just to repeat what I mentioned on irc)
I think main thing is how do you create a swapchain/surface and know
which is current front buffer after SwapBuffers().. that is the only
bits of GBM that seem like there would still be useful. idk, maybe
there is some other idea.
I don't view this as terribly useful except for legacy apps that need an
EGL window surface and can't be updated to use new methods. Wayland
compositors certainly don't fall in that category. I don't know that
any GBM apps do.

Rather, I think the way forward for the classes of apps that need
something like GBM or the generic allocator is more or less the path
ChromeOS took with their graphics architecture: Render to individual
buffers (using FBOs bound to imported buffers in GL) and manage buffer
exchanges/blits manually.

The useful abstraction surfaces provide isn't so much deciding which
buffer is currently "front" and "back", but rather handling the
transition/hand-off to the window system/display device/etc. in
SwapBuffers(), and the whole idea of the allocator proposals is to make
that something the application or at least some non-driver utility
library handles explicitly based on where exactly the buffer is being
handed off to.

The one other useful information provided by EGL surfaces that I suspect
only our hardware cares about is whether the app is potentially going to
bind a depth buffer along with the color buffers from the surface, and
AFAICT, the GBM notion of surfaces doesn't provide enough information
for our driver to determine that at surface creation time, so the GBM
surface mechanism doesn't fit quite right with NVIDIA hardware anyway.

That's all for the compositors, embedded apps, demos, and whatnot that
are using GBM directly though. Every existing GL wayland client needs
to be able to get an EGLSurface and call eglSwapBuffers() on it. As I
mentioned in my XDC 2017 slides, I think that's best handled by a
generic EGL window system implementation that all drivers could share,
and which uses allocator mechanisms behind the scenes to build up an
EGLSurface from individual buffers. It would all have to be transparent
to apps, but we already had that working with our EGLStreams wayland
implementation, and the Mesa Wayland EGL client does roughly the same
thing with DRM or GBM buffers IIRC, but without a driver-external
interface. It should be possible with generic allocator buffers too.
Jason's Vulkan WSI improvements that were sent out recently move Vulkan
in that direction already as well, and that was always one of the goals
of the Vulkan external objects extensions.

This is all a really long-winded way of saying yeah I think it would be
technically feasible to implement GBM on top of the generic allocator
mechanisms, but I don't think that's a very interesting undertaking.
It'd just be an ABI-compatibility thing for a bunch of open-source apps,
which seems unnecessary in the long run since the apps can just be
patched instead. Maybe it's useful as a transition mechanism though.

However, if the generic allocator is going to be something separate from
GBM, I think the idea of modernizing & adapting the existing GBM backend
infrastructure in Mesa to serve as a backend for the allocator is a good
idea. Maybe it's easier to just let GBM sit on that same updated
backend beside the allocator API. For GBM, all the interesting stuff
happens in the backend anyway.

Thanks,
-James
Post by Rob Clark
BR,
-R
Post by Jason Ekstrand
Post by Rob Clark
Post by Jason Ekstrand
I *think* I like the idea of having $new_thing implement GBM as a deprecated
legacy API. Whether that means we start by pulling GBM out into it's own
project or we start over, I don't know. My feeling is that the current
dri_interface is *not* what we want which is why starting with GBM makes me
nervous.
/me expects if we pull GBM out of mesa, the interface between GBM and
mesa (or other GL drivers) is 'struct gbm_device'.. so "GBM the
project" is just a thin shim plus some 'struct gbm_device' versioning.
BR,
-R
Post by Jason Ekstrand
I need to go read through your code before I can provide a stronger or more
nuanced opinion. That's not going to happen before the end of the year.
Post by Rob Clark
-I have also heard some general comments that regardless of the relationship
between GBM and the new allocator mechanisms, it might be time to move GBM
out of Mesa so it can be developed as a stand-alone project. I'd be
interested what others think about that, as it would be something worth
coordinating with any other new development based on or inside of GBM.
+1
We already have at least a couple different non-mesa implementations
of GBM (which afaict tend to lag behind mesa's GBM and cause
headaches).
The extracted part probably isn't much more than a header and shim.
But probably does need to grow some versioning for the backend to know
if, for example, gbm->bo_map() is supported.. at least it could
provide stubs that return an error, rather than having link-time fail
if building something w/ $vendor's old gbm implementation.
And of course I'm open to any other ideas for integration. Beyond just
where this code would live, there is much to debate about the mechanisms
themselves and all the implementation details. I was just hoping to kick
things off with something high level to start.
My $0.02, is that the place where devel happens and place to go for
releases could be different. Either way, I would like to see git tree
for tagged release versions live on fd.o and use the common release
process[2] for generating/uploading release tarballs that distros can
use.
Agreed. I think fd.o is the right place for such a project to live. We can
have mirrors on GitHub and other places but fd.o is where Linux graphics
stack development currently happens.
Post by Rob Clark
[2] https://cgit.freedesktop.org/xorg/util/modular/tree/release.sh
For reference, the code Miguel and I have been developing for the prototype
https://github.com/cubanismo/allocator
And we've posted a port of kmscube that uses the new interfaces as a
https://github.com/cubanismo/kmscube
There are still some proposed mechanisms (usage transitions mainly) that
aren't prototyped, but I think it makes sense to start discussing
integration while prototyping continues.
btw, I think a nice end goal would be a gralloc implementation using
this new API for sharing buffers in various use-cases. That could
mean converting gbm-gralloc, or perhaps it means something new.
AOSP has support for mesa + upstream kernel for some devices which
also have upstream camera and/or video decoder in addition to just
GPU.. and this is where you start hitting the limits of a GBM based
gralloc. In a lot of way, I view $new_thing as what gralloc *should*
have been, but at least it provides a way to implement a generic
gralloc.
+100
Post by Rob Clark
Maybe that is getting a step ahead, there is a lot we can prototype
with kmscube. But gralloc gets us into interesting real-world
use-cases that involve more than just GPUs. Possibly this would be
something that linaro might be interested in getting involved with?
BR,
-R
In addition, I'd like to note that NVIDIA is committed to providing open
source driver implementations of these mechanisms for our hardware, in
addition to support in our proprietary drivers. In other words, wherever
modifications to the nouveau kernel & userspace drivers are needed to
implement the improved allocator mechanisms, we'll be contributing patches
if no one beats us to it.
Thanks in advance for any feedback!
-James Jones
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Nicolai Hähnle
2017-11-30 09:29:19 UTC
Reply
Permalink
Raw Message
Post by James Jones
This is all a really long-winded way of saying yeah I think it would be
technically feasible to implement GBM on top of the generic allocator
mechanisms, but I don't think that's a very interesting undertaking.
It'd just be an ABI-compatibility thing for a bunch of open-source apps,
which seems unnecessary in the long run since the apps can just be
patched instead.  Maybe it's useful as a transition mechanism though.
However, if the generic allocator is going to be something separate from
GBM, I think the idea of modernizing & adapting the existing GBM backend
infrastructure in Mesa to serve as a backend for the allocator is a good
idea.  Maybe it's easier to just let GBM sit on that same updated
backend beside the allocator API.  For GBM, all the interesting stuff
happens in the backend anyway.
That's precisely why I brought up the libgalloc <-> driver interface in
another mail. If the libgalloc <-> driver interface uses the same
extension mechanism that is in place for libgbm <-> driver today, just
with different extensions, the transition can be made very seamless.

For example, I think we could let whatever "device handle" we use in
that interface simply be an alias for __DRIscreen as far as drivers from
Mesa are concerned. Other drivers (which won't implement the DRI_XXX
extensions) won't have to concern themselves with that if they don't
want to.

Cheers,
Nicolai
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
Rob Clark
2017-11-30 18:48:45 UTC
Reply
Permalink
Raw Message
Post by Rob Clark
Post by Jason Ekstrand
Post by Rob Clark
Post by Jason Ekstrand
I'm not quite some sure what I think about this. I think I would like to
see $new_thing at least replace the guts of GBM. Whether GBM becomes a
wrapper around $new_thing or $new_thing implements the GBM API, I'm not
sure. What I don't think I want is to see GBM development continuing on
it's own so we have two competing solutions.
I don't really view them as competing.. there is *some* overlap, ie.
allocating a buffer.. but even if you are using GBM w/out $new_thing
you could allocate a buffer externally and import it. I don't see
$new_thing as that much different from GBM PoV.
But things like surfaces (aka swap chains) seem a bit out of place
when you are thinking about implementing $new_thing for non-gpu
devices. Plus EGL<->GBM tie-ins that seem out of place when talking
about a (for ex.) camera. I kinda don't want to throw out the baby
with the bathwater here.
Agreed. GBM is very EGLish and we don't want the new allocator to be that.
Post by Rob Clark
*maybe* GBM could be partially implemented on top of $new_thing. I
don't quite see how that would work. Possibly we could deprecate
parts of GBM that are no longer needed? idk.. Either way, I fully
expect that GBM and mesa's implementation of $new_thing could perhaps
sit on to of some of the same set of internal APIs. The public
interface can be decoupled from the internal implementation.
Maybe I should restate things a bit. My real point was that modifiers +
$new_thing + Kernel blob should be a complete and more powerful replacement
for GBM. I don't know that we really can implement GBM on top of it because
GBM has lots of wishy-washy concepts such as "cursor plane" which may not
map well at least not without querying the kernel about specifc display
planes. In particular, I don't want someone to feel like they need to use
$new_thing and GBM at the same time or together. Ideally, I'd like them to
never do that unless we decide gbm_bo is a useful abstraction for
$new_thing.
(just to repeat what I mentioned on irc)
I think main thing is how do you create a swapchain/surface and know
which is current front buffer after SwapBuffers().. that is the only
bits of GBM that seem like there would still be useful. idk, maybe
there is some other idea.
I don't view this as terribly useful except for legacy apps that need an EGL
window surface and can't be updated to use new methods. Wayland compositors
certainly don't fall in that category. I don't know that any GBM apps do.
kmscube doesn't count? :-P

Hmm, I assumed weston and the other wayland compositors where still
using gbm to create EGL surfaces, but I confess to have not actually
looked at weston src code for quite a few years now.

Anyways, I think it is perfectly fine for GBM to stay as-is in it's
current form. It can already import dma-buf fd's, and those can
certainly come from $new_thing.

So I guess we want an EGL extension to return the allocator device
instance for the GPU. That also takes care of the non-bare-metal
case.
Rather, I think the way forward for the classes of apps that need something
like GBM or the generic allocator is more or less the path ChromeOS took
with their graphics architecture: Render to individual buffers (using FBOs
bound to imported buffers in GL) and manage buffer exchanges/blits manually.
The useful abstraction surfaces provide isn't so much deciding which buffer
is currently "front" and "back", but rather handling the transition/hand-off
to the window system/display device/etc. in SwapBuffers(), and the whole
idea of the allocator proposals is to make that something the application or
at least some non-driver utility library handles explicitly based on where
exactly the buffer is being handed off to.
Hmm, ok.. I guess the transition will need some hook into the driver.
For freedreno and vc4 (and I suspect this is not uncommon for tiler
GPUs), switching FBOs doesn't necessarily flush rendering to hw.
Maybe it would work out if you requested the sync fd file descriptor
from an EGL fence before passing things to next device, as that would
flush rendering.

I wonder a bit about perf tools and related things.. gallium HUD and
apitrace use SwapBuffers() as a frame marker..
The one other useful information provided by EGL surfaces that I suspect
only our hardware cares about is whether the app is potentially going to
bind a depth buffer along with the color buffers from the surface, and
AFAICT, the GBM notion of surfaces doesn't provide enough information for
our driver to determine that at surface creation time, so the GBM surface
mechanism doesn't fit quite right with NVIDIA hardware anyway.
That's all for the compositors, embedded apps, demos, and whatnot that are
using GBM directly though. Every existing GL wayland client needs to be
able to get an EGLSurface and call eglSwapBuffers() on it. As I mentioned
in my XDC 2017 slides, I think that's best handled by a generic EGL window
system implementation that all drivers could share, and which uses allocator
mechanisms behind the scenes to build up an EGLSurface from individual
buffers. It would all have to be transparent to apps, but we already had
that working with our EGLStreams wayland implementation, and the Mesa
Wayland EGL client does roughly the same thing with DRM or GBM buffers IIRC,
but without a driver-external interface. It should be possible with generic
allocator buffers too. Jason's Vulkan WSI improvements that were sent out
recently move Vulkan in that direction already as well, and that was always
one of the goals of the Vulkan external objects extensions.
This is all a really long-winded way of saying yeah I think it would be
technically feasible to implement GBM on top of the generic allocator
mechanisms, but I don't think that's a very interesting undertaking. It'd
just be an ABI-compatibility thing for a bunch of open-source apps, which
seems unnecessary in the long run since the apps can just be patched
instead. Maybe it's useful as a transition mechanism though.
However, if the generic allocator is going to be something separate from
GBM, I think the idea of modernizing & adapting the existing GBM backend
infrastructure in Mesa to serve as a backend for the allocator is a good
idea. Maybe it's easier to just let GBM sit on that same updated backend
beside the allocator API. For GBM, all the interesting stuff happens in the
backend anyway.
right

BR,
-R
James Jones
2017-12-06 05:52:07 UTC
Reply
Permalink
Raw Message
Post by Rob Clark
Post by Rob Clark
Post by Jason Ekstrand
Post by Rob Clark
Post by Jason Ekstrand
I'm not quite some sure what I think about this. I think I would like to
see $new_thing at least replace the guts of GBM. Whether GBM becomes a
wrapper around $new_thing or $new_thing implements the GBM API, I'm not
sure. What I don't think I want is to see GBM development continuing on
it's own so we have two competing solutions.
I don't really view them as competing.. there is *some* overlap, ie.
allocating a buffer.. but even if you are using GBM w/out $new_thing
you could allocate a buffer externally and import it. I don't see
$new_thing as that much different from GBM PoV.
But things like surfaces (aka swap chains) seem a bit out of place
when you are thinking about implementing $new_thing for non-gpu
devices. Plus EGL<->GBM tie-ins that seem out of place when talking
about a (for ex.) camera. I kinda don't want to throw out the baby
with the bathwater here.
Agreed. GBM is very EGLish and we don't want the new allocator to be that.
Post by Rob Clark
*maybe* GBM could be partially implemented on top of $new_thing. I
don't quite see how that would work. Possibly we could deprecate
parts of GBM that are no longer needed? idk.. Either way, I fully
expect that GBM and mesa's implementation of $new_thing could perhaps
sit on to of some of the same set of internal APIs. The public
interface can be decoupled from the internal implementation.
Maybe I should restate things a bit. My real point was that modifiers +
$new_thing + Kernel blob should be a complete and more powerful replacement
for GBM. I don't know that we really can implement GBM on top of it because
GBM has lots of wishy-washy concepts such as "cursor plane" which may not
map well at least not without querying the kernel about specifc display
planes. In particular, I don't want someone to feel like they need to use
$new_thing and GBM at the same time or together. Ideally, I'd like them to
never do that unless we decide gbm_bo is a useful abstraction for
$new_thing.
(just to repeat what I mentioned on irc)
I think main thing is how do you create a swapchain/surface and know
which is current front buffer after SwapBuffers().. that is the only
bits of GBM that seem like there would still be useful. idk, maybe
there is some other idea.
I don't view this as terribly useful except for legacy apps that need an EGL
window surface and can't be updated to use new methods. Wayland compositors
certainly don't fall in that category. I don't know that any GBM apps do.
kmscube doesn't count? :-P
Hmm, I assumed weston and the other wayland compositors where still
using gbm to create EGL surfaces, but I confess to have not actually
looked at weston src code for quite a few years now.
Anyways, I think it is perfectly fine for GBM to stay as-is in it's
current form. It can already import dma-buf fd's, and those can
certainly come from $new_thing.
So I guess we want an EGL extension to return the allocator device
instance for the GPU. That also takes care of the non-bare-metal
case.
Rather, I think the way forward for the classes of apps that need something
like GBM or the generic allocator is more or less the path ChromeOS took
with their graphics architecture: Render to individual buffers (using FBOs
bound to imported buffers in GL) and manage buffer exchanges/blits manually.
The useful abstraction surfaces provide isn't so much deciding which buffer
is currently "front" and "back", but rather handling the transition/hand-off
to the window system/display device/etc. in SwapBuffers(), and the whole
idea of the allocator proposals is to make that something the application or
at least some non-driver utility library handles explicitly based on where
exactly the buffer is being handed off to.
Hmm, ok.. I guess the transition will need some hook into the driver.
For freedreno and vc4 (and I suspect this is not uncommon for tiler
GPUs), switching FBOs doesn't necessarily flush rendering to hw.
Maybe it would work out if you requested the sync fd file descriptor
from an EGL fence before passing things to next device, as that would
flush rendering.
This "flush" is exactly what usage transitions are for:

1) Perform rendering or texturing
2) Insert a transition into command stream using metadata extracted from
allocator library into the rendering/texturing API using a new entry
point. This instructs the driver to perform any
flushes/decompressions/etc. needed to transition to the next usage the
pipeline.
3) Insert/extract your fence (potentially this is combined with above
entry point like it is in GL_EXT_semaphore).
Post by Rob Clark
I wonder a bit about perf tools and related things.. gallium HUD and
apitrace use SwapBuffers() as a frame marker..
Yes, end frame markers are convenient but have never been completely
reliable anyway. Many apps exist now that never call SwapBuffers().
Presumably these tools could add more markers to detect frames,
including transitions or pseudo-extensions they implement in their
wrapper code similar to Vulkan layers that these special types of apps
could use explicitly.

Thanks,
-James
Post by Rob Clark
The one other useful information provided by EGL surfaces that I suspect
only our hardware cares about is whether the app is potentially going to
bind a depth buffer along with the color buffers from the surface, and
AFAICT, the GBM notion of surfaces doesn't provide enough information for
our driver to determine that at surface creation time, so the GBM surface
mechanism doesn't fit quite right with NVIDIA hardware anyway.
That's all for the compositors, embedded apps, demos, and whatnot that are
using GBM directly though. Every existing GL wayland client needs to be
able to get an EGLSurface and call eglSwapBuffers() on it. As I mentioned
in my XDC 2017 slides, I think that's best handled by a generic EGL window
system implementation that all drivers could share, and which uses allocator
mechanisms behind the scenes to build up an EGLSurface from individual
buffers. It would all have to be transparent to apps, but we already had
that working with our EGLStreams wayland implementation, and the Mesa
Wayland EGL client does roughly the same thing with DRM or GBM buffers IIRC,
but without a driver-external interface. It should be possible with generic
allocator buffers too. Jason's Vulkan WSI improvements that were sent out
recently move Vulkan in that direction already as well, and that was always
one of the goals of the Vulkan external objects extensions.
This is all a really long-winded way of saying yeah I think it would be
technically feasible to implement GBM on top of the generic allocator
mechanisms, but I don't think that's a very interesting undertaking. It'd
just be an ABI-compatibility thing for a bunch of open-source apps, which
seems unnecessary in the long run since the apps can just be patched
instead. Maybe it's useful as a transition mechanism though.
However, if the generic allocator is going to be something separate from
GBM, I think the idea of modernizing & adapting the existing GBM backend
infrastructure in Mesa to serve as a backend for the allocator is a good
idea. Maybe it's easier to just let GBM sit on that same updated backend
beside the allocator API. For GBM, all the interesting stuff happens in the
backend anyway.
right
BR,
-R
Rob Clark
2017-12-06 13:03:09 UTC
Reply
Permalink
Raw Message
Post by James Jones
Post by Rob Clark
Post by Rob Clark
Post by Jason Ekstrand
On Sat, Nov 25, 2017 at 12:46 PM, Jason Ekstrand
Post by Jason Ekstrand
I'm not quite some sure what I think about this. I think I would
like
to
see $new_thing at least replace the guts of GBM. Whether GBM becomes a
wrapper around $new_thing or $new_thing implements the GBM API, I'm not
sure. What I don't think I want is to see GBM development continuing on
it's own so we have two competing solutions.
I don't really view them as competing.. there is *some* overlap, ie.
allocating a buffer.. but even if you are using GBM w/out $new_thing
you could allocate a buffer externally and import it. I don't see
$new_thing as that much different from GBM PoV.
But things like surfaces (aka swap chains) seem a bit out of place
when you are thinking about implementing $new_thing for non-gpu
devices. Plus EGL<->GBM tie-ins that seem out of place when talking
about a (for ex.) camera. I kinda don't want to throw out the baby
with the bathwater here.
Agreed. GBM is very EGLish and we don't want the new allocator to be that.
*maybe* GBM could be partially implemented on top of $new_thing. I
don't quite see how that would work. Possibly we could deprecate
parts of GBM that are no longer needed? idk.. Either way, I fully
expect that GBM and mesa's implementation of $new_thing could perhaps
sit on to of some of the same set of internal APIs. The public
interface can be decoupled from the internal implementation.
Maybe I should restate things a bit. My real point was that modifiers +
$new_thing + Kernel blob should be a complete and more powerful replacement
for GBM. I don't know that we really can implement GBM on top of it because
GBM has lots of wishy-washy concepts such as "cursor plane" which may not
map well at least not without querying the kernel about specifc display
planes. In particular, I don't want someone to feel like they need to use
$new_thing and GBM at the same time or together. Ideally, I'd like
them
to
never do that unless we decide gbm_bo is a useful abstraction for
$new_thing.
(just to repeat what I mentioned on irc)
I think main thing is how do you create a swapchain/surface and know
which is current front buffer after SwapBuffers().. that is the only
bits of GBM that seem like there would still be useful. idk, maybe
there is some other idea.
I don't view this as terribly useful except for legacy apps that need an EGL
window surface and can't be updated to use new methods. Wayland compositors
certainly don't fall in that category. I don't know that any GBM apps do.
kmscube doesn't count? :-P
Hmm, I assumed weston and the other wayland compositors where still
using gbm to create EGL surfaces, but I confess to have not actually
looked at weston src code for quite a few years now.
Anyways, I think it is perfectly fine for GBM to stay as-is in it's
current form. It can already import dma-buf fd's, and those can
certainly come from $new_thing.
So I guess we want an EGL extension to return the allocator device
instance for the GPU. That also takes care of the non-bare-metal
case.
Rather, I think the way forward for the classes of apps that need something
like GBM or the generic allocator is more or less the path ChromeOS took
with their graphics architecture: Render to individual buffers (using FBOs
bound to imported buffers in GL) and manage buffer exchanges/blits manually.
The useful abstraction surfaces provide isn't so much deciding which buffer
is currently "front" and "back", but rather handling the
transition/hand-off
to the window system/display device/etc. in SwapBuffers(), and the whole
idea of the allocator proposals is to make that something the application or
at least some non-driver utility library handles explicitly based on where
exactly the buffer is being handed off to.
Hmm, ok.. I guess the transition will need some hook into the driver.
For freedreno and vc4 (and I suspect this is not uncommon for tiler
GPUs), switching FBOs doesn't necessarily flush rendering to hw.
Maybe it would work out if you requested the sync fd file descriptor
from an EGL fence before passing things to next device, as that would
flush rendering.
1) Perform rendering or texturing
2) Insert a transition into command stream using metadata extracted from
allocator library into the rendering/texturing API using a new entry point.
This instructs the driver to perform any flushes/decompressions/etc. needed
to transition to the next usage the pipeline.
3) Insert/extract your fence (potentially this is combined with above entry
point like it is in GL_EXT_semaphore).
yeah, I'm coming to the conclusion that either a transition or simply
the act of requesting the fence fd from an EGL (or vk?) fence is
sufficient hint to the driver to know to flush a tiling pass.
Post by James Jones
Post by Rob Clark
I wonder a bit about perf tools and related things.. gallium HUD and
apitrace use SwapBuffers() as a frame marker..
Yes, end frame markers are convenient but have never been completely
reliable anyway. Many apps exist now that never call SwapBuffers().
Presumably these tools could add more markers to detect frames, including
transitions or pseudo-extensions they implement in their wrapper code
similar to Vulkan layers that these special types of apps could use
explicitly.
I guess there are even things like GREMEDY_frame_terminator
extension.. which could be extended to GLES or perhaps an EGL equiv
introduced. I suppose there are a few different ways we could solve
this..

BR,
-R
Nicolai Hähnle
2017-11-29 12:19:20 UTC
Reply
Permalink
Raw Message
I'm not quite some sure what I think about this.  I think I would like
to see $new_thing at least replace the guts of GBM. Whether GBM becomes
a wrapper around $new_thing or $new_thing implements the GBM API, I'm
not sure.  What I don't think I want is to see GBM development
continuing on it's own so we have two competing solutions.
I *think* I like the idea of having $new_thing implement GBM as a
deprecated legacy API.  Whether that means we start by pulling GBM out
into it's own project or we start over, I don't know.  My feeling is
that the current dri_interface is *not* what we want which is why
starting with GBM makes me nervous.
Why not?

The most basic part of the dri_interface is just a
__driDriverGetExtensions_xxx function that returns an array of pointers
to extension structures derived from __DRIextension.

That is *perfectly fine*.

I completely agree if you limit your statement to saying that the
current *set of extensions* that are exposed by this interface are full
of X-isms, and it's a good idea to do a thorough house-cleaning in
there. This can go all the way up to eventually phasing out the DRI_Core
"extension" as far as I'm concerned.

I know it's tempting to reinvent the world every couple of years, but
it's just *better* to find an evolutionary path that makes sense.

Cheers,
Nicolai
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
Jason Ekstrand
2017-11-29 17:36:51 UTC
Reply
Permalink
Raw Message
Post by Nicolai Hähnle
Post by Jason Ekstrand
I'm not quite some sure what I think about this. I think I would like to
see $new_thing at least replace the guts of GBM. Whether GBM becomes a
wrapper around $new_thing or $new_thing implements the GBM API, I'm not
sure. What I don't think I want is to see GBM development continuing on
it's own so we have two competing solutions.
I *think* I like the idea of having $new_thing implement GBM as a
deprecated legacy API. Whether that means we start by pulling GBM out into
it's own project or we start over, I don't know. My feeling is that the
current dri_interface is *not* what we want which is why starting with GBM
makes me nervous.
Why not?
The most basic part of the dri_interface is just a
__driDriverGetExtensions_xxx function that returns an array of pointers to
extension structures derived from __DRIextension.
That is *perfectly fine*.
Fair enough. I'm perfectly happy to re-use a well-tested API extension
mechanism.
Post by Nicolai Hähnle
I completely agree if you limit your statement to saying that the current
*set of extensions* that are exposed by this interface are full of X-isms,
and it's a good idea to do a thorough house-cleaning in there. This can go
all the way up to eventually phasing out the DRI_Core "extension" as far as
I'm concerned.
That's more of what I was getting at. In particular, I don't want the
design of $new_thing to be constrained by trying to cram into the current
DRI extensions nor do I want it to attempt to have exactly the same set of
functionality as the current DRI extensions (or GBM) support.
Nicolai Hähnle
2017-11-30 22:43:17 UTC
Reply
Permalink
Raw Message
Hi,

I've had a chance to look a bit more closely at the allocator prototype
repository now. There's a whole bunch of low-level API design feedback,
but for now let's focus on the high-level stuff first.

Going by the 4.5 major object types (as also seen on slide 5 of your
presentation [0]), assertions and usages make sense to me.

Capabilities and capability sets should be cleaned up in my opinion, as
the status quo is overly obfuscating things. What capability sets really
represent, as far as I understand them, is *memory layouts*, and so
that's what they should be called.

This conceptually simplifies `derive_capabilities` significantly without
any loss of expressiveness as far as I can see. Given two lists of
memory layouts, we simply look for which memory layouts appear in both
lists, and then merge their constraints and capabilities.

Merging constraints looks good to me.

Capabilities need some more thought. The prototype removes capabilities
when merging layouts, but I'd argue that that is often undesirable. (In
fact, I cannot think of capabilities which we'd always want to remove.)

A typical example for this is compression (i.e. DCC in our case). For
rendering usage, we'd return something like:

Memory layout: AMD/tiled; constraints(alignment=64k); caps(AMD/DCC)

For display usage, we might return (depending on hardware):

Memory layout: AMD/tiled; constraints(alignment=64k); caps(none)

Merging these in the prototype would remove the DCC capability, even
though it might well make sense to keep it there for rendering. Dealing
with the fact that display usage does not have this capability is
precisely one of the two things that transitions are about! The other
thing that transitions are about is caches.

I think this is kind of what Rob was saying in one of his mails.

Two interesting questions:

1. If we query for multiple usages on the same device, can we get a
capability which can only be used for a subset of those usages?

2. What happens when we merge memory layouts with sets of capabilities
where neither is a subset of the other?

As for the actual transition API, I accept that some metadata may be
required, and the metadata probably needs to depend on the memory
layout, which is often vendor-specific. But even linear layouts need
some transitions for caches. We probably need at least some generic
"off-device usage" bit.

Cheers,
Nicolai

[0] https://www.x.org/wiki/Events/XDC2017/jones_allocator.pdf
Post by James Jones
As many here know at this point, I've been working on solving issues
related to DMA-capable memory allocation for various devices for some
time now.  I'd like to take this opportunity to apologize for the way I
handled the EGL stream proposals.  I understand now that the development
process followed there was unacceptable to the community and likely
offended many great engineers.
Moving forward, I attempted to reboot talks in a more constructive
manner with the generic allocator library proposals & discussion forum
at XDC 2016.  Some great design ideas came out of that, and I've since
been prototyping some code to prove them out before bringing them back
as official proposals.  Again, I understand some people are growing
concerned that I've been doing this off on the side in a github project
that has primarily NVIDIA contributors.  My goal was only to avoid
wasting everyone's time with unproven ideas.  The intent was never to
dump the prototype code as-is on the community and presume acceptance.
It's just a public research project.
Now the prototyping is nearing completion, and I'd like to renew
discussion on whether and how the new mechanisms can be integrated with
the Linux graphics stack.
I'd be interested to know if more work is needed to demonstrate the
usefulness of the new mechanisms, or whether people think they have
value at this point.
After talking with people on the hallway track at XDC this year, I've
-Include ideas from the generic allocator design into GBM.  This could
take the form of designing a "GBM 2.0" API, or incrementally adding to
the existing GBM API.
-Develop a library to replace GBM.  The allocator prototype code could
be massaged into something production worthy to jump start this process.
-Develop a library that sits beside or on top of GBM, using GBM for
low-level graphics buffer allocation, while supporting non-graphics
kernel APIs directly.  The additional cross-device negotiation and
sorting of capabilities would be handled in this slightly higher-level
API before handing off to GBM and other APIs for actual allocation somehow.
-I have also heard some general comments that regardless of the
relationship between GBM and the new allocator mechanisms, it might be
time to move GBM out of Mesa so it can be developed as a stand-alone
project.  I'd be interested what others think about that, as it would be
something worth coordinating with any other new development based on or
inside of GBM.
And of course I'm open to any other ideas for integration.  Beyond just
where this code would live, there is much to debate about the mechanisms
themselves and all the implementation details.  I was just hoping to
kick things off with something high level to start.
For reference, the code Miguel and I have been developing for the
   https://github.com/cubanismo/allocator
And we've posted a port of kmscube that uses the new interfaces as a
   https://github.com/cubanismo/kmscube
There are still some proposed mechanisms (usage transitions mainly) that
aren't prototyped, but I think it makes sense to start discussing
integration while prototyping continues.
In addition, I'd like to note that NVIDIA is committed to providing open
source driver implementations of these mechanisms for our hardware, in
addition to support in our proprietary drivers.  In other words,
wherever modifications to the nouveau kernel & userspace drivers are
needed to implement the improved allocator mechanisms, we'll be
contributing patches if no one beats us to it.
Thanks in advance for any feedback!
-James Jones
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
Rob Clark
2017-12-01 15:06:53 UTC
Reply
Permalink
Raw Message
Post by Nicolai Hähnle
Hi,
I've had a chance to look a bit more closely at the allocator prototype
repository now. There's a whole bunch of low-level API design feedback, but
for now let's focus on the high-level stuff first.
Going by the 4.5 major object types (as also seen on slide 5 of your
presentation [0]), assertions and usages make sense to me.
Capabilities and capability sets should be cleaned up in my opinion, as the
status quo is overly obfuscating things. What capability sets really
represent, as far as I understand them, is *memory layouts*, and so that's
what they should be called.
This conceptually simplifies `derive_capabilities` significantly without any
loss of expressiveness as far as I can see. Given two lists of memory
layouts, we simply look for which memory layouts appear in both lists, and
then merge their constraints and capabilities.
Merging constraints looks good to me.
Capabilities need some more thought. The prototype removes capabilities when
merging layouts, but I'd argue that that is often undesirable. (In fact, I
cannot think of capabilities which we'd always want to remove.)
A typical example for this is compression (i.e. DCC in our case). For
Memory layout: AMD/tiled; constraints(alignment=64k); caps(AMD/DCC)
Memory layout: AMD/tiled; constraints(alignment=64k); caps(none)
Merging these in the prototype would remove the DCC capability, even though
it might well make sense to keep it there for rendering. Dealing with the
fact that display usage does not have this capability is precisely one of
the two things that transitions are about! The other thing that transitions
are about is caches.
I think this is kind of what Rob was saying in one of his mails.
Perhaps "layout" is a better name than "caps".. either way I think of
both AMD/tiled and AMD/DCC as the same type of "thing".. the
difference between AMD/tiled and AMD/DCC is that a transition can be
provided for AMD/DCC. Other than that they are both things describing
the layout.

So lets say you have a setup where both display and GPU supported
FOO/tiled, but only GPU supported compressed (FOO/CC) and cached
(FOO/cached). But the GPU supported the following transitions:

trans_a: FOO/CC -> null
trans_b: FOO/cached -> null

Then the sets for each device (in order of preference):

GPU:
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
2: caps(FOO/tiled, FOO/CC); constraints(alignment=32k)
3: caps(FOO/tiled); constraints(alignment=32k)

Display:
1: caps(FOO/tiled); constraints(alignment=64k)

Merged Result:
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
transition(GPU->display: trans_a, trans_b; display->GPU: none)
2: caps(FOO/tiled, FOO/CC); constraints(alignment=64k);
transition(GPU->display: trans_a; display->GPU: none)
3: caps(FOO/tiled); constraints(alignment=64k);
transition(GPU->display: none; display->GPU: none)
Post by Nicolai Hähnle
1. If we query for multiple usages on the same device, can we get a
capability which can only be used for a subset of those usages?
I think the original idea was, "no".. perhaps that could restriction
could be lifted if transitions where part of the result. Or maybe you
just query independently the same device for multiple different
usages, and then merge that cap-set.

(Do we need to care about intra-device transitions? Or can we just
let the driver care about that, same as it always has?)
Post by Nicolai Hähnle
2. What happens when we merge memory layouts with sets of capabilities where
neither is a subset of the other?
I think this is a case where no zero-copy sharing is possible, right?
Post by Nicolai Hähnle
As for the actual transition API, I accept that some metadata may be
required, and the metadata probably needs to depend on the memory layout,
which is often vendor-specific. But even linear layouts need some
transitions for caches. We probably need at least some generic "off-device
usage" bit.
I've started thinking of cached as a capability with a transition.. I
think that helps. Maybe it needs to somehow be more specific (ie. if
you have two devices both with there own cache with no coherency
between the two)

BR,
-R
Post by Nicolai Hähnle
Cheers,
Nicolai
[0] https://www.x.org/wiki/Events/XDC2017/jones_allocator.pdf
Post by James Jones
As many here know at this point, I've been working on solving issues
related to DMA-capable memory allocation for various devices for some time
now. I'd like to take this opportunity to apologize for the way I handled
the EGL stream proposals. I understand now that the development process
followed there was unacceptable to the community and likely offended many
great engineers.
Moving forward, I attempted to reboot talks in a more constructive manner
with the generic allocator library proposals & discussion forum at XDC 2016.
Some great design ideas came out of that, and I've since been prototyping
some code to prove them out before bringing them back as official proposals.
Again, I understand some people are growing concerned that I've been doing
this off on the side in a github project that has primarily NVIDIA
contributors. My goal was only to avoid wasting everyone's time with
unproven ideas. The intent was never to dump the prototype code as-is on
the community and presume acceptance. It's just a public research project.
Now the prototyping is nearing completion, and I'd like to renew
discussion on whether and how the new mechanisms can be integrated with the
Linux graphics stack.
I'd be interested to know if more work is needed to demonstrate the
usefulness of the new mechanisms, or whether people think they have value at
this point.
After talking with people on the hallway track at XDC this year, I've
-Include ideas from the generic allocator design into GBM. This could
take the form of designing a "GBM 2.0" API, or incrementally adding to the
existing GBM API.
-Develop a library to replace GBM. The allocator prototype code could be
massaged into something production worthy to jump start this process.
-Develop a library that sits beside or on top of GBM, using GBM for
low-level graphics buffer allocation, while supporting non-graphics kernel
APIs directly. The additional cross-device negotiation and sorting of
capabilities would be handled in this slightly higher-level API before
handing off to GBM and other APIs for actual allocation somehow.
-I have also heard some general comments that regardless of the
relationship between GBM and the new allocator mechanisms, it might be time
to move GBM out of Mesa so it can be developed as a stand-alone project.
I'd be interested what others think about that, as it would be something
worth coordinating with any other new development based on or inside of GBM.
And of course I'm open to any other ideas for integration. Beyond just
where this code would live, there is much to debate about the mechanisms
themselves and all the implementation details. I was just hoping to kick
things off with something high level to start.
For reference, the code Miguel and I have been developing for the
https://github.com/cubanismo/allocator
And we've posted a port of kmscube that uses the new interfaces as a
https://github.com/cubanismo/kmscube
There are still some proposed mechanisms (usage transitions mainly) that
aren't prototyped, but I think it makes sense to start discussing
integration while prototyping continues.
In addition, I'd like to note that NVIDIA is committed to providing open
source driver implementations of these mechanisms for our hardware, in
addition to support in our proprietary drivers. In other words, wherever
modifications to the nouveau kernel & userspace drivers are needed to
implement the improved allocator mechanisms, we'll be contributing patches
if no one beats us to it.
Thanks in advance for any feedback!
-James Jones
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Nicolai Hähnle
2017-12-01 17:09:32 UTC
Reply
Permalink
Raw Message
Post by Rob Clark
Post by Nicolai Hähnle
Hi,
I've had a chance to look a bit more closely at the allocator prototype
repository now. There's a whole bunch of low-level API design feedback, but
for now let's focus on the high-level stuff first.
Going by the 4.5 major object types (as also seen on slide 5 of your
presentation [0]), assertions and usages make sense to me.
Capabilities and capability sets should be cleaned up in my opinion, as the
status quo is overly obfuscating things. What capability sets really
represent, as far as I understand them, is *memory layouts*, and so that's
what they should be called.
This conceptually simplifies `derive_capabilities` significantly without any
loss of expressiveness as far as I can see. Given two lists of memory
layouts, we simply look for which memory layouts appear in both lists, and
then merge their constraints and capabilities.
Merging constraints looks good to me.
Capabilities need some more thought. The prototype removes capabilities when
merging layouts, but I'd argue that that is often undesirable. (In fact, I
cannot think of capabilities which we'd always want to remove.)
A typical example for this is compression (i.e. DCC in our case). For
Memory layout: AMD/tiled; constraints(alignment=64k); caps(AMD/DCC)
Memory layout: AMD/tiled; constraints(alignment=64k); caps(none)
Merging these in the prototype would remove the DCC capability, even though
it might well make sense to keep it there for rendering. Dealing with the
fact that display usage does not have this capability is precisely one of
the two things that transitions are about! The other thing that transitions
are about is caches.
I think this is kind of what Rob was saying in one of his mails.
Perhaps "layout" is a better name than "caps".. either way I think of
both AMD/tiled and AMD/DCC as the same type of "thing".. the
difference between AMD/tiled and AMD/DCC is that a transition can be
provided for AMD/DCC. Other than that they are both things describing
the layout.
The reason that a transition can be provided is that they aren't quite
the same thing, though. In a very real sense, AMD/DCC is a "child"
property of AMD/tiled: DCC is implemented as a meta surface whose memory
layout depends on the layout of the main surface.

Although, if there are GPUs that can do an in-place "transition" between
different tiling layouts, then the distinction is perhaps really not as
clear-cut. I guess that would only apply to tiled renderers.
Post by Rob Clark
So lets say you have a setup where both display and GPU supported
FOO/tiled, but only GPU supported compressed (FOO/CC) and cached
trans_a: FOO/CC -> null
trans_b: FOO/cached -> null
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
2: caps(FOO/tiled, FOO/CC); constraints(alignment=32k)
3: caps(FOO/tiled); constraints(alignment=32k)
1: caps(FOO/tiled); constraints(alignment=64k)
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
transition(GPU->display: trans_a, trans_b; display->GPU: none)
2: caps(FOO/tiled, FOO/CC); constraints(alignment=64k);
transition(GPU->display: trans_a; display->GPU: none)
3: caps(FOO/tiled); constraints(alignment=64k);
transition(GPU->display: none; display->GPU: none)
We definitely don't want to expose a way of getting uncached rendering
surfaces for radeonsi. I mean, I think we are supposed to be able to
program our hardware so that the backend bypasses all caches, but (a)
nobody validates that and (b) it's basically suicide in terms of
performance. Let's build fewer footguns :)

So at least for radeonsi, we wouldn't want to have an AMD/cached bit,
but we'd still want to have a transition between the GPU and display
precisely to flush caches.
Post by Rob Clark
Post by Nicolai Hähnle
1. If we query for multiple usages on the same device, can we get a
capability which can only be used for a subset of those usages?
I think the original idea was, "no".. perhaps that could restriction
could be lifted if transitions where part of the result. Or maybe you
just query independently the same device for multiple different
usages, and then merge that cap-set.
(Do we need to care about intra-device transitions? Or can we just
let the driver care about that, same as it always has?)
Post by Nicolai Hähnle
2. What happens when we merge memory layouts with sets of capabilities where
neither is a subset of the other?
I think this is a case where no zero-copy sharing is possible, right?
Not necessarily. Let's say we have some industry-standard tiling layout
foo, and vendors support their own proprietary framebuffer compression
on top of that.

In that case, we may get:

Device 1, rendering: caps(BASE/foo, VND1/compressed)
Device 2, sampling/scanout: caps(BASE/foo, VND2/compressed)

It should be possible to allocate a surface as

caps(BASE/foo, VND1/compressed)

and just transition the cap(VND1/compressed) away after rendering before
accessing it with device 2.

The interesting question is whether it would be possible or ever useful
to have a surface allocated as caps(BASE/foo, VND1/compressed,
VND2/compressed).

My guess is: there will be cases where it's possible, but there won't be
cases where it's useful (because you tend to render on device 1 and just
sample or scanout on device 2).

So it makes sense to say that derive_capabilities should just provide
both layouts in this case.
Post by Rob Clark
Post by Nicolai Hähnle
As for the actual transition API, I accept that some metadata may be
required, and the metadata probably needs to depend on the memory layout,
which is often vendor-specific. But even linear layouts need some
transitions for caches. We probably need at least some generic "off-device
usage" bit.
I've started thinking of cached as a capability with a transition.. I
think that helps. Maybe it needs to somehow be more specific (ie. if
you have two devices both with there own cache with no coherency
between the two)
As I wrote above, I'd prefer not to think of "cached" as a capability at
least for radeonsi.

From the desktop perspective, I would say let's ignore caches, the
drivers know which caches they need to flush to make data visible to
other devices on the system.

On the other hand, there are probably SoC cases where non-coherent
caches are shared between some but not all devices, and in that case
perhaps we do need to communicate this.

So perhaps we should have two kinds of "capabilities".

The first, like framebuffer compression, is a capability of the
allocated memory layout (because the compression requires a meta
surface), and devices that expose it may opportunistically use it.

The second, like caches, is a capability that the device/driver will use
and you don't get a say in it, but other devices/drivers also don't need
to be aware of them.

So then you could theoretically have a system that gives you:

GPU: FOO/tiled(layout-caps=FOO/cc, dev-caps=FOO/gpu-cache)
Display: FOO/tiled(layout-caps=FOO/cc)
Video: FOO/tiled(dev-caps=FOO/vid-cache)
Camera: FOO/tiled(dev-caps=FOO/vid-cache)

... from which a FOO/tiled(FOO/cc) surface would be allocated.

The idea here is that whether a transition is required is fully visible
from the capabilities:

1. Moving an image from the camera to the video engine for immediate
compression requires no transition.

2. Moving an image from the camera or video engine to the display
requires a transition by the video/camera device/API, which may flush
the video cache.

3. Moving an image from the camera or video engine to the GPU
additionally requires a transition by the GPU, which may invalidate the
GPU cache.

4. Moving an image from the GPU anywhere else requires a transition by
the GPU; in all cases, GPU caches may be flushed. When moving to the
video engine or camera, the image additionally needs to be decompressed.
When moving to the video engine (or camera? :)), a transition by the
video engine is also required, which may invalidate the video cache.

5. Moving an image from the display to the video engine requires a
decompression -- oops! :)

Ignoring that last point for now, I don't think you actually need a
"query_transition" function in libdevicealloc with this approach, for
the most part.

Instead, each API needs to provide import and export transition/barrier
functions which receive the previous/next layout-and capability-set.

Basically, to import a frame from the camera to OpenGL/Vulkan in the
above system, you'd first do the camera transition:

struct layout_capability cc_cap = { FOO, FOO_CC };
struct device_capability gpu_cache = { FOO, FOO_GPU_CACHE };

cameraExportTransition(image, 1, &layoutCaps, 1, &gpu_cache, &fence);

and then e.g. an OpenGL import transition:

struct device_capability vid_cache = { FOO, FOO_VID_CACHE };

glImportTransitionEXT(texture, 0, NULL, 1, &vid_cache, fence);

By looking at the capabilities for the other device, each API's driver
can derive the required transition steps.

There are probably more gaps, but these are the two I can think of right
now, and both related to the initialization status of meta surfaces,
i.e. FOO/cc:

1. Point 5 above about moving away from the display engine in the
example. This is an ugly asymmetry in the rule that each engine performs
its required import and export transitions.

2. When the GPU imports a FOO/tiled(FOO/cc) surface, the compression
meta surface can be in one of two states:
- reflecting a fully decompressed surface (if the surface was previously
exported from the GPU), or
- garbage (if the surface was allocated by the GPU driver, but then
handed off to the camera before being re-imported for processing)
The GPU's import transition needs to distinguish the two, but it can't
with the scheme above.

Something to think about :)

Also, not really a gap, but something to keep in mind: for multi-GPU
systems, the cache-capability needs to carry the device number or PCI
bus id or something, at least as long as those caches are not coherent
between GPUs.

Cheers,
Nicolai
Post by Rob Clark
BR,
-R
Post by Nicolai Hähnle
Cheers,
Nicolai
[0] https://www.x.org/wiki/Events/XDC2017/jones_allocator.pdf
Post by James Jones
As many here know at this point, I've been working on solving issues
related to DMA-capable memory allocation for various devices for some time
now. I'd like to take this opportunity to apologize for the way I handled
the EGL stream proposals. I understand now that the development process
followed there was unacceptable to the community and likely offended many
great engineers.
Moving forward, I attempted to reboot talks in a more constructive manner
with the generic allocator library proposals & discussion forum at XDC 2016.
Some great design ideas came out of that, and I've since been prototyping
some code to prove them out before bringing them back as official proposals.
Again, I understand some people are growing concerned that I've been doing
this off on the side in a github project that has primarily NVIDIA
contributors. My goal was only to avoid wasting everyone's time with
unproven ideas. The intent was never to dump the prototype code as-is on
the community and presume acceptance. It's just a public research project.
Now the prototyping is nearing completion, and I'd like to renew
discussion on whether and how the new mechanisms can be integrated with the
Linux graphics stack.
I'd be interested to know if more work is needed to demonstrate the
usefulness of the new mechanisms, or whether people think they have value at
this point.
After talking with people on the hallway track at XDC this year, I've
-Include ideas from the generic allocator design into GBM. This could
take the form of designing a "GBM 2.0" API, or incrementally adding to the
existing GBM API.
-Develop a library to replace GBM. The allocator prototype code could be
massaged into something production worthy to jump start this process.
-Develop a library that sits beside or on top of GBM, using GBM for
low-level graphics buffer allocation, while supporting non-graphics kernel
APIs directly. The additional cross-device negotiation and sorting of
capabilities would be handled in this slightly higher-level API before
handing off to GBM and other APIs for actual allocation somehow.
-I have also heard some general comments that regardless of the
relationship between GBM and the new allocator mechanisms, it might be time
to move GBM out of Mesa so it can be developed as a stand-alone project.
I'd be interested what others think about that, as it would be something
worth coordinating with any other new development based on or inside of GBM.
And of course I'm open to any other ideas for integration. Beyond just
where this code would live, there is much to debate about the mechanisms
themselves and all the implementation details. I was just hoping to kick
things off with something high level to start.
For reference, the code Miguel and I have been developing for the
https://github.com/cubanismo/allocator
And we've posted a port of kmscube that uses the new interfaces as a
https://github.com/cubanismo/kmscube
There are still some proposed mechanisms (usage transitions mainly) that
aren't prototyped, but I think it makes sense to start discussing
integration while prototyping continues.
In addition, I'd like to note that NVIDIA is committed to providing open
source driver implementations of these mechanisms for our hardware, in
addition to support in our proprietary drivers. In other words, wherever
modifications to the nouveau kernel & userspace drivers are needed to
implement the improved allocator mechanisms, we'll be contributing patches
if no one beats us to it.
Thanks in advance for any feedback!
-James Jones
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
Nicolai Hähnle
2017-12-01 18:34:04 UTC
Reply
Permalink
Raw Message
On 01.12.2017 18:09, Nicolai Hähnle wrote:
[snip]
Post by Nicolai Hähnle
Post by Rob Clark
Post by Nicolai Hähnle
As for the actual transition API, I accept that some metadata may be
required, and the metadata probably needs to depend on the memory layout,
which is often vendor-specific. But even linear layouts need some
transitions for caches. We probably need at least some generic "off-device
usage" bit.
I've started thinking of cached as a capability with a transition.. I
think that helps.  Maybe it needs to somehow be more specific (ie. if
you have two devices both with there own cache with no coherency
between the two)
As I wrote above, I'd prefer not to think of "cached" as a capability at
least for radeonsi.
From the desktop perspective, I would say let's ignore caches, the
drivers know which caches they need to flush to make data visible to
other devices on the system.
On the other hand, there are probably SoC cases where non-coherent
caches are shared between some but not all devices, and in that case
perhaps we do need to communicate this.
So perhaps we should have two kinds of "capabilities".
The first, like framebuffer compression, is a capability of the
allocated memory layout (because the compression requires a meta
surface), and devices that expose it may opportunistically use it.
The second, like caches, is a capability that the device/driver will use
and you don't get a say in it, but other devices/drivers also don't need
to be aware of them.
GPU:     FOO/tiled(layout-caps=FOO/cc, dev-caps=FOO/gpu-cache)
Display: FOO/tiled(layout-caps=FOO/cc)
Video:   FOO/tiled(dev-caps=FOO/vid-cache)
Camera:  FOO/tiled(dev-caps=FOO/vid-cache)
[snip]

FWIW, I think all that stuff about different caches quite likely
over-complicates things. At the end of each "command submission" of
whichever type of engine, the buffer must be in a state where the kernel
is free to move it around for memory management purposes. This already
puts a big constraint on the kind of (non-coherent) caches that can be
supported anyway, so I wouldn't be surprised if we could get away with a
*much* simpler approach.

Cheers,
Nicolai
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
James Jones
2017-12-06 07:01:00 UTC
Reply
Permalink
Raw Message
Post by Rob Clark
[snip]
Post by Nicolai Hähnle
Post by Rob Clark
Post by Nicolai Hähnle
As for the actual transition API, I accept that some metadata may be
required, and the metadata probably needs to depend on the memory layout,
which is often vendor-specific. But even linear layouts need some
transitions for caches. We probably need at least some generic "off-device
usage" bit.
I've started thinking of cached as a capability with a transition.. I
think that helps.  Maybe it needs to somehow be more specific (ie. if
you have two devices both with there own cache with no coherency
between the two)
As I wrote above, I'd prefer not to think of "cached" as a capability
at least for radeonsi.
 From the desktop perspective, I would say let's ignore caches, the
drivers know which caches they need to flush to make data visible to
other devices on the system.
On the other hand, there are probably SoC cases where non-coherent
caches are shared between some but not all devices, and in that case
perhaps we do need to communicate this.
So perhaps we should have two kinds of "capabilities".
The first, like framebuffer compression, is a capability of the
allocated memory layout (because the compression requires a meta
surface), and devices that expose it may opportunistically use it.
The second, like caches, is a capability that the device/driver will
use and you don't get a say in it, but other devices/drivers also
don't need to be aware of them.
GPU:     FOO/tiled(layout-caps=FOO/cc, dev-caps=FOO/gpu-cache)
Display: FOO/tiled(layout-caps=FOO/cc)
Video:   FOO/tiled(dev-caps=FOO/vid-cache)
Camera:  FOO/tiled(dev-caps=FOO/vid-cache)
[snip]
FWIW, I think all that stuff about different caches quite likely
over-complicates things. At the end of each "command submission" of
whichever type of engine, the buffer must be in a state where the kernel
is free to move it around for memory management purposes. This already
puts a big constraint on the kind of (non-coherent) caches that can be
supported anyway, so I wouldn't be surprised if we could get away with a
*much* simpler approach.
I'd rather not depend on this type of cleverness if possible. Other
kernels/OS's may not behave this way, and I'd like the allocator
mechanism to be something we can use across all or at least most of the
POSIX and POSIX-like OS's we support. Also, this particular example is
not true of our proprietary Linux driver, and I suspect it won't always
be the case for other drivers. If a particular driver or OS fits this
assumption, the driver is always free to return no-op transitions in
that case.

Thanks,
-James
Post by Rob Clark
Cheers,
Nicolai
Nicolai Hähnle
2017-12-06 10:38:28 UTC
Reply
Permalink
Raw Message
Post by Rob Clark
[snip]
Post by Nicolai Hähnle
Post by Rob Clark
Post by Nicolai Hähnle
As for the actual transition API, I accept that some metadata may be
required, and the metadata probably needs to depend on the memory layout,
which is often vendor-specific. But even linear layouts need some
transitions for caches. We probably need at least some generic "off-device
usage" bit.
I've started thinking of cached as a capability with a transition.. I
think that helps.  Maybe it needs to somehow be more specific (ie. if
you have two devices both with there own cache with no coherency
between the two)
As I wrote above, I'd prefer not to think of "cached" as a capability
at least for radeonsi.
 From the desktop perspective, I would say let's ignore caches, the
drivers know which caches they need to flush to make data visible to
other devices on the system.
On the other hand, there are probably SoC cases where non-coherent
caches are shared between some but not all devices, and in that case
perhaps we do need to communicate this.
So perhaps we should have two kinds of "capabilities".
The first, like framebuffer compression, is a capability of the
allocated memory layout (because the compression requires a meta
surface), and devices that expose it may opportunistically use it.
The second, like caches, is a capability that the device/driver will
use and you don't get a say in it, but other devices/drivers also
don't need to be aware of them.
GPU:     FOO/tiled(layout-caps=FOO/cc, dev-caps=FOO/gpu-cache)
Display: FOO/tiled(layout-caps=FOO/cc)
Video:   FOO/tiled(dev-caps=FOO/vid-cache)
Camera:  FOO/tiled(dev-caps=FOO/vid-cache)
[snip]
FWIW, I think all that stuff about different caches quite likely
over-complicates things. At the end of each "command submission" of
whichever type of engine, the buffer must be in a state where the
kernel is free to move it around for memory management purposes. This
already puts a big constraint on the kind of (non-coherent) caches
that can be supported anyway, so I wouldn't be surprised if we could
get away with a *much* simpler approach.
I'd rather not depend on this type of cleverness if possible.  Other
kernels/OS's may not behave this way, and I'd like the allocator
mechanism to be something we can use across all or at least most of the
POSIX and POSIX-like OS's we support.  Also, this particular example is
not true of our proprietary Linux driver, and I suspect it won't always
be the case for other drivers.  If a particular driver or OS fits this
assumption, the driver is always free to return no-op transitions in
that case.
Agreed.

(What I wrote about memory management should be true for all systems,
but the kernel could use an engine that goes through the relevant caches
for memory management-related buffer moves. It just so happens that it
doesn't do that on our hardware, but that's by no means universal.)

Cheers,
Nicolai
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
Rob Clark
2017-12-01 18:38:41 UTC
Reply
Permalink
Raw Message
Post by Rob Clark
Post by Nicolai Hähnle
Hi,
I've had a chance to look a bit more closely at the allocator prototype
repository now. There's a whole bunch of low-level API design feedback, but
for now let's focus on the high-level stuff first.
Going by the 4.5 major object types (as also seen on slide 5 of your
presentation [0]), assertions and usages make sense to me.
Capabilities and capability sets should be cleaned up in my opinion, as the
status quo is overly obfuscating things. What capability sets really
represent, as far as I understand them, is *memory layouts*, and so that's
what they should be called.
This conceptually simplifies `derive_capabilities` significantly without any
loss of expressiveness as far as I can see. Given two lists of memory
layouts, we simply look for which memory layouts appear in both lists, and
then merge their constraints and capabilities.
Merging constraints looks good to me.
Capabilities need some more thought. The prototype removes capabilities when
merging layouts, but I'd argue that that is often undesirable. (In fact, I
cannot think of capabilities which we'd always want to remove.)
A typical example for this is compression (i.e. DCC in our case). For
Memory layout: AMD/tiled; constraints(alignment=64k); caps(AMD/DCC)
Memory layout: AMD/tiled; constraints(alignment=64k); caps(none)
Merging these in the prototype would remove the DCC capability, even though
it might well make sense to keep it there for rendering. Dealing with the
fact that display usage does not have this capability is precisely one of
the two things that transitions are about! The other thing that transitions
are about is caches.
I think this is kind of what Rob was saying in one of his mails.
Perhaps "layout" is a better name than "caps".. either way I think of
both AMD/tiled and AMD/DCC as the same type of "thing".. the
difference between AMD/tiled and AMD/DCC is that a transition can be
provided for AMD/DCC. Other than that they are both things describing
the layout.
The reason that a transition can be provided is that they aren't quite the
same thing, though. In a very real sense, AMD/DCC is a "child" property of
AMD/tiled: DCC is implemented as a meta surface whose memory layout depends
on the layout of the main surface.
I suppose this is six-of-one, half-dozen of the other..

what you are calling a layout is what I'm calling a cap that just
happens not to have an associated transition
Although, if there are GPUs that can do an in-place "transition" between
different tiling layouts, then the distinction is perhaps really not as
clear-cut. I guess that would only apply to tiled renderers.
I suppose the advantage of just calling both layout and caps the same
thing, and just saying that a "cap" (or "layout" if you prefer that
name) can optionally have one or more associated transitions, is that
you can deal with cases where sometimes a tiled format might actually
have an in-place transition ;-)
Post by Rob Clark
So lets say you have a setup where both display and GPU supported
FOO/tiled, but only GPU supported compressed (FOO/CC) and cached
trans_a: FOO/CC -> null
trans_b: FOO/cached -> null
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
2: caps(FOO/tiled, FOO/CC); constraints(alignment=32k)
3: caps(FOO/tiled); constraints(alignment=32k)
1: caps(FOO/tiled); constraints(alignment=64k)
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
transition(GPU->display: trans_a, trans_b; display->GPU: none)
2: caps(FOO/tiled, FOO/CC); constraints(alignment=64k);
transition(GPU->display: trans_a; display->GPU: none)
3: caps(FOO/tiled); constraints(alignment=64k);
transition(GPU->display: none; display->GPU: none)
We definitely don't want to expose a way of getting uncached rendering
surfaces for radeonsi. I mean, I think we are supposed to be able to program
our hardware so that the backend bypasses all caches, but (a) nobody
validates that and (b) it's basically suicide in terms of performance. Let's
build fewer footguns :)
sure, this was just a hypothetical example. But to take this case as
another example, if you didn't want to expose uncached rendering (or
cached w/ cache flushes after each draw), you would exclude the entry
from the GPU set which didn't have FOO/cached (I'm adding back a
cached but not CC config just to make it interesting), and end up
with:

trans_a: FOO/CC -> null
trans_b: FOO/cached -> null

GPU:
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
2: caps(FOO/tiled, FOO/cached); constraints(alignment=32k)

Display:
1: caps(FOO/tiled); constraints(alignment=64k)

Merged Result:
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
transition(GPU->display: trans_a, trans_b; display->GPU: none)
2: caps(FOO/tiled, FOO/cached); constraints(alignment=64k);
transition(GPU->display: trans_b; display->GPU: none)

So there isn't anything in the result set that doesn't have GPU cache,
and the cache-flush transition is always in the set of required
transitions going from GPU -> display

Hmm, I guess this does require the concept of a required cap..
So at least for radeonsi, we wouldn't want to have an AMD/cached bit, but
we'd still want to have a transition between the GPU and display precisely
to flush caches.
Post by Rob Clark
Post by Nicolai Hähnle
1. If we query for multiple usages on the same device, can we get a
capability which can only be used for a subset of those usages?
I think the original idea was, "no".. perhaps that could restriction
could be lifted if transitions where part of the result. Or maybe you
just query independently the same device for multiple different
usages, and then merge that cap-set.
(Do we need to care about intra-device transitions? Or can we just
let the driver care about that, same as it always has?)
Post by Nicolai Hähnle
2. What happens when we merge memory layouts with sets of capabilities where
neither is a subset of the other?
I think this is a case where no zero-copy sharing is possible, right?
Not necessarily. Let's say we have some industry-standard tiling layout foo,
and vendors support their own proprietary framebuffer compression on top of
that.
Device 1, rendering: caps(BASE/foo, VND1/compressed)
Device 2, sampling/scanout: caps(BASE/foo, VND2/compressed)
It should be possible to allocate a surface as
caps(BASE/foo, VND1/compressed)
and just transition the cap(VND1/compressed) away after rendering before
accessing it with device 2.
In this case presumably VND1 defines transitions VND1/cc <-> null, and
VND2 does the same for VND2/cc ?
The interesting question is whether it would be possible or ever useful to
have a surface allocated as caps(BASE/foo, VND1/compressed,
VND2/compressed).
not sure if it is useful or not, but I think the idea of defining cap
transitions and returning the associated transitions when merging a
caps set could express this..
My guess is: there will be cases where it's possible, but there won't be
cases where it's useful (because you tend to render on device 1 and just
sample or scanout on device 2).
So it makes sense to say that derive_capabilities should just provide both
layouts in this case.
Post by Rob Clark
Post by Nicolai Hähnle
As for the actual transition API, I accept that some metadata may be
required, and the metadata probably needs to depend on the memory layout,
which is often vendor-specific. But even linear layouts need some
transitions for caches. We probably need at least some generic "off-device
usage" bit.
I've started thinking of cached as a capability with a transition.. I
think that helps. Maybe it needs to somehow be more specific (ie. if
you have two devices both with there own cache with no coherency
between the two)
As I wrote above, I'd prefer not to think of "cached" as a capability at
least for radeonsi.
From the desktop perspective, I would say let's ignore caches, the drivers
know which caches they need to flush to make data visible to other devices
on the system.
On the other hand, there are probably SoC cases where non-coherent caches
are shared between some but not all devices, and in that case perhaps we do
need to communicate this.
So perhaps we should have two kinds of "capabilities".
The first, like framebuffer compression, is a capability of the allocated
memory layout (because the compression requires a meta surface), and devices
that expose it may opportunistically use it.
The second, like caches, is a capability that the device/driver will use and
you don't get a say in it, but other devices/drivers also don't need to be
aware of them.
yeah, a required cap.. we had tried to avoid this, since unlike
constraints which are well defined, the core constraint/capability
merging wouldn't know what to do about merging parameterized caps.
But I guess if transitions are provided then it doesn't have to.
GPU: FOO/tiled(layout-caps=FOO/cc, dev-caps=FOO/gpu-cache)
Display: FOO/tiled(layout-caps=FOO/cc)
Video: FOO/tiled(dev-caps=FOO/vid-cache)
Camera: FOO/tiled(dev-caps=FOO/vid-cache)
... from which a FOO/tiled(FOO/cc) surface would be allocated.
The idea here is that whether a transition is required is fully visible from
1. Moving an image from the camera to the video engine for immediate
compression requires no transition.
2. Moving an image from the camera or video engine to the display requires a
transition by the video/camera device/API, which may flush the video cache.
3. Moving an image from the camera or video engine to the GPU additionally
requires a transition by the GPU, which may invalidate the GPU cache.
4. Moving an image from the GPU anywhere else requires a transition by the
GPU; in all cases, GPU caches may be flushed. When moving to the video
engine or camera, the image additionally needs to be decompressed. When
moving to the video engine (or camera? :)), a transition by the video engine
is also required, which may invalidate the video cache.
5. Moving an image from the display to the video engine requires a
decompression -- oops! :)
I guess it should be possible for devices to provide transitions in
both directions, which would deal with this..
Ignoring that last point for now, I don't think you actually need a
"query_transition" function in libdevicealloc with this approach, for the
most part.
with the idea of being able to provide optional transitions, I'm
leaning towards just having one of outputs of merging the caps sets
being the sets of transitions required to pass a buffer between
different devices..

although maybe the user doesn't need to know every possible transition
between devices once you have more than two devices..

/me shrugs
Instead, each API needs to provide import and export transition/barrier
functions which receive the previous/next layout-and capability-set.
Basically, to import a frame from the camera to OpenGL/Vulkan in the above
struct layout_capability cc_cap = { FOO, FOO_CC };
struct device_capability gpu_cache = { FOO, FOO_GPU_CACHE };
cameraExportTransition(image, 1, &layoutCaps, 1, &gpu_cache, &fence);
struct device_capability vid_cache = { FOO, FOO_VID_CACHE };
glImportTransitionEXT(texture, 0, NULL, 1, &vid_cache, fence);
By looking at the capabilities for the other device, each API's driver can
derive the required transition steps.
There are probably more gaps, but these are the two I can think of right
now, and both related to the initialization status of meta surfaces, i.e.
1. Point 5 above about moving away from the display engine in the example.
This is an ugly asymmetry in the rule that each engine performs its required
import and export transitions.
2. When the GPU imports a FOO/tiled(FOO/cc) surface, the compression meta
- reflecting a fully decompressed surface (if the surface was previously
exported from the GPU), or
- garbage (if the surface was allocated by the GPU driver, but then handed
off to the camera before being re-imported for processing)
The GPU's import transition needs to distinguish the two, but it can't with
the scheme above.
hmm, so I suppose this is also true in the cache case.. you want to
know if the buffer was written by someone else since you saw it last..
Something to think about :)
Also, not really a gap, but something to keep in mind: for multi-GPU
systems, the cache-capability needs to carry the device number or PCI bus id
or something, at least as long as those caches are not coherent between
GPUs.
yeah, maybe shouldn't be FOO/gpucache but FOO/gpucache($id)..


BR,
-R
Miguel Angel Vico
2017-12-01 21:52:22 UTC
Reply
Permalink
Raw Message
On Fri, 1 Dec 2017 13:38:41 -0500
Post by Rob Clark
Post by Rob Clark
Post by Nicolai Hähnle
Hi,
I've had a chance to look a bit more closely at the allocator prototype
repository now. There's a whole bunch of low-level API design feedback, but
for now let's focus on the high-level stuff first.
Going by the 4.5 major object types (as also seen on slide 5 of your
presentation [0]), assertions and usages make sense to me.
Capabilities and capability sets should be cleaned up in my opinion, as the
status quo is overly obfuscating things. What capability sets really
represent, as far as I understand them, is *memory layouts*, and so that's
what they should be called.
This conceptually simplifies `derive_capabilities` significantly without any
loss of expressiveness as far as I can see. Given two lists of memory
layouts, we simply look for which memory layouts appear in both lists, and
then merge their constraints and capabilities.
Merging constraints looks good to me.
Capabilities need some more thought. The prototype removes capabilities when
merging layouts, but I'd argue that that is often undesirable. (In fact, I
cannot think of capabilities which we'd always want to remove.)
A typical example for this is compression (i.e. DCC in our case). For
Memory layout: AMD/tiled; constraints(alignment=64k); caps(AMD/DCC)
Memory layout: AMD/tiled; constraints(alignment=64k); caps(none)
Merging these in the prototype would remove the DCC capability, even though
it might well make sense to keep it there for rendering. Dealing with the
fact that display usage does not have this capability is precisely one of
the two things that transitions are about! The other thing that transitions
are about is caches.
I think this is kind of what Rob was saying in one of his mails.
Perhaps "layout" is a better name than "caps".. either way I think of
both AMD/tiled and AMD/DCC as the same type of "thing".. the
difference between AMD/tiled and AMD/DCC is that a transition can be
provided for AMD/DCC. Other than that they are both things describing
the layout.
The reason that a transition can be provided is that they aren't quite the
same thing, though. In a very real sense, AMD/DCC is a "child" property of
AMD/tiled: DCC is implemented as a meta surface whose memory layout depends
on the layout of the main surface.
I suppose this is six-of-one, half-dozen of the other..
what you are calling a layout is what I'm calling a cap that just
happens not to have an associated transition
Although, if there are GPUs that can do an in-place "transition" between
different tiling layouts, then the distinction is perhaps really not as
clear-cut. I guess that would only apply to tiled renderers.
I suppose the advantage of just calling both layout and caps the same
thing, and just saying that a "cap" (or "layout" if you prefer that
name) can optionally have one or more associated transitions, is that
you can deal with cases where sometimes a tiled format might actually
have an in-place transition ;-)
Post by Rob Clark
So lets say you have a setup where both display and GPU supported
FOO/tiled, but only GPU supported compressed (FOO/CC) and cached
trans_a: FOO/CC -> null
trans_b: FOO/cached -> null
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
2: caps(FOO/tiled, FOO/CC); constraints(alignment=32k)
3: caps(FOO/tiled); constraints(alignment=32k)
1: caps(FOO/tiled); constraints(alignment=64k)
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
transition(GPU->display: trans_a, trans_b; display->GPU: none)
2: caps(FOO/tiled, FOO/CC); constraints(alignment=64k);
transition(GPU->display: trans_a; display->GPU: none)
3: caps(FOO/tiled); constraints(alignment=64k);
transition(GPU->display: none; display->GPU: none)
We definitely don't want to expose a way of getting uncached rendering
surfaces for radeonsi. I mean, I think we are supposed to be able to program
our hardware so that the backend bypasses all caches, but (a) nobody
validates that and (b) it's basically suicide in terms of performance. Let's
build fewer footguns :)
sure, this was just a hypothetical example. But to take this case as
another example, if you didn't want to expose uncached rendering (or
cached w/ cache flushes after each draw), you would exclude the entry
from the GPU set which didn't have FOO/cached (I'm adding back a
cached but not CC config just to make it interesting), and end up
trans_a: FOO/CC -> null
trans_b: FOO/cached -> null
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
2: caps(FOO/tiled, FOO/cached); constraints(alignment=32k)
1: caps(FOO/tiled); constraints(alignment=64k)
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
transition(GPU->display: trans_a, trans_b; display->GPU: none)
2: caps(FOO/tiled, FOO/cached); constraints(alignment=64k);
transition(GPU->display: trans_b; display->GPU: none)
So there isn't anything in the result set that doesn't have GPU cache,
and the cache-flush transition is always in the set of required
transitions going from GPU -> display
Hmm, I guess this does require the concept of a required cap..
Which we already introduced to the allocator API when we realized we
would need them as we were prototyping.
Post by Rob Clark
So at least for radeonsi, we wouldn't want to have an AMD/cached bit, but
we'd still want to have a transition between the GPU and display precisely
to flush caches.
Post by Rob Clark
Post by Nicolai Hähnle
1. If we query for multiple usages on the same device, can we get a
capability which can only be used for a subset of those usages?
I think the original idea was, "no".. perhaps that could restriction
could be lifted if transitions where part of the result. Or maybe you
just query independently the same device for multiple different
usages, and then merge that cap-set.
(Do we need to care about intra-device transitions? Or can we just
let the driver care about that, same as it always has?)
Post by Nicolai Hähnle
2. What happens when we merge memory layouts with sets of capabilities where
neither is a subset of the other?
I think this is a case where no zero-copy sharing is possible, right?
Not necessarily. Let's say we have some industry-standard tiling layout foo,
and vendors support their own proprietary framebuffer compression on top of
that.
Device 1, rendering: caps(BASE/foo, VND1/compressed)
Device 2, sampling/scanout: caps(BASE/foo, VND2/compressed)
It should be possible to allocate a surface as
caps(BASE/foo, VND1/compressed)
and just transition the cap(VND1/compressed) away after rendering before
accessing it with device 2.
In this case presumably VND1 defines transitions VND1/cc <-> null, and
VND2 does the same for VND2/cc ?
The interesting question is whether it would be possible or ever useful to
have a surface allocated as caps(BASE/foo, VND1/compressed,
VND2/compressed).
not sure if it is useful or not, but I think the idea of defining cap
transitions and returning the associated transitions when merging a
caps set could express this..
Yeah. I guess if we define per-cap transitions, the allocator should be
able to find a transition chain to express layouts like the above when
merging.
Post by Rob Clark
My guess is: there will be cases where it's possible, but there won't be
cases where it's useful (because you tend to render on device 1 and just
sample or scanout on device 2).
So it makes sense to say that derive_capabilities should just provide both
layouts in this case.
Post by Rob Clark
Post by Nicolai Hähnle
As for the actual transition API, I accept that some metadata may be
required, and the metadata probably needs to depend on the memory layout,
which is often vendor-specific. But even linear layouts need some
transitions for caches. We probably need at least some generic "off-device
usage" bit.
I've started thinking of cached as a capability with a transition.. I
think that helps. Maybe it needs to somehow be more specific (ie. if
you have two devices both with there own cache with no coherency
between the two)
As I wrote above, I'd prefer not to think of "cached" as a capability at
least for radeonsi.
From the desktop perspective, I would say let's ignore caches, the drivers
know which caches they need to flush to make data visible to other devices
on the system.
On the other hand, there are probably SoC cases where non-coherent caches
are shared between some but not all devices, and in that case perhaps we do
need to communicate this.
So perhaps we should have two kinds of "capabilities".
The first, like framebuffer compression, is a capability of the allocated
memory layout (because the compression requires a meta surface), and devices
that expose it may opportunistically use it.
The second, like caches, is a capability that the device/driver will use and
you don't get a say in it, but other devices/drivers also don't need to be
aware of them.
yeah, a required cap.. we had tried to avoid this, since unlike
constraints which are well defined, the core constraint/capability
merging wouldn't know what to do about merging parameterized caps.
But I guess if transitions are provided then it doesn't have to.
We are going to need required caps either way, right? The core
capability merging logic would try to find a compatible layout to be
used across engines by either using available transitions or dropping
caps. We'd need a way to indicate that a particular engine won't be
able to handle the resulting layout if a certain capability was dropped.
Post by Rob Clark
GPU: FOO/tiled(layout-caps=FOO/cc, dev-caps=FOO/gpu-cache)
Display: FOO/tiled(layout-caps=FOO/cc)
Video: FOO/tiled(dev-caps=FOO/vid-cache)
Camera: FOO/tiled(dev-caps=FOO/vid-cache)
... from which a FOO/tiled(FOO/cc) surface would be allocated.
The idea here is that whether a transition is required is fully visible from
1. Moving an image from the camera to the video engine for immediate
compression requires no transition.
2. Moving an image from the camera or video engine to the display requires a
transition by the video/camera device/API, which may flush the video cache.
3. Moving an image from the camera or video engine to the GPU additionally
requires a transition by the GPU, which may invalidate the GPU cache.
4. Moving an image from the GPU anywhere else requires a transition by the
GPU; in all cases, GPU caches may be flushed. When moving to the video
engine or camera, the image additionally needs to be decompressed. When
moving to the video engine (or camera? :)), a transition by the video engine
is also required, which may invalidate the video cache.
5. Moving an image from the display to the video engine requires a
decompression -- oops! :)
I guess it should be possible for devices to provide transitions in
both directions, which would deal with this..
Ignoring that last point for now, I don't think you actually need a
"query_transition" function in libdevicealloc with this approach, for the
most part.
with the idea of being able to provide optional transitions, I'm
leaning towards just having one of outputs of merging the caps sets
being the sets of transitions required to pass a buffer between
different devices..
I think I like the idea of having transitions being part of the
per-device/engine cap sets, so that such information can be used upon
merging to know which capabilities may remain or have to be dropped.

I think James's proposal for usage transitions was intended to work
with flows like:

1. App gets GPU caps for RENDER usage
2. App allocates GPU memory using a layout from (1)
3. App now decides it wants use the buffer for SCANOUT
4. App queries usage transition metadata from RENDER to SCANOUT,
given the current memory layout.
5. Do the transition and hand the buffer off to display

The problem I see with this is that it isn't guaranteed that there will
be a chain of transitions for the buffer to be usable by display.

Adding transition metadata to the original capability sets, and using
that information when merging could give us a compatible memory layout
that would be usable by both GPU and display.

I'll look into extending the current merging logic to also take into
account transitions.
Post by Rob Clark
although maybe the user doesn't need to know every possible transition
between devices once you have more than two devices..
We should be able to infer how buffers are going to be moved around
from the list of usages, shouldn't we?

Maybe we are missing some bits of information there, but I think the
allocator should be able to know what transitions the app will care
about and provide only those.
Post by Rob Clark
/me shrugs
Instead, each API needs to provide import and export transition/barrier
functions which receive the previous/next layout-and capability-set.
Basically, to import a frame from the camera to OpenGL/Vulkan in the above
struct layout_capability cc_cap = { FOO, FOO_CC };
struct device_capability gpu_cache = { FOO, FOO_GPU_CACHE };
cameraExportTransition(image, 1, &layoutCaps, 1, &gpu_cache, &fence);
struct device_capability vid_cache = { FOO, FOO_VID_CACHE };
glImportTransitionEXT(texture, 0, NULL, 1, &vid_cache, fence);
By looking at the capabilities for the other device, each API's driver can
derive the required transition steps.
There are probably more gaps, but these are the two I can think of right
now, and both related to the initialization status of meta surfaces, i.e.
1. Point 5 above about moving away from the display engine in the example.
This is an ugly asymmetry in the rule that each engine performs its required
import and export transitions.
2. When the GPU imports a FOO/tiled(FOO/cc) surface, the compression meta
- reflecting a fully decompressed surface (if the surface was previously
exported from the GPU), or
- garbage (if the surface was allocated by the GPU driver, but then handed
off to the camera before being re-imported for processing)
The GPU's import transition needs to distinguish the two, but it can't with
the scheme above.
hmm, so I suppose this is also true in the cache case.. you want to
know if the buffer was written by someone else since you saw it last..
Something to think about :)
Also, not really a gap, but something to keep in mind: for multi-GPU
systems, the cache-capability needs to carry the device number or PCI bus id
or something, at least as long as those caches are not coherent between
GPUs.
yeah, maybe shouldn't be FOO/gpucache but FOO/gpucache($id)..
That just seems an implementation detail of the representation the
particular vendor chooses for the CACHE capability, right?

Thanks,
Miguel.
Post by Rob Clark
BR,
-R
--
Miguel
James Jones
2017-12-06 07:07:49 UTC
Reply
Permalink
Raw Message
Post by Miguel Angel Vico
On Fri, 1 Dec 2017 13:38:41 -0500
Post by Rob Clark
Post by Rob Clark
Post by Nicolai Hähnle
Hi,
I've had a chance to look a bit more closely at the allocator prototype
repository now. There's a whole bunch of low-level API design feedback, but
for now let's focus on the high-level stuff first.
Thanks for taking a look.
Post by Miguel Angel Vico
Post by Rob Clark
Post by Rob Clark
Post by Nicolai Hähnle
Going by the 4.5 major object types (as also seen on slide 5 of your
presentation [0]), assertions and usages make sense to me.
Capabilities and capability sets should be cleaned up in my opinion, as the
status quo is overly obfuscating things. What capability sets really
represent, as far as I understand them, is *memory layouts*, and so that's
what they should be called.
This conceptually simplifies `derive_capabilities` significantly without any
loss of expressiveness as far as I can see. Given two lists of memory
layouts, we simply look for which memory layouts appear in both lists, and
then merge their constraints and capabilities.
Merging constraints looks good to me.
Capabilities need some more thought. The prototype removes capabilities when
merging layouts, but I'd argue that that is often undesirable. (In fact, I
cannot think of capabilities which we'd always want to remove.)
A typical example for this is compression (i.e. DCC in our case). For
Memory layout: AMD/tiled; constraints(alignment=64k); caps(AMD/DCC)
Memory layout: AMD/tiled; constraints(alignment=64k); caps(none)
Merging these in the prototype would remove the DCC capability, even though
it might well make sense to keep it there for rendering. Dealing withthe
fact that display usage does not have this capability is precisely one of
the two things that transitions are about! The other thing that transitions
are about is caches.
I think this is kind of what Rob was saying in one of his mails.
Perhaps "layout" is a better name than "caps".. either way I think of
both AMD/tiled and AMD/DCC as the same type of "thing".. the
difference between AMD/tiled and AMD/DCC is that a transition can be
provided for AMD/DCC. Other than that they are both things describing
the layout.
The reason that a transition can be provided is that they aren't quite the
same thing, though. In a very real sense, AMD/DCC is a "child" propertyof
AMD/tiled: DCC is implemented as a meta surface whose memory layout depends
on the layout of the main surface.
I suppose this is six-of-one, half-dozen of the other..
what you are calling a layout is what I'm calling a cap that just
happens not to have an associated transition
Although, if there are GPUs that can do an in-place "transition" between
different tiling layouts, then the distinction is perhaps really not as
clear-cut. I guess that would only apply to tiled renderers.
I suppose the advantage of just calling both layout and caps the same
thing, and just saying that a "cap" (or "layout" if you prefer that
name) can optionally have one or more associated transitions, is that
you can deal with cases where sometimes a tiled format might actually
have an in-place transition ;-)
Post by Rob Clark
So lets say you have a setup where both display and GPU supported
FOO/tiled, but only GPU supported compressed (FOO/CC) and cached
trans_a: FOO/CC -> null
trans_b: FOO/cached -> null
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
2: caps(FOO/tiled, FOO/CC); constraints(alignment=32k)
3: caps(FOO/tiled); constraints(alignment=32k)
1: caps(FOO/tiled); constraints(alignment=64k)
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
transition(GPU->display: trans_a, trans_b; display->GPU: none)
2: caps(FOO/tiled, FOO/CC); constraints(alignment=64k);
transition(GPU->display: trans_a; display->GPU: none)
3: caps(FOO/tiled); constraints(alignment=64k);
transition(GPU->display: none; display->GPU: none)
We definitely don't want to expose a way of getting uncached rendering
surfaces for radeonsi. I mean, I think we are supposed to be able to program
our hardware so that the backend bypasses all caches, but (a) nobody
validates that and (b) it's basically suicide in terms of performance. Let's
build fewer footguns :)
sure, this was just a hypothetical example. But to take this case as
another example, if you didn't want to expose uncached rendering (or
cached w/ cache flushes after each draw), you would exclude the entry
from the GPU set which didn't have FOO/cached (I'm adding back a
cached but not CC config just to make it interesting), and end up
trans_a: FOO/CC -> null
trans_b: FOO/cached -> null
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
2: caps(FOO/tiled, FOO/cached); constraints(alignment=32k)
1: caps(FOO/tiled); constraints(alignment=64k)
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
transition(GPU->display: trans_a, trans_b; display->GPU: none)
2: caps(FOO/tiled, FOO/cached); constraints(alignment=64k);
transition(GPU->display: trans_b; display->GPU: none)
So there isn't anything in the result set that doesn't have GPU cache,
and the cache-flush transition is always in the set of required
transitions going from GPU -> display
Hmm, I guess this does require the concept of a required cap..
Which we already introduced to the allocator API when we realized we
would need them as we were prototyping.
Note I also posed the question of whether things like cached (and
similarly compression, since I view compression as roughly an equivalent
mechanism to a cache) in one of the open issues on my XDC 2017 slides
because of this very problem of over-pruning it causes. It's on slide
15, as "No device-local capabilities". You'll have to listen to my
coverage of it in the recorded presentation for that slide to make any
sense, but it's the same thing Nicolai has laid out here.

As I continued working through our prototype driver support, I found I
didn't actually need to include cached or compressed as capabilities:
The GPU just applies them as needed and the usage transitions make it
transparent to the non-GPU engines. That does mean the GPU driver
currently needs to be the one to realize the allocation from the
capability set to get optimal behavior. We could fix that by reworking
our driver though. At this point, not including device-local properties
like on-device caching in capabilities seems like the right solution to
me. I'm curious whether this applies universally though, or if other
hardware doesn't fit the "compression and stuff all behaves like a
cache" idiom.
Post by Miguel Angel Vico
Post by Rob Clark
So at least for radeonsi, we wouldn't want to have an AMD/cached bit, but
we'd still want to have a transition between the GPU and display precisely
to flush caches.
Post by Rob Clark
Post by Nicolai Hähnle
1. If we query for multiple usages on the same device, can we get a
capability which can only be used for a subset of those usages?
I think the original idea was, "no".. perhaps that could restriction
could be lifted if transitions where part of the result. Or maybe you
just query independently the same device for multiple different
usages, and then merge that cap-set.
(Do we need to care about intra-device transitions? Or can we just
let the driver care about that, same as it always has?)
Post by Nicolai Hähnle
2. What happens when we merge memory layouts with sets of capabilities where
neither is a subset of the other?
I think this is a case where no zero-copy sharing is possible, right?
Not necessarily. Let's say we have some industry-standard tiling layoutfoo,
and vendors support their own proprietary framebuffer compression on top of
that.
Device 1, rendering: caps(BASE/foo, VND1/compressed)
Device 2, sampling/scanout: caps(BASE/foo, VND2/compressed)
It should be possible to allocate a surface as
caps(BASE/foo, VND1/compressed)
and just transition the cap(VND1/compressed) away after rendering before
accessing it with device 2.
In this case presumably VND1 defines transitions VND1/cc <-> null, and
VND2 does the same for VND2/cc ?
The interesting question is whether it would be possible or ever usefulto
have a surface allocated as caps(BASE/foo, VND1/compressed,
VND2/compressed).
not sure if it is useful or not, but I think the idea of defining cap
transitions and returning the associated transitions when merging a
caps set could express this..
Yeah. I guess if we define per-cap transitions, the allocator should be
able to find a transition chain to express layouts like the above when
merging.
Post by Rob Clark
My guess is: there will be cases where it's possible, but there won't be
cases where it's useful (because you tend to render on device 1 and just
sample or scanout on device 2).
So it makes sense to say that derive_capabilities should just provide both
layouts in this case.
Post by Rob Clark
Post by Nicolai Hähnle
As for the actual transition API, I accept that some metadata may be
required, and the metadata probably needs to depend on the memory layout,
which is often vendor-specific. But even linear layouts need some
transitions for caches. We probably need at least some generic "off-device
usage" bit.
I've started thinking of cached as a capability with a transition.. I
think that helps. Maybe it needs to somehow be more specific (ie. if
you have two devices both with there own cache with no coherency
between the two)
As I wrote above, I'd prefer not to think of "cached" as a capability at
least for radeonsi.
From the desktop perspective, I would say let's ignore caches, the drivers
know which caches they need to flush to make data visible to other devices
on the system.
On the other hand, there are probably SoC cases where non-coherent caches
are shared between some but not all devices, and in that case perhaps we do
need to communicate this.
So perhaps we should have two kinds of "capabilities".
The first, like framebuffer compression, is a capability of the allocated
memory layout (because the compression requires a meta surface), and devices
that expose it may opportunistically use it.
The second, like caches, is a capability that the device/driver will use and
you don't get a say in it, but other devices/drivers also don't need tobe
aware of them.
yeah, a required cap.. we had tried to avoid this, since unlike
constraints which are well defined, the core constraint/capability
merging wouldn't know what to do about merging parameterized caps.
But I guess if transitions are provided then it doesn't have to.
We are going to need required caps either way, right? The core
capability merging logic would try to find a compatible layout to be
used across engines by either using available transitions or dropping
caps. We'd need a way to indicate that a particular engine won't be
able to handle the resulting layout if a certain capability was dropped.
Post by Rob Clark
GPU: FOO/tiled(layout-caps=FOO/cc, dev-caps=FOO/gpu-cache)
Display: FOO/tiled(layout-caps=FOO/cc)
Video: FOO/tiled(dev-caps=FOO/vid-cache)
Camera: FOO/tiled(dev-caps=FOO/vid-cache)
... from which a FOO/tiled(FOO/cc) surface would be allocated.
The idea here is that whether a transition is required is fully visiblefrom
1. Moving an image from the camera to the video engine for immediate
compression requires no transition.
2. Moving an image from the camera or video engine to the display requires a
transition by the video/camera device/API, which may flush the video cache.
3. Moving an image from the camera or video engine to the GPU additionally
requires a transition by the GPU, which may invalidate the GPU cache.
4. Moving an image from the GPU anywhere else requires a transition by the
GPU; in all cases, GPU caches may be flushed. When moving to the video
engine or camera, the image additionally needs to be decompressed. When
moving to the video engine (or camera? :)), a transition by the video engine
is also required, which may invalidate the video cache.
5. Moving an image from the display to the video engine requires a
decompression -- oops! :)
I guess it should be possible for devices to provide transitions in
both directions, which would deal with this..
Ignoring that last point for now, I don't think you actually need a
"query_transition" function in libdevicealloc with this approach, for the
most part.
with the idea of being able to provide optional transitions, I'm
leaning towards just having one of outputs of merging the caps sets
being the sets of transitions required to pass a buffer between
different devices..
I think I like the idea of having transitions being part of the
per-device/engine cap sets, so that such information can be used upon
merging to know which capabilities may remain or have to be dropped.
I think James's proposal for usage transitions was intended to work
1. App gets GPU caps for RENDER usage
2. App allocates GPU memory using a layout from (1)
3. App now decides it wants use the buffer for SCANOUT
4. App queries usage transition metadata from RENDER to SCANOUT,
given the current memory layout.
5. Do the transition and hand the buffer off to display
No, all usages the app intends to transition to must be specified up
front when initially querying caps in the model I assumed. The app then
specifies some subset (up to the full set) of the specified usages as a
src and dst when querying transition metadata.
Post by Miguel Angel Vico
The problem I see with this is that it isn't guaranteed that there will
be a chain of transitions for the buffer to be usable by display.
I hadn't thought hard about it, but my initial thoughts were that it
would be required that the driver support transitioning to any single
usage given the capabilities returned. However, transitioning to
multiple usages (E.g., to simultaneously rendering and scanning out)
could fail to produce a valid transition, in which case the app would
have to fall back to a copy in that case, or avoid that simultaneous
usage combination in some other way.
Post by Miguel Angel Vico
Adding transition metadata to the original capability sets, and using
that information when merging could give us a compatible memory layout
that would be usable by both GPU and display.
I'll look into extending the current merging logic to also take into
account transitions.
Yes, it'll be good to see whether this can be made to work. I agree
Rob's example outcomes above are ideal, but it's not clear to me how to
code up such an algorithm. This also all seems unnecessary if "device
local" capabilities aren't needed, as posited above.
Post by Miguel Angel Vico
Post by Rob Clark
although maybe the user doesn't need to know every possible transition
between devices once you have more than two devices..
We should be able to infer how buffers are going to be moved around
from the list of usages, shouldn't we?
Maybe we are missing some bits of information there, but I think the
allocator should be able to know what transitions the app will care
about and provide only those.
The allocator only knows the requested union of all usages currently.
The number of possible transitions grows combinatorially for every usage
requested I believe. I expect there will be cases where ~10 usages are
specified, so generating all possible transitions all the time may be
excessive, when the app will probably generally only care about 2 or 3
states, and in practice, there will probably only actually be 2 or 3
different underlying possible combinations of operations.
Post by Miguel Angel Vico
Post by Rob Clark
/me shrugs
Instead, each API needs to provide import and export transition/barrier
functions which receive the previous/next layout-and capability-set.
Basically, to import a frame from the camera to OpenGL/Vulkan in the above
struct layout_capability cc_cap = { FOO, FOO_CC };
struct device_capability gpu_cache = { FOO, FOO_GPU_CACHE };
cameraExportTransition(image, 1, &layoutCaps, 1, &gpu_cache, &fence);
struct device_capability vid_cache = { FOO, FOO_VID_CACHE };
glImportTransitionEXT(texture, 0, NULL, 1, &vid_cache, fence);
By looking at the capabilities for the other device, each API's driver can
derive the required transition steps.
There are probably more gaps, but these are the two I can think of right
now, and both related to the initialization status of meta surfaces, i.e.
1. Point 5 above about moving away from the display engine in the example.
This is an ugly asymmetry in the rule that each engine performs its required
import and export transitions.
2. When the GPU imports a FOO/tiled(FOO/cc) surface, the compression meta
- reflecting a fully decompressed surface (if the surface was previously
exported from the GPU), or
- garbage (if the surface was allocated by the GPU driver, but then handed
off to the camera before being re-imported for processing)
The GPU's import transition needs to distinguish the two, but it can't with
the scheme above.
hmm, so I suppose this is also true in the cache case.. you want to
know if the buffer was written by someone else since you saw it last..
Something to think about :)
Also, not really a gap, but something to keep in mind: for multi-GPU
systems, the cache-capability needs to carry the device number or PCI bus id
or something, at least as long as those caches are not coherent between
GPUs.
yeah, maybe shouldn't be FOO/gpucache but FOO/gpucache($id)..
That just seems an implementation detail of the representation the
particular vendor chooses for the CACHE capability, right?
Agreed.

One final note: When I initially wrote up the capability merging logic,
I treated "layout" as a sort of "special" capability, basically like
Nicolai originally outlined above. Miguel suggested I add the
"required" bit instead to generalize things, and it ended up working out
much cleaner. Besides the layout, there is at least one other obvious
candidate for a "required" capability that became obvious as soon as I
started coding up the prototype driver: memory location. It might seem
like memory location is a simple device-agnostic constraint rather than
a capability, but it's actually too complicated (we need more memory
locations than "device" and "host"). It has to be vendor specific, and
hence fits in better as a capability.

I think if possible, we should try to keep the design generalized to as
few types of objects and special cases as possible. The more we can
generalize the solutions to our existing problem set, the better the
mechanism should hold up as we apply it to new and unknown problems as
they arise.

Thanks,
-James
Post by Miguel Angel Vico
Thanks,
Miguel.
Post by Rob Clark
BR,
-R
Nicolai Hähnle
2017-12-06 11:25:34 UTC
Reply
Permalink
Raw Message
On 06.12.2017 08:07, James Jones wrote:
[snip]
Post by James Jones
Post by Miguel Angel Vico
Post by Nicolai Hähnle
Post by Rob Clark
So lets say you have a setup where both display and GPU supported
FOO/tiled, but only GPU supported compressed (FOO/CC) and cached
    trans_a: FOO/CC -> null
    trans_b: FOO/cached -> null
    1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
    2: caps(FOO/tiled, FOO/CC); constraints(alignment=32k)
    3: caps(FOO/tiled); constraints(alignment=32k)
    1: caps(FOO/tiled); constraints(alignment=64k)
    1: caps(FOO/tiled, FOO/CC, FOO/cached);
constraints(alignment=64k);
       transition(GPU->display: trans_a, trans_b; display->GPU: none)
    2: caps(FOO/tiled, FOO/CC); constraints(alignment=64k);
       transition(GPU->display: trans_a; display->GPU: none)
    3: caps(FOO/tiled); constraints(alignment=64k);
       transition(GPU->display: none; display->GPU: none)
We definitely don't want to expose a way of getting uncached rendering
surfaces for radeonsi. I mean, I think we are supposed to be able to program
our hardware so that the backend bypasses all caches, but (a) nobody
validates that and (b) it's basically suicide in terms of
performance. Let's
build fewer footguns :)
sure, this was just a hypothetical example.  But to take this case as
another example, if you didn't want to expose uncached rendering (or
cached w/ cache flushes after each draw), you would exclude the entry
from the GPU set which didn't have FOO/cached (I'm adding back a
cached but not CC config just to make it interesting), and end up
    trans_a: FOO/CC -> null
    trans_b: FOO/cached -> null
   1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
   2: caps(FOO/tiled, FOO/cached); constraints(alignment=32k)
   1: caps(FOO/tiled); constraints(alignment=64k)
   1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
      transition(GPU->display: trans_a, trans_b; display->GPU: none)
   2: caps(FOO/tiled, FOO/cached); constraints(alignment=64k);
      transition(GPU->display: trans_b; display->GPU: none)
So there isn't anything in the result set that doesn't have GPU cache,
and the cache-flush transition is always in the set of required
transitions going from GPU -> display
Hmm, I guess this does require the concept of a required cap..
Which we already introduced to the allocator API when we realized we
would need them as we were prototyping.
Note I also posed the question of whether things like cached (and
similarly compression, since I view compression as roughly an equivalent
mechanism to a cache) in one of the open issues on my XDC 2017 slides
because of this very problem of over-pruning it causes.  It's on slide
15, as "No device-local capabilities".  You'll have to listen to my
coverage of it in the recorded presentation for that slide to make any
sense, but it's the same thing Nicolai has laid out here.
As I continued working through our prototype driver support, I found I
The GPU just applies them as needed and the usage transitions make it
transparent to the non-GPU engines.  That does mean the GPU driver
currently needs to be the one to realize the allocation from the
capability set to get optimal behavior.  We could fix that by reworking
our driver though.  At this point, not including device-local properties
like on-device caching in capabilities seems like the right solution to
me.  I'm curious whether this applies universally though, or if other
hardware doesn't fit the "compression and stuff all behaves like a
cache" idiom.
Compression is a part of the memory layout for us: framebuffer
compression uses an additional "meta surface". At the most basic level,
an allocation with loss-less compression support is by necessity bigger
than an allocation without.

We can allocate this meta surface separately, but then we're forced to
decompress when passing the surface around (e.g. to a compositor.)

Consider also the example I gave elsewhere, where a cross-vendor tiling
layout is combined with vendor-specific compression:

Device 1, rendering: caps(BASE/foo-tiling, VND1/compression)
Device 2, sampling/scanout: caps(BASE/foo-tiling, VND2/compression)

Some more thoughts on caching or "device-local" properties below.


[snip]
Post by James Jones
Post by Miguel Angel Vico
I think I like the idea of having transitions being part of the
per-device/engine cap sets, so that such information can be used upon
merging to know which capabilities may remain or have to be dropped.
I think James's proposal for usage transitions was intended to work
   1. App gets GPU caps for RENDER usage
   2. App allocates GPU memory using a layout from (1)
   3. App now decides it wants use the buffer for SCANOUT
   4. App queries usage transition metadata from RENDER to SCANOUT,
      given the current memory layout.
   5. Do the transition and hand the buffer off to display
No, all usages the app intends to transition to must be specified up
front when initially querying caps in the model I assumed.  The app then
specifies some subset (up to the full set) of the specified usages as a
src and dst when querying transition metadata.
Post by Miguel Angel Vico
The problem I see with this is that it isn't guaranteed that there will
be a chain of transitions for the buffer to be usable by display.
I hadn't thought hard about it, but my initial thoughts were that it
would be required that the driver support transitioning to any single
usage given the capabilities returned.  However, transitioning to
multiple usages (E.g., to simultaneously rendering and scanning out)
could fail to produce a valid transition, in which case the app would
have to fall back to a copy in that case, or avoid that simultaneous
usage combination in some other way.
Post by Miguel Angel Vico
Adding transition metadata to the original capability sets, and using
that information when merging could give us a compatible memory layout
that would be usable by both GPU and display.
I'll look into extending the current merging logic to also take into
account transitions.
Yes, it'll be good to see whether this can be made to work.  I agree
Rob's example outcomes above are ideal, but it's not clear to me how to
code up such an algorithm.  This also all seems unnecessary if "device
local" capabilities aren't needed, as posited above.
Post by Miguel Angel Vico
although maybe the user doesn't need to know every possible transition
between devices once you have more than two devices..
We should be able to infer how buffers are going to be moved around
from the list of usages, shouldn't we?
Maybe we are missing some bits of information there, but I think the
allocator should be able to know what transitions the app will care
about and provide only those.
The allocator only knows the requested union of all usages currently.
The number of possible transitions grows combinatorially for every usage
requested I believe.  I expect there will be cases where ~10 usages are
specified, so generating all possible transitions all the time may be
excessive, when the app will probably generally only care about 2 or 3
states, and in practice, there will probably only actually be 2 or 3
different underlying possible combinations of operations.
Exactly. So I wonder if we can't just "cut through the bullshit" somehow?

I'm looking for something that would also eliminate another part of the
design that makes me uncomfortable: the metadata for transitions. This
makes me uncomfortable for a number of reasons. Who computes the
metadata? How is the representation of the metadata? With cross-device
usages (which is the whole point of the exercise), this quickly becomes
infeasible.

So instead as a thought experiment, let's just use what we already have:
capabilities and constraints (or properties/attributes).

I kind of already outlined this with the long example in my email here
https://lists.freedesktop.org/archives/mesa-dev/2017-December/179055.html

Let me try to summarize the transition algorithm. Its inputs are:
- the current (source) capability set
- the desired new usages
- the capability sets associated with these usages, as queried when the
surface was allocated

Steps of the algorithm:

1. Compute the merged capability set for the new usages (the destination
capability set).
2. Compute the transition capability set, which is the merger of the
source and destination sets.
3. Determine whether a "release" transition is required on the source
device(s):
3a. For global properties, a transition is required if the source
capability set is a superset of the transition set.
3b. For device-local properties, a transition is required if there is
some destination device for which the device-local properties are a
subset of the source set.
4. Determine whether an "acquire" transition is required on the
destination device(s) in a similar way.

Finally, execute the transitions using corresponding APIs, where the
APIs simply receive the computed capability sets.

For example, release transitions would receive the source capability set
(and perhaps the source usages), the transition capability set, and the
set difference of device-local capabilities, and nothing else.

The point is that all steps of the algorithm can be implemented in a
device-agnostic way in libdevicealloc, without calling into any
device/driver callbacks.

I'm pretty sure this or something like it can be made to work. We need
to think through a lot of example cases, but at least we'll have thought
them through, which is better than relying on some opaque metadata thing
and then finding out later that there are some new cross-device cases
where things don't work out because the piece of (presumably
device-specific driver) code that computes the metadata isn't aware of them.


[snip]
Post by James Jones
One final note:  When I initially wrote up the capability merging logic,
I treated "layout" as a sort of "special" capability, basically like
Nicolai originally outlined above.  Miguel suggested I add the
"required" bit instead to generalize things, and it ended up working out
much cleaner.  Besides the layout, there is at least one other obvious
candidate for a "required" capability that became obvious as soon as I
started coding up the prototype driver: memory location.  It might seem
like memory location is a simple device-agnostic constraint rather than
a capability, but it's actually too complicated (we need more memory
locations than "device" and "host").  It has to be vendor specific, and
hence fits in better as a capability.
Could you give more concrete examples of what you'd like to see, and why
having this as constraints is insufficient?
Post by James Jones
I think if possible, we should try to keep the design generalized to as
few types of objects and special cases as possible.  The more we can
generalize the solutions to our existing problem set, the better the
mechanism should hold up as we apply it to new and unknown problems as
they arise.
I'm coming around to the fact that those things should perhaps live in a
single list/array, but I still don't like the term "capability".

I admit it's a bit of bike-shedding, but I'm starting to think it would
be better to go with the generic term "property" or "attribute", and
then add flags/adjectives to that based on how merging should work.

This would include the constraints as well -- it seems arbitrary to me
that those would be singled out into their own list.

Basically, the underlying principle is that a good API would have either
one list that includes all the properties, or one list per
merging-behavior. And I think one single list is easier on the API
consumer and easier to extend.

Cheers,
Nicolai
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
Rob Clark
2017-12-06 13:36:25 UTC
Reply
Permalink
Raw Message
Post by Rob Clark
[snip]
Post by James Jones
Post by Miguel Angel Vico
Post by Rob Clark
Post by Nicolai Hähnle
Post by Rob Clark
So lets say you have a setup where both display and GPU supported
FOO/tiled, but only GPU supported compressed (FOO/CC) and cached
trans_a: FOO/CC -> null
trans_b: FOO/cached -> null
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
2: caps(FOO/tiled, FOO/CC); constraints(alignment=32k)
3: caps(FOO/tiled); constraints(alignment=32k)
1: caps(FOO/tiled); constraints(alignment=64k)
1: caps(FOO/tiled, FOO/CC, FOO/cached);
constraints(alignment=64k);
transition(GPU->display: trans_a, trans_b; display->GPU: none)
2: caps(FOO/tiled, FOO/CC); constraints(alignment=64k);
transition(GPU->display: trans_a; display->GPU: none)
3: caps(FOO/tiled); constraints(alignment=64k);
transition(GPU->display: none; display->GPU: none)
We definitely don't want to expose a way of getting uncached rendering
surfaces for radeonsi. I mean, I think we are supposed to be able to program
our hardware so that the backend bypasses all caches, but (a) nobody
validates that and (b) it's basically suicide in terms of performance. Let's
build fewer footguns :)
sure, this was just a hypothetical example. But to take this case as
another example, if you didn't want to expose uncached rendering (or
cached w/ cache flushes after each draw), you would exclude the entry
from the GPU set which didn't have FOO/cached (I'm adding back a
cached but not CC config just to make it interesting), and end up
trans_a: FOO/CC -> null
trans_b: FOO/cached -> null
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
2: caps(FOO/tiled, FOO/cached); constraints(alignment=32k)
1: caps(FOO/tiled); constraints(alignment=64k)
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
transition(GPU->display: trans_a, trans_b; display->GPU: none)
2: caps(FOO/tiled, FOO/cached); constraints(alignment=64k);
transition(GPU->display: trans_b; display->GPU: none)
So there isn't anything in the result set that doesn't have GPU cache,
and the cache-flush transition is always in the set of required
transitions going from GPU -> display
Hmm, I guess this does require the concept of a required cap..
Which we already introduced to the allocator API when we realized we
would need them as we were prototyping.
Note I also posed the question of whether things like cached (and
similarly compression, since I view compression as roughly an equivalent
mechanism to a cache) in one of the open issues on my XDC 2017 slides
because of this very problem of over-pruning it causes. It's on slide 15,
as "No device-local capabilities". You'll have to listen to my coverage of
it in the recorded presentation for that slide to make any sense, but it's
the same thing Nicolai has laid out here.
As I continued working through our prototype driver support, I found I
didn't actually need to include cached or compressed as capabilities: The
GPU just applies them as needed and the usage transitions make it
transparent to the non-GPU engines. That does mean the GPU driver currently
needs to be the one to realize the allocation from the capability set to get
optimal behavior. We could fix that by reworking our driver though. At
this point, not including device-local properties like on-device caching in
capabilities seems like the right solution to me. I'm curious whether this
applies universally though, or if other hardware doesn't fit the
"compression and stuff all behaves like a cache" idiom.
Compression is a part of the memory layout for us: framebuffer compression
uses an additional "meta surface". At the most basic level, an allocation
with loss-less compression support is by necessity bigger than an allocation
without.
We can allocate this meta surface separately, but then we're forced to
decompress when passing the surface around (e.g. to a compositor.)
side note: I think this is pretty typical.. although afaict for
adreno at least, when you start getting into sampling from things with
multiple layers/levels, the meta surface needs to be interleaved with
the "main" surface, so it can't really be allocated after the fact.

Also for depth buffer, there is potentially an additional meta buffer
for lrz. (Although cross-device depth buffer sharing seems a bit...
strange.)
Post by Rob Clark
Consider also the example I gave elsewhere, where a cross-vendor tiling
Device 1, rendering: caps(BASE/foo-tiling, VND1/compression)
Device 2, sampling/scanout: caps(BASE/foo-tiling, VND2/compression)
Some more thoughts on caching or "device-local" properties below.
[snip]
Post by James Jones
Post by Miguel Angel Vico
I think I like the idea of having transitions being part of the
per-device/engine cap sets, so that such information can be used upon
merging to know which capabilities may remain or have to be dropped.
I think James's proposal for usage transitions was intended to work
1. App gets GPU caps for RENDER usage
2. App allocates GPU memory using a layout from (1)
3. App now decides it wants use the buffer for SCANOUT
4. App queries usage transition metadata from RENDER to SCANOUT,
given the current memory layout.
5. Do the transition and hand the buffer off to display
No, all usages the app intends to transition to must be specified up front
when initially querying caps in the model I assumed. The app then specifies
some subset (up to the full set) of the specified usages as a src and dst
when querying transition metadata.
Post by Miguel Angel Vico
The problem I see with this is that it isn't guaranteed that there will
be a chain of transitions for the buffer to be usable by display.
I hadn't thought hard about it, but my initial thoughts were that it would
be required that the driver support transitioning to any single usage given
the capabilities returned. However, transitioning to multiple usages (E.g.,
to simultaneously rendering and scanning out) could fail to produce a valid
transition, in which case the app would have to fall back to a copy in that
case, or avoid that simultaneous usage combination in some other way.
Post by Miguel Angel Vico
Adding transition metadata to the original capability sets, and using
that information when merging could give us a compatible memory layout
that would be usable by both GPU and display.
I'll look into extending the current merging logic to also take into
account transitions.
Yes, it'll be good to see whether this can be made to work. I agree Rob's
example outcomes above are ideal, but it's not clear to me how to code up
such an algorithm. This also all seems unnecessary if "device local"
capabilities aren't needed, as posited above.
Post by Miguel Angel Vico
Post by Rob Clark
although maybe the user doesn't need to know every possible transition
between devices once you have more than two devices..
We should be able to infer how buffers are going to be moved around
from the list of usages, shouldn't we?
Maybe we are missing some bits of information there, but I think the
allocator should be able to know what transitions the app will care
about and provide only those.
The allocator only knows the requested union of all usages currently. The
number of possible transitions grows combinatorially for every usage
requested I believe. I expect there will be cases where ~10 usages are
specified, so generating all possible transitions all the time may be
excessive, when the app will probably generally only care about 2 or 3
states, and in practice, there will probably only actually be 2 or 3
different underlying possible combinations of operations.
Exactly. So I wonder if we can't just "cut through the bullshit" somehow?
I'm looking for something that would also eliminate another part of the
design that makes me uncomfortable: the metadata for transitions. This makes
me uncomfortable for a number of reasons. Who computes the metadata? How is
the representation of the metadata? With cross-device usages (which is the
whole point of the exercise), this quickly becomes infeasible.
capabilities and constraints (or properties/attributes).
I kind of already outlined this with the long example in my email here
https://lists.freedesktop.org/archives/mesa-dev/2017-December/179055.html
- the current (source) capability set
- the desired new usages
- the capability sets associated with these usages, as queried when the
surface was allocated
1. Compute the merged capability set for the new usages (the destination
capability set).
2. Compute the transition capability set, which is the merger of the source
and destination sets.
3. Determine whether a "release" transition is required on the source
3a. For global properties, a transition is required if the source capability
set is a superset of the transition set.
3b. For device-local properties, a transition is required if there is some
destination device for which the device-local properties are a subset of the
source set.
4. Determine whether an "acquire" transition is required on the destination
device(s) in a similar way.
Finally, execute the transitions using corresponding APIs, where the APIs
simply receive the computed capability sets.
For example, release transitions would receive the source capability set
(and perhaps the source usages), the transition capability set, and the set
difference of device-local capabilities, and nothing else.
The point is that all steps of the algorithm can be implemented in a
device-agnostic way in libdevicealloc, without calling into any
device/driver callbacks.
I'm pretty sure this or something like it can be made to work. We need to
think through a lot of example cases, but at least we'll have thought them
through, which is better than relying on some opaque metadata thing and then
finding out later that there are some new cross-device cases where things
don't work out because the piece of (presumably device-specific driver) code
that computes the metadata isn't aware of them.
[snip]
Post by James Jones
One final note: When I initially wrote up the capability merging logic, I
treated "layout" as a sort of "special" capability, basically like Nicolai
originally outlined above. Miguel suggested I add the "required" bit
instead to generalize things, and it ended up working out much cleaner.
Besides the layout, there is at least one other obvious candidate for a
"required" capability that became obvious as soon as I started coding up the
prototype driver: memory location. It might seem like memory location is a
simple device-agnostic constraint rather than a capability, but it's
actually too complicated (we need more memory locations than "device" and
"host"). It has to be vendor specific, and hence fits in better as a
capability.
Could you give more concrete examples of what you'd like to see, and why
having this as constraints is insufficient?
Post by James Jones
I think if possible, we should try to keep the design generalized to as
few types of objects and special cases as possible. The more we can
generalize the solutions to our existing problem set, the better the
mechanism should hold up as we apply it to new and unknown problems as they
arise.
I'm coming around to the fact that those things should perhaps live in a
single list/array, but I still don't like the term "capability".
I admit it's a bit of bike-shedding, but I'm starting to think it would be
better to go with the generic term "property" or "attribute", and then add
flags/adjectives to that based on how merging should work.
This would include the constraints as well -- it seems arbitrary to me that
those would be singled out into their own list.
I'm not picky about the names, but I think "constraint" has to be
something well specified, if we are to merge constraints. If you
don't know what "pitch alignment" is how do you merge the pitch
constraints from two different devices? This was the reason for the
separate list.

BR,
-R
Post by Rob Clark
Basically, the underlying principle is that a good API would have either one
list that includes all the properties, or one list per merging-behavior. And
I think one single list is easier on the API consumer and easier to extend.
Cheers,
Nicolai
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
James Jones
2017-12-07 00:57:45 UTC
Reply
Permalink
Raw Message
Post by Rob Clark
[snip]
Post by James Jones
Post by Miguel Angel Vico
Post by Nicolai Hähnle
Post by Rob Clark
So lets say you have a setup where both display and GPU supported
FOO/tiled, but only GPU supported compressed (FOO/CC) and cached
    trans_a: FOO/CC -> null
    trans_b: FOO/cached -> null
    1: caps(FOO/tiled, FOO/CC, FOO/cached);
constraints(alignment=32k)
    2: caps(FOO/tiled, FOO/CC); constraints(alignment=32k)
    3: caps(FOO/tiled); constraints(alignment=32k)
    1: caps(FOO/tiled); constraints(alignment=64k)
    1: caps(FOO/tiled, FOO/CC, FOO/cached);
constraints(alignment=64k);
       transition(GPU->display: trans_a, trans_b; display->GPU: none)
    2: caps(FOO/tiled, FOO/CC); constraints(alignment=64k);
       transition(GPU->display: trans_a; display->GPU: none)
    3: caps(FOO/tiled); constraints(alignment=64k);
       transition(GPU->display: none; display->GPU: none)
We definitely don't want to expose a way of getting uncached rendering
surfaces for radeonsi. I mean, I think we are supposed to be able to program
our hardware so that the backend bypasses all caches, but (a) nobody
validates that and (b) it's basically suicide in terms of
performance. Let's
build fewer footguns :)
sure, this was just a hypothetical example.  But to take this case as
another example, if you didn't want to expose uncached rendering (or
cached w/ cache flushes after each draw), you would exclude the entry
from the GPU set which didn't have FOO/cached (I'm adding back a
cached but not CC config just to make it interesting), and end up
    trans_a: FOO/CC -> null
    trans_b: FOO/cached -> null
   1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
   2: caps(FOO/tiled, FOO/cached); constraints(alignment=32k)
   1: caps(FOO/tiled); constraints(alignment=64k)
   1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
      transition(GPU->display: trans_a, trans_b; display->GPU: none)
   2: caps(FOO/tiled, FOO/cached); constraints(alignment=64k);
      transition(GPU->display: trans_b; display->GPU: none)
So there isn't anything in the result set that doesn't have GPU cache,
and the cache-flush transition is always in the set of required
transitions going from GPU -> display
Hmm, I guess this does require the concept of a required cap..
Which we already introduced to the allocator API when we realized we
would need them as we were prototyping.
Note I also posed the question of whether things like cached (and
similarly compression, since I view compression as roughly an
equivalent mechanism to a cache) in one of the open issues on my XDC
2017 slides because of this very problem of over-pruning it causes.
It's on slide 15, as "No device-local capabilities".  You'll have to
listen to my coverage of it in the recorded presentation for that
slide to make any sense, but it's the same thing Nicolai has laid out
here.
As I continued working through our prototype driver support, I found I
The GPU just applies them as needed and the usage transitions make it
transparent to the non-GPU engines.  That does mean the GPU driver
currently needs to be the one to realize the allocation from the
capability set to get optimal behavior.  We could fix that by
reworking our driver though.  At this point, not including
device-local properties like on-device caching in capabilities seems
like the right solution to me.  I'm curious whether this applies
universally though, or if other hardware doesn't fit the "compression
and stuff all behaves like a cache" idiom.
Compression is a part of the memory layout for us: framebuffer
compression uses an additional "meta surface". At the most basic level,
an allocation with loss-less compression support is by necessity bigger
than an allocation without.
We can allocate this meta surface separately, but then we're forced to
decompress when passing the surface around (e.g. to a compositor.)
Consider also the example I gave elsewhere, where a cross-vendor tiling
Device 1, rendering: caps(BASE/foo-tiling, VND1/compression)
Device 2, sampling/scanout: caps(BASE/foo-tiling, VND2/compression)
Some more thoughts on caching or "device-local" properties below.
Compression requires extra resources for us as well. That's probably
universal. I think the distinction between the two approaches is
whether the allocating driver deduces that compression can be used with
a given capability set and hence adds the resources implicitly, or
whether the capability set indicates it explicitly. My theory is that
the implicit path is possible, but it has downsides. The explicit path
is attractive due to its exact nature, as I alluded to in my talk: You
can tell the exact properties of an allocation given the capability set
used to allocate it. If that can be made to work, I prefer that path as
well. Agreed that your path also works better for the
multi-vendor+device example.
Post by Rob Clark
[snip]
Post by James Jones
Post by Miguel Angel Vico
I think I like the idea of having transitions being part of the
per-device/engine cap sets, so that such information can be used upon
merging to know which capabilities may remain or have to be dropped.
I think James's proposal for usage transitions was intended to work
   1. App gets GPU caps for RENDER usage
   2. App allocates GPU memory using a layout from (1)
   3. App now decides it wants use the buffer for SCANOUT
   4. App queries usage transition metadata from RENDER to SCANOUT,
      given the current memory layout.
   5. Do the transition and hand the buffer off to display
No, all usages the app intends to transition to must be specified up
front when initially querying caps in the model I assumed.  The app
then specifies some subset (up to the full set) of the specified
usages as a src and dst when querying transition metadata.
Post by Miguel Angel Vico
The problem I see with this is that it isn't guaranteed that there will
be a chain of transitions for the buffer to be usable by display.
I hadn't thought hard about it, but my initial thoughts were that it
would be required that the driver support transitioning to any single
usage given the capabilities returned.  However, transitioning to
multiple usages (E.g., to simultaneously rendering and scanning out)
could fail to produce a valid transition, in which case the app would
have to fall back to a copy in that case, or avoid that simultaneous
usage combination in some other way.
Post by Miguel Angel Vico
Adding transition metadata to the original capability sets, and using
that information when merging could give us a compatible memory layout
that would be usable by both GPU and display.
I'll look into extending the current merging logic to also take into
account transitions.
Yes, it'll be good to see whether this can be made to work.  I agree
Rob's example outcomes above are ideal, but it's not clear to me how
to code up such an algorithm.  This also all seems unnecessary if
"device local" capabilities aren't needed, as posited above.
Post by Miguel Angel Vico
although maybe the user doesn't need to know every possible transition
between devices once you have more than two devices..
We should be able to infer how buffers are going to be moved around
from the list of usages, shouldn't we?
Maybe we are missing some bits of information there, but I think the
allocator should be able to know what transitions the app will care
about and provide only those.
The allocator only knows the requested union of all usages currently.
The number of possible transitions grows combinatorially for every
usage requested I believe.  I expect there will be cases where ~10
usages are specified, so generating all possible transitions all the
time may be excessive, when the app will probably generally only care
about 2 or 3 states, and in practice, there will probably only
actually be 2 or 3 different underlying possible combinations of
operations.
Exactly. So I wonder if we can't just "cut through the bullshit" somehow?
I'm looking for something that would also eliminate another part of the
design that makes me uncomfortable: the metadata for transitions. This
makes me uncomfortable for a number of reasons. Who computes the
metadata? How is the representation of the metadata? With cross-device
usages (which is the whole point of the exercise), this quickly becomes
infeasible.
capabilities and constraints (or properties/attributes).
I kind of already outlined this with the long example in my email here
https://lists.freedesktop.org/archives/mesa-dev/2017-December/179055.html
- the current (source) capability set
- the desired new usages
- the capability sets associated with these usages, as queried when the
surface was allocated
1. Compute the merged capability set for the new usages (the destination
capability set).
2. Compute the transition capability set, which is the merger of the
source and destination sets.
3. Determine whether a "release" transition is required on the source
3a. For global properties, a transition is required if the source
capability set is a superset of the transition set.
3b. For device-local properties, a transition is required if there is
some destination device for which the device-local properties are a
subset of the source set.
4. Determine whether an "acquire" transition is required on the
destination device(s) in a similar way.
Finally, execute the transitions using corresponding APIs, where the
APIs simply receive the computed capability sets.
For example, release transitions would receive the source capability set
(and perhaps the source usages), the transition capability set, and the
set difference of device-local capabilities, and nothing else.
The point is that all steps of the algorithm can be implemented in a
device-agnostic way in libdevicealloc, without calling into any
device/driver callbacks.
I'm pretty sure this or something like it can be made to work. We need
to think through a lot of example cases, but at least we'll have thought
them through, which is better than relying on some opaque metadata thing
and then finding out later that there are some new cross-device cases
where things don't work out because the piece of (presumably
device-specific driver) code that computes the metadata isn't aware of them.
This sounds pretty good. I'd like to see more detailed pseudo-code of a
full cycle (cap query, allocation, transition to and from a few usages),
but it seems pretty solid. I very much like that it enables the
explicit capability sets, but I'm mildly worried it might add API
complexity overall rather than reduce it.

I think in the end our two proposals are very similar: Yours just moves
the conversion from high-level properties -> device commands to the
driver applying the transition. That's fine in theory, though it shifts
some minor overhead to the time of the transition. We could design the
APIs such that it's possible to cache/pre-bake the device commands for a
given transition though to alleviate that if it proves meaningful.

To make it clearer what the "metadata" is in my version and hence
perhaps make it clearer how similar the two are, a few notes:

Transitions are queried per device in my proposal. Note this means you
need to query two different sets of transition metadata for a
cross-device transition, one from the source that would be applied on
that device in the source API, and one from the destination that would
be applied on that device in the destination API. APIs/engines that
don't require transitions would return some NULL metadata indicating no
required transition on that side.

Some examples of the metadata approach:

1) transition from NVIDIA dev rendering -> NVIDIA dev texturing both in
Vulkan, same device:

-Query transition. You'd get some metadata representing very simple
cache management stuff if anything. You'd apply it using some form of
pipeline barrier on the relevant image.

2) transition from NVIDIA dev rendering -> NVIDIA dev texturing both in
Vulkan, different device:

-Query transition from each device. You'd get some metadata
representing more complex cache management, and potentially a decompress
depending on the compatibility of the two devices. The driver is the
same for both devices in this case, so it can calculate the similarities
exactly by examining the capability set and each device's properties.
You'd apply it using some form of pipeline barrier with the respective
metadata on the relevant image on each device.


3) transition from NVIDIA dev rendering -> AMD dev texturing both in Vulkan:

-Query transition from each device. NVIDIA driver would see the
destination usage is a foreign device it has no knowledge of and perform
a complete cache flush and decompress. AMD driver would see the source
usage is something it doesn't recognize and perform a full cache
invalidate (and compression surface invalidate, if any?). You'd apply
it using some form of pipeline barrier with the respective metadata on
the relevant image on each device.

4) transition from NVIDIA dev rendering -> NVIDIA encoder with cache
coherence

-Query source transition on GPU dev. Query destination transition on
video encoder dev. GPU recognizes the destination is a device it is
aware has certain properties and hence returns a decompress only since
it knows it has cache coherence. Video encoder dev returns NULL
transition. Apply source transition on source graphics API. Note this
case requires some careful coordination across a vendor's various driver
stacks to perform optimally. It would automatically degrade to the
foreign device case for naive/incomplete drivers though.
Post by Rob Clark
[snip]
Post by James Jones
One final note:  When I initially wrote up the capability merging
logic, I treated "layout" as a sort of "special" capability, basically
like Nicolai originally outlined above.  Miguel suggested I add the
"required" bit instead to generalize things, and it ended up working
out much cleaner.  Besides the layout, there is at least one other
obvious candidate for a "required" capability that became obvious as
soon as I started coding up the prototype driver: memory location.  It
might seem like memory location is a simple device-agnostic constraint
rather than a capability, but it's actually too complicated (we need
more memory locations than "device" and "host").  It has to be vendor
specific, and hence fits in better as a capability.
Could you give more concrete examples of what you'd like to see, and why
having this as constraints is insufficient?
We have more than one "device local" memory with different capabilities
on some devices. I think you guys have this situation as well with your
cards with an SSD on them or something if I'm interpretting the
marketing stuf right. I'd like to be able to express those all without
needing to code them into the device-agnostic portion of the allocator
library ahead of time. That way, if we come up with any new clever
ones, we don't need to wait for everyone to update their allocator
library to make use of them.

Additionally, with things like SLI/Crossfire, we end up with a sort of
NUMA memory architecture, where memory on a "remote" card might have
similar but not exactly the same capabilities as device-local memory.
This would be rather complex to represent in the generic constraints as
well.
Post by Rob Clark
Post by James Jones
I think if possible, we should try to keep the design generalized to
as few types of objects and special cases as possible.  The more we
can generalize the solutions to our existing problem set, the better
the mechanism should hold up as we apply it to new and unknown
problems as they arise.
I'm coming around to the fact that those things should perhaps live in a
single list/array, but I still don't like the term "capability".
I admit it's a bit of bike-shedding, but I'm starting to think it would
be better to go with the generic term "property" or "attribute", and
then add flags/adjectives to that based on how merging should work.
This would include the constraints as well -- it seems arbitrary to me
that those would be singled out into their own list.
Basically, the underlying principle is that a good API would have either
one list that includes all the properties, or one list per
merging-behavior. And I think one single list is easier on the API
consumer and easier to extend.
Agreed with Rob. Constraints are different for a reason: They're
non-extensible and hence can merge in more complex ways. Capabilities
are extensible, but must be merged by simple memcmp()-style operations,
currently more or less simple intersection.

However, I also don't care about naming. "Constraints" was chosen
because it connotates negatively since they "limit" what an allocation
created from a capability set can do, and similarly "capabilities"
connotates positively because it indicates things that are built up
additively to describe abilities of an allocation. However, I don't
know that that metaphor held up entirely as the design was realized, so
it might be a good time to bikeshed new names anyway.

Thanks,
-James
Post by Rob Clark
Cheers,
Nicolai
Miguel Angel Vico
2017-12-08 18:52:39 UTC
Reply
Permalink
Raw Message
On Wed, 6 Dec 2017 16:57:45 -0800
Post by James Jones
Post by Rob Clark
[snip]
Post by James Jones
Post by Miguel Angel Vico
Post by Nicolai Hähnle
Post by Rob Clark
So lets say you have a setup where both display and GPU supported
FOO/tiled, but only GPU supported compressed (FOO/CC) and cached
    trans_a: FOO/CC -> null
    trans_b: FOO/cached -> null
    1: caps(FOO/tiled, FOO/CC, FOO/cached);
constraints(alignment=32k)
    2: caps(FOO/tiled, FOO/CC); constraints(alignment=32k)
    3: caps(FOO/tiled); constraints(alignment=32k)
    1: caps(FOO/tiled); constraints(alignment=64k)
    1: caps(FOO/tiled, FOO/CC, FOO/cached);
constraints(alignment=64k);
       transition(GPU->display: trans_a, trans_b; display->GPU: none)
    2: caps(FOO/tiled, FOO/CC); constraints(alignment=64k);
       transition(GPU->display: trans_a; display->GPU: none)
    3: caps(FOO/tiled); constraints(alignment=64k);
       transition(GPU->display: none; display->GPU: none)
We definitely don't want to expose a way of getting uncached rendering
surfaces for radeonsi. I mean, I think we are supposed to be able to program
our hardware so that the backend bypasses all caches, but (a) nobody
validates that and (b) it's basically suicide in terms of performance. Let's
build fewer footguns :)
sure, this was just a hypothetical example.  But to take this case as
another example, if you didn't want to expose uncached rendering (or
cached w/ cache flushes after each draw), you would exclude the entry
from the GPU set which didn't have FOO/cached (I'm adding back a
cached but not CC config just to make it interesting), and end up
    trans_a: FOO/CC -> null
    trans_b: FOO/cached -> null
   1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
   2: caps(FOO/tiled, FOO/cached); constraints(alignment=32k)
   1: caps(FOO/tiled); constraints(alignment=64k)
   1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
      transition(GPU->display: trans_a, trans_b; display->GPU: none)
   2: caps(FOO/tiled, FOO/cached); constraints(alignment=64k);
      transition(GPU->display: trans_b; display->GPU: none)
So there isn't anything in the result set that doesn't have GPU cache,
and the cache-flush transition is always in the set of required
transitions going from GPU -> display
Hmm, I guess this does require the concept of a required cap..
Which we already introduced to the allocator API when we realized we
would need them as we were prototyping.
Note I also posed the question of whether things like cached (and
similarly compression, since I view compression as roughly an
equivalent mechanism to a cache) in one of the open issues on my XDC
2017 slides because of this very problem of over-pruning it causes.
It's on slide 15, as "No device-local capabilities".  You'll have to
listen to my coverage of it in the recorded presentation for that
slide to make any sense, but it's the same thing Nicolai has laid out
here.
As I continued working through our prototype driver support, I found I
The GPU just applies them as needed and the usage transitions make it
transparent to the non-GPU engines.  That does mean the GPU driver
currently needs to be the one to realize the allocation from the
capability set to get optimal behavior.  We could fix that by
reworking our driver though.  At this point, not including
device-local properties like on-device caching in capabilities seems
like the right solution to me.  I'm curious whether this applies
universally though, or if other hardware doesn't fit the "compression
and stuff all behaves like a cache" idiom.
Compression is a part of the memory layout for us: framebuffer
compression uses an additional "meta surface". At the most basic level,
an allocation with loss-less compression support is by necessity bigger
than an allocation without.
We can allocate this meta surface separately, but then we're forced to
decompress when passing the surface around (e.g. to a compositor.)
Consider also the example I gave elsewhere, where a cross-vendor tiling
Device 1, rendering: caps(BASE/foo-tiling, VND1/compression)
Device 2, sampling/scanout: caps(BASE/foo-tiling, VND2/compression)
Some more thoughts on caching or "device-local" properties below.
Compression requires extra resources for us as well. That's probably
universal. I think the distinction between the two approaches is
whether the allocating driver deduces that compression can be used with
a given capability set and hence adds the resources implicitly, or
whether the capability set indicates it explicitly. My theory is that
the implicit path is possible, but it has downsides. The explicit path
is attractive due to its exact nature, as I alluded to in my talk: You
can tell the exact properties of an allocation given the capability set
used to allocate it. If that can be made to work, I prefer that path as
well. Agreed that your path also works better for the
multi-vendor+device example.
Post by Rob Clark
[snip]
Post by James Jones
Post by Miguel Angel Vico
I think I like the idea of having transitions being part of the
per-device/engine cap sets, so that such information can be used upon
merging to know which capabilities may remain or have to be dropped.
I think James's proposal for usage transitions was intended to work
   1. App gets GPU caps for RENDER usage
   2. App allocates GPU memory using a layout from (1)
   3. App now decides it wants use the buffer for SCANOUT
   4. App queries usage transition metadata from RENDER to SCANOUT,
      given the current memory layout.
   5. Do the transition and hand the buffer off to display
No, all usages the app intends to transition to must be specified up
front when initially querying caps in the model I assumed.  The app
then specifies some subset (up to the full set) of the specified
usages as a src and dst when querying transition metadata.
Post by Miguel Angel Vico
The problem I see with this is that it isn't guaranteed that there will
be a chain of transitions for the buffer to be usable by display.
I hadn't thought hard about it, but my initial thoughts were that it
would be required that the driver support transitioning to any single
usage given the capabilities returned.  However, transitioning to
multiple usages (E.g., to simultaneously rendering and scanning out)
could fail to produce a valid transition, in which case the app would
have to fall back to a copy in that case, or avoid that simultaneous
usage combination in some other way.
Post by Miguel Angel Vico
Adding transition metadata to the original capability sets, and using
that information when merging could give us a compatible memory layout
that would be usable by both GPU and display.
I'll look into extending the current merging logic to also take into
account transitions.
Yes, it'll be good to see whether this can be made to work.  I agree
Rob's example outcomes above are ideal, but it's not clear to me how
to code up such an algorithm.  This also all seems unnecessary if
"device local" capabilities aren't needed, as posited above.
Even if "device local" capabilities aren't exposed in the capability
set, we might still want to have capabilities exposed that may have an
associated transition, right? And as already mentioned, looks like some
capabilites such as shared caches might not qualify as "device local"
and must be exposed either way.

If we don't embed transition information in the capability set somehow,
I don't see how we can avoid the merge operation dropping certain
capabilities because they aren't found in all sets.

My background in the matter is limited, so I'm probably missing some
points, but here's an idea of how to implement Rob's suggestion, which
I think also ties to Nicolai's transition pseudo-algorithm below:

1. Upon capabilities query, a device puts together a list of optimal
capability sets to best satisfy the intended usage for that
particular device.

2. Since the list of all usages is provided, we can guess whether we
might need subsets of those sets from (1) to satisfy all usages. We
also know whether we can provide transitions that convert those
sets (1) to simpler subsets.

3. We add to the list of returned capability sets all optimal sets
from (1) plus all suboptimal sets from (2) (than can actually be
obtained through transitions from (1).

4. We add information to each capability set so that we know what
other sets in the list are obtained by applying transitions. Each
set has a list 'source transitions' (pointer to source super set +
pointer to transition to apply) and a list of 'destination
transitions' (pointer to destination subset + pointer to transition
to apply).

Thus, we end up with a list of sets and subsets connected to each other
according to the available transitions.


Then, we can modify the capability merge logic such that:

1. We compute union of constraints

2. We search for the set which capabilities are found in both provided
lists. Since we have the connectivity information, we can actually
return more complex sets that will be converted to the simpler
found one (no need for dropping capabilities).

3. If no intersection of sets is found by (2), we start dropping
capabilities until we find an intersection, or fail the merge
operation.

Note that transition information will be preserved in the new returned
list of sets.

When actually transitioning from one usage to another, we just navigate
the capability set graph from the corresponding source set to the
destination set, applying any transitions required, which are encoded
in the sets themselves.
Post by James Jones
Post by Rob Clark
Post by James Jones
Post by Miguel Angel Vico
although maybe the user doesn't need to know every possible transition
between devices once you have more than two devices..
We should be able to infer how buffers are going to be moved around
from the list of usages, shouldn't we?
Maybe we are missing some bits of information there, but I think the
allocator should be able to know what transitions the app will care
about and provide only those.
The allocator only knows the requested union of all usages currently.
The number of possible transitions grows combinatorially for every
usage requested I believe.  I expect there will be cases where ~10
usages are specified, so generating all possible transitions all the
time may be excessive, when the app will probably generally only care
about 2 or 3 states, and in practice, there will probably only
actually be 2 or 3 different underlying possible combinations of
operations.
Exactly. So I wonder if we can't just "cut through the bullshit" somehow?
Rather than expressing usage as a union of uses, would it make sense to
express it as a directed graph somehow so that the application can
specify how it intends to move the allocation around?

If we had a directed usage graph, upon capability query, we'd know what
transitions the application is going to care about, and expose one set
of capabilities and transitions or another accordingly.
Post by James Jones
Post by Rob Clark
I'm looking for something that would also eliminate another part of the
design that makes me uncomfortable: the metadata for transitions. This
makes me uncomfortable for a number of reasons. Who computes the
metadata? How is the representation of the metadata? With cross-device
usages (which is the whole point of the exercise), this quickly becomes
infeasible.
capabilities and constraints (or properties/attributes).
I kind of already outlined this with the long example in my email here
https://lists.freedesktop.org/archives/mesa-dev/2017-December/179055.html
- the current (source) capability set
- the desired new usages
- the capability sets associated with these usages, as queried when the
surface was allocated
1. Compute the merged capability set for the new usages (the destination
capability set).
2. Compute the transition capability set, which is the merger of the
source and destination sets.
3. Determine whether a "release" transition is required on the source
3a. For global properties, a transition is required if the source
capability set is a superset of the transition set.
3b. For device-local properties, a transition is required if there is
some destination device for which the device-local properties are a
subset of the source set.
4. Determine whether an "acquire" transition is required on the
destination device(s) in a similar way.
Finally, execute the transitions using corresponding APIs, where the
APIs simply receive the computed capability sets.
For example, release transitions would receive the source capability set
(and perhaps the source usages), the transition capability set, and the
set difference of device-local capabilities, and nothing else.
The point is that all steps of the algorithm can be implemented in a
device-agnostic way in libdevicealloc, without calling into any
device/driver callbacks.
I'm pretty sure this or something like it can be made to work. We need
to think through a lot of example cases, but at least we'll have thought
them through, which is better than relying on some opaque metadata thing
and then finding out later that there are some new cross-device cases
where things don't work out because the piece of (presumably
device-specific driver) code that computes the metadata isn't aware of them.
This sounds pretty good. I'd like to see more detailed pseudo-code of a
full cycle (cap query, allocation, transition to and from a few usages),
but it seems pretty solid. I very much like that it enables the
explicit capability sets, but I'm mildly worried it might add API
complexity overall rather than reduce it.
I think in the end our two proposals are very similar: Yours just moves
the conversion from high-level properties -> device commands to the
driver applying the transition. That's fine in theory, though it shifts
some minor overhead to the time of the transition. We could design the
APIs such that it's possible to cache/pre-bake the device commands for a
given transition though to alleviate that if it proves meaningful.
To make it clearer what the "metadata" is in my version and hence
Transitions are queried per device in my proposal. Note this means you
need to query two different sets of transition metadata for a
cross-device transition, one from the source that would be applied on
that device in the source API, and one from the destination that would
be applied on that device in the destination API. APIs/engines that
don't require transitions would return some NULL metadata indicating no
required transition on that side.
1) transition from NVIDIA dev rendering -> NVIDIA dev texturing both in
-Query transition. You'd get some metadata representing very simple
cache management stuff if anything. You'd apply it using some form of
pipeline barrier on the relevant image.
2) transition from NVIDIA dev rendering -> NVIDIA dev texturing both in
-Query transition from each device. You'd get some metadata
representing more complex cache management, and potentially a decompress
depending on the compatibility of the two devices. The driver is the
same for both devices in this case, so it can calculate the similarities
exactly by examining the capability set and each device's properties.
You'd apply it using some form of pipeline barrier with the respective
metadata on the relevant image on each device.
-Query transition from each device. NVIDIA driver would see the
destination usage is a foreign device it has no knowledge of and perform
a complete cache flush and decompress. AMD driver would see the source
usage is something it doesn't recognize and perform a full cache
invalidate (and compression surface invalidate, if any?). You'd apply
it using some form of pipeline barrier with the respective metadata on
the relevant image on each device.
4) transition from NVIDIA dev rendering -> NVIDIA encoder with cache
coherence
-Query source transition on GPU dev. Query destination transition on
video encoder dev. GPU recognizes the destination is a device it is
aware has certain properties and hence returns a decompress only since
it knows it has cache coherence. Video encoder dev returns NULL
transition. Apply source transition on source graphics API. Note this
case requires some careful coordination across a vendor's various driver
stacks to perform optimally. It would automatically degrade to the
foreign device case for naive/incomplete drivers though.
Post by Rob Clark
[snip]
Post by James Jones
One final note:  When I initially wrote up the capability merging
logic, I treated "layout" as a sort of "special" capability, basically
like Nicolai originally outlined above.  Miguel suggested I add the
"required" bit instead to generalize things, and it ended up working
out much cleaner.  Besides the layout, there is at least one other
obvious candidate for a "required" capability that became obvious as
soon as I started coding up the prototype driver: memory location.  It
might seem like memory location is a simple device-agnostic constraint
rather than a capability, but it's actually too complicated (we need
more memory locations than "device" and "host").  It has to be vendor
specific, and hence fits in better as a capability.
Could you give more concrete examples of what you'd like to see, and why
having this as constraints is insufficient?
We have more than one "device local" memory with different capabilities
on some devices. I think you guys have this situation as well with your
cards with an SSD on them or something if I'm interpretting the
marketing stuf right. I'd like to be able to express those all without
needing to code them into the device-agnostic portion of the allocator
library ahead of time. That way, if we come up with any new clever
ones, we don't need to wait for everyone to update their allocator
library to make use of them.
Additionally, with things like SLI/Crossfire, we end up with a sort of
NUMA memory architecture, where memory on a "remote" card might have
similar but not exactly the same capabilities as device-local memory.
This would be rather complex to represent in the generic constraints as
well.
Post by Rob Clark
Post by James Jones
I think if possible, we should try to keep the design generalized to
as few types of objects and special cases as possible.  The more we
can generalize the solutions to our existing problem set, the better
the mechanism should hold up as we apply it to new and unknown
problems as they arise.
I'm coming around to the fact that those things should perhaps live in a
single list/array, but I still don't like the term "capability".
I admit it's a bit of bike-shedding, but I'm starting to think it would
be better to go with the generic term "property" or "attribute", and
then add flags/adjectives to that based on how merging should work.
This would include the constraints as well -- it seems arbitrary to me
that those would be singled out into their own list.
Basically, the underlying principle is that a good API would have either
one list that includes all the properties, or one list per
merging-behavior. And I think one single list is easier on the API
consumer and easier to extend.
Agreed with Rob. Constraints are different for a reason: They're
non-extensible and hence can merge in more complex ways. Capabilities
are extensible, but must be merged by simple memcmp()-style operations,
currently more or less simple intersection.
However, I also don't care about naming. "Constraints" was chosen
because it connotates negatively since they "limit" what an allocation
created from a capability set can do, and similarly "capabilities"
connotates positively because it indicates things that are built up
additively to describe abilities of an allocation. However, I don't
know that that metaphor held up entirely as the design was realized, so
it might be a good time to bikeshed new names anyway.
There have been several naming suggestions for different pieces of the
library. I'll start separate threads with patches with some of the
name changes, so we can keep the bike-shedding separate from the design
discussion.

Thanks.
Post by James Jones
Thanks,
-James
Post by Rob Clark
Cheers,
Nicolai
--
Miguel
Rob Clark
2017-12-06 13:25:19 UTC
Reply
Permalink
Raw Message
Post by Miguel Angel Vico
On Fri, 1 Dec 2017 13:38:41 -0500
Post by Rob Clark
sure, this was just a hypothetical example. But to take this case as
another example, if you didn't want to expose uncached rendering (or
cached w/ cache flushes after each draw), you would exclude the entry
from the GPU set which didn't have FOO/cached (I'm adding back a
cached but not CC config just to make it interesting), and end up
trans_a: FOO/CC -> null
trans_b: FOO/cached -> null
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
2: caps(FOO/tiled, FOO/cached); constraints(alignment=32k)
1: caps(FOO/tiled); constraints(alignment=64k)
1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
transition(GPU->display: trans_a, trans_b; display->GPU: none)
2: caps(FOO/tiled, FOO/cached); constraints(alignment=64k);
transition(GPU->display: trans_b; display->GPU: none)
So there isn't anything in the result set that doesn't have GPU cache,
and the cache-flush transition is always in the set of required
transitions going from GPU -> display
Hmm, I guess this does require the concept of a required cap..
Which we already introduced to the allocator API when we realized we
would need them as we were prototyping.
Note I also posed the question of whether things like cached (and similarly
compression, since I view compression as roughly an equivalent mechanism to
a cache) in one of the open issues on my XDC 2017 slides because of this
very problem of over-pruning it causes. It's on slide 15, as "No
device-local capabilities". You'll have to listen to my coverage of it in
the recorded presentation for that slide to make any sense, but it's the
same thing Nicolai has laid out here.
As I continued working through our prototype driver support, I found I
didn't actually need to include cached or compressed as capabilities: The
GPU just applies them as needed and the usage transitions make it
transparent to the non-GPU engines. That does mean the GPU driver currently
needs to be the one to realize the allocation from the capability set to get
optimal behavior. We could fix that by reworking our driver though. At
this point, not including device-local properties like on-device caching in
capabilities seems like the right solution to me. I'm curious whether this
applies universally though, or if other hardware doesn't fit the
"compression and stuff all behaves like a cache" idiom.
Possibly a SoC(ish) type device which has a "system" cache that some
but not all devices fall into. I *think* the intel chips w/ EDRAM
might fall into this category. I know the idea has come up elsewhere,
although not sure if anything like that ended up in production. It
seems like something we'd at least want to have an idea how to deal
with, even if it isn't used for device internal caches.

Not sure if similar situation could come up w/ discrete GPU and video
decode/encode engines on the same die?

[snip]
Post by Miguel Angel Vico
I think I like the idea of having transitions being part of the
per-device/engine cap sets, so that such information can be used upon
merging to know which capabilities may remain or have to be dropped.
I think James's proposal for usage transitions was intended to work
1. App gets GPU caps for RENDER usage
2. App allocates GPU memory using a layout from (1)
3. App now decides it wants use the buffer for SCANOUT
4. App queries usage transition metadata from RENDER to SCANOUT,
given the current memory layout.
5. Do the transition and hand the buffer off to display
No, all usages the app intends to transition to must be specified up front
when initially querying caps in the model I assumed. The app then specifies
some subset (up to the full set) of the specified usages as a src and dst
when querying transition metadata.
Post by Miguel Angel Vico
The problem I see with this is that it isn't guaranteed that there will
be a chain of transitions for the buffer to be usable by display.
hmm, I guess if a buffer *can* be shared across all uses, there by
definition has to be a chain of transitions to go from any
usage+device to any other usage+device.

Possibly a separate step to query transitions avoids solving for every
possible transition when merging the caps set.. although until you do
that query I don't think you know the resulting merged caps set is
valid.

Maybe in practice for every cap FOO there exists a FOO->null (or
FOO->generic if you prefer) transition, ie. compressed->uncompressed,
cached->clean, etc. I suppose that makes the problem easier to solve.
I hadn't thought hard about it, but my initial thoughts were that it would
be required that the driver support transitioning to any single usage given
the capabilities returned. However, transitioning to multiple usages (E.g.,
to simultaneously rendering and scanning out) could fail to produce a valid
transition, in which case the app would have to fall back to a copy in that
case, or avoid that simultaneous usage combination in some other way.
Post by Miguel Angel Vico
Adding transition metadata to the original capability sets, and using
that information when merging could give us a compatible memory layout
that would be usable by both GPU and display.
I'll look into extending the current merging logic to also take into
account transitions.
Yes, it'll be good to see whether this can be made to work. I agree Rob's
example outcomes above are ideal, but it's not clear to me how to code up
such an algorithm. This also all seems unnecessary if "device local"
capabilities aren't needed, as posited above.
Probably things like device private caches, and transitions between
usages on the same device(+driver?[1]) could be left out. For the
cache case, if you have a cache shared between some but not all
devices, that problem looks to me to be basically the same problem as
compressed buffers when some but not all devices support a particular
compression scheme.

[1] can we assume magic under the hood for vk and gl interop with
drivers from same vendor on same device?

[snip]
Post by Miguel Angel Vico
Post by Rob Clark
yeah, maybe shouldn't be FOO/gpucache but FOO/gpucache($id)..
That just seems an implementation detail of the representation the
particular vendor chooses for the CACHE capability, right?
Agreed.
One final note: When I initially wrote up the capability merging logic, I
treated "layout" as a sort of "special" capability, basically like Nicolai
originally outlined above. Miguel suggested I add the "required" bit
instead to generalize things, and it ended up working out much cleaner.
Besides the layout, there is at least one other obvious candidate for a
"required" capability that became obvious as soon as I started coding up the
prototype driver: memory location. It might seem like memory location is a
simple device-agnostic constraint rather than a capability, but it's
actually too complicated (we need more memory locations than "device" and
"host"). It has to be vendor specific, and hence fits in better as a
capability.
I think if possible, we should try to keep the design generalized to as few
types of objects and special cases as possible. The more we can generalize
the solutions to our existing problem set, the better the mechanism should
hold up as we apply it to new and unknown problems as they arise.
agreed

BR,
-R
Nicolai Hähnle
2017-12-06 15:45:41 UTC
Reply
Permalink
Raw Message
Post by Rob Clark
Note I also posed the question of whether things like cached (and similarly
compression, since I view compression as roughly an equivalent mechanism to
a cache) in one of the open issues on my XDC 2017 slides because of this
very problem of over-pruning it causes. It's on slide 15, as "No
device-local capabilities". You'll have to listen to my coverage of it in
the recorded presentation for that slide to make any sense, but it's the
same thing Nicolai has laid out here.
As I continued working through our prototype driver support, I found I
didn't actually need to include cached or compressed as capabilities: The
GPU just applies them as needed and the usage transitions make it
transparent to the non-GPU engines. That does mean the GPU driver currently
needs to be the one to realize the allocation from the capability set to get
optimal behavior. We could fix that by reworking our driver though. At
this point, not including device-local properties like on-device caching in
capabilities seems like the right solution to me. I'm curious whether this
applies universally though, or if other hardware doesn't fit the
"compression and stuff all behaves like a cache" idiom.
Possibly a SoC(ish) type device which has a "system" cache that some
but not all devices fall into. I *think* the intel chips w/ EDRAM
might fall into this category. I know the idea has come up elsewhere,
although not sure if anything like that ended up in production. It
seems like something we'd at least want to have an idea how to deal
with, even if it isn't used for device internal caches.
Not sure if similar situation could come up w/ discrete GPU and video
decode/encode engines on the same die?
It definitely could. Our GPUs currently don't have shared caches between
gfx and video engines, but moving more and more clients under a shared
L2 cache has been a theme over the last few generations. I doubt that's
going to happen for the video engines any time soon, but you never know.

I don't think we really need caches as a capability for our current
GPUs, but it may change, and in any case, we do want compression as a
capability.
Post by Rob Clark
[snip]
Post by Miguel Angel Vico
I think I like the idea of having transitions being part of the
per-device/engine cap sets, so that such information can be used upon
merging to know which capabilities may remain or have to be dropped.
I think James's proposal for usage transitions was intended to work
1. App gets GPU caps for RENDER usage
2. App allocates GPU memory using a layout from (1)
3. App now decides it wants use the buffer for SCANOUT
4. App queries usage transition metadata from RENDER to SCANOUT,
given the current memory layout.
5. Do the transition and hand the buffer off to display
No, all usages the app intends to transition to must be specified up front
when initially querying caps in the model I assumed. The app then specifies
some subset (up to the full set) of the specified usages as a src and dst
when querying transition metadata.
Post by Miguel Angel Vico
The problem I see with this is that it isn't guaranteed that there will
be a chain of transitions for the buffer to be usable by display.
hmm, I guess if a buffer *can* be shared across all uses, there by
definition has to be a chain of transitions to go from any
usage+device to any other usage+device.
Possibly a separate step to query transitions avoids solving for every
possible transition when merging the caps set.. although until you do
that query I don't think you know the resulting merged caps set is
valid.
Maybe in practice for every cap FOO there exists a FOO->null (or
FOO->generic if you prefer) transition, ie. compressed->uncompressed,
cached->clean, etc. I suppose that makes the problem easier to solve.
It really would, to the extent that I would prefer if we could bake it
into the system as an assumption.

I have my doubts about how to manage calculating transitions cleanly at
all without it. The metadata stuff is very vague to me.
Post by Rob Clark
I hadn't thought hard about it, but my initial thoughts were that it would
be required that the driver support transitioning to any single usage given
the capabilities returned. However, transitioning to multiple usages (E.g.,
to simultaneously rendering and scanning out) could fail to produce a valid
transition, in which case the app would have to fall back to a copy in that
case, or avoid that simultaneous usage combination in some other way.
Post by Miguel Angel Vico
Adding transition metadata to the original capability sets, and using
that information when merging could give us a compatible memory layout
that would be usable by both GPU and display.
I'll look into extending the current merging logic to also take into
account transitions.
Yes, it'll be good to see whether this can be made to work. I agree Rob's
example outcomes above are ideal, but it's not clear to me how to code up
such an algorithm. This also all seems unnecessary if "device local"
capabilities aren't needed, as posited above.
Probably things like device private caches, and transitions between
usages on the same device(+driver?[1]) could be left out. For the
cache case, if you have a cache shared between some but not all
devices, that problem looks to me to be basically the same problem as
compressed buffers when some but not all devices support a particular
compression scheme.
[1] can we assume magic under the hood for vk and gl interop with
drivers from same vendor on same device?
In my book, the fewer assumptions we have to make for that, the better.

Cheers,
Nicolai
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
Loading...