/srv/irclogs.linaro.org/2011/06/24/#linaro-mm-sig.txt

michaedwrobclark: I'd be interested in your take on the topic of user allocated buffers01:05
michaedwespecially since you know all about the GStreamer / OpenMAX angle :-)01:05
michaedwrobclark: we had actually looked a bit at VDPAU as well, on the theory that it would help with getting Flash 10.x ported01:19
robclarkmichaedw, sorry, didn't notice you earlier..03:13
michaedwno worries03:13
robclarkmy preference is to kernel allocated buffers03:13
robclarkit gives more flexibility..03:13
robclarkie, if you need to unmap buffers, take faults map them back in on demand, etc..03:14
robclarkI think that something where you are passing around opaque handles 90+% of the time, also helps.. vdpau or vaapi get it right on that count03:14
michaedwdid you see my postings about the userland asking the kernel for buffers with the appropriate properties, as a separate step from registering them with individual drivers?03:15
michaedwfor when you need to merge the requirements of producer and consumer03:15
robclarkno.. not yet.. which list?03:15
michaedwlinaro-mm-sig03:15
robclarkahh, ok.. sorry, I'm a bit behind03:16
michaedwsay, the GPU wants uncacheable write-combining, the video capture block wants megabyte alignment03:16
robclarkI think if kernel has control of buffers, they can be migrated..  mapped.. whatever is needed..03:16
robclarkalthough you need some sort of sync object to know when no hw is accessing the buffer..03:17
michaedwit can be very hairy to change the cache policy once the buffer has been handed out03:17
robclarkknowing about userspace is easy.. page faulting03:17
robclarktrue..  that is why it is better if you don't have to map it at all ;-)03:17
michaedwyou may have to unmap all existing mappings and force a full cache flush03:17
robclarkie. if you only have to map it for edge cases, like generating a thumbnail w/ sw jpeg encoder.. then life gets easier..03:18
michaedwand tracing indirections through IOMMU and such is likely to be a giant pain03:18
robclarkbut normally you use a buffer over and over.. so if you have to take a hit the first time the buffer makes it through the multimedia pipeline, no biggie03:19
michaedwI also have in mind non-graphics uses like handing over whole data structures from one process to another03:19
robclarkhmm... well, admittedly my focus has been more video/gfx buffers..03:20
michaedwusing hugepages mmap'ed from a shared file descriptor03:20
robclarkwho big of data structures are you talking about?03:20
michaedwand offset pointers03:20
robclarks/who/how/03:20
michaedwmulti-megabyte; things like priority search trees full of log entries03:20
robclarkhmm.. well, that sort of thing, if sw is operating on it, you probably want cached, don't you?03:21
michaedwnot if it's being handed from DSP to ARM ;-)03:21
robclarkhmm, ok03:21
michaedwI am trying to create a generic solution to this shared-buffer problem, that supports inter-process and inter-processor equally well03:22
robclarkwould that be single allocator/producer on one side (ARM or DSP), and consumer on the other.. or multiple ARM side processes?03:22
michaedwpotentially even concurrent access using lock-free algorithms03:23
robclarkright.. but DSP is no different from a GPU.. it is just "dma" from ARM's perspective..03:23
robclarkI mean, it is not completely different from GPU vs CPU..03:23
michaedwnot on a processor with tightly coupled memory, where atomic accesses from the ARM and DSP sides can share operands as long as they bypass the cache hierarchy03:24
robclarkin some cases, w/ GPU, you want to map buffers uncached.. but usually at time you map them into userspace, you know what they are for..03:24
michaedwimagine, say, lens distortion correction done by the GPU03:24
michaedwso you might split the capture buffer into four quadrants per frame03:25
robclarkok..03:25
michaedwsplit at the optical center, whch may not be the center of the frame03:26
michaedwand may even be a moving target, in the case of "digital pan/tilt/zoom"03:26
robclarkok..03:26
robclarkI guess I probably need to read your emails first, maybe I'm misunderstanding what you want..03:27
michaedwthe coordinating entity may have to reap and reallocate buffers from a larger block03:27
robclarkI mean, I'm convinced there needs to be a way for uncached buffers..03:27
michaedwI need to go anyway, but I'd love to talk with you about this at your convenience, in the next week or so03:27
robclarkok, well let me get caught up on email ;-)03:28
michaedwvery good; catch you whenever, or you can email at m.k.edwards@gmail.com or michaedw@cisco.com03:28
robclarksounds good03:28
michaedwta03:28
robclarkttyl03:28
mszyprowarnd: hello07:49
arndhi mszyprow08:23
mszyprowarnd: I wanted to ask if you saw my latest dma-mapping patches for ARM architecture08:24
mszyprowarnd: and I wonder if I should start working on alloc_attrs @ dma_map_ops patches for other architectures08:25
arndmszyprow: I've seen that you sent patches, but didn't have the time yet to look at them in detail08:25
mszyprowarnd: ok, no problem, I can wait more when you will have some time to comment them08:26
arndmszyprow: are you using the common dma_map_ops already or do you define an ARM specific version right now?08:28
mszyprowarnd: this version use common dma_map_ops08:29
mszyprowbut with additional patch that adds alloc_attrs08:30
arndok08:30
arndI also just realized that dma_map_ops is defined in linux/dma-mapping.h, so you can't override it08:30
arndin that case, changing the other architectures would be necessary to get the patches merged, I assume08:31
mszyprowarnd: I know08:31
mszyprowarnd: but I wanted to get any feedback if such change will be accepted08:32
michaedwmszyprow: have you given any thought to my suggestion that userland should ask drivers for attrs, merge them, and ask the kernel for a region having the merged attrs08:34
arndI think the chances of getting that accepted are pretty good, convincing Fujita Tomonori is most important there08:34
michaedwwhich it can then slice into buffers and register with the devices that will touch those buffers08:34
mszyprowhmm, I probably missed to CC: him in my patches :(08:34
arndmszyprow: since most architectures only need one kind of allocation (coherent), the dma_alloc_coherent in include/asm-generic/dma-mapping-common.h can just pass an empty attribute pointer08:35
mszyprowmichaedw: right now I don't consider userspace api08:35
arndso you wouldn't have to do much at all for the other architectures08:35
mszyprowarnd: yes, right08:35
arndmichaedw: I agree with mszyprow, it's a completely separate discussion08:36
mszyprowarnd: and I also noticed that dma_alloc_non_coherent can also be redirected to dma_alloc_attrs() with DMA_ATTR_NON_COHERENT attribute08:36
mszyprowarnd: some archs does #define dma_alloc_non_coherent dma_alloc_coherent anyway08:37
arndexactly, we would do that on architectures that require multiple dma_alloc_* variants08:37
arndyes, those are the ones where every DMA is coherent08:37
michaedwarnd: ok; so there's an in-kernel api that takes an attrs struct and a size, and returns a physical address region?08:37
mszyprowimho dma_alloc_attrs() will be really nice and clean iinterface08:37
arndmszyprow: agreed08:38
arndmichaedw: we've discussed the dma-buffer subsystem in length in Budapest08:38
michaedwwhich can be mapped with the appropriate set of attributes into kernel address space if needed, and via IOMMU and/or process address space if not?08:39
arndthe idea is that you can have an in-kernel data structure to point to a dma buffer (sg_list), and a handle you can pass around in user space08:39
michaedwok; and you can convert this handle to an actual mmap'ed block if you need to write to it?08:39
arndnormally, the attributes would be determined by the kernel subsystem that uses the dma-buffer08:40
michaedwas, say, a write-combining uncacheable range?08:40
arndincluding ways to mmap it into user space, if that's permitted08:40
michaedwI have in mind use cases outside the video pipeline per se08:40
michaedwsuch as passing image assets from a userland framework such as Qt to the GPU08:41
arndthe dma-buffer is not strictly tied to video processing, we also discussed uses for DSPs etc08:41
arndright08:41
michaedwand passing large data structures between DSP and Linux userland08:42
arndthe case you refer to one of the easier ones because there is only a single hardware component accessing it08:42
arndthe dma-buffer framework is mostly for the complicated cases, where we have two or more subsystems interacting08:42
michaedwpossibly even concurrent access with lockless algorithms, if the chip has atomic operations that work among cores (as long as each core's cache hierarchy is bypassed)08:42
michaedwI also have in mind delivering decoded H.264 to GPU video textures08:43
michaedwand using the GPU for lens distortion correction in between stages of the video capture pipeline08:43
arndyes, that is one of the cases where two subsystems (v4l and drm) need to talk to one another08:44
arndso you get a dma-buffer pointer from e.g. drm and pass it to v4l, then tell v4l to render video to the buffer, and tell drm to show the buffer as a texture08:44
arndin that case, you would not mmap it to user space08:45
arndand possibly also not to kernel space08:45
michaedwwhat if v4l and drm have distinct, but non-conflicting, allocation requirements?08:46
michaedwsay, v4l wants it mapped uncacheable write-combining (so userland can fiddle with the frame metadata), and drm wants it megabyte-aligned?08:47
arndthat's part of the dma-buffer API to figure out. the details of that depend on the implementation, which is not yet done08:47
michaedwit seems to me that the attrs themselves have to be exported from driver to userland08:47
arndin your example, drm would allocate the buffer with the alignment it wants and give a handle to user space08:48
michaedwand merged, and the result used to set up the allocation and mapping(s)08:48
michaedwI don't see why it should be either driver's job to do the allocating08:48
arndv4l can expose a method of mmapping the buffer to user space, and has to ask the creator of the buffer if the mmap attributes are possible08:48
arndideally you don't want the user to care about the attributes, because they are highly arch specific and most of the time the kernel will already know what they must be08:49
michaedwthey should be opaque to the user08:49
michaedwbut the user should do the attribute merging and allocating08:50
arndmany subsystems already have ways to expose attributes to the user, I wouldn't want to introduce yet another conflicting way08:50
michaedwbecause only the user knows that this buffer will be passed back and forth between, say, the video capture block, the face detection block, and the H.264 encoder08:51
arndand you probably don't want the user space to be responsible for allocation08:51
michaedwthe user space can't be responsible for the allocation of physical memory or for the creation of the mapping(s)08:52
arndi.e. not have the user allocate an arbitrary piece of memory and then map it to the device08:52
michaedwbut can request that the kernel allocate a block with the merged attribute set, large enough to hold the set of buffers needed08:52
arndwell, a lot of subsystems do that today (infiniband, v4l2), but it's rather flawed imho08:52
michaedwand then manage the allocation of buffers within that block08:53
arndwhy can't we do the API per buffer?08:53
michaedwone use case I have in mind is digital pan/tilt/zoom, followed by a GPU-based lens distortion correction08:53
arndit sounds like an unnecessary complication to provide an API to deal with multiple buffers at once08:54
michaedwthe lens distortion correction has to be divided in quadrants, split at the optical center08:54
michaedwwhich moves around as you pan/tilt08:54
michaedwso you need to resplit the frame's worth of space into buffers, dynamically08:55
michaedwif you want sub-frame latency08:55
michaedwthis is an actual use case in an actual product we are building today08:55
michaedwand we are suffering from the frame-centric buffer model in the vendor's capture pipeline08:56
michaedwthere is similar pain associated with long-term reference pictures and slice-based gradual data refresh08:57
arndI don't see what that has to do with the user interface, but that may have to do with my missing knowledge of video processing08:58
michaedwwhere the LTRP is composed of slices retained from various past decoded frames08:59
michaedwso the buffers actually have to be refcounted, and not recycled until they are no longer in use as part of the LTRP08:59
arndyou can either pass one large buffer between subsystems and have them know about the data format within, or you can pass lots of buffers between them08:59
arndeach buffer of course is refcounted09:00
arndbut I would not want to have the complexity in the dma-buffer subsystem to handle sub-allocations09:00
arndthat belongs into the drivers using the buffers09:00
michaedwor you can have a region with a uniform allocation policy, and let the framework slice and dice it into buffers and register them with the components09:00
michaedw"framework" here being GStreamer or the like, in userland09:01
arndone thing you can do with dma-buffers is to have a user interface that is separate from all others and only knows how to do allocation based on user space requests09:02
arndand then pass the buffer into other subsystems09:02
arndsome misc chardev that takes ioctls to get buffers09:02
arndbut some subsystems also really like to be the ones that do the allocation themselves09:03
michaedwthat coordinating entity is really the only one that can know that a given region needs to be a) mapped uncacheable write-combining, so that it can be touched efficiently from CPU; b) in tightly coupled memory, so that there is enough memory bandwidth to/from the video capture block; and c) allocated on a megabyte boundary, so the GPU can treat it like a framebuffer09:03
arndone of the really nice aspects of the dma-buffer interface is that a consumer (e.g. v4l) doesn't need to care if the buffer was allocated by drm, v4l or a pure allocation driver09:03
michaedwto me, the natural object to represent a shared memory region and its mapping policy is a file descriptor09:04
arndI absolutely agree, a file descriptor would also be my preferred choice09:04
michaedweasily passed to unprivileged processes over a local domain socket09:04
michaedwalready a parameter to mmap()09:05
arndthe other option is a global cookie like in SysV SHM or in GEM09:05
arndan fd can also support a standard set of ioctls, like GET_ATTRIBUTES09:06
arndto find out if it's contiguous and/or aligned09:06
michaedwa non-flie-descriptor cookie means a whole new set of APIs09:06
michaedwand is a lot harder to inspect with existing extrinsic tools09:06
arndwell, it would probably mean keeping the subsystem specific APIs09:07
michaedwI like being able to grovel around in /proc/*/fd09:07
arnde.g. DRM, infiniband and V4L all have ways to map buffers into userspace and see a set of attributes that is interesting for the respecitive subsystem09:08
arndall different09:08
michaedwso let's fix that09:08
arndwe can't remove the interfaces that already exist, but we can add a new set on top09:08
michaedwfds are also easy to pass back to in-kernel consumers, via a netlink socket09:08
michaedwor rebase the existing interfaces on top of a uniform solution09:09
arndI wouldn't mention netlink in that context, that is likely to cause negative feelings09:10
arndbut you can certainly pass fds to other ioctl09:11
michaedwwell, just because libnl is almost totally undocumented, and its API has been a moving target forever ...09:11
michaedw(it's not that bad, though; I ported libpcap and NetworkManager to the libnl 3.0 API)09:11
arndmostly because the a lot of subsystems already have ioctl interfaces and you don't want to mix too many ways of doing stuff09:12
arndif you ever want to pass a dma-buffer between v4l and a socket, using netlink is probaly appropriate on the socket side09:12
michaedwsure, ioctls on files opened from /dev are a perfectly good way to do things09:13
michaedwand adding an ioctl that converts an allocation fd to a cookie, for use in existing ioctls, is fairly painless09:14
michaedwanyway, I don't want to hijack the good progress that is being made on the in-kernel allocation and DMA infrastructure09:16
arndmichaedw: I think it's good to keep talking about this part as well, not sure if we're making enough progress on the user side right now09:17
arndmichaedw: are you aware of https://blueprints.launchpad.net/linaro-graphics-wg/+spec/engr-mm-bo-sharing-1111 ?09:17
michaedwI am just concerned about locking in the existing models for buffer allocation, given the foreseeable twists (like tightly-coupled memory and ping-ponging between GPU and other hardware units)09:18
arndmichaedw: yes09:19
michaedwand I am also conscious of the cost of TLB misses in userland-resident pipeline stages, which is part of why I want to be able to map a hugepage and slice and dice it into buffers09:19
arndI realized that https://wiki.linaro.org/WorkingGroups/Middleware/Graphics/Specs/1111/engr-mm-bo-sharing is still just the template, it should really list our plans09:19
michaedwarnd: I am now :-)09:20
arndonce we have more content in the spec page, listing huge pages as a requirement is probably a good idea09:21
arndmost users won't need that, but it can be rather important for a few others09:21
michaedwit seems like a natural fit to slice-based processing stages09:22
michaedwand to the sort of non-video-pipeline use cases I try to keep in mind, like image texture assets and priority search trees full of log entries09:23
arndone way to do that would be making hugetlbfs a provider of dma-buffers09:24
michaedwthat sounds sane09:24
michaedwI don't know what the interactions with IOMMU imply09:24
arndso you open a hugetlbfs file and pass the fd to the subsystems that you want to interact with the buffer09:24
arndwell, if you have a hugetlbfs file, it's contiguous already, so the iommu doesn't cause problems, and it also doesn't help09:25
michaedwmm, with xattrs to control the allocation and mapping policy?09:25
michaedwI could get to like that09:25
arndI would stay with ioctls09:25
arndnot every provide is going to have dentries09:25
arndat least not visible ones09:26
michaedwwhy not set the attrs on the file, so whoever opens it gets the right attributes?09:26
michaedwthat makes the first-class object the inode rather than the fd09:26
arndbecause that doesn't work with subsystems that only have an fd interface09:26
michaedwmmm, too bad09:26
michaedwthough you could of course still pass the fd once you've opened it09:27
arnde.g. when you let DRM provide a dma-buffer using an ioctl that returns a file from anon_inode_getffd09:27
arndI guess it's also possible to make the mapping policy dependent on the provider09:28
arndin that case an xattr would be appropriate for hugetlbfs, while drm would use some other method09:28
michaedwfair enough; but you could still expose the "generic" provider as an FS09:28
arndyes09:28
arndor lots of FSs ;-)09:29
michaedwthe privileged process creates the inodes and massages their xattrs09:29
arndhugetlbfs, CMAfs, ...09:29
michaedwand then moves them into a directory with the right group permissions09:29
michaedwand any process with that group can open them and mmap the fd it gets09:30
michaedwta-da!  coherent mapping policy09:30
michaedwwithout passing fds over local domain sockets09:30
arndyes, that's all possible. I still think though that the most important aspect is to allow multiple providers, each with their own policies09:31
michaedwagreed09:31
arndthe common part is the way that the handle looks (fd) and what operations are possible on it in the kernel and in userland09:31
michaedwyes; and it's immaterial whether, say, tightly coupled memory is its own provider or an xattr in the generic provider09:32
arndyes09:33
michaedwI like that just fine; make it so ;-)09:34

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!