Re: [PATCH v5] mm/gup: disallow GUP writing to file-backed mappings by default

From: Jason Gunthorpe
Date: Mon May 01 2023 - 08:40:27 EST

Next message: Christophe JAILLET: "[PATCH 0/5] optimize some data structure in nvme"
Previous message: Ricardo Ribalda: "[PATCH v6 4/4] risc/purgatory: Add linker script"
In reply to: Dave Chinner: "Re: [PATCH v5] mm/gup: disallow GUP writing to file-backed mappings by default"
Next in thread: Jan Kara: "Re: [PATCH v5] mm/gup: disallow GUP writing to file-backed mappings by default"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, May 01, 2023 at 05:27:18PM +1000, Dave Chinner wrote:
> On Sat, Apr 29, 2023 at 08:01:11PM -0300, Jason Gunthorpe wrote:
> > On Sat, Apr 29, 2023 at 12:21:09AM -0400, Theodore Ts'o wrote:
> >
> > > In any case, the file system maintainers' position (mine and I doubt
> > > Dave Chinner's position has changed) is that if you write to
> > > file-backed mappings via GUP/RDMA/process_vm_writev, and it causes
> > > silent data corruption, you get to keep both pieces, and don't go
> > > looking for us for anything other than sympathy...
> >
> > This alone is enough reason to block it. I'm tired of this round and
> > round and I think we should just say enough, the mm will work to
> > enforce this view point. Files can only be written through PTEs.
>
> It has to be at least 5 years ago now that we were told that the
> next-gen RDMA hardware would be able to trigger hardware page faults
> when remote systems dirtied local pages. This would enable
> ->page-mkwrite to be run on file backed pages mapped pages just like
> local CPU write faults and everything would be fine.

Things are progressing, but I'm not as optimistic as I once was..

- Today mlx5 has ODP which allows this to work using hmm_range_fault()
techniques. I know of at least one deployment using this with a DAX
configuration. This is now at least 5 years old stuff. The downside
is that HMM approaches yield poor wost case performance, and have
weird caching corner cases. This is still only one vendor, in the
past 5 years nobody else stepped up to implement it.

- Intel Sapphire Rapids chips have ATS/PRI support and we are doing
research on integrating mlx5 with that. In Linux this is called
"IOMMU SVA".

However, performance is wonky - in the best case it is worse
than ODP but it removes ODP's worst case corners. It also makes the
entire MM notably slower for processes that turn it on. Who knows
when or what this will turn out to be useful for.

- Full cache coherence with CXL. CXL has taken a long time to really
reach the mainstream market - maybe next gen of server CPUs. I'm not
aware of anyone doing work here in the RDMA space, it is difficult
to see the benefit. This seems likely to be very popular in the GPU
space, I already see some products announced. This is a big topic on
its own for FSs..

So, basically, you can make it work on the most popular HW, but at the
cost of top performance. Which makes it unpopular.

I don't expect anything on the horizon to subtantially change this
calculus, the latency cost of doing ATS like things is an inherent
performance penalty that can't be overcome

Jason

Next message: Christophe JAILLET: "[PATCH 0/5] optimize some data structure in nvme"
Previous message: Ricardo Ribalda: "[PATCH v6 4/4] risc/purgatory: Add linker script"
In reply to: Dave Chinner: "Re: [PATCH v5] mm/gup: disallow GUP writing to file-backed mappings by default"
Next in thread: Jan Kara: "Re: [PATCH v5] mm/gup: disallow GUP writing to file-backed mappings by default"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]