In this deck from the 2016 Stanford HPC Conference, Robert Roy from Seagate Technologies presents: Debugging Slow Buffered Reads to the Lustre File System.
"Buffered read performance under Lustre has been inexplicably slow when compared to writes or even direct IO reads. A balanced FDR-based Object Storage Server can easily saturate the network or backend disk storage using o_direct based IO. However, buffered IO reads remain at 80% of write bandwidth. In this presentation we will characterize the problem, discuss how it was debugged and proposed resolution. The format will be a presentation followed by Q&A."
Learn more: http://seagate.com
See more talks from the Stanford HPC Conference: http://insidehpc.com/2016-stanford-hpc-conference-video-gallery/
2. 22
Direct IO reads are better than Buffered IO
The Problem
Seagate CS9000 with 4M RPCs
Reads Buffered ~3.5 GB/s per OST
Reads o_direct ~4.5 GB/s per OST
Writes Buffered ~4.5 GB/s per OST
More clients do not produce more bandwidth
Suggests server side
Data path on the server side is the same
for o_direct and buffered IO
Suggests client side
Buffered IO uses paged cache which is
populated by readahead
Client side readahead is suspect
5. 55
The Source of the Problem
And right above that line…
/lustre/llite/rw.c
#define RAS_INCREASE_STEP(inode) (ONE_MB_BRW_SIZE >> PAGE_CACHE_SHIFT)
/* RAS_INCREASE_STEP should be (1UL << (inode->i_blkbits - PAGE_CACHE_SHIFT)).
* Temporarily set RAS_INCREASE_STEP to 1MB. After 4MB RPC is enabled
* by default, this should be adjusted corresponding with max_read_ahead_mb
* and max_read_ahead_per_file_mb otherwise the readahead budget can be used
* up quickly which will affect read performance significantly. See LU-2816 */
6. 66
Set the increase step to the same value as the RPC size
The Solution
< #define RAS_INCREASE_STEP(inode) (ONE_MB_BRW_SIZE >> PAGE_CACHE_SHIFT)
> #define RAS_INCREASE_STEP(inode) (PTLRPC_MAX_BRW_SIZE >> PAGE_CACHE_SHIFT)
9. 99
Conclusion and More Information
Buffered reads can be improved significantly when 4m RPCs are in use
Seagate implemented a parameter to address the issue
lctl set_param -n llite.*.read_ahead_step 4
https://github.com/Xyratex/lustre-stable/commit/2395f8e0e7e963aec43deb07d719e9229884758c
LU-7140 tracks the upstream work
https://jira.hpdd.intel.com/browse/LU-7140