Preview only show first 10 pages with watermark. For full document please download

Tuning Lustre* Systems Training

   EMBED


Share

Transcript

Intel® Lustre* system and network administration Lustre tuning 11.2016 Intel Confidential — Do Not Forward Lustre Tuning Lustre has many options for tuning the file system Some tuning is done at file system creation • Creating and Defining an External Journal • Setting the Stride and Stripe-Width Some is performed after file system creation Tuning is often a trade-off between performance and stability Tuning is an iterative process • Benchmark, tune in one direction, retest, tune further/back based on results • Repeat for next tuning option Can get very complex very quickly, so: • Always best to start with the "low hanging fruit" Intel Confidential — Do Not Forward 2 Linux IO Scheduler IO scheduler (elevator) comparison • [root@st02-oss1 ~]# uname -r 2.6.32-358.11.1.el6_lustre.x86_64 [root@st02-oss1 ~]# cat /sys/block/sd*/queue/scheduler Default kernel elevator (CFQ) is wrong for Lustre Ensure use of the deadline IO scheduler • Lustre patched kernel already uses the “deadline” scheduler by default noop anticipatory [deadline] cfq noop anticipatory [deadline] cfq noop anticipatory [deadline] cfq noop anticipatory [deadline] cfq Intel Confidential — Do Not Forward 3 Linux – Other non-Lustre params Linux Network Stack (just some of the options) • TCP read / write buffers (default) • • TCP read / write buffers (maximum) • • net.ipv4.tcp_rmem / net.ipv4.tcp_wmem net.core.rmem_max / net.core.wmem_max Queue length (maximum) • Receive • • Transmit • • net.core.netdev_max_backlog txqueuelen FlowControl, TCP Window Scaling, etc. Linux Kernel Memory https://www.kernel.org/doc/Documentation/sysctl/vm.txt Intel Confidential — Do Not Forward Example parameters check with your network vendors: • net.ipv4.tcp_timestamps=1 • net.ipv4.tcp_low_latency=1 • net.core.rmem_max=4194304 • net.core.wmem_max=4194304 • net.core.rmem_default=4194304 • net.core.wmem_default=4194304 • net.core.optmem_max=4194304 • net.ipv4.tcp_rmem=4096 87380 4194304 • net.ipv4.tcp_wmem=4096 65536 4194304 4 Linux – tuned to simplify tuned is a daemon that monitors the use of system components and dynamically tunes system settings based on that monitoring information. Dynamic tuning accounts for the way that various system components are used differently throughout the uptime for any given system. To install tuned on a RHEL/Centos 6.x: yum install tuned-utils.noarch tuned.noarch To list all the profiles available: tuned-adm list To set a profile: tuned-adm profile throughput-performance To verify which profile is active: tuned-adm active Intel Confidential — Do Not Forward 5 What is in the throughput-performance profile? set_cpu_governor performance set_transparent_hugepages always Scheduler => deadline kernel.sched_min_granularity_ns = 10000000 kernel.sched_wakeup_granularity_ns = 15000000 vm.dirty_ratio = 40 tuned monitor = off Intel Confidential — Do Not Forward The CPUfreq governor "performance” sets the CPU statically to the highest frequency within the borders of scaling_min_freq and scaling_max_freq. 6 MDS Tuning – IOPS, RAM Storage with high IOPS characteristics are best • High IOPS storage configured in RAID 10 array and / or SSD Take advantage of Linux write through caching • Amount of RAM is greater than size of MDT • Cache all the data from the MDT or as much of the working set as you can afford Improved MDS SMP performance in Lustre 2.3 • Lustre’s metadata code is still CPU bound. Use high frequency Intel CPU Distributed Namespace allows for multiple MDTs in each Lustre file system • Single metadata target can be a bottleneck • Remote directory implementation in Lustre 2.4 • Directories striped across MDTs in Lustre 2.7 • Recommend maximum of 4 MDT for each MDS (balance of increased IO against CPU and memory contention) Intel Confidential — Do Not Forward 7 OSS Tuning – Throw Hardware At It OSS Backplane Type and placement of HBA's (NUMA) Number of CPU cores • Improved LNET/OSS SMP performance in Lustre 2.3 CPU / thread ratio OSS bandwidth often limited by: • Speed of the network with modern storage arrays • Speed of the backend storage • Speed of the I/O controller(s) (or PCI-E slot) • Increasing amount of controllers can help offset this bottleneck Intel Confidential — Do Not Forward 8 Disks: SATA / Near Line SAS SATA - Enterprise vs. Consumer Grade • Be aware of the many differences – use only Enterprise Grade NL-SAS is Enterprise SATA w/ SAS Interface ++ • Higher speed interface, longer cabling, better features, etc. • Similar costs vs SATA effectively makes using SATA a bad choice SATA / NL-SAS - Not recommended for MDT • MDTs need more IOPS than these disks can provide • If you insist, an external journal is strongly advised: # mke2fs –O journal_dev –b 4096 /dev/sdf # mkfs.lustre --mkfsoptions “-j -J device=/dev/sdf" --ost /dev/sda When SATA / NL-SAS is used in OSTs for large block sequential IO: • Price / performance compared to SAS disks is excellent Intel Confidential — Do Not Forward 9 Disks: SSD and fast Storage Array SSD – many options available • Intel (and other vendors) provides several models specialized on different workload • Lustre needs for MDT high IOPS for Reads AND Writes and low latency • Verify the QoS and the endurance • <> for endurance and performance • Using ldiskfs as backend for MDT, use larger journal (e.g. 2GB) to avoid being IOps-bound Fast Storage Array • Modern, very fast storage arrays need larger journal (2GB+) than default • The design of the chain from the HBA to the storage backplane is really important especially in NUMA servers • sgp_dd-survey and obdfilter-survey can help to identify bottlenecks Intel Confidential — Do Not Forward 10 Asynchronous Journal on OST Prior to 1.8, all OST I/O was synchronous • When OST sent commit to client, all data was on the disk • Required forcing a flush after every bulk write The option for async journaling was added in 1.8 • Block data is still written synchronously • OST journal transactions are written asynchronously • Reduced number of small journal I/Os = better performance • Single client can push many IO's and get "commits" faster • As the journaling entries remain in cache From 2.x on, async journal is enabled by default Intel Confidential — Do Not Forward 11 Tuning LDISKFS backend file systems – OST Lustre "tries" to aggregate all I/O into 1 MB increments. Want writes to be aligned in order to avoid read-modify-write penalty • mballoc() tries to locate I/O aligned and contiguous disk blocks. Ext-based file systems try to locate all the blocks in a file within a block group 1MB reads/writes aligned via stride & stripe-width • Block size is 4 KB • stride equals the number of blocks written per disk before writing the next stride to the next disk • stripe-width equals the number of blocks per stripe • Note: these two parameters are exposed by mke2fs Ensure that "stripe width" equals "stride" times the amount of “data” disks in the RAID set Example: 10 disk RAID6 array (equivalent of 8 data disks, 2 parity) using 4K blocks • Stride = 1024 KB / 8 disks = 128KB / disk, and 128KB/disk / 4KB/block = 32 • Stripe-width (in blocks) = 1024KB / 4KB = 256 • Check: 256 = 32 x 8, thus we are good to go: # mkfs.lustre --mkfsoptions=“-E stride=32,stripe_width=256” --ost –mgsnode=192.168.0.22@tcp /dev/sda1 Intel Confidential — Do Not Forward 12 OST inodes and LUN settings Blocks per inode ratio • EXT* default is to create one (1) inode for every 16 KB • Appropriate for Enterprise, not ideal for typical HPC apps • Instead, set the blocks per inode ratio higher • 1 inode per 256/512 KB might be more appropriate for large block sequential IO • Set with --mkfsoptions=" " when formatting OSTs Configure LUNs for performance • Write through caching • Read ahead • Max sectors / KB Intel Confidential — Do Not Forward 13 OST Striping Recall that speed comes from parallel IO • Accomplished by striping files across OSTs Striping files across more OSTs • An "obvious" way to improve performance, typically Max stripe size count originally limited to 160 OSTs • Restriction caused by size of EA in MDT inodes Lustre versions 2.2+ support wide-striping (has to be enabled at format time) • Maximum 2000 OSTs per file Intel Confidential — Do Not Forward 14 Expanding the file system – indirect tuning Simple (software wise) to add more OSTs Format and mount new OSTs Clients automatically learn of/use new OSTs • Will support larger stripe sizes for files Ideally, should rebalance after mounting • Any OSTs previously "nearly-full" will have faster IO Intel Confidential — Do Not Forward 15 Software RAID using MDRAID Software RAID using MDRAID on OSS servers is not advised Benefits include: • Lower capital cost to purchase and upgrade • Higher "potential" maximum performance and fewer hardware components to fail • Vendor agnostic; mix and match drive types (to a degree) Downsides include: • Usually more complicated to manage • Recovery/rebuild is typically slower/longer • Uses host CPUs to perform RAID calculations. Excessive use of CPU for storage management affects overall IO If you still insist: • RAID-6 for better reliability • Ensure that you specify (with –c) the proper chunk size when formatting Consider ZFS as an alternative Intel Confidential — Do Not Forward 16 Application / Site specific Lustre tuning Intel Confidential — Do Not Forward 17 OSS – Service Threads Three different types of OSS threads • Used for statfs and object creation • • ost_creat (actually ll_ost_creat_XX, were XX is OST index) All other operations (read/write, truncate, setattr, etc) • ost (also ll_ost_XX) • ost_io (ditto) Correct setting depends on: • Speed of the storage (hardware, running synchronous, etc.) • Number of OSTs exported from the OSS • Capacity of the server • Workload from clients Intel Confidential — Do Not Forward 18 OSS – Service Threads Two ways to manage service threads: 1. Set the initial number of threads when the module loads • Configured when the module loads • For ost_creat threads options oss oss_num_create_threads=8 • For ost and ost_io threads options oss oss_num_threads=64 • Minimum initial thread count is 2 (max is 512) 2. Set the max (and min) number of threads • Lustre starts more service threads as needed. Threads increase up to 4x the minimum or the maximum (never > 512) • Example: set max ost_io threads # lctl set_param ost.OSS.ost_io.threads_max=128 • Determine how many threads have been started # lctl get_param ost.OSS.ost_io.threads_started Intel Confidential — Do Not Forward 19 OSS Threads – Tuning Guidelines Increase the number of threads if: • Several OSTs are exported from an OSS • Back-end storage is running in synchronous mode (Commits are not cached) • I/O completions take excessive time due to slow storage Decrease the number of threads if: • Storage is being overwhelmed • There are “slow I/O” or watchdog messages on clients • OSS appears to be resource constrained (such as CPU load or RAM utilization are excessively high) Additional information: • Thread tuning is applied similarly for the MDS • MDS/OSS SMP thread affinity is supported in 2.3+ • Threads more likely to access "hot" caches Intel Confidential — Do Not Forward 20 OSS Read Cache – LDISKFS only Linux provides (for "free") read caching • Data is cached in unused memory • However, this data is frequently overwritten as there is more IO than the memory available for caching Lustre provides an OSS read cache tuneable • Ideally, we would want to cache all the files (by default) • However, this is not practical • • • Not enough memory available Results in too much thrashing inside the cache Instead, setting can be used to cache the "small" files First, define the max size for a small file Next, instruct the OSS to cache those files # lctl conf_param obdfilter.*.readcache_max_filesize=5 Intel Confidential — Do Not Forward 21 MDS – Service Threads Work similarly to the OSS service threads MDS threads are: • mds • mds_readpage • mds_setattr Define thread count when loading the module options mds mds_num_threads=XX (or) Let Lustre auto-tune the thread count • Setting min, max and getting num started applies here, also lctl {get,set}_param {service}.thread_{min,max,started} See Lustre manual for details Intel Confidential — Do Not Forward 22 MDS – Caching Presented earlier was the fact that fully caching of OST data is not practical However, it can be feasible to cache most or all of a MDT Accomplished utilizing Linux's native read cache The only configuration requirement is lots of RAM Intel Confidential — Do Not Forward 23 Client Caching – Inactive Data Lustre provides caching similar to Linux Read the existing setting: lctl get_param llite.*.max_cached_mb Tuneable to set amount of inactive data cached: lctl set_param llite.*.max_cached_mb=512 • ¾ RAM is the default setting Intel Confidential — Do Not Forward 24 Client Caching – Active Data Performance gains caching some dirty data • Clients performing small IO need not transmit every write If caching too much dirty data • Events that force a cache flush may block client thread • Also, more risks if system experiences an interruption Dirty cache size is set per-OST • Set in: /proc/fs/lustre/osc//max_dirty_mb • Default is 32MB, max is 1024MB lctl get_param osc.*.max_dirty_mb Dirty cache backed with "granted" space on OSS's • Ensures I/O completion Intel Confidential — Do Not Forward 25 Client Tuning – Read-ahead Lustre maintains a read-ahead value • Value starts at 1MB and increments linearly • Value increases when there are 2 sequential Linux buffer cache misses • Value increases up to the defined maximum: /proc/fs/lustre/llite//max_read_ahead_mb • Default maximum is 40MB • Disabled when set to 0 # lctl get_param llite.*.max_read_ahead_mb • Non-sequential IO sets read-ahead value back to 1MB Intel Confidential — Do Not Forward 26 Client Tuning - Readahead Read entire “small” file into cache on first access client# lctl set_param llite.*.max_read_ahead_mb=10 client# lctl set_param llite.*.max_read_ahead_per_file_mb=6 client# lctl set_param llite.*.max_read_ahead_whole_mb=5.5 Cache only files < 6MB on OSS, avoid cache thrashing oss# lctl set_param obdfilter.*.readcache_max_filesize=6M Intel Confidential — Do Not Forward 27 Client Tuning - Statahead Max number of directory entries to be pre-cached when a directory is stat'd /proc/fs/lustre/llite//statahead_max lctl get_param llite.*.statahead_max Read-only variable showing current status /proc/fs/lustre/llite//statahead_status lctl get_param llite.*.statahead_status Intel Confidential — Do Not Forward 28 Client Tunables - Networking Goal: Keep network pipe full but not overloaded Control how much data each client can send • Maximum number of RPCs that the OST can have pending per client • Eight (8) is often considered optimal • May be increased on faster networks • May be increased with larger OST stripe counts Maximum number RPC for MDT is 1 # lctl set_param fail_loc=0x804 # to disable • No recovery is possible (be careful!) • Only for benchmarking Intel Confidential — Do Not Forward 29 Client Tunables – Networking Optimal settings depend on client / network Max number of 4K pages per RPC • Ideally 256 => 1MB per RPC /proc/fs/lustre/osc//max_pages_per_rpc lctl get_param osc.*.max_pages_per_rpc Max RPCs in flight between OSC and OST • Range is 1-256 /proc/fs/lustre/osc//max_rpcs_in_flight lctl set_param osc.*.max_rpcs_in_flight=256 LNet Credits and LND Peer Credits Intel Confidential — Do Not Forward 30 Client Tunables – All available How to list tunables without ls /proc lctl get_param -NF osc.*.* lctl get_param -NF llite.*.* Client import state lctl get_param osc.*.import Intel Confidential — Do Not Forward 31 Tuning Scenario - #1 Clients writing large, sequential block I/O • This is Lustre's sweet spot • System is designed, built and functioning properly? On OSS tunables: • Disable read cache lctl set_param obdfilter.*.read_cache_enable=0 • zone_reclaim_mode = 1 (on numa only) • swappiness = 10 Intel Confidential — Do Not Forward 32 Tuning Scenario - #2 (1/2) Clients writing random or small block IO – this is the worst case IO scenario for Lustre Determine application performing this IO • Use the brw_stats, rpc_stats and extents files from /proc • Verify using: # strace –T –ttt –p • As (strace + args) will show the IO size See if developer will optimize the IO components Increase the write back cache • This allows the per-OST IO to build up on the client • Less I/O's to the servers, less waiting, more processing • Tunable is /proc/fs/lustre/osc/*/max_dirty_mb Intel Confidential — Do Not Forward 33 Tuning Scenario - #2 (2/2) OSS: optimally set readcache_max_filesize to cache files lctl set_param obdfilter.*.readcache_max_filesize=6M Clients: optimally set read ahead lctl set_param llite.*.max_read_ahead_whole_mb=5.5 lctl set_param llite.*.max_read_ahead_per_file_mb=10 Consider increasing the rpc in flight Random IO to one large file? Maximize stripe count • File will get IOPS from all OSTs Random IO to many small files? • Set stripe count to 1 • Spread distribution of files across many/all OSTs • Scenario will still have large MDS overhead If this is the case with many/all apps, configure OST storage as RAID-10 (more IOps on RAID-10 versus RAID-6) or use SSD Intel Confidential — Do Not Forward 34 Tuning Scenario - #3 Clients waiting on IO Seen by many entries for RPC's in flight #8 in brw_stats lctl get_param osc.*.rpc_stats If significant, increase the number of RPC's in flight /proc/fs/lustre/osc//max_rpcs_in_flight lctl set_param osc.*.max_rpcs_in_flight= How far? • Until the highest RPC number is much less used. • Verify the peers and increase also LNet peers and LNet peer credits • Use an iterative approach when increasing Intel Confidential — Do Not Forward 35 Tuning Scenario - #4 Backend disk I/O is slow Seen by applications stalling waiting for IO Increase the number of OSS threads • Initial thread count • Maximal thread count How far? • Again, use the iterative approach • A good starting point is 32 threads per OST Intel Confidential — Do Not Forward 36 Tuning Scenario - #5 (1/2) High Bandwidth / High Latency Link • Possibly a WAN or MAN • See: http://www.kehlet.cx/articles/99.html The problem is latency, so increase: • Client read ahead cache /proc/fs/lustre/llite//max_read_ahead_mb /proc/fs/lustre/llite//max_read_ahead_whole_mb • Write behind cache /proc/fs/lustre/osc//max_dirty_mb Intel Confidential — Do Not Forward 37 Tuning Scenario - #5 (2/2) Increase max RPC's in flight Increase LNet credits to match RPC's in flight Increase LND peer_credits also If using o2iblnd • Set concurrent_sends manually • o2iblnd may/may not determines good value based on peer_credits If using socklnd • Increase the TCP send buffer • Increase Tx/Rx window sizes Intel Confidential — Do Not Forward 38 Tuning Scenario - #6 Intel True Scale Infiniband Card • Verbs implementation on-load (cpu bound) • Disable HyperThreading • Set cpu_governor to performance • QIB options: options ib_qib singleport=1 pcie_caps=0x51 krcvqs=4 rcvhdrcnt=4096 Recommended LNet IB configuration parameters for Intel True Scale InfiniBand: options ko2iblnd peer_credits=128 peer_credits_hiw=64 credits=1024 \ concurrent_sends=256 ntx=2048 map_on_demand=32 \ fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 Intel Confidential — Do Not Forward 39 Tuning Scenario - #7 Intel SSD disks • Verify the configuration with FIO prior to use with lustre • Avoid any write cache (read cache is disabled on Intel SSDs) hdparm –W0 /dev/sdb • Increase the journal to 2GB during lustre’s format --mkfsoptions="-J size=2048” • Scheduler: deadline • Verify endurance: smartctl -a /dev/sda |grep 233 Intel Confidential — Do Not Forward 40 Legal Information All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://www.intel.com/content/www/us/en/software/intel-solutions-for-lustre-software.html. Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer. You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation 41 Intel Confidential — Do Not Forward