Preview only show first 10 pages with watermark. For full document please download

Zfs And Mysql At Salesforce

   EMBED


Share

Transcript

ZFS and MySQL at Salesforce Welcome! ​ David Peterson “NoPantsPeterson” ​ Lead Systems Engineer [email protected] @geekhead7 ZFS/MySQL Stats ​ Slide Missing Due to PR/Legal Reasons Our Hardware ​ Slide Missing Due to PR/Legal Reasons Hardware Roadblocks •  SSD not big enough to store our MySQL dataset •  2 SSD Drives != performance we need vs reliability (RAID 1 vs 0) •  Using SATA spindles for all MySQL data storage. Slow SATA spindle disks for MySQL = Why ZFS? •  Throttles writes to avoid issuing so much to disks concurrently that they start behaving poorly. Performs write coalescing •  ZIL/SLOG Ability to put sync writes on the SSDs = lower write latency = a happier MySQL host •  Compression •  ARC/L2ARC with MFU and MRU is much more intelligent than your linux page cache (LRU). •  Block size matching. •  Data integrity and resiliency •  Kernel version independent. DKMS for the win! •  Snapshots •  Active development community. ZFS Cons •  Fragmentation, but this is true for any CoW system •  Simple to setup but complex to tweak and tune •  Likes lots of RAM, but doesn’t necessarily require lots of it •  Requires extra free space or pool performance can suffer. Big SATA disks are cheap! Hardware/ZFS Setup ZFS ZIL/SLOG ​ Synchronous Writes ZFS/Kernel ​ What versions we currently run •  ZFS v0.6.5.2 and v0.6.5.3 with plans to get to v0.6.5.8 •  http://zfsonlinux.org/ •  Kernel 3.10.x •  http://elrepo.org/ •  Do not use stock kernel with CentOS 6 / RHEL 6 (2.6.x) ZFS Myths ZFS Myths Myth #1 Don’t Run on Hardware RAID •  ZFS can still detect all data corruption issues, it just can’t fix them. We’re ok with this •  Write-hole extremely rare with modern controllers now with proper BBUs and write caches being protected by capacitors and NVRAM/flash ZFS Myths Myth #2 ZoL is unreliable •  After almost two years, no data loss or corruption caused by ZoL •  Hard resets required due to controller firmware bug, no data corruption. ZFS Myths Myth #3 ZoL is Slow •  Our numbers speak for themselves •  Yes, COW takes a perf hit, but with proper tuning and setup, ZFS can perform very well. •  Requires properly understanding how ZFS works and your workload IO patterns. •  Great for RDMS systems like MySQL running on commodity hardware that make lots of fsync() calls. ZFS Myths Myth #4 Don’t use ZFS caching for InnoDB •  Only if your MySQL working data set size can fit in the allocated innodb buffer pool size. •  Our data set size is much larger than our available memory. ZFS Configurations ZFS Configurations Parameter Value Default Value zfs_arc_max 3435973836 ½ RAM = 64GB zfs_nocacheflush 1 0 zfs_prefetch_disable 1 0 zil_slog_limit 104857600 1048576 (1MB) zfs_txg_timeout 60 5 zfs_txg_history 100 0 •  Limit ARC size. By default it is dynamic but more efficient to limit it •  zil_slog_limit => If the current ZIL commit is over this setting or the current total ZIL log size is over twice this, the ZIL commit is not written to your SLOG device but instead written into your main pool. The default setting of 1MB is way too low for a very busy MySQL host. •  txg history location: /proc/spl/kstat/zfs/${pool}/txgs ZFS Configurations Parameter Value Default Value zfs_dirty_data_sync 536870912 67108864 zfs_vdev_sync_write_min_active 16 10 zfs_vdev_sync_write_max_active 20 10 zfs_vdev_sync_read_min_active 16 10 zfs_vdev_sync_read_max_active 20 10 zfs_vdev_async_write_min_active 3 1 zfs_vdev_async_write_active_min_dirty_percent 5 30 ZFS Configurations ​ Observations for Async operations •  The default value of 30 for zfs_vdev_async_write_active_min_dirty_percent is too high for 128GB of RAM. Equals to ~4GB of dirty data as zfs_dirty_data_max (10% or RAM) gets set to ~13GB. •  Rate of async writes is enough to cause long transactions but not dirty enough memory to increase the cap on the number of async write operations per vdev. •  zfs_vdev_async_write_active_min_dirty_percent (5%) X zfs_dirty_data_max (13GB) = 650MB ZFS Configurations ​ Observations for Sync operations (ZIL/SLOG) •  Two variables control when data in ZIL (technically RAM) gets flushed back to your main pool •  zfs_txg_timeout => default is 5s •  zfs_dirty_data_sync => default is 64MB •  The zfs_txg_timeout parameter only takes affect when: 1.  The amount of time gone by is above the zfs_txg_timeout value AND 2.  The amount of dirty data does not go above the zfs_dirty_data_sync value •  The zfs_dirty_data_sync will initiate a txg group flush when dirty data goes above this value regardless of the zfs_txg_timeout value •  Default value of 64MB is too low for busy MySQL hosts. Introduced increased latency on our main pool. ZFS Configurations ​ Pool Settings •  Use ashift=12 when creating your pool •  atime=off •  compression=lz4 •  LZ4 gives great compression ratios with minimal CPU cost. ZFS Configurations ​ MySQL Filesystems zfs create data/mysql zfs create data/mysql-log zfs create data/mysql-tmp zfs set recordsize=16K data/mysql zfs set mountpoint=/var/lib/mysql data/mysql zfs set mountpoint=/var/log/mysql data/mysql-log zfs set mountpoint=/var/lib/mysql/tmp data/mysql-tmp Pro Tip: The biggest boost in performance can be obtained by matching the ZFS record size with the size of the IO. InnoDB page is 16KB in size. •  Prevents read-modify-write penalty •  Read only the data you want and nothing more •  128k recordsize for log/binlog ZFS Configurations ​ MySQL Filesystems Note: Don’t set the ZFS MySQL data filesystem to “logbias=throughput” despite what the internet might say J ZFS Configurations ​ MySQL Settings – Specific for running on ZFS Parameter Value Default Value innodb_doublewrite 0 1 innodb_checksums 0 1 •  Innodb uses a double write buffer to prevent partial writes. ZFS does not allow partial writes so you can safely disable this for an improvement in performance •  ZFS already checksums everything and is more efficient so no need to do it again at the MySQL layer. Graphs/Metrics A single production MySQL host Challenges Challenges ​ ZFS Fragmentation •  Do not run more than 80% full •  Even if you don’t run full, with a high IO application like MySQL, you will eventually incur fragmentation. •  We see performance issues when fragmentation reaches around 70-80% •  You have to destroy and recreate your pool to fix fragmentation. Not fun with production data Challenges ​ ZFS Fragmentation – Let’s automate the fixing Open Source Script: {Link Here} 1.  Uses an S3 API compatible storage tier, internal or external (S3, Ceph, etc) 2.  Detects open files on any ZFS file systems 3.  Takes a recursive snapshot of the pool and sends it to your S3 bucket compressed •  Applies very specific ZFS settings to double throughput 4.  Captures how your pool was created with any custom parameters 5.  Verifies snapshot in S3 and then destroys pool 6.  Re-creates pool using the exact same pool create flags or any custom parameters specified when the script was run 7.  Pulls down the snapshot from S3 and imports it 8.  Cleans up snaps, reverts ZFS settings, and reports your new low fragmentation % Challenges ​ ZFS Fragmentation – Example Host zfs_defrag.sh –x ashift=12 -y •  Host had a ~800GB pool compressed, ~1.5TB uncompressed. •  Over 80% fragmentation •  Took around 17 hours for the script to finish •  •  Without the “turbo” zfs settings, it would have taken around 40 hours Fragmentation went from ~80% down to ~8% Challenges ​ ZFS Fragmentation – Script Limitations and Future Features •  Only supports an S3 API compatible storage tier •  Plans to add additional options like NFS mounts or a separate device on the host itself. It’s OSS so you can add your own as well J •  Does not support ZVOLs •  Can only detect open files using the ZFS filesystems •  Plans to add additional logic to actually shutdown services/daemons safely that have open files Challenges ​ MySQL Multi-Host Re-seeds or Shard Splits Typical Way DC 1 Source DC 2 Host 2 Issues with this: •  Source server becomes the bottleneck •  Duplicating data twice across DC’s •  Large datasets >1TB can take days or more depending on your network and WAN setup. Host 1 Host 3 Challenges ​ MySQL Multi-Host Re-seeds or Shard Splits Our Way DC 1 DC 2 All automated with Ansible Source Host 2 What’s Going on Here? •  We are daisy chaining the copy process and all hosts are able to receive the data in the same amount of time it takes for one. •  Only one copy gets sent over the WAN Host 1 Host 3 •  We are using a ZFS send/receive which is more efficient since it’s at the block level and doesn’t require replaying logs like innobackupex does. Challenges ​ MySQL Multi-Host Re-seeds or Shard Splits How: •  We are using a combination of netcat, fifo, and zfs snaps. •  The last host in the chain is running something like this: nc –ld 1234 | zfs receive {zfs_mount} •  Every other host is running: mkfifo testfifo nc ip_of_next_box 1234 < testfifo & nc –ld 1234 | tee testfifo | zfs receive {zfs_mount} •  Source host is running: zfs send –R {zfs_mount}@{zfs_snap} | nc ip_of_next_box 1234 •  We also temporarily set the same specific ZFS settings we use in our frag script to increase throughput. Challenges ​ MySQL Multi-Host Re-seeds or Shard Splits ZFS “TURBO” Settings Parameter Value zfs_vdev_sync_write_min_active 512 zfs_vdev_sync_write_max_active 512 zfs_vdev_sync_read_min_active 512 zfs_vdev_sync_read_max_active 512 zfs_vdev_async_read_min_active 128 zfs_vdev_async_read_max_active 128 zfs_vdev_async_write_max_active 128 zfs_vdev_async_write_min_active 128 zfs_vdev_max_active 5000 Note: These settings maximize throughput at the expense of latency. Please resist the urge to apply these on a MySQL host in production J. We don’t care about latency here b/c MySQL isn’t running when performing a re-seed or when running our ZFS defrag script. Final Summary Lessons Learned •  Proper sizing of L2ARC device •  Data stored in L2ARC is tracked in your ARC by headers. •  •  •  Too big of an L2ARC can consume your entire ARC with these headers and not with real data that needs to be cached. •  Verify how much memory it's consuming: •  cat /proc/spl/kstat/zfs/arcstats | grep l2_hdr_size |awk '{print $3}’ •  arc_summary.py |grep -i header Even though you can remove a l2arc/slog device online, don’t on your primary MySQL host •  •  1GB of L2ARC ~= 25-30MB of ARC Will cause stalls and timeouts for a short period of time If running ZoL <= v0.6.5.2, turn off dynamic taskqs •  echo 0 > /sys/module/spl/parameters/spl_taskq_thread_dynamic •  echo “options spl spl_taskq_thread_dynamic=0” >> /etc/modprobe.d/zfs.conf Future ZoL Features •  Resumable send => https://github.com/zfsonlinux/zfs/issues/3896 will be part of the 0.7.x release •  Data at rest encryption => https://github.com/zfsonlinux/zfs/pull/5769 •  Persistent L2ARC => https://github.com/zfsonlinux/zfs/issues/3744 •  SSD Trim => https://github.com/zfsonlinux/zfs/pull/5925 •  New zpool iostat options for avg latency, histograms, queues, and request size histograms (0.7.x release) ​ [email protected] ​ @geekhead7