Transcript
HIP Software and Physics project
Experiences setting up a Rocks based Linux cluster Tomas Lind´en Helsinki Institute of Physics CMS programme 08.12.2003
Physicists network December meeting CSC, Espoo, Finland 08.12.2003
1
Tomas Lind´ en
HIP Software and Physics project
Contents 1. Introduction 2. Linux cluster mill hardware 3. Disk performance results 4. NPACI Rocks Cluster Distribution 5. Conclusions 6. Acknowledgements
08.12.2003
2
Tomas Lind´ en
HIP Software and Physics project
1. Introduction Commodity off the shelf hardware (COTS) gives an excellent price / performance ratio for High Throughput Computing (HTC). Top500 class computers can be built using COTS. There is no public x86 Linux HTC Linux cluster at CSC nor in Kumpula. To meet the demand for large affordable CPU resources a x86 Linux project was started in Kumpula with: • Hardware optimized for Monte Carlo simulations in Computational Material Science (MD-simulations) and High Energy Physics (CMS Monte Carlo production runs for CMS Data Challenges). • The applications are mostly of the pleasantly parallell type. • Network bandwidth and latency for message passing is therefore not an issue, so only a 100 Mb/s network is required between the nodes. • One subphase of the simulations is I/O bound and needs good random read performance. The budget for the system was ≈3 50 kEUR (without VAT).
08.12.2003
Tomas Lind´ en
HIP Software and Physics project
Main design goals
• 1U or 2U 19” rack case
• Maximize CPU power
+ more compact than minitower
• Maximize random disk reading bandwidth
• home made ”ATX blade server”
• Minimize costs
+ more compact than minitower
Casing comparison
- more expensive than minitower
+ least expensive if assembly costs are neglected
• minitower + standard commodity solution
- labour intensive
+ inexpensive
- cooling problems can be an issue
- big space requirement
• blade server + most compact - most expensive
08.12.2003
4
Tomas Lind´ en
HIP Software and Physics project
A 32 node OpenMosix AMD Athlon XP ATX blade server at the Institute of Biotechnology at the University of Helsinki.
08.12.2003
5
Tomas Lind´ en
HIP Software and Physics project
Heat dissipation is obviously a major problem with any sizeable cluster. Other factors affecting the choice of casing are space and cost issues. The idea of building a ”ATX-blade server” was very attractive to us in terms of needed space and costs, but we were somewhat discouraged by the heat problems with the previously shown cluster (the heat problem was subsequently solved with more effective CPU coolers). It was also felt that one would have more problems with warranties with a completely home built system. Also the mechanical design of a ATX blade server would take some extra effort compared to a more standard solution. Because of space limitations a mini tower solution was not possible so we chose a rack based solution with 2U cases for the nodes and a 4U case for the frontend for the Linux cluster mill.
08.12.2003
6
Tomas Lind´ en
HIP Software and Physics project
1U cases are very compact, but the only possibilty to add expansion cards is trough a PCI-riser card, which can be problematic, so when using a 1U case one really wants to have a motherboard with as many integrated components as possible. The cooling of a 1U case needs to be designed very carefully. The advantage of a 2U case is that one can use half height or low-profile PCI-cards without using any PCI-riser card (Intel Gb/s NICs are available in half height PCI size). Heat dissipation problems are probably less likely to occur in a 2U case because of the additional space available for airflow. The advantage of 4U cases is that standard PCI-expansion cards can be used.
08.12.2003
7
Tomas Lind´ en
HIP Software and Physics project
The motherboard requirements were: • Dual CPU support – minimizes the number of boxes to save space and maintenance effort • At least dual ATA 100 controllers • Integrated fast ethernet or Gb/s ethernet • Serial console support or integrated display adapter • Support for bootable USB-devices (CDROM, FDD). – Each worker node does not need a CDROM- nor a FD drive if the motherboard BIOS supports booting from a corresponding USB device and the network card and the cluster management software supports PXE booting. Worker nodes can then be booted for maintenance or BIOS upgrades from USB devices that are plugged into the node only when needed. • Support for PXE booting 08.12.2003
8
Tomas Lind´ en
HIP Software and Physics project
2. Linux cluster mill hardware PUBLIC LAN mill (frontend)
eth1 100 Mb/s
eth0 1 Gb/s (Cu)
eth0 1 Gb/s (Cu)
silo (1.4 TB fileserver)
eth1 1 Gb/s (fiber)
PRIVATE LAN
1000/100 Mb/s switch
console
compute-0-0 (node-0)
eth0 100 Mb/s
compute-1-17 (node-31)
eth0 100 Mb/s
...
A schematic view of the mill cluster network connections. 08.12.2003
9
Tomas Lind´ en
HIP Software and Physics project
• CPU: 2 * AMD 2.133 GHz Athlon MP • MB: Tyan Tiger MPX S2466N-4M
The 32+1 node dual AMD Athlon MP Rocks 2U rack cluster mill.
• Memory: 1 GB ECC registered DDR • IDE disks: 2 * 80 GB 7200 rpm Hitachi 180 GXP DeskStar • NIC: 3Com 3C920C 100 Mb/s • Case: Supermicro SC822I-300LP 2U • Power: Ablecom SP302-2C 300W • FDD: integrated in the case • Price: ≈1.4 kEUR/node with 0% VAT One USB 2.0 - IDE case with a 52x IDE CDROM can be connected to the USB 1.1 ports of any node. The 1000/100 Mb/s network switches existed already previously.
08.12.2003
10
Tomas Lind´ en
HIP Software and Physics project
Frontend hardware:
Node cables:
• CPU: 2 * AMD 1.667 GHz Athlon MP
• Power supply
• MB: Tyan Tiger MPX S2466N-4M
• Ethernet
• Memory: 1 GB ECC registered DDR
• Serial console CAT-5
• IDE disks: 2 * 60 GB 7200 rpm IBM Console monitoring: 180 GXP DeskStar • Digi EtherLite 32 with 32 RS232 ports • NIC: 3Com 3C996-TX Gb/s • 2 port D-Link DKVM-2 switch • NIC: 3Com 3C920C 100 Mb/s
Recycled hardware:
• Graphics: ATI Rage Pro Turbo 8MB AGP • Case: Compucase S466A 4U
• rack shelves • serial console control computer • 17” display, keyboard and mouse
• Power supply: HEC300LR-PT
In total:
• CDROM: LG 48X/16X/48X CD-RW • FDD: yes
• CPUs: 64+2 • Memory 30 GB
Racks:
• disk: 5 TB (2,5 TB RAID1)
• 2 Rittal TS609/19 42U 60x90x200cm 08.12.2003
11
Tomas Lind´ en
HIP Software and Physics project
Cooling • The cases have large air exhaust holes near the CPUs. • Rounded IDE cables on the nodes. • The frontend draws 166 W (idle) — 200 W (full load). • Node CPU temperatures are ≤ 40 ◦ C, when idle.
The interior of one of the nodes. 08.12.2003
12
Tomas Lind´ en
HIP Software and Physics project
About the chosen hardware • PXE booting works mostly fine. • USB FD and USB CDROM booting works OK (USB 1.1 is a bit slow). • The Digi EtherLite 32 has worked fine after the correct DB9-RJ-45 adapter wiring was found. • The ATX power cable extensions have created problems on a few nodes. • Booting from a USB memory stick does not work well, because sometimes the nodes hang when inserting the flash memory. Booting several nodes from a USB Storage device is unpractical, because the BIOS boot order list has to be edited every time the memory stick is inserted, unlike the case for CDROM- and FD drives. • The remote serial console is a bit tricky to setup. Unfortunately the slow BIOS memory test cannot be interrupted from a serial console. • The Rittal 1U power outlets have a fragile plastic mounting plate. 08.12.2003
13
Tomas Lind´ en
HIP Software and Physics project
Node console BIOS settings can be replicated with /dev/nvram and many error conditions are found in the Linux log files, but to see BIOS POST messages or console error messages some kind of console is needed [1]. Some popular alternatives are: • keyboard and display is attached only when needed • Keyboard Video Mouse (KVM) switch • serial switch • remote management card cabling
LAN
all consoles
graphics
access
at once
card
speed
special
mouse
keys
Serial
+
+
+
not required
-
-
-
KVM
-
-
-
required
+
+
+
08.12.2003
14
Tomas Lind´ en
HIP Software and Physics project
Useful special Red Hat Linux kernel options: • apm=power off This is useful on SMP-machines. • display=IP:0 Allows kernel Oops and Panics to be viewed remotely (this has not been tested on mill). Compilers to install: • C/C++ and F77/F90 compilers by Intel and Portland Group Grid software to install: • NorduGrid software • Large Hadron Collider Computing Grid (LCG) software Application software to install: • Molecular dynamics simulation package PARCAS. • Compact Muon Solenoid standard- and production software. 08.12.2003
15
Tomas Lind´ en
HIP Software and Physics project
3. Disk performance Equipping each node with two disks gives flexibility in the disk space configuration. I/O performance can be maximized using software RAID 1 and storage space can be maximized using individual disks or software RAID 0 (with or without LVM). In the following performance for single disks and software RAID 1 are presented. The seek-test by Tony Wildish is a benchmark tool to study sequential and random disk reading speed as a function of the number of processes [2]. • Each filesystem was filled so the disk speed variation of different disk area regions were averaged over. • The files were read sequentially (1-1) and randomly (5-10) (the test can randomly skip a random number of blocks within a interval of minimum and maximum number of blocks). • All of the RAM available was used in these tests and each point reads data for 600 s. 08.12.2003
16
Tomas Lind´ en
HIP Software and Physics project
Disk read throughput 60000 kB/s
"./gplot.compute-0-0_jbod_seek-test-1-1_20030812" using 1:2:3 "./gplot.seek-test-1-1_jbod_120gxpMB" using 1:2:3 "./gplot.seek-test-1-1_jbod_120gxp" using 1:2:3
50000
40000
30000
20000
10000
0 0
10
20
30
40
50 (# procs)
60
Single disk sequential read performance Tiger MPX MB IDE-controller 180 GXP disk, Tyan MP MB IDE-controller 120 GXP disk and 3ware 7850 IDEcontroller 120 GXP disk on a Tiger MP. 08.12.2003
17
Tomas Lind´ en
HIP Software and Physics project
Disk read throughput 80000 kB/s
"./gplot.seek-test-1-1_hipswap_sraid1" using 1:2:3 "./gplot.sraid1_newBIOS_crh731_64k-seek-test-1-1" using 1:2:3
70000
60000
50000
40000
30000
20000
10000
0 0
10
20
30
40
50 (# procs)
60
Software RAID 1 sequential read performance Tiger MPX MB IDE-controller 180 GXP disk and 3ware 7850 IDE-controller on a Tiger MP 120 GXP disk.
08.12.2003
18
Tomas Lind´ en
HIP Software and Physics project
Disk read throughput 20000 kB/s
"./gplot.seek-test-5-10_hipswap_sraid1" using 1:2:3 "./gplot.sraid1_newBIOS_crh731_64k-seek-test-5-10" using 1:2:3 "./gplot.compute-0-0_jbod_seek-test-5-10_20030812" using 1:2:3
15000
10000
5000
0 0
10
20
30
40
50 (# procs)
60
Software RAID 1 random read performance Tiger MPX MB IDE-controller 180 GXP disk and 3ware 7850 IDE-controller on a Tiger MP 120 GXP disk. Single 180 GXP disk random read performance on a Tiger MPX MB IDE-controller. 08.12.2003
19
Tomas Lind´ en
HIP Software and Physics project
4. NPACI Rocks Cluster Distribution Cluster software requirements • Cluster software should reduce the task of maintaining N+1 computers to maintaining only 1+1 computers and provide good tools for cluster installation, configuration, monitoring and maintenance. • Manage software components, not the bits on the disk. This is the only way to deal with heterogeneous hardware – System imaging relies on homogeneity (bit blasting) – Homogeneous clusters exists a very short time • Cluster software should work with CERN Red Hat Linux • No cluster software with a modified Linux kernel (OpenMosix, Scyld) NPACI Rocks was chosen as the cluster software because of a favorable review and positive experiences within CMS [3]. 08.12.2003
20
Tomas Lind´ en
HIP Software and Physics project
Some other cluster tools are • OSCAR (Open Source Cluster Application Resources) – Image based, more complex to set up than Rocks, but better documentation. – http://oscar.sourceforge.net/ • LCFG (Local ConFiGuration system) – Powerful but difficult to use. Used and developed by EDG/EGEE (LCG). – http://www.lcfg.org/ Some widely used automated installation tools are: • SystemImager – Assumes homogeneous hardware. – http://www.systemimager.org/ • LUI (Linux Utility for cluster Installation) – Created by IBM, no development since 2001? – http://oss.software.ibm.com/developerworks/projects/lui/ • FAI (Fully Automatic Installation) – Only for Debian Linux 08.12.2003
– http://www.informatik.uni-koeln.de/fai/ 21
Tomas Lind´ en
HIP Software and Physics project
NPACI Rocks Cluster Distribution is a RPM based cluster management software for scientific computation based on Red Hat Linux [4]. Both the latest version 3.0.0 and the previous one 2.3.2. are based on Red Hat 7.3. There are at least four Rocks based clusters on the November 2003 Top500 list (# 26, # 176, # 201 and # 408). There are some 140 registered Rocks clusters with more than 8000 CPUs. All nodes are considered to have soft state and any upgrade, installation or configuration change is done by node reinstallation, which takes about 6–8 min for a node and about 20 min for the whole mill cluster. The default configuration is to reinstall a node also after each power down. Settings like this can be easily changed according to taste. Rocks makes it possible for nonexperts to setup a Linux cluster for scientific computation in a short amount of time.
08.12.2003
22
Tomas Lind´ en
HIP Software and Physics project
Features of Rocks • Supported architectures IA32, IA64. IA32-64 soon (no Alpha, no SPARC) • Networks: Ethernet, Myrinet (no SCALI) • Requires nodes with local disk • Requires nodes able to boot without a keyboard (headless support) • Relies heavily on Kickstart and Anaconda (works only with Red Hat) • Can handle a cluster with heterogenous hardware • XML configuration scripts • Non RPM packages can be handled with XML post configuration scripts • eKV (ethernet, Keyboard and Video) a telnet based tool for remote installation monitoring • Supports PXE booting and installation • Installation is very easy on well behaved hardware
08.12.2003
23
Tomas Lind´ en
HIP Software and Physics project
• Services and libraries out of the box – Ganglia (monitoring with nice graphical interface) – SNMP (text mode monitoring information) – PBS (batch queue system) – Maui (scheduler) – Sun Grid Engine (alternative to PBS) – MPICH (parallell libraries) – DHCP (node ip-addresses) – NIS (user management) 411 SIS is beta in v. 3.0.0 – NFS (global disk space) – MySQL (cluster internal configuration bookkeeping) – HTTP (cluster installation) – PVFS (distributed cluster file system kernel support) 08.12.2003
24
Tomas Lind´ en
HIP Software and Physics project
The most important Rocks commands • insert-ethers Insert/remove a node to/from the MySQL cluster database. • rocks-dist mirror Build or update a Rocks mirror. • rocks-dist dist Build the RPM distribution for compute nodes. • shoot-node Reinstall a compute node. • cluster-fork Run any command serially on the cluster. cluster-fork /boot/kickstart/cluster-kickstart Reinstalls the cluster. • cluster-ps Get a cluster wide process list. • cluster-kill Kill processes running on the cluster. Test if the XML/kickstart infrastructure returns a OK kickstart file: cd /export/home/install ; ./kickstart.cgi –client=”compute-0-0” character
”
&
’
<
>
XML
"
&
'
<
>
08.12.2003
25
Tomas Lind´ en
HIP Software and Physics project
Documentation Tutorial: The Rocks CDROM contains the slides of a good tutorial talk [5]. User Guide: The Rocks manual covers the minimum to get one started. Reference Guide: Some basic configuration changes are covered in the manual, but more advanced issues need usually to be resolved with the help of the very active Rocks mailing list which has also a web archive. Mininimum hardware to try out Rocks To try out Rocks on a minimum 1+1 cluster only two computers, three NICs and one cross connect ethernet cable is needed. Future of Rocks The next version of NPACI Rocks will be based on Red Hat Enterprise Linux compiled from the source RPMs. The project has financing for at least three years from now. 08.12.2003
26
Tomas Lind´ en
HIP Software and Physics project
5. Conclusions • The hardware on mill works well, application software installation has started. Also the NorduGrid installation has begun. • The Tyan MPX BIOS has still room for improvements. • Software RAID 1 gives a good sequential and random (2 processes) disk reading performance. • The 3ware IDE-controller driver or the Linux SCSI driver has some room for improvement compared to the standard IDE-driver. • The NPACI Rocks cluster distribution has shown to be a powerful tool enabling nonexperts to set up a Linux cluster. But the Rocks documentation level is not quite up to the software quality. This is partly compensated by the active Rocks user mailing list. • NPACI Rocks cluster distribution is an interesting option worth considering for the Material science grid project. 08.12.2003
27
Tomas Lind´ en
HIP Software and Physics project
6. Acknowledgements The Institute of Physical Sciences has financed the mill cluster nodes. The Kumpula Campus Computational Unit hosts the mill cluster in their machine room and has provided the needed network switches. N. Jiganova has helped with the software and hardware of the cluster. P. L¨ ahteenm¨ aki has been very helpful in clarifying network issues and setting up the network for the cluster. Damicon Kraa the vendor of the nodes has given very good service.
08.12.2003
28
Tomas Lind´ en
HIP Software and Physics project
References [1] Remote Serial Console HOWTO, http://www.dc.turkuamk.fi/LDP/ HOWTO/Remote-Serial-Console-HOWTO/index.html. [2] Seek-test, by T. Wildish http://wildish.home.cern.ch/wildish/ Benchmark-results/Performance.html. [3] Analysis and Evaluation of Open Source Solutions for the Installation and Management of Clusters of PCs under Linux, R. Leiva http://heppc11.ft.uam.es/galera/doc/ATL-SOFT-2003-001.pdf. [4] Rocks homepage, http://rocks.sdsc.edu/Rock. [5] NPACI All Hands Meeting, Rocks v2.3.2 Tutorial Session, March 2003, http://rocks.sdsc.edu/rocks-documentation/3.0.0/talks/ npaci-ahm-2003.pdf 08.12.2003
29
Tomas Lind´ en