[images/CAMP_small.jpg] VALHAL - THE CAMP COMPAQ-ALPHA SUPERCOMPUTER

The parallel Compaq Alpha-based supercomputer named VALHAL was installed on November 5, 1999 by the CAMP research center at DTU. The funding was obtained through a donation of 2.850.000 DKK (approx. US$ 400.000) by the Villum Kann Rasmussen Foundation.

Notice: On May 17, 2004 VALHAL was turned off after 4 1/2 years of eminent service. A new Linux cluster takes up the VALHAL computer room.

The supercomputer will be used for quantum mechanical calculations and atomistic simulations of complex materials, enabling the prediction of mechanical, electrical, magnetic and chemical properties. Among other topics, researchers at CAMP will study deformations of materials, catalytic properties of surfaces, and properties of complex biological molecules.

[images/valhal6.jpg]
Photo showing part of VALHAL. The Ethernet switch is surrounded by Compaq Alphastation XP1000 cabinets. To the left is a rack of AlphaServer DS10L systems. Date: April 20, 2001

The supercomputer system was delivered by Compaq Danmark and Benau A/S.

In the following we describe the supercomputer's hardware, software, and explain the name VALHAL.

The author of this document is Ole Holm Nielsen, to whom any questions should be sent (E-mail address Ole.H.Nielsen (at) fysik.dtu.dk).


VALHAL hardware

VALHAL consists of a large number of Compaq Alphastation XP1000 and Alphaserver DS10L computers connected in parallel using Fast Ethernet switching technology, designed according to the "Beowulf" concept of parallel computing. (The name Beowulf derives from on old English legend about Danish (!) kings in the sixth century A.D. There is a nice, new translation by the Nobel laureate Seamus Heaney).

The Beowulf concept is mainly about connecting powerful, mass-produced (and therefore affordable) computers by means of commodity networking. The Beowulf-type of computers is surprisingly efficient for a large class of (but not all !) supercomputing problems. There is a nice book on How to Build a Beowulf, which also explains the history and technology of Beowulf machines, as well as giving an introduction to parallel programming.

The VALHAL nodes have the following characteristics:

Nodes CPU CPU Clock RAM Disk SPECfp95 Peak speed Linpack-1000
(number) (type) (MHz) (MB) (GB) (marks) (MFLOPS) (MFLOPS)
94 Alpha EV67 667 512 9.1 65.5 1334 984
40 Alpha EV6 466 512 10 47.9 932 -
6 Alpha EV6 500 640 4.5 52.2 1000 737
140 TOTAL - - 72448 1282 8386 168676 -

VALHAL's peak speed: 168 GigaFLOPS

(the aggregate peak floating-point performance of all nodes, measured in billions of floating-point operations per second).

The reason why we focus on floating-point performance is that this aspect is the key performance parameter for CAMP's supercomputer applications. We find that our codes usually achieve overall some 40-50% of any processor's peak speed, and hence it makes some sense to compare relative peak-performances when evaluating different systems. The design of the VALHAL system has been chosen to obtain maximum real throughput performance on CAMP's main production codes, for the given amount of funding. It should be emphasized that our parallel codes are working well using Ethernet interconnecting networks, as we have determined experimentally. The price/performance of the VALHAL cluster technology is significantly better than for traditional supercomputers, and is actually not far from the price/performance of high-end Intel-based PCs.

At the time of its installation in November 1999, VALHAL was the fastest computer in Denmark measured in terms of floating-point performance.

If you want to know more about the Alpha CPUs, there are some Compaq white papers.

Networking hardware

The network interconnect is the key element which turns a collections of workstations into a parallel computer. VALHAL employs a Fast Ethernet network which operates with 100 Mbit/sec full-duplex connections between compute nodes and a central Switch. The switch is a powerful Cisco Catalyst 4006. Our switch is at present configured with 3 modules containing a total of 176 ports of 100 Mbit/sec, and 2 Gigabit/sec ports. The Catalyst 4006 switch backplane bandwidth of 24-60 Gigabit/sec and 18-48 million packets-per-second throughput is more than sufficient to handle full media speed on all ports.

A word about configuration of the Cisco switch: Cisco provides on-line documentation for the Catalyst 4000 family switches.

In order to minimize the delay of nodes booting over the network, the following customizations should be performed for all switch-ports that connect directly to a workstation node:

set port host XXX
set port speed XXX 100
set port duplex XXX full
where XXX refers to the relevant workstation ports, for example 2/1-48. The network speed and duplex settings of the switch must match the settings on all workstations (can be changed at the SRM console by e.g. "set ewa0_mode FastFD" for 100 Mbit/s full-duplex).

For the Cisco supervisor engine software release 5.1 and earlier you must optimize the port configuration for host connections by the commands

set spantree portfast XXX enable
set port channel XXX off
set trunk XXX off
in stead of the set port host command, which is only supported from release 5.2. Please refer to the document Using Portfast and Other Commands to Fix Workstation Startup Connectivity Delays, and to Cisco's software Release Notes for further details.

Ports connected to other switches must not have the spantree portfast enabled because of potential spanning-tree problems. Cisco has some useful advice in their Tech Notes on LAN Technologies Technical Tips, particularly:

With a Cisco customer login account, you can download LAN Switching Software for upgrading your switch software, if necessary. You can also upgrade the ROM-monitor software as described in this Field Notice.

Networking performance

We measured the network performance on a pair of Alphastation XP1000 machines running Tru64 UNIX, using the NetPIPE network performance evaluator tool version 2.3. From the "network signature" graph we find the following performance numbers using MPICH (version 1.1.2) communication with the ch_p4 device on top of Tru64's TCP/IP, as well as using TCP/IP directly:

Network CPU speed Protocol Latency Bandwidth Bandwidth
(type) (MHz) (software) (microseconds) (Megabits/sec) (Megabytes/sec)
Fast Ethernet 667 MPI 86 81.5 10.2
Fast Ethernet 500 MPI 99 81.6 10.2
Gigabit Ethernet 500 MPI 155 254 31.7
Fast Ethernet 667 TCP 48 83.4 10.4
Fast Ethernet 500 TCP 52 83.5 10.4
Gigabit Ethernet 500 TCP 98 429 53.6

This networking performance is excellent compared to the requirements of CAMP's parallel application codes.

Network configuration

The IP-network is configured as a Private Internet (RFC 1918) which does not consume scarce addresses from our IP-pool. A complication arising from the use of a Private Internet is that all network-services required for the proper operation of the computers must be provided on the private network, in addition to the being provided on the public network. This includes services such as DNS nameservice, an SMTP mail-gateway, NTP timeservice, NIS (Network Information Service), and NFS (Network File System).

Fileserver hardware

One XP1000 node is a dedicated fileserver providing a large disk space to the entire cluster. This node has a mirrored system disk (using the Tru64 UNIX Logical Storage Manager software) for reliability, and a Gigabit connection to the Ethernet switch for performance.

A RAID disk system of 216 GB is used as primary file storage on the fileserver. The RAID system is a Voyager system delivered by Heinex Data. The RAID system is based on the Chaparral G5312 RAID controller, and on Ultra2 SCSI technology (bus-speed of 80 MB/sec), and our IBM 36 GB disks are configured as a RAID level 3 set. With an 128 MB cache, the system may sustain more than 50 MB/sec throughput.

Backup: The capacity of the RAID-disks exceeds the capacity of our tape-jukebox backup system ! Therefore, we do not perform any backups of the RAID-disks, however, the data should be quite safely protected against system malfunctions by the redundancy of the RAID technology. It is important for the users to realize that if they destroy their files on the RAID-disks, the files will be lost forever !

Physical installation

The physical installation of the 60 computers is on standard shelves as shown in the front-side photo. There are 5 shelves, and the dimension of the entire system is approximately 4 meters in width and 2.5 meters in height. The bare shelves before mounting the computers are shown in this photo. The rear side displaying our mounting of cables etc. is shown in this photo.

The cabling plan of the computers is illustrated in this figure. The network cables and the serial-port cables (for control purposes) are contained in plastic channels mounted on the rear side of the shelves. The Ethernet switch is located at the center of the system in order to minimize the cabling requirements. Two 32-port DECserver 900TM serial-line terminal servers are located on the top shelf, again because of cabling considerations.

The 220 Volt power cables are drawn via the ceiling and along the shelves' carrying rods, separated from the network and serial cables in order to avoid electrical interference. Groups consisting of 5 computers each are supplied from one power outlet, for a total of 16 power groups (each rated at 16 Amps max). A surge protector is installed in the main supply power line. The maximum power consumption of an XP1000 is rated by Compaq as 615 W; however, this number is far from the reality. Each XP1000 workstation in our configuration consumes less than 200 Watts of power, so the total power consumption is about 12 kW.

A standalone cooling unit of 20 kW is installed for supplying chilled air at the front of the computer shelves. The XP1000 cooling fans draw air through the front-plate and emit air through the rear-plate, so the chilled air should be supplied at the front side.


VALHAL software

Operating system software

For efficiency reasons, the Compaq
Tru64 UNIX operating system (in a previous life named Digital UNIX) is used on the computer nodes, rather than the Linux operating system usually found on Beowulf computers. We explain our reasoning below.

Each XP1000 computer node runs the Compaq Tru64 UNIX operating system version 4.0F, which is installed on the node's harddisk. We document our customized Tru64 UNIX installation on a Compaq Alpha computer in this document, which contains many hints learned "the hard way".

Each node's local disk contains the Tru64 UNIX operating system, and a large swap space of 2 GBytes. The remainder of the disk is laid out as a "scratch" disk area for the temporary files of running batch jobs.

Cluster management software

It is nontrivial to install and administer a many-node cluster of independent systems. The CAMP cluster of Alphastation XP1000 machines has been bought together with Compaq's Cluster Management Utility (CMU) software, which is produced by Compaq Custom Systems in Annecy-le-Vieux, France. The CMU software is available for Linux and Tru64 UNIX Alpha clusters from Compaq, and and inquiry can be made via the above CMU Web-page.

The CMU cloning of nodes seems to have many ideas in common with the cloning methods explained in How to Build a Beowulf.

Performance Visualizer software

The CMU software contains the Performance Visualizer (PVIS) version 1.2.4 package, which allows monitoring of the operating system of an entire cluster. Download of PVIS is also available from the Compaq Web-page, but currently an old version 1.1.4 is offered. While the public version 1.1.4 comes with a "Getting_Started" document, the version 1.2.4 in CMU is undocumented at present. Therefore we give the relevant details here:

On the nodes which you wish to monitor, install the Performance Manager software located in the CMU-software directory PMGR440, or with the Tru64 UNIX CD set. Select for installation only item 3, "Performance Manager Daemons & Base" (PMGRBASE440). This will start the pmgrd daemon, which however requires that the snmpd daemon is configured and running (snmpd is enabled by default).

On the servers or nodes where you want to run the Performance Visualizer graphical display tool, you install the software located in the CMU-software directory PPM124 (PPMBASE124 and PPMCDE124). The Tcl and Tk software kits (OSFTCLBASE440 and OSFTKBASE440) are prerequisites.

Now start the pvis monitoring tool, preferably not on a compute-node. Select the menu item File->Connect and add the nodes you want to monitor by entering their names in the Add field. Finally press the Connect button. Now select the menu item View->All to pick the quantities that you want to monitor. There is unfortunately no documentation nor man-page for the pvis tool.

The configuration of pvis is performed on a per-user basis using the configuration file .pvis in the user's home directory. There does not seem to be any possibility of a system-wide configuration file.

Turning your cluster into a Beowulf system

Having installed the basic operating system onto a bunch of nodes doesn't make a Beowulf cluster suitable for parallel computing. You will want to set up seamless integration among all of the nodes, and you want to enable some kind of centralized software management. We have written a Beowulf cluster mini-HowTo explaining one possible way to configure a cluster for parallel computing.

Batch system software

There exists several commercial batch system softwares available also on Tru64 UNIX. However, we found that the commercial packages demanded a very high price, because we are building an unusual system of quite many nodes, and excessive license charges are apparently the policy of the vendors. We decided that we prefer to buy a number of additional computer nodes, rather than spending our money on commercial batch system software

Our choice of batch system softwares is the Open Source Portable Batch System (PBS) offered for free to registered users, and with the possibility of commercial support. The commercial version of PBS is available from http://www.pbspro.com/. Patches to PBS developed in the user-community are being collected on the site http://www-unix.mcs.anl.gov/openpbs/.

The Portable Batch System is a flexible batch software processing system developed at the NASA Ames Research Center. It operates on networked, multi-platform UNIX environments, including heterogeneous clusters of workstations, supercomputers, and massively parallel systems.

The PBS batch system can use plug-in scheduler codes for implementing local batch policies, and the PBS user community seems to frequently use their own plug-in modules in order to implement local policies. However, we initially use the default FIFO-scheduler provided in the PBS distribution.

A sophisticated batch scheduler is available from the Maui Scheduler Home Page. The Maui scheduler can interface with a number of batch systems, including PBS, and offers better control over the usage policy than the PBS FIFO scheduler. We have been using the Maui scheduler on our cluster since November 2000, and are very happy with the control over usage policies and resource utilization that Maui enables. We're currently running Maui version 3.0.7.

The current version of OpenPBS is 2.3 (released 18 September 2000). For configuring the PBS for Tru64 UNIX we suggest to use this initial configuration command to use the native cc-compiler:

configure --set-server-home=/var/spool/PBS --set-default-server=XXX --set-cc '--set-cflags=-g3 -O2'
where XXX is the hostname of your designated PBS server.

We have written a mini-HowTo document going through the steps required to install a fully functional PBS environment.

Some of the most used commands in PBS are:

Message-passing software

A distributed-memory parallel computer is usually programmed in parallel using explicit message-passing. That is, when parallel processes need to exchange data, the user code sends data from one CPU to the user code running on another CPU.

We use the popular MPICH freely available, portable implementation of the MPI message-passing standard. At the time of installation the version of MPICH was 1.1.2, but version 1.2.0 has been released on December 2, 1999. We recommend that you use MPICH version 1.2 (or later), and install the PBS batch system version 2.2 (or later). If you use MPICH 1.1.2 with PBS 2.1, please contact us for a bug fix.

Compiler and library tools

It is very important to us that our parallel application codes deliver maximum performance on the Alpha CPUs. We usually find very good performance delivered by Compaq Fortran (A manual is on the Web) and the Compaq C and C++ compilers.

In addition our applications require optimally hand-tuned Basic Linear Algebra Subroutines (BLAS) as well as 3-dimensional complex Fast Fourier Transforms (FFT), which are delivered by the Compaq Extended Math Library (CXML/DXML).

However, the FFT subroutine in DXML is only optimally efficient for power-of-2 grid sizes, at present. For non-power-of-2 grids, the Open Source FFT library FFTW is actually found to provide an all-round robust performance, and we recommend this FFT library for general grid-sizes.

Why don't we use Linux on this Beowulf-like cluster ?

We're actually happy users of Linux Intel PCs for our desktops, and it might make sense to put Linux on the supercomputer as well. The reason for choosing Tru64 UNIX is our extreme demands for performance. We will not tolerate any unnecessary bottlenecks on our batch production systems. Another aspect is stability, where we find Tru64 UNIX to be very stable on the XP1000 hardware.

The key softwares that will allow our parallel application codes to deliver maximum performance on the Alpha CPUs include:

In conclusion we feel that Compaq's Tru64 UNIX is preferred over the present-day Linux operating system for purely pragmatic reasons, namely that the quest for maximum performance overrules any other concerns such as the flavor of UNIX or Linux. In all other respects VALHAL is a true Beowulf machine !


VALHAL - the name

In Nordic mythology, Valhal is the home of Odin, the King of Gods. In Valhal the bravest warriors are gathered around Odin's table.

Valhal is massively parallel: It has 540 gates, and through each gate 800 warriors can walk side by side. Hence the name is appropriate for a highly parallel computer. The massive wall-like appearance of our VALHAL computer is also reminiscent of the monumental size of the mythological Valhal.

Further readings on Nordic mythology may be found at these links:


This page is maintained by: . Last update: 07 Jan 2003 .
Copyright © 2003 Center for Atomic-scale Materials Physics . All rights reserved.

[images/CAMP_small.jpg] Home