The parallel Compaq Alpha-based supercomputer named VALHAL was installed on November 5, 1999 by the CAMP research center at DTU. The funding was obtained through a donation of 2.850.000 DKK (approx. US$ 400.000) by the Villum Kann Rasmussen Foundation.
Notice: On May 17, 2004 VALHAL was turned off after 4 1/2 years of eminent service. A new Linux cluster takes up the VALHAL computer room.
The supercomputer will be used for quantum mechanical calculations and atomistic simulations of complex materials, enabling the prediction of mechanical, electrical, magnetic and chemical properties. Among other topics, researchers at CAMP will study deformations of materials, catalytic properties of surfaces, and properties of complex biological molecules.
![[images/valhal6.jpg]](images/valhal6.jpg)
The supercomputer system was delivered by Compaq Danmark and Benau A/S.
In the following we describe the supercomputer's hardware, software, and explain the name VALHAL.
The author of this document is Ole Holm Nielsen, to whom any questions should be sent (E-mail address Ole.H.Nielsen (at) fysik.dtu.dk).
VALHAL consists of a large number of Compaq
Alphastation XP1000 and
Alphaserver DS10L computers connected in parallel using
Fast Ethernet switching technology, designed according to the
"Beowulf" concept
of parallel computing.
(The name Beowulf derives from on old
English legend about Danish (!) kings in the sixth century A.D.
There is a nice, new
translation by the Nobel laureate Seamus Heaney).
The Beowulf concept is mainly about connecting powerful, mass-produced
(and therefore affordable) computers by means of commodity networking.
The Beowulf-type of computers is surprisingly efficient for
a large class of (but not all !) supercomputing problems.
There is a nice book on
How to Build a Beowulf, which also explains the history and
technology of Beowulf machines, as well as giving an introduction
to parallel programming.
The VALHAL nodes have the following characteristics:
VALHAL hardware
| Nodes | CPU | CPU Clock | RAM | Disk | SPECfp95 | Peak speed | Linpack-1000 |
|---|---|---|---|---|---|---|---|
| (number) | (type) | (MHz) | (MB) | (GB) | (marks) | (MFLOPS) | (MFLOPS) |
| 94 | Alpha EV67 | 667 | 512 | 9.1 | 65.5 | 1334 | 984 |
| 40 | Alpha EV6 | 466 | 512 | 10 | 47.9 | 932 | - |
| 6 | Alpha EV6 | 500 | 640 | 4.5 | 52.2 | 1000 | 737 |
| 140 TOTAL | - | - | 72448 | 1282 | 8386 | 168676 | - |
VALHAL's peak speed: 168 GigaFLOPS
(the aggregate peak floating-point performance of all nodes, measured in billions of floating-point operations per second).
The reason why we focus on floating-point performance is that this aspect is the key performance parameter for CAMP's supercomputer applications. We find that our codes usually achieve overall some 40-50% of any processor's peak speed, and hence it makes some sense to compare relative peak-performances when evaluating different systems. The design of the VALHAL system has been chosen to obtain maximum real throughput performance on CAMP's main production codes, for the given amount of funding. It should be emphasized that our parallel codes are working well using Ethernet interconnecting networks, as we have determined experimentally. The price/performance of the VALHAL cluster technology is significantly better than for traditional supercomputers, and is actually not far from the price/performance of high-end Intel-based PCs.
At the time of its installation in November 1999, VALHAL was the fastest computer in Denmark measured in terms of floating-point performance.
If you want to know more about the Alpha CPUs, there are some
Compaq white papers.
The network interconnect is the key element which turns a
collections of workstations into a parallel computer.
VALHAL employs a Fast Ethernet network which
operates with 100 Mbit/sec full-duplex
connections between compute nodes and a central Switch.
The switch is a powerful Cisco
Catalyst 4006.
Our switch is at present configured with 3 modules containing
a total of 176 ports of 100 Mbit/sec, and 2 Gigabit/sec ports.
The Catalyst 4006 switch backplane bandwidth of 24-60 Gigabit/sec
and 18-48 million packets-per-second throughput
is more than sufficient to handle full media speed on all ports.
A word about configuration of the Cisco switch:
Cisco provides on-line
documentation for the Catalyst 4000 family switches.
In order to minimize the delay of nodes booting over the network,
the following customizations should be performed for all switch-ports
that connect directly to a workstation node:
For the Cisco supervisor engine software release 5.1 and earlier
you must optimize the port configuration for host connections
by the commands
Networking hardware
set port host XXX
set port speed XXX 100
set port duplex XXX full
where XXX refers to the relevant workstation ports,
for example 2/1-48.
The network speed and duplex settings of the switch must match
the settings on all workstations (can be changed at the SRM console
by e.g. "set ewa0_mode FastFD" for 100 Mbit/s full-duplex).
set spantree portfast XXX enable
set port channel XXX off
set trunk XXX off
in stead of the set port host command, which is only
supported from release 5.2.
Please refer to the document
Using Portfast and Other Commands to Fix Workstation Startup Connectivity Delays,
and to Cisco's software Release Notes
for further details.
Ports connected to other switches must not have the spantree portfast enabled because of potential spanning-tree problems. Cisco has some useful advice in their Tech Notes on LAN Technologies Technical Tips, particularly:
We measured the network performance on a pair of Alphastation XP1000 machines running Tru64 UNIX, using the NetPIPE network performance evaluator tool version 2.3. From the "network signature" graph we find the following performance numbers using MPICH (version 1.1.2) communication with the ch_p4 device on top of Tru64's TCP/IP, as well as using TCP/IP directly:
| Network | CPU speed | Protocol | Latency | Bandwidth | Bandwidth |
|---|---|---|---|---|---|
| (type) | (MHz) | (software) | (microseconds) | (Megabits/sec) | (Megabytes/sec) |
| Fast Ethernet | 667 | MPI | 86 | 81.5 | 10.2 |
| Fast Ethernet | 500 | MPI | 99 | 81.6 | 10.2 |
| Gigabit Ethernet | 500 | MPI | 155 | 254 | 31.7 |
| Fast Ethernet | 667 | TCP | 48 | 83.4 | 10.4 |
| Fast Ethernet | 500 | TCP | 52 | 83.5 | 10.4 |
| Gigabit Ethernet | 500 | TCP | 98 | 429 | 53.6 |
This networking performance is excellent compared to the
requirements of CAMP's parallel application codes.
The IP-network is configured as a Private Internet
(RFC 1918)
which does not consume scarce addresses from our IP-pool.
A complication arising from the use of a Private Internet is
that all network-services required for the proper operation
of the computers must be provided on the private network,
in addition to the being provided on the public network.
This includes services such as DNS nameservice, an SMTP
mail-gateway, NTP timeservice, NIS (Network Information Service),
and NFS (Network File System).
One XP1000 node is a dedicated fileserver providing a large disk space
to the entire cluster. This node has a mirrored system disk
(using the Tru64 UNIX Logical Storage Manager software) for reliability,
and a Gigabit connection to the Ethernet switch for performance.
A RAID disk system of 216 GB is used as primary file storage
on the fileserver.
The RAID system is a Voyager system delivered by Heinex Data.
The RAID system is based on the
Chaparral G5312
RAID controller, and on
Ultra2 SCSI technology
(bus-speed of 80 MB/sec), and our
IBM 36 GB disks
are configured as a RAID level 3 set.
With an 128 MB cache, the system may sustain more than 50 MB/sec throughput.
Backup: The capacity of the RAID-disks exceeds the capacity of
our tape-jukebox backup system !
Therefore, we do not perform any backups of the RAID-disks,
however, the data should be quite safely protected against
system malfunctions by the redundancy of the RAID technology.
It is important for the users to realize that if they destroy
their files on the RAID-disks, the files will be lost forever !
The physical installation of the 60 computers is on standard
shelves as shown in the
front-side photo.
There are 5 shelves, and the dimension of the entire system is
approximately 4 meters in width and 2.5 meters in height.
The bare shelves before mounting the computers are shown in
this photo.
The rear side displaying our mounting of cables etc. is shown in
this photo.
The cabling plan of the computers is
illustrated in this figure.
The network cables and the serial-port cables (for control purposes)
are contained in plastic channels mounted on the rear side
of the shelves.
The Ethernet switch is located at the center of the system
in order to minimize the cabling requirements.
Two 32-port
DECserver 900TM
serial-line terminal servers
are located on the top shelf, again because of cabling considerations.
The 220 Volt power cables are drawn
via the ceiling and along the shelves' carrying rods,
separated from the network and serial cables
in order to avoid electrical interference.
Groups consisting of 5 computers each are supplied from one power outlet,
for a total of 16 power groups (each rated at 16 Amps max).
A surge protector is installed in the main supply power line.
The maximum power consumption of an XP1000 is rated by Compaq as
615 W; however, this number is far from the reality.
Each XP1000 workstation in our configuration consumes less than
200 Watts of power, so the total power consumption is about 12 kW.
A standalone cooling unit of 20 kW is installed for supplying
chilled air at the front of the computer shelves.
The XP1000 cooling fans draw air through the front-plate
and emit air through the rear-plate, so the chilled air should
be supplied at the front side.
Each XP1000 computer node runs the Compaq Tru64 UNIX operating system
version 4.0F, which is installed on the node's harddisk.
We document our customized Tru64 UNIX installation on a Compaq
Alpha computer in
this document,
which contains many hints learned "the hard way".
Each node's local disk contains the
Tru64 UNIX operating system, and a large swap space of 2 GBytes.
The remainder of the disk is laid out as a "scratch" disk area for the
temporary files of running batch jobs.
The CMU cloning of nodes seems to have many ideas in common with the
cloning methods explained in
How to Build a Beowulf.
The CMU software contains the
Performance Visualizer (PVIS) version 1.2.4
package, which allows monitoring of the operating system of
an entire cluster. Download of PVIS is also available from the
Compaq Web-page, but currently an old version 1.1.4 is offered.
While the public version 1.1.4 comes with a "Getting_Started"
document, the version 1.2.4 in CMU is undocumented at present.
Therefore we give the relevant details here:
On the nodes which you wish to monitor, install the
Performance Manager
software located in the CMU-software directory PMGR440,
or with the Tru64 UNIX CD set.
Select for installation only item 3,
"Performance Manager Daemons & Base" (PMGRBASE440).
This will start the pmgrd daemon, which however requires
that the snmpd daemon is configured and running
(snmpd is enabled by default).
On the servers or nodes where you want to run the Performance Visualizer
graphical display tool, you install the software located in the CMU-software
directory PPM124 (PPMBASE124 and PPMCDE124).
The Tcl and Tk software kits (OSFTCLBASE440 and OSFTKBASE440)
are prerequisites.
Now start the pvis monitoring tool, preferably not on a
compute-node. Select the menu item File->Connect and add the nodes
you want to monitor by entering their names in the Add field.
Finally press the Connect button.
Now select the menu item View->All to pick the quantities that
you want to monitor. There is unfortunately no documentation
nor man-page for the pvis tool.
The configuration of pvis is performed on a per-user
basis using the configuration file .pvis in the user's
home directory. There does not seem to be any possibility of
a system-wide configuration file.
Our choice of batch system softwares is the Open Source
Portable Batch System (PBS)
offered for free to registered users, and with the possibility
of commercial support.
The commercial version of PBS is available from
http://www.pbspro.com/.
Patches to PBS developed in the user-community are being
collected on the site http://www-unix.mcs.anl.gov/openpbs/.
The Portable Batch System is a flexible batch software processing system
developed at the
NASA Ames Research Center.
It operates on networked,
multi-platform UNIX environments, including heterogeneous clusters of
workstations, supercomputers, and massively parallel systems.
The PBS batch system can use plug-in scheduler codes
for implementing local batch policies, and the PBS user community
seems to frequently use their own plug-in modules in order
to implement local policies.
However, we initially use the default FIFO-scheduler
provided in the PBS distribution.
A sophisticated batch scheduler is available from the
Maui Scheduler Home Page.
The Maui scheduler can interface with a number of batch systems,
including PBS, and offers better control over the usage policy
than the PBS FIFO scheduler.
We have been using the Maui scheduler on our cluster since November 2000,
and are very happy with the control over usage policies and resource utilization that Maui
enables. We're currently running Maui version 3.0.7.
The current version of OpenPBS is 2.3 (released 18 September 2000).
For configuring the PBS for Tru64 UNIX we suggest to use this
initial configuration command to use the native cc-compiler:
We have written a mini-HowTo document
going through the steps required to install a fully functional
PBS environment.
Some of the most used commands in PBS are:
We use the popular MPICH freely available, portable implementation of
the MPI message-passing standard.
At the time of installation the version of MPICH was 1.1.2,
but version 1.2.0 has been released on December 2, 1999.
We recommend that you use MPICH version 1.2 (or later),
and install the PBS batch system version 2.2 (or later).
If you use MPICH 1.1.2 with PBS 2.1, please contact us for a bug fix.
It is very important to us that our parallel application codes
deliver maximum performance on the Alpha CPUs.
We usually find very good performance delivered by
Compaq Fortran
(A manual
is on the Web)
and the Compaq C and C++ compilers.
In addition our applications require optimally hand-tuned
Basic Linear Algebra Subroutines (BLAS) as well as
3-dimensional complex Fast Fourier Transforms (FFT),
which are delivered by the
Compaq Extended Math Library (CXML/DXML).
However, the FFT subroutine in DXML is only optimally
efficient for power-of-2 grid sizes, at present.
For non-power-of-2 grids, the Open Source FFT library
FFTW is actually found
to provide an all-round robust performance, and we recommend
this FFT library for general grid-sizes.
The key softwares that will allow our parallel application codes to
deliver maximum performance on the Alpha CPUs include:
Valhal is massively parallel: It has 540 gates, and through each
gate 800 warriors can walk side by side. Hence the name is
appropriate for a highly parallel computer. The massive
wall-like appearance of our VALHAL computer
is also reminiscent of
the monumental size of the mythological Valhal.
Further readings on Nordic mythology may be found at these links:
Network configuration
Fileserver hardware
Physical installation
VALHAL software
Operating system software
For efficiency reasons, the Compaq
Tru64 UNIX operating system
(in a previous life named Digital UNIX)
is used on the computer nodes, rather than
the Linux operating system usually found on Beowulf computers.
We explain our reasoning below.
Cluster management software
It is nontrivial to install and administer a many-node cluster
of independent systems.
The CAMP cluster of Alphastation XP1000 machines has been bought together
with Compaq's
Cluster Management Utility (CMU) software,
which is produced by
Compaq Custom Systems in Annecy-le-Vieux, France.
The CMU software is available for Linux and Tru64 UNIX Alpha clusters
from Compaq, and and inquiry can be made via the above CMU Web-page.
Performance Visualizer software
Turning your cluster into a Beowulf system
Having installed the basic operating system onto a bunch
of nodes doesn't make a Beowulf cluster suitable for parallel
computing. You will want to set up seamless integration
among all of the nodes, and you want to enable some kind
of centralized software management.
We have written a Beowulf cluster mini-HowTo
explaining one possible way to configure a cluster for
parallel computing.
Batch system software
There exists several commercial batch system softwares
available also on Tru64 UNIX. However, we found that the
commercial packages demanded a very high price, because we
are building an unusual system of quite many nodes,
and excessive license charges are apparently the policy of
the vendors.
We decided that we prefer to buy a number of additional
computer nodes, rather than spending our money on commercial batch
system software
configure --set-server-home=/var/spool/PBS --set-default-server=XXX --set-cc '--set-cflags=-g3 -O2'
where XXX is the hostname of your designated PBS server.
Message-passing software
A distributed-memory parallel computer is usually programmed
in parallel using explicit message-passing.
That is, when parallel processes need to exchange data,
the user code sends data from one CPU to the user code
running on another CPU.
Compiler and library tools
Why don't we use Linux on this Beowulf-like cluster ?
We're actually happy users of Linux Intel PCs for our desktops,
and it might make sense to put Linux on the supercomputer as well.
The reason for choosing Tru64 UNIX is our extreme demands
for performance. We will not tolerate any unnecessary bottlenecks
on our batch production systems. Another aspect is stability,
where we find Tru64 UNIX to be very stable on the XP1000 hardware.
In conclusion we feel that Compaq's Tru64 UNIX is preferred over
the present-day Linux operating system for purely pragmatic reasons,
namely that the quest for maximum performance overrules any other
concerns such as the flavor of UNIX or Linux.
In all other respects VALHAL is a true Beowulf machine !
VALHAL - the name
This page is maintained by:
.
Last update: 07 Jan 2003
.
Copyright © 2003
Center for Atomic-scale Materials Physics .
All rights reserved.
Home