NCSA: National Center for Supercomputing Applications
AllianceNCSAUser_InfoAccess
University of Illinois at Urbana-Champaign
Leading Edge Site

NCSA NT Cluster Consortium

October 22, 1998

An NCSA Private Sector Program Opportunity
For Infrastructure Development Partners, IDP

Table of Contents

Executive Summary

Wintel-based computational systems are becoming increasingly interesting and relevant to scientific and technical computing, long the domain of UNIX-based systems. The rapidly growing capabilities of Intel processors, hardware interconnects and storage systems and the continued maturation of Microsoft Windows NT and NT-based software has produced the technology to cluster NT systems for efficient allocation of all available cycles and for high performance computing.

During 1998 NCSA created a 256-processor "NT Supercluster" to explore the technical issues of both the architecture and the use of such systems. Architecture issues relate primarily to interconnect, storage, management and middleware. Usage studies have involved working with over a dozen application codes and their users as well as third party application suppliers.

In terms of management, it was important to be able to start to think about the Supercluster not as 128 dual-processor NT machines but as a "system" and so our staff worked to put tools in place with this in mind.

Middleware for the NT Supercluster has included a project based on Microsoft DCOM -- the NCSA Symera Distributed System (Symera) project (http://symera.ncsa.uiuc.edu/). The Symera project is aimed at easily bringing the power of parallel computing to the growing number of codes being written for NT and using DCOM objects. Beyond the application development and runtime environment, Symera, provides tools for management not only of tightly coupled clusters such as the Supercluster but also of loosely coupled systems on the desktop.

In 1999, NCSA plans to increase the size of the cluster to 512 processors. This will allow for further investigation and development of management tools, middleware, and applications.

The NT Cluster Consortium is an activity within the Alliance Private Sector Program (PSP) structured to provide Infrastructure Development Partners (IDP) with access to the NCSA’s NT technical expertise, software and hardware. An IDP is one or more Alliance academic partners, national labs, a single company, a number of companies acting as a consortium, or a consortium of larger and/or smaller companies working with NCSA or with NCSA and other Alliance university partners. The Consortium is also open to Strategic Industrial Partners (SIP) and to Strategic Technology Partners (STP).

Consortium members will be kept abreast of the current developments through technical updates and reports, attend a yearly workshop for updates on NT technologies undertaken at NCSA, have access to the NT Supercluster and have early access to the Symera software for internal commercial use and will able to interact with the development staff to discuss specific technical questions of interest to the member. The NT Cluster Consortium is also intended to serve as a forum for members to interact with one another and with key hardware and software suppliers involved in the NCSA efforts.

Of critical interest to many Consortium members will be the availability of direct access to the NCSA NT Supercluster researchers, developers and hardware systems for the purposes of performing software development, computational usage and consulting on a contractual basis. Potential projects include work in porting existing applications, developing new applications using object oriented techniques, data mining efforts, system evaluation and assistance and consulting for on-site NT cluster programs.

Technology Overview

The two primary projects at NCSA that are the keystones of the Consortium are the NT Supercluster computational system and the NCSA Symera Distributed System (Symera). The NT Supercluster program is focused on providing a computational platform for current message passing applications. The Symera program is creating a distributed object system that provides both a development environment for new applications and an object management system for allocating resources on an NT cluster. These two programs are complementary and are aimed at accommodating current and future applications.

The NT Supercluster is a tightly coupled cluster with 192 PII cpus and a high speed interconnect (80MB/s bandwidth and 17 usec latency) and is focused on providing a platform for mature serial and parallel applications. It can provide computational cycles for these applications as well as a porting and testing environment for applications being migrated from UNIX systems. It is also a development environment for new applications that are implemented in C/C++/Fortran with message passing or shared memory. The capability of this system will keep pace with technical developments in hardware and software through a series of on-going upgrades and testbed systems.

Symera is a distributed object system that is built upon DCOM. Structurally, it consists of two lines of development. First, it is an object management system for a cluster of NT workstations which allocates resources, schedules processes/jobs, implements fault tolerance/object migration and provides user interfaces/API’s for controlling and observing distributed processes. Second, Symera is a set of object libraries that enables developers to create objects that inherently support the interfaces that are required to interact with the management system. The libraries are built using scaleable levels of base class support giving the developer support flexibility. Symera can be used to develop and run parallel, distributed applications exploiting the high availability and low cost of NT workstations. Symera can be used to run applications on the dedicated NT Cluster discussed above or on a set of distributed NT workstations with the intent of harvesting unused cycles.

Benefits of Consortium Membership

Participation in the NT Cluster Consortium provides the following benefits:

  1. Meeting with NCSA technical staff to discuss all options and benefits of the Consortium membership and explore the specific interests of the partner. An initial meeting between partner representatives and NCSA technical staff will be held to ensure clear understanding of the partner's needs. This half to full day technology orientation (to be held at NCSA) is intended to bring the partner up to date with the NCSA NT efforts and provide an opportunity for sufficient discussion to understand the individual needs of the partner in this technology area.
  2. Access to development reports and presentations. The current status of development will be made available on a monthly basis to the Consortium members in the form of short progress reports. These reports will outline the lessons learned in both NT Supercluster and Symera developments.
  3. Technology update. NCSA will provide an annual in-depth workshop at NCSA (customized to Consortium membership needs) to which representatives of all Consortium members will be invited. The purpose of the workshop will be to update their representatives about NT technology developments at NCSA and within the Alliance.
  4. Participation in a Symera early access program. Although new versions of the Symera software will be released on the ftp server when stabilized, the Consortium members will be able to use the software for internal business purposes with technical support for installing and using the software. The Symera development team would be available (on a priority basis) to answer questions and provide guidance concerning the installation and use of the Symera software. Support and training in the use of Symera would be limited to no more than a total of 4 hours in no more than 4 instances. This level of support and consulting is not available to the public at large.
  5. Access to the NT Supercluster. Consortium members will have access to the development machines for code porting and testing experiments by up to three people. A reservation of 2000 processor hours will be available on the NT Supercluster for the Consortium members to conduct code-porting experiments as a friendly user. This access can be in the form of a dedicated block of time on a subset of machines or through the general time-shared batch system. General technical information will be available online for using and programming the system. Direct interaction with the NT Supercluster developers will be limited to no more than a total of 4 hours in no more than 4 instances to handle specific code porting issues. The development team would be available on a priority basis to resolve system level problems.
  6. Identification of an NCSA contact person. A contact person from both the Symera and Supercluster technical teams will be identified as the contact person for technical questions and regular interaction. A specific contact person will be identified for administrative questions as well.
  7. Tailored training and projects. Consortium members will be able to contract with NCSA for specific, in-depth training in areas covered by the NT Cluster Consortium. (See the "Research Opportunities" section below.)

Requirements of Consortium Membership

Membership in the NT Cluster Consortium requires the following commitments:

  1. Joining the PSP as an IDP (SIP and STP are pre qualified)
  2. Annual fees $50,000.
  3. Identification of a contact person within the company willing to assist in securing the working relationship.
  4. One to two page description of member initiatives and existing expertise. Whenever possible, a short explanation of the partner's primary goals in the NT area would facilitate an understanding of the goals and interests for each member. This information would be considered proprietary and treated as confidential information unless otherwise indicated by the partner.

The NT Cluster Consortium is intended for companies building and using cluster technology as end-users. NCSA has a companion program, the Strategic Technology Partner (STP) program, intended for technology suppliers. STP partners are also encouraged to participate in the NT Cluster Consortium.

Potential Operating Agreement Projects

In addition to the base level benefits provided to Consortium members, as IDPs, the companies participating in this activity may have priority status in developing operating agreements concerning Symera and Supercluster developments. Since both staff and time resources are limited for the Supercluster and Symera projects, the operating agreements needed to facilitate continued development will be cultivated from Consortium members. Examples of the types of projects that could be defined as individually priced operating agreements are provided below:

  1. Port a key application to run on Symera. A Symera developer would work directly with a representative of the company to port a key Windows NT application to run on Symera. The company would retain ownership of the original application and the modifications to that application.
  2. On-site support for installing and using the Symera software. A member of the Symera development team would work with a company representative on-site at the company. The Symera team would advise the company representatives in the development of applications or porting of current applications using the company resources. The details of these on-site visits and supporting pre and post-visit interactions would be negotiated on an individual basis as an operating agreement.
  3. Priority technical support for installing and using the software. The Symera development team would be available on a priority basis to answer questions and provide guidance concerning the installation and use of the Symera software.
  4. Access to NT Supercluster machine cycles for the execution of serial and parallel applications. The NT Supercluster is a computational resource that would be available for use by a member. This could occur either through the batch system for long term, non-time critical applications on shared machines or by preset time periods on a subset of the cluster that have been dedicated solely to the member.
  5. Direct support in porting of applications to the NT Supercluster. Members would be able to set up extended visits to the NCSA by applications programmers who need focused time in the cluster development environment for specific projects, including benchmarking and testing of proprietary codes.
  6. Access to the NT Supercluster developers and the testbeds for high performance interconnects and storage area network testbeds. Testbeds for interconnects and storage area networks are planned to provide a proving ground for new technologies that may be adopted for future upgrades of the Supercluster. Participation and support with the testbed efforts could give the Consortium member direct access to system test results, and new technologies.
  7. On-site support for the replication of an NT Supercluster. The NT Supercluster technology might be replicated at a member’s home site. The NCSA developers could provide advice on purchases, assist in the setup and configuration of the cluster, and make available training materials for the member’s system administrators and users.

NCSA Symera Distributed System

Background

With the exponentially growing need to access and process large amounts of data in a relatively short period of time and the corresponding growth in the number of available personal computers, the practicality of coupling those resources to meet the data analysis needs is obvious. The best method for running a parallel process on a cluster of machines to capitalize on unused cycles is less obvious. However, Microsoft’s development of the Distributed Component Object Model (DCOM) provides a means of capitalizing on both the assets of object oriented program development and the benefits of parallel processing.

The NCSA Symera Distributed System (Symera) is a distributed object system that is built upon DCOM. Structurally, it consists of two lines of development. First, it is an object management system for a cluster of NT workstations which allocates resources, schedules processes/jobs, implements fault tolerance/object migration and provides user interfaces/API’s for controlling and observing distributed processes. Second, Symera is a set of object libraries that enables developers to create objects that inherently support the interfaces that are required to interact with the management system. The libraries are built using scaleable levels of base class support giving the developer support flexibility. Symera can be used to develop and run parallel, distributed applications exploiting the high availability and low cost of NT workstations.

The benefits of successfully harvesting unused cycles on a large cluster of machines can be gained by developing applications to be controlled through Symera. A consortium of companies and the Symera team will drive the development of the Symera system and provide a competitive advantage to these companies in fully utilizing their compute resources.

Status of Symera

Two developer-only releases of Symera were made in 1997 (as "developer previews I and II"). Since then, much of the Symera code has been rewritten to take into account feedback and experience gained with these earlier releases. Major changes include 1) a complete rewrite of the Symera object libraries; 2) a much-improved administrative user interface (now called the Symera Viewer) and 3) a streamlining of the Symera installation process. As a result of these changes, the next release of Symera will be considerably more user- and developer-friendly.

A special limited release of Symera is scheduled for October, 1998. This release will be restricted to a limited number of users and developers including testbeds at Princeton University and the University of Washington. The limited release will also be available to NT Cluster Consortium members. A full public release, Symera 1.0, is on target for completion at the end of January, 1999.

Future Objectives

Short term development goals for Symera include developing increasing resource management capabilities (such as those listed below) on a wide cluster of NT machines. To drive this development, specific industrial applications are needed.

  1. Migration of jobs after a fault has occurred
  2. Migration of jobs after user intervention
  3. Scalability
  4. Security
  5. Cross domain support

Other Potential Collaborations

Consortium members will also be provided with the opportunity to work with Symera staff on special projects such as porting specific codes to NT or, based on mutually agreeable terms, to customize Symera for individual company needs. Training opportunities in NT, DCOM, Symera, and related areas may also be negotiated for broader technology transfer within member companies. These projects would incur additional expenses for use of NCSA resources and can be negotiated by Consortium members at any time. Ability to address prospective special project requests will be affected by availability of resources on the Symera team.

For additional information about Symera, consult the following URL: http://symera.ncsa.uiuc.edu/

NCSA NT Supercluster

The NT Supercluster is an Alliance project to enable the integration of NT-based Intel systems into high performance computing. It is being used for scientific applications, infrastructure research, development and deployment.

The NT cluster is an NT-based 192 Intel PII CPU cluster that uses Myrinet interconnect hardware with the HPVM software system from Andrew Chien’s Concurrent Systems Architecture Group to provide a high level of performance for scientific applications. Scientific applications moved to the NT cluster to date include six major codes in C, C++, and Fortran that have been run on up to 192 cpus.

The capability of the NT Supercluster is clearly evident in applications such as, a 2D Navier-Stokes kernel which executed at 6.9 GF on 128 300 MHz PII cpus in the cluster and 14 GF on 128 R10000 cpus in an SGI O2000. This type of capability and the expertise to take advantage of it are available to Consortium members.

Status

The NT Supercluster is being deployed in stages as we establish the viability and robustness of the hardware and software systems that comprise the system. The cluster will become available

Model Systems CPU Memory Disk Interconnects
HP Kayak XU 64 Dual PII, 300 Mhz 512 MB 4.5 GB Myrinet, Fast Enet
Compaq PWS 6000 32 Dual PII, 333 MHz 512 MB 4 GB Myrinet, Fast Enet

The cluster is currently at 192 processors, which we plan to increase by a factor of 2-5 over the next three years. During this multi-year process, we will be constantly evaluating new technologies, improving the current system through continuous upgrades and deploying the established systems as computational systems for single cpu and distributed applications.

We have been focusing our effort on the development and deployment of the NT Supercluster as a robust scalable computational resource. These areas include:

  1. Availability of the NT Supercluster system for batch oriented serial applications
  2. Maturation of the infrastructure of NT Supercluster to support parallel applications
  3. Development and deployment of a storage area network with the NT Supercluster
  4. Deployment of the necessary infrastructure to support parallel database applications
  5. Installation and support of third party applications.

Applications Porting and Development for NT Clusters

There will be vigorous activity in this area over the next 2-3 years as more applications make the transition to NT as a primary execution platform. We have an extensive experience base for applications porting from UNIX to NT for serial and parallel (MPI) applications. The development environment for moving a functional code to the NT Supercluster for UNIX developers is focused on minimizing the barrier to getting started by providing a familiar set of commands to UNIX programmers (including gmake, tar, ar, bash, emacs, etc.) and it provides the standard Microsoft development environment for experienced Windows developers.

New software technologies such as OpenMP, software DSM, HPF and Globus are expected to be tools that enable the development of new distributed applications on the NT cluster. Symera is an example of an emerging technology that is native to NT that will provide the necessary framework to develop distributed applications for NT clusters. The NT Supercluster might be used a scalability testbed for Symera applications as well as a controlled environment for evaluating and benchmarking them.

General porting assistance and direct involvement or collaboration on an application porting effort will require negotiation of an operational agreement for the particular effort.

Computational Usage

Serial Application Throughput Engine

A significant fraction, approximately 1/3 of the current systems, is being deployed as a large-scale batch throughput engine for single threaded or single system applications. The machines will only be running one or at most two user processes per dual cpu machine, thus allowing an individual process to access up to 512MB of memory and 6GB of disk.

Consortium members will be able to access these serial computational systems for execution of their applications at a level commensurate with their interest and support. It will also be possible to dedicate blocks of these machines to specific computational efforts by Consortium members for extended periods of time on a contract basis.

Parallel Applications Testbed and Computational Resource

The NT Supercluster is currently available as a porting platform for MPI-based parallel applications that have been matured on other platforms, such as the SGI O2000 or Cray T3E. It provides a unique large-scale testbed for parallel applications under the NT operating system on Intel hardware. This resource has very high bandwidth (~80MB/s), low latency (~17usec) and a high processor count (up to 128 HP or 192 HP+Compaq system PII cpus) for testing the performance and scaling of message passing applications.

Consortium members will be able to access the parallel computational systems for execution of their applications on a research basis at a level commensurate with their interest and support. It will also be possible to dedicate blocks of these machines to specific computational efforts by Consortium members for extended periods of time on a contract basis.

Consulting and Replicating the NT Supercluster

One of the goals of the NT Supercluster effort is the creation of mechanisms to disseminate the knowledge and expertise associated with this type of technology to partners. We can provide research results, documentation and consulting experience to aid a Consortium member in setting up a functional serial or parallel NT cluster at their home site. This type of activity would entail very close collaboration with the partner and we would expect them to have considerable experience on the NCSA NT Supercluster before undertaking it. Arrangements can be made for training courses and on-site assistance in constructing such a cluster.

Research Opportunities

One of the major roles of the NCSA NT cluster effort is to perform the necessary path-finding work for large scale NT clusters in a high performance environment. The work includes a large proportion of research and development as well as experimenting with and evaluating new technologies in distributed systems and storage. Consortium members will have access to the results of the research conducted by the NCSA staff.

We are leveraging the expertise and experience of the NCSA NT Cluster group, Supercluster hardware and the applications teams and industrial partners to drive the efforts in the most fruitful directions by drawing on the real world requirements of the applications for a set of key technologies in interconnects, storage systems and storage area networks and middleware:

These key technologies enable us to proceed with the tasks of pushing the leading edge in computational clusters and subsequently deploying the technologies in a cluster for routine use and push its development into the upcoming generations of systems, interconnects and storage architectures.

Collaborative research agreements in specific areas of interest with individual Consortium members can be developed at any time during the period of membership.

Points of Contact

For additional information about the NT Cluster Consortium, please contact the following people:

John McKelvey
217-265-5045
mckelvey@ncsa.uiuc.edu

Jae Allen
217-244-3364
jallen@ncsa.uiuc.edu

Rob Pennington
217-244-1052
robp@ncsa.uiuc.edu

Pat Flanigan
217-244-5602
pflaniga@ncsa.uiuc.edu

 


[Alliance] Alliance NCSA UIUC [NCSA]