|
|
An NCSA Private Sector Program Opportunity
For Infrastructure Development Partners, IDP
Wintel-based computational systems are becoming increasingly interesting and relevant to scientific and technical computing, long the domain of UNIX-based systems. The rapidly growing capabilities of Intel processors, hardware interconnects and storage systems and the continued maturation of Microsoft Windows NT and NT-based software has produced the technology to cluster NT systems for efficient allocation of all available cycles and for high performance computing.
During 1998 NCSA created a 256-processor "NT Supercluster" to explore the technical issues of both the architecture and the use of such systems. Architecture issues relate primarily to interconnect, storage, management and middleware. Usage studies have involved working with over a dozen application codes and their users as well as third party application suppliers.
In terms of management, it was important to be able to start to think about the Supercluster not as 128 dual-processor NT machines but as a "system" and so our staff worked to put tools in place with this in mind.
Middleware for the NT Supercluster has included a project based on Microsoft DCOM -- the NCSA Symera Distributed System (Symera) project (http://symera.ncsa.uiuc.edu/). The Symera project is aimed at easily bringing the power of parallel computing to the growing number of codes being written for NT and using DCOM objects. Beyond the application development and runtime environment, Symera, provides tools for management not only of tightly coupled clusters such as the Supercluster but also of loosely coupled systems on the desktop.
In 1999, NCSA plans to increase the size of the cluster to 512 processors. This will allow for further investigation and development of management tools, middleware, and applications.
The NT Cluster Consortium is an activity within the Alliance Private Sector Program (PSP) structured to provide Infrastructure Development Partners (IDP) with access to the NCSA’s NT technical expertise, software and hardware. An IDP is one or more Alliance academic partners, national labs, a single company, a number of companies acting as a consortium, or a consortium of larger and/or smaller companies working with NCSA or with NCSA and other Alliance university partners. The Consortium is also open to Strategic Industrial Partners (SIP) and to Strategic Technology Partners (STP).
Consortium members will be kept abreast of the current developments through technical updates and reports, attend a yearly workshop for updates on NT technologies undertaken at NCSA, have access to the NT Supercluster and have early access to the Symera software for internal commercial use and will able to interact with the development staff to discuss specific technical questions of interest to the member. The NT Cluster Consortium is also intended to serve as a forum for members to interact with one another and with key hardware and software suppliers involved in the NCSA efforts.
Of critical interest to many Consortium members will be the availability of direct access to the NCSA NT Supercluster researchers, developers and hardware systems for the purposes of performing software development, computational usage and consulting on a contractual basis. Potential projects include work in porting existing applications, developing new applications using object oriented techniques, data mining efforts, system evaluation and assistance and consulting for on-site NT cluster programs.
The two primary projects at NCSA that are the keystones of the Consortium are the NT Supercluster computational system and the NCSA Symera Distributed System (Symera). The NT Supercluster program is focused on providing a computational platform for current message passing applications. The Symera program is creating a distributed object system that provides both a development environment for new applications and an object management system for allocating resources on an NT cluster. These two programs are complementary and are aimed at accommodating current and future applications.
The NT Supercluster is a tightly coupled cluster with 192 PII cpus and a high speed interconnect (80MB/s bandwidth and 17 usec latency) and is focused on providing a platform for mature serial and parallel applications. It can provide computational cycles for these applications as well as a porting and testing environment for applications being migrated from UNIX systems. It is also a development environment for new applications that are implemented in C/C++/Fortran with message passing or shared memory. The capability of this system will keep pace with technical developments in hardware and software through a series of on-going upgrades and testbed systems.
Symera is a distributed object system that is built upon DCOM. Structurally, it consists of two lines of development. First, it is an object management system for a cluster of NT workstations which allocates resources, schedules processes/jobs, implements fault tolerance/object migration and provides user interfaces/API’s for controlling and observing distributed processes. Second, Symera is a set of object libraries that enables developers to create objects that inherently support the interfaces that are required to interact with the management system. The libraries are built using scaleable levels of base class support giving the developer support flexibility. Symera can be used to develop and run parallel, distributed applications exploiting the high availability and low cost of NT workstations. Symera can be used to run applications on the dedicated NT Cluster discussed above or on a set of distributed NT workstations with the intent of harvesting unused cycles.
Participation in the NT Cluster Consortium provides the following benefits:
Membership in the NT Cluster Consortium requires the following commitments:
The NT Cluster Consortium is intended for companies building and using cluster technology as end-users. NCSA has a companion program, the Strategic Technology Partner (STP) program, intended for technology suppliers. STP partners are also encouraged to participate in the NT Cluster Consortium.
In addition to the base level benefits provided to Consortium members, as IDPs, the companies participating in this activity may have priority status in developing operating agreements concerning Symera and Supercluster developments. Since both staff and time resources are limited for the Supercluster and Symera projects, the operating agreements needed to facilitate continued development will be cultivated from Consortium members. Examples of the types of projects that could be defined as individually priced operating agreements are provided below:
With the exponentially growing need to access and process large amounts of data in a relatively short period of time and the corresponding growth in the number of available personal computers, the practicality of coupling those resources to meet the data analysis needs is obvious. The best method for running a parallel process on a cluster of machines to capitalize on unused cycles is less obvious. However, Microsoft’s development of the Distributed Component Object Model (DCOM) provides a means of capitalizing on both the assets of object oriented program development and the benefits of parallel processing.
The NCSA Symera Distributed System (Symera) is a distributed object system that is built upon DCOM. Structurally, it consists of two lines of development. First, it is an object management system for a cluster of NT workstations which allocates resources, schedules processes/jobs, implements fault tolerance/object migration and provides user interfaces/API’s for controlling and observing distributed processes. Second, Symera is a set of object libraries that enables developers to create objects that inherently support the interfaces that are required to interact with the management system. The libraries are built using scaleable levels of base class support giving the developer support flexibility. Symera can be used to develop and run parallel, distributed applications exploiting the high availability and low cost of NT workstations.
The benefits of successfully harvesting unused cycles on a large cluster of machines can be gained by developing applications to be controlled through Symera. A consortium of companies and the Symera team will drive the development of the Symera system and provide a competitive advantage to these companies in fully utilizing their compute resources.
Two developer-only releases of Symera were made in 1997 (as "developer previews I and II"). Since then, much of the Symera code has been rewritten to take into account feedback and experience gained with these earlier releases. Major changes include 1) a complete rewrite of the Symera object libraries; 2) a much-improved administrative user interface (now called the Symera Viewer) and 3) a streamlining of the Symera installation process. As a result of these changes, the next release of Symera will be considerably more user- and developer-friendly.
A special limited release of Symera is scheduled for October, 1998. This release will be restricted to a limited number of users and developers including testbeds at Princeton University and the University of Washington. The limited release will also be available to NT Cluster Consortium members. A full public release, Symera 1.0, is on target for completion at the end of January, 1999.
Short term development goals for Symera include developing increasing resource management capabilities (such as those listed below) on a wide cluster of NT machines. To drive this development, specific industrial applications are needed.
Consortium members will also be provided with the opportunity to work with Symera staff on special projects such as porting specific codes to NT or, based on mutually agreeable terms, to customize Symera for individual company needs. Training opportunities in NT, DCOM, Symera, and related areas may also be negotiated for broader technology transfer within member companies. These projects would incur additional expenses for use of NCSA resources and can be negotiated by Consortium members at any time. Ability to address prospective special project requests will be affected by availability of resources on the Symera team.
For additional information about Symera, consult the following URL: http://symera.ncsa.uiuc.edu/
The NT Supercluster is an Alliance project to enable the integration of NT-based Intel systems into high performance computing. It is being used for scientific applications, infrastructure research, development and deployment.
The NT cluster is an NT-based 192 Intel PII CPU cluster that uses Myrinet interconnect hardware with the HPVM software system from Andrew Chien’s Concurrent Systems Architecture Group to provide a high level of performance for scientific applications. Scientific applications moved to the NT cluster to date include six major codes in C, C++, and Fortran that have been run on up to 192 cpus.
The capability of the NT Supercluster is clearly evident in applications such as, a 2D Navier-Stokes kernel which executed at 6.9 GF on 128 300 MHz PII cpus in the cluster and 14 GF on 128 R10000 cpus in an SGI O2000. This type of capability and the expertise to take advantage of it are available to Consortium members.
The NT Supercluster is being deployed in stages as we establish the viability and robustness of the hardware and software systems that comprise the system. The cluster will become available
| Model | Systems | CPU | Memory | Disk | Interconnects | |
|---|---|---|---|---|---|---|
| HP | Kayak XU | 64 | Dual PII, 300 Mhz | 512 MB | 4.5 GB | Myrinet, Fast Enet |
| Compaq | PWS 6000 | 32 | Dual PII, 333 MHz | 512 MB | 4 GB | Myrinet, Fast Enet |
The cluster is currently at 192 processors, which we plan to increase by a factor of 2-5 over the next three years. During this multi-year process, we will be constantly evaluating new technologies, improving the current system through continuous upgrades and deploying the established systems as computational systems for single cpu and distributed applications.
We have been focusing our effort on the development and deployment of the NT Supercluster as a robust scalable computational resource. These areas include:
There will be vigorous activity in this area over the next 2-3 years as more applications make the transition to NT as a primary execution platform. We have an extensive experience base for applications porting from UNIX to NT for serial and parallel (MPI) applications. The development environment for moving a functional code to the NT Supercluster for UNIX developers is focused on minimizing the barrier to getting started by providing a familiar set of commands to UNIX programmers (including gmake, tar, ar, bash, emacs, etc.) and it provides the standard Microsoft development environment for experienced Windows developers.
New software technologies such as OpenMP, software DSM, HPF and Globus are expected to be tools that enable the development of new distributed applications on the NT cluster. Symera is an example of an emerging technology that is native to NT that will provide the necessary framework to develop distributed applications for NT clusters. The NT Supercluster might be used a scalability testbed for Symera applications as well as a controlled environment for evaluating and benchmarking them.
General porting assistance and direct involvement or collaboration on an application porting effort will require negotiation of an operational agreement for the particular effort.
A significant fraction, approximately 1/3 of the current systems, is being deployed as a large-scale batch throughput engine for single threaded or single system applications. The machines will only be running one or at most two user processes per dual cpu machine, thus allowing an individual process to access up to 512MB of memory and 6GB of disk.
Consortium members will be able to access these serial computational systems for execution of their applications at a level commensurate with their interest and support. It will also be possible to dedicate blocks of these machines to specific computational efforts by Consortium members for extended periods of time on a contract basis.
The NT Supercluster is currently available as a porting platform for MPI-based parallel applications that have been matured on other platforms, such as the SGI O2000 or Cray T3E. It provides a unique large-scale testbed for parallel applications under the NT operating system on Intel hardware. This resource has very high bandwidth (~80MB/s), low latency (~17usec) and a high processor count (up to 128 HP or 192 HP+Compaq system PII cpus) for testing the performance and scaling of message passing applications.
Consortium members will be able to access the parallel computational systems for execution of their applications on a research basis at a level commensurate with their interest and support. It will also be possible to dedicate blocks of these machines to specific computational efforts by Consortium members for extended periods of time on a contract basis.
One of the goals of the NT Supercluster effort is the creation of mechanisms to disseminate the knowledge and expertise associated with this type of technology to partners. We can provide research results, documentation and consulting experience to aid a Consortium member in setting up a functional serial or parallel NT cluster at their home site. This type of activity would entail very close collaboration with the partner and we would expect them to have considerable experience on the NCSA NT Supercluster before undertaking it. Arrangements can be made for training courses and on-site assistance in constructing such a cluster.
One of the major roles of the NCSA NT cluster effort is to perform the necessary path-finding work for large scale NT clusters in a high performance environment. The work includes a large proportion of research and development as well as experimenting with and evaluating new technologies in distributed systems and storage. Consortium members will have access to the results of the research conducted by the NCSA staff.
We are leveraging the expertise and experience of the NCSA NT Cluster group, Supercluster hardware and the applications teams and industrial partners to drive the efforts in the most fruitful directions by drawing on the real world requirements of the applications for a set of key technologies in interconnects, storage systems and storage area networks and middleware:
These key technologies enable us to proceed with the tasks of pushing the leading edge in computational clusters and subsequently deploying the technologies in a cluster for routine use and push its development into the upcoming generations of systems, interconnects and storage architectures.
Collaborative research agreements in specific areas of interest with individual Consortium members can be developed at any time during the period of membership.
For additional information about the NT Cluster Consortium, please contact the following people:
John McKelvey
217-265-5045
mckelvey@ncsa.uiuc.edu
Jae Allen
217-244-3364
jallen@ncsa.uiuc.edu
Rob Pennington
217-244-1052
robp@ncsa.uiuc.edu
Pat Flanigan
217-244-5602
pflaniga@ncsa.uiuc.edu
|
|