HPC in Cloud Computing: March 2011

Tuesday, March 15, 2011

The Impact of Cloud Computing on Corporate IT Governance

While cloud computing is enabling some fundamental changes on how IT groups deliver services, from a corporate management viewpoint, the basic principles of IT governance still remain true. However, the advent of cloud computing is having an increasing impact on how the components of the governance process are executed. For the purpose of this article, we will use the COBIT model (Control OBjectives for Information and related Technology) that is comprised of five major process focus areas: Strategy Alignment, Value Delivery, Resource Management, Risk Management, and Performance Measurement.
Governance at its core is the effective management of the IT function to ensure that an organization is realizing maximum value from its investments in information technology. Many companies, especially those with considerable IT budgets, have implemented significant internal IT governance procedures to manage their IT investment portfolio. This governance function provides the processes and framework for the management team to analyze, understand, and manage the level of return on the organizations technology investments. Industry studies show that on average, companies with effective IT governance processes in place average 5-7 percent less in equivalent IT spend to deliver the same functionality as compared to those companies that do not.
Any proper IT governance function also requires active management participation, the proper forum to make IT related decisions, and effective communication between the IT organization and the company's management team. While these factors are critical to creating a successful IT governance function, there are five essential areas of process focus as spelled out in the COBIT model, which are described here:

Strategic Alignment: This focuses on ensuring the linkage of business and IT plans; defining, maintaining and validating the IT value proposition; and aligning IT projects and operations with enterprise operations.
Value Delivery: This is about executing the value proposition throughout the delivery cycle, ensuring that IT delivers the promised benefits against the strategy, concentrating on optimizing costs and proving the intrinsic value of IT.
Resource Management: This is about the optimal investment in, and the proper management of, critical IT resources: applications, information, infrastructure and people. Key issues relate to the optimization of system knowledge and technical infrastructure.
Risk Management: This requires risk awareness by senior corporate officers, a clear understanding of the enterprise's appetite for risk, understanding of compliance requirements, transparency about the significant risks to the enterprise and embedding of risk management responsibilities into the IT organization.
Performance Measurement: This tracks and monitors strategy implementation, project completion, resource usage, process performance and service delivery, using, for example, balanced scorecards that translate strategy into action to achieve goals measurable beyond conventional accounting.

If the IT governance framework isn't implemented and managed correctly, this can adversely impact how well IT delivers on its commitments to its customers along with how IT is perceived within the organization. Lack of effective IT strategy, governance and oversight can cause continued issues with project overruns or even outright failures, project stakeholder dissatisfaction, and reduced business value received in relation to the resources expended. Companies that properly manage their IT function operate with a higher level of certainty that they are receiving an appropriate level of value from their investments in information technology. They also have the ability to ensure that the IT group is working on the projects that provide the most business value to the organization.
Now that we have discussed the impact of cloud computing on the IT group, let's examine how cloud computing effects the five governance factors as defined in the COBIT model.
Value Delivery: Under the pre-cloud provisioning model, most new projects included costs for hardware to support the application and usually for testing and development environments also. IT was also guilty of over-buying hardware to ensure that if there were performance issues they were at least not hardware-related and to provide capacity for peak loads that might never materialize. Cloud computing offers several options that can change the cost model and free up more of the IT budget for innovation and not for under-utilized hardware and associated support. One option would be to provision test and QA instances via the cloud instead of purchasing additional servers or to shift peak loads to the cloud instead of maintaining that capacity internally. Cloud-based tools could also enable rapid prototyping, allowing for quicker delivery of business applications. With the potential cost savings, projects that were cost prohibitive may now be viable or funds freed up to support additional projects. Certainly some of these issues can be addressed using virtualization but cloud gives the IT group another tool in its tool kit to attack business problems. With the right strategy and mix of technologies, the IT group can deliver more value for potentially less money. There is one caveat. In order to ensure that proper value is being delivered, the IT organization needs to have a firm grasp on its internal cost structure as mentioned above in order to correctly drive investments.
Resource Management: One of the challenges in any IT group is appropriately managing the resources as its disposal to provide as much business value as possible. Cloud computing can impact the resources available to IT in a variety of ways. From a personnel standpoint, cloud will require a shift in operational skill sets from a more internally focused system services mentality to a more holistic system viewpoint oriented around delivering business value and not system infrastructure. IT staff will need to have increased knowledge of the value chain in the business to better understand where cloud technologies can fit in and to also recognize where they are not appropriate. IT management should include a plan to deal with the personnel skills changes required and incorporate that into any overall cloud adaptation strategy. Cloud can also impact system resources by requiring additional network bandwidth, monitoring tools, or other items to appropriately manage and maintain this new hybrid environment.
Risk Management: This is one of the most critical areas of governance impacted by cloud computing. Critical questions arise when cloud computing is brought into the existing IT ecosystem. These questions include those oriented to data protection and business continuity such as, impact to existing disaster recovery plans, how backups/restores and data archival policies are effected, and how are any business continuity plans effected. IT management must have a clear understanding of risk related to vendor service levels, strategies for mitigating that risk and how any potential outages would impact the business. IT also must examine security access and potential risks from putting corporate data into the cloud and what the potential impact might be on the business if data is lost or access control is breached. Other risks that need to be addressed revolve around the viability of the vendor, long-term prospects of any particular technology, and the impact to the existing IT infrastructure. All these questions and more must be asked and addressed, particularly as cloud computing is embraced for more critical business applications and IT services.
Performance Measurement: This area looks at the overall achievement of the IT organization. While cloud does not directly impact the purpose of this portion of the governance process, it does modify some aspects of the underlying key performance measures. Performance measurement is directed at providing management with information on how the IT group is performing outside of conventional accounting measures such as project completion, resource usage, service delivery, and user support metrics. While not integral to the adoption of cloud computing, the setting of governance goals and objectives should take into account the impact of using cloud resources. This could include completing projects quicker by provisioning resources via the cloud or using cloud resources to speed prototyping, or higher efficiencies in using funding and personnel resources by leveraging cloud capabilities. IT organizations will need to review and adjust their metrics and measurements and adjust accordingly.
Strategic Alignment: The primary goal of IT governance is to ensure alignment with organizational objective, cloud computing would not have a significant impact on this area of the IT governance process. Regardless of the technical architecture being proposed for a project, the management team needs to maintain the linkage of business goals and IT plans and ensure that IT projects and operations align with the enterprise needs.

Cloud Computing Vendors for HPC

---

3Leaf Systems

Newcomer 3Leaf Systems enables a cloud computing environment to be built from low-cost commodity servers by providing virtualization of CPU and memory resources for an entire server farm.

Adaptive Computing

Formerly Cluster Resources, Inc., Adaptive Computing offers a range of HPC cluster and cloud computing middleware products. When combined with the company's self-service Moab Cloud Portal, the Adaptive Operating Environment delivers an architecture for utility-based computing environments.

Amazon Elastic Compute Cloud

Amazon Elastic Compute Cloud (Amazon EC2) is a Web service that provides resizable compute capacity in a public cloud. It is designed to make Web-scale computing easier for developers. Services are provided as a pay-for-usage model, with high-memory and high-CPU instances available.

Bull

Bull's bullx servers are positioned as platforms for "Extreme Computing," optimized for running high performance computing, very large-scale online transaction processing (OLTP), and cloud computing workloads.

Cycle Computing

Cycle Computing, provides open source cloud solutions, including CycleServer and CycleCloud, for deploying Condor grids in the Cloud, and CloudFS, a storage cloud based on Apache's Hadoop.

Darkstrand

Darkstrand is a Chicago-based startup that is commercializing the National LambdaRail (NLR) network and connecting HPC expertise and resources at NLR-affiliated national labs and universities with companies in media and entertainment, manufacturing, biotech and financial services.

Google App Engine

App Engine is a commercial cloud platform for building and deploying Web applications on Google's network of server farms -- the same infrastructure that powers Gmail, Google Docs, and Google Calendar.

Gompute (a Gridcore company)

Gompute provides on-demand high performance computing for technical and scientific computing applications. Consultants and ISVs are able to sell their services and software licenses via the Gompute platform.

HP

HP provides the building blocks for cloud infrastructure as well as design and integration support for large-scale datacenters. The company also offers its own Cloud Consulting Services, a Cloud Discovery Workshop, and a Cloud Roadmap Service.

IBM
IBM has a portfolio of cloud offerings, services, and technologies including hardware/software building blocks for cloud platforms. IBM's Computing on Demand (CoD), an HPC cloud infrastructure service, offers clients the ability to rent HPC Clusters hosted in global IBM cloud centers by the hour, week or year.

Microsoft Azure Platform

The Azure Services Platform is a cloud environment that provides a set of services for the development, management and hosting of applications across Microsoft datacenters. Using these service, developers can build their own applications to be deployed in the cloud.

Nimbis Services

Nimbis acts as a clearinghouse for buyers and sellers of what they call "Digital Analysis Computing (DAC)" services. It does this via pre-negotiated access to high performance computing services, software and expertise from on-demand vendors, ISVs and domain experts.

NVIDIA RealityServer

The NVIDIA RealityServer platform combines Tesla GPUs and 3D Web services software into a cloud platform that delivers interactive, photorealistic content over the Web. The resulting applications can be used by media artists, product designers, engineers, architects, scientists and consumers.
Penguin Computing
Penguin on Demand (or “POD”) is the company's HPC-as-a-service offering aimed at end users and SaaS providers looking for a high performance, on-demand environment. CPU cycles can be rented on a pay-as-you-go basis or through a monthly subscription.
Platform Computing
Platform's ISF product is a private cloud platform that builds on the company's HPC cluster and grid management expertise. ISF is designed to support a shared computing infrastructure and deliver application environments according to workload-aware and resource-aware policies.

R Systems

R Systems provides HPC hardware resources on-demand for academic researchers and commercial organizations, offering both Linux and Windows-based cluster systems.

SGI CloudRack

SGI's CloudRack C2 is server cabinet targeted for cloud-scale infrastructure. The enclosure is built for extreme density and energy efficiency, and is designed to operate in hot datacenter environments -- up to 104F. The CloudRack X2 is a variant designed for HPC workloads.

Sun Microsystems

Sun offers hardware and software building blocks for cloud building. The company's Open Cloud Platform is designed to support both public and private clouds. Its own public cloud service, the Sun Cloud, is due out later this year.

The MathWorks

MATLAB includes built-in support to run on the European EGEE grid (Enabling Grids for E-sciencE). In addition, MathWorks parallel computing products can be configured to run MATLAB and Simulink applications on Amazon's EC2 platform.

ToutVirtual

ToutVirtual enables enterprise and HPC users to automate cloud computing workloads. VirtualIQ, the company's flagship product, allows users to view and manage their servers, applications, storage and clients.

T-Services

An affiliate of T-Platforms Holding, T-Services is a Russian company that offers a range of high performance computing services including providing access to supercomputing infrastructure, computational software, and HPC expertise. The company also helps customers manage their own HPC sites.

Univa UD

Univa offers a range of cloud software products that allow HPC and enterprise users to build and manage cloud computing environments. Private, public, and hybrid cloud environments are all supported.

VMware

VMware's vCloud delivers a single way to run, manage and secure applications. Through VMware's ecosystem of cloud service providers, you can get VMware Virtualized services ranging from on-demand, pay-as-you go infrastructure, to enterprise-class, production ready offerings.

Wolfram Alpha

Wolfram Alpha is a Web-based computational platform, based on Mathematica and utilizing large-scale HPC infrastructure. A newly announced API gives users the capability to build custom applications on top of the platform.

Grids or Clouds for HPC?

Grids didn't keep all their promises
Grids did not evolve (as some of us originally thought) into the next fundamental IT infrastructure for everything and for everybody. Because of the diversity of computing environments we had to develop different middleware stacks (department, enterprise, global, compute, data, sensors, instruments, etc.), and had to face different usage models with different benefits. Enterprise grids were (and are) providing better resource utilization and business flexibility, while global grids are best suited for complex R&D application collaboration with resource sharing. For enterprise usage, setting up and operating grids was often complicated. For researchers this characteristic was seen to be a necessary evil. Implementing complex applications on HPC systems has never been easy. So what.
Grid: the way station to the cloud
After 40 years of dealing with HPC, grid computing was indeed the next big thing for the grand challenge, big-science researcher, while for the enterprise CIO, the grid was a way station on its way to the cloud model. For the enterprise today, clouds are providing all the missing pieces: easy to use, economies of scale, business elasticity up and down, and pay-as you go (thus getting rid of some CapEx). And in cases where security matters, there is always the private cloud. In more complex enterprise environments, with applications running under different policies, private clouds can easily connect to public clouds -- and vice versa -- into a hybrid cloud infrastructure, to balance security with efficiency.
Different policies, what does that mean?
No application job is alike. Jobs differ by priority, strategic importance, deadline, budget, IP and licenses. In addition, the nature of the code often necessitates a specific computer architecture, operating system, memory, and other resources. These important differentiating factors strongly influence where and when a job is running. For any new type of job, a set of specific requirements decide on the set of specific policies that have to be defined and programmed, such that any of these jobs will run just according to these policies. Ideally, this is guaranteed by a dynamic resource broker that controls submission to grid or cloud resources, be they local or global, private or public.
Grids or clouds?
One important question is still open: how do I find out, and then tell the resource broker, whether my application should run on the grid or in the cloud? The answer, among others, depends on the algorithmic structure of the compute-intensive part of the program, which might be intolerant of high latency and low bandwidth. This has been observed with benchmark results. The performance limitations are exhibited mainly by parallel applications with tightly-coupled, data-intensive inter-process communication, running on hundreds or even thousands of processors or cores.
The good news is, however, that many HPC applications do not require high bandwidth and low latency. Examples are parameter studies often seen in science and engineering, with one and the same application executed for many parameters, resulting in many independent jobs, such as analyzing the data from a particle physics collider, identifying the solution parameter in optimization, ensemble runs to quantify climate model uncertainties, identifying potential drug targets via screening a database of ligand structures, studying economic model sensitivity to parameters, and analyzing different materials and their resistance in crash tests, to name just a few.

A Grid in the cloud
One great example of a project that has built a grid in the cloud is Gaia, a European Space Agency funded mission which aims to monitor one billion stars. Amazon Machine Images (AMIs) were configured for the Oracle database grid and processing software (AGIS). The result is an Oracle grid running inside the Amazon Elastic Compute Cloud (EC2). To process five years of data for 2 million stars, 24 iterations of 100 minutes each translates into 40 hours of 20 EC2 CPU instances. Benefits include reduced costs (50 percent compared to the in-house solution) and massive scalability on demand without having to invest in new
infrastructure or train new personnel. And only a single line of source code was changed in order to get it to run in the cloud.
HPC needs grids and clouds
According to the DEISA Extreme Computing Initiative (DECI), there are still plenty of grand challenge science and engineering applications that can only run effectively on the largest and most expensive supercomputers. In DEISA, a European HPC grid, also called the HPC Ecosystem, is made up of 11-teraflops nodes.
Today, nobody would build an HPC cloud for these particular applications. It simply wouldn't be a profitable business, the "market" (i.e., the HPC users) is far too small and thus lacks economy of scale. In some specific science application scenarios, with complex workflows of different tasks (nodes), a hybrid infrastructure might make sense: cloud capacity resources and HPC capability nodes, providing the best of both worlds.

9 Things to Know When Comparing Cloud Vendors

1. Most of the HaaS and IaaS providers offer four nines (99.99%) of SLAs on uptime42. Due to customer demand many cloud providers are seriously considering providing “five nines” SLAs later this year.
2. Amazon EC2 charges for incoming bandwidth, whereas GoGrid does not.
3. Amazon uses Xen virtualization.
4. Amazon officially only supports RHEL 5.1+.
5. Amazon charges $.10/VM per hour for compute capacity $.15/Gb per month for data storage43.
6. Amazon EC2 provides its own firewall and networking configuration. Standard HPC networking configurations do not translate well to EC2.
7. Google App Engine datastore has to be BigTable format which is quite different from the relational database format.
8. Typical admin to user ratio in an enterprise environment is 1 to 100. Conversely, the typical admin to user ratio in the public cloud environment is 1 to 20,00044.
9. Some vendors provide you with a single bill for all your cloud computing services instead of separate bills from various cloud vendors. This centralized/unified billing model can result in economies of scale that lowers per-unit costs.

Vendor Requirements: What to Look For In a Cloud Vendor

The option to utilize public clouds is not a black and white choice. It is very much a case-by-case evaluation based on application, tradeoffs and opportunity cost.
Companies looking to tap into the full potential and promise of cloud computing should partner with a trusted vendor who can devise a solution that meets their specific needs and who possess the following attributes and capabilities:
1. Have a good solution for HPC private clouds and can also provide seamless transition from there to public clouds.
2. Allows them to manage their compute capacity and facilitates similar setup, configuration and tear-down of their compute resources in both private and public HPC cloud environments.
3. Provide iron-clad user authentication and authorization in public clouds.
4. Guarantee the security of data while being transported to the public cloud or while inside the cloud.
5. Provide a complete cloud computing software stack (see Figure 1 on the next page).
6. Have experience deploying public clouds and have best practices that they can share with you of their current cloud deployments.
7. Have conducted extensive security audits for real-world customers.
8. Can provide ironclad guarantees in the contract and or insurance about the persistency of your corporate data within the public cloud environment.
9. Have an enterprise architecture that can address any system integration concerns.
10. Can provide support for older versions of the OS in the public cloud environment.
11. Provide solutions that are open-stands based and portable to prevent vendor lock-in, ideally supported by the open source community20.
12. Are there when you need them – you should be absolutely clear on whom to call when there is a problem21.
13. Is not a one-trick pony: Doesn’t do only one thing, like support for just private HPC clouds. Customers should have the choice to decide which type of cloud solution is right for them: private, public or hybrid.

Grid, HPC Cluster and Cloud, Part 2: A Developer Perspective

omparison

Table 1 outlines our discussion.

There are oddly enough similarities between an HPC Cluster which is considered to be very expensive in cost and our Cloud infrastructure. There is one very important difference between the two however: SLA Requirements! This makes sense if you look at Part 1 of this article. HPC Clusters have very fast networks vs cloud where public internet is used for communication. For this reason, you see the type of work that gets farmed out to a Cloud be very different than the type of work that is scheduled on an HPC Cluster. Typically speaking, Cloud workloads are longer running in length and resemble a workflow like loan processing, large data crunching routines, data mining, etc. HPC Cluster workloads are more repetitive and smaller in duration -- problems that are known as Massively Parallelizable Problems, where a smaller chunk of the work is scheduled on a node. HPC clusters benefit from the fact that there are many nodes, a fast scheduler that can utilize these nodes and high speed connectivity for communication. Cloud infrastructures are typically managed by FIFO queues and not much enforcement for policies takes place. In addition, Cloud environments are VM based and as such less powerful that bare metal machines. Grids are somewhere in the middle of the pact. There is good connectivity between the nodes, but the nodes are more heterogeneous in nature: fast machines, VM's, old 486's, etc. For this reason, the jobs that get farmed out to the Grid are more diverse in nature. The infrastructure could get very large -- into 10,000s, and dispersed globally. There are pockets of Grids (known as Virtual Organizations), and generally speaking a committee that manages the environment. As far as the types of jobs that get farmed out to the Grid, there is no rhyme or reason or pattern to them. The idea is that there is at least one (!) resource that is suitable to your needs. This is a very high-level view of things, but the reality is not much different:

Pockets of resources in different data centers
The resources are of different types
The scheduler works hard to try to figure out what is the suitable resource for your request

What does that mean to you? If you have a large enough Grid that is capable of handling a range of requests, it can be your savior. This is due to the fact that application integration is very challenging and time consuming.

Integration

In my opinion, Cloud application integration is the simplest. That is due to the fact that I would put my long-running jobs on the cloud and spend as little time as possible to try to get these applications integrated and fine-tuned with the cloud API's. Cloud, if you recall, is connectivity over the public internet with little SLA requirement. As such, saving a couple of seconds will not matter to me. Grid is where I aim to be next. Application integration is tough and will require, at times, re-engineering of a legacy application to benefit from your Grid framework and infrastructure. Many choose to stop here. The application goes into maintenance mode after this phase, and some extra features get implemented over the years. Some are brave enough to have a tighter integration with the underlying hardware in order to get the best bang for the buck. This type of integration is very fine-tuned and you are essentially re-engineering your application to benefit from amount of memory to details of the physical later protocol of your network for better performance.

There is no chicken and egg

The steps that I outlined in the previous section get repeated over and over again. You start with a wide-brush-stroke approach to the integration as new features are added and keep working your way down as you fine-tune your infrastructure. I guess that is all that I want you to take away from these two articles: continuous integration of your application to take into account the capabilities of your underlying infrastructure. If your application can be parallelized, then you want to move more towards an HPC Cluster with tight integration of the hardware. If your application is more workflow type that cannot be parallelized, then you will move more towards a Cloud infrastructure where low cost VM's can be used to process your request. Again, there are so many exceptions out there that make these statements look irrelevant, but I hope that you gained some perspective of how to at least approach the problem.

The Real Cost of Cloud Computing

e have up to four categories of cloud, which are shown in figure 1.

Figure 1: Classes of Cloud Computing.
As you can imagine, there is a cost associated with each of these options. What are these costs? Everywhere you look there is cost-benefit-analysis especially for external public clouds by vendors. (I can't imagine why). But what are these costs? I will do some estimation to get to these numbers, but I do believe that these numbers are correct.

Overview on Cost

We are looking at an apple-to-apple comparison of what it would be for each of the options that showed in figure 1. We are not interested in a pie in the sky scenario that cloud vendors publish. I have nothing against cloud or any vendor for that matter, but we need to get past the hype and start to justify the cost.
This is what I like to get to:

* If you go and lease or buy 10 machines * and one person is managing this cluster of 10 machines * and you are utilizing this cluster 100% of the time for a month What is the cost? ( = A) Then, I like to get a number for a cloud vendor:

* you lend 10 machines, similar to above * and you use these machines 100% of the time for a month What is the cost? (= B) Let’s compare A and B, and see what we will get. In fact, we can graph it and see where is makes sense to switch to cloud. I am sure that there is a turning point in that if you are, for example, utilizing over 50%, it is cheaper to self-host.
I wanted to add a very important note regarding the size of our infrastructure that we are considering for the Cloud. Obviously, if you want to Cloud-ify a single server, then Cloud makes sense. This is simply for the fact that a system admin to manage “a” server will cost too much. We will focus on economies of scale here. We will focus on a scenario where you need about 1000-10000 cores!
For anything smaller, a number of different variables need to come into play, which "clouds" the topic.

Cost Calculation

Imagine that you wanted to have a compute backbone of 1000 cores. Table below enumerated what you could expect under three different scenarios.

Interesting; not completely acute, but very interesting.
Some readers will send me hate emails, but that’s ok. Some will say that I am completely off on my numbers, and that’s ok as well. You can play with the numbers all you want, but the plain and simple fact is that if you rent 1000 cores from a cloud vendor for 10 cents/hr per core, you will pay over $900k for one year of usage:

Obviously, this figure includes a $50 for IT staff -– that’s one System Admin part time for the year.

Arguments Against

There are a number of arguments against my calculation and generally speaking the approach that I have taken with the calculation. Let me cover some of them:

* One part-time IT staff is not enough. * 5 cents/hr/core is too low * Some costs are not included such as networking, security, build and configuring the OS, etc. And last, but certainly now least:

* The assumption that Cores are used for the entire year is wrong! These are the basic arguments – there could be others, but the list represents the majority of them.

Counter Argument

Let’s take the simplest one first: IT staff size. I would break the IT staff in to two parts: System Admin and application Support. Application support is required regardless of where your applications are deployed. Many think that once an application is moved to the cloud, the application support needs disappear in the cloud as well. That’s where you get into trouble! Application support needs to always be there!
Your system admins need to have an automated way of maintaining your servers. It is beyond the scope of this article, but that is a must. Much of system admin in either Windows or Linux can be automated, and if you hire a part-time senior admin, s/he will automate your process to the last step. Why? Because it makes their life easier if everything is automated. If you have not automated your processes, cloud won’t help either, as you still need to create images, configure images, etc.
5 cents/core/hr might be a little too low, yes. Not by much though. I would argue that a cloud vendor needs t o have 100% margin on its services, and that’s where I got the 5 cents/core/hr. If you were a vendor, and I have been one, you have a markup of 100%, at least!
Now, as I mentioned in the previous article, the main difference between private and public clouds is that in a public cloud, you go over the Internet. Some argue that networking costs are not included in my calculations. I would also argue that you still need a high-speed network connectivity. In fact, you would need better connectivity to the Internet if you are connecting to a public cloud as the uplink bandwidth (what you essentially pay for) determines how much you can transfer to your cores on the cloud.
I would further argue that you have more to worry about insofar as security when you are wondering about public clouds. For some organizations that deal with sensitive data, this number is tremendously high!

What is Difference between Grid, Cloud, HPC

Scalability vs. Performance
First it’s critical for readers to understand the fundamental difference between scalability and performance. While the two are frequently conflated, they are quite different. Performance is the capability of particular component to provide a certain amount of capacity, throughput, or ‘yield’. Scalability, in contrast, is about the ability of a system to expand to meet demand. This is quite frequently measured by looking at the aggregate performance of the individual components of a particular system and how they function over time.
Put more simply, performance measures the capability of a single part of a large system while scalability measures the ability of a large system to grow to meet growing demand.
Scalable systems may have individual parts that are relatively low performing. I have heard that the Amazon.com retail website’s web servers went from 300 transactions per second (TPS) to a mere 3 TPS each after moving to a more scalable architecture. The upside is that while every web server might have lower individual performance, the overall system became significantly more scalable and new web servers could be added ad infinitum.
High performing systems on the other hand focus on eking out every ounce of resource from a particular component, rather than focusing on the big picture. One might have high performance systems in a very scalable system or not.
For most purposes, scalability and performance are orthogonal, but many either equate them or believe that one breeds the other.
Grid & High Performance Computing
The origins of HPC/Grid exist within the academic community where needs arose to crunch large data sets very early on. Think satellite data, genomics, nuclear physics, etc. Grid, effectively, has been around since the beginning of the enterprise computing era, when it became easier for academic research institutions to move away from large mainframe-style supercomputers (e.g. Cray, Sequent) towards a more scale-out model using lots of relatively inexpensive x86 hardware in large clusters. The emphasis here on *relatively*.
Most x86 clusters today are built out for very high performance *and* scalability, but with a particular focus on performance of individual components (servers) and the interconnect network for reasons that I will explain below. The price/performance of the overall system is not as important as aggregate throughput of the entire system. Most academic institutions build out a grid to the full budget they have attempting to eke out every ounce of performance in each component.
This is not the way that cloud pioneers such as Amazon.com and Google built their infrastructures.
Cloud & High Scalability Computing
Cloud, or HSC, by contrast, focuses on hitting the price/performance sweet spot, using truly commodity components and buying *lots* more of them. This means building very large and scalable systems.
I was surprised at the ISC Cloud Conference when I heard one participant bragging about their cluster with 320,000 ‘cores’. Amazon EC2 (sans the new HPC offering) is at roughly 500,000 cores, quite possibly more. And Google is probably in the order of 10 million+ cores. Clouds built around High Scalability Computing are an order of magnitude larger than most grid clusters and designed to handle generic workloads, requiring hitting the price/performance sweet spot when building them.
Grid workloads can be very, very different.
Some Grid Workloads Drive the Grid Community
In talking to the grid community I learned that there are effectively two key types of problem that are solved on large scale computing clusters: MPI (Message Passing Interface) and ‘embarrassingly parallel’ problems. I’m using terms I heard at the conference, but will use MPI and EPP (embarrassingly parallel problem) so that I can shorthand throughout the rest of this article.
MPI is essentially a programming paradigm that allows for taking extremely large sets of data and crunching the information in parallel WHILE sharing the data between compute nodes. Some times this is also referred to as ‘clustering’, although that term is frequently overloaded today. Certain kinds of problems necessitate this sharing as the computed results on one node may effect the computed results on another node in the grid. MPI-based grids, the de facto standard for most academic institutions, are built to maximum throughput and performance per system, including the lowest latency possible. Most of them use Infiniband technology for example to effectively turn the entire grid into a single ‘supercomputer‘. In fact, most of these MPI-based grids are ranked into the Supercomputer Top500.
An MPI grid/cluster, in many ways, looks more like an old school mainframe and technology such as Infiniband essentially turns the network into a high-speed bus, just like a PCI bus inside a typical x86 server.
EPP workloads, by contrast, have no data sharing requirements. A very large dataset is chopped into pieces, distributed to a large pool of workers, and then the data is brought back and reassembled. Does this sound familiar? It should, it’s very similar to Google’s MapReduce functionality and the open source tool, Hadoop. EPP workloads are very commonly run on top of MPI clusters, although some academic institutions build out separate or smaller grids to run them instead.
The majority of grid workloads are of the EPP type. The diagram below shows this.

I had one person confide in me that “MPI power users drive grid requirements for vendors and assume that if their problems are solved, then the problems of [EPP] users are solved.”
This is interesting since these two types of workloads have different needs.
HPC vs. HSC
The reality is that High Scalability Computing is ideal for the majority of EPP grid workloads. In fact, large amounts of this kind of work, in the form of MapReduce jobs have been running on Amazon EC2 since its beginning and have driven much of its growth.
HPC is a different beast altogether as many of the MPI workloads require very low latency and servers with individually high performance. It turns out however, that all MPI workloads are not the same. The lower bottom of the top part of that pyramid is filled with MPI workloads that require a great network, but not an Infiniband network:

In keeping with Amazon Web Service’s tendency to build out using commodity (cloud) techniques, their new HPC offering does not use Infiniband, but instead opts for 10Gig Ethernet. This makes the network great, but not awesome and allows them to create a cloud service tailored for many HPC jobs. In fact, this recent benchmark posting by CycleComputing shows that AWS’ Cloud HPC system has impressive performance particularly for many MPI workloads.
HSC designed to accommodate HPC!

How to Build an HPC Cluster in the Cloud

Step 1: Deploy or migrate a management server in the cloud
A distributed computing management (DCM) server (also called a front end server or queuing server) is required for coordinating execution of jobs across a large number of compute servers. Products such as Oracle Grid Engine or Condor are commonly used to provide this capability. There are also tools such as Rocks Clusters which include the DCM software as part of an OS provisioning system. Regardless of the specific solution used, CloudSwitch enables an administrator to migrate an existing deployment or establish a new deployment in the cloud within a few minutes. Once the chosen software framework is in place, compute capacity to carry out the workload can be provisioned.
Step 2: Create compute servers quickly in the cloud
The CloudSwitch API makes it easy to quickly stamp out dozens or hundreds of virtual servers in the cloud that will form the cluster or grid. You can configure the virtual machine parameters that match your internal environment with a few clicks, or upload a gold image to make provisioning even easier. CloudSwitch automatically creates the appropriate cloud resources with the chosen configuration rather than relying on a cloud provider’s options. (A previous post describes the CloudSwitch point-and-click approach in more detail.) The CloudSwitch isolation layer extends the internal environment into the cloud so that when the servers are started, they appear to be running inside the data center, using the same management tools and processes.
Step 3: Install the operating system
Now that we’ve created the cloud infrastructure, we continue building up the stack, starting by installing the operating system on the virtual machines we created. There are a number of products that do this quickly with minimal human effort. If using Rocks, the compute servers automatically boot from the network using the “PXE” standard, and operating systems are pushed onto them. CloudSwitch also supports other solutions, including ISO-based or image-based provisioning in addition to network boot.
Step 4: Install the DCM software and build the cluster
Since the DCM software provides the overlay framework enabling HPC jobs to run in parallel, it must be installed onto each compute node. In some cases this step can be done automatically by the provisioning solution (as in the case of Rocks), or the software may be installed manually after OS provisioning completes. Regardless of the installation method, once the DCM software is installed onto the newly-provisioned compute nodes, they are available to the management server as workload targets.
The above process will vary slightly depending on which tool(s) you use, but the end result will be the same: a fully-functioning cluster in the cloud, with the same look and feel as if it was running within the data center. These steps could be repeated as often as needed to provision multiple clusters in the cloud, with each cluster running within its own private network to securely support separate users and groups.