{"id":65181,"date":"2024-09-04T10:48:52","date_gmt":"2024-09-04T15:48:52","guid":{"rendered":"https:\/\/connect-community.org\/?p=65181"},"modified":"2024-09-04T11:29:43","modified_gmt":"2024-09-04T16:29:43","slug":"maximize-performance-sustainably-for-ai-training-with-hpe-cray-xd670","status":"publish","type":"post","link":"https:\/\/connect-community.org\/maximize-performance-sustainably-for-ai-training-with-hpe-cray-xd670\/","title":{"rendered":"Maximize performance sustainably for AI training with HPE Cray XD670"},"content":{"rendered":"<h2><span style=\"font-size: 12pt;\"><em><strong>Discover how the HPE Cray XD670 delivers strong results in the MLPerf Training v4.0 benchmark, introduces support for NVIDIA H200 Tensor Core GPUs, and advances sustainability with optional direct liquid cooling.<\/strong><\/em><\/span><\/h2>\n<p><span style=\"font-size: 12pt;\">In the race to deploy and successfully implement AI environments, two key requirements are becoming increasingly critical for service providers and AI model builders:\u00a0<\/span><\/p>\n<ol>\n<li><span style=\"font-size: 12pt;\">Achieving the highest possible performance\u00a0with the latest accelerator technologies<\/span><\/li>\n<li><span style=\"font-size: 12pt;\">Addressing the escalating cooling needs of these accelerators<\/span><\/li>\n<\/ol>\n<p><span style=\"font-size: 12pt;\">On the first requirement, we are pleased to share that HPE\u2019s premier AI training platform, the\u00a0<a href=\"https:\/\/www.hpe.com\/us\/en\/hpe-cray-xd670.html\" target=\"_blank\" rel=\"noopener noreferrer\">HPE Cray XD670<\/a>, has once again demonstrated strong performance results, this time in the recently published\u00a0<a href=\"https:\/\/mlcommons.org\/2024\/06\/mlperf-training-v4-benchmark-results\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">MLPerf Training 4.0 benchmarks<\/a>. HPE submitted nine performance results against five AI models in three categories: LLM fine-tuning, NLP training, and Computer Vision training. HPE Cray XD670, in single- and double-node configurations,\u00a0<a href=\"https:\/\/mlcommons.org\/benchmarks\/training\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">delivered strong on-premise training and fine-tuning performance<\/a>\u00a0including:<\/span><\/p>\n<ul>\n<li><span style=\"font-size: 12pt;\">Overall #2 fastest single-node system when compared to other 8x NVIDIA H100 SXM5 80GB servers in both NLP (BERT) training and LLM (Llama 2 70B for LoRA) fine-tuning.<\/span><\/li>\n<li><span style=\"font-size: 12pt;\">The 2-node HPE Cray XD670 configuration, with a total of 16 NVIDIA H100 SXM5 80GB GPUs, outperformed a 16-node server configuration, with a total of 64 NVIDIA L40S GPUs on Llama 2 fine-tuning tasks.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-size: 12pt;\">Based off limited comparable results, HPE Cray XD670 was the overall fastest 2-node configuration in all submitted models.<\/span><\/p>\n<p><span class=\"lia-inline-image-display-wrapper lia-image-align-center\" style=\"font-size: 12pt;\"><span class=\"lia-message-image-wrapper lia-message-image-actions-narrow lia-message-image-actions-below\"><img decoding=\"async\" class=\"lia-media-image aligncenter\" tabindex=\"0\" title=\"HPE-Cray-XD67-MLPerf-4-Benchmark.png\" role=\"button\" src=\"https:\/\/community.hpe.com\/t5\/image\/serverpage\/image-id\/143566iADCFA7A734DAC458\/image-size\/large?v=v2&amp;px=2000\" alt=\"*Lower values indicate better performance\" \/><\/span><\/span><\/p>\n<p style=\"text-align: center;\"><span class=\"lia-inline-image-display-wrapper lia-image-align-center\" style=\"font-size: 12pt;\"><span class=\"lia-inline-image-caption\">*Lower values indicate better performance<\/span><\/span><\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: center;\"><span class=\"lia-inline-image-display-wrapper lia-image-align-center\" style=\"font-size: 12pt;\"><span class=\"lia-message-image-wrapper lia-message-image-actions-narrow lia-message-image-actions-below\"><img decoding=\"async\" class=\"lia-media-image aligncenter\" tabindex=\"0\" title=\"HPE-Cray-Llama 2-Model.png\" role=\"button\" src=\"https:\/\/community.hpe.com\/t5\/image\/serverpage\/image-id\/143567iF24956E9214C6FD5\/image-size\/large?v=v2&amp;px=2000\" alt=\"*Lower values indicate better performance\" \/><\/span><span class=\"lia-inline-image-caption\">*Lower values indicate better performance<\/span><\/span><\/p>\n<p class=\"lia-align-left\"><span style=\"font-size: 12pt;\">These results are in addition to prior\u00a0<a href=\"https:\/\/mlcommons.org\/2024\/03\/mlperf-inference-v4\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">MLPerf Inference v4.0 benchmark results<\/a>\u00a0published in March, where HPE Cray XD670 achieved the #1 spot for Natural Language Processing (NLP with Bert 99.0 Offline scenario) and was also a top performer in all the categories in which it participated, including GenAI, computer vision and large language models. (Read our blog\u00a0<a href=\"https:\/\/community.hpe.com\/t5\/servers-systems-the-right\/boost-ai-performance-with-the-leading-server-for-natural\/ba-p\/7211514\" target=\"_blank\" rel=\"noopener\">&#8220;<\/a><a href=\"https:\/\/community.hpe.com\/t5\/servers-systems-the-right\/boost-ai-performance-with-the-leading-server-for-natural\/ba-p\/7211514\" target=\"_blank\" rel=\"noopener\">Boost AI performance with the leading server for natural language processing&#8221;<\/a>\u00a0for more details.)<\/span><\/p>\n<p><span style=\"font-size: 12pt;\">Implementing the latest accelerator technology is another contributing factor to delivering high performance, so we expect future benchmark results to get even better as HPE Cray XD670 now supports eight NVIDIA H200 SXM Tensor Core GPUs. Additionally, our portfolio of AI training solutions will continue to evolve, and we plan to be time-to-market with future key GPU releases, as shared by our CEO Antonio Neri during\u00a0<a href=\"https:\/\/www.hpe.com\/us\/en\/discover.html?media-id=\/us\/en\/resources\/discover\/las-vegas-2024\/keynote-by-antonio-neri-intelligence-has-no-limits-hpedlv2024\/_jcr_content.details.json&amp;media-strategy=embed\" target=\"_blank\" rel=\"noopener noreferrer\">his HPE Discover keynote<\/a>.<\/span><\/p>\n<h3><span style=\"font-size: 12pt;\"><strong>Liquid cooling: Efficient today, essential tomorrow<\/strong><\/span><\/h3>\n<p><span style=\"font-size: 12pt;\">With heat from powerful CPUs and GPUs soon to draw over 500 watts, traditional air-cooling setups are being strained. Organizations are starting to realize that a different way to cool these environments is necessary, especially as energy requirements are only expected to rise, as technology evolves.\u00a0<strong>HPE Cray XD670 comes with a direct liquid cooling option<\/strong>\u00a0that addresses the power and cooling needs of today and tomorrow.<\/span><\/p>\n<p><span style=\"font-size: 12pt;\">This option provides direct liquid cooling to the hottest components in the server such as the GPUs and CPUs, about 70%, while using about 30% air cooling to cool the remaining low-heat components. The racks come pre-filled with coolant and ready to plug into facility water connections.<\/span><\/p>\n<p><span style=\"font-size: 12pt;\">These racks are fully integrated, installed, and supported by HPE; they can become 100% liquid cooled when combined with HPE\u00a0liquid-to-air cooling solutions: HPE Rear Door Heat Exchanger (RDHX) or HPE Adaptive Rack Cooling Solution (ARCS). These work with facility-chilled water that provides cold air where it is needed most in the rack.<\/span><\/p>\n<p><span style=\"font-size: 12pt;\">Some of the benefits of direct liquid cooling over air cooling include:<\/span><\/p>\n<ul>\n<li><span style=\"font-size: 12pt;\"><strong>Lower operational costs and improved efficiencies.\u00a0<\/strong>In an analysis conducted by HPE, liquid cooling was shown to deliver about\u00a020% more performance per kW and reduce\u00a0chassis power requirements by about 15%, over 5 years. Although this study was performed using a different HPE platform, the results are representative of the expected benefits of a liquid cooled set up.<\/span><\/li>\n<li><span style=\"font-size: 12pt;\"><strong>Reduced environmental impact.\u00a0<\/strong>Reducing power consumption with more efficient cooling can help organizations meet environmental, societal, and governance (ESG) goals and reduce their data center\u2019s CO2 equivalent (CO2e) footprint.\u00a0\u00a0<\/span><\/li>\n<li><span style=\"font-size: 12pt;\"><strong>Higher density that can defer expensive data center upgrades.\u00a0<\/strong>In space-constrained data centers, liquid cooling can enable denser rack configurations, helping maximize available space.<\/span><\/li>\n<li><span style=\"font-size: 12pt;\"><strong>Improved reliability and predictability.\u00a0<\/strong>Liquid cooling can prolong component life by providing stable operating temperatures, avoiding overheating conditions, and improving overall availability.\u00a0<\/span><\/li>\n<\/ul>\n<p><span style=\"font-size: 12pt;\">The results of the analysis conducted by HPE include opportunities for an 87.3% reduction in carbon emissions and power consumption due to cooling, and a potential 77.5% reduction in data center space requirements and are summarized in the chart below.<\/span><\/p>\n<p><span class=\"lia-inline-image-display-wrapper lia-image-align-center\" style=\"font-size: 12pt;\"><span class=\"lia-message-image-wrapper lia-message-image-actions-narrow lia-message-image-actions-below\"><img decoding=\"async\" class=\"lia-media-image aligncenter\" tabindex=\"0\" title=\"Air-based-cooling-vs-liquid-cooling.png\" role=\"button\" src=\"https:\/\/community.hpe.com\/t5\/image\/serverpage\/image-id\/143568i3C7AB6DAC47FA8EA\/image-size\/large?v=v2&amp;px=2000\" alt=\"Air-based-cooling-vs-liquid-cooling.png\" \/><\/span><\/span><\/p>\n<p><span style=\"font-size: 12pt;\">HPE future-focused cooling strategy advances sustainability and increases efficiencies. Our expertise in liquid cooling and leadership portfolio is second to none, a culmination of over five decades of experience and innovation. Today, we continue to lead the industry in preparing for the next generation of liquid cooling.\u00a0<\/span><\/p>\n<h3><span style=\"font-size: 12pt;\"><strong>Achieve leading-edge, sustainable performance for AI training and tuning<\/strong><\/span><\/h3>\n<p><span style=\"font-size: 12pt;\">In today\u2019s fast-paced and complex world of AI deployments, performance and sustainability are critical for organizations looking to get ahead of their competition. HPE\u2019s proven expertise in setting up highly performant, large-scale.\u00a0HPE\u2019s proven expertise in setting up high-performing AI clusters\u00a0with bespoke cooling solutions positions us as a leader to help you in your journey. The HPE Cray XD670 proves this with the latest MLPerf benchmarks demonstrating exceptional performance results in AI training and fine-tuning across various models. Additionally, our commitment to liquid cooling addresses escalating cooling demands, not only with immediate efficiency gains, but also by future-proofing against rising energy requirements.<\/span><\/p>\n<h3><span style=\"font-size: 12pt;\"><strong>Ready for more?\u00a0<\/strong><\/span><\/h3>\n<p><span style=\"font-size: 12pt;\">Visit the\u00a0<a href=\"https:\/\/www.hpe.com\/us\/en\/hpe-cray-xd670.html\" target=\"_blank\" rel=\"noopener noreferrer\">webpage<\/a>\u00a0to see the explainer video, view a demo, and read the solution brief about the HPE Cray XD670.\u00a0\u00a0<\/span><\/p>\n<p><span style=\"font-size: 12pt;\">Check out Jason Zeiler\u2019s Tech Talk\u00a0\u00a0from HPE Discover,\u00a0<a href=\"https:\/\/www.hpe.com\/h22228\/video-gallery\/us\/en\/Discover2024-26939\/video\/\" target=\"_blank\" rel=\"noopener noreferrer\">&#8220;The future of liquid cooling for data centers&#8221;<\/a>\u00a0for an overview of HPE&#8217;s liquid cooling solutions.\u00a0<\/span><\/p>\n<hr \/>\n<p><span style=\"font-size: 8pt; color: #808080;\">This article, republished with permission, originally appeared at https:\/\/community.hpe.com\/t5\/ai-unlocked\/maximize-performance-sustainably-for-ai-training-with-hpe-cray\/ba-p\/7222696 on August 12, 2024<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Discover how the HPE Cray XD670 delivers strong results in the MLPerf Training v4.0 benchmark, introduces support for NVIDIA H200 Tensor Core GPUs, and advances sustainability with optional direct liquid&hellip;<\/p>\n","protected":false},"author":2106,"featured_media":65184,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_seopress_robots_primary_cat":"411","_seopress_titles_title":"","_seopress_titles_desc":"","_seopress_robots_index":"","_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"content-type":"","footnotes":""},"categories":[16,411],"tags":[1552,1550,1549,1546,1553,1551,1547,1548],"coauthors":[1554],"class_list":["post-65181","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","category-blog","tag-carbon-emissions","tag-computer-vision","tag-genai","tag-hpe-cray-xd670","tag-liquid-cooling","tag-lm","tag-mlperf-training-v4-0-benchmark","tag-nvidia-h200-tensor-core-gpus","category-16","category-411","description-off"],"_links":{"self":[{"href":"https:\/\/connect-community.org\/wp-json\/wp\/v2\/posts\/65181","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connect-community.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connect-community.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connect-community.org\/wp-json\/wp\/v2\/users\/2106"}],"replies":[{"embeddable":true,"href":"https:\/\/connect-community.org\/wp-json\/wp\/v2\/comments?post=65181"}],"version-history":[{"count":3,"href":"https:\/\/connect-community.org\/wp-json\/wp\/v2\/posts\/65181\/revisions"}],"predecessor-version":[{"id":65188,"href":"https:\/\/connect-community.org\/wp-json\/wp\/v2\/posts\/65181\/revisions\/65188"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connect-community.org\/wp-json\/wp\/v2\/media\/65184"}],"wp:attachment":[{"href":"https:\/\/connect-community.org\/wp-json\/wp\/v2\/media?parent=65181"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connect-community.org\/wp-json\/wp\/v2\/categories?post=65181"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connect-community.org\/wp-json\/wp\/v2\/tags?post=65181"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/connect-community.org\/wp-json\/wp\/v2\/coauthors?post=65181"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}