众所周知,液体冷却比空气冷却更有效。但由于冷却设施的安装和管理将会发生颠覆性变化,以及认为没有必要等多种原因,数据中心运营商采用液体冷却的速度很慢。大多数情况下,采取这种技术的往往是具有高功率密度的数据中心。因此,如果企业的数据中心已达到要求采用液体冷却的功率密度级别,那么其日常运营将有哪些变化?
而根据工作人员数据中心工作的职业生涯,液体冷却可能看起来是一种全新的技术或相当传统的技术。“早在上世纪80年代和90年代,大型机以及超级计算机采用液体冷却技术很常见。”Uptime Institute公司首席技术官Chris Brown表示,“如果数据中心管理人员年龄较大的话,他们可能会觉得采用液体冷却很熟悉,但年轻一代对其应用感到担忧和紧张。”

GRC公司在休斯敦数据中心的冷却罐
人们往往不愿意将液体冷却与昂贵的IT资产联系在一起。但是,一旦更好地理解和应用这项技术,就会消除其担忧和顾虑,因为在很多情况下,冷却硬件设备的液体实际上并不是水,也不会造成任何损害。
例如,现代浸没式和直接为芯片提供冷却的液体冷却系统采用的是介电(非导电)的非易燃流体,冷却分配单元将冷冻液输送到热交换器,从而通过浸没去除热量。Brown解释说:“这使IT设备能够获得液体冷却的好处,如果存在泄漏,也会破坏价值数百万美元的硬件。”
对设施的影响
事实上,已经使用冷冻水设施的数据中心切换到液体冷却并不会变得更复杂。“他们已经习惯于处理液压和冷却器等问题,并需要对管道的水进行处理以防止藻类生长,因为如果水质不好的话,就会堵塞热交换器中的管路。”冷却浸没水箱的水回路可以在现有的活动地板下运行,需要额外的结构支撑。
Brown警告说,如果企业使用机械制冷装置而不熟悉液体冷却运作的话,液体冷却需要更陡峭的学习曲线,此外,还需要对数据中心运营进行更多改变。任何冷冻水系统都是如此。
对IT的影响
采用颠覆性液体冷却技术取决于企业IT部门选择的冷却技术类型。美国能源部下属的劳伦斯伯克利国家实验室工程师Dale Sartor表示,“后门热交换器需要的变化很少,后门交换器具有管道连接,它们很灵活,因此可以像以前一样打开和关闭后门,而采用液体冷却只是需要一个更厚、更重的门,但其他技术和服务方面几乎是一样的。”
同样,对于直接到芯片的冷却技术,机架后部有一个歧管,将歧管连接到服务器并冷却组件。Sartor解释说,这些管子安装了无滴漏的连接器。“技术人员将这个连接器从服务器上取出,其采用无滴水设计,因此他们可以像以前一样将服务器拉出来。”
需要注意的一个问题是正确地恢复连接。“工作人员可能会弄错管子的方面,因此可能会错误地连接,反之亦然。”他警告说。一些连接器采用颜色编码。而包括微软、Facebook、谷歌、英特尔在内的行业组织正致力于为液冷服务器机架开发一种开放式规范,该机架将引入不可逆插头以避免此问题。“可以使冷热水管相互区分和隔离,以消除人为错误,”Sartor说。
采用浸入式冷却
浸入式冷却确实显着改变了IT设备的维护过程和设备需要。地球科学机构CGG公司先进系统部门的经理Ted Barragy表示,该公司已经使用GRC公司液体浸没系统已有五年多的时间。
如果企业的服务器供应商在发货之前没有进行所有更改,则可能需要卸下风扇或反向导轨,以便将主板悬挂在浸入液中。对于具有监控冷却风扇速度的BIOS的旧系统,GRC公司等冷却供应商提供风扇仿真器电路,但较新的BIOS则不需要。
Barragy说,“网络设备并不总是适合沉浸式冷却,因为有些产品是基于塑料的,容易溶化或腐蚀。”实际上,CGG公司发现网络设备并不需要采用液体冷却,因此可以将它们部署在冷却设施之外,从而腾出空间来实施更多计算。
虽然CGG公司在液体冷却方面还有一些问题需要解决,但一旦企业了解了如何调整数据中心架构和运营以利用它,人们就会认为这种技术是可靠的。Barragy说,“如今,人们采用液体冷却最大的障碍是心理障碍。”
液体冷却的IT设备维护
工作人员如果更换浸没在冷却液中的硬盘或内存等组件,则必须将整个主板从液体中取出,但这种措施代价高昂,因为可能弄乱冷却布局,或导致冷却液泄漏或流失。
Barragy建议工作人员在拆装组件时需要穿戴橡胶手套和围裙,以免液体溅到身上。此外与通常维护更大的区别是,工作人员需要在专业区域维修IT设备,而不是直接在机架中工作。此外,更换组件可能需要更换整个机箱。
Barragy说,“如果想分批拆解组件的话,其团队将等到他们有四到五个系统需要维修时工作,这经常会让故障的服务器离线数天的时间。”为了缩短维护时间,Barragy建议提前做好配件准备。
权衡利弊
如今,可供选择的液冷系统供应商相对较少,而即使液冷式机架的开放式规格系统上市,企业的IT设备也需要冷却设备供应商的产品进行匹配。Barragy警告说,“如今的行业中,沉浸式冷却供应商很少,而可以提供直接芯片冷却系统的厂商更少,他们都倾向于与硬件供应商合作。这意味着一旦企业的产品被锁定在制冷供应商中,其选择所需硬件的能力就非常有限。”
另一方面,如果企业要增加功率密度,则无需重新进行复杂的气流动力学计算或计算如何在更多机架上分布负载。只需将20kW的冷却油箱切换到40kW冷却油箱,并保持相同的冷却液和冷却液分配单元即可。
其设备组件维护变得更复杂,最好分批完成。“如果有一些IT组件需要维修,需要让它们干燥一段时间。”Barragy解释说。而设计用于浸没式系统的主板供应商可以轻松处理这些组件。CGG公司可以通过正常的回收渠道处理使用寿命到期的IT系统。
人员的舒适性
联想数据中心集团高性能计算和人工智能执行总监Scott Tease表示,采用液体冷却可能意味着额外的工作,但它也可以带来更舒适的工作环境。许多数据中心由于采用速度更快的处理器和更多的组件,数据中心中的温度正在成为一个比电源更大的问题。
这意味着企业需要越来越多的冷空气来冷却服务器。“对更多空气流动的需求将推动服务器内的能耗,并且加大了机房空调的耗电量。此外,空调噪音也很嘈杂。”
CGG公司用户的IT员工现在更喜欢在沉浸式冷却数据中心工作。“一旦掌握了这种技术,数据中心将会运营更好,也很安静。”Barragy说,“而配备大量机房空调的数据中心环境的噪音在80dB范围内。“
液冷数据中心也为内部工作人员提供更舒适的空气温度。Brown 说,“数据中心的冷却工作都是从机柜后部进行的,热通道的温度让工作人员感觉很热,而冷通道的温度也很低,也会让人感觉不舒适。”
So, You Want to Go Liquid – Here’s What Awaits Your Data Center Team
Yes, in some cases you’ll need gloves and aprons, but once you get some practice with liquid, it may provide a more comfortable working environment.
Mary Branscombe | Aug 09, 2018
Liquid cooling can be more efficient than air cooling, but data center operators have been slow to adopt it for a number of reasons, ranging from it being disruptive in terms of installation and management to it being simply unnecessary. In most cases where it does become necessary, the driver is high power density. So, if your data center (or parts of it) has reached the level of density that calls for liquid cooling, how will your day-to-day routine change?
Depending on how long you’re been working in data centers, liquid cooling may seem brand new (and potentially disturbing) or pretty old-school. “Back in the 80s and 90s, liquid cooling was still common for mainframes as well as in the supercomputer world,” Chris Brown, CTO of the Uptime Institute, says. “Just being comfortable with water in date centers can be a big step. If data center managers are older, they may find it familiar, but the younger generation are nervous of any liquid.”
GRC's cooling tanks at a CGG data center in Houston
There’s often an instinctive reluctance to mix water and expensive IT assets. But that concern goes away once you understand the technology better, because in many cases, the liquid that’s close to the hardware isn’t actually water and can’t do any damage.
Modern immersion and some direct-to-chip liquid cooling systems, for example, use dielectric (non-conductive) non-flammable fluids, with standard cooling distribution units piping chilled water to a heat exchanger that removes heat from the immersion fluid. “That allows them to have the benefits of liquid cooling without having water right at the IT asset … so that if there is a leak, they’re not destroying millions of dollars’ worth of hardware,” Brown explains.
Impact on Facilities
In fact, he says, data centers that already use chilled water won’t get much more complex to manage from switching to liquid cooling. “They’re already accustomed to dealing with hydraulics and chillers, and worrying about maintaining the water treatment in the piping to keep the algae growth down – because if the water quality is low, it’s going to plug the tubes in the heat exchangers.” The water loop that cools immersion tanks can run under an existing raised floor with needing extra structural support.
If you don’t have that familiarity with running a water plant because you’re using direct expansion air conditioning units, Brown warns that liquid cooling will require a steeper learning curve and more changes to your data center operations. But that’s true of any chilled-water system.
Impact on IT
How disruptive liquid cooling will be for day-to-day IT work depends on the type of cooling technology you choose. Rear-door heat exchangers will require the fewest changes, says Dale Sartor, an engineer at the US Department of Energy’s Lawrence Berkeley National Laboratory who oversees the federal Center of Expertise for Data Centers. “There are plumbing connections on the rear door, but they’re flexible, so you can open and close the rear door pretty much the same way as you did before; you just have a thicker, heavier door, but otherwise servicing is pretty much the same.”
Similarly, for direct-to-chip cooling there’s a manifold in the back of the rack, with narrow tubes running into the server from the manifold and on to the components. Those tubes have dripless connectors, Sartor explains. “The technician pops the connector off the server, and it’s designed not to drip, so they can pull the server out as they would before.”
One problem to watch out for here is putting the connections back correctly. “You could easily reverse the tubes, so the supply water could be incorrectly connected to the return, or vice versa,” he warns. Some connectors are color-coded, but an industry group that includes Microsoft, Facebook, Google, and Intel is working on an open specification for liquid-cooled server racks that would introduce non-reversible plugs to avoid the issue. “The cold should only be able to connect up to the cold and the hot to the hot to eliminate that human error,” Sartor says.
Adjusting to Immersion
Immersion cooling does significantly change maintenance processes and the equipment needed Ted Barragy, manager of the advanced systems group at geosciences company CGG, which has been using GRC’s liquid immersion systems for more than five years.
If your server supplier hasn’t made all the changes before shipping, you may have to remove fans or reverse rails, so that motherboards hang down into the immersion fluid. For older systems with a BIOS that monitors the speed of cooling fans, cooling vendors like GRC offer fan emulator circuits, but newer BIOSes don’t require that.
Networking equipment isn’t always suitable for immersion, Barragy says. “Commodity fiber is plastic-based and clouds in the oil.” In practice, CGG has found that network devices don’t actually need liquid cooling, and its data center team now attaches them outside the tanks, freeing up space for more compute.
While CGG had some initial teething troubles with liquid cooling, Barragy is confident that the technology is reliable once you understand how to adjust your data center architecture and operations to take advantage of it. “The biggest barrier is psychological,” he says.
Gloves and Aprons
To replace components like memory in a server dipped in a tub of coolant, you have to remove the whole motherboard from the fluid – which is expensive enough that you don’t want to waste it and messy enough that you don’t want to spill it – and allow it to drain before you service it.
Barragy recommends wearing disposable nitrile gloves and remembers spilling oil down his legs the first time he worked with immersed components. Some technicians wear rubber aprons; others, who have more experience, do it in business-casual and don’t get a drop on them. “Once you’ve done it a few times, you learn the do’s and don’ts, like pulling the system out of the oil very slowly,” he says. “Pretty much anyone that does break-fix will master this.”
A bigger difference is that you’re going to be servicing IT equipment in a specialized area off the data center floor rather than working directly in the racks. You might have to replace an entire chassis and bring the replacement online before taking the original chassis away to replace or upgrade the components, Brown suggests.
“You want to do break-fix in batches,” Barragy agrees. His team will wait until they have four or five systems to work on before starting repairs, often leaving faulty servers offline for days, with failed jobs automatically requeued on other systems. To speed the process up, he recommends having a spare-parts kiosk.
The Trade-Offs
There are relatively few suppliers of liquid-cooled systems to choose from, and until systems based on the upcoming open specification for liquid-cooled racks are on the market, you can’t mix and match vendors. “There is no interoperability,” Lawrence warns. “There are ten or 15 suppliers of immersive cooling and fewer of direct-to-chip [systems], and they tend to partner up with a hardware supplier. That means the ability to just choose the hardware you need is very limited once you're locked into a cooling provider.”
On the other hand, you don’t have to redo complex airflow dynamics calculations or figure out how to spread load across more racks if you want to increase power density. You can just switch from a 20kW to a 40kW tank and keep the same coolant and coolant distribution units.
Returns get somewhat more complicated and best done in batches. “If you’ve got some broken components, you're going to let those drip dry for a couple of days,” Barragy explains. “They'll have an oil film on them, but you’re not going to wind up with a puddle of mineral oil on your hands.” Vendors who design motherboards for use in immersion systems will be comfortable dealing with components coming back in this condition, and CGG is able to process systems that reach end of life through their normal recycling channels.
Creature Comfort
Liquid cooling may mean extra work, but it also makes for a more pleasant working environment Scott Tease, executive director of high-performance computing and Artificial Intelligence at Lenovo’s Data Center Group, says. Heat is becoming a bigger problem than power in many data centers, with faster processors and accelerators coming in ever-smaller packages.
That means you need more and more air to move inside each server. “The need for more air movement will drive up power consumption inside the server, and will also make air handlers and air conditioning work harder,” he says. Not only will it be hard to deliver enough cubic feet per minute of air, it will also be noisy.
The break-fix and first-level-fix IT staff at CGG now prefer to work in the immersion-cooled data center rather than the company’s other, air-cooled facility, which is state-of-the-art. “Once you learn the techniques so you don’t get the oil all over you, it’s a nicer data center, because it’s quiet and you can talk to people,” Barragy said. “The other data center with the 40mm high-speed fans is awful. It’s in the 80dB range.”
Liquid-cooled data centers also have more comfortable air temperature for the staff working inside. “A lot of work in data centers is done from the rear of the cabinet, where the hot air is exhausted, and those hot aisles can get to significant temperatures that are not very comfortable for people to work in,” Brown says. “The cold aisles get down to pretty cold temperatures, and that's not comfortable either.”
