Meta’s Vijay Rao on the future of the data center, how AI will grow up, and the ways computer architecture will evolve in the years ahead
For our inaugural Q&A, Atoms & Bits spoke with Vijay Rao, VP of Engineering at Meta, on AI and data, how he approaches the thorniest problems, and what plants can teach us about creativity.
If you pull the curtain back on the future of computing, you’ll find the data center. How these often unseen citadels of storage and processing are architected in the years ahead will determine whether AI reaches its full potential, if Moore’s Law can persist into the 21st Century, and the tools that will define the modern world.
Vijay Rao has a lot to say about the look, feel, and texture of this future. In his Atoms & Bits conversation, Rao gets into how GPUs are straining current data center designs, the best way to create computer chips that work effectively for the AI-enabled future, and what a thriving AI ecosystem could look like.
Advocating for more dynamic, open source computing infrastructure, Rao’s vision for the industry is an exciting one bursting with innovation and possibility.
This interview has been edited for clarity.
Atoms & Bits: You call yourself a “technologist.” What does that term mean to you?
Vijay Rao: At a high level, some technologists enjoy fixing organizational problems. Other technologists like to dig deep and fix more nuanced problems — like designing a chip. Interrogating the “why” of an issue is what I enjoy most at this point in my life. “Why are chips designed this way?” “Why does this memory need to be stored locally?” “Why does an AI system need to be designed a particular way?” Today, I can give more to the industry by solving these larger technology problems.
A&B: Tell us about your journey.
Vijay Rao: I came to the United States in 1994 to attend Purdue University. I did my masters in computer engineering and joined Intel where I worked for roughly seven years building chips. Towards the end of my time there, I started thinking about systems and left to start my own company. That year transformed me from an engineer into an entrepreneur. At Intel, the question on my mind was always, “How do I find a solution?” After that year, I started thinking, “What problem are we trying to solve?” Now, whether it’s a technology or business situation, I figure out what problem I am solving first. That has stayed with me for the rest of my life.
A&B: Tell us about your time at AMD.
Vijay Rao: I worked at AMD for close to 10 years. I was fortunate enough to join the office of the CTO and ended up working with Google. Google was starting to build data centers and wrap their heads around what scale meant. AMD taught me how to build chips and systems, while Google was all about how to think at an enormous scale. When I wrote a line of code, I had to grapple with how it was going to work across a million servers, not just one. There was no textbook I could consult. When building a single server, I made a whole set of assumptions about memory, dividing the application, and networking. Ninety percent of those considerations were the opposite when I turned to solving problems at a million-server scale.
A&B: How is engineering for a million servers different?
Vijay Rao: Reliability, availability, and scalability, those are the three qualities all distributed systems folks will say are critical to building at scale. The types of problems, the architecture of the software, the infrastructure — it all changes.
A&B: Give us an example.
Rao: To build a million servers, you must make the basic assumption that servers will fail at any time, all over the place. You can’t expect to have a particular server working at a given time at a given location. So applications must be built to tolerate server failure. Reliability then is critical. Sometimes the Bay Area has a lot of people online. Sometimes India will account for the majority of your traffic. My latency requirements are also different. Whether it’s a million servers or 100,000, it doesn’t matter: I’ve taken the problem of one server and spread it out across a number of them. Some requests may take a millisecond to come back; some requests may take 100 nanoseconds to come back. I need to think of my application very differently; I need to build in a lot of tolerance. One needs to think about graceful degradation. If a region goes down or a chunk of your servers are taken offline for whatever reason, you want to gracefully degrade the user experience or hide it entirely, rather than abruptly shut down. This requires a lot of management infrastructure. There are far too many examples to cover here, but you get the point.
A&B: I imagine memory and storage needs change as well.
Rao: Of course. One machine can handle five, 10, 20 Terabytes. A million machines require petabytes of storage. And how do you keep the storage? Do you attach all the storage in one pod, or do you spread the storage across multiple data centers? We keep data that we need access to very quickly and frequently on flash. We can’t keep everything in memory; it’s just too much. This impacts how you think about spreading your compute and how much latency you can tolerate.
A&B: When did you join Meta?
Rao: I joined Meta — formerly Facebook — in 2013. My first five years there were primarily about scaling out. We were building products really fast, and we did not have enough servers. Our growth was so enormous, we could not buy enough servers or build them fast enough. So we needed to use what we had on hand efficiently. As a result, a lot of my work was making our applications run efficiently across many, many servers. That was before AI.
A&B: Speaking of AI, what does the term “deep tech” mean to you?
Rao: I think deep tech is an overused term, which dilutes its meaning. I believe its origins lie in the act of going deep into technology to understand what problems we will need to solve next. However, today that depth is often quite shallow. But I understand the intent. The goal is to discover technologies that are far away from current development and will solve critical problems in two or more years. I think it is important that the industry spends resources on innovating solutions to fundamental challenges. We should look at where the puck is going, and not where it is today.
A&B: It sounds like you believe that the zeitgeist perception of deep tech has lost some of its meaning. Yet the work of looking at hard-to-solve problems is still a productive endeavor.
Rao: Yes. Some say, “The trend is your friend, except at the end where it bends.”
A&B: Tell us about the future of AI. What’s around the bend?
Rao: My expectation is that AI will get absorbed into everything as a utility in the next few years. Software will just have AI in it; we won’t call it AI or AI-enabled. It will just support the applications and hardware we use every day. People will use the value that AI brings to make things better without end users ever being aware.
A&B: Where are the biggest disagreements in AI?
Rao: There is a lot of debate right now about whether foundational models should be kept inside the company or whether to offer them to the world. There may be no right or wrong answer, but I favor the open-source approach. I believe the future of AI won’t be based on one company’s foundational model. We will need to use foundational models to create specific models for specific verticals — like law or medicine. A foundation model can get you to a place where it answers most questions. But it does not have the depth that you need in certain disciplines. More specialized models and applications can get you across the finish line. That’s the direction I see AI going in the end. Unique verticals will be built on foundational models and each vertical will in turn improve the foundation model in that space.
A&B: Do you see the builders of foundational models making way for collaboration?
Rao: Foundation models are like that movie, “Everything, Everywhere, All at Once”: They want to be everything; they want to be everywhere; and they want to do it all at once. But it’s important to bring foundational models to other companies so they can build and innovate on top of them. Think of a foundational model as going from zero to one. Now we have a chance to go from one to 10 if we share. That one to 10 will mean innovating within niches for each industry. Zero to one was hard, and getting everybody to recreate the process of going from zero to one is not an efficient way of utilizing the ecosystem’s resources. Let’s work on achieving 10.
A&B: How is AI affecting data centers?
Rao: GPUs don’t behave well in the data center. If I’ve got a kilowatt product, I’m not actually consuming all that kilowatt for work. A lot of that kilowatt is cooling the HBM — High Bandwidth Memory — to a sufficient temperature to run the bus at a high frequency. That memory feeds the beast — compute. Also, because you’ve got 1024 bits so close to each other, signal integrity becomes a problem. Transistors and wires don’t work well in hot environments. Cool transistors maintain a charge and can sustain the kind of signal integrity that allows translation and transfer of zeroes and ones when data goes between HBM and compute. As our AI needs increase, we will only want more bandwidth and more capacity, which will compound the problem.
A&B: The disease of more.
Rao: Engineers always want more resources. It’s like a drug: you want more memory, more compute, more capacity, more networking. I’ve never had a software engineer tell me, “I want less.”
A&B: How did we get here?
Rao: If you turn back time about 40 years, CPUs were not built as one big block. You had one package with a lot of cores. The memory controller was a separate package, and the cache was separate from the memory. All of the components were separate packages. They were all connected on a board, and they worked fine. The latencies were tolerable. The bandwidth was okay. Over time, every component needed more bandwidth and more wires until there was no space around the package for more pins. Around that time, Gordon Moore came up with Moore’s Law, and teams started integrating components into one package. The FPU, the memory controller, the southbridge and northbridge, everything was integrated. We ran down this path of integration for the next three decades until AI came around. Suddenly, we needed memory to be close to compute, and the only memory that met the required characteristics was High Bandwidth Memory (HBM). We needed compute and memory to be close to each other and the only way we knew how to do that was by integrating both into the same package.
A&B: Why is data center design so critical to modern compute?
Rao: Data centers are made up of rows of racks. Typically, each row has 15 to 20 racks. In the past, those racks were built for 10 to 20 kilowatts, which made sense for general purpose compute. AI broke that. Today, a single GPU is approaching a kilowatt. Ten to 16 GPUs barely fit on one rack, which is quite inefficient. That is where power density becomes a problem. With CPUs, you can blow cold air from cold aisles to the hot aisles. That’s the simplest data center design: no empty spots, the cooling is uniform, the heat is uniform, and the networking is good. The challenge with power dense racks like GPUs, is everything is tightly coupled to ensure high networking bandwidth and low latency. That leaves a lot of empty space in the racks for cold air to flow, and a bunch of hotspots where we have insufficient cold air. This makes it very difficult to manage. Designers now need to think hard about compute and networking and thermal density, too.
A&B: Do you see data center design changing in the next five to 10 years?
Rao: I’m concerned with the direction the industry is headed. Power density requirements are growing due to how tightly we’re coupling systems. Additionally, GPUs put off enormous amounts of heat. As a result, most companies are forced to head towards liquid or immersion cooling. Why? Air cooling becomes ineffective at roughly 400 or 500 watts. Liquid cooling gives data centers room for 1 to 1.5 kilowatts per device. So liquid-cooled systems are becoming popular with current GPUs.
A&B: Why does that concern you?
Rao: You normally should be worried when a liquid and servers are near each other. It’s a pain for serviceability as well as design and maintenance. As I said earlier, we assume servers are unreliable by default. An engineer is always testing things, touching them, moving or replacing components. In immersive-cooled systems, when servers stop working, you don’t service them. You let them fail. It’s a pain from an operational standpoint, and we’re not even talking about cost. That’s a whole different ball game.
A&B: Is immersive cooling imminent?
Rao: It’s still many years out. Liquid-cooled systems are currently being deployed at a relatively large scale in the industry. We are learning how to run these designs efficiently.
A&B: Is there a better way?
Rao: Yes, I would like to see a better design of the components for the GPU/accelerator.
A&B: Say more.
Rao: Soon, we are going to have the challenge of 16 to 32 GPUs in a rack. I would like to split apart GPU/accelerators into a discrete compute package and a HBM memory package coupled with an optical interconnect. That way the two packages can each run at 500 watts. One can more easily design for and cool them even though the thermal characteristics are different. I would also like the industry to build an open optical standard to connect any package with any other package. That way you can have lots of memory connected to many GPU/accelerators — basically right-size memory and compute based on your workload.
A&B: Why does AI need memory to be so close to compute?
Rao: Memory is required even for regular compute. But that memory — or DRAM — sits on its own bus. Those latencies are in the order of 100 nanoseconds — maybe 200 nanoseconds — which is quite acceptable. In AI applications, most of the HBM need is in training. So in training, I'm feeding data and embedding tables for a model that doesn’t fit into one machine. They're spread across many machines. That's why we have a smaller set of machines that are tightly coupled. The thing is, the AI model is constantly exchanging parameters so the higher the latency to access the data, the more time it’s going to take to finish training. The number of transactions is significantly higher than the regular compute workloads — many, many times higher. That’s why you need a lot of memory close to the compute. When I say a lot, it’s not petabytes of HBM. It’s terabytes of HBM. But it needs to be close.
A&B: It sounds like the future of data centers will be dependent on the characteristics of the chips they are designed for.
Rao: Yes, data center design will need to consider the architecture of future GPUs and accelerators.
A&B: Is anyone building these new architectures?
Rao: Some are looking at how to build a piece of optics and the chiplet at an optical standard that allows two chips to talk to each other. Once that standard is done, then it can be applied to wherever you want, but it’s still a work in progress.
A&B: What is software’s role in all of this?
Rao: Software comes in many layers. I won’t bother talking about firmware, so let’s start with an NVIDIA box of eight GPUs. CUDA is their compiler, or if you were to apply an AMD chip or someone else’s you would use their compiler. If you have accelerators, you need a compiler for that accelerator, too. Then above that, you need management software for that rack, box, or whatever you want to call it. Above that you need orchestration infrastructure software that can take your job and run it over many servers or clusters of servers. On top of that is the application and inside that application, you have your AI model. So there are many layers of software.
A&B: You’ve said that software is a key enabler for advanced compute technologies.
Rao: Everyone loves to talk about innovation at the model level, but innovation needs to happen all along the stack. Let’s talk at the compiler level. Right now, AI software written for CUDA cannot run on non-Nvidia hardware. That means other accelerators have a hard time integrating into systems. It would be great to see a compiler layer that is agnostic to hardware. If we look back, the reason x86 architecture was so successful was due to GCC and LLVM. x86 itself was just a piece of hardware, which allowed users to write programs in C, Python, or whatever their favorite language happened to be. Any GCC or LLVM compiler could map software into the x86. On the other end of the spectrum were ASICs which required specialized code. In GPU land, it’s the same story. Nvidia created CUDA as a software layer that allows it to become general purpose so coders don’t need to understand the programming language of the GPU. They only need to understand CUDA and its libraries. However, one cannot use CUDA with AMD or with some other ASIC. If we could abstract all hardware by putting a compiler stack in the middle, like GCC or LLVM, that would be tremendous. I know there are some companies working on it, but that’s going to take some time. It’s a major piece of software that needs to be solved first and that will enable hardware innovation to come to an end user more quickly.
A&B: What does a healthy chip ecosystem look like?
Rao: Everything that's relatively commoditized — NICs, GPUs, CPUs, memory, storage — needs an ecosystem with choice. These are all pieces of hardware that tend to experience issues ranging from supply chain problems to design problems. Companies don’t like their fleet to be stalled for a hiccup of that variety. If there is a bug in a chip, we don’t want that to prevent the launch of a new product or service. Having multiple players is always helpful. I’ve been in many situations where relying on a single source hit us hard.
A&B: So a healthy ecosystem is a diverse ecosystem.
Rao: Yes.
A&B: Then it’s something like agriculture where a monoculture can be existentially threatened by a single pest. The right locust or mold can wipe out an entire crop. Whereas if you cultivate many different species on the same plot of land, it supports the entire ecology’s survival.
Rao: I’m thinking of bananas. A blight nearly wiped them out, almost causing an extinction.
A&B: Exactly.
Rao: One company cannot build chips for everybody.
A&B: So the future of AI may not be built exclusively on a single GPU foundation.
Rao: What I mean is that one solution does not cover every requirement efficiently. There are generative AI models, recommendation models, CNNs, and so on and to the extent that any of them have grown thus far, they’ve all depended on GPUs whether we like it or not. Now companies are doing what’s right for them to tackle exactly what makes sense for their models, and that won’t always be GPUs.
A&B: That sounds like an opportunity for entrepreneurs.
Rao: I do worry about where chip development may end up. Does it end with every company having their own internal chip design ecosystem? That would leave startups and current chip design companies out in the cold. If big companies alone have their own chip, I don’t think that’s healthy for the world. So diversity in itself is not sufficient. Chips must be diverse and available to everybody.
A&B: Where do you turn for inspiration?
Rao: Many times, I find the best answers in unexpected places. I find most of my inspiration in the unrelated — something that is completely different. I'll be thinking about a system design problem while in the backyard planting flowers or watering vegetables when suddenly the way a plant is growing will make me think, “Ah, I could apply that here.” Even at work, when I take a break from working on one problem and discuss something unrelated with my team members, I’ll see the original issue in a new light. I find fostering creativity is about managing context across topics. Having open discussions with team members on varied topics is always enlightening.
A&B: I see a guitar case behind you. Do you find music to be a helpful context for problem solving?
Rao: My son and I played guitar when he was young. Sometimes, he wouldn’t play for a week, and then just pick it up and be right back to where he was. When I would not play for two days, I would forget where all the strings were.
A&B: I will have to ponder what the lesson for AI is there.
Rao: One thing about AI that people aren’t considering is how humanoid it can be. If you look at how children learn, it is very similar to AI. There are three main types of learning: supervised, unsupervised, and reinforced. When your child makes a mistake, you can correct the mistake. We call that supervised learning. You could also let the child make a mistake and have them learn to stop repeating the mistake on their own. That is unsupervised learning. Alternatively, you can praise a child for doing something right, which is known as reinforced learning. Models are very similar. You could either stop it when it’s going wrong, let it make a few mistakes and find its way to the right place, or encourage good behavior.
A&B: Models do sound a lot like children.
Rao: In more ways than you might imagine. When a child reaches their teenage years, sometimes they start behaving poorly and rebelling. They behave in a certain manner that makes sense to them and them alone. This is the norm. I’m sure you’ve heard of hallucinations in AI models, which are something of a rebellion. The model thinks what it’s doing is right, and it can get stuck there. When kids finally grow up, they become independent and go through the world in their own way. These models are trained similarly. They are fed more and more data, and out comes a trained model. After that, it will start exploring the world and returning answers. That’s inference. Sometimes it will hallucinate. Maybe someday it will grow up.
A&B: I wonder what kind of adult AI will be.
Rao: We’ll know in a decade whether it remained an infant or matured into a real adult.
A&B: Thank you for your time.