Systems for Artificial Intelligence

At the time of writing, the compute requirements for machine learning models and artificial intelligence are growing at a staggering rate of 200% every 3.5 months. Interest in the area is being quantified as 10k papers per month on the topic, while dollar investments on compute (energy, cooling, sustainability of compute in general) have had a hard time keeping up with the continuous new requests. Image from here ...

June 4, 2025 · Reading Time: 12 minutes ·  By Xuanqiang Angelo Huang

Datacenter Hardware

We want to optimize the parts of the datacenter hardware such that the cost of operating the datacenter as a whole would be lower, we need to think about it as a whole. Datacenter CPUs Desktop CPU vs Cloud CPU Isolation: Desktop CPUs have low isolation, they are used by a single user. Cloud CPUs have high isolation, they are shared among different users. Workload and performance: usually high workloads and moving a lot of data around. They have a spectrum of low and high end cores, so that if you have high parallelism you can use lower cores, while for resource intensive tasks, its better to have high end cores, especially for latency critical tasks. ...

June 4, 2025 · Reading Time: 19 minutes ·  By Xuanqiang Angelo Huang

Cloud Storage

Object Stores Characteristics of Cloud Systems Object storage design principles We don’t want the hierarchy that is common in Filesystems, so we need to simplify that and have these four principles: Black-box objects Flat and global key-value model (trivial model, easy to access, without the need to trasverse a file hierarchy). Flexible metadata Commodity hardware (the battery idea of Tesla until 2017). Object storage usages Object storage are useful to store things that are usually read-intensive. Some examples are ...

June 7, 2025 · Reading Time: 19 minutes ·  By Xuanqiang Angelo Huang

Green Computing

The cloud is inefficient, and it looks like we can improve a lot on this side. Computer Science with their systems have reached industrial scales and can be compared to build airports, highways and metro systems in terms of public infrastructure, yet, due to their immaterial and intangible nature, the perception of these systems do not match their perceived reality by the majority of the people. While classical engineering designs physical objects, computer science designs virtual objects ~Gustavo Alonso CCA Lecture 14 May 2025 ETH Zürich ...

June 6, 2025 · Reading Time: 5 minutes ·  By Xuanqiang Angelo Huang

Virtual Machines

The fundamental idea behind a virtual machine is to abstract the hardware of a single computer (the CPU, memory, disk drives, network interface cards, and so forth) into several different execution environments, thereby creating the illusion that each separate environment is running on its own private computer. (Silberschatz et al. 2018). Virtualization allows a single computer to host multiple virtual machines, each potentially running a completely different operating system. È virtuale nel senso che la macchina virtuale ha la stessa percezione della realtà di una macchina reale. Qualcosa che non è la realtà ma appare molto simile ad essa. ...

June 6, 2025 · Reading Time: 13 minutes ·  By Xuanqiang Angelo Huang

Communication in the Cloud

How can we coordinate services to actually understand what they are doing, or what the user wants them to do? How to manage networks errors? This note will mainly focus on high level communication protocols to coordinate this kind of communication. Remote Procedure Calls History: the Stub This has been the main idea, introduced in 1984, using the idea of stubs, see (Birrell & Nelson 1984). The system basically calls the remote procedure as if it was local on the high level, but on a lower level a network request is sent. The architecture has remained the same in these years. It hides all the complexity in the stub (marshaling, binding and sending, without caring about the sockets and communication matters). One problem is that it might be hiding the complexity too well. The programmer has surely an ease of programming, but design consideration should consider overloads generated by the network communication. ...

June 4, 2025 · Reading Time: 7 minutes ·  By Xuanqiang Angelo Huang

Redundant Array of Independent Disks

Introduzione ai Redundant Array of Indipendent Disks I RAID ne abbiamo citato per la prima volta in Memoria. Come facciamo a stare su alla velocità del processore se questa va a crescere in modo esponenziale? Parallelizzazione della ricerca!. Ecco perché ci serve raid (oltre alla ridondanza quindi più sicuro). E possono anche fallire. → ammette recovery. E una altra cosa bella dei raid è che sono hot-swappable cioè li puoi sostituire anche quando stanno runnando. ...

June 4, 2025 · Reading Time: 4 minutes ·  By Xuanqiang Angelo Huang

Cloud Computing Services

Cloud Computing: An Overview Cloud shifted the paradigm from owning hardware to renting computing resources on-demand. Hardware became a service. Key Players in the Cloud Industry The cloud computing market is dominated by several major providers, often referred to as the “Big Seven”, also called hyper-scalers. They are usually not interested in making it interoperable (they prefer the lock-in). Amazon Web Services (AWS): The largest provider, offering a comprehensive suite of cloud services. Microsoft Azure: Known for deep integration with enterprise systems and hybrid cloud solutions. Google Cloud Platform (GCP): Excels in data analytics, AI/ML, and Kubernetes-based solutions. IBM Cloud: Focuses on hybrid cloud and enterprise-grade AI. Oracle Cloud: Specializes in database solutions and enterprise applications. Alibaba Cloud: The leading provider in Asia, offering services similar to AWS. Salesforce: A major player in SaaS, particularly for CRM and business applications. These providers collectively control the majority of the global cloud infrastructure market, enabling scalable and on-demand computing resources for businesses worldwide. Capital and Operational Expenses in the Cloud Definition for CapEx and OpEx Cloud computing transforms traditional IT cost structures by shifting expenses from capital expenditures (CapEx), such as purchasing servers and data centers, to operational expenditures (OpEx), where users pay only for the resources they consume. ...

June 3, 2025 · Reading Time: 14 minutes ·  By Xuanqiang Angelo Huang

Cloud Reliability

Reliability is the ability of a system to remain operational over time, i.e., to offer the service it was designed for. Cloud Hardware and software fails. In this note, we will try to find methods to analyze and predict when components fail, and how we can prevent this problem. Defining the vocabulary Reliability and Factors of Influence Reliability is the probability that a system will perform its intended function without failure over a specified period of time. There are many factors that influence this value: ...

June 3, 2025 · Reading Time: 8 minutes ·  By Xuanqiang Angelo Huang

Cluster Resource Management

We need to find an efficient and effective manner to allocate the resources around. This is what the resource management layer does. Introduction to the problem What is Cluster Resource Management? Most of the time, the user specifies an amount of resources, and then the cluster decides how much to allocate (but approaches like (Delimitrou & Kozyrakis 2014), do it differently). There are mainly two parts in cluster resource management: Allocation: deciding how many resources an application (techniques for this is presented in Cluster Management Policies. Assignment: from which physical machine you can effectively put the application. Types of management architectures We mainly divide the management architectures in three ways: ...

June 3, 2025 · Reading Time: 7 minutes ·  By Xuanqiang Angelo Huang