CPU Context Switching and Performance Degradation Under DDoS in Xen Server

Originally published as a research paper at CONNEPI 2016. Revisited here with interactive visualizations and updated context.

This article summarizes my research paper presented at CONNEPI 2016 (XI Congresso Norte Nordeste de Pesquisa e Inovação), co-authored with André Henrique Sousa de Menezes, Prof. Leandro Cavalcanti de Almeida, and Prof. Paulo Ditarso Maciel Junior at IFPB.

The full paper is embedded at the bottom of this post.

The question

Server virtualization lets multiple virtual machines share the same physical hardware. This is efficient — but what happens to the neighbors when one VM gets hit by a DDoS attack?

We already knew from prior experiments that neighboring VMs suffer performance degradation during an attack on a co-hosted VM. The question was why, and specifically: what role does CPU context switching play?

Background

In a paravirtualized environment like Xen Server, the hypervisor mediates all system calls and hardware access from guest VMs. Every network packet, every I/O operation, every system call goes through the hypervisor. This creates overhead — and that overhead becomes critical under attack.

A context switch happens when the CPU transfers control from one process to another. They’re triggered either when a process completes its time slice, or when a high-priority interrupt forces the OS to reclaim the CPU. Context switches are expensive: the CPU must save the current process state, load the new one, and flush relevant caches.

Previous work by Shea and Liu (2012) showed that virtualized environments experience a non-linear increase in context switches under DDoS compared to bare-metal systems. They proposed kernel modifications to KVM to reduce context switching overhead.

We wanted to understand this relationship in Xen Server specifically.

The experiment

We built a controlled lab environment:

Host: Intel i7 quad-core, 32 GB RAM, 1 TB HDD, two Gigabit Ethernet interfaces (LACP bonded for 2 Gbps)
Hypervisor: Citrix XenServer 7.2.1511
Two VMs: Each with 1 vCPU, 1 GB RAM, 10 GB disk, Debian 8.2 running Apache 2
Attack infrastructure: 10 slave machines for DDoS, plus a master controller and client machines
Network: Cisco Catalyst 2960 switch, VMs on separate virtual networks

The experiment ran 60 rounds: 30 without attack (baseline) and 30 with a DDoS attack targeting VM1. In both scenarios, both VMs ran synthetic workloads via Sysbench and Stress-ng to simulate realistic server load.

Each round followed four automated phases — initialization, execution, collection, and cleanup — orchestrated by shell scripts that coordinated all machines via SSH. On the VMs, Sysbench and Stress-ng generated a consistent synthetic workload to simulate realistic server load. We collected context switch and interrupt data directly from Dom0 (the privileged Xen management domain) using vmstat, a native Linux tool that reports kernel statistics including per-second interrupt and context switch counts.

See it in action

Before diving into the numbers, here’s an interactive simulation of what happens inside the CPU. Toggle the DDoS attack to see how interrupts flood the system and force context switches:

Normal operation

CPU Timeline — Dom0 Hypervisor

VM1

VM2

Hypervisor

Interrupts

Running — VM executing its workload normally

Context Switch — CPU saves/loads process state (expensive)

Hypervisor — Dom0 mediating system calls and I/O

High-priority interrupt — forces CPU to context switch

Low-priority interrupt — handled without context switch

Interrupts/s

Hardware signals requesting CPU attention

Context Switches/s

Times CPU saved/loaded process state

Conversion Rate

% of interrupts that forced a context switch

Experiment Data — Actual measurements from 60 rounds

Interrupts 25,295

Context Switches 13,504

Conversion Rate (Interrupts → Context Switches) 53.39%

Results

The numbers were striking:

Metric (averages)	Without DDoS	With DDoS
Interrupts	25,295	141,157
Context switches	13,504	95,369
Conversion rate	53.39%	67.63%

During the DDoS attack:

Interrupts increased by 82%
Context switches increased by 86%
The CPU converted 14.23% more interrupts into context switches compared to the baseline

This last finding is the key insight. It’s not just that there are more interrupts — the nature of those interrupts changes. Under DDoS, a disproportionate number of interrupts are high-priority, forcing the CPU to perform expensive context switches rather than handling them in the current process context.

Why I thought this was interesting

This research, while academic, touches on something very practical: noisy neighbor problems in shared infrastructure. If you’re running workloads on shared virtualized infrastructure (which includes most cloud environments), a DDoS attack on a co-tenant can degrade your performance even if your VM isn’t the target.

The mechanism is clear: the flood of network packets generates hardware interrupts that propagate through the hypervisor, consuming CPU cycles across all VMs sharing that physical host. The non-linear relationship between interrupts and context switches means the degradation is worse than you’d expect from simple resource contention.

Reflections

This was undergraduate research — my first real exposure to systems-level thinking. Setting up the controlled environment, writing the automation scripts, collecting and analyzing the data — it taught me how to reason about system behavior from first principles.

This first principles thing really stuck with me by the way.

But looking back, I should have instrumented the VMs themselves to get more granular data on how the attack affected their workloads. We only had visibility from Dom0, which limited our understanding of the internal state of each VM during the attack.

Also, I should have explored the real world implications more deeply. For example, did it present a significant risk to cloud customers? How cloud companies saw this problem, did affect revenue, and what mitigation strategies existed at the time?

I think these things would have made the paper more impactful and relevant to practitioners, rather than just an academic exercise. But hey, it was my first research project — I was learning as I went.

What’s next

It’s been nearly a decade since we ran these experiments. A lot has changed in virtualization and cloud infrastructure — container runtimes, eBPF-based observability, hardware-assisted virtualization improvements, and cloud providers implementing better noisy-neighbor isolation. In an upcoming post, I’ll revisit this research and explore what’s different now: which of our findings still hold, what new mitigation strategies exist, and how modern observability tools would have changed our experimental approach.

Full Paper (CONNEPI 2016)

Can't see the PDF? Download it here.