The use of eBPF – in Netflix, GPU infrastructure, Windows programs and more

eBPF Summit 2024Takeaways from eBPF Summit 2024

How are organizations applying eBPF to solve real problems in observability, security, profiling, and networking? It’s a question I’ve found myself asking as I work in and around the observability space – and I was pleasantly surprised when Isovalent’s recent eBPF Summit provided some answers.

For those new to eBPF, it’s an open source technology that empowers observability practices. Many organizations and vendors have adopted it as a data source (including Causely, where we use it to enhance our instrumentation for Kubernetes).

Many of the eBPF sessions highlighted real challenges companies faced and how they used eBPF to overcome them. In the spirit of helping others, my cliff notes and key takeaways from eBPF Summit are below.

Organizations like Netflix and Datadog are using eBPF in new, creative ways

The use of eBPF in Netflix

One of the Keynote presentations was delivered by Shweta Saraf who described specific problems Netflix overcame using eBPF, such as noisy neighbors. This is a common problem faced by many companies with cloud-native environments.

Shweta Saraf described Netflix's use cases for eBPF

Shweta Saraf described Netflix’s use cases for eBPF

Netflix uses eBPF to measure how long processes spend in the CPU scheduled state.  When processes are taking too long, it usually indicates a performance bottleneck on CPU resources — like CPU throttling or over-allocation.  (Netflix’s compute and performance team released a blog recently with much more detail on the subject.)  In solving the noisy neighbor problem, the Netflix team also created a tool called bpftop which is designed to measure the CPU usage of the eBPF code they instrumented.

The Netflix team released bpftop for the community to use, and it will ultimately help organizations implement efficient eBPF programs.  This is especially useful if an eBPF program is hung, allowing teams to quickly identify any overhead that an eBPF program has.  We have come full circle: monitoring our monitoring programs 😁.

The use of eBPF in Datadog

Another use case for eBPF – and one that can be easily overlooked – is in chaos engineering.  Scott Gerring, a technical advocate at Datadog, shared his experience on the matter.  This quote resonated with me: “with eBPF… we have this universal language of destruction” – controlled destruction that is.

Scott Gerring discussed eBPF's use in Datadog

Scott Gerring discussed eBPF’s use in Datadog

The benefit of eBPF is that we can inject failures into cloud-native systems without having to re-write the code of an application.  Interestingly, there are open source projects out there for chaos engineering that already use eBPF, such as ChaosMesh.

Scott listed a few examples like Kernel Probes that are attached to the openat system that will cause access denied errors for 50% of calls made by system processes that a user can select or define.  Or, using the traffic control subsystem to drop packets for sockets on process you want to mark for failure.

eBPF will underpin AI development

Isovalent Co-founder and CTO Thomas Graf presented the eBPF roadmap and what he is most excited about.  Notably: eBPF will deliver value in enabling the GPU and DPU infrastructure wave fueled by AI.  AI is undoubtably one of the hottest topics in tech right now.  Many companies are using GPUs and DPUs to accelerate AI and ML (Machine Learning) tasks, because CPUs cannot deliver the processing power demanded by today’s AI models.

Thomas Graf talked about the value of eBPF in enabling GPU and DPU infrastructures

Thomas Graf talked about the value of eBPF in enabling GPU and DPU infrastructures

As Tom mentioned, whether the AI wave produces anything meaningful is up for debate, but companies will undoubtedly try, and they will make significant investments in GPUs and DPUs along the way.  The capabilities of eBPF will be applied to this new wave of infrastructure in the same manner they did for CPUs.

GPUs and DPUs are expensive, so companies do not want to waste processing power on programs that will drive up utilization. The efficiency of eBPF programs can help maximize the performance of costly GPUs. For example, eBPF can be used for GPU profiling by hooking into GPU events such as memory, sync, and kernel launches.  Unlocking this type of data can be used to understand which kernels are used most frequently, improving efficiencies of AI development.

eBPF support for Windows is growing

Another interesting milestone in eBPF’s journey is the support for Windows.  In fact, there is a growing Git Repository for eBPF programs on Windows that exists today: https://github.com/microsoft/ebpf-for-windows

The project supports Windows 10 or later and Windows Server 2019 or later, and while there is not feature parity yet to Linux, there is a lot of development in this space.  The community is hard at work porting over the same tooling for eBPF on Linux, but it is a challenging endeavor as the hook points for Linux eBPF components (like Just-In-Time compilation or eBPF bytecode signatures) will differ on Windows.

It will be exciting to watch the same networking, security, and observability eBPF capabilities on Linux become available for Windows.

The need for better observability is fueling eBPF ecosystem growth

eBPF tools have been created by the community for both applications and infrastructure use cases.  There a 9 major projects for applications and over 30 exciting emerging projects for applications.  Notably, while there are a few production-ready runtimes and tools within the infrastructure ecosystem (like Linux and LLVM Compiler), there are many emerging projects such as eBPF for Windows.

With a user base across Meta, Apple, Capital One, LinkedIn, and Walmart (just to name a few), we can expect the number of eBPF projects to grow considerably in the coming years.  The overall number of projects is actually forecasted in the triple digits by the end of 2025.

One of the top catalysts for growth? The urgent need for better observability.  Of all the topics at last year’s KubeCon in Chicago, observability ranked the highest, beating competing topics like cost and automation.  As with any other tool, eBPF can allow organizations gather a lot of data, but the “why” is important. Are you using that data to create more noise and more alerts, or can you apply it to get to the root cause of problems that surface, or for other applications?

It is exciting to watch the eBPF community develop and implement creative new ways to use eBPF and the 2024 eBPF summit was (and still is) an excellent source of real-world eBPF use cases and community-generated tooling.


Related resources

Leave a Reply

Your email address will not be published. Required fields are marked *