Tuesday, October 16, 2018

Summary: Analyzing Linux Game Server Performance (MGDC Presentation)

Performance Metrics

  1. Throughput:

    • Measures how much work the server handles.
    • Example: Maximum concurrent users the server can handle.
  2. Response Time:

    • Measures how quickly a task is completed.
    • Example: Detecting server lag.

Challenges of Game Server Analysis

  • Complex Architecture: Game servers involve numerous components.
  • Desired Traits in Analysis Tools:
    1. Minimal performance impact (unlike tools like Valgrind, which significantly degrade performance).
    2. Usable in production without causing crashes.
    3. Capable of continuous monitoring.
    4. Provide actionable and detailed data (e.g., response time distributions).

Approach to Measurement

  1. Instrumentation in Code:

    • Add measurement code to the game server.
    • Example: Measure individual event execution times:
      • Client message processing time.
      • Queue wait times.
      • External API and database request-response times.
  2. Advantages:

    • Directly measure specific metrics, such as external API call latency or failure rates.
    • Control measurement scope, focusing on specific abstraction levels (e.g., system, process, or thread level).
  3. Disadvantages:

    • Manual work to insert measurement code.
    • Missed areas render measurement incomplete.
    • Modifications require redeploying or reloading the server.

Alternative: External Observation

  • Use Linux tools and frameworks for performance monitoring and tracing.

Linux Performance Tools

1. Linux perf

  • Tracks kernel events like CPU cycles, cache misses, and context switches.
  • Available for kernel versions 2.6 and above.

2. eBPF (Extended Berkeley Packet Filter)

  • Modern, high-performance tracing and monitoring framework.
  • Runs within the Linux kernel using a restricted virtual machine.

eBPF Features

  • Tracing Capabilities:
    • Hook kernel functions, system calls, or application functions.
    • Perform actions during function calls, such as filtering packets.
  • Tools:
    • BCC (BPF Compiler Collection): Generates eBPF programs and communicates via Python/Lua scripts.
    • PLY: A scripting tool for eBPF.
    • Flame Graphs: Visualize CPU profiling data, mapping CPU usage to call stacks.

eBPF Use Cases

  • Identify bottlenecks, such as:
    • Long lock waits.
    • Slowest disk I/O operations.
    • Most frequent system calls.
  • Separate kernel/user space data securely for safety.

Advantages of eBPF:

  1. Safety: Does not crash OS or target systems due to robust compiler guarantees.
  2. Dynamic Measurement: Change measurement methods without restarting systems.
  3. Third-party Support: Utilize libraries and frameworks for complex tasks.

Disadvantages of eBPF:

  1. Kernel Dependency: Requires a recent kernel version (v3.18 or later; v4.9+ recommended).
  2. Knowledge Requirements: Familiarity with the target system’s libraries, OS, and call stack is essential.
  3. Compiler Challenges: Optimizations may obscure stack traces.
  4. Resource Overhead: Typically 3–6%, but up to 10% is acceptable in practice (e.g., Netflix's use case).

Conclusion

By combining in-code instrumentation with external tools like eBPF, Linux game server performance can be analyzed efficiently. This approach ensures minimal impact on production environments, provides comprehensive insights, and allows dynamic adaptability to evolving performance requirements.