The server under consideration is equipped with two CPUs, each having 24 physical cores, and operates on Ubuntu 23.10. I observed a significant slowdown in one of my Python scripts, with its speed reduced to a hundredth when utilizing threads. Given the sometimes problematic nature of threading in Python, I wrote a simple test code in C to investigate this issue further. This code does not share any memory and so forth in its compute intensive parts.
Here is the relevant information:
joe@galileo:~$ cat /proc/versionLinux version 6.5.0-28-generic (buildd@lcy02-amd64-001) (x86_64-linux-gnu-gcc-13 (Ubuntu 13.2.0-4ubuntu3) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.41) #29-Ubuntu SMP PREEMPT_DYNAMIC Thu Mar 28 23:46:48 UTC 2024
The CPU is:
root@galileo:~# cat /proc/cpuinfo | grep "model name" | head -1model name : Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
The test C code:
#include <pthread.h>#include <stdio.h>#include <stdlib.h>#include <sys/time.h>#include <stdio.h>#include <unistd.h>int _memsize;int _N;double time_usec(void) { struct timeval tv; if (gettimeofday(&tv, NULL) == 0) { // Convert seconds to double and add microseconds divided by 1,000,000 return tv.tv_sec + tv.tv_usec / 1000000.0; } else { // gettimeofday failed, handle error as appropriate perror("gettimeofday failed"); return 0.0; }}void *test(void *arg) { int M = 10000; double t0, t1; int count; int tot; unsigned char *mem; long int id = (long int)arg; int N = _N; int memsize = _memsize; printf("Threads=%d memsize=%d\n", N, memsize); mem = (unsigned char *)malloc(memsize); for (int i=0;i<memsize;i++) mem[i] = random(); t0 = time_usec(); count = 0; tot = 0; while (1) { for (int i = 0; i < M; i++) { tot += mem[(rand())%memsize]; if (tot>10000000) tot -= 10000000; count++; } t1 = time_usec(); if ((t1-t0)>2.0) { double nop_per_sec = (double)count / (t1-t0); t0 = t1; printf("%3ld %15.2lf nop/sec\n", id, nop_per_sec/1000000.0); count = 0; M = nop_per_sec*0.1; } } pthread_exit(NULL);}int main(int argn, char *arg[]) { pthread_t thread; int rc; long t; _N = atoi(arg[1]); _memsize = atoi(arg[2]); for (t = 0; t < _N; t++) { printf("Creating thread %ld\n", t); rc = pthread_create(&thread, NULL, test, (void *)t); if (rc) { printf("ERROR; return code from pthread_create() is %d\n", rc); exit(-1); } } while (1) sleep(1); return 0;}
The performance when using single thread:
joe@galileo:~$ ./perf 1 1000Creating thread 0Threads=1 memsize=1000 0 60.61 nop/sec 0 60.97 nop/sec 0 61.05 nop/sec 0 61.05 nop/sec
Which show 60M operations per second (an arbitrary but consistent unit). When I use 2 threads:
joe@galileo:~$ ./perf 2 1000Creating thread 0Creating thread 1Threads=2 memsize=1000Threads=2 memsize=1000 0 0.79 nop/sec 1 0.71 nop/sec 0 0.80 nop/sec 1 0.72 nop/sec 0 0.79 nop/sec 1 0.67 nop/sec
That show about 1/100 slowdown. CPU load is about 200% as expected. When I run two processes each having a single thread I do not observe such slowdown:
joe@galileo:~$ ./perf 1 1000 & ./perf 1 1000[1] 4989Creating thread 0Threads=1 memsize=1000Creating thread 0Threads=1 memsize=1000 0 60.76 nop/sec 0 60.76 nop/sec 0 61.48 nop/sec 0 61.48 nop/sec 0 61.44 nop/sec
The scheduling values are as follows:
root@galileo:~# sysctl -a | grep schedkernel.sched_autogroup_enabled = 1kernel.sched_cfs_bandwidth_slice_us = 5000kernel.sched_child_runs_first = 0kernel.sched_deadline_period_max_us = 4194304kernel.sched_deadline_period_min_us = 100kernel.sched_energy_aware = 1kernel.sched_rr_timeslice_ms = 100kernel.sched_rt_period_us = 1000000kernel.sched_rt_runtime_us = 950000kernel.sched_schedstats = 0kernel.sched_util_clamp_max = 1024kernel.sched_util_clamp_min = 1024kernel.sched_util_clamp_min_rt_default = 1024
I played with scheduling setting with to avail. The "perf stat" yield problematic "LLC-load-misses" and "iTLB-load-misses" as shown below:
joe@galileo:~$ sudo perf stat -d -d -d --timeout 10000 ./perf 2 1000Creating thread 0Creating thread 1Threads=2 memsize=1000Threads=2 memsize=1000 1 1.21 nop/sec 0 0.76 nop/sec 0 0.72 nop/sec 1 1.00 nop/sec 0 2.30 nop/sec 1 1.30 nop/sec 1 1.28 nop/sec 0 2.32 nop/sec./perf: Terminated Performance counter stats for './perf 2 1000': 19,875.36 msec task-clock # 1.985 CPUs utilized 24,396 context-switches # 1.227 K/sec 0 cpu-migrations # 0.000 /sec 100 page-faults # 5.031 /sec 62,051,572,900 cycles # 3.122 GHz (38.44%) 12,305,915,977 instructions # 0.20 insn per cycle (46.07%) 1,770,935,151 branches # 89.102 M/sec (46.19%) 20,833,461 branch-misses # 1.18% of all branches (46.14%) 2,886,437,444 L1-dcache-loads # 145.227 M/sec (46.08%) 62,673,587 L1-dcache-load-misses # 2.17% of all L1-dcache accesses (46.16%) 36,380,110 LLC-loads # 1.830 M/sec (30.73%) 36,335,895 LLC-load-misses # 99.88% of all L1-icache accesses (30.80%)<not supported> L1-icache-loads 14,122,757 L1-icache-load-misses (30.91%) 2,893,359,729 dTLB-loads # 145.575 M/sec (30.77%) 15,591,131 dTLB-load-misses # 0.54% of all dTLB cache accesses (30.84%) 154,787 iTLB-loads # 7.788 K/sec (30.91%) 30,148,243 iTLB-load-misses # 19477.24% of all iTLB cache accesses (30.72%)<not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 10.010814854 seconds time elapsed 11.662038000 seconds user 8.215249000 seconds sys
I find this quite peculiar. I don't experience such a slowdown when using gcloud or AWS EC2. This seems to undermine the overall purpose of threads.
Does anyone have any insights?