Quantcast
Channel: Active questions tagged ubuntu - Stack Overflow
Viewing all articles
Browse latest Browse all 6107

Linux pthreads are extremely slow on multicore server

$
0
0

The server under consideration is equipped with two CPUs, each having 24 physical cores, and operates on Ubuntu 23.10. I observed a significant slowdown in one of my Python scripts, with its speed reduced to a hundredth when utilizing threads. Given the sometimes problematic nature of threading in Python, I wrote a simple test code in C to investigate this issue further. This code does not share any memory and so forth in its compute intensive parts.

Here is the relevant information:

joe@galileo:~$ cat /proc/versionLinux version 6.5.0-28-generic (buildd@lcy02-amd64-001) (x86_64-linux-gnu-gcc-13 (Ubuntu 13.2.0-4ubuntu3) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.41) #29-Ubuntu SMP PREEMPT_DYNAMIC Thu Mar 28 23:46:48 UTC 2024

The CPU is:

root@galileo:~# cat /proc/cpuinfo | grep "model name" | head -1model name  : Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz

The test C code:

#include <pthread.h>#include <stdio.h>#include <stdlib.h>#include <sys/time.h>#include <stdio.h>#include <unistd.h>int _memsize;int _N;double time_usec(void) {    struct timeval tv;    if (gettimeofday(&tv, NULL) == 0) {        // Convert seconds to double and add microseconds divided by 1,000,000        return tv.tv_sec + tv.tv_usec / 1000000.0;    } else {        // gettimeofday failed, handle error as appropriate        perror("gettimeofday failed");        return 0.0;    }}void *test(void *arg) {    int M = 10000;    double t0, t1;    int count;    int tot;    unsigned char *mem;    long int id = (long int)arg;    int N = _N;    int memsize = _memsize;    printf("Threads=%d   memsize=%d\n", N, memsize);    mem = (unsigned char *)malloc(memsize);    for (int i=0;i<memsize;i++) mem[i] = random();    t0 = time_usec();    count = 0;    tot = 0;    while (1) {        for (int i = 0; i < M; i++) {            tot += mem[(rand())%memsize];            if (tot>10000000) tot -= 10000000;            count++;        }        t1 = time_usec();        if ((t1-t0)>2.0) {            double nop_per_sec = (double)count / (t1-t0);            t0 = t1;            printf("%3ld  %15.2lf nop/sec\n", id, nop_per_sec/1000000.0);            count = 0;            M = nop_per_sec*0.1;        }    }    pthread_exit(NULL);}int main(int argn, char *arg[]) {    pthread_t thread;    int rc;    long t;    _N = atoi(arg[1]);    _memsize = atoi(arg[2]);    for (t = 0; t < _N; t++) {        printf("Creating thread %ld\n", t);        rc = pthread_create(&thread, NULL, test, (void *)t);        if (rc) {            printf("ERROR; return code from pthread_create() is %d\n", rc);            exit(-1);        }    }    while (1) sleep(1);    return 0;}

The performance when using single thread:

joe@galileo:~$ ./perf 1 1000Creating thread 0Threads=1   memsize=1000  0            60.61 nop/sec  0            60.97 nop/sec  0            61.05 nop/sec  0            61.05 nop/sec

Which show 60M operations per second (an arbitrary but consistent unit). When I use 2 threads:

joe@galileo:~$ ./perf 2 1000Creating thread 0Creating thread 1Threads=2   memsize=1000Threads=2   memsize=1000  0             0.79 nop/sec  1             0.71 nop/sec  0             0.80 nop/sec  1             0.72 nop/sec  0             0.79 nop/sec  1             0.67 nop/sec

That show about 1/100 slowdown. CPU load is about 200% as expected. When I run two processes each having a single thread I do not observe such slowdown:

joe@galileo:~$ ./perf 1 1000 & ./perf 1 1000[1] 4989Creating thread 0Threads=1   memsize=1000Creating thread 0Threads=1   memsize=1000  0            60.76 nop/sec  0            60.76 nop/sec  0            61.48 nop/sec  0            61.48 nop/sec  0            61.44 nop/sec

The scheduling values are as follows:

root@galileo:~# sysctl -a | grep schedkernel.sched_autogroup_enabled = 1kernel.sched_cfs_bandwidth_slice_us = 5000kernel.sched_child_runs_first = 0kernel.sched_deadline_period_max_us = 4194304kernel.sched_deadline_period_min_us = 100kernel.sched_energy_aware = 1kernel.sched_rr_timeslice_ms = 100kernel.sched_rt_period_us = 1000000kernel.sched_rt_runtime_us = 950000kernel.sched_schedstats = 0kernel.sched_util_clamp_max = 1024kernel.sched_util_clamp_min = 1024kernel.sched_util_clamp_min_rt_default = 1024

I played with scheduling setting with to avail. The "perf stat" yield problematic "LLC-load-misses" and "iTLB-load-misses" as shown below:

joe@galileo:~$ sudo perf stat -d -d -d --timeout 10000 ./perf 2 1000Creating thread 0Creating thread 1Threads=2   memsize=1000Threads=2   memsize=1000  1             1.21 nop/sec  0             0.76 nop/sec  0             0.72 nop/sec  1             1.00 nop/sec  0             2.30 nop/sec  1             1.30 nop/sec  1             1.28 nop/sec  0             2.32 nop/sec./perf: Terminated Performance counter stats for './perf 2 1000':         19,875.36 msec task-clock                       #    1.985 CPUs utilized                         24,396      context-switches                 #    1.227 K/sec                                      0      cpu-migrations                   #    0.000 /sec                                     100      page-faults                      #    5.031 /sec                          62,051,572,900      cycles                           #    3.122 GHz                         (38.44%)    12,305,915,977      instructions                     #    0.20  insn per cycle              (46.07%)     1,770,935,151      branches                         #   89.102 M/sec                       (46.19%)        20,833,461      branch-misses                    #    1.18% of all branches             (46.14%)     2,886,437,444      L1-dcache-loads                  #  145.227 M/sec                       (46.08%)        62,673,587      L1-dcache-load-misses            #    2.17% of all L1-dcache accesses   (46.16%)        36,380,110      LLC-loads                        #    1.830 M/sec                       (30.73%)        36,335,895      LLC-load-misses                  #   99.88% of all L1-icache accesses   (30.80%)<not supported>      L1-icache-loads                                                               14,122,757      L1-icache-load-misses                                                   (30.91%)     2,893,359,729      dTLB-loads                       #  145.575 M/sec                       (30.77%)        15,591,131      dTLB-load-misses                 #    0.54% of all dTLB cache accesses  (30.84%)           154,787      iTLB-loads                       #    7.788 K/sec                       (30.91%)        30,148,243      iTLB-load-misses                 # 19477.24% of all iTLB cache accesses  (30.72%)<not supported>      L1-dcache-prefetches                                                  <not supported>      L1-dcache-prefetch-misses                                                   10.010814854 seconds time elapsed      11.662038000 seconds user       8.215249000 seconds sys

I find this quite peculiar. I don't experience such a slowdown when using gcloud or AWS EC2. This seems to undermine the overall purpose of threads.

Does anyone have any insights?


Viewing all articles
Browse latest Browse all 6107

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>