The Tragedy of Lost Efficiency

"The mere thought of the amount of energy waste caused by computers running precompiled binaries not optimized for processors they run on keeps me up at night."

The mere thought of the amount of energy waste caused by computers running precompiled binaries not optimized for processors they run on keeps me up at night. There needs to be a better solution than just compiling on-site.

Potential Solutions

I see two problems each with two potential solutions here:

How many build choices will we supply?

For every optimization level combination possible.
Only for Processor Suppliment ABIs.

In an ideal world we would have choices for each optimization level combination possible but in the real world it's just not practically viable. Arch Linux merged a RFC to provide x86-64-v3 feature level builds that also talks about problems that will be encountered while working on it in April 2021. The efforts to make the RFC a reality are still ongoing.

How will the optimized code be loaded?

Every build will be compiled per optimization level and be provided as a different download.
The main application will be compiled normally but the processor intensive part will be compiled optimized per optimization level and stored in different shared objects.

Let's do an experiment to see how hard the second option would be.

The Experiment

I'll create a basic function to export later. Let's call this file libshared.c for future referring.

#include <stdio.h>
#include <math.h>
 
int fun_had(int in1) {
    int ret = in1;
    ret = ret * ret - 1;
    ret = pow(ret, 2);
    return ret;
}

In the build script I'll build libshared.c in both unoptimized and optimized forms, then convert both builds into shared objects. Let's call this file build.sh for future referring.

rm -r libsharednormal.o libsharedoptimized.o a.out binary libsharednormal.so libsharedoptimized.so 2>/dev/null

gcc -c -Wall -Werror -fpic libshared.c -o libsharednormal.o
gcc -c -Wall -Werror -fpic -march=native -mtune=native -O3 libshared.c -o libsharedoptimized.o

echo $(md5sum libsharednormal.o)
echo $(md5sum libsharedoptimized.o)

gcc -shared -o libsharednormal.so libsharednormal.o -lm
gcc -shared -o libsharedoptimized.so libsharedoptimized.o -lm

rm -r libsharednormal.o
rm -r libsharedoptimized.o

gcc -Wall -O0 -o binary main.c # build the binary normally, no optimizations

I'll start building our main program to call the shared objects from. Let's call this file main.c for future referring.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <dlfcn.h>
#include <sys/time.h>

int (*fun_had)(int); // declaring a prototype before the actual function is loaded
struct timeval start, stop;

int main(int argc, char *argv[]) {
    if(argc < 2) return -1;
    int num = strtol(argv[1], NULL, 10); // argv[1] is now stored in num
    
    char toLoad[256] = "";
    __builtin_cpu_init();
    if(__builtin_cpu_supports("avx2")) strcat(toLoad, "./libsharedoptimized.so");
    else strcat(toLoad, "./libsharednormal.so");
    printf("\nlib to load: %s\n", toLoad);
    
    void *handler_dl = dlopen(toLoad, RTLD_NOW);
    if(!handler_dl) { 
        printf("dlopen error: %s\n", dlerror()); 
        return -1; 
    }
    fun_had = dlsym(handler_dl, "fun_had"); // the actual function is now loaded
	
    gettimeofday(&start, NULL);
    int final = 0;
    for(int i = 0; i < 100000; i++) final = fun_had(num);
    printf("%i\n", final);
    gettimeofday(&stop, NULL);
    printf("function took %lu us\n", (stop.tv_sec - start.tv_sec) * 1000000 + stop.tv_usec - start.tv_usec);
    dlclose(handler_dl);
    
    printf("\n");
    printf("loading unoptimized function for testing purposes\n");
    handler_dl = dlopen("./libsharednormal.so", RTLD_NOW); // let's load the unoptimized function for testing
    if(!handler_dl) { 
        printf("dlopen error: %s\n", dlerror()); 
        return -1; 
    }
    fun_had = dlsym(handler_dl, "fun_had");
	
    gettimeofday(&start, NULL);
    int final2 = 0;
    for(int i = 0; i < 100000; i++) final2 = fun_had(num);
    printf("%i\n", final2);
    gettimeofday(&stop, NULL);
    printf("function took %lu us\n", (stop.tv_sec - start.tv_sec) * 1000000 + stop.tv_usec - start.tv_usec);
    dlclose(handler_dl);
    
    return 0;
}

Most of this code is testing boilerplate so some explanation might be necessary here.

I'll go over the shared object loading first:

#include <dlfcn.h>
int (*fun_had)(int);

void *handler_dl = dlopen("./libshared.so", RTLD_NOW);
if(!handler_dl) { 
    printf("dlopen error: %s\n", dlerror()); 
    return -1; 
}
fun_had = dlsym(handler_dl, "fun_had");

This snippet here will create a prototype, open the shared object file and then finally point our prototype to the function exported from the shared library.

To feed to the above snippet, I also need to pick which shared library we need to load:

char toLoad[256] = "";
__builtin_cpu_init();
if(__builtin_cpu_supports("avx2")) strcat(toLoad, "./libsharedoptimized.so");
else strcat(toLoad, "./libsharednormal.so");

For testing purposes I used __builtin_cpu_supports to pick between them and used supporting AVX2 as a distinction. There are many other ways to do this including probing /proc/cpuinfo but for simplicity's sake I'll go with this.

Finally, I need to call the function. For testing purposes, I'll also measure the time it takes to call the function.

#include <sys/time.h>
struct timeval start, stop;

gettimeofday(&start, NULL);
int final = 0;
for(int i = 0; i < 100000; i++) final = fun_had(num);
printf("%i\n", final);
gettimeofday(&stop, NULL);
printf("function took %lu us\n", (stop.tv_sec - start.tv_sec) * 1000000 + stop.tv_usec - start.tv_usec);

Now I'll try running it.

$ ./build.sh && ./binary 59
84f6ada061366dc244fb474c5ba50347 libsharednormal.o
83851c036b3a04d53ff016e2cca48cad libsharedoptimized.o

lib to load: ./libsharedoptimized.so
12110400
function took 182 us

loading unoptimized function for testing purposes
12110400
function took 1901 us

This is not practical, at all! The runtime feature level detection code would evolve into spaghetti if we were to use it for real.

For more practical purposes I envision a library that takes care of all of this including mechanisms for building objects for every CPU feature level combination desired and picking between them at runtime.

Steps Already Taken

GNU C Library has something very similar! They call it glibc-hwcaps and it's very cool! Unfortunately I'm of the opinion that C libraries should be POSIX only and stuff like this should be handled at compilation or at runtime via libraries. In fact, GNU C Library uses tunables to do something similar internally already. Take a look at its ifunc-impl-list.c for some examples.
You may also be interested in reading about ARM64 ELF hwcaps and POWERPC ELF hwcaps.

On the compiler side of things, GCC has had x86-64-vX feature level support since October 2020. I do not use or follow other compilers so I will not be able to comment on them.

Likewise, there may be other efforts on the distribution side of things but I will not be able to comment on them as I don't use or follow them.

Future looks bright!