I can easily explain this, having worked in this space. The new languages don’t actually solve any urgent problems.
How people imagine scalable parallelism works and how it actually works doesn’t have a lot of overlap. The code is often boringly single-threaded because that is optimal for performance.
The single biggest resource limit in most HPC code is memory bandwidth. If you are not addressing this then you are not addressing a real problem for most applications. For better or worse, C++ is really good at optimizing for memory bandwidth. Most of the suggested alternative languages are not.
It is that simple. The new languages address irrelevant problems. It is really difficult to design a language that is more friendly to memory bandwidth than C++. And that is the resource you desperately need to optimize for in most cases.
If you think C++ is the best here, then I don't think you've actually worked in this space nor appreciated the actual problems these languages try to solve. In particular because you can't program accelerators with C++.
Memory bandwidth is often the problem, yes. Language abstractions for performance aim to, e.g., automatically manage caches (that must be handled manually in performant GPU code, for instance) with optimized memory tiling and other strategies. Kernel fusion is another nontrivial example that improves effective bandwidth.
Adding on the diversity of hardware that one needs to target (both within and among vendors), i.e., portability not just of function but of performance, makes the need for better tooling abundantly obvious. C++ isn't even an entrant in this space.
Wait what!? I have been programming CUDA since 2009 and specifically remember it being pushed to C++ as main development language for the first few years, after a brief "CUDA C extension" period.
Parent talks about new languages, as per the article Fortran or C doing fine. I speculate the benefit of C++ over Rust how it let programmers instruct the compiler of warranty that goes beyong the initial semantic of the language. See __restrict, __builtin_prefetch and __builtin_assume_aligned. The programming language is a space for conversations between compiler builders and hardware designers.
I believe __restrict, and __builtin_prefetch/__builtin_assume are compiler extensions, not part of the C++ language as is, and different compilers implement (or don't) these differently.
Maybe some time in the future good acceptable abstractions will be conceived for them.. Perhaps using just using nightly builds for HPC is not that far out, though.
I'm pretty interested in realtime computing and didn't realise C++ was considered bandwidth efficient! Coming from C, I find myself avoiding most 'new' C++ features because I can't easily figure out how they allocate without grabbing a memory profiler.
There is unless using a llvm compiler that does naive things with code motion.
Rust is typically slowest (often negligible <3%), C++ has better CUDA support, and C can be heavily optimized with inline assembly (very unforgiving to juniors.)
> C++ is really good at optimizing for memory bandwidth
In general, most modern CPU thread-safe code is still a bodge in most languages. If folks are unfortunate enough to encounter inseparable overlapping state sub-problems, than there is no magic pixie dust to escape the computational cost. On average, attempting to parallelize this type of code can end up >30% slower on identical hardware, and a GPU memory copy exchange can make it even worse.
Sometimes even compared to a large multi-core CPU, a pinned-core higher clock-speed chip will win out for those types of problems.
Thus, the mystery why most people revert to batching k copies of single-core-bound non-parallel version of a program was it reduces latency, stalls, cache thrashing, i/o saturation, and interprocess communication costs.
Exchange costs only balloon higher across networks, as however fast the cluster partition claims to be... the physics is still going to impose space-time constraints, as modern data-centers will spend >15% of energy cost just moving stuff around networks for lower efficiency code.
I like languages like Julia, as it implicitly abstracts the broadcast operator to handle which areas may be cleanly unrolled. However, much like Erlang/Elixir the multi-host parallelization is not cleanly implemented... yet...
The core problem with HPC software, has always been academics are best modeled like hermit-crabs with facilities. Once a lucky individual inherits a nice new shell, the pincers come out to all smaller entities who may approach with competing interests.
Best of luck, =3
"Crabs Trade Shells in the Strangest Way | BBC Earth"
All these fancy HPC languages are all nice and dandy, but the hard reality I see on our cluster is that most of the work is done in Python, R and even Perl and awk. MPI barely reached us and people still prefer huge single machines to proper distributed computing. Yeah, bioinformatics is from another planet.
Bioinformatics is an outlier within HPC. It's less about numerical computing and more about processing string data with weird algorithms and data structures that are rarely used anywhere else.
Distributed computing never really took off in bioinformatics, because most tasks are conveniently small. For example, a human genome is small enough that you can run most tasks involving a single genome on an average cost-effective server in a reasonable time. And that was already true 10–15 years ago. And if you have a lot of data, it usually means that you have many independent tasks.
Which is nice from the perspective of a tool developer. You don't have to deal with the bureaucracy of distributed computing, as it's the user's responsibility.
C++ is popular for developing bioinformatics tools. Some core tools are written in C, but actual C developers are rare. And Rust has become popular with new projects — to the extent that I haven't really seen C++20 or newer in the field.
To add on this, what I see gaining traction are "workflow managers", tools that let people specify flow of data through various tools. These can figure out how to parallelize things on their own so users are not burdened with this task.
So from what I see actual programming language doesn't matter as much as how the work is organized. Anything helping people simplify this task is of immediate benefit to the science.
Perhaps one issue lacking discussion in the article is how easy it is to find devs?
I've never worked in HPC but it seems it should be relatively simple to find a C/C++ dev that can pick up OpenMP, or one that already knows it, compared to hiring people who know Chapel.
The "scaling down" factor (how easy or interesting it is to use tool X for small use) seems a disadvantage of HPC-only languages, which creates a barrier to entry and a reduction in available workforce.
I think hpc devs need an extra set of skills that are not so common. Such as parallel file systems, batch schedulers, NUMA, infiniband, and probably some domain-specific knowledge for the apps they will develop. This knowledge is also probably a bit niche, like climate modelling, earthquake simulation, lidar data processing, and so it goes.
And even knowing OpenMP or MPI may not suffice if the site uses older versions or heterogeneous approaches with CUDA, FPGA, etc. Knowing the language and the shared/distributed mem libs help, but if your project needs a new senior dev than it may be a bit hard to find (although popularity of company/HPC, salary, and location also play a role).
You tend to only learn these things as they become a problem too. That's super super domain specific and it doesn't always translate between areas of research.
So for e.g. when I did HPC simulation codes in magnetics, there was little point focusing on some of these areas because our codes were dominated by the long-range interaction cost which limited compute scaling. All of our effort was tuning those algorithms to the absolute max. We tried heterogenous CPU + GPU but had very mixed results, and at that time (2010s) the GPU memory wasn't large enough for the problems we cared about either.
I then moved to CFD in industry. The concerns there were totally different since everything is grid local. Partitioning over multi-GPU is simple since only the boundaries need to be exchanged on each iteration. The problems there were much more on the memory bandwidth and parallel file system performance side.
Basically, you have to learn to solve whatever challenges get thrown up by the specific domain problem.
> And even knowing OpenMP or MPI may not suffice if the site uses older versions
To be fair, you always have the option of compiling yourself, but most people I met in academia didn't have the background to do this. Spack and EasyBuild make this much much easier.
I worked in HPC adjacent fields for a while (up until 40gig ethernet was cheap enough to roll out to all the edge nodes)
There are a couple of big things that are difficult to get your head around:
1) when and where to dispatch and split jobs (ie whats the setup cost of spinning up n binaries on n machines vs threading on y machines)
2) data exchange primitives, Shared file systems have quirks, and a they differ from system to system. But most of the time its better/easier/faster to dump shit to a file system than some fancy database/object store. Until its not. Distributed queues are great, unless you're using them wrong. Most of the time you need to use them wrong. (the share memory RPC is a whole another beast that fortunatly I've never had to work with directly. )
3) dealing with odd failures. As the number of parallel jobs increase the chance of getting a failure reaches 1. You need to bake in failure modes at the very start.
4) loading/saving data is often a bottle neck, lots of efficiecny comes from being clever in what you load, and _where_ you load it. (ie you have data affinity, which might be location based, or topology based, and you don't often have control over where your stuff is placed.)
It's interesting that none of the actor-based languages ever made it into this space. Feels like something with the design philosophy of Erlang would be pretty suitable to exploit millions of cores and a variety of interconnects...
> we have failed to broadly adopt any new compiled programming languages for HPC
The article neglects that all of C, C++, and Fortran have evolved over the last 30 years.
Also, you'll find significant advances in the HPC library ecosystem over the trailing years. Consider, for example, Trilinos (https://trilinos.github.io/index.html) or Dakota (https://dakota.sandia.gov/about-dakota/) both of which push a ton of domain-agnostic capabilities into a C++ library instead of bolting them into a bespoke language. Communities of users tend to coalesce around shared libraries not creating new languages.
As someone who worked for a while and still works in HPC, my impression from this field as compared to eg. programming in finance sector or programming for storage sector is that... HPC is so backwards and far behind, it's really amazing how it's portrayed as some sort of a champion of the field.
That's not to say that new things don't happen there, it's just that I find a lot of old stuff that was shown to be bad decades ago still being in vogue in HPC. Probably because it's a relatively small field with a lot of people there being academics and not a lot of migration to/from other fields.
You've probably never heard of `module` (either Tcl or Lmod). This is a staple of HPC world. What this thing does is it sources or (tries to) remove some shell variables and functions into the shell used either interactively or by a batch job. This is a beyond atrocious idea to handle your working environment. The information leaks, becomes stale, you often end up loading the wrong thing into your environment. It's simply amazing how bad this thing is. And yet, it's just everywhere in HPC.
Another example: running anything in HPC, basically, means running Slurm batch jobs. There are alternatives, but those are even worse (eg. OpenPBS). When you dig into the configuration of these tools, you realize they've been written for pre-systemd Linux and are held together by a shoestring of shell scripting. They seldom if at all do the right thing when it comes to logging or general integration with the environment they run in. They can be simultaneously on the bleeding edge (eg. cgroup integration or accelerator driver integration) and be completely backwards when it comes to having a sensible service definition for systemd (eg. try to manage their service dependencies on their own instead of relying on systemd to do that for them).
In other words, imagine a steam-punk world, but now it's in software. That's sort of how HPC feels like after a decade or so in more popular programming fields.
Also, a lot of code written for HPC is written the way it is not because the writer chose the language or the environment. The typical setup is: university IT created a cluster with whatever tools they managed to put there eons ago, and you, the code writer, have to deal with... using CentOS6 by authenticating to university's AD... in your browser... through JupyterLab interface. And there's nothing you can do about it because the IT isn't there, is incompetent to the bone and as long as you can get your work done somehow, you'd prefer that over fighting to perfect your toolchain.
Bottom line, unless a language somehow becomes indispensable in this world, no matter its advantages, it's not going to be used because of the huge inertia and general unwillingness to do beyond the minimum.
HPCs never loved the inefficiencies of anything virtualized (VMs or any containers really), so the shell hacks of module enabled a (limited, but workable) level of reproducibility that was sufficiently composable and usable by researchers who understood the shell. I am not going to defend this tcl hack any further, but I can see how it was the path of least resistance when people tried to stay close to the raw metal of their large clusters while keeping some level of sanity. Slurm is a more defensible choice, but I agree that these tools are from a different era of compute. I grew to love and hate these tools, but they definitely represent an acquired taste, like a dorian fruit; not like an apple.
How should it be better? Most environments offer Apptainer which can import Docker containers. Plus a lot of theae languages like Julia and Chapel are pretty self contained and programmed against eg ancient libc for these very reasons.
There has been a very big adoption of ENGLISH as a programming language in the last year or so, and, painful as it sounds, AI is already generating machine code without compilers, so let's see where we are in 2030.
How people imagine scalable parallelism works and how it actually works doesn’t have a lot of overlap. The code is often boringly single-threaded because that is optimal for performance.
The single biggest resource limit in most HPC code is memory bandwidth. If you are not addressing this then you are not addressing a real problem for most applications. For better or worse, C++ is really good at optimizing for memory bandwidth. Most of the suggested alternative languages are not.
It is that simple. The new languages address irrelevant problems. It is really difficult to design a language that is more friendly to memory bandwidth than C++. And that is the resource you desperately need to optimize for in most cases.
Memory bandwidth is often the problem, yes. Language abstractions for performance aim to, e.g., automatically manage caches (that must be handled manually in performant GPU code, for instance) with optimized memory tiling and other strategies. Kernel fusion is another nontrivial example that improves effective bandwidth.
Adding on the diversity of hardware that one needs to target (both within and among vendors), i.e., portability not just of function but of performance, makes the need for better tooling abundantly obvious. C++ isn't even an entrant in this space.
The rust compiler actually has similar things, but they're not available in stable builds. I suppose there are some issues if principle why not to include them in stable. E.g: https://doc.rust-lang.org/std/intrinsics/fn.prefetch_read_da...
Maybe some time in the future good acceptable abstractions will be conceived for them.. Perhaps using just using nightly builds for HPC is not that far out, though.
Rust is typically slowest (often negligible <3%), C++ has better CUDA support, and C can be heavily optimized with inline assembly (very unforgiving to juniors.)
Also, heavily associated with coding style =3
https://en.wikipedia.org/wiki/The_Power_of_10:_Rules_for_Dev...
Even with HDL defined accelerators, that statement may not mean what people assume. =3
https://en.wikipedia.org/wiki/Latency_(engineering)
https://en.wikipedia.org/wiki/Clock_domain_crossing
https://en.wikipedia.org/wiki/Metastability_(electronics)
https://en.wikipedia.org/wiki/The_Power_of_10:_Rules_for_Dev...
https://www.youtube.com/watch?v=G2y8Sx4B2Sk
In general, most modern CPU thread-safe code is still a bodge in most languages. If folks are unfortunate enough to encounter inseparable overlapping state sub-problems, than there is no magic pixie dust to escape the computational cost. On average, attempting to parallelize this type of code can end up >30% slower on identical hardware, and a GPU memory copy exchange can make it even worse.
Sometimes even compared to a large multi-core CPU, a pinned-core higher clock-speed chip will win out for those types of problems.
Thus, the mystery why most people revert to batching k copies of single-core-bound non-parallel version of a program was it reduces latency, stalls, cache thrashing, i/o saturation, and interprocess communication costs.
Exchange costs only balloon higher across networks, as however fast the cluster partition claims to be... the physics is still going to impose space-time constraints, as modern data-centers will spend >15% of energy cost just moving stuff around networks for lower efficiency code.
I like languages like Julia, as it implicitly abstracts the broadcast operator to handle which areas may be cleanly unrolled. However, much like Erlang/Elixir the multi-host parallelization is not cleanly implemented... yet...
The core problem with HPC software, has always been academics are best modeled like hermit-crabs with facilities. Once a lucky individual inherits a nice new shell, the pincers come out to all smaller entities who may approach with competing interests.
Best of luck, =3
"Crabs Trade Shells in the Strangest Way | BBC Earth"
https://www.youtube.com/watch?v=f1dnocPQXDQ
Distributed computing never really took off in bioinformatics, because most tasks are conveniently small. For example, a human genome is small enough that you can run most tasks involving a single genome on an average cost-effective server in a reasonable time. And that was already true 10–15 years ago. And if you have a lot of data, it usually means that you have many independent tasks.
Which is nice from the perspective of a tool developer. You don't have to deal with the bureaucracy of distributed computing, as it's the user's responsibility.
C++ is popular for developing bioinformatics tools. Some core tools are written in C, but actual C developers are rare. And Rust has become popular with new projects — to the extent that I haven't really seen C++20 or newer in the field.
So from what I see actual programming language doesn't matter as much as how the work is organized. Anything helping people simplify this task is of immediate benefit to the science.
I've never worked in HPC but it seems it should be relatively simple to find a C/C++ dev that can pick up OpenMP, or one that already knows it, compared to hiring people who know Chapel.
The "scaling down" factor (how easy or interesting it is to use tool X for small use) seems a disadvantage of HPC-only languages, which creates a barrier to entry and a reduction in available workforce.
And even knowing OpenMP or MPI may not suffice if the site uses older versions or heterogeneous approaches with CUDA, FPGA, etc. Knowing the language and the shared/distributed mem libs help, but if your project needs a new senior dev than it may be a bit hard to find (although popularity of company/HPC, salary, and location also play a role).
So for e.g. when I did HPC simulation codes in magnetics, there was little point focusing on some of these areas because our codes were dominated by the long-range interaction cost which limited compute scaling. All of our effort was tuning those algorithms to the absolute max. We tried heterogenous CPU + GPU but had very mixed results, and at that time (2010s) the GPU memory wasn't large enough for the problems we cared about either.
I then moved to CFD in industry. The concerns there were totally different since everything is grid local. Partitioning over multi-GPU is simple since only the boundaries need to be exchanged on each iteration. The problems there were much more on the memory bandwidth and parallel file system performance side.
Basically, you have to learn to solve whatever challenges get thrown up by the specific domain problem.
> And even knowing OpenMP or MPI may not suffice if the site uses older versions
To be fair, you always have the option of compiling yourself, but most people I met in academia didn't have the background to do this. Spack and EasyBuild make this much much easier.
There are a couple of big things that are difficult to get your head around:
1) when and where to dispatch and split jobs (ie whats the setup cost of spinning up n binaries on n machines vs threading on y machines)
2) data exchange primitives, Shared file systems have quirks, and a they differ from system to system. But most of the time its better/easier/faster to dump shit to a file system than some fancy database/object store. Until its not. Distributed queues are great, unless you're using them wrong. Most of the time you need to use them wrong. (the share memory RPC is a whole another beast that fortunatly I've never had to work with directly. )
3) dealing with odd failures. As the number of parallel jobs increase the chance of getting a failure reaches 1. You need to bake in failure modes at the very start.
4) loading/saving data is often a bottle neck, lots of efficiecny comes from being clever in what you load, and _where_ you load it. (ie you have data affinity, which might be location based, or topology based, and you don't often have control over where your stuff is placed.)
And Erlang has already run many telecom infrastructures for decades. Surprising given how fragile the multi-host implementation has proven.
Erlang/Elixir are neat languages, and right next to Julia for fun. =3
The article neglects that all of C, C++, and Fortran have evolved over the last 30 years.
Also, you'll find significant advances in the HPC library ecosystem over the trailing years. Consider, for example, Trilinos (https://trilinos.github.io/index.html) or Dakota (https://dakota.sandia.gov/about-dakota/) both of which push a ton of domain-agnostic capabilities into a C++ library instead of bolting them into a bespoke language. Communities of users tend to coalesce around shared libraries not creating new languages.
That's not to say that new things don't happen there, it's just that I find a lot of old stuff that was shown to be bad decades ago still being in vogue in HPC. Probably because it's a relatively small field with a lot of people there being academics and not a lot of migration to/from other fields.
You've probably never heard of `module` (either Tcl or Lmod). This is a staple of HPC world. What this thing does is it sources or (tries to) remove some shell variables and functions into the shell used either interactively or by a batch job. This is a beyond atrocious idea to handle your working environment. The information leaks, becomes stale, you often end up loading the wrong thing into your environment. It's simply amazing how bad this thing is. And yet, it's just everywhere in HPC.
Another example: running anything in HPC, basically, means running Slurm batch jobs. There are alternatives, but those are even worse (eg. OpenPBS). When you dig into the configuration of these tools, you realize they've been written for pre-systemd Linux and are held together by a shoestring of shell scripting. They seldom if at all do the right thing when it comes to logging or general integration with the environment they run in. They can be simultaneously on the bleeding edge (eg. cgroup integration or accelerator driver integration) and be completely backwards when it comes to having a sensible service definition for systemd (eg. try to manage their service dependencies on their own instead of relying on systemd to do that for them).
In other words, imagine a steam-punk world, but now it's in software. That's sort of how HPC feels like after a decade or so in more popular programming fields.
Also, a lot of code written for HPC is written the way it is not because the writer chose the language or the environment. The typical setup is: university IT created a cluster with whatever tools they managed to put there eons ago, and you, the code writer, have to deal with... using CentOS6 by authenticating to university's AD... in your browser... through JupyterLab interface. And there's nothing you can do about it because the IT isn't there, is incompetent to the bone and as long as you can get your work done somehow, you'd prefer that over fighting to perfect your toolchain.
Bottom line, unless a language somehow becomes indispensable in this world, no matter its advantages, it's not going to be used because of the huge inertia and general unwillingness to do beyond the minimum.
Your centos6 references made me chuckle :-)