Radix sorting is great, as this article points out. But I followed the links and didn't see any mention of the most obvious optimization that makes it even better on modern systems.
If you do MSB for the first pass, then the subsequent passes within each bucket are completely independent :) So... Choose a large radix and do one MSB pass, then parallelize the LSB processing of each bucket. This is fantastic on a GPU. You might think that you would want many thousands of kernels running simultaneously, but in actuality each SM (and the Apple equivalent) has a thread-local cache, so you really only want a couple hundred simultaneous kernels at the most. I've written CUDA and Metal versions and they are fast. And you don't even have to be very good at GPU programming to get there. It's very basic. AoCP has the pseudocode.
In my 3 years old laptop, system memory (dual channel DDR4-3200) delivers about 50 GB / second. You have measured 50M elements per second, with 8 bytes per element translates to 400 MB / second. If your hardware is similar, your implementation did less than 1% of theoretical bottleneck.
When your data is indeed big and the performance actually matters, consider doing something completely different.
In the real world, your application is running in a container among hundreds or thousands of other containers. The system's resources are also probably being managed by a hypervisor. The CPU is shared among N tenants _and_ is overcommitted. It's not that much to ask to optimize where you can, when it isn't unreasonably difficult.
More importantly, this attitude is precisely why software sucks today. "[CPU, Memory, Disk] is cheap, engineering time isn't." Fuck that. Bad engineering time is _incredibly_ expensive. This is an excuse to not spend time learning the ins and outs of your language, and your hardware.
It also frustrates me to no end that people are so deeply incurious. This seems to only happen in tech, which is baffling considering how incredibly broad and deep the industry is. Instead, everyone clamors to learn the new Javascript abstraction that lets them get even further away from the reality of what the magic boxes are doing.
Some apple-pickers think the point of picking apples is to prove how good you are at climbing trees. They'll never not be mad at pluckers who just pluck the nearest apple, and the people who reward them for the apples, and the world that doesn't understand the pluckers picked their apples wrong, and didn't even earn them, and they're not even real apples.
Climbing a ladder takes more time, so in order to get the apples to market faster, the apple tree owners keep the pickers focused on only picking the apples within arm's reach.
The owners also disallow the use of ladders because the pool of candidates to hire remains bigger.
It's one thing if you know the optimal (or at least approaching it) solution, and deliberately use something else for other reasons. At least thought went into it.
I'm not upset at this simply for purity's sake; it directly impacts me in my day-to-day job. A hundred devs individually decide that good enough is good enough (deliberately or otherwise), and suddenly my infrastructure is dealing with traffic and query patterns it shouldn't have to, and then I have to explain what network bandwidth limits are, and why it's absurd that we hit them in the first place.
In general, what I'm railing against is the continued push towards simplification and abstraction, while not requiring the understanding of what lies beneath. The hyperscalers certainly aren't trying to foster that attitude in-house – someone has to build the abstractions in the first place, after all, and keep datacenters humming. Yet they're happily shoveling it out, because it's a win-win for them: fewer people have the skills necessary to compete (or even to simply not use their services), and more people are able to use their services.
> It also frustrates me to no end that people are so deeply incurious. This seems to only happen in tech
This doesn't match my observations. In many fields, training is limited and hides the details. We train workers to repeat specific tasks and they excel at them. But they don't have conceptual understanding. Any explanation of why things are done the way they are is surface-level. You can see it when a procedure fails for some reason. They way to deal with is to 1) do basic checks 2) try again and if it still fails 3) delegate to someone else. Nobody is going to troubleshoot or optimize tasks unless that's their main job.
It happens in construction, in the kitchens, on the assembly lines, and in the offices. It happens because it gets the job done.
You're right. My wording was flat-out wrong, and I apologize. A more accurate sentiment (IMO) would be "that this happens in tech is baffling, considering..."
Life is very different for many people and I think we just need to build empathy for people who treat a job as just a job. If they deliver and are not unkind about it there's nothing wrong about not going above and beyond the bare minimum, which is what they are paid for.
As I commented above, a large part of my umbrage stems from the impact these decisions have on my job. I dislike having to work harder because others didn't want to go above the bare minimum.
This isn't unique to any one company, nor my personal experience. At my last job, my team was initially human triaging practically all alerts, because we had the rare ability to read logs. I spoke to someone at Slack once who was stuck doing the same. That's an absurdly low expected level of competence.
> people are so deeply incurious. This seems to only happen in tech
It happens everywhere. If anything, techies are more curious than the average Joe. How many fellow programmers can you nerd-snipe with a comment that makes them say, “Well, that’s strange…”
Agree with the sentiment. However it's hard to stay curious, even harder to stay up-to-date.
I liked fiddling with storage for a while, got really into it, deepened my knowledge about it. A couple years later I realized everything else (networking, architectures, languages) developed so much, mot of my (non-basic) knowledge was obsolete. Picking up where I left off with all technologies is incredibly hard and caused fatigue.
Now I'm at a point where I have the feeling I don't know nothing about anything. It's factually not true, but my gut feeling tells this. Would I be younger, this would trigger a lot of anxiety. Thankfully I can janfle this by now.
That's understandable. I'm into databases (both professionally and personally), so I get to touch a wide variety of things. Following a query the whole way is pretty neat. Typed query --> client --> wire protocol --> network --> server --> DB frontend --> DB planner --> DB storage engine --> OS --> Memory / Disk. `blktrace` is super interesting to watch commands being sent to the disk.
When you are deeply involved in tech both personally and professionally you are probably very passionate about this and makes sense that you'd only look closely at this field and think "people are so deeply incurious. This seems to only happen in tech".
Tech is also one of the (if not the) most dynamic and fast evolving field a normal person will ever touch. Curiosity in tech can drain every single bit of free time and energy you have and you will hardly keep up with the progress, maybe barely scratch the surface. But people's available free time and energy wanes and curiosity is a collateral victim.
I've painfully gone through the entire cycle of this, including the bit of resurgence later on when you have a combination of free time but less energy. What I can say is that this absolutely does not happen just in tech. If anything tech is flooded with people with more curiosity than almost any other field.
> When you are deeply involved in tech both personally and professionally you are probably very passionate about this and makes sense that you'd only look closely at this field and think "people are so deeply incurious. This seems to only happen in tech".
Good point. I commented in a sibling post to the same effect.
I've certainly felt the personal strain of time sink and procrastination in my homelab. It's currently running k3os, which has been dead for about four years now, because everything I want is still running, and I never seem have the motivation on the weekend to yell at my laptop when I could be playing board games.
> including the bit of resurgence later on when you have a combination of free time but less energy.
I'm guessing that will happen in another decade or so, when my kids are grown.
I can somewhat echo some of the statements here and provide my own experience that is similar.
I spend a decent amount of time writing decent C++ code. My friends in other parts of the industry are writing certainly better C++ code than me because they are doing it in environments that are more constricted in various ways. In either case, I do spend my time catching up a bit and would consider myself a competent C++21 programmer in some ways.
My experience and my conversations lead me to understand there is so much left on the table with even the most basic implementations. When I implement it correctly in C++ we get close to some of the theoretical limits for the hardware for some workloads, compared to something that is literally 1% as fast running in NodeJS.
Wit that said, for so many situations I cannot justify the time and complexity to use C++ for many projects. At least for the stage most projects are in. In theory this optimization can happen later, but it never really does because the horizontal (or sometimes even vertical) scaling kicks in and we're all just willing to throw a few more dollars at the problem instead of re-engineering it. Sure, some of the really big companies like Netflix find a decent reason from time to time to invest the engineering time squeeze out those numbers, but it's becoming the exception and not the rule.
It’s hard to learn the ins and outs of dozens of programming languages. One doesn’t usually just use one or two PL over their entire career. I have worked professionally with at least PHP, Java, Python, JS, Go, and Ruby. That without taking into account the respective libraries and frameworks (and without taking into account as well the myriad of other components like dbs, web servers, caches, etc.)
It sounds like an excuse, I know. The true is I just don’t have that much time.
> In the real world, your application is running in a container among hundreds or thousands of other containers
I mean, that’s an engineering decision too. In my day job we’re capturing, pre-processing, running inference on, and post-processing about 500Mpx/s worth of live image data at about 80ms/frame end-to-end at the edge. The processor SoM costs about $3000/unit and uses about 50W running flat out. The retail cost of our overall product is two orders of magnitude more than what the processor is worth but it incurs zero recurring costs for us.
Edit: and it’s got 64GB of Unified RAM that I’ve got all to myself :)
I was wondering if someone from a different sub-industry would disagree here :D
That sounds like a very interesting job, with quite different requirements and constraints from what I'm used to. One day, I'll get a job where application latency is critical, and optimizations matter deeply. Undoubtedly I'll find something else to be upset about, but at least it'll be a new complaint.
> "[CPU, Memory, Disk] is cheap, engineering time isn't." Fuck that.
It is. It's absurdly cheap. I ensure I check the amount of time it would take for me to make a performance improvement against the runtime costs of my functions. It's rarely worth the extra effort.
Seriously, until you get into the millions of records per second level, you're almost never benefited. You may make your function 2x faster, at a cost of additional complexity, but you never run it enough in a year for it to pay itself back.
> Bad engineering time is _incredibly_ expensive.
Engineering time is expensive. Period. It speaks to the need to minimize it.
> This is an excuse to not spend time learning the ins and outs of your language, and your hardware.
All of which will change in a few years, which is fine, if you're also committing to keeping _all that code_ up to date right along with it. Otherwise you end up with an obscure mess that you have to unwind 5 years of context to understand and fix again.
Complexity and available mental contexts are forgotten costs. If your language even has that many "ins and outs" to begin with you may want to reconsider that.
> You may make your function 2x faster, at a cost of additional complexity, but you never run it enough in a year for it to pay itself back.
"You" is both singular and plural, which is often the problem with this thinking.
Is it worth spending a month of engineering time to make a page load in 50ms instead of 2s? Seems like a lot of engineering time for a noticeable but somewhat minor improvement.
But now, what if you have a million users who do this operation 100x/day? Absolutely worth it!
For example, I sure wish atlassian would spend a tiny bit of effort into making jira faster. Even if it is 1 second per ticket, since I'm viewing 100+ tickets per day that adds up. And there's many hundreds of us at the company doing the same thing, it really adds up.
Nit:
50ms vs 2000ms is 40x speed increase, i.e. ~1.5 order of magnitude.
I still keep words of my database optimization lecturer who said that by his experience optimization below 1 OOM aren’t worth it and most „good ones” are 3+
> Absolutely worth it!
Long reaching assumption. Even the biggest companies have limited resources (even if vast). Would you rather improve load times by 2x (from 500ms to 250ms) or improve checkout reliability from 99% to 99.5%? And there is much more to consider on some levels (e.g. planning for thermal efficiency is fun).
> You may make your function 2x faster, at a cost of additional complexity, but you never run it enough in a year for it to pay itself back.
I'm not talking about increased complexity, I'm talking about extremely basic things that take zero extra time, like using the correct data structure. For example, in Python:
In [8]: a = array("i", (x for x in range(1_000_000)))
...: l = [x for x in range(1_000_000)]
...: d = deque(l)
...: for x in (a, l, d):
...: print(f"{sys.getsizeof(x) / 2**20} MiB")
...:
3.902385711669922 MiB
8.057334899902344 MiB
7.868537902832031 MiB
Very similar structures, with very different memory requirements and access speeds. I can count on one hand with no fingers the number of times I've seen an array used.
Or knowing that `random.randint` is remarkably slow compared to `random.random()`, which can matter in a hot loop:
In [10]: %timeit math.floor(random.random() * 1_000_000)
31.9 ns ± 0.138 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [11]: %timeit random.randint(0, 1_000_000)
163 ns ± 0.653 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
> All of which will change in a few years, which is fine, if you're also committing to keeping _all that code_ up to date right along with it.
With the exception of list comprehension over large ranges slowing down from 3.11 --> now, I don't think there's been much in Python that's become dramatically worse such that you would need to refactor it later (I gather the Javascript community does this ritual every quarter or so). Anything being deprecated has years of warning.
Yes but if you want to do things in a less obvious way you should be aware of the downsides, such as bias in your random numbers. Also making sure you watch out for off by one errors.
Stolen the number to show this off well from a bug report somewhere:
random_counter = Counter()
for i in range(10_000_000):
result = floor(random() * 6755399441055744) % 3
random_counter[result] += 1
print("floor method", random_counter.most_common(3))
randint_counter = Counter()
for i in range(10_000_000):
result = randint(0, 6755399441055743) % 3
randint_counter[result] += 1
print("randint method", randint_counter.most_common(3))
> Very similar structures, with very different memory requirements and access speeds. I can count on one hand with no fingers the number of times I've seen an array used.
That is obvious when you actually check the access speed of arrays and find out it is about half that of lists on small integers (under 256), and worse on non-small integers. That is literally the opposite trade off of what you want in 99.99% of cases.
Deques are even less of a consideration, they’re unrolled linked lists so random access is impossible and iteration is slower, you use a deque when you need a deque (or at least a fifo), aka when you need to routinely manipulate the head of the collection.
It depends on your constraints. If you’re limited by RAM, arrays make a lot of sense for certain applications. If you need Python’s buffer protocol, again, they make a lot of sense.
As to deques, yes, they have specific uses, and being slightly smaller isn’t usually a selling point for them. My point was that I have seen many cases where an incorrect data structure was used, because a list or dict was “good enough.” And sure, they generally are, but if the language ships with other options, why wouldn’t you explore those?
Python is 100% the wrong language to worry about this in. If your hot loops are in python and you care about performance, you should be rewriting them in another language.
Agreed; I used it partially because TFA used it to demonstrate ideas, and partially because I’m very familiar with it.
But you’re correct, of course. When I need something to go faster in Python, I write it in C. If it’s more than a small section, then a full rewrite is reasonable.
Ok, but neither of these matter until you know they matter. Seriously. Like, yes, it's nice they exist and that they are available for when you want them, but I would generally advise people to use a list or random.randint, if only because I value their confidence with them over the 2x performance win, because most workloads are not simply just a single array or random number generator loop. And, to be clear, I work on performance professionally: most of my job is not making things as fast as possible, but considering the tradeoffs that go into writing that code. I understand your example as showing off an interesting performance story but in the real world most workloads are more complex than what can be solved with using a rare but drop-in API.
This will need to happen 7.6 million times to save me 1 CPU second. On AWS lambda with 1GB of memory this will cost you a whopping: $0.0000166667.
The point is, you're not even wrong, but there are vanishingly few cases where it would actually matter to the bottom line in practice. You're taking an absolutist point of view to a discipline which thoroughly rejects it.
This is what I love about the cloud. It forces you to confront what your efforts are actually worth by placing a specific value on all of these commodities. In my experience they're often worth very little given that none of us have the scale of problems where this would show actual returns.
Reaching the scale where it shows actual returns isn't all that difficult. You need it to happen 7.6 million times to save 1 CPU second, but each CPU core can execute it nearly that many times every second.
Probably you don't leave it generating only random numbers all day, but suppose you do generate a good few, so that it's 1% of your total compute budget, and you have only a modest load, using on average four CPU cores at any given time. Then saving that amount of computation will have saved you something like $15/year in compute, recurring. Which isn't actually that bad a return for ten seconds worth of choosing the right function.
There are also a lot of even fairly small entities for which four cores is peanuts and they're running a hundred or a thousand at once, which quickly turns up the price.
And even the things with internet scale aren't all that rare. Suppose you're making a contribution to the mainline Linux kernel. It will run on billions of devices, possibly for decades. Even if it doesn't run very often, that's still a lot of cycles, and some of the kernel code does run very often. Likewise code in popular web browsers, javascript on popular websites or in popular libraries, etc.
You don't have to work for Google to make a contribution to zlib and that kind of stuff has the weight of the world on it.
Sure, but the cumulative effects of pervasive mediocre to bad decisions do add up. And it isn't just about cloud compute cost. Your own time is stolen by the slow ci jobs that you inevitably get stuck waiting for. For me, I prioritize my own personal
happiness in my work and this mindset taken too far makes me unhappy.
> until you get into the millions of records per second level, you're almost never benefited
Yeah, but the software landscape is very diverse.
On my job (CAM/CAE) I often handle data structures with gigabytes of data. Worse, unlike e.g. multimedia frameworks many algorithms operating on these numbers are global i.e. can’t be represented as a pipeline which splits data into independent chunks and processes them sequentially.
Making performance critical functions twice as fast might saves hours of time for a single end user in a single day.
That's good food for thought. My hardware is quite old, but I agree that the numbers seem somewhat suspicious.
I believe that RAM can only be RMW'd in cache lines, so modifying just 8 bytes still requires 64 bytes to be transmitted. I'm assuming the 50 GB/s throughput is half-duplex, and 25 GB/s over 64 bytes is ~400 Melems/s, somewhat closer to my result.
I tried using non-temporal stores in the straightforward algorithm, but to my surprise, this led to a significant decrease of performance across all input lengths.
> When your data is indeed big and the performance actually matters, consider doing something completely different.
I'm not sure what you mean by this. Scaling across machines is just ignoring the problem. What do you mean by "something completely different"?
> I believe that RAM can only be RMW'd in cache lines, so modifying just 8 bytes still requires 64 bytes to be transmitted.
Ages ago I worked with several different memory controllers, and it depends on the memory controller, cache, and MMU configuration.
Plenty of systems do require cacheline updates, then modifying 8B requires reading one or two cachelines, updating them, and writing them out eventually.
Some caches track cacheline validity with a bit per byte. This enables a CPU to set a book out in memory without fetching the cacheline. The cache may then try to burst read that line from the memory controller, but if it doesn't get around to it before deciding to flush that cacheline, it may issue a single byte write to the controller. The controller can then issue a makes DRAM write to the memory, which will update only certain bytes in the DRAM column. However, this still takes about as long as sending the full cacheline but it offloads the read-modify.
Validity per byte is also useful to implement hit under miss.
I bet on newer, bigger systems tricks like this are less useful since the memory busses are faster and wider today.
Non-temporal stores will not help you if all you do are those accesses. The point of using them is that you don't want them pulled into the caches and pushing out everything else.
> In my 3 years old laptop, system memory (dual channel DDR4-3200) delivers about 50 GB / second.
That's almost certainly for (mostly) sequential access.
When you just want a couple bytes here and there, and access isn't pipelined and prefetch doesn't accelerate your use case, the real world bandwidth is going to be significantly less.
I don’t understand Rust but if the code is doing what’s written there, “simple multiplicative hash and perform a simple analysis on the buckets – say, compute the sum of minimums among buckets” it’s possible to improve substantially.
They don’t need to move these megabytes of elements between collections. I would split the input across a few CPU cores, compute per-bucket minimums in parallel on each core, when all completed aggregate minimums across all cores, then compute sum of the results.
Multiplicative hash and the final sum should be possible to vectorize on most computers. Updating per-bucket minimum is probably impossible to vectorize (technically AVX512 set has the required instructions but I’m not sure these are efficient) but there’re much fewer buckets than input numbers, which means arrays with per-bucket minimums are likely to fit in caches.
Isn't the point of the person you replied to that the article author wasn't able to eliminate latency because if they were they'd be constrained by throughput but they are not?
To/from what sort of devices could the GPU read or write at 1TB/sec, besides main memory?
The fastest consumer SSDs top out at several GB/sec (I guess with massive hardware RAID they could be faster, but not sure if they’d be 1TB/sec fast).
Even a network adapter that does 10 Gbit/sec is only recently becoming slightly more common for the average consumer. Not sure if any consumer adapters in the 10 or 1Tbit/sec range exist at all.
Consumer? Even in the "if-you-need-to-ask-what-it-costs-you-can't-afford-it" world of frontier HPC systems, were only getting teasers of 0.8 Tbit/sec NICs this year.
As you say, only the GPU, maybe RAM, and the NIC will be able to churn data at these speeds. There is a reason why Mellanox (Nvidia) has developed GPUDirect RDMA, so the GPU and NIC can talk directly to each other.
> Not sure if any consumer adapters in the 10 or 1Tbit/sec range exist at all.
Further, what exactly is a consumer going to plug a 1Tbit/sec adapter into? Your little home ATT fiber connection isn’t going to come close to using that available bandwidth.
Main memory is an important thing to have fast, though. The faster (and lower-wallclock-latency) it is, the less time your system spends waiting around when it needs to swap things in and out of it. It's my understanding that programs that need to be fast (like many video games) take pains to preemptively load data into RAM from disk, and (when appropriate for the program) from main RAM into VRAM. If main RAM's transfer speed was equal to or greater than VRAM's, and it access latency was a small fraction of a frame render time, (presumably) some of that preloading complexity could go away.
> I guess with massive hardware RAID they could be faster...
This section of the comment is for folks who haven't been paying attention to how fast storage has gotten: It's nowhere near 1TB per second, but...
I have four 4TB SATA-attached Crucial MX500s set up in LVM2 RAID 0. This array is a bit faster than a 10gbit link. (That is, I get 1.5GByte/s transfer rate off of the thing.) Even a single non-garbage U.2-attached (or (barf) M.2-attached) device can saturate a 10Gbit link.
I mean they can only do that if you have hundreds of threads all making coalesced writes. Typical CPU workloads look nothing like that; if you pointer chase on the GPU you are going to get absurdly bad performance.
It actually almost never does. To see that you'd need to benchmark. It's pretty difficult get good utilization on GPU on either compute or memory bandwidth side. A lot of kernels irretrievably fuck up both. You need long, coalesced reads/writes, and judicious use of the memory hierarchy, or else everything gets very slow very quickly.
> Cache is seen as an optimization for small data: if it fits in L2, it’s going to be processed faster
Nobody worth their salt believes just this and nothing else.
Yes, if the data fits entirely into a given cache, that's a nice case that's easy to reason about. No matter what access pattern is applied to the data, it doesn't matter because it's in the cache.
Hopefully everyone working with caches understands that they provide a potential speedup when not everything fits into the cache, and that this depends on the pattern of access (mainly, does it exhibit "locality"). Moreover, this case is extremely important.
The article gives an example of exactly that: improving the locality of access.
If you don't know this, you don't know one of the first facts about caching.
There is something else to know about: you can't tell by size alone whether a given data set will fit into a cache. The problem is that caches are not always fully associative. In a set associative cache, a given block of data cannot be stored in any cache line: it is assigned to a small set of possible cache lines. Then within a set, the cache lines are dynamically allocated and tagged. A given bunch of working which appears to be just smaller than the cache might be arranged in such a poor way in memory that it doesn't map to all of the cache's sets. And so, it actually does not fit into the cache.
>Nobody worth their salt believes just this and nothing else.... and that this depends on the pattern of access (mainly, does it exhibit "locality").... If you don't know this, you don't know one of the first facts about caching.
Not necessarily specific to this issue, but I've found that surprisingly many people out there are not "worth their salt" in areas where you'd really expect them to be.
That's true, perhaps my wording is off. I believe that the devil in the details. Sure, knowing that better access patterns result in better performance is common. But the fact that the access pattern can be optimized when the problem is literally "access RAM at these random locations" is counterintuitive, IMO.
This was a great read, thanks. OP, readers might benefit from having it explicitly mentioned that while the pseudocode is in Python, actual Python code will likely not benefit from such an approach because of how memory is fragmented in the standard Python implementation - e.g. this discussion: https://stackoverflow.com/questions/49786441/python-get-proc...
I am tempted to actually test this (i.e. see if there's a speedup in Python), but I don't have the time right now.
All modern CPUs have powerful hardware prefetchers, which are insufficiently documented.
On any modern CPU (i.e. from the last 20 years), explicit prefetching is very seldom useful.
What is important is to organize your memory accesses in such orders so that they will occur in one of the patterns that triggers the hardware prefetcher, which will then take care to provide the data on time.
The patterns recognized by the hardware prefetchers vary from CPU to CPU, but all of them include the accesses where the addresses are in arithmetic progressions, going either forwards or backwards, so usually all array operations will be accelerated automatically by the hardware prefetchers.
That's not the problem at hand. Python is good for pseudocode. But not if you want to talk about cache misses because in the pseudocode written with higher level languages a lot of details on how memory is accessed are opaque.
Again, what would you suggest instead? If you can't guess that `list` is supposed to represent an array of consecutive elements, I have trouble thinking of a language that'll make that clear without being exceedingly verbose for no reason.
A bit later, in the article, you'll see that memory patterns in allocating the arrays have a role. A role that was hidden initially.
> Again, what would you suggest instead?
The answer is inside you. You have only to search for it.
Or, if you really want to extort me the obvious: any low level language (even not implementing every detail but calling imaginary functions whose name suggest what they are doing). This exercise will show you for instance that you'll have to immediately choose if to append_to_dynamic_array() or add_to_linked_list().
> This exercise will show you for instance that you'll have to immediately choose if to append_to_dynamic_array() or add_to_linked_list().
Linked lists, that's what you're worried about? `[...]` is not a linked list in Python, in fact I don't know any imperative language where it's something other than a dynamic array/vector. I can only assume someone who doesn't understand or intuit this is arguing in bad faith, especially when taking your attitude into account. What did I do to deserve being talked down to?
> any low level language
Like this?
std::vector<std::vector<element_t>> groups(n_groups);
for (auto&& element : elements) {
groups[element.group].push_back(std::move(element));
}
std::sort(elements.begin(), elements.end(), [](const auto& a, const auto& b) {
return a.group < b.group;
});
std::vector<std::vector<element_t>> groups;
for (auto group_elements : group_by(std::move(elements), [](const auto& element) {
return element.group;
})) {
groups.push_back(std::move(group_elements));
}
Is it worth it? If you don't know `list` is a dynamic array in Python, how will you know `std::vector` is a dynamic array in C++? Not to mention the syntax is terrible. C would be just as bad. Using Rust would get people angry about Rust evangelism. The only winning move is not to play.
// Create n empty arrays.
DynamicArray* groups[n_groups];
for (int i = 0; i < n_groups; i++) {
groups[i] = create_empty_array();
}
// For each element in elements[], append it to its group's array.
for (int i = 0; i < n_elements; i++) {
append_to_dynamic_array(groups[elements[i].group], elements[i].value);
}
antirez: You already got it wrong. You've got a double indirection, with `groups[i]` pointing to a heap-allocated instance of `DynamicArray`, which supposedly stores the actual heap pointer.
It's not about the language being low-level or high-level. It's about understanding the basics of memory layout. If the pseudocode being in Python is an obstacle for you, the problem is not in Python, but in your (lack of) intuition.
I wrote the first random semantic that came in mind, in the pseudocode in C above. The fact you believe it is not the right one is a proof that in C I can express the exact semantic, and in Python I can't (because everything is hidden inside how the Python interpreter can or can not implement a given thing).
So, modify my C code to match what you have in mind, if you wish. But the point is that low level code will specify in a much clearer way what's going on with memory, and in a piece of code that is about memory layout / access, that's a good idea. Otherwise I consider Python a good language for pseudocode. Just: not in this case.
Have a nice day, I'll stop replying since would be useless to continue. P.S. I didn't downvote you.
It's a little surprising that this works at all, since the partitioning step in the radix sort is itself the same kind of sharding operation. But I guess that's because it allows for working on smaller pieces at a time that fit in whatever cache they need to.
> Python can't really reserve space for lists, but pretend `reserve` did that anyway.
FWIW, you can pre-fill a list with e.g. `None` values, and then replacing those values won't cause resizes or relocations (of the list's memory - the elements are still indirected). But of course you'd then need to keep track of the count of "real" elements yourself.
But of course, tricks like this are going to be counterproductive in Python anyway, because of all the indirection inherent in the system. They'll also get worse if you actually do have objects (unless perhaps they're statically, constantly sized, and in a language that can avoid indirection in that case) with a "group ID" attribute, rather than integers to which you can apply some hash function. Some test results, trying to approximate the Rust code in Python but with such objects:
It's important to understand the domain. For Pythonistas, simple really is better than complex. And identifying the bottlenecks and moving them to a faster language is also important (Numpy isn't successful by accident). The naive result here is still, if I'm understanding the graphs right, dozens of times worse than what any of the Rust code achieves.
(Edit: merely "several" times worse. I forgot that I told `timeit` to run 10 iterations over the same input, so each test processes ~10M elements.)
> It's a little surprising that this works at all, since the partitioning step in the radix sort is itself the same kind of sharding operation.
The key property here is the number of groups. When sharding data to n groups, only n locations have to be stored in cache -- the tails of the groups. In my radix sort implementation, this is just 256 locations, which works well with cache.
I'D love to understand the differences between the "devices" named A, Y, M in the performance measurement, referring to "(A, Y, M indicate different devices)"
I sometimes get the feeling it would be better to just start treating RAM as secondary storage, with read/write access through a (hardware implemented) io_uring style API. Is there anything along those lines out there?
CXL RAM is something like this on the physical side. It's RAM over a PCIe connection, and PCIe is basically a very fast, serial, point to point network. However as far as I'm aware the "API" to software makes it looks just like regular, local RAM.
It resembles sequential access memory with relatively fast seeks and a cache (ok, multiple layers of caches). Reading sequential, predictable addresses gives you much more throughput than random access, and reading a value that was recently accessed (or adjacent to such) is much lower latency than something that was not. There's further wrinkles in multicore systems as well, because then accesses to memory recently written to by another core can be slower again.
It resembles a hierarchy wherein a small fraction of memory - a "cache" region - can be accessed much faster than the rest. With careful planning, the programmer can increase the odds that the necessary information for the next step of an algorithm, at any given point, is within that small region. (This is a simplification; there are actually several layers of cache between a CPU and the most general part of the RAM.)
Anaesthetists also undergo many years of training prior to taking post. Typically they've got say 10 years training before they actually do the job, much of it specialised to this job in particular (and the rest in general or intensive care medicine)
If you're lucky a Software Engineer has a three year degree in CS, and probably only a semester at best was studying "Software Engineering" and even that might focus on something you don't care about, such as formal methods.
It is entirely possible that your junior engineers have never maintained a sizeable codebase for more than a few weeks, have never co-operated on software with more than a handful of other people, and have never used most of the common software libraries you use every day, regardless of whether these are in-house (so how could they) or widely available.
For example maybe you do lots of web apps fronting SQL Server databases in C#. Your new hire has six months of C#, they half-remember a course doing SQL on an Oracle database, and all their web knowledge is in Javascript. Do they know version control? Kinda. Have they used a test framework before? Well, they did in Java but never in C#.
The "All Your Tests Are Terrible" talk begins by pointing out that probably they're strictly wrong because you don't have any fucking tests. All of the rest of the talk is about the hideous tests that you're unhappy to find because you forget that somebody could just not have bothered to write any tests at all.
> All of the rest of the talk is about the hideous tests that you're unhappy to find because you forget that somebody could just not have bothered to write any tests at all.
At many times in my professional "career", I've found myself wishing that whoever wrote the test I'm staring at just hadn't bothered. "Tests that say and/or look like they test one thing, but test something entirely different." [0] and "Tests that actually test nothing at all and never fail when the code they claim to test is changed out from under them." are two of the many categories of tests I wish folks would just never have wasted their time writing.
[0] To be clear, I usually find tests of this type to claim to be testing something useful, but actually be testing something that's not worth testing.
I was expecting something about NUMA performance, temporal memory access instructions, shared versus global versus register memory on GPUs, SRAM, and so on. There's an article about all these things waiting to be written. This article is instead about memory-hierarchy-aware access pattern optimization, which is important, but not the whole story.
Exemplar code in python, shows the benefit in rust. o---kaaaaay.
So the outcome is "for this compiled rust, around 1m records it gets better"
But you didn't actually prove a general case, you proved "for a good optimising rust compiler" didn't you?
Maybe I'm over-thinking it. Maybe this does just become simple, working-set-locality stuff.
I could take the lesson: for less than 10m things, on modern hardware with more than 4 cores and more than 4GB stop worrying and just code in the idiom which works for you.
Uh? The post very clearly says: "I’m using Python as pseudocode; pretend I used your favorite low-level language". The goal is to show what's possible when you do your best, of course I'm not going to use Python for anything beyond demonstrating ideas.
> But you didn't actually prove a general case, you proved "for a good optimising rust compiler" didn't you?
Again, that was never my goal. I chose Rust because I've stumbled upon this problem while working on a Rust library. I could've chosen C++, or C, or maybe even Go -- the result would've been the same, and I checked codegen to make sure of that.
> I could take the lesson: for less than 10m things, on modern hardware with more than 4 cores and more than 4GB stop worrying and just code in the idiom which works for you.
The number of cores and RAM capacity has nothing to do with this. It's all about how well data fits in cache, and "less than 10m things" are likely to fit in L3 anyway. If your main takeaway from "here's how to process large data" was "I don't need to worry about this for small data", well, I don't know what to say.
Large and Small are so contextual. I'm processing 350m events/day in 24h splits, and I managed to stop worrying about locality of reference because I'm the sole occupant of the machine. When I did worry about it, I found radix tree, awk hash and perl/python hash/dict pretty much all occupied much the same space and time but a tuned C implementation got 2-3x faster than any of them. Somebody else pointed out memory resident for most of this would be faster still but you have to then work to process 24h of things against a single memory instance. Which means buying into IPC to get the data "into" that memory.
It interested me you didn't show the idea in rust. That was the only point I was making: Python as pseudocode to think things in documents is fine with me.
But overall, I liked your outcome. I just think it's important to remember large and small are very contextual. Your large case looks to me to be >5m things and for an awful lot of people doing stupid things, 5m is bigger than they'll ever see. If the target was only people who routinely deal in hundreds of millions of things, then sure.
> The number of cores and RAM capacity has nothing to do with this. It's all about how well data fits in cache, and "less than 10m things" are likely to fit in L3 anyway.
What matters is locality since that allows prefetch to mask latency. If you have this, then you are in a good place even if your data does not fit in the L3 cache. What you did demonstrates the benefits that locality gives from the effect on prefetch. Fitting in L3 cache helps, but not as much as prefetch does. If you do not believe me, test a random access pattern on things in L3 cache vs a sequential access pattern. The sequential access pattern will win every time, because L3 cache is relatively slow and prefetch masks that latency.
I have seen options for disabling prefetch and cache as BIOS options (although I only remember the option for disabling cache in ancient systems). If you could get one, you could do some experiments to see which will matter more.
Radix sorting is great, as this article points out. But I followed the links and didn't see any mention of the most obvious optimization that makes it even better on modern systems.
If you do MSB for the first pass, then the subsequent passes within each bucket are completely independent :) So... Choose a large radix and do one MSB pass, then parallelize the LSB processing of each bucket. This is fantastic on a GPU. You might think that you would want many thousands of kernels running simultaneously, but in actuality each SM (and the Apple equivalent) has a thread-local cache, so you really only want a couple hundred simultaneous kernels at the most. I've written CUDA and Metal versions and they are fast. And you don't even have to be very good at GPU programming to get there. It's very basic. AoCP has the pseudocode.
In my 3 years old laptop, system memory (dual channel DDR4-3200) delivers about 50 GB / second. You have measured 50M elements per second, with 8 bytes per element translates to 400 MB / second. If your hardware is similar, your implementation did less than 1% of theoretical bottleneck.
When your data is indeed big and the performance actually matters, consider doing something completely different.
In the real world, your application is running in a container among hundreds or thousands of other containers. The system's resources are also probably being managed by a hypervisor. The CPU is shared among N tenants _and_ is overcommitted. It's not that much to ask to optimize where you can, when it isn't unreasonably difficult.
More importantly, this attitude is precisely why software sucks today. "[CPU, Memory, Disk] is cheap, engineering time isn't." Fuck that. Bad engineering time is _incredibly_ expensive. This is an excuse to not spend time learning the ins and outs of your language, and your hardware.
It also frustrates me to no end that people are so deeply incurious. This seems to only happen in tech, which is baffling considering how incredibly broad and deep the industry is. Instead, everyone clamors to learn the new Javascript abstraction that lets them get even further away from the reality of what the magic boxes are doing.
Some apple-pickers think the point of picking apples is to prove how good you are at climbing trees. They'll never not be mad at pluckers who just pluck the nearest apple, and the people who reward them for the apples, and the world that doesn't understand the pluckers picked their apples wrong, and didn't even earn them, and they're not even real apples.
Climbing a ladder takes more time, so in order to get the apples to market faster, the apple tree owners keep the pickers focused on only picking the apples within arm's reach.
The owners also disallow the use of ladders because the pool of candidates to hire remains bigger.
And the highest 90% of apples remain unpicked.
The part missing from this analogy is that the low-hanging fruit pickers are slowly killing the trees.
It's one thing if you know the optimal (or at least approaching it) solution, and deliberately use something else for other reasons. At least thought went into it.
I'm not upset at this simply for purity's sake; it directly impacts me in my day-to-day job. A hundred devs individually decide that good enough is good enough (deliberately or otherwise), and suddenly my infrastructure is dealing with traffic and query patterns it shouldn't have to, and then I have to explain what network bandwidth limits are, and why it's absurd that we hit them in the first place.
In general, what I'm railing against is the continued push towards simplification and abstraction, while not requiring the understanding of what lies beneath. The hyperscalers certainly aren't trying to foster that attitude in-house – someone has to build the abstractions in the first place, after all, and keep datacenters humming. Yet they're happily shoveling it out, because it's a win-win for them: fewer people have the skills necessary to compete (or even to simply not use their services), and more people are able to use their services.
> it directly impacts me in my day-to-day job
It directly impacts anyone that uses software everyday. Most people don't seem to understand and/or care how poorly the software they use runs.
Sir, your comment is poetry. I commend you.
> It also frustrates me to no end that people are so deeply incurious. This seems to only happen in tech
This doesn't match my observations. In many fields, training is limited and hides the details. We train workers to repeat specific tasks and they excel at them. But they don't have conceptual understanding. Any explanation of why things are done the way they are is surface-level. You can see it when a procedure fails for some reason. They way to deal with is to 1) do basic checks 2) try again and if it still fails 3) delegate to someone else. Nobody is going to troubleshoot or optimize tasks unless that's their main job.
It happens in construction, in the kitchens, on the assembly lines, and in the offices. It happens because it gets the job done.
You're right. My wording was flat-out wrong, and I apologize. A more accurate sentiment (IMO) would be "that this happens in tech is baffling, considering..."
Life is very different for many people and I think we just need to build empathy for people who treat a job as just a job. If they deliver and are not unkind about it there's nothing wrong about not going above and beyond the bare minimum, which is what they are paid for.
As I commented above, a large part of my umbrage stems from the impact these decisions have on my job. I dislike having to work harder because others didn't want to go above the bare minimum.
This isn't unique to any one company, nor my personal experience. At my last job, my team was initially human triaging practically all alerts, because we had the rare ability to read logs. I spoke to someone at Slack once who was stuck doing the same. That's an absurdly low expected level of competence.
> people are so deeply incurious. This seems to only happen in tech
It happens everywhere. If anything, techies are more curious than the average Joe. How many fellow programmers can you nerd-snipe with a comment that makes them say, “Well, that’s strange…”
Agree with the sentiment. However it's hard to stay curious, even harder to stay up-to-date.
I liked fiddling with storage for a while, got really into it, deepened my knowledge about it. A couple years later I realized everything else (networking, architectures, languages) developed so much, mot of my (non-basic) knowledge was obsolete. Picking up where I left off with all technologies is incredibly hard and caused fatigue.
Now I'm at a point where I have the feeling I don't know nothing about anything. It's factually not true, but my gut feeling tells this. Would I be younger, this would trigger a lot of anxiety. Thankfully I can janfle this by now.
That's understandable. I'm into databases (both professionally and personally), so I get to touch a wide variety of things. Following a query the whole way is pretty neat. Typed query --> client --> wire protocol --> network --> server --> DB frontend --> DB planner --> DB storage engine --> OS --> Memory / Disk. `blktrace` is super interesting to watch commands being sent to the disk.
When you are deeply involved in tech both personally and professionally you are probably very passionate about this and makes sense that you'd only look closely at this field and think "people are so deeply incurious. This seems to only happen in tech".
Tech is also one of the (if not the) most dynamic and fast evolving field a normal person will ever touch. Curiosity in tech can drain every single bit of free time and energy you have and you will hardly keep up with the progress, maybe barely scratch the surface. But people's available free time and energy wanes and curiosity is a collateral victim.
I've painfully gone through the entire cycle of this, including the bit of resurgence later on when you have a combination of free time but less energy. What I can say is that this absolutely does not happen just in tech. If anything tech is flooded with people with more curiosity than almost any other field.
> When you are deeply involved in tech both personally and professionally you are probably very passionate about this and makes sense that you'd only look closely at this field and think "people are so deeply incurious. This seems to only happen in tech".
Good point. I commented in a sibling post to the same effect.
I've certainly felt the personal strain of time sink and procrastination in my homelab. It's currently running k3os, which has been dead for about four years now, because everything I want is still running, and I never seem have the motivation on the weekend to yell at my laptop when I could be playing board games.
> including the bit of resurgence later on when you have a combination of free time but less energy.
I'm guessing that will happen in another decade or so, when my kids are grown.
I can somewhat echo some of the statements here and provide my own experience that is similar.
I spend a decent amount of time writing decent C++ code. My friends in other parts of the industry are writing certainly better C++ code than me because they are doing it in environments that are more constricted in various ways. In either case, I do spend my time catching up a bit and would consider myself a competent C++21 programmer in some ways.
My experience and my conversations lead me to understand there is so much left on the table with even the most basic implementations. When I implement it correctly in C++ we get close to some of the theoretical limits for the hardware for some workloads, compared to something that is literally 1% as fast running in NodeJS.
Wit that said, for so many situations I cannot justify the time and complexity to use C++ for many projects. At least for the stage most projects are in. In theory this optimization can happen later, but it never really does because the horizontal (or sometimes even vertical) scaling kicks in and we're all just willing to throw a few more dollars at the problem instead of re-engineering it. Sure, some of the really big companies like Netflix find a decent reason from time to time to invest the engineering time squeeze out those numbers, but it's becoming the exception and not the rule.
>C++21
There's C++20 and C++23. No C++21.
Sorry, I meant to type C++11
It’s hard to learn the ins and outs of dozens of programming languages. One doesn’t usually just use one or two PL over their entire career. I have worked professionally with at least PHP, Java, Python, JS, Go, and Ruby. That without taking into account the respective libraries and frameworks (and without taking into account as well the myriad of other components like dbs, web servers, caches, etc.)
It sounds like an excuse, I know. The true is I just don’t have that much time.
> In the real world, your application is running in a container among hundreds or thousands of other containers
I mean, that’s an engineering decision too. In my day job we’re capturing, pre-processing, running inference on, and post-processing about 500Mpx/s worth of live image data at about 80ms/frame end-to-end at the edge. The processor SoM costs about $3000/unit and uses about 50W running flat out. The retail cost of our overall product is two orders of magnitude more than what the processor is worth but it incurs zero recurring costs for us.
Edit: and it’s got 64GB of Unified RAM that I’ve got all to myself :)
I was wondering if someone from a different sub-industry would disagree here :D
That sounds like a very interesting job, with quite different requirements and constraints from what I'm used to. One day, I'll get a job where application latency is critical, and optimizations matter deeply. Undoubtedly I'll find something else to be upset about, but at least it'll be a new complaint.
> Undoubtedly I’ll find something else to be upset about
Vendor SDKs. You’ll be upset about that I guarantee :)
> "[CPU, Memory, Disk] is cheap, engineering time isn't." Fuck that.
It is. It's absurdly cheap. I ensure I check the amount of time it would take for me to make a performance improvement against the runtime costs of my functions. It's rarely worth the extra effort.
Seriously, until you get into the millions of records per second level, you're almost never benefited. You may make your function 2x faster, at a cost of additional complexity, but you never run it enough in a year for it to pay itself back.
> Bad engineering time is _incredibly_ expensive.
Engineering time is expensive. Period. It speaks to the need to minimize it.
> This is an excuse to not spend time learning the ins and outs of your language, and your hardware.
All of which will change in a few years, which is fine, if you're also committing to keeping _all that code_ up to date right along with it. Otherwise you end up with an obscure mess that you have to unwind 5 years of context to understand and fix again.
Complexity and available mental contexts are forgotten costs. If your language even has that many "ins and outs" to begin with you may want to reconsider that.
> You may make your function 2x faster, at a cost of additional complexity, but you never run it enough in a year for it to pay itself back.
"You" is both singular and plural, which is often the problem with this thinking.
Is it worth spending a month of engineering time to make a page load in 50ms instead of 2s? Seems like a lot of engineering time for a noticeable but somewhat minor improvement.
But now, what if you have a million users who do this operation 100x/day? Absolutely worth it!
For example, I sure wish atlassian would spend a tiny bit of effort into making jira faster. Even if it is 1 second per ticket, since I'm viewing 100+ tickets per day that adds up. And there's many hundreds of us at the company doing the same thing, it really adds up.
Nit: 50ms vs 2000ms is 40x speed increase, i.e. ~1.5 order of magnitude.
I still keep words of my database optimization lecturer who said that by his experience optimization below 1 OOM aren’t worth it and most „good ones” are 3+
> Absolutely worth it!
Long reaching assumption. Even the biggest companies have limited resources (even if vast). Would you rather improve load times by 2x (from 500ms to 250ms) or improve checkout reliability from 99% to 99.5%? And there is much more to consider on some levels (e.g. planning for thermal efficiency is fun).
Software development is always a game of choice.
Most of the time they just move the expensive processing to the user's browser so they don't have to pay for it :)
> 50ms instead of 2s
In the past I believe Google was very adament that page load time perception was very important to other metrics.
You're probably not going to achieve that with the kind of optimization described in this article though.
> You may make your function 2x faster, at a cost of additional complexity, but you never run it enough in a year for it to pay itself back.
I'm not talking about increased complexity, I'm talking about extremely basic things that take zero extra time, like using the correct data structure. For example, in Python:
Very similar structures, with very different memory requirements and access speeds. I can count on one hand with no fingers the number of times I've seen an array used.Or knowing that `random.randint` is remarkably slow compared to `random.random()`, which can matter in a hot loop:
> All of which will change in a few years, which is fine, if you're also committing to keeping _all that code_ up to date right along with it.With the exception of list comprehension over large ranges slowing down from 3.11 --> now, I don't think there's been much in Python that's become dramatically worse such that you would need to refactor it later (I gather the Javascript community does this ritual every quarter or so). Anything being deprecated has years of warning.
Yes but if you want to do things in a less obvious way you should be aware of the downsides, such as bias in your random numbers. Also making sure you watch out for off by one errors.
Stolen the number to show this off well from a bug report somewhere:
Result https://bugs.python.org/issue9025Have you ran this in any modern version of Python? It’s been fixed for a long time.
3.10 so I redid it on 3.13.1, same results.
> Very similar structures, with very different memory requirements and access speeds. I can count on one hand with no fingers the number of times I've seen an array used.
That is obvious when you actually check the access speed of arrays and find out it is about half that of lists on small integers (under 256), and worse on non-small integers. That is literally the opposite trade off of what you want in 99.99% of cases.
Deques are even less of a consideration, they’re unrolled linked lists so random access is impossible and iteration is slower, you use a deque when you need a deque (or at least a fifo), aka when you need to routinely manipulate the head of the collection.
It depends on your constraints. If you’re limited by RAM, arrays make a lot of sense for certain applications. If you need Python’s buffer protocol, again, they make a lot of sense.
As to deques, yes, they have specific uses, and being slightly smaller isn’t usually a selling point for them. My point was that I have seen many cases where an incorrect data structure was used, because a list or dict was “good enough.” And sure, they generally are, but if the language ships with other options, why wouldn’t you explore those?
Python is 100% the wrong language to worry about this in. If your hot loops are in python and you care about performance, you should be rewriting them in another language.
Agreed; I used it partially because TFA used it to demonstrate ideas, and partially because I’m very familiar with it.
But you’re correct, of course. When I need something to go faster in Python, I write it in C. If it’s more than a small section, then a full rewrite is reasonable.
Ok, but neither of these matter until you know they matter. Seriously. Like, yes, it's nice they exist and that they are available for when you want them, but I would generally advise people to use a list or random.randint, if only because I value their confidence with them over the 2x performance win, because most workloads are not simply just a single array or random number generator loop. And, to be clear, I work on performance professionally: most of my job is not making things as fast as possible, but considering the tradeoffs that go into writing that code. I understand your example as showing off an interesting performance story but in the real world most workloads are more complex than what can be solved with using a rare but drop-in API.
> which can matter in a hot loop:
163ns - 31.9ns == 131.1ns
This will need to happen 7.6 million times to save me 1 CPU second. On AWS lambda with 1GB of memory this will cost you a whopping: $0.0000166667.
The point is, you're not even wrong, but there are vanishingly few cases where it would actually matter to the bottom line in practice. You're taking an absolutist point of view to a discipline which thoroughly rejects it.
This is what I love about the cloud. It forces you to confront what your efforts are actually worth by placing a specific value on all of these commodities. In my experience they're often worth very little given that none of us have the scale of problems where this would show actual returns.
Reaching the scale where it shows actual returns isn't all that difficult. You need it to happen 7.6 million times to save 1 CPU second, but each CPU core can execute it nearly that many times every second.
Probably you don't leave it generating only random numbers all day, but suppose you do generate a good few, so that it's 1% of your total compute budget, and you have only a modest load, using on average four CPU cores at any given time. Then saving that amount of computation will have saved you something like $15/year in compute, recurring. Which isn't actually that bad a return for ten seconds worth of choosing the right function.
There are also a lot of even fairly small entities for which four cores is peanuts and they're running a hundred or a thousand at once, which quickly turns up the price.
And even the things with internet scale aren't all that rare. Suppose you're making a contribution to the mainline Linux kernel. It will run on billions of devices, possibly for decades. Even if it doesn't run very often, that's still a lot of cycles, and some of the kernel code does run very often. Likewise code in popular web browsers, javascript on popular websites or in popular libraries, etc.
You don't have to work for Google to make a contribution to zlib and that kind of stuff has the weight of the world on it.
Sure, but the cumulative effects of pervasive mediocre to bad decisions do add up. And it isn't just about cloud compute cost. Your own time is stolen by the slow ci jobs that you inevitably get stuck waiting for. For me, I prioritize my own personal happiness in my work and this mindset taken too far makes me unhappy.
You are saying that the potential gains are less than an order of magnitude. That mkes them a pretty hard sell in most instances.
> until you get into the millions of records per second level, you're almost never benefited
Yeah, but the software landscape is very diverse.
On my job (CAM/CAE) I often handle data structures with gigabytes of data. Worse, unlike e.g. multimedia frameworks many algorithms operating on these numbers are global i.e. can’t be represented as a pipeline which splits data into independent chunks and processes them sequentially.
Making performance critical functions twice as fast might saves hours of time for a single end user in a single day.
That's good food for thought. My hardware is quite old, but I agree that the numbers seem somewhat suspicious.
I believe that RAM can only be RMW'd in cache lines, so modifying just 8 bytes still requires 64 bytes to be transmitted. I'm assuming the 50 GB/s throughput is half-duplex, and 25 GB/s over 64 bytes is ~400 Melems/s, somewhat closer to my result.
I tried using non-temporal stores in the straightforward algorithm, but to my surprise, this led to a significant decrease of performance across all input lengths.
> When your data is indeed big and the performance actually matters, consider doing something completely different.
I'm not sure what you mean by this. Scaling across machines is just ignoring the problem. What do you mean by "something completely different"?
> I believe that RAM can only be RMW'd in cache lines, so modifying just 8 bytes still requires 64 bytes to be transmitted.
Ages ago I worked with several different memory controllers, and it depends on the memory controller, cache, and MMU configuration.
Plenty of systems do require cacheline updates, then modifying 8B requires reading one or two cachelines, updating them, and writing them out eventually.
Some caches track cacheline validity with a bit per byte. This enables a CPU to set a book out in memory without fetching the cacheline. The cache may then try to burst read that line from the memory controller, but if it doesn't get around to it before deciding to flush that cacheline, it may issue a single byte write to the controller. The controller can then issue a makes DRAM write to the memory, which will update only certain bytes in the DRAM column. However, this still takes about as long as sending the full cacheline but it offloads the read-modify.
Validity per byte is also useful to implement hit under miss.
I bet on newer, bigger systems tricks like this are less useful since the memory busses are faster and wider today.
Non-temporal stores will not help you if all you do are those accesses. The point of using them is that you don't want them pulled into the caches and pushing out everything else.
> In my 3 years old laptop, system memory (dual channel DDR4-3200) delivers about 50 GB / second.
That's almost certainly for (mostly) sequential access.
When you just want a couple bytes here and there, and access isn't pipelined and prefetch doesn't accelerate your use case, the real world bandwidth is going to be significantly less.
The input data is sequential.
I don’t understand Rust but if the code is doing what’s written there, “simple multiplicative hash and perform a simple analysis on the buckets – say, compute the sum of minimums among buckets” it’s possible to improve substantially.
They don’t need to move these megabytes of elements between collections. I would split the input across a few CPU cores, compute per-bucket minimums in parallel on each core, when all completed aggregate minimums across all cores, then compute sum of the results.
Multiplicative hash and the final sum should be possible to vectorize on most computers. Updating per-bucket minimum is probably impossible to vectorize (technically AVX512 set has the required instructions but I’m not sure these are efficient) but there’re much fewer buckets than input numbers, which means arrays with per-bucket minimums are likely to fit in caches.
cache misses are slow because of latency, not because of throughput.
Isn't the point of the person you replied to that the article author wasn't able to eliminate latency because if they were they'd be constrained by throughput but they are not?
And my GPU delivers 1TB/s. Massive difference <3
I wish we could get those kind of speeds on system RAM.
To/from what sort of devices could the GPU read or write at 1TB/sec, besides main memory?
The fastest consumer SSDs top out at several GB/sec (I guess with massive hardware RAID they could be faster, but not sure if they’d be 1TB/sec fast).
Even a network adapter that does 10 Gbit/sec is only recently becoming slightly more common for the average consumer. Not sure if any consumer adapters in the 10 or 1Tbit/sec range exist at all.
Consumer? Even in the "if-you-need-to-ask-what-it-costs-you-can't-afford-it" world of frontier HPC systems, were only getting teasers of 0.8 Tbit/sec NICs this year.
As you say, only the GPU, maybe RAM, and the NIC will be able to churn data at these speeds. There is a reason why Mellanox (Nvidia) has developed GPUDirect RDMA, so the GPU and NIC can talk directly to each other.
https://www.servethehome.com/this-is-the-next-gen-nvidia-con...
> Not sure if any consumer adapters in the 10 or 1Tbit/sec range exist at all.
Further, what exactly is a consumer going to plug a 1Tbit/sec adapter into? Your little home ATT fiber connection isn’t going to come close to using that available bandwidth.
> Further, what exactly is a consumer going to plug a 1Tbit/sec adapter into?
Another similarly-equipped machine on the LAN, and the switch(es) between them.
> ...besides main memory?
Main memory is an important thing to have fast, though. The faster (and lower-wallclock-latency) it is, the less time your system spends waiting around when it needs to swap things in and out of it. It's my understanding that programs that need to be fast (like many video games) take pains to preemptively load data into RAM from disk, and (when appropriate for the program) from main RAM into VRAM. If main RAM's transfer speed was equal to or greater than VRAM's, and it access latency was a small fraction of a frame render time, (presumably) some of that preloading complexity could go away.
> I guess with massive hardware RAID they could be faster...
This section of the comment is for folks who haven't been paying attention to how fast storage has gotten: It's nowhere near 1TB per second, but...
I have four 4TB SATA-attached Crucial MX500s set up in LVM2 RAID 0. This array is a bit faster than a 10gbit link. (That is, I get 1.5GByte/s transfer rate off of the thing.) Even a single non-garbage U.2-attached (or (barf) M.2-attached) device can saturate a 10Gbit link.
I mean they can only do that if you have hundreds of threads all making coalesced writes. Typical CPU workloads look nothing like that; if you pointer chase on the GPU you are going to get absurdly bad performance.
It actually almost never does. To see that you'd need to benchmark. It's pretty difficult get good utilization on GPU on either compute or memory bandwidth side. A lot of kernels irretrievably fuck up both. You need long, coalesced reads/writes, and judicious use of the memory hierarchy, or else everything gets very slow very quickly.
> Cache is seen as an optimization for small data: if it fits in L2, it’s going to be processed faster
Nobody worth their salt believes just this and nothing else.
Yes, if the data fits entirely into a given cache, that's a nice case that's easy to reason about. No matter what access pattern is applied to the data, it doesn't matter because it's in the cache.
Hopefully everyone working with caches understands that they provide a potential speedup when not everything fits into the cache, and that this depends on the pattern of access (mainly, does it exhibit "locality"). Moreover, this case is extremely important.
The article gives an example of exactly that: improving the locality of access.
If you don't know this, you don't know one of the first facts about caching.
There is something else to know about: you can't tell by size alone whether a given data set will fit into a cache. The problem is that caches are not always fully associative. In a set associative cache, a given block of data cannot be stored in any cache line: it is assigned to a small set of possible cache lines. Then within a set, the cache lines are dynamically allocated and tagged. A given bunch of working which appears to be just smaller than the cache might be arranged in such a poor way in memory that it doesn't map to all of the cache's sets. And so, it actually does not fit into the cache.
>Nobody worth their salt believes just this and nothing else.... and that this depends on the pattern of access (mainly, does it exhibit "locality").... If you don't know this, you don't know one of the first facts about caching.
Not necessarily specific to this issue, but I've found that surprisingly many people out there are not "worth their salt" in areas where you'd really expect them to be.
Assuming incompetence cannot be a general strategy, but there are many surprising ways to get jobs and pass exams.
That's true, perhaps my wording is off. I believe that the devil in the details. Sure, knowing that better access patterns result in better performance is common. But the fact that the access pattern can be optimized when the problem is literally "access RAM at these random locations" is counterintuitive, IMO.
When you have locality, prefetch can mask the latency of getting the next object, regardless of whether everything fits in cache.
This was a great read, thanks. OP, readers might benefit from having it explicitly mentioned that while the pseudocode is in Python, actual Python code will likely not benefit from such an approach because of how memory is fragmented in the standard Python implementation - e.g. this discussion: https://stackoverflow.com/questions/49786441/python-get-proc...
I am tempted to actually test this (i.e. see if there's a speedup in Python), but I don't have the time right now.
I should have looked at the comments before I started writing the code. I'd have replied to you otherwise :) (see https://news.ycombinator.com/item?id=42459055 .)
Haha thanks
[dead]
> The only way to prevent cache misses is to make the memory accesses more ordered.
Or prefetch. Not enough people know about prefetch.
All modern CPUs have powerful hardware prefetchers, which are insufficiently documented.
On any modern CPU (i.e. from the last 20 years), explicit prefetching is very seldom useful.
What is important is to organize your memory accesses in such orders so that they will occur in one of the patterns that triggers the hardware prefetcher, which will then take care to provide the data on time.
The patterns recognized by the hardware prefetchers vary from CPU to CPU, but all of them include the accesses where the addresses are in arithmetic progressions, going either forwards or backwards, so usually all array operations will be accelerated automatically by the hardware prefetchers.
Using Python pseudo code in this context is hardly understandable.
What concise and readable language would you suggest instead?
That's not the problem at hand. Python is good for pseudocode. But not if you want to talk about cache misses because in the pseudocode written with higher level languages a lot of details on how memory is accessed are opaque.
Again, what would you suggest instead? If you can't guess that `list` is supposed to represent an array of consecutive elements, I have trouble thinking of a language that'll make that clear without being exceedingly verbose for no reason.
A bit later, in the article, you'll see that memory patterns in allocating the arrays have a role. A role that was hidden initially.
> Again, what would you suggest instead?
The answer is inside you. You have only to search for it. Or, if you really want to extort me the obvious: any low level language (even not implementing every detail but calling imaginary functions whose name suggest what they are doing). This exercise will show you for instance that you'll have to immediately choose if to append_to_dynamic_array() or add_to_linked_list().
> This exercise will show you for instance that you'll have to immediately choose if to append_to_dynamic_array() or add_to_linked_list().
Linked lists, that's what you're worried about? `[...]` is not a linked list in Python, in fact I don't know any imperative language where it's something other than a dynamic array/vector. I can only assume someone who doesn't understand or intuit this is arguing in bad faith, especially when taking your attitude into account. What did I do to deserve being talked down to?
> any low level language
Like this?
Is it worth it? If you don't know `list` is a dynamic array in Python, how will you know `std::vector` is a dynamic array in C++? Not to mention the syntax is terrible. C would be just as bad. Using Rust would get people angry about Rust evangelism. The only winning move is not to play.antirez: You already got it wrong. You've got a double indirection, with `groups[i]` pointing to a heap-allocated instance of `DynamicArray`, which supposedly stores the actual heap pointer.
It's not about the language being low-level or high-level. It's about understanding the basics of memory layout. If the pseudocode being in Python is an obstacle for you, the problem is not in Python, but in your (lack of) intuition.
I wrote the first random semantic that came in mind, in the pseudocode in C above. The fact you believe it is not the right one is a proof that in C I can express the exact semantic, and in Python I can't (because everything is hidden inside how the Python interpreter can or can not implement a given thing).
So, modify my C code to match what you have in mind, if you wish. But the point is that low level code will specify in a much clearer way what's going on with memory, and in a piece of code that is about memory layout / access, that's a good idea. Otherwise I consider Python a good language for pseudocode. Just: not in this case.
Have a nice day, I'll stop replying since would be useless to continue. P.S. I didn't downvote you.
For an article about caching and performance C is the lingua franca
There are actually some algorithms specifically designed to optimize usage of cache resources without knowing the specific features of the cache.
"Cache Oblivious algorithms" https://en.wikipedia.org/wiki/Cache-oblivious_algorithm
It's a little surprising that this works at all, since the partitioning step in the radix sort is itself the same kind of sharding operation. But I guess that's because it allows for working on smaller pieces at a time that fit in whatever cache they need to.
> Python can't really reserve space for lists, but pretend `reserve` did that anyway.
FWIW, you can pre-fill a list with e.g. `None` values, and then replacing those values won't cause resizes or relocations (of the list's memory - the elements are still indirected). But of course you'd then need to keep track of the count of "real" elements yourself.
But of course, tricks like this are going to be counterproductive in Python anyway, because of all the indirection inherent in the system. They'll also get worse if you actually do have objects (unless perhaps they're statically, constantly sized, and in a language that can avoid indirection in that case) with a "group ID" attribute, rather than integers to which you can apply some hash function. Some test results, trying to approximate the Rust code in Python but with such objects:
https://gist.github.com/zahlman/c1d2e98eac57cbb853ce2af515fe...
And as I expected, the results on my (admittedly underpowered) machine are terrible (output is time in seconds):
It's important to understand the domain. For Pythonistas, simple really is better than complex. And identifying the bottlenecks and moving them to a faster language is also important (Numpy isn't successful by accident). The naive result here is still, if I'm understanding the graphs right, dozens of times worse than what any of the Rust code achieves.(Edit: merely "several" times worse. I forgot that I told `timeit` to run 10 iterations over the same input, so each test processes ~10M elements.)
> It's a little surprising that this works at all, since the partitioning step in the radix sort is itself the same kind of sharding operation.
The key property here is the number of groups. When sharding data to n groups, only n locations have to be stored in cache -- the tails of the groups. In my radix sort implementation, this is just 256 locations, which works well with cache.
Actual benchmarks and graphs. Thank you so much for that.
I'D love to understand the differences between the "devices" named A, Y, M in the performance measurement, referring to "(A, Y, M indicate different devices)"
any pointers appreciated
I sometimes get the feeling it would be better to just start treating RAM as secondary storage, with read/write access through a (hardware implemented) io_uring style API. Is there anything along those lines out there?
CXL RAM is something like this on the physical side. It's RAM over a PCIe connection, and PCIe is basically a very fast, serial, point to point network. However as far as I'm aware the "API" to software makes it looks just like regular, local RAM.
https://en.wikipedia.org/wiki/Compute_Express_Link
> The RAM myth is a belief that modern computer memory resembles perfect random-access memory. Cache is seen as an optimization for small data
The post doesn't seem to answer what the memory actually resembles. If it's not resembling a random access memory, then what is it resembling?
It resembles sequential access memory with relatively fast seeks and a cache (ok, multiple layers of caches). Reading sequential, predictable addresses gives you much more throughput than random access, and reading a value that was recently accessed (or adjacent to such) is much lower latency than something that was not. There's further wrinkles in multicore systems as well, because then accesses to memory recently written to by another core can be slower again.
It resembles a hierarchy wherein a small fraction of memory - a "cache" region - can be accessed much faster than the rest. With careful planning, the programmer can increase the odds that the necessary information for the next step of an algorithm, at any given point, is within that small region. (This is a simplification; there are actually several layers of cache between a CPU and the most general part of the RAM.)
Anaesthetists are required to undergo days of retraining per year because the field, and the safety numbers, keep moving.
To do this researchers publish findings and professionals aggregate those into useful training materials
Who / where is the useful training materials for software devs? It really cannot be just blogs.
That’s a VC investment that has ROI across the industry - tell a16z how to measure that and we are quids in
"nulls should not be in your language" is my version of "doctors should wash their hands"
"Immutability-first, with mutation-as-a-special-case" is my "accountants should not use erasers"
"make illegal states unrepresentable" is my "hospitals should sterilise their operating rooms and equipment"
As a field, we are very far from reaching broad agreement on the left side (which I consider the basics). So what would the training materials teach?
Anaesthetists also undergo many years of training prior to taking post. Typically they've got say 10 years training before they actually do the job, much of it specialised to this job in particular (and the rest in general or intensive care medicine)
If you're lucky a Software Engineer has a three year degree in CS, and probably only a semester at best was studying "Software Engineering" and even that might focus on something you don't care about, such as formal methods.
It is entirely possible that your junior engineers have never maintained a sizeable codebase for more than a few weeks, have never co-operated on software with more than a handful of other people, and have never used most of the common software libraries you use every day, regardless of whether these are in-house (so how could they) or widely available.
For example maybe you do lots of web apps fronting SQL Server databases in C#. Your new hire has six months of C#, they half-remember a course doing SQL on an Oracle database, and all their web knowledge is in Javascript. Do they know version control? Kinda. Have they used a test framework before? Well, they did in Java but never in C#.
The "All Your Tests Are Terrible" talk begins by pointing out that probably they're strictly wrong because you don't have any fucking tests. All of the rest of the talk is about the hideous tests that you're unhappy to find because you forget that somebody could just not have bothered to write any tests at all.
> All of the rest of the talk is about the hideous tests that you're unhappy to find because you forget that somebody could just not have bothered to write any tests at all.
At many times in my professional "career", I've found myself wishing that whoever wrote the test I'm staring at just hadn't bothered. "Tests that say and/or look like they test one thing, but test something entirely different." [0] and "Tests that actually test nothing at all and never fail when the code they claim to test is changed out from under them." are two of the many categories of tests I wish folks would just never have wasted their time writing.
[0] To be clear, I usually find tests of this type to claim to be testing something useful, but actually be testing something that's not worth testing.
Hey, I resent and agree with this!
> Who / where is the useful training materials for software devs? It really cannot be just blogs.
Why can't it be blogs, books, and word-of-mouth?
If you're a programmer (as your HN profile suggests that you are), you already know how little formal training we receive.
I was expecting something about NUMA performance, temporal memory access instructions, shared versus global versus register memory on GPUs, SRAM, and so on. There's an article about all these things waiting to be written. This article is instead about memory-hierarchy-aware access pattern optimization, which is important, but not the whole story.
Tl;dr: cache is the new RAM; RAM is the new disk; disk is the new tape.
Exemplar code in python, shows the benefit in rust. o---kaaaaay.
So the outcome is "for this compiled rust, around 1m records it gets better"
But you didn't actually prove a general case, you proved "for a good optimising rust compiler" didn't you?
Maybe I'm over-thinking it. Maybe this does just become simple, working-set-locality stuff.
I could take the lesson: for less than 10m things, on modern hardware with more than 4 cores and more than 4GB stop worrying and just code in the idiom which works for you.
Uh? The post very clearly says: "I’m using Python as pseudocode; pretend I used your favorite low-level language". The goal is to show what's possible when you do your best, of course I'm not going to use Python for anything beyond demonstrating ideas.
> But you didn't actually prove a general case, you proved "for a good optimising rust compiler" didn't you?
Again, that was never my goal. I chose Rust because I've stumbled upon this problem while working on a Rust library. I could've chosen C++, or C, or maybe even Go -- the result would've been the same, and I checked codegen to make sure of that.
> I could take the lesson: for less than 10m things, on modern hardware with more than 4 cores and more than 4GB stop worrying and just code in the idiom which works for you.
The number of cores and RAM capacity has nothing to do with this. It's all about how well data fits in cache, and "less than 10m things" are likely to fit in L3 anyway. If your main takeaway from "here's how to process large data" was "I don't need to worry about this for small data", well, I don't know what to say.
Large and Small are so contextual. I'm processing 350m events/day in 24h splits, and I managed to stop worrying about locality of reference because I'm the sole occupant of the machine. When I did worry about it, I found radix tree, awk hash and perl/python hash/dict pretty much all occupied much the same space and time but a tuned C implementation got 2-3x faster than any of them. Somebody else pointed out memory resident for most of this would be faster still but you have to then work to process 24h of things against a single memory instance. Which means buying into IPC to get the data "into" that memory.
It interested me you didn't show the idea in rust. That was the only point I was making: Python as pseudocode to think things in documents is fine with me.
But overall, I liked your outcome. I just think it's important to remember large and small are very contextual. Your large case looks to me to be >5m things and for an awful lot of people doing stupid things, 5m is bigger than they'll ever see. If the target was only people who routinely deal in hundreds of millions of things, then sure.
> The number of cores and RAM capacity has nothing to do with this. It's all about how well data fits in cache, and "less than 10m things" are likely to fit in L3 anyway.
What matters is locality since that allows prefetch to mask latency. If you have this, then you are in a good place even if your data does not fit in the L3 cache. What you did demonstrates the benefits that locality gives from the effect on prefetch. Fitting in L3 cache helps, but not as much as prefetch does. If you do not believe me, test a random access pattern on things in L3 cache vs a sequential access pattern. The sequential access pattern will win every time, because L3 cache is relatively slow and prefetch masks that latency.
I have seen options for disabling prefetch and cache as BIOS options (although I only remember the option for disabling cache in ancient systems). If you could get one, you could do some experiments to see which will matter more.