Ruslan Shestopalyuk:
Был вчера/позавчера на С++ семинаре с Александреску. Особых откровений не было (что и ожидалось, впрочем). Мужик прикольный, как и большинство румынов.
Вот мои записи с первого дня, если кому интересно.
Day 1.
  • 1. He started by stating that efficiency of certain containers/algorithms' implementation matter a lot at Facebook.
The operations that are performed very often, on "big data" and need to be efficient and scalable:
- Dictionary lookup: hash maps (unordered_map) vs maps.
- Depending on the scenario, one may want to pick one over another.
- Hash maps are generally faster, but hash key computation can be expensive.
- In case of maps it may be beneficial to scramble the keys, to avoid wasting time on comparing many strings with similar prefixes.
- For small data sets ("max 7 to 17" elements) just plain arrays with linear search may be better.
- Set operations: union and intersection - using search engine terms as an example.
- Set intersection on sorted arrays can be improved via binary search.
- Which can be improved via so-called "gallop-search" (going from start and probing in exponential increments: 2, 4, 8, 16...)
  • 2. There was an interlude about the benefits of generic, type parameters (template) based vs. interface (base abstract class/inheritance) based API design.
The main argument for former is performance and more compile time checking.
The latter is generally more flexible (as behaviours can be modified at runtime).
The rest of the seminar was about the details of template-based APIs/implementations, mostly from the perspective of a library designer (which apparently has been Alexandrescu's primary occupation all these years).
  • 3. He talked rather a lot about the new C++11 feature - so called variadic templates.
- It's a generally obscure feature, which found its way into the standard almost by accident, and many people still not sure why they would need it at all.
- Allows for a variable amount of type parameters inside the ugly brackets syntax ("template <class A, class B, ...> class C {};")
- This can be abused left and right for certain presumably neat applications, which he tried to illustrate:
- better typelists (which are... well, lists of types used as template parameters)
- memoization.
- better C++ tuples implementation.
Unfortunately, there was no clear explanation why would one need typelists or tuples in the first place.
There was one typelists success story from the audience, which went along: "There are clients, and there are numbers, and there are money, and all of them together - we'd need to write a lot of code, but we used typelists, and then BAAM - magic happens!".
The take-away is that it's probably a good thing to know about (just in case), but better avoid doing it at home.
Or do it at home, but don't do it at work. Or whatever.
  • 4. Policy-based design:
- This is one of the things Alexandrescu is famous of, and there was a bit of talk about it.
- In a sense, it has been a cornerstone of the standard C++ library design (with things like predicates and allocaters as typer parameters)
- Very powerful idiom, albeit sometimes hard to understand.
- "Pull the implementation down, but the policy upwards, to the client" - allow client to decide separate aspects of the class behaviour. But provide good defaults in case he does not care.
- So, classes can be parametrized by different types, each of the representing a separate behaviour aspect (policgy): e.g a memory allocation policy, locking policy, comparison policy etc.
- His Loki C++ library was very policy heavy, but C++ standard designers decided that having too many of those is too complicated.
- The variadic template hacks could be used to simplify the syntax for user (e.g. user could specify policy classes in any order and skip the defaults). Not worth it anyway.
- Generally, all this stuff blows up the compile time through the roof, but this consideration was somehow sweeped under the rag in the course of discussion.
  • 5. Many Standard C++ library implementations still suck from the point of view of performance.
- Vector resizing: not specified exactly by the standard, and almost all compilers have been doing it wrong.
- Growing the capacity by the factor of two has been prevalent (except of the recent Microsoft C++ library implementation). Optimal - golden section, 1.6, so after reallocation the freed memory chinks have higher chance of being reused, thus reducing the probability of heap fragmentation.
- Search of a substring in a string - there are better algorithms than the default one being used, e.g. Rabin-Karp (a smart "floating character code sum" idea)
  • 6. Memory allocation.
- Things like freelists and other custom allocation strategies.
- The underlying memory allocator performance still matters a lot.
- The majority of the memory allocators implementations are reluctant to release memory to the system, which is a problem in a system with many services.
- Facebook uses jemalloc, which fixes that problem. They claim they could not find anything better so far. Worth trying.
  • 7. Efficient/scalable STL usage.
- Matters even more in multithreaded environment.
- Mind the memory allocation: locks should be held for as short period of time as possible, locking the memory allocator should be avoided at all costs.
- Double checked locking trick can be used.
- Another trick: Edit a copy of container aside, lock the original container, swap, unlock.
- C++11 has nice threading primitives that can be used.
  • 8. C++ design by committee is rather unfortunate.
- For example, they removed COW (copy on write) from strings, trying to "fix" the potential threading problems.
- Alexandrescu seems to be missing his cow a lot. There are sound ways to fix these problems without killing the cow.
- Facebook does that in its Folly library (which is open sourced and available on github)
- There is even more nice (and/or obscure) features coming up with C++14, so one can use/abuse them to their liking.
День второй:
Day 2.
There were two main topics this day: error handling and low-level C++ optimization.
  • 1. Error checking.
- Assertions. Absolutely crucial. Used for handling unexpected/useless inputs, but not validating input data etc.
- The reliability - your service shall keep working with whatever crappy input it gets. Which appears to be in conflict with the "crash early" philosophy.
- Not many people are creating unit tests for the catch-clauses. Their loss.
- Usually there is not much testing for the "bad" scenarios (aka "happy-day testing") - it's a road to hell.
- Thunderbird mail client as an example of multiple problems because of that - networking errors, race conditions, modal dialogs etc.
  • 2. C++ exceptions are ok, but there is a lot of limitations.
- Only one exception in flight - inherently serial.
- No throwing exception inside exception.
  • 3. Expected<T>
- Can try doing the tricks to get the best of two worlds: error codes and exceptions.
- We build it as a container to either actual value of T OR std::exception_ptr (C++11 feature, kind of a shared pointer to an exception).
- Example usage: "Expected<int> parseInt(const string&);" (instead of just returning an int, with a "special" value in case of error)
- Other languages have similar stuff. Haskell "Maybe T", Scala "Option[T]", C# "Nullable<T>" etc. But we want to the embed the error information as well (which those other language types do not provide).
- Kind of promise<T>/future<T> in C++11 (and other languages), but we don't need async with all the overhead.
- Performance benefits - creates the exception and encapsulates it, but does not actually throw anything, no try/catch.
- Btw, in C++11 there is this new notion of move constructors for RHS values (with the "&&" syntax); allows for better efficiency in the move-scenarios.
- Exception values - watch for slicing. Exception are not your regular objects, need to be passed by reference etc.
- ...followed a bunch of gritty implementation details of this using C++11.
- According to Alexandrescu, "this is a very new thing, just starting to be used at Facebook".
  • 4. ScopeGuard.
- There was his article in Dr.Dobbs written in 2000.
- Encapsulates the action/cleanup/rollback-on-action-fail pattern.
- Allows for better composition (e.g. when want nesting these kind of actions)
- Improvements in C++11: use lambdas and get rid of nesting. Move constructor for ScopeGuard (and explicitly prohibit, "delete" all other constructors)
- Can do some macro tricks to wrap the scope guard, to use auto-generated anonymous variable names etc.
- All of this is available in Facebook Folly library on github.
  • 5. Code optimization.
- Mind the Amdahl's law (the system is only as fast as its slowest part); optimize the big spenders first.
- Remember about the operations that are "negligible" but are spread all over the place.
- "Lhadma's" (Amdahls reverse pun) law: mind the hotspots _outside_ the whole application.
- Hard to get reproducible results in today's architectures. Memory hierarchies, pipelining, speculative execution, cores/threads, networking etc. Also, dynamic frequency control (power saving)
- Intuition does not work. Don't use it. Your mental models are not as complex as the reality is, anyway. Measure everything.
- E.g., fewer instruction does not mean faster code. Data is not necessarily faster than computation. And not necessarily the opposite.
- "Measuring gives you a leg on experts who are too smart to measure".
- If we have 100 measurements in a row, which one to take as reperesentative?
- He claims that one should take the minimum. Repeat many times, but consider only one experiment, the fastest one.
- (this was kind of half-retracted when I asked him about warming up the cache at the very first run, and scenarios that don't necessarily run the same tight loop many times in a row).
- Because: <meaured time> = <actual time of interest> + <quantization noise> + <measurement noise> + <other noise>. Noise is additive. Positive Gaussian distribution.
- When measuring, trying to subtract all that noise from the sum, therefore the minimum time.
- Even better, consider the "mode", not the minimum. Mode is the point at which results are the most dense (i.e. "The mode is the value that appears most often in a set of data").
- Learn Statiscs!!!
- Make sure compiler does not optimize away the code in your micro-benchmark (which won't be optimized away in real life scenario)
- Relative vs absolute timing.
- Choose the baseline to relate to - a commonly accepted solution (e.g. a baseline for writing to a file is fwrite, std::sort for sorting).
- Slower than the baseline - BAD. Always choose relevant baseline.
- Differential timing trick: run baseline 2n times, measure t2a, run baseline n times and contender n times, measure ta+b. Relative improvement t2a/(ta+b - t2a)
- Basic mistakes:
- Measuring debug builds.
- Different setup for baseline and measured: cache, files etc.
- Including stuff like printf, malloc inside the measured code block.
- Mixture - measure the sum of tasks, improve one but conclude the other one got improved.
- Optimizing rare cases (e.g. checking if array is already sorted before sorting). Unless those cases are not that rare (back to "exploiting patterns in data and data access")
- Techniques.
- Generalities.
- Prefer static linking, position-dependent code.
- Prefer 64-bit code, 32-bit data.
- Prefer 32 bit array indexing to pointers. More hints to compiler.
- a[i++] better than a[++i]. The first one has more chance of being instruction pipelined.
- Regular memory access patterns are good.
- Storage optimizing.
- Use "static const" for immutables: they are loaded with the program, mapped in read-only section.
- Use stack for variables: hot, 0-cost addressing. Globals have aliasing issues.
- Integers.
- Prefer 32 bits to all other sizes (8, 16 bits may be worse, because of internal conversion to/from 32 bit)
- Consider using small ints in arrays indexing.
- Unsigned better than signed, slightly faster division/multiplication.
- Statistically, "most numbers gonna be small". Optimize for that.
- Strength reduction.
- Speed hierarchy (cheapest->most expensive) comparison:
- int add, sub, bitops, shift.
- fp add/sub.
- int mul, fp mul.
- fp division, remainder.
- int mul/div.
- See if you could get away with a higher in the list operation.
- Minimize array (through pointer) writes.
- Array writes disable enregistering (compiler can't put stuff into registers)
- A write is really read/write because of the cache.
- And the cache gets dirty.
- Minimize the data dependencies, this improves instruction instruction-level parallelism: pipelining, out-of-order execution etc.
- - a nice web-site to dissasemble code online.
- Dynamics. Usually:
- Precomputing faster than computing.
- CPUs compute faster than fetch memory.
- Caches make huge difference.
- Precomputing also reduces data dependencies.
- Data storage granularity - think in multiples of 64. The cache line size.
- Inlining is important, since it's a gateway to a lot of other optimizations.
- Might need fine tuning via compiler flags (e.g. this was a win at Facebook for gcc 4.8: --max-inline-instns-auto=100, --early-inlining-insts=200)
- Constructors and destructors are the most affected by inlining.
- Beware expanding the code size due to inlining.
- Instruction cache; icache spills - seldom occur in microbenchmarks, but may hit the real life scenario.
- A destructor getting inlined example at Facebook, this caused cache spilling (more instructions), +0.4% CPU time - potentially millions of $
- They've got a NEVER_INLINE macro for those cases.
- Finely hand-tuned.
- shared_ptr performance/implentation considerations.
- Atomics _do_ incur cost.
- Statistically, a lot of objects has a low reference count. One could exploit "lazy refcount allocation" and other tricks.
- Btw, "zero is a nice number", CPU-instruction wise.
- When you design object for performance, the first 64 bytes of this object are absolutely crucial for speed.
- Return by value - even with C++11 move semantics may be more expensive than doing it the old way.
- URVO (unnamed return value optimization) - all path in the function return rvalues.
- NRVO (named return value optimization) - all path in the function return same local.
  • 6. Final sentiments (I am not commenting, use your own judgment)
- "One solution to get co-workers understand the highly optimized code is to get good co-workers".
- "Hardcore C++ optimization will keep mattering. History goes in spiral.".
Кстати, кому интересно - вот почти все видео из основной конференции (Александреску был два отдельных дня "pre-conference workshop", потом было три дня самой NDC Oslo 2014):
Там наиболее популярное видео - тот же Александреску с "Three cool things about D". Было много других известных личностей. Много про функциональное программирование говорили.
Дядя Боб (АКА Роберт Мартин) достал из кармана флешку, и целый час ею размахивал, выкрикивая: "Memory is cheap! All praise functional programming! All praise Clojure!". Я у него в конце спросил, типа, ну хорошо, память дешевая, но bandwidth и cache misses никто не отменял, что с этим делать?... Он ответил в духе "да забейте". Тот еще популист :)
Понравился Джо Армстронг (который Эрланг придумал), смешной дедуля. У него я спросил глупый вопрос "а почему все-таки строки в Эрланге - это связные списки?" Он ответил, что "строки - это такой же тип как и любой другой, так что хуле" :) Пошутил так.
Ruslan Abdikeev:
спасибо за заметки!
я не буду комментировать тараканы нашего камрада Александреску, но вот эти вещи мне по-настоящему интересны:
- я не знал что "Double checked locking trick" уже "can be used".
- я не знал что есть реализации string cow которые одновременно быстрые при высоком contention и дешевые при маленьких строках -- кто-нибудь уже смотрел на folly?
- я не знал что Rabin-Karp реально эффективен на небольших строках.
- "Another trick: Edit a copy of container aside, lock/swap/unlock" звучит как будто пропустили CASы/ABA или как большой факинг дизастер.
Если кто знает подробности - хочу услышать, спасибо заранее!
Arseny Kapoulkine:
Про variadic templates это Александреску рассказывал или твое мнение?
Про COW - лучше чтобы стандарт гарантировал что COW есть либо нет, чем как в С++03 (у всех по-разному)
Ruslan Abdikeev:
"Another trick" - тогда я не очень понимаю про трик, расскажи?
Ruslan Shestopalyuk:
Да, про variadic templates там в конце было больше отсебятины (писалось в casual стиле для сотрудников, которые не пришли). Хотя сам Александреску много возмущался, в духе wtf, зачем такой специальный синтаксис для экспаншена параметров итд.
"Случайно попало в стандарт" и "синтаксис на голову не налезает" - это было в его исполнении :)
Про COW - походу (если верить Александреску, опять же), можно ожидать что НЕТ.
Начну с конца - про ABA речь даже не шла, имелись в виду толстые локи. CAS был однажды упомянут, но в другом контексте.
Рабин-Карп - нет, он не говорил про короткие строки. Короткие строки - всегда линейный поиск итп.
Про "Another trick" там имелось в виду, что чем держать толстый лок на контейнере, делая несколько модификаций за раз, может быть лучше сделать временную копию, модифицировать ее, залочить основную, std::swap со временной, разлочить.
Arseny Kapoulkine:
Временную копию пока сделаешь небось можно было уже пять раз поменять.
Если обычные контейнеры.
Ruslan Abdikeev:
поменять херня - еще можно удалить не вздрогнув помолившись.
даже если необычные - раз трик нужен.
обычно если необычные - трик уже не нужен :)
я похоже только что изобрёл новую модификацию No True Scotsman.
Ruslan Abdikeev:
а, это в смысле если только один писатель, понял, иначе никуда без CAS+ABA.
я надеялся услышать про новый трик!
а что там с "Double checked locking trick"?
и чтобы два раза не вставать - я не увидел чудес в folly, что скорее всего говорит о том что CoW таки обоссыццо при разумно приемлемой длине строк.
Andrey Mironov:
Ruslan Abdikeev:
О да, я с распростертыми объятиями жду как меня смеетет толпа оверлордов умеющих хотя бы приблизительно отпарсить.
"write-release can synchronize-with a read-acquire" и перевести на C++.
Но я понял, что имел ввиду Александреску, спасибо!
просто как написано - создается ощущение что можно втупую без барьеров.
и я думал что это новая нетривиальная фича языка о которой я не слышал.
Andrey Mironov:
Думаю, те, кто не понимает, что такое synchronizes-with, в принципе не будут на такие темы беседовать. Наверное это не многим нужно знать на практике, я в целом довольно быстро постиг эти вещи.
Ruslan Abdikeev:
спасибо большое за заметки!
Yuriy Krasnoschek:
Спасибо. Интересно.
Ruslan Abdikeev:
про 2. C++ exceptions.
я тут кстати понял, что за 8 лет в принципе довольно много что поменялось.
и это хорошо.
Boris Batkin:
ты опять очень глубоко копаешь. почти непонятно.
ну т-е все слова понятные, а смысл фундаментально неочевидный.
т-е еси говорить за "много поменялось" - то надо читать про std::future, и что бывает с exception-ами.
а про аисторичность я вообще ничего не понял. первые два таки да - много поменялось.
обратно я смотрю со своей любимой колокольни.
на предмет "с кем писать exceptions". и от этих мыслей худею.
т-е обратно мой вопрос "кто все эти люди".
Ruslan Abdikeev:
аисторичность - это возможность сказать что привело к проблеме.
теперь это можно хоть как-то решать через std::exception_ptr.
Ruslan Abdikeev:
да, это правильный вопрос.
на него со своей колокольни риторически отвечает Александреску:
"One solution to get co-workers understand the highly optimized code is to get good co-workers".