crazy.ark

Raft on RocksDB -- Shared Log

Tue, 17 Aug 2021 02:44:38 GMT

It has been three years since the last blog update. During this time, I went through studying abroad, graduation, starting work, and many other things -- experiences that only I truly understand. Now that the blog is back, I hope my writing and thinking have improved a little. At the same time, given my limited knowledge, I welcome any guidance from experts.

Background

Last month, to learn the Rust programming language, I participated in a performance challenge called CloudBuild hosted by Alibaba Cloud's ECS team, implementing a chatroom server in Rust. Since I was not proficient with Rust, I only submitted once during the preliminary round, yet somehow made it to the finals, which was quite interesting. The finals required the service to be deployed across three machines while providing complete read and write services, with the requirement that the service should remain unaffected even if one machine goes down. Upon seeing the problem, my intuitive thought was to use a consensus protocol to make the three machines agree on the stored data. Common consensus protocols in distributed systems include Paxos and Raft. Raft is essentially a specialized form of Paxos and is simple and easy to understand. Therefore, I ultimately decided to implement the solution using Raft + RocksDB.

Problem

Raft is a distributed consensus protocol, and RocksDB is a key-value database. Both are well-known in their respective fields and need no further introduction. During the implementation, I encountered several issues. One major problem was: the Raft protocol requires maintaining a log during operation, and RocksDB also maintains a WAL (Write-Ahead Log) to ensure durability. These two logs are highly redundant in terms of data content. This means that during data writes, an extra copy needs to be written, which is unacceptable for any performance-conscious system.

In fact, many distributed database systems have already merged these two logs. For example, some modified distributed MySQL systems, as well as TiKV (though I haven't verified this). Their approaches include either extending the storage engine's WAL to support special events from the consensus protocol, embedding a storage engine WAL entry within the consensus protocol's log, or for specialized systems like distributed log systems, directly transforming the Raft log into the storage engine. While searching for solutions, I found a previous article from eBay [1] that discusses how to combine the log storage's log with the Raft log. Although it is only a workshop paper, the illustrated explanations were truly enlightening.

Since I adopted RocksDB as the storage engine, I obviously did not want to modify RocksDB's WAL -- the engineering effort would be too large and not worth it. Therefore, I decided to try customizing the Raft log and disabling RocksDB's WAL. During crash recovery, the Raft log would be used to recover RocksDB. Because RocksDB is an LSM-tree based system, the SSTs (Sorted String Tables) persisted on disk are all read-only. Even in case of a crash, only the memtable in memory that hasn't been flushed is lost, which greatly simplifies the crash recovery workload.

Shared Log

The shared log needs to consider the following two issues for functionality and performance:

During crash recovery, how do we find the earliest log entry not yet written to RocksDB?
How do we reduce small disk I/Os and maximize throughput?

Crash Recovery

Since RocksDB only loses the in-memory memtable, and KV operations are all idempotent, the simplest approach is to replay the entire log from beginning to end. Obviously this is not a good solution -- not only is the recovery time long, but the redundant data caused by duplicate KV operations also brings additional disk pressure, which is very unfriendly to SSDs that have expensive GC and limited write endurance. A better approach is to record the maximum log LSN of the currently written SST when RocksDB finishes flushing an SST, and write it to disk as well. Upon restart, query this recorded LSN and replay from that point onward.

There are two ways to implement this:

Use RocksDB's event callbacks to asynchronously obtain and write the LSN
Each time a put(k, v) is performed, additionally put a special key, for example _raft_log_lsn, with the value being the current log's LSN

The first approach's asynchronous write may always fail (during a crash), so I prefer the second approach. Its only requirement is that the special key must not fall within the range of valid user keys. When the memtable is written to disk as an SST, this special key is also written. When RocksDB restarts, simply querying this key's value yields the maximum LSN already written. This approach is clearly more elegant. Of course, this method cannot guarantee that the LSN and the data in the SST are always consistent, unless the memtable can be locked during writes to prevent flushing.

One important note here is that RocksDB's default memtable flushing policy is based on size. Therefore, we can write the LSN key first (which does not change the memtable size), then perform put(k, v). This way, even if a flush is triggered, the data and LSN remain consistent. However, we need to ensure that no other policy triggers a flush that writes the LSN key while the data is lost -- this currently seems quite difficult to guarantee (without modifying RocksDB).

Group Commit

All databases have a requirement: before returning to the user, data must already be on disk in some form (e.g., WAL) to ensure durability. However, when a large number of user requests are small, writing each request individually to disk causes many small I/Os. Small I/Os (less than 4KB) are slow and harmful on any type of disk product. A common optimization technique in database systems is group commit, which combines a group of requests into a single request and commits them to disk together before returning to users. This makes each I/O request as large as possible, effectively improving throughput.

The shared WAL can also use group commit for writes. Note that this requires a small modification to the crash recovery scheme described above: the LSN should only be written after all K-V pairs have been put.

Flow Diagram

Here is a flow diagram following the approach described above:

Summary

This is just some thoughts triggered by a small competition. In practice, the competition only involved inserts and queries, with no deletes or overwrites -- messages were append-only. RocksDB is not optimal for every scenario, but the idea of Raft on top of some storage engine should be universally applicable.

References

[1] Designing an Efficient Replicated Log Store with Consensus Protocol. Jung-Sang Ahn, et al. 11th USENIX Workshop on Hot Topics in Cloud Computing 2019.

Raft on RocksDB -- 共享日志

Tue, 17 Aug 2021 02:44:38 GMT

自上次更新博客起已经有三年时间了，这期间经历了留学、毕业、工作等很多事情，个中滋味只有自己知道。博客重开，希望文笔和思路上能比以前进步一点点。同时因为学识有限，也希望各位大佬能不吝赐教。

背景

上个月为了学习 rust 语言，参加了阿里云 ECS 团队举办的 CloudBuild 性能挑战赛，用 rust 实现一个聊天室服务器。因为 rust 不熟练初赛只提交了一次，居然也闯进复赛了，也挺有意思的。复赛题目要求服务要部署在三台机器上，同时提供完整的读写服务，且要求宕机一台不影响服务。看到题目后，我直观地想到了使用一致性协议让三台机器关于存储的数据达成一致就可。分布式系统中常用的一致性协议有 Paxos 和 Raft 等，其中 Raft 从原理上来说是一种特化的 Paxos，而且简单易懂。因此我最终决定采用 Raft + RocksDB 的方案进行实现。

问题

Raft 是一个分布式一致性协议，而 RocksDB 是一个 K-V 数据库，在各自的领域应该都是闻名遐迩，在此不再多做介绍。在方案的实现过程中，碰到不少问题，其中的一个主要问题是：Raft 协议运行过程中需要维护一份日志，而 RocksDB 为了保障持久性，也会维护一份 WAL，这两份日志在数据内容上是高度重合的。这意味着在写入数据的过程中，需要再多写一份，这对于任何关注性能的系统来说都是不能接受的。

事实上很多分布式数据库系统都已经合并了这两条日志，比如某些魔改的分布式 MySQL 系统，以及 tikv（？我没调研）。他们采用的方案或是将存储引擎的 WAL 扩展支持一致性协议的特殊事件，或是在一致性协议的日志中嵌入一条存储引擎的 WAL，或者对某些特化的系统比如分布式日志系统，干脆直接将 Raft 的日志改造成存储引擎。在 Google 解决方案的过程中，我发现了 eBay 以前的一篇文章 [1]，讲的就是如何把日志存储的日志和 Raft 的日志相结合起来，虽然只是一篇 workshop 的文章，但图文讲解还是让我茅塞顿开。

由于我采用了 RocksDB 作为存储引擎，显然我不希望去修改 RocksDB 的 WAL，这工程量太大了，不够回本的。因此我决定尝试定制 Raft 日志、关闭 RocksDB 的 WAL 的方案。在崩溃恢复的时候，使用 Raft 的日志去进行 RocksDB 的恢复。因为 RocksDB 是一个 LSMT 结构的系统，它在磁盘上持久化的 SST（Sorted String Table）都是只读的，即使崩溃也只是丢失内存中还没有刷盘的 memtable，这大大简化了崩溃恢复的工作量。

共享日志

共享日志需要为功能和性能考虑以下两点问题：

崩溃恢复时，怎么寻找最早未写入 RocksDB 的日志？
怎么减少小的磁盘 IO，尽量提高吞吐？

崩溃恢复

因为 RocksDB 只会丢失内存中的 memtable，而 KV 操作又都是幂等的，因此最简单的办法就是直接把日志从头重放到尾。显然这不是一个好方案，不说恢复时间很长，因为重复的 KV 操作导致的重复数据也带来了额外的磁盘压力，对 SSD 这种 GC 昂贵还有擦写次数限制的磁盘非常不友好。更好的做法是在 RocksDB 刷完 SST 的时候，记录下当前写下的 SST 的最大日志 LSN，同样写入磁盘。当重启恢复时查询到这个记下的 LSN，从这个点开始重放就好。

这个方案有两种实现办法：

用 RocksDB 提供的事件回调异步地完成 LSN 获取和写入
每次 put(k, v) 时额外 put 一个特殊的 key，举个例子 _raft_log_lsn，值是当前日志的 lsn

第一种办法异步写入总是有可能会写入失败（崩溃时），因此我更倾向于第二种实现办法，它唯一需要保证的是特殊的 key 不在用户合法的 key 的范围内。当 memtable 被写入磁盘成为 SST 时，这个特殊的 key 同样也会写入。那么当 RocksDB 再次启动时，只要查询一下该 key 的值就可以获取当前最大已经写入的 LSN。显然这种方法更加的优雅。当然这种方法并不能保证 LSN 和 SST 中的数据一定是一致的，除非在写入时能锁定 memtable 不让刷。

这里有一个注意点就是，RocksDB 默认的刷 memtable 策略是根据大小来的，因此可以先做 lsn key 的写入（不会改变 memtable 大小），再做 put(k, v)，这样即使触发刷盘，也能保证数据和 lsn 是一致的。但是需要保证不会因为其他的什么策略刷盘导致刷了 lsn key 而数据丢了，目前看来还是比较困难的（不改 RocksDB 的话）。

合并提交

所有的数据库都有一个要求，那就是在用户请求返回前，数据一定以某种形式（e.g. WAL）已经在磁盘上了，从而保证持久性。但是当用户大量的请求都很小时，每个请求都单独落盘会导致大量的小 IO，而小 IO（小于 4KB）在任何类型的磁盘产品上都是慢且有害的。数据库系统中常用的一个优化手段是合并提交（group commit），将一组请求合并成一个请求同时提交到磁盘，然后再返回给用户。这样可以让每次的 IO 请求都尽可能的大，有效地提高了吞吐。

共享的 WAL 同样也可以采用合并提交的方式进行写入，但是注意的这时需要对上述崩溃恢复的方案做一些小的修改，即当所有 K-V 都 put 后再将 LSN 写入。

流程图

照猫画虎搞了个流程图：

总结

总之只是一个小比赛引发的一点思考，实际上比赛是只有插入和查询，没有删除、覆盖写的场景，甚至消息只有 append。 RocksDB 并不是每个场景都能做到最优性能，但 Raft on 某种存储引擎的思路应该是通用的。

参考文献

[1] Designing an Efficient Replicated Log Store with Consensus Protocol. Jung-Sang Ahn, et al. 11th USENIX Workshop on Hot Topics in Cloud Computing 2019.

Hash Table -- Hash Collision

Thu, 16 Aug 2018 06:55:23 GMT

Hash table -- Collision

In computer science, a hash table is a very important data structure that helps us perform fast insertions and lookups. Theoretically, a single lookup in the table should take O(1) time. Here I assume you already understand what hashing is. If you have never encountered this concept before, this article might be a good place to start.

Suppose you have a perfect hashing function and infinite memory. Then each hash value in the hash table corresponds to exactly one original element. Clearly, to perform a lookup, we just need to compute the hash value and find the corresponding entry in the table. However, in reality there is no infinite memory, and it is not easy to find a "good" perfect hash, such as minimal perfect hashing.

So in practice, we inevitably encounter hash collisions, meaning two different elements are mapped to the same hash value. When a hash table encounters a hash collision, there are several ways to resolve it, each with its own pros and cons. Let me walk through them one by one.

Hash collision

There are typically two main approaches to resolving hash collisions: separate chaining and open addressing. Of course, there are more beyond these two, such as multiple-choice hashing, cuckoo hashing, hopscotch hashing, robin hood hashing, and so on. This article will cover several of them.

Separate chaining

The core idea of separate chaining is to store all elements that hash to the same value in a linked list, as shown in the figure below:

Each hash table cell (also called a bucket) stores a linked list head. When a hash collision occurs, the colliding element is simply appended to the linked list. For lookups, we first find the corresponding cell using the hash value, then traverse the linked list to find the desired element. The lookup time here is O(N), where N is the length of the linked list. Since the linked list is usually short, we can consider the overall query time to still be O(1). Most standard library hash table implementations use this approach because it is sufficiently general and offers decent performance.

We can also replace the linked list with a BST, reducing query time to O(lgN), though insertion and space overhead are higher than with a linked list. In Java's HashMap implementation, when a bucket's linked list length exceeds 8, it automatically converts to a red-black tree to reduce lookup overhead.

Separate chaining has two drawbacks that may be intolerable in performance-critical contexts:

Linked lists have relatively high space overhead
Linked lists are not CPU cache-friendly

At the same time, it has the best generality because separate chaining does not need to know anything about the elements being stored. This is why the vast majority of standard library implementations use this approach.

Open Addressing

Open addressing stores elements directly in the table, though not necessarily in the cell corresponding to the hash value. The process of finding the cell where an element is stored is called probing. Open addressing requires a special element to mark empty cells. Well-known open addressing methods include linear/quadratic probing and double hashing.

Linear/Quadratic probing

During insertion/lookup, linear/quadratic probing finds the insertion/storage position through the following process:

Find the cell at position s(0) corresponding to the initial hash value h(e), check whether it is empty or has the same key. If it matches the requirement, return; otherwise continue with the following process:
- Typically s(0) = h(e) % n, where n is the table size
For linear probing, search sequentially with positions derived by s(i+1) = (s(i) + 1) % n; for quadratic probing, also search sequentially but with positions derived by s(i + 1) = (s(i) + i^2) % n
Repeat step 2 until found or confirmed not found

The advantage of linear probing is that it is very CPU cache-friendly since related elements are clustered together. However, the downside is also due to clustering, known as primary clustering: any element added to the hash table must probe across element clusters encountered during the probing process, and ultimately enlarges the cluster. The primary clustering problem is particularly severe at high load factors. Interested readers can check the CMU course notes in reference [2], which contain proofs and experimental results.

For this reason, linear probing's performance is highly correlated with the quality of the hash function. When the load factor is small (30%-70%) and the hash function quality is high, linear probing's actual running speed is quite fast.

Quadratic probing effectively solves the primary clustering problem, but similarly cannot avoid clustering. This type of clustering is called secondary clustering: when two hash values are identical, their probe sequences are also identical, leading to increasingly long probe sequences, just as before.

Quadratic probing has an important property: when the table size is a prime p and at least half the positions in the table are empty, quadratic probing is guaranteed to find an empty position, and no position will be probed more than once. The proof of this property can also be found in the CMU notes. Since this guarantee only holds when the load factor is less than 0.5, quadratic probing is not very space-efficient.

Double hashing

Like linear/quadratic probing, double hashing probes step by step for empty/occupied positions, but uses two hash functions h_1 and h_2. The probe position sequence is s(i) = (h_1(e) + i * h_2(e)) % n. Double hashing effectively avoids secondary clustering because for each probe starting point, different elements correspond to different probe sequences.

Compared to linear/quadratic probing, double hashing can store more elements in a smaller table, but the probing computation is slower.

Robin Hood hashing

Robin Hood Hashing is a variant of open addressing. What we commonly refer to as Robin Hood hashing means Robin Hood Linear Probing. In this blog post [14], the author even recommends using hash tables based on Robin Hood hashing. So what are its advantages?

To give you an idea how Robin Hood Hashing improves things, the probe length variance for a RH table at a load factor of 0.9 is 0.98, whereas for a normal open addressing scheme it's 16.2. At a load factor of 0.99 it's 1.87 and 194 respectively.

In other words, Robin Hood hashing can significantly reduce the variance of probe lengths. Let's see how it works:

For each element in the hash table, record the probe length at insertion time
When inserting a new element, for elements encountered during probing, if their probe length is less than the current inserting element's probe length, swap the two elements (along with their probe lengths), then continue probing

In effect, everyone's probe lengths become more equalized, so the expected maximum probe length decreases significantly. This is also the origin of the name Robin Hood (the legendary English outlaw) hashing -- "steal from the rich, give to the poor." Although most elements' probe lengths are closer to the average and not found in one step, since this overhead is negligible compared to the cost of loading a CPU cache line, there is still a significant overall improvement.

The paper [16] presents experiments across various load factors and table sizes, showing that at a load factor of 0.9, the expected maximum probe length for Robin Hood hashing is only about 6 (with a table size of 200,000), and the growth rate is extremely slow as the table size increases.

However, even Robin Hood hashing cannot solve the primary clustering problem. Clusters that should cluster will still cluster together. If you probe for a nonexistent element starting at the beginning of a cluster, the probing process can be quite lengthy. But don't panic -- this problem is easy to solve: we record a global maximum probe length. If the current probe length exceeds the global maximum, the element definitely doesn't exist, and we can return immediately. According to the experiments above, the global maximum won't be too long either, so no more worries about failed lookups!

In various tests, Robin Hood hash tables demonstrate excellent space and time performance. Hopefully there will be opportunities to use it in practice.

2-choice hashing

2-choice hashing also uses two hash functions h_1 and h_2. During insertion, only the two positions h_1(e) and h_2(e) are considered, and the one with fewer elements is chosen. 2-choice hashing has a remarkable result: the expected number of elements per bucket is $\theta(\log(\log(n)))$.

For the proof of the expected value, if you are a student at Nanjing University and have taken Professor Yin Yitong's Randomized Algorithms course, you should have heard of The Power of 2-Choice. Similarly, CMU's lecture notes also contain the proof.

Cuckoo Hashing

Cuckoo Hashing is also a form of open addressing with O(1) worst-case lookup time. Cuckoo hashing gets its name from a type of cuckoo bird whose hatchlings kick unhatched eggs out of the nest shortly after birth.

Cuckoo hashing typically uses two arrays and hash functions, so like 2-choice hashing, each element corresponds to two positions. When an element is inserted, if either position is empty, it is placed in the empty position. If both are occupied, one of the occupants is evicted and the new element takes its place. The evicted element is then placed in its alternative position. If that position is also occupied, another eviction occurs, and the process repeats until the element is finally placed in an empty position.

Obviously, the insertion process can fail when it enters an infinite loop. In that case, we can rebuild the table in-place using new hash functions:

There is no need to allocate new tables for the rehashing: We may simply run through the tables to delete and perform the usual insertion procedure on all keys found not to be at their intended position in the table.

— Pagh & Rodler, "Cuckoo Hashing"[17]

Compared to other methods, cuckoo hashing has higher space utilization, which led researchers to design the Cuckoo Filter [18] [10], which is more space-efficient than Bloom Filters and supports dynamic deletion.

Further details are omitted here. Readers can consult the materials and references on Wikipedia [5].

Benchmarks

This blog post [11] analyzes the time and space performance of 8 mainstream hash table implementations across various scenarios. The analysis shows that Robin Hood hash tables (robin_map) and quadratic probing (dense_map) both deliver excellent performance. Detailed data can be found in the original blog post.

References

[1] https://courses.cs.washington.edu/courses/cse373/01sp/Lect13.pdf

[2] http://www.cs.cmu.edu/afs/cs/academic/class/15210-f13/www/lectures/lecture24.pdf

[3] https://en.wikipedia.org/wiki/Hash_table

[4] https://github.com/efficient/libcuckoo

[5] https://en.wikipedia.org/wiki/Cuckoo_hashing

[6] https://en.wikipedia.org/wiki/2-choice_hashing

[7] https://www.threadingbuildingblocks.org/

[8] https://github.com/sparsehash/sparsehash

[9] http://www.cs.cmu.edu/afs/cs/academic/class/15859-f04/www/scribes/lec10.pdf

[10] https://brilliant.org/wiki/cuckoo-filter/

[11] https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html

[12] http://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf

[13] http://tcs.nju.edu.cn/wiki/index.php/%E9%AB%98%E7%BA%A7%E7%AE%97%E6%B3%95_(Fall_2017)

[14] https://www.sebastiansylvan.com/post/robin-hood-hashing-should-be-your-default-hash-table-implementation/

[15] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.130.6339

[16] http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-shift-deletion/ [17] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.25.4189&rep=rep1&type=pdf [18] https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf

哈希表 -- 哈希冲突

Thu, 16 Aug 2018 06:55:23 GMT

Hash table -- Collision

在计算机科学中，哈希表 (hash table) 是一个非常重要的数据结构，它帮助我们快速的进行插入和查询。理论上来说，在表中查询一次的耗时应该是 O(1) 的。这里假设大家对于哈希 (hash) 都已经了解，如果你以前没有接触过这个概念，这篇文章或许是一个很好的开始。

假设你有一个完美的哈希函数 (perfect hashing) 和无穷大的内存，那么哈希表中每一个哈希值都对应了一个原始的元素。显然此时要进行查找，我们只需要算出哈希值，然后找到表中对应的项就可以了。然而现实中不存在无穷大的内存，也不是很容易去找到一个“优秀”的完美哈希，比如最小完美哈希 (minimal perfect hashing)。

所以实际中，我们不可避免的会碰到哈希冲突 (hash collision) ，也就是两个不同的元素被映射到同一个哈希值上。当一个哈希表碰到哈希冲突的时候，有几种办法去解决它，它们各有优劣，且容我一一道来。

Hash collision

哈希冲突的解决办法通常是两种，一种叫拉链法 (separate chaining) ，一种叫开地址法 (open addressing)。当然不仅限于这两种，其他的比如 multiple-choice hashing, cuckoo hashing, hopscotch hashing, robin hood hashing 等等，本文挑几种介绍一下。

Separate chaining

拉链法核心思想就是将哈希冲突的元素存入都存入到链表中，如下图所示，

在每个哈希表的单元格中 (hash table cell, 或者我们也称为桶 bucket)，都存放了一个链表头。当哈希冲突发生时，只需要把冲突的元素放到链表中。当需要查找时，先通过哈希值找到对应的单元格，再遍历链表找到想要的元素，此时查找时间是 O(N) 的。这里 N 是链表长度，通常链表不会很长，所以可以认为整体查询时间还是 O(1) 的。大多数标准库中的哈希表时间都是采用了这种方式，因为足够通用，性能也还不错。

当然我们也可以用一棵 BST 来代替链表，此时查询时间是 O(lgN) 的，但是插入和空间开销要比链表大。 Java 的 HashMap 的实现中，当一个哈希单元格的链表长度超过 8 时，就会自动转成一颗红黑树来降低查询开销。

拉链法有两个缺点，这在一些对性能要求比较高的地方可能不太能忍受:

链表的空间开销比较大
链表对 CPU cache 不友好

但同时它的通用性最好，因为采用拉链法不需要对存入的元素有任何了解，所以绝大部分标准库的实现都用的这种方式。

Open Addressing

开地址法是将元素直接存入到表中，但不一定是哈希值对应的那个单元格中，而找到元素对应存储的单元格的过程通常被称为探查 (probe)。开地址法需要有一个特殊的元素，来标识空的单元格。比较著名的开地址法有线性/二次探查 (linear/quadratic probing) 和双重哈希 (double hashing)。

Linear/Quadratic probing

在插入/查找的过程中，线性/二次探查法会通过以下过程来找到插入/存放的位置:

找到初始的哈希值 h(e) 对应位置 s(0) 的单元格，查看是否是空/相同的键，如果符合要求则返回，否则继续下面的过程
- 通常 s(0) = h(e) % n，n 是表大小
如果是线性探查，则依次往后搜索，位置由 s(i +1) = (s(i) + 1) % n 推导; 如果是二次探查，也是依次往后搜索，不过位置推导为 s(i + 1) = (s(i) + i^2) % n
循环第二步直到找到或者确认找不到为止

线性探查的优势是对于 CPU cache 非常友好，相关的元素都聚集在一起；但是缺点同样也是因为聚集，这个问题称为 primary clustering: 任何添加到哈希表的元素，必须探查并跨过探查过程中的元素聚簇 (cluster)，并且最终会增大聚簇。 Primary clustering 的问题在负载系数 (load factor) 比较大时候非常严重，感兴趣的同学可以去看下参考文献 [2] 中 CMU 的课程讲义，有一些证明和实验结果。

由于这个原因，线性探查的性能和哈希函数的质量有很大的关系。当负载系数较小 (30% -70%) 并且哈希函数质量比较高时，线性探查的实际运行速度是相当快的。

二次探查则很好的解决了 primary clustering 的问题，但同样无法避免聚集，这个聚集叫做 secondary clustering: 当两个哈希值相同时，它们的探查序列也是相同的，这就和刚才的一样，探查序列会越来越长。

二次探查有一个重要的性质：当表大小是素数 p，并且表中还有至少一半的位置是空的，那么二次探查一定能找到空的位置，并且任何位置只会被探查一次。关于这个性质的证明同样可以参考 CMU 的教材。由于只能在负载系数小于 0.5 时才有这个保证，二次探查法的空间效率不怎么高。

Double hashing

双重哈希与线性/二次探查一样，都是一步步探查空/存放的位置，只不过使用两个哈希函数 h_1 和 h_2，探查的位置序列 s(i) = (h_1(e) +i * h_2(e)) % n。双重哈希有效地避免了 secondary clustering，因为对于每个探查起点，不同的元素对应的探查序列也是不同的。

双重哈希相比于线性/二次探查，可以用更小的表空间存放更多的元素，但是探查的计算过程会较慢。

Robin Hood hashing

罗宾汉哈希 (Robin Hood Hashing) 是一个开地址法的变种，我们通常所说的罗宾汉哈希是指 Robin Hood Linear Probing。在这篇博文 [14] 中，作者甚至推荐大家使用基于罗宾汉哈希实现的哈希表。那它到底有什么优势？

To give you an idea how Robin Hood Hashing improves things, the probe length variance for a RH table at a load factor of 0.9 is 0.98, whereas for a normal open addressing scheme it’s 16.2. At a load factor of 0.99 it’s 1.87 and 194 respectively.

换句话说，罗宾汉哈希可以显著降低探查长度的方差。我们来看一下它是怎么做的：

对每个哈希表中的元素，记录插入时的探查长度
当插入一个新元素时，对于探查过程中的元素，如果它的探查长度小于当前插入元素的探查长度，那么交换这两个元素 (以及探测长度)，然后继续探查

也就是说，事实上大家的探查长度更加平均了，所以期望最长探查长度也会显著的下降。这也是罗宾汉 (英国传说中的侠盗) 哈希名字的来源，劫富济贫。虽然大部分元素的探查长度都更趋近于平均值，不是一次就能查到，但是由于这部分开销较 CPU 加载 cache line 开销可以忽略不计，所以整体上仍有显著的提高。

论文 [16] 中针对各种负载系数和表规模的实验表明，罗宾汉哈希在负载系数为 0.9 时，最长探查长度的期望值也就是 6 左右 (表规模 20 万)，而且随着表规模的增大，增长速度极为缓慢。

但是，即使是罗宾汉哈希也不能解决 Primary clustering 的问题，该聚簇的还是会聚簇在一起，如果在聚簇块开始探查一个不存在的元素，那么探查过程会相当漫长。不必惊慌，这个问题很容易解决：我们记录一个全局的最长探查长度，如果当前探查长度大于全局最长探查长度了，那肯定是找不到了，我们就可以直接返回。而根据上面的实验，全局最长也不会太长，麻麻再也不怕我查询失败了！

在各种测试中，罗宾汉哈希表的空间和时间性能表现都非常好，希望以后在工作过程中能有机会使用。

2-choice hashing

2-choice hashing 同样是使用两个哈希函数 h_1 和 h_2，每次插入时，只考虑 h_1(e) 和 h_2(e) 这两个位置，并选取其中元素少的那个插入。 2-choice hashing 有一个神奇的结论，即每个桶中的元素的期望值为 $\theta(\log(\log(n)))$。

关于期望值的证明，如果你也是南大的同学并且被尹一通老师的随机算法课虐过，那你应该听说过 The power of 2-choice，同样 CMU 的讲义中也有证明。

Cuckoo Hashing

布谷鸟哈希 (Cuckoo Hashing) 也是开地址法的一种，它最差情况的查询速度都是 O(1) 的。布谷鸟哈希的名字来源于一种布谷鸟，他们的幼崽会在刚孵化的时候，把其他未孵化的蛋一脚踢出鸟巢。

布谷鸟哈希通常使用两个数组以及哈希函数，所以和 2-choice hashing 一样，每一个元素都对应两个位置。当一个元素插入的时候，如果两个位置没满，就直接放入空的位置；如果满了，那么踢掉其中一个，放入其中，然后把踢掉的这个放入它的第二个位置；如果踢掉的这个的第二个位置也被占了，那么继续踢掉其中的；循环上述过程直到最后放入空的位置。

显然，当最后遇到无限循环的时候，上述插入过程有可能会失败。此时我们可以用新的哈希函数原地重建一个表：

There is no need to allocate new tables for the rehashing: We may simply run through the tables to delete and perform the usual insertion procedure on all keys found not to be at their intended position in the table.

— Pagh & Rodler, "Cuckoo Hashing"[17]

布谷鸟哈希相较于其他方法来说，空间利用率更高，所以有研究者用它设计了 Cuckoo Filter [18] [10]，比 Bloom Filter 更加的空间高效，并且支持动态删除。

具体的其他细节这里就不多叙述了，大家可以去阅读维基百科 [5] 上的资料和参考文献。

Benchmarks

这儿有一篇博客 [11]，分析了 8 种哈希表主流实现在各种场景下的时间和空间性能，分析表明罗宾汉哈希表 (robin_map) 和二次线性探查 (dense_map) 的性能表现都相当优秀，详细数据可以前往原博文查看。

References

[1] https://courses.cs.washington.edu/courses/cse373/01sp/Lect13.pdf

[2] http://www.cs.cmu.edu/afs/cs/academic/class/15210-f13/www/lectures/lecture24.pdf

[3] https://en.wikipedia.org/wiki/Hash_table

[4] https://github.com/efficient/libcuckoo

[5] https://en.wikipedia.org/wiki/Cuckoo_hashing

[6] https://en.wikipedia.org/wiki/2-choice_hashing

[7] https://www.threadingbuildingblocks.org/

[8] https://github.com/sparsehash/sparsehash

[9] http://www.cs.cmu.edu/afs/cs/academic/class/15859-f04/www/scribes/lec10.pdf

[10] https://brilliant.org/wiki/cuckoo-filter/

[11] https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html

[12] http://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf

[13] http://tcs.nju.edu.cn/wiki/index.php/%E9%AB%98%E7%BA%A7%E7%AE%97%E6%B3%95_(Fall_2017)

[14] https://www.sebastiansylvan.com/post/robin-hood-hashing-should-be-your-default-hash-table-implementation/

[15] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.130.6339

Reflections on the 4th Tianchi Middleware Performance Challenge

Wed, 01 Aug 2018 11:59:45 GMT

It has been nearly four months since I signed up for the competition, and it has finally come to an end. I'm very happy to have won third place. Over these three months I learned a lot of practical engineering knowledge, and got reacquainted with C++ along the way. The preliminary round code was too ugly to share, but the final round code is hosted on GitHub:

https://github.com/arkbriar/awrace2018_messagestore

Below are the thought processes and final solutions for both the preliminary and final rounds.

Preliminary Round

Problem Analysis and Understanding

Implement a high-performance Service Mesh Agent component with the following features: 1. Service registration and discovery, 2. Protocol conversion, 3. Load balancing

The problem required us to achieve the highest possible performance. We first restated the scenario and general approach:

The Consumer will receive over 500 connections: use IO multiplexing
HTTP = TCP connection: disable Nagle's algorithm
The Dubbo provider has only 200 processing threads, and exceeding 200 concurrent requests will cause fast failures: load balancing should avoid provider overload as much as possible
Online network performance (pps) is poor: batch send requests/responses, use UDP for inter-Agent communication
The Consumer has poor performance: offload protocol conversion and similar tasks to the Provider Agent

Core Approach

To reduce system overhead, we maintain only a single UDP channel between agents, and only a single TCP channel between the PA (Provider Agent) and the Provider.

The overall architecture is as follows:

On the CA (Consumer Agent) side, we use a single epoll event loop. The general flow of the loop is shown below:

1. First, accept all new connections. When a connection is accepted, assign a provider to handle all subsequent requests on that connection based on provider weights.

2. Read all readable sockets.

If a complete request is read, enqueue it into the request queue.
If a complete response is read, write it back directly via blocking write.

3. After all events are processed, batch send all requests in the queue (using writev).

Throughout the request sending and response receiving process, we always attach an ID to the request. This ID is unique on the consumer side, ensuring that we only need a single map on the consumer side to determine which connection a response belongs to. Here, our ID is the handler's index.

Key Code

The batch request sending code looks roughly like this:

epoll_event events[LOWER_LOAD_EVENT_SIZE];
while (_event_group.wait_and_handle(events, LOWER_LOAD_EVENT_SIZE, -1)) {
    // check pending write queue, and block write all
    for (auto& entry : _pending_data) {
        // write all requests to provider agent
        // ...
    }
}

The rest of the code is at https://code.aliyun.com/arkbriar/dubbo-agent-cxx.

Final Round

Problem Analysis and Understanding

Implement the Queue Store interface put/get, supporting millions of queues and hundreds of gigabytes of data on a single machine.

After some initial analysis, we gathered the following information:

The number of messages is approximately 2 billion, with message sizes around 64 bytes, though some may be up to 1024 bytes
The online machines have 4 cores, 8 GB RAM with SSD, approximately 10K IOPS, and sequential 4K read/write throughput around 200 MB/s
For the same queue, put operations are serialized
For the same queue, get operations may be concurrent
There is no deletion of queues or messages, and put and get do not occur simultaneously

The benchmark program's idle put TPS is around 6 million, so performance isn't extremely high. Without compression, writing 100 GB of data takes at least 500 seconds. Therefore, the challenge requires contestants to ensure the put phase is as non-blocking as possible, and to design a data structure that allows the get phase to take advantage of caching and locality.

Core Approach

To implement high-performance interfaces, we needed to solve three problems:

Storage design: support sequential writes and random reads
IO/SSD optimization: push IO throughput to its limits
Memory planning: compact planning and usage due to limited evaluation environment memory

Storage Design: Page-Based Storage

We first studied a similar product, Kafka:

Kafka's general storage model is shown above. We won't elaborate here; interested readers can learn more at https://thehoard.blog/how-kafkas-storage-internals-work-3a29b02e026

However, directly copying Kafka's approach poses several problems for our use case:

With 1 million queues, there would be at least 2 million files, placing enormous burden on the filesystem
Kafka uses full indexing, but a full index for 2 billion messages won't fit in memory. At minimum, a two-level index is needed, which adds read overhead
The disk overhead of a full index for 2 billion messages is too large
The primary queue use case is sequential reads, where full indexing provides no advantage (it's designed for random reads)

So we adopted a common file partitioning pattern: paging, with pages as the smallest unit of operation and indexing. This is our storage model: page-based storage.

This storage approach:

Effectively reduces the number of files
Since writes are sequential, we don't need to worry about dirty pages
Since pages are the smallest unit, consecutive messages belonging to the same queue stay in the same page as much as possible, ensuring sequential reads can leverage caching and locality
Finally, we index at the page level, and at this point the index fits entirely in memory

The page size is set to 4K to balance memory usage.

IO/SSD Optimization: Block Reads and Writes

For most existing storage devices, whether HDD or SSD, sequential read/write is always optimal.

However, SSDs differ in one key aspect: due to their "built-in concurrency" characteristic, reads and writes aligned to Clustered Blocks perform just as well as sequential reads and writes.

To understand built-in concurrency, we need to know some SSD internals. The following website provides more details; here we only cover part of it.

http://codecapsule.com/2014/02/12/coding-for-ssds-part-3-pages-blocks-and-the-flash-translation-layer/

SSD reads and writes operate at the page (NAND-flash page) level. Even reading or writing a single bit involves reading or writing an entire page. Adjacent pages form a block, typically sized between 256 KB and 4 MB. There are additional SSD characteristics that cause write amplification and GC issues, which we won't cover here; interested readers can look into these independently.

SSD built-in concurrency means that accesses to different channels/packages/chips/planes in the diagram can proceed concurrently. Blocks across different planes form an access unit called a Clustered Block, as shown by the yellow box in the diagram. Reading and writing Clustered Blocks fully leverages concurrent access. The read/write pattern we'll describe next is: random reads and writes aligned to Clustered Blocks perform just as well as sequential reads and writes. As a side note, Clustered Block sizes are typically 32 or 64 MB.

Here is a throughput comparison between random writes and sequential writes. At 32M/64M, random write throughput matches sequential write throughput across the board:

We also ran a small test, measuring read/write performance on a 4 GB file (all times in seconds):

Here, "concurrent pseudo sequential write" means multiple threads obtain the next write offset from a shared atomic integer. The experimental results confirmed the effectiveness of this pattern.

So our final read/write pattern is block reads and writes:

Large block (64 MB) writes, which don't even need to be sequential
Effectively improved concurrent read/write performance
Sequential reads fully leverage caching and prefetching

Memory Planning: Every Byte Counts

Building on the above two designs, we planned memory precisely:

Each queue has its own 4K page cache, corresponding to the page currently being written
Each put thread has its own double 64 MB write buffer, where the buffer always corresponds to a range within a file
Total cache memory: 4K * 1M + 64M * 2 = approximately 5.2 GB

Overall Architecture

The relationship between queues and files is shown below:

The page and message structure is shown below:

Size is encoded using variable-length encoding:

Size < 128: stored in one byte, with the first bit being 0
Size >= 128 and < 32768: stored in two bytes, but since the first bit is also 0, we store size | 0x8000

The index for each page looks roughly like this:

Each page stores four fields: page number, file number, number of messages in the page, and total number of messages in all preceding pages. This way, during get operations, a binary search locates the page containing the target message. The total index size is approximately 12 bytes * 25M = 300 MB, which fits entirely in memory.

Put/Get Flow

These two flows are relatively straightforward. First, the put flow:

Check if the in-memory page has space
- Yes: write directly
- No: flush the current page, update the index, then write the message

When a page is flushed, it is submitted to the current put thread's write buffer, and the index is updated accordingly. When the write buffer is full, it swaps with the backup buffer and continues writing while a new thread flushes to disk, minimizing blocking of the put thread.

Before any get operations begin, we force all buffers to flush to disk. The get flow is also relatively straightforward:

Binary search for the first and last pages
Sequentially blocking read the corresponding file pages
Skip irrelevant messages within the page, collect all relevant messages
If the number of remaining messages in the last requested page is below a threshold (e.g., 15), use readahead to prefetch the next page
Return the requested messages

During the get phase, we maintained two principles:

Fully leverage the OS page cache
Prefetch as much as possible, hoping to reduce sequential read time

Fully leveraging the OS page cache was motivated by:

Implementing a concurrent LRU cache is cumbersome
If we implement a sequential read buffer per queue (similar to a linked list structure), it would be constantly invalidated when multiple threads read sequentially at the same time; even with thread-local buffers, if a single thread performs sequential reads on the same queue at different offsets (e.g., two full reads), the buffer would be constantly invalidated

So we chose to use the page cache directly.

Key Code

For queue mapping, we used TBB's concurrent hash map, which turned out to be the biggest CPU bottleneck. Our put operations were entirely CPU-bound.

template <class K, class V, class C = tbb::tbb_hash_compare<K>,
          class A = tbb::cache_aligned_allocator<std::pair<const K, V>>>
class ConcurrentHashMapProxy : public tbb::concurrent_hash_map<K, V, C, A> {
public:
    ConcurrentHashMapProxy(size_t n) : tbb::concurrent_hash_map<K, V, C, A>(n) {}
    ~ConcurrentHashMapProxy() {
        this->tbb::concurrent_hash_map<K, V, C, A>::~concurrent_hash_map<K, V, C, A>();
    }
    typename tbb::concurrent_hash_map<K, V, C, A>::const_pointer fast_find(const K& key) const {
        return this->internal_fast_find(key);
    }
};

MessageQueue* QueueStore::find_or_create_queue(const String& queue_name) {
    auto q_ptr = queues_.fast_find(queue_name);
    if (q_ptr) return q_ptr->second;

    decltype(queues_)::const_accessor ac;
    queues_.find(ac, queue_name);
    if (ac.empty()) {
        uint32_t queue_id = next_queue_id_.next();
        auto queue_ptr = new MessageQueue(queue_id, this);
        queues_.insert(ac, std::make_pair(queue_name, queue_ptr));
        DLOG("Created a new queue, id: %d, name: %s", q->get_queue_id(),
             q->get_queue_name().c_str());
    }
    return ac->second;
}

MessageQueue* QueueStore::find_queue(const String& queue_name) const {
    auto q_ptr = queues_.fast_find(queue_name);
    return q_ptr ? q_ptr->second : nullptr;

    decltype(queues_)::const_accessor ac;
    queues_.find(ac, queue_name);
    return ac.empty() ? nullptr : ac->second;
}

void QueueStore::put(const String& queue_name, const MemBlock& message) {
    if (!message.ptr) return;
    auto q_ptr = find_or_create_queue(queue_name);
    q_ptr->put(message);
}

Vector<MemBlock> QueueStore::get(const String& queue_name, long offset, long size) {
    if (!flushed) flush_all_before_read();

    auto q_ptr = find_queue(queue_name);
    if (q_ptr) return q_ptr->get(offset, size);
    // return empty list when queue is not found
    return Vector<MemBlock>();
}

Next is our main data structure, FilePage:

// File page header
struct __attribute__((__packed__)) FilePageHeader {
    uint64_t offset = NEGATIVE_OFFSET;
};

#define FILE_PAGE_HEADER_SIZE sizeof(FilePageHeader)
#define FILE_PAGE_AVAILABLE_SIZE (FILE_PAGE_SIZE - FILE_PAGE_HEADER_SIZE)

// File page
struct __attribute__((__packed__)) FilePage {
    FilePageHeader header;
    char content[FILE_PAGE_AVAILABLE_SIZE];
};

The double write buffer requires synchronization during swaps, which we handle with a mutex:

// buffer is full, switch to back
{
    // spin wait until back buf is not in scheduled status
    while (back_buf_status->load() == BACK_BUF_FLUSH_SCHEDULED)
        ;
    // try swap active and back buffer
    std::unique_lock<std::mutex> lock(*back_buf_mutex);
    std::swap(active_buf, back_buf);
    assert(back_buf_status->load() == BACK_BUF_FREE);
    uint32_t exp = BACK_BUF_FREE;
    back_buf_status->compare_exchange_strong(exp, BACK_BUF_FLUSH_SCHEDULED);
}

To let the OS make full use of available memory, the get phase needs to release all unused memory. However, tcmalloc does not proactively release memory, so we force it:

// release free memory back to os
MallocExtension::instance()->ReleaseFreeMemory();

Since the online writes were extremely uniform, when it was time to flush, all page caches would flush simultaneously, causing all put threads to block and wait. To mitigate this, we made some queues flush their first page early:

// first page will hold at most (queue_id / DATA_FILE_SPLITS) % 64 + 1 messages, this make write
// more average. This leads to 64 timepoints of first flush. I call it flush fast.
bool flush_fast =
    paged_message_indices_.size() == 1 &&
    paged_message_indices_.back().msg_size >= ((queue_id_ / DATA_FILE_SPLITS) & 0x3f) + 1;

The rest of the code is at https://code.aliyun.com/arkbriar/queue-race-2018-cpp.

Final Results

The fastest write time was approximately 690 seconds, final cache flushing took 20 seconds, random verification took 120 seconds, sequential verification took 140 seconds, totaling 970+ seconds.

All improvements from version 1.9 onward were focused on reducing CPU overhead, without changing the read/write structure.

Engineering Value and Robustness

During the storage design process, we fully accounted for the existence of long messages. Although in the competition code, the maximum supported length was just over 4K, it would be easy to support cross-page messages to remove this limitation.

During the put design process, we fully leveraged SSD characteristics, enabling even random writes to achieve maximum disk throughput. Unfortunately, we were ultimately stuck on the CPU bottleneck and never reached maximum throughput.

Although the scenario had no concurrent read-write requirements, by adding read-write locks on the queue mapping or finer-grained locks on in-memory pages, and accounting for unflushed in-memory data during reads, concurrent read-write support could be added immediately.

Summary and Reflections

Understanding and applying various types of knowledge is the key to solving real-world problems. Through the exchange and competition with other contestants, we learned a great deal, and I believe everyone gained a lot from this competition.

第四届天池中间件性能挑战赛感想

Wed, 01 Aug 2018 11:59:45 GMT

从报名比赛开始到现在将近四个月了，终于告一段落，最终拿到了季军也很开心。这三个月中我学到了很多有用的工程知识，顺便把 C++ 又摸熟了。初赛代码实在写太丑就不放出来了，复赛的代码托管在 github 上：

https://github.com/arkbriar/awrace2018_messagestore

下面是本次大赛初赛和复赛部分的思考过程和最终方案。

初赛部分

赛题背景分析及理解

实现一个高性能的 Service Mesh Agent 组件，并包含如下一些功能：1. 服务注册与发现, 2. 协议转换, 3. 负载均衡

本题要求我们能够尽可能的高性能，我们首先对场景和大致思路进行了一个重述：

Consumer 将接受超过 500 个连接：想到使用 IO multiplex
Http = TCP 连接：禁用 Nagle 算法
Dubbo provider 只有 200 个处理线程，超过 200 个并发请求会快速失败：负载均衡尽量避免 provider 过载
线上网络性能 (pps) 较差：批量发送 request/response，使用 UDP 进行 Agent 间通信
Consumer 性能较差：将协议转换等放到 provider agent 上去做

核心思路

为了减少系统开销，我们在 agent 之间都只保持一个 udp 信道，在 PA (Provider Agent) 和 Provider 之间也只保持一个 tcp 信道。

整体的架构图如下所示：

其中，CA (Consumer Agent) 端使用一个 epoll 的 eventloop，loop 的大致流程如下图：

1. 首先接起所有的新的连接，在接到连接的时候，根据 provider 的权重直接分配一个 provider 处理后续该连接的所有请求

2. 读所有的能读的 socket

如果读到完整的请求，那么将它放入到请求队列
如果读到完整的回复，直接通过阻塞写的方式写回

3. 所有事件处理完成以后，批量发送所有的请求队列 (此处使用 writev)

在请求发送和获取回复的过程中，我们始终有一个 id 附在请求中，这个 id 是 consumer 端唯一的，这保证了我们只在 consumer 端有一个 map 来确定回复的归宿，在这里我们的 id 是 handler 的编号。

关键代码

批量发送请求代码大致如下：

epoll_event events[LOWER_LOAD_EVENT_SIZE];
while (_event_group.wait_and_handle(events, LOWER_LOAD_EVENT_SIZE, -1)) {
    // check pending write queue, and block write all
    for (auto& entry : _pending_data) {
        // write all requests to provider agent
        // ... 
    }
}

其余代码在 https://code.aliyun.com/arkbriar/dubbo-agent-cxx 中。

复赛部分

赛题背景分析及理解

实现 Queue Store 的接口 put/get，要求支持单机百万队列和百 G 数据。

在对题目稍微了解过后，我们得到了以下的几个信息：

消息条数大约为 2g，消息大小大约为 64 byte，但是可能会有长为 1024 byte 的消息
线上的机器是 4c8g 带 ssd，iops 1w 左右，顺序 4k 读写性能大约为 200m/s 左右
对同一个队列，put 是序列化的
对同一个队列，get 可能是并发的
不存在删除队列和消息，也不存在 put 与 get 同时进行

而评测程序的空跑 put TPS 大约在 600w 左右，性能也不是很高；而在不压缩的情况下，写打满写 100g 数据最少耗时 500s。因此本题要求选手能够尽可能的保证 put 阶段的无阻塞，以及能够设计出一个良好的数据结构，保证 get 阶段能够利用缓存和局部性。

核心思路

为了实现高性能的接口，我们必须要解决以下三个问题：

存储设计问题，要求能够支持顺序写与随机读
IO/SSD 优化问题，能尽可能的使 IO 做到极限
内存规划问题，由于评测环境内存比较有限，需要紧凑的规划和使用

存储设计：页式存储

首先我们研究了一个比较类似的产品，Kafka：

Kafka 的大致存储模型如上图所示，这里并不展开叙述，有兴趣的同学可以到这个网站继续学习 https://thehoard.blog/how-kafkas-storage-internals-work-3a29b02e026

但是对于我们来说，照搬它有很多问题：

100w 个队列，至少有 200w 个文件，这对文件系统形成了巨大的负担
Kafka 使用全量索引，而 2g 消息的全量索引根本放不进内存，至少做两级索引，这加重了读的负担
2g 消息的全量索引的磁盘 overhead 太大
队列的主要场景是顺序读，全量索引优势完全没有发挥 (随机读)

所以我们借鉴了一个常见的文件切分模式，也就是分页，并以页为最小单位进行操作和建立索引。这就是我们的存储模式：页式存储。

通过这个存储方式，

可以有效的减少文件的数目
并且由于顺序写的性质，我们并不需要考虑脏页的问题
同时由于页是最小单位，同属一个队列的连续消息尽可能的在一个页中，保证了顺序读能够利用缓存和局部性
最后，我们对页建立索引，此时的索引已经完全可以放入内存了

为了兼顾内存，页大小设置为 4k。

IO/SSD 优化：块读块写

对于现有的大多数存储设备来说，无论是 HDD 还是 SSD 也好，顺序读写永远都是最优的。

但是，SSD 有区别的一点在于，由于 SSD 的特性 “内置并发”，对齐于 Clustered Block 的读写将完全不输于顺序读写。

如果要理解内置并发这个事情，我们可能需要去了解一些 SSD 的原理，下面这个网站提供了更多的细节，在这里只叙述一部分的东西。

http://codecapsule.com/2014/02/12/coding-for-ssds-part-3-pages-blocks-and-the-flash-translation-layer/

SSD 的读写都是以页 (NAND-flash page) 为单位的，即使只读/写一个 bit，也会同时读/写一个页。相邻的页组成一个块，块大小通常在 256k 到 4m 之间。 SSD 的还有些原理会导致一个写放大和 GC 的问题，这里不再展开，有兴趣的同学自行了解。

SSD 的内置并发指的是，图示上面的不同 channel/package/chip/plane 的访问都是可以并发进行的，不同的 plane 上的 block 组成的访问单位叫做聚簇块 (Clustered Block)，如图示中黄色框中的部分。可以看到，读写聚簇块就能充分地并发访问，而我们接下来要讲的读写模式就是，对齐聚簇块随机读写，性能将完全不输顺序读写。这里额外提一句，通常聚簇块大小为 32/64m。

这是一张随机写与顺序写的吞吐对比，在 32M/64M 下，随机写已经能够全线等同顺序写。

而我们也做了一个小测试，测试 4g 大小的文件读写，单位都是秒：

其中 concurrent pseudo sequential write 是指多个线程从同一个 atomic int 上获取下一个写 offset。实验结果证明了这个模式的有效性。

所以我们最后的读写模式为块读块写：

大块 (64m) 写，甚至可以不用顺序写
有效地提高了并发读写性能
顺序读时充分利用缓存和预读

内存规划：精打细算

在结合上述两个设计的基础上，我们对内存有个精确的规划：

每个队列有自己的 4k 页缓存，对应于当前正在写的页
每个 put 线程有自己的双 64m 写缓冲，缓冲永远对应某个文件的一个区间
总计加起来缓存共占用 4k * 1m + 64m * 2 ≈ 5.2g

整体架构

Queue 与文件的关系如下图：

页与消息的结构如下图：

Size 的计算采用变长编码：

Size < 128，使用一个 byte 存储，第一个 bit 为 0
Size >= 128 并且 < 32768，使用两个 byte 存储，但是由于第一个 bit 也为 0，所以存储 size | 0x8000

针对每个页的索引大致如下：

每个页存储页编号、文件编号、页内消息数和之前所有页消息数这四项，这样我们在 get 时二分查找就可以获取到对应消息所在的页。所有的索引总量大约为 12byte * 25m = 300m，完全可以放于内存。

Put/Get 流程

这两个流程相对简单，首先是 put 流程：

检查内存中的页还有没有空间
- 有，直接写入
- 没有，将当前页写出，更新索引，然后写入消息

页写出时会提交到当前 put 线程的写缓冲，并反馈修改索引。当写缓冲满的时候，与后备的缓冲交换继续写入缓冲，并起新的线程刷盘，这样尽可能不阻塞 put 线程。

在所有 get 发生前，我们会强制所有缓冲落盘，接下来的 get 流程也相对简单：

二分查找第一页和最后一页
顺序阻塞读对应的文件页
跳过页内无关的消息，收集所有相关消息
如果请求的最后一页剩余消息数小于某个阈值 (e.g. 15)，使用 readahead 预读下一页
返回请求的消息

我们在 get 的过程中保持两个理念：

全力使用 OS 的页缓存
尽可能预读，以期望减少顺序读的时间

全力使用 OS 的页缓存主要是考虑到

实现 Concurrent LRU cache 比较麻烦
如果实现队列的一个顺序读缓冲，类似于链表结构，一旦发生多线程同时顺序读，缓冲会一直失效；即使实现 thread local 缓冲，如果出现单线程不同偏移量顺序读同一个队列 (e.g. 完整读两次)，缓冲也会不断失效

所以不如直接使用页缓存。

关键代码

队列映射我们使用了 tbb 的 concurrent hash map，最终证明这是最大的 cpu 瓶颈，而我们的 put 完全卡在了 cpu 上。

template <class K, class V, class C = tbb::tbb_hash_compare<K>,
          class A = tbb::cache_aligned_allocator<std::pair<const K, V>>>
class ConcurrentHashMapProxy : public tbb::concurrent_hash_map<K, V, C, A> {
public:
    ConcurrentHashMapProxy(size_t n) : tbb::concurrent_hash_map<K, V, C, A>(n) {}
    ~ConcurrentHashMapProxy() {
        this->tbb::concurrent_hash_map<K, V, C, A>::~concurrent_hash_map<K, V, C, A>();
    }
    typename tbb::concurrent_hash_map<K, V, C, A>::const_pointer fast_find(const K& key) const {
        return this->internal_fast_find(key);
    }
};

MessageQueue* QueueStore::find_or_create_queue(const String& queue_name) {
    auto q_ptr = queues_.fast_find(queue_name);
    if (q_ptr) return q_ptr->second;

    decltype(queues_)::const_accessor ac;
    queues_.find(ac, queue_name);
    if (ac.empty()) {
        uint32_t queue_id = next_queue_id_.next();
        auto queue_ptr = new MessageQueue(queue_id, this);
        queues_.insert(ac, std::make_pair(queue_name, queue_ptr));
        DLOG("Created a new queue, id: %d, name: %s", q->get_queue_id(),
             q->get_queue_name().c_str());
    }
    return ac->second;
}

MessageQueue* QueueStore::find_queue(const String& queue_name) const {
    auto q_ptr = queues_.fast_find(queue_name);
    return q_ptr ? q_ptr->second : nullptr;

    decltype(queues_)::const_accessor ac;
    queues_.find(ac, queue_name);
    return ac.empty() ? nullptr : ac->second;
}

void QueueStore::put(const String& queue_name, const MemBlock& message) {
    if (!message.ptr) return;
    auto q_ptr = find_or_create_queue(queue_name);
    q_ptr->put(message);
}

Vector<MemBlock> QueueStore::get(const String& queue_name, long offset, long size) {
    if (!flushed) flush_all_before_read();

    auto q_ptr = find_queue(queue_name);
    if (q_ptr) return q_ptr->get(offset, size);
    // return empty list when queue is not found
    return Vector<MemBlock>();
}

然后是我们的主要数据结构 FilePage：

// File page header
struct __attribute__((__packed__)) FilePageHeader {
    uint64_t offset = NEGATIVE_OFFSET;
};

#define FILE_PAGE_HEADER_SIZE sizeof(FilePageHeader)
#define FILE_PAGE_AVAILABLE_SIZE (FILE_PAGE_SIZE - FILE_PAGE_HEADER_SIZE)

// File page
struct __attribute__((__packed__)) FilePage {
    FilePageHeader header;
    char content[FILE_PAGE_AVAILABLE_SIZE];
};

双写缓冲有个交换上的同步，我们使用 mutex 去进行这个同步：

// buffer is full, switch to back
{
    // spin wait until back buf is not in scheduled status
    while (back_buf_status->load() == BACK_BUF_FLUSH_SCHEDULED)
        ;
    // try swap active and back buffer
    std::unique_lock<std::mutex> lock(*back_buf_mutex);
    std::swap(active_buf, back_buf);
    assert(back_buf_status->load() == BACK_BUF_FREE);
    uint32_t exp = BACK_BUF_FREE;
    back_buf_status->compare_exchange_strong(exp, BACK_BUF_FLUSH_SCHEDULED);
}

为了使得 OS 充分的跑起来，get 阶段需要释放掉所有无用的内存，然而 tcmalloc 是不会主动释放的，所以需要强制释放：

// release free memory back to os
MallocExtension::instance()->ReleaseFreeMemory();

由于线上的写入太过于均匀，到了刷盘的时候就是所有页缓存一起刷盘，这会造成所有 put 线程都阻塞并等待的状况，为了缓解这个情况，我们让某些队列的第一页快速写出：

// first page will hold at most (queue_id / DATA_FILE_SPLITS) % 64 + 1 messages, this make write
// more average. This leads to 64 timepoints of first flush. I call it flush fast.
bool flush_fast =
    paged_message_indices_.size() == 1 &&
    paged_message_indices_.back().msg_size >= ((queue_id_ / DATA_FILE_SPLITS) & 0x3f) + 1;

其余代码放在 https://code.aliyun.com/arkbriar/queue-race-2018-cpp 中。

最终成绩

最快写入时间大约 690s，最终缓存落盘 20s，随机校验 120s，顺序校验 140s，总计 970s+。

期间从 1.9 开始的提升全在压缩 cpu 开销，没有动过读写结构。

工程价值与健壮性

在存储设计过程中，我们充分考虑到了长消息的存在；虽然在比赛代码的实现中，我们最长支持也就是 4k 多一点的长度，但是却可以很轻易的支持跨页消息来去掉这个限制。

在 put 的设计过程中，我们充分利用了 SSD 的特性，使得即使是随机写也能达到最大的磁盘吞吐。遗憾的是，最终我们深陷 cpu 瓶颈从来没有跑出过最大吞吐。

虽然考虑到场景下没有边读边写的情况，但是通过在队列映射上加读写锁，或者在更细粒度的内存页上加锁，并在读时考虑内存中未落盘的数据，则立刻可以支持边读边写。

总结与感想

对各种知识的理解和运用是解决实际问题的关键，在与各位选手的交流与竞争中，我们学到了很多知识，相信大家通过这次比赛也能收获很多。

Linux IO Multiplexing -- Epoll

Mon, 16 Apr 2018 11:51:21 GMT

Epoll is a set of programming interfaces unique to the Linux platform, used for monitoring IO events on multiple file descriptors. The advantage of epoll over select/poll is that it performs very well even when monitoring a large number of file descriptors. The epoll API supports two monitoring modes: edge-triggered (EPOLLET) and level-triggered (default).

In edge-triggered mode, a file descriptor is only returned by epoll_wait when an event occurs on it. For example, when monitoring a socket, suppose the first epoll_wait returns that socket with 2 bytes available to read, but only 1 byte is read. The next epoll_wait will not return that file descriptor. In other words, having unread data remaining in the buffer is not considered an event.

Level-triggered mode is different -- as long as the socket is still readable, it will continue to be returned.

When using ET mode, non-blocking file descriptors must be used to prevent blocking reads/writes from starving tasks that handle multiple file descriptors. It is best to use the ET mode epoll_wait interface with the following pattern:

Use non-blocking file descriptors
Only suspend and wait when read/write returns EAGAIN; when read/write returns less data than requested, you can be sure there is no more data in the buffer, and the event can be considered complete.

Epoll API

Linux provides the following functions for creating, managing, and using epoll instances:

int epoll_create(int size);, int epoll_create1(int flags);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);
int epoll_pwait(int epfd, struct epoll_event *events, int maxevents, int timeout, const sigset_t *sigmask);

epoll_create/epoll_create1

epoll_create creates an epoll instance and returns a file descriptor representing that instance. In epoll_create1, the size limitation of epoll was removed. The flags parameter can be set to EPOLL_CLOEXEC, which sets close-on-exec (FD_CLOEXEC) on the new file descriptor. This flag indicates whether the file descriptor should be closed after an execve system call in the new thread.

epoll_ctl

epoll_ctl is used to control the file descriptors monitored by an epoll instance. Here, epfd is the epoll file descriptor, and op specifies the operation to perform. There are three types of operations:

EPOLL_CTL_ADD
EPOLL_CTL_MOD
EPOLL_CTL_DEL

As the names suggest: add, modify, and delete.

The remaining parameters are the corresponding file descriptor fd and the set of events to monitor, stored in struct epoll_event:

typedef union epoll_data {
   void        *ptr;
   int          fd;
   uint32_t     u32;
   uint64_t     u64;
} epoll_data_t;

struct epoll_event {
   uint32_t     events;      /* Epoll events */
   epoll_data_t data;        /* User data variable */
};

The events field in struct epoll_event is a bitmask indicating the events being monitored. Here are some of the important ones:

EPOLLIN/EPOLLOUT: the file is readable/writable
EPOLLRDHUP: the connection was closed or half-shutdown for writing
EPOLLERR: default parameter, an error occurred on the file descriptor
EPOLLHUP: default parameter, the file was hung up; on sockets/pipes this means the local end closed the connection
EPOLLET: enables edge-triggered mode; level-triggered is the default
EPOLLONESHOT: automatically removes the monitor after a single trigger
EPOLLWAKEUP: if EPOLLONESHOT and EPOLLET are cleared and the process has CAP_BLOCK_SUSPEND capability, this flag ensures the system does not suspend or sleep while the event is pending or being processed

epoll_wait/epoll_pwait

epoll_wait blocks and waits for events on file descriptors. The events array must be at least as large as maxevents. epoll_wait blocks until:

A file descriptor produces an event
It is interrupted by a signal
It times out (timeout)

And returns the number of current events.

epoll_pwait additionally takes a sigmask parameter specifying signals that should not interrupt the call; otherwise it is equivalent to epoll_wait.

Performance Benchmark

A comparison of system call overhead for poll/select and epoll when monitoring different numbers of file descriptors, referenced from the first website in the references.

# operations  |  poll  |  select   | epoll
10            |   0.61 |    0.73   | 0.41
100           |   2.9  |    3.0    | 0.42
1000          |  35    |   35      | 0.53
10000         | 990    |  930      | 0.66

References

[1] https://jvns.ca/blog/2017/06/03/async-io-on-linux--select--poll--and-epoll/

[2] Linux Programmer's Manual: man epoll/epoll_create/epoll_ctl/epoll_wait

A Simple Epoll-based Server

The following is a simple server implemented in C that supports up to 100 simultaneous connections. For each new connection, it is added to the epoll queue. When IO events arrive, the corresponding client's IO events are handled.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <sys/epoll.h>

#define MAX_EVENTS 10
#define LISTEN_PORT 1234
#define BUF_LEN 512
#define MAX_CONN 100

struct epoll_event ev, events[MAX_EVENTS];
int listen_sock, conn_sock, nfds, epollfd;
struct sockaddr_in server;

#define log(...) printf(__VA_ARGS__)

void response_to_conn(int conn_sock) {
    char buf[BUF_LEN + 1];

    int read_len = 0;
    while ((read_len = read(conn_sock, buf, BUF_LEN)) > 0) {
        buf[read_len] = '\0';

        int cursor = 0;
        while (cursor < read_len) {
            // writing to a pipe or socket whose reading end is closed
            // will lead to a SIGPIPE
            int len = write(conn_sock, buf + cursor, read_len - cursor);
            if (len < 0) {
                perror("write");
                return;
            }
            cursor += len;
        }

        // there are no data so we do not have to do another read
        if (read_len < BUF_LEN) {
            break;
        }
    }

    // must make sure that the next read will block this non-blocking
    // socket, then we think the event is fully consumed.
    if (read_len < 0 && errno == EAGAIN) {
        return;
    }
    // end of file
    if (read_len == 0) {
        return;
    }
}

/* Code to set up listening socket, 'listen_sock' */
void listen_and_bind() {
    if ((listen_sock = socket(AF_INET, SOCK_STREAM, 0)) == -1) {
        perror("socket");
        exit(EXIT_FAILURE);
    }
    int option = 1;
    setsockopt(listen_sock, SOL_SOCKET, SO_REUSEADDR, &option, sizeof(option));

    server.sin_family = AF_INET;
    server.sin_addr.s_addr = INADDR_ANY;
    server.sin_port = htons(LISTEN_PORT);
    if (bind(listen_sock, (struct sockaddr *)&server, sizeof(server)) == -1) {
        perror("bind");
        exit(EXIT_FAILURE);
    }

    listen(listen_sock, MAX_CONN);
}

void create_epoll() {
    epollfd = epoll_create1(0);
    if (epollfd == -1) {
        perror("epoll_create1");
        exit(EXIT_FAILURE);
    }

    ev.events = EPOLLIN;
    ev.data.fd = listen_sock;
    if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == -1) {
        perror("epoll_ctl: listen_sock");
        exit(EXIT_FAILURE);
    }
}

void set_fd_nonblocking(int fd) {
    int flags = fcntl(fd, F_GETFL, 0);
    if (flags == -1) {
        perror("getfl");
        return;
    }
    if (fcntl(fd, F_SETFL, flags | O_NONBLOCK) < 0) {
        perror("setfl");
        return;
    }
}

void epoll_loop() {
    for (;;) {
        int nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
        if (nfds == -1) {
            perror("epoll_wait");
            exit(EXIT_FAILURE);
        }

        log("get %d events from epoll_wait!\n", nfds);

        for (int n = 0; n < nfds; ++ n) {
            if (events[n].data.fd == listen_sock) {
                struct sockaddr_in local;
                socklen_t addrlen;
                conn_sock = accept(listen_sock, (struct sockaddr *) &local, &addrlen);
                if (conn_sock == -1) {
                    perror("accept");
                    exit(EXIT_FAILURE);
                }

                log("accept a new connection!\n");

                // set non-blocking
                set_fd_nonblocking(conn_sock);

                ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
                ev.data.fd = conn_sock;
                if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock, &ev) == -1) {
                    perror("epoll_ctl: conn_sock");
                    exit(EXIT_FAILURE);
                }
            } else {
                if (events[n].events & (EPOLLRDHUP | EPOLLERR)) {
                    log("detect a closed/broken connection!\n");
                    epoll_ctl(epollfd, EPOLL_CTL_DEL, events[n].data.fd, NULL);
                    close(events[n].data.fd);
                } else response_to_conn(events[n].data.fd);
            }
        }
    }
}

int main(int argc, char **argv) {
    log("listenning on port 1234!\n");
    listen_and_bind();

    log("creating epoll!\n");
    create_epoll();

    log("starting loop on epoll!\n");
    epoll_loop();

    return 0;
}

Linux IO 多路复用 —— Epoll

Mon, 16 Apr 2018 11:51:21 GMT

Epoll 是 Linux 平台上独有的一组编程接口，用于监听多个文件描述符上的 IO 事件。 Epoll 相对于 select/poll 的优势在于即使监听了大量的文件描述符，性能也非常好。 Epoll API 支持两种监听方式：edge-triggered (EPOLLET) 和 level_triggered (default)。

Edge-triggered 模式下，只有当文件描述符上产生事件时，才会被 epoll_wait 返回。例如，监听一个 socket，假如第一次 epoll_wait 返回了该 sock，可读取为 2 字节，但是只读取了 1 字节。那么下一次 epoll_wait 将不会返回该文件描述符了。换句话说，缓冲区中还有数据可读不是一个事件。

Level-triggered 不同，只要该 sock 还是可读的，将持续返回。

在使用 ET 模式时，必须使用非阻塞文件描述符，防止阻塞读/阻塞写将处理多个文件描述符的任务饿死。最好以以下模式调用 ET 模式的 epoll_wait 接口：

使用非阻塞的文件描述符
只有当 read/write 返回 EAGAIN 时挂起并等待；当 read/write 返回的数据长度小于请求的数据长度时，就可以确定缓冲中已经没有数据了，也就可以认为事件已经完成了。

Epoll API

Linux 提供了以下几个函数，用于创建、管理和使用 epoll 实例：

int epoll_create(int size);、int epoll_create1(int flags);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);
int epoll_pwait(int epfd, struct epoll_event *events, int maxevents, int timeout, const sigset_t *sigmask);

epoll_create/epoll_create1

epoll_create 将创建一个 epoll 实例，并且返回一个代表该实例的文件描述符。在 epoll_create1 中，epoll 的大小限制被取消了。 flags 可以为 EPOLL_CLOEXEC，即为新的文件描述符设置 close-on-exec (FD_CLOEXEC)，这个标志在文件描述符上表示当 execve 系统调用之后，新线程的文件描述符是否要被关闭。

epoll_ctl

epoll_ctl 用于控制 epoll 实例上的监听的文件描述符，其中 epfd 就是 epoll 文件描述符，op 是指可以做的操作 (operation)，一共有三种：

EPOLL_CTL_ADD
EPOLL_CTL_MOD
EPOLL_CTL_DEL

顾名思义，添加、修改和删除。

后面的就是对应的文件描述符和 fd，以及设置好的想要监听的事件集合，存放在 struct epoll_event 中：

typedef union epoll_data {
   void        *ptr;
   int          fd;
   uint32_t     u32;
   uint64_t     u64;
} epoll_data_t;

struct epoll_event {
   uint32_t     events;      /* Epoll events */
   epoll_data_t data;        /* User data variable */
};

struct epoll_event 中的 events 是个位数组，表明当前监听的时间，列举几个比较重要的：

EPOLLIN/EPOLLOUT，文件可读/写
EPOLLRDHUP，关闭连接或者写入半连接
EPOLLERR，默认参数，文件描述符上发生错误
EPOLLHUP，默认参数，文件被挂断，在 socket/pipe 上代表本端关闭连接
EPOLLET，开启 edge-triggered，默认是 level-triggered
EPOLLONESHOT，一次触发后自动移除监听
EPOLLWAKEUP，如果 EPOLLONESHOT 和 EPOLLET 清除了，并且进程拥有 CAP_BLOCK_SUSPEND 权限，那么这个标志能够保证事件在挂起或者处理的时候，系统不会挂起或休眠

epoll_wait/epoll_pwait

epoll_wait 阻塞并等待文件描述上的事件，需要保证 events 数组的大小要比 maxevents 大。epoll_wait 将阻塞直到：

一个文件描述符产生事件
被信号打断
超时 (timeout）

并返回当前事件的数量。

epoll_pwait 多设置一个 sigmask，代表不想被这些信号打断，其余的相当于 epoll_wait。

性能测试

对 poll/selelct 和 epoll 在监听不同数量文件描述符时的系统调用消耗对比，参考自参考文献表中的第一个网站。

# operations  |  poll  |  select   | epoll
10            |   0.61 |    0.73   | 0.41
100           |   2.9  |    3.0    | 0.42
1000          |  35    |   35      | 0.53
10000         | 990    |  930      | 0.66

参考文献

[1] https://jvns.ca/blog/2017/06/03/async-io-on-linux--select--poll--and-epoll/

[2] Linux Programmer's Manual: man epoll/epoll_create/epoll_ctl/epoll_wait

基于 epoll 的简易服务器

以下使用 C 语言实现了一个简单的服务器，支持同时最多 100 个连接，对每个新建的连接。将它加入 epoll 队列中。当 IO 事件到达时，处理对应的客户端的 IO 事件。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <sys/epoll.h>

#define MAX_EVENTS 10
#define LISTEN_PORT 1234
#define BUF_LEN 512
#define MAX_CONN 100

struct epoll_event ev, events[MAX_EVENTS];
int listen_sock, conn_sock, nfds, epollfd;
struct sockaddr_in server;

#define log(...) printf(__VA_ARGS__)

void response_to_conn(int conn_sock) {
    char buf[BUF_LEN + 1];

    int read_len = 0;
    while ((read_len = read(conn_sock, buf, BUF_LEN)) > 0) {
        buf[read_len] = '\0';

        int cursor = 0;
        while (cursor < read_len) {
            // writing to a pipe or socket whose reading end is closed
            // will lead to a SIGPIPE
            int len = write(conn_sock, buf + cursor, read_len - cursor);
            if (len < 0) {
                perror("write");
                return;
            }
            cursor += len;
        }

        // there are no data so we do not have to do another read
        if (read_len < BUF_LEN) {
            break;
        }
    }
    
    // must make sure that the next read will block this non-blocking
    // socket, then we think the event is fully consumed.
    if (read_len < 0 && errno == EAGAIN) {
        return;
    }
    // end of file
    if (read_len == 0) {
        return;
    }
}

/* Code to set up listening socket, 'listen_sock' */
void listen_and_bind() {
    if ((listen_sock = socket(AF_INET, SOCK_STREAM, 0)) == -1) {
        perror("socket");
        exit(EXIT_FAILURE);
    }
    int option = 1;
    setsockopt(listen_sock, SOL_SOCKET, SO_REUSEADDR, &option, sizeof(option));

    server.sin_family = AF_INET;
    server.sin_addr.s_addr = INADDR_ANY;
    server.sin_port = htons(LISTEN_PORT);
    if (bind(listen_sock, (struct sockaddr *)&server, sizeof(server)) == -1) {
        perror("bind");
        exit(EXIT_FAILURE);
    }

    listen(listen_sock, MAX_CONN);
}

void create_epoll() {
    epollfd = epoll_create1(0);
    if (epollfd == -1) {
        perror("epoll_create1");
        exit(EXIT_FAILURE);
    }

    ev.events = EPOLLIN;
    ev.data.fd = listen_sock;
    if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == -1) {
        perror("epoll_ctl: listen_sock");
        exit(EXIT_FAILURE);
    }
}

void set_fd_nonblocking(int fd) {
    int flags = fcntl(fd, F_GETFL, 0);
    if (flags == -1) {
        perror("getfl");
        return;
    }
    if (fcntl(fd, F_SETFL, flags | O_NONBLOCK) < 0) {
        perror("setfl");
        return;
    }
}

void epoll_loop() {
    for (;;) {
        int nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
        if (nfds == -1) {
            perror("epoll_wait");
            exit(EXIT_FAILURE);
        }

        log("get %d events from epoll_wait!\n", nfds);

        for (int n = 0; n < nfds; ++ n) {
            if (events[n].data.fd == listen_sock) {
                struct sockaddr_in local;
                socklen_t addrlen;
                conn_sock = accept(listen_sock, (struct sockaddr *) &local, &addrlen);
                if (conn_sock == -1) {
                    perror("accept");
                    exit(EXIT_FAILURE);
                }

                log("accept a new connection!\n");

                // set non-blocking
                set_fd_nonblocking(conn_sock);

                ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
                ev.data.fd = conn_sock;
                if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock, &ev) == -1) {
                    perror("epoll_ctl: conn_sock");
                    exit(EXIT_FAILURE);
                }
            } else {
                if (events[n].events & (EPOLLRDHUP | EPOLLERR)) {
                    log("detect a closed/broken connection!\n");
                    epoll_ctl(epollfd, EPOLL_CTL_DEL, events[n].data.fd, NULL);
                    close(events[n].data.fd);
                } else response_to_conn(events[n].data.fd);
            }
        }
    }
}

int main(int argc, char **argv) {
    log("listenning on port 1234!\n");
    listen_and_bind();

    log("creating epoll!\n");
    create_epoll();

    log("starting loop on epoll!\n");
    epoll_loop();

    return 0;
}

Bloom Filter

Mon, 16 Apr 2018 05:02:27 GMT

A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. A Bloom filter may produce false positives (reporting that an element exists when it actually does not), but never produces false negatives (reporting that an element does not exist when it actually does). The more elements in the set, the higher the probability of false positives.

Structure

The structure of a Bloom filter is very simple: a bit array of length m, initialized to all zeros. The Bloom filter requires k different hash functions, each of which maps an element to a number in [0, m), i.e., a position in the array, with a uniform distribution.

Typically, k is a constant much smaller than m, proportional to the number of elements to be added. The exact values of k and m depend on the desired false positive rate.

The following diagram shows a Bloom filter:

Here are the two operations on a Bloom filter:

Add an element: For each hash function, compute the corresponding position in the array and set it to 1.
Query an element: For each hash function, compute the corresponding position in the array. If all positions are 1, report that the element exists; otherwise, the element does not exist.

Clearly, if an element was previously added, a query will always report it as present. However, a query reporting presence does not necessarily mean the element was actually added.

In this design, since false negatives are not allowed, element deletion is not possible. If false negatives were acceptable, element deletion could be handled using a counting array. For one-time deletion, a second Bloom filter recording deleted elements could be used, but then the false positives of this second filter would become false negatives of the first.

Space and Time

Bloom filters are extremely space-efficient, far more so than other set data structures such as self-balancing binary trees, tries, and hash tables. A Bloom filter with an expected false positive rate below 1%, using optimal k, requires only 9.6 bits per element, regardless of the element itself. Furthermore, adding only about 4.8 additional bits per element can reduce the false positive rate to 0.1%. However, when the number of possible elements is very small, a Bloom filter is obviously worse than a simple bit array.

For both insertion and query, the time complexity of a Bloom filter depends on the k hash functions. Since hash functions are typically efficient, we consider the time complexity to be O(k).

False Positive Rate and Optimal Parameters

Let's derive the false positive probability. When an element is inserted, the probability that a given position in the array has not been set to 1 is: $(1 - \frac{1}{m})^{k}$. After inserting n elements, the probability that this position is still 0 is $(1 - \frac{1}{m})^{kn}$.

So during a query, the probability that all k hash positions are 1 is: $(1 - \left[1 - \frac{1}{m}\right]^{kn})^k \approx (1 - e^{-kn/m})^k$.

This derivation is not completely rigorous because it assumes the events of each bit being set are mutually independent. In fact, when this assumption is removed, the same approximation can still be derived. Interested readers can refer to this source. We have:

The false positive rate decreases as m increases, and increases as n increases.

The optimal number of hash functions is: $k = \dfrac{m}{n}\ln 2$

The derivation is as follows:

$p = (1 - e^{-kn/m})^k$

$\ln p = k\ln(1 - e^{-kn/m}) = -\dfrac{m}{n}\ln(e^{-kn/m}) \ln(1 - e^{-kn/m})$

Let $g = e^{-kn/m}$, so $\ln p = -\dfrac{m}{n}\ln(g) \ln(1 - g)$. By symmetry, the expression is minimized when $g = \dfrac{1}{2}$.

Therefore $g = e^{-kn/m} = \dfrac{1}{2}$, giving $k = \dfrac{m}{n}\ln 2$.

Estimating Set Size

Swamidass & Baldi (2007) gave a formula to estimate the approximate size of the set in a Bloom filter:

$$n^* = -\dfrac{m}{k}\ln\left[1 - \dfrac{X}{m}\right]$$ ,

where X is the number of 1-bits in the bit array.

Applications of Bloom Filters

Google Bigtable, Apache HBase, Apache Cassandra, and PostgreSQL use Bloom filters to reduce unnecessary disk lookups.
Google Chrome uses a Bloom filter to detect dangerous URLs.
...

References

[1] https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the_number_of_items_in_a_Bloom_filter

布隆过滤器 (Bloom Filter)

Mon, 16 Apr 2018 05:02:27 GMT

布隆过滤器 (Bloom Filter) 是一个空间高效的概率数据结构，它能够用来测试是否一个元素在一个集合中。布隆过滤器存在着 false positive (返回存在，其实不存在），但是不存在 false negative (返回不存在，其实存在)。集合中的元素越多，false positive 的概率就越高。

结构描述

布隆过滤器的结构非常简单，即一个长度为 m 的比特数组，初始化都为 0。然后，布隆过滤器需要有 k 个不同的哈希函数，每一个都能将元素映射到 [0, m) 中的一个数字，也就是数组上的一个位置，并呈现均匀分布。

通常，k 是一个远小于 m 的常数，与将要被加入的元素成比例。精确的 k 和 m 的取值取决于期望的 false positive 率。

下图是一个布隆过滤器：

那么介绍下布隆过滤器上的两个操作：

添加一个元素：对每一个哈希函数，都计算出对应的数组上的位置，并将其置为 1
查询一个元素：对每一个哈希函数，计算出对应的数组上的位置，如果都是 1，则返回元素存在，否则元素不存在

显然，如果一个元素曾经被添加过，在查询的时候一定存在，但是查询返回存在却不一定表明元素被添加过。

在这种设计中，由于不允许 false negative，所以不可能进行元素删除。如果允许 false negative，可以通过计数数组来处理元素删除。如果是一次性的删除，可以通过新建一个记录被删除元素的布隆过滤器来实现，但是这个过滤器的 false positive 就会变成第一个过滤起的 false negative。

空间和时间

布隆过滤器的空间效率非常高，比其他一些用于集合的数据结构高效的多，例如自平衡二叉树、字典树、哈希表等。期望 1% 以内误报率的布隆过滤器，在采用最优 k 的情况下，每个元素只需要 9.6 个比特存储，并且与元素本身无关。甚至，我们只需要为每个元素添加大约 4.8 个比特，就可以把误报率降低到 0.1%。但是，当集合中可能的元素非常少时，布隆过滤器显然要比比特数组差。

在插入和查询时，布隆过滤器的时间复杂度都取决于 k 个哈希函数，通常哈希函数效率较高，我们认为时间复杂度就是 O(k)。

误报率和最优选择

我们来推导一下误报的概率，当一个元素插入时，数组中一个位置还没被设置为 1 的概率显然为：$(1 - \frac{1}{m})^{k}$，那么当已经插入了 n 个元素之后，这个位置仍然是 0 的概率为 $(1 - \frac{1}{m})^{kn}$。

所以当查询时，每个哈希函数的位置都为 1 的概率为: $(1 - \left[1 - \frac{1}{m}\right]^{kn})^k \approx (1 - e^{-kn/m})^k$。

这个概率的推导并不完全正确，因为它假设每个比特被设置的事件是相互独立的。事实上当去除假设时，仍然可以分析出相同的近似结果，有兴趣的同学可以阅读。那么我们有

误报率随着 m 的增大而减小，并随着 n 的增大而增大

那么我们将有最优的哈希函数数为: $k = \dfrac{m}{n}\ln 2$ 推导过程如下所示:

$p = (1 - e^{-kn/m})^k$

$\ln p = k\ln(1 - e^{-kn/m}) = -\dfrac{m}{n}\ln(e^{-kn/m}) \ln(1 - e^{-kn/m})$

令 $g = e^{-kn/m}$，即有 $\ln p = -\dfrac{m}{n}\ln(g) \ln(1 - g)$，由对称法可知 $g = \dfrac{1}{2}$ 时上式最小。

所以有 $g = e^{-kn/m} = \dfrac{1}{2}$, $k = \dfrac{m}{n}\ln 2$。

估计集合大小

Swamidass & Baldi (2007) 给出了布隆过滤器中集合的大概大小：

$$n^* = -\dfrac{m}{k}\ln\left[1 - \dfrac{X}{m}\right]$$ ，

其中 X 是比特数组中 1 的个数。

布隆过滤器的应用

Google Bigtable、 Apache HBase、 Apache Cassandra 和 Postgresql 使用布隆过滤器来减少不必要的磁盘查询。
Google Chrome 使用布隆过滤器来检测危险的 URL。
...

参考文献

[1] https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the_number_of_items_in_a_Bloom_filter

Longest Palindromic Substring -- Manacher's Algorithm

Thu, 12 Apr 2018 17:50:51 GMT

Emm, I actually planned to write this post back in September 2017, but only now am I finally getting to it. If I don't write it soon, I'll forget everything...

Palindromic strings are celebrities in the world of string algorithms, always coming with all sorts of tricky problems. Take, for example, the Longest Palindromic Substring problem. The obvious algorithm has a complexity of $O(n^2)$, while Manacher's Algorithm can give the answer in $O(n)$ time.

Let's use a simple but good example to explain this algorithm: "abababa", whose longest palindromic substring is itself.

The Simple Algorithm

The simple algorithm enumerates palindrome centers. For a string of length n, there are 2n + 1 palindrome centers, and computing the palindrome length for each center takes $O(n)$, giving a total of $O(n^2)$.

Manacher's Algorithm

Manacher's Algorithm leverages previously computed maximum palindrome lengths to speed up subsequent computations.

First, we mark all palindrome centers as shown below:

For each palindrome center, we label them from 0 to 2n, totaling 2n + 1 centers. Even-numbered ones are centers of even-length palindromes, and odd-numbered ones are centers of odd-length palindromes. We use an array LPS to record the known longest palindrome lengths.

Clearly, LPS[0] is always 0, and LPS[1] is always 1. Assume we compute LPS from left to right, and the current position is the red-marked 7, with LPS[7] = 7.

An important observation here is that 7 is the center with the farthest right boundary among all i <= 7, meaning it has the largest i + LPS[i]. Note that i + LPS[i] is always even.

Why must it be this one? We will explain in detail later.

Now we want to compute LPS[8]. Since 8 mirrors to 6 with respect to center 7, and the maximum palindrome at 6 is 0, within the visible range [i - LPS[i], i + LPS[i]] = [0, 14], the palindrome at 6, [6 - LPS[6], 6 + LPS[6]] = [6, 6], mirrors to [8, 8] at position 8. It lies entirely within (0, 14) without touching the boundary.

By the symmetry of palindromes with respect to the palindrome center, LPS[8] = 0.

But what if the mirrored palindrome touches or exceeds the boundary? The known length can only be min(i + LPS[i] - cur, LPS[2 * i - cur]), i.e., the minimum of the mirrored LPS value and the remaining range to the right. In this case, we are adjacent to some unknown characters, and we should scan the unknown characters to the right, matching them one by one with the left side to accurately determine LPS[cur]. At this point, the right boundary has expanded, and we can update i to cur.

The above text may be a bit convoluted, but there is a ready example: when i = 5, cur = 7, the mirror is 3, and LPS[3] = 3 = i + LPS[i] - cur, meaning it reaches the right boundary exactly. So we scan to the right, expanding the right boundary to 14, and update i = 7.

Repeating the above process gives us all LPS[i] values. Return the maximum one.

Complexity

Computing LPS falls into two cases: direct computation, or computing an initial value followed by expanding the right boundary.

The right boundary can expand at most to 2 * n, so it is O(n). Assignment occurs 2 * n + 1 times, which is also O(n).

Therefore, the total time complexity is O(n).

Cpp

class Solution {
public:
    string longestPalindrome(string s) {
        if (s.empty()) return "";
        int n = s.size();
        int res = 1, resi = 1;
        int N = n * 2 + 1;

        vector<int> lps(N);
        lps[1] = 1;

        int c = 1;
        for (int i = 2; i < N; ++i) {
            int li = 2 * c - i, r = c + lps[c];
            int to_r = r - i;

            // set initial lps[i]
            lps[i] = min(lps[li], to_r);

            /* if (lps[li] >= to_r) { */

            // expand if we can
            // i + lps[i] < N - 1 && i - lps[i] > 0 so that there are spaces for expand
            while (
                i + lps[i] < N - 1 && i - lps[i] > 0 &&
                ((i + lps[i] + 1) % 2 == 0 || s[(i - lps[i] - 1) / 2] == s[(i + lps[i] + 1) / 2])) {
                lps[i]++;
            }

            if (i + lps[i] > r) {
                c = i;
            }

            /* } */

            // record result
            if (lps[i] > res) {
                res = lps[i];
                resi = i;
            }
        }

        return s.substr((resi - res) / 2, res);
    }
};

Reference

[1] http://articles.leetcode.com/longest-palindromic-substring-part-ii

[2] https://www.felix021.com/blog/read.php?entryid=2040

[3] https://www.geeksforgeeks.org/manachers-algorithm-linear-time-longest-palindromic-substring-part-1/

最长回文子串 Manacher 算法 (Longest Palindromic Substring -- Manacher's Algorithm)

Thu, 12 Apr 2018 17:50:51 GMT

Emm，这篇其实在 2017 年 9 月就打算写了，到现在才填上，再不填又要忘记了...

字符串界的明星回文串，总是有各种稀奇古怪难搞的题目，比如说这道最长回文子串 (Longest Palindromic Substring)，显而易见的算法复杂度是 $O(n^2)$，而这个 Manacher's Algorithm 则可以在 $O(n)$ 的时间给出答案。

我们用一个简单但是不错的例子来解释这个算法: "abababa"，它的最长回文子串就是自己。

简单的算法

简单的算法就是枚举回文核，对于一个长度为 n 的字符串，一共有 2n + 1 个回文核，每个回文核计算回文长度的复杂度为 $O(n)$，所以总计为 $O(n^2)$。

Manacher's Algorithm

Manacher's Algorithm 利用了之前已经计算过的最大回文长度，来加速后面的计算。

首先依旧是标明所有回文核，如下图所示：

对于每个回文核，我们标号从 0 - 2n，一共 2n + 1 个。偶数号的为偶数长度的回文的核，奇数号的为奇数长度的回文的核。我们使用一个数组 LPS 来记录已经知道的最长回文长度。

毫无疑问，LPS[0] 一定是 0，LPS[1] 一定是 1。假设我们从左往右计算 LPS，当前为标红的 7 号，LPS[7] = 7。

这里有一个很重要的事情，那就是 7 是 i <= 7 中，i + LPS[i] 最大的，也就是右边界最远的那个。 i + LPS[i] 一定是偶数。

为什么一定要它呢，我们之后会详细讲。

我们现在要计算 LPS[8] 了，因为 8 关于 7 对称为 6，而 6 上的最大回文串为 0，所以其实在可见范围 [i - LPS[i], i + LPS[i]] = [0, 14] 内，6 上的回文串 [6 - LPS[6], 6 + LPS[6]] = [6, 6] 在 8 上的对称为 [8, 8]，它整个在 (0, 14) 内，不接触边界。

由回文串对回文核的对称性，LPS[8] = 0。

但是如果对称到的回文串接触边界或者超出边界？我们已知的长度只能是 min(i + LPS[i] - cur, LPS[2 * i - cur]), 也就是对称那边的 LPS 和右边剩余范围这两个的最小值。那么我们就邻近了一些未知的字符，此时就应该 向右扫描未知的字符，并逐个和左边匹配，来准确得到 LPS[cur]。此时右边界扩展了，我们可以更新 i 到 cur。

上述文字可能有点绕，但是也有现成的例子，比如 i = 5, cur = 7 时，对称过去是 3，LPS[3] = 3 = i + LPS[i] - cur，也就是接着右边界了，所以就向右扫描，把右边界扩展到 14，同时更新 i = 7。

重复上述过程，就可以获得所有 LPS[i]，返回其中最大的即可。

复杂度

LPS 的计算，分为两种情况，一是直接计算，而是计算初始值后扩展右边界。

右边界最多扩展到 2 * n，所以是 O(n) 的，而赋值一共 2 * n + 1 次，所以也是 O(n) 的。

所以总计时间复杂度为 O(n)。

Cpp

class Solution {
public:
    string longestPalindrome(string s) {
        if (s.empty()) return "";
        int n = s.size();
        int res = 1, resi = 1;
        int N = n * 2 + 1;

        vector<int> lps(N);
        lps[1] = 1;

        int c = 1;
        for (int i = 2; i < N; ++i) {
            int li = 2 * c - i, r = c + lps[c];
            int to_r = r - i;

            // set initial lps[i]
            lps[i] = min(lps[li], to_r);

            /* if (lps[li] >= to_r) { */

            // expand if we can
            // i + lps[i] < N - 1 && i - lps[i] > 0 so that there are spaces for expand
            while (
                i + lps[i] < N - 1 && i - lps[i] > 0 &&
                ((i + lps[i] + 1) % 2 == 0 || s[(i - lps[i] - 1) / 2] == s[(i + lps[i] + 1) / 2])) {
                lps[i]++;
            }

            if (i + lps[i] > r) {
                c = i;
            }

            /* } */

            // record result
            if (lps[i] > res) {
                res = lps[i];
                resi = i;
            }
        }

        return s.substr((resi - res) / 2, res);
    }
};

Reference

[1] http://articles.leetcode.com/longest-palindromic-substring-part-ii

[2] https://www.felix021.com/blog/read.php?entryid=2040

[3] https://www.geeksforgeeks.org/manachers-algorithm-linear-time-longest-palindromic-substring-part-1/

MWeb With Hugo

Wed, 01 Nov 2017 10:34:30 GMT

A couple of days ago I discovered a great markdown editor called MWeb. It supports direct publishing to many blog platforms and has native support for two static site generators (hexo/jelly). However, it doesn't support Hugo. Fortunately, they all work on similar principles, so with a few tweaks it can be made to work.

Media Path

Hugo places static resource files in the static directory, which is a bit awkward for MWeb's parsing. Typically we put images in a folder like static/img, so one solution is to create a symlink for static/img:

$ ln -sf static/img .

Then set the Media Folder Name to img and change the path to Absolute mode.

This enables seamless document editing under Hugo.

MathJax

Hugo's markdown rendering module blackfriday interprets _ inside $ or $$ as  tags, causing rendering errors. This still hasn't been fixed...

Solution for MathJax-related issues: https://gohugo.io/content-management/formats/#mathjax-with-hugo

Conclusion

Now we can happily edit blog posts using MWeb!

MWeb With Hugo

Wed, 01 Nov 2017 10:34:30 GMT

前两天发现了一个好用的 markdown 编辑器叫 MWeb，支持好多 Blog 的直接发布，同时对两种静态网页生成器有直接的支持 (hexo/jelly)，然而对 hugo 却并没有支持，好在大家工作原理都差不多，稍微改改就能用上了。

Media Path

Hugo 的静态资源文件放在 static 目录下，对于 MWeb 的解析来说比较尴尬，而通常我们会将图片放在诸如 static/img 的文件夹下，所以解决方案之一是对 static/img 做软链

$ ln -sf static/img .

并把 Media Folder Name 设置为 img，并把路径改为 Absolute 模式。

这样就能完美支持 hugo 下的文档编辑了。

MathJax

Hugo 的 markdown 渲染木块 blackfriday 会将 $ 或是 $$ 中的 _ 中识别为  段落，导致渲染错误，至今还没修复...

关于 MathJax 引起的问题的解决方案： https://gohugo.io/content-management/formats/#mathjax-with-hugo

Conclusion

现在我们就能使用 MWeb 愉快的编辑 Blog 里的文章了！

Game Theory -- Impartial Combinatorial Games

Tue, 10 Oct 2017 16:40:08 GMT

Last time I said the upcoming posts would all be about graph algorithms... well, I lied.

I encountered game theory while solving problems, and I really knew nothing about this area, so it was time to learn. This article mainly covers Impartial Combinatorial Games within the domain of combinatorial games. The structure and content are referenced from the game theory lecture notes by Professor Thomas S. Ferguson at CMU, combined with my own thinking.

Take-Away Games

A combinatorial game is a two-player game where each player has complete information, there are no random moves, and the outcome is always a win or a loss. Each step of the game consists of a move. Players typically alternate making moves until a terminal state is reached. A terminal state is one from which no legal move exists.

Clearly, the outcome of the game is determined from the very beginning. The result is completely determined by the set of game states, the initial state, and which player goes first.

There are two forms of combinatorial games. Here we mainly discuss one form: Impartial Combinatorial Games, hereafter abbreviated as ICG. In an ICG, the moves available to both players are identical. The other form, Partizan Combinatorial Games, is where the two players have different available moves -- for example, the well-known game of chess.

A Simple Take-Away Game

Let's look at a simple game:

Suppose we have two players, labeled I and II
Suppose there is a pile of chips on the table, 21 in total
Each move consists of taking 1, 2, or 3 chips from the pile
Starting with Player I, the two players take turns moving
The player who takes the last chip wins

This game is simple enough that it's easy to see Player I can always win. But if we change the number of chips to n, can we still determine who ultimately wins?

Clearly, if n is between 1 and 3, Player I can always win. Suppose there are currently n chips on the table and it is Player I's turn. Suppose Player I takes m chips, leaving n - m chips on the table.

Clearly, if Player II can determine that for all legal m, they can win in every n - m situation, then Player I must lose; otherwise Player I must win.

Note that Player II has the same complete information as Player I. So this is really asking whether Player I, going first, can win for all n - m situations. Obviously, a bottom-up or top-down dynamic programming approach can easily answer this question.

Combinatorial Game

Now let us formally define a combinatorial game:

A game between two players
The game has a state set representing all possible states during play, usually finite
The rules describe the legal moves from each state to the next; if the rules are the same for both players, the game is impartial; otherwise it is partizan
Players alternate moves
When a player moves to a state from which the next player has no legal move, the game ends;
- Under normal play rules, the last player to move wins
- The other convention is called the misere play rule, where the last player to move loses; this rule greatly increases the difficulty of analysis
No matter how the game is played, it always ends in a finite number of steps

Note that the game assumes both players are sufficiently intelligent, and no random moves are allowed.

P-positions, N-positions

In the simple Take-Away game above, the game state is the number of chips remaining on the table. As we can see, the number of remaining chips determines which player ultimately wins. More generally, the current state of the game determines the winning outcome.

For a game state, we define it as:

P-position: the Previous player (the one who just moved) can win, i.e., the second player to move wins
N-position: the Next player (the one about to move) can win, i.e., the first player to move wins

In the above game, 1, 2, 3, 5, 6... are all N-positions, and 0, 4, 8, 12, 16... are all P-positions.

Clearly, by definition:

For every P-position, every legal move leads to an N-position
For every N-position, there exists a legal move that leads to a P-position

For the above game, the N/P labels for each state are as follows:

x	0	1	2	3	4	5	6	7	8	9	...
position	P	N	N	N	P	N	N	N	P	N	...

Subtraction Games

Let's look at a slightly harder problem. Let S be a set of positive integers and n be a large positive integer. Suppose there is a pile of chips on the table totaling n, and two players take turns removing x chips, where x must be some number in S. The player who takes the last chip wins.

Through N/P-position analysis, we can easily determine the outcome for all states.

The Game of Nim

The most famous class of Take-Away games is the Nim game, with the following rules:

Suppose there are three piles of chips with x1, x2, x3 chips respectively
Two players take turns removing chips; each turn a player must take at least one chip from exactly one pile, and at most the entire pile
The player who takes the last chip wins

Nim is not necessarily played with three piles -- it can be any number. For convenience, we analyze the three-pile special case.

We represent the state as a triple (a, b, c), where each number represents the remaining chips in the corresponding pile.

Clearly, the terminal state is (0, 0, 0), which is a P-position.

For a single-pile Nim game (0, 0, x) where x > 0, the first player clearly wins, so it is an N-position.
For a two-pile Nim game, all states of the form (0, x, x) are P-positions, and all others are N-positions: suppose the initial state is (0, x, x), meaning the two piles are equal. The first player must take some chips, making the piles unequal. The second player can then take the same number of chips from the other pile to restore equality. Eventually the second player must win.
For a three-pile Nim game, simple analysis becomes difficult. Next we introduce an elegant analysis method that can easily analyze any Nim game.

Nim-Sum & Bouton's Theorem

For a Nim game with m piles, suppose the chip counts are x1, x2, x3 ... xm. Then (x1, x2, x3 ... xm) is a P-position if and only if x1 ^ x2 ^ x3 ... ^ xm = 0, where ^ denotes the XOR operation. The XOR sum here is called the Nim-Sum.

The above theorem is known as Bouton's Theorem.

First, let us note some properties of the XOR operation:

Self-inverse: x ^ x = 0
Associativity: a ^ (b ^ c) = (a ^ b) ^ c
Commutativity: a ^ b = b ^ a

Now let us prove the theorem. To prove it, we only need to verify the conditions for P and N-positions:

All terminal positions are P-positions. Since the only terminal position is (0, 0, ... 0), this is obvious.
For every N-position, there exists a move leading to a P-position. Suppose the current state is (x1, x2, x3 ..., xm) with Nim-Sum a = x1 ^ x2 ^ x3 ... ^ xm, and a is not 0. Let the highest bit of a be bit k. Then among x1, x2 ... xm, the number of values with a 1 in bit k is odd, meaning at least one such value exists. Without loss of generality, suppose it is x1. Let x1' be the number formed by keeping the bits of x1 above position k, setting bit k to 0, and appending the lower bits of a. Clearly x1' < x1, so moving from (x1, x2, x3 ..., xm) to (x1', x2, x3 ..., xm) is a legal move. Now x1' ^ x2 ^ x3 ... ^ xm = a ^ x1' ^ x1 = 0, so (x1', x2, x3 ..., xm) is a P-position.
For every P-position, every move leads to an N-position. This can be proved by contradiction.

Q.E.D.

Mis`ere Nim

For Nim under the misere rule, Bouton gave the following strategy:

When there are still at least two piles with more than 1 chip, play the same as under normal rules
Otherwise there is only one pile with more than 1 chip remaining. Take from that pile to leave either 0 or 1 chip, ensuring an odd number of piles remain, each with exactly 1 chip, thus guaranteeing a win.

Graph Games

Finally we reach this section, which introduces an important analytical tool for Impartial Combinatorial Games called the Sprague-Grundy function.

First, we convert the game into a directed graph G(X, F), where X is the set of all game states and F represents the move relations between states. For a state x, F(x) represents the set of legal next states, i.e., the set of moves.

If F(x) is empty for a node x, then x is a terminal node.

For analytical convenience, we assume such graph games are progressively bounded: there exists an integer n such that every path in the graph has length at most n.

Sprague-Grundy Function

Define the SG-function on graph G(X,F) as follows:

$g(x) = \min{{n \ge 0: n \ne g(y)\ \textrm{for}\ y \in F(x)}}$

That is, g(x) is the smallest non-negative integer not appearing among the SG-values of x's successors. We define an operation called the minimal excludant, abbreviated as mex, which gives the smallest non-negative integer not in a given set of non-negative integers. Then g(x) can be written as:

$g(x) = \mathrm{mex}{{g(y): y \in F(x)}}$

Note that g(x) is defined recursively. For a terminal node x, g(x) = 0.

The Use of the Sprague-Grundy Function

Let g be the SG-function on graph game G(X, F). We have the following properties:

If x is a terminal node, then g(x) = 0
If g(x) = 0, then for every successor y of x, g(y) != 0
If g(x) != 0, then there exists a successor y of x with g(y) = 0

These three conclusions follow directly from the definition of g.

Now we can say that for any node with g(x) = 0, the corresponding state x is a P-position, and all other nodes are N-positions.

In fact, the SG-function gives the winning path: the SG-value of the current state alternates to 0.

This theory can also be extended to progressively finite graphs: every path in the graph is finite.

Sums of Combinatorial Games

Suppose we now have several combinatorial games and we want to combine them into one large game:

A player's move becomes: choose one of the non-terminated games and make a legal move
The terminal state is: all games have terminated

The resulting combined game is called the disjunctive sum of the games.

Suppose the games being combined are represented as graphs G1(X1, F1), G2(X2, F2) ..., Gn(Xn, Fn). We can combine them into a new graph G(X, F), where $X = X_1 \times X_2 \times \cdots \times X_n$ is the Cartesian product. Let x = (x1, ... xn), then $F(x) = F(x_1, ..., x_n) = F_1(x_1) \times {x_2} \times \cdots {x_n} \cup {x_1} \times F_2(x_2) \times \cdots{x_n} \cup \cdots {x_1} \times {x_2} \times \cdots F_n(x_n)$

The Sprague-Grundy Theorem

In the combined graph game, we can compute the SG-function g as follows: let the original games' SG-functions be g1, g2, ..., gn. Then for x = (x1, x2, ..., xn), g(x) = g1(x1) ^ g2(x2) ^ ... gn(xn), i.e., the Nim-sum of the SG-values of the current states of the sub-games.

Let x = (x1, x2, ..., xn) be any point in the graph, and b = g1(x1) ^ g2(x2) ^ ... gn(xn). In fact, we only need to prove the following two points to establish the above conclusion:

For every non-negative integer a < b, there exists a successor y of x such that g(y) = a
No successor of x has value b

The proof here is similar to the Nim-sum proof in the previous section, and is left as an exercise.

In fact, in Nim, the SG-function for each pile is clearly g(n) = n, so the N/P-position determination in Nim is a special case of Sprague-Grundy.

We can also say that every ICG is equivalent to some Nim game.

Green Hackenbush

Now let's introduce a combinatorial game called Hackenbush. The gist of this game is to chop edges from a rooted graph and remove any parts that become disconnected from the ground. The name "Hackenbush" is quite literal -- think of it as a game of hacking at bushes.

The impartial version of this game is called Green Hackenbush: every edge in the graph is green, and both players can chop any edge. The partizan version is called Blue-Red Hackenbush, where edges are either blue or red -- Player I can only chop blue edges, and Player II can only chop red edges. In the most general version, edges can be any of the three colors, with green edges being choppable by either player.

Bamboo Stalks

Let's first look at a simple special case of Green Hackenbush. In this case, each rooted connected component is a straight line, like a pillar, as shown below:

Each move consists of choosing an edge, chopping it, and removing all parts no longer connected to the ground. Players alternate moves, and the last player to move wins.

Clearly, this game with n stalks is equivalent to a Nim game with n piles.

Green Hackenbush on Trees

As the name suggests, each bush looks like a tree:

By the theorem from the previous section, this game is equivalent to some version of Bamboo Stalks. We can perform the conversion using the following principle, more generally known as the Colon Principle:

When branches meet at a vertex, they can be replaced by a single stalk whose length is the Nim-sum of the branch lengths

The proof of the Colon Principle should be relatively straightforward and is left as an exercise.

Green Hackenbush on General Rooted Graphs

Now let's consider arbitrary graphs, which may contain cycles, as shown below:

Though the graph looks complex, each connected component can similarly be converted into a single pile in a Nim game. If we can convert each component into a tree, then we can use the method above to convert it into an equivalent pile.

Here we introduce a conversion method called the Fusion Principle: nodes on a cycle can be merged without changing the SG-function value of the graph.

As shown below, we can convert a gate-shaped cycle into a stalk of length 1:

First, the two nodes on the ground can actually be merged into one node. Then, applying the fusion principle, this triangle is equivalent to three independent self-loops. Each self-loop is equivalent to a Nim pile of size 1, which ultimately combines into a single pile of size 1.

More generally, odd-length cycles can be converted into a single edge, and even-length cycles can be converted into a single point.

The proof of the fusion principle is rather long. Interested readers can refer to Chapter 7 of "Winning Ways".

A Problem

Here is a concrete algorithm problem, Bob's Game. The original problem is at https://www.hackerrank.com/contests/university-codesprint-3/challenges/bobs-game/problem.

Problem summary:

Bob invented a new game played on an n x n board. The rules are as follows:

Initially, there are some kings on the board. Multiple kings can occupy the same cell.

Kings can only move up, left, or upper-left

Some cells on the board are damaged, and kings cannot move onto damaged cells

Kings cannot move off the board

The game is played by two players who alternate turns

Each turn, a player moves one king

The last player to move wins

Bob plays this game with his good friend Alice. Bob goes first. Given the initial state of the board, determine the number of first moves that guarantee Bob's victory. If Bob must lose, output LOSE.

If there is only one king on the board, this game can be easily analyzed using the SG-function to determine whether Bob wins and what his winning first move is.

Now there are multiple kings on the board, but the kings do not interfere with each other, so this game is the disjunctive sum of multiple single-king games. Applying the Sprague-Grundy theorem, we can derive the SG-function for this game.

References

[1] http://www.cs.cmu.edu/afs/cs/academic/class/15859-f01/www/notes/comb.pdf

博弈论-公平组合游戏 (Game Theory -- Impartial Combinatorial Games)

Tue, 10 Oct 2017 16:40:08 GMT

上回说到接下来都是图算法... 这个，我食个言 🙃

做题目碰到了博弈论，对这方面我真是完全不了解，还是要学习一个。本篇文章主要介绍组合游戏中的 Impartial Combinatorial Games，结构和内容均参考自 CMU Thomas S. Ferguson 教授的博弈论讲稿，结合我自己的思考。

Take-Away Games

组合游戏是指一种两个玩家的游戏，每个玩家都有着完全的信息，不存在随机动作，游戏的结果总是赢或者输。游戏的每一个步骤由一个移动构成，通常玩家会交替地进行移动，直到达到终止状态。终止状态是指从该状态不存在任何一个状态移动方式的状态。

显然，游戏的结果从一开始就被决定了，结果由游戏的状态集合、游戏初始状态以及玩家的先后手完全确定。

组合游戏的形式有两种，这里我们主要讨论其中的一种游戏形式 Impartial Combinatorial Games，以下简称 ICG。 ICG 是指在游戏中，两个玩家所能进行的移动是完全相同的。对应的另一种形式 Partizan Combinatorial Games 就是指两个玩家分别有不同的移动，比如说我们熟知的象棋。

A Simple Take-Away Game

我们来看一个简单的小游戏：

假设我们有两个玩家，标记为 I 和 II
假设桌子上有一堆碎片，总计 21 个
每一次移动对应于从碎片堆中取出 1 或 2 或 3 个碎片
从玩家 I 开始，两个玩家轮流移动
最后一个移走碎片的玩家获胜

这个游戏足够简单，很容易就得知玩家 I 总是能获胜。但是如果我们把碎片数目改成 n，我们是否也能确定最终是谁获胜呢？

显然，如果 n 在 1 到 3 之间，玩家 I 总是能获胜。假设现在场上的碎片数是 n，并且是玩家 I 的回合。那么假设玩家 I 取走了 m 片碎片，场上剩余 n - m 片碎片。

显然，如果玩家 II 能够确定对所有合法的 m，他能够在所有 n - m 的情况下都必胜，那么玩家 I 必败，否则玩家 I 必胜。

注意到这里玩家 II 和玩家 I 一样拥有完全的信息，其实也就是问先手条件下的玩家 I 能不能对所有 n - m 都必胜，那么显然一个 bottom-up 或者 top-down 的动态规划能很容易地回答上述问题。

Combinatorial Game

现在我们来形式化地定义一个组合游戏：

两个玩家构成的游戏
游戏有一个状态集合，表示游戏过程中所有可能状态，通常是有穷的
游戏的规则描述了在某个状态下玩家移动到下一个状态的合法移动，如果规则对两个玩家是相同的，就是 impartial 的，否则是 partizan 的
玩家交替移动
当玩家移动到一个状态，下一个移动的玩家没有可行的移动时，游戏结束；
- 在通常的游戏规则下，最后一个移动的玩家获胜
- 另一种规则叫做 misère play rule，最后一个移动的玩家失败；这个规则极大地提高的分析的难度
游戏无论怎么进行，始终能在有限步内结束

注意游戏假设两个玩家都是足够聪明的玩家，不允许任何随机移动的存在。

P-positions，N-positions

在上述简单的 Take-Away 游戏里，游戏的状态就是桌子上剩余的碎片数。可以看到，桌子上剩余的碎片数决定了哪个玩家能够最终获胜。更普遍的，游戏的当前状态决定了玩家的获胜情况。

对于一个游戏状态，我们定义它为

P-position，在该状态，上一个移动的玩家 (Previous player) 能够获胜，也就是后手必胜
N-position，在该状态，下一个移动的玩家 (Next player) 能够获胜，也就是先手必胜

上述游戏中，1、 2、 3、 5、 6 ... 都是 N-position，0、 4、 8、 12、 16 ...都是 P-position。

显然，由定义可知：

对于每一个 P-position，对于任何一个合法的移动下的下一个状态一定是一个 N-position
对于每一个 N-position，一定存在一个合法移动，使得下一个状态是 P-position

对于上述的游戏，状态对应的 N、 P 如下：

x	0	1	2	3	4	5	6	7	8	9	...
position	P	N	N	N	P	N	N	N	P	N	...

Subtraction Games

我们来看一个更难一点的问题。令 S 是一个正整数的集合，n 是一个比较大的正整数。假设桌上有一堆碎片，总计 n 个，两个玩家轮流从中取出 x 个，这里 x 必须是 S 中的某个数，最后取的玩家获胜。

通过状态对应的 N、 P 分析，我们可以轻松的得出所有状态下结果。

The Game of Nim

Take-Away 游戏中最出名的一类游戏叫做 Nim 游戏，它的游戏规则如下：

假设现在有三堆碎片，分别为 x1, x2, x3
两个玩家轮流取碎片，每次必须从其中的一堆取至少一个碎片，最多不能超过选择的堆
最后一个取碎片的玩家获胜

Nim 游戏不一定是三堆的，可以使其他的数目，为了方便起见，我们用三堆的特例来分析。

我们将状态表示为一个三元对 (a, b, c)，每个数表示对应的堆剩余的碎片数。

显然，终止状态是 (0, 0, 0)，它是一个 P-pos。

对于只剩一个一堆的 Nim 游戏 (0, 0, x), x > 0，显然先手必胜，也就是一个 N-pos。
对于两堆的 Nim 游戏，所有 (0, x, x) 的点都是 N-pos，其余点都是 P-pos：假设初始状态为 (0, x, x)，也就是说两堆相等，先手的玩家必须取走一些碎片，使得场上碎片数不相等，后手的玩家通过取走另一堆里相同数目的碎片可以恢复场上相等的条件，最终后手玩家一定胜利。
对于三堆的 Nim 游戏，就很难做一个简单分析了，接下去我们将介绍一个优美的分析方法，它将轻易的分析任意 Nim 游戏。

Nim-Sum & Bouton's Theorem

对于一个有 m 个堆的 Nim 游戏，假设场上碎片数为 x1, x2, x3 ... xm，那么 (x1, x2, x3 ... xm) 是一个 P-pos 当且仅当 x1 ^ x2 ^ x3 ... ^ xm = 0，这里 ^ 为异或操作，异或和在这里被称为是 Nim-Sum。

上述定理被称为是 Bouton's Theorem。

这里先介绍几个异或操作的性质：

自反，x ^ x = 0
结合律，a ^ (b ^ c) = (a ^ b) ^ c
交换律，a ^ b = b ^ a

下面我们来证明上述定理。对该定理的证明，我们只需要验证上述 P、 N -pos 的几个条件就可以了：

所有终止点都是 P-pos，因为我们只有终止点 (0, 0, ... 0)，所以显然。
对每个 N-pos，存在一个移动使得下一个状态时是 P-pos。假设当前状态为 (x1, x2, x3 ..., xm), Nim-Sum 为 a = x1 ^ x2 ^ x3 ... ^ xm，且 a 不等于 0。假设 a 的二进制最高位为第 k 位，那么 x1, x2 ... xm 所有数中第 k 位为 1 的数的个数是奇数，也就是说其中一定存在一个数，第 k 位为 1。不失一般性，假设是 x1。那么令 x1' 为 x1 在 k 之后的高位、 k 位为 0、 a 在 k 之后的地位拼接而成的数，显然 x1' < x1，所以 (x1, x2, x3 ..., xm) 到 (x1', x2, x3 ..., xm) 是一个合法的移动。此时 x1' ^ x2 ^ x3 ... ^ xm = a ^ x1' ^ x1 = 0，(x1', x2, x3 ..., xm) 是一个 P-pos。
对每个 P-pos，每一个移动都是到 N-pos。这里反证就行。

证毕。

Mis`ere Nim

对于 Mis`ere 规则下的 Nim 游戏，Bouton 给出了玩法：

当前场上还有至少两堆大于 1 的碎片时，和正常规则一样移动
否则场上只剩一堆大于 1 的碎片，从这一堆里取，使得堆里剩余 0 或者 1 个碎片，从而保证场上还剩奇数个堆，每个堆只有一个 1 碎片，从而必胜。

Graph Games

终于介绍到这里了，这一节将介绍一种 Impartial Combinatorial Games 中的重要的分析手段，叫做 Sprague-Grundy function。

首先我们先把游戏转换成一个有向图 G(X, F)，其中 X 是游戏中所有状态的集合，F 表示状态之间的移动关系，令 x 表示一个状态，那么 F(x) 表示在 x 状态下的下一个合法状态，也就代表了移动的集合。

如果对于节点 x，F(x) 是空集，那么 x 就是一个终止节点。

为了分析方便起见，我们假设这样的图游戏是渐进有界的：存在一个整数 n，使得图上任何一条路径是的长度不超过 n。

Sprague-Grundy Function

定义图 G(X,F) 上的 SG-函数为如下形式：

$g(x) = \min{{n \ge 0: n \ne g(y)\ \textrm{for}\ y \in F(x)}}$

也就是说，g(x) 代表了在 x 的后继节点的 SG-函数中没有出现的第一个非负整数。我们定义一个操作叫做 minimal excludant, 简写为 mex，为给出不出现在一个非负整数集合中的第一个非负整数，那么 g(x) 可以写成如下形式：

$g(x) = \mathrm{mex}{{g(y): y \in F(x)}}$

注意到 g(x) 是递归定义的，对终止节点 x，g(x) = 0。

The Use of the Sprague-Grundy Function

令 g 为图游戏 G(X, F) 上的 SG-函数，我们有如下性质：

x 是终止节点，那么 g(x) = 0
g(x) = 0，那么对于 x 每个后继节点 y，g(y) != 0
g(x) != 0，那么存在一个 x 的后继节点 y，g(y) = 0

根据 g 的定义，上述三条结论是显然的。

现在我们可以说，对于任意 g(x) = 0 的节点，x 对应的状态为一个 P-pos，其余的节点都为 N-pos。

事实上，SG-函数给出了获胜的路径：当前状态的 GF-函数交替变为 0。

该理论也可以推广到渐进有穷的的图上：图的每一条路径都是有穷的。

Sums of Combinatorial Games

假设我们现在有几个组合游戏，我们打算把它合并成一个大游戏：

玩家的移动变为：选择其中一个未终止的游戏，进行一次合法移动
终止状态为：所有游戏都终止了

这样合并出来的游戏被称为游戏的 disjunctive sum。

假设我们所合并的游戏分别表示为图 G1(X1, F1)，G2(X2, F2) ...，Gn(Xn, Fn)，我们可以把它们合并成一张新的图 G(X, F)，其中 $X = X_1 \times X_2 \times \cdots \times X_n$ 为笛卡尔乘积；令 x = (x1, ... xn) ，则 $F(x) = F(x_1, ..., x_n) = F_1(x_1) \times {x_2} \times \cdots {x_n} \cup {x_1} \times F_2(x_2) \times \cdots{x_n} \cup \cdots {x_1} \times {x_2} \times \cdots F_n(x_n)$

The Sprague-Grundy Theorem

合并后的图游戏中，我们可以通过以下方式给出图上的 SG-函数 g：假设原来游戏的 SG-函数分别为 g1, g2, ..., gn，那么对于 x = (x1, x2, ..., xn)，g(x) = g1(x1) ^ g2(x2) ^ ... gn(xn)，也就是子游戏当前状态的 SG-函数的 Nim-和。

令 x = (x1, x2, ..., xn) 为图上任意一点，b = g1(x1) ^ g2(x2) ^ ... gn(xn) ，事实上我们只需要证明如下两点就可以得出上述结论：

对每个非负整数 a < b，存在一个 x 的后继 y，使得 g(y) = a
任意 x 的后继不为 b

此处证明与上一节 Nim 和的证明类似，就留作习题吧。

事实上，Nim 游戏中，显然每堆的 SG-函数 g(n) = n，所以 Nim 游戏判断 N、 P-pos 是 Sprague-Grundy 的一个特例。

也可以说，任意 ICG 都等价于某个 Nim 游戏。

Green Hackenbush

下面介绍一个组合游戏，叫做 Hackenbush ，这个游戏的大意是从一个有根的图上砍下一些边，并移除不接地的部分。额，直译有点难，这里 hack 翻译为砍，bush 是灌木，从 Hackenbush 这个游戏名的字面上理解就行了，就是砍灌木丛的游戏。

这个游戏的 impartial 版本叫做 Green Hackenbush：图上的每条边都是绿色的，两个玩家都能砍任意边。 Partizan 版本的游戏叫做 Blue-Red Hackenbush，其中图上的边有蓝色和红色之分，玩家 I 只能砍蓝色的，而玩家 II 只能砍红色的。更为一般的版本中，边有上述三种，绿色是都能砍的。

Bamboo Stalks

先看一个简单的 Green Hackenbush 的特例，这个特例里，每个有根的联通分支都是一条直线，就跟柱子一样，如下图所示：

每一次移动就是选取其中一条边，砍断它并移除所有不连到地面的部分。玩家交替进行移动，最后一个移动的玩家胜利。

显然，有 n 根柱子的这个游戏和有 n 堆的 nim 游戏是等价的。

Green Hackenbush on Trees

顾名思义，每棵灌木长的和树一样：

由上一节的定理，这个游戏等价于某个版本的 Bamboo Stalks。我们可以通过以下原则来进行转换，更为普遍地，叫做 Colon Principle：

当分支交汇于某个顶点时，可以用一条长度为分支长度 nim-和的杆来代替

Colon Principle 的证明应该比较简单，这里留作思考。

Green Hackenbush on general rooted graphs

现在我们来考虑任意的图，这些图中可能存在环等，如下所示：

这图看着复杂，其实同样对于每个联通分支都可以转换为 Nim 游戏中的一个堆。考虑如果能将每个分支转换成一棵树，那就可以用上面的方法转换成等价的堆了。

这里介绍一种转换方式，叫做 Fusion Principle：在不改变图上 SG-函数值的情况下，回路上的点可以进行合并。

如下图所示，我们可以将门形状的回路转换为长度为 1 的杆：

首先，地上的两个点事实上可以合并成一个点；然后应用 fusion principle，这个三角形等价于三个独立的自环；而每个自环又等价于一个大小为 1 的 nim 堆，最终合并为一个大小为 1 的堆。

更一般地，奇数长度的回路都可以转换为一条边，偶数长度的回路转换为一个点。

关于 fusion principle 的证明比较长，有兴趣的同学可以参考 《 Winning Ways》 的第七章。

A Problem

这里给出一个具体的算法问题 Bob's Game，原题在 https://www.hackerrank.com/contests/university-codesprint-3/challenges/bobs-game/problem 。

题目大意：

Bob 发明了一个新游戏，这个游戏在一个 n x n 的棋盘上进行。游戏的规则如下：

一开始，棋盘上有一些国王。一个格子中可以存在多个国王。

国王只允许向上、左、左上移动

棋盘上有些格子损坏了，国王不能移动到损坏的格子上

国王不能移出棋盘

游戏由两个玩家进行，他们交替进行操作

玩家每次操作移动一个国王

最后一个移动的玩家获胜

Bob 和他的好朋友 Alice 玩这个游戏，Bob 先手。假设给定棋盘的初始状态，给出 Bob 一定能获胜的第一步移动的数目；如果 Bob 必败，输出 LOSE。

假设棋盘上只有一个国王，这个游戏通过 SG-函数很容易分析 Bob 是否必胜，以及必胜的第一步操作。

现在场上存在多个国王，但是国王之间互相不干扰，所以这个游戏是由多个单国王的游戏合并而成的。应用 Sprague-Grundy 定理，我们可以推导出这个游戏的 SG-函数。

References

[1] http://www.cs.cmu.edu/afs/cs/academic/class/15859-f01/www/notes/comb.pdf

Maximum Matching in Bipartite Graph

Tue, 26 Sep 2017 14:26:08 GMT

The previous few blog posts were mainly about data structures for range updates and range queries. The upcoming topic is graph theory and graph algorithms. I hope to learn and recall most graph algorithms.

Today I encountered a problem that can be transformed into the existence of a perfect matching in a bipartite graph. Let's first look at the theory behind perfect matching, Hall's Marriage Theorem, and then introduce two algorithms for solving the superset of perfect matching -- the maximum matching problem.

Hall's Marriage Theorem

Let $G$ denote a bipartite graph with left and right parts $X$ and $Y$ respectively. Let $W \subset X$, and $N_G(W)$ be the set of vertices in $Y$ adjacent to $W$.

Then a matching that covers all of $X$ exists if and only if

$\forall W \subset X, |W| \le |N_G(W)|$, meaning every subset of $X$ has enough neighbors for matching.

Deduction 1 [1]

Given a bipartite graph $G(X + Y, E)$, $|X| = |Y|$, $G$ is connected, and every vertex in $X$ has a distinct degree, then a perfect matching must exist in $G$.

Proof:

First, since $G$ is connected, $\forall u \in X, deg(u) >= 1$. Then $\forall W \subset X$, $\max\limits_{u \in W}{deg(u)} \ge |S|$, satisfying Hall's Marriage Theorem. QED.

Hungarian Algorithm

In fact, after extensive searching, I could not find an algorithm specifically called the Hungarian Algorithm for bipartite maximum matching. The only Hungarian Algorithm I found is for the Assignment Problem, i.e., maximum matching in weighted bipartite graphs.

Since the maximum matching problem and the maximum flow problem can be easily converted to each other, I also found many algorithms with similar or even identical ideas, such as the Ford-Fulkerson Algorithm, Edmonds-Karp Algorithm, etc.

As for which book introduced the name "Hungarian Algorithm," I can't quite remember... It was probably some graph theory book I studied. Here I will mainly discuss its specific idea and proof.

Alternating Path & Augmenting Path

Suppose we have a bipartite graph $G(U + V, E)$, and there is a matching $M \subset E$. If a vertex does not appear on any edge in $M$, we say the vertex is unmatched. The definition of unmatched edges is similar.

Alternating path: Starting from an unmatched vertex, a path that alternates between unmatched edges, matched edges, unmatched edges... is called an alternating path. Augmenting path: An alternating path that starts from an unmatched vertex in $U$ and ends at an unmatched vertex in $V$ is called an augmenting path.

Clearly, by the definition of an augmenting path, there is one more unmatched edge than matched edge on the path. If we flip all unmatched edges to matched and matched edges to unmatched on this path, the resulting matching $M'$ is larger than the original by 1.

Theorem of Augmenting Path

A matching $M$ is a maximum matching if and only if there is no augmenting path in graph $G$.

Proof:

If an augmenting path exists, $M$ is obviously not maximum. So we only need to prove that when no augmenting path exists, $M$ is maximum.

Assume there exists a matching $M$ with no augmenting path, and $M$ is not a maximum matching. Let $M^$ be a maximum matching on $G$, so $|M^| > |M|$.

Thus $|M^* - M| > |M - M^*|$.

Consider all edges in the symmetric difference of $M^$ and $M$ ($M^ \cup M - M^* \cap M$), and let $G'$ be the graph formed by vertices $U + V$ and these edges.

Since the edges in $G'$ come from two matchings, any vertex in $G'$ is adjacent to at most two edges.

Therefore, any connected component of $G'$ can only be a path or a cycle, with an even number of edges, and the edges along the path or cycle must form an alternating sequence with respect to $M^* - M$ or $M - M^*$.

Since $|M^* - M| > |M - M^|$ and all cycles have an even number of edges, there must be a path whose edges, start vertex, and end vertex are all in $M^ - M$, and which alternates between $M^* - M$ and $M - M^*$. Clearly, this path forms an augmenting path with respect to $M$. Contradiction!

QED.

Pseudocode

The Hungarian Algorithm has two implementations, based on DFS and BFS respectively, both with time complexity $\mathcal{O}(|V||E|)$.

Below is the pseudocode for the BFS version:

Algorithm MaximumBigartiteMatching(G)
    initialize set M of edges // can be the empty set
    initialize queue Q with all the free vertices in V
    while not Empty(Q) do
        w ← Front(Q)
        if w ε V then
            for every vertex u adjacent to w do // u must be in U
                if u is free then // augment
                    M ← M union (w, u)
                    v ← w
                    while v is labeled do // follow the augmenting path
                        u ← label of v
                        M ← M - (v, u)  // (v, u) was in previous M
                        v ← label of u
                        M ← M union (v, u) // add the edge to the path
                    // start over
                    remove all vertex labels
                    reinitialize Q with all the free vertices in V
                    break // exit the for loop
                else // u is matched
                    if (w, u) not in M and u is unlabeled then
                    label u with w // represents an edge in E-M
                    Enqueue(Q, u)
                    // only way for a U vertex to enter the queue

        else // w ε U and therefore is matched with v
            v  ←  w's mate // (w, v) is in M
            label v with w // represents in M
            Enqueue(Q, v) // only way for a mated v to enter Q

Compared to BFS, the DFS version of the Hungarian Algorithm is easier to implement. Its C++ code can be found in the appendix.

Hopcroft-Karp Algorithm

The Hopcroft-Karp Algorithm is an algorithm specifically designed for the bipartite maximum matching problem. Its worst-case time complexity is $\mathcal{O}(|E|\sqrt{|V|})$, and its worst-case space complexity is $\mathcal{O}(|V|)$.

The Hopcroft-Karp Algorithm was discovered in 1973 by computer scientists John Hopcroft and Richard Karp.

Like the Hungarian Algorithm, the Hopcroft-Karp Algorithm also repeatedly finds augmenting paths to enlarge the partial matching. The difference is that while the Hungarian Algorithm finds only one augmenting path at a time, this algorithm finds a maximal set of augmenting paths each time, so we only need $\mathcal{O}(\sqrt{|V|})$ iterations.

The Hopcroft-Karp Algorithm alternates between the following two phases:

Use BFS to find augmenting paths of the next length, traversing all augmenting paths of that length (i.e., the maximal set mentioned above).
If longer augmenting paths exist, for each possible starting vertex u, use DFS to find and record augmenting paths.

In each iteration, the length of the shortest augmenting path found by BFS increases by at least 1. So after $\sqrt{|V|}$ iterations, the shortest augmenting path that can be found has length at least $\sqrt{|V|}$. Suppose the current partial matching is $M$ (edge set). The symmetric difference of $M$ and the maximum matching forms a collection of vertex-disjoint augmenting paths and alternating cycles. If all paths in this collection have length at least $\sqrt{|V|}$, then there are at most $\sqrt{|V|}$ paths, so the maximum matching size differs from $|M|$ by at most $\sqrt{|V|}$. Since each iteration increases the matching size by at least 1, there are at most $\sqrt{|V|}$ more iterations until the algorithm terminates.

In each iteration, BFS traverses at most every edge in the graph, and so does DFS, so the time complexity per iteration is $\mathcal{O}({|E|})$, giving a total time complexity of $\mathcal{O}({|E|\sqrt{|V|}})$.

Pseudocode

/*
 G = U ∪ V ∪ {NIL}
 where U and V are partition of graph and NIL is a special null vertex
*/

function BFS ()
    for each u in U
        if Pair_U[u] == NIL
            Dist[u] = 0
            Enqueue(Q,u)
        else
            Dist[u] = ∞
    Dist[NIL] = ∞
    while Empty(Q) == false
        u = Dequeue(Q)
        if Dist[u] < Dist[NIL]
            for each v in Adj[u]
                if Dist[ Pair_V[v] ] == ∞
                    Dist[ Pair_V[v] ] = Dist[u] + 1
                    Enqueue(Q,Pair_V[v])
    return Dist[NIL] != ∞

function DFS (u)
    if u != NIL
        for each v in Adj[u]
            if Dist[ Pair_V[v] ] == Dist[u] + 1
                if DFS(Pair_V[v]) == true
                    Pair_V[v] = u
                    Pair_U[u] = v
                    return true
        Dist[u] = ∞
        return false
    return true

function Hopcroft-Karp
    for each u in U
        Pair_U[u] = NIL
    for each v in V
        Pair_V[v] = NIL
    matching = 0
    while BFS() == true
        for each u in U
            if Pair_U[u] == NIL
                if DFS(u) == true
                    matching = matching + 1
    return matching

A Problem

A problem that can be converted into finding a perfect matching. Problem summary:

There are n people and n shelters on a 2D plane. You need to assign n people to shelters such that each shelter accommodates exactly one person, and the time for everyone to enter a shelter is minimized, i.e., the time for the last person to enter a shelter is minimized. The time for a person to move from (X, Y) to (X1, Y1) is |X - X1| + |Y - Y1|. 1 <= n <= 100

I couldn't think of a good direct approach to this problem. However, observing that any possible solution must equal the travel time of some person to some shelter, we can compute and sort all travel times from every person to every shelter.

For a person moving to a shelter with travel time T, all moves with time less than or equal to T are feasible. We build a bipartite graph with people on the left and shelters on the right. For all feasible moves, we add a corresponding edge to the bipartite graph. If an assignment satisfying the above conditions exists, it must be a perfect matching on the graph, and the goal is to find the smallest such T.

For our bipartite graph, the worst case is a complete bipartite graph. The time complexity for the Hungarian Algorithm to determine a perfect matching is $\mathcal{O}(n^3)$, and for the Hopcroft-Karp Algorithm it is $\mathcal{O}(n^2\sqrt{n})$, so the verification complexity is still relatively high.

If we add edges one by one in order of travel time, we would need up to $n^2$ verifications in the worst case, which is too expensive. While the small data size here is manageable, it would be problematic if n were as large as 1000.

Remember the technique we mentioned for reducing the number of verifications? Yes, binary search -- only $2\log n$ verifications needed.

Here we also have a simulation cost for constructing the corresponding bipartite graph each time. The worst-case construction time per iteration is $n^2$, giving a total time complexity of $\mathcal{O}(n^3\log n)$ or $\mathcal{O}(n^2\sqrt{n}\log n)$.

The code implementation is in the appendix.

References

[1] https://en.wikipedia.org/wiki/Hall%27s_marriage_theorem

[2] https://math.stackexchange.com/questions/1204270/bipartite-graph-has-perfect-matching

[3] https://en.wikipedia.org/wiki/Matching_(graph_theory)

[4] https://en.wikipedia.org/wiki/Hopcroft%E2%80%93Karp_algorithm

[5] https://www.topcoder.com/community/data-science/data-science-tutorials/maximum-flow-augmenting-path-algorithms-comparison/

[6] http://www.csl.mtu.edu/cs4321/www/Lectures/Lecture%2022%20-%20Maximum%20Matching%20in%20Bipartite%20Graph.htm

Appendix

Air Defense Exercise

#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
#include <stack>
#include <cmath>
#include <deque>
#include <queue>
#include <map>
#include <bitset>
#include <set>
#include <list>
#include <unordered_map>
#include <unordered_set>
#include <sstream>
#include <numeric>
#include <climits>
#include <utility>
#include <iomanip>
#include <cassert>

using namespace std;

using ll = long long;
using ii = pair<int, int>;
using iii = pair<int, ii>;
template <class T>
using vv = vector<vector<T>>;

#define rep(i, b) for (int i = 0; i < int(b); ++i)
#define reps(i, a, b) for (int i = int(a); i < int(b); ++i)
#define rrep(i, b) for (int i = int(b) - 1; i >= 0; --i)
#define rreps(i, a, b) for (int i = int(b) - 1; i >= a; --i)
#define repe(i, b) for (int i = 0; i <= int(b); ++i)
#define repse(i, a, b) for (int i = int(a); i <= int(b); ++i)
#define rrepe(i, b) for (int i = int(b); i >= 0; --i)
#define rrepse(i, a, b) for (int i = int(b); i >= int(a); --i)

#define all(a) a.begin(), a.end()
#define rall(a) a.rbegin(), a.rend()
#define sz(a) int(a.size())
#define mp(a, b) make_pair(a, b)

#define inf (INT_MAX / 2)
#define infl (LONG_MAX / 2)
#define infll (LLONG_MAX / 2)

#define X first
#define Y second
#define pb push_back
#define eb emplace_back

// tools for pair<int, int> & graph
template <class T, size_t M, size_t N>
class graph_delegate_t {
    T (&f)[M][N];

public:
    graph_delegate_t(T (&f)[M][N]) : f(f) {}
    T& operator[](const ii& s) { return f[s.first][s.second]; }
    const T& operator[](const ii& s) const { return f[s.first][s.second]; }
};
ii operator+(const ii& lhs, const ii& rhs) {
    return mp(lhs.first + rhs.first, lhs.second + rhs.second);
}

// clang-format off
template <class S, class T> ostream& operator<<(ostream& os, const pair<S, T>& t) { return os << "(" << t.first << "," << t.second << ")"; }
template <class T> ostream& operator<<(ostream& os, const vector<T>& t) { os << "{"; rep(i, t.size() - 1) { os << t[i] << ","; } if (!t.empty()) os << t.back(); os << "}"; return os; }
vector<string> __macro_split(const string& s) { vector<string> v; int d = 0, f = 0; string t; for (char c : s) { if (!d && c == ',') v.pb(t), t = ""; else t += c; if (c == '\"' || c == '\'') f ^= 1; if (!f && c == '(') ++d; if (!f && c == ')') --d; } v.pb(t); return v; }
void __args_output(vector<string>::iterator, vector<string>::iterator) { cerr << endl; }
template <typename T, typename... Args>
void __args_output(vector<string>::iterator it, vector<string>::iterator end, T a, Args... args) { cerr << it->substr((*it)[0] == ' ', it->length()) << " = " << a; if (++it != end) { cerr << ", "; } __args_output(it, end, args...); }
#define out(args...) { vector<string> __args = __macro_split(#args); __args_output(__args.begin(), __args.end(), args); }
// clang-format on

const int MAX_N = 100;
int n;
ii p[MAX_N], h[MAX_N];
iii d[MAX_N * MAX_N];

vector<int> edges[MAX_N];
int match[MAX_N];
bool visited[MAX_N];

void link(int u, int v) { edges[u].push_back(v); }

bool dfs(int u) {
    for (auto v : edges[u]) {
        if (visited[v]) continue;
        visited[v] = true;
        if (match[v] == -1 || dfs(match[v])) {
            match[v] = u;
            return true;
        }
    }
    return false;
}

bool hungarian() {
    int m = 0;
    fill_n(match, n, -1);
    rep(i, n) {
        fill_n(visited, n, false);
        if (dfs(i)) ++m;
    }
    return m == n;
}

int match_u[MAX_N], match_v[MAX_N];
int dist[MAX_N];
int NIL = MAX_N;

bool bfs() {
    queue<int> q;
    rep(u, n) {
        if (match_u[u] == NIL) {
            dist[u] = 0;
            q.push(u);
        } else {
            dist[u] = inf;
        }
    }

    dist[NIL] = inf;
    while (!q.empty()) {
        auto u = q.front();
        q.pop();
        for (auto v : edges[u]) {
            if (dist[match_v[v]] == inf) {
                dist[match_v[v]] = dist[u] + 1;
                if (match_v[v] != NIL) q.push(match_v[v]);
            }
        }
    }
    return dist[NIL] != inf;
}

bool dfs_h(int u) {
    if (u == NIL) return true;
    for (auto v : edges[u]) {
        if (dist[match_v[v]] == dist[u] + 1 && dfs_h(match_v[v])) {
            match_u[u] = v;
            match_v[v] = u;
            return true;
        }
    }
    dist[u] = inf;
    return false;
}

bool hopcraft_karp() {
    int m = 0;
    fill_n(match_u, n, NIL);
    fill_n(match_v, n, NIL);
    while (bfs()) {
        rep(u, n) {
            if (match_u[u] == NIL && dfs_h(u)) {
                ++m;
            }
        }
    }
    return m == n;
}

bool pfmatch() { return hopcraft_karp(); }

int main() {
    cin >> n;
    rep(i, n) { cin >> p[i].first >> p[i].second; }
    rep(i, n) { cin >> h[i].first >> h[i].second; }
    rep(i, n) {
        rep(j, n) {
            int idx = i * n + j;
            int dis = abs(p[i].X - h[j].X) + abs(p[i].Y - h[j].Y);
            d[idx] = {dis, {i, j}};
        }
    }
    sort(d, d + n * n);
    int l = 0, r = n * n - 1;
    // binary search
    // time complexity: O(n^3lgn) for hungarian,
    // O(n^2√n * lgn) for hopcroft-karp
    while (l < r && d[l].first != d[r].first) {
        // replay
        rep(i, n) { edges[i].clear(); }
        int mid = (l + r) / 2;
        repe(i, mid) {
            int pi = d[i].second.first, hj = d[i].second.second;
            link(hj, pi);
        }
        if (pfmatch()) {
            r = mid;
        } else {
            l = mid + 1;
        }
    }
    cout << d[l].first << endl;
    return 0;
}

二部图的最大匹配 (Maximum Matching in Bipartite Graph)

Tue, 26 Sep 2017 14:26:08 GMT

上几篇博文主要是关于范围更新和范围查询的几个数据结构，接下去的主题是图论和图算法，希望能够学习和回忆起大部分图算法。

今天遇到一个问题，可以转化成二部图 (bipartite graph) 上的完美匹配的存在性问题，我们先来看一下完美匹配的理论，Hall's Marriage theorem；然后介绍两个算法，用于解决完美匹配的超集——最大匹配问题。

Hall's Marriage Theorem

令 $G$ 表示一个二部图，左部和右部分别为 $X$ 和 $Y$。令 $W \subset X$, $N_G(W)$ 为 $W$ 在 $Y$ 中的相邻点的集合。

那么如果存在一个匹配方式覆盖整个 $X$ 当且仅当

$\forall W \subset X, |W| \le |N_G(W)|$，也就是说每个 $X$ 的子集都有足够的邻居做匹配。

Deduction 1 [1]

加入一个二部图 $G(X + Y, E)$，$|X| = |Y|$，$G$ 是连通图，且每个 $X$ 中的点的度数都不相同，那么 $G$ 上一定存在完美匹配。

证明：

首先，因为 $G$ 是连通图，所以 $\forall u \in X, deg(u) >= 1$。那么 $\forall W \subset X$， $\max\limits_{u \in W}{deg(u)} \ge |S|$，满足 Hall's Marriage Theorem，得证。

Hungarian Algorithm

事实上，我对照着找了半天，并没有找到一个叫匈牙利算法 (Hungarian Algorithm) 的用于二部图最大匹配的算法，唯一找到的 Hungarian Algorithm 是用于任务分配问题 (Assignment Problem)，也就是带权二部图的最大匹配。

由于最大匹配问题与最大流问题能够很容易的互相转换，所以同时我也找到了许多思想类似甚至一致的算法，如 Ford-Fulkerson Algorihtm，Edmonds-Karp Algorithm 等。

至于匈牙利算法这个名字是哪本书上提出的，我已经记不太清了 ... 应该是某本学习过的图论书，在此主要讲一下它的具体思想和证明。

Alternating Path & Augmenting Path

假设我们有一个二部图 $G(U + V, E)$，现在有一个匹配 $M \subset E$，此时如果 $M$ 中的边上的点集中不存在一个点，我们就说该点是未匹配的，未匹配边的定义类似。

交替路径 (alternating path)：从一个未匹配点出发，经过未匹配边、匹配边、未匹配边...这样交替的路径叫做交替路径。增广路径 (augmenting path)：从 $U$ 中一个未匹配点出发，到达 $V$ 中一个未匹配点的交替路叫做增广路径。

显然，由增广路径的定义，可以知道增广路径上未匹配边比匹配边要多一条，并且将这条路径上所有未匹配边改为匹配边，匹配边改为未匹配边，则修改后的匹配 $M'$ 比原来大 1。

Theorem of Augmenting Path

一个匹配 $M$ 是最大匹配当且仅当那么在图 $G$ 上不存在增广路径。

证明如下：

假设存在一条增广路径，$M$ 显然不是最大的。所以我们只需要证明当不存在增广路径时，$M$ 是最大的。

假设存在一个匹配 $M$，不存在增广路径并且 $M$ 不是最大匹配，我们令 $M^$ 为 $G$ 上的一个最大匹配，显然 $|M^| > |M|$。

所以同时有 $|M^* - M| > |M - M^*|$。

考察所有在 $M^$ 和 $M$ 对称差 ($M^ \cup M - M^* \cap M$) 中的边，令 $G'$ 是由点 $U + V$ 和上述边构成的图。

因为 $G'$ 中的边是来自两个匹配，所以 $G'$ 上任意一个点最多与两条边相连。

因此，对于 $G'$ 上的任意联通分支，只可能是一条路或者一个环，并且边的数目是偶数，并且路或者环上对于 $M^* - M$ 或者 $M - M^*$ 一定构成交替路。

因为 $|M^* - M| > |M - M^|$ 并且环都是偶数条边，所以一定有一条路，它的边，它的起点和终点都在 $M^ - M$ 中，且 $M^* - M$ 和 $M - M^*$ 交替构成，显然这条路对于 $M$ 构成增广路径，矛盾！

得证。

Pseudocode

匈牙利算法有两种实现，分别基于 DFS 和 BFS，时间复杂度都是 $\mathcal{O}(|V||E|)$。

下面是 BFS 版本的伪代码：

Algorithm MaximumBigartiteMatching(G)
    initialize set M of edges // can be the empty set
    initialize queue Q with all the free vertices in V
    while not Empty(Q) do
        w ← Front(Q)
        if w ε V then
            for every vertex u adjacent to w do // u must be in U
                if u is free then // augment
                    M ← M union (w, u)
                    v ← w
                    while v is labeled do // follow the augmenting path
                        u ← label of v
                        M ← M - (v, u)  // (v, u) was in previous M
                        v ← label of u
                        M ← M union (v, u) // add the edge to the path
                    // start over
                    remove all vertex labels
                    reinitialize Q with all the free vertices in V
                    break // exit the for loop
                else // u is matched
                    if (w, u) not in M and u is unlabeled then
                    label u with w // represents an edge in E-M
                    Enqueue(Q, u)
                    // only way for a U vertex to enter the queue
         
        else // w ε U and therefore is matched with v
            v  ←  w's mate // (w, v) is in M
            label v with w // represents in M
            Enqueue(Q, v) // only way for a mated v to enter Q

相比于 BFS，DFS 版本的匈牙利算法更容易实现，它 C++ 代码可以参考附录。

Hopcroft-Karp Algorithm

Hopcroft-Karp 算法是一个专用于解二部图最大匹配问题的算法，它最差情况的时间复杂度为 $\mathcal{O}(|E|\sqrt{|V|})$，最差情况下的空间开销为 $\mathcal{O}(|V|)$。

Hopcroft-Karp 算法是在 1973 年由 Hohn Hopcroft 和 Richard Karp 两位计算机学者发现的。

和匈牙利算法一样，Hopcroft-Karp 算法同样是不断地通过寻找增广路径，来增大部分匹配。不同的是，匈牙利算法每次只找到一条增广路径，而该算法则每次找增广路径的一个最大集合，从而我们只需要进行 $\mathcal{O}(\sqrt{|V|})$ 次迭代。

Hopcroft-Karp 算法循环以下两个阶段：

用 BFS 寻找下一个长度的增广路径，并且能遍历该长度下所有增广路径 (也就是上面所说的最大集合)。
如果存在更长的增广路径，对每个可能的起点 u，用 DFS 寻找并记录增广路径

每一次循环，BFS 所找到的最短增广路径的长度至少增加 1，所以在 $\sqrt{|V|}$ 次循环以后，能找到的最短增广路径长度至少为 $\sqrt{|V|}$。假设当前的部分匹配集合为 $M$ (边集)，$M$ 和最大匹配的对称差组成了一组点不相交的增广路径和交替环。如果这个集合内所有的路径的长度都至少为 $\sqrt{|V|}$，那么最多只有 $\sqrt{|V|}$ 条路径，那么最大匹配的大小与 $|M|$ 最多为 $\sqrt{|V|}$。而每次循环至少将匹配大小增加 1，所以直到算法结束最多还有 $\sqrt{|V|}$ 次循环。

每次循环中，BFS 最多遍历图中每条边，DFS 也是最多遍历每条边，所以每一轮循环的时间复杂度为 $\mathcal{O}({|E|})$，总时间复杂度为 $\mathcal{O}({|E|\sqrt{|V|}})$。

Pseudocode

/* 
 G = U ∪ V ∪ {NIL}
 where U and V are partition of graph and NIL is a special null vertex
*/
  
function BFS ()
    for each u in U
        if Pair_U[u] == NIL
            Dist[u] = 0
            Enqueue(Q,u)
        else
            Dist[u] = ∞
    Dist[NIL] = ∞
    while Empty(Q) == false
        u = Dequeue(Q)
        if Dist[u] < Dist[NIL] 
            for each v in Adj[u]
                if Dist[ Pair_V[v] ] == ∞
                    Dist[ Pair_V[v] ] = Dist[u] + 1
                    Enqueue(Q,Pair_V[v])
    return Dist[NIL] != ∞

function DFS (u)
    if u != NIL
        for each v in Adj[u]
            if Dist[ Pair_V[v] ] == Dist[u] + 1
                if DFS(Pair_V[v]) == true
                    Pair_V[v] = u
                    Pair_U[u] = v
                    return true
        Dist[u] = ∞
        return false
    return true

function Hopcroft-Karp
    for each u in U
        Pair_U[u] = NIL
    for each v in V
        Pair_V[v] = NIL
    matching = 0
    while BFS() == true
        for each u in U
            if Pair_U[u] == NIL
                if DFS(u) == true
                    matching = matching + 1
    return matching

A Problem

一个可以转换成做完美匹配的题目，题目大意：

二维平面上一共 n 个人和 n 个防空洞，现在你需要将 n 个人分配到防空洞中，使得每个防空洞仅容纳一个人，并且所有人进入防空洞的时间最短，即最晚进入防空洞的人的时间最短。一个人从 (X, Y) 移动到 (X1, Y1) 所需时间为 |X - X1| + |Y - Y1|。 1 <= n <= 100

这道题直接做我没有想到什么好办法，但是观察到可能解一定为某个人移动到某个防空洞的时间，所以我们将所有人移动到所有防空洞的时间全部计算出来并排序。

对于某个人移动到某个防空洞，假设耗时为 T，那么所有耗时小于等于 T 的移动操作是可行的。我们建立一张二部图，左边是人的集合，右边是防空洞的集合，对于所有可行操作，我们在二部图上添加一条对应的边。那么如果此时存在一种分配方式满足上述条件，它在图上一定是一个完美匹配，而目标就是找到这样最小的一个 T。

对于我们的二部图，最差情况为完全二部图，对于匈牙利算法判定完美匹配的时间复杂度为 $\mathcal{O}(n^3)$，Hopcroft-Karp 算法为 $\mathcal{O}(n^2\sqrt{n})$，所以判定复杂度还是比较高的。

假如我们一条一条添加，也就是按照耗时顺序添加，那么最坏情况一共要判定 $n^2$ 次，这太高了，这里数据比较小还可以，但是万一 n 大到 1000 就难说了。

还记得之前我们提过的减小判定次数的方式嘛？对，二分查找，一共判定 $2\log n$ 次。

在这里，我们同时也存在模拟复杂度，这里模拟为构造对应的二部图，每次构造的最坏时间复杂度为 $n^2$，所以总计时间复杂度为 $\mathcal{O}(n^3\log n)$ 或者 $\mathcal{O}(n^2\sqrt{n}\log n)$。

代码实现在附录中。

References

[1] https://en.wikipedia.org/wiki/Hall%27s_marriage_theorem

[2] https://math.stackexchange.com/questions/1204270/bipartite-graph-has-perfect-matching

[3] https://en.wikipedia.org/wiki/Matching_(graph_theory)

[4] https://en.wikipedia.org/wiki/Hopcroft%E2%80%93Karp_algorithm

[5] https://www.topcoder.com/community/data-science/data-science-tutorials/maximum-flow-augmenting-path-algorithms-comparison/

[6] http://www.csl.mtu.edu/cs4321/www/Lectures/Lecture%2022%20-%20Maximum%20Matching%20in%20Bipartite%20Graph.htm

Appendix

Air Defense Exercise

#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
#include <stack>
#include <cmath>
#include <deque>
#include <queue>
#include <map>
#include <bitset>
#include <set>
#include <list>
#include <unordered_map>
#include <unordered_set>
#include <sstream>
#include <numeric>
#include <climits>
#include <utility>
#include <iomanip>
#include <cassert>

using namespace std;

using ll = long long;
using ii = pair<int, int>;
using iii = pair<int, ii>;
template <class T>
using vv = vector<vector<T>>;

#define rep(i, b) for (int i = 0; i < int(b); ++i)
#define reps(i, a, b) for (int i = int(a); i < int(b); ++i)
#define rrep(i, b) for (int i = int(b) - 1; i >= 0; --i)
#define rreps(i, a, b) for (int i = int(b) - 1; i >= a; --i)
#define repe(i, b) for (int i = 0; i <= int(b); ++i)
#define repse(i, a, b) for (int i = int(a); i <= int(b); ++i)
#define rrepe(i, b) for (int i = int(b); i >= 0; --i)
#define rrepse(i, a, b) for (int i = int(b); i >= int(a); --i)

#define all(a) a.begin(), a.end()
#define rall(a) a.rbegin(), a.rend()
#define sz(a) int(a.size())
#define mp(a, b) make_pair(a, b)

#define inf (INT_MAX / 2)
#define infl (LONG_MAX / 2)
#define infll (LLONG_MAX / 2)

#define X first
#define Y second
#define pb push_back
#define eb emplace_back

// tools for pair<int, int> & graph
template <class T, size_t M, size_t N>
class graph_delegate_t {
    T (&f)[M][N];

public:
    graph_delegate_t(T (&f)[M][N]) : f(f) {}
    T& operator[](const ii& s) { return f[s.first][s.second]; }
    const T& operator[](const ii& s) const { return f[s.first][s.second]; }
};
ii operator+(const ii& lhs, const ii& rhs) {
    return mp(lhs.first + rhs.first, lhs.second + rhs.second);
}

// clang-format off
template <class S, class T> ostream& operator<<(ostream& os, const pair<S, T>& t) { return os << "(" << t.first << "," << t.second << ")"; }
template <class T> ostream& operator<<(ostream& os, const vector<T>& t) { os << "{"; rep(i, t.size() - 1) { os << t[i] << ","; } if (!t.empty()) os << t.back(); os << "}"; return os; }
vector<string> __macro_split(const string& s) { vector<string> v; int d = 0, f = 0; string t; for (char c : s) { if (!d && c == ',') v.pb(t), t = ""; else t += c; if (c == '\"' || c == '\'') f ^= 1; if (!f && c == '(') ++d; if (!f && c == ')') --d; } v.pb(t); return v; }
void __args_output(vector<string>::iterator, vector<string>::iterator) { cerr << endl; }
template <typename T, typename... Args>
void __args_output(vector<string>::iterator it, vector<string>::iterator end, T a, Args... args) { cerr << it->substr((*it)[0] == ' ', it->length()) << " = " << a; if (++it != end) { cerr << ", "; } __args_output(it, end, args...); }
#define out(args...) { vector<string> __args = __macro_split(#args); __args_output(__args.begin(), __args.end(), args); }
// clang-format on

const int MAX_N = 100;
int n;
ii p[MAX_N], h[MAX_N];
iii d[MAX_N * MAX_N];

vector<int> edges[MAX_N];
int match[MAX_N];
bool visited[MAX_N];

void link(int u, int v) { edges[u].push_back(v); }

bool dfs(int u) {
    for (auto v : edges[u]) {
        if (visited[v]) continue;
        visited[v] = true;
        if (match[v] == -1 || dfs(match[v])) {
            match[v] = u;
            return true;
        }
    }
    return false;
}

bool hungarian() {
    int m = 0;
    fill_n(match, n, -1);
    rep(i, n) {
        fill_n(visited, n, false);
        if (dfs(i)) ++m;
    }
    return m == n;
}

int match_u[MAX_N], match_v[MAX_N];
int dist[MAX_N];
int NIL = MAX_N;

bool bfs() {
    queue<int> q;
    rep(u, n) {
        if (match_u[u] == NIL) {
            dist[u] = 0;
            q.push(u);
        } else {
            dist[u] = inf;
        }
    }

    dist[NIL] = inf;
    while (!q.empty()) {
        auto u = q.front();
        q.pop();
        for (auto v : edges[u]) {
            if (dist[match_v[v]] == inf) {
                dist[match_v[v]] = dist[u] + 1;
                if (match_v[v] != NIL) q.push(match_v[v]);
            }
        }
    }
    return dist[NIL] != inf;
}

bool dfs_h(int u) {
    if (u == NIL) return true;
    for (auto v : edges[u]) {
        if (dist[match_v[v]] == dist[u] + 1 && dfs_h(match_v[v])) {
            match_u[u] = v;
            match_v[v] = u;
            return true;
        }
    }
    dist[u] = inf;
    return false;
}

bool hopcraft_karp() {
    int m = 0;
    fill_n(match_u, n, NIL);
    fill_n(match_v, n, NIL);
    while (bfs()) {
        rep(u, n) {
            if (match_u[u] == NIL && dfs_h(u)) {
                ++m;
            }
        }
    }
    return m == n;
}

bool pfmatch() { return hopcraft_karp(); }

int main() {
    cin >> n;
    rep(i, n) { cin >> p[i].first >> p[i].second; }
    rep(i, n) { cin >> h[i].first >> h[i].second; }
    rep(i, n) {
        rep(j, n) {
            int idx = i * n + j;
            int dis = abs(p[i].X - h[j].X) + abs(p[i].Y - h[j].Y);
            d[idx] = {dis, {i, j}};
        }
    }
    sort(d, d + n * n);
    int l = 0, r = n * n - 1;
    // binary search
    // time complexity: O(n^3lgn) for hungarian,
    // O(n^2√n * lgn) for hopcroft-karp
    while (l < r && d[l].first != d[r].first) {
        // replay
        rep(i, n) { edges[i].clear(); }
        int mid = (l + r) / 2;
        repe(i, mid) {
            int pi = d[i].second.first, hj = d[i].second.second;
            link(hj, pi);
        }
        if (pfmatch()) {
            r = mid;
        } else {
            l = mid + 1;
        }
    }
    cout << d[l].first << endl;
    return 0;
}

Sparse Table & Parallel Binary Search

Sat, 09 Sep 2017 17:06:11 GMT

You don't know how weak you are until you start grinding problems, and the more you grind the more you realize it -- a record of getting destroyed in HourRank23.

Sparse Table

Recall that the previous two posts on segment trees and BIT both discussed the range query problem. Let's review.

Segment trees support range updates and range queries for various functions (associative, i.e., satisfying the associative law), with O(nlgn) space complexity and O(lgn) time complexity for both updates and queries.

BIT supports single-point update/range query and range update/single-point query for operations on any group, with O(n) space complexity and O(lgn) time complexity for both updates and queries (actually depending on the speed of inverse element construction). BIT's generalization to range update/range query requires additional properties, and is mainly used for sum operations on the integer domain.

The Sparse Table discussed here is another data structure supporting range queries, targeting immutable arrays, with O(nlgn) space complexity.

Sparse Table also supports various functions -- any function satisfying the associative law is supported. For all such functions, its time complexity is O(nlgn), and both the concept and implementation are very simple and easy to understand.

Furthermore, if the function is idempotent, Sparse Table can return range query results in O(1) time.

Core Principle

Suppose we have an array of length $N$, ${a_0, ..., a_{N - 1}}$, and a binary function $f$ satisfying the associative law $f(a, f(b, c)) = f(f(a, b), c)$.

We use the shorthand $f(a[i..j])$ for the query of function $f$ over the range $[i, j]$.

The Sparse Table generates a 2D array of size $N(\lfloor\log N\rfloor + 1)$. The $(i, j)$-th entry represents the range result $f(a[i..i + 2^j - 1])$, denoted $b_{i,j}$.

Generating such a 2D array is straightforward, because $f(a[i..i + 2^j - 1]) = f(f(a[i..i+2^{j-1} - 1]), f(a[i + 2^{j-1}..i + 2^j - 1]))$, where the two sub-expressions are the $(i, j - 1)$-th and $(i + 2^{j - 1}, j - 1)$-th entries, and $f([i..i]) = a_i$. So we just build layer by layer as follows:

// assuming Arr is indexed from 0
for i=0..N-1:
  Table[i][0] = Arr[i]

// assuming N < 2^(k+1)
for j=1..k:
  for i=0..N-2^j:
    Table[i][j] = F(Table[i][j - 1], Table[i + 2^(j - 1)][j - 1])

How do we perform queries? For a range $[i, j]$, the range length $L = j - i + 1 \le N$ always holds. If we express $L$ in binary form, $L = 2^{q_k} + 2^{q_{k - 1}} + ... + 2^{q_0}$, then we have

$j = (\cdots((i + 2^{q_k} - 1) + 2^{q_{k - 1}} - 1) + ... + 2^{q_0}) - 1$. Does this representation ring a bell?

So using the following process we can get the exact result in O(lgN) time:

answer = ZERO
L' = L
for i=k..0:
  if L' + 2^i - 1 <= R:
    // F is associative, so this operation is meaningful
    answer = F(answer, Table[L'][i])
    L' += 2^i

Suppose our function $f$ is also idempotent, meaning $f(x, x) = x$ holds for all x in the domain. Then we immediately get

$f(a[i..j]) = f(f(a[i..s],f(a[t..j])), i \le t, s \le j, t \le s + 1$.

This property allows us to not precisely cover the region exactly once, which is the key to achieving O(1).

Let $t$ be the largest $t$ satisfying $2^t \le (j - i + 1)$, i.e., $2^{t + 1} > (j - i + 1)$. Then clearly $i + 2^t - 1 \le j$, $j - 2^t + 1 \ge i$, and $j - 2^t + 1 \le (i + 2^t - 1) + 1$ always holds.

So $f(a[i..j]) = f(f(i..i + 2^t - 1), f(j - 2^t + 1..j))$, where the two terms are $b_{i, t}$ and $b_{j - 2^t, t}$.

This concludes the principle. Implementation code is in the Appendix at the end.

ST & LCA

Sparse Table can be used not only for various range queries but also for computing the Lowest Common Ancestor (LCA) of two nodes in a tree. Using ST, the LCA of any two nodes can be computed in O(2lgH) time, with O(NlgN) space complexity, where N is the number of nodes in the tree and H is the tree height.

The main idea is to store the 2^j-th ancestor of node i in an array, then search. Interested readers can refer to the article on RMQ and LCA on TopCoder. The link is in the references, and further details are omitted here.

Parallel Binary Search

Parallel Binary Search -- let's first use a problem from SPOJ to introduce this method, then look at a slightly different problem from HourRank.

The following two subsections heavily reference a blog on Codeforces. The original link is at the end of the article, and interested readers can go there to learn more.

Motivation Problem

There is a problem on SPOJ: Meteors. Here is a brief description:

There are N member states and a circular area divided equally into M sectors, each belonging to some member state. Q meteor showers are forecast, each falling on a sector range [L, R], with each sector receiving the same quantity K of meteorites. Each member state has a meteorite collection target reqd[i]. For each member state, determine the minimum number of meteor showers needed to meet the collection target.

1 <= N, M, Q <= 300000 1 <= K <= 1e9

Solution

If we consider simulating each step without any data structures and then checking whether the target is met, the update cost is O(M) and the check cost is O(N + M), giving a total cost of O((2M + N)Q), which is unacceptable.

Noticing the range updates and recalling the data structures discussed at the beginning of this blog, BIT is the most suitable choice for simulating updates. This reduces the update cost to O(lgM), but the check cost becomes O(MlgM). If we still simulate step by step, it will obviously be worse than before.

At this point, can you think of something? Yes, binary search! Binary search doesn't care what each value in the sequence looks like. It only needs to know:

The sequence is monotonic
The value at a given position in the sequence can be obtained by some method

Then, through binary search, we can find the desired position within O(lgN) value retrievals/checks.

Here, the "sequence" refers to a member state's cumulative collection total, which is implicitly reflected by the sequence of sector ranges. Through binary search, for each member state we only need O(lgQ) checks to find the time point at which the target is met. So the total time complexity per member state is approximately O(logQ * Q * logM), where the check time is negligible compared to the update time and is thus omitted. The final complexity is O(N * log Q * Q * logM).

As we can see, this complexity is still too high -- even higher than direct simulation. So what do we do?

Think about it: while the cost per simulation step decreased, the number of simulations rose to O(NlgQ). Although the number of checks also decreased, compared to simulation cost, the reduction was negligible. So the priority is to simultaneously reduce the simulation overhead.

It feels a bit wasteful, doesn't it? Each time we just do a quick check but have to simulate an entire round. Even though the per-step cost is low, you can't afford to burn CPU like that.

In reality, each member state's binary search experience is very similar -- finding the right portion in the fully simulated sequence and then doing a check before moving to the next round. Since we've already done the full simulation, can we let every member state check at once before starting the next round?

This is the core idea of Parallel Binary Search: complete one step of search for all member states through a single simulation round, reducing the total number of simulations to O(lgQ) -- a huge improvement!

The concrete method is to maintain two arrays of length N, recording the current L and R for each member state. For each mid value to check, maintain a linked list of member states that need to check that mid value. The rest is described by the following pseudocode:

for all logQ steps:
    clear range tree and linked list check
    for all member states i:
        if L[i] != R[i]:
            mid = (L[i] + R[i]) / 2
            insert i in check[mid]
    for all queries q:
        apply(q)
        for all member states m in check[q]:
            if m has requirements fulfilled:
                R[m] = q
            else:
                L[m] = q + 1

In the above process, the apply() function simulates the update and checks whether the member states that need checking have met their targets.

This way, the total simulation cost is O(lgQ * Q * lgM), and the total check cost is O(lgQ * (MlgM + N)). We successfully reduced both the simulation and check costs simultaneously! The total time complexity is O(lgQ * (Q + M) * lgM). As for the smaller N term, it's negligible.

As in the original article, here's a quote from a top competitor explaining Parallel Binary Search:

"A cool way to visualize this is to think of a binary search tree. Suppose we are doing standard binary search, and we reject the right interval — this can be thought of as moving left in the tree. Similarly, if we reject the left interval, we are moving right in the tree. So what Parallel Binary Search does is move one step down in N binary search trees simultaneously in one "sweep", taking O(N * X) time, where X is dependent on the problem and the data structures used in it. Since the height of each tree is LogN, the complexity is O(N * X * logN)." — rekt_n00b

Hourrank23 -- Selective Additions

This is a cursed problem... sorry I couldn't help it.

The problem is as follows:

There is an array of length N. We perform M range updates, each adding a positive number. However, we have K favorite numbers. Once an element in the array becomes one of our favorite numbers, further updates to it are invalidated and it retains that value forever. Output the array sum after each update.

1 <= N, M <= 1e5 1 <= k <= 5

This problem differs from the previous one: the check cost is low, and the check target coincides with the original range. Also, there are multiple targets.

The multiple targets can be handled as follows: sort the favorite numbers. Since updates are always positive, once the earlier (smaller) favorite number is reached, the later ones no longer need to be checked.

Using the PBS approach described above, the complexity of this problem is O(k(n + m)lognlogm), which is a very good complexity.

However, there must be a "however" here. Since the check cost is very low, binary search is not strictly necessary for this problem. I reviewed many top competitors' code and the editorial. Some used PBS, some simulated the time series, and some used segment trees with tricks. Let me introduce them one by one:

PBS

Nothing more to say -- offline simulation for logm rounds.

Time Series Simulation

Code: https://www.hackerrank.com/rest/contests/hourrank-23/challenges/selective-additions/hackers/nuipqiun/download_solution

This is also offline processing, but this competitor's simulation method is quite unique -- they simulate the value of the current array element along the time series. This was a first for me.

First, recall BIT's range update method: add delta at position l, subtract delta at position r + 1. Suppose we consider all range updates as [l, n]. Then simulating the value of the i-th array element along the time series becomes intuitive: for updates (j, a, b, delta) where a <= i, add delta at position j in the time series array. This way, all values after time j are increased by delta.

With this in mind, the time series simulation is straightforward. Since there are additions, there must be corresponding subtractions: for all updates (j, a, b, delta) where b < i, cancel them out by subtracting delta at position j in the time series, because such updates should not affect the time series of array element i, and they were previously applied.

So the approach becomes:

rep(i,n){
	for(int j:ad[i]) add(j,ads[j]);
	for(int j:rm[i]) add(j,-ads[j]);
	// do you work
}

Here, ad[i] contains the indices of updates of the form [i, r], rm[i] contains the indices of updates of the form [l, i - 1], and the add operation is the Fenwick tree's addition operation. This allows us to compute a[i]'s value at any time point using sum(j).

The rest is simple: for each i, do a binary search to find the target and record it. The total time complexity should be O((kn + m)logm).

This is clearly faster than PBS because the check range coincides with the original range. This observation is very important. In contrast, for Meteors, although we can also simulate the time series of a range, a range is not meaningful for the target.

Segment Tree

Unlike the two approaches above, this is an online algorithm.

The time complexity is O(klgm(n + m)), but it uses segment trees, so the space complexity is O(kmlgm).

The segment tree stores three values: a lazy field add, a range maximum maxi, and the corresponding array index id.

There are k segment trees in total, each initialized with the array {a_i - fav_j}. In other words, if the corresponding value is positive, it's worth checking; if negative, it hasn't reached the check threshold yet.

Conveniently, leveraging the segment tree's structure, we can check the root node to determine whether there are any elements in the tree that need checking. This operation is O(1). When an array element has been checked, it is assigned -inf so it will never be checked again, and the segment tree transitions to the next pending state.

Since each array element is checked at most once per favorite number, the total check complexity is O(knlgm).

The update complexity is O(kmlgm), so the total complexity is O(k(m + n)lgm).

The segment tree trick here initially puzzled me. Why doesn't it work for Meteors but works here? I thought about it extensively but couldn't come up with a third approach to map Meteors' query ranges to a conveniently modifiable structure. So it ultimately comes down to the two reasons described earlier.

Conclusion

For a series of update/query problems, the community has provided a variety of powerful tools:

Data structures: segment tree, Fenwick tree, Sparse Table, the recently studied Cartesian tree, and so on
Binary search, parallel binary search, CDQ divide and conquer (which I haven't looked into yet), and more

Organizing all of this in my head should give me a bit more confidence next time I tackle problems.

Reference

[1] https://www.hackerearth.com/practice/notes/sparse-table/

[2] https://www.topcoder.com/community/data-science/data-science-tutorials/range-minimum-query-and-lowest-common-ancestor/#Sparse_Table_(ST)_algorithm

[3] http://codeforces.com/blog/entry/45578

[4] https://ideone.com/tTO9bD

Appendix

Impl ST/C++

#include <vector>
#include <cassert>
#include <cstring>
#include <iostream>
#include <limits>
#include <type_traits>
#include <random>
using namespace std;

namespace st_impl {

template <class T, class F>
class SparseTable {
public:
    typedef F func_type;
    typedef unsigned size_type;
    typedef T value_type;

    SparseTable(const vector<T>& init) : _size(init.size()), _idx_size(flsl(_size)) {
        table.resize(_size);
        for (auto& row : table) {
            row.resize(_idx_size, func_type::default_value);
        }

        // initialize sparse table
        for (size_type i = 0; i < _size; ++i) {
            table[i][0] = init[i];
        }
        for (size_type j = 1; j < _idx_size; ++j) {
            for (size_type i = 0; i <= _size - (1 << j); ++i) {
                table[i][j] = f(table[i][j - 1], table[i + (1 << (j - 1))][j - 1]);
            }
        }
    }

    SparseTable(const initializer_list<T>& init) : SparseTable(vector<T>(init)) {}

    SparseTable(const vector<T>& init, F f) : SparseTable(init) { this->f = f; }
    SparseTable(const initializer_list<T>& init, F f) : SparseTable(vector<T>(init), f) {}

    T rangeQuery(size_type l, size_type r) const {
        if (!(l <= r && r < _size)) {
            throw std::out_of_range("Bad query!");
        }

        // if the function is idempotent, which means f(x, x) = x holds for
        // all x with definition, then we can deduce that
        // f(range(l, s), range(t, r)) == f(range(l, r)) always
        // holds for all l, s, t, r which satisfies l <= t && s <= r && t <= s + 1
        // then rangeQuery will be executed in O(1).
        // otherwise it should be finished in O(lgN).
        if (func_type::idempotent) {
            size_type idx = flsl(r - l + 1) - 1;
            return f(table[l][idx], table[r - (1 << idx) + 1][idx]);
        } else {
            T res = func_type::default_value;
            for (size_type i = 0; i < _idx_size; ++i) {
                size_type idx = _idx_size - 1 - i;
                if (l + (1 << idx) - 1 <= r) {
                    res = f(res, table[l][idx]);
                    l += 1 << idx;
                }
            }
            return res;
        }
    }

private:
    func_type f;

    size_type _size;
    size_type _idx_size;
    vector<vector<T>> table;
};

}  // namespace st_impl

template <class T, T v = T{}>
struct sum_f {
    static constexpr T default_value = v;
    static constexpr bool idempotent = false;
    T operator()(const T& a, const T& b) const { return a + b; }
};
template <class T, T v>
constexpr const T sum_f<T, v>::default_value;

template <class T, T v = numeric_limits<T>::min(),
          typename = typename enable_if<numeric_limits<T>::is_specialized>::type>
struct max_f {
    static constexpr T default_value = v;
    static constexpr bool idempotent = true;
    T operator()(const T& a, const T& b) const { return max(a, b); }
};
template <class T, T v, typename R>
constexpr const T max_f<T, v, R>::default_value;

template <class T, T v = numeric_limits<T>::max(),
          typename = typename enable_if<numeric_limits<T>::is_specialized>::type>
struct min_f {
    static constexpr T default_value = v;
    static constexpr bool idempotent = true;
    T operator()(const T& a, const T& b) const { return min(a, b); }
};
template <class T, T v, typename R>
constexpr const T min_f<T, v, R>::default_value;

uint64_t gcd(uint64_t a, uint64_t b) {
    if (a < b) swap(a, b);
    while (b != 0) {
        auto t = b;
        b = a % b;
        a = t;
    }
    return a;
}

template <class T, T v = T{}, typename = typename enable_if<numeric_limits<T>::is_integer>::type>
struct gcd_f {
    static constexpr T default_value = v;
    static constexpr bool idempotent = true;
    T operator()(const T& a, const T& b) const { return gcd(a, b); }
};
template <class T, T v, typename R>
constexpr const T gcd_f<T, v, R>::default_value;

template <class T, class F = max_f<T>>
using SparseTable = st_impl::SparseTable<T, F>;

template <class F>
void random_test(string target_func) {
    int n = 400;
    vector<int> test(n);

    // generate random numbers
    random_device r;
    default_random_engine eng(r());
    uniform_int_distribution<int> uniform_dist(0, 2000);

    for (int i = 0; i < n; ++i) {
        test[i] = uniform_dist(eng);
    }

    // query and verify
    F f;
    SparseTable<int, F> st_test(test, f);

    cout << "Begin random test on " << target_func << "!" << endl;
    int t = 10;
    for (int i = 0; i < t; ++i) {
        int l = uniform_dist(eng) % n, r = l + ((uniform_dist(eng) % (n - l)) >> (i / 2));
        auto to_verify = st_test.rangeQuery(l, r);
        auto expected = decltype(f)::default_value;

        for (int j = l; j <= r; ++j) {
            expected = f(expected, test[j]);
        }
        assert(to_verify == expected);
        cout << " + query range(" << l << "," << r << ")\t= " << to_verify << endl;
    }
    cout << "Test passed!" << endl;
}

void regular_test() {
    SparseTable<int> st_max({3, 1, 2, 5, 2, 10, 8});

    assert(st_max.rangeQuery(0, 2) == 3);
    assert(st_max.rangeQuery(3, 6) == 10);
    assert(st_max.rangeQuery(0, 6) == 10);
    assert(st_max.rangeQuery(2, 4) == 5);

    SparseTable<int, min_f<int>> st_min({3, 1, 2, 5, 2, 10, 8});

    assert(st_min.rangeQuery(0, 2) == 1);
    assert(st_min.rangeQuery(3, 6) == 2);
    assert(st_min.rangeQuery(0, 6) == 1);
    assert(st_min.rangeQuery(2, 4) == 2);

    SparseTable<int, sum_f<int>> st_sum({3, 1, 2, 5, 2, 10, 8});

    assert(st_sum.rangeQuery(0, 2) == 6);
    assert(st_sum.rangeQuery(3, 6) == 25);
    assert(st_sum.rangeQuery(0, 6) == 31);
    assert(st_sum.rangeQuery(2, 4) == 9);
}

int main() {
    regular_test();

    random_test<max_f<int>>("max");
    random_test<min_f<int>>("min");
    random_test<sum_f<int>>("sum");
    random_test<gcd_f<int>>("gcd");

    return 0;
}

Selective Additions / PBS C++

#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <algorithm>
#include <vector>
#include <climits>
#include <utility>
#include <queue>
#include <map>

using namespace std;

#define defv(alias, type) using v##alias = vector<type>
#define defvv(alias, type) using vv##alias = vector<vector<type>>

using ii = pair<int, int>;
using iii = pair<int, ii>;

defv(i, int);
defvv(i, int);
defv(ii, ii);
defvv(ii, ii);

#define forall(i, a, b) for (int i = int(a); i < int(b); ++i)
#define all(a) a.begin(), a.end()
#define inf (IMAX_NT_MAX / 2)
#define sz(a) int(a.size())
#define mp(a, b) make_pair(a, b)

const int MAX_N = 1e5 + 5;

long a[MAX_N], d[MAX_N];
int l[MAX_N], r[MAX_N];
long f[5];

int n, m, k;

long fen[MAX_N];
int lowbit(int x) { return x & -x; }

void fen_update(long *fen, int idx, int delta) {
    for (int i = idx; i < n; i += lowbit(i + 1)) {
        fen[i] += delta;
    }
}

long __fen_get(long *fen, int r) {
    long res = 0;
    while (r >= 0) {
        res += fen[r];
        r -= lowbit(r + 1);
    }
    return res;
}

long fen_get(long *fen, int l, int r) { return __fen_get(fen, r) - __fen_get(fen, l - 1); }

void fen_range_update(long *fen, int l, int r, int delta) {
    fen_update(fen, l, delta);
    fen_update(fen, r + 1, -delta);
}

long fen_point_get(long *fen, int i) { return __fen_get(fen, i); }

void fen_reset(long *fen) { forall(i, 0, n) fen[i] = 0; }

inline int fls(int x) {
    int r = 32;

    if (!x) return 0;
    if (!(x & 0xffff0000u)) {
        x <<= 16;
        r -= 16;
    }
    if (!(x & 0xff000000u)) {
        x <<= 8;
        r -= 8;
    }
    if (!(x & 0xf0000000u)) {
        x <<= 4;
        r -= 4;
    }
    if (!(x & 0xc0000000u)) {
        x <<= 2;
        r -= 2;
    }
    if (!(x & 0x80000000u)) {
        x <<= 1;
        r -= 1;
    }
    return r;
}

int t[MAX_N], lb[MAX_N], rb[MAX_N];
vector<int> to_check[MAX_N + 1];
void solve() {
    sort(f, f + k);
    fill_n(t, n, -1);
    forall(i, 0, n) forall(s, 0, k) {
        if (t[i] < 0 && a[i] == f[s]) t[i] = 0;
    }

    // Parallel Binary Search
    for (int s = 0; s < k; ++s) {
        forall(i, 0, n) if (t[i] < 0) { lb[i] = 1, rb[i] = m; }
        forall(rnd, 0, fls(m)) {
            fen_reset(fen);
            forall(i, 0, m + 1) to_check[i].clear();
            forall(i, 0, n) {
                if (t[i] < 0) to_check[(lb[i] + rb[i]) >> 1].push_back(i);
            }
            forall(i, 0, m) {
                fen_range_update(fen, l[i], r[i], d[i]);
                for (int idx : to_check[i + 1]) {
                    auto now = fen_point_get(fen, idx);
                    if (now + a[idx] > f[s])
                        rb[idx] = i;
                    else if (now + a[idx] < f[s])
                        lb[idx] = i + 2;
                    else
                        t[idx] = i + 1;
                }
            }
        }
    }

    // reuse fen for counting invalid updates
    long sum = 0;
    fen_reset(fen);
    forall(i, 0, m + 1) to_check[i].clear();
    forall(i, 0, n) {
        if (t[i] == 0)
            fen_update(fen, i, 1);
        else if (t[i] > 0)
            to_check[t[i]].push_back(i);
        sum += a[i];
    }

    forall(i, 0, m) {
        sum += d[i] * (r[i] - l[i] + 1 - fen_get(fen, l[i], r[i]));
        cout << sum << endl;
        for (auto idx : to_check[i + 1]) {
            fen_update(fen, idx, 1);
        }
    }
}

int main() {
    cin >> n >> m >> k;
    forall(i, 0, n) { cin >> a[i]; }
    forall(i, 0, k) { cin >> f[i]; }
    forall(i, 0, m) {
        cin >> l[i] >> r[i] >> d[i];
        l[i]--, r[i]--;
    }

    solve();

    return 0;
}

稀疏表和并行二分查找 (Sparse Table & Parallel Binary Search)

Sat, 09 Sep 2017 17:06:11 GMT

不刷题不知道自己菜，越刷题越发现自己🙄 —— 记 HourRank23 被虐。

Sparse Table

还记得上两篇线段树和 BIT 都讲到了区间查找的问题，我们来回忆一下。

线段树空间支持各种函数 (Associative，需要满足结合律) 的区间更新和区间查询，空间复杂度是 O(nlgn)，更新和查询的时间复杂度都是 O(lgn)。

BIT 支持任意群上运算的单点更新/区间查询、区间更新/单点查询，空间复杂度是 O(n)，更新和查询的时间复杂度也都是 O(lgn) (其实取决于逆元构造速度)。 BIT 的区间更新/区间查询泛化需要更多性质，反正主要是用于整数域上的和运算。

而这里所要讲的 Sparse Table 是另一种支持区间查询的数据结构，针对的是不变的（immutable）的数组，其空间复杂度为 O(nlgn)。

Sparse Table 同样支持各种函数，只要是满足结合律的函数一律都是支持的，对所有这样的函数，其时间复杂度为 O(nlgn)，而且思想和编码都非常简单易懂。

更进一步地，如果函数是幂等 (Idemponent) 的，Sparse Table 可以在 O(1) 内得到区间查询的结果。

核心原理

假设有一个长度为 $N$ 的数组 ${a_0, ..., a_{N - 1}}$，并有一个二元函数 $f$，满足结合律 $f(a, f(b, c)) = f(f(a, b), c)$。

我们简记区间 $[i, j]$ 上对函数 $f$ 的查询为 $f(a[i..j])$。

那么 Sparse Table 将生成这样一个二维数组，这个二维数组的大小为 $N(\lfloor\log N\rfloor + 1)$。数组的第 $(i, j)$ 项代表了区间结果 $f(a[i..i + 2^j - 1])$，记为 $b_{i,j}$。

生成一个这样的二维数组是很简单的，因为 $f(a[i..i + 2^j - 1]) = f(f(a[i..i+2^{j-1} - 1]), f(a[i + 2^{j-1}..i + 2^j - 1]))$，而后面这两个分别是第 $(i, j - 1)$ 项和第 $(i + 2^{j - 1}, j - 1)$项，并且 $f([i..i]) = a_i$，所以我们一层层递推就行，过程如下

// assuming Arr is indexed from 0
for i=0..N-1: 
  Table[i][0] = Arr[i]
  
// assuming N < 2^(k+1)
for j=1..k: 
  for i=0..N-2^j:
    Table[i][j] = F(Table[i][j - 1], Table[i + 2^(j - 1)][j - 1])

那么我们如何进行查询呢？因为对于一个区间 $[i, j]$ 来说，区间长度 $L = j - i + 1 \le N$ 恒成立，所以如果我们将 $L$ 表示成二进制形式，$L = 2^{q_k} + 2^{q_{k - 1}} + ... + 2^{q_0}$，那么有

$j = (\cdots((i + 2^{q_k} - 1) + 2^{q_{k - 1}} - 1) + ... + 2^{q_0}) - 1$，这个表示形式是不是提醒你了呢？

所以用以下过程我们可以在 O(lgN) 时间内得到准确结果:

answer = ZERO 
L’ = L
for i=k..0:
  if L’ + 2^i - 1 <= R:
    // F is associative, so this operation is meaningful
    answer = F(answer, Table[L’][i]) 
    L’ += 2^i

假设我们的函数 $f$ 同时是幂等的，也就是说 $f(x, x) = x$ 对所有定义域内的数都成立，那么我们马上就能得到

$f(a[i..j]) = f(f(a[i..s],f(a[t..j])), i \le t, s \le j, t \le s + 1$。

这条性质允许我们不用精确地只覆盖该区域一次，这是加速到 O(1) 的关键。

令 $t$ 是满足 $2^t \le (j - i + 1)$ 的最大的 $t$，也就是 $2^{t + 1} > (j - i + 1)$。那么显然 $i + 2^t - 1 \le j$，$j - 2^t + 1 \ge i$，并且有 $j - 2^t + 1 \le (i + 2^t - 1) + 1$ 恒成立。

所以 $f(a[i..j]) = f(f(i..i + 2^t - 1), f(j - 2^t + 1..j))$，后面两项就是 $b_{i, t}$ 和 $b_{j - 2^t, t}$。

至此原理介绍完毕，实现的代码在最后 Appendix 中。

ST & LCA

Sparse Table 不仅可以用于计算各种区间查询，还可以用于计算树上两个节点的最近公共祖先。使用 ST 可以在 O(2lgH) 的时间内计算出任意两个几点的最近公共祖先，空间复杂度还是 O(NlgN)，这里 N 是树上节点数，H 是树高度。

其主要思想是用数组存放节点 i 的第 2^j 个祖先，然后搜索，具体细节有兴趣的同学可以参考 topcoder 上关于 RMQ 和 LCA 的那篇文章，链接在引用中，这里不再赘述。

Parallel Binary Search

Parallel Binary Search，中文叫整体二分，我们先用一个 spoj 上的题目来介绍这个方法，然后再来看一下 hourrank 上的另一个不太一样的题目。

接下来的两个小节大量参考了 codeforces 上的博客，原文链接在文章的末尾，有兴趣的同学可以前去学习。

Motivation Problem

SPOJ 上有这样一个题目：Meteors，在这里我简单描述一下：

有 N 个国家，有一块圆形的地方，等分成 M 个扇形区域，每个区域属于某一个国家。现在预报有 Q 阵流星雨，每一针降落在某个扇形区域 [L, R]，每个扇形都获得相同数量 K 的陨石。已知每个国家有一个陨石收集目标 reqd[i]，请问每个国家分别最少在多少阵流星雨后才能收集到目标数量的陨石。

1 <= N, M, Q<= 300000 1 <= K <= 1e9

Solution

如果我们考虑不使用任何数据结构逐次模拟，然后检查目标是否达到了，更新代价是 O(M)，检查的代价是 O(N + M)，总代价 O((2M + N)Q) 是无法承受的。

而看到区间更新，结合 Blog 开头所讲的几个数据结构，选择 BIT 用于模拟更新是最合适不过的了；这样我们更新代价降到了 O(lgM)，但是检查的代价变为了 O(MlgM)，如果还是逐次模拟，显然会比之前更差。

到了这里，能想到了什么了嘛？对，二分查找！二分查找不关心序列中每个数长什么样子，只需要知道：

序列单调
可以通过某种方式获取到序列中指定位置的值

然后，通过二分查找，我们可以在 O(lgN) 的值获取/检查内，找到我们想要的位置。

这里的序列指一个国家的收集总数，而这个收集总数由扇形区间所构成的序列隐式地反应，那么通过二分查找，对每个国家我们只需要进行 O(lgQ) 次检查，就能知道是在那个时间点满足条件。所以每个国家的总计时间复杂度为大约为 O(logQ * Q * logM)，这里每次检查的时间相比更新的时间微不足道所以略去了。最终复杂度为 O(N * log Q * Q * logM)。

可以看到，上面的复杂度还是太高，比直接模拟还是高了很多，那怎么办呢？

稍微想一想，每次模拟的代价虽然降低了，但是模拟的数量上升到了 O(NlgQ) 次，虽然检查的次数也降低了，但是跟模拟的耗时相比，低的都被略掉了 ... 所以当务之急是同时能降低模拟的开销。

是不是感觉有点奢侈？每次我就检查那么一下，就要模拟一整轮，虽然我开销小，也架不住你烧 CPU 玩啊。

其实每个国家的二分查找经历是很类似的，在整个模拟出来的序列里面找到那一片要的，然后检查一下我们就下一轮了。那么，既然一整轮模拟都做完了，能不能让每个国家都检查完，然后我们再开下一轮好不好？

这就是 Parallel Binary Search 的核心思想：通过一轮模拟完成所有国家的这一步查找，从而将模拟总数减少到了 O(lgQ) 次，巨大的进步！

具体方法是，开两个长度为 N 的数组，对每一个国家，记录它当前的 L 和 R；对于每一个要检查的 mid，开一个链表记录当前需要检查 mid 值的国家；其余过程用下列伪代码描述：

for all logQ steps:
    clear range tree and linked list check
    for all member states i:
        if L[i] != R[i]:
            mid = (L[i] + R[i]) / 2
            insert i in check[mid]
    for all queries q:
        apply(q)
        for all member states m in check[q]:
            if m has requirements fulfilled:
                R[m] = q
            else:
                L[m] = q + 1

上述过程中，apply() 函数的作用是，进行模拟，然后检查需要检查的国家是否满足条件了。

这样子，模拟的代价总计为 O(lgQ * Q * lgM)，检查的代价总计为 O(lgQ * (MlgM + N))，我们成功地同时将模拟和检查的代价都减小了！这样的时间复杂度总共为 O(lgQ * (Q + M) * lgM)，至于少的那个 N，忽略不计~

同原文一样，这里引一句大佬的话解释这个 Parallel Binary Search，

"A cool way to visualize this is to think of a binary search tree. Suppose we are doing standard binary search, and we reject the right interval — this can be thought of as moving left in the tree. Similarly, if we reject the left interval, we are moving right in the tree. So what Parallel Binary Search does is move one step down in N binary search trees simultaneously in one "sweep", taking O(N * X) time, where X is dependent on the problem and the data structures used in it. Since the height of each tree is LogN, the complexity is O(N * X * logN)." — rekt_n00b

Hourrank23 -- Selective Additions

这是一道被诅咒的题目...抱歉实在没忍住🤣

这道题目是这样的：

有一个数组，长度为 N，现在要进行 M 次区间更新，都是加一个正数。但是我有 K 个很喜欢的数字，一旦数组中某个元素成为我喜欢的数字，对它的更新就无效，它将一直保持那个数字。问每轮更新后的数组和。

1 <= N, M <= 1e5 1 <= k <= 5

这道题和上道题不太一样，检查的代价低、检查目标和原区间相同，而且目标变成了多个。

目标是多个可以这样解决：对喜欢的数排序，因为永远是正的更新，所以先到达前一个后一个就不用检查了。

采用上述的 PBS 的方法，这题的复杂度在 O(k(n + m)lognlogm)，是一个很好的复杂度了。

但是，这里一定要有个但是，因为检查的代价很低，所以这题的二分查找不是必须的，我翻阅了大量大佬的代码和 editorial，有用 PBS 的，有模拟时间序列的，有用 Segment tree 加 trick 的，各种各样，我来一个个介绍一下：

PBS

这个不多说了，离线模拟 logm 遍。

Time Series Simulation

代码: https://www.hackerrank.com/rest/contests/hourrank-23/challenges/selective-additions/hackers/nuipqiun/download_solution

同样是离线处理，但是这个大佬的模拟方式很特别，他是在时间序列上模拟数组当前项的值，我还是第一次见。

首先还记得 BIT 的区间更新方式么，在 l 处加 delta，r + 1 处减去 delta。假设考虑区间更新都是 [l, n] 的，那么在时间序列上模拟 i-th 数组项的值就很容易想到了：对于在 i 以及之前的更新 (j, a, b, delta) (a <= i), 在时间序列对应数组的 j 处加 delta。这样 j 时间之后的数都被加了 delta。

想到这里，时间序列上的模拟就呼之欲出了，既然对应有加，肯定对应有减：对于 i 以及 i 之前的所有更新 (j, a, b, delta) (b < i)，在序列上抵消掉，也即在时间序列 j 处减 delta，因为这样的更新不应该影响到数组项 i 的时间序列，而且一定在之前被更新上去了。

所以就变成了这样一种方式：

rep(i,n){
	for(int j:ad[i]) add(j,ads[j]);
	for(int j:rm[i]) add(j,-ads[j]);
	// do you work
}

这里，ad[i] 是 [i, r] 更新的索引号，rm[i] 是 [l, i - 1] 更新的索引号，add 操作是 fenwick tree 的加操作，于是我们可以用 sum(j) 计算出任一个时间节点上 a[i] 的值。

后面就很简单了，对每个 i 二分查找，找到就记下来。总计时间复杂度应该为 O((kn + m)logm)。

这里明显比 PBS 快了一点，这是因为检查区间和原区间一致。这件事情很重要，类比到 Meteors，虽然我们也能模拟出一个区间的时间序列，但是一个区间对于目标没有意义。

Segment Tree

与上面两个不同的是，这是个在线算法。

时间复杂度是 O(klgm(n + m))，不过用了线段树，空间复杂度为 O(kmlgm)。

这里线段树中记录了三个值：lazy 域 add，区间最大值 maxi，对应的数组索引 id。

一共 k 棵线段树，每颗线段树初始化为数组 {a_i - fav_j}，也就是说，对应的值如果为正数，才有检查的必要，如果为负数，那压根还没到检查的时候。

正巧，利用线段树的结构，我们检查线段树的 root 节点就能知道这棵树中存不存在需要检查的项，这个操作是 O(1) 的。而当一个数组项被检查过后，它将被赋值为 -inf，这样它永远也不会被再次检查，线段树也顺利更新到下一个待检查状态。

由于每个数组项对于每个 favorite 最多被检查一次，所以总计检查复杂度为 O(knlgm)。

更新复杂度是 O(kmlgm)，所以总计复杂度为 O(k(m + n)lgm)。

这里利用 Segment Tree 的 trick 一开始让我百思不得其解，为什么 meteors 不行而这里可以，想来想去也想不出第三种方式，把 meteors 的查询区间映射到方便修改的结构上，所以还是归结为之前所述的两个原因。

Conclusion

对于更新/查询的一系列问题，业界大佬们已经提供了各种各样有力的武器：

数据结构 Segment tree、 Fenwick tree、 Sparse table，还有最近才研究到的 Cartesian tree 等等
二分、整体二分、还没看过的 CDQ 分治等等

把这里整理进脑子里，大概下次刷题的时候底气也会足一些🍺

Reference

[1] https://www.hackerearth.com/practice/notes/sparse-table/

[2] https://www.topcoder.com/community/data-science/data-science-tutorials/range-minimum-query-and-lowest-common-ancestor/#Sparse_Table_(ST)_algorithm

[3] http://codeforces.com/blog/entry/45578

[4] https://ideone.com/tTO9bD

Appendix

Impl ST/C++

#include <vector>
#include <cassert>
#include <cstring>
#include <iostream>
#include <limits>
#include <type_traits>
#include <random>
using namespace std;

namespace st_impl {

template <class T, class F>
class SparseTable {
public:
    typedef F func_type;
    typedef unsigned size_type;
    typedef T value_type;

    SparseTable(const vector<T>& init) : _size(init.size()), _idx_size(flsl(_size)) {
        table.resize(_size);
        for (auto& row : table) {
            row.resize(_idx_size, func_type::default_value);
        }

        // initialize sparse table
        for (size_type i = 0; i < _size; ++i) {
            table[i][0] = init[i];
        }
        for (size_type j = 1; j < _idx_size; ++j) {
            for (size_type i = 0; i <= _size - (1 << j); ++i) {
                table[i][j] = f(table[i][j - 1], table[i + (1 << (j - 1))][j - 1]);
            }
        }
    }

    SparseTable(const initializer_list<T>& init) : SparseTable(vector<T>(init)) {}

    SparseTable(const vector<T>& init, F f) : SparseTable(init) { this->f = f; }
    SparseTable(const initializer_list<T>& init, F f) : SparseTable(vector<T>(init), f) {}

    T rangeQuery(size_type l, size_type r) const {
        if (!(l <= r && r < _size)) {
            throw std::out_of_range("Bad query!");
        }

        // if the function is idempotent, which means f(x, x) = x holds for
        // all x with definition, then we can deduce that
        // f(range(l, s), range(t, r)) == f(range(l, r)) always
        // holds for all l, s, t, r which satisfies l <= t && s <= r && t <= s + 1
        // then rangeQuery will be executed in O(1).
        // otherwise it should be finished in O(lgN).
        if (func_type::idempotent) {
            size_type idx = flsl(r - l + 1) - 1;
            return f(table[l][idx], table[r - (1 << idx) + 1][idx]);
        } else {
            T res = func_type::default_value;
            for (size_type i = 0; i < _idx_size; ++i) {
                size_type idx = _idx_size - 1 - i;
                if (l + (1 << idx) - 1 <= r) {
                    res = f(res, table[l][idx]);
                    l += 1 << idx;
                }
            }
            return res;
        }
    }

private:
    func_type f;

    size_type _size;
    size_type _idx_size;
    vector<vector<T>> table;
};

}  // namespace st_impl

template <class T, T v = T{}>
struct sum_f {
    static constexpr T default_value = v;
    static constexpr bool idempotent = false;
    T operator()(const T& a, const T& b) const { return a + b; }
};
template <class T, T v>
constexpr const T sum_f<T, v>::default_value;

template <class T, T v = numeric_limits<T>::min(),
          typename = typename enable_if<numeric_limits<T>::is_specialized>::type>
struct max_f {
    static constexpr T default_value = v;
    static constexpr bool idempotent = true;
    T operator()(const T& a, const T& b) const { return max(a, b); }
};
template <class T, T v, typename R>
constexpr const T max_f<T, v, R>::default_value;

template <class T, T v = numeric_limits<T>::max(),
          typename = typename enable_if<numeric_limits<T>::is_specialized>::type>
struct min_f {
    static constexpr T default_value = v;
    static constexpr bool idempotent = true;
    T operator()(const T& a, const T& b) const { return min(a, b); }
};
template <class T, T v, typename R>
constexpr const T min_f<T, v, R>::default_value;

uint64_t gcd(uint64_t a, uint64_t b) {
    if (a < b) swap(a, b);
    while (b != 0) {
        auto t = b;
        b = a % b;
        a = t;
    }
    return a;
}

template <class T, T v = T{}, typename = typename enable_if<numeric_limits<T>::is_integer>::type>
struct gcd_f {
    static constexpr T default_value = v;
    static constexpr bool idempotent = true;
    T operator()(const T& a, const T& b) const { return gcd(a, b); }
};
template <class T, T v, typename R>
constexpr const T gcd_f<T, v, R>::default_value;

template <class T, class F = max_f<T>>
using SparseTable = st_impl::SparseTable<T, F>;

template <class F>
void random_test(string target_func) {
    int n = 400;
    vector<int> test(n);

    // generate random numbers
    random_device r;
    default_random_engine eng(r());
    uniform_int_distribution<int> uniform_dist(0, 2000);

    for (int i = 0; i < n; ++i) {
        test[i] = uniform_dist(eng);
    }

    // query and verify
    F f;
    SparseTable<int, F> st_test(test, f);

    cout << "Begin random test on " << target_func << "!" << endl;
    int t = 10;
    for (int i = 0; i < t; ++i) {
        int l = uniform_dist(eng) % n, r = l + ((uniform_dist(eng) % (n - l)) >> (i / 2));
        auto to_verify = st_test.rangeQuery(l, r);
        auto expected = decltype(f)::default_value;

        for (int j = l; j <= r; ++j) {
            expected = f(expected, test[j]);
        }
        assert(to_verify == expected);
        cout << " + query range(" << l << "," << r << ")\t= " << to_verify << endl;
    }
    cout << "Test passed!" << endl;
}

void regular_test() {
    SparseTable<int> st_max({3, 1, 2, 5, 2, 10, 8});

    assert(st_max.rangeQuery(0, 2) == 3);
    assert(st_max.rangeQuery(3, 6) == 10);
    assert(st_max.rangeQuery(0, 6) == 10);
    assert(st_max.rangeQuery(2, 4) == 5);

    SparseTable<int, min_f<int>> st_min({3, 1, 2, 5, 2, 10, 8});

    assert(st_min.rangeQuery(0, 2) == 1);
    assert(st_min.rangeQuery(3, 6) == 2);
    assert(st_min.rangeQuery(0, 6) == 1);
    assert(st_min.rangeQuery(2, 4) == 2);

    SparseTable<int, sum_f<int>> st_sum({3, 1, 2, 5, 2, 10, 8});

    assert(st_sum.rangeQuery(0, 2) == 6);
    assert(st_sum.rangeQuery(3, 6) == 25);
    assert(st_sum.rangeQuery(0, 6) == 31);
    assert(st_sum.rangeQuery(2, 4) == 9);
}

int main() {
    regular_test();

    random_test<max_f<int>>("max");
    random_test<min_f<int>>("min");
    random_test<sum_f<int>>("sum");
    random_test<gcd_f<int>>("gcd");

    return 0;
}

Selective Additions / PBS C++

#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <algorithm>
#include <vector>
#include <climits>
#include <utility>
#include <queue>
#include <map>

using namespace std;

#define defv(alias, type) using v##alias = vector<type>
#define defvv(alias, type) using vv##alias = vector<vector<type>>

using ii = pair<int, int>;
using iii = pair<int, ii>;

defv(i, int);
defvv(i, int);
defv(ii, ii);
defvv(ii, ii);

#define forall(i, a, b) for (int i = int(a); i < int(b); ++i)
#define all(a) a.begin(), a.end()
#define inf (IMAX_NT_MAX / 2)
#define sz(a) int(a.size())
#define mp(a, b) make_pair(a, b)

const int MAX_N = 1e5 + 5;

long a[MAX_N], d[MAX_N];
int l[MAX_N], r[MAX_N];
long f[5];

int n, m, k;

long fen[MAX_N];
int lowbit(int x) { return x & -x; }

void fen_update(long *fen, int idx, int delta) {
    for (int i = idx; i < n; i += lowbit(i + 1)) {
        fen[i] += delta;
    }
}

long __fen_get(long *fen, int r) {
    long res = 0;
    while (r >= 0) {
        res += fen[r];
        r -= lowbit(r + 1);
    }
    return res;
}

long fen_get(long *fen, int l, int r) { return __fen_get(fen, r) - __fen_get(fen, l - 1); }

void fen_range_update(long *fen, int l, int r, int delta) {
    fen_update(fen, l, delta);
    fen_update(fen, r + 1, -delta);
}

long fen_point_get(long *fen, int i) { return __fen_get(fen, i); }

void fen_reset(long *fen) { forall(i, 0, n) fen[i] = 0; }

inline int fls(int x) {
    int r = 32;

    if (!x) return 0;
    if (!(x & 0xffff0000u)) {
        x <<= 16;
        r -= 16;
    }
    if (!(x & 0xff000000u)) {
        x <<= 8;
        r -= 8;
    }
    if (!(x & 0xf0000000u)) {
        x <<= 4;
        r -= 4;
    }
    if (!(x & 0xc0000000u)) {
        x <<= 2;
        r -= 2;
    }
    if (!(x & 0x80000000u)) {
        x <<= 1;
        r -= 1;
    }
    return r;
}

int t[MAX_N], lb[MAX_N], rb[MAX_N];
vector<int> to_check[MAX_N + 1];
void solve() {
    sort(f, f + k);
    fill_n(t, n, -1);
    forall(i, 0, n) forall(s, 0, k) {
        if (t[i] < 0 && a[i] == f[s]) t[i] = 0;
    }

    // Parallel Binary Search
    for (int s = 0; s < k; ++s) {
        forall(i, 0, n) if (t[i] < 0) { lb[i] = 1, rb[i] = m; }
        forall(rnd, 0, fls(m)) {
            fen_reset(fen);
            forall(i, 0, m + 1) to_check[i].clear();
            forall(i, 0, n) {
                if (t[i] < 0) to_check[(lb[i] + rb[i]) >> 1].push_back(i);
            }
            forall(i, 0, m) {
                fen_range_update(fen, l[i], r[i], d[i]);
                for (int idx : to_check[i + 1]) {
                    auto now = fen_point_get(fen, idx);
                    if (now + a[idx] > f[s])
                        rb[idx] = i;
                    else if (now + a[idx] < f[s])
                        lb[idx] = i + 2;
                    else
                        t[idx] = i + 1;
                }
            }
        }
    }

    // reuse fen for counting invalid updates
    long sum = 0;
    fen_reset(fen);
    forall(i, 0, m + 1) to_check[i].clear();
    forall(i, 0, n) {
        if (t[i] == 0)
            fen_update(fen, i, 1);
        else if (t[i] > 0)
            to_check[t[i]].push_back(i);
        sum += a[i];
    }

    forall(i, 0, m) {
        sum += d[i] * (r[i] - l[i] + 1 - fen_get(fen, l[i], r[i]));
        cout << sum << endl;
        for (auto idx : to_check[i + 1]) {
            fen_update(fen, idx, 1);
        }
    }
}

int main() {
    cin >> n >> m >> k;
    forall(i, 0, n) { cin >> a[i]; }
    forall(i, 0, k) { cin >> f[i]; }
    forall(i, 0, m) {
        cin >> l[i] >> r[i] >> d[i];
        l[i]--, r[i]--;
    }

    solve();

    return 0;
}

Binary Indexed Tree (Fenwick Tree)

Fri, 08 Sep 2017 14:23:50 GMT

I was always confused about the tree construction of Binary Indexed Tree/Fenwick tree, understanding it only vaguely. Now I've finally figured out the parent-child relationships of its nodes, so I'm writing this down to avoid forgetting.

Binary Indexed Tree

BIT is commonly used to store prefix sums, or to store the array itself (using a prefix sum trick). BIT uses an array to represent the tree, with O(n) space complexity. BIT supports two operations, update and query, for point updates and prefix sum queries, both with a strict time complexity of O(lg n).

In a prefix sum BIT, the i-th element (1-based) stores the sum of the original array over the interval [i - (i & -i) + 1, i], where (i & -i) is the value represented by the lowest set bit of i.

Parent

Given a node i, its parent is i - (i & -i), which clears the lowest set bit. Clearly the parent is unique, and all nodes share a common ancestor at 0.

Siblings

All siblings can be obtained through the following process:

for (int j = 1; j < (i & -i); j <<= 1) {
    // z is a sibling of i
    int z = i + j;
}

Point Get

Given a node i, we can obtain the value of array element i through the following process:

int res = bit[i];
int z = i - (i & -i);
int j = i - 1;
while (j != z) {
    res -= bit[j];
    j = j - (j & -j);
}

The above process removes the trailing consecutive 1-bits of i - 1 one by one, effectively computing sum(i - (i & -i), i - 1), and then subtracting it from node i's stored sum to get the value of array element i.

Prefix Sum

By repeatedly finding the parent node and accumulating:

long sum = 0;
for (int j = i; j > 0; j -= (j & -j)) {
    sum += bit[j];
}

Clearly the intervals are disjoint, and the final interval is [1, i].

Intervals Cover [i, i]

The following process finds all nodes whose intervals cover [i, i]. These are also the ancestors of node i in two's complement, as we will prove shortly:

for (int j = i; j < n; j += (j & -j)) {
    // node j covers [i, i]
}

First, clearly node i is the first node covering i. Then j = i + (i & -i) is the first node after i satisfying j - (j & -j) < i, which is easy to prove.

Furthermore, for j = i + (i & -i), the inequality j - (j & -j) <= i - (i & -i) always holds, meaning node j's interval covers node i's interval, while nodes between i and j are all disjoint from node i.

Therefore, it's easy to deduce that all nodes whose intervals intersect with node i's interval are exactly those produced by the above process.

Parent in 2's Complement

Assume we're in a 32-bit environment. Remember the binary representation of -i? It's 2^32 - i. So we have i & -i = i & (2^32 - i), allowing us to convert from signed to unsigned arithmetic.

Recall the operation for finding the next node covering i: i + (i & -i). Let j = 2^32 - i, and we can reorganize this expression:

i + (i & -i) = 2^32 - (j - (j & (2 ^32 - j))),

Mapping it to two's complement: 2^32 - i - (i & -i) = j - (j & (2 ^32 - j))

This looks very familiar, doesn't it? It's the operation of clearing j's lowest set bit, which we can also view as a find-parent operation, with 0 as the common ancestor. Now we can see clearly: the operation i + (i & -i) is a symmetric find-parent operation in two's complement, and the original find-parent operation i - (i & -i) is likewise its symmetric counterpart.

Suffix BIT

Now we can introduce a suffix sum BIT:

const int MAX_N = 1e5;
int bit[MAX_N];

void update(int i, int delta) {
    while (i > 0) {
        bit[i] += delta;
        i -= (i & -i);
    }
}

// get the suffix sum of [i, n]
int get(int i, int n) {
    int res = 0;
    while (i <= n) {
        res += bit[i];
        i += (i & -i);
    }
    return res;
}

The operation that was previously baffling finds its explanation in two's complement. Here, node i represents the interval sum of [i, i + (i & -i) - 1].

Of course, one could also just maintain a global sum variable and compute sum - prefix(1, i - 1) instead.

Appendix

Impl & 3 Usages

#include <vector>
#include <iostream>
#include <cassert>
using namespace std;

class BinaryIndexedTree {
protected:
    vector<long> bit;

    static int lowbit(int x) { return x & -x; }

    BinaryIndexedTree(BinaryIndexedTree&& other) { bit = std::move(other.bit); }

public:
    BinaryIndexedTree(int n) { bit = vector<long>(n); }

    BinaryIndexedTree(const vector<long>& nums) {
        int n = nums.size();
        bit = vector<long>(n);

        for (int i = 0; i < n; ++i) {
            bit[i] = nums[i];
            for (int j = i - 1; j > i - lowbit(i + 1); --j) {
                bit[i] += nums[j];
            }
        }
    }

    void add(int i, long delta) {
        for (int j = i; j < int(bit.size()); j += lowbit(j + 1)) {
            bit[j] += delta;
        }
    }

    long sum(int k) {
        long res = 0;
        for (int i = k; i >= 0; i -= lowbit(i + 1)) {
            res += bit[i];
        }
        return res;
    }
};

class PointUpdateRangeQueryExectuor {
private:
    int n;
    BinaryIndexedTree tree;

    long prefixSum(int r) {
        if (r < 0) return 0;
        return tree.sum(r);
    }

public:
    PointUpdateRangeQueryExectuor(int n) : n(n), tree(n) {}
    PointUpdateRangeQueryExectuor(const vector<long>& nums) : n(nums.size()), tree(nums) {}

    void update(int i, long delta) {
        assert(i >= 0 && i < n);
        tree.add(i, delta);
    }

    long rangeSum(int l, int r) {
        assert(l <= r && l >= 0 && r < n);
        return prefixSum(r) - prefixSum(l - 1);
    }
};

class RangeUpdatePointQueryExecutor {
private:
    int n;
    BinaryIndexedTree tree;

    // Tear array into pieces
    static vector<long> rangePieces(const vector<long>& nums) {
        int n = nums.size();
        vector<long> res(n);
        // make sure that prefix_sum(res, i) = nums[i]
        if (n != 0) res[0] = nums[0];
        for (int i = 1; i < n; ++i) {
            res[i] = nums[i] - nums[i - 1];
        }
        return res;
    }

    friend class RangeUpdateRangeQueryExecutor;

public:
    RangeUpdatePointQueryExecutor(int n) : n(n), tree(n) {}

    RangeUpdatePointQueryExecutor(const vector<long>& nums)
        : n(nums.size()), tree(rangePieces(nums)) {}

    void update(int l, int r, long delta) {
        assert(l <= r && l >= 0 && r < n);
        tree.add(l, delta);
        if (r + 1 < n) tree.add(r + 1, -delta);
    }

    long get(int i) {
        assert(i >= 0 && i < n);
        return tree.sum(i);
    }
};

class RangeUpdateRangeQueryExecutor {
private:
    long n;
    BinaryIndexedTree tree;
    BinaryIndexedTree tree2;

    static vector<long> prefixPieces(const vector<long>& nums) {
        int n = nums.size();
        vector<long> res(n);
        // make sure that nums[i] * i - res[i] = prefix_sum(nums, i),
        // so that the following prefixSum works.
        // Then run rangePieces, so that we get res[i] = (nums[i] - nums[i - 1]) * (i - 1);
        if (n != 0) res[0] = -nums[0];
        for (long i = 0; i < n; ++i) {
            res[i] = (nums[i] - nums[i - 1]) * (i - 1);
        }
        return res;
    }

    long prefixSum(long r) {
        if (r < 0) return 0;
        return tree.sum(r) * r - tree2.sum(r);
    }

    static constexpr auto rangePieces = RangeUpdatePointQueryExecutor::rangePieces;

public:
    RangeUpdateRangeQueryExecutor(long n) : n(n), tree(n), tree2(n) {}

    RangeUpdateRangeQueryExecutor(const vector<long>& nums)
        : n(nums.size()), tree(rangePieces(nums)), tree2(prefixPieces(nums)) {}

    void update(long l, long r, long delta) {
        assert(l <= r && l >= 0 && r < n);
        tree.add(l, delta);
        if (r + 1 < n) tree.add(r + 1, -delta);
        tree2.add(l, delta * (l - 1));
        if (r + 1 < n) tree2.add(r + 1, -delta * r);
    }

    long rangeSum(long l, long r) {
        assert(l <= r && l >= 0 && r < n);
        return prefixSum(r) - prefixSum(l - 1);
    }
};

int main() {
    // point update range query
    PointUpdateRangeQueryExectuor purq(5);
    purq.update(0, 2);
    purq.update(3, 3);
    purq.update(4, 5);
    cout << purq.rangeSum(0, 1) << endl;  // 2
    cout << purq.rangeSum(2, 3) << endl;  // 3
    cout << purq.rangeSum(3, 4) << endl;  // 8

    PointUpdateRangeQueryExectuor purq2({2, 1, 2, 3, 5});
    cout << purq2.rangeSum(0, 1) << endl;  // 3
    cout << purq2.rangeSum(2, 3) << endl;  // 5
    cout << purq2.rangeSum(3, 4) << endl;  // 8

    // range update point query
    RangeUpdatePointQueryExecutor rupq(5);
    rupq.update(0, 4, 2);
    rupq.update(3, 4, 3);
    cout << rupq.get(0) << endl;  // 2
    cout << rupq.get(3) << endl;  // 5

    RangeUpdatePointQueryExecutor rupq2({11, 3, 2, 6, 5});
    cout << rupq2.get(0) << endl;  // 11
    cout << rupq2.get(3) << endl;  // 6

    // range update range query
    RangeUpdateRangeQueryExecutor rurq(5);
    rurq.update(0, 4, 2);
    rurq.update(3, 4, 3);
    cout << rurq.rangeSum(2, 4) << endl;  // 12

    RangeUpdateRangeQueryExecutor rurq2({2, 2, 3, 6, 5});
    cout << rurq2.rangeSum(2, 4) << endl;  // 14

    return 0;
}

二叉索引树/树状数组 (Binary Indexed Tree)

Fri, 08 Sep 2017 14:23:50 GMT

Binary Indexed Tree/Fenwick tree 的树构成方式我一直很疑惑，总是似懂非懂。现在终于弄清楚了它的节点的父子关系，记录下来防止忘记。

Binary Indexed Tree

BIT 通常用于存储前缀和，或者存储数组本身 (用前缀和 trick)。 BIT 用一个数组来表示树，空间复杂度为 O(n)。 BIT 支持两个操作，update 和 query，用于单点更新和前缀和查询，它们的时间复杂度都为严格的 O(lgn)。

前缀和 BIT 数组里的第 i 项（1-based）记录了原数组区间 [i - (i & -i) + 1, i] 的和，其中 (i & -i) 是 i 的最低位 1 表示的数。

Parent

假设我们有节点 i，节点 i 的父节点为 i - (i & -i)，也就是抹除最后一个 1。显然父节点唯一，且所有节点都有公共祖先 0。

Siblings

通过以下过程可以获得所有兄弟节点:

for (int j = 1; j < (i & -i); j <<= 1) {
    // z is a sibling of i
    int z = i + j;
}

Point Get

假设我们有节点 i，那么可以通过以下过程获取数组 i 的值：

int res = bit[i];
int z = i - (i & -i);
int j = i - 1;
while (j != z) {
    res -= bit[j];
    j = j - (j & -j);
}

上述过程即把 i - 1 的末尾的连续的 1 一步步移除，也就是求了 sum(i - (i & -i), i - 1)，然后用 i 节点的和减去就得到了数组项 i 的值。

Prefix Sum

通过以下过程不断找到父节点，并加和：

long sum = 0;
for (int j = i; j > 0; j -= (j & -j)) {
    sum += bit[j];
}

显然各个区间不交，最终区间为 [1, i]。

Intervals Cover [i, i]

通过以下过程，获得所有覆盖区间 [i, i] 的节点，同时它们是节点 i 在补码中的父节点，稍后证明：

for (int j = i; j < n; j += (j & -j)) {
    // node j covers [i, i]
}

首先显然，节点 i 是第一个覆盖 i 的节点，然后 j = i + (i & -i) 是 i 之后的第一个满足 j - (j & -j) < i 的节点，证明也很容易。

且有 j = i + (i & -i)，j - (j & -j) <= i - (i & -i) 恒成立，也就是节点 j 所代表的区间能够覆盖 i 的节点，而 i 到 j 之间的节点与节点 i 均不交。

所以很容易就能推导出，所有与节点 i 所代表区间相交的节点为上述过程中的节点。

Parent in 2's Complement

不妨设这里是 32 位的环境，大家还记得负数 -i 的二进制表示么？对，就是 2^32 - i，所以我们有 i & -i = i & (2^32 - i)，这样我们可以从有符号数转换到无符号数的操作。

还记得上述获取下一个覆盖 i 的节点的操作，i + (i & -i)，我们用 j = 2^32 - i 来重新组织一下这个式子：

i + (i & -i) = 2^32 - (j - (j & (2 ^32 - j))),

我们把它也映射成补码，2^32 - i - (i & -i) = j - (j & (2 ^32 - j))

十分熟悉了吧？就是 j 去掉最后一位 1 的操作，我们也可以把它看做一个找父节点的操作，公共祖先也是 0。现在就可以清楚地看到了，其实 i + (i & -i) 的操作在补码中是一个对称的求父节点操作，原来的求父节点操作 i - (i & -i) 同理为对称操作。

Suffix BIT

现在可以引出一个后缀和的 BIT：

const int MAX_N = 1e5;
int bit[MAX_N];

void update(int i, int delta) {
    while (i > 0) {
        bit[i] += delta;
        i -= (i & -i);
    }
}

// get the suffix sum of [i, n]
int get(int i, int n) {
    int res = 0;
    while (i <= n) {
        res += bit[i];
        i += (i & -i);
    }
    return res;
}

本来百思不得其解的操作在补码中得到了解释，这里节点 i 代表的区间和为 [i, i + (i & -i) - 1]。

但是这里直接添加一个数字用于记录全局 sum，然后 sum - prefix(1, i - 1) 也是可以的。

Appendix

Impl & 3 Usages

#include <vector>
#include <iostream>
#include <cassert>
using namespace std;

class BinaryIndexedTree {
protected:
    vector<long> bit;

    static int lowbit(int x) { return x & -x; }

    BinaryIndexedTree(BinaryIndexedTree&& other) { bit = std::move(other.bit); }

public:
    BinaryIndexedTree(int n) { bit = vector<long>(n); }

    BinaryIndexedTree(const vector<long>& nums) {
        int n = nums.size();
        bit = vector<long>(n);

        for (int i = 0; i < n; ++i) {
            bit[i] = nums[i];
            for (int j = i - 1; j > i - lowbit(i + 1); --j) {
                bit[i] += nums[j];
            }
        }
    }

    void add(int i, long delta) {
        for (int j = i; j < int(bit.size()); j += lowbit(j + 1)) {
            bit[j] += delta;
        }
    }

    long sum(int k) {
        long res = 0;
        for (int i = k; i >= 0; i -= lowbit(i + 1)) {
            res += bit[i];
        }
        return res;
    }
};

class PointUpdateRangeQueryExectuor {
private:
    int n;
    BinaryIndexedTree tree;

    long prefixSum(int r) {
        if (r < 0) return 0;
        return tree.sum(r);
    }

public:
    PointUpdateRangeQueryExectuor(int n) : n(n), tree(n) {}
    PointUpdateRangeQueryExectuor(const vector<long>& nums) : n(nums.size()), tree(nums) {}

    void update(int i, long delta) {
        assert(i >= 0 && i < n);
        tree.add(i, delta);
    }

    long rangeSum(int l, int r) {
        assert(l <= r && l >= 0 && r < n);
        return prefixSum(r) - prefixSum(l - 1);
    }
};

class RangeUpdatePointQueryExecutor {
private:
    int n;
    BinaryIndexedTree tree;

    // Tear array into pieces
    static vector<long> rangePieces(const vector<long>& nums) {
        int n = nums.size();
        vector<long> res(n);
        // make sure that prefix_sum(res, i) = nums[i]
        if (n != 0) res[0] = nums[0];
        for (int i = 1; i < n; ++i) {
            res[i] = nums[i] - nums[i - 1];
        }
        return res;
    }

    friend class RangeUpdateRangeQueryExecutor;

public:
    RangeUpdatePointQueryExecutor(int n) : n(n), tree(n) {}

    RangeUpdatePointQueryExecutor(const vector<long>& nums)
        : n(nums.size()), tree(rangePieces(nums)) {}

    void update(int l, int r, long delta) {
        assert(l <= r && l >= 0 && r < n);
        tree.add(l, delta);
        if (r + 1 < n) tree.add(r + 1, -delta);
    }

    long get(int i) {
        assert(i >= 0 && i < n);
        return tree.sum(i);
    }
};

class RangeUpdateRangeQueryExecutor {
private:
    long n;
    BinaryIndexedTree tree;
    BinaryIndexedTree tree2;

    static vector<long> prefixPieces(const vector<long>& nums) {
        int n = nums.size();
        vector<long> res(n);
        // make sure that nums[i] * i - res[i] = prefix_sum(nums, i),
        // so that the following prefixSum works.
        // Then run rangePieces, so that we get res[i] = (nums[i] - nums[i - 1]) * (i - 1);
        if (n != 0) res[0] = -nums[0];
        for (long i = 0; i < n; ++i) {
            res[i] = (nums[i] - nums[i - 1]) * (i - 1);
        }
        return res;
    }

    long prefixSum(long r) {
        if (r < 0) return 0;
        return tree.sum(r) * r - tree2.sum(r);
    }

    static constexpr auto rangePieces = RangeUpdatePointQueryExecutor::rangePieces;

public:
    RangeUpdateRangeQueryExecutor(long n) : n(n), tree(n), tree2(n) {}

    RangeUpdateRangeQueryExecutor(const vector<long>& nums)
        : n(nums.size()), tree(rangePieces(nums)), tree2(prefixPieces(nums)) {}

    void update(long l, long r, long delta) {
        assert(l <= r && l >= 0 && r < n);
        tree.add(l, delta);
        if (r + 1 < n) tree.add(r + 1, -delta);
        tree2.add(l, delta * (l - 1));
        if (r + 1 < n) tree2.add(r + 1, -delta * r);
    }

    long rangeSum(long l, long r) {
        assert(l <= r && l >= 0 && r < n);
        return prefixSum(r) - prefixSum(l - 1);
    }
};

int main() {
    // point update range query
    PointUpdateRangeQueryExectuor purq(5);
    purq.update(0, 2);
    purq.update(3, 3);
    purq.update(4, 5);
    cout << purq.rangeSum(0, 1) << endl;  // 2
    cout << purq.rangeSum(2, 3) << endl;  // 3
    cout << purq.rangeSum(3, 4) << endl;  // 8

    PointUpdateRangeQueryExectuor purq2({2, 1, 2, 3, 5});
    cout << purq2.rangeSum(0, 1) << endl;  // 3
    cout << purq2.rangeSum(2, 3) << endl;  // 5
    cout << purq2.rangeSum(3, 4) << endl;  // 8

    // range update point query
    RangeUpdatePointQueryExecutor rupq(5);
    rupq.update(0, 4, 2);
    rupq.update(3, 4, 3);
    cout << rupq.get(0) << endl;  // 2
    cout << rupq.get(3) << endl;  // 5

    RangeUpdatePointQueryExecutor rupq2({11, 3, 2, 6, 5});
    cout << rupq2.get(0) << endl;  // 11
    cout << rupq2.get(3) << endl;  // 6

    // range update range query
    RangeUpdateRangeQueryExecutor rurq(5);
    rurq.update(0, 4, 2);
    rurq.update(3, 4, 3);
    cout << rurq.rangeSum(2, 4) << endl;  // 12

    RangeUpdateRangeQueryExecutor rurq2({2, 2, 3, 6, 5});
    cout << rurq2.rangeSum(2, 4) << endl;  // 14

    return 0;
}

Segment Tree

Fri, 08 Sep 2017 04:58:11 GMT

This post is a translated version of the Segment Tree article on WCIPEG. Aside from removing a few sections, the structure follows the original exactly.

A segment tree is a very flexible data structure that helps us efficiently perform range queries or modifications on an underlying array. As the name suggests, a segment tree can be thought of as a tree formed by intervals of the underlying array, and its fundamental idea is divide and conquer.

Motivation

One of the most common applications of segment trees is solving the Range Minimum Query (RMQ) problem: given an array, repeatedly query the minimum value within a given index range. For example, given the array [9, 2, 6, 3, 1, 5, 0, 7], querying the minimum from the 3rd to the 6th element yields min(6, 3, 1, 5) = 1. Then querying from the 1st to the 3rd element yields 2... and so on for a series of queries. There are many articles discussing this problem with various solutions. Among them, the segment tree is often the most suitable choice, especially when queries and modifications are interleaved. For simplicity, in the following sections, we will focus on the specific segment tree used for answering range minimum queries without further notice, while other types of segment trees will be discussed later in this article.

The divide-and-conquer solution

Divide and conquer:

If the range contains only one element, then that element is obviously the minimum.
Otherwise, split the range into two roughly equal halves, find their respective minimums, and the original minimum is the smaller of the two.

Therefore, let $a_i$ denote the i-th element of the array. The minimum query can be expressed as the following recursive function:

$f(x,y) = \begin{cases} a_x & \textrm{if}\ x = y\\min(f(x, \lfloor\frac{x+y}{2}\rfloor), f(\lfloor\frac{x+y}{2}\rfloor) + 1, y) & \textrm{otherwise}\end{cases}, x \le y$

Thus, for example, the first query from the previous section, $f(3,6)$, would be recursively expressed as $\min(f(3,4),f(5,6))$.

Structure

Suppose we use the function defined above to compute $f(1,N)$, where $N$ is the number of elements in the array. When $N$ is large, this recursive computation has two sub-computations: $f(1, \lfloor\frac{1+N}{2}\rfloor)$ and $f(\lfloor\frac{1+N}{2}\rfloor) + 1, N)$. Each sub-computation in turn has two sub-computations, and so on, until reaching the base case. If we represent this recursive computation process as a tree, $f(1, N)$ is the root node with two children, each of which may also have two children, and so on. The base cases are the leaf nodes of this tree. Now we can describe the structure of a segment tree:

A binary tree used to represent the underlying array
Each node represents a certain interval of the array and contains some function value over that interval
The root node represents the entire array (i.e., the interval [1, N])
Each leaf node represents a specific element in the array
Each non-leaf node has two children whose intervals are disjoint and whose union equals the parent node's interval
Each child's interval is approximately half the size of its parent's interval
Each non-leaf node stores a value that is not only the function value of the interval it represents, but also a function of its children's stored values (i.e., corresponding to the recursive process)

In other words, the structure of a segment tree is identical to the process of recursively computing $f(1,N)$.

Therefore, for example, the root node of the array [9,2,6,3,1,5,0,7] contains the number 0 -- the minimum of the entire array. Its left child contains the minimum of [9,2,6,3], which is 2; its right child contains the minimum of [1,5,0,7], which is 0. Each element corresponds to a leaf node containing only itself.

Operations

A segment tree has three basic operations: construction, update, and query.

Construction

To perform queries and updates, we first need to build a segment tree representing a given array. We can build it either bottom-up or top-down. Top-down construction is a recursive process: attempt to fill a node; if it is a leaf node, fill it directly with the corresponding value; otherwise, first fill its two children, then fill the node with the smaller of the children's values. Bottom-up construction is left as an exercise. The difference in running speed between these two approaches is almost negligible.

Update

Updating a segment tree means updating a specific element of the underlying array. We first update the corresponding leaf node -- since leaf nodes correspond to a single array element. The parent of the updated node will also be affected, because its interval contains the modified element. The same applies to the grandparent, all the way up to the root, but no other nodes are affected. For a top-down update, we first update the root node, which causes the relevant one of its two children to be recursively updated. The update of a child node works the same way, with the boundary being the update of a leaf node. Once the recursive process completes, non-leaf node values are updated to the smaller of their two children's values. Bottom-up updating is also left as an exercise.

Query

Querying a segment tree means determining the function value over a certain interval of the underlying array -- in this case, the minimum element value within the interval. The query operation is much more complex than the update operation, so we illustrate with an example. Suppose we want to find the minimum from the 1st to the 6th element, represented as $f(1, 6)$. Each node in the segment tree contains the minimum of some interval: for example, the root contains $f(1, 8)$, its left child $f(1, 4)$, its right child $f(5, 8)$, and so on; each leaf contains $f(x,x)$. No single node is $f(1,6)$, but notice that $f(1, 6) = \min(f(1,4),f(5,6))$, and there exist two nodes containing these two values (highlighted in yellow in the figure below).

Therefore, when querying a segment tree, we select a subset of all nodes such that the union of the intervals they represent is exactly the interval we want to query. To find these nodes, we start from the root and recursively query nodes whose intervals have at least some intersection with the query interval. In our example, $f(1, 6)$, we notice that both the left and right subtrees' intervals intersect, so both are recursively processed. The left child is a base case (highlighted in yellow) because its interval is entirely contained within the query interval. For the right child, we notice that its left child intersects the query interval, but its right child does not, so we only recurse into its left child. This is also a base case, also highlighted in yellow. The recursion terminates, and the final minimum is the minimum among all selected nodes.

Analysis

Some important performance metrics of segment trees are as follows:

Space

It is easy to prove that a node at depth d corresponds to an interval of size at most $\lceil \frac{N}{2^d} \rceil$. We can see that all nodes at depth $\lceil \lg N\rceil$ correspond to at most one element, meaning they are leaf nodes. Therefore, a segment tree is a perfectly balanced binary tree with the theoretically minimum height. Consequently, we can store the tree as an array, with nodes ordered by breadth-first traversal, and for convenience, children ordered left-to-right. For the segment tree of [9,2,6,3,1,5,8,7] above, it would be stored as [1,2,1,2,3,1,7,9,2,6,3,1,5,8,7]. Assume the root node's index is 1. For a non-leaf node with index i, its two children have indices 2i and 2i+1. Note that some space at the leaf level of the tree may be unused, but this is usually acceptable. A binary tree of height $\lceil \lg N\rceil$ has at most $2^{\lfloor\lg N\rfloor + 1} - 1$ nodes, so through simple mathematical analysis, we know that a segment tree storing N elements requires an array of size no more than $4N$. A segment tree uses $\mathcal{O} (N)$ space.

Time

Construction

For each node, construction requires only a constant number of operations. Since the number of nodes in a segment tree is $\mathcal{O} (n)$, construction takes linear time.

Update

The update operation updates all nodes on the path from the root to the affected leaf node, with a constant number of operations per node. The number of nodes is bounded by the tree height, so as shown above, the time complexity of an update is $\mathcal{O}(\lg N)$.

Query

Consider all selected nodes (the yellow nodes in the figure from the previous section). In the case of querying $f(1, N - 1)$, there are $\lg N$ such nodes. Can there be more? The answer is no. Perhaps the simplest proof is that unrolling the selected-node algorithm terminates in $\mathcal{O}(\lg N)$ steps, as hinted at by the non-recursive approach mentioned earlier. So a query operation takes $\mathcal{O}(\lg N)$ time. The proof for the recursive version is left as an exercise.

Implementation

object rmq_segtree
     private function build_rec(node,begin,end,a[])
          if begin = end
               A[node] = a[begin];
          else
               let mid = floor((begin+end)/2)
               build_rec(2*node,begin,mid,a[])
               build_rec(2*node+1,mid+1,end,a[])
               A[node] = min(A[2*node],A[2*node+1])
     private function update_rec(node,begin,end,pos,val)
          if begin = end
               A[node] = val
          else
               let mid=floor((begin+end)/2)
               if pos<=mid
                    update_rec(2*node,begin,mid,pos,val)
               else
                    update_rec(2*node+1,mid+1,end,pos,val)
               A[node] = min(A[2*node],A[2*node+1])
     private function query_rec(node,t_begin,t_end,a_begin,a_end)
          if t_begin>=a_begin AND t_end<=a_end
               return A[node]
          else
               let mid = floor((t_begin+t_end)/2)
               let res = ∞
               if mid>=a_begin AND t_begin<=a_end
                    res = min(res,query_rec(2*node,t_begin,mid,a_begin,a_end))
               if t_end>=a_begin AND mid+1<=a_end
                    res = min(res,query_rec(2*node+1,mid+1,t_end,a_begin,a_end))
               return res
     function construct(size,a[1..size])
          let N = size
          let A be an array that can hold at least 4N elements
          build_rec(1,1,N,a)
     function update(pos,val)
          update_rec(1,1,N,pos,val)
     function query(begin,end)
          return query_rec(1,1,N,begin,end)

Variations

Segment trees are not limited to range minimum queries; they can be applied to many different function queries. Here are some examples from more challenging competition problems.

Maximum

Similar to minimum: replace all min operations with max.

Sum or product

Each non-leaf node's value is the sum of its children's values, representing the array sum over its interval. For example, in the pseudocode above, all min operations are replaced with addition. However, in some summation cases, the segment tree is replaced by a Binary Indexed Tree (BIT), because BIT uses less space, runs faster, and is easier to code. Multiplication can be implemented in the same way, simply replacing addition with multiplication.

Maximum/minimum prefix suffix sum

An interval prefix consists of the first k elements of the interval (k can be 0); similarly, a suffix consists of the last k elements. The maximum prefix sum is the largest sum among all prefixes of the interval (the sum of an empty interval is 0). This maximum is called the maximum prefix sum (the minimum prefix sum can be defined similarly). We want to efficiently query the maximum prefix sum of a given interval. For example, the maximum prefix sum of [1,-2,3,-4] is 2, corresponding to the prefix [1,-2,3].

To solve this problem with a segment tree, each node stores two function values for its corresponding interval: the maximum prefix sum and the interval sum. A non-leaf node's interval sum is the sum of its two children's interval sums. To find a non-leaf node's maximum prefix sum, we note that the last element of the prefix corresponding to the maximum prefix sum is either in the left child's interval or in the right child's interval. In the former case, we can directly obtain the maximum prefix sum from the left child. In the latter case, we add the left child's interval sum and the right child's maximum prefix sum. The larger of these two values is our maximum prefix sum. Querying is similar, but there may be more than two adjacent intervals (in which case the last element of the maximum prefix sum could be in any of them).

Maximum/minimum subvector sum

This problem queries the maximum sum among all subintervals of a given interval. It is similar to the maximum prefix sum problem from the previous section, but the first element of the subinterval does not necessarily start at the beginning of the interval. (The maximum subinterval sum of [1,-2,3,-4] is 3, corresponding to the subinterval [3].) Each node needs to store 4 pieces of information: maximum prefix sum, maximum suffix sum, maximum subinterval sum, and interval sum. The detailed design is left as an exercise.

"Stabbing query"

For more details, please visit the original page.

Extension to two or more dimensions

Segment trees are not limited to solving one-dimensional array problems. In principle, they can be used for arrays of arbitrary dimensions, where intervals are replaced by boxes. Therefore, in a two-dimensional array, we can query the minimum element within a box, or the Cartesian product of two intervals.

For more details, please visit the original page.

Lazy propagation

Certain types of segment trees support range updates. For example, consider a variant of the range minimum problem: we need to be able to update all elements within an interval to a specific value. This is called a range update. Lazy propagation is a technique that enables range updates to be completed in $\mathcal{O}(\lg N)$ time, just like single element updates.

It works as follows: each node has an additional lazy field for temporary storage. When this field is unused, its value is set to $+\infty$. When updating an interval, we select the same set of intervals as in a query. If the interval is not a leaf node, update its lazy field to the new minimum value (if the new value is smaller). Otherwise, directly update the value on the node. When a query or update operation needs to access a node's descendants, and that node's lazy field has a value, we push the lazy field's value down to its two children: update the two children's lazy fields with the parent's lazy field value, then reset the parent's lazy field to $+\infty$. However, if we only want the node's value without accessing any of its children, and the lazy field has a value, we can directly return the lazy value as the interval minimum.

Reference

[1] https://wcipeg.com/wiki/Segment_tree

Appendix

C++ Maximum Query, Range Update, Lazy Propagation

class SegmentTree {
private:
    int n;
    vector<int> max_val, to_add;

    void push(int i, int tl, int tr) {
        max_val[i] += to_add[i];
        if (tl != tr - 1) {
            to_add[2 * i + 1] += to_add[i];
            to_add[2 * i + 2] += to_add[i];
        }
        to_add[i] = 0;
    }

    void add(int i, int tl, int tr, int l, int r, int delta) {
        push(i, tl, tr);
        if (tl >= r || tr <= l) {
            return;
        }
        if (l <= tl && tr <= r) {
            to_add[i] += delta;
            push(i, tl, tr);
            return;
        }
        int tm = (tl + tr) / 2;
        add(2 * i + 1, tl, tm, l, r, delta);
        add(2 * i + 2, tm, tr, l, r, delta);
        max_val[i] = max(max_val[2 * i + 1], max_val[2 * i + 2]);
    }

    int get(int i, int tl, int tr, int l, int r) {
        push(i, tl, tr);
        if (tl >= r || tr <= l) {
            return 0;
        }
        if (l <= tl && tr <= r) {
            return max_val[i];
        }
        int tm = (tl + tr) / 2;
        return max(get(2 * i + 1, tl, tm, l, r), get(2 * i + 2, tm, tr, l, r));
    }

public:
    SegmentTree(int k) {
        n = 1;
        while (n < k) {
            n *= 2;
        }
        max_val = vector<int>(2 * n, 0);
        to_add = vector<int>(2 * n, 0);
    }

    void add(int l, int r, int delta) { add(0, 0, n, l, r, delta); }

    int get(int l, int r) { return get(0, 0, n, l, r); }
};

线段树 (Segment Tree)

Fri, 08 Sep 2017 04:58:11 GMT

本篇为 WCIPEG 上关于 SegmentTree 的翻译稿，除了删去了几个小节，其余行文结构将完全一致。

线段树是一种非常灵活的数据结构，它可以帮助我们高效地完成对底层数组区间查询或是修改。顾名思义，线段树可以被想象成底层数组区间构成的一棵树，它的基本思想是分治。

Motivation

线段树的一个最常见的应用是用于解决范围最小值查询问题：给定一个数组，然后不断地查询某个给定索引范围的最小值。举个例子，给定数组 [9, 2, 6, 3, 1, 5, 0, 7]，查询第 3 个到第 6 个数之间的最小值，答案是 min(6, 3, 1, 5) = 1。接着查询第 1 个到第 3 个，答案是 2... 等等一系列查询操作。关于这个问题有很多文章进行了探讨，提出了诸多不同的解法。其中线段树往往是最合适的选择，特别是查询和修改穿插着进行的时候。为了简便起见，在接下来的几个小节中，我们将关注于用于回答范围最小值查询的特定线段树，不会再另行声明，而其他类型线段树将在本文后面进行讨论。

The divide-and-conquer solution

分而治之:

如果这个范围仅包含一个元素，那么这个元素自己显然就是最小值
否则，将范围二分成两个差不多大的小范围，然后找到它们相应的最小值，那么原来的最小值等于两个值之中小的那个。

故而，令 $a_i$ 表示数组内第 i 个元素，最小值查询可以表示为以下递归函数：

$f(x,y) = \begin{cases} a_x & \textrm{if}\ x = y\\min(f(x, \lfloor\frac{x+y}{2}\rfloor), f(\lfloor\frac{x+y}{2}\rfloor) + 1, y) & \textrm{otherwise}\end{cases}, x \le y$

因此，举个例子，上一章的第一个查询$f(3,6)$将被递归地表示为$\min(f(3,4),f(5,6))$。

Structure

假设我们用上面定义的函数来计算 $f(1,N)$，这里 $N$ 是数组内元素的数目。当 $N$ 很大时，这个递归计算有两个子计算，一个是 $f(1, \lfloor\frac{1+N}{2}\rfloor)$，另一个是 $f(\lfloor\frac{1+N}{2}\rfloor) + 1, N)$。每个子计算又有两个子计算，等等，直到遇到 base case。如果我们将这个递归计算的过程表示成一棵树，$f(1, N)$就是根节点，它有两个自己点，每个子节点可能也有两个子节点，等等。 Base case 就是这棵树的叶节点。现在我们可以来描述线段树的结构：

一棵二叉树，用于表示底层数组
每个节点代表了数组的某个区间，并且包含了区间上的某些函数值
根节点代表整个数组 (也就是区间 [1, N])
每个叶节点表示数组中某个特定元素
每个非叶节点有两个子节点，它们的区间不交，并且它们的区间的并等于父节点的区间
每个子节点的区间大约是父节点区间的一半大小
每个非叶节点存储的值不仅是它代表的区间的函数值，而且是它的子节点的存储值的函数值（拗口...就是那个递归过程）

也就是说，线段树的结构和递归计算 $f(1,N)$ 的过程完全相同。

因此，举个例子，数组 [9,2,6,3,1,5,0,7] 的根节点包含数字 0 —— 整个数组的最小值。它的左子节点包含 [9,2,6,3] 中的最小，2；右子节点包含 [1,5,0,7] 中的最小，0。每个元素对应于一个叶节点，仅仅包含它自己。

Operations

线段树一共有三种基础操作：构建、更新、查询。

Construction

为了进行查询和更新，首先我们需要构建一棵线段树来表示某个数组。我们可以自底向上或是自顶向下地构建。自顶向下的构建是一个递归的过程：尝试填充节点，如果是叶子节点，直接用对应的值填充，否则先填充该节点的两个子节点，然后用子节点中小的值填充该节点。自底向上的构建留作练习。这两种方式在运行速度上的差距几乎可以忽略不计。

Update

更新线段树即更新底层数组的某个元素。我们首先更新对应的叶子节点 —— 因为叶子节点只对应数组的一个元素。被更新的节点的父节点也将被影响，因为它对应的区间包含了被修改的元素，对祖父节点同样，直到根节点，但是不会影响其他节点。如果要进行自顶向下的更新，首先进行根节点的更新，这将导致两个子节点中的相关的那个递归地更新。对子节点的更新也是同样的，边界为对叶节点的更新。当递归过程完成以后，非叶节点的值更新为两个子节点的较小值。自底向上的更新同样留作练习。

Query

在线段树上进行查询即确定底层数组某个区间的函数值，在这里就是区间内的最小元素值。查询操作的执行过程比更新操作复杂的多，我们用一个例子来阐释。假设我们想要知道第 1 个到第 6 个元素内的最小值，我们将这个查询表示为 $f(1, 6)$。线段树上每个节点包含了某个区间内的最小值：举例来说，根节点包含了 $f(1, 8)$，它的左子节点 $f(1, 4)$，右子节点 $f(5, 8)$，等等；每个叶节点包含了 $f(x,x)$。没有一个节点是 $f(1,6)$，但是注意到 $f(1, 6) = \min(f(1,4),f(5,6))$，并且存在两个节点包含这两个值（在下图中以黄色标识）。

因此在线段树上查询时，我们选择所有节点的一个子集，使得它们所表示的区间的并集与我们要查询的区间完全相同。为了找到这些节点，我们首先从根节点开始，递归查询那些与区间至少有一个交集的节点。在我们的例子中，$f(1, 6)$，我们注意到左子树和右子树对应的区间都有交集，因此他们都被递归执行。左子节点是一个 base case（以黄色标识），因为它的区间被查询区间整个包含。在右子节点中，我们注意到它的左子节点和查询区间相交，但是右子节点没有，所以我们只递归执行它的左子节点。而它也是一个 base case，同样以黄色标识。递归终止，而最终的最小值就是这些被选中的节点的最小值。

Analysis

线段树的一些重要的性能指标如下：

Space

很容易证明，一个深度为 d 的节点对应的区间大小不超过 $\lceil \frac{N}{2^d} \rceil$。我们可以看到所有深度为 $\lceil \lg N\rceil$ 的节点对应于不超过一个节点，也就说它们是叶节点。因此，线段树是一棵完全平衡二叉树，它的高度是理论最小值。因此，我们可以将树存储为一个数组，节点顺序按照宽度优先遍历排序，为了方便起见，子节点按照先左后右顺序。那么，对于上述 [9,2,6,3,1,5,8,7] 的线段树，将被存储为 [1,2,1,2,3,1,7,9,2,6,3,1,5,8,7]。假设根节点的索引是 1。那么对于一个索引是 i 的非叶节点，它的两个子节点的索引是 2i 和 2i+1。注意在树的叶节点层有一些空间可能是被无用的，但是通常这都是可以接受的。一棵高度为 $\lceil \lg N\rceil$ 的二叉树最多有 $2^{\lfloor\lg N\rfloor + 1} - 1$ 个节点，所以通过简单的数学分析我们可以知道一棵存储了 N 个元素的线段树需要一个不超过 $4N$ 大小的数组进行存储。一棵线段树使用了 $\mathcal{O} (N)$ 的空间

Time

Construction

对每个节点，构建只需要进行几个固定操作。因为一棵线段树的节点数是 $\mathcal{O} (n)$，所以构建也是线性的时间。

Update

更新操作一共更新从根节点到被影响的叶节点的路径上所有的节点，对每个节点的更新操作数是固定的。节点的数量以树的高度为上界，因此同上述结论，更新的时间复杂度为 $\mathcal{O}(\lg N)$。

Query

考虑所有被选中的节点 (上节图中黄色的节点)。查询 $f(1, N - 1)$ 的情况下，有 $\lg N$ 那么多。那么是否会有更多呢？答案是否定的。一个可能最简单的证明是，将选中节点的算法展开，将在 $\mathcal{O}(\lg N)$ 步内终止，也就是之前暗示过的非递归的方式。所以一个查询操作耗时 $\mathcal{O}(\lg N)$ 。对递归版本的证明留作练习。

Implementation

object rmq_segtree
     private function build_rec(node,begin,end,a[])
          if begin = end
               A[node] = a[begin];
          else
               let mid = floor((begin+end)/2)
               build_rec(2*node,begin,mid,a[])
               build_rec(2*node+1,mid+1,end,a[])
               A[node] = min(A[2*node],A[2*node+1])
     private function update_rec(node,begin,end,pos,val)
          if begin = end
               A[node] = val
          else
               let mid=floor((begin+end)/2)
               if pos<=mid
                    update_rec(2*node,begin,mid,pos,val)
               else
                    update_rec(2*node+1,mid+1,end,pos,val)
               A[node] = min(A[2*node],A[2*node+1])
     private function query_rec(node,t_begin,t_end,a_begin,a_end)
          if t_begin>=a_begin AND t_end<=a_end
               return A[node]
          else
               let mid = floor((t_begin+t_end)/2)
               let res = ∞
               if mid>=a_begin AND t_begin<=a_end
                    res = min(res,query_rec(2*node,t_begin,mid,a_begin,a_end))
               if t_end>=a_begin AND mid+1<=a_end
                    res = min(res,query_rec(2*node+1,mid+1,t_end,a_begin,a_end))
               return res
     function construct(size,a[1..size])
          let N = size
          let A be an array that can hold at least 4N elements
          build_rec(1,1,N,a)
     function update(pos,val)
          update_rec(1,1,N,pos,val)
     function query(begin,end)
          return query_rec(1,1,N,begin,end)

Variations

线段树不仅适用于区间最小值查询，它还可以适用于许多不同的函数查询。这里有一些比较难的竞赛题中的例子。

Maximum

和最小类似：所有 min 操作替换为 max。

Sum or product

每个非叶节点的值为它的子节点的和，也就是表示了它的区间内的数组和。举例来说，在上述伪代码中，所有 min 操作被替换成加操作。但是，在一些加和的情况中，线段树被二叉索引树 (Binary Indexed Tree) 取代，因为 BIT 空间占用更小，运行更快，并且容易编码。乘可以以同样的方式实现，只不过是用乘取代加。

Maximum/minimum prefix suffix sum

一个区间前缀由区间内前 k 个元素组成 (k 可以为 0)；类似的，后缀由后 k 个元素组成。区间最大前缀和表示区间内所有前缀的和中最大的数 (空区间的和为 0)。这个最大和被称为最大前缀和（最小前缀和可以类似的定义）。我们希望能够高效地查询某个区间的最大前缀和。举个例子，[1,-2,3,-4] 的前缀和是 2，对应的前缀是 [1,-2,3]。

为了使用线段树解决这个问题，我们用每个节点存储对应区间的两个函数值：最大前缀和和区间和。一个非叶节点的区间和是它的两个子节点区间和的和。为了找到一个非叶节点的最大前缀和，我们注意到最大前缀和对应的前缀的最后一个元素要么在左子节点对应的区间内，要么在右子节点对应的区间内。前一种情况，我们可以直接从左子节点获取最大前缀和，而后一种情况，我们通过左子节点的区间和和右子节点的最大前缀和相加获得。这两个数中的较大这就是我们的最大前缀和。查询也是类似的，但是可能会有超过两个相邻的区间（这种情况下最大前缀和的最后一个元素可能在它们中的任何一个）。

Maximum/minimum subvector sum

这个问题是查询给定区间内所有子区间的最大和。和上一节最大前缀和的问题类似，但是子区间的第一个元素并不一定在区间的开始。 ([1,-2,3,-4] 最大子区间和是 3，对应的子区间是 [3])。每个节点需要存储 4 份信息：最大前缀和，最大后缀和，最大子区间和，区间和。详细设计留作习题。

"Stabbing query"

For more details, please visit the original page. 🍺

Extension to two or more dimensions

线段树并不只限于解决一维数组的问题。原则上来说，它可以被用于任意纬度的数组，区间将被箱子答题。因此，在一个二维数组中，我们可以查询一个箱子内的最小元素，或者两个区间的笛卡尔乘积。

For more details, please visit the original page. 🍺

Lazy propagation

某些类型的线段树支持区间更新。举个例子，考虑一个区间最小值问题的变种：我们需要能够更新一个区间内所有的元素到某个特定的值。这被称为区间更新。懒传播是一种可以使得区间更新和单个元素更新一样能够在 $\mathcal{O}(\lg N)$ 时间内完成的技术。

它的工作原理如下：每个节点额外拥有一个 lazy 域，用于临时存储。当这个域不被使用是，令它的值为 $+\infty$。当更新一个区间是，我们选中和查询一样的那些区间。如果区间不是叶节点，则更新它们的 lazy 域到新的最小值 (如果新的最小值比原来小的话)，否则直接将值更新到节点上。当查询或是更新操作需要访问一个节点的后代，并且这个节点的 lazy 域有值，我们将 lazy 域内的值更新到它的两个子节点上：用父节点 lazy 域内的值更新到两个子节点的 lazy 域上，然后将父节点的 lazy 域重新设置为 $+\infty$。但是，如果我们只是想要该节点的值，并不访问它的任何子节点，并且 lazy 域有值，我们可以直接返回 lazy 内的值当做是区间内最小值。

Reference

[1] https://wcipeg.com/wiki/Segment_tree

Apendix

C++ 最大查询、区间更新、懒传播

class SegmentTree {
private:
    int n;
    vector<int> max_val, to_add;

    void push(int i, int tl, int tr) {
        max_val[i] += to_add[i];
        if (tl != tr - 1) {
            to_add[2 * i + 1] += to_add[i];
            to_add[2 * i + 2] += to_add[i];
        }
        to_add[i] = 0;
    }

    void add(int i, int tl, int tr, int l, int r, int delta) {
        push(i, tl, tr);
        if (tl >= r || tr <= l) {
            return;
        }
        if (l <= tl && tr <= r) {
            to_add[i] += delta;
            push(i, tl, tr);
            return;
        }
        int tm = (tl + tr) / 2;
        add(2 * i + 1, tl, tm, l, r, delta);
        add(2 * i + 2, tm, tr, l, r, delta);
        max_val[i] = max(max_val[2 * i + 1], max_val[2 * i + 2]);
    }

    int get(int i, int tl, int tr, int l, int r) {
        push(i, tl, tr);
        if (tl >= r || tr <= l) {
            return 0;
        }
        if (l <= tl && tr <= r) {
            return max_val[i];
        }
        int tm = (tl + tr) / 2;
        return max(get(2 * i + 1, tl, tm, l, r), get(2 * i + 2, tm, tr, l, r));
    }

public:
    SegmentTree(int k) {
        n = 1;
        while (n < k) {
            n *= 2;
        }
        max_val = vector<int>(2 * n, 0);
        to_add = vector<int>(2 * n, 0);
    }

    void add(int l, int r, int delta) { add(0, 0, n, l, r, delta); }

    int get(int l, int r) { return get(0, 0, n, l, r); }
};

Mysterious Combinators (AGC019-F)

Wed, 06 Sep 2017 13:16:06 GMT

The original problem is from AtCoder Grand Contest 019, F - Yes or No.

Rephrased as a math problem, the statement is as follows:

Suppose you have M+N questions to answer, and each answer is either Yes or No. You know that N of them are Yes and M are No, but you don't know the order. You answer the questions one by one in order, and immediately learn the correct answer after each question. Assuming you always answer in a way that maximizes the expected number of correct answers, what is the expected number of correct answers?

The Answer

Let $E(M, N)$ denote the expected number of correct answers given (M, N) for (Yes, No).

Then

$E(M, N) = E(N, M)$, and

$E(M, N) = \dfrac{\sum\limits_{k = 1}^{k = N}\binom{2k}{k}\binom{M + N - 2k}{N - k}}{2\binom{M + N}{N}} + M, M \ge N$

Proof by Induction

Clearly, the strategy that maximizes the expected value is to always guess whichever answer currently has more remaining, so we have

$E(M, N) = \dfrac{M}{M + N}\left(E(M - 1, N) + 1\right) + \dfrac{N}{M + N}E(M, N - 1), M \ge N$
$E(M, 0) = M$

From these two equations and the symmetry $E(M, N) = E(N, M)$, any $E(M, N)$ can be derived.

Let $F(M, N) = E(M, N) - M, M \ge N$, and $F(M, N) = F(N, M)$

We have

$F(M, N) = \dfrac{M}{M + N}F(M - 1, N) + \dfrac{N}{M + N}F(M, N - 1), M \gt N$
$F(N, N) = F(N, N - 1) + \dfrac{1}{2}$

Let $G(M, N) = \binom{M + N}{N}F(M, N), M \ge N$, which likewise satisfies $G(M, N) = G(N, M)$

The following two identities hold simultaneously:

$G(M, N) = G(M - 1, N) + G(M, N - 1), M \gt N$
$G(N, N) = \dfrac{1}{2}\binom{2N}{N} + 2G(N, N-1)$

So we only need to verify that $G(M, N) = \dfrac{\sum\limits_{k = 1}^{k = N}\binom{2k}{k}\binom{M + N - 2k}{N - k}}{2} = \sum\limits_{k = 1}^{k = N}\binom{2k - 1}{k}\binom{M + N - 2k}{N - k}, M\ge N$.

First, clearly $G(M, 0) = 0$ holds.

Setting $M = N = 1$, we get $G(1, 1) = 1 + 2G(1, 0) = 1$, and $G(M, 1) = G(M - 1, 1) + G(M, 0), M \gt 1$, so $G(M, 1) = 1, M \ge 1$ holds.

Assume that for $\forall n.n \le N - 1, \forall m.m\ge n$, we have $G(m, n) = \sum\limits_{k = 1}^{k = n}\binom{2k - 1}{k}\binom{m + n - 2k}{n - k}$, then

$G(N, N) = \binom{2N - 1}{N - 1} + 2\sum\limits_{k = 1}^{k = N - 1}\binom{2k - 1}{k}\binom{2N - 1 - 2k}{N - k}$ $ = \binom{2N - 1}{N - 1} + \sum\limits_{k = 1}^{k = N - 1}\binom{2k - 1}{k}\binom{2N - 2k}{N - 1 - k}$ $=\sum\limits_{k = 1}^{k = N}\binom{2k - 1}{k}\binom{2N - 2k}{N - k}$

Assume for $\forall m. m \le M, m \ge n$, the formula also holds, then

$G(M + 1, N) = G(M, N) + G(M + 1, N - 1)$ $= \sum\limits_{k = 1}^{k = N}\binom{2k - 1}{k}\binom{M + N - 2k}{N - k} + \sum\limits_{k = 1}^{k = N - 1}\binom{2k - 1}{k}\binom{M + N - 2k}{N - 1- k}$ $ = \sum\limits_{k = 1}^{k = N - 1}\binom{2k - 1}{k}\left(\binom{M + N - 2k}{N - k} + \binom{M + N- 2k}{N - 1- k}\right) + \binom{2N-1}{N}$ $ = \sum\limits_{k = 1}^{k = N - 1}\binom{2k - 1}{k}\binom{M + N + 1 - 2k}{N - k}+ \binom{2N-1}{N} = \sum\limits_{k = 1}^{k = N}\binom{2k - 1}{k}\binom{M + N + 1 - 2k}{N - k}$

Therefore for $\forall n.n\le N, \forall m.m\ge n$, the formula holds.

Hence for $\forall m, \forall n.m \ge n$, the formula holds.

Q.E.D.

The Troublesome Part

Looking at the form of this solution, it feels like there should be a constructive proof, but I racked my brain and couldn't come up with one. More critically, without knowing the form of the solution in advance, it's impossible to carry out the induction proof above...

I think I need to ask an old friend for help... or maybe email the problem setter...

Constructive Proof

It turns out I figured it out right after taking a shower... Could this be the legendary shower-solving method???

Problem Mapping

Suppose we have an M x N rectangular grid. We place the grid in the first quadrant of a coordinate system, with the bottom-left corner at (0, 0) and the top-right corner at (M, N).

There are a total of $\binom{M + N}{N}$ distinct paths connecting the bottom-left and top-right corners (from bottom-left to top-right consists of M rightward steps and N upward steps in some order).

All paths described below are constructed from the top-right corner (M,N) to the bottom-left corner (0,0), and M >= N is assumed throughout.

We now define a mapping from a path to an answering strategy:

Being at grid point (x, y) means there are x + y questions remaining, of which x are Yes and y are No.
At grid point (x, y):
- If x >= y, then going left to (x - 1, y) means (answer Yes, correct), and going down to (x, y - 1) means (answer Yes, wrong).
- If x < y, then going left to (x - 1, y) means (answer No, wrong), and going down means (answer No, correct).

Clearly, through this mapping, a path uniquely determines both the question sequence and the answering strategy, and vice versa.

How do we compute the number of correct answers in the original problem? We can see that when the path passes through grid point (x, y), whether the answer is correct or wrong is uniquely determined by the direction of the edge, i.e., determined by the edge itself:

For all x >= y, edge (x, y)-(x - 1, y) has weight 1, edge (x, y)-(x, y - 1) has weight 0.
For all x < y, edge (x, y)-(x - 1, y) has weight 0, edge (x, y)-(x, y - 1) has weight 1.

If we label all edges of the grid with these weights, then the number of correct answers equals the sum of edge weights along each path.

Solving the Problem

I won't draw diagrams here; readers should sketch this on paper.

Let $w(x, y, 1)$ denote the weight on edge (x,y)-(x-1,y), and $w(x, y, 0)$ denote the weight on edge (x,y)-(x,y-1). Then:

$x \ne y, w(x, y, 1) = w(y, x, 0)$
$x = y, w(x, y, 1) = 1, w(x, y, 0) = 0$

Intuitively, this means that except for edges emanating from points (x, x), all edges are symmetric about $y = x$.

For any path S, let S' be its reflection about y=x within the grid (reflecting what can be reflected).

Let $w(S)$ denote the sum of weights on all edges of S. We can show that:

$w(S) + w(S')= 2M + |S \cap {(x, y) | y = x}|$

The proof is as follows:

Clearly, suppose S first intersects $y = x$ at (X, X). Before reaching (X, X), the total weight is (M - X).

From (X, X) to (0, 0), the combined weight of the two paths S + S' is $2X + |S \cap {(x, y) | y = x}|$:

This is because if we set $w(x,x,1)=w(x,x,0)=0$, then edges are completely symmetric about $y=x$. Flipping all of S's path above $y=x$ does not change the path weight, so downward edges always contribute 1 and leftward edges always contribute 0, giving a weight of X. And in S + S', each of $w(x,x,1)$ and $w(x,x,0)$ can only be collected once, so the formula holds.

Therefore, over all paths, the total edge weight is $W = \sum\limits_{S}w(S) = \sum\limits_{S}\dfrac{2M + |S \cap {(x, y) | y = x}|}{2}$, i.e.,

$W = M\binom{M + N}{N} + \frac{\sum\limits_{S}{|S \cap {(x, y) | y = x}|}}{2}$. We just need to compute how many paths pass through each point (x,x), then sum up.

For a point (x,y), the number of paths passing through it is $\binom{x + y}{x}\binom{M + N - x - y}{N - y}$, which is obvious...

Therefore, $W = M\binom{M + N}{N} + \frac{\sum\limits_{k = 1}^{N}\binom{2k}{k}\binom{M + N - 2k}{N - k}}{2}$.

So the expected value is $\dfrac{W}{\binom{M + N}{N}} = M + \frac{\sum\limits_{k = 1}^{N}\binom{2k}{k}\binom{M + N - 2k}{N - k}}{2\binom{M + N}{N}}$. Q.E.D.

During the proof, we can observe an interesting fact: in the process of answering questions, if at some point the remaining M equals N, and you simply give up answering and look at the answers, the number of correct answers is always M.

Summary

When my brain isn't working, I choose to take a shower (just kidding).

The constructive solution to this problem is quite elegant. Without diagrams, it can be tiring to explain or understand, so I recommend drawing it out yourself.

It took me three days to figure this out, and I even looked at the editorial, which hinted at the grid construction approach... If I were under time pressure during a contest, I would probably just go with an O(MN) DP solution...

Appendix

A Small Lemma

During this investigation, I also found a small result. Suppose $p$ is a prime, $a, b, c, d$ are all coprime to $p$, and there exist $x, y \lt p$ such that $a \equiv bx \ (\textrm{mod} \ p)$ and $c \equiv dy \ (\textrm{mod} \ p)$.

Then $ad + bc \equiv (x + y)bd \ (\textrm{mod} \ p)$. The proof is as follows:

$ad + bc \equiv xbd^pb^{p - 1} + ydb^pd^{p - 1} \ (\textrm{mod} \ p) \equiv (bd)^p(x + y) \ (\textrm{mod} \ p) \equiv bd(x + y) \ (\textrm{mod} \ p)$. Q.E.D.

This is useful for computing the sum of fractions $\frac{a}{b}$ modulo $p$, i.e., $ab^{-1} \ (\textrm{mod} \ p)$.

神奇的组合数 (AGC019-F Mysterious Combinators)

Wed, 06 Sep 2017 13:16:06 GMT

原题目在 AtCoder Grand Contest 019，F - Yes or No。

把它改成数学题，题目大意如下：

假设你有 M+N 个问题要回答，每个问题的回答不是 Yes 就是 No。你知道其中有 N 个 Yes，M 个 No，但是并不知道顺序。你将按顺序一个一个回答问题，并且答完一道题后立刻就能知道这道题的正确答案。假设你每次回答问题都采取最大化期望正确题目数的方式，请问期望正确题目数是多少？

正确答案

令 $E(M, N)$ 表示已知 (M, N)x(Yes, No) 的期望正确题目数。

则

$E(M, N) = E(N, M)$，且

$E(M, N) = \dfrac{\sum\limits_{k = 1}^{k = N}\binom{2k}{k}\binom{M + N - 2k}{N - k}}{2\binom{M + N}{N}} + M, M \ge N$

归纳证明

显然，所谓最大化期望的方式就是选当前答案多的那个，所以有

$E(M, N) = \dfrac{M}{M + N}\left(E(M - 1, N) + 1\right) + \dfrac{N}{M + N}E(M, N - 1), M \ge N$
$E(M, 0) = M$

由两条及对称性 $E(M, N) = E(N, M)$ 可以推出任何 $E(M, N)$。

令 $F(M, N) = E(M, N) - M, M \ge N$，以及 $F(M, N) = F(N, M)$

我们有

$F(M, N) = \dfrac{M}{M + N}F(M - 1, N) + \dfrac{N}{M + N}F(M, N - 1), M \gt N$
$F(N, N) = F(N, N - 1) + \dfrac{1}{2}$

令 $G(M, N) = \binom{M + N}{N}F(M, N), M \ge N$，同样有 $G(M, N) = G(N, M)$

同时下两式成立

$G(M, N) = G(M - 1, N) + G(M, N - 1), M \gt N$
$G(N, N) = \dfrac{1}{2}\binom{2N}{N} + 2G(N, N-1)$

所以我们只需要验证 $G(M, N) = \dfrac{\sum\limits_{k = 1}^{k = N}\binom{2k}{k}\binom{M + N - 2k}{N - k}}{2} = \sum\limits_{k = 1}^{k = N}\binom{2k - 1}{k}\binom{M + N - 2k}{N - k}, M\ge N$ 即可。

首先显然 $G(M, 0) = 0$ 成立。

令 $M = N = 1$，有$G(1, 1) = 1 + 2G(1, 0) = 1$，且 $G(M, 1) = G(M - 1, 1) + G(M, 0), M \gt 1$，所以 $G(M, 1) = 1, M \ge 1$ 成立。

假设对 $\forall n.n \le N - 1, \forall m.m\ge n$, 都有 $G(m, n) = \sum\limits_{k = 1}^{k = n}\binom{2k - 1}{k}\binom{m + n - 2k}{n - k}$, 则有

假设 $\forall m. m \le M, m \ge n$, 上式同样成立，则有

所以 $\forall n.n\le N, \forall m.m\ge n$，上式成立。

所以 $\forall m, \forall n.m \ge n$，上式成立。

证毕。

头疼的问题

看这个解的形式，感觉上去像是可以有构造解，但是我绞尽脑汁也想不出来。而且更关键的是，在不知道解形式的情况下，根本不能完成上述归纳证明...

我想我得求助下老朋友...或者发个邮件问下解题的大佬了...

构造解

结果刚洗完澡就研究出来了...难道是传说中的洗澡解题法？？？

问题映射

假设我们有一个 MxN 的矩形格子，我们把格子放置在坐标系第一象限，格子左下角在 (0, 0)，右上角在 (M, N)。

那么，一共存在 $\binom{M + N}{N}$ 条不同的路径能够连通左下角和右上角，(从左下到右上为 M 次向右和 N 次向上的排列)。

下述所有路径构造过程均从右上角 (M,N) 至左下角 (0,0)，且 M >= N，提前声明。

下面定义一条路径到答题方式的映射：

当处于格点 (x, y) 时，表明当前还剩 x + y 题，其中 x 题为 Yes，y 题为 No
当处于格点 (x, y) 时
- 如果 x >= y，那么路径向左 (x - 1, y) 表明 (答 Yes，正确)，向下 (x, y - 1) 表明 (答 Yes，错误)
- 如果 x < y, 则向左 (x - 1, y) 表明 (答 No，错误)，向下表明 (答 No，正确)

显然，通过上述表示映射，一条路径唯一地确定出题和答题方式，反之亦然。

那么原问题中正确答案的数目怎么计算呢？可以看到，路径经过格点 (x, y) 时，正确或是错误是由路径走向唯一确定的，也就是由边唯一确定的

所有 x >= y, (x, y)-(x - 1, y) 的边计数为 1，(x, y)-(x, y - 1) 的边计数为 0
所有 x < y, (x, y)-(x - 1, y) 的边计数为 0，(x, y)-(x, y - 1) 为 1

我们将计数标到格子的所有边上，那么我们正确答案的计数就等于所有路径上所经过的边权和。

问题求解

这里我不画图，大家还是在稿纸上画一下。

令 $w(x, y, 1)$ 代表 (x,y)-(x-1,y) 边上的权，$w(x, y, 0)$代表 (x, y-1) 边上的权，那么

$x \ne y, w(x, y, 1) = w(y, x, 0)$
$x = y, w(x, y, 1) = 1, w(x, y, 0) = 0$

这个脑补出来就是除了 (x, x) 格点出来的边，其余边都关于$y = x$对称。

假设任意一条路径 S，其在格子内与 y=x 对称得到的路径为 S' (可以对称的部分对称)。

令$w(S)$表示 S 上所有边的加权和，可以证明：

$w(S) + w(S')= 2M + |S \cap {(x, y) | y = x}|$

证明如下：

显然，假设 S 与 $y = x$ 第一次相交于 (X, X)，那么到达 (X, X) 之前权和总共为 (M - X)；

从 (X, X) 开始到 (0, 0)，两条路径 S + S' 的加权和为 $2X + |S \cap {(x, y) | y = x}|$：

因为假设 $w(x,x,1)=w(x,x,0)=0$，那么边关于$y=x$就完全对称，我们将 S 所有路径全部翻到$y=x$之上不影响路径权和，则向下永远为 1，向左永远为 0，权和为 X。而 S + S' 中 $w(x,x,1)$和$w(x,x,0)$ 都只能获得一次，所以上式成立。

所以对于所有路径，边权和为 $W = \sum\limits_{S}w(S) = \sum\limits_{S}\dfrac{2M + |S \cap {(x, y) | y = x}|}{2}$，也就是

$W = M\binom{M + N}{N} + \frac{\sum\limits_{S}{|S \cap {(x, y) | y = x}|}}{2}$，我们只需要再计算每个 (x,x) 点被被多少条路径经过，再加和就行。

对于一个点 (x,y)，经过的路径数为$\binom{x + y}{x}\binom{M + N - x - y}{N - y}$，超级显然...

所以，$W = M\binom{M + N}{N} + \frac{\sum\limits_{k = 1}^{N}\binom{2k}{k}\binom{M + N - 2k}{N - k}}{2}$。

所以期望为 $\dfrac{W}{\binom{M + N}{N}} = M + \frac{\sum\limits_{k = 1}^{N}\binom{2k}{k}\binom{M + N - 2k}{N - k}}{2\binom{M + N}{N}}$，终于证毕。

在证明过程中可以发现一个很有意思的事实，在回答问题的过程中，如果当前剩余的 M=N，直接放弃答题并查看答案，那么答对的题目数永远是 M。

总结

脑子不好使的时候，我选择洗澡🛀（误

这题的解法构造十分巧妙，没图解释/理解起来会很累，大家看的时候还是自己画一画图比较好。

我花了三天时间才想出来，期间还看了 editorial，提示了我格子构造法...如果答题没什么时间想，我大概就直接 O(MN) 的 dp 了...

附录

小结论

这次研究题目还有一个小结论，假设 $p$ 是质数，$a, b, c, d$ 与 $p$ 均互质，且存在 $x, y \lt p$，使得 $a \equiv bx \ (\textrm{mod} \ p)$，$c \equiv dy \ (\textrm{mod} \ p)$。

那么 $ad + bc \equiv (x + y)bd \ (\textrm{mod} \ p)$，证明如下

$ad + bc \equiv xbd^pb^{p - 1} + ydb^pd^{p - 1} \ (\textrm{mod} \ p) \equiv (bd)^p(x + y) \ (\textrm{mod} \ p) \equiv bd(x + y) \ (\textrm{mod} \ p)$，证毕。

用于分数 $\frac{a}{b}$ 对于 $p$ 的模数 $ab^{-1} \ (\textrm{mod} \ p)$ 的加和。

Longest Increasing Subsequence

Fri, 01 Sep 2017 07:41:22 GMT

The Longest Increasing Subsequence algorithm -- I thought I had memorized the fastest algorithm, but my memory is too poor. Today I encountered a problem and forgot how to do it again.

Three approaches: DP, Sort + LCS, DP + BS. I only remembered the first one, DP...

And while we're at it, let's solve a certain problem too.

Algorithms for LIS

DP

Obviously, the length of the longest increasing subsequence ending at the i-th element is max{dp[j] + 1}, j < i && arr[j] < arr[i]. So the DP has O(n) space complexity and O(n^2) time complexity.

int LIS(vector<int> nums) {
    int n = nums.size();
    int dp[n];
    int res = 0;
    for (int i = 0; i < n; ++ i) {
        dp[n] = 1;
        for (int j = 0; j < i; ++ j) {
            if (nums[j] < nums[i])
                dp[i] = max(dp[i], dp[j] + 1);
        }
        res = max(res, dp[i]);
    }
    return res;
}

Sort + LCS

An increasing subsequence of the original array must be a common subsequence of the sorted array and the original array. So we just sort and find the longest common subsequence.

Space complexity O(n^2), time complexity O(n^2).

int LCS(vector<int> a, vector<int> b) {
    assert(a.size() == b.size());

    int n = a.size();
    int dp[n + 1][n + 1];
    for (int i = 0; i < n; ++ i) {
        for (int j = 0; j < n; ++ j) {
            if (i == 0 || j == 0)
                dp[i][j] = 0;
            else if (a[i - 1] == b[j - 1])
                dp[i][j] = dp[i - 1][j - 1] + 1;
            else dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]);
        }
    }
    return dp[n][n];
}

int LIS(vector<int> nums) {
    auto sorted = nums;
    sort(sorted.begin(), sorted.end());
    return LCS(nums, sorted);
}

DP + BS

This is the fastest algorithm, and also the one I still haven't memorized. The key idea is to maintain an array where the l-th entry (1-based) stores the smallest possible last element of an increasing subsequence of length l. We continuously maintain this array until the end, and the effective length of the array is the LIS result.

Since the l-th entry stores the smallest possible last element of an increasing subsequence of length l, this array is guaranteed to be sorted in increasing order. When the next number n needs to become the end of some increasing subsequence, we search the array for the first element that is not less than n. Suppose the position is l. Then we know we can build an increasing subsequence of length at most l ending with n (think about why -- it's straightforward). At this point, we update the array value at position l to n. A special case is when position l doesn't exist in the array, in which case we append n to the end.

Binary search can be used here, giving O(n) space complexity and O(nlgn) time complexity.

int LIS(vector<int> nums) {
    vector<int> dp;
    for (auto n : nums) {
        auto it = lower_bound(dp.begin(), dp.end(), n);
        if (it == dp.end()) dp.push_back(n);
        else *it = n;
    }
    return dp.size();
}

Actually, I realized... I don't even need to hand-write BS. The STL handles a lot of things. Apart from not having a Fibonacci Heap, it seems like there aren't many downsides (laughs).

AtCoder Grand Contest 019C

Problem:

In the city of Nevermore, there are 10e8 streets and 10e8 avenues, both numbered from 0 to 10e8 - 1. All streets run straight from west to east, and all avenues run straight from south to north. The distance between neighboring streets and between neighboring avenues is exactly 100 meters. Every street intersects every avenue. Every intersection can be described by pair (x,y), where x is avenue ID and y is street ID.

There are N fountains in the city, situated at intersections (Xi,Yi). Unlike normal intersections, there's a circle with radius 10 meters centered at the intersection, and there are no road parts inside this circle. The picture below shows an example of how a part of the city with roads and fountains may look like.

Constraints:

0<=x1,y1,x2,y2<10e8

1<=N<=200,000
0<=Xi,Yi<10e8
Xi!=Xj for i!=j
Yi!=Yj for i!=j
Intersections (x1,y1) and (x2,y2) are different and don't contain fountains.
All input values are integers.

Input:

x1 y1 x2 y2 N X1 Y1 X2 Y2 : XN YN

Output:

Print the shortest possible distance one needs to cover in order to get from intersection (x1,y1) to intersection (x2,y2), in meters. Your answer will be considered correct if its absolute or relative error doesn't exceed 10e-11.

Samples:

I: 1 1 6 5 3 3 2 5 3 2 4

O: 891.415926535897938

I: 3 5 6 4 3 3 2 5 3 2 4

O: 400.000000000000000

I: 4 2 2 2 3 3 2 5 3 2 4

O: 211.415926535897938

OK, given two points on a grid, the path without considering fountains is fixed, with length abs(x1 - x2) + abs(y1 - y2).

By symmetry, we may assume x1 <= x2, y1 <= y2. When fountains are considered, turning at a fountain saves some distance, while going straight through a fountain costs extra distance.

Suppose our strategy is to detour to maximize turns or to avoid going straight through fountains, i.e., going outside the rectangle formed by {(x1, y1), (x2, y2)}. Since each row and column has at most one fountain, going one cell further from the rectangle adds 200m of extra distance, while the maximum savings is about (2 * r - pi * r / 2) approximately 4.5m, or avoiding the extra cost of (pi * r - 2 * r) approximately 11.4m. So this is clearly not worthwhile.

Therefore, the path must stay inside the rectangle {(x1, y1), (x2, y2)}, maximizing the number of turns.

Clearly, on the path from (x1, y1) to (x2, y2), there are at most c = min(x2 - x1, y2 - y1) turns and at most c + 1 fountains.

If the fountains are sorted by their x-coordinate, the maximum number of fountains we can pass through equals the length of the longest increasing subsequence of the fountains' y-values. Let this length be k.

There are two cases:

k <= c: then there exists a path that turns at every fountain in this subsequence
k == c + 1: then one fountain must be passed through straight, so the path makes c turns and passes through the remaining one

So the key is computing the longest increasing subsequence. Here's the code:

#include <bits/stdc++.h>

using namespace std;

pair<long long, long long> p[200000];

const double PI = 3.14159265358979323846;
const double R = 10.0;
const double quarter_circle = PI * R / 2;
const double edge = 100.0;

using PLL = pair<long long, long long>;
int dp[200000];

int main() {
    long long x1, y1, x2, y2;
    scanf("%lld %lld %lld %lld", &x1, &y1, &x2, &y2);

    if (x1 > x2) {
        swap(x1, x2);
        swap(y1, y2);
    }

    long long y_min = min(y1, y2), y_max = max(y1, y2);

    // drop those not used
    int n;
    scanf("%d", &n);
    for (int i = 0; i < n;) {
        scanf("%lld %lld", &p[i].first, &p[i].second);
        if (p[i].first < x1 || p[i].first > x2 || p[i].second < y_min || p[i].second > y_max) {
            n--;
        } else {
            ++i;
        }
    }

    // x1 <= x2
    double res = edge * (x2 - x1 + y_max - y_min);
    sort(p, p + n, [&](const PLL& a, const PLL& b) { return a.first < b.first; });
    if (y1 > y2) {
        for (int i = 0; i < n; ++i) {
            p[i].second = -p[i].second;
        }
    }

    int len = 0;
    for (int i = 0; i < n; ++i) {
        auto it = lower_bound(dp, dp + len, p[i].second);
        *it = p[i].second;
        if (it == dp + len) {
            len ++;
        }
    }

    int max_corner = min(x2 - x1, y_max - y_min);

    if (len < max_corner + 1)
        res += len * (quarter_circle - 2 * R);
    else
        res += max_corner * (quarter_circle - 2 * R) + (2 * quarter_circle - 2 * R);

    printf("%.15f\n", res);

    return 0;
}

Summary

Skip one day of problem grinding and you'll forget the algorithms. Also, AtCoder problems really are hard...

最长递增子序列 (Longest Increasing Subsequence)

Fri, 01 Sep 2017 07:41:22 GMT

最长递增子序列算法，原本以为已经记住了最快的算法，看来是记性太差，今天碰到一道题目又忘记了怎么做 🙄

三种做法：DP，Sort + LCS，DP + BS，我只记得第一种 DP 了 ...

然后咱们顺便把某道题目做了 🤣

Algorithms for LIS

DP

很显然，以第 i 个元素为结尾的最长递增子序列的长度是 max{dp[j] + 1}, j < i && arr[j] < arr[i]，所以 DP 空间复杂度 O(n)，时间复杂度 O(n^2)。

int LIS(vector<int> nums) {
    int n = nums.size();
    int dp[n];
    int res = 0;
    for (int i = 0; i < n; ++ i) {
        dp[n] = 1;
        for (int j = 0; j < i; ++ j) {
            if (nums[j] < nums[i]) 
                dp[i] = max(dp[i], dp[j] + 1);
        }
        res = max(res, dp[i]);
    }
    return res;
}

Sort + LCS

原数组的递增子序列一定是排序后数组和原来数组的公共子序列，所以排完序找最长公共子序列就行了。

空间复杂度 O(n^2)，时间复杂度 O(n^2)。

int LCS(vector<int> a, vector<int> b) {
    assert(a.size() == b.size());
    
    int n = a.size();
    int dp[n + 1][n + 1];
    for (int i = 0; i < n; ++ i) {
        for (int j = 0; j < n; ++ j) {
            if (i == 0 || j == 0) 
                dp[i][j] = 0;
            else if (a[i - 1] == b[j - 1]) 
                dp[i][j] = dp[i - 1][j - 1] + 1;
            else dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]);
        }
    }
    return dp[n][n];
}

int LIS(vector<int> nums) {
    auto sorted = nums;
    sort(sorted.begin(), sorted.end());
    return LCS(nums, sorted);
}

DP + BS

这是最快的一个算法，也是我到现在都还没记住的算法 🙄，关键思想为维护一个数组，数组中第 l 项 (1-based) 为长度为 l 的递增子序列的最大 (最后) 元素的最小值，然后不断维护这个数组直到结束，最终数组的有效长度即为 LIS 的结果。

由于第 l 项为长度为 l 的递增子序列的最大元素的最小值，所以该数组一定为一个递增数组。当下一个数 n 需要能够成为某个递增子序列的结尾时，我们在该数组中查找第一个不比 n 小的数，假设位置为 l，那么我们就知道我们能以 n 结尾构建一个最长为 l 的递增子序列 (想想为什么？很简单)，此时我们就将 l 位置的数组值更新为 n。有一个特殊情况即数组中不存在位置 l，那么就在数组最后添加 n。

这里查找可以使用二分查找的方式，从而空间复杂度为 O(n)，时间复杂度为 O(nlgn)。

int LIS(vector<int> nums) {
    vector<int> dp;
    for (auto n : nums) {
        auto it = lower_bound(dp.begin(), dp.end(), n);
        if (it == dp.end()) dp.push_back(n);
        else *it = n;
    }
    return dp.size();
}

其实我发现...，我都不用手写 BS，STL 还是解决了很多问题的，除了没有 Fibonacci Heap 之外感觉好像也没啥缺点了 (笑

AtCoder Grand Contest 019C

题目：

In the city of Nevermore, there are 10e8 streets and 10e8 avenues, both numbered from 0 to 10e8 − 1. All streets run straight from west to east, and all avenues run straight from south to north. The distance between neighboring streets and between neighboring avenues is exactly 100 meters. Every street intersects every avenue. Every intersection can be described by pair (x,y), where x is avenue ID and y is street ID.

There are N fountains in the city, situated at intersections (Xi,Yi). Unlike normal intersections, there's a circle with radius 10 meters centered at the intersection, and there are no road parts inside this circle. The picture below shows an example of how a part of the city with roads and fountains may look like.

约束:

0≤x1,y1,x2,y2<10e8

1≤N≤200,000
0≤Xi,Yi<10e8
Xi≠Xj for i≠j
Yi≠Yj for i≠j
Intersections (x1,y1) and (x2,y2) are different and don't contain fountains.
All input values are integers.

输入:

x1 y1 x2 y2 N X1 Y1 X2 Y2 : XN YN

输出:

Print the shortest possible distance one needs to cover in order to get from intersection (x1,y1) to intersection (x2,y2), in meters. Your answer will be considered correct if its absolute or relative error doesn't exceed 10e−11.

Samples:

I: 1 1 6 5 3 3 2 5 3 2 4

O: 891.415926535897938

I: 3 5 6 4 3 3 2 5 3 2 4

O: 400.000000000000000

I: 4 2 2 2 3 3 2 5 3 2 4

O: 211.415926535897938

Ok，一个网格给定两个点，那么不考虑喷泉的走法是固定的，长度为 abs(x1 - x2) + abs(y1 - y2)。

由对称性，我们不妨设 x1 <= x2, y1 <= y2，考虑喷泉的情况下，从喷泉处转角能够少走一些路，而直走经过喷泉会多走一些路。

假设我们的走法是绕远路贪转角少走，或是避免直走，即走出{(x1, y1), (x2, y2)}所构成的矩形的情况，由于每行每列仅有一个喷泉，所以多远离矩形一格会导致多走 200m，而最多只能少走 (2 * r - pi * r / 2) 大约为 4.5m 的距离，或者少走 (pi * r - 2 * r) 大约为 11.4m 的距离，所以显然是不可能的。

所以走法一定是在{(x1, y1), (x2, y2)}矩形内部，贪最多的转角。

显然，在 (x1, y1) 到 (x2, y2) 的路程中，最多只有 c = min(x2 - x1, y2 - y1) 次转角，最多有 c + 1 个喷泉。

假设喷泉是按照 x 的大小进行排列的，那么我们能过的最多的温泉等于温泉 y 值数组的最长递增子序列，假设长度为 k。

那么有两种情况，

k <= c, 那么就一定有一种走法，使得我们经过这个子序列中每个温泉的时候都是转角
k == c + 1，那么一定存在一个喷泉需要经过，所以走法是走 c 个角，并经过剩余的一个。

所以关键即求最长递增子序列，放代码

#include <bits/stdc++.h>

using namespace std;

pair<long long, long long> p[200000];

const double PI = 3.14159265358979323846;
const double R = 10.0;
const double quarter_circle = PI * R / 2;
const double edge = 100.0;

using PLL = pair<long long, long long>;
int dp[200000];

int main() {
    long long x1, y1, x2, y2;
    scanf("%lld %lld %lld %lld", &x1, &y1, &x2, &y2);

    if (x1 > x2) {
        swap(x1, x2);
        swap(y1, y2);
    }

    long long y_min = min(y1, y2), y_max = max(y1, y2);

    // drop those not used
    int n;
    scanf("%d", &n);
    for (int i = 0; i < n;) {
        scanf("%lld %lld", &p[i].first, &p[i].second);
        if (p[i].first < x1 || p[i].first > x2 || p[i].second < y_min || p[i].second > y_max) {
            n--;
        } else {
            ++i;
        }
    }

    // x1 <= x2
    double res = edge * (x2 - x1 + y_max - y_min);
    sort(p, p + n, [&](const PLL& a, const PLL& b) { return a.first < b.first; });
    if (y1 > y2) {
        for (int i = 0; i < n; ++i) {
            p[i].second = -p[i].second;
        }
    }

    int len = 0;
    for (int i = 0; i < n; ++i) {
        auto it = lower_bound(dp, dp + len, p[i].second);
        *it = p[i].second;
        if (it == dp + len) {
            len ++;
        }
    }

    int max_corner = min(x2 - x1, y_max - y_min);

    if (len < max_corner + 1)
        res += len * (quarter_circle - 2 * R);
    else
        res += max_corner * (quarter_circle - 2 * R) + (2 * quarter_circle - 2 * R);

    printf("%.15f\n", res);

    return 0;
}

总结

一天不刷题就会忘记算法，还有 atcoder 的题还真是难...

SIGFPE When Doing DivQ

Sun, 27 Aug 2017 09:38:37 GMT

This was my first encounter with a SIGFPE caused by something other than division by zero. Here's a note about it.

Symptoms

When using the following function with a = 1e18, b = 1e18, m = 1e9 + 7, a SIGFPE is triggered.

long mulmod(long a, long b, long m) {
    long res;
    asm("mulq %2; divq %3" : "=d"(res), "+a"(a) : "S"(b), "c"(m));
    return res;
}

Cause

The root cause is that divq S divides the 128-bit value %rdx:%rax by S, storing the quotient in %rax and the remainder in %rdx. In the above case, a * b / m is too large to fit in 64 bits, so %rax overflows and triggers SIGFPE.

Solution

Take the modulus before performing the modular multiplication.

SIGFPE When Doing DivQ

Sun, 27 Aug 2017 09:38:37 GMT

第一次遇到了除 0 以外的 SIGFPE，记录一下。

症状

在使用以下函数的时候，假定 a = 1e18, b = 1e18, m = 1e9 + 7 就会触发 SIGFPE。

long mulmod(long a, long b, long m) {
    long res;
    asm("mulq %2; divq %3" : "=d"(res), "+a"(a) : "S"(b), "c"(m));
    return res;
}

原因

主要是由于 divq S 的执行逻辑是用 128bit 的 %rdx:%rax 除以 S，将商存入 %rax, 余数存入 %rdx，而上面的情况 a * b / m 太大，超过了 64bit，所以 %rax 存不下就触发了 SIGFPE。

解决方案

先模再模乘。

Hyperexponentiation of an Integer (Project Euler #188 -- The Hyperexponentiation of A Number)

Fri, 25 Aug 2017 13:32:26 GMT

Following up on the previous blog post, let us solve the large integer factorization problem and ultimately solve Project Euler #188.

Recall that the problem asks us to compute $a\uparrow\uparrow b \ (\textrm{mod} \ m)$, where $1 \le a, b, m \le 10^{18}$.

In the field of integer factorization, the integers here are not particularly large. I had never studied algorithms of this kind before, so this was a good opportunity to fill that gap. Here I used the Pollard Rho algorithm; other algorithms include the Fermat Rho and Quadratic Sieve algorithms.

Pollard Rho

Algorithm Pseudocode

x <- 2; y <- 2; d <- 1
while d = 1:
    x <- g(x)
    y <- g(g(y))
    d <- gcd(|x - y|, n)
if d = n:
    return failure
else:
    return d

Core Idea

Suppose the integer to be factored is $n = pq$, where $p, q$ are both primes. Without loss of generality, assume $p \le q$. We use $g(x) = (x^2 + 1) \ (\textrm{mod} \ n)$ to generate a pseudorandom number sequence.

Suppose the initial x is 2. Then we have a sequence ${x_1 = g(2), x_2 = g(g(2)), ... x_k = g^k(2),...}$. Since the values in the sequence are finite and each number depends only on its predecessor, the sequence must eventually cycle.

Define the sequence ${y_k = x_k\ (\textrm{mod} \ p), k = 1, 2, ...}$. This sequence must also cycle, and since $p$ is much smaller than $n$, this sequence will cycle much earlier than the ${x_k}$ sequence.

We only need to detect this cycle. Suppose the cycle length is r; it must satisfy $y_{r+k} - y_{k} \equiv 0 \ (\textrm{mod} \ p)$. So we use Floyd Cycle Detection and the greatest common divisor to detect the appearance of the cycle (the GCD is not 1). At this point, if the GCD is not equal to n, it must be either p or q.

The only caveat is that when the cycle appears, the GCD might equal n, i.e., x equals y in the algorithm above. In this case, we randomly change the sequence seed x and proceed with the next round of factorization.

Time Complexity

The expected running time is proportional to the square root of the smallest prime factor of n, which is approximately O(3.2e4) here.

#188 Overall Approach

Define $T(a, b) = a\uparrow\uparrow b$.

We have $T(a, b) = a^{T(a, b - 1)}$.

By the extended Euler's theorem, if $T(a, b - 1) >= \phi(m)$, then $T(a,b) \equiv a^{T(a, b - 1) \ %\ \phi(m) + \phi(m)} \ (\textrm{mod} \ m)$; otherwise $T(a,b) \equiv a^{T(a, b - 1) \ %\ \phi(m)} \ (\textrm{mod} \ m)$. In this case, we need to solve $T(a, b - 1) \ (\textrm{mod}\ \phi(m))$.

Similarly, $T(a, b - 1) \equiv a^{T(a, b - 2) \ %\ \phi(\phi(m))\ ?+\ \phi(\phi(m))} \ (\textrm{mod} \ \phi(m))$, requiring us to solve $T(a, b - 2) \ (\textrm{mod}\ \phi(\phi(m)))$. We can keep recursing.

Until we solve $T(a, 1)\ (\textrm{mod} \ \phi^{b - 1}(m))$, or $T(a, b') \ (\textrm{mod} \ 1), \phi^{b - b'}(m) = 1$.

Recursion Depth

The recursion depth is at most min(128, b - 1). The proof is as follows:

There are two termination conditions for the recursion: either b equals 1, or the Euler's totient function equals 1.

We prove that when $m \ge 2$, $\phi(\phi(m)) \le m/2$. By the definition of $\phi(m)$, we always have $1 \le \phi(m) \le m$.

We consider two cases:

m is even: by definition $\phi(m) \le m / 2$, so $\phi(\phi(m)) \le \phi(m) \le m / 2$
m is odd: by definition $\phi(m)$ must be even, so $\phi(\phi(m)) \le \phi(m) / 2 \le m / 2$

QED.

Therefore, if $\phi^{s}(m) = 1, m < 2^{64}$, then $s \le 64 \times 2 = 128$ always holds, so the recursion depth never exceeds 128.

Application of the Extended Euler's Theorem

The extended form has a size comparison condition, so we need to estimate the magnitude of the current tetration. Since tetration grows extremely fast, the values of a and b for which the result stays within (2^63 - 1) can even be enumerated:

a	b	T(a, b)
1	any	1
any	1	a
2	2	4
2	3	16
2	4	65536
3	2	27
3	3	7625597484987
4	2	64
...	...	...
15	2	437893890380859375

We can easily compute these numbers and compare them with the modulus, while all others are guaranteed to exceed the modulus.

Large Integer Factorization

Given the range of $m$, I first precompute a table of all primes up to 1e6 and factor out all such prime factors from m. Let the remaining number be $m'$. If $m'$ is composite, it must be the product of two primes $p, q \gt 1000000$.

If $m' \le 1e12$, then $m'$ must be prime. Otherwise, we first perform a primality test on $m'$ (Miller-Rabin), and if it is composite, we use Pollard-Rho to factor it -- once factored, the result must be two primes.

64-bit Modular Multiplication Trick

Both Gcc and Clang provide the built-in type __int128_t, which supports 128-bit arithmetic. If running on an Intel x86_64 architecture, you can directly use the following assembly:

long mulmod(long a, long b, long m) {
    long res;
    asm("mulq %2; divq %3" : "=d"(res), "+a"(a) : "S"(b), "c"(m));
    return res;
}

Algorithm Flow

$T(a, b, m)$:

If a == 1 or b == 1, return a % m
If m == 1, return 0
Factorize m and compute the Euler's totient function $\phi(m)$
Recursively compute $e = T(a, b - 1, \phi(m))$
Estimate $T(a, b - 1)$ and compare with $\phi(m)$. If $T(a, b - 1) \ge \phi(m)$, set $e = e + \phi(m)$
Compute $a ^ e % m$

The entire flow ensures that all operands stay within 64 bits, with a maximum recursion depth of min(128, b - 1).

The algorithm code is hosted at https://github.com/arkbriar/hackerrank-projecteuler/blob/master/cpp/188.cc .

References

[1] https://en.wikipedia.org/wiki/Pollard%27s_rho_algorithm

整数的超幂 (Project Euler #188 -- The Hyperexponentiation of A Number)

Fri, 25 Aug 2017 13:32:26 GMT

接上次的博文，我们来解决大整数分解问题，并最终解决 Project Euler #188。

回忆一下，问题要求解的是 $a\uparrow\uparrow b \ (\textrm{mod} \ m)$，其中 $1 \le a, b, m \le 10^{18}$。

其实这里的整数在整数分解领域并不算太大，之前并没有学习过这类的算法，正好也算是补上了。在这里我使用了 Pollard Rho 算法，其他的算法还有 Fermat Rho 和 Quadratic Sieve 算法。

Pollard Rho

算法伪代码

x ← 2; y ← 2; d ← 1
while d = 1:
    x ← g(x)
    y ← g(g(y))
    d ← gcd(|x - y|, n)
if d = n: 
    return failure
else:
    return d

核心思想

假设要分解的整数 $n = pq$，其中 $p, q$ 都是质数，不妨设 $p \le q$。我们使用 $g(x) = (x^2 + 1) \ (\textrm{mod} \ n)$ 来生成一个伪随机数序列。

不妨设初始的 x 为 2，那么我们有一个序列为 ${x_1 = g(2), x_2 = g(g(2)), ... x_k = g^k(2),...}$，因为序列中的值一定是有穷的，并且序列的每一个数只依赖于前一个数，所以该序列一定会循环。

定义 ${y_k = x_k\ (\textrm{mod} \ p), k = 1, 2, ...}$ 序列，那么这个序列也一定会循环，并且由于 $p$ 比 $n$ 小得多，所以序列循环也一定会比 ${x_k}$ 序列循环得早许多。

那么我们只需要检测到这个环，假设环长为 r，他一定有$y_{r+k} - y_{k} \equiv 0 \ (\textrm{mod} \ p)$，所以使用 Floyd Cycle Detection ，并使用最小公约数来检测环的出现 (最小公约数不为 1)，此时如果最小公约数为不等于 n，那么它一定是 p 或者 q。

唯一需要注意的是环出现的时候可能最小公约数为 n，也就是上述算法中 x 等于 y。此时我们随机更换序列种子 x，并进行下一轮分解。

时间复杂度

期望运行时间正比于 n 的最小素数的开根，此处大约为 O(3.2e4)。

#188 整体思路

定义 $T(a, b) = a\uparrow\uparrow b$。

我们有 $T(a, b) = a^{T(a, b - 1)}$。

由欧拉定理的扩展，如果 $T(a, b - 1) >= \phi(m)$，则有 $T(a,b) \equiv a^{T(a, b - 1) \ %\ \phi(m) + \phi(m)} \ (\textrm{mod} \ m)$，否则有 $T(a,b) \equiv a^{T(a, b - 1) \ %\ \phi(m)} \ (\textrm{mod} \ m)$，此时需要求解 $T(a, b - 1) \ (\textrm{mod}\ \phi(m))$。

同理，$T(a, b - 1) \equiv a^{T(a, b - 2) \ %\ \phi(\phi(m))\ ?+\ \phi(\phi(m))} \ (\textrm{mod} \ \phi(m))$，此时求解 $T(a, b - 2) \ (\textrm{mod}\ \phi(\phi(m)))$，我们可以一直递归下去。

直到求解 $T(a, 1)\ (\textrm{mod} \ \phi^{b - 1}(m))$，或者 $T(a, b') \ (\textrm{mod} \ 1), \phi^{b - b'}(m) = 1$ 为止。

递归深度

递归深度顶多为 min(128, b - 1)，证明如下：

递归结束条件有两个，一个是 b 等于 1，或者欧拉函数为 1。

我们证明 $m \ge 2$ 时，$\phi(\phi(m)) \le m/2$ 即可，由 $\phi(m)$ 定义, $1 \le \phi(m) \le m$ 恒成立。

分以下两种情况：

m 为偶数，则由定义 $\phi(m) \le m / 2$，$\phi(\phi(m)) \le \phi(m) \le m / 2$
m 为奇数，则由定义$\phi(m)$一定为偶数，$\phi(\phi(m)) \le \phi(m) / 2 \le m / 2$

证毕。

所以如果 $\phi^{s}(m) = 1, m < 2^{64}$，那么 $s \le 64 \times 2 = 128$ 恒成立，所以递归深度最大不会超过 128。

扩展欧拉定理的应用

扩展形式由于有一个大小比较的条件，所以需要估算当前栈迭代次幂的大小，而迭代次幂 (tetration) 增长十分快，所以值在 (2^63 - 1) 以内的 a 和 b 甚至可以枚举出来：

a	b	T(a, b)
1	any	1
any	1	a
2	2	4
2	3	16
2	4	65536
3	2	27
3	3	7625597484987
4	2	64
...	...	...
15	2	437893890380859375

这些数我们可以轻易的计算出来，与模数进行比较，而其余的一定大于模数。

大整数分解

由于 $m$ 的范围，我先打表 1e6 内所有的素数，并先将 m 中这样的质因子全部分解，假设剩余的数为 $m'$，那么 $m'$ 如果是合数，一定是两个质数 $p, q \gt 1000000$ 的乘积。

如果 $m' \le 1e12$，那么 $m'$ 一定是素数，否则我们先对 $m'$ 进行素数测试 (Miller-Rabin)，如果为合数再使用 Polland-Rho 进行分解，一旦分解一定为两个素数。

64 位模乘 trick

Gcc/Clang 均提供内置类型 __int128_t，可以支持 128 位的运算，而如果运行在 intel x86_64 架构上，可以直接使用如下汇编

long mulmod(long a, long b, long m) {
    long res;
    asm("mulq %2; divq %3" : "=d"(res), "+a"(a) : "S"(b), "c"(m));
    return res;
}

算法流程

$T(a, b, m)$:

如果 a == 1 或者 b == 1，返回 a % m
如果 m == 1，返回 0
对 m 进行质因数分解，并计算欧拉函数 $\phi(m)$
递归计算 $e = T(a, b - 1, \phi(m))$
估算 $T(a, b - 1)$ 并与 $\phi(m)$ 比较大小，如果 $T(a, b - 1) \ge \phi(m)$，$e = e + \phi(m)$
计算 $a ^ e % m$

整个流程保证了所有操作数都在 64 位以内，递归深度最大为 min(128, b - 1)。

算法代码托管在 https://github.com/arkbriar/hackerrank-projecteuler/blob/master/cpp/188.cc 。

References

[1] https://en.wikipedia.org/wiki/Pollard%27s_rho_algorithm

Euler's Theorem Extension for Non-coprime Cases

Wed, 23 Aug 2017 14:02:59 GMT

I just encountered a terrifying problem: computing iterated exponentiation (tetration) modulo some number quickly under an extremely large range, where the modulus is between 1 and 1e18.

OK, the approach for this problem is actually quite clear -- use Euler's theorem to reduce the exponent. But the hardest parts are:

Factorizing m into prime factors
Using the extended form of Euler's theorem for non-coprime cases

Let's set aside the prime factorization of m for now (actually still unsolved)... and focus on the second problem first.

Euler Theorem & Fermat's Little/Last Theorem

Let m be an integer greater than 1, and $(a, m) = 1$. Then

$$a^{\phi(m)} \equiv 1 \ (\textrm{mod}\ m)$$

Here $\phi(m)$ is Euler's totient function. If $m = p_1^{k_1}p_2^{k_2}\cdots p_s^{k_s}$, then $\phi(m) = m(1 - \frac{1}{p_1})\cdots (1 - \frac{1}{p_s})$.

Here we introduce the concept of a reduced residue system. For an integer m, the reduced residue system is the set $R(m) = {r | 0 \le r \lt m, (r, m) = 1}$, i.e., the set of all residues modulo m that are coprime to m.

Clearly $|R(m)| = \phi(m)$.

Proven for Euler Theorem

This proof should be quite well-known. I'll write it out anyway -- it's also quite simple. For integers a and m with $(a, m) = 1$, let the reduced residue system of m be $R(m) = {r_1, r_2, ..., r_{\phi(m)}}$.

$(ar_1)(ar_2)\cdots(ar_{\phi(m)}) \equiv a^{\phi(m)}(r_1r_2\cdots r_{\phi(m)})\ (\textrm{mod}\ m)$

Since $(a, m) = 1$, we obviously have:

$ar_i \not\equiv 0 \ (\textrm{mod} \ m)$
$ar_i \not\equiv ar_j \ (\textrm{mod} \ m), i \ne j$

Therefore $b_i = (ar_i\ \textrm{mod} \ m)$ also forms a reduced residue system of m.

So $\prod_{i = 1}^{i = \phi(m)} b_i \equiv \prod_{i = 1}^{i = \phi(m)} r_i \ (\textrm{mod} \ m)$

Thus $\prod_{i = 1}^{i = \phi(m)} r_i \equiv a^{\phi(m)}\prod_{i = 1}^{i = \phi(m)} r_i \ (\textrm{mod} \ m)$

Then $(a^{\phi(m)} - 1)\prod_{i = 1}^{i = \phi(m)} r_i \equiv 0 \ (\textrm{mod} \ m)$

Since $(r_i, m) = 1$, it is clear that $a^{\phi(m)} - 1 \equiv 0 \ (\textrm{mod} \ m)$.

Q.E.D.

Fermat's Little Theorem

This is a special case of Euler's theorem.

Euler Theorem for Non-coprime

First, let us state the conclusion. Let a and m be positive integers. We have $a^{k\phi(m) + b} \equiv a^{\phi(m) + b} \ (\textrm{mod} \ m), k \in N^+$. Clearly we only need to consider the case $(a, m) \ne 1$.

We prove the above conclusion by proving two weaker results.

If p is a prime factor of m, i.e., $(p, m) = p$, then $p^{2\phi(m)} \equiv p^{\phi(m)} \ (\textrm{mod} \ m)$
$a^{2\phi(m)} \equiv a^{\phi(m)} \ (\textrm{mod} \ m)$

Let $m = p_1^{k_1}p_2^{k_2}\cdots p_s^{k_s}$.

Weakened Result 1 (p, m)

Without loss of generality, let $p = p_1$, and set $m' = m \ / \ p_1^{k_1}, k = k_1$. Clearly $(p^{k}, m') = 1$.

It is easy to see that $p^{2\phi(m)} - p^{\phi(m)} \equiv 0 \ (\textrm{mod} \ m')$.

Also, $p^{2\phi(m)} - p^{\phi(m)} \equiv 0 \ (\textrm{mod} \ p^k)$, because $\phi(m) \ge p^{k - 1}(p - 1) \ge k$ always holds.

By the Chinese Remainder Theorem, the solution to $(p^{2\phi(m)} - p^{\phi(m)}) \ (\textrm{mod} \ p^k\cdot m')$ exists and is unique, and here it is clearly 0.

Therefore, $p^{2\phi(m)} \equiv p^{\phi(m)} \ (\textrm{mod} \ m)$.

Weakened Result 2 (a, m)

Assume $(a, m) \ne 1$, and that a prime p satisfies $p | (a,m)$. As before, without loss of generality let $p = p_1, k = k_1$, and define $a = pa_1, m = p^km_1$. We have:

$a^{2\phi(m)} - a^{\phi(m)} = p^{2\phi(m)}a_1^{2\phi(m)} - p^{\phi(m)}a_1^{\phi(m)}$

$= p^{2\phi(m)}a_1^{2\phi(m)} - p^{\phi(m)}a_1^{2\phi(m)} + p^{\phi(m)}a_1^{2\phi(m)} - p^{\phi(m)}a_1^{\phi(m)}$

Therefore $a^{2\phi(m)} - a^{\phi(m)} = (p^{2\phi(m)} - p^{\phi(m)})a_1^{2\phi(m)} + p^{\phi(m)}(a_1^{2\phi(m)} -a_1^{\phi(m)})$

$\equiv p^{\phi(m)}(a_1^{2\phi(m)} -a_1^{\phi(m)}) \ (\textrm{mod} \ m)$

So we need to prove $p^{\phi(m)}(a_1^{2\phi(m)} -a_1^{\phi(m)}) \equiv 0\ (\textrm{mod} \ m)$, which clearly reduces to $a_1^{2\phi(m)} -a_1^{\phi(m)} \equiv 0\ (\textrm{mod} \ m_1)$.

If $(a_1, m_1) = 1$, the result follows immediately. Otherwise $a_1 < a$, and we can always find a prime $q | (a_1, m_1)$, then descend. Since $a_1$ must be an integer, we have $a_1 \ge 1$, providing a lower bound for the descent, and $(1, m_t) = 1$ always holds.

Thus proved.

Final Conclusion

From Weakened Result 2, it is easy to see that $a^{k\phi(m)} \equiv a^{\phi(m)} \ (\textrm{mod} \ m), k \in N^+$, and $a^{k\phi(m) + b} \equiv a^{\phi(m) + b} \ (\textrm{mod} \ m), k \in N^+, b \in N, b \le m$. Q.E.D.

Summary

Proving this took me quite a while. Clearly I've forgotten all my number theory =_=.

And there's still large integer factorization giving me a headache...

Reference

[1] "Elementary Number Theory" (3rd edition), Higher Education Press

非质数的欧拉定理扩展 (Euler Theorem for Non-coprime)

Wed, 23 Aug 2017 14:02:59 GMT

刚遇到一道可怕的题目，迭代次幂 (tetration) 在超级大的范围下快速求解对某个模数的幂，模数范围在 1 到 1e18 之间。

OK，这道题其实思路很清晰，用欧拉定理降幂，但是最难的部分在于

对 m 进行质因数分解
使用欧拉定理在非互质情况的扩展形式

我们先把 m 的质因数分解放一放 (其实还没解决)... 先来解决第二个问题。

Euler Theorem & Fermat's Little/Last Theorem

设 m 是大于 1 的整数，$(a, m) = 1$，则

$$a^{\phi(m)} \equiv 1 \ (\textrm{mod}\ m)$$

这里 $\phi(m)$ 是欧拉函数，如果 $m = p_1^{k_1}p_2^{k_2}\cdots p_s^{k_s}$，$\phi(m) = m(1 - \frac{1}{p_1})\cdots (1 - \frac{1}{p_s})$。

这里引入简化剩余系的概念，对整数 m 简化剩余系为一个集合 $R(m) = {r | 0 \le r \lt m, (r, m) = 1}$，也就是所有 m 的模数里与 m 互质的数的集合。

显然 $|R(m)| = \phi(m)$。

Proven for Euler Theorem

这个证明应该很常见了，我还是写一下，也很简单，对于整数 a 和 m，$(a, m) = 1$，m 的简化剩余系 $R(m) = {r_1, r_2, ..., r_{\phi(m)}}$

$(ar_1)(ar_2)\cdots(ar_{\phi(m)}) \equiv a^{\phi(m)}(r_1r_2\cdots r_{\phi(m)})\ (\textrm{mod}\ m)$

而因为 $(a, m) = 1$, 显然有

$ar_i \not\equiv 0 \ (\textrm{mod} \ m)$
$ar_i \not\equiv ar_j \ (\textrm{mod} \ m), i \ne j$

因此 $b_i = (ar_i\ \textrm{mod} \ m)$ 同样构成了 m 的简化剩余系。

所以 $\prod_{i = 1}^{i = \phi(m)} b_i \equiv \prod_{i = 1}^{i = \phi(m)} r_i \ (\textrm{mod} \ m)$

所以 $\prod_{i = 1}^{i = \phi(m)} r_i \equiv a^{\phi(m)}\prod_{i = 1}^{i = \phi(m)} r_i \ (\textrm{mod} \ m)$

那么 $(a^{\phi(m)} - 1)\prod_{i = 1}^{i = \phi(m)} r_i \equiv 0 \ (\textrm{mod} \ m)$

由 $(r_i, m) = 1$，显然 $a^{\phi(m)} - 1 \equiv 0 \ (\textrm{mod} \ m)$

证毕。

Fermat's Little Theorem

欧拉定理的特殊情况。

Euler Theorem for Non-coprime

先给出结论，设 a，m 是正整数，我们有 $a^{k\phi(m) + b} \equiv a^{\phi(m) + b} \ (\textrm{mod} \ m), k \in N^+$，显然只需要考虑 $(a, m) \ne 1$

我们通过证明两个弱化的结论，来证明上述结论。

如果 p 是 m 的质因数，即 $(p, m) = p$，那么$p^{2\phi(m)} \equiv p^{\phi(m)} \ (\textrm{mod} \ m)$
$a^{2\phi(m)} \equiv a^{\phi(m)} \ (\textrm{mod} \ m)$

令 $m = p_1^{k_1}p_2^{k_2}\cdots p_s^{k_s}$。

弱化 1(p, m)

不妨设 $p = p_1$, 令 $m' = m \ / \ p_1^{k_1}, k = k_1$，显然 $(p^{k}, m') = 1$。

易见，$p^{2\phi(m)} - p^{\phi(m)} \equiv 0 \ (\textrm{mod} \ m')$。

且, $p^{2\phi(m)} - p^{\phi(m)} \equiv 0 \ (\textrm{mod} \ p^k)$，因为 $\phi(m) \ge p^{k - 1}(p - 1) \ge k$ 永远成立。

由中国剩余定理，$(p^{2\phi(m)} - p^{\phi(m)}) \ (\textrm{mod} \ p^k\cdot m')$解存在且唯一，而这里显然是 0.

所以，$p^{2\phi(m)} \equiv p^{\phi(m)} \ (\textrm{mod} \ m)$

弱化 2(a, m)

假设 $(a, m) \ne 1$，并且质数 p，有 $p | (a,m)$，同上不妨设$p = p_1, k = k_1$, 定义$a = pa_1, m = p^km_1$，我们有

$a^{2\phi(m)} - a^{\phi(m)} = p^{2\phi(m)}a_1^{2\phi(m)} - p^{\phi(m)}a_1^{\phi(m)}$

$= p^{2\phi(m)}a_1^{2\phi(m)} - p^{\phi(m)}a_1^{2\phi(m)} + p^{\phi(m)}a_1^{2\phi(m)} - p^{\phi(m)}a_1^{\phi(m)}$

因此 $a^{2\phi(m)} - a^{\phi(m)} = (p^{2\phi(m)} - p^{\phi(m)})a_1^{2\phi(m)} + p^{\phi(m)}(a_1^{2\phi(m)} -a_1^{\phi(m)})$

$\equiv p^{\phi(m)}(a_1^{2\phi(m)} -a_1^{\phi(m)}) \ (\textrm{mod} \ m)$

即要证明 $p^{\phi(m)}(a_1^{2\phi(m)} -a_1^{\phi(m)}) \equiv 0\ (\textrm{mod} \ m)$，显然即 $a_1^{2\phi(m)} -a_1^{\phi(m)} \equiv 0\ (\textrm{mod} \ m_1)$

如果 $(a_1, m_1) = 1$，即得证，否则 $a_1 < a$，我们总能找到质数 $q | (a_1, m_1)$，然后递降，而 $a_1$ 一定是整数，所以 $a_1 \ge 1$，即存在递降下限 1，而 $(1, m_t) = 1$ 恒成立。

故得证。

最终结论

由弱化结论 2，很容易得知 $a^{k\phi(m)} \equiv a^{\phi(m)} \ (\textrm{mod} \ m), k \in N^+$，以及 $a^{k\phi(m) + b} \equiv a^{\phi(m) + b} \ (\textrm{mod} \ m), k \in N^+, b \in N, b \le m$，得证。

总结

证明这个花了我一段时间，果然数论已经全扔了=_=。

然而还有大整数分解让我头疼...

Reference

[1] 《初等数论》 (第三版)，高等教育出版社

Polynomial Hash and Thue-Morse Sequence

Mon, 21 Aug 2017 15:58:42 GMT

A couple days ago I was working through a Contest on Hackerrank. We were given two days, and I didn't expect to get completely stuck on the last problem. Here I document my thought process and takeaways.

The original problem is at https://www.hackerrank.com/contests/gs-codesprint/challenges/transaction-certificates, so I won't restate it here.

Thought Process

Essentially, the problem asks you to hash a transaction chain and produce two different transaction chains that result in the same hash (a hash collision).

Some potentially useful conditions: p is a prime number, and m is a power of 2.

I won't prove the existence of a solution -- the pigeonhole principle makes it straightforward.

Approach 1

By subtracting two hashes, the problem can be transformed into: $\sum\limits_{i = 0}^{n - 1}a_i \cdot p^{n - 1 - i} \equiv 0 \ (mod\ m), a_i \in (-k, k)$, with the constraint that it cannot be the trivial solution where all $a_i$ are 0.

When $k \gt p$, let $a_{n - 1} = p % m, a_{n - 2} = -1$, with all others being 0, and we can easily construct the original solution.

But when $k \le p$, the only approach I could think of was brute-force search (searching the solution space of the multivariate linear congruence).

Approach 2

By observation, $a_0a_1...a_{n-1}$ forms a base-p representation of a number, and for a positive integer $x$, $tm + x, t \in N$ has the same remainder modulo m as x.

For the base-p representation of $x$, we can uniquely determine a transaction chain (by adding 1 to each digit), and it will have the same hash as the transaction chain formed by $tm + x$.

However, due to the constraint on transaction chain indices, this can only give a fast solution when $k\ge p$.

For the case $k \lt p$, I could still only think of brute-force search (searching through $tm$), but combined with the fast construction for the $k > p$ case above, I managed to score 44 points.

But there were two or three TLE cases, so this approach clearly wouldn't work.

Editorial

In the author's editorial, a very fast construction method is given using the Thue-Morse Sequence, primarily exploiting the property that m is a power of 2. In fact, p only needs to be odd (not necessarily prime).

Define two sequences $f(N)$ and $g(N)$ as follows:

$f(1) = [1]$

$g(0) = [0]$

$f(N) = f(N - 1) \oplus g(N - 1)$

$g(N) = g(N - 1) \oplus f(N - 1)$

Here $\oplus$ denotes the concatenation of two sequences.

Define the certificate computation function as $H(A, m), A = [a_0, a_1, ..., a_{n - 1}]$.

Clearly, N is always a power of 2. Let $N = 2^n$. Then there exists a sufficiently large $N$ such that $H(f(N), m) = H(g(N), m)$, where m must be a power of 2.

The most interesting part is the proof. Through the proof, we can see that for $m \le 2^{32}$, such an $N$ is approximately $\sqrt{m}$.

Here is the proof:

Clearly, we know that $\overline{f(N)} = g(N)$.

Let $T = H(f(N), m) - H(g(N), m)$. Clearly:

$T = p^{2^n - 1} - p^{2^n - 2} - p^{2^n - 3} + p^{2^n - 4} - ... \pm 1$

Then $T = (p - 1)(p^2 - 1)\cdots(p^{2^{n-1}} - 1)$.

OK, one last piece of the puzzle: if $2^s|(p - 1)$, then $2^{s + 1} |(p^2 - 1)$. The proof is simple, so I'll skip it.

Note: the editorial overlooked the case p = 2, but this case is trivial, so it was simply ignored... and it wasn't in the test cases either.

Since p is prime and therefore odd, $2 | (p - 1)$, and it follows that $2^n | (p^{2^n} - 1)$.

Therefore $2^{n(n - 1)/2} | T$, and as long as $n(n-1)/2 \ge x$ where $m = 2^x$, we have $m | T$.

As you can see, the convergence of $N$ is remarkably fast.

Once $f$ and $g$ are obtained, a little processing yields the answer. Since the answer is required to be a multiple of $n$ in length, the complexity is $O(n)$.

Summary

The certificate computation in this problem is essentially Polynomial Hash, which, like the Thue-Morse Sequence, has some very interesting properties worth exploring further.

When I saw the condition that m is a power of 2, my first reaction was that it must be exploitable somehow. I kept trying to apply Fermat's Last Theorem but could never make it work. Since the solution space is enormous, it must be a constructive solution -- I just couldn't figure out how to leverage both "p is prime" and "m is a power of 2." As you can see in my approaches, I consistently failed to make use of these two conditions.

The editorial's solution is truly elegant -- it leans much more toward pure mathematics.

References

http://codeforces.com/blog/entry/4898
https://www.hackerrank.com/contests/gs-codesprint/challenges/transaction-certificates/editorial
https://www.wikiwand.com/en/Thue%E2%80%93Morse_sequence

多项式哈希及Theu-Morse序列 (Polynomial Hash and Theu-Morse Sequence)

Mon, 21 Aug 2017 15:58:42 GMT

前两天刷 Hackerrank 上的 Contest，给了两天时间，没想到被最后一题卡成🐶，谨记录思考和收获。

原题目在 https://www.hackerrank.com/contests/gs-codesprint/challenges/transaction-certificates ，就不在此赘述了。

思考过程

首先，题目其实就是将一个交易链进行 hash，并要求给出 hash 碰撞的两个不同的交易链。

有一些可能有用的条件，p 是素数，m 是 2 的幂。

解的存在性就不证明了，鸽笼原理很简单。

思路 1

由两个 hash 相减得到，可以将题目转换为 $\sum\limits_{i = 0}^{n - 1}a_i \cdot p^{n - 1 - i} \equiv 0 \ (mod\ m), a_i \in (-k, k)$, 并且不能是所有 $a_i$ 为 0 的 trival 解。

在 $k \gt p$ 时，令 $a_{n - 1} = p % m, a_{n - 2} = -1$，其余都为 0，我们可以很容易的构造原来的解。

但是在 $k \le p$ 时，我只想到了暴力搜索（通过暴力搜索多元一次同余式的解空间）。

思路 2

由观察，$a_0a_1...a_{n-1}$ 构成了一个数的 p 进制表示，而对于一个正整数$x$，有 $tm + x, t \in N$ 模 m 的余数值与 x 相同。

而对于 $x$ 的 p 进制表示，可以唯一的确定一条交易链 (每位+1 构成交易链)，并与 $tm + x$ 构成的交易链 hash 相同。

但是因为交易链编号限制，只能在 $k\ge p$ 时快速给出解。

对于 $k \lt p$ 的情况，还是只想到了暴力搜索 (通过暴力搜索 tm)，但是结合上面的 $k > p$ 情况的快速构造，解出了 44 分。

但是有两三个 TLE，这种解法肯定不行。

Editorial

在出题者给出的答案里，使用 Thue-Morse Sequence 给出了一种非常快速的构造方案，主要是利用了 m 是 2 的幂这个性质，并且 p 甚至可以退化为奇数。

下面定义两个序列 $f(N)$ 和 $g(N)$，定义为

$f(1) = [1]$

$g(0) = [0]$

$f(N) = f(N - 1) \oplus g(N - 1)$

$g(N) = g(N - 1) \oplus f(N - 1)$

这里 $\oplus$ 是指级联两个序列。

定义计算 certificate 的函数为 $H(A, m), A = [a_0, a_1, ..., a_{n - 1}]$。

显然, N 一直是 2 的幂，令 $N = 2^n$，那么存在一个足够大的 $N$ ，$H(f(N), m) = H(g(N), m)$，这里 m 必须是 2 的幂。

最有意思的是它的证明，我们可以通过证明知道，在 $m \le 2^{32}$ 的情况下，这样的 $N$ 大约在 $\sqrt{m}$ 左右。

下面是证明：

显然，我们知道 $\overline{f(N)} = g(N)$。

令 $T = H(f(N), m) - H(g(N), m)$, 显然

$T = p^{2^n - 1} - p^{2^n - 2} - p^{2^n - 3} + p^{2^n - 4} - ... \pm 1$

那么 $T = (p - 1)(p^2 - 1)\cdots(p^{2^{n-1}} - 1)$。

OK，还有最后一块砖: 如果 $2^s|(p - 1)$，那么 $2^{s + 1} |(p^2 - 1)$，这个证明很简单，就不证了。

这里答案漏考虑 p = 2 的情况，但是这种情况太容易了，所以直接忽略了...而且 test cases 里也没有😅

所以由 p 为质数即奇数，$2 | (p - 1)$，那么一定有 $2^n | (p^{2^n} - 1)$。

故而一定有 $2^{n(n - 1)/2} | T$，只要 $n(n-1)/2 \ge x，m = 2^x$，就有 $m | T$。

可以看到，$N$ 的收敛岂止是快。

得到 $f, g$ 稍微处理下就出答案了，因为答案要求长度为 $n$ 的整数倍，所以复杂度为 $O(n)$。

总结

这个题目里 Certificate 的计算方式就是 Polynominal Hash，它和 Thue-Morse Sequence 一样，都有一些很有意思的性质，留着之后慢慢发掘一下。

在看到 m 为 2 的幂的条件时，第一反应肯定是能够做什么文章，而我想来想去想将 Fermat's Last Theorem 应用上去，始终无法如愿。这道题由于解空间庞大，所以想必是构造解，只是实在想不到如何应用 p 是素数以及 m 是 2 的幂，可以看到我的思路中始终无法应用这两点。

Editorial 的解法简直精妙，更偏近于数学了。

References

http://codeforces.com/blog/entry/4898
https://www.hackerrank.com/contests/gs-codesprint/challenges/transaction-certificates/editorial
https://www.wikiwand.com/en/Thue%E2%80%93Morse_sequence

Stoer-Wagner Algorithm -- Global Min-Cut in Undirected Weighted Graphs

Sat, 05 Aug 2017 14:14:01 GMT

I recently came across a problem that required finding the global minimum cut of a graph. Unfortunately, graph theory was never my strong suit -- the only algorithm I could recall was Ford-Fulkerson for computing s-t max-flow/min-cut. I thought to myself, surely I can't just run max-flow $n^2$ times, so I eventually turned to Wikipedia for help.

Stoer-Wagner Algorithm

The Stoer-Wagner algorithm computes the global minimum cut in an undirected graph with non-negative edge weights. It was proposed in 1995 by Mechthild Stoer and Frank Wagner.

The core idea of the algorithm is to repeatedly merge vertices from an s-t minimum cut to reduce the size of the graph, until only two merged vertices remain.

Algorithm Procedure

In graph G, let C = {S, T} be the global minimum cut of G. Then for any pair of vertices (s, t) in G:

Either s and t belong to different sides S and T
Or s and t belong to the same side S or T

Stoer-Wagner first finds an s-t minimum cut, then merges s and t:

Remove the edge between s and t
For any vertex v adjacent to s or t, the weight of the edge to the new merged vertex is w(v, s) + w(v, t) (assuming weight 0 if not adjacent)

Clearly, for any cut where s and t are in the same set, merging s and t does not affect the cut weight.

The above procedure (finding a minimum cut, then merging) constitutes one phase. After performing |V| - 1 such phases, only two vertices remain in G. The global minimum cut is then the minimum among the s-t minimum cuts from each phase and the cut between the final two vertices. This completes the algorithm.

As you may have noticed, the key to making the algorithm efficient is how to find an s-t minimum cut. This is precisely where the elegance of the Stoer-Wagner algorithm lies.

Since the algorithm has no specific requirements for the endpoints of the s-t minimum cut, any minimum cut will do. The procedure is as follows:

Pick an arbitrary vertex $u$ from graph G and add it to set $A$
Select a vertex $v$ from $V - A$ that maximizes $w(A,v)$, and add it to $A$; where $w(A,v) = \sum\limits_{u \in A}w(u,v)$
Repeat Step 2 until $V = A$
Let the last two vertices added to $A$ be $s$ and $t$; merge them

In the above procedure, $\left(A - {s}, {t}\right)$ is an s-t minimum cut.

Proof of Correctness

To prove the correctness of the algorithm, we only need to show that each phase finds a valid s-t minimum cut.

Symbol Definitions

Graph $G(V,E,W)$
$s$, $t$ are the last two vertices added to set $A$
Let $C = (X, \overline{X})$ be any s-t cut; without loss of generality, let $t \in X$
$A_v$ is the set $A$ before vertex $v$ is added
$C_v$ is the cut induced by $C$ on the subgraph of $G$ formed by vertices $A_v \cup {v}$

Therefore, we have $C_v = C$, and we only need to prove that $w(A_t, t) \le w(C) = w(C_t)$.

We will prove by induction that $\forall u \in X$, $w(A_u, u) \le w(C_u)$.

Let $u_0$ be the first vertex in $X$ added to $A$. By definition, we have $w(A_{u_0}, u) = w(C_{u_0})$.

Assume that for two consecutive vertices $u, v$ added to $A$ that both belong to $X$, the induction hypothesis gives $w(A_u,u) \le w(C_u)$.

First, we have $w(A_v, v) = w(A_u, v) + w(A_v - A_u, v)$.

Since $u$ was selected before $v$, we know $w(A_u, u) \ge w(A_u, v)$, so:

$w(A_v, v) \le w(A_u, u) + w(A_v - A_u, v) \le w(C_u) + w(A_v - A_u, v)$

Since $(A_v - A_u) \cap X = {u}$, we have $W(C_u) + w(A_v - A_u, v) = w(C_v)$.

Therefore $w(A_u, u) \le w(C_u)$. QED.

Thus, we know that each phase of the algorithm finds a valid s-t minimum cut.

Time Complexity

Computing the maximum $w(A, u)$ each time takes $O(|V|^2)$, and the final merge takes $O(|E|)$, so each phase has time complexity $O(|E| + |V|^2)$.

With $|V|-1$ phases in total, the overall time complexity is $O(|V||E| + |V|^3)$.

The selection of the maximum $w(A, u)$ can be optimized using a heap, reducing it to $O(lg|V|)$ per selection, giving a total time complexity of $O(|V||E| + |V|^2lg|V|)$.

As a side note, the Stoer-Wagner algorithm requires a heap that supports dynamic updates. A commonly used heap for this is the Fibonacci heap, which is not implemented in the C++ standard STL. An alternative is std::multiset. The Boost library provides direct support for the Stoer-Wagner algorithm.

References

[1] https://en.wikipedia.org/wiki/Stoer%E2%80%93Wagner_algorithm

[2] http://www.boost.org/doc/libs/1_64_0/libs/graph/doc/stoer_wagner_min_cut.html

无向带权图图全局最小割 Stoer-Wagner 算法 (Stoer-Wagner Algorithm -- Global Min-Cut in Undirected Weighted Graphs)

Sat, 05 Aug 2017 14:14:01 GMT

最近碰到一道题目，求一个图的全局最小割，可惜图论博主学的不太好，至今只记得一个求 s-t 最大流/最小割的 ford-fulkerson。想了想总不能做$n^2$次最大流吧，最终还是求助了维基百科 🤣

Stoer-Wagner Algorithm

Stoer-Wagner 算法是一个求带非负权无向图中全局最小割的算法，它是在 1995 年由 Mechthild Stoer 和 Frank Wagner 提出的。

算法的核心思想是通过不断的合并某个 s-t 最小割的顶点来缩小图的规模，直到图里只剩下两个合并剩下的顶点。

算法流程

在图 G 中，假设 G 的全局最小割为 C = {S, T}，那么对于任何一对 G 上的顶点 (s, t) 来说

要么 s 和 t分属S 和 T
要么 s 和 t同属S 或 T

Stoer-Wagner 首先找到一个 s-t 最小割，然后将 s 和 t 合并：

删除 s 和 t 之间的边
对于任何和 s、 t 相邻的顶点 v，与新顶点链接的权重为 w(v, s) + w(v, t)（假设不相邻为 0 权重）

显然，对于任何 s、 t 在同一个集合的割，合并 s、 t 并不影响割的权重。

上述流程（找到一个最小割、合并）为一个阶段，在进行 |V| - 1 次这样的阶段之后，G 中只剩下两个点，那么全局最小割一定为每个阶段的 s-t 最小割以及最后两个点的割中的最小值，如此算法即完成了。

同学们一定发现了，如何找到一个 s-t 最小割才是算法能够高效执行的关键，Stoer-Wagner 算法的巧妙之处就在于此。

因为算法对于 s-t 最小割的起点和终点没有任何要求，所以找到任何一个最小割就行，过程如下：

图 G 上任意一个顶点$u$，放入集合$A$中
选取一个$V - A$中的点$v$，使得$w(A,v)$最大，将它放入$A$中；其中 $w(A,v) = \sum\limits_{u \in A}w(u,v)$
重复执行 步骤 2，直到$V = A$
假设最后加入$A$的两个顶点是$s$和$t$, 合并它们

上述过程中 $\left(A - {s}, {t}\right)$ 是一个 s-t 最小割。

正确性证明

要证明算法的正确性，只需要证明阶段中找到的一定是一个 s-t 最小割。

符号定义

图$G(V,E,W)$
$s$,$t$ 是最后加入集合$A$的两个顶点
令$C = (X, \overline{X})$为任意一个 s-t 割，不失一般性，令 $t \in X$
$A_v$为在顶点$v$加入$A$之前的集合
$C_v$是在由顶点$A_v \cup {v}$构成的$G$的子图上，由割$C$导出的割

所以，我们有$C_v = C$，即只需要证明 $w(A_t, t) \le w(C ) = w(C_t)$。

下面我们将归纳证明 $\forall u \in X$，$w(A_u, u) \le w(C_u)$。

设$u_0$是$X$中第一个被放入$A$中的顶点，那么由定义，有 $w(A_{u_0}, u) = w(C_{u_0})$

假设对于两个连续被放入$A$中、并且同属于$X$的顶点$u, v$，归纳假设有 $w(A_u,u) \le w(C_u)$

首先，我们有 $w(A_v, v) = w(A_u, v) + w(A_v - A_u, v)$

因为$u$优先于$v$被选取，所以$w(A_u, u) \ge w(A_u, v)$，所以上式有

$w(A_v, v) \le w(A_u, u) + w(A_v - A_u, v) \le w(C_u) + w(A_v - A_u, v)$

而由于$(A_v - A_u) \cap X = {u}$，所以 $W(C_u) + w(A_v - A_u, v) = w(C_v)$

所以有 $w(A_u, u) \le w(C_u)$，证毕。

故而我们知道，算法每个阶段都能找到一个 s-t 最小割。

时间复杂度

每次计算最大的$w(A, u)$的时间为$O(|V|^2)$，最后合并的时间为 $O(|E|)$，所以每一个阶段时间复杂度为 $O(|E| + |V|^2)$；

一共执行 $|V|-1$次，算法总计时间复杂度为 $O(|V||E| + |V|^3)$。

其中每次选择最大的$w(A, u)$可以通过堆进行优化，将时间优化到$O(lg|V|)$，总计时间复杂度可以降为 $O(|V||E| + |V|^2lg|V|)$。

这里顺便一提，Stoer-Wagner 算法适用的堆需要支持动态 update，常用的堆为斐波那契堆，在 C++ 的标准 stl 里并没有实现，替代方案可以使用 std::multiset。在 boost 中，对 Stoer-Wagner 算法有直接支持。

References

[1] https://en.wikipedia.org/wiki/Stoer%E2%80%93Wagner_algorithm

[2] http://www.boost.org/doc/libs/1_64_0/libs/graph/doc/stoer_wagner_min_cut.html

Sorting Algorithm that Minimizes Comparisons (Ford-Johnson Algorithm)

Fri, 04 Aug 2017 06:15:35 GMT

I happened to discover AtCoder, signed up to give it a try, and ended up stuck on the practice contest...

The problem itself is quite simple:

There are N balls labeled with the first N uppercase letters. The balls have pairwise distinct weights. You are allowed to ask at most Q queries. In each query, you can compare the weights of two balls (see Input/Output section for details). Sort the balls in the ascending order of their weights.

Constraints (N,Q)=(26,1000), (26,100), or (5,7).

Partial Score There are three testsets. Each testset is worth 100 points. In testset 1, N=26 and Q=1000. In testset 2, N=26 and Q=100. In testset 3, N=5 and Q=7.

Through comparison-based sorting, there are three test cases. The (26, 1000) case can be passed with any comparison sort, though it might TLE. The (26, 100) case can be passed with merge sort which has a worst-case of $O(nlgn)$. The only tricky one is (5, 7). For this case, merge sort has a worst case of 8 comparisons.

Like a fellow internet user, I tried using STL's sort to solve it, only to find it produced even more WAs.

You must be kidding!

Ford-Johnson

In desperation I turned to Google, and sure enough found an algorithm called the Ford-Johnson Algorithm. It was proposed in 1959 and is an algorithm that optimizes the number of comparisons in comparison-based sorting. In fact, for a considerable period after its proposal, it was believed to be the lower bound on the number of comparisons.

Even though it was later shown that further optimization is possible, the Ford-Johnson algorithm remains extremely close to the actual lower bound.

In Knuth's The Art of Computer Programming, Volume 3, this algorithm is also called merge-insertion sort.

Algorithm Steps

Suppose we have n elements to sort. The algorithm consists of three steps:

Divide the elements into $\lfloor n/2 \rfloor$ pairs and compare each pair. If n is odd, the last element is left unpaired.
Recursively sort the larger element from each pair. Now we have a sorted sequence consisting of the larger values from each pair, called the main chain. Suppose the main chain is $a_1, a_2, ... a_{\lfloor n/2 \rfloor}$, and the remaining elements are $b_1, b_2, ..., b_{\lceil n/2\rceil}$, where $b_i \le a_i$ holds for $i = 1, 2, ..., \lfloor n/2 \rfloor$.
Now we insert $b_1, ..., b_{\lceil n/2 \rceil}$ into the main chain. First is $b_1$ -- we know it comes before $a_1$, so the main chain becomes ${b_1, a_1, a_2, ... a_{\lfloor n/2 \rfloor}}$. Then consider inserting $b_3$, which only needs to be compared against three elements ${b_1, a_1, a_2}$. Suppose the first three elements become ${b_1, a_1, b_3}$; then inserting $b_2$ also requires searching among only these three elements. As you can see, no matter where $b_3$ is inserted, the search range is always at most 3. The next element to insert, $b_k$, corresponds to the next Jacobsthal number. That is, we first insert $b_3$, then $b_5, b_{11}, b_{21} ...$. To summarize, the insertion order is $b_1, b_3, b_2, b_5, b_4, b_{11}, b_{10}, ..., b_6, b_{21}, ...$

Performance

The Ford-Johnson algorithm requires special data structures to implement. Its running speed is not particularly fast -- it just performs fewer comparisons. In practice, merge sort and quick sort are still faster.

The (5, 7) Problem

Suppose the elements are A, B, C, D, E.

Compare (A, B) and (C, D). Without loss of generality, assume A > B and C > D.
Compare (A, C). Without loss of generality, assume A > C.
Insert E into (D, C, A) -- this can be done in two comparisons.
Insert B into the sorted {E, C, D} (since B < A, we don't need to consider comparing with A) -- this can be done in two comparisons.

In step 3, E is inserted first because if we inserted B into (D, C) first, it would take at most two comparisons, but then inserting E into {D, C, B, A} would take at most three comparisons.

References

[1] Bui, T. D., and Mai Thanh. "Significant improvements to the Ford-Johnson algorithm for sorting." BIT Numerical Mathematics 25.1 (1985): 70-75.

[2] https://codereview.stackexchange.com/questions/116367/ford-johnson-merge-insertion-sort

Appendix

Interactive Sorter in C++

#include <iostream>
#include <algorithm>
#include <vector>
using namespace std;

bool less_than(char a, char b) {
    cout << "? " << a << " " << b << endl;
    cout.flush();
    char ans;
    cin >> ans;
    if (ans == '<') return true;
    return false;
}

void ford_johnson(string &s, int n) {
    // assert(n == 5)
    // ugly but work
    if (!less_than(s[0], s[1])) {
        swap(s[0], s[1]);
    }

    if (!less_than(s[2], s[3])) {
        swap(s[2], s[3]);
    }

    if (!less_than(s[1], s[3])) {
        swap(s[0], s[2]);
        swap(s[1], s[3]);
    }

    // now we have s[0] < s[1], s[2] < s[3], s[1] < s[3]
    vector<char> cs = {s[0], s[1], s[3]};

    // insertion will be completed in two comparations
    auto insert_into_first_three = [&](char c) {
        if (less_than(c, cs[1])) {
            if (less_than(c, cs[0])) cs.insert(cs.begin(), c);
            else cs.insert(cs.begin() + 1, c);
        } else {
            if (less_than(c, cs[2])) cs.insert(cs.begin() + 2, c);
            else cs.insert(cs.begin() + 3, c);
        }
    };

    insert_into_first_three(s[4]);
    // always sorted {s[0], s[1], s[4]} < s[3] or s[0] < s[1] < s[3] < s[4]
    // so the first three elements are enough
    insert_into_first_three(s[2]);

    s = string(cs.begin(), cs.end());
}

// at most 99 comparations for n = 26
void merge_sort(string &s, int n) {
    if (n == 1) return;
    else if (n == 2) {
        if (!less_than(s[0], s[1])) swap(s[0], s[1]);
        return;
    }

    auto left_half = s.substr(0, n / 2),
        right_half = s.substr(n / 2);

    merge_sort(left_half, n / 2);
    merge_sort(right_half, n - n / 2);

    // merge, at most n - 1 comparations
    int i = 0, j = 0, k = 0;
    while (k < n) {
        if (i >= n / 2) s[k++] = right_half[j++];
        if (j >= n - n / 2) s[k++] = left_half[i++];
        else if (less_than(left_half[i], right_half[j])) s[k++] = left_half[i++];
        else s[k++] = right_half[j++];
    }
}

int main() {
    int n, q;
    cin >> n >> q;

    string s;
    for (int i = 0; i < n; ++i) {
        s += 'A' + i;
    }

    if (n == 5) ford_johnson(s, n);
    else merge_sort(s, n);

    cout << "! " << s << endl;

    return 0;
}

优化比较次数的排序算法 (Ford Johnson Algorithm)

Fri, 04 Aug 2017 06:15:35 GMT

偶然发现 AtCoder，上去注册了准备试试，结果卡在 practice contest...

问题倒是很简单：

There are N balls labeled with the first N uppercase letters. The balls have pairwise distinct weights. You are allowed to ask at most Q queries. In each query, you can compare the weights of two balls (see Input/Output section for details). Sort the balls in the ascending order of their weights.

Constraints (N,Q)=(26,1000), (26,100), or (5,7).

Partial Score There are three testsets. Each testset is worth 100 points. In testset 1, N=26 and Q=1000. In testset 2, N=26 and Q=100. In testset 3, N=5 and Q=7.

通过比较排序，一共三种数据，其中 (26, 1000) 的情况用任何比较都能过，但是可能会 TLE，(26, 100) 的用 worst-case $O(nlgn)$ 的 merge sort 能过，唯一难受的是 (5, 7)。这个样例 merge sort 的 worst case 是比较 8 次。

我和某网友一样，尝试用 STL 的 sort 来解决，结果发现 WA 了更多 🙄

You must be kidding!

Ford-Johnson

绝望之下只能求助 google，果然找到了一个算法叫 Ford-Johnson Algorithm，是 1959 年被提出来的，是针对比较排序中比较数进行最优化的算法，事实上在提出后的相当长一段时间被认为是比较数的 lower bound。

即便后来被证明能够进一步优化，Ford-Johnson 算法距离事实 lower bound 也极其接近。

在 Knuth 老爷子的《 The Art of Computer Programming 》第三卷中，该算法也被称为 merge-insertion sort。

算法流程

假设我们有 n 个元素需要排序，算法总共分为三步：

将元素分为 $\lfloor n/2 \rfloor$ 对，并两两进行比较，如果 n 是奇数，最后一个元素不进行配对
递归地排序每对中大的那一个。现在我们有一个由每对中较大值组成的排过序的序列，叫做 main chain。假设 main chain 是 $a_1, a_2, ... a_{\lfloor n/2 \rfloor}$ ，剩余的元素假设他们是 $b_1, b_2, ..., b_{\lceil n/2\rceil}$，并且有 $b_i \le a_i$ 对 $i = 1, 2, ..., \lfloor n/2 \rfloor$ 成立。
现在我们将 $b_1, ..., b_{\lceil n/2 \rceil}$ 插入 main chain 中，首先是 $b_1$，我们知道他在 $a_1$ 前面，所以 main chain 变为 ${b_1, a_1, a_2, ... a_{\lfloor n/2 \rfloor}}$，然后考虑插入 $b_3$，仅需要比较三个数 ${b_1, a_1, a_2}$；假设前三个元素变为 ${b_1, a_1, b_3}$，那么插入 $b_2$ 也是在这三个数内；可以看到，不论 $b_3$ 被插入到那里了，要插入的范围总是最多为 3。下一个要插入的数 $b_k$ 对应于下一个Jacobsthal Number，即我们先插入 $b_3$，然后插入 $b_5, b_{11}, b_{21} ...$，总结一下插入顺序为 $b_1, b_3, b_2, b_5, b_4, b_{11}, b_{10}, ..., b_6, b_{21}, ...$

性能

Ford-Johnson 算法需要特殊的数据结构来实现，运行速度并不算快，只是能够更少地进行比较，在实际使用中还是 merge sort 和 quick sort 来得更快一点。

问题 (5, 7)

假设元素 A, B, C, D, E

比较 (A, B) 和 (C, D)，不失一般性，我们假设 A > B，C > D
比较 (A, C)，不失一般性，假设 A > C
将 E 插入 (D, C, A)，两次比较内完成
将 B 插入排序后的{E, C, D} 中 (因为 B < A，所以不需要考虑 A 的比较)，两次比较内完成

这里第三步先插入 E，是因为如果先插入 B 到 (D, C)，最多需要两次比较，而插入 E 到 {D, C, B, A} 最多要三次比较。

References

[1] Bui, T. D., and Mai Thanh. "Significant improvements to the Ford-Johnson algorithm for sorting." BIT Numerical Mathematics 25.1 (1985): 70-75.

[2] https://codereview.stackexchange.com/questions/116367/ford-johnson-merge-insertion-sort

Appendix

Interactive Sorter in C++

#include <iostream>
#include <algorithm>
#include <vector>
using namespace std;

bool less_than(char a, char b) {
    cout << "? " << a << " " << b << endl;
    cout.flush();
    char ans;
    cin >> ans;
    if (ans == '<') return true;
    return false;
}

void ford_johnson(string &s, int n) {
    // assert(n == 5)
    // ugly but work
    if (!less_than(s[0], s[1])) {
        swap(s[0], s[1]);
    }

    if (!less_than(s[2], s[3])) {
        swap(s[2], s[3]);
    }

    if (!less_than(s[1], s[3])) {
        swap(s[0], s[2]);
        swap(s[1], s[3]);
    }

    // now we have s[0] < s[1], s[2] < s[3], s[1] < s[3]
    vector<char> cs = {s[0], s[1], s[3]};

    // insertion will be completed in two comparations
    auto insert_into_first_three = [&](char c) {
        if (less_than(c, cs[1])) {
            if (less_than(c, cs[0])) cs.insert(cs.begin(), c);
            else cs.insert(cs.begin() + 1, c);
        } else {
            if (less_than(c, cs[2])) cs.insert(cs.begin() + 2, c);
            else cs.insert(cs.begin() + 3, c);
        }
    };
    
    insert_into_first_three(s[4]);
    // always sorted {s[0], s[1], s[4]} < s[3] or s[0] < s[1] < s[3] < s[4]
    // so the first three elements are enough
    insert_into_first_three(s[2]);

    s = string(cs.begin(), cs.end());
}

// at most 99 comparations for n = 26
void merge_sort(string &s, int n) {
    if (n == 1) return;
    else if (n == 2) {
        if (!less_than(s[0], s[1])) swap(s[0], s[1]);
        return;
    }

    auto left_half = s.substr(0, n / 2), 
        right_half = s.substr(n / 2);
        
    merge_sort(left_half, n / 2);
    merge_sort(right_half, n - n / 2);

    // merge, at most n - 1 comparations
    int i = 0, j = 0, k = 0;
    while (k < n) {
        if (i >= n / 2) s[k++] = right_half[j++];
        if (j >= n - n / 2) s[k++] = left_half[i++];
        else if (less_than(left_half[i], right_half[j])) s[k++] = left_half[i++];
        else s[k++] = right_half[j++];
    }
}

int main() {
    int n, q;
    cin >> n >> q;

    string s;
    for (int i = 0; i < n; ++i) {
        s += 'A' + i;
    }

    if (n == 5) ford_johnson(s, n);
    else merge_sort(s, n);

    cout << "! " << s << endl;

    return 0;
}

Sliding Window Maximum / Monotonic Queue

Thu, 03 Aug 2017 07:55:02 GMT

There is a problem on Leetcode called Sliding Window Maximum. Although I didn't solve it today, the solution is very interesting, so I'm recording it here.

Problem description:

Given an array nums, there is a sliding window of size k which is moving from the very left of the array to the very right. You can only see the k numbers in the window. Each time the sliding window moves right by one position.

For example, Given nums = [1,3,-1,-3,5,3,6,7], and k = 3.

Therefore, return the max sliding window as [3,3,5,5,6,7].

This problem can be solved using a priority queue, self-balancing BST, or similar methods to get an O(nlgn) solution, but there is actually an O(n) solution based on maintaining a monotonic queue during the process.

Monotonic Queue

We use a double-ended queue to implement this monotonic queue, ensuring that all numbers in the queue are monotonically non-increasing, and the maximum value in a window is always at the front of the queue.

Suppose we have a queue Q satisfying the above condition. When a number a enters the window and a number b slides out of the window, the queue operations are:

If the queue is not empty and the front of the queue equals b, remove it!
If the queue is not empty and the back of the queue is less than a, remove it!
Repeat step 2 until the queue is empty or the back of the queue is not less than a
Add a to the back of the queue

Clearly, in step 2, the numbers removed are numbers that cannot possibly be the maximum, so after the operations, the maximum value in the window is still in the queue, and the queue remains monotonically non-increasing.

The maximum value is then the front of the queue.

Since each number enters the queue once and exits once, the time cost is O(n).

Appendix

C++ Implementation

vector<int> maxSlidingWindow(vector<int> &nums, int k) {
    vector<int> res;
    // tracking index instead of value
    deque<int> dq;

    for (int i = 0; i < nums.size(); ++ i) {
        // dequeue (i - k)th element if exists
        if (!dq.empty() && dq.front() == i - k) dq.pop_front();
        // remove those less than current
        while (!dq.empty() && nums[dq.back()] < nums[i]) dq.pop_back();
        // enqueue current
        dq.push_back(i);

        if (i >= k - 1) res.push_back(nums[dq.front()]);
    }

    return res;
}

滑动窗口中的最大值 (Sliding Window Maximum / Monotonic Queue)

Thu, 03 Aug 2017 07:55:02 GMT

Leetcode 上有一道题叫 Sliding Window Maximum，虽然不是今天刷的，但是解法非常有意思，就记录一下。

问题描述：

Given an array nums, there is a sliding window of size k which is moving from the very left of the array to the very right. You can only see the k numbers in the window. Each time the sliding window moves right by one position.

For example, Given nums = [1,3,-1,-3,5,3,6,7], and k = 3.

Therefore, return the max sliding window as [3,3,5,5,6,7].

这道题可以用优先队列、自平衡 BST 等方法得到一个 O(nlgn) 的解法，但其实这道题有另一种 O(n) 的解法，基本思想是在过程中维持一个单调队列。

Monotonic Queue

我们用双端队列来实现这个单调队列，保证这个队列中所有数单调非增，同时一个窗口中的最大的数就在队列的开端。

假设我们现在有一个队列 Q 满足上述条件，当一个数 a 进入窗口时，此时数 b 滑出窗口，队列操作步骤：

如果队列不为空并且队列开端等于 b，remove it!
如果队列不为空并且队列尾端小于 a，remove it！
循环 步骤 2 直到队列为空或者队列尾端不小于 a
将 a 加到队列尾端

显然，在 步骤 2 中，删除的数都是不可能为最大值的数，所以在操作结束后，窗口内最大值仍然在队列中，并且队列保持单调非增。

此时最大值即为队列开端。

因为每个数只进队一次，并出队一次，所以时间开销为 O(n)。

Appendix

C++ 实现

vector<int> maxSlidingWindow(vector<int> &nums, int k) {
    vector<int> res;
    // tracking index instead of value
    deque<int> dq;
    
    for (int i = 0; i < nums.size(); ++ i) {
        // dequeue (i - k)th element if exists
        if (!dq.empty() && dq.front() == i - k) dq.pop_front();
        // remove those less than current
        while (!dq.empty() && nums[dq.back()] < nums[i]) dq.pop_back();
        // enqueue current
        dq.push_back(i);
        
        if (i >= k - 1) res.push_back(nums[dq.front()]);
    }
    
    return res;
}

Kth Largest Element in a Sorted Matrix (Selection in X+Y or Sorted Matrices)

Wed, 02 Aug 2017 12:58:55 GMT

Today while solving LeetCode problems, I came across this one: Kth Smallest Element in a Sorted Matrix.

First, using a Min-Heap gives an O(klgn) algorithm (where n is the number of columns). The implementation is provided at the end.

However, while browsing the Discuss section, I discovered that there is actually an O(n) algorithm (where n is the number of rows/columns), from the paper Selection in X + Y and Matrices with Sorted Rows and Columns. It is also applicable to another problem, Find k Pairs with Smallest Sums. I will only provide an overview here, because I don't think anyone could write this out during an interview...

Brief Introduction

Let A be an n * n matrix where all rows and columns are in non-decreasing order.

Let $A^*$ be the submatrix of A consisting of all odd-indexed rows. If n is even, also append the last row and column of A. Let $rank_{desc}(A, a)$ denote the number of elements in A greater than a, and $rank_{asc}(A, a)$ the number of elements in A less than a.

Then the following inequalities hold:

$rank_{asc}(A, a) \le 4 rank_{asc}(A^*, a)$
$rank_{desc}(A, a) \le 4 rank_{desc}(A^*, a)$

This is the foundation of the algorithm. Combined with a clever selection strategy, it reduces the problem to a subproblem on A*.

For the algorithm details, please refer to the paper. It is very non-intuitive and requires substantial theoretical proofs, making it really difficult to describe in detail... Interested readers should consult the original paper.

Python and C++ Implementations

Here I include the Python and C++ implementations shared by experts in the Discuss section. If anyone could write this by hand during an interview, I would be in awe.

class Solution(object):
    def kthSmallest(self, matrix, k):

        # The median-of-medians selection function.
        def pick(a, k):
            if k == 1:
                return min(a)
            groups = (a[i:i+5] for i in range(0, len(a), 5))
            medians = [sorted(group)[len(group) / 2] for group in groups]
            pivot = pick(medians, len(medians) / 2 + 1)
            smaller = [x for x in a if x < pivot]
            if k <= len(smaller):
                return pick(smaller, k)
            k -= len(smaller) + a.count(pivot)
            return pivot if k < 1 else pick([x for x in a if x > pivot], k)

        # Find the k1-th and k2th smallest entries in the submatrix.
        def biselect(index, k1, k2):

            # Provide the submatrix.
            n = len(index)
            def A(i, j):
                return matrix[index[i]][index[j]]

            # Base case.
            if n <= 2:
                nums = sorted(A(i, j) for i in range(n) for j in range(n))
                return nums[k1-1], nums[k2-1]

            # Solve the subproblem.
            index_ = index[::2] + index[n-1+n%2:]
            k1_ = (k1 + 2*n) / 4 + 1 if n % 2 else n + 1 + (k1 + 3) / 4
            k2_ = (k2 + 3) / 4
            a, b = biselect(index_, k1_, k2_)

            # Prepare ra_less, rb_more and L with saddleback search variants.
            ra_less = rb_more = 0
            L = []
            jb = n   # jb is the first where A(i, jb) is larger than b.
            ja = n   # ja is the first where A(i, ja) is larger than or equal to a.
            for i in range(n):
                while jb and A(i, jb - 1) > b:
                    jb -= 1
                while ja and A(i, ja - 1) >= a:
                    ja -= 1
                ra_less += ja
                rb_more += n - jb
                L.extend(A(i, j) for j in range(jb, ja))

            # Compute and return x and y.
            x = a if ra_less <= k1 - 1 else \
                b if k1 + rb_more - n*n <= 0 else \
                pick(L, k1 + rb_more - n*n)
            y = a if ra_less <= k2 - 1 else \
                b if k2 + rb_more - n*n <= 0 else \
                pick(L, k2 + rb_more - n*n)
            return x, y

        # Set up and run the search.
        n = len(matrix)
        start = max(k - n*n + n-1, 0)
        k -= n*n - (n - start)**2
        return biselect(range(start, min(n, start+k)), k, k)[0]

class Solution {
public:
	int kthSmallest(const std::vector<std::vector<int>> & matrix, int k)
	{
		if (k == 1) // guard for 1x1 matrix
		{
			return matrix.front().front();
		}

		size_t n = matrix.size();
		std::vector<size_t> indices(n);
		std::iota(indices.begin(), indices.end(), 0);
		std::array<size_t, 2> ks = { k - 1, k - 1 }; // use zero-based indices
		std::array<int, 2> results = biSelect(matrix, indices, ks);
		return results[0];
	}

private:
	// select two elements from four elements, recursively
	std::array<int, 2> biSelect(
		const std::vector<std::vector<int>> & matrix,
		const std::vector<size_t> & indices,
		const std::array<size_t, 2> & ks)
	// Select both ks[0]-th element and ks[1]-th element in the matrix,
	// where k0 = ks[0] and k1 = ks[1] and n = indices.size() satisfie
	// 0 <= k0 <= k1 < n*n  and  k1 - k0 <= 4n-4 = O(n)   and  n>=2
	{
		size_t n = indices.size();
		if (n == 2u) // base case of resursion
		{
			return biSelectNative(matrix, indices, ks);
		}

		// update indices
		std::vector<size_t> indices_;
		for (size_t idx = 0; idx < n; idx += 2)
		{
			indices_.push_back(indices[idx]);
		}
		if (n % 2 == 0) // ensure the last indice is included
		{
			indices_.push_back(indices.back());
		}

		// update ks
		// the new interval [xs_[0], xs_[1]] should contain [xs[0], xs[1]]
		// but the length of the new interval should be as small as possible
		// therefore, ks_[0] is the largest possible index to ensure xs_[0] <= xs[0]
		// ks_[1] is the smallest possible index to ensure xs_[1] >= xs[1]
		std::array<size_t, 2> ks_ = { ks[0] / 4, 0 };
		if (n % 2 == 0) // even
		{
			ks_[1] = ks[1] / 4 + n + 1;
		}
		else // odd
		{
			ks_[1] = (ks[1] + 2 * n + 1) / 4;
		}

		// call recursively
		std::array<int, 2> xs_ = biSelect(matrix, indices_, ks_);

		// Now we partipate all elements into three parts:
		// Part 1: {e : e < xs_[0]}.  For this part, we only record its cardinality
		// Part 2: {e : xs_[0] <= e < xs_[1]}. We store the set elementsBetween
		// Part 3: {e : x >= xs_[1]}. No use. Discard.
		std::array<int, 2> numbersOfElementsLessThanX = { 0, 0 };
		std::vector<int> elementsBetween; // [xs_[0], xs_[1])

		std::array<size_t, 2> cols = { n, n }; // column index such that elem >= x
		 // the first column where matrix(r, c) > b
		 // the first column where matrix(r, c) >= a
		for (size_t row = 0; row < n; ++row)
		{
			size_t row_indice = indices[row];
			for (size_t idx : {0, 1})
			{
				while ((cols[idx] > 0)
					&& (matrix[row_indice][indices[cols[idx] - 1]] >= xs_[idx]))
				{
					--cols[idx];
				}
				numbersOfElementsLessThanX[idx] += cols[idx];
			}
			for (size_t col = cols[0]; col < cols[1]; ++col)
			{
				elementsBetween.push_back(matrix[row_indice][indices[col]]);
			}
		}

		std::array<int, 2> xs; // the return value
		for (size_t idx : {0, 1})
		{
			size_t k = ks[idx];
			if (k < numbersOfElementsLessThanX[0]) // in the Part 1
			{
				xs[idx] = xs_[0];
			}
			else if (k < numbersOfElementsLessThanX[1]) // in the Part 2
			{
				size_t offset = k - numbersOfElementsLessThanX[0];
				std::vector<int>::iterator nth = std::next(elementsBetween.begin(), offset);
				std::nth_element(elementsBetween.begin(), nth, elementsBetween.end());
				xs[idx] = (*nth);
			}
			else // in the Part 3
			{
				xs[idx] = xs_[1];
			}
		}
		return xs;
	}

	// select two elements from four elements, using native way
	std::array<int, 2> biSelectNative(
		const std::vector<std::vector<int>> & matrix,
		const std::vector<size_t> & indices,
		const std::array<size_t, 2> & ks)
	{
		std::vector<int> allElements;
		for (size_t r : indices)
		{
			for (size_t c : indices)
			{
				allElements.push_back(matrix[r][c]);
			}
		}
		std::sort(allElements.begin(), allElements.end());
		std::array<int, 2> results;
		for (size_t idx : {0, 1})
		{
			results[idx] = allElements[ks[idx]];
		}
		return results;
	}
};

Min-Heap O(klgn) Algorithm

class Solution {
    struct Tuple {
        int x, y, val;
        Tuple(int x, int y, int val) : x(x), y(y), val(val) {}

        bool operator<(const Tuple& other) const { return val < other.val; }
        bool operator>(const Tuple& other) const { return val > other.val; }
    };

public:
    int kthSmallest(vector<vector<int>>& matrix, int k) {
        priority_queue<Tuple, vector<Tuple>, std::greater<Tuple>> min_heap;
        int n = matrix.size();
        for (int i = 0; i < n; ++i) min_heap.push(Tuple(0, i, matrix[0][i]));

        for (int i = 0; i < k - 1; ++i) {
            auto t = min_heap.top();
            min_heap.pop();
            if (t.x == n - 1) continue;
            min_heap.push(Tuple(t.x + 1, t.y, matrix[t.x + 1][t.y]));
        }

        return min_heap.top().val;
    }
};

有序矩阵中的第k大数 (Selection in X+Y or Sorted Matrices)

Wed, 02 Aug 2017 12:58:55 GMT

今天在刷 leetcode 的时候遇到一道题目 Kth Smallest Element in a Sorted Matrix。

首先用一个 Min-Heap 就可以得到 O(klgn) (n 为列数) 的算法，实现放在最后。

然而在翻阅览 Discuss 区的时候发现，这玩意居然有 O(n) (n 为行、列数) 的算法，来自一篇论文 Selection in X + Y and Matrices with Sorted Rows and Columns，同时适用于另一道题 Find k Pairs with Smallest Sums，在此只做介绍，因为我不认为有人能在面试的时候写的出来...

简单介绍

设 A 为一个 n * n 的矩阵，所有行和列都是非降续的。

令 $A^*$ 是 A 取所有奇数行的子矩阵，如果 n 为偶数的话添上 A 的最后一行一列；令 $rank_{desc}(A, a)$ 为 A 中大于 a 的元素的数目，$rank_{asc}(A, a)$ 为 A 中小于 a 的元素数目。

那么有以下不等式成立

$rank_{asc}(A, a) \le 4 rank_{asc}(A^*, a)$
$rank_{desc}(A, a) \le 4 rank_{desc}(A^*, a)$

这是这个算法的基础，加上巧妙的选择，可以是使得问题变为解一个在 A*上的子问题。

算法过程参考论文，非常不直观，需要不少理论证明，实在没法详述... 有兴趣的同学还是参阅原论文

Python、 C++ 算法实现

在此摘录 discuss 上大佬给出的 python 和 cpp 版实现，在面试中谁能手写出来我只能跪穿。

class Solution(object):
    def kthSmallest(self, matrix, k):

        # The median-of-medians selection function.
        def pick(a, k):
            if k == 1:
                return min(a)
            groups = (a[i:i+5] for i in range(0, len(a), 5))
            medians = [sorted(group)[len(group) / 2] for group in groups]
            pivot = pick(medians, len(medians) / 2 + 1)
            smaller = [x for x in a if x < pivot]
            if k <= len(smaller):
                return pick(smaller, k)
            k -= len(smaller) + a.count(pivot)
            return pivot if k < 1 else pick([x for x in a if x > pivot], k)

        # Find the k1-th and k2th smallest entries in the submatrix.
        def biselect(index, k1, k2):

            # Provide the submatrix.
            n = len(index)
            def A(i, j):
                return matrix[index[i]][index[j]]
            
            # Base case.
            if n <= 2:
                nums = sorted(A(i, j) for i in range(n) for j in range(n))
                return nums[k1-1], nums[k2-1]

            # Solve the subproblem.
            index_ = index[::2] + index[n-1+n%2:]
            k1_ = (k1 + 2*n) / 4 + 1 if n % 2 else n + 1 + (k1 + 3) / 4
            k2_ = (k2 + 3) / 4
            a, b = biselect(index_, k1_, k2_)

            # Prepare ra_less, rb_more and L with saddleback search variants.
            ra_less = rb_more = 0
            L = []
            jb = n   # jb is the first where A(i, jb) is larger than b.
            ja = n   # ja is the first where A(i, ja) is larger than or equal to a.
            for i in range(n):
                while jb and A(i, jb - 1) > b:
                    jb -= 1
                while ja and A(i, ja - 1) >= a:
                    ja -= 1
                ra_less += ja
                rb_more += n - jb
                L.extend(A(i, j) for j in range(jb, ja))
                
            # Compute and return x and y.
            x = a if ra_less <= k1 - 1 else \
                b if k1 + rb_more - n*n <= 0 else \
                pick(L, k1 + rb_more - n*n)
            y = a if ra_less <= k2 - 1 else \
                b if k2 + rb_more - n*n <= 0 else \
                pick(L, k2 + rb_more - n*n)
            return x, y

        # Set up and run the search.
        n = len(matrix)
        start = max(k - n*n + n-1, 0)
        k -= n*n - (n - start)**2
        return biselect(range(start, min(n, start+k)), k, k)[0]

class Solution {
public:
	int kthSmallest(const std::vector<std::vector<int>> & matrix, int k)
	{
		if (k == 1) // guard for 1x1 matrix
		{
			return matrix.front().front();
		}

		size_t n = matrix.size();
		std::vector<size_t> indices(n);
		std::iota(indices.begin(), indices.end(), 0);
		std::array<size_t, 2> ks = { k - 1, k - 1 }; // use zero-based indices
		std::array<int, 2> results = biSelect(matrix, indices, ks);
		return results[0];
	}

private:
	// select two elements from four elements, recursively
	std::array<int, 2> biSelect(
		const std::vector<std::vector<int>> & matrix,
		const std::vector<size_t> & indices,
		const std::array<size_t, 2> & ks)
	// Select both ks[0]-th element and ks[1]-th element in the matrix,
	// where k0 = ks[0] and k1 = ks[1] and n = indices.size() satisfie
	// 0 <= k0 <= k1 < n*n  and  k1 - k0 <= 4n-4 = O(n)   and  n>=2
	{
		size_t n = indices.size();		
		if (n == 2u) // base case of resursion
		{			
			return biSelectNative(matrix, indices, ks);
		}
		
		// update indices
		std::vector<size_t> indices_;
		for (size_t idx = 0; idx < n; idx += 2)
		{
			indices_.push_back(indices[idx]);
		}
		if (n % 2 == 0) // ensure the last indice is included
		{
			indices_.push_back(indices.back());
		}

		// update ks
		// the new interval [xs_[0], xs_[1]] should contain [xs[0], xs[1]]
		// but the length of the new interval should be as small as possible
		// therefore, ks_[0] is the largest possible index to ensure xs_[0] <= xs[0]
		// ks_[1] is the smallest possible index to ensure xs_[1] >= xs[1]
		std::array<size_t, 2> ks_ = { ks[0] / 4, 0 };
		if (n % 2 == 0) // even
		{
			ks_[1] = ks[1] / 4 + n + 1;
		}
		else // odd
		{
			ks_[1] = (ks[1] + 2 * n + 1) / 4;
		}

		// call recursively
		std::array<int, 2> xs_ = biSelect(matrix, indices_, ks_);

		// Now we partipate all elements into three parts:
		// Part 1: {e : e < xs_[0]}.  For this part, we only record its cardinality
		// Part 2: {e : xs_[0] <= e < xs_[1]}. We store the set elementsBetween
		// Part 3: {e : x >= xs_[1]}. No use. Discard.
		std::array<int, 2> numbersOfElementsLessThanX = { 0, 0 };
		std::vector<int> elementsBetween; // [xs_[0], xs_[1])

		std::array<size_t, 2> cols = { n, n }; // column index such that elem >= x
		 // the first column where matrix(r, c) > b
		 // the first column where matrix(r, c) >= a
		for (size_t row = 0; row < n; ++row)
		{
			size_t row_indice = indices[row];
			for (size_t idx : {0, 1})
			{
				while ((cols[idx] > 0)
					&& (matrix[row_indice][indices[cols[idx] - 1]] >= xs_[idx]))
				{
					--cols[idx];
				}
				numbersOfElementsLessThanX[idx] += cols[idx];
			}
			for (size_t col = cols[0]; col < cols[1]; ++col)
			{
				elementsBetween.push_back(matrix[row_indice][indices[col]]);
			}
		}

		std::array<int, 2> xs; // the return value
		for (size_t idx : {0, 1})
		{
			size_t k = ks[idx];
			if (k < numbersOfElementsLessThanX[0]) // in the Part 1
			{
				xs[idx] = xs_[0];
			}
			else if (k < numbersOfElementsLessThanX[1]) // in the Part 2
			{
				size_t offset = k - numbersOfElementsLessThanX[0];
				std::vector<int>::iterator nth = std::next(elementsBetween.begin(), offset);
				std::nth_element(elementsBetween.begin(), nth, elementsBetween.end());
				xs[idx] = (*nth);
			}
			else // in the Part 3
			{
				xs[idx] = xs_[1];
			}
		}
		return xs;
	}

	// select two elements from four elements, using native way
	std::array<int, 2> biSelectNative(
		const std::vector<std::vector<int>> & matrix,
		const std::vector<size_t> & indices,
		const std::array<size_t, 2> & ks)
	{
		std::vector<int> allElements;
		for (size_t r : indices)
		{
			for (size_t c : indices)
			{
				allElements.push_back(matrix[r][c]);
			}
		}
		std::sort(allElements.begin(), allElements.end());
		std::array<int, 2> results;
		for (size_t idx : {0, 1})
		{
			results[idx] = allElements[ks[idx]];
		}
		return results;
	}
};

Min-Heap O(klgn) 算法

class Solution {
    struct Tuple {
        int x, y, val;
        Tuple(int x, int y, int val) : x(x), y(y), val(val) {}

        bool operator<(const Tuple& other) const { return val < other.val; }
        bool operator>(const Tuple& other) const { return val > other.val; }
    };

public:
    int kthSmallest(vector<vector<int>>& matrix, int k) {
        priority_queue<Tuple, vector<Tuple>, std::greater<Tuple>> min_heap;
        int n = matrix.size();
        for (int i = 0; i < n; ++i) min_heap.push(Tuple(0, i, matrix[0][i]));

        for (int i = 0; i < k - 1; ++i) {
            auto t = min_heap.top();
            min_heap.pop();
            if (t.x == n - 1) continue;
            min_heap.push(Tuple(t.x + 1, t.y, matrix[t.x + 1][t.y]));
        }

        return min_heap.top().val;
    }
};

Dynamic Proxy in Java

Sat, 29 Jul 2017 14:44:24 GMT

Although I learned about the Proxy pattern a year ago, I had barely tried using it, only seeing some examples in frameworks. Yesterday, while reading "Large-Scale Website Systems and Java Middleware in Practice," I stumbled upon an application of the Proxy pattern in Java -- dynamic proxies. I decided to write it down and review the Proxy pattern along the way.

The following content is referenced from:

"Design Patterns"
"Large-Scale Website Systems and Java Middleware in Practice"
http://www.baeldung.com/java-dynamic-proxies

Proxy pattern diagram:

The Proxy Pattern

In "Design Patterns," four common situations for using the Proxy pattern are described:

Remote Proxy provides a local representative for an object in a different address space.
Virtual Proxy creates expensive objects on demand.
Protection Proxy controls access to the original object. Protection proxies are used when objects should have different access rights.
Smart Reference implements reference counting, thread safety, lazy creation, etc. Those familiar with C++ should know std::share_ptr, std::weak_ptr, std::unique_ptr, and so on.

Static Proxy in Java

Here we borrow an example from "Large-Scale Website Systems and Java Middleware in Practice":

public interface Calculator {
    int add(int a, int b);
}

public class CalculatorImpl implements Calculator {
    int add(int a, int b) {
        return a + b;
    }
}

public class CalculatorProxy implements Calculator {
    private Calculator calculator;

    public CalculatorProxy(Calculator calculator) {
        this.calculator = calculator;
    }

    int add(int a, int b) {
        // We can do anything before and after real add is called,
        // such as logging
        return calculator.add(a, b);
    }
}

As you can see, in CalculatorProxy we call the real add operation and can do additional work before and after the call. This approach is quite straightforward, but if we have a batch of classes that need proxying with the same proxy functionality, we would need to implement an XxxProxy class for each one, which is very tedious.

Dynamic Proxy in Java

Java provides a way to dynamically generate a proxy for a class. Let's look at an example:

public class DynamicProxyPlayground {

    public static void main(String[] args) {
        Map mapProxyInstance = (Map) Proxy
            .newProxyInstance(DynamicProxyPlayground.class.getClassLoader(), new Class[]{Map.class},
                new TimingDynamicInvocationHandler(new HashMap<String, String>()));

        // DynamicProxyPlayground$TimingDynamicInvocationHandler invoke
        // INFO: Executing put finished in 74094 ns.
        mapProxyInstance.put("hello", "world");

        CharSequence csProxyInstance = (CharSequence) Proxy.newProxyInstance(
            DynamicProxyPlayground.class.getClassLoader(),
            new Class[] { CharSequence.class },
            new TimingDynamicInvocationHandler("Hello World"));

        // DynamicProxyPlayground$TimingDynamicInvocationHandler invoke
        // INFO: Executing length finished in 11720 ns.
        csProxyInstance.length();
    }

    public static class TimingDynamicInvocationHandler implements InvocationHandler {
        private static Logger logger =
            Logger.getLogger(TimingDynamicInvocationHandler.class.getName());

        private Object target;

        private TimingDynamicInvocationHandler(Object target) {
            this.target = target;
        }

        @Override
        public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
            long start = System.nanoTime();
            Object result = method.invoke(target, args);
            long elapsed = System.nanoTime() - start;

            logger.info(String.format("Executing %s finished in %d ns.", method.getName(), elapsed));
            return result;
        }
    }
}

In the example above, by implementing a single InvocationHandler, we can use Proxy.newInstance to create dynamic proxies with the same functionality for a batch of classes.

Dynamic Proxy in Java

Sat, 29 Jul 2017 14:44:24 GMT

虽然在一年前就知道了 Proxy 模式，但是基本没有尝试使用过，仅在框架里看到一些例子。昨天翻阅《大型网站系统与 Java 中间件实践》时，偶然发现了 Proxy 模式在 Java 中的应用 —— 动态代理，遂记录下来，顺便复习一下 Proxy 模式。

以下内容参考自

《 Design Patterns》
《大型网站系统与 Java 中间件实践》。
http://www.baeldung.com/java-dynamic-proxies

Proxy 模式图示：

Proxy 模式

在《 Design Patterns》书中提到了使用 Proxy 模式的四种常见情况：

远程代理 (Remote Proxy) 为一个对象在不同的地址空间提供局部代表。
虚代理 (Virtual Proxy) 根据需要创建开销很大的对象。
保护代理 (Protection Proxy) 控制对原始对象的访问。保护代理用于对象应该有不同的访问权限的时候。
智能指针 (Smart Reference) 引用计数、实现线程安全、延迟创建，熟悉 C++ 的同学应该知道 std::share_ptr、 std::weak_ptr、std::unique_ptr 等。

Java 中的静态代理

这里借用《大型网站系统与 Java 中间件实践》中的例子：

public interface Calculator {
    int add(int a, int b);
}

public class CalculatorImpl implements Calculator {
    int add(int a, int b) {
        return a + b;
    }
}

public class CalculatorProxy implements Calculator {
    private Calculator calculator;
    
    public CalculatorProxy(Calculator calculator) {
        this.calculator = calculator;
    }
    
    int add(int a, int b) {
        // We can do anything before and after real add is called,
        // such as logging
        return calculator.add(a, b);
    }
}

可以看到，在 CalculatorProxy 中我们可以调用了真正的 add 操作，并可以在 add 前后做很多额外的工作，这种方式看上去十分直接，但是如果我们有一批类需要代理，并且代理类中实现的功能是一致的，那么我们就需要为每一个类都实现一个 XxxProxy 类，这将会十分麻烦。

Java 中的动态代理

Java 中提供了一种方式使我们能够动态的生成一个类的代理，我们来看一个例子

public class DynamicProxyPlayground {

    public static void main(String[] args) {
        Map mapProxyInstance = (Map) Proxy
            .newProxyInstance(DynamicProxyPlayground.class.getClassLoader(), new Class[]{Map.class},
                new TimingDynamicInvocationHandler(new HashMap<String, String>()));

        // DynamicProxyPlayground$TimingDynamicInvocationHandler invoke 
        // INFO: Executing put finished in 74094 ns.
        mapProxyInstance.put("hello", "world");

        CharSequence csProxyInstance = (CharSequence) Proxy.newProxyInstance(
            DynamicProxyPlayground.class.getClassLoader(),
            new Class[] { CharSequence.class },
            new TimingDynamicInvocationHandler("Hello World"));

        // DynamicProxyPlayground$TimingDynamicInvocationHandler invoke 
        // INFO: Executing length finished in 11720 ns.
        csProxyInstance.length();
    }

    public static class TimingDynamicInvocationHandler implements InvocationHandler {
        private static Logger logger =
            Logger.getLogger(TimingDynamicInvocationHandler.class.getName());

        private Object target;

        private TimingDynamicInvocationHandler(Object target) {
            this.target = target;
        }

        @Override
        public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
            long start = System.nanoTime();
            Object result = method.invoke(target, args);
            long elapsed = System.nanoTime() - start;

            logger.info(String.format("Executing %s finished in %d ns.", method.getName(), elapsed));
            return result;
        }
    }
}

在上述例子中，通过实现一个 InvocationHandler，我们就可以使用Proxy.newInstance为一批类创建功能相同的动态代理。

Boyer-Moore Majority Voting Algorithm

Fri, 28 Jul 2017 13:08:02 GMT

A problem I encountered while doing leetcode. This post provides a brief description and records my thoughts.

Reference: https://gregable.com/2013/10/majority-vote-algorithm-find-majority.html, an excellently written blog post.

Problem description: Consider an unsorted list of length n. You want to determine whether there is a value that occupies more than half of the list (majority), and if so, find that value.

A common application of this problem is in fault-tolerant computing, where after performing multiple redundant computations, the value obtained by the majority of computations is output.

The Obvious Approach

Sort first, then traverse the list and count. The time cost is O(nlgn), which is not fast enough! In fact, we can produce the result in O(n) time.

Boyer-Moore Majority Algorithm

Paper reference: Boyer-Moore Majority Algorithm

This algorithm has O(2n) time complexity and O(1) space complexity. It traverses the list twice, and the idea is very simple.

We need the following two values:

candidate, initialized to any value
count, initialized to 0

First pass, with current representing the current value:

IF count == 0, THEN candidate = current
IF candidate == current THEN count += 1 ELSE count -= 1

Second pass: count the candidate from the first pass. If it exceeds half, then the majority is candidate; otherwise, no majority exists.

Here is the Python implementation:

def boyer_moore_majority(input):
    candidate = 0
    count = 0
    for value in input:
        if count == 0:
            candidate = value
        if candidate == value
            count += 1
        else:
            count -= 1

    count = 0
    for value in input:
        if candidate == value:
            count += 1

    if count > len(input) / 2:
        return candidate
    else:
        return -1 # any value represents NOT FOUND

A Simple Proof

We only need to consider the case where a majority exists in the original list, since the second pass will directly reject if there is none.

So assume that list L contains a majority, denoted as M.

As we can see, the candidate changes when count equals 0, which effectively divides the list into segments, each with its own candidate.

Each segment has an important property: the candidate occupies exactly half of that segment.

We prove by induction that in the last segment, candidate == M.

When scanning the first segment S, there are two cases:

candidate == M: Since M is the majority, and count(M in S) = len(S) / 2, M is still the majority in the sublist L - S.
candidate != M: Then count(M in S) <= len(S) / 2. Similarly, M is still the majority in L - S.

The last segment is the final sublist, so candidate == M.

Even Faster

Two-pass O(n) is already fast, but people still wanted faster, so... an O(3n/2 - 2) algorithm was born.

This algorithm makes only 3n/2 - 2 comparisons, which is the theoretical lower bound. Here is the prover: Finding a majority among N votes

The basic idea of this algorithm is to rearrange the original list so that no two adjacent values are the same.

Here, we need a bucket to store extra values, so the space cost is O(n), and it still requires two passes.

First pass: candidate remains the last value of the list, current is the current value.

current == candidate: put current into the bucket
current != candidate: put current at the end of the list, then take one from the bucket and put it at the end of the list

Clearly, no two adjacent elements in the list can be the same.

In the second pass, we need to continuously compare candidate with the last value of the list:

If they are the same, remove two elements from the end of the list
Otherwise, remove one element from the end of the list, and remove one element from the bucket

If the bucket is empty, there is no majority; otherwise, candidate is the majority.

The proof is omitted. Interested readers can refer to the paper.

Distributed Boyer-Moore

Interested readers can read Finding the Majority Element in Parallel.

The main algorithm is as follows:

def distributed_boyer_moore_majority(parallel_output):
    candidate = 0
    count = 0
    for candidate_i, count_i in parallel_output:
    if candidate_i = candidate:
        count += count_i
    else if count_i > count:
        count = count_i - count
        candidate = candidate_i
    else:
        count = count - count_i
    ...

Summary

An interesting problem encountered while doing leetcode: https://leetcode.com/problems/majority-element-ii/tabs/description. Knowing this algorithm makes it very easy.

Boyer-Moore 投票算法 (Boyer-Moore Majority Voting Algorithm)

Fri, 28 Jul 2017 13:08:02 GMT

刷 leetcode 时碰到的问题，本篇仅做简要描述，以及记录思考。

参考自: https://gregable.com/2013/10/majority-vote-algorithm-find-majority.html，一篇写的非常好的博客

问题描述：考虑你有一个长度为 n 的无序列表，现在你想知道列表中是否有一个值占据了列表的一半以上 (majority)，如果有的话找出这个数。

这个问题的一个普遍的应用场景是在容错计算 (fault-tolerant computing) 中，在进行了多次冗余的计算后，输出最后多数计算得到的值。

显而易见的做法

先排序，然后遍历列表并计数就行。耗时为 O(nlgn)，不够快！实际上我们可以在 O(n) 的时间内给出结果。

Boyer-Moore Majority Algorithm

论文依据：Boyer-Moore Majority Algorithm

该算法时间开销为 O(2n), 空间开销为 O(1)，总共遍历两遍列表，思想非常简单。

我们需要以下两个值:

candidate，初始化为任意值
count，初始化为 0

第一遍遍历，以 current 代表当前值：

IF count == 0, THEN candidate = current
IF candiadate == current THEN count += 1 ELSE count -= 1

第二遍遍历，对上次结果中的 candidate 计数，超过一半则存在 majority 为 candidate，否则不存在

来看一下 Python 版代码实现：

def boyer_moore_majority(input):
    candidate = 0
    count = 0
    for value in input:
        if count == 0:
            candidate = value
        if candidate == value
            count += 1
        else:
            count -= 1

    count = 0
    for value in input:
        if candidate == value:
            count += 1
    
    if count > len(input) / 2:
        return candidate
    else:
        return -1 # any value represents NOT FOUND

一个简单的证明

我们只需要考虑在原列表中有 majority 的情况，因为如果没有第二遍遍历会直接 reject。

所以假设列表 L 中存在 majority，记为 M。

可以看到，上面 candidate 在 count 等于 0 的时候变更，其实将列表分成了一段一段，每一段有一个 candidate。

每一段有一个重要的性质，即 candidate 在这一段中恰好占据了一半。

我们归纳证明：在最后一段中 candidate == M

那么当扫描到第一段 S 时，有两种情况：

candidate == M，那么根据 M 是 majority，以及根据 count(M in S) = len(S) / 2，在子列表 L - S 中 M 还是 majority
candidate != M，那么 count(M in S) <= len(S) / 2, 同上，L - S 中 M 还是 majority

最后一段就是最后一个子列表，所以 candidate == M。

更快更好 😁

两遍遍历的 O(n) 已经很快，但是大家还是觉得不够快，于是... O(3n / 2 -2) 的算法诞生了。

这个算法只比较 3n/2 - 2 次，已经是理论下限。 Here is the prover.Finding a majority among N votes

这个算法的基本想法是：将原列表重新排列，使得没有两个相邻的值是相同的。

在这里，我们需要一个桶来存放额外的值，所以空间消耗为 O(n)，同样是两遍遍历。

第一遍遍历，candidate 保持为列表的最后一个值，current 为当前值

current == candidate, 把 current 放入桶中
current != candidate, 把 current 放到列表的最后，然后从桶中取出一个放到列表最后

显然列表相邻的两个绝不可能相同

第二遍遍历中，我们需要将 candidate 不断的与列表最后一个值比较：

如果相同，从列表最后去除两个元素
否则，从列表最后去除一个元素，并从桶中去除一个元素

如果桶空了，那么没有 majority，否则 candidate 就是 majority。

证明略去，有兴趣的同学可以参考论文。

分布式 Boyer-Moore

有兴趣的同学可以阅读 Finding the Majority Element in Parallel。

主要算法如下：

def distributed_boyer_moore_majority(parallel_output):
    candidate = 0
    count = 0
    for candidate_i, count_i in parallel_output:
    if candidate_i = candidate:
        count += count_i
    else if count_i > count:
        count = count_i - count
        candidate = candidate_i
    else:
        count = count - count_i
    ...

总结

刷 leetcode 时遇到的很有意思的题目 https://leetcode.com/problems/majority-element-ii/tabs/description，知道这个算法就非常容易了。

2017 Alibaba Middleware 24h Final (Just for Fun)

Wed, 26 Jul 2017 08:59:26 GMT

When the Alibaba Middleware competition came around this year, I happened to be in a bad mood and had to prepare for final exams, so I didn't participate -- a real shame. Fortunately, my good friend Eric Fu competed and won the championship! This year's theme was distributed databases. For details and the semifinal solution, please visit Eric's blog.

I was quite regretful about missing it, and the problem kept itching at me. So I asked Eric for the final 24-hour geek challenge to play around with.

24h TOPKN

The problem is paginated sorting on a distributed database, corresponding to the SQL execution of order by id limit k, n. The main technical challenge is the "distributed" strategy -- the competition uses multiple files to simulate multiple data shards.

Abbreviated as top(k, n).

Given a batch of data, find n consecutive data items starting from the k-th index in ascending order (similar to the semantics of order by asc + limit k,n in databases).

top(k,3) means getting items ranked k+1, k+2, k+3. For example: top(10,3) means items ranked 11, 12, 13; top(0,3) means items ranked 1, 2, 3. Consider several cases for the value of k, such as very small or very large values, e.g., k=1,000,000,000. Range of k,n values: 0 <= k < 2^63 and 0 < n < 100. All data consists of text and numbers combined. Sorting is based on the entire string's length. For strings of equal length, the one with a smaller ASCII code comes first. Example: a ab abc def abc123

For example, given (k,n) = (2,2), the answer for the above data would be: abc def

The original problem is at https://code.aliyun.com/wanshao/topkn_final.

Required to be completed within 24 hours, however...

Code is hosted on GitHub at https://github.com/arkbriar/topKN. This version is lightly modified from Eric's version...

Solution Process

Key Conditions		Key Constraints

3 machines: 2 workers, 1 master 2. Output on the master | |1. Timeout: 5min 2. JVM heap size: 3G 3. Off-heap memory not allowed (FileChannel)

The problem involves approximately 160,000,000 data entries (strings) distributed across two machines. To perform top(k, n), brute force is clearly not feasible.

Fully sorting 160,000,000 entries in JVM is impractical -- first, the memory is too small; second, there isn't enough time. Sorting 10,000,000 entries takes approximately 1 minute and 13 seconds, as referenced from https://codereview.stackexchange.com/questions/67387/quicksort-implementation-seems-too-slow-with-large-data. External sorting would be even slower due to disk I/O; plus there's the network overhead of outputting results to the master.

Anyone with basic database knowledge knows about indexes, which are used to speed up queries. So let's first look at MySQL's indexes.

MySQL Indexes

Indexes have a crucial impact on query speed, and understanding indexes is the starting point for database performance tuning. Suppose a database table has 100,000 records, the DBMS page size is 4K, and each page stores 100 records. Without an index, a query must scan the entire table. In the worst case, if none of the data pages are in memory, 10,000 pages need to be read. If these 10,000 pages are randomly distributed on disk, 10,000 I/O operations are needed. Assuming each disk I/O takes 10ms, this totals 100 seconds (in practice, much more). However, with a B+ tree index, only $log_{100}1000000 = 3$ page reads are needed, with a worst-case time of 30ms -- that's the power of indexes.

MySQL has two types of indexes: B+ tree and Hash indexes.

Referenced from http://www.cnblogs.com/hustcat/archive/2009/10/28/1591648.html

B/B+ Trees

We all know that RB-Trees and AVL-Trees are self-balancing binary search trees, where insertion, deletion, and query operations can all be completed in $O(log_{2}n)$ time in the worst case.

The B-tree was proposed in 1972 by Rudolf Bayer and Edward M. McCreight. It is a generalized self-balancing tree where each node can have more than two children. Unlike self-balancing binary search trees, B-trees are optimized for systems that read and write large blocks of data. B-trees reduce the number of intermediate steps when locating records, thereby speeding up access. This is why B-trees are commonly used in database and file system implementations.

Assuming each B-tree node has b child nodes (b partitions), a lookup can be completed with approximately logbN disk read operations.

Other characteristics of B/B+ trees are well documented on Wikipedia and other resources, so I won't elaborate further here.

Due to time constraints, implementing a B/B+ tree index was impractical, so we considered a degenerate approach -- bucket sort.

Referenced from:

Bucket Sort

Let's first look at the data distribution:

Each string has a length in the range [1, 128] (numbers are uniformly distributed within the Long value range)

Clearly, using string length as the bucket identifier is feasible, but with 160,000,000 strings and 128 buckets, each bucket would still contain 1,250,000 entries -- the bucket granularity is still too coarse.

Since we have no other distribution information, and strings of the same length are sorted lexicographically, using string length plus the first 2-3 characters as the bucket identifier suffices.

In this problem, since only 5 rounds of queries are performed and write overhead is very high, we only track the count of strings in each bucket.

Basic Approach

Workers independently scan and build counts for all buckets
Master aggregates all buckets and builds a global bucket map
Master uses binary search to find the bucket(s) containing (k, n)
Master requests all strings within those buckets from the Workers
Master sorts these strings and outputs the final result

Implementation

The main difficulty in implementing this problem lies in I/O and GC tuning, as well as fully utilizing the CPU.

For GC tuning, we used fixed buffers for operations, essentially avoiding GC altogether.
Multi-threaded processing ensures full CPU utilization.
Network overhead is negligible under the above algorithm.

However, the remaining I/O optimization was a real headache, along with potential synchronization issues between reading and processing.

Let's look at the hardware specifications:

CPU: 24 cores, 2.2GHz
Memory: 3G heap memory
Data disk: in-memory file system
Temporary/result disk: SSD

Top 1's Final Result

From Eric, I learned that the top 1 implementation completed all 5 rounds (including JVM startup, pausing, and computation) in just over 11 seconds, which served as a benchmark for how far away from optimal I was.

(I really can't figure out how first place got it down to 11 seconds -- maybe I set up the in-memory file system incorrectly?)

Unstable I/O Implementation

Single-threaded reading per file, fixed 4 threads for processing, with pre-allocated buffers to avoid GC.

File operations are parallelized, with 5 files running in parallel occupying a total of 25 cores.

Due to the inability to effectively balance reading and processing, it was hard to achieve process-as-you-read. A lot of time was spent waiting for the read thread to produce a new buffer or for the processing thread to finish with a buffer.

With system cache cleared, processing all strings took about 8 seconds; without clearing, about 4 seconds.

This implementation could not be optimized -- the waiting time was entirely at the OS's mercy.

Stable I/O Implementation

As mentioned earlier, single-threaded reading isn't a great approach, and the above processing architecture has an unstable waiting problem.

So how can we maintain multi-threaded reading while also enabling multi-threaded processing, without wasting time on waiting?

The answer is each thread reads a chunk of data, processes it immediately, then reads the next chunk.

But how do we assign each thread its chunk of data, ensuring no duplicate reads and no missed data blocks between threads?

Here we adopted Eric's semifinal implementation: FileSegment. By splitting a file into individual Segments, each thread requests the next Segment on its own and reads it using RandomAccessFile (since FileChannel was not allowed).

public class FileSegment {
    private File file;

    private long offset;

    private long size;
    ...
}

public abstract class BufferedFileSegmentReadProcessor implements Runnable {
    private final int bufferMargin = 1024;
    private FileSegmentLoader fileSegmentLoader;
    private int bufferSize;
    private byte[] readBuffer;

    // Here bufferSize must be greater than segments' size got from fileSegmentLoader,
    // otherwise segment will not be fully read/processed!
    public BufferedFileSegmentReadProcessor(FileSegmentLoader fileSegmentLoader, int bufferSize) {
        this.bufferSize = bufferSize;
        this.fileSegmentLoader = fileSegmentLoader;
    }

    private int readSegment(FileSegment segment, byte[] readBuffer) throws IOException {
        int limit = 0;
        try (RandomAccessFile randomAccessFile = new RandomAccessFile(segment.getFile(), "r")) {
            randomAccessFile.seek(segment.getOffset());
            limit = randomAccessFile.read(readBuffer, 0, readBuffer.length);
        }
        return limit;
    }

    protected abstract void processSegment(FileSegment segment, byte[] readBuffer, int limit);

    @Override
    public void run() {
        readBuffer = new byte[bufferSize + bufferMargin];
        try {
            FileSegment segment;
            while ((segment = fileSegmentLoader.nextFileSegment()) != null) {
                int limit = readSegment(segment, readBuffer);
                processSegment(segment, readBuffer, limit);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

This way, each thread cycles between reading and processing. Threads that finish first request the next Segment, ensuring all cores are utilized in the shortest time possible to complete one pass of data processing.

This approach guarantees very stable reading and processing without needing to wait for Reader-Processor communication. In my test environment, with cache present it took 3-4 seconds to scan through, and 7-8 seconds with cache cleared.

At this point, the problem was essentially solved. I recall that with this approach, Eric's final competition result was just over 14 seconds.

Top 1's Optimization

While building buckets, simultaneously store the offset and bucket index for the i-th string in each file into two arrays. Save the bucket information along with these two arrays to local storage, creating an index that allows fast querying of all strings within a bucket.
This information totals approximately 1GB. By building such an index, each query only needs to read the index information and perform a limited number of reads to retrieve all required strings.

In comparison, the previous approach read 5GB of files from the in-memory file system, while the new approach reads about 1GB from SSD. As long as the concurrent read speed difference is not more than 5x, this optimization saves significant time!

Testing

Only tested index building.

With Cache Cleared

$ sysctl -w vm.drop_caches=3
vm.drop_caches = 3
$ time java ${JAVA_OPTS} -cp /exp/topkn/topkn.jar com.alibaba.middleware.topkn.TopknWorker localhost 5527 .
[2017-07-27 14:24:27.512] INFO com.alibaba.middleware.topkn.TopknWorker Connecting to master at localhost:5527
[2017-07-27 14:24:27.519] INFO com.alibaba.middleware.topkn.TopknWorker Building index ...
0.673: [GC (Allocation Failure) 0.673: [ParNew: 809135K->99738K(943744K), 0.1759305 secs] 809135K->542134K(3040896K), 0.1761763 secs] [Times: user=3.15 sys=0.69, real=0.17 secs]
[2017-07-27 14:24:32.330] INFO com.alibaba.middleware.topkn.TopknWorker Index built.
Heap
 par new generation   total 943744K, used 903716K [0x0000000700000000, 0x0000000740000000, 0x0000000740000000)
  eden space 838912K,  95% used [0x0000000700000000, 0x00000007311225e0, 0x0000000733340000)
  from space 104832K,  95% used [0x00000007399a0000, 0x000000073fb06b38, 0x0000000740000000)
  to   space 104832K,   0% used [0x0000000733340000, 0x0000000733340000, 0x00000007399a0000)
 concurrent mark-sweep generation total 2097152K, used 442395K [0x0000000740000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 3980K, capacity 4638K, committed 4864K, reserved 1056768K
  class space    used 434K, capacity 462K, committed 512K, reserved 1048576K

real    0m5.238s
user    1m11.980s
sys     0m26.324s

Without Cache Cleared

$ time java ${JAVA_OPTS} -cp /exp/topkn/topkn.jar com.alibaba.middleware.topkn.TopknWorker localhost 5527 .
[2017-07-27 14:25:39.986] INFO com.alibaba.middleware.topkn.TopknWorker Connecting to master at localhost:5527
[2017-07-27 14:25:39.992] INFO com.alibaba.middleware.topkn.TopknWorker Building index ...
0.590: [GC (Allocation Failure) 0.590: [ParNew: 838912K->99751K(943744K), 0.1851592 secs] 838912K->591302K(3040896K), 0.1855308 secs] [Times: user=3.03 sys=0.88, real=0.19 secs]
[2017-07-27 14:25:44.449] INFO com.alibaba.middleware.topkn.TopknWorker Index built.
Heap
 par new generation   total 943744K, used 863664K [0x0000000700000000, 0x0000000740000000, 0x0000000740000000)
  eden space 838912K,  91% used [0x0000000700000000, 0x000000072ea02318, 0x0000000733340000)
  from space 104832K,  95% used [0x00000007399a0000, 0x000000073fb09e00, 0x0000000740000000)
  to   space 104832K,   0% used [0x0000000733340000, 0x0000000733340000, 0x00000007399a0000)
 concurrent mark-sweep generation total 2097152K, used 491550K [0x0000000740000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 3980K, capacity 4638K, committed 4864K, reserved 1056768K
  class space    used 434K, capacity 462K, committed 512K, reserved 1048576K

real    0m4.754s
user    0m57.144s
sys     0m30.568s

UPDATE

On my 8-core physical machine, scanning through the data in the in-memory file system takes less than 1 second... It seems to be a machine issue -- cloud servers just can't keep up.

Summary

This year's middleware competition was really interesting. The problem set high expectations for contestants, involving multi-threaded concurrent read/write, distributed sorting, indexing, socket programming, JVM tuning, and many other aspects. Coming in cold, I went through two broken versions of code, looking everywhere and asking everyone before finally figuring out the final approach.

Rather than saying I solved it, it's more accurate to say I was learning how to solve it. In the process, I discovered many interesting technical points that I plan to continue studying.

2017 Alibaba Middleware 24h Final (Just for Fun 😀)

Wed, 26 Jul 2017 08:59:26 GMT

今年阿里中间件比赛的时候不巧博主心情不好，外加要准备期末考试，并没有参加，非常遗憾。不过好在好基友 Eric Fu 参加并获得了冠军！今年的主题是分布式数据库，如果想了解详情及复赛的解题思路请读者前往 Eric 的博客。

博主没有参加甚是遗憾，外加看到题目手痒难耐，遂问基友讨了最后的 24h 极客赛来玩一玩。

24h TOPKN

题目是分布式数据库上的分页排序，对应的 SQL 执行为 order by id limit k，n；主要的技术挑战为"分布式"的策略，赛题中使用多个文件模拟多个数据分片。

简称 top(k, n)。

给定一批数据，求解按顺序从小到大，顺序排名从第 k 下标序号之后连续的 n 个数据 (类似于数据库的 order by asc + limit k,n 语义)

top(k,3) 代表获取排名序号为 k+1,k+2,k+3 的内容,例:top(10,3) 代表序号为 11、 12、 13 的内容,top(0,3) 代表序号为 1、 2、 3 的内容需要考虑 k 值几种情况，k 值比较小或者特别大的情况，比如 k=1,000,000,000 对应 k,n 取值范围： 0 <= k < 2^63 和 0 < n < 100 数据全部是文本和数字结合的数据，排序的时候按照整个字符串的长度来排序。例如一个有序的排序如下 (相同长度情况下 ascii 码小的排在前面)：例： a ab abc def abc123

例如给定的 k,n 为 (2,2)，则上面的数据需要的答案为: abc def

原题目在 https://code.aliyun.com/wanshao/topkn_final。

要求在 24 小时内完成，然而 ...

Codes 托管在 github 上 https://github.com/arkbriar/topKN，这个版本从基友的版本轻度修改而来 ...

解题过程

主要条件		主要限制

3 台机器: 2 台 worker，1 台 master 2. 输出在 master 上 | |1. Timeout: 5min 2. JVM heap size: 3G 3. 不允许使用堆外内存 (FileChannel)

题目中一共有大约 160000000 条数据 (字符串)，并分布在两台机器上，要进行 top(k, n)，首先 brute force 一定是不行的。

如果要对 160000000 条数据全排序 JVM，第一内存太小，第二时间不够。对 10000000 条数据进行排序大约耗时为 1min13s，参考自 https://codereview.stackexchange.com/questions/67387/quicksort-implementation-seems-too-slow-with-large-data；如果要进行外排序，磁盘读写导致耗时更长；以及还有最后要输出到 master 上，还有网络开销。

稍微了解一些数据库的同学都知道，数据库里有个叫索引 (Index) 的东西，是用来加速查询的。那我们首先来看一下 MySQL 的索引。

MySQL 的索引

索引对查询的速度有至关重要的影响，理解索引也是进行数据库性能调优的起点。假设数据库中一个表有 100000 条记录，DBMS 的页面大小为 4K，并存储 100 条记录。如果没有索引，查询将对整个表进行扫描，最坏的情况下，如果所有数据页都不在内存，需要读取 10000 个页面，如果这 10000 个页面在磁盘上随机分布，需要进行 10000 次 I/O，假设磁盘每次 I/O 时间为 10ms，则总共需要 100s(实际远不止)。但是如果建立 B+ 树索引，则只需要进行 $log_{100}1000000 = 3$ 次页面读取，最坏情况下耗时 30ms，这就是索引带来的效果。

MySQL 有两种索引的类型：B+树和 Hash 索引。

参考自 http://www.cnblogs.com/hustcat/archive/2009/10/28/1591648.html

B/B+ 树

我们都知道 RB-Tree 和 AVL-Tree 是都是自平衡的二叉查找树，在这样的树上进行插入，删除和查询操作最坏情况都能在 $O(log_{2}n)$ 的时间内完成。

B 树是 1972 年由 Rudolf Bayer 和 Edward M.McCreight 提出的，它是一种泛化的自平衡树，每个节点可以有多于两个的子节点。与自平衡二叉查找树不同的是，B 树为系统大块数据的读写操作做了优化。 B 树减少定位记录时所经历的中间过程，从而加快存取速度。所以 B 树常被应用于数据库和文件系统的实现。

假设 B 树每个节点下有 b 个子节点 (b 个分块)，那么查找可以在约为 logbN 的磁盘读取开销下完成。

关于 B/B+ 树的其他特征在维基和其他资料上有详细讲解，在此不再赘述。

由于时间限定的缘故，实现 B/B+ 树的索引显得不太现实，所以我们将考虑一种退化的方式 —— 桶排序。

参考自

桶排序

先来看一下数据的分布情况：

每一行字符串的长度为 [1,128] （数字在 Long 值范围内均匀分布）

显然以字符串长来标识桶是可行的，但是我们有 160000000 条字符串，128 个桶每个桶还是有 1250000 条，这样的桶粒度还是太粗。

由于我们手里没有其他分布情况，以及字符串大小在同长度下是字典序，所以以字符串长度+前 2-3 个字符作为桶的标识就可以了。

在本题中，由于只进行 5 轮 query，并且写开销非常大，故仅对桶内字符串的数目进行统计。

基本思路

Worker 分别扫描并建立所有桶的计数
Master 汇总所有桶，并建立全局桶
Master 通过二分查找，找到 (k, n) 所在的几个桶
Master 向 Worker 请求获得这些桶内的所有字符串
Master 对这些字符串进行排序，并输出最终结果

实现过程

本题实现的最大难点在于 I/O 和 GC 调优，还有充分调动 CPU。

GC 调优我们通过使用固定的 buffer 来进行操作，基本避免出现 GC。
通过多线程处理来保证 CPU 调动。
网络开销在上述算法下基本忽略了

然而，剩下的 I/O 调优让人非常头大，并伴随着可能存在的同步问题 (读和处理)

我们来看一下硬件条件：

CPU: 24 core, 2.2GHz
内存: 3G heap memory
数据磁盘：内存文件系统
临时、结果磁盘：SSD

Top 1 的最终结果

从基友那得知，top 1 的实现 5 轮总耗时 (JVM 启动、暂停，以及计算) 共 11s 多，用来推算自己离最优还有多远。

(我实在没想通第一名怎么跑到 11s 的，大概是我内存文件系统开的不对？)

不稳定的 I/O 实现

每个文件单线程读，固定 4 线程处理，通过预先分配的 buffer 避免出现 GC。

封装文件操作并行，5 个文件并行总共占用 25 core。

实现下来由于读和操作无法有效平衡，很难做到读完就操作，大量时间花费在等待读线程新产出一个 buffer 或是处理线程处理完一个 buffer。

在清理系统 cache 的情况下，大约 8s 处理完所有字符串，不清理大约为 4s。

这种实现无法调优，等待的时间完全看 OS 心情。

稳定的 I/O 实现

其实之前提到过，单线程读并不是一个很好的思路，而且上述的处理架构存在着不稳定的等待问题。

那么如何使得在保持多线程读的同时能够多线程处理，并且不浪费时间在等待上呢？

答案是每一个线程在读完一份数据后，直接进行处理，然后读取下一份。

但是如何安排每一个线程的那份数据，保证线程之间不会有重复读取，也不会有漏读的数据块？

这里采用 Eric 在复赛里的实现: FileSegment, 通过将一个文件分割为一个一个 Segment，让每一个线程自己请求下一个 Segment，并通过 RandomAccessFile 进行读取。 (由于不许用 FileChannel)

public class FileSegment {
    private File file;

    private long offset;

    private long size;
    ...
}

public abstract class BufferedFileSegmentReadProcessor implements Runnable {
    private final int bufferMargin = 1024;
    private FileSegmentLoader fileSegmentLoader;
    private int bufferSize;
    private byte[] readBuffer;

    // Here bufferSize must be greater than segments' size got from fileSegmentLoader, 
    // otherwise segment will not be fully read/processed!
    public BufferedFileSegmentReadProcessor(FileSegmentLoader fileSegmentLoader, int bufferSize) {
        this.bufferSize = bufferSize;
        this.fileSegmentLoader = fileSegmentLoader;
    }

    private int readSegment(FileSegment segment, byte[] readBuffer) throws IOException {
        int limit = 0;
        try (RandomAccessFile randomAccessFile = new RandomAccessFile(segment.getFile(), "r")) {
            randomAccessFile.seek(segment.getOffset());
            limit = randomAccessFile.read(readBuffer, 0, readBuffer.length);
        }
        return limit;
    }

    protected abstract void processSegment(FileSegment segment, byte[] readBuffer, int limit);

    @Override
    public void run() {
        readBuffer = new byte[bufferSize + bufferMargin];
        try {
            FileSegment segment;
            while ((segment = fileSegmentLoader.nextFileSegment()) != null) {
                int limit = readSegment(segment, readBuffer);
                processSegment(segment, readBuffer, limit);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

这样每一个线程都在读和处理之间循环，先完成的线程会请求下一个 Segment，保证以最短的时间完成调用所有 Core 进行一遍数据处理。

这个方法保证能够十分稳定的进行读取和处理，并且不需要等待 Reader 和 Processor 通信，在我的实验环境中基本存着 cache 3-4s 能扫描一遍，清理 cache 也在 7-8s。

其实到这里本题已经差不多了，我记得 Eric 在这种处理下比赛的最终结果是 14s 多。

Top 1 的优化

在建立桶的时候，同时将每个文件上第 i 条字符串对应的 offset、桶序号存入两个数组，并将桶信息以及这两个数组一起存储到本地，这样构成了一个可以快速查询桶内所有字符串的索引
这些信息总计大约 1G 左右，通过建立这样的索引，每次仅讲索引信息读入，并进行有限次的读取，就可以获取到所有需要的字符串

对比之下，之前的方式是从内存文件系统读取 5G 的文件，现在是从 SSD 读取 1G 左右文件，只要并发读速度差距不在 5 倍以上，这样的优化是能节省不少时间！

测试

仅测试了建 Index

清理 cache

$ sysctl -w vm.drop_caches=3
vm.drop_caches = 3
$ time java ${JAVA_OPTS} -cp /exp/topkn/topkn.jar com.alibaba.middleware.topkn.TopknWorker localhost 5527 .
[2017-07-27 14:24:27.512] INFO com.alibaba.middleware.topkn.TopknWorker Connecting to master at localhost:5527
[2017-07-27 14:24:27.519] INFO com.alibaba.middleware.topkn.TopknWorker Building index ...
0.673: [GC (Allocation Failure) 0.673: [ParNew: 809135K->99738K(943744K), 0.1759305 secs] 809135K->542134K(3040896K), 0.1761763 secs] [Times: user=3.15 sys=0.69, real=0.17 secs]
[2017-07-27 14:24:32.330] INFO com.alibaba.middleware.topkn.TopknWorker Index built.
Heap
 par new generation   total 943744K, used 903716K [0x0000000700000000, 0x0000000740000000, 0x0000000740000000)
  eden space 838912K,  95% used [0x0000000700000000, 0x00000007311225e0, 0x0000000733340000)
  from space 104832K,  95% used [0x00000007399a0000, 0x000000073fb06b38, 0x0000000740000000)
  to   space 104832K,   0% used [0x0000000733340000, 0x0000000733340000, 0x00000007399a0000)
 concurrent mark-sweep generation total 2097152K, used 442395K [0x0000000740000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 3980K, capacity 4638K, committed 4864K, reserved 1056768K
  class space    used 434K, capacity 462K, committed 512K, reserved 1048576K

real    0m5.238s
user    1m11.980s
sys     0m26.324s

不清理 cache

$ time java ${JAVA_OPTS} -cp /exp/topkn/topkn.jar com.alibaba.middleware.topkn.TopknWorker localhost 5527 .
[2017-07-27 14:25:39.986] INFO com.alibaba.middleware.topkn.TopknWorker Connecting to master at localhost:5527
[2017-07-27 14:25:39.992] INFO com.alibaba.middleware.topkn.TopknWorker Building index ...
0.590: [GC (Allocation Failure) 0.590: [ParNew: 838912K->99751K(943744K), 0.1851592 secs] 838912K->591302K(3040896K), 0.1855308 secs] [Times: user=3.03 sys=0.88, real=0.19 secs]
[2017-07-27 14:25:44.449] INFO com.alibaba.middleware.topkn.TopknWorker Index built.
Heap
 par new generation   total 943744K, used 863664K [0x0000000700000000, 0x0000000740000000, 0x0000000740000000)
  eden space 838912K,  91% used [0x0000000700000000, 0x000000072ea02318, 0x0000000733340000)
  from space 104832K,  95% used [0x00000007399a0000, 0x000000073fb09e00, 0x0000000740000000)
  to   space 104832K,   0% used [0x0000000733340000, 0x0000000733340000, 0x00000007399a0000)
 concurrent mark-sweep generation total 2097152K, used 491550K [0x0000000740000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 3980K, capacity 4638K, committed 4864K, reserved 1056768K
  class space    used 434K, capacity 462K, committed 512K, reserved 1048576K

real    0m4.754s
user    0m57.144s
sys     0m30.568s

UPDATE

在我的 8 核小物理机上，跑一边内存文件系统里的数据不要 1s... 看来是机器的问题，云服务器还是不给力啊 🤣

总结

今年的中间件比赛是在很有意思，这个题目对参赛选手有比较高的要求，解题中涉及到多线程并发读写、分布式排序、索引、 Socket 编程、 JVM 调优等多个方面，我这种上来直接挑的果断挂了两个版本的代码，东看西看左问右问终于把最终形式给弄清楚了。