Graph Algorithms and Data Structures

参考书籍:Algorithms Illuminated (Part 2) Graph Algorithms and Data Structures

对应课程:Coursera: Graph Search, Shortest Paths, and Data Structures

Computing a Topological Ordering

原理

注意:

  • $(v,w)$ 指节点 $v$ 指向节点 $w$
  • 可以有多种排序方式

  • Not every graph have a topological ordering.

  • It is impossible to topologically order the vertices of a graph that contains a directed cycle.
  • Directed cycles are the only obstruction to topological orderings (A directed graph without any directed cycles is called directed acyclic graph, or simply a DAG).
  • Every DAG Has a Topological Ordering.
  • A sink vertex is one with no outgoing edges.

Pseudocode

The global variable curLabel keeps track of where we are in the topological ordering. Our algorithm will compute an ordering in reverse order, so curLabel counts down from the number of vertices to 1.

Computing Strongly Connected Components

原理

A strongly connected component or SCC of a directed graph is a maximal subset $S \subseteq V$ of vertices such that there is a directed path from any vertex in $S$ to any other vertex in $S$. For example, the strongly connected components of the graph in Figure 8.16 are ${1, 3, 5}$, ${11}$, ${2, 4, 7, 9}$, and ${6, 8, 10}$. Within each component, it’s possible to get from anywhere to anywhere else (as you should check). Each component is maximal subject to this property, as there’s no way to “move to the left” from one SCC to another.

Intuitively, we want to first discover a “sink SCC,” meaning an SCC with no outgoing edges (like SCC#4 in Figure 8.16), and then work backward.

Our algorithm will use two passes of depth-first search. The first pass computes a magical ordering in which to process the vertices, and the second follows this ordering to discover the SCCs one by one. This two-pass strategy is known as Kosaraju’s algorithm.

With a directed acyclic graph $G$, the vertex positions constitute a topological ordering, and the vertex in the last position must be a sink vertex of $G$, with no outgoing edges. (Any such edges would travel backward in the ordering.) Perhaps with a general directed graph $G$, the vertex in the last position always belongs to a sink SCC? Sadly, no. If we label each SCC of $G$ with the smallest position of one of its vertices, these labels constitute a topological ordering of the meta-graph of SCCs defined in Proposition 8.9.

Proposition 8.9 (The SCC Meta-Graph Is Directed Acyclic):

Let $G = (V,E)$ be a directed graph. Define the corresponding meta-graph $H = (X, F)$ with one meta-vertex $x \in X$ per SCC of $G$ and a meta-edge $(x, y)$ in $F$ whenever there is an edge in $G$ from a vertex in the SCC corresponding to $x$ to one in the SCC corresponding to $y$. Then $H$ is a directed acyclic graph.

Theorem 8.10 (Topological Ordering of the SCCs):

Let $G$ be a directed graph, with the vertices ordered arbitrarily, and for each vertex $v \in V$ let $f(v)$ denote the position of $v$ computed by the TopoSort algorithm. Let $S1, S2$ denote two SCCs of $G$, and suppose $G$ has an edge $(v,w)$ with $v \in S1$ and $w \in S2$. Then,
$$
\min {x \in S{1}} f(x)<\min {y \in S{2}} f(y)
$$

Summarizing, after one pass of depth-first search, we can immediately identify a vertex in a source SCC. The only problem? We want to identify a vertex in a sink SCC. The fix? Reverse the graph first.

Pseudocode

Implementation details:

  1. A smarter implementation runs the TopoSort algorithm backward in the original input graph, by replacing the clause “each edge $(s, v)$ in s’s outgoing adjacency list” in the DFS-Topo subroutine with
    “each edge $(v, s)$ in s’s incoming adjacency list.”
  2. For best results, the first pass of depth-first search should export an array that contains the vertices (or pointers to them) in order of their positions, so that the second pass can process them with a simple array scan. This adds only constant overhead to the TopoSort subroutine (as you should check).

实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import numpy as np


def topo_sort(g_dict, vertices_num):
"""
param g: dict, vertices number from 1 to n.
"""

def dfs_topo(s):
"""
param s: a vertex belong to V .
"""
nonlocal topo_label

explore[s - 1] = True

for v in g_dict.get(s, []):
# edge (s, v) is outgoing if not reversed
# edge (v, s) is incoming if reversed
if not explore[v - 1]:
dfs_topo(v)

f_v[s - 1] = topo_label
topo_label -= 1

f_v = [0] * vertices_num # topo label
explore = [False] * vertices_num # is explored flag

topo_label = vertices_num

# vertex num: start from 1
for vertex in range(1, vertices_num + 1): # loop in all vertices (not only in tail vertices)
if not explore[vertex - 1]: # if vertex is unexplored
dfs_topo(vertex)

return f_v


def get_graph_dict(tail_vertices, head_vertices):

g_dict = {}

for i, v in enumerate(tail_vertices):
if v in g_dict:
g_dict[v].add(head_vertices[i])
else:
g_dict[v] = {head_vertices[i]}

return g_dict


def count_SCC(scc_list):
count_dict = {}

for i in scc_list:
count_dict[i] = count_dict.get(i, 0) + 1

return count_dict


def Kosaraju(g):
"""
param g: ndarray.
directed acyclic graph G = (V, E) in adjacency-list representation.
number from 1 to n.
"""

def dfs_scc(s):
explore[s - 1] = True
vertices_SCC[s - 1] = num_SCC

for v_out in g_outgoing.get(s, []):
if not explore[v_out - 1]:
dfs_scc(v_out)

tail_vertices = g[:, 0]
head_vertices = g[:, 1]
vertices_num = g.max()

g_outgoing = get_graph_dict(tail_vertices, head_vertices)
g_incoming = get_graph_dict(head_vertices, tail_vertices) # reversed

f_v = topo_sort(g_incoming, vertices_num) # clac reversed graph topo sort
f_v = np.argsort(f_v)

explore = [False] * vertices_num # is explored flag

num_SCC = 0
vertices_SCC = [0] * vertices_num

for v in f_v:
if not explore[v]:
num_SCC += 1
dfs_scc(v + 1)

k = count_SCC(vertices_SCC)

count = sorted(k.items(), key=lambda x: x[1], reverse=True)

for label, num in count[:5]:
print(num)

def main():
g = np.loadtxt('/home/ubuntu/Desktop/python_test/SCC.txt', dtype='int')
Kosaraju(g)

if __name__ == "__main__":
# ---small test case---

# case 1. Answer: 3,3,1
# a = np.array([[1, 2],
# [2, 3],
# [3, 1],
# [1, 4],
# [4, 5],
# [5, 6],
# [6, 7],
# [7, 5]])

# case 2. Answer: 3,3,3,0,0
# a = np.array([[1, 4],
# [2, 8],
# [3, 6],
# [4, 7],
# [5, 2],
# [6, 9],
# [7, 1],
# [8, 5],
# [8, 6],
# [9, 7],
# [9, 3]])

# Kosaraju(a)

# reference: https://www.coursera.org/learn/algorithms-graphs-data-structures/discussions/weeks/1/threads/JiHDKJnVEeetAQoC-amJYg/replies/sZAZXrVjEeeEagonu2xPWg
import sys, threading
sys.setrecursionlimit(800000)
threading.stack_size(67108864)

thread = threading.Thread(target=main)
thread.start()

Dijkstra’s Shortest-Path

原理

Dijkstra’s algorithm solves the single-source shortest path problem.

You should not use Dijkstra’s algorithm in applications with negative edge lengths.

Why Not Breadth-First Search?

Remember that breadth-first search computes the minimum number of edges in a path from the starting vertex to every other vertex. This is the special case of the single-source shortest path problem in which every edge has length 1. We saw in Quiz 9.1 that, with general nonnegative edge lengths, a shortest path need not be a path with the fewest number of edges. Many applications of shortest paths, such as computing driving directions or a sequence of financial transactions, inevitably involve edges with different lengths.

Pseudocode

When all the edges have length 1, it’s equivalent to breadth-first search.

实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# 1. 如果有向图的边长权重有负的情况,不一定保证算法的正确性,见练习题。
# 2. 每次循环只选择最小边的一个节点加入已经确定的集合(注意只选择一个)。
# 3. 把已经确定的集合视作一个大节点M,从大节点指出的节点中选择最小边长的节点N。
# 4. 利用heapq将 N 融合入 M,在进行融合可能产生特殊两种情况
# 情况一: N 所指向的某些节点 P 在 M 中
# 情况二: N 所指向的某些节点 P 也是 M 指向的节点
# 情况一可以通过flag判断,情况二需要多次heappop。
#
# 5. 关于 python 内置的heapq模块,参考:
# https://stackoverflow.com/questions/45892736/python-heapq-how-do-i-sort-the-heap-using-nth-element-of-the-list-of-lists

def load_graph(file_path):
# The loaded file contains an adjacency list representation
# of an undirected weighted graph.

graph_dict = {}

with open(file_path, 'r') as f:
for line in f:
try:
line = line.strip().rstrip('\n').split('\t')
vertex_num = int(line[0])
out_edges = list(map(lambda k: (int(k.split(',')[1]), int(k.split(',')[0])),
line[1:])) # (length, vertex)

assert vertex_num not in graph_dict

graph_dict[vertex_num] = out_edges

except ValueError or AssertionError:
print('Check the adjacency list representation in txt file!')
return -1
else:
return graph_dict

print('Break in loading process, check the txt file!')
return -1


# implementation of Dijkstra's algorithm
def count_shortest_path(graph, source_vertex):
# param graph: dict, contains an adjacency list representation.
# param source_vertex: int, start from 1, source vertex to compute the shortest-path
# distances between it and every other vertex of the graph.
import heapq

vertex_num = len(graph)
assert source_vertex <= vertex_num, 'The start vertex is not in graph!'

dist_source = [float('inf')] * vertex_num
dist_source[source_vertex - 1] = 0
vertex_merged = graph[source_vertex].copy() # heapq [(length, vertex), ...] which the vertices outgoing from big merged vertex
# out_vertices = [s[1] for s in vertex_merged] # vertices out from merged vertex (step by one edge)

while vertex_merged != []:
length, vertex = heapq.heappop(vertex_merged)

if dist_source[vertex - 1] == float('inf'):
dist_source[vertex - 1] = length
else:
# 这种情况是新加入的节点和融合节点指向同一节点导致的
assert length >= dist_source[vertex - 1]
continue # 因为heapq没有delete操作,所以会产生重复情况,要多次pop

# merge
for l, v in graph[vertex]:
# if v in merged vertex, discard (because the shortest distance has been calc).
# 如果v指向融合节点(中的节点)
if dist_source[v - 1] != float('inf'):
continue

new_len = length + l

heapq.heappush(vertex_merged, (new_len, v))

return dist_source


if __name__ == "__main__":
file_path = '/home/ubuntu/Desktop/python_test/dijkstraData.txt'
graph_dict = load_graph(file_path) # out edge format: (length, vertex)

assert graph_dict != -1, 'Wrong in loading txt file!'

source_vertex = 1

shortest_path = count_shortest_path(graph_dict, source_vertex) # [(length, vertex), ...]

# for i, v in enumerate(shortest_path, 1):
# print(i, v)

print(shortest_path[6],
shortest_path[36],
shortest_path[58],
shortest_path[81],
shortest_path[98],
shortest_path[114],
shortest_path[132],
shortest_path[164],
shortest_path[187],
shortest_path[196])

heapq 的对象是列表,但是列表中的元素要是元组,如果子元素是列表,那么pop出的内容会是子列表的第一个元素。可以这样进行构造(a, [b, c]),就可以修改列表中的内容。在heappop之前,不需要对list进行初始化(heapify),但是如果对列表进行了修改,后面进行heappush之前要用heapify进行初始化。

#The Heap Data Structure

A heap is a data structure that keeps track of an evolving set of objects with keys and can quickly identify the object with the smallest key.

Supported Operations

Basic Operations:

  • Insert: given a heap $H$ and a new object $x$, add $x$ to $H$.
  • Extract-Min: given a heap $H$, remove and return from $H$ an object with the smallest key (or a pointer to it).

Extra Operations:

  • Find-Min: given a heap $H$, return an object with the smallest key (or a pointer to it).
  • Heapify: given objects $x_1, . . . ,x_n$, create a heap containing them.
  • Delete: given a heap $H$ and a pointer to an object $x$ in $H$, delete $x$ from $H$.

When to Use a Heap:

If your application requires fast minimum (or maximum) computations on a dynamically changing set of objects, the heap is usually the data structure of choice.

Whenever you see an algorithm or program with lots of brute-force minimum or maximum computations, a light bulb should go off in your head: This calls out for a heap!

Application

Sorting

Median Maintenance

Recall that the median of a collection of numbers is its “middle element.” In an array with odd length $2k − 1$, the median is the kth order statistic (that is, the kth-smallest element). In an array with even length $2k$, both the kth and $(k + 1)$th order statistics are considered median elements.

For a less obvious application of heaps, let’s consider the median maintenance problem. You are presented with a sequence of numbers, one by one; assume for simplicity that they are distinct. Each time you receive a new number, your responsibility is to reply with the median element of all the numbers you’ve seen thus far.

Using heaps, we can solve the median maintenance problem in just logarithmic time per round. The key idea is to maintain two heaps $H1$ and $H2$ while satisfying two invariants. The first invariant is that $H1$ and $H2$ are balanced, meaning they each contain the same number of elements (after an even round) or that one contains exactly one more element than the other (after an odd round). The second invariant is that $H1$ and $H2$ are ordered, meaning every element in $H1$ is smaller than every element in $H2$.

For example, if the numbers so far have been $1, 2, 3, 4, 5$, then $H1$ stores $1$ and $2$ and $H2$ stores $4$ and $5$; the median element $3$ is allowed to go in either one, as either the maximum element of $H1$ or the minimum element of $H2$. If we’ve seen $1, 2, 3, 4, 5, 6$, then the first three numbers are in $H1$ and the second three are in $H2$; both the maximum element of $H1$ and the minimum element of $H2$ are median elements. One twist: $H2$ will be a standard heap, supporting Insert and Extract-Min, while H1 will be the “max” variant, supporting Insert and Extract-Max. This way, we can extract the median element with one heap operation, whether it’s in $H1$ or $H2$.

We still must explain how to update $H1$ and $H2$ each time a new element arrives so that they remain balanced and ordered. To figure out where to insert a new element $x$ so that the heaps remain ordered, it’s enough to compute the maximum element $y$ in $H1$ and the minimum element $z$ in $H2$. If $x$ is less than $y$, it has to go in $H1$; if it’s more than $z$, it has to go in $H2$; if it’s in between, it can go in either one. Except for one case:

Speeding Up Dijkstra’s Algorithm

上节的程序已经体现。

Implementation Details

There are two ways to visualize objects in a heap, as a tree (better for pictures and exposition) or as an array (better for an implementation).

Heaps as Trees

A heap can be viewed as a rooted binary tree.

Duplicate keys are allowed. For example, here’s a valid heap containing nine objects:

Here’s another heap, with the same set of keys:

Heaps as Arrays

In our minds we visualize a heap as a tree, but in an implementation we use an array with length equal to the maximum number of objects we expect to store.

The first element of the array corresponds to the tree’s root, the next two elements to the next level of the tree (in the same order), and so on (Figure 10.5).

Implementing Insert in $O(log n)$ Time

Insert operation: given a heap $H$ and a new object $x$, add $x$ to $H$.

After $x$’s addition to $H$, $H$ should still correspond to a full binary tree (with one more node than before) that satisfies the heap property. The operation should take $O(log n)$ time, where $n$ is the number of objects in the heap.

Because a heap is a full binary tree, it has $\approx log_2 n$ levels, where $n$ is the number of objects in the heap. The number of swaps is at most the number of levels, and only a constant amount of work is required per swap. We conclude that the worst-case running time of the Insert operation is $O(log n)$, as desired.

Implementing Extract-Min in $O(log n)$ Time

Extract-Min operation: given a heap $H$, remove and return from $H$ an object with the smallest key.

The challenge is to restore the full binary tree and heap properties after ripping out a heap’s root.

The number of swaps is at most the number of levels, and only a constant amount of work is required per swap. Because there are $\approx log_2 n$ levels, we conclude that the worst-case running time of the Extract-Min operation is $O(log n)$, where $n$ is the number of objects in the heap.

编程案例

实现 Median Maintenance:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import heapq

class Max_heap:
def __init__(self, num):
self.heap = [-num]
self.length = 1

def __len__(self):
return self.length

def pop_max(self):
self.length -= 1
return -heapq.heappop(self.heap)

def push(self, num):
self.length += 1
heapq.heappush(self.heap, -num)

def get_list(self):
return self.heap


class Min_heap:
def __init__(self, num):
self.heap = [num]
self.length = 1

def __len__(self):
return self.length

def pop_min(self):
self.length -= 1
return heapq.heappop(self.heap)

def push(self, num):
self.length += 1
heapq.heappush(self.heap, num)

def get_list(self):
return self.heap


def load_data(file_path):

data = []

with open(file_path, 'r') as f:
for i in f:
try:
data.append(int(i.strip('\n')))
except:
pass

return data


if __name__ == "__main__":
d = load_data('/home/ubuntu/Desktop/python_test/Median.txt')

mid_sum = d[0] + min(*d[:2]) # 别忘记考虑前两个数

h1_heap = Max_heap(min(*d[:2]))
h2_heap = Min_heap(max(*d[:2]))

for i in d[2:]:

h1_max = h1_heap.pop_max()
h2_min = h2_heap.pop_min()

a, b, c = sorted([i, h1_max, h2_min])

if len(h1_heap) == len(h2_heap):
mid_sum += b
h2_heap.push(b)
elif len(h1_heap) == len(h2_heap) - 1:
mid_sum += b
h1_heap.push(b)
elif len(h1_heap) - 1 == len(h2_heap):
mid_sum += a
h2_heap.push(b)
else:
raise Exception('balance is broken!')

h1_heap.push(a)
h2_heap.push(c)

print(mid_sum)

Search Trees

A search tree, like a heap, is a data structure for storing an evolving set of objects associated with keys (and possibly lots of other data). It maintains a total ordering over the stored objects, and can support a richer set of operations than a heap, at the expense of increased space and, for some operations, somewhat slower running times.

Sorted Arrays

A good way to think about a search tree is as a dynamic version of a sorted array—it can do everything a sorted array can do, while also accommodating fast insertions and deletions.

Supported Operations:

  • Search: for a key $k$, return a pointer to an object in the data structure with key $k$ (or report that no such object exists). ( binary search)
  • Min (Max): return a pointer to the object in the data structure with the smallest (respectively, largest) key.
  • Predecessor (Successor): given a pointer to an object in the data structure, return a pointer to the object with the next-smallest (respectively, next-largest) key. If the given object has the minimum (respectively, maximum) key, report “none.”
  • Output Sorted: output the objects in the data structure one by one in order of their keys.
  • Select: given a number $i$, between 1 and the number of objects, return a pointer to the object in the data structure with the $i_{th}$-smallest key.
  • Rank: given a key $k$, return the number of objects in the data structure with key at most $k$.

Unsupported Operations:

Insert: given a new object $x$, add x to the data structure.

Delete: for a key $k$, delete an object with key $k$ from the data structure, if one exists.

These two operations aren’t impossible to implement with a sorted array, but they’re painfully slow—inserting or deleting an element while maintaining the sorted array property requires linear time in the worst case. Is there an alternative data structure that replicates all the functionality of a sorted array, while matching the logarithmic-time performance of a heap for the Insert and Delete operations?

Search Trees

The raison d’être of a search tree is to support all the operations that a sorted array supports, plus insertions and deletions. All the operations except Output-Sorted run in $O(log n)$ time, where $n$ is the number of objects in the search tree. The Output-Sorted operation runs in $O(n)$ time, and this is as good as it gets (since it must output n objects).

An important caveat: The running times in Table 11.2 are achieved by a balanced search tree, which is a more sophisticated version of the standard binary search tree. These running times are not guaranteed by an unbalanced search tree.

When to Use a Balanced Search Tree:

If your application requires maintaining an ordered representation
of a dynamically changing set of objects, the balanced search tree (or a data structure based on one) is usually the data structure of choice.

The Search Tree Property:

  • For every object $x$, objects in $x$’s left subtree have keys smaller than that of $x$.
  • For every object $x$, objects in $x$’s right subtree have keys larger than that of $x$.

For example, here’s a search tree containing objects with the keys ${1, 2, 3, 4, 5}$:

Binary search trees and heaps differ in several ways. Heaps can be thought of as trees, but they are implemented as arrays, with no explicit pointers between objects. A search tree explicitly stores three pointers per object, and hence uses more space (by a constant factor). Heaps don’t need explicit pointers because they always correspond to full binary trees, while binary search trees can have an arbitrary structure.

Heaps are optimized for fast minimum computations, and the heap property—that a child’s key is only bigger than its parent’s key—makes the minimum-key object easy to find (it’s the root). Search trees are optimized for—wait for it—search, and the search tree property is defined accordingly.

The height of a tree is defined as the length of a longest path from its root to a leaf.

Implementing Search in $O(height)$ Time

Search operation: for a key $k$, return a pointer to an object in the data structure with key $k$ (or report that no such object exists).

  1. Start at the root node.
  2. Repeatedly traverse left and right child pointers, as appropriate (left if $k$ is less than the current node’s key, right if $k$ is bigger).
  3. Return a pointer to an object with key $k$ (if found) or “none” (upon reaching a null pointer).

Implementing Min and Max in $O(height)$ Time

Min and Max operations:

Min (Max): return a pointer to the object in the data structure with the smallest (respectively, largest) key.

  1. Start at the root node.
  2. Traverse left child pointers (right child pointers) as long as possible, until encountering a null pointer.
  3. Return a pointer to the last object visited.

Implementing Predecessor in $O(height)$ Time

The implementation of the Successor operation is analogous.

Predecessor: given a pointer to an object in the data structure, return a pointer to the object with the next-smallest key. (If the object has the minimum key, report “none.”)

  1. If $x$’s left subtree is non-empty, return the result of Max applied to this subtree.
  2. Otherwise, traverse parent pointers upward toward the root. If the traversal visits consecutive nodes $y$ and $z$ with y a right child of $z$, return a pointer to $z$.
  3. Otherwise, report “none.”

Implementing Output-Sorted in $O(n)$ Time

Output-Sorted: output the objects in the data structure one by one in order of their keys.

  1. Recursively call Output-Sorted on the root’s left subtree.
  2. Output the object at the root.
  3. Recursively call Output-Sorted on the root’s right subtree.

For a tree containing $n$ objects, the operation performs n recursive calls (one initiated at each node) and does a constant amount of work in each, for a total running time of $O(n)$.

Implementing Insert in $O(height)$ Time

Make changes to the tree.

Insert: given a new object $x$, add $x$ to the data structure.

  1. Start at the root node.
  2. Repeatedly traverse left and right child pointers, as appropriate (left if $k$ is at most the current node’s key, right if it’s bigger), until a null pointer is encountered.
  3. Replace the null pointer with one to the new object. Set the new node’s parent pointer to its parent, and its child pointers to null.

Implementing Delete in $O(height)$ Time

Delete: for a key $k$, delete an object with key $k$ from the search tree, if one exists.

  1. Use Search to locate an object $x$ with key k. (If no such object exists, halt.)

  2. If $x$ has no children, delete $x$ by setting the appropriate child pointer of $x$’s parent to null. (If $x$ was the root, the new tree is empty.)

  3. If $x$ has one child, splice $x$ out by rewiring the appropriate child pointer of $x$’s parent to $x$’s child, and the parent pointer of $x$’s child to $x$’s parent. (If $x$ was the root, its child becomes the new root.)

  4. Otherwise, swap $x$ with the object in its left subtree that has the biggest key (Predecessor), and delete $x$ from its new position (where it has at most one child). For example:

Augmented Search Trees for Select

Select: given a number $i$, between $1$ and the number of objects, return a pointer to the object in the data structure with the $i$th-smallest key.

  1. Start at the root and let $j$ be the size of its left subtree. (If it has no left child pointer, then $j = 0$.)
  2. If $i = j + 1$, return a pointer to the root.
  3. If $i < j + 1$, recursively compute the $i$th-smallest key in the left subtree.
  4. If $i > j + 1$, recursively compute the $(i − j − 1)$th smallest key in the right subtree.

Because each node of the search tree stores the size of its subtree, each recursive call performs only a constant amount of work. Each recursive call proceeds further downward in the tree, so the total amount of work is $O(height)$.

Balanced Search Trees

Balanced search trees guarantee $O(log n)$ height.

Rotations

All the most common implementations of balanced search trees use rotations, a constant-time operation that performs a modest amount of local rebalancing while preserving the search tree property.

For example,

I encourage readers interested in what’s under the hood of a balanced search tree to check out a textbook treatment or explore the open-source implementations and visualization demos that are freely available online.

Hash Table / Hash Map

The raison d’être of a hash table is to facilitate super-fast searches, which are also called lookups in this context. A hash table can tell you what’s there and what’s not, and can do it really, really quickly (much faster than a heap or search tree).

Supported Operations:

  • Lookup (a.k.a. Search): for a key $k$, return a pointer to an object in the hash table with key $k$ (or report that no such object exists).
  • Insert: given a new object $x$, add x to the hash table.
  • Delete: for a key $k$, delete an object with key $k$ from the hash table, if one exists.

In a hash table, all these operations typically run in constant time.

When to Use a Hash Table:

If your application requires fast lookups with a dynamically changing set of objects, the hash table is usually the data structure of choice.

Application

De-duplication

When a new object $x$ with key $k$ arrives:

  1. Use Lookup to check if the hash table already contains an object with key $k$.
  2. If not, use Insert to put $x$ in the hash table.

The 2-SUM Problem

Input: An unsorted array $A$ of $n$ integers, and a target integer $t$.

Goal: Determine whether or not there are two numbers $x, y$ in $A$ satisfying $x + y = t$.

OJ: 两数之和

Birthday Paradox

Consider n people with random birthdays, with each of the $366$ days of the year equally likely. (Assume all $n$ people were born in a leap year.) How large does $n$ need to be before there is at least a $50%$ chance that two people have the same birthday?

The answer is $23$. 参考生日悖论

The birthday paradox implies that, even for the gold standard, we’re likely to start seeing collisions in a hash table of size n once a small constant times $\sqrt{n}$ objects have been inserted. For example, when $n = 10, 000$, the insertion of $200$ objects is likely to cause at least one collision—even though at least $98%$ of the array positions are completely unused!

Collision Resolution

With collisions an inevitable fact of life, a hash table needs some method for resolving them.

Two dominant approaches, separate chaining (or simply chaining) and open addressing.

Chaining

With chaining, the positions of the array are often called buckets, as each can contain multiple objects.

Chaining:

  1. Keep a linked list in each bucket of the hash table.
  2. To Lookup/Insert/Delete an object with key $k$, perform Lookup/Insert Delete on the linked list in the bucket $A[h(k)]$, where h denotes the hash function and A the hash table’s array.

###Open Addressing

Open addressing is much easier to implement and understand when the hash table must support only Insert and Lookup (and not Delete). With open addressing, each position of the array stores 0 or 1 objects, rather than a list.

Where do we put an object with key $k$ if a different object is already stored in the position $A[h(k)]$? The idea is to associate each key k with a probe sequence of positions, not just a single position.

The first number of the sequence indicates the position to consider first; the second the next position to consider when the first is already occupied; and so on. The object is stored in the first unoccupied position of its key’s probe sequence (see Figure 12.4).

Open Addressing:

  1. Insert: Given an object with key $k$, iterate through the probe sequence associated with $k$, storing the object in the first empty position found.
  2. Lookup: Given a key $k$, iterate through the probe sequence associated with $k$ until encountering the desired object (in which case, return it) or an empty position (in which case, report “none”).

There are several ways to use one or more hash functions to define a probe sequence. The simplest is linear probing. A more sophisticated method is double hashing.

linear probing

This method uses one hash function $h$, and defines the probe sequence for a key $k$ as $h(k)$, followed by $h(k)+1$, followed by $h(k)+2$, and so on (wrapping around to the beginning upon reaching the last position).

Double Hashing

A more sophisticated method is double hashing, which uses two hash functions. The first tells you the first position of the probe sequence, and the second indicates the offset for subsequent positions. For example, if $h_1(k) = 17$ and $h_2(k) = 23$, the first place to look for an object with key $k$ is position $17$; failing that, position $40$; failing that, position $63$; failing that, position $86$; and so on. For a different key $k^{\prime}$, the probe sequence could look quite different. For example, if $h_1(k^{\prime}) = 42$ and $h_2(k^{\prime}) = 27$, the probe sequence would be $42$, followed by $69$, followed by $96$, followed by $123$, and so on.

What Makes for a Good Hash Function?

How can we choose a hash function so that there aren’t too many collisions?

As of this writing (in 2018), hash functions that are good starting points for further exploration include FarmHash, MurmurHash3, SpookyHash and MD5. These are all non-cryptographic hash functions, and are not designed to protect against adversarial attacks like that of Crosby and Wallach (see footnote 12).25 Cryptographic hash functions are more complicated and slower to evaluate than their non-cryptographic counterparts, but they do protect against such attacks. A good starting point here is the hash function SHA-1 and its newer relatives like SHA-256.

编程练习

The file contains 1 million integers, both positive and negative (there might be some repetitions!)

Your task is to compute the number of target values t in the interval $[-10000,10000]$ (inclusive) such that there are distinct numbers $x,y$ in the input file that satisfy $x+y=t$.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import os
import time
os.chdir(r'I:\python_files')

data_path = r'programming_prob-2sum.txt'

# --------find distinct numbers-------
numbers = set()

read_tic = time.time()

with open(data_path) as f:
for line in f:
n = int(line)
numbers.add(n)

read_toc = time.time()
print('read cost: {:.2f}s'.format(read_toc - read_tic))

# print(len(numbers)) # 999752

# -------------calc number of 2 sum t----------------

main_tic = time.time()

t_set = set()
N = sorted(numbers)

i = 0 # begin idx
j = len(N) - 1 # end idx

# find the right interval [-10000,10000]
while i != j:

two_sum = N[i] + N[j]

if two_sum < -10000:
i += 1
elif two_sum > 10000:
j -= 1
else:
# slice waste time because it is reference not 'view'
for ii in range(i, j):
two_sum = N[ii] + N[j]

if two_sum > 10000:
break

t_set.add(two_sum)

j -= 1

main_toc = time.time()
print('main cost: {:.2f}s'.format(main_toc - main_tic))
print('total cost: {:.2f}s'.format(main_toc - read_tic))
print(len(t_set))

输出:

1
2
3
4
read cost: 1.46s
main cost: 2.46s
total cost: 3.93s
427

Bloom Filters

Bloom filters are close cousins of hash tables. They are ridiculously space-efficient but, in exchange, they occasionally make errors.

The raison d’être of a bloom filter is essentially the same as that of a hash table: super-fast insertions and lookups. Why should we bother with another data structure with the same set of operations? Because bloom filters are preferable to hash tables in applications in which space is at a premium and the occasional error is not a dealbreaker.

Like hash tables with open addressing, bloom filters are much easier to implement and understand when they support only Insert and Lookup (and no Delete).

Supported Operations:

Lookup: for a key $k$, return “yes” if $k$ has been previously inserted into the bloom filter and “no” otherwise.

Insert: add a new key $k$ to the bloom filter.

Bloom Filters Vs. Hash Tables:

  1. Pro: More space efficient.
  2. Pro: Guaranteed constant-time operations for every data set.
  3. Con: Can’t store pointers to objects.
  4. Con: Deletions are complicated, relative to a hash table with chaining.
  5. Con: Non-zero false positive probability.

When to Use a Bloom Filter:

If your application requires fast lookups with a dynamically changing set of objects, space is at a premium, and a small number of false positives can be tolerated, the bloom filter is usually the data structure of choice.

Applications

Spell checkers.

Forbidden passwords.

Internet routers.

Implementation

Insert (given key k)

Lookup (given key k)


----------over----------