forked from cockroachdb/cockroach
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTODO
206 lines (148 loc) · 8.06 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
* Transactions
- Transaction ID: generated via allocation of block of transaction
IDs. Suggested transaction struct:
type IsolationType int
const (
SERIALIZABLE IsolationType = iota
SNAPSHOT IsolationType
)
type Transaction struct {
ID int64
Priority int32
Isolation IsolationType
Epoch int32 // incremented on txn retry
Timestamp HLTimestamp // 0 to use timestamp at destination
MaxTimestamp HLTimestamp // Timestamp + clock skew; set to Timestamp for historical read
}
- Generate random priority, pick candidate timestamp.
- Modified ops:
- Get: logic to support clock skew uncertainty and concurrency.
If Timestamp on transaction is 0, assign current node's wall
clock as Timestamp and MaxTimestamp. This timestamp is returned
via the ResponseHeader.
Simple, common case: if most recent timestamp for key is less
than Timestamp and committed, read value.
If most recent timestamp for key is greater than MaxTimestamp
(it can be either committed or an intent), and there are no
versions of key between Timestamp and MaxTimestamp, read value
for key at Timestamp.
If there are version(s) of the key (can include most recent
intent) with timestamp between Timestamp and MaxTimestamp,
return WriteWithinUncertaintyInterval error. The ResponseHeader
will contain the latest version's timestamp (actually, just the
latest version with timestamp <= MaxTimestamp); on transaction
retry, Timestamp will be min(key's timestamp + 1, MaxTimestamp),
and MaxTimestamp will be max(key's timestamp + 1, MaxTimestamp).
In the event an intent is encountered with timestamp <=
Timestamp, try to push the transaction to Timestamp + 1. If
already committed, resolve and retry the read. If push succeeds,
read value for key at Timestamp. If push fails, backoff and
retry the transaction.
After reading any value, update the read-timestamp-cache with
the txn's Timestamp.
- Put: additions to support intent marker.
If entry doesn't exist, or does exist and is not an intent and
has earlier timestamp than Timestamp, add new put value with
intent set to true.
If entry exists, is committed, but has later timestamp, write
put intent with existing timestamp + 1 (this new timestamp is
returned with response and becomes the new Timestamp for the txn).
If entry exists but is intent:
- If intent is owned by this txn, continue; update txn settings
for intent, including new Epoch, Priority and Timestamp.
- Otherwise, if intent owned by another txn, try to push
transaction (with "Abort"=true). If result of push is
already-committed or aborted, resolve the existing intent
and retry put. If push succeeds (meaning the other txn is
aborted), delete existing intent and write new intent. If
push fails, backoff and retry the transaction.
If read-timestamp-cache has an entry for the key with a later
timestamp, write put intent with read timestamp + 1 (the read
timestamp becomes the new Timestamp for the txn).
TODO(spencer): split this logic up into what happens before raft
log write and what happens when raft command is executed.
- Resolve: clear a key's intent status.
Either marks a put intent as committed or aborted. On abort the
put intent is deleted. The resolve method takes the Txn record
which includes the Epoch of the txn. If the Epoch doesn't match,
a committed txn will still delete the put intent.
New operations on transaction table rows:
- PushTransaction: Moves transaction timestamp forwards.
If existing txn entry isn't present or its LastHeartbeat
timestamp isn't set, use PushTxn.Timestamp as LastHeartbeat.
If current time - LastHeartbeat > MaxHeartbeatExpiry, then
the existing txn should be either pushed forward or aborted,
depending on value of Request.Abort.
If the txn is committed, return already-committed error. If txn
has been aborted, noop and return success.
Otherwise, Compare PushTxn and Txn priorities:
- If Txn.Priority < PushTxn.Priority, return retry-txn
error. Transaction will be retried with priority =
max(random, PushTxn.Priority - 1).
- If Txn.Priority > PushTxn.Priority, set/add txn entry with
new timestamp as max(existing timestamp, push timestamp + 1).
If Request.Abort is true, set/add ABORTED txn entry.
type PushTransactionRequest struct {
RequestHeader
Key Key // derivative of txn table prefix & txn ID
Abort bool // abort txn on successful push--this is done for puts
PushTxn Transaction // from encounted txn intent
Txn Transaction // txn which encountered intent
}
type PushTransactionResponse struct {
ResponseHeader
}
- AbortTransaction: called on transaction abort from client
connection.
If txn isn't yet recorded or alive but not committed, add
aborted entry. If txn is committed, returns already-committed
error.
- CommitTransaction: called on final commit from client connection.
If txn isn't yet recorded, or is alive but not committed:
- If isolation is snapshot: adds committed entry and returns
final timestamp, which is the max of any pushed timestamp
and the txn's Timestamp.
- If isolation is serializable: if there is a pushed timestamp
which is greater than the txn's Timestamp, txn must retry;
returns retry-txn error. If there is no entry or pushed
timestamp is <= txn's Timestamp, adds committed entry.
If txn aborted, returns already-aborted error. If already
committed, returns already-committed error.
- Transaction coordinator must continually heartbeat the transaction
table in order to update liveness of txn. On a PushTransaction,
the pusher will always succeed if the pushee has not heartbeat the
txn table entry within the allotted timeout.
- On transaction retry, coordinator checks the txn table to verify
txn has not been aborted before proceeding.
- On transaction conclusion, resolve all known intents.
* StoreFinder using Gossip protocol to filter
* Range split
- Split criteria: is range larger than max_range_bytes?
- Keep a sampled set of keys per range for finding split key.
- Transactionally rewrite range addressing indexes.
* Rebalance range replica. Only fully-replicated ranges may be
rebalanced.
- Keep a rebalance queue in memory. Range replicas are added to the
queue from a store during initial range scan and also during
operation as a response to certain conditions. Listed here:
- A range is split. Each replica in the split range is marked as
needing rebalancing.
- Replica not matching zone config. When zone config changes happen,
all ranges are scanned by each store and any mismatched replicas
are added to the queue.
- Rebalance away from stores finding themselves in top N space
utilized, taking care to account for fewer than N stores in
cluster. Only stores finding themselves in the top N space
utilized set may have rebalances in effect.
- Rebalance target is selected from available stores in bottom N
space utilized. Adjacent stores are exempted.
- Add rebalance target to replica set and rewrite range addressing
indexes.
- Rebalance targets are added to replica set always exactly one at a
time. Targets are marked as REBALANCING. Obsolete sources are
marked as PENDING_DELETION. Any time a range becomes fully
replicated, the range leader replica will move REBALANCING
replicas into state OK and will remove PENDING_DELETION replicas
from the RangeDescriptor. The store which owns a removed replica
is responsible for clearing the relevant portion of the key space
as well as any other housekeeping details.