Update README.md

pull/7226/head
Marco Slot 2023-09-18 13:09:22 +02:00 committed by Önder Kalacı
parent 60f99bf32c
commit c9d7f63351
1 changed files with 3 additions and 3 deletions

View File

@ -1687,7 +1687,7 @@ The COPY .. FROM command is used to load a CSV (or TSV, or binary copy format) f
**The CitusCopyDestReceiver inspects the value in the distribution column and finds the corresponding shard.** It opens a connection to the node(s) on which the shard is placed, starts a COPY into the shard, and forwards the tuple. For performance reasons, we use the binary copy format over the internal connections, when possible (e.g. all types have send/receive), even if the client used CSV. **The CitusCopyDestReceiver inspects the value in the distribution column and finds the corresponding shard.** It opens a connection to the node(s) on which the shard is placed, starts a COPY into the shard, and forwards the tuple. For performance reasons, we use the binary copy format over the internal connections, when possible (e.g. all types have send/receive), even if the client used CSV.
The COPY protocol does not require immediate confirmation when sending a tuple, which means we can continue parsing the next tuple without waiting for the previous tuple to be fully processed. This creates nice pipelining behaviour where tuples are effectively ingested in parallel and can improve performance over regular PostgreSQL, even though parsing runs at the same speed. This effect will be more pronounced when insertions are relatively heavy-weight due to triggers or heavy indexes. **The COPY protocol does not require immediate confirmation when sending a tuple**, which means we can continue parsing the next tuple without waiting for the previous tuple to be fully processed. This creates nice pipelining behaviour where tuples are effectively ingested in parallel and can improve performance over regular PostgreSQL, even though parsing runs at the same speed. This effect will be more pronounced when insertions are relatively heavy-weight due to triggers or heavy indexes.
**COPY does not always use a connection per shard.** If there were already writes to multiple shards on a specific connection earlier in the transaction (e.g. consecutive inserts), then that connection must be used for the writes done by the COPY (e.g. to be able to check unique constraints). However, we can only COPY into one table at a time. In this case, the COPY logic maps multiple shards to the same connection and switches back-and-forth between shards through multiple COPY commands (which has overhead). If we get a tuple for a shard that is currently active, we forward immediately over the connection. Otherwise, we add the tuple to a per-shard buffer, or switch the connection if we already sent `citus.copy_switchover_threshold` bytes to the current shard. **COPY does not always use a connection per shard.** If there were already writes to multiple shards on a specific connection earlier in the transaction (e.g. consecutive inserts), then that connection must be used for the writes done by the COPY (e.g. to be able to check unique constraints). However, we can only COPY into one table at a time. In this case, the COPY logic maps multiple shards to the same connection and switches back-and-forth between shards through multiple COPY commands (which has overhead). If we get a tuple for a shard that is currently active, we forward immediately over the connection. Otherwise, we add the tuple to a per-shard buffer, or switch the connection if we already sent `citus.copy_switchover_threshold` bytes to the current shard.
@ -2012,7 +2012,7 @@ An open issue with CDC is that there is no good way to get a consistent snapshot
When you have a multi-node cluster, clients should connect to each node and combine the changes. It is important to note that there are no guarantees with regard to when and in what order changes will be emitted between nodes. It is especially important to understand that changes cannot be reordered (e.g. based on timestamp or transaction ID), because only the node-level order is correct. The lack of distributed snapshot isolation in Citus means that changes can be interleaved (a happens before b on node 1, b happens before a on node 2). The node-level decoder output will reflect that as it happened. When you have a multi-node cluster, clients should connect to each node and combine the changes. It is important to note that there are no guarantees with regard to when and in what order changes will be emitted between nodes. It is especially important to understand that changes cannot be reordered (e.g. based on timestamp or transaction ID), because only the node-level order is correct. The lack of distributed snapshot isolation in Citus means that changes can be interleaved (a happens before b on node 1, b happens before a on node 2). The node-level decoder output will reflect that as it happened.
_Do not reorder changes based on timestamp or distributed transaction ID. It is not correct._ _Do not reorder changes based on timestamp or distributed transaction ID or anything that is not guaranteed to preserve node-level order. It is never correct._
# Global PID # Global PID