This commit is preperation for introducing distributed partitioned
table support. We want to clean and refactor some code in distributed
table creation logic so that we can handle partitioned tables in more
robust way.
- master_add_node enforces that there is only one primary per group
- there's also a trigger on pg_dist_node to prevent multiple primaries
per group
- functions in metadata cache only return primary nodes
- Rename ActiveWorkerNodeList -> ActivePrimaryNodeList
- Rename WorkerGetLive{Node->Group}Count()
- Refactor WorkerGetRandomCandidateNode
- master_remove_node only complains about active shard placements if the
node being removed is a primary.
- master_remove_node only deletes all reference table placements in the
group if the node being removed is the primary.
- Rename {Node->NodeGroup}HasShardPlacements, this reflects the behavior it
already had.
- Rename DeleteAllReferenceTablePlacementsFrom{Node->NodeGroup}. This also
reflects the behavior it already had, but the new signature forces the
caller to pass in a groupId
- Rename {WorkerGetLiveGroup->ActivePrimaryNode}Count
Comes with a few changes:
- Change the signature of some functions to accept groupid
- InsertShardPlacementRow
- DeleteShardPlacementRow
- UpdateShardPlacementState
- NodeHasActiveShardPlacements returns true if the group the node is a
part of has any active shard placements
- TupleToShardPlacement now returns ShardPlacements which have NULL
nodeName and nodePort.
- Populate (nodeName, nodePort) when creating ShardPlacements
- Disallow removing a node if it contains any shard placements
- DeleteAllReferenceTablePlacementsFromNode matches based on group. This
doesn't change behavior for now (while there is only one node per
group), but means in the future callers should be careful about
calling it on a secondary node, it'll delete placements on the primary.
- Create concept of a GroupShardPlacement, which represents an actual
tuple in pg_dist_placement and is distinct from a ShardPlacement,
which has been resolved to a specific node. In the future
ShardPlacement should be renamed to NodeShardPlacement.
- Create some triggers which allow existing code to continue to insert
into and update pg_dist_shard_placement as if it still existed.
Adds support for PostgreSQL 10 by copying in the requisite ruleutils
and updating all API usages to conform with changes in PostgreSQL 10.
Most changes are fairly minor but they are numerous. One particular
obstacle was the change in \d behavior in PostgreSQL 10's psql; I had
to add SQL implementations (views, mostly) to mimic the pre-10 output.
So far citus used postgres' predicate proofing logic for shard
pruning, except for INSERT and COPY which were already optimized for
speed. That turns out to be too slow:
* Shard pruning for SELECTs is currently O(#shards), because
PruneShardList calls predicate_refuted_by() for every
shard. Obviously using an O(N) type algorithm for general pruning
isn't good.
* predicate_refuted_by() is quite expensive on its own right. That's
primarily because it's optimized for doing a single refutation
proof, rather than performing the same proof over and over.
* predicate_refuted_by() does not keep persistent state (see 2.) for
function calls, which means that a lot of syscache lookups will be
performed. That's particularly bad if the partitioning key is a
composite key, because without a persistent FunctionCallInfo
record_cmp() has to repeatedly look-up the type definition of the
composite key. That's quite expensive.
Thus replace this with custom-code that works in two phases:
1) Search restrictions for constraints that can be pruned upon
2) Use those restrictions to search for matching shards in the most
efficient manner available:
a) Binary search / Hash Lookup in case of hash partitioned tables
b) Binary search for equal clauses in case of range or append
tables without overlapping shards.
c) Binary search for inequality clauses, searching for both lower
and upper boundaries, again in case of range or append
tables without overlapping shards.
d) exhaustive search testing each ShardInterval
My measurements suggest that we are considerably, often orders of
magnitude, faster than the previous solution, even if we have to fall
back to exhaustive pruning.
With this change we add an option to add a node without replicating all reference
tables to that node. If a node is added with this option, we mark the node as
inactive and no queries will sent to that node.
We also added two new UDFs;
- master_activate_node(host, port):
- marks node as active and replicates all reference tables to that node
- master_add_inactive_node(host, port):
- only adds node to pg_dist_node
Delete operation is blocked for any table distributed by hash using master_apply_delete_command. Suggested master_modify_multiple_shards command as a hint.
This change ignores `citus.replication_model` setting and uses the
statement based replication in
- Tables distributed via the old `master_create_distributed_table` function
- Append and range partitioned tables, even if created via
`create_distributed_table` function
This seems like the easiest solution to #1191, without changing the existing
behavior and harming existing users with custom scripts.
This change also prevents RF>1 on streaming replicated tables on `master_create_worker_shards`
Prior to this change, `master_create_worker_shards` command was not checking
the replication model of the target table, thus allowing RF>1 with streaming
replicated tables. With this change, `master_create_worker_shards` errors
out on the case.
Add a call to RemoteTransactionBeginIfNecessary so that BEGIN is
actually sent to the remote connections. This means that ROLLBACK and
Ctrl-C are respected and don't leave the table in a partial state.
This change allows users to drop sequences on MX workers. Previously, Citus didn't allow dropping
sequences on MX workers because it could cause shards to be dropped if `DROP SEQUENCE ... CASCADE`
is used. We now allow that since allowing sequence creation but not dropping hurts user experience
and also may cause problems with custom Citus solutions.
- Break CheckShardPlacements into multiple functions (The most important
is MarkFailedShardPlacements), so that we can get rid of the global
CoordinatedTransactionUses2PC.
- Call MarkFailedShardPlacements in the router executor, so we mark
shards as invalid and stop using them while inside transaction blocks.
With this change DropShards function started to use new connection API. DropShards
function is used by DROP TABLE, master_drop_all_shards and master_apply_delete_command,
therefore all of these functions now support transactional operations. In DropShards
function, if we cannot reach a node, we mark shard state of related placements as
FILE_TO_DELETE and continue to drop remaining shards; however if any error occurs after
establishing the connection, we ROLLBACK whole operation.
So far router planner had encapsulated different functionality in
MultiRouterPlanCreate. Modifications always go through router, selects
sometimes. Modifications always error out if the query is unsupported,
selects return NULL. Especially the error handling is a problem for
the upcoming extension of prepared statement support.
Split MultiRouterPlanCreate into CreateRouterPlan and
CreateModifyPlan, and change them to not throw errors.
Instead errors are now reported by setting the new
MultiPlan->plannigError.
Callers of router planner functionality now have to throw errors
themselves if desired, but also can skip doing so.
This is a pre-requisite for expanding prepared statement support.
While touching all those lines, improve a number of error messages by
getting them closer to the postgres error message guidelines.
This adds a replication_model GUC which is used as the replication
model for any new distributed table that is not a reference table.
With this change, tables with replication factor 1 are no longer
implicitly MX tables.
The GUC is similarly respected during empty shard creation for e.g.
existing append-partitioned tables. If the model is set to streaming
while replication factor is greater than one, table and shard creation
routines will error until this invalid combination is corrected.
Changing this parameter requires superuser permissions.
If any placements fail it doesn't update shard statistics on those placements.
A minor enabling refactor: Make CoordinatedTransactionUses2PC public (it used to be CoordinatedTransactionUse2PC but that symbol already existed, so renamed it as well)