Soon shard pruning will be optimized not to generally work linearly
anymore. Thus we can't print the pruned shard intervals as currently
done anymore.
The current printing of shard ids also prevents us from running tests
in parallel, as otherwise shard ids aren't linearly numbered.
Pretty straightforward. Had some concerns about locking, but due to the
fact that all distributed operations use either some level of deparsing
or need to enumerate column names, they all block during any concurrent
column renames (due to the AccessExclusive lock).
In addition, I had some misgivings about permitting renames of the dis-
tribution column, but nothing bad comes from just allowing them.
Finally, I tried to trigger any sort of error using prepared statements
and could not trigger any errors not also exhibited by plain PostgreSQL
tables.
With this change, we set to default value of isactive column to true so that
upgrading users all nodes will be marked as active to not break their environment.
With this change we add an option to add a node without replicating all reference
tables to that node. If a node is added with this option, we mark the node as
inactive and no queries will sent to that node.
We also added two new UDFs;
- master_activate_node(host, port):
- marks node as active and replicates all reference tables to that node
- master_add_inactive_node(host, port):
- only adds node to pg_dist_node
Before this commit, we were erroring out for queries containing parameterized SQL functions
like 'SELECT parameterized_sql_query(value)' as we should, however we were returning wrong
results for queries like 'SELECT * FROM parameterized_sql_query(value)'. With this commit
we started to error out on such queries too.
In this PR, we aim to deduce whether each of the RTE_RELATION
is joined with at least on another RTE_RELATION on their partition keys. If each
RTE_RELATION follows the above rule, we can conclude that all RTE_RELATIONs are
joined on their partition keys.
In order to do that, we invented a new equivalence class namely:
AttributeEquivalenceClass. In very simple words, a AttributeEquivalenceClass is
identified by an unique id and consists of a list of AttributeEquivalenceMembers.
Each AttributeEquivalenceMember is designed to identify attributes uniquely within the
whole query. The necessity of this arise since varno attributes are defined within
a single level of a query. Instead, here we want to identify each RTE_RELATION uniquely
and try to find equality among each RTE_RELATION's partition key.
Whenever we find an equality clause A = B, where both A and B originates from
relation attributes (i.e., not random expressions), we create an
AttributeEquivalenceClass to record this knowledge. If we later find another
equivalence B = C, we create another AttributeEquivalenceClass. Finally, we can
apply transitity rules and generate a new AttributeEquivalenceClass which includes
A, B and C.
Note that equality among the members are identified by the varattno and rteIdentity.
Each equality among RTE_RELATION is saved using an AttributeEquivalenceClass where
each member attribute is identified by a AttributeEquivalenceMember. In the final
step, we try generate a common attribute equivalence class that holds as much as
AttributeEquivalenceMembers whose attributes are a partition keys.
This was getting pretty long and complex in the context of the main
utility hook. Moved out the checks for what should skip Citus process-
ing and what should have version checks performed.
With this change, we start to error out if loaded citus binaries does not match
the available major version or installed citus extension version. In this case
we force user to restart the server or run ALTER EXTENSION depending on the
situation
Thought this looked slightly nicer than the default behavior.
Changed preventTransaction to concurrent to be clearer that this code
path presently affects CONCURRENTLY code only.
Coordinator code marks index as invalid as a base, set it as valid in a
transactional layer atop that base, then proceeds with worker commands.
If a worker command has problems, the rollback results in an index with
isvalid = false. If everything succeeds, the user sees a valid index.
Before this commit, in certain cases router planner allowed pushing
down JOINs that are not on the partition keys.
With @anarazel's suggestion, we change the logic to use uninstantiated
parameter. Previously, the planner was traversing on the restriction
information and once it finds the parameter, it was replacing it with
the shard range. With this commit, instead of traversing the restrict
infos, the planner explicitly checks for the equivalence of the relation
partition key with the uninstantiated parameter. If finds an equivalence,
it adds the restrictions. In this way, we have more control over the
queries that are pushed down.
Some tests relied on worker errors though local commands were invalid.
Fixed those by ensuring preconditions were met to have command work
correctly. Otherwise most test changes are related to slight changes
in local/remote error ordering.
With this commit, we add the range table list of the original query to our
custom plan. Therefore, PostgreSQL can check relations in the original query
for access permissions and error out if the proper access is not granted.
Custom Scan is a node in the planned statement which helps external providers
to abstract data scan not just for foreign data wrappers but also for regular
relations so you can benefit your version of caching or hardware optimizations.
This sounds like only an abstraction on the data scan layer, but we can use it
as an abstraction for our distributed queries. The only thing we need to do is
to find distributable parts of the query, plan for them and replace them with
a Citus Custom Scan. Then, whenever PostgreSQL hits this custom scan node in
its Vulcano style execution, it will call our callback functions which run
distributed plan and provides tuples to the upper node as it scans a regular
relation. This means fewer code changes, fewer bugs and more supported features
for us!
First, in the distributed query planner phase, we create a Custom Scan which
wraps the distributed plan. For real-time and task-tracker executors, we add
this custom plan under the master query plan. For router executor, we directly
pass the custom plan because there is not any master query. Then, we simply let
the PostgreSQL executor run this plan. When it hits the custom scan node, we
call the related executor parts for distributed plan, fill the tuple store in
the custom scan and return results to PostgreSQL executor in Vulcano style,
a tuple per XXX_ExecScan() call.
* Modify planner to utilize Custom Scan node.
* Create different scan methods for different executors.
* Use native PostgreSQL Explain for master part of queries.
Previously we'd segfault in PQisnonblocking() which, contrary to other
libpq calls, doesn't handle a NULL PQconn (because there'd be no
appropriate return value for that).
cr: @jasonmp85
Delete operation is blocked for any table distributed by hash using master_apply_delete_command. Suggested master_modify_multiple_shards command as a hint.
During later work the transaction debug output will change (as it will
in postgres 10), which makes it hard to see actual changes in the
INSERT ... SELECT ... test. Reduce to DEBUG2 after changing a debug
message to that log level.
This change ignores `citus.replication_model` setting and uses the
statement based replication in
- Tables distributed via the old `master_create_distributed_table` function
- Append and range partitioned tables, even if created via
`create_distributed_table` function
This seems like the easiest solution to #1191, without changing the existing
behavior and harming existing users with custom scripts.
This change also prevents RF>1 on streaming replicated tables on `master_create_worker_shards`
Prior to this change, `master_create_worker_shards` command was not checking
the replication model of the target table, thus allowing RF>1 with streaming
replicated tables. With this change, `master_create_worker_shards` errors
out on the case.
Add a call to RemoteTransactionBeginIfNecessary so that BEGIN is
actually sent to the remote connections. This means that ROLLBACK and
Ctrl-C are respected and don't leave the table in a partial state.
This change allows users to drop sequences on MX workers. Previously, Citus didn't allow dropping
sequences on MX workers because it could cause shards to be dropped if `DROP SEQUENCE ... CASCADE`
is used. We now allow that since allowing sequence creation but not dropping hurts user experience
and also may cause problems with custom Citus solutions.
- Break CheckShardPlacements into multiple functions (The most important
is MarkFailedShardPlacements), so that we can get rid of the global
CoordinatedTransactionUses2PC.
- Call MarkFailedShardPlacements in the router executor, so we mark
shards as invalid and stop using them while inside transaction blocks.
With this change DropShards function started to use new connection API. DropShards
function is used by DROP TABLE, master_drop_all_shards and master_apply_delete_command,
therefore all of these functions now support transactional operations. In DropShards
function, if we cannot reach a node, we mark shard state of related placements as
FILE_TO_DELETE and continue to drop remaining shards; however if any error occurs after
establishing the connection, we ROLLBACK whole operation.
Before this change, when a transaction failed, we update related placements shard states
to FILE_INACTIVE during XACT_EVENT_PRE_COMMIT. However that means if another code block
changed shard state to something else (e.g. FILE_TO_DELETE) before XACT_EVENT_PRE_COMMIT
we overwrite that. To prevent that problem, in case of failure we started to change
shard state, only if its current shard state is FILE_FINALIZED.
This UDF returns a shard placement from cache given shard id and placement id. At the
moment it iterates over all shard placements of given shard by ShardPlacementList and
searches given placement id in that list, which is not a good solution performance-wise.
However, currently, this function will be used only when there is a failed transaction.
If a need arises we can optimize this function in the future.
All router, real-time, task-tracker plannable queries should now have
full prepared statement support (and even use router when possible),
unless they don't go through the custom plan interface (which
basically just affects LANGUAGE SQL (not plpgsql) functions).
This is achieved by forcing postgres' planner to always choose a
custom plan, by assigning very low costs to plans with bound
parameters (i.e. ones were the postgres planner replanned the query
upon EXECUTE with all parameter values provided), instead of the
generic one.
This requires some trickery, because for custom plans to work the
costs for a non-custom plan have to be known, which means we can't
error out when planning the generic plan. Instead we have to return a
"faux" plan, that'd trigger an error message if executed. But due to
the custom plan logic that plan will likely (unless called by an SQL
function, or because we can't support that query for some reason) not
be executed; instead the custom plan will be chosen.
So far router planner had encapsulated different functionality in
MultiRouterPlanCreate. Modifications always go through router, selects
sometimes. Modifications always error out if the query is unsupported,
selects return NULL. Especially the error handling is a problem for
the upcoming extension of prepared statement support.
Split MultiRouterPlanCreate into CreateRouterPlan and
CreateModifyPlan, and change them to not throw errors.
Instead errors are now reported by setting the new
MultiPlan->plannigError.
Callers of router planner functionality now have to throw errors
themselves if desired, but also can skip doing so.
This is a pre-requisite for expanding prepared statement support.
While touching all those lines, improve a number of error messages by
getting them closer to the postgres error message guidelines.
The name CreatePhysicalPlan() hasn't been accurate for a while, and
the split of work between multi_planner() and CreatePhysicalPlan()
doesn't seem perfect. So rename to CreateDistributedPlan() and move a
bit more logic in there.
It can be useful, e.g. in the upcoming prepared statement support, to
be able to return an error from a function that is not raised
immediately, but can later be thrown. That allows e.g. to attempt to
plan a statment using different methods and to create good error
messages in each planner, but to only error out after all planners
have been run.
To enable that create support for deferred error messages that can be
created (supporting errorcode, message, detail, hint) in one function,
and then thrown in different place.
This adds a replication_model GUC which is used as the replication
model for any new distributed table that is not a reference table.
With this change, tables with replication factor 1 are no longer
implicitly MX tables.
The GUC is similarly respected during empty shard creation for e.g.
existing append-partitioned tables. If the model is set to streaming
while replication factor is greater than one, table and shard creation
routines will error until this invalid combination is corrected.
Changing this parameter requires superuser permissions.
If any placements fail it doesn't update shard statistics on those placements.
A minor enabling refactor: Make CoordinatedTransactionUses2PC public (it used to be CoordinatedTransactionUse2PC but that symbol already existed, so renamed it as well)
We changed error message which appears when user tries to execute outer join command and
that command requires repartitioning. Old error message mentioned about 1-to-1 shard
partitioning which may not be clear to user.
This enables proper transactional behaviour for copy and relaxes some
restrictions like combining COPY with single-row modifications. It
also provides the basis for relaxing restrictions further, and for
optionally allowing connection caching.
Instead use pg_plan_query() like the normal explain does, and use that
to explain the query. That's important because it allows to remove
the duplicated planner logic from multi_explain - and that logic is
about to get more complicated.
They make fixing explain for prepared statement harder, and they don't
really fit into EXPLAIN in the first place. Additionally they're
currently not exercised in any tests.
This change adds support for serial columns to be used with MX tables.
Prior to this change, sequences of serial columns were created in all
workers (for being able to create shards) but never used. With MX, we
need to set the sequences so that sequences in each worker create
unique values. This is done by setting the MINVALUE, MAXVALUE and
START values of the sequence.
A small refactor which pulls some code out of `RecoverWorkerTransactions`
and into `remote_commands.c`. This code block currently only occurs in
`RecoverWorkerTransactions` but will be useful to other functions
shortly.
Unfortunately we couldn't call it `ExecuteRemoteCommand`, that name was
already taken.
Because of foreign keys and similar concerns there should only be a
single modifying/DDL connection for a set of colocated placements to a
node. To enforce placement_connection.c now has an additional
hash-table keeping track of the connections to a set of colocated
placements. In addition to enforcing per placement restrictions on
connections, there's now very similar restrictions for sets of
colocated placements.
This commit is intended to improve the error messages while planning
INSERT INTO .. SELECT queries. The main motivation for this change is
that we used to map multiple cases into a single message. With this change,
we added explicit error messages for many cases.
With this change, we start to delete placement of reference tables at given worker node
after master_remove_node UDF call. We remove placement metadata at master node but we do
not drop actual shard from the worker node. There are two reasons for that decision,
first, it is not critical to DROP the shards in the workers because Citus will ignore them
as long as node is removed from cluster and if we add that node back to cluster we will
DROP and recreate all reference tables. Second, if node is unreachable, it becomes
complicated to cover failure cases and have a transaction support.
Enables use views within distributed queries.
User can create and use a view on distributed tables/queries
as he/she would use with regular queries.
After this change router queries will have full support for views,
insert into select queries will support reading from views, not
writing into. Outer joins would have a limited support, and would
error out at certain cases such as when a view is in the inner side
of the outer join.
Although PostgreSQL supports writing into views under certain circumstances.
We disallowed that for distributed views.
This change fixes a small bug about quoting of workerrack column in
NodeListInsertCommand:
Previous: `"..., '%s'", workerRack`
Now: `"..., %s", quote_literal_cstr(workerRack)`
So far we've reloaded them frequently. Besides avoiding that cost -
noticeable for some workloads with large shard counts - it makes it
easier to add information to ShardPlacements that help us make
placement_connection.c colocation aware.
Doing so requires adding a mapping from shardId to the cache
entries. For that metadata_cache.c now maintains an additional
hashtable. That hashtable only references shard intervals in the
dist table cache.
Previously the function was getting too large. Thus this splits the
function into separate parts for looking up the cache entry and
building the cache contents.
CloseNodeConnections() is supposed to close connections to a given node.
However, before this commit it lacks to actually call PQFinish() on the
connections. Using CloseConnection() handles closing and all other necessary
actions.
With this change we start to error out on router planner queries where a common table
expression with data-modifying statement is present. We already do not support if
there is a data-modifying statement using result of the CTE, now we also error out
if CTE itself is data-modifying statement.
Remove the router specific transaction and shard management, and
replace it with the new placement connection API. This mostly leaves
behaviour alone, except that it is now, inside a transaction, legal to
select from a shard to which no pre-existing connection exists.
To simplify code the code handling task executions for select and
modify has been split into two - the previous coding was starting to
get confusing due to the amount of only conditionally applicable code.
Modification connections & transactions are now always established in
parallel, not just for reference tables.
Currently there are several places in citus that map placements to
connections and that manage placement health. Centralize this
knowledge. Because of the centralized knowledge about which
connection has previously been used for which shard/placement, this
also provides the basis for relaxing restrictions around combining
various forms of DDL/DML.
Connections for a placement can now be acquired using
GetPlacementConnection(). If the connection is used for DML or DDL the
FOR_DDL/DML flags should be used respectively. If an individual
remote transaction fails (but the transaction on the master succeeds)
and FOR_DDL/DML have been specified, the placement is marked as
invalid, unless that'd mark all placements for a shard as invalid.
With this change, we start to replicate all reference tables to the new node when new node
is added to the cluster with master_add_node command. We also update replication factor
of reference table's colocation group.
With this change we introduce new UDF, upgrade_to_reference_table, which can be used to
upgrade existing broadcast tables reference tables. For upgrading, we require that given
table contains only one shard.
Renamed FindShardIntervalIndex() to ShardIndex() and added binary search
capability. It used to assume that hash partition tables are always
uniformly distributed which is not true if upcoming tenant isolation
feature is applied. This commit also reduces code duplication.
We have one replication of reference table for each node. Therefore all problems with
replication factor > 1 also applies to reference table. As a solution we will not allow
foreign keys on reference tables. It is not possible to define foreign key from, to or
between reference tables.
Previously, we errored out if non-user tries to SELECT query for some metadata tables. It
seems that we already GRANT SELECT access to some metadata tables but not others. With
this change, we GRANT SELECT access to all existing Citus metadata tables.
* Add get_distribution_value_shardid UDF
With this UDF users can now map given distribution value to shard id. We mostly hide
shardids from users to prevent unnecessary complexity but some power users might need
to know about which entry/value is stored in which shard for maintanence purposes.
Signature of this UDF is as follows;
bigint get_distribution_value_shardid(table_name regclass, distribution_value anyelement)
With this commit, we implemented some basic features of reference tables.
To start with, a reference table is
* a distributed table whithout a distribution column defined on it
* the distributed table is single sharded
* and the shard is replicated to all nodes
Reference tables follows the same code-path with a single sharded
tables. Thus, broadcast JOINs are applicable to reference tables.
But, since the table is replicated to all nodes, table fetching is
not required any more.
Reference tables support the uniqueness constraints for any column.
Reference tables can be used in INSERT INTO .. SELECT queries with
the following rules:
* If a reference table is in the SELECT part of the query, it is
safe join with another reference table and/or hash partitioned
tables.
* If a reference table is in the INSERT part of the query, all
other participating tables should be reference tables.
Reference tables follow the regular co-location structure. Since
all reference tables are single sharded and replicated to all nodes,
they are always co-located with each other.
Queries involving only reference tables always follows router planner
and executor.
Reference tables can have composite typed columns and there is no need
to create/define the necessary support functions.
All modification queries, master_* UDFs, EXPLAIN, DDLs, TRUNCATE,
sequences, transactions, COPY, schema support works on reference
tables as expected. Plus, all the pre-requisites associated with
distribution columns are dismissed.
We used to disable router planner and executor
when task executor is set to task-tracker.
This change enables router planning and execution
at all times regardless of task execution mode.
We are introducing a hidden flag enable_router_execution
to enable/disable router execution. Its default value is
true. User may disable router planning by setting it to false.
Adds support for VACUUM and ANALYZE commands which target a specific
distributed table. After grabbing the appropriate locks, this imple-
mentation sends VACUUM commands to each placement (using one connec-
tion per placement). These commands are sent in parallel, so users
with large tables will benefit from sharding. Except for VERBOSE, all
VACUUM and ANALYZE options are supported, including the explicit
column list used by ANALYZE.
As with many of our utility commands, the local command also runs. In
the VACUUM/ANALYZE case, the local command is executed before any re-
mote propagation. Because error handling is managed after local proc-
essing, this can result in a VACUUM completing locally but erroring
out when distributed processing commences: a minor technicality in all
cases, as there isn't really much reason to ever roll back a VACUUM (an
impossibility in any case, as VACUUM cannot run within a transaction).
Remote propagation of targeted VACUUM/ANALYZE is controlled by the
enable_ddl_propagation setting; warnings are emitted if such a command
is attempted when DDL propagation is disabled. Unqualified VACUUM or
ANALYZE is not handled, but a warning message informs the user of this.
Implementation note: this commit adds a "BARE" value to MultiShard-
CommitProtocol. When active, no BEGIN command is ever sent to remote
nodes, useful for commands such as VACUUM/ANALYZE which must not run in
a transaction block. This value is not user-facing and is reset at
transaction end.
This change adds `start_metadata_sync_to_node` UDF which copies the metadata about nodes and MX tables
from master to the specified worker, sets its local group ID and marks its hasmetadata to true to
allow it receive future DDL changes.
One less place managing remote transactions. It also makes it fairly
easy to use 2PC for certain modifications (e.g. reference tables). Just
issue a CoordinatedTransactionUse2PC(). If every placement failure
should cause the whole transaction to abort, additionally mark the
relevant transactions as critical.
With this PR, we add foreign key support to ALTER TABLE commands. For now,
we only support foreign constraint creation via ALTER TABLE query, if it
is only subcommand in ALTER TABLE subcommand list.
We also only allow foreign key creation if replication factor is 1.
That way connections can be automatically closed after errors and such,
and the connection management infrastructure gets wider testing. It
also fixes a few issues around connection string building.