On some systems a new libpq is available than what we're compiling
against, but until now we used psql in the version we're compiling
against. That' a problem, because (quoting Jason):
With 9.6, libpq's default handling of CONTEXT changed: it is hidden
unless the level is ERROR or higher. We addressed this ourselves using
the SHOW_CONTEXT variable (by setting "always" in pg_regress_multi): in
9.5, this is ignored (and unneeded), in 9.6, it ensures old behavior is
preserved.
For 9.6 we'd already worked around the problem by specifying that
context should always be shown, but < 9.6 psql doesn't know how to do
that.
As there's no csql anymore, which strictly tied us to a specific version
of psql/csql, we can now just use the system's psql if available. We
still fall back to the psql of the installation we're compiling against,
if there's no other psql in PATH.
This commit fixes a bug when the SELECT target list includes a constant
value.
Previous behaviour of target list re-ordering:
* Iterate over the INSERT target list
* If it includes a Var, find the corresponding SELECT entry
and update its resno accordingly
* If it does not include a Var (which we only considered to be
DEFAULTs), generate a new SELECT target entry
* If the processed target entry count in SELECT target list is less
than the original SELECT target list (GROUP BY elements not included in
the SELECT target entry), add them in the SELECT target list and
update the resnos accordingly.
* However, this step was leading to add the CONST SELECT target entries
twice. The reason is that when CONST target list entries appear in the
SELECT target list, the INSERT target list doesn't include a Var. Instead,
it includes CONST as it does for DEFAULTs.
New behaviour of target list re-ordering:
* Iterate over the INSERT target list
* If it includes a Var, find the corresponding SELECT entry
and update its resno accordingly
* If it does not include a Var (which we consider to be
DEFAULTs and CONSTs on the SELECT), generate a new SELECT
target entry
* If any target entry remains on the SELECT target list which are resjunk,
(GROUP BY elements not included in the SELECT target entry), keep them
in the SELECT target list by updating the resnos.
While creating a colocated table, we don't want the source table to be dropped.
However, using a ShareLock blocks DML statements on the source table, and
using AccessShareLock is enough to prevent DROP. Therefore, we just loosened
the lock to AccessShareLock.
This change allows seeing the names of columns of `master_add_node`,
using `SELECT * FROM master_add_node(...)` by specifying output
columns in UDF definition.
Previously, we threw an error when we ran CREATE INDEX IF NOT EXISTS
with an already existing index. This change enables expected behavior by
checking if the statement has IF NOT EXISTS before throwing the error.
We also ensure that we don't execute the command on the workers, if an
index already exists on the master.
At the moment, we do not support foreign constraints if replication factor is greater
than 1. However foreign constraints can be used in cloud with high availability option.
Therefore we do not want to create an impression such that foreign constraints with
high availability is not supported at all. We call users to action with this error
message.
In ErrorIfShardPlacementsNotColocated(), while checking if shards are colocated,
error out if matching shard intervals have different number of shard placements.
Added a new UDF, mark_tables_colocated(), to colocate tables with the same
configuration (shard count, shard replication count and distribution column type).
Fixcitusdata/citus#886
The way postgres' explain hook is designed means that our hook is never
called during EXPLAIN EXECUTE. So, we special-case EXPLAIN EXECUTE by
catching it in the utility hook. We then replace the EXECUTE with the
original query and pass it back to Citus.
Previously, when a repair is requested on a shard, we also repair all co-located shards
of given shard, which may cause repairing already healthy shards. With this change, we
only repair given shard.
This forces prepared statements to be re-planned after changes of the
placement metadata. There's some locking issues remaining, but that's a
a separate task.
Also add regression tests verifying that invalidations take effect on
prepared statements.
This commit adds INSERT INTO ... SELECT feature for distributed tables.
We implement INSERT INTO ... SELECT by pushing down the SELECT to
each shard. To compute that we use the router planner, by adding
an "uninstantiated" constraint that the partition column be equal to a
certain value. standard_planner() distributes that constraint to all
the tables where it knows how to push the restriction safely. An example
is that the tables that are connected via equi joins.
The router planner then iterates over the target table's shards,
for each we replace the "uninstantiated" restriction, with one that
PruneShardList() handles. Do so by replacing the partitioning qual
parameter added in multi_planner() with the current shard's
actual boundary values. Also, add the current shard's boundary values to the
top level subquery to ensure that even if the partitioning qual is
not distributed to all the tables, we never run the queries on the shards
that don't match with the current shard boundaries. Finally, perform the
normal shard pruning to decide on whether to push the query to the
current shard or not.
We do not support certain SQLs on the subquery, which are described/commented
on ErrorIfInsertSelectQueryNotSupported().
We also added some locking on the router executor. When an INSERT/SELECT command
runs on a distributed table with replication factor >1, we need to ensure that
it sees the same result on each placement of a shard. So we added the ability
such that router executor takes exclusive locks on shards from which the SELECT
in an INSERT/SELECT reads in order to prevent concurrent changes. This is not a
very optimal solution, but it's simple and correct. The
citus.all_modifications_commutative can be used to avoid aggressive locking.
An INSERT/SELECT whose filters are known to exclude any ongoing writes can be
marked as commutative. See RequiresConsistentSnapshot() for the details.
We also moved the decison of whether the multiPlan should be executed on
the router executor or not to the planning phase. This allowed us to
integrate multi task router executor tasks to the router executor smoothly.
The necessity for this functionality comes from the fact that ruleutils.c is not supposed to be
used on "rewritten" queries (i.e. ones that have been passed through QueryRewrite()).
Query rewriting is the process in which views and such are expanded,
and, INSERT/UPDATE targetlists are reordered to match the physical order,
defaults etc. For the details of reordeing, see transformInsertRow().
We'd been relying on a single SET search_path command in an earlier
script, but a subsequent script RESET search_path, causing any further
bare functions to be created in the first schema on the search path.
However, starting with an older extension version and executing ALTER
scripts one at a time DOES avoid putting any functions in the public
namespace, so I wrote an upgrade script resilient to that, especially
because PostgreSQL 9.5 will error out if a function is already in the
schema it's being moved to.
Between restart (running the new code) and ALTER EXTENSION citus
UPGRADE there was an inconsistency where we assumed that
pg_dist_partition had the repmodel column set. Now we give it a default
value if the column doesn't exist yet.
With this change, we now push down foreign key constraints created during CREATE TABLE
statements. We also start to send foreign constraints during shard move along with
other DDL statements
create_reference_table() creates a hash distributed table with shard count
equals to 1 and replication factor equals to shard_replication_factor
configuration value.
Adds support for PostgreSQL 9.6 by copying in the requisite ruleutils
file and refactoring the out/readfuncs code to flexibly support the
old-style copy/pasted out/readfuncs (prior to 9.6) or use extensible
node APIs (in 9.6 and higher).
Most version-specific code within this change is only needed to set new
fields in the AggRef nodes we build for aggregations. Version-specific
test output files were added in certain cases, though in most they were
not necessary. Each such file begins by e.g. printing the major version
in order to clarify its purpose.
The comment atop citus_nodes.h details how to add support for new nodes
for when that becomes necessary.
This change adds the pg_dist_local_group metadata table, which indicates
the group id of the current node. It is expected that this table contains
one and only one row, which only contains the group id of the node as an
integer.
With this change, master_copy_shard_placement and master_move_shard_placement functions
start to copy/move given shard along with its co-located shards.
This commit completes having support in Citus by adding having support for
real-time and task-tracker executors. Multiple tests are added to regression
tests to cover new supported queries with having support.
This change adds the required infrastructure about metadata snapshot from MX
codebase into Citus, mainly metadata_sync.c file and master_metadata_snapshot UDF.
Two sets of tests are fixed by this change:
* multi_agg_approximate_distinct
* those in multi_task_tracker_extra_schedule
The first broke when we renamed stage to load in many files and was
never being run because the HyperLogLog extension wasn't easily
available in Debian. Now it's in our repo, so we install it and run
the test. I removed the distinct HLL target in favor of just always
running it and providing an output variant to handle when the extension
is absent. Basically, if PostgreSQL thinks HLL is available, the test
installs it and runs normally, otherwise the absent variant is used.
The second broke when I removed a test variant, erroneously believing
it to be related to an older Citus version. I've added a line in that
test to clarify why the variant is necessary (a practice we should
widely adopt).
So far placements were assigned an Oid, but that was just used to track
insertion order. It also did so incompletely, as it was not preserved
across changes of the shard state. The behaviour around oid wraparound
was also not entirely as intended.
The newly introduced, explicitly assigned, IDs are preserved across
shard-state changes.
The prime goal of this change is not to improve ordering of task
assignment policies, but to make it easier to reference shards. The
newly introduced UpdateShardPlacementState() makes use of that, and so
will the in-progress connection and transaction management changes.
Related to #786
This change adds the `pg_dist_node` table that contains the information
about the workers in the cluster, replacing the previously used
`pg_worker_list.conf` file (or the one specified with `citus.worker_list_file`).
Upon update, `pg_worker_list.conf` file is read and `pg_dist_node` table is
populated with the file's content. After that, `pg_worker_list.conf` file
is renamed to `pg_worker_list.conf.obsolete`
For adding and removing nodes, the change also includes two new UDFs:
`master_add_node` and `master_remove_node`, which require superuser
permissions.
'citus.worker_list_file' guc is kept for update purposes but not used after the
update is finished.
related to a table that might be distributed, allowing any name
that is within regular PostgreSQL length limits to be extended
with a shard ID for use in shards on workers. Handles multi-byte
character boundaries in identifiers when making prefixes for
shard-extended names. Includes tests.
Uses hash_any from PostgreSQL's access/hashfunc.c.
Removes AppendShardIdToStringInfo() as it's used only once
and arguably is best replaced there with a call to AppendShardIdToName().
Adds UDF shard_name(object_name, shard_id) to expose the shard-extended
name logic to other PL/PGSQL, UDFs and scripts.
Bumps version to 6.0-2 to allow for UDF to be created in migration script.
Fixescitusdata/citus#781 and citusdata/citus#179.
hash_create(), called by TaskHashCreate(), doesn't work correctly for a
zero sized hash table. This triggers valgrind errors, and could
potentially cause crashes even without valgring.
This currently happens for Jobs with 0 tasks. These probably should be
optimized away before reaching TaskHashCreate(), but that's a bigger
change.
count_agg_clause *adds* the cost of the aggregates to the state
variable, it doesn't reinitialize it. That is intentional, as it is used
to incrementally add costs in some places.
is now a `::regtype` using the qualified name of the column type,
not the column type OID which may differ between master/worker nodes.
Test coverage of a hash reparitition using a UDT as the join column.
Note that the UDFs `worker_hash_partition_table` and `worker_range_partition_table`
are unchanged, and rightly expect an OID for the column type; but the
planner code building the commands now allows for `::regtype` casting
to do its magic.
Fixescitusdata/citus#111.
Fixescitusdata/citus#714
On `InsertShardRow`, we previously called `CommandCounterIncrement()` before
`CitusInvalidateRelcacheByRelid(relationId);`. This might prevent to skip
invalidation of the distributed table in the next access within the same session.
This commit enables to create different worker and master temporary folders.
This change is important for citus-mx on task-tracker execution. In simple words,
on citus-mx, the worker could actually be reponsible for the master tasks as well.
Prior to this change, both master and worker logic on task-tracker executor was
accessing and using the same files for different purposes which was dangerous on
certain cases (i.e., when task_tracker_delay is low).
Before this change, count on a distributed returned NULL if all shards
were pruned away, because on the master we replace with count(..) call
with a sum(..) call to sum the counts from the shards. However, sum
returns NULL when there are no rows, whereas count is expected to return
0.
I had changed these callbacks to use the same method I chose for the
router executor (for consistency), but as that method is flawed, we now
want to ensure we directly register them from PG_init as well.
Not entirely sure why we went with the shared memory hook approach, but
it causes problems (multiple registration) during crashes. Changing to
a simple direct registration call from PG_init.
An interaction between ReraiseRemoteError and DML transaction support
causes segfaults:
* ReraiseRemoteError calls PurgeConnection, freeing a connection...
* That connection is still in the xactParticipantHash
At transaction end, the memory in the freed connection might happen to
pass the "is this connection OK?" check, causing us to try to send an
ABORT over that connection. By removing it from the transaction hash
before calling ReraiseRemoteError, we avoid this possibility.
UNIQUE or PRIMARY KEY constraints. Also, properly propagate valid
EXCLUDE constraints to worker shard tables.
If an EXCLUDE constraint includes the distribution column,
the operator must be an equality operator.
Tests in regression suite for exclusion constraints that include
the partition column, omit it, and include it but with non-equality
operator. Regression tests also verify that valid exclusion constraints
are propagated to the shard tables. And the tests work in different
timezones now.
Fixescitusdata/citus#748 and citusdata/citus#778.
pg_toast_* oids are constantly changing, and this causes regression tests to
fail time to time. With this commit, we remove all of the pg_toast_* references
from regression test outputs.
Three changes here to get to true multi-statement, multi-relation DDL
transactions (same functionality pre-5.2, with benefits of atomicity):
1. Changed the multi-shard utility hook to always run (consistency
with router executor hook, removes ad-hoc "installed" boolean)
2. Change the global connection list in multi_shard_transaction to
instead be a hash; update related functions to operate on global
hash instead of local hash/global list
3. Remove check within DDL code to prevent subsequent DDL commands;
place unset/reset guard around call to ConnectToNode to permit
connecting to additional nodes after DDL transaction has begun
In addition, code has been added to raise an error if a ROLLBACK TO
SAVEPOINT is attempted (similar to router executor), and comprehensive
tests execute all multi-DDL scenarios (full success, user ROLLBACK, any
actual errors (say, duplicate index), partial failure (duplicate index
on one node but not others), partial COMMIT (one node fails), and 2PC
partial PREPARE (one node fails)). Interleavings with other commands
(DML, \copy) are similarly all covered.
To permit use with ZomboDB (https://github.com/zombodb/zombodb), two
changes were necessary:
1. Permit use of `tableoid` system column in queries
2. Extend relation names appearing in index expressions
The first is accomplished by simply changing the deparse logic to allow
system columns in queries destined for distributed tables. The latter
was slightly more complex, given that DDL extension currently occurs on
workers. But since indexes cannot reference tables other than the one
being indexed, it is safe to look for any relation reference ending in
a '*' character and extend their penultimate segments with a shard id.
This change also adds an error to prevent users from distributing any
relations using the WITH (OIDS) feature, which is unsupported.
The call to hash_create specified HASH_CONTEXT without actually setting
one using the provided HASHCTL. The hashes returned by this function
are used locally, so simply using CurrentMemoryContext is sufficient.
In subquery pushdown, we allow outer joins if the join condition is on the
partition columns. WhereClauseList() used to return all join conditions including
outer joins. However, this has been changed with a commit related to outer join
support on regular queries. With this commit, we refactored ExtractFromExpressionWalker()
to return two lists of qualifiers. The first list is for inner join and filter
clauses and the second list is for outer join clauses. Therefore, we can also
use outer join clauses to check subquery pushdown prerequisites.
Before this change, we do not check whether given table which already contains any data
in master_create_distributed_table command. If that table contains any data, making it
it distributed, makes that data hidden to user. With this change, we now gave error to
user if the table contains data.
Recent changes to DDL and transaction logic resulted in a "regression"
from the viewpoint of users. Previously, DDL commands were allowed in
multi-command transaction blocks, though they were not processed in any
actual transactional manner. We improved the atomicity of our DDL code,
but added a restriction that DDL commands themselves must not occur in
any BEGIN/END transaction block.
To give users back the original functionality (and improved atomicity)
we now keep track of whether a multi-command transaction has modified
data (DML) or schema (DDL). Interleaving the two modification types in
a single transaction is disallowed.
This first step simply permits a single DDL command in such a block,
admittedly an incomplete solution, but one which will permit us to add
full multi-DDL command support in a subsequent commit.
"Staging table" will be the only valid use of 'stage' from now on, we
will now say "load" when talking about data ingestion. If creation of
shards is its own step, we'll just say "shard creation".
Fixes#547
This change removes all references to \stage in the regression tests
and puts \COPY instead. Doing so changed shard counts, min/max
values on some test tables (lineitem, orders, etc.).
I've been seeing warnings on OS X/clang for a while about these lines
and finally got tired of it. The main problem is that PRIu64 expects a
uint64_t but we were passing a uint64 (a PostgreSQL-defined type). In
PostgreSQL 9.5, we now have INT64_MODIFIER, so can build our own zero-
padded unsigned 64-bit int format modifier that expects a PostgreSQL-
provided uint64 type.
This simplifies the code slightly (no more ifdefs) and gets rid of the
warning that's been annoying me since April (my TODO creation time).
A recent change to the image used in Travis causes some problems for
the code we use here to ensure the local replica is first. Since this
code is essentially dead in a post-stage world anyhow, we're OK with
ripping out the tests to placate Travis.
PostgreSQL 9.5.4 stopped calling planner for materialized view create
command when NO DATA option is provided.
This causes our test to behave differently between pre-9.5.4 and 9.5.4.
When an unreferenced prepared statement parameter does not explicitly
have a type assigned, we cannot deserialize it, to send to the remote
side. That commonly happens inside plpgsql functions, where local
variables are passed in as unused prepared statement parameters.
A recent change generates a "dummy" shard placement with its identifier
set to INVALID_SHARD_ID for SELECT queries against distributed tables
with no shards. Normally, no lock is acquired for SELECT statements,
but if all_modifications_commutative is set to true, we will acquire a
shared lock, triggering an assertion failure within LockShardResource
in the above case.
The "dummy" shard placement is actually necessary to ensure such empty
queries have somewhere to execute, and INVALID_SHARD_ID seems the most
appropriate value for the dummy's shard identifier field, so the most
straightforward fix is to just avoid locking invalid shard identifiers.
Text datums can't be directly accessed via the struct equivalence trick
used to access catalogs. That's because, as an optimization, they're
sometimes aligned to 1 byte ("text"'s alignment), and sometimes to 4
bytes. That depends on it being a short
varlena (cf. VARATT_NOT_PAD_BYTE) or not.
In the case at hand here, partkey became longer than 127 characters -
the boundary for short varlenas (cf. VARATT_CAN_MAKE_SHORT()). Thus it
became 4 byte/int aligned. Which lead to the direct struct access
accessing the wrong data.
The fix is simply to never access partkey that way - to enforce that,
hide partkey ehind the usual ifdef.
Fixes: #674
Fixes#679
This change sets the default commit protocol for distributed DDL
commands to '1pc'. If the user issues a distributed DDL command with
this default setting, then once in a session, a NOTICE message is
shown about using '2pc' being extra safe.
This adds support for SERIAL/BIGSERIAL column types. Because we now can
evaluate functions on the master (during execution), adding this is a
matter of ensuring the table creation step works properly.
To accomplish this, I've added some logic to detect sequences owned by
a table (i.e. those related to its columns). Simply creating a sequence
and using it in a default value is insufficient; users who do so must
ensure the sequence is owned by the column using it.
Fortunately, this is exactly what SERIAL and BIGSERIAL do, which is the
use case we're targeting with this feature. While testing this, I found
that worker_apply_shard_ddl_command actually adds shard identifiers to
sequence names, though I found no places that use or test this path. I
removed that code so that sequence names are not mutated and will match
those used by a SERIAL default value expression.
Our use of the new-to-9.5 CREATE SEQUENCE IF NOT EXISTS syntax means we
are dropping support for 9.4 (which is being done regardless, but makes
this change simpler). I've removed 9.4 from the Travis build matrix.
Some edge cases are possible in ALTER SEQUENCE, COPY FROM (on workers),
and CREATE SEQUENCE OWNED BY. I've added errors for each so that users
understand when and why certain operations are prohibited.
We remove schema name parameter from worker_fetch_foreign_file and
worker_fetch_regular_table functions. We now send schema name
concatanated with table name.
Fixes#676
We added old versions (i.e. without schema name) of worker_apply_shard_ddl_command,
worker_fetch_foreign_file and worker_fetch_regular_table back. During function call
of one of these functions, we set schema name as public schema and call the newer
version of the functions.
This change removes AllFinalizedPlacementsAccessible function since,
we open connections to all shard placements before any command is
sent so we immediately error out if a shard placement is not accessible.
This change allows users to interrupt long running DDL commands.
Interrupt requests are handled after each DDL command being propagated
to a shard placement, which means that generally the cancel request will
be processed right after the execution of the DDL is finished in the
current placement.
We can now support richer set of queries in router planner.
This allow us to support CTEs, joins, window function, subqueries
if they are known to be executed at a single worker with a single
task (all tables are filtered down to a single shard and a single
worker contains all table shards referenced in the query).
Fixes : #501
Fixes#132
We hook into ALTER ... SET SCHEMA and warn out if user tries to change schema of a
distributed table.
We also hook into ALTER TABLE ALL IN TABLE SPACE statements and warn out if citus has
been loaded.
Allows the use of modification commands (INSERT/UPDATE/DELETE) within
transaction blocks (delimited by BEGIN and ROLLBACK/COMMIT), so long as
all modifications hit a subset of nodes involved in the first such com-
mand in the transaction. This does not circumvent the requirement that
each individual modification command must still target a single shard.
For instance, after sending BEGIN, a user might INSERT some rows to a
shard replicated on two nodes. Subsequent modifications can hit other
shards, so long as they are on one or both of these nodes.
SAVEPOINTs are supported, though if the user actually attempts to send
a ROLLBACK command that specifies a SAVEPOINT they will receive an
ERROR at the end of the topmost transaction.
Placements are only marked inactive if at least one replica succeeds
in a transaction where others fail. Non-atomic behavior is possible if
the shard targeted by the initial modification within a transaction has
a higher replication factor than another shard within the same block
and a node with the latter shard has a failure during the COMMIT phase.
Other methods of denoting transaction blocks (multi-statement commands
sent all at once and functions written in e.g. PL/pgSQL or other such
languages) are not presently supported; their treatment remains the
same as before.
Fixes#555
Before this change, we were resolving HLL function and type Oid without qualified name.
Now we find the schema name where HLL objects are stored and generate qualified names for
each objects.
Similar fix is also applied for cstore_table_size function call.
Fixes#565Fixes#626
To add schema support to citus, we need to schema-prefix all table names, object names etc.
in the queries sent to worker nodes. However; query deparsing is not available for most of
DDL commands, therefore it is not easy to generate worker query in the master node.
As a solution we are sending schema names along with shard id and query to run to worker
nodes with worker_apply_shard_ddl_command.
To not break \STAGE command we pass public schema as paramater while calling
worker_apply_shard_ddl_command from there. This will not cause problem if user uses \STAGE
in different schema because passes schema name is used only if there is no schema name is
given in the query.
Fixes#215Fixes#267Fixes#502Fixes#556Fixes#557Fixes#560Fixes#568Fixes#623Fixes#624
With this change we schema-prefix table names, operator names and composite types.
This change fixes the unused variable problem in
`ExecuteDistributedDDLCommand` function (multi_utility.c). The
parameter is meant to be used in PreventTransactionChain call.
Fixes#513
This change modifies the DDL Propagation logic so that DDL queries
are propagated via 2-Phase Commit protocol. This way, failures during
the execution of distributed DDL commands will not leave the table in
an intermediate state and the pending prepared transactions can be
commited manually.
DDL commands are not allowed inside other transaction blocks or functions.
DDL commands are performed with 2PC regardless of the value of
`citus.multi_shard_commit_protocol` parameter.
The workflow of the successful case is this:
1. Open individual connections to all shard placements and send `BEGIN`
2. Send `SELECT worker_apply_shard_ddl_command(<shardId>, <DDL Command>)`
to all connections, one by one, in a serial manner.
3. Send `PREPARE TRANSCATION <transaction_id>` to all connections.
4. Sedn `COMMIT` to all connections.
Failure cases:
- If a worker problem occurs before sending of all DDL commands is finished, then
all changes are rolled back.
- If a worker problem occurs after all DDL commands are sent but not after
`PREPARE TRANSACTION` commands are finished, then all changes are rolled back.
However, if a worker node is failed, then the prepared transactions in that worker
should be rolled back manually.
- If a worker problem occurs during `COMMIT PREPARED` statements are being sent,
then the prepared transactions on the failed workers should be commited manually.
- If master fails before the first 'PREPARE TRANSACTION' is sent, then nothing is
changed on workers.
- If master fails during `PREPARE TRANSACTION` commands are being sent, then the
prepared transactions on workers should be rolled back manually.
- If master fails during `COMMIT PREPARED` or `ROLLBACK PREPARED` commands are being
sent, then the remaining prepared transactions on the workers should be handled manually.
This change also helps with #480, since failed DDL changes no longer mark
failed placements as inactive.
Fixes#394
This change adds LIMIT/OFFSET support for non router-plannable
distributed queries.
In cases that we can push the LIMIT down, we add the OFFSET value to
that LIMIT in the worker queries. When a query with LIMIT x OFFSET y is issued,
the query is propagated to the workers as LIMIT (x+y) OFFSET 0, and on the
master table, the original LIMIT and OFFSET values are used. With this change,
we can use OFFSET wherever we can use LIMIT.
- Enables using VOLATILE functions (like nextval()) in INSERT queries
- Enables using STABLE functions (like now()) targetLists and joinTrees
UPDATE and INSERT can now contain non-immutable functions. INSERT can contain any kind of
expression, while UPDATE can contain any STABLE function, so long as a Var is not passed
into the STABLE function, even indirectly. UPDATE TagetEntry's can now also include Vars.
There's an exception, CASE/COALESCE statements may not contain mutable functions.
Functions calls in master_modify_multiple_shards are also evaluated.
Fixes#463
OID of user-defined types may be different in master and worker nodes. This causes errors
while sending data between nodes with binary nodes. Because binary copy format adds OID
of the element if it is in an array. The code adding OID is in PostgreSQL code, therefore
we cannot change it. Instead we decided to use text format if we try to send array of
user-defined type.
It turns out some tests exercised this behavior, but removing it should
have no ill effects. Besides, both copy and INSERT disallow NULLs in a
table's partition column.
Fixes a bug where anti-joins on hash-partitioned distributed tables
would incorrectly prune shards early, result in incorrect results (test
included).
The upcoming RETURNING support would otherwise require too much
duplication. This contains most of the pieces required for RETURNING
support, except removing the planner checks and adjusting regression
test output.
The old targetlist wasn't used so far, but the upcoming RETURNING
support relies on it.
This also allows to get rid of some crufty code in
multi_executor.c:multi_ExecutorStart(), which used the worker query's
targetlist instead of the main statement's (which didn't have one up to
now).
The targetlist contains TargetEntrys containing expressions, not
expressions directly. That didn't matter so far, but with the upcoming
RETURNING support, the targetlist is inspected to build a TupleDesc.
ExecCleanTypeFromTL hits an assert when looking at something that's not
a TargetEntry.
Mark the entry as resjunk, so it's not actually used.
As postgres's generic .l -> .c Makefile rule uses ifdef - which is
evaluated early, not during rule evaluation - we have to override the
rule, in addition to the detection of FLEX in the previous commit.
Fixes: #439
The only way we re-raise an error is if the raiseError flag is true, so
might as well purge connection in that block rather than independently
checking errorLevel.
Fixes#78
With this change, it is possible to append a table in any schema to shard. The function
master_append_table_to_shard now supports schema names.
For CITUS_RTE_RELATION type fragments, reloading shardIntervals from the
database is rather expensive. So store a pointer to the full shard
interval, instead of just the shard id. There's no new memory lifetime
hazards here, because we already passed a pointer to the shardInterval's
->shardId field around.
The plan time for the query in issue #607 goes from 2889 ms to 106 ms.
with this change.
By far the most expensive part of ShardIntervalsOverlap() is computing
the function to use to determine overlap. Luckily we already have that
computed and cached.
The plan time for the query in issue #607 goes from 8764 ms to 2889 ms
with this change.
-Added 2 more schedules for task-tracker and multi-binary
instead of running multi_schedule 3 times
-set task-tracker-delay for each long running schedule
Fixes#550, fixes#545
If table name contains special characters, it needs to be escaped. However in some cases,
we escape table name before appending shardId, which causes syntax error in the queries
sent to worker nodes. With this change we now append shardId before escaping table names.
This checkin removes variant files we needed
due to differences in outputs of pg94 and pg95 runs.
However, variant file for test multi_upsert stays
since this file tests for a feature that does not
exist in pg94, and outputs are drastically different.
now copies all column references in count distinct aggreagete
to worker target list and group by. Master target list is
also updated to reflect changes in attribute order.
Fixes 569
Fixes#496
Previously we do not check whether table is foreign or not while creating empty
shards, and set storage type to 't'(Standard table) or 'c'(Columnar table). Now
if the table is foreign table(but not CStore foreign table) we set storage
type to 'f'(Foreign table). If it is CStore foreign table, we set its storage
type to 'c', i.e. columnar table have priority over foreign table.
Please note that 'c' is only used for CStore tables not for other possible
columnar stores at the moment. Possible improvement could be checking for other
columnar stores, though I am not sure if there is a way to check it for all
other columnar stores.
Since we now short-circuit on certain remote errors, we want to ensure
we preserve the old behavior of not modifying any placement states if
a non-short-circuiting error occurs on all placements.
There's not a ton of documentation about what CONTEXT lines should look
like, but this seems like the most dominant pattern. Similarly, users
should expect lowercase, non-period strings.
Fixes#271
This change sets ShardIds and JobIds for each test case. Before this change,
when a new test that somehow increments Job or Shard IDs is added, then
the tests after the new test should be updated.
ShardID and JobID sequences are set at the beginning of each file with the
following commands:
```
ALTER SEQUENCE pg_catalog.pg_dist_shardid_seq RESTART 290000;
ALTER SEQUENCE pg_catalog.pg_dist_jobid_seq RESTART 290000;
```
ShardIds and JobIds are multiples of 10000. Exceptions are:
- multi_large_shardid: shardid and jobid sequences are set to much larger values
- multi_fdw_large_shardid: same as above
- multi_join_pruning: Causes a race condition with multi_hash_pruning since
they are run in parallel.
Fixes#302
Since our previous syntax did not allow creating hash partitioned tables,
some of the previous tests manually changed partition method to hash to
be able to test it. With this change we remove unnecessary workaround and
create hash distributed tables instead. Also in some tests metadata was
created manually. With this change we also fixed this issue.
Fixes#475
With this change we prevent addition of ONLY clause to queries prepared for
worker nodes. When we add ONLY clause we may miss the inherited tables in
worker nodes created by users manually.
When executing queries with citus.task_executor = 'real-time', query
execution could, so far, spend a significant amount of time
sleeping. That's because we were
a) sleeping after several phases of query execution, even if we're not
waiting for network IO
b) sleeping for a fixed amount of time when waiting for network IO;
often a lot longer than actually required.
Just reducing the amount of time slept isn't a real solution, because
that just increases CPU usage.
Instead have the real-time executor's ManageTaskExecution return whether
a task is currently being processed, waiting for reads or writes, or
failed. When all tasks are waiting for IO use poll() to wait for IO
readyness.
That requires to slightly redefine how connection timeouts are handled:
before we counted the number of times ManageTaskExecution() was called,
and compared that with the timeout divided by the task check
interval. That, if processing of tasks took a while, could significantly
increase the time till a timeout occurred. Because it was based on the
ManageTaskExecution() being called on a constant interval, this approach
isn't feasible anymore. Instead measure the actual time since
connection establishment was started. That could in theory, if task
processing takes a very long time, lead to few passes over
PQconnectPoll().
The problem of sleeping too much also exists for the 'task-tracker'
executor, but is generally less problematic there, as processing the
individual tasks usually will take longer. That said, for e.g. the
regression tests it'd be helpful to use a similar approach.
Single table repartition subqueries now support count(distinct column)
and count(distinct (case when ...)) expressions. Repartition query
extracts column used in aggregate expression and adds them to target
list and group by list, master query stays the same (count (distinct ...))
but attribute numbers inside the aggregate expression is modified to
reflect changes in repartition query.
Now, master_create_empty_shard() will create shards according to the
value of citus.shard_placement_policy which also makes default round-robin
instead of random.
Fixes#10
This change creates a new UDF: master_modify_multiple_shards
Parameters:
modify_query: A simple DELETE or UPDATE query as a string.
The UDF is similar to the existing master_apply_delete_command UDF.
Basically, given the modify query, it prunes the shard list, re-constructs
the query for each shard and sends the query to the placements.
Depending on the value of citus.multi_shard_commit_protocol, the commit
can be done in one-phase or two-phase manner.
Limitations:
* It cannot be called inside a transaction block
* It only be called with simple operator expressions (like Single Shard Modify)
Sample Usage:
```
SELECT master_modify_multiple_shards(
'DELETE FROM customer_delete_protocol WHERE c_custkey > 500 AND c_custkey < 500');
```
Make's $(wildcard) does not sort the glob result, but returns filenames
in filesystem ordering. This makes the build result vary and hence
unreproducible on the binary level. Fix by adding $(sort).
Spotted by Debian's reproducible builds project.
This commit fixes failures happen during check-full. The change does make
clean seperation of executor types in certain places to keep the outputs
stable.
Now, we can copy to an append-partitioned distributed relation from
any worker node by providing master options such as;
COPY relation_name FROM file_path WITH (delimiter '|', master_host 'localhost', master_port 5432);
where master_port is optional and default is 5432.
Based on Andres' suggestion, I removed SetConnectionStatus, moving its
functionality directly into set_connection_status_bad, which now simply
shuts down the socket underlying a particular connection.
This keeps the functionality as-is while removing our questionable use
of internal libpq headers.
Fixes#477
This change fixes the compile time warning message in BuildMapMergeJob in
multi_physical_planner.c about mixed declarations and code. Basically, the
problematic declaration is moved up so that no expression is before it.
Allow references to columns in UPDATE statements
Queries like "UPDATE tbl SET column = column + 1" are now allowed, so long as you don't use any IMMUTABLE functions.
This change renames the distributed transaction manager parameter from
citus.copy_transaction_manager to citus.multi_shard_commit_protocol.
Distributed transaction manager has been used only by the COPY on hash
partitioned tables but it can be used by upcoming features so, we needed
to rename so that its name do not contain a reference to COPY.
The change also includes renames like transaction_manager_options to
commit_protocol_options and TRANSACTION_MANAGER_1PC to COMMIT_PROTOCOL_1PC.
With this change, declaration of MultiShardCommitProtocol (was
CopyTransactionManager) is moved from multi_copy.c to multi_transaction.c.
Currently that's just COPY FROM. There's other places where we could
check for permissions earlier (to fail less verbosely), but since
there's other pending changes in the whole DDL area, which is affected
by this, I'm just adding a note to those places.
That's important because ownership of relations implies special
privileges. Without this change, a distributed table can be accessible
by a table's owner, but a shard created by another user might not.
Some small parts of citus currently require superuser privileges; which
is obviously not desirable for production scenarios. Run these small
parts under superuser privileges (we use the extension owner) to avoid
that.
This does not yet coordinate grants between master and workers. Thus it
allows to create shards, load data, and run queries as a non-superuser,
but it is not easily possible to allow differentiated accesses to
several users.
\stage so far directly inserted into pg_dist_shard and
pg_dist_shard_placement. That makes it hard to do effective permission
checks. Thus move the inserts into two C functions.
These two new functions aren't the nicest abstraction. But as we are
planning to obsolete \stage, it doesn't seem worthwhile to refactor the
client-side code of \stage to allow the use of
master_create_empty_shard() et al.
Previously several commands, amongst them commands like
master_create_distributed_table(), were allowed for everyone. That's not
good: Even though citus currently requires superuser permissions, we
shouldn't allow non-superusers to perform actions as sensitive as making
a table distributed.
There's no checks on the worker_* functions, as these usually just punt
the action to underlying postgres functionality, which then perform the
necessary checks.
Citus' extension version now has a -$schemaversion appendix. When the
schema is changed, a new schema version has to be added; changes to the
same schema version several commits inside a single pull request are ok.
Schema migration scripts between each schema version have to be
added. To ensure upgrade scripts work correctly a new regression test
ensures that all steps work.
The extension scripts to-be-used for CREATE EXTENSION (i.e. not
extension updates) are generated by concatenating citus.sql and the
relevant migration scripts.
Otherwise the owner of relations and such will depend on the username of
the user running the regression tests. As "postgres" is the most common
username for that purpose, hardcode that in pg_regress_multi.pl.
So far we've always used libpq defaults when connecting to workers; bar
special environment variables being set that'll always be the user that
started the server. That's not desirable because it prevents using
users with fewer privileges.
Thus change the various APIs creating connections to workers to always
use usernames. That means:
1) MultiClientConnect() needs to, optionally, accept a username
2) GetOrEstablishConnection(), including the underlying cache, need to
use the current user as part of the connection cache key. That way
connections for separate users are distinct, and we always use one
with the correct authorization.
3) The task tracker needs to keep track of the username associated with
a task, so it can use it when establishing connections outside the
originating session.
This commit adds a fast shard pruning path for INSERTs on
hash-partitioned tables. The rationale behind this change is
that if there exists a sorted shard interval array, a single
index lookup on the array allows us to find the corresponding
shard interval. As mentioned above, we need a sorted
(wrt shardminvalue) shard interval array. Thus, this commit
updates shardIntervalArray to sortedShardIntervalArray in the
metadata cache. Then uses the low-level API that is defined in
multi_copy to handle the fast shard pruning.
The performance impact of this change is more apparent as more
shards exist for a distributed table. Previous implementation
was relying on linear search through the shard intervals. However,
this commit relies on constant lookup time on shard interval
array. Thus, the shard pruning becomes less dependent on the
shard count.
When we notice that pg_dist_partition is being invalidated we assume
that the citus extension is being dropped and drop state such as
extensionLoaded and the cached oids of all the metadata tables.
This frees the user from needing to reconnect after running DROP
EXTENSION, so we also no longer send a warning message.
- non-router plannable queries can be executed
by router executor if they satisfy the criteria
- router executor is removed from configuration,
now task executor can not be set to router
- removed some tests that error out for router executor
With #426, some new warning messages started to arise, because of
cross assignment of Node and Expr pointers. This change fixes the
warnings with type casts.
Fixes#379
Varchar VAR struct is wrapped in RELABELTYPE struct inside PostgreSQL code and
IsPartitionColumnRecursive function considers only VAR types so returning false
for varchar.
This change adds strip_implicit_coercions() call to the columnExpression in
IsPartitionColumnRecursive function so that we get rid of implicit coercions like
RELABELTYPE are stripped to VAR.
This change fixes the problem with joins with VARCHAR columns. Prior to
this change, when we tried to do large table joins on varchar columns, we got
an error of the form:
ERROR: cannot perform local joins that involve expressions
DETAIL: local joins can be performed between columns only.
This is because we have a check in CheckJoinBetweenColumns() which requires the
join clause to have only 'Var' nodes (i.e. columns). Postgres adds a relabel t
ype cast to cast the varchar to text; hence the type of the node is not T_Var
and the join fails.
The fix involves calling strip_implicit_coercions() to the left and right
arguments so that RELABELTYPE is stripped to VAR.
Fixes#76.
Fixes#375
Prior to this change, shard pruning couldn't be done if:
- Table is hash-distributed
- Partition column of is VARCHAR
- Query to be pruned is a subquery
There were two problems:
- A bug in left-side/right-side checks for the partition column
- We were not considering relabeled types (VARCHAR was relabeled as TEXT)
While reading this code last week, it appeared as though there was no
place we ensured that the partition clause actually used equality ops.
As such, I was worried that we might transform a clause such as id < 5
into a constraint like hash(id) = hash(5) when doing shard pruning. The
relevant code seemed to just ensure:
1. The node is an OpExpr
2. With a related hash function
3. It compares the partition column
4. Against a constant
A superficial reading implied we didn't actually make sure the original
op was equality-related, but it turns out the hash lookup function DOES
ensure that for us. So I added a comment.
Previously (if you're creating the index with the same name on different
tables) we successfully ran the command on the workers before failing it
on the master and leaving no record of the index.
Now we check whether the index exists on the master before sending
commands to the workers.
--
Also make the error better when user attampts to create an index without
a name. Previously those statements returned:
brian=# create index on c (b);
WARNING: could not receive query results from localhost:9700
DETAIL: Client error: cannot extend name for null index name
ERROR: could not execute DDL command on worker node shards
They now return
brian=# create index on c (b);
ERROR: creating index without a name on a distributed table is
currently unsupported
This macro is intended to receive a bare integer literal (no suffix).
It adds a suffix as necessary, depending upon available features. On
e.g. 32-bit platforms, the existing code failed to compile because a
suffix was added to the existing suffix. This fixes that problem.
Fixes#363
This change modifies the error message given when Citus is attempted
to be loaded other than shared_preload_libraries. Explanations have been
extended with that shared_preload_parameters parameter is in
postgresql.conf and citus should be at the beginning.
Prior to this change, performing a SELECT query without a target
list caused backend to crash.
Sample Query: SELECT FROM github_events; (without any * before FROM)
PostgreSQL:
```
--
(39599 rows)
```
Citus:
```
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>
```
The problem was an unnecessary Assert on column list in
SetRangeTblExtraData(citus_nodefuncs.c)
This change removes the whitelisting check on the WHERE clauses. Note that, before
this change, citus was already allowing all types of nodes with the following
format (i.e., wrap with a boolean test):
* SELECT col FROM table WHERE (ANY EXPRESSION) is TRUE;
Thus, this change is mostly useful for allowing the expressions in the WHERE clause
directly and avoiding "unsupport clause type" errors.
Prior to this change, it was not possible to use UDFs in repartitioned
subqueries. The reason is that we were setting the search path explicitly
and omiting public schema from that path.
This change adds the public schema to the explicitly set search path.
Fixes issue #258
Prior to this change, Citus gives a deceptive NOTICE message when a query
including ANY or ALL on a non-partition column is issued on a hash
partitioned table.
Let the github_events table be hash-distributed on repo_id column. Then,
issuing this query:
SELECT count(*) FROM github_events WHERE event_id = ANY ('{1,2,3}')
Gives this message:
NOTICE: cannot use shard pruning with ANY (array expression)
HINT: Consider rewriting the expression with OR clauses.
Note that since event_id is not the partition column, shard pruning would
not be applied in any case. However, the NOTICE message would be valid
and be given if the ANY clause would have been applied on repo_id column.
Reviewer: Murat Tuncer
WORKER_LENGTH + 1 is too large. Fixing this has no impact on the string
that is ultimately copied, as it's impossible for the source string to
be any larger to begin with.
The previous form of the test, utilizing DEBUG2, included too much
output dependent on the specifc system and version. Reformulate it to
explicitly connect to workers and show the schema there, when necessary.
The only remaining difference in some of the remaining alternate
regression test files was due to an older minor version release
change. Remove those as well.
There already exist tests that locally embed knowledge about port
numbers, and there's more tests requiring that. Instead of copying
\set's to several tests, make these port number variables available to
all tests.
multi_ExecutorStart() replaces the original planned statement with the
master select statement. As that hasn't gone through the parse analysis
hooks, it'll not have a associated queryId. This prevents extensions
pg_stat_statements to show useful data associated with the query.
I came across several places we weren't as flexible or resilient as we
should have been in our build logic. They include:
* Not using `DESTDIR` in the install-header destination
* Allowing callers to specify `VPATH` or `srcdir` (which breaks)
* Using absolute path for SCRIPTS (9.5 prepends srcdir)
* Including libpq-int in a confusing way (extracted this function)
* Having server includes come first during csql build (client must)
In particular, I hit all of these attempting to build with pg_buildext
in Debian. It passes in an explicit VPATH, as well as srcdir (breaking
all recursive make invocations), and also uses DESTDIR during install.
In addition, a PGDG-enabled Debian box will have the latest libpq-dev
headers (e.g. 9.5) even when building against an older server version
(e.g. 9.4). This leads to problems when including e.g. `c.h`, which
is ambiguous. While compiling more client-side code (csql), we need to
ensure the newer libpq headers are included _first_, so I fixed that.
The default staging policy is now round-robin, though tests were still
configured to use local-first. Testing with the shipping default seems
like the best option, correctness-wise, and since local-first has some
issues with OSes where connecting from localhost doesn't always resolve
to 'localhost', just going with the default is a win-win.
Though Citus' Task struct has a shardId field, it doesn't have the same
semantics as the one previously used in pg_shard code. The analogous
field in the Citus Task is anchorShardId. I've also added an argument
check to the relevant locking function to catch future locking attempts
which pass an invalid argument.
- Flexed the check which prevented append operation cstore tables
since its storage type is not SHARD_STORAGE_TABLE.
- Used process utility function to perform copy operation in
worker_append_table_to shard() instead of directly calling
postgresql DoCopy().
- Removed the additional check in master_create_empty_shard() function.
This check was redundant and erroneous since it was called after
CheckDistributedTable() call.
- Modified WorkerTableSize() function to retrieve cstore table shard
size correctly.
After this change, shards and associated metadata are automatically
dropped when running DROP TABLE on a distributed table, which fixes#230.
It also adds schema support for master_apply_delete_command, which
fixes#73.
Dropping the shards happens in the master_drop_all_shards UDF, which is
called from the SQL_DROP trigger. Inside the trigger, the table is no
longer visible and calling master_apply_delete_command directly wouldn't
work and oid <-> name mappings are not available. The
master_drop_all_shards function therefore takes the relation id, schema
name, and table name as parameters, which can be obtained from
pg_event_trigger_dropped_objects() in the SQL_DROP trigger. If the user
calls master_drop_all_shards while the table still exists, the schema
name and table name are ignored.
Author: Marco Slot
Reviewed-By: Andres Freund
I removed two braces to have this function remain more similar to the
original PostgreSQL function and added uncrustify commands to disable
formatting of its contents.
All citusdb references in
- extension, binary names
- file headers
- all configuration name prefixes
- error/warning messages
- some functions names
- regression tests
are changed to be citus.
The postgres_fdw extension has an extern function with an identical
signature, which can cause problems when both extensions are loaded.
A simple rename can fix this for now (this is the only function with)
such a conflict.
Previously we used, for historical reasons, MessageContext.
That is problematic if a single message from the client
causes a lot of statements to be planned. E.g. for the
copy_to_distributed_table script one insert statement
is planned for each row inserted via COPY, and only freed
when COPY has finished.
This entirely removes any restriction on the type of partitioning
during DML planning and execution. Though there aren't actually any
technical limitations preventing DML commands against append- (or even
range-) partitioned tables, we had initially forbidden this, as any
future stage operation could cause shards to overlap, banning all
subsequent DML operations to partition values contained within more
than one shards. This ended up mostly restricting us, so we're now
removing that restriction.
When two data types have the same binary representation, PostgreSQL may
add an implicit coercion between them by wrapping a node in a relabel
type. This wrapper signals that the wrapped value is completely binary
compatible with the designated "final type" of the relabel node. As an
example, the varchar type is often relabeled to text, since functions
provided for use with text (comparisons, hashes, etc.) are completely
compatible with varchar as well.
The hash-partitioned codepath contains functions that verify queries
actually contain an equality constraint on the partition column, but
those functions expect such constraints to be comparison operations
between a Var and Const. The RelabelType wrapper node causes these
functions to always return false, which bypasses shard pruning.