We mark objects as distributed objects in Citus metadata only if we need
to propagate given the command that creates it to worker nodes. For this
reason, we were not doing this for the objects that are created while
pg_dist_node is empty.
One implication of doing so is that we defer the schema propagation to
the time when user creates the first distributed table in the schema.
However, this doesn't help for schema-based sharding (#6866) because we
want to sync pg_dist_tenant_schema to the worker nodes even for empty
schemas too.
* Support test dependencies for isolation tests without a schedule
* Comment out a test due to a known issue (#6901)
* Also, reduce the verbosity for some log messages and make some
tests compatible with run_test.py.
Enable router planner and a limited version of INSERT .. SELECT planner
for the queries that reference colocated null shard key tables.
* SELECT / UPDATE / DELETE / MERGE is supported as long as it's a router
query.
* INSERT .. SELECT is supported as long as it only references colocated
null shard key tables.
Note that this is not only limited to distributed INSERT .. SELECT but
also
covers a limited set of query types that require pull-to-coordinator,
e.g.,
due to LIMIT clause, generate_series() etc. ...
(Ideally distributed INSERT .. SELECT could handle such queries too,
e.g.,
when we're only referencing tables that don't have a shard key, but
today
this is not the case. See
https://github.com/citusdata/citus/pull/6773#discussion_r1140130562.
With this PR, we allow creating distributed tables with without
specifying a shard key via create_distributed_table(). Here are the
the important details about those tables:
* Specifying `shard_count` is not allowed because it is assumed to be 1.
* We mostly call such tables as "null shard-key" table in code /
comments.
* To avoid doing a breaking layout change in create_distributed_table();
instead of throwing an error, it will inform the user that
`distribution_type`
param is ignored unless it's explicitly set to NULL or 'h'.
* `colocate_with` param allows colocating such null shard-key tables to
each other.
* We define this table type, i.e., NULL_SHARD_KEY_TABLE, as a subclass
of
DISTRIBUTED_TABLE because we mostly want to treat them as distributed
tables in terms of SQL / DDL / operation support.
* Metadata for such tables look like:
- distribution method => DISTRIBUTE_BY_NONE
- replication model => REPLICATION_MODEL_STREAMING
- colocation id => **!=** INVALID_COLOCATION_ID (distinguishes from
Citus local tables)
* We assign colocation groups for such tables to different nodes in a
round-robin fashion based on the modulo of "colocation id".
Note that this PR doesn't care about DDL (except CREATE TABLE) / SQL /
operation (i.e., Citus UDFs) support for such tables but adds a
preliminary
API.
There was a bug related to regex. We sometimes caught the wrong line
when the test name is also included in comments.
Example: We caught the wrong line as multi_metadata_sync is included in
the comment before the test line.
```
# ----------
# multi_metadata_sync tests the propagation of mx-related metadata changes to metadata workers
# multi_unsupported_worker_operations tests that unsupported operations error out on metadata workers
# ----------
test: multi_metadata_sync
```
Solution: Restrict regex rule better.
In #6814 we started using the Python test runner for upgrade tests in
run_test.py, instead of the Perl based one. This had a problem though,
not all tests in minimal_schedule can be run with the Python runner.
This adds a separate minimal schedule for the pg_upgrade tests which
doesn't include the tests that break with the Python runner.
This PR also fixes various other issues that came up while testing
the upgrade tests.
- Query generator is used to create queries, allowed by the grammar which is documented at `query_generator/query_gen.py` (currently contains only joins).
- This PR adds a CI test which utilizes the query generator to compare the results of generated queries that are executed on Citus tables and local (undistributed) tables. It fails if there is an unexpected error at results. The error can be related to Citus, the query generator, or even Postgres.
- The tool is configured by the file `query_generator/config/config.yaml`, which limits table counts at generated queries and sets many table related parameters (e.g. row count).
- Run time of the CI task can be configured from the config file. By default, we run 250 queries with maximum table count of 40 inside each query.
DESCRIPTION: Changes the regression test setups adding the coordinator
to metadata by default.
When creating a Citus cluster, coordinator can be added in metadata
explicitly by running `citus_set_coordinator_host ` function. Adding the
coordinator to metadata allows to create citus managed local tables.
Other Citus functionality is expected to be unaffected.
This change adds the coordinator to metadata by default when creating
test clusters in regression tests.
There are 3 ways to run commands in a sql file (or a schedule which is a
sequence of sql files) with Citus regression tests. Below is how this PR
adds the coordinator to metadata for each.
1. `make <schedule_name>`
Changed the sql files (sql/multi_cluster_management.sql and
sql/minimal_cluster_management.sql) which sets up the test clusters such
that they call `citus_set_coordinator_host`. This ensures any following
tests will have the coordinator in metadata by default.
2. `citus_tests/run_test.py <sql_file_name>`
Changed the python code that sets up the cluster to always call `
citus_set_coordinator_host`.
For the upgrade tests, a version check is included to make sure
`citus_set_coordinator_host` function is available for a given version.
3. ` make check-arbitrary-configs `
Changed the python code that sets up the cluster to always call
`citus_set_coordinator_host `.
#6864 will be used to track the remaining work which is to change the
tests where coordinator is added/removed as a node.
Over the last few months run_test.py got more and more complex. This
refactors the code in `run_test.py` to be better understandable. Mostly
this splits up separate pieces of logic into separate functions.
For some tests such as upgrade tests and arbitrary config tests we set
up the citus cluster using Python. This setup is slightly different from
the perl based setup script (`multi_regress.pl`). Most importantly it
uses replication factor 1 by default.
This changes our run_test.py script to be able to run a schedule using
python instead of `multi_regress.pl`, for the tests that require it.
For now arbitrary config tests are still not runnable with
`run_test.py`, but this brings us one step closer to being able to do
that.
Fixes#6804
Having as little Perl as possible in our repo seems a worthy goal. Sadly
Postgres its Perl based TAP infrastructure was the only way in which we
could
run tests that were hard to do using only SQL commands. This change adds
infrastructure to run such "application style tests" using python and
converts all our existing Perl TAP tests to this new infrastructure.
Some of the helper functions that are added in this PR are currently
unused. Most of these will be used by the CDC PR that depends on this.
Some others are there because they were needed by the PgBouncer test
framework that this is based on, and the functions seemed useful enough
to citus testing to keep.
The main features of the test suite are:
1. Application style tests using a programming language that our
developers know how to write.
2. Caching of Citus clusters in-between tests using the ["fixture"
pattern][fixture] from `pytest` to achieve speedy tests. To make this
work in practice any changes made during a test are automatically
undone. Schemas, replication slots, subscriptions, publications are
dropped at the end of each test. And any changes made by `ALTER SYSTEM`
or manually editing of `pg_hba.conf` are undone too.
3. Automatic parallel execution of tests using the `-n auto` flag that's
added by `pytest-xdist`. This improved the speed of tests greatly with
the similar test framework I created for PgBouncer. Right now it doesn't
help much yet though, since this PR only adds two tests (one of which
takes ~10 times longer than the other).
Possible future improvements are:
1. Clean up even more things at the end of each test (e.g. users that
were created). These are fairly easy to add, but I have not done so yet
since they were not needed yet for this PR or the CDC PR. So I would not
be able to test the cleanup easily.
2. Support for query block detection similar to what we can now do using
isolation tests.
[fixture]: https://docs.pytest.org/en/6.2.x/fixture.html
DESCRIPTION: This PR removes the task dependencies between shard moves
for which the shards belong to different colocation groups. This change
results in scheduling multiple tasks in the RUNNABLE state. Therefore it
is possible that the background task monitor can run them concurrently.
Previously, all the shard moves planned in a rebalance operation took
dependency on each other sequentially.
For instance, given the following table and shards
colocation group 1 colocation group 2
table1 table2 table3 table4 table 5
shard11 shard21 shard31 shard41 shard51
shard12 shard22 shard32 shard42 shard52
if the rebalancer planner returned the below set of moves
` {move(shard11), move(shard12), move(shard41), move(shard42)}`
background rebalancer scheduled them such that they depend on each other
sequentially.
```
{move(reftables) if there is any, none}
|
move( shard11)
|
move(shard12)
| {move(shard41)<--- move(shard12)} This is an artificial dependency
move(shard41)
|
move(shard42)
```
This results in artificial dependencies between otherwise independent
moves.
Considering that the shards in different colocation groups can be moved
concurrently, this PR changes the dependency relationship between the
moves as follows:
```
{move(reftables) if there is any, none} {move(reftables) if there is any, none}
| |
move(shard11) move(shard41)
| |
move(shard12) move(shard42)
```
---------
Co-authored-by: Jelte Fennema <jelte.fennema@microsoft.com>
Description:
Implementing CDC changes using Logical Replication to avoid
re-publishing events multiple times by setting up replication origin
session, which will add "DoNotReplicateId" to every WAL entry.
- shard splits
- shard moves
- create distributed table
- undistribute table
- alter distributed tables (for some cases)
- reference table operations
The citus decoder which will be decoding WAL events for CDC clients,
ignores any WAL entry with replication origin that is not zero.
It also maps the shard names to distributed table names.
Soon I will be doing some changes related to #692 in router planner
and those changes require updating ~5/6 tests related to router
planning. And to make those test files runnable by run_test.py
multiple times, we need to make some other tests (that they're
run in parallel / they badly depend on) ready for run_test.py too.
When run_test.py is run for an upgrade_.*_after.sql then, then
automatically run the corresponding uprade_.*_before.sql file first.
This is because all those upgrade_.*_after.sql files depend on the
objects created in upgrade_.*_before.sql files by definition.
DESCRIPTION: Correctly report shard size in citus_shards view
When looking at citus_shards, people are interested in the actual size
that all the data related to the shard takes up on disk.
`pg_total_relation_size` is the function to use for that purpose. The
previously used `pg_relation_size` does not include indexes or TOAST.
Especially the missing toast can have enormous impact on the size of the
shown data.
With this small change, arbitrary config tests can have multiple acceptable correct outputs.
For an arbitrary config tests named `t`, now you can define `expected/t.out`, `expected/t_0.out`, `expected/t_1.out` etc and the test will succeed if the output of `sql/t.sql` is equal to any of the `t.out` or `t_{0, 1, ...}.out` files.
First of all, this commit sets next_shard_id for
single_node_truncate.sql because shard ids in the test output were
changing whenever we modify a prior test file.
Then the flaky test detector started complaining about
single_node_truncate.sql. We fix that by specifying the correct
test dependency for it in run_test.py.
In #6718 I accidentally added Python type hint syntax that was only
supported on Python 3.10. Our CI uses 3.9, so this PR changes that to a
syntax that's supported on 3.9 too.
Some of our tests depend on previous tests. Normally all these tests
should be part of a base schedule, but that's not always the case. The
flaky test detection script should ensure that we don't introduce other
dependencies by accident in new tests. But we have many old tests that
are not worth the effort of changing. This adds a way to define such
test dependencies in `run_test.py`, so that it can make sure to run any
dependencies before the actual test.
This change is a precursor to attempts to add more editorconfig rules in
our codebase. It is a good idea to comply with POSIX standards and have
an empty newline at the end of text files. However, once we have such a
rule, arbitrary configs scripts used to fail before this change.
Related: #5981
* Skip some exceptional test files in the flaky workflow, like
multi_extension
* Run some tests without a schedule, like single_node_enterprise
* Use minimal schedule for the tests in split and operations schedules
DESCRIPTION: Support ALTER TABLE .. ADD PRIMARY KEY ... command
Before processing
> **ALTER TABLE ... ADD PRIMARY KEY ...**
command
1. Create a primary key name to use as the constraint name.
2. Change the **ALTER TABLE ... ADD PRIMARY KEY ...** command to into
**ALTER TABLE ... ADD CONSTRAINT \<constraint name> PRIMARY KEY ...**
form.
This is the only form we can specify a name for a primary key. If we run
ALTER TABLE .. ADD PRIMARY KEY, postgres
would create a constraint name internally in its own scheme. But the
problem is that we need to create constraint names
for shards in our own scheme which is \<constraint name>_\<shardid>.
Hence we need to create a name and send it to workers so that the
workers can append the shardid.
4. Run the changed command on the coordinator to make sure we are using
the same constraint name across the board.
5. Send the changed command to workers such that it is executed for the
main table as well as for the shards.
Fixes#6515.
This PR adds a new CI workflow named ```flaky-test``` to run flaky test
detection on newly introduced regression tests.
Co-authored-by: Jelte Fennema <github-tech@jeltef.nl>
Our python based tests didn't always copy the normalized files after the
regress run. I had the problem where running the following command would
result in non-normalized files in the expected directory after running
our PG upgrade tests locally:
```
cp src/test/regress/{results,expected}/upgrade_list_citus_objects.out
```
This PR fixes that by always running `copy_modified` even if the tests
fail. The same was already being done for our perl based tests at the
end of the `pg_regress_multi.pl` file.
I upgraded my OS to Ubuntu 22.04 a while back and since then some tests
order output slightly differently. I think it might be because of the
glibc upgrade that changed ordering for things like underscores and
spaces.
Changing the locale to C.UTF-8 solves this issue.
One of our arbitrary config tests would sometimes fail like this in CI:
```diff
su_nationkey,
cust_nation,
l_year;
- supp_nation | cust_nation | l_year | revenue
----------------------------------------------------------------------
- 9 | C | 2008 | 3.00
-(1 row)
-
+ERROR: cannot connect to localhost:10212 to fetch intermediate results
+CONTEXT: while executing command on localhost:10211
```
When looking at the logs it seems like we were running out of
connections:
```
2022-08-23 14:03:52.856 UTC [28122] FATAL: sorry, too many clients already
2022-08-23 14:03:52.860 UTC [21027] ERROR: cannot connect to localhost:10212 to fetch intermediate results
```
This happened with `CitusThreeWorkersManyShards` config. This test on
purpose tries to push the limits of Citus quite far. And the
`ch_benchmarks_1` test is also run in parallel with a few more ones. So
it's not too weird that it ran out of connections. This doubles the
connection limit in the arbitrary config tests to hopefully not hit this
error again.
Example of failed test: https://app.circleci.com/pipelines/github/citusdata/citus/26365/workflows/7a1b5688-85cc-4bc3-ade5-9bd1d83cd0ed/jobs/747908/parallel-runs/1
In `pg_regress_multi.pl` we're running `initdb` with some options that
the `common.py` `initdb` is currently not using. All these flags seem
reasonable, so this brings `common.py` in line with
`pg_regress_multi.pl`.
In passing change the `--nosync` flag to `--no-sync`, since that's what
the PG documentation lists as the official option name (but both work).
Cluster setup time is significant in arbitrary configs. We can
parallelize this a bit more.
Runtime of the following command decreases from ~25 seconds to ~22
seconds on my machine with this change:
```
make -C src/test/regress/ check-arbitrary-base CONFIGS=CitusDefaultClusterConfig EXTRA_TESTS=prepared_statements_1
```
Currently we can only run different configs in parallel. However, when working on a feature or trying to fix a bug this is not important. In those cases you simply want to run a single test file on a single config. And you want to run that every time you made a change to the code that you think fixes the issue.
This PR allows parallelising running of bash commands. So `initdb` and `pg_ctl start` is run in parallel for all nodes in the cluster. Instead of one waiting for the other.
When you run the above command nothing is being run in parallel.
After this PR, cluster setup is being run in parallel.
We have fsync enabled for regular tests already in `pg_regress_multi.pl`.
This does the same for the arbitrary config tests.
On my machine this changes the runtime from the following command from
~37 to ~25 seconds:
```bash
make -C src/test/regress/ check-arbitrary-configs CONFIGS=CitusDefaultClusterConfig
```
(cherry picked from commit 4e93afd1f78854e1aaab63690c441b0b0598a82c)
(cherry picked from commit 0295fe2f5b)
(cherry picked from commit 878510725fab9cb6870b4504e0b1f055d7bbc68d)
We had 2 class definitions for CitusCacheManyConnectionsConfig, where
one of them was a copy of CitusSmallCopyBuffersConfig.
This commit leaves the intended class definition that configures caching
many connections, and removes the one that is a copy of another class
- [x] Add some more regression test coverage
- [x] Make sure returning works fine in case of
local execution + remote execution
(task->partiallyLocalOrRemote works as expected, already added tests)
- [x] Implement locking properly (and add isolation tests)
- [x] We do #shardcount round-trips on `SerializeNonCommutativeWrites`.
We made it a single round-trip.
- [x] Acquire locks for subselects on the workers & add isolation tests
- [x] Add a GUC to prevent modification from the workers, hence increase the
coordinator-only throughput
- The performance slightly drops (~%15), unless
`citus.allow_modifications_from_workers_to_replicated_tables`
is set to false
This PR is fixing 2 separate issues related to the local run of citus upgrade tests.
d3e7c825ab fixes the issue that, with our new testing infrastructure, we moved/renamed some of existing folders. This created a problem for local runs of citus upgrade tests since some paths were sensitive to such changes. This commit tries to make it more generic so that this issue is less likely to happen in the future, while also fixing the current issue.
93de6b60c3 we are fixing an issue that a new environment variable was added for citus upgrade tests, which is defined in the CI. 0cb51f8c37/.circleci/config.yml (L294)
This environment variable wasn't set in our local runs hence it would create problems. Instead of defining this environment variable in the local run, we change the citus_upgrade run command to use an existing env variable, which is now also set in the CI.
To run tests in parallel use:
```bash
make check-arbitrary-configs parallel=4
```
To run tests sequentially use:
```bash
make check-arbitrary-configs parallel=1
```
To run only some configs:
```bash
make check-arbitrary-base CONFIGS=CitusSingleNodeClusterConfig,CitusSmallSharedPoolSizeConfig
```
To run only some test files with some config:
```bash
make check-arbitrary-base CONFIGS=CitusSingleNodeClusterConfig EXTRA_TESTS=dropped_columns_1
```
To get a deterministic run, you can give the random's seed:
```bash
make check-arbitrary-configs parallel=4 seed=12312
```
The `seed` will be in the output of the run.
In our regular regression tests, we can see all the details about either planning or execution but this means
we need to run the same query under different configs/cluster setups again and again, which is not really maintanable.
When we don't care about the internals of how planning/execution is done but the correctness, especially with different configs
this infrastructure can be used.
With `check-arbitrary-configs` target, the following happens:
- a bunch of configs are loaded, which are defined in `config.py`. These configs have different settings such as different shard count, different citus settings, postgres settings, worker amount, or different metadata.
- For each config, a separate data directory is created for tests in `tmp_citus_test` with the config's name.
- For each config, `create_schedule` is run on the coordinator to setup the necessary tables.
- For each config, `sql_schedule` is run. `sql_schedule` is run on the coordinator if it is a non-mx cluster. And if it is mx, it is either run on the coordinator or a random worker.
- Tests results are checked if they match with the expected.
When tests results don't match, you can see the regression diffs in a config's datadir, such as `tmp_citus_tests/dataCitusSingleNodeClusterConfig`.
We also have a PostgresConfig which runs all the test suite with Postgres.
By default configs use regular user, but we have a config to run as a superuser as well.
So the infrastructure tests:
- Postgres vs Citus
- Mx vs Non-Mx
- Superuser vs regular user
- Arbitrary Citus configs
When you want to add a new test, you can add the create statements to `create_schedule` and add the sql queries to `sql_schedule`.
If you are adding Citus UDFs that should be a NO-OP for Postgres, make sure to override the UDFs in `postgres.sql`.
You can add your new config to `config.py`. Make sure to extend either `CitusDefaultClusterConfig` or `CitusMXBaseClusterConfig`.
On the CI, upon a failure, all logfiles will be uploaded as artifacts, so you can check the artifacts tab.
All the regressions will be shown as part of the job on CI.
In your local, you can check the regression diffs in config's datadirs as in `tmp_citus_tests/dataCitusSingleNodeClusterConfig`.