Phase 1 seeks to implement minimal infrastructure, so does not include:
- dynamic generation of support aggregates to handle multiple arguments
- configuration methods to direct aggregation strategy,
or mark an aggregate's serialize/deserialize as safe to operate across nodes
Aggregates can be distributed when:
- they have a single argument
- they have a combinefunc
- their transition type is not a pseudotype
Postgres doesn't require you to add all columns that are in the target list to
the GROUP BY when you group by a unique column (or columns). It even actively
removes these group by clauses when you do.
This is normally fine, but for repartition joins it is not. The reason for this
is that the temporary tables don't have these primary key columns. So when the
worker executes the query it will complain that it is missing columns in the
group by.
This PR fixes that by adding an ANY_VALUE aggregate around each variable in
the target list that does is not contained in the group by or in an aggregate.
This is done only for repartition joins.
The ANY_VALUE aggregate chooses the value from an undefined row in the
group.
This is an improvement over #2512.
This adds the boolean shouldhaveshards column to pg_dist_node. When it's false, create_distributed_table for new collocation groups will not create shards on that node. Reference tables will still be created on nodes where it is false.
In this PR the default `threshold` of `rebalance_table_shards` was set to 0: https://github.com/citusdata/shard_rebalancer/pull/73
However, the default for get_rebalance_table_shards_plan was not updated. This
can cause the confusing situation where the actual steps run by
`rebalance_table_shards` are not the same as the ones returned by
`get_rebalance_table_shards_plan`.
With this commit, we're changing the API for create_distributed_function()
such that users can provide the distribution argument and the colocation
information.
This PR simply adds the columns to pg_dist_object and
implements the necessary metadata changes to keep track of
distribution argument of the functions/procedures.
This PR aims to add the minimal set of changes required to start
distributing functions. You can use create_distributed_function(regproc)
UDF to distribute a function.
SELECT create_distributed_function('add(int,int)');
The function definition should include the param types to properly
identify the correct function that we wish to distribute
@thanodnl told me it was a bit of a problem that it's impossible to see
the history of a UDF in git. The only way to do so is by reading all the
sql migration files from new to old. Another problem is that it's also
hard to review the changed UDF during code review, because to find out
what changed you have to do the same. I thought of a IMHO better (but
not perfect) way to handle this.
We keep the definition of a UDF in sql/udfs/{name_of_udf}/latest.sql.
That file we change whenever we need to make a change to the the UDF. On
top of that you also make a snapshot of the file in
sql/udfs/{name_of_udf}/{migration-version}.sql (e.g. 9.0-1.sql) by
copying the contents. This way you can easily view what the actual
changes were by looking at the latest.sql file.
There's still the question on how to use these files then. Sadly
postgres doesn't allow inclusion of other sql files in the migration sql
file (it does in psql using \i). So instead I used the C preprocessor+
make to compile a sql/xxx.sql to a build/sql/xxx.sql file. This final
build/sql/xxx.sql file has every occurence of #include "somefile.sql" in
sql/xxx.sql replaced by the contents of somefile.sql.