Commit Graph

69 Commits (b82f886ad3f8433e45f065a4e92f00f32aba6058)

Author SHA1 Message Date
Marco Slot 7523753a73 Clear metadata OID cache prior to deadlock detection 2017-08-18 11:20:24 +02:00
Brian Cloutier 9d93fb5551 Create citus.use_secondary_nodes GUC
This GUC has two settings, 'always' and 'never'. When it's set to
'never' all behavior stays exactly as it was prior to this commit. When
it's set to 'always' only SELECT queries are allowed to run, and only
secondary nodes are used when processing those queries.

Add some helper functions:
- WorkerNodeIsSecondary(), checks the noderole of the worker node
- WorkerNodeIsReadable(), returns whether we're currently allowed to
  read from this node
- ActiveReadableNodeList(), some functions (namely, the ones on the
  SELECT path) don't require working with Primary Nodes. They should call
  this function instead of ActivePrimaryNodeList(), because the latter
  will error out in contexts where we're not allowed to write to nodes.
- ActiveReadableNodeCount(), like the above, replaces
  ActivePrimaryNodeCount().
- EnsureModificationsCanRun(), error out if we're not currently allowed
  to run queries which modify data. (Either we're in read-only mode or
  use_secondary_nodes is set)

Some parts of the code were switched over to use readable nodes instead
of primary nodes:
- Deadlock detection
- DistributedTableSize,
- the router, real-time, and task tracker executors
- ShardPlacement resolution
2017-08-10 17:37:17 +03:00
Brian Cloutier 3fc87a7a29 Metadata sync also syncs nodes in other clusters 2017-08-10 16:55:55 +03:00
Brian Cloutier 1961add6f9 Improve error message when there are no nodes for a placement 2017-08-10 12:38:51 +03:00
Brian Cloutier a3e9bef685 All users of WorkerNodeHash take an AccessShareLock
The metadata cache simulates a SELECT on pg_dist_node. Now the locks it
takes also simulate that SELECT.
2017-08-08 13:12:06 +03:00
Brian Cloutier 5618e69386 Add pg_dist_node.nodecluster 2017-08-08 11:18:31 +03:00
Brian Cloutier ec99f8f983 Add nodeRole column
- master_add_node enforces that there is only one primary per group
- there's also a trigger on pg_dist_node to prevent multiple primaries
  per group
- functions in metadata cache only return primary nodes
- Rename ActiveWorkerNodeList -> ActivePrimaryNodeList
- Rename WorkerGetLive{Node->Group}Count()
- Refactor WorkerGetRandomCandidateNode
- master_remove_node only complains about active shard placements if the
  node being removed is a primary.
- master_remove_node only deletes all reference table placements in the
  group if the node being removed is the primary.
- Rename {Node->NodeGroup}HasShardPlacements, this reflects the behavior it
  already had.
- Rename DeleteAllReferenceTablePlacementsFrom{Node->NodeGroup}. This also
  reflects the behavior it already had, but the new signature forces the
  caller to pass in a groupId
- Rename {WorkerGetLiveGroup->ActivePrimaryNode}Count
2017-07-24 11:57:46 +03:00
velioglu 6ea15fbb25 Make create_distributed_table transactional 2017-07-18 12:35:40 +03:00
Brian Cloutier 7ad95b53d2 Rename pg_dist_shard_placement -> pg_dist_placement
Comes with a few changes:

- Change the signature of some functions to accept groupid
  - InsertShardPlacementRow
  - DeleteShardPlacementRow
  - UpdateShardPlacementState

- NodeHasActiveShardPlacements returns true if the group the node is a
  part of has any active shard placements

- TupleToShardPlacement now returns ShardPlacements which have NULL
  nodeName and nodePort.

- Populate (nodeName, nodePort) when creating ShardPlacements
- Disallow removing a node if it contains any shard placements

- DeleteAllReferenceTablePlacementsFromNode matches based on group. This
  doesn't change behavior for now (while there is only one node per
  group), but means in the future callers should be careful about
  calling it on a secondary node, it'll delete placements on the primary.

- Create concept of a GroupShardPlacement, which represents an actual
  tuple in pg_dist_placement and is distinct from a ShardPlacement,
  which has been resolved to a specific node. In the future
  ShardPlacement should be renamed to NodeShardPlacement.

- Create some triggers which allow existing code to continue to insert
  into and update pg_dist_shard_placement as if it still existed.
2017-07-12 14:17:31 +02:00
Brian Cloutier 0b64bb1092 Fix typo in comment in CachedRelationLookup 2017-07-12 14:16:24 +02:00
Marco Slot 01c9b1f921 Use GetPlacementListConnection for router SELECTs 2017-07-12 11:26:22 +02:00
Andres Freund 3483bb99eb Minimal infrastructure for per-backend citus initialization. 2017-06-23 11:20:10 -07:00
Andres Freund 1691f780fd Force cache invalidation machinery to be initialized earlier.
Previously it was not guaranteed that invalidations were registered
after creating the extension, only if the extension was used
afterwards.
2017-06-23 11:20:10 -07:00
Andres Freund f645dca593 Centralized metadata_cache cache variables into one struct, to avoid missing resets.
E.g. extensionOwner was already missed.
2017-06-23 11:20:10 -07:00
Burak Yucesoy 8c1bbf1417 Register cache invalidation callback before version checks
With this commit we start to register InvalidateDistRelationCacheCallback
function as cache invalidation callback function before version checks
because during version checks we use cache to look up relation ids of some
relations like pg_dist_relation or pg_dist_partition_logical_relid_index
and we want to know about cache invalidation before accessing them.
2017-05-24 17:39:25 +03:00
Burak Yucesoy c7bfa06cb9 Fix incorrect call to CheckInstalledVersion
During version update, we indirectly calld CheckInstalledVersion via
ChackCitusVersions. This obviously fails because during version update it is
expected to have version mismatch between installed version and binary version.
Thus, we remove that ChackCitusVersions. We now only call ChackAvailableVersion.
2017-05-24 17:39:25 +03:00
Burak Yucesoy 9fb15c439c Add version checks to necessary UDFs 2017-05-22 09:53:29 +03:00
Burak Yucesoy eea8c51e1f Only error out on distributed queries when there is version mismatch
Before this commit, we were erroring out at almost all queries if there is a
version mismatch. With this commit, we started to error out only requested
operation touches distributed tables.

Normally we would need to use distributed cache to understand whether a table
is distributed or not. However, it is not safe to read our metadata tables when
there is a version mismatch, thus it is not safe to create distributed cache.
Therefore for this specific occasion, we directly read from pg_dist_partition
table. However; reading from catalog is costly and we should not use this
method in other places as much as possible.
2017-05-22 09:53:29 +03:00
Önder Kalacı e0257aecd9 Accept invalidation messages before accessing the metadata cache (#1406)
* Accept invalidation messages before accessing the metadata cache

This commit is crucial to prevent stale metadata reads from the
cache. Without this commit, some of the operations may use stale
metadata which could end up with various bugs such as crashes,
inconsistent/lost data etc.

As an example, consider that a COPY operation is blocked on shard
metadata lock. Another concurrent session updates the metadata and
invalidates the cache. However, since Citus doesn't accept invalidations,
COPY continues with the stale metadata once it acquires the lock.

With this commit, we make sure that invalidation messages are accepted
just before accessing the metadata cache and preventing any operation to
use stale metadata.

* Add isolation tests for placement changes and conccurrent operations

   - add node with reference table vs COPY/insert/update/DDL
   - repair shard vs COPY/insert/update/DDL
   - repair shard vs repair shard
2017-05-12 12:32:35 +03:00
Brian Cloutier 22e7aa9a4f Fix crash in isolation tests
- There was a crash when the table a shardid belonged to changed during
  a session. Instead of crashing (a failed assert) we now throw an error
- Update the isolation test which was crashing to no longer exercise
  that code path
- Add a regression test to check that the error is thrown
2017-04-29 04:25:26 +03:00
Andres Freund 6bd2e3ed30 Add DistTableCacheEntry->hasOverlappingShardInterval.
This determines whether it's possible to perform binary search on
sortedShardIntervalArray or not.  If e.g. two shards have overlapping
ranges, that'd be prohibitive.

That'll be useful in later commit introducing faster shard pruning.
2017-04-28 14:40:38 -07:00
Andres Freund 105483ec56 Add DistTableCacheEntry->shardValueCompareFunction.
That's useful when comparing values a hash-partitioned table is
filtered by.  The existing shardIntervalCompareFunction is about
comparing hashed values, not unhashed ones.

The added btree opclass function is so we can get a comparator
back. This should be changed much more widely, but is not necessary so
far.
2017-04-28 14:40:38 -07:00
Andres Freund 52571c00ad Build DistTableCacheEntry->shardIntervalCompareFunction even for 0 shards.
Previously we, unnecessarily, used a the first shard's type
information to to look up the comparison function.  But that
information is already available, so use it.  That's helpful because
we sometimes want to access the comparator function even if there's no
shards.
2017-04-28 14:40:38 -07:00
velioglu 2327b63291 Change native hash function with worker_hash 2017-04-19 22:16:55 +03:00
Burak Yucesoy e9095e62ec Decouple reference table replication
With this change we add an option to add a node without replicating all reference
tables to that node. If a node is added with this option, we mark the node as
inactive and no queries will sent to that node.

We also added two new UDFs;
 - master_activate_node(host, port):
    - marks node as active and replicates all reference tables to that node
 - master_add_inactive_node(host, port):
    - only adds node to pg_dist_node
2017-04-17 13:33:31 +03:00
Jason Petersen 033fda9183
Clean up remaining error messages
Added details and hints, based off of similar PostgreSQL scenarios.
2017-04-04 16:11:59 -06:00
Burak Yucesoy a09614553f Add enable_version_checks GUC and address feedback 2017-04-04 19:11:13 +03:00
Burak Yucesoy 087d8427e3
Error out if binary citus version does not match installed extension
With this change, we start to error out if loaded citus binaries does not match
the available major version or installed citus extension version. In this case
we force user to restart the server or run ALTER EXTENSION depending on the
situation
2017-04-03 17:36:13 -06:00
Marco Slot ba940a1de9 Use coordinator instead of schema node in terminology 2017-01-25 11:07:23 +01:00
Burak Yucesoy 484cb12cd0 Add LoadShardPlacement UDF
This UDF returns a shard placement from cache given shard id and placement id. At the
moment it iterates over all shard placements of given shard by ShardPlacementList and
searches given placement id in that list, which is not a good solution performance-wise.
However, currently, this function will be used only when there is a failed transaction.
If a need arises we can optimize this function in the future.
2017-01-23 21:04:57 +03:00
Andres Freund 6972186652 Add ShardPlacement fields required for colocated placement connection mapping. 2017-01-16 13:42:54 -08:00
Andres Freund b813b39241 Cache ShardPlacements in metadata cache.
So far we've reloaded them frequently. Besides avoiding that cost -
noticeable for some workloads with large shard counts - it makes it
easier to add information to ShardPlacements that help us make
placement_connection.c colocation aware.
2017-01-10 18:14:18 -08:00
Andres Freund 8cb47195ba Make LoadShardInterval() backed by the metadata cache.
Doing so requires adding a mapping from shardId to the cache
entries. For that metadata_cache.c now maintains an additional
hashtable. That hashtable only references shard intervals in the
dist table cache.
2017-01-10 17:00:19 -08:00
Andres Freund f6e8647337 Split DistTableCacheEntry() into separate functions.
Previously the function was getting too large. Thus this splits the
function into separate parts for looking up the cache entry and
building the cache contents.
2017-01-10 15:23:18 -08:00
Burak Yucesoy 9c9f479e4b Replicate reference tables when new node is added
With this change, we start to replicate all reference tables to the new node when new node
is added to the cluster with master_add_node command. We also update replication factor
of reference table's colocation group.
2017-01-05 14:30:41 +03:00
Eren Basak 71d73ec5ff Propagate DDL commands to metadata workers for MX tables 2016-12-23 15:43:32 +03:00
Onder Kalaci 9f0bd4cb36 Reference Table Support - Phase 1
With this commit, we implemented some basic features of reference tables.

To start with, a reference table is
  * a distributed table whithout a distribution column defined on it
  * the distributed table is single sharded
  * and the shard is replicated to all nodes

Reference tables follows the same code-path with a single sharded
tables. Thus, broadcast JOINs are applicable to reference tables.
But, since the table is replicated to all nodes, table fetching is
not required any more.

Reference tables support the uniqueness constraints for any column.

Reference tables can be used in INSERT INTO .. SELECT queries with
the following rules:
  * If a reference table is in the SELECT part of the query, it is
    safe join with another reference table and/or hash partitioned
    tables.
  * If a reference table is in the INSERT part of the query, all
    other participating tables should be reference tables.

Reference tables follow the regular co-location structure. Since
all reference tables are single sharded and replicated to all nodes,
they are always co-located with each other.

Queries involving only reference tables always follows router planner
and executor.

Reference tables can have composite typed columns and there is no need
to create/define the necessary support functions.

All modification queries, master_* UDFs, EXPLAIN, DDLs, TRUNCATE,
sequences, transactions, COPY, schema support works on reference
tables as expected. Plus, all the pre-requisites associated with
distribution columns are dismissed.
2016-12-20 14:09:35 +02:00
Metin Doslu 86cca54857 Add colocate_with option to create_distributed_table()
With this commit, we support three versions of colocate_with: i.default, ii.none
and iii. a specific table name.
2016-12-16 14:53:35 +02:00
Murat Tuncer b5c1ecb684 Fix failures during pg_upgrade
- fix error in CitusHasBeenLoaded()
- allow creation of pg_catalog tables during upgrade
2016-11-11 17:22:45 -08:00
Andres Freund fcd150c7c8 Invalidate relcache after pg_dist_shard_placement changes.
This forces prepared statements to be re-planned after changes of the
placement metadata. There's some locking issues remaining, but that's a
a separate task.

Also add regression tests verifying that invalidations take effect on
prepared statements.
2016-10-26 03:36:35 -07:00
Brian Cloutier 2e96f6ab27 Fix crash when upgrading to Citus 6
Between restart (running the new code) and ALTER EXTENSION citus
UPGRADE there was an inconsistency where we assumed that
pg_dist_partition had the repmodel column set. Now we give it a default
value if the column doesn't exist yet.
2016-10-24 15:18:29 +03:00
Metin Doslu 161093908e Convert colocationid to uint32 2016-10-20 10:59:31 +03:00
Metin Doslu 40bdafa8d1 Add create_distributed_table()
create_distributed_table() creates a hash distributed table with default values
of shard count and shard replication factor.
2016-10-20 10:58:25 +03:00
Andres Freund ac14b2edbc
Support PostgreSQL 9.6
Adds support for PostgreSQL 9.6 by copying in the requisite ruleutils
file and refactoring the out/readfuncs code to flexibly support the
old-style copy/pasted out/readfuncs (prior to 9.6) or use extensible
node APIs (in 9.6 and higher).

Most version-specific code within this change is only needed to set new
fields in the AggRef nodes we build for aggregations. Version-specific
test output files were added in certain cases, though in most they were
not necessary. Each such file begins by e.g. printing the major version
in order to clarify its purpose.

The comment atop citus_nodes.h details how to add support for new nodes
for when that becomes necessary.
2016-10-18 16:23:55 -06:00
Eren Basak cee7b54e7c Add worker transaction and transaction recovery infrastructure 2016-10-18 14:18:14 +03:00
Eren Basak f3ede37c9f Add hasmetadata column to pg_dist_node 2016-10-17 11:52:18 +03:00
Eren Basak c7bf2021fa Add metadata infrastructure for pg_dist_local_group table 2016-10-17 11:52:18 +03:00
Eren Basak ed3af403fd Add Metadata Snapshot Infrastructure
This change adds the required infrastructure about metadata snapshot from MX
codebase into Citus, mainly metadata_sync.c file and master_metadata_snapshot UDF.
2016-10-13 10:40:14 +03:00
Andres Freund 982ad66753 Introduce placement IDs.
So far placements were assigned an Oid, but that was just used to track
insertion order. It also did so incompletely, as it was not preserved
across changes of the shard state. The behaviour around oid wraparound
was also not entirely as intended.

The newly introduced, explicitly assigned, IDs are preserved across
shard-state changes.

The prime goal of this change is not to improve ordering of task
assignment policies, but to make it easier to reference shards.  The
newly introduced UpdateShardPlacementState() makes use of that, and so
will the in-progress connection and transaction management changes.
2016-10-07 11:59:20 -07:00
Brian Cloutier 9d6699b07c Switch from pg_worker_list.conf file to pg_dist_node metadata table.
Related to #786

This change adds the `pg_dist_node` table that contains the information
about the workers in the cluster, replacing the previously used
`pg_worker_list.conf` file (or the one specified with `citus.worker_list_file`).

Upon update, `pg_worker_list.conf` file is read and `pg_dist_node` table is
populated with the file's content. After that, `pg_worker_list.conf` file
is renamed to `pg_worker_list.conf.obsolete`

For adding and removing nodes, the change also includes two new UDFs:
`master_add_node` and `master_remove_node`, which require superuser
permissions.

'citus.worker_list_file' guc is kept for update purposes but not used after the
update is finished.
2016-10-05 13:01:35 +03:00