Through gprof performance analysis, I found that master node is a
CPU bottleneck and function NextCopyFrom() spent most of the time.
To improve ingestion performance, I assign this time-consuming
function to each worker node and the benchmark result shows that
it's actually working, we get nearly #worker times as fast as before.
This bulkload feature works as below:
1. we issue a bulkload copy command on any node(master or worker), such
as "copy tb1 from 'tb1.csv' with(format csv, method 'bulkload');" in
node host:port.
2. the copy command is rebuilt to "copy tb1 from program 'bload'
with(format csv, bulkload_host host, bulkload_port port, method
'bulkload')" in host:port, and then this rebuilt-copy command is
assigned to each worker asynchronously, besides, we would create a
zeromq server, which reads records from file 'tb1.csv' and delivers
these records to zeromq client(program 'bload' in each worker node).
3. in each worker node, it just executes the copy command assigned in
step 2, the records of copy command come from zeromq client bload,
which pull records from zeromq server.
To enable this feature, you must have zeromq installed. After compiling
and installing citus extension, just add copy option "method 'bulkload'"
to use bulkload ingestion.
For now, bulkload copy supports copy from file,program with(format csv,text)
for append and hash distributed table.
Note: only supports format csv,text for copy from stdin, format binary is
not supported.
TODO: better support for transaction and error handling.
I came across several places we weren't as flexible or resilient as we
should have been in our build logic. They include:
* Not using `DESTDIR` in the install-header destination
* Allowing callers to specify `VPATH` or `srcdir` (which breaks)
* Using absolute path for SCRIPTS (9.5 prepends srcdir)
* Including libpq-int in a confusing way (extracted this function)
* Having server includes come first during csql build (client must)
In particular, I hit all of these attempting to build with pg_buildext
in Debian. It passes in an explicit VPATH, as well as srcdir (breaking
all recursive make invocations), and also uses DESTDIR during install.
In addition, a PGDG-enabled Debian box will have the latest libpq-dev
headers (e.g. 9.5) even when building against an older server version
(e.g. 9.4). This leads to problems when including e.g. `c.h`, which
is ambiguous. While compiling more client-side code (csql), we need to
ensure the newer libpq headers are included _first_, so I fixed that.
Need to change to the project's top srcdir, as citus_indent expects to
be able to find styled files using git ls-files, and VPATH builds would
otherwise not return any results.
All citusdb references in
- extension, binary names
- file headers
- all configuration name prefixes
- error/warning messages
- some functions names
- regression tests
are changed to be citus.