Commit Graph

1 Commits (3cc56b32770f4d6e8a3d1cef4cd06f7ad8861dd8)

Author SHA1 Message Date
Yuanhao Luo 4923a85aba Bulkload copy for citus
Through gprof performance analysis, I found that master node is a
CPU bottleneck and function NextCopyFrom() spent most of the time.
To improve ingestion performance, I assign this time-consuming
function to each worker node and the benchmark result shows that
it's actually working, we get nearly #worker times as fast as before.

This bulkload feature works as below:
1. we issue a bulkload copy command on any node(master or worker), such
   as "copy tb1 from 'tb1.csv' with(format csv, method 'bulkload');" in
   node host:port.
2. the copy command is rebuilt to "copy tb1 from program 'bload'
   with(format csv, bulkload_host host, bulkload_port port, method
   'bulkload')" in host:port, and then this rebuilt-copy command is
   assigned to each worker asynchronously, besides, we would create a
   zeromq server, which reads records from file 'tb1.csv' and delivers
   these records to zeromq client(program 'bload' in each worker node).
3. in each worker node, it just executes the copy command assigned in
   step 2, the records of copy command come from zeromq client bload,
   which pull records from zeromq server.

To enable this feature, you must have zeromq installed. After compiling
and installing citus extension, just add copy option "method 'bulkload'"
to use bulkload ingestion.

For now, bulkload copy supports copy from file,program with(format csv,text)
for append and hash distributed table.
Note: only supports format csv,text for copy from stdin, format binary is
not supported.

TODO: better support for transaction and error handling.
2017-02-16 20:53:58 +08:00