Viasock

Automagically Serverize Your Pipelines

Kilian Evang

PyGrunn 2017

Automagically Serverize Your Pipelines

The Unix Philosophy

  • Write programs that do one thing and do it well.
  • Write programs to work together.
  • Write programs to handle text streams, because that is a universal interface.

(Doug McIlroy)

A Pipeline


$ cat data/01.txt | ./bin/tokenize -m models/tokenizer.model | ./bin/parse -m models/parser.model > out/01.parse
					

A Makefile


out/%.tok : data/%.txt bin/tokenize models/tokenizer.model
	cat $< | ./bin/tokenize -m models/tokenizer.model > $@

out/%.parse : out/%.tok bin/parse models/parser.model
	cat $< | ./bin/parse -m models/parser.model > $@
					

Runtime

Load model Process
10s 3h

Runtime

Load model Process
10s 0.1s
Load model Process
10s 0.1s
Load model Process
10s 0.1s

Automagically Serverize Your Pipelines

Our Old Makefile


out/%.tok : data/%.txt bin/tokenize models/tokenizer.model
	cat $< | ./bin/tokenize -m models/tokenizer.model > $@

out/%.parse : out/%.tok bin/parse models/parser.model
	cat $< | ./bin/parse -m models/parser.model > $@
					

Our New Makefile (First Attempt)


# Assuming viasock server wrapping tokenize is listening on tokenizer.socket
out/%.tok : data/%.txt bin/tokenize models/tokenizer.model
	cat $< | viasock client tokenizer.socket > $@

# Assuming viasock server wrapping parse is listening on parser.socket
out/%.parse : out/%.tok bin/parse models/parser.model
	cat $< | viasock client parser.socket > $@
					

Automagically Serverize Your Pipelines

viasock run: the Magic Command


$ cat input1.txt | viasock run mytool -m mymodel > output1.txt
$ cat input2.txt | viasock run myothertool > output2.txt
$ cat input3.txt | viasock run mytool -m myothermodel > output3.txt
					

Our New Makefile


out/%.tok : data/%.txt bin/tokenize models/tokenizer.model
	cat $< | viasock run ./bin/tokenize -m models/tokenizer.model > $@

out/%.parse : out/%.tok bin/parse models/parser.model
	cat $< | viasock run ./bin/parse -m models/parser.model > $@
					

How viasock run Finds the Right Server

Example command:

viasock run --server-timeout 3600 ./bin/parse -m models/parser.model

$SERVERID = hash of

  • Viasock options (e.g., 3600)
  • Program name (e.g., ./bin/parse)
  • Program mtime
  • Program arguments (e.g., models/parser.model)
  • Program argument mtimes

Uses/starts server listening on ./.viasock/sockets/$SERVERID

Summary

Problem

  • Frequently running a program with high startup overhead on small data

Problems with Conventional Client/Server Setups

  • Poor separation of concerns
    • violates Unix philosophy
    • no reuse of client/server code
  • Devops!
    • server must run
    • server must restart when mytool or model etc. changes
  • Less flexible: run mytool with different options?

The Viasock Solution

  • Leave mytool as is
  • Viasock server process wraps it
  • viasock run provides transparent access
    • starts server if needed
    • different servers for different options
    • new server when mytool or file arguments change

Viasock Limitations

  • No load balancing yet (in case mytool throughput is too low)
  • Requires mytool to process 1 record at a time and flush its output buffer after each record

ilker ender, CC-BY-NC

https://github.com/texttheater/viasock