TaskVine Log File Formats

Performance Log Format

The performance log is a sequence of records, recorded at each significant change in an integer metric such as the number of tasks submitted, running, and so forth. The first row always contains the name of the columns, which correspond to values that can be obtained from vine_stats. The first column is a Unix timestamp with microsecond resolution.

Here is an example of the first few rows and columns:

# timestamp workers_connected workers_init workers_idle workers_busy workers_...
1602165237833411 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 5 0 0 0 0 0 1602165237827668 ...
1602165335687547 1 0 0 1 1 1 0 0 0 0 0 0 4 1 0 0 5 0 0 0 0 0 1602165237827668 ...
1602165335689677 1 0 0 1 1 1 0 0 0 0 0 0 4 1 1 1 5 1 0 0 0 0 1602165237827668 ...
...

Transactions Log Format

The first few lines of the log document the possible log records:

# time manager_pid MANAGER manager_pid START|END time_from_origin
# time manager_pid WORKER worker_id CONNECTION host:port
# time manager_pid WORKER worker_id DISCONNECTION (UNKNOWN|IDLE_OUT|FAST_ABORT|FAILURE|STATUS_WORKER|EXPLICIT)
# time manager_pid WORKER worker_id RESOURCES {resources}
# time manager_pid WORKER worker_id CACHE_UPDATE filename size_in_mb wall_time_us start_time_us
# time manager_pid WORKER worker_id TRANSFER (INPUT|OUTPUT) filename size_in_mb wall_time_us start_time_us
# time manager_pid CATEGORY name MAX {resources_max_per_task}
# time manager_pid CATEGORY name MIN {resources_min_per_task_per_worker}
# time manager_pid CATEGORY name FIRST (FIXED|MAX|MIN_WASTE|MAX_THROUGHPUT) {resources_requested}
# time manager_pid TASK task_id WAITING category_name (FIRST_RESOURCES|MAX_RESOURCES) attempt_number {resources_requested}
# time manager_pid TASK task_id RUNNING worker_id (FIRST_RESOURCES|MAX_RESOURCES) {resources_allocated}
# time manager_pid TASK task_id WAITING_RETRIEVAL worker_id
# time manager_pid TASK task_id RETRIEVED (SUCCESS|UNKNOWN|INPUT_MISSING|OUTPUT_MISSING|STDOUT_MISSING|SIGNAL|RESOURCE_EXHAUSTION|MAX_RETRIES|MAX_END_TIME|MAX_WALL_TIME|FORSAKEN) {limits_exceeded} {resources_measured}
# time manager_pid TASK task_id DONE (SUCCESS|UNKNOWN|INPUT_MISSING|OUTPUT_MISSING|STDOUT_MISSING|SIGNAL|RESOURCE_EXHAUSTION|MAX_RETRIES|MAX_END_TIME|MAX_WALL_TIME|FORSAKEN) exit_code
# time manager_pid LIBRARY library_id (WAITING|SENT|STARTED|FAILURE) worker_id

Lowercase words indicate values, and uppercase indicate constants. A bar (|) inside parentheses indicate a choice of possible constants. Variables encased in braces {} indicate a JSON dictionary. Here is an example of the first few records of a transactions log:

1679929304405580 4107108 MANAGER 4107108 START 0
1679929315785718 4107108 TASK 1 WAITING default FIRST_RESOURCES 1 {"cores":[1,"cores"]}
1679929315789781 4107108 TASK 2 WAITING default FIRST_RESOURCES 1 {"cores":[1,"cores"]}
1679929315791349 4107108 TASK 3 WAITING default FIRST_RESOURCES 1 {"cores":[1,"cores"]}
1679929315792852 4107108 TASK 4 WAITING default FIRST_RESOURCES 1 {"cores":[1,"cores"]}
1679929315794343 4107108 TASK 5 WAITING default FIRST_RESOURCES 1 {"cores":[1,"cores"]}
...

With the transactions log, it is easy to track the lifetime of a task. For example, to print the lifetime of the task with id 1, we can simply do:

$ grep 'TASK \<1\>' my.tr.log
1599244364466668 16444 TASK 1 WAITING default FIRST_RESOURCES {"cores":[1,"cores"],"memory":[800,"MB"],"disk":[500,"MB"]}
1599244400311044 16444 TASK 1 RUNNING 10.32.79.143:48268  FIRST_RESOURCES {"cores":[4,"cores"],"memory":[4100,"MB"],...}
1599244539953798 16444 TASK 1 WAITING_RETRIEVAL 10.32.79.143:48268
1599244540075173 16444 TASK 1 RETRIEVED SUCCESS  0  {} {"cores":[1,"cores"],"wall_time":[123.137485,"s"],...}
1599244540083820 16444 TASK 1 DONE SUCCESS  0  {} {"cores":[1,"cores"],"wall_time":[123.137485,"s"],...}

The statistics available are:

Field Description
Stats for the current state of workers
workers_connected Number of workers currently connected to the manager
workers_init Number of workers connected, but that have not send their available resources report yet
workers_idle Number of workers that are not running a task
workers_busy Number of workers that are running at least one task
workers_able Number of workers on which the largest task can run
Cumulative stats for workers
workers_joined Total number of worker connections that were established to the manager
workers_removed Total number of worker connections that were released by the manager, idled-out, slow, or lost
workers_released Total number of worker connections that were asked by the manager to disconnect
workers_idled_out Total number of worker that disconnected for being idle
workers_slow Total number of worker connections terminated for being too slow
workers_blacklisted Total number of workers blacklisted by the manager (includes workers_slow)
workers_lost Total number of worker connections that were unexpectedly lost (does not include idled-out or slow)
Stats for the current state of tasks
tasks_waiting Number of tasks waiting to be dispatched
tasks_on_workers Number of tasks currently dispatched to some worker
tasks_running Number of tasks currently executing at some worker
tasks_with_results Number of tasks with retrieved results and waiting to be returned to user
Cumulative stats for tasks
tasks_submitted Total number of tasks submitted to the manager
tasks_dispatched Total number of tasks dispatch to workers
tasks_done Total number of tasks completed and returned to user (includes tasks_failed)
tasks_failed Total number of tasks completed and returned to user with result other than VINE_RESULT_SUCCESS
tasks_cancelled Total number of tasks cancelled
tasks_exhausted_attempts Total number of task executions that failed given resource exhaustion
Manager time statistics (in microseconds)
time_when_started Absolute time at which the manager started
time_send Total time spent in sending tasks to workers (tasks descriptions, and input files)
time_receive Total time spent in receiving results from workers (output files)
time_send_good Total time spent in sending data to workers for tasks with result VINE_RESULT_SUCCESS
time_receive_good Total time spent in sending data to workers for tasks with result VINE_RESULT_SUCCESS
time_status_msgs Total time spent sending and receiving status messages to and from workers, including workers' standard output, new workers connections, resources updates, etc.
time_internal Total time the manager spents in internal processing
time_polling Total time blocking waiting for worker communications (i.e., manager idle waiting for a worker message)
time_application Total time spent outside vine_wait
Wrokers time statistics (in microseconds)
time_workers_execute Total time workers spent executing done tasks
time_workers_execute_good Total time workers spent executing done tasks with result VINE_RESULT_SUCCESS
time_workers_execute_exhaustion Total time workers spent executing tasks that exhausted resources
Transfer statistics
bytes_sent Total number of file bytes (not including protocol control msg bytes) sent out to the workers by the manager
bytes_received Total number of file bytes (not including protocol control msg bytes) received from the workers by the manager
bandwidth Average network bandwidth in MB/S observed by the manager when transferring to workers
Resources statistics
capacity_tasks The estimated number of tasks that this manager can effectively support
capacity_cores The estimated number of workers' cores that this manager can effectively support
capacity_memory The estimated number of workers' MB of RAM that this manager can effectively support
capacity_disk The estimated number of workers' MB of disk that this manager can effectively support
capacity_instantaneous The estimated number of tasks that this manager can support considering only the most recently completed task
capacity_weighted The estimated number of tasks that this manager can support placing greater weight on the most recently completed task
total_cores Total number of cores aggregated across the connected workers
total_memory Total memory in MB aggregated across the connected workers
total_disk Total disk space in MB aggregated across the connected workers
committed_cores Committed number of cores aggregated across the connected workers
committed_memory Committed memory in MB aggregated across the connected workers
committed_disk Committed disk space in MB aggregated across the connected workers
max_cores The highest number of cores observed among the connected workers
max_memory The largest memory size in MB observed among the connected workers
max_disk The largest disk space in MB observed among the connected workers
min_cores The lowest number of cores observed among the connected workers
min_memory The smallest memory size in MB observed among the connected workers
min_disk The smallest disk space in MB observed among the connected workers
manager_load In the range of [0,1]. If close to 1, then the manager is at full load
and spends most of its time sending and receiving taks, and thus
cannot accept connections from new workers. If close to 0, the
manager is spending most of its time waiting for something to happen.