TaskVine Log File Formats
Performance Log Format
The performance log is a sequence of records, recorded at
each significant change in an integer metric such as the number of tasks submitted,
running, and so forth.
The first row always contains the name of the columns, which correspond to
values that can be obtained from vine_stats
. The first column is a Unix timestamp
with microsecond resolution.
Here is an example of the first few rows and columns:
# timestamp workers_connected workers_init workers_idle workers_busy workers_...
1602165237833411 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 5 0 0 0 0 0 1602165237827668 ...
1602165335687547 1 0 0 1 1 1 0 0 0 0 0 0 4 1 0 0 5 0 0 0 0 0 1602165237827668 ...
1602165335689677 1 0 0 1 1 1 0 0 0 0 0 0 4 1 1 1 5 1 0 0 0 0 1602165237827668 ...
...
Transactions Log Format
The first few lines of the log document the possible log records:
# time manager_pid MANAGER manager_pid START|END time_from_origin
# time manager_pid WORKER worker_id CONNECTION host:port
# time manager_pid WORKER worker_id DISCONNECTION (UNKNOWN|IDLE_OUT|FAST_ABORT|FAILURE|STATUS_WORKER|EXPLICIT)
# time manager_pid WORKER worker_id RESOURCES {resources}
# time manager_pid WORKER worker_id CACHE_UPDATE filename size_in_mb wall_time_us start_time_us
# time manager_pid WORKER worker_id TRANSFER (INPUT|OUTPUT) filename size_in_mb wall_time_us start_time_us
# time manager_pid CATEGORY name MAX {resources_max_per_task}
# time manager_pid CATEGORY name MIN {resources_min_per_task_per_worker}
# time manager_pid CATEGORY name FIRST (FIXED|MAX|MIN_WASTE|MAX_THROUGHPUT) {resources_requested}
# time manager_pid TASK task_id WAITING category_name (FIRST_RESOURCES|MAX_RESOURCES) attempt_number {resources_requested}
# time manager_pid TASK task_id RUNNING worker_id (FIRST_RESOURCES|MAX_RESOURCES) {resources_allocated}
# time manager_pid TASK task_id WAITING_RETRIEVAL worker_id
# time manager_pid TASK task_id RETRIEVED (SUCCESS|UNKNOWN|INPUT_MISSING|OUTPUT_MISSING|STDOUT_MISSING|SIGNAL|RESOURCE_EXHAUSTION|MAX_RETRIES|MAX_END_TIME|MAX_WALL_TIME|FORSAKEN) {limits_exceeded} {resources_measured}
# time manager_pid TASK task_id DONE (SUCCESS|UNKNOWN|INPUT_MISSING|OUTPUT_MISSING|STDOUT_MISSING|SIGNAL|RESOURCE_EXHAUSTION|MAX_RETRIES|MAX_END_TIME|MAX_WALL_TIME|FORSAKEN) exit_code
# time manager_pid LIBRARY library_id (WAITING|SENT|STARTED|FAILURE) worker_id
Lowercase words indicate values, and uppercase indicate constants. A bar (|) inside parentheses indicate a choice of possible constants. Variables encased in braces {} indicate a JSON dictionary. Here is an example of the first few records of a transactions log:
1679929304405580 4107108 MANAGER 4107108 START 0
1679929315785718 4107108 TASK 1 WAITING default FIRST_RESOURCES 1 {"cores":[1,"cores"]}
1679929315789781 4107108 TASK 2 WAITING default FIRST_RESOURCES 1 {"cores":[1,"cores"]}
1679929315791349 4107108 TASK 3 WAITING default FIRST_RESOURCES 1 {"cores":[1,"cores"]}
1679929315792852 4107108 TASK 4 WAITING default FIRST_RESOURCES 1 {"cores":[1,"cores"]}
1679929315794343 4107108 TASK 5 WAITING default FIRST_RESOURCES 1 {"cores":[1,"cores"]}
...
With the transactions log, it is easy to track the lifetime of a task. For example, to print the lifetime of the task with id 1, we can simply do:
$ grep 'TASK \<1\>' my.tr.log
1599244364466668 16444 TASK 1 WAITING default FIRST_RESOURCES {"cores":[1,"cores"],"memory":[800,"MB"],"disk":[500,"MB"]}
1599244400311044 16444 TASK 1 RUNNING 10.32.79.143:48268 FIRST_RESOURCES {"cores":[4,"cores"],"memory":[4100,"MB"],...}
1599244539953798 16444 TASK 1 WAITING_RETRIEVAL 10.32.79.143:48268
1599244540075173 16444 TASK 1 RETRIEVED SUCCESS 0 {} {"cores":[1,"cores"],"wall_time":[123.137485,"s"],...}
1599244540083820 16444 TASK 1 DONE SUCCESS 0 {} {"cores":[1,"cores"],"wall_time":[123.137485,"s"],...}
The statistics available are:
Field | Description |
---|---|
Stats for the current state of workers | |
workers_connected | Number of workers currently connected to the manager |
workers_init | Number of workers connected, but that have not send their available resources report yet |
workers_idle | Number of workers that are not running a task |
workers_busy | Number of workers that are running at least one task |
workers_able | Number of workers on which the largest task can run |
Cumulative stats for workers | |
workers_joined | Total number of worker connections that were established to the manager |
workers_removed | Total number of worker connections that were released by the manager, idled-out, slow, or lost |
workers_released | Total number of worker connections that were asked by the manager to disconnect |
workers_idled_out | Total number of worker that disconnected for being idle |
workers_slow | Total number of worker connections terminated for being too slow |
workers_blacklisted | Total number of workers blacklisted by the manager (includes workers_slow) |
workers_lost | Total number of worker connections that were unexpectedly lost (does not include idled-out or slow) |
Stats for the current state of tasks | |
tasks_waiting | Number of tasks waiting to be dispatched |
tasks_on_workers | Number of tasks currently dispatched to some worker |
tasks_running | Number of tasks currently executing at some worker |
tasks_with_results | Number of tasks with retrieved results and waiting to be returned to user |
Cumulative stats for tasks | |
tasks_submitted | Total number of tasks submitted to the manager |
tasks_dispatched | Total number of tasks dispatch to workers |
tasks_done | Total number of tasks completed and returned to user (includes tasks_failed) |
tasks_failed | Total number of tasks completed and returned to user with result other than VINE_RESULT_SUCCESS |
tasks_cancelled | Total number of tasks cancelled |
tasks_exhausted_attempts | Total number of task executions that failed given resource exhaustion |
Manager time statistics (in microseconds) | |
time_when_started | Absolute time at which the manager started |
time_send | Total time spent in sending tasks to workers (tasks descriptions, and input files) |
time_receive | Total time spent in receiving results from workers (output files) |
time_send_good | Total time spent in sending data to workers for tasks with result VINE_RESULT_SUCCESS |
time_receive_good | Total time spent in sending data to workers for tasks with result VINE_RESULT_SUCCESS |
time_status_msgs | Total time spent sending and receiving status messages to and from workers, including workers' standard output, new workers connections, resources updates, etc. |
time_internal | Total time the manager spents in internal processing |
time_polling | Total time blocking waiting for worker communications (i.e., manager idle waiting for a worker message) |
time_application | Total time spent outside vine_wait |
Wrokers time statistics (in microseconds) | |
time_workers_execute | Total time workers spent executing done tasks |
time_workers_execute_good | Total time workers spent executing done tasks with result VINE_RESULT_SUCCESS |
time_workers_execute_exhaustion | Total time workers spent executing tasks that exhausted resources |
Transfer statistics | |
bytes_sent | Total number of file bytes (not including protocol control msg bytes) sent out to the workers by the manager |
bytes_received | Total number of file bytes (not including protocol control msg bytes) received from the workers by the manager |
bandwidth | Average network bandwidth in MB/S observed by the manager when transferring to workers |
Resources statistics | |
capacity_tasks | The estimated number of tasks that this manager can effectively support |
capacity_cores | The estimated number of workers' cores that this manager can effectively support |
capacity_memory | The estimated number of workers' MB of RAM that this manager can effectively support |
capacity_disk | The estimated number of workers' MB of disk that this manager can effectively support |
capacity_instantaneous | The estimated number of tasks that this manager can support considering only the most recently completed task |
capacity_weighted | The estimated number of tasks that this manager can support placing greater weight on the most recently completed task |
total_cores | Total number of cores aggregated across the connected workers |
total_memory | Total memory in MB aggregated across the connected workers |
total_disk | Total disk space in MB aggregated across the connected workers |
committed_cores | Committed number of cores aggregated across the connected workers |
committed_memory | Committed memory in MB aggregated across the connected workers |
committed_disk | Committed disk space in MB aggregated across the connected workers |
max_cores | The highest number of cores observed among the connected workers |
max_memory | The largest memory size in MB observed among the connected workers |
max_disk | The largest disk space in MB observed among the connected workers |
min_cores | The lowest number of cores observed among the connected workers |
min_memory | The smallest memory size in MB observed among the connected workers |
min_disk | The smallest disk space in MB observed among the connected workers |
manager_load | In the range of [0,1]. If close to 1, then the manager is at full load and spends most of its time sending and receiving taks, and thus cannot accept connections from new workers. If close to 0, the manager is spending most of its time waiting for something to happen. |