= jb_execv =

== Name ==

//jb_execv// - execute program on BGAS IO node

== Synopsis ==

{{{
#include <mpi.h>
#include "jbcnl.h"

int jb_execv( const char* filename, char* const argv[]);
}}}

Include with //-I/bgsys/local/bgas/jbrt/jbcn/include//, link with //-L/bgsys/local/bgas/jbrt/jbcn/lib -ljbcn//.

== Description ==

//jb_execv()// is the JBRT analogue of a call of the following function:

{{{
pid_t fork_execv( const char* filename, char* const argv[]) {
  pid_t pid = fork();

  if (!pid) {
    execv( filename, argv);

    exit( -1);
  }
  else {
    return pid;
  }
}
}}}

The main difference is that while this hypothetical function creates a process and executes code on the node the caller process lives on, //jb_execv()// is called by a process living on a BG/Q compute node, but creates a process and executes code on the IO node this compute node is connected to.

That is, //jb_execv()// triggers execution of the program pointed to by "filename". "filename" must be either a binary executable, or a script starting with a line of the form

{{{
#! interpreter [optional-arg]
}}}

For details of the latter case, see the "Interpreter scripts" section of the //execve()// manpage.

"argv" is an array of null-terminated strings representing the argument list available to the new program. However, for CN-ION message size constraint reasons, //jb_execv()// adds the "filename" argument to the actual argument list of the program to be executed. That is, assuming "filename" is a binary and its main routine reads

{{{
int main( int ion_argc, char** ion_argv),
}}}

"ion_argv[0]" is a duplicate of "filename" (i.e. equals "filename" in the strcmp sense of the word), while "ion_argv[1]" duplicates "argv[0]", "ion_argv[2]" duplicates "argv[1]" etc. In other words, users should follow the "argv[0]=filename" convention when writing user code for the IO node side, but should explicitly NOT follow it when calling //jb_execv()// on the compute node side. As with //execv()//, the array of pointers must be terminated by a NULL pointer; in particular, do not specify "argv" to be NULL, as this will segfault (this is in accordance with //execv()// behaviour).

Note that, unlike for standard //execv()//, the IO node process created by //jb_execv()// does NOT inherit the environment of the calling compute node process. Instead, the environment of the new (IO node) process is empty. (Inheritance of the compute node environment would not make sense since the "child" process is not even running under the same OS as the "parent" process.)

No process attributes save for user ID, effective user ID, group ID and effective group ID are preserved in the newly created IO node process. //jb_execv()// follows the //execve()// rules for effective user ID and effective group ID.

== Constraints and limitations ==
=== Limits on size of arguments ===

Due to constraints on the size of the CN-ION messages used by the JBRT, the limit on the total size of the command-line argument strings to be passed to the new IO node program is much stricter than the corresponding limit for //execve()//. The total size of "filename" and the "argv" element strings, when written as a long, blank-delimited, zero-terminated string, may not exceed the //JBRT_EXEC_CMD_SIZE// constant defined in //jbsd_messages.h//, which is currently set to 150 bytes.

=== Limits on number of spawned ION processes ===

There are two limits on the number of IO node processes spawned via //jb_execv()//, one on a per-compute-node-process basis and one on a per-IO-node basis. Both limits are unlikely to be reached in practice.

On a per-compute-node-process basis, each spawned IO node process gets assigned a so-called "tag", which is used in place of the child process id for monitoring and reaping. Tags are integers from 1 to 8, thus each compute node process can have 8 IO node "children" at any given time. If no tag is available, //jb_execv()// will fail.

On a per-IO-node basis, one has to take care of the limits to numbers of both threads and open files imposed by the Linux installation on the IO nodes. The IO-node-wise limit to the number of IO node user processes is currently slightly below 256, and thus significantly below the theoretical maximum of 32768 IO node user code instances allowed for by the CN-side limitations. Violations of the IO node limitations cannot be accounted for by the CN side and are thus treated analogously to a failure of the //execv()// part of the hypothetical "fork_execv" function showed above (as opposed to a failure of the fork() part); that is, they do not cause //jb_execv()// itself to fail, but can be detected a posteriori by monitoring of the issued tag.

== Return value ==

On success, an integer between 1 and 8 is returned; this integer is to be treated like the return value of fork(), i.e. to be stored for usage as a function argument for monitoring and reaping of IO node processes; see [wiki:bgas-user:bgas-manpages:jb_execv_status jb_execv_status].

On failure, -1 is returned, and "errno" is set appropriately. Note that //jb_execv()// can "fail silently" in that it might return successful though no actual ION process has been created. In this case, the JBRT behaves as if an ION process had been successfully created and then got killed by SIGHUP.

== Errors ==

E2BIG
	The total number of bytes in "filename" and the argument list ("argv") is too large.
EAGAIN
	jb_execv() ran into the 8-process limit for processes spawned by this compute node process; wait for termination of an earlier IO node tasks and "reap" it (see jb_execv_status()) before retrying.
ENAMETOOLONG
	"filename" is too long.

Other errors pertain to ZeroMQ errors on the ION side and are proof of bugs in the JBRT. If an error different from E2BIG, EAGAIN and ENAMETOOLONG occurs, please report to n.vandenbergen@fz-juelich.de.

THREAD-SAFETY

	jb_execv() may be called safely from inside a (POSIX or OpenMP) threaded region. Note that its return value should not be stored in a variable shared among threads, since it is needed later for freeing resources allocated to the remote job (see jb_execv_status()). Also, the space of available tags is shared among threads of the same process, i.e. the 8-tag limit is per MPI task, not per thread.