How does lookup in $PATH work under the hood?

debugcn Published at Dev

Xlee

There're way too many articles/resources on the web that teaches people HOW to set the environment variable PATH so that they can use the short hand of java or python etc instead of the absolute path in command line interface.

What I'm interested to know is that what's behind the scene when we type in the command and hit enter (similar to what happens when you type in a URL in browser).

Here is my guess:

read the command (parse/preprocess stdin to get the right arguments $@)
command lookup
command execution (program started, consume memory, stdout/stderr to shell)
re-render the emulator by relevant environment variables (e.g. $PS#, $PROMPT, etc)

The part I want to figure out most is the command lookup. Obviously, the $PATH is consumed by some background function and separated by : / ;as delimiters, then what happened? Do we use a hash table (key: basename of the file, value: absolute dirname of the file) to store the binary files under those PATHs, or some other hooks?

_{NOTE: I originally thought it was hash table as I can use [ -z hash [command] ] to check if a command is available in current env, but when I use hash | grep python I get nothing from the output while which python works as anticipated. (I think the mechanism might be shell specific, but I want to get more insights into it.)}

Michael Homer

As you suspect, the exact behaviour is shell-dependent, but a baseline level of functionality is specified by POSIX.

Command search and execution for the standard shell command language (which most shells implement a superset of) has a lot of cases, but we're only interested for the moment in the case where PATH is used. In that case:

the command shall be searched for using the PATH environment variable as described in XBD Environment Variables

and

If the search is successful:

[...]

the shell executes the utility in a separate utility environment with actions equivalent to calling the execl() function [...] with the path argument set to the pathname resulting from the search.

In the unsuccessful case, execution fails and an exit code of 127 is returned with an error message.

This behaviour is consistent with the execvp function, in particular. All the exec* functions accept the file name of a program to run, a sequence of arguments (which will be the argv of the program), and perhaps a set of environment variables. For the versions using PATH lookup, POSIX defines that:

The argument file is used to construct a pathname that identifies the new process image file [...] the path prefix for this file is obtained by a search of the directories passed as the environment variable PATH

The behaviour of PATH is defined elsewhere as:

This variable shall represent the sequence of path prefixes that certain functions and utilities apply in searching for an executable file known only by a filename. The prefixes shall be separated by a <colon> ( ':' ). When a non-zero-length prefix is applied to this filename, a <slash> shall be inserted between the prefix and the filename if the prefix did not end in . A zero-length prefix is a legacy feature that indicates the current working directory. It appears as two adjacent characters ( "::" ), as an initial <colon> preceding the rest of the list, or as a trailing <colon> following the rest of the list. A strictly conforming application shall use an actual pathname (such as .) to represent the current working directory in PATH. The list shall be searched from beginning to end, applying the filename to each prefix, until an executable file with the specified name and appropriate execution permissions is found. If the pathname being sought contains a <slash>, the search through the path prefixes shall not be performed. If the pathname begins with a <slash>, the specified path is resolved (see Pathname Resolution). If PATH is unset or is set to null, the path search is implementation-defined.

That's a bit dense, so a summary:

If the program name has a / (slash, U+002F SOLIDUS) in it, treat it as a path in the usual fashion, and skip the rest of this process. For the shell, this case technically doesn't arise (because the shell rules will have dealt with it already).
The value of PATH is split into pieces at each colon, and then each component processed from left to right. As a special (historical) case, an empty component of a non-empty variable is treated as . (the current directory).
For each component, the program name is appended to the end with a joining / and the existence of a file by that name is checked, and if one does exist then valid execute (+x) permissions are checked as well. If either of those checks fails, the process moves on to the next component. Otherwise, the command resolves to this path and the search is done.
If you run out of components, the search fails.
If there's nothing in PATH, or it doesn't exist, do whatever you want.

Real shells will have builtin commands, which are found before this lookup, and often aliases and functions as well. Those don't interact with PATH. POSIX defines some behaviour around those, and your shell may have much more.

While it's possible to rely on exec* to do most of this for you, the shell in practice may implement this lookup itself, notably for caching purposes, but the empty-cache behaviour should be similar. Shells have fairly wide latitude here and have subtly different behaviours in the corner cases.

As you found, Bash uses a hash table to remember the full paths of commands it's seen before, and that table can be accessed with the hash function. The first time you run a command it searches, and when a result is found it gets added to the table so there's no need to bother looking the next time you try it.

In zsh, on the other hand, the full PATH is generally searched when the shell starts. A lookup table is prepopulated with all discovered command names so that runtime lookups usually aren't necessary (unless a new command is added). You can notice that happening when you try to tab-complete a command that didn't exist before.

Very lightweight shells, like dash, tend to delegate as much behaviour as possible to the system library and don't bother to remember past command paths.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-06-19

Comments

0 comments

From Dev

Related Related

Article