On 2014-05-15 02:02, Gavin Lambert wrote:
...
Although that brings up another question (which I'm not really familiar enough with the "path" class internals to answer by looking at the example implementation): is "base" intended to be assumed as a directory name (which is how most filesystem "make relative" functions typically work) or as a file name (which is how URL "make relative" works)?
path makes no assumptions about what it references, or whether the pathname even exists in the filesystem.
I'm aware of that; my point was that the semantics of the "make a relative path" operation typically differ between filesystems and URLs.
Dear all, One problem with boost filesystem is the lack of theoretical foundation. Hey, don't run away. :) Note: I have not checked the formalities thoroughly, so parts of the text might be erroneous or incomplete. Note: I type slashes everywhere because it is easier. Some should be backslashes on Windows. What are paths? =============== Surprisingly this is already a controversial question. From my experience, I can classify people's attitude to three categories. "Paths are names of resources, cookies for the user code." ---------------------------------------------------------- This is a justified approach, in simple cases like: int main(int argc, char *argv[]) { ifstream in(argv[1]); ofstream out(argv[2]); out << in.rdbuf(); } But obviously there is more to paths than that! "Paths are strings. Doing simple string operations won't do anybody harm." -------------------------------------------------------------------------- This is a common approach in scripting languages, for example. There are plenty of C++ code that does the same thing. This approach assumes specific connection between syntax and semantics of paths, however. The most common path operation, namely concatenation, can be easily defined to be consistent by applying some discipline (at least on Unix and Windows). Specifically, by ending all directories with a slash, concatenating any non-file path x with a relative path y can be done simply by x + y (std::string::operator +). The problem of this approach is that it does not encompass the meaning behind the defined operations, making hard to well-define them. "Paths are sequences of instructions to locate some resources." --------------------------------------------------------------- After all, this is what they are for the OS. It breaks them to a sequence of 'path elements' (from here on simply 'elements'), and depending on each of them, does some state transition on the internal path resolution state, getting to the desired filesystem node. This definition is not limited to filesystem paths, it is general enough to work with "step left, 3 steps back, dig the treasure", or "England, London, Baker St. 4". We will get to it later. Definitions =========== Notation: * f,g,h filesystem nodes * a,b,c elements * x,y,z paths From here on, if we quantify over "all filesystem nodes f", the intention is over all conceivable nodes in all possible machine states, not only in the concrete filesytem the program is currently run. Definition: a path is a finite sequence of elements, denoted by x = a0 / ... / an. (Here '/' is only an algebraic notation.) With filesystem paths, the elements are instructions of how to traverse the graph of filesystem nodes. Kinds of elements include: * On Unix: root "/", regular "name", current ".", parent "..". * On Windows: drive "c:", root "/", regular "name", current "." and parent "..". The action of an element on a node, denoted f / a, returns a new filesystem node as if by a call to openat() on Unix. On Windows there is no direct analogy of openat, so the semantics are somewhat similar to both, NtCreateFile() and SetCurrentDirectory(). The action of a path on a node is defined by f / (a0 / ... / an) = (...(f / a0) / ... ) / an. (openat() is consistent with this definition, in the sense that you get the same result whether you pass any string representation (see below) of (a0 / ... / an) in one call, or invoke it repeatedly for each element.) Naturally, concatenation of paths x / y is simply concatenation of the said sequences. The identity f / (x / y) = (f / x) / y holds, by definition. So far I gave an abstract definition of paths *without saying how they are represented syntactically*. This point is important because we cannot map every sequence of elements to a string in the OS specific syntax. For example, there is no encoding of the sequence "a" / "c:" on Windows. However: Definition: x is equivalent to y iff for all f, if g = f / x and h = f / y are both defined, then g = h (i.e. they yield the same node). Definition: str(x) is the canonical string representation of the path x. path(s) is the path resulting from parsing the string s. I leave str() and path() be implementation defined, subject to the following requirements: * path(str(x)) is equivalent to x * if x and y are equivalent, then str(x) = str(y) In practice there need no be path() and str() functions. In memory, paths are already represented as strings. All operations, although formally defined on abstract paths, can be implemented to work directly on strings. Therefore, where it is obvious from the context, we can identify paths x with their string representations str(x). For all operations, except canonical(s) = str(path(x)), we require that if the operands are already in canonical string representation, then the result is also to be in canonical string representation. This lets the implementation to short-circuit when appropriate. Operations (examples are for Windows, because it is less trivial than Unix): * concatenation -- already defined: * "a" / "b" = "a/b" * "c:" / "d" = "c:d" * "/a/b" / "c:" = "c:" * "a" / "../b" = "a/../b" * "d:/a" / "/b" = "d:/b" * is_absolute(x) iff there exists f such that for all g, g / x = f. * is_absolute("d:/xyz") = true * is_absolute("/abc") = false * is_absolute("c:") = true or false, depending on whether we include the environment states into the filesystem node quantification. * relative(x, y) = z s.t. y / z is equivalent to x. (not always exists) * relative("a/b", "a") = "b" * relative("c:/a/b", "c:") = "/a/b" * relative("c:/a/b", "c:/") = "c:/" * relative("c:a/b", "c:/") = NaP * relative(x, "") = x Eval-equivalence ---------------- Definition: x is eval-equivalent to y iff for all f, if g = f / x and h = f / y are both defined, and none involve symbolic link resolution, then g = h. The rest of the previous section is repeated except that eval-equivalence is used. * concatenation * "a/b" / "./../c" = "a/c" * "a" / "../../b" = "../b" * relative(x, y) * relative("a/b", "a/c") = "../b" * relative("c:a/b", "c:/") = NaP Note that on Windows, AFAIK but I may be mistaken, eval-equivalence and equivalence are the same, because ".." are resolved syntactically by Win32 API before any reparse points etc. kick-in. One can think of eval-parent ".." meaning 'go back' whereas the Unix parent ".." behave as a regular subdirectory. One may still wish to distinguish between equivalence and eval-equivalence on Windows. Summary ----------- Lots of questions are left intentionally un-answered. The above text establishes a framework, from here we can proceeded in different ways. For every concrete paths implementation we define * possible path elements * the effect of these elements on the filesystem nodes * path equivalence * a parsing function path() * a formatting function str() Possible variation: * Define that x is equivalent to y as before, with an added stipulation that the current "." elements are assumed to be regular. So that "." / "." = "./." rather than ".". * Split regular elements to filenames and directories. When parsing "abc/", path() will return a single directory element, when parsing "abc" it will return a single filename element. This is similar to the way URLs work. Usecase: "~/a.txt" / "b.txt" = "~/b.txt". Where Boost.Filesystem fails? ============================= Note I wasn't following its evolution in the last year or two, so some of this section might be outdated. Path is a sequence of what? --------------------------- Constructors, value_type, append(), all suggest that path is a sequences of chars/wchars. begin()/end()/iterator, on the other hand, suggest that path is a sequence of... paths? O.o Let's leave alone the fact that using the second as a definition doesn't work (it's like saying that string is a sequence of strings, though some languages do exactly this). The bigger problem is that path::iterator::value_type != path::value_type. What I naturally want to do when slicing paths, is to construct path(first, last) where first and last are path::iterators, but I cannot. Currently, at places where this is the intended operation, the documentations says "as if applying operator /= repeatedly". Which brings us to the following point. operator /= is broken --------------------- The path "c:a/b" decomposes, on Windows, to the sequence {"c:", "a", "b"}. But applying operator /= to this sequence repeatedly returns "c:/a/b". Now, according to the documentation, parent_path() of "c:a/b" should return "c:/a". But guess what, the last time I checked, it returned the desired "c:a"! Someone got confused... absolute(p,base) illogical -------------------------- The case when p.has_root_name() && !p.has_root_directory() makes no sense. I remember that it was once left unspecified. Don't know why it got defined now. Explanation: The analogy of what today's implementation does is absolute("England, Baker St. 4", "France, Paris, Passage Landrieu 8") = "England, Paris, Baker St. 4". For those who don't like analogies, absolute("c:a/b") shall return "c:/...current c drive directory.../a/b", not "c:/a/b" as it does today! More generally, when changing a higher-rank element, it makes no sense to leave the lower-rank elements which was specified relative to that higher-rank element! Note that: * For other cases, absolute() does the same thing as abstract path concatenation defined above. * This case is irrelevant for Unix. Thus I propose to scrap absolute() altogether. Bad names --------- * parent_path() does not return the parent directory! It should be named pop_back(). * absolute() isn't needed, system_complete() should be renamed to absolute(). * canonical() -- a better name would be resolve(), cause what it ultimately does is resolving symlinks? Naming it realpath() is also an option. equivalent() could be more useful... ------------------------------------ ...if it exposed an encapsulated system specific unique file "key". Then we could make an unordered_map of files by their unique filesystem keys. directory iterators ------------------- I would expect a directory_iterator to return elements rather than concatenating them with the path passed in the constructor. Similar to what 'ls' does. recursive_directory_iterators, on the other hand, can be modeled after a 'find .' and do what it does today. Think of implementing recursive_directory_iterators with the current public interface of directory_iterator... Path arithmetic and symlinks ============================ Path arithmetic, which includes concatenation, relative, is_absolute, etc., shall work purely syntactically. Generally, operations that work on paths shall either require that the whole path exists (or they are creating it), or ignore the filesystem completely. All or nothing. It is user's responsibility to resolve() the path if he really wants to. Rationale: by the definitions section, paths are abstract entities detached from the filesystem. Asking for relative(x,y) or x/y are legitimate questions on their own, even if the paths do not exist at the specific point in time and space. Even when x and y are relative to a yet-unknown base. Making them depend on the current filesystem state is error prone. Bottom line =========== Boost.Filesystem, as it is today, does not have a clear separation of syntax, semantics and filesystem access. Overall it looks quite messy, and when I tried to use it, it did not solve the real problems I had at hand. RFC and thanks for your attention, Yakov Galka