Re: [boost] [filesystem] proposal: treat reparse files as regular files

29 Jul 2015

      On 29 July 2015 at 10:06, Niall Douglas <s_sourceforge@nedprod.com> wrote:
...
On 28 Jul 2015 at 20:40, Paul Harris wrote:
...
I am _disagree_ with the way dedup'd files are currently treated as a
special file (as if they were a device or a character file or a fifo or a
socket).  device/socket/fifos all need to be read in a special way, but
dedup'd files should be read as if they were a plain file.
I _disagree_ that a dedup file should be treated as if they are a
symlink.
This is because a dedup file does not point to another file (or inode) on
the file system, which is a characteristic of a symlink or a hardlink.
It
is basically just a compressed file.  We don't treat NTFS-compressed
files
differently from regular files, why are we treating dedup'd files
differently?
NTFS compressed files act exactly like normal files. Reparse point
files do not and require significant additional processing to figure
out what kind they are. That's the difference.
You only need to process symlink-reparse-point-files.
Dedup reparse point files can be treated the same as a normal file.
...
From AFIO's perspective, when it does NtQueryDirectoryFile() to fetch
metadata about a file entry, it can zero cost learn if an entry is a
reparse point by examining FileAttributes for the
FILE_ATTRIBUTE_REPARSE_POINT flag. It cannot tell what kind of
reparse point file it is without opening the file and asking.
Windows' CreateFile() API is astonishingly slow. To require calling
that, then an additional NtQueryDirectoryFile() to fetch the
FILE_REPARSE_POINT_INFORMATION metadata and close the handle - which
is the fastest way I know of to fetch the reparse point tag code -
would impose an enormous performance penalty for all file entries
marked with FILE_ATTRIBUTE_REPARSE_POINT.
I have no comment on performance.  I want things to work.
...
I appreciate you're saying the cost is worth it, but we're thinking
all Boost users here, not just the small minority on Windows Server
2012 with dedup turned on.
You don't seem to understand that this affects ANY Windows client that talks
to a Windows 2012 dedup-enabled server.

Which, as of last month, has gone from zero to 5 different companies in
my world.  Seems that all the IT departments are upgrading after the end-of-
financial-year.

So, a Windows 7 user will be accessing dedup files.
...
...
for (directory_iterator ...)
{
   if (is_symlink(fn)) backup_link(fn);
   if (is_regular_file(fn)) backup_contents(fn);
   if (is_directory(fn)) ignore(fn);
   if (is_other(fn)) ignore(fn);
}
Currently, this pseudo code would fail to backup any automatic dedup'd
files (which are basically any file older than 3 days on some of my
sites).
It fails because a dedup'd file is currently an "other".
If you treat a dedup'd file as a symlink, only the "link" will be backed
up.
This link points to a magical place that is impossible to read other than
simply reading "fn".
So how does this simple program backup the dedup'd file contents?
I appreciate the problem with saying something is a symlink, but
trying to retrieve the target of that symlink has to error out
because it's meaningless in the case of a dedup symlink.
Please stop calling it "dedup symlink".  It is _not_ any kind of symlink.
That is the point of misunderstanding, we are not on the same page.
...
What seems to me the best route forward is you do something like
this:
if (is_symlink(fn))
{
  error_code ec;
  auto target=read_symlink(fn, ec);
  if(!ec)
    backup_link(fn);
}
Because is_regular_file() and is_directory() use status(), they
follow any symlink so you can safely fall through to those.
This is unacceptable, because I do not want to follow symlinks.
That was specified in the example.

Lets be more specific about the example directory to backup.

On Monday, it contains:
FILE_A (a plain file)
FILE_B (a symlink to FILE_A)
FILE_C (a plain copy of FILE_A)

Backup should store this:
FILE_A contents.  FILE_B link.   FILE_C contents.

On Tuesday, dedup/archival has run on the server. Directory now contains:
FILE_A (a dedup file)
FILE_B (a symlink to FILE_A)
FILE_C (a dedup file)

Backup SHOULD store this:
FILE_A contents.  FILE_B link.   FILE_C contents.

IF you treat dedup=symlink, then the example will instead store:
FILE_A link.  FILE_B link.   FILE_C link.
(although I have no idea what "FILE_A link" will actually read)

If you follow symlinks, then backup stores the wrong thing:
FILE_A contents.  FILE_B contents (WRONG).   FILE_C contents.

If you treat dedup files as regular files, then backup stores correctly:
FILE_A contents.  FILE_B link.   FILE_C contents.

cheers,
Paul