On 29 July 2015 at 10:06, Niall Douglas
On 28 Jul 2015 at 20:40, Paul Harris wrote:
I am _disagree_ with the way dedup'd files are currently treated as a special file (as if they were a device or a character file or a fifo or a socket). device/socket/fifos all need to be read in a special way, but dedup'd files should be read as if they were a plain file.
I _disagree_ that a dedup file should be treated as if they are a symlink. This is because a dedup file does not point to another file (or inode) on the file system, which is a characteristic of a symlink or a hardlink. It is basically just a compressed file. We don't treat NTFS-compressed files differently from regular files, why are we treating dedup'd files differently?
NTFS compressed files act exactly like normal files. Reparse point files do not and require significant additional processing to figure out what kind they are. That's the difference.
You only need to process symlink-reparse-point-files. Dedup reparse point files can be treated the same as a normal file.
From AFIO's perspective, when it does NtQueryDirectoryFile() to fetch metadata about a file entry, it can zero cost learn if an entry is a reparse point by examining FileAttributes for the FILE_ATTRIBUTE_REPARSE_POINT flag. It cannot tell what kind of reparse point file it is without opening the file and asking.
Windows' CreateFile() API is astonishingly slow. To require calling that, then an additional NtQueryDirectoryFile() to fetch the FILE_REPARSE_POINT_INFORMATION metadata and close the handle - which is the fastest way I know of to fetch the reparse point tag code - would impose an enormous performance penalty for all file entries marked with FILE_ATTRIBUTE_REPARSE_POINT.
I have no comment on performance. I want things to work.
I appreciate you're saying the cost is worth it, but we're thinking all Boost users here, not just the small minority on Windows Server 2012 with dedup turned on.
You don't seem to understand that this affects ANY Windows client that talks to a Windows 2012 dedup-enabled server. Which, as of last month, has gone from zero to 5 different companies in my world. Seems that all the IT departments are upgrading after the end-of- financial-year. So, a Windows 7 user will be accessing dedup files.
for (directory_iterator ...) { if (is_symlink(fn)) backup_link(fn); if (is_regular_file(fn)) backup_contents(fn); if (is_directory(fn)) ignore(fn); if (is_other(fn)) ignore(fn); }
Currently, this pseudo code would fail to backup any automatic dedup'd files (which are basically any file older than 3 days on some of my sites). It fails because a dedup'd file is currently an "other".
If you treat a dedup'd file as a symlink, only the "link" will be backed up. This link points to a magical place that is impossible to read other than simply reading "fn".
So how does this simple program backup the dedup'd file contents?
I appreciate the problem with saying something is a symlink, but trying to retrieve the target of that symlink has to error out because it's meaningless in the case of a dedup symlink.
Please stop calling it "dedup symlink". It is _not_ any kind of symlink. That is the point of misunderstanding, we are not on the same page.
What seems to me the best route forward is you do something like this:
if (is_symlink(fn)) { error_code ec; auto target=read_symlink(fn, ec); if(!ec) backup_link(fn); }
Because is_regular_file() and is_directory() use status(), they follow any symlink so you can safely fall through to those.
This is unacceptable, because I do not want to follow symlinks. That was specified in the example. Lets be more specific about the example directory to backup. On Monday, it contains: FILE_A (a plain file) FILE_B (a symlink to FILE_A) FILE_C (a plain copy of FILE_A) Backup should store this: FILE_A contents. FILE_B link. FILE_C contents. On Tuesday, dedup/archival has run on the server. Directory now contains: FILE_A (a dedup file) FILE_B (a symlink to FILE_A) FILE_C (a dedup file) Backup SHOULD store this: FILE_A contents. FILE_B link. FILE_C contents. IF you treat dedup=symlink, then the example will instead store: FILE_A link. FILE_B link. FILE_C link. (although I have no idea what "FILE_A link" will actually read) If you follow symlinks, then backup stores the wrong thing: FILE_A contents. FILE_B contents (WRONG). FILE_C contents. If you treat dedup files as regular files, then backup stores correctly: FILE_A contents. FILE_B link. FILE_C contents. cheers, Paul