On 3/14/19 2:29 PM, Florian Lindner via Boost wrote:
Am 14.03.19 um 10:11 schrieb Andrey Semashev via Boost:
I haven't had experience with Lustre, but I'm guessing it may be related. Did you try calling fsync between close and rename?
No, I was assuming that close() does this. I have modified the code to
{ namespace fs = boost::filesystem; auto path = getFilename(); auto tmp = fs::path(path + "~"); fs::create_directories(tmp.parent_path()); boost::iostreams::streamboost::iostreams::file_descriptor_sink ofs(tmp); ofs << info; ::fdatasync(ofs->handle()); ofs.close(); fs::rename(tmp, path); }
Reproducing the bug is hard, as so far, it only has appeared on really huge runs with more than 4000 processors.
close doesn't guarantee that written data or metadata has reached the media. IOW, other processes may not observe the file creation immediately after close. fdatasync only guarantees that for data but not metadata. fsync guarantees that for both, which is why I explicitly mentioned it and not fdatasync. For distributed filesystems, "media" typically means something else than the physical storage on the nodes. Exactly what it means depends on the filesystem. Normally, one would expect that OS (and filesystem driver in the OS, in particular) would guarantee that file creation would be visible at least to the same process (thread) that created the file, even if that operation did not reach the media. I allow that Lustre doesn't maintain this guarantee, and if so, I would think this is a filesystem problem, not that of user's application or Boost.Filesystem. This may be a design choice (which would be wrong, IMHO) or even a configurable option with some tradeoff, not necessarilly a programming bug.