Home » Research Ideas

Category Archives: Research Ideas

Of file handles and implicit offsets (Follow-up)

My post about file handles elicited some interesting feedback, so I wanted to capture it because I thought it provided some insight.

Shared libraries were not a standard part of UNIX systems in the 1980s (though they had certainly been described in prior work) and thus one interesting observation here is that putting code in the kernel was a way of minimizing the amplification of common runtime code. The use of shared libraries today and our increased certainty that kernels need to be as small as we can make them certainly would lead to a very different divide today.

Conversations with Malcolm are often insightful, so I wanted to capture it here because it is definitely germane to the area that I’m exploring, particularly as I try to explain some of the underlying rationale for it – a combination of software archaeology and pragmatically looking at how to evolve forward.

I always thought this was just a programming convenience, because it allows a simple program to have “read next” semantics, and half of the core UNIX utilities are stream parsers that want that semantic. While you’re in the area though, I’d ask “what’s the purpose of a current directory?” which is implemented (on Windows) as a process wide value. I’m guessing it also started (on UNIX) as a programming convenience, but being process-wide has meant that it disappeared as a convenience and reemerged as a headache. (DOS arguably had a different history since it was trying to run non-directory aware applications in the presence of directories.)

My response:

If it were just a convenience we could easily bury it in a library. I’m trying to do some hybrid file systems implementation work and it creates complications that seem unnecessary. But who really looks at this old cruft anyway?

Current directory is another good one. And the directory enumeration offset a third.

And his reply:

As you mentioned though, POSIX was trying to codify existing implementations, and presumably at some point the choice between kernel and library could have been made, but once it’s been made it’s hard to change. UNIX libraries always seemed strange to me in that in many cases they’re super-simple syscall wrappers (similar to NTDLL) but in some cases (file pattern matching) all the heavy lifting is there. Remember that in the beginning shared libraries weren’t a thing, so requiring functionality to be in a library meant duplicated code while the kernel provided a natural place to share code/functionality. This probably influenced a lot of choices.

Of file handles and implicit offsets

My current research direction (which is wandering a bit, as is common with research) has forced me to look as some of the vagaries of the POSIX interface.  One of these is this intriguing decision to incorporate a piece of file descriptor specific state for the “file pointer” (note that in Windows there is an exact equivalent in the CurrentByteOffset of the file handle).

One thing to note about POSIX is that it was not designed initially.  Rather, it captured the state of UNIX systems in the 1980s and codify it.  Thus, rather than inventing this behavior, POSIX (or officially, IEEE Std. 1003.1-1988)  codified a uniform interface  acceptable to a variety of parties.  Like any standards document, it is a compromise that attempts to mollify a variety of different players.

Here is a version of the Linux in-kernel file structure (from the main linux repository as of this morning):

struct file {
	union {
		struct llist_node	fu_llist;
		struct rcu_head 	fu_rcuhead;
	} f_u;
	struct path		f_path;
	struct inode		*f_inode;	/* cached value */
	const struct file_operations	*f_op;

	 * Protects f_ep_links, f_flags.
	 * Must not be taken from IRQ context.
	spinlock_t		f_lock;
	enum rw_hint		f_write_hint;
	atomic_long_t		f_count;
	unsigned int 		f_flags;
	fmode_t			f_mode;
	struct mutex		f_pos_lock;
	loff_t			f_pos;
	struct fown_struct	f_owner;
	const struct cred	*f_cred;
	struct file_ra_state	f_ra;

	u64			f_version;
	void			*f_security;
	/* needed for tty driver, and maybe others */
	void			*private_data;

	/* Used by fs/eventpoll.c to link all the hooks to this file */
	struct list_head	f_ep_links;
	struct list_head	f_tfile_llink;
#endif /* #ifdef CONFIG_EPOLL */
	struct address_space	*f_mapping;
	errseq_t		f_wb_err;
} __randomize_layout
  __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */

Note the f_pos field (which I’ve highlighted).  This is the file pointer and it allows things like read and write to work without an explicit offset value.

Here’s the equivalent structure in Windows 10:

typedef struct _FILE_OBJECT {
    CSHORT Type;
    CSHORT Size;
    PDEVICE_OBJECT DeviceObject;
    PVPB Vpb;
    PVOID FsContext;
    PVOID FsContext2;
    PSECTION_OBJECT_POINTERS SectionObjectPointer;
    PVOID PrivateCacheMap;
    NTSTATUS FinalStatus;
    struct _FILE_OBJECT *RelatedFileObject;
    BOOLEAN LockOperation;
    BOOLEAN DeletePending;
    BOOLEAN ReadAccess;
    BOOLEAN WriteAccess;
    BOOLEAN DeleteAccess;
    BOOLEAN SharedRead;
    BOOLEAN SharedWrite;
    BOOLEAN SharedDelete;
    ULONG Flags;
    LARGE_INTEGER CurrentByteOffset;
    __volatile ULONG Waiters;
    __volatile ULONG Busy;
    PVOID LastLock;
    KEVENT Lock;
    KEVENT Event;
    __volatile PIO_COMPLETION_CONTEXT CompletionContext;
    KSPIN_LOCK IrpListLock;
    LIST_ENTRY IrpList;
    __volatile PVOID FileObjectExtension;
typedef struct _FILE_OBJECT *PFILE_OBJECT; 

I highlighted the equivalent field for this structure (from wdm.h in the Windows 10 WDK).  I spent some time looking through the various fields and my observation is that this is the only piece of implicit user-visible shared mutable state.

This actually doesn’t work in multi-threaded environments (very common these days) if threads use the same file descriptor (file handle in Windows) since it doesn’t make any sense to arbitrarily interleave reads.  In those environments, you use a different call – pread for POSIX systems, and in Windows it is explicit parameter in the native system call (NtReadFile where it is an optional parameter).

This led me to ask the question: why is this here?  I haven’t found a definitive source since this predates the original POSIX specification, but my theory is that it is because it is the only way to properly implement sharing of the file descriptor.  When UNIX added the fork call, one of the characteristics of it was “inheritance of file descriptors”.

          The child inherits copies of the parent's set of open file
          descriptors.  Each file descriptor in the child refers to the same
          open file description (see open(2)) as the corresponding file
          descriptor in the parent.  This means that the two file
          descriptors share open file status flags, file offset, and signal-
          driven I/O attributes (see the description of F_SETOWN and
          F_SETSIG in fcntl(2)).

(Source: http://man7.org/linux/man-pages/man2/fork.2.html)

The status flags describe how the file was opened so they aren’t changing (immutable). The addition of F_SETOWN and F_SETSIG
is more recent but it does appear to be explicitly mutable state (it allows programmatic changes).

Fork is not the only way that a file descriptor (or file handle). For example, it can be done
using UNIX domain sockets on UNIX and Linux systems. Windows provides a system call for doing something similar
as well (the documented version is ZwDuplicateObject).

I’ve spent time thinking about this and it seems that the reason to maintain this shared state is to ensure
that two processes sharing the same file descriptor/handle get the same position pointer value. This
then let me to ask why is this useful?

I have been able to construct a single scenario in which this is useful: appending to the end of a shared file.
Interleaving reads doesn’t make much sense. Interleaving writes inside the existing boundaries of a file
makes even less sense to me. I can construct peculiar scenarios in which I can write applications that
explicitly use this feature but they seem artificial.

Writing to a log file at the end seems like it would make sense. But if that’s my goal, it makes more
sense to just use O_APPEND mode:

              The file is opened in append mode.  Before each write(2), the
              file offset is positioned at the end of the file, as if with
              lseek(2).  The modification of the file offset and the write
              operation are performed as a single atomic step.

(Source: http://man7.org/linux/man-pages/man2/open.2.html)

Thus, this makes me wonder: could we just eliminate this piece of shared state?  I have a reason for asking this question though I will save discussing that for another time.

Preserving the correct behavior for most applications will require fixing things up in the library – we could eliminate read as a system call and provide a library implementation that calls pread.

I’m considering doing that and seeing what breaks. it is more difficult to do that in Windows than in Linux, so I’m considering starting there.

For those problems that are nails, I have a sledgehammer

I’ve been involved in a project that requires I learn more about blockchain than I ever really wanted to know.

I had a basic appreciation of it from reading the Nakamoto paper many years ago.  It provides an insightful solution to a challenging problem – the problem of distributed consensus in an untrusted network.  There is an agreed mechanism (the protocol) and an incentive to participate (the fee).

Why is an untrusted network the problem?  Because this came from the world of digital cash, and the concern in that environment was that someone could “double spend” money.  The blockchain model makes that effectively impossible.  It does so in an interesting way – by combining a consensus protocol (getting multiple systems to agree on a given outcome) along with an incentive to behave (participants get paid to behave).  It’s an intriguing way of reaching common consensus.

But it has limitations.  Like any consensus protocol, it takes time to send messages back and forth and reach an agreement.  Further, blockchain requires that nodes attempt to solve a cryptographically hard problem.  The node that is successful gets the reward.  The algorithm is designed so that it takes about 10 minutes for someone to reach a valid solution.  These solutions are easy to verify, so once someone has figured out a solution – which indicates real work – they present their answer and it is echoed throughout the network.   By chaining the answer for the current block into the prior block it essentially creates chain that is not realistically mutable (e.g., it cannot easily change).

There are more technical details (Merkle Trees and cryptographic hashes and so forth) but the final solution is a “public ledger” that shows who owns what in the world.

Here’s the rub: this is interesting technology, but it is not the solution to every problem.  As is so often the case in computing, though, we pick “the shiny new tool” and insist it must be the right tool to chose for our problem at hand.  The challenge then for those of us who have been through this before is to actually analyze a problem and figure out a good solution.  Of course, sometimes we’re called upon to utilize a particular tool, like blockchain, regardless of whether or not it is the right tool for the job.  When that happens there’s an unenviable task of trying to explain why it is not the right solution.  Doing so is tough, as it is often not what a customer wants to hear.

Where does search functionality live?

In mulling over the depths of semantic knowledge and file systems, it occurs to me that one thing which differs between the world of Unix/Linux file systems and Windows file systems is that in Unix/Linux environments, search of a directory’s contents are done in the shell (or application) while in Windows they are a service of the file system.

I admit, when I first started working on Windows file systems, I thought this was an annoying decision, since it involved quite a bit of work inside the file system related to string handling and matching.  Even as I write this, I still think that it is a lot of work that really doesn’t belong in the kernel, but, having said that, this distinction is one reason why a Unix/Linux file systems developer might not think of adding semantic support to a file system as something logical – after all, the purpose of the file system is to manage storage of file systems and associated meta-data, not to find things.  Having experience in the Windows file systems space, I can understand why it might not be a great idea to do this in kernel mode.  After all, C is not a language well-known for its strength and safety in handling strings, and the kernel is not an environment well-known for its tolerance of C runtime error tolerance.

But I digress.  The point is this: when we begin to embed semantic knowledge inside the file system, we exploit a model in which the file system is involved in the search function and this would seem to be anathema to normal file systems behavior.   This is a good challenge: does this need to be done in the file system?  If not, perhaps there is instead an abstraction that the file system itself must be able to provide.

Each time I tackle this problem, my general sense is that the model I want is a case in which each file has a set of attributes.  Ideally, what I want is some way to quickly and efficiently find things based upon those attributes.  After all, how hard could this be?

One benefit to the current search paradigm with which users have been trained is that it does not provide reproducible search results.  Thus, nobody will really be surprised if they repeat a search today and get back different results than they got back yesterday.

Hence, I keep coming back to this paradigm.  It also gives me the sense that there are different characteristics of such a system – there are persistent attributes, like the timestamp, and ephemeral attributes, like semantic tags.

Plenty to think about, but this idea of where to draw the line of search is an important one.  In either case, though, I need to determine efficient ways of rapidly finding files based upon these attributes.

What a long strange trip it’s been!

This is my first post in over a month.  The past five weeks have been eventful – a period upon which I suspect I will look back in the future and remark about how so much change can be packed into such a small amount of time.

My former employer and I parted the ways on November 15.  While not entirely a surprise, the manner in which it was carried out was a surprise.  The week that followed included a trip to the company HQ (in the eastern US), at my own expense, of course, and ultimately finding out that not only had I lost my positions with the company and been locked out of my own offices here but they had also seized control of my shares in the company.

I took stock of my situation and decided that I needed to deal with wrapping things up and planning for the future.

Thus, I briefed my lawyers to the extent possible.  I decided that I would sign up for three classes in the Spring 2017 session for my MSCS program which would put me in a position to graduate at the end of Summer 2017.  This seemed like a terrific idea given that I wouldn’t be working for some period of time!

I also decided that I would push ahead with the PhD applications. On December 4, 2016 I submitted three applications: two to UBC (CS and ECE) and one to Georgia Tech (CS).  I chose ECE at UBC because my professor at Georgia Tech had strongly recommended someone in that program, though I wasn’t convinced they’d be willing to waive the usual thesis requirements (though you’d think that having published two books in the field might count for something.)

The surprise came the next day.  That afternoon I’d received a letter forwarded to me by my legal counsel.  It seemed filled with invective and really had upset me because it painted a picture of me that certainly didn’t square with my vision of myself.  To console myself, I convinced my spouse to meet up for an after-work drink.  As we sat down I looked at my phone and saw an e-mail from a UBC CS professor – in fact, the very person I’d originally identified as being a strong prospective match back in 2013 when I had previously applied.

Imagine my surprise as I read the letter and found he suggested meeting and discussing my “(interesting) application”.  My mood changed – I responded quickly, said I was available and responded.  A few exchanges later, we’d agreed to meet for coffee nearby two days afterwards.  That initial meeting was short (45 minutes) and while it seemed positive, he’d been clear that I’d have to wait until the end of the PhD recruiting process.  He also suggested that I sign up to take the class he is teaching in January, noted there were forms that needed to be completed and left it to me to chase that to ground.  The final suggestion was that I meet with one of his current graduate students.

I went home, found the necesary form, completed it and sent it along to him.  Two hours later I received a signed copy back from him.  He followed up with an e-mail to me and one of his students suggesting we meet soon – like the next day.  So we met the next day, and chatted.  It seemed to go resonably well.  I followed up and the next day (Friday) we exchanged more e-mail and the CS professor suggested another meeting the next Tuesday.

It was at that meeting that it became clear I was “part of the team” (albeit provisionally, for sure).  He told me that he’d hire me as a research assistant until fall 2017 but he was still working out the details with the staff there.  He discussed what they were doing and at the end of the meeting suggested meeting on Friday.  Later, when I looked at the follow-up e-mail invitation I noticed that it was for every Friday, not just the next Friday.

At our first Friday meeting he spent a bit of time going over logistics.  Once again he reiterated that he wouldn’t be able to commit to my acceptance into the program until the “PhD recruiting process is done,” which admittedly was a mixed signal. Today (Friday December 23) was the second standing meeting.  Between the two meetings I spent considerable time running around trying to deal with various details.  Ultimately, nothing was quite resolved because we ran smack into Christmas break, but the ball is certainly rolling.

One challenge was taking his class.  While he signed off on it, and the CS department signed off on it, once it reached the “Enrolment Services” team, they advised CS that I was not eligible to register for the class because I wasn’t an “unclassified student”.  It turns out the deadline for applying for “unregistered student” status  was November 15.  I did find some irony in this, as this was my termination date.

One thing I recall from my own time working at Stanford many years ago is that Universities have rules but they also generally have a mechanism for overriding the rules – it’s mostly a matter of making a persuasive case and convincing someone with authority to grant an exception.  I was successful at doing so.  Further, by the time I received the exception I’d already started the “time consuming” part of the process – namely, getting them original transcripts directly from my undergraduate institution showing that I was granted the degree I claimed.  So, when everyone returns after the first of the year, I should be able to get that situation resolved quickly, register for the class and have student status.

Similarly, they’re also processing the paperwork for my appointment as a research assistant.  Ironic how I’ve come full circle almost 30 years later.  I was also amused to see that my old boss has endowed a chair at UBC.  It’s definitely a small world.

So now I can start focusing on doing research.  In the course of just over a month, I’ve gone from being employed, to terminated, to being a PhD applicant, to getting back into reserarch.  While I’m not accepted into the program yet, my perspective is that it will happen unless I screw up.  Naturally, my goal is to demonstrate my worth.

Exciting times!