Home » 2019 (Page 3)

Yearly Archives: 2019

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 210 other subscribers
December 2024
S M T W T F S
1234567
891011121314
15161718192021
22232425262728
293031  

Collaboration versus Cheating

Collaboration Versus Cheating: Reducing Code Plagiarism in an Online MS Computer Science Program Tony Mason, Ada Gavrilovska, and David A. Joyner, Proceedings of the 50th ACM Technical Symposium on Computer Science Education, pp. 1004-1010, Minneapolis, MN, February 27 – March 2, 2019.

This is a rather different paper than my usual but I thought I would write about it since I was the one that drove this particular work. I first started as a teaching assistant for the Graduate Introduction to Operating Systems in Georgia Tech’s Online Master of Science in Computer Science program. One of my motivations at the time was to really learn more about how online education could be realized. The program’s goals were to achieve scale (which it has) and to maintain a quality level that was non-inferior to their online MSc program offering. I achieved many of my goals here, as I learned how to improve the pedagogical goals (by improving feedback to students, and helping them understand the topics) of the class while also finding ways to ensure academic honesty.


Figure 1: Detected plagiarism rates over time

I started working with the class in Fall 2015; at that point we had no proactive plagiarism detection in place, though we were collecting similarity data. It wasn’t until 2016 that I started looking at that data and began to realize we had a noticeable plagiarism rate. We tried a number of techniques to try and reduce this. We explained what plagiarism was, we offered “amnesties” if people that cheated would come forward and admit it. Neither worked. Amnesty actually lead to cases where students we did not think had plagiarized were admitting to things we considered to be fair use (e.g., using existing code to set up socket connections).

Figure 1 shows the progress over time, and clearly delineates when our intervention strategies were successful. What it does not capture is the qualitative difference. By Spring 2017, I was observing students wholesale submitting other students’ projects and claiming it as their own! It was that semester when I prioritized “early intervention” in which I quickly identified students suspected of cheating and confronted them. The plagiarism rate for subsequent projects dropped precipitously.

The downside to this approach is that it took a tremendous amount of work to do this, which makes it impractical because it does not scale. Effective is great, but scalability was also a requirement if it was going to be useful. In the summer of 2017 we introduced one new mechanism to the class: we added a mandatory quiz that went over the policy and asked students if they understood the course policies defining collaboration and cheating, as well as understanding why this was important and the potential penalties.

The results surprised me: the plagiarism rate plummeted. I was cautiously optimistic though because the Summer 2017 class was the smallest we’d ever had. Thus, I suspected it might simply be an anomaly. But this encouraged me to begin making the evaluation more rigorous. In the Fall of 2018 I spent time automating the process of doing code comparisons not only with other students in the same class, but also comparing submissions across all prior submissions. My motivation for doing this work was to validate what I thought I was seeing.

Thus I performed retrospective analysis on prior semesters in which the course was offered. I used that data to write a paper for Learning at Scale 2018 but that paper was rejected. I used the feedback from that to revise my draft and submit an update to it to ICER 2018 but that was also rejected. I once again dug deep and revised the paper and submitted it to SIGCSE 2019, where it was accepted (and astoundingly, all the reviews were accepts!) I revised the final paper to address the concerns of some reviewers (the weak accepts) and that is what was included at SIGCSE.

This doesn’t move my research forward per-se, but it helped me understand how incredibly important data visualization is – like Figure 1. I also learned how important it is to construct reproducible data analysis flows. The scripts I wrote helped me build a suite of tools for improving reproducibility. The research I did helped me better understand why the primary automated tool we use works (MOSS) and its limitations. When I was done I realized that I could easily continue exploring this research area, but since it isn’t core to my focus, I’ll leave it to others to push the frontiers.

Still it was an interesting project and helped confirm my interest in doing research.

I have a public repository with further information (beyond what is in my paper) on GitHub. It includes a copy of the slides from my presentation in March at SIGCSE (in which I think I had around 60 people, which was far more than I’d expected given that it was the last half-day of the conference!)



QMDS: A File System Metadata Management Service Supporting a Graph Data Model-based Query Language

QMDS: A File System Metadata Management Service Supporting a Graph Data Model-based Query Language

Sasha Ames, Maya B. Gokhale, and Carlos Maltzahn, International Journal of Parallel, Emergent and Distributed Systems, Volume 28, Number 2, pp. 159-183, 2013.

This paper came to my attention via feedback from an anonymous reviewer, observing that our idea of constructing a graph file system had “already been done”. It never ceases to amaze me that, despite how much time I have spent combing the literature, there seem to be things I miss. In this particular case, I have to agree with the reviewer that the basic idea we proposed really had been done before, though it seems as if the design space has not been exhausted and this paper actually will save me considerable time because up to this point it’s been a challenge to even explain why this kind of file system is useful.

Indeed, my read of this paper really suggests that these authors also struggled with similar objections because they spend considerable time justifying the need for their work: the introduction is fairly long because it explains the underlying problem, and the prior work section also goes to great length to explain why prior work is inadequate to the job.

In this paper, we discuss an exploration of our approach to the problem: the use of a graph data model for representing file system user-de fined metadata and a query language for retrieval. The purpose of this approach is to provide management of user-defined file metadata along with data under a single file system interface, delivering a common service across applications. Applications would be able to offload their metadata management needs to the service, alleviating the need for their own solution. This arrangement would benefit applications by reducing their code complexity, by virtue of not having their own custom metadata management components. A second benefit is improved opportunities for interoperability among separate applications.

The model we have been discussing, and trying to present, is one in which we have a richer model for meta-data to capture not only attributes of files, but also relationships across files. Like these authors we reached the conclusion that a graph is likely a better representational model for data. This encompasses the hierarchical model that is a fundamental part of POSIX, while at the same time providing us with a robust platform on which to build additional functionality. Before I start explaining that, though, I should go through this paper because it has valid results that I can use moving forward.

In Figure 1 from the paper, the authors describe the type of graph they are using: it has vertices (files) and edges (parent/child), with labels on the edges (attributes). I am not convinced this is the right graph model, but I will save that conversation for a future blog post.

In Figure 2, the authors delve into the structure of their graph in greater detail, as they compare their model to that of Resource Description Framework (RDF) triples that are used in several graph processing systems. Here we see a better description of their format (which is actually closer to what I’ll describe for my own work): “Our data model for file system metadata is a directed graph with attributes on nodes and edges, shown in Figure 1. Nodes in the graph can represent files, and this allows the system to manage relationships among files. We call our directed edges links, connecting parent and child nodes.”

They note that applications do not explicitly define schemas (something else I need to discuss in a future blog post) nor does their system require classes be defined. The authors argue this provides greater flexibility and indeed, the fact it does not force an application to be locked into a specific model.

“A heterogeneous approach to managing metadata gives all applications the
same tools to manage relationships.” It seems to me that this is one of the most compelling reasons on why this is a systems problem and not an application problem. If we insist applications implement this, most will not. Those that do will have no mechanism for interaction across applications. If files were truly isolated from one another, that would be fine, but in the real world, files do have relationships with other objects, whether it is other files, or external references (e.g., the “get me the e-mail from whence this file originated” example I described recently.) This helps with motivation, which I mentioned before has been an area of resistance I’ve received as well.

Figure 3 shows how the authors optimized their file system’s meta-data efficiency. This provides some interesting insight into the cases they expect to be common. I found their emphasis on navigation a useful one as well, particularly given the discussion of it recently.

The choice of optimization models certainly seems to be an important one, given that we can’t optimize for everything, and if we optimize for the wrong things we end up with something that looks much like brute-force search, which I can’t imagine is going to perform well.

In Figure 4 the authors turn their attention to their implementation model. They use the FUSE file systems interface to aid their implementation. This is interesting because one of the areas I’ve been exploring (sigh, yet another area to discuss in greater detail) is ways to more easily enhance the FUSE interface to enable exploring enhanced interfaces.

It seems that one downside to this approach is that it focuses on existing mechanisms for finding and accessing files, without providing a useful mechanism for exploiting enhanced search. Admittedly, the extended attribute interface does provide some mechanism for achieving this, but this is a useful paradigm for exploring how such a file system will work with existing applications – certainly an important aspect of constructing any file system that one expects will be useful.

Figure 5 was quite interesting to me, because it addresses one of the concerns I’ve seen in prior work based on relational databases (e.g., they’re often slow). I suspect that these results, with good times within QMDS relative to their evaluation relate to their optimization model for the queries they are executing.

One thing this doesn’t evaluate is how QDMS performs relative to other FUSE based file systems. The queries they do execute on QMDS seem to be targeted search queries and not necessarily well-correlated to actual usage as a file system.

There is no follow-up to this work, unfortunately, which makes it difficult to understand the general usefulness of QMDS. The upside to this is that it leaves considerable room for future work. It does provide a strong case for exploring this approach more thoroughly and I have already suggested that better evaluation seems justified under the circumstances.

I’ll discuss more of these issues as I turn my attention to describing my own work in future posts.

The Ubiquitous Digital File: A Review of File Management Research

The Ubiquitous Digital File: A Review of File Management Research

Jesse David Dinneen and Charles-Antoine Julien, Journal of the Association for Information Science and Technology, April 12, 2019.

I recently stumbled across this recent paper, which I found to be very useful and timely for my current project. As I mentioned in my recent post about Eurosys 2019, I am looking at how we can do a better job of creating associative relationships across our data.

This isn’t a new idea – I described the Memex previously, which posited the idea of an associative data storage model. The current hierarchical model does a poor job of capturing this idea, but observing this is definitely not new, as even a cursory review of the literature points out.

This paper is a survey paper, capturing decades of research in the area of “File Management”. This is reflected in the paper’s exhaustive bibliography, which is roughly 7.5 pages of 32 page paper, or almost 25% of the full paper (32 pages). Since I have spent a considerable amount of time digesting much of the systems focused research as well as some of the Human Computer Interface (HCI) focused research in this area, I found this paper to be particularly insightful, both for categorizing the literature as well as identifying useful research questions, some of which I find particularly interesting.

Frameworks

One of the observations that I found interesting was the authors’ identification that “[t]here do not currently exist any explicit theories about FM [File Management] or theoretical frameworks specifically for understanding it.” As a result, trying to evaluate alternative models or approaches remains particularly challenging. They do draw upon personal information management (PIM) as being valid for consideration and identify three categories to consider: keeping, exploiting, and managing data (or keeping, finding/refinding, and organizing). They do explore various ways of evaluation, but my sense from reading the paper is that the field is complex and not well-understood. This either creates complexity when it comes to evaluation or creates further research opportunities (or likely both!

Systems

Of course, my interest really lies in how this impacts systems. Ultimately, the only way to make effective system level optimizations is to understand the usage patterns of the applications. Some of the cases they observed resonated with me. For example “from a user-remembered event to an email in which it is discussed and then to a document that was attached to the email”. I liked this because I have used the reverse process of following back from a document to the e-mail from which it originated as a good use case for considering the design of a new file system.

They point out that their work is relevant to “computer science” (and particularly the branch with which I work): “… a considerable body of existing literature aims to understand the contents and access patterns of file systems, such as file size distribution, to optimize hardware, firmware, and software. FM studies focusing on real-world file systems that users have interacted with may provide valuable data sets for such design goals, especially given that most of such computer science studies have
examined only files stored on servers and software development
machines.”

Thus two important observations: (1) there is a synergy between file management and storage management that should be realized; and (2) prior work in systems really has focused on specific workloads that are not likely representative of what is useful for file management (and correspondingly, for users of file management).

One observation the users make is surprising to me: “A preference for navigating to files is much more common than a preference for searching , even among users who prefer to search rather than navigate folders when retrieving their emails”. What this suggests to me is that trying to shift people to a search based paradigm may not, in fact, be useful. Thus, it may be more important to consider ways in which information can be presented for navigation in a more flexible way than the current hierarchical model would suggest. The authors do point out that using augmented search mechanisms still likely have a place. Another potential model to consider is to provide mechanisms by which applications can convert navigation into search queries in a more dynamic fashion.

Perhaps something more radical is in order, some sort of automated mechanism for augmenting navigation and management functions: suggesting locations to create new files based upon similarity, for example, or allowing temporal navigation. Some of these are issues that I have been considering and discussing with others, but this paper really emphasizes their importance and I would be remiss to ignore the research literature they have summarized.

This is a text-dense paper, with no figures and only text tables. I’ve now read through it twice and expect I will do so several more times as I try to extract the salient points for my own work, which is what I will start describing in subsequent posts.


Eurosys 2019

I attended Eurosys 2019 last month and in fact just returned from that trip, as I added three weeks of vacation to the end of it, though the first week had me spending most of the time in a hotel room working on a paper for submission.

I attended the doctoral workshop at Eurosys to pitch my idea for an associative file system. I received useful feedback from both the paper I submitted as well as my presentation. My poster for the doctoral workshop attempted to capture a non-textual perspective on the problem and the approach I am seeking to achieve




I did achieve my goal of ensuring there was minimal text in the poster, though I’m not sure I quite hit the right balance with it.

I also presented a second poster based upon a paper we had submitted around the same idea. This was a very different realization of the same basic concepts.

As I promised in my earlier post, I will be discussing my forward moving file systems idea(s) in more detail as I move along through this project.


Reboot time

It’s been almost a year since I posted anything substantive. It is so easy to just focus on other things, which is what I’ve been doing. In the past year I’ve continued to explore some interesting areas related to file systems. For example, for the past year I’ve been looking at persistent memory, which acts somewhat like storage (because it is persistent) and somewhat like DRAM (because it is byte addressed). The findings have been interesting: surprising in some cases, close to predicted in some cases. We’re still doing more work in this area, and I’m hoping to submit two papers later this year.

But the other big project is a file systems project. I’ve decided to use Windows as my platform of choice, both because I’m quite comfortable with Windows and because I think it’s the best choice for this project. Since it has been a goal of mine to do more writing in this area, I thought it would be great to use my blog to capture how I go about writing this file system for Windows with the (probably unrealistic) hope that I can eventually turn it into an online guide. I also plan on using a public repository for my project so that other people can see what I end up doing. I think I’ll turn that into a separate blog post, so I can talk more about that project.

Hopefully, by using this as a mechanism for describing my forward progress I can continue to post new information and content. That’s the theory at least.

If you are interested in persistent memory, two good recent papers are:

Basic Performance Measurements of the Intel Optane DC Persistent Memory Module

Single Machine Graph Analytics on Massive Datasets Using Intel® OptaneTM DC Persistent Memory