Monday, July 18, 2016

ReArchitecting the AutoGrader

So on Friday I followed through with my plans to get the rest of the FeatureGrader to expose errors in the students’ code to students, rather than just having it respond with “your code timed out or had an error” and I think I was largely successful.

At least I got those few extra code changes deployed into production and my manual tests through the edX interface showed me that my test examples would display full errors for RSpec failures, migrations failures, and Rails failures. Of course I’m blogging before I’ve reviewed how things faired over the weekend, but it feels like a step in the right direction. Even if the students can’t understand the errors themselves, they can copy and paste the output and perhaps a TA has an increased chance of helping them.

I also wrapped my spike in tests like:

  Scenario: student submits a HW4 with migration_error
    Given I set up a test that requires internet connection
    Given an XQueue that has submission "hw4_migration_error.json" in queue
    And has been setup with the config file "conf.yml"
    Then I should receive a grade of "0" for my assignment
    And results should include 
"SQLException: duplicate column name: director: ALTER TABLE"

to check that the errors would actually be displayed even as we mutate the code. I have a selection of similar scenarios which feel like they are crying out to be DRYed out with a scenario template. Similarly, with these tests in place I wonder if I can’t improve some of the underlying grading code. Maybe we can re-throw these TestFailedError custom errors that look like they might have been intended for communicating submitted code errors back up to the grader. I found myself spending the time I could have been doing further refactoring reaching out to individual students on the forums and in GitHub to add some details about where the grader had been getting stuck for them, and encouraging them to re-submit since the grader had changed and they should now be able to see more details.

I just sneaked a peak at the GitHub comment thread, and while there are some new issues that could distract me from completing this blog, at the very least I can see some students deriving value from the new level of grader feedback. So grader refactoring? I continue to feel negative about that task. The nested sandboxes of the feature grader … the fear is that refactoring could open new cans of worms and it just feels like we miss a huge chance by not having students submit their solutions via pull request.

So how would a PR-based grader work? Well, reflecting on the git-immersion grader that we developed for the AV102 Managing Distributed Teams course, we can have students submit their GitHub usernames and have the grader grab details from GitHub. We can get a list of comments from a PR and so if we had code-climate, CI etc. set up on a repo and had students submit their solutions as pull-requests we could pull in relevant data using a combination of the repo name and their GitHub username.

Making pull-requests would require students to fork rather than clone repos as they were originally encouraged to do. Switching back to that should not be a big deal. I was keen to remove forking since it didn’t really contribute to the experience of the assignment and was just an additional hoop to jump through. However if submission is by PR then we want students to understand forking and pulling; and of course that’s a valuable real world skill.

This means all the solutions to the assignments exist in much larger numbers in GitHub repos, but they exist in a lot already, so not much change there. What we might have though is students submitting assignments through a process that’s worth learning, rather than an idiosyncratic one specific to edX and our custom auto graders.

With a CI system like Travis or Semaphore we can run custom scripts to achieve the necessary mutation grading and so forth; although setting that up might be a little involved. The most critical step however is some mechanism for checking that the students are making git commit step by step. Particularly since the solutions will be available in even greater numbers, what we need to ensure is that students are not just copying a complete solution verbatim and submitting in a single git commit. I am less concerned about the students ability to master an individual problem completely independently, and more concerned being able to follow a git process where they write small pieces of code step by step (googling when they get stuck) and commit each to git.

So for example in the Ruby-Intro assignment I imagine a step that checks that each individual method solution was submitted in a separate commit and that that commit comes from the student in question. Pairing is a concern there, but perhaps we can get the students set up so that the pairing session involves author and committer so that both are credited.

But basically we’d be checking that the first sum(arr) method was written and submitted in one commit, and then that max_2_sum(arr) was solved in a separate commit, and that the student in question was either the committer or the author on the assignment. In addition we would check that the commits were suitably spaced out in time, and of a recent date. The nature of the assignment changes here from being mainly focused on “can you solve this programming problem?”, to “can you solve this code versioning issue?”. And having the entire architecture based around industry standard CI might allow us to reliably change out the problems more frequently; something that feels challenging with the current grader architecture. The current grader architecture is set up to allow the publication of new assignments, but the process of doing so is understood by few. Maybe better documentation is the key there, although I think if there is a set of well tested assignments, then the temptation for many instructors and maintainers is just to use the existing tried and tested problems and focus their attention on other logistical aspects of a course.

Using existing CI systems effectively removes a large portion of the existing grader architecture, i.e. the complex sandboxing and forking of processes. This then removes a critical maintenance burden … which is provided reliably and free by the many available CI services (Travis, Semaphore, CodeShip etc.). Students now start to experience industry standard tools that will help them pass interviews and land jobs. The most serious criticism is the idea is that students won’t be trying to solve the problems themselves, but google any aspect of our assignments and find links like this. The danger of the arms race to keep solutions secret is that we burn all our resources on that, while preventing students from learning by reviewing different solutions to the problem.

I’m sure I’m highly biased but it feels to me that having students submitted a video of themselves pairing on the problem, along with a technical check to ensure they’ve submitted the right sorts of git commits will reap dividends in terms of students learning the process of coding. Ultimately the big win would be checking that the tests were written before the code, which could be checked by asking students to commit the failing tests, and then commit the code that makes them pass. Not ideal practice on the master branch but acceptable for pedagogical purposes perhaps … especially if we are checking for feature branches, and then even that those sets of commits are squashed onto master to ensure it always stays green …


I also reflect that it might be more efficient to be using web hooks on the GitHub Repos in question, rather than repeatedly querying the API (which is rate limited). We’d need our centralised autograder to be storing the data about all the student PRs so that we could ensure that the student’s submission was checked in a timely fashion.

Monday, July 11, 2016

Pairing - it's the Logistics Stupid!

Pair programming is one of the more controversial practices associated with Agile and Extreme programming. It inspires a full spectrum of emotions from love to hate. We’ve taken a new step in the current run of the “Agile Development using Ruby on Rails” MOOC (formerly known as Engineering Software as a Service), in making pair programming a requirement of getting a certificate. There are five programming assignments in the course and we’re giving half of each assignment grade on a pair programming component. Learners submit pair programming videos and are assessed by peer review. Each learner needs to review two other pair programming videos for relevance, pairing role rotation and communication.

The peer review rubric is based on the following four questions:
  1. were you programming on the relevant assignment? (1 point)
  2. did you have a pair partner? (9 points)
  3. did you rotate driver/navigator roles frequently? (approx evenly 4 times an hour) (40 points)
  4. did you communicate effectively? i.e. regular talking, discussing the task at hand (50 points)
The rubric has evolved through several rounds of testing. The points are somewhat arbitrary, and set as they are in order to have them add up to 100 and match the 100 points that are received from the correctness of the separately submitted programming assignment code. They’re not completely arbitrary of course; having a video that shows you and a partner actually working on the correct assignment is an important gateway, and we consider good rotation and communication of higher and similar importance.

There’s a conflict here of course between trying to prevent people gaming the system (submitting irrelevant or borrowed videos) and encouraging effective pair programming behaviour. The peer review process itself is not without controversy. We’ve softened the wording under rotation to indicate that roles do not need to be rotated precisely every 15 minutes, and increased the range of rotation points that can be assigned. The rotation rubric now looks like this:
Did the participants rotate driver/navigator roles frequently? Ideally at least once every 15 minutes, i.e. this means that roughly every 15 minutes the person who is typing stops and allows their partner to type for the next 15 minutes, while switching to an advisory “navigator” role. Check particularly the time-indexes submitted.
Don’t be too strict - the point is that the participants rotate roles approximately four times an hour - so for example a rotation at 13 mins, then after 17 mins then after 16 mins and then 14 mins is fine.
  • No (0 points) There was no driver navigator rotation, only a single person contributing code during the video
  • Not so much (10 points) There was only one driver/navigator rotation for hour of coding
  • A couple of times (20 points) There were a couple of driver/navigator rotations per hour of coding
  • Several times (30 points) There were 3 or even 4 rotations, but they weren’t spaced out very well over an hours coding
  • Yes (40 points) There were at least 4 rotations and all the rotations were roughly evenly spaced throughout the pairing, i.e. at least one every 15 minutes
We introduced more gradations than the original Yes/No in response to feedback to learners. However the other parts of the rubric are still Yes/No. We have a more graduated version of the communication rubric that we haven’t employed yet:
Did the pair continue to communicate effectively through the pairing process? Specifically, did the driver explain what they were doing as, or around, the process of them typing. Did the navigator ask questions or make suggestions or look up relevant documentation to the task at hand?
  • No (0 points) There was no communication or no helpful task focused communication between driver and navigator.
  • Not so much (10 points) There was almost no communication between driver and navigator, but they did communicate a little. Or the majority of communication was not helpful or task-focused.
  • Some (20 points) There were some occasional periods when the pair was communicating effectively, but there were longer periods of total silence and/or most of the communication was unrelated to the task and not helping the pair move forward.
  • A Fair Amount (30 points) There was communication but it was on and off, or communication that was unhelpful, e.g. talking off topic, getting angry etc.
  • Quite a lot (40 points) The communication was pretty good, but there was the occasional period with no communication of any kind, when perhaps it might have been helpful.
  • Yes, lots and very effectively (50 points) There was regular communication between the driver and the navigator. Although there might be occasional silences it is clear from the communication that driver and navigator are focused on solving the same task together, and they are using excellent communication skills to complete that task.
We’re holding off this further extension of the rubric for fears of too much complexity. There are also issues arising from how the edX peer review training component works where the learner has to match the instructed chosen grades on example videos, and so a more complex rubric leads to an even trickier peer review training process.

The edX peer review system is also a little tricky for some learners since progress through the review training component is not obvious. That said there is great support for learners to give feedback on the reviews they receive, and a nice admin interface to allow us to override peer review grading where necessary.  I just which I could hook it all up to a slack bot via APIs ...

The peer review of pair programming videos is an experiment to see if we can effectively and reliably grade pair programming activity. Pair programming has been shown to have great benefits for learners in terms of understanding. The process of explaining technical concepts to others is hugely beneficial in terms of consolidating learning. We’ve encouraged learners in the MOOC to use the AgileVentures remote pairing system to schedule pairing events, and use a shared Gitter chat room for additional coordination.

The most challenging parts of MOOC pair programming appear to be the scheduling and frustrations arising from pair programming logistics. Anecdotally there is a fair amount of time spent in Google hangouts waiting for a pair partner to appear, or conversely one’s pair partner has to leave early. Some people feel nervous about the process of pair programming with a different stranger each week.
Some of this is specific to a MOOC where participation drops precipitously as the weeks go by. For the first assignments it’s easy to find a partner, but if you join the course late, or simply move on to the later assignments where fewer learners are available finding a pair partner in your timezone can be challenging.

Of course your pair partner doesn’t have to be in the same timezone as you. In principle there are learners in other timezones with different schedules that happen to overlap with you. We don’t require learners to pair with others in the MOOC. They can pair with their room-mates, work colleagues, friends, whoever comes to hand. The only requirement is that a video is submitted. AgileVentures events provide a relatively convenient way to generate hangouts on air which make it fairly straightforward to generate a youtube screencast of a pairing session. Even the hangout itself can be used as a local pairing sessions recording mechanism. There are however too many points that one can get stuck at.

Have we found the perfect solution? Certainly not. The difficulty with a more rigid pair scheduling system in a MOOC is not being able to rely on the participation of learners who are only involved in an ad-hoc basis. That said, my experience running a bootcamp is that an imposed pairing schedule is actually preferred by most learners, since it removes the cognitive overhead of negotiating with others about initiating pairing sessions.

Perhaps for the next run of the MOOC, we could have a more structured approach where we get groups of 8 together at specific times, with the assumption that at least 2 of the 8 will show up and will be able to pair … we’ll need to analyse all the pairing data in some detail first …

Friday, July 1, 2016

Delivering Features vs Testing them (and Jasmine Async)

Michael was a little later to our pairing session yesterday and so I spent a little time scratching some itches about the way in which our AgileBot pings the Gitter chat for the “Agile Development using Ruby on Rails” MOOC. I’d previously been frustrated in my attempts to add hyperlinking to the Slack messages since the ‘request’ library seems to escape the Slack hyperlink syntax, and it wasn’t immediately clear how to address that. Any changes to the AgileBot might easily blow up since it’s a legacy app with no tests. Michael and I had been trying to rectify that this week, but we were moving slowly due to lack of familiarity with the CoffeeScript AgileBot is written in; so I took the opportunity of some solo time to have a crack at reformatting the Gitter messages, which looked like this:

existing message format

No one had complained about them explicitly, but my intuition was that the long hangout link looked messy, and might be intimidating. Of course this is dangerous territory for Agile development. A common part of the Agile approach is to ensure that you are responding to customer needs, rather than incorporating necessary features on a whim. However there’s a big difference between building a system for a paying customer compared to edging a volunteer community towards critical mass. The paying customer gives you a concrete person to satisfy. In a community you’ll get suggestions, but they may not be practical and as the saying goes, they might just tell you they want a faster horse.

Anyhow, the AgileVentures system has lots of little rough edges, itches I want to scratch, and scratching them makes me feel good. Remembering that Gitter supports the full markdown syntax (unlike Slack) I thought there might be a quick win and with a live chat I could easily get user feedback like so:

my suggestion in the Gitter chat

which got some immediate feedback


and allowed us to evolve the formatting a little further

evolving the formatting

The change involved a tiny reformatting to a string in the AgileBot, specifically:

send_gitter_message_avoid_repeats room, "[#{req.body.title} with #{}](#{}) is starting NOW!"

I pushed it to the staging server, tested it there, saw that it worked on our agile-bot test channel, and then pushed it live. Within vary short order we had the re-formatted messages coming into the channel:


I hadn’t been able to remove the time element, which is an artefact not of the AgileBot, but of the main site. I got a quick pull request in to fix that, but that’s take a little longer to deploy.

Now maybe changing these Gitter links won’t make much difference to the community in the long run. At the very least they made me feel good and motivated on a Thursday. I hope the users in the chat get a positive feeling and are maybe inspired to get more involved in the community, and my biggest hope is that as we re-run “Agile Development using Ruby on Rails” that completely new users will have slightly less friction as they consider joining a pairing session.

I’m particularly glad that I spent the time solo-ing on the above, because the rest of the afternoon pairing with Michael was somewhat frustrating in that we were repeatedly stuck on getting a basic test of the 3rd party HTTP connection for the AgileBot. We did make progress over the course of the session, but nothing that any end user would see soon. The AgileBot hits 3rd party services Gitter and Slack to get its job done. If we are going to do proper integration tests these services need to be stubbed. We trying the node equivalent of VCR, Sepia by the folks at LinkedIn, which will record and save 3rd party HTTP interactions and allow them to be played back, effectively sandboxing an app. We got sepia working, however in playback mode it highlighted cache misses (unexpected network connections) by creating a file rather than throwing an error that could be caught by JasmineNode.

We set that aside and tried Nock, one of many node world equivalents of WebMock that allows precision stubbing of network connections. Personally I prefer approaches like VCR and Sepia that allow blanket recording of interactions, in contrast to WebMock and Nock which require you to write out individual network stubs by hand. We had nock working, but ran up against node async issues. The JasmineNode tests were not waiting for the HTTP connections to complete, and we were tying ourselves in knots trying to get the JasmineNode async to work in CoffeeScript.

We untied ourselves by dropping back to ground truth, by first getting the example Jasmine Async test working in pure JavaScript:

describe("Asynchronous specs", function() {
  var value;
  beforeEach(function(done) {
    setTimeout(function() {
      value = 0;
    }, 1);

  it("should support async execution of test preparation and expectations", function(done) {

That working we converted to CoffeeScript and confirmed that that worked

describe 'Asynchronous specs', ->
  value = undefined
  beforeEach (done) ->
    setTimeout (->
      value = 0
    ), 1
  it 'should support async execution of test preparation and expectations', (done) ->
    expect(value).toBeGreaterThan 0
then carefully inserted the elements from our AgileBot HTTP testing setup:

nock = require('nock');

slack = nock('')
                .reply(200, {
                  ok: false,
                  error: 'not_authed'

avHangoutsNotifications = require('../scripts/')

describe 'AV Hangout Notifications', ->
  beforeEach ->
    routes_functions = {}
    avHangoutsNotifications({router: { post: (s,f) -> routes_functions[s] = f } })
    @routes_functions = routes_functions

  describe 'hangouts-video-notify', ->
    beforeEach (done) ->
      res = {}
      res.writeHead = -> {}
      res.end = -> {} 
      req = { body: { host_name: 'jon', host_avatar: 'jon.jpg', type: 'Scrum' } } = -> {} 
      setTimeout (->
      ), 3000

    it 'should support async execution of test preparation and expectations', (done) ->

and this would just crash the whole thing. So at least we had identified that the problem was with our code, not how we happened to be implementing Async testing in CoffeeScript in JasmineNode. It was precisely this line @routes_functions['/hubot/hangouts-video-notify'](req,res) that was causing the crash. The one that started the network connection. Having not yet set up for an interactive debugger it was console.log statements all the way down to discover that it was the error reporting roller component under our hood that was actually breaking everything. Clearly that needed to be disabled for testing purposes. That was achieved like so:

rollbar.init(process.env.ROLLBAR_ACCESS_TOKEN, {enabled: false})

and suddenly all the tests went green. It was a frustrating process, but it highlights the problem solving approach of breaking out the different elements of the system you are testing in order to isolate where the problem actually exists. We’ve achieved a much better understanding of the legacy app as a result of all this. I’m just glad for motivations sake that I kicked out a minor improvement direct to the users before we spent the afternoon wrestling with testing frameworks :-)