Thoughts Before Attemping Hashcode 2016

Together with friends, I participated in the Google hashcode in 2015. It was a lot of fun, and we made it to the final round at the Google Paris offices, finishing 49th of 65 teams (qualification scores were better 20th of 230 teams). To get an idea of what it was like, you can read write-ups from a fellow competitor, Antoine Amarilli, for the 2014 final round, the 2015 qualification round and the 2015 final round. Preparing this year's hashcode, I want to revisit some thoughts about the competition and some tips for mastering it.

The tasks

Based on the previous editions, I believe this year's hashcode will be again about solving an optimization problem. Last year's problems were:

  • facade painting (training)
  • data center server placement (qualification)
  • pizza slicing (warm-up final round)
  • Google Loon trajectory optimization (final round)

This year's training problem is a variation on the facade painting problem.

What all these problems have in common is that they are optimization problems:

  • facade painting: compute minimum set of painting instructions to recover the required picture
  • data center: assign servers to rows and then groups to servers so as to maximize the remaing power of the smallest group in the case of an outage
  • pizza slicing: make the biggest number of slices possible given a set of constraints
  • Loon: find the best trajectories to maximize the coverage in time of the balloons

Another observation is that the problems are non-trivial in every case: optimization of an object that is not easily manipulated (painting instructions, rows of servers, pizza slices, ballons). This leads me to what I believe is a key point in this competition: modeling.

Modeling problems

All of these problems are described in text form. To the human brain, concepts are easily picked up from their description and manipulated. The important point is how these abstract concepts are translated when putting the code to practice. My previous participations have led me to believe that the challenge is often not one of performance and score, but rather one of translating an idea of an approximate solution to code in a finite amount of time.

Thus my current vision is: code clarity is better than code performance, even in this sort of contest. It is better to have a slow solution that produces a result rather than a fast solution that doesn't produce a result due to various reasons (in my experience because the code is too complex to finish it).

Norvig's approach to modeling (points)

The ideal approach I advoce is essentially what Peter Norvig does when he codes. Peter Norvig has a very clean approach to doing things. Usually, he first defines the objects he will deal with. Good examples of this can be found in the Convex Hull notebook, where he defines the following:

  • Point: We'll define a class such that Point(3, 4) is a point where p.x is 3 and p.y is 4.
  • Set of Points: We'll use a Python set: {Point(0,0), Point(3,4), ...}
  • Polygon: We'll represent a polygon as an ordered list of vertex points.

As noted by Peter, several implementations are possible for 2d points. He uses several, depending on the problem.

Using namedtuples in Python:

In [1]:
import collections

Point = collections.namedtuple('Point', 'x, y')

This easily allows us to do the following:

In [3]:
p = Point(3, 4)
p
Out[3]:
Point(x=3, y=4)
In [4]:
p.x
Out[4]:
3

However, other implementations exist. In the travelling salesman problem, Norvig uses complex numbers, which have the advantage of allowing easy (euclidian) distance computations:

In [6]:
Point = complex

def X(point): 
    "The x coordinate of a point."
    return point.real

def Y(point): 
    "The y coordinate of a point."
    return point.imag

def distance(A, B): 
    "The distance between two points."
    return abs(A - B)
In [7]:
A = Point(3, 4)
B = Point(3, 5)
distance(A, B)
Out[7]:
1.0

Finally, in Gesture Typing, Norvig uses an even more clever variant:

In [8]:
class Point(complex):
    "A point in the (x, y) plane."
    def __repr__(self): return 'Point({}, {})'.format(self.x, self.y)
    x = property(lambda p: p.real)
    y = property(lambda p: p.imag)
    
def distance(A, B):
    "The distance between two points."
    return abs(A - B)
In [9]:
A = Point(0, 4)
A
Out[9]:
Point(0.0, 4.0)
In [10]:
A = Point(2, 1)
B = Point(5, 3)
distance(A, B)
Out[10]:
3.605551275463989

Norvig's approach to modeling algorithms

Another great aspect of the way Peter Norvig codes in his notebooks is the way he writes algorithms. In the travelling salesman, he conceives the algorithm with objects that don't exist yet:

In [11]:
def alltours_tsp(cities):
    "Generate all possible tours of the cities and choose the shortest tour."
    return shortest_tour(alltours(cities))

def shortest_tour(tours): 
    "Choose the tour with the minimum tour length."
    return min(tours, key=tour_length)

# TO DO: Data types: cities, tours, Functions: alltours, tour_length

The TO DO allows him to explicitly document the symbols he has used but not yet defined. The interesting approach here is that this is exactly what humans do: we talk about concepts that exist in our heads, even though they don't have a concrete meaning. I would say this is a sort of lazy evaluation.

Another look at Norvig's way at algorithms is found in the convex hulls notebook:

In [12]:
def convex_hull(points):
    "Find the convex hull of a set of points."
    if len(points) <= 3:
        return points
    # Find the two half-hulls and append them, but don't repeat first and last points
    upper = half_hull(sorted(points))
    lower = half_hull(reversed(sorted(points)))
    return upper + lower[1:-1]

def half_hull(sorted_points):
    "Return the half-hull from following points in sorted order."
    # Add each point C in order; remove previous point B if A->B-C is not a left turn.
    hull = []
    for C in sorted_points:
        # if A->B->C is not a left turn ...
        while len(hull) >= 2 and turn(hull[-2], hull[-1], C) != 'left':
            hull.pop() # ... then remove B from hull.
        hull.append(C)
    return hull

In the main algorithm, he uses the half_hull function before defining it. This is a great way of making problems more tractable.

But, what's the point of this approach?

The point of this approach is: by describing things in a mostly normal language, the concrete problems become tractable. In particular, I would say that:

Good data type design + high level algorithm description = understandable and writable code

For me, a good example of this is found in the gesture typing notebook. The keyboard has been defined as a dictionary, thus allowing direct access to letter coordinates, so we can simply loop over the letters of a given word to compute the total swiped distance (because we used complex numbers to represent 2d coordinates) over that word. This gives the beautiful following code:

def path_length(word, kbd=qwerty):
    "The total path length for a word on this keyboard: the sum of the segment lengths."
    return sum(distance(kbd[word[i]], kbd[word[i+1]])
               for i in range(len(word)-1))

Debugging your code and writing files to disk

Two other recurrent needs in the hashcode competitions are debugging and writing files to disk. Of these two, I have to say that debugging is the biggest shortcoming in my current toolbox. I mostly code in the Jupyter notebook and debugging is doable, but not easy with it.

Two ways of debugging are:

  • calling the magic debug fuction %debug when an exception fails launches a post-mortem inside the debugger
  • calling %debug do(args) allows one to step inside the call on that line

Writing files to disk is the standard interface for submitting files to the hashcode judge server. This is easy enough in Python, for example using the string format instruction:

In [14]:
"arg1: {}, arg2: {}".format(1, 2)
Out[14]:
'arg1: 1, arg2: 2'

Visualization

A recurring need in hashcode problems is visualization. Basic visualization includes visualizing the input dataset, visualizing algorithm outputs. In particular, reimplementing a visual representation of the text output of the algorihtm usually saves time and effort because you don't have to wait for the Judge system to grade your submission. I believe this is in fact essential to success. It's the eyes you have in real life, but here it applies to an algorithm.

Checklist

I'd like to conclude this post by providing a little checklist of things that need to be done in a typical hashcode competition:

  • define data types to be used
  • define functions to be used
  • write algorithm to apply in as plain as possible English
  • write an input file reader and visualize the input result
  • write an output file reader and visualize the output result
  • write a debugging visualization of your algorithm

Comments