Thursday, March 1, 2012

Book Review - Programming Collective Intelligence by Toby Segaran

Book cover photo Disclaimer: I received a free (electronic) copy of this ebook (Programming Collective Intelligence by Toby Segaran) from O'Reilly as part of the O'Reilly Blogger Review Program, which also requires me to write a review about it. That aside, I would have purchased this book this year anyway, and would have reviewed it on this blog too.

About me and why I read this book


I've been programming professionally for ~7.5 years, mainly business applications and reporting, so I already have quite some love for data. While I haven't used math much in my day jobs, I liked (and was good at) it in high school, including taking extra classes - so I have learned basic statistics. Refreshing and advancing my data analytics skills is one of my goals this year, and reading this book was part of the plan.


About the book



The book introduces lots of algorithms that can be used to gain new insight into any kind of data one might come across. The explanations are broken up into digestible chunks, and are supported by great visualizations. While understanding of the previous chunks is required for the later ones, this allowed me to read through most of the book on the train to and from work.


Each of the algorithms is illustrated with real world application examples, and examples where applying them doesn't make sense are brought too. The exercises at the end of the chapters are applied and not purely theoretical - and coming up with exercises from the domain I work with every day was pretty easy! The book is really inspiring, which is great for an introductory book!


In addition to the well written, gradual introduction, the book has a concise algorithm reference at the end, so when one needs a quick refresher, there is no need to wade through the lengthy tutorials.


While the prose and the logic of the explanations are great, I have found the code samples hard to follow: really short, cryptic variable names; leaky abstractions; inconsistent coding style just to name a few. Some code samples are actually incorrect implementations of the given algorithm and there are antipatterns like string sql concatenation in the code without a warning comment to the reader to remind them it's a bad practice.


Nonetheless, it is great to have actual code to play with, just the initial reading and reviewing of it requires some extra effort.

The book claims that you don't need previous Python knowledge to understand the code samples, which I can't confirm (I use Python at my day job), but I wouldn't be surprised if not knowing Python could make understanding the code even more difficult (I've actually learned a few new language features from the samples!). Also, the Python language has come a long way since 2.4, which is the version used in the book - and that old version makes the code feel dated.

The book was written in 2007, but is not dated. First, the foundations of any topic tend to be timeless, and the most recent algorithm the book describes was published in 1990. The Table of Contents is comparable to more recently written ones (though I haven't read other introductory books yet).

In summary: I would recommend it as a great introductory book!

Friday, February 17, 2012

Inversion of Control for Continuous Integration

Problem Description



Our build structure is pretty stable, but the exact content of the steps varies as we discover more smoke tests that we'd like to add to, or when we rearrange the location of these checks.



The CI servers I've used made this a rather cumbersome process:





  • First, I have to leave my development environment to go to the build servers configuration of choice - most of the time it is a web interface, and for some it is a config file


  • I have to point and click, and if it's a shell script, I have to make my modifications without syntax highlighting (for the config files usually take the shell command to execute as a string, so no syntax highlighting)


  • If it's a web interface, I have (or had) no versioning/backup/diff support for my changes (config files are better in this aspect).


  • If it's a config file, then I need to get it to the build server (we version control our config files), so that's at least one more command


  • I need to save my changes, and run a whole build to see whether my changes worked, which is a rather costly thing.


  • Most places have only one build server, so when I'm changing the step, I either edit the real job (bad idea) or make a copy of it, edit it, and then integrate it back to the real job. Of course, integrating back means: copy and paste.


  • If the build failed, I need to go back to the point and click and no syntax highlighting step to fix the failures


  • Last, but not least, with web interfaces, concurrent modifications of a build step lead to nasty surprises!






Normal development workflow





  • I have an idea what I want to do


  • I write the tests and code to make it happen


  • I run the relevant tests and repeat until it's working


  • I check for source control updates


  • I run the pre-commit test suite (for dvcs people: pre-push)


  • Once all tests pass I commit, and move on to the next problem




Quite a contrast, isn't it? And even the concurrent editing problem is solved!




Quick'n'Dirty Inversion of Control for builds



Disclaimer: the solution described below is a really basic, low tech, proof of concept implementation.



Since most build servers at the end of the day





  • invoke a shell command


  • and interpret exit codes, stdout, stderr, and/or log files




we defined the basic steps (update from version control, initialize database, run tests, run checks, generate documentation, notify) using the standard build server configuration, but the non-built in steps (all, except the version control update and the notification) are defined to invoke a shell script that resides in the project's own repository (e.g.: under bin/ci/oncommit/runchecks.sh). These shell scripts' results can be interpreted by the standard ways CI servers are familiar with - exceptions and stack traces, (unit)test output, and exit codes.




Benefits





  • adding an extra smoke test doesn't require me to break my flow, and I can more easily test my changes locally and integrating it back into the main build means just committing it to the repository, and the next build will already pick this up


  • I can run the same checks locally if I would like to


  • if I were to support a bigger team/organization with their builds, this would make it rather easy to maintain a standard build across teams, yet allow each of them to customize their builds as they see it fit


  • if I were to evaluate a new build server product, I could easily and automatically see how it would work under production load, just by:


    • creating a single parameterized build (checkout directory, source code repository)


    • defining the schedule for each build I have


    • and then replaying the past few weeks/months load - OK, I still would need to write the script that would queue the builds for the replay, but it still is more effective than to run the product only with a small pilot group and then see it crash under production load









Shortcomings, Possible Improvements



As said, the above is a basic implementation, but has served a successful proof of concept for us. However, our builds are simple:





  • no dependencies between the build steps, it is simply chronological


  • no inter-project dependencies, such as component build hierarchy (if the server component is built successfully, rerun the UI component's integration tests in case the server's API changed, etc.)


  • the tests are executed in a single thread and process, on a single machine - no parallelization or sharding




All of the above shortcomings could be addressed by writing a build server specific interpreter that would read our declarative build config file (map steps to scripts, define step/build dependencies/workflows), and would redefine the build's definition on the server. By creating a standard build definition format, we could just as easily move our builds between different servers as we can currently do with blogs - pity Google is not a player in the CI space, so the Data Liberation Front cannot help :).




Questions



Does this idea make sense for you? Does such a solution already exist? Or are the required building blocks available? Let me know in the comments!

Friday, February 3, 2012

There Is More To Clean Code Than Clean Code

A post written by Uncle Bob in January (I'm behind my reading list) offended me. I absolutely agree with Uncle Bob's analysis regarding the code itself, and I also prefer the refactored version, but I have a problem with insulting the programmer(s) reluctant to appreciate the change.


We write code in programming languages, and there are different levels of proficiency in a language.


As I'm currently learning a new spoken language, I'm painfully aware of this - initially I probably sounded like a caveman. The first impression you get about me is totally different depending on the language I speak - but I am the same person!


The learning curve of a language is not smooth - the steepness between consecutive levels of proficiency is different. Going from not speaking any German to speaking A1 (tourist) level was easy, getting the basic grammar required for the low intermediate (B1) level wasn't too bad, but to get my German to the level where my English is will take more effort than the sum of all my previous investments1.


Since it is my third foreign language I'm learning, I have no difficulty accepting that the level I think I speak is higher than the level I actually speak. Because of that, whenever someone rephrases my sentences in proper German 2, I start from the assumption that likely their version is better, even if I don't understand first why - and I take the effort to understand their reasoning 3. I do that despite that I was of course convinced that when I spoke, I expressed my thoughts in the best possible way.


However, I don't have much at stake - no ego to hurt, no reputation to loose, and the roles are clear: I'm the beginner, and the people around me are the more experienced ones. In a software team, the roles might not be so clear - I had told colleagues almost twice my age how they should write code after only a few weeks of working there. Bad idea. Since then, I have learned not to start improving the coding style of a team by rewriting the code they have written, and showing off how much better my version is. Rather, I wait until a situation arises when they don't mind having me demonstrate some code improvements. I demo it, and explain why I do it that way. In my experience, the second approach is more effective, though it doesn't have that instant satisfaction and relief the first provides.


As the joke goes, you need only one psychologist to change a light bulb, but the light bulb has to want the change real bad.


Driving Technical Change is hard, because it requires a mental/cultural change, and that change has to come from the inside - but can be catalyzed from the outside of course4. But just forcing practices or ways of working on unwilling recipients generates resistance (e.g.: the story of the EU technocrat appointed to recalculate the 2009 Greek budget).


I would like to see more public code reviews and public refactorings (e.g.: Andrew Parker, GeePawHill), but I would like to see less public judgement passing on people at the lower proficiency levels of programming.




1 there is a great Hanselminutes episode on learning a foreign language if interested. Beware, it may contain programming!


2 German readers might disagree, since most Germans I meet speak Frankish :)


3 Which of course, is sometimes harder for natives to properly explain than for novices to ask questions pointing out the seeming irregularities of the grammar


4 And we won't always be able to foster change in all environments (note: this does not mean the others are at fault for not changing!). The same programmer can be highly productive in one team, and be the one slowing down another team. There is nothing wrong with changing jobs after realizing we are a net loss to a given team.

Thursday, January 12, 2012

Find The Test Structure That Fits Your Team

A number of recent posts by Phil Haack, Ayende Rahien, and Gil Zilberfeld dealt with the topic of test organization. Each approach has its pros and cons, but neither is a silver bullet. Your (and your team's, project's) context determines which approach is right.


Without aiming to provide an exhaustive list, below are some questions that have influence on test organization:



  • Is the team in a consulting project where test documentation is required as part of delivery?

  • Is it a product team? Is the firm in its early stage or is it mature like Oracle with mature products?

  • What is the turnover rate of the team? What are the plans for its growth? The team might have all the knowledge in their head, but if it'll double in size in a year, then the communication value of tests could increase.

  • What is the maturity level of the team? How long have they been working together?

  • How closely and often do team members collaborate?

  • Is there collective code ownership?

  • How does the team and its customers communicate? Some customers can - and willing to - read code, some need English (Turkish/Hungarian/German/etc.). Some teams have a level of (grown and deserved) trust that just saying the software works is accepted, some need a more formal acceptance and regression process.

  • Is there proper IDE support for discoverability? Do all people reporting bugs (as tests) have access to that IDE? If not so, how do they find examples of how to write the bug-report test?


Feel free to add more questions in the comments!

Thursday, December 1, 2011

That Is Not Your Decision to Make

A recent Weinberg blog (there is a part 2 too) reminded me that I meant to write about the antipattern of programmers making business decisions for some time - so here it goes!


The linked example of a programmer purposefully implementing something other than what (s)he was asked to do - I hope - is not common, but there are other, more subtle situations where we make the same mistake of making decisions instead of our clients (which can be a customer, business analyst, product owner, etc.).


Not Specific Specifications


Project failure (or rework at best) happens when we don't deliver what we the customer wanted. One reason for that can be traced back to misunderstandings of requirements. I often say that our job as developers is to put on our detective hats and help our customers discover & formalize their processes - which they know, do every day, but usually never had to articulate it precisely. Despite the best efforts, specifications/acceptance tests leave space for interpretation, which we, as programmers with a special attention to detail, will discover.


And here comes the problem - another trait of us programmers is that we love solving difficult problems/puzzles. Thus we will first think of a solution that we would like to implement and we often stick with the first idea that makes sense to us as the logical one. Now we are doing exactly what the programmer in the linked article did. There is a reason we are developers and our customers are lawyers/traders/salesmen/etc. - they know their problem domain better. And their perspective (ROI, time to market, cost vs. benefit, etc.) likely differs from ours (such a great puzzle to solve!).
While it's hard to accept it as a programmer, it is a valid business decision to say we don't care about that or we are willing to take the risk - which to us would look bad, for being an imperfect system, an unhandled scenario. However, we might not actually be able to show an occurrence of the problem in the past 5 years' data... And it probably saved a week's worth of programmer work that was spent on some other feature important for the business.


The Road to Hell is Paved with Good Intentions


Has it ever occurred to you how the usability of the site could be improved, or how much better would be if there was a shortcut, or... It certainly has happened to me (and I hope will happen in the future). But again, even if we are right, and it would be a great improvement, that might not be needed, or not now. For instance, if you have a wizard form, where the first step is so beautiful it should be shown in the Louvre, and is super easy to use - almost reads the user's mind! - ... but it crashes as soon as you press the next button, have we really made a good decision? As the saying goes: shipping (functionality) is a feature too.


I have yet to meet a client who didn't appreciate programmers taking initiatives and making suggestions for improvement in the application, but I have met ones that were furious when they learned that only 10% has been delivered due to having too many bells and whistles (though of course, there are counterexamples to everything - but in my experience they are the exception, not the rule).


Let them decide!


If you find yourself in any of the cases above, I suggest you stop, and consult with the person who should make that decision instead of coding away one that makes sense to you - even if eventually it turns out that the customer's decision is the same as the one you had come up with.


Caveat: I am not saying you should stall work, waiting on a decision from customers. You might not have instant access to your customer, they may be on holiday, etc. If so, talk with the most senior person you have available (tech lead, team lead, etc.) and have her/him make a decision. If you are the most senior person, make a decision, but keep it minimal (even if you see myriad of other issues this might lead to), and make it easily undoable if needed.


To give a concrete example: recently while implementing a certain form, and the related business logic behind it, I realized that this action has different implications depending on the type of user that requests it - but it was not specified yet. The combinations I came up with were pretty big. But for my given feature, I just needed one of those combinations. I ended up using a single request handler class, but implemented the logic for each role as a separate mixin class, so when it comes to dealing with this issue, we don't have to spend the time excavating the code relating to each role from one class. Based on my prior experiences, I wouldn't be surprised if our customer will decide to simplify this feature, and say we only handle the two most common combinations, and don't make the feature available for the other users. If it'll turn out that is the case, it's rather a good thing that I haven't forged ahead and spent days writing a complex, composed object hierarchy to handle all possibilities.

Thursday, November 17, 2011

CITCON London 2011

Given what a great conference CITCON 2010 was, when registration opened for CITCON 2011, I didn't hesitate - which turned out to be a good thing, since spots had filled up rather quick. So next year, watch the mailing list, and rush to the registration form!


Friday Evening - Open Spaces Introduction/Topic Proposals


It was held at Skillsmatter, so I assumed there will be no surprises for me there (great infrastructure), but since I didn't have to take the tube this time (picked my accommodation to be walking distance from Glasswell Road on purpose), I've quickly learned that Bing Maps has little to no knowledge of Glasswell Road and Skillsmatter. So the first day I took a nice, leisurely, albeit somewhat longer road towards the location.


The registration went flawlessly, ran by PJ's mom. Great job!


PJ and Jeffery made the introductions in their usual, entertaining manner. While it wasn't new for me, I was glad to see this time they emphasized that the schedule can change throughout the day - apparently it wasn't only me who had been surprised by it last year. And of course the schedule did change.


Because Julian Simpson couldn't make it, I introduced his topic (Do you use your tests in prod?), and then two of my own - Continuous Release and Delivery in Downloadable Product Companies and Why Most People Don't have a Rollback Plan for Releases and instead "plan" to Hack Forward.


The great thing about proposing topics is that even when it doesn't meet the threshold to have a whole session dedicated to it, a lot of people now know that you are interested in this topic, and find you to share their experiences about it during the breaks.


To my surprise, we actually followed Jeffery's recommendation and we didn't run side conversations during the proposals, so the process was smooth and efficient, and there was more than enough time to chat later - especially that I didn't worry much about the agenda, since proposing two topics made my priorities to attend those, and I knew that whatever agenda we'll have at Friday closing time is not final.


As we learned the next day, when you propose a topic, you should be careful which words to use. In my case, rollback was a terrible word, since many interpreted as going back in time, undoing, while what I meant was more like backout (planned and disciplined retreat).


Sessions I attended


I've added my notes on the conference wiki:



Sessions I wish I attended



  • BitbeamBot: The Angry Birds Playing, Mobile Testing Robot by Jason Huggins ran a session about automated UI testing of touch screens

  • there was a TDD session - given that often I need (want) to explain TDD to people new to the concept, it would have been a great learning opportunity seeing someone (was it run by Steve Freeman?) else's approach to introducing it. Plus likely I would have gained a new understanding of TDD...

  • Backyard beekeeping - I ended up in a random chat, but I was would have been curious to see these non-technical, non-work related sessions

  • I couldn't stay long enough in the Slaughtered Lamb


The hallway chats


While open spaces are already like an unconference, nonetheless a lot of great conversations took place during the breaks (didn't get to use the law of two feet this time either), and I've met a lot of great people. I just wish I had more time to talk more with each of them. Guess I queued these conversations for future processing on twitter or via email.


In contrast to last year's conference, I spent almost the whole time offline, in the analog world, and it didn't feel the least bit wrong. I've pretty much only used twitter when I followed someone on twitter - instead of exchanging emails or business card. Though I have to admit, business cards are great for jotting down small reminder notes on them.


Some travel lessons I've learned


While this is tangential to CITCON, I have learnt a few lessons on this trip. I'll list them here, hoping it's beneficial for others.



  • have your travel plans printed out (a'la TripIt. It made the check-in process much smoother, and going from the airport to my hotel was perfectly relaxed, knowing exactly which tube to take. The only part I forgot to plan was from the Nueremberg Airport to the office (only remembered that would be useful when I landed). Thanks to Career Tools for the idea (and they great advice for attending conferences, approaching your boss about sending you to a conference, and more)!

  • Phones, data plans, roaming. While this isn't the reason that I always buy my phones from the stores and not from the carriers, being able to just buy a prepaid SIM in London made a huge difference. It's not necessarily the money, though I think I've saved there (for the four days I stayed, just the data plan would have cost me 14 €, and there are the calls I made - contrast that with the £15), but rather that making a call or using mobile net wasn't something that I had to consciously choose (is it worth the extra roaming charge?).

  • have enough slack in your trip. I arrived Friday afternoon, and returned home on Monday morning. This allowed me to dedicate Friday and Saturday to the conference (and the Slaughtered Lamb afterwards), be able to meet up with friends living in London on Sunday, and the fact that some of my friends were an hour late just simply didn't cause a problem. And financially it doesn't have to be more expensive - I believe I spent about the same amount on the three nights' stay as I did last year for two nights. Advance planning, more research for accommodation helps you with that.

  • If you need travel accessories (e.g.: AC adapter for the UK), try them before leaving home - I managed to buy adapters incompatible with my laptop's charger. Luckily the reception at my hotel managed to find one of theirs that was compatible, but it took a long five minutes for them to find one, so I wouldn't rely on this

  • In addition to that, I forgot my phone charger at home, and while Bing Maps isn't perfect, it's much better than walking in London without a GPS. Lesson learned: it's handy when you have a USB cable with you that you can use to charge from somebody's laptop - the 10 cm long USB cable fits in any pocket.

Friday, November 4, 2011

Data Migrations As Acceptance Tests

While I have previously said that on migration projects both verification and regression tests are important, does it mean that the two should be separate? Like first, let's migrate the data, and then we'll rewrite the functionality, right? Or let's do it the other way around - we'll talk with the customer, incrementally figure out their requirements, deliver the software (with a proper regression test suite) that satisfies them, and then we migrate. Both approaches have problems:



  • customers want to use the software with their current, real data - having only the data and no application to use it with is no value to them. Neither is having only an application with no data in it

  • real data has lots of surprising scenarios that the domain expert might have forgotten to specify (see caveats though)

  • requirements are not static, and new ones will be introduced during the development process, that inevitably will cause the new application's models to change, which means that the migration has a moving target it needs to migrate to.


Doing them in parallel


If the data source is organized chronologically (order by date in the migration script), and organized in a format that resembles what the system's end users will enter into the system, then we can use the new application's outmost automatable entry point (Selenium, HTTP POST, a module's public API) to enter this data during the migration from the old system to the new.


Why


While a clear disadvantage of this approach is speed of the migration - it will be likely slower than an INSERT INTO new_db.new_table select .... FROM old_db.old_table join ... where .... statement, but in the case of non-trivial migrations it will likely compensate for the slowness, because:



  • changes to the new system's code/data structure become a problem localized to the new application code - no headaches to update the migration scripts in addition to the code

  • when the client requests the demo deployment to be in sync with the old system, the code is ready (spare for the part to figure out which records have changed)

  • the legacy data edge cases provides focus - no need to invent corner cases, for there will be enough in the data

  • likely there will be releasable functionality sooner than with either of the above approaches


How


First, create the acceptance tests for the migration:



  • Pick the data to be migrated

  • find the view in the original system that displays this data to the users and find a way to extract the data from there

  • define the equivalent view in the new system (it's about end to end, user visible features!)

  • write the verification script that compares the above two (be sure to list the offending records in the failure message!)

  • define the data entry point in the new system

  • write the migration script - extract from the old system, transform if needed (only to match the entry points expectations of the format - no quality verification as in classic ETL!), then send it into the new system (using the above defined entry point)


At this point both the new view, and the data entry points are empty. From here on, the TDD cycle becomes a nested loop



  • run the migration script. See which record it failed for

  • analyze the failing acceptance test, and find the missing features for it

  • (A)TDD the missing features

  • run the migration script to restart the cycle


Caveats


While the existing data makes one focus on the real edge cases instead of the imagined one, beware - not everything has to (or can be) migrated - for instance, in a payment system, the system used to accept many currencies in the past, but now only . IN this case, possibly the currency exchange handling logic could be dropped in the new system (and just to store the currency in a char field for the old ones); or in some other domains, maybe only the last ten years' data is needed. However, this should be a business decision, not a decision for a developer!


Source Data Quality is often a problem, one that will likely cause issues. If data needs to be fixed (as above, ask the stakeholders!), it should stay out from your application's code, and be in the Transform part of the migration script.