Friday, November 4, 2011

Data Migrations As Acceptance Tests

While I have previously said that on migration projects both verification and regression tests are important, does it mean that the two should be separate? Like first, let's migrate the data, and then we'll rewrite the functionality, right? Or let's do it the other way around - we'll talk with the customer, incrementally figure out their requirements, deliver the software (with a proper regression test suite) that satisfies them, and then we migrate. Both approaches have problems:



  • customers want to use the software with their current, real data - having only the data and no application to use it with is no value to them. Neither is having only an application with no data in it

  • real data has lots of surprising scenarios that the domain expert might have forgotten to specify (see caveats though)

  • requirements are not static, and new ones will be introduced during the development process, that inevitably will cause the new application's models to change, which means that the migration has a moving target it needs to migrate to.


Doing them in parallel


If the data source is organized chronologically (order by date in the migration script), and organized in a format that resembles what the system's end users will enter into the system, then we can use the new application's outmost automatable entry point (Selenium, HTTP POST, a module's public API) to enter this data during the migration from the old system to the new.


Why


While a clear disadvantage of this approach is speed of the migration - it will be likely slower than an INSERT INTO new_db.new_table select .... FROM old_db.old_table join ... where .... statement, but in the case of non-trivial migrations it will likely compensate for the slowness, because:



  • changes to the new system's code/data structure become a problem localized to the new application code - no headaches to update the migration scripts in addition to the code

  • when the client requests the demo deployment to be in sync with the old system, the code is ready (spare for the part to figure out which records have changed)

  • the legacy data edge cases provides focus - no need to invent corner cases, for there will be enough in the data

  • likely there will be releasable functionality sooner than with either of the above approaches


How


First, create the acceptance tests for the migration:



  • Pick the data to be migrated

  • find the view in the original system that displays this data to the users and find a way to extract the data from there

  • define the equivalent view in the new system (it's about end to end, user visible features!)

  • write the verification script that compares the above two (be sure to list the offending records in the failure message!)

  • define the data entry point in the new system

  • write the migration script - extract from the old system, transform if needed (only to match the entry points expectations of the format - no quality verification as in classic ETL!), then send it into the new system (using the above defined entry point)


At this point both the new view, and the data entry points are empty. From here on, the TDD cycle becomes a nested loop



  • run the migration script. See which record it failed for

  • analyze the failing acceptance test, and find the missing features for it

  • (A)TDD the missing features

  • run the migration script to restart the cycle


Caveats


While the existing data makes one focus on the real edge cases instead of the imagined one, beware - not everything has to (or can be) migrated - for instance, in a payment system, the system used to accept many currencies in the past, but now only . IN this case, possibly the currency exchange handling logic could be dropped in the new system (and just to store the currency in a char field for the old ones); or in some other domains, maybe only the last ten years' data is needed. However, this should be a business decision, not a decision for a developer!


Source Data Quality is often a problem, one that will likely cause issues. If data needs to be fixed (as above, ask the stakeholders!), it should stay out from your application's code, and be in the Transform part of the migration script.