FINN tech blog

tech blog

Author Archive

Log4j2 in production – making it fly



Now that log4j2 is the predominant logging framework in use at FINN.no why not share the good news with the world and try to provide a summary over the introduction of this exciting new technology into our platform.

let’s just one logging framework

In beginning of September 2013 it became the responsibility of one of our engineering teams to introduce a “best practice for logging” for all of FINN.

The proposal put forth was that we can standardise on one backend logging framework while there would be no need to standardise on the abstraction layer directly used in our code.

The rationale to not standardising the logging abstractions was…
  • nearly every codebase, through a tree of dependencies, already includes all the different logging abstraction libraries, so the hard work of configuring them against the chosen logging framework is still required,
  • the different APIs to the different logging abstractions are not difficult for programmers to go between from one project to another.

While the rationale to standardising the logging framework was…
  • makes life easier for programmers with one documented “best practice” for all,
  • makes it possible through an in-house library to configure all abstraction layers, creating less configuration for programmers,
  • makes life easier for operations knowing all jvms log the same way,
  • makes life easier for operations knowing all log files follow the same format.

Log4j2 wins HANDS down

Log4j2 was chosen as the logging framework given…
  • it provided all the features that logback was becoming popular for,
  • between old log4j and logback it was the only framework that was written in java with modern concurrency (ie not hard synchronised methods/blocks),
  • it provided a significant performance improvement (1000-10000 times faster)
  • it consisted of a more active community (logback has been announced as the replacement for the old log4j, but log4j2 saw a new momentum in its apache community).

This proposal was checked with a vote by finn programmers
        – 73% agreed, 27% were unsure, no-one disagreed.

when nightly compression jams

Earlier on in this process we hit a bug with the old log4j where nightly compression of already rotated logfiles were locking up all requests in any (or most) jvms for up to ten seconds. This fault came back to poor java concurrency code in the original log4j (which logback cloned). Exacerbated by us having scores of jvms for all our different microservices running on the same machines so that when nightly compression kicked in it did so all in parallel. Possible fixes here were to
  a) stop compression of log files,
  b) make loggers async, or
  c) migrate over quickly to log4j2.

After some investigation, (c) was ruled out because there was no logstash plugin for log4j2 ready and moving forward without the json logfiles and the logstash & kibana integration was not an option. (a) was chosen as a temporary solution.

ready, steady, go…

Later on, when we started the work on upgrading all our services from thrift-0.6.1 up to thrift-0.9.1, we took the opportunity to kill two birds with the one stone. Log4j2 was out of beta, and we had ironed out the issues around the logstash plugin.

We’d be lying if we told you it was all completely pain free,
 introducing Log4j2 came with some concerns and hurdles.

    • Using a release candidate of log4j2 in production lead to some concerns. So far the only consequence was slow startup times (eg even small services paused for ~8 seconds during startup). This was due to log4j2 having to scan all classes for possible log4j plugins. This problem was fixed in 2.0-rc2. On the bright side – our use of the release candidate meant we spotted early and provided a patch in getting the upcoming initial release of log4j2 to support shaded jarfiles, of which we are heavily dependent on.
    • Operation’s had expressed concerns over nightly compression, raised from the earlier problem around nightly compression, that even if code no longer blocked while the compressing was happening in a background thread the amount of parallel compressions spawned would lead to IO contention which in turn leads to CPU contention. Because of this very real concern extensive tests have been executed, so far they’ve shown no measurable (under 1ms) impact exists upon services within the FINN platform. Furthermore this problem can be easily circumvented by adding the SizeBasedTriggeringPolicy to your appender, thereby enforcing a limit on how much parallel compression can happen at midnight.
    • The new logstash plugin (which finn has actively contributed to on github) caused a few breakages to the format expected by our custom logstash parsers written by operations. Unfortunately this parser is based off the old log4j format, of which we are trying to escape. Breakages here were: log events on separate lines, avoiding commas are the end of lines between log events, thread context in wrong format, etc. These were tackled with pull requests on github and patch versions of our commons-service (the library used to pre-configure the correct dependency tree for log4j2 artifacts and properly plugging in all the different logging abstraction libraries).
    • Increased memory from switching to sync loggers to async loggers impacted services with very small heap. The async logger used is based of the lmax-disruptor which pre-allocates its ringBuffer with its maximum capacity. By default this ringBuffer is configured to queue at maximum 256k log events. This can be adjusted with the “AsyncLoggerConfig.RingBufferSize” system property.

simply beautiful

To wrap it up the hurdles have been there, but trivial and easy to deal with, while the benefits of introducing log4j2, and moving to async loggers, make it well worth it…
    • The “best practice” for log4j2 included changing all loggers to be async, and this means that the performance of the FINN platform (which consists primarily of in-memory services) is no longer tied to and effected by how disks are performing (it was crazy that it was before).
    • More and more applications are generating consistent logfiles according to our best practices.
    • More and more applications are actually plugging in the various different logging abstractions used by all their various third-party dependencies.
    • All the advantages people liked about logback.
    • An easier approach to changing loglevels at runtime through jmx.
    • Profiling applications in crisis easier for outsiders (one less low-level behavioural difference).
    • Loggers are no longer a visible bottleneck in jvms under duress.
    • And naturally the performance increase.

Because of the significant performance provided by the lmax-disruptor we also use the open-sourced statsd client that takes advantage of it.

Package Management conflicts Continuous Delivery


The idea of package management is to correctly operate and bundle together various components in any system. The practice of package management is a consequence from the design and evolution of each component’s API.

Package management is tedious

   but necessary. It can also help to address the ‘fear of change’.

We can minimise package management by minimising API. But we can’t minimise API if we don’t have experience with where it comes from. You can’t define for yourself what the API of your code is. It is well beyond that of your public method signatures. Anything that with change can break a consumer is API.

Continuous Delivery isn’t void of API

   despite fixed and minimised interfaces between runtime services, each runtime service also contains an API in how it behaves. The big difference though is you own the release change, a la the deployment event, and if things don’t go well you can roll back. Releasing artifacts in the context of package management can not be undone. Once you have released the artifact you must presume someone has already downloaded it and you can’t get it back. The best you can do it release a new version and hope everyone upgrades to it quickly.

Push code out from behind the shackles of package management

   take advantage of continuous delivery! Bearing in mind a healthy modular systems design comes from making sure you got the api design right – so the amount one can utilise CD is ultimately limited, unless you want to throw out modularity. In general we let components low in the stack “be safe” by focusing on api design over delivery time, and the opposite for components high in the stack.

High in the stack doesn’t refer to front-end code

   Code at the top of the stack is that free of package management and completely free for continuous deployment. Components with direct consumers no longer sit at the top of the stack. As components consumers multiple, and they become transitive dependencies, they move further down the stack. Typically entropy of the component corresponds to position in the stack. Other components forced into package management can be those where parallel versions need be deployed.

Some simple rules to abide by…

  • don’t put configuration into libraries.
    because this creates version-churn and leads to more package management

  • don’t put services into libraries.
    same reason as above.

  • don’t confuse deploying with version releases.
    don’t release every artifact as part of a deployment pipeline.
    separate concerns of continuous delivery and package management.


  • try to use a runtime service instead of a compile-time library.
    this minimises API, in turn minimising package management,

  • try to re-use standard APIs (REST, message-buses, etc).
    the less API you own the less package management.
    but don’t cheat! data formats are APIs, and anything exposed that breaks stuff when changed is API.

Dark Launching and Feature Toggles


Make sure to distinguish between these two.

They are not the same thing,
  and it’s a lot quicker to just Dark Launch.

In addition Dark Launching promotes incremental development, continuous delivery, and modular design. Feature Toggles need not, and can possibly be counter-productive.

Dark Launching is an operation to silently and freely deploy something. Giving you time to ensure everything operates as expected in production and the freedom to switch back and forth while bugfixing. A Dark Launch’s goal is often about being completely invisible to the end-user. It also isolates the context of the deployment to the component itself.

Feature Toggling, in contrast, is often the ability to A/B test new products in production. Feature Toggling is typically accompanied with measurements and analytics, eg NetInsight/Google-Analytics. Feature Toggles may also extend to situations when the activation switch of a dark launch can only happen in a consumer codebase, or when only some percentage of executions will use the dark launched code.

Given that one constraint of any decent enterprise platform is that all components must fail gracefully Dark Launching is the easiest solution, and a golden opportunity to ensure your new code fails gracefully. Turn the new module on, and it’s dark launched and in use, any problems turn it off again. You also shouldn’t have to worry about only running some percentage of executions against the new code, let it all go to the new component, if the load is too much the excessive load should also fail-gracefully and fall back to the old system.

Dark Launching is the simple approach as it requires no feature toggling framework, or custom key-value store of options. It is a DevOps goal that remains best isolated to the context of DevOps – in a sense the ‘toggling’ happens through operations and not through code. When everything is finished it is also the easier approach to clean up. Dealing with and cleaning up old code takes up a lot of our time and is a significant hindrance to our ability to continually innovate. In contrast any feature toggling framework can risk encouraging a mess of outdated, arcane, and unused properties and code paths. KISS: always try and bypass a feature toggle framework by owning the ‘toggle’ in your own component, rather than forcing it out into consumer code.

Where components have CQS it gets even better. First the command component is dark launched, whereby it runs in parallel, and data can be test between the old and new system (blue-green deployments). Later on the query component is dark launched. While the command components can run and be used in parallel for each and every request the query components cannot. When the dark launch of the query component is final the old system is completely turned off.

Now the intention of this post isn’t to say we don’t need feature toggling, but to give terminology to, and distinguish, between two different practices instead of lumping everything under the term “feature toggling”. And to discourage using a feature toggling framework for everything because we fail to understand the simpler alternative.

In the context of front-end changes, it’s typical that for a new front-end feature to come into play there’s been some new backend components required. These will typically have been dark launched. Once that is done, the front-end will introduce a feature toggle rather than dark launch because it’s either introducing something new to the user or wanting to introduce something new to a limited set of users. So even here dark launching can be seen not as a “cool” alternative, but as the prerequisite practice.

Reference: “DevOps for Developers” By Michael Hüttermann

given the git

 

“There is no way to do CVS right.” – Linus



FINN is migrating from subversion to the ever trendy git.
   We’ve waited years for it to happen,
      here we’ll try to highlight why and how we are doing it.

Working together

There’s no doubt that git gives us a cleaner way of working on top of each other. Wherever you promote peer review you need a way of working with changesets from one computer to the next without having to commit to and via the trunk where everyone is affected. Creating custom branches comes with too much (real or perceived) overhead, so the approach at best falls to throwing patches around. Coming away from a pair-programming session it’s better when developers go back to their own desk with such a patch so they can work on it a bit more and finish it properly with tests, docs, and a healthy dose of clean coding. It properly entitles them as the author rather than appearing as if someone else took over and committed their work. Git’s decentralisation of repositories provides the cleaner way by replacing these patches with private repositories and its easy to use branches.

Individual productivity

Git improves the individual’s productivity with benefits of stashing, squashing, reseting, and rebasing. A number of programmers for a number of years were already on the bandwagon using git-svn against our subversion repositories. This was real proof of the benefits, given the headaches of git-svn (can’t move files and renaming files gives corrupted repositories)

With Git, work is encouraged to be done on feature branches and merged in to master as complete (squashed/rebased) changesets with clean and summarised commit messages.

  1. This improves efforts towards continuous deployment due to a more stable HEAD.
  2. Rolling back any individual feature regardless of its age is a far more manageable task.
  3. By squashing all those checkpoint commits we typically make we get more meaningful, contextual, and accurate commit messages.

Reading isolated and complete changesets provides clear oversight, to the point reading code history becomes enjoyable, rather than a chore. Equally important is that such documentation that resides so close to, if not with, the code comes with a real permanence. There is no documentation more accurate over all of time than the code itself and the commit messages to it. Lastly writing and rewriting good commit message will alleviate any culture of jira issues with vague, or completely inadequate, descriptions as teams hurry themselves through scrum methodologies where little attention is given to what is written down.

Maintaining forks

Git makes maintaining forks of upstream projects easy.

With Git
fork the upstream repository,
branch, fix and commit,
create the upstream pull request,
while you wait for the pull request to be accepted/rejected use your custom inhouse artifact.
  With Subversion
file an upstream issue,
checkout the code,
fix and store in a patch attached to the issue,
while you wait use the inhouse custom artifact from the patched but uncommited codebase.



Both processes are largely the same but it’s safer and so much easier using a forked git repository over a bunch of patch files.

Has Git-Flow any advantage?

We’re putting some thoughts into how we were to organise our repositories, branches, and workflows. The best introductory article we’ve so far come across is from sandofsky and should be considered mandatory reading. Beyond this one popular approach is organising branches using Git Flow. This seemed elegant but upon closer inspection comes with more disadvantages than benefits…

  • the majority needs to ‘checkout develop’ after cloning (there are more developers than ops),
  • master is but a sequence of “tags” and therefore develop becomes but a superfluous branch, a floating “stable” tag instead is a better solution over any branch that simply marks states,
  • it was popular but didn’t form any standard,
  • requires a script not available in all GUIs/IDEs, otherwise it is but a convention,
  • prevents you from getting your hands dirty with the real Git, how else are you going to learn?,
  • it goes against having focus and gravity towards continuous integration that promotes an always stable HEAD. That is we desire less stabilisation and qa branches, and more individual feature and fix branches.

GitHub Flow gives a healthy critique of Git-Flow and it helped identify our style better. GitHub Flow focus’ on continuous integration, “deploy to production every day” rather than releasing, and relying on git basics rather than adding another plugin to our development environment.

Our workflows

So we devised two basic and flexible workflows: one for applications and one for services and libraries. Applications are end-user products and stuff that has no defined API like batch jobs. Services are our runtime services that builds up our platform, each come with a defined API and client-server separation in artifacts. Applications are deployed to environments, but because no other codebase depends on them their artifacts are never released. Services, with their APIs and client-side libraries, are released using semantic versions, and the server-side code to them is deployed to environments in the same manner as Applications. The differences between Applications and Services/Libraries warrant two different workflow styles.

Both workflow styles use master as the stable branch. Feature branches come off master. An optional “integration” (or “develop”) branch may exist between master and feature branches, for example CI build tools might automatically merge integration changes back to master, but take care not to fall into the anti-pattern of using merges to mark state.

The workflow for Applications may use an optional stable branch where deployments are made from, this is used by projects that have not perfected continuous deployment. Here bug fix branches are taken from the stable branch, and forward ported to master. For applications practising continuous deployment a more GitHub approach may be taken where deployments from finished feature branches occur and after successfully running in production such feature branches are then merged to master.

The workflow for Services is based upon each release creating a new semantic version and the git tagging of it. Continuous deployment off master is encouraged but is limited to how compatible API and the client libraries are against the HEAD code in the master branch – code that is released and deployed must work together. Instead of the optional stable branch, optional versioned branches may exist. These are created lazy from the release tag when the need for a bug fix on any previous release arises, or when master code no longer provides compatibility to the released artifacts currently in use. The latter case highlights the change when deployments start to occur off the versioned branch instead of off master. Bug fix branches are taken from the versioned branch, and forward ported to master.

Similar to Services are Libraries. These are artifacts that have no server-side code. They are standalone code artifacts serving as compile-time dependencies to the platform. A Library is released, but never itself deployed to any environment. Libraries are void of any efforts towards continuous deployments but otherwise follow very similar workflow as Services – typically they give longer support to older versions and therefore have multiple release branches active.

How any team operates their workflow is up to them, free to experiment to see what is effective. At the end of the day as long as you understand the differences between merge and rebase then evolving from one workflow to another over time shouldn’t be a problem.

Infrastructure: Atlassian Stash

The introduction of Git was stalled for a year from our Ops team as there was no repository management software they were happy enough with to support (integration with existing services was important, particularly Crowd). Initially they were waiting on either Gitolite or Gitorius. Eventually someone suggested Stash from Atlassian and after a quick trial this was to be it. We’re using a number of Atlassian products already: Jira, Fisheye, Crucible, and Confluence; so the integration factor was good and so we’ve paid for a product that was at the time incredibly overpriced with next to nothing on its feature list. One feature the otherwise very naked Stash comes with is Projects, which provides a basic grouping of repositories. We’ve used this grouping to organise our repositories based on our architectural layers: “applications”, “services”, “libraries”, and “operations”. The idea is not to build fortresses with projects based on teams but to best please the outsider who is looking for some yet unknown codebase and only knows what type of codebase it is. We’re hoping that Atlassian adds labels and descriptions to repositories to further help organisation. Permissions is easy, full read and write access to everyone, we’ll learn quickest when we’re all free to make mistakes, it’s all under version control at the end of the day.

Atlassian Stash


We’re still a cathedral

Git decentralises everything, but we’re not a real bazaar: our private code is our cathedral with typical enterprise trends like scrum and kanban in play; and so we have still the need to centralise a lot.
Our list of users and roles we still want centralised, when people push to the master repository are all commits logged against known users or are we going to end up with multiple aliases for every developer? Or worse junk users like “localhost”?
To tackle this we wrote a pre-push hook that authenticates usernames for all commits against Crowd. If a commit from an unknown user is encountered the push fails and the pusher needs to fix their history using this recipe before pushing again.

Releases can be made off any clone and obviously not be something we want. Released artifacts need to be permanent and unique, and deployed to our central maven repository. Maven’s release plugin fortunately tackles this for us as when you run mvn release:prepare or mvn release:branch it automatically pushes resulting changes upstream for you, as dictated by the scm details in the pom.xml

Migrating repositories

Our practice with subversion was to have everything in one large subversion repository, like how Apache does it. This approach worked best for us allowing projects and the code across projects to be freely moved around. With Git it makes more sense for each project to have its own repository, as moving files along with their history between repositories is easy.

Initial attempts of conversion were using svn2git as described here along with svndumpfilter3.

But a plugin in Stash came along called SubGit. It rocks! Converting individual projects from such a large subversion repository one at a time is easy. Remember to moderate the .gitattributes file afterwards, we found in most usecases it could be deleted.


Hurdles

Integration with our existing tools (bamboo, fisheye, jira) was easier when everything was in one subversion repository. Now with scores of git repositories it is rather cumbersome. Every new git repository has to be added manually into every other tool. We’re hoping that Atlassian comes to the rescue and provides some sort of automatic recognition of new and renamed repositories.

Renaming repositories in Stash is very easy, and should be encouraged in an agile world, but it breaks the integration with other tools. Every repository rename means going through other tools and manually updating them. Again we hope Atlassian comes to the rescue.

Binary files we were worried about as our largest codebase had many and was already slow on subversion due to it. Subversion also stores all xml files by default as binary and in a large spring based application with a long history this might have been a problem. We were ready to investigate solutions like git-annex. All test migrations though showed that it was not a problem, git clones of this large codebase were super fast, and considerably smaller (subversion 4.1G -> git 1.1G).


Adaptation

Towards the end of February we were lucky enough to have Tim Berglund, Brent Beer, and David Graham, from GitHub come and teach us Git. The first two days was a set course with 75 participants and covered

  • Git Fundamentals (staging, moving, copying, history, branching, merging, resetting, reverting),
  • Collaboration using GitHub (Push, pull, and fetch, Pull Requests Project Sites, Gists, Post-receive hooks), and
  • Advanced Git (Filter-Branch, Bisect, Rebase-onto, External merge/diff tools, Event Hooks, Refspec, .gitattributes).

The third day with the three GitHubbers was more of an open space with under twenty participants where we discussed various specifics to FINN’s adoption to Git, from continuous deployment (which GitHub excels at) to branching workflows.

No doubt about it this was one of the best, if not very best, courses held for FINN developers, and left everyone with a resounding drive to immediately switch all codebases over to Git.

Other documentation that’s encouraged for everyone to read/watch is


Tips and tricks for beginners…

To wrap it up here’s some of the tips and tricks we’ve documented for ourselves…

Cant push because HEAD is newer
So you pull first… Then you go ahead and push which adds two commits into history: the original and a duplicate merge from you.
You need to learn to do git rebase in such situations, better yet to do git --rebase pull.
You can make the latter permanent behaviour with
git config --global branch.master.rebase true
git config --global branch.autosetuprebase always

Colour please!
git config --global color.ui auto

Did you really want to push commits on all your branches?
This can trap people, they often expect push to be restricted to the current branch your on. It can be enforced to be this way with git config --global push.default tracking

Pretty log display
Alias git lol to your preferred log format…
Simple oneline log formatting:
git config --global alias.lol "log --abbrev-commit --graph --decorate --all --pretty=oneline"

Oneline log formatting including committer’s name and relative times for each commit:
git config --global alias.lol "log --abbrev-commit --graph --all
      --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset'"

Compact log formatting with full commit messages, iso timestamps, file history over renames, and mailmap usernames:
git config --global alias.lol "log --follow --find-copies-harder --graph --abbrev=4
  --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %Cgreen%ai %n %C(bold blue)%aN%Creset %B'"

Cherry-picking referencing the original commit
When recording a cherry-pick commit, using the “-x” option appends the line “(cherry picked from commit …)” to the original commit message so to indicate which commit this change was cherry-picked from. For example git cherry-pick -x 3523dfc

Quick log to see what i’ve done in the last 24 hours?
git config --global alias.standup "log --all --since yesterday --author `git config --global user.email`"

What files is this project got but ignoring?
git config --global alias.ignored "ls-files --others --i --exclude-standard"

Wipe all uncommitted changes
git config --global alias.wipe "reset --hard HEAD"

Edit and squash commits before pushing them
git config --global alias.ready "rebase -i @{u}"



StrataConf & Hadoop World 2012…

A summary of this year’s Strataconf & Hadoop World.
    A fascinating and inspiring conference with use-cases on both sides of an ethical divide – proof that the technologies coming are game-changers in both our industry and in society. Along with some intimidating use-cases i’ve never seen such recruitment efforts at any conference before, from multi-nationals to the CIA. The need for developers and data scientists in Big Data is burning – the market for Apache Hadoop Market is expected to reach $14 billion by 2017.

Plenty of honesty towards the hype and the challenges involved too. A barcamp Big Data Controversies labelled it all as Big Noise and looked at ways through the hype. It presented balancing perspectives from a insurance company’s statistician who has dealt successfully with the problem of too much data for a decade and a hadoop techie who could provide much desired answers to previously impossible questions. Highlights from this barcamp were…

  • One should always use intelligent samples before ever committing to big data.
  • Unix tools can be used but they are not very fault tolerant.
  • You know when you’re storing too much denormalised data when you’re also getting high compression rates on it.
  • MapReduce isn’t everything as it can be replaced with indexing.
  • If you try to throw automated algorithms at problems without any human intervention you’re bound to bullshit.
  • Ops hate hadoop and this needs to change.
  • Respecting user privacy is important and requires a culture of honesty and common-sense within the company. But everyone needs to understand what’s illegal and why.

Noteworthy (10 minute) keynotes…

  • The End of the Data Warehouse. They are monuments to the old way of doing things: pretty packaging but failing to deliver the business value. But Hadoop too is still flawed… Also a blog available.
  • Moneyball for New York City. How NYC council started combining datasets from different departments with surprising results.
  • The Composite Database, a focus on using big data for product development. To an application programmer the concept of a database is moving from one entity into a componential architecture.
  • Bringing the ‘So What’ to Big Data, a different keynote with a sell towards going to work for the CIA. Big data isn’t about data but changing and improving lives.
  • Cloud, Mobile and Big Data. Paul Kent, a witty speaker, talks about analytics in a new world. “At the end of the day, we are closer to the beginning than we are at end of this big data revolution… One radical change hadoop and m/r brings is now we push the work to the data, instead of pulling the data out.”

Noteworthy (30 minute) presentations…

  • The Future – Hadoop-2. Hadoop YARN makes all functional programming algorithms possible, reducing the existing map reduce framework to just one of many user-land libraries. Many existing limitations are removed. Fault-tolerance is improved (namenode). 2x performance improvement on small jobs.
  • Designing Hadoop for the Enterprise Data Center. A joint talk from Cisco and Cloudera on hardware tuning to meet Hadoop’s serious demands. 10G networks help. dual-attached 1G networks is an alternative. More jobs in parallel will average out network bursts. Data-locality misses hurt network, consider above a 80% data-locality hitrate good.
  • How to Win Friends and Influence People. LinkedIn presents four of their big data products
        ∘ Year in Review. Most successful email ever – 20% response rate.
        ∘ Network Updates.
        ∘ Skills and Endorsements. A combination of propensity to know someone and the propensity to have the skill.
        ∘ People You May Know. Listing is easy, but ranking required combining many datasets.

    All these products were written in PIG. Moving data around is the key problem. Kafka is used instead of scribe.
  • Designing for data-driven organisation. Many companies who think they are data-driven are in fact metrics-driven. It’s not the same thing. Metrics-driven companies often want interfaces with less data. Data-driven companies have data rich interfaces presenting holistic visualisations.
  • Visualizing Networks. The art of using the correct visualisation and layout. Be careful of our natural human trait to see visual implications from familiarity and proximity – we don’t always look at the legend. A lot of examples using the d3 javascript library.

The two training sessions I attended were Testing Hadoop, and Hadoop using Hive.
Testing Hadoop.
Presented by an old accomplice from the NetBeans Governance board, Tom Wheeler. He presented an interesting perspective on testing calling it another form of computer security: “a computer is secure if you can depend on it and its software to behave as you expect”. Otherwise i took home a number of key technologies to fill in the gaps between and around our current unit and single-node integration tests on our CountStatistics project: Apache MRUnit for m/r units, MiniMRCluster and MiniDFSCluster for multi-jvm integration cluster, and BigTop for ecosystem testing (pig, hive, etc). We also went through various ways to benchmark hadoop jobs using TeraSort, MRBench, NNBench, TestDFSIO, GridMix3, and SWIM. Lastly we went through a demo of the free product “Cloudera Manager” – a diagnostics UI similar to Cassandra’s OpsCenter.

Hadoop using Hive.
Hive provides an SQL interface to Hadoop. It works out-of-the-box if you’re using HBase but with Cassandra as our underlying database we haven’t gotten around to installing it yet. The tutorial went through many live queries on a AWS EC2 cluster, exploring the specifics to STRUCTs in schemas, JOINs, UDFs, and serdes. This is a crucial interface to make it easier for others, particularly BI and FINN økosystem, to explore freely through our data. Pig isn’t the easiest way in for outsiders, but everyone knows enough SQL to get started. Fingers crossed we get Hive or Impala installed some time soon…

A number of meet-ups occurred during the evenings, one hosted at AppNexus, a company providing 10 billion real-time ads per day (with a stunning office). AppNexus does all their hadoop work using python, but they also putting focus on RESTful Open APIs like we do. The other meetup represented Cassandra by DataStax with plenty of free Cassandra beer. Latest benchmarks prove it to be the fastest distributed database around. I was hoping to see more Cassandra at strataconf – when someone mentions big data i think of Cassandra before Hadoop.

Otherwise this US election was on the news as the big data election

   

Foraging in the landscape of Big Data

 
This is the first article describing FINN.no’s coming of age with Big Data. It’ll cover the challenges arising the need for it, the challenges in accomplishing new goals with new tools, and the challenges that remain ahead.

Big Data is for many just another vague and hyped up trend getting more than its far share of attention. The general definition, from Wikipedia, is big data covers the scenario where existing tools fail to process the increasing amount or dimensions of data. This can mean anything from:

      α – the existing tools being poor (while large companies pour $$$ into scaling existing solutions up) or
      β – the status quo requiring more data to be used, to
      γ – a requirement for faster and more connected processing and analysis on existing data.


http://navajonationparks.orgThe latter two are also described as big data’s three V‘s: Volume, Variety, Velocity. If the theoretical definition isn’t convincing you put it into context against some of today’s big data crunching use-cases…
    • online advertising combining content analysis with behavioural targeting,
    • biomedical’s DNA sequence tagging,
    • pharmaceutics’s metabolic network modelling,
    • health services detecting disease/virus spread via internet activity & patient records,
    • the financial industry ranging from credit scores at retail level to quant trading,
    • insurance companies crunching actuarial data,
    • US defence programs for offline (ADAMS) and online (CINDER) threat detection,
    • environmental research into climate change and atmospheric modelling, and
    • neuroscience research into mapping the human brain’s neural pathways.


On the other hand big data is definitely no silver bullet. It cannot give you answers to questions you haven’t yet formulated (pattern recognition aside), and so it doesn’t give one excuses to store overwhelming amounts of data where the potential value in it is still undefined. And it certainly won’t make analysis of existing data sets initially any easier. In this regard it’s less to do with the difficulty of achieving such tasks and more to do with the potential to solve what was previously impossible.

Often companies can choose their direction and the services they will provide, but in any competitive market where one fails to match the competitor’s offerings it can result in the fall of that company. Here Big Data earns its hype with many a CEO concerned into paying attention. And it probably gives many CEOs a headache as the possibilities it opens, albeit tempting or necessary, create significant challenges in and of themselves. The multiple vague dimensions of big data also allows the critics plenty of room to manoeuvre.

One can argue that to scale up: to buy more powerful machines or to buy more expensive software solves the problem (α). If all you have is this problem then sure it’s a satisfactory solution. But ask yourself are you successfully solving today’s problem while forgetting your future?

“If we look at the path, we do not see the sky.” – Native American saying

One can also argue away the need for such vasts amounts of data (β). Through various strategies: more aggressive normalisation of the data, storing data for shorter periods, or persisting data in more ways in different places; the size of each individual data set can be significantly reduced. Excessively normalising data has its benefits and is what one may do in the Lean development approach for any new feature or product. Indeed a simpler datamodel trickles through into a simpler application design, in turn leaving more content, more productive, pragmatic developers. Nothing to be scoffed at in the slightest. But in this context the Lean methodology isn’t about any one state or snapshot in time illustrating the simple and the minimalistic, rather it’s about evolution, the processes involved and their direction. Much like the KISS saying: it’s not about doing it simple stupid but about *keeping* it simple stupid. Here it questions to how does overly normalised data evolve as its product becomes successful and something more complicated over time. Anyone who has had to deal awkwardly with numerous and superfluous tables, joins, and indexes in a legacy application because it failed over time to continually improve its datamodel due to needs of compatibility knows what we’re talking about. There is another problem from such legacy applications that follow a datamodel centric design: the datamodel itself becomes a public API and the many consumers of it create this need of compatibility and the resulting inflexible datamodel. But this isn’t an underlying but rather overlapping problem as one loses oversight of the datamodel in its completeness and in a way represented optimally.




It also difficult to deny the amount of data we’re drowning in today.
“90% of the data that exists today was created in the last two years.. the sheer volume of social media and mobile data streaming in daily, businesses expect to use big data to aggregate all of this data, extract information from it, and identify value to clients and consumers.. Data is no longer from our normalised datasets sitting in our traditional databases. We’re accessing broader, possibly external, sources of data like social media or learning to analyse new types of data like image, video, and audio..” – greenbookblog.org.

And one may also argue that existing business intelligence solutions (can) provide the analysis required from all already existing datasets (γ). It ignores a future of possibilities: take for example the research going into behavioural targeting giving glimpses into the challenges of modern marketing as events and trends spark, shift, and evolve through online social media with ever faster frequencies – just the tip of the iceberg when one thinks forward to being able to connect face and voice recognition to emotional pattern matching analysis. But it also defaults to the conservative opinion that business intelligence need be but a post-mortem analysis of events and trends so to provide the insights intended and required only for internal review. This notion that this large scale analysis is of sole benefit to company strategy must become the tell tale of companies failing to see how users and the likes of social media change dramatically what is possible in product development today.

The methodologies of innovation therefore change. Real time analysis of user behaviour plays a forefront role in decisive actions on what product features and interfaces will become successful. This is the potential to cut down the risk of product development’s internal guesswork for a product’s popularity at any given point in time. In turn this cuts down time-to-market bringing the products release date closer to its maximum popularity potential moment. Startup companies know that a success doesn’t come only from a clever designed product, there is a significant factor of luck involved in releasing the right product at the right time. Large, and very costly, marketing campaigns can only extend, or synthetically create, this potential moment by so much.

This latter point around the extents and performance of big data analysis and the differentiation it creates between business analysis versus richer, more spontaneous, product development and innovation creates for FINN an important and consequential factor to our forage into big data. Here at FINN a product owner recently said: “FINN is already developing its product largely built around the customer feedback and therefore achieving continuous innovation?”. Of course what he meant to say was “from numbers we choose to collect, we generate the statistics and reports we think are interesting, and from these statistics we freely interpret them to support the hypothesis we need for further product development…” We couldn’t be further from continuous innovation if it hitched a lift to the other side of Distortionville¹.

“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts” – Arthur Conan Doyle

FINN isn’t alone here, I’d say most established companies while trying to brand themselves as innovative are still struggling to break out of their existing models of product development based upon internal business analysis. No one said continuous innovation was easy, and there’s a lot of opinions out on this, but along with shorter deployment cycles i reckon there’s two keywords: truth and transparency. Tell your users the truth and watch how they respond. For example give them the statistics that show them their ads stop getting visits after a week, and then observe how they behave, to which solution do they flock to regain traffic to their ads. Don’t try and solve all their problems for them, rather try to enable them. You’ll probably make more money out of them, and by telling them the truth you’ve removed a vulnerability, or to what some fancy refers to as “a business secret”.

There’s also a potential problem with organisational silos. Large companies having invested properly in business intelligence and data warehousing will have assigned these roles of data collection, aggregation, and analysis, to a separate team or group of experts typically trained database administrators, statisticians, and traffic analysts. They are rarely the programmers, the programmers are on the front lines building the product. Such a split can parallel the sql vs nosql camps. This split against the programmers whom you rely on to make continuous innovation a reality can run the risk of stifling any adoption of big data. With the tools enabling big data the programmers can generate reports and numbers previously only capable from the business intelligence and data warehousing departments, and can serve them to your users at web response times – integrating such insights and intelligence into your product. Such new capabilities doesn’t supersede these traditional departments, rather it needs everyone accepting the new: working together to face new challenges with old wisdom. The programmers working on big data, even if tools and data become shared between these two organisational silos, cannot replace the needs of business intelligence any more than business intelligence can undertake big data’s potential. As data and data sources continue to increase year after year the job of asking the right questions, even knowing how to formulate the questions correctly, needs all hands on deck. Expecting your programmers to do it all might well swamped them into oblivion, but it isn’t just the enormity of new challenges involved, it’s that these challenges have an integral nature to them that programmers aren’t typically trained to tackle. Big Data can be used as an opportunity not only to introduce exciting new tools, paradigms, and potential into the company but as a way to help remove existing organisational silos.

The need for big data at FINN came from a combination of (α) and (γ).   The statistics we show users for their ads had traditional been accumulated and stored in a sybase table. These statistics included everything from page views, “tip a friend” emails sent, clicks on promoted advertisement placements, ads marked as favourite, and whatever else you can imagine.

(α) FINN is a busy site, the busiest in Norway, and we display ~50 million ad pages each day. Like a lot of web applications we had a modern scalable presentation and logic tier based upon ten tomcat servers but just one not-so-scalable monster database sitting in the data tier. The sybase procedure responsible for writing to the statistics table ended up being our biggest thorn. It used 50% of the database’s write execution time, 20% of total procedure execution time, and the overall load it created on the database accounted for 30% of ad page performance. It was a problem we had lived with long enough that Operations knew quickly to turn off the procedure if the database showed any signs of trouble. Over one period of troublesome months Operations wrote a cronjob to turn off the procedure automatically during peak traffic hours – when ads were receiving the most traffic we had to stop counting altogether, embarrassing to say the least!

(γ) On top of this product owners in FINN had for years been requesting that we provide statistics on a day basis. The existing table had tinkered with this idea for some of the numbers that didn’t accumulate so high, eg “tip a friend emails”, but for page viewings this was completely out of the question – not even the accumulated totals were working properly.

At the time we were in the process of modularising the FINN web application. The time was right to turn statistics into something modern and modular. We wanted an asynchronous, fault-tolerance, linearly scaling, and durable solution. The new design uses the Command-Query Separation pattern by using two separate modules: one for counting and one for displaying statistics. The counting achieves asynchronousity, scalability, and durability by using Scribe. The backend persistence and statistics module achieves all goals by using Cassandra and Thrift. As an extension of the push-on-change model: the counting stores denormalised data and it is later normalised to the views the statistics module requires; we use MapReduce jobs within a Hadoop cluster.

http://www.flickr.com/photos/ciordia/2892558385/sizes/m/in/photostream/ 

The resulting project is kick-ass, let me say. Especially Cassandra, it is a truly amazing modern database: linear scalability, decentralised, elastic, fault-tolerant, durable; with a rich datamodel that provides often superior approaches to joins, grouping, and ordering than traditional sql. But we’ll spend more time describing the project in technical details in a later article.

A challenge we face now is broader adoption of the project and the technologies involved. Various departments: from developers to tech support; want to read the data, regardless if it is traditional or ‘big data’, and the habit was always to read it directly from production’s Sybase. And it’s a habit that’s important in fostering a data-driven culture within the company, without having to encourage datamodel centric designs. With our Big Data solution this hasn’t been so easy. Without this transparency to the data, developers, tech support, and product owners alike seem to be failing to initiate further involvement – to solve this, since our big data is stored in Cassandra, we’d love to see a read only web-based gui interface based off caqel.

…to be continued…




—-




the ultimate view — Tiles-3

A story of getting the View layer up and running quickly in Spring…

Since the original article, parts of the code has been accepted upstream, now available as part of the Tiles-3 release, so the article has been updated — it’s all even simpler!


Based upon the Composite pattern and Convention over Configuration we’ll pump steroids into
   a web application’s view layer
      with four simple steps using Spring and Tiles-3
         to make organising large complex websites elegant with minimal of xml editing.



 




Background

At FINN.no we were redesigning our control and view layers. The architectural team had decided on Spring-Web as a framework for the control layer due to its flexibility and for providing us a simpler migration path. For the front end we were a little unclear. In a department of ~60 developers we knew that the popular vote would lead us towards SiteMesh. And we knew why – for practical purposes sitemesh gives the front end developer more flexibility and definitely less xml editing.
But sitemesh has some serious shortcomings…

SiteMesh shortcomings:
  • from a design perspective the Decorator pattern can undermine the seperation MVC intends,
  • requires all possible html for a request in buffer requiring large amounts of memory
  • unable to flush the response before the response is complete,
  • requires more overall processing due to all the potentially included fragments,
  • does not guaranteed thread safety, and
  • does not provide any structure or organisation amongst jsps, making refactorings and other tricks awkward.

One of the alternatives we looked at was Apache Tiles. It follows the Composite Pattern, but within that allows one to take advantage of the Decorator pattern using a ViewPreparer. This meant it provided by default what we considered a superior design but if necessary could also do what SiteMesh was good at. It already had integration with Spring, and the way it worked it meant that once the Spring-Web controller code was executed, the Spring’s view resolver would pass the model onto Tiles letting it do the rest. This gave us a clear MVC separation and an encapsulation ensuring single thread safety within the view domain.

“Tiles has been indeed the most undervalued project in past decade. It was the most useful part of struts, but when the focus shifted away from struts, tiles was forgotten. Since then struts as been outpaced by spring and JSF, however tiles is still the easiest and most elegant way to organize a complex web site, and it works not only with struts, but with every current MVC technology.” – Nicolas Le Bas

Yet the best Tiles was going to give wasn’t realised until we started experimenting a little more…

Profiling and debugging view templates



Ever needed to profile the tree of JSPs rendered server-side?
  Most companies do and I’ve seen elaborate and rather convoluted ways to do so.

With Tiles-3 you can use the PublisherRenderer to profile and debug not just the tree of JSPs but the full tree of all and any view templates rendered whether they be JSP, velocity, freemarker, or mustache.

At FINN all web pages print such a tree at the bottom of the page. This helps us see what templates were involved in the rendering of that page, and which templates are slow to render.



We also embed into the html source wrapping comments like

<!-- start: frontpage_geoUserData.jsp -->
...template output...
<!-- end: frontpage_geoUserData.jsp :it took: 2ms-->


The code please

To do this register and then attach your own listener to the PublisherRenderer. For example in your TilesContainerFactory (the class you extend to setup and configure Tiles) add to the methd createTemplateAttributeRenderer something like:

    @Override
    protected Renderer createTemplateAttributeRenderer(BasicRendererFactory rendererFactory, ApplicationContext applicationContext, TilesContainer container, AttributeEvaluatorFactory attributeEvaluatorFactory) {
 
        Renderer renderer = super.createTemplateAttributeRenderer(rendererFactory, applicationContext, container, attributeEvaluatorFactory);
        PublisherRenderer publisherRenderer = new PublisherRenderer(renderer);
        publisherRenderer.addListener(new MyListener());
        return publisherRenderer;
    }

Then implement your own listener, this implementation does just the wrapping comments with profiling information…

class MyListener implements PublisherRenderer.RendererListener {
 
    @Override
    public void start(String template, Request request) throws IOException {
        boolean first = null == request.getContext("request").get("started");
        if (!first) {
            // first check avoids writing before a template's doctype tag
            request.getPrintWriter().println("\n<!-- start: " + template + " -->");
            startStopWatch(request);
        } else {
            request.getContext("request").put("started", Boolean.TRUE);
        }
    }
 
    @Override
    public void end(String template, Request request) throws IOException {
        Long time = stopStopWatch(request);
        if(null != time){
            request.getPrintWriter().println("\n<!-- end: " + template 
                                         + " :it took: " + time + "ms -->");
        }
    }
 
    private void startStopWatch(Request request){
        Deque<StopWatch> stack = request.getContext("request").get("stack");
        if (null == stack) {
            stack = new ArrayDeque<StopWatch>();
            request.getContext("request").put("stack", stack);
        }
        StopWatch watch = new StopWatch();
        stack.push(watch);
        watch.start();
    }
 
    private Long stopStopWatch(Request request){
        Deque<StopWatch> stack = request.getContext("request").get("stack");
        return 0 < stack.size() ? stack.pop().getTime() : null;
    }

It’s quick to see the possibilities for simple and complex profiling open up here as well as being agnostic to the language of each particular template used. Learn more about Tiles-3 here.

Putting a mustache on Tiles-3

We’re proud to see a contribution from one of our developers end up in the Tiles-3 release!

The front-end architecture of FINN.no is evolving to be a lot more advanced and a lot more work is being done by client-side scripts. In order to maintain first time rendering speeds and to prevent duplicating template-code we needed something which allowed us to reuse templates both client- and server-side. This is where mustache templates have come into play. We could’ve gone ahead and done a large template framework review, like others have done, but we instead opted to just solve the problem with the technology we already had.

Morten Lied Johansen’s contribution allows Tiles-3 to render mustache templates. Existing jsp templates can be rewritten into mustache without having to touch surrounding templates or code!

The code please

To get Tiles-3 to do this include the tiles-request-mustache library and configure your TilesContainerFactory like

    protected void registerAttributeRenderers(...) {
        MustacheRenderer mustacheRenderer = new MustacheRenderer();
        mustacheRenderer.setAcceptPattern(Pattern.compile(".*.mustache"));
        rendererFactory.registerRenderer("mustache", mustacheRenderer);
        ...
    }
    protected Renderer createTemplateAttributeRenderer(...) {
        final ChainedDelegateRenderer chainedRenderer = new ChainedDelegateRenderer();
        chainedRenderer.addAttributeRenderer(rendererFactory.getRenderer("mustache"));
        ...
    }

then you’re free to replace existing tiles attributes like

<put-attribute name="my_template" value="/WEB-INF/my_template.jsp"/>

with stuff like

<put-attribute name="my_template" value="/my_template.mustache"/>

Good stuff FINN!

Dependency Injection with constructors?

Pic of Neo/The Matrix





The debate whether to use
  constructors, setters, fields, or interfaces
    for dependency injection is often heated and opinionated.
Should you have a preference?


The argument for Constructor Injection

We had a consultant working with us reminding us to take a preference towards Constructor injection. Indeed we had a large code base using predominantly setter injection because in the past that is what the Spring community recommended.

The arguments for constructor injection goes like:

  • Dependencies are declared public, providing clarity in the wiring of Dependency Inversion,
  • Safe construction, what must be initialised must be called,
  • Immutability, fields can be declared final, and
  • Clear indication of complexity through numbers of constructor parameters.

And that Setter injection can be used when needed for cyclic dependencies, optional and re-assignable dependencies, to support multiple/complicated variations of construction, or to free up the constructor for polymorphism purposes.

Being a big fan of Inversion of Control but not overly of Dependency Injection frameworks something smelt wrong to me. Yet solely within the debate of constructor versus setter injection i don’t disagree that constructor injection has the advantage. Having been using Spring’s dependency injection through annotation a little recently and building a favouritism towards field injection I was happy to get the chance to ponder it over, to learn and to be taught new things. What was it i was missing? Is there a bigger picture?

API vs Implementation

If there is a bigger picture it has to be around the Dependency Inversion argument since this is known to be potentially complex. The point here of using constructor injection is that 1) through a public declaration and injection of dependencies we build an explicit graph showing the dependency inversion throughout the application, and 2) even if the application is wired magically by a framework such injection must still be done in the same way without the framework (eg when writing tests). The latter (2) is interesting in that the requirement on “dependency injection” is too also inverted, that the framework providing dependency injection is removed from the architectural design and becomes solely a implementation detail. But it is the graph in (1) that becomes an important facet in the following analysis.

With this dependency graph in mind what does happen when we bring into the picture a desire to distinguish between API and implementation design…

The DI graph being requested to be clarified by using constructor injection will fall into one of two categories:
   ‘implementation-specific’, an interface defines the public API and the DI is held private by the constructor in the implementation class,
   ‘API-specific’ when the class has no interface. Here everything public is a fully exposed api. There is no implementation-protected visibility here for injectable constructors.

By introducing the constraint of only ever using constructor based injection: in the pursuit of a clarified dependency graph; you remove or make more difficult the ability to publicly distinguish between API and Implementation design.

This distinction between API and implementation is important in being able to create the simple API. The previous blog “using the constretto configuration factory” is a co-incidental example of this. I think the work in Constretto has an excellent implementation design to it, but this particular issue raised frustrations that the API was not as simple as it could have been. Indeed: to obtain the “simplest api”; Constretto (intentionally or not) promotes the use of Spring’s injection, a loose coupling that can be compared to reflection. It may be that our usage of Constretto’s API, where we wanted to isolate groups of properties, was not what the author originally intended but this only re-enforces the need for designing the simplest possible API.

Therefore it is important to sometimes have all dependency injection completely hidden in the implementation. A clean elegant API must take precedence over a clean elegant implementation. And to achieve this one must first make that distinction between API and Implementation design.

Taking this further we can introduce the distinction between API and SPI. Here a good practice is to stick to using final classes for API and interfaces for SPI. By the same argument as above SPI can’t use constructor injection because they don’t have constructors.

Inversion-of-Control vs Dependency-Injection

What about the difference between IoC and DI. They are overlapping concepts: the subtlety between the “the contexts” and “the dependencies” rarely emphasised enough. (Java EE 6 has tried to address the distinction between contexts and dependencies at the implementation level with the CDI spec.) The difference between the two, nuanced as it may be, can help illustrate that the DI graph in any application deserves attention in multiple dimensions.

   

Drawing an application’s architecture up as a graph where the vertical axis represents the request stack: that which is typically categorised into architectural layers view, control, and model/services; and the horizontal axis representing the broadness of each architectural layer, then it can be demonstrated that:
   IoC generally forms the passing and layering of contexts downwards.

   The api-specific DI is fulfilling the layer of such contexts, and these contexts can be dependencies directly or helper classes holding such dependencies. Such dependencies must therefore be initially defined high up in the stack.

   The DI that is implementation-specific is at most only visible inside each architectural layer and is the DI that is represented horizontally on the graph. Possibly still within the definition of IoC it can also be considered a “wiring of collaborating components”. The need for clarity in the dependency graph isn’t as critical and so applications here often tend towards Service Locators, Factories, and Injectable Singletons. On the other hand many of the existing Service Locator implementations have been poor enough to push people towards (and possibly it was an instigator for the initial implementations of) dependency injection.

   Constructor injection works easily horizontally, especially when instantiation of objects is under one’s ability, but has potential hurdles when working vertically down through the graph. Sticking to constructor injection horizontally can also greatly help when the wiring of an application is difficult, by ensuring at the construction of each object dependency injection has been successful. Missing setter, field, interface injection and Service Locators may not report an error until actually used in runtime.

A simple illustration of difficulty with vertical constructor injection is looking at these helper contexts and how they may be layering contexts through delegation rather than repetitive instantiation, a pattern more applicable for an application with a deep narrow graph. This exemplifies a pattern that has often relied on proxy classes.

Another illustration is when having to instantiate the initial context at the very top of the request/application stack it involves instantiating all the implementation of dependencies used in contexts down through the stack, this is when dependency inversion explodes – the case where the IoC becomes up-front and explicit, and the encapsulation of implementation is lost through an unnecessary leak of abstractions. A problem paralleling to this is trying to apply checked exceptions up through the request stack: one answer is that we need different checked exceptions per architectural layer (another answer is anchored exceptions). With dependencies we would eventuate with requiring different dependency types per architectural layer and this could lead to dependencies types from inner domains needing to be declared from the outer domains. Here we can instead declare resource loaders in the initial context and then letting each architectural layer build from scratch its own context with dependencies constructed from configuration. But this comes the full circle in coming back to a design similar to a service locator. Something similar has happened with annotations in that by bringing Convention over Configuration to DI what was once loose wiring with xml has become the magic of the convention and begins too to resemble the service locator or naming lookups.

follow the white rabbit/The Matrix


For a legacy application this likely becomes all too much: the declaring of all dependencies required throughout all these contexts; and so relying on a little louse-coupling-magic (be it reflection or spring injection) is our answer out. Indeed this seems to be one of the reasons spring dependency injection was introduced into FINN.
And so we’ve become less worried about the type of injection used…

Broad vs Deep Applications

FINN.no is generally a broad application with a shallow contextual stack. Here is the traditional view-control-model design and the services inside the model layer typically interact directly with the data stores and maybe interact with one or two peer services.

Focusing on the interfaces to the services we see there is a huge amount of public api available to the controller layer and very little in defined contexts except a few parameters, or maybe the whole parameter map, and the current user object. There is therefore very little inversion of control in our contexts, it is often just parameterisation. (Why we often use interfaces to define service APIs is interesting since we usually have no intention for client code to be supplying their own implementations, it is definitely not SPIs that are being published. Such interfaces are used as a poor-man’s simplification of the API declaration of public methods within the final classes. Albeit these interfaces do make it easy to make stubs and mocks for tests.)

In this design the implementation details of service-layer dependencies is rarely passed down through contexts but rather hard baked into the application. And in a product like FINN it probably always will be hard baked in. Hard baked here doesn’t mean it can’t be changed or mocked for testing, but that it is not a dynamic component, it is not contextual, and so does not belong in the architectural design of the application.

In such a broad architectural layer i can see two problems in trying to obtain a perfect DI graph:

   cyclic dependencies: bad but forgiven when existing as peers within a group. In this case constructor injection fails. We can define one as the lesser or auxiliary service and fall-back to the setter/field injection just for it, but if they are real equal peers this could be a bullet-in-the-foot approach and using field injection for both with documentation might be the better approach.

   central dependencies: these are the “core” dependencies used throughout the bulk of the services, the database connection, resource loaders, etc. If we enforce these to be injected via constructors then we in turn are enforcing a global-store of them. Such a global store would typically be implemented as a factory or singleton. Then what is the point of injection? Worse yet is that this could encourage us to start passing the spring application context around through all our services. A service locator may better serve our purpose…

Hopefully by now you’ve guessed that we really should be more interested in modularisation of the code. Breaking up this very broad services layer into appropriate groups is an easier and more productive first step to take. And during this task we have found discovering and visualising the DI graph is not the problem. Untangling it is. Constructor injection can be used to prevent these tangles, but so can tools like maven reporting and sonar. This shows the the DI graph is actually more easily visualised through the class’s import statements than through constructor parameters.

With modularisation we can minimise contexts, isolate dependency chains, publish contextual inversion of control into APIs, declare interface-injection for SPIs, and move dependency injection into wired constructors.

   

Back to Constructor injection

So it’s true that constructor injection goes beyond just DI in being able to provide some IoC. But it alone can not satisfy Inversion of Control in any application unless you are willing to overlook API and SPI design. DI is not a subset or union of IoC: it has uses horizontally and in loose-coupling configuration; and IoC is not a subset or union of DI: to insinuate such would mean IoC can only be implemented using spring beans leading to an application of only spring beans and singletons. In the latter case IoC will often become forgotten outside the application’s realm of DI.

Constructor injection is especially valid when it’s desired for code to be used via both spring-injection and manual injection, and it does make test code more natural java code. But imagine manually injecting every spring bean in a oversized legacy broad-stack application using constructor injection without the spring framework – is this really a possibility, let’s be serious? What you would likely end up with is one massive factory with initialisation code constructing all services instead of the spring xml, and looking up using this factory every request. What’s the point here? This isn’t where IoC is supposed to take us.

If code is being moved towards a distributed and modular architecture you should pay be aware on how it clashes with the DI fan club.

If code is in development and you are uncertain if the dependency should be obtained through a service locator or declared public giving dependency inversion, and in the spirit of lean think it smart to not yet make the decision, then using field injection can be the practical solution.

And just maybe you are not looking to push the Dependency Inversion out into the API and because you think of Spring’s ApplicationContext (or BeanFactory) as your application’s Service Locator, you use field injection as a method to automate service locator lookups.

For the majority of developers for the majority of the time you will be writing new code, not caring about dependency injection trashing inversion of control, wanting lots of easy to write tests, and not be worrying about API design so it’s ok to have a healthy preference towards constructor injection…

  Pic of Morpheus/The Matrix  


Keep questioning everything…
  …by remaining focused on what is required from the code at hand we can be pragmatic in a world full of rules and recommendations. This isn’t about laziness or permitting poor code, but about being the idealist: the person that knows the middle way between the pragmatist and ideologue. By knowing: when what can, and for how long, be dropped; we can incrementally evolve complex code towards a modular design in a sensible, sustainable, and practical way.
  In turn this means the programmer gets the chance to catch breath and remember paramount to their work is the people: those that will develop against the design and those end-users of the product.




References:



Credits:
A large and healthy dose of credit must go to Kaare Nilsen for being a sparring partner in the discussion that lead up to this article.

Fork me on GitHub casus telefon telefon dinleme casus telefon telefon dinleme casus telefon telefon dinleme casus telefon telefon dinleme casus telefon telefon dinleme