FINN tech blog

tech blog

Log4j2 in production – making it fly



Now that log4j2 is the predominant logging framework in use at FINN.no why not share the good news with the world and try to provide a summary over the introduction of this exciting new technology into our platform.

let’s just one logging framework

In beginning of September 2013 it became the responsibility of one of our engineering teams to introduce a “best practice for logging” for all of FINN.

The proposal put forth was that we can standardise on one backend logging framework while there would be no need to standardise on the abstraction layer directly used in our code.

The rationale to not standardising the logging abstractions was…
  • nearly every codebase, through a tree of dependencies, already includes all the different logging abstraction libraries, so the hard work of configuring them against the chosen logging framework is still required,
  • the different APIs to the different logging abstractions are not difficult for programmers to go between from one project to another.

While the rationale to standardising the logging framework was…
  • makes life easier for programmers with one documented “best practice” for all,
  • makes it possible through an in-house library to configure all abstraction layers, creating less configuration for programmers,
  • makes life easier for operations knowing all jvms log the same way,
  • makes life easier for operations knowing all log files follow the same format.

Log4j2 wins HANDS down

Log4j2 was chosen as the logging framework given…
  • it provided all the features that logback was becoming popular for,
  • between old log4j and logback it was the only framework that was written in java with modern concurrency (ie not hard synchronised methods/blocks),
  • it provided a significant performance improvement (1000-10000 times faster)
  • it consisted of a more active community (logback has been announced as the replacement for the old log4j, but log4j2 saw a new momentum in its apache community).

This proposal was checked with a vote by finn programmers
        – 73% agreed, 27% were unsure, no-one disagreed.

when nightly compression jams

Earlier on in this process we hit a bug with the old log4j where nightly compression of already rotated logfiles were locking up all requests in any (or most) jvms for up to ten seconds. This fault came back to poor java concurrency code in the original log4j (which logback cloned). Exacerbated by us having scores of jvms for all our different microservices running on the same machines so that when nightly compression kicked in it did so all in parallel. Possible fixes here were to
  a) stop compression of log files,
  b) make loggers async, or
  c) migrate over quickly to log4j2.

After some investigation, (c) was ruled out because there was no logstash plugin for log4j2 ready and moving forward without the json logfiles and the logstash & kibana integration was not an option. (a) was chosen as a temporary solution.

ready, steady, go…

Later on, when we started the work on upgrading all our services from thrift-0.6.1 up to thrift-0.9.1, we took the opportunity to kill two birds with the one stone. Log4j2 was out of beta, and we had ironed out the issues around the logstash plugin.

We’d be lying if we told you it was all completely pain free,
 introducing Log4j2 came with some concerns and hurdles.

    • Using a release candidate of log4j2 in production lead to some concerns. So far the only consequence was slow startup times (eg even small services paused for ~8 seconds during startup). This was due to log4j2 having to scan all classes for possible log4j plugins. This problem was fixed in 2.0-rc2. On the bright side – our use of the release candidate meant we spotted early and provided a patch in getting the upcoming initial release of log4j2 to support shaded jarfiles, of which we are heavily dependent on.
    • Operation’s had expressed concerns over nightly compression, raised from the earlier problem around nightly compression, that even if code no longer blocked while the compressing was happening in a background thread the amount of parallel compressions spawned would lead to IO contention which in turn leads to CPU contention. Because of this very real concern extensive tests have been executed, so far they’ve shown no measurable (under 1ms) impact exists upon services within the FINN platform. Furthermore this problem can be easily circumvented by adding the SizeBasedTriggeringPolicy to your appender, thereby enforcing a limit on how much parallel compression can happen at midnight.
    • The new logstash plugin (which finn has actively contributed to on github) caused a few breakages to the format expected by our custom logstash parsers written by operations. Unfortunately this parser is based off the old log4j format, of which we are trying to escape. Breakages here were: log events on separate lines, avoiding commas are the end of lines between log events, thread context in wrong format, etc. These were tackled with pull requests on github and patch versions of our commons-service (the library used to pre-configure the correct dependency tree for log4j2 artifacts and properly plugging in all the different logging abstraction libraries).
    • Increased memory from switching to sync loggers to async loggers impacted services with very small heap. The async logger used is based of the lmax-disruptor which pre-allocates its ringBuffer with its maximum capacity. By default this ringBuffer is configured to queue at maximum 256k log events. This can be adjusted with the “AsyncLoggerConfig.RingBufferSize” system property.

simply beautiful

To wrap it up the hurdles have been there, but trivial and easy to deal with, while the benefits of introducing log4j2, and moving to async loggers, make it well worth it…
    • The “best practice” for log4j2 included changing all loggers to be async, and this means that the performance of the FINN platform (which consists primarily of in-memory services) is no longer tied to and effected by how disks are performing (it was crazy that it was before).
    • More and more applications are generating consistent logfiles according to our best practices.
    • More and more applications are actually plugging in the various different logging abstractions used by all their various third-party dependencies.
    • All the advantages people liked about logback.
    • An easier approach to changing loglevels at runtime through jmx.
    • Profiling applications in crisis easier for outsiders (one less low-level behavioural difference).
    • Loggers are no longer a visible bottleneck in jvms under duress.
    • And naturally the performance increase.

Because of the significant performance provided by the lmax-disruptor we also use the open-sourced statsd client that takes advantage of it.

Conquering the kingdom of Java with a NodeJS trojan horse


Most of FINN.no is built with Java and we’ve been a Java shop for 6-7 years. With the professionalization of front-end development we saw that the tools available in the Java Web Development ecosystem was preventing us from delivering rapidly and with high quality.
The trend was that the tools we used weren’t keeping up with the pace set by people creating tools in Node‘s NPM module ecosystem. When working on pet projects we could use these tools, but in our daily job we were working with second rate tools. This was a frustration for many of us, so we set out to work out how we could utilize Node without having to rewrite all our applications.

In the belly of the beast that is Maven

We use Maven to build and test all our applications. This includes continuous deployment and quality analysis. Our reporting tools requires projects to be built with Maven in order to work. To introduce Node-based tools for front-end tasks, we had to be able to run them from Maven.

The easiest way to get started is just to use the Maven exec plugin. It allows you to run commands like on the command line. This approach worked, but there was one catch: it relied upon Node + NPM to be installed on every developer machine as well as in our CI environment. Forcing 100+ developers to install Node and keep up with versions was something which wasn’t really acceptable for the rest of the organization. Enter The Maven Front-end Plugin

Behold! The Frontend Maven plugin

Erik Sletteberg, who started out working as a summer intern and is now working at FINN.no full time, helped solve the problem of Node having to be installed. He created the Frontend Maven Plugin which install Node+NPM in the project folder. The plugin also has tasks for running NPM, Grunt, Gulp and similar.
Because of the tool we were able to use Node within a Maven web application module, which was awesome! It meant we could use build-, test and code analysis tools developed in Node and have them report into our existing tool chain.

Independence with Standalone Node modules

In the Java stack there is no real support for distributing front end dependencies for things like JavaScript, CSS, etc. There are plugins for things like Require-js for JavaScript, but they were all suffering from the same problem: the plugins couldn’t keep up with the pace set by the Node ecosystem when it came to iterating on these kinds of tools.
Our setup at the time was to split out any front-end resource which needed to be distributed into Maven Web Application modules and include them like any other Maven dependency in the POM. The release process for Maven artifacts is tedious. When importing dependencies into an existing Maven module you had to perform all kinds of URL rewriting to make paths correct. In short, this was not a very smooth way of sharing client side resources.

Having Node and NPM running inside Maven, it was only natural for us to look at how we could create Common JS modules of our existing JS modules which were now in Maven. The short term goal was to get rid of Maven while developing these modules. The long term goal is to have client side dependencies declared as regular Node dependencies in projects.

Maven-deploy

In order to make the transition as smooth as possible we needed to have Node modules which we were able to deploy to Maven repositories in addition to our private NPM registry. Gregers G. Rygg from the Front-end Core Team set out to create the maven-deploy module which solves this (if you use Gulp, you can use the gulp-maven-deploy plugin). It can either install the artifact locally under M2_HOME or deploy it to a remote repositories. This way we could have Node modules behave as if they were Maven modules, which in turn meant that existing Java Web Application didn’t have to be changed in order for us to use Node.

In closing

This was by no means the fastest way to introduce Node into an existing architecture. If you’re a small shop with few people, you should adopt a more aggressive approach. Our approach however works beautifully for those of you working in larger organizations were you need to do things in a more subtle way. Taking small steps and always making sure everyone can perform their job as normal was one of our keys to succeeding. If we’d gone all out and created additional work for many developers, our Node adventure would’ve never happened.

Espen Dalløkken from the Front-end Core Team did a short lightening talk about this very subject at Booster Conference 2014 in Bergen:

Conquering the wicked kingdom of Java with a NodeJS Trojan horse – Espen Dalløkken from Booster conference on Vimeo.

We love NPM!


What? Why is FINN.no donating to scale NPM? I thought you guys were a pure Java shop? It is true, we used to be a pure Java-shop. However over the past three years we have adopted new technologies to solve specific problems. We have used Ruby and Cucumber for some time to make a platform for continuos delivery and it has worked out beautifully! Our front-end developers have been forced to deal with out dated and not suitable tools for doing their job. This is largely due to the fact that all innovation when it comes to front-end development does not happen in the Java community. Most of the exciting tools are written in Node and this has become a frustration and a challenge for us.

In the past year FINN have been gradually making a transition away from using only Java-based tools for front-end development and towards a NodeJS powered tool set. We are now at a point were we are on the brink of rolling this out for our projects. Having worked with Node for a while we have learned to appreciate the Node ecosystem which is NPM. Being part of such a vibrant ecosystem of modules makes the transition easier and it also inspires us to become better at giving back. Therefore we are trying to give back to NPM, when we can.

When the scale NPM campaign was launched it was obvious that this was something we wanted to be apart of. It is an investment in our own happiness in a sense, as NPM is becoming a very important part for our technology portfolio.

Nodeify all the things

So were is it that we use Node in our technology stack today? Earlier this year we moved away from JsTestDriver in favor of Karma-runner. This meant that we needed to create a trojan horse containing the goodness of Node/NPM into existing Java projects without causing too many problems for developers with no knowledge of Node. A part of this scheme was the frontend-maven-plugin, which enables us to have control of which Node projects use and allows developers without Node previously installed to build projects and run tests without having to learn anything about Node.

Currently we are working towards removing the need for using Maven to build and deploy pure JavaScript projects using Node. The end result is of course to have more web applications built using Node. Today we have just two which are running in a production environment.

Low-fi Coffe Surveilence

There is a Norwegian start-up called AppearIn which enables anyone to do online video conferencing with only a web browser. They utilize WebRTC to empower users to create rooms where they can hang out, do meetings, etc. We had the pleasure of having Ingrid form Appear In over for a visit and she suggested we’d contribute to their blog about use cases for appear in. That’s how we became use case #7 on their blog. Naturally this should’ve been an arduino powered thing with lot’s of sophisticated technology, but we decided we’d just get something simple working first and then make it better later on.

This is a really exciting platform as it empowers developers to have video/audio communication with little effort. Dealing with WebRTC, which is still a moving specification, can be quite challenging to implement properly. There’s numerous corner cases and differences between implementations.

Update: Through the wonders of the internet, it turns out that we have brought the webcam thing full circle. A user on Hacker News posted a comment which linked to a wikiepedia article about the first webcam. It turns out the use case for it was exactly the same as we had.

Package Management conflicts Continuous Delivery


The idea of package management is to correctly operate and bundle together various components in any system. The practice of package management is a consequence from the design and evolution of each component’s API.

Package management is tedious

   but necessary. It can also help to address the ‘fear of change’.

We can minimise package management by minimising API. But we can’t minimise API if we don’t have experience with where it comes from. You can’t define for yourself what the API of your code is. It is well beyond that of your public method signatures. Anything that with change can break a consumer is API.

Continuous Delivery isn’t void of API

   despite fixed and minimised interfaces between runtime services, each runtime service also contains an API in how it behaves. The big difference though is you own the release change, a la the deployment event, and if things don’t go well you can roll back. Releasing artifacts in the context of package management can not be undone. Once you have released the artifact you must presume someone has already downloaded it and you can’t get it back. The best you can do it release a new version and hope everyone upgrades to it quickly.

Push code out from behind the shackles of package management

   take advantage of continuous delivery! Bearing in mind a healthy modular systems design comes from making sure you got the api design right – so the amount one can utilise CD is ultimately limited, unless you want to throw out modularity. In general we let components low in the stack “be safe” by focusing on api design over delivery time, and the opposite for components high in the stack.

High in the stack doesn’t refer to front-end code

   Code at the top of the stack is that free of package management and completely free for continuous deployment. Components with direct consumers no longer sit at the top of the stack. As components consumers multiple, and they become transitive dependencies, they move further down the stack. Typically entropy of the component corresponds to position in the stack. Other components forced into package management can be those where parallel versions need be deployed.

Some simple rules to abide by…

  • don’t put configuration into libraries.
    because this creates version-churn and leads to more package management

  • don’t put services into libraries.
    same reason as above.

  • don’t confuse deploying with version releases.
    don’t release every artifact as part of a deployment pipeline.
    separate concerns of continuous delivery and package management.


  • try to use a runtime service instead of a compile-time library.
    this minimises API, in turn minimising package management,

  • try to re-use standard APIs (REST, message-buses, etc).
    the less API you own the less package management.
    but don’t cheat! data formats are APIs, and anything exposed that breaks stuff when changed is API.

Dark Launching and Feature Toggles


Make sure to distinguish between these two.

They are not the same thing,
  and it’s a lot quicker to just Dark Launch.

In addition Dark Launching promotes incremental development, continuous delivery, and modular design. Feature Toggles need not, and can possibly be counter-productive.

Dark Launching is an operation to silently and freely deploy something. Giving you time to ensure everything operates as expected in production and the freedom to switch back and forth while bugfixing. A Dark Launch’s goal is often about being completely invisible to the end-user. It also isolates the context of the deployment to the component itself.

Feature Toggling, in contrast, is often the ability to A/B test new products in production. Feature Toggling is typically accompanied with measurements and analytics, eg NetInsight/Google-Analytics. Feature Toggles may also extend to situations when the activation switch of a dark launch can only happen in a consumer codebase, or when only some percentage of executions will use the dark launched code.

Given that one constraint of any decent enterprise platform is that all components must fail gracefully Dark Launching is the easiest solution, and a golden opportunity to ensure your new code fails gracefully. Turn the new module on, and it’s dark launched and in use, any problems turn it off again. You also shouldn’t have to worry about only running some percentage of executions against the new code, let it all go to the new component, if the load is too much the excessive load should also fail-gracefully and fall back to the old system.

Dark Launching is the simple approach as it requires no feature toggling framework, or custom key-value store of options. It is a DevOps goal that remains best isolated to the context of DevOps – in a sense the ‘toggling’ happens through operations and not through code. When everything is finished it is also the easier approach to clean up. Dealing with and cleaning up old code takes up a lot of our time and is a significant hindrance to our ability to continually innovate. In contrast any feature toggling framework can risk encouraging a mess of outdated, arcane, and unused properties and code paths. KISS: always try and bypass a feature toggle framework by owning the ‘toggle’ in your own component, rather than forcing it out into consumer code.

Where components have CQS it gets even better. First the command component is dark launched, whereby it runs in parallel, and data can be test between the old and new system (blue-green deployments). Later on the query component is dark launched. While the command components can run and be used in parallel for each and every request the query components cannot. When the dark launch of the query component is final the old system is completely turned off.

Now the intention of this post isn’t to say we don’t need feature toggling, but to give terminology to, and distinguish, between two different practices instead of lumping everything under the term “feature toggling”. And to discourage using a feature toggling framework for everything because we fail to understand the simpler alternative.

In the context of front-end changes, it’s typical that for a new front-end feature to come into play there’s been some new backend components required. These will typically have been dark launched. Once that is done, the front-end will introduce a feature toggle rather than dark launch because it’s either introducing something new to the user or wanting to introduce something new to a limited set of users. So even here dark launching can be seen not as a “cool” alternative, but as the prerequisite practice.

Reference: “DevOps for Developers” By Michael Hüttermann

given the git

 

“There is no way to do CVS right.” – Linus



FINN is migrating from subversion to the ever trendy git.
   We’ve waited years for it to happen,
      here we’ll try to highlight why and how we are doing it.

Working together

There’s no doubt that git gives us a cleaner way of working on top of each other. Wherever you promote peer review you need a way of working with changesets from one computer to the next without having to commit to and via the trunk where everyone is affected. Creating custom branches comes with too much (real or perceived) overhead, so the approach at best falls to throwing patches around. Coming away from a pair-programming session it’s better when developers go back to their own desk with such a patch so they can work on it a bit more and finish it properly with tests, docs, and a healthy dose of clean coding. It properly entitles them as the author rather than appearing as if someone else took over and committed their work. Git’s decentralisation of repositories provides the cleaner way by replacing these patches with private repositories and its easy to use branches.

Individual productivity

Git improves the individual’s productivity with benefits of stashing, squashing, reseting, and rebasing. A number of programmers for a number of years were already on the bandwagon using git-svn against our subversion repositories. This was real proof of the benefits, given the headaches of git-svn (can’t move files and renaming files gives corrupted repositories)

With Git, work is encouraged to be done on feature branches and merged in to master as complete (squashed/rebased) changesets with clean and summarised commit messages.

  1. This improves efforts towards continuous deployment due to a more stable HEAD.
  2. Rolling back any individual feature regardless of its age is a far more manageable task.
  3. By squashing all those checkpoint commits we typically make we get more meaningful, contextual, and accurate commit messages.

Reading isolated and complete changesets provides clear oversight, to the point reading code history becomes enjoyable, rather than a chore. Equally important is that such documentation that resides so close to, if not with, the code comes with a real permanence. There is no documentation more accurate over all of time than the code itself and the commit messages to it. Lastly writing and rewriting good commit message will alleviate any culture of jira issues with vague, or completely inadequate, descriptions as teams hurry themselves through scrum methodologies where little attention is given to what is written down.

Maintaining forks

Git makes maintaining forks of upstream projects easy.

With Git
fork the upstream repository,
branch, fix and commit,
create the upstream pull request,
while you wait for the pull request to be accepted/rejected use your custom inhouse artifact.
  With Subversion
file an upstream issue,
checkout the code,
fix and store in a patch attached to the issue,
while you wait use the inhouse custom artifact from the patched but uncommited codebase.



Both processes are largely the same but it’s safer and so much easier using a forked git repository over a bunch of patch files.

Has Git-Flow any advantage?

We’re putting some thoughts into how we were to organise our repositories, branches, and workflows. The best introductory article we’ve so far come across is from sandofsky and should be considered mandatory reading. Beyond this one popular approach is organising branches using Git Flow. This seemed elegant but upon closer inspection comes with more disadvantages than benefits…

  • the majority needs to ‘checkout develop’ after cloning (there are more developers than ops),
  • master is but a sequence of “tags” and therefore develop becomes but a superfluous branch, a floating “stable” tag instead is a better solution over any branch that simply marks states,
  • it was popular but didn’t form any standard,
  • requires a script not available in all GUIs/IDEs, otherwise it is but a convention,
  • prevents you from getting your hands dirty with the real Git, how else are you going to learn?,
  • it goes against having focus and gravity towards continuous integration that promotes an always stable HEAD. That is we desire less stabilisation and qa branches, and more individual feature and fix branches.

GitHub Flow gives a healthy critique of Git-Flow and it helped identify our style better. GitHub Flow focus’ on continuous integration, “deploy to production every day” rather than releasing, and relying on git basics rather than adding another plugin to our development environment.

Our workflows

So we devised two basic and flexible workflows: one for applications and one for services and libraries. Applications are end-user products and stuff that has no defined API like batch jobs. Services are our runtime services that builds up our platform, each come with a defined API and client-server separation in artifacts. Applications are deployed to environments, but because no other codebase depends on them their artifacts are never released. Services, with their APIs and client-side libraries, are released using semantic versions, and the server-side code to them is deployed to environments in the same manner as Applications. The differences between Applications and Services/Libraries warrant two different workflow styles.

Both workflow styles use master as the stable branch. Feature branches come off master. An optional “integration” (or “develop”) branch may exist between master and feature branches, for example CI build tools might automatically merge integration changes back to master, but take care not to fall into the anti-pattern of using merges to mark state.

The workflow for Applications may use an optional stable branch where deployments are made from, this is used by projects that have not perfected continuous deployment. Here bug fix branches are taken from the stable branch, and forward ported to master. For applications practising continuous deployment a more GitHub approach may be taken where deployments from finished feature branches occur and after successfully running in production such feature branches are then merged to master.

The workflow for Services is based upon each release creating a new semantic version and the git tagging of it. Continuous deployment off master is encouraged but is limited to how compatible API and the client libraries are against the HEAD code in the master branch – code that is released and deployed must work together. Instead of the optional stable branch, optional versioned branches may exist. These are created lazy from the release tag when the need for a bug fix on any previous release arises, or when master code no longer provides compatibility to the released artifacts currently in use. The latter case highlights the change when deployments start to occur off the versioned branch instead of off master. Bug fix branches are taken from the versioned branch, and forward ported to master.

Similar to Services are Libraries. These are artifacts that have no server-side code. They are standalone code artifacts serving as compile-time dependencies to the platform. A Library is released, but never itself deployed to any environment. Libraries are void of any efforts towards continuous deployments but otherwise follow very similar workflow as Services – typically they give longer support to older versions and therefore have multiple release branches active.

How any team operates their workflow is up to them, free to experiment to see what is effective. At the end of the day as long as you understand the differences between merge and rebase then evolving from one workflow to another over time shouldn’t be a problem.

Infrastructure: Atlassian Stash

The introduction of Git was stalled for a year from our Ops team as there was no repository management software they were happy enough with to support (integration with existing services was important, particularly Crowd). Initially they were waiting on either Gitolite or Gitorius. Eventually someone suggested Stash from Atlassian and after a quick trial this was to be it. We’re using a number of Atlassian products already: Jira, Fisheye, Crucible, and Confluence; so the integration factor was good and so we’ve paid for a product that was at the time incredibly overpriced with next to nothing on its feature list. One feature the otherwise very naked Stash comes with is Projects, which provides a basic grouping of repositories. We’ve used this grouping to organise our repositories based on our architectural layers: “applications”, “services”, “libraries”, and “operations”. The idea is not to build fortresses with projects based on teams but to best please the outsider who is looking for some yet unknown codebase and only knows what type of codebase it is. We’re hoping that Atlassian adds labels and descriptions to repositories to further help organisation. Permissions is easy, full read and write access to everyone, we’ll learn quickest when we’re all free to make mistakes, it’s all under version control at the end of the day.

Atlassian Stash


We’re still a cathedral

Git decentralises everything, but we’re not a real bazaar: our private code is our cathedral with typical enterprise trends like scrum and kanban in play; and so we have still the need to centralise a lot.
Our list of users and roles we still want centralised, when people push to the master repository are all commits logged against known users or are we going to end up with multiple aliases for every developer? Or worse junk users like “localhost”?
To tackle this we wrote a pre-push hook that authenticates usernames for all commits against Crowd. If a commit from an unknown user is encountered the push fails and the pusher needs to fix their history using this recipe before pushing again.

Releases can be made off any clone and obviously not be something we want. Released artifacts need to be permanent and unique, and deployed to our central maven repository. Maven’s release plugin fortunately tackles this for us as when you run mvn release:prepare or mvn release:branch it automatically pushes resulting changes upstream for you, as dictated by the scm details in the pom.xml

Migrating repositories

Our practice with subversion was to have everything in one large subversion repository, like how Apache does it. This approach worked best for us allowing projects and the code across projects to be freely moved around. With Git it makes more sense for each project to have its own repository, as moving files along with their history between repositories is easy.

Initial attempts of conversion were using svn2git as described here along with svndumpfilter3.

But a plugin in Stash came along called SubGit. It rocks! Converting individual projects from such a large subversion repository one at a time is easy. Remember to moderate the .gitattributes file afterwards, we found in most usecases it could be deleted.


Hurdles

Integration with our existing tools (bamboo, fisheye, jira) was easier when everything was in one subversion repository. Now with scores of git repositories it is rather cumbersome. Every new git repository has to be added manually into every other tool. We’re hoping that Atlassian comes to the rescue and provides some sort of automatic recognition of new and renamed repositories.

Renaming repositories in Stash is very easy, and should be encouraged in an agile world, but it breaks the integration with other tools. Every repository rename means going through other tools and manually updating them. Again we hope Atlassian comes to the rescue.

Binary files we were worried about as our largest codebase had many and was already slow on subversion due to it. Subversion also stores all xml files by default as binary and in a large spring based application with a long history this might have been a problem. We were ready to investigate solutions like git-annex. All test migrations though showed that it was not a problem, git clones of this large codebase were super fast, and considerably smaller (subversion 4.1G -> git 1.1G).


Adaptation

Towards the end of February we were lucky enough to have Tim Berglund, Brent Beer, and David Graham, from GitHub come and teach us Git. The first two days was a set course with 75 participants and covered

  • Git Fundamentals (staging, moving, copying, history, branching, merging, resetting, reverting),
  • Collaboration using GitHub (Push, pull, and fetch, Pull Requests Project Sites, Gists, Post-receive hooks), and
  • Advanced Git (Filter-Branch, Bisect, Rebase-onto, External merge/diff tools, Event Hooks, Refspec, .gitattributes).

The third day with the three GitHubbers was more of an open space with under twenty participants where we discussed various specifics to FINN’s adoption to Git, from continuous deployment (which GitHub excels at) to branching workflows.

No doubt about it this was one of the best, if not very best, courses held for FINN developers, and left everyone with a resounding drive to immediately switch all codebases over to Git.

Other documentation that’s encouraged for everyone to read/watch is


Tips and tricks for beginners…

To wrap it up here’s some of the tips and tricks we’ve documented for ourselves…

Cant push because HEAD is newer
So you pull first… Then you go ahead and push which adds two commits into history: the original and a duplicate merge from you.
You need to learn to do git rebase in such situations, better yet to do git --rebase pull.
You can make the latter permanent behaviour with
git config --global branch.master.rebase true
git config --global branch.autosetuprebase always

Colour please!
git config --global color.ui auto

Did you really want to push commits on all your branches?
This can trap people, they often expect push to be restricted to the current branch your on. It can be enforced to be this way with git config --global push.default tracking

Pretty log display
Alias git lol to your preferred log format…
Simple oneline log formatting:
git config --global alias.lol "log --abbrev-commit --graph --decorate --all --pretty=oneline"

Oneline log formatting including committer’s name and relative times for each commit:
git config --global alias.lol "log --abbrev-commit --graph --all
      --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset'"

Compact log formatting with full commit messages, iso timestamps, file history over renames, and mailmap usernames:
git config --global alias.lol "log --follow --find-copies-harder --graph --abbrev=4
  --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %Cgreen%ai %n %C(bold blue)%aN%Creset %B'"

Cherry-picking referencing the original commit
When recording a cherry-pick commit, using the “-x” option appends the line “(cherry picked from commit …)” to the original commit message so to indicate which commit this change was cherry-picked from. For example git cherry-pick -x 3523dfc

Quick log to see what i’ve done in the last 24 hours?
git config --global alias.standup "log --all --since yesterday --author `git config --global user.email`"

What files is this project got but ignoring?
git config --global alias.ignored "ls-files --others --i --exclude-standard"

Wipe all uncommitted changes
git config --global alias.wipe "reset --hard HEAD"

Edit and squash commits before pushing them
git config --global alias.ready "rebase -i @{u}"



I wish I knew my consumers – Maven Reverse Dependency

At FINN.no being a developer fixing bugs in a library is a breeze. Getting every user of your library to use the fix, however, is a different story. How to know who to notify? I mean, I know my library’s dependencies, but who “out there” has dependency to the component where I just fixed a bug? I wish. Enter maven-dependency-graph.

The idea was born on the plane back home from a Copenhagen hosted conference. Graph database. Download neo4j and start dabbling at a maven plugin. Flying time Copenhagen – Oslo was too short, all of a sudden.

From there, the idea slept for a couple of years. Until the need arose somewhere among the developers. With 100+ different applications running with common core services and libraries, everybody suddenly needed to know who depended on their code which had recently been bugfixed. So the old idea was dusted off and once more saw the light of day. We needed to upgrade the server installation and the API to neo4j – which took some time to grasp; but after some playing around, it became obvious and easy.

The idea was to have every project report its dependencies to a graph database, building the tree of dependencies on each commit. This constitutes one half of the plugin. Over time, all projects will have reported their dependencies, and from there on part two of the plugin comes into use. It will examine the reverse dependencies to the *current* maven project, and report all incoming dependencies to it in the maven log. Hey, presto! We now know who out there uses us! And even which version they are using, thanks to two different keys into the built-in lucene index engine.

The plugin is published on github @ Finn Technology’s account. Feel free!
@gardleopard and @roarjoh

Usage examples

Dependencies to current maven project:

mvn no.finntech:dependency-mapper-maven-plugin:read
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building greenpages thrift-client 3.4.5-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- dependency-mapper-maven-plugin:1.0-SNAPSHOT:read (default-cli) @ commons-thrift-client ---
[INFO] Resolving reverse dependencies
[INFO] no.finntech.travel.supplier:supplier-client:1.2-SNAPSHOT -> no.finntech:commons-thrift-client:3.1.1
[INFO] no.finntech.cop:client:1.1-SNAPSHOT -> no.finntech:commons-thrift-client:3.1.1
[INFO] no.finntech.oppdrag-services:iad-model:2013.2-SNAPSHOT -> no.finntech:commons-thrift-client:3.4.3
[INFO] no.finntech:minfinn:2013.2-SNAPSHOT -> no.finntech:commons-thrift-client:3.4.3
[INFO] no.finntech:service-user:2013.2-SNAPSHOT -> no.finntech:commons-thrift-client:3.4.3
[INFO] no.finntech:service-oppdrag:2013.2-SNAPSHOT -> no.finntech:commons-thrift-client:3.4.3
[INFO] no.finntech:kernel:2013.2-SNAPSHOT -> no.finntech:commons-thrift-client:3.4.3

(umpteen lines skipped…)

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.957s
[INFO] Finished at: Thu Jan 31 09:50:19 CET 2013
[INFO] Final Memory: 9M/211M
[INFO] ------------------------------------------------------------------------

Usage of third party framework (using neo4j’s included admin interface):

StrataConf & Hadoop World 2012…

A summary of this year’s Strataconf & Hadoop World.
    A fascinating and inspiring conference with use-cases on both sides of an ethical divide – proof that the technologies coming are game-changers in both our industry and in society. Along with some intimidating use-cases i’ve never seen such recruitment efforts at any conference before, from multi-nationals to the CIA. The need for developers and data scientists in Big Data is burning – the market for Apache Hadoop Market is expected to reach $14 billion by 2017.

Plenty of honesty towards the hype and the challenges involved too. A barcamp Big Data Controversies labelled it all as Big Noise and looked at ways through the hype. It presented balancing perspectives from a insurance company’s statistician who has dealt successfully with the problem of too much data for a decade and a hadoop techie who could provide much desired answers to previously impossible questions. Highlights from this barcamp were…

  • One should always use intelligent samples before ever committing to big data.
  • Unix tools can be used but they are not very fault tolerant.
  • You know when you’re storing too much denormalised data when you’re also getting high compression rates on it.
  • MapReduce isn’t everything as it can be replaced with indexing.
  • If you try to throw automated algorithms at problems without any human intervention you’re bound to bullshit.
  • Ops hate hadoop and this needs to change.
  • Respecting user privacy is important and requires a culture of honesty and common-sense within the company. But everyone needs to understand what’s illegal and why.

Noteworthy (10 minute) keynotes…

  • The End of the Data Warehouse. They are monuments to the old way of doing things: pretty packaging but failing to deliver the business value. But Hadoop too is still flawed… Also a blog available.
  • Moneyball for New York City. How NYC council started combining datasets from different departments with surprising results.
  • The Composite Database, a focus on using big data for product development. To an application programmer the concept of a database is moving from one entity into a componential architecture.
  • Bringing the ‘So What’ to Big Data, a different keynote with a sell towards going to work for the CIA. Big data isn’t about data but changing and improving lives.
  • Cloud, Mobile and Big Data. Paul Kent, a witty speaker, talks about analytics in a new world. “At the end of the day, we are closer to the beginning than we are at end of this big data revolution… One radical change hadoop and m/r brings is now we push the work to the data, instead of pulling the data out.”

Noteworthy (30 minute) presentations…

  • The Future – Hadoop-2. Hadoop YARN makes all functional programming algorithms possible, reducing the existing map reduce framework to just one of many user-land libraries. Many existing limitations are removed. Fault-tolerance is improved (namenode). 2x performance improvement on small jobs.
  • Designing Hadoop for the Enterprise Data Center. A joint talk from Cisco and Cloudera on hardware tuning to meet Hadoop’s serious demands. 10G networks help. dual-attached 1G networks is an alternative. More jobs in parallel will average out network bursts. Data-locality misses hurt network, consider above a 80% data-locality hitrate good.
  • How to Win Friends and Influence People. LinkedIn presents four of their big data products
        ∘ Year in Review. Most successful email ever – 20% response rate.
        ∘ Network Updates.
        ∘ Skills and Endorsements. A combination of propensity to know someone and the propensity to have the skill.
        ∘ People You May Know. Listing is easy, but ranking required combining many datasets.

    All these products were written in PIG. Moving data around is the key problem. Kafka is used instead of scribe.
  • Designing for data-driven organisation. Many companies who think they are data-driven are in fact metrics-driven. It’s not the same thing. Metrics-driven companies often want interfaces with less data. Data-driven companies have data rich interfaces presenting holistic visualisations.
  • Visualizing Networks. The art of using the correct visualisation and layout. Be careful of our natural human trait to see visual implications from familiarity and proximity – we don’t always look at the legend. A lot of examples using the d3 javascript library.

The two training sessions I attended were Testing Hadoop, and Hadoop using Hive.
Testing Hadoop.
Presented by an old accomplice from the NetBeans Governance board, Tom Wheeler. He presented an interesting perspective on testing calling it another form of computer security: “a computer is secure if you can depend on it and its software to behave as you expect”. Otherwise i took home a number of key technologies to fill in the gaps between and around our current unit and single-node integration tests on our CountStatistics project: Apache MRUnit for m/r units, MiniMRCluster and MiniDFSCluster for multi-jvm integration cluster, and BigTop for ecosystem testing (pig, hive, etc). We also went through various ways to benchmark hadoop jobs using TeraSort, MRBench, NNBench, TestDFSIO, GridMix3, and SWIM. Lastly we went through a demo of the free product “Cloudera Manager” – a diagnostics UI similar to Cassandra’s OpsCenter.

Hadoop using Hive.
Hive provides an SQL interface to Hadoop. It works out-of-the-box if you’re using HBase but with Cassandra as our underlying database we haven’t gotten around to installing it yet. The tutorial went through many live queries on a AWS EC2 cluster, exploring the specifics to STRUCTs in schemas, JOINs, UDFs, and serdes. This is a crucial interface to make it easier for others, particularly BI and FINN økosystem, to explore freely through our data. Pig isn’t the easiest way in for outsiders, but everyone knows enough SQL to get started. Fingers crossed we get Hive or Impala installed some time soon…

A number of meet-ups occurred during the evenings, one hosted at AppNexus, a company providing 10 billion real-time ads per day (with a stunning office). AppNexus does all their hadoop work using python, but they also putting focus on RESTful Open APIs like we do. The other meetup represented Cassandra by DataStax with plenty of free Cassandra beer. Latest benchmarks prove it to be the fastest distributed database around. I was hoping to see more Cassandra at strataconf – when someone mentions big data i think of Cassandra before Hadoop.

Otherwise this US election was on the news as the big data election

   

Leaving the Tower of Babel

Here at FINN we use mostly Java for our day-to-day work, but we do have some applications and modules written in other languages. We currently have code in Java, Ruby, JavaScript, Objective-C and Scala, supporting things as diverse as testing frameworks and iOS-apps.

Recently there has been some discussion with regards to what we should be using for new projects, as each team of developers start suggesting that they should use something they think would be easier and more effective. The arguments against are based on how easy it would be to recruit new developers who know these new technologies, how well it would perform under our kinds of loads, and wheter the assumption that we can be more effective is actually true.

Our strategy is firm on this point, that experiments can be done, but for major applications serving the public, we should be using Java for the time being. We are however considering if it’s time to take a second look at our chosen languages, and see if we should expand our portfolio. This strategy is founded on a wish to reduce the size of our technology portfolio, so that we don’t have to spend time and effort to bring developers up to speed when they switch teams. There’s also a higher chance that new hires will be familiar with Java, while other technologies would often require a period of training when they first start.

During these discussions, we created a quick poll and sent it out on our internal discussionboard. The poll asked questions like “How long have you worked as a developer?”, “Which language do you use for your day-to-day work?”, “If you were free to chose, which language would you use for your next project?” and “Which programming languages do you know?”. Our definition of “know” was very open in this poll, lowering the bar to allow people who had maybe written a single example program, and understands a bit of code could check the box.

Out of the nearly 100 people that work with development, 56 answered. We can assume that the people who did answer, were the people who were interested in the topic, and might not be a representative selection, but the results are interesting none the less.

So.. what did we learn?

Experience

Being a poll made in the midst of a discussion, mostly for fun, this part of the poll wasn’t framed in a way that gives us much useful information. We can say that the average developer who answered has worked for around seven or eight years, but with what looks like a reasonably fair spread across all groups. Almost as many with a couple years experience, as there are people with 10 to 15 years.

Primary language in day-to-day work

Being a predominatly Java shop, most of the respondents were using Java for their daily work. 35 out of 56 were Java-developers. 11 respondents worked with JavaScript daily, while four worked with Scala, three with Ruby, two with Objective-C, and finally one database-developer who worked mostly in T-SQL (The SQL-variant used in Sybase).

Most popular choice if allowed to chose freely

The most interesting part of the poll, with many interesting insights. This is also the most loaded part, as the results could easily short-circuit any discussion about future choice in language.

The bad news (or good, depending on your point of view), is that there was no clear answer from this section. Keep in mind that around 40 people didn’t answer this poll, and could be assumed to be content with the current status-quo.

The biggest group was Java, not surprisingly. The surprising part is that it was only 16 people who would chose Java if they could chose freely. This is much lower than expected, but still make up the largest group. Of these, 13 people are already using Java today. 20 Java developers would like to use something else.

So, if Java was the largest group, at 16 people, what does that mean for the remainder of the results? Well, 10 people would have chosen Scala, while JavaScript, Ruby, Clojure and Groovy clock in at about four or five respondents wanting to use them. At the bottom of the pack we have Python and Objective-C with three and two respondents chosing them.

Six people were interested enough to answer the poll, but chose the non-commitant “I don’t care, as long as I’m making good stuff for our users” option.

The number of possible answers here was a rather large selection of popular and not-so-popular-but-quite-well-known languages, so the fact that the list only includes eight languages is a sign that it’s not a completely random selection. Still, quite a lot of discussion is needed before any one of those languages gets center stage.

A couple curious details found in this part of the poll includes the fact that the only people who would chose JavaScript are people who are already working in JavaScript every day. Groovy, Clojure, Objective-C and Scala have all managed to be chosen by a person who doesn’t actually know the langauge. There’s only three people who would switch *to* Java, if allowed to chose.

Which languages do we “know”?

As mentioned earlier, the bar for “knowing” a language was set quite low in this poll, to get more diversity in answers. This resulted in some interesting numbers.

Being primarily a Java-shop, it might not surprise anyone that a full 100% knows Java. A little over three fourths know JavaScript, while Ruby and T-SQL was known to about half. These are the main languages most of us have some sort of dealings with in our daily work, so that they score high is not surprising.

Next on the list is PHP and Python, with around 40% knowing them.

We run primarily Sybase for our databases, with a small number of MySQL and PostgreSQL servers for smaller applications and modules. We are trying to standardise on PostgreSQL, but this isn’t reflected in knowledge, with 37.5% knowing how to code MySQL-procedures, 35.7% knowing how to code for Oracle, and only 25% knowing their way around a PostgreSQL codebase.

Our operations-department have had a “vendetta” against Perl for several years, but there are still more people who knows Perl than there are people who knows Scala, with the score 18 to 15. Groovy has quite a following too, with 13 respondents knowing Groovy.

We have a small Apps team working with iOS-development, but Objective-C has a reach far outside that team, with 10 respondents knowing Objective-C.

Common Lisp is known to five respondents, while Clojure suprisingly was only known to seven respondents. As you can read elsewhere in this blog, we had a Clojure workshop at our Technology Day earlier this summer, and these results might be telling us something about the language or our teachers when it didn’t have a better ability to “stick” than this.

We also have people who know some Erlang, Smalltalk, Lua, Haskell, Scheme, Eiffel and ML/SML.

How many languages do we “know”?

On average, we know 6.8 languages each. The most knowledgeable person knows 16 languages, while the least knowledgeable knows only one. The high experience respondents, 16 years or more, have a higher average than the rest of us, with 9.4, while the rest of the “experience brackets” know somewhere between six and seven languages.

Several programmer-gurus seem to think you should stribe to teach yourself one new language every year, and it would seem atleast one of our respondents have been able to follow this advice. For those of us with less extra time, a less ambitious strategy might be more compatible, but that we should stribe to learn something new every once in a while seems to be good advice.

Now what?

As mentioned previously, this poll was not a serious attemt at gaining actionable insights. The results should be taken with a large helping of salt, and at most used as a basis for discussions. On the other hand, I know we will be discussing this topic going forward, because the fact that only 16 of 56 wanted to use Java is telling us it’s time to start the discussion. There are possibly 40 developers who would prefer Java, who didn’t respond, so it’s too early to draw conclusions, but we have a place to start.

There are other questions falling out of this too:

  • There are two developers who wants to work with Objective-C, and eight Java-developers who wants to work with Scala, so why do we have so few internal applicants when we have openings on the teams working with these languages?
  • What is it that makes developers want to work with languages they don’t even know?
  • Of the respondents, more than half the Java-developers wants to use something else. Why is that?

 

Fork me on GitHub casus telefon telefon dinleme casus telefon telefon dinleme casus telefon telefon dinleme casus telefon telefon dinleme casus telefon telefon dinleme