Ansible for Big Data infrastructure

When observing how Big Data infrastructure is currently managed we notice either a manual process, or a semi-automated complex system.

Big Data teams are raising tickets to Infrastructure. After approval by business, a devops implements changes either manually, using Cloudera Manager or Ambari and keeps track of which tickets have been implemented on which environment (dev/sit/uat) etc. Some teams also use Puppet or Chef, but we think Ansible to be the most beneficial implementation, and we will explain the reasons.

Significant pain points exist on Big Data infrastructure management

To stay on the positive side, know that most organizations (especially in Europe) have their Hadoop systems in their datacenters. Now, with regulations in place, and cloud security not a concert most enterprises are planning moving Hadoop platforms to the Cloud over the next few years.

Ansible

Ask yourself the question: How shall we manage Big Data infrastructure ?

  • Using a manual process based on tickets ?
  • Using deployment tools such as Puppet or Chef ? That will require server/s, agent/s and f/w to be set up and entails high learning curve
  • Using the Ansible technology ? Simple to understand and use as we describe the desired state, and this state is depicting the blueprint of your infrastructure. The infrastructure is engineered in small size (less than 1MByte) and contains the exact specification of the Hadoop (i.e. Cloudera Cluster).
Ansible is beautiful as it can live in a small USB stick :)

Usually Ansible lives in a Git repository. Development teams have access and raise change requests directly to devops team (!). Usually most git changes will be a few new lines. Ansible will ensure the state is consistent, and then check and apply the new state as described in the commit message. Small changes can be tested using a Continuous Integration server onto the Hadoop DEV cluster, before propagating to other environments.

Changes to Cloudera Manager or Ambari can also be driven through API requests, so that no manual task will need to be implemented. As far as security is concerned sensitive information such as passwords are encrypted and shadowed in Ansible through the vault.

This is the level at which Hadoop infrastructure should be managed, and this is how at '51zero' we are currently building and managing our Hadoop infrastructure on Landoop