When it comes to the classic dilemma of “build vs buy”, I think it’s fair to say that the time of completely custom software development has passed. With so many standard system and application components being available as open source it’s really more about integration rather than custom development these days. Which is great, because significant technical capabilities can be achieved by organizations leveraging combinations of open source projects and common enterprise software. For example, a lot of our customers run Hadoop stacks to adopt a new way of ETL (Extract, Transform, Load) processes, which allows them to run analytics against masses of structured and semistructured data. On the other hand, mature time-based event stores, such as Splunk, had been proliferating, helping organization streamline their operations and fulfill audit and compliance requirements. As great as Splunk is, there are always going to be opportunities in combining its mature, feature rich data store with data residing in cheap, distributed file systems. By bringing the two types of environments together we can support several extremely useful scenarios. For example, a lot of our customers use this technology to fill Splunk Lookups automatically with data from Hadoop. Another valuable use case is to correlate MapReduce job results with Splunk search results and to visualize them in a single view or a dashboard. Also, Hadoop Connect enables Hadoop system administrators to monitor and analyze their HDFS using the well-known Splunk search language without the need to write dedicated MapReduce jobs. So how would we move forward with the integration? One possible solution is Hadoop Connect App. Hadoop Connect enables Splunk to integrate with Hadoop Distributed File System (HDFS) clusters and enables exchange of data between both platforms bidirectionally. 1. To use Hadoop Connect, the app should be installed within a Splunk Search Head instance.

  1. As a prerequisite, the Hadoop Client Utilities specific to the Hadoop distribution used in each case need to be installed on the Search Head, as well. Hadoop Connect supports Apache Hadoop, Cloudera CDH, and Hortonworks HDP.
  2. Further, the CLI tools require the Oracle JDK version 6 or 7 as well on the same machine.
  3. After installing all three components (Splunk App, JDK, Hadoop Client Utilities) on the system, the connection to the Hadoop Cluster needs to be set up. It is recommended to test the connection to the Name node using the CLI tools first (“hadoop fs” Command).
  4. Finally, the Cluster address with the Namenode and the IPC port has to be added to the App configuration. Voilà, the connection is now available!

Certainly you came across the product “Hunk”, released by Splunk. Since its first release a couple of years ago, Hunk has been shipped as separate product. Technically, it’s a Splunk Enterprise installation with the extensions of so called “Virtual Indexes”. It’s possible to connect a HDFS environment as a virtual index and enable it to search and report data seamlessly from it with native Splunk search language. Additionally, a very powerful feature of Hunk is to age buckets from traditional Splunk indexes to HDFS destinations, which are connected as virtual index again making the aged data searchable. The option to roll data to a Hadoop distributed filesystem lowers the total cost of ownership by reducing the required storage amount for Splunk data.

Since the newest release of Splunk, version 6.5, Hunk is now known as “Splunk Analytics for Hadoop” and isn’t shipped as dedicated download anymore as it has been integrated into Splunk Enterprise.

Hadoop Connect**Splunk Analytics for Hadoop**FeaturesExport events to Hadoop n/aExplore Hadoop directories and filesHDFS ExplorerImport and Index Hadoop data into Splunkn/an/aSearch and report on Hadoop datan/aRoll data to HDFS / S3 DistributionSplunk Appbaked into Splunk EnterpriseLicenseNo license feePremium offeringNow, since SBOX has released version 2.0 of its operating system, it is much simpler to deploy Hadoop Connect, including the dependencies, with just a couple mouse-clicks. The only thing needed for the configuration is the Hadoop Cluster URL and it’s good to go. Furthermore, SBOX 2.0 also supports the use of “Splunk Analytics for Hadoop” with Splunk 6.5.Whichever route you take, Hadoop Connect is the initial step to connect the two worlds. The fact that there are no additional license fees makes this a very attractive deployment scenario.

And for those readers who’ve deployed this scenario before, what has been your experience? Let us know, we always love a good story.