The Knowledge Gap of Big Data

    By: Ian Abramson on Mar 02, 2017

    The Knowledge Gap of Big Data

    I have attended many conferences focused on big data. I am always struck by a number of things at these conferences, but mainly by the wide gap between those who are implementing solutions versus others who are trying to understand the technology.

    At this time in the big data space, there are numerous companies that have moved from investigation into production development scenarios. On the other hand, we see people and organizations who are still trying to grasp the concept of big data. People are hungry for knowledge at both ends of this big data knowledge continuum, but are we serving anyone in a way that provides them value, or are we simply hoping that things will just work out?

    As we all know, big data is evolving more quickly than most technologies we saw in the past. Consider that Oracle’s biggest release in its first 10 years of existence was version 6. This was when row-level locking and hot backups were introduced, but we would still have to wait for version 7, which was released nearly four years later.

    If you compare this to Hadoop and the big data ecosystem, in its first 10 years, it has grown in incredible ways. The functionality within Hadoop has been developed not by one vendor but by many contributors in the open source community. As a result, we see disruptive products being created that optimize and improve the product. Consider that in the past two years we went from the introduction of Spark to a time now when Spark is a required component of most, big data solutions. In this ever-changing world, the gap in knowledge between those who want to do something and those who are doing something is wide.

    The conference provided information for both sides, but I truly believe that most vendors continue to focus on the introduction and adoption of big data versus the production implementation of Hadoop solutions. The value here is learning about the products in the ecosystem that are valuable or should be considered as they mature.

    Two such products were presented. The first is Apache Atlas, a metadata tool that provides data governance capabilities for Hadoop. With Atlas, one can collect metadata, data lineage, audit access and support data compliance. The next project was Apache Zeppelin. This tool provides an environment for simple data analysis and presentation. Both of these projects will be open source solutions. 

    Also at the conference, attendees wanted to understand the evolution of current products like Hive. This led to interesting conversations and highlighted the existing confusion. The differences between Hive on MapReduce versus Hive on Tez or Hive on Spark illustrate how people are having challenges in figuring what they should use and why should they use it. In this case, it is about the underlying engine used to generate the “SQL” in Hadoop. The differences are performance and architecture.

    It’s my guess that fewer than a quarter of people attending these events are looking for in-depth knowledge. Most have some familiarity or are in the midst of gaining practical experience. The people trying to implement enterprise-grade methods find this ecosystem challenging. They are looking at how the ecosystem works together. They ask questions about how best to manage Avro files and discuss challenges that they are experiencing with loading these files.

    The answers to these questions tend to be vague, and this is where the information divide widens and we begin to see new scenarios where no solution has been found. These problems tend to be either interoperability issues or product bugs. We find that in today’s big data projects, we often have to find our own solutions or build workarounds to make things perform as we need them. The resources available are limited, and you must question the validity of the fix you find in the open community. Big data teams must find ways to bridge the knowledge gap, take conceptual solutions and make them a reality.

    Released: March 2, 2017, 7:08 am | Updated: June 2, 2017, 9:18 am
    Keywords: Department | apache | Big Data | Data Evolution | hadoop | Ian Abramson

    Copyright © 2019 Communication Center. All Rights Reserved
    All material, files, logos and trademarks within this site are properties of their respective organizations.
    Terms of Service - Privacy Policy - Contact

    Independent Oracle Users Group
    330 N. Wabash Ave., Suite 2000, Chicago, IL 60611
    phone: 312-245-1579 | email:

    IOUG Logo

    Copyright © 1993-2019 by the Independent Oracle Users Group
    Terms of Use | Privacy Policy