Getting Data into Big Data

    By: Ian Abramson on Apr 11, 2017

    Getting Data into Big Data

    The Internet of Things (IoT) is changing the way we generate data and how we use it to gain insights, but the challenge is still in how we manage this diverse data and store it in a meaningful way.

    In the early days of big data, we were forced to use rudimentary tools like Flume, Kafka, Storm or even Spark to load data. These products have all helped, but they were difficult to manage and, in many cases, we would build a framework around these tools to make them more efficient. There was a need for better ways to ingest any type of data faster. There was a need for tools similar to traditional extract, transform and load (ETL) tools, which had a good user interface along with the flexibility to create and manage data loading processes. To that end, we are seeing an evolution in the Hadoop ecosystem and specifically in the area of data ingestion.

    The emergence of Hadoop projects like Gobblin and Hortonworks Data Flow (HDF) are seen as leading options in the area of data ingestion. As data loading for big data becomes more complex, we need these types of facilities to allow us to load data quickly into Hadoop Distributed File System (HDFS) with little development effort. These products can now be used to enhance loading, reduce time to analysis and support an Agile approach to data. Both tools are interesting and should be considered. Gobblin, originally developed by LinkedIn, is open source and is not as refined as HDF. It allowed LinkedIn to load the massive amounts of data its website and tools were generating on a daily basis. It then shared this technology, allowing all of us to use and contribute to its development.

    On the other hand, HDF is a new product being launched by Hortonworks to help not only with the loading of data but also to help with data governance. It is based on the open-source project NiFi, which was developed for the National Security Agency in the United States, but it has since been further enhanced by Hortonworks. With it you can plan and execute loads for any data source. Whether your data arrives from a file, database or stream, all of this data can be loaded via a single interface. This flexibility should help those who cut their teeth on ETL tools. These tools help you address the complexity of loading a diverse set of data inputs and can enhance your organization’s ability to analyze data with lower rates of latency.

    So, in the world of IoT, we need to be ready to load an unprecedented range of data formats, which are both old and new. In the end, we must ensure that in spite of these diverse needs we find ways to simplify and shorten the time to analysis.

    Released: April 11, 2017, 5:51 pm | Updated: June 2, 2017, 9:15 am
    Keywords: Department | Big Data | Data Evolution | Data Workflow | Internet of Things | IoT

    Copyright © 2019 Communication Center. All Rights Reserved
    All material, files, logos and trademarks within this site are properties of their respective organizations.
    Terms of Service - Privacy Policy - Contact

    Independent Oracle Users Group
    330 N. Wabash Ave., Suite 2000, Chicago, IL 60611
    phone: 312-245-1579 | email:

    IOUG Logo

    Copyright © 1993-2019 by the Independent Oracle Users Group
    Terms of Use | Privacy Policy