Feasibility analysis of AsterixDB and Spark streaming with Cassandra for stream-based processing

Pekka Pääkkönen (Corresponding Author)

    Research output: Contribution to journalArticleScientificpeer-review

    6 Citations (Scopus)

    Abstract

    For getting up-to-date insight into online services, extracted data has to be processed in near real time. For example, major big data companies (Facebook, LinkedIn, Twitter) analyse streaming data for development of new services. Several technologies have been developed, which could be selected for implementation of stream processing functionalities. The contribution of this paper is feasibility analysis of technologies for stream-based processing of semi-structured data. Particularly, feasibility of a Big Data management system for semi-structured data (AsterixDB) will be compared to Spark streaming, which has been integrated with Cassandra NoSQL database for persistence. The study focuses on stream processing in a simulated social media use case (tweet analysis), which has been implemented to Eucalyptus cloud computing environment on a distributed shared memory multiprocessor platform. The results indicate that AsterixDB is able to provide significantly better performance both in terms of throughput and latency, when data feed functionality of AsterixDB is used, and stream processing has been implemented with Java. AsterixDB also scaled on the same level or better, when the amount of nodes on the cloud platform was increased. However, stream processing in AsterixDB was delayed by batching of data, when tweets were streamed into the database with data feeds.
    Original languageEnglish
    Number of pages25
    JournalJournal of Big Data
    Volume3
    Issue number6
    DOIs
    Publication statusPublished - 2016
    MoE publication typeA1 Journal article-refereed

    Fingerprint

    Electric sparks
    Processing
    Cloud computing
    Information management
    Throughput
    Feasibility analysis
    Data storage equipment
    Industry
    Big data
    Semistructured data
    Data base
    Functionality

    Keywords

    • sentiment
    • tweet
    • word count
    • AsterixDB
    • Spark
    • performance
    • Eucalyptus
    • Cassandra

    Cite this

    @article{c5606f0163b14714a30c3830a5819277,
    title = "Feasibility analysis of AsterixDB and Spark streaming with Cassandra for stream-based processing",
    abstract = "For getting up-to-date insight into online services, extracted data has to be processed in near real time. For example, major big data companies (Facebook, LinkedIn, Twitter) analyse streaming data for development of new services. Several technologies have been developed, which could be selected for implementation of stream processing functionalities. The contribution of this paper is feasibility analysis of technologies for stream-based processing of semi-structured data. Particularly, feasibility of a Big Data management system for semi-structured data (AsterixDB) will be compared to Spark streaming, which has been integrated with Cassandra NoSQL database for persistence. The study focuses on stream processing in a simulated social media use case (tweet analysis), which has been implemented to Eucalyptus cloud computing environment on a distributed shared memory multiprocessor platform. The results indicate that AsterixDB is able to provide significantly better performance both in terms of throughput and latency, when data feed functionality of AsterixDB is used, and stream processing has been implemented with Java. AsterixDB also scaled on the same level or better, when the amount of nodes on the cloud platform was increased. However, stream processing in AsterixDB was delayed by batching of data, when tweets were streamed into the database with data feeds.",
    keywords = "sentiment, tweet, word count, AsterixDB, Spark, performance, Eucalyptus, Cassandra",
    author = "Pekka P{\"a}{\"a}kk{\"o}nen",
    note = "Project code: 101215",
    year = "2016",
    doi = "10.1186/s40537-016-0041-8",
    language = "English",
    volume = "3",
    journal = "Journal of Big Data",
    issn = "2196-1115",
    publisher = "Springer",
    number = "6",

    }

    Feasibility analysis of AsterixDB and Spark streaming with Cassandra for stream-based processing. / Pääkkönen, Pekka (Corresponding Author).

    In: Journal of Big Data, Vol. 3, No. 6, 2016.

    Research output: Contribution to journalArticleScientificpeer-review

    TY - JOUR

    T1 - Feasibility analysis of AsterixDB and Spark streaming with Cassandra for stream-based processing

    AU - Pääkkönen, Pekka

    N1 - Project code: 101215

    PY - 2016

    Y1 - 2016

    N2 - For getting up-to-date insight into online services, extracted data has to be processed in near real time. For example, major big data companies (Facebook, LinkedIn, Twitter) analyse streaming data for development of new services. Several technologies have been developed, which could be selected for implementation of stream processing functionalities. The contribution of this paper is feasibility analysis of technologies for stream-based processing of semi-structured data. Particularly, feasibility of a Big Data management system for semi-structured data (AsterixDB) will be compared to Spark streaming, which has been integrated with Cassandra NoSQL database for persistence. The study focuses on stream processing in a simulated social media use case (tweet analysis), which has been implemented to Eucalyptus cloud computing environment on a distributed shared memory multiprocessor platform. The results indicate that AsterixDB is able to provide significantly better performance both in terms of throughput and latency, when data feed functionality of AsterixDB is used, and stream processing has been implemented with Java. AsterixDB also scaled on the same level or better, when the amount of nodes on the cloud platform was increased. However, stream processing in AsterixDB was delayed by batching of data, when tweets were streamed into the database with data feeds.

    AB - For getting up-to-date insight into online services, extracted data has to be processed in near real time. For example, major big data companies (Facebook, LinkedIn, Twitter) analyse streaming data for development of new services. Several technologies have been developed, which could be selected for implementation of stream processing functionalities. The contribution of this paper is feasibility analysis of technologies for stream-based processing of semi-structured data. Particularly, feasibility of a Big Data management system for semi-structured data (AsterixDB) will be compared to Spark streaming, which has been integrated with Cassandra NoSQL database for persistence. The study focuses on stream processing in a simulated social media use case (tweet analysis), which has been implemented to Eucalyptus cloud computing environment on a distributed shared memory multiprocessor platform. The results indicate that AsterixDB is able to provide significantly better performance both in terms of throughput and latency, when data feed functionality of AsterixDB is used, and stream processing has been implemented with Java. AsterixDB also scaled on the same level or better, when the amount of nodes on the cloud platform was increased. However, stream processing in AsterixDB was delayed by batching of data, when tweets were streamed into the database with data feeds.

    KW - sentiment

    KW - tweet

    KW - word count

    KW - AsterixDB

    KW - Spark

    KW - performance

    KW - Eucalyptus

    KW - Cassandra

    U2 - 10.1186/s40537-016-0041-8

    DO - 10.1186/s40537-016-0041-8

    M3 - Article

    VL - 3

    JO - Journal of Big Data

    JF - Journal of Big Data

    SN - 2196-1115

    IS - 6

    ER -