Abstract
For getting up-to-date insight into online services,
extracted data has to be processed in near real time. For
example, major big data companies (Facebook, LinkedIn,
Twitter) analyse streaming data for development of new
services. Several technologies have been developed, which
could be selected for implementation of stream processing
functionalities. The contribution of this paper is
feasibility analysis of technologies for stream-based
processing of semi-structured data. Particularly,
feasibility of a Big Data management system for
semi-structured data (AsterixDB) will be compared to
Spark streaming, which has been integrated with Cassandra
NoSQL database for persistence. The study focuses on
stream processing in a simulated social media use case
(tweet analysis), which has been implemented to
Eucalyptus cloud computing environment on a distributed
shared memory multiprocessor platform. The results
indicate that AsterixDB is able to provide significantly
better performance both in terms of throughput and
latency, when data feed functionality of AsterixDB is
used, and stream processing has been implemented with
Java. AsterixDB also scaled on the same level or better,
when the amount of nodes on the cloud platform was
increased. However, stream processing in AsterixDB was
delayed by batching of data, when tweets were streamed
into the database with data feeds.
Original language | English |
---|---|
Number of pages | 25 |
Journal | Journal of Big Data |
Volume | 3 |
Issue number | 6 |
DOIs | |
Publication status | Published - 2016 |
MoE publication type | A1 Journal article-refereed |
Keywords
- sentiment
- tweet
- word count
- AsterixDB
- Spark
- performance
- Eucalyptus
- Cassandra