Big Data at UGent

Big data represents a new computing paradigm, which stands for decentralized data storage combined with decentralized processing. It is one of the ways to cope with increasing data volume, variety, and velocity (Big Data 3 V's).

Recently accepted papers:

VAN DONGEN G. & VAN DEN POEL, D. (2021), Influencing Factors in the Scalability of Distributed Stream Processing Jobs, IEEE Access (click on the link for free open access download).

VAN DONGEN G. & VAN DEN POEL, D. (2021), A Performance Analysis of Fault Recovery in Stream Processing Frameworks, IEEE Access (click on the link for free open access download).

STEURTEWAGEN B. & VAN DEN POEL, D. (2021), Adding interpretability to predictive maintenance by machine learning on sensor data, Computers and Chemical Engineering, 152.

VAN DONGEN G. & VAN DEN POEL, D. (2020), Evaluation of Stream Processing Frameworks, IEEE Transactions on Parallel and Distributed Systems.

LISEUNE A. et al. (2020), Predicting the milk yield curve of dairy cows in the subsequent lactation period using deep learning, Computers and Electronics in Agriculture, 180.

STEURTEWAGEN B. & VAN DEN POEL, D. (2020), Machine Learning Refinery Sensor Data to Predict Catalyst Saturation Levels, Computers and Chemical Engineering, 134.

UGent/Klarrio Big Data Stream-Processing Frameworks Analytics Benchmark

If you are dealing with big data, high throughput, low latency, adaptive online algorithms, streaming analytics is the way to go. In recent years, a lot of R&D was put into the development of so-called streaming frameworks. It should be noted that a streaming context, in which incoming streams are aggregated, is totally different from online analytics. The latter type is performed on streaming data that has been enriched by static data. Additionally, streaming data types can be very different from case to case and little is known of how different frameworks react to different types of data.

Although the industry acknowledges the power of streaming analytics, one struggles to decide which framework will suit their needs and subsequently solve their problem in an optimal way.

Therefore, the UGent Big Data Analytics Team, in close collaboration with Klarrio, is developing a streaming analytics benchmark. OSPBench is available from GitHub: see here.

The included frameworks are Apache Spark (both Spark Streaming and Structured Streaming), Apache Flink, and Kafka Streams.

The first phase of the benchmark will focus on measuring throughput and latency for a basic streaming job comprising the following steps: 1. Ingest: Read event data from Kafka. 2. Basic transformations: Parse the data. 3. Joins: Join datastreams across topics together. 4. Aggregation: Compute aggregated metrics for each measurement point. 5. Window operations: Evolution of the metrics over specified look-back periods.

A second phase of the project will include data augmentation by enriching streaming data with static data. The third phase will extend the first and second phase by implementing analytical models for predictive and prescriptive analytics.

The benchmark will use open data from traffic information of the Netherlands and will be open-sourced.

For further information or if you have any questions please do not hesitate to contact us: Dirk.VandenPoel AT UGent.be

Contributions to Big Data Open-Source Projects

We are proud that Bram Steurtewagen (UGent & Klarrio) has made an important contribution to an open-source project.

Heron (https://twitter.github.io/heron/) is a highly scalable and fast stream processing engine that is being used in-house at Twitter (which probably makes it one of the more battle-tested streaming frameworks currently available.) During our development of a benchmarking solution for streaming frameworks, we discovered that Apache Storm currently has some issues running on Mesos. Luckily, the Heron project is API-compatible with Apache Storm and runs on a multitude of schedulers and platforms. As we already opted for a Marathon-based deployment of our other frameworks, we tried to launch Heron on this framework to no avail. We identified the following issue with the Heron Marathon Scheduler: the Heron framework parameters to be sent to Marathon were generated in a deprecated format that was being phased out. (https://github.com/twitter/heron/issues/2581) We resolved this matter in a following pull request: https://github.com/twitter/heron/pull/2583. We can now confirm that our Heron benchmark is running smoothly under the latest version of Marathon and DCOS.

Big Data Events

Giselle van Dongen (UGent PhD candidate & Data Scientist at Klarrio) presented her work-in-progress paper "Latency Measurement of Fine-Grained Operations in Benchmarking Distributed Stream Processing Frameworks" (co-authors: Bram Steurtewagen and Dirk Van den Poel).

We are proud to announce that Prof. Dr. Dirk Van den Poel was awarded the Francqui Chair 2017-2018 on Big Data Analytics at the University of Namur.

We are proud to announce that our Big Data team is again represented at the Apache Big Data conference on May 16-18, 2017 in Miami, FL. The talk is by Dirk Van den Poel ("Big Data Analytics Using (Py)Spark For Analyzing IPO Tweets.") Last year, we had three talks at the Apache Big Data event on May 9-12, 2016 in Vancouver, Canada. The three talks were by Bram Steurtewagen ("Data Science Applied: A Utilities Sector Case Study"), by Tijl Carpels ("On the fly retraining of predictive analytical models using Spark Streaming: An equity-price direction prediction case study."), and by Dirk Van den Poel ("Spark Big Data Analytics for Business, Finance and Marketing.").

Several UGent professors in Big Data offer a training program. Click here (in Dutch) for more information. After two very successful editions, we decided to intensify our efforts in 2018.

Big Data Publications

Older publications in the field of Big Data:

VERCAMER D., STEURTEWAGEN B., VAN DEN POEL D. & VERMEULEN F. (2017), Predicting Consumer Load Profiles Using Commercial and Open Data, IEEE Transactions on Power Systems, 31 (5).

VAN DEN POEL D., CHESTERMAN C., KOPPEN M. & BALLINGS M. (2016), Equity Price-Direction Prediction For Day Trading: Ensemble Classification Using Technical Analysis Indicators With Interaction Effects, IEEE WCCI Proceedings of the IJCNN Conference.

Big Data Projects

We have a strong cooperation with Klarrio, the leading Big Data IoT and Analytics Co. in the Benelux.

Starting Jan. 2016, we partner with the insurance company Corona Direct for a large-scale IoT Usage-Based Insurance research project.

Blog posts about some recent Big Data projects: e.g. Total Refineries asked us to apply industrial analytics to an IoT (Internet of Things) case. The team compared two open-source analytics environments (R versus Python + Spark) for the task at hand (unfortunately all other details are confidential).

Since Sept. 2013, we teach Apache Hadoop/HBase/Hive/Spark in our two state-of-the-art master degrees: Master of Science in Marketing Analysis and Master of Science in Business Engineering: Data Analytics

Since Sept. 2013, we are actively involved in several research projects to use Big Data technology for Analytics.

Past Conference Participations

Blog entries related to Big Data:

  • IEEE Big Data 2018 Congress in San Francisco, CA
  • Student Presentations in Big Data Class 2017 in Ghent, Belgium
  • NIPS 2017 in Long Beach, CA
  • SuperComputing 2017 (SC17) in Denver, CO
  • INFORMS 2017 Annual Meeting in Houston, TX
  • ACM KDD 2017 in Halifax, NS (Canada)
  • Apache Big Data North America 2017 in Miami, FL
  • INFORMS Business Analytics 2017 in Las Vegas, NV
  • Spark Summit East 2017 in Boston, MA
  • IBM Spark Technology Center Meeting Feb. 2017 in Boston, MA
  • Student Presentations in Big Data class 2016 in Ghent, Belgium
  • IEEE Big Data 2016 in Washington, DC
  • AMPLab End of Project Event in Berkeley, CA
  • INFORMS Annual Meeting 2016 in Nashville, TN
  • Spark Summit Europe 2016 in Brussels, Belgium
  • ACM KDD 2016 in San Francisco, CA
  • IEEE WCCI 2016 in Vancouver, Canada
  • Apache Big Data 2016 in Vancouver, Canada
  • INFORMS Business Analytics 2016 in Orlando, FL
  • Spark Summit East 2016 in New York City, NY
  • FOSDEM 2016 in Brussels, Belgium
  • UC Berkeley's AMPLab Winter Retreat in Lake Tahoe, CA
  • NIPS 2015 in Montreal, Canada
  • SC15 Supercomputing Conference in Austin, TX
  • Informs 2015 Annual Meeting in Philadelphia, PA
  • Informs 2014 Annual Meeting in San Francisco
  • ACM KDD2014 in New York City
  • MSI 2014 Conference on Marketing in Data-Rich Environments in San Francisco, CA
  • INFORMS Big Data Conference in San Jose, CA
  • ASE 2014 Big Data Conference at Stanford University
  • VOSEKO Alumni lecture on Big Data/IoT ...
  • INFORMS 2014 Business Analytics and Big Data Conference in Boston, MA
  • Agoria Data-Driven Innovation in Brussels, Belgium
  • IEEE ICDM Conference in Dallas, TX
  • Sogeti BI Symposium in Amsterdam, The Netherlands
  • SC13 Supercomputing in Denver, CO
  • Agoria BigData Opening event in Brussels, Belgium + The Data-Driven Bank
  • INMA 2013 in Berlin, Germany
  • DMA 2013 in Chicago, IL
  • INFORMS Annual Meeting 2013 in Minneapolis, MN
  • KDD 2013 in Chicago, IL
  • OSCON 2013 Open Source Convention in Portland, OR
  • Oracle 2013 Big Data Forum in Belgium
  • Strata + Hadoop New York City 2012 Big Data Conference
  • Sogeti Belux 2012 Conference on Big Data in Brussels
  • O'Reilly's Strata 2012 Big Data Conference in Santa Clara, CA
  •