Skip to main content

2024 | Buch

Big Data Analytics in Astronomy, Science, and Engineering

11th International Conference on Big Data Analytics, BDA 2023, Aizu, Japan, December 5–7, 2023, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the proceedings of the 11th International Conference on Big Data Analytics in Astronomy, Science, and Engineering, BDA 2023, which took place in Aizu, Japan during December 5–7, 2023.

The 19 full papers included in this book were carefully reviewed and selected from 55 submissions. They were organized in topical sections as follows: Data management and visualization; data science: architectures and systems; data science and applications; and cyber systems and information security.

Inhaltsverzeichnis

Frontmatter

Data Management and Visualization

Frontmatter
AI-Based Assistance for Management of Oral Community Knowledge in Low-Resource and Colloquial Kannada Language
Abstract
Knowledge in rural communities is largely created, preserved, and is transferred verbally, and it is limited. This information is valuable to these communities, and managing and making it available digitally with state-of-the-art approaches enriches awareness and collective knowledge of people of these communities. The large amounts of data and information produced on the Internet are inaccessible to the population in these rural communities due to factors like lack of infrastructure, connectivity, and limited literacy. Knowledge internal to rural communities is also not conserved and made available in any global Big Data information systems. Artificial Intelligence (AI) technologies such as Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) provide substantial assistance when vast quantities of data, like Big Data, are available to build solutions. In the case of low-resource languages like Kannada and rural colloquial dialects, publicly available corpora are significantly less. Building state-of-the-art AI solutions is challenging in this context, and we address this problem in this work. Knowledge management in rural communities requires a low-cost and efficient approach that social workers can use. This paper proposes an architecture for oral knowledge management for rural communities speaking colloquial Kannada. The proposed architecture has an interface for oral knowledge retrieval using text processing on transcripts generated from the smallest state-of-the-art ASR model. We propose three interfaces to search for content: an n-gram based fuzzy search to search for texts in audios, the most frequent entities search based on the Kannada Named Entity Recognition (NER) model, and question-answering with Large Language Model (LLM) using a community knowledge vector store.
M. Aparna, Sharath Srivatsa, G. Sai Madhavan, T. B. Dinesh, Srinath Srinivasa
Topic Modeling Applied to Reddit Posts
Abstract
Text data is widely used for both commercial and research purposes. While extensive sources of text data are available within Internet forums, such as Reddit, their volume is vast and, typically, only a small subset of posts is studied. To overcame problem of data size, topic modeling can be applied, to extract the main ideas from the documents. However, as it will be shown, different modeling techniques may produce very different results. Specifically, in this contribution, an overview of the most popular topic models, used in natural language processing, and methods for their comparison, is provided. Moreover, a software solution for downloading, modeling, exploring, and comparing topics, contained in Reddit posts, is introduced. The proposed application is experimentally validated, by showing that the extracted topics reflect real-world events. Finally, obtained results are compared to these originating from a different tool, used for investigating topic popularity.
Maria Kędzierska, Mikołaj Spytek, Marcelina Kurek, Jan Sawicki, Maria Ganzha, Marcin Paprzycki
Twitter Sentiment Analysis in Resource Limited Language
Abstract
Sentiment analysis is essential for understanding public opinion and user feedback in various languages. However, a language barrier often limits the use of existing models which are primarily pre-trained on English. Therefore, previous approaches have focused on building language-specific models for non-English languages. In this work, we investigate the efficacy of low resource language specific models (like GreekBERT) and compare their performance with RoBERTa model for predicting sentiments in Greek and English language tweets. More specifically, we explore whether Greek tweets translated to English and fed to RoBERTa model performs better than Greek tweets directly fed to GreekBERT model. We find the RoBERTa model performs well not only for the English tweets but also for the non-English tweets (Greek) translated to English. We present a detailed summary of model performance for sentiment classsification of non-English (Greek) tweets.
Riya Gupta, Sandli Agarwal, Shreya Garg, Rishabh Kaushal
Querying Healthcare Data in Knowledge-Based Systems
Abstract
In the ever-evolving healthcare landscape, integrating knowledge-based systems into data querying processes is becoming imperative. The existing challenges in querying healthcare data lie in the complexity of extracting meaningful insights from vast and heterogeneous datasets. EHRs store different forms of data, and query systems’ scalability and performance, especially considering the increasing volume of EHR data, are the main challenges faced. To overcome these challenges, the paper proposes a system with a user-friendly graphical interface for creating Archetype Query Language (AQL) queries in openEHR systems. It consists of three components: User Interface, which allows the user to specify query parameters, modify EHRs paths, filter data, and customize query results; Query builder, which creates the AQL query based on input from the User Interface and Repository of Documents where the compositions are stored and the query result obtained from this component is sent back to User Interface. It stands out with its innovative approach, systematically extracting openEHR schemas and simplifying the creation of complex AQL queries. The system’s effectiveness and user satisfaction make learning, using, and developing queries for graph-driven healthcare data knowledge easy. The system enhances the overall functionality and usability of the query builder within the system. It offers a pathway to improved clinical decision-making and patient care outcomes.
Kanika Soni, Shelly Sachdeva, Anupama Minj
IGUANER - DIfferential Gene Expression and fUnctionAl aNalyzER
Abstract
In the past fifteen years, the advent of Next-Generation Sequencing technologies, characterized by high efficiency and reduced costs, has marked a pivotal turn for research across various fields including molecular biology, genetics, and molecular medicine. Projects that would have previously required extensive timeframes and significant investments can now be completed swiftly at a fraction of the cost. A direct consequence of the proliferation of these systems is the exponential increase in data generated by RNA-Seq experiments. Much of this data originates from biological samples (cells, tissues, mucus, etc.) of organisms with either absent or incomplete genomic annotations. Compounding this issue is the fact that the surge in data has not been matched by the development of adequate software tools capable of analyzing RNA-Seq data for such organisms. Currently available tools have several limitations: a) they operate in silos, so they only support certain types of analyses, thus complicating the biological interpretation of results; b) they are often executable only via Web interfaces, overlooking the parallelism and efficiency offered by modern supercomputers; c) functional analysis tools rely on outdated functional annotations or support only a limited set of organisms with genomic annotation; d) only one comparison (between two different experimental conditions) can be tested at each run. In order to overcome these limitations, we present IGUANER - (DIfferential Gene expression and fUnctionAl aNalyzER), a software aimed at ensuring the capability for integrated and up-to-date analysis of RNA-Seq data from any organism, regardless of the level of genomic annotation.
Valentina Pinna, Jessica Di Martino, Franco Liberati, Paolo Bottoni, Tiziana Castrignanò

Data Science: Architectures and Systems

Frontmatter
Boosting Diagnostic Accuracy of Osteoporosis in Knee Radiograph Through Fine-Tuning CNN
Abstract
Osteoporosis is a serious worldwide medical problem that might be challenging to identify promptly owing to the absence of indicators. At the moment, DEXA scans, CT scans, and other techniques with expensive devices and payroll expenses are the mainstays of osteoporosis evaluation. Consequently, an improved, accurate and affordable approach is essential for osteoporosis diagnosis. With the advancement of deep learning, systems for the automated identification of illnesses are regularly presented. Leveraging datasets from chest X-rays accessible for free, the present research assesses the efficacy of several convolutional neural network (CNN) models with the best extreme parameters for osteoporosis detection. Both custom CNN designs and already trained CNN structures for VGG-16 have been incorporated into the assessed system. According to the research results, the VGG-16 with fine-tuning outperformed the one without fine-tuning with an 86.36% accuracy, 86.67% precision, 86.36% recall and 86.34% f1-score, which makes it a potential and reliable model for osteoporosis prediction. The automated diagnosis approach built on CNN can help practitioners promptly, correctly, and reliably identify osteoporosis. This development results from enhanced patient outcomes and increased system productivity.
Saumya Kumar, Puneet Goswami, Shivani Batra
VLSI Implementation of Reconfigurable Canny Edge Detection Algorithm
Abstract
Real-time video and image processing are used in various industrial, medical, consumer electronics and embedded device applications. These applications typically demonstrate an increasing demand for computing power and system complexity. Hence, edge detection is the most common and widely used technique in image or video processing applications. Several traditional canny edge detection methods use fixed thresholding techniques to compare the pixel values. This sacrifices the edge detection performance and increases the computational complexity. Hence, the Canny Edge detection algorithm is preferred to enhance the image quality with reduced complexity. They adjust the quality of the image by manipulating the Sigma and Threshold parameters and detect the edges accurately by eliminating the noise. The reconfigurable canny edge detection algorithm presents a procedure for detecting edges without multipliers. The new algorithm uses a low-complex, non-uniform histogram gradient to compute thresholds and variable sigma values that replace the add and shift operator instead of multipliers to reduce the area and sigma. The simulation is done in the ModelSim platform using VHDL code which results in the output of bit sequences. By comparing the results of the reconfigurable canny edge detection and traditional algorithm, the new algorithm’s performance can be observed with improvements of around 21% and 80% for consumed power and delay parameters respectively.
K. K. Senthilkumar, E. Avantika, B. Gayathri, Vaithiyanathan Dhandapani
DMC Approach for Modeling Viral Transmission over Respiratory System
Abstract
Diffusive Molecular Communication (DMC) is a widely accepted technique for modeling biological environments. Within DMC, information is conveyed from transmitting nanomachines to a receiving nano-machine by utilizing molecules that disperse through the medium. This Paper uses the DMC Approach for Modeling Viral Transmission over Respiratory System. It is noticeable that the complete respiratory tract is responsible to grade the severity of the disease. And therefore, literatures present propagation of CoVID-29 in the respiratory tract. Further, it is most important to mention that the impulse response of the system which characterizes ACE2 (Angiotensin-converting enzyme 2) concentration per unit area (f(y)) plays a major role in modeling viral transmission over the respiratory tract. The Author in [8] describes the propagation of SARS-COV2 bacteria over the respiratory tract and its binding with the ACE2 receptor however we present a generalized approach which can be applied to any type of bacteria and its binding with any receptor kind. In particular, analytical expressions of binding probability \({P}_{b}\), probability of the virus evading \({P}_{e}(y)\) and probability of binding rate y \({P}_{b}\left(y\right)\) under certain impulse responses are presented in this work. Also, the effect of different physical parameters on \({P}_{b}\), \({P}_{nb}\) and \({P}_{b}(y)\) have been quantified with the help of numerical simulation. This work presents a mathematical model describing the concentration of virus and its distribution throughout the respiratory tract. Presented analysis shows perfect agreement with the theoretical background.
Raghevendra Jaiswal, Masood Asim, Urvashi Chugh, Prabhakar Agrawal, S. Pratap Singh
Machine Learning in Particle Physics
Abstract
This note surveys developments in particle physics due to advances made in the fields of statistics, machine learning, and artificial intelligence. With the aid of examples and recent work, this article attempts to give a flavor of the effect of these advances on particle physics, including brief mention of cloud computing, classic machine learning techniques, statistics applications, new ML/AI techniques, reinforcement learning, and other advances. Suggestions are made regarding the future.
Milind V. Purohit

Data Science and Applications

Frontmatter
A Robust Ensemble Machine Learning Model with Advanced Voting Techniques for Comment Classification
Abstract
In the modern era, we find ourselves immersed in an ever-expanding flow of data where data is increasing exponentially. Data is generated from different platforms like Education, Business, E-commerce, and predominantly, social media platforms such as Twitter, YouTube, Facebook, and Instagram. Amidst this proliferation of content, user comments have emerged as a crucial element, serving as a platform for expressions of opinions, commendations, and critiques. However, within the abundance of user feedback lies a persistent issue: the presence of undesirable comments that elicit negative emotional responses and prove to be tedious and irrelevant. Effectively identifying and removing such comments poses a major challenge. This research addresses the imperative need for a robust comment classification model. To tackle this issue, a comprehensive investigation is conducted, employing a variety of machine learning models, including Decision Trees, Random Forests (RF), Naive Bayes, K-Nearest Neighbors, Gradient Boosting, AdaBoost, Logistic Regression, and Support Vector Machines (SVM) for comment classification. Furthermore, fundamental voting techniques such as Hard-Voting, Averaging, and Soft-Voting are incorporated with machine learning models to improve the classification performance. The objective is to discern the characteristics of text comments, classifying them, with the aim of achieving superior accuracy compared to prior research. In this paper, we propose a robust ensemble model, RF+AdaBoost+SVM+Soft-Voting, specifically designed for comment classification. The results obtained indicate that the proposed ensemble model achieved an impressive accuracy of approximately 98% for comment classification on YouTube dataset.
Ariful Islam Shiplu, Md. Mostafizer Rahman, Yutaka Watanobe
Vector-Based Semantic Scenario Search for Vehicular Traffic
Abstract
Autonomous Vehicles (AVs) are expected to have the potential to impact urban mobility by providing increased safety, reducing traffic congestion, mitigating accidents and reducing emissions. Since AVs operate with little or no human intervention, it is very essential to perceive the external world and understand different objects and their relationships in the scene, and respond appropriately. For doing this effectively, AVs need to be trained on a variety of traffic situations and appropriate responses to them. Behaviour of vehicular traffic varies widely from one part of the world to another. An AV trained for traffic conditions in one part of the world may not be effective, or worse, even be risky in some other part of the world. There is hence a need to create datasets of vehicular traffic scenarios and design mechanisms to query, retrieve and reason about dynamic traffic scenarios. This paper discusses a method for vector based scenario search using a natural language interface for describing traffic scenarios. We first generate textual descriptions of snapshots of traffic scenarios captured from instrumenting vehicles using image captioning libraries. Next, we create vector embeddings of the captions, store them in a vector database to enable semantic scenario search using natural language based queries. This is an ongoing work where different other modalities of scenarios data are planned to be supported over an underlying image captioning and natural language search interface. Experimental results on the image captioning core, show encouraging results.
A. P. Bhoomika, Srinath Srinivasa, Vijaya Sarathi Indla, Saikat Mukherjee
Searching for Short M-Dwarf Flares by Machine Learning Method
Abstract
We propose a machine learning method to identify M-dwarf flares in astronomical observation data. A flare is a sudden increase of luminosity of a star’s surface, and is thought to be the result of magnetic reconnection. Observations of the stellar flares play a crucial role in understanding stellar magnetic activity. In particular, analyzing flare time evolution (light curve) is essential. We use the data from Tomo-e Gozen camera, mounted on the Kiso Schmit telescope, with a cadence of approximately one second, which is shorter than the cadence of other telescopes such as NASA’s Kepler space telescope and the Transiting Exoplanet Survey Satellite. The dataset is ideal for identifying fast flares. We develop a one-dimensional convolutional neural network (CNN) to detect fast and faint flares in optical light curves. We train the model on a limited number of real flares identified by human experts, augmented with a large number of artificially generated flares to detect sub-minute flare candidates within the light curves captured by Tomo-e Gozen camera, and subsequently fit these candidates to make the selections. Our novel CNN model has successfully identified potential flares characterized by a rise time in the range of 4 s \(\lesssim t_\textrm{rise} \lesssim \) 88 s, and energy levels spanning \(10^{30}\) erg \(\lesssim E_\textrm{flare} \lesssim 10^{33}\) erg. Notably these potential flares exhibit shorter duration and lower energy compared to those detected by human experts, who typically identify flares with a rise time of 5 s \(\lesssim t_\textrm{rise} \lesssim \) 100 s and energy of \(10^{31}\) erg \(\lesssim E_\textrm{flare} \lesssim 10^{34}\) erg.
Hanchun Jiang
Vayu Vishleshan: AQI Monitoring and Reduction Analysis
Abstract
In today’s life, humans compromise with nature while evolving into a more advanced species. One of the main effects of that advancement is air pollution. Air pollution seriously threatens human health, the environment, and the general quality of life worldwide. A quantitative analysis of air quality was the purpose of developing the Air Quality Index (AQI), an indexing approach. The air quality index is computed using measurements for particulate matter (PM), PM2.5, PM10, NO2, CO, CO2, NH3, and other contaminants. Agnihotra (Yagya) is a method of environmental purification mentioned in the Hindu sculpture (The Book Yagya Vimarsh, written by Dr. Ramprakash, talks about the air pollution reduction method). This paper discusses the Vayu Vishleshan framework for monitoring air quality in the presence of Agnihotra (Yagya). This study is divided into two processes. The first one is AQI monitoring, and the second is AQI reduction. AQI Monitoring covers sensing particulate matter (PM), PM2.5, PM10, NO2, and CO and performing analysis over the AQI data before, during, and following Agnihotra (Yagya). According to analysis, there is an approximate 6–7% decrease in CO levels and an approximate 10–12% decrease in PM levels following Agnihotra.
Mayank Deep Khare, Shelly Sachdeva, Divyam Dubey, Rohit Singh Rajpoot, Saurav Kumar
Efficient Knowledge Graph Embeddings via Kernelized Random Projections
Abstract
Knowledge Graph Completion (KGC) aims to predict missing entities or relations in knowledge graph but it becomes computationally expensive as KG scales. Existing research focuses on bilinear pooling-based factorization methods (LowFER, TuckER) to solve this problem. These approaches introduce too many trainable parameters which obstruct the deployment of these techniques in many real-world scenarios. In this paper, we introduce a novel parameter-efficient framework, KGRP which a) approximates bilinear pooling using Kernelized Random Projection matrix b) employs CNN for the better fusion of entities and relations to infer missing links. Our experimental results show that KGRP has 73% fewer parameters as compared to the state-of-the-art approaches (LowFER, TuckER) for the knowledge graph completion task while retaining 88% performance for the best baseline. Furthermore, we also provide novel insights on the interpretability of relation embeddings. We also test the effectiveness of KGRP on a large-scale recruitment knowledge graph of 0.25M entities.
Nidhi Goyal, Anmol Goel, Tanuj Garg, Niharika Sachdeva, Ponnurangam Kumaraguru

Cyber Systems and Information Security

Frontmatter
From Silos to Unity: Seamless Cross-Platform Gaming by Leveraging Blockchain Technology
Abstract
By exploring the impact of blockchain technology within cross-platform gaming, this study confronts pivotal hurdles including asset ownership, player identity, and the interoperability within virtual economies. Through a blend of theoretical scrutiny and practical insights, the research illuminates blockchain’s capacity to authenticate digital asset ownership, forge unified player profiles, and streamline interactions across various gaming platforms. Further, notwithstanding the solutions, the effective amalgamations of blockchain hinges on concerted efforts from game developers, platform operators, and additional key players. It emphasizes the criticality of norms, scalability, and user acceptance. Thus, this study advocates for a comprehensive approach to unlock blockchain’s vast capabilities in enhancing cross-platform gaming experiences.
Rashmi P. Sarode, Yutaka Watanobe, Subhash Bhalla
Analysis of Job Processing Data – Towards Large Cloud Infrastructure Operation Simulation
Abstract
Cloud computing is the most popular way of delivering on-demand computational resources. Recently, the research in this area has started to focus on carbon-aware clouds. Here, the most challenging aspects are related to defining strategies for efficient task scheduling and resource allocation. These strategies can be simulated and assessed using dedicated tools. However, to perform their accurate evaluation, the tests should reproduce close-to-real conditions of the actual cloud center. In particular, they require running simulations with various mixtures of tasks that replicate the actual cloud center operation. Therefore, the main aim of this work was to prepare tools that will allow the generation of synthetic job streams, with mixes of realistic types of computational tasks. The core of this contribution is the analysis of actual job processing data from the CloudFerro cloud center. The proposed methodology is based on data clustering and includes a comparison between multiple algorithms. Furthermore, the resulting clusters have been categorized from the point of view of cloud center operation, in order to identify prototypical tasks’ classes with respect to the resource demands. Finally, a tool that generates synthetic job streams, based on the Gaussian Mixture Model, which has been implemented, is summarized.
Zofia Wrona, Maria Ganzha, Marcin Paprzycki, Stanisław Krzyżanowski
Exploring Approaches to Detection of Anomalies in Streaming Data
Abstract
Numerous methods have been proposed to detect anomalies in data streams. In this work, a comprehensive study of performance of AutoEncoders and Predictive networks, applied to two datasets, is presented. The first dataset (from the paper mill) is labeled, whereas the second (from the server farm) is not. In this context, first, AutoEncoders and Predictive networks are applied to the labeled dataset and tuned, to improve their performance. Moreover, chronological and random training data splitting is explored. Additionally, an industry expert’s suggested performance evaluation method is proposed. Effects of its use are experimentally investigated. It is shown that the proposed approach outperforms the state-of-the-art approaches. The best of breed model and approach from the labeled paper mill dataset, is applied to the log data from a server farm. Obtained results turned out to “make sense” to the log data owners, and the developed method is going to be tried in a real-life deployment.
Damian Rakus, Maria Ganzha, Marcin Paprzycki, Artur Bicki
Blockchain-Based Framework for Healthcare 5.0
Abstract
This article presents a Blockchain-based data-sharing approach that offers reliability, integrity, and decentralization properties of Blockchain, making the system user-controllable. It contains the evolution of the healthcare industry with respect to other industries. We also discuss the impact of industry 5.0 technologies and Blockchain-enabled healthcare. The proposed system stores medical data off-chain with IPFS, and the reference of off-chain data is stored on Blockchain. Information flow is set up with the help of smart contracts deployed over Blockchain to ensure immutability and traceability. A prototype has been simulated on the Rinkeby test network using a Proof-of-Work consensus algorithm for various parameters such as upload time, retrieval time, and gas consumed for transaction execution.
Vijayant Pawar, Shelly Sachdeva, Subhash Bhalla
Semantics for Resource Selection in Next Generation Internet of Things Systems
Abstract
Over the last two decades, the complexity and scale of computer systems radically increased. This is caused, mainly, by the proliferation of data to be processed, and the increasing number and heterogeneity of digital artifacts that can, and need, to be used for deployment of software components and for processing of data. Here, design paradigms, such as cloud/fog/edge processing, and construction of a computing continuum, are trends worth mentioning. Moreover, these changes should be seen also in the context of rapid acceptance of the Internet of Things, which combines very large numbers of sensors and actuators. These changes result also in new challenges that need to be tackled: dynamic environments, heterogeneous data models and processing elements, distributed execution of user workflows, etc. In this context, one of the crucial issues that needs to be addressed is the selection of the right resource(s) that can/should execute a given task/workflow. One of the approaches that can be considered to address such challenges is application of semantic data processing. In this contribution potential benefits and limitations of application of semantic technologies are discussed.
Katarzyna Wasielewska-Michniewska, Marcin Paprzycki, Maria Ganzha
Backmatter
Metadaten
Titel
Big Data Analytics in Astronomy, Science, and Engineering
herausgegeben von
Shelly Sachdeva
Yutaka Watanobe
Copyright-Jahr
2024
Electronic ISBN
978-3-031-58502-9
Print ISBN
978-3-031-58501-2
DOI
https://doi.org/10.1007/978-3-031-58502-9

Premium Partner