Integrating visual analytics and machine learning approaches for analyzing multivariate proximal sensor data

Date

2021-08

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Sensors are pervasive in a broad range of industries. They provide efficient methods to measure and monitor properties of interest for numerous business and manufacturing operations. Recent developments in sensing technologies and their efficiency bring a massive amount of multivariate proximal sensor data. While collecting multivariate proximal sensor data becomes more accessible, analyzing them still consumes a lot of time for domain-specific analysts. Therefore, this dissertation discusses five objectives, consolidated from interviewing 102 stakeholders, for a software solution in this area. These objectives include 1) having a set of typical visualizations for chemical measurement data, 2) having an intelligent visual recommendation component that provides personalized visualizations for chemical analysts, 3) having real-life error indications for proximal sensors, 4) having machine learning components for device calibrations (e.g., spectrum to concentrations), and 5) having machine learning components with an emphasis on high-level, difficult-to-measure property predictions from low-level, relatively easy-to-measure proximal sensor data.

Proximal sensor-related technologies offer a means to acquire a massive amount of multivariate proximal sensor data in various forms (e.g., simple, georeferenced, or timeseries). However, existing software approaches are ill-suited to explore and utilize these types of data. Specifically, current analysis solutions provide unintuitive visualizations, lack of support for interactions, or require skills or training before use. Therefore, this dissertation proposes a multifaceted approach with seamless integration from the collecting data, data processing, data analysis, and reporting results to the public. Specifically, this dissertation follows a multidisciplinary approach, in which data visualizers worked closely with soil scientists to propose two interactive, 2D, and 3D visualization tools, called SOAViz and SoilScanner, to analyze multivariate proximal sensor data collected from soil profiles. These tools received great feedback from soil scientists.

Georeferenced multivariate proximal sensor data are also popular in many areas such as the Natural Resources Conservation Service, the United States Department of Agriculture (NRCS, USDA) would like to manage their land over large geographical regions at various depths. Interactive three-dimensional visualizations play a crucial role in analyzing this form of data. Existing approaches for analyzing, visualizing, and reporting the results for this type of data are limited. For instance, this type of visualizations for soil profiles are primarily available to small profiles scanned using X-ray CT or CT scanners, and they are static representations only. In other words, they do not equip users with interactive options to further explore and find insights from the collected data. Therefore, working with soil scientists, this dissertation also proposes an approach to analyze georeferenced multivariate proximal sensor data collected in large geographical areas called iDVS. With interpolation and volume rendering techniques, iDVS offers three-dimensional visualizations of the soil profiles in large geographical areas. Also, interactive web-based implementation of iDVS allows the analysts to perform their data exploration tasks and find and report insights from the soil’s chemical and physical properties of interest.

Different business domains or different users have different types of data and different analysis needs for the acquired multivariate proximal sensor data. Thus, it is difficult to find an exhaustive list of standard interactive visual analytics solutions that can accommodate all users’ needs. Therefore, proximal sensor end users often use unintuitive software (due to the lack of visualizations) to analyze their data. They then generate static graphics (due to the lack of interactive options) to report their findings. In many cases, they need to collaborate with external parties or learn programming skills to create custom data analysis and visualization solutions for their acquired multivariate proximal sensor data. Therefore, this dissertation proposes visual recommendations approaches to tackle this issue. There are two approaches to visual recommendations discussed in this dissertation. They are the static approach and the dynamic one. The former leverages visual features of the underlying data, while the latter uses reinforcement learning to learn and recommend appropriate visualizations. Specifically, the static approach explores Scagnostics measurements to extract visual features from the underlying data and suggest appropriate visualizations of interest to the users. Besides the visual features, the dynamic approach uses a contextual bandit algorithm, called LinUCB, to learn and recommend personalized visualizations to the end-users using their context. These two investigated approaches set a foundation for further exploration for a deployable solution in visual recommendations for the users working with multivariate proximal sensor data.

Another common type of multivariate proximal sensor data is the time series one. Also, abnormalities in the underlying multivariate proximal sensor data are of high interest for scientists in various domains. These abnormalities may represent potential errors that occur during the data collection step or indicate interesting situations in many other cases. Therefore, this dissertation also presents an approach to analyzing two-dimensional and multi-dimensional temporal proximal sensor datasets focusing on identifying observations that are significant in detecting the outliers in the underlying data. Specifically, we propose two visual analytics tools for this purpose, called Outliagnostics and MTSAD. These tools guide users when interactively exploring abnormalities in large bivariate or multivariate time series acquired from proximal sensors.

Instead of focusing on detecting outliers at each time point, these approaches monitor and display the discrepant temporal signatures of each data entry concerning the overall distributions. This dissertation also illustrates and validates the uses of these two tools on real-world datasets of various sizes, including but not limited to proximal sensor data, to highlight the benefits and performance of these tools.

Furthermore, recent advancements in the machine learning and deep learning fields bring a high demand for using the low-level multivariate proximal sensor data to predict high-level, more difficult-to-collect features. For instance, high-performance computing centers want to use the monitored health statuses acquired by sensors to predict future power consumption or the surge of computation nodes’ temperatures to better allocate resources to the centers. Similarly, soil scientists want to use spectral data acquired from soil samples using proximal sensors to predict soil’s physical and chemical properties as a rapid, cost-effective, and environmentally friendly alternative to the time-consuming, expensive, and destructive laboratory procedures. Therefore, this dissertation experiments with different machine learning and deep learning methods to predict pHH2O and pHKCl from a set of Vis-NIR spectral data for a globally distributed set of soil samples. We then propose a deep learning method, called RDNet, that outperforms the other existing approaches for these prediction tasks.


Embargo status: Restricted until September 2022. To request an access exception, click on the PDF link to the left.

Description

Keywords

Data Analytics, Data Visualizations, Machine Learnings, Deep Learnings, Proximal Sensor Data Analysis, Soil Profile Analysis, Portable X-Ray Fluorescence Data, Vis-Nir Data

Citation