Student Poster Competition

Lucy Family Institute for Data and Society Fall Symposium

Wednesday, Oct 27 at McKenna Hall 

12-1 PM Student posters opened the symposium during the lunch hour

  • Live/crowd sourced voting during the event determined the winners  
  • Awards for the top three posters($300, $200, $100) will be announced at the evening reception (6-8 pm) at Foley's in O'Neill Hall
Submitted Entries:
Title Author(s) Abstract
An Empirical Study on Model Errors & Users‘ Error Repair Strategies in Natural Language Interfaces for Database Queries (NL2SQL) Zhang Z, Ning Z, Sun T, Li T Natural language to SQL (NL2SQL) models can help users access databases with natural language queries, which are “translated” into SQL queries by the models. While current state-of-art NL2SQL models have made significant progress, reaching ~70% accuracy in popular benchmark datasets, the difficulty for users to identify and repair errors made by NL2SQL models limits their wide-adoption. In order to help users discover and repair errors in NL2SQL results, we first perform an empirical study on errors made by several state-of-the-art NL2SQL models on Spider, which is the largest NL2SQL benchmark dataset The study results in the first comprehensive taxonomy of NL2SQL errors as well as a statistical analysis on these error types. We also conduct a controlled user study that compares the effectiveness and efficiency of interaction strategies for discovering and repairing NL2SQL erros. Based on findings from these studies, we discuss design implications for natural language interfaces for database queries and identify future research directions.
An Optimized NL2SQL system for enterprise data mart Dong K, Lu K, Xia X, Cieslak D, Chawla N Natural language interfaces to databases is a growing field that enables end users to interact with relational databases without technical database skills. These interfaces solve the problem of synthesizing SQL queries based on natural language input from the user. There are considerable research interests around the topic but there are few systems to date that are deployed on top of an active enterprise data mart. We present our NL2SQL system designed for the banking sector, which can generate a SQL query from a user’s natural language question. The system is comprised of the NL2SQL model we developed, as well as the data simulation and the adaptive feedback framework to continuously improve model performance. The architecture of this NL2SQL model is built on our research on WikiSQL data, which we extended to support multitable scenarios via our unique table expand process. The data simulation and the feedback loop help the model continuously adjust to linguistic variation introduced by the domain specific knowledge.
CareNet: Toward a user-centered design for facilitating access to social services​ Anuyah O, Farrell M, McBride M, Carlson C, Conrado A, Metoyer R Our research is motivated by the need to design technology that can aid vulnerable populations in the community of South Bend and surrounding areas to more effectively identify and access the admirable array of social services in place. Research has shown that while community resources are vast and valuable, they are hidden and hard to find. Moreover, existing means of accessing them are not user-friendly, rely on outdated databases, are difficult to search, inconsistent, and non-standard. These challenges leave end-users and those tasked with aiding them dependent on word of mouth and interpersonal connections, which unfortunately leaves some users left behind. The Jessie Ball DuPont Foundation is supporting the team to investigate the further development of a data-driven application that connects people in need with social service providers, food assistance, education assistance, and other community resources. As part of that research process, the team conducted studies with expert social service providers to identify gaps with existing systems and discover potential design recommendations. We further developed a web-based application and validated our design with users and providers.
Approximating Probability Densities Via Normalizing Flows and Adaptive Annealing Cobian E, Hauenstein J, Liu F, Schiavazzi D Normalizing flows are invertible mappings used to transform simpler probability densities into ones that are more complex. Through optimizing parameters associated with these mappings, normalizing flows are used in statistics and machine learning for density estimation and variational inference. In the framework of variational inference, to improve the efficiency and accuracy of the approximation via normalizing flows when a target distribution is multimodal, annealing of the target distribution with a constant annealing schedule in the optimization is often employed. On the other hand, a constant annealing schedule often applies slowly-changing temperatures to the target distribution and can be inefficient computationally. In this presentation, we will introduce a more efficient adaptive annealing schedule that automatically adjusts the incremental step in the annealing schedule to the KL-divergence between two adjacent tempered distributions. We will demonstrate the computational efficiency of normalizing flows, combined with our proposed adaptive annealing scheme, in approximating multimodal distribution and obtaining Bayesian inferences of the parameter for dynamical systems.
Counterfactual Graph Learning for Link Prediction Liu G Learning to predict missing links is important for many graph-based applications. Existing methods were designed to learn the observed association between two sets of variables: (1) the observed graph structure and (2) the existence of link between a pair of nodes. However, the causal relationship between these variables was ignored and we visit the possibility of learning it by simply asking a counterfactual question: "would the link exist or not if the observed graph structure became different?" To answer this question by causal inference, we consider the information of the node pair as context, global graph structural properties as treatment, and link existence as outcome. In this work, we propose a novel link prediction method that enhances graph learning by the counterfactual inference. It creates counterfactual links from the observed ones, and our method learns representations from both of them. Experiments on a number of benchmark datasets show that our proposed method achieves the state-of-the-art performance on link prediction.
Crepe: Collector for Research Experiments on Participant Experiences Lu Y, Cox V, Chen M, Jiang M, Li T Data collected from users’ smartphones has been used for many purposes in academic research. For example, many research projects have been conducted by analyzing users’ log data in mobile applications. Also, by collecting data fields directly displayed on application interfaces, researchers are able to access data that is only visible to certain user groups. Analyzing such mobile data can help researchers gain insights into technology’s impact on individuals and our society, as well as obtain design implications for new technology. While such data collection tasks are common and similar in many aspects, there’s no existing tools to support the process. In many cases, researchers end up “reinventing the wheel” and building their own data collector. In order to reduce the required development work to collect mobile data, we present Crepe, an open-source system for collecting data directly from screen displays of Android devices in research projects to investigate participant experiences. Crepe is designed and developed using a human-centered approach by involving researchers and research participants in the creation lifecycle. It utilizes Android accessibility APIs to access on-screen displayed information, contexts of interface elements, as well as a number of user activities. Crepe uses a graph-based model for representing GUI screens and querying data from GUIs, ensuring the robustness of data collection despite app GUI changes and updates. Researchers are able to customize the data they need to collect and share the collectors with participants, who will download the same app on their devices to contribute their anonymized data. By presenting this tool, we hope to lower the barrier of obtaining quality mobile data for research analysis in the future.
Sausage: Empower Gig-Workers in AI Fairness Lu Y, Cox V, Chen M, Jiang M, Li T Gig work has become increasingly widespread in the U.S. economy and is expected to play a prominent role in the future of work. AI algorithmic management is responsible for many aspects of gig work, allocating tasks, connecting customers, determining pays, monitoring worker performance, enforcing platform rules, and even terminating workers. Prior work showed various types of algorithmic biases in existing gig work environments. Gig workers also express low levels of trust in the fairness of the AI systems used by gig work platforms due to their lack of transparency. Recent worker movements and legislative changes cause a fundamental paradigm shift in worker-AI relationship in gig work from dispatching tasks to recommending tasks, giving workers more agency. As a result, workers also need more help in understanding these recommendations. This study involves two components: 1) collect, analyze, and measure the fairness in algorithmic recommendations of tasks to gig workers by platforms, 2) use a human-centered approach to design, develop, and study a tool to help improve gig workers’ understanding of their task recommendations’ fairness and support their task selection decision making. In the first component, we will focus on two aspects of fairness: (1) fairness among workers—Are some workers treated “better” by AI systems than others? (2) fairness between the platform and its workers—Are AI systems working in the best interest of the platform operators while hurting the interest of workers? In general, the data analysis results along with the design study implications will inform the design, development, and evaluation of AI systems to empower gig workers.
Developing a New Surrogate Model for Computational Fluid Dynamic Simulation of Aorta Using Stochastic Shape Modeling and Deep Neural Networks Du Pan, Zhu Xiaozhi, Wang Jian-xun Due to the rising demand for acquiring comprehensive hemodynamic flow information for the diagnosis of cardiovascular diseases, image-based Computational Fluid Dynamics (CFD) has been widely employed to enable the derivation of functional information that is not accessible by medical images alone (e.g., pressure distribution, shear stress contour, velocity vector field), facilitating quantitative analysis and risk assessment in clinical therapy. However, such modeling requires numerically solving mesh-based discretization of partial differential equations, which is computationally expensive, particularly for complex flow or when considering fluid-structure interaction. This has largely limited the translation of image-based CFD to clinical treatments that require timely feedback for further therapeutic assessment and treatment planning. Moreover, it has posed a significant challenge to many-query applications, including uncertainty quantification, parameter estimation, and optimization problems arising in cardiovascular modeling. To enable efficient cardiovascular hemodynamic simulations, reduced-order or surrogate models have received increased attention and been developed as an alternative to predict functional information with a significantly less computational cost. For example, Lumped Parameter or 1-D reduced-order models are widely used to rapidly predict volumetric flow rate and have been an area of intense investigation. However, those approaches only focus on global information and are incapable of providing local flow information such as spatiotemporal fields of velocity or wall shear stresses, which is more crucial to advancing cardiovascular research/healthcare. Deep neural network (DNN) is renowned for its capability of approximating complex nonlinear functions and fast online inference speed. As a result, trained DNN shows a great potential of serving as a surrogate model for high-dimensional CFD simulations. In this work, we propose a novel deep learning surrogate modeling framework for image-based computational fluid simulations, enabling fast predictions of hemodynamics given complex 3-D patient-specific geometries.
Disclosure Risk from Homogeneity Attack in Differentially Private Frequency Distribution Liu F, Zhao X Differential privacy (DP) provides a robust model to achieve privacy guarantees for released information. We examine the protection potency of sanitized multi-dimensional frequency distributions via DP randomization mechanisms against homogeneity attack (HA). HA allows adversaries to obtain the exact values on sensitive attributes for their targets without having to identify them from the released data. We propose measures for disclosure risk from HA and derive closed-form relationships between the privacy loss parameters in DP and the disclosure risk from HA. The availability of the closed-form relationships assists understanding the abstract concepts of DP and privacy loss parameters by putting them in the context of a concrete privacy attack and offers a perspective for choosing privacy loss parameters when employing DP mechanisms in information sanitization and release in practice. We apply the closed-form mathematical relationships in real-life datasets to demonstrate the assessment of disclosure risk due to HA on differentially private sanitized frequency distributions at various privacy loss parameters.
Dynamic Poisson Factor Analysis: A Hierarchical Bayesian Approach with Intensive Text Data Shao S, Liu F, Jacobucci R. In psychological science, researchers commonly assess open-ended questions at the daily level in order to detect additional information on both inter-individual and intra-individual change over time. While text mining algorithms have been developed for longitudinal text responses, such as the dynamic topic model (DTM; Blei & Lafferty, 2006), current implementations and formulations have a number of limitations for psychological research, namely that inter- and intra- person differences cannot be distinguished. In addition, these algorithms are not appropriate for handling short text responses collected from open-ended questions. Meanwhile, factor analysis is an appealing approach because not only is it widely used in psychological research, but also prior research extending factor analysis to non-Gaussian responses. However, in spite of the attractive features and popularity of factor analysis, it is not tailored to handle sparsity and large p features that are common in textual data. In this talk, we will discuss the formulation of a hierarchical Bayesian approach named dynamic Poisson factor analysis (Zhang et al., 2016) with longitudinal textual data, and follow with an evaluation of its performance in a simulation informed by dynamically assessed suicide responses. The talk will be enclosed with a discussion of possible future steps.
Injecting Entity Types into Entity-Guided Text Generation Wenhao Y, Meng J Recent successes in deep generative modeling have led to significant advances in natural language generation (NLG). Incorporating entities into neural generation models has demonstrated great improvements by assisting to infer the summary topic and to generate coherent content. To enhance the role of entity in NLG, in this paper, we aim to model the entity type in the decoding phase to generate contextual words accurately. We develop a novel NLG model to produce a target sequence based on a given list of entities. Our model has a multi-step decoder that injects the entity types into the process of entity mention generation. Experiments on two public news datasets demonstrate type injection performs better than existing type embedding concatenation baselines.

Fake drugs are dangerous to human health, can we identify them outside a science lab ?

Olatunde A, Roseboom N, Cai J,  Hayes K, Lieberman M Portable spectrophotometers are in commercial use for detection of substandard and falsified pharmaceuticals (SFPs) but there is little published evidence for how well these devices perform, particularly for the task of detecting substandard products. Here, we spiked pure acetaminophen (AC) with lactose (LA) and/or ascorbic acid (AA) to simulate the presence of adulterants and/or excipients. We collected 100 NIR spectra of each of the three pure compounds and 100 spectra of each of 9 binary and 4 ternary mixtures of these compounds. A range of algorithms was then applied to the combined data set (n = 1200 spectra) to distinguish between the different mixtures. An approach combining PLS, SIMCA, and SVM classification and regression not only grouped the lab-formulated samples correctly, but accurately differentiated five different commercial acetaminophen dosage forms from two major brands. We studied the robustness of the combined approach with 100 spectra generated from blinded samples to ascertain its ability in detection of substandard formulations.  The analysis of the confusion matrix showed excellent prediction ability as well as the ability to detect when an API was not in the training set.  Our end goal is to integrate NIR with the chemical functional group analysis performed by our already widely accepted paper analytical device; together, these technologies will be a more powerful tool for field screening of pharmaceutical and illicit drugs.
The Future of Work in a Digital and Algorithmic World: Algorithmic Management and Power & Information Asymmetry in the Gig Economy Holahan C, DeStefano R, Cariddi B, Dudrick E, Allison S
Our project uses Uber as a case study for how Silicon Valley algorithms change the rules of the world in favor of the company over the user and individuals. Specifically, we discuss how Uber is rewriting the rules of work and employment in the gig economy and demonstrate the problems that we will encounter if we continue to allow technological elitists to decide the future of work. Couched in the terms of a “sharing economy,” Uber confuses disruption with innovation and pretends to be stakeholder oriented while actually operating in a shareholder economy. Citing neutral, transparent algorithms and technological exceptionalism narratives, Uber blurs the line between employment and consumption by treating drivers as consumers of their product. While Uber promises to provide freedom and entrepreneurial work to drivers, the realities of the gig economy fall far short. Uber embeds exploitation into its algorithms, pits community stakeholders against each other, and uses the power of big data to maintain power and information asymmetry from its users. Uber can be viewed as a cautionary tale of the lengths technology companies, funded by cheap venture capital, will go in the name of growth and the vast impacts these organizations can have on societal standards, including the value of work and how we do it. Drawing from pitfalls of Uber, we plan to demonstrate the role of ESG investing in promoting a sustainable future of work and to outline the first principles companies and investors should follow in the age of algorithms.
Generative AI Design and Exploration of Nucleoside Analogs Dablain D, Chawla N, Siwo G Nucleosides are fundamental building blocks of DNA and RNA in all life forms and viruses. In addition, natural nucleosides and their analogs are critical in prebiotic chemistry, innate immunity, signaling, antiviral drug discovery and artificial synthesis of DNA / RNA sequences. Combined with the fact that quantitative structure activity relationships (QSAR) have been widely performed to understand their antiviral activity, nucleoside analogs could be used to benchmark generative chemistry algorithms. Here, we undertake the first generative design of nucleoside analogs using an approach that we refer to as the Conditional Randomized Transformer (CRT). We also benchmark our model against five previously published molecular generative models. We demonstrate that AI-generated molecules include nucleoside analogs that are of significance in a wide range of areas including prebiotic chemistry, antiviral drug discovery and synthesis of oligonucleotides. Our results show that CRT explores distinct molecular spaces and chemical transformations, some of which are similar to those undertaken by nature and medicinal chemists. Finally, we demonstrate the potential application of the CRT model in the generative design of molecules conditioned on Remdesivir and Molnupiravir as well as other nucleoside analogs with in vitro activity against SARS-CoV-2.
GraSeq: Graph and Sequence Fusion Learning for Molecular Property Prediction Guo Z
With the recent advancement of deep learning, molecular representation learning -- automating the discovery of feature representation of molecular structure, has attracted significant attention from both chemists and machine learning researchers. Deep learning can facilitate a variety of downstream applications, including bio-property prediction, chemical reaction prediction, etc. Despite the fact that current SMILES string or molecular graph molecular representation learning algorithms (via sequence modeling and graph neural networks, respectively) have achieved promising results, there is no work to integrate the capabilities of both approaches in preserving molecular characteristics (e.g, atomic cluster, chemical bond) for further improvement. In this paper, we propose GraSeq, a joint graph and sequence representation learning model for molecular property prediction. Specifically, GraSeq makes a complementary combination of graph neural networks and recurrent neural networks for modeling two types of molecular inputs, respectively. In addition, it is trained by the multitask loss of unsupervised reconstruction and various downstream tasks, using limited size of labeled datasets. In a variety of chemical property prediction tests, we demonstrate that our GraSeq model achieves better performance than state-of-the-art approaches.

Higher-order Networks of Diabetes Comorbidities: Disease Trajectories that Matter Krieg S, Robertson D, Pradhan M, Chawla N Networks are powerful and flexible structures for modeling relationships in medical and biological systems, but in a traditional first-order network representation, an edge typically expresses a relationship between a single pair of nodes. In order to analyze complex relationships between groups of nodes, researchers rely on combined sets of these pairwise connections, which can misrepresent the true relationships in the underlying data. Higher-order networks, on the other hand, capture the higher-order dependencies that go beyond the pairwise interactions, and thus can encode more complex relationships within a familiar structure. In this study, we created and analyzed higherorder networks of disease trajectories generated from the records of 913,475 type 2 diabetes patients. We show that higher-order networks provide a more accurate representation of the underlying disease trajectories than traditional first-order networks. We also analyze differences in PageRank scores and community structure at higher orders and discuss the implications of these differences for the future study of comorbidity networks.
Humanistic Data Science Germino J, Bird A, Kilbane M, Dhaliwal R When searching for patterns within large datasets, data scientists frequently discard outlier data points that do not match within their models. However, when these data sets contain information on human beings, we risk discarding whole communities whose sole representation was that data point. These outliers represent real people whose stories should be told with our models. Our goal is to explore new ways to approach data science in a humanistic manner. We hope to use data science to tell stories for these often underrepresented communities and build models which account for and celebrate the uniqueness of the population instead of clustering people into groups which may not be truly representative of their communities. To achieve this, we have organized an interdisciplinary group of scholars with backgrounds in Computer Science, English, and American Studies. We are excited to explore new techniques which can further our research within each of our respective fields. We see humanistic data science as a field with endless potential applications and view this as a unique opportunity to set new standards for how we should approach research with human data.
Intent Detection in Specialized Domains by Deep Text Clustering Yu M, Tong L Detecting intents in natural language text is useful for many downstream applications such as user modeling, recommendation, and text generation. Classifiers can be trained to perform the task when intent schema and labeled data are given in some well-studied domains like user requests in chat-bot systems. However, the schema and labeled data are often not available in new or specialized domains, when people are curious about the different types of intents (or topics/aspects/strategies) and their examples. For example, identifying different user intents in social media messages would advance our understanding of language of self disclosure and social support for those who are suffering from negative life events; capturing different aspects that researchers used to compare two methods or techniques in published articles would facilitate automated knowledge discovery in the research field. There is great need for clustering (instead of classification) approaches that could potentially distinguish different types of intents in the text. In this work, we discuss how to acquire deep learning-based text representations (instead of a bag of words) that can separate texts reasonably (i.e., likely based on types of intents). Then we compare several clustering algorithms on data from the two specialized domains (i.e., scientific statements and social support for mental health). We develop a toolkit for easy practice, evaluation, and visualization.
Building Literacy Around Police Recruitment in South Bend Through Data Visualization and Data Storytelling Brown T,  Zhang H, Qi Y   
RecipeRec: Recipe Recommendation with Heterogeneous Graph Learning Tian Y Recipe recommendation systems play an important role in helping people decide what to eat. Existing recipe recommendation systems, however, are typically developed using content-based or collaborative filtering approaches, failing to leverage the higher-order collaborative signal (e.g., relational structure information) among users, recipes and food items. In this paper, we formalize the problem recipe recommendation with graphs to utilize graph modeling and expressively encode the collaborative signal into recipe recommendation. Specifically, we first present URI-Graph, a new and large-scale user-recipe-ingredient graph to facilitate graph-based food studies and recipe recommendation research. We then propose RecipeRec, a novel heterogeneous graph learning model for recipe recommendation. The proposed model is able to capture both recipe content and the higher-order collaborative signal through a heterogeneous graph neural network with hierarchical attention and an ingredients set transformer. Additionally, we introduce a graph contrastive augmentation strategy to extract informative graph knowledge in a self-supervised manner. Finally, we design a joint objective of recommendation loss and graph contrastive learning loss to optimize the model. Our extensive experiments demonstrate that RecipeRec outperforms state-of-the-art baselines for recipe recommendation.
Machine learning and transfer learning study of gas permeability for polymer membranes Xu J, Luo T High-performance polymer membranes for gas separations have achieved remarkable success in energy and environment industries. The growing machine learning technique possesses great promise to accelerate the discovery and development of innovative polymer membrane materials, yet obstructed by the lack of sufficient training data. We demonstrate the successful application of a transfer learning (TL) technique to improve the performance of neural network models (for one case, R2 improved by 19% and MSE decreased by 50%) even if only limited target data are available based on an open-source polymer gas permeability database for six major industrial gases. Further gas permeability property prediction is also conducted using the TL-improved models on over 12,000 polymer candidates, representing an advanced way to explore the unknown chemical space for high-performance gas separation polymer membrane design.
Food Information Networks (FINs): The Visual Representation of Food Information for Healthy Dietary Choices Szymanski A, Wimer B, Metoyer R With both in-store or online grocery shopping comes the challenge to make informed choices when purchasing foods to provide a well-balanced and nutritional diet and these challenges are further exacerbated for people living in rural areas, food deserts, or in poverty where physical full-service supermarkets are less available and online shopping is most needed. Currently, consumers are not well supported in making healthy nutritional decisions online and face challenges of finding products, navigating through categories, comparing prices, and purchasing nutritional foods, all with a limited and sometimes hidden set of information. The objectives of this study are twofold: first to support the information seeking process by incorporating existing visual narrative techniques into online grocery shopping to help shoppers make nutritional food choices, and second to develop a new visualization designed to support product comparisons that will make diet-based decisions more educational for consumers. We will utilize eye tracking techniques to understand how new visual food presentations change the information seeking and engagement of consumers compared to traditional online food product displays. We will also create synthesized tasks to determine whether our new visualizations allow for participants to comprehend nutrition information more quickly and accurately. The overall aim of our work will be to provide new designs for the presentation of food information so that informed nutritional choices may be made.
Developing an App for Facilitating Communication Between HIMFG and Cancer Outpatients. Gentry W, Wani K, Medina E, Soga P, Patterson C One of Hospital Infantil de México Federico Gómez’s (HIMFG) responsibilities is treating children with cancer, many of whom are sent home after treatment to recover and must be kept in touch with in order to ensure that their health does not worsen, in which case they may need to be readmitted for further care. Currently, HIMFG has no formal way of communicating with cancer outpatients, and so the goal of this project is to provide them with a mobile application with which (1) doctors and professionals at HIMFG can interact with outpatients and vice versa and (2) outpatients can be given an additional health resource when a doctor is not physically available. Accompanying this app is a companion tool for HIMFG doctors and social workers to enter data for each child being treated entailing both clinical data, like height and weight, and socioeconomic data, such as place of residency and financial stability. This provides HIMFG with a unified digital record of each child’s health information that the app can draw on and be made useful for doctors and patients alike. This project will find importance in helping smooth communication between the hospital and its outpatients as well as providing outpatients with the knowledge, assistance, and guidance needed to monitor their health condition. It will also provide HIMFG with a more unified electronic method of recording patient data.
Modeling a Pandemic (Covid-19) Lee, Yu Min As COVID-19 has affected everyone in various ways throughout the past year, we’ve recognized that there is not a unified front on battling the pandemic. The media has a great influence on shaping public opinion. While it is important to be informed on major issues such as COVID-19, in some cases the population is receiving types of misinformation that can lead to potential harm to themselves and others. We hope to achieve a better understanding of the dissemination of information about COVID-19, which includes the societal challenge of how people perceive the vaccination. To best leverage the data we have, we will first perform a text analysis on the Twitter data to determine the sentiments of the Tweets, specifically pro-vax or anti-vax. We can then correlate this data with information on political parties and geographic affiliation because some Tweets include their location data. Then, we will use historical data to see which media outlets different demographics are more inclined to, and then use that to assess vaccination rate. A predictive model may be developed in order to leverage different types of data collected that reflect the local trajectory of COVID-19. For example, the model might represent case rates, virus rate predictions, and vaccination rates or disparities in different local areas. We would also like to examine how people’s ideas were framed by media discourse, and how such has correlated with spread of virus and vaccination success. Data science methods used for this project may include predictive statistics such as linear regression and data visualization, and descriptive statistics may be utilized to further exemplify the outcomes of the virus as well.
PEANUT: An Intelligent Human-AI Collaborative Tool for Annotating Audio-Visual Data Zhang Z, Zheng N, Li T Audio-visual learning is an emerging important problem in machine learning that seeks to enhance the computer's multi-modal perception leveraging the correlation between the auditory and visual modalities. Despite their many useful downstream tasks in domains such as video retrieval, AR/VR, and accessibility, the performance and the wide adoption of existing audio-visual models have been impeded by the availability of high-quality datasets across different domains, as annotating audio-visual datasets is laborious, expensive, and time-consuming. To address this challenge, we design and develop PEANUT, a human-AI collaborative audio-visual annotation tool that seeks to make this process more efficient using several novel mixed-initiative partial-automation strategies. PEANUT leverages state-of-art single-modal audio
Differentially Private Metropolis-Hastings Sampling with Auxiliary Variable Su B., Liu F. Markov Chain Monte Carlo (MCMC) is a powerful tool to sample from complex distributions. Due to the inherent randomness of the MCMC sampling, some privacy guarantees can be achieved. In the Bayesian framework, there exists work on the achieved upper bound for the privacy loss by posterior sampling in the setting of differential privacy (DP). In the implementation of the algorithms, the actual privacy loss can be adjusted indirectly through the subsampling ratio in the MCMC iterations or by changing the temperature of the target distribution. We propose a new differentially private Metropolis-Hastings algorithm by introducing auxiliary variables to help achieve DP explicitly. Compared to the approaches leveraging the inherent randomness to achieve DP, our approach can deliver privacy guarantees of different levels through tuning the parameters associated with the distribution of the auxiliary variables. Our empirical studies suggest the differentially private posterior distributions generated by our proposed method are closer to the true posterior distributions at a given privacy budget in various settings compared to the existing methods for differentially private posterior sampling.
Sentence-Permuted Paragraph Generation (PermGen) Zhihan Z, Wenhao Y Generating paragraphs of diverse contents is important in many applications. Existing generation models produce similar contents from homogenized contexts due to the fixed left-to-right sentence order. Our idea is permuting the sentence orders to improve the content diversity of multi-sentence paragraph. We propose a novel framework PermGen whose objective is to maximize the expected log-likelihood of output paragraph distributions with respect to all possible sentence orders. PermGen uses hierarchical positional embedding and designs new procedures for training, decoding, and candidate ranking in the sentence-permuted generation. Experiments on three paragraph generation benchmarks demonstrate PermGen generates more diverse outputs with a higher quality than existing models.
Predicting Adverse Outcomes of FN in Pediatric Cancer Patients

Schnur J, García-Martínez A, Chawla N.

Fever and neutropenia (FN) can be a life-threatening complication in pediatric cancer patients undergoing chemotherapy treatment. In certain patients, this condition may progress to adverse infection outcomes such as septic shock or bacteremia, which in turn increase mortality risk. To develop optimal treatment strategies, an appropriate first step is to classify children as low- or high-risk for these adverse events, such that low-risk children may receive treatment in the home setting, while the acutely ill may receive closer attention in the hospital. In this study, we perform logistic regression as a baseline technique for predicting risk of adverse infection outcomes using evidence-based clinical predictors. We then begin to explore future prospects for interpretable and flexible risk modeling using symbolic regression, a method that learns both the structure and parameters of model from the data, unlike traditional regression techniques that predetermine model structure. Through this analysis, we uncover important relationships between the clinical features and health outcomes of interest that will assist in process improvement for this critical population.
Comparison of Methods for Imputing Social Network Data Xu Z, Hai J, Yang Y, Zhang Z Social network analysis could aid the improvement of social environment by providing insights on how human relationships inform social behaviors, interactions, and conflicts, which can further reveal economic, political, and cultural issues. In practice, social network data often contain missing data because of the sensitive nature of information collected. As a response, network imputation methods including simple ones constructed from network structural characteristics and more complicated model-based ones have been developed, but few have thoroughly examined them. The current study aimed to evaluate seven network imputation techniques (i.e., reconstruction, preferential attachment, constrained random dot product graph, Bayesian exponential random graph models or Bayesian ERGMs, k-nearest neighbors, random forest, multiple imputation by chained equations) through simulation. A factorial design for missing data conditions was adopted with factors including missing data types (i.e., actor non-response, tie non-response, mixed), missing data mechanisms (i.e., MCAR, MAR, MNAR), and missing data proportions (i.e., 10%, 20%, 30%), which were applied to 100 replications of social networks with a varying number of nodes generated through 8 sets of different coefficients in ERGMs. Results showed that imputation methods’ effectiveness in recovering network statistics mainly differed by missing data types and mechanisms while the effectiveness in recovering ERGM model coefficients mainly differed by which network statistics the coefficients were estimated on. Overall, reconstruction, constrained random dot product graph, and multiple imputation by chained equations had the best performance across conditions. Future research could identify the network structures of interest and potential missing mechanisms before selecting appropriate imputation methods.

Recent improvements to field device for detecting illicit drugs increase user-friendliness and potential for field applications

Hayes K, Whitehead H, Sweet C, Lieberman M

Interest in presumptive field tests for illicit drug detection has increased in recent years as deaths from opioid overdoses have risen. Previously, a microfluidic paper analytical device for detecting (idPAD) was described that sensitively and specifically detects cocaine, crack cocaine, heroine, and methamphetamine by using a library of 12 colorimetric lane tests which detect functional groups found in illicit drugs and their cutting agents. Each drug elicits a unique color “barcode” which users match to standard barcodes for sample identification. Since its original publication, we have redesigned the idPAD to increase field-friendliness and deployability. One important modification is the introduction of a sample collection area. This collection area is outside of the colorimetric lane tests and protected from water exposure, so the sample can be stored and saved for downstream applications such as LC-MS/MS analysis. This collection area allows for drug samples to be transported more conveniently and safely than before. Average recovery from the collection area is 34% of the applied drug, and is stable for storage for up to 6 weeks. Analysis of idPADs has historically relied on users to match the color barcode with its corresponding drug. To eliminate the need for human judgement and increase efficiency, we are training a neural network to read barcodes and report drug identity. Another potential application of the idPAD for detecting hormones used for birth control, abortions, and hormone replacement therapy is being explored as a screening tool for social justice groups interested in ensuring quality of dosage forms in settings where access to such drugs is limited.

See Beneath the Skin -- Machine Learning for Optical Spectroscopy Lan Q DRS (diffuse reflectance spectroscopy) is a technique of inferring properties of biological tissues using optical illumination and detection. The light feedback being measured is a quantity called diffuse reflectance. Because different tissue components feature different optical properties, optical properties can then be translated to biological properties. For example, by analyzing the diffuse reflectance from a brain tissue, we’re able to infer the optical properties and further the existence of tumor tissues. Blood oxygenation detection is another typical application of DRS. Deoxygenated hemoglobin and oxygenated hemoglobin result in different the light absorption and scattering in blood and therefore affect the diffuse reflectance when illuminated. The current method of solving the problem relies on an inverse lookup table generated by Monte Carlo simulations, which is a computational method using random samples. The table contains many combinations of optical properties with the corresponding diffuse reflectance.
Support Students to Develop Effective Learning Behaviors Through the Power of Data Duan X In recent years, more and more educational settings have begun to adopt the latest advancement in predictive modeling, machine learning, artificial intelligence, and visual analytics to understand and optimize learning and the environments in which it occurs. For example, many institutions have implemented data-driven intervention systems with the hope to provide at-risk or underachieving students just-in-time and personalized support. Although these types of data-driven supporting systems are becoming increasingly available, most of them primarily focus on at-risk students. How to scale them to serve and motivate all students to adopt more effective learning behavior/strategies is an important research topic that still remains under-explored. To bridge that gap, this study presents the development process and evaluation outcomes of a data-driven supporting system that incorporates predictive modeling and visualization analytics. This system aims to inform students about their learning behaviors/strategies, motivate them to adopt the more effective behaviors/strategies demonstrated by the top students, and therefore improve their learning outcomes. The study is guided by these research questions: 1. What learning features can accurately predict learning outcomes? 2. How to convert the identified important learning features to intuitive and interactive visualization feedback for students delivered through this supporting system? 3. What sentiment, behavior, and achievement impacts does this supporting system have on students?
Testing the Pulse: Alternative Methods to Measure Student Well-Being Gordley J, O'Brien C, Heneghan J, Geiler M, Ramos J Our project aims to identify a potential ethically sound framework for universities to more effectively gauge student well being using publicly available resources such as social media data, in addition to the typical survey methods used to identify student pain points during the semester.
The Cultural Capital of Political Incivility:
Do Jerks Join Congress or Does Joining Congress Turn People Into Jerks? 
Dudley J Increasing political incivility in the United States Congress has devastating consequences for democracy. While research finds Americans disapprove of political incivility, it also finds that Americans support politicians who appear to stand up to political opponents, even in uncivil ways. By distinguishing between valued and unvalued forms of incivility, I will resolve the paradox between research that shows that Americans disapprove of incivility while incivility in politics increases. My dissertation will investigate three potential sources of incivility: the pool of potential candidates, voter preference, and influence on new members of Congress. This poster focuses on the third source - influence on new members - by leveraging semi-supervised machine learning and automated text analysis of Congressional transcripts to determine if tenure in Congress predicts the use of incivility over time.
Variational Inference with NoFAS: Normalizing Flow with 
Adaptive Surrogate for Computationally Expensive Models
Yu W, Fang L, Daniele S Fast inference of numerical model parameters from data is an important prerequisite to generate predictive models for a wide range of applications. Use of sampling-based approaches such as Markov chain Monte Carlo may become intractable when each likelihood evaluation is computationally expensive. New approaches combining varia- tional inference with normalizing flow are characterized by a computational cost that grows only linearly with the dimensionality of the latent variable space, and rely on gradient-based optimization instead of sampling, providing a more efficient approach for Bayesian inference about the model parameters. Moreover, the cost of frequently evaluating an expensive likelihood can be mitigated by replacing the true model with an offline trained surrogate model, such as neural networks. However, this approach might generate significant bias when the surrogate is insufficiently accurate around the posterior modes. To reduce the computational cost without sacrificing inferential accuracy, we propose Normalizing Flow with Adaptive Surrogate (NoFAS), an opti- mization strategy that alternatively updates the normalizing flow parameters and the weights of a neural network surrogate model. We also propose an efficient sample weighting scheme for surrogate model training that ensures some global accuracy of the surrogate while capturing the likely regions of the parameters that yield the ob- served data. We demonstrate the inferential and computational superiority of NoFAS against various benchmarks, including cases where the underlying model lacks identi- fiability. The source code and numerical experiments used for this study are available at
What’s All The Yak about Yik Yak?: 
What a Digital Ethnography of Yik Yak Reveals About Power Dynamics of Race, Gender, Class, and Sexuality in Online Spaces at Notre Dame
Oppenlander H If you've heard about the recent derogatory posts towards St. Mary's students on Yik Yak and wondered, "Why does that sound familiar?", it’s probably because the app plagued campus communities years ago. Before it was taken down in 2017 and then revived earlier this year, Yik Yak — which allows users within 5 miles of each other to share posts anonymously — was criticized for threatening racist and sexist posts found on the platform, especially around college campuses. Inspired by an assignment for my Black Ethnographers course, I have been conducting a digital ethnography of Yik Yak at Notre Dame. I've observed how anonymity empowers students to share racist, sexist, and homophobic thoughts without repercussions to their reputations. Yik Yak's "upvote" and "downvote" system makes these types of posts rise to the top based on Notre Dame's demographics, which must be contextualized knowing that Notre Dame originally only admitted white men. On my poster, I will include screenshots of Yik Yak posts that touch on insider/outsider groups on campus, youth and social media culture, geography, and the role of language like “smick” in online discourse. Since one of the research focuses of the Lucy Institute is "Social and Information Systems," it is essential that we bring discussions of data science, digital technology, and the humanities together to evaluate how existing hierarchies manifest themselves in online social spaces. We must use the resources we have to contemplate what online spaces we are allowing to exist on our campus.

Call for Student Posters (Enter by Oct 20 - DEADLINE IS PAST

  • We welcome poster competition entries from undergraduate and graduate students in all disciplines.
  • Posters may be submitted by individual students or groups of co-authors.

Posters that feature project concepts, new or continuing research in data science, AI, data engineering, computing, applications, and methods that amplify societal expertise in areas of human development, peace accord, ethics, global development, health disparities, and poverty are particularly welcome. 

Important Dates:

  • Sep 28 Student Registration Opens

  • Oct 20  Student Registration Closes & Student Poster Competition Entry Closes

  • Oct 21  All Entrants' Posters must be submitted to Lucy Family Institute for printing by 3 PM 

  • Oct 22  Student Poster Competition Entries Announced & Abstracts posted

  • Oct 27 Lucy Family Institute Fall Symposium Please Arrive a little before noon to find your poster and pick up your badge. 

To enter the student poster competition, for posters with a single author, include on your symposium registration your poster's title, a brief abstract describing your poster theme in 250 words or less, and submit  your name  as you'd like it to appear on the poster competition voting system the day of the event.   

All poster competition entrants are instructed to use the event's provided poster template and must submit their posters for printing by 3PM Oct 21st.

We are facilitating poster printing at no cost to students to ensure ease of participation.

Group Entries in the poster competition: For group entries, one student, acting as corresponding author, should submit on their symposium registration, the poster's title, a brief abstract, and list of all the groups' author names  as you'd like them to appear on the poster competition voting system the day of the event.   If you are part of a group submitting a poster to the competition please know everyone in the group who is planning to attend the symposium (regardless of whether you are corresponding author) should register for the symposium  and indicate their meal choices. 

Still have Questions about Group registration? We want to hear from you, please contact  or call 574-631-7095 for assistance entering the poster competition. 

Accommodations: We want the event to be engaging for all. We’ve anticipated certain accommodations to enable participation (diet, allergy, interpreter, lactation room) but we know special circumstances may arise. Please contact Lucy Institute if you have any questions about student registration, the poster competition, or to ensure we can accommodate your needs the day of the event. Phone 574-631-7095