Foundations: Algorithms, Methods, and Systems
Accuracy and Fairness in Face Recognition
Current state of the art face matchers vary in accuracy between demographic groups. For instance, women have a higher false non-match rate than man, and also a higher false match rate than men. Why does this occur? What can be done about it? Lines of attack on this problem include analyzing the training data and training process for the face matcher, analyzing the variation in image acquisition between groups, and analyzing the inherent variation in facial appearance between groups.
Centrality Scaling in Large Networks
"Betweenness" centrality lies at the core of both transport and structural vulnerability properties of complex networks. However, it is computationally costly, and its measurement for networks with millions of nodes is nearly impossible. By introducing a multiscale decomposition of shortest paths, Lucy Family Institute researchers show that the contributions to betweenness coming from geodesics not longer than L obey a characteristic scaling versus L, which can be used to predict the distribution of the full centralities. The method is also illustrated on a real-world social network of 5.5 million nodes and 27 million links.
Detection of Manipulated Media
Researchers at the Lucy Family Institute are looking at how to detect if an image or video has been manipulated to change its content, like whether an image has an original face or if its morphed. Or if something is this a video of a person actually saying that, or is it a deep fake created to make the person look bad? Lines of attack on this problem include analyzing the content of the image, statistical properties of the image pixel values, and the meta-data for the image.
Distributed Decision Making
Researchers at the CNDS use both data-driven and model-driven approaches, with applications to the emerging smart and integrated architecture of societal infrastructure such as transportation networks, power grid, and water distribution networks. Center topics of interest include developing learning approaches suitable for control, compositional control for large scale systems, cyber-physical security and privacy, and incentive design in distributed estimation and control.
The relationship between graph theory and formal language theory allows for a Hyperedge Replacement Grammar (HRG) to be extracted from any graph without loss of information. Like a context free grammar, but for graphs, the extracted HRG contains the precise building blocks of the network as well as the instructions by which these building blocks ought to be pieced together. Because of the principled way it is constructed, the HRG can even be used to regenerate an isomorphic copy of the original graph. By marrying the fields of graph theory and formal language theory, lessons from the previous 50 years of study in formal language theory, grammars, and much of theoretical computer science can now be applied to graph mining and network science. This research takes the first steps towards reconciling these disparate fields by asking incisive questions about the extraction, inference, and analysis of network patterns in a mathematically elegant and principled way.
Higher Order Networks
Network-based representation has quickly emerged as the norm in representing rich interactions in complex systems. For example, given the trajectories of ships, a global shipping network can be constructed by assigning port-to-port traffic as edge weights. However, the conventional first-order (Markov property) networks thus built captures only pairwise shipping traffic between ports, disregarding the fact that ship movements can depend on multiple previous steps. The loss of information when representing raw data as networks can lead to inaccurate results in the downstream network analyses. CNDS researchers have developed Higher-order Network (HON), which remedies the gap between big data and the network representation by embedding higher-order dependencies in the network. Click here to view the project website.
We study data analysis and general computing problems from an human-computer interaction perspective. The CNDS identifies the needs of researchers and then design, build, and evaluate interfaces that support research teams directly rather than forcing changes to match the machine and existing approaches. The center is especially interested in complex data analysis problems where there are opportunities to take advantage of the human perceptual system to help people make sense of their data through visual analysis.
Learning from Multi-modal Data
Multi-modal data means presenting a unique set of challenges, given the different types, cadence, as well as challenges that stem from "missingness" and noise. For example the proliferation of personal tracking devices allows for the continuous and pervasive collection of an individual’s behavioral and physiological data. These sensed multi-modal streams are often chronologically ordered (e.g. time series of heart rate measurements), which we refer to as temporal multi-modal sensory data. Such data, which is from multiple sensor sources, can benefit a wide spectrum of applications, such as personality inference, health and wellness assessment, and demographics prediction. CNDS researchers are developing algorithms to address these challenges for a wide-spectrum of applications.
Learning from Imbalanced Data
CNDS researchers use the Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm, which is considered “de facto” standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to different type of problems. SMOTE has proven successful in a variety of applications from several different domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also significantly contributed to new supervised learning paradigms, including multilabel classification, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of different software packages — from open source to commercial.
Longitudinal Analysis and Modeling of Large-Scale Social Networks
The growth in information technology systems is generating new sources of data on human behavior that are only now beginning to be analyzed. Digital communications systems log communication events and therefore contain valuable information on usage patterns that can be used to map social networks and analyze human behaviors within them. The availability of this data of over millions of individuals provides the potential to induce transformative changes in the way people analyze and understand human behavior. The data generated by digital communication technologies have five key traits that have the potential to transform the way researchers study social networks:
- quality of statistics because data comes from millions of users
- purely observational due to non-obtrusive measurement
- complete network data
- longitudinal data spanning several years
- spatial information because it is geographically located
Data of such extent and longitudinal character brings with it novel challenges which can only be tackled by a well orchestrated multidisciplinary approach involving network social science, physics methods developed for large-scale interacting particle systems, mathematical statistics and data analysis, and computer science methods of data mining, community detection algorithms and agent-based modeling.
Natural Language Processing and Knowledge Graphs
Knowledge graphs (KGs) serve as useful resources for natural language processing applications. Previous KG completion approaches required a large number of training instances (i.e., head-tail entity pairs) for every relation. The real case is that for most of the relations, very few entity pairs are available. Existing work of one-shot learning limits method generalizability for few-shot scenarios and does not fully use the supervisory information; however, few-shot KG completion has not been well studied yet. In this work, CNDS researcher are using a novel few-shot relation learning model (FSRL) that aims at discovering facts of new relations with few-shot references. FSRL can capture knowledge from heterogeneous graph structure, aggregate representations of few-shot references, and match similar entity pairs of reference set for every relation. Extensive experiments on two public datasets demonstrate that FSRL outperforms the current state-of-the-art.
Open-set Presentation Attack Detection for Iris Recognition
This research looks at how to develop an approach to detecting spoofing attacks that will still work on the first instance of the particular attack being seen. For example, textured contact lenses can be used to obscure true iris texture. It is easy to detect any particular brand of textured lenses if samples of them are in the training data. But what happens when the test data contains a brand of lenses for which there are no samples in the training data? Lines of attack on this problem include texture analysis, feature space analysis, 2D/3D shape inference, and fusion of complementary classifiers.
Representation Learning on Heterogeneous Networks
Representation learning in heterogeneous graphs aims to pursue a meaningful vector representation for each node so as to facilitate downstream applications such as link prediction, personalized recommendation, node classification, etc. This task, however, is challenging not only because of the demand to incorporate heterogeneous structural (graph) information consisting of multiple types of nodes and edges, but also due to the need for considering heterogeneous attributes or contents (e.g., text or image) associated with each node. Despite a substantial amount of effort has been made to homogeneous (or heterogeneous) graph embedding, attributed graph embedding as well as graph neural networks, few of them can jointly consider heterogeneous structural (graph) information as well as heterogeneous contents information of each node effectively. This research has made significant advances on developing representation learning methods for heterogeneous graphs that tackle content as well as heterogeneity of nodes and edges.