Ongoing Research Projects

Dynamic Information Retrieval

Dynamic Search
For details, please visit here

Oftentimes, complex information needs require more than one query in a search session to adequately satisfy the user's search task. Given a series of queries that a user enters in the same session, how do the earlier queries, returned search results, and click information interact with the user and the targeted search goal? This research investigates techniques in query formulation, query expansion, user interactions, and relevance feedback to gain an in-depth understanding of user behaviors in search sessions and to better model search activities with complex information needs.

This research has participated in TREC 2012 Session track evaluation and won the 2nd position in whole session search (RL2-RL4). It is published in SIGIR 2013.
Dynamic Evaluation (TREC Dynamic Domain Track)
For details, please visit TREC Dynamic Domain Track Website

We propose a new track focused on domain-specific search tasks in which professional searchers explore complex content spread across a corpus. To help such users, we need retrieval algorithms that can dynamically adjust as the user makes sense of the entities and relationships mentioned in the corpus. While TREC hosts evaluations in several domains, e.g. TREC Medical and TREC Legal, we propose to create domain-agnostic evaluation protocols for studying retrieval systems that "hang in there" and evolve along with the user's own understanding.

Dynamic Recommendation
Based on a given context including city, date, time, season, and a user's personal interest profile, a contextual system makes suggestions about places to go, to eat, and to have fun. The research emphasizes designing and implementing effective and efficient systems to search unlimitedly on the Open Web and intelligently identify and merge interesting results based on a comprehensive understanding of user profiles and contexts. Novel and advanced retrieval techniques and machine learning algorithms are designed to tackle the new challenges presented in this new area.

Dynamic Information Organization
A concept hierarchy is a set of concepts that have been organized into a groups and subgroups based on the relations between those concepts. Most hierarchies, such as Yahoo! Directory and Library of Congress Subject Headings, are large and complex. Some situations, however, call for light-weight concept hierarchies that are user-specific and task-specific. For example, a user who wants to gather information from online search engines to plan a multi-day family trip to Disneyland would like to quickly sort though large amounts of relevant materials to make decisions.

This research examines concept hierarchy construction, a mechanism to create dynamic and personalized concept hierarchies for Web search.

Privacy-preserving Information Retrieval

Privacy preserving IR workshop
Information retrieval (IR) and information privacy/security are two fast-growing computer science disciplines. There are many synergies and connections between these two disciplines. However, there have been very limited efforts to connect the two important disciplines. On the other hand, due to lack of mature techniques in privacy-preserving IR, concerns about information privacy and security have become serious obstacles that prevent valuable user data to be used in IR research such as studies on query logs, social media, tweets, sessions, and medical record retrieval. This privacy-preserving IR workshop aims to spur research that brings together the research fields of IR and privacy/security, and research that mitigates privacy threats in information retrieval by constructing novel algorithms and tools that enable web users to better understand associated privacy risks.

We organize the first Privacy-Preserving IR workshop PIR 2014 co-located with SIGIR 2014 at Gold Coast, Australia.
Online Information Exposure Detection

This research investigates effective natural language processing technologies to quickly identify components and/or attributes of a user's online public profile that may reduce the user's privacy. Through identifying potential risks of sharing information in social media, the system facilitates a better understanding and warns of one's vulnerability on the Web.

This is a joint project with Dr. Lisa Singh (PI) and Dr. Micar Sherr (Co-PI). The research is sponsored by National Science Foundation.
More Research Projects

Personal Ontology Learning, Carnegie Mellon University (Aug 2006-Dec 2011)

Since ancient times, people oraganize information into ontologies, also known as concept hierarchies or taxonomies. Often ontologies are detailed, task-independent, user-independent, and long-lifespan data models, such as MeSH and WordNet, that represent and standardize sets of concepts and relations among concepts. However, some situations require light-weight, task-specific, and user-specific ontologies with short lifespan, which we call personal ontologies. For example, in lawsuits and regulatory reforms, lawyers or government employees must quickly organize large amounts of material into task-specific concept hierarchies that will later be discarded. Sophisticated ontologies for these situations may be unnecessary or may even create information overload. This project examines personal ontology learning. It focuses on creating light-weight, personal ontologies that allow users to quickly understand the range of the issues raised, and enable "drilling down" into documents that discuss a specific topic.

This work was published in EMNLP 2012, ACL 2012 Workshop, JCDL 2012 Workshop, IEEE Intelligent Systems 2009, ACL 2009, SIGIR 2009, DG.O 2008, and CIKM 2008 Workshop in Ontology Learning.

Search Engine Training and Evaluation, Microsoft Research / Bing (2009)

The accuracy of a learned model depends on both the quality of the training labels and the amount of training examples. As expected, the higher the quality of the training labels, and the more the training examples, the better the accuracy of the learned model. I proposed a new method to improve data quality and search engine accuracy (Yang et al., SIGIR'10). My work explores whether, when, and for which data points one should obtain multiple, expert training labels, as well as what to do with the labels once they have been obtained. Collecting multiple overlapping labels only for a subset of training samples that has already been labeled relevant is far more effective than blindly labeling all training samples. This labeling scheme yields higher quality labels and improves several learning-to-rank models' accuracy by considering more opinions from different judges on samples that need to be noise-free. The proposed labeling scheme is currently employed by the Bing search engine.

This work has been published in SIGIR 2010.

Sentiment Detection and Opinion Detection, Carnegie Mellon University (May 2006- Aug 2006)

Due to the richness of natural language representations, sentiment and opinion detection is challenging. Often as a classification task, sentiment and opinion detection classifies the polarity of a given text at the document or sentence level. It first appeared in the TREC'06 Blog Track where the domain of interest was blog posts. At that time, no training data was available for this task, thus my research focused on transfer learning for sentiment and opinion detection: training documents are movie reviews and product reviews and testing documents are blog posts. Common linguistic features and statistical language features in the training data are captured by a non-diagonal prior covariance matrix, and used as shared knowledge to build informative priors for a Gaussian logistic regression model (Yang et al., TREC'06). This work participated in the TREC'06 Blog Track evaluation. For more information, check the project page.

This work participated in TREC 2006 evaluation and was published in TREC 2006.

Near-Duplicate Detection in eRulmaking, Carnegie Mellon University (2004-2007)

The U.S. regulatory agencies are required to read and solicit every single piece of public comments to the proposed rules. To save the human effort in the rulemaking process, near-duplicate detection is developed via a semi-supervised clustering approach, which allows flexibly incorporating constraints into the clustering process to achieve a better clustering accuracy. For more information, check the project page.

This work was reported in Digital News Journal (2004 Aug) and was published in SIGIR 2006, DG.O 2006, and DG.O 2005. This work also yielded a spin-off company: TellTale Information.

Multimedia Information Retrieval, Carnegie Mellon University (2004), National University of Singapore (2003-2004)

News Video collection contains thousands of hours of videos, which is a combination of text scripts, audio, image, and video sequences. To find a qualified video sequences matching with the use query, the system applies text analysis, audio analysis, speech recognition and image processing. A comparison of uni-modal, multi-modal and multi-concept classifiers of feature extraction is studied. Both visual-only features and multi-modal video features are also explored in the search process.

This work participated in the TRECVID 2003 and TRECVID 2004 evaluations and won the 1st (National University of Singapore) and 2nd place (Carnegie Mellon University). It was published in TRECVID 2003 and TRECVID 2004.

Question Answering , National University of Singapore (2001-2004)

My research in QA is centered around the tasks and evaluations initiated by TREC. The TREC QA Track attempts to deal with open-domain factoid, list, and definitional questions. The event-based question answering approach that Professor Tat-Seng Chua and I proposed, exploits general ontologies and external resources, such as WordNet glosses and synonyms, and search result snippets, to gather additional world knowledge about a question-answer event, in which the answer lies. The constraints imposed by additional knowledge within a question-answer event enable more effective passage retrieval and answer extraction (Yang & Chua, TREC'02; Yang & Chua, EACL'03 ; Yang et al., TREC'03; Yang et al., SIGIR'03, Yang et al., WWW'03; Yang & Chua, COLING'04; Yang & Chua, SIGIR'04). The system participated in TREC'02, '03, and '04, and consistently won 2nd place in the TREC QA competitions among systems from all over the world.

This work was published in SIGIR 2003, TREC 2002, TREC 2003, COLING 2004, EACL 2003, WWW 2003, and SIGIR 2004.

VideoQA: Question Answering on News Video, National University of Singapore (2003-2004)

Question Answering for Video (VideoQA, Yang et al., ACM Multimedia'03) extends my research in QA from text to the context of multimedia, in particular, news video. News video collections usually contain thousands of hours of videos, which are a combination of text scripts, audio, image, and video sequences. VideoQA answers short natural language questions with implicit constraints on content, context, duration, and genre of expected segments of video, and returns short and precise video summaries as the answer. It takes advantage of multi-modal features of visual, audio, textual, and external resources to correct speech recognition errors and locate precise answers.

This work was published in ACM MM 2003.

Online Streaming Video Broadcasting and Recording (2000-2001)

The work sets up an online video station by capturing the analogue video signals broadcasted from local television stations. It converts the analogue signals into digital signals, and allows the user to view and select from different stations. A video recording feature also highlights the capability of the system being a ready-to-be-commercialized product. The main research effort is in synchronization of speeches and videos.

This work was an undergraduate final year research project, which won 100 out of 100 in the project evaluation.