yahoo
logo_upf
4th Workshop on the Future of Web Search: Semantic Search
Ibiza - April 17-18, 2009
FBM

Confirmed speakers





Ph.D. Presentations







(Re)Organizing web search results by means of semantic and visual tools

Gonzalo A. Aranda-Corral

We'd like to introduce our main line of research which involves using knowledge representation and reasoning (Formal Concept Analisys, Visual Reasoning,...) to present the web search results in a cognitively proper way.
We also use these techniques to reorganize the web search results and reflect them into our personal site -p.e. Delicicious-, if necessary.
In the presentation, we'd want to present all these techniques, some tools and how to manage them to get all its advantages.



Top


Text Analytics for Wikipedia Semantic Search

Giuseppe Attardi

I will survey some tools for text analytics concentrating in particular on those employing the architectural pattern of Pipes and Enrichment. I will present the design and architecture of Tanl, a suite of tools for analyzing and annotating NL texts, ranging from preprocessing to full parsing. A Tanl pipeline can be assembled on the fly with a few lines of scripting, e.g. in Python, and processed through a Map/Reduce cluster in order to handle large document collections. A semantically enriched index is created with the output of such pipeline. Tanl is being used to analyze the Italian Wikipedia. The extracted annotations are used in a self-training process to improve the same corpora on which the NL tools were built.



Top


Non-linear semantic mapping for cross-language document search

Rafael Banchs

A non-linear semantic mapping procedure is proposed for cross-language document retrieval. The method relays on a non-linear space reduction technique for constructing semantic embeddings of multilingual document collections. In the proposed method, an independent embedding is constructed for each language in a multilingual collection and the similarities among the resulting semantic representations are used for cross-language document retrieval. Two variants of the proposed method are implemented and compared with a state-of-the-art cross-language information retrieval technique. It is shown that, for some specific tasks, the proposed method outperforms the conventional one.



Top


Achieving (and Understanding) Dense Linking in the LOD Cloud

Mariano Consens

The Linking Open Data (LOD) community project is extending the Web by encouraging the creation of interlinks (RDF links between data items identified using dereferenceable URIs). Abundant linked data justifies extending the capabilities of web browsers and search engines, and enables new usage scenarios, novel applications, and sophisticated mashups.

This promising direction for publishing data on the web brings forward a number of challenges. In this talk, we describe two specific challenges. First, achieving and managing dense interlinking (using the LinkedMDB project as an example). Second, understanding the data, the metadata, and the interlinks that are used both within and across datasets in the LOD cloud.



Top


Video shot retrieval : Who kills the vampire?

Koen Deschacht

We present a system that performs automatic semantic analysis of videos and transcripts and retrieves relevant shots for "Buffy the Vampire Slayer". We have build a semantic frame classifier that detects and classifies actions in video transcripts, together with the actors, patients  and other circumstances. We have performed experiments and evaluated different discriminative and generative classification models. Furthermore we present prelimenary results on aligning semantic units in the text with corresponding semantic units in video, where we focus on actors in the text with detected faces in the video. Finally we propose several probabilistic retrieval models for the retrieval of actions and/or persons involved in actions in the video.



Top


Language-Model based Ranking in Entity-Relationship Graphs

Shady Elbassuoni

The success of knowledge-sharing communities like Wikipedia and the ad- vances of automatic information extraction from textual and Web sources have made it possible to build large \knowledge repositories" such as DBpedia, Free- base, or YAGO. These collections can be viewed as graphs of entities and rela- tionships (ER graphs) and can be represented as a set of subject-property-object (SPO) triples in the Semantic-Web data model RDF. Queries can be expressed in the W3C-endorsed SPARQL language or by similarly designed graph-pattern search. However, exact-match query semantics often falls short of satisfying the users' needs by returning too many or too few results. Therefore, IR-style scoring and ranking models are crucially needed. In this talk, I will present an effecient language-model-based approach to ranking the results of exact, approx- imate and keyword-augmented SPARQL queries over RDF databases such as ER graphs. Our method estimates a query model and a set of result-graph mod- els and ranks results based on their Kullback-Leibler divergence with respect to the query model.



Top


Semantically enhanced Information Retrieval: an ontology-based approach

Miriam Fernandez

The amount of content stored and shared on the Web and other document repositories keeps increasing steadily and fast. This growth results in well known difficulties and problems when it comes to finding and properly managing information in massive volumes. Striking progress has been achieved in the last decade with the development of search engine technologies, which collect, store and pre-process worldwide-scale information to return relevant resources instantly in response to users' needs. However, users still miss or need considerable effort sometimes to reach their targets, even if the sought information is present in the search space.

A common cause for this is that currently consolidated content description and query processing techniques for Information Retrieval (IR) are based on keywords, and therefore provide limited capabilities to grasp and exploit the conceptualizations involved in user needs and content meanings. This involves limitations such as the inability to describe relations between search terms (e.g., "hurricanes originated in Mexico" vs. "hurricanes that have affected Mexico", "books about recommender systems" vs. "systems that recommend books"), or the weakness to properly cope with linguistic phenomena such as polisemy (e.g., "mouth" as part of the body vs. "mouth" as the point where a stream issues into a larger body of water) or synonymy (e.g., find "movies" when the user queries for "films").

Aiming to solve the limitations of keyword-based models, the idea of conceptual search, understood as searching by meanings rather than literal strings, has been the focus of a wide body of research in the IR field. More recently, it has been used as a prototypical scenario (or even envisioned as a potential "killer app") in the Semantic Web (SW) vision since its emergence in the late nineties. However the undertakings in information search and retrieval from the semantic-based technology area have not yet taken full advantage of the technologies, background, knowledge, and accumulated experience through several decades of work in the IR field tradition.

Starting from this position, this workinvestigates the definition of ontology-based IR models, oriented to the exploitation of domain KBs to support semantic search capabilities in large document repositories, stressing on the one hand the use of full-fledged ontologies in the semantic-based perspective, and on the other the consideration of unstructured content as the final search space. In other words, the work explores the use of semantic information to support more expressive queries and more accurate results, while the retrieval problem is formulated in a way that is proper of the IR field, thus drawing benefit from the state of the art in this area, and enabling more realistic and applicable approaches.



Top


Extracting structured data from text with applications

Blaz Fortuna

In this presentation I will present an a set of online web service for extracting structured information from plain text, in the form of an RDF-like semantic graph based on subject-verb-object triplets extracted from sentences. The techniques for extracting triplets will be explained in more details. Furthermore, I will demonstrate three applications that utilize the extracted semantic graph: an automatic document summarizer, question answering system and a visualization tool.



Top


Name ambiguity resolution and attribute extraction for the Web People Search task: overview of the WePS 2 evaluation campaign

Julio Gonzalo

The second WePS (Web People Search) Evaluation campaign took place between 2008 and 2009 with the participation of 19 research groups from Europe, Asia and North America. Given the output of a Web Search Engine for a (usually ambiguous) person name as query, two tasks were addressed: a clustering task, which consists of grouping together web pages referring to the same person, and an extraction task, which consists of extracting salient attributes for each of the persons sharing the same name. We will summarize the lessons learnt from this evaluation exercise and their implications for Semantic Search.



Top


Relevance feedback for Semantic Search

Harry Halpin

Relevance feedback is one method for creating a `virtuous cycle' - as put by Baeza-Yates - between semantics and search. Previous approaches to search have generally considered the Semantic Web and hypertext Web search to be entirely disparate, indexing and searching over different domains. While relevance feedback have traditionally improved information retrieval performance, relevance feedback is normally used to improve rankings of a single data-set. Our novel approach is to use relevance feedback from hypertext Web search to improve the retrieval of Semantic Web data. We also inspect whether relevance feedback from Semantic Web data can improve hypertext Web search results. In both cases, an evaluation based on certain kinds of informational queries (abstract concepts, people, and places) selected from a query log and human judges show that relevance feedback works: relevance feedback from hypertext Web search can improve the retrieval of Semantic Web data, and vice versa. We evaluate our work over a wide range of algorithms, and show it improves baseline performance on these queries for deployed systems as well, such as the Semantic Search engine FALCON-S and the commercial Web search engine Yahoo! search.



Top


Adding semantic context to key word queries

Rianne Kaptein

A bottleneck for providing more accurate search results is the shallowness on the client side, i.e. users provide only a short keyword query of 2-3 words on average. Adding context to queries can help to interpret the query and improve search results. Using implicit or explicit feedback techniques, queries can be associated with related concepts, entities, or document types. This context provides new opportunities to rank search results. The result list can be diversified by showing results belonging to different concepts, entities or document types, or results belonging to the same concept or entity can be aggregated or clustered.



Top


Concept Search: Enabling Semantics in Syntactic Search

Uladzimir Kharkevich

Concept Search extends syntactic search, i.e., search based on the computation of string similarity between words, with semantic search, i.e., search based on the computation of semantic relations between concepts. The main idea of Concept Search is to keep the underlying machinery of syntactic search, but to modify it so that, whenever possible, syntactic search is substituted by semantic search, thus improving the system performance. This is why we say that Concept Search is semantics enabled syntactic search. Semantics can be enabled along different dimensions, on different levels, and to different extents forming a space of approaches lying between purely syntactic search and fully semantic search. We call this space the semantic continuum. In the talk, we will discuss how Concept Search can be tuned to work at different points in the semantic continuum taking advantage of semantics when and where possible.



Top


Investigating the Semantic Gap

Peter Mika

This talk explores techniques to study the semantic gap of semantic search, i.e. the difference between the data available on the Semantic Web and what web users are searching for in general. While the data available on the Semantic Web is relatively well understood and has been widely studied based on the crawls of Semantic Web search engines, much less is known about the information needs of Web searchers in terms of structured data objects. For this reason, we will pay particular attention in this talk to the demand-side of semantic search and introduce a tool that can be used to classify web queries using semantic categories as well as to investigate the various information needs related to each semantic category.



Top


Unique identifiers for the Web

Zoltan Miklos

Many applications, but in particular semantic web search could highly benefit from a service on the Web, which would provide unique identifiers. We discuss the challenges we are facing while realizing such a service and present the architecture of our pre-release system, together with a short demo. We also give a short overview of the related research activities we are involved. These include a dynamic virtual-economy-based resource allocation technique to ensure a high availability for geographically distributed services and specific entity resolution techniques, which are specific for our application. Our talk reports some of the ongoing work we pursue in the context of the OKKAM project.



Top


Approximately Optimal Facet Selection

Ronny Lempel

Multifaceted search is a popular interaction paradigm for discovery and mining applications that allows users to analyze and navigate through multidimensional data. A crucial aspect of faceted search applications is selecting the list of facets to display to the user following each query. We call this the Facet Selection problem.

When refining a query by drilling down into a facet, documents that are associated with that facet are promoted in the rankings. We formulate facet selection as an optimization problem aiming to maximize the rank promotion of certain documents. As the optimization problem is NP-Hard, we propose an approximation algorithm for selecting an approximately optimal set of facets per query.

We conducted experiments over hundreds of queries and search results of a large commercial search engine, comparing two flavors of our algorithm to facet selection algorithms appearing in the literature. The results show that our algorithm significantly outperforms those baseline schemes.

Joint work with Sonya Liberman, Technion, Israel.



Top


Semantics and tags: giving tags and labels their due

Alexandre Monnin

The last couple of years has seen an increasing development of semantic approaches to tagging. From Connotea's Entity Describer to Alexandre Passant's MOAT, it seems that the future of tagging is to associate its powers with semantic search. However, one question immediately arises: what exactly is a tag? Trite as it may seem, this question is seldom asked. We shall address it nonewithstanding and offer a definition of tagging that clearly purports to explain why a tag, strictly speaking, is always devoid of any semantic content while a "label" is not. This is essential if we want an account of tagging that is both promising and faithful to users' practices.



Top


Freebase: A socially managed identity database

Jamie Taylor

Freebase is a open, writable database of the world's information. Community members can create entities, contribute facts and extend the data model to represent the things in the world they care about.  The entities within Freebase are reconciled, so each entity is represente once, gathering all information about the item under one strong identifier.

This talk will describe how Freebase can be used as a switchboard for data.

Entities in Freebase may contain data from a wide number of sources, representing many facets of information, which allows users to find items of interest using queries that cut across domains of interest. In addition, entities can be annotated with identifiers from other systems allowing Freebase to direct users (and applications) to other data systems for additional information.



Top


Data Web Search: What is to be done?

Tran Thanh

The Web as a global information space is developing from a Web of documents to a Web of data. This development opens new ways for addressing complex information needs. Search is no longer limited to matching keywords against documents, but instead complex information needs can be expressed in a structured way, with precise answers as results. In this talk, I will discuss a number of challenges involved in realizing search on the Web of data. A concrete infrastructure called SearchWebDB addressing some of these challenges is presented as a solution towards data web search. I will then elaborate on possible directions and research activities for realizing more usable, effective and precise search on the Web of data.



Top


Using Web Information for Author Name Disambiguation

Nivio Ziviani

In digital libraries, ambiguous author names may occur due to the existence of multiple authors with the same name (polysemes) or different name variations for the same author (synonyms). We proposed here a new method that use information available on the Web to deal with both problems at the same time. Our idea consists of gathering information from input citations and submitting queries to a Web search engine, aiming at finding curricula vitae and Web pages containing publications of the ambiguous authors. From the content of documents in the answer sets returned by the Web search engine, useful information that can help in the disambiguation process is extracted. Using this information, author names are disambiguated by leveraging a hierarchical clustering method, which groups citations in the same document together in a bottom-up fashion. Experimental results show that the our method yields results that outperform those of two state-of-the-art unsupervised methods and are statistically comparable with those of a supervised one, but requiring no training. We observe gains of up to 65.2% in the pairwise F1 metric when compared to our best unsupervised baseline method.



Top