(Re)Organizing web search results by means of semantic and visual tools
Gonzalo A. Aranda-Corral
We'd like to introduce our main line of research which involves using knowledge representation and reasoning (Formal Concept Analisys, Visual Reasoning,...) to present the web search results in a cognitively proper way.
We also use these techniques to reorganize the web search results and reflect them into our personal site -p.e. Delicicious-, if necessary.
In the presentation, we'd want to present all these techniques, some tools and how to manage them to get all its advantages.
Text Analytics for Wikipedia Semantic Search
I will survey some tools for text analytics concentrating in particular
on those employing the architectural pattern of Pipes and Enrichment.
I will present the design and architecture of Tanl, a suite of tools for
analyzing and annotating NL texts, ranging from preprocessing to full
parsing. A Tanl pipeline can be assembled on the fly with a few lines
of scripting, e.g. in Python, and processed through a Map/Reduce cluster
in order to handle large document collections. A semantically enriched
index is created with the output of such pipeline. Tanl is being used
to analyze the Italian Wikipedia. The extracted annotations are used in
a self-training process to improve the same corpora on which the NL tools
Non-linear semantic mapping for cross-language document search
A non-linear semantic mapping procedure is proposed for cross-language
document retrieval. The method relays on a non-linear space reduction
technique for constructing semantic embeddings of multilingual document
collections. In the proposed method, an independent embedding is constructed
for each language in a multilingual collection and the similarities among
the resulting semantic representations are used for cross-language document
retrieval. Two variants of the proposed method are implemented and compared
with a state-of-the-art cross-language information retrieval technique.
It is shown that, for some specific tasks, the proposed method outperforms
the conventional one.
Achieving (and Understanding) Dense Linking in the LOD Cloud
The Linking Open Data (LOD) community project is extending the
Web by encouraging the creation of interlinks (RDF links between data
items identified using dereferenceable URIs). Abundant linked data justifies
extending the capabilities of web browsers and search engines, and enables
new usage scenarios, novel applications, and sophisticated mashups.
This promising direction for publishing data on the web brings forward a
number of challenges. In this talk, we describe two specific challenges.
First, achieving and managing dense interlinking (using the LinkedMDB
project as an example). Second, understanding the data, the metadata, and
the interlinks that are used both within and across datasets in the LOD
Video shot retrieval : Who kills the vampire?
We present a system that performs automatic semantic analysis of
videos and transcripts and retrieves relevant shots for "Buffy the
Vampire Slayer". We have build a semantic frame classifier that
detects and classifies actions in video transcripts, together with the
actors, patients and other circumstances. We have performed
experiments and evaluated different discriminative and generative
classification models. Furthermore we present prelimenary results on
aligning semantic units in the text with corresponding semantic units
in video, where we focus on actors in the text with detected faces in
the video. Finally we propose several probabilistic retrieval models
for the retrieval of actions and/or persons involved in actions in the
Language-Model based Ranking in Entity-Relationship Graphs
The success of knowledge-sharing communities like Wikipedia and the ad-
vances of automatic information extraction from textual and Web sources have
made it possible to build large \knowledge repositories" such as DBpedia, Free-
base, or YAGO. These collections can be viewed as graphs of entities and rela-
tionships (ER graphs) and can be represented as a set of subject-property-object
(SPO) triples in the Semantic-Web data model RDF. Queries can be expressed
in the W3C-endorsed SPARQL language or by similarly designed graph-pattern
search. However, exact-match query semantics often falls short of satisfying
the users' needs by returning too many or too few results. Therefore, IR-style
scoring and ranking models are crucially needed. In this talk, I will present an
effecient language-model-based approach to ranking the results of exact, approx-
imate and keyword-augmented SPARQL queries over RDF databases such as
ER graphs. Our method estimates a query model and a set of result-graph mod-
els and ranks results based on their Kullback-Leibler divergence with respect to
the query model.
Semantically enhanced Information Retrieval: an ontology-based approach
The amount of content stored and shared on the Web and other document
repositories keeps increasing steadily and fast. This growth results in
well known difficulties and problems when it comes to finding and properly
managing information in massive volumes. Striking progress has been achieved
in the last decade with the development of search engine technologies,
which collect, store and pre-process worldwide-scale information to return
relevant resources instantly in response to users' needs. However, users
still miss or need considerable effort sometimes to reach their targets,
even if the sought information is present in the search space.
A common cause for this is that currently consolidated content description
and query processing techniques for Information Retrieval (IR) are based
on keywords, and therefore provide limited capabilities to grasp and exploit
the conceptualizations involved in user needs and content meanings. This
involves limitations such as the inability to describe relations between
search terms (e.g., "hurricanes originated in Mexico" vs. "hurricanes
that have affected Mexico", "books about recommender systems" vs. "systems
that recommend books"), or the weakness to properly cope with linguistic
phenomena such as polisemy (e.g., "mouth" as part of the body vs. "mouth"
as the point where a stream issues into a larger body of water) or synonymy
(e.g., find "movies" when the user queries for "films").
Aiming to solve the limitations of keyword-based models, the idea of conceptual
search, understood as searching by meanings rather than literal strings,
has been the focus of a wide body of research in the IR field. More recently,
it has been used as a prototypical scenario (or even envisioned as a potential
"killer app") in the Semantic Web (SW) vision since its emergence in the
late nineties. However the undertakings in information search and retrieval
from the semantic-based technology area have not yet taken full advantage
of the technologies, background, knowledge, and accumulated experience
through several decades of work in the IR field tradition.
Starting from this position, this workinvestigates the definition of ontology-based
IR models, oriented to the exploitation of domain KBs to support semantic
search capabilities in large document repositories, stressing on the one
hand the use of full-fledged ontologies in the semantic-based perspective,
and on the other the consideration of unstructured content as the final
search space. In other words, the work explores the use of semantic information
to support more expressive queries and more accurate results, while the
retrieval problem is formulated in a way that is proper of the IR field,
thus drawing benefit from the state of the art in this area, and enabling
more realistic and applicable approaches.
Extracting structured data from text with applications
In this presentation I will present an a set of online web service for
extracting structured information from plain text, in the form of an RDF-like
semantic graph based on subject-verb-object triplets extracted from sentences.
The techniques for extracting triplets will be explained in more details.
Furthermore, I will demonstrate three applications that utilize the extracted
semantic graph: an automatic document summarizer, question answering system
and a visualization tool.
Name ambiguity resolution and attribute extraction for the Web
People Search task: overview of the WePS 2 evaluation campaign
The second WePS (Web People Search) Evaluation campaign took place between
2008 and 2009 with the participation of 19 research groups from Europe, Asia
and North America. Given the output of a Web Search Engine for a
(usually ambiguous) person name as query, two tasks were addressed: a
clustering task, which consists of grouping together web pages
referring to the same person, and an extraction task, which consists
of extracting salient attributes for each of the persons sharing the
same name. We will summarize the lessons learnt from this evaluation
exercise and their implications for Semantic Search.
Relevance feedback for Semantic Search
Relevance feedback is one method for creating a `virtuous cycle' - as
put by Baeza-Yates - between semantics and search. Previous approaches
to search have generally considered the Semantic Web and hypertext Web
search to be entirely disparate, indexing and searching over different
domains. While relevance feedback have traditionally improved information
retrieval performance, relevance feedback is normally used to improve
rankings of a single data-set. Our novel approach is to use relevance
feedback from hypertext Web search to improve the retrieval of Semantic
Web data. We also inspect whether relevance feedback from Semantic Web
data can improve hypertext Web search results. In both cases, an evaluation
based on certain kinds of informational queries (abstract concepts, people,
and places) selected from a query log and human judges show that relevance
feedback works: relevance feedback from hypertext Web search can improve
the retrieval of Semantic Web data, and vice versa. We evaluate our work
over a wide range of algorithms, and show it improves baseline performance
on these queries for deployed systems as well, such as the Semantic Search
engine FALCON-S and the commercial Web search engine Yahoo! search.
Adding semantic context to key word queries
A bottleneck for providing more accurate search results is the
shallowness on the client side, i.e. users provide only a short keyword
query of 2-3 words on average. Adding context to queries can help to
interpret the query and improve search results. Using implicit or
explicit feedback techniques, queries can be associated with related
concepts, entities, or document types. This context provides new
opportunities to rank search results. The result list can be diversified
by showing results belonging to different concepts, entities or document
types, or results belonging to the same concept or entity can be
aggregated or clustered.
Concept Search: Enabling Semantics in Syntactic Search
Concept Search extends syntactic search, i.e., search based on the computation
of string similarity between words, with semantic search, i.e., search
based on the computation of semantic relations between concepts. The main
idea of Concept Search is to keep the underlying machinery of syntactic
search, but to modify it so that, whenever possible, syntactic search
is substituted by semantic search, thus improving the system performance.
This is why we say that Concept Search is semantics enabled syntactic
search. Semantics can be enabled along different dimensions, on different
levels, and to different extents forming a space of approaches lying between
purely syntactic search and fully semantic search. We call this space
the semantic continuum. In the talk, we will discuss how Concept Search
can be tuned to work at different points in the semantic continuum taking
advantage of semantics when and where possible.
Investigating the Semantic Gap
This talk explores techniques to study the semantic gap of semantic search,
i.e. the difference between the data available on the Semantic Web and
what web users are searching for in general. While the data available
on the Semantic Web is relatively well understood and has been widely
studied based on the crawls of Semantic Web search engines, much less
is known about the information needs of Web searchers in terms of structured
data objects. For this reason, we will pay particular attention in this
talk to the demand-side of semantic search and introduce a tool that can
be used to classify web queries using semantic categories as well as to
investigate the various information needs related to each semantic category.
Unique identifiers for the Web
Many applications, but in particular semantic web search could highly benefit
from a service on the Web, which would provide unique identifiers. We discuss
the challenges we are facing while realizing such a service and present the
architecture of our pre-release system, together with a short demo. We also give
a short overview of the related research activities we are involved. These
include a dynamic virtual-economy-based resource allocation technique to ensure
a high availability for geographically distributed services and specific entity
resolution techniques, which are specific for our application. Our talk
reports some of the ongoing work we pursue in the context of the OKKAM project.
Approximately Optimal Facet Selection
Multifaceted search is a popular interaction paradigm for discovery and
mining applications that allows users to analyze and navigate through
multidimensional data. A crucial aspect of faceted search applications
is selecting the list of facets to display to the user following each
query. We call this the Facet Selection problem.
When refining a query by drilling down into a facet, documents that are
associated with that facet are promoted in the rankings. We formulate
facet selection as an optimization problem aiming to maximize the rank
promotion of certain documents. As the optimization problem is NP-Hard,
we propose an approximation algorithm for selecting an approximately optimal
set of facets per query.
We conducted experiments over hundreds of queries and search results of
a large commercial search engine, comparing two flavors of our algorithm
to facet selection algorithms appearing in the literature. The results
show that our algorithm significantly outperforms those baseline schemes.
Joint work with Sonya Liberman, Technion, Israel.
Semantics and tags: giving tags and labels their due
The last couple of years has seen an increasing development of semantic
approaches to tagging. From Connotea's Entity Describer to Alexandre Passant's
MOAT, it seems that the future of tagging is to associate its powers with
semantic search. However, one question immediately arises: what exactly
is a tag? Trite as it may seem, this question is seldom asked. We shall
address it nonewithstanding and offer a definition of tagging that clearly
purports to explain why a tag, strictly speaking, is always devoid of
any semantic content while a "label" is not. This is essential if we want
an account of tagging that is both promising and faithful to users' practices.
Freebase: A socially managed identity database
Freebase is a open, writable database of the world's information.
Community members can create entities, contribute facts and extend the
data model to represent the things in the world they care about. The
entities within Freebase are reconciled, so each entity is represente
once, gathering all information about the item under one strong
This talk will describe how Freebase can be used as a switchboard for
Entities in Freebase may contain data from a wide number of sources,
representing many facets of information, which allows users to find
items of interest using queries that cut across domains of interest.
In addition, entities can be annotated with identifiers from other
systems allowing Freebase to direct users (and applications) to other
data systems for additional information.
Data Web Search: What is to be done?
The Web as a global information space is developing from a Web of documents
to a Web of data. This development opens new ways for addressing complex
information needs. Search is no longer limited to matching keywords against
documents, but instead complex information needs can be expressed in a
structured way, with precise answers as results. In this talk, I will
discuss a number of challenges involved in realizing search on the Web
of data. A concrete infrastructure called SearchWebDB addressing some
of these challenges is presented as a solution towards data web search.
I will then elaborate on possible directions and research activities for
realizing more usable, effective and precise search on the Web of data.
Using Web Information for Author Name Disambiguation
In digital libraries, ambiguous author names may occur due to the existence
of multiple authors with the same name (polysemes) or different name variations
for the same author (synonyms). We proposed here a new method that use
information available on the Web to deal with both problems at the same
time. Our idea consists of gathering information from input citations
and submitting queries to a Web search engine, aiming at finding curricula
vitae and Web pages containing publications of the ambiguous authors.
From the content of documents in the answer sets returned by the Web search
engine, useful information that can help in the disambiguation process
is extracted. Using this information, author names are disambiguated by
leveraging a hierarchical clustering method, which groups citations in
the same document together in a bottom-up fashion. Experimental results
show that the our method yields results that outperform those of two state-of-the-art
unsupervised methods and are statistically comparable with those of a
supervised one, but requiring no training. We observe gains of up to 65.2%
in the pairwise F1 metric when compared to our best unsupervised baseline