Modern Information Retrieval

the concepts and technology behind search

Ricardo Baeza-Yates received a PhD in computer science from the University of Waterloo, Canada, in 1989, related to the Oxford English Dictionary project. Prior to that, he received a bachelor degree in CS in 1983 from the University of Chile. Later, he also received a MSc in CS (1985), the professional title in electrical engineering (1985), and a MEng in EE (1986) from the same university. He was the president of the Chilean Computer Science Society (SCCC) in two different periods in the 1990s. In 2000 he started, a search engine for the Chilean Web that is still active. From 2000 to 2004 he was president of CLEI, the Latin American association of CS departments as well as the international coordinator of the Ibero-American cooperation program on science and technology (CYTED) in the areas of applied electronics and informatics. In 2002 he founded the Center for Web Research at the Department of Computer Science of the Engineering School of the University of Chile, which lead until 2005. At the end of 2004 he became ICREA Research Professor at the Department of Information and Communication Technologies of the University Pompeu Fabra in Barcelona, Spain. Since 2006 has been VP of Yahoo! Research for Europe and Latin America, leading the labs at Barcelona, Spain and Santiago, Chile. In 2008 he added the supervision of a new research lab in Haifa, Israel.

His research interests include Web retrieval and data mining, indexing and searching algorithms. He has been a member of the Board of Governors of the IEEE Computer Society and of the ACM Publications Board. He has been program chair or co-chair of major conferences including ACM SIGIR 2002, ACM CIKM 2007, ACM KDD 2009, and ACM/IEEE/WIC WI/IAT 2009. He has been general chair or co-chair of ACM SIGIR 2005 and ACM WSDM 2009, among other conference roles. He is associate editor of several journals including ACM TOIS, Information Systems and Information Processing & Management. He is co-author of several other books, including the 2nd edition of the Handbook of Algorithms and Data Structure, Addison-Wesley, 1991; and co-editor of Information Retrieval: Algorithms and Data Structures, Prentice-Hall, 1992, among more than 250 other publications. He has received the Organization of American States award for young researchers in exact sciences (1993) and with two Brazilian colleagues obtained the COMPAQ prize for the best CS Brazilian research article (1997). In 2003 he was the first computer scientist to be elected to the Chilean Academy of Sciences. During 2007 he was awarded the Graham Medal for innovation in computing, given by the University of Waterloo to distinguished alumni. In 2009 he was awarded by CLEI of the Latin American distinction for contributions to CS in the region and named Fellow of the ACM.

He is a member of ACM (Fellow), AMS, EATCS, IEEE (Senior), SCCC, and SIAM.

Berthier Ribeiro-Neto received a PhD degree in Computer Science from the University of California at Los Angeles, in 1995. Prior to that, he received a BSc in Math, a BSc in Electrical Engineering, and an MSc in Computer Science from the Federal University of Minas Gerais (UFMG) at Belo Horizonte, Brazil. In 1996, he became a member of the UFMG Computer Science Department, where he is currently an Associate Professor.

In 2000, Ribeiro-Neto co-founded Akwan Information Technologies, a start-up specializing in search engines for the Brazilian Web, located in Belo Horizonte. In 2001, he took unpaid leave from UFMG and became CEO of Akwan. The company thrived by selling customized search solutions to the corporate market. In 2005, Akwan was acquired by Google to become the Google Engineering Office for Latin America, where Ribeiro-Neto is currently Director of Engineering and Site Lead.

Ribeiro-Neto's main interests are IR systems, Web search, and social networks. He has been involved with a number of research projects financed through Brazilian national agencies such as the Ministry of Science and Technology (MCT) and the National Research Council (CNPq). He was Program Committee Chair of SPIRE (String Processing and Information Retrieval Symposium) in 1998, of SBBD (Brazilian Symposium on Databases) in 1999, and was Program Committee Co-Chair of the ACM Conference on Web Search and Data Mining (ACM WSDM) in 2009. He has published over 70 papers in various conferences and journals, is an Associate Editor of ACM Transactions on Office and Information Systems (ACM TOIS), and a member of ACM.

Contributors Bio

Eric Brown earned his BSc degree at the University of Vermont and MSc and PhD degrees at the University of Massachusetts, all in Computer Science. At UMass Eric was advised by Bruce Croft and was a member of the Center for Intelligent Information Retrieval. Eric joined the IBM T.J. Watson Research lab in 1995 as a Research Staff Member. While at IBM Eric has conducted research in information retrieval, document categorization, text analysis, question answering, bio-informatics, and applications of automatic speech recognition. Since 2007 Eric has been involved in the DeepQA project at IBM and the application of automatic, open domain question answering to build the Watson Question Answering system. The goal of Watson is to achieve human-level question answering performance. Eric has published numerous conference and journal papers, and holds several patents in the areas of text analysis and question answering.

Carlos Castillo is a research scientist at Yahoo! Research Barcelona. Before that, he was a postdoctoral fellow at Sapienza Universit\`a di Roma and Universitat Pompeu Fabra in Barcelona. He got his PhD in 2004 from the University of Chile, and has undertaken research on Web crawling, Web characterization, and Web ranking. Dr. Castillo is currently active in the areas of Web Usage and Link Mining. He has published several papers in refereed journals and proceedings of international conferences; organized several workshops and challenges related to Adversarial Web IR; and served in the PC of major conferences in his area (WWW, WSDM, SIGIR, CIKM, etc.).

Marcos André Gonçalves is an Assistant Professor at the Computer Science Department of the Federal University of Minas Gerais (UFMG). He holds a PhD in Computer Science (CS) from Virginia Tech (2004), a MS in CS from State University of Campinas, Brazil (Unicamp, 1997), and a BS, also in Computer Science, from the Federal University of Ceará, Brazil (UFC, 1995). He has served as referee on different journals (TOIS, TIDE, IP\&M, Information Retrieval, Information Systems, etc.) and at several conferences (SIGIR, CIKM, JCDL, etc.). His research interests include information retrieval, digital libraries, text classification and text mining in general, having published a number of papers in these areas. Marcos is an affiliated member of the Brazilian Academy of Sciences.

David Hawking is Chief Scientist at Funnelback in Canberra, Australia. Funnelback is an enterprise and Internet search company spun off to commercialise research by Dr Hawking and his team, initially at the Australian National University (ANU) and later at CSIRO. David is an Adjunct Professor at ANU and supervises a number of PhD students there. He was a Coordinator of the TREC Web Track from 1997--2004 and, with Nick Craswell, was responsible for the creation and distribution of text retrieval benchmark collections now in use at over 120 research organisations worldwide. He was a program chair of the ACM SIGIR conference in 2003 and 2006. He holds a PhD from ANU and an honorary doctorate from the University of Neuchâtel. In 2004 he won an Australasian award for computer science research. His research interests include distributed information retrieval, distance-based ranking, personal metasearch, document annotations, efficient retrieval algorithms, Web search, automatic quality rating of health information, meaningful IR evaluation, and, of course, enterprise search.

Marti Hearst is a professor in the School of Information at the University of California Berkeley, with an affiliate appointment in the Computer Science Division. Her primary research interests are user interfaces for search engines, information visualization, natural language processing, and empirical analysis of social media. She has just completed the first book on Search User Interfaces. She received BA, MSc, and PhD degrees in computer science from the University of California, Berkeley, and was a member of the research staff at Xerox PARC from 1994 to 1997. Prof. Hearst has served on the Advisory Council of NSF's CISE Directorate and is co-chair of the Web Board for CACM. She is a member of the Usage Panel for the American Heritage Dictionary and is on the panel of experts. She is on the editorial boards of ACM Transactions on the Web and ACM Transactions on Computer-Human Interaction and was formerly on the boards of Computational Linguistics, ACM Transactions on Information Systems, and IEEE Intelligent Systems.

Mounia Lalmas holds a Microsoft Research/RAEng Research Chair at the Department of Computing Science, University of Glasgow. Before that, she was Professor of Information Retrieval, at the department of Computer Science at Queen Mary, University of London, which she joined in 1999 as a lecturer. Prior to this, she was a Research Scientist at the University of Dortmund in 1998, a Lecturer from 1995 to 1997 and a Research Fellow from 1997 to 1998 at the University of Glasgow, where she received her PhD in 1996. She is a Chartered IT Professional (CITP) and a Fellow of the British Computer Society (FBCS). She was also the (elected) vice chair, and before this, the Information Director of ACM SIGIR. She is an editorial board member for ACM TOIS, IR (Springer) and IP\&M (Elsevier). Her research focuses on the development and evaluation of intelligent access to interactive heterogeneous and complex information repositories, and covering a wide range of domains such as HTML, XML, and MPEG-7. From 2002 until 2007, she co-led with Norbert Fuhr the Evaluation Initiative for XML Retrieval (INEX), a large-scale project with over 80 participating organizations worldwide, which was responsible for defining the nature of XML retrieval, and how it should be evaluated. She has given numerous presentations and lectures on XML retrieval and evaluation, for instance at CIKM, SIGIR and ESSIR. She is now working on technologies for aggregated search and bridging the digital divide. She is also currently getting back into theoretical information retrieval where she is looking at the use of quantum theory to model interactive information retrieval. She is/was the workshop co-chair at SIGIR 2004 and 2006, mentoring chair at SIGIR 2009, PR (co-) chair at CIKM 2008 and WI/IAT 2009, workshop chair at CIKM 2010, PC chair at ECIR 2006 (European Conference on Information Retrieval Research), vice co-chair for the XML and Web Data track at WWW 2009, and general co-chair of IIiX 2008 (Information Interaction in Context) and ECDL 2010 (European Conference on Digital Libraries).

Yoelle Maarek is a Senior Research Director at Yahoo! Labs in Israel, which she joined in June 2009. Before this, she was an Engineering Director at the Google Haifa Engineering Center, which she founded in March 2006 and grew to close to 40 researchers and software engineers. Her team at Google Haifa launched one of the most visible features in Web search in the recent years: 'Google Suggest', a query completion feature that has been deployed on since August 2008 and is available in a series of Google properties, from YouTube, iGoogle to Mobile Search, and this in most languages. The Haifa team launched features in other domains as well, such as Searching Ads and Interactive Annotations on YouTube. Prior to this, she had been with IBM Research since 1989. While at IBM Research, she held a series of technical and management appointments first at the T.J. Watson Research in New York, USA, and then at the IBM Haifa Research Lab in Israel until February 2006, where she contributed to IBM Enterprise search offerings. Her two last positions were Distinguished Engineer and Department Group Manager in the area of search and collaboration. She graduated from the "Ecole Nationale des Ponts et Chaussees" in Paris, France, and received her DEA (graduate degree) in Computer Science from Paris VI University, both in 1985. She was a visiting PhD student at Columbia University in NY in 1986/87. She received her PhD in Computer Science from the Technion, in Haifa, Israel, in 1989. Yoelle's research interests include information retrieval, Web applications, and collaborative technologies. She has published over 50 papers and articles in these fields. She is active in the research community and has served as chair or vice-chair for several technical tracks at the WWW conference series and as senior or regular PC member at most ACM SIGIR conferences in the last 10 years. She also chaired and moderated multiple workshops and panels at both WWW and SIGIR conferences. Most recently, she served as co-chair (with Andrei Broder) of the Panels track at WWW'2008 and as co-chair (with Wolfgang Nejdl) of the Technical Program of WWW'2009, that was held in Madrid in April 2009. Yoelle is also a member of the Board of Directors of the Caesarea-Rotschild Institute at Haifa University and of the Board of Governors of the Technion, Israel Institute of Technology since 2009.

Christian Middleton currently works as a software engineer. Previously, he was a PhD student in Computer Science and Digital Communications at the Universitat Pompeu Fabra, under the supervision of Ricardo Baeza-Yates. In 2004 he earned a Masters degree, as well as a Computer Science Engineer, at the University of Chile. His main interest areas are Web mining and log analysis. In the past years, has participated on projects related to Web graph visualization, usage log analysis, and search engines evaluation.

Gonzalo Navarro earned his PhD in Computer Science in 1998 from the University of Chile, where he is currently Professor at the Department of Computer Science. He is also a researcher at the Millennium Institute for Cell Dynamics and Biotechnology. His areas of interest include algorithms and data structures, text searching, compression, and metric space searching. He has headed several research projects around Text Searching and Information Retrieval, such as the Center for Web Research, RIBIDI (an Ibero American research group on Information Retrieval), and a project funded by Yahoo! Research. Professor Navarro has been PC (co-)chair of several conferences, including SPIRE 2001 and 2005 (String Processing and Information Retrieval) and ACM SIGIR 2005 Posters. He co-created the conference SISAP in 2008 (Similarity Search and Applications). He is member of the Steering Committee of LATIN (Latin American Theoretical Informatics) and SISAP, and of the Editorial Board of the Information Retrieval journal and ACM Journal of Experimental Algorithmics. He has coauthored a book on String Matching (published by Cambridge University Press), 15 book chapters, 6 international conference proceedings (editor), and about 80 papers in international journals and 140 in international conferences.

Dulce Ponceleón holds an MSc and a PhD degree in computer science from Stanford University. She worked in the Advanced Technology Group at Apple Computer, Inc., where she worked on information retrieval, video compression and audio compression technologies for QuickTime. She was a key contributor to the first software-only videoconferencing system. She is currently at the IBM Almaden Research Center, where she manages the Content Protection Competency Center. She has worked on multimedia content analysis and indexing, video summarization, applications of speech recognition, storage systems, and content protection. She contributed to the ISO MPEG-7 standardization efforts, specifically in Multimedia Description Schemes. She is an IBM technical representative in 4C and Advanced Access Content System (AACS). The 4C Entity has developed content protection standards for recordable and prerecorded media (CPRM/CPPM). Dr. Poncele'on is the Chair of the 4C Technical Group since 2004. AACS is a content protection standards for managing content stored on the next generation of prerecorded and recorded optical media for consumer use with PCs and CE devices. Dr. Poncele'on has been on the Scientific Advisory Board of a leading NSF multimedia school, and a program committee member of ACM Multimedia, SPIE, SIGIR, IEEE, and several multimedia workshops. She has held workshops on multimedia standards (ACM MM 2000), panels on streaming video (ACM MM 2001), and multimedia information retrieval tutorials (SIGIR 2002, SIGIR 2005 and ICASPP 2006). She holds patents and numerous publications in video and audio compression, multimedia information retrieval, numerical linear algebra and non-linear programming. She holds several patents and numerous publications in video and audio compression, multimedia information retrieval, content protection, human computer interfaces, numerical linear algebra and non-linear programming.

Edie Rasmussen is currently Professor in the School of Library, Archival and Information Studies at the University of British Columbia, Vancouver, BC, Canada, where she served as Director for six years. Prior to joining UBC she was a Professor in the School of Information Sciences, University of Pittsburgh, Pittsburgh, USA. She has also held appointments at the School of Library and Information Studies at Dalhousie University, Nova Scotia, Canada, the School of Library Science at the Institiut Teknoloji MARA, Kuala Lumpur, Malaysia, and visiting positions at Nanyang Technological University, Singapore, Victoria University of Wellington, New Zealand, and Oslo University College, Norway. Dr. Rasmussen has been active in the Information Retrieval and Digital Library research communities, serving as a Chair for the ACM SIGIR, ACM DL, ACM/IEE JCDL, and ASIS\&T conferences. She served as President of the American Society for Information Science \& Technology, President of the Canadian Council of Information Studies/Conseil Canadien des science de l’information, and Co-convenor of the Council of Deans and Directors of the Association for Library and Information Science Education. Her current research interests include indexing and information retrieval in text and multimedia databases and digital libraries.

Malcolm Slaney is a principal scientist at Yahoo! Research and works on all manner of multimedia data. He received his PhD from Purdue University for his work on computed imaging. He is a coauthor, with A. C. Kak, of the IEEE book Principles of Computerized Tomographic Imaging. This book was recently republished by SIAM in their "Classics in Applied Mathematics" Series. He is co-editor, with Steven Greenberg, of the book Computational Models of Auditory Function. Before Yahoo!, Dr. Slaney has worked at Bell Laboratory, Schlumberger Palo Alto Research, Apple Computer, Interval Research and IBM's Almaden Research Center. He is also a (consulting) Professor at Stanford's CCRMA where he organizes and teaches the Hearing Seminar. His research interests include auditory modeling and perception, multimedia analysis and synthesis, compressed-domain processing, music similarity and audio search, and machine learning. For the last several years he has lead the auditory group at the Telluride Neuromorphic Workshop.

Nivio Ziviani has a PhD in Computer Science from the University of Waterloo, Canada, 1982. He is a Professor Emeritus at the Department of Computer Science of the Federal University of Minas Gerais, Brazil, where he coordinates the Laboratory for Treating Information (LATIN). He is a member of the Brazilian Academy of Sciences and of the Brazilian National Order of the Scientific Merit in the Comendador class. He is a co-founder of two high tech start-up companies: Miner Technology Group, sold to Folha de São Paulo UOL group in 1999, and Akwan Information Technologies, sold to Google Inc. in 2005. He is the author of books and papers in the areas of algorithm design and information retrieval, the latter his primary area of research. He was General Co-Chair of the 28th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) in 2005 and co-founder with Ricardo Baeza-Yates of the International Conference on String Processing and Information Retrieval (SPIRE) in 1993.