Tuesday, June 28, 2011

Basic concepts in ESP

In my previous article, I talked about How to install FAST ESP in Windows and Hello World! -- Set up a test search in ESP. In this article, I'm gonna talk about some basic concepts in ESP.
-----------------------------------------------------------
Basic Concepts
The most of following concepts are from the product overview.

Concept
Description
Document set description, indexes

Applies algorithms or business rule-based ranking to the results

Data Flow Overview

Module Overview
Talking about a general ideas about all ESP modules
Basic Concepts
Talking about the concepts in ESP
Content
Data that has not submitted to the FAST ESP system
Document
Processed, searchable content are called Document
Collections
Documents are grouped into different collections. Each collection can have its own processed and indexed way (Index Profile). Also, by setting priority for each collection, we can specify the order of document processing.
Search Profile
Define what to search and how the queries and results should be processed and displayed
Document and Document Element
One content will be converted into a document. Each property of the content will be converted into a document element.
Index Schedule, Profile
FAST Search Engine maps the document's elements to fields. Fields are defined document elements that are to be searchable. Fields can be defined by Index Profile. Multiple fields may be grouped into composite fields, allowing a query to be executed on several fields at the same time.
Enterprise Crawler
Use Enterprise Crawler to access content on Web Site(s)
File Traverser
The file traverser scans specified file directories of file servers.
Pushing Content to Search Engine Using Content API
Use the content API directly to push the content to Search Engine.
Query Side
Three ways to query the result: Search API, HTTP-based Query Interface, FAST Web Service Interface
Content Interface
Integrate of application via C++, Java, .NET
Search Interface

Document Processing Interface
inclusion of customer-defined document processors
Query/Result Processing Interface
provides an interface for dynamic linking of custom query and result processors
Administration Interface
supports API integration for system administration and collection configuration
Security Integration
Security Access Module provides document-level security capabilities for integration with your content and
portal infrastructure
SDKs
ESP Content SDK, Search SDK, and Application SDK provide various interfacing capabilities.
Web Service Interface
Web services are a collection of standards and protocols that allow computers to communicate across the
internet using XML and the ubiquitous HTTP protocol
Document processing is defined per collection

Document Processing Engine, Pipeline, Stage
One search engine contains multiple pipelines, but one collection can only have one pipeline. One pipeline contains multiple stages. One stage performs a particular document processing task. It takes one or more document elements to be input and the resulting output is new or modified elements that may be further processed
Entity Extraction
Entity extraction is detecting, extracting, and normalizing
entities from documents
Extract other entities
Two ways to extract other entities:
Using Admin UI to specify additional extractor
Via a regular expression document processor
Search Engine Clusters
Search Engine instances are grouped into search engine clusters. A search engine cluster is a group of
Search Engine instances that share the same index schema, which is provided by an index profile.
Search Columns and Rows
Sets of indexed documents are stored in all search engine instances within a search column to scale data volume. That means each node in a search rows share the same set of indexed documents. When a query is sent to a cluster, it will be sent to all search engine instances within a search row to scale query rate.
Index profile
An index profile is an XML-based configuration file. It’s an index schema that defines the way documents are searchable. It specifies search properties like:
Which document elements are to become searchable fields
Which document elements are to become fields that are returned as part of a result
How to calculate values that are used for sorting and ranking
The relationship between Document Processing, Indexing and Search Engine Clusters

Index Profile Structure

Scope Search
Used for
Indexing customer XML content without any knowledge of the DTD/Schema.
Indexing a more dynamic field structure using the Scope Search framework.
Relevancy, Data mining

Linguistic processing

Sorting

Rank value calculations

Query context analysis

Navigation

Contextual Insight

Ranking Concept

Quality

Freshness Boosting

WebAnalyzer
The WebAnalyzer is a FAST ESP module that uses links between documents to improve search relevancy
Tools to modify rank for individual documents
Two tools for modifying rank
Search Business Center
Boost Bulk Tool
3 boost mechanisms
Absolute Query Boost: Specify an absolute ranking position for a document against a specified query. Or exclude displaying a document against a specified query.
Relative Query Boost: Ensure a document is always displayed in first xx (a number) result list against a specified query.
Relative Document Boost: Ensure a document is always displayed in first xx (a number) result list whatever user submitted.
Proximity Ranking and Matching
The term proximity denotes the degree to which a query and a document match, based on the distance between the query terms within a document.
Two types of proximity:
Explicit Proximity
Implicit Proximity
Field Collapsing
Two kinds of field collapsing
Field collapsing which removes collapsed documents
Field collapsing which does not remove collapsed document (default)
Boundary Matching

Duplicate Removal
Different ways of detecting and removing duplicate documents.
Crawler Duplicate Removal (The FAST Crawler)
Dynamic (Result-Side) Duplicate Removal (may be used to detect and remove duplicates across collections, and also enable a more flexible definition of perceived duplicates)
Field Collapsing
GEO Search Overview
The Geo Search feature provides capabilities for filtering, sorting and boosting query results based on geographical location.
Query Modifications
Query processing is configured globally and three ways to modify a query in FAST ESP
As an automatic rewrite of the query before execution against the index
As a suggested rewrite, typically presented as a search tip on the result page
A combination of the two above: The query is first executed in its original form. In case of no hits, the query is automatically resubmitted using the automatic rewrite option, and the new result is presented to the user
Query Resubmission
The resubmission is set per query and used to switch to suggested transformation of the user’s query. There are three kinds of query transformation.
Modify: Automatically modified. The modified query is executed and the result set is returned
Conditional Modify: Automatically modified only if no hits are returned by the executed query
Suggest: Never modified. But a suggested transformed query is returned together with the result set.
FAST Query Language (FQL)

Navigator
Navigators provide functionality for drilling down into the query results based on value distribution of one or more individual fields.
Field Navigator

Deep Navigator

Shallow Navigator

Scope Navigator

Contextual Navigator

Field Navigators for Values in Scope Fields

Taxonomy

FAST Classifier

Unsupervised Clustering


Data Flow
In this section, I'm gonna talk about the data flow. The first one is about how the ESP crawl data.

The second one is about how the ESP handle user search.

No comments:

Post a Comment