Protean Career: June 2011

Tuesday, June 28, 2011

Basic concepts in ESP

In my previous article, I talked about How to install FAST ESP in Windows and Hello World! -- Set up a test search in ESP. In this article, I'm gonna talk about some basic concepts in ESP.
-----------------------------------------------------------
Basic Concepts
The most of following concepts are from the product overview.

Concept	Description
Document set description, indexes
Applies algorithms or business rule-based ranking to the results
Data Flow Overview
Module Overview	Talking about a general ideas about all ESP modules
Basic Concepts	Talking about the concepts in ESP
Content	Data that has not submitted to the FAST ESP system
Document	Processed, searchable content are called Document
Collections	Documents are grouped into different collections. Each collection can have its own processed and indexed way (Index Profile). Also, by setting priority for each collection, we can specify the order of document processing.
Search Profile	Define what to search and how the queries and results should be processed and displayed
Document and Document Element	One content will be converted into a document. Each property of the content will be converted into a document element.
Index Schedule, Profile	FAST Search Engine maps the document's elements to fields. Fields are defined document elements that are to be searchable. Fields can be defined by Index Profile. Multiple fields may be grouped into composite fields, allowing a query to be executed on several fields at the same time.
Enterprise Crawler	Use Enterprise Crawler to access content on Web Site(s)
File Traverser	The file traverser scans specified file directories of file servers.
Pushing Content to Search Engine Using Content API	Use the content API directly to push the content to Search Engine.
Query Side	Three ways to query the result: Search API, HTTP-based Query Interface, FAST Web Service Interface
Content Interface	Integrate of application via C++, Java, .NET
Search Interface
Document Processing Interface	inclusion of customer-defined document processors
Query/Result Processing Interface	provides an interface for dynamic linking of custom query and result processors
Administration Interface	supports API integration for system administration and collection configuration
Security Integration	Security Access Module provides document-level security capabilities for integration with your content and portal infrastructure
SDKs	ESP Content SDK, Search SDK, and Application SDK provide various interfacing capabilities.
Web Service Interface	Web services are a collection of standards and protocols that allow computers to communicate across the internet using XML and the ubiquitous HTTP protocol
Document processing is defined per collection
Document Processing Engine, Pipeline, Stage	One search engine contains multiple pipelines, but one collection can only have one pipeline. One pipeline contains multiple stages. One stage performs a particular document processing task. It takes one or more document elements to be input and the resulting output is new or modified elements that may be further processed
Entity Extraction	Entity extraction is detecting, extracting, and normalizing entities from documents
Extract other entities	Two ways to extract other entities: Using Admin UI to specify additional extractor Via a regular expression document processor
Search Engine Clusters	Search Engine instances are grouped into search engine clusters. A search engine cluster is a group of Search Engine instances that share the same index schema, which is provided by an index profile.
Search Columns and Rows	Sets of indexed documents are stored in all search engine instances within a search column to scale data volume. That means each node in a search rows share the same set of indexed documents. When a query is sent to a cluster, it will be sent to all search engine instances within a search row to scale query rate.
Index profile	An index profile is an XML-based configuration file. It’s an index schema that defines the way documents are searchable. It specifies search properties like: Which document elements are to become searchable fields Which document elements are to become fields that are returned as part of a result How to calculate values that are used for sorting and ranking
The relationship between Document Processing, Indexing and Search Engine Clusters
Index Profile Structure
Scope Search	Used for Indexing customer XML content without any knowledge of the DTD/Schema. Indexing a more dynamic field structure using the Scope Search framework.
Relevancy, Data mining
Linguistic processing
Sorting
Rank value calculations
Query context analysis
Navigation
Contextual Insight
Ranking Concept
Quality
Freshness Boosting
WebAnalyzer	The WebAnalyzer is a FAST ESP module that uses links between documents to improve search relevancy
Tools to modify rank for individual documents	Two tools for modifying rank Search Business Center Boost Bulk Tool
3 boost mechanisms	Absolute Query Boost: Specify an absolute ranking position for a document against a specified query. Or exclude displaying a document against a specified query. Relative Query Boost: Ensure a document is always displayed in first xx (a number) result list against a specified query. Relative Document Boost: Ensure a document is always displayed in first xx (a number) result list whatever user submitted.
Proximity Ranking and Matching	The term proximity denotes the degree to which a query and a document match, based on the distance between the query terms within a document. Two types of proximity: Explicit Proximity Implicit Proximity
Field Collapsing	Two kinds of field collapsing Field collapsing which removes collapsed documents Field collapsing which does not remove collapsed document (default)
Boundary Matching
Duplicate Removal	Different ways of detecting and removing duplicate documents. Crawler Duplicate Removal (The FAST Crawler) Dynamic (Result-Side) Duplicate Removal (may be used to detect and remove duplicates across collections, and also enable a more flexible definition of perceived duplicates) Field Collapsing
GEO Search Overview	The Geo Search feature provides capabilities for filtering, sorting and boosting query results based on geographical location.
Query Modifications	Query processing is configured globally and three ways to modify a query in FAST ESP As an automatic rewrite of the query before execution against the index As a suggested rewrite, typically presented as a search tip on the result page A combination of the two above: The query is first executed in its original form. In case of no hits, the query is automatically resubmitted using the automatic rewrite option, and the new result is presented to the user
Query Resubmission	The resubmission is set per query and used to switch to suggested transformation of the user’s query. There are three kinds of query transformation. Modify: Automatically modified. The modified query is executed and the result set is returned Conditional Modify: Automatically modified only if no hits are returned by the executed query Suggest: Never modified. But a suggested transformed query is returned together with the result set.
FAST Query Language (FQL)
Navigator	Navigators provide functionality for drilling down into the query results based on value distribution of one or more individual fields.
Field Navigator
Deep Navigator
Shallow Navigator
Scope Navigator
Contextual Navigator
Field Navigators for Values in Scope Fields
Taxonomy
FAST Classifier
Unsupervised Clustering

Data Flow
In this section, I'm gonna talk about the data flow. The first one is about how the ESP crawl data.

The second one is about how the ESP handle user search.

Tuesday, June 21, 2011

Hello World! -- Set up a test search in ESP

In the previous article, I already showed you how to setup ESP in windows. In this article, I'd like to show you how to do a test search in ESP. After this article, you will be able to setup File Traverser connector in ESP, create a collection and search the result.
---------------------------------------------------------------------------
Setup File Traverser connector
1. Integrate the connector into ESP. Open your $FASTSEARCH/etc/NodeConf.xml file, and then add the "<proc>connectorcontroller</proc>" to <startorder> node.

2. Add the following code after </global>:
    
    <process name="connectorcontroller" description="Connector Controller">
        <start>
        <executable>connectorcontroller</executable>
        <parameters>-P $PORT</parameters>
        <port base="3150"/>
        </start>
        <outfile>connectorcontroller.scrap</outfile>
    </process>

3. Run $FASTSEARCH/bin/nctrl reloadcfg
4. Run $FASTSEARCH/bin/nctrl start connectorcontroller

5. Now, your File Traverser should be able to see in ESP Data Source. Setup a shared folder, and then put some pdf files into this folder.

Setup your new collection
1. Go to your FAST ESP log in page

2. Enter your user name and password, click Log in button to go to your home page

3. Click ESP Admin GUI from the right side

4. In your Collection Overview tab under FAST ESP administration page, click Create Collection button

5. Enter a new collection Name and Description and then click the Next button.

6. Since we are doing a demo search, there is only one cluster we can use. Keep the webcluster selected, and then click Next button.

7. In the pipeline configuration page, click the drop down button, select the Generic (webcluster) pipeline, click the Add Selected button and then click Next button.

8. In the data source configuration page, click the drop down button, select the File Traverser and then click the add selected button.

9. In the data source setup page, enter the shared folder path, keep the other default settings and then click submit button.

10. Finally, click OK button to finish the setup

11. Now, a new collection has been setup and it starts to crawl the shared folder.

12. After a few seconds, click the Refresh button on the top right corner. You will see there is a Doc found.

13. Click the Search View from the top.

14. In the new search view tab, you will see the built-in search UI. Enter some keywords and then click search button.

Friday, June 17, 2011

How to install FAST ESP in Windows

I'm lucky that my project will use FAST ESP to be our search engine. In next a few months, I will write some articles to talk about the FAST ESP. Since I'm a beginner, there may be a lot of errors, mistakes in my article. So if you know something, please let me know. I'll be very appreciated your help. :)

FAST ESP which is a a product of FAST, a subsidiary of Microsoft, stands for "Enterprise Search Platform". In this article, I will talk about how to install it.
---------------------------------------------------------------------------------------
Prepare to install
There are three software need to be installed before installing FAST ESP manually. They are .NET Framework, Java SDK and Microsoft Visual C++ 2005 SP1 Redistributable.

1. Install .NET Framework
The .NET Framework can be installed from here.

2. Install Java
You can download the latest Java from Java.com. After the installation complete, you need to configure the Java Environment Variable. You can refer to my another article to see how to do that.
But one thing you need to pay attention: ESP only supports Java x86 version whatever you are using a x86 OS or a x64 OS.

3. Install Microsoft Visual C++ 2005 SP1 Redistributable.
It also pretty easy to install the Microsoft Visual C++ 2005 SP1 Redistributable, as well as Java and .NET Framework. You can download it from here and then install it.
Note: The link is x64 bit. If you are using a x86 bit system, you need to download a x86 one from here.
--------------------------------------------------------------------------
Start to install ESP
All right, we are ready to install the FAST ESP on our server. Let's get into our topic quickly. :)
1. Go to your installation directory, double click the setup.exe file to kick off the installation.

2. In the InstallShield Wizard, click the Next button to go to the next step.

3. In the agreement page, select "I accept...." option, and then click the Next button.

4. In the third party software components page, select "Yes, complete the download and installation" option, and then click the Next button.

5. Then it gets started on downloading all third party software. It will take a few minutes.

6. After the downloading completed, it will ask you if you are going to install them. Select "Proceed" and then click Next button

7. In the next step, you need to get your license file, use the browser to find it out, and then click the Next button.

8. I'm doing a demo in this article, so we can select Single Node mode, and then click Next button. More installation can be from in ESP Installation Guide.

9. Enter your service username and password, and then click Next button.

In this step, you need to make sure your service account has the "Log on as a service" right. Otherwise, you will receive the following error dialog.

a. To grant your account the right, click Start -> Administrative Tools -> Local Security Policy

b. In the Local Security Policy tool, expand Local Policies, select User Rights Assignment and then select Log on as a service from the right side.

c. Double click the Log on as a service item, add your account to the list, click OK button to close the dialog and then click the Next button again in the FAST ESP installation page.

10. Select a installation path, and then click Next button.
Note: the path can't contain a white space.

11. Select a index storage path, and then click Next button.

12. In the language setting page, keep all default and click the Next button.

13. Keep the default port and click the Next button.

14. Keep selecting Standard web profile, and click Next button. For the index profile, we can change it from the admin UI.

15. Type your organization name, administrator email address and then click the Next button.

16. If you want to send an email to your administrator mail box when certain events occurs, you can configure a SMTP server. For me, I just keep it empty. Click the Next button.

21. In the review page, you can verify your settings and then click Next button

22. Right now, it starts to testing your environment. If everything is fine, it will start to install ESP to your server.

23. After the ESP installation competed, you can start the service. Click Next button

24. Click the Finish button to close the dialog.

25. You can use the URL provided in step 24 to visit the admin UI.

--------------------------------------------------------------------------------
Installation Troubleshooting
1. Microsoft Visual C++ 2005 Redistributable can't be found
I was not lucky that I run into this issue. From the installation guide, it said I can install x64 sp1 one and it will work. But in my server, I wouldn't success until I install a x86 without SP1. It can be download from here.

2. ESP doesn't support daylight saving times.
From the installation guide, it mentioned that ESP doesn't support daylight saving times. That means you need to turn it off.
a. Click your date & time from the task bar, and click Change date and time settings.

b. Click the Change time zone button

c. Unselect the Automatically adjust clock for Daylight Saving Time check box, and then click the OK button

--------------------------------------------------------------------
Install Service Patch
1. To install a service patch, you need to stop the service first. Click Start -> All Programs -> FAST ESP -> FAST ESP - Stop

2. Go to your service patch folder, double install.cmd file.
Note: there are a lot of *.exe files, do not use them. Otherwise, your server won't be connected anymore.

3. Select Install Patch, and then click Next button

4. Select your ESP installation path, and click Next button.

5. Since we are in the demo server, keep selecting Single node installation and then click the Next button

6. Then it will check if your esp service is running, after that, you need to specify your patch file location. And then click Next button.

7. Now, it's installing

8. If you are unlucky to run into the following issue, you need to kill these two processes manually and then reinstall this service patch.

9. After the service patch completed, you will see the result dialog. Click the Next button to close the dialog.

10. Click Start -> All Programs -> FAST ESP -> FAST ESP - Start to start the service

11. Open your browser, enter the admin URL: http://localhost:16000 to go to the admin page.

12. If you run into the following issue after you installed service patch, you need to another three steps

a. Disable your IPv6
Click your network icon from task bar, and then click Open Network and Sharing Center

Click the Change adapter settings from the left pane.

Right click your adapter, and then click Properties

Uncheck the Internet Protocol Version 6 (TCP/IPv6) item, and then click the OK button.

b. Edit your registry
Click Start -> Run ... to open your registry editor.

Enter regedit and then press Enter button

Go to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\TCPIP6\Parameters, create a DWORD named "DisabledComponents", and set it to "ffffffff"

c. Edit your host file
Go to your %windir%\System32\drivers\etc, open your hosts file using notepad, enter:
127.0.0.1 full computer name with domain name

After these three steps, refresh your collection page.