SharePoint Search Crawl is slow

You may find that in certain situations, it takes a longer time than you expect for Microsoft Office SharePoint Server 2007 or Microsoft Office SharePoint Portal Server 2003 to crawl content in the portal site and update the content index. Performance is slow, although a hard disk, CPU, or network bottleneck does not exist.

This issue may occur if a proxy server is not configured for the search service in SharePoint Server 2007 or in SharePoint Portal Server 2003 and a proxy server is configured in Microsoft Internet Explorer. When a proxy server is not configured for the search service in SharePoint Server 2007 or in SharePoint Portal Server 2003, the account that is configured as the default content access account uses the proxy server settings that are configured in Internet Explorer.

WORKAROUND

To work around this issue, remove the proxy settings that are configured in Internet Explorer, and then do one of the following:

  • If you are running SharePoint Server 2007, restart the Office SharePoint Server Search service.
  • If you are running SharePoint Portal Server 2003, restart the Microsoft SharePointPS Search service.

To do so, follow these steps:

  1. Remove the proxy settings that are configured in Internet Explorer. To do so, follow these steps:
    1. Start Internet Explorer (if it is not already started).
    2. On the Tools menu, click Internet Options.
    3. Click the Connections tab, and then click LAN Settings.
    4. Under Proxy server, click to clear the Use a proxy server for your LAN (These settings will not apply to dial-up or VPN connections) check box, and then click OK.
    5. Click OK.
  2. Restart either the Office SharePoint Server Search service or the Microsoft SharePointPS Search service. To do so, follow these steps:
    1. Click Start, point to Administrative Tools, and then click Services.
    2. In the list of services, right-click Office SharePoint Server Search or Microsoft SharePointPS Search, and then click Restart.

REFERENCE: http://support.microsoft.com/kb/829216

Sites that require forms-based authentication or cookie-based authentication are not crawled in SharePoint Server 2007

Microsoft Office SharePoint Server 2007 or Microsoft Office SharePoint Server 2007 for Search is directed to crawl content that is saved on sites that require forms-based authentication or cookie-based authentication. However, only the logon page of a site is crawled.

To enable the crawling of sites that require forms-based authentication or cookie-based authentication, use the AddRule.exe command-line tool after you apply this hotfix. To obtain the AddRule.exe command-line tool, visit the following Microsoft Web site:

http://www.microsoft.com/downloads/details.aspx?FamilyId=D5090BC4-5B4F-411B-8CDE-E37D33F7EFDF

Command-line use

AddRule.exe This command displays the following help text:

Usage: AddRule.exe <xml file>

The structure of the input file is specified in the instructions provided with this hotfix.

AddRule.exe input xml file This command will add the crawl rule based on the XML file. These rules are added to the end of their current set of crawl rules. The administrator can later change the order by using the user interface.If the XML file is malformed, you may receive an error message.

You may receive the following error messages if the XML file is malformed.

  • If there is no <Rules> tag, you receive the following error message:

    Syntax error: [rules] element not found as the only node at the root.

  • If a required node is missing in the XML file, you receive the following error message:

    Syntax error: <missing node> element unexpected.

  • If a node in the XML file is incorrectly duplicated, you receive the following error message:

    Syntax error: <node name> element already exists for the current rule

  • If the type is not "FORM" or "COOKIE," you receive the following error message:

    Syntax error: unrecognized value for the <type> element

  • If the login_type is not "POST," you receive the following error message:

    Syntax error: unrecognized value for the <login_type> element

    Note If the administrator reruns this command by using another input file and then finds that the path is identical to an existing rule, the command will modify the rule.


Crawl rules object model

The CrawlRuleAuthenticationType enumeration includes the following new values:

  • FormsRuleAccess = 4
  • CookieRuleAccess = 5

The SetCredentials method in the crawl rules object model is overloaded with two new implementations.

The forms-based authentication rule takes the following input parameters in the following order:

  • type::CrawlRuleAuthenticationType: This will be FormsRuleAccess.
  • AuthSubmissionMethod::String: This will be "POST."
  • AuthSubmissionPath::String: This is the URL in which the parameters should be posted.
  • authData::NameValueCollection: This is where the hidden name value pairs are stored.
  • privateAuthData:: NameValueCollection: This is where the encrypted name value pairs such as user names and passwords are stored.
  • errorPages::StringCollection: This will store the various error pages that would indicate to the crawler to refetch a cookie or to fail the URL with an "Access Denied" error message.

The cookie-based authentication rule takes the following input parameters in the following order:

  • type::CrawlRuleAuthenticationType: This will be CookieRuleAccess.
  • cookies::StringCollection: This will store the cookies that the crawler should use.
  • errorPages::StringCollection: This will store the various error pages that would indicate to the crawler to fail the URL with an "Access Denied" error message.

Note The encryption of the name value pairs and of the cookies is performed by using the same mechanism that is currently available.

Reference: http://support.microsoft.com/kb/934577