Link Search Menu Expand Document
Table of Contents

Developer Reference

Intended Audience

This guide is intended to provide developers with detailed reference information about Swirl. Please refer to the Developer Guide for an overview of how to work with Swirl.

State Table

The following table describes in more detail all the steps in the federation process, with the associated status and other important state information.

ActionModuleStatusNotes
Search object createdviews.py SearchViewSet.list()Search.status:
NEW_SEARCH
UPDATE_SEARCH
Required:
Search.query_string
Pre-processingsearch.py search()Search.status:
PRE_PROCESSING
Checks permissions
Loads the Search object
Pre-query processingsearch.py search()Search.status:
PRE_QUERY_PROCESSING
Processes Search.query_string and updates Search.query_string_processed
Federationsearch.py search()Search.status:
FEDERATING
FEDERATING_WAIT_*
FULL_RESULTS
Creates one Connector for each SearchProvider in the Search
Connector Initconnectors/connector.py
connectors/db_connector.py
Connector.status:
INIT
READY
Loads the Search and SearchProvider
Connector Federatefederate()Connector.status:
FEDERATING
 
Connector Query Processingprocess_query()FEDERATINGProcess Search.query_string_processed and store in Connector.query_string_to_provider
Connector Construct Queryconstruct_query()FEDERATINGTake Connector.query_string_to_provider and create Connector.query_to_provider
Connector Validate Queryvalidate_query()FEDERATINGReturns “False” if Connector.query_to_provider is empty
Connector Execute Searchexecute_search ()FEDERATINGConnect to the SearchProvider
Execute the search using Search.query_to_provider
Store the response in Connector.response
Connector Normalize Responsenormalize_response()FEDERATINGTransform Connector.response into JSON list of dicts
Store it in Connector.results
Connector Process Resultsprocess_results()Connector.status:
FEDERATING
READY
Process Connector.results
Connector Save Resultssave_results()Connector.status:
READY
Returns “True”
Post-result processingsearch.py search()Search.status:
POST_RESULT_PROCESSING
FULL_RESULTS_READY
FULL_UPDATE_READY
Runs the post_result_processors
Updates Result objects

Search.Status

Normal States

StatusMeaning
NEW_SEARCHThe search object is to be executed immediately
UPDATE_SEARCHThe search object is to be updated immediately
PRE_PROCESSINGSwirl is performing pre-processing for this search
PRE_QUERY_PROCESSINGSwirl is performing pre-query processing for this search
FEDERATINGSwirl is provisioning Celery workers with Connectors and waiting for results
FEDERATING_WAIT_nSwirl has been waiting for the number of seconds indicated by n
FULL_RESULTSSwirl has received all results
NO_RESULTSSwirl received no results
PARTIAL_RESULTSSwirl has received results from some providers, but not all
POST_RESULT_PROCESSINGSwirl is performing post-result processing
PARTIAL_RESULTS_READYSwirl has processed results from responding providers
PARTIAL_UPDATE_READYSwirl has processed updated results from responding providers
FULL_RESULTS_READYSwirl has processed results for all specified providers
FULL_UPDATE_READYSwirl has processed updated results for all specified providers

Error States

StatusMeaning
ERR_DUPLICATE_RESULT_OBJECTSMore than one Result object was found; contact support for assistance.
ERR_NEED_PERMISSIONThe Django User did not have sufficient permissions to perform the requested operation. More: Permissioning Normal Users
ERR_NO_ACTIVE_SEARCHPROVIDERSSearch failed because no specified SearchProviders were active
ERR_NO_RESULTSSwirl has not received results from any source
ERR_NO_SEARCHPROVIDERSSearch failed because no SearchProviders were specified
ERR_RESULT_NOT_FOUNDA Result object that was expected to be found, was not; contact support for assistance.
ERR_RESULT_PROCESSINGAn error occurred during Result processing - check the logs/celery-worker.log for details
ERR_SUBSCRIBE_PERMISSIONSThe user who created the Search object lacks permission to enable subscribe mode

Key Module List

SearchProvider Object

A SearchProvider defines some searchable source. It includes metadata identifying the type of connector used to search the source and much more. Many properties are optional when creating a SearchProvider object. They will have the default values shown below.

Properties

PropertyDescriptionDefault Value (Example Value)
idUnique identifier for the SearchProviderAutomatic (1)
nameHuman-readable name for the source”” ("Enterprise Search PSE")
ownerThe username of the Django user that owns the objectLogged-in user ("admin")
sharedBoolean setting: if true the SearchProvider can be searched by other users, if false it is only available to the ownertrue (false)
date_createdThe time and date at which the SearchProvider was createdAutomatic (2022-02-28T17:55:04.811262Z)
date_updatedThe time and date at which the SearchProvdier was updatedAutomatic (2022-02-29T18:03:02.716456Z)
activeBoolean setting: if true the SearchProvider is used, if false it is ignored when federatingfalse (true)
defaultBoolean setting: if true the SearchProvider will be queried for searches that don’t specify a searchprovider_list; if false, the SearchProvider must be specified in the searchprovider_listfalse (true)
connectorName of the Connector to use for this source”” ("RequestsGet")
urlThe URL or other string including file path needed by the Connector for this source; not validated”” ("https://www.googleapis.com/customsearch/v1")
query_templateA string with optional variables in form {variable}; the Connector will bind the query_template with required data including the url and query_string, as well as any query_mappings or credentials, at runtime. Note this format is not yet used by the Sqlite3 Connector.”” ("{url}?q={query_string}")
post_query_templateFor the RequestsPost Connector: valid JSON and a marker for the query text which then sent as the POST body”” ("query": "{query_string}","limit": "100")
http_request_headersA string for passing custom HTTP Request Header values to the source along with query”” ("Accept": "application/vnd.github.text-match+json")
query_processorsA list of processors to use to prepare the query for this source”” ("AdaptiveQueryProcessor")
query_mappingsA string defining the mappings for the query. Depends on the Connector used.”” ("cx=your-google-search-engine-key")
result_grouping_fieldUsed with the DedupeByFieldResultProcessor result processor, this string defines the specific source field to use for duplicate suppression.”” ("resource.conversationId")
result_processorsA list of processors to use to normalize results from this source“CosineRelevancyResultProcessor” ("MappingResultProcessor","CosineRelevancyResultProcessor")
response_mappingsList of response keys and JSONPaths to transform this providers response into a JSON result set”” ("FOUND=searchInformation.totalResults, RETRIEVED=queries.request[0].count, RESULTS=items")
result_mappingsList of keys, optionally with values; see User Guide, Result Mappings for notes on special keys”” ("url=link,body=snippet,cacheId,NO_PAYLOAD")
results_per_queryThe number of results to request from this source for each query10 (20)
credentialsThe credentials to use for this source. Dependent on the source.”” ("key=your-google-json-api-key")
eval_credentialsA credential variable set in the session and then be used in the SearchProvider.”” ("session["my-connector-token"]")
tagsList of strings that organize SearchProviders into groups.”” ("News", "EnterpriseSearch")

As of version 2.5, the revised CosineRelevancyResultProcessor must be added last to the list of result_processors.

APIs

URLExplanation
/swirl/searchproviders/List SearchProvider objects, newest first; create a new one using the POST button and the form at the bottom
/swirl/searchproviders/id/Retrieve a specific SearchProvider object; destroy it using the DELETE button; edit it using the PUT button and the form at the bottom

Search Object

Search objects are JSON dictionaries that define the searches some user or system desires to have run. They have unique IDs. They may be linked to by Result objects.

The only required property is a query_string with the actual text to be searched. All other properties are optional when creating a Search object, and they will have default values shown below. As Swirl executes a search, asynchronously, it will update properties of the Search object like status and result_url in real-time.

Properties

PropertyDescriptionDefault Value (Example Value)
idUnique identifier for the SearchAutomatic (search_id=1)
ownerThe username of the Django user that owns the objectlogged-in user (admin)
date_createdThe time and date at which the Search was createdAutomatic (2022-02-28T17:55:04.811262Z)
date_updatedThe time and date at which the Search was updatedAutomatic (2022-02-28T17:55:07.811262Z)
query_stringThe query to be federated - the only required field!”” (knowledge management)
query_string_processedThe Search query, modified by any pre-query processing”” (“”)
sortThe type of search to be runrelevancy (date)
results_requestedThe number of results, overall, the user has requested10 (25)
searchprovider_listA list of the SearchProviders to search for this query; an empty list, the default, searches all sources[] ([ "Enterprise Search Engines - Google PSE" ])
subscribeIf True, Swirl will update this Search as per the Celery-Beats scheduleFalse (True)
statusThe execution status of this search (see below)NEW_SEARCH (FULL_RESULTS_READY)
pre_query_processorsA list of processors to apply to the query before federation starts”” ([ "SpellcheckQueryProcessor" ])
post_result_processorsA list of result processors to apply to the results after federation is complete”” ([ "DedupeByFieldPostResultProcessor", "CosineRelevancyPostResultProcessor" ])
result_urlLink to the initial Result object for the Search which uses the RelevancyMixerAutomatic ("http://localhost:8000/swirl/results?search_id=17&result_mixer=RelevancyMixer")
new_result_urlLink to the updated Result object for the search which uses the RelevancyNewItemsMixerAutomatic ("http://localhost:8000/swirl/results?search_id=17&result_mixer=RelevancyNewItemsMixer")
messagesMessages from SearchProviders”” (Retrieved 1 of 1 results from: Document DB Search)
result_mixerThe name of the Mixer object (see below) to use for ordering resultsRoundRobinMixer (Stack2Mixer)
retentionThe retention setting for this object; 0 = retain indefinitely; see Search Expiration Service for details0 (2 for daily deletion)
tagsParameter (string) that can be passed into a search and will be attached to the Search object that is stored in Swirl”” ({ "query_string": "knowledge management", "tags": ["max_length:50"] })

There are some special Search tags that control query processing. For example, the SW_RESULT_PROCESSOR_SKIP Search tag can be used to skip a processor for the Search it is specified for: SW_RESULT_PROCESSOR_SKIP:DedupeByFieldResultProcessor

APIs

URLExplanation
/swirl/search/List Search objects, newest first; create a new one using the form at the bottom and the POST button
/swirl/search/id/Retrieve a specific Search object; destroy it using the DELETE button; edit it using the form at the bottom and the PUT button
/swirl/search/?q=some+queryCreate a Search object with default values except for query_string which is set to the q= parameter value; return redirects to result set; also accepts providers parameter
/swirl/search/?qs=some+queryCreate Search object as above, but return results synchronously; also accepts providers and result_mixer parameters
/swirl/search/?rerun=idRe-run a Search, deleting any previously stored Results
/swirl/search/?update=idUpdate a Search with new results since the last time you ran it

Result Objects

A Result object is the normalized, re-ranked result for a single Search, from a single SearchProvider. They are created at the end of the federated search process in response to the creation of a Search object. They are the only Swirl object that has a foreign key (search.id).

Only Connectors should create Result objects.

Developers are free to operate on individual Results as needed for their application.

However, the goal of Swirl (and federated search in general) is to provide unified results from all sources. Swirl uses Mixers to make this quick and easy.

Properties

PropertyDescriptionExample Value
idUnique identifier for the Result1
ownerThe username of the Django user that owns the objectadmin
date_createdThe time and date at which the Result was created.2022-02-28T17:55:04.811262Z
date_updatedThe time and date at which the Result was updated2022-02-28T19:55:02.752962Z
search_idThe id of the associated Search; there may be many Result objects with this id18
searchproviderThe name value of the SearchProvider that provided this result list"OneDrive Files - Microsoft 365"
query_to_providerThe exact query sent to the SearchProviderhttps://www.googleapis.com/customsearch/v1?cx=google-search-engine-id&key=google-json-api-key&q=strategy
query_processorsThe names of the Processors, specified in the SearchProvider, that processed the query"AdaptiveQueryProcessor"
result_processorsThe names of the Processors, specified in the SearchProvider, that normalized the results"MappingResultProcessor","CosineRelevancyResultProcessor"
result_processor_json_feedbackTBD(See a full result object)
messagesA list of any messages (strings) from the SearchProviderRetrieved 10 of 249 results from: OneDrive Files - Microsoft 365
statusThe readiness of the result setREADY
retrievedThe number of results Swirl retrieved from this SearchProvider for this query10
foundThe total number of results reported by the SearchProvider for this query2309
timeThe time it took for the SearchProvider to create this result set, in seconds1.9
json_resultsThe normalized JSON results from this SearchProvider(See below)

json_results

FieldDescriptionExample
swirl_rankSwirl’s relevancy ranking for this result1
swirl_scoreA metric showing the relevancy of the result to the query. It is not meant to be meaningful as a number otherwise. It is only shown if &explain=True is set.1890.6471312936828
searchproviderThe human-readable name for the SearchProvider"OneDrive Files - Microsoft 365"
searchprovider_rankThe SearchProvider’s ranking of this result3
titleThe source-reported title for this result, with search term matches highlightedGerman car industry to invest in <em>electric</em> <em>vehicles</em> ...
urlThe URL for this result; may be reported by the source and/or calculated by Swirlhttp://pwc.com/etc
bodyThe source-reported result snippet(s), with search term matches highlighted<em>Technology</em> strategy encompasses a full set of Consulting capabilities ...
date_publishedThe source-reported time and date of publication of this result, when available (unknown if not). This is the field used by the DateMixerunknown
date_published_displayOptional SearchProvider field for mapping a different publish date value for display purposes.... date_published=foo.bar.date1,date_published_display=foo.bar.date2 ...
date_retrievedThe time and date at which Swirl received this result from the source2022-02-20 03:45:03.207909
authorThe source-reported author of this result, when available"CNN staff"
title_hit_highlightsTBDTBD
body_hit_highlightsTBDTBD
payloadA dictionary of all remaining keys in the SearchProvider’s response{}
explainA dictionary containing (a) the matching word stems, (b) similarity scores for each match in each field, (c) length adjustments and other metadata, (d) hits information. It is only shown if &explain=True is set.{}

Note that PAYLOAD can be totally different from SearchProvider to SearchProvider. It is up to the caller to access the PAYLOAD and extract whatever is needed, generally by making a new Processor (see below) or adding result_mappings.

APIs

URLExplanation
/swirl/results/List Result objects, newest first; create a new one using the POST button and form at bottom
/swirl/results/id/Retrieve a specific Result object; destroy it using the DELETE button; edit it using the PUT button and form at bottom
/swirl/results/?search_id=search_idRetrieve unified Results for a specific Search, ordered by the Mixer specified; also accepts result_mixer and page parameters

Connectors

Connectors are objects responsible for searching a specific type of SearchProvider, retrieving the results and normalizing them to the Swirl format. This includes calling query and result processors.

Both connector.py and db_connectory.py are base classes from which other connector classes are derived. While requests.py is a wrapper called by RequestsGet. Two utility functions, mappings.py and utils.py, can be used by other Connectors.

For Release 2.5.1, requests.py was updated to handle XML responses from source APIs and convert them to JSON for mapping in SearchProvider configurations.

The following table describes the included source Connectors:

ConnectorDescriptionInputs
BigQuerySearches Google BigQueryquery_template, credentials (project JSON token file)
ChatGPTAsks OpenAI ChatGPTcredentials
ElasticSearches Elasticsearchurl or cloud_id, query_template, index_name, credentials
Microsoft GraphUses the Microsoft Graph API to search M365 contentcredentials
OpenSearchSearches OpenSearchurl, query_template, index_name, credentials
PostgreSQLSearches PostgreSQL databaseurl (connection parameters), query_template, credentials
RequestsGetSearches any web endpoint using HTTP/GET with JSON response, including Google PSE, SOLR, Northern Light and more (see below)url, credentials
RequestsPostSearches any web endpoint using HTTP/POST with JSON response, including M365url, credentials
Sqlite3Searches SQLite3 databasesurl (database file path), query_template

Connectors are specified in, and configured by, SearchProvider objects.

BigQuery

The BigQuery connector uses the Google Cloud Python package.

The included BigQuery SearchProvider is intended for use with the Funding Data Set but can be adapted to most any configuration.

{
    "name": "Company Funding Records - BigQuery",
    "active": false,
    "default": false,
    "connector": "BigQuery",
    "query_template": "select {fields} from `{table}` where search({field1}, '{query_string}') or search({field2}, '{query_string}');",
    "query_processors": [
        "AdaptiveQueryProcessor"
    ],
    "query_mappings": "fields=*,sort_by_date=fundedDate,table=funding.funding,field1=company,field2=city",
    "result_processors": [
        "MappingResultProcessor",	
        "CosineRelevancyResultProcessor"
    ],
    "result_mappings": "title='{company}',body='{company} raised ${raisedamt} series {round} on {fundeddate}. The company is located in {city} {state} and has {numemps} employees.',url=permalink,date_published=fundeddate,NO_PAYLOAD",
    "credentials": "/path/to/bigquery/token.json",
    "tags": [
        "Company",
        "BigQuery"
    ]
}

More information: BigQuery Documentation

ChatGPT

For Release 2.5.1, both the ChatGPT Connector and QueryProcessor were updated to use the ChatCompletion method which supports the latest GPT models, including GPT-4, and a much greater range of interactivity. The ChatGPT SearchProvider now queries the GPT-3.5-Turbo model by default.

The ChatGPT Connector makes use of the OpenAI APIs. It will return at most one result.

ChatGPT responses usually rank highly since they are written on-the-fly in response to the user’s query, reflect the terminology associated with that query, and are titled for the query.

The included SearchProvider is pre-configured with a “Tell me about: …” prompt. The ChatGPTQueryProcessor also contains other query options for ChatGPT.

{
    "name": "ChatGPT - OpenAI",
    "active": false,
    "default": true,
    "connector": "ChatGPT",
    "url": "",
    "query_template": "",
    "query_processors": [
        "AdaptiveQueryProcessor"
    ],
    "query_mappings": "PROMPT='Tell me about: {query_to_provider}'",
    "result_processors": [
        "GenericResultProcessor",
        "CosineRelevancyResultProcessor"
    ],
    "response_mappings": "",
    "result_mappings": "BLOCK=ai_summary",
    "results_per_query": 10,
    "credentials": "your-openai-API-key-here",
    "tags": [
        "ChatGPT",
        "Question"
    ]
}

The Question and ChatGPT SearchProvider Tags make it easy to target ChatGPT by starting a query with either of them. For example:

Question: Tell me about knowledge management software?

You must remove the default result_mappings value of BLOCK=ai_summary in the SearchProvider configuration to enable the ChatGPT or Question Tags! Otherwise, these Tags will be ignored.

ChatGPT SearchProvider Tags

The following three Tags are available for use in the ChatGPT SearchProvider to help shape the Prompt or Default Role passed to ChatGPT along with the user’s query. To utilize these options:

  1. Add a valid OpenAI API Key to Swirl’s .env file (found in the swirl-home directory), and then restart the Swirl for it to take effect.
  2. Add the ChatGPTQueryProcessor to the SearchProvider query_processors list:
     "query_processors": [
         "AdaptiveQueryProcessor",
         "ChatGPTQueryProcessor"
     ],
    

Available Tags

  • CHAT_QUERY_REWRITE_PROMPT: Override the default Prompt used to rewrite the query.
    "CHAT_QUERY_REWRITE_PROMPT:Write a more precise query of similar length to this : {query_string}"
    
  • CHAT_QUERY_REWRITE_GUIDE: Override the system role passed to ChatGPT.
    "CHAT_QUERY_REWRITE_GUIDE:You are a helpful assistant that responds like a pirate captain"
    
  • CHAT_QUERY_DO_FILTER: Turn on or off the default internal filter of ChatGPT responses.
    "CHAT_QUERY_DO_FILTER:false"
    

ChatGPT query_mapping

The following query_mapping is also available for use in the ChatGPT SearchProvider to help shape the Default Role passed to ChatGPT along with the user’s query. To utilize this query_mapping, add a valid OpenAI API Key to Swirl’s .env file (found in the swirl-home directory) and then restart the Swirl for it to take effect.

  • CHAT_QUERY_REWRITE_GUIDE: Override the system role passed to ChatGPT.
    "query_mappings": "PROMPT='Tell me about: {query_to_provider}',CHAT_QUERY_REWRITE_GUIDE='You are a helpful assistant that responds like a pirate captain'",
    

Elastic & OpenSearch

The Elastic and OpenSearch Connectors make use of each engine’s Python client.

Here is an example of a SearchProvider to connect to ElasticCloud using the cloud_id:

{
    "name": "ENRON Email - Elastic Cloud",
    "connector": "Elastic",
    "url": "your-cloud-id-here",
    "query_template": "index='{index_name}', query={'query_string': {'query': '{query_string}', 'default_field': '{default_field}'}}",
    "query_processors": [
        "AdaptiveQueryProcessor"
    ],
    "query_mappings": "index_name=email,default_field=content,sort_by_date=date_published.keyword,NOT=true,NOT_CHAR=-",
    "result_processors": [
        "MappingResultProcessor",
        "CosineRelevancyResultProcessor"
    ],
    "result_mappings": "url=_source.url,date_published=_source.date_published,author=_source.author,title=_source.subject,body=_source.content,_source.to,NO_PAYLOAD",
    "credentials": "(\"elastic\", \"elastic-password\")",
    "tags": [
        "Enron",
        "Elastic"
    ]
}

This is the default OpenSearch SearchProvider that connects to a local instance of the engine:

{
    "name": "ENRON Email - OpenSearch",
    "active": false,
    "default": false,
    "connector": "OpenSearch",
    "url": "http://localhost:9200/",
    "query_template": "{\"query\":{\"query_string\":{\"query\":\"{query_string}\",\"default_field\":\"{default_field}\",\"default_operator\":\"and\"}}}",
    "query_processors": [
        "AdaptiveQueryProcessor"
    ],
    "query_mappings": "index_name=email,default_field=content,sort_by_date=date_published.keyword,NOT=true,NOT_CHAR=-",
    "result_processors": [
        "MappingResultProcessor",
        "CosineRelevancyResultProcessor"
    ],
    "result_mappings": "url=_source.url,date_published=_source.date_published,author=_source.author,title=_source.subject,body=_source.content,_source.to,NO_PAYLOAD",
    "credentials": "'admin','password'",
    "tags": [
        "Enron",
        "OpenSearch"
    ]
}

Note the use of JSONPaths in the result_mappings. This is essential for Elastic and OpenSearch since they embed results in a _source field unless otherwise configured. For more details consult the User Guide, Result Mappings section.

Use the PAYLOAD Field to store extra content that doesn’t map to an existing item.

Microsoft Graph

The Microsoft Graph connector uses the Microsoft Graph API to query M365 content. The user must authenticate via OAuth2 in order to enable searching of their M365 content.

There are 5 pre-configured SearchProviders for querying the authenticated user’s M365 content:

  • Outlook Messages
  • Calendar Events
  • OneDrive Files
  • SharePoint Sites
  • Teams Chat
{
        "name": "Outlook Messages - Microsoft 365",
        "active": false,
        "default": true,
        "connector": "M365OutlookMessages",
        "url": "",
        "query_template": "{url}",
        "query_processors": [
            "AdaptiveQueryProcessor"
        ],
        "query_mappings": "NOT=true,NOT_CHAR=-",
        "result_grouping_field": "conversationId",
        "result_processors": [
            "MappingResultProcessor",
            "DedupeByFieldResultProcessor",
            "CosineRelevancyResultProcessor"
        ],
        "response_mappings": "",
        "result_mappings": "title=resource.subject,body=summary,date_published=resource.createdDateTime,author=resource.sender.emailAddress.name,url=resource.webLink,resource.isDraft,resource.importance,resource.hasAttachments,resource.ccRecipients[*].emailAddress[*].name,resource.replyTo[*].emailAddress[*].name,NO_PAYLOAD",
        "results_per_query": 10,
        "credentials": "",
        "eval_credentials": "",
        "tags": [
            "Microsoft",
            "Email"
        ]
    }

PostgreSQL

The PostgreSQL connector uses the psycopg2 driver.

Installing the PostgreSQL Driver

To use PostgreSQL with Swirl:

  1. Install PostgreSQL
  2. Modify the system PATH so that pg_config from the PostgreSQL distribution runs from the command line
  3. Install psycopg2 using pip:
pip install psycopg2
  1. Uncomment the PostgreSQL Connector in the following modules:
# uncomment this to enable PostgreSQL
# from swirl.connectors.postgresql import PostgreSQL
    CONNECTOR_CHOICES = [
        ('RequestsGet', 'HTTP/GET returning JSON'),
        ('Elastic', 'Elasticsearch Query String'),
        # Uncomment the line below to enable PostgreSQL
        # ('PostgreSQL', 'PostgreSQL'),
        ('BigQuery', 'Google BigQuery'),
        ('Sqlite3', 'Sqlite3')
    ]
  1. Run Swirl setup:
python swirl.py setup
  1. Restart Swirl:
python swirl.py restart
  1. Add a PostgreSQL SearchProvider like the one provided with the Funding Dataset

Suggestions on how to improve this process are most welcome!

Here is an example of a basic SearchProvider using PostgreSQL:

{
    "name": "Company Funding Records - PostgreSQL",
    "default": false,
    "connector": "PostgreSQL",
    "url": "host:port:database:username:password",
    "query_template": "select {fields} from {table} where {field1} ilike '%{query_string}%' or {field2} ilike '%{query_string}%';",
    "query_processors": [
        "AdaptiveQueryProcessor"
    ],
    "query_mappings": "fields=*,sort_by_date=fundedDate,table=funding,field1=city,field2=company",
    "result_processors": [
        "MappingResultProcessor",
        "CosineRelevancyResultProcessor"
    ],
    "result_mappings": "title='{company} series {round}',body='{city} {fundeddate}: {company} raised usd ${raisedamt}\nThe company is headquartered in {city} and employs {numemps}',date_published=fundeddate,NO_PAYLOAD",
    "tags": [
        "Company"
    ]
}

Note: Putting a fixed SQL query in the query_template is perfectly acceptable. Anything that doesn’t change in the URL can be stored here.

RequestsGet

The RequestsGet connector uses HTTP/GET to fetch URLs. It supports optional authentication. Here is a generic SearchProvider based on RequestsGet:

{
    "name": "Sample HTTP GET Endpoint",
    "connector": "RequestsGet",
    "url": "http://hostname/site/endpoint",
    "query_template": "{url}&textQuery={query_string}",
    "query_processors": [
        "AdaptiveQueryProcessor"
    ],
    "query_mappings": "PAGE=start=RESULT_INDEX,DATE_SORT=sort=date",
    "result_processors": [
        "MappingResultProcessor",
        "CosineRelevancyResultProcessor"
    ],
    "response_mappings": "FOUND=count,RESULTS=results",
    "result_mappings": "body=content,date_published=date,author=creator",
    "credentials": "HTTPBasicAuth('your-username-here', 'your-password-here')",
    "tags": [
        "News"
    ]
}

Here is the same connector, configured for three different Google Programmable Search Engines:

[
    {
        "name": "Enterprise Search Engines - Google PSE",
        "connector": "RequestsGet",
        "url": "https://www.googleapis.com/customsearch/v1",
        "query_template": "{url}?cx={cx}&key={key}&q={query_string}",
        "query_processors": [
            "AdaptiveQueryProcessor"
        ],
        "query_mappings": "cx=0c38029ddd002c006,DATE_SORT=sort=date,PAGE=start=RESULT_INDEX,NOT_CHAR=-",
        "result_processors": [
            "MappingResultProcessor",
            "DateFinderResultProcessor",
            "CosineRelevancyResultProcessor"
        ],
        "response_mappings": "FOUND=searchInformation.totalResults,RETRIEVED=queries.request[0].count,RESULTS=items",
        "result_mappings": "url=link,body=snippet,author=displayLink,cacheId,pagemap.metatags[*].['og:type'],pagemap.metatags[*].['og:site_name'],pagemap.metatags[*].['og:description'],NO_PAYLOAD",
        "credentials": "key=AIzaSyDvVeE-L6nCC9u-TTGuhggvSmzhtiTHJsA",
        "tags": [
            "News",
            "EnterpriseSearch"
        ]
    },
    {
        "name": "Strategy Consulting - Google PSE",
        "connector": "RequestsGet",
        "url": "https://www.googleapis.com/customsearch/v1",
        "query_template": "{url}?cx={cx}&key={key}&q={query_string}",
        "query_processors": [
            "AdaptiveQueryProcessor"
        ],
        "query_mappings": "cx=7d473806dcdde5bc6,DATE_SORT=sort=date,PAGE=start=RESULT_INDEX,NOT_CHAR=-",
        "result_processors": [
            "MappingResultProcessor",
            "DateFinderResultProcessor",
            "CosineRelevancyResultProcessor"
        ],
        "response_mappings": "FOUND=searchInformation.totalResults,RETRIEVED=queries.request[0].count,RESULTS=items",
        "result_mappings": "url=link,body=snippet,author=pagemap.metatags[*].['authors-name'],cacheId,pagemap.metatags[*].['articletype'],pagemap.metatags[*].['practice-name'],pagemap.metatags[*].['searchresults-tags'],pagemap.metatags[*].['site_name'],pagemap.metatags[*].['twitter:description'],NO_PAYLOAD",
        "credentials": "key=AIzaSyDvVeE-L6nCC9u-TTGuhggvSmzhtiTHJsA",
        "tags": [
            "News",
            "StrategyConsulting"
        ]
    },
    {
        "name": "Mergers & Acquisitions - Google PSE",
        "connector": "RequestsGet",
        "url": "https://www.googleapis.com/customsearch/v1",
        "query_template": "{url}?cx={cx}&key={key}&q={query_string}",
        "query_processors": [
            "AdaptiveQueryProcessor"
        ],
        "query_mappings": "cx=b384c4e79a5394479,DATE_SORT=sort=date,PAGE=start=RESULT_INDEX,NOT_CHAR=-",
        "result_processors": [
            "MappingResultProcessor",
            "DateFinderResultProcessor",
            "CosineRelevancyResultProcessor"
        ],
        "response_mappings": "FOUND=searchInformation.totalResults,RETRIEVED=queries.request[0].count,RESULTS=items",
        "result_mappings": "url=link,body=snippet,author=pagemap.metatags[*].['article:publisher'],cacheId,pagemap.metatags[*].['og:type'],pagemap.metatags[*].['article:tag'],pagemap.metatags[*].['og:site_name'],pagemap.metatags[*].['og:description'],NO_PAYLOAD",
        "credentials": "key=AIzaSyDvVeE-L6nCC9u-TTGuhggvSmzhtiTHJsA",
        "tags": [
            "News",
            "MergersAcquisitions"
        ]
    }
]

And here it is again, configured for SOLR with the tech products example collection installed:

{
    "name": "techproducts - Apache Solr",
    "active": false,
    "default": false,
    "connector": "RequestsGet",
    "url": "http://localhost:8983/solr/{collection}/select?wt=json",
    "query_template": "{url}&q={query_string}",
    "query_processors": [
        "AdaptiveQueryProcessor"
    ],
    "query_mappings": "collection=techproducts,PAGE=start=RESULT_ZERO_INDEX,NOT=True,NOT_CHAR=-",
    "result_processors": [
        "MappingResultProcessor",
        "CosineRelevancyResultProcessor"
    ],
    "response_mappings": "FOUND=numFound,RESULTS=response.docs",
    "result_mappings": "title=name,body=features,response",
    "credentials": "",
    "tags": [
        "Products",
        "Solr"
    ]
}

To adapt RequestsGet for your JSON response, just replace the JSONPaths on the right of the FOUND, RETRIEVED, and RESULT configurations in response_mappings, following the left-to-right format of swirl_key=source-key. If the response provides a dictionary wrapper around each result, use the RESULT path to extract it.

From there, map results fields to Swirl’s schema as described in the User Guide, Result Mapping section. Use the PAYLOAD Field to store any extra content from SearchProviders that doesn’t map to an existing Swirl field.

Add additional required key/value parameters to the url - if they won’t change from SearchProvider to SearchProvider - or by adding a mapping in the query_template and a default or guide value in the query_mappings. (See {collection} in the SOLR example above.)

RequestsPost

The RequestsPost connector uses HTTP/POST to submit search forms to configured URLs. It supports optional authentication. RequestsPost is used by M365 connectors and is now available for other sources. Here is a generic SearchProvider based on RequestsPost:

{
    "name": "Sample HTTP POST Endpoint",
    "active": false,
    "default": false,
    "connector": "RequestsPost",
    "url": "https://xx.apis.it.h.edu/ats/person/v3/search",
    "query_template": "{url}?Query={query_string}",
    "query_processor": "",
    "post_query_template": {
        "fields": [
          "names.personNameKey",
          "names.firstName",
          "names.lastName"
        ],
        "conditions": {
          "names.name": "*{query_string}*"
        }
    },
    "query_processors": [
        "AdaptiveQueryProcessor"
    ],
    "query_mappings": "NOT=true,NOT_CHAR=-",
    "result_processor": "",
    "result_processors": [
        "MappingResultProcessor",
        "CosineRelevancyResultProcessor"
    ],
    "response_mappings": "FOUND=count,RESULTS=results",
    "result_mappings": "titles=names[0].name,url=names[0].personNameKey,body='{names[0].name} ID#: {names[*].personNameKey}'",
    "results_per_query": 10,
    "credentials": "X-Api-Key=<your-api-key>",
    "eval_credentials": "",
    "tags": [
        "People"
    ]
}

Contact Support for help getting started.

SQLite3

The SQLite3 Connector uses the SQLite3 driver built-in to Python.

Here is an example of a SearchProvider using SQLite3:

{
    "name": "Company Funding Records - SQLite3",
    "active": false,
    "default": false,
    "connector": "Sqlite3",
    "url": "db.sqlite3",
    "query_template": "select {fields} from {table} where {field1} like '%%{query_string}%%' or {field2} like '%%{query_string}%%';",
    "query_processors": [
        "AdaptiveQueryProcessor"
    ],
    "query_mappings": "fields=*,sort_by_date=fundeddate,table=funding,field1=city,field2=company",
    "result_processors": [
        "MappingResultProcessor",
        "CosineRelevancyResultProcessor"
    ],
    "result_mappings": "title='{company} series {round}',body='{city} {fundeddate}: {company} raised usd ${raisedamt}\nThe company is headquartered in {city} and employs {numemps}',date_published=fundeddate,NO_PAYLOAD",
    "tags": [
        "Company"
    ]
}

Note: Putting a fixed SQL query in the query_template is perfectly acceptable. Anything that doesn’t change in the URL can be specified here.

Processing Pipelines

Swirl Processing Pipelines

Processors are intended to be single purpose and executed in a sequence called a “pipeline”. Pipelines are specified as JSON lists in their respective properties. There are four processing pipelines in Swirl:

  • Search.pre_query_processors
  • SearchProvider.query_processors
  • SearchProvider.result_processors
  • Search.post_result_processors

For example, the default Search.post_result_processors pipeline is:

"post_result_processors": [
            "DedupeByFieldPostResultProcessor",
            "CosineRelevancyPostResultProcessor"
        ],

This pipeline removes duplicates from the result set prior to relevancy ranking.

Query Processors

Query Processors operate queries. The exact field they operate on depends on how they are deployed.

PipelineReadsUpdates
Search.pre_query_processorsSearch.query_stringSearch.query_string_processed
SearchProvider.query_processorsSearch.query_string_processed<Connector>.query_string_to_provider

This table describes the query processors included in Swirl:

ProcessorDescriptionNotes
AdaptiveQueryProcessorRewrites queries based on the query_mappings for a given SearchProviderShould not be used as pre_query_processor
ChatGPTQueryProcessorThis query processor asks ChatGPT to rewrite queries based on a configurable prompt. For example it can rewrite queries to be fuzzier, broader, more specific, boolean, or in another language.Experimental
GenericQueryProcessorRemoves special characters from the query 
SpellcheckQueryProcessorUses TextBlob to predict and fix spelling errors in query_stringBest deployed in a SearchProvider.query_processor for sources that need it; not recommended with Google PSEs

Result Processors

Result Processors transform source results into the Swirl format defined in swirl/processors/utils.py.

The following table lists the Result Processors included with Swirl:

ProcessorDescriptionNotes
GenericResultProcessorCopies results from source format to Swirl format by exact match on nameRecommended for sources that don’t need mapping
MappingResultProcessorTransforms results from source format to Swirl format, using SearchProvider.result_mappingsDefault
LenLimitingResultProcessorChecks if the title and body responses from a source exceed a configurable length (set in swirl_server/settings.py: SWIRL_MAX_FIELD_LEN = 512), truncates anything after that value, and adds an ellipsis (“…”). If the body field has been truncated, the processor reports the entire response in a new body_full field in the Payload. The default truncation length for can be overridden for a specific SearchProvider using a new Tag value (e.g. max_length:256).Recommended for sources that consistently return lengthy title or body fields; should follow the MappingResultProcessor.
CleanTextResultProcessorRemoves non-alphanumeric characters from the source response. It should be considered for lengthy responses where URLs or other HTML or Markdown syntax appear in results.Should be installed before the LenLimitingResultProcessor when both are used.
DateFinderResultProcessorLooks for a date in any a number of formats in the body field of each result item. Should it find one, and the date_published for that item is 'unknown', it replaces date_published with the date extracted from the body, and notes this in the result.messages.This processor can detect the following date formats:
06/01/23
06/01/2023
06-01-23
06-01-2023
jun 1, 2023
june 1, 2023

Post Result Processors

PostResultProcessors operate only on processed result data and are saved to the Swirl database by each connector. They operate on Result objects.

CosineRelevancyPostResultProcessor

Swirl includes a cosine vector similarity relevancy model based on spaCy. The source code is found in: swirl/processors/relevancy.py

The relevancy model is as follows:

  • Matches on word stems from nltk

  • Scores the fields configured in SWIRL_RELEVANCY_CONFIG in swirl_server/settings.py

  • Aggregates the similarity of:

    • the entire query and the field, with the score being the highest in any single sentence (if any),

    • the entire query and a window of text around the field match

    • the 1 and 2 word combinations in the query (if long enough), and a window of text around the field match

  • Weights these fields as defined in SWIRL_RELEVANCY_CONFIG, a dictionary structure in swirl_server/settings.py

  • Further weights the window-text matches with the length of the match

  • Gives a small boost based on the original SearchProvider’s rank, equal to $1/(1+\sqrt{SearchProvider.rank})$

  • Normalizes the length of the result against the median length of all results in the set - this is reflected in the result_length_adjust in the explain structure.

  • Normalizes the query executed by this SearchProvider vs. all the other queries in the set - this is reflected in the query_length_adjust in the explain structure.

The Swirl score is just that: a score. The higher a score is, the more contextually relevant the result is. Scores aren’t comparable between queries or results.

Tip: to translate a result score to a confidence score, take the #1 result as 1.0, and then divide subsequent results by the score for that result to calculate the confidence.

Swirl reports the swirl_rank, from 1 to N, where N is the total number of results. Swirl also includes the searchprovider_rank, which is the result rank assigned by the source. This makes it easy to compare what Swirl viewed as relevant compared to what each SearchProvider did.

In the event of a relevancy tie, the date_published and search_provider rank are used as additional sorts, to break it.

In Swirl 2.5, result processing was separated into two passes. The SearchProvider.result_processors runs first, followed by the Search.post_result_processors which adjusts length and finalizes. As a result of this change, the revised CosineRelevancyPostResultProcessor must be added last in the Search.post_result_processors list. For example:

    "result_processors": [
        "MappingResultProcessor",
        "DateFinderResultProcessor",
        "CosineRelevancyResultProcessor"
    ],

Mixers

The following table details the Result Mixers included with Swirl:

MixerDescriptionNotes
RelevancyMixerOrganizes results by relevancy score (descending), then source rank (ascending)The default; depends on relevancy_processor being installed as the search.post_result_processors (also the default)
RelevancyNewItemsMixerOrganizes results as above, but hiding results that don’t have the new field as created during Search updatesThis is the default for search.new_result_url
DateMixerOrganizes results by date_published. Results with “unknown” for date_published are omittedUse when you want date sorted results
DateNewItemsMixerOrganizes results as above, but hiding results that don’t have the new field as created during Search updatesThis is the default for search.new_result_url when search.date is set to sort
RoundRobinMixerOrganizes results by taking 1 result from each responding SearchProvider, alternating; actually calls Stack1Mixer (see below)Good for searches with search.sort set to “date” or anytime you want a cross-section of results instead of just the ones with the most evidence
Stack1MixerOrganizes results by taking 1 result from each responding SearchProvider, alternatingGood for cross-sections of data
Stack2MixerOrganizes results by taking 2 from each responding SearchProvider, alternatingGood for cross-sections of data with 4-6 sources
Stack3MixerOrganizes results by taking 3 from each responding SearchProvider, alternatingGood for cross-sections of data with few sources
StackNMixerOrganizes results by taking N from each responding source, where N if not specified is the number of results requested divided by the number of SearchProviders reporting at least 1 resultGood for cross-sections of data with few providers

Date Mixer

If you want results sorted by date_published, descending, use the Date Mixer.

Swirl Results, Date Mixer

For example: http://localhost:8000/swirl/results?search_id=1&result_mixer=DateMixer

NewItems Mixers

The two NewItems mixers automatically filter results to items with the new field present. Both will report the number of results hidden because they do not have this field.

To remove the new field from all results in a search, add &mark_all_as_read=1 to the result_mixer URL property. For example:

http://localhost:8000/swirl/results/?search_id=1&result_mixer=DateNewItemsMixer&mark_all_as_read=1

The Mixer will then return 0 results. But it will return results if the Search is updated.

Mixer objects combine separate Result objects for a given Search, wrap them in metadata, and order them. For example, a Mixer might alternate from each source (Stack1Mixer), or sort results by date_published (DateMixer)

To invoke the mixer specified using the result_mixer property of the Search object, specify the search_id as shown in APIs, above:

http://localhost:8000/swirl/results/?search_id=1

If you use the Swirl defaults, a search will produce a JSON result that is relevancy ranked.

To specify a different Mixer, add &result_mixer=mixer-name to the URL.

http://localhost:8000/swirl/results/?search_id=1&result_mixer=Stack1Mixer

The following table describes the Mixer wrapper in more detail:

FieldDescription
messagesAll messages from the Search and all SearchProviders
infoA dictionary of Found and Retrieved counts from each SearchProvider
info - searchInformation about the Search, including the processed query, and links to re-run and re-score Searches
info - resultsInformation about the Results, including the number retrieved and the URL of the next (and previous) pages of results
resultsMixed Results from the specified Search

Sample Data Sets

Funding Data Set

The TechCrunch Continental USA funding data set was taken from Insurity SpatialKey. It is included with Swirl in Data/funding_db.csv This file was processed with scripts/fix_csv.py prior to loading into SQLite3.

Loading into SQLite3

  1. Activate sqlite_web Then, from the swirl-home directory:
    sqlite_web db.sqlite3
    
  2. A browser window should open automatically; if not go to http://localhost:8080/
  3. Enter “funding” in the text box in the upper right of the screen and press the Create button Sqlite loading funding dataset

  4. Click Choose File and select Data/funding_db.csv

  5. Leave “Yes” in the box below.

  6. Click Import. Sqlite loading funding dataset

  7. Load the Funding DB SQLite3 SearchProvider as described in the User Guide, SearchProvider section.

Loading into PostgreSQL

  • Create a table following the structure in the CSV using SQL or something like Postico
-- DDL generated by Postico 1.5.21
-- Table Definition ----------------------------------------------

CREATE TABLE funding (
    id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    permalink text,
    company text,
    numemps text DEFAULT '0'::numeric,
    category text,
    city text,
    state text,
    fundeddate date,
    raisedamt numeric DEFAULT '0'::numeric,
    raisedcurrency text,
    round text
);

-- Indices -------------------------------------------------------

CREATE UNIQUE INDEX funding_pkey ON funding(id int4_ops);
  • Run this SQL to ingest the CSV:
COPY funding(permalink,company,numemps,category,city,state,fundeddate,raisedamt,raisedcurrency,round) FROM '/path/to/Data/funding_db.csv' DELIMITER ',' CSV HEADER;

Loading into BigQuery

  • Create a table with a schema like so:

BigQuery Funding DB Schema in Console

Enron Email Data Set

This data set consists of 500,000+ emails from 150 employees of the former Enron Corporation. It’s available from Kaggle: https://www.kaggle.com/datasets/wcukierski/enron-email-dataset among other places.

Loading into Elastic or OpenSearch

  • Download the enron email data set: https://www.cs.cmu.edu/~enron/

  • Unzip and untar the download and place the email.csv file in your swirl-home directory

  • Create a new Elastic/OpenSearch index named ‘email’ using the development console:

PUT /email
  • Index the emails.csv from the Swirl directory:

For Elastic:

python scripts/index_email_elastic.py emails.csv -p elastic-password

For OpenSearch:

python scripts/index_email_opensearch.py emails.csv -p admin-password

The script will report progress every 100 rows. Hit Ctrl-C to stop it if desired.

  • Verify that the messages have been indexed by Elastic/OpenSearch by querying in the development console:
GET _search
{
  "query": {
    "match_all": {}
  }
}

Results should appear in the right-hand pane:

{
  "took" : 38,
  "timed_out" : false,
  "_shards" : {
    "total" : 6,
    "successful" : 6,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6799,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : ".kibana_92668751_admin_1",
        "_id" : "config:2.4.1",
        "_score" : 1.0,
        "_source" : {
          "config" : {
            "buildNum" : 4665
          },
          "type" : "config",
          "references" : [ ],
          "migrationVersion" : {
            "config" : "7.9.0"
          },
          "updated_at" : "2023-01-18T02:36:57.349Z"
        }
      },
      {
        "_index" : "email",
        "_id" : "c13MwoUBmlIzd81ioZ7H",
        "_score" : 1.0,
        "_source" : {
          "url" : "allen-p/_sent_mail/118.",
          "date_published" : "2000-09-26 09:26:00.000000",
          "author" : "Phillip K Allen",
          "to" : "pallen70@hotmail.com",
          "subject" : "Investment Structure",
          "content" : """---------------------- Forwarded by 
          
    ...etc...