Skip to main content
Product and Content Collectors: Deep Dive
Updated over a month ago

In BlueConic, a Product or Content Collector can be used to add items to a product or content store that can then be used for creating personalized recommendations. Since the principles of these two connections are almost the same, they will be both covered in this article.

Data flow

The underlying principle of a BlueConic collector is that if content is already available on a channel, it can be used to fill the store. New product or article pages can automatically be detected when visitors are on the page.

Note: For each collector, a store is created. This makes it possible to have different stores for different brands.

CollectorsFlow.png

The data flow can be pictured above and described as follows (with the Product Collector used as an example):

  1. The BlueConic script is loaded on the webpage and the Product Collector is configured for the channel, which contains the logic of how each attribute should be scraped. In this step, the following checks are applied (when it doesn’t meet the criteria, a log message is shown in the console like "Item will not be scraped due to missing URL"):

    • The ID, name, and URL have a value

    • The URL starts with “http://” or “https://”

    • The type matches “product” for the Product Collector or “article” for the Content Collector

    • When it concerns a Content Collector, the publicationDate should have a value

    • When an attribute is set to required, the configuration should result in a value

  2. When a valid product page is detected, a hash is created to check if the content on the webpage has been changed. A view event is thrown by providing the item ID, URL, and hash.

    • Note: The view event is also used as input for the recommendation engine. For example, to determine viral products.

  3. When the hash is changed or the item is not yet in the store, the item is added to the queue.

  4. The batch part of the collector retrieves the items from the queue. This process runs every 3 minutes.

  5. For each item in the queue, the HTML is retrieved. Based on this HTML and the configuration of the connection, the values for the defined attributes are determined.

    • Note: All of the server-side rendered HTML is available for calculating custom attribute values.

  6. The output is a product item stored in the product store. The created items appear in the connection interface under “Most recent collected items.”

Note: The client-side part of the connection has built-in 404 (page not found) detection. This detection mechanism checks for 404 candidates by parsing for values like not found, sorry, and 404 in the page title and description. When found, a HEAD request is created to see what the response code is. Typically, the response is that the product or article has been removed and the page is added to the queue for removal from the store.

Default implementation

The metadata of a page can be determined by combining the data that is available in JSON-LD, RFDa, and microdata. Product and Content Collectors in BlueConic have the metadata field options listed in the table below. Most fields are captured correctly using the “default” logic rules, which are as follows:

Attribute

Logic

ID

  1. It takes the second value when it matches: /(?:^|\s)(?:page-node|uuid|postid)-([^\s]+)/

    1. For example, a classname postid-123 results in 123

  2. The URL without the querystring is being used.

Name

  1. og:title meta tag

  2. name in the metadata

  3. headline in the metadata

Description

  1. og:description meta tag

  2. description in the metadata

  3. description meta tag

Image

  1. leadimage meta tag

  2. og:image meta tag

  3. image in the metadata

An extra check is applied where the last part of the image URL has to contain .jpg, .jpeg, .gif, .png or .bmp.

URL

  1. og:url meta tag

  2. When there is link[rel=canonical] available, the href is used.

  3. The URL without querystring is used.

Publication date

  1. article:published_time meta tag

  2. datePublished meta tag

  3. datePublished in the metadata

Type

  1. og:type meta tag

  2. For the product collector: When metadata has a property Product, “product” is returned.

  3. For the content collector: When metadata has a property Article or NewsArticle, “article” is returned.

    • The type is normalized to lowercase and “newsarticle” is translated to “article”.

Categories

  1. og:section meta tag

  2. article:section meta tag

  3. article:tag meta tag

  4. keywords meta tag

  5. news_keywords meta tag

  6. articleSection in the metadata

  7. product:category meta tag

Text

No default implementation

Price

  1. When there is a Product attribute in the metadata that has an Offers attribute, the price of the first offer will be used.

  2. price meta tag

  3. product:price:amount meta tag

  4. price in the metadata

In stock

  1. When there is a Product attribute in the metadata that has an Offers attribute, the availability of the first offer will be used.

  2. og:availability meta tag

  3. product:availability meta tag

  4. availability meta tag

The outcome will be matched against the following words: outofstock, oos, soldout, discontinued, or out of stock.

If one of the words is in the text, in stock is set to false. Otherwise, it is set to true.

Removing items from a Content or Product Store

By default, part of the batch process of a collector is removing older items from the store. If the store is more than 95% filled, then a maximum of 5% of the least viewed items which are older than a week are removed. After this, the queue will be read following the process above.

Importing products using SFTP

An alternative way to populate a product store in BlueConic is to fill it based on a feed from a product information (PIM) system. This gives you more control over the products being added to the store and also provides extra metadata that may not be available on the web.

To do this, use the Product Connection (SFTP) which contains the following flow:

  • Based on the selected files (filename may contain a wildcard), get files which have a last modified date which is later than the previous run date. When the settings are changed (for example a setting in the mapping), all files will be picked up (expect the .done files).

  • Every CSV file will be parsed and translated to a product based on the mapping step.

  • The product is stored in the product store, which is automatically created when the connection is set up.

  • Optional: There is a cleanup mechanism for this connection that allows you to remove items based on simply configurable rules. This cleanup is executed before importing the newest files.

  • Optional: This connection also makes it possible to add extra data points that can be used in mapping. Therefore, if the CSV files contain extra metadata, you need to add the extra fields first.

Did this answer your question?