In BlueConic, a Product or Content Collector can be used to add items to a product or content store that can then be used for creating personalized recommendations. Since the principles of these two connections are almost the same, they will be both covered in this article.
Data flow
The underlying principle of a BlueConic collector is that if content is already available on a channel, it can be used to fill the store. New product or article pages can automatically be detected when visitors are on the page.
Note: For each collector, a store is created. This makes it possible to have different stores for different brands.
The data flow can be pictured above and described as follows (with the Product Collector used as an example):
- The BlueConic script is loaded on the webpage and the Product Collector is configured for the channel, which contains the logic of how each attribute should be scraped. In this step, the following checks are applied (when it doesn’t meet the criteria, a log message is shown in the console like "Item will not be scraped due to missing URL"): - The ID, name, and URL have a value 
- The URL starts with “http://” or “https://” 
- The type matches “product” for the Product Collector or “article” for the Content Collector 
- When it concerns a Content Collector, the publicationDate should have a value 
- When an attribute is set to required, the configuration should result in a value 
 
- When a valid product page is detected, a hash is created to check if the content on the webpage has been changed. A view event is thrown by providing the item ID, URL, and hash. - Note: The view event is also used as input for the recommendation engine. For example, to determine viral products. 
 
- When the hash is changed or the item is not yet in the store, the item is added to the queue. 
- The batch part of the collector retrieves the items from the queue. This process runs every 3 minutes. 
- For each item in the queue, the HTML is retrieved. Based on this HTML and the configuration of the connection, the values for the defined attributes are determined. - Note: All of the server-side rendered HTML is available for calculating custom attribute values. 
 
- The output is a product item stored in the product store. The created items appear in the connection interface under “Most recent collected items.” 
Note: The client-side part of the connection has built-in 404 (page not found) detection. This detection mechanism checks for 404 candidates by parsing for values like not found, sorry, and 404 in the page title and description. When found, a HEAD request is created to see what the response code is. Typically, the response is that the product or article has been removed and the page is added to the queue for removal from the store.
Default implementation
The metadata of a page can be determined by combining the data that is available in JSON-LD, RFDa, and microdata. Product and Content Collectors in BlueConic have the metadata field options listed in the table below. Most fields are captured correctly using the “default” logic rules, which are as follows:
| Attribute | Logic | 
| ID | 
 | 
| Name | 
 | 
| Description | 
 | 
| Image | 
 An extra check is applied where the last part of the image URL has to contain .jpg, .jpeg, .gif, .png or .bmp. | 
| URL | 
 | 
| Publication date | 
 | 
| Type | 
 | 
| Categories | 
 | 
| Text | No default implementation | 
| Price | 
 | 
| In stock | 
 The outcome will be matched against the following words: outofstock, oos, soldout, discontinued, or out of stock. 
 If one of the words is in the text, in stock is set to false. Otherwise, it is set to true. | 
Limitations
When using Product and Content Collectors, keep the following limitations in mind:
- The BlueConic scraper for Content Collectors can only access the page source code. To collect metadata correctly, the scraper requires access to meta tags or properly formatted JSON-LD data. 
- Any content contained within - <script>or- <style>tags is stripped from the page and cannot be collected.
Removing items from a Content or Product Store
By default, part of the batch process of a collector is removing older items from the store. If the store is more than 95% filled, then a maximum of 5% of the least viewed items which are older than a week are removed. After this, the queue will be read following the process above.
Importing products using SFTP
An alternative way to populate a product store in BlueConic is to fill it based on a feed from a product information (PIM) system. This gives you more control over the products being added to the store and also provides extra metadata that may not be available on the web.
To do this, use the Product Connection (SFTP) which contains the following flow:
- Based on the selected files (filename may contain a wildcard), get files which have a last modified date which is later than the previous run date. When the settings are changed (for example a setting in the mapping), all files will be picked up (expect the .done files). 
- Every CSV file will be parsed and translated to a product based on the mapping step. 
- The product is stored in the product store, which is automatically created when the connection is set up. 
- Optional: There is a cleanup mechanism for this connection that allows you to remove items based on simply configurable rules. This cleanup is executed before importing the newest files. 
- Optional: This connection also makes it possible to add extra data points that can be used in mapping. Therefore, if the CSV files contain extra metadata, you need to add the extra fields first. 
