|
|
The Marshal Site Router
The Marshal Site Router is a multithreaded windows service for harvesting web sites or individual web pages. The router can follow links to harvest entire web sites. The router will not follow external links, i.e. the router will not harvest pages that are located on another domain than the start page.
Technically the router binds a connection socket to a tcp/ip address and port and listens for requests, and returns the result as JSON.
By default the Site router binds a connection socket to 127.0.0.1 port 8080. If you want to change the address and/or port you can add start parameters. Using the parameters in image 1 below, the router will bind to address 192.168.0.123 and port 8081. Enter the desired parameters and klick the Start button.
Image 1, Starting the Site Router
Using the Site Router
When harvesting using Routers, the JSON TCP Harvester is used, please see the Routers article for more information.
Configuring the JSON TCP Harvester
Image 2, JSON TCP Harvester configuration
Harvester |
Name |
The name of the harvester in use by the selected Query-node. To change harvester, select the name property and click on the ellipsis button. Select the appropriate harvester in the form that is displayed. |
Router |
Connection String |
The connection instructs the Site Router which pages to harvest, and how to harvest them. The connection string must start with the url to the web page you want to harvest. Connection string syntax: url;key1=value1;key2=value2;... Connection string sample: http://www.mydomain.com;maxdepth=2;keys=id,page,content;image=jpg |
For name |
Not used |
Host |
The tcp/ip address on which the Site Router listens for connections. |
Port |
The tcp/ip port on which the Site Router listens for connections. |
Settings |
Group by |
Not used |
Order by |
Not used |
Query Timeout |
Not used |
Table or View |
The name of the table to harvest data from. |
Where |
|
User authentication |
Password |
Not used |
User ID |
Not used |
Table 1, JSON TCP Harvester configuration
Connection string keys |
maxDepth |
When harvesting from the page table, see below for more information, the Site router is able to follow links. The router will not follow links to pages that are not on the same domain as the initial page. You may specify the maximum depth to follow links. By setting maxDepth=1, no links will be followed. Setting maxDepth=2 means that the first page, and the pages that are linked by the first page will be harvested. The default value is 25. |
keys |
This key is used when following links. Keys is a comma separated list of url parameters which are valid. If the urls contains query strings, parameters that are not defined as valid are removed from the url. Example: If keys=id,page,content and the url is: http://www.mydomain.com?id=32&q=test&content=x, the router will navigate to: http://www.mydomain.com?id=32&content=x. I.e. q=test is removed from the url since q is not a valid parameter. |
image | This key is used when saving a schreen shot of a web page as an image. The image key specifies the file format of image captures of web pages. Possible values are: bmp, gif, jpg, jpeg, png, tif and tiff. The default value is png. |
imageSizeMode |
This key is used when saving a schreen shot of a web page as an image. Normally, the web browser component is able to determine the dimensions of the web page. If this is not possible, the router can make an attempt to parse the DOM-tree to determine the dimensions. Possible values for imageSizeMode are default and traverse. |
sitewidth |
This key is used when saving a schreen shot of a web page as an image. Some web pages change their layout depending of the dimensions if the web browser. In these cases you may set the sitewidth to the number of pixels wide you want the screen shots of pages of the site to be. |
captureDelay |
When pages contain dynamic content, there is no way for a web browser to know when the content has been completely loaded, or even if it ever will be. If the capture is performed immediatelly after the static content has been loaded, data will most likely be missing. The captureDelay is the time in milliseconds to wait after the static content has been loaded before the capture is taken. The default value is 200 ms. |
Table 2, Connection string keys
HarvestAdding tables, retrieveing columns, adding relations etc. is done in the same way as when using the ODBC Harvester.
Image 3, Marshal model for harvesting a site
Every query node (except for the root node), independent of which harvester it uses, has a parent relation property in the XML section. To add or edit a parent relation, select the property and click on the ellipsis button.
All leaves of the selected node, having Column Name specified in the Source section, are listed in the Column combo box, and all parent leaves, having Column Name specified in the Source section, are listed in the Parent column combo box.
When using the Site Router, sub-queries are related to their parents by the url column. This means that in the first column, of the relation form, you should type Url, and in the second column you should select the Url column. This means that the Url of the parent node is passed to the child nodes, allowing them to harvest additional information using that url.
The Site Router tablesRouters mimic relational databases in representing data as tables with columns, see the Routers article for more information.
The Site Router implements the 7 tables, Page, Page-AHref, Page-Image, Page-Link, Page-Meta and Page-Script.
The Page tableThe page table contains information about the web pages. Each page is represented as a table row. The page table automatically follows links until maxDepth has been reached, or all pages have been retrieved. Only pages from the same domain will be harvested, i.e. the router will not follow external links.
Page |
CharacterSet |
The CharacterSet column contains a value that describes the character set of the response. This character set information is taken from the header returned with the response. |
Content |
The Content column contains the response content. |
ContentEncoding |
The ContentEncoding column contains the value of the Content-Encoding header returned with the response. |
ContentLength |
The ContentLength column contains the value of the Content-Length header returned with the response. If the Content-Length header is not set in the response, ContentLength is set to the value -1. |
ContentType |
The ContentType column contains the value of the Content-Type header returned with the response. |
Image |
If the Image column is harvested, an image snapshot is taken of the web page, and the image binary is returned in this column. |
LastModified |
The LastModified column contains the value of the Last-Modified header received with the response. The date and time are assumed to be local time. |
LoadTime |
The LoadTime column contains the time in milliseconds to execute the request. |
MD5 |
The MD5 column contains the MD5 sum of the response byte array. |
Method |
This column contains the method that is used to return the response. Common HTTP methods are GET, HEAD, POST, PUT, and DELETE. |
Pdfa |
If the Pdfa column is harvested, the web page is printed to pdf, and the document binary is returned in this column. |
ProtocolVersion |
The ProtocolVersion column contains the HTTP protocol version number of the response sent by the Internet resource. |
ResponseUrl |
The ResponseUri column contains the URI of the Internet resource that actually responded to the request. This URI might not be the same as the originally requested URI, if the original server redirected the request. The ResponseUri column will use the Content-Location header if present. |
Server |
The Server column contains the value of the Server header returned with the response. |
StatusCode |
The StatusCode column contains a number that indicates the status of the HTTP response. |
StatusDescription |
A string that describes the status of the response. A common status message is OK. |
SuggestedName |
The SuggestedName column contains a fuzzy logic suggested name. |
Title |
The title column contains the page title. |
TitlePath |
The titles of the pages, including the title of the current page title, that the harvester has passed on the way to this page, separated by slash '/'. |
TitlePathRoot |
The titles of the pages, excluding the title of the current page title, that the harvester has passed on the way to this page, separated by slash '/'. |
Url |
The Url column contains the request url. |
Table 3, the Page table columns
The Page-AHref tableThe Page-AHref table returns all <a href></a> elements of the specified web page.
Page-AHref |
Content |
The Content column contains the inner text of the element. |
Href |
The Href column contains the value of the href attribute of the element. |
Raw |
The Raw column contains the raw <a href=""></a>-element as it appears in the html document. |
Title |
The Title column contains the value of the title attribute of the element. |
Table 4, the Page-AHref columns
The Page-Header tableAll headers for the specified web page are returned.
Page-Header |
Name |
The name part of the header name value pair. |
Value |
The value part of the header name value pair. |
Table 5, the Page-Image columns
The Page-Image tableAll <img />-tags for the specified web page are returned.
Page-Image |
Alt |
The image alt-tag. |
CharacterSet |
The CharacterSet column contains a value that describes the character set of the response. This character set information is taken from the header returned with the response. |
ContentEncoding |
The ContentEncoding column contains the value of the Content-Encoding header returned with the response. |
ContentLength |
The ContentLength column contains the value of the Content-Length header returned with the response. If the Content-Length header is not set in the response, ContentLength is set to the value -1. |
ContentType |
The ContentType column contains the value of the Content-Type header returned with the response. |
LastModified |
The LastModified column contains the value of the Last-Modified header received with the response. The date and time are assumed to be local time. |
LoadTime |
The time in milliseconds to execute the request. |
MD5 |
The MD5 sum of the response byte array. |
Method |
This column contains the method that is used to return the response. Common HTTP methods are GET, HEAD, POST, PUT, and DELETE. |
Original |
This column contains the image binary. |
ProtocolVersion |
The ProtocolVersion column contains the HTTP protocol version number of the response sent by the Internet resource. |
Raw |
The raw image tag as it appears in the document. |
ResponseUrl |
The ResponseUri column contains the URI of the Internet resource that actually responded to the request. This URI might not be the same as the originally requested URI, if the original server redirected the request. The ResponseUri column will use the Content-Location header if present. |
Server |
The Server column contains the value of the Server header returned with the response. |
Src |
The image src-tag. |
StatusCode |
The StatusCode column contains a number that indicates the status of the HTTP response. |
StatusDescription |
A string that describes the status of the response. A common status message is OK. |
SuggestedName |
The SuggestedName column contains a fuzzy logic suggested name. |
Title |
The image title-tag. |
Table 6, the Page-Image columns
The Page-Link tableFor harvesting linked information such as style sheets for the specified web page.
Page-Link |
CharacterSet |
The CharacterSet column contains a value that describes the character set of the response. This character set information is taken from the header returned with the response. |
ContentEncoding |
The ContentEncoding column contains the value of the Content-Encoding header returned with the response. |
ContentLength |
The ContentLength column contains the value of the Content-Length header returned with the response. If the Content-Length header is not set in the response, ContentLength is set to the value -1. |
ContentType |
The ContentType column contains the value of the Content-Type header returned with the response. |
File |
The File column contains the response content. |
LastModified |
The LastModified column contains the value of the Last-Modified header received with the response. The date and time are assumed to be local time. |
LoadTime |
The time in milliseconds to execute the request. |
MD5 |
The MD5 sum of the response byte array. |
Method |
This column contains the method that is used to return the response. Common HTTP methods are GET, HEAD, POST, PUT, and DELETE. |
Name |
The Name column contains a fuzzy logic suggested name. |
ProtocolVersion |
The ProtocolVersion column contains the HTTP protocol version number of the response sent by the Internet resource. |
ResponseUrl |
The ResponseUri column contains the URI of the Internet resource that actually responded to the request. This URI might not be the same as the originally requested URI, if the original server redirected the request. The ResponseUri column will use the Content-Location header if present. |
Server |
The Server column contains the value of the Server header returned with the response. |
StatusCode |
The StatusCode column contains a number that indicates the status of the HTTP response. |
StatusDescription |
A string that describes the status of the response. A common status message is OK. |
Table 7, the Page-Link columns
The Page-Meta tableThe Page-Meta table is used for harvestig meta tags for the specified web page.
Page-Meta |
Charset |
Specifies the character encoding for the HTML document. |
Content |
Gives the value associated with the http-equiv or name attribute. |
Http-equiv |
Provides an HTTP header for the information/value of the content attribute. |
Name |
Specifies a name for the metadata. |
Property |
The property in meta tags allows web pages to specify values to property fields which come from a property library. The property library (RDFa format) is specified in the head tag. |
Raw |
The raw meta tag as it appears in the document. |
Scheme |
Specifies a scheme to be used to interpret the value of the content attribute. |
Table 8, the Page-Meta columns
The Page-Script tableThe Page-Script table is used for harvesting external script files for the specified web page.
Page-Script |
CharacterSet |
The CharacterSet column contains a value that describes the character set of the response. This character set information is taken from the header returned with the response. |
ContentEncoding |
The ContentEncoding column contains the value of the Content-Encoding header returned with the response. |
ContentLength |
The ContentLength column contains the value of the Content-Length header returned with the response. If the Content-Length header is not set in the response, ContentLength is set to the value -1. |
ContentType |
The ContentType column contains the value of the Content-Type header returned with the response. |
File |
The File column contains the response content. |
LastModified |
The LastModified column contains the value of the Last-Modified header received with the response. The date and time are assumed to be local time. |
LoadTime |
The time in milliseconds to execute the request. |
MD5 |
The MD5 sum of the response byte array. |
Method |
This column contains the method that is used to return the response. Common HTTP methods are GET, HEAD, POST, PUT, and DELETE. |
Name |
The Name column contains a fuzzy logic suggested name. |
ProtocolVersion |
The ProtocolVersion column contains the HTTP protocol version number of the response sent by the Internet resource. |
ResponseUrl |
The ResponseUri column contains the URI of the Internet resource that actually responded to the request. This URI might not be the same as the originally requested URI, if the original server redirected the request. The ResponseUri column will use the Content-Location header if present. |
Server |
The Server column contains the value of the Server header returned with the response. |
StatusCode |
The StatusCode column is a number that indicates the status of the HTTP response. |
StatusDescription |
A string that describes the status of the response. A common status message is OK. |
Table 9, the Page-Script columns
|
|
|
|
|