NAV Navbar
shell go python javascript php

Introduction

Welcome to the Dataflow Kit (DFK) API!

DFK’s API enables you to programatically manage and run your web data extraction and SERPs collection Tasks. You can easily retrieve extracted data afterwards.

Rendering web pages, Converting URLs to PDF or Capturing a web page screenshots are also can be run in Dataflow kit cloud.

Quick links to DFK API services:

Curl, Go, Python, Node.js, and PHP code examples are available. You can view them in the dark area to the right, and you can switch the programming language of the examples with the tabs in the top right. By default, curl is selected so that you can try out the commands in your terminal.

Authentication

To authorize, use this code:

# With shell, you can just pass a valid API Key with each request
curl --request POST \
     --url https://api.dataflowkit.com/v1/{API-ENDPOINT}?api_key=YOUR_API_KEY -d \
'{
  "foo":"bar"
}'

API-ENDPOINT corresponds to specified API endpoint. Make sure to replace YOUR_API_KEY with your API key.

After signing up, every user is assigned a personal API Access Key - a unique "password" used to make requests to the Dataflow Kit API.

Dataflow Kit API needs to be authenticated by passing a secret API Key to all API requests to the server as the api_key query parameter.

It looks like the following: api_key=YOUR_API_KEY

The API Key can be found in the DFK Dashboard after registration.

Once you sign-up, we grant you free 1000 credits which is equal to €5 for evaluation and testing.

Versioning

All Dataflow Kit API endpoints URLs start with https://api.dataflowkit.com/v1/.

The current API version 1 is available via the /v1 prefix.

If there are backward incompatible changes that need to be made to our API, we will release a new API version. The previous API version will be maintained for at least a year after releasing the new version.

Tasks & Processes

Tasks and processes are central to the Dataflow kit API.

Task represents an instance of web data extractor, search engine results (SERPs) collector. Task spawns a new process every run with a given set of parameters.

Examples of tasks are listed below.

Task endpoints Description Results
/task/create Create new task Returns {JSON object} representing task structure. Pass task id to the run endpoint to launch it afterwards.
/task/{Task_ID}/run Run the task with task id created before A new process spawned from the specified task is created and {JSON object} representing process structure returned.
/task/{Task_ID}/info Get an information about task with task id {JSON object} containing JSON payload and other meta information.
/task/{Task_ID}/results Send a request to /results endpoint to retrieve a list of processes belong to this task Returns [JSON array] containing processes belong to this task.
/task/{Task_ID}/update Update existing task. Pass here {JSON object} task structure with updated fields. Returns {JSON object} representing updated task structure.
/task/{Task_ID}/delete Deletes the task with task id {"deleted":"ok"}

Process is a single job spawned by a Task performing data extraction or conversion action.

Process endpoints Description
/Process/{Process ID}/info Returns {JSON object} representing process structure with specified process id.
/Process/{Process ID}/cancel Cancels the process specifying its process id. Returns {JSON object} representing process structure by process id.

The next sections list HTTP endpoints that can be used to manipulate Tasks & Processes.

Create a Task

Create an Web data Extractor/ SERP collection Task specifying payload configuration

curl --request POST \
     --url https://api.dataflowkit.com/v1/task/create?api_key=YOUR_API_KEY \
     -d '{JSON Task Payload}'

Create task endpoint is used to create tasks with specified parameters to run them multiple times afterwards. The same payload structure is used both for Web Data extraction and Search engine results (SERPs) collection tasks.

Send JSON Task payload to /task/create endpoint.

Create task endpoint returns New Task Object

{
   "id":"1XtQA0Z15N3fqKZuzPKESUsTIW1",
   "name":"Task Name",
   "webhook":"https://your-web-site.com/webhook",
   "payload":{JSON Paylod},
   "description":"Task description...",
   "type":"extract"
}

Returned object

If successful, returns task JSON object. The error is returned otherwise.

Run a Task

Run the task

curl --request POST \
     --url https://api.dataflowkit.com/v1/task/{Task_ID}/run?api_key=YOUR_API_KEY

Posting a request to /task/{Task_ID}/run endpoint starts a new process spawned from the previously created Task with {Task_ID} in the Dataflow Kit cloud.

This method returns immediately a {JSON Process Object} generated by the current task, while the process continues in the background. You can use webhooks or polling /process/{Process_ID}/info endpoint to figure out when resulted data for this Process ID is ready to retrieve it.

Process object

Process object

{
  "id": "1PBhj5EGo2hAvBsytLDL363A6Mq",
  "status":"finished",
  "taskID":"1NGYaLJsY8Xf7RwO99Ew3yyt5rz",
  "startedAt": "1580302278",
  "finishedAt": "1580312522",
  "requestCount": 1000,
  "responseCount":1000,
  "results" : "Results File Name",
  "logFile" : "",
  "missingCredits":0,
  "cost":50,
}

Process object contains the following information:

Property Description
id A globally unique id represents this Process.
status Represent status of the current process. Possible status values are described below.
taskID Task ID which Curent Process belong to.
startedAt The time that this Process was started at, in Unix time format.
finishedAt The time that this Process was completed or Cancelled. This field will be null if the run is either initialized or running. Time is in in Unix time format.
requestCount The number of requests for web data / SERP extraction Tasks that have been performed by this Process so far.
responseCount The number of successful responses for web data / SERP extraction Tasks that have been performed by this Process so far.
results The name of a file results in Dataflow Kit storage. File format can be specified in a task payload as either CSV, MS Excel, JSON, JSON Lines or XML.
logFile The Link to the log file.
missingCredits A number of missing credits needed to complete a process. Partial data that was extracted so far will be available for download. The complete data set may be returned after replenishment of funds.
cost A number of credits that have been withdrawn for the current process.

Once after a process spawned by Task is completed, its status changes from running to the one following statuses:

Process info

curl --request POST \
     --url https://api.dataflowkit.com/v1/processes/{Process_ID}/info?api_key=YOUR_API_KEY

Process Info endpoint returns a process object described above that contains all the details about a specific Process.

If response status is running then polling the process info endpoint on the way will return different request and response count according to the actual progress.

Right after process completion extra information like startedAt, finishedAt, results and cost will be returned.

No results or incomplete result sets are returned if the process has been canceled or the process failed.

Download results.

Get Results download link

curl --request GET \
     --url https://api.dataflowkit.com/v1/getlink?api_key=YOUR_API_KEY \ 
     -d 'Results File Name'

Send a request containing Results File Name from a Process to /getlink endpoint to retrieve download link.

As a result the actual download link to results file will be returned.

https://dfk-storage.ams3.digitaloceanspaces.com/results/96d16bce_2019-05-15_19%3A02.json?X-Amz-Signature=1b321eb76325140fb85a2dfb0fbc4834a7d8b998d3054d84636a77ecdd8016ef

Run the script to download results file

curl --request GET \
     --url  "https://dfk-storage.ams3.digitaloceanspaces.com/results/96d16bce_2019-05-15_19%3A02.json?X-Amz-Signature=1b321eb76325140fb85a2dfb0fbc4834a7d8b998d3054d84636a77ecdd8016ef"

Run the script on the right to download results providing the link above.

Intermediate conclusion

Cancel a Process

curl --request POST \
     --url https://api.dataflowkit.com/v1/processes/{Process_ID}/cancel?api_key=YOUR_API_KEY

Cancel method stops the specified currently running Process. Credit will be withdrawn for already processed successfull requests.

Task info

curl --request POST \
     --url https://api.dataflowkit.com/v1/tasks/{Task_ID}/info?api_key=YOUR_API_KEY

Gets a Task object that contains all the details about a specific Task.

Task object

{
  "id": "1PBhaN1wLaqN8BINrsDXlZANpWN",
  "name": "taskName",
  "description":"Task description...",
  "type":"extract",
  "payload": {JSON Payload},
  "webhook" : "http://mywebsite.com/webhook/"
}

Task object has the following properties:

Property Description
id A globally unique id represents this Task.
name Task name parameter is optional.
description Task description...
type Currently only one "extract" type available for all tasks.
payload JSON structure that describes a set of rules for Task launch. Payload depends on task type. Each type of payload is described in corresponded section.
webhook If provided, Dataflow Kit API will send the results to given URL.

Get a Task results

Get a Task's results after completion.

curl --request POST \
     --url https://api.dataflowkit.com/v1/task/{Task_ID}/results?api_key=YOUR_API_KEY

Response consists of an array of corresponded processes that were created by specific task.

[
  {
    "id": "1PBhj5EGo2hAvBsytLDL363A6Mq",
    "status":"finished",
    "taskID":"1NGYaLJsY8Xf7RwO99Ew3yyt5rz",
    "startedAt": "1580302278",
    "finishedAt": "1580312522",
    "requestCount": 1000,
    "responseCount":1000,
    "missingCredits":0,
    "cost":100,
    "results" : "Results File Name",
    "logFile" : ""
  },
  {
    "id":"1NotHmEj03c27QUn54dtgICziSy",
    "status":"failed",
    "taskID":"1NGYaLJsY8Xf7RwO99Ew3yyt5rz",
    "startedAt": "1580302278",
    "finishedAt": "1580312522",
    "requestCount": 8,
    "responseCount":8,
    "missingCredits":0,
    "cost":100,
    "results" : "Results File Name",
    "logFile" : ""
  }
]

Send a request to the task /task/{Task_ID}/results endpoint to retrieve an array of corresponded processes that were created by specified task.

Depending on data extraction settings, resulted data then may be either downloaded from DFK storage or uploaded directly to Google Cloud, Dropbox and Microsoft Onedrive.

Get a list of Tasks

Get a list of tasks.

curl --request POST \
     --url https://api.dataflowkit.com/v1/tasks?api_key=YOUR_API_KEY

This endpoint returns the list of all Tasks that the user created or used. The response is a list of Tasks where each object contains a basic information about a single Task.

As a response, a JSON array will be returned with objects containing user tasks.

[
  {
   "id":"1XtQA0Z15N3fqKZuzPKESUsTIW1",
   "name":"SERP Task",
   "webhook":"https://your-web-site.com/webhook1",
   "payload":{JSON Paylod},
   "description":"SERP description...",
   "type":"extract"
  },
  {
   "id":"fg1QA0Z15N3fqKZuzPKESUsTIW1",
   "name":"Web Extraction Name",
   "webhook":"https://your-web-site.com/webhook2",
   "payload":{JSON Paylod},
   "description":"Web description...",
   "type":"extract"
  }
]

Delete a Task

Delete a Task

curl --request DELETE \
     --url https://api.dataflowkit.com/v1/task/{task_ID}delete?api_key=YOUR_API_KEY

Calling this endpoint deletes a specific Task along with corresponding resulted data and log files.

As a response the JSON object is returned. {"deleted":"ok"}

References

Refer to the corresponded sections for more information about specific task types:

Single Processes

Single process is intended for performing simple jobs like rendering/ fetching html, capturing a screenshot or print web page to PDF. It is similar to a Task. But the general difference is that a Single Process can be run only once and returns result immediately after finishing.

Examples of Single process types are listed here:

Fetch HTML

Base Fetcher

curl --request POST \
     --url https://api.dataflowkit.com/v1/fetch?api_key=YOUR_API_KEY -d \
'{
  "type":"base",
  "url":"https://anysite.com",
  "proxy": "country-any"
}'

Chrome Fetcher

curl --request POST \
     --url https://api.dataflowkit.com/v1/fetch?api_key=YOUR_API_KEY -d \
'{
  "type":"chrome",
  "url":"http://google.com",
  "proxy":"country-any",
  "waitDelay":0.5,
  "actions": [
        {
            "input": {
                "selector": "#search-box",
                "value": "Search Term"
            }
        },
        {
            "click": {
                "selector": "#button"
            }
        },
        {
            "waitVisible": {
                "selector": ":root"
            }
        },
        {
            "scroll": {
                "times": "10"
            }
        }
    ]
}'

Fetch endpoint is used for web pages download. Regular pages are fetched "as is" using standard http requests. But real headless chrome web browser is used for rendering dynamic Javascript driven web pages.

Base Fetcher

Base fetcher uses standard http requests to download regular pages. It works faster than Chrome fetcher.

Chrome Fetcher

Chrome fetcher is intended for rendering dynamic Javascript based content. It sends requests to Chrome running in headless mode.

Parameters

Parameter Description
type If set to "base", Base fetcher is used for downloading web page content. Use "chrome" for fetching content with headless chrome browser.
url Specify url to download.
proxy Specify proxy like country-sk
waitDelay Specify a custom delay (in seconds). This may be useful if certain elements of the web site need to be rendered after initial page load. (Chrome fetcher type only)
actions Use actions to automate manual workflows while rendering web pages. They simulate real-world human interaction with pages. (Chrome fetcher type only)

Fetch Response

Fetch returns utf8 encoded web page content.

Capture a Screenshot

Create a PNG Screenshot from URL

curl --request POST \
    --url https://api.dataflowkit.com/v1/convert/url/screenshot?api_key=  \
    -H "Content-Type: application/json" \
    -d '{
    "url": "https://dataflowkit.com",
    "proxy": "country-au",
    "width": 1920,
    "height": 1080,
    "offsetx": 50,
    "offsety": 50,
    "scale": 1,
    "format": "jpeg",
    "quality": 90,
    "waitDelay": 0.5,
    "actions":[]
}'

Dataflow Kit Screenshot endpoint is intended for taking screenshots from web pages.

It returns the png/ jpeg captured screenshot download link.

Parameter Default Description
url - Remote web page URL to take a screenshot from
format png Sets the Format of output image. Values: png, jpeg
quality 80 Sets the Quality of output image. Compression quality from range [0..100] (jpeg only).
fullPage false takes a screenshot of a full web page. It ignores offsetX, offsety, width and height argument values.
clipSelector - captures a screenshot of specified HTML element. For example, pass CSS selector like "#clipped-element" as an Value.
offsetx 0 X offset in device independent pixels (dip).
offsety 0 Y offset in device independent pixels (dip).
width 800 Rectangle width in device independent pixels (dip).
height 600 Rectangle height in device independent pixels (dip).
scale 1 Page scale factor. range [0.1..3] defaults to 1
waitDelay - Specify a custom delay (in seconds) before making of a Screenshot. This may be useful if certain elements of the web site need to be rendered after initial page load. (e.g. CSS animations, JavaScript effects, etc.)

Create an Converter Task specifying payload configuration

curl --request POST \
        --url https://api.dataflowkit.com/v1/convert/url/pdf?api_key=  \
        -H "Content-Type: application/json" \
        -d '{
          "url": "https://dataflowkit.com",
          "proxy": "country-at",
          "paperSize": "A4",
          "landscape": false,
          "printBackground": false,
          "printHeaderFooter": true,
          "scale": 1,
          "pageRanges": "",
          "marginTop": 0.4,
          "marginLeft": 0.4,
          "marginRight": 0.4,
          "marginBottom": 0.4,
          "waitDelay": 0.5,
          "actions":[]
}'
Parameter Default Description
url - The full URL address (including HTTP/HTTPS) of web page that you want to print to PDF
proxy - Select country to locate proxy to pass requests through to target web sites.
orientation false Paper orientation. Set landscape = true for portrait orientation.
Page size "A4" Page size parameter consists of the most popular page formats. Possible values are: "A3", "A4", "A5", "A6", "Letter", "Legal", "Tabloid"
Print background false Print background graphics in the PDF.
Page ranges - Specify page ranges to convert, e.g., '1-4, 6, 10-12'. Defaults to the empty value, which means convert all pages.
Scale 1 By default, PDF document content is generated according to the size and dimensions of the original web page content. Using Scale parameter you can specify a custom zoom factor from 0.1 to 5.0 of the webpage rendering.
marginTop 0.4 inches Top Margin of the PDF
marginLeft 0.4 inches Left Margin of the PDF
marginRight 0.4 inches Right Margin of the PDF
marginBottom 0.4 inches Bottom Margin of the PDF
Header and Footer false Turn the header/footer on or off. They include the date, name of the web page, the page URL and how many pages the document you're printing.
Wait Delay - Specify a custom delay (in seconds) before generation of a PDF. This may be useful if certain elements of the web site need to be rendered after initial page load. (e.g. CSS animations, JavaScript effects, etc.)
Actions - Actions simulate real-world human interaction with pages. They can be used to automate manual workflows before a PDF conversion is performed.

Extract data from web

/extract endpoint crawls web pages and extracts data like text, links or images following the specified rules. Dataflow kit uses CSS selectors to find HTML elements in web pages and to extract data from. Extracted data is returned in CSV, MS Excel, JSON, JSON(Lines) or XML format.

Collection scheme

Here is a simple collection object:

'{
    "name":"test.dataflowkit.com",
    "request":{
        "url":"https://test.dataflowkit.com/persons/page-0",
        "type":"chrome",
        "proxy":"country-any"
    },
    "commonParent":".parent",
    "fields":[
        {
            "name":"Number",
            "selector":".badge-primary",
            "attrs":["text"],
            "type":1,
            "filters":[
                {
                    "name":"trim"
                }
            ]
        },
        {
            "name":"Name",
            "selector":"#cards a",
            "attrs":["href","text"],
            "type":2,
            "filters":[
                {
                    "name":"trim"
                }
            ]
        },
        {
            "name":"Picture",
            "selector":".card-img-top",
            "attrs":["src","alt"],
            "type":0,
            "filters":[
                {
                    "name":"trim"
                }
            ]
        }
    ],
    "paginator":{
        "nextPageSelector":".page-link",
        "pageNum":2
        },
    "path":false,
    "format":"JSON"
}'

Collection scheme represents settings for data extraction from specified web site. It has the following properties:

Property Description Required
name Collection name required
request Request parameters for downloading html pages. Refer to Fetch HTML section for more details about request parameters required
url url holds the the starting web page address to be downloaded. required
type type specifies fetcher type which may be "base" or "chrome" value. If omited "base" fetcher is used by default optional
commonParent commonParent specifies common ancestor block for all fields used to extract data from a web page optional
fields A set of fields used to extract data from a web page. A Field represents a given chunk of data to be extracted from every block on each page. Read more about field types required
name Field name is used to aggregate results. required
selector Selector represents a CSS selector for data extraction within the given block. required
attrs A set of attributes to extract from a Field. Find more information about attributes required
type Selector type. ( 0 - image, 1 - text, 2 - link) required
filters Filters are used to pre-processing of text data when extracting. optional
details Details is an optional field strictly intended for Link extractor type. Details themself represent independent collection to extract data from linked pages. Read more at "details" optional
paginator Paginator is used to scrape multiple pages. If there is no paginator in Scheme, then no pagination is performed and it is assumed that the initial URL is the only page. Read more about paginators optional
path Path is a special field for navigation only. It is used to collect information from detailed pages. No results from the current page will be returned. Defaults to false. optional
format Extracted data is returned either in CSV, MS Excel, JSON, JSON(Lines) or XML format. required

Field types and attributes

There are 3 predefined field types:

Text extracts human-readable text from the selected element and from all its child elements. HTML tags are stripped and only text is returned.

Link is used for link extraction and website navigation. Capture href(URL) attribute and , link text or specify a special Path option for navigation only. When Path option specified, all other selectors will be ignored and no results from the current page will be returned.

Image selector extracts src (URL) and alt attributes of an image.

Filters

Filters are used to manipulate text data when extracting.

The following filters are available:

Trim returns a copy of the Field's text/ attribute, with all leading and trailing white space removed.

Normal leaves the case and capitalization of text/ attribute exactly as is.

UPPERCASE makes all of the letters in the Field's text/ attribute uppercase.

lowercase makes all of the letters in the Field's text/ attribute lowercase.

Capitalize capitalizes the first letter of each word in the Field's text/ attribute

Concatinate joins text array element into a single string

Regular Expressions

"filters":[ 
    {  
      "name":"regex",
      "param":"[\\d.]+"
    }
]

For more advanced text formatting regular expression can be used.

e.x. the currency signs removed from product prices.

The whole match (group 0) will be returned as a result. Some useful examples are listed below:

Input text Regex Result
price: 10.99€ [0-9]+.[0-9]+ 10.99
phone: 0 (944) 244-18-22 \w+ 09442441822

Details

Some parts are omited for brevity

...
"fields":[
  {
      "name":"link2details",
      "selector":"h3 a",
      "details":{
          "name":"DetailsPage",
          "request":{
              "url":"http://example.com/details1/index.html",
              "type":"",
          },
          "fields":[
              {
                  "name":"title",
                  "selector":"h1",
                  "attrs":[
                      "text"
                  ],
              }
          ],
          "paginator":{},
          "path":false,
      },
      "attrs":[
          "href",
          "text"
      ],
  },
  ],
...

The Link field type might serve as a navigation link to a details page containing additional data.

So following the links from the main page, elements on detailed page can be gathered into separate collection.

Special Path option is used for navigation only. When Path option specified, no results from the current page will be returned. But grouped results from details pages will be returned instead.

Detailed page consists of its own fields and may contain paginators and deeper leveled detailed pages' collections.

Paginator

Paginator is used to scrape multiple pages. It extracts the next page from a document by querying a given CSS selector.

There are three paginator types.

"Next link" paginator type is used on pages containing link pointing to a next page. The next page link is extracted from a document by querying href attribute of a given element's CSS selector.

"Infinite scroll" paginator type automatically loads additional page content while user scrolls page down.

"Load more Button" paginator type looks like "Next link" but behaves as "Infinite scroll" paginator type. It loads additional page content on its click.

Point-and-click toolkit

The most easiest way to define fields for extraction is to use Dataflow Kit Visual interface

Just click elements on loaded page and then export collection to a file.

Select Elements

Export collection

Extract SERPs

To crawl search engine result pages (SERPs) you can run either single process or create a task. SERPs collection service extracts a list of organic results, news, images and more. Specify advanced configuration parameters such as country or languages to customize output SERP data.

The following search engines are supported:

Google Web Google Images
Google News Google Shopping
Bing DuckDuckGo
Baidu Yandex

Search parameters

Create an SERP Extractor Task.

curl --request POST \
        --url https://api.dataflowkit.com/v1/extract?api_key=YOUR_API_KEY \
        -H 'Content-Type: application/json' \
        -d '{
    "name": "google",
    "request": {
        "url": "https://www.google.com/search?q=dataflow+kit&lr=lang_de&gl=at",
        "proxy": "country-at",
        "type": "chrome"
    },
    "fields": [
        {
            "name": "selector1",
            "selector": ".r>a:first-of-type",
            "attrs": [
                "href",
                "text"
            ],
            "type": 2,
            "filters": [
                {
                    "name": "trim"
                }
            ]
        }
    ],
    "paginator": {
        "nextPageSelector": ".b.navend:last-child a",
        "pageNum": 3
    },
    "format": "csv"
}'
Parameter Description Notes
name Collection name required
url url holds the link to a Search Engine to use, and other optional parameters like languages or country. required. See URL GET parameters description below.

URL GET parameters

q Parameter defines encoded search term. You can use anything that you would use in a regular Search engines search. (e.g. for Google, link:dataflowkit.com, site:twitter.com Bratislava, inurl:view/view.shtml, etc.) See The Complete List of 42 Advanced Google Search Operators q parameter is used by google, Bing, DuckDuckGo. text is used as query holder by Yandex SE. Chineese Baidu uses wd for this purpose.
tbm tbm is a special Google parameter used to differentiate between search types - tbm=isch - Google Images, - tbm=nws - Google News, - tbm=shop - Google Shopping
lr Restricts the search to documents written in a particular languages. Google uses lang_{two-letter lang code} to specify languages and | as a delimiter. (e.g., lang_sk|lang_de will only search Slovak and German pages). See the full list of possible values for Google. For Bing specify setLang=en parameter. In Yandex use lang=ca parameter
gl Specify the country to search from. It's a two-letter country code. (e.g., sk for Slovakia, or us for the United States). For Google see the Country Codes page for a list of valid values. For Bing cc=at parameter is used.
Parameter Description Notes
proxy Select country to locate proxy to pass requests through to target web sites. NOTE: You Always have to use proxy when requesting SERPs Use country-{two-letter lang code} to locate proxy in specified country or country-any for random proxy. (e.g., country-us will pass all requests to US proxy; country-any will pass proxified requests to random country;).
fields Set of definite CSS selectors (patterns) used to gather data from Search Engine Result Pages. Payloads for collecting search results (SERP data) from the most popular Search Engines are available. These payloads are fully customizable.
pageNum Specify number of pages to crawl. Defaults to 1
format Select format of output data. Possible Values are CSV, JSON(L), XML

Results

Extracted data is returned in CSV, JSON, JSON(Lines) or XML format.