NAV Navbar
shell go python javascript php

Introduction

Welcome to the Dataflow Kit (DFK) API!

DFK’s API enables you to programatically manage and run your web data extraction and conversion Tasks, and retrieve extracted data.

Quick links to DFK API services:

Curl, Go, Python, Node.js, and PHP code examples are available. You can view them in the dark area to the right, and you can switch the programming language of the examples with the tabs in the top right. By default, curl is selected so that you can try out the commands in your terminal.

Authentication

To authorize, use this code:

# With shell, you can just pass a valid API Key with each request
curl --request POST \
     --url https://api.dataflowkit.com/v1/{API-ENDPOINT}?api_key=YOUR_API_KEY -d \
'{
  "foo":"bar"
}'

API-ENDPOINT corresponds to specified API endpoint. Make sure to replace YOUR_API_KEY with your API key.

Dataflow Kit API needs to be authenticated by passing a secret API Key to all API requests to the server as the api_key query parameter.

It looks like the following: api_key=YOUR_API_KEY

The API Key can be found in the DFK Dashboard.

Versioning

All Dataflow Kit API endpoints URLs start with https://api.dataflowkit.com/v1/.

The current API version 1 is available via the /v1 prefix.

If there are backward incompatible changes that need to be made to our API, we will release a new API version. The previous API version will be maintained for at least a year after releasing the new version.

Tasks & Processes

Tasks and processes are central to the Dataflow kit API.

Task represents an instance of web data extractor, search engine results (SERPs) extractor. Task is a repeatable process that runs at a given time with a given set of parameters.

Examples of tasks are listed below.

Task endpoints Description Results
/create Create new task task id. Pass id to the run endpoint
/run Run the task with task id created before process id. A new process is created and its ID returned.
/info Get an information about task with task id {JSON object} containing JSON payload and other meta information.
/results - Returns array of processes created by this task
/delete Deletes the task with task id {"deleted":"ok"}

Process is a single job spawned by a Task performing data extraction or conversion action.

Process endpoints Description
/info An information about current task with process id
/cancel Cancels the process with process id

The next sections list HTTP endpoints that can be used to manipulate Tasks & Processes.

Create a Task

Create task endpoint is used to create tasks with specified parameters to run them afterwards. Depending on a task type different payload configurations are passed as arguments but the scheme is identical for all task types.

The following task types are available to be created and run in Dataflow kit cloud.

Web Data extraction Task

Create an Extractor Task specifying payload configuration

curl --request POST \
     --url https://api.dataflowkit.com/v1/extract/create?name=TASK_NAME&api_key=YOUR_API_KEY \
     -d '{JSON Collection Payload}'

Send JSON payload to /extract endpoint. If successful, returns task id.

{"id":"1PBhaN1wLaqN8BINrsDXlZANpWN"}

The error is returned otherwise.

Search engine results (SERPs) extraction Task

Create an SERP Extractor Task specifying payload configuration

curl --request POST \
     --url https://api.dataflowkit.com/v1/serp/create?name=TASK_NAME&api_key=YOUR_API_KEY \
     -d '{JSON SERP extraction Payload}'

This task type specially intended for data extraction from Search engines Result pages.

Send JSON payload to /serp endpoint. If successful, returns task id.

{"id":"r5FhaN1wLaqN8BINrsDXlsANpWf"}

The error is returned otherwise.

Run a Task

Run the task

curl --request POST \
     --url https://api.dataflowkit.com/v1/tasks/{Task_ID}/run?api_key=YOUR_API_KEY

Task Run method starts a new process of the previously created Task in the Dataflow Kit cloud.

This method returns immediately a Process ID generated by the current task, while the process continues in the background. You can use webhooks or polling to figure out when resulted data for this Process ID is ready in order to retrieve it.

Run task endpoint returns Process ID

  {"id":"Process_ID"} 

Process object

Process object

{
  "id": "1PBhj5EGo2hAvBsytLDL363A6Mq",
  "status":"finished",
  "taskID":"1NGYaLJsY8Xf7RwO99Ew3yyt5rz",
  "startedAt": "2019-07-10_13:46:32",
  "finishedAt": "2019-07-10_13:47:07",
  "took": "35.567787745s",
  "requestCount": 1000,
  "responseCount":1000,
  "results" : "https://dfk-storage.ams3.digitaloceanspaces.com/results/96d16bce_2019-05-15_19%3A02.csv?X-Amz-Signature=1b321eb76325140fb85a2dfb0fbc4834a7d8b998d3054d84636a77ecdd8016ef",
  "logFile" : "https://dfk-storage.ams3.digitaloceanspaces.com/results/96d16bce_2019-05-15_19%3A02.log?X-Amz-Signature=1b321eb76325140fb85a2dfb0fbc4834a7d8b998d3054d84636a77ec8016ef"
}

Process object is created by calling tasks/{taskID}/run endpoint described above. It contains an information about specified task run and results returned after process completion.

Property Description
id A globally unique id represents this Process.
status Represent status of the current process. Possible status values are running, finished, cancelled, failed
taskID Parent task ID.
startedAt The time that this Process was started at, in UTC +0000.
finishedAt The time that this Process was completed or Cancelled. This field will be null if the run is either initialized or running. Time is in UTC +0000.
took elapsed time
requestCount The number of successful requests for web data / SERP extraction Tasks that have been performed by this Process so far.
responseCount The number of successful responses for web data / SERP extraction Tasks that have been performed by this Process so far.
results The Link to the file with results in either CSV, MS Excel, JSON, JSON Lines or XML, depending on the format parameter from specified collection scheme. JSON file format is used for SERP endpoint.
logFile The Link to the log file.

Process info

curl --request POST \
     --url https://api.dataflowkit.com/v1/processes/{Process_ID}/info?api_key=YOUR_API_KEY

Process Info endpoint returns a process object described above that contains all the details about a specific Process.

If response status is running then polling the process info endpoint on the way will return different request and response count according to the actual progress.

Right after process completion extra information like startedAt, finishedAt, results and logFile will be returned.

No results or incomplete result sets are returned if the process has been canceled or the process failed.

Download results & log files

curl --request GET \
     --url  "https://dfk-storage.ams3.digitaloceanspaces.com/results/96d16bce_2019-05-15_19%3A02.json?X-Amz-Signature=1b321eb76325140fb85a2dfb0fbc4834a7d8b998d3054d84636a77ecdd8016ef"

Run the script on the right to download results / log files providing the links from the process object described above.

Cancel a Process

curl --request POST \
     --url https://api.dataflowkit.com/v1/processes/{Process_ID}/cancel?api_key=YOUR_API_KEY

Cancel method stops the specific Process. Any data that was extracted so far will be available for download.

Task info

curl --request POST \
     --url https://api.dataflowkit.com/v1/tasks/{Task_ID}/info?api_key=YOUR_API_KEY

Gets a Task object that contains all the details about a specific Task.

Task object

{
  "id": "1PBhaN1wLaqN8BINrsDXlZANpWN",
  "name": "taskName",
  "payload": "{JSON Payload}",
  "webhook" : "http://mywebsite.com/webhook/"
}

Task object has the following properties:

Property Description
id A globally unique id represents this Task.
name Task name parameter is optional.
payload JSON structure that describes a set of rules for Task launch. Payload depends on task type. Each type of payload is described in corresponded section.
webhook If provided, Dataflow Kit API will send the results to given URL.

Get a Task results

Get a Task's results after completion.

curl --request POST \
     --url https://api.dataflowkit.com/v1/tasks/{Task_ID}/results?api_key=YOUR_API_KEY

Response consists of an array of corresponded processes that were created by specific task.

[
  {
    "id": "1PBhj5EGo2hAvBsytLDL363A6Mq",
    "status":"finished",
    "taskID":"1NGYaLJsY8Xf7RwO99Ew3yyt5rz",
    "startedAt": "2019-04-14T23:09:38",
    "finishedAt": "2019-04-14T23:10:40",
    "took": "35.567787745s",
    "requestCount": 1000,
    "responseCount":1000,
    "results" : "https://dfk-storage.ams3.digitaloceanspaces.com/results/96d16bce_2019-05-15_19%3A02.csv?X-Amz-Signature=1b321eb763636a77ecdd8016ef",
    "logFile" : "https://dfk-storage.ams3.digitaloceanspaces.com/results/96d16bce_2019-05-15_19%3A02.log?X-Amz-Signature=198d3054d84636a77ec8016ef"
  },
  {
    "id":"1NotHmEj03c27QUn54dtgICziSy",
    "status":"failed",
    "taskID":"1NGYaLJsY8Xf7RwO99Ew3yyt5rz",
    "startedAt": "2019-04-18T00:09:38",
    "finishedAt": "2019-04-18T00:10:40",
    "requestCount": 8,
    "responseCount":8,
    "took": "3.567787745s",
    "results" : "",
    "LogFile" : "https://dfk-storage.ams3.digitaloceanspaces.com/results/96d16bce_2019-05-15_19%3A02.log?X-Amz-Signature=1b321eb766a77ec8016ef"
  }
]

Once after a process started by Task is completed, its status changes from running to the one following statuses:

If successful, returns the link to resulted data in either CSV, MS Excel, JSON, JSON Lines or XML, depending on the format parameter from specified collection scheme for data extraction tasks. JSON file format is used for SERP endpoint.

Depending on data extraction settings, resulted data then may be either downloaded from DFK storage or delivered directly to E-Mail, Amazon S3, Google Cloud, Dropbox.

Get a list of Tasks

Get a list of tasks.

curl --request POST \
     --url https://api.dataflowkit.com/v1/tasks?api_key=YOUR_API_KEY

This endpoint returns the list of all Tasks that the user created or used. The response is a list of Tasks where each object contains a basic information about a single Task.

As a response, a JSON array will be returned with objects containing user tasks.

[
  {"id": "Task_ID_1", "name": "EXTRACT_TASK_NAME"},
  {"id": "Task_ID_2", "name": "SERP_TASK_NAME"},
  {"id": "Task_ID_3", "name": ""}
]

Delete a Task

Delete a Task

curl --request DELETE \
     --url https://api.dataflowkit.com/v1/tasks/{task_ID}delete?api_key=YOUR_API_KEY

Calling this endpoint deletes a specific Task along with corresponding resulted data and log files.

As a response the JSON object is returned. {"deleted":"ok"}

References

Refer to the corresponded sections for more information:

Single Processes

Single process is intended for performing simple jobs like fetching an html, make a screenshot or convert a file. It is similar to a Task. But the general difference is that a Single Process can be run only once and returns result immediately after finishing.

Examples of Single process types are listed here:

Fetch HTML

Fetch endpoint is used for web pages download. Regular pages are fetched "as is" using standard http requests. But real headless web browser is used for rendering dynamic Javascript driven web pages.

Parameters

Parameter Description
type If set to "base", Base fetcher is used for downloading web page content. Use "chrome" for fetching content with headless chrome browser.
url Specify url to download.
proxy Specify custom proxy as http://1.2.3.4:3128

Fetch Response

Fetch returns utf8 encoded web page content.

Base Fetcher

curl --request POST \
     --url https://api.dataflowkit.com/v1/fetch?api_key=YOUR_API_KEY -d \
'{
  "type":"base",
  "url":"http://google.com",
  "proxy": "http://0.0.0.0:55555"
}'

Base fetcher uses standard http requests to download regular pages. It works faster than Chrome fetcher.

Chrome Fetcher

curl --request POST \
     --url https://api.dataflowkit.com/v1/fetch?api_key=YOUR_API_KEY -d \
'{
  "type":"chrome",
  "url":"http://google.com",
}'

Chrome fetcher is intended for rendering dynamic Javascript based content. It sends requests to Chrome running in headless mode.

Take a Screenshot

Create a PNG Screenshot from URL

curl --request POST \
    --url https://api.dataflowkit.com/v1/screenshot?api_key=YOUR_API_KEY \
    --header 'Content-Type: multipart/form-data' \
    --form remoteURL=https://dataflowkit.com \
    --form format=png \
    --form qality=80 \
    --form fullPage=false \
    --form xoffset=0 \
    --form yoffset=0 \
    --form width=800 \
    --form height=600 \
    --form scale=1 \
    -o result.png

Dataflow Kit Screenshot endpoint is intended for taking screenshots from web pages.

It returns the link to png or jpeg captured screenshot.

It accepts POST requests with a multipart/form-data Content-Type.

Parameter Default Description
remoteURL - Remote URL to take a screenshot from
format png Sets the Format of output image. Values: png, jpeg
qality 80 Sets the Quality of output image. Compression quality from range [0..100] (jpeg only).
fullPage false takes a screenshot of a full web page. It ignores xoffset, yoffset, width and height argument values.
xoffset 0 X offset in device independent pixels (dip).
yoffset 0 Y offset in device independent pixels (dip).
width 800 Rectangle width in device independent pixels (dip).
height 600 Rectangle height in device independent pixels (dip).
scale 1 Page scale factor. defaults to 1

Convert files to PDF

Create an Converter Task specifying payload configuration

curl --request POST \
     --url https://api.dataflowkit.com/v1/convert/{from}/pdf?api_key=YOUR_API_KEY \
     -d '{JSON Conversion Payload}'

{from} parameter has one of the following values:

If successful, returns the link to resulted data in PDF format.

The error is returned otherwise.

Extract data from web

/extract endpoint crawls web pages and extracts data like text, links or images following the specified rules. Dataflow kit uses CSS selectors to find HTML elements in web pages and to extract data from. Extracted data is returned in CSV, MS Excel, JSON, JSON(Lines) or XML format.

Collection scheme

Here is a simple collection object:

'{
    "name":"test.dataflowkit.com",
    "request":{
        "url":"https://test.dataflowkit.com/persons/page-0",
        "type":"chrome"
    },
    "fields":[
        {
            "name":"Number",
            "selector":".badge-primary",
            "attrs":["text"],
            "type":1,
            "filters":[
                {
                    "name":"trim"
                }
            ]
        },
        {
            "name":"Name",
            "selector":"#cards a",
            "attrs":["href","text"],
            "type":2,
            "filters":[
                {
                    "name":"trim"
                }
            ]
        },
        {
            "name":"Picture",
            "selector":".card-img-top",
            "attrs":["src","alt"],
            "type":0,
            "filters":[
                {
                    "name":"trim"
                }
            ]
        }
    ],
    "paginator":".page-link",
    "path":false
}'

Collection scheme represents settings for data extraction from specified web site. It has the following properties:

Property Description Required
name Collection name required
request Request parameters for downloading html pages. Refer to Fetch HTML section for more details about request parameters required
url url holds the the starting web page address to be downloaded. URL is required. required
type type specifies fetcher type which may be "base" or "chrome" value. If omited "base" fetcher is used by default optional
fields A set of fields used to extract data from a web page. A Field represents a given chunk of data to be extracted from every block on each page. Read more about field types required
name Field name is used to aggregate results. required
selector Selector represents a CSS selector for data extraction within the given block. Pass in "." to use the root block's selector. required
attrs A set of attributes to extract from a Field. Find more information about attributes required
filters Filters are used to pre-processing of text data when extracting. optional
details Details is an optional field strictly for Link extractor type. It guides scraper to parse additional pages following the links according to the set of fields specified inside "details". optional
paginator Paginator is used to scrape multiple pages. If there is no paginator in Scheme, then no pagination is performed and it is assumed that the initial URL is the only page. Read more about paginators optional
path Path is a special field for navigation only. It is used to collect information from detailed pages. No results from the current page will be returned. Defaults to false. TODO: Add path example optional
format Extracted data is returned either in CSV, MS Excel, JSON, JSON(Lines) or XML format. required
delivery Email, Amazon S3 bucket, FTP, Dropbox, etc. Not implemented yet

Field types and attributes

There are 3 predefined field types:

Text extracts human-readable text from the selected element and from all its child elements. HTML tags are stripped and only text is returned.

Link is used for link extraction and website navigation.Capture href(URL), text attributes or specify a special Path option for navigation only. When Path option specified, all other selectors become disable and no results from the current page will be returned.

Image selector extracts src (URL) and alt attributes of an image.

Filters

Filters are used to manipulate text data when extracting.

The following filters are available:

Trim returns a copy of the Field's text/ attribute, with all leading and trailing white space removed.

Normal case leaves the case and capitalization of text/ attribute exactly as is.

UPPERCASE makes all of the letters in the Field's text/ attribute uppercase.

lowercase makes all of the letters in the Field's text/ attribute lowercase.

Capitalize capitalizes the first letter of each word in the Field's text/ attribute

Regular Expressions

"filters":[ 
    {  
      "name":"regex",
      "param":"[\\d.]+"
    }
]

For more advanced text formatting regular expression can be used.

e.x. the currency signs removed from product pricesls

The whole match (group 0) will be returned as a result. Some useful examples are listed below:

Input text Regex Result
price: 10.99€ [0-9]+.[0-9]+ 10.99
id: H18JKDX4 [A-Z0-9]{8} H18JKDX4
date: 2018-10-19 [0-9]{4}-[0-9]{2}-[0-9]{2} 2019-04-02

Details

Some parts are omited for brevity

...
"fields":[
  {
      "name":"link2details",
      "selector":"h3 a",
      "details":{
          "name":"DetailsPage",
          "request":{
              "url":"http://example.com/details1/index.html",
              "type":"",
          },
          "fields":[
              {
                  "name":"title",
                  "selector":"h1",
                  "attrs":[
                      "text"
                  ],
              }
          ],
          "paginator":"",
          "path":false,
      },
      "attrs":[
          "href",
          "text"
      ],
  },
  ],
...

The Link field type might serve as a navigation link to a details page containing additional data.

So following the links from the main page, elements on detailed page can be gathered into separate collection.

Special Path option is used for navigation only. When Path option specified, no results from the current page will be returned. But grouped results from details pages will be returned instead.

Detailed page consists of its own fields and may contain paginators and deeper leveled detailed pages' collections.

Paginator

Paginator is used to scrape multiple pages. It extracts the next page from a document by querying a given CSS selector and extracting the given HTML attribute from the resulting element.

There are three paginator types.

"Next link" paginator type is used on pages containing Next Button Paginator link.

"Infinite scroll" automatically loads content while user scrolls page down.

"Load more Button" looks like "Next link" but loads content on its click.

Type represents paginator type. The following are available: "next", "more", "infinite" Selector represents corresponding CSS selector for the "Next" link or "Load more" Button paginator types page along with Attr belong exclusively to "Next" link paginator to define HTML element attribute for the next page.

Point-and-click toolkit

The most comfortable way to define fields for extraction is to use Dataflow Kit Visual interface

Just click elements on loaded page and then export collection to a file.

Select Elements

Export collection

Extract SERPs

SERPs endpoint /serp crawls search engine result pages (SERPs) and extracts a list of organic results, ads, news, images and more. Specify advanced configuration parameters such as country or language to customize SERP data.

Extracted data is returned in JSON format.

The following search engines are supported:

google google_news google_image
bing bing_news baidu
duckduckgo duckduckgo_news youtube
infospace webcrawler

Search parameters

Create an SERP Extractor Task.

curl --request POST \
     --url https://api.dataflowkit.com/v1/serp/create?name=TASK_NAME&api_key=YOUR_API_KEY \
     -d '{
          "search_engine": "google",
          "keywords": [
            "dafaflow kit",
            "extract SERP"
          ],
          "num_pages": 3,
          "region": "us"
}'
Parameter Description
search_engine Specify a search engine to use. Valid values are google, google_news, google_image, bing,bing_news, duckduckgo, duckduckgo_news,youtube, baidu, infospace, webcrawler
keywords Parameter defines array of search terms. You can use anything that you would use in a regular Search engines search. (e.g. for Google, link:dataflowkit.com, site:twitter.com Bratislava, inurl:view/view.shtml, etc.)
num_pages the number of pages to scrape for each keyword
region Specify region value to send requests from. Available values are: 'us'(United States), 'de'(Germany), 'uk'(United Kingdom), 'fr'(France). More regions will be available soon.

Google. Search parameters

{
  "search_engine: "google",
  "keywords": ["dafaflow kit", "extract SERP"],
  "num_pages": 1,
  "google_settings": {
    "google_domain": "google.com", // the google domain used for extracting SERPs
    "gl": "us", // Geolocation. Specify country code for google search.
    "hl": "en", // Host Language of user interface. Specify interface language for google search results.
    "start": 0, // specify the results offset to use, defaults to 0.
    "num": 10, // specify the number of results per page to return, defaults to 10. Maximum is 100.
   }
}

You can specify additional search parameters for Google SE with the google_settings key.

Parameter Description
google_domain Parameter defines the Google domain to use. It defaults to google.com. Head to the Google domains for a full list of supported Google domains.
gl Use gl=country parameter if you'd like to get country specific search results. (e.g. gl=us for the United States, gl=sk for Slovakia, gl=de for Germany, etc. See the List of available Country codes
hl Parameter defines the language to use for the Google search. It's a two-letter language code. (e.g., hl=en for English, hl=es for Spanish, or hl=fr for French) Head to the Google languages for a full list of supported Google languages.
start Parameter defines the result offset. It skips the given number of results. It's used for pagination. (e.g., 0 (default) is the first page of results, 10 is the 2nd page of results, 20 is the 3rd page of results, etc.)
num Parameter defines the maximum number of results to return. (e.g., 10 (default) returns 10 results, and 100 returns 100 results).

Results

It returns JSON file containing results from Google, Bing, Baidu and etc. with other meta information about serached topics.

Convert to PDF

Convert PDF endpoint is used for converting URL, local HTML, Markdown and Office documents to PDF.

HTML and Markdown conversions are performed using Google Chrome headless browser.

Assets: You can send your header, footer, images, fonts, stylesheets and so on for converting your HTML and Markdown to PDFs.

URL

curl --request POST \
    --url https://api.dataflowkit.com/v1/convert/url/pdf?api_key=YOUR_API_KEY \
    --header 'Content-Type: multipart/form-data' \
    --form remoteURL=https://dataflowkit.com \
    --form marginTop=0 \
    --form marginBottom=0 \
    --form marginLeft=0 \
    --form marginRight=0

Use Dataflow Kit endpoint /convert/url/pdf to convert remote URL to PDF.

It accepts POST requests with a multipart/form-data Content-Type.

Parameter Default Description
remoteURL - Remote URL to be converted to PDF
marginTop 0
marginBottom 0
marginLeft 0
marginRight 0

HTML

curl --request POST \
    --url https://api.dataflowkit.com/v1/convert/html/pdf?api_key=YOUR_API_KEY \
    --header 'Content-Type: multipart/form-data' \
    --form files=@index.html \
    --form files=@header.html \
    --form files=@footer.html \
    --form files=@style.css \
    --form files=@img.png \
    --form files=@font.woff \
    --form paperWidth=8.27 \
    --form paperHeight=11.27 \
    --form marginTop=1.2 \
    --form marginBottom=1.2 \
    --form marginLeft=1 \
    --form marginRight=1 \
    --form landscape=true

DFK endpoint /convert/html/pdf is intended for HTML file conversions.

Just send a POST requests with a multipart/form-data Content-Type.

Parameter Default Description
files - Specify html files to be converted to PDF. The main file index.html is required. All others parameters are optional
paperWidth 8.27
paperHeight 11.69
marginTop 1
marginBottom 1
marginLeft 1
marginRight 1
landscape false By default, it will be rendered with portrait orientation.

Using parameters you can customize the resulting PDF file.

Paper size and margins have to be provided in inches.

By default, it will be rendered with A4 size, 1 inch margins and portrait orientation.

header.html file sample

<html>
<head>
  <style>
    body {
      font-size: 14px;
      margin: 100px auto;
    }
  </style>
</head>
<body>
  <h1>Header</h1>
  <p><span class="date"></span></p>
  <p>
    <span class="pageNumber"></span> of <span class="totalPages"></span>
  </p>
</body>
</html>

You may also specify a header and/or a footer in the resulting PDF. Respectively, a file named header.html and footer.html

They should be a complete HTML document like:

The following classes are helpfull for injecting printing values:

Class Description
date formatted print date
title document title
pageNumber current page number
totalPage total pages in the document

Assets

Adding assets to resulted PDF

<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>New PDF</title>
  </head>
  <body>
    <img src="logo.jpg">
    <h1>Hello world!</h1>
  </body>
</html>

You may also include additional files like images, fonts, stylesheets and so on to a rendered PDF file.

They have to be located in the same directory as index.html.

Using external paths for Google fonts, images is ok.

Markdown

curl --request POST \
    --url https://api.dataflowkit.com/v1/convert/markdown/pdf?api_key=YOUR_API_KEY \
    --header 'Content-Type: multipart/form-data' \
    --form files=@index.html \
    --form files=@file.md

Sample index.html file for Markdown to PDF conversion

<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>New PDF</title>
  </head>
  <body>
    {{ toHTML .DirPath "file.md" }}
  </body>
</html>

Use Dataflow Kit endpoint /convert/markdown/pdf to convert Markdown format to PDF.

It accepts POST requests with a multipart/form-data Content-Type.

Converting from Markdown to PDF works the same way as HTML to PDF endpoint does.

The only difference is that you have access to the Go template function toHTML in the file index.html. This function will convert a given markdown file to HTML.

Please refer to HTML conversion section for details about parameters used for Markdown to PDF conversion.

Ofice

Dataflow Kit endpoint /convert/office/pdf is used for Office document to PDF conversions.

It accepts POST requests with a multipart/form-data Content-Type.

The following file formats are supported:

Text Spreadsheets Presentations
.txt .xls .ppt
.rtf .xlsx .pptx
.fodt .ods .odp
.doc
.docx
.odt

All files will be merged into a single resulting PDF.

curl --request POST \
    --url https://api.dataflowkit.com/v1/convert/ofice/pdf?api_key=YOUR_API_KEY \
    --header 'Content-Type: multipart/form-data' \
    --form files=@document1.docx \
    --form files=@document2.doc \
    --form files=@spreadsheet.xlsx \
    --form landscape=true
Parameter Description
files Specify document files to be converted to PDF. At least on file have to be specified. All others parameters are optional
landscape By default, it will be rendered with portrait orientation.

Merge

curl --request POST \
    --url https://api.dataflowkit.com/v1/merge/pdf?api_key=YOUR_API_KEY \
    --header 'Content-Type: multipart/form-data' \
    --form files=@pdf1.pdf \
    --form files=@pdf2.pdf \
    --form files=@pdf3.pdf

DFK Merge endpoint /merge/pdf is intended for merging several PDFs into one resulting PDF.

It accepts POST requests with a multipart/form-data Content-Type.

Just send some PDF files and DFK API will merge them and return the resulting PDF file.

Results

curl --request GET \
     --url  "https://dfk-storage.ams3.digitaloceanspaces.com/result_office2pdf_2019-08-09_14%3A39.pdf?X-Amz-Signature=b23fffd81b29f4a597eaa6c29b34501144d1687e6d08bc33141ddae9f7ff1f69"

As a results a link to resulted PDF file returned.

Run the script on the right to download results providing the link.

Webhooks

curl --request POST \
    --url https://api.dataflowkit.com/v1/convert/html/pdf?api_key=YOUR_API_KEY \
    --header 'Content-Type: multipart/form-data' \
    --form files=@index.html \
    --form webhookURL='http://mywebsite.com/webhook/'

All PDF conversion endpoints accept a form field named webhookURL.

If provided, Dataflow Kit API will send the resulting PDF file in a POST request with the application/pdf Content-Type to given URL.

TODO: ???? fileSize | The size of converted file in Megabytes for conversion Tasks. This field is ignored for web data / SERP extraction Tasks.

Errors

The Dataflow Kit API uses the following error codes:

Error Code Meaning
400 Bad Request -- Your Collection object or conversion scheme is invalid.
401 Unauthorized -- Your API key is wrong.
403 Forbidden -- The requested resource is hidden for administrators only.
404 Not Found -- The specified page could not be found.
429 Too Many Requests -- You're requesting DFK API too fast! Slow down!
500 Internal Server Error -- We had a problem with our server. Try again later.
503 Service Unavailable -- We're temporarily offline for maintenance. Please try again later.

Addendum

Google domains

when customizing Google SERPs Extractor:

TLD= Top Level Domain

Region TLD Domain
Worldwide (Original for the United States) .com google.com
Ascension Island .ac google.ac
Andorra .ad google.ad
United Arab Emirates .ae google.ae
Afghanistan .af google.com.af
Antigua and Barbuda .ag google.com.ag
Anguilla .ai google.com.ai
Albania .al google.al
Armenia .am google.am
Angola .ao google.co.ao
Argentina .ar google.com.ar
American Samoa .as google.as
Austria .at google.at
Australia .au google.com.au
Azerbaijan .az google.az
Bosnia and Herzegovina .ba google.ba
Bangladesh .bd google.com.bd
Belgium .be google.be
Burkina Faso .bf google.bf
Bulgaria .bg google.bg
Bahrain .bh google.com.bh
Burundi .bi google.bi
Benin .bj google.bj
Brunei .bn google.com.bn
Bolivia .bo google.com.bo
Brazil .br google.com.br
Bahamas .bs google.bs
Bhutan .bt google.bt
Botswana .bw google.co.bw
Belarus .by google.by
Belize .bz google.com.bz
Canada .ca google.ca
Cambodia .kh google.com.kh
Cocos (Keeling) Islands .cc google.cc
Democratic Republic of the Congo .cd google.cd
Central African Republic .cf google.cf
Catalonia Catalan Countries .cat google.cat
Republic of the Congo .cg google.cg
Switzerland .ch google.ch
Ivory Coast .ci google.ci
Cook Islands .ck google.co.ck
Chile .cl google.cl
Cameroon .cm google.cm
China .cn google.cn
China .cn g.cn
Colombia .co google.com.co
Costa Rica .cr google.co.cr
Croatia Croatia .hr google.hr
Cuba .cu google.com.cu
Cape Verde .cv google.cv
Cyprus .cy google.com.cy
Czech Republic .cz google.cz
Germany .de google.de
Djibouti .dj google.dj
Denmark .dk google.dk
Dominica .dm google.dm
Dominican Republic .do google.com.do
Algeria .dz google.dz
Ecuador .ec google.com.ec
Estonia .ee google.ee
Egypt .eg google.com.eg
Spain .es google.es
Ethiopia .et google.com.et
Finland .fi google.fi
Fiji .fj google.com.fj
Federated States of Micronesia .fm google.fm
France .fr google.fr
Gabon .ga google.ga
Georgia .ge google.ge
French Guiana .gf google.gf
Germany .de google.de
Guernsey .gg google.gg
Ghana .gh google.com.gh
Gibraltar .gi google.com.gi
Greenland .gl google.gl
Gambia .gm google.gm
Guadeloupe .gp google.gp
Greece .gr google.gr
Guatemala .gt google.com.gt
Guyana .gy google.gy
Hong Kong .hk google.com.hk
Honduras .hn google.hn
Croatia .hr google.hr
Haiti .ht google.ht
Hungary .hu google.hu
Indonesia .id google.co.id
Iran .ir google.ir
Iraq .iq google.iq
Ireland .ie google.ie
Israel .il google.co.il
Isle of Man .im google.im
India .in google.co.in
British Indian Ocean Territory .io google.io
Iceland .is google.is
Italy .it google.it
Jersey .je google.je
Jamaica .jm google.com.jm
Jordan .jo google.jo
Japan .jp google.co.jp
Kenya .ke google.co.ke
Kiribati .ki google.ki
Kyrgyzstan .kg google.kg
South Korea .kr google.co.kr
Kuwait .kw google.com.kw
Kazakhstan .kz google.kz
Laos .la google.la
Lebanon .lb google.com.lb
Saint Lucia .lc google.com.lc
Liechtenstein .li google.li
Sri Lanka .lk google.lk
Lesotho .ls google.co.ls
Lithuania .lt google.lt
Luxembourg .lu google.lu
Latvia .lv google.lv
Libya .ly google.com.ly
Morocco .ma google.co.ma
Moldova .md google.md
Montenegro .me google.me
Madagascar .mg google.mg
Macedonia .mk google.mk
Mali .ml google.ml
Burma .mm google.com.mm
Mongolia .mn google.mn
Montserrat .ms google.ms
Malta .mt google.com.mt
Mauritius .mu google.mu
Maldives .mv google.mv
Malawi .mw google.mw
Mexico .mx google.com.mx
Malaysia .my google.com.my
Mozambique .mz google.co.mz
Namibia .na google.com.na
Niger .ne google.ne
Norfolk Island .nf google.com.nf
Nigeria .ng google.com.ng
Nicaragua .ni google.com.ni
Netherlands .nl google.nl
Norway .no google.no
Nepal .np google.com.np
Nauru .nr google.nr
Niue .nu google.nu
New Zealand .nz google.co.nz
Oman .om google.com.om
Panama .pa google.com.pa
Peru .pe google.com.pe
Philippines .ph google.com.ph
Pakistan .pk google.com.pk
Poland .pl google.pl
Papua New Guinea .pg google.com.pg
Pitcairn Islands .pn google.pn
Puerto Rico .pr google.com.pr
Palestine .ps google.ps
Portugal .pt google.pt
Paraguay .py google.com.py
Qatar .qa google.com.qa
Romania .ro google.ro
Serbia .rs google.rs
Russia .ru google.ru
Rwanda .rw google.rw
Saudi Arabia .sa google.com.sa
Solomon Islands .sb google.com.sb
Seychelles .sc google.sc
Sweden .se google.se
Singapore .sg google.com.sg
Saint Helena, Ascension and Tristan da Cunha .sh google.sh
Slovenia .si google.si
Slovakia .sk google.sk
Sierra Leone .sl google.com.sl
Senegal .sn google.sn
San Marino .sm google.sm
Somalia .so google.so
São Tomé and Príncipe .st google.st
El Salvador .sv google.com.sv
Chad .td google.td
Togo .tg google.tg
Thailand .th google.co.th
Tajikistan .tj google.com.tj
Tokelau .tk google.tk
Timor-Leste .tl google.tl
Turkmenistan .tm google.tm
Tonga .to google.to
Tunisia .tn google.tn
Tunisia .tn google.com.tn
Turkey .tr google.com.tr
Trinidad and Tobago .tt google.tt
Taiwan .tw google.com.tw
Tanzania .tz google.co.tz
Ukraine .ua google.com.ua
Uganda .ug google.co.ug
United Kingdom .uk google.co.uk
United States .us google.us
Uruguay .uy google.com.uy
Uzbekistan .uz google.co.uz
Saint Vincent and the Grenadines .vc google.com.vc
Venezuela .ve google.co.ve
British Virgin Islands .vg google.vg
United States Virgin Islands .vi google.co.vi
Vietnam .vn google.com.vn
Vanuatu .vu google.vu
Samoa .ws google.ws
South Africa .za google.co.za
Zambia .zm google.co.zm
Zimbabwe .zw google.co.zw

Country Codes

The following table lists the two-letter country codes that can be used as values of gl parameter when customizing Google SERPs Extractor:

Country Country Code
Afghanistan af
Albania al
Algeria dz
American Samoa as
Andorra ad
Angola ao
Anguilla ai
Antarctica aq
Antigua and Barbuda ag
Argentina ar
Armenia am
Aruba aw
Australia au
Austria at
Azerbaijan az
Bahamas bs
Bahrain bh
Bangladesh bd
Barbados bb
Belarus by
Belgium be
Belize bz
Benin bj
Bermuda bm
Bhutan bt
Bolivia bo
Bosnia and Herzegovina ba
Botswana bw
Bouvet Island bv
Brazil br
British Indian Ocean Territory io
Brunei Darussalam bn
Bulgaria bg
Burkina Faso bf
Burundi bi
Cambodia kh
Cameroon cm
Canada ca
Cape Verde cv
Cayman Islands ky
Central African Republic cf
Chad td
Chile cl
China cn
Christmas Island cx
Cocos (Keeling) Islands cc
Colombia co
Comoros km
Congo cg
Congo, the Democratic Republic of the cd
Cook Islands ck
Costa Rica cr
Cote D'ivoire ci
Croatia hr
Cuba cu
Cyprus cy
Czech Republic cz
Denmark dk
Djibouti dj
Dominica dm
Dominican Republic do
Ecuador ec
Egypt eg
El Salvador sv
Equatorial Guinea gq
Eritrea er
Estonia ee
Ethiopia et
Falkland Islands (Malvinas) fk
Faroe Islands fo
Fiji fj
Finland fi
France fr
French Guiana gf
French Polynesia pf
French Southern Territories tf
Gabon ga
Gambia gm
Georgia ge
Germany de
Ghana gh
Gibraltar gi
Greece gr
Greenland gl
Grenada gd
Guadeloupe gp
Guam gu
Guatemala gt
Guinea gn
Guinea-Bissau gw
Guyana gy
Haiti ht
Heard Island and Mcdonald Islands hm
Holy See (Vatican City State) va
Honduras hn
Hong Kong hk
Hungary hu
Iceland is
India in
Indonesia id
Iran, Islamic Republic of ir
Iraq iq
Ireland ie
Israel il
Italy it
Jamaica jm
Japan jp
Jordan jo
Kazakhstan kz
Kenya ke
Kiribati ki
Korea, Democratic People's Republic of kp
Korea, Republic of kr
Kuwait kw
Kyrgyzstan kg
Lao People's Democratic Republic la
Latvia lv
Lebanon lb
Lesotho ls
Liberia lr
Libyan Arab Jamahiriya ly
Liechtenstein li
Lithuania lt
Luxembourg lu
Macao mo
Macedonia, the Former Yugosalv Republic of mk
Madagascar mg
Malawi mw
Malaysia my
Maldives mv
Mali ml
Malta mt
Marshall Islands mh
Martinique mq
Mauritania mr
Mauritius mu
Mayotte yt
Mexico mx
Micronesia, Federated States of fm
Moldova, Republic of md
Monaco mc
Mongolia mn
Montserrat ms
Morocco ma
Mozambique mz
Myanmar mm
Namibia na
Nauru nr
Nepal np
Netherlands nl
Netherlands Antilles an
New Caledonia nc
New Zealand nz
Nicaragua ni
Niger ne
Nigeria ng
Niue nu
Norfolk Island nf
Northern Mariana Islands mp
Norway no
Oman om
Pakistan pk
Palau pw
Palestinian Territory, Occupied ps
Panama pa
Papua New Guinea pg
Paraguay py
Peru pe
Philippines ph
Pitcairn pn
Poland pl
Portugal pt
Puerto Rico pr
Qatar qa
Reunion re
Romania ro
Russian Federation ru
Rwanda rw
Saint Helena sh
Saint Kitts and Nevis kn
Saint Lucia lc
Saint Pierre and Miquelon pm
Saint Vincent and the Grenadines vc
Samoa ws
San Marino sm
Sao Tome and Principe st
Saudi Arabia sa
Senegal sn
Serbia and Montenegro cs
Seychelles sc
Sierra Leone sl
Singapore sg
Slovakia sk
Slovenia si
Solomon Islands sb
Somalia so
South Africa za
South Georgia and the South Sandwich Islands gs
Spain es
Sri Lanka lk
Sudan sd
Suriname sr
Svalbard and Jan Mayen sj
Swaziland sz
Sweden se
Switzerland ch
Syrian Arab Republic sy
Taiwan, Province of China tw
Tajikistan tj
Tanzania, United Republic of tz
Thailand th
Timor-Leste tl
Togo tg
Tokelau tk
Tonga to
Trinidad and Tobago tt
Tunisia tn
Turkey tr
Turkmenistan tm
Turks and Caicos Islands tc
Tuvalu tv
Uganda ug
Ukraine ua
United Arab Emirates ae
United Kingdom uk
United States us
United States Minor Outlying Islands um
Uruguay uy
Uzbekistan uz
Vanuatu vu
Venezuela ve
Viet Nam vn
Virgin Islands, British vg
Virgin Islands, U.S. vi
Wallis and Futuna wf
Western Sahara eh
Yemen ye
Zambia zm
Zimbabwe zw

Interface languages

The following list identifies all of the interface languages that Google SE supports that can be used as values of the hl Parameter when customizing Google SERPs Extractor:

Display Language Language code
Afrikaans af
Albanian sq
Amharic sm
Arabic ar
Azerbaijani az
Basque eu
Belarusian be
Bengali bn
Bihari bh
Bosnian bs
Bulgarian bg
Catalan ca
Chinese (Simplified) zh-CN
Chinese (Traditional) zh-TW
Croatian hr
Czech cs
Danish da
Dutch nl
English en
Esperanto eo
Estonian et
Faroese fo
Finnish fi
French fr
Frisian fy
Galician gl
Georgian ka
German de
Greek el
Gujarati gu
Hebrew iw
Hindi hi
Hungarian hu
Icelandic is
Indonesian id
Interlingua ia
Irish ga
Italian it
Japanese ja
Javanese jw
Kannada kn
Korean ko
Latin la
Latvian lv
Lithuanian lt
Macedonian mk
Malay ms
Malayam ml
Maltese mt
Marathi mr
Nepali ne
Norwegian no
Norwegian (Nynorsk) nn
Occitan oc
Persian fa
Polish pl
Portuguese (Brazil) pt-BR
Portuguese (Portugal) pt-PT
Punjabi pa
Romanian ro
Russian ru
Scots Gaelic gd
Serbian sr
Sinhalese si
Slovak sk
Slovenian sl
Spanish es
Sudanese su
Swahili sw
Swedish sv
Tagalog tl
Tamil ta
Telugu te
Thai th
Tigrinya ti
Turkish tr
Ukrainian uk
Urdu ur
Uzbek uz
Vietnamese vi
Welsh cy
Xhosa xh
Zulu zu