Please wait...
With Dataflow Kit web scraper, extracting data is as easy as clicking the data you need.
This guide is useful as a general reference for common tasks associated with data collection building.
A collection is a set of instructions outlining the actions to be performed against a specific website. These instructions are consumed by Dataflow Kit servers to gather data from a target website thereafter.
The scraping process is based on the pattern of data you have selected. Look at the sample screenshot taken from a webshop results. Let's say we want to scrape the Image, title of item listed and the price.
Type (or copy and paste) a website address into the address bar and click the button next to it to load
the
website.
Web address should start with
http(s)://
The requested web page is now loaded in "point and click" editor.
In the Selectors panel you can add new selectors to a collection, modify them and navigate the selector list.
Start selecting elements on the web page clicking on
button.Clicked element will be
highlighted
in Green.
Dataflow Kit suggests
similar elements you may also want to select and mark them in Yellow.
Optionally click on a highlighted element to remove it from selection. Removed element becomes Red coloured.
Otherwise, click on unhighlighted element to add it to the current selection.
Iterate selection and rejection (steps 2 and 3) to specify needed patterns for collecting data.
A number in a circle 24 near selector shows a number of elements selected.
Press
button to finish selection. Or click to start specifying patterns again.Once you have selected all the data for your first selector, repeat steps listed above to add more Selectors to the Collection.
Clicking on a selector highlights their corresponded elements on the loaded web page.
Find more information about selector types, options in Selectors documentation.
Websites that contain long lists of items frequently break these up into pages. Navigating through different pages on a website is an integral part of the web scraping process. Paginator is used to scrape multiple pages or process infinite scrolled pages.
Scroll up or down the web page until you can see the button or link that navigates to the next page.
Click
button and choose the one from the drop down list of paginator types:paginator type is used on pages containing link pointing to a next page. The next page link is extracted from a document by querying href attribute of a given element's CSS selector.
paginator type automatically loads content while user scrolls a page down.
Selector represents corresponding CSS selector for the ` "Next" link` or ` "Load more" button` paginator types.
The scraper is now configured to go to the next page (and all remaining pages) after collecting all data from the current page.
If there is no paginator specified, then it is assumed that the initial URL is the only page to scrape.
The Link selector type might serve as a link to a product details page, so we can click on it in order to navigate to the details page and gather additional data.
Click on `Details` link. This loads the product details page, where you can collect additional data about each individual product.
You can repeat steps described in Select elements section for each additional piece of information you want to collect from details page.
After you have added selectors, set up paginator and details сlick
button to have an idea of what data will be extracted.Once the process runs, you will see coming data in the Data Viewer and keep informed on progress.
Click Button to interrupt extraction at any time.
You can choose CSV, JSON, JSON Lines, Excel or XML as output format.
Have a look at our article at HackerNoon about benefits of storing data in JSON Lines format.
The steps in this walkthrough may not be fully applicable to all websites from which you might desire to gather data. For help with specific data extraction tasks not found in this Getting Started guide, search the Help Center for relevant articles.