Web scraping has become an essential tool for extracting data from websites in various industries.
However, understanding the terminology associated with web scraping can sometimes be challenging.
In this blog post, we provide you with a comprehensive glossary of terms that will definitely guide you to navigate the world of web scraping easily.
Whether you are new to data extraction or a seasoned professional, this glossary will serve as a handy reference to ensure you stay well-informed.
1. Account
An account represents an individual customer account, a business, or even a partner organization with whom we do business. It serves as the basis for managing and organizing data scraping projects.
2. Account Owner
Similarly, the Account Owner is a designated point of contact from Grepsr responsible for delivery, support, and account expansion. This role is reserved for certain account types and ensures smooth communication and coordination between the customer and Grepsr.
3. Data Platform
The Data Platform is Grepsr’s proprietary, enterprise-grade system for data project management. It consists of two complementary pieces, first is the backend infrastructure that handles data extraction and management. Consecutively, the frontend interface enables users to configure and monitor their scraping projects.
4. Data Project
A project is a vehicle through which customer requirements are translated into workable data, and value is delivered. It includes data requirements such as URLs and data points to extract, as well as additional instructions required to pull data effectively.
5. Data Report
Project requirements are grouped into sets called Reports. A Report represents a use case or a granular set of data and delivery requirements. They can execute at once and deliver together. Each Report is associated with a set of programmatic instructions to source data known as a Crawler or Service.
6. Data Crawler (or Spider)
A Crawler programmatically opens and interacts with a website to parse content and extract data. It is versioned to reflect changes in the data scope over time. As a result, a successful Project has at least one Report associated with a unique Crawler version.
7. Run
A Run is the execution of a Crawler. It retrieves data from the target website based on the defined instructions and configuration.
8. Dataset
A Dataset is the data output resulting from a Run. It contains the extracted data in a structured format ready for analysis and processing.
9. Page
Pages within a Dataset are similar to sheets in a spreadsheet. Each Dataset consists of at least one Page, which allows for the normalization of the final output, akin to a relational database or separation of concerns.
10. Columns
Columns are the extracted fields in a Dataset or a Page in a Dataset. They organize the data and provide a clear structure to the extracted information.
11. Indexed Column
Indexing a column is a crucial process in database management. It implies that the generated data output for that particular column is stored in a way that allows filtering, sorting, and searching across millions of records without any delay.
12. Rows
Each line of record in a Dataset is a Row. Rows contain the extracted data for each specific instance or entry.
13. Object
In a JSON output, a Row of records is an Object. Unlike a Row, an Object can be layered, allowing for a more complex structure of data representation.
14. Data Quality
Quality is an umbrella term to measure the quantitative, qualitative, and overall health of a Report. It takes various factors into consideration. It includes Accuracy, Completeness, Data Distribution, Rows, and Requests.
15. Data Accuracy
Accuracy is a numeric score, expressed as a percentage, that measures if the sourced data complies with the expected data format. Rules assigned to different Columns in a Dataset validate the compliance. Hence, higher Accuracy indicates better adherence to data standards.
16. Data Completeness
Completeness refers to the state where the data contains all the information available to extract from the source. A Fill Rate measures it which calculates the data density within the Dataset.
17. Fill Rate
Furthermore, the Fill Rate is a numeric score, expressed as a percentage, that measures the data density within a Dataset. It indicates the number of empty cells versus cells with data. Additionally, a higher Fill Rate signifies a more complete Dataset.
18. Data Distribution
Data Distribution measures the occurrence of a certain value in a Column. It is particularly useful for Indexed Columns and acts as a proxy for data quality. However, if the data distribution deviates from the norm, it may indicate potential issues with the sourced data.
19. Data Crawler Requests
A Request is an HTTP request made to the server to retrieve content. Subsequently, the Crawler makes a series of Requests to load and interact with a web page to extract the necessary data. Afterwards, the content request is either served by the server or failed, indicating an error.
20. Team
A Team refers to a set of users belonging to the same Account. Teams can have different roles, such as Team Manager or Viewer. The Team Manager has administrative rights and access to all Projects in the Account, while the Viewer has limited rights and access only to specific added Projects.
In conclusion
Dive into Grepsr’s all-inclusive glossary of web scraping terms, tailored to empower you with the knowledge needed to excel in data extraction. Altogether, web scraping is a powerful technique for extracting data from websites, and understanding the associated terminology is essential. Thus, this glossary provides a comprehensive list of terms that will help you navigate the world of web scraping with confidence.
Therefore, either as a beginner or an experienced user, having a clear understanding of these terms will empower you to effectively leverage web scraping in your data-driven projects.