-
Notifications
You must be signed in to change notification settings - Fork 158
docs: remove glossary #2130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
docs: remove glossary #2130
Conversation
remove glossary directory remove glossary from sidebar and 2nd navbar remove mentions of glossary from AGENTS.md
|
Preview for this PR was built for commit |
|
Preview for this PR was built for commit |
|
Preview for this PR was built for commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is being reviewed by Cursor Bugbot
Details
Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
Comment @cursor review or bugbot run to trigger another review on this PR
| └── academy/ # Educational content | ||
| ├── tutorials/ # Step-by-step guides | ||
| ├── webscraping/ # Web scraping courses | ||
| └── glossary/ # Terminology and definitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Remove the glossary also from .cursor/rules/file-organization.mdc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I forgot about those rules
|
Preview for this PR was built for commit |
marcel-rbro
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, consider adding external links to some of the places where links to glossary were removed. Not necessary for stuff like HTTP headers and CSS, but would be helpful for mentions of tools: Postman, Insomnia, Quick JavaScript Switcher (or whatever was the name)...
| If you're feeling ambitious, you don't need to have any prior experience with Crawlee to get started with this course; however, at least 5–10 minutes of exposure is recommended. If you haven't yet tried out Crawlee, you can refer to the [Using a scraping framework with Node.js](../../webscraping/scraping_basics_javascript/12_framework.md) lesson of the **Web scraping basics for JavaScript devs** course. To familiarize yourself with the Apify SDK, you can refer to the [Apify Platform](../apify_platform.md) category. | ||
|
|
||
| The Apify CLI will play a core role in the running and testing of the Actor you will build, so if you haven't gotten it installed already, please refer to [this short lesson](../../glossary/tools/apify_cli.md). | ||
| The Apify CLI will play a core role in the running and testing of the Actor you will build, so if you haven't gotten it installed already, please refer to this short lesson. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no link now
| ::: | ||
|
|
||
| Now, let's move over to our favorite HTTP client (in this lesson we'll use [Insomnia](../../glossary/tools/insomnia.md) in order to prepare and send the request). | ||
| Now, let's move over to our favorite HTTP client (in this lesson we'll use Insomnia in order to prepare and send the request). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding link to https://insomnia.rest/ or to their docs: https://developer.konghq.com/insomnia/
| ## Making the choice {#making-the-choice} | ||
|
|
||
| When choosing which scraper to use, we would suggest first checking whether the website works without JavaScript or not. Probably the easiest way to do so is to use the [Quick JavaScript Switcher](../../glossary/tools/quick_javascript_switcher.md) extension for Chrome. If JavaScript is not needed, or you've spotted some XHR requests in the **Network** tab with the data you need, you probably won't need to use an automated browser. You can then check what data is received in response using [Postman](../../glossary/tools/postman.md) or [Insomnia](../../glossary/tools/insomnia.md) or try to send a few requests programmatically. If the data is there and you're not blocked straight away, a request-based scraper is probably the way to go. | ||
| When choosing which scraper to use, we would suggest first checking whether the website works without JavaScript or not. Probably the easiest way to do so is to use the Quick JavaScript Switcher extension for Chrome. If JavaScript is not needed, or you've spotted some XHR requests in the **Network** tab with the data you need, you probably won't need to use an automated browser. You can then check what data is received in response using Postman or Insomnia or try to send a few requests programmatically. If the data is there and you're not blocked straight away, a request-based scraper is probably the way to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to the extension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we mention something for the first time, we should link. The changes made in sources/academy/platform/getting_started/apify_api.md now fit such approach, but if I'm not missing something, this tutorial mentions the Chrome extension for the first time here?
|
|
||
| - Using [proxies](../mitigation/proxies.md)? | ||
| - Making the request with the proper [headers](../../../glossary/concepts/http_headers.md) and [cookies](../../../glossary/concepts/http_cookies.md)? | ||
| - Making the request with the proper headers and cookies? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about adding MDN docs link to each occurence of cookies/headers. I did it a few times but not for all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on target audience. If it's a course for beginners, and we mention cookies or headers for the first time, it makes sense to me to link it. Otherwise I'd consider these terms as something the reader should understand.
remove unnecessary heading anchors add links to docs & external tools
|
Preview for this PR was built for commit |
honzajavorek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did find a few things, but they're more like opinions or nitpicks and I don't want to hold back delivery with those. Approving, and up to you what you do with my comments 🚀
| ## Making the choice {#making-the-choice} | ||
|
|
||
| When choosing which scraper to use, we would suggest first checking whether the website works without JavaScript or not. Probably the easiest way to do so is to use the [Quick JavaScript Switcher](../../glossary/tools/quick_javascript_switcher.md) extension for Chrome. If JavaScript is not needed, or you've spotted some XHR requests in the **Network** tab with the data you need, you probably won't need to use an automated browser. You can then check what data is received in response using [Postman](../../glossary/tools/postman.md) or [Insomnia](../../glossary/tools/insomnia.md) or try to send a few requests programmatically. If the data is there and you're not blocked straight away, a request-based scraper is probably the way to go. | ||
| When choosing which scraper to use, we would suggest first checking whether the website works without JavaScript or not. Probably the easiest way to do so is to use the Quick JavaScript Switcher extension for Chrome. If JavaScript is not needed, or you've spotted some XHR requests in the **Network** tab with the data you need, you probably won't need to use an automated browser. You can then check what data is received in response using Postman or Insomnia or try to send a few requests programmatically. If the data is there and you're not blocked straight away, a request-based scraper is probably the way to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we mention something for the first time, we should link. The changes made in sources/academy/platform/getting_started/apify_api.md now fit such approach, but if I'm not missing something, this tutorial mentions the Chrome extension for the first time here?
|
|
||
| - Using [proxies](../mitigation/proxies.md)? | ||
| - Making the request with the proper [headers](../../../glossary/concepts/http_headers.md) and [cookies](../../../glossary/concepts/http_cookies.md)? | ||
| - Making the request with the proper headers and cookies? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on target audience. If it's a course for beginners, and we mention cookies or headers for the first time, it makes sense to me to link it. Otherwise I'd consider these terms as something the reader should understand.
|
|
||
| - Using [proxies](../mitigation/proxies.md). | ||
| - Mocking [headers](../../../glossary/concepts/http_headers.md). | ||
| - Mocking headers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an idea: Depending on context, instead of linking to MDN, we could just be more specific so that the reader can search for the term if they need. E.g. instead of vague headers, we could write HTTP headers, without link, and that could be a bit better, even without a link.
(I don't know if this place is the place where this would make sense, but this place is the place where I got this idea, hence I put the comment here.)
| ## Cookies & headers {#cookies-headers} | ||
|
|
||
| Certain websites might use certain location-specific/language-specific [headers](../../../glossary/concepts/http_headers.md)/[cookies](../../../glossary/concepts/http_cookies.md) to geolocate a user. Some examples of these headers are `Accept-Language` and `CloudFront-Viewer-Country` (which is a custom HTTP header from [CloudFront](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/adding-cloudfront-headers.html)). | ||
| Certain websites might use certain location-specific/language-specific headers/cookies to geolocate a user. Some examples of these headers are `Accept-Language` and `CloudFront-Viewer-Country` (which is a custom HTTP header from [CloudFront](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/adding-cloudfront-headers.html)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Certain websites might use certain location-specific/language-specific headers/cookies to geolocate a user. Some examples of these headers are `Accept-Language` and `CloudFront-Viewer-Country` (which is a custom HTTP header from [CloudFront](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/adding-cloudfront-headers.html)). | |
| To geolocate a user, websites might use HTTP headers and cookies specific to location or language. Some examples of these headers are `Accept-Language` and `CloudFront-Viewer-Country` (which is a custom header from [CloudFront](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/adding-cloudfront-headers.html)). |
Also, cookies are technically also HTTP headers, but whatever 😅
| ### Header checking | ||
|
|
||
| This type of bot identification is based on the given fact that humans are accessing web pages through browsers, which have specific [header](../../glossary/concepts/http_headers.md) sets which they send along with every request. The most commonly known header that helps to detect bots is the `User-Agent` header, which holds a value that identifies which browser is being used, and what version it's running. Though `User-Agent` is the most commonly used header for the **Header checking** method, other headers are sometimes used as well. The evaluation is often also run based on the header consistency, and includes a known combination of browser headers. | ||
| This type of bot identification is based on the given fact that humans are accessing web pages through browsers, which have specific header sets which they send along with every request. The most commonly known header that helps to detect bots is the `User-Agent` header, which holds a value that identifies which browser is being used, and what version it's running. Though `User-Agent` is the most commonly used header for the **Header checking** method, other headers are sometimes used as well. The evaluation is often also run based on the header consistency, and includes a known combination of browser headers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This type of bot identification is based on the given fact that humans are accessing web pages through browsers, which have specific header sets which they send along with every request. The most commonly known header that helps to detect bots is the `User-Agent` header, which holds a value that identifies which browser is being used, and what version it's running. Though `User-Agent` is the most commonly used header for the **Header checking** method, other headers are sometimes used as well. The evaluation is often also run based on the header consistency, and includes a known combination of browser headers. | |
| This type of bot identification is based on the given fact that humans are accessing web pages through browsers, which have specific HTTP header sets which they send along with every request. The most commonly known header that helps to detect bots is the `User-Agent` header, which holds a value that identifies which browser is being used, and what version it's running. Though `User-Agent` is the most commonly used header for the **Header checking** method, other headers are sometimes used as well. The evaluation is often also run based on the header consistency, and includes a known combination of browser headers. |
| ::: | ||
|
|
||
| In order to perform introspection on our [target website](https://www.cheddar.com), we need to make a request to their GraphQL API with this introspection query using [Insomnia](../../../glossary/tools/insomnia.md) or another HTTP client that supports GraphQL: | ||
| In order to perform introspection on our [target website](https://www.cheddar.com), we need to make a request to their GraphQL API with this introspection query using Insomnia or another HTTP client that supports GraphQL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the first mention of Insomnia within the course and it is a somewhat known term, I'd link. If not, I wouldn't link.
| Puppeteer and Playwright don't sit around waiting for a page (or specific elements) to load though - if we tell it to do something with an element that hasn't been rendered yet, it'll start trying to do it (which will result in nasty errors). We've got to tell it to wait. | ||
|
|
||
| > For a thorough explanation on how dynamic rendering works, give [**Dynamic pages**](../../../glossary/concepts/dynamic_pages.md) a quick readover, and check out the examples. | ||
| > For a thorough explanation on how dynamic rendering works, give **Dynamic pages** a quick readover, and check out the examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| > For a thorough explanation on how dynamic rendering works, give **Dynamic pages** a quick readover, and check out the examples. |
I think the admonition doesn't make sense without the page and should be removed.
| If we remember properly, after clicking the first result, we want to console log the title of the result's page and save a screenshot into the filesystem. In order to grab a solid screenshot of the loaded page though, we should **wait for navigation** before snapping the image. This can be done with [`page.waitForNavigation()`](https://pptr.dev/#?product=Puppeteer&version=v14.1.0&show=api-pagewaitfornavigationoptions). | ||
|
|
||
| > A navigation is when a new [page load](../../../glossary/concepts/dynamic_pages.md) happens. First, the `domcontentloaded` event is fired, then the `load` event. `page.waitForNavigation()` will wait for the `load` event to fire. | ||
| > A navigation is when a new page load happens. First, the `domcontentloaded` event is fired, then the `load` event. `page.waitForNavigation()` will wait for the `load` event to fire. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't be like @TC-MO and I won't put here a comment that in a better version of this world, which we are surely building, these blockquotes could be turned into proper admonitions. I won't do it. But trust me, I'm tempted!
|
|
||
|
|
||
| Great! But wait, where do we go from here? We need to go to the offers page next and scrape each offer, but how can we do that? Let's take a small break from writing the scraper and open up [Proxyman](../../../glossary/tools/proxyman.md) to analyze requests which we might be difficult to find in the network tab, then we'll click the button on the product page that loads up all of the product offers: | ||
| Great! But wait, where do we go from here? We need to go to the offers page next and scrape each offer, but how can we do that? Let's take a small break from writing the scraper and open up Proxyman to analyze requests which we might be difficult to find in the network tab, then we'll click the button on the product page that loads up all of the product offers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First time I hear about https://proxyman.com/ (like, in my life). If this is a first mention of the tool in the course, I'd link.
honzajavorek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed @marcel-rbro generally points out the same stuff, so I'm changing to Comment and once he approves, it should be Approved.
I just noticed @marcel-rbro generally points out the same stuff, so I'm changing to Comment and once he approves, it should be Approved.
Note
Removes the Academy Glossary and cleans up navigation and references across the docs.
sources/academy/glossary/**pages (concepts/tools) and related contentdocusaurus.config.jsandsources/academy/sidebars.jsto remove Glossary menu items/academy/glossarypaths and related redirects/cli/docs/installation, refine AGENTS.md structure/checklistWritten by Cursor Bugbot for commit 5607cb6. Configure here.