Skip to content

Conversation

@TC-MO
Copy link
Contributor

@TC-MO TC-MO commented Dec 5, 2025

Note

Removes the Academy Glossary and cleans up navigation and references across the docs.

  • Delete all sources/academy/glossary/** pages (concepts/tools) and related content
  • Update docusaurus.config.js and sources/academy/sidebars.js to remove Glossary menu items
  • Add NGINX rewrite to retire /academy/glossary paths and related redirects
  • Replace internal Glossary links with plain text or external refs (e.g., MDN), and adjust copy accordingly in multiple Academy pages
  • Minor doc tweaks: update Apify CLI link to /cli/docs/installation, refine AGENTS.md structure/checklist

Written by Cursor Bugbot for commit 5607cb6. Configure here.

TC-MO added 2 commits December 3, 2025 23:32
remove glossary directory
remove glossary from sidebar and 2nd navbar
remove mentions of glossary from AGENTS.md
@TC-MO TC-MO self-assigned this Dec 5, 2025
@TC-MO TC-MO added documentation Improvements or additions to documentation. t-docs Issues owned by technical writing team. labels Dec 5, 2025
@apify-service-account
Copy link

Preview for this PR was built for commit 10c1f06b and is ready at https://pr-2130.preview.docs.apify.com!

@TC-MO TC-MO requested a review from marcel-rbro January 13, 2026 11:54
@apify-service-account
Copy link

Preview for this PR was built for commit 64e432fc and is ready at https://pr-2130.preview.docs.apify.com!

@TC-MO TC-MO marked this pull request as ready for review January 13, 2026 11:58
@TC-MO TC-MO requested a review from honzajavorek as a code owner January 13, 2026 11:58
@apify-service-account
Copy link

Preview for this PR was built for commit 5607cb69 and is ready at https://pr-2130.preview.docs.apify.com!

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Comment @cursor review or bugbot run to trigger another review on this PR

└── academy/ # Educational content
├── tutorials/ # Step-by-step guides
├── webscraping/ # Web scraping courses
└── glossary/ # Terminology and definitions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Remove the glossary also from .cursor/rules/file-organization.mdc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I forgot about those rules

@TC-MO TC-MO requested a review from marcel-rbro January 15, 2026 10:25
@apify-service-account
Copy link

Preview for this PR was built for commit f066c4e6 and is ready at https://pr-2130.preview.docs.apify.com!

Copy link
Contributor

@marcel-rbro marcel-rbro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, consider adding external links to some of the places where links to glossary were removed. Not necessary for stuff like HTTP headers and CSS, but would be helpful for mentions of tools: Postman, Insomnia, Quick JavaScript Switcher (or whatever was the name)...

If you're feeling ambitious, you don't need to have any prior experience with Crawlee to get started with this course; however, at least 5–10 minutes of exposure is recommended. If you haven't yet tried out Crawlee, you can refer to the [Using a scraping framework with Node.js](../../webscraping/scraping_basics_javascript/12_framework.md) lesson of the **Web scraping basics for JavaScript devs** course. To familiarize yourself with the Apify SDK, you can refer to the [Apify Platform](../apify_platform.md) category.

The Apify CLI will play a core role in the running and testing of the Actor you will build, so if you haven't gotten it installed already, please refer to [this short lesson](../../glossary/tools/apify_cli.md).
The Apify CLI will play a core role in the running and testing of the Actor you will build, so if you haven't gotten it installed already, please refer to this short lesson.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no link now

:::

Now, let's move over to our favorite HTTP client (in this lesson we'll use [Insomnia](../../glossary/tools/insomnia.md) in order to prepare and send the request).
Now, let's move over to our favorite HTTP client (in this lesson we'll use Insomnia in order to prepare and send the request).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding link to https://insomnia.rest/ or to their docs: https://developer.konghq.com/insomnia/

## Making the choice {#making-the-choice}

When choosing which scraper to use, we would suggest first checking whether the website works without JavaScript or not. Probably the easiest way to do so is to use the [Quick JavaScript Switcher](../../glossary/tools/quick_javascript_switcher.md) extension for Chrome. If JavaScript is not needed, or you've spotted some XHR requests in the **Network** tab with the data you need, you probably won't need to use an automated browser. You can then check what data is received in response using [Postman](../../glossary/tools/postman.md) or [Insomnia](../../glossary/tools/insomnia.md) or try to send a few requests programmatically. If the data is there and you're not blocked straight away, a request-based scraper is probably the way to go.
When choosing which scraper to use, we would suggest first checking whether the website works without JavaScript or not. Probably the easiest way to do so is to use the Quick JavaScript Switcher extension for Chrome. If JavaScript is not needed, or you've spotted some XHR requests in the **Network** tab with the data you need, you probably won't need to use an automated browser. You can then check what data is received in response using Postman or Insomnia or try to send a few requests programmatically. If the data is there and you're not blocked straight away, a request-based scraper is probably the way to go.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to the extension?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we mention something for the first time, we should link. The changes made in sources/academy/platform/getting_started/apify_api.md now fit such approach, but if I'm not missing something, this tutorial mentions the Chrome extension for the first time here?


- Using [proxies](../mitigation/proxies.md)?
- Making the request with the proper [headers](../../../glossary/concepts/http_headers.md) and [cookies](../../../glossary/concepts/http_cookies.md)?
- Making the request with the proper headers and cookies?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding MDN docs link to each occurence of cookies/headers. I did it a few times but not for all

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on target audience. If it's a course for beginners, and we mention cookies or headers for the first time, it makes sense to me to link it. Otherwise I'd consider these terms as something the reader should understand.

remove unnecessary heading anchors
add links to docs & external tools
@apify-service-account
Copy link

Preview for this PR was built for commit 63bc8076 and is ready at https://pr-2130.preview.docs.apify.com!

honzajavorek
honzajavorek previously approved these changes Jan 16, 2026
Copy link
Collaborator

@honzajavorek honzajavorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did find a few things, but they're more like opinions or nitpicks and I don't want to hold back delivery with those. Approving, and up to you what you do with my comments 🚀

## Making the choice {#making-the-choice}

When choosing which scraper to use, we would suggest first checking whether the website works without JavaScript or not. Probably the easiest way to do so is to use the [Quick JavaScript Switcher](../../glossary/tools/quick_javascript_switcher.md) extension for Chrome. If JavaScript is not needed, or you've spotted some XHR requests in the **Network** tab with the data you need, you probably won't need to use an automated browser. You can then check what data is received in response using [Postman](../../glossary/tools/postman.md) or [Insomnia](../../glossary/tools/insomnia.md) or try to send a few requests programmatically. If the data is there and you're not blocked straight away, a request-based scraper is probably the way to go.
When choosing which scraper to use, we would suggest first checking whether the website works without JavaScript or not. Probably the easiest way to do so is to use the Quick JavaScript Switcher extension for Chrome. If JavaScript is not needed, or you've spotted some XHR requests in the **Network** tab with the data you need, you probably won't need to use an automated browser. You can then check what data is received in response using Postman or Insomnia or try to send a few requests programmatically. If the data is there and you're not blocked straight away, a request-based scraper is probably the way to go.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we mention something for the first time, we should link. The changes made in sources/academy/platform/getting_started/apify_api.md now fit such approach, but if I'm not missing something, this tutorial mentions the Chrome extension for the first time here?


- Using [proxies](../mitigation/proxies.md)?
- Making the request with the proper [headers](../../../glossary/concepts/http_headers.md) and [cookies](../../../glossary/concepts/http_cookies.md)?
- Making the request with the proper headers and cookies?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on target audience. If it's a course for beginners, and we mention cookies or headers for the first time, it makes sense to me to link it. Otherwise I'd consider these terms as something the reader should understand.


- Using [proxies](../mitigation/proxies.md).
- Mocking [headers](../../../glossary/concepts/http_headers.md).
- Mocking headers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea: Depending on context, instead of linking to MDN, we could just be more specific so that the reader can search for the term if they need. E.g. instead of vague headers, we could write HTTP headers, without link, and that could be a bit better, even without a link.

(I don't know if this place is the place where this would make sense, but this place is the place where I got this idea, hence I put the comment here.)

## Cookies & headers {#cookies-headers}

Certain websites might use certain location-specific/language-specific [headers](../../../glossary/concepts/http_headers.md)/[cookies](../../../glossary/concepts/http_cookies.md) to geolocate a user. Some examples of these headers are `Accept-Language` and `CloudFront-Viewer-Country` (which is a custom HTTP header from [CloudFront](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/adding-cloudfront-headers.html)).
Certain websites might use certain location-specific/language-specific headers/cookies to geolocate a user. Some examples of these headers are `Accept-Language` and `CloudFront-Viewer-Country` (which is a custom HTTP header from [CloudFront](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/adding-cloudfront-headers.html)).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Certain websites might use certain location-specific/language-specific headers/cookies to geolocate a user. Some examples of these headers are `Accept-Language` and `CloudFront-Viewer-Country` (which is a custom HTTP header from [CloudFront](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/adding-cloudfront-headers.html)).
To geolocate a user, websites might use HTTP headers and cookies specific to location or language. Some examples of these headers are `Accept-Language` and `CloudFront-Viewer-Country` (which is a custom header from [CloudFront](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/adding-cloudfront-headers.html)).

Also, cookies are technically also HTTP headers, but whatever 😅

### Header checking

This type of bot identification is based on the given fact that humans are accessing web pages through browsers, which have specific [header](../../glossary/concepts/http_headers.md) sets which they send along with every request. The most commonly known header that helps to detect bots is the `User-Agent` header, which holds a value that identifies which browser is being used, and what version it's running. Though `User-Agent` is the most commonly used header for the **Header checking** method, other headers are sometimes used as well. The evaluation is often also run based on the header consistency, and includes a known combination of browser headers.
This type of bot identification is based on the given fact that humans are accessing web pages through browsers, which have specific header sets which they send along with every request. The most commonly known header that helps to detect bots is the `User-Agent` header, which holds a value that identifies which browser is being used, and what version it's running. Though `User-Agent` is the most commonly used header for the **Header checking** method, other headers are sometimes used as well. The evaluation is often also run based on the header consistency, and includes a known combination of browser headers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This type of bot identification is based on the given fact that humans are accessing web pages through browsers, which have specific header sets which they send along with every request. The most commonly known header that helps to detect bots is the `User-Agent` header, which holds a value that identifies which browser is being used, and what version it's running. Though `User-Agent` is the most commonly used header for the **Header checking** method, other headers are sometimes used as well. The evaluation is often also run based on the header consistency, and includes a known combination of browser headers.
This type of bot identification is based on the given fact that humans are accessing web pages through browsers, which have specific HTTP header sets which they send along with every request. The most commonly known header that helps to detect bots is the `User-Agent` header, which holds a value that identifies which browser is being used, and what version it's running. Though `User-Agent` is the most commonly used header for the **Header checking** method, other headers are sometimes used as well. The evaluation is often also run based on the header consistency, and includes a known combination of browser headers.

:::

In order to perform introspection on our [target website](https://www.cheddar.com), we need to make a request to their GraphQL API with this introspection query using [Insomnia](../../../glossary/tools/insomnia.md) or another HTTP client that supports GraphQL:
In order to perform introspection on our [target website](https://www.cheddar.com), we need to make a request to their GraphQL API with this introspection query using Insomnia or another HTTP client that supports GraphQL:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the first mention of Insomnia within the course and it is a somewhat known term, I'd link. If not, I wouldn't link.

Puppeteer and Playwright don't sit around waiting for a page (or specific elements) to load though - if we tell it to do something with an element that hasn't been rendered yet, it'll start trying to do it (which will result in nasty errors). We've got to tell it to wait.

> For a thorough explanation on how dynamic rendering works, give [**Dynamic pages**](../../../glossary/concepts/dynamic_pages.md) a quick readover, and check out the examples.
> For a thorough explanation on how dynamic rendering works, give **Dynamic pages** a quick readover, and check out the examples.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> For a thorough explanation on how dynamic rendering works, give **Dynamic pages** a quick readover, and check out the examples.

I think the admonition doesn't make sense without the page and should be removed.

If we remember properly, after clicking the first result, we want to console log the title of the result's page and save a screenshot into the filesystem. In order to grab a solid screenshot of the loaded page though, we should **wait for navigation** before snapping the image. This can be done with [`page.waitForNavigation()`](https://pptr.dev/#?product=Puppeteer&version=v14.1.0&show=api-pagewaitfornavigationoptions).

> A navigation is when a new [page load](../../../glossary/concepts/dynamic_pages.md) happens. First, the `domcontentloaded` event is fired, then the `load` event. `page.waitForNavigation()` will wait for the `load` event to fire.
> A navigation is when a new page load happens. First, the `domcontentloaded` event is fired, then the `load` event. `page.waitForNavigation()` will wait for the `load` event to fire.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't be like @TC-MO and I won't put here a comment that in a better version of this world, which we are surely building, these blockquotes could be turned into proper admonitions. I won't do it. But trust me, I'm tempted!



Great! But wait, where do we go from here? We need to go to the offers page next and scrape each offer, but how can we do that? Let's take a small break from writing the scraper and open up [Proxyman](../../../glossary/tools/proxyman.md) to analyze requests which we might be difficult to find in the network tab, then we'll click the button on the product page that loads up all of the product offers:
Great! But wait, where do we go from here? We need to go to the offers page next and scrape each offer, but how can we do that? Let's take a small break from writing the scraper and open up Proxyman to analyze requests which we might be difficult to find in the network tab, then we'll click the button on the product page that loads up all of the product offers:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First time I hear about https://proxyman.com/ (like, in my life). If this is a first mention of the tool in the course, I'd link.

Copy link
Collaborator

@honzajavorek honzajavorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed @marcel-rbro generally points out the same stuff, so I'm changing to Comment and once he approves, it should be Approved.

@honzajavorek honzajavorek dismissed their stale review January 16, 2026 10:41

I just noticed @marcel-rbro generally points out the same stuff, so I'm changing to Comment and once he approves, it should be Approved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation. t-docs Issues owned by technical writing team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants