Skip to content

Commit

Permalink
Merge pull request #10309 from ethereum/docsearch-index-update
Browse files Browse the repository at this point in the history
DocSearch index update
  • Loading branch information
corwintines authored May 24, 2023
2 parents 6d2bcf0 + 6b53353 commit 984d5d4
Show file tree
Hide file tree
Showing 27 changed files with 70 additions and 274 deletions.
27 changes: 0 additions & 27 deletions .github/workflows/docsearch-crawl.yml

This file was deleted.

123 changes: 0 additions & 123 deletions .github/workflows/docsearchConfig.json

This file was deleted.

63 changes: 0 additions & 63 deletions .github/workflows/docsearchConfigScript.js

This file was deleted.

32 changes: 14 additions & 18 deletions docs/site-search.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,28 @@
# Site search on ethereum.org

TL;DR: we use Algolia to implement a site search feature on ethereum.org.
TL;DR: we use Algolia to implement a site search feature on ethereum.org. As an open source project, Algolia has sponsored the crawling and indexing of the entire site.

## What do we use Algolia and Docsearch for?
## What do we use Algolia and DocSearch for?

Algolia allows us to index the content on ethereum.org and implement a powerful site search tool on ethereum.org. In order to create the index of our content, we use a web crawling tool called Docsearch. Docsearch takes a start_urls of ethereum.org and crawls the site to index the content based on a [docsearchConfig file](https://github.com/ethereum/ethereum-org-website/blob/dev/.github/workflows/docsearchConfig.json).
Algolia allows us to index the content on ethereum.org and implement a powerful site search tool on ethereum.org. In order to create the index of our content, we use a web crawling tool called DocSearch. DocSearch takes a starting URL of ethereum.org and crawls the site to index the content, based on a custom configuration setup held with the service.

We kick off the crawling and indexing of ethereum.org through a GitHub Action that triggers on the merge to `master` branch. [View the GitHub Action](https://github.com/ethereum/ethereum-org-website/blob/dev/.github/workflows/docsearch-crawl.yml).
Site crawling and indexing is performed by default on a weekly basis on Friday afternoons. This is performed automatically by Algolia servers, which scrape the entire production site of ethereum.org to build an index. This index is hosted by Algolia for use on the site.

## Docsearch Config
## DocSearch Config

Some important notes about the docsearch config file:
Some important notes about the DocSearch config:

### Configuration

- `index_name` is the name of the algolia index where the generated index will be uploaded to.
- `start_urls` are the urls that the crawler will start from. Some important attributes in the `start_urls` that we use are:
- `lang`: regex path to different languages that the site is translated to that need crawling. Since ethereum.org is translated to 37+ languages, we need to be able to crawl the website in each language for indexing.
- `page_rank`: the rank of pages that breaks ties when multiple query results have the same weight. This weight is derived from the selectors.
- `stop_urls` is used to strip out query parameters in the websites urls. We were running into issues where we were getting duplicate query results due to query parameters making urls unique. Stripping these out solved our deduplication problem.
- selectors are used to specify what the crawler should look for when weighting content for the index.

### Generation

We generate the docsearchConfig.json file using a [script](https://github.com/ethereum/ethereum-org-website/blob/dev/.github/workflows/docsearchConfigScript.js). This allows us to dynamically pull in the languages the websites support from the [translations.json data file](https://github.com/ethereum/ethereum-org-website/blob/dev/src/data/translations.json). Our GitHub action executes this script.
- `indexName` is the name of the Algolia index where the generated index will be uploaded to
- `startUrls` are the urls that the crawler will start from
- Translated pages are automatically faceted for search results based on the `<html lang="">` attribute of each page
- Selectors are used to specify what the crawler should look for when weighting content for the index.
- CheerioAPI can be utilized within the crawler using the `$` selector to manipulate the DOM before indexing each page
- Elements to be ignored are removed before indexing using the CheerioAPI library: `$('selector').remove()`. This includes `aside`, `nav`, `footer` and `style` elements.
- While building pages, semantic naming with the aforementioned elements, i.e. `aside`, will ignore any content contained within. This is beneficial for content that is not directly related to the page content, such as callouts, banners, quiz content, or navigation elements.

## Resources

- [Algolia documentation](https://www.algolia.com/doc/)
- [Docsearch documentation](https://docsearch.algolia.com/docs/what-is-docsearch)
- [Docsearch scraper Docker image](https://hub.docker.com/r/algolia/docsearch-scraper)
- [DocSearch documentation](https://docsearch.algolia.com/docs/what-is-docsearch)
1 change: 1 addition & 0 deletions src/components/BannerNotification/index.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ const BannerNotification: React.FC<IProps> = ({
<>
{shouldShow && (
<Center
as="aside"
maxW={isLGScreen ? oldTheme.variables.maxPageWidth : "100%"}
w="100%"
py="4"
Expand Down
1 change: 1 addition & 0 deletions src/components/CallToContribute.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ const CallToContribute: React.FC<IProps> = ({ editPath }) => {

return (
<Flex
as="aside"
bg="ednBackground"
align="center"
mt={8}
Expand Down
1 change: 1 addition & 0 deletions src/components/Callout.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ const Callout: React.FC<IProps> = ({
...rest
}) => (
<Flex
as="aside"
direction="column"
bgGradient="linear-gradient(
49.21deg,
Expand Down
1 change: 1 addition & 0 deletions src/components/CalloutBanner.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ const CalloutBanner: React.FC<IProps> = ({
...restProps
}) => (
<Flex
as="aside"
direction={{ base: "column", lg: "row-reverse" }}
bg="layer2Gradient"
p={{ base: 8, sm: 12 }}
Expand Down
6 changes: 5 additions & 1 deletion src/components/EthExchanges/index.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,11 @@ const EthExchanges = () => {

return (
<Flex flexDir="column" align="center" w="full">
<Heading fontSize={{ base: "2xl", md: "2rem" }} fontWeight={600} lineHeight={1.4}>
<Heading
fontSize={{ base: "2xl", md: "2rem" }}
fontWeight={600}
lineHeight={1.4}
>
<Translation id="page-get-eth-exchanges-header" />
</Heading>
<Text maxW="container.sm" mb={8} lineHeight={1.4} textAlign="center">
Expand Down
9 changes: 6 additions & 3 deletions src/components/Layout.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ import SideNavMobile from "./SideNavMobile"
import TranslationBanner from "./TranslationBanner"
import TranslationBannerLegal from "./TranslationBannerLegal"
import FeedbackWidget from "./FeedbackWidget"
import { SkipLink, SkipLinkAnchor } from "./SkipLink"
import { SkipLink } from "./SkipLink"

import { ZenModeContext } from "../contexts/ZenModeContext"

Expand Down Expand Up @@ -135,8 +135,11 @@ const Layout: React.FC<IProps> = ({
<Nav path={path} />
{shouldShowSideNav && <SideNavMobile path={path} />}
</ZenMode>
<SkipLinkAnchor id="main-content" />
<Flex flexDirection={{ base: "column", lg: "row" }}>
<Flex
flexDirection={{ base: "column", lg: "row" }}
id="main-content"
scrollMarginTop={20}
>
{shouldShowSideNav && (
<ZenMode>
<SideNav path={path} />
Expand Down
6 changes: 3 additions & 3 deletions src/components/PageHero.tsx
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import React, { ReactNode } from "react"
import { Box, Flex, Heading, Wrap, WrapItem } from "@chakra-ui/react"
import { Box, Flex, Heading, Text, Wrap, WrapItem } from "@chakra-ui/react"

import { GatsbyImage, IGatsbyImageData } from "gatsby-plugin-image"

Expand Down Expand Up @@ -82,15 +82,15 @@ const PageHero: React.FC<IProps> = ({
>
{header}
</Heading>
<Box
<Text
fontSize={{ base: "xl", lg: "2xl" }}
lineHeight={1.4}
color="text200"
mt={4}
mb={8}
>
{subtitle}
</Box>
</Text>
{buttons && (
<Wrap spacing={2} overflow="visible">
{buttons.map((button, idx) => {
Expand Down
2 changes: 1 addition & 1 deletion src/components/Quiz/QuizWidget.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -248,7 +248,7 @@ const QuizWidget: React.FC<IProps> = ({ quizKey, maxQuestions }) => {

// Render QuizWidget component
return (
<Flex width="full" direction="column" alignItems="center">
<Flex as="aside" width="full" direction="column" alignItems="center">
<Heading
as="h2"
mb={12}
Expand Down
Loading

0 comments on commit 984d5d4

Please sign in to comment.