Selenium+Python環境をDockerで作ったときのメモ #32

yamamoto-yuta · 2023-12-03T08:43:27Z

# 必要に合わせてコメントアウトを外して記載してください

# 記事の説明文(無い場合は本文先頭200文字を使用)
#ogp_description:

# サムネイル画像のテーマ -> 'default' or 'upload'
thumbnail_theme: default

# サムネイル画像の背景画像(1280x670px推奨, なくてもOK)
#thumbnail_image_url: 

# 予約投稿の日時(無い場合は現在時刻を使用)
#posted_at: YYYY-MM-DD hh:mm

※ この記事は、私が2022/12/31に書いたメモを転記したものです。

はじめに

Selenium+Python環境をDockerで作った際のメモ．基本的に下記の記事を参考に行った:

参考記事: 第662回　Docker+Selenium ServerでWebブラウザ自動操作環境を作る | gihyo.jp

また，記事中のコードは下記リポジトリに上がっている:

リポジトリ: https://github.com/yamamoto-yuta/selenium-on-docker-sample

手順

1. Dockerイメージをpull

今回はスクレイピングできればOKだったので standalone-chrome をpullした

$ docker pull selenium/standalone-chrome

2. 素のイメージにはpipとseleniumが入っていないので，入れたイメージをDockerfileで作成

FROM selenium/standalone-chrome

USER root
RUN apt-get update && apt-get upgrade -y && apt install -y python3-pip

USER 1200
RUN pip3 install selenium

なお， apt コマンドを使おうとしたら Permission denied と言われたので，一時的に root ユーザにしている:

=> ERROR [2/3] RUN apt-get update && apt-get upgrade -y && apt install -y python3-pip                                                            1.0s
------
 > [2/3] RUN apt-get update && apt-get upgrade -y && apt install -y python3-pip:
#0 0.900 Reading package lists...
#0 0.925 E: List directory /var/lib/apt/lists/partial is missing. - Acquire (13: Permission denied)
------
failed to solve: executor failed running [/bin/sh -c apt-get update && apt-get upgrade -y && apt install -y python3-pip]: exit code: 100

3. 環境設定を `docker-compose.yml` にまとめる

version: "3.9"

services:
  app:
    build: .
    image: "selenium-on-docker-sample"
    container_name: "selenium-on-docker-sample"
    volumes:
      - /dev/shm:/dev/shm
      - .:/usr/src/app
    working_dir: /usr/src/app

ポイントは下記の volumes 設定．この設定ではホストのメモリ領域 /dev/shm をマウントしている．これをしておかないとメモリ不足で正常に動作しないことがあるらしい:

    volumes:
      - /dev/shm:/dev/shm

参考: Seleniumをdockerで動かすと異様に遅い特定サイトで落ちる場合の対処法 │ wonwon eater

なお，上記の方法はホストOSがLinuxの場合に使える方法で，他のOSの場合は直接サイズを指定する方法があるらしい:

    build:
    context: .
    shm_size: '2gb'

参考: Seleniumをdockerで動かすと異様に遅い特定サイトで落ちる場合の対処法 │ wonwon eater

公式ドキュメント:

Start a Docker container with Firefox
docker run -d -p 4444:4444 -p 7900:7900 --shm-size="2g" selenium/standalone-firefox:4.7.2-20221219
（中略）
☝️ When executing docker run for an image that contains a browser please use the flag --shm-size=2g to use the host's shared memory.

引用: SeleniumHQ/docker-selenium: Docker images for Selenium Grid

3. コンテナを起動して中に入る

$ docker compose up
$ docker exec -it <CONTAINER_ID> bash

4. スクレイピングスクリプトを実行

今回は下記のようなスクレイピングスクリプト sample.py を作成した．内容としては，「webdriver」でググって検索結果のページタイトルを標準出力する．

import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys


# Chromeのオプション
options = webdriver.ChromeOptions()
driver = webdriver.Chrome('chromedriver', options=options)


try:
    # 要素の待機時間を最大10秒に設定
    driver.implicitly_wait(10)

    # http://www.google.com を開く
    driver.get("http://www.google.com")

    # 検索ボックスに「webdriver」と入力して検索
    driver.find_element(By.NAME, "q").send_keys("webdriver" + Keys.ENTER)
    time.sleep(5)

    # 検索結果のタイトルを取得して出力
    element_titles = driver.find_elements(By.TAG_NAME, "h3")
    for element_title in element_titles:
        print(element_title.text)

except:
    import traceback
    traceback.print_exc()

finally:
    # Chromeを終了
    input("何かキーを押すと終了します...")
    driver.quit()

実行結果:

seluser@aefbc6616615:/usr/src/app$ python3 sample.py 
WebDriver - Selenium
WebDriver を使用して Microsoft Edge を自動化する
WebDriver - MDN Web Docs - Mozilla
WebDriver について私が知っていること (2017 年版)




10分で理解する Selenium - Qiita
ChromeDriver - WebDriver for Chrome
7. WebDriver API — Selenium Python Bindings 2 ドキュメント
WebDriverマウスとキーボードイベント
Pythonで自動化しよう！ ー Selenium Webdriverを ...
何かキーを押すと終了します...

過程で調べたこと

`/dev/shm` って何？

tmpfsというLinuxマシンのメモリに作成できるファイルシステムのマウントポイントの1つ．tmpfsは一見RAMディスクっぽいが，tmpfsはファイルシステムのためフォーマットが不要という違いがある（そのため，あらかじめ容量を確保する必要が無く，使用した分だけメモリを消費する）．

/dev/shm を利用するには，好きなディレクトリをマウントする．

参考: tmpfs - Linux技術者認定 LinuC | LPI-Japan

Seleniumは3系と4系で書き方が変わっている

元記事では find_elements_by_* 系メソッドが用いられていたが，それらはバージョン4.3.0廃止されており，現在は下記のような書き方になっている模様:

3系:

driver.find_elements_by_class_name("content")

4系:

# 引数にまとめて書くやり方に統一される
from selenium.webdriver.common.by import By
driver.find_elements(By.CLASS_NAME, "content")

引用: 【Selenium】急にAttributeError: 'WebDriver' object has no attributeが起きた - Qiita

The text was updated successfully, but these errors were encountered:

yamamoto-yuta added article publish labels Dec 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selenium+Python環境をDockerで作ったときのメモ #32

Selenium+Python環境をDockerで作ったときのメモ #32

yamamoto-yuta commented Dec 3, 2023

Selenium+Python環境をDockerで作ったときのメモ #32

Selenium+Python環境をDockerで作ったときのメモ #32

Comments

yamamoto-yuta commented Dec 3, 2023

はじめに

手順

1. Dockerイメージをpull

2. 素のイメージにはpipとseleniumが入っていないので，入れたイメージをDockerfileで作成

3. 環境設定を docker-compose.yml にまとめる

3. コンテナを起動して中に入る

4. スクレイピングスクリプトを実行

過程で調べたこと

/dev/shm って何？

Seleniumは3系と4系で書き方が変わっている

3. 環境設定を `docker-compose.yml` にまとめる

`/dev/shm` って何？