Skip to content

Commit

Permalink
humptydumpty
Browse files Browse the repository at this point in the history
  • Loading branch information
pekasen committed Jan 22, 2024
1 parent fd976b8 commit 44b9ea8
Show file tree
Hide file tree
Showing 3 changed files with 153 additions and 50 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
203 changes: 153 additions & 50 deletions notebooks/01-introduction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,33 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# How to get data from Telegram.\n",
"# Telegram.\n",
"\n",
"## Data Structure\n",
"A (hopefully) useful guide to get and analyze data from Telegram.\n",
"\n",
"*P. Kessling, Leibniz-Institute for Media Research | Hans-Bredow-Institute (HBI), Hamburg, Germany, 2024-01-22.*\n",
"\n",
"## Table of Contents\n",
"\n",
"- [Telegram.](#telegram)\n",
" - [Table of Contents](#table-of-contents)\n",
" - [Data Structure](#data-structure)\n",
" - [Base Objects](#base-objects)\n",
" - [Data Access](#data-access)\n",
" - [API-Access](#api-access)\n",
" - [Requirements](#requirements)\n",
" - [Scraping](#scraping)\n",
" - [`ponyexpress`](#ponyexpress)\n",
" - [Desktop App](#desktop-app)\n",
"\n",
"## Introduction\n",
"\n",
"Telegram is a messenger app that not only offers one-to-one chats but also group chats, channels, bots and more. The data structure is quite complex as it uses a few base objects for all of the different chat and group types.\n",
"\n",
"*Fig. 1: Telegram Web: The public web view of a channel.*\n",
"![Telegram Web: The public web view of a channel.](../images/screenshot-2024-01-22-21-10-47-reitschusterde.png)\n",
"\n",
"\n",
"### Base Objects\n",
"\n",
"The base objects are the following:\n",
Expand Down Expand Up @@ -45,8 +66,98 @@
"A --> C[Group]\n",
"A --> D[Channel]\n",
"A --> E[Bot]\n",
"```\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Access\n",
"\n",
"We have a few different methods at hand to obtain data from Telegram for public channels and groups.\n",
"We'll have a look at them in depth in the next sections, starting easy and getting more complex.\n",
"\n",
"### Telegram Desktop App\n",
"\n",
"The Telegram Desktop App offers a few options to export data. The easiest way is to export a chat as a JSON file. This file contains all messages of the chat. However, it does not contain any media like images, videos, etc. The media can be exported separately.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Scraping the Public Web Interface\n",
"\n",
"Since, Telegram offers a publicly accessible web interface for channels, we can scrape the data from there. The web interface is available at `https://t.me/<channel_name>`.\n",
"\n",
"We developed a tool called `ponyexpress-telegram` that can scrape the data from the web interface. It is available on [GitHub](https://www.github.com/Leibniz-HBI/ponyexpress-telegram).\n",
"\n",
"\n",
"```bash\n",
"$ telegram --help\n",
"\n",
"Usage: telegram [OPTIONS] [NAMES]...\n",
"\n",
" Scrape Telegram Channels.\n",
"\n",
"Options:\n",
" --version Show the version and exit.\n",
" -m, --messages-output FILENAME\n",
" -u, --users-output FILENAME\n",
" -p, --prepare-edges\n",
" -l, --log-file PATH\n",
" -v, --verbose\n",
" --help Show this message and exit.\n",
"```\n",
"\n",
"```json\n",
"{\n",
" \"post_id\": \"reitschusterde/8920\",\n",
" \"views\": 14400,\n",
" \"datetime\": 1705571988000,\n",
" \"user\": \"reitschuster.de\",\n",
" \"from_author\": null,\n",
" \"text\": \"Ampel will „Regenbogen-Familien“ stärken – auf Kosten der Kinder?Anpassung „an soziale Wirklichkeit“.Die Bundesregierung plant weitreichende Reformen bei Adoption und Sorgerecht. Dazu sollen die Mindeststrafen bei Kinderpornografie wieder gesenkt werden. Einige der dabei verwendeten Wörter und Formulierungen müssen aufhorchen lassen. Von Kai Rebmann. https://reitschuster.de/post/ampel-will-regenbogen-familien-staerken-auf-kosten-der-kinder/\",\n",
" \"link\": [\n",
" \"https://reitschuster.de/post/ampel-will-regenbogen-familien-staerken-auf-kosten-der-kinder/\",\n",
" \"https://reitschuster.de/post/ampel-will-regenbogen-familien-staerken-auf-kosten-der-kinder/\",\n",
" \"https://reitschuster.de/post/ampel-will-regenbogen-familien-staerken-auf-kosten-der-kinder/\",\n",
" \"https://reitschuster.de/post/ampel-will-regenbogen-familien-staerken-auf-kosten-der-kinder/\"\n",
" ],\n",
" \"reply_to_user\": null,\n",
" \"reply_to_text\": null,\n",
" \"reply_to_link\": null,\n",
" \"image_url\": [],\n",
" \"forwarded_message_url\": null,\n",
" \"forwarded_message_user\": null,\n",
" \"video_url\": [],\n",
" \"video_duration\": null,\n",
" \"handle\": \"reitschusterde\",\n",
" \"post_number\": \"8920\"\n",
"}\n",
"```\n",
"\n",
"```json\n",
"{\n",
" \"name\": \"reitschusterde\",\n",
" \"fullname\": \"reitschuster.de\",\n",
" \"url\": \"https://t.me/reitschusterde\",\n",
" \"description\": \"Offizieller Kanal von Boris Reitschuster\",\n",
" \"subscriber_count\": 235000,\n",
" \"photos_count\": 754,\n",
" \"videos_count\": 86,\n",
" \"files_count\": 9,\n",
" \"links_count\": 7440\n",
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## API-Access\n",
"\n",
Expand All @@ -71,6 +182,31 @@
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## tegracli\n",
"\n",
"[tegracli](https://www.github.com/Leibniz-HBI/tegracli) is a command line interface for Telegram. It is written in Python and uses the [Telethon]() library to access the Telegram API. It is intended for research use, e.g. collecting large account-based datasets.\n",
"It allows you also to persists data from a single channel or search for keywords in the channels your account in subscribed to.\n",
"\n",
"### Installation\n",
"\n",
"`tegracli` is available on [PyPI](https://pypi.org/project/tegracli/) and can be installed via `pip`:\n",
"\n",
"```bash\n",
"pip install tegracli\n",
"```\n",
"\n",
"Alternatively you can install it with `pipx`:\n",
"\n",
"```bash\n",
"# pip install pipx # if not already installed\n",
"pipx install tegracli\n",
"```\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
Expand Down Expand Up @@ -139,15 +275,6 @@
"[^1]: Running this in JupyterLab or a Jupyter notebook is not possible, since they do not allow interactive prompts."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Web Interface\n",
"\n",
"https://t.me/s/reitschusterde\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
Expand All @@ -174,46 +301,22 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"```json\n",
"{\n",
" \"post_id\": \"reitschusterde/8920\",\n",
" \"views\": 14400,\n",
" \"datetime\": 1705571988000,\n",
" \"user\": \"reitschuster.de\",\n",
" \"from_author\": null,\n",
" \"text\": \"Ampel will „Regenbogen-Familien“ stärken – auf Kosten der Kinder?Anpassung „an soziale Wirklichkeit“.Die Bundesregierung plant weitreichende Reformen bei Adoption und Sorgerecht. Dazu sollen die Mindeststrafen bei Kinderpornografie wieder gesenkt werden. Einige der dabei verwendeten Wörter und Formulierungen müssen aufhorchen lassen. Von Kai Rebmann. https://reitschuster.de/post/ampel-will-regenbogen-familien-staerken-auf-kosten-der-kinder/\",\n",
" \"link\": [\n",
" \"https://reitschuster.de/post/ampel-will-regenbogen-familien-staerken-auf-kosten-der-kinder/\",\n",
" \"https://reitschuster.de/post/ampel-will-regenbogen-familien-staerken-auf-kosten-der-kinder/\",\n",
" \"https://reitschuster.de/post/ampel-will-regenbogen-familien-staerken-auf-kosten-der-kinder/\",\n",
" \"https://reitschuster.de/post/ampel-will-regenbogen-familien-staerken-auf-kosten-der-kinder/\"\n",
" ],\n",
" \"reply_to_user\": null,\n",
" \"reply_to_text\": null,\n",
" \"reply_to_link\": null,\n",
" \"image_url\": [],\n",
" \"forwarded_message_url\": null,\n",
" \"forwarded_message_user\": null,\n",
" \"video_url\": [],\n",
" \"video_duration\": null,\n",
" \"handle\": \"reitschusterde\",\n",
" \"post_number\": \"8920\"\n",
"}\n",
"```\n",
"## Data Repositories\n",
"\n",
"```json\n",
"{\n",
" \"name\": \"reitschusterde\",\n",
" \"fullname\": \"reitschuster.de\",\n",
" \"url\": \"https://t.me/reitschusterde\",\n",
" \"description\": \"Offizieller Kanal von Boris Reitschuster\",\n",
" \"subscriber_count\": 235000,\n",
" \"photos_count\": 754,\n",
" \"videos_count\": 86,\n",
" \"files_count\": 9,\n",
" \"links_count\": 7440\n",
"}\n",
"```"
"- **Social Media Observatory**: [SMO](https://leibniz-hbi.de/de/projekte/social-media-observatory)\n",
"- **Data4Transperancy**: [D4T](https://data4transparency.com/) offers researchers access to a collection of Telegram data. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Data Set of this Workshop\n",
"\n",
"The data set of this workshop is centered on the protests in Lützerath in late 2022. Since Telegram does not offer a global search endpoint, we searched in our main Telegram data set which consists of approx. 15.000 channels and 100 million messages. We searched for the following term `Lütz*`.\n",
"\n",
"The data set is available for particapants via the workshops chat or on request.\n",
"\n"
]
}
],
Expand Down

0 comments on commit 44b9ea8

Please sign in to comment.