Skip to content

Commit

Permalink
Refactoring and simplification
Browse files Browse the repository at this point in the history
  • Loading branch information
mcamou committed Jan 30, 2025
1 parent 9819ba6 commit 0ca9893
Show file tree
Hide file tree
Showing 8 changed files with 126 additions and 252 deletions.
94 changes: 36 additions & 58 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,94 +95,72 @@ func main() {

The tee-worker currently supports 3 job types:

**TODO** Add descriptions of the return values.
**TODO:** Add descriptions of the return values.

#### `web-scraper`

Scrapes a URL down to some depth.

**Arguments**

`url` (string): The URL to scrape.

`depth` (int): How deep to go (if unset or less than 0, will be set to 1).
* `url` (string): The URL to scrape.
* `depth` (int): How deep to go (if unset or less than 0, will be set to 1).

#### `twitter-scraper`

Performs different types of Twitter searches.

**Arguments**

`type` (string): Type of query (see below).

`query` (string): The query to execute. Its meaning depends on the type of query (see below)

`count` (int): How many results to return.

`next_cursor` (int): Cursor returned from the previous query, for pagination (for those job types that support it).
* `type` (string): Type of query (see below).
* `query` (string): The query to execute. Its meaning depends on the type of query (see below)
* `count` (int): How many results to return.
* `next_cursor` (int): Cursor returned from the previous query, for pagination (for those job types that support it).

**Job types**

Some jobs types have both `get` and `fetch` variants. The `get` variants ignore the `next_cursor` parameter and are meant for quick retrieval of the first `count` records. If you need to get more records (paginate) you should use the `fetch` job types which give you access to a cursor.

**Jobs that return tweets or lists of tweets**

`searchbyquery` - Executes a query and returns the tweets that match. The `query` parameter is a query using the [Twitter API query syntax](https://developer.x.com/en/docs/x-api/v1/tweets/search/guides/standard-operators)

`getbyid` - Returns a tweet given its ID. The `query` parameter is the tweet ID.

`getreplies` - Returns a list of all the replies to a given tweet. The `query` parameter is the tweet ID.

`gettweets` / `fetchusertweets` - Returns all the tweets for a given profile. The `query` parameter is the profile to search.

`gethometweets` / `fetchhometweets` - Returns all the tweets from a profile's home timeline. The `query` parameter is the profile to search.

`getforyoutweets` / `fetchforyoutweets` - Returns all the tweets from a profile's "For You" timeline. The `query` parameter is the profile to search.

`getbookmarks` / `fetchbookmarks` - Returns all of a profile's bookmarked tweets. The `query` parameter is the profile to search.
* `searchbyquery` - Executes a query and returns the tweets that match. The `query` parameter is a query using the [Twitter API query syntax](https://developer.x.com/en/docs/x-api/v1/tweets/search/guides/standard-operators)
* `getbyid` - Returns a tweet given its ID. The `query` parameter is the tweet ID.
* `getreplies` - Returns a list of all the replies to a given tweet. The `query` parameter is the tweet ID.
* `gettweets` / `fetchusertweets` - Returns all the tweets for a given profile. The `query` parameter is the profile to search.
* `gethometweets` / `fetchhometweets` - Returns all the tweets from a profile's home timeline. The `query` parameter is the profile to search.
* `getforyoutweets` / `fetchforyoutweets` - Returns all the tweets from a profile's "For You" timeline. The `query` parameter is the profile to search.
* `getbookmarks` / `fetchbookmarks` - Returns all of a profile's bookmarked tweets. The `query` parameter is the profile to search.

**Jobs that return profiles or lists of profiles**

`getprofilebyid` / `searchbyprofile` - Returns a given user profile. The `query` parameter is the profile to search for.

`getfollowers` / `searchfollowers` - Returns a list of profiles of the followers of a given profile. The `query` parameter is the profile to search.

`getfollowing` - Returns all of the profiles a profile is following. The `query` parameter is the profile to search.

`getretweeters` - Returns a list of profiles that have retweeted a given tweet. The `query` parameter is the tweet ID.
* `getprofilebyid` / `searchbyprofile` - Returns a given user profile. The `query` parameter is the profile to search for.
* `getfollowers` / `searchfollowers` - Returns a list of profiles of the followers of a given profile. The `query` parameter is the profile to search.
* `getfollowing` - Returns all of the profiles a profile is following. The `query` parameter is the profile to search.
* `getretweeters` - Returns a list of profiles that have retweeted a given tweet. The `query` parameter is the tweet ID.

**Jobs that return other types of data**

`getmedia` / `fetchusermedia` - Returns info about all the photos and videos for a given user. The `query` parameter is the profile to search.

`gettrends`- Returns a list of all the trending topics. The `query` parameter is ignored.

`getspace`- Returns info regarding a Twitter Space given its ID. The `query` parameter is the space ID.
* `getmedia` / `fetchusermedia` - Returns info about all the photos and videos for a given user. The `query` parameter is the profile to search.
* `gettrends`- Returns a list of all the trending topics. The `query` parameter is ignored.
* `getspace`- Returns info regarding a Twitter Space given its ID. The `query` parameter is the space ID.

#### `telemetry`

This job type has no parameters, and returns the current state of the worker. It returns an object with the following fields. All timestamps are given in local time, in seconds since the Unix epoch (1/1/1970 00:00:00 UTC). The counts represent the interval between the `boot_time` and the `current_time`. All the fields in the `stats` object are optional (if they are missing it means that its value is 0):

`boot_time` - Timestamp when the process started up.

`last_operation_time` - Timestamp when the last operation happened.

`current_time` - Current timestamp of the host.

`stats.twitter_scrapes` - Total number of Twitter scrapes.

`stats.twitter_returned_tweets` - Number of tweets returned to clients (this does not consider other types of data such as profiles or trending topics).

`stats.twitter_returned_profiles` - Number of profiles returned to clients.

`stats.twitter_returned_other` - Number of other records returned to clients (e.g. media, spaces or trending topics).

`stats.twitter_errors` - Number of errors while scraping tweets (excluding authentication and rate-limiting).

`stats.twitter_ratelimit_errors` - Number of Twitter rate-limiting errors.
This job type has no parameters, and returns the current state of the worker. It returns an object with the following fields. All timestamps are given in local time, in seconds since the Unix epoch (1/1/1970 00:00:00 UTC). The counts represent the interval between the `boot_time` and the `current_time`. All the fields in the `stats` object are optional (if they are missing it means that its value is 0).

`stats.twitter_auth_errors` - Number of Twitter authentication errors.
Note that the stats are reset whenever the node is rebooted (therefore we need the `boot_time` to properly account for the stats)

`stats.web_success` - Number of successful web scrapes.
These are the fields in the response:

`stats.web_errors` - Number of web scrapes that resulted in an error.
* `boot_time` - Timestamp when the process started up.
* `last_operation_time` - Timestamp when the last operation happened.
* `current_time` - Current timestamp of the host.
* `stats.twitter_scrapes` - Total number of Twitter scrapes.
* `stats.twitter_returned_tweets` - Number of tweets returned to clients (this does not consider other types of data such as profiles or trending topics).
* `stats.twitter_returned_profiles` - Number of profiles returned to clients.
* `stats.twitter_returned_other` - Number of other records returned to clients (e.g. media, spaces or trending topics).
* `stats.twitter_errors` - Number of errors while scraping tweets (excluding authentication and rate-limiting).
* `stats.twitter_ratelimit_errors` - Number of Twitter rate-limiting errors.
* `stats.twitter_auth_errors` - Number of Twitter authentication errors.
* `stats.web_success` - Number of successful web scrapes.
* `stats.web_errors` - Number of web scrapes that resulted in an error.
10 changes: 9 additions & 1 deletion internal/jobs/stats/stats.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ import (
"github.com/sirupsen/logrus"
)

// These are the types of statistics that we can add. The value is the JSON key that will be used for serialization.
type statType string

const (
Expand All @@ -19,10 +20,11 @@ const (
TwitterRateErrors statType = "twitter_ratelimit_errors"
WebErrors statType = "web_errors"
WebSuccess statType = "web_success"
// TODO Should we add stats for calls to each of the Twitter job types?
// TODO: Should we add stats for calls to each of the Twitter job types?

)

// allStats is a list of all the stats that we support.
// Make sure to keep this in sync with the above!
var allStats []statType = []statType{
TwitterScrapes,
Expand All @@ -36,23 +38,27 @@ var allStats []statType = []statType{
WebErrors,
}

// AddStat is the struct used in the rest of the tee-worker for sending statistics
type AddStat struct {
Type statType
Num uint
}

// stats is the structure we use to store the statistics
type stats struct {
BootTimeUnix int64 `json:"boot_time"`
LastOperationUnix int64 `json:"last_operation_time"`
CurrentTimeUnix int64 `json:"current_time"`
Stats map[statType]uint `json:"stats"`
}

// StatsCollector is the object used to collect statistics
type StatsCollector struct {
stats *stats
Chan chan AddStat
}

// StartCollector starts a goroutine that listens to a channel for AddStat messages and updates the stats accordingly.
func StartCollector() *StatsCollector {
logrus.Info("Starting stats collector")

Expand Down Expand Up @@ -82,11 +88,13 @@ func StartCollector() *StatsCollector {
return &StatsCollector{stats: &s, Chan: ch}
}

// Json returns the current statistics as a JSON byte array
func (s StatsCollector) Json() ([]byte, error) {
s.stats.CurrentTimeUnix = time.Now().Unix()
return json.Marshal(s.stats)
}

// AddStat is a convenience method to add a number to a statistic
func (s StatsCollector) AddStat(typ statType, num uint) {
s.Chan <- AddStat{Type: typ, Num: num}
}
1 change: 1 addition & 0 deletions internal/jobs/telemetry.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import (

const TelemetryJobType = "telemetry"

// A TelemetryJob connects to a StatsCollector, and receives requests to return the current stats
type TelemetryJob struct {
collector *stats.StatsCollector
}
Expand Down
Loading

0 comments on commit 0ca9893

Please sign in to comment.