Source Scrapers
Horizon fetches content from four source types. All scrapers inherit from BaseScraper, share an async HTTP client, and implement a fetch(since) method that returns a list of ContentItem objects. Sources are fetched concurrently via asyncio.gather.
Hacker News
File: src/scrapers/hackernews.py
Uses the Firebase HN API:
GET /topstories.json— fetches top story IDsGET /item/{id}.json— fetches story/comment details
Stories and their comments are fetched concurrently. For each story, the top 5 comments are included (deleted/dead comments excluded, HTML stripped, truncated at 500 chars).
Config (sources.hackernews):
{
"enabled": true,
"fetch_top_stories": 30,
"min_score": 100
}
fetch_top_stories— number of top story IDs to fetchmin_score— minimum HN points to include a story
Extracted data: title, URL (falls back to HN discussion URL), author, score, comment count, and top comment text.
GitHub
File: src/scrapers/github.py
Uses the GitHub REST API:
GET /users/{username}/events/public— user activity eventsGET /repos/{owner}/{repo}/releases— repository releases
Two source types are supported:
user_events— tracks push, create, release, public, and watch events for a userrepo_releases— tracks new releases for a specific repository
Config (sources.github, list of entries):
{
"type": "user_events",
"username": "torvalds",
"enabled": true
}
{
"type": "repo_releases",
"owner": "golang",
"repo": "go",
"enabled": true
}
Authentication: Set GITHUB_TOKEN in your environment for higher rate limits (5000 req/hr vs 60 without).
RSS
File: src/scrapers/rss.py
Fetches any Atom/RSS feed using the feedparser library. Tries multiple date fields (published, updated, created) with fallback parsing.
Config (sources.rss, list of entries):
{
"name": "Simon Willison",
"url": "https://simonwillison.net/atom/everything/",
"enabled": true,
"category": "ai-tools"
}
category— optional tag for grouping (e.g.,"programming","microblog")
Extracted data: title, URL, author, content (from summary/description/content fields), feed name, category, and entry tags.
File: src/scrapers/reddit.py
Uses Reddit’s public JSON API (www.reddit.com):
GET /r/{subreddit}/{sort}.json— subreddit postsGET /user/{username}/submitted.json— user submissionsGET /r/{subreddit}/comments/{post_id}.json— post comments
Subreddits and users are fetched concurrently. Comments are sorted by score, limited to the configured count, and exclude moderator-distinguished comments. Self-text is truncated at 1500 chars, comments at 500 chars.
Config (sources.reddit):
{
"enabled": true,
"fetch_comments": 5,
"subreddits": [
{
"subreddit": "MachineLearning",
"sort": "hot",
"fetch_limit": 25,
"min_score": 10
}
],
"users": [
{
"username": "spez",
"sort": "new",
"fetch_limit": 10
}
]
}
sort—hot,new,top, orrising(subreddits);hotornew(users)time_filter— fortop/risingsorts:hour,day,week,month,year,allmin_score— minimum post score (subreddits only)
Rate limiting: Detects HTTP 429 responses, reads the Retry-After header, waits, and retries once. Uses a descriptive User-Agent as required by Reddit’s API guidelines.
Extracted data: title, URL, author, score, upvote ratio, comment count, subreddit, flair, self-text, and top comments.
File: src/scrapers/twitter.py
Uses the Apify platform to bypass Twitter’s anti-scraping measures. The actor altimis~scweet is called via the Apify REST API.
Flow:
- POST to
/v2/acts/{actor_id}/runsto trigger a run - Poll
/v2/actor-runs/{run_id}until status isSUCCEEDEDor a terminal failure - GET
/v2/datasets/{dataset_id}/itemsto retrieve results
Config (sources.twitter):
{
"enabled": true,
"users": ["karpathy", "ylecun"],
"fetch_limit": 10,
"fetch_reply_text": false,
"max_replies_per_tweet": 3,
"max_tweets_to_expand": 10,
"reply_min_likes": 5,
"actor_id": "altimis~scweet",
"apify_token_env": "APIFY_TOKEN"
}
users— Twitter screen names to monitor, without the@prefixfetch_limit— maximum tweets to fetch per runfetch_reply_text— whentrue, a second Apify run fetches reply bodies for each important tweet and appends them under--- Top Comments ---for AI analysismax_replies_per_tweet— maximum reply lines per tweet (sorted by engagement score)max_tweets_to_expand— cap on reply expansion runs per pipeline cycle, to control Apify credit usagereply_min_likes— minimum likes required for a reply to be includedactor_id— Apify actor ID (default:altimis~scweet)apify_token_env— environment variable name containing the Apify API token
Authentication: Set APIFY_TOKEN in your .env. Get a token at console.apify.com.
Extracted data: tweet text, URL, author, publish time, likes, retweets, replies, views, and (optionally) reply-thread text appended under --- Top Comments ---.