Using AI to Filter Web Content: Building Your Own Algorithms
In today's digital landscape, users are bombarded with content from numerous platforms, but most services offer limited control over what appears in feeds and recommendations. While many platforms use proprietary algorithms to personalize content, these algorithms often prioritize engagement over user preferences, leading to content fatigue and frustration.
This article explores how you can build your own AI-powered filtering system to take back control of your digital experience. By leveraging APIs when available or web scraping when necessary, combined with AI models like OpenAI's GPT, you can create personalized content filters tailored to your exact preferences.
Current content filtering on major platforms has several shortcomings:
- Limited Customization: Platforms often offer only basic filtering options (block/mute keywords or accounts)
- Black Box Algorithms: Recommendation systems prioritize engagement metrics over user preferences
- Lack of Nuance: Simple keyword filters miss context and can't understand complex preferences
- Inconsistent Experience: Different platforms have different filtering capabilities
- Profit-Driven Decisions: Platforms optimize for advertising revenue, not user experience
By creating your own filtering layer using AI, you can implement sophisticated content processing that understands your unique preferences and applies them consistently across platforms.
1. Using Official APIs
The preferred approach when available. Many platforms provide APIs that allow you to programmatically access content:
import tweepy
import os
from datetime import datetime, timedelta
# Set up authentication
client = tweepy.Client(
bearer_token=os.environ.get("TWITTER_BEARER_TOKEN"),
consumer_key=os.environ.get("TWITTER_API_KEY"),
consumer_secret=os.environ.get("TWITTER_API_SECRET"),
access_token=os.environ.get("TWITTER_ACCESS_TOKEN"),
access_token_secret=os.environ.get("TWITTER_ACCESS_SECRET")
)
# Function to fetch timeline tweets
def fetch_timeline_tweets(max_results=50):
"""Fetch tweets from the user's home timeline."""
# Using v2 API
response = client.get_home_timeline(
max_results=max_results,
tweet_fields=['created_at', 'public_metrics', 'entities', 'context_annotations']
)
if response.data:
return response.data
return []
# Function to fetch tweets from a specific list
def fetch_list_tweets(list_id, max_results=50):
"""Fetch tweets from a specific list."""
response = client.get_list_tweets(
id=list_id,
max_results=max_results,
tweet_fields=['created_at', 'public_metrics', 'entities']
)
if response.data:
return response.data
return []
# Example usage
timeline_tweets = fetch_timeline_tweets()
print(f"Fetched {len(timeline_tweets)} tweets from timeline")
2. RSS Feeds
Many news sites, blogs, and podcasts still provide RSS feeds which offer a structured way to access content:
import feedparser
import pandas as pd
from datetime import datetime
def fetch_rss_content(feed_urls):
"""Fetch content from multiple RSS feeds and organize into a DataFrame."""
all_entries = []
for url in feed_urls:
try:
feed = feedparser.parse(url)
source_name = feed.feed.title
for entry in feed.entries:
# Extract relevant fields
published = entry.get('published_parsed') or entry.get('updated_parsed')
if published:
# Convert to datetime
published_date = datetime(*published[:6])
else:
published_date = datetime.now()
article_data = {
'title': entry.get('title', ''),
'link': entry.get('link', ''),
'description': entry.get('description', ''),
'published': published_date,
'source': source_name,
'content': entry.get('content', [{'value': ''}])[0]['value']
if 'content' in entry else entry.get('summary', '')
}
all_entries.append(article_data)
except Exception as e:
print(f"Error processing feed {url}: {e}")
# Convert to DataFrame for easier manipulation
if all_entries:
df = pd.DataFrame(all_entries)
# Sort by published date (newest first)
df = df.sort_values('published', ascending=False)
return df
return pd.DataFrame()
# Example usage
feed_urls = [
'https://news.ycombinator.com/rss',
'https://feeds.arstechnica.com/arstechnica/index',
'https://www.wired.com/feed/rss'
]
content_df = fetch_rss_content(feed_urls)
print(f"Fetched {len(content_df)} articles from RSS feeds")
3. Web Scraping (When APIs Are Unavailable)
For platforms without accessible APIs, web scraping may be necessary, though it should be used responsibly and in accordance with terms of service:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
from datetime import datetime
import random
def scrape_news_site(base_url, article_selector, title_selector, content_selector, max_pages=2):
"""
Scrapes news articles from a website.
Parameters:
- base_url: The homepage or section URL to start scraping
- article_selector: CSS selector to identify article links
- title_selector: CSS selector to extract article title
- content_selector: CSS selector to extract article content
- max_pages: Maximum number of pages to scrape
Returns:
- DataFrame with article data
"""
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
article_data = []
try:
# Get the main page
response = requests.get(base_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
# Find article links
article_links = soup.select(article_selector)
article_urls = [link.get('href') for link in article_links if link.get('href')]
# Make sure URLs are absolute
article_urls = [url if url.startswith('http') else f"{base_url.rstrip('/')}/{url.lstrip('/')}"
for url in article_urls]
# Limit to max_pages
article_urls = article_urls[:min(len(article_urls), max_pages)]
# Process each article
for url in article_urls:
try:
# Be nice to the server
sleep(random.uniform(1.0, 3.0))
article_response = requests.get(url, headers=headers)
article_soup = BeautifulSoup(article_response.text, 'html.parser')
# Extract information
title = article_soup.select_one(title_selector)
title = title.text.strip() if title else "No title"
content_element = article_soup.select_one(content_selector)
content = content_element.text.strip() if content_element else ""
# Store the data
article_data.append({
'title': title,
'url': url,
'content': content,
'source': base_url,
'scraped_at': datetime.now()
})
print(f"Scraped: {title}")
except Exception as e:
print(f"Error scraping article {url}: {e}")
continue
except Exception as e:
print(f"Error scraping main page {base_url}: {e}")
# Convert to DataFrame
return pd.DataFrame(article_data)
# Example usage (fictional selectors)
news_df = scrape_news_site(
base_url='https://example-news-site.com',
article_selector='div.article-list a.article-link',
title_selector='h1.article-title',
content_selector='div.article-content',
max_pages=5
)
print(f"Scraped {len(news_df)} articles")
Once you've collected content, you can use AI models to analyze and filter it according to your preferences. OpenAI's GPT models are particularly effective at understanding content and applying nuanced filtering criteria:
import openai
import pandas as pd
import os
from time import sleep
# Set OpenAI API key
client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
def classify_content(df, content_column, system_prompt, max_batch=10):
"""
Classifies content using OpenAI's GPT model.
Parameters:
- df: DataFrame containing content
- content_column: Column name containing the text to classify
- system_prompt: Instructions for the AI model
- max_batch: Maximum number of items to process in one batch
Returns:
- DataFrame with added classification columns
"""
results = []
# Process in batches to avoid rate limits
for i in range(0, len(df), max_batch):
batch = df.iloc[i:i+max_batch]
for _, row in batch.iterrows():
content = row[content_column]
try:
# Skip empty content
if not content or pd.isna(content):
results.append({
'original_index': _.index,
'classification': 'unknown',
'keep': False,
'reason': 'Empty content'
})
continue
# Truncate very long content
if len(content) > 15000:
content = content[:15000] + "..."
# Call OpenAI API
response = client.chat.completions.create(
model="gpt-4", # Or another suitable model
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": content}
],
max_tokens=150,
temperature=0.1 # Low temperature for more consistent results
)
# Parse the response
classification_text = response.choices[0].message.content.strip()
# Simple parsing - adjust based on your system prompt's expected output format
if "KEEP" in classification_text:
keep = True
elif "DISCARD" in classification_text:
keep = False
else:
keep = False # Default to discard if unclear
# Extract categorization if provided
categories = []
if "CATEGORIES:" in classification_text:
category_text = classification_text.split("CATEGORIES:")[1].strip()
categories = [c.strip() for c in category_text.split(',')]
# Extract reason if provided
reason = "No specific reason provided"
if "REASON:" in classification_text:
reason = classification_text.split("REASON:")[1].strip()
results.append({
'original_index': row.name,
'keep': keep,
'categories': categories if categories else [],
'reason': reason
})
# Sleep to avoid rate limits
sleep(0.5)
except Exception as e:
print(f"Error processing content: {e}")
results.append({
'original_index': row.name,
'keep': False,
'categories': [],
'reason': f"Error: {str(e)}"
})
# Convert results to DataFrame
results_df = pd.DataFrame(results)
# Merge back with original data
merged_df = df.copy()
for _, row in results_df.iterrows():
idx = row['original_index']
for col in ['keep', 'categories', 'reason']:
if col in row:
merged_df.at[idx, col] = row[col]
return merged_df
# Example usage
system_prompt = """
You are a content filtering assistant that helps users filter their news and social media feeds.
Analyze the provided content and determine if it should be kept or discarded based on the following criteria:
1. DISCARD any content that:
- Is primarily clickbait with misleading headlines
- Contains excessive negativity or outrage without substantive information
- Is primarily promoting a product without educational value
- Contains partisan political attacks without factual substance
- Is low-effort, shallow content without depth
2. KEEP content that:
- Provides substantive information or insights
- Teaches something valuable or interesting
- Presents balanced analysis with multiple perspectives
- Contains original research or reporting
- Sparks constructive discussion or thoughtful reflection
For each piece of content, respond with:
- KEEP or DISCARD
- CATEGORIES: [list relevant categories like technology, science, politics, etc.]
- REASON: [brief explanation for your decision]
"""
# Assuming content_df was created in a previous step
filtered_df = classify_content(content_df, 'content', system_prompt)
# Show the results
kept_content = filtered_df[filtered_df['keep'] == True]
print(f"Kept {len(kept_content)} out of {len(filtered_df)} items")
print(f"Discard rate: {(1 - len(kept_content)/len(filtered_df))*100:.1f}%")
Creating Effective AI Filter Prompts
The system prompt you provide to the AI model is crucial for effective filtering. Here are some examples for different filtering goals:
You are a political content analyzer designed to help users balance their information diet.
For each article, analyze the political perspective and determine:
1. The dominant political leaning (liberal, conservative, centrist, or non-political)
2. Whether multiple perspectives are fairly presented
3. If factual claims are supported with evidence
4. If the tone is informative vs. inflammatory
Respond with:
- CLASSIFICATION: [liberal/conservative/centrist/non-political]
- PERSPECTIVE_SCORE: [1-10 where 1=extremely one-sided, 10=multiple viewpoints fairly presented]
- EVIDENCE_SCORE: [1-10 where 1=claims without evidence, 10=well-supported claims]
- TONE_SCORE: [1-10 where 1=highly inflammatory, 10=neutral/informative]
- KEEP or DISCARD (KEEP if average of all scores > 6)
- REASON: [brief explanation]
You are an educational content evaluator designed to identify high-value learning material.
For each article or post, analyze:
1. Educational value - does it teach something substantial?
2. Accuracy - is the information accurate and current?
3. Depth - does it go beyond surface-level explanations?
4. Actionability - can the reader apply this knowledge?
Respond with:
- EDUCATIONAL_VALUE: [High/Medium/Low]
- TOPIC: [Main subject area]
- DEPTH_SCORE: [1-10]
- ACTIONABLE: [Yes/Somewhat/No]
- KEEP or DISCARD (KEEP if Educational_Value is High OR if Depth_Score > 7)
- REASON: [brief explanation of your assessment]
You are a digital wellness assistant helping users maintain a healthy online experience.
For each piece of content, evaluate:
1. Whether it induces anxiety, FOMO, or negative emotions
2. Whether it promotes constructive thinking or mindless scrolling
3. If it's designed to be addictive through outrage or endless engagement
4. Whether it contains actionable advice or practical value
Respond with:
- EMOTIONAL_IMPACT: [Positive/Neutral/Negative]
- ATTENTION_TYPE: [Constructive/Neutral/Distracting]
- PRACTICAL_VALUE: [High/Medium/Low]
- ADDICTIVENESS: [High/Medium/Low]
- KEEP or DISCARD (KEEP if Emotional_Impact is Positive OR Practical_Value is High)
- REASON: [brief explanation of your assessment]
Beyond simple filtering, you can use AI to transform and enhance content:
def summarize_content(df, content_column, title_column=None, max_length=150):
"""
Summarizes content in a DataFrame using OpenAI.
Parameters:
- df: DataFrame containing content
- content_column: Column name with the text to summarize
- title_column: Optional column with article titles
- max_length: Maximum length of summary in words
Returns:
- DataFrame with added summary column
"""
summaries = []
for idx, row in df.iterrows():
content = row[content_column]
title = row[title_column] if title_column and not pd.isna(row[title_column]) else "Untitled"
try:
# Skip empty content
if not content or pd.isna(content):
summaries.append({"original_index": idx, "summary": "No content to summarize"})
continue
# Create prompt
prompt = f"Title: {title}\n\nContent: {content[:8000]}..." # Truncate long content
system_message = f"""
You are a skilled content summarizer. Create a concise summary of the following content in {max_length} words or less.
Focus on the main points and key information. Maintain a neutral tone and do not add information not present in the original.
"""
# Call OpenAI API
response = client.chat.completions.create(
model="gpt-3.5-turbo", # Smaller model for summarization is often sufficient
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": prompt}
],
max_tokens=300,
temperature=0.3
)
summary = response.choices[0].message.content.strip()
summaries.append({"original_index": idx, "summary": summary})
# Sleep to avoid rate limits
sleep(0.5)
except Exception as e:
print(f"Error summarizing content: {e}")
summaries.append({"original_index": idx, "summary": f"Error: {str(e)}"})
# Add summaries to original DataFrame
summaries_df = pd.DataFrame(summaries)
# Merge with original data
merged_df = df.copy()
for _, row in summaries_df.iterrows():
idx = row['original_index']
merged_df.at[idx, 'summary'] = row['summary']
return merged_df
# Example usage
summarized_df = summarize_content(
kept_content, # Only summarize content we're keeping
content_column='content',
title_column='title',
max_length=100
)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN
import numpy as np
def cluster_by_topic(df, text_column, min_cluster_size=2, max_distance=0.7):
"""
Clusters content by topic similarity.
Parameters:
- df: DataFrame containing content
- text_column: Column with the text to analyze
- min_cluster_size: Minimum number of items to form a cluster
- max_distance: Maximum distance between items in a cluster
Returns:
- DataFrame with added cluster column
"""
# Skip if not enough items
if len(df) < min_cluster_size:
df['cluster'] = -1
return df
# Create TF-IDF vectors
vectorizer = TfidfVectorizer(
max_features=5000,
stop_words='english',
min_df=2,
max_df=0.9
)
# Handle missing values and convert to strings
texts = df[text_column].fillna('').astype(str).tolist()
# Skip if empty texts
if not any(texts):
df['cluster'] = -1
return df
# Create TF-IDF matrix
try:
tfidf_matrix = vectorizer.fit_transform(texts)
except Exception as e:
print(f"Error creating TF-IDF matrix: {e}")
df['cluster'] = -1
return df
# Cluster using DBSCAN
clusterer = DBSCAN(
eps=max_distance,
min_samples=min_cluster_size,
metric='cosine'
)
clusters = clusterer.fit_predict(tfidf_matrix)
# Add cluster assignments to DataFrame
result_df = df.copy()
result_df['cluster'] = clusters
# Generate cluster topics
if (clusters != -1).any(): # If we have any clusters
unique_clusters = sorted(set(clusters[clusters != -1]))
for cluster_id in unique_clusters:
# Get articles in this cluster
cluster_docs = [texts[i] for i in range(len(texts)) if clusters[i] == cluster_id]
cluster_indices = [i for i in range(len(texts)) if clusters[i] == cluster_id]
# Get top terms for this cluster
if len(cluster_docs) > 0:
# Use OpenAI to generate a topic name
combined_text = "\n---\n".join(cluster_docs[:3]) # Just use first 3 docs to save tokens
try:
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a topic labeling assistant. Based on the content provided, generate a concise 2-5 word topic label that best represents the common theme."},
{"role": "user", "content": f"Please create a topic label for these related articles:\n\n{combined_text}"}
],
max_tokens=20,
temperature=0.3
)
topic_name = response.choices[0].message.content.strip()
# Assign topic name to all articles in this cluster
for idx in cluster_indices:
result_df.at[df.index[idx], 'topic'] = topic_name
except Exception as e:
print(f"Error generating topic name: {e}")
for idx in cluster_indices:
result_df.at[df.index[idx], 'topic'] = f"Topic {cluster_id}"
# For unclustered items (-1), leave topic as NaN
return result_df
# Example usage
clustered_df = cluster_by_topic(
summarized_df,
text_column='content',
min_cluster_size=2,
max_distance=0.7
)
Let's integrate all the components into a complete pipeline that:
- Collects content from multiple sources
- Filters and classifies content
- Summarizes kept content
- Groups by topic
- Presents the results
import schedule
import time
import pandas as pd
from datetime import datetime, timedelta
import json
import os
class ContentCurator:
"""A complete content curation system using AI filtering."""
def __init__(self, config_path="curator_config.json"):
"""Initialize with configuration."""
self.config = self.load_config(config_path)
self.openai_client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
# Set up storage directories
os.makedirs(self.config['data_dir'], exist_ok=True)
# Initialize content store
self.content_store = pd.DataFrame()
self.load_existing_content()
def load_config(self, config_path):
"""Load configuration from JSON file."""
try:
with open(config_path, 'r') as f:
config = json.load(f)
return config
except Exception as e:
print(f"Error loading config: {e}")
# Return default config
return {
"data_dir": "content_data",
"sources": [],
"filter_prompt": "Default filter prompt...",
"run_schedule": "1h",
"max_content_age_days": 7,
"summarize_content": True
}
def load_existing_content(self):
"""Load previously processed content."""
content_path = os.path.join(self.config['data_dir'], 'processed_content.csv')
if os.path.exists(content_path):
try:
self.content_store = pd.read_csv(content_path)
# Convert date strings back to datetime
if 'published' in self.content_store.columns:
self.content_store['published'] = pd.to_datetime(self.content_store['published'])
if 'processed_at' in self.content_store.columns:
self.content_store['processed_at'] = pd.to_datetime(self.content_store['processed_at'])
print(f"Loaded {len(self.content_store)} items from content store")
except Exception as e:
print(f"Error loading content store: {e}")
self.content_store = pd.DataFrame()
def save_content_store(self):
"""Save processed content to CSV."""
content_path = os.path.join(self.config['data_dir'], 'processed_content.csv')
try:
self.content_store.to_csv(content_path, index=False)
print(f"Saved {len(self.content_store)} items to content store")
except Exception as e:
print(f"Error saving content store: {e}")
def collect_content(self):
"""Collect content from all configured sources."""
all_content = []
for source in self.config['sources']:
try:
if source['type'] == 'rss':
print(f"Collecting from RSS: {source['name']}")
feed_df = fetch_rss_content([source['url']])
if not feed_df.empty:
feed_df['source_type'] = 'rss'
feed_df['source_name'] = source['name']
all_content.append(feed_df)
elif source['type'] == 'twitter':
print(f"Collecting from Twitter: {source['name']}")
if source.get('list_id'):
tweets = fetch_list_tweets(source['list_id'])
else:
tweets = fetch_timeline_tweets()
if tweets:
# Convert tweets to DataFrame
tweet_data = []
for tweet in tweets:
tweet_data.append({
'title': '', # Tweets don't have titles
'content': tweet.text,
'link': f"https://twitter.com/user/status/{tweet.id}",
'published': tweet.created_at,
'source': 'Twitter',
'source_type': 'twitter',
'source_name': source['name']
})
tweet_df = pd.DataFrame(tweet_data)
all_content.append(tweet_df)
elif source['type'] == 'scrape':
print(f"Collecting from web scrape: {source['name']}")
scraped_df = scrape_news_site(
base_url=source['url'],
article_selector=source['article_selector'],
title_selector=source['title_selector'],
content_selector=source['content_selector'],
max_pages=source.get('max_pages', 3)
)
if not scraped_df.empty:
scraped_df['source_type'] = 'scrape'
scraped_df['source_name'] = source['name']
all_content.append(scraped_df)
except Exception as e:
print(f"Error collecting from {source['name']}: {e}")
# Combine all content
if all_content:
combined_df = pd.concat(all_content, ignore_index=True)
print(f"Collected {len(combined_df)} items from all sources")
return combined_df
return pd.DataFrame()
def filter_content(self, content_df):
"""Filter content using AI."""
if content_df.empty:
return content_df
print("Filtering content...")
system_prompt = self.config.get('filter_prompt', '')
if not system_prompt:
print("No filter prompt configured, using default")
system_prompt = """
You are a content filtering assistant. Analyze the provided content and determine if it should be kept or discarded.
KEEP content that is informative, educational, or otherwise valuable.
DISCARD content that is clickbait, low-quality, or not relevant.
Respond with:
- KEEP or DISCARD
- CATEGORIES: [relevant categories]
- REASON: [brief explanation]
"""
filtered_df = classify_content(
content_df,
'content' if 'content' in content_df.columns else 'description',
system_prompt
)
kept_count = len(filtered_df[filtered_df['keep'] == True])
print(f"Kept {kept_count} out of {len(filtered_df)} items after filtering")
return filtered_df
def process_content(self):
"""Run the complete content processing pipeline."""
print(f"Starting content curation run at {datetime.now()}")
# 1. Collect new content
new_content = self.collect_content()
if new_content.empty:
print("No new content collected")
return
# 2. Filter out previously processed content by URL
if not self.content_store.empty and 'link' in self.content_store.columns and 'link' in new_content.columns:
existing_urls = set(self.content_store['link'].dropna())
new_content = new_content[~new_content['link'].isin(existing_urls)]
print(f"{len(new_content)} items remain after removing previously processed content")
if new_content.empty:
print("No new content to process after deduplication")
return
# 3. Filter content with AI
filtered_content = self.filter_content(new_content)
# 4. Process kept content
kept_content = filtered_content[filtered_content['keep'] == True].copy()
if kept_content.empty:
print("No content kept after filtering")
return
# 5. Add processing metadata
kept_content['processed_at'] = datetime.now()
# 6. Summarize content if enabled
if self.config.get('summarize_content', True):
print("Summarizing kept content...")
title_col = 'title' if 'title' in kept_content.columns else None
content_col = 'content' if 'content' in kept_content.columns else 'description'
kept_content = summarize_content(kept_content, content_col, title_col)
# 7. Cluster by topic if we have enough content
if len(kept_content) >= 2:
print("Clustering content by topic...")
content_col = 'content' if 'content' in kept_content.columns else 'description'
kept_content = cluster_by_topic(kept_content, content_col)
# 8. Add to content store
self.content_store = pd.concat([kept_content, self.content_store], ignore_index=True)
# 9. Prune old content
max_age = self.config.get('max_content_age_days', 7)
cutoff_date = datetime.now() - timedelta(days=max_age)
if 'processed_at' in self.content_store.columns:
old_count = len(self.content_store[self.content_store['processed_at'] < cutoff_date])
self.content_store = self.content_store[self.content_store['processed_at'] >= cutoff_date]
print(f"Removed {old_count} items older than {max_age} days")
# 10. Save updated content store
self.save_content_store()
# 11. Export latest results
self.export_latest_digest()
print(f"Content curation run completed at {datetime.now()}")
def export_latest_digest(self, max_items=50):
"""Export a digest of latest content."""
if self.content_store.empty:
return
latest_content = self.content_store.sort_values('processed_at', ascending=False).head(max_items)
# Group by topic if available
if 'topic' in latest_content.columns and 'cluster' in latest_content.columns:
# Get items with topics
topic_content = latest_content[~latest_content['topic'].isna()]
# Sort by cluster to group topics together
topic_content = topic_content.sort_values(['cluster', 'processed_at'], ascending=[True, False])
# Get remainder without topics
other_content = latest_content[latest_content['topic'].isna()]
# Combine, with topic content first
latest_content = pd.concat([topic_content, other_content])
# Export to HTML
html_path = os.path.join(self.config['data_dir'], 'latest_digest.html')
try:
# Generate HTML
html_content = """
Your Curated Content Digest
Your Curated Content Digest
Generated on {date}
""".format(date=datetime.now().strftime("%Y-%m-%d %H:%M"))
# Check if we have topics
if 'topic' in latest_content.columns and latest_content['topic'].notna().any():
# Group by topic
topics = latest_content['topic'].fillna('Uncategorized').unique()
for topic in topics:
topic_items = latest_content[latest_content['topic'].fillna('Uncategorized') == topic]
# Skip if no items (shouldn't happen but just in case)
if len(topic_items) == 0:
continue
# Add topic section
html_content += """
{topic}
""".format(topic=topic)
# Add items in this topic
for _, item in topic_items.iterrows():
html_content += self._format_item_html(item)
html_content += ""
else:
# No topics, just list all items
for _, item in latest_content.iterrows():
html_content += self._format_item_html(item)
html_content += """
"""
# Write to file
with open(html_path, 'w', encoding='utf-8') as f:
f.write(html_content)
print(f"Exported digest with {len(latest_content)} items to {html_path}")
except Exception as e:
print(f"Error exporting digest: {e}")
def _format_item_html(self, item):
"""Format a single item for HTML output."""
# Get fields with fallbacks
title = item.get('title', 'Untitled')
if pd.isna(title) or not title:
title = item.get('link', 'Untitled').split('/')[-1]
link = item.get('link', '#')
source = item.get('source_name', item.get('source', 'Unknown Source'))
summary = item.get('summary', item.get('description', ''))
if pd.isna(summary):
summary = ''
# Format date
date_str = ''
if 'published' in item and not pd.isna(item['published']):
try:
date_obj = pd.to_datetime(item['published'])
date_str = date_obj.strftime("%Y-%m-%d")
except:
pass
# Get categories
categories_html = ''
if 'categories' in item and not pd.isna(item['categories']) and item['categories']:
try:
if isinstance(item['categories'], str):
# If stored as string, try to parse
try:
categories = eval(item['categories']) # Careful with this!
except:
categories = [item['categories']]
else:
categories = item['categories']
if categories:
categories_html = ''
for cat in categories:
categories_html += f'{cat}'
categories_html += ''
except:
pass
# Format the HTML
item_html = """
{summary}
{categories}
""".format(
title=title,
link=link,
source=source,
date=date_str,
summary=summary,
categories=categories_html
)
return item_html
def run_scheduled(self):
"""Run the curator on a schedule."""
# Run once immediately
self.process_content()
# Set up schedule
schedule_time = self.config.get('run_schedule', '1h')
if schedule_time.endswith('h'):
# Hourly schedule
try:
hours = int(schedule_time[:-1])
schedule.every(hours).hours.do(self.process_content)
print(f"Scheduled to run every {hours} hours")
except:
schedule.every(1).hours.do(self.process_content)
print("Scheduled to run hourly (default)")
elif schedule_time.endswith('m'):
# Minutes schedule
try:
minutes = int(schedule_time[:-1])
schedule.every(minutes).minutes.do(self.process_content)
print(f"Scheduled to run every {minutes} minutes")
except:
schedule.every(30).minutes.do(self.process_content)
print("Scheduled to run every 30 minutes (default)")
else:
# Default schedule
schedule.every(1).hours.do(self.process_content)
print("Scheduled to run hourly (default)")
# Run the schedule
print("Starting scheduler...")
try:
while True:
schedule.run_pending()
time.sleep(60) # Check every minute
except KeyboardInterrupt:
print("Scheduler stopped by user")
# Example usage
if __name__ == "__main__":
# Create a configuration file
config = {
"data_dir": "content_data",
"sources": [
{
"name": "Hacker News",
"type": "rss",
"url": "https://news.ycombinator.com/rss"
},
{
"name": "Tech News",
"type": "rss",
"url": "https://feeds.arstechnica.com/arstechnica/index"
},
# Example Twitter source (requires API access)
# {
# "name": "Tech Twitter",
# "type": "twitter",
# "list_id": "1234567890"
# },
# Example web scraping source
# {
# "name": "Example News",
# "type": "scrape",
# "url": "https://example-news-site.com",
# "article_selector": "div.article-list a.article-link",
# "title_selector": "h1.article-title",
# "content_selector": "div.article-content",
# "max_pages": 3
# }
],
"filter_prompt": """
You are a content filtering assistant that helps users filter their news and social media feeds.
Analyze the provided content and determine if it should be kept or discarded based on the following criteria:
1. DISCARD any content that:
- Is primarily clickbait with misleading headlines
- Contains excessive negativity or outrage without substantive information
- Is primarily promoting a product without educational value
- Contains partisan political attacks without factual substance
- Is low-effort, shallow content without depth
2. KEEP content that:
- Provides substantive information or insights
- Teaches something valuable or interesting
- Presents balanced analysis with multiple perspectives
- Contains original research or reporting
- Sparks constructive discussion or thoughtful reflection
For each piece of content, respond with:
- KEEP or DISCARD
- CATEGORIES: [list relevant categories like technology, science, politics, etc.]
- REASON: [brief explanation for your decision]
""",
"run_schedule": "1h",
"max_content_age_days": 7,
"summarize_content": True
}
# Save configuration
os.makedirs("content_data", exist_ok=True)
with open("content_data/curator_config.json", "w") as f:
json.dump(config, f, indent=2)
# Create and run the curator
curator = ContentCurator("content_data/curator_config.json")
curator.run_scheduled()
When implementing your own AI filtering system, consider these important ethical guidelines:
- Respect Terms of Service - Always check platform terms of service before scraping or using APIs
- Manage Rate Limits - Implement proper delays to avoid overloading servers
- Avoid Echo Chambers - Design filters that don't simply reinforce existing views
- Privacy Protection - Store only what you need and handle personal data with care
- Attribution - Provide proper attribution for content sources
- Transparency - Understand and document how your filtering works
- Confirmation Bias - Be aware of and counter your own biases in prompt design
- Implement caching to reduce API calls to both content sources and AI services
- Add robust error handling for resiliency
- Use smaller AI models when possible to reduce costs
- Track filter performance over time to improve your prompts
- Consider edge cases like different languages or media formats
Conclusion
Building your own AI-powered content filter puts you back in control of your digital information diet. By leveraging the capabilities of modern AI models, you can create sophisticated filtering systems that understand context, nuance, and your personal preferences in ways that simple keyword filters cannot.
This approach not only helps reduce information overload but can lead to more meaningful engagement with higher-quality content across all your information sources. As AI capabilities continue to improve, these personal filtering systems will become increasingly accessible and powerful tools for navigating our complex information ecosystem.