Technical Articles & Tutorials

Using AI to Filter Web Content: Building Your Own Algorithms

In today's digital landscape, users are bombarded with content from numerous platforms, but most services offer limited control over what appears in feeds and recommendations. While many platforms use proprietary algorithms to personalize content, these algorithms often prioritize engagement over user preferences, leading to content fatigue and frustration.

This article explores how you can build your own AI-powered filtering system to take back control of your digital experience. By leveraging APIs when available or web scraping when necessary, combined with AI models like OpenAI's GPT, you can create personalized content filters tailored to your exact preferences.

The Problem with Existing Filtering Systems

Current content filtering on major platforms has several shortcomings:

  • Limited Customization: Platforms often offer only basic filtering options (block/mute keywords or accounts)
  • Black Box Algorithms: Recommendation systems prioritize engagement metrics over user preferences
  • Lack of Nuance: Simple keyword filters miss context and can't understand complex preferences
  • Inconsistent Experience: Different platforms have different filtering capabilities
  • Profit-Driven Decisions: Platforms optimize for advertising revenue, not user experience

By creating your own filtering layer using AI, you can implement sophisticated content processing that understands your unique preferences and applies them consistently across platforms.

Approaches to Content Acquisition
1. Using Official APIs

The preferred approach when available. Many platforms provide APIs that allow you to programmatically access content:

Python Example: Using the Twitter/X API
import tweepy
import os
from datetime import datetime, timedelta

# Set up authentication
client = tweepy.Client(
    bearer_token=os.environ.get("TWITTER_BEARER_TOKEN"),
    consumer_key=os.environ.get("TWITTER_API_KEY"),
    consumer_secret=os.environ.get("TWITTER_API_SECRET"),
    access_token=os.environ.get("TWITTER_ACCESS_TOKEN"),
    access_token_secret=os.environ.get("TWITTER_ACCESS_SECRET")
)

# Function to fetch timeline tweets
def fetch_timeline_tweets(max_results=50):
    """Fetch tweets from the user's home timeline."""
    # Using v2 API
    response = client.get_home_timeline(
        max_results=max_results,
        tweet_fields=['created_at', 'public_metrics', 'entities', 'context_annotations']
    )
    
    if response.data:
        return response.data
    return []

# Function to fetch tweets from a specific list
def fetch_list_tweets(list_id, max_results=50):
    """Fetch tweets from a specific list."""
    response = client.get_list_tweets(
        id=list_id,
        max_results=max_results,
        tweet_fields=['created_at', 'public_metrics', 'entities']
    )
    
    if response.data:
        return response.data
    return []

# Example usage
timeline_tweets = fetch_timeline_tweets()
print(f"Fetched {len(timeline_tweets)} tweets from timeline")
2. RSS Feeds

Many news sites, blogs, and podcasts still provide RSS feeds which offer a structured way to access content:

Python Example: Processing RSS Feeds
import feedparser
import pandas as pd
from datetime import datetime

def fetch_rss_content(feed_urls):
    """Fetch content from multiple RSS feeds and organize into a DataFrame."""
    all_entries = []
    
    for url in feed_urls:
        try:
            feed = feedparser.parse(url)
            source_name = feed.feed.title
            
            for entry in feed.entries:
                # Extract relevant fields
                published = entry.get('published_parsed') or entry.get('updated_parsed')
                if published:
                    # Convert to datetime
                    published_date = datetime(*published[:6])
                else:
                    published_date = datetime.now()
                
                article_data = {
                    'title': entry.get('title', ''),
                    'link': entry.get('link', ''),
                    'description': entry.get('description', ''),
                    'published': published_date,
                    'source': source_name,
                    'content': entry.get('content', [{'value': ''}])[0]['value'] 
                              if 'content' in entry else entry.get('summary', '')
                }
                all_entries.append(article_data)
        except Exception as e:
            print(f"Error processing feed {url}: {e}")
    
    # Convert to DataFrame for easier manipulation
    if all_entries:
        df = pd.DataFrame(all_entries)
        # Sort by published date (newest first)
        df = df.sort_values('published', ascending=False)
        return df
    return pd.DataFrame()

# Example usage
feed_urls = [
    'https://news.ycombinator.com/rss',
    'https://feeds.arstechnica.com/arstechnica/index',
    'https://www.wired.com/feed/rss'
]

content_df = fetch_rss_content(feed_urls)
print(f"Fetched {len(content_df)} articles from RSS feeds")
3. Web Scraping (When APIs Are Unavailable)

For platforms without accessible APIs, web scraping may be necessary, though it should be used responsibly and in accordance with terms of service:

Python Example: Scraping News Articles
import requests
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
from datetime import datetime
import random

def scrape_news_site(base_url, article_selector, title_selector, content_selector, max_pages=2):
    """
    Scrapes news articles from a website.
    
    Parameters:
    - base_url: The homepage or section URL to start scraping
    - article_selector: CSS selector to identify article links
    - title_selector: CSS selector to extract article title
    - content_selector: CSS selector to extract article content
    - max_pages: Maximum number of pages to scrape
    
    Returns:
    - DataFrame with article data
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
    }
    
    article_data = []
    
    try:
        # Get the main page
        response = requests.get(base_url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find article links
        article_links = soup.select(article_selector)
        article_urls = [link.get('href') for link in article_links if link.get('href')]
        
        # Make sure URLs are absolute
        article_urls = [url if url.startswith('http') else f"{base_url.rstrip('/')}/{url.lstrip('/')}" 
                      for url in article_urls]
        
        # Limit to max_pages
        article_urls = article_urls[:min(len(article_urls), max_pages)]
        
        # Process each article
        for url in article_urls:
            try:
                # Be nice to the server
                sleep(random.uniform(1.0, 3.0))
                
                article_response = requests.get(url, headers=headers)
                article_soup = BeautifulSoup(article_response.text, 'html.parser')
                
                # Extract information
                title = article_soup.select_one(title_selector)
                title = title.text.strip() if title else "No title"
                
                content_element = article_soup.select_one(content_selector)
                content = content_element.text.strip() if content_element else ""
                
                # Store the data
                article_data.append({
                    'title': title,
                    'url': url,
                    'content': content,
                    'source': base_url,
                    'scraped_at': datetime.now()
                })
                
                print(f"Scraped: {title}")
                
            except Exception as e:
                print(f"Error scraping article {url}: {e}")
                continue
    
    except Exception as e:
        print(f"Error scraping main page {base_url}: {e}")
    
    # Convert to DataFrame
    return pd.DataFrame(article_data)

# Example usage (fictional selectors)
news_df = scrape_news_site(
    base_url='https://example-news-site.com',
    article_selector='div.article-list a.article-link',
    title_selector='h1.article-title',
    content_selector='div.article-content',
    max_pages=5
)

print(f"Scraped {len(news_df)} articles")
Important: Always respect website terms of service, robots.txt files, and rate limits when scraping. Consider caching results to minimize requests, and include proper delays between requests to avoid overloading servers.
Building an AI Content Filter with OpenAI

Once you've collected content, you can use AI models to analyze and filter it according to your preferences. OpenAI's GPT models are particularly effective at understanding content and applying nuanced filtering criteria:

Python Example: Content Classification with OpenAI
import openai
import pandas as pd
import os
from time import sleep

# Set OpenAI API key
client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def classify_content(df, content_column, system_prompt, max_batch=10):
    """
    Classifies content using OpenAI's GPT model.
    
    Parameters:
    - df: DataFrame containing content
    - content_column: Column name containing the text to classify
    - system_prompt: Instructions for the AI model
    - max_batch: Maximum number of items to process in one batch
    
    Returns:
    - DataFrame with added classification columns
    """
    results = []
    
    # Process in batches to avoid rate limits
    for i in range(0, len(df), max_batch):
        batch = df.iloc[i:i+max_batch]
        
        for _, row in batch.iterrows():
            content = row[content_column]
            
            try:
                # Skip empty content
                if not content or pd.isna(content):
                    results.append({
                        'original_index': _.index,
                        'classification': 'unknown',
                        'keep': False,
                        'reason': 'Empty content'
                    })
                    continue
                
                # Truncate very long content
                if len(content) > 15000:
                    content = content[:15000] + "..."
                
                # Call OpenAI API
                response = client.chat.completions.create(
                    model="gpt-4",  # Or another suitable model
                    messages=[
                        {"role": "system", "content": system_prompt},
                        {"role": "user", "content": content}
                    ],
                    max_tokens=150,
                    temperature=0.1  # Low temperature for more consistent results
                )
                
                # Parse the response
                classification_text = response.choices[0].message.content.strip()
                
                # Simple parsing - adjust based on your system prompt's expected output format
                if "KEEP" in classification_text:
                    keep = True
                elif "DISCARD" in classification_text:
                    keep = False
                else:
                    keep = False  # Default to discard if unclear
                
                # Extract categorization if provided
                categories = []
                if "CATEGORIES:" in classification_text:
                    category_text = classification_text.split("CATEGORIES:")[1].strip()
                    categories = [c.strip() for c in category_text.split(',')]
                
                # Extract reason if provided
                reason = "No specific reason provided"
                if "REASON:" in classification_text:
                    reason = classification_text.split("REASON:")[1].strip()
                
                results.append({
                    'original_index': row.name,
                    'keep': keep,
                    'categories': categories if categories else [],
                    'reason': reason
                })
                
                # Sleep to avoid rate limits
                sleep(0.5)
                
            except Exception as e:
                print(f"Error processing content: {e}")
                results.append({
                    'original_index': row.name,
                    'keep': False,
                    'categories': [],
                    'reason': f"Error: {str(e)}"
                })
    
    # Convert results to DataFrame
    results_df = pd.DataFrame(results)
    
    # Merge back with original data
    merged_df = df.copy()
    for _, row in results_df.iterrows():
        idx = row['original_index']
        for col in ['keep', 'categories', 'reason']:
            if col in row:
                merged_df.at[idx, col] = row[col]
    
    return merged_df

# Example usage
system_prompt = """
You are a content filtering assistant that helps users filter their news and social media feeds.
Analyze the provided content and determine if it should be kept or discarded based on the following criteria:

1. DISCARD any content that:
   - Is primarily clickbait with misleading headlines
   - Contains excessive negativity or outrage without substantive information
   - Is primarily promoting a product without educational value
   - Contains partisan political attacks without factual substance
   - Is low-effort, shallow content without depth

2. KEEP content that:
   - Provides substantive information or insights
   - Teaches something valuable or interesting
   - Presents balanced analysis with multiple perspectives
   - Contains original research or reporting
   - Sparks constructive discussion or thoughtful reflection

For each piece of content, respond with:
- KEEP or DISCARD
- CATEGORIES: [list relevant categories like technology, science, politics, etc.]
- REASON: [brief explanation for your decision]
"""

# Assuming content_df was created in a previous step
filtered_df = classify_content(content_df, 'content', system_prompt)

# Show the results
kept_content = filtered_df[filtered_df['keep'] == True]
print(f"Kept {len(kept_content)} out of {len(filtered_df)} items")
print(f"Discard rate: {(1 - len(kept_content)/len(filtered_df))*100:.1f}%")
Creating Effective AI Filter Prompts

The system prompt you provide to the AI model is crucial for effective filtering. Here are some examples for different filtering goals:

Example 1: Political Balance Filter
You are a political content analyzer designed to help users balance their information diet.

For each article, analyze the political perspective and determine:
1. The dominant political leaning (liberal, conservative, centrist, or non-political)
2. Whether multiple perspectives are fairly presented
3. If factual claims are supported with evidence
4. If the tone is informative vs. inflammatory

Respond with:
- CLASSIFICATION: [liberal/conservative/centrist/non-political]
- PERSPECTIVE_SCORE: [1-10 where 1=extremely one-sided, 10=multiple viewpoints fairly presented]
- EVIDENCE_SCORE: [1-10 where 1=claims without evidence, 10=well-supported claims]
- TONE_SCORE: [1-10 where 1=highly inflammatory, 10=neutral/informative]
- KEEP or DISCARD (KEEP if average of all scores > 6)
- REASON: [brief explanation]
Example 2: Educational Content Prioritizer
You are an educational content evaluator designed to identify high-value learning material.

For each article or post, analyze:
1. Educational value - does it teach something substantial?
2. Accuracy - is the information accurate and current?
3. Depth - does it go beyond surface-level explanations?
4. Actionability - can the reader apply this knowledge?

Respond with:
- EDUCATIONAL_VALUE: [High/Medium/Low]
- TOPIC: [Main subject area]
- DEPTH_SCORE: [1-10]
- ACTIONABLE: [Yes/Somewhat/No]
- KEEP or DISCARD (KEEP if Educational_Value is High OR if Depth_Score > 7)
- REASON: [brief explanation of your assessment]
Example 3: Productivity & Mental Health Filter
You are a digital wellness assistant helping users maintain a healthy online experience.

For each piece of content, evaluate:
1. Whether it induces anxiety, FOMO, or negative emotions
2. Whether it promotes constructive thinking or mindless scrolling
3. If it's designed to be addictive through outrage or endless engagement
4. Whether it contains actionable advice or practical value

Respond with:
- EMOTIONAL_IMPACT: [Positive/Neutral/Negative]
- ATTENTION_TYPE: [Constructive/Neutral/Distracting]
- PRACTICAL_VALUE: [High/Medium/Low]
- ADDICTIVENESS: [High/Medium/Low]
- KEEP or DISCARD (KEEP if Emotional_Impact is Positive OR Practical_Value is High)
- REASON: [brief explanation of your assessment]
Advanced Features: Content Transformation & Summarization

Beyond simple filtering, you can use AI to transform and enhance content:

Python Example: Content Summarization
def summarize_content(df, content_column, title_column=None, max_length=150):
    """
    Summarizes content in a DataFrame using OpenAI.
    
    Parameters:
    - df: DataFrame containing content
    - content_column: Column name with the text to summarize
    - title_column: Optional column with article titles
    - max_length: Maximum length of summary in words
    
    Returns:
    - DataFrame with added summary column
    """
    summaries = []
    
    for idx, row in df.iterrows():
        content = row[content_column]
        title = row[title_column] if title_column and not pd.isna(row[title_column]) else "Untitled"
        
        try:
            # Skip empty content
            if not content or pd.isna(content):
                summaries.append({"original_index": idx, "summary": "No content to summarize"})
                continue
            
            # Create prompt
            prompt = f"Title: {title}\n\nContent: {content[:8000]}..."  # Truncate long content
            
            system_message = f"""
            You are a skilled content summarizer. Create a concise summary of the following content in {max_length} words or less.
            Focus on the main points and key information. Maintain a neutral tone and do not add information not present in the original.
            """
            
            # Call OpenAI API
            response = client.chat.completions.create(
                model="gpt-3.5-turbo",  # Smaller model for summarization is often sufficient
                messages=[
                    {"role": "system", "content": system_message},
                    {"role": "user", "content": prompt}
                ],
                max_tokens=300,
                temperature=0.3
            )
            
            summary = response.choices[0].message.content.strip()
            
            summaries.append({"original_index": idx, "summary": summary})
            
            # Sleep to avoid rate limits
            sleep(0.5)
            
        except Exception as e:
            print(f"Error summarizing content: {e}")
            summaries.append({"original_index": idx, "summary": f"Error: {str(e)}"})
    
    # Add summaries to original DataFrame
    summaries_df = pd.DataFrame(summaries)
    
    # Merge with original data
    merged_df = df.copy()
    for _, row in summaries_df.iterrows():
        idx = row['original_index']
        merged_df.at[idx, 'summary'] = row['summary']
    
    return merged_df

# Example usage
summarized_df = summarize_content(
    kept_content,  # Only summarize content we're keeping
    content_column='content',
    title_column='title',
    max_length=100
)
Python Example: Topic Clustering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN
import numpy as np

def cluster_by_topic(df, text_column, min_cluster_size=2, max_distance=0.7):
    """
    Clusters content by topic similarity.
    
    Parameters:
    - df: DataFrame containing content
    - text_column: Column with the text to analyze
    - min_cluster_size: Minimum number of items to form a cluster
    - max_distance: Maximum distance between items in a cluster
    
    Returns:
    - DataFrame with added cluster column
    """
    # Skip if not enough items
    if len(df) < min_cluster_size:
        df['cluster'] = -1
        return df
    
    # Create TF-IDF vectors
    vectorizer = TfidfVectorizer(
        max_features=5000,
        stop_words='english',
        min_df=2,
        max_df=0.9
    )
    
    # Handle missing values and convert to strings
    texts = df[text_column].fillna('').astype(str).tolist()
    
    # Skip if empty texts
    if not any(texts):
        df['cluster'] = -1
        return df
    
    # Create TF-IDF matrix
    try:
        tfidf_matrix = vectorizer.fit_transform(texts)
    except Exception as e:
        print(f"Error creating TF-IDF matrix: {e}")
        df['cluster'] = -1
        return df
    
    # Cluster using DBSCAN
    clusterer = DBSCAN(
        eps=max_distance,
        min_samples=min_cluster_size,
        metric='cosine'
    )
    
    clusters = clusterer.fit_predict(tfidf_matrix)
    
    # Add cluster assignments to DataFrame
    result_df = df.copy()
    result_df['cluster'] = clusters
    
    # Generate cluster topics
    if (clusters != -1).any():  # If we have any clusters
        unique_clusters = sorted(set(clusters[clusters != -1]))
        
        for cluster_id in unique_clusters:
            # Get articles in this cluster
            cluster_docs = [texts[i] for i in range(len(texts)) if clusters[i] == cluster_id]
            cluster_indices = [i for i in range(len(texts)) if clusters[i] == cluster_id]
            
            # Get top terms for this cluster
            if len(cluster_docs) > 0:
                # Use OpenAI to generate a topic name
                combined_text = "\n---\n".join(cluster_docs[:3])  # Just use first 3 docs to save tokens
                
                try:
                    response = client.chat.completions.create(
                        model="gpt-3.5-turbo",
                        messages=[
                            {"role": "system", "content": "You are a topic labeling assistant. Based on the content provided, generate a concise 2-5 word topic label that best represents the common theme."},
                            {"role": "user", "content": f"Please create a topic label for these related articles:\n\n{combined_text}"}
                        ],
                        max_tokens=20,
                        temperature=0.3
                    )
                    
                    topic_name = response.choices[0].message.content.strip()
                    
                    # Assign topic name to all articles in this cluster
                    for idx in cluster_indices:
                        result_df.at[df.index[idx], 'topic'] = topic_name
                    
                except Exception as e:
                    print(f"Error generating topic name: {e}")
                    for idx in cluster_indices:
                        result_df.at[df.index[idx], 'topic'] = f"Topic {cluster_id}"
    
    # For unclustered items (-1), leave topic as NaN
    
    return result_df

# Example usage
clustered_df = cluster_by_topic(
    summarized_df,
    text_column='content',
    min_cluster_size=2,
    max_distance=0.7
)
Building a Complete Content Curation Pipeline

Let's integrate all the components into a complete pipeline that:

  1. Collects content from multiple sources
  2. Filters and classifies content
  3. Summarizes kept content
  4. Groups by topic
  5. Presents the results
Python Example: Complete Content Curation Pipeline
import schedule
import time
import pandas as pd
from datetime import datetime, timedelta
import json
import os

class ContentCurator:
    """A complete content curation system using AI filtering."""
    
    def __init__(self, config_path="curator_config.json"):
        """Initialize with configuration."""
        self.config = self.load_config(config_path)
        self.openai_client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
        
        # Set up storage directories
        os.makedirs(self.config['data_dir'], exist_ok=True)
        
        # Initialize content store
        self.content_store = pd.DataFrame()
        self.load_existing_content()
    
    def load_config(self, config_path):
        """Load configuration from JSON file."""
        try:
            with open(config_path, 'r') as f:
                config = json.load(f)
            return config
        except Exception as e:
            print(f"Error loading config: {e}")
            # Return default config
            return {
                "data_dir": "content_data",
                "sources": [],
                "filter_prompt": "Default filter prompt...",
                "run_schedule": "1h",
                "max_content_age_days": 7,
                "summarize_content": True
            }
    
    def load_existing_content(self):
        """Load previously processed content."""
        content_path = os.path.join(self.config['data_dir'], 'processed_content.csv')
        if os.path.exists(content_path):
            try:
                self.content_store = pd.read_csv(content_path)
                # Convert date strings back to datetime
                if 'published' in self.content_store.columns:
                    self.content_store['published'] = pd.to_datetime(self.content_store['published'])
                if 'processed_at' in self.content_store.columns:
                    self.content_store['processed_at'] = pd.to_datetime(self.content_store['processed_at'])
                
                print(f"Loaded {len(self.content_store)} items from content store")
            except Exception as e:
                print(f"Error loading content store: {e}")
                self.content_store = pd.DataFrame()
    
    def save_content_store(self):
        """Save processed content to CSV."""
        content_path = os.path.join(self.config['data_dir'], 'processed_content.csv')
        try:
            self.content_store.to_csv(content_path, index=False)
            print(f"Saved {len(self.content_store)} items to content store")
        except Exception as e:
            print(f"Error saving content store: {e}")
    
    def collect_content(self):
        """Collect content from all configured sources."""
        all_content = []
        
        for source in self.config['sources']:
            try:
                if source['type'] == 'rss':
                    print(f"Collecting from RSS: {source['name']}")
                    feed_df = fetch_rss_content([source['url']])
                    if not feed_df.empty:
                        feed_df['source_type'] = 'rss'
                        feed_df['source_name'] = source['name']
                        all_content.append(feed_df)
                
                elif source['type'] == 'twitter':
                    print(f"Collecting from Twitter: {source['name']}")
                    if source.get('list_id'):
                        tweets = fetch_list_tweets(source['list_id'])
                    else:
                        tweets = fetch_timeline_tweets()
                    
                    if tweets:
                        # Convert tweets to DataFrame
                        tweet_data = []
                        for tweet in tweets:
                            tweet_data.append({
                                'title': '',  # Tweets don't have titles
                                'content': tweet.text,
                                'link': f"https://twitter.com/user/status/{tweet.id}",
                                'published': tweet.created_at,
                                'source': 'Twitter',
                                'source_type': 'twitter',
                                'source_name': source['name']
                            })
                        tweet_df = pd.DataFrame(tweet_data)
                        all_content.append(tweet_df)
                
                elif source['type'] == 'scrape':
                    print(f"Collecting from web scrape: {source['name']}")
                    scraped_df = scrape_news_site(
                        base_url=source['url'],
                        article_selector=source['article_selector'],
                        title_selector=source['title_selector'],
                        content_selector=source['content_selector'],
                        max_pages=source.get('max_pages', 3)
                    )
                    if not scraped_df.empty:
                        scraped_df['source_type'] = 'scrape'
                        scraped_df['source_name'] = source['name']
                        all_content.append(scraped_df)
            
            except Exception as e:
                print(f"Error collecting from {source['name']}: {e}")
        
        # Combine all content
        if all_content:
            combined_df = pd.concat(all_content, ignore_index=True)
            print(f"Collected {len(combined_df)} items from all sources")
            return combined_df
        
        return pd.DataFrame()
    
    def filter_content(self, content_df):
        """Filter content using AI."""
        if content_df.empty:
            return content_df
        
        print("Filtering content...")
        system_prompt = self.config.get('filter_prompt', '')
        if not system_prompt:
            print("No filter prompt configured, using default")
            system_prompt = """
            You are a content filtering assistant. Analyze the provided content and determine if it should be kept or discarded.
            KEEP content that is informative, educational, or otherwise valuable.
            DISCARD content that is clickbait, low-quality, or not relevant.
            
            Respond with:
            - KEEP or DISCARD
            - CATEGORIES: [relevant categories]
            - REASON: [brief explanation]
            """
        
        filtered_df = classify_content(
            content_df, 
            'content' if 'content' in content_df.columns else 'description',
            system_prompt
        )
        
        kept_count = len(filtered_df[filtered_df['keep'] == True])
        print(f"Kept {kept_count} out of {len(filtered_df)} items after filtering")
        return filtered_df
    
    def process_content(self):
        """Run the complete content processing pipeline."""
        print(f"Starting content curation run at {datetime.now()}")
        
        # 1. Collect new content
        new_content = self.collect_content()
        if new_content.empty:
            print("No new content collected")
            return
        
        # 2. Filter out previously processed content by URL
        if not self.content_store.empty and 'link' in self.content_store.columns and 'link' in new_content.columns:
            existing_urls = set(self.content_store['link'].dropna())
            new_content = new_content[~new_content['link'].isin(existing_urls)]
            print(f"{len(new_content)} items remain after removing previously processed content")
        
        if new_content.empty:
            print("No new content to process after deduplication")
            return
        
        # 3. Filter content with AI
        filtered_content = self.filter_content(new_content)
        
        # 4. Process kept content
        kept_content = filtered_content[filtered_content['keep'] == True].copy()
        if kept_content.empty:
            print("No content kept after filtering")
            return
        
        # 5. Add processing metadata
        kept_content['processed_at'] = datetime.now()
        
        # 6. Summarize content if enabled
        if self.config.get('summarize_content', True):
            print("Summarizing kept content...")
            title_col = 'title' if 'title' in kept_content.columns else None
            content_col = 'content' if 'content' in kept_content.columns else 'description'
            kept_content = summarize_content(kept_content, content_col, title_col)
        
        # 7. Cluster by topic if we have enough content
        if len(kept_content) >= 2:
            print("Clustering content by topic...")
            content_col = 'content' if 'content' in kept_content.columns else 'description'
            kept_content = cluster_by_topic(kept_content, content_col)
        
        # 8. Add to content store
        self.content_store = pd.concat([kept_content, self.content_store], ignore_index=True)
        
        # 9. Prune old content
        max_age = self.config.get('max_content_age_days', 7)
        cutoff_date = datetime.now() - timedelta(days=max_age)
        
        if 'processed_at' in self.content_store.columns:
            old_count = len(self.content_store[self.content_store['processed_at'] < cutoff_date])
            self.content_store = self.content_store[self.content_store['processed_at'] >= cutoff_date]
            print(f"Removed {old_count} items older than {max_age} days")
        
        # 10. Save updated content store
        self.save_content_store()
        
        # 11. Export latest results
        self.export_latest_digest()
        
        print(f"Content curation run completed at {datetime.now()}")
    
    def export_latest_digest(self, max_items=50):
        """Export a digest of latest content."""
        if self.content_store.empty:
            return
        
        latest_content = self.content_store.sort_values('processed_at', ascending=False).head(max_items)
        
        # Group by topic if available
        if 'topic' in latest_content.columns and 'cluster' in latest_content.columns:
            # Get items with topics
            topic_content = latest_content[~latest_content['topic'].isna()]
            # Sort by cluster to group topics together
            topic_content = topic_content.sort_values(['cluster', 'processed_at'], ascending=[True, False])
            
            # Get remainder without topics
            other_content = latest_content[latest_content['topic'].isna()]
            
            # Combine, with topic content first
            latest_content = pd.concat([topic_content, other_content])
        
        # Export to HTML
        html_path = os.path.join(self.config['data_dir'], 'latest_digest.html')
        
        try:
            # Generate HTML
            html_content = """
            
            
            
                Your Curated Content Digest
                
                
                
            
            
                

Your Curated Content Digest

Generated on {date}

""".format(date=datetime.now().strftime("%Y-%m-%d %H:%M")) # Check if we have topics if 'topic' in latest_content.columns and latest_content['topic'].notna().any(): # Group by topic topics = latest_content['topic'].fillna('Uncategorized').unique() for topic in topics: topic_items = latest_content[latest_content['topic'].fillna('Uncategorized') == topic] # Skip if no items (shouldn't happen but just in case) if len(topic_items) == 0: continue # Add topic section html_content += """

{topic}

""".format(topic=topic) # Add items in this topic for _, item in topic_items.iterrows(): html_content += self._format_item_html(item) html_content += "
" else: # No topics, just list all items for _, item in latest_content.iterrows(): html_content += self._format_item_html(item) html_content += """ """ # Write to file with open(html_path, 'w', encoding='utf-8') as f: f.write(html_content) print(f"Exported digest with {len(latest_content)} items to {html_path}") except Exception as e: print(f"Error exporting digest: {e}") def _format_item_html(self, item): """Format a single item for HTML output.""" # Get fields with fallbacks title = item.get('title', 'Untitled') if pd.isna(title) or not title: title = item.get('link', 'Untitled').split('/')[-1] link = item.get('link', '#') source = item.get('source_name', item.get('source', 'Unknown Source')) summary = item.get('summary', item.get('description', '')) if pd.isna(summary): summary = '' # Format date date_str = '' if 'published' in item and not pd.isna(item['published']): try: date_obj = pd.to_datetime(item['published']) date_str = date_obj.strftime("%Y-%m-%d") except: pass # Get categories categories_html = '' if 'categories' in item and not pd.isna(item['categories']) and item['categories']: try: if isinstance(item['categories'], str): # If stored as string, try to parse try: categories = eval(item['categories']) # Careful with this! except: categories = [item['categories']] else: categories = item['categories'] if categories: categories_html = '
' for cat in categories: categories_html += f'{cat}' categories_html += '
' except: pass # Format the HTML item_html = """
{source} • {date}
{summary}
{categories}
""".format( title=title, link=link, source=source, date=date_str, summary=summary, categories=categories_html ) return item_html def run_scheduled(self): """Run the curator on a schedule.""" # Run once immediately self.process_content() # Set up schedule schedule_time = self.config.get('run_schedule', '1h') if schedule_time.endswith('h'): # Hourly schedule try: hours = int(schedule_time[:-1]) schedule.every(hours).hours.do(self.process_content) print(f"Scheduled to run every {hours} hours") except: schedule.every(1).hours.do(self.process_content) print("Scheduled to run hourly (default)") elif schedule_time.endswith('m'): # Minutes schedule try: minutes = int(schedule_time[:-1]) schedule.every(minutes).minutes.do(self.process_content) print(f"Scheduled to run every {minutes} minutes") except: schedule.every(30).minutes.do(self.process_content) print("Scheduled to run every 30 minutes (default)") else: # Default schedule schedule.every(1).hours.do(self.process_content) print("Scheduled to run hourly (default)") # Run the schedule print("Starting scheduler...") try: while True: schedule.run_pending() time.sleep(60) # Check every minute except KeyboardInterrupt: print("Scheduler stopped by user") # Example usage if __name__ == "__main__": # Create a configuration file config = { "data_dir": "content_data", "sources": [ { "name": "Hacker News", "type": "rss", "url": "https://news.ycombinator.com/rss" }, { "name": "Tech News", "type": "rss", "url": "https://feeds.arstechnica.com/arstechnica/index" }, # Example Twitter source (requires API access) # { # "name": "Tech Twitter", # "type": "twitter", # "list_id": "1234567890" # }, # Example web scraping source # { # "name": "Example News", # "type": "scrape", # "url": "https://example-news-site.com", # "article_selector": "div.article-list a.article-link", # "title_selector": "h1.article-title", # "content_selector": "div.article-content", # "max_pages": 3 # } ], "filter_prompt": """ You are a content filtering assistant that helps users filter their news and social media feeds. Analyze the provided content and determine if it should be kept or discarded based on the following criteria: 1. DISCARD any content that: - Is primarily clickbait with misleading headlines - Contains excessive negativity or outrage without substantive information - Is primarily promoting a product without educational value - Contains partisan political attacks without factual substance - Is low-effort, shallow content without depth 2. KEEP content that: - Provides substantive information or insights - Teaches something valuable or interesting - Presents balanced analysis with multiple perspectives - Contains original research or reporting - Sparks constructive discussion or thoughtful reflection For each piece of content, respond with: - KEEP or DISCARD - CATEGORIES: [list relevant categories like technology, science, politics, etc.] - REASON: [brief explanation for your decision] """, "run_schedule": "1h", "max_content_age_days": 7, "summarize_content": True } # Save configuration os.makedirs("content_data", exist_ok=True) with open("content_data/curator_config.json", "w") as f: json.dump(config, f, indent=2) # Create and run the curator curator = ContentCurator("content_data/curator_config.json") curator.run_scheduled()
Ethical Considerations and Best Practices

When implementing your own AI filtering system, consider these important ethical guidelines:

  1. Respect Terms of Service - Always check platform terms of service before scraping or using APIs
  2. Manage Rate Limits - Implement proper delays to avoid overloading servers
  3. Avoid Echo Chambers - Design filters that don't simply reinforce existing views
  4. Privacy Protection - Store only what you need and handle personal data with care
  5. Attribution - Provide proper attribution for content sources
  6. Transparency - Understand and document how your filtering works
  7. Confirmation Bias - Be aware of and counter your own biases in prompt design
Technical Best Practices:
  • Implement caching to reduce API calls to both content sources and AI services
  • Add robust error handling for resiliency
  • Use smaller AI models when possible to reduce costs
  • Track filter performance over time to improve your prompts
  • Consider edge cases like different languages or media formats

Conclusion

Building your own AI-powered content filter puts you back in control of your digital information diet. By leveraging the capabilities of modern AI models, you can create sophisticated filtering systems that understand context, nuance, and your personal preferences in ways that simple keyword filters cannot.

This approach not only helps reduce information overload but can lead to more meaningful engagement with higher-quality content across all your information sources. As AI capabilities continue to improve, these personal filtering systems will become increasingly accessible and powerful tools for navigating our complex information ecosystem.

About

Why fear those copying you, if you are doing good they will do the same to the world.

Archives

  1. AI & Automation
  2. AI Filtering for Web Content
  3. Web Fundamentals & Infrastructure
  4. Reclaiming Connection: Decentralized Social Networks
  5. Web Economics & Discovery
  6. The Broken Discovery Machine
  7. Evolution of Web Links
  8. Code & Frameworks
  9. Breaking the Tech Debt Avoidance Loop
  10. Evolution of Scaling & High Availability
  11. Evolution of Configuration & Environment
  12. Evolution of API Support
  13. Evolution of Browser & Client Support
  14. Evolution of Deployment & DevOps
  15. Evolution of Real-time Capabilities
  16. The Visual Basic Gap in Web Development
  17. Evolution of Testing & Monitoring
  18. Evolution of Internationalization & Localization
  19. Evolution of Form Processing
  20. Evolution of Security
  21. Evolution of Caching
  22. Evolution of Data Management
  23. Evolution of Response Generation
  24. Evolution of Request Routing & Handling
  25. Evolution of Session & State Management
  26. Web Framework Responsibilities
  27. Evolution of Internet Clients
  28. Evolution of Web Deployment
  29. The Missing Architectural Layer in Web Development
  30. Development Velocity Gap: WordPress vs. Modern Frameworks
  31. Data & Storage
  32. Evolution of Web Data Storage
  33. Information Management
  34. Managing Tasks Effectively: A Complete System
  35. Managing Appointments: Designing a Calendar System
  36. Building a Personal Knowledge Base
  37. Contact Management in the Digital Age
  38. Project Management for Individuals
  39. The Art of Response: Communicating with Purpose
  40. Strategic Deferral: Purposeful Postponement
  41. The Art of Delegation: Amplifying Impact
  42. Taking Action: Guide to Decisive Execution
  43. The Art of Deletion: Digital Decluttering
  44. Digital Filing: A Clutter-Free Life
  45. Managing Incoming Information
  46. Cloud & Infrastructure
  47. AWS Lightsail versus EC2
  48. WordPress on AWS Lightsail
  49. Migrating from Heroku to Dokku
  50. Storage & Media
  51. Vultr Object Storage on Django Wagtail
  52. Live Video Streaming with Nginx
  53. YI 4k Live Streaming
  54. Tools & Connectivity
  55. Multi Connection VPN
  56. Email Forms with AWS Lambda
  57. Static Sites with Hexo

Optimize Your Website!

Is your WordPress site running slowly? I offer a comprehensive service that includes needs assessments and performance optimizations. Get your site running at its best!

Check Out My Fiverr Gig!

Elsewhere

  1. YouTube
  2. Twitter
  3. GitHub