Agentic Eyes Researchers

Central Research Questions

Do different LLM architectures generate meaningfully distinct research questions when analyzing identical discourse data?
Can these models effectively connect empirical observations to established theoretical frameworks in communication studies?
Which architectural or training approaches consistently produce the most semantically distinctive outputs?
How stable are model outputs across repeated trials with identical inputs?

Research Contributions

Benchmarking computational creativity in academic contexts
Understanding cross-model variation in theoretical reasoning
Assessing reliability of AI-assisted research design

Practical Applications

Identifying appropriate models for exploratory research phases
Developing ensemble approaches using multiple models
Establishing quality benchmarks for AI research assistance

Data Source

Subreddit: r/conspiracy_commons

Rationale: This community provides naturally occurring discourse containing complex narratives, contested truth claims, and rich intertextual references. The content challenges models to engage with ambiguous, controversial material that resists simple framing.

Collection Protocol

Daily automated retrieval via Reddit API (PRAW)
Top posts by community engagement
Up to 25 posts per collection window
Up to 10 comments per post (thread context preserved)
Metadata: scores, timestamps, author identifiers

Data Structure

Post titles and self-text
Comment threads with parent-child relationships
Engagement metrics (upvotes, comment counts)
Temporal information
Pseudonymized user references

Ethical Considerations

Public data: All content publicly posted in open forum
Minimal identifiers: Only public usernames (already visible to all Reddit users)
Academic use: Research-only, not commercial application
Platform compliance: Adheres to Reddit API Terms of Service
No intervention: Observational only, no manipulation or engagement

Model Selection

DeepSeek V3.1

Chinese development, strong reasoning capabilities

Qwen3 Max

Alibaba, multilingual optimization

Grok 4

X.AI, real-time data training

Claude Sonnet 4.5

Anthropic, constitutional AI approach

Gemini 2.5 Pro

Google, multimodal architecture

GPT-5

OpenAI, current generation flagship

GLM 4.6

Z.AI, bilingual Chinese-English

Kimi K2

Moonshot AI, extended context windows

Selection Criteria

Architectural diversity: Different training approaches and paradigms
Geographic representation: North American, Chinese, and European providers
Current availability: Accessible via OpenRouter unified API
Task capability: Sufficient context window and reasoning capacity
Recent versions: Current or recently updated models only

Controlled Conditions

All models receive identical inputs and parameters:

Same Reddit data corpus
Identical researcher persona prompt
Temperature 0.8 (allows creativity while maintaining coherence)
Standardized output format requirements
No model-specific prompt engineering

Uniqueness Metric

Questions are scored based on semantic distance from all other questions generated across all models in a given run.

Calculation Method

Embedding: Each question is encoded as a 384-dimensional vector using the sentence-transformers/all-MiniLM-L6-v2 model (semantic representation)
Pairwise comparison: Cosine similarity computed between all question pairs (1.0 = identical meaning, 0.0 = completely orthogonal)
Uniqueness derivation: Uniqueness = 1 - (mean similarity to all other questions)
Model scoring: Average uniqueness across all questions from a single model

What this measures

Semantic differentiation from peer outputs
Novel framing or research direction
Non-redundancy within model outputs
Divergence from consensus approaches

What this does NOT measure

Research quality or feasibility
Methodological rigor
Theoretical appropriateness
Practical value or significance

Interpretation

Higher scores indicate greater semantic distance from the cluster of questions generated by other models. This reflects distinctiveness, not necessarily superiority. A model producing highly similar questions to peers would score lower, even if those questions are methodologically sound.

Daily Workflow

Scheduled execution: 9:00 AM UTC daily

1

Data collection

Retrieve latest posts and comments from r/conspiracy_commons

2

Data preprocessing

Structure content into standardized format with metadata

3

Parallel model invocation

Identical prompts sent to all 8 models simultaneously via OpenRouter

4

Response parsing

Extract research questions, theoretical frameworks, and hypotheses

5

Uniqueness computation

Generate embeddings, calculate pairwise similarities, derive uniqueness scores

6

Ranking and archival

Sort models by score, timestamp results, update leaderboard

7

Publication

Generate this static site with updated results

Technical Stack

Python 3.12+ with async/await
OpenRouter API for model access
sentence-transformers for embeddings
Server infrastructure in Singapore

Data Retention

Daily results archived with timestamps
Reddit data retained for reproducibility
Model responses logged for analysis
Longitudinal tracking enabled

Performance Across Simulation Runs

Uniqueness Score Rankings

Comparative Analysis

Score Distribution

Theoretical Framework Usage

Sample Research Questions

Research Design

Study Overview