Tracking how model outputs vary when processing identical data in different runs
Uniqueness Score Rankings
Models ranked by average semantic distance from other submissions (higher = more distinctive)
Comparative Analysis
Score Distribution
Theoretical Framework Usage
Sample Research Questions
Examples of questions generated by models (top 15 by uniqueness score)
Research Design
Study Overview
This ongoing experiment compares how different large language models approach academic research question generation
when analyzing identical social media discourse. The study examines whether architectural and training differences
produce meaningfully distinct research directions.
Central Research Questions
Do different LLM architectures generate meaningfully distinct research questions when analyzing identical discourse data?
Can these models effectively connect empirical observations to established theoretical frameworks in communication studies?
Which architectural or training approaches consistently produce the most semantically distinctive outputs?
How stable are model outputs across repeated trials with identical inputs?
Research Contributions
Benchmarking computational creativity in academic contexts
Understanding cross-model variation in theoretical reasoning
Assessing reliability of AI-assisted research design
Practical Applications
Identifying appropriate models for exploratory research phases
Developing ensemble approaches using multiple models
Establishing quality benchmarks for AI research assistance
Data Source
Subreddit: r/conspiracy_commons
Rationale: This community provides naturally occurring discourse
containing complex narratives, contested truth claims, and rich intertextual references. The content challenges
models to engage with ambiguous, controversial material that resists simple framing.
Collection Protocol
Daily automated retrieval via Reddit API (PRAW)
Top posts by community engagement
Up to 25 posts per collection window
Up to 10 comments per post (thread context preserved)
Metadata: scores, timestamps, author identifiers
Data Structure
Post titles and self-text
Comment threads with parent-child relationships
Engagement metrics (upvotes, comment counts)
Temporal information
Pseudonymized user references
Ethical Considerations
Public data: All content publicly posted in open forum
Minimal identifiers: Only public usernames (already visible to all Reddit users)
Academic use: Research-only, not commercial application
Platform compliance: Adheres to Reddit API Terms of Service
No intervention: Observational only, no manipulation or engagement
Model Selection
DeepSeek V3.1
Chinese development, strong reasoning capabilities
Qwen3 Max
Alibaba, multilingual optimization
Grok 4
X.AI, real-time data training
Claude Sonnet 4.5
Anthropic, constitutional AI approach
Gemini 2.5 Pro
Google, multimodal architecture
GPT-5
OpenAI, current generation flagship
GLM 4.6
Z.AI, bilingual Chinese-English
Kimi K2
Moonshot AI, extended context windows
Selection Criteria
Architectural diversity: Different training approaches and paradigms
Geographic representation: North American, Chinese, and European providers
Current availability: Accessible via OpenRouter unified API
Task capability: Sufficient context window and reasoning capacity
Recent versions: Current or recently updated models only
Controlled Conditions
All models receive identical inputs and parameters:
Same Reddit data corpus
Identical researcher persona prompt
Temperature 0.8 (allows creativity while maintaining coherence)
Standardized output format requirements
No model-specific prompt engineering
Uniqueness Metric
Questions are scored based on semantic distance from all other questions
generated across all models in a given run.
Calculation Method
Embedding: Each question is encoded as a 384-dimensional vector using the
sentence-transformers/all-MiniLM-L6-v2 model (semantic representation)
Pairwise comparison: Cosine similarity computed between all question pairs
(1.0 = identical meaning, 0.0 = completely orthogonal)
Uniqueness derivation: Uniqueness = 1 - (mean similarity to all other questions)
Model scoring: Average uniqueness across all questions from a single model
What this measures
Semantic differentiation from peer outputs
Novel framing or research direction
Non-redundancy within model outputs
Divergence from consensus approaches
What this does NOT measure
Research quality or feasibility
Methodological rigor
Theoretical appropriateness
Practical value or significance
Interpretation
Higher scores indicate greater semantic distance from the cluster of questions generated by other models.
This reflects distinctiveness, not necessarily superiority. A model
producing highly similar questions to peers would score lower, even if those questions are methodologically sound.
Daily Workflow
Scheduled execution: 9:00 AM UTC daily
1
Data collection
Retrieve latest posts and comments from r/conspiracy_commons
2
Data preprocessing
Structure content into standardized format with metadata
3
Parallel model invocation
Identical prompts sent to all 8 models simultaneously via OpenRouter
4
Response parsing
Extract research questions, theoretical frameworks, and hypotheses
Data use: Public Reddit content used exclusively for academic research.
No personally identifiable information collected beyond publicly visible usernames.
Output interpretation: AI-generated content represents computational outputs.
Rankings measure semantic distinctiveness, not research quality or validity.
Data Security
Transmission security: All data transfers use industry-standard encryption protocols.
Server access restricted to authorized researchers.
Retention policy: Results archived for longitudinal analysis of model performance evolution.
Reddit data retained for reproducibility verification.