OnlineJourno

The Agentic Newsroom Toolkit — Browse

Core Content

1.1 Entity Extraction & NER

Tool Repository Stars Lang Use Case Composability
**spaCy** github.com/explosion/spacy 30K Python Industrial-grade NLP, NER with pretrained models (60+ languages) Core module; plug into any pipeline
**Spark NLP** github.com/JohnSnowLabs/spark-nlp 4K Python/Scala Distributed NER at scale (14,500+ pretrained models) Enterprise-grade, works with Spark
**Flair** github.com/flairNLP/flair 13K Python State-of-art NER with BERT embeddings Can fine-tune for domain-specific entities
**DeepPavlov NER** github.com/deeppavlov/ner 2.5K Python Multilingual NER (Russian, English, etc.) Modular, pre-trained CNN models
**Stanza** github.com/stanfordnlp/stanza 7K Python Stanford NLP pipeline (tokenization, POS, NER, dependency parsing) Core NLP foundation for downstream tasks
**NLTK** github.com/nltk/nltk 13K Python Classic NLP toolkit, wrapper for Stanford NER Educational, good for baseline extraction
**entity-fishing** github.com/kermitt2/entity-fishing 500+ Java Lightweight entity linker to Wikidata Direct Wikidata disambiguation

Recipe: `spaCy (extraction) → entity-fishing (disambiguation) → Wikidata linking`

1.2 Fact-Checking & Verification

Tool Repository Stars Lang Use Case Composability
**Loki (OpenFactVerification)** github.com/Libr-AI/OpenFactVerification 1K Python 5-step fact-check pipeline: claim decomposition → check-worthiness → query generation → evidence retrieval → verification End-to-end; integrates LLMs + traditional NLP
**OpenFactCheck** github.com/yuxiaw/OpenFactCheck 500+ Python Unified framework for LLM factuality evaluation + fact-checker leaderboard Modular; integrate multiple fact-checkers
**Veracity** github.com/ (in development) Python Open-source claim-focused fact-checking with web retrieval agents Local-first; transparent reasoning
**Google Fact Check API** developers.google.com/fact-check/tools/json-ld REST API Query fact-checks from Google’s database (ClaimBuster, Snopes, PolitiFact integration) Third-party integration layer

Recipe: `Loki (pipeline orchestration) + entity-fishing (entity context) + Google Fact Check API (external validation)`

1.3 Headline Generation & SEO Optimization

Tool Repository Stars Lang Use Case Composability
**seomachine** github.com/TheCraigHewitt/seomachine 200+ Claude Code Claude Code skill for SEO-optimized blog content: keyword research, article writing, internal linking, performance review AI-native; composes Claude agents
**BLEURT** github.com/google-research/bleurt 1K+ Python Text generation evaluation model (Google); scores headline quality Ranking layer for headline variants
**TextRank** github.com/summanlp/textrank 1K+ Python Graph-based NLP; extracts keywords and summarizes (basis for headline ideation) Pre-processing for headline candidates
**Pyabsa** github.com/yangheng95/PyABSA 800+ Python Aspect-based sentiment analysis; understand entity sentiment for positioning Context enrichment for headlines
**GPT-2 / GPT-J** github.com/openai/gpt-2; github.com/kingoflolz/mesh-transformer-jax 10K+/3K Python Open-source language models for headline generation Fine-tunable; local execution

Recipe: `TextRank (keyword extraction) → GPT-J (headline generation) → BLEURT (quality scoring) → SERP intent matching`

1.4 Structural Editing & Clarity

Tool Repository Stars Lang Use Case Composability
**Readability Metrics** Python Flesch-Kincaid, SMOG, Gunning Fog indices via textstat library Text quality scoring
**EditTools** Python Hemingway Editor equivalent (open-source alternatives): detects passive voice, adverbs, complex sentences Real-time feedback
**Grammarly NLP** Proprietary API Grammar & style checking; available as API Third-party enhancement

Recipe: `spaCy (sentence parsing) → Readability metrics (clarity scoring) → BERT (coherence detection)`

Knowledge Infrastructure

2.1 Entity Recognition & Taxonomic Labeling

Tool Repository Stars Lang Use Case Composability
**GENRE** github.com/facebookresearch/GENRE 600+ Python Multilingual entity linker to Wikidata (BART-based); ~8% improvement over baselines Plug-and-play entity disambiguation
**BLINK** github.com/facebookresearch/BLINK 1K Python Fast entity linking (fine-tuned BERT); maps to Wikipedia → Wikidata Speed-optimized variant
**entity-fishing (spaCy wrapper)** github.com/kermitt2/entity-fishing Python spaCy integration for entity disambiguation Seamless NLP pipeline integration
**DBpedia Spotlight** github.com/dbpedia-spotlight/dbpedia-spotlight 800+ Java/REST Entity linking to DBpedia (older but stable) REST API accessibility

Recipe: `spaCy (NER) → GENRE (disambiguation) → OpenTapioca (Wikidata linking) → sameAs enrichment`

2.2 Schema Markup & Structured Data Generation

Tool Repository Stars Lang Use Case Composability
**JSON-LD Schema Generators** iloveschema.com, jsonld.com, incrementors.com Web UIs Free generators for Article, NewsArticle, BlogPosting schemas Manual/semi-automated
**python-jsonschema** github.com/Julian/jsonschema 4K Python JSON Schema validation; ensures schema compliance before publishing Validation layer
**PyLD** github.com/digitalbazaar/pyld 500+ Python JSON-LD processor; flattening, expansion, compaction JSON-LD manipulation
**Structured Data Testing Tool (Google)** search.google.com/test/rich-results Web UI Validates schema markup before publishing; provides rich result preview Pre-publish validation

Recipe: `Article metadata → schema-org-python (construction) → PyLD (normalization) → Structured Data Testing (validation)`

2.3 Knowledge Graph & Wikidata Integration

Tool Repository Stars Lang Use Case Composability
**FrOG (Framework of Open GraphRAG)** github.com/Framework-of-Open-GraphRAG/FROG 200+ Python GraphRAG system; entity linking + SPARQL query generation + answer generation End-to-end knowledge graph RAG
**RDFLib** github.com/RDFLib/rdflib 2K Python RDF/SPARQL query builder; serialize to Turtle, N3, JSON-LD RDF manipulation foundation
**pywikibot** github.com/wikimedia/pywikibot 1K Python Python library for Wikidata + Wikipedia bot programming Wikidata write/sync operations
**MediaWiki API** www.mediawiki.org/wiki/API REST Query Wikidata directly for entity resolution Direct Wikidata source
**SKOS (Simple Knowledge Organization System)** github.com/RDFLib/rdflib Python Thesaurus/taxonomy representation in RDF Taxonomy formalization
**Wikibase (Wikimedia platform)** github.com/wikimedia/mediawiki-extensions-Wikibase 500+ PHP Deploy your own Wikidata-like instance Self-hosted knowledge graph infrastructure

2.4 Archive Ingest & Content Versioning

Tool Repository Stars Lang Use Case Composability
**Git + GitLFS** github.com/git-lfs/git-lfs 10K Go Version control for large media assets (images, videos) Version control + content deduplication
**MediaWiki** github.com/wikimedia/mediawiki 500+ PHP Foundation for archive systems; used by Wikipedia, archival projects Archive query/retrieval infrastructure
**Hydra (Samvera)** github.com/samvera/hyrax 300+ Ruby Digital repository software; handles ingestion, discovery, preservation Institutional repository framework
**Fedora Repository** github.com/fcrepo/fcrepo 300+ Java Flexible extensible digital object repository Academic digital asset management
**DSpace** github.com/DSpace/DSpace 400+ Java Open-source institutional repository software Established archive platform

Recipe: `OCR pipeline (Tesseract) → Entity extraction → Metadata generation (schema) → DSpace/Hydra ingest → Full-text search indexing`

Distribution & Amplification

3.1 Social Media Optimization & Multi-Channel Distribution

Tool Repository Stars Lang Use Case Composability
**Twython / Tweepy** github.com/tweepy/tweepy 10K Python Twitter API client; automate social posting and engagement tracking Social distribution backbone
**python-telegram-bot** github.com/python-telegram-bot/python-telegram-bot 25K Python Telegram bot framework; reach readers on messaging platforms Alternative channel distribution
**mastodon.py** github.com/halcy/Mastodon.py 600+ Python Mastodon API client; federated social network posting Open-source social integration
**WordPress.com Publish Tools** developer.wordpress.com/docs/ REST API Syndicate to WordPress multisite network WordPress ecosystem integration
**Matrix Client Library** github.com/matrix-org/matrix-python-sdk 400+ Python Decentralized messaging; content distribution to Matrix rooms Decentralized channel support

Recipe: Multi-channel agent that generates platform-specific copy:

3.2 Audience Segmentation & Analytics

Tool Repository Stars Lang Use Case Composability
**Plausible Analytics** github.com/plausible/analytics 9K Elixir Privacy-first web analytics (self-hosted alternative to Google Analytics) Privacy-compliant analytics
**Matomo** github.com/matomo-org/matomo 18K PHP Open-source web analytics platform; audience segmentation, behavioral tracking Full-featured analytics suite
**Segment** Proprietary REST API Customer data platform; unify analytics from multiple sources Data warehouse connector
**RudderStack** github.com/rudderlabs/rudder-server 5K Go Open-source CDP alternative; route analytics to multiple destinations CDP infrastructure
**Mixpanel / Amplitude alternatives**: Custom event tracking with Python + PostHog github.com/PostHog/posthog 12K Python/JavaScript Open-source product analytics Event-driven analytics

Recipe: `Reader behavior tracking → Audience segmentation (clustering) → Engagement prediction → Churn detection → Retention campaigns`

Technical Seo

4.1 SEO Crawling & Audit

Tool Repository Stars Lang Use Case Composability
**Crawl4AI** github.com/unclecode/crawl4ai 12K+ Python #1 trending LLM-friendly web crawler; Markdown extraction, link analysis, Core Web Vitals AI-native crawling foundation
**LibreCrawl** github.com/PhialsBasement/LibreCrawl 1K Python/Flask Free desktop SEO crawler alternative to Screaming Frog; multi-tenant, plugin architecture Enterprise-grade; fully customizable
**Greenflare** github.com/beb7/gflare-tk 800+ Python Lightweight cross-platform SEO crawler; on-page analysis, robots.txt parsing, status code reporting Lightweight, scalable to 4M+ URLs
**Searx (metasearch)** github.com/searxng/searxng 8K Python Privacy-respecting metasearch engine; crawl SERP results Search intelligence layer

Recipe: `LibreCrawl (site audit) → Crawl4AI (content extraction) → spaCy (on-page entity extraction) → Schema validation`

4.2 Keyword Research & Gap Analysis

Tool Repository Stars Lang Use Case Composability
**OpenSEO** github.com/every-app/open-seo 1K TypeScript Free SEO suite alternative to Semrush/Ahrefs; keyword research, position tracking, backlink analysis (uses DataForSEO API) Workflow-focused; modular
**Keyword Extraction Tools** (Python textacy, sklearn TfidfVectorizer) Python Extract keywords via TF-IDF, YAKE, or statistical models Lightweight extraction
**Google Search Console API** developers.google.com/webmaster-tools/search-console-api REST Fetch real keyword rankings and click data from GSC First-party keyword data
**SEOTool** Python Competitor keyword gap analysis Comparative research

Recipe: `Google Search Console API (query data) → Keyword clustering → Gap analysis → Content roadmap generation`

4.3 Schema Validation & Core Web Vitals

Tool Repository Stars Lang Use Case Composability
**Google Rich Results Test** search.google.com/test/rich-results Web UI Validate schema markup; preview rich result appearance Pre-publish validation
**Schema.org Validator** validator.schema.org Web UI Validate structured data compliance Standards verification
**Lighthouse API (Google)** github.com/GoogleChrome/lighthouse 27K Node.js Automated website auditing; performance, accessibility, best practices, SEO, PWA scores Scriptable performance audit
**WebPageTest** github.com/WPO-Foundation/webpagetest 7K PHP Open-source performance testing (self-hosted version) Performance baseline establishment
**Core Web Vitals Monitoring (CrUX API)** REST API Google’s real-user monitoring data Production performance tracking

Recipe: `Lighthouse API (automated audit) → Schema.org validator (markup check) → CrUX API (production metrics tracking)`

Editorial Workflow

5.1 Newsroom CMS & Editorial Management

Tool Repository Stars Lang Use Case Composability
**Superdesk** github.com/superdesk/superdesk 1K Python/Angular Purpose-built open-source newsroom CMS; editorial workflow, content planning, multi-channel distribution. Trusted by NTB, Canadian Press Core newsroom backbone
**Ghost** github.com/TryGhost/Ghost 46K Node.js/Handlebars Headless CMS for publishers; clean writing environment, built-in membership/subscriptions, performance-optimized Membership + SEO-friendly
**Decap CMS** github.com/decaporg/decap-cms 17K React Headless CMS for Git-based workflows; editorial workflow mode, content versioning via Git/GitHub Git-native; low infrastructure overhead
**Drupal** github.com/drupal/drupal 4K PHP Flexible, extensible CMS; used for large media sites, deep customization capability Enterprise-grade extensibility
**WordPress + Gutenberg** github.com/WordPress/WordPress 19K PHP Familiar CMS with block editor; massive plugin ecosystem for publishing Lowest barrier to entry

5.2 Live Blogging & Breaking News

5.3 Collaboration & Version Control

Tool Repository Stars Lang Use Case Composability
**Git** github.com/git/git 50K C Distributed version control; track all editorial changes, revert corrupted content, blame view for attribution Version control foundation
**Fidus Writer** github.com/fiduswriter/fiduswriter 500+ Python/Vue.js Open collaborative writing platform with academic focus; versioning, commenting, export to multiple formats Collaborative drafting

Business Intelligence

6.1 Reader Analytics & Audience Behavior

Tool Repository Stars Lang Use Case Composability
**Matomo** github.com/matomo-org/matomo 18K PHP Full-featured analytics; audience segmentation, behavioral funnels, heatmaps, session recording Analytics foundation
**Plausible** github.com/plausible/analytics 9K Elixir Privacy-first analytics (GDPR-compliant, no tracking ID); lightweight alternative to GA Privacy-native
**Open Web Analytics** github.com/Open-Web-Analytics/Open-Web-Analytics 1K PHP Open-source web analytics similar to Google Analytics GA alternative
**Segment / RudderStack** github.com/rudderlabs/rudder-server 5K Go Customer data platform; unify reader behavior from multiple sources CDP backbone

6.2 Paywall & Subscription Analytics

Tool Repository Stars Lang Use Case Composability
**Ghost Membership** Built into Ghost CMS Node.js Native membership + subscription management in Ghost CMS Integrated membership layer
**Memberful (via WordPress)** Proprietary Membership plugin for WordPress (Automattic-owned but REST API available) WordPress ecosystem
**Supabase** github.com/supabase/supabase 60K TypeScript Open-source Firebase alternative; real-time database, auth, edge functions for custom subscription logic Custom subscription infrastructure
**Stripe API** github.com/stripe/stripe-python 300+ Python Payment processing; webhooks for subscription events, revenue tracking Payment infrastructure

Recipe: `Ghost Membership (subscription mgmt) → Stripe API (payment processing) → Matomo (funnel tracking) → Revenue attribution`

6.3 Competitive Intelligence & Content Benchmarking

Tool Repository Stars Lang Use Case Composability
**Crawl4AI** github.com/unclecode/crawl4ai 12K Python Monitor competitor sites; extract headlines, publish timestamps, SERP tracking Competitive crawling
**NewsGuard API / Fact-Check Aggregators** REST Integrate third-party fact-check data for context Fact-check federation
**CCPA Compliance Tools** Privacy-respecting audience tracking for competitor analysis Privacy-compliant intelligence

Specialized Journalism Tools

7.1 Investigative Reporting & OSINT

Tool Repository Stars Lang Use Case Composability
**Bellingcat toolkit** github.com/bellingcat/ Various Various Collection of OSINT tools (image verification, timeline building, etc.) Investigation support
**Fact-Checking Verification Handbook** github.com/The-Osint-Toolbox/Fact-Checking-Verification 500+ Markdown Curated collection of fact-checking and verification resources Reference guide

7.2 Misinformation & Disinformation Detection

Tool Repository Stars Lang Use Case Composability
**Mist (Misinformation Identification)** Various Academic benchmarks and models for misinformation identification Research reference

Projects & Agents

Applied newsroom AI projects and agents (distinct from the libraries above). The JournalismAI GitHub org is the live, searchable feed of fellowship cohort builds — start there.

Project Repository What it is
JournalismAI (org) github.com/JournalismAI The searchable home of JournalismAI Fellowship cohort projects — browse dozens of newsroom AI builds.
GPT Newspaper github.com/rotemweiss57/gpt-newspaper Autonomous multi-agent newspaper (Search · Curator · Writer · Critique · Designer · Editor agents).
ICIJ Datashare github.com/ICIJ/datashare Document-mining platform with ML detection — search millions of records locally.
Aleph github.com/alephdata/aleph OCCRP's cross-border investigative data platform — entities, leaks, follow-the-money.
Census Reporter github.com/censusreporter/censusreporter Turns census data into story-ready facts for reporters.
Dify github.com/langgenius/dify Production platform for agentic workflows — build newsroom assistants/pipelines.
Langflow github.com/langflow-ai/langflow Low-code drag-drop builder for AI agents + RAG — prototype newsroom agents fast.