Kadoa icon
Blog post illustration

Build vs Buy: LLM Adoption for Web Scraping in Finance

Tavis Lochhead,Co-Founder of Kadoa
28 October 2024
Back to blog

Over the past year, we've spoken with 100+ data leaders at top investment firms (hedge funds, asset managers, private equity, and investment banks) about their web scraping operations and how they’re navigating LLM adoption.

Here is what we’ve learned and our thoughts on the build vs. buy decision.

Why Use LLMs for Web Scraping?

AI’s long promise of solving major web scraping issues is now coming to fruition with the current evolution of LLMs.

Problems solved
Automated web scraping code generation and maintenance
Business Outcome
• Cut scraper build time from days to minutes • Limit data loss with self-healing scrapers • Reduce number of engineers working full-time on scraper maintenance
Problems solved
Agentic web navigation
Business Outcome
• Source granular data from thousands of company websites • Scale extraction from data hidden behind complex browser interactions
Problems solved
Unstructured data extraction (text blocks, PDFs, images, etc.)
Business Outcome
• Unlock analysis of 10M+ unstructured documents • 95%+ accuracy in PDF data extraction
Problems solved
Advanced data cleaning, mapping, and transformation
Business Outcome
• 80%+ reduction in manual data cleaning time • Standardized outputs across hundreds of sources

We know this first-hand based on what we’ve shipped to enterprise customers.

So, how are investment firms exploring this new unlock?

Current LLM Implementations

Every top investment firm we spoke with has in-house web scraping teams and purchases web-scraped data. Many are experimenting with LLMs either for web scraping or elsewhere. Finance is the hungriest and ready to invest in new technology to get an edge; LLMs are no exception.

Examples of how firms are trialing LLMs (excluding Kadoa):

Company
Bank
Implementation
High-volume, zero-context, high-accuracy PDF extraction
Business Outcome
95%+ accuracy or they lose money
Company
Asset Manager
Implementation
In-house GPTs (on-prem, trained on internal and external data)
Business Outcome
Real-time access to company and market intelligence
Company
Prop Firm
Implementation
Extract data from unstructured reports and filings
Business Outcome
Unlock deeper insights from public documents
Company
Hedge Fund
Implementation
In-house LLM-powered web scraping tool
Business Outcome
Reduce # of engineers exclusively working on web scraping

Examples of how firms are using Kadoa:

Company
Hedge Fund
Implementation
Automate building and maintaining traditional web scraping
Business Outcome
Focus web scraping engineers on complex/critical scraping projects
Company
Asset Manager
Implementation
Empower analysts to build web data feeds independently
Business Outcome
Enable analysts to bypass data teams to source custom web data, cutting data acquisition from days to minutes
Company
Market Maker
Implementation
Empower analysts to monitor strategic web pages in real-time
Business Outcome
Enable analysts to act immediately to market moving updates
Company
Trading Firm
Implementation
Automate browser interactions and extract from unstructured reports (i.e., gov, commodity)
Business Outcome
Deeper, broader insight into public documents
Company
Hedge Fund
Implementation
Aggregate hundreds of web sources into unified data structures
Business Outcome
Save on expensive data provider costs and customize results

Build vs. Buy

Investment firms are obsessed with building things in-house to hide their secrets, comply with their privacy policies, and avoid any sort of insight commoditization. But because LLM innovation is moving so quickly, investment firms need to think strategically about what to build vs. buy.

Building in-house makes sense for firms with:

  • Ready access to AI talent
  • Highly custom requirements that a vendor cannot meet
  • A long-term vision that doing this will give you an edge

Buying from vendors is appealing when firms want to:

  • Rapidly adopt the latest technology
  • Address more generalized needs
  • Avoid reinventing the wheel without clear long-term benefits

Our Recommendation

Large investment firms have the resources to build anything they want. At the same time, the pace of LLMs is so fast that building everything in-house might leave them in the dust. A hybrid approach feels the most advantageous at this point, which looks like:

  • Find vendors that save time by unlocking bottlenecks in your web scraping operations, for example:
    • Tools for analysts
    • Automating manual operations
    • Better data quality
  • Work closely with emerging vendors, guiding their roadmap to fit your specific needs
  • Leverage LLMs for highly custom projects
  • Gradually build in-house expertise

Whatever you choose to do first, it’s best to start now and stay on top of this technological wave.

Looking to dive deeper? Let's discuss your firm's web scraping strategy and LLM opportunities. Contact us here.