Startup Talents

Research Crawling Engineer

Startup Talents  •  $150k/yr  •  London, GB (Onsite)  •  26 days ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description


The employer is a decentralized, Solana-based web-scraping network that allows users to monetize their unused internet bandwidth. By installing a browser extension, users securely share bandwidth to help AI companies crawl the web for public data, receiving Points (convertible to crypto tokens) as compensation.
They also operate a massive distributed crawler, giving them unique access to high-quality public web data at global scale.
They are hiring a Research Crawling Engineer (Full remote - USA/EU 6 hour overlap with EST)
You will join a company at the forefront of developing a web-scale crawler and knowledge graph that improves access to public web data and extends the value of AI to the people.
As a Research Crawling Engineer, you will design and operate large-scale web data acquisition systems for research and model development. You will work will span distributed systems, scraping infrastructure, and data pipelines.
This Role Involves:
- Operating at the boundary of scale and reliability
- Adapting to constantly changing web environments
- Balancing throughput, coverage, and data quality
- Owning end-to-end data acquisition pipelines

MISSIONS

  • Design high-throughput, fault-tolerant systems for data collection (millions to billions of URLs/day)

  • Handle anti-bot systems, rate limits, and dynamic/JS-heavy sites

  • Develop pipelines for cleaning, deduplication, filtering, and normalisation

  • Construct and maintain datasets for research and model training

  • Monitor crawl performance, coverage, and data quality; iterate quickly

  • Collaborate with research teams to align data collection with modeling needs

  • Optimize infrastructure for cost, latency, and reliability

Example Projects you could work on :
- Build a distributed crawler for a continuously updated, high-quality web project
- Design a system to classify and filter billions of pages for pretraining
- Extract structured data from dynamic, JS-heavy sites at scale
- Improve deduplication and quality scoring across multimodal datasets


Requirements


  • Strong programming experience in one or more of : Go, Rust, Python, Java, or C++

  • Experience working for reputable companies

  • Experience building and maintaining large-scale web crawlers or large-scale data pipelines

  • Experience designing high-throughput, fault-tolerant systems for data collection (millions to billions of URLs/day)

  • Experience handling anti-bot systems, rate limits, and dynamic/JS-heavy sites

  • Experience constructing and maintaining datasets for research and model training

  • Solid understanding of HTTP, networking, and browser behavior

  • Familiarity with distributed systems and parallel processing

  • Experience working with large datasets (TB–PB scale preferred)

  • Ability to debug unstable or adversarial environments

Preferred / Bonus:

  • Experience with NLP pipelines or dataset curation for ML

  • Familiarity with LLM pretraining data or retrieval systems

  • Experience with headless browsers (e.g., Chrome DevTools Protocol, Playwright, Puppeteer)

  • Knowledge of proxy systems, IP rotation, and large-scale request orchestration

  • Background in data quality evaluation or benchmarking

  • Experience running workloads on cloud or bare-metal infrastructure

Main Evaluation Criteria:

  • Ability to design systems that scale without degrading quality

  • Practical problem-solving under real-world constraints

  • Speed of iteration and ownership

  • Measurable improvements in data coverage, quality, or efficiency


Benefits


  • Contract : Permanent role (Full remote - USA or 6 hour overlap with EST).

  • Salary : $150k to $225k based on experience and demonstrated ability to operate at scale + Equity package / tokens

Recruitment process :

  • Recruiter / HR Call

  • Technical Interview

  • CEO Interview

  • Final Interview
Startup Talents

About Startup Talents

Startup Talents is a startup centric recruitment boutique.

We assist with efficient recruitment and provide bespoke solutions for Scale Up Startups and drive readiness for the next critical milestone (investment/exit/acquisition)

Our tailored and strategic direct approach sourcing execution, along with a thorough discovery of your unique environment drives our obsession for delivering recruitment efficiency, retention and ultimately confidence and visibility for all startups growth dynamics whether nationally or internationally.

Contact us : hello@startuptalents.tech

Industry
HR & Recruiting
Company Size
Unknown
Headquarters
Paris, FR
Year Founded
2020
Social Media