




Job Summary: We are seeking a Senior Data Scraping Analysis Specialist with Python experience to build intelligent crawling pipelines and perform large-scale data extraction within AWS ecosystems, connecting external data sources with internal systems and AI agents. Key Highlights: 1. Building intelligent crawling pipelines on AWS 2. Mastery of classical and AI-driven scraping techniques 3. Collaboration with Data Science, AI, and Backend teams Senior Data Engineer (Data Scraping) We seek a Senior Data Scraping Analysis Specialist with strong Python expertise who wishes to advance their career by building high-performance intelligent crawling and large-scale data extraction pipelines deployed in AWS ecosystems. CONTEXT AND RESPONSIBILITIES The selected candidate will join the Functional Team with the critical mission of connecting external information sources to internal analytics systems and new cloud-based AI agents. The role involves designing and maintaining advanced scraping and crawling pipelines capable of operating at scale in AWS environments, ensuring resilience, traceability, observability, and compliance with security standards. Proficiency in classical scraping techniques (Playwright, Selenium, BeautifulSoup) is essential, alongside emerging AI-driven solutions such as Firecrawl, Crawl4AI, or LLM agents capable of automating navigation and content extraction from dynamic and highly protected websites. The specialist must also process and transform large volumes of data within cloud-native architectures, integrating results into the organization’s analytical systems. PROJECT AND TEAM This project aims to fully automate external data acquisition and make it available in AWS to feed analytical platforms and Generative AI models. This includes developing intelligent crawlers, anti-bot strategies, proxy rotation, and structuring unstructured data into formats optimized for subsequent consumption. The selected candidate will work closely with Data Scientists, AI Engineers, and Backend teams under the supervision of the Product Manager and in alignment with architectural guidelines defined for AWS environments. The ecosystem integrates services such as Lambda, ECS, S3, Step Functions, and distributed databases; thus, the ability to design cloud-native pipelines will be key to success in this role. EXPERIENCE AND KNOWLEDGE We seek a candidate with at least 4 years of experience in advanced scraping and data analysis, and deep specialization in Python applied to large-scale crawling and web automation. Experience building distributed scrapers on AWS and recent exposure to AI-driven scraping technologies will be especially valued. **Required experience includes:** * Core Scraping & Crawling: \- Playwright, Selenium, BeautifulSoup, Requests / aiohttp * Firecrawl, Crawl4AI, Browserless, or LLM agents for intelligent crawling * Anti-bot strategies, proxy rotation, and browser fingerprinting * Data Engineering Processing: \- Python (Pandas, Polars, PySpark) * ETL/ELT pipelines, normalization and cleaning of large-scale data * Advanced parsing (HTML, JSON, XML, structured and unstructured documents) * AWS Infrastructure (mandatory): \- S3, Lambda, ECS/ECR, Step Functions * CloudWatch (crawler monitoring), IAM (permission segmentation) * SQS/SNS (orchestration and communication) * AWS Glue or EMR (desirable) * Databases: \- PostgreSQL, MySQL, MongoDB, or DynamoDB * Data integration and storage model design for high-volume scenarios Additionally, the following experience or knowledge will be positively considered: * Orchestration: Airflow, Prefect, or Dagster * Serverless infrastructure and containers optimized for crawling * Data integration with LLMs, RAG pipelines, or intelligent agents * Data visualization or exploratory data analysis * Design of highly concurrent distributed pipelines HIRING AND LOCATION This position is based in Madrid and governed by a full-time contract with long-term stability. Given the project’s criticality and the need for close collaboration with business and technical teams, the role requires on-site presence at the offices (operating under a hybrid model, typically 3 days on-site and 2 days remote). Playwright, Selenium, BeautifulSoup, Firecrawl, Crawl4AI


