Senior Data Engineer (Data Scraping), Madrid

Negotiable Salary

Indeed

Full-time

Onsite

No experience limit

No degree limit

Prta del Sol, 4, 2ºC, Centro, 28013 Madrid, Spain

Favourites

Some content was automatically translatedView Original

Description

Senior Data Engineer (Data Scraping) We are seeking a Senior Data Scraping Analysis Specialist with solid Python experience who wants to advance their professional career by building intelligent crawling pipelines and large-scale data extraction systems deployed on high-performance AWS ecosystems. CONTEXT AND RESPONSIBILITIES The selected candidate will join the Functional Team with the critical mission of connecting external information sources to internal analytics systems and new cloud-based AI agents. This role involves designing and maintaining advanced scraping and crawling pipelines capable of operating at scale within AWS environments, ensuring resilience, traceability, observability, and compliance with security standards. Mastery of classical scraping techniques (Playwright, Selenium, BeautifulSoup) is essential, alongside emerging AI-driven solutions such as Firecrawl, Crawl4AI, or LLM agents capable of automating navigation and content extraction from dynamic and highly protected websites. The specialist must also process and transform large volumes of data within cloud-native architectures and integrate the results into the organization’s analytical systems. PROJECT AND TEAM The project aims to fully automate the acquisition of external data and make it available in AWS to feed analytical platforms and Generative AI models. This includes developing intelligent crawlers, anti-bot strategies, proxy rotation, and structuring unstructured data into formats optimized for subsequent consumption. The selected candidate will work closely with Data Scientists, AI Engineers, and Backend teams under the supervision of the Product Manager and in alignment with the architectural guidelines defined for AWS environments. The ecosystem integrates services such as Lambda, ECS, S3, Step Functions, and distributed databases; thus, the ability to design cloud-native pipelines will be key to success in this role. EXPERIENCE AND KNOWLEDGE We seek a candidate with at least four years of experience in advanced scraping and data analysis, and deep specialization in Python applied to massive crawling and web automation. Particular value will be placed on experience building distributed scrapers on AWS and recent exposure to AI-powered scraping technologies. **Required experience includes:** * Core Scraping & Crawling: — Playwright, Selenium, BeautifulSoup, Requests / aiohttp * Firecrawl, Crawl4AI, Browserless, or LLM agents for intelligent crawling * Anti-bot strategies, proxy rotation, and browser fingerprinting * Data Engineering Processing: — Python (Pandas, Polars, PySpark) * ETL/ELT pipelines, normalization and cleaning of large-scale data * Advanced parsing (HTML, JSON, XML, structured and unstructured documents) * AWS Infrastructure (mandatory): — S3, Lambda, ECS/ECR, Step Functions * CloudWatch (crawler monitoring), IAM (permission segmentation) * SQS/SNS (orchestration and communication) * AWS Glue or EMR (desirable) * Databases: — PostgreSQL, MySQL, MongoDB, or DynamoDB * Data integration and storage model design for high-volume scenarios Additionally, the following experience or knowledge will be considered advantageous: * Orchestration: Airflow, Prefect, or Dagster * Serverless infrastructure and containerized environments optimized for crawling * Data integration with LLMs, RAG pipelines, or intelligent agents * Data visualization or exploratory data analysis * Design of highly concurrent, distributed pipelines CONTRACT AND LOCATION This position is based in Madrid and governed by a full-time employment contract with long-term stability in mind. Given the criticality of the project and the need for close collaboration with business and technical teams, the role requires physical presence at the office (operating under a hybrid model, typically three days on-site and two days remote). Playwright, Selenium, BeautifulSoup, Firecrawl, Crawl4AI

Source: indeed View original post