Design and develop streaming ingestion pipelines using Apache Spark (Structured Streaming) and Databricks Auto Loader to consume files from cloud storage or messages from Kafka/RabbitMQ/Confluent Cloud and ingest them into Delta Lake, ensuring schema evolution and exactly once semantics. Implement CDC and deduplication logic by capturing change events from source databases using Debezium, built-in CDC features of SQL Server/Oracle, or other connectors, and apply watermarking and drop duplicate strategies based on primary keys and event timestamps. Scale ingestion through configuration by building a config-driven framework such as using Airflow, DBX Jobs, or Delta Live Tables that iterates over metadata tables to deploy/update ingestion pipelines for hundreds of tables/sources without code duplication. Implement monitoring, observability, and security by capturing streaming query metrics and publishing them to monitoring platforms like Prometheus and Grafana, setting up dashboards for lag, files processed, and processing duration, and enforcing role-based access control, encryption, and data masking. Participate in DevOps processes by using CI/CD pipelines, such as Jenkins or GitHub Actions, to automate the deployment of jobs, managing infrastructure with Terraform or similar tools, and following best practices for version control and code reviews. This role requires 5–8 years of experience designing and building data pipelines using Apache Spark, Databricks, or equivalent big data frameworks, along with hands-on expertise with streaming and messaging systems such as Apache Kafka, Confluent Cloud, RabbitMQ, or Azure Event Hub, including creating producers, consumers, and topics and integrating them into downstream processing. Candidates should possess a deep understanding of relational databases and CDC, with proficiency in SQL Server, Oracle, or other RDBMSs and experience capturing change events using Debezium or native CDC tools; proficiency in programming languages such as Python, Scala, or Java; solid knowledge of SQL for data manipulation and transformation; cloud platform expertise, specifically with Azure or AWS services for data storage, compute, and orchestration; and knowledge of data Lakehouse architectures, Delta Lake, partitioning strategies, and performance optimization. Additionally, familiarity with Git, CI/CD pipelines, and infrastructure-as-code is essential,

Virtusa is a global product and platform engineering services company that makes experiences better with technology. We help organizations grow faster, more profitably, and more sustainably by reimagining enterprises through domain-driven solutions. We combine strategy, design, and engineering, backed by unmatched expertise at the intersection of industry, business, and technology to generate real-world business impact for clients.
Headquartered in Massachusetts with global delivery centers, Virtusa provides a broad range of services, solutions, and assets, including strategy and design, AI advisory and services, digital engineering, data and analytics, digital assurance, cloud and security, cx transformation and managed services across industries such as financial services, healthcare, communications, media, entertainment, travel, manufacturing, and technology.